- Fix page refcount leak when looking up specific inodes
introduced by metabuf reworking.
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEQ0A6bDUS9Y+83NPFUXZn5Zlu5qoFAmXbQekRHHhpYW5nQGtl
cm5lbC5vcmcACgkQUXZn5Zlu5qpkyg//cjKnuzI2gHL3ff1o0jiTMD11Y/EmwzQS
C+JgE8WDMjUlVQpbwYnFWo/6se8qvk/fwTXPcRc45piF8YNZVchwsagxO626Kab7
0xTX7ZUgjuth6ahrkAJZkJNMgO4mYf928uYB6EVWaTb4iF0iT6glEOsSVAR/cssL
wYKbtgv3OP9t6fQcN/XL31hRkqr2MPk0y5Q27KyT4zo4lrx3xyah7Ndo3aEK/RcM
+6FUwqRiDsgDF/Ga65ylDvEp9eA03OFNHBn4DrORe3B9KV75NkmSJf/8QVEceNV/
9D072Hvt7/iyOq53AxWH3Jvp7aro/i0rvAHPbXZX4RVyqcJxaLYCQyrBFvQL0/Ie
B+793Iua8zbkQCbZ85LpTGrxAb5WydlSJp10AuHTr2MO+wS8bBqf96Jp9x1MuS9D
vqq7jbjwuZfnFUjpzu49GF6htG+WRVgY5TDzU6IAr3izXuqVxmz+zw5zExbGMmKm
2S5lb+q68DOBP4YaelAQHh97k0pYDW3eQ6GhDD2FA4P/DOr8p3vszZFBeaqaL70v
WS8z1wzOgEaleTRN4iRFlMCTvRLjOtDEaUNquohcHqJLe7w/DF7gzCU+0//8dVjD
cZieFX8uZDtzzcjrlYWeTo8oHd0q2WYixbCN8P92/YXUFIVpfdl85ugyr9KvkZxd
jKzpAy2LbSw=
=/Bfj
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-6.8-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fix from Gao Xiang:
- Fix page refcount leak when looking up specific inodes
introduced by metabuf reworking
* tag 'erofs-for-6.8-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: fix refcount on the metabuf used for inode lookup
and a fix for erofs failure exit breakage (had been there since
way back).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZdrkZAAKCRBZ7Krx/gZQ
67D8AP0eM68yZvbThA/Hb5iElDh3Aogt1AW/QAu9/alkDVHr+wD+PKqhamC8WXGk
b1QZ5AOHQFwzkzdF4738fdbujquBWQE=
=Ra0D
-----END PGP SIGNATURE-----
Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
"A couple of fixes - revert of regression from this cycle and a fix for
erofs failure exit breakage (had been there since way back)"
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
erofs: fix handling kern_mount() failure
Revert "get rid of DCACHE_GENOCIDE"
In erofs_find_target_block() when erofs_dirnamecmp() returns 0,
we do not assign the target metabuf. This causes the caller
erofs_namei()'s erofs_put_metabuf() at the end to be not effective
leaving the refcount on the page.
As the page from metabuf (buf->page) is never put, such page cannot be
migrated or reclaimed. Fix it now by putting the metabuf from
previous loop and assigning the current metabuf to target before
returning so caller erofs_namei() can do the final put as it was
intended.
Fixes: 500edd0956 ("erofs: use meta buffers for inode lookup")
Cc: <stable@vger.kernel.org> # 5.18+
Signed-off-by: Sandeep Dhavale <dhavale@google.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20240221210348.3667795-1-dhavale@google.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
if you have a variable that holds NULL or a pointer to live struct mount,
do not shove ERR_PTR() into it - not if you later treat "not NULL" as
"holds a pointer to object".
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Even with inplace decompression, sometimes very few temporary buffers
may be still needed for a single decompression shot (e.g. 16 pages for
64k sliding window or 4 pages for 16k sliding window). In low-memory
scenarios, it would be better to try to allocate with GFP_NOWAIT on
readahead first. That can help reduce the time spent on page allocation
under durative memory pressure.
Here are detailed performance numbers under multi-app launch benchmark
workload [1] on ARM64 Android devices (8-core CPU and 8GB of memory)
running a 5.15 LTS kernel with EROFS of 4k pclusters:
+----------------------------------------------+
| LZ4 | vanilla | patched | diff |
|----------------+---------+---------+---------|
| Average (ms) | 3364 | 2684 | -20.21% | [64k sliding window]
|----------------+---------+---------+---------|
| Average (ms) | 2079 | 1610 | -22.56% | [16k sliding window]
+----------------------------------------------+
The total size of system images for 4k pclusters is almost unchanged:
(64k sliding window) 9,117,044 KB
(16k sliding window) 9,113,096 KB
Therefore, in addition to switch the sliding window from 64k to 16k,
after applying this patch, it can eventually save 52.14% (3364 -> 1610)
on average with no memory reservation. That is particularly useful for
embedded devices with limited resources.
[1] https://lore.kernel.org/r/20240109074143.4138783-1-guochunhai@vivo.com
Suggested-by: Gao Xiang <xiang@kernel.org>
Signed-off-by: Chunhai Guo <guochunhai@vivo.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Link: https://lore.kernel.org/r/20240126140142.201718-1-hsiangkao@linux.alibaba.com
I encountered a race issue after lengthy (~594647 secs) stress tests on
a 64k-page arm64 VM with several 4k-block EROFS images. The timing
is like below:
z_erofs_try_inplace_io z_erofs_fill_bio_vec
cmpxchg(&compressed_bvecs[].page,
NULL, ..)
[access bufvec]
compressed_bvecs[] = *bvec;
Previously, z_erofs_submit_queue() just accessed bufvec->page only, so
other fields in bufvec didn't matter. After the subpage block support
is landed, .offset and .end can be used too, but filling bufvec isn't
an atomic operation which can cause inconsistency.
Let's use a spinlock to keep the atomicity of each bufvec. More
specifically, just reuse the existing spinlock `pcl->obj.lockref.lock`
since it's rarely used (also it takes a short time if even used) as long
as the pcluster has a reference.
Fixes: 192351616a ("erofs: support I/O submission for sub-page compressed blocks")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Sandeep Dhavale <dhavale@google.com>
Link: https://lore.kernel.org/r/20240125120039.3228103-1-hsiangkao@linux.alibaba.com
Clean up some leftovers since there is no way for EROFS to be called
again from a reclaim context.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240124031945.130782-1-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZabMrQAKCRCRxhvAZXjc
ovnUAQDgCOonb1tjtTvC8s8IMDUEoaVYZI91KVfsZQSJYN1sdQD+KfJmX1BhJnWG
l0cEffGfnWGXMZkZqDgLPHUIPzFrmws=
=1b3j
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.8.netfs' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull netfs updates from Christian Brauner:
"This extends the netfs helper library that network filesystems can use
to replace their own implementations. Both afs and 9p are ported. cifs
is ready as well but the patches are way bigger and will be routed
separately once this is merged. That will remove lots of code as well.
The overal goal is to get high-level I/O and knowledge of the page
cache and ouf of the filesystem drivers. This includes knowledge about
the existence of pages and folios
The pull request converts afs and 9p. This removes about 800 lines of
code from afs and 300 from 9p. For 9p it is now possible to do writes
in larger than a page chunks. Additionally, multipage folio support
can be turned on for 9p. Separate patches exist for cifs removing
another 2000+ lines. I've included detailed information in the
individual pulls I took.
Summary:
- Add NFS-style (and Ceph-style) locking around DIO vs buffered I/O
calls to prevent these from happening at the same time.
- Support for direct and unbuffered I/O.
- Support for write-through caching in the page cache.
- O_*SYNC and RWF_*SYNC writes use write-through rather than writing
to the page cache and then flushing afterwards.
- Support for write-streaming.
- Support for write grouping.
- Skip reads for which the server could only return zeros or EOF.
- The fscache module is now part of the netfs library and the
corresponding maintainer entry is updated.
- Some helpers from the fscache subsystem are renamed to mark them as
belonging to the netfs library.
- Follow-up fixes for the netfs library.
- Follow-up fixes for the 9p conversion"
* tag 'vfs-6.8.netfs' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (50 commits)
netfs: Fix wrong #ifdef hiding wait
cachefiles: Fix signed/unsigned mixup
netfs: Fix the loop that unmarks folios after writing to the cache
netfs: Fix interaction between write-streaming and cachefiles culling
netfs: Count DIO writes
netfs: Mark netfs_unbuffered_write_iter_locked() static
netfs: Fix proc/fs/fscache symlink to point to "netfs" not "../netfs"
netfs: Rearrange netfs_io_subrequest to put request pointer first
9p: Use length of data written to the server in preference to error
9p: Do a couple of cleanups
9p: Fix initialisation of netfs_inode for 9p
cachefiles: Fix __cachefiles_prepare_write()
9p: Use netfslib read/write_iter
afs: Use the netfs write helpers
netfs: Export the netfs_sreq tracepoint
netfs: Optimise away reads above the point at which there can be no data
netfs: Implement a write-through caching option
netfs: Provide a launder_folio implementation
netfs: Provide a writepages implementation
netfs, cachefiles: Pass upper bound length to allow expansion
...
- Fix a "BUG: kernel NULL pointer dereference" issue due to
inconsistent on-disk indices of compressed inodes against
per-sb `available_compr_algs` generated by Syzkaller;
- Don't use certain unnecessary folio_*() helpers if the folio
type (page cache) is known.
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEQ0A6bDUS9Y+83NPFUXZn5Zlu5qoFAmWpO4cRHHhpYW5nQGtl
cm5lbC5vcmcACgkQUXZn5Zlu5qrJ7w//UpMasVxNpnZCsaWntDhp8AM9+wQZjosM
sc0B1sFjuISQuGfjVEpnlabSudzRRGKI/0R55M8/woa8fuSXJiRNou+bv9Ogi+Aa
CJ4E4+TSCGq98rjuuM9gb5L7V36pBp0PtxgANzKskHcq5w5JUNG6f6nhNQqnvRUG
M7hBvzzLLz3fRPZZFzdu5S8ekwuBrq8K/PBM7PFfDgbl5IZ0cjLXXIdx61MXTro9
FGGJSRbJsUYg6+sqb0YWmluW4CBiwe7crovp6IaPBU0744Ga+jGyTNrOWAGjW42e
7glsM5MClTfmv17LJK3jV1Dg8EPkKtrhpeTCdECnWnuAyLGKFOT4juNc68GzCieR
sSRR+WhmF/B2msAvyH4+gcaULCMAhLiVL1Yf1sfaxC1walEuyEM0EPWEHhAEGXjA
BpT6+EZBbYdh24hpyNSNWy/xGMHuiUFy7940yII0o/9cvEbMXNPtIHxA09mOH08X
1tWgLlsLJ69ApIFYD3TkP9yNj22HrxRCQByKvYEe9JsmxwqDayXUP5FQLv1NPNMm
ds36PDbNpxAM/cBnQcfPbZSODSWOCkLIHtmOvFP12tiixMG7yc4KY14Wuj3ZyHYr
T16BZLlcdobHPapSsxzEQqPTgAYBcvh+6PHXfwnLsoXSYQXoxaUQMX1JREnmC3+I
4nMpKIp3qpY=
=knvn
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-6.8-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
- Fix a "BUG: kernel NULL pointer dereference" issue due to
inconsistent on-disk indices of compressed inodes against
per-sb `available_compr_algs` generated by Syzkaller
- Don't use certain unnecessary folio_*() helpers if the folio
type (page cache) is known
* tag 'erofs-for-6.8-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: Don't use certain unnecessary folio_*() functions
erofs: fix inconsistent per-file compression format
EROFS can select compression algorithms on a per-file basis, and each
per-file compression algorithm needs to be marked in the on-disk
superblock for initialization.
However, syzkaller can generate inconsistent crafted images that use
an unsupported algorithmtype for specific inodes, e.g. use MicroLZMA
algorithmtype even it's not set in `sbi->available_compr_algs`. This
can lead to an unexpected "BUG: kernel NULL pointer dereference" if
the corresponding decompressor isn't built-in.
Fix this by checking against `sbi->available_compr_algs` for each
m_algorithmformat request. Incorrect !erofs_sb_has_compr_cfgs preset
bitmap is now fixed together since it was harmless previously.
Reported-by: <bugreport@ubisectech.com>
Fixes: 8f89926290 ("erofs: get compression algorithms directly on mapping")
Fixes: 622ceaddb7 ("erofs: lzma compression support")
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Link: https://lore.kernel.org/r/20240113150602.1471050-1-hsiangkao@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Syzbot reported a KMSAN warning,
erofs: (device loop0): z_erofs_lz4_decompress_mem: failed to decompress -12 in[46, 4050] out[917]
=====================================================
BUG: KMSAN: uninit-value in hex_dump_to_buffer+0xae9/0x10f0 lib/hexdump.c:194
..
print_hex_dump+0x13d/0x3e0 lib/hexdump.c:276
z_erofs_lz4_decompress_mem fs/erofs/decompressor.c:252 [inline]
z_erofs_lz4_decompress+0x257e/0x2a70 fs/erofs/decompressor.c:311
z_erofs_decompress_pcluster fs/erofs/zdata.c:1290 [inline]
z_erofs_decompress_queue+0x338c/0x6460 fs/erofs/zdata.c:1372
z_erofs_runqueue+0x36cd/0x3830
z_erofs_read_folio+0x435/0x810 fs/erofs/zdata.c:1843
The root cause is that the printed decompressed buffer may be filled
incompletely due to decompression failure. Since they were once only
used for debugging, get rid of them now.
Reported-and-tested-by: syzbot+6c746eea496f34b3161d@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/r/000000000000321c24060d7cfa1c@google.com
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20231227151903.2900413-1-hsiangkao@linux.alibaba.com
In order to reduce memory footprints even further, let's allow
partially filled compressed bvecs for readahead to bail out later.
Signed-off-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20231221062341.23901-1-zbestahu@gmail.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Let's just disable cached decompression and inplace I/Os for partial
pages as the first step in order to enable sub-page block initial
support. In other words, currently it works primarily based on
temporary short-lived pages. Don't expect too much in terms of
performance.
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20231206091057.87027-6-hsiangkao@linux.alibaba.com
Sub-page block support is still unusable even with previous commits if
interlaced PLAIN pclusters exist. Such pclusters can be found if the
fragment feature is enabled.
This commit tries to handle "the head part" of interlaced PLAIN
pclusters first: it was once explained in commit fdffc091e6 ("erofs:
support interlaced uncompressed data for compressed files").
It uses a unique way for both shifted and interlaced PLAIN pclusters.
As an added bonus, PLAIN pclusters larger than the block size is also
supported now for the upcoming large lclusters.
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20231206091057.87027-5-hsiangkao@linux.alibaba.com
Previously, the block size always equaled to PAGE_SIZE, therefore
`lclusterbits` couldn't be less than 12.
Since sub-page compressed blocks are now considered, `lobits` for
a lcluster in each pack cannot always be `lclusterbits` as before.
Otherwise, there is no enough room for the special value
`Z_EROFS_LI_D0_CBLKCNT`.
To support smaller block sizes, `lobits` for each compacted lcluster is
now calculated as:
lobits = max(lclusterbits, ilog2(Z_EROFS_LI_D0_CBLKCNT) + 1)
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20231206091057.87027-4-hsiangkao@linux.alibaba.com
Currently, compressed sizes are recorded in pages using `pclusterpages`,
However, for tailpacking pclusters, `tailpacking_size` is used instead.
This approach doesn't work when dealing with sub-page blocks. To address
this, let's switch them to the unified `pclustersize` in bytes.
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20231206091057.87027-3-hsiangkao@linux.alibaba.com
Add a basic I/O submission path first to support sub-page blocks:
- Temporary short-lived pages will be used entirely;
- In-place I/O pages can be used partially, but compressed pages need
to be able to be mapped in contiguous virtual memory.
As a start, currently cache decompression is explicitly disabled for
sub-page blocks, which will be supported in the future.
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20231206091057.87027-2-hsiangkao@linux.alibaba.com
Currently EROFS can map another compressed buffer for inplace
decompression, that was used to handle the cases that some pages of
compressed data are actually not in-place I/O.
However, like most simple LZ77 algorithms, LZ4 expects the compressed
data is arranged at the end of the decompressed buffer and it
explicitly uses memmove() to handle overlapping:
__________________________________________________________
|_ direction of decompression --> ____ |_ compressed data _|
Although EROFS arranges compressed data like this, it typically maps two
individual virtual buffers so the relative order is uncertain.
Previously, it was hardly observed since LZ4 only uses memmove() for
short overlapped literals and x86/arm64 memmove implementations seem to
completely cover it up and they don't have this issue. Juhyung reported
that EROFS data corruption can be found on a new Intel x86 processor.
After some analysis, it seems that recent x86 processors with the new
FSRM feature expose this issue with "rep movsb".
Let's strictly use the decompressed buffer for lz4 inplace
decompression for now. Later, as an useful improvement, we could try
to tie up these two buffers together in the correct order.
Reported-and-tested-by: Juhyung Park <qkrwngud825@gmail.com>
Closes: https://lore.kernel.org/r/CAD14+f2AVKf8Fa2OO1aAUdDNTDsVzzR6ctU_oJSmTyd6zSYR2Q@mail.gmail.com
Fixes: 0ffd71bcc3 ("staging: erofs: introduce LZ4 decompression inplace")
Fixes: 598162d050 ("erofs: support decompress big pcluster for lz4 backend")
Cc: stable <stable@vger.kernel.org> # 5.4+
Tested-by: Yifan Zhao <zhaoyifan@sjtu.edu.cn>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20231206045534.3920847-1-hsiangkao@linux.alibaba.com
Both MicroLZMA and DEFLATE algorithms can use short-lived pages on
demand for the overlapped inplace I/O decompression.
However, those short-lived pages are actually added to
`be->compressed_pages`. Thus, it should be checked instead of
`pcl->compressed_bvecs`.
The LZ4 algorithm doesn't work like this, so it won't be impacted.
Fixes: 67139e36d9 ("erofs: introduce `z_erofs_parse_in_bvecs'")
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20231128180431.4116991-1-hsiangkao@linux.alibaba.com
After commit 1c7f49a767 ("erofs: tidy up EROFS on-disk naming"),
there is a unique `union erofs_inode_i_u` so that we could parse
the union directly.
Besides, it also replaces `inode->i_sb` with `sb` for simplicity.
Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20231109111822.17944-1-mengferry@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZUpEaAAKCRCRxhvAZXjc
ounBAQCAoS66gnOZ+k4kOWwB2zZ1Ueh3dPFC7IcEZ+pwFS8hpAEAxUQxV0TSWf5l
W/1oKRtAJyuSYvehHeMUSJmHVBiM8w4=
=bNm0
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.7.fsid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fanotify fsid updates from Christian Brauner:
"This work is part of the plan to enable fanotify to serve as a drop-in
replacement for inotify. While inotify is availabe on all filesystems,
fanotify currently isn't.
In order to support fanotify on all filesystems two things are needed:
(1) all filesystems need to support AT_HANDLE_FID
(2) all filesystems need to report a non-zero f_fsid
This contains (1) and allows filesystems to encode non-decodable file
handlers for fanotify without implementing any exportfs operations by
encoding a file id of type FILEID_INO64_GEN from i_ino and
i_generation.
Filesystems that want to opt out of encoding non-decodable file ids
for fanotify that don't support NFS export can do so by providing an
empty export_operations struct.
This also partially addresses (2) by generating f_fsid for simple
filesystems as well as freevxfs. Remaining filesystems will be dealt
with by separate patches.
Finally, this contains the patch from the current exportfs maintainers
which moves exportfs under vfs with Chuck, Jeff, and Amir as
maintainers and vfs.git as tree"
* tag 'vfs-6.7.fsid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
MAINTAINERS: create an entry for exportfs
fs: fix build error with CONFIG_EXPORTFS=m or not defined
freevxfs: derive f_fsid from bdev->bd_dev
fs: report f_fsid from s_dev for "simple" filesystems
exportfs: support encoding non-decodeable file handles by default
exportfs: define FILEID_INO64_GEN* file handle types
exportfs: make ->encode_fh() a mandatory method for NFS export
exportfs: add helpers to check if filesystem can encode/decode file handles
included in this merge do the following:
- Kemeng Shi has contributed some compation maintenance work in the
series "Fixes and cleanups to compaction".
- Joel Fernandes has a patchset ("Optimize mremap during mutual
alignment within PMD") which fixes an obscure issue with mremap()'s
pagetable handling during a subsequent exec(), based upon an
implementation which Linus suggested.
- More DAMON/DAMOS maintenance and feature work from SeongJae Park i the
following patch series:
mm/damon: misc fixups for documents, comments and its tracepoint
mm/damon: add a tracepoint for damos apply target regions
mm/damon: provide pseudo-moving sum based access rate
mm/damon: implement DAMOS apply intervals
mm/damon/core-test: Fix memory leaks in core-test
mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval
- In the series "Do not try to access unaccepted memory" Adrian Hunter
provides some fixups for the recently-added "unaccepted memory' feature.
To increase the feature's checking coverage. "Plug a few gaps where
RAM is exposed without checking if it is unaccepted memory".
- In the series "cleanups for lockless slab shrink" Qi Zheng has done
some maintenance work which is preparation for the lockless slab
shrinking code.
- Qi Zheng has redone the earlier (and reverted) attempt to make slab
shrinking lockless in the series "use refcount+RCU method to implement
lockless slab shrink".
- David Hildenbrand contributes some maintenance work for the rmap code
in the series "Anon rmap cleanups".
- Kefeng Wang does more folio conversions and some maintenance work in
the migration code. Series "mm: migrate: more folio conversion and
unification".
- Matthew Wilcox has fixed an issue in the buffer_head code which was
causing long stalls under some heavy memory/IO loads. Some cleanups
were added on the way. Series "Add and use bdev_getblk()".
- In the series "Use nth_page() in place of direct struct page
manipulation" Zi Yan has fixed a potential issue with the direct
manipulation of hugetlb page frames.
- In the series "mm: hugetlb: Skip initialization of gigantic tail
struct pages if freed by HVO" has improved our handling of gigantic
pages in the hugetlb vmmemmep optimizaton code. This provides
significant boot time improvements when significant amounts of gigantic
pages are in use.
- Matthew Wilcox has sent the series "Small hugetlb cleanups" - code
rationalization and folio conversions in the hugetlb code.
- Yin Fengwei has improved mlock()'s handling of large folios in the
series "support large folio for mlock"
- In the series "Expose swapcache stat for memcg v1" Liu Shixin has
added statistics for memcg v1 users which are available (and useful)
under memcg v2.
- Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
prctl so that userspace may direct the kernel to not automatically
propagate the denial to child processes. The series is named "MDWE
without inheritance".
- Kefeng Wang has provided the series "mm: convert numa balancing
functions to use a folio" which does what it says.
- In the series "mm/ksm: add fork-exec support for prctl" Stefan Roesch
makes is possible for a process to propagate KSM treatment across
exec().
- Huang Ying has enhanced memory tiering's calculation of memory
distances. This is used to permit the dax/kmem driver to use "high
bandwidth memory" in addition to Optane Data Center Persistent Memory
Modules (DCPMM). The series is named "memory tiering: calculate
abstract distance based on ACPI HMAT"
- In the series "Smart scanning mode for KSM" Stefan Roesch has
optimized KSM by teaching it to retain and use some historical
information from previous scans.
- Yosry Ahmed has fixed some inconsistencies in memcg statistics in the
series "mm: memcg: fix tracking of pending stats updates values".
- In the series "Implement IOCTL to get and optionally clear info about
PTEs" Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits
us to atomically read-then-clear page softdirty state. This is mainly
used by CRIU.
- Hugh Dickins contributed the series "shmem,tmpfs: general maintenance"
- a bunch of relatively minor maintenance tweaks to this code.
- Matthew Wilcox has increased the use of the VMA lock over file-backed
page faults in the series "Handle more faults under the VMA lock". Some
rationalizations of the fault path became possible as a result.
- In the series "mm/rmap: convert page_move_anon_rmap() to
folio_move_anon_rmap()" David Hildenbrand has implemented some cleanups
and folio conversions.
- In the series "various improvements to the GUP interface" Lorenzo
Stoakes has simplified and improved the GUP interface with an eye to
providing groundwork for future improvements.
- Andrey Konovalov has sent along the series "kasan: assorted fixes and
improvements" which does those things.
- Some page allocator maintenance work from Kemeng Shi in the series
"Two minor cleanups to break_down_buddy_pages".
- In thes series "New selftest for mm" Breno Leitao has developed
another MM self test which tickles a race we had between madvise() and
page faults.
- In the series "Add folio_end_read" Matthew Wilcox provides cleanups
and an optimization to the core pagecache code.
- Nhat Pham has added memcg accounting for hugetlb memory in the series
"hugetlb memcg accounting".
- Cleanups and rationalizations to the pagemap code from Lorenzo
Stoakes, in the series "Abstract vma_merge() and split_vma()".
- Audra Mitchell has fixed issues in the procfs page_owner code's new
timestamping feature which was causing some misbehaviours. In the
series "Fix page_owner's use of free timestamps".
- Lorenzo Stoakes has fixed the handling of new mappings of sealed files
in the series "permit write-sealed memfd read-only shared mappings".
- Mike Kravetz has optimized the hugetlb vmemmap optimization in the
series "Batch hugetlb vmemmap modification operations".
- Some buffer_head folio conversions and cleanups from Matthew Wilcox in
the series "Finish the create_empty_buffers() transition".
- As a page allocator performance optimization Huang Ying has added
automatic tuning to the allocator's per-cpu-pages feature, in the series
"mm: PCP high auto-tuning".
- Roman Gushchin has contributed the patchset "mm: improve performance
of accounted kernel memory allocations" which improves their performance
by ~30% as measured by a micro-benchmark.
- folio conversions from Kefeng Wang in the series "mm: convert page
cpupid functions to folios".
- Some kmemleak fixups in Liu Shixin's series "Some bugfix about
kmemleak".
- Qi Zheng has improved our handling of memoryless nodes by keeping them
off the allocation fallback list. This is done in the series "handle
memoryless nodes more appropriately".
- khugepaged conversions from Vishal Moola in the series "Some
khugepaged folio conversions".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZULEMwAKCRDdBJ7gKXxA
jhQHAQCYpD3g849x69DmHnHWHm/EHQLvQmRMDeYZI+nx/sCJOwEAw4AKg0Oemv9y
FgeUPAD1oasg6CP+INZvCj34waNxwAc=
=E+Y4
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Many singleton patches against the MM code. The patch series which are
included in this merge do the following:
- Kemeng Shi has contributed some compation maintenance work in the
series 'Fixes and cleanups to compaction'
- Joel Fernandes has a patchset ('Optimize mremap during mutual
alignment within PMD') which fixes an obscure issue with mremap()'s
pagetable handling during a subsequent exec(), based upon an
implementation which Linus suggested
- More DAMON/DAMOS maintenance and feature work from SeongJae Park i
the following patch series:
mm/damon: misc fixups for documents, comments and its tracepoint
mm/damon: add a tracepoint for damos apply target regions
mm/damon: provide pseudo-moving sum based access rate
mm/damon: implement DAMOS apply intervals
mm/damon/core-test: Fix memory leaks in core-test
mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval
- In the series 'Do not try to access unaccepted memory' Adrian
Hunter provides some fixups for the recently-added 'unaccepted
memory' feature. To increase the feature's checking coverage. 'Plug
a few gaps where RAM is exposed without checking if it is
unaccepted memory'
- In the series 'cleanups for lockless slab shrink' Qi Zheng has done
some maintenance work which is preparation for the lockless slab
shrinking code
- Qi Zheng has redone the earlier (and reverted) attempt to make slab
shrinking lockless in the series 'use refcount+RCU method to
implement lockless slab shrink'
- David Hildenbrand contributes some maintenance work for the rmap
code in the series 'Anon rmap cleanups'
- Kefeng Wang does more folio conversions and some maintenance work
in the migration code. Series 'mm: migrate: more folio conversion
and unification'
- Matthew Wilcox has fixed an issue in the buffer_head code which was
causing long stalls under some heavy memory/IO loads. Some cleanups
were added on the way. Series 'Add and use bdev_getblk()'
- In the series 'Use nth_page() in place of direct struct page
manipulation' Zi Yan has fixed a potential issue with the direct
manipulation of hugetlb page frames
- In the series 'mm: hugetlb: Skip initialization of gigantic tail
struct pages if freed by HVO' has improved our handling of gigantic
pages in the hugetlb vmmemmep optimizaton code. This provides
significant boot time improvements when significant amounts of
gigantic pages are in use
- Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code
rationalization and folio conversions in the hugetlb code
- Yin Fengwei has improved mlock()'s handling of large folios in the
series 'support large folio for mlock'
- In the series 'Expose swapcache stat for memcg v1' Liu Shixin has
added statistics for memcg v1 users which are available (and
useful) under memcg v2
- Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
prctl so that userspace may direct the kernel to not automatically
propagate the denial to child processes. The series is named 'MDWE
without inheritance'
- Kefeng Wang has provided the series 'mm: convert numa balancing
functions to use a folio' which does what it says
- In the series 'mm/ksm: add fork-exec support for prctl' Stefan
Roesch makes is possible for a process to propagate KSM treatment
across exec()
- Huang Ying has enhanced memory tiering's calculation of memory
distances. This is used to permit the dax/kmem driver to use 'high
bandwidth memory' in addition to Optane Data Center Persistent
Memory Modules (DCPMM). The series is named 'memory tiering:
calculate abstract distance based on ACPI HMAT'
- In the series 'Smart scanning mode for KSM' Stefan Roesch has
optimized KSM by teaching it to retain and use some historical
information from previous scans
- Yosry Ahmed has fixed some inconsistencies in memcg statistics in
the series 'mm: memcg: fix tracking of pending stats updates
values'
- In the series 'Implement IOCTL to get and optionally clear info
about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap
which permits us to atomically read-then-clear page softdirty
state. This is mainly used by CRIU
- Hugh Dickins contributed the series 'shmem,tmpfs: general
maintenance', a bunch of relatively minor maintenance tweaks to
this code
- Matthew Wilcox has increased the use of the VMA lock over
file-backed page faults in the series 'Handle more faults under the
VMA lock'. Some rationalizations of the fault path became possible
as a result
- In the series 'mm/rmap: convert page_move_anon_rmap() to
folio_move_anon_rmap()' David Hildenbrand has implemented some
cleanups and folio conversions
- In the series 'various improvements to the GUP interface' Lorenzo
Stoakes has simplified and improved the GUP interface with an eye
to providing groundwork for future improvements
- Andrey Konovalov has sent along the series 'kasan: assorted fixes
and improvements' which does those things
- Some page allocator maintenance work from Kemeng Shi in the series
'Two minor cleanups to break_down_buddy_pages'
- In thes series 'New selftest for mm' Breno Leitao has developed
another MM self test which tickles a race we had between madvise()
and page faults
- In the series 'Add folio_end_read' Matthew Wilcox provides cleanups
and an optimization to the core pagecache code
- Nhat Pham has added memcg accounting for hugetlb memory in the
series 'hugetlb memcg accounting'
- Cleanups and rationalizations to the pagemap code from Lorenzo
Stoakes, in the series 'Abstract vma_merge() and split_vma()'
- Audra Mitchell has fixed issues in the procfs page_owner code's new
timestamping feature which was causing some misbehaviours. In the
series 'Fix page_owner's use of free timestamps'
- Lorenzo Stoakes has fixed the handling of new mappings of sealed
files in the series 'permit write-sealed memfd read-only shared
mappings'
- Mike Kravetz has optimized the hugetlb vmemmap optimization in the
series 'Batch hugetlb vmemmap modification operations'
- Some buffer_head folio conversions and cleanups from Matthew Wilcox
in the series 'Finish the create_empty_buffers() transition'
- As a page allocator performance optimization Huang Ying has added
automatic tuning to the allocator's per-cpu-pages feature, in the
series 'mm: PCP high auto-tuning'
- Roman Gushchin has contributed the patchset 'mm: improve
performance of accounted kernel memory allocations' which improves
their performance by ~30% as measured by a micro-benchmark
- folio conversions from Kefeng Wang in the series 'mm: convert page
cpupid functions to folios'
- Some kmemleak fixups in Liu Shixin's series 'Some bugfix about
kmemleak'
- Qi Zheng has improved our handling of memoryless nodes by keeping
them off the allocation fallback list. This is done in the series
'handle memoryless nodes more appropriately'
- khugepaged conversions from Vishal Moola in the series 'Some
khugepaged folio conversions'"
[ bcachefs conflicts with the dynamically allocated shrinkers have been
resolved as per Stephen Rothwell in
https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/
with help from Qi Zheng.
The clone3 test filtering conflict was half-arsed by yours truly ]
* tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits)
mm/damon/sysfs: update monitoring target regions for online input commit
mm/damon/sysfs: remove requested targets when online-commit inputs
selftests: add a sanity check for zswap
Documentation: maple_tree: fix word spelling error
mm/vmalloc: fix the unchecked dereference warning in vread_iter()
zswap: export compression failure stats
Documentation: ubsan: drop "the" from article title
mempolicy: migration attempt to match interleave nodes
mempolicy: mmap_lock is not needed while migrating folios
mempolicy: alloc_pages_mpol() for NUMA policy without vma
mm: add page_rmappable_folio() wrapper
mempolicy: remove confusing MPOL_MF_LAZY dead code
mempolicy: mpol_shared_policy_init() without pseudo-vma
mempolicy trivia: use pgoff_t in shared mempolicy tree
mempolicy trivia: slightly more consistent naming
mempolicy trivia: delete those ancient pr_debug()s
mempolicy: fix migrate_pages(2) syscall return nr_failed
kernfs: drop shared NUMA mempolicy hooks
hugetlbfs: drop shared NUMA mempolicy pretence
mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()
...
- Fix inode metadata space layout documentation;
- Avoid warning MicroLZMA format anymore;
- Fix erofs_insert_workgroup() lockref usage;
- Some cleanups.
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEQ0A6bDUS9Y+83NPFUXZn5Zlu5qoFAmVBMVoRHHhpYW5nQGtl
cm5lbC5vcmcACgkQUXZn5Zlu5qq8iA/+NdvaKPFvexKuJkW2n9t4UaDEPqS6qobk
B1uIaXxDhz6RdhDyhLA7ysQQi1Z8g0t/MXgCWZGM1NXJ995Y0VE+dFlT+a5QPLiu
4rz8bnZpdUHVF0jIyoru/nPZQWXsYBh2AeYLKPi+nvQoCRtoRsvdKXxv76iEF45o
zE2548TXqogL3Q04oH040fRzVwYROqnF85q9uCPNFS/Sh/swxyob9waBijlzbECS
GLjyJ+ZsQBBG95SzwNEPSFek5+od9ty2UKQwDJfqABOhbQT2Li6r2t0afyrlqDQY
m7VvvmzkFe5uYwAkuMkAOxhnW+eBzxz/whk+8W3dw/lm4dDbe0LuCrGXVdivdf3F
dO5ywSEGJnOH1FqNF1ubAWdzJ+qoaTupuOpwrGM9/lHGgxCIVjW06XzKfOv/AfaQ
VR2SPTrcA6eulXsLGopVho9dtHBUQZIAmJjQcWIyCpYm15NmElDXsybN/Os2Hgo0
d2Tsx9fgsN25xRsIBPxiV6qsEXqx56Nce1U4k0VV4ali81Y22nqq6rV4AFWz8KNJ
LhYFcL5YgO5RxYEpoQGkt4hyeDkxt4lc/cYqCr8E8kpNEuz0P3O27NcLnSP7CJVQ
pBYfSoV5tm+SyK4toeacG+HJTnohNyLE4vIXN96awoCBfJiickY7ZxJNuf6wRbOA
4Y+aMq/lQXQ=
=Yum+
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"Nothing exciting lands for this cycle, since we're still busying in
developing support for sub-page blocks and large-folios of compressed
data for new scenarios on Android.
In this cycle, MicroLZMA format is marked as stable, and there are
minor cleanups around documentation and codebase. In addition, it also
fixes incorrect lockref usage in erofs_insert_workgroup().
Summary:
- Fix inode metadata space layout documentation
- Avoid warning for MicroLZMA format anymore
- Fix erofs_insert_workgroup() lockref usage
- Some cleanups"
* tag 'erofs-for-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: fix erofs_insert_workgroup() lockref usage
erofs: tidy up redundant includes
erofs: get rid of ROOT_NID()
erofs: simplify compression configuration parser
erofs: don't warn MicroLZMA format anymore
erofs: fix inode metadata space layout description in documentation
As Linus pointed out [1], lockref_put_return() is fundamentally
designed to be something that can fail. It behaves as a fastpath-only
thing, and the failure case needs to be handled anyway.
Actually, since the new pcluster was just allocated without being
populated, it won't be accessed by others until it is inserted into
XArray, so lockref helpers are actually unneeded here.
Let's just set the proper reference count on initializing.
[1] https://lore.kernel.org/r/CAHk-=whCga8BeQnJ3ZBh_Hfm9ctba_wpF444LpwRybVNMzO6Dw@mail.gmail.com
Fixes: 7674a42f35 ("erofs: use struct lockref to replace handcrafted approach")
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20231031060524.1103921-1-hsiangkao@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Move erofs_load_compr_cfgs() into decompressor.c as well as introduce
a callback instead of a hard-coded switch for each algorithm for
simplicity.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20231022130957.11398-1-xiang@kernel.org
The LZMA algorithm support has been landed for more than one year since
Linux 5.16. Besides, the new XZ Utils 5.4 has been available in most
Linux distributions.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20231021020137.1646959-1-hsiangkao@linux.alibaba.com
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTppYgAKCRCRxhvAZXjc
okIHAP9anLz1QDyMLH12ASuHjgBc0Of3jcB6NB97IWGpL4O21gEA46ohaD+vcJuC
YkBLU3lXqQ87nfu28ExFAzh10hG2jwM=
=m4pB
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode time accessor updates from Christian Brauner:
"This finishes the conversion of all inode time fields to accessor
functions as discussed on list. Changing timestamps manually as we
used to do before is error prone. Using accessors function makes this
robust.
It does not contain the switch of the time fields to discrete 64 bit
integers to replace struct timespec and free up space in struct inode.
But after this, the switch can be trivially made and the patch should
only affect the vfs if we decide to do it"
* tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (86 commits)
fs: rename inode i_atime and i_mtime fields
security: convert to new timestamp accessors
selinux: convert to new timestamp accessors
apparmor: convert to new timestamp accessors
sunrpc: convert to new timestamp accessors
mm: convert to new timestamp accessors
bpf: convert to new timestamp accessors
ipc: convert to new timestamp accessors
linux: convert to new timestamp accessors
zonefs: convert to new timestamp accessors
xfs: convert to new timestamp accessors
vboxsf: convert to new timestamp accessors
ufs: convert to new timestamp accessors
udf: convert to new timestamp accessors
ubifs: convert to new timestamp accessors
tracefs: convert to new timestamp accessors
sysv: convert to new timestamp accessors
squashfs: convert to new timestamp accessors
server: convert to new timestamp accessors
client: convert to new timestamp accessors
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTppWAAKCRCRxhvAZXjc
okB2AP4jjoRErJBwj245OIDJqzoj4m4UVOVd0MH2AkiSpANczwD/TToChdpusY2y
qAYg1fQoGMbDVlb7Txaj9qI9ieCf9w0=
=2PXg
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.7.xattr' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull vfs xattr updates from Christian Brauner:
"The 's_xattr' field of 'struct super_block' currently requires a
mutable table of 'struct xattr_handler' entries (although each handler
itself is const). However, no code in vfs actually modifies the
tables.
This changes the type of 's_xattr' to allow const tables, and modifies
existing file systems to move their tables to .rodata. This is
desirable because these tables contain entries with function pointers
in them; moving them to .rodata makes it considerably less likely to
be modified accidentally or maliciously at runtime"
* tag 'vfs-6.7.xattr' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (30 commits)
const_structs.checkpatch: add xattr_handler
net: move sockfs_xattr_handlers to .rodata
shmem: move shmem_xattr_handlers to .rodata
overlayfs: move xattr tables to .rodata
xfs: move xfs_xattr_handlers to .rodata
ubifs: move ubifs_xattr_handlers to .rodata
squashfs: move squashfs_xattr_handlers to .rodata
smb: move cifs_xattr_handlers to .rodata
reiserfs: move reiserfs_xattr_handlers to .rodata
orangefs: move orangefs_xattr_handlers to .rodata
ocfs2: move ocfs2_xattr_handlers and ocfs2_xattr_handler_map to .rodata
ntfs3: move ntfs_xattr_handlers to .rodata
nfs: move nfs4_xattr_handlers to .rodata
kernfs: move kernfs_xattr_handlers to .rodata
jfs: move jfs_xattr_handlers to .rodata
jffs2: move jffs2_xattr_handlers to .rodata
hfsplus: move hfsplus_xattr_handlers to .rodata
hfs: move hfs_xattr_handlers to .rodata
gfs2: move gfs2_xattr_handlers_max to .rodata
fuse: move fuse_xattr_handlers to .rodata
...
Rename the default helper for encoding FILEID_INO32_GEN* file handles to
generic_encode_ino32_fh() and convert the filesystems that used the
default implementation to use the generic helper explicitly.
After this change, exportfs_encode_inode_fh() no longer has a default
implementation to encode FILEID_INO32_GEN* file handles.
This is a step towards allowing filesystems to encode non-decodeable
file handles for fanotify without having to implement any
export_operations.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231023180801.2953446-3-amir73il@gmail.com
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Convert erofs to use bdev_open_by_path() and pass the handle around.
CC: Gao Xiang <xiang@kernel.org>
CC: Chao Yu <chao@kernel.org>
CC: linux-erofs@lists.ozlabs.org
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-21-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
This makes it harder for accidental or malicious changes to
erofs_xattr_handlers or xattr_handler_map at runtime.
Cc: Gao Xiang <xiang@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Yue Hu <huyue2@coolpad.com>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: linux-erofs@lists.ozlabs.org
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
Link: https://lore.kernel.org/r/20230930050033.41174-9-wedsonaf@gmail.com
Acked-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
When stressing microLZMA EROFS images with the new global compressed
deduplication feature enabled (`-Ededupe`), I found some short-lived
temporary pages weren't properly released, which could slowly cause
unexpected OOMs hours later.
Let's fix it now (LZ4 and DEFLATE don't have this issue.)
Fixes: 5c2a64252c ("erofs: introduce partial-referenced pclusters")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230907050542.97152-1-hsiangkao@linux.alibaba.com
- Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which
reduces the special-case code for handling hugetlb pages in GUP. It
also speeds up GUP handling of transparent hugepages.
- Peng Zhang provides some maple tree speedups ("Optimize the fast path
of mas_store()").
- Sergey Senozhatsky has improved te performance of zsmalloc during
compaction (zsmalloc: small compaction improvements").
- Domenico Cerasuolo has developed additional selftest code for zswap
("selftests: cgroup: add zswap test program").
- xu xin has doe some work on KSM's handling of zero pages. These
changes are mainly to enable the user to better understand the
effectiveness of KSM's treatment of zero pages ("ksm: support tracking
KSM-placed zero-pages").
- Jeff Xu has fixes the behaviour of memfd's
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED").
- David Howells has fixed an fscache optimization ("mm, netfs, fscache:
Stop read optimisation when folio removed from pagecache").
- Axel Rasmussen has given userfaultfd the ability to simulate memory
poisoning ("add UFFDIO_POISON to simulate memory poisoning with UFFD").
- Miaohe Lin has contributed some routine maintenance work on the
memory-failure code ("mm: memory-failure: remove unneeded PageHuge()
check").
- Peng Zhang has contributed some maintenance work on the maple tree
code ("Improve the validation for maple tree and some cleanup").
- Hugh Dickins has optimized the collapsing of shmem or file pages into
THPs ("mm: free retracted page table by RCU").
- Jiaqi Yan has a patch series which permits us to use the healthy
subpages within a hardware poisoned huge page for general purposes
("Improve hugetlbfs read on HWPOISON hugepages").
- Kemeng Shi has done some maintenance work on the pagetable-check code
("Remove unused parameters in page_table_check").
- More folioification work from Matthew Wilcox ("More filesystem folio
conversions for 6.6"), ("Followup folio conversions for zswap"). And
from ZhangPeng ("Convert several functions in page_io.c to use a
folio").
- page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext").
- Baoquan He has converted some architectures to use the GENERIC_IOREMAP
ioremap()/iounmap() code ("mm: ioremap: Convert architectures to take
GENERIC_IOREMAP way").
- Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support
batched/deferred tlb shootdown during page reclamation/migration").
- Better maple tree lockdep checking from Liam Howlett ("More strict
maple tree lockdep"). Liam also developed some efficiency improvements
("Reduce preallocations for maple tree").
- Cleanup and optimization to the secondary IOMMU TLB invalidation, from
Alistair Popple ("Invalidate secondary IOMMU TLB on permission
upgrade").
- Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes
for arm64").
- Kemeng Shi provides some maintenance work on the compaction code ("Two
minor cleanups for compaction").
- Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle most
file-backed faults under the VMA lock").
- Aneesh Kumar contributes code to use the vmemmap optimization for DAX
on ppc64, under some circumstances ("Add support for DAX vmemmap
optimization for ppc64").
- page-ext cleanups from Kemeng Shi ("add page_ext_data to get client
data in page_ext"), ("minor cleanups to page_ext header").
- Some zswap cleanups from Johannes Weiner ("mm: zswap: three
cleanups").
- kmsan cleanups from ZhangPeng ("minor cleanups for kmsan").
- VMA handling cleanups from Kefeng Wang ("mm: convert to
vma_is_initial_heap/stack()").
- DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes:
implement DAMOS tried total bytes file"), ("Extend DAMOS filters for
address ranges and DAMON monitoring targets").
- Compaction work from Kemeng Shi ("Fixes and cleanups to compaction").
- Liam Howlett has improved the maple tree node replacement code
("maple_tree: Change replacement strategy").
- ZhangPeng has a general code cleanup - use the K() macro more widely
("cleanup with helper macro K()").
- Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for memmap
on memory feature on ppc64").
- pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list
in page_alloc"), ("Two minor cleanups for get pageblock migratetype").
- Vishal Moola introduces a memory descriptor for page table tracking,
"struct ptdesc" ("Split ptdesc from struct page").
- memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups
for vm.memfd_noexec").
- MM include file rationalization from Hugh Dickins ("arch: include
asm/cacheflush.h in asm/hugetlb.h").
- THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text
output").
- kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use
object_cache instead of kmemleak_initialized").
- More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor
and _folio_order").
- A VMA locking scalability improvement from Suren Baghdasaryan
("Per-VMA lock support for swap and userfaults").
- pagetable handling cleanups from Matthew Wilcox ("New page table range
API").
- A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop
using page->private on tail pages for THP_SWAP + cleanups").
- Cleanups and speedups to the hugetlb fault handling from Matthew
Wilcox ("Change calling convention for ->huge_fault").
- Matthew Wilcox has also done some maintenance work on the MM subsystem
documentation ("Improve mm documentation").
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZO1JUQAKCRDdBJ7gKXxA
jrMwAP47r/fS8vAVT3zp/7fXmxaJYTK27CTAM881Gw1SDhFM/wEAv8o84mDenCg6
Nfio7afS1ncD+hPYT8947UnLxTgn+ww=
=Afws
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Some swap cleanups from Ma Wupeng ("fix WARN_ON in
add_to_avail_list")
- Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which
reduces the special-case code for handling hugetlb pages in GUP. It
also speeds up GUP handling of transparent hugepages.
- Peng Zhang provides some maple tree speedups ("Optimize the fast path
of mas_store()").
- Sergey Senozhatsky has improved te performance of zsmalloc during
compaction (zsmalloc: small compaction improvements").
- Domenico Cerasuolo has developed additional selftest code for zswap
("selftests: cgroup: add zswap test program").
- xu xin has doe some work on KSM's handling of zero pages. These
changes are mainly to enable the user to better understand the
effectiveness of KSM's treatment of zero pages ("ksm: support
tracking KSM-placed zero-pages").
- Jeff Xu has fixes the behaviour of memfd's
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED").
- David Howells has fixed an fscache optimization ("mm, netfs, fscache:
Stop read optimisation when folio removed from pagecache").
- Axel Rasmussen has given userfaultfd the ability to simulate memory
poisoning ("add UFFDIO_POISON to simulate memory poisoning with
UFFD").
- Miaohe Lin has contributed some routine maintenance work on the
memory-failure code ("mm: memory-failure: remove unneeded PageHuge()
check").
- Peng Zhang has contributed some maintenance work on the maple tree
code ("Improve the validation for maple tree and some cleanup").
- Hugh Dickins has optimized the collapsing of shmem or file pages into
THPs ("mm: free retracted page table by RCU").
- Jiaqi Yan has a patch series which permits us to use the healthy
subpages within a hardware poisoned huge page for general purposes
("Improve hugetlbfs read on HWPOISON hugepages").
- Kemeng Shi has done some maintenance work on the pagetable-check code
("Remove unused parameters in page_table_check").
- More folioification work from Matthew Wilcox ("More filesystem folio
conversions for 6.6"), ("Followup folio conversions for zswap"). And
from ZhangPeng ("Convert several functions in page_io.c to use a
folio").
- page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext").
- Baoquan He has converted some architectures to use the
GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert
architectures to take GENERIC_IOREMAP way").
- Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support
batched/deferred tlb shootdown during page reclamation/migration").
- Better maple tree lockdep checking from Liam Howlett ("More strict
maple tree lockdep"). Liam also developed some efficiency
improvements ("Reduce preallocations for maple tree").
- Cleanup and optimization to the secondary IOMMU TLB invalidation,
from Alistair Popple ("Invalidate secondary IOMMU TLB on permission
upgrade").
- Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes
for arm64").
- Kemeng Shi provides some maintenance work on the compaction code
("Two minor cleanups for compaction").
- Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle
most file-backed faults under the VMA lock").
- Aneesh Kumar contributes code to use the vmemmap optimization for DAX
on ppc64, under some circumstances ("Add support for DAX vmemmap
optimization for ppc64").
- page-ext cleanups from Kemeng Shi ("add page_ext_data to get client
data in page_ext"), ("minor cleanups to page_ext header").
- Some zswap cleanups from Johannes Weiner ("mm: zswap: three
cleanups").
- kmsan cleanups from ZhangPeng ("minor cleanups for kmsan").
- VMA handling cleanups from Kefeng Wang ("mm: convert to
vma_is_initial_heap/stack()").
- DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes:
implement DAMOS tried total bytes file"), ("Extend DAMOS filters for
address ranges and DAMON monitoring targets").
- Compaction work from Kemeng Shi ("Fixes and cleanups to compaction").
- Liam Howlett has improved the maple tree node replacement code
("maple_tree: Change replacement strategy").
- ZhangPeng has a general code cleanup - use the K() macro more widely
("cleanup with helper macro K()").
- Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for
memmap on memory feature on ppc64").
- pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list
in page_alloc"), ("Two minor cleanups for get pageblock
migratetype").
- Vishal Moola introduces a memory descriptor for page table tracking,
"struct ptdesc" ("Split ptdesc from struct page").
- memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups
for vm.memfd_noexec").
- MM include file rationalization from Hugh Dickins ("arch: include
asm/cacheflush.h in asm/hugetlb.h").
- THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text
output").
- kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use
object_cache instead of kmemleak_initialized").
- More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor
and _folio_order").
- A VMA locking scalability improvement from Suren Baghdasaryan
("Per-VMA lock support for swap and userfaults").
- pagetable handling cleanups from Matthew Wilcox ("New page table
range API").
- A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop
using page->private on tail pages for THP_SWAP + cleanups").
- Cleanups and speedups to the hugetlb fault handling from Matthew
Wilcox ("Change calling convention for ->huge_fault").
- Matthew Wilcox has also done some maintenance work on the MM
subsystem documentation ("Improve mm documentation").
* tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits)
maple_tree: shrink struct maple_tree
maple_tree: clean up mas_wr_append()
secretmem: convert page_is_secretmem() to folio_is_secretmem()
nios2: fix flush_dcache_page() for usage from irq context
hugetlb: add documentation for vma_kernel_pagesize()
mm: add orphaned kernel-doc to the rst files.
mm: fix clean_record_shared_mapping_range kernel-doc
mm: fix get_mctgt_type() kernel-doc
mm: fix kernel-doc warning from tlb_flush_rmaps()
mm: remove enum page_entry_size
mm: allow ->huge_fault() to be called without the mmap_lock held
mm: move PMD_ORDER to pgtable.h
mm: remove checks for pte_index
memcg: remove duplication detection for mem_cgroup_uncharge_swap
mm/huge_memory: work on folio->swap instead of page->private when splitting folio
mm/swap: inline folio_set_swap_entry() and folio_swap_entry()
mm/swap: use dedicated entry for swap in folio
mm/swap: stop using page->private on tail pages for THP_SWAP
selftests/mm: fix WARNING comparing pointer to 0
selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check
...
- Support xattr bloom filter to optimize negative xattr lookups;
- Support DEFLATE compression algorithm as an alternative;
- Fix a regression that ztailpacking pclusters don't release properly;
- Avoid warning dedupe and fragments features anymore;
- Some folio conversions and cleanups.
-----BEGIN PGP SIGNATURE-----
iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCZOvhIBEceGlhbmdAa2Vy
bmVsLm9yZwAKCRA5NzHcH7XmBFgqAP4/gcxH5vhgxMunxmgBkSxMFBQf/W7CfOiN
QkGHjSKl8gEA78EBwAJ3vDJ1JgQRTb9/9UBrtW7n2hzj/eVS/LIyYQI=
=o3Bx
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"In this cycle, a xattr bloom filter feature is introduced to speed up
negative xattr lookups, which was originally suggested by Alexander
for Composefs use cases.
Additionally, the DEFLATE algorithm is now supported, which can be
used together with hardware accelerators for our cloud workloads. Each
supported compression algorithm can be selected on a per-file basis
for specific access patterns too.
There are also some random fixes and cleanups as usual:
- Support xattr bloom filter to optimize negative xattr lookups
- Support DEFLATE compression algorithm as an alternative
- Fix a regression that ztailpacking pclusters don't release properly
- Avoid warning dedupe and fragments features anymore
- Some folio conversions and cleanups"
* tag 'erofs-for-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: release ztailpacking pclusters properly
erofs: don't warn dedupe and fragments features anymore
erofs: adapt folios for z_erofs_read_folio()
erofs: adapt folios for z_erofs_readahead()
erofs: get rid of fe->backmost for cache decompression
erofs: drop z_erofs_page_mark_eio()
erofs: tidy up z_erofs_do_read_page()
erofs: move preparation logic into z_erofs_pcluster_begin()
erofs: avoid obsolete {collector,collection} terms
erofs: simplify z_erofs_read_fragment()
erofs: remove redundant erofs_fs_type declaration in super.c
erofs: add necessary kmem_cache_create flags for erofs inode cache
erofs: clean up redundant comment and adjust code alignment
erofs: refine warning messages for zdata I/Os
erofs: boost negative xattr lookup with bloom filter
erofs: update on-disk format for xattr name filter
erofs: DEFLATE compression support
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXTKAAKCRCRxhvAZXjc
oifJAQCzi/p+AdQu8LA/0XvR7fTwaq64ZDCibU4BISuLGT2kEgEAuGbuoFZa0rs2
XYD/s4+gi64p9Z01MmXm2XO1pu3GPg0=
=eJz5
-----END PGP SIGNATURE-----
Merge tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs timestamp updates from Christian Brauner:
"This adds VFS support for multi-grain timestamps and converts tmpfs,
xfs, ext4, and btrfs to use them. This carries acks from all relevant
filesystems.
The VFS always uses coarse-grained timestamps when updating the ctime
and mtime after a change. This has the benefit of allowing filesystems
to optimize away a lot of metadata updates, down to around 1 per
jiffy, even when a file is under heavy writes.
Unfortunately, this has always been an issue when we're exporting via
NFSv3, which relies on timestamps to validate caches. A lot of changes
can happen in a jiffy, so timestamps aren't sufficient to help the
client decide to invalidate the cache.
Even with NFSv4, a lot of exported filesystems don't properly support
a change attribute and are subject to the same problems with timestamp
granularity. Other applications have similar issues with timestamps
(e.g., backup applications).
If we were to always use fine-grained timestamps, that would improve
the situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.
This introduces fine-grained timestamps that are used when they are
actively queried.
This uses the 31st bit of the ctime tv_nsec field to indicate that
something has queried the inode for the mtime or ctime. When this flag
is set, on the next mtime or ctime update, the kernel will fetch a
fine-grained timestamp instead of the usual coarse-grained one.
As POSIX generally mandates that when the mtime changes, the ctime
must also change the kernel always stores normalized ctime values, so
only the first 30 bits of the tv_nsec field are ever used.
Filesytems can opt into this behavior by setting the FS_MGTIME flag in
the fstype. Filesystems that don't set this flag will continue to use
coarse-grained timestamps.
Various preparatory changes, fixes and cleanups are included:
- Fixup all relevant places where POSIX requires updating ctime
together with mtime. This is a wide-range of places and all
maintainers provided necessary Acks.
- Add new accessors for inode->i_ctime directly and change all
callers to rely on them. Plain accesses to inode->i_ctime are now
gone and it is accordingly rename to inode->__i_ctime and commented
as requiring accessors.
- Extend generic_fillattr() to pass in a request mask mirroring in a
sense the statx() uapi. This allows callers to pass in a request
mask to only get a subset of attributes filled in.
- Rework timestamp updates so it's possible to drop the @now
parameter the update_time() inode operation and associated helpers.
- Add inode_update_timestamps() and convert all filesystems to it
removing a bunch of open-coding"
* tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits)
btrfs: convert to multigrain timestamps
ext4: switch to multigrain timestamps
xfs: switch to multigrain timestamps
tmpfs: add support for multigrain timestamps
fs: add infrastructure for multigrain timestamps
fs: drop the timespec64 argument from update_time
xfs: have xfs_vn_update_time gets its own timestamp
fat: make fat_update_time get its own timestamp
fat: remove i_version handling from fat_update_time
ubifs: have ubifs_update_time use inode_update_timestamps
btrfs: have it use inode_update_timestamps
fs: drop the timespec64 arg from generic_update_time
fs: pass the request_mask to generic_fillattr
fs: remove silly warning from current_time
gfs2: fix timestamp handling on quota inodes
fs: rename i_ctime field to __i_ctime
selinux: convert to ctime accessor functions
security: convert to ctime accessor functions
apparmor: convert to ctime accessor functions
sunrpc: convert to ctime accessor functions
...
Remove the unnecessary encoding of page order into an enum and pass the
page order directly. That lets us get rid of pe_order().
The switch constructs have to be changed to if/else constructs to prevent
GCC from warning on builds with 3-level page tables where PMD_ORDER and
PUD_ORDER have the same value.
If you are looking at this commit because your driver stopped compiling,
look at the previous commit as well and audit your driver to be sure it
doesn't depend on mmap_lock being held in its ->huge_fault method.
[willy@infradead.org: use "order %u" to match the (non dev_t) style]
Link: https://lkml.kernel.org/r/ZOUYekbtTv+n8hYf@casper.infradead.org
Link: https://lkml.kernel.org/r/20230818202335.2739663-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently ztailpacking pclusters are chained with FOLLOWED_NOINPLACE and
not recorded into the managed_pslots XArray.
After commit 7674a42f35 ("erofs: use struct lockref to replace
handcrafted approach"), ztailpacking pclusters won't be freed with
erofs_workgroup_put() anymore, which will cause the following issue:
BUG erofs_pcluster-1 (Tainted: G OE ): Objects remaining in erofs_pcluster-1 on __kmem_cache_shutdown()
Use z_erofs_free_pcluster() directly to free ztailpacking pclusters.
Fixes: 7674a42f35 ("erofs: use struct lockref to replace handcrafted approach")
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230822110530.96831-1-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
The `dedupe` and `fragments` features have been merged for a year.
They are mostly stable now.
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: sunshijie <sunshijie@xiaomi.com>
Link: https://lore.kernel.org/r/20230821041737.2673401-1-sunshijie@xiaomi.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
It's a straight-forward conversion except that readahead_folio()
will do folio_put() in advance but it doesn't matter since folios
are still locked.
As before, since file-backed folios (pages for now) are locked, so
we could temporarily use folio->private as an internal counter to
indicate split parts of each folio for the corresponding pclusters
to decompress.
When such counter becomes zero, the folio will be finally unlocked
(see compress.h and z_erofs_onlinepage_endio()).
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230817082813.81180-7-hsiangkao@linux.alibaba.com
EROFS_MAP_FULL_MAPPED is more accurate to decide if caching the last
incomplete pcluster for later read or not.
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230817082813.81180-6-hsiangkao@linux.alibaba.com
Some preparation logic should be part of z_erofs_pcluster_begin()
instead of z_erofs_do_read_page(). Let's move now.
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230817082813.81180-3-hsiangkao@linux.alibaba.com
{collector,collection} were once reserved in order to indicate different
runtime logical extent instance of multi-reference pclusters.
However, de-duplicated decompression has been landed in a more flexable
way, thus `struct z_erofs_collection` was formally removed in commit
87ca34a706 ("erofs: get rid of `struct z_erofs_collection'").
Let's handle the remaining leftovers, for example:
`z_erofs_collector_begin` => `z_erofs_pcluster_begin`
`z_erofs_collector_end` => `z_erofs_pcluster_end`
as well as some comments. No logic changes.
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230817082813.81180-2-hsiangkao@linux.alibaba.com
As erofs_fs_type has been declared in internal.h, there is no use to
declare repeatedly in super.c.
Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
eviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230815094849.53249-3-mengferry@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Don't warn users since -EINTR can be returned due to user interruption.
Also suppress warning messages of readmore.
Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230809060637.21311-1-mengferry@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Optimise the negative xattr lookup with bloom filter.
The bit value for the bloom filter map has a reverse semantics for
compatibility. That is, the bit value of 0 indicates existence, while
the bit value of 1 indicates the absence of corresponding xattr.
The initial version is _only_ enabled when xattr_filter_reserved is
zero. The filter map internals may change in the future, in which case
the reserved flag will be set non-zero and we don't need bothering the
compatible bits again at that time. For now disable the optimization if
this reserved flag is non-zero.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230722094538.11754-3-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
The xattr name bloom filter feature is going to be introduced to speed
up the negative xattr lookup, e.g. system.posix_acl_[access|default]
lookup when running "ls -lR" workload.
There are some commonly used extended attributes (n) and the total
number of these is approximately 30.
trusted.overlay.opaque
trusted.overlay.redirect
trusted.overlay.origin
trusted.overlay.impure
trusted.overlay.nlink
trusted.overlay.upper
trusted.overlay.metacopy
trusted.overlay.protattr
user.overlay.opaque
user.overlay.redirect
user.overlay.origin
user.overlay.impure
user.overlay.nlink
user.overlay.upper
user.overlay.metacopy
user.overlay.protattr
security.evm
security.ima
security.selinux
security.SMACK64
security.SMACK64IPIN
security.SMACK64IPOUT
security.SMACK64EXEC
security.SMACK64TRANSMUTE
security.SMACK64MMAP
security.apparmor
security.capability
system.posix_acl_access
system.posix_acl_default
user.mime_type
Given the number of bits of the bloom filter (m) is 32, the optimal
value for the number of the hash functions (k) is 1 (ln2 * m/n = 0.74).
The single hash function is implemented as:
xxh32(name, strlen(name), EROFS_XATTR_FILTER_SEED + index)
where `index` represents the index of corresponding predefined short name
prefix, while `name` represents the name string after stripping the above
predefined name prefix.
The constant magic number EROFS_XATTR_FILTER_SEED, i.e. 0x25BBE08F, is
used to give a better spread when mapping these 30 extended attributes
into 32-bit bloom filter as:
bit 0: security.ima
bit 1:
bit 2: trusted.overlay.nlink
bit 3:
bit 4: user.overlay.nlink
bit 5: trusted.overlay.upper
bit 6: user.overlay.origin
bit 7: trusted.overlay.protattr
bit 8: security.apparmor
bit 9: user.overlay.protattr
bit 10: user.overlay.opaque
bit 11: security.selinux
bit 12: security.SMACK64TRANSMUTE
bit 13: security.SMACK64
bit 14: security.SMACK64MMAP
bit 15: user.overlay.impure
bit 16: security.SMACK64IPIN
bit 17: trusted.overlay.redirect
bit 18: trusted.overlay.origin
bit 19: security.SMACK64IPOUT
bit 20: trusted.overlay.opaque
bit 21: system.posix_acl_default
bit 22:
bit 23: user.mime_type
bit 24: trusted.overlay.impure
bit 25: security.SMACK64EXEC
bit 26: user.overlay.redirect
bit 27: user.overlay.upper
bit 28: security.evm
bit 29: security.capability
bit 30: system.posix_acl_access
bit 31: trusted.overlay.metacopy, user.overlay.metacopy
h_name_filter is introduced to the on-disk per-inode xattr header to
place the corresponding xattr name filter, where bit value 1 indicates
non-existence for compatibility.
This feature is indicated by EROFS_FEATURE_COMPAT_XATTR_FILTER
compatible feature bit.
Reserve one byte in on-disk superblock as the on-disk format for xattr
name filter may change in the future. With this flag we don't need
bothering these compatible bits again at that time.
Suggested-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230722094538.11754-2-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Add DEFLATE compression as the 3rd supported algorithm.
DEFLATE is a popular generic-purpose compression algorithm for quite
long time (many advanced formats like gzip, zlib, zip, png are all
based on that) as Apple documentation written "If you require
interoperability with non-Apple devices, use COMPRESSION_ZLIB. [1]".
Due to its popularity, there are several hardware on-market DEFLATE
accelerators, such as (s390) DFLTCC, (Intel) IAA/QAT, (HiSilicon) ZIP
accelerator, etc. In addition, there are also several high-performence
IP cores and even open-source FPGA approches available for DEFLATE.
Therefore, it's useful to support DEFLATE compression in order to find
a way to utilize these accelerators for asynchronous I/Os and get
benefits from these later.
Besides, it's a good choice to trade off between compression ratios
and performance compared to LZ4 and LZMA. The DEFLATE core format is
simple as well as easy to understand, therefore the code size of its
decompressor is small even for the bootloader use cases. The runtime
memory consumption is quite limited too (e.g. 32K + ~7K for each zlib
stream). As usual, EROFS ourperforms similar approaches too.
Alternatively, DEFLATE could still be used for some specific files
since EROFS supports multiple compression algorithms in one image.
[1] https://developer.apple.com/documentation/compression/compression_algorithm
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230810154859.118330-1-hsiangkao@linux.alibaba.com
generic_fillattr just fills in the entire stat struct indiscriminately
today, copying data from the inode. There is at least one attribute
(STATX_CHANGE_COOKIE) that can have side effects when it is reported,
and we're looking at adding more with the addition of multigrain
timestamps.
Add a request_mask argument to generic_fillattr and have most callers
just pass in the value that is passed to getattr. Have other callers
(e.g. ksmbd) just pass in STATX_BASIC_STATS. Also move the setting of
STATX_CHANGE_COOKIE into generic_fillattr.
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Paulo Alcantara (SUSE)" <pc@manguebit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230807-mgctime-v7-2-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
When handling deduplicated compressed data, there can be multiple
decompressed extents pointing to the same compressed data in one shot.
In such cases, the bvecs which belong to the longest extent will be
selected as the primary bvecs for real decompressors to decode and the
other duplicated bvecs will be directly copied from the primary bvecs.
Previously, only relative offsets of the longest extent were checked to
decompress the primary bvecs. On rare occasions, it can be incorrect
if there are several extents with the same start relative offset.
As a result, some short bvecs could be selected for decompression and
then cause data corruption.
For example, as Shijie Sun reported off-list, considering the following
extents of a file:
117: 903345.. 915250 | 11905 : 385024.. 389120 | 4096
...
119: 919729.. 930323 | 10594 : 385024.. 389120 | 4096
...
124: 968881.. 980786 | 11905 : 385024.. 389120 | 4096
The start relative offset is the same: 2225, but extent 119 (919729..
930323) is shorter than the others.
Let's restrict the bvec length in addition to the start offset if bvecs
are not full.
Reported-by: Shijie Sun <sunshijie@xiaomi.com>
Fixes: 5c2a64252c ("erofs: introduce partial-referenced pclusters")
Tested-by Shijie Sun <sunshijie@xiaomi.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230719065459.60083-1-hsiangkao@linux.alibaba.com
In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.
Acked-by: Gao Xiang <xiang@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230705190309.579783-37-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
DAX can be used to share page cache between VMs, reducing guest memory
overhead. And chunk based data format is widely used for VM and
container image. So enable dax support for it, make erofs better used
for VM scenarios.
Fixes: c5aa903a59 ("erofs: support reading chunk-based uncompressed files")
Signed-off-by: Xin Yin <yinxin.x@bytedance.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230711062130.7860-1-yinxin.x@bytedance.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
z_erofs_do_read_page() may loop infinitely due to the inappropriate
truncation in the below statement. Since the offset is 64 bits and min_t()
truncates the result to 32 bits. The solution is to replace unsigned int
with a 64-bit type, such as erofs_off_t.
cur = end - min_t(unsigned int, offset + end - map->m_la, end);
- For example:
- offset = 0x400160000
- end = 0x370
- map->m_la = 0x160370
- offset + end - map->m_la = 0x400000000
- offset + end - map->m_la = 0x00000000 (truncated as unsigned int)
- Expected result:
- cur = 0
- Actual result:
- cur = 0x370
Signed-off-by: Chunhai Guo <guochunhai@vivo.com>
Fixes: 3883a79abd ("staging: erofs: introduce VLE decompression support")
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230710093410.44071-1-guochunhai@vivo.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
z_erofs_pcluster_readmore() may take a long time to loop when the page
offset is large enough, which is unnecessary should be prevented.
For example, when the following case is encountered, it will loop 4691368
times, taking about 27 seconds:
- offset = 19217289215
- inode_size = 1442672
Signed-off-by: Chunhai Guo <guochunhai@vivo.com>
Fixes: 386292919c ("erofs: introduce readmore decompression strategy")
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230710042531.28761-1-guochunhai@vivo.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Use memcpy_to_page() instead of open-coding them.
In addition, add a missing flush_dcache_page() even though almost all
modern architectures clear `PG_dcache_clean` flag for new file cache
pages so that it doesn't change anything in practice.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230627161240.331-2-hsiangkao@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSV8dwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpilGD/9Yys1oxIXJpRf00fzrylAlBthRxMjFQVWw
zAut106hAQiBHvU8IkmGA3MvEFVHxtzwYhHI7IR8K3aZBIqscweCqmVI9JyogJw9
U9Twnzel47VmuKdM94FeoN+hbj1fP8EWTjzmy67/zEEfFCdmHvNlMi3lSrGYIpFy
39LxTB99Y4UarM5PtWbes37GYYljzMSWKuo4AfBkvq1eQa+sZ0Vq2xAABKq3UM7f
apqhgHtkJooRePDP0eQp+kAyyVMgW2jIK+oIdJDxNF3CKTu2w40RzaYz6fp+jVSU
H4R/xS59GW4/xql+VBJDh/qJg9K62DPPYjlW8BmSR8+IjvfFpsyH3/MacE50CD3P
20fs/Mnj49H79fDrQEHJI53cOOb2EmUitbwLbvOcColNTPpt8loBtdQxjF2RMU8R
Nyort9DJPFclYCxky1LYg1CNEC2Ln4Zy/jD47wPvqRmOQphOoVlV/hPnOEqvjaZC
49Vn70W2DeE9cXvYI7ha+XIg6/oj+Gs3iusEbV08Ci7EAtXgI+ZUUsQ97K8UNiUh
h2lqSJtuI7lBpYP9sf+BeCch5UCC+xGYyTdoM5f58lehWBBPtbs0g7S9RyRyOYxe
n+yxEUo3dAGzJ/xsKAjinbZfeWIpr0b1TkAh4w3Cq/BKzRr9Bp8lBAxYuancbQ+Y
1ADPteUOTA==
=zP4Y
-----END PGP SIGNATURE-----
Merge tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- NVMe pull request via Keith:
- Various cleanups all around (Irvin, Chaitanya, Christophe)
- Better struct packing (Christophe JAILLET)
- Reduce controller error logs for optional commands (Keith)
- Support for >=64KiB block sizes (Daniel Gomez)
- Fabrics fixes and code organization (Max, Chaitanya, Daniel
Wagner)
- bcache updates via Coly:
- Fix a race at init time (Mingzhe Zou)
- Misc fixes and cleanups (Andrea, Thomas, Zheng, Ye)
- use page pinning in the block layer for dio (David)
- convert old block dio code to page pinning (David, Christoph)
- cleanups for pktcdvd (Andy)
- cleanups for rnbd (Guoqing)
- use the unchecked __bio_add_page() for the initial single page
additions (Johannes)
- fix overflows in the Amiga partition handling code (Michael)
- improve mq-deadline zoned device support (Bart)
- keep passthrough requests out of the IO schedulers (Christoph, Ming)
- improve support for flush requests, making them less special to deal
with (Christoph)
- add bdev holder ops and shutdown methods (Christoph)
- fix the name_to_dev_t() situation and use cases (Christoph)
- decouple the block open flags from fmode_t (Christoph)
- ublk updates and cleanups, including adding user copy support (Ming)
- BFQ sanity checking (Bart)
- convert brd from radix to xarray (Pankaj)
- constify various structures (Thomas, Ivan)
- more fine grained persistent reservation ioctl capability checks
(Jingbo)
- misc fixes and cleanups (Arnd, Azeem, Demi, Ed, Hengqi, Hou, Jan,
Jordy, Li, Min, Yu, Zhong, Waiman)
* tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux: (266 commits)
scsi/sg: don't grab scsi host module reference
ext4: Fix warning in blkdev_put()
block: don't return -EINVAL for not found names in devt_from_devname
cdrom: Fix spectre-v1 gadget
block: Improve kernel-doc headers
blk-mq: don't insert passthrough request into sw queue
bsg: make bsg_class a static const structure
ublk: make ublk_chr_class a static const structure
aoe: make aoe_class a static const structure
block/rnbd: make all 'class' structures const
block: fix the exclusive open mask in disk_scan_partitions
block: add overflow checks for Amiga partition support
block: change all __u32 annotations to __be32 in affs_hardblocks.h
block: fix signed int overflow in Amiga partition support
block: add capacity validation in bdev_add_partition()
block: fine-granular CAP_SYS_ADMIN for Persistent Reservation
block: disallow Persistent Reservation on partitions
reiserfs: fix blkdev_put() warning from release_journal_dev()
block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions()
block: document the holder argument to blkdev_get_by_path
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSV8QgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpupIEADKEZvpxDyaxHjYZFFeoSJRkh+AEJHe0Xtr
J5vUL8t8zmAV3F7i8XaoAEcR0dC0VQcoTc8fAOty71+5hsc7gvtyyNjqU/YWRVqK
Xr+VJuSJ+OGx3MzpRWEkepagfPyqP5cyyCOK6gqIgqzc3IwqkR/3QHVRc6oR8YbY
AQd7tqm2fQXK9WDHEy5hcaQeqb9uKZjQQoZejpPPerpJM+9RMgKxpCGtnLLIUhr/
sgl7KyLIQPBmveO2vfOR+dmsJBqsLqneqkXDKMAIfpeVEEkHHAlCH4E5Ne1XUS+s
ie4If+reuyn1Ktt5Ry1t7w2wr8cX1fcay3K28tgwjE2Bvremc5YnYgb3pyUDW38f
tXXkpg/eTXd/Pn0Crpagoa9zJ927tt5JXIO1/PagPEP1XOqUuthshDFsrVqfqbs+
36gqX2JWB4NJTg9B9KBHA3+iVCJyZLjUqOqws7hOJOvhQytZVm/IwkGBg1Slhe1a
J5WemBlqX8lTgXz0nM7cOhPYTZeKe6hazCcb5VwxTUTj9SGyYtsMfqqTwRJO9kiF
j1VzbOAgExDYe+GvfqOFPh9VqZho66+DyOD/Xtca4eH7oYyHSmP66o8nhRyPBPZA
maBxQhUkPQn4/V/0fL2TwIdWYKsbj8bUyINKPZ2L35YfeICiaYIctTwNJxtRmItB
M3VxWD3GZQ==
=KhW4
-----END PGP SIGNATURE-----
Merge tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux
Pull splice updates from Jens Axboe:
"This kills off ITER_PIPE to avoid a race between truncate,
iov_iter_revert() on the pipe and an as-yet incomplete DMA to a bio
with unpinned/unref'ed pages from an O_DIRECT splice read. This causes
memory corruption.
Instead, we either use (a) filemap_splice_read(), which invokes the
buffered file reading code and splices from the pagecache into the
pipe; (b) copy_splice_read(), which bulk-allocates a buffer, reads
into it and then pushes the filled pages into the pipe; or (c) handle
it in filesystem-specific code.
Summary:
- Rename direct_splice_read() to copy_splice_read()
- Simplify the calculations for the number of pages to be reclaimed
in copy_splice_read()
- Turn do_splice_to() into a helper, vfs_splice_read(), so that it
can be used by overlayfs and coda to perform the checks on the
lower fs
- Make vfs_splice_read() jump to copy_splice_read() to handle
direct-I/O and DAX
- Provide shmem with its own splice_read to handle non-existent pages
in the pagecache. We don't want a ->read_folio() as we don't want
to populate holes, but filemap_get_pages() requires it
- Provide overlayfs with its own splice_read to call down to a lower
layer as overlayfs doesn't provide ->read_folio()
- Provide coda with its own splice_read to call down to a lower layer
as coda doesn't provide ->read_folio()
- Direct ->splice_read to copy_splice_read() in tty, procfs, kernfs
and random files as they just copy to the output buffer and don't
splice pages
- Provide wrappers for afs, ceph, ecryptfs, ext4, f2fs, nfs, ntfs3,
ocfs2, orangefs, xfs and zonefs to do locking and/or revalidation
- Make cifs use filemap_splice_read()
- Replace pointers to generic_file_splice_read() with pointers to
filemap_splice_read() as DIO and DAX are handled in the caller;
filesystems can still provide their own alternate ->splice_read()
op
- Remove generic_file_splice_read()
- Remove ITER_PIPE and its paraphernalia as generic_file_splice_read
was the only user"
* tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux: (31 commits)
splice: kdoc for filemap_splice_read() and copy_splice_read()
iov_iter: Kill ITER_PIPE
splice: Remove generic_file_splice_read()
splice: Use filemap_splice_read() instead of generic_file_splice_read()
cifs: Use filemap_splice_read()
trace: Convert trace/seq to use copy_splice_read()
zonefs: Provide a splice-read wrapper
xfs: Provide a splice-read wrapper
orangefs: Provide a splice-read wrapper
ocfs2: Provide a splice-read wrapper
ntfs3: Provide a splice-read wrapper
nfs: Provide a splice-read wrapper
f2fs: Provide a splice-read wrapper
ext4: Provide a splice-read wrapper
ecryptfs: Provide a splice-read wrapper
ceph: Provide a splice-read wrapper
afs: Provide a splice-read wrapper
9p: Add splice_read wrapper
net: Make sock_splice_read() use copy_splice_read() by default
tty, proc, kernfs, random: Use copy_splice_read()
...
Several trivial cleanups which aren't quite necessary to split:
- Rename lcluster load functions as well as justify full indexes
since they are typically used for global deduplication for
compressed data;
- Avoid unnecessary lines, comments for simplicity.
No logic changes.
Reviewed-by: Guo Xuenan <guoxuenan@huaweicloud.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230615064421.103178-1-hsiangkao@linux.alibaba.com
Current check for atomic context is not sufficient as
z_erofs_decompressqueue_endio can be called under rcu lock
from blk_mq_flush_plug_list(). See the stacktrace [1]
In such case we should hand off the decompression work for async
processing rather than trying to do sync decompression in current
context. Patch fixes the detection by checking for
rcu_read_lock_any_held() and while at it use more appropriate
!in_task() check than in_atomic().
Background: Historically erofs would always schedule a kworker for
decompression which would incur the scheduling cost regardless of
the context. But z_erofs_decompressqueue_endio() may not always
be in atomic context and we could actually benefit from doing the
decompression in z_erofs_decompressqueue_endio() if we are in
thread context, for example when running with dm-verity.
This optimization was later added in patch [2] which has shown
improvement in performance benchmarks.
==============================================
[1] Problem stacktrace
[name:core&]BUG: sleeping function called from invalid context at kernel/locking/mutex.c:291
[name:core&]in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1615, name: CpuMonitorServi
[name:core&]preempt_count: 0, expected: 0
[name:core&]RCU nest depth: 1, expected: 0
CPU: 7 PID: 1615 Comm: CpuMonitorServi Tainted: G S W OE 6.1.25-android14-5-maybe-dirty-mainline #1
Hardware name: MT6897 (DT)
Call trace:
dump_backtrace+0x108/0x15c
show_stack+0x20/0x30
dump_stack_lvl+0x6c/0x8c
dump_stack+0x20/0x48
__might_resched+0x1fc/0x308
__might_sleep+0x50/0x88
mutex_lock+0x2c/0x110
z_erofs_decompress_queue+0x11c/0xc10
z_erofs_decompress_kickoff+0x110/0x1a4
z_erofs_decompressqueue_endio+0x154/0x180
bio_endio+0x1b0/0x1d8
__dm_io_complete+0x22c/0x280
clone_endio+0xe4/0x280
bio_endio+0x1b0/0x1d8
blk_update_request+0x138/0x3a4
blk_mq_plug_issue_direct+0xd4/0x19c
blk_mq_flush_plug_list+0x2b0/0x354
__blk_flush_plug+0x110/0x160
blk_finish_plug+0x30/0x4c
read_pages+0x2fc/0x370
page_cache_ra_unbounded+0xa4/0x23c
page_cache_ra_order+0x290/0x320
do_sync_mmap_readahead+0x108/0x2c0
filemap_fault+0x19c/0x52c
__do_fault+0xc4/0x114
handle_mm_fault+0x5b4/0x1168
do_page_fault+0x338/0x4b4
do_translation_fault+0x40/0x60
do_mem_abort+0x60/0xc8
el0_da+0x4c/0xe0
el0t_64_sync_handler+0xd4/0xfc
el0t_64_sync+0x1a0/0x1a4
[2] Link: https://lore.kernel.org/all/20210317035448.13921-1-huangjianan@oppo.com/
Reported-by: Will Shiu <Will.Shiu@mediatek.com>
Suggested-by: Gao Xiang <xiang@kernel.org>
Signed-off-by: Sandeep Dhavale <dhavale@google.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Alexandre Mergnat <amergnat@baylibre.com>
Link: https://lore.kernel.org/r/20230621220848.3379029-1-dhavale@google.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
There's a callback styled xattr parser, i.e. xattr_foreach(), which is
shared among listxattr and getxattr.
Convert it to two separate xattr parsers to serve listxattr and getxattr
for better readability.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230613074114.120115-6-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Make inline_{list,get}xattr() as well as inline_xattr_iter_begin()
unified as erofs_xattr_iter_inline(), and shared_{list,get}xattr()
unified as erofs_xattr_iter_shared().
After these changes, both erofs_xattr_iter_{inline,shared}() return 0 on
success, and negative error on failure.
One thing worth noting is that, the logic of returning it->buffer_ofs
when there's no shared xattrs in shared_listxattr() is moved to
erofs_listxattr() to make the unification possible. The only difference
is that, semantically the old behavior will return ENOATTR rather than
it->buffer_ofs if ENOATTR encountered when listxattr is parsing upon a
specific shared xattr, while now the new behavior will return
it->buffer_ofs in this case. This is not an issue, as listxattr upon a
specific xattr won't return ENOATTR.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230613074114.120115-5-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Since now xattr_iter structures have been unified, make the size of the
read data stored in buffer_ofs. Don't bother reusing buffer_size for
this use, which may be confusing.
This is in preparation for the following further cleanup.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230613074114.120115-4-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Unify xattr_iter/listxattr_iter/getxattr_iter structures into
erofs_xattr_iter structure.
This is in preparation for the following further cleanup.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230613074114.120115-3-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Replace blkaddr/ofs with pos in 'struct erofs_xattr_iter'.
After erofs_bread() is introduced to replace raw page cache APIs for
metadata I/Os handling, xattr_iter_fixup() is no longer needed anymore.
In addition, it is also unnecessary to check if the iterated position is
span over the block boundary as absolute offset is used instead of
blkaddr + offset pairs.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230613074114.120115-2-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
In compact 4B, two adjacent lclusters are packed together as a unit to
form on-disk indexes for effective random access, as below:
(amortized = 4, vcnt = 2)
_____________________________________________
|___@_____ encoded bits __________|_ blkaddr _|
0 . amortized * vcnt = 8
. .
. . amortized * vcnt - 4 = 4
. .
.____________________________.
|_type (2 bits)_|_clusterofs_|
Therefore, encoded bits for each pack are 32 bits (4 bytes). IOWs,
since each lcluster can get 16 bits for its type and clusterofs, the
maximum supported lclustersize for compact 4B format is 16k (14 bits).
Fix this to enable compact 4B format for 16k lclusters (blocks), which
is tested on an arm64 server with 16k page size.
Fixes: 152a333a58 ("staging: erofs: add compacted compression indexes support")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230601112341.56960-1-hsiangkao@linux.alibaba.com
buf->inode is constant once initialized during erofs_buf's lifetime.
Thus call erofs_init_metabuf() and erofs_bread() separately to avoid
the repetition of assigning buf->inode when iterating xattrs.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230601024347.108469-2-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Let's avoid the current handcrafted lockref although `struct lockref`
inclusion usually increases extra 4 bytes with an explicit spinlock if
CONFIG_DEBUG_SPINLOCK is off.
Apart from the size difference, note that the meaning of refcount is
also changed to active users. IOWs, it doesn't take an extra refcount
for XArray tree insertion.
I don't observe any significant performance difference at least on
our cloud compute server but the new one indeed simplifies the
overall codebase a bit.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Link: https://lore.kernel.org/r/20230529123727.79943-1-hsiangkao@linux.alibaba.com
The only overlap between the block open flags mapped into the fmode_t and
other uses of fmode_t are FMODE_READ and FMODE_WRITE. Define a new
blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and
->ioctl and stop abusing fmode_t.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd]
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The current interface for exclusive opens is rather confusing as it
requires both the FMODE_EXCL flag and a holder. Remove the need to pass
FMODE_EXCL and just key off the exclusive open off a non-NULL holder.
For blkdev_put this requires adding the holder argument, which provides
better debug checking that only the holder actually releases the hold,
but at the same time allows removing the now superfluous mode argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd]
Link: https://lore.kernel.org/r/20230608110258.189493-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a new blk_holder_ops structure, which is passed to blkdev_get_by_* and
installed in the block_device for exclusive claims. It will be used to
allow the block layer to call back into the user of the block device for
thing like notification of a removed device or a device resize.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch gets rid of erofs_try_to_free_cached_page() and fold it
into .release_folio().
It also moves managed inode operations into zdata.c, which simplifies
the code a bit. No logic changes.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Link: https://lore.kernel.org/r/20230526201459.128169-5-hsiangkao@linux.alibaba.com
After heavily stressing EROFS with several images which include a
hand-crafted image of repeated patterns for more than 46 days, I found
two chains could be linked with each other almost simultaneously and
form a loop so that the entire loop won't be submitted. As a
consequence, the corresponding file pages will remain locked forever.
It can be _only_ observed on data-deduplicated compressed images.
For example, consider two chains with five pclusters in total:
Chain 1: 2->3->4->5 -- The tail pcluster is 5;
Chain 2: 5->1->2 -- The tail pcluster is 2.
Chain 2 could link to Chain 1 with pcluster 5; and Chain 1 could link
to Chain 2 at the same time with pcluster 2.
Since hooked chains are all linked locklessly now, I have no idea how
to simply avoid the race. Instead, let's avoid hooked chains completely
until I could work out a proper way to fix this and end users finally
tell us that it's needed to add it back.
Actually, this optimization can be found with multi-threaded workloads
(especially even more often on deduplicated compressed images), yet I'm
not sure about the overall system impacts of not having this compared
with implementation complexity.
Fixes: 267f2492c8 ("erofs: introduce multi-reference pclusters (fully-referenced)")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Link: https://lore.kernel.org/r/20230526201459.128169-4-hsiangkao@linux.alibaba.com
On-stack pagepool is used so that short-lived temporary pages could be
shared within a single I/O request (e.g. among multiple pclusters).
Moving the remaining frontend-related uses into
z_erofs_decompress_frontend to avoid too many arguments.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Link: https://lore.kernel.org/r/20230526201459.128169-3-hsiangkao@linux.alibaba.com
If non-bootstrap bvecs cannot be kept in place (very rarely), an extra
short-lived page is allocated.
Let's just allocate it immediately rather than do unnecessary -EAGAIN
return first and retry as a cleanup. Also it's unnecessary to use
__GFP_NOFAIL here since we could gracefully fail out this case instead.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Link: https://lore.kernel.org/r/20230526201459.128169-2-hsiangkao@linux.alibaba.com
`end` parameter is no needed since it's pointless for !backmost, we can
handle it with backmost internally. And we only expand the trailing
edge, so the newstart can be replaced with ->headoffset.
Also, remove linux/prefetch.h inclusion since that is not used anymore
after commit 386292919c ("erofs: introduce readmore decompression
strategy").
Signed-off-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230525072605.17857-1-zbestahu@gmail.com
[ Gao Xiang: update commit description. ]
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
The struct member is only used to add REQ_RAHEAD during I/O submission.
So it is cleaner to pass it as a parameter than keep it in the struct.
Also, rename function z_erofs_get_sync_decompress_policy() to
z_erofs_is_sync_decompress() for better clarity and conciseness.
Signed-off-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230524063944.1655-1-zbestahu@gmail.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
No need this helper since it's just a simple wrapper for decompress
method and only one caller. So, let's fold in directly instead.
Signed-off-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230426084449.12781-1-zbestahu@gmail.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
As Sandeep shown [1], high priority RT per-cpu kthreads are
typically helpful for Android scenarios to minimize the scheduling
latencies.
Switch EROFS_FS_PCPU_KTHREAD_HIPRI on by default if
EROFS_FS_PCPU_KTHREAD is on since it's the typical use cases for
EROFS_FS_PCPU_KTHREAD.
Also clean up unneeded sched_set_normal().
[1] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Sandeep Dhavale <dhavale@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230522092141.124290-1-hsiangkao@linux.alibaba.com