-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmcuWXMACgkQiiy9cAdy
T1Eu8gv+LUAmrvvv8PDoLUT50QZb6aAY2SeulgTdeG8OzImXH5VUSjptRYwP46Dk
KNLh85A4C39w/guxm3FX2qjeesZZD5DDubSJNATLy75jorq7z+1uTNg8oUZGpvJS
airmcv/0mcDZqVayCmiT7wPyhUSYa+VTvHrkFpsI20BrlyDybe5HGps77iCOJ5K0
uTRgM6VNxkKx+Z5NietpDyaUl2A5b6Yx/9J8vMq4ytBfEcSGi+ndpZNvG7kKg8gQ
3i/ND4O2+eScwvYclVP5mJbF71LW0Z/ljS4mEVH5UuRgLH2Ji35B9xaDFDSixI3x
EHFwnAX0QeGHIlIuFhRDdtR2gFqREAJOYxkDxfo7PXO5gOXLWZXru9F7v6lWsydN
varqSseBBucHOLn8NylvgJWwqYs+sIKQycYKsX3ZUnQfejaUwfV2H/ADJzccjFF8
PUzVQFyOZtUK3fdkoqvULr/zvwninhtLJYLIsPcUgSPCcxGxMApvtkCaJVV3JGfB
2acZPdMu
=ZzcZ
-----END PGP SIGNATURE-----
Merge tag 'v6.12-rc6-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fix from Steve French:
"Fix net namespace refcount use after free issue"
* tag 'v6.12-rc6-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: Fix use-after-free of network namespace.
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmcsXSUACgkQiiy9cAdy
T1FyOgv+Ks1lfl+6D/G89zFl5XOtCm8njsedJu9y3jR7hzophX2osfmodACMVX6B
0VLu0jzquvUo18VNlL+wF7YFH+Mc6zrevEnjBay9Xa05YyRqK5c7qjpiWEgXPN7/
ROQfC2slCAFjymhw+9qY+PGZYg3x0fyGdJC/gBNSFnu2ufag367Li+0fTKQTXFwz
F24S5eI+M9OWNgMnMYoNt+77f0n0JkKbQznq9nTEvUsbTWZFSEfmVczfSY0ltdOH
RER9zoyTU3zbPuMZqK+Jb7c2247ahsLzDEBAUG0Wn77wSaiWXU5dmVD5bWsDTp25
5p9uLpkr3irDWwJGkCrkpm2Tva/50IHPEFQ4kllVlm6ffoao/dxBCwFf/MEvJXzI
OgU+HpXyZdq6NF1hcB4xUlcbHvGCa6pEcYkcM7PwLml+6SKIwEsEGpnJ23kxGR3+
MGYMCITatRuvZstfEDolNyrO2+gPMd3ODnLhfjfjT47Kh38e7yxrLr4cmxbPAA+s
EVdm2N08
=zTn6
-----END PGP SIGNATURE-----
Merge tag 'v6.12-rc6-ksmbd-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
"Four fixes, all also marked for stable:
- fix two potential use after free issues
- fix OOM issue with many simultaneous requests
- fix missing error check in RPC pipe handling"
* tag 'v6.12-rc6-ksmbd-fixes' of git://git.samba.org/ksmbd:
ksmbd: check outstanding simultaneous SMB operations
ksmbd: fix slab-use-after-free in smb3_preauth_hash_rsp
ksmbd: fix slab-use-after-free in ksmbd_smb2_session_create
ksmbd: Fix the missing xa_store error check
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmct+40ACgkQxWXV+ddt
WDvCtRAAp0rheEu14hpVvWE2//+6u9Gx7Wfjzbj0+o4zBRWdg7BigFxfeb6JsH/E
2TjuWdcoP/OMV9ghCBQAQxySAPtsxH7skkyNy2UcMk5byBIrNvhw9auP5GXXlrhK
jSKDD4yfOMb++8LhrLevgTrijNyjLqaKXruw9a1Pmc3gxpdNmnMEySsQaF62o2Sm
YC3jwi0KpNAhu2qyJ6TnPgd5zf3BTM0JAeuB019IZW4WoeRTOdcPe7S7gqqJwZ+e
lL0D2/lfIE1lKvLE266Fab4FAQiJV07rozYj25XHiDpqThCxnJVOZCEHasOQ1PRy
d6j3RmGPqJYAYfQL1L+FH2hsS1BVZfVyCV1V7A/cN+lAffBfnROnf13C3gJ15Nbx
3lTyjBPQQw2WpfdmeyF3ikbrjZ8AfahChQO+mMnLN7oAWdIwWX5MRB+cwfWTxzA/
P8upz6HSTpSwy8nXdq264q1KkyCjx0Wv+8iyU7LirN2fCcEchA12HAIaOBeHedgh
PrGZDqrkZccQQxAvU5H7hQv0hZkGK8qba381oYHO09g72VM6ysuBU7tGrPZrlZYB
CvYTCwNZ/lqI8ikrcHOyUO1SPR9SaaWej1mWgBJ69ZIfg+ZuMtOMl171DU4S/i2V
iYgYoN8eCqTQWdaX5kk+3LWmK8fSU7F/KSDtJtT1KxkaSwCacfY=
=TQzP
-----END PGP SIGNATURE-----
Merge tag 'for-6.12-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more one-liners that fix some user visible problems:
- use correct range when clearing qgroup reservations after COW
- properly reset freed delayed ref list head
- fix ro/rw subvolume mounts to be backward compatible with old and
new mount API"
* tag 'for-6.12-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix the length of reserved qgroup to free
btrfs: reinitialize delayed ref list after deleting it from the list
btrfs: fix per-subvolume RO/RW flags with new mount API
Some trivial syzbot fixes, two more serious btree fixes found by looping
single_devices.ktest small_nodes:
- Topology error on split after merge, where we accidentaly picked the
node being deleted for the pivot, resulting in an assertion pop
- New nodes being preallocated were left on the freedlist, unlocked,
resulting in them sometimes being accidentally freed: this dated from
pre-cycle detector, when we could leave them locked. This should have
resulted in more explosions and fireworks, but turned out to be
surprisingly hard to hit because the preallocated nodes were being
used right away.
the fix for this is bigger than we'd like - reworking btree list
handling was a bit invasive - but we've now got more assertions and
it's well tested.
- Also another mishandled transaction restart fix (in
btree_node_prefetch) - we're almost done with those.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmctRYoACgkQE6szbY3K
bnbkxRAArtqV9/qsKbSYAaa/+GaL7YdapuYbi/pmC9X96F9qbTdEJzW5rs66iiGE
zbkfFqo2I85nacSTk3b12E3QUXj+CEmSIWOPQtamYw/0AkmVsKepgGsXLazZ0rYi
X8UDVc6fuFkoO1aC/9V2NJEFG9QXIj8ru0m2kyUE9ZM6rgskugVN/ec9ipNQNZhY
4L8U7Z6Y9AX4vs/BeV3i6cLrTaMroUFYSM0hJalBJ24KZsZ1bWflC39C0dXSvy/O
gCmBCobZTT5aDEQai1kdyFr4GZZUgCJg4YEUDfyOdpPmhbcP4iwX/cJqJHXqxXVt
nMyLz5nLs0nYO791UlLHZuUUUe99nl+tC09b034n20peLnQwWW/obTrhn86SDDka
2eQv1Rk5C5i8r5b0k8UYjy5ogfiVlC/X1OwmLKkarKnC/wd0eFQI71Qq9s8KpXbo
VVASENYFV3hrIV8ZcxiqiJ18g6o7++jtTAmIfRljQrO6B8tU5g5uWCTZli+wciii
qWnt1k7P92er8lBzUnQGh9CEwLVbe9ZyBJv+fYVwTOxPES/TbJS7n5fb+1f1rF9w
j5llXVUiaLucXoCpBjEDflvhBTRQHEkKk3gJgy86NKgRjEjPhQT8D2dksT4kgyHb
RqgOSUN+oVqi/i+7RKf9x/jG4id0uvMH5xT7qiXTUiQXtUD+J9g=
=cn3u
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-11-07' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Some trivial syzbot fixes, two more serious btree fixes found by
looping single_devices.ktest small_nodes:
- Topology error on split after merge, where we accidentaly picked
the node being deleted for the pivot, resulting in an assertion pop
- New nodes being preallocated were left on the freedlist, unlocked,
resulting in them sometimes being accidentally freed: this dated
from pre-cycle detector, when we could leave them locked. This
should have resulted in more explosions and fireworks, but turned
out to be surprisingly hard to hit because the preallocated nodes
were being used right away.
The fix for this is bigger than we'd like - reworking btree list
handling was a bit invasive - but we've now got more assertions and
it's well tested.
- Also another mishandled transaction restart fix (in
btree_node_prefetch) - we're almost done with those"
* tag 'bcachefs-2024-11-07' of git://evilpiepirate.org/bcachefs:
bcachefs: Fix UAF in __promote_alloc() error path
bcachefs: Change OPT_STR max to be 1 less than the size of choices array
bcachefs: btree_cache.freeable list fixes
bcachefs: check the invalid parameter for perf test
bcachefs: add check NULL return of bio_kmalloc in journal_read_bucket
bcachefs: Ensure BCH_FS_may_go_rw is set before exiting recovery
bcachefs: Fix topology errors on split after merge
bcachefs: Ancient versions with bad bkey_formats are no longer supported
bcachefs: Fix error handling in bch2_btree_node_prefetch()
bcachefs: Fix null ptr deref in bucket_gen_get()
If we error in data_update_init() after adding to the rhashtable of
outstanding promotes, kfree_rcu() is required.
Reported-by: Reed Riley <reed@riley.engineer>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Change OPT_STR max value to be 1 less than the "ARRAY_SIZE" of "_choices"
array. As a result, remove -1 from (opt->max-1) in bch2_opt_to_text.
The "_choices" array is a null-terminated array, so computing the maximum
using "ARRAY_SIZE" without subtracting 1 yields an incorrect result. Since
bch2_opt_validate don't subtract 1, as bch2_opt_to_text does, values
bigger than the actual maximum would pass through option validation.
Reported-by: syzbot+bee87a0c3291c06aa8c6@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=bee87a0c3291c06aa8c6
Fixes: 63c4b25453 ("bcachefs: Better superblock opt validation")
Suggested-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Piotr Zalewski <pZ010001011111@proton.me>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When allocating new btree nodes, we were leaving them on the freeable
list - unlocked - allowing them to be reclaimed: ouch.
Additionally, bch2_btree_node_free_never_used() ->
bch2_btree_node_hash_remove was putting it on the freelist, while
bch2_btree_node_free_never_used() was putting it back on the btree
update reserve list - ouch.
Originally, the code was written to always keep btree nodes on a list -
live or freeable - and this worked when new nodes were kept locked.
But now with the cycle detector, we can't keep nodes locked that aren't
tracked by the cycle detector; and this is fine as long as they're not
reachable.
We also have better and more robust leak detection now, with memory
allocation profiling, so the original justification no longer applies.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The perf_test does not check the number of iterations and threads
when it is zero. If nr_thread is 0, the perf test will keep
waiting for wakekup. If iteration is 0, it will cause exception
of division by zero. This can be reproduced by:
echo "rand_insert 0 1" > /sys/fs/bcachefs/${uuid}/perf_test
or
echo "rand_insert 1 0" > /sys/fs/bcachefs/${uuid}/perf_test
Fixes: 1c6fdbd8f2 ("bcachefs: Initial commit")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bio_kmalloc may return NULL, will cause NULL pointer dereference.
Add check NULL return for bio_kmalloc in journal_read_bucket.
Signed-off-by: Pei Xiao <xiaopei01@kylinos.cn>
Fixes: ac10a9611d ("bcachefs: Some fixes for building in userspace")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If BCH_FS_may_go_rw is not yet set, it indicates to the transaction
commit path that updates should be done via the list of journal replay
keys.
This must be set before multithreaded use commences.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If a btree split picks a pivot that's being deleted by a btree node
merge, we're going to have problems.
Fix this by checking if the pivot is being deleted, the same as we check
for deletions in journal replay keys.
Found by single_devic.ktest small_nodes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Syzbot found an assertion pop, by generating an ancient filesystem
version with an invalid bkey_format (with fields that can overflow) as
well as packed keys that aren't representable unpacked.
This breaks key comparisons in all sorts of painful ways.
Filesystems have been automatically rewriting nodes with such invalid
formats for years; we can safely drop support for them.
Reported-by: syzbot+8a0109511de9d4b61217@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bucket_gen() checks if we're lookup up a valid bucket and returns NULL
otherwise, but bucket_gen_get() was failing to check; other callers were
correct.
Also do a bit of cleanup on callers.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
seq_printf is costy, on a system with n CPUs, reading /proc/softirqs
would yield 10*n decimal values, and the extra cost parsing format string
grows linearly with number of cpus. Replace seq_printf with
seq_put_decimal_ull_width have significant performance improvement.
On an 8CPUs system, reading /proc/softirqs show ~40% performance
gain with this patch.
Signed-off-by: David Wang <00107082@163.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I noticed that recently, simple operations like "make" started
failing on NFSv3 mounts of ext4 exports. Network capture shows that
READDIRPLUS operated correctly but READDIR failed with
NFS3ERR_INVAL. The vfs_llseek() call returned EINVAL when it is
passed a non-zero starting directory cookie.
I bisected to commit c689bdd3bf ("nfsd: further centralize
protocol version checks.").
Turns out that nfsd3_proc_readdir() does not call fh_verify() before
it calls nfsd_readdir(), so the new fhp->fh_64bit_cookies boolean is
not set properly. This leaves the NFSD_MAY_64BIT_COOKIE unset when
the directory is opened.
For ext4, this causes the wrong "max file size" value to be used
when sanity checking the incoming directory cookie (which is a seek
offset value).
The fhp->fh_64bit_cookies boolean is /always/ properly initialized
after nfsd_open() returns. There doesn't seem to be a reason for the
generic NFSD open helper to handle the f_mode fix-up for
directories, so just move that to the one caller that tries to open
an S_IFDIR with NFSD_MAY_64BIT_COOKIE.
Suggested-by: NeilBrown <neilb@suse.de>
Fixes: c689bdd3bf ("nfsd: further centralize protocol version checks.")
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The dealloc flag may be cleared and the extent won't reach the disk in
cow_file_range when errors path. The reserved qgroup space is freed in
commit 30479f31d4 ("btrfs: fix qgroup reserve leaks in
cow_file_range"). However, the length of untouched region to free needs
to be adjusted with the correct remaining region size.
Fixes: 30479f31d4 ("btrfs: fix qgroup reserve leaks in cow_file_range")
CC: stable@vger.kernel.org # 6.11+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At insert_delayed_ref() if we need to update the action of an existing
ref to BTRFS_DROP_DELAYED_REF, we delete the ref from its ref head's
ref_add_list using list_del(), which leaves the ref's add_list member
not reinitialized, as list_del() sets the next and prev members of the
list to LIST_POISON1 and LIST_POISON2, respectively.
If later we end up calling drop_delayed_ref() against the ref, which can
happen during merging or when destroying delayed refs due to a transaction
abort, we can trigger a crash since at drop_delayed_ref() we call
list_empty() against the ref's add_list, which returns false since
the list was not reinitialized after the list_del() and as a consequence
we call list_del() again at drop_delayed_ref(). This results in an
invalid list access since the next and prev members are set to poison
pointers, resulting in a splat if CONFIG_LIST_HARDENED and
CONFIG_DEBUG_LIST are set or invalid poison pointer dereferences
otherwise.
So fix this by deleting from the list with list_del_init() instead.
Fixes: 1d57ee9416 ("btrfs: improve delayed refs iterations")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
With util-linux 2.40.2, the 'mount' utility is already utilizing the new
mount API. e.g:
# strace mount -o subvol=subv1,ro /dev/test/scratch1 /mnt/test/
...
fsconfig(3, FSCONFIG_SET_STRING, "source", "/dev/mapper/test-scratch1", 0) = 0
fsconfig(3, FSCONFIG_SET_STRING, "subvol", "subv1", 0) = 0
fsconfig(3, FSCONFIG_SET_FLAG, "ro", NULL, 0) = 0
fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = 0
fsmount(3, FSMOUNT_CLOEXEC, 0) = 4
mount_setattr(4, "", AT_EMPTY_PATH, {attr_set=MOUNT_ATTR_RDONLY, attr_clr=0, propagation=0 /* MS_??? */, userns_fd=0}, 32) = 0
move_mount(4, "", AT_FDCWD, "/mnt/test", MOVE_MOUNT_F_EMPTY_PATH) = 0
But this leads to a new problem, that per-subvolume RO/RW mount no
longer works, if the initial mount is RO:
# mount -o subvol=subv1,ro /dev/test/scratch1 /mnt/test
# mount -o rw,subvol=subv2 /dev/test/scratch1 /mnt/scratch
# mount | grep mnt
/dev/mapper/test-scratch1 on /mnt/test type btrfs (ro,relatime,discard=async,space_cache=v2,subvolid=256,subvol=/subv1)
/dev/mapper/test-scratch1 on /mnt/scratch type btrfs (ro,relatime,discard=async,space_cache=v2,subvolid=257,subvol=/subv2)
# touch /mnt/scratch/foobar
touch: cannot touch '/mnt/scratch/foobar': Read-only file system
This is a common use cases on distros.
[CAUSE]
We have a workaround for remount to handle the RO->RW change, but if the
mount is using the new mount API, we do not do that, and rely on the
mount tool NOT to set the ro flag.
But that's not how the mount tool is doing for the new API:
fsconfig(3, FSCONFIG_SET_STRING, "source", "/dev/mapper/test-scratch1", 0) = 0
fsconfig(3, FSCONFIG_SET_STRING, "subvol", "subv1", 0) = 0
fsconfig(3, FSCONFIG_SET_FLAG, "ro", NULL, 0) = 0 <<<< Setting RO flag for super block
fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = 0
fsmount(3, FSMOUNT_CLOEXEC, 0) = 4
mount_setattr(4, "", AT_EMPTY_PATH, {attr_set=MOUNT_ATTR_RDONLY, attr_clr=0, propagation=0 /* MS_??? */, userns_fd=0}, 32) = 0
move_mount(4, "", AT_FDCWD, "/mnt/test", MOVE_MOUNT_F_EMPTY_PATH) = 0
This means we will set the super block RO at the first mount.
Later RW mount will not try to reconfigure the fs to RW because the
mount tool is already using the new API.
This totally breaks the per-subvolume RO/RW mount behavior.
[FIX]
Do not skip the reconfiguration even if using the new API. The old
comments are just expecting any mount tool to properly skip the RO flag
set even if we specify "ro", which is not the reality.
Update the comments regarding the backward compatibility on the kernel
level so it works with old and new mount utilities.
CC: stable@vger.kernel.org # 6.8+
Fixes: f044b31867 ("btrfs: handle the ro->rw transition for mounting different subvolumes")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable Fixes:
* Fix KMSAN warning in decode_getfattr_attrs()
Other Bugfixes:
* Handle -ENOTCONN in xs_tcp_setup_socked()
* NFSv3: only use NFS timeout for MOUNT when protocols are compatible
* Fix attribute delegation behavior on exclusive create and a/mtime changes
* Fix localio to cope with racing nfs_local_probe()
* Avoid i_lock contention in fs_clear_invalid_mapping()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmcr1HUACgkQ18tUv7Cl
QOtdYBAA0YohWDHflcHPbltJu0UyCyDDtowvpVacSDJwZwVEXnLQRTTqrdUnWVxx
Bc2Ae8tGsfcwo10yZ6LUIPjcyEqLQeYvKoKv2Awf0j7eubjRYZrQVypIKtmy8aC2
H5ETCyrbIubE06jX8EPO8LFxQ+T6nGD7kC8qJZL8z/aNVXGA2nRRCi7AzdE4o6Ht
0t6fC+W5vxJ4hQHYKb59nGvREMwpKSLg2U4wo1lyFvkDxEJ06DobGOKEtD333cI8
Mou/1UlSZ6RzgfwJNIPMMpCepIp2spaDeet0XVN+zqzxg55Jmk7LqpxP5pswTjLb
WsxErV9ZRXtwutCCf+IDoMCv/YS4g4ZG7CLKXQ4felKJVYIuiS4z0n659xRqLyyi
nW71vrRUdOBE3rCXUW6crZYwX/fHDvl6bsq9/h7cy2ZPnbGkVvXx+LIm0dJRenfb
MaxVM3CyrMnzL3UUk/caK/rVCOHrDD5q/dAtSNfizMWnqoX+gXby3ho6Zwn0Wj89
NiUZJIRI/s4V1WzMw4g+Daz7LUUwGblODTtphH2nnKRDfTiYXeT/r/waU6zUOVcS
7Jd285DF/tkQp2SJ3nvsM/ni7TD2UuG2BsKA3Urlht9i32lwyENeS3nNcx6aHo3i
blNpD+9mp3vZfWWZNVvLM/JldcIqEvd30+P6GWwS/Td8Zz4PYIM=
=9mwu
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.12-3' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client fixes from Anna Schumaker:
"These are mostly fixes that came up during the nfs bakeathon the other
week.
Stable Fixes:
- Fix KMSAN warning in decode_getfattr_attrs()
Other Bugfixes:
- Handle -ENOTCONN in xs_tcp_setup_socked()
- NFSv3: only use NFS timeout for MOUNT when protocols are compatible
- Fix attribute delegation behavior on exclusive create and a/mtime
changes
- Fix localio to cope with racing nfs_local_probe()
- Avoid i_lock contention in fs_clear_invalid_mapping()"
* tag 'nfs-for-6.12-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
nfs: avoid i_lock contention in nfs_clear_invalid_mapping
nfs_common: fix localio to cope with racing nfs_local_probe()
NFS: Further fixes to attribute delegation a/mtime changes
NFS: Fix attribute delegation behaviour on exclusive create
nfs: Fix KMSAN warning in decode_getfattr_attrs()
NFSv3: only use NFS timeout for MOUNT when protocols are compatible
sunrpc: handle -ENOTCONN in xs_tcp_setup_socket()
The commit 78ff640819 ("vfs: Convert tracefs to use the new mount API")
broke the gid setting when set by fstab or other mount utility.
It is ignored when it is set. Fix the code so that it recognises the
option again and will honor the settings on mount at boot up.
Update the internal documentation and create a selftest to make sure
it doesn't break again in the future.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZyuidRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qsgQAQDuV0x4RLpCrrowDS/ITQw/eb/WjhR7
lhkXVROLN6RK6wD+JWmbaCP82q2S4A2Vx0Rjc72gUMmTzDb1HQflhQiLhwU=
=0dZF
-----END PGP SIGNATURE-----
Merge tag 'tracefs-v6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracefs fixes from Steven Rostedt:
"Fix tracefs mount options.
Commit 78ff640819 ("vfs: Convert tracefs to use the new mount API")
broke the gid setting when set by fstab or other mount utility. It is
ignored when it is set. Fix the code so that it recognises the option
again and will honor the settings on mount at boot up.
Update the internal documentation and create a selftest to make sure
it doesn't break again in the future"
* tag 'tracefs-v6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/selftests: Add tracefs mount options test
tracing: Document tracefs gid mount option
tracing: Fix tracefs mount options
If Client send simultaneous SMB operations to ksmbd, It exhausts too much
memory through the "ksmbd_work_cache”. It will cause OOM issue.
ksmbd has a credit mechanism but it can't handle this problem. This patch
add the check if it exceeds max credits to prevent this problem by assuming
that one smb request consumes at least one credit.
Cc: stable@vger.kernel.org # v5.15+
Reported-by: Norbert Szetei <norbert@doyensec.com>
Tested-by: Norbert Szetei <norbert@doyensec.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
ksmbd_user_session_put should be called under smb3_preauth_hash_rsp().
It will avoid freeing session before calling smb3_preauth_hash_rsp().
Cc: stable@vger.kernel.org # v5.15+
Reported-by: Norbert Szetei <norbert@doyensec.com>
Tested-by: Norbert Szetei <norbert@doyensec.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
There is a race condition between ksmbd_smb2_session_create and
ksmbd_expire_session. This patch add missing sessions_table_lock
while adding/deleting session from global session table.
Cc: stable@vger.kernel.org # v5.15+
Reported-by: Norbert Szetei <norbert@doyensec.com>
Tested-by: Norbert Szetei <norbert@doyensec.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Multi-threaded buffered reads to the same file exposed significant
inode spinlock contention in nfs_clear_invalid_mapping().
Eliminate this spinlock contention by checking flags without locking,
instead using smp_rmb and smp_load_acquire accordingly, but then take
spinlock and double-check these inode flags.
Also refactor nfs_set_cache_invalid() slightly to use
smp_store_release() to pair with nfs_clear_invalid_mapping()'s
smp_load_acquire().
While this fix is beneficial for all multi-threaded buffered reads
issued by an NFS client, this issue was identified in the context of
surprisingly low LOCALIO performance with 4K multi-threaded buffered
read IO. This fix dramatically speeds up LOCALIO performance:
before: read: IOPS=1583k, BW=6182MiB/s (6482MB/s)(121GiB/20002msec)
after: read: IOPS=3046k, BW=11.6GiB/s (12.5GB/s)(232GiB/20001msec)
Fixes: 17dfeb9113 ("NFS: Fix races in nfs_revalidate_mapping")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
Fix the possibility of racing nfs_local_probe() resulting in:
list_add double add: new=ffff8b99707f9f58, prev=ffff8b99707f9f58, next=ffffffffc0f30000.
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:35!
Add nfs_uuid_init() to properly initialize all nfs_uuid_t members
(particularly its list_head).
Switch to returning bool from nfs_uuid_begin(), returns false if
nfs_uuid_t is already in-use (its list_head is on a list). Update
nfs_local_probe() to return early if the nfs_client's cl_uuid
(nfs_uuid_t) is in-use.
Also, switch nfs_uuid_begin() from using list_add_tail_rcu() to
list_add_tail() -- rculist was used in an earlier version of the
localio code that had a lockless nfs_uuid_lookup interface.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
When asked to set both an atime and an mtime to the current system time,
ensure that the setting is atomic by calling inode_update_timestamps()
only once with the appropriate flags.
Fixes: e12912d941 ("NFSv4: Add support for delegated atime and mtime attributes")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
When the client does an exclusive create and the server decides to store
the verifier in the timestamps, a SETATTR is subsequently sent to fix up
those timestamps. When that is the case, suppress the exceptions for
attribute delegations in nfs4_bitmap_copy_adjust().
Fixes: 32215c1f89 ("NFSv4: Don't request atime/mtime/size if they are delegated to us")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
If a timeout is specified in the mount options, it currently applies to
both the NFS protocol and (with v3) the MOUNT protocol. This is
sensible when they both use the same underlying protocol, or those
protocols are compatible w.r.t timeouts as RDMA and TCP are.
However if, for example, NFS is using TCP and MOUNT is using UDP then
using the same timeout doesn't make much sense.
If you
mount -o vers=3,proto=tcp,mountproto=udp,timeo=600,retrans=5 \
server:/path /mountpoint
then the timeo=600 which was intended for the NFS/TCP request will
apply to the MOUNT/UDP requests with the result that there will only be
one request sent (because UDP has a maximum timeout of 60 seconds).
This is not what a reasonable person might expect.
This patch disables the sharing of timeout information in cases where
the underlying protocols are not compatible.
Fixes: c9301cb35b ("nfs: hornor timeo and retrans option when mounting NFSv3")
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
Recently, we got a customer report that CIFS triggers oops while
reconnecting to a server. [0]
The workload runs on Kubernetes, and some pods mount CIFS servers
in non-root network namespaces. The problem rarely happened, but
it was always while the pod was dying.
The root cause is wrong reference counting for network namespace.
CIFS uses kernel sockets, which do not hold refcnt of the netns that
the socket belongs to. That means CIFS must ensure the socket is
always freed before its netns; otherwise, use-after-free happens.
The repro steps are roughly:
1. mount CIFS in a non-root netns
2. drop packets from the netns
3. destroy the netns
4. unmount CIFS
We can reproduce the issue quickly with the script [1] below and see
the splat [2] if CONFIG_NET_NS_REFCNT_TRACKER is enabled.
When the socket is TCP, it is hard to guarantee the netns lifetime
without holding refcnt due to async timers.
Let's hold netns refcnt for each socket as done for SMC in commit
9744d2bf19 ("smc: Fix use-after-free in tcp_write_timer_handler().").
Note that we need to move put_net() from cifs_put_tcp_session() to
clean_demultiplex_info(); otherwise, __sock_create() still could touch a
freed netns while cifsd tries to reconnect from cifs_demultiplex_thread().
Also, maybe_get_net() cannot be put just before __sock_create() because
the code is not under RCU and there is a small chance that the same
address happened to be reallocated to another netns.
[0]:
CIFS: VFS: \\XXXXXXXXXXX has not responded in 15 seconds. Reconnecting...
CIFS: Serverclose failed 4 times, giving up
Unable to handle kernel paging request at virtual address 14de99e461f84a07
Mem abort info:
ESR = 0x0000000096000004
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
FSC = 0x04: level 0 translation fault
Data abort info:
ISV = 0, ISS = 0x00000004
CM = 0, WnR = 0
[14de99e461f84a07] address between user and kernel address ranges
Internal error: Oops: 0000000096000004 [#1] SMP
Modules linked in: cls_bpf sch_ingress nls_utf8 cifs cifs_arc4 cifs_md4 dns_resolver tcp_diag inet_diag veth xt_state xt_connmark nf_conntrack_netlink xt_nat xt_statistic xt_MASQUERADE xt_mark xt_addrtype ipt_REJECT nf_reject_ipv4 nft_chain_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_comment nft_compat nf_tables nfnetlink overlay nls_ascii nls_cp437 sunrpc vfat fat aes_ce_blk aes_ce_cipher ghash_ce sm4_ce_cipher sm4 sm3_ce sm3 sha3_ce sha512_ce sha512_arm64 sha1_ce ena button sch_fq_codel loop fuse configfs dmi_sysfs sha2_ce sha256_arm64 dm_mirror dm_region_hash dm_log dm_mod dax efivarfs
CPU: 5 PID: 2690970 Comm: cifsd Not tainted 6.1.103-109.184.amzn2023.aarch64 #1
Hardware name: Amazon EC2 r7g.4xlarge/, BIOS 1.0 11/1/2018
pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : fib_rules_lookup+0x44/0x238
lr : __fib_lookup+0x64/0xbc
sp : ffff8000265db790
x29: ffff8000265db790 x28: 0000000000000000 x27: 000000000000bd01
x26: 0000000000000000 x25: ffff000b4baf8000 x24: ffff00047b5e4580
x23: ffff8000265db7e0 x22: 0000000000000000 x21: ffff00047b5e4500
x20: ffff0010e3f694f8 x19: 14de99e461f849f7 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
x14: 0000000000000000 x13: 0000000000000000 x12: 3f92800abd010002
x11: 0000000000000001 x10: ffff0010e3f69420 x9 : ffff800008a6f294
x8 : 0000000000000000 x7 : 0000000000000006 x6 : 0000000000000000
x5 : 0000000000000001 x4 : ffff001924354280 x3 : ffff8000265db7e0
x2 : 0000000000000000 x1 : ffff0010e3f694f8 x0 : ffff00047b5e4500
Call trace:
fib_rules_lookup+0x44/0x238
__fib_lookup+0x64/0xbc
ip_route_output_key_hash_rcu+0x2c4/0x398
ip_route_output_key_hash+0x60/0x8c
tcp_v4_connect+0x290/0x488
__inet_stream_connect+0x108/0x3d0
inet_stream_connect+0x50/0x78
kernel_connect+0x6c/0xac
generic_ip_connect+0x10c/0x6c8 [cifs]
__reconnect_target_unlocked+0xa0/0x214 [cifs]
reconnect_dfs_server+0x144/0x460 [cifs]
cifs_reconnect+0x88/0x148 [cifs]
cifs_readv_from_socket+0x230/0x430 [cifs]
cifs_read_from_socket+0x74/0xa8 [cifs]
cifs_demultiplex_thread+0xf8/0x704 [cifs]
kthread+0xd0/0xd4
Code: aa0003f8 f8480f13 eb18027f 540006c0 (b9401264)
[1]:
CIFS_CRED="/root/cred.cifs"
CIFS_USER="Administrator"
CIFS_PASS="Password"
CIFS_IP="X.X.X.X"
CIFS_PATH="//${CIFS_IP}/Users/Administrator/Desktop/CIFS_TEST"
CIFS_MNT="/mnt/smb"
DEV="enp0s3"
cat <<EOF > ${CIFS_CRED}
username=${CIFS_USER}
password=${CIFS_PASS}
domain=EXAMPLE.COM
EOF
unshare -n bash -c "
mkdir -p ${CIFS_MNT}
ip netns attach root 1
ip link add eth0 type veth peer veth0 netns root
ip link set eth0 up
ip -n root link set veth0 up
ip addr add 192.168.0.2/24 dev eth0
ip -n root addr add 192.168.0.1/24 dev veth0
ip route add default via 192.168.0.1 dev eth0
ip netns exec root sysctl net.ipv4.ip_forward=1
ip netns exec root iptables -t nat -A POSTROUTING -s 192.168.0.2 -o ${DEV} -j MASQUERADE
mount -t cifs ${CIFS_PATH} ${CIFS_MNT} -o vers=3.0,sec=ntlmssp,credentials=${CIFS_CRED},rsize=65536,wsize=65536,cache=none,echo_interval=1
touch ${CIFS_MNT}/a.txt
ip netns exec root iptables -t nat -D POSTROUTING -s 192.168.0.2 -o ${DEV} -j MASQUERADE
"
umount ${CIFS_MNT}
[2]:
ref_tracker: net notrefcnt@000000004bbc008d has 1/1 users at
sk_alloc (./include/net/net_namespace.h:339 net/core/sock.c:2227)
inet_create (net/ipv4/af_inet.c:326 net/ipv4/af_inet.c:252)
__sock_create (net/socket.c:1576)
generic_ip_connect (fs/smb/client/connect.c:3075)
cifs_get_tcp_session.part.0 (fs/smb/client/connect.c:3160 fs/smb/client/connect.c:1798)
cifs_mount_get_session (fs/smb/client/trace.h:959 fs/smb/client/connect.c:3366)
dfs_mount_share (fs/smb/client/dfs.c:63 fs/smb/client/dfs.c:285)
cifs_mount (fs/smb/client/connect.c:3622)
cifs_smb3_do_mount (fs/smb/client/cifsfs.c:949)
smb3_get_tree (fs/smb/client/fs_context.c:784 fs/smb/client/fs_context.c:802 fs/smb/client/fs_context.c:794)
vfs_get_tree (fs/super.c:1800)
path_mount (fs/namespace.c:3508 fs/namespace.c:3834)
__x64_sys_mount (fs/namespace.c:3848 fs/namespace.c:4057 fs/namespace.c:4034 fs/namespace.c:4034)
do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
Fixes: 26abe14379 ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The usual collection of singletons - please see the changelogs.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZyfGDAAKCRDdBJ7gKXxA
jr19AQD6bfDF/6L2Alq1QG26pgrgccEbKzDSzR6pBajwCbdrNQD/XPhiv3zRJfGf
lgt0Qkqwe/ApBhVYUnL8y1CePv3EDgA=
=W5W0
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2024-11-03-10-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"17 hotfixes. 9 are cc:stable. 13 are MM and 4 are non-MM.
The usual collection of singletons - please see the changelogs"
* tag 'mm-hotfixes-stable-2024-11-03-10-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: multi-gen LRU: use {ptep,pmdp}_clear_young_notify()
mm: multi-gen LRU: remove MM_LEAF_OLD and MM_NONLEAF_TOTAL stats
mm, mmap: limit THP alignment of anonymous mappings to PMD-aligned sizes
mm: shrinker: avoid memleak in alloc_shrinker_info
.mailmap: update e-mail address for Eugen Hristev
vmscan,migrate: fix page count imbalance on node stats when demoting pages
mailmap: update Jarkko's email addresses
mm: allow set/clear page_type again
nilfs2: fix potential deadlock with newly created symlinks
Squashfs: fix variable overflow in squashfs_readpage_block
kasan: remove vmalloc_percpu test
tools/mm: -Werror fixes in page-types/slabinfo
mm, swap: avoid over reclaim of full clusters
mm: fix PSWPIN counter for large folios swap-in
mm: avoid VM_BUG_ON when try to map an anon large folio to zero page.
mm/codetag: fix null pointer check logic for ref and tag
mm/gup: stop leaking pinned pages in low memory conditions
* fix a sysbot reported crash on filestreams
* Reduce cpu time spent searching for extents in
a very fragmented FS
* Check for delayed allocations before setting extsize
Signed-off-by: Carlos Maiolino <cem@kernel.org>
-----BEGIN PGP SIGNATURE-----
iJUEABMJAB0WIQQMHYkcUKcy4GgPe2RGdaER5QtfpgUCZyIMDwAKCRBGdaER5Qtf
pllxAYCkk+mtDTD5xBfOVGZWO5MMFz8HqYcro5wrSCzgL8HDmW29kXTBYFviGn3R
3l/H6BEBgOk0EkI5qGOzijpzbsWyJeLzPzZtxQFPD8zFBdxSERCtbpqFDLLvLQWG
M+TLhUNkPQ==
=kKX4
-----END PGP SIGNATURE-----
Merge tag 'xfs-6.12-fixes-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Carlos Maiolino:
- fix a sysbot reported crash on filestreams
- Reduce cpu time spent searching for extents in a very fragmented FS
- Check for delayed allocations before setting extsize
* tag 'xfs-6.12-fixes-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: streamline xfs_filestream_pick_ag
xfs: fix finding a last resort AG in xfs_filestream_pick_ag
xfs: Reduce unnecessary searches when searching for the best extents
xfs: Check for delayed allocations before setting extsize
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZyTGVAAKCRCRxhvAZXjc
oltEAP9r8cWa3Tdv8DzMNWu/jezTUXoW/mX5Qe+c1L6faqj0WQD/dIVtBtG37Tfq
3Ci9F/GEWjKijtCQ5lwMGUq27jQJ1gk=
=/0iA
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.12-rc6.iomap' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull iomap fixes from Christian Brauner:
"Fixes for iomap to prevent data corruption bugs in the fallocate
unshare range implementation of fsdax and a small cleanup to turn
iomap_want_unshare_iter() into an inline function"
* tag 'vfs-6.12-rc6.iomap' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
iomap: turn iomap_want_unshare_iter into an inline function
fsdax: dax_unshare_iter needs to copy entire blocks
fsdax: remove zeroing code from dax_unshare_iter
iomap: share iomap_unshare_iter predicate code with fsdax
xfs: don't allocate COW extents when unsharing a hole
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZyTGAQAKCRCRxhvAZXjc
opd6AQCal4omyfS8FYe4VRRZ/0XHouagq99I0U0TAmKkvoKAsgD/XrdE+pSTEkPX
Pv4T9phh1cZRxcyKVu77UoYkuHJEDAg=
=Lu9R
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.12-rc6.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull filesystem fixes from Christian Brauner:
"VFS:
- Fix copy_page_from_iter_atomic() if KMAP_LOCAL_FORCE_MAP=y is set
- Add a get_tree_bdev_flags() helper that allows to modify e.g.,
whether errors are logged into the filesystem context during
superblock creation. This is used by erofs to fix a userspace
regression where an error is currently logged when its used on a
regular file which is an new allowed mode in erofs.
netfs:
- Fix the sysfs debug path in the documentation.
- Fix iov_iter_get_pages*() for folio queues by skipping the page
extracation if we're at the end of a folio.
afs:
- Fix moving subdirectories to different parent directory.
autofs:
- Fix handling of AUTOFS_DEV_IOCTL_TIMEOUT_CMD ioctl in
validate_dev_ioctl(). The actual ioctl number, not the ioctl
command needs to be checked for autofs"
* tag 'vfs-6.12-rc6.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
iov_iter: fix copy_page_from_iter_atomic() if KMAP_LOCAL_FORCE_MAP
autofs: fix thinko in validate_dev_ioctl()
iov_iter: Fix iov_iter_get_pages*() for folio_queue
afs: Fix missing subdir edit when renamed between parent dirs
doc: correcting the debug path for cachefiles
erofs: use get_tree_bdev_flags() to avoid misleading messages
fs/super.c: introduce get_tree_bdev_flags()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmck8eQACgkQxWXV+ddt
WDu05g/6AwrnvPkivC4iVOv4Wkzrpk4gm76smx91Y9B8tSDLI1pHaS27CvJz9iWl
vBKXPN3PQVQHwo6SPn+NjsFOSMkXlbBOVKpPU+MlZwH9Tuw66qcC+EnUCK2wEuAy
3TN7cUGIA4r/j+SkhgIz+Irlr5pjdb1KkPIMBEVGcVFqDIuvDaTEGBqTn2i/V5aa
dMn+gK+9rfngTOJ68t/pEFaX7SEWCvgMIcBpBB4/vs1gHm3ve2bcc1sBAdMxb1Se
SrxgZfq+Rc5tkMn540JaWGwkb0rLzwXlurK6ygTKDKCpH0IMX+pBvDkexh9Zj0ux
jejlRxiuDzTx3z2a7FjHDyp2sdZWMpq3sPsowpJ1Dsgi5EtSxTy4irmQuSAZY1Uj
/uo6YwV9aTGeiNDwZeKqKc/wOuAttaMZLr14s37pro9KxndFJ/XZBxeyB+euUCOw
B8AvAQVVIJAYQLyWINWruNKppqlgiO2RaN15RvvT2pX01d0TOx1KX1XFQku7YFxb
M/8ZNXzJ96XtkeyHL3euo3zj7N5jWtnCvPINugUG1ADQa+bc8aX336gld1neD6fs
QqIFIgzZG0l4N95viJilACrI6tW9zFnBqMyNFRhucKiX9aP9glOvhSfxfjcpDuQ/
i/LIyxVLwp8M3hPNvv8tC345+1C2ug9AD0OyhWjjIYPuiOxtTWs=
=alpB
-----END PGP SIGNATURE-----
Merge tag 'for-6.12-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more stability fixes. There's one patch adding export of MIPS
cmpxchg helper, used in the error propagation fix.
- fix error propagation from split bios to the original btrfs bio
- fix merging of adjacent extents (normal operation, defragmentation)
- fix potential use after free after freeing btrfs device structures"
* tag 'for-6.12-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix defrag not merging contiguous extents due to merged extent maps
btrfs: fix extent map merging not happening for adjacent extents
btrfs: fix use-after-free of block device file in __btrfs_free_extra_devids()
btrfs: fix error propagation of split bios
MIPS: export __cmpxchg_small()
Various syzbot fixes, and the more notable ones:
- Fix for pointers in an extent overflowing the max (16) on a filesystem
with many devices: we were creating too many cached copies when moving
data around. Now, we only create at most one cached copy if there's a
promote target set.
Caching will be a bit broken for reflinked data until 6.13: I have
larger series queued up which significantly improves the plumbing for
data options down into the extent (bch_extent_rebalance) to fix this.
- Fix for deadlock on -ENOSPC on tiny filesystems
Allocation from the partial open_bucket list wasn't correctly
accounting partial open_buckets as free: this fixes the main cause of
tests timing out in the automated tests.
-----BEGIN PGP SIGNATURE-----
iQJOBAABCAA4FiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmckUrIaHGtlbnQub3Zl
cnN0cmVldEBsaW51eC5kZXYACgkQE6szbY3KbnaFYA//Qd9SD8v+ypavnaogWhqk
3bufCO8YJDV5DRQVuX/z36fia8zOKzWQGRAYvq0vF0mmgagwBE+AcBh6vvfDCxqZ
m1937IXcv/hHh2FFQau9gWEItTH9dGwyQjeDjB3xaTL5ZTGsAdA9558ygf8GAVOe
wD+W8Z8Qj09hAErnNS7y50t/PGbZDuG7AV2Dy2unp+fp6U0FVrZ3Z0bhFuhxcR7/
e3j49DoW4EZL7Gu1svn7nzehjWK4wx1wX7QhynPgSOVIhdj2Fc3XG76b3mBsuZF6
A/cBRKmSZsYL9MBK0vferqizqeuwlIJsvwpo/6zzukpyf8QOl+0IqPuAXFoz8vg3
vrdp9cdvzWvQNexTD2+7PYosCKoUswOvo0oIy8Iopkg4VGSreZib1sZeCPzw2FBK
AZcQaQSBLKojWpYsn9Dl2AlqEHHTvnopjr5wRXiimqKe/OcA3ugIvebUw2UE2ACp
/Z2ZQu615BtRYQM+dRIJJQ2CAy0F3EZxIXEXwc/yrH7kL2VBay8QCKp/k/9YYy4e
Nlxxw7alb/XGgT8GQgu24tho3yMKT621dLFOaAZ7x2HtLP8T56zL/L/wKWsocW/V
R8Kwqot6F1EVb3Q0LECUJottYQ+5I1Et7ZpVyOPxfqF1y7KOsuxKOmZFLO7i3Spc
fg0gOt/fyKrAF3zuSmWXne8=
=hzm/
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-10-31' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Various syzbot fixes, and the more notable ones:
- Fix for pointers in an extent overflowing the max (16) on a
filesystem with many devices: we were creating too many cached
copies when moving data around. Now, we only create at most one
cached copy if there's a promote target set.
Caching will be a bit broken for reflinked data until 6.13: I have
larger series queued up which significantly improves the plumbing
for data options down into the extent (bch_extent_rebalance) to fix
this.
- Fix for deadlock on -ENOSPC on tiny filesystems
Allocation from the partial open_bucket list wasn't correctly
accounting partial open_buckets as free: this fixes the main cause
of tests timing out in the automated tests"
* tag 'bcachefs-2024-10-31' of git://evilpiepirate.org/bcachefs:
bcachefs: Fix NULL ptr dereference in btree_node_iter_and_journal_peek
bcachefs: fix possible null-ptr-deref in __bch2_ec_stripe_head_get()
bcachefs: Fix deadlock on -ENOSPC w.r.t. partial open buckets
bcachefs: Don't filter partial list buckets in open_buckets_to_text()
bcachefs: Don't keep tons of cached pointers around
bcachefs: init freespace inited bits to 0 in bch2_fs_initialize
bcachefs: Fix unhandled transaction restart in fallocate
bcachefs: Fix UAF in bch2_reconstruct_alloc()
bcachefs: fix null-ptr-deref in have_stripes()
bcachefs: fix shift oob in alloc_lru_idx_fragmentation
bcachefs: Fix invalid shift in validate_sb_layout()
Commit 78ff640819 ("vfs: Convert tracefs to use the new mount API")
converted tracefs to use the new mount APIs caused mount options
(e.g. gid=<gid>) to not take effect.
The tracefs superblock can be updated from multiple paths:
- on fs_initcall() to init_trace_printk_function_export()
- from a work queue to initialize eventfs
tracer_init_tracefs_work_func()
- fsconfig() syscall to mount or remount of tracefs
The tracefs superblock root inode gets created early on in
init_trace_printk_function_export().
With the new mount API, tracefs effectively uses get_tree_single() instead
of the old API mount_single().
Previously, mount_single() ensured that the options are always applied to
the superblock root inode:
(1) If the root inode didn't exist, call fill_super() to create it
and apply the options.
(2) If the root inode exists, call reconfigure_single() which
effectively calls tracefs_apply_options() to parse and apply
options to the subperblock's fs_info and inode and remount
eventfs (if necessary)
On the other hand, get_tree_single() effectively calls vfs_get_super()
which:
(3) If the root inode doesn't exists, calls fill_super() to create it
and apply the options.
(4) If the root inode already exists, updates the fs_context root
with the superblock's root inode.
(4) above is always the case for tracefs mounts, since the super block's
root inode will already be created by init_trace_printk_function_export().
This means that the mount options get ignored:
- Since it isn't applied to the superblock's root inode, it doesn't
get inherited by the children.
- Since eventfs is initialized from a separate work queue and
before call to mount with the options, and it doesn't get remounted
for mount.
Ensure that the mount options are applied to the super block and eventfs
is remounted to respect the mount options.
To understand this better, if fstab has the following:
tracefs /sys/kernel/tracing tracefs nosuid,nodev,noexec,gid=tracing 0 0
On boot up, permissions look like:
# ls -l /sys/kernel/tracing/trace
-rw-r----- 1 root root 0 Nov 1 08:37 /sys/kernel/tracing/trace
When it should look like:
# ls -l /sys/kernel/tracing/trace
-rw-r----- 1 root tracing 0 Nov 1 08:37 /sys/kernel/tracing/trace
Link: https://lore.kernel.org/r/536e99d3-345c-448b-adee-a21389d7ab4b@redhat.com/
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Ali Zahraee <ahzahraee@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 78ff640819 ("vfs: Convert tracefs to use the new mount API")
Link: https://lore.kernel.org/20241030171928.4168869-2-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When running defrag (manual defrag) against a file that has extents that
are contiguous and we already have the respective extent maps loaded and
merged, we end up not defragging the range covered by those contiguous
extents. This happens when we have an extent map that was the result of
merging multiple extent maps for contiguous extents and the length of the
merged extent map is greater than or equals to the defrag threshold
length.
The script below reproduces this scenario:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f $DEV
mount $DEV $MNT
# Create a 256K file with 4 extents of 64K each.
xfs_io -f -c "falloc 0 64K" \
-c "pwrite 0 64K" \
-c "falloc 64K 64K" \
-c "pwrite 64K 64K" \
-c "falloc 128K 64K" \
-c "pwrite 128K 64K" \
-c "falloc 192K 64K" \
-c "pwrite 192K 64K" \
$MNT/foo
umount $MNT
echo -n "Initial number of file extent items: "
btrfs inspect-internal dump-tree -t 5 $DEV | grep EXTENT_DATA | wc -l
mount $DEV $MNT
# Read the whole file in order to load and merge extent maps.
cat $MNT/foo > /dev/null
btrfs filesystem defragment -t 128K $MNT/foo
umount $MNT
echo -n "Number of file extent items after defrag with 128K threshold: "
btrfs inspect-internal dump-tree -t 5 $DEV | grep EXTENT_DATA | wc -l
mount $DEV $MNT
# Read the whole file in order to load and merge extent maps.
cat $MNT/foo > /dev/null
btrfs filesystem defragment -t 256K $MNT/foo
umount $MNT
echo -n "Number of file extent items after defrag with 256K threshold: "
btrfs inspect-internal dump-tree -t 5 $DEV | grep EXTENT_DATA | wc -l
Running it:
$ ./test.sh
Initial number of file extent items: 4
Number of file extent items after defrag with 128K threshold: 4
Number of file extent items after defrag with 256K threshold: 4
The 4 extents don't get merged because we have an extent map with a size
of 256K that is the result of merging the individual extent maps for each
of the four 64K extents and at defrag_lookup_extent() we have a value of
zero for the generation threshold ('newer_than' argument) since this is a
manual defrag. As a consequence we don't call defrag_get_extent() to get
an extent map representing a single file extent item in the inode's
subvolume tree, so we end up using the merged extent map at
defrag_collect_targets() and decide not to defrag.
Fix this by updating defrag_lookup_extent() to always discard extent maps
that were merged and call defrag_get_extent() regardless of the minimum
generation threshold ('newer_than' argument).
A test case for fstests will be sent along soon.
CC: stable@vger.kernel.org # 6.1+
Fixes: 199257a78b ("btrfs: defrag: don't use merged extent map for their generation check")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we have 3 or more adjacent extents in a file, that is, consecutive file
extent items pointing to adjacent extents, within a contiguous file range
and compatible flags, we end up not merging all the extents into a single
extent map.
For example:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ xfs_io -f -d -c "pwrite -b 64K 0 64K" \
-c "pwrite -b 64K 64K 64K" \
-c "pwrite -b 64K 128K 64K" \
-c "pwrite -b 64K 192K 64K" \
/mnt/sdc/foo
After all the ordered extents complete we unpin the extent maps and try
to merge them, but instead of getting a single extent map we get two
because:
1) When the first ordered extent completes (file range [0, 64K)) we
unpin its extent map and attempt to merge it with the extent map for
the range [64K, 128K), but we can't because that extent map is still
pinned;
2) When the second ordered extent completes (file range [64K, 128K)), we
unpin its extent map and merge it with the previous extent map, for
file range [0, 64K), but we can't merge with the next extent map, for
the file range [128K, 192K), because this one is still pinned.
The merged extent map for the file range [0, 128K) gets the flag
EXTENT_MAP_MERGED set;
3) When the third ordered extent completes (file range [128K, 192K)), we
unpin its extent map and attempt to merge it with the previous extent
map, for file range [0, 128K), but we can't because that extent map
has the flag EXTENT_MAP_MERGED set (mergeable_maps() returns false
due to different flags) while the extent map for the range [128K, 192K)
doesn't have that flag set.
We also can't merge it with the next extent map, for file range
[192K, 256K), because that one is still pinned.
At this moment we have 3 extent maps:
One for file range [0, 128K), with the flag EXTENT_MAP_MERGED set.
One for file range [128K, 192K).
One for file range [192K, 256K) which is still pinned;
4) When the fourth and final extent completes (file range [192K, 256K)),
we unpin its extent map and attempt to merge it with the previous
extent map, for file range [128K, 192K), which succeeds since none
of these extent maps have the EXTENT_MAP_MERGED flag set.
So we end up with 2 extent maps:
One for file range [0, 128K), with the flag EXTENT_MAP_MERGED set.
One for file range [128K, 256K), with the flag EXTENT_MAP_MERGED set.
Since after merging extent maps we don't attempt to merge again, that
is, merge the resulting extent map with the one that is now preceding
it (and the one following it), we end up with those two extent maps,
when we could have had a single extent map to represent the whole file.
Fix this by making mergeable_maps() ignore the EXTENT_MAP_MERGED flag.
While this doesn't present any functional issue, it prevents the merging
of extent maps which allows to save memory, and can make defrag not
merging extents too (that will be addressed in the next patch).
Fixes: 199257a78b ("btrfs: defrag: don't use merged extent map for their generation check")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Syzbot reported that page_symlink(), called by nilfs_symlink(), triggers
memory reclamation involving the filesystem layer, which can result in
circular lock dependencies among the reader/writer semaphore
nilfs->ns_segctor_sem, s_writers percpu_rwsem (intwrite) and the
fs_reclaim pseudo lock.
This is because after commit 21fc61c73c ("don't put symlink bodies in
pagecache into highmem"), the gfp flags of the page cache for symbolic
links are overwritten to GFP_KERNEL via inode_nohighmem().
This is not a problem for symlinks read from the backing device, because
the __GFP_FS flag is dropped after inode_nohighmem() is called. However,
when a new symlink is created with nilfs_symlink(), the gfp flags remain
overwritten to GFP_KERNEL. Then, memory allocation called from
page_symlink() etc. triggers memory reclamation including the FS layer,
which may call nilfs_evict_inode() or nilfs_dirty_inode(). And these can
cause a deadlock if they are called while nilfs->ns_segctor_sem is held:
Fix this issue by dropping the __GFP_FS flag from the page cache GFP flags
of newly created symlinks in the same way that nilfs_new_inode() and
__nilfs_read_inode() do, as a workaround until we adopt nofs allocation
scope consistently or improve the locking constraints.
Link: https://lkml.kernel.org/r/20241020050003.4308-1-konishi.ryusuke@gmail.com
Fixes: 21fc61c73c ("don't put symlink bodies in pagecache into highmem")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+9ef37ac20608f4836256@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=9ef37ac20608f4836256
Tested-by: syzbot+9ef37ac20608f4836256@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Syzbot reports a slab out of bounds access in squashfs_readpage_block().
This is caused by an attempt to read page index 0x2000000000. This value
(start_index) is stored in an integer loop variable which overflows
producing a value of 0. This causes a loop which iterates over pages
start_index -> end_index to iterate over 0 -> end_index, which ultimately
causes an out of bounds page array access.
Fix by changing variable to a loff_t, and rename to index to make it
clearer it is a page index, and not a loop count.
Link: https://lkml.kernel.org/r/20241020232200.837231-1-phillip@squashfs.org.uk
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
Closes: https://lore.kernel.org/all/ZwzcnCAosIPqQ9Ie@ly-workstation/
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The error flow in nfsd4_copy() calls cleanup_async_copy(), which
already decrements nn->pending_async_copies.
Reported-by: Olga Kornievskaia <okorniev@redhat.com>
Fixes: aadc3bbea1 ("NFSD: Limit the number of concurrent async COPY operations")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Directly return the error from xfs_bmap_longest_free_extent instead
of breaking from the loop and handling it there, and use a done
label to directly jump to the exist when we found a suitable perag
structure to reduce the indentation level and pag/max_pag check
complexity in the tail of the function.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
When the main loop in xfs_filestream_pick_ag fails to find a suitable
AG it tries to just pick the online AG. But the loop for that uses
args->pag as loop iterator while the later code expects pag to be
set. Fix this by reusing the max_pag case for this last resort, and
also add a check for impossible case of no AG just to make sure that
the uninitialized pag doesn't even escape in theory.
Reported-by: syzbot+4125a3c514e3436a02e6@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: syzbot+4125a3c514e3436a02e6@syzkaller.appspotmail.com
Fixes: f8f1ed1ab3 ("xfs: return a referenced perag from filestreams allocator")
Cc: <stable@vger.kernel.org> # v6.3
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Recently, we found that the CPU spent a lot of time in
xfs_alloc_ag_vextent_size when the filesystem has millions of fragmented
spaces.
The reason is that we conducted much extra searching for extents that
could not yield a better result, and these searches would cost a lot of
time when there were millions of extents to search through. Even if we
get the same result length, we don't switch our choice to the new one,
so we can definitely terminate the search early.
Since the result length cannot exceed the found length, when the found
length equals the best result length we already have, we can conclude
the search.
We did a test in that filesystem:
[root@localhost ~]# xfs_db -c freesp /dev/vdb
from to extents blocks pct
1 1 215 215 0.01
2 3 994476 1988952 99.99
Before this patch:
0) | xfs_alloc_ag_vextent_size [xfs]() {
0) * 15597.94 us | }
After this patch:
0) | xfs_alloc_ag_vextent_size [xfs]() {
0) 19.176 us | }
Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Extsize should only be allowed to be set on files with no data in it.
For this, we check if the files have extents but miss to check if
delayed extents are present. This patch adds that check.
While we are at it, also refactor this check into a helper since
it's used in some other places as well like xfs_inactive() or
xfs_ioctl_setattr_xflags()
**Without the patch (SUCCEEDS)**
$ xfs_io -c 'open -f testfile' -c 'pwrite 0 1024' -c 'extsize 65536'
wrote 1024/1024 bytes at offset 0
1 KiB, 1 ops; 0.0002 sec (4.628 MiB/sec and 4739.3365 ops/sec)
**With the patch (FAILS as expected)**
$ xfs_io -c 'open -f testfile' -c 'pwrite 0 1024' -c 'extsize 65536'
wrote 1024/1024 bytes at offset 0
1 KiB, 1 ops; 0.0002 sec (4.628 MiB/sec and 4739.3365 ops/sec)
xfs_io: FS_IOC_FSSETXATTR testfile: Invalid argument
Fixes: e94af02a9c ("[XFS] fix old xfs_setattr mis-merge from irix; mostly harmless esp if not using xfs rt")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Mounting btrfs from two images (which have the same one fsid and two
different dev_uuids) in certain executing order may trigger an UAF for
variable 'device->bdev_file' in __btrfs_free_extra_devids(). And
following are the details:
1. Attach image_1 to loop0, attach image_2 to loop1, and scan btrfs
devices by ioctl(BTRFS_IOC_SCAN_DEV):
/ btrfs_device_1 → loop0
fs_device
\ btrfs_device_2 → loop1
2. mount /dev/loop0 /mnt
btrfs_open_devices
btrfs_device_1->bdev_file = btrfs_get_bdev_and_sb(loop0)
btrfs_device_2->bdev_file = btrfs_get_bdev_and_sb(loop1)
btrfs_fill_super
open_ctree
fail: btrfs_close_devices // -ENOMEM
btrfs_close_bdev(btrfs_device_1)
fput(btrfs_device_1->bdev_file)
// btrfs_device_1->bdev_file is freed
btrfs_close_bdev(btrfs_device_2)
fput(btrfs_device_2->bdev_file)
3. mount /dev/loop1 /mnt
btrfs_open_devices
btrfs_get_bdev_and_sb(&bdev_file)
// EIO, btrfs_device_1->bdev_file is not assigned,
// which points to a freed memory area
btrfs_device_2->bdev_file = btrfs_get_bdev_and_sb(loop1)
btrfs_fill_super
open_ctree
btrfs_free_extra_devids
if (btrfs_device_1->bdev_file)
fput(btrfs_device_1->bdev_file) // UAF !
Fix it by setting 'device->bdev_file' as 'NULL' after closing the
btrfs_device in btrfs_close_one_device().
Fixes: 1423881941 ("btrfs: do not background blkdev_put()")
CC: stable@vger.kernel.org # 4.19+
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219408
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>