Relocation in a zoned filesystem can fail with a transaction abort with
error -22 (EINVAL). This happens because the relocation code assumes that
the extents we relocated the data to have the same size the source extents
had and ensures this by preallocating the extents.
But in a zoned filesystem we currently can't preallocate the extents as
this would break the sequential write required rule. Therefore it can
happen that the writeback process kicks in while we're still adding pages
to a delalloc range and starts writing out dirty pages.
This then creates destination extents that are smaller than the source
extents, triggering the following safety check in get_new_location():
1034 if (num_bytes != btrfs_file_extent_disk_num_bytes(leaf, fi)) {
1035 ret = -EINVAL;
1036 goto out;
1037 }
Temporarily create a dedicated block group for the relocation process, so
no non-relocation data writes can interfere with the relocation writes.
This is needed that we can switch the relocation process on a zoned
filesystem from the REQ_OP_ZONE_APPEND writing we use for data to a scheme
like in a non-zoned filesystem using REQ_OP_WRITE and preallocation.
Fixes: 32430c6148 ("btrfs: zoned: enable relocation on a zoned filesystem")
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are several places in our codebase where we check if a root is the
root of the data reloc tree and subsequent patches will introduce more.
Factor out the check into a small helper function instead of open coding
it multiple times.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Function repair_io_failure() is no longer used out of extent_io.c since
commit 8b9b6f2554 ("btrfs: scrub: cleanup the remaining nodatasum
fixup code"), which removes the last external caller.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging a regular file in full sync mode, we are currently committing
its delayed inode item. This is to ensure that we never miss copying the
inode item, with its most up to date data, into the log tree.
However that is not necessary since commit e4545de5b0 ("Btrfs: fix fsync
data loss after append write"), because even if we don't find the leaf
with the inode item when looking for leaves that changed in the current
transaction, we end up logging the inode item later using the in-memory
content. In case we find the leaf containing the inode item, we already
end up using the in-memory inode for filling the inode item in the log
tree, and not the inode item that is in the fs/subvolume tree, as it
might be not up to date (copy_items() -> fill_inode_item()).
So don't commit the delayed inode item, which brings a couple of benefits:
1) Avoid writing the inode item to the fs/subvolume btree, saving time and
reducing lock contention on the btree;
2) In case no other item for the inode was changed, added or deleted in
the same leaf where the inode item is located, we ended up copying
all the items in that leaf to the log tree - it's harmless from a
functional point of view, but it wastes time and log tree space.
This patch is part of a patch set comprised of the following patches:
btrfs: check if a log tree exists at inode_logged()
btrfs: remove no longer needed checks for NULL log context
btrfs: do not log new dentries when logging that a new name exists
btrfs: always update the logged transaction when logging new names
btrfs: avoid expensive search when dropping inode items from log
btrfs: add helper to truncate inode items when logging inode
btrfs: avoid expensive search when truncating inode items from the log
btrfs: avoid search for logged i_size when logging inode if possible
btrfs: avoid attempt to drop extents when logging inode for the first time
btrfs: do not commit delayed inode when logging a file in full sync mode
This is patch 10/10 and the following test results compare a branch with
the whole patch set applied versus a branch without any of the patches
applied.
The following script was used to test dbench with 8 and 16 jobs on a
machine with 12 cores, 64G of RAM, a NVME device and using a non-debug
kernel config (Debian's default):
$ cat test.sh
#!/bin/bash
if [ $# -ne 1 ]; then
echo "Use $0 NUM_JOBS"
exit 1
fi
NUM_JOBS=$1
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-m single -d single"
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
dbench -D $MNT -t 120 $NUM_JOBS
umount $MNT
The results were the following:
8 jobs, before patchset:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 4113896 0.009 238.665
Close 3021699 0.001 0.590
Rename 174215 0.082 238.733
Unlink 830977 0.049 238.642
Deltree 96 2.232 8.022
Mkdir 48 0.003 0.005
Qpathinfo 3729013 0.005 2.672
Qfileinfo 653206 0.001 0.152
Qfsinfo 683866 0.002 0.526
Sfileinfo 335055 0.004 1.571
Find 1441800 0.016 4.288
WriteX 2049644 0.010 3.982
ReadX 6449786 0.003 0.969
LockX 13400 0.002 0.043
UnlockX 13400 0.001 0.075
Flush 288349 2.521 245.516
Throughput 1075.73 MB/sec 8 clients 8 procs max_latency=245.520 ms
8 jobs, after patchset:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 4154282 0.009 156.675
Close 3051450 0.001 0.843
Rename 175912 0.072 4.444
Unlink 839067 0.048 66.050
Deltree 96 2.131 5.979
Mkdir 48 0.002 0.004
Qpathinfo 3765575 0.005 3.079
Qfileinfo 659582 0.001 0.099
Qfsinfo 690474 0.002 0.155
Sfileinfo 338366 0.004 1.419
Find 1455816 0.016 3.423
WriteX 2069538 0.010 4.328
ReadX 6512429 0.003 0.840
LockX 13530 0.002 0.078
UnlockX 13530 0.001 0.051
Flush 291158 2.500 163.468
Throughput 1105.45 MB/sec 8 clients 8 procs max_latency=163.474 ms
+2.7% throughput, -40.1% max latency
16 jobs, before patchset:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 5457602 0.033 337.098
Close 4008979 0.002 2.018
Rename 231051 0.323 254.054
Unlink 1102209 0.202 337.243
Deltree 160 6.521 31.720
Mkdir 80 0.003 0.007
Qpathinfo 4946147 0.014 6.988
Qfileinfo 867440 0.001 1.642
Qfsinfo 907081 0.003 1.821
Sfileinfo 444433 0.005 2.053
Find 1912506 0.067 7.854
WriteX 2724852 0.018 7.428
ReadX 8553883 0.003 2.059
LockX 17770 0.003 0.350
UnlockX 17770 0.002 0.627
Flush 382533 2.810 353.691
Throughput 1413.09 MB/sec 16 clients 16 procs max_latency=353.696 ms
16 jobs, after patchset:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 5393156 0.034 303.181
Close 3961986 0.002 1.502
Rename 228359 0.320 253.379
Unlink 1088920 0.206 303.409
Deltree 160 6.419 30.088
Mkdir 80 0.003 0.004
Qpathinfo 4887967 0.015 7.722
Qfileinfo 857408 0.001 1.651
Qfsinfo 896343 0.002 2.147
Sfileinfo 439317 0.005 4.298
Find 1890018 0.073 8.347
WriteX 2693356 0.018 6.373
ReadX 8453485 0.003 3.836
LockX 17562 0.003 0.486
UnlockX 17562 0.002 0.635
Flush 378023 2.802 315.904
Throughput 1454.46 MB/sec 16 clients 16 procs max_latency=315.910 ms
+2.9% throughput, -11.3% max latency
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging an extent, in the fast fsync path, we always attempt do drop
or trim any existing extents with a range that match or overlap the range
of the extent we are about to log. We do that through a call to
btrfs_drop_extents().
However this is not needed when we are logging the inode for the first
time in the current transaction, since we have no inode items of the
inode in the log tree. Calling btrfs_drop_extents() does a deletion search
on the log tree, which is expensive when we have concurrent tasks
accessing the log tree because a deletion search always acquires a write
lock on the extent buffers at levels 2, 1 and 0, adding significant lock
contention, specially taking into account the height of a log tree rarely
(if ever) goes beyond 2 or 3, due to its short life.
So skip the call to btrfs_drop_extents() when the inode was not previously
logged in the current transaction.
This patch is part of a patch set comprised of the following patches:
btrfs: check if a log tree exists at inode_logged()
btrfs: remove no longer needed checks for NULL log context
btrfs: do not log new dentries when logging that a new name exists
btrfs: always update the logged transaction when logging new names
btrfs: avoid expensive search when dropping inode items from log
btrfs: add helper to truncate inode items when logging inode
btrfs: avoid expensive search when truncating inode items from the log
btrfs: avoid search for logged i_size when logging inode if possible
btrfs: avoid attempt to drop extents when logging inode for the first time
btrfs: do not commit delayed inode when logging a file in full sync mode
This is patch 9/10 and test results are listed in the change log of the
last patch in the set.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we are logging that an inode exists and the inode was not logged
before, we can avoid searching in the log tree for the inode item since we
know it does not exists. That wastes time and adds more lock contention on
the extent buffers of the log tree when there are other tasks that are
logging other inodes.
This patch is part of a patch set comprised of the following patches:
btrfs: check if a log tree exists at inode_logged()
btrfs: remove no longer needed checks for NULL log context
btrfs: do not log new dentries when logging that a new name exists
btrfs: always update the logged transaction when logging new names
btrfs: avoid expensive search when dropping inode items from log
btrfs: add helper to truncate inode items when logging inode
btrfs: avoid expensive search when truncating inode items from the log
btrfs: avoid search for logged i_size when logging inode if possible
btrfs: avoid attempt to drop extents when logging inode for the first time
btrfs: do not commit delayed inode when logging a file in full sync mode
This is patch 8/10 and test results are listed in the change log of the
last patch in the set.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Whenever we are logging a file inode in full sync mode we call
btrfs_truncate_inode_items() to delete items of the inode we may have
previously logged.
That results in doing a btree search for deletion, which is expensive
because it always acquires write locks for extent buffers at levels 2, 1
and 0, and it balances any node that is less than half full. Acquiring
the write locks can block the task if the extent buffers are already
locked by another task or block other tasks attempting to lock them,
which is specially bad in case of log trees since they are small due to
their short life, with a root node at a level typically not greater than
level 2.
If we know that we are logging the inode for the first time in the current
transaction, we can skip the call to btrfs_truncate_inode_items(), avoiding
the deletion search. This change does that.
This patch is part of a patch set comprised of the following patches:
btrfs: check if a log tree exists at inode_logged()
btrfs: remove no longer needed checks for NULL log context
btrfs: do not log new dentries when logging that a new name exists
btrfs: always update the logged transaction when logging new names
btrfs: avoid expensive search when dropping inode items from log
btrfs: add helper to truncate inode items when logging inode
btrfs: avoid expensive search when truncating inode items from the log
btrfs: avoid search for logged i_size when logging inode if possible
btrfs: avoid attempt to drop extents when logging inode for the first time
btrfs: do not commit delayed inode when logging a file in full sync mode
This is patch 7/10 and test results are listed in the change log of the
last patch in the set.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move the call to btrfs_truncate_inode_items(), and the surrounding retry
loop, into a local helper function. This avoids some repetition and avoids
making the next change a bit awkward due to a bit of too much indentation.
This patch is part of a patch set comprised of the following patches:
btrfs: check if a log tree exists at inode_logged()
btrfs: remove no longer needed checks for NULL log context
btrfs: do not log new dentries when logging that a new name exists
btrfs: always update the logged transaction when logging new names
btrfs: avoid expensive search when dropping inode items from log
btrfs: add helper to truncate inode items when logging inode
btrfs: avoid expensive search when truncating inode items from the log
btrfs: avoid search for logged i_size when logging inode if possible
btrfs: avoid attempt to drop extents when logging inode for the first time
btrfs: do not commit delayed inode when logging a file in full sync mode
This is patch 6/10 and test results are listed in the change log of the
last patch in the set.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Whenever we are logging a directory inode, logging that an inode exists or
logging an inode that has changes in its references or xattrs, we attempt
to delete items of this inode we may have previously logged (through calls
to drop_objectid_items()).
That attempt does a btree search for deletion, which is expensive because
it always acquires write locks for extent buffers at levels 2, 1 and 0,
and it balances any node that is less than half full. Acquiring the write
locks can block the task if the extent buffers are already locked or block
other tasks attempting to lock them, which is specially bad in case of log
trees since they are small due to their short life, with a root node at a
level typically not greater than level 2.
If we know that we are logging the inode for the first time in the current
transaction, we can skip the search. This change does that.
This patch is part of a patch set comprised of the following patches:
btrfs: check if a log tree exists at inode_logged()
btrfs: remove no longer needed checks for NULL log context
btrfs: do not log new dentries when logging that a new name exists
btrfs: always update the logged transaction when logging new names
btrfs: avoid expensive search when dropping inode items from log
btrfs: add helper to truncate inode items when logging inode
btrfs: avoid expensive search when truncating inode items from the log
btrfs: avoid search for logged i_size when logging inode if possible
btrfs: avoid attempt to drop extents when logging inode for the first time
btrfs: do not commit delayed inode when logging a file in full sync mode
This is patch 5/10 and test results are listed in the change log of the
last patch in the set.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we are logging a new name for an inode, due to a link or rename
operation, if the inode has ancestor inodes that are new, created in the
current transaction, we need to log that these inodes exist. To ensure
that a subsequent explicit fsync on one of these ancestor inodes does
sync the log, we don't set the logged_trans field of these inodes.
This was done in commit 75b463d2b4 ("btrfs: do not commit logs and
transactions during link and rename operations"), to avoid syncing a
log after a rename or link operation.
In order to allow for future changes to do some optimizations, change
this behaviour to always update the logged_trans of any logged inode
and don't update the last_log_commit of the inode if we are logging
that it exists. This accomplishes that same objective with simpler
logic, allowing for some optimizations in the next patches.
So just do that simplification.
This patch is part of a patch set comprised of the following patches:
btrfs: check if a log tree exists at inode_logged()
btrfs: remove no longer needed checks for NULL log context
btrfs: do not log new dentries when logging that a new name exists
btrfs: always update the logged transaction when logging new names
btrfs: avoid expensive search when dropping inode items from log
btrfs: add helper to truncate inode items when logging inode
btrfs: avoid expensive search when truncating inode items from the log
btrfs: avoid search for logged i_size when logging inode if possible
btrfs: avoid attempt to drop extents when logging inode for the first time
btrfs: do not commit delayed inode when logging a file in full sync mode
This is patch 4/10 and test results are listed in the change log of the
last patch in the set.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging a new name for an inode, due to a link or rename operation,
we don't need to log all new dentries of the parent directories and their
subdirectories. We only want to log the names of the inode and that any
new parent directories exist. So in this case don't trigger logging of
the new dentries, that is only need when doing an explicit fsync on a
directory or on a file which requires logging its parent directories.
This avoids unnecessary work and reduces contention on the extent buffers
of a log tree.
This patch is part of a patch set comprised of the following patches:
btrfs: check if a log tree exists at inode_logged()
btrfs: remove no longer needed checks for NULL log context
btrfs: do not log new dentries when logging that a new name exists
btrfs: always update the logged transaction when logging new names
btrfs: avoid expensive search when dropping inode items from log
btrfs: add helper to truncate inode items when logging inode
btrfs: avoid expensive search when truncating inode items from the log
btrfs: avoid search for logged i_size when logging inode if possible
btrfs: avoid attempt to drop extents when logging inode for the first time
btrfs: do not commit delayed inode when logging a file in full sync mode
This is patch 3/10 and test results are listed in the change log of the
last patch in the set.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit 75b463d2b4 ("btrfs: do not commit logs and transactions
during link and rename operations"), we always pass a non-NULL log context
to btrfs_log_inode_parent() and therefore to all the functions that it
calls. So remove the checks we have all over the place that test for a
NULL log context, making the code shorter and easier to read, as well as
reducing the size of the generated code.
This patch is part of a patch set comprised of the following patches:
btrfs: check if a log tree exists at inode_logged()
btrfs: remove no longer needed checks for NULL log context
btrfs: do not log new dentries when logging that a new name exists
btrfs: always update the logged transaction when logging new names
btrfs: avoid expensive search when dropping inode items from log
btrfs: add helper to truncate inode items when logging inode
btrfs: avoid expensive search when truncating inode items from the log
btrfs: avoid search for logged i_size when logging inode if possible
btrfs: avoid attempt to drop extents when logging inode for the first time
btrfs: do not commit delayed inode when logging a file in full sync mode
This is patch 2/10 and test results are listed in the change log of the
last patch in the set.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In case an inode was never logged since it was loaded from disk and was
modified in the current transaction (its ->last_trans matches the ID of
the current transaction), inode_logged() returns true even if there's no
existing log tree. In this case we can simply check if a log tree exists
and return false if it does not. This avoids a caller of inode_logged()
doing some unnecessary, but harmless, work.
For btrfs_log_new_name() it avoids it logging an inode in case it was
never logged since it was loaded from disk and there is currently no log
tree for the inode's root. For the remaining callers of inode_logged(),
btrfs_del_dir_entries_in_log() and btrfs_del_inode_ref_in_log(), it has
no effect since they already check if a log tree exists through their
calls to join_running_log_trans().
So just add a check to inode_logged() to verify if a log tree exists, and
return false if it does not.
This patch is part of a patch set comprised of the following patches:
btrfs: check if a log tree exists at inode_logged()
btrfs: remove no longer needed checks for NULL log context
btrfs: do not log new dentries when logging that a new name exists
btrfs: always update the logged transaction when logging new names
btrfs: avoid expensive search when dropping inode items from log
btrfs: add helper to truncate inode items when logging inode
btrfs: avoid expensive search when truncating inode items from the log
btrfs: avoid search for logged i_size when logging inode if possible
btrfs: avoid attempt to drop extents when logging inode for the first time
btrfs: do not commit delayed inode when logging a file in full sync mode
This is patch 1/10 and test results are listed in the change log of the
last patch in the set.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There were few lockdep warnings because btrfs_show_devname() was using
device_list_mutex as recorded in the commits:
0ccd05285e ("btrfs: fix a possible umount deadlock")
779bf3fefa ("btrfs: fix lock dep warning, move scratch dev out of device_list_mutex and uuid_mutex")
And finally, commit 88c14590cd ("btrfs: use RCU in btrfs_show_devname
for device list traversal") removed the device_list_mutex from
btrfs_show_devname for performance reasons.
This patch removes a stale comment about the function
btrfs_show_devname and device_list_mutex.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The test case btrfs/238 reports the warning below:
WARNING: CPU: 3 PID: 481 at fs/btrfs/super.c:2509 btrfs_show_devname+0x104/0x1e8 [btrfs]
CPU: 2 PID: 1 Comm: systemd Tainted: G W O 5.14.0-rc1-custom #72
Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
Call trace:
btrfs_show_devname+0x108/0x1b4 [btrfs]
show_mountinfo+0x234/0x2c4
m_show+0x28/0x34
seq_read_iter+0x12c/0x3c4
vfs_read+0x29c/0x2c8
ksys_read+0x80/0xec
__arm64_sys_read+0x28/0x34
invoke_syscall+0x50/0xf8
do_el0_svc+0x88/0x138
el0_svc+0x2c/0x8c
el0t_64_sync_handler+0x84/0xe4
el0t_64_sync+0x198/0x19c
Reason:
While btrfs_prepare_sprout() moves the fs_devices::devices into
fs_devices::seed_list, the btrfs_show_devname() searches for the devices
and found none, leading to the warning as in above.
Fix:
latest_dev is updated according to the changes to the device list.
That means we could use the latest_dev->name to show the device name in
/proc/self/mounts, the pointer will be always valid as it's assigned
before the device is deleted from the list in remove or replace.
The RCU protection is sufficient as the device structure is freed after
synchronization.
Reported-by: Su Yue <l@damenly.su>
Tested-by: Su Yue <l@damenly.su>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In preparation to fix a bug in btrfs_show_devname().
Convert fs_devices::latest_bdev type from struct block_device to struct
btrfs_device and, rename the member to fs_devices::latest_dev.
So that btrfs_show_devname() can use fs_devices::latest_dev::name.
Tested-by: Su Yue <l@damenly.su>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We will no longer write to a relocating block group. So, we can finish it
now.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we have written to the zone capacity, the device automatically
deactivates the zone. Sync up block group side (the active BG list and
zone_is_active flag) with it.
We need to do it both on data BGs and metadata BGs. On data side, we add a
hook to btrfs_finish_ordered_io(). On metadata side, we use
end_extent_buffer_writeback().
To reduce excess lookup of a block group, we mark the last extent buffer in
a block group with EXTENT_BUFFER_ZONE_FINISH flag. This cannot be done for
data (ordered_extent), because the address may change due to
REQ_OP_ZONE_APPEND.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The current extent allocator tries to allocate a new block group when the
existing block groups do not have enough space. On a ZNS device, a new
block group means a new active zone. If the number of active zones has
already reached the max_active_zones, activating a new zone needs to finish
an existing zone, leading to wasting the free space there.
So, instead, it should reuse the existing active block groups as much as
possible when we can't activate any other zones without sacrificing an
already activated block group.
While at it, I converted find_free_extent_update_loop() to check the
found_extent() case early and made the other conditions simpler.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are passing too many variables as it is from btrfs_reserve_extent() to
find_free_extent(). The next commit will add min_alloc_size to ffe_ctl, and
that means another pass-through argument. Take this opportunity to move
ffe_ctl one level up and drop the redundant arguments.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Activate new block group at btrfs_make_block_group(). We do not check the
return value. If failed, we can try again later at the actual extent
allocation phase.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Activate a block group when trying to allocate an extent from it. We check
read-only case and no space left case before trying to activate a block
group not to consume the number of active zones uselessly.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Load activeness of underlying zones of a block group. When underlying zones
are active, we add the block group to the fs_info->zone_active_bgs list.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add zone_is_active flag to btrfs_block_group. This flag indicates the
underlying zones are all active. Such zone active block groups are tracked
by fs_info->active_bg_list.
btrfs_dev_{set,clear}_active_zone() take responsibility for the underlying
device part. They set/clear the bitmap to indicate zone activeness and
count the number of zones we can activate left.
btrfs_zone_{activate,finish}() take responsibility for the logical part and
the list management. In addition, btrfs_zone_finish() wait for any writes
on it and send REQ_OP_ZONE_FINISH to the zone.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We will use a block group's physical location to track active zones and
finish fully written zones in the following commits. Since the zone
activation is done in the extent allocation context which already holding
the tree locks, we can't query the chunk tree for the physical locations.
So, copy the location info into a block group and use it for activation.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The ZNS specification defines a limit on the number of zones that can be in
the implicit open, explicit open or closed conditions. Any zone with such
condition is defined as an active zone and correspond to any zone that is
being written or that has been only partially written. If the maximum
number of active zones is reached, we must either reset or finish some
active zones before being able to chose other zones for storing data.
Load queue_max_active_zones() and track the number of active zones left on
the device.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If there is no more space left for a new superblock in a superblock zone,
then it is better to ZONE_FINISH the zone and frees up the active zone
count.
Since btrfs_advance_sb_log() can now issue REQ_OP_ZONE_FINISH, we also need
to convert it to return int for the error case.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
sb_write_pointer() returns the write position of next superblock. For READ,
we need a previous location. When the pointer is at the head, the previous
one is the last one of the other zone. Calculate the last one's position
from zone capacity.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We cannot write beyond zone capacity. So, we should consider a zone as
"full" when the write pointer goes beyond capacity - the size of super
info.
Also, take this opportunity to replace a subtle duplicated code with a loop
and fix a typo in comment.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With the introduction of zone capacity, the range [capacity, length] is
always zone unusable. Counting this region as a reclaim target will
cause reclaiming too early. Reclaim block groups based on bytes that can
be usable after resetting.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we introduced capacity in a block group, we need to calculate free
space using the capacity instead of the length. Thus, bytes we account
capacity - alloc_pointer as free, and account bytes [capacity, length] as
zone unusable.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_free_excluded_extents() is not neccessary for
btrfs_calc_zone_unusable() and it makes btrfs_calc_zone_unusable()
difficult to reuse. Move it out and call btrfs_free_excluded_extents()
in proper context.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The ZNS specification introduces the concept of a Zone Capacity. A zone
capacity is an additional per-zone attribute that indicates the number of
usable logical blocks within each zone, starting from the first logical
block of each zone. It is always smaller or equal to the zone size.
With the SINGLE profile, we can set a block group's "capacity" as the same
as the underlying zone's Zone Capacity. We will limit the allocation not
to exceed in a following commit.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With the new infrastructure which has taken subpage into consideration,
now we should be safe to allow defrag to work for subpage case.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now the old infrastructure can all be removed, defrag
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function defrag_one_cluster() is able to defrag one range well
enough, we only need to do preparation for it, including:
- Clamp and align the defrag range
- Exclude invalid cases
- Proper inode locking
The old infrastructures will not be removed in this patch, as it would
be too noisy to review.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This new helper, defrag_one_cluster(), will defrag one cluster (at most
256K):
- Collect all initial targets
- Kick in readahead when possible
- Call defrag_one_range() on each initial target
With some extra range clamping.
- Update @sectors_defragged parameter
This involves one behavior change, the defragged sectors accounting is
no longer as accurate as old behavior, as the initial targets are not
consistent.
We can have new holes punched inside the initial target, and we will
skip such holes later.
But the defragged sectors accounting doesn't need to be that accurate
anyway, thus I don't want to pass those extra accounting burden into
defrag_one_range().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A new helper, defrag_one_range(), is introduced to defrag one range.
This function will mostly prepare the needed pages and extent status for
defrag_one_locked_target().
As we can only have a consistent view of extent map with page and extent
bits locked, we need to re-check the range passed in to get a real
target list for defrag_one_locked_target().
Since defrag_collect_targets() will call defrag_lookup_extent() and lock
extent range, we also need to teach those two functions to skip extent
lock. Thus new parameter, @locked, is introduced to skip extent lock if
the caller has already locked the range.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A new helper, defrag_one_locked_target(), introduced to do the real part
of defrag.
The caller needs to ensure both page and extents bits are locked, and no
ordered extent exists for the range, and all writeback is finished.
The core defrag part is pretty straight-forward:
- Reserve space
- Set extent bits to defrag
- Update involved pages to be dirty
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a helper, defrag_collect_targets(), to collect all possible
targets to be defragged.
This function will not consider things like max_sectors_to_defrag, thus
caller should be responsible to ensure we don't exceed the limit.
This function will be the first stage of later defrag rework.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In cluster_pages_for_defrag(), we have complex code block inside one
for() loop.
The code block is to prepare one page for defrag, this will ensure:
- The page is locked and set up properly.
- No ordered extent exists in the page range.
- The page is uptodate.
This behavior is pretty common and will be reused by later defrag
rework.
So factor out the code into its own helper, defrag_prepare_one_page(),
for later usage, and cleanup the code by a little.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When testing subpage defrag support, I always find some strange inode
nbytes error, after a lot of debugging, it turns out that
defrag_lookup_extent() is using PAGE_SIZE as size for
lookup_extent_mapping().
Since lookup_extent_mapping() is calling __lookup_extent_mapping() with
@strict == 1, this means any extent map smaller than one page will be
ignored, prevent subpage defrag to grab a correct extent map.
There are quite some PAGE_SIZE usage in ioctl.c, but most of them are
correct usages, and can be one of the following cases:
- ioctl structure size check
We want ioctl structure to be contained inside one page.
- real page operations
The remaining cases in defrag_lookup_extent() and
check_defrag_in_cache() will be addressed in this patch.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In function cluster_pages_for_defrag() we have a window where we unlock
page, either start the ordered range or read the content from disk.
When we re-lock the page, we need to make sure it still has the correct
page->private for subpage.
Thus add the extra PagePrivate check here to handle subpage cases
properly.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_defrag_file() accepts both "struct inode" and "struct
file" as parameter. We can easily grab "struct inode" from "struct
file" using file_inode() helper.
The reason why we need "struct file" is just to re-use its f_ra.
Change this to pass "struct file_ra_state" parameter, so that it's more
clear what we really want. Since we're here, also add some comments on
the function btrfs_defrag_file().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_chunk_readonly() checks if the given chunk is writeable. It
returns 1 for readonly, and 0 for writeable. So the return argument type
bool shall suffice instead of the current type int.
Also, rename btrfs_chunk_readonly() to btrfs_chunk_writeable() as we
check if the bg is writeable, and helps to keep the logic at the parent
function simpler to understand.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fix a warning reported by smatch that ret could be returned without
initialized. The dedupe operations are supposed to to return 0 for a 0
length range but the caller does not pass olen == 0. To keep this
behaviour and also fix the warning initialize ret to 0.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Sidong Yang <realwakka@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we use u16 bitmap to make 4k sectorsize work for 64K page
size.
But this u16 bitmap is not large enough to contain larger page size like
128K, nor is space efficient for 16K page size.
To handle both cases, here we pack all subpage bitmaps into a larger
bitmap, now btrfs_subpage::bitmaps[] will be the ultimate bitmap for
subpage usage.
Each sub-bitmap will has its start bit number recorded in
btrfs_subpage_info::*_start, and its bitmap length will be recorded in
btrfs_subpage_info::bitmap_nr_bits.
All subpage bitmap operations will be converted from using direct u16
operations to bitmap operations, with above *_start calculated.
For 64K page size with 4K sectorsize, this should not cause much
difference.
While for 16K page size, we will only need 1 unsigned long (u32) to
store all the bitmaps, which saves quite some space.
Furthermore, this allows us to support larger page size like 128K and
258K.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we use fixed size u16 bitmap for subpage bitmap. This is fine
for 4K sectorsize with 64K page size.
But for 4K sectorsize and larger page size, the bitmap is too small,
while for smaller page size like 16K, u16 bitmaps waste too much space.
Here we introduce a new helper structure, btrfs_subpage_bitmap_info, to
record the proper bitmap size, and where each bitmap should start at.
By this, we can later compact all subpage bitmaps into one u32 bitmap.
This patch is the first step.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The existing calling convention of btrfs_alloc_subpage() is pretty
awful. Change it to a more common pattern by returning struct
btrfs_subpage directly and let the caller to determine if the call
succeeded.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are two call sites of btrfs_alloc_subpage():
- btrfs_attach_subpage()
We have ensured sectorsize is smaller than PAGE_SIZE
- alloc_extent_buffer()
We call btrfs_alloc_subpage() unconditionally.
The alloc_extent_buffer() forces us to check the sectorsize size against
page size inside btrfs_alloc_subpage().
Since the function name, btrfs_alloc_subpage(), already indicates it
should only get called for subpage cases, do the check in
alloc_extent_buffer() and add an ASSERT() in btrfs_alloc_subpage().
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Update it since commit 944d3f9fac ("btrfs: switch seed device to
list api") did conversion from fs_devices::seed to fs_devices::seed_list.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need for the variable ret after d66105cfa873 ("btrfs:
allocate btrfs_ioctl_quota_rescan_args on stack"), remove it.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The out label is being overused, we can simply return if the condition
permits.
No functional changes.
Reviewed-by: Su Yue <l@damenly.su>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The user facing function used to allocate new chunks is
btrfs_chunk_alloc, unfortunately there is yet another similar sounding
function - btrfs_alloc_chunk. This creates confusion, especially since
the latter function can be considered "private" in the sense that it
implements the first stage of chunk creation and as such is called by
btrfs_chunk_alloc.
To avoid the awkwardness that comes with having similarly named but
distinctly different in their purpose function rename btrfs_alloc_chunk
to btrfs_create_chunk, given that the main purpose of this function is
to orchestrate the whole process of allocating a chunk - reserving space
into devices, deciding on characteristics of the stripe size and
creating the in-memory structures.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The second argument was only used by the USB gadget code, yet everyone
pays the overhead of passing a zero to be passed into aio, where it
ends up being part of the aio res2 value.
Now that everybody is passing in zero, kill off the extra argument.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ioprio setup doesn't depend on other fields that are modified in
io_prep_rw() and we can move it down in the function without worrying
about performance. It's useful as it makes iocb->ki_flags
accesses/modifications closer together, so it's more likely the compiler
will cache it in a register and avoid extra reloads.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8ee98779c06f1b59f6039b1e292db4332efd664b.1634987320.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
opcode prep functions are one of the first things that are called, we
can't have ->async_data allocated at this point and it's certainly a
bug. Reflect this assumption in io_timeout_prep() and add a WARN_ONCE
just in case.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/75a28ca7dbcc5af8b6cd9092819e8384c24dedd4.1634987320.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If an opcode doesn't support polling, just let it be executed
synchronously in iowq, otherwise it will do a nonblock attempt just to
fail in io_arm_poll_handler() and return back to blocking execution.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6401256db01b88f448f15fcd241439cb76f5b940.1634987320.git.asml.silence@gmail.com
Reviewed-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
->pollout or ->pollin are set only for opcodes that need a file, so if
io_arm_poll_handler() tests them first we can be sure that the request
has file set and the ->file check can be removed.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9adfe4f543d984875e516fce6da35348aab48668.1634987320.git.asml.silence@gmail.com
Reviewed-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we've got IO_WQ_WORK_CANCEL in io_wq_submit_work(), handle the error
on the same lines as the check instead of having a weird code flow. The
main loop doesn't change but goes one indention left.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ff4a09cf41f7a22bbb294b6f1faea721e21fe615.1634987320.git.asml.silence@gmail.com
Reviewed-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Do a bit of cleaning for the main loop of io_wq_submit_work(). Get rid
of switch, just replace it with a single if as we're retrying in both
other cases. Kill issue_sqe label, Get rid of needs_poll nesting and
disambiguate a bit the comment.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ed12ce0c64e051f9a6b8a37a24f8ea554d299c29.1634987320.git.asml.silence@gmail.com
Reviewed-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coverity complains of an unused value:
CID 119623 (#1 of 1): Unused value (UNUSED_VALUE)
assigned_value: Assigning value -1 to error here, but that stored value is
overwritten before it can be used.
237 error = -EPERM;
Fix it by removing the assignment.
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Add a might_sleep call into gfs2_glock_put which can sleep in DLM when
the last reference is released. This will show problems earlier, and
not only when the last reference is put.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
So far, glock_hash_walk took a reference on each glock it iterated over, and it
was the examiner's responsibility to drop those references. Dropping the final
reference to a glock can sleep and the examiners are called in a RCU critical
section with spin locks held, so examiners that didn't need the extra reference
had to drop it asynchronously via gfs2_glock_queue_put or similar. This wasn't
done correctly in thaw_glock which did call gfs2_glock_put, and not at all in
dump_glock_func.
Change glock_hash_walk to not take glock references at all. That way, the
examiners that don't need them won't have to bother with slow asynchronous
puts, and the examiners that do need references can take them themselves.
Reported-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
In gfs2_inode_lookup and gfs2_create_inode, we're calling
gfs2_cancel_delete_work which currently cancels any remote delete work
(delete_work_func) synchronously. This means that if the work is
currently running, it will wait for it to finish. We're doing this to
pevent a previous instance of an inode from having any influence on the
next instance.
However, delete_work_func uses gfs2_inode_lookup internally, and we can
end up in a deadlock when delete_work_func gets interrupted at the wrong
time. For example,
(1) An inode's iopen glock has delete work queued, but the inode
itself has been evicted from the inode cache.
(2) The delete work is preempted before reaching gfs2_inode_lookup.
(3) Another process recreates the inode (gfs2_create_inode). It tries
to cancel any outstanding delete work, which blocks waiting for
the ongoing delete work to finish.
(4) The delete work calls gfs2_inode_lookup, which blocks waiting for
gfs2_create_inode to instantiate and unlock the new inode =>
deadlock.
It turns out that when the delete work notices that its inode has been
re-instantiated, it will do nothing. This means that it's safe to
cancel the delete work asynchronously. This prevents the kind of
deadlock described above.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, function gfs2_create_inode called glock_set_object to
set the gl_object for inode and iopen glocks before the glock was locked.
That's wrong because other competing processes like evict may be
blocked waiting for the glock and still have gl_object set before the
actual eviction can take place.
This patch moves the call to glock_set_object until after the glock is
acquire in function gfs2_create_inode, so it waits for possibly
competing evicts to finish their processing first.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The new GLF_INSTANTIATE_NEEDED flag obsoletes the old rgrp flag
GFS2_RDF_UPTODATE, so this patch replaces it like we did with inodes.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
With the addition of the new GLF_INSTANTIATE_NEEDED flag, the
GIF_INVALID flag is now redundant. This patch removes it.
Since inode_instantiate is only called when instantiation is needed,
the check in inode_instantiate is removed too.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, when a glock was locked, the very first holder on the
queue would unlock the lockref and call the go_instantiate glops function
(if one existed), unless GL_SKIP was specified. When we introduced the new
node-scope concept, we allowed multiple holders to lock glocks in EX mode
and share the lock.
But node-scope introduced a new problem: if the first holder has GL_SKIP
and the next one does NOT, since it is not the first holder on the queue,
the go_instantiate op was not called. Eventually the GL_SKIP holder may
call the instantiate sub-function (e.g. gfs2_rgrp_bh_get) but there was
still a window of time in which another non-GL_SKIP holder assumes the
instantiate function had been called by the first holder. In the case of
rgrp glocks, this led to a NULL pointer dereference on the buffer_heads.
This patch tries to fix the problem by introducing two new glock flags:
GLF_INSTANTIATE_NEEDED, which keeps track of when the instantiate function
needs to be called to "fill in" or "read in" the object before it is
referenced.
GLF_INSTANTIATE_IN_PROG which is used to determine when a process is
in the process of reading in the object. Whenever a function needs to
reference the object, it checks the GLF_INSTANTIATE_NEEDED flag, and if
set, it sets GLF_INSTANTIATE_IN_PROG and calls the glops "go_instantiate"
function.
As before, the gl_lockref spin_lock is unlocked during the IO operation,
which may take a relatively long amount of time to complete. While
unlocked, if another process determines go_instantiate is still needed,
it sees GLF_INSTANTIATE_IN_PROG is set, and waits for the go_instantiate
glop operation to be completed. Once GLF_INSTANTIATE_IN_PROG is cleared,
it needs to check GLF_INSTANTIATE_NEEDED again because the other process's
go_instantiate operation may not have been successful.
Functions that previously called the instantiate sub-functions now call
directly into gfs2_instantiate so the new bits are managed properly.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, function do_promote had a section of code that did
the actual instantiation. This patch splits that off into its own
function, gfs2_instantiate, which prepares us for the next patch that
will use that function.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch further simplifies function do_promote by eliminating some
redundant code in favor of using a lock_released flag. This is just
prep work for a future patch.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch simply re-factors function do_promote to reduce the indents.
The logic should be unchanged. This makes future patches more readable.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Remove the 'first' argument of trace_gfs2_promote: with GL_SKIP, the
'first' holder isn't the one that instantiates the glock
(gl_instantiate), which is what the 'first' flag was apparently supposed
to indicate.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, the go_lock glock operations (glops) did not do
any actual locking. They were used to instantiate objects, like reading
in dinodes and rgrps from the media.
This patch renames the functions to go_instantiate for clarity.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, failed consistency checks printed out the object
that failed, but not the object's glock. This patch makes it also
print out the object glock so we can see the glock's holders and flags
to aid with debugging.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, if function gfs2_inode_lookup encountered an error
after it had locked the iopen glock, it never unlocked it, relying on
the evict code to do the cleanup. The evict code then took the
inode glock while holding the iopen glock, which violates the locking
order. For example,
(1) node A does a gfs2_inode_lookup that fails, leaving the iopen glock
locked.
(2) node B calls delete_work_func -> gfs2_lookup_by_inum ->
gfs2_inode_lookup. It locks the inode glock and blocks trying to
lock the iopen glock, which is held by node A.
(3) node A eventually calls gfs2_evict_inode -> evict_should_delete.
It blocks trying to lock the inode glock, which is now held by
node B.
This patch introduces error handling to function gfs2_inode_lookup
so it properly dequeues held iopen glocks on errors.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, when a glock was locked by function gfs2_glock_nq_init,
it initialized the holder gh_ip (return address) as gfs2_glock_nq_init.
That made it extremely difficult to track down problems because many
functions call gfs2_glock_nq_init. This patch changes the function so
that it saves gh_ip from the caller of gfs2_glock_nq_init, which makes
it easy to backtrack which holder took the lock.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, function do_gfs2_set_flags checked if the append
and immutable flags were being set while already set. If so, error -EPERM
was given. There's no reason why these two flags should be mutually
exclusive, and if you set them separately, you will, in essence, set
one while it is already set. For example:
chattr +a /mnt/gfs2/file1
chattr +i /mnt/gfs2/file1
The first command sets the append-only flag. Since they are additive,
the second command sets the immutable flag AND append-only flag,
since they both coexist in i_diskflags. So the second command should
not return an error. This bug caused xfstests generic/545 to fail.
This patch simply removes the invalid checks.
I also eliminated an unused parm from do_gfs2_set_flags.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
In rgrp.c, there are several places where it does BUG_ON. This tells us
the call stack but nothing more, which is not very helpful.
This patch switches them to GLOCK_BUG_ON which also prints the glock,
its holders, and many of the rgrp values, which will help us debug
problems in the future.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, each individual "go_lock" glock operation (glop)
checked the GL_SKIP flag, and if set, would skip further processing.
This patch changes the logic so the go_lock caller, function go_promote,
checks the GL_SKIP flag before calling the go_lock op in the first place.
This avoids having to unnecessarily unlock gl_lockref.lock only to
re-lock it again.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Somehow, the GL_SKIP flag was missed when dumping glock holders.
This patch adds it to function hflags2str. I added it at the end because
I wanted Holder and Skip flags together to read "Hs" rather than "sH"
to avoid confusion with "Shared" ("SH") holder state.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, function gfs2_rgrp_go_lock checked if GL_SKIP and
ar_rgrplvb were both true. However, GL_SKIP is only set for rgrps if
ar_rgrplvb is true (see gfs2_inplace_reserve). This patch simply removes
the redundant check.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Also disable page faults during direct I/O requests and implement a
similar kind of retry logic as in the buffered I/O case.
The retry logic in the direct I/O case differs from the buffered I/O
case in the following way: direct I/O doesn't provide the kinds of
consistency guarantees between concurrent reads and writes that buffered
I/O provides, so once we lose the inode glock while faulting in user
pages, we always resume the operation. We never need to return a
partial read or write.
This locking problem was originally reported by Jan Kara. Linus came up
with the idea of disabling page faults. Many thanks to Al Viro and
Matthew Wilcox for their feedback.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Currently, ->lru is a way to arrange non-LRU pages and has some
in-kernel users. In order to minimize noticable issues of page
reclaim and cache thrashing under high memory presure, limited
temporary pages were all chained with ->lru and can be reused
during the request. However, it seems that ->lru could be removed
when folio is landing.
Let's use page->private to chain temporary pages for now instead
and transform EROFS formally after the topic of the folio / file
page design is finalized.
Link: https://lore.kernel.org/r/20211022090120.14675-1-hsiangkao@linux.alibaba.com
Cc: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Pull autofs fix from Al Viro:
"Fix for a braino of mine (in getting rid of open-coded
dentry_path_raw() in autofs a couple of cycles ago).
Mea culpa... Obvious -stable fodder"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
autofs: fix wait name hash calculation in autofs_wait()
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmFzb/4ACgkQiiy9cAdy
T1HhOQv/bTsGX85V5QFBQyW7F7ymYaSuDpvb/nYFzzs4fMz8/lbCJ6GQHMK7G+uR
ZYFdZw5c5A+KsfjRoRAKOc6HJbGERnKWvIlRkNWnmcpsXcR3PFhhOxXKXxCTBjga
fHxAJBENbsyCtGz25pUtuOKVG9zVaz31lyXgnz53b0ATrb8CxIll7zdKTp7aGdK0
UIWUEYeUwctIhvyBXyms8ni+15keop6/7Y1KhUvodL/VhU4YCkKerFFQojsqLEBk
4N6ZrgEEvzZPSjMt3KkbAapMNvf8Jgy6hKrB10MWB2sJG3bG1A4MuWYzVz7hlLbx
KXztSjbEiExCw7Y8ZXeA0pT/P51TVB9uxSLaDJhGVrXIxkZ4exUz2pS0Tp2E8eMH
4BGylAV9vrZqjmF3HQHJu8c/+f833dwbmMzgDFlFgR01U8NQYQvLf6ZoxmnQxdJC
5CVzO2rGV1bAVEbCJAN/wnKSCpEzjcN0ruz9bcje8MFF9mXXiJxATIgNahuaj17t
AvOnAbwd
=FhM8
-----END PGP SIGNATURE-----
Merge tag '5.15-rc6-ksmbd-fixes' of git://git.samba.org/ksmbd
Pull ksmbd fixes from Steve French:
"Ten fixes for the ksmbd kernel server, for improved security and
additional buffer overflow checks:
- a security improvement to session establishment to reduce the
possibility of dictionary attacks
- fix to ensure that maximum i/o size negotiated in the protocol is
not less than 64K and not more than 8MB to better match expected
behavior
- fix for crediting (flow control) important to properly verify that
sufficient credits are available for the requested operation
- seven additional buffer overflow, buffer validation checks"
* tag '5.15-rc6-ksmbd-fixes' of git://git.samba.org/ksmbd:
ksmbd: add buffer validation in session setup
ksmbd: throttle session setup failures to avoid dictionary attacks
ksmbd: validate OutputBufferLength of QUERY_DIR, QUERY_INFO, IOCTL requests
ksmbd: validate credit charge after validating SMB2 PDU body size
ksmbd: add buffer validation for smb direct
ksmbd: limit read/write/trans buffer size not to exceed 8MB
ksmbd: validate compound response buffer
ksmbd: fix potencial 32bit overflow from data area check in smb2_write
ksmbd: improve credits management
ksmbd: add validation in smb2_ioctl
Add a done_before argument to iomap_dio_rw that indicates how much of
the request has already been transferred. When the request succeeds, we
report that done_before additional bytes were tranferred. This is
useful for finishing a request asynchronously when part of the request
has already been completed synchronously.
We'll use that to allow iomap_dio_rw to be used with page faults
disabled: when a page fault occurs while submitting a request, we
synchronously complete the part of the request that has already been
submitted. The caller can then take care of the page fault and call
iomap_dio_rw again for the rest of the request, passing in the number of
bytes already tranferred.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
In iomap_dio_rw, when iomap_apply returns an -EFAULT error and the
IOMAP_DIO_PARTIAL flag is set, complete the request synchronously and
return a partial result. This allows the caller to deal with the page
fault and retry the remainder of the request.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
When a user copy fails in one of the helpers of iomap_dio_rw, fail with
-EFAULT instead of returning 0. This matches what iomap_dio_bio_actor
returns when it gets an -EFAULT from bio_iov_iter_get_pages. With these
changes, iomap_dio_actor now consistently fails with -EFAULT when a user
page cannot be faulted in.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
In the .read_iter and .write_iter file operations, we're accessing
user-space memory while holding the inode glock. There is a possibility
that the memory is mapped to the same file, in which case we'd recurse
on the same glock.
We could detect and work around this simple case of recursive locking,
but more complex scenarios exist that involve multiple glocks,
processes, and cluster nodes, and working around all of those cases
isn't practical or even possible.
Avoid these kinds of problems by disabling page faults while holding the
inode glock. If a page fault would occur, we either end up with a
partial read or write or with -EFAULT if nothing could be read or
written. In either case, we know that we're not done with the
operation, so we indicate that we're willing to give up the inode glock
and then we fault in the missing pages. If that made us lose the inode
glock, we return a partial read or write. Otherwise, we resume the
operation.
This locking problem was originally reported by Jan Kara. Linus came up
with the idea of disabling page faults. Many thanks to Al Viro and
Matthew Wilcox for their feedback.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmFzfyQQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpiO/D/9cqYHpjGDwyftzQFJFfEy9ny6nlLm6lJef
hsrZjC0S649FnXc0YHVLDH3/nos0XsQUYvVJnAMW9EHB6x/95JRUyxzouVz1Fewp
w8Z+lOKymIf3X1LQoB6KQXH5ayohNtUo6HA0Ye/v+iEG+bq/lo9tCMSshpJs3afq
UWW8RxGhrMHfqfgn/8Kkz8fEqZjXz7tssZ+1AFftTxKbk97ZWPahwjvO+xLFWl/m
NbMkHf3xeAvDL747ccrVBOerRZUPySXZElgkPzdjQ4y5HHZrpxt/ZR9Xu7XRzgkJ
7SEmsJ80vla19u3eW/oAn3T4EEGS3qWlei8T47kKIoT1W52S3rqjwsV/30re16GW
sGMWdFiH/GW3VnOxs0/a4/q70je3E9DicSTs4SALTwnvjQ+vrunWgG6ojtxLcieT
Br+km8nmDPug1wxoH2gQLN/EhGcH5hQvi4ZMiMH8MWalYpEkIADOOvAwp0GDwVoE
6DxWeYs57rdSQnSLxDah+mAqBokqswJ/ZmuBOO/iSqXCImehLs0VL1Y+TsThVbRy
epnBdqLk5PbDpODcYTl7on3MD3hpoHjbpnAPah0py57sroiY73sNE/ms1AUsqYPs
fAe5tjFwhGhVWRiZMGOAG6kgTtSdxG134c0Lyvy6xACTR8rJfgcnWMwFJDWK2GDn
ReGYJcgEOA==
=ywLV
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.15-2021-10-22' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Two fixes for the max workers limit API that was introduced this
series: one fix for an issue with that code, and one fixing a linked
timeout regression in this series"
* tag 'io_uring-5.15-2021-10-22' of git://git.kernel.dk/linux-block:
io_uring: apply worker limits to previous users
io_uring: fix ltimeout unprep
io_uring: apply max_workers limit to all future users
io-wq: max_worker fixes
The current logic of requests with IOSQE_ASYNC is first queueing it to
io-worker, then execute it in a synchronous way. For unbound works like
pollable requests(e.g. read/write a socketfd), the io-worker may stuck
there waiting for events for a long time. And thus other works wait in
the list for a long time too.
Let's introduce a new way for unbound works (currently pollable
requests), with this a request will first be queued to io-worker, then
executed in a nonblock try rather than a synchronous way. Failure of
that leads it to arm poll stuff and then the worker can begin to handle
other works.
The detail process of this kind of requests is:
step1: original context:
queue it to io-worker
step2: io-worker context:
nonblock try(the old logic is a synchronous try here)
|
|--fail--> arm poll
|
|--(fail/ready)-->synchronous issue
|
|--(succeed)-->worker finish it's job, tw
take over the req
This works much better than the old IOSQE_ASYNC logic in cases where
unbound max_worker is relatively small. In this case, number of
io-worker eazily increments to max_worker, new worker cannot be created
and running workers stuck there handling old works in IOSQE_ASYNC mode.
In my 64-core machine, set unbound max_worker to 20, run echo-server,
turns out:
(arguments: register_file, connetion number is 1000, message size is 12
Byte)
original IOSQE_ASYNC: 76664.151 tps
after this patch: 166934.985 tps
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20211018133445.103438-1-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If writeback I/O to a COW extent fails, the COW fork blocks are
punched out and the data fork blocks left alone. It is possible for
COW fork blocks to overlap non-shared data fork blocks (due to
cowextsz hint prealloc), however, and writeback unconditionally maps
to the COW fork whenever blocks exist at the corresponding offset of
the page undergoing writeback. This means it's quite possible for a
COW fork extent to overlap delalloc data fork blocks, writeback to
convert and map to the COW fork blocks, writeback to fail, and
finally for ioend completion to cancel the COW fork blocks and leave
stale data fork delalloc blocks around in the inode. The blocks are
effectively stale because writeback failure also discards dirty page
state.
If this occurs, it is likely to trigger assert failures, free space
accounting corruption and failures in unrelated file operations. For
example, a subsequent reflink attempt of the affected file to a new
target file will trip over the stale delalloc in the source file and
fail. Several of these issues are occasionally reproduced by
generic/648, but are reproducible on demand with the right sequence
of operations and timely I/O error injection.
To fix this problem, update the ioend failure path to also punch out
underlying data fork delalloc blocks on I/O error. This is analogous
to the writeback submission failure path in xfs_discard_page() where
we might fail to map data fork delalloc blocks and consistent with
the successful COW writeback completion path, which is responsible
for unmapping from the data fork and remapping in COW fork blocks.
Fixes: 787eb48550 ("xfs: fix and streamline error handling in xfs_end_io")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
The owner info parameter is always NULL, so get rid of the parameter.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
We only use EFIs to free metadata blocks -- not regular data/attr fork
extents. Remove all the fields that we never use, for a net reduction
of 16 bytes.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
xfs_bmap_add_free isn't a block mapping function; it schedules deferred
freeing operations for a later point in a compound transaction chain.
While it's primarily used by bunmapi, its use has expanded beyond that.
Move it to xfs_alloc.c and rename the function since it's now general
freeing functionality. Bring the slab cache bits in line with the
way we handle the other intent items.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Create slab caches for the high-level structures that coordinate
deferred intent items, since they're used fairly heavily.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Rearrange these structs to reduce the amount of unused padding bytes.
This saves eight bytes for each of the three structs changed here, which
means they're now all (rmap/bmap are 64 bytes, refc is 32 bytes) even
powers of two.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Now that we've gotten rid of the kmem_zone_t typedef, rename the
variables to _cache since that's what they are.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Remove these typedefs by referencing kmem_cache directly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYXLSYQAKCRDh3BK/laaZ
PEfYAQCZcGVboa5uIrCYmVnEgXXf5NX0UrrM0ytvnVssGcgUOQEA8nAx3hwyvwvS
onA14DgXIz3koEE48PWv3gbJdpL/kAM=
=R0ip
-----END PGP SIGNATURE-----
Merge tag 'fuse-fixes-5.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
"Syzbot discovered a race in case of reusing the fuse sb (introduced in
this cycle).
Fix it by doing the s_fs_info initialization at the proper place"
* tag 'fuse-fixes-5.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: clean up error exits in fuse_fill_super()
fuse: always initialize sb->s_fs_info
fuse: clean up fuse_mount destruction
fuse: get rid of fuse_put_super()
fuse: check s_root when destroying sb
Rename didn't decrement/clear nlink on overwritten target inode.
Create a common helper fuse_entry_unlinked() that handles this for unlink,
rmdir and rename.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Logically it belongs there since attributes are invalidated due to the
updated ctime. This is a cleanup and should not change behavior.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Due to the introduction of kmap_local_*, the storage of slots used for
short-term mapping has changed from per-CPU to per-thread. kmap_atomic()
disable preemption, while kmap_local_*() only disable migration.
There is no need to disable preemption in several kamp_atomic places used
in fuse.
Link: https://lwn.net/Articles/836144/
Signed-off-by: Peng Hao <flyingpeng@tencent.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fuse ->release() is otherwise asynchronous for the reason that it can
happen in contexts unrelated to close/munmap.
Inode is already written back from fuse_flush(). Add it to
fuse_vma_close() as well to make sure inode dirtying from mmaps also get
written out before the file is released.
Also add error handling.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In writeback cache mode mtime/ctime updates are cached, and flushed to the
server using the ->write_inode() callback.
Closing the file will result in a dirty inode being immediately written,
but in other cases the inode can remain dirty after all references are
dropped. This result in the inode being written back from reclaim, which
can deadlock on a regular allocation while the request is being served.
The usual mechanisms (GFP_NOFS/PF_MEMALLOC*) don't work for FUSE, because
serving a request involves unrelated userspace process(es).
Instead do the same as for dirty pages: make sure the inode is written
before the last reference is gone.
- fallocate(2)/copy_file_range(2): these call file_update_time() or
file_modified(), so flush the inode before returning from the call
- unlink(2), link(2) and rename(2): these call fuse_update_ctime(), so
flush the ctime directly from this helper
Reported-by: chenguanyou <chenguanyou@xiaomi.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Get rid of the indirections and just provide a sync_bdevs
helper for the generic sync code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211019062530.2174626-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use sync_blockdev_nowait instead of opencoding it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20211019062530.2174626-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use sync_blockdev_nowait instead of opencoding it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20211019062530.2174626-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use sync_blockdev instead of opencoding it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Acked-by: David Sterba <dsterba@suse.com>
Link: https://lore.kernel.org/r/20211019062530.2174626-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead offer a new sync_blockdev_nowait helper for the !wait case.
This new helper is exported as it will grow modular callers in a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211019062530.2174626-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no clear benefit in having this helper vs just open coding it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20211019062530.2174626-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Call the ->get_unique_id method to query the SCSI identifiers. This can
use the cached VPD page in the sd driver instead of sending a command
on every LAYOUTGET. It will also allow to support NVMe based volumes
if the draft for that ever takes off.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: J. Bruce Fields <bfields@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20211021060607.264371-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove a redundant call in nfs_updatepage(). nfs_writepage_setup() will
have already called nfs_mark_request_dirty() on success.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Another change to the API io-wq worker limitation API added in 5.15,
apply the limit to all prior users that already registered a tctx. It
may be confusing as it's now, in particular the change covers the
following 2 cases:
TASK1 | TASK2
_________________________________________________
ring = create() |
| limit_iowq_workers()
*not limited* |
TASK1 | TASK2
_________________________________________________
ring = create() |
| issue_requests()
limit_iowq_workers() |
| *not limited*
A note on locking, it's safe to traverse ->tctx_list as we hold
->uring_lock, but do that after dropping sqd->lock to avoid possible
problems. It's also safe to access tctx->io_wq there because tasks
kill it only after removing themselves from tctx_list, see
io_uring_cancel_generic() -> io_uring_clean_tctx()
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d6e09ecc3545e4dc56e43c906ee3d71b7ae21bed.1634818641.git.asml.silence@gmail.com
Reviewed-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Syzkaller reports a null pointer dereference in fuse_test_super() that is
caused by sb->s_fs_info being NULL.
This is due to the fact that fuse_fill_super() is initializing s_fs_info,
which is too late, it's already on the fs_supers list. The initialization
needs to be done in sget_fc() with the sb_lock held.
Move allocation of fuse_mount and fuse_conn from fuse_fill_super() into
fuse_get_tree().
After this ->kill_sb() will always be called with non-NULL ->s_fs_info,
hence fuse_mount_destroy() can drop the test for non-NULL "fm".
Reported-by: syzbot+74a15f02ccb51f398601@syzkaller.appspotmail.com
Fixes: 5d5b74aa9c ("fuse: allow sharing existing sb")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
1. call fuse_mount_destroy() for open coded variants
2. before deactivate_locked_super() don't need fuse_mount destruction since
that will now be done (if ->s_fs_info is not cleared)
3. rearrange fuse_mount setup in fuse_get_tree_submount() so that the
regular pattern can be used
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The ->put_super callback is called from generic_shutdown_super() in case of
a fully initialized sb. This is called from kill_***_super(), which is
called from ->kill_sb instances.
Fuse uses ->put_super to destroy the fs specific fuse_mount and drop the
reference to the fuse_conn, while it does the same on each error case
during sb setup.
This patch moves the destruction from fuse_put_super() to
fuse_mount_destroy(), called at the end of all ->kill_sb instances. A
follup patch will clean up the error paths.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Checking "fm" works because currently sb->s_fs_info is cleared on error
paths; however, sb->s_root is what generic_shutdown_super() checks to
determine whether the sb was fully initialized or not.
This change will allow cleanup of sb setup error paths.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
There's a mistake in commit 2be7828c9f ("get rid of autofs_getpath()")
that affects kernels from v5.13.0, basically missed because of me not
fully testing the change for Al.
The problem is that the hash calculation for the wait name qstr hasn't
been updated to account for the change to use dentry_path_raw(). This
prevents the correct matching an existing wait resulting in multiple
notifications being sent to the daemon for the same mount which must
not occur.
The problem wasn't discovered earlier because it only occurs when
multiple processes trigger a request for the same mount concurrently
so it only shows up in more aggressive testing.
Fixes: 2be7828c9f ("get rid of autofs_getpath()")
Cc: stable@vger.kernel.org
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
As noted in the "Deprecated Interfaces, Language Features, Attributes,
and Conventions" documentation [1], size calculations (especially
multiplication) should not be performed in memory allocator (or similar)
function arguments due to the risk of them overflowing. This could lead
to values wrapping around and a smaller allocation being made than the
caller was expecting. Using those allocations could lead to linear
overflows of heap memory and other misbehaviors.
So, use the struct_size() helper to do the arithmetic instead of the
argument "size + size * count" in the kzalloc() function.
[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments
Signed-off-by: Len Baker <len.baker@gmx.com>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
As noted in the "Deprecated Interfaces, Language Features, Attributes,
and Conventions" documentation [1], size calculations (especially
multiplication) should not be performed in memory allocator (or similar)
function arguments due to the risk of them overflowing. This could lead
to values wrapping around and a smaller allocation being made than the
caller was expecting. Using those allocations could lead to linear
overflows of heap memory and other misbehaviors.
In this case these are not actually dynamic sizes: all the operands
involved in the calculation are constant values. However it is better to
refactor them anyway, just to keep the open-coded math idiom out of
code.
So, use the struct_size() helper to do the arithmetic instead of the
argument "size + count * size" in the kzalloc() functions.
This code was detected with the help of Coccinelle and audited and fixed
manually.
[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments
Signed-off-by: Len Baker <len.baker@gmx.com>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Use 2-factor argument multiplication form kvcalloc() instead of
kvzalloc().
Link: https://github.com/KSPP/linux/issues/162
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
All the callers are now in client.c so we can remove the
EXPORT_SYMBOL_GPL() and make it static.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
This lets us update the server's attributes when the user does a "mount
-o remount" on the filesystem.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Clean up. There are a few places where we want to probe the server, but
don't actually care about the fsinfo result. Change these to use
nfs_probe_server(), which handles the fattr allocation for us.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
And rename it to nfs_probe_server(). I also change it to take the nfs_fh
as an argument so callers can choose what filehandle to probe.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
And call it before doing an FSINFO probe to reset to the baseline
capabilities before probing.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
dprintk call sites that display no other information than the
function name can be replaced with use of the trace "function" or
"function_graph" plug-ins.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
These new events report slightly different information for readpage
and readpages/readahead.
For readpage:
fsx-1387 [006] 380.761896: nfs_aop_readpage: fileid=00:28:2 fhandle=0x36fbbe51 version=1752899355910932437 offset=131072
fsx-1387 [006] 380.761900: nfs_aop_readpage_done: fileid=00:28:2 fhandle=0x36fbbe51 version=1752899355910932437 offset=131072 ret=0
The index of a synchronous single-page read is reported.
For readpages:
fsx-1387 [006] 380.760847: nfs_aop_readahead: fileid=00:28:2 fhandle=0x36fbbe51 version=1752899355909932456 nr_pages=3
fsx-1387 [006] 380.760853: nfs_aop_readahead_done: fileid=00:28:2 fhandle=0x36fbbe51 version=1752899355909932456 nr_pages=3 ret=0
The count of pages requested is reported. nfs_readpages does not
wait for the READ requests to complete.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
For certain special cases, RPC-related tracepoints record a -1 as
the task ID or the client ID. It's ugly for a trace event to display
4 billion in these cases.
To help keep SUNRPC tracepoints consistent, create a macro that
defines the print format specifiers for tk_pid and cl_clid. At some
point in the future we might try tk_pid with a wider range of values
than 0..64K so this makes it easier to make that change.
RPC tracepoints now look like this:
<...>-1276 [009] 149.720358: rpc_clnt_new: client=00000005 peer=[192.168.2.55]:20049 program=nfs server=klimt.ib
<...>-1342 [004] 149.921234: rpc_xdr_recvfrom: task:0000001a@00000005 head=[0xff1242d9ab6dc01c,144] page=0 tail=[(nil),0] len=144
<...>-1342 [004] 149.921235: xprt_release_cong: task:0000001a@00000005 snd_task:ffffffff cong=256 cwnd=16384
<...>-1342 [004] 149.921235: xprt_put_cong: task:0000001a@00000005 snd_task:ffffffff cong=0 cwnd=16384
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Save some space in the nfs_inode by setting up an anonymous union with
the fields that are peculiar to a specific type of filesystem object.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We mustn't call nfs_wb_all() on anything other than a regular file.
Furthermore, we can exit early when we don't hold a delegation.
Reported-by: David Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If O_DIRECT bumps the commit_info rpcs_out field, then that could lead
to fsync() hangs. The fix is to ensure that O_DIRECT calls
nfs_commit_end().
Fixes: 723c921e7d ("sched/wait, fs/nfs: Convert wait_on_atomic_t() usage to the new wait_var_event() API")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
superblocks issue was particularly annoying because for unexperienced
users it essentially exacted a reboot to establish a new functional
mount in that scenario.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmFwWuYTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHziwdHB/wJEvFkMQrlzbgVmijhmneU+TseAMxR
UBGnsyHdiimIDWqzb81cBuDfrocQzyhntghP2lBzcbzI+gZN1KlrYzKAbYk++cfi
E5Zbw3U8+moa5B2CnO19QEgmJY5DoXYXb6AbO3udIIj1Ls9lx0ByUyDoSn6fZyVH
iUQ9OH7zVTsTscoaBiEVcutmhQjIFjoYJqPpfCg6/15xcXX/L1DvxQFBWOxXqHQw
LYfCQIu8orrA2QdZpuTRpklrMg1Ih+RmqYTdQST6tTtTKJUrHPI0r3A8c2vUoBk1
ph4fBNsAMUqFn1fIGT88PJg81RC5RC3E6D5PqErzRFsPbAv9FHfGYvGQ
=FadF
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.15-rc7' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"Two important filesystem fixes, marked for stable.
The blocklisted superblocks issue was particularly annoying because
for unexperienced users it essentially exacted a reboot to establish a
new functional mount in that scenario"
* tag 'ceph-for-5.15-rc7' of git://github.com/ceph/ceph-client:
ceph: fix handling of "meta" errors
ceph: skip existing superblocks that are blocklisted or shut down when mounting
Now that gfs2_file_buffered_write is the only remaining user of
ip->i_gh, we can move the glock holder to the stack (or rather, use the
one we already have on the stack); there is no need for keeping the
holder in the inode anymore.
This is slightly complicated by the fact that we're using ip->i_gh for
the statfs inode in gfs2_file_buffered_write as well. Writing to the
statfs inode isn't very common, so allocate the statfs holder
dynamically when needed.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
So far, for buffered writes, we were taking the inode glock in
gfs2_iomap_begin and dropping it in gfs2_iomap_end with the intention of
not holding the inode glock while iomap_write_actor faults in user
pages. It turns out that iomap_write_actor is called inside iomap_begin
... iomap_end, so the user pages were still faulted in while holding the
inode glock and the locking code in iomap_begin / iomap_end was
completely pointless.
Move the locking into gfs2_file_buffered_write instead. We'll take care
of the potential deadlocks due to faulting in user pages while holding a
glock in a subsequent patch.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch introduces a new HIF_MAY_DEMOTE flag and infrastructure that
will allow glocks to be demoted automatically on locking conflicts.
When a locking request comes in that isn't compatible with the locking
state of an active holder and that holder has the HIF_MAY_DEMOTE flag
set, the holder will be demoted before the incoming locking request is
granted.
Note that this mechanism demotes active holders (with the HIF_HOLDER
flag set), while before we were only demoting glocks without any active
holders. This allows processes to keep hold of locks that may form a
cyclic locking dependency; the core glock logic will then break those
dependencies in case a conflicting locking request occurs. We'll use
this to avoid giving up the inode glock proactively before faulting in
pages.
Processes that allow a glock holder to be taken away indicate this by
calling gfs2_holder_allow_demote(), which sets the HIF_MAY_DEMOTE flag.
Later, they call gfs2_holder_disallow_demote() to clear the flag again,
and then they check if their holder is still queued: if it is, they are
still holding the glock; if it isn't, they can re-acquire the glock (or
abort).
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Pass the first current glock holder into function may_grant and
deobfuscate the logic there.
While at it, switch from BUG_ON to GLOCK_BUG_ON in may_grant. To make
that build cleanly, de-constify the may_grant arguments.
We're now using function find_first_holder in do_promote, so move the
function's definition above do_promote.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Add a wrapper around iomap_file_buffered_write. We'll add code for when
the operation needs to be retried here later.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Consolidate the various helpers into a single blk_flush_plug helper that
takes a plk_plug and the from_scheduler bool and switch all callsites to
call it directly. Checks that the plug is non-NULL must be performed by
the caller, something that most already do anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211020144119.142582-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_unprep_linked_timeout() is broken, first it needs to return back
REQ_F_ARM_LTIMEOUT, so the linked timeout is enqueued and disarmed. But
now we refcounted it, and linked timeouts may get not executed at all,
leaking a request.
Just kill the unprep optimisation.
Fixes: 906c6caaf5 ("io_uring: optimise io_prep_linked_timeout()")
Reported-by: Beld Zhang <beldzhang@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/51b8e2bfc4bea8ee625cf2ba62b2a350cc9be031.1634719585.git.asml.silence@gmail.com
Link: https://github.com/axboe/liburing/issues/460
Reported-by: Beld Zhang <beldzhang@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, IORING_REGISTER_IOWQ_MAX_WORKERS applies only to the task
that issued it, it's unexpected for users. If one task creates a ring,
limits workers and then passes it to another task the limit won't be
applied to the other task.
Another pitfall is that a task should either create a ring or submit at
least one request for IORING_REGISTER_IOWQ_MAX_WORKERS to work at all,
furher complicating the picture.
Change the API, save the limits and apply to all future users. Note, it
should be done first before giving away the ring or submitting new
requests otherwise the result is not guaranteed.
Fixes: 2e480058dd ("io-wq: provide a way to limit max number of workers")
Link: https://github.com/axboe/liburing/issues/460
Reported-by: Beld Zhang <beldzhang@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/51d0bae97180e08ab722c0d5c93e7439cfb6f697.1634683237.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use ERR_CAST() instead of ERR_PTR(PTR_ERR()).
This makes it more readable and also fix this warning detected by
err_cast.cocci:
./fs/io_uring.c: WARNING: 3208: 11-18: ERR_CAST can be used with buf
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Changcheng Deng <deng.changcheng@zte.com.cn>
Link: https://lore.kernel.org/r/20211020084948.1038420-1-deng.changcheng@zte.com.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Right now security_dentry_init_security() only supports single security
label and is used by SELinux only. There are two users of this hook,
namely ceph and nfs.
NFS does not care about xattr name. Ceph hardcodes the xattr name to
security.selinux (XATTR_NAME_SELINUX).
I am making changes to fuse/virtiofs to send security label to virtiofsd
and I need to send xattr name as well. I also hardcoded the name of
xattr to security.selinux.
Stephen Smalley suggested that it probably is a good idea to modify
security_dentry_init_security() to also return name of xattr so that
we can avoid this hardcoding in the callers.
This patch adds a new parameter "const char **xattr_name" to
security_dentry_init_security() and LSM puts the name of xattr
too if caller asked for it (xattr_name != NULL).
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: James Morris <jamorris@linux.microsoft.com>
[PM: fixed typos in the commit description]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Make sure the security buffer's length/offset are valid with regards to
the packet length.
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Marios Makassikis <mmakassikis@freebox.fr>
Signed-off-by: Steve French <stfrench@microsoft.com>
To avoid dictionary attacks (repeated session setups rapidly sent) to
connect to server, ksmbd make a delay of a 5 seconds on session setup
failure to make it harder to send enough random connection requests
to break into a server if a user insert the wrong password 10 times
in a row.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Validate OutputBufferLength of QUERY_DIR, QUERY_INFO, IOCTL requests and
check the free size of response buffer for these requests.
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Currently force_nonblock stands for three meanings:
- nowait or not
- in an io-worker or not(hold uring_lock or not)
Let's split the logic to two flags, IO_URING_F_NONBLOCK and
IO_URING_F_UNLOCKED for convenience of the next patch.
Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20211018133431.103298-1-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
First, fix nr_workers checks against max_workers, with max_worker
registration, it may pretty easily happen that nr_workers > max_workers.
Also, synchronise writing to acct->max_worker with wqe->lock. It's not
an actual problem, but as we don't care about io_wqe_create_worker(),
it's better than WRITE_ONCE()/READ_ONCE().
Fixes: 2e480058dd ("io-wq: provide a way to limit max number of workers")
Reported-by: Beld Zhang <beldzhang@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/11f90e6b49410b7d1a88f5d04fb8d95bb86b8cf3.1634671835.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that we have the infrastructure to track the max possible height of
each btree type, we can create a separate slab cache for cursors of each
type of btree. For smaller indices like the free space btrees, this
means that we can pack more cursors into a slab page, improving slab
utilization.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Add code for all five btree types so that we can compute the absolute
maximum possible btree height for each btree type. This is a setup for
the next patch, which makes every btree type have its own cursor cache.
The functions are exported so that we can have xfs_db report the
absolute maximum btree heights for each btree type, rather than making
everyone run their own ad-hoc computations.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Nobody uses this symbol anymore, so kill it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Instead of assuming that the hardcoded XFS_BTREE_MAXLEVELS value is big
enough to handle the maximally tall rmap btree when all blocks are in
use and maximally shared, let's compute the maximum height assuming the
rmapbt consumes as many blocks as possible.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
During review of the next patch, Dave remarked that he found these two
btree geometry calculation functions lacking in documentation and that
they performed more work than was really necessary.
These functions take the same parameters and have nearly the same logic;
the only real difference is in the return values. Reword the function
comment to make it clearer what each function does, and move them to be
adjacent to reinforce their relation.
Clean up both of them to stop opencoding the howmany functions, stop
using the uint typedefs, and make them both support computations for
more than 2^32 leaf records, since we're going to need all of the above
for files with large data forks and large rmap btrees.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Compute the actual maximum AG btree height for deciding if a per-AG
block reservation is critically low. This only affects the sanity check
condition, since we /generally/ will trigger on the 10% threshold. This
is a long-winded way of saying that we're removing one more usage of
XFS_BTREE_MAXLEVELS.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Years ago when XFS was thought to be much more simple, we introduced
m_ag_maxlevels to specify the maximum btree height of per-AG btrees for
a given filesystem mount. Then we observed that inode btrees don't
actually have the same height and split that off; and now we have rmap
and refcount btrees with much different geometries and separate
maxlevels variables.
The 'ag' part of the name doesn't make much sense anymore, so rename
this to m_alloc_maxlevels to reinforce that this is the maximum height
of the *free space* btrees. This sets us up for the next patch, which
will add a variable to track the maximum height of all AG btrees.
(Also take the opportunity to improve adjacent comments and fix minor
style problems.)
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
To support future btree code, we need to be able to size btree cursors
dynamically for very large btrees. Switch the maxlevels computation to
use the precomputed values in the superblock, and create cursors that
can handle a certain height. For now, we retain the btree cursor cache
that can handle up to 9-level btrees, though a subsequent patch
introduces separate caches for each btree type, where each cache's
objects will be exactly tall enough to handle the specific btree type.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Encode the maximum btree height in the cursor, since we're soon going to
allow smaller cursors for AG btrees and larger cursors for file btrees.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Refactor btree allocation to a common helper.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reduce the size of the btree cursor structure some more by rearranging
fields to eliminate unused space. While we're at it, fix the ragged
indentation and a spelling error.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Split out the btree level information into a separate struct and put it
at the end of the cursor structure as a VLA. Files with huge data forks
(and in the future, the realtime rmap btree) will require the ability to
support many more levels than a per-AG btree cursor, which means that
we're going to create per-btree type cursor caches to conserve memory
for the more common case.
Note that a subsequent patch actually introduces dynamic cursor heights.
This one merely rearranges the structure to prepare for that.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reorganize struct xchk_btree so that we can dynamically size the context
structure to fit the type of btree cursor that we have. This will
enable us to use memory more efficiently once we start adding very tall
btree types. Right-size the lastkey array to match the number of *node*
levels in the tree so that we stop wasting space.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
The btree scrubbing code checks that the records (or keys) that it finds
in a btree block are all in order by calling the btree cursor's
->recs_inorder function. This of course makes no sense for the first
item in the block, so we switch that off with a separate variable in
struct xchk_btree.
Christoph helped me figure out that the variable is unnecessary, since
we just accessed bc_ptrs[level] and can compare that against zero. Use
that, and save ourselves some memory space.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
We're never going to run more than 4 billion btree operations on a
refcount cursor, so shrink the field to an unsigned int to reduce the
structure size. Fix whitespace alignment too.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
During review of subsequent patches, Dave and I noticed that this
function doesn't work quite right -- accessing cur->bc_ino depends on
the ROOT_IN_INODE flag, not LONG_PTRS. Fix that and the parentheses
isssue. While we're at it, remove the piece that accesses cur->bc_ag,
because block 0 of an AG is never part of a btree.
Note: This changes the btree scrubber tracepoints behavior -- if the
cursor has no buffer for a certain level, it will always report
NULLFSBLOCK. It is assumed that anyone tracing the online fsck code
will also be tracing xchk_start/xchk_done or otherwise be aware of what
exactly is being scrubbed.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
The for_each_perag*() set of macros are hacky in that some (i.e.
those based on sb_agcount) rely on the assumption that perag
iteration terminates naturally with a NULL perag at the specified
end_agno. Others allow for the final AG to have a valid perag and
require the calling function to clean up any potential leftover
xfs_perag reference on termination of the loop.
Aside from providing a subtly inconsistent interface, the former
variant is racy with growfs because growfs can create discoverable
post-eofs perags before the final superblock update that completes
the grow operation and increases sb_agcount. This leads to the
following assert failure (reproduced by xfs/104) in the perag free
path during unmount:
XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/libxfs/xfs_ag.c, line: 195
This occurs because one of the many for_each_perag() loops in the
code that is expected to terminate with a NULL pag (and thus has no
post-loop xfs_perag_put() check) raced with a growfs and found a
non-NULL post-EOFS perag, but terminated naturally based on the
end_agno check without releasing the post-EOFS perag.
Rework the iteration logic to lift the agno check from the main for
loop conditional to the iteration helper function. The for loop now
purely terminates on a NULL pag and xfs_perag_next() avoids taking a
reference to any perag beyond end_agno in the first place.
Fixes: f250eedcf7 ("xfs: make for_each_perag... a first class citizen")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
The for_each_perag_from() iteration macro relies on sb_agcount to
process every perag currently within EOFS from a given starting
point. It's perfectly valid to have perag structures beyond
sb_agcount, however, such as if a growfs is in progress. If a perag
loop happens to race with growfs in this manner, it will actually
attempt to process the post-EOFS perag where ->pag_agno ==
sb_agcount. This is reproduced by xfs/104 and manifests as the
following assert failure in superblock write verifier context:
XFS: Assertion failed: agno < mp->m_sb.sb_agcount, file: fs/xfs/libxfs/xfs_types.c, line: 22
Update the corresponding macro to only process perags that are
within the current sb_agcount.
Fixes: 58d43a7e32 ("xfs: pass perags around in fsmap data dev functions")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Rename the next_agno variable to be consistent across the several
iteration macros and shorten line length.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Fold the loop iteration logic into a helper in preparation for
further fixups. No functional change in this patch.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
coccicheck complains about the use of snprintf() in sysfs show functions.
Fix the coccicheck warning:
WARNING: use scnprintf or sprintf.
Use sysfs_emit instead of scnprintf or sprintf makes more sense.
Signed-off-by: Qing Wang <wangqing@vivo.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
This is only of historical interest, and anyone interested in the
history can dig out an old version of locks.c from from git.
Triggered by the observation that it references the now-removed
Documentation/filesystems/mandatory-locking.rst.
Reported-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
When enabling -Wunused warnings by building with W=1, I get an
instance of the -Wunused-but-set-parameter warning in the io_uring code:
fs/io_uring.c: In function 'io_queue_async_work':
fs/io_uring.c:1445:61: error: parameter 'locked' set but not used [-Werror=unused-but-set-parameter]
1445 | static void io_queue_async_work(struct io_kiocb *req, bool *locked)
| ~~~~~~^~~~~~
There are very few warnings of this type, so it would be nice to enable
this by default and fix all the existing instances. As the assignment
serves no purpose by itself other than to prevent developers from using
the variable, an easy workaround is to remove the assignment and just
rename the argument to "dont_use".
Fixes: f237c30a56 ("io_uring: batch task work locking")
Link: https://lore.kernel.org/lkml/20210920121352.93063-1-arnd@kernel.org/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20211019153507.348480-1-arnd@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add MicroLZMA support in order to maximize compression ratios for
specific scenarios. For example, it's useful for low-end embedded
boards and as a secondary algorithm in a file for specific access
patterns.
MicroLZMA is a new container format for raw LZMA1, which was created
by Lasse Collin aiming to minimize old LZMA headers and get rid of
unnecessary EOPM (end of payload marker) as well as to enable
fixed-sized output compression, especially for 4KiB pclusters.
Similar to LZ4, inplace I/O approach is used to minimize runtime
memory footprint when dealing with I/O. Overlapped decompression is
handled with 1) bounced buffer for data under processing or 2) extra
short-lived pages from the on-stack pagepool which will be shared in
the same read request (128KiB for example).
Link: https://lore.kernel.org/r/20211010213145.17462-8-xiang@kernel.org
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Previously, some LZ4 methods were named with `generic'. However, while
evaluating the effective LZMA approach, it seems they aren't quite
generic at all (e.g. no need preparing dstpages for most LZMA cases.)
Avoid such naming instead.
Link: https://lore.kernel.org/r/20211010213145.17462-7-xiang@kernel.org
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Previously, the readahead window was strictly followed by EROFS
decompression strategy in order to minimize extra memory footprint.
However, it could become inefficient if just reading the partial
requested data for much big LZ4 pclusters and the upcoming LZMA
implementation.
Let's try to request the leading data in a pcluster without
triggering memory reclaiming instead for the LZ4 approach first
to boost up 100% randread of large big pclusters, and it has no real
impact on low memory scenarios.
It also introduces a way to expand read lengths in order to decompress
the whole pcluster, which is useful for LZMA since the algorithm
itself is relatively slow and causes CPU bound, but LZ4 is not.
Link: https://lore.kernel.org/r/20211008200839.24541-4-xiang@kernel.org
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Previously, for each HEAD lcluster, it can be either HEAD or PLAIN
lcluster to indicate whether the whole pcluster is compressed or not.
In this patch, a new HEAD2 head type is introduced to specify another
compression algorithm other than the primary algorithm for each
compressed file, which can be used for upcoming LZMA compression and
LZ4 range dictionary compression for various data patterns.
It has been stayed in the EROFS roadmap for years. Complete it now!
Link: https://lore.kernel.org/r/20211017165721.2442-1-xiang@kernel.org
Reviewed-by: Yue Hu <huyue2@yulong.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
./fs/nfsd/nfssvc.c: 1072: 8-9: :WARNING return of 0/1 in function
'nfssvc_decode_voidarg' with return type bool
Return statements in functions returning bool should use true/false
instead of 1/0.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Changcheng Deng <deng.changcheng@zte.com.cn>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The block layer can use this knowledge to make smarter decisions on
how to handle the request, if it knows that N more may be coming. Switch
to using blk_start_plug_nr_ios() to pass in that information.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge REQ_F_NOWAIT_READ and REQ_F_NOWAIT_WRITE into one flag, i.e.
REQ_F_SUPPORT_NOWAIT. First it gets rid of dependence on CONFIG_64BIT
but also simplifies the code.
One thing to consider is when we don't have ->{read,write}_iter and go
through loop_rw_iter(). Just fail it with -EAGAIN if we expect nowait
behaviour but not sure whether it supports it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f832a20e5186c2e79c6519280c238f559a1d2bbc.1634425438.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't check if we can do nowait before arming apoll, there are several
reasons for that. First, we don't care much about files that don't
support nowait. Second, it may be useful -- we don't want to be taking
away extra workers from io-wq when it can go in some async. Even if it
will go through io-wq eventually, it make difference in the numbers of
workers actually used. And the last one, it's needed to clean nowait in
future commits.
[kernel test robot: fix unused-var]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9d06f3cb2c8b686d970269a87986f154edb83043.1634425438.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This commit reorders the conditions in a branch in io_write. The
reorder to check 'ret2 == -EAGAIN' first as checking
'(req->ctx->flags & IORING_SETUP_IOPOLL)' will likely be more
expensive due to 2x memory derefences.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Link: https://lore.kernel.org/r/20211017013229.4124279-1-goldstein.w.n@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We pass iovec** into __io_import_iovec(), which should keep it,
initialise and modify accordingly. It's expensive, return it directly
from __io_import_iovec encoding errors with ERR_PTR if needed.
io_import_iovec keeps the old interface, but it's inline and so
everything is optimised nicely.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6230e9769982f03a8f86fa58df24666088c44d3e.1634314022.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Combine force_nonblock branches (which is already optimised by
compiler), flip branches so the most hot/common path is the first, e.g.
as with non on-stack iov setup, and add extra likely/unlikely
attributions for errror paths.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2c2536c5896d70994de76e387ea09a0402173a3f.1634144845.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make io_import_iovec taking struct io_rw_state instead of an iter
pointer. First it takes care of initialising iovec pointer, which can be
forgotten. Even more, we can not init it if not needed, e.g. in case of
IORING_OP_READ_FIXED or IORING_OP_READ. Also hide saving iter_state
inside of it by splitting out an inline function of it to avoid extra
ifs.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b1bbc213a95e5272d4da5867bb977d9acb6f2109.1634144845.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
First, change IO_URING_F_NONBLOCK to take sign bit of the int, so
checking for it can be turned into test + sign-based-jump, makes the
binary smaller and may be faster.
Then, instead of passing need_lock boolean into io_import_iovec() just
give it issue_flags, which is already stored somewhere. Saves some space
on stack, a couple of test + cmov operations and other conversions.
note: we still leave
force_nonblock = issue_flags & IO_URING_F_NONBLOCK
variable, but it's optimised out by the compiler into testing
issue_flags directly.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ee96547e692f6c975c229cd82fc721679571a734.1634144845.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently io_read() and io_write() keep separate pointers to an iter and
to struct iov_iter_state, which is not great for register spilling and
requires more on-stack copies. They are both either on-stack or in
req->async_data at the same time, so use struct io_rw_state and keep a
pointer only to it, so having all the state with just one pointer.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5c5e7ffd7dc25fc35075c70411ba99df72f237fa.1634144845.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't override req->result in io_complete_rw_iopoll() when it's already
of the same value, we have an if just above it, so move the assignment
there. Also, add one simle unlikely() in __io_complete_rw_common().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8dfeb4f84026a20172bcf82c05010abe955874ae.1634144845.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Apparently, percpu_ref_put/get() are expensive enough if done per
request, get them in a batch and cache on the submission side to avoid
getting it over and over again. Also, if we're completing under
uring_lock, return refs back into the cache instead of
perfcpu_ref_put(). Pretty similar to how we do tctx->cached_refs
accounting, but fall back to normal putting when we already changed a
rsrc node by the time of free.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b40d8c5bc77d3c9550df8a319117a374ac85f8f4.1633817310.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Looking at the assembly, the compiler decided to reload req->opcode in
io_op_defs[opcode].needs_file instead of one it had in a register, so
store it in a temp variable so it can be optimised out. Also move the
personality block later, it's better for spilling/etc. as it only
depends on @sqe, which we're keeping anyway.
By the way, zero req->opcode if it over IORING_OP_LAST, not a problem,
at the moment but is safer.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6ba869f5f8b7b0f991c87fdf089f0abf87cbe06b.1633532552.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Plugging is only needed with requests that also need a file, so hide
plugging under a ->needs_file check. Also, place ->needs_file and ->plug
bits into the same byte of io_op_defs, it may matter for compilers, e.g.
only with the change a tested one decided to optimise two memory testb
into a more with two register testb.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1600d1287bb7d16451d4ef3343252787a5314927.1633532552.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
->async_data is a slow path, so it won't matter much if we do the clean
up inside io_clean_op(). Moreover, in many cases it's allocated together
with setting one or more of IO_REQ_CLEAN_FLAGS flags, so it'd go through
io_clean_op() anyway.
Control ->async_data allocation with a new flag REQ_F_ASYNC_DATA, so we
can do all the maintainence under io_req_needs_clean() fast check.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6892cf5883c459f36bda26f30ceb16742b20b84b.1633373302.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Delay reading the next node in io_free_batch_list(), allows the compiler
to load the value a bit later improving register spilling in some cases.
With gcc 11.1 it helped to move @task_refs variable from the stack to a
register and optimises out a couple of per request instructions.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cc9fdfb6f72a4e8bc9918a5e9f2d97869a263ae4.1633373302.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Attribute cold functions so compilers can optimise them for size. It
shrinks the binary by 2.5-3%
text data bss dec hex filename
90670 14002 8 104680 198e8 ./fs/io_uring.o
88053 14002 8 102063 18eaf ./fs/io_uring.o
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b53d385f91dca45170b67d7f11c7abd787e821f6.1633373302.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currenlty, we allocate one ctx reference per request at submission time
and put them at free. It's batched and not so expensive but it still
bloats the kernel, adds 2 function calls for rcu and adds some overhead
for request counting in io_free_batch_list().
Always keep one reference with a request, even when it's freed and in
io_uring request caches. There is extra work at ring exit / quiesce
paths, which now need to put all cached requests. io_ring_exit_work() is
already looping, so it's not a problem. Add hybrid-busy waiting to
io_ctx_quiesce() as well for now.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/99613fbe396e80777228cde39bbda1aa8938554e.1633373302.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The invariant of io_wq_work_list is that it's empty IFF ->first is NULL,
so no need to initially set ->last. With now having more users of the
list it may play a role, i.e. used in each tw iteration and on every
completion flushing.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c464ab5cab6e46a858c6d39c107e92b3b5291f13.1633373302.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Even after fully inlining io_alloc_req() my compiler does a NULL check
in the path of successful allocation, no hacks like an empty dereference
help it. Restructure io_alloc_req() by splitting out refilling part, so
the compiler generate a slightly better binary.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/eda17571bdc7248d8e617b23e7132a5416e4680b.1633373302.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_req_complete_state() is inlined and used in lots of places, so we
want to keep it concise. Move adding a request into a completion batch
list from io_req_complete_state() into the consumer, i.e.
__io_queue_sqe().
before vs after
text data bss dec hex filename
91894 14002 8 105904 19db0 ./fs/io_uring.o
91046 14002 8 105056 19a60 ./fs/io_uring.o
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4afca4e11abfd4cc8e99777fdcaf4d34cf4d022d.1633373302.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We want ->comp_list in the second cacheline, which is hotter comparing
to the 3rd. Swap the field with ->link, which is not as hot and
controlled by flags and so not accessed unless there is a link.
By the way add a couple of comments for io_kiocb fields.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9d9dde31f8f62279a5f48c575bbc27b8290edc0c.1633373302.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For some reason non-off IORING_OP_TIMEOUT always fails links, it's
pretty inconvenient and unnecessary limits chaining after it to hard
linking, which is far from ideal, e.g. doesn't pair well with timeout
cancellation. Add a flag forcing it to not fail links on -ETIME.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/17c7ec0fb7a6113cc6be8cdaedcada0ba836ac0e.1633199723.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Put an explicit check for number of requests to submit. First,
we can turn while into do-while and it generates better code, and second
that if can be cheaper, e.g. by using CPU flags after sub in
io_sqring_entries().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5926baadd20c28feab7a5e1725fedf32e4553ff7.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If a request completed inline the result should only be zero, it's a
grave error otherwise. So, when we see REQ_F_COMPLETE_INLINE it's not
even necessary to check the return code, and the flag check can be moved
earlier.
It's one "if" less for inline completions, and same two checks for it
normally completing (ret == 0). Those are two cases we care about the
most.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ebd4e397a9c26d96c99b24447acc309741041a83.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Extract slow paths from __io_queue_sqe() into a function and inline the
hot path. With that we have everything completely inlined on the
submission path up until io_issue_sqe().
-> io_submit_sqes()
-> io_submit_sqe() (inlined)
-> io_queue_sqe() (inlined)
-> __io_queue_sqe() (inlined)
-> io_issue_sqe()
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f1606864d95d7f26dc28c7eec3dc6ed6ec32618a.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We don't want the slow path of io_queue_sqe to be inlined, so extract a
function from it.
text data bss dec hex filename
91950 13986 8 105944 19dd8 ./fs/io_uring.o
91758 13986 8 105752 19d18 ./fs/io_uring.o
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/fb01253911f8fb374268f65b1ba939b54ca6583f.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
req->ctx->active_drain is a bit too expensive, partially because of two
dereferences. Do a trick, if we see it set in io_init_req(), set
REQ_F_FORCE_ASYNC and it automatically goes through a slower path where
we can catch it. It's nearly free to do in io_init_req() because there
is already ->restricted check and it's in the same byte of a bitmask.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d7e7ddc63c15e8a300833132abb3eb8fd3918aef.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are two call sites of io_queue_sqe() in io_submit_sqe(), combine
them into one, because io_queue_sqe() is inline and we don't want to
bloat binary, and will become even bigger
text data bss dec hex filename
92126 13986 8 106120 19e88 ./fs/io_uring.o
91966 13986 8 105960 19de8 ./fs/io_uring.o
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/506124b8e767f0a4576f7a459f6aea3d13fb4dda.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
First, convert rest of iopoll bits to single linked lists, and also
replace per-request list_add_tail() with splicing a part of slist.
With that, use io_free_batch_list() to put/free requests. The main
advantage of it is that it's now the only user of struct req_batch and
friends, and so they can be inlined. The main overhead there was
per-request call to not-inlined io_req_free_batch(), which is expensive
enough.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b37fc6d5954b241e025eead7ab92c6f44a42f229.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Convert explicit barrier around iopoll_completed to smp_load_acquire()
and smp_store_release(). Similar on the callback side, but replaces a
single smp_rmb() with per-request smp_load_acquire(), neither imply any
extra CPU ordering for x86. Use READ_ONCE as usual where it doesn't
matter.
Use it to move filling CQEs by iopoll earlier, that will be necessary
to avoid traversing the list one extra time in the future.
Suggested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8bd663cb15efdc72d6247c38ee810964e744a450.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The main loop of io_do_iopoll() iterates and does ->iopoll() until it
meets a first completed request, then it continues from that position
and splices requests to pass them through io_iopoll_complete().
Split the loop in two for clearness, iopolling and reaping completed
requests from the list.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a7f6fd27a94845e5dc925a47a4a9765a92e514fb.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Apart from just using lists (i.e. io_wq_work_list), we also want to have
stacks, which are a bit faster, and have some interoperability between
them. Add a stack implementation based on io_wq_work_node and some
helpers.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5d3a412a5ac0d47e0f0499d70d2207d70a68925e.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently we collect requests for completion batching in an array.
Replace them with a singly linked list. It's as fast as arrays but
doesn't take some much space in ctx, and will be used in future patches.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a666826f2854d17e9fb9417fb302edfeb750f425.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We don't really need to pass the number of requests to complete into
io_do_iopoll(), a flag whether to enforce non-spin mode is enough.
Should be straightforward, maybe except io_iopoll_check(). We pass !min
there, because we do never enter with the number of already reaped
requests is larger than the specified @min, apart from the first
iteration, where nr_events is 0 and so the final check should be
identical.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/782b39d1d8ec584eae15bca0a1feb6f0571fe5b8.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hint the compiler that it's not as likely to have creds different from
current attached to a request. The current code generation is far from
ideal, hopefully it can help to some compilers to remove duplicated jump
tables and so.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e7815251ac4bf5a4a23d298c752f029ae19f3837.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
IOSQE_IO_DRAIN is quite marginal and we don't care too much about
IOSQE_BUFFER_SELECT. Save to ifs and hide both of them under
SQE_VALID_FLAGS check. Now we first check whether it uses a "safe"
subset, i.e. without DRAIN and BUFFER_SELECT, and only if it's not
true we test the rest of the flags.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/dccfb9ab2ab0969a2d8dc59af88fa0ce44eeb1d5.1631703764.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now completions are done from task context, that means that it's either
the task itself, task_work or io-wq worker. In all those cases the ctx
will be staying alive by mutexing, explicit referencing or req references
by iowq. Remove extra ctx pinning from io_req_complete_post().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/60a0e96434c16ab4fe587651448290d61ec9a113.1631703756.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Developers may need some uring info to help themselves debug and address
issues in production. This includes sqring/cqring head/tail and the
detailed sqe/cqe info, which is very useful when an application is hung
on a ring.
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210913130854.38542-1-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We don't do io_submit_flush_completions() when there is no requests
enqueued, and every single caller checks for it. Hide that check into
the function not forgetting about inlining. That will make it much
easier for changing the empty check condition in the future.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d7ff8cef5da1b38e8ea648f5aad9a315ddfc7b57.1631115443.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
->ios_left is only used to decide whether to plug or not, kill it to
avoid this extra accounting, just use the initial submission number.
There is no much difference in regards of enabling plugging, where this
one does it in a few more cases, but all major ones should be covered
well.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f13993bcf5b477f9a7d52881fc49f9457ea9870a.1631115443.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
While task_work_add() in io_workqueue_create() is true,
then duplicate code is executed:
-> clear_bit_unlock(0, &worker->create_state);
-> io_worker_release(worker);
-> atomic_dec(&acct->nr_running);
-> io_worker_ref_put(wq);
-> return false;
-> clear_bit_unlock(0, &worker->create_state); // back to io_workqueue_create()
-> io_worker_release(worker);
-> kfree(worker);
The io_worker_release() and clear_bit_unlock() are executed twice.
Fixes: 3146cba99a ("io-wq: make worker creation resilient against signals")
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Link: https://lore.kernel.org/r/20210911085847.34849-1-cuibixuan@huawei.com
Reviwed-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
I recently had to look at a production problem where a request ended
up getting the dreaded -EINVAL error on submit. The most used and
hence useless of error codes, as it just tells you that something
was wrong with your request, but not more than that.
Let's dump the full sqe contents if we run into an issue failure,
that'll allow easier diagnosing of a wide variety of issues.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, we check the wb_err too early for directories, before all of
the unsafe child requests have been waited on. In order to fix that we
need to check the mapping->wb_err later nearer to the end of ceph_fsync.
We also have an overly-complex method for tracking errors after
blocklisting. The errors recorded in cleanup_session_requests go to a
completely separate field in the inode, but we end up reporting them the
same way we would for any other error (in fsync).
There's no real benefit to tracking these errors in two different
places, since the only reporting mechanism for them is in fsync, and
we'd need to advance them both every time.
Given that, we can just remove i_meta_err, and convert the places that
used it to instead just use mapping->wb_err instead. That also fixes
the original problem by ensuring that we do a check_and_advance of the
wb_err at the end of the fsync op.
Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/52864
Reported-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Currently when mounting, we may end up finding an existing superblock
that corresponds to a blocklisted MDS client. This means that the new
mount ends up being unusable.
If we've found an existing superblock with a client that is already
blocklisted, and the client is not configured to recover on its own,
fail the match. Ditto if the superblock has been forcibly unmounted.
While we're in here, also rename "other" to the more conventional "fsc".
Cc: stable@vger.kernel.org
URL: https://bugzilla.redhat.com/show_bug.cgi?id=1901499
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If we open a file without read access and then pass the fd to a syscall
whose implementation calls kernel_read_file_from_fd(), we get a warning
from __kernel_read():
if (WARN_ON_ONCE(!(file->f_mode & FMODE_READ)))
This currently affects both finit_module() and kexec_file_load(), but it
could affect other syscalls in the future.
Link: https://lkml.kernel.org/r/20211007220110.600005-1-willy@infradead.org
Fixes: b844f0ecbc ("vfs: define kernel_copy_file_from_fd()")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reported-by: Hao Sun <sunhao.th@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Starting with kernel 5.11 built with CONFIG_FORTIFY_SOURCE mouting an
ocfs2 filesystem with either o2cb or pcmk cluster stack fails with the
trace below. Problem seems to be that strings for cluster stack and
cluster name are not guaranteed to be null terminated in the disk
representation, while strlcpy assumes that the source string is always
null terminated. This causes a read outside of the source string
triggering the buffer overflow detection.
detected buffer overflow in strlen
------------[ cut here ]------------
kernel BUG at lib/string.c:1149!
invalid opcode: 0000 [#1] SMP PTI
CPU: 1 PID: 910 Comm: mount.ocfs2 Not tainted 5.14.0-1-amd64 #1
Debian 5.14.6-2
RIP: 0010:fortify_panic+0xf/0x11
...
Call Trace:
ocfs2_initialize_super.isra.0.cold+0xc/0x18 [ocfs2]
ocfs2_fill_super+0x359/0x19b0 [ocfs2]
mount_bdev+0x185/0x1b0
legacy_get_tree+0x27/0x40
vfs_get_tree+0x25/0xb0
path_mount+0x454/0xa20
__x64_sys_mount+0x103/0x140
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
Link: https://lkml.kernel.org/r/20210929180654.32460-1-vvidic@valentin-vidic.from.hr
Signed-off-by: Valentin Vidic <vvidic@valentin-vidic.from.hr>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 6dbf7bb555 ("fs: Don't invalidate page buffers in
block_write_full_page()") uncovered a latent bug in ocfs2 conversion
from inline inode format to a normal inode format.
The code in ocfs2_convert_inline_data_to_extents() attempts to zero out
the whole cluster allocated for file data by grabbing, zeroing, and
dirtying all pages covering this cluster. However these pages are
beyond i_size, thus writeback code generally ignores these dirty pages
and no blocks were ever actually zeroed on the disk.
This oversight was fixed by commit 693c241a5f ("ocfs2: No need to zero
pages past i_size.") for standard ocfs2 write path, inline conversion
path was apparently forgotten; the commit log also has a reasoning why
the zeroing actually is not needed.
After commit 6dbf7bb555, things became worse as writeback code stopped
invalidating buffers on pages beyond i_size and thus these pages end up
with clean PageDirty bit but with buffers attached to these pages being
still dirty. So when a file is converted from inline format, then
writeback triggers, and then the file is grown so that these pages
become valid, the invalid dirtiness state is preserved,
mark_buffer_dirty() does nothing on these pages (buffers are already
dirty) but page is never written back because it is clean. So data
written to these pages is lost once pages are reclaimed.
Simple reproducer for the problem is:
xfs_io -f -c "pwrite 0 2000" -c "pwrite 2000 2000" -c "fsync" \
-c "pwrite 4000 2000" ocfs2_file
After unmounting and mounting the fs again, you can observe that end of
'ocfs2_file' has lost its contents.
Fix the problem by not doing the pointless zeroing during conversion
from inline format similarly as in the standard write path.
[akpm@linux-foundation.org: fix whitespace, per Joseph]
Link: https://lkml.kernel.org/r/20210930095405.21433-1-jack@suse.cz
Fixes: 6dbf7bb555 ("fs: Don't invalidate page buffers in block_write_full_page()")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: "Markov, Andrey" <Markov.Andrey@Dell.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A race is possible when a process exits, its VMAs are removed by
exit_mmap() and at the same time userfaultfd_writeprotect() is called.
The race was detected by KASAN on a development kernel, but it appears
to be possible on vanilla kernels as well.
Use mmget_not_zero() to prevent the race as done in other userfaultfd
operations.
Link: https://lkml.kernel.org/r/20210921200247.25749-1-namit@vmware.com
Fixes: 63b2d4174c ("userfaultfd: wp: add the writeprotect API to userfaultfd ioctl")
Signed-off-by: Nadav Amit <namit@vmware.com>
Tested-by: Li Wang <liwang@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the sb_bdev_nr_blocks helper instead of open coding it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211018101130.1838532-31-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the sb_bdev_nr_blocks helper instead of open coding it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211018101130.1838532-30-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the sb_bdev_nr_blocks helper instead of open coding it and clean up
ntfs_fill_super a bit by moving an assignment a little earlier that has
no negative side effects.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Anton Altaparmakov <anton@tuxera.com>
Link: https://lore.kernel.org/r/20211018101130.1838532-29-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the sb_bdev_nr_blocks helper instead of open coding it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Link: https://lore.kernel.org/r/20211018101130.1838532-28-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the sb_bdev_nr_blocks helper instead of open coding it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20211018101130.1838532-27-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the proper helper to read the block device size and remove two
cargo culted checks that can't be false.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20211018101130.1838532-23-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the proper helper to read the block device size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20211018101130.1838532-22-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the proper helper to read the block device size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20211018101130.1838532-19-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the proper helper to read the block device size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Link: https://lore.kernel.org/r/20211018101130.1838532-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the proper helper to read the block device size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20211018101130.1838532-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the proper helper to read the block device size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20211018101130.1838532-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the proper helper to read the block device size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20211018101130.1838532-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the proper helper to read the block device size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20211018101130.1838532-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
No need to convert from bdev to inode and back.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211018101130.1838532-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the proper helper to read the block device size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20211018101130.1838532-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Wire up using an io_comp_batch for f_op->iopoll(). If the lower stack
supports it, we can handle high rates of polled IO more efficiently.
This raises the single core efficiency on my system from ~6.1M IOPS to
~6.6M IOPS running a random read workload at depth 128 on two gen2
Optane drives.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
struct io_comp_batch contains a list head and a completion handler, which
will allow completions to more effciently completed batches of IO.
For now, no functional changes in this patch, we just define the
io_comp_batch structure and add the argument to the file_operations iopoll
handler.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memset(), avoid intentionally writing across
neighboring fields.
Use memset_startat() so memset() doesn't get confused about writing
beyond the destination member that is intended to be the starting point
of zeroing through the end of the struct.
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: David Sterba <dsterba@suse.com>
Cc: linux-btrfs@vger.kernel.org
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Turn iov_iter_fault_in_readable into a function that returns the number
of bytes not faulted in, similar to copy_to_user, instead of returning a
non-zero value when any of the requested pages couldn't be faulted in.
This supports the existing users that require all pages to be faulted in
as well as new users that are happy if any pages can be faulted in.
Rename iov_iter_fault_in_readable to fault_in_iov_iter_readable to make
sure this change doesn't silently break things.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Turn fault_in_pages_{readable,writeable} into versions that return the
number of bytes not faulted in, similar to copy_to_user, instead of
returning a non-zero value when any of the requested pages couldn't be
faulted in. This supports the existing users that require all pages to
be faulted in as well as new users that are happy if any pages can be
faulted in.
Rename the functions to fault_in_{readable,writeable} to make sure
this change doesn't silently break things.
Neither of these functions is entirely trivial and it doesn't seem
useful to inline them, so move them to mm/gup.c.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Replace the blk_poll interface that requires the caller to keep a queue
and cookie from the submissions with polling based on the bio.
Polling for the bio itself leads to a few advantages:
- the cookie construction can made entirely private in blk-mq.c
- the caller does not need to remember the request_queue and cookie
separately and thus sidesteps their lifetime issues
- keeping the device and the cookie inside the bio allows to trivially
support polling BIOs remapping by stacking drivers
- a lot of code to propagate the cookie back up the submission path can
be removed entirely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no point in sleeping for the expected I/O completion timeout
in the io_uring async polling model as we never poll for a specific
I/O.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Switch the boolean spin argument to blk_poll to passing a set of flags
instead. This will allow to control polling behavior in a more fine
grained way.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-10-hch@lst.de
[axboe: adapt to changed io_uring iopoll]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
syscall-level code can't just poke into the details of the poll cookie,
which is private information of the block layer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012111226.760968-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If an iocb is split into multiple bios we can't poll for both. So don't
bother to even try to poll in that case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The polling support in the legacy direct-io support is a little crufty.
It already doesn't support the asynchronous polling needed for io_uring
polling, and is hard to adopt to upcoming changes in the polling
interfaces. Given that all the major file systems already use the iomap
direct I/O code, just drop the polling support.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move inode_to_bdi out of line to avoid having to include blkdev.h.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no need to pull blk-cgroup.h and thus blkdev.h in here, so
break the include chain.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk-cgroup.h pulls in blkdev.h and thus pretty much all the block
headers. Break this dependency chain by turning wbc_blkcg_css into a
macro and dropping the blk-cgroup.h include.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reimplement redirty_page_for_writepage() as a wrapper around
folio_redirty_for_writepage(). Account the number of pages in the
folio, add kernel-doc and move the prototype to writeback.h.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmFsIe4QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpgjCD/9szxypMlfPv0ZuUo/mnWLt/qSI/tFkfDt2
4uwDnibObDUii9jw+2NBvIAHhSnotx+/nOvoI4DeSVahMx7+AbE/zZKHqAjLZDAS
Uc81SyDJLA7IYlNN0XL4cxyQo9LSHOESqLyEekAGvECBIwI/M+HIWIPOn7NCChOr
YU6gZImgO1ty+IrRWCS/QV1goOm6mtwHvzS3vtcrGCorApTACdAuuePDrYfsOAiO
btdhdSgSkkFg0L+fCsTpxVKCddUvP196wesNCEW/yAOXSaZKRX11hAb82LtlB6S1
Il4fLuJGlCHSdEFmgUkfIqHWUW8xDuKLu52NYb7aoU76xKEYE59HrMjEpo1NHOKS
iAZxr3BaTaCfPq0y3r/rCwBQhDVKz9vyXSuELoKqpf1CHCiFFmLYxE4qWM3A0GoE
0Y4DcYLroVWiobvsArsaMPNRQsWdl6nWqwlpGyaWrCP/cLlyaiNmmre0CEr3tf3s
u7nMiPMrk9NkRSYl9O14WlEduuR5Bng97ORVGYB+/Bm3x8VKLyO7VHL2OmpFKC23
Z07Pg9UrH/tm5Hh6VGDuxTYfbJ4iqoIrNYeA+zdxpMDYwozmXGfw90Zia+JgtDXS
Zt/oY9LYp6qJsVFcy+70lAl7EEQqfC9zFRbW3UZ6A7lDKNBgXKlBCpdvCliWqi05
UbWe4AEntQ==
=4wXI
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.15-2021-10-17' of git://git.kernel.dk/linux-block
Pull io_uring fix from Jens Axboe:
"Just a single fix for a wrong condition for grabbing a lock, a
regression in this merge window"
* tag 'io_uring-5.15-2021-10-17' of git://git.kernel.dk/linux-block:
io_uring: fix wrong condition to grab uring lock
Here are some small driver core fixes for 5.15-rc6, all of which have
been in linux-next for a while with no reported issues.
They include:
- kernfs negative dentry bugfix
- simple pm bus fixes to resolve reported issues
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYWvzFg8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ymfLQCfSCP698AAvoCgG0fOfLakFkw80h0AoKIVm3lk
t0GUdnplU18CjnO5M1Zj
=+dh9
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are some small driver core fixes for 5.15-rc6, all of which have
been in linux-next for a while with no reported issues.
They include:
- kernfs negative dentry bugfix
- simple pm bus fixes to resolve reported issues"
* tag 'driver-core-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
drivers: bus: Delete CONFIG_SIMPLE_PM_BUS
drivers: bus: simple-pm-bus: Add support for probing simple bus only devices
driver core: Reject pointless SYNC_STATE_ONLY device links
kernfs: don't create a negative dentry if inactive node exists
Currently, z_erofs_map_blocks_iter() returns whether extents are
compressed or not, and the decompression frontend gets the specific
algorithms then.
It works but not quite well in many aspests, for example:
- The decompression frontend has to deal with whether extents are
compressed or not again and lookup the algorithms if compressed.
It's duplicated and too detailed about the on-disk mapping.
- A new secondary compression head will be introduced later so that
each file can have 2 compression algorithms at most for different
type of data. It could increase the complexity of the decompression
frontend if still handled in this way;
- A new readmore decompression strategy will be introduced to get
better performance for much bigger pcluster and lzma, which needs
the specific algorithm in advance as well.
Let's look up compression algorithms in z_erofs_map_blocks_iter()
directly instead.
Link: https://lore.kernel.org/r/20211008200839.24541-2-xiang@kernel.org
Reviewed-by: Chao Yu <chao@kernel.org>
Reviewed-by: Yue Hu <huyue2@yulong.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
In order to support multi-layer container images, add multiple
device feature to EROFS. Two ways are available to use for now:
- Devices can be mapped into 32-bit global block address space;
- Device ID can be specified with the chunk indexes format.
Note that it assumes no extent would cross device boundary and mkfs
should take care of it seriously.
In the future, a dedicated device manager could be introduced then
thus extra devices can be automatically scanned by UUID as well.
Link: https://lore.kernel.org/r/20211014081010.43485-1-hsiangkao@linux.alibaba.com
Reviewed-by: Chao Yu <chao@kernel.org>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Previously, EROFS mount options are all in the basic types, so
erofs_fs_context can be directly copied with assignment. However,
when the multiple device feature is introduced, it's hard to handle
multiple device information like the other basic mount options.
Let's separate basic mount option usage from fs_context, thus
multiple device information can be handled gracefully then.
No logic changes.
Link: https://lore.kernel.org/r/20211007070224.12833-1-hsiangkao@linux.alibaba.com
Reviewed-by: Chao Yu <chao@kernel.org>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
I don't know if that Solaris behavior matters any more or if it's still
possible to look up that bug ID any more. The XFS behavior's definitely
still relevant, though; any but the most recent XFS filesystems will
lose the top bits.
Reported-by: Frank S. Filz <ffilzlnx@mindspring.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
smb2_validate_credit_charge() accesses fields in the SMB2 PDU body,
but until smb2_calc_size() is called the PDU has not yet been verified
to be large enough to access the PDU dynamic part length field.
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Ralph Boehme <slow@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Add buffer validation for smb direct.
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
ksmbd limit read/write/trans buffer size not to exceed maximum 8MB.
And set the minimum value of max response buffer size to 64KB.
Windows client doesn't send session setup request if ksmbd set max
trans/read/write size lower than 64KB in smb2 negotiate.
It means windows allow at least 64 KB or more about this value.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
some memory leaks and panic. Also many minor fixes.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEh0DEKNP0I9IjwfWEqbAzH4MkB7YFAmFoMFQACgkQqbAzH4Mk
B7bTtQ/+KiF48deefbEEExjfT8Mm76+JE0XkdCPT0bXkhVNpqhRLSOQBR2hg5A81
7SSFHNbSMzXxiXdh2KfcXbBmwdJtcH1N9tjwffC3zhMkCTcDKnmDczz/lo4rHd0g
zZ3rPBP9yPCZGxo3W804XRYOeqLclrGJPI3kWQen+Rln/cZIzJMaHRUkVI22OYwj
e0dSdtabFDxJbdewz9xcvycHrPpVlrZUsuib/ZHFu2XGtgKalccgfvwBy5cOrTVh
N1WSBGcoy0xQGRGLP0o2hN62N2Md7/+UwWjXY+Wz4i+4gmziGvGuk8Y5uiSLu7lS
EG12xlrUtwouf4QaeleQZLT9Ym5YU3EALtKpZxAQi6Rm4A8Z6EMNUq0WBHJcNP/u
MRJlfK7jC25GnIFQjZtU+eMX8BT8MgMeSriv9FIY86T3ijedfxxEbb/cMvUGm2Hn
3hoQelLCUkLSqTyMeZiAv507AJv5MjfMrSJ9r9f36OxDer3w84VCVcxDtyGH++CR
fbRNjHvz7gYG5L5qwsFgfxSC/z+hyUXi01RalbosojsRyvg/f1p+yMxvQ57DrltY
IfHrMGcd9FlUiijBGFvyWQoMAl/pb6EIym2IMxr9X+aXgPJiG/BhWLbmzU4MYUUP
1PwIOpN2vhtU2Z3bVzbecxfy/TWjBhKBYe9jW1AH8KSSvLZExjk=
=QUnM
-----END PGP SIGNATURE-----
Merge tag 'ntfs3_for_5.15' of git://github.com/Paragon-Software-Group/linux-ntfs3
Pull ntfs3 fixes from Konstantin Komarov:
"Use the new api for mounting as requested by Christoph.
Also fixed:
- some memory leaks and panic
- xfstests (tested on x86_64) generic/016 generic/021 generic/022
generic/041 generic/274 generic/423
- some typos, wrong returned error codes, dead code, etc"
* tag 'ntfs3_for_5.15' of git://github.com/Paragon-Software-Group/linux-ntfs3: (70 commits)
fs/ntfs3: Check for NULL pointers in ni_try_remove_attr_list
fs/ntfs3: Refactor ntfs_read_mft
fs/ntfs3: Refactor ni_parse_reparse
fs/ntfs3: Refactor ntfs_create_inode
fs/ntfs3: Refactor ntfs_readlink_hlp
fs/ntfs3: Rework ntfs_utf16_to_nls
fs/ntfs3: Fix memory leak if fill_super failed
fs/ntfs3: Keep prealloc for all types of files
fs/ntfs3: Remove unnecessary functions
fs/ntfs3: Forbid FALLOC_FL_PUNCH_HOLE for normal files
fs/ntfs3: Refactoring of ntfs_set_ea
fs/ntfs3: Remove locked argument in ntfs_set_ea
fs/ntfs3: Use available posix_acl_release instead of ntfs_posix_acl_release
fs/ntfs3: Check for NULL if ATTR_EA_INFO is incorrect
fs/ntfs3: Refactoring of ntfs_init_from_boot
fs/ntfs3: Reject mount if boot's cluster size < media sector size
fs/ntfs3: Refactoring lock in ntfs_init_acl
fs/ntfs3: Change posix_acl_equiv_mode to posix_acl_update_mode
fs/ntfs3: Pass flags to ntfs_set_ea in ntfs_set_acl_ex
fs/ntfs3: Refactor ntfs_get_acl_ex for better readability
...
The implementations of get_wchan() can be expensive. The only information
imparted here is whether or not a process is currently blocked in the
scheduler (and even this doesn't need to be exact). Avoid doing the
heavy lifting of stack walking and just report that information by using
task_is_running().
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211008111626.211281780@infradead.org
This reverts commit 152c432b12.
When a kernel address couldn't be symbolized for /proc/$pid/wchan, it
would leak the raw value, a potential information exposure. This is a
regression compared to the safer pre-v5.12 behavior.
Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: Vito Caputo <vcaputo@pengaru.com>
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20211008111626.090829198@infradead.org
Remove the few leftover instances of the xfs_dinode_t typedef.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Remove the few leftover instances of the xfs_dinode_t typedef.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Remove the few leftover instances of the xfs_dinode_t typedef.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Warn if we ever bump nlevels higher than the allowed maximum cursor
height.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
When we're scanning for btree roots to rebuild the AG headers, make sure
that the proposed tree does not exceed the maximum height for that btree
type (and not just XFS_BTREE_MAXLEVELS).
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Since each btree type has its own precomputed maxlevels variable now,
use them instead of the generic XFS_BTREE_MAXLEVELS to check the level
of each per-AG btree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Convert the on-stack scrub context, btree scrub context, and da btree
scrub context into a heap allocation so that we reduce stack usage and
gain the ability to handle tall btrees without issue.
Specifically, this saves us ~208 bytes for the dabtree scrub, ~464 bytes
for the btree scrub, and ~200 bytes for the main scrub context.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Get rid of this old typedef before we start changing other things.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The btree geometry computation function has an off-by-one error in that
it does not allow maximally tall btrees (nlevels == XFS_BTREE_MAXLEVELS).
This can result in repairs failing unnecessarily on very fragmented
filesystems. Subsequent patches to remove MAXLEVELS usage in favor of
the per-btree type computations will make this a much more likely
occurrence.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
When log recovery tries to recover a transaction that had log intent
items attached to it, it has to save certain parts of the transaction
state (reservation, dfops chain, inodes with no automatic unlock) so
that it can finish single-stepping the recovered transactions before
finishing the chains.
This is done with the xfs_defer_ops_capture and xfs_defer_ops_continue
functions. Right now they open-code this functionality, so let's port
this to the formalized resource capture structure that we introduced in
the previous patch. This enables us to hold up to two inodes and two
buffers during log recovery, the same way we do for regular runtime.
With this patch applied, we'll be ready to support atomic extent swap
which holds two inodes; and logged xattrs which holds one inode and one
xattr leaf buffer.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Transaction users are allowed to flag up to two buffers and two inodes
for ownership preservation across a deferred transaction roll. Hoist
the variables and code responsible for this out of xfs_defer_trans_roll
so that we can use it for the defer capture mechanism.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
As Xiang mentioned, such path has no real impact to our current
decompression strategy, remove it directly. Also, update the return
value of z_erofs_lz4_decompress() to 0 if success to keep consistent
with LZMA which will return 0 as well for that case.
Link: https://lore.kernel.org/r/20211014065744.1787-1-zbestahu@gmail.com
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Yue Hu <huyue2@yulong.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Grab uring lock when we are in io-worker rather than in the original
or system-wq context since we already hold it in these two situation.
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Fixes: b66ceaf324 ("io_uring: move iopoll reissue into regular IO path")
Link: https://lore.kernel.org/r/20211014140400.50235-1-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add the check to validate compound response buffer.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
DataOffset and Length validation can be potencial 32bit overflow.
This patch fix it.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
* Requests except READ, WRITE, IOCTL, INFO, QUERY
DIRECOTRY, CANCEL must consume one credit.
* If client's granted credits are insufficient,
refuse to handle requests.
* Windows server 2016 or later grant up to 8192
credits to clients at once.
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Add validation for request/response buffer size check in smb2_ioctl and
fsctl_copychunk() take copychunk_ioctl_req pointer and the other arguments
instead of smb2_ioctl_req structure and remove an unused smb2_ioctl_req
argument of fsctl_validate_negotiate_info.
Cc: Tom Talpey <tom@talpey.com>
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Böhme <slow@samba.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Returning an undecorated integer is an age-old trope, but it's
not clear (even to previous experts in this code) that the only
valid return values are 1 and 0. These functions do not return
a negative errno, rpc_stat value, or a positive length.
Document there are only two valid return values by having
.pc_encode return only true or false.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The passed-in value of the "__be32 *p" parameter is now unused in
every server-side XDR encoder, and can be removed.
Note also that there is a line in each encoder that sets up a local
pointer to a struct xdr_stream. Passing that pointer from the
dispatcher instead saves one line per encoder function.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Refactor: Currently nfs4svc_encode_compoundres() relies on the NFS
dispatcher to pass in the buffer location of the COMPOUND status.
Instead, save that buffer location in struct nfsd4_compoundres.
The compound tag follows immediately after.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Returning an undecorated integer is an age-old trope, but it's
not clear (even to previous experts in this code) that the only
valid return values are 1 and 0. These functions do not return
a negative errno, rpc_stat value, or a positive length.
Document there are only two valid return values by having
.pc_decode return only true or false.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The passed-in value of the "__be32 *p" parameter is now unused in
every server-side XDR decoder, and can be removed.
Note also that there is a line in each decoder that sets up a local
pointer to a struct xdr_stream. Passing that pointer from the
dispatcher instead saves one line per decoder function.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmFkq/oACgkQxWXV+ddt
WDs10g//Qx27foBu0U3ovvsla0t8GgcqgzUyOx3zxed0MbOEQCtK6kqRHQ/I+9ap
1Ec5y4qQqBwfp1NKlYdU/EiKBQIYbJO/nYhVIrFI/EZL/7qJTwyjYjrOjG9zIMvy
2ekxuF/XVnM6p3hyRcuWMCxsossuK4XIkb0bSZrwk/nFA6nt+gbXR1oE94JitM8p
0pwjvSVqpdTmOAIU5+oQldqL/By7un/rv+o6OTD9sJqTdQ1UMlHVDaa9mD8aCsYk
XIiCYfkyo9rlbSAB5wmWuiAhske2xh7IXSr4l9mKxGOA0egbQAgmS1Zw3+Km7vFM
t+ji/4rTFPFd2yv/sLCEnMinuwvBr3mnEh6pDHR76RNrI4CoK/GHmZSf7XyqzV8W
QOftznNA9/nJInTULdhCDvNxbKhKKb+xeSP1L4uytnWc5am+WKOPLNkfczJUh3sq
WUORpaUxByDol6BMsdQJqPVJ7CH5YI8lQzuQFoUTXDCgeQUBE2wE1s3q+5Ma+dNZ
mamkfQim2R42nPk7RSQlFBeIyDBVBXWfSNvXNovrPFJyRmZqRWzh0nb3PS9VNnUy
6oCOCIT7XlM4Jwh4ZR21OT66RNQQ/2sLUOU/4838TOOdn00UVBrFObHQ+ll8rq74
Va9j0atj6iIn9c8lDQkqTek0pMDcmVGzb2MV6JA4BCbCL/lcGk8=
=u3qV
-----END PGP SIGNATURE-----
Merge tag 'for-5.15-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more error handling fixes, stemming from code inspection, error
injection or fuzzing"
* tag 'for-5.15-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix abort logic in btrfs_replace_file_extents
btrfs: check for error when looking up inode during dir entry replay
btrfs: unify lookup return value when dir entry is missing
btrfs: deal with errors when adding inode reference during log replay
btrfs: deal with errors when replaying dir entry during log replay
btrfs: deal with errors when checking if a dir entry exists during log replay
btrfs: update refs for any root except tree log roots
btrfs: unlock newly allocated extent buffer after error
In f2fs_balance_fs_bg(), it needs to check both NAT_ENTRIES and INO_ENTRIES
memory usage to decide whether we should skip background checkpoint, otherwise
we may always skip checking INO_ENTRIES memory usage, so that INO_ENTRIES may
potentially cause high memory footprint.
Fixes: 493720a485 ("f2fs: fix to avoid REQ_TIME and CP_TIME collision")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since active_logs can be set to 2 or 4 or NR_CURSEG_PERSIST_TYPE(6),
it cannot be set to NR_CURSEG_TYPE(8).
That is, whint_mode is always off.
Therefore, the condition is changed from NR_CURSEG_TYPE to NR_CURSEG_PERSIST_TYPE.
Cc: Chao Yu <chao@kernel.org>
Fixes: d0b9e42ab6 (f2fs: introduce inmem curseg)
Reported-by: tanghuan <tanghuan@vivo.com>
Signed-off-by: Keoseong Park <keosung.park@samsung.com>
Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For kmalloc() allocations SLOB prepends the blocks with a 4-byte header,
and it puts the size of the allocated blocks in that header.
Blocks allocated with kmem_cache_alloc() allocations do not have that
header.
SLOB explodes when you allocate memory with kmem_cache_alloc() and then
try to free it with kfree() instead of kmem_cache_free().
SLOB will assume that there is a header when there is none, read some
garbage to size variable and corrupt the adjacent objects, which
eventually leads to hang or panic.
Let's make XFS work with SLOB by using proper free function.
Fixes: 9749fee83f ("xfs: enable the xfs_defer mechanism to process extents to free")
Signed-off-by: Rustam Kovhaev <rkovhaev@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Use 2-factor argument multiplication form kvcalloc() instead of
kvzalloc().
Link: https://github.com/KSPP/linux/issues/162
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
The reference counting issue happens in one exception handling
path of orangefs_mount(). When failing to allocate sb info, the
function forgets to decrease the refcount of sb increased by
sget(), causing a refcount leak.
Fix this issue by jumping to the label "free_sb_and_op" instead
of "free_op"
Signed-off-by: Chenyuan Mi <cymi20@fudan.edu.cn>
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
When op_alloc() returns NULL to new_op, no error return code of
orangefs_revalidate_lookup() is assigned.
To fix this bug, ret is assigned with -ENOMEM in this case.
Fixes: 8bb8aefd5a ("OrangeFS: Change almost all instances of the string PVFS2 to OrangeFS.")
Reported-by: TOTE Robot <oslab@tsinghua.edu.cn>
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
The variable ret is being initialized with a value that is never read, it
is being updated later on. The assignment is redundant and can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Change argument from void* to struct REPARSE_DATA_BUFFER*
We copy data to buffer, so we can read it later in ntfs_read_mft.
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Now ntfs_utf16_to_nls takes length as one of arguments.
If length of symlink > 255, then we tried to convert
length of symlink +- some random number.
Now 255 symbols limit was removed.
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
In ntfs_init_fs_context we allocate memory in fc->s_fs_info.
In case of failed mount we must free it in ntfs_fill_super.
We can't do it in ntfs_fs_free, because ntfs_fs_free called
with fc->s_fs_info == NULL.
fc->s_fs_info became NULL in sget_fc.
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Before we haven't kept prealloc for sparse files because we thought that
it will speed up create / write operations.
It lead to situation, when user reserved some space for sparse file,
filled volume, and wasn't able to write in reserved file.
With this commit we keep prealloc.
Now xfstest generic/274 pass.
Fixes: be71b5cba2 ("fs/ntfs3: Add attrib operations")
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
The block number in the quota tree on disk should be smaller than the
v2_disk_dqinfo.dqi_blocks. If the quota file was corrupted, we may be
allocating an 'allocated' block and that would lead to a loop in a tree,
which will probably trigger oops later. This patch adds a check for the
block number in the quota tree to prevent such potential issue.
Link: https://lore.kernel.org/r/20211008093821.1001186-2-yi.zhang@huawei.com
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Cc: stable@kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Partially revert commit 2ce209c42c ("NFS: Wait for requests that are
locked on the commit list"), since it can lead to deadlocks between
commit requests and nfs_join_page_group().
For now we should assume that any locked requests on the commit list are
either about to be removed and committed by another task, or the writes
they describe are about to be retransmitted. In either case, we should
not need to worry.
Fixes: 2ce209c42c ("NFS: Wait for requests that are locked on the commit list")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Generate a trace event whenever the NFS client modifies the size of
a file. These new events aid troubleshooting workloads that trigger
races around size updates.
There are four new trace points, all named nfs_size_something so
they are easy to grep for or enable as a group with a single glob.
Size updated on the server:
kworker/u24:10-194 [010] 369.939174: nfs_size_update: fileid=00:28:2 fhandle=0x36fbbe51 version=1752899344277980615 cursize=250471 newsize=172083
Server-side size update reported via NFSv3 WCC attributes:
fsx-1387 [006] 380.760686: nfs_size_wcc: fileid=00:28:2 fhandle=0x36fbbe51 version=1752899355909932456 cursize=146792 newsize=171216
File has been truncated locally:
fsx-1387 [007] 369.437421: nfs_size_truncate: fileid=00:28:2 fhandle=0x36fbbe51 version=1752899231200117272 cursize=215244 newsize=0
File has been extended locally:
fsx-1387 [007] 369.439213: nfs_size_grow: fileid=00:28:2 fhandle=0x36fbbe51 version=1752899343704248410 cursize=258048 newsize=262144
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Clean up: TRACE_DEFINE_ENUM is unnecessary because the target
symbols are all C macros, not enums.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmFhB78ACgkQiiy9cAdy
T1Hh1Qv/WcouW4kZIgTUl8POfFcZbfNbJ982rjMTczZscVOZMTUfY+ETslKVqE/f
daOPeytTRqZGw9sI5C26Qnlv5BjjtCkNriM/E77grKEe56c689ZLNQx0LRxVPOKp
7mDyaEVgshscTOATLzGzpyZ4hJkhCJ7u/DVF/cwWVD0W6ESQ3Ws8EsLWyN8yGo2b
H1DLvAcMV40Gr+AdPVyEUCc0+IFmxddur4+ZQTgVoAK6ndyv7jEvp/55B7f8tho2
Foq4wPUPO5mkzy/CuU+20b7eV4owj4DTz0l7HM3Bft+lHkfIfBH5ZV+m4C+Jj0Ng
oLwdoZ8S3+zc+ca37ookhpawODIFdbuwW+l9CfcM1H84Orwm/wXkAM8GsiuNPHQ4
kZ6rFPrG/LCEpJPQzF2PKxyyWtNtH85WEZaGRg6j3GkwFkoW1ybLGhwJP6l2Nlat
hUYyF0rlZ+zPDu8gyrH9xjh0VXuQKuLDAWHfPj0oPxmsEyVqvSvwqfM/qE87Ncmw
Ft6VgPn3
=9ERH
-----END PGP SIGNATURE-----
Merge tag '5.15-rc4-ksmbd-fixes' of git://git.samba.org/ksmbd
Pull ksmbd fixes from Steve French:
"Six fixes for the ksmbd kernel server, including two additional
overflow checks, a fix for oops, and some cleanup (e.g. remove dead
code for less secure dialects that has been removed)"
* tag '5.15-rc4-ksmbd-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix oops from fuse driver
ksmbd: fix version mismatch with out of tree
ksmbd: use buf_data_size instead of recalculation in smb3_decrypt_req()
ksmbd: remove the leftover of smb2.0 dialect support
ksmbd: check strictly data area in ksmbd_smb2_check_message()
ksmbd: add the check to vaildate if stream protocol length exceeds maximum value
The tracefs file system is by default mounted such that only root user can
access it. But there are legitimate reasons to create a group and allow
those added to the group to have access to tracing. By changing the
permissions of the tracefs mount point to allow access, it will allow
group access to the tracefs directory.
There should not be any real reason to allow all access to the tracefs
directory as it contains sensitive information. Have the default
permission of directories being created not have any OTH (other) bits set,
such that an admin that wants to give permission to a group has to first
disable all OTH bits in the file system.
Link: https://lkml.kernel.org/r/20210818153038.664127804@goodmis.org
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Today when a signal is delivered with a handler of SIG_DFL whose
default behavior is to generate a core dump not only that process but
every process that shares the mm is killed.
In the case of vfork this looks like a real world problem. Consider
the following well defined sequence.
if (vfork() == 0) {
execve(...);
_exit(EXIT_FAILURE);
}
If a signal that generates a core dump is received after vfork but
before the execve changes the mm the process that called vfork will
also be killed (as the mm is shared).
Similarly if the execve fails after the point of no return the kernel
delivers SIGSEGV which will kill both the exec'ing process and because
the mm is shared the process that called vfork as well.
As far as I can tell this behavior is a violation of people's
reasonable expectations, POSIX, and is unnecessarily fragile when the
system is low on memory.
Solve this by making a userspace visible change to only kill a single
process/thread group. This is possible because Jann Horn recently
modified[1] the coredump code so that the mm can safely be modified
while the coredump is happening. With LinuxThreads long gone I don't
expect anyone to have a notice this behavior change in practice.
To accomplish this move the core_state pointer from mm_struct to
signal_struct, which allows different thread groups to coredump
simultatenously.
In zap_threads remove the work to kill anything except for the current
thread group.
v2: Remove core_state from the VM_BUG_ON_MM print to fix
compile failure when CONFIG_DEBUG_VM is enabled.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
[1] a07279c9a8 ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot")
Fixes: d89f3847def4 ("[PATCH] thread-aware coredumps, 2.5.43-C3")
History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Link: https://lkml.kernel.org/r/87y27mvnke.fsf@disp2133
Link: https://lkml.kernel.org/r/20211007144701.67592574@canb.auug.org.au
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Error injection testing uncovered a case where we'd end up with a
corrupt file system with a missing extent in the middle of a file. This
occurs because the if statement to decide if we should abort is wrong.
The only way we would abort in this case is if we got a ret !=
-EOPNOTSUPP and we called from the file clone code. However the
prealloc code uses this path too. Instead we need to abort if there is
an error, and the only error we _don't_ abort on is -EOPNOTSUPP and only
if we came from the clone file code.
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At replay_one_name(), we are treating any error from btrfs_lookup_inode()
as if the inode does not exists. Fix this by checking for an error and
returning it to the caller.
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_lookup_dir_index_item() and btrfs_lookup_dir_item() lookup for dir
entries and both are used during log replay or when updating a log tree
during an unlink.
However when the dir item does not exists, btrfs_lookup_dir_item() returns
NULL while btrfs_lookup_dir_index_item() returns PTR_ERR(-ENOENT), and if
the dir item exists but there is no matching entry for a given name or
index, both return NULL. This makes the call sites during log replay to
be more verbose than necessary and it makes it easy to miss this slight
difference. Since we don't need to distinguish between those two cases,
make btrfs_lookup_dir_index_item() always return NULL when there is no
matching directory entry - either because there isn't any dir entry or
because there is one but it does not match the given name and index.
Also rename the argument 'objectid' of btrfs_lookup_dir_index_item() to
'index' since it is supposed to match an index number, and the name
'objectid' is not very good because it can easily be confused with an
inode number (like the inode number a dir entry points to).
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At __inode_add_ref(), we treating any error returned from
btrfs_lookup_dir_item() or from btrfs_lookup_dir_index_item() as meaning
that there is no existing directory entry in the fs/subvolume tree.
This is not correct since we can get errors such as, for example, -EIO
when reading extent buffers while searching the fs/subvolume's btree.
So fix that and return the error to the caller when it is not -ENOENT.
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At replay_one_one(), we are treating any error returned from
btrfs_lookup_dir_item() or from btrfs_lookup_dir_index_item() as meaning
that there is no existing directory entry in the fs/subvolume tree.
This is not correct since we can get errors such as, for example, -EIO
when reading extent buffers while searching the fs/subvolume's btree.
So fix that and return the error to the caller when it is not -ENOENT.
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently inode_in_dir() ignores errors returned from
btrfs_lookup_dir_index_item() and from btrfs_lookup_dir_item(), treating
any errors as if the directory entry does not exists in the fs/subvolume
tree, which is obviously not correct, as we can get errors such as -EIO
when reading extent buffers while searching the fs/subvolume's tree.
Fix that by making inode_in_dir() return the errors and making its only
caller, add_inode_ref(), deal with returned errors as well.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I hit a stuck relocation on btrfs/061 during my overnight testing. This
turned out to be because we had left over extent entries in our extent
root for a data reloc inode that no longer existed. This happened
because in btrfs_drop_extents() we only update refs if we have SHAREABLE
set or we are the tree_root. This regression was introduced by
aeb935a455 ("btrfs: don't set SHAREABLE flag for data reloc tree")
where we stopped setting SHAREABLE for the data reloc tree.
The problem here is we actually do want to update extent references for
data extents in the data reloc tree, in fact we only don't want to
update extent references if the file extents are in the log tree.
Update this check to only skip updating references in the case of the
log tree.
This is relatively rare, because you have to be running scrub at the
same time, which is what btrfs/061 does. The data reloc inode has its
extents pre-allocated, and then we copy the extent into the
pre-allocated chunks. We theoretically should never be calling
btrfs_drop_extents() on a data reloc inode. The exception of course is
with scrub, if our pre-allocated extent falls inside of the block group
we are scrubbing, then the block group will be marked read only and we
will be forced to cow that extent. This means we will call
btrfs_drop_extents() on that range when we COW that file extent.
This isn't really problematic if we do this, the data reloc inode
requires that our extent lengths match exactly with the extent we are
copying, thankfully we validate the extent is correct with
get_new_location(), so if we happen to COW only part of the extent we
won't link it in when we do the relocation, so we are safe from any
other shenanigans that arise because of this interaction with scrub.
Fixes: aeb935a455 ("btrfs: don't set SHAREABLE flag for data reloc tree")
CC: stable@vger.kernel.org # 5.8+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmFfE4oACgkQ+7dXa6fL
C2txTBAAnWlEssljz7x09A/I9Js155U2hW9oDSoqkUxqZSe05oBbTPNycURvXAGZ
wZhNZdD5Xc4ITjLmPQQclgkfWc+deq6UKzw8E58XmjiO1Uq6WcqUsC95M1USAmaM
nRyhGrYRxJbv5eRDx3Ox3yoLntlSzvX1ZLhWr6DgAnb9uCdIWSGgy34XTd3aOSZa
OEtPR/tvBZygxMV9wsflD2GNNLe7QDrOMUnvFSlmxBOUolclbHj9uhB/fQXN7frN
Q/nf5QluBqZK13CIbiKSPy0wfl/hEdSFsOs5jAgMGm4IsZjSpsw2lvzxlfEaI7U/
QzNHpqAc0ynPI9fbvs2LTkNFR1oe+njOIVvu0QMjOXEdnyOGEbFjX5eDNiKSAih4
R3cNh2T16yUsx99lVbGkJAwbBQTmdp2yvfugQVX5qDNi+Ln8TFUKUHgruUv/FYJw
hUjcOL6cjGdWORpWkxSoEariA6zDjKCWiyMu5w2yzSufI+DJ0AI6MQVOeqaX6dm6
EldlxDO3w7uvXmwpH1RZsHXCqWfyiHn4P5LsSuVy/wM2O/VemaGQuHsxnLtMMJ+q
HGniSziE6LAvF0RvBrngFGhAY6rqMIGzXK/+S1Z/YwM9+tYnoYhbANDhjmywrcI5
GWaKePV5giTXlaI/XertjzEpQ2yo8r2HkYoVowV3NaRNrc3qgnQ=
=X7mM
-----END PGP SIGNATURE-----
Merge tag 'misc-fixes-20211007' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull netfslib, cachefiles and afs fixes from David Howells:
- Fix another couple of oopses in cachefiles tracing stemming from the
possibility of passing in a NULL object pointer
- Fix netfs_clear_unread() to set READ on the iov_iter so that source
it is passed to doesn't do the wrong thing (some drivers look at the
flag on iov_iter rather than other available information to determine
the direction)
- Fix afs_launder_page() to write back at the correct file position on
the server so as not to corrupt data
* tag 'misc-fixes-20211007' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Fix afs_launder_page() to set correct start file position
netfs: Fix READ/WRITE confusion when calling iov_iter_xarray()
cachefiles: Fix oops with cachefiles_cull() due to NULL object
Marios reported kernel oops from fuse driver when ksmbd call
mark_inode_dirty(). This patch directly update ->i_ctime after removing
mark_inode_ditry() and notify_change will put inode to dirty list.
Cc: Tom Talpey <tom@talpey.com>
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Böhme <slow@samba.org>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Reported-by: Marios Makassikis <mmakassikis@freebox.fr>
Tested-by: Marios Makassikis <mmakassikis@freebox.fr>
Acked-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Fix version mismatch with out of tree, This updated version will be
matched with ksmbd-tools.
Cc: Tom Talpey <tom@talpey.com>
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Böhme <slow@samba.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Tom suggested to use buf_data_size that is already calculated, to verify
these offsets.
Cc: Tom Talpey <tom@talpey.com>
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Böhme <slow@samba.org>
Suggested-by: Tom Talpey <tom@talpey.com>
Acked-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Although ksmbd doesn't send SMB2.0 support in supported dialect list of smb
negotiate response, There is the leftover of smb2.0 dialect.
This patch remove it not to support SMB2.0 in ksmbd.
Cc: Tom Talpey <tom@talpey.com>
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Böhme <slow@samba.org>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
When invalid data offset and data length in request,
ksmbd_smb2_check_message check strictly and doesn't allow to process such
requests.
Cc: Tom Talpey <tom@talpey.com>
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Böhme <slow@samba.org>
Acked-by: Hyunchul Lee <hyc.lee@gmail.com>
Reviewed-by: Ralph Boehme <slow@samba.org>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
If nfsd has existing listening sockets without any processes, then an error
returned from svc_create_xprt() for an additional transport will remove
those existing listeners. We're seeing this in practice when userspace
attempts to create rpcrdma transports without having the rpcrdma modules
present before creating nfsd kernel processes. Fix this by checking for
existing sockets before calling nfsd_destroy().
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Rename coredump_exit_mm to coredump_task_exit and call it from do_exit
before PTRACE_EVENT_EXIT, and before any cleanup work for a task
happens. This ensures that an accurate copy of the process can be
captured in the coredump as no cleanup for the process happens before
the coredump completes. This also ensures that PTRACE_EVENT_EXIT
will not be visited by any thread until the coredump is complete.
Add a new flag PF_POSTCOREDUMP so that tasks that have passed through
coredump_task_exit can be recognized and ignored in zap_process.
Now that all of the coredumping happens before exit_mm remove code to
test for a coredump in progress from mm_release.
Replace "may_ptrace_stop()" with a simple test of "current->ptrace".
The other tests in may_ptrace_stop all concern avoiding stopping
during a coredump. These tests are no longer necessary as it is now
guaranteed that fatal_signal_pending will be set if the code enters
ptrace_stop during a coredump. The code in ptrace_stop is guaranteed
not to stop if fatal_signal_pending returns true.
Until this change "ptrace_event(PTRACE_EVENT_EXIT)" could call
ptrace_stop without fatal_signal_pending being true, as signals are
dequeued in get_signal before calling do_exit. This is no longer
an issue as "ptrace_event(PTRACE_EVENT_EXIT)" is no longer reached
until after the coredump completes.
Link: https://lkml.kernel.org/r/874kaax26c.fsf@disp2133
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Separate the coredump logic from the ordinary exit_mm logic
by moving the coredump logic out of exit_mm into it's own
function coredump_exit_mm.
Link: https://lkml.kernel.org/r/87a6k2x277.fsf@disp2133
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Prevent exec continuing when a fatal signal is pending by replacing
mmap_read_lock with mmap_read_lock_killable. This is always the right
thing to do as userspace will never observe an exec complete when
there is a fatal signal pending.
With that change it becomes unnecessary to explicitly test for a core
dump in progress. In coredump_wait zap_threads arranges under
mmap_write_lock for all tasks that use a mm to also have SIGKILL
pending, which means mmap_read_lock_killable will always return -EINTR
when old_mm->core_state is present.
Link: https://lkml.kernel.org/r/87fstux27w.fsf@disp2133
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This patch add MAX_STREAM_PROT_LEN macro and check if stream protocol
length exceeds maximum value. opencode pdu size check in
ksmbd_pdu_size_has_room().
Cc: Tom Talpey <tom@talpey.com>
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Böhme <slow@samba.org>
Acked-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmFcDYUACgkQ+7dXa6fL
C2uVYA/9H4dNBL20cHC+SyPMy7FGdBzaW/d/ahzHw2GO1JipDmofZLqeshOXR/Va
E3VcHb+RzpZNCjYflFD44mWiIYLhqGtDstbcSn/Eqlij4Mmtq4cj7AbYV753gRnX
OItms9dez6IEkCjE3gMpHE9VPWtFv/48qzuMY2X9lyBFnYVSdHyBwr3Dk/wHfieP
w07DBKptv2mrMV1mz9ghCU0VlgrjJcxJBzHaFHxIg3e/cVXVGpbH5ZWmh7OIJkC2
ae7tKvQSDEwZHSO0cQiNKlyVt5v3iNBY7C3IjWFunbPvaBTiq8DpRXOx6vsPVeOI
K+8+HwKdi1attN7KCjatYJLgiLkzOWr9H1rQHTDsa0QRfOwPR9KOA1arm9mnrcMP
I7SZ7oKZ6SasKe5GrbP3IBdWmDxSUUXgzZmapr+eh0cwF9gVRp+rivlokBdmE01c
aneG3FlGGIk+KRJl371XsAd3VlZguk66/r3GOLpor7+Z0DcXfvvpx5bJmmQCMvyW
UjSFarn+FQwbk//29sXRGyYRjx7Nf7LjE7yu9Mq3s97SQpvZd8Z1WI9LTibUd6nd
eC1eDqtaf6bEG5VFDFwfgCww7zGsWqES7HaaoaRRTQoiMog52GIDxiEYTepoGKto
ufiLNYd8AAOQ4z+5aRwbq2bUbld2KI0TN3X8/7dlEP5xKFf4TN4=
=kSrV
-----END PGP SIGNATURE-----
Merge tag 'warning-fixes-20211005' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull misc fs warning fixes from David Howells:
"The first four patches fix kerneldoc warnings in fscache, afs, 9p and
nfs - they're mostly just comment changes, though there's one place in
9p where a comment got detached from the function it was attached to
(v9fs_fid_add) and has to switch places with a function that got
inserted between (__add_fid).
The patch on the end removes an unused symbol in fscache"
* tag 'warning-fixes-20211005' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
fscache: Remove an unused static variable
fscache: Fix some kerneldoc warnings shown up by W=1
9p: Fix a bunch of kerneldoc warnings shown up by W=1
afs: Fix kerneldoc warning shown up by W=1
nfs: Fix kerneldoc warning shown up by W=1
If one ends up expanding on this line checkpatch will complain that the
combination S_IRWXU|S_IRUGO|S_IXUGO should just be replaced with the
octal 0755. Do that.
This makes no functional changes.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20210927163805.808907-9-mcgrof@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We don't need ntfs_xattr_get_acl and ntfs_xattr_set_acl.
There are ntfs_get_acl_ex and ntfs_set_acl_ex.
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
If one ends up extending this line checkpatch will complain about the
use of S_IRWXUGO suggesting it is not preferred and that 0777
should be used instead. Take the tip from checkpatch and do that
change before we do our subsequent changes.
This makes no functional changes.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20210927163805.808907-8-mcgrof@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
FALLOC_FL_PUNCH_HOLE isn't allowed with normal files.
Filesystem must remember info about hole, but for normal file
we can only zero it and forget.
Fixes: 4342306f0f ("fs/ntfs3: Add file operations and implementation")
Now xfstests generic/016 generic/021 generic/022 pass.
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
/proc/uptime reports idle time by reading the CPUTIME_IDLE field from
the per-cpu kcpustats. However, on NO_HZ systems, idle time is not
continually updated on idle cpus, leading this value to appear
incorrectly small.
/proc/stat performs an accounting update when reading idle time; we
can use the same approach for uptime.
With this patch, /proc/stat and /proc/uptime now agree on idle time.
Additionally, the following shows idle time tick up consistently on an
idle machine:
(while true; do cat /proc/uptime; sleep 1; done) | awk '{print $2-prev; prev=$2}'
Reported-by: Luigi Rizzo <lrizzo@google.com>
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lkml.kernel.org/r/20210827165438.3280779-1-joshdon@google.com
Make code more readable.
Don't try to read zero bytes.
Add warning when size of exteneded attribute exceeds limit.
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
We always need to lock now, because locks became smaller
(see d562e901f2
"fs/ntfs3: Move ni_lock_dir and ni_unlock into ntfs_create_inode").
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
We don't need to maintain ntfs_posix_acl_release.
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYVrm7wAKCRDh3BK/laaZ
PCTFAQDESDvQTW4tLj8jdb9ELVCekNtgU6bVJCgUWqR6M1iZcgD8DiqinvagWhEN
K8R/ayWMaPSMA3NEDTlLHy8Bx2iqow4=
=PvJ5
-----END PGP SIGNATURE-----
Merge tag 'ovl-fixes-5.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
"Fix two bugs, both of them corner cases not affecting most users"
* tag 'ovl-fixes-5.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix IOCB_DIRECT if underlying fs doesn't support direct IO
ovl: fix missing negative dentry check in ovl_rename()
Since the openat2(2) syscall uses a struct open_how pointer to communicate
its parameters they are not usefully recorded by the audit SYSCALL record's
four existing arguments.
Add a new audit record type OPENAT2 that reports the parameters in its
third argument, struct open_how with fields oflag, mode and resolve.
The new record in the context of an event would look like:
time->Wed Mar 17 16:28:53 2021
type=PROCTITLE msg=audit(1616012933.531:184): proctitle=
73797363616C6C735F66696C652F6F70656E617432002F746D702F61756469742D
7465737473756974652D737641440066696C652D6F70656E617432
type=PATH msg=audit(1616012933.531:184): item=1 name="file-openat2"
inode=29 dev=00:1f mode=0100600 ouid=0 ogid=0 rdev=00:00
obj=unconfined_u:object_r:user_tmp_t:s0 nametype=CREATE
cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=PATH msg=audit(1616012933.531:184):
item=0 name="/root/rgb/git/audit-testsuite/tests"
inode=25 dev=00:1f mode=040700 ouid=0 ogid=0 rdev=00:00
obj=unconfined_u:object_r:user_tmp_t:s0 nametype=PARENT
cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=CWD msg=audit(1616012933.531:184):
cwd="/root/rgb/git/audit-testsuite/tests"
type=OPENAT2 msg=audit(1616012933.531:184):
oflag=0100302 mode=0600 resolve=0xa
type=SYSCALL msg=audit(1616012933.531:184): arch=c000003e syscall=437
success=yes exit=4 a0=3 a1=7ffe315f1c53 a2=7ffe315f1550 a3=18
items=2 ppid=528 pid=540 auid=0 uid=0 gid=0 euid=0 suid=0
fsuid=0 egid=0 sgid=0 fsgid=0 tty=ttyS0 ses=1 comm="openat2"
exe="/root/rgb/git/audit-testsuite/tests/syscalls_file/openat2"
subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
key="testsuite-1616012933-bjAUcEPO"
Link: https://lore.kernel.org/r/d23fbb89186754487850367224b060e26f9b7181.1621363275.git.rgb@redhat.com
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
[PM: tweak subject, wrap example, move AUDIT_OPENAT2 to 1337]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Replace uses of mem_encrypt_active() with calls to cc_platform_has() with
the CC_ATTR_MEM_ENCRYPT attribute.
Remove the implementation of mem_encrypt_active() across all arches.
For s390, since the default implementation of the cc_platform_has()
matches the s390 implementation of mem_encrypt_active(), cc_platform_has()
does not need to be implemented in s390 (the config option
ARCH_HAS_CC_PLATFORM is not set).
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210928191009.32551-9-bp@alien8.de
It's been reported that doing stress test for module insertion and
removal can result in an ENOENT from libkmod for a valid module.
In kernfs_iop_lookup() a negative dentry is created if there's no kernfs
node associated with the dentry or the node is inactive.
But inactive kernfs nodes are meant to be invisible to the VFS and
creating a negative dentry for these can have unexpected side effects
when the node transitions to an active state.
The point of creating negative dentries is to avoid the expensive
alloc/free cycle that occurs if there are frequent lookups for kernfs
attributes that don't exist. So kernfs nodes that are not yet active
should not result in a negative dentry being created so when they
transition to an active state VFS lookups can create an associated
dentry is a natural way.
It's also been reported that https://github.com/osandov/blktests.git
test block/001 hangs during the test. It was suggested that recent
changes to blktests might have caused it but applying this patch
resolved the problem without change to blktests.
Fixes: c7e7c04274 ("kernfs: use VFS negative dentry caching")
Tested-by: Yi Zhang <yi.zhang@redhat.com>
ACKed-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Ian Kent <raven@themaw.net>
Link: https://lore.kernel.org/r/163330943316.19450.15056895533949392922.stgit@mickey.themaw.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
_nfs4_pnfs_v3/v4_ds_connect do
some work
smp_wmb
ds->ds_clp = clp;
And nfs4_ff_layout_prepare_ds currently does
smp_rmb
if(ds->ds_clp)
...
This patch places the smp_rmb after the if. This ensures that following
reads only happen once nfs4_ff_layout_prepare_ds has checked that data
has been properly initialized.
Fixes: d67ae825a5 ("pnfs/flexfiles: Add the FlexFile Layout Driver")
Signed-off-by: Baptiste Lepers <baptiste.lepers@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The original premise in commit 83672d392f ("NFS: Fix directory caching
problem - with test case and patch.") was that readdirplus was caching
attribute information and replaying it later. This is no longer the
case.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the directory changed while we were revalidating the dentry, then
don't update the dentry verifier. There is no value in setting the
verifier to an older value, and we could end up overwriting a more up to
date verifier from a parallel revalidation.
Fixes: efeda80da3 ("NFSv4: Fix revalidation of dentries with delegations")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
If a user is doing 'ls -l', we have a heuristic in GETATTR that tells
the readdir code to try to use READDIRPLUS in order to refresh the inode
attributes. In certain cirumstances, we also try to invalidate the
remaining directory entries in order to ensure this refresh.
If there are multiple readers of the directory, we probably should avoid
invalidating the page cache, since the heuristic breaks down in that
situation anyway.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
The check for duplicate readdir cookies should only care if the change
attribute is invalid or the data cache is invalid.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
If we want to revalidate the directory, then just mark the change
attribute as invalid.
Fixes: 13c0b082b6 ("NFS: Replace use of NFS_INO_REVAL_PAGECACHE when checking cache validity")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
NFS_INO_DATA_INVAL_DEFER and NFS_INO_INVALID_DATA should be considered
mutually exclusive.
Fixes: 1c341b7775 ("NFS: Add deferred cache invalidation for close-to-open consistency violations")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Both NFSv3 and NFSv2 generate their change attribute from the ctime
value that was supplied by the server. However the problem is that there
are plenty of servers out there with ctime resolutions of 1ms or worse.
In a modern performance system, this is insufficient when trying to
decide which is the most recent set of attributes when, for instance, a
READ or GETATTR call races with a WRITE or SETATTR.
For this reason, let's revert to labelling the NFSv2/v3 change
attributes as NFS4_CHANGE_TYPE_IS_UNDEFINED. This will ensure we protect
against such races.
Fixes: 7b24dacf08 ("NFS: Another inode revalidation improvement")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
NFS4_CREATE_EXCLUSIVE does not allow the caller to set an access mode,
so for most Linux filesystems, the access call ends up returning no
permissions. However both NFS4_CREATE_EXCLUSIVE4_1 and
NFS4_CREATE_GUARDED allow the client to set the access mode.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the cached credential exists but doesn't have any expiration callback
then exit early.
Fix up atomicity issues when replacing the credential with a new one
since the existing code could lead to refcount leaks.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
After the success of an operation such as rmdir() or unlink(), we expect
to add the dentry back to the dcache as an ordinary negative dentry.
However in NFS, unless it is labelled with the appropriate verifier for
the parent directory state, then nfs_lookup_revalidate will end up
discarding that dentry and forcing a new lookup.
The fix is to ensure that we relabel the dentry appropriately on
success.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
After the success of an operation such as link(), or symlink(), we
expect to add the dentry back to the dcache as an ordinary positive
dentry.
However in NFS, unless it is labelled with the appropriate verifier for
the parent directory state, then nfs_lookup_revalidate will end up
discarding that dentry and forcing a new lookup.
The fix is to ensure that we relabel the dentry appropriately on
success.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
In commit b212921b13 ("elf: don't use MAP_FIXED_NOREPLACE for elf
executable mappings") we still leave MAP_FIXED_NOREPLACE in place for
load_elf_interp.
Unfortunately, this will cause kernel to fail to start with:
1 (init): Uhuuh, elf segment at 00003ffff7ffd000 requested but the memory is mapped already
Failed to execute /init (error -17)
The reason is that the elf interpreter (ld.so) has overlapping segments.
readelf -l ld-2.31.so
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x000000000002c94c 0x000000000002c94c R E 0x10000
LOAD 0x000000000002dae0 0x000000000003dae0 0x000000000003dae0
0x00000000000021e8 0x0000000000002320 RW 0x10000
LOAD 0x000000000002fe00 0x000000000003fe00 0x000000000003fe00
0x00000000000011ac 0x0000000000001328 RW 0x10000
The reason for this problem is the same as described in commit
ad55eac74f ("elf: enforce MAP_FIXED on overlaying elf segments").
Not only executable binaries, elf interpreters (e.g. ld.so) can have
overlapping elf segments, so we better drop MAP_FIXED_NOREPLACE and go
back to MAP_FIXED in load_elf_interp.
Fixes: 4ed2863951 ("fs, elf: drop MAP_FIXED usage from elf_map")
Cc: <stable@vger.kernel.org> # v4.19
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Chen Jingwen <chenjingwen6@huawei.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
allocation. Also fix error handling code paths in ext4_dx_readdir()
and ext4_fill_super(). Finally, avoid a grabbing a journal head in
the delayed allocation write in the common cases where we are
overwriting an pre-existing block or appending to an inode.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmFZ2SsACgkQ8vlZVpUN
gaN6DAgAkIeisL1EfQT0VwshEs8y7N6IoX8dydLSRLpNf5oWiJOv2CTY9Qpi6X/C
qNfuLsbJ2NXChvhIAM2hD82hvX21rYc6iqPxgho02VF4eYIP7NzLjwTFKnKbHPB5
TiF498nJTnkcmSrJUEXmSAEdLoCwa5THH9+9HVHXZrkLXPULBtOOJ85mDAcIzVhV
Zqb7yfbpWl0gnF0S0YjNATPtbhcC9EiC4MOVYVesRlgT9B3+k5q4fmVU0euTU9OH
F2H6TNG+Mg/19gTnDP5acB9+eXHvYEqMpe+CaDifR9iFE9PTG/Edhxr6z9roXhHr
kBvEVHSFH+YTEJXghnpS9YDd9Lwc9w==
=WKzd
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Fix a number of ext4 bugs in fast_commit, inline data, and delayed
allocation.
Also fix error handling code paths in ext4_dx_readdir() and
ext4_fill_super().
Finally, avoid a grabbing a journal head in the delayed allocation
write in the common cases where we are overwriting a pre-existing
block or appending to an inode"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: recheck buffer uptodate bit under buffer lock
ext4: fix potential infinite loop in ext4_dx_readdir()
ext4: flush s_error_work before journal destroy in ext4_fill_super
ext4: fix loff_t overflow in ext4_max_bitmap_size()
ext4: fix reserved space counter leakage
ext4: limit the number of blocks in one ADD_RANGE TLV
ext4: enforce buffer head state assertion in ext4_da_map_blocks
ext4: remove extent cache entries when truncating inline data
ext4: drop unnecessary journal handle in delalloc write
ext4: factor out write end code of inline file
ext4: correct the error path of ext4_write_inline_data_end()
ext4: check and update i_disksize properly
ext4: add error checking to ext4_ext_replay_set_iblocks()
Here are some driver core and kernfs fixes for reported issues for
5.15-rc4. These fixes include:
- kernfs positive dentry bugfix
- debugfs_create_file_size error path fix
- cpumask sysfs file bugfix to preserve the user/kernel abi (has
been reported multiple times.)
- devlink fixes for mdiobus devices as reported by the subsystem
maintainers.
Also included in here are some devlink debugging changes to make it
easier for people to report problems when asked. They have already
helped with the mdiobus and other subsystems reporting issues.
All of these have been linux-next for a while with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYVmC1A8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ykfCQCgtQ0kdPoPdUG6mPCn45rrbJQLBY4AnRL5JyhO
zB60l6C2EHHnLnRxMnQq
=gygB
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are some driver core and kernfs fixes for reported issues for
5.15-rc4. These fixes include:
- kernfs positive dentry bugfix
- debugfs_create_file_size error path fix
- cpumask sysfs file bugfix to preserve the user/kernel abi (has been
reported multiple times.)
- devlink fixes for mdiobus devices as reported by the subsystem
maintainers.
Also included in here are some devlink debugging changes to make it
easier for people to report problems when asked. They have already
helped with the mdiobus and other subsystems reporting issues.
All of these have been linux-next for a while with no reported issues"
* tag 'driver-core-5.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
kernfs: also call kernfs_set_rev() for positive dentry
driver core: Add debug logs when fwnode links are added/deleted
driver core: Create __fwnode_link_del() helper function
driver core: Set deferred probe reason when deferred by driver core
net: mdiobus: Set FWNODE_FLAG_NEEDS_CHILD_BOUND_ON_ADD for mdiobus parents
driver core: fw_devlink: Add support for FWNODE_FLAG_NEEDS_CHILD_BOUND_ON_ADD
driver core: fw_devlink: Improve handling of cyclic dependencies
cpumask: Omit terminating null byte in cpumap_print_{list,bitmask}_to_buf
debugfs: debugfs_create_file_size(): use IS_ERR to check for error
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmFXwIYACgkQiiy9cAdy
T1EWfQv/fSSoymQpFZrnzd0ELS7J14IvPbjpL5wWLWUdtrHLIk5Fcg1rxNXDZVsY
o932TGAo/X3qEgMXWbD812Q4MoB+Rupj0NmHReLL+UxwrgCHexFVnzr0SH0YQfWA
59xa+2BVzInqnejika1H4HewJqGKt6npGiAg0Rzx+nJiFlX0CAPupW8oC90UM5Co
3vJNG4orZILlGLRIdMpSashW8Z5dbXY95k/VqF/vYqHgfy37L1m+pDCjRjEbXtFY
fuqFGeAcsnRWnu6ECvuujTyh+hQMSdwb/5F6uovVUrdChdvbfWi+rjtHDx0HpD2t
UKMnRQPk/BWD1/6zFriObCt4QDpufZdvLlVNyir4BdIT2OhzkwkZ4qXz/du+IWKm
4/4nYEaD2lFN4pEAy73NGGt9eJrAjbnaswNDPTZIDpJ7IyZiakDFjOD8iLFncBS7
xL6hUcMvc4njaqxcB9LHFZ8w67cjwR5aw0+wr8DKfCh13lJgSvxoXEP/D+4fxINv
QULcIhF/
=X733
-----END PGP SIGNATURE-----
Merge tag '5.15-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd
Pull ksmbd server fixes from Steve French:
"Eleven fixes for the ksmbd kernel server, mostly security related:
- an important fix for disabling weak NTLMv1 authentication
- seven security (improved buffer overflow checks) fixes
- fix for wrong infolevel struct used in some getattr/setattr paths
- two small documentation fixes"
* tag '5.15-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd:
ksmbd: missing check for NULL in convert_to_nt_pathname()
ksmbd: fix transform header validation
ksmbd: add buffer validation for SMB2_CREATE_CONTEXT
ksmbd: add validation in smb2 negotiate
ksmbd: add request buffer validation in smb2_set_info
ksmbd: use correct basic info level in set_file_basic_info()
ksmbd: remove NTLMv1 authentication
ksmbd: fix documentation for 2 functions
MAINTAINERS: rename cifs_common to smbfs_common in cifs and ksmbd entry
ksmbd: fix invalid request buffer access in compound
ksmbd: remove RFC1002 check in smb2 request
Refactor.
Now that the NFSv2 and NFSv3 XDR decoders have been converted to
use xdr_streams, the WRITE decoder functions can use
xdr_stream_subsegment() to extract the WRITE payload into its own
xdr_buf, just as the NFSv4 WRITE XDR decoder currently does.
That makes it possible to pass the first kvec, pages array + length,
page_base, and total payload length via a single function parameter.
The payload's page_base is not yet assigned or used, but will be in
subsequent patches.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Pointer ni is being initialized with plain integer zero. Fix
this by initializing with NULL.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Most of the fields in 'struct knfsd_fh' are 2 levels deep (a union and a
struct) and are accessed using macros like:
#define fh_FOO fh_base.fh_new.fb_FOO
This patch makes the union and struct anonymous, so that "fh_FOO" can be
a name directly within 'struct knfsd_fh' and the #defines aren't needed.
The file handle as a whole is sometimes accessed as "fh_base" or
"fh_base.fh_pad", neither of which are particularly helpful names.
As the struct holding the filehandle is now anonymous, we
cannot use the name of that, so we union it with 'fh_raw' and use that
where the raw filehandle is needed. fh_raw also ensure the structure is
large enough for the largest possible filehandle.
fh_raw is a 'char' array, removing any need to cast it for memcpy etc.
SVCFH_fmt() is simplified using the "%ph" printk format. This
changes the appearance of filehandles in dprintk() debugging, making
them a little more precise.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Filehandles not in the "new" or "version 1" format have not been handed
out for new mounts since Linux 2.4 which was released 20 years ago.
I think it is safe to say that no such file handles are still in use,
and that we can drop support for them.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
A small part of the declaration concerning filehandle format are
currently in the "uapi" include directory:
include/uapi/linux/nfsd/nfsfh.h
There is a lot more to the filehandle format, including "enum fid_type"
and "enum nfsd_fsid" which are not exported via "uapi".
This small part of the filehandle definition is of minimal use outside
of the kernel, and I can find no evidence that an other code is using
it. Certainly nfs-utils and wireshark (The most likely candidates) do not
use these declarations.
So move it out of "uapi" by copying the content from
include/uapi/linux/nfsd/nfsfh.h
into
fs/nfsd/nfsfh.h
A few unnecessary "#include" directives are not copied, and neither is
the #define of fh_auth, which is annotated as being for userspace only.
The copyright claims in the uapi file are identical to those in the nfsd
file, so there is no need to copy those.
The "__u32" style integer types are only needed in "uapi". In
kernel-only code we can use the more familiar "u32" style.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmFXvEwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjALD/4/SZyJD+VcPM5Zx7kp2JMLm5GM6ZFnRTpI
ZZoRsWnKZ39AAXS4GvGWGb+lEqI7LkLUy0SeUWbK2G3TIdKDU1Sza15knie3uzQ0
Np/No3kv/2vl49gY64A/IVhcqDaHCVXZjRLe2cIK6kULmEH5AwNuTMvYS84g0+38
XCDaWqzGbwTvtJ1RBjxlvXwl9EpmLwY+fr57KOxRgjIvfe68vZcSQBKPVJro3uUS
2toSfiZuRIBk8pIwZ+WCnEv0zbvicjaQVFL6kGpANO4zMsBfkgjeB2Bz+sS8vtfB
pQ1RR24cl3vtoGwYOLOBeVMxyaekkBxruXYmOogRNLq0Uq+EHJn2Wu0ifvk5dw9u
yMXFUEC2TPPV4ah8lFRkVmU6bVeOVmaTn/Yp2nQRR830PkZVNST4vcXeJTNpHa2k
b2uKPCJlkQzpq1ea29FbJ6RN/IrTOcTog+RWKXF4A3PMEkdUPP64a/lFtmMbIu9S
mg00P1d3qA6rmrOX+Igw6zbtxyEn1pVEjexFfi49dQMM1wphpQPOHq3EgcQi/UDM
SbKo6f6RsYDhlte8YVz70KSjW1HwTbepm4z6Zuts7Lbbbu1LUJrO9yf/LMEIJxaH
no2fv82BPTSHUqWh0aqiVW+2DthaX3oZtQe6HpK2+SpV+Rre/RktnEFTN2JuSNYI
DzCNRAJa+A==
=qP3G
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.15-2021-10-01' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Two fixes in here:
- The signal issue that was discussed start of this week (me).
- Kill dead fasync support in io_uring. Looks like it was broken
since io_uring was initially merged, and given that nobody has ever
complained about it, let's just kill it (Pavel)"
* tag 'io_uring-5.15-2021-10-01' of git://git.kernel.dk/linux-block:
io_uring: kill fasync
io-wq: exclusively gate signal based exit on get_signal() return
We have never supported fasync properly, it would only fire when there
is something polling io_uring making it useless. The original support came
in through the initial io_uring merge for 5.1. Since it's broken and
nobody has reported it, get rid of the fasync bits.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2f7ca3d344d406d34fa6713824198915c41cea86.1633080236.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 8e33fadf94 ("ext4: remove an unnecessary if statement in
__ext4_get_inode_loc()") forget to recheck buffer's uptodate bit again
under buffer lock, which may overwrite the buffer if someone else have
already brought it uptodate and changed it.
Fixes: 8e33fadf94 ("ext4: remove an unnecessary if statement in __ext4_get_inode_loc()")
Cc: stable@kernel.org
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210910080316.70421-1-yi.zhang@huawei.com
When ext4_htree_fill_tree() fails, ext4_dx_readdir() can run into an
infinite loop since if info->last_pos != ctx->pos this will reset the
directory scan and reread the failing entry. For example:
1. a dx_dir which has 3 block, block 0 as dx_root block, block 1/2 as
leaf block which own the ext4_dir_entry_2
2. block 1 read ok and call_filldir which will fill the dirent and update
the ctx->pos
3. block 2 read fail, but we has already fill some dirent, so we will
return back to userspace will a positive return val(see ksys_getdents64)
4. the second ext4_dx_readdir will reset the world since info->last_pos
!= ctx->pos, and will also init the curr_hash which pos to block 1
5. So we will read block1 too, and once block2 still read fail, we can
only fill one dirent because the hash of the entry in block1(besides
the last one) won't greater than curr_hash
6. this time, we forget update last_pos too since the read for block2
will fail, and since we has got the one entry, ksys_getdents64 can
return success
7. Latter we will trapped in a loop with step 4~6
Cc: stable@kernel.org
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210914111415.3921954-1-yangerkun@huawei.com
The error path in ext4_fill_super forget to flush s_error_work before
journal destroy, and it may trigger the follow bug since
flush_stashed_error_work can run concurrently with journal destroy
without any protection for sbi->s_journal.
[32031.740193] EXT4-fs (loop66): get root inode failed
[32031.740484] EXT4-fs (loop66): mount failed
[32031.759805] ------------[ cut here ]------------
[32031.759807] kernel BUG at fs/jbd2/transaction.c:373!
[32031.760075] invalid opcode: 0000 [#1] SMP PTI
[32031.760336] CPU: 5 PID: 1029268 Comm: kworker/5:1 Kdump: loaded
4.18.0
[32031.765112] Call Trace:
[32031.765375] ? __switch_to_asm+0x35/0x70
[32031.765635] ? __switch_to_asm+0x41/0x70
[32031.765893] ? __switch_to_asm+0x35/0x70
[32031.766148] ? __switch_to_asm+0x41/0x70
[32031.766405] ? _cond_resched+0x15/0x40
[32031.766665] jbd2__journal_start+0xf1/0x1f0 [jbd2]
[32031.766934] jbd2_journal_start+0x19/0x20 [jbd2]
[32031.767218] flush_stashed_error_work+0x30/0x90 [ext4]
[32031.767487] process_one_work+0x195/0x390
[32031.767747] worker_thread+0x30/0x390
[32031.768007] ? process_one_work+0x390/0x390
[32031.768265] kthread+0x10d/0x130
[32031.768521] ? kthread_flush_work_fn+0x10/0x10
[32031.768778] ret_from_fork+0x35/0x40
static int start_this_handle(...)
BUG_ON(journal->j_flags & JBD2_UNMOUNT); <---- Trigger this
Besides, after we enable fast commit, ext4_fc_replay can add work to
s_error_work but return success, so the latter journal destroy in
ext4_load_journal can trigger this problem too.
Fix this problem with two steps:
1. Call ext4_commit_super directly in ext4_handle_error for the case
that called from ext4_fc_replay
2. Since it's hard to pair the init and flush for s_error_work, we'd
better add a extras flush_work before journal destroy in
ext4_fill_super
Besides, this patch will call ext4_commit_super in ext4_handle_error for
any nojournal case too. But it seems safe since the reason we call
schedule_work was that we should save error info to sb through journal
if available. Conversely, for the nojournal case, it seems useless delay
commit superblock to s_error_work.
Fixes: c92dc85684 ("ext4: defer saving error info from atomic context")
Fixes: 2d01ddc866 ("ext4: save error info to sb through journal if available")
Cc: stable@kernel.org
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210924093917.1953239-1-yangerkun@huawei.com
We should use unsigned long long rather than loff_t to avoid
overflow in ext4_max_bitmap_size() for comparison before returning.
w/o this patch sbi->s_bitmap_maxbytes was becoming a negative
value due to overflow of upper_limit (with has_huge_files as true)
Below is a quick test to trigger it on a 64KB pagesize system.
sudo mkfs.ext4 -b 65536 -O ^has_extents,^64bit /dev/loop2
sudo mount /dev/loop2 /mnt
sudo echo "hello" > /mnt/hello -> This will error out with
"echo: write error: File too large"
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/594f409e2c543e90fd836b78188dfa5c575065ba.1622867594.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When ext4_insert_delayed block receives and recovers from an error from
ext4_es_insert_delayed_block(), e.g., ENOMEM, it does not release the
space it has reserved for that block insertion as it should. One effect
of this bug is that s_dirtyclusters_counter is not decremented and
remains incorrectly elevated until the file system has been unmounted.
This can result in premature ENOSPC returns and apparent loss of free
space.
Another effect of this bug is that
/sys/fs/ext4/<dev>/delayed_allocation_blocks can remain non-zero even
after syncfs has been executed on the filesystem.
Besides, add check for s_dirtyclusters_counter when inode is going to be
evicted and freed. s_dirtyclusters_counter can still keep non-zero until
inode is written back in .evict_inode(), and thus the check is delayed
to .destroy_inode().
Fixes: 51865fda28 ("ext4: let ext4 maintain extent status tree")
Cc: stable@kernel.org
Suggested-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210823061358.84473-1-jefflexu@linux.alibaba.com
Now EXT4_FC_TAG_ADD_RANGE uses ext4_extent to track the
newly-added blocks, but the limit on the max value of
ee_len field is ignored, and it can lead to BUG_ON as
shown below when running command "fallocate -l 128M file"
on a fast_commit-enabled fs:
kernel BUG at fs/ext4/ext4_extents.h:199!
invalid opcode: 0000 [#1] SMP PTI
CPU: 3 PID: 624 Comm: fallocate Not tainted 5.14.0-rc6+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
RIP: 0010:ext4_fc_write_inode_data+0x1f3/0x200
Call Trace:
? ext4_fc_write_inode+0xf2/0x150
ext4_fc_commit+0x93b/0xa00
? ext4_fallocate+0x1ad/0x10d0
ext4_sync_file+0x157/0x340
? ext4_sync_file+0x157/0x340
vfs_fsync_range+0x49/0x80
do_fsync+0x3d/0x70
__x64_sys_fsync+0x14/0x20
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
Simply fixing it by limiting the number of blocks
in one EXT4_FC_TAG_ADD_RANGE TLV.
Fixes: aa75f4d3da ("ext4: main fast-commit commit path")
Cc: stable@kernel.org
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210820044505.474318-1-houtao1@huawei.com
The kmalloc() does not have a NULL check. This code can be re-written
slightly cleaner to just use the kstrdup().
Fixes: 265fd1991c ("ksmbd: use LOOKUP_BENEATH to prevent the out of share access")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Acked-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
RFC3530 notes that the 'dircount' field may be zero, in which case the
recommendation is to ignore it, and only enforce the 'maxcount' field.
In RFC5661, this recommendation to ignore a zero valued field becomes a
requirement.
Fixes: aee3776441 ("nfsd4: fix rd_dircount enforcement")
Cc: <stable@vger.kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Remove ntfs_sb_info members sector_size and sector_bits.
Print details why mount failed.
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
If we continue to work in this case, then we can corrupt fs.
Fixes: 82cae269cf ("fs/ntfs3: Add initialization of super block").
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
init_nfsd() should not unregister pernet subsys if the register fails
but should instead unwind from the last successful operation which is
register_filesystem().
Unregistering a failed register_pernet_subsys() call can result in
a kernel GPF as revealed by programmatically injecting an error in
register_pernet_subsys().
Verified the fix handled failure gracefully with no lingering nfsd
entry in /proc/filesystems. This change was introduced by the commit
bd5ae9288d ("nfsd: register pernet ops last, unregister first"),
the original error handling logic was correct.
Fixes: bd5ae9288d ("nfsd: register pernet ops last, unregister first")
Cc: stable@vger.kernel.org
Signed-off-by: Patrick Ho <Patrick.Ho@netapp.com>
Acked-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Validate that the transform and smb request headers are present
before checking OriginalMessageSize and SessionId fields.
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Böhme <slow@samba.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Tom Talpey <tom@talpey.com>
Acked-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Add buffer validation for SMB2_CREATE_CONTEXT.
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Reviewed-by: Ralph Boehme <slow@samba.org>
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
This patch add validation to check request buffer check in smb2
negotiate and fix null pointer deferencing oops in smb3_preauth_hash_rsp()
that found from manual test.
Cc: Tom Talpey <tom@talpey.com>
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Böhme <slow@samba.org>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Ralph Boehme <slow@samba.org>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Add buffer validation in smb2_set_info, and remove unused variable
in set_file_basic_info. and smb2_set_info infolevel functions take
structure pointer argument.
Cc: Tom Talpey <tom@talpey.com>
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Böhme <slow@samba.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Hyunchul Lee <hyc.lee@gmail.com>
Reviewed-by: Ralph Boehme <slow@samba.org>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Use correct basic info level in set/get_file_basic_info().
Reviewed-by: Ralph Boehme <slow@samba.org>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Remove insecure NTLMv1 authentication.
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Böhme <slow@samba.org>
Reviewed-by: Tom Talpey <tom@talpey.com>
Acked-by: Steve French <smfrench@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
ksmbd_kthread_fn() and create_socket() returns 0 or error code, and not
task_struct/ERR_PTR.
Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
A KMSAN warning is reported by Alexander Potapenko:
BUG: KMSAN: uninit-value in kernfs_dop_revalidate+0x61f/0x840
fs/kernfs/dir.c:1053
kernfs_dop_revalidate+0x61f/0x840 fs/kernfs/dir.c:1053
d_revalidate fs/namei.c:854
lookup_dcache fs/namei.c:1522
__lookup_hash+0x3a6/0x590 fs/namei.c:1543
filename_create+0x312/0x7c0 fs/namei.c:3657
do_mkdirat+0x103/0x930 fs/namei.c:3900
__do_sys_mkdir fs/namei.c:3931
__se_sys_mkdir fs/namei.c:3929
__x64_sys_mkdir+0xda/0x120 fs/namei.c:3929
do_syscall_x64 arch/x86/entry/common.c:51
It seems a positive dentry in kernfs becomes a negative dentry directly
through d_delete() in vfs_rmdir(). dentry->d_time is uninitialized
when accessing it in kernfs_dop_revalidate(), because it is only
initialized when created as negative dentry in kernfs_iop_lookup().
The problem can be reproduced by the following command:
cd /sys/fs/cgroup/pids && mkdir hi && stat hi && rmdir hi && stat hi
A simple fixes seems to be initializing d->d_time for positive dentry
in kernfs_iop_lookup() as well. The downside is the negative dentry
will be revalidated again after it becomes negative in d_delete(),
because the revison of its parent must have been increased due to
its removal.
Alternative solution is implement .d_iput for kernfs, and assign d_time
for the newly-generated negative dentry in it. But we may need to
take kernfs_rwsem to protect again the concurrent kernfs_link_sibling()
on the parent directory, it is a little over-killing. Now the simple
fix is chosen.
Link: https://marc.info/?l=linux-fsdevel&m=163249838610499
Fixes: c7e7c04274 ("kernfs: use VFS negative dentry caching")
Reported-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20210928140750.1274441-1-houtao1@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fix an integer overflow when computing the Merkle tree layout of
extremely large files, exposed by btrfs adding support for fs-verity.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCYVKxQBQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK0q7AQCRVYl9e6gOPduntU6zNfYxYiJAmGRQ
9jekhtPwFnuhLgEAnFxW3B51bG5c+Yv3xBBbDRpflk+39gd39eUOqRtlPQ4=
=tPA9
-----END PGP SIGNATURE-----
Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt
Pull fsverity fix from Eric Biggers:
"Fix an integer overflow when computing the Merkle tree layout of
extremely large files, exposed by btrfs adding support for fs-verity"
* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
fs-verity: fix signed integer overflow with i_size near S64_MAX
Normally the check at open time suffices, but e.g loop device does set
IOCB_DIRECT after doing its own checks (which are not sufficent for
overlayfs).
Make sure we don't call the underlying filesystem read/write method with
the IOCB_DIRECT if it's not supported.
Reported-by: Huang Jianan <huangjianan@oppo.com>
Fixes: 16914e6fc7 ("ovl: add ovl_read_iter()")
Cc: <stable@vger.kernel.org> # v4.19
Tested-by: Huang Jianan <huangjianan@oppo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Commit 9d682ea6bc ("vboxsf: Fix the check for the old binary
mount-arguments struct") was meant to fix a build error due to sign
mismatch in 'char' and the use of character constants, but it just moved
the error elsewhere, in that on some architectures characters and signed
and on others they are unsigned, and that's just how the C standard
works.
The proper fix is a simple "don't do that then". The code was just
being silly and odd, and it should never have cared about signed vs
unsigned characters in the first place, since what it is testing is not
four "characters", but four bytes.
And the way to compare four bytes is by using "memcmp()".
Which compilers will know to just turn into a single 32-bit compare with
a constant, as long as you don't have crazy debug options enabled.
Link: https://lore.kernel.org/lkml/20210927094123.576521-1-arnd@kernel.org/
Cc: Arnd Bergmann <arnd@kernel.org>
Cc: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
io-wq threads block all signals, except SIGKILL and SIGSTOP. We should not
need any extra checking of signal_pending or fatal_signal_pending, rely
exclusively on whether or not get_signal() tells us to exit.
The original debugging of this issue led to the false positive that we
were exiting on non-fatal signals, but that is not the case. The issue
was around races with nr_workers accounting.
Fixes: 87c1696655 ("io-wq: ensure we exit if thread group is exiting")
Fixes: 15e20db2e0 ("io-wq: only exit on fatal signals")
Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reinforce that page flags are actually in the head page by changing the
type from page to folio. Increases the size of cachefiles by two bytes,
but the kernel core is unchanged in size.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: David Howells <dhowells@redhat.com>
wait_on_page_writeback_killable() only has one caller, so convert it to
call folio_wait_writeback_killable(). For the wait_on_page_writeback()
callers, add a compatibility wrapper around folio_wait_writeback().
Turning PageWriteback() into folio_test_writeback() eliminates a call
to compound_head() which saves 8 bytes and 15 bytes in the two
functions. Unfortunately, that is more than offset by adding the
wait_on_page_writeback compatibility wrapper for a net increase in text
of 7 bytes.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Howells <dhowells@redhat.com>
There aren't any actual callers of lock_page_async(), so remove it.
Convert filemap_update_page() to call __folio_lock_async().
__folio_lock_async() is 21 bytes smaller than __lock_page_async(),
but the real savings come from using a folio in filemap_update_page(),
shrinking it from 515 bytes to 404 bytes, saving 110 bytes. The text
shrinks by 132 bytes in total.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Ronnie reported invalid request buffer access in chained command when
inserting garbage value to NextCommand of compound request.
This patch add validation check to avoid this issue.
Cc: Tom Talpey <tom@talpey.com>
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Böhme <slow@samba.org>
Tested-by: Steve French <smfrench@gmail.com>
Reviewed-by: Steve French <smfrench@gmail.com>
Acked-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
In smb_common.c you have this function : ksmbd_smb_request() which
is called from connection.c once you have read the initial 4 bytes for
the next length+smb2 blob.
It checks the first byte of this 4 byte preamble for valid values,
i.e. a NETBIOSoverTCP SESSION_MESSAGE or a SESSION_KEEP_ALIVE.
We don't need to check this for ksmbd since it only implements SMB2
over TCP port 445.
The netbios stuff was only used in very old servers when SMB ran over
TCP port 139.
Now that we run over TCP port 445, this is actually not a NB header anymore
and you can just treat it as a 4 byte length field that must be less
than 16Mbyte. and remove the references to the RFC1002 constants that no
longer applies.
Cc: Tom Talpey <tom@talpey.com>
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Böhme <slow@samba.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmFPfPIACgkQiiy9cAdy
T1FXSAv/Y/iJPj48F/rO2FX5aoUgK2All0dmQ2VOOJ96HuKOVH+9oTv60fbVoAjI
16zOXRhMGQHjLNHWuJp+n4OOM3pgOyg3MaLReIaIVnHe7z8G5WpmMID+QfIJhFWM
PfS+nr9S6sHwfspOIX4AwVzsYozj8vNWsUJKGLcG/d1u67ipemAOmiXMrD+4P+1Q
sO0v0rVcuOy3tYMXYpDhOQBWxK2R8DeWCs+NTradntnWWsfHUUfuhj7G66AphjDh
HzH9TuR+M5gOrLKpHDCWlE1u3MoTGURBXZC6xB4zHMZy+W5hM6PVNrO/89GdLD89
nXa1ZxvNSwrJsc6JArExwKcz+sKM172ui/BY5iR7YEuchVUd8u3L0yeThx/2H7Ri
9C/k9oNnql3hB8vX9yeqrt8Rr6G0IGJ/nlS+aBdbTKqB/t3cNGFImL3X0BeBsxjD
umkeY92wVP0ve7hLSOPSQnNoWo4p+eN215tRRSCu+1DjAqJA1HyD+PqibzeWTAdK
FT+CF/Ko
=wDpw
-----END PGP SIGNATURE-----
Merge tag '5.15-rc2-ksmbd-fixes' of git://git.samba.org/ksmbd
Pull ksmbd fixes from Steve French:
"Five fixes for the ksmbd kernel server, including three security
fixes:
- remove follow symlinks support
- use LOOKUP_BENEATH to prevent out of share access
- SMB3 compounding security fix
- fix for returning the default streams correctly, fixing a bug when
writing ppt or doc files from some clients
- logging more clearly that ksmbd is experimental (at module load
time)"
* tag '5.15-rc2-ksmbd-fixes' of git://git.samba.org/ksmbd:
ksmbd: use LOOKUP_BENEATH to prevent the out of share access
ksmbd: remove follow symlinks support
ksmbd: check protocol id in ksmbd_verify_smb_message()
ksmbd: add default data stream name in FILE_STREAM_INFORMATION
ksmbd: log that server is experimental at module load
Merge misc fixes from Andrew Morton:
"16 patches.
Subsystems affected by this patch series: xtensa, sh, ocfs2, scripts,
lib, and mm (memory-failure, kasan, damon, shmem, tools, pagecache,
debug, and pagemap)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm: fix uninitialized use in overcommit_policy_handler
mm/memory_failure: fix the missing pte_unmap() call
kasan: always respect CONFIG_KASAN_STACK
sh: pgtable-3level: fix cast to pointer from integer of different size
mm/debug: sync up latest migrate_reason to migrate_reason_names
mm/debug: sync up MR_CONTIG_RANGE and MR_LONGTERM_PIN
mm: fs: invalidate bh_lrus for only cold path
lib/zlib_inflate/inffast: check config in C to avoid unused function warning
tools/vm/page-types: remove dependency on opt_file for idle page tracking
scripts/sorttable: riscv: fix undeclared identifier 'EM_RISCV' error
ocfs2: drop acl cache for directories too
mm/shmem.c: fix judgment error in shmem_is_huge()
xtensa: increase size of gcc stack frame check
mm/damon: don't use strnlen() with known-bogus source length
kasan: fix Kconfig check of CC_HAS_WORKING_NOSANITIZE_ADDRESS
mm, hwpoison: add is_free_buddy_page() in HWPoisonHandlable()
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmFPKzYQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgplzrD/9eAeKdjpehV+o8HuLC9EvaBmp8jjaSEwJq
6S1CrCuIeCY9C0BfyZTaRjdS9Oo6VwCfuRhzEOq99QqTx1kPS/2fGqhymKAzMCGy
zPBLMkcUsKtRVAJEA7qGHJl3CaCn7NrHOb9cAM150/vTS96GYjuaFWGbBKsGbbXu
ORSsaeVLcwFyGwWSf5+v+tkORoSiKEfwgUy+y5xwkiFxStcVHsLNxQ/1HtfEIjdl
j3KBsW9eqM59lAOmv0YPGZ5gCN7VXelqaI/SmZxH4Y5XDDk86vtl0A7l3B5HSURX
6njZakejtvclqv7p8eb+oB2EDqgsjgPZE9uJduLqtRcLqdE7koy096drzfMsACfu
NKCnltY56iDEKWCOg1DL2O4D1/qpcZE9rj14WJtwxeauHQd7uVwTOHdOHw3OLz3X
xpEasZ2IG3h6qzXrAuAeBIbpp0KIwc03p/vBQTOFqXTxqV+x2c8J3oM4QHeU6nU/
+5clqOuCdKMxWMmBti7cTA9HHupWZnB47P7BHRRL4z/19ysH29Fc7xVLnaB2WPeG
1WmvD0DpeDgLnUfzohSh+vTU4XtR4UEjHp8S4HfrzQM7xxmweQNX3Ph/KCPZOe8F
T1vSFFBiae0IHx3aX+vGvoTmznrEJ279jNArcA693VB9utANjVzvwMtlmnzN8DtQ
wxhF4oXqXw==
=jnhx
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.15-2021-09-25' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"This one looks a bit bigger than it is, but that's mainly because 2/3
of it is enabling IORING_OP_CLOSE to close direct file descriptors.
We've had a few folks using them and finding it confusing that the way
to close them is through using -1 for file update, this just brings
API symmetry for direct descriptors. Hence I think we should just do
this now and have a better API for 5.15 release. There's some room for
de-duplicating the close code, but we're leaving that for the next
merge window.
Outside of that, just small fixes:
- Poll race fixes (Hao)
- io-wq core dump exit fix (me)
- Reschedule around potentially intensive tctx and buffer iterators
on teardown (me)
- Fix for always ending up punting files update to io-wq (me)
- Put the provided buffer meta data under memcg accounting (me)
- Tweak for io_write(), removing dead code that was added with the
iterator changes in this release (Pavel)"
* tag 'io_uring-5.15-2021-09-25' of git://git.kernel.dk/linux-block:
io_uring: make OP_CLOSE consistent with direct open
io_uring: kill extra checks in io_write()
io_uring: don't punt files update to io-wq unconditionally
io_uring: put provided buffer meta data under memcg accounting
io_uring: allow conditional reschedule for intensive iterators
io_uring: fix potential req refcount underflow
io_uring: fix missing set of EPOLLONESHOT for CQ ring overflow
io_uring: fix race between poll completion and cancel_hash insertion
io-wq: ensure we exit if thread group is exiting
- fix the dangling pointer use in erofs_lookup tracepoint;
- fix unsupported chunk format check;
- zero out compacted_2b if compacted_4b_initial > totalidx.
-----BEGIN PGP SIGNATURE-----
iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCYU9CBxEceGlhbmdAa2Vy
bmVsLm9yZwAKCRA5NzHcH7XmBDgBAQDaj1NWjIleK4Q7hoerl++6MMhzEJrmpxSE
EENs9NPuiQEAp7dN0T05a2J+Szp5xJeLYg67LoYbAnDmbmzGH/jQQg0=
=6e/B
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-5.15-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
"Two bugfixes to fix the 4KiB blockmap chunk format availability and a
dangling pointer usage. There is also a trivial cleanup to clarify
compacted_2b if compacted_4b_initial > totalidx.
Summary:
- fix the dangling pointer use in erofs_lookup tracepoint
- fix unsupported chunk format check
- zero out compacted_2b if compacted_4b_initial > totalidx"
* tag 'erofs-for-5.15-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: clear compacted_2b if compacted_4b_initial > totalidx
erofs: fix misbehavior of unsupported chunk format check
erofs: fix up erofs_lookup tracepoint
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmFOOyoACgkQiiy9cAdy
T1FntAv/YGILEUdiKs364//L15K9rInUKo/FPLAg9THBvG+vGJd6HXD30RZNQBRl
030BPcA3AfchrPPTW2Lf3LyEHYHarRI3RE6E5o0++9XodxmjoukhG0ogUnevLZYl
IyQ0VwDBb0BhzY9CQHPc1cqbCqeabGHjF8+aPs2bN1G4wspW4EdUoc6xxZM1xKzi
XCbCSgRQKMA4qOsmExX5Jfl3nwjIuiBzcV5HAvn9F6WCdILH9/ltjwXunZBWhI5Z
aLRrEpMwFyCrNr3B59XPGmGXe5mw0guQzCZDcgMDHGXIVbY/IRZPTIw0pDJqkwvo
XfHljjba9FlYDNxc4fgY1J+ez7zRmS0eJzP2wZw9UkoF0Sb0nYmlZmGPnGADqIJ6
YBH7rKz+L6cVVa+zFRMEWcNguOADzZNMo3xuHriXy5lzKQpgT6dwxN6JUNfhvnfq
vHzm8tjaFFlQQ47cmTiW6QDiMcyQtJGvP36v8PvxEilH0KUGozZGrsG69aI+sFoB
a9XjbLhT
=SK4S
-----END PGP SIGNATURE-----
Merge tag '5.15-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Six small cifs/smb3 fixes, two for stable:
- important fix for deferred close (found by a git functional test)
related to attribute caching on close.
- four (two cosmetic, two more serious) small fixes for problems
pointed out by smatch via Dan Carpenter
- fix for comment formatting problems pointed out by W=1"
* tag '5.15-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix incorrect check for null pointer in header_assemble
smb3: correct server pointer dereferencing check to be more consistent
smb3: correct smb3 ACL security descriptor
cifs: Clear modified attribute bit from inode flags
cifs: Deal with some warnings from W=1
cifs: fix a sign extension bug
instead of removing '..' in a given path, call
kern_path with LOOKUP_BENEATH flag to prevent
the out of share access.
ran various test on this:
smb2-cat-async smb://127.0.0.1/homes/../out_of_share
smb2-cat-async smb://127.0.0.1/homes/foo/../../out_of_share
smbclient //127.0.0.1/homes -c "mkdir ../foo2"
smbclient //127.0.0.1/homes -c "rename bar ../bar"
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Boehme <slow@samba.org>
Tested-by: Steve French <smfrench@gmail.com>
Tested-by: Namjae Jeon <linkinjeon@kernel.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The kernel test robot reported the regression of fio.write_iops[1] with
commit 8cc621d2f4 ("mm: fs: invalidate BH LRU during page migration").
Since lru_add_drain is called frequently, invalidate bh_lrus there could
increase bh_lrus cache miss ratio, which needs more IO in the end.
This patch moves the bh_lrus invalidation from the hot path( e.g.,
zap_page_range, pagevec_release) to cold path(i.e., lru_add_drain_all,
lru_cache_disable).
Zhengjun Xing confirmed
"I test the patch, the regression reduced to -2.9%"
[1] https://lore.kernel.org/lkml/20210520083144.GD14190@xsang-OptiPlex-9020/
[2] 8cc621d2f4, mm: fs: invalidate BH LRU during page migration
Link: https://lkml.kernel.org/r/20210907212347.1977686-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: kernel test robot <oliver.sang@intel.com>
Reviewed-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
Tested-by: "Xing, Zhengjun" <zhengjun.xing@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ocfs2_data_convert_worker() is currently dropping any cached acl info
for FILE before down-converting meta lock. It should also drop for
DIRECTORY. Otherwise the second acl lookup returns the cached one (from
VFS layer) which could be already stale.
The problem we are seeing is that the acl changes on one node doesn't
get refreshed on other nodes in the following case:
Node 1 Node 2
-------------- ----------------
getfacl dir1
getfacl dir1 <-- this is OK
setfacl -m u:user1:rwX dir1
getfacl dir1 <-- see the change for user1
getfacl dir1 <-- can't see change for user1
Link: https://lkml.kernel.org/r/20210903012631.6099-1-wen.gang.wang@oracle.com
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
From recently open/accept are now able to manipulate fixed file table,
but it's inconsistent that close can't. Close the gap, keep API same as
with open/accept, i.e. via sqe->file_slot.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The following reproducer
mkdir lower upper work merge
touch lower/old
touch lower/new
mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merge
rm merge/new
mv merge/old merge/new & unlink upper/new
may result in this race:
PROCESS A:
rename("merge/old", "merge/new");
overwrite=true,ovl_lower_positive(old)=true,
ovl_dentry_is_whiteout(new)=true -> flags |= RENAME_EXCHANGE
PROCESS B:
unlink("upper/new");
PROCESS A:
lookup newdentry in new_upperdir
call vfs_rename() with negative newdentry and RENAME_EXCHANGE
Fix by adding the missing check for negative newdentry.
Signed-off-by: Zheng Liang <zhengliang6@huawei.com>
Fixes: e9be9d5e76 ("overlay filesystem")
Cc: <stable@vger.kernel.org> # v3.18
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmFNudoTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi93DCACjNadeFDipw3oAkzdAo+wo0MSYgByU
4eu3XN77yTHphc+xoCU/9cCeSkthfNc2XZDktb22lbAX2QxCgvQsWFgck+i0d245
9uQ68IycrUza9PGjLL3okZiLGzqsk97ZDt7vqXT51zN6dgEATVJ5YaXIhVIwAIM0
F6VVcoHruoOPLhPXctQUZusWS+XPzPvU34n2sNodREv9mYeoFmaZJ15wB6UZn2Ps
qL6Usoq15EH5cbW60XbqVWFDAhWmCN/6HG3b6w3WtD+lVb2NWTFJOAZGxMVvNo46
1TacgQUVs7iIh+bol3modOT94yVZ4Cvrb0uUMY/oE4AgjVPGxSjHSeMB
=7D7F
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.15-rc3' of git://github.com/ceph/ceph-client
Pull ceph fix from Ilya Dryomov:
"A fix for a potential array out of bounds access from Dan"
* tag 'ceph-for-5.15-rc3' of git://github.com/ceph/ceph-client:
ceph: fix off by one bugs in unsafe_request_wait()
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmFNp2UACgkQnJ2qBz9k
QNlubwf/Zv5XJccDBGxn0pB7ew1fN4HowTbWdaS0ELDuLZ2KHhZgEbtUu0V2oZ7I
pkUMO97llPk0KHWWjcooIaBMGbBQ78Hqq3xFWWboxEu5hMhJyN1cR2uJlrELvxp1
HsKKaREaUl8jHNQyIuREl/SqLaHmW4LWgrVCZKUTBEc4BeRz2E0C4LTymAZEpTXt
UCRwAU8itK8Z9/Da1xJ6b04/ZMamgoc0a8Rzq3YwCIcUyFLJATB1/XHR6j0c6B6x
gH/y7m/y1m3BFsMbMwGNMwiyJWcBXy2RB4I3B4HB3U6ifBL3ebFQlPQIu5PzRbiG
bGaWRx5o8c5mY/eHpFV4kZS30HM25A==
=Wep3
-----END PGP SIGNATURE-----
Merge tag 'fixes_for_v5.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull misc filesystem fixes from Jan Kara:
"A for ext2 sleep in atomic context in case of some fs problems and a
cleanup of an invalidate_lock initialization"
* tag 'fixes_for_v5.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
ext2: fix sleeping in atomic bugs on error
mm: Fully initialize invalidate_lock, amend lock class later
We don't retry short writes and so we would never get to async setup in
io_write() in that case. Thus ret2 > 0 is always false and
iov_iter_advance() is never used. Apparently, the same is found by
Coverity, which complains on the code.
Fixes: cd65869512 ("io_uring: use iov_iter state save/restore helpers")
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5b33e61034748ef1022766efc0fb8854cfcf749c.1632500058.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There's no reason to punt it unconditionally, we just need to ensure that
the submit lock grabbing is conditional.
Fixes: 05f3fb3c53 ("io_uring: avoid ring quiesce for fixed file set unregister and update")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For each provided buffer, we allocate a struct io_buffer to hold the
data associated with it. As a large number of buffers can be provided,
account that data with memcg.
Fixes: ddf0322db7 ("io_uring: add IORING_OP_PROVIDE_BUFFERS")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For multishot mode, there may be cases like:
iowq original context
io_poll_add
_arm_poll()
mask = vfs_poll() is not 0
if mask
(2) io_poll_complete()
compl_unlock
(interruption happens
tw queued to original
context)
io_poll_task_func()
compl_lock
(3) done = io_poll_complete() is true
compl_unlock
put req ref
(1) if (poll->flags & EPOLLONESHOT)
put req ref
EPOLLONESHOT flag in (1) may be from (2) or (3), so there are multiple
combinations that can cause ref underfow.
Let's address it by:
- check the return value in (2) as done
- change (1) to if (done)
in this way, we only do ref put in (1) if 'oneshot flag' is from
(2)
- do poll.done check in io_poll_task_func(), so that we won't put ref
for the second time.
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210922101238.7177-4-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We should set EPOLLONESHOT if cqring_fill_event() returns false since
io_poll_add() decides to put req or not by it.
Fixes: 5082620fb2 ("io_uring: terminate multishot poll for CQ ring overflow")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210922101238.7177-3-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If poll arming and poll completion runs in parallel, there maybe races.
For instance, run io_poll_add in iowq and io_poll_task_func in original
context, then:
iowq original context
io_poll_add
vfs_poll
(interruption happens
tw queued to original
context) io_poll_task_func
generate cqe
del from cancel_hash[]
if !poll.done
insert to cancel_hash[]
The entry left in cancel_hash[], similar case for fast poll.
Fix it by set poll.done = true when del from cancel_hash[].
Fixes: 5082620fb2 ("io_uring: terminate multishot poll for CQ ring overflow")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210922101238.7177-2-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dave reports that a coredumping workload gets stuck in 5.15-rc2, and
identified the culprit in the Fixes line below. The problem is that
relying solely on fatal_signal_pending() to gate whether to exit or not
fails miserably if a process gets eg SIGILL sent. Don't exclusively
rely on fatal signals, also check if the thread group is exiting.
Fixes: 15e20db2e0 ("io-wq: only exit on fatal signals")
Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is possible because of moving lock into ntfs_create_inode.
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Right now ntfs3 uses posix_acl_equiv_mode instead of
posix_acl_update_mode like all other fs.
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
In case of removing of xattr there must be XATTR_REPLACE flag and
zero length. We already check XATTR_REPLACE in ntfs_set_ea, so
now we pass XATTR_REPLACE to ntfs_set_ea.
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
We can safely move set_cached_acl because it works with NULL acl too.
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Now ntfs3 locks mutex for smaller time.
Theoretically in successful cases those locks aren't needed at all.
But proving the same for error cases is difficult.
So instead of removing them we just move them.
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
We need to always call indx_delete_entry after indx_insert_entry
if error occurred.
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Although very unlikely that the tlink pointer would be null in this case,
get_next_mid function can in theory return null (but not an error)
so need to check for null (not for IS_ERR, which can not be returned
here).
Address warning:
fs/smbfs_client/connect.c:2392 cifs_match_super()
warn: 'tlink' isn't an ERR_PTR
Pointed out by Dan Carpenter via smatch code analysis tool
CC: stable@vger.kernel.org
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Address warning:
fs/smbfs_client/misc.c:273 header_assemble()
warn: variable dereferenced before check 'treeCon->ses->server'
Pointed out by Dan Carpenter via smatch code analysis tool
Although the check is likely unneeded, adding it makes the code
more consistent and easier to read, as the same check is
done elsewhere in the function.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmFM79wACgkQxWXV+ddt
WDtdZQ/+K7NNEutg4JEH7n2KiXxwj8P23NwVK66a+XwH6/ZBe9xz5TQpnTJQ+D13
+3mhthTJG7Wbcv+FlUVbfTSp5q8m2IH7CKox43o4JCZEEGtFfRPHrBGHLlGKMk3P
ap2TZ3rvo0Sb21rx978HCQY824wdJvhv0SmWSScmvWzlTQEKaJHz1OFJgFxhUsMp
Cy9y7mtIy+Ei4qJglU88iFNXhNL6YvwqXxDFY5LwN9rlCaV+rLk476aPfIBvvyf8
4f34FHJOe1w9Jlk3KfydIwWefRBbq2dm0zNqNrMHNjl8zXbvfn8+ETOvf54HbjIz
GGgKiZlBgNh2Na+p0SLoloEvBUdD5lSUCXis8099oUWZ+MporIwsyy4jAvtAeWR/
QxBkZyxvTNFlXLamSo6oS58K9BNuxFYO7nLGSXQFEoYvb8/fu18rRt/A/rmNS8TU
2vxpYacNKbggoULiGDzB74JY7MHdHRcMcAhmfDeG1bvNESPHfyLnpHfWamBVoUO6
0eQOr78f1UpBlqJAGAGtfBefN1kMDnORyX0npGkGLFrKYiZbMgsxdjkNhiHnsufl
9gsNVJ6baCeB1d5qS2vpZXeOLw0ln5iYZa5Yqz0eh/yc/9Wlj/YCsKRuAbaPMR1i
i2ppHo3/na4K6L0EgSi6SU3xaUT+4LLzEEcBlJJuWZEwUTeYwiM=
=VJFC
-----END PGP SIGNATURE-----
Merge tag 'for-5.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- regression fix for leak of transaction handle after verity rollback
failure
- properly reset device last error between mounts
- improve one error handling case when checksumming bios
- fixup confusing displayed size of space info free space
* tag 'for-5.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: prevent __btrfs_dump_space_info() to underflow its free space
btrfs: fix mount failure due to past and transient device flush error
btrfs: fix transaction handle leak after verity rollback failure
btrfs: replace BUG_ON() in btrfs_csum_one_bio() with proper error handling
Address warning:
fs/smbfs_client/smb2pdu.c:2425 create_sd_buf()
warn: struct type mismatch 'smb3_acl vs cifs_acl'
Pointed out by Dan Carpenter via smatch code analysis tool
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Clear CIFS_INO_MODIFIED_ATTR bit from inode flags after
updating mtime and ctime
Signed-off-by: Rohith Surabattula <rohiths@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Cc: stable@vger.kernel.org # 5.13+
Signed-off-by: Steve French <stfrench@microsoft.com>
Deal with some warnings generated from make W=1:
(1) Add/remove/fix kerneldoc parameters descriptions.
(2) Turn cifs' rqst_page_get_length()'s banner comment into a kerneldoc
comment. It should probably be prefixed with "cifs_" though.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Some discussion has been spoken that this deprecated mount options
should be removed before 5.15 lands. This driver is not never seen day
light so it was decided that nls mount option has to be removed. We have
always possibility to add this if needed.
One possible need is example if current ntfs driver will be taken out of
kernel and ntfs3 needs to support mount options what it has.
Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
There is already a 'u8 mask' defined at the top of the function.
There is no need to define a new one here.
Remove the useless and shadowing new 'mask' variable.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
'fnd' has been dereferenced several time before, so testing it here is
pointless.
Moreover, all callers of 'indx_find()' already have some error handling
code that makes sure that no NULL 'fnd' is passed.
So, remove the useless test.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Currently, the whole indexes will only be compacted 4B if
compacted_4b_initial > totalidx. So, the calculated compacted_2b
is worthless for that case. It may waste CPU resources.
No need to update compacted_4b_initial as mkfs since it's used to
fulfill the alignment of the 1st compacted_2b pack and would handle
the case above.
We also need to clarify compacted_4b_end here. It's used for the
last lclusters which aren't fitted in the previous compacted_2b
packs.
Some messages are from Xiang.
Link: https://lore.kernel.org/r/20210914035915.1190-1-zbestahu@gmail.com
Signed-off-by: Yue Hu <huyue2@yulong.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
[ Gao Xiang: it's enough to use "compacted_4b_initial < totalidx". ]
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Unsupported chunk format should be checked with
"if (vi->chunkformat & ~EROFS_CHUNK_FORMAT_ALL)"
Found when checking with 4k-byte blockmap (although currently mkfs
uses inode chunk indexes format by default.)
Link: https://lore.kernel.org/r/20210922095141.233938-1-hsiangkao@linux.alibaba.com
Fixes: c5aa903a59 ("erofs: support reading chunk-based uncompressed files")
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
In jfs_mount, when diMount(ipaimap2) fails, it goes to errout35. However,
the following code does not free ipaimap2 allocated by diReadSpecial.
Fix this by refactoring the error handling code of jfs_mount. To be
specific, modify the lable name and free ipaimap2 when the above error
ocurrs.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Use LOOKUP_NO_SYMLINKS flags for default lookup to prohibit the middle of
symlink component lookup and remove follow symlinks parameter support.
We re-implement it as reparse point later.
Test result:
smbclient -Ulinkinjeon%1234 //172.30.1.42/share -c
"get hacked/passwd passwd"
NT_STATUS_OBJECT_NAME_NOT_FOUND opening remote file \hacked\passwd
Cc: Ralph Böhme <slow@samba.org>
Cc: Steve French <smfrench@gmail.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
When second smb2 pdu has invalid protocol id, ksmbd doesn't detect it
and allow to process smb2 request. This patch add the check it in
ksmbd_verify_smb_message() and don't use protocol id of smb2 request as
protocol id of response.
Reviewed-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Reviewed-by: Ralph Böhme <slow@samba.org>
Reported-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
fscrypt currently requires a 512-bit master key when AES-256-XTS is
used, since AES-256-XTS keys are 512-bit and fscrypt requires that the
master key be at least as long any key that will be derived from it.
However, this is overly strict because AES-256-XTS doesn't actually have
a 512-bit security strength, but rather 256-bit. The fact that XTS
takes twice the expected key size is a quirk of the XTS mode. It is
sufficient to use 256 bits of entropy for AES-256-XTS, provided that it
is first properly expanded into a 512-bit key, which HKDF-SHA512 does.
Therefore, relax the check of the master key size to use the security
strength of the derived key rather than the size of the derived key
(except for v1 encryption policies, which don't use HKDF).
Besides making things more flexible for userspace, this is needed in
order for the use of a KDF which only takes a 256-bit key to be
introduced into the fscrypt key hierarchy. This will happen with
hardware-wrapped keys support, as all known hardware which supports that
feature uses an SP800-108 KDF using AES-256-CMAC, so the wrapped keys
are wrapped 256-bit AES keys. Moreover, there is interest in fscrypt
supporting the same type of AES-256-CMAC based KDF in software as an
alternative to HKDF-SHA512. There is no security problem with such
features, so fix the key length check to work properly with them.
Reviewed-by: Paul Crowley <paulcrowley@google.com>
Link: https://lore.kernel.org/r/20210921030303.5598-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
If the file size is almost S64_MAX, the calculated number of Merkle tree
levels exceeds FS_VERITY_MAX_LEVELS, causing FS_IOC_ENABLE_VERITY to
fail. This is unintentional, since as the comment above the definition
of FS_VERITY_MAX_LEVELS states, it is enough for over U64_MAX bytes of
data using SHA-256 and 4K blocks. (Specifically, 4096*128**8 >= 2**64.)
The bug is actually that when the number of blocks in the first level is
calculated from i_size, there is a signed integer overflow due to i_size
being signed. Fix this by treating i_size as unsigned.
This was found by the new test "generic: test fs-verity EFBIG scenarios"
(https://lkml.kernel.org/r/b1d116cd4d0ea74b9cd86f349c672021e005a75c.1631558495.git.boris@bur.io).
This didn't affect ext4 or f2fs since those have a smaller maximum file
size, but it did affect btrfs which allows files up to S64_MAX bytes.
Reported-by: Boris Burkov <boris@bur.io>
Fixes: 3fda4c617e ("fs-verity: implement FS_IOC_ENABLE_VERITY ioctl")
Fixes: fd2d1acfca ("fs-verity: add the hook for file ->open()")
Cc: <stable@vger.kernel.org> # v5.4+
Reviewed-by: Boris Burkov <boris@bur.io>
Link: https://lore.kernel.org/r/20210916203424.113376-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
- Fix crash in NLM TEST procedure
- NFSv4.1+ backchannel not restored after PATH_DOWN
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmFLUKcACgkQM2qzM29m
f5fCAhAAp66o6n49/fxOLWo+MftFlT1EY8NtFjyTh1x/o4R9S74qxTy3RC3GzRvk
oGOnkFvuiToyjcoeyb9yumYxO00Qf75hrTJvsXqnsbrLZOAKVuITn9MkQXBOXjCi
GDxQSRFg8ihz0vG4YbE/brnZR1fIMr7KSzXLwdXOs8mKvro7JmiiB87JOGhw9yon
W9+bFcnN2TynYsqmtHu987LvaIUE79dFfhrfj6bIobNQ25oqJoG5e1/M48/1MJol
DFPiWoErJ/S1c0lA8rbjIvtzgbXs84U88EXmFUVsxSXhepGui3Uh/cA49vu46icH
vze8fwHs6q3qzF7gE6jbslrrdQ/H6AZ6arhe27h4cVxdh0AouDuBat2xLY2I4TP3
DckfLbEsOqTJhfzqYnk+8ckOaBMpkfyDqG6SodIKglPoknNCtCp0/7NuYF0yMLe5
I6pO7JDgz7ySrbpm27ZMOpdwkLqqA1i8V9MPvimUsKTYJqlVBsc2RsdldQhunNbd
50InJarWQ+japkEl3WK3aJ5rTluiIWjcePT7wA76wP3PnZmcjweOiQMc8uuLlzPw
tOLRlHdpdZzeM3hGuI6KKsg8ZRbDB7L8YiaLkSwxl2qwJwDSB0xo7/WWwVzkyfdf
zdQ2cR9z70I2Bgxq/1lAPB8tXq+SEvu1qCYDFSTo3I9c0Y2frs4=
=L0c1
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
"Critical bug fixes:
- Fix crash in NLM TEST procedure
- NFSv4.1+ backchannel not restored after PATH_DOWN"
* tag 'nfsd-5.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
nfsd: back channel stuck in SEQ4_STATUS_CB_PATH_DOWN
NLM: Fix svcxdr_encode_owner()
The ext2_error() function syncs the filesystem so it sleeps. The caller
is holding a spinlock so it's not allowed to sleep.
ext2_statfs() <- disables preempt
-> ext2_count_free_blocks()
-> ext2_get_group_desc()
Fix this by using WARN() to print an error message and a stack trace
instead of using ext2_error().
Link: https://lore.kernel.org/r/20210921203233.GA16529@kili
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The problem is the mismatched types between "ctx->total_len" which is
an unsigned int, "rc" which is an int, and "ctx->rc" which is a
ssize_t. The code does:
ctx->rc = (rc == 0) ? ctx->total_len : rc;
We want "ctx->rc" to store the negative "rc" error code. But what
happens is that "rc" is type promoted to a high unsigned int and
'ctx->rc" will store the high positive value instead of a negative
value.
The fix is to change "rc" from an int to a ssize_t.
Fixes: c610c4b619 ("CIFS: Add asynchronous write support through kernel AIO")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
DRC bucket pruning is done by nfsd_cache_lookup(), which is part of
every NFSv2 and NFSv3 dispatch (ie, it's done while the client is
waiting).
I added a trace_printk() in prune_bucket() to see just how long
it takes to prune. Here are two ends of the spectrum:
prune_bucket: Scanned 1 and freed 0 in 90 ns, 62 entries remaining
prune_bucket: Scanned 2 and freed 1 in 716 ns, 63 entries remaining
...
prune_bucket: Scanned 75 and freed 74 in 34149 ns, 1 entries remaining
Pruning latency is noticeable on fast transports with fast storage.
By noticeable, I mean that the latency measured here in the worst
case is the same order of magnitude as the round trip time for
cached server operations.
We could do something like moving expired entries to an expired list
and then free them later instead of freeing them right in
prune_bucket(). But simply limiting the number of entries that can
be pruned by a lookup is simple and retains more entries in the
cache, making the DRC somewhat more effective.
Comparison with a 70/30 fio 8KB 12 thread direct I/O test:
Before:
write: IOPS=61.6k, BW=481MiB/s (505MB/s)(14.1GiB/30001msec); 0 zone resets
WRITE:
1848726 ops (30%)
avg bytes sent per op: 8340 avg bytes received per op: 136
backlog wait: 0.635158 RTT: 0.128525 total execute time: 0.827242 (milliseconds)
After:
write: IOPS=63.0k, BW=492MiB/s (516MB/s)(14.4GiB/30001msec); 0 zone resets
WRITE:
1891144 ops (30%)
avg bytes sent per op: 8340 avg bytes received per op: 136
backlog wait: 0.616114 RTT: 0.126842 total execute time: 0.805348 (milliseconds)
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Windows client expect to get default stream name(::DATA) in
FILE_STREAM_INFORMATION response even if there is no stream data in file.
This patch fix update failure when writing ppt or doc files.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-By: Tom Talpey <tom@talpey.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
While we are working through detailed security reviews
of ksmbd server code we should remind users that it is an
experimental module by adding a warning when the module
loads. Currently the module shows as experimental
in Kconfig and is disabled by default, but we don't want
to confuse users.
Although ksmbd passes a wide variety of the
important functional tests (since initial focus had
been largely on functional testing such as smbtorture,
xfstests etc.), and ksmbd has added key security
features (e.g. GCM256 encryption, Kerberos support),
there are ongoing detailed reviews of the code base
for path processing and network buffer decoding, and
this patch reminds users that the module should be
considered "experimental."
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The "> max" tests should be ">= max" to prevent an out of bounds access
on the next lines.
Fixes: e1a4541ec0 ("ceph: flush the mdlog before waiting on unsafe reqs")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This allows to wait only when it's requested.
It speeds up creation of hardlinks.
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
xfstest generic/041 works with 3003 hardlinks.
Because of this we raise hardlinks limit to 4000.
There are no drawbacks or regressions.
Theoretically we can raise all the way up to ffff,
but there is no practical use for this.
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
In commit b7213ffa0e ("qnx4: avoid stringop-overread errors") I tried
to teach gcc about how the directory entry structure can be two
different things depending on a status flag. It made the code clearer,
and it seemed to make gcc happy.
However, Arnd points to a gcc bug, where despite using two different
members of a union, gcc then gets confused, and uses the size of one of
the members to decide if a string overrun happens. And not necessarily
the rigth one.
End result: with some configurations, gcc-11 will still complain about
the source buffer size being overread:
fs/qnx4/dir.c: In function 'qnx4_readdir':
fs/qnx4/dir.c:76:32: error: 'strnlen' specified bound [16, 48] exceeds source size 1 [-Werror=stringop-overread]
76 | size = strnlen(name, size);
| ^~~~~~~~~~~~~~~~~~~
fs/qnx4/dir.c:26:22: note: source object declared here
26 | char de_name;
| ^~~~~~~
because gcc will get confused about which union member entry is actually
getting accessed, even when the source code is very clear about it. Gcc
internally will have combined two "redundant" pointers (pointing to
different union elements that are at the same offset), and takes the
size checking from one or the other - not necessarily the right one.
This is clearly a gcc bug, but we can work around it fairly easily. The
biggest thing here is the big honking comment about why we do what we
do.
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99578#c6
Reported-and-tested-by: Arnd Bergmann <arnd@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Do not try to insert attribute if there is no room in record.
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
The file comment in bio.c is almost completely irrelevant to the actual
contents of the file; it was originally copied from crypto.c. Fix it
up, and also add a kerneldoc comment for fscrypt_decrypt_bio().
Link: https://lore.kernel.org/r/20210909190737.140841-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
The max_namelen field is unnecessary, as it is set to 255 (NAME_MAX) on
all filesystems that support fscrypt (or plan to support fscrypt). For
simplicity, just use NAME_MAX directly instead.
Link: https://lore.kernel.org/r/20210909184513.139281-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Inconsistent node block will cause a file fail to open or read,
which could make the user process crashes or stucks. Let's mark
SBI_NEED_FSCK flag to trigger a fix at next fsck time. After
unlinking the corrupted file, the user process could regenerate
a new one and work correctly.
Signed-off-by: Weichao Guo <guoweichao@oppo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch enables f2fs_balance_fs_bg() to check all metadatas' dirty
threshold rather than just checking node block's, so that checkpoint()
from background can be triggered more frequently to avoid heaping up
too much dirty metadatas.
Threshold value by default:
race with foreground ops single type global
No 16MB 24MB
Yes 24MB 36MB
In addtion, let f2fs_balance_fs_bg() be aware of roll-forward sapce
as well as fsync().
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIyBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmE/CK0ACgkQ+7dXa6fL
C2vR+A/3ZOlda7wl9grj+qPPiJE1jCav7myLJJR73Yog5T8ZfFkaK6a20IOAyOBu
1v9GzTEODCA12uomYfvIZqNHrcBr2oV6jf8twcnioELQELEP4KPQsXpd1eqq/Kho
O3JUaY7BRiKIk5jUL7IEt2hdBgYCBU2FMoQa+M3FiKfoq601rDDsb5YnwWP0og26
MxXpVmn8uY+QTfwCI4uoJaRZmEX5tu7DnPX3VNHbno9uuI2VJo16S/jmw5CAkG5B
K9p9VdWbGkelM3CXl2rYBG4cA56uwEhVDfTze+A/Eg9JYD2WCFrsehGWC1DR/QtZ
LMM5FxiajF2tvg8KQE/Ou+er96qujwfIJKUgI+vqYLh2s6b5ZLqIyzUpTk4fIrf4
MbHBb4ec0AMXrGapO0fu7UZ2x7f+T7CkYrtIMYxddjlv8YQ860TtzEp/esing4IW
2DHe6xe72LiqoZ09DBaFq0DJKxtFYKQ94GcHjVGxOaFf4nx4OVkQP3gPz3jrhIy8
boWJZQ3xv4cuSbX23GBdELzPbkaTRUjI1siYM2zVk31S4YkZVyy5LbgjQL93C+Bp
BzQwhMGiFQOz17J5eBehVIvHoKDi5fVBuX3WK7aMFmPtUxNhh3KnLKjaxERxdUYw
6pHq3P23rX15TVC24djqtDevv+otITqJ7dKDovKnGm6hoPRqnw==
=BLd7
-----END PGP SIGNATURE-----
Merge tag 'afs-fixes-20210913' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS fixes from David Howells:
"Fixes for AFS problems that can cause data corruption due to
interaction with another client modifying data cached locally:
- When d_revalidating a dentry, don't look at the inode to which it
points. Only check the directory to which the dentry belongs. This
was confusing things and causing the silly-rename cleanup code to
remove the file now at the dentry of a file that got deleted.
- Fix mmap data coherency. When a callback break is received that
relates to a file that we have cached, the data content may have
been changed (there are other reasons, such as the user's rights
having been changed). However, we're checking it lazily, only on
entry to the kernel, which doesn't happen if we have a writeable
shared mapped page on that file.
We make the kernel keep track of mmapped files and clear all PTEs
mapping to that file as soon as the callback comes in by calling
unmap_mapping_pages() (we don't necessarily want to zap the
pagecache). This causes the kernel to be reentered when userspace
tries to access the mmapped address range again - and at that point
we can query the server and, if we need to, zap the page cache.
Ideally, I would check each file at the point of notification, but
that involves poking the server[*] - which is holding an exclusive
lock on the vnode it is changing, waiting for all the clients it
notified to reply. This could then deadlock against the server.
Further, invalidating the pagecache might call ->launder_page(),
which would try to write to the file, which would definitely
deadlock. (AFS doesn't lease file access).
[*] Checking to see if the file content has changed is a matter of
comparing the current data version number, but we have to ask
the server for that. We also need to get a new callback promise
and we need to poke the server for that too.
- Add some more points at which the inode is validated, since we're
doing it lazily, notably in ->read_iter() and ->page_mkwrite(), but
also when performing some directory operations.
Ideally, checking in ->read_iter() would be done in some derivation
of filemap_read(). If we're going to call the server to read the
file, then we get the file status fetch as part of that.
- The above is now causing us to make a lot more calls to
afs_validate() to check the inode - and afs_validate() takes the
RCU read lock each time to make a quick check (ie.
afs_check_validity()). This is entirely for the purpose of checking
cb_s_break to see if the server we're using reinitialised its list
of callbacks - however this isn't a very common event, so most of
the time we're taking this needlessly.
Add a new cell-wide counter to count the number of
reinitialisations done by any server and check that - and only if
that changes, take the RCU read lock and check the server list (the
server list may change, but the cell a file is part of won't).
- Don't update vnode->cb_s_break and ->cb_v_break inside the validity
checking loop. The cb_lock is done with read_seqretry, so we might
go round the loop a second time after resetting those values - and
that could cause someone else checking validity to miss something
(I think).
Also included are patches for fixes for some bugs encountered whilst
debugging this:
- Fix a leak of afs_read objects and fix a leak of keys hidden by
that.
- Fix a leak of pages that couldn't be added to extend a writeback.
- Fix the maintenance of i_blocks when i_size is changed by a local
write or a local dir edit"
Link: https://bugzilla.kernel.org/show_bug.cgi?id=214217 [1]
Link: https://lore.kernel.org/r/163111665183.283156.17200205573146438918.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163113612442.352844.11162345591911691150.stgit@warthog.procyon.org.uk/ # i_blocks patch
* tag 'afs-fixes-20210913' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Fix updating of i_blocks on file/dir extension
afs: Fix corruption in reads at fpos 2G-4G from an OpenAFS server
afs: Try to avoid taking RCU read lock when checking vnode validity
afs: Fix mmap coherency vs 3rd-party changes
afs: Fix incorrect triggering of sillyrename on 3rd-party invalidation
afs: Add missing vnode validation checks
afs: Fix page leak
afs: Fix missing put on afs_read objects and missing get on the key therein
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmFHRVAACgkQiiy9cAdy
T1F0zwv/RC2quL/y+DjNOKbTZwaExLSaZlsww23XghVXIlYMMy4pENpYsu+tjW+l
aEEsIciGBBQ40/Q0Eu2ttk8vIUpaI2SxM+KlCufjX61Rlve42eWZBZ1KkrijKIq4
xvMBJLAg9Jhq1JLl58nyIHb4XV0N9sVELd3aNyEM+4b/2kEe59qW1FdFAXOS3GOc
kkHEWIDnoYs/sCpKey2UuJmI9D2BbxZwhrW6r7mmyq7PQmbPuggSAnL8m5tIsv7Y
GHqmhaaJovfbOJ5L+BUblRyqMgDoYaxiyk3ujHdJkWUkeSpCfQhelUyT30xGzhTV
AQhVrAjB6ozcm/F0lLW9J8LUL0ESDQVUbEMEK1W2GyaR24oQ2sxj6VzILBMne+oh
7QyHbd7N302f+yTvYQbeX9TKd3slh+oOUVAANWDpPfiFhF09iKdjaY3iAUSsf339
nBO4/LlcpELb51UhHEsE1SfP2EbtJwvIMsFGnct6qyQoYM7giacP0AEs/RQhXMwW
XMddUoON
=vPyV
-----END PGP SIGNATURE-----
Merge tag '5.15-rc1-ksmbd' of git://git.samba.org/ksmbd
Pull ksmbd server fixes from Steve French:
"Three ksmbd fixes, including an important security fix for path
processing, and a buffer overflow check, and a trivial fix for
incorrect header inclusion"
* tag '5.15-rc1-ksmbd' of git://git.samba.org/ksmbd:
ksmbd: add validation for FILE_FULL_EA_INFORMATION of smb2_get_info
ksmbd: prevent out of share access
ksmbd: transport_rdma: Don't include rwlock.h directly
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmFGPJwACgkQiiy9cAdy
T1H1Fgv+NjYfcS9C4UynXT9b0cm9Nv3t+1IVepS3WWH/V9EGWjR8aVY3HgFgzx7m
MqJRs1ytAB58fsDzu0RH9409QyyAcPiHk88Fw85yB1hMSEHABVfq37iXiPOWAPA0
pYKjm5pbbGzeTBnCBFaqgkJ/AeiZQ7vbtAYQ4AdCW5hi1fwSrJHPj+qA7NefgbnB
S9p4cQKMYFwzHP2+oUJBemktl512HaTEg8a+nqbGWd3QR7zcNSi3k5M+sHIP0DzZ
zqDgvgmgOecIqj9w/G9rTToPhKO9fFnoDxkpm/4JLxj2Zul+QZ6Lsfrm7BTOA8V8
bNQrlgBioOdLo3WpVYIyTPvywxD4zbLlwfk/spFDnuRvyyKDjR64iYfArCKSm9G9
c0wlNW7uFiAB66NNzTISSjA31lrwwvq8Q6bmOyNRC/n/LwsbE+EQCf2P4Ajn0m7l
Gb8441sbs8yjEs+E/FJF4f9xiVaCKQe6nBsGpxHKslD+J1W5f6hBco3Zswix13m+
0ObM5i+5
=d5GP
-----END PGP SIGNATURE-----
Merge tag '5.15-rc1-smb3' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs client fixes from Steve French:
- two deferred close fixes (for bugs found with xfstests 478 and 461)
- a deferred close improvement in rename
- two trivial fixes for incorrect Linux comment formatting of multiple
cifs files (pointed out by automated kernel test robot and
checkpatch)
* tag '5.15-rc1-smb3' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Not to defer close on file when lock is set
cifs: Fix soft lockup during fsstress
cifs: Deferred close performance improvements
cifs: fix incorrect kernel doc comments
cifs: remove pathname for file from SPDX header
Currently a failed allocation on sbi->upcase will cause an exit via
the label free_sbi causing a memory leak on object opts. Fix this by
re-ordering the exit paths free_opts and free_sbi so that kfree's occur
in the reverse allocation order.
Addresses-Coverity: ("Resource leak")
Fixes: 27fac77707 ("fs/ntfs3: Init spi more in init_fs_context than fill_super")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Right now sb blocksize first get initiliazed in fill_super but in can be
changed in helper function. It makes more sense to that this happened
only in one place.
Because we move this to helper function it makes more sense that
s_maxbytes will also be there. I rather have every sb releted thing in
fill_super, but because there is already sb releted stuff in this
helper. This will have to do for now.
Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Initializing should be as close as possible when we use it so that
we do not need to scroll up to see what is happening.
Also bdev_get_queue() can never return NULL so we do not need to check
for !rq.
Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
We can survive without this tmp point upcase. So remove it we don't have
so many tmp pointer in this function.
Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Drop tmp pointer bd_inode because this is only used ones in fill_super.
Also we have so many initializing happening at the beginning that it is
already way too much to follow.
Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
We only use this in two places so we do not really need it. Also
wrapper sb_rdonly() is pretty self explanatory. This will make little
bit easier to read this super long variable list in the beginning of
ntfs_fill_super().
Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Use sb instead of sbi->sb in fill_super. We have sb so why not use
it. Also makes code more readable.
Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Remove some unnecessary variable loading. These look like copy paste
work and they are not used to anything.
Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
In many places it is not needed to use goto out. We can just return
right away. This will make code little bit more cleaner as we won't
need to check error path.
Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Remove root drop when we fault out. This can never happened because
when we allocate root we eather fault when no root or success.
Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Change EINVAL to ENOMEM when d_make_root fails because that is right
errno.
Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
A full expalantion of io_uring is beyond the scope of this commit
description, but in summary it is an asynchronous I/O mechanism
which allows for I/O requests and the resulting data to be queued
in memory mapped "rings" which are shared between the kernel and
userspace. Optionally, io_uring offers the ability for applications
to spawn kernel threads to dequeue I/O requests from the ring and
submit the requests in the kernel, helping to minimize the syscall
overhead. Rings are accessed in userspace by memory mapping a file
descriptor provided by the io_uring_setup(2), and can be shared
between applications as one might do with any open file descriptor.
Finally, process credentials can be registered with a given ring
and any process with access to that ring can submit I/O requests
using any of the registered credentials.
While the io_uring functionality is widely recognized as offering a
vastly improved, and high performing asynchronous I/O mechanism, its
ability to allow processes to submit I/O requests with credentials
other than its own presents a challenge to LSMs. When a process
creates a new io_uring ring the ring's credentials are inhertied
from the calling process; if this ring is shared with another
process operating with different credentials there is the potential
to bypass the LSMs security policy. Similarly, registering
credentials with a given ring allows any process with access to that
ring to submit I/O requests with those credentials.
In an effort to allow LSMs to apply security policy to io_uring I/O
operations, this patch adds two new LSM hooks. These hooks, in
conjunction with the LSM anonymous inode support previously
submitted, allow an LSM to apply access control policy to the
sharing of io_uring rings as well as any io_uring credential changes
requested by a process.
The new LSM hooks are described below:
* int security_uring_override_creds(cred)
Controls if the current task, executing an io_uring operation,
is allowed to override it's credentials with @cred. In cases
where the current task is a user application, the current
credentials will be those of the user application. In cases
where the current task is a kernel thread servicing io_uring
requests the current credentials will be those of the io_uring
ring (inherited from the process that created the ring).
* int security_uring_sqpoll(void)
Controls if the current task is allowed to create an io_uring
polling thread (IORING_SETUP_SQPOLL). Without a SQPOLL thread
in the kernel processes must submit I/O requests via
io_uring_enter(2) which allows us to compare any requested
credential changes against the application making the request.
With a SQPOLL thread, we can no longer compare requested
credential changes against the application making the request,
the comparison is made against the ring's credentials.
Signed-off-by: Paul Moore <paul@paul-moore.com>
Converting io_uring's anonymous inode to the secure anon inode API
enables LSMs to enforce policy on the io_uring anonymous inodes if
they chose to do so. This is an important first step towards
providing the necessary mechanisms so that LSMs can apply security
policy to io_uring operations.
Signed-off-by: Paul Moore <paul@paul-moore.com>
Extending the secure anonymous inode support to other subsystems
requires that we have a secure anon_inode_getfile() variant in
addition to the existing secure anon_inode_getfd() variant.
Thankfully we can reuse the existing __anon_inode_getfile() function
and just wrap it with the proper arguments.
Acked-by: Mickaël Salaün <mic@linux.microsoft.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
This patch adds basic auditing to io_uring operations, regardless of
their context. This is accomplished by allocating audit_context
structures for the io-wq worker and io_uring SQPOLL kernel threads
as well as explicitly auditing the io_uring operations in
io_issue_sqe(). Individual io_uring operations can bypass auditing
through the "audit_skip" field in the struct io_op_def definition for
the operation; although great care must be taken so that security
relevant io_uring operations do not bypass auditing; please contact
the audit mailing list (see the MAINTAINERS file) with any questions.
The io_uring operations are audited using a new AUDIT_URINGOP record,
an example is shown below:
type=UNKNOWN[1336] msg=audit(1631800225.981:37289):
uring_op=19 success=yes exit=0 items=0 ppid=15454 pid=15681
uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0
subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
key=(null)
Thanks to Richard Guy Briggs for review and feedback.
Signed-off-by: Paul Moore <paul@paul-moore.com>
Add validation to check whether req->InputBufferLength is smaller than
smb2_ea_info_req structure size.
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Ralph Böhme <slow@samba.org>
Cc: Steve French <smfrench@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Because of .., files outside the share directory
could be accessed. To prevent this, normalize
the given path and remove all . and ..
components.
In addition to the usual large set of regression tests (smbtorture
and xfstests), ran various tests on this to specifically check
path name validation including libsmb2 tests to verify path
normalization:
./examples/smb2-ls-async smb://172.30.1.15/homes2/../
./examples/smb2-ls-async smb://172.30.1.15/homes2/foo/../
./examples/smb2-ls-async smb://172.30.1.15/homes2/foo/../../
./examples/smb2-ls-async smb://172.30.1.15/homes2/foo/../
./examples/smb2-ls-async smb://172.30.1.15/homes2/foo/..bar/
./examples/smb2-ls-async smb://172.30.1.15/homes2/foo/bar../
./examples/smb2-ls-async smb://172.30.1.15/homes2/foo/bar..
./examples/smb2-ls-async smb://172.30.1.15/homes2/foo/bar../../../../
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>