all uses are conditional upon ELF_CORE_COPY_XFPREGS, which has not
been defined on any architecture since 2010
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The only architecture where we might end up using both is arm,
and there we definitely don't want fdpic-related fields in
elf_prstatus - coredump layout of ELF binaries should not
depend upon having the kernel built with the support of ELF_FDPIC
ones. Just move the fdpic-modified variant into binfmt_elf_fdpic.c
(and call it elf_prstatus_fdpic there)
[name stolen from nico]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Two new helpers: given a process and regset, dump into a buffer.
regset_get() takes a buffer and size, regset_get_alloc() takes size
and allocates a buffer.
Return value in both cases is the amount of data actually dumped in
case of success or -E... on error.
In both cases the size is capped by regset->n * regset->size, so
->get() is called with offset 0 and size no more than what regset
expects.
binfmt_elf.c callers of ->get() are switched to using those; the other
caller (copy_regset_to_user()) will need some preparations to switch.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The 'inode' argument to handle_event(), sometimes referred to as
'to_tell' is somewhat obsolete.
It is a remnant from the times when a group could only have an inode mark
associated with an event.
We now pass an iter_info array to the callback, with all marks associated
with an event.
Most backends ignore this argument, with two exceptions:
1. dnotify uses it for sanity check that event is on directory
2. fanotify uses it to report fid of directory on directory entry
modification events
Remove the 'inode' argument and add a 'dir' argument.
The callback function signature is deliberately changed, because
the meaning of the argument has changed and the arguments have
been documented.
The 'dir' argument is set to when 'file_name' is specified and it is
referring to the directory that the 'file_name' entry belongs to.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
When removing an extent map at try_release_extent_mapping(), called through
the page release callback (btrfs_releasepage()), we always set the full
sync flag on the inode, which forces the next fsync to use a slower code
path.
This hurts performance for workloads that dirty an amount of data that
exceeds or is very close to the system's RAM memory and do frequent fsync
operations (like database servers can for example). In particular if there
are concurrent fsyncs against different files, by falling back to a full
fsync we do a lot more checksum lookups in the checksums btree, as we do
it for all the extents created in the current transaction, instead of only
the new ones since the last fsync. These checksums lookups not only take
some time but, more importantly, they also cause contention on the
checksums btree locks due to the concurrency with checksum insertions in
the btree by ordered extents from other inodes.
We actually don't need to set the full sync flag on the inode, because we
only remove extent maps that are in the list of modified extents if they
were created in a past transaction, in which case an fsync skips them as
it's pointless to log them. So stop setting the full fsync flag on the
inode whenever we remove an extent map.
This patch is part of a patchset that consists of 3 patches, which have
the following subjects:
1/3 btrfs: fix race between page release and a fast fsync
2/3 btrfs: release old extent maps during page release
3/3 btrfs: do not set the full sync flag on the inode during page release
Performance tests were ran against a branch (misc-next) containing the
whole patchset. The test exercises a workload where there are multiple
processes writing to files and fsyncing them (each writing and fsyncing
its own file), and in total the amount of data dirtied ranges from 2x to
4x the system's RAM memory (16GiB), so that the page release callback is
invoked frequently.
The following script, using fio, was used to perform the tests:
$ cat test-fsync.sh
#!/bin/bash
DEV=/dev/sdk
MNT=/mnt/sdk
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-d single -m single"
if [ $# -ne 3 ]; then
echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ"
exit 1
fi
NUM_JOBS=$1
FILE_SIZE=$2
FSYNC_FREQ=$3
cat <<EOF > /tmp/fio-job.ini
[writers]
rw=write
fsync=$FSYNC_FREQ
fallocate=none
group_reporting=1
direct=0
bs=64k
ioengine=sync
size=$FILE_SIZE
directory=$MNT
numjobs=$NUM_JOBS
thread
EOF
echo "Using config:"
echo
cat /tmp/fio-job.ini
echo
mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The tests were performed for different numbers of jobs, file sizes and
fsync frequency. A qemu VM using kvm was used, with 8 cores (the host has
12 cores, with cpu governance set to performance mode on all cores), 16GiB
of ram (the host has 64GiB) and using a NVMe device directly (without an
intermediary filesystem in the host). While running the tests, the host
was not used for anything else, to avoid disturbing the tests.
The obtained results were the following, and the last line printed by
fio is pasted (includes aggregated throughput and test run time).
*****************************************************
**** 1 job, 32GiB file, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=29.1MiB/s (30.5MB/s), 29.1MiB/s-29.1MiB/s (30.5MB/s-30.5MB/s), io=32.0GiB (34.4GB), run=1127557-1127557msec
After patchset:
WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=32.0GiB (34.4GB), run=1119042-1119042msec
(+0.7% throughput, -0.8% run time)
*****************************************************
**** 2 jobs, 16GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=33.5MiB/s (35.1MB/s), 33.5MiB/s-33.5MiB/s (35.1MB/s-35.1MB/s), io=32.0GiB (34.4GB), run=979000-979000msec
After patchset:
WRITE: bw=39.9MiB/s (41.8MB/s), 39.9MiB/s-39.9MiB/s (41.8MB/s-41.8MB/s), io=32.0GiB (34.4GB), run=821283-821283msec
(+19.1% throughput, -16.1% runtime)
*****************************************************
**** 4 jobs, 8GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=52.1MiB/s (54.6MB/s), 52.1MiB/s-52.1MiB/s (54.6MB/s-54.6MB/s), io=32.0GiB (34.4GB), run=629130-629130msec
After patchset:
WRITE: bw=71.8MiB/s (75.3MB/s), 71.8MiB/s-71.8MiB/s (75.3MB/s-75.3MB/s), io=32.0GiB (34.4GB), run=456357-456357msec
(+37.8% throughput, -27.5% runtime)
*****************************************************
**** 8 jobs, 4GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=76.1MiB/s (79.8MB/s), 76.1MiB/s-76.1MiB/s (79.8MB/s-79.8MB/s), io=32.0GiB (34.4GB), run=430708-430708msec
After patchset:
WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=245458-245458msec
(+74.7% throughput, -43.0% run time)
*****************************************************
**** 16 jobs, 2GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=74.7MiB/s (78.3MB/s), 74.7MiB/s-74.7MiB/s (78.3MB/s-78.3MB/s), io=32.0GiB (34.4GB), run=438625-438625msec
After patchset:
WRITE: bw=184MiB/s (193MB/s), 184MiB/s-184MiB/s (193MB/s-193MB/s), io=32.0GiB (34.4GB), run=177864-177864msec
(+146.3% throughput, -59.5% run time)
*****************************************************
**** 32 jobs, 2GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=72.6MiB/s (76.1MB/s), 72.6MiB/s-72.6MiB/s (76.1MB/s-76.1MB/s), io=64.0GiB (68.7GB), run=902615-902615msec
After patchset:
WRITE: bw=227MiB/s (238MB/s), 227MiB/s-227MiB/s (238MB/s-238MB/s), io=64.0GiB (68.7GB), run=288936-288936msec
(+212.7% throughput, -68.0% run time)
*****************************************************
**** 64 jobs, 1GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=98.8MiB/s (104MB/s), 98.8MiB/s-98.8MiB/s (104MB/s-104MB/s), io=64.0GiB (68.7GB), run=663126-663126msec
After patchset:
WRITE: bw=294MiB/s (308MB/s), 294MiB/s-294MiB/s (308MB/s-308MB/s), io=64.0GiB (68.7GB), run=222940-222940msec
(+197.6% throughput, -66.4% run time)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When removing an extent map at try_release_extent_mapping(), called through
the page release callback (btrfs_releasepage()), we never release an extent
map that is in the list of modified extents. This is to prevent races with
a concurrent fsync using the fast path, which could lead to not logging an
extent created in the current transaction.
However we can safely remove an extent map created in a past transaction
that is still in the list of modified extents (because no one fsynced yet
the inode after that transaction got commited), because such extents are
skipped during an fsync as it is pointless to log them. This change does
that.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When releasing an extent map, done through the page release callback, we
can race with an ongoing fast fsync and cause the fsync to miss a new
extent and not log it. The steps for this to happen are the following:
1) A page is dirtied for some inode I;
2) Writeback for that page is triggered by a path other than fsync, for
example by the system due to memory pressure;
3) When the ordered extent for the extent (a single 4K page) finishes,
we unpin the corresponding extent map and set its generation to N,
the current transaction's generation;
4) The btrfs_releasepage() callback is invoked by the system due to
memory pressure for that no longer dirty page of inode I;
5) At the same time, some task calls fsync on inode I, joins transaction
N, and at btrfs_log_inode() it sees that the inode does not have the
full sync flag set, so we proceed with a fast fsync. But before we get
into btrfs_log_changed_extents() and lock the inode's extent map tree:
6) Through btrfs_releasepage() we end up at try_release_extent_mapping()
and we remove the extent map for the new 4Kb extent, because it is
neither pinned anymore nor locked. By calling remove_extent_mapping(),
we remove the extent map from the list of modified extents, since the
extent map does not have the logging flag set. We unlock the inode's
extent map tree;
7) The task doing the fast fsync now enters btrfs_log_changed_extents(),
locks the inode's extent map tree and iterates its list of modified
extents, which no longer has the 4Kb extent in it, so it does not log
the extent;
8) The fsync finishes;
9) Before transaction N is committed, a power failure happens. After
replaying the log, the 4K extent of inode I will be missing, since
it was not logged due to the race with try_release_extent_mapping().
So fix this by teaching try_release_extent_mapping() to not remove an
extent map if it's still in the list of modified extents.
Fixes: ff44c6e36d ("Btrfs: do not hold the write_lock on the extent tree while logging")
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we're (re)mounting a btrfs filesystem we set the
BTRFS_FS_STATE_REMOUNTING state in fs_info to serialize against async
reclaim or defrags.
This flag is set in btrfs_remount_prepare() called by btrfs_remount().
As btrfs_remount_prepare() does nothing but setting this flag and
doesn't have a second caller, we can just open-code the flag setting in
btrfs_remount().
Similarly do for so clearing of the flag by moving it out of
btrfs_remount_cleanup() into btrfs_remount() to be symmetrical.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previously we depended on some weird behavior in our chunk allocator to
force the allocation of new stripes, so by the time we got to doing the
reduce we would usually already have a chunk with the proper target.
However that behavior causes other problems and needs to be removed.
First however we need to remove this check to only restripe if we
already have those available profiles, because if we're allocating our
first chunk it obviously will not be available. Simply use the target
as specified, and if that fails it'll be because we're out of space.
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs/061 has been failing consistently for me recently with a
transaction abort. We run out of space in the system chunk array, which
means we've allocated way too many system chunks than we need.
Chris added this a long time ago for balance as a poor mans restriping.
If you had a single disk and then added another disk and then did a
balance, update_block_group_flags would then figure out which RAID level
you needed.
Fast forward to today and we have restriping behavior, so we can
explicitly tell the fs that we're trying to change the raid level. This
is accomplished through the normal get_alloc_profile path.
Furthermore this code actually causes btrfs/061 to fail, because we do
things like mkfs -m dup -d single with multiple devices. This trips
this check
alloc_flags = update_block_group_flags(fs_info, cache->flags);
if (alloc_flags != cache->flags) {
ret = btrfs_chunk_alloc(trans, alloc_flags, CHUNK_ALLOC_FORCE);
in btrfs_inc_block_group_ro. Because we're balancing and scrubbing, but
not actually restriping, we keep forcing chunk allocation of RAID1
chunks. This eventually causes us to run out of system space and the
file system aborts and flips read only.
We don't need this poor mans restriping any more, simply use the normal
get_alloc_profile helper, which will get the correct alloc_flags and
thus make the right decision for chunk allocation. This keeps us from
allocating a billion system chunks and falling over.
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are currently getting this lockdep splat in btrfs/161:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc5+ #20 Tainted: G E
------------------------------------------------------
mount/678048 is trying to acquire lock:
ffff9b769f15b6e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: clone_fs_devices+0x4d/0x170 [btrfs]
but task is already holding lock:
ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
__mutex_lock+0x8b/0x8f0
btrfs_init_new_device+0x2d2/0x1240 [btrfs]
btrfs_ioctl+0x1de/0x2d20 [btrfs]
ksys_ioctl+0x87/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__lock_acquire+0x1240/0x2460
lock_acquire+0xab/0x360
__mutex_lock+0x8b/0x8f0
clone_fs_devices+0x4d/0x170 [btrfs]
btrfs_read_chunk_tree+0x330/0x800 [btrfs]
open_ctree+0xb7c/0x18ce [btrfs]
btrfs_mount_root.cold+0x13/0xfa [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
do_mount+0x7de/0xb30
__x64_sys_mount+0x8e/0xd0
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_info->chunk_mutex);
lock(&fs_devs->device_list_mutex);
lock(&fs_info->chunk_mutex);
lock(&fs_devs->device_list_mutex);
*** DEADLOCK ***
3 locks held by mount/678048:
#0: ffff9b75ff5fb0e0 (&type->s_umount_key#63/1){+.+.}-{3:3}, at: alloc_super+0xb5/0x380
#1: ffffffffc0c2fbc8 (uuid_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x54/0x800 [btrfs]
#2: ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs]
stack backtrace:
CPU: 2 PID: 678048 Comm: mount Tainted: G E 5.8.0-rc5+ #20
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011
Call Trace:
dump_stack+0x96/0xd0
check_noncircular+0x162/0x180
__lock_acquire+0x1240/0x2460
? asm_sysvec_apic_timer_interrupt+0x12/0x20
lock_acquire+0xab/0x360
? clone_fs_devices+0x4d/0x170 [btrfs]
__mutex_lock+0x8b/0x8f0
? clone_fs_devices+0x4d/0x170 [btrfs]
? rcu_read_lock_sched_held+0x52/0x60
? cpumask_next+0x16/0x20
? module_assert_mutex_or_preempt+0x14/0x40
? __module_address+0x28/0xf0
? clone_fs_devices+0x4d/0x170 [btrfs]
? static_obj+0x4f/0x60
? lockdep_init_map_waits+0x43/0x200
? clone_fs_devices+0x4d/0x170 [btrfs]
clone_fs_devices+0x4d/0x170 [btrfs]
btrfs_read_chunk_tree+0x330/0x800 [btrfs]
open_ctree+0xb7c/0x18ce [btrfs]
? super_setup_bdi_name+0x79/0xd0
btrfs_mount_root.cold+0x13/0xfa [btrfs]
? vfs_parse_fs_string+0x84/0xb0
? rcu_read_lock_sched_held+0x52/0x60
? kfree+0x2b5/0x310
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
? cred_has_capability+0x7c/0x120
? rcu_read_lock_sched_held+0x52/0x60
? legacy_get_tree+0x30/0x50
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
do_mount+0x7de/0xb30
? memdup_user+0x4e/0x90
__x64_sys_mount+0x8e/0xd0
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This is because btrfs_read_chunk_tree() can come upon DEV_EXTENT's and
then read the device, which takes the device_list_mutex. The
device_list_mutex needs to be taken before the chunk_mutex, so this is a
problem. We only really need the chunk mutex around adding the chunk,
so move the mutex around read_one_chunk.
An argument could be made that we don't even need the chunk_mutex here
as it's during mount, and we are protected by various other locks.
However we already have special rules for ->device_list_mutex, and I'd
rather not have another special case for ->chunk_mutex.
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's long existed a lockdep splat because we open our bdev's under
the ->device_list_mutex at mount time, which acquires the bd_mutex.
Usually this goes unnoticed, but if you do loopback devices at all
suddenly the bd_mutex comes with a whole host of other dependencies,
which results in the splat when you mount a btrfs file system.
======================================================
WARNING: possible circular locking dependency detected
5.8.0-0.rc3.1.fc33.x86_64+debug #1 Not tainted
------------------------------------------------------
systemd-journal/509 is trying to acquire lock:
ffff970831f84db0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x44/0x70 [btrfs]
but task is already holding lock:
ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sb_pagefaults){.+.+}-{0:0}:
__sb_start_write+0x13e/0x220
btrfs_page_mkwrite+0x59/0x560 [btrfs]
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
asm_exc_page_fault+0x1e/0x30
-> #5 (&mm->mmap_lock#2){++++}-{3:3}:
__might_fault+0x60/0x80
_copy_from_user+0x20/0xb0
get_sg_io_hdr+0x9a/0xb0
scsi_cmd_ioctl+0x1ea/0x2f0
cdrom_ioctl+0x3c/0x12b4
sr_block_ioctl+0xa4/0xd0
block_ioctl+0x3f/0x50
ksys_ioctl+0x82/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #4 (&cd->lock){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
sr_block_open+0xa2/0x180
__blkdev_get+0xdd/0x550
blkdev_get+0x38/0x150
do_dentry_open+0x16b/0x3e0
path_openat+0x3c9/0xa00
do_filp_open+0x75/0x100
do_sys_openat2+0x8a/0x140
__x64_sys_openat+0x46/0x70
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
__blkdev_get+0x6a/0x550
blkdev_get+0x85/0x150
blkdev_get_by_path+0x2c/0x70
btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
open_fs_devices+0x88/0x240 [btrfs]
btrfs_open_devices+0x92/0xa0 [btrfs]
btrfs_mount_root+0x250/0x490 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
vfs_kern_mount.part.0+0x71/0xb0
btrfs_mount+0x119/0x380 [btrfs]
legacy_get_tree+0x30/0x50
vfs_get_tree+0x28/0xc0
do_mount+0x8c6/0xca0
__x64_sys_mount+0x8e/0xd0
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
btrfs_run_dev_stats+0x36/0x420 [btrfs]
commit_cowonly_roots+0x91/0x2d0 [btrfs]
btrfs_commit_transaction+0x4e6/0x9f0 [btrfs]
btrfs_sync_file+0x38a/0x480 [btrfs]
__x64_sys_fdatasync+0x47/0x80
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
__mutex_lock+0x7b/0x820
btrfs_commit_transaction+0x48e/0x9f0 [btrfs]
btrfs_sync_file+0x38a/0x480 [btrfs]
__x64_sys_fdatasync+0x47/0x80
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
__lock_acquire+0x1241/0x20c0
lock_acquire+0xb0/0x400
__mutex_lock+0x7b/0x820
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
start_transaction+0xd2/0x500 [btrfs]
btrfs_dirty_inode+0x44/0xd0 [btrfs]
file_update_time+0xc6/0x120
btrfs_page_mkwrite+0xda/0x560 [btrfs]
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
asm_exc_page_fault+0x1e/0x30
other info that might help us debug this:
Chain exists of:
&fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sb_pagefaults);
lock(&mm->mmap_lock#2);
lock(sb_pagefaults);
lock(&fs_info->reloc_mutex);
*** DEADLOCK ***
3 locks held by systemd-journal/509:
#0: ffff97083bdec8b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x12e/0x4b0
#1: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
#2: ffff97083144d6a8 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3f8/0x500 [btrfs]
stack backtrace:
CPU: 0 PID: 509 Comm: systemd-journal Not tainted 5.8.0-0.rc3.1.fc33.x86_64+debug #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
dump_stack+0x92/0xc8
check_noncircular+0x134/0x150
__lock_acquire+0x1241/0x20c0
lock_acquire+0xb0/0x400
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
? lock_acquire+0xb0/0x400
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
__mutex_lock+0x7b/0x820
? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
? kvm_sched_clock_read+0x14/0x30
? sched_clock+0x5/0x10
? sched_clock_cpu+0xc/0xb0
btrfs_record_root_in_trans+0x44/0x70 [btrfs]
start_transaction+0xd2/0x500 [btrfs]
btrfs_dirty_inode+0x44/0xd0 [btrfs]
file_update_time+0xc6/0x120
btrfs_page_mkwrite+0xda/0x560 [btrfs]
? sched_clock+0x5/0x10
do_page_mkwrite+0x4f/0x130
do_wp_page+0x3b0/0x4f0
handle_mm_fault+0xf47/0x1850
do_user_addr_fault+0x1fc/0x4b0
exc_page_fault+0x88/0x300
? asm_exc_page_fault+0x8/0x30
asm_exc_page_fault+0x1e/0x30
RIP: 0033:0x7fa3972fdbfe
Code: Bad RIP value.
Fix this by not holding the ->device_list_mutex at this point. The
device_list_mutex exists to protect us from modifying the device list
while the file system is running.
However it can also be modified by doing a scan on a device. But this
action is specifically protected by the uuid_mutex, which we are holding
here. We cannot race with opening at this point because we have the
->s_mount lock held during the mount. Not having the
->device_list_mutex here is perfectly safe as we're not going to change
the devices at this point.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add some comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Eric reported seeing this message while running generic/475
BTRFS: error (device dm-3) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
Full stack trace:
BTRFS: error (device dm-0) in btrfs_commit_transaction:2323: errno=-5 IO failure (Error while writing out transaction)
BTRFS info (device dm-0): forced readonly
BTRFS warning (device dm-0): Skipping commit of aborted transaction.
------------[ cut here ]------------
BTRFS: error (device dm-0) in cleanup_transaction:1894: errno=-5 IO failure
BTRFS: Transaction aborted (error -117)
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6480 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6488 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6490 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6498 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64c0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85e8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85f0 len 4096 err no 10
WARNING: CPU: 3 PID: 23985 at fs/btrfs/tree-log.c:3084 btrfs_sync_log+0xbc8/0xd60 [btrfs]
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4288 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4290 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4298 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42d0 len 4096 err no 10
CPU: 3 PID: 23985 Comm: fsstress Tainted: G W L 5.8.0-rc4-default+ #1181
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
RIP: 0010:btrfs_sync_log+0xbc8/0xd60 [btrfs]
RSP: 0018:ffff909a44d17bd0 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001
RDX: ffff8f3be41cb940 RSI: ffffffffb0108d2b RDI: ffffffffb0108ff7
RBP: ffff909a44d17e70 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000037988 R12: ffff8f3bd20e4000
R13: ffff8f3bd20e4428 R14: 00000000ffffff8b R15: ffff909a44d17c70
FS: 00007f6a6ed3fb80(0000) GS:ffff8f3c3dc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6a6ed3e000 CR3: 00000000525c0003 CR4: 0000000000160ee0
Call Trace:
? finish_wait+0x90/0x90
? __mutex_unlock_slowpath+0x45/0x2a0
? lock_acquire+0xa3/0x440
? lockref_put_or_lock+0x9/0x30
? dput+0x20/0x4a0
? dput+0x20/0x4a0
? do_raw_spin_unlock+0x4b/0xc0
? _raw_spin_unlock+0x1f/0x30
btrfs_sync_file+0x335/0x490 [btrfs]
do_fsync+0x38/0x70
__x64_sys_fsync+0x10/0x20
do_syscall_64+0x50/0xe0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f6a6ef1b6e3
Code: Bad RIP value.
RSP: 002b:00007ffd01e20038 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
RAX: ffffffffffffffda RBX: 000000000007a120 RCX: 00007f6a6ef1b6e3
RDX: 00007ffd01e1ffa0 RSI: 00007ffd01e1ffa0 RDI: 0000000000000003
RBP: 0000000000000003 R08: 0000000000000001 R09: 00007ffd01e2004c
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000009f
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00
softirqs last enabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace af146e0e38433456 ]---
BTRFS: error (device dm-0) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
This ret came from btrfs_write_marked_extents(). If we get an aborted
transaction via EIO before, we'll see it in btree_write_cache_pages()
and return EUCLEAN, which gets printed as "Filesystem corrupted".
Except we shouldn't be returning EUCLEAN here, we need to be returning
EROFS because EUCLEAN is reserved for actual corruption, not IO errors.
We are inconsistent about our handling of BTRFS_FS_STATE_ERROR
elsewhere, but we want to use EROFS for this particular case. The
original transaction abort has the real error code for why we ended up
with an aborted transaction, all subsequent actions just need to return
EROFS because they may not have a trans handle and have no idea about
the original cause of the abort.
After patch "btrfs: don't WARN if we abort a transaction with EROFS" the
stacktrace will not be dumped either.
Reported-by: Eric Sandeen <esandeen@redhat.com>
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add full test stacktrace ]
Signed-off-by: David Sterba <dsterba@suse.com>
We've had some discussions about what to do in certain scenarios for
error codes, specifically EUCLEAN and EROFS. Document these near the
error handling code so its clear what their intentions are.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we got some sort of corruption via a read and call
btrfs_handle_fs_error() we'll set BTRFS_FS_STATE_ERROR on the fs and
complain. If a subsequent trans handle trips over this it'll get EROFS
and then abort. However at that point we're not aborting for the
original reason, we're aborting because we've been flipped read only.
We do not need to WARN_ON() here.
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The possibility of extents being shared (through clone and deduplication
operations) requires special care when logging data checksums, to avoid
having a log tree with different checksum items that cover ranges which
overlap (which resulted in missing checksums after replaying a log tree).
Such problems were fixed in the past by the following commits:
commit 40e046acbd ("Btrfs: fix missing data checksums after replaying a
log tree")
commit e289f03ea7 ("btrfs: fix corrupt log due to concurrent fsync of
inodes with shared extents")
Test case generic/588 exercises the scenario solved by the first commit
(purely sequential and deterministic) while test case generic/457 often
triggered the case fixed by the second commit (not deterministic, requires
specific timings under concurrency).
The problems were addressed by deleting, from the log tree, any existing
checksums before logging the new ones. And also by doing the deletion and
logging of the cheksums while locking the checksum range in an extent io
tree (root->log_csum_range), to deal with the case where we have concurrent
fsyncs against files with shared extents.
That however causes more contention on the leaves of a log tree where we
store checksums (and all the nodes in the paths leading to them), even
when we do not have shared extents, or all the shared extents were created
by past transactions. It also adds a bit of contention on the spin lock of
the log_csums_range extent io tree of the log root.
This change adds a 'last_reflink_trans' field to the inode to keep track
of the last transaction where a new extent was shared between inodes
(through clone and deduplication operations). It is updated for both the
source and destination inodes of reflink operations whenever a new extent
(created in the current transaction) becomes shared by the inodes. This
field is kept in memory only, not persisted in the inode item, similar
to other existing fields (last_unlink_trans, logged_trans).
When logging checksums for an extent, if the value of 'last_reflink_trans'
is smaller then the current transaction's generation/id, we skip locking
the extent range and deletion of checksums from the log tree, since we
know we do not have new shared extents. This reduces contention on the
log tree's leaves where checksums are stored.
The following script, which uses fio, was used to measure the impact of
this change:
$ cat test-fsync.sh
#!/bin/bash
DEV=/dev/sdk
MNT=/mnt/sdk
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-d single -m single"
if [ $# -ne 3 ]; then
echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ"
exit 1
fi
NUM_JOBS=$1
FILE_SIZE=$2
FSYNC_FREQ=$3
cat <<EOF > /tmp/fio-job.ini
[writers]
rw=write
fsync=$FSYNC_FREQ
fallocate=none
group_reporting=1
direct=0
bs=64k
ioengine=sync
size=$FILE_SIZE
directory=$MNT
numjobs=$NUM_JOBS
EOF
echo "Using config:"
echo
cat /tmp/fio-job.ini
echo
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The tests were performed for different numbers of jobs, file sizes and
fsync frequency. A qemu VM using kvm was used, with 8 cores (the host has
12 cores, with cpu governance set to performance mode on all cores), 16GiB
of ram (the host has 64GiB) and using a NVMe device directly (without an
intermediary filesystem in the host). While running the tests, the host
was not used for anything else, to avoid disturbing the tests.
The obtained results were the following (the last line of fio's output was
pasted). Starting with 16 jobs is where a significant difference is
observable in this particular setup and hardware (differences highlighted
below). The very small differences for tests with less than 16 jobs are
possibly just noise and random.
**** 1 job, file size 1G, fsync frequency 1 ****
before this change:
WRITE: bw=23.8MiB/s (24.9MB/s), 23.8MiB/s-23.8MiB/s (24.9MB/s-24.9MB/s), io=1024MiB (1074MB), run=43075-43075msec
after this change:
WRITE: bw=24.4MiB/s (25.6MB/s), 24.4MiB/s-24.4MiB/s (25.6MB/s-25.6MB/s), io=1024MiB (1074MB), run=41938-41938msec
**** 2 jobs, file size 1G, fsync frequency 1 ****
before this change:
WRITE: bw=37.7MiB/s (39.5MB/s), 37.7MiB/s-37.7MiB/s (39.5MB/s-39.5MB/s), io=2048MiB (2147MB), run=54351-54351msec
after this change:
WRITE: bw=37.7MiB/s (39.5MB/s), 37.6MiB/s-37.6MiB/s (39.5MB/s-39.5MB/s), io=2048MiB (2147MB), run=54428-54428msec
**** 4 jobs, file size 1G, fsync frequency 1 ****
before this change:
WRITE: bw=67.5MiB/s (70.8MB/s), 67.5MiB/s-67.5MiB/s (70.8MB/s-70.8MB/s), io=4096MiB (4295MB), run=60669-60669msec
after this change:
WRITE: bw=68.6MiB/s (71.0MB/s), 68.6MiB/s-68.6MiB/s (71.0MB/s-71.0MB/s), io=4096MiB (4295MB), run=59678-59678msec
**** 8 jobs, file size 1G, fsync frequency 1 ****
before this change:
WRITE: bw=128MiB/s (134MB/s), 128MiB/s-128MiB/s (134MB/s-134MB/s), io=8192MiB (8590MB), run=64048-64048msec
after this change:
WRITE: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=8192MiB (8590MB), run=63405-63405msec
**** 16 jobs, file size 1G, fsync frequency 1 ****
before this change:
WRITE: bw=78.5MiB/s (82.3MB/s), 78.5MiB/s-78.5MiB/s (82.3MB/s-82.3MB/s), io=16.0GiB (17.2GB), run=208676-208676msec
after this change:
WRITE: bw=110MiB/s (115MB/s), 110MiB/s-110MiB/s (115MB/s-115MB/s), io=16.0GiB (17.2GB), run=149295-149295msec
(+40.1% throughput, -28.5% runtime)
**** 32 jobs, file size 1G, fsync frequency 1 ****
before this change:
WRITE: bw=58.8MiB/s (61.7MB/s), 58.8MiB/s-58.8MiB/s (61.7MB/s-61.7MB/s), io=32.0GiB (34.4GB), run=557134-557134msec
after this change:
WRITE: bw=76.1MiB/s (79.8MB/s), 76.1MiB/s-76.1MiB/s (79.8MB/s-79.8MB/s), io=32.0GiB (34.4GB), run=430550-430550msec
(+29.4% throughput, -22.7% runtime)
**** 64 jobs, file size 512M, fsync frequency 1 ****
before this change:
WRITE: bw=65.8MiB/s (68.0MB/s), 65.8MiB/s-65.8MiB/s (68.0MB/s-68.0MB/s), io=32.0GiB (34.4GB), run=498055-498055msec
after this change:
WRITE: bw=85.1MiB/s (89.2MB/s), 85.1MiB/s-85.1MiB/s (89.2MB/s-89.2MB/s), io=32.0GiB (34.4GB), run=385116-385116msec
(+29.3% throughput, -22.7% runtime)
**** 128 jobs, file size 256M, fsync frequency 1 ****
before this change:
WRITE: bw=54.7MiB/s (57.3MB/s), 54.7MiB/s-54.7MiB/s (57.3MB/s-57.3MB/s), io=32.0GiB (34.4GB), run=599373-599373msec
after this change:
WRITE: bw=121MiB/s (126MB/s), 121MiB/s-121MiB/s (126MB/s-126MB/s), io=32.0GiB (34.4GB), run=271907-271907msec
(+121.2% throughput, -54.6% runtime)
**** 256 jobs, file size 256M, fsync frequency 1 ****
before this change:
WRITE: bw=69.2MiB/s (72.5MB/s), 69.2MiB/s-69.2MiB/s (72.5MB/s-72.5MB/s), io=64.0GiB (68.7GB), run=947536-947536msec
after this change:
WRITE: bw=121MiB/s (127MB/s), 121MiB/s-121MiB/s (127MB/s-127MB/s), io=64.0GiB (68.7GB), run=541916-541916msec
(+74.9% throughput, -42.8% runtime)
**** 512 jobs, file size 128M, fsync frequency 1 ****
before this change:
WRITE: bw=85.4MiB/s (89.5MB/s), 85.4MiB/s-85.4MiB/s (89.5MB/s-89.5MB/s), io=64.0GiB (68.7GB), run=767734-767734msec
after this change:
WRITE: bw=141MiB/s (147MB/s), 141MiB/s-141MiB/s (147MB/s-147MB/s), io=64.0GiB (68.7GB), run=466022-466022msec
(+65.1% throughput, -39.3% runtime)
**** 1024 jobs, file size 128M, fsync frequency 1 ****
before this change:
WRITE: bw=115MiB/s (120MB/s), 115MiB/s-115MiB/s (120MB/s-120MB/s), io=128GiB (137GB), run=1143775-1143775msec
after this change:
WRITE: bw=171MiB/s (180MB/s), 171MiB/s-171MiB/s (180MB/s-180MB/s), io=128GiB (137GB), run=764843-764843msec
(+48.7% throughput, -33.1% runtime)
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since there is not common cleanup run after the label it makes it
somewhat redundant.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This enum is the interface exposed to developers.
Although we have a detailed comment explaining the whole idea of space
flushing at the beginning of space-info.c, the exposed enum interface
doesn't have any comment.
Some corner cases, like BTRFS_RESERVE_FLUSH_ALL and
BTRFS_RESERVE_FLUSH_ALL_STEAL can be interrupted by fatal signals, are
not explained at all.
So add some simple comments for these enums as a quick reference.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since most metadata reservation calls can return -EINTR when get
interrupted by fatal signal, we need to review the all the metadata
reservation call sites.
In relocation code, the metadata reservation happens in the following
sites:
- btrfs_block_rsv_refill() in merge_reloc_root()
merge_reloc_root() is a pretty critical section, we don't want to be
interrupted by signal, so change the flush status to
BTRFS_RESERVE_FLUSH_LIMIT, so it won't get interrupted by signal.
Since such change can be ENPSPC-prone, also shrink the amount of
metadata to reserve least amount avoid deadly ENOSPC there.
- btrfs_block_rsv_refill() in reserve_metadata_space()
It calls with BTRFS_RESERVE_FLUSH_LIMIT, which won't get interrupted
by signal.
- btrfs_block_rsv_refill() in prepare_to_relocate()
- btrfs_block_rsv_add() in prepare_to_relocate()
- btrfs_block_rsv_refill() in relocate_block_group()
- btrfs_delalloc_reserve_metadata() in relocate_file_extent_cluster()
- btrfs_start_transaction() in relocate_block_group()
- btrfs_start_transaction() in create_reloc_inode()
Can be interrupted by fatal signal and we can handle it easily.
For these call sites, just catch the -EINTR value in btrfs_balance()
and count them as canceled.
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a bug report about bad signal timing could lead to read-only
fs during balance:
BTRFS info (device xvdb): balance: start -d -m -s
BTRFS info (device xvdb): relocating block group 73001861120 flags metadata
BTRFS info (device xvdb): found 12236 extents, stage: move data extents
BTRFS info (device xvdb): relocating block group 71928119296 flags data
BTRFS info (device xvdb): found 3 extents, stage: move data extents
BTRFS info (device xvdb): found 3 extents, stage: update data pointers
BTRFS info (device xvdb): relocating block group 60922265600 flags metadata
BTRFS: error (device xvdb) in btrfs_drop_snapshot:5505: errno=-4 unknown
BTRFS info (device xvdb): forced readonly
BTRFS info (device xvdb): balance: ended with status: -4
[CAUSE]
The direct cause is the -EINTR from the following call chain when a
fatal signal is pending:
relocate_block_group()
|- clean_dirty_subvols()
|- btrfs_drop_snapshot()
|- btrfs_start_transaction()
|- btrfs_delayed_refs_rsv_refill()
|- btrfs_reserve_metadata_bytes()
|- __reserve_metadata_bytes()
|- wait_reserve_ticket()
|- prepare_to_wait_event();
|- ticket->error = -EINTR;
Normally this behavior is fine for most btrfs_start_transaction()
callers, as they need to catch any other error, same for the signal, and
exit ASAP.
However for balance, especially for the clean_dirty_subvols() case, we're
already doing cleanup works, getting -EINTR from btrfs_drop_snapshot()
could cause a lot of unexpected problems.
From the mentioned forced read-only report, to later balance error due
to half dropped reloc trees.
[FIX]
Fix this problem by using btrfs_join_transaction() if
btrfs_drop_snapshot() is called from relocation context.
Since btrfs_join_transaction() won't get interrupted by signal, we can
continue the cleanup.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>3
Signed-off-by: David Sterba <dsterba@suse.com>
Although btrfs balance can be canceled with "btrfs balance cancel"
command, it's still almost muscle memory to press Ctrl-C to cancel a
long running btrfs balance.
So allow btrfs balance to check signal to determine if it should exit.
The cancellation points are in known location and we're only adding one
more reason, so this should be safe.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no cleanup that occurs so we can simply return 0 directly.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
User Forza reported on IRC that some invalid combinations of file
attributes are accepted by chattr.
The NODATACOW and compression file flags/attributes are mutually
exclusive, but they could be set by 'chattr +c +C' on an empty file. The
nodatacow will be in effect because it's checked first in
btrfs_run_delalloc_range.
Extend the flag validation to catch the following cases:
- input flags are conflicting
- old and new flags are conflicting
- initialize the local variable with inode flags after inode ls locked
Inode attributes take precedence over mount options and are an
independent setting.
Nocompress would be a no-op with nodatacow, but we don't want to mix
any compression-related options with nodatacow.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: David Sterba <dsterba@suse.com>
->show_devname currently shows the lowest devid in the list. As the seed
devices have the lowest devid in the sprouted filesystem, the userland
tool such as findmnt end up seeing seed device instead of the device from
the read-writable sprouted filesystem. As shown below.
mount /dev/sda /btrfs
mount: /btrfs: WARNING: device write-protected, mounted read-only.
findmnt --output SOURCE,TARGET,UUID /btrfs
SOURCE TARGET UUID
/dev/sda /btrfs 899f7027-3e46-4626-93e7-7d4c9ad19111
btrfs dev add -f /dev/sdb /btrfs
umount /btrfs
mount /dev/sdb /btrfs
findmnt --output SOURCE,TARGET,UUID /btrfs
SOURCE TARGET UUID
/dev/sda /btrfs 899f7027-3e46-4626-93e7-7d4c9ad19111
All sprouts from a single seed will show the same seed device and the
same fsid. That's confusing.
This is causing problems in our prototype as there isn't any reference
to the sprout file-system(s) which is being used for actual read and
write.
This was added in the patch which implemented the show_devname in btrfs
commit 9c5085c147 ("Btrfs: implement ->show_devname").
I tried to look for any particular reason that we need to show the seed
device, there isn't any.
So instead, do not traverse through the seed devices, just show the
lowest devid in the sprouted fsid.
After the patch:
mount /dev/sda /btrfs
mount: /btrfs: WARNING: device write-protected, mounted read-only.
findmnt --output SOURCE,TARGET,UUID /btrfs
SOURCE TARGET UUID
/dev/sda /btrfs 899f7027-3e46-4626-93e7-7d4c9ad19111
btrfs dev add -f /dev/sdb /btrfs
mount -o rw,remount /dev/sdb /btrfs
findmnt --output SOURCE,TARGET,UUID /btrfs
SOURCE TARGET UUID
/dev/sdb /btrfs 595ca0e6-b82e-46b5-b9e2-c72a6928be48
mount /dev/sda /btrfs1
mount: /btrfs1: WARNING: device write-protected, mounted read-only.
btrfs dev add -f /dev/sdc /btrfs1
findmnt --output SOURCE,TARGET,UUID /btrfs1
SOURCE TARGET UUID
/dev/sdc /btrfs1 ca1dbb7a-8446-4f95-853c-a20f3f82bdbb
cat /proc/self/mounts | grep btrfs
/dev/sdb /btrfs btrfs rw,relatime,noacl,space_cache,subvolid=5,subvol=/ 0 0
/dev/sdc /btrfs1 btrfs ro,relatime,noacl,space_cache,subvolid=5,subvol=/ 0 0
Reported-by: Martin K. Petersen <martin.petersen@oracle.com>
CC: stable@vger.kernel.org # 4.19+
Tested-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Sometime fsstress could lead to qgroup warning for case like
generic/013:
BTRFS warning (device dm-3): qgroup 0/259 has unreleased space, type 1 rsv 81920
------------[ cut here ]------------
WARNING: CPU: 9 PID: 24535 at fs/btrfs/disk-io.c:4142 close_ctree+0x1dc/0x323 [btrfs]
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:close_ctree+0x1dc/0x323 [btrfs]
Call Trace:
btrfs_put_super+0x15/0x17 [btrfs]
generic_shutdown_super+0x72/0x110
kill_anon_super+0x18/0x30
btrfs_kill_super+0x17/0x30 [btrfs]
deactivate_locked_super+0x3b/0xa0
deactivate_super+0x40/0x50
cleanup_mnt+0x135/0x190
__cleanup_mnt+0x12/0x20
task_work_run+0x64/0xb0
__prepare_exit_to_usermode+0x1bc/0x1c0
__syscall_return_slowpath+0x47/0x230
do_syscall_64+0x64/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
---[ end trace 6c341cdf9b6cc3c1 ]---
BTRFS error (device dm-3): qgroup reserved space leaked
While that subvolume 259 is no longer in that filesystem.
[CAUSE]
Normally per-trans qgroup reserved space is freed when a transaction is
committed, in commit_fs_roots().
However for completely dropped subvolume, that subvolume is completely
gone, thus is no longer in the fs_roots_radix, and its per-trans
reserved qgroup will never be freed.
Since the subvolume is already gone, leaked per-trans space won't cause
any trouble for end users.
[FIX]
Just call btrfs_qgroup_free_meta_all_pertrans() before a subvolume is
completely dropped.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
clang static analysis flags this error
fs/btrfs/ref-verify.c:290:3: warning: Potential leak of memory pointed to by 're' [unix.Malloc]
kfree(be);
^~~~~
The problem is in this block of code:
if (root_objectid) {
struct root_entry *exist_re;
exist_re = insert_root_entry(&exist->roots, re);
if (exist_re)
kfree(re);
}
There is no 'else' block freeing when root_objectid is 0. Add the
missing kfree to the else branch.
Fixes: fd708b81d9 ("Btrfs: add a extent ref verify tool")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The whole chunk tree is read at mount time so we can utilize readahead
to get the tree blocks to memory before we read the items. The idea is
from Robbie, but instead of updating search slot readahead, this patch
implements the chunk tree readahead manually from nodes on level 1.
We've decided to do specific readahead optimizations and then unify them
under a common API so we don't break everything by changing the search
slot readahead logic.
Higher chunk trees grow on large filesystems (many terabytes), and
prefetching just level 1 seems to be sufficient. Provided example was
from a 200TiB filesystem with chunk tree level 2.
CC: Robbie Ko <robbieko@synology.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add retrieval of the filesystem's metadata UUID to the fsinfo ioctl.
This is driven by setting the BTRFS_FS_INFO_FLAG_METADATA_UUID flag in
btrfs_ioctl_fs_info_args::flags.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add retrieval of the filesystem's generation to the fsinfo ioctl. This is
driven by setting the BTRFS_FS_INFO_FLAG_GENERATION flag in
btrfs_ioctl_fs_info_args::flags.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With the recent addition of filesystem checksum types other than CRC32c,
it is not anymore hard-coded which checksum type a btrfs filesystem uses.
Up to now there is no good way to read the filesystem checksum, apart from
reading the filesystem UUID and then query sysfs for the checksum type.
Add a new csum_type and csum_size fields to the BTRFS_IOC_FS_INFO ioctl
command which usually is used to query filesystem features. Also add a
flags member indicating that the kernel responded with a set csum_type and
csum_size field.
For compatibility reasons, only return the csum_type and csum_size if
the BTRFS_FS_INFO_FLAG_CSUM_INFO flag was passed to the kernel. Also
clear any unknown flags so we don't pass false positives to user-space
newer than the kernel.
To simplify further additions to the ioctl, also switch the padding to a
u8 array. Pahole was used to verify the result of this switch:
The csum members are added before flags, which might look odd, but this
is to keep the alignment requirements and not to introduce holes in the
structure.
$ pahole -C btrfs_ioctl_fs_info_args fs/btrfs/btrfs.ko
struct btrfs_ioctl_fs_info_args {
__u64 max_id; /* 0 8 */
__u64 num_devices; /* 8 8 */
__u8 fsid[16]; /* 16 16 */
__u32 nodesize; /* 32 4 */
__u32 sectorsize; /* 36 4 */
__u32 clone_alignment; /* 40 4 */
__u16 csum_type; /* 44 2 */
__u16 csum_size; /* 46 2 */
__u64 flags; /* 48 8 */
__u8 reserved[968]; /* 56 968 */
/* size: 1024, cachelines: 16, members: 10 */
};
Fixes: 3951e7f050 ("btrfs: add xxhash64 to checksumming algorithms")
Fixes: 3831bf0094 ("btrfs: add sha256 to checksumming algorithm")
CC: stable@vger.kernel.org # 5.5+
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
commit a514d63882 ("btrfs: qgroup: Commit transaction in advance to
reduce early EDQUOT") tries to reduce the early EDQUOT problems by
checking the qgroup free against threshold and tries to wake up commit
kthread to free some space.
The problem of that mechanism is, it can only free qgroup per-trans
metadata space, can't do anything to data, nor prealloc qgroup space.
Now since we have the ability to flush qgroup space, and implemented
retry-after-EDQUOT behavior, such mechanism can be completely replaced.
So this patch will cleanup such mechanism in favor of
retry-after-EDQUOT.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[PROBLEM]
There are known problem related to how btrfs handles qgroup reserved
space. One of the most obvious case is the the test case btrfs/153,
which do fallocate, then write into the preallocated range.
btrfs/153 1s ... - output mismatch (see xfstests-dev/results//btrfs/153.out.bad)
--- tests/btrfs/153.out 2019-10-22 15:18:14.068965341 +0800
+++ xfstests-dev/results//btrfs/153.out.bad 2020-07-01 20:24:40.730000089 +0800
@@ -1,2 +1,5 @@
QA output created by 153
+pwrite: Disk quota exceeded
+/mnt/scratch/testfile2: Disk quota exceeded
+/mnt/scratch/testfile2: Disk quota exceeded
Silence is golden
...
(Run 'diff -u xfstests-dev/tests/btrfs/153.out xfstests-dev/results//btrfs/153.out.bad' to see the entire diff)
[CAUSE]
Since commit c6887cd111 ("Btrfs: don't do nocow check unless we have to"),
we always reserve space no matter if it's COW or not.
Such behavior change is mostly for performance, and reverting it is not
a good idea anyway.
For preallcoated extent, we reserve qgroup data space for it already,
and since we also reserve data space for qgroup at buffered write time,
it needs twice the space for us to write into preallocated space.
This leads to the -EDQUOT in buffered write routine.
And we can't follow the same solution, unlike data/meta space check,
qgroup reserved space is shared between data/metadata.
The EDQUOT can happen at the metadata reservation, so doing NODATACOW
check after qgroup reservation failure is not a solution.
[FIX]
To solve the problem, we don't return -EDQUOT directly, but every time
we got a -EDQUOT, we try to flush qgroup space:
- Flush all inodes of the root
NODATACOW writes will free the qgroup reserved at run_dealloc_range().
However we don't have the infrastructure to only flush NODATACOW
inodes, here we flush all inodes anyway.
- Wait for ordered extents
This would convert the preallocated metadata space into per-trans
metadata, which can be freed in later transaction commit.
- Commit transaction
This will free all per-trans metadata space.
Also we don't want to trigger flush multiple times, so here we introduce
a per-root wait list and a new root status, to ensure only one thread
starts the flushing.
Fixes: c6887cd111 ("Btrfs: don't do nocow check unless we have to")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[PROBLEM]
Before this patch, when btrfs_qgroup_reserve_data() fails, we free all
reserved space of the changeset.
For example:
ret = btrfs_qgroup_reserve_data(inode, changeset, 0, SZ_1M);
ret = btrfs_qgroup_reserve_data(inode, changeset, SZ_1M, SZ_1M);
ret = btrfs_qgroup_reserve_data(inode, changeset, SZ_2M, SZ_1M);
If the last btrfs_qgroup_reserve_data() failed, it will release the
entire [0, 3M) range.
This behavior is kind of OK for now, as when we hit -EDQUOT, we normally
go error handling and need to release all reserved ranges anyway.
But this also means the following call is not possible:
ret = btrfs_qgroup_reserve_data();
if (ret == -EDQUOT) {
/* Do something to free some qgroup space */
ret = btrfs_qgroup_reserve_data();
}
As if the first btrfs_qgroup_reserve_data() fails, it will free all
reserved qgroup space.
[CAUSE]
This is because we release all reserved ranges when
btrfs_qgroup_reserve_data() fails.
[FIX]
This patch will implement a new function, qgroup_unreserve_range(), to
iterate through the ulist nodes, to find any nodes in the failure range,
and remove the EXTENT_QGROUP_RESERVED bits from the io_tree, and
decrease the extent_changeset::bytes_changed, so that we can revert to
previous state.
This allows later patches to retry btrfs_qgroup_reserve_data() if EDQUOT
happens.
Suggested-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have refcount_t now with the associated library to handle refcounts,
which gives us extra debugging around reference count mistakes that may
be made. For example it'll warn on any transition from 0->1 or 0->-1,
which is handy for noticing cases where we've messed up reference
counting. Convert the block group ref counting from an atomic_t to
refcount_t and use the appropriate helpers.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Multi-statement macros should be enclosed in do/while(0) block to make
their use safe in single statement if conditions. All current uses of
the macros are safe, so this change is for future protection.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While at it use the opportunity to simplify find_logical_bio_stripe by
reducing the scope of 'stripe_start' variable and squash the
sector-to-bytes conversion on one line.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unify the style in the file such that return value of bio_list_pop is
assigned directly in the while loop. This is in line with the rest of
the kernel.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The merging logic is always executed if the current stripe's device
is not missing. So there's no point in duplicating the check. Simply
remove it, while at it reduce the scope of the 'last_end' variable.
If the current stripe's device is missing we fail the stripe early on.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since btrfs_bio always contains the extra space for the tgtdev_map and
raid_maps it's pointless to make the assignment iff specific conditions
are met.
Instead, always assign the pointers to their correct value at allocation
time. To accommodate this change also move code a bit in
__btrfs_map_block so that btrfs_bio::stripes array is always initialized
before the raid_map, subsequently move the call to sort_parity_stripes
in the 'if' building the raid_map, retaining the old behavior.
To better understand the change, there are 2 aspects to this:
1. The original code is harder to grasp because the calculations for
initializing raid_map/tgtdev ponters are apart from the initial
allocation of memory. Having them predicated on 2 separate checks
doesn't help that either... So by moving the initialisation in
alloc_btrfs_bio puts everything together.
2. tgtdev/raid_maps are now always initialized despite sometimes they
might be equal i.e __btrfs_map_block_for_discard calls
alloc_btrfs_bio with tgtdev = 0 but their usage should be predicated
on external checks i.e. just because those pointers are non-null
doesn't mean they are valid per-se. And actually while taking another
look at __btrfs_map_block I saw a discrepancy:
Original code initialised tgtdev_map if the following check is true:
if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL)
However, further down tgtdev_map is only used if the following check
is true:
if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL && need_full_stripe(op))
e.g. the additional need_full_stripe(op) predicate is there.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copy more details from mail discussion ]
Signed-off-by: David Sterba <dsterba@suse.com>
Since BTRFS uses a private bdi it makes sense to create a link to this
bdi under /sys/fs/btrfs/<UUID>/bdi. This allows size of read ahead to
be controlled. Without this patch it's not possible to uniquely identify
which bdi pertains to which btrfs filesystem in the case of multiple
btrfs filesystems.
It's fine to simply call sysfs_remove_link without checking if the
link indeed has been created. The call path
sysfs_remove_link
kernfs_remove_by_name
kernfs_remove_by_name_ns
will simply return -ENOENT in case it doesn't exist.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If a compressed read fails due to checksum error only a line is printed
to dmesg, device corrupt counter is not modified.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
compressed_bio::orig_bio is always set in btrfs_submit_compressed_read
before any bio submission is performed. Since that function is always
called with a valid bio it renders the ASSERT unnecessary.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that btrfs_io_bio have access to btrfs_device we can safely
increment the device corruption counter on error. There is one notable
exception - repair bios for raid. Since those don't go through the
normal submit_stripe_bio callpath but through raid56_parity_recover thus
repair bios won't have their device set.
Scrub increments the corruption counter for checksum mismatch as well
but does not call this function.
Link: https://lore.kernel.org/linux-btrfs/4857863.FCrPRfMyHP@liv/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_map_bio ensures that all submitted bios to devices have valid
btrfs_device::bdev so this check can be removed from btrfs_end_bio. This
check was added in june 2012 597a60fade ("Btrfs: don't count I/O
statistic read errors for missing devices") but then in October of the
same year another commit de1ee92ac3 ("Btrfs: recheck bio against
block device when we map the bio") started checking for the presence of
btrfs_device::bdev before actually issuing the bio.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of recording stripe_index and using that to access correct
btrfs_device from btrfs_bio::stripes record the btrfs_device in
btrfs_io_bio. This will enable endio handlers to increment device
error counters on checksum errors.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Make the function directly return a pointer to a failure record and
adjust callers to handle it. Also refactor the logic inside so that
the case which allocates the failure record for the first time is not
handled in an 'if' arm, saving us a level of indentation. Finally make
the function static as it's not used outside of extent_io.c .
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Only failure that get_state_failrec can get is if there is no failure
for the given address. There is no reason why the function should return
a status code and use a separate parameter for returning the actual
failure rec (if one is found). Simplify it by making the return type
a pointer and return ERR_PTR value in case of errors.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The option subvolrootid used to be a workaround for mounting subvolumes
and ineffective since 5e2a4b25da ("btrfs: deprecate subvolrootid mount
option"). We have subvol= that works and we don't need to keep the
cruft, let's remove it.
Signed-off-by: David Sterba <dsterba@suse.com>
The mount option alloc_start has no effect since 0d0c71b317 ("btrfs:
obsolete and remove mount option alloc_start") which has details why
it's been deprecated. We can remove it.
Signed-off-by: David Sterba <dsterba@suse.com>
When syncing the log, we used to update the log root tree without holding
neither the log_mutex of the subvolume root nor the log_mutex of log root
tree.
We used to have two critical sections delimited by the log_mutex of the
log root tree, so in the first one we incremented the log_writers of the
log root tree and on the second one we decremented it and waited for the
log_writers counter to go down to zero. This was because the update of
the log root tree happened between the two critical sections.
The use of two critical sections allowed a little bit more of parallelism
and required the use of the log_writers counter, necessary to make sure
we didn't miss any log root tree update when we have multiple tasks trying
to sync the log in parallel.
However after commit 06989c799f ("Btrfs: fix race updating log root
item during fsync") the log root tree update was moved into a critical
section delimited by the subvolume's log_mutex. Later another commit
moved the log tree update from that critical section into the second
critical section delimited by the log_mutex of the log root tree. Both
commits addressed different bugs.
The end result is that the first critical section delimited by the
log_mutex of the log root tree became pointless, since there's nothing
done between it and the second critical section, we just have an unlock
of the log_mutex followed by a lock operation. This means we can merge
both critical sections, as the first one does almost nothing now, and we
can stop using the log_writers counter of the log root tree, which was
incremented in the first critical section and decremented in the second
criticial section, used to make sure no one in the second critical section
started writeback of the log root tree before some other task updated it.
So just remove the mutex_unlock() followed by mutex_lock() of the log root
tree, as well as the use of the log_writers counter for the log root tree.
This patch is part of a series that has the following patches:
1/4 btrfs: only commit the delayed inode when doing a full fsync
2/4 btrfs: only commit delayed items at fsync if we are logging a directory
3/4 btrfs: stop incremening log_batch for the log root tree when syncing log
4/4 btrfs: remove no longer needed use of log_writers for the log root tree
After the entire patchset applied I saw about 12% decrease on max latency
reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of
ram, using kvm and using a raw NVMe device directly (no intermediary fs on
the host). The test was invoked like the following:
mkfs.btrfs -f /dev/sdk
mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk
dbench -D /mnt/sdk -t 300 8
umount /mnt/dsk
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are incrementing the log_batch atomic counter of the root log tree but
we never use that counter, it's used only for the log trees of subvolume
roots. We started doing it when we moved the log_batch and log_write
counters from the global, per fs, btrfs_fs_info structure, into the
btrfs_root structure in commit 7237f18336 ("Btrfs: fix tree logs
parallel sync").
So just stop doing it for the log root tree and add a comment over the
field declaration so inform it's used only for log trees of subvolume
roots.
This patch is part of a series that has the following patches:
1/4 btrfs: only commit the delayed inode when doing a full fsync
2/4 btrfs: only commit delayed items at fsync if we are logging a directory
3/4 btrfs: stop incremening log_batch for the log root tree when syncing log
4/4 btrfs: remove no longer needed use of log_writers for the log root tree
After the entire patchset applied I saw about 12% decrease on max latency
reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of
ram, using kvm and using a raw NVMe device directly (no intermediary fs on
the host). The test was invoked like the following:
mkfs.btrfs -f /dev/sdk
mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk
dbench -D /mnt/sdk -t 300 8
umount /mnt/dsk
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging an inode we are committing its delayed items if either the
inode is a directory or if it is a new inode, created in the current
transaction.
We need to do it for directories, since new directory indexes are stored
as delayed items of the inode and when logging a directory we need to be
able to access all indexes from the fs/subvolume tree in order to figure
out which index ranges need to be logged.
However for new inodes that are not directories, we do not need to do it
because the only type of delayed item they can have is the inode item, and
we are guaranteed to always log an up to date version of the inode item:
*) for a full fsync we do it by committing the delayed inode and then
copying the item from the fs/subvolume tree with
copy_inode_items_to_log();
*) for a fast fsync we always log the inode item based on the contents of
the in-memory struct btrfs_inode. We guarantee this is always done since
commit e4545de5b0 ("Btrfs: fix fsync data loss after append write").
So stop running delayed items for a new inodes that are not directories,
since that forces committing the delayed inode into the fs/subvolume tree,
wasting time and adding contention to the tree when a full fsync is not
required. We will only do it in case a fast fsync is needed.
This patch is part of a series that has the following patches:
1/4 btrfs: only commit the delayed inode when doing a full fsync
2/4 btrfs: only commit delayed items at fsync if we are logging a directory
3/4 btrfs: stop incremening log_batch for the log root tree when syncing log
4/4 btrfs: remove no longer needed use of log_writers for the log root tree
After the entire patchset applied I saw about 12% decrease on max latency
reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of
ram, using kvm and using a raw NVMe device directly (no intermediary fs on
the host). The test was invoked like the following:
mkfs.btrfs -f /dev/sdk
mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk
dbench -D /mnt/sdk -t 300 8
umount /mnt/dsk
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 2c2c452b0c ("Btrfs: fix fsync when extend references are added
to an inode") forced a commit of the delayed inode when logging an inode
in order to ensure we would end up logging the inode item during a full
fsync. By committing the delayed inode, we updated the inode item in the
fs/subvolume tree and then later when copying items from leafs modified in
the current transaction into the log tree (with copy_inode_items_to_log())
we ended up copying the inode item from the fs/subvolume tree into the log
tree. Logging an up to date version of the inode item is required to make
sure at log replay time we get the link count fixup triggered among other
things (replay xattr deletes, etc). The test case generic/040 from fstests
exercises the bug which that commit fixed.
However for a fast fsync we don't need to commit the delayed inode because
we always log an up to date version of the inode item based on the struct
btrfs_inode we have in-memory. We started doing this for fast fsyncs since
commit e4545de5b0 ("Btrfs: fix fsync data loss after append write").
So just stop committing the delayed inode if we are doing a fast fsync,
we are only wasting time and adding contention on fs/subvolume tree.
This patch is part of a series that has the following patches:
1/4 btrfs: only commit the delayed inode when doing a full fsync
2/4 btrfs: only commit delayed items at fsync if we are logging a directory
3/4 btrfs: stop incremening log_batch for the log root tree when syncing log
4/4 btrfs: remove no longer needed use of log_writers for the log root tree
After the entire patchset applied I saw about 12% decrease on max latency
reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of
ram, using kvm and using a raw NVMe device directly (no intermediary fs on
the host). The test was invoked like the following:
mkfs.btrfs -f /dev/sdk
mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk
dbench -D /mnt/sdk -t 300 8
umount /mnt/dsk
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When the anonymous block device pool is exhausted, subvolume/snapshot
creation fails with EMFILE (Too many files open). This has been reported
by a user. The allocation happens in the second phase during transaction
commit where it's only way out is to abort the transaction
BTRFS: Transaction aborted (error -24)
WARNING: CPU: 17 PID: 17041 at fs/btrfs/transaction.c:1576 create_pending_snapshot+0xbc4/0xd10 [btrfs]
RIP: 0010:create_pending_snapshot+0xbc4/0xd10 [btrfs]
Call Trace:
create_pending_snapshots+0x82/0xa0 [btrfs]
btrfs_commit_transaction+0x275/0x8c0 [btrfs]
btrfs_mksubvol+0x4b9/0x500 [btrfs]
btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs]
btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs]
btrfs_ioctl+0x11a4/0x2da0 [btrfs]
do_vfs_ioctl+0xa9/0x640
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x1a/0x20
do_syscall_64+0x5a/0x110
entry_SYSCALL_64_after_hwframe+0x44/0xa9
---[ end trace 33f2f83f3d5250e9 ]---
BTRFS: error (device sda1) in create_pending_snapshot:1576: errno=-24 unknown
BTRFS info (device sda1): forced readonly
BTRFS warning (device sda1): Skipping commit of aborted transaction.
BTRFS: error (device sda1) in cleanup_transaction:1831: errno=-24 unknown
[CAUSE]
When the global anonymous block device pool is exhausted, the following
call chain will fail, and lead to transaction abort:
btrfs_ioctl_snap_create_v2()
|- btrfs_ioctl_snap_create_transid()
|- btrfs_mksubvol()
|- btrfs_commit_transaction()
|- create_pending_snapshot()
|- btrfs_get_fs_root()
|- btrfs_init_fs_root()
|- get_anon_bdev()
[FIX]
Although we can't enlarge the anonymous block device pool, at least we
can preallocate anon_dev for subvolume/snapshot in the first phase,
outside of transaction context and exactly at the moment the user calls
the creation ioctl.
Reported-by: Greed Rong <greedrong@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CA+UqX+NTrZ6boGnWHhSeZmEY5J76CTqmYjO2S+=tHJX7nb9DPw@mail.gmail.com/
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When a lot of subvolumes are created, there is a user report about
transaction aborted caused by slow anonymous block device reclaim:
BTRFS: Transaction aborted (error -24)
WARNING: CPU: 17 PID: 17041 at fs/btrfs/transaction.c:1576 create_pending_snapshot+0xbc4/0xd10 [btrfs]
RIP: 0010:create_pending_snapshot+0xbc4/0xd10 [btrfs]
Call Trace:
create_pending_snapshots+0x82/0xa0 [btrfs]
btrfs_commit_transaction+0x275/0x8c0 [btrfs]
btrfs_mksubvol+0x4b9/0x500 [btrfs]
btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs]
btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs]
btrfs_ioctl+0x11a4/0x2da0 [btrfs]
do_vfs_ioctl+0xa9/0x640
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x1a/0x20
do_syscall_64+0x5a/0x110
entry_SYSCALL_64_after_hwframe+0x44/0xa9
---[ end trace 33f2f83f3d5250e9 ]---
BTRFS: error (device sda1) in create_pending_snapshot:1576: errno=-24 unknown
BTRFS info (device sda1): forced readonly
BTRFS warning (device sda1): Skipping commit of aborted transaction.
BTRFS: error (device sda1) in cleanup_transaction:1831: errno=-24 unknown
[CAUSE]
The anonymous device pool is shared and its size is 1M. It's possible to
hit that limit if the subvolume deletion is not fast enough and the
subvolumes to be cleaned keep the ids allocated.
[WORKAROUND]
We can't avoid the anon device pool exhaustion but we can shorten the
time the id is attached to the subvolume root once the subvolume becomes
invisible to the user.
Reported-by: Greed Rong <greedrong@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CA+UqX+NTrZ6boGnWHhSeZmEY5J76CTqmYjO2S+=tHJX7nb9DPw@mail.gmail.com/
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When a lot of subvolumes are created, there is a user report about
transaction aborted:
BTRFS: Transaction aborted (error -24)
WARNING: CPU: 17 PID: 17041 at fs/btrfs/transaction.c:1576 create_pending_snapshot+0xbc4/0xd10 [btrfs]
RIP: 0010:create_pending_snapshot+0xbc4/0xd10 [btrfs]
Call Trace:
create_pending_snapshots+0x82/0xa0 [btrfs]
btrfs_commit_transaction+0x275/0x8c0 [btrfs]
btrfs_mksubvol+0x4b9/0x500 [btrfs]
btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs]
btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs]
btrfs_ioctl+0x11a4/0x2da0 [btrfs]
do_vfs_ioctl+0xa9/0x640
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x1a/0x20
do_syscall_64+0x5a/0x110
entry_SYSCALL_64_after_hwframe+0x44/0xa9
---[ end trace 33f2f83f3d5250e9 ]---
BTRFS: error (device sda1) in create_pending_snapshot:1576: errno=-24 unknown
BTRFS info (device sda1): forced readonly
BTRFS warning (device sda1): Skipping commit of aborted transaction.
BTRFS: error (device sda1) in cleanup_transaction:1831: errno=-24 unknown
[CAUSE]
The error is EMFILE (Too many files open) and comes from the anonymous
block device allocation. The ids are in a shared pool of size 1<<20.
The ids are assigned to live subvolumes, ie. the root structure exists
in memory (eg. after creation or after the root appears in some path).
The pool could be exhausted if the numbers are not reclaimed fast
enough, after subvolume deletion or if other system component uses the
anon block devices.
[WORKAROUND]
Since it's not possible to completely solve the problem, we can only
minimize the time the id is allocated to a subvolume root.
Firstly, we can reduce the use of anon_dev by trees that are not
subvolume roots, like data reloc tree.
This patch will do extra check on root objectid, to skip roots that
don't need anon_dev. Currently it's only data reloc tree and orphan
roots.
Reported-by: Greed Rong <greedrong@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CA+UqX+NTrZ6boGnWHhSeZmEY5J76CTqmYjO2S+=tHJX7nb9DPw@mail.gmail.com/
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch will add the following sysfs interface:
/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/referenced
/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/exclusive
/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/max_referenced
/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/max_exclusive
/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/limit_flags
Which is also available in output of "btrfs qgroup show".
/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_data
/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_meta_pertrans
/sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_meta_prealloc
The last 3 rsv related members are not visible to users, but can be very
useful to debug qgroup limit related bugs.
Also, to avoid '/' used in <qgroup_id>, the separator between qgroup
level and qgroup id is changed to '_'.
The interface is not hidden behind 'debug' as we want this interface to
be included into production build and to provide another way to read the
qgroup information besides the ioctls.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The qgroup level is limited to u16, so no need to use u64 for it.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
vfs_inode is used only for the inode number everything else requires
btrfs_inode.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use btrfs_ino ]
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of making multiple calls to BTRFS_I simply take btrfs_inode as
an input paramter.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The vfs inode is only used for a pair of inode_lock/unlock calls all
other uses call for btrfs_inode.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All of its children functions use btrfs_inode.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All of its children take btrfs_inode so bubble up this requirement to
btrfs_delalloc_reserve_space's interface and stop calling BTRFS_I
internally.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of calling BTRFS_I on the passed vfs_inode take btrfs_inode
directly.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It needs btrfs_inode so take it as a parameter directly.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It only uses btrfs_inode internally so take it as a parameter.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
No point in taking an inode only to get btrfs_fs_info from it, instead
take btrfs_fs_info directly.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's only a single use of vfs_inode in a tracepoint so let's take
btrfs_inode directly.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a single use of the generic vfs_inode so let's take btrfs_inode
as a parameter and remove couple of redundant BTRFS_I() calls.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preparation to make btrfs_dirty_pages take btrfs_inode as parameter.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Only find_lock_delalloc_range uses vfs_inode so let's take the
btrfs_inode as a parameter.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It has only a single use for a generic vfs inode vs 3 for btrfs_inode.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function really needs a btrfs_inode and not a generic vfs one. Take
it as a parameter and get rid of superfluous BTRFS_I() calls.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Take btrfs_inode directly and stop using superfulous BTRFS_I calls.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Simply forwards its argument so let's get rid of one extra BTRFS_I by
taking btrfs_inode directly.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All children now take btrfs_inode so convert it to taking it as a
parameter as well.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Gets rid of superfulous BTRFS_I() calls and prepare for converting
btrfs_run_delalloc_range to using btrfs_inode.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Simply gets rid of superfluous BTRFS_I() calls.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Gets rid of superfluous BTRFS_I() calls.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preparation to converting btrfs_run_delalloc_range to using btrfs_inode
without BTRFS_I() calls.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It really wants btrfs_inode.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It doesn't really need vfs_inode but btrfs_inode.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It only uses vfs inode for assigning it to the async_chunk function.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It only really uses btrfs_inode.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It really wants btrfs_inode and is prepration to converting
run_delalloc_nocow to taking btrfs_inode.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It just forwards its argument to __btrfs_qgroup_release_data.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All but 3 uses require vfs_inode so convert the logic to have
btrfs_inode be the main inode struct.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Majority of its uses are for btrfs_inode so take it as an argument
directly.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It simpy forwards its inode argument to __btrfs_add_ordered_extent which
already takes btrfs_inode.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All its children functions take btrfs_inode so convert it to taking
btrfs_inode.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preparation to converting its callers to taking btrfs_inode.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It has only 2 uses for the vfs_inode - insert_inline_extent and
i_size_read. On the flipside it will allow converting its callers to
btrfs_inode, so convert it to taking btrfs_inode.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It passes btrfs_inode to its callee so change the interface.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It uses vfs_inode only for a tracepoint so convert its interface to take
btrfs_inode directly.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It only uses btrfs_inode so can just as easily take it as an argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
On a filesystem with exhausted metadata, but still enough to start
balance, it's possible to hit this error:
[324402.053842] BTRFS info (device loop0): 1 enospc errors during balance
[324402.060769] BTRFS info (device loop0): balance: ended with status: -28
[324402.172295] BTRFS: error (device loop0) in reset_balance_state:3321: errno=-28 No space left
It fails inside reset_balance_state and turns the filesystem to
read-only, which is unnecessary and should be fixed too, but the problem
is caused by lack for space when the balance item is deleted. This is a
one-time operation and from the same rank as unlink that is allowed to
use the global block reserve. So do the same for the balance item.
Status of the filesystem (100GiB) just after the balance fails:
$ btrfs fi df mnt
Data, single: total=80.01GiB, used=38.58GiB
System, single: total=4.00MiB, used=16.00KiB
Metadata, single: total=19.99GiB, used=19.48GiB
GlobalReserve, single: total=512.00MiB, used=50.11MiB
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_check_can_nocow() now has two completely different
call patterns.
For nowait variant, callers don't need to do any cleanup. While for
wait variant, callers need to release the lock if they can do nocow
write.
This is somehow confusing, and is already a problem for the exported
btrfs_check_can_nocow().
So this patch will separate the different patterns into different
functions.
For nowait variant, the function will be called check_nocow_nolock().
For wait variant, the function pair will be btrfs_check_nocow_lock()
btrfs_check_nocow_unlock().
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These two functions have extra conditions that their callers need to
meet, and some not-that-common parameters used for return value.
So adding some comments may save reviewers some time.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When the data space is exhausted, even if the inode has NOCOW attribute,
we will still refuse to truncate unaligned range due to ENOSPC.
The following script can reproduce it pretty easily:
#!/bin/bash
dev=/dev/test/test
mnt=/mnt/btrfs
umount $dev &> /dev/null
umount $mnt &> /dev/null
mkfs.btrfs -f $dev -b 1G
mount -o nospace_cache $dev $mnt
touch $mnt/foobar
chattr +C $mnt/foobar
xfs_io -f -c "pwrite -b 4k 0 4k" $mnt/foobar > /dev/null
xfs_io -f -c "pwrite -b 4k 0 1G" $mnt/padding &> /dev/null
sync
xfs_io -c "fpunch 0 2k" $mnt/foobar
umount $mnt
Currently this will fail at the fpunch part.
[CAUSE]
Because btrfs_truncate_block() always reserves space without checking
the NOCOW attribute.
Since the writeback path follows NOCOW bit, we only need to bother the
space reservation code in btrfs_truncate_block().
[FIX]
Make btrfs_truncate_block() follow btrfs_buffered_write() to try to
reserve data space first, and fall back to NOCOW check only when we
don't have enough space.
Such always-try-reserve is an optimization introduced in
btrfs_buffered_write(), to avoid expensive btrfs_check_can_nocow() call.
This patch will export check_can_nocow() as btrfs_check_can_nocow(), and
use it in btrfs_truncate_block() to fix the problem.
Reported-by: Martin Doucha <martin.doucha@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Estimated time of removal of the functionality is 5.11, the option will
be still parsed but will have no effect.
Reasons for deprecation and removal:
- very poor naming choice of the mount option, it's supposed to cache
and reuse the inode _numbers_, but it sounds a some generic cache for
inodes
- the only known usecase where this option would make sense is on a
32bit architecture where inode numbers in one subvolume would be
exhausted due to 32bit inode::i_ino
- the cache is stored on disk, consumes space, needs to be loaded and
written back
- new inode number allocation is slower due to lookups into the cache
(compared to a simple increment which is the default)
- uses the free-space-cache code that is going to be deprecated as well
in the future
Known problems:
- since 2011, returning EEXIST when there's not enough space in a page
to store all checksums, see commit 4b9465cb9e ("Btrfs: add mount -o
inode_cache")
Remaining issues:
- if the option was enabled, new inodes created, the option disabled
again, the cache is still stored on the devices and there's currently
no way to remove it
Signed-off-by: David Sterba <dsterba@suse.com>
Last touched in 2013 by commit de78b51a28 ("btrfs: remove cache only
arguments from defrag path") that was the only code that used the value.
Now it's only set but never used for anything, so we can remove it.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The fiemap callback is not part of UAPI interface and the prototypes
don't have the __u64 types either.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
num_extents is already checked in the next if condition and can
be safely removed.
Signed-off-by: Denis Efremov <efremov@linux.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_put_root() we're freeing a btrfs_root's 'node' and 'commit_root'
extent buffers manually via kfree(), while we're using
free_root_extent_buffers() in the free_root_pointers() function above.
free_root_extent_buffers() also NULLs the pointers after freeing, which
mitigates potential double frees.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function iterates all extents in the extent cluster, make this
intention obvious by using a for loop. No functional chanes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_alloc_data_chunk_ondemand and btrfs_free_reserved_data_space_noquota
don't really use the guts of the inodes being passed to them. This
implies it's not required to call them under extent lock. Move code
around in prealloc_file_extent_cluster to do the heavy, data alloc/free
operations outside of the lock. This also makes the 'out' label
unnecessary, so remove it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Extents in the extent cluster are guaranteed to be contiguous as such
the hole check inside the loop can never trigger. In fact this check was
never functional since it was added in 18513091af ("btrfs: update
btrfs_space_info's bytes_may_use timely") which came after the commit
introducing clustered/contiguous extents 0257bb82d2 ("Btrfs: relocate
file extents in clusters").
Let's just remove it as it adds noise to the source.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It has only 4 uses of a vfs_inode for inode_sub_bytes but unifies the
interface with the non __ prefixed version. Will also makes converting
its callers to btrfs_inode easier.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Will enable converting btrfs_submit_compressed_write to btrfs_inode more
easily.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It has one VFS and 1 btrfs inode usages but converting it to btrfs_inode
interface will allow seamless conversion of its callers.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It really wants a btrfs_inode and will allow submit_compressed_extents
to be completely converted to btrfs_inode in follow up patches.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It really wants btrfs_inode and not a vfs inode.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It doesn't use the generic vfs inode for anything use btrfs_inode
directly.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It doesn't use the vfs inode for anything, can just as easily take
btrfs_inode. Follow up patches will convert callers as well.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is internal btrfs function what really needs the vfs_inode only for
igrab and a tracepoint.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The 'trans_list' member of an ordered extent was used to keep track of the
ordered extents for which a transaction commit had to wait. These were
ordered extents that were started and logged by an fsync. However we don't
do that anymore and before we stopped doing it we changed the approach to
wait for the ordered extents in commit 161c3549b4 ("Btrfs: change how
we wait for pending ordered extents"), which stopped using that list and
therefore the 'trans_list' member is not used anymore since that commit.
So just remove it since it's doing nothing and making each ordered extent
structure waste memory (2 pointers).
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The 'log_list' member of an ordered extent was used keep track of which
ordered extents we needed to wait after logging metadata, but is not used
anymore since commit 5636cf7d6d ("btrfs: remove the logged extents
infrastructure"), as we now always wait on ordered extent completion
before logging metadata. So just remove it since it's doing nothing and
making each ordered extent structure waste more memory (2 pointers).
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The CPU and on-disk keys are mapped to two different structures because
of the endianness. There's an intermediate buffer used to do the
conversion, but this is not necessary when CPU and on-disk endianness
match.
Add optimized versions of helpers that take disk_key and use the buffer
directly for CPU keys or drop the intermediate buffer and conversion.
This saves a lot of stack space accross many functions and removes about
6K of generated binary code:
text data bss dec hex filename
1090439 17468 14912 1122819 112203 pre/btrfs.ko
1084613 17456 14912 1116981 110b35 post/btrfs.ko
Delta: -5826
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Before this patch, qgroup completely relies on per-inode extent io tree
to detect reserved data space leak.
However previous bug has already shown how release page before
btrfs_finish_ordered_io() could lead to leak, and since it's
QGROUP_RESERVED bit cleared without triggering qgroup rsv, it can't be
detected by per-inode extent io tree.
So this patch adds another (and hopefully the final) safety net to catch
qgroup data reserved space leak. At least the new safety net catches
all the leaks during development, so it should be pretty useful in the
real world.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
The following simple workload from fsstress can lead to qgroup reserved
data space leak:
0/0: creat f0 x:0 0 0
0/0: creat add id=0,parent=-1
0/1: write f0[259 1 0 0 0 0] [600030,27288] 0
0/4: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 64 627318] return 25, fallback to stat()
0/4: dwrite f0[259 1 0 0 64 627318] [610304,106496] 0
This would cause btrfs qgroup to leak 20480 bytes for data reserved
space. If btrfs qgroup limit is enabled, such leak can lead to
unexpected early EDQUOT and unusable space.
[CAUSE]
When doing direct IO, kernel will try to writeback existing buffered
page cache, then invalidate them:
generic_file_direct_write()
|- filemap_write_and_wait_range();
|- invalidate_inode_pages2_range();
However for btrfs, the bi_end_io hook doesn't finish all its heavy work
right after bio ends. In fact, it delays its work further:
submit_extent_page(end_io_func=end_bio_extent_writepage);
end_bio_extent_writepage()
|- btrfs_writepage_endio_finish_ordered()
|- btrfs_init_work(finish_ordered_fn);
<<< Work queue execution >>>
finish_ordered_fn()
|- btrfs_finish_ordered_io();
|- Clear qgroup bits
This means, when filemap_write_and_wait_range() returns,
btrfs_finish_ordered_io() is not guaranteed to be executed, thus the
qgroup bits for related range are not cleared.
Now into how the leak happens, this will only focus on the overlapping
part of buffered and direct IO part.
1. After buffered write
The inode had the following range with QGROUP_RESERVED bit:
596 616K
|///////////////|
Qgroup reserved data space: 20K
2. Writeback part for range [596K, 616K)
Write back finished, but btrfs_finish_ordered_io() not get called
yet.
So we still have:
596K 616K
|///////////////|
Qgroup reserved data space: 20K
3. Pages for range [596K, 616K) get released
This will clear all qgroup bits, but don't update the reserved data
space.
So we have:
596K 616K
| |
Qgroup reserved data space: 20K
That number doesn't match the qgroup bit range anymore.
4. Dio prepare space for range [596K, 700K)
Qgroup reserved data space for that range, we got:
596K 616K 700K
|///////////////|///////////////////////|
Qgroup reserved data space: 20K + 104K = 124K
5. btrfs_finish_ordered_range() gets executed for range [596K, 616K)
Qgroup free reserved space for that range, we got:
596K 616K 700K
| |///////////////////////|
We need to free that range of reserved space.
Qgroup reserved data space: 124K - 20K = 104K
6. btrfs_finish_ordered_range() gets executed for range [596K, 700K)
However qgroup bit for range [596K, 616K) is already cleared in
previous step, so we only free 84K for qgroup reserved space.
596K 616K 700K
| | |
We need to free that range of reserved space.
Qgroup reserved data space: 104K - 84K = 20K
Now there is no way to release that 20K unless disabling qgroup or
unmounting the fs.
[FIX]
This patch will change the timing of btrfs_qgroup_release/free_data()
call. Here it uses buffered COW write as an example.
The new timing | The old timing
----------------------------------------+---------------------------------------
btrfs_buffered_write() | btrfs_buffered_write()
|- btrfs_qgroup_reserve_data() | |- btrfs_qgroup_reserve_data()
|
btrfs_run_delalloc_range() | btrfs_run_delalloc_range()
|- btrfs_add_ordered_extent() |
|- btrfs_qgroup_release_data() |
The reserved is passed into |
btrfs_ordered_extent structure |
|
btrfs_finish_ordered_io() | btrfs_finish_ordered_io()
|- The reserved space is passed to | |- btrfs_qgroup_release_data()
btrfs_qgroup_record | The resereved space is passed
| to btrfs_qgroup_recrod
|
btrfs_qgroup_account_extents() | btrfs_qgroup_account_extents()
|- btrfs_qgroup_free_refroot() | |- btrfs_qgroup_free_refroot()
The point of such change is to ensure, when ordered extents are
submitted, the qgroup reserved space is already released, to keep the
timing aligned with file_write_and_wait_range().
So that qgroup data reserved space is all bound to btrfs_ordered_extent
and solve the timing mismatch.
Fixes: f695fdcef8 ("btrfs: qgroup: Introduce functions to release/free qgroup reserve data space")
Suggested-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The incoming qgroup reserved space timing will move the data reservation
to ordered extent completely.
However in btrfs_punch_hole_lock_range() will call
btrfs_invalidate_page(), which will clear QGROUP_RESERVED bit for the
range.
In current stage it's OK, but if we're making ordered extents handle the
reserved space, then btrfs_punch_hole_lock_range() can clear the
QGROUP_RESERVED bit before we submit ordered extent, leading to qgroup
reserved space leakage.
So here change the timing to make reserve data space after
btrfs_punch_hole_lock_range().
The new timing is fine for either current code or the new code.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is to prepare for the incoming timing change of qgroup reserved
data space and ordered extent.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Function insert_reserved_file_extent() takes a long list of parameters,
which are all for btrfs_file_extent_item, even including two reserved
members, encryption and other_encoding.
This makes the parameter list unnecessary long for a function which only
gets called twice.
This patch will refactor the parameter list, by using
btrfs_file_extent_item as parameter directly to hugely reduce the number
of parameters.
Also, since there are only two callers, one in btrfs_finish_ordered_io()
which inserts file extent for ordered extent, and one
__btrfs_prealloc_file_range().
These two call sites have completely different context, where ordered
extent can be compressed, but will always be regular extent, while the
preallocated one is never going to be compressed and always has PREALLOC
type.
So use two small wrapper for these two different call sites to improve
readability.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add proper variable for the scrub page and use it instead of repeatedly
dereferencing the other structures.
Signed-off-by: David Sterba <dsterba@suse.com>
Use a simpler iteration over tree block pages, same what csum_tree_block
does: first page always exists, loop over the rest.
Signed-off-by: David Sterba <dsterba@suse.com>
Add proper variable for the scrub page and use it instead of repeatedly
dereferencing the other structures.
Signed-off-by: David Sterba <dsterba@suse.com>
Add proper variable for the scrub page and use it instead of repeatedly
dereferencing the other structures.
Signed-off-by: David Sterba <dsterba@suse.com>
The page contents with the checksum is available during the entire
function so we don't need to make a copy.
Signed-off-by: David Sterba <dsterba@suse.com>
BTRFS_SUPER_INFO_SIZE is 4096, and fits to a page on all supported
architectures, so we can calculate the checksum in one go.
Signed-off-by: David Sterba <dsterba@suse.com>
As the page mapping has been removed, rename the variables to 'kaddr'
that we use everywhere else. The type is changed to 'char *' so pointer
arithmetic works without casts.
Signed-off-by: David Sterba <dsterba@suse.com>
All pages that scrub uses in the scrub_block::pagev array are allocated
with GFP_KERNEL and never part of any mapping, so kmap is not necessary,
we only need to know the page address.
In scrub_write_page_to_dev_replace we don't even need to call
flush_dcache_page because of the same reason as above.
Signed-off-by: David Sterba <dsterba@suse.com>
This patch introduces a new "rescue=" mount option group for all mount
options for data recovery.
Different rescue sub options are seperated by ':'. E.g
"ro,rescue=nologreplay:usebackuproot".
The original plan was to use ';', but ';' needs to be escaped/quoted,
or it will be interpreted by bash, similar to '|'.
And obviously, user can specify rescue options one by one like:
"ro,rescue=nologreplay,rescue=usebackuproot".
The following mount options are converted to "rescue=", old mount
options are deprecated but still available for compatibility purpose:
- usebackuproot
Now it's "rescue=usebackuproot"
- nologreplay
Now it's "rescue=nologreplay"
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We currently use btrfs_check_data_free_space() when allocating space for
relocating data extents, but that is not necessary because that function
combines btrfs_alloc_data_chunk_ondemand(), which does the actual space
reservation, and btrfs_qgroup_reserve_data().
We can use btrfs_alloc_data_chunk_ondemand() directly because we know we
do not need to reserve qgroup space since we are dealing with a relocation
tree, which can never have qgroups (btrfs_qgroup_reserve_data() does
nothing as is_fstree() returns false for a relocation tree).
Conversely we can use btrfs_free_reserved_data_space_noquota() directly
instead of btrfs_free_reserved_data_space(), since we had no qgroup
reservation when allocating space.
This change is preparatory work for another patch in this series that
makes relocation reserve the exact amount of space it needs to relocate
a data block group. The function btrfs_check_data_free_space() has
the incovenient of requiring a start offset argument and we will want to
be able to allocate space for multiple ranges, which are not consecutive,
at once.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The start argument for btrfs_free_reserved_data_space_noquota() is only
used to make sure the amount of bytes we decrement from the bytes_may_use
counter of the data space_info object is aligned to the filesystem's
sector size. It serves no other purpose.
All its current callers always pass a length argument that is already
aligned to the sector size, so we can make the start argument go away.
In fact its presence makes it impossible to use it in a context where we
just want to free a number of bytes for a range for which either we do
not know its start offset or for freeing multiple ranges at once (which
are not contiguous).
This change is preparatory work for a patch (third patch in this series)
that makes relocation of data block groups that are not full reserve less
data space.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As there is a dump_stack() done on memory allocation failures, these
messages might as well be deleted instead.
Signed-off-by: Liao Pingfang <liao.pingfang@zte.com.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor tweaks ]
Signed-off-by: David Sterba <dsterba@suse.com>
Use the helper function where it is open coded to increment the
block_group reference count As btrfs_get_block_group() is a one-liner we
could have open-coded it, but its partner function
btrfs_put_block_group() isn't one-liner which does the free part in it.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__btrfs_return_cluster_to_free_space() returns only 0. And all its
parent functions don't need the return value either so make this a void
function.
Further, as none of the callers of btrfs_return_cluster_to_free_space()
is actually using the return from this function, make this function also
return void.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Initially when the 'removed' flag was added to a block group to avoid
races between block group removal and fitrim, by commit 04216820fe
("Btrfs: fix race between fs trimming and block group remove/allocation"),
we had to lock the chunks mutex because we could be moving the block
group from its current list, the pending chunks list, into the pinned
chunks list, or we could just be adding it to the pinned chunks if it was
not in the pending chunks list. Both lists were protected by the chunk
mutex.
However we no longer have those lists since commit 1c11b63eff
("btrfs: replace pending/pinned chunks lists with io tree"), and locking
the chunk mutex is no longer necessary because of that. The same happens
at btrfs_unfreeze_block_group(), we lock the chunk mutex because the block
group's extent map could be part of the pinned chunks list and the call
to remove_extent_mapping() could be deleting it from that list, which
used to be protected by that mutex.
So just remove those lock and unlock calls as they are not needed anymore.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When find_first_block_group() finds a block group item in the extent-tree,
it does a lookup of the object in the extent mapping tree and does further
checks on the item.
Factor out this step from find_first_block_group() so we can further
simplify the code.
While we're at it, we can also just return early in
find_first_block_group(), if the tree slot isn't found.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already have an fs_info in our function parameters, there's no need
to do the maths again and get fs_info from the extent_root just to get
the mapping_tree.
Instead directly grab the mapping_tree from fs_info.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Adresses held in 'logical' array are always guaranteed to fall within
the boundaries of the block group. That is, 'start' can never be
smaller than cache->start. This invariant follows from the way the
address are calculated in btrfs_rmap_block:
stripe_nr = physical - map->stripes[i].physical;
stripe_nr = div64_u64(stripe_nr, map->stripe_len);
bytenr = chunk_start + stripe_nr * io_stripe_size;
I.e it's always some IO stripe within the given chunk.
Exploit this invariant to simplify the body of the loop by removing the
unnecessary 'if' since its 'else' part is the one always executed.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
extent_map::orig_block_len contains the size of a physical stripe when
it's used to describe block groups (calculated in read_one_chunk via
calc_stripe_length or calculated in decide_stripe_size and then assigned
to extent_map::orig_block_len in create_chunk). Exploit this fact to get
the size directly rather than opencoding the calculations. No functional
changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The call to btrfs_btree_balance_dirty has been there since the early
days of BTRFS, when the btree was directly modified from the write path,
hence dirtied btree inode pages. With the implementation of b888db2bd7
("Btrfs: Add delayed allocation to the extent based page tree code")
13 years ago the btree is no longer modified from the write path, hence
there is no point in calling this function. Just remove it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, we allocate temp pages which is used to pad hole in
cluster during read IO submission, it may take long time before
releasing them in f2fs_decompress_pages(), since they are only
used as temp output buffer in decompression context, so let's
just do the allocation in that context to reduce time of memory
pool resource occupation.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We missed to update isize of compressed file in write_end() with
below case:
cluster size is 16KB
- write 14KB data from offset 0
- overwrite 16KB data from offset 0
Fixes: 4c8ff7095b ("f2fs: support data compression")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Just for code style, no logic change
1. delete useless space
2. change spaces into tab
Signed-off-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch refactored target bpf_iter_init_seq_priv_t callback
function to accept additional information. This will be needed
in later patches for map element targets since a particular
map should be passed to traverse elements for that particular
map. In the future, other information may be passed to target
as well, e.g., pid, cgroup id, etc. to customize the iterator.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723184110.590156-1-yhs@fb.com
The UDP reuseport conflict was a little bit tricky.
The net-next code, via bpf-next, extracted the reuseport handling
into a helper so that the BPF sk lookup code could invoke it.
At the same time, the logic for reuseport handling of unconnected
sockets changed via commit efc6b6f6c3
which changed the logic to carry on the reuseport result into the
rest of the lookup loop if we do not return immediately.
This requires moving the reuseport_has_conns() logic into the callers.
While we are here, get rid of inline directives as they do not belong
in foo.c files.
The other changes were cases of more straightforward overlapping
modifications.
Signed-off-by: David S. Miller <davem@davemloft.net>
- Fix the layering violation in the use of the EFI runtime services
availability mask in users of the 'efivars' abstraction
- Revert build fix for GCC v4.8 which is no longer supported
- Clean up some x86 EFI stub details, some of which are borderline bugs
that copy around garbage into padding fields - let's fix these
out of caution.
- Fix build issues while working on RISC-V support
- Avoid --whole-archive when linking the stub on arm64
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl8cCIkRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hGsxAAsPwL5ghYykViUQId0OjNmI7eRgISKRso
3n3BaFmgYMp9eq1gndzPp7A5Ty0NPsr8FiwuZ7setk9YoLuTUT1MHVtAzd6xkxlR
838CwTDvW5HvB69uxPQDHA/1mcH1smH4Iew/J7QXP+o6Zrg+BWOjKNtTiKFwawNC
m4Tkdvvq52wzykqbeuRhXxLetKFOH//R1V0s4M6nNySuY6gQGJQ2LPyaiMN6eq1V
LE8+wGcNlIRUOeC8RkEA7CE9g92jkGZ07uJDA09OP0J5WBNWLcxMJM2mBZ1Ho6uc
sNNOeTy76sjuXQvUBWCnjBr3/qqXnVGSuIq8NS1hS4L7HagdQ/peqFJRzvWBHT90
b9wUZv09ioFm9lb6/P6NL16sn/WPknCD7umxpfp5HKrlL2p7puvsvDXtuWgyhhPG
M+X9ZX1iaA54cA9bU6cXFzrNMw/DnYjHFsECF915EXjItNZToGs0rd1lf7ArgH2P
+3HgvLods73ufObKH2pUfY7EU2Ly1oJsNpK3RmpoOUzehW+++S80KdytkaAB10kT
dKp9LlTj6gC+lnC45J9NtpHlCodz/Rc0lBHQpxIqlk/p9grUuH4zn714Ii/FfNQg
i29GX9cCdgv+4KzmJCNTqyZvGFZ5m3K8f41x7iD+/ygqPLsP9PfV5RGRGau+FfL8
ezInhKdITqM=
=n3nF
-----END PGP SIGNATURE-----
Merge tag 'efi-urgent-2020-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into master
Pull EFI fixes from Ingo Molnar:
"Various EFI fixes:
- Fix the layering violation in the use of the EFI runtime services
availability mask in users of the 'efivars' abstraction
- Revert build fix for GCC v4.8 which is no longer supported
- Clean up some x86 EFI stub details, some of which are borderline
bugs that copy around garbage into padding fields - let's fix these
out of caution.
- Fix build issues while working on RISC-V support
- Avoid --whole-archive when linking the stub on arm64"
* tag 'efi-urgent-2020-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi: Revert "efi/x86: Fix build with gcc 4"
efi/efivars: Expose RT service availability via efivars abstraction
efi/libstub: Move the function prototypes to header file
efi/libstub: Fix gcc error around __umoddi3 for 32 bit builds
efi/libstub/arm64: link stub lib.a conditionally
efi/x86: Only copy upto the end of setup_header
efi/x86: Remove unused variables
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl8bmiQACgkQiiy9cAdy
T1EfFwv+IeD4kNshoBF5NT//+7j+9n+R8z74v5bJUEmoabZMuCaWuYl/uUcBqIBH
l29Xd60aF8V2jd3xIt/SEPNFOAvvBSQIjn8X52Lkquo+QzgiOElSUL8aNZe17T9L
N1S+nhwk2udlveBtWZRxHT50WNdKQCGKEwUDn9ouUbuKxLrcwWc1gRCxkegmJXey
NRaynzdCRYFx7uLdctvesTDB4P9MJn+5vX7L1BdI7MxXzsSEI1yHpM5OBL9e13ki
swZ7P16qyCA6TK8EYuYLCeXKEK0X+wGeu/y8JZ7RjO9ozqrvECvguumNbcUCs+zf
vLWQkaeC2M1L5L4WV71sY5Pi7CUGw9WGaU/FBJUK65+hYqUWETt351BwbIRwVlrx
mX7C7Dh50AM84/54u169Sk1p/S8vjdvkmx8YyRWK+t7kmZx6TP7XKXvYN1vEHS12
Y5ceRKmNHLdRqS3nY6FK7TKdgVWK2hgNODDdVbUZ+iPoJztSmH7j2RkTzxJz6qBi
vzBAPJiG
=Cvv4
-----END PGP SIGNATURE-----
Merge tag '5.8-rc6-cifs-fix' of git://git.samba.org/sfrench/cifs-2.6 into master
Pull cifs fix from Steve French:
"A fix for a recently discovered regression in rename to older servers
caused by a recent patch"
* tag '5.8-rc6-cifs-fix' of git://git.samba.org/sfrench/cifs-2.6:
Revert "cifs: Fix the target file was deleted when rename failed."
Linked requests are hashed, remove a comment stating otherwise. Also
move hash bits to emphasise that we don't carry it through loop
iteration and set it every time.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Whoever called io_prep_linked_timeout() should also do
io_queue_linked_timeout(). __io_queue_sqe() doesn't follow that for the
punting path leaving linked timeouts prepared but never queued.
Fixes: 6df1db6b54 ("io_uring: fix mis-refcounting linked timeouts")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove REQ_F_WORK_INITIALIZED after io_req_clean_work(). That's a cold
path but is safer for those using io_req_clean_work() out of
*dismantle_req()/*io_free(). And for the same reason zero work.fs
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
/proc/fs/nfsd/client/../state at the wrong moment.
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl8bWF4VHGJmaWVsZHNA
ZmllbGRzZXMub3JnAAoJECebzXlCjuG+CL4P/1z2k9OPSTxNpcTv/+5+DCg5gEVp
raF3zumeMc8b0gvSRzrMLpYWRZBO/ZcDrH10WFPvySQxQGhMzPNDYsOvRcGabUOs
SDMV7RCQe7fPNlRSZejdTLnWD1ftHgNLXpBKSXPR0GbmML4BRoKwHjTM4rrr3nox
bluQHFyAHMwrNSjjRufi5qi14ThyW5qIanSEpV99qS+aeFwx0U8FL4f4eyFMA1Pq
sV/H2k9N1v/xoPOqQWF20EOZqI+AMxexMY2EQBm7LS3kv5IReSV6WTM4bcTNDl/7
jtFA3A9pIngwfh2d8BOEC12BzPmw1F1vo2kbBjvPzeo1EgtD+EN8RjJ9MP9Lwz3v
kOL09hkJrvtM24+g+VOzdwGEknxm09vmps3vyhth5EvLSLNzJwGfY7HisRxIOrmT
HZBzFNM/wX6RkzsIyrx+AjoqwmOYiz/diElE1oLGLVFrp3IVhk9Grv0IdNlWl8U5
ImtlewJY9vF6+HAsOToXZaPJGS9PBuLWwGj0zrXn5Gr+gNqZfFGLUBfz3sTVxku5
5xa34kECCV7ECGwtHGQ+X0Lp/1jhbAUxdTW0ZI1rJiWtdmx9WJttlmdxmX3dASso
PmBGYCVynMDIaGrnnQ5kig7X9aUy1GF+P9OnZ+tE5FkFdHgG3i4kjOwyOGjtTTen
YmsqNI4KBwV1wBZf
=lzOL
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.8-2' of git://linux-nfs.org/~bfields/linux into master
Pull nfsd fix from Bruce Fields:
"Just one fix for a NULL dereference if someone happens to read
/proc/fs/nfsd/client/../state at the wrong moment"
* tag 'nfsd-5.8-2' of git://linux-nfs.org/~bfields/linux:
nfsd4: fix NULL dereference in nfsd/clients display code
Drop the repeated word "the" in a comment.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Merge misc fixes from Andrew Morton:
"Subsystems affected by this patch series: mm/pagemap, mm/shmem,
mm/hotfixes, mm/memcg, mm/hugetlb, mailmap, squashfs, scripts,
io-mapping, MAINTAINERS, and gdb"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
scripts/gdb: fix lx-symbols 'gdb.error' while loading modules
MAINTAINERS: add KCOV section
io-mapping: indicate mapping failure
scripts/decode_stacktrace: strip basepath from all paths
squashfs: fix length field overlap check in metadata reading
mailmap: add entry for Mike Rapoport
khugepaged: fix null-pointer dereference due to race
mm/hugetlb: avoid hardcoding while checking if cma is enabled
mm: memcg/slab: fix memory leak at non-root kmem_cache destroy
mm/memcg: fix refcount error while moving and swapping
mm/memcontrol: fix OOPS inside mem_cgroup_get_nr_swap_pages()
mm: initialize return of vm_insert_pages
vfs/xattr: mm/shmem: kernfs: release simple xattr entry in a right way
mm/mmap.c: close race between munmap() and expand_upwards()/downwards()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl8auzgACgkQxWXV+ddt
WDv0CRAAooFO+hloV+br40eEfJwZJJk+iIvc3tyq3TRUrmt1D0G4F7nUtiHjb8JU
ch2HK+GNZkIK4747OCgcFREpYZV2m0hrKybzf/j4mYb7OXzHmeHTMfGVut1g80e7
dlpvP7q4VZbBP8BTo/8wqdSAdCUiNhLFy5oYzyUwyflJ5S8FpjY+3dXIRHUnhxPU
lxMANWhX9y/qQEceGvxqwqJBiYT6WI7dwONiULc1klWDIug/2BGZQR0WuC5PVr0G
YNuxcEU6rluWzKWJ5k3104t+N1Nc5+xglIgBLeLKAyTVYq8zAMf+P8bBPnQ3QDkV
zniNIH9ND8tYSjmGkmO0ltExFrE2o9NRnjapOFXfB0WGXee5LfzFfzd5Hk9YV+Ua
bs98VNGR4B12Iw++DvrbhbFAMxBHiBfAX/O44xJ81uAYVUs21OfefxHWrLzTJK+1
xYfiyfCDxZDGpC/weg9GOPcIZAzzoSAvqDqWHyWY5cCZdB60RaelGJprdG5fP/gA
Y+hDIdutVXMHfhaX0ktWsDvhPRXcC7MT0bjasljkN5WUJ/xZZQr6QmgngY+FA8G/
0n/dv0pYdOTK/8YVZAMO+VklzrDhziqzc2sBrH1k3MA9asa/Ls5v+r2PU+qBKZJm
cBJGtxxsx72CHbkIhtd5oGj5LNTXFdXeHph37ErzW3ajeamO4X0=
=51h/
-----END PGP SIGNATURE-----
Merge tag 'for-5.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into master
Pull btrfs fixes from David Sterba:
"A few resouce leak fixes from recent patches, all are stable material.
The problems have been observed during testing or have a reproducer"
* tag 'for-5.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix mount failure caused by race with umount
btrfs: fix page leaks after failure to lock page for delalloc
btrfs: qgroup: fix data leak caused by race between writeback and truncate
btrfs: fix double free on ulist after backref resolution failure
Two fixes, the first one to remove compilation warnings and the second
to avoid potentially inefficient allocation of BIOs for direct writes
into sequential zones.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCXxqPOwAKCRDdoc3SxdoY
dkUzAQCp8p1ijemK+t2pN35dz+J9TG2idX0iUkZA5dAUkwsZmQD+NT/U52WLTqCH
eKc72BZTfdMhTK/Sk6fnUtFzVXvGhgM=
=RERh
-----END PGP SIGNATURE-----
Merge tag 'zonefs-5.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs into master
Pull zonefs fixes from Damien Le Moal:
"Two fixes, the first one to remove compilation warnings and the second
to avoid potentially inefficient allocation of BIOs for direct writes
into sequential zones"
* tag 'zonefs-5.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
zonefs: count pages after truncating the iterator
zonefs: Fix compilation warning
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8bEJwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpgCOEADPQhFAkQvKfaAKBXriQTFluDbET1PjAUNX
oa1ZRjP67O/Oxb4bEW6V6NRlteno1vqx0PD+DQfvnzlSly0WM0LUzQXLz4KDcAGW
55EdoZq3ev50bQOPkDzeqDmER0NEt5S+haVv3dvdEfqQaN9GBVQvEzk3elI5/+Sn
PO9hFhH8u3I16HgCFiQu4E4q+zK2D5j+rr82/wSsQ8wtdjzVHfaoBBTwCFaNq3r1
nGYyRY9rM/iq61l0bNKF0fqlx5sNGqHzqrxEaVsc7xbgcbn5ivDXReQ4fm3BsuUF
uabItKEyZjyr6dB0N5Eq24+S9S9i5V+9nhuZtRgjivANll/goVAVVZxqbvwrdB3q
w0SBDzOMu8LAlEYaNjnheQ+xbzedrI564MDBpyLdf4yuJmnIEmk6TR1fVtgmB+aa
AW1vplw+uBM25rkjltzdVFGpzBVVb478GgJkVbtzNaat7jUamg46s3qq02ReZXVi
W/ga9JZ87zIVE5/ClynUcGQoLSmeJRmQnnvVZjMAgw2lzE2i9xG5RwBU5cOpDKal
RwbnxoRG6FkLMXs0mAIXBsP5EvNuOSyItyCyk/LYqbjYjDTHvnYA/UFvE9MiJX+3
2S4Mt3aHzf+rBX5NDOrRBq0Ri4KF0AgXv/xNSYfrlDV5iB5NQMTv5ZQLaPoBBu/I
mph+EGVssQ==
=TBlP
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.8-2020-07-24' of git://git.kernel.dk/linux-block into master
Pull io_uring fixes from Jens Axboe:
- Fix discrepancy in how sqe->flags are treated for a few requests,
this makes it consistent (Daniele)
- Ensure that poll driven retry works with double waitqueue poll users
- Fix a missing io_req_init_async() (Pavel)
* tag 'io_uring-5.8-2020-07-24' of git://git.kernel.dk/linux-block:
io_uring: missed req_init_async() for IOSQE_ASYNC
io_uring: always allow drain/link/hardlink/async sqe flags
io_uring: ensure double poll additions work with both request types
This is a regression introduced by the "migrate from ll_rw_block usage
to BIO" patch.
Squashfs packs structures on byte boundaries, and due to that the length
field (of the metadata block) may not be fully in the current block.
The new code rewrote and introduced a faulty check for that edge case.
Fixes: 93e72b3c61 ("squashfs: migrate from ll_rw_block usage to BIO")
Reported-by: Bernd Amend <bernd.amend@gmail.com>
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Adrien Schildknecht <adrien+dev@schischi.me>
Cc: Guenter Roeck <groeck@chromium.org>
Cc: Daniel Rosenberg <drosen@google.com>
Link: http://lkml.kernel.org/r/20200717195536.16069-1-phillip@squashfs.org.uk
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move io_req_init_async() into io_grab_files(), it's safer this way. Note
that io_queue_async_work() does *init_async(), so it's valid to move out
of __io_queue_sqe() punt path. Also, add a helper around io_grab_files().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Calling into opcode prep handlers may be dangerous, as they re-read
SQE but might not re-initialise requests completely. If io_req_defer()
passed fast checks and is done with preparations, punt it async.
As all other cases are covered with nulling @sqe, this guarantees that
io_[opcode]_prep() are visited only once per request.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In io_sq_thread(), if there are task works to handle, current codes
will skip schedule() and go on polling sq again, but forget to clear
IORING_SQ_NEED_WAKEUP flag, fix this issue. Also add two helpers to
set and clear IORING_SQ_NEED_WAKEUP flag,
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As every iopoll request have a task ref, it becomes expensive to put
them one by one, instead we can put several at once integrating that
into io_req_free_batch().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Locked and pinned memory accounting in io_{,un}account_mem() depends on
having ->sqo_mm, which is NULL after a recent change for non SQPOLL'ed
io_ring. That disables the accounting.
Return ->sqo_mm initialisation back, and do __io_sq_thread_acquire_mm()
based on IORING_SETUP_SQPOLL flag.
Fixes: 8eb06d7e8d ("io_uring: fix missing ->mm on exit")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_sqe_buffer_unregister() uses cxt->sqo_mm for memory accounting, but
io_ring_ctx_free() drops ->sqo_mm before leaving pinned_vm
over-accounted. Postpone mm cleanup for when it's not needed anymore.
Fixes: 309758254e ("io_uring: report pinned memory usage")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't implement fast path of kbuf freeing and management inlined into
io_recv{,msg}(), that's error prone and duplicates handling. Replace it
with a helper io_put_recv_kbuf(), which mimics io_put_rw_kbuf() in the
io_read/write().
This also keeps cflags calculation in one place, removing duplication
between rw and recv/send.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Extract a common helper for cleaning up a selected buffer, this will be
used shortly. By the way, correct cflags types to unsigned and, as kbufs
are anyway tracked by a flag, remove useless zeroing req->rw.addr.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move REQ_F_BUFFER_SELECT flag check out of io_recv_buffer_select(), and
do that in its call sites That saves us from double error checking and
possibly an extra function call.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_clean_op() may be skipped even if there is a selected io_buffer,
that's because *select_buffer() funcions never set REQ_F_NEED_CLEANUP.
Trigger io_clean_op() when REQ_F_BUFFER_SELECTED is set as well, and
and clear the flag if was freed out of it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of returning error from io_recv(), go through generic cleanup
path, because it'll retain cflags for userspace. Do the same for
io_send() for consistency.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With the return on a bad socket, kmsg is always non-null by the end
of the function, prune left extra checks and initialisations.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Flip over "if (sock)" condition with return on error, the upper layer
will take care. That change will be handy later, but already removes
an extra jump from hot path.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, file refs in struct io_submit_state are tracked with 2 vars:
@has_refs -- how many refs were initially taken
@used_refs -- number of refs used
Replace it with a single variable counting how many refs left at the
current moment.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
RLIMIT_SIZE in needed only for execution from an io-wq context, hence
move all preparations from hot path to io-wq work setup.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Every call to io_req_defer_prep() is prepended with allocating ->io,
just do that in the function. And while we're at it, mark error paths
with unlikey and replace "if (ret < 0)" with "if (ret)".
There is only one change in the observable behaviour, that's instead of
killing the head request right away on error, it postpones it until the
link is assembled, that looks more preferable.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A switch in __io_clean_op() doesn't have default, it's pointless to list
opcodes that doesn't do any cleanup. Remove IORING_OP_OPEN* from there.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The only caller of io_req_work_grab_env() is io_prep_async_work(), and
they are both initialising req->work. Inline grab_env(), it's easier
to keep this way, moreover there already were bugs with misplacing
io_req_init_async().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
req->cflags is used only for defer-completion path, just use completion
data to store it. With the 4 bytes from the ->sequence patch and
compacting io_kiocb, this frees 8 bytes.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
req->sequence is used only for deferred (i.e. DRAIN) requests, but
initialised for every request. Remove req->sequence from io_kiocb
together with its initialisation in io_init_req().
Replace it with a new field in struct io_defer_entry, that will be
calculated only when needed in io_req_defer(), which is a slow path.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The only left user of req->list is DRAIN, hence instead of keeping a
separate per request list for it, do that with old fashion non-intrusive
lists allocated on demand. That's a really slow path, so that's OK.
This removes req->list and so sheds 16 bytes from io_kiocb.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of using shared req->list, hang timeouts up on their own list
entry. struct io_timeout have enough extra space for it, but if that
will be a problem ->inflight_entry can reused for that.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As with the completion path, also use compl.list for overflowed
requests. If cleaned up properly, nobody needs per-op data there
anymore.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
req->inflight_entry is used to track requests that grabbed files_struct.
Let's share it with iopoll list, because the only iopoll'ed ops are
reads and writes, which don't need a file table.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It supports both polling and I/O polling. Rename ctx->poll to clearly
show that it's only in I/O poll case.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Calling io_req_complete(req) means that the request is done, and there
is nothing left but to clean it up. That also means that per-op data
after that should not be used, so we're free to reuse it in completion
path, e.g. to store overflow_list as done in this patch.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As for import_iovec(), return !=NULL iovec from io_import_iovec() only
when it should be freed. That includes returning NULL when iovec is
already in req->io, because it should be deallocated by other means,
e.g. inside op handler. After io_setup_async_rw() local iovec to ->io,
just mark it NULL, to follow the idea in io_{read,write} as well.
That's easier to follow, and especially useful if we want to reuse
per-op space for completion data.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: only call kfree() on non-NULL pointer]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Preparing reads/writes for async is a bit tricky. Extract a helper to
not repeat it twice.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't deref req->io->rw every time, but put it in a local variable. This
looks prettier, generates less instructions, and doesn't break alias
analysis.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_kiocb::task_work was de-unionised, and is not planned to be shared
back, because it's too useful and commonly used. Hence, instead of
keeping a separate task_work in struct io_async_rw just reuse
req->task_work.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
send/recv msghdr initialisation works with struct io_async_msghdr, but
pulls the whole struct io_async_ctx for no reason. That complicates it
with composite accessing, e.g. io->msg.
Use and pass the most specific type, which is struct io_async_msghdr.
It is the larget field in union io_async_ctx and doesn't save stack
space, but looks clearer.
The most of the changes are replacing "io->msg." with "iomsg->"
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Every second field in send/recv is called msg, make it a bit more
understandable by renaming ->msg, which is a user provided ptr,
to ->umsg.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
rings_size() sets sq_offset to the total size of the rings (the returned
value which is used for memory allocation). This is wrong: sq array should
be located within the rings, not after them. Set sq_offset to where it
should be.
Fixes: 75b28affdd ("io_uring: allocate the two rings together")
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Hristo Venev <hristo@venev.name>
Cc: io-uring@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge in io_uring-5.8 fixes, as changes/cleanups to how we do locked
mem accounting require a fixup, and only one of the spots are noticed
by git as the other merges cleanly. The flags fix from io_uring-5.8
also causes a merge conflict, the leak fix for recvmsg, the double poll
fix, and the link failure locking fix.
* io_uring-5.8:
io_uring: fix lockup in io_fail_links()
io_uring: fix ->work corruption with poll_add
io_uring: missed req_init_async() for IOSQE_ASYNC
io_uring: always allow drain/link/hardlink/async sqe flags
io_uring: ensure double poll additions work with both request types
io_uring: fix recvmsg memory leak with buffer selection
io_uring: fix not initialised work->flags
io_uring: fix missing msg_name assignment
io_uring: account user memory freed when exit has been queued
io_uring: fix memleak in io_sqe_files_register()
io_uring: fix memleak in __io_sqe_files_update()
io_uring: export cq overflow status to userspace
Signed-off-by: Jens Axboe <axboe@kernel.dk>
req->work might be already initialised by the time it gets into
__io_arm_poll_handler(), which will corrupt it by using fields that are
in an union with req->work. Luckily, the only side effect is missing
put_creds(). Clean req->work before going there.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
During umount, f2fs_put_super() unregisters procfs entries after
f2fs_destroy_segment_manager(), it may cause use-after-free
issue when umount races with procfs accessing, fix it by relocating
f2fs_unregister_sysfs().
[Chao Yu: change commit title/message a bit]
Signed-off-by: Li Guifu <bluce.liguifu@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The return value of f2fs_flush_inline_data() is not used,
so delete it.
Signed-off-by: Jia Yang <jiayang5@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This reverts commit 9ffad9263b.
Upon additional testing with older servers, it was found that
the original commit introduced a regression when using the old SMB1
dialect and rsyncing over an existing file.
The patch will need to be respun to address this, likely including
a larger refactoring of the SMB1 and SMB3 rename code paths to make
it less confusing and also to address some additional rename error
cases that SMB3 may be able to workaround.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reported-by: Patrick Fernie <patrick.fernie@gmail.com>
CC: Stable <stable@vger.kernel.org>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Acked-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
IOSQE_ASYNC branch of io_queue_sqe() is another place where an
unitialised req->work can be accessed (i.e. prior io_req_init_async()).
Nothing really bad though, it just looses IO_WQ_WORK_CONCURRENT flag.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since debugfs include sensitive information it need to be treated
carefully. But it also has many very useful debug functions for userspace.
With this option we can have same configuration for system with
need of debugfs and a way to turn it off. This gives a extra protection
for exposure on systems where user-space services with system
access are attacked.
It is controlled by a configurable default value that can be override
with a kernel command line parameter. (debugfs=)
It can be on or off, but also internally on but not seen from user-space.
This no-mount mode do not register a debugfs as filesystem, but client can
register their parts in the internal structures. This data can be readed
with a debugger or saved with a crashkernel. When it is off clients
get EPERM error when accessing the functions for registering their
components.
Signed-off-by: Peter Enderborg <peter.enderborg@sony.com>
Link: https://lore.kernel.org/r/20200716071511.26864-3-peter.enderborg@sony.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We hold the cl_lock here, and that's enough to keep stateid's from going
away, but it's not enough to prevent the files they point to from going
away. Take fi_lock and a reference and check for NULL, as we do in
other code.
Reported-by: NeilBrown <neilb@suse.de>
Fixes: 78599c42ae ("nfsd4: add file to display list of client's opens")
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Normally smp_store_release() or cmpxchg_release() is paired with
smp_load_acquire(). Sometimes smp_load_acquire() can be replaced with
the more lightweight READ_ONCE(). However, for this to be safe, all the
published memory must only be accessed in a way that involves the
pointer itself. This may not be the case if allocating the object also
involves initializing a static or global variable, for example.
fsverity_info::tree_params.hash_alg->tfm is a crypto_ahash object that's
internal to and is allocated by the crypto subsystem. So by using
READ_ONCE() for ->i_verity_info, we're relying on internal
implementation details of the crypto subsystem.
Remove this fragile assumption by using smp_load_acquire() instead.
Also fix the cmpxchg logic to correctly execute an ACQUIRE barrier when
losing the cmpxchg race, since cmpxchg doesn't guarantee a memory
barrier on failure.
(Note: I haven't seen any real-world problems here. This change is just
fixing the code to be guaranteed correct and less fragile.)
Fixes: fd2d1acfca ("fs-verity: add the hook for file ->open()")
Link: https://lore.kernel.org/r/20200721225920.114347-6-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Normally smp_store_release() or cmpxchg_release() is paired with
smp_load_acquire(). Sometimes smp_load_acquire() can be replaced with
the more lightweight READ_ONCE(). However, for this to be safe, all the
published memory must only be accessed in a way that involves the
pointer itself. This may not be the case if allocating the object also
involves initializing a static or global variable, for example.
fscrypt_info includes various sub-objects which are internal to and are
allocated by other kernel subsystems such as keyrings and crypto. So by
using READ_ONCE() for ->i_crypt_info, we're relying on internal
implementation details of these other kernel subsystems.
Remove this fragile assumption by using smp_load_acquire() instead.
(Note: I haven't seen any real-world problems here. This change is just
fixing the code to be guaranteed correct and less fragile.)
Fixes: e37a784d8b ("fscrypt: use READ_ONCE() to access ->i_crypt_info")
Link: https://lore.kernel.org/r/20200721225920.114347-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Normally smp_store_release() or cmpxchg_release() is paired with
smp_load_acquire(). Sometimes smp_load_acquire() can be replaced with
the more lightweight READ_ONCE(). However, for this to be safe, all the
published memory must only be accessed in a way that involves the
pointer itself. This may not be the case if allocating the object also
involves initializing a static or global variable, for example.
super_block::s_master_keys is a keyring, which is internal to and is
allocated by the keyrings subsystem. By using READ_ONCE() for it, we're
relying on internal implementation details of the keyrings subsystem.
Remove this fragile assumption by using smp_load_acquire() instead.
(Note: I haven't seen any real-world problems here. This change is just
fixing the code to be guaranteed correct and less fragile.)
Fixes: 22d94f493b ("fscrypt: add FS_IOC_ADD_ENCRYPTION_KEY ioctl")
Link: https://lore.kernel.org/r/20200721225920.114347-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Normally smp_store_release() or cmpxchg_release() is paired with
smp_load_acquire(). Sometimes smp_load_acquire() can be replaced with
the more lightweight READ_ONCE(). However, for this to be safe, all the
published memory must only be accessed in a way that involves the
pointer itself. This may not be the case if allocating the object also
involves initializing a static or global variable, for example.
fscrypt_prepared_key includes a pointer to a crypto_skcipher object,
which is internal to and is allocated by the crypto subsystem. By using
READ_ONCE() for it, we're relying on internal implementation details of
the crypto subsystem.
Remove this fragile assumption by using smp_load_acquire() instead.
(Note: I haven't seen any real-world problems here. This change is just
fixing the code to be guaranteed correct and less fragile.)
Fixes: 5fee36095c ("fscrypt: add inline encryption support")
Cc: Satya Tangirala <satyat@google.com>
Link: https://lore.kernel.org/r/20200721225920.114347-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
fscrypt_do_sha256() is only used for hashing encrypted filenames to
create no-key tokens, which isn't performance-critical. Therefore a C
implementation of SHA-256 is sufficient.
Also, the logic to create no-key tokens is always potentially needed.
This differs from fscrypt's other dependencies on crypto API algorithms,
which are conditionally needed depending on what encryption policies
userspace is using. Therefore, for fscrypt there isn't much benefit to
allowing SHA-256 to be a loadable module.
So, make fscrypt_do_sha256() use the SHA-256 library instead of the
crypto_shash API. This is much simpler, since it avoids having to
implement one-time-init (which is hard to do correctly, and in fact was
implemented incorrectly) and handle failures to allocate the
crypto_shash object.
Fixes: edc440e3d2 ("fscrypt: improve format of no-key names")
Cc: Daniel Rosenberg <drosen@google.com>
Link: https://lore.kernel.org/r/20200721225920.114347-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
It is possible to cause a btrfs mount to fail by racing it with a slow
umount. The crux of the sequence is generic_shutdown_super not yet
calling sop->put_super before btrfs_mount_root calls btrfs_open_devices.
If that occurs, btrfs_open_devices will decide the opened counter is
non-zero, increment it, and skip resetting fs_devices->total_rw_bytes to
0. From here, mount will call sget which will result in grab_super
trying to take the super block umount semaphore. That semaphore will be
held by the slow umount, so mount will block. Before up-ing the
semaphore, umount will delete the super block, resulting in mount's sget
reliably allocating a new one, which causes the mount path to dutifully
fill it out, and increment total_rw_bytes a second time, which causes
the mount to fail, as we see double the expected bytes.
Here is the sequence laid out in greater detail:
CPU0 CPU1
down_write sb->s_umount
btrfs_kill_super
kill_anon_super(sb)
generic_shutdown_super(sb);
shrink_dcache_for_umount(sb);
sync_filesystem(sb);
evict_inodes(sb); // SLOW
btrfs_mount_root
btrfs_scan_one_device
fs_devices = device->fs_devices
fs_info->fs_devices = fs_devices
// fs_devices-opened makes this a no-op
btrfs_open_devices(fs_devices, mode, fs_type)
s = sget(fs_type, test, set, flags, fs_info);
find sb in s_instances
grab_super(sb);
down_write(&s->s_umount); // blocks
sop->put_super(sb)
// sb->fs_devices->opened == 2; no-op
spin_lock(&sb_lock);
hlist_del_init(&sb->s_instances);
spin_unlock(&sb_lock);
up_write(&sb->s_umount);
return 0;
retry lookup
don't find sb in s_instances (deleted by CPU0)
s = alloc_super
return s;
btrfs_fill_super(s, fs_devices, data)
open_ctree // fs_devices total_rw_bytes improperly set!
btrfs_read_chunk_tree
read_one_dev // increment total_rw_bytes again!!
super_total_bytes < fs_devices->total_rw_bytes // ERROR!!!
To fix this, we clear total_rw_bytes from within btrfs_read_chunk_tree
before the calls to read_one_dev, while holding the sb umount semaphore
and the uuid mutex.
To reproduce, it is sufficient to dirty a decent number of inodes, then
quickly umount and mount.
for i in $(seq 0 500)
do
dd if=/dev/zero of="/mnt/foo/$i" bs=1M count=1
done
umount /mnt/foo&
mount /mnt/foo
does the trick for me.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When locking pages for delalloc, we check if it's dirty and mapping still
matches. If it does not match, we need to return -EAGAIN and release all
pages. Only the current page was put though, iterate over all the
remaining pages too.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When running tests like generic/013 on test device with btrfs quota
enabled, it can normally lead to data leak, detected at unmount time:
BTRFS warning (device dm-3): qgroup 0/5 has unreleased space, type 0 rsv 4096
------------[ cut here ]------------
WARNING: CPU: 11 PID: 16386 at fs/btrfs/disk-io.c:4142 close_ctree+0x1dc/0x323 [btrfs]
RIP: 0010:close_ctree+0x1dc/0x323 [btrfs]
Call Trace:
btrfs_put_super+0x15/0x17 [btrfs]
generic_shutdown_super+0x72/0x110
kill_anon_super+0x18/0x30
btrfs_kill_super+0x17/0x30 [btrfs]
deactivate_locked_super+0x3b/0xa0
deactivate_super+0x40/0x50
cleanup_mnt+0x135/0x190
__cleanup_mnt+0x12/0x20
task_work_run+0x64/0xb0
__prepare_exit_to_usermode+0x1bc/0x1c0
__syscall_return_slowpath+0x47/0x230
do_syscall_64+0x64/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
---[ end trace caf08beafeca2392 ]---
BTRFS error (device dm-3): qgroup reserved space leaked
[CAUSE]
In the offending case, the offending operations are:
2/6: writev f2X[269 1 0 0 0 0] [1006997,67,288] 0
2/7: truncate f2X[269 1 0 0 48 1026293] 18388 0
The following sequence of events could happen after the writev():
CPU1 (writeback) | CPU2 (truncate)
-----------------------------------------------------------------
btrfs_writepages() |
|- extent_write_cache_pages() |
|- Got page for 1003520 |
| 1003520 is Dirty, no writeback |
| So (!clear_page_dirty_for_io()) |
| gets called for it |
|- Now page 1003520 is Clean. |
| | btrfs_setattr()
| | |- btrfs_setsize()
| | |- truncate_setsize()
| | New i_size is 18388
|- __extent_writepage() |
| |- page_offset() > i_size |
|- btrfs_invalidatepage() |
|- Page is clean, so no qgroup |
callback executed
This means, the qgroup reserved data space is not properly released in
btrfs_invalidatepage() as the page is Clean.
[FIX]
Instead of checking the dirty bit of a page, call
btrfs_qgroup_free_data() unconditionally in btrfs_invalidatepage().
As qgroup rsv are completely bound to the QGROUP_RESERVED bit of
io_tree, not bound to page status, thus we won't cause double freeing
anyway.
Fixes: 0b34c261e2 ("btrfs: qgroup: Prevent qgroup->reserved from going subzero")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Added a new ioctl to send discard commands or/and zero out
to selected data area of a regular file for security reason.
The way of handling range.len of F2FS_IOC_SEC_TRIM_FILE:
1. Added -1 value support for range.len to secure trim the whole blocks
starting from range.start regardless of i_size.
2. If the end of the range passes over the end of file, it means until
the end of file (i_size).
3. ignored the case of that range.len is zero to prevent the function
from making end_addr zero and triggering different behaviour of
the function.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
https://bugzilla.kernel.org/show_bug.cgi?id=208565
PID: 257 TASK: ecdd0000 CPU: 0 COMMAND: "init"
#0 [<c0b420ec>] (__schedule) from [<c0b423c8>]
#1 [<c0b423c8>] (schedule) from [<c0b459d4>]
#2 [<c0b459d4>] (rwsem_down_read_failed) from [<c0b44fa0>]
#3 [<c0b44fa0>] (down_read) from [<c044233c>]
#4 [<c044233c>] (f2fs_truncate_blocks) from [<c0442890>]
#5 [<c0442890>] (f2fs_truncate) from [<c044d408>]
#6 [<c044d408>] (f2fs_evict_inode) from [<c030be18>]
#7 [<c030be18>] (evict) from [<c030a558>]
#8 [<c030a558>] (iput) from [<c047c600>]
#9 [<c047c600>] (f2fs_sync_node_pages) from [<c0465414>]
#10 [<c0465414>] (f2fs_write_checkpoint) from [<c04575f4>]
#11 [<c04575f4>] (f2fs_sync_fs) from [<c0441918>]
#12 [<c0441918>] (f2fs_do_sync_file) from [<c0441098>]
#13 [<c0441098>] (f2fs_sync_file) from [<c0323fa0>]
#14 [<c0323fa0>] (vfs_fsync_range) from [<c0324294>]
#15 [<c0324294>] (do_fsync) from [<c0324014>]
#16 [<c0324014>] (sys_fsync) from [<c0108bc0>]
This can be caused by flush_dirty_inode() in f2fs_sync_node_pages() where
iput() requires f2fs_lock_op() again resulting in livelock.
Reported-by: Zhiguo Niu <Zhiguo.Niu@unisoc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
IV_INO_LBLK_* exist only because of hardware limitations, and currently
the only known use case for them involves AES-256-XTS. Therefore, for
now only allow them in combination with AES-256-XTS. This way we don't
have to worry about them being combined with other encryption modes.
(To be clear, combining IV_INO_LBLK_* with other encryption modes
*should* work just fine. It's just not being tested, so we can't be
100% sure it works. So with no known use case, it's best to disallow it
for now, just like we don't allow other weird combinations like
AES-256-XTS contents encryption with Adiantum filenames encryption.)
This can be relaxed later if a use case for other combinations arises.
Fixes: b103fb7653 ("fscrypt: add support for IV_INO_LBLK_64 policies")
Fixes: e3b1078bed ("fscrypt: add support for IV_INO_LBLK_32 policies")
Link: https://lore.kernel.org/r/20200721181012.39308-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
To allow the kernel not to play games with set_fs to call exec
implement kernel_execve. The function kernel_execve takes pointers
into kernel memory and copies the values pointed to onto the new
userspace stack.
The calls with arguments from kernel space of do_execve are replaced
with calls to kernel_execve.
The calls do_execve and do_execveat are made static as there are now
no callers outside of exec.
The comments that mention do_execve are updated to refer to
kernel_execve or execve depending on the circumstances. In addition
to correcting the comments, this makes it easy to grep for do_execve
and verify it is not used.
Inspired-by: https://lkml.kernel.org/r/20200627072704.2447163-1-hch@lst.de
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/87wo365ikj.fsf@x220.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In preparation for implementiong kernel_execve (which will take kernel
pointers not userspace pointers) factor out bprm_stack_limits out of
prepare_arg_pages. This separates the counting which depends upon the
getting data from userspace from the calculations of the stack limits
which is usable in kernel_execve.
The remove prepare_args_pages and compute bprm->argc and bprm->envc
directly in do_execveat_common, before bprm_stack_limits is called.
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lkml.kernel.org/r/87365u6x60.fsf@x220.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Currently it is necessary for the usermode helper code and the code
that launches init to use set_fs so that pages coming from the kernel
look like they are coming from userspace.
To allow that usage of set_fs to be removed cleanly the argument
copying from userspace needs to happen earlier. Factor bprm_execve
out of do_execve_common to separate out the copying of arguments
to the newe stack, and the rest of exec.
In separating bprm_execve from do_execve_common the copying
of the arguments onto the new stack happens earlier.
As the copying of the arguments does not depend any security hooks,
files, the file table, current->in_execve, current->fs->in_exec,
bprm->unsafe, or creds this is safe.
Likewise the security hook security_creds_for_exec does not depend upon
preventing the argument copying from happening.
In addition to making it possible to implement kernel_execve that
performs the copying differently, this separation of bprm_execve from
do_execve_common makes for a nice separation of responsibilities making
the exec code easier to navigate.
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lkml.kernel.org/r/878sfm6x6x.fsf@x220.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Currently it is necessary for the usermode helper code and the code that
launches init to use set_fs so that pages coming from the kernel look like
they are coming from userspace.
To allow that usage of set_fs to be removed cleanly the argument copying
from userspace needs to happen earlier. Move the allocation and
initialization of bprm->mm into alloc_bprm so that the bprm->mm is
available early to store the new user stack into. This is a prerequisite
for copying argv and envp into the new user stack early before ther rest of
exec.
To keep the things consistent the cleanup of bprm->mm is moved into
free_bprm. So that bprm->mm will be cleaned up whenever bprm->mm is
allocated and free_bprm are called.
Moving bprm_mm_init earlier is safe as it does not depend on any files,
current->in_execve, current->fs->in_exec, bprm->unsafe, or the if the file
table is shared. (AKA bprm_mm_init does not depend on any of the code that
happens between alloc_bprm and where it was previously called.)
This moves bprm->mm cleanup after current->fs->in_exec is set to 0. This
is safe because current->fs->in_exec is only used to preventy taking an
additional reference on the fs_struct.
This moves bprm->mm cleanup after current->in_execve is set to 0. This is
safe because current->in_execve is only used by the lsms (apparmor and
tomoyou) and always for LSM specific functions, never for anything to do
with the mm.
This adds bprm->mm cleanup into the successful return path. This is safe
because being on the successful return path implies that begin_new_exec
succeeded and set brpm->mm to NULL. As bprm->mm is NULL bprm cleanup I am
moving into free_bprm will do nothing.
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lkml.kernel.org/r/87eepe6x7p.fsf@x220.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Currently it is necessary for the usermode helper code and the code
that launches init to use set_fs so that pages coming from the kernel
look like they are coming from userspace.
To allow that usage of set_fs to be removed cleanly the argument
copying from userspace needs to happen earlier. Move the computation
of bprm->filename and possible allocation of a name in the case
of execveat into alloc_bprm to make that possible.
The exectuable name, the arguments, and the environment are
copied into the new usermode stack which is stored in bprm
until exec passes the point of no return.
As the executable name is copied first onto the usermode stack
it needs to be known. As there are no dependencies to computing
the executable name, compute it early in alloc_bprm.
As an implementation detail if the filename needs to be generated
because it embeds a file descriptor store that filename in a new field
bprm->fdpath, and free it in free_bprm. Previously this was done in
an independent variable pathbuf. I have renamed pathbuf fdpath
because fdpath is more suggestive of what kind of path is in the
variable. I moved fdpath into struct linux_binprm because it is
tightly tied to the other variables in struct linux_binprm, and as
such is needed to allow the call alloc_binprm to move.
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lkml.kernel.org/r/87k0z66x8f.fsf@x220.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Currently it is necessary for the usermode helper code and the code
that launches init to use set_fs so that pages coming from the kernel
look like they are coming from userspace.
To allow that usage of set_fs to be removed cleanly the argument
copying from userspace needs to happen earlier. Move the allocation
of the bprm into it's own function (alloc_bprm) and move the call of
alloc_bprm before unshare_files so that bprm can ultimately be
allocated, the arguments can be placed on the new stack, and then the
bprm can be passed into the core of exec.
Neither the allocation of struct binprm nor the unsharing depend upon each
other so swapping the order in which they are called is trivially safe.
To keep things consistent the order of cleanup at the end of
do_execve_common swapped to match the order of initialization.
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/87pn8y6x9a.fsf@x220.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
On-disk format for name_hash field is LE, so it must be explicitly
transformed on BE system for proper result.
Fixes: 370e812b3e ("exfat: add nls operations")
Cc: stable@vger.kernel.org # v5.7
Signed-off-by: Chen Minqiang <ptpt52@gmail.com>
Signed-off-by: Ilya Ponetayev <i.ponetaev@ndmsystems.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
The stream.size field is updated to the value of create timestamp
of the file entry. Fix this to use correct stream entry pointer.
Fixes: 29bbb14bfc ("exfat: fix incorrect update of stream entry in __exfat_truncate()")
Signed-off-by: Hyeongseok Kim <hyeongseok@gmail.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
We found the wrong hint_stat initialization in exfat_find_dir_entry().
It should be initialized when cluster is EXFAT_EOF_CLUSTER.
Fixes: ca06197382 ("exfat: add directory operations")
Cc: stable@vger.kernel.org # v5.7
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
An overflow issue can occur while calculating sector in
exfat_cluster_to_sector(). It needs to cast clus's type to sector_t
before left shifting.
Fixes: 1acf1a564b ("exfat: add in-memory and on-disk structures and headers")
Cc: stable@vger.kernel.org # v5.7
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
The name "FS_KEY_DERIVATION_NONCE_SIZE" is a bit outdated since due to
the addition of FSCRYPT_POLICY_FLAG_DIRECT_KEY, the file nonce may now
be used as a tweak instead of for key derivation. Also, we're now
prefixing the fscrypt constants with "FSCRYPT_" instead of "FS_".
Therefore, rename this constant to FSCRYPT_FILE_NONCE_SIZE.
Link: https://lore.kernel.org/r/20200708215722.147154-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Each HKDF context byte is associated with a specific format of the
remaining part of the application-specific info string. Add comments so
that it's easier to keep track of what these all are.
Link: https://lore.kernel.org/r/20200708215529.146890-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Drop the repeated word "the" in a comment.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: linux-f2fs-devel@lists.sourceforge.net
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Memory allocated for storing compressed pages' poitner should be
released after f2fs_write_compressed_pages(), otherwise it will
cause memory leak issue.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Fixes: 4c8ff7095b ("f2fs: support data compression")
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Don't define F2FS_IOC_* aliases to ioctls that already have a generic
FS_IOC_* name. These aliases are unnecessary, and they make it unclear
which ioctls are f2fs-specific and which are generic.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Count pages after possibly truncating the iterator to the maximum zone
append size, not before.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Avoid the compilation warning "Variable 'ret' is reassigned a value
before the old one has been used." in zonefs_create_zgroup() by setting
ret for the error path only if an error happens.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Opening files in /proc/pid/map_files when the current user is
CAP_CHECKPOINT_RESTORE capable in the root namespace is useful for
checkpointing and restoring to recover files that are unreachable via
the file system such as deleted files, or memfd files.
Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Link: https://lore.kernel.org/r/20200719100418.2112740-5-areber@redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
These codes have been commented out since 2007 and lay in kernel
since then. So, it's better to remove them.
Link: http://lkml.kernel.org/r/20200628074337.45895-1-jianyong.wu@arm.com
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
In the current setattr implementation in 9p, fid is always retrieved
from dentry no matter file instance exists or not. If so, there may be
some info related to opened file instance dropped. So it's better
to retrieve fid from file instance when it is passed to setattr.
for example:
fd=open("tmp", O_RDWR);
ftruncate(fd, 10);
The file context related with the fd will be lost as fid is always
retrieved from dentry, then the backend can't get the info of
file context. It is against the original intention of user and
may lead to bug.
Link: http://lkml.kernel.org/r/20200710101548.10108-1-jianyong.wu@arm.com
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
We currently filter these for timeout_remove/async_cancel/files_update,
but we only should be filtering for fixed file and buffer select. This
also causes a second read of sqe->flags, which isn't needed.
Just check req->flags for the relevant bits. This then allows these
commands to be used in links, for example, like everything else.
Signed-off-by: Daniele Albano <d.albano@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The double poll additions were centered around doing POLL_ADD on file
descriptors that use more than one waitqueue (typically one for read,
one for write) when being polled. However, it can also end up being
triggered for when we use poll triggered retry. For that case, we cannot
safely use req->io, as that could be used by the request type itself.
Add a second io_poll_iocb pointer in the structure we allocate for poll
based retry, and ensure we use the right one from the two paths.
Fixes: 18bceab101 ("io_uring: allow POLL_ADD with double poll_wait() users")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The MS_I_VERSION mount flag is exposed via the VFS, as documented
in the mount manpages etc; see the iversion and noiversion mount
options in mount(8).
As a result, mount -o remount looks for this option in /proc/mounts
and will only send the I_VERSION flag back in during remount it it
is present. Since it's not there, a remount will /remove/ the
I_VERSION flag at the vfs level, and iversion functionality is lost.
xfs v5 superblocks intend to always have i_version enabled; it is
set as a default at mount time, but is lost during remount for the
reasons above.
The generic fix would be to expose this documented option in
/proc/mounts, but since that was rejected, fix it up again in the
xfs remount path instead, so that at least xfs won't suffer from
this misbehavior.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reverting commit d03727b248 "NFSv4 fix CLOSE not waiting for
direct IO compeletion". This patch made it so that fput() by calling
inode_dio_done() in nfs_file_release() would wait uninterruptably
for any outstanding directIO to the file (but that wait on IO should
be killable).
The problem the patch was also trying to address was REMOVE returning
ERR_ACCESS because the file is still opened, is supposed to be resolved
by server returning ERR_FILE_OPEN and not ERR_ACCESS.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8RyOwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpuNmD/sFxMpo0Q4szKSdFY16RxmLbeeCG8eQC+6P
Zqqd4t4tpr1tamSf4pya8zh7ivkfPlm+IFQopEEXbDAZ5P8TwF59KvABRbUYbCFM
ldQzJgvRwoTIhs0ojIY6CPMAxbpDLx8mpwgbzcjuKxbGDHEnndXPDbNO/8olxAaa
Ace5zk7TpY9YDtEXr1qe3y0riw11o/E9S/iX+M/z1KGKQcx01jU4hwesuzssde4J
rEG3TYFiHCkhfB0AtGj3zYInCYIXqqJRqEv9NP0npWB1IWbyLy9XatEDCx8aIblA
HICy09+4v5HR5h4vByRGOvT28rl//7ZB4tdzkunLWYrxYkYOqypsRI8NeDelxtWa
Iv+1Og94lQnjwOF9Iqz/q2z/OfpxlJpOvy8d5xWjhiNr9oc5ugAqVUiFjuQ6XnVG
mNJA21pJwzpesggOErIYjI13JvwW3aFylAB3fBPitHcmCusElnLunSs3/zhr9NY7
BJomUwC/KCmcp/X/WX2W5LoKEnG9WnVrJJmDWjz1wQLziKa7dvHAGUGFArpGJmJ3
TUGefdBi6q2nC7o+K26pwFfjQpA1Myf8Vp6qS957YQ7kZoI1a0bCuxp/rrq1gNFt
8HeKf4jmfqcBZeTPlZDyMWwC5F1MpK9V+KComBqSA8x5/Q0mV6BsOvY5mMHfCgky
HuD7ERgs3g==
=Aucc
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.8-2020-07-17' of git://git.kernel.dk/linux-block into master
Pull io_uring fix from Jens Axboe:
"Fix for a case where, with automatic buffer selection, we can leak the
buffer descriptor for recvmsg"
* tag 'io_uring-5.8-2020-07-17' of git://git.kernel.dk/linux-block:
io_uring: fix recvmsg memory leak with buffer selection
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXxGFQwAKCRDh3BK/laaZ
PDYzAP9bXxHQaRdetnj6lOGNWjmVmiHfntxHqkl6QjZf6e1WlwD+NRXayVTc+Lzw
M1pBK6kqovMQVWkyFfA3dTq/BZMzfAc=
=9GPn
-----END PGP SIGNATURE-----
Merge tag 'fuse-fixes-5.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse into master
Pull fuse fixes from Miklos Szeredi:
- two regressions in this cycle caused by the conversion of writepage
list to an rb_tree
- two regressions in v5.4 cause by the conversion to the new mount API
- saner behavior of fsconfig(2) for the reconfigure case
- an ancient issue with FS_IOC_{GET,SET}FLAGS ioctls
* tag 'fuse-fixes-5.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: Fix parameter for FS_IOC_{GET,SET}FLAGS
fuse: don't ignore errors from fuse_writepages_fill()
fuse: clean up condition for writepage sending
fuse: reject options on reconfigure via fsconfig(2)
fuse: ignore 'data' argument of mount(..., MS_REMOUNT)
fuse: use ->reconfigure() instead of ->remount_fs()
fuse: fix warning in tree_insert() and clean up writepage insertion
fuse: move rb_erase() before tree_insert()
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXxGF+QAKCRDh3BK/laaZ
PCHnAQCqNxcxncKMebpJ2hNIEPuSvUPRA4+iOOnc+9HTZ4A09wD/d/8ryybORTZN
IHq2PpQUtuGgASv6GrptJSmpDvG6RA0=
=lOD9
-----END PGP SIGNATURE-----
Merge tag 'ovl-fixes-5.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs into master
Pull overlayfs fixes from Miklos Szeredi:
- fix a regression introduced in v4.20 in handling a regenerated
squashfs lower layer
- two regression fixes for this cycle, one of which is Oops inducing
- miscellaneous issues
* tag 'ovl-fixes-5.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix lookup of indexed hardlinks with metacopy
ovl: fix unneeded call to ovl_change_flags()
ovl: fix mount option checks for nfs_export with no upperdir
ovl: force read-only sb on failure to create index dir
ovl: fix regression with re-formatted lower squashfs
ovl: fix oops in ovl_indexdir_cleanup() with nfs_export=on
ovl: relax WARN_ON() when decoding lower directory file handle
ovl: remove not used argument in ovl_check_origin
ovl: change ovl_copy_up_flags static
ovl: inode reference leak in ovl_is_inuse true case.
It looks like this "else" is just a typo. It turns off nconnect for
NFSv4.0 even though it works for every other version.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
commit 0688e64bc6 ("NFS: Allow signal interruption of NFS4ERR_DELAYed operations")
introduces nfs4_delay_interruptible which also needs an _unsafe version to
avoid the following call trace for the same reason explained in
commit 416ad3c9c0 ("freezer: add unsafe versions of freezable helpers for NFS")
CPU: 4 PID: 3968 Comm: rm Tainted: G W 5.8.0-rc4 #1
Hardware name: Marvell OcteonTX CN96XX board (DT)
Call trace:
dump_backtrace+0x0/0x1dc
show_stack+0x20/0x30
dump_stack+0xdc/0x150
debug_check_no_locks_held+0x98/0xa0
nfs4_delay_interruptible+0xd8/0x120
nfs4_handle_exception+0x130/0x170
nfs4_proc_rmdir+0x8c/0x220
nfs_rmdir+0xa4/0x360
vfs_rmdir.part.0+0x6c/0x1b0
do_rmdir+0x18c/0x210
__arm64_sys_unlinkat+0x64/0x7c
el0_svc_common.constprop.0+0x7c/0x110
do_el0_svc+0x24/0xa0
el0_sync_handler+0x13c/0x1b8
el0_sync+0x158/0x180
Signed-off-by: He Zhe <zhe.he@windriver.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.
In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:
git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var\(([^\)]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'
drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.
No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.
[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/
Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
This is an effort to eliminate the uninitialized_var() macro[1].
The use of this macro is the wrong solution because it forces off ANY
analysis by the compiler for a given variable. It even masks "unused
variable" warnings.
Quoted from Linus[2]:
"It's a horrible thing to use, in that it adds extra cruft to the
source code, and then shuts up a compiler warning (even the _reliable_
warnings from gcc)."
Fix it by remove this variable since it is not needed at all.
[1] https://github.com/KSPP/linux/issues/81
[2] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/
Suggested-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Link: https://lore.kernel.org/r/20200615085132.166470-1-yanaijie@huawei.com
Signed-off-by: Kees Cook <keescook@chromium.org>
bd_start_claiming duplicates a lot of the work done in __blkdev_get.
Integrate the two functions to avoid the duplicate work, and to do the
right thing for the md -ERESTARTSYS corner case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The arcane magic in bd_start_claiming is only needed to be able to claim
a block_device that hasn't been fully set up. Switch the loop driver
that claims from the ioctl path with a fully set up struct block_device
to just use the much simpler bd_prepare_to_claim directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the locking and assignment of bd_claiming from bd_start_claiming to
bd_prepare_to_claim.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Insted of duplicating all the cleanup logic jump to the code that cleans
up anyway, and restart after that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a helper for struct file based chmode operations. To be used by
the initramfs code soon.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a helper for struct file based chown operations. To be used by
the initramfs code soon.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
We recently moved setting inode flag OVL_UPPERDATA to ovl_lookup().
When looking up an overlay dentry, upperdentry may be found by index
and not by name. In that case, we fail to read the metacopy xattr
and falsly set the OVL_UPPERDATA on the overlay inode.
This caused a regression in xfstest overlay/033 when run with
OVERLAY_MOUNT_OPTIONS="-o metacopy=on".
Fixes: 28166ab3c8 ("ovl: initialize OVL_UPPERDATA in ovl_lookup()")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The check if user has changed the overlay file was wrong, causing unneeded
call to ovl_change_flags() including taking f_lock on every file access.
Fixes: d989903058 ("ovl: do not generate duplicate fsnotify events for "fake" path")
Cc: <stable@vger.kernel.org> # v4.19+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The afs filesystem driver allows unstarted operations to be cancelled by
signal, but most of these can easily be restarted (mkdir for example). The
primary culprits for reproducing this are those applications that use
SIGALRM to display a progress counter.
File lock-extension operation is marked uninterruptible as we have a
limited time in which to do it, and the release op is marked
uninterruptible also as if we fail to unlock a file, we'll have to wait 20
mins before anyone can lock it again.
The store operation logs a warning if it gets interruption, e.g.:
kAFS: Unexpected error from FS.StoreData -4
because it's run from the background - but it can also be run from
fdatasync()-type things. However, store options aren't marked
interruptible at the moment.
Fix this in the following ways:
(1) Mark store operations as uninterruptible. It might make sense to
relax this for certain situations, but I'm not sure how to make sure
that background store ops aren't affected by signals to foreground
processes that happen to trigger them.
(2) In afs_get_io_locks(), where we're getting the serialisation lock for
talking to the fileserver, return ERESTARTSYS rather than EINTR
because a lot of the operations (e.g. mkdir) are restartable if we
haven't yet started sending the op to the server.
Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Without upperdir mount option, there is no index dir and the dependency
checks nfs_export => index for mount options parsing are incorrect.
Allow the combination nfs_export=on,index=off with no upperdir and move
the check for dependency redirect_dir=nofollow for non-upper mount case
to mount options parsing.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
With index feature enabled, on failure to create index dir, overlay is
being mounted read-only. However, we do not forbid user to remount overlay
read-write. Fix that by setting ofs->workdir to NULL, which prevents
remount read-write.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Commit 9df085f3c9 ("ovl: relax requirement for non null uuid of lower
fs") relaxed the requirement for non null uuid with single lower layer to
allow enabling index and nfs_export features with single lower squashfs.
Fabian reported a regression in a setup when overlay re-uses an existing
upper layer and re-formats the lower squashfs image. Because squashfs
has no uuid, the origin xattr in upper layer are decoded from the new
lower layer where they may resolve to a wrong origin file and user may
get an ESTALE or EIO error on lookup.
To avoid the reported regression while still allowing the new features
with single lower squashfs, do not allow decoding origin with lower null
uuid unless user opted-in to one of the new features that require
following the lower inode of non-dir upper (index, xino, metacopy).
Reported-by: Fabian <godi.beat@gmx.net>
Link: https://lore.kernel.org/linux-unionfs/32532923.JtPX5UtSzP@fgdesktop/
Fixes: 9df085f3c9 ("ovl: relax requirement for non null uuid of lower fs")
Cc: stable@vger.kernel.org # v4.20+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Mounting with nfs_export=on, xfstests overlay/031 triggers a kernel panic
since v5.8-rc1 overlayfs updates.
overlayfs: orphan index entry (index/00fb1..., ftype=4000, nlink=2)
BUG: kernel NULL pointer dereference, address: 0000000000000030
RIP: 0010:ovl_cleanup_and_whiteout+0x28/0x220 [overlay]
Bisect point at commit c21c839b84 ("ovl: whiteout inode sharing")
Minimal reproducer:
--------------------------------------------------
rm -rf l u w m
mkdir -p l u w m
mkdir -p l/testdir
touch l/testdir/testfile
mount -t overlay -o lowerdir=l,upperdir=u,workdir=w,nfs_export=on overlay m
echo 1 > m/testdir/testfile
umount m
rm -rf u/testdir
mount -t overlay -o lowerdir=l,upperdir=u,workdir=w,nfs_export=on overlay m
umount m
--------------------------------------------------
When mount with nfs_export=on, and fail to verify an orphan index, we're
cleaning this index from indexdir by calling ovl_cleanup_and_whiteout().
This dereferences ofs->workdir, that was earlier set to NULL.
The design was that ovl->workdir will point at ovl->indexdir, but we are
assigning ofs->indexdir to ofs->workdir only after ovl_indexdir_cleanup().
There is no reason not to do it sooner, because once we get success from
ofs->indexdir = ovl_workdir_create(... there is no turning back.
Reported-and-tested-by: Murphy Zhou <jencce.kernel@gmail.com>
Fixes: c21c839b84 ("ovl: whiteout inode sharing")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Decoding a lower directory file handle to overlay path with cold
inode/dentry cache may go as follows:
1. Decode real lower file handle to lower dir path
2. Check if lower dir is indexed (was copied up)
3. If indexed, get the upper dir path from index
4. Lookup upper dir path in overlay
5. If overlay path found, verify that overlay lower is the lower dir
from step 1
On failure to verify step 5 above, user will get an ESTALE error and a
WARN_ON will be printed.
A mismatch in step 5 could be a result of lower directory that was renamed
while overlay was offline, after that lower directory has been copied up
and indexed.
This is a scripted reproducer based on xfstest overlay/052:
# Create lower subdir
create_dirs
create_test_files $lower/lowertestdir/subdir
mount_dirs
# Copy up lower dir and encode lower subdir file handle
touch $SCRATCH_MNT/lowertestdir
test_file_handles $SCRATCH_MNT/lowertestdir/subdir -p -o $tmp.fhandle
# Rename lower dir offline
unmount_dirs
mv $lower/lowertestdir $lower/lowertestdir.new/
mount_dirs
# Attempt to decode lower subdir file handle
test_file_handles $SCRATCH_MNT -p -i $tmp.fhandle
Since this WARN_ON() can be triggered by user we need to relax it.
Fixes: 4b91c30a5a ("ovl: lookup connected ancestor of dir in inode cache")
Cc: <stable@vger.kernel.org> # v4.16+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
ovl_check_origin outparam 'ctrp' argument not used by caller. So remove
this argument.
Signed-off-by: youngjun <her0gyugyu@gmail.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
"ovl_copy_up_flags" is used in copy_up.c.
so, change it static.
Signed-off-by: youngjun <her0gyugyu@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
io_recvmsg() doesn't free memory allocated for struct io_buffer. This can
causes a leak when used with automatic buffer selection.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Break up fanotify_alloc_event() into helpers by event struct type.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The special overflow event is allocated as struct fanotify_path_event,
but with a null path.
Use a special event type to identify the overflow event, so the helper
fanotify_has_event_path() will always indicate a non null path.
Allocating the overflow event doesn't need any of the fancy stuff in
fanotify_alloc_event(), so create a simplified helper for allocating the
overflow event.
There is also no need to store and report the pid with an overflow event.
Link: https://lore.kernel.org/r/20200708111156.24659-7-amir73il@gmail.com
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
inotify's event->wd is the object identifier.
Compare that instead of the common fsnotidy event objectid, so
we can get rid of the objectid field later.
Link: https://lore.kernel.org/r/20200708111156.24659-6-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
When creating an FS_MODIFY event on inode itself (not on parent)
the file_name argument should be NULL.
The change to send a non NULL name to inode itself was done on purpuse
as part of another commit, as Tejun writes: "...While at it, supply the
target file name to fsnotify() from kernfs_node->name.".
But this is wrong practice and inconsistent with inotify behavior when
watching a single file. When a child is being watched (as opposed to the
parent directory) the inotify event should contain the watch descriptor,
but not the file name.
Fixes: df6a58c5c5 ("kernfs: don't depend on d_find_any_alias()...")
Link: https://lore.kernel.org/r/20200708111156.24659-5-amir73il@gmail.com
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Return non const inode pointer from fsnotify_data_inode().
None of the fsnotify hooks pass const inode pointer as data and
callers often need to cast to a non const pointer.
Link: https://lore.kernel.org/r/20200708111156.24659-3-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
All (two) callers of fsnotify_parent() also call fsnotify() to notify
the child inode. Move the second fsnotify() call into fsnotify_parent().
This will allow more flexibility in making decisions about which of the
two event falvors should be sent.
Using 'goto notify_child' in the inline helper seems a bit strange, but
it mimics the code in __fsnotify_parent() for clarity and the goto
pattern will become less strage after following patches are applied.
Link: https://lore.kernel.org/r/20200708111156.24659-2-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The fsnotify paths are trivial to hit even when there are no watchers and
they are surprisingly expensive. For example, every successful vfs_write()
hits fsnotify_modify which calls both fsnotify_parent and fsnotify unless
FMODE_NONOTIFY is set which is an internal flag invisible to userspace.
As it stands, fsnotify_parent is a guaranteed functional call even if there
are no watchers and fsnotify() does a substantial amount of unnecessary
work before it checks if there are any watchers. A perf profile showed
that applying mnt->mnt_fsnotify_mask in fnotify() was almost half of the
total samples taken in that function during a test. This patch rearranges
the fast paths to reduce the amount of work done when there are no
watchers.
The test motivating this was "perf bench sched messaging --pipe". Despite
the fact the pipes are anonymous, fsnotify is still called a lot and
the overhead is noticeable even though it's completely pointless. It's
likely the overhead is negligible for real IO so this is an extreme
example. This is a comparison of hackbench using processes and pipes on
a 1-socket machine with 8 CPU threads without fanotify watchers.
5.7.0 5.7.0
vanilla fastfsnotify-v1r1
Amean 1 0.4837 ( 0.00%) 0.4630 * 4.27%*
Amean 3 1.5447 ( 0.00%) 1.4557 ( 5.76%)
Amean 5 2.6037 ( 0.00%) 2.4363 ( 6.43%)
Amean 7 3.5987 ( 0.00%) 3.4757 ( 3.42%)
Amean 12 5.8267 ( 0.00%) 5.6983 ( 2.20%)
Amean 18 8.4400 ( 0.00%) 8.1327 ( 3.64%)
Amean 24 11.0187 ( 0.00%) 10.0290 * 8.98%*
Amean 30 13.1013 ( 0.00%) 12.8510 ( 1.91%)
Amean 32 13.9190 ( 0.00%) 13.2410 ( 4.87%)
5.7.0 5.7.0
vanilla fastfsnotify-v1r1
Duration User 157.05 152.79
Duration System 1279.98 1219.32
Duration Elapsed 182.81 174.52
This is showing that the latencies are improved by roughly 2-9%. The
variability is not shown but some of these results are within the noise
as this workload heavily overloads the machine. That said, the system CPU
usage is reduced by quite a bit so it makes sense to avoid the overhead
even if it is a bit tricky to detect at times. A perf profile of just 1
group of tasks showed that 5.14% of samples taken were in either fsnotify()
or fsnotify_parent(). With the patch, 2.8% of samples were in fsnotify,
mostly function entry and the initial check for watchers. The check for
watchers is complicated enough that inlining it may be controversial.
[Amir] Slightly simplify with mnt_or_sb_mask => marks_mask
Link: https://lore.kernel.org/r/20200708111156.24659-1-amir73il@gmail.com
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
When user provides large buffer for events and there are lots of events
available, we can try to copy them all to userspace without scheduling
which can softlockup the kernel (furthermore exacerbated by the
contention on notification_lock). Add a scheduling point after copying
each event.
Note that usually the real underlying problem is the cost of fanotify
event merging and the resulting contention on notification_lock but this
is a cheap way to somewhat reduce the problem until we can properly
address that.
Reported-by: Francesco Ruggeri <fruggeri@arista.com>
Link: https://lore.kernel.org/lkml/20200714025417.A25EB95C0339@us180.sjc.aristanetworks.com
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The ioctl encoding for this parameter is a long but the documentation says
it should be an int and the kernel drivers expect it to be an int. If the
fuse driver treats this as a long it might end up scribbling over the stack
of a userspace process that only allocated enough space for an int.
This was previously discussed in [1] and a patch for fuse was proposed in
[2]. From what I can tell the patch in [2] was nacked in favor of adding
new, "fixed" ioctls and using those from userspace. However there is still
no "fixed" version of these ioctls and the fact is that it's sometimes
infeasible to change all userspace to use the new one.
Handling the ioctls specially in the fuse driver seems like the most
pragmatic way for fuse servers to support them without causing crashes in
userspace applications that call them.
[1]: https://lore.kernel.org/linux-fsdevel/20131126200559.GH20559@hall.aurel32.net/T/
[2]: https://sourceforge.net/p/fuse/mailman/message/31771759/
Signed-off-by: Chirantan Ekbote <chirantan@chromium.org>
Fixes: 59efec7b90 ("fuse: implement ioctl support")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
commit 0688e64bc6 ("NFS: Allow signal interruption of
NFS4ERR_DELAYed operations") introduces nfs4_delay_interruptible
which also needs an _unsafe version to avoid the following call
trace for the same reason explained in commit 416ad3c9c0 ("freezer:
add unsafe versions of freezable helpers for NFS")
CPU: 4 PID: 3968 Comm: rm Tainted: G W 5.8.0-rc4 #1
Hardware name: Marvell OcteonTX CN96XX board (DT)
Call trace:
dump_backtrace+0x0/0x1dc
show_stack+0x20/0x30
dump_stack+0xdc/0x150
debug_check_no_locks_held+0x98/0xa0
nfs4_delay_interruptible+0xd8/0x120
nfs4_handle_exception+0x130/0x170
nfs4_proc_rmdir+0x8c/0x220
nfs_rmdir+0xa4/0x360
vfs_rmdir.part.0+0x6c/0x1b0
do_rmdir+0x18c/0x210
__arm64_sys_unlinkat+0x64/0x7c
el0_svc_common.constprop.0+0x7c/0x110
do_el0_svc+0x24/0xa0
el0_sync_handler+0x13c/0x1b8
el0_sync+0x158/0x180
Signed-off-by: He Zhe <zhe.he@windriver.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Remove duplicated include.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
These two definitions are unused now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
In the course of some operations, we look up the perag from
the mount multiple times to get or change perag information.
These are often very short pieces of code, so while the
lookup cost is generally low, the cost of the lookup is far
higher than the cost of the operation we are doing on the
perag.
Since we changed buffers to hold references to the perag
they are cached in, many modification contexts already hold
active references to the perag that are held across these
operations. This is especially true for any operation that
is serialised by an allocation group header buffer.
In these cases, we can just use the buffer's reference to
the perag to avoid needing to do lookups to access the
perag. This means that many operations don't need to do
perag lookups at all to access the perag because they've
already looked up objects that own persistent references
and hence can use that reference instead.
Cc: Dave Chinner <dchinner@redhat.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
fuse_writepages() ignores some errors taken from fuse_writepages_fill() I
believe it is a bug: if .writepages is called with WB_SYNC_ALL it should
either guarantee that all data was successfully saved or return error.
Fixes: 26d614df1d ("fuse: Implement writepages callback")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
fuse_writepages_fill uses following construction:
if (wpa && ap->num_pages &&
(A || B || C)) {
action;
} else if (wpa && D) {
if (E) {
the same action;
}
}
- ap->num_pages check is always true and can be removed
- "if" and "else if" calls the same action and can be merged.
Move checking A, B, C, D, E conditions to a helper, add comments.
Original-patch-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Previous patch changed handling of remount/reconfigure to ignore all
options, including those that are unknown to the fuse kernel fs. This was
done for backward compatibility, but this likely only affects the old
mount(2) API.
The new fsconfig(2) based reconfiguration could possibly be improved. This
would make the new API less of a drop in replacement for the old, OTOH this
is a good chance to get rid of some weirdnesses in the old API.
Several other behaviors might make sense:
1) unknown options are rejected, known options are ignored
2) unknown options are rejected, known options are rejected if the value
is changed, allowed otherwise
3) all options are rejected
Prior to the backward compatibility fix to ignore all options all known
options were accepted (1), even if they change the value of a mount
parameter; fuse_reconfigure() does not look at the config values set by
fuse_parse_param().
To fix that we'd need to verify that the value provided is the same as set
in the initial configuration (2). The major drawback is that this is much
more complex than just rejecting all attempts at changing options (3);
i.e. all options signify initial configuration values and don't make sense
on reconfigure.
This patch opts for (3) with the rationale that no mount options are
reconfigurable in fuse.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The command
mount -o remount -o unknownoption /mnt/fuse
succeeds on kernel versions prior to v5.4 and fails on kernel version at or
after. This is because fuse_parse_param() rejects any unrecognised options
in case of FS_CONTEXT_FOR_RECONFIGURE, just as for FS_CONTEXT_FOR_MOUNT.
This causes a regression in case the fuse filesystem is in fstab, since
remount sends all options found there to the kernel; even ones that are
meant for the initial mount and are consumed by the userspace fuse server.
Fix this by ignoring mount options, just as fuse_remount_fs() did prior to
the conversion to the new API.
Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Fixes: c30da2e981 ("fuse: convert to use the new mount API")
Cc: <stable@vger.kernel.org> # v5.4
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
s_op->remount_fs() is only called from legacy_reconfigure(), which is not
used after being converted to the new API.
Convert to using ->reconfigure(). This restores the previous behavior of
syncing the filesystem and rejecting MS_MANDLOCK on remount.
Fixes: c30da2e981 ("fuse: convert to use the new mount API")
Cc: <stable@vger.kernel.org> # v5.4
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
fuse_writepages_fill() calls tree_insert() with ap->num_pages = 0 which
triggers the following warning:
WARNING: CPU: 1 PID: 17211 at fs/fuse/file.c:1728 tree_insert+0xab/0xc0 [fuse]
RIP: 0010:tree_insert+0xab/0xc0 [fuse]
Call Trace:
fuse_writepages_fill+0x5da/0x6a0 [fuse]
write_cache_pages+0x171/0x470
fuse_writepages+0x8a/0x100 [fuse]
do_writepages+0x43/0xe0
Fix up the warning and clean up the code around rb-tree insertion:
- Rename tree_insert() to fuse_insert_writeback() and make it return the
conflicting entry in case of failure
- Re-add tree_insert() as a wrapper around fuse_insert_writeback()
- Rename fuse_writepage_in_flight() to fuse_writepage_add() and reverse
the meaning of the return value to mean
+ "true" in case the writepage entry was successfully added
+ "false" in case it was in-fligt queued on an existing writepage
entry's auxiliary list or the existing writepage entry's temporary
page updated
Switch from fuse_find_writeback() + tree_insert() to
fuse_insert_writeback()
- Move setting orig_pages to before inserting/updating the entry; this may
result in the orig_pages value being discarded later in case of an
in-flight request
- In case of a new writepage entry use fuse_writepage_add()
unconditionally, only set data->wpa if the entry was added.
Fixes: 6b2fb79963 ("fuse: optimize writepages search")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Original-path-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In fuse_writepage_end() the old writepages entry needs to be removed from
the rbtree before inserting the new one, otherwise tree_insert() would
fail. This is a very rare codepath and no reproducer exists.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.
Deterministic algorithm:
For each file:
If not .svg:
For each line:
If doesn't contain `\bxmlns\b`:
For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`:
If both the HTTP and HTTPS versions
return 200 OK and serve the same content:
Replace HTTP with HTTPS.
Link: https://lore.kernel.org/r/20200713200738.37800-1-grandmaster@al2klimov.de
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Implement client side caching for NFSv4.2 extended attributes. The cache
is a per-inode hashtable, with name/value entries. There is one special
entry for the listxattr cache.
NFS inodes have a pointer to a cache structure. The cache structure is
allocated on demand, freed when the cache is invalidated.
Memory shrinkers keep the size in check. Large entries (> PAGE_SIZE)
are collected by a separate shrinker, and freed more aggressively
than others.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Now that all the lower level code is there to make the RPC calls, hook
it in to the xattr handlers and the listxattr entry point, to make them
available.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Implement the extended attribute procedures for NFSv4.2 extended
attribute support (RFC 8276).
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Make the buf_to_pages_noslab function available to the rest of the NFS
code. Rename it to nfs4_buf_to_pages_noslab to be consistent.
This will be used later in the NFSv4.2 xattr code.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Define the NFS_INO_INVALID_XATTR flag, to be used for the NFSv4.2 xattr
cache, and use it where appropriate.
No functional change as yet.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Until now, change attributes in change_info form were only returned by
directory operations. However, they are also used for the RFC 8276
extended attribute operations, which work on both directories
and regular files. Modify update_changeattr to deal:
* Rename it to nfs4_update_changeattr and make it non-static.
* Don't always use INO_INVALID_DATA, this isn't needed for a
directory that only had its extended attributes changed by us.
* Existing callers now always pass in INO_INVALID_DATA.
For the current callers of this function, behavior is unchanged.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
RFC 8276 defines separate ACCESS bits for extended attribute checking.
Query them in nfs_do_access and opendata.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The only consumer of nfs_access_get_cached_rcu and nfs_access_cached
calls these static functions in order to first try RCU access, and
then locked access.
Combine them in to a single function, and call that. Make this function
available to the rest of the NFS code.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Define the argument and response structures that will be used for
RFC 8276 extended attribute RPC calls, and implement the necessary
functions to encode/decode the extended attribute operations.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Query the server for extended attribute support, and record it
as the NFS_CAP_XATTR flag in the server capabilities.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Set limits for extended attributes (attribute value size and listxattr
buffer size), based on the fs-independent limits (XATTR_*_MAX).
Define the maximum XDR sizes for the RFC 8276 XATTR operations.
In the case of operations that carry a larger payload (SETXATTR,
GETXATTR, LISTXATTR), these exclude that payload, which is added
as separate pages, like other operations do.
Define, much like for read and write operations, the maximum overhead
sizes for get/set/listxattr, and use them to limit the maximum payload
size for those operations, in combination with the channel attributes.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the rpc_pipefs is unmounted, then the rpc_pipe->dentry becomes NULL
and dereferencing the dentry->d_sb will trigger an oops. The only
reason we're doing that is to determine the nfsd_net, which could
instead be passed in by the caller. So do that instead.
Fixes: 11a60d1592 ("nfsd: add a "GetVersion" upcall for nfsdcld")
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
We recently fixed lease breaking so that a client's actions won't break
its own delegations.
But we still have an unnecessary self-conflict when granting
delegations: a client's own write opens will prevent us from handing out
a read delegation even when no other client has the file open for write.
Fix that by turning off the checks for conflicting opens under
vfs_setlease, and instead performing those checks in the nfsd code.
We don't depend much on locks here: instead we acquire the delegation,
then check for conflicts, and drop the delegation again if we find any.
The check beforehand is an optimization of sorts, just to avoid
acquiring the delegation unnecessarily. There's a race where the first
check could cause us to deny the delegation when we could have granted
it. But, that's OK, delegation grants are optional (and probably not
even a good idea in that case).
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
A single character (line break) should be put into a sequence.
Thus use the corresponding function "seq_putc()".
Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Check if user extended attributes are supported for an inode,
and return the answer when being queried for file attributes.
An exported filesystem can now signal its RFC8276 user extended
attributes capability.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Implement the main entry points for the *XATTR operations.
Add functions to calculate the reply size for the user extended attribute
operations, and implement the XDR encode / decode logic for these
operations.
Add the user extended attributes operations to nfsd4_ops.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Add the structures used in extended attribute request / response handling.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Since the NFSv4.2 extended attributes extension defines 3 new access
bits for xattr operations, take them in to account when validating
what the client is asking for, and when checking permissions.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This adds the filehandle based functions for the xattr operations
that call in to the vfs layer to do the actual work.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
[ cel: address checkpatch.pl complaint ]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Add defines for server-side extended attribute support. Most have
already been added as part of client support, but these are
the network order error codes for the noxattr and xattr2big errors,
and the addition of the xattr support to the supported file
attributes (if configured).
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
nfs4_decode_write has code to parse incoming XDR write data in to
a kvec head, and a list of pages.
Put this code in to a separate function, so that it can be used
later by the xattr code, for setxattr. No functional change.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Add a function that checks is an extended attribute namespace is
supported for an inode, meaning that a handler must be present
for either the whole namespace, or at least one synthetic
xattr in the namespace.
To be used by the nfs server code when being queried for extended
attributes support.
Cc: linux-fsdevel@vger.kernel.org
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
set/removexattr on an exported filesystem should break NFS delegations.
This is true in general, but also for the upcoming support for
RFC 8726 (NFSv4 extended attribute support). Make sure that they do.
Additionally, they need to grow a _locked variant, since callers might
call this with i_rwsem held (like the NFS server code).
Cc: stable@vger.kernel.org # v4.9+
Cc: linux-fsdevel@vger.kernel.org
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Expand __receive_fd() with support for replace_fd() for the coming seccomp
"addfd" ioctl(). Add new wrapper receive_fd_replace() for the new behavior
and update existing wrappers to retain old behavior.
Thanks to Colin Ian King <colin.king@canonical.com> for pointing out an
uninitialized variable exposure in an earlier version of this patch.
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dmitry Kadashev <dkadashev@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Sargun Dhillon <sargun@sargun.me>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
For both pidfd and seccomp, the __user pointer is not used. Update
__receive_fd() to make writing to ufd optional via a NULL check. However,
for the receive_fd_user() wrapper, ufd is NULL checked so an -EFAULT
can be returned to avoid changing the SCM_RIGHTS interface behavior. Add
new wrapper receive_fd() for pidfd and seccomp that does not use the ufd
argument. For the new helper, the allocated fd needs to be returned on
success. Update the existing callers to handle it.
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Sargun Dhillon <sargun@sargun.me>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
In preparation for users of the "install a received file" logic outside
of net/ (pidfd and seccomp), relocate and rename __scm_install_fd() from
net/core/scm.c to __receive_fd() in fs/file.c, and provide a wrapper
named receive_fd_user(), as future patches will change the interface
to __receive_fd().
Additionally add a comment to fd_install() as a counterpoint to how
__receive_fd() interacts with fput().
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Dmitry Kadashev <dkadashev@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Ido Schimmel <idosch@idosch.org>
Cc: Ioana Ciornei <ioana.ciornei@nxp.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: netdev@vger.kernel.org
Reviewed-by: Sargun Dhillon <sargun@sargun.me>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
We used to do this before 3453d5708b, but this was changed to better
handle the NFS4ERR_SEQ_MISORDERED error code. This commit fixed the slot
re-use case when the server doesn't receive the interrupted operation,
but if the server does receive the operation then it could still end up
replying to the client with mis-matched operations from the reply cache.
We can fix this by sending a SEQUENCE to the server while recovering from
a SEQ_MISORDERED error when we detect that we are in an interrupted slot
situation.
Fixes: 3453d5708b (NFSv4.1: Avoid false retries when RPC calls are interrupted)
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Make sure we specify the layout segment range when calculating the
mirror count. In theory, that number could depend on the range to
which we're writing.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Both nfs_pageio_reset_read_mds() and nfs_pageio_reset_write_mds()
do call pnfs_generic_pg_cleanup() for us.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the application uses the AT_STATX_DONT_SYNC flag after doing readdir(),
then we should still mark the parent inode as seeing a readdirplus hit.
That ensures that we continue to use readdirplus in the 'ls -l' type
of workflow to do fast lookups of the dentries.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl8K3ugACgkQxWXV+ddt
WDsDNBAAn5iaMNwlCBYpwAaWlltMog3SKg+vgpEcFD9qLlmimW/1TlrjjGRzp6Mn
nnNp+YjYDotqU9pP1OwESpY1LTzuVQlQL1yaiPLrehw/WsZgjdDWBk/EyU0n1vz1
Sr5wcyCVyVZZyO2/BEVTDhkvu+sj9Rcwo2QCsC2aIOTVSfQGFSklMp2VNdu2YQBy
zyTOhbwpn3OPPZsvScEujvSY9oUAN3J8WYA9jmgtwjZD7sr6UNyNI9vy8woi0VAQ
Uo7nXc43ZcS1xTwziGOpC6fZi90zrF7ZvfFT0qY92EEDcAQcCzPDl6f4OnAjr6/b
rnZcLvusEcENjFQn3pD7fCuXiIRrN8eHspj5+K/oRBTXWC5AykBwsLWt7M+tTMYa
ljEBRZlQlHMlC3xSEZNDccEvScXrEIu3Q2WrTOTXSgXi4e3q89VUTEIjAhfnTTzJ
VwHhGZIB6o+V7wZ0EhWdt9b1/Ro/AcADddV+AxTsfC1YCHVZOsSSa3DxV243ORsA
/U3t2a4SMp/iSHTtoLIwbr/O1Uj9UaOk2n1DcNbGIgdn14yYt6YWOhvrOPBampEa
zfBzmAOx9r5Mf2wWD0iTm4gJEZsrB+IpboYZ6cuBcOI29+A4k0POBfRLXgf8/jMo
5kBWm+C3KKkZO8u/Z4gtVG1ZFdxsnYAc+q+UXS5ZSJMH+++UoZQ=
=hTok
-----END PGP SIGNATURE-----
Merge tag 'for-5.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two refcounting fixes and one prepartory patch for upcoming splice
cleanup:
- fix double put of block group with nodatacow
- fix missing block group put when remounting with discard=async
- explicitly set splice callback (no functional change), to ease
integrating splice cleanup patches"
* tag 'for-5.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: wire up iter_file_splice_write
btrfs: fix double put of block group with nocow
btrfs: discard: add missing put when grabbing block group from unused list
59960b9deb ("io_uring: fix lazy work init") tried to fix missing
io_req_init_async(), but left out work.flags and hash. Do it earlier.
Fixes: 7cdaf587de ("io_uring: avoid whole io_wq_work copy for requests completed inline")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ensure to set msg.msg_name for the async portion of send/recvmsg,
as the header copy will copy to/from it.
Cc: stable@vger.kernel.org # v5.5+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl8I/pEACgkQiiy9cAdy
T1HEzAv/frpjB9Ae4EH4oI9bqAPpIvmz0OG8qjiVhA7EMuKCPzj4DdP6HwgJ41Rr
GS6IaltHHYRFmv2rKPFm9BJs0l03XJmkOlx/D+mvh0YJiNAYGan99wCVuhjotRkk
1LLpG8ZKRSGtxc+gkyaAUiKMZPHtDxvfy/VUsoCzOJBUDVByeH2LlwanXGibWEmw
DHap06dClRZ0nLf7BvfL5vtFdX5If9CUOiwgj6PY/Oy+/hhq/XjQJClhVSH1tGhG
wXMOpq5gANG0MOCkJh6GVykawjSE0gz/jc6d+zlUW4stqMLQMz7kfIqLy/OZ2z4/
zggpKmxN969ZyhLxldEJguet5+gHu7rX7dQMJb83UBtV9uC8mtgVR/5vTMCtrGBX
c9YNdzOnk7Djhb9ka1sx9KAORwagphYS4BIAhk2bQbOcW8OfkiRHVLaDou39Vbn3
3rBVTFDg7Kdg1yFhPXZM1r4HBcWupVHq/jh+kCoHcPVrwYBwNDTflTWogDjWkzph
Bc+39FSC
=c1VA
-----END PGP SIGNATURE-----
Merge tag '5.8-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Four cifs/smb3 fixes: the three for stable fix problems found recently
with change notification including a reference count leak"
* tag '5.8-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal module version number
cifs: fix reference leak for tlink
smb3: fix unneeded error message on change notify
cifs: remove the retry in cifs_poxis_lock_set
smb3: fix access denied on change notify request to some servers
A common question asked when debugging seccomp filters is "how many
filters are attached to your process?" Provide a way to easily answer
this question through /proc/$pid/status with a "Seccomp_filters" line.
Signed-off-by: Kees Cook <keescook@chromium.org>
debugfs_create_u32_array() allocates a small structure to wrap
the data and size information about the array. If users ever
try to remove the file this leads to a leak since nothing ever
frees this wrapper.
That said there are no upstream users of debugfs_create_u32_array()
that'd remove a u32 array file (we only have one u32 array user in
CMA), so there is no real bug here.
Make callers pass a wrapper they allocated. This way the lifetime
management of the wrapper is on the caller, and we can avoid the
potential leak in debugfs.
CC: Chucheng Luo <luochucheng@vivo.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8IkFIQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgph7tEACdkJ9CupSjVQBJ35/2wBiUCU6lh0iUbru3
oURuw9AdjzT75en68LOyrhfdGCkNX8xrIoHnh9pGukoPid//R3rAsk+5j4Doc7SY
FBZuH7HbOC/gdTAzSd0tLyaddmf06eq0PPmlWQGUYu29Vy49nvLDM2V+455OgkT9
csCNk/Fq/np7IKVZW56KZ1Iz90Dtp8BVwFJeFdzFK5cpCjKXTyvzS/5GEFqz0rc3
AwvxkBCpDjXQqh5OOs2qM18QUeJAQV0PX1x+XjtASMB3dr4W8D5KDEhaQwiXiGOT
j7yHX/JThtOYI41IiKpF/LP9EQAkavIT1n7ZjXwablq6ecQjj24DxCe0h1uwnKyW
G5Hztjcb2YMhhDVj4q6m/uzD13+ccIeEmVoKjRuGj+0YDFETIBYK8U4GElu7VUIr
yUKqy7jFEHKFZj5eQz5JDl08+zq/ocZHJctqZhCB3DWMbE5f+VCzNGEqGGs2IiRS
J7/S8jB1j7Te8jFXBN3uaNpuj+U/JlYCPqO83fxF14ar3wZ1mblML4cw0sXUJazO
1YI3LU91qBeCNK/9xUc43MzzG/SgdIKJjfoaYYPBPMG5SzVYAtRoUUiNf942UE1k
7u8/leql+RqwSLedjU/AQqzs29p7wcK8nzFBApmTfiiBHYcRbJ6Fxp3/UwTOR3tQ
kwmkUtivtQ==
=31iz
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.8-2020-07-10' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
- Fix memleak for error path in registered files (Yang)
- Export CQ overflow state in flags, necessary to fix a case where
liburing doesn't know if it needs to enter the kernel (Xiaoguang)
- Fix for a regression in when user memory is accounted freed, causing
issues with back-to-back ring exit + init if the ulimit -l setting is
very tight.
* tag 'io_uring-5.8-2020-07-10' of git://git.kernel.dk/linux-block:
io_uring: account user memory freed when exit has been queued
io_uring: fix memleak in io_sqe_files_register()
io_uring: fix memleak in __io_sqe_files_update()
io_uring: export cq overflow status to userspace
Reshuffle the (__)kernel_read and (__)kernel_write helpers, and ensure
all users of in-kernel file I/O use them if they don't use iov_iter
based methods already.
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl8Ij8gLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYOcpBAAn157ooLqRrqQisEA6j59rTgkHUuqZMUx+8XjiivX
baHQPmgctza1Xzjc4PjJ1owtLpt4ywcTpY8IDj3vZF1PpffeeuWVzxMTk/aIvhNN
zPK2SJpRlDQHErKEhkTTOfOYoFTgc7vPa5Hvm6AEMaJs8oPtGZ2rnQHzPXENl/TY
TgcLd1ou3iuw19UIAfB+EfuC9uhq7pCPu9+tryNyT2IfM7fqdsIhRESpcodg1ve+
1k6leFIBrXa3MWiBGVUGCrSmlpP9xd22Zl8D/w60WeYWeg7szZoUK2bjhbdIEDZI
tTwkdZ73IKpcxOyzUVbfr2hqNa94zrXCKQGfEGVS/7arV7QH4yvhg9NU9lqVXZKV
ruPoyjsmJkHW52FfEEv1Gfrd6v4H6qZ6iyJEm3ZYNGul85O97t1xA/kKxAIwMuPa
nFhhxHIooT/We3Ao77FROhIob4D5AOfOI4gvkTE15YMzsNxT/yjilQjdDFR5An6A
ckzqb+VyDvcTx2gxR/qaol7b4lzmri4S/8Jt7WXjHOtNe9eXC4kl44leitK5j31H
fHZNyMLJ2+/JF5pGB2rNRNnTeQ7lXKob4Y+qAjRThddDxtdsf5COdZAiIiZbRurR
Ogl2k3sMDdHgNfycK2Bg5Fab9OIWePQlpcGU14afUSPviuNkIYKLGrx92ZWef53j
loI=
=eYsI
-----END PGP SIGNATURE-----
Merge tag 'cleanup-kernel_read_write' of git://git.infradead.org/users/hch/misc
Pull in-kernel read and write op cleanups from Christoph Hellwig:
"Cleanup in-kernel read and write operations
Reshuffle the (__)kernel_read and (__)kernel_write helpers, and ensure
all users of in-kernel file I/O use them if they don't use iov_iter
based methods already.
The new WARN_ONs in combination with syzcaller already found a missing
input validation in 9p. The fix should be on your way through the
maintainer ASAP".
[ This is prep-work for the real changes coming 5.9 ]
* tag 'cleanup-kernel_read_write' of git://git.infradead.org/users/hch/misc:
fs: remove __vfs_read
fs: implement kernel_read using __kernel_read
integrity/ima: switch to using __kernel_read
fs: add a __kernel_read helper
fs: remove __vfs_write
fs: implement kernel_write using __kernel_write
fs: check FMODE_WRITE in __kernel_write
fs: unexport __kernel_write
bpfilter: switch to kernel_write
autofs: switch to kernel_write
cachefiles: switch to kernel_write
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAl8IgKwUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTpm2Q//bj3ZDIjep9a4d7mRVGeX3OeslLzk
NDB2Vu03B0oZKQFYbQNdpblxy2Cfyz4m8xkNCdsD8EQ2d1zaPWhywJ6vxc1VO5Dw
wRODwRMgVe0hd9dLR8b8GzUO0+4ncpjqmyEyrCRjwPRkghcX8uuSTifXtY+yeDEv
X2BHlSGMjqCFBfq+RTa8Fi3wWFy9QhGy74QVoidMM0ulFLJbWSu0EnCXZ+hZQ4vR
sJokd2SDSP60LE964CwMxuMNUNwSMwL3VrlUm74qx1WVCK8lyYtm231E5CAHRbAw
C/f6sIKoyzyfJbv2HqgvMXvh72hO4MaJgIb8Pbht8a9GZdfk6i2JbcNmHXXk5OMN
GkYLLhkDrj4X/MChNuk20Zsylaij1+CCLb6C4UsQeXF0e/QA6iYIGRmpApGN2gNP
IA8rTz4Ibmd5ZpVMJNPOGSbq3fpPEboEoxVn+fWVvhDTopATxYS85tKqU5Bfvdr5
QcBqqeAL9yludQa520C1lIbGDBOJ57LisybMBVufklx8ZtFNNbHyB/b1YnfUBvRF
8WXVpYkh1ckB4VvVj7qnKY2/JJT0VVhQmTogqwqZy9m+Nb8I4l0pemUsJnypS0qs
KmoBvZmhWhE3tnqmCVzSvuHzO/eYGSfN91AavGBaddFzsqLLe8Hkm8kzlS5bZxGn
OVWGWVvuoSu72s8=
=dfnJ
-----END PGP SIGNATURE-----
Merge tag 'gfs2-v5.8-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fixes from Andreas Gruenbacher:
"Fix gfs2 readahead deadlocks by adding a IOCB_NOIO flag that allows
gfs2 to use the generic fiel read iterator functions without having to
worry about being called back while holding locks".
* tag 'gfs2-v5.8-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Rework read and page fault locking
fs: Add IOCB_NOIO flag for generic_file_read_iter
We currently account the memory after the exit work has been run, but
that leaves a gap where a process has closed its ring and until the
memory has been accounted as freed. If the memlocked ulimit is
borderline, then that can introduce spurious setup errors returning
-ENOMEM because the free work hasn't been run yet.
Account this as freed when we close the ring, as not to expose a tiny
gap where setting up a new ring can fail.
Fixes: 85faa7b834 ("io_uring: punt final io_ring_ctx wait-and-free to workqueue")
Cc: stable@vger.kernel.org # v5.7
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We don't use 'ctx' at all in io_sq_thread_drop_mm(), it just works
on the mm of the current task. Drop the argument.
Move io_file_put_work() to where we have the other forward declarations
of functions.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
btrfs implements the iter_write op and thus can use the more efficient
iov_iter based splice implementation. For now falling back to the less
efficient default is pretty harmless, but I have a pending series that
removes the default, and thus would cause btrfs to not support splice
at all.
Reported-by: Andy Lavr <andy.lavr@gmail.com>
Tested-by: Andy Lavr <andy.lavr@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Depending on the workloads, the following circular locking dependency
warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo
lock) may show up:
======================================================
WARNING: possible circular locking dependency detected
5.0.0-rc1+ #60 Tainted: G W
------------------------------------------------------
fsfreeze/4346 is trying to acquire lock:
0000000026f1d784 (fs_reclaim){+.+.}, at:
fs_reclaim_acquire.part.19+0x5/0x30
but task is already holding lock:
0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650
which lock already depends on the new lock.
:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sb_internal);
lock(fs_reclaim);
lock(sb_internal);
lock(fs_reclaim);
*** DEADLOCK ***
4 locks held by fsfreeze/4346:
#0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650
#1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290
#2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650
#3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650
stack backtrace:
Call Trace:
dump_stack+0xe0/0x19a
print_circular_bug.isra.10.cold.34+0x2f4/0x435
check_prev_add.constprop.19+0xca1/0x15f0
validate_chain.isra.14+0x11af/0x3b50
__lock_acquire+0x728/0x1200
lock_acquire+0x269/0x5a0
fs_reclaim_acquire.part.19+0x29/0x30
fs_reclaim_acquire+0x19/0x20
kmem_cache_alloc+0x3e/0x3f0
kmem_zone_alloc+0x79/0x150
xfs_trans_alloc+0xfa/0x9d0
xfs_sync_sb+0x86/0x170
xfs_log_sbcount+0x10f/0x140
xfs_quiesce_attr+0x134/0x270
xfs_fs_freeze+0x4a/0x70
freeze_super+0x1af/0x290
do_vfs_ioctl+0xedc/0x16c0
ksys_ioctl+0x41/0x80
__x64_sys_ioctl+0x73/0xa9
do_syscall_64+0x18f/0xd23
entry_SYSCALL_64_after_hwframe+0x49/0xbe
This is a false positive as all the dirty pages are flushed out before
the filesystem can be frozen.
One way to avoid this splat is to add GFP_NOFS to the affected allocation
calls by using the memalloc_nofs_save()/memalloc_nofs_restore() pair.
This shouldn't matter unless the system is really running out of memory.
In that particular case, the filesystem freeze operation may fail while
it was succeeding previously.
Without this patch, the command sequence below will show that the lock
dependency chain sb_internal -> fs_reclaim exists.
# fsfreeze -f /home
# fsfreeze --unfreeze /home
# grep -i fs_reclaim -C 3 /proc/lockdep_chains | grep -C 5 sb_internal
After applying the patch, such sb_internal -> fs_reclaim lock dependency
chain can no longer be found. Because of that, the locking dependency
warning will not be shown.
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
While debugging a patch that I wrote I was hitting use-after-free panics
when accessing block groups on unmount. This turned out to be because
in the nocow case if we bail out of doing the nocow for whatever reason
we need to call btrfs_dec_nocow_writers() if we called the inc. This
puts our block group, but a few error cases does
if (nocow) {
btrfs_dec_nocow_writers();
goto error;
}
unfortunately, error is
error:
if (nocow)
btrfs_dec_nocow_writers();
so we get a double put on our block group. Fix this by dropping the
error cases calling of btrfs_dec_nocow_writers(), as it's handled at the
error label now.
Fixes: 762bf09893 ("btrfs: improve error handling in run_delalloc_nocow")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Don't leak a reference to tlink during the NOTIFY ioctl
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
CC: Stable <stable@vger.kernel.org> # v5.6+
Commit
bf67fad19e ("efi: Use more granular check for availability for variable services")
introduced a check into the efivarfs, efi-pstore and other drivers that
aborts loading of the module if not all three variable runtime services
(GetVariable, SetVariable and GetNextVariable) are supported. However, this
results in efivarfs being unavailable entirely if only SetVariable support
is missing, which is only needed if you want to make any modifications.
Also, efi-pstore and the sysfs EFI variable interface could be backed by
another implementation of the 'efivars' abstraction, in which case it is
completely irrelevant which services are supported by the EFI firmware.
So make the generic 'efivars' abstraction dependent on the availibility of
the GetVariable and GetNextVariable EFI runtime services, and add a helper
'efivar_supports_writes()' to find out whether the currently active efivars
abstraction supports writes (and wire it up to the availability of
SetVariable for the generic one).
Then, use the efivar_supports_writes() helper to decide whether to permit
efivarfs to be mounted read-write, and whether to enable efi-pstore or the
sysfs EFI variable interface altogether.
Fixes: bf67fad19e ("efi: Use more granular check for availability for variable services")
Reported-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Tested-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.
Deterministic algorithm:
For each file:
If not .svg:
For each line:
If doesn't contain `\bxmlns\b`:
For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`:
If both the HTTP and HTTPS versions
return 200 OK and serve the same content:
Replace HTTP with HTTPS.
Link: https://lore.kernel.org/r/20200708171905.15396-1-grandmaster@al2klimov.de
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Signed-off-by: Jan Kara <jack@suse.cz>
In order to correctly account/limit space usage, should initialize
quota info before calling quota related functions.
Link: https://lore.kernel.org/r/20200626054959.114177-1-cgxu519@mykernel.net
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
sbi->s_freeinodes_counter is only decreased by the ext2 code, it is never
increased. This patch fixes it.
Note that sbi->s_freeinodes_counter is only used in the algorithm that
tries to find the group for new allocations, so this bug is not easily
visible (the only visibility is that the group finding algorithm selects
inoptinal result).
Link: https://lore.kernel.org/r/alpine.LRH.2.02.2004201538300.19436@file01.intranet.prod.int.rdu2.redhat.com
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Almost all callers of ext2_find_entry() transform NULL return value to
-ENOENT, so just let ext2_find_entry() retuen -ENOENT instead of NULL
if no valid entry found, and also switch to check the return value of
ext2_inode_by_name() in ext2_lookup() and ext2_get_parent().
Link: https://lore.kernel.org/r/20200608034043.10451-2-yi.zhang@huawei.com
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
The same to commit <36de928641ee4> (ext4: propagate errors up to
ext4_find_entry()'s callers') in ext4, also return error instead of NULL
pointer in case of some error happens in ext2_find_entry() (e.g. -ENOMEM
or -EIO). This could avoid a negative dentry cache entry installed even
it failed to read directory block due to IO error.
Link: https://lore.kernel.org/r/20200608034043.10451-1-yi.zhang@huawei.com
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
In the process of changing value for existing EA,
there is an improper assignment of e_value_offs(setting to 0),
because it will be reset to incorrect value in the following
loop(shifting EA values before target). Delayed assignment
can avoid this issue.
Link: https://lore.kernel.org/r/20200603084429.25344-1-cgxu519@mykernel.net
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Jan Kara <jack@suse.cz>
meta inode's pages are used for encrypted, verity and compressed blocks,
so the meta inode's cache invalidation condition in do_checkpoint() should
consider compression as well, not just for verity and encryption, fix it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For those applications which are not willing to use io_uring_enter()
to reap and handle cqes, they may completely rely on liburing's
io_uring_peek_cqe(), but if cq ring has overflowed, currently because
io_uring_peek_cqe() is not aware of this overflow, it won't enter
kernel to flush cqes, below test program can reveal this bug:
static void test_cq_overflow(struct io_uring *ring)
{
struct io_uring_cqe *cqe;
struct io_uring_sqe *sqe;
int issued = 0;
int ret = 0;
do {
sqe = io_uring_get_sqe(ring);
if (!sqe) {
fprintf(stderr, "get sqe failed\n");
break;;
}
ret = io_uring_submit(ring);
if (ret <= 0) {
if (ret != -EBUSY)
fprintf(stderr, "sqe submit failed: %d\n", ret);
break;
}
issued++;
} while (ret > 0);
assert(ret == -EBUSY);
printf("issued requests: %d\n", issued);
while (issued) {
ret = io_uring_peek_cqe(ring, &cqe);
if (ret) {
if (ret != -EAGAIN) {
fprintf(stderr, "peek completion failed: %s\n",
strerror(ret));
break;
}
printf("left requets: %d\n", issued);
continue;
}
io_uring_cqe_seen(ring, cqe);
issued--;
printf("left requets: %d\n", issued);
}
}
int main(int argc, char *argv[])
{
int ret;
struct io_uring ring;
ret = io_uring_queue_init(16, &ring, 0);
if (ret) {
fprintf(stderr, "ring setup failed: %d\n", ret);
return 1;
}
test_cq_overflow(&ring);
return 0;
}
To fix this issue, export cq overflow status to userspace by adding new
IORING_SQ_CQ_OVERFLOW flag, then helper functions() in liburing, such as
io_uring_peek_cqe, can be aware of this cq overflow and do flush accordingly.
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Except for pktdvd, the only places setting congested bits are file
systems that allocate their own backing_dev_info structures. And
pktdvd is a deprecated driver that isn't useful in stack setup
either. So remove the dead congested_fn stacking infrastructure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Acked-by: David Sterba <dsterba@suse.com>
[axboe: fixup unused variables in bcache/request.c]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
check_disk_change isn't for consumers of the block layer, so remove
the comment mentioning it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
flush_disk has only two callers, so open code it there. That also helps
clarifying the error message for the particular case, and allows to remove
setting bd_invalidated in check_disk_size_change, which will be cleared
again instantly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It's safe to call kfree() with a NULL pointer, but it's also pointless.
Most of the time we don't have any data to free, and at millions of
requests per second, the redundant function call adds noticeable
overhead (about 1.3% of the runtime).
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The "apoll" variable is freed and then used on the next line. We need
to move the free down a few lines.
Fixes: 0be0b0e33b ("io_uring: simplify io_async_task_func()")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Wire up ext4 to support inline encryption via the helper functions which
fs/crypto/ now provides. This includes:
- Adding a mount option 'inlinecrypt' which enables inline encryption
on encrypted files where it can be used.
- Setting the bio_crypt_ctx on bios that will be submitted to an
inline-encrypted file.
Note: submit_bh_wbc() in fs/buffer.c also needed to be patched for
this part, since ext4 sometimes uses ll_rw_block() on file data.
- Not adding logically discontiguous data to bios that will be submitted
to an inline-encrypted file.
- Not doing filesystem-layer crypto on inline-encrypted files.
Co-developed-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20200702015607.1215430-5-satyat@google.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Wire up f2fs to support inline encryption via the helper functions which
fs/crypto/ now provides. This includes:
- Adding a mount option 'inlinecrypt' which enables inline encryption
on encrypted files where it can be used.
- Setting the bio_crypt_ctx on bios that will be submitted to an
inline-encrypted file.
- Not adding logically discontiguous data to bios that will be submitted
to an inline-encrypted file.
- Not doing filesystem-layer crypto on inline-encrypted files.
This patch includes a fix for a race during IPU by
Sahitya Tummala <stummala@codeaurora.org>
Signed-off-by: Satya Tangirala <satyat@google.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Link: https://lore.kernel.org/r/20200702015607.1215430-4-satyat@google.com
Co-developed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Add support for inline encryption to fs/crypto/. With "inline
encryption", the block layer handles the decryption/encryption as part
of the bio, instead of the filesystem doing the crypto itself via
Linux's crypto API. This model is needed in order to take advantage of
the inline encryption hardware present on most modern mobile SoCs.
To use inline encryption, the filesystem needs to be mounted with
'-o inlinecrypt'. Blk-crypto will then be used instead of the traditional
filesystem-layer crypto whenever possible to encrypt the contents
of any encrypted files in that filesystem. Fscrypt still provides the key
and IV to use, and the actual ciphertext on-disk is still the same;
therefore it's testable using the existing fscrypt ciphertext verification
tests.
Note that since blk-crypto has a fallback to Linux's crypto API, and
also supports all the encryption modes currently supported by fscrypt,
this feature is usable and testable even without actual inline
encryption hardware.
Per-filesystem changes will be needed to set encryption contexts when
submitting bios and to implement the 'inlinecrypt' mount option. This
patch just adds the common code.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20200702015607.1215430-3-satyat@google.com
Co-developed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
- don't panic kernel if f2fs_get_node_page() fails in
f2fs_recover_inline_data() or f2fs_recover_inline_xattr();
- return error number of f2fs_truncate_blocks() to
f2fs_recover_inline_data()'s caller;
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We should not be logging a warning repeatedly on change notify.
CC: Stable <stable@vger.kernel.org> # v5.6+
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Consolidate the two in-kernel read helpers to make upcoming changes
easier. The only difference are the missing call to rw_verify_area
in kernel_read, and an access_ok check that doesn't make sense for
kernel buffers to start with.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Consolidate the two in-kernel write helpers to make upcoming changes
easier. The only difference are the missing call to rw_verify_area
in kernel_write, and an access_ok check that doesn't make sense for
kernel buffers to start with.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add a WARN_ON_ONCE if the file isn't actually open for write. This
matches the check done in vfs_write, but actually warn warns as a
kernel user calling write on a file not opened for writing is a pretty
obvious programming error.
Signed-off-by: Christoph Hellwig <hch@lst.de>
While pipes don't really need sb_writers projection, __kernel_write is an
interface better kept private, and the additional rw_verify_area does not
hurt here.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ian Kent <raven@themaw.net>
__kernel_write doesn't take a sb_writers references, which we need here.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com>
Added a new gc_urgent mode, GC_URGENT_LOW, in which mode
F2FS will lower the bar of checking idle in order to
process outstanding discard commands and GC a little bit
aggressively.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If two readahead threads having same offset enter in readpages, every read
IOs are split and issued to the disk which giving lower bandwidth.
This patch tries to avoid redundant readahead calls.
Fixes one build error reported by Randy.
Fix build error when F2FS_FS_COMPRESSION is not set/enabled.
This label is needed in either case.
../fs/f2fs/data.c: In function ‘f2fs_mpage_readpages’:
../fs/f2fs/data.c:2327:5: error: label ‘next_page’ used but not defined
goto next_page;
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If f2fs_grab_cache_page() fails, it needs to return -ENOMEM.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The parameter op_flag is not used in f2fs_get_read_data_page(),
but it is used in f2fs_grab_read_bio(). Obviously, op_flag is
not passed to f2fs_grab_read_bio() successfully. We need to add
parameter in f2fs_submit_page_read() to pass it.
The case:
- gc_data_segment
- f2fs_get_read_data_page(.., op_flag = REQ_RAHEAD,..)
- f2fs_submit_page_read
- f2fs_grab_read_bio(.., op_flag = 0, ..)
Signed-off-by: Jia Yang <jiayang5@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
to two independent functions:
- f2fs_allocate_new_segment() for specified type segment allocation
- f2fs_allocate_new_segments() for all data type segments allocation
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
if get_node_path() return -E2BIG and trace of
f2fs_truncate_inode_blocks_enter/exit enabled
then the matching-pair of trace_exit will lost
in log.
Signed-off-by: Yubo Feng <fengyubo3@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In the f2fs_unlink we do not add trace end for some
error paths, just add.
Signed-off-by: Lihong Kou <koulihong@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_write_raw_pages(), we need to check page dirty status before
writeback, because there could be a racer (e.g. reclaimer) helps
writebacking the dirty page.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The parameter compr is unused in the f2fs_cluster_blocks function
so we no longer need to pass it as a parameter.
Signed-off-by: Wang Xiaojun <wangxiaojun11@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
to show f2fs_fiemap()'s result as below:
f2fs_fiemap: dev = (251,0), ino = 7, lblock:0, pblock:1625292800, len:2097152, flags:0, ret:0
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
to show f2fs_bmap()'s result as below:
f2fs_bmap: dev = (251,0), ino = 7, lblock:0, pblock:396800
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If compression is disable, we should return zero rather than -EOPNOTSUPP
to indicate f2fs_bmap() is not supported.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In current version, @state will only be FREE_NID. This parameter
has no real effect so remove it to keep clean.
Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
stakable/stackable
Signed-off-by: Liu Song <fishland@aliyun.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Filesystem including f2fs should support stable page for special
device like software raid, however there is one missing path that
page could be updated while it is writeback state as below, fix
this.
- gc_node_segment
- f2fs_move_node_page
- __write_node_page
- set_page_writeback
- do_read_inode
- f2fs_init_extent_tree
- __f2fs_init_extent_tree
i_ext->len = 0;
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When f2fs_ioc_gc_range performs multiple segments gc ops, the return
value of f2fs_ioc_gc_range is determined by the last segment gc ops.
If its ops failed, the f2fs_ioc_gc_range will be considered to be failed
despite some of previous segments gc ops succeeded. Therefore, so we
fix: Redefine the return value of getting victim ops and add exception
handle for f2fs_gc. In particular, 1).if target has no valid block, it
will go on. 2).if target sectoion has valid block(s), but it is current
section, we will reminder the caller.
Signed-off-by: Qilong Zhang <zhangqilong3@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use validation of @fio to inidcate whether caller want to serialize IOs
in io.io_list or not, then @add_list will be redundant, remove it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
- to avoid race between checkpoint and quota file writeback, it
just needs to hold read lock of node_write in writeback path.
- node_write lock has covered all LFS data write paths, it's not
necessary, we only need to hold node_write lock at write path of
quota file.
This refactors commit ca7f76e680 ("f2fs: fix wrong discard space").
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The caller of cifs_posix_lock_set will do retry(like
fcntl_setlk64->do_lock_file_wait) if we will wait for any file_lock.
So the retry in cifs_poxis_lock_set seems duplicated, remove it to
make a cleanup.
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: NeilBrown <neilb@suse.de>
read permission, not just read attributes permission, is required
on the directory.
See MS-SMB2 (protocol specification) section 3.3.5.19.
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org> # v5.6+
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
So far, gfs2 has taken the inode glocks inside the ->readpage and
->readahead address space operations. Since commit d4388340ae ("fs:
convert mpage_readpages to mpage_readahead"), gfs2_readahead is passed
the pages to read ahead locked. With that, the current holder of the
inode glock may be trying to lock one of those pages while
gfs2_readahead is trying to take the inode glock, resulting in a
deadlock.
Fix that by moving the lock taking to the higher-level ->read_iter file
and ->fault vm operations. This also gets rid of an ugly lock inversion
workaround in gfs2_readpage.
The cache consistency model of filesystems like gfs2 is such that if
data is found in the page cache, the data is up to date and can be used
without taking any filesystem locks. If a page is not cached,
filesystem locks must be taken before populating the page cache.
To avoid taking the inode glock when the data is already cached,
gfs2_file_read_iter first tries to read the data with the IOCB_NOIO flag
set. If that fails, the inode glock is taken and the operation is
retried with the IOCB_NOIO flag cleared.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl8EdTkACgkQxWXV+ddt
WDv6xA/9Hguo/k6oj/7Nl9n3UUZ7gp44R/jy37fhMuNcwuEDuqIEfAgGXupdJVaj
pYDorUMRUQfI2yLB1iHAnPgBMKBidSroDsdrRHKuimnhABSO2/KX/KXPianIIRGi
wPvqZR04L565LNpRlDQx7OYkJWey7b6xf47UZqDglivnKY1OwCJlXgfCj/9FApr0
Y+PVlgEU78ExTeAHs/h8ofZ/f5T2eqiluBSFVykzCg1NngaQVOKpN3gnWEatUAvM
ekm6U4E1ZR9oOprdhlf6V96ztGzVTRKB1vFIeCvJLqLNIe+0pxlRfRn2aOj8vzEO
DRjgOlhyAIgypp78SwCspjhvejvVneSFdEGSVvHOw1ombB//OJ1qBb5G/lIcwCj3
PZ3OnQJV7+/Ty7Xt/X26W841zvnu90K0di0CsOPehtbkgkR4txgHCJB9mSlsMugN
awN5Ryy1rw1cAM5GspXG9EEOvJmnSizQf4BcK649IG5eUKThYYLc5mp68jiMljs0
NHFPg5P4yTRjk7Yqgxq5VvTPLLJo5j5xxqtY/1zDWuguRa40wIoy/JUJaJoPg9Vd
221/qRG4R4xGyZXGx6XTiWK+M3qjTlS9My9tGoWygwlExRkr7Uli9Ikef3U0tBoF
bjTcfCNOuCp+JECHNcnMZ9fhhFaMwIL1V4OflB1iicBAtXxo8Lk=
=+4BZ
-----END PGP SIGNATURE-----
Merge tag 'for-5.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- regression fix of a leak in global block reserve accounting
- fix a (hard to hit) race of readahead vs releasepage that could lead
to crash
- convert all remaining uses of comment fall through annotations to the
pseudo keyword
- fix crash when mounting a fuzzed image with -o recovery
* tag 'for-5.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: reset tree root pointer after error in init_tree_roots
btrfs: fix reclaim_size counter leak after stealing from global reserve
btrfs: fix fatal extent_buffer readahead vs releasepage race
btrfs: convert comments to fallthrough annotations
First of all don't spin in io_ring_ctx_wait_and_kill() on iopoll.
Requests won't complete faster because of that, but only lengthen
io_uring_release().
The same goes for offloaded cleanup in io_ring_exit_work() -- it
already has waiting loop, don't do blocking active spinning.
For that, pass min=0 into io_iopoll_[try_]reap_events(), so it won't
actively spin. Leave the function if io_do_iopoll() there can't
complete a request to sleep in io_ring_exit_work().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Nobody checks io_iopoll_check()'s output parameter @nr_events.
Remove the parameter and declare it further down the stack.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_iopoll_reap_events() doesn't care about returned valued of
io_iopoll_getevents() and does the same checks for list emptiness
and need_resched(). Just use io_do_iopoll().
io_sq_thread() doesn't check return value as well. It also passes min=0,
so there never be the second iteration inside io_poll_getevents().
Inline it there too.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make sure the rtbitmap is large enough to store the entire bitmap.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Ensure that the realtime bitmap file is backed entirely by written
extents. No holes, no unwritten blocks, etc.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
This debug code is called on every xfs_iflush() call, which then
checks every inode in the buffer for non-zero unlinked list field.
Hence it checks every inode in the cluster buffer every time a
single inode on that cluster it flushed. This is resulting in:
- 38.91% 5.33% [kernel] [k] xfs_iflush
- 17.70% xfs_iflush
- 9.93% xfs_inobp_check
4.36% xfs_buf_offset
10% of the CPU time spent flushing inodes is repeatedly checking
unlinked fields in the buffer. We don't need to do this.
The other place we call xfs_inobp_check() is
xfs_iunlink_update_dinode(), and this is after we've done this
assert for the agino we are about to write into that inode:
ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino));
which means we've already checked that the agino we are about to
write is not 0 on debug kernels. The inode buffer verifiers do
everything else we need, so let's just remove this debug code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_iflush_done() does 3 distinct operations to the inodes attached
to the buffer. Separate these operations out into functions so that
it is easier to modify these operations independently in future.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that we have all the dirty inodes attached to the cluster
buffer, we don't actually have to do radix tree lookups to find
them. Sure, the radix tree is efficient, but walking a linked list
of just the dirty inodes attached to the buffer is much better.
We are also no longer dependent on having a locked inode passed into
the function to determine where to start the lookup. This means we
can drop it from the function call and treat all inodes the same.
We also make xfs_iflush_cluster skip inodes marked with
XFS_IRECLAIM. This we avoid races with inodes that reclaim is
actively referencing or are being re-initialised by inode lookup. If
they are actually dirty, they'll get written by a future cluster
flush....
We also add a shutdown check after obtaining the flush lock so that
we catch inodes that are dirty in memory and may have inconsistent
state due to the shutdown in progress. We abort these inodes
directly and so they remove themselves directly from the buffer list
and the AIL rather than having to wait for the buffer to be failed
and callbacks run to be processed correctly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
with xfs_iflush() gone, we can rename xfs_iflush_int() back to
xfs_iflush(). Also move it up above xfs_iflush_cluster() so we don't
need the forward definition any more.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now we have a cached buffer on inode log items, we don't need
to do buffer lookups when flushing inodes anymore - all we need
to do is lock the buffer and we are ready to go.
This largely gets rid of the need for xfs_iflush(), which is
essentially just a mechanism to look up the buffer and flush the
inode to it. Instead, we can just call xfs_iflush_cluster() with a
few modifications to ensure it also flushes the inode we already
hold locked.
This allows the AIL inode item pushing to be almost entirely
non-blocking in XFS - we won't block unless memory allocation
for the cluster inode lookup blocks or the block device queues are
full.
Writeback during inode reclaim becomes a little more complex because
we now have to lock the buffer ourselves, but otherwise this change
is largely a functional no-op that removes a whole lot of code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Rather than attach inodes to the cluster buffer just when we are
doing IO, attach the inodes to the cluster buffer when they are
dirtied. The means the buffer always carries a list of dirty inodes
that reference it, and we can use that list to make more fundamental
changes to inode writeback that aren't otherwise possible.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Once we have inodes pinning the cluster buffer and attached whenever
they are dirty, we no longer have a guarantee that the items are
flush locked when we lock the cluster buffer. Hence we cannot just
walk the buffer log item list and modify the attached inodes.
If the inode is not flush locked, we have to ILOCK it first and then
flush lock it to do all the prerequisite checks needed to avoid
races with other code. This is already handled by
xfs_ifree_get_one_inode(), so rework the inode iteration loop and
function to update all inodes in cache whether they are attached to
the buffer or not.
Note: we also remove the copying of the log item lsn to the
ili_flush_lsn as xfs_iflush_done() now uses the XFS_ISTALE flag to
trigger aborts and so flush lsn matching is not needed in IO
completion for processing freed inodes.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Inode reclaim is quite different now to the way described in various
comments, so update all the comments explaining what it does and how
it works.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Clean up xfs_reclaim_inodes() callers. Most callers want blocking
behaviour, so just make the existing SYNC_WAIT behaviour the
default.
For the xfs_reclaim_worker(), just call xfs_reclaim_inodes_ag()
directly because we just want optimistic clean inode reclaim to be
done in the background.
For xfs_quiesce_attr() we can just remove the inode reclaim calls as
they are a historic relic that was required to flush dirty inodes
that contained unlogged changes. We now log all changes to the
inodes, so the sync AIL push from xfs_log_quiesce() called by
xfs_quiesce_attr() will do all the required inode writeback for
freeze.
Seeing as we now want to loop until all reclaimable inodes have been
reclaimed, make xfs_reclaim_inodes() loop on the XFS_ICI_RECLAIM_TAG
tag rather than having xfs_reclaim_inodes_ag() tell it that inodes
were skipped. This is much more reliable and will always loop until
all reclaimable inodes are reclaimed.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
All background reclaim is SYNC_TRYLOCK already, and even blocking
reclaim (SYNC_WAIT) can use trylock mechanisms as
xfs_reclaim_inodes_ag() will keep cycling until there are no more
reclaimable inodes. Hence we can kill SYNC_TRYLOCK from inode
reclaim and make everything unconditionally non-blocking.
We remove all the optimistic "avoid blocking on locks" checks done
in xfs_reclaim_inode_grab() as nothing blocks on locks anymore.
Further, checking XFS_IFLOCK optimistically can result in detecting
inodes in the process of being cleaned (i.e. between being removed
from the AIL and having the flush lock dropped), so for
xfs_reclaim_inodes() to reliably reclaim all inodes we need to drop
these checks anyway.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When we attempt to reclaim an inode, the first thing we do is take
the inode lock. This is blocking right now, so if the inode being
accessed by something else (e.g. being flushed to the cluster
buffer) we will block here.
Change this to a trylock so that we do not block inode reclaim
unnecessarily here.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Inode reclaim will still throttle direct reclaim on the per-ag
reclaim locks. This is no longer necessary as reclaim can run
non-blocking now. Hence we can remove these locks so that we don't
arbitrarily block reclaimers just because there are more direct
reclaimers than there are AGs.
This can result in multiple reclaimers working on the same range of
an AG, but this doesn't cause any apparent issues. Optimising the
spread of concurrent reclaimers for best efficiency can be done in a
future patchset.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We no longer need to issue IO from shrinker based inode reclaim to
prevent spurious OOM killer invocation. This leaves only the global
filesystem management operations such as unmount needing to
writeback dirty inodes and reclaim them.
Instead of using the reclaim pass to write dirty inodes before
reclaiming them, use the AIL to push all the dirty inodes before we
try to reclaim them. This allows us to remove all the conditional
SYNC_WAIT locking and the writeback code from xfs_reclaim_inode()
and greatly simplify the checks we need to do to reclaim an inode.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that dirty inode writeback doesn't cause read-modify-write
cycles on the inode cluster buffer under memory pressure, the need
to throttle memory reclaim to the rate at which we can clean dirty
inodes goes away. That is due to the fact that we no longer thrash
inode cluster buffers under memory pressure to clean dirty inodes.
This means inode writeback no longer stalls on memory allocation
or read IO, and hence can be done asynchronously without generating
memory pressure. As a result, blocking inode writeback in reclaim is
no longer necessary to prevent reclaim priority windup as cleaning
dirty inodes is no longer dependent on having memory reserves
available for the filesystem to make progress reclaiming inodes.
Hence we can convert inode reclaim to be non-blocking for shrinker
callouts, both for direct reclaim and kswapd.
On a vanilla kernel, running a 16-way fsmark create workload on a
4 node/16p/16GB RAM machine, I can reliably pin 14.75GB of RAM via
userspace mlock(). The OOM killer gets invoked at 15GB of
pinned RAM.
Without the inode cluster pinning, this non-blocking reclaim patch
triggers premature OOM killer invocation with the same memory
pinning, sometimes with as much as 45% of RAM being free. It's
trivially easy to trigger the OOM killer when reclaim does not
block.
With pinning inode clusters in RAM and then adding this patch, I can
reliably pin 14.5GB of RAM and still have the fsmark workload run to
completion. The OOM killer gets invoked 14.75GB of pinned RAM, which
is only a small amount of memory less than the vanilla kernel. It is
much more reliable than just with async reclaim alone.
simoops shows that allocation stalls go away when async reclaim is
used. Vanilla kernel:
Run time: 1924 seconds
Read latency (p50: 3,305,472) (p95: 3,723,264) (p99: 4,001,792)
Write latency (p50: 184,064) (p95: 553,984) (p99: 807,936)
Allocation latency (p50: 2,641,920) (p95: 3,911,680) (p99: 4,464,640)
work rate = 13.45/sec (avg 13.44/sec) (p50: 13.46) (p95: 13.58) (p99: 13.70)
alloc stall rate = 3.80/sec (avg: 2.59) (p50: 2.54) (p95: 2.96) (p99: 3.02)
With inode cluster pinning and async reclaim:
Run time: 1924 seconds
Read latency (p50: 3,305,472) (p95: 3,715,072) (p99: 3,977,216)
Write latency (p50: 187,648) (p95: 553,984) (p99: 789,504)
Allocation latency (p50: 2,748,416) (p95: 3,919,872) (p99: 4,448,256)
work rate = 13.28/sec (avg 13.32/sec) (p50: 13.26) (p95: 13.34) (p99: 13.34)
alloc stall rate = 0.02/sec (avg: 0.02) (p50: 0.01) (p95: 0.03) (p99: 0.03)
Latencies don't really change much, nor does the work rate. However,
allocation almost never stalls with these changes, whilst the
vanilla kernel is sometimes reporting 20 stalls/s over a 60s sample
period. This difference is due to inode reclaim being largely
non-blocking now.
IOWs, once we have pinned inode cluster buffers, we can make inode
reclaim non-blocking without a major risk of premature and/or
spurious OOM killer invocation, and without any changes to memory
reclaim infrastructure.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When we dirty an inode, we are going to have to write it disk at
some point in the near future. This requires the inode cluster
backing buffer to be present in memory. Unfortunately, under severe
memory pressure we can reclaim the inode backing buffer while the
inode is dirty in memory, resulting in stalling the AIL pushing
because it has to do a read-modify-write cycle on the cluster
buffer.
When we have no memory available, the read of the cluster buffer
blocks the AIL pushing process, and this causes all sorts of issues
for memory reclaim as it requires inode writeback to make forwards
progress. Allocating a cluster buffer causes more memory pressure,
and results in more cluster buffers to be reclaimed, resulting in
more RMW cycles to be done in the AIL context and everything then
backs up on AIL progress. Only the synchronous inode cluster
writeback in the the inode reclaim code provides some level of
forwards progress guarantees that prevent OOM-killer rampages in
this situation.
Fix this by pinning the inode backing buffer to the inode log item
when the inode is first dirtied (i.e. in xfs_trans_log_inode()).
This may mean the first modification of an inode that has been held
in cache for a long time may block on a cluster buffer read, but
we can do that in transaction context and block safely until the
buffer has been allocated and read.
Once we have the cluster buffer, the inode log item takes a
reference to it, pinning it in memory, and attaches it to the log
item for future reference. This means we can always grab the cluster
buffer from the inode log item when we need it.
When the inode is finally cleaned and removed from the AIL, we can
drop the reference the inode log item holds on the cluster buffer.
Once all inodes on the cluster buffer are clean, the cluster buffer
will be unpinned and it will be available for memory reclaim to
reclaim again.
This avoids the issues with needing to do RMW cycles in the AIL
pushing context, and hence allows complete non-blocking inode
flushing to be performed by the AIL pushing context.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_ail_delete_one() is called directly from dquot and inode IO
completion, as well as from the generic xfs_trans_ail_delete()
function. Inodes are about to have their own failure handling, and
dquots will in future, too. Pull the clearing of the LI_FAILED flag
up into the callers so we can customise the code appropriately.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When an buffer IO error occurs, we want to mark all
the log items attached to the buffer as failed. Open code
the error handling loop so that we can modify the flagging for the
different types of objects directly and independently of each other.
This also allows us to remove the ->iop_error method from the log
item operations.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Currently when a buffer with attached log items has an IO error
it called ->iop_error for each attched log item. These all call
xfs_set_li_failed() to handle the error, but we are about to change
the way log items manage buffers. hence we first need to remove the
per-item dependency on buffer handling done by xfs_set_li_failed().
We already have specific buffer type IO completion routines, so move
the log item error handling out of the generic error handling and
into the log item specific functions so we can implement per-type
error handling easily.
This requires a more complex return value from the error handling
code so that we can take the correct action the failure handling
requires. This results in some repeated boilerplate in the
functions, but that can be cleaned up later once all the changes
cascade through this code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
They are not used anymore, so remove them from the log item and the
buffer iodone attachment interfaces.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that we've sorted inode and dquot buffers, we can apply the same
cleanups to dirty buffers with buffer log items. They only have one
callback, too, so we don't need the log item callback. Collapse the
iodone functions and remove all the now unnecessary infrastructure
around callback processing.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[BUG]
The following small test script can trigger ASSERT() at unmount time:
mkfs.btrfs -f $dev
mount $dev $mnt
mount -o remount,discard=async $mnt
umount $mnt
The call trace:
assertion failed: atomic_read(&block_group->count) == 1, in fs/btrfs/block-group.c:3431
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.h:3204!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 4 PID: 10389 Comm: umount Tainted: G O 5.8.0-rc3-custom+ #68
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
btrfs_free_block_groups.cold+0x22/0x55 [btrfs]
close_ctree+0x2cb/0x323 [btrfs]
btrfs_put_super+0x15/0x17 [btrfs]
generic_shutdown_super+0x72/0x110
kill_anon_super+0x18/0x30
btrfs_kill_super+0x17/0x30 [btrfs]
deactivate_locked_super+0x3b/0xa0
deactivate_super+0x40/0x50
cleanup_mnt+0x135/0x190
__cleanup_mnt+0x12/0x20
task_work_run+0x64/0xb0
__prepare_exit_to_usermode+0x1bc/0x1c0
__syscall_return_slowpath+0x47/0x230
do_syscall_64+0x64/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The code:
ASSERT(atomic_read(&block_group->count) == 1);
btrfs_put_block_group(block_group);
[CAUSE]
Obviously it's some btrfs_get_block_group() call doesn't get its put
call.
The offending btrfs_get_block_group() happens here:
void btrfs_mark_bg_unused(struct btrfs_block_group *bg)
{
if (list_empty(&bg->bg_list)) {
btrfs_get_block_group(bg);
list_add_tail(&bg->bg_list, &fs_info->unused_bgs);
}
}
So every call sites removing the block group from unused_bgs list should
reduce the ref count of that block group.
However for async discard, it didn't follow the call convention:
void btrfs_discard_punt_unused_bgs_list(struct btrfs_fs_info *fs_info)
{
list_for_each_entry_safe(block_group, next, &fs_info->unused_bgs,
bg_list) {
list_del_init(&block_group->bg_list);
btrfs_discard_queue_work(&fs_info->discard_ctl, block_group);
}
}
And in btrfs_discard_queue_work(), it doesn't call
btrfs_put_block_group() either.
[FIX]
Fix the problem by reducing the reference count when we grab the block
group from unused_bgs list.
Reported-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Fixes: 6e80d4f8c4 ("btrfs: handle empty block_group removal for async discard")
CC: stable@vger.kernel.org # 5.6+
Tested-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When building a kernel with CONFIG_PSTORE=y and CONFIG_CRYPTO not set,
a build error happens:
ld: fs/pstore/platform.o: in function `pstore_dump':
platform.c:(.text+0x3f9): undefined reference to `crypto_comp_compress'
ld: fs/pstore/platform.o: in function `pstore_get_backend_records':
platform.c:(.text+0x784): undefined reference to `crypto_comp_decompress'
This because some pstore code uses crypto_comp_(de)compress regardless
of the CONFIG_CRYPTO status. Fix it by wrapping the (de)compress usage
by IS_ENABLED(CONFIG_PSTORE_COMPRESS)
Signed-off-by: Matteo Croce <mcroce@linux.microsoft.com>
Link: https://lore.kernel.org/lkml/20200706234045.9516-1-mcroce@linux.microsoft.com
Fixes: cb3bee0369 ("pstore: Use crypto compress API")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Make sure iomap_end is always called when iomap_begin succeeds.
Without this fix, iomap_end won't be called when a filesystem's
iomap_begin operation returns an invalid mapping, bypassing any
unlocking done in iomap_end. With this fix, the unlocking will still
happen.
This bug was found by Bob Peterson during code review. It's unlikely
that such iomap_begin bugs will survive to affect users, so backporting
this fix seems unnecessary.
Fixes: ae259a9c85 ("fs: introduce iomap infrastructure")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Similar to inodes, we can call the dquot IO completion functions
directly from the buffer completion code, removing another user of
log item callbacks for IO completion processing.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Having different io completion callbacks for different inode states
makes things complex. We can detect if the inode is stale via the
XFS_ISTALE flag in IO completion, so we don't need a special
callback just for this.
This means inodes only have a single iodone callback, and inode IO
completion is entirely buffer centric at this point. Hence we no
longer need to use a log item callback at all as we can just call
xfs_iflush_done() directly from the buffer completions and walk the
buffer log item list to complete the all inodes under IO.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When we've emptied the buffer log item list, it does a list_del_init
on itself to reset it's pointers to itself. This is unnecessary as
the list is already empty at this point - it was a left-over
fragment from the list_head conversion of the buffer log item list.
Remove them.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
All unmarked dirty buffers should be in the AIL and have log items
attached to them. Hence when they are written, we will run a
callback to remove the item from the AIL if appropriate. Now that
we've handled inode and dquot buffers, all remaining calls are to
xfs_buf_iodone() and so we can hard code this rather than use an
indirect call.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Log recovery has it's own buffer write completion handler for
buffers that it directly recovers. Convert these to direct calls by
flagging these buffers as being log recovery buffers. The flag will
get cleared by the log recovery IO completion routine, so it will
never leak out of log recovery.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
dquot buffers always have write IO callbacks, so by marking them
directly we can avoid needing to attach ->b_iodone functions to
them. This avoids an indirect call, and makes future modifications
much simpler.
This is largely a rearrangement of the code at this point - no IO
completion functionality changes at this point, just how the
code is run is modified.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Inode buffers always have write IO callbacks, so by marking them
directly we can avoid needing to attach ->b_iodone functions to
them. This avoids an indirect call, and makes future modifications
much simpler.
While this is largely a refactor of existing functionality, we
broaden the scope of the flag to beyond where inodes are explicitly
attached because future changes need to know what type of log items
are attached to the buffer. Adding this buffer flag may invoke the
inode iodone callback in cases where it wouldn't have been
previously, but this is not a functional change because the callback
is identical to the normal buffer write iodone callback when inodes
are not attached.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The inode log item is kind of special in that it can be aggregating
new changes in memory at the same time time existing changes are
being written back to disk. This means there are fields in the log
item that are accessed concurrently from contexts that don't share
any locking at all.
e.g. updating ili_last_fields occurs at flush time under the
ILOCK_EXCL and flush lock at flush time, under the flush lock at IO
completion time, and is read under the ILOCK_EXCL when the inode is
logged. Hence there is no actual serialisation between reading the
field during logging of the inode in transactions vs clearing the
field in IO completion.
We currently get away with this by the fact that we are only
clearing fields in IO completion, and nothing bad happens if we
accidentally log more of the inode than we actually modify. Worst
case is we consume a tiny bit more memory and log bandwidth.
However, if we want to do more complex state manipulations on the
log item that requires updates at all three of these potential
locations, we need to have some mechanism of serialising those
operations. To do this, introduce a spinlock into the log item to
serialise internal state.
This could be done via the xfs_inode i_flags_lock, but this then
leads to potential lock inversion issues where inode flag updates
need to occur inside locks that best nest inside the inode log item
locks (e.g. marking inodes stale during inode cluster freeing).
Using a separate spinlock avoids these sorts of problems and
simplifies future code.
This does not touch the use of ili_fields in the item formatting
code - that is entirely protected by the ILOCK_EXCL at this point in
time, so it remains untouched.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This was used to track if the item had logged fields being flushed
to disk. We log everything in the inode these days, so this logic is
no longer needed. Remove it.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In tracking down a problem in this patchset, I discovered we are
reclaiming dirty stale inodes. This wasn't discovered until inodes
were always attached to the cluster buffer and then the rcu callback
that freed inodes was assert failing because the inode still had an
active pointer to the cluster buffer after it had been reclaimed.
Debugging the issue indicated that this was a pre-existing issue
resulting from the way the inodes are handled in xfs_inactive_ifree.
When we free a cluster buffer from xfs_ifree_cluster, all the inodes
in cache are marked XFS_ISTALE. Those that are clean have nothing
else done to them and so eventually get cleaned up by background
reclaim. i.e. it is assumed we'll never dirty/relog an inode marked
XFS_ISTALE.
On journal commit dirty stale inodes as are handled by both
buffer and inode log items to run though xfs_istale_done() and
removed from the AIL (buffer log item commit) or the log item will
simply unpin it because the buffer log item will clean it. What happens
to any specific inode is entirely dependent on which log item wins
the commit race, but the result is the same - stale inodes are
clean, not attached to the cluster buffer, and not in the AIL. Hence
inode reclaim can just free these inodes without further care.
However, if the stale inode is relogged, it gets dirtied again and
relogged into the CIL. Most of the time this isn't an issue, because
relogging simply changes the inode's location in the current
checkpoint. Problems arise, however, when the CIL checkpoints
between two transactions in the xfs_inactive_ifree() deferops
processing. This results in the XFS_ISTALE inode being redirtied
and inserted into the CIL without any of the other stale cluster
buffer infrastructure being in place.
Hence on journal commit, it simply gets unpinned, so it remains
dirty in memory. Everything in inode writeback avoids XFS_ISTALE
inodes so it can't be written back, and it is not tracked in the AIL
so there's not even a trigger to attempt to clean the inode. Hence
the inode just sits dirty in memory until inode reclaim comes along,
sees that it is XFS_ISTALE, and goes to reclaim it. This reclaiming
of a dirty inode caused use after free, list corruptions and other
nasty issues later in this patchset.
Hence this patch addresses a violation of the "never log XFS_ISTALE
inodes" caused by the deferops processing rolling a transaction
and relogging a stale inode in xfs_inactive_free. It also adds a
bunch of asserts to catch this problem in debug kernels so that
we don't reintroduce this problem in future.
Reproducer for this issue was generic/558 on a v4 filesystem.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Remove current_pid(), current_test_flags() and
current_clear_flags_nested(), because they are useless.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The page faultround path ->map_pages is implemented in XFS via
filemap_map_pages(). This function checks that pages found in page
cache lookups have not raced with truncate based invalidation by
checking page->mapping is correct and page->index is within EOF.
However, we've known for a long time that this is not sufficient to
protect against races with invalidations done by operations that do
not change EOF. e.g. hole punching and other fallocate() based
direct extent manipulations. The way we protect against these
races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED
lock so they serialise against fallocate and truncate before calling
into the filemap function that processes the fault.
Do the same for XFS's ->map_pages implementation to close this
potential data corruption issue.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Move the double-inode locking helpers to xfs_inode.c since they're not
specific to reflink.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Refactor the two functions that we use to lock and unlock two inodes to
block userspace from initiating IO against a file, whether via system
calls or mmap activity.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Fix the return value of xfs_reflink_remap_prep so that its return value
conventions match the rest of xfs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
If the source and destination map are identical, we can skip the remap
step to save some time.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
When logging quota block count updates during a reflink operation, we
only log the /delta/ of the block count changes to the dquot. Since we
now know ahead of time the extent type of both dmap and smap (and that
they have the same length), we know that we only need to reserve quota
blocks for dmap's blockcount if we're mapping it into a hole.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Now that we've reworked xfs_reflink_remap_extent to remap only one
extent per transaction, we actually know if the extent being removed is
an allocated mapping. This means that we now know ahead of time if
we're going to be touching the data fork.
Since we only need blocks for a bmbt split if we're going to update the
data fork, we only need to get quota reservation if we know we're going
to touch the data fork.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
The existing reflink remapping loop has some structural problems that
need addressing:
The biggest problem is that we create one transaction for each extent in
the source file without accounting for the number of mappings there are
for the same range in the destination file. In other words, we don't
know the number of remap operations that will be necessary and we
therefore cannot guess the block reservation required. On highly
fragmented filesystems (e.g. ones with active dedupe) we guess wrong,
run out of block reservation, and fail.
The second problem is that we don't actually use the bmap intents to
their full potential -- instead of calling bunmapi directly and having
to deal with its backwards operation, we could call the deferred ops
xfs_bmap_unmap_extent and xfs_refcount_decrease_extent instead. This
makes the frontend loop much simpler.
Solve all of these problems by refactoring the remapping loops so that
we only perform one remapping operation per transaction, and each
operation only tries to remap a single extent from source to dest.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reported-by: Edwin Török <edwin@etorok.net>
Tested-by: Edwin Török <edwin@etorok.net>
The name of this predicate is a little misleading -- it decides if the
extent mapping is allocated and written. Change the name to be more
direct, as we're going to add a new predicate in the next patch.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Quota reservations are supposed to account for the blocks that might be
allocated due to a bmap btree split. Reflink doesn't do this, so fix
this to make the quota accounting more accurate before we start
rearranging things.
Fixes: 862bb360ef ("xfs: reflink extents from one file to another")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
The data fork scrubber calls filemap_write_and_wait to flush dirty pages
and delalloc reservations out to disk prior to checking the data fork's
extent mappings. Unfortunately, this means that scrub can consume the
EIO/ENOSPC errors that would otherwise have stayed around in the address
space until (we hope) the writer application calls fsync to persist data
and collect errors. The end result is that programs that wrote to a
file might never see the error code and proceed as if nothing were
wrong.
xfs_scrub is not in a position to notify file writers about the
writeback failure, and it's only here to check metadata, not file
contents. Therefore, if writeback fails, we should stuff the error code
back into the address space so that an fsync by the writer application
can pick that up.
Fixes: 99d9d8d05d ("xfs: scrub inode block mappings")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
The rmapbt extent swap algorithm remaps individual extents between
the source inode and the target to trigger reverse mapping metadata
updates. If either inode straddles a format or other bmap allocation
boundary, the individual unmap and map cycles can trigger repeated
bmap block allocations and frees as the extent count bounces back
and forth across the boundary. While net block usage is bound across
the swap operation, this behavior can prematurely exhaust the
transaction block reservation because it continuously drains as the
transaction rolls. Each allocation accounts against the reservation
and each free returns to global free space on transaction roll.
The previous workaround to this problem attempted to detect this
boundary condition and provide surplus block reservation to
acommodate it. This is insufficient because more remaps can occur
than implied by the extent counts; if start offset boundaries are
not aligned between the two inodes, for example.
To address this problem more generically and dynamically, add a
transaction accounting mode that returns freed blocks to the
transaction reservation instead of the superblock counters on
transaction roll and use it when the rmapbt based algorithm is
active. This allows the chain of remap transactions to preserve the
block reservation based own its own frees and prevent premature
exhaustion regardless of the remap pattern. Note that this is only
safe for superblocks with lazy sb accounting, but the latter is
required for v5 supers and the rmap feature depends on v5.
Fixes: b3fed43482 ("xfs: account format bouncing into rmapbt swapext tx reservation")
Root-caused-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
It's not nice to hold @uring_lock for too long io_iopoll_reap_events().
For instance, the lock is needed to publish requests to @poll_list, and
that locks out tasks doing that for no good reason. Loose it
occasionally.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Nobody adjusts *nr_events (number of completed requests) before calling
io_iopoll_getevents(), so the passed @min shouldn't be adjusted as well.
Othewise it can return less than initially asked @min without hitting
need_resched().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
->iopoll() may have completed current request, but instead of reaping
it, io_do_iopoll() just continues with the next request in the list.
As a result it can leave just polled and completed request in the list
up until next syscall. Even outer loop in io_iopoll_getevents() doesn't
help the situation.
E.g. poll_list: req0 -> req1
If req0->iopoll() completed both requests, and @min<=1,
then @req0 will be left behind.
Check whether a req was completed after ->iopoll().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't forget to fill cqe->flags properly in io_submit_flush_completions()
Fixes: a1d7c393c4 ("io_uring: enable READ/WRITE to use deferred completions")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A preparation path, extracts error path into a separate block. It looks
saner then calling req_set_fail_links() after io_put_req_find_next(), even
though it have been working well.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_prep_linked_timeout() sets REQ_F_LINK_TIMEOUT altering refcounting of
the following linked request. After that someone should call
io_queue_linked_timeout(), otherwise a submission reference of the linked
timeout won't be ever dropped.
That's what happens in io_steal_work() if io-wq decides to postpone linked
request with io_wqe_enqueue(). io_queue_linked_timeout() can also be
potentially called twice without synchronisation during re-submission,
e.g. io_rw_resubmit().
There are the rules, whoever did io_prep_linked_timeout() must also call
io_queue_linked_timeout(). To not do it twice, io_prep_linked_timeout()
will return non NULL only for the first call. That's controlled by
REQ_F_LINK_TIMEOUT flag.
Also kill REQ_F_QUEUE_TIMEOUT.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since we now have that in the 5.9 branch, convert the existing users of
task_work_add() to use this new helper.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Provide a helper to run task_work instead of checking and running
manually in a bunch of different spots. While doing so, also move the
task run state setting where we run the task work. Then we can move it
out of the callback helpers. This also helps ensure we only do this once
per task_work list run, not per task_work item.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull in task_work changes from the 5.8 series, as we'll need to apply
the same kind of changes to other parts in the 5.9 branch.
* io_uring-5.8:
io_uring: fix regression with always ignoring signals in io_cqring_wait()
io_uring: use signal based task_work running
task_work: teach task_work_add() to do signal_wake_up()
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.
Deterministic algorithm:
For each file:
If not .svg:
For each line:
If doesn't contain `\bxmlns\b`:
For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
If both the HTTP and HTTPS versions
return 200 OK and serve the same content:
Replace HTTP with HTTPS.
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Link: https://lore.kernel.org/r/20200627103125.71828-1-grandmaster@al2klimov.de
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8BDx4QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpvhzD/4rxzJsn6ukrsxMXFaKIrjZ/hkcRJIMNozz
YWu4PwcDvszvZu66MeAu0tnCttzxlIgP8oCm6cx9ImMQwkYIVbV0q1XJ3wmzUQpZ
pEDW4j0j8hgcLhfZH9ojUAkTP8TnltakxkrwC6egUvnT0vuKDUy5ISbkl4uxWYpH
p4Dq7ASqy8xjtzac/VLTSzBgzhTMSic5NMJY21md9eAaFB1vYBmDyHB3O1bEk4kw
pvWGFm7a4qssnAB61SMfq3nWQ9UA0+XX4a+CWEzJIMqj4H6UpjOCQU23X1AlaLJX
ILeq26PwoZQF8cS4D83tMnmPWz1LqslBgnUuAGCVLsT7omvhDLM75iFBpMzWglLu
No8TlxLZ+Dga04vpjeEptWqSfUS6K879cNJuFGjadBogq06SImIVDHXXTrPhtCGg
B9+uFHkOUlIkjM5h2zqdkmhnbf0sWodowIrx7+aL294QVlqnY0uBR9eh6+CSKT+h
PhJ+FhN+N6B1dTyryaO5hMjyg0h4ZpvIMT3HBpNXtnRVlUT2+OYN3g5HHt6z//Rp
eeJTh7pnY7uT60c8x96kySwQIydXSKBI+7ysLlntgiyvutbzaC5Fq7/f1YTWyNVk
zqM/+FuJUsstu0y/GBEDpglpL1+S9VjNcJUDpUMUKwCAkh7TnI/ATo1rn9GiM1n1
SQZ4HcaCYw==
=Uawr
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.8-2020-07-05' of git://git.kernel.dk/linux-block
Pull io_uring fix from Jens Axboe:
"Andres reported a regression with the fix that was merged earlier this
week, where his setup of using signals to interrupt io_uring CQ waits
no longer worked correctly.
Fix this, and also limit our use of TWA_SIGNAL to the case where we
need it, and continue using TWA_RESUME for task_work as before.
Since the original is marked for 5.7 stable, let's flush this one out
early"
* tag 'io_uring-5.8-2020-07-05' of git://git.kernel.dk/linux-block:
io_uring: fix regression with always ignoring signals in io_cqring_wait()
When switching to TWA_SIGNAL for task_work notifications, we also made
any signal based condition in io_cqring_wait() return -ERESTARTSYS.
This breaks applications that rely on using signals to abort someone
waiting for events.
Check if we have a signal pending because of queued task_work, and
repeat the signal check once we've run the task_work. This provides a
reliable way of telling the two apart.
Additionally, only use TWA_SIGNAL if we are using an eventfd. If not,
we don't have the dependency situation described in the original commit,
and we can get by with just using TWA_RESUME like we previously did.
Fixes: ce593a6c48 ("io_uring: use signal based task_work running")
Cc: stable@vger.kernel.org # v5.7
Reported-by: Andres Freund <andres@anarazel.de>
Tested-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that the last callser has been removed remove this code from exec.
For anyone thinking of resurrecing do_execve_file please note that
the code was buggy in several fundamental ways.
- It did not ensure the file it was passed was read-only and that
deny_write_access had been called on it. Which subtlely breaks
invaniants in exec.
- The caller of do_execve_file was expected to hold and put a
reference to the file, but an extra reference for use by exec was
not taken so that when exec put it's reference to the file an
underflow occured on the file reference count.
- The point of the interface was so that a pathname did not need to
exist. Which breaks pathname based LSMs.
Tetsuo Handa originally reported these issues[1]. While it was clear
that deny_write_access was missing the fundamental incompatibility
with the passed in O_RDWR filehandle was not immediately recognized.
All of these issues were fixed by modifying the usermode driver code
to have a path, so it did not need this hack.
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
[1] https://lore.kernel.org/linux-fsdevel/2a8775b4-1dd5-9d5c-aa42-9872445e0942@i-love.sakura.ne.jp/
v1: https://lkml.kernel.org/r/871rm2f0hi.fsf_-_@x220.int.ebiederm.org
v2: https://lkml.kernel.org/r/87lfk54p0m.fsf_-_@x220.int.ebiederm.org
Link: https://lkml.kernel.org/r/20200702164140.4468-10-ebiederm@xmission.com
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Pull sysctl fix from Al Viro:
"Another regression fix for sysctl changes this cycle..."
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
Call sysctl_head_finish on error
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl7/s1YACgkQiiy9cAdy
T1GhzgwAqARAg1iCUEDjyy7VEZ9HNA3X87GxM7zkid5Fz2WTDlHLBQL6LWZkLODK
PIz8IP4V3DoBddN2DGlqIiCZmCMDn2bBN+6u1O2TkR2lv2w3ASxzwYSMQWqUUw6U
a03BkDZNFE4fJq5pPDdVaVzDss4tuNKW8N5RvptRqlbLp74SRUgMjVyyWwN4UunW
AHH3VqRCWJJj6Yp6MAx3rtoEiAtjTt9Ej3Fb2MXdF5jZObzI3LOY13Z09QIWbE3P
Sh7Py66CSG7UYYkQqoe43zYwxeOgo6FAYWxIULTPJYdFIi5+RHPQ0SYc6+BHfDRo
AHMchJpwZ/j4JOeIJGDItuUQPVnwYAOZ+75s7ofhAbG95kwcfs+AkDoLqkM8IWpu
LS5rHi7sOA4GK8Hio9xp+MgttsmXRcnBQ4ShBoTaDBKa7v/NeRAPolsD5FgZWunO
CKRDsDD5hKO2bQsJk4te35/IQpxRpEiiGmMpyaNUdaCdhXxcPHCYEWYdw9EnTP6i
1xc7au/u
=laR3
-----END PGP SIGNATURE-----
Merge tag '5.8-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Eight cifs/smb3 fixes, most when specifying the multiuser mount flag.
Five of the fixes are for stable"
* tag '5.8-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: prevent truncation from long to int in wait_for_free_credits
cifs: Fix the target file was deleted when rename failed.
SMB3: Honor 'posix' flag for multiuser mounts
SMB3: Honor 'handletimeout' flag for multiuser mounts
SMB3: Honor lease disabling for multiuser mounts
SMB3: Honor persistent/resilient handle flags for multiuser mounts
SMB3: Honor 'seal' flag for multiuser mounts
cifs: Display local UID details for SMB sessions in DebugData
- Fix a use-after-free bug when the fs shuts down.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl77c9oACgkQ+H93GTRK
tOvYow/+O61y32kJzJ6xJ5QoGE5K10CXp4cpBwOgG/6POERIDBirgAMqKAx8rEu2
go36qUVzwyveaB4iyuIkw5K+odZbHpGiuQqiGu8Aw4XBEAEhDqPPsnHhllSi/VOq
4AjvfYefmj0ALQay1pzGpR2h5+03JwOw0ZFcmBl5QSTMLwZQZ1PoU8ujiYPSxsUr
m9dcGZtGU16mmDgORzYTDnSYSKruhJDSD5IxsID+QST0wK4MuPkJr0T+ZQmziWb1
xmHE1aTGDZcrYG4+x2Pzop822mrnuMBnMaX8KOOtiZAhtKb19sAf0OUWBkWOQYvb
Vk3mIDDz910vd/szw3B5KMVNkeYoRtHAQztpLyfJ3Gtxmt5g/4ZX2fEX/KWJbV5D
82GLf4gB4GhTQFUwcTmgtmCCd87aH0ABiCEnURN6R04tOCXgc7bYn0XXvZL6Axd/
25bkhlDdmOfveAZrZ1WKWSEYN/at9R5iqYsbWH1FmoE6h82OvZDxweB7P4r66KwT
pMhYRqKrRrwNtarPmn6bC/8Ci5h6vl8MOP3+mTYx2eXU0aFe8OCYpPNFsSufxx+e
hHTHUKxCSQDfYHbIqXVF5F8G7msh4ue+687UIo7seWyzTU4PaphH0kaHKV/+fiY1
zU9+GNjk0iVZy6cG+25ZcMPQmYv7qdrvm6XuxciywFim7l2wq0s=
=BnJO
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.8-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fix from Darrick Wong:
"Fix a use-after-free bug when the fs shuts down"
* tag 'xfs-5.8-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix use-after-free on CIL context on shutdown
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAl7/A8kUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZToR8A//RI5c1W9De9ST6ITDsVLVpKX7ugRl
Zo9ORCRgnnBuJFGQ6VtPuren9Rd6nxIOwf8MeFHjkgapIcXVM/f81Hx7WTUS7KOP
dakeOOWPw1Ue3hcnpxjT4fTgZo43u1VoKbFGMpUPK0zvGrYh8fa4euAhALGIUB0X
jc12mX/FdpQY+9YurY45GJ186tC0aZp1kEK+mA5apPuIA+9pUqbPIt6tmLV1wRry
2a2fTZXI0MN0/u4ZhHSFcJVj4k6xdLLQodJHW+FAPh50vHf7W++DL6mJ5V6Svp8N
vaWUdTbLh8wp/IxD+G4tnPk2BFKvGRwXZV9ZmrivxcwRKc4bgA878yjU6Ox4B0XB
EzXcTPeO0zGHtXBv49Pig0dZDllgvlDL/us2rnt+APZ99yWsX8GzJ1BDP4gT4var
3IuNxyvpeAth9zth/aOzS2pE50yano64CQZq/NsuvAz4MTI8zms/o2ioH/aPbwfk
DFkAQHzlNPZlbt6jF6WJxF4R0Mrs8g1oTF5eH0sIpfMUxtq0Ay9dlHDEY1yjl/Jw
2KBuKLF6P6IZN3aSolawY29dcnMp5vbRDLdbqNzz2/DzAQmqjd4IcY0gK/nHTc8i
2TLyffQ6YTdEDXKIDqfPNDxjcYLKd1XCIyogXt6Rw57C3YwIbDY/drPtFUpUCUFI
Oaf4j8M9W+gamzU=
=GQwd
-----END PGP SIGNATURE-----
Merge tag 'gfs2-v5.8-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fixes from Andreas Gruenbacher:
"Various gfs2 fixes"
* tag 'gfs2-v5.8-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: The freeze glock should never be frozen
gfs2: When freezing gfs2, use GL_EXACT and not GL_NOCACHE
gfs2: read-only mounts should grab the sd_freeze_gl glock
gfs2: freeze should work on read-only mounts
gfs2: eliminate GIF_ORDERED in favor of list_empty
gfs2: Don't sleep during glock hash walk
gfs2: fix trans slab error when withdraw occurs inside log_flush
gfs2: Don't return NULL from gfs2_inode_lookup
This error path returned directly instead of calling sysctl_head_finish().
Fixes: ef9d965bc8 ("sysctl: reject gigantic reads/write to sysctl files")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Before this patch, some gfs2 code locked the freeze glock with LM_FLAG_NOEXP
(Do not freeze) flag, and some did not. We never want to freeze the freeze
glock, so this patch makes it consistently use LM_FLAG_NOEXP always.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, the freeze code in gfs2 specified GL_NOCACHE in
several places. That's wrong because we always want to know the state
of whether the file system is frozen.
There was also a problem with freeze/thaw transitioning the glock from
frozen (EX) to thawed (SH) because gfs2 will normally grant glocks in EX
to processes that request it in SH mode, unless GL_EXACT is specified.
Therefore, the freeze/thaw code, which tried to reacquire the glock in
SH mode would get the glock in EX mode, and miss the transition from EX
to SH. That made it think the thaw had completed normally, but since the
glock was still cached in EX, other nodes could not freeze again.
This patch removes the GL_NOCACHE flag to allow the freeze glock to be
cached. It also adds the GL_EXACT flag so the glock is fully transitioned
from EX to SH, thereby allowing future freeze operations.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, only read-write mounts would grab the freeze
glock in read-only mode, as part of gfs2_make_fs_rw. So the freeze
glock was never initialized. That meant requests to freeze, which
request the glock in EX, were granted without any state transition.
That meant you could mount a gfs2 file system, which is currently
frozen on a different cluster node, in read-only mode.
This patch makes read-only mounts lock the freeze glock in SH mode,
which will block for file systems that are frozen on another node.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, function freeze_go_sync, called when promoting
the freeze glock, was testing for the SDF_JOURNAL_LIVE superblock flag.
That's only set for read-write mounts. Read-only mounts don't use a
journal, so the bit is never set, so the freeze never happened.
This patch removes the check for SDF_JOURNAL_LIVE for freeze requests
but still checks it when deciding whether to flush a journal.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
In several places, we used the GIF_ORDERED inode flag to determine
if an inode was on the ordered writes list. However, since we always
held the sd_ordered_lock spin_lock during the manipulation, we can
just as easily check list_empty(&ip->i_ordered) instead.
This allows us to keep more than one ordered writes list to make
journal writing improvements.
This patch eliminates GIF_ORDERED in favor of checking list_empty.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
leak and a module unloading bug in the /proc/fs/nfsd/clients/ code, and
a compile warning.
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl79+IoVHGJmaWVsZHNA
ZmllbGRzZXMub3JnAAoJECebzXlCjuG+6I8P/24e9/W50SUBsQYseG2BpwjR/RsQ
YjMbqrf1XOxo+8axNpdbe0bhq2jWiyQz0esnF33RlztlDSJmSNJfueWDSezKzKwC
o8afQx0qJaVZUsT/XAXa2Hk2OZd2ZYF6f3DGMiz+knBGdAzSwJjpgqhzocMCQ3Hr
t/PG6DJazLB3VDIe1VziTet2uv52A0A+uBYKguK/QPlpae2uXKFJ8U7v6wCsU395
Sqd2/X2KGbeYoCrWsmpvdCDVeNmAbI0KlhY8pR6BHqGp7TYm4+AueqWzpYHlNHei
PukM8AROoTBEAc6Wiqqmp0UKRR+Qn/9NIuvQtvBnC6WGIPjEG1hTkAwlRfT6VYvn
oPg4oekKjRJLz/TSaqfJRpli5GwxfWAW14LTZZT+Xe0/7FhVe28/R8F1dP5ZJaeq
h9//4rCt/yUYAQq1odOMbNCr0rGVcKzdSN3E36OvJFVQ9bMyXHKetKHywOki13w5
M8UQK5zb21ghT7OSICmeRXHqsXRmTFO8QhUZ8L63Qb2hfiQ5fVQdSiHmM8iRcwWY
bxqrSs8YV7i+I0i1YYTYWmmFgP8Y11sL7ovAEs86cP2Rk58Bk5VA2TPT114W53AD
xaZHpjsH0AfZS87dEVdvS2/dAdtbHZsFwHxGnfvyl/CKTqoz5yY5etcwULQI3+XQ
8kG8FOFpt7T/5zmB
=wF3M
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.8-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields:
"Fixes for a umask bug on exported filesystems lacking ACL support, a
leak and a module unloading bug in the /proc/fs/nfsd/clients/ code,
and a compile warning"
* tag 'nfsd-5.8-1' of git://linux-nfs.org/~bfields/linux:
SUNRPC: Add missing definition of ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
nfsd: fix nfsdfs inode reference count leak
nfsd4: fix nfsdfs reference count loop
nfsd: apply umask on fs without ACL support
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl79YU0QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgplHKD/9rgv0c1I7dCh6MgQKxT+2z/eZcaPO3PekW
sbn8yC8RiSIL85Av1zEfC1wAp+Mp21QlFKXFiZ6BJj5bdDbbshLk0WdbnxvuM+9I
gyngTI/+em5D/WCcetAkPjnMTDq0m4l0UXd91fyNAeErmYZbvhL5dXihZsBJ3T9c
Bprn4RzWwrUsUwGn8qIEZhx2UovMrzXJHGFxWXh/81YHkh7Y4mjvATKxtECIliW/
+QQJDU7Tf3gZw+ETPIDOEB9Hl9c9W+9fcWWzmrXzViUyy54IMbF4qyJpWcGaRh6c
sO3apymwu7wwAUbQcE8IWr3ZLZDtw68AgUdZ5b/T0c2fEwqsI/UDMhBbELiuqcT0
MAoQdUSNNqZTti0PX5vg5CQlCFzjnl2uIwHF6LVSbrqgyqxiC3Qrus/FYSaf3x9h
bAmNgWC9DeKp/wtEKMuBXaOm7RjrEutD5hjJYfVK/AkvKTZyZDx3vZ9FRH8WtrII
7KhUI3DPSZCeWlcpDtK+0fEqtqTw6OtCQ8U5vKSnJjoRSXLUtuk6IYbp/tqNxwe/
0d+U6R+w513jVlXARUP48mV7tzpESp2MLP6Nd2Is/OD5tePWzQEZinpKzsFP4djH
d2PT5FFGPCw9yBk03sI1Je/CFqVYwCGqav6h8dKKVBanMjoEdL4U1PMhI48Zua+9
M8pqRHoeDA==
=4lvI
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.8-2020-07-01' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"One fix in here, for a regression in 5.7 where a task is waiting in
the kernel for a condition, but that condition won't become true until
task_work is run. And the task_work can't be run exactly because the
task is waiting in the kernel, so we'll never make any progress.
One example of that is registering an eventfd and queueing io_uring
work, and then the task goes and waits in eventfd read with the
expectation that it'll get woken (and read an event) when the io_uring
request completes. The io_uring request is finished through task_work,
which won't get run while the task is looping in eventfd read"
* tag 'io_uring-5.8-2020-07-01' of git://git.kernel.dk/linux-block:
io_uring: use signal based task_work running
task_work: teach task_work_add() to do signal_wake_up()
Eric reported an issue where mounting -o recovery with a fuzzed fs
resulted in a kernel panic. This is because we tried to free the tree
node, except it was an error from the read. Fix this by properly
resetting the tree_root->node == NULL in this case. The panic was the
following
BTRFS warning (device loop0): failed to read tree root
BUG: kernel NULL pointer dereference, address: 000000000000001f
RIP: 0010:free_extent_buffer+0xe/0x90 [btrfs]
Call Trace:
free_root_extent_buffers.part.0+0x11/0x30 [btrfs]
free_root_pointers+0x1a/0xa2 [btrfs]
open_ctree+0x1776/0x18a5 [btrfs]
btrfs_mount_root.cold+0x13/0xfa [btrfs]
? selinux_fs_context_parse_param+0x37/0x80
legacy_get_tree+0x27/0x40
vfs_get_tree+0x25/0xb0
fc_mount+0xe/0x30
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x147/0x3e0 [btrfs]
? cred_has_capability+0x7c/0x120
? legacy_get_tree+0x27/0x40
legacy_get_tree+0x27/0x40
vfs_get_tree+0x25/0xb0
do_mount+0x735/0xa40
__x64_sys_mount+0x8e/0xd0
do_syscall_64+0x4d/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Nik says: this is problematic only if we fail on the last iteration of
the loop as this results in init_tree_roots returning err value with
tree_root->node = -ERR. Subsequently the caller does: fail_tree_roots
which calls free_root_pointers on the bogus value.
Reported-by: Eric Sandeen <sandeen@redhat.com>
Fixes: b8522a1e5f ("btrfs: Factor out tree roots initialization during mount")
CC: stable@vger.kernel.org # 5.5+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add details how the pointer gets dereferenced ]
Signed-off-by: David Sterba <dsterba@suse.com>
Under somewhat convoluted conditions, it is possible to attempt to
release an extent_buffer that is under io, which triggers a BUG_ON in
btrfs_release_extent_buffer_pages.
This relies on a few different factors. First, extent_buffer reads done
as readahead for searching use WAIT_NONE, so they free the local extent
buffer reference while the io is outstanding. However, they should still
be protected by TREE_REF. However, if the system is doing signficant
reclaim, and simultaneously heavily accessing the extent_buffers, it is
possible for releasepage to race with two concurrent readahead attempts
in a way that leaves TREE_REF unset when the readahead extent buffer is
released.
Essentially, if two tasks race to allocate a new extent_buffer, but the
winner who attempts the first io is rebuffed by a page being locked
(likely by the reclaim itself) then the loser will still go ahead with
issuing the readahead. The loser's call to find_extent_buffer must also
race with the reclaim task reading the extent_buffer's refcount as 1 in
a way that allows the reclaim to re-clear the TREE_REF checked by
find_extent_buffer.
The following represents an example execution demonstrating the race:
CPU0 CPU1 CPU2
reada_for_search reada_for_search
readahead_tree_block readahead_tree_block
find_create_tree_block find_create_tree_block
alloc_extent_buffer alloc_extent_buffer
find_extent_buffer // not found
allocates eb
lock pages
associate pages to eb
insert eb into radix tree
set TREE_REF, refs == 2
unlock pages
read_extent_buffer_pages // WAIT_NONE
not uptodate (brand new eb)
lock_page
if !trylock_page
goto unlock_exit // not an error
free_extent_buffer
release_extent_buffer
atomic_dec_and_test refs to 1
find_extent_buffer // found
try_release_extent_buffer
take refs_lock
reads refs == 1; no io
atomic_inc_not_zero refs to 2
mark_buffer_accessed
check_buffer_tree_ref
// not STALE, won't take refs_lock
refs == 2; TREE_REF set // no action
read_extent_buffer_pages // WAIT_NONE
clear TREE_REF
release_extent_buffer
atomic_dec_and_test refs to 1
unlock_page
still not uptodate (CPU1 read failed on trylock_page)
locks pages
set io_pages > 0
submit io
return
free_extent_buffer
release_extent_buffer
dec refs to 0
delete from radix tree
btrfs_release_extent_buffer_pages
BUG_ON(io_pages > 0)!!!
We observe this at a very low rate in production and were also able to
reproduce it in a test environment by introducing some spurious delays
and by introducing probabilistic trylock_page failures.
To fix it, we apply check_tree_ref at a point where it could not
possibly be unset by a competing task: after io_pages has been
incremented. All the codepaths that clear TREE_REF check for io, so they
would not be able to clear it after this point until the io is done.
Stack trace, for reference:
[1417839.424739] ------------[ cut here ]------------
[1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
[1417839.447024] invalid opcode: 0000 [#1] SMP
[1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
[1417839.517008] Code: ed e9 ...
[1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202
[1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX: 0000000000000028
[1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI: ffff888102d6def0
[1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09: 0000000000000238
[1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12: 0000000000000000
[1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15: ffff888797aafd90
[1417839.651549] FS: 00007f5a844fa700(0000) GS:ffff88885f680000(0000) knlGS:0000000000000000
[1417839.669810] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4: 00000000003606e0
[1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1417839.731320] Call Trace:
[1417839.737103] release_extent_buffer+0x39/0x90
[1417839.746913] read_block_for_search.isra.38+0x2a3/0x370
[1417839.758645] btrfs_search_slot+0x260/0x9b0
[1417839.768054] btrfs_lookup_file_extent+0x4a/0x70
[1417839.778427] btrfs_get_extent+0x15f/0x830
[1417839.787665] ? submit_extent_page+0xc4/0x1c0
[1417839.797474] ? __do_readpage+0x299/0x7a0
[1417839.806515] __do_readpage+0x33b/0x7a0
[1417839.815171] ? btrfs_releasepage+0x70/0x70
[1417839.824597] extent_readpages+0x28f/0x400
[1417839.833836] read_pages+0x6a/0x1c0
[1417839.841729] ? startup_64+0x2/0x30
[1417839.849624] __do_page_cache_readahead+0x13c/0x1a0
[1417839.860590] filemap_fault+0x6c7/0x990
[1417839.869252] ? xas_load+0x8/0x80
[1417839.876756] ? xas_find+0x150/0x190
[1417839.884839] ? filemap_map_pages+0x295/0x3b0
[1417839.894652] __do_fault+0x32/0x110
[1417839.902540] __handle_mm_fault+0xacd/0x1000
[1417839.912156] handle_mm_fault+0xaa/0x1c0
[1417839.921004] __do_page_fault+0x242/0x4b0
[1417839.930044] ? page_fault+0x8/0x30
[1417839.937933] page_fault+0x1e/0x30
[1417839.945631] RIP: 0033:0x33c4bae
[1417839.952927] Code: Bad RIP value.
[1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206
[1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX: 0000000000000000
[1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
[1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09: 00007f5a94aeccb8
[1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12: 00007f5a94aecc79
[1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15: 00007f5a94aecc40
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Convert fall through comments to the pseudo-keyword which is now the
preferred way.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The wait_event_... defines evaluate to long so we should not assign it an int as this may truncate
the value.
Reported-by: Marshall Midden <marshallmidden@gmail.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When xfstest generic/035, we found the target file was deleted
if the rename return -EACESS.
In cifs_rename2, we unlink the positive target dentry if rename
failed with EACESS or EEXIST, even if the target dentry is positived
before rename. Then the existing file was deleted.
We should just delete the target file which created during the
rename.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
The flag from the primary tcon needs to be copied into the volume info
so that cifs_get_tcon will try to enable extensions on the per-user
tcon. At that point, since posix extensions must have already been
enabled on the superblock, don't try to needlessly adjust the mount
flags.
Fixes: ce558b0e17 ("smb3: Add posix create context for smb3.11 posix mounts")
Fixes: b326614ea2 ("smb3: allow "posix" mount option to enable new SMB311 protocol extensions")
Signed-off-by: Paul Aurich <paul@darkrain42.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Fixes: ca567eb2b3 ("SMB3: Allow persistent handle timeout to be configurable on mount")
Signed-off-by: Paul Aurich <paul@darkrain42.org>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Without this:
- persistent handles will only be enabled for per-user tcons if the
server advertises the 'Continuous Availabity' capability
- resilient handles would never be enabled for per-user tcons
Signed-off-by: Paul Aurich <paul@darkrain42.org>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Ensure multiuser SMB3 mounts use encryption for all users' tcons if the
mount options are configured to require encryption. Without this, only
the primary tcon and IPC tcons are guaranteed to be encrypted. Per-user
tcons would only be encrypted if the server was configured to require
encryption.
Signed-off-by: Paul Aurich <paul@darkrain42.org>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>