219b51a6f0
767 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
|
fdaf9a5840 |
Page cache changes for 5.19
- Appoint myself page cache maintainer - Fix how scsicam uses the page cache - Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS - Remove the AOP flags entirely - Remove pagecache_write_begin() and pagecache_write_end() - Documentation updates - Convert several address_space operations to use folios: - is_dirty_writeback - readpage becomes read_folio - releasepage becomes release_folio - freepage becomes free_folio - Change filler_t to require a struct file pointer be the first argument like ->read_folio -----BEGIN PGP SIGNATURE----- iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmKNMDUACgkQDpNsjXcp gj4/mwf/bpHhXH4ZoNIvtUpTF6rZbqeffmc0VrbxCZDZ6igRnRPglxZ9H9v6L53O 7B0FBQIfxgNKHZpdqGdOkv8cjg/GMe/HJUbEy5wOakYPo4L9fZpHbDZ9HM2Eankj xBqLIBgBJ7doKr+Y62DAN19TVD8jfRfVtli5mqXJoNKf65J7BkxljoTH1L3EXD9d nhLAgyQjR67JQrT/39KMW+17GqLhGefLQ4YnAMONtB6TVwX/lZmigKpzVaCi4r26 bnk5vaR/3PdjtNxIoYvxdc71y2Eg05n2jEq9Wcy1AaDv/5vbyZUlZ2aBSaIVbtKX WfrhN9O3L0bU5qS7p9PoyfLc9wpq8A== =djLv -----END PGP SIGNATURE----- Merge tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache Pull page cache updates from Matthew Wilcox: - Appoint myself page cache maintainer - Fix how scsicam uses the page cache - Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS - Remove the AOP flags entirely - Remove pagecache_write_begin() and pagecache_write_end() - Documentation updates - Convert several address_space operations to use folios: - is_dirty_writeback - readpage becomes read_folio - releasepage becomes release_folio - freepage becomes free_folio - Change filler_t to require a struct file pointer be the first argument like ->read_folio * tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache: (107 commits) nilfs2: Fix some kernel-doc comments Appoint myself page cache maintainer fs: Remove aops->freepage secretmem: Convert to free_folio nfs: Convert to free_folio orangefs: Convert to free_folio fs: Add free_folio address space operation fs: Convert drop_buffers() to use a folio fs: Change try_to_free_buffers() to take a folio jbd2: Convert release_buffer_page() to use a folio jbd2: Convert jbd2_journal_try_to_free_buffers to take a folio reiserfs: Convert release_buffer_page() to use a folio fs: Remove last vestiges of releasepage ubifs: Convert to release_folio reiserfs: Convert to release_folio orangefs: Convert to release_folio ocfs2: Convert to release_folio nilfs2: Remove comment about releasepage nfs: Convert to release_folio jfs: Convert to release_folio ... |
||
|
36e8c62273 |
btrfs: add a btrfs_dio_rw wrapper
Add a wrapper around iomap_dio_rw that keeps the direct I/O internals isolated in inode.c. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
d4135134ab |
btrfs: avoid blocking on space revervation when doing nowait dio writes
When doing a NOWAIT direct IO write, if we can NOCOW then it means we can proceed with the non-blocking, NOWAIT path. However reserving the metadata space and qgroup meta space can often result in blocking - flushing delalloc, wait for ordered extents to complete, trigger transaction commits, etc, going against the semantics of a NOWAIT write. So make the NOWAIT write path to try to reserve all the metadata it needs without resulting in a blocking behaviour - if we get -ENOSPC or -EDQUOT then return -EAGAIN to make the caller fallback to a blocking direct IO write. This is part of a patchset comprised of the following patches: btrfs: avoid blocking on page locks with nowait dio on compressed range btrfs: avoid blocking nowait dio when locking file range btrfs: avoid double nocow check when doing nowait dio writes btrfs: stop allocating a path when checking if cross reference exists btrfs: free path at can_nocow_extent() before checking for checksum items btrfs: release path earlier at can_nocow_extent() btrfs: avoid blocking when allocating context for nowait dio read/write btrfs: avoid blocking on space revervation when doing nowait dio writes The following test was run before and after applying this patchset: $ cat io-uring-nodatacow-test.sh #!/bin/bash DEV=/dev/sdc MNT=/mnt/sdc MOUNT_OPTIONS="-o ssd -o nodatacow" MKFS_OPTIONS="-R free-space-tree -O no-holes" NUM_JOBS=4 FILE_SIZE=8G RUN_TIME=300 cat <<EOF > /tmp/fio-job.ini [io_uring_rw] rw=randrw fsync=0 fallocate=posix group_reporting=1 direct=1 ioengine=io_uring iodepth=64 bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5 filesize=$FILE_SIZE runtime=$RUN_TIME time_based filename=foobar directory=$MNT numjobs=$NUM_JOBS thread EOF echo performance | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT The test was run a 12 cores box with 64G of ram, using a non-debug kernel config (Debian's default config) and a spinning disk. Result before the patchset: READ: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec WRITE: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec Result after the patchset: READ: bw=436MiB/s (457MB/s), 436MiB/s-436MiB/s (457MB/s-457MB/s), io=128GiB (137GB), run=300044-300044msec WRITE: bw=435MiB/s (456MB/s), 435MiB/s-435MiB/s (456MB/s-456MB/s), io=128GiB (137GB), run=300044-300044msec That's about +7.2% throughput for reads and +6.9% for writes. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
d7a8ab4e9b |
btrfs: avoid double nocow check when doing nowait dio writes
When doing a NOWAIT direct IO write we are checking twice if we can COW into the target file range using can_nocow_extent() - once at the very beginning of the write path, at btrfs_write_check() via check_nocow_nolock(), and later again at btrfs_get_blocks_direct_write(). The can_nocow_extent() function does a lot of expensive things - searching for the file extent item in the inode's subvolume tree, searching for the extent item in the extent tree, checking delayed references, etc, so it isn't a very cheap call. We can remove the first check at btrfs_write_check(), and add there a quick check to verify if the inode has the NODATACOW or PREALLOC flags, and quickly bail out if it doesn't have neither of those flags, as that means we have to COW and therefore can't comply with the NOWAIT semantics. After this we do only one call to can_nocow_extent(), while we are at btrfs_get_blocks_direct_write(), where we have already locked the file range and we did a try lock on the range before, at btrfs_dio_iomap_begin() (since the previous patch in the series). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
63c34cb4c6 |
btrfs: add and use helper to assert an inode range is clean
We have four different scenarios where we don't expect to find ordered extents after locking a file range: 1) During plain fallocate; 2) During hole punching; 3) During zero range; 4) During reflinks (both cloning and deduplication). This is because in all these cases we follow the pattern: 1) Lock the inode's VFS lock in exclusive mode; 2) Lock the inode's i_mmap_lock in exclusive node, to serialize with mmap writes; 3) Flush delalloc in a file range and wait for all ordered extents to complete - both done through btrfs_wait_ordered_range(); 4) Lock the file range in the inode's io_tree. So add a helper that asserts that we don't have ordered extents for a given range. Make the four scenarios listed above use this helper after locking the respective file range. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
55961c8abf |
btrfs: remove ordered extent check and wait during hole punching and zero range
For hole punching and zero range we have this loop that checks if we have ordered extents after locking the file range, and if so unlock the range, wait for ordered extents, and retry until we don't find more ordered extents. This logic was needed in the past because: 1) Direct IO writes within the i_size boundary did not take the inode's VFS lock. This was because that lock used to be a mutex, then some years ago it was switched to a rw semaphore (commit |
||
|
bd6526d0df |
btrfs: lock the inode first before flushing range when punching hole
When doing hole punching we are flushing delalloc and waiting for ordered extents to complete before locking the inode (VFS lock and the btrfs specific i_mmap_lock). This is fine because even if a write happens after we call btrfs_wait_ordered_range() and before we lock the inode (call btrfs_inode_lock()), we will notice the write at btrfs_punch_hole_lock_range() and flush delalloc and wait for its ordered extent. We can however make this simpler by locking first the inode an then call btrfs_wait_ordered_range(), which will allow us to remove the ordered extent lookup logic from btrfs_punch_hole_lock_range() in the next patch. It also makes the behaviour the same as plain fallocate, hole punching and reflinks. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
ffa8fc603d |
btrfs: remove ordered extent check and wait during fallocate
For fallocate() we have this loop that checks if we have ordered extents after locking the file range, and if so unlock the range, wait for ordered extents, and retry until we don't find more ordered extents. This logic was needed in the past because: 1) Direct IO writes within the i_size boundary did not take the inode's VFS lock. This was because that lock used to be a mutex, then some years ago it was switched to a rw semaphore (commit |
||
|
831e1ee602 |
btrfs: remove useless dio wait call when doing fallocate zero range
When starting a fallocate zero range operation, before getting the first extent map for the range, we make a call to inode_dio_wait(). This logic was needed in the past because direct IO writes within the i_size boundary did not take the inode's VFS lock. This was because that lock used to be a mutex, then some years ago it was switched to a rw semaphore (by commit |
||
|
47e1d1c7bb |
btrfs: only reserve the needed data space amount during fallocate
During a plain fallocate, we always start by reserving an amount of data space that matches the length of the range passed to fallocate. When we already have extents allocated in that range, we may end up trying to reserve a lot more data space then we need, which can result in several undesired behaviours: 1) We fail with -ENOSPC. For example the passed range has a length of 1G, but there's only one hole with a size of 1M in that range; 2) We temporarily reserve excessive data space that could be used by other operations happening concurrently; 3) By reserving much more data space then we need, we can end up doing expensive things like triggering dellaloc for other inodes, waiting for the ordered extents to complete, trigger transaction commits, allocate new block groups, etc. Example: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj mkfs.btrfs -f -b 1g $DEV mount $DEV $MNT # Create a file with a size of 600M and two holes, one at [200M, 201M[ # and another at [401M, 402M[ xfs_io -f -c "pwrite -S 0xab 0 200M" \ -c "pwrite -S 0xcd 201M 200M" \ -c "pwrite -S 0xef 402M 198M" \ $MNT/foobar # Now call fallocate against the whole file range, see if it fails # with -ENOSPC or not - it shouldn't since we only need to allocate # 2M of data space. xfs_io -c "falloc 0 600M" $MNT/foobar umount $MNT $ ./test.sh (...) wrote 209715200/209715200 bytes at offset 0 200 MiB, 51200 ops; 0.8063 sec (248.026 MiB/sec and 63494.5831 ops/sec) wrote 209715200/209715200 bytes at offset 210763776 200 MiB, 51200 ops; 0.8053 sec (248.329 MiB/sec and 63572.3172 ops/sec) wrote 207618048/207618048 bytes at offset 421527552 198 MiB, 50688 ops; 0.7925 sec (249.830 MiB/sec and 63956.5548 ops/sec) fallocate: No space left on device $ So fix this by not allocating an amount of data space that matches the length of the range passed to fallocate. Instead allocate an amount of data space that corresponds to the sum of the sizes of each hole found in the range. This reservation now happens after we have locked the file range, which is safe since we know at this point there's no delalloc in the range because we've taken the inode's VFS lock in exclusive mode, we have taken the inode's i_mmap_lock in exclusive mode, we have flushed delalloc and waited for all ordered extents in the range to complete. This type of failure actually seems to happen in practice with systemd, and we had at least one report about this in a very long thread which is referenced by the Link tag below. Link: https://lore.kernel.org/linux-btrfs/bdJVxLiFr_PyQSXRUbZJfFW_jAjsGgoMetqPHJMbg-hdy54Xt_ZHhRetmnJ6cJ99eBlcX76wy-AvWwV715c3YndkxneSlod11P1hlaADx0s=@protonmail.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
f913cff350 |
btrfs: Convert to release_folio
I've only converted the outer layers of the btrfs release_folio paths to use folios; the use of folios should be pushed further down into btrfs from here. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> |
||
|
7e0a126519 |
mm,fs: Remove aops->readpage
With all implementations of aops->readpage converted to aops->read_folio, we can stop checking whether it's set and remove the member from aops. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> |
||
|
fb12489b0d |
btrfs: Convert btrfs to read_folio
This is a "weak" conversion which converts straight back to using pages. A full conversion should be performed at some point, hopefully by someone familiar with the filesystem. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> |
||
|
5efe7448a1 |
fs: Introduce aops->read_folio
Change all the callers of ->readpage to call ->read_folio in preference, if it exists. This is a transitional duplication, and will be removed by the end of the series. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> |
||
|
05fd9564e9 |
btrfs: fix fallocate to use file_modified to update permissions consistently
Since the initial introduction of (posix) fallocate back at the turn of the century, it has been possible to use this syscall to change the user-visible contents of files. This can happen by extending the file size during a preallocation, or through any of the newer modes (punch, zero range). Because the call can be used to change file contents, we should treat it like we do any other modification to a file -- update the mtime, and drop set[ug]id privileges/capabilities. The VFS function file_modified() does all this for us if pass it a locked inode, so let's make fallocate drop permissions correctly. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
23e3337faf |
btrfs: reset last_reflink_trans after fsyncing inode
When an inode has a last_reflink_trans matching the current transaction, we have to take special care when logging its checksums in order to avoid getting checksum items with overlapping ranges in a log tree, which could result in missing checksums after log replay (more on that in the changelogs of commit |
||
|
7c0c7269f7 |
btrfs: add BTRFS_IOC_ENCODED_WRITE
The implementation resembles direct I/O: we have to flush any ordered extents, invalidate the page cache, and do the io tree/delalloc/extent map/ordered extent dance. From there, we can reuse the compression code with a minor modification to distinguish the write from writeback. This also creates inline extents when possible. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
28c9b1e75a |
btrfs: support different disk extent size for delalloc
Currently, we always reserve the same extent size in the file and extent size on disk for delalloc because the former is the worst case for the latter. For BTRFS_IOC_ENCODED_WRITE writes, we know the exact size of the extent on disk, which may be less than or greater than (for bookends) the size in the file. Add a disk_num_bytes parameter to btrfs_delalloc_reserve_metadata() so that we can reserve the correct amount of csum bytes. No functional change. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
7ecb4c31e7 |
btrfs: remove constraint on number of visited leaves when replacing extents
At btrfs_drop_extents(), we try to replace a range of file extent items
with a new file extent in a single btree search, to avoid the need to do
a search for deletion, followed by a path release and followed by yet
another search for insertion.
When I originally added that optimization, in commit
|
||
|
558732df21 |
btrfs: reduce extent threshold for autodefrag
There is a big gap between inode_should_defrag() and autodefrag extent size threshold. For inode_should_defrag() it has a flexible @small_write value. For compressed extent is 16K, and for non-compressed extent it's 64K. However for autodefrag extent size threshold, it's always fixed to the default value (256K). This means, the following write sequence will trigger autodefrag to defrag ranges which didn't trigger autodefrag: pwrite 0 8k sync pwrite 8k 128K sync The latter 128K write will also be considered as a defrag target (if other conditions are met). While only that 8K write is really triggering autodefrag. Such behavior can cause extra IO for autodefrag. Close the gap, by copying the @small_write value into inode_defrag, so that later autodefrag can use the same @small_write value which triggered autodefrag. With the existing transid value, this allows autodefrag really to scan the ranges which triggered autodefrag. Although this behavior change is mostly reducing the extent_thresh value for autodefrag, I believe in the future we should allow users to specify the autodefrag extent threshold through mount options, but that's an other problem to consider in the future. CC: stable@vger.kernel.org # 5.16+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
26fbac2517 |
btrfs: autodefrag: only scan one inode once
Although we have btrfs_requeue_inode_defrag(), for autodefrag we are still just exhausting all inode_defrag items in the tree. This means, it doesn't make much difference to requeue an inode_defrag, other than scan the inode from the beginning till its end. Change the behaviour to always scan from offset 0 of an inode, and till the end. By this we get the following benefit: - Straight-forward code - No more re-queue related check - Fewer members in inode_defrag We still keep the same btrfs_get_fs_root() and btrfs_iget() check for each loop, and added extra should_auto_defrag() check per-loop. Note: the patch needs to be backported and is intentionally written to minimize the diff size, code will be cleaned up later. CC: stable@vger.kernel.org # 5.16 Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
51bd9563b6 |
btrfs: fix deadlock due to page faults during direct IO reads and writes
If we do a direct IO read or write when the buffer given by the user is
memory mapped to the file range we are going to do IO, we end up ending
in a deadlock. This is triggered by the new test case generic/647 from
fstests.
For a direct IO read we get a trace like this:
[967.872718] INFO: task mmap-rw-fault:12176 blocked for more than 120 seconds.
[967.874161] Not tainted 5.14.0-rc7-btrfs-next-95 #1
[967.874909] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[967.875983] task:mmap-rw-fault state:D stack: 0 pid:12176 ppid: 11884 flags:0x00000000
[967.875992] Call Trace:
[967.875999] __schedule+0x3ca/0xe10
[967.876015] schedule+0x43/0xe0
[967.876020] wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
[967.876109] ? do_wait_intr_irq+0xb0/0xb0
[967.876118] lock_extent_bits+0x37/0x90 [btrfs]
[967.876150] btrfs_lock_and_flush_ordered_range+0xa9/0x120 [btrfs]
[967.876184] ? extent_readahead+0xa7/0x530 [btrfs]
[967.876214] extent_readahead+0x32d/0x530 [btrfs]
[967.876253] ? lru_cache_add+0x104/0x220
[967.876255] ? kvm_sched_clock_read+0x14/0x40
[967.876258] ? sched_clock_cpu+0xd/0x110
[967.876263] ? lock_release+0x155/0x4a0
[967.876271] read_pages+0x86/0x270
[967.876274] ? lru_cache_add+0x125/0x220
[967.876281] page_cache_ra_unbounded+0x1a3/0x220
[967.876291] filemap_fault+0x626/0xa20
[967.876303] __do_fault+0x36/0xf0
[967.876308] __handle_mm_fault+0x83f/0x15f0
[967.876322] handle_mm_fault+0x9e/0x260
[967.876327] __get_user_pages+0x204/0x620
[967.876332] ? get_user_pages_unlocked+0x69/0x340
[967.876340] get_user_pages_unlocked+0xd3/0x340
[967.876349] internal_get_user_pages_fast+0xbca/0xdc0
[967.876366] iov_iter_get_pages+0x8d/0x3a0
[967.876374] bio_iov_iter_get_pages+0x82/0x4a0
[967.876379] ? lock_release+0x155/0x4a0
[967.876387] iomap_dio_bio_actor+0x232/0x410
[967.876396] iomap_apply+0x12a/0x4a0
[967.876398] ? iomap_dio_rw+0x30/0x30
[967.876414] __iomap_dio_rw+0x29f/0x5e0
[967.876415] ? iomap_dio_rw+0x30/0x30
[967.876420] ? lock_acquired+0xf3/0x420
[967.876429] iomap_dio_rw+0xa/0x30
[967.876431] btrfs_file_read_iter+0x10b/0x140 [btrfs]
[967.876460] new_sync_read+0x118/0x1a0
[967.876472] vfs_read+0x128/0x1b0
[967.876477] __x64_sys_pread64+0x90/0xc0
[967.876483] do_syscall_64+0x3b/0xc0
[967.876487] entry_SYSCALL_64_after_hwframe+0x44/0xae
[967.876490] RIP: 0033:0x7fb6f2c038d6
[967.876493] RSP: 002b:00007fffddf586b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
[967.876496] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fb6f2c038d6
[967.876498] RDX: 0000000000001000 RSI: 00007fb6f2c17000 RDI: 0000000000000003
[967.876499] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000000000
[967.876501] R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000003
[967.876502] R13: 0000000000000000 R14: 00007fb6f2c17000 R15: 0000000000000000
This happens because at btrfs_dio_iomap_begin() we lock the extent range
and return with it locked - we only unlock in the endio callback, at
end_bio_extent_readpage() -> endio_readpage_release_extent(). Then after
iomap called the btrfs_dio_iomap_begin() callback, it triggers the page
faults that resulting in reading the pages, through the readahead callback
btrfs_readahead(), and through there we end to attempt to lock again the
same extent range (or a subrange of what we locked before), resulting in
the deadlock.
For a direct IO write, the scenario is a bit different, and it results in
trace like this:
[1132.442520] run fstests generic/647 at 2021-08-31 18:53:35
[1330.349355] INFO: task mmap-rw-fault:184017 blocked for more than 120 seconds.
[1330.350540] Not tainted 5.14.0-rc7-btrfs-next-95 #1
[1330.351158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1330.351900] task:mmap-rw-fault state:D stack: 0 pid:184017 ppid:183725 flags:0x00000000
[1330.351906] Call Trace:
[1330.351913] __schedule+0x3ca/0xe10
[1330.351930] schedule+0x43/0xe0
[1330.351935] btrfs_start_ordered_extent+0x108/0x1c0 [btrfs]
[1330.352020] ? do_wait_intr_irq+0xb0/0xb0
[1330.352028] btrfs_lock_and_flush_ordered_range+0x8c/0x120 [btrfs]
[1330.352064] ? extent_readahead+0xa7/0x530 [btrfs]
[1330.352094] extent_readahead+0x32d/0x530 [btrfs]
[1330.352133] ? lru_cache_add+0x104/0x220
[1330.352135] ? kvm_sched_clock_read+0x14/0x40
[1330.352138] ? sched_clock_cpu+0xd/0x110
[1330.352143] ? lock_release+0x155/0x4a0
[1330.352151] read_pages+0x86/0x270
[1330.352155] ? lru_cache_add+0x125/0x220
[1330.352162] page_cache_ra_unbounded+0x1a3/0x220
[1330.352172] filemap_fault+0x626/0xa20
[1330.352176] ? filemap_map_pages+0x18b/0x660
[1330.352184] __do_fault+0x36/0xf0
[1330.352189] __handle_mm_fault+0x1253/0x15f0
[1330.352203] handle_mm_fault+0x9e/0x260
[1330.352208] __get_user_pages+0x204/0x620
[1330.352212] ? get_user_pages_unlocked+0x69/0x340
[1330.352220] get_user_pages_unlocked+0xd3/0x340
[1330.352229] internal_get_user_pages_fast+0xbca/0xdc0
[1330.352246] iov_iter_get_pages+0x8d/0x3a0
[1330.352254] bio_iov_iter_get_pages+0x82/0x4a0
[1330.352259] ? lock_release+0x155/0x4a0
[1330.352266] iomap_dio_bio_actor+0x232/0x410
[1330.352275] iomap_apply+0x12a/0x4a0
[1330.352278] ? iomap_dio_rw+0x30/0x30
[1330.352292] __iomap_dio_rw+0x29f/0x5e0
[1330.352294] ? iomap_dio_rw+0x30/0x30
[1330.352306] btrfs_file_write_iter+0x238/0x480 [btrfs]
[1330.352339] new_sync_write+0x11f/0x1b0
[1330.352344] ? NF_HOOK_LIST.constprop.0.cold+0x31/0x3e
[1330.352354] vfs_write+0x292/0x3c0
[1330.352359] __x64_sys_pwrite64+0x90/0xc0
[1330.352365] do_syscall_64+0x3b/0xc0
[1330.352369] entry_SYSCALL_64_after_hwframe+0x44/0xae
[1330.352372] RIP: 0033:0x7f4b0a580986
[1330.352379] RSP: 002b:00007ffd34d75418 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
[1330.352382] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f4b0a580986
[1330.352383] RDX: 0000000000001000 RSI: 00007f4b0a3a4000 RDI: 0000000000000003
[1330.352385] RBP: 00007f4b0a3a4000 R08: 0000000000000003 R09: 0000000000000000
[1330.352386] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
[1330.352387] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Unlike for reads, at btrfs_dio_iomap_begin() we return with the extent
range unlocked, but later when the page faults are triggered and we try
to read the extents, we end up btrfs_lock_and_flush_ordered_range() where
we find the ordered extent for our write, created by the iomap callback
btrfs_dio_iomap_begin(), and we wait for it to complete, which makes us
deadlock since we can't complete the ordered extent without reading the
pages (the iomap code only submits the bio after the pages are faulted
in).
Fix this by setting the nofault attribute of the given iov_iter and retry
the direct IO read/write if we get an -EFAULT error returned from iomap.
For reads, also disable page faults completely, this is because when we
read from a hole or a prealloc extent, we can still trigger page faults
due to the call to iov_iter_zero() done by iomap - at the moment, it is
oblivious to the value of the ->nofault attribute of an iov_iter.
We also need to keep track of the number of bytes written or read, and
pass it to iomap_dio_rw(), as well as use the new flag IOMAP_DIO_PARTIAL.
This depends on the iov_iter and iomap changes introduced in commit
|
||
|
c03098d4b9 |
gfs2: Fix mmap + page fault deadlocks
Functions gfs2_file_read_iter and gfs2_file_write_iter are both accessing the user buffer to write to or read from while holding the inode glock. In the most basic scenario, that buffer will not be resident and it will be mapped to the same file. Accessing the buffer will trigger a page fault, and gfs2 will deadlock trying to take the same inode glock again while trying to handle that fault. Fix that and similar, more complex scenarios by disabling page faults while accessing user buffers. To make this work, introduce a small amount of new infrastructure and fix some bugs that didn't trigger so far, with page faults enabled. -----BEGIN PGP SIGNATURE----- iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmGBPisUHGFncnVlbmJh QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTpE6A/7BezUnGuNJxJrR8pC+vcLYA7xAgUU 6STQ6IN7w5UHRlSkNzZxZ2XPxW4uVQ4SxSEeaLqBsHZihepjcLNFZ/8MhQ6UPSD0 8noHOi7CoIcp6IuWQtCpxRM/xjjm2SlMt2XbVJZaiJcdzCV9gB6TU9EkBRq7Zm/X 9WFBbv1xZF0skn9ISCJvNtiiI+VyWKgMDUKxJUiTQjmJcklyyqHcVGmQi9BjqPz4 4s3F+WH6CoGbDKlmNk/6Y9wZ/2+sbvGswVscUxPwJVPoZWsR1xBBUdAeAmEMD1P4 BgE/Y1J8JXyVPYtyvZKq70XUhKdQkxB7RfX87YasOk9mY4Kjd5rIIGEykh+o2vC9 kDhCHvf2Mnw5I6Rum3B7UXyB1vemY+fECIHsXhgBnS+ztabRtcAdpCuWoqb43ymw yEX1KwXyU4FpRYbrRvdZT42Fmh6ty8TW+N4swg8S2TrffirvgAi5yrcHZ4mPupYv lyzvsCW7Wv8hPXn/twNObX+okRgJnsxcCdBXARdCnRXfA8tH23xmu88u8RA1Vdxh nzTvv6Dx2EowwojuDWMx29Mw3fA2IqIfbOV+4FaRU7NZ2ZKtknL8yGl27qQUsMoJ vYsHTmagasjQr+NDJ3vQRLCw+JQ6B1hENpdkmixFD9moo7X1ZFW3HBi/UL973Bv6 5CmgeXto8FRUFjI= =WeNd -----END PGP SIGNATURE----- Merge tag 'gfs2-v5.15-rc5-mmap-fault' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 mmap + page fault deadlocks fixes from Andreas Gruenbacher: "Functions gfs2_file_read_iter and gfs2_file_write_iter are both accessing the user buffer to write to or read from while holding the inode glock. In the most basic deadlock scenario, that buffer will not be resident and it will be mapped to the same file. Accessing the buffer will trigger a page fault, and gfs2 will deadlock trying to take the same inode glock again while trying to handle that fault. Fix that and similar, more complex scenarios by disabling page faults while accessing user buffers. To make this work, introduce a small amount of new infrastructure and fix some bugs that didn't trigger so far, with page faults enabled" * tag 'gfs2-v5.15-rc5-mmap-fault' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Fix mmap + page fault deadlocks for direct I/O iov_iter: Introduce nofault flag to disable page faults gup: Introduce FOLL_NOFAULT flag to disable page faults iomap: Add done_before argument to iomap_dio_rw iomap: Support partial direct I/O on user copy failures iomap: Fix iomap_dio_rw return value for user copies gfs2: Fix mmap + page fault deadlocks for buffered I/O gfs2: Eliminate ip->i_gh gfs2: Move the inode glock locking to gfs2_file_buffered_write gfs2: Introduce flag for glock holder auto-demotion gfs2: Clean up function may_grant gfs2: Add wrapper for iomap_file_buffered_write iov_iter: Introduce fault_in_iov_iter_writeable iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable} powerpc/kvm: Fix kvm_use_magic_page iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value |
||
|
f42c5da6c1 |
btrfs: add additional parameters to btrfs_init_tree_ref/btrfs_init_data_ref
In order to make 'real_root' used only in ref-verify it's required to have the necessary context to perform the same checks that this member is used for. So add 'mod_root' which will contain the root on behalf of which a delayed ref was created and a 'skip_group' parameter which will contain callsite-specific override of skip_qgroup. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
8496153945 |
btrfs: add a BTRFS_FS_ERROR helper
We have a few flags that are inconsistently used to describe the fs in different states of failure. As of |
||
|
e4f9434749 |
btrfs: subpage: add bitmap for PageChecked flag
Although in btrfs we have very limited usage of PageChecked flag, it's still some page flag not yet subpage compatible. Fix it by introducing btrfs_subpage::checked_offset to do the convert. For most call sites, especially for free-space cache, COW fixup and btrfs_invalidatepage(), they all work in full page mode anyway. For other call sites, they work as subpage compatible mode. Some call sites need extra modification: - btrfs_drop_pages() Needs extra parameter to get the real range we need to clear checked flag. Also since btrfs_drop_pages() will accept pages beyond the dirtied range, update btrfs_subpage_clamp_range() to handle such case by setting @len to 0 if the page is beyond target range. - btrfs_invalidatepage() We need to call subpage helper before calling __btrfs_releasepage(), or it will trigger ASSERT() as page->private will be cleared. - btrfs_verify_data_csum() In theory we don't need the io_bio->csum check anymore, but it's won't hurt. Just change the comment. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
f064165661 |
btrfs: unexport setup_items_for_insert()
Since setup_items_for_insert() is not used anymore outside of ctree.c, make it static and remove its prototype from ctree.h. This also requires to move the definition of setup_item_for_insert() from ctree.h to ctree.c and move down btrfs_duplicate_item() so that it's defined after setup_items_for_insert(). Further, since setup_item_for_insert() is used outside ctree.c, rename it to btrfs_setup_item_for_insert(). This patch is part of a small patchset that is comprised of the following patches: btrfs: loop only once over data sizes array when inserting an item batch btrfs: unexport setup_items_for_insert() btrfs: use single bulk copy operations when logging directories This is patch 2/3 and performance results, and the specific tests, are included in the changelog of patch 3/3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
b7ef5f3a6f |
btrfs: loop only once over data sizes array when inserting an item batch
When inserting a batch of items into a btree, we end up looping over the data sizes array 3 times: 1) Once in the caller of btrfs_insert_empty_items(), when it populates the array with the data sizes for each item; 2) Once at btrfs_insert_empty_items() to sum the elements of the data sizes array and compute the total data size; 3) And then once again at setup_items_for_insert(), where we do exactly the same as what we do at btrfs_insert_empty_items(), to compute the total data size. That is not bad for small arrays, but when the arrays have hundreds of elements, the time spent on looping is not negligible. For example when doing batch inserts of delayed items for dir index items or when logging a directory, it's common to have 200 to 260 dir index items in a single batch when using a leaf size of 16K and using file names between 8 and 12 characters. For a 64K leaf size, multiply that by 4. Taking into account that during directory logging or when flushing delayed dir index items we can have many of those large batches, the time spent on the looping adds up quickly. It's also more important to avoid it at setup_items_for_insert(), since we are holding a write lock on a leaf and, in some cases, on upper nodes of the btree, which causes us to block other tasks that want to access the leaf and nodes for longer than necessary. So change the code so that setup_items_for_insert() and btrfs_insert_empty_items() no longer compute the total data size, and instead rely on the caller to supply it. This makes us loop over the array only once, where we can both populate the data size array and compute the total data size, taking advantage of spatial and temporal locality. To make this more manageable, use a structure to contain all the relevant details for a batch of items (keys array, data sizes array, total data size, number of items), and use it as an argument for btrfs_insert_empty_items() and setup_items_for_insert(). This patch is part of a small patchset that is comprised of the following patches: btrfs: loop only once over data sizes array when inserting an item batch btrfs: unexport setup_items_for_insert() btrfs: use single bulk copy operations when logging directories This is patch 1/3 and performance results, and the specific tests, are included in the changelog of patch 3/3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
4fdccaa0d1 |
iomap: Add done_before argument to iomap_dio_rw
Add a done_before argument to iomap_dio_rw that indicates how much of the request has already been transferred. When the request succeeds, we report that done_before additional bytes were tranferred. This is useful for finishing a request asynchronously when part of the request has already been completed synchronously. We'll use that to allow iomap_dio_rw to be used with page faults disabled: when a page fault occurs while submitting a request, we synchronously complete the part of the request that has already been submitted. The caller can then take care of the page fault and call iomap_dio_rw again for the rest of the request, passing in the number of bytes already tranferred. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> |
||
|
a6294593e8 |
iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable
Turn iov_iter_fault_in_readable into a function that returns the number of bytes not faulted in, similar to copy_to_user, instead of returning a non-zero value when any of the requested pages couldn't be faulted in. This supports the existing users that require all pages to be faulted in as well as new users that are happy if any pages can be faulted in. Rename iov_iter_fault_in_readable to fault_in_iov_iter_readable to make sure this change doesn't silently break things. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> |
||
|
4afb912f43 |
btrfs: fix abort logic in btrfs_replace_file_extents
Error injection testing uncovered a case where we'd end up with a corrupt file system with a missing extent in the middle of a file. This occurs because the if statement to decide if we should abort is wrong. The only way we would abort in this case is if we got a ret != -EOPNOTSUPP and we called from the file clone code. However the prealloc code uses this path too. Instead we need to abort if there is an error, and the only error we _don't_ abort on is -EOPNOTSUPP and only if we came from the clone file code. CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
d175209be0 |
btrfs: update refs for any root except tree log roots
I hit a stuck relocation on btrfs/061 during my overnight testing. This turned out to be because we had left over extent entries in our extent root for a data reloc inode that no longer existed. This happened because in btrfs_drop_extents() we only update refs if we have SHAREABLE set or we are the tree_root. This regression was introduced by |
||
|
146054090b |
btrfs: initial fsverity support
Add support for fsverity in btrfs. To support the generic interface in fs/verity, we add two new item types in the fs tree for inodes with verity enabled. One stores the per-file verity descriptor and btrfs verity item and the other stores the Merkle tree data itself. Verity checking is done in end_page_read just before a page is marked uptodate. This naturally handles a variety of edge cases like holes, preallocated extents, and inline extents. Some care needs to be taken to not try to verity pages past the end of the file, which are accessed by the generic buffered file reading code under some circumstances like reading to the end of the last page and trying to read again. Direct IO on a verity file falls back to buffered reads. Verity relies on PageChecked for the Merkle tree data itself to avoid re-walking up shared paths in the tree. For this reason, we need to cache the Merkle tree data. Since the file is immutable after verity is turned on, we can cache it at an index past EOF. Use the new inode ro_flags to store verity on the inode item, so that we can enable verity on a file, then rollback to an older kernel and still mount the file system and read the file. Since we can't safely write the file anymore without ruining the invariants of the Merkle tree, we mark a ro_compat flag on the file system when a file has verity enabled. Acked-by: Eric Biggers <ebiggers@google.com> Co-developed-by: Chris Mason <clm@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
7c11d0ae43 |
btrfs: subpage: fix a potential use-after-free in writeback helper
[BUG] There is a possible use-after-free bug when running generic/095. BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b Faulting instruction address: 0xc000000000283654 c000000000283078 do_raw_spin_unlock+0x88/0x230 c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90 c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0 c0000000009e0458 end_bio_extent_writepage+0x158/0x270 c000000000b6fd14 bio_endio+0x254/0x270 c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200 c000000000b6fd14 bio_endio+0x254/0x270 c000000000b781fc blk_update_request+0x46c/0x670 c000000000b8b394 blk_mq_end_request+0x34/0x1d0 c000000000d82d1c lo_complete_rq+0x11c/0x140 c000000000b880a4 blk_complete_reqs+0x84/0xb0 c0000000012b2ca4 __do_softirq+0x334/0x680 c0000000001dd878 irq_exit+0x148/0x1d0 c000000000016f4c do_IRQ+0x20c/0x240 c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0 [CAUSE] There is very small race window like the following in generic/095. Thread 1 | Thread 2 --------------------------------+------------------------------------ end_bio_extent_writepage() | btrfs_releasepage() |- spin_lock_irqsave() | | |- end_page_writeback() | | | | |- if (PageWriteback() ||...) | | |- clear_page_extent_mapped() | | |- kfree(subpage); |- spin_unlock_irqrestore(). The race can also happen between writeback and btrfs_invalidatepage(), although that would be much harder as btrfs_invalidatepage() has much more work to do before the clear_page_extent_mapped() call. [FIX] Here we "wait" for the subapge spinlock to be released before we detach subpage structure. So this patch will introduce a new function, wait_subpage_spinlock(), to do the "wait" by acquiring the spinlock and release it. Since the caller has ensured the page is not dirty nor writeback, and page is already locked, the only way to hold the subpage spinlock is from endio function. Thus we only need to acquire the spinlock to wait for any existing holder. Reported-by: Ritesh Harjani <riteshh@linux.ibm.com> Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
e046786619 |
btrfs: subpage: fix race between prepare_pages() and btrfs_releasepage()
[BUG] When running generic/095, there is a high chance to crash with subpage data RW support: assertion failed: PagePrivate(page) && page->private ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.h:3403! Internal error: Oops - BUG: 0 [#1] SMP CPU: 1 PID: 3567 Comm: fio Tainted: 5.12.0-rc7-custom+ #17 Hardware name: Khadas VIM3 (DT) Call trace: assertfail.constprop.0+0x28/0x2c [btrfs] btrfs_subpage_assert+0x80/0xa0 [btrfs] btrfs_subpage_set_uptodate+0x34/0xec [btrfs] btrfs_page_clamp_set_uptodate+0x74/0xa4 [btrfs] btrfs_dirty_pages+0x160/0x270 [btrfs] btrfs_buffered_write+0x444/0x630 [btrfs] btrfs_direct_write+0x1cc/0x2d0 [btrfs] btrfs_file_write_iter+0xc0/0x160 [btrfs] new_sync_write+0xe8/0x180 vfs_write+0x1b4/0x210 ksys_pwrite64+0x7c/0xc0 __arm64_sys_pwrite64+0x24/0x30 el0_svc_common.constprop.0+0x70/0x140 do_el0_svc+0x28/0x90 el0_svc+0x2c/0x54 el0_sync_handler+0x1a8/0x1ac el0_sync+0x170/0x180 Code: f0000160 913be042 913c4000 955444bc (d4210000) ---[ end trace 3fdd39f4cccedd68 ]--- [CAUSE] Although prepare_pages() calls find_or_create_page(), which returns the page locked, but in later prepare_uptodate_page() calls, we may call btrfs_readpage() which will unlock the page before it returns. This leaves a window where btrfs_releasepage() can sneak in and release the page, clearing page->private and causing above ASSERT(). [FIX] In prepare_uptodate_page(), we should not only check page->mapping, but also PagePrivate() to ensure we are still holding the correct page which has proper fs context setup. Reported-by: Ritesh Harjani <riteshh@linux.ibm.com> Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
d3acb15a3a |
Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull iov_iter updates from Al Viro: "iov_iter cleanups and fixes. There are followups, but this is what had sat in -next this cycle. IMO the macro forest in there became much thinner and easier to follow..." * 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits) csum_and_copy_to_pipe_iter(): leave handling of csum_state to caller clean up copy_mc_pipe_to_iter() pipe_zero(): we don't need no stinkin' kmap_atomic()... iov_iter: clean csum_and_copy_...() primitives up a bit copy_page_from_iter(): don't need kmap_atomic() for kvec/bvec cases copy_page_to_iter(): don't bother with kmap_atomic() for bvec/kvec cases iterate_xarray(): only of the first iteration we might get offset != 0 pull handling of ->iov_offset into iterate_{iovec,bvec,xarray} iov_iter: make iterator callbacks use base and len instead of iovec iov_iter: make the amount already copied available to iterator callbacks iov_iter: get rid of separate bvec and xarray callbacks iov_iter: teach iterate_{bvec,xarray}() about possible short copies iterate_bvec(): expand bvec.h macro forest, massage a bit iov_iter: unify iterate_iovec and iterate_kvec iov_iter: massage iterate_iovec and iterate_kvec to logics similar to iterate_bvec iterate_and_advance(): get rid of magic in case when n is 0 csum_and_copy_to_iter(): massage into form closer to csum_and_copy_from_iter() iov_iter: replace iov_iter_copy_from_user_atomic() with iterator-advancing variant [xarray] iov_iter_npages(): just use DIV_ROUND_UP() iov_iter_npages(): don't bother with iterate_all_kinds() ... |
||
|
77d255348b |
btrfs: eliminate insert label in add_falloc_range
By way of inverting the list_empty conditional the insert label can be eliminated, making the function's flow entirely linear. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
0528476b6a |
btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range()
[BUG] With current subpage RW support, the following script can hang the fs with 64K page size. # mkfs.btrfs -f -s 4k $dev # mount $dev -o nospace_cache $mnt # fsstress -w -n 50 -p 1 -s 1607749395 -d $mnt The kernel will do an infinite loop in btrfs_punch_hole_lock_range(). [CAUSE] In btrfs_punch_hole_lock_range() we: - Truncate page cache range - Lock extent io tree - Wait any ordered extents in the range. We exit the loop until we meet all the following conditions: - No ordered extent in the lock range - No page is in the lock range The latter condition has a pitfall, it only works for sector size == PAGE_SIZE case. While can't handle the following subpage case: 0 32K 64K 96K 128K | |///////||//////| || lockstart=32K lockend=96K - 1 In this case, although the range crosses 2 pages, truncate_pagecache_range() will invalidate no page at all, but only zero the [32K, 96K) range of the two pages. Thus filemap_range_has_page(32K, 96K-1) will always return true, thus we will never meet the loop exit condition. [FIX] Fix the problem by doing page alignment for the lock range. Function filemap_range_has_page() has already handled lend < lstart case, we only need to round up @lockstart, and round_down @lockend for truncate_pagecache_range(). This modification should not change any thing for sector size == PAGE_SIZE case, as in that case our range is already page aligned. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
f02a85d2d5 |
btrfs: make btrfs_dirty_pages() to be subpage compatible
Since the extent io tree operations in btrfs_dirty_pages() are already subpage compatible, we only need to make the page status update to use subpage helpers. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
ec87b42f70 |
btrfs: use list_last_entry in add_falloc_range
Instead of calling list_entry with head->prev simply call list_last_entry which makes it obvious which member of the list is being referred. This allows to remove the extra 'prev' pointer. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
f0b65f39ac |
iov_iter: replace iov_iter_copy_from_user_atomic() with iterator-advancing variant
Replacement is called copy_page_from_iter_atomic(); unlike the old primitive the callers do *not* need to do iov_iter_advance() after it. In case when they end up consuming less than they'd been given they need to do iov_iter_revert() on everything they had not consumed. That, however, needs to be done only on slow paths. All in-tree callers converted. And that kills the last user of iterate_all_kinds() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
||
|
e7b2ec3d3d |
btrfs: return value from btrfs_mark_extent_written() in case of error
We always return 0 even in case of an error in btrfs_mark_extent_written(). Fix it to return proper error value in case of a failure. All callers handle it. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
626e9f41f7 |
btrfs: fix race leading to unpersisted data and metadata on fsync
When doing a fast fsync on a file, there is a race which can result in the fsync returning success to user space without logging the inode and without durably persisting new data. The following example shows one possible scenario for this: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ touch /mnt/bar $ xfs_io -f -c "pwrite -S 0xab 0 1M" -c "fsync" /mnt/baz # Now we have: # file bar == inode 257 # file baz == inode 258 $ mv /mnt/baz /mnt/foo # Now we have: # file bar == inode 257 # file foo == inode 258 $ xfs_io -c "pwrite -S 0xcd 0 1M" /mnt/foo # fsync bar before foo, it is important to trigger the race. $ xfs_io -c "fsync" /mnt/bar $ xfs_io -c "fsync" /mnt/foo # After this: # inode 257, file bar, is empty # inode 258, file foo, has 1M filled with 0xcd <power failure> # Replay the log: $ mount /dev/sdc /mnt # After this point file foo should have 1M filled with 0xcd and not 0xab The following steps explain how the race happens: 1) Before the first fsync of inode 258, when it has the "baz" name, its ->logged_trans is 0, ->last_sub_trans is 0 and ->last_log_commit is -1. The inode also has the full sync flag set; 2) After the first fsync, we set inode 258 ->logged_trans to 6, which is the generation of the current transaction, and set ->last_log_commit to 0, which is the current value of ->last_sub_trans (done at btrfs_log_inode()). The full sync flag is cleared from the inode during the fsync. The log sub transaction that was committed had an ID of 0 and when we synced the log, at btrfs_sync_log(), we incremented root->log_transid from 0 to 1; 3) During the rename: We update inode 258, through btrfs_update_inode(), and that causes its ->last_sub_trans to be set to 1 (the current log transaction ID), and ->last_log_commit remains with a value of 0. After updating inode 258, because we have previously logged the inode in the previous fsync, we log again the inode through the call to btrfs_log_new_name(). This results in updating the inode's ->last_log_commit from 0 to 1 (the current value of its ->last_sub_trans). The ->last_sub_trans of inode 257 is updated to 1, which is the ID of the next log transaction; 4) Then a buffered write against inode 258 is made. This leaves the value of ->last_sub_trans as 1 (the ID of the current log transaction, stored at root->log_transid); 5) Then an fsync against inode 257 (or any other inode other than 258), happens. This results in committing the log transaction with ID 1, which results in updating root->last_log_commit to 1 and bumping root->log_transid from 1 to 2; 6) Then an fsync against inode 258 starts. We flush delalloc and wait only for writeback to complete, since the full sync flag is not set in the inode's runtime flags - we do not wait for ordered extents to complete. Then, at btrfs_sync_file(), we call btrfs_inode_in_log() before the ordered extent completes. The call returns true: static inline bool btrfs_inode_in_log(...) { bool ret = false; spin_lock(&inode->lock); if (inode->logged_trans == generation && inode->last_sub_trans <= inode->last_log_commit && inode->last_sub_trans <= inode->root->last_log_commit) ret = true; spin_unlock(&inode->lock); return ret; } generation has a value of 6 (fs_info->generation), ->logged_trans also has a value of 6 (set when we logged the inode during the first fsync and when logging it during the rename), ->last_sub_trans has a value of 1, set during the rename (step 3), ->last_log_commit also has a value of 1 (set in step 3) and root->last_log_commit has a value of 1, which was set in step 5 when fsyncing inode 257. As a consequence we don't log the inode, any new extents and do not sync the log, resulting in a data loss if a power failure happens after the fsync and before the current transaction commits. Also, because we do not log the inode, after a power failure the mtime and ctime of the inode do not match those we had before. When the ordered extent completes before we call btrfs_inode_in_log(), then the call returns false and we log the inode and sync the log, since at the end of ordered extent completion we update the inode and set ->last_sub_trans to 2 (the value of root->log_transid) and ->last_log_commit to 1. This problem is found after removing the check for the emptiness of the inode's list of modified extents in the recent commit |
||
|
3227788cd3 |
btrfs: fix a potential hole punching failure
In commit |
||
|
e2b84217f3 |
btrfs: update outdated comment at btrfs_replace_file_extents()
There is a comment at btrfs_replace_file_extents() that mentions that we
set the full sync flag on an inode when cloning into a file with a size
greater than or equals to 16MiB, through try_release_extent_mapping() when
we truncate the page cache after replacing file extents during a clone
operation.
That is not true anymore since commit
|
||
|
bc0939fcfa |
btrfs: fix race between marking inode needs to be logged and log syncing
We have a race between marking that an inode needs to be logged, either at btrfs_set_inode_last_trans() or at btrfs_page_mkwrite(), and between btrfs_sync_log(). The following steps describe how the race happens. 1) We are at transaction N; 2) Inode I was previously fsynced in the current transaction so it has: inode->logged_trans set to N; 3) The inode's root currently has: root->log_transid set to 1 root->last_log_commit set to 0 Which means only one log transaction was committed to far, log transaction 0. When a log tree is created we set ->log_transid and ->last_log_commit of its parent root to 0 (at btrfs_add_log_tree()); 4) One more range of pages is dirtied in inode I; 5) Some task A starts an fsync against some other inode J (same root), and so it joins log transaction 1. Before task A calls btrfs_sync_log()... 6) Task B starts an fsync against inode I, which currently has the full sync flag set, so it starts delalloc and waits for the ordered extent to complete before calling btrfs_inode_in_log() at btrfs_sync_file(); 7) During ordered extent completion we have btrfs_update_inode() called against inode I, which in turn calls btrfs_set_inode_last_trans(), which does the following: spin_lock(&inode->lock); inode->last_trans = trans->transaction->transid; inode->last_sub_trans = inode->root->log_transid; inode->last_log_commit = inode->root->last_log_commit; spin_unlock(&inode->lock); So ->last_trans is set to N and ->last_sub_trans set to 1. But before setting ->last_log_commit... 8) Task A is at btrfs_sync_log(): - it increments root->log_transid to 2 - starts writeback for all log tree extent buffers - waits for the writeback to complete - writes the super blocks - updates root->last_log_commit to 1 It's a lot of slow steps between updating root->log_transid and root->last_log_commit; 9) The task doing the ordered extent completion, currently at btrfs_set_inode_last_trans(), then finally runs: inode->last_log_commit = inode->root->last_log_commit; spin_unlock(&inode->lock); Which results in inode->last_log_commit being set to 1. The ordered extent completes; 10) Task B is resumed, and it calls btrfs_inode_in_log() which returns true because we have all the following conditions met: inode->logged_trans == N which matches fs_info->generation && inode->last_subtrans (1) <= inode->last_log_commit (1) && inode->last_subtrans (1) <= root->last_log_commit (1) && list inode->extent_tree.modified_extents is empty And as a consequence we return without logging the inode, so the existing logged version of the inode does not point to the extent that was written after the previous fsync. It should be impossible in practice for one task be able to do so much progress in btrfs_sync_log() while another task is at btrfs_set_inode_last_trans() right after it reads root->log_transid and before it reads root->last_log_commit. Even if kernel preemption is enabled we know the task at btrfs_set_inode_last_trans() can not be preempted because it is holding the inode's spinlock. However there is another place where we do the same without holding the spinlock, which is in the memory mapped write path at: vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) { (...) BTRFS_I(inode)->last_trans = fs_info->generation; BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid; BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit; (...) So with preemption happening after setting ->last_sub_trans and before setting ->last_log_commit, it is less of a stretch to have another task do enough progress at btrfs_sync_log() such that the task doing the memory mapped write ends up with ->last_sub_trans and ->last_log_commit set to the same value. It is still a big stretch to get there, as the task doing btrfs_sync_log() has to start writeback, wait for its completion and write the super blocks. So fix this in two different ways: 1) For btrfs_set_inode_last_trans(), simply set ->last_log_commit to the value of ->last_sub_trans minus 1; 2) For btrfs_page_mkwrite() only set the inode's ->last_sub_trans, just like we do for buffered and direct writes at btrfs_file_write_iter(), which is all we need to make sure multiple writes and fsyncs to an inode in the same transaction never result in an fsync missing that the inode changed and needs to be logged. Turn this into a helper function and use it both at btrfs_page_mkwrite() and at btrfs_file_write_iter() - this also fixes the problem that at btrfs_page_mkwrite() we were setting those fields without the protection of the inode's spinlock. This is an extremely unlikely race to happen in practice. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
885f46d87f |
btrfs: fix race between memory mapped writes and fsync
When doing an fsync we flush all delalloc, lock the inode (VFS lock), flush
any new delalloc that might have been created before taking the lock and
then wait either for the ordered extents to complete or just for the
writeback to complete (depending on whether the full sync flag is set or
not). We then start logging the inode and assume that while we are doing it
no one else is touching the inode's file extent items (or adding new ones).
That is generally true because all operations that modify an inode acquire
the inode's lock first, including buffered and direct IO writes. However
there is one exception: memory mapped writes, which do not and can not
acquire the inode's lock.
This can cause two types of issues: ending up logging file extent items
with overlapping ranges, which is detected by the tree checker and will
result in aborting the transaction when starting writeback for a log
tree's extent buffers, or a silent corruption where we log a version of
the file that never existed.
Scenario 1 - logging overlapping extents
The following steps explain how we can end up with file extents items with
overlapping ranges in a log tree due to a race between a fsync and memory
mapped writes:
1) Task A starts an fsync on inode X, which has the full sync runtime flag
set. First it starts by flushing all delalloc for the inode;
2) Task A then locks the inode and flushes any other delalloc that might
have been created after the previous flush and waits for all ordered
extents to complete;
3) In the inode's root we have the following leaf:
Leaf N, generation == current transaction id:
---------------------------------------------------------
| (...) [ file extent item, offset 640K, length 128K ] |
---------------------------------------------------------
The last file extent item in leaf N covers the file range from 640K to
768K;
4) Task B does a memory mapped write for the page corresponding to the
file range from 764K to 768K;
5) Task A starts logging the inode. At copy_inode_items_to_log() it uses
btrfs_search_forward() to search for leafs modified in the current
transaction that contain items for the inode. It finds leaf N and copies
all the inode items from that leaf into the log tree.
Now the log tree has a copy of the last file extent item from leaf N.
At the end of the while loop at copy_inode_items_to_log(), we have the
minimum key set to:
min_key.objectid = <inode X number>
min_key.type = BTRFS_EXTENT_DATA_KEY
min_key.offset = 640K
Then we increment the key's offset by 1 so that the next call to
btrfs_search_forward() leaves us at the first key greater than the key
we just processed.
But before btrfs_search_forward() is called again...
6) Dellaloc for the page at offset 764K, dirtied by task B, is started.
It can be started for several reasons:
- The async reclaim task is attempting to satisfy metadata or data
reservation requests, and it has reached a point where it decided
to flush delalloc;
- Due to memory pressure the VMM triggers writeback of dirty pages;
- The system call sync_file_range(2) is called from user space.
7) When the respective ordered extent completes, it trims the length of
the existing file extent item for file offset 640K from 128K to 124K,
and a new file extent item is added with a key offset of 764K and a
length of 4K;
8) Task A calls btrfs_search_forward(), which returns us a path pointing
to the leaf (can be leaf N or some other) containing the new file extent
item for file offset 764K.
We end up copying this item to the log tree, which overlaps with the
last copied file extent item, which covers the file range from 640K to
768K.
When writeback is triggered for log tree's extent buffers, the issue
will be detected by the tree checker which will dump a trace and an
error message on dmesg/syslog. If the writeback is triggered when
syncing the log, which typically is, then we also end up aborting the
current transaction.
This is the same type of problem fixed in
|
||
|
8d9b4a162a |
btrfs: exclude mmap from happening during all fallocate operations
There's a small window where a deadlock can happen between fallocate and mmap. This is described in detail by Filipe: """ When doing a fallocate operation we lock the inode, flush delalloc within the target range, wait for any ordered extents to complete and then lock the file range. Before we lock the range and after we flush delalloc, there is a time window where another task can come in and do a memory mapped write for a page within the fallocate range. This means that after fallocate locks the range, there can be a dirty page in the range. More often than not, this does not cause any problem. The exception is when we are low on available metadata space, because an fallocate operation needs to start a transaction while holding the file range locked, either through btrfs_prealloc_file_range() or through the call to btrfs_fallocate_update_isize(). If that's the case, we can end up in a deadlock. The following list of steps explains how that happens: 1) A fallocate operation starts, locks the inode, flushes delalloc in the range and waits for ordered extents in the range to complete; 2) Before the fallocate task locks the file range, another task does a memory mapped write for a page in the fallocate target range. This is possible since memory mapped writes do not (and can not) lock the inode; 3) The fallocate task locks the file range. At this point there is one dirty page in the range (due to the memory mapped write); 4) When the fallocate task attempts to start a transaction, it blocks when attempting to reserve metadata space, since we are low on available metadata space. Before blocking (wait on its reservation ticket), it starts the async reclaim task (if not running already); 5) The async reclaim task is not able to release space through any other means, so it decides to flush delalloc for inodes with dirty pages. It finds that the inode used in the fallocate operation has a dirty page and therefore queues a job (fs_info->flush_workers workqueue) to flush delalloc for that inode and waits on that job to complete; 6) The flush job blocks when attempting to lock the file range because it is currently locked by the fallocate task; 7) The fallocate task keeps waiting for its metadata reservation, waiting for a wakeup on its reservation ticket. The async reclaim task is waiting on the flush job, which in turn is waiting for locking the file range that is currently locked by the fallocate task. So unless some other task is able to release enough metadata space, for example an ordered extent for some other inode completes, we end up in a deadlock between all these tasks. When this happens stack traces like the following show up in dmesg/syslog: INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds. Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000 Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs] Call Trace: __schedule+0x5d1/0xcf0 schedule+0x45/0xe0 lock_extent_bits+0x1e6/0x2d0 [btrfs] ? finish_wait+0x90/0x90 btrfs_invalidatepage+0x32c/0x390 [btrfs] ? __mod_memcg_state+0x8e/0x160 __extent_writepage+0x2d4/0x400 [btrfs] extent_write_cache_pages+0x2b2/0x500 [btrfs] ? lock_release+0x20e/0x4c0 ? trace_hardirqs_on+0x1b/0xf0 extent_writepages+0x43/0x90 [btrfs] ? lock_acquire+0x1a3/0x490 do_writepages+0x43/0xe0 ? __filemap_fdatawrite_range+0xa4/0x100 __filemap_fdatawrite_range+0xc5/0x100 btrfs_run_delalloc_work+0x17/0x40 [btrfs] btrfs_work_helper+0xf1/0x600 [btrfs] process_one_work+0x24e/0x5e0 worker_thread+0x50/0x3b0 ? process_one_work+0x5e0/0x5e0 kthread+0x153/0x170 ? kthread_mod_delayed_work+0xc0/0xc0 ret_from_fork+0x22/0x30 INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds. Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000 Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] Call Trace: __schedule+0x5d1/0xcf0 ? kvm_clock_read+0x14/0x30 ? wait_for_completion+0x81/0x110 schedule+0x45/0xe0 schedule_timeout+0x30c/0x580 ? _raw_spin_unlock_irqrestore+0x3c/0x60 ? lock_acquire+0x1a3/0x490 ? try_to_wake_up+0x7a/0xa20 ? lock_release+0x20e/0x4c0 ? lock_acquired+0x199/0x490 ? wait_for_completion+0x81/0x110 wait_for_completion+0xab/0x110 start_delalloc_inodes+0x2af/0x390 [btrfs] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs] flush_space+0x24f/0x660 [btrfs] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs] process_one_work+0x24e/0x5e0 worker_thread+0x20f/0x3b0 ? process_one_work+0x5e0/0x5e0 kthread+0x153/0x170 ? kthread_mod_delayed_work+0xc0/0xc0 ret_from_fork+0x22/0x30 (...) several tasks waiting for the inode lock held by the fallocate task below (...) RIP: 0033:0x7f61efe73fff Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5. RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff RDX: 00000000ffffff9c RSI: 0000560fbd5d90a0 RDI: 00000000ffffff9c RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003 R10: 0000560fbd5d7ad0 R11: 0000000000000202 R12: 0000000000000001 R13: 000000000000005e R14: 00007ffc3371bea0 R15: 00007ffc3371beb0 task:fdm-stress state:D stack: 0 pid:2508243 ppid:2508153 flags:0x00000000 Call Trace: __schedule+0x5d1/0xcf0 ? _raw_spin_unlock_irqrestore+0x3c/0x60 schedule+0x45/0xe0 __reserve_bytes+0x4a4/0xb10 [btrfs] ? finish_wait+0x90/0x90 btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs] btrfs_block_rsv_add+0x1f/0x50 [btrfs] start_transaction+0x2d1/0x760 [btrfs] btrfs_replace_file_extents+0x120/0x930 [btrfs] ? btrfs_fallocate+0xdcf/0x1260 [btrfs] btrfs_fallocate+0xdfb/0x1260 [btrfs] ? filename_lookup+0xf1/0x180 vfs_fallocate+0x14f/0x440 ioctl_preallocate+0x92/0xc0 do_vfs_ioctl+0x66b/0x750 ? __do_sys_newfstat+0x53/0x60 __x64_sys_ioctl+0x62/0xb0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 """ Fix this by disallowing mmaps from happening while we're doing any of the fallocate operations on this inode. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
64708539cd |
btrfs: use btrfs_inode_lock/btrfs_inode_unlock inode lock helpers
A few places we intermix btrfs_inode_lock with a inode_unlock, and some places we just use inode_lock/inode_unlock instead of btrfs_inode_lock. None of these places are using this incorrectly, but as we adjust some of these callers it would be nice to keep everything consistent, so convert everybody to use btrfs_inode_lock/btrfs_inode_unlock. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
cca5de97ae |
btrfs: make find_desired_extent take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |