Commit Graph

13461 Commits

Author SHA1 Message Date
Filipe Manana
68a505bb87 btrfs: more efficient chunk map iteration when device replace finishes
When iterating the chunk maps when a device replace finishes we are doing
a full rbtree search for each chunk map, which is not the most efficient
thing to do, wasting CPU time. As we are holding a write lock on the tree
during the whole iteration, we can simply start from the first node in the
tree and then move to the next chunk map by doing a rb_next() call - the
only exception is when we need to reschedule, in which case we have to do
a full rbtree search since we dropped the write lock and the tree may have
changed (chunk maps may have been removed and the tree got rebalanced).
So just do that.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:17 +02:00
Filipe Manana
b79f1c2caa btrfs: reschedule when updating chunk maps at the end of a device replace
At the end of a device replace we must go over all the chunk maps and
update their stripes to point to the target device instead of the source
device. We iterate over the chunk maps while holding a write lock and
we never reschedule, which can result in monopolizing a CPU for too long
and blocking readers for too long (it's a rw lock, non-blocking).

So improve on this by rescheduling if necessary. This is safe because at
this point we are holding the chunk mutex, which means no new chunks can
be allocated and therefore we don't risk missing a new chunk map that
covers a range behind the last one we processed before rescheduling.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:17 +02:00
Josef Bacik
5fe1912449 btrfs: convert extent_range_clear_dirty_for_io() to use a folio
Instead of getting a page and using that to clear dirty for io, use the
folio helper and use the appropriate folio functions.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:17 +02:00
Josef Bacik
c86d3aac81 btrfs: convert insert_inline_extent() to use a folio
We only use a page to copy in the data for the inline extent.  Use a
folio for this instead.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:17 +02:00
Josef Bacik
1bbf3a3aea btrfs: convert btrfs_set_range_writeback() to use a folio
We already use a lot of functions here that use folios, update the
function to use __filemap_get_folio instead of find_get_page and then
use the folio directly.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:17 +02:00
Josef Bacik
dfc9e3017a btrfs: convert wait_subpage_spinlock() to only use a folio
Currently this already uses a folio for most things, update it to take a
folio and update all the page usage with the corresponding folio usage.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:16 +02:00
Josef Bacik
1a48259d9b btrfs: convert find_next_dirty_byte() to take a folio
We already use a folio some in this function, replace all page usage
with the folio and update the function to take the folio as an argument.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:16 +02:00
Josef Bacik
7ed07d1662 btrfs: convert __get_extent_map() to take a folio
Now that btrfs_get_extent takes a folio, update __get_extent_map to
take a folio as well.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:16 +02:00
Josef Bacik
dce9ef9412 btrfs: convert btrfs_get_extent() to take a folio
We only pass this into read_inline_extent, change it to take a folio and
update the callers.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:16 +02:00
Josef Bacik
220e77c412 btrfs: convert read_inline_extent() to use a folio
Instead of using a page, use a folio instead, take a folio as an
argument, and update the callers appropriately.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:16 +02:00
Josef Bacik
752965824b btrfs: convert uncompress_inline() to take a folio
Update uncompress_inline to take a folio and update it's usage
accordingly.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:16 +02:00
Josef Bacik
1b5125bbd4 btrfs: convert struct btrfs_writepage_fixup to use a folio
Now the fixup creator and consumer use folios, change this to use a
folio as well.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:16 +02:00
Josef Bacik
d71b53c3cb btrfs: convert btrfs_writepage_cow_fixup() to use folio
Instead of a page, use a folio for btrfs_writepage_cow_fixup.  We
already have a folio at the only caller, and the fixup worker uses
folios.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:16 +02:00
Josef Bacik
7d003cc2b3 btrfs: convert btrfs_writepage_fixup_worker() to use a folio
This function heavily messes with pages, instead update it to use a
folio.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:16 +02:00
Josef Bacik
0d11706810 btrfs: convert submit_uncompressed_range() to take a folio
This mostly uses folios already, update it to take a folio and update
the rest of the function to use the folio instead of the page.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:16 +02:00
Josef Bacik
3ed984b5d0 btrfs: convert struct async_chunk to hold a folio
Instead of passing in the page for ->locked_page, make it hold a
locked_folio and then update the users of async_chunk to act
accordingly.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:15 +02:00
Josef Bacik
2609c9289f btrfs: convert btrfs_run_delalloc_range() to take a folio
Now that every function that btrfs_run_delalloc_range calls takes a
folio, update it to take a folio and update the callers.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:15 +02:00
Josef Bacik
d9c750272d btrfs: convert run_delalloc_compressed() to take a folio
This just passes the page into the compressed machinery to keep track of
the locked page.  Update this to take a folio and convert it to a page
where appropriate.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:15 +02:00
Josef Bacik
94cea66d1c btrfs: convert btrfs_cleanup_ordered_extents() to take a folio
Now that btrfs_cleanup_ordered_extents is operating mostly with folios,
update it to use a folio instead of a page, and the update the function
and the callers as appropriate.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:15 +02:00
Josef Bacik
b38ec94ab9 btrfs: convert btrfs_cleanup_ordered_extents() to use folios
We walk through pages in this function and clear ordered, and the
function for this uses folios. Update the function to use a folio for
this whole operation.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:15 +02:00
Josef Bacik
42a5947b1c btrfs: convert run_delalloc_nocow() to take a folio
Now all of the functions that use locked_page in run_delalloc_nocow take
a folio, update it to take a folio and update the caller.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:15 +02:00
Josef Bacik
39bbc56a9c btrfs: convert fallback_to_cow() to take a folio
With this we can pass the folio directly into cow_file_range().

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:15 +02:00
Josef Bacik
4cf7e0562f btrfs: convert cow_file_range() to take a folio
Convert this to take a folio and pass it into all of the various cleanup
functions.  Update the callers to pass in a folio instead.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:15 +02:00
Josef Bacik
9f5db28074 btrfs: convert cow_file_range_inline() to take a folio
Now that we want the folio in this function, convert it to take a folio
directly and use that.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:15 +02:00
Josef Bacik
2cdc1fbb1b btrfs: convert run_delalloc_cow() to take a folio
We pass the folio into extent_write_locked_range, go ahead and take a
folio to pass along, and update the callers to pass in a folio.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:15 +02:00
Josef Bacik
01e11841f0 btrfs: convert extent_write_locked_range() to take a folio
This mostly uses folios, convert it to take a folio instead and update
the callers to pass in the folio.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:15 +02:00
Josef Bacik
a67f540582 btrfs: convert extent_clear_unlock_delalloc() to take a folio
Instead of taking the locked page, take the locked folio so we can pass
that into __process_folios_contig.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:14 +02:00
Josef Bacik
c9ce51d67f btrfs: convert process_one_page() to operate only on folios
Now that this mostly uses folios, update it to take folios, use the
folios that are passed in, and rename from process_one_page =>
process_one_folio.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:14 +02:00
Josef Bacik
a59ff7201a btrfs: convert __process_pages_contig() to take a folio
This operates mostly on folios, update it to take a folio for the locked
folio instead of the page, rename from __process_pages_contig =>
__process_folios_contig.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:14 +02:00
Josef Bacik
79be4a28d8 btrfs: convert __unlock_for_delalloc() to take a folio
All of the callers have a folio at this point, update
__unlock_for_delalloc to take a folio so that it's consistent with its
callers.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:14 +02:00
Josef Bacik
e4d80ebe50 btrfs: convert lock_delalloc_pages() to take a folio
Also rename lock_delalloc_pages => lock_delalloc_folios in the process,
now that it exclusively works on folios.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:14 +02:00
Josef Bacik
c987f1e6d4 btrfs: convert find_lock_delalloc_range() to use a folio
Instead of passing in a page for locked_page, pass in the folio instead.
We only use the folio itself to validate some range assumptions, and
then pass it into other functions.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:14 +02:00
Josef Bacik
dc6c745447 btrfs: convert writepage_delalloc() to take a folio
We already use a folio heavily in this function, pass the folio in
directly and use it everywhere, only passing the page down to functions
that do not take a folio yet.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:14 +02:00
Josef Bacik
a79228011c btrfs: convert btrfs_mark_ordered_io_finished() to take a folio
We only need a folio now, make it take a folio as an argument and update
all of the callers.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:14 +02:00
Josef Bacik
aef665d69a btrfs: convert btrfs_finish_ordered_extent() to take a folio
The callers and callee's of this now all use folios, update it to take a
folio as well.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:14 +02:00
Josef Bacik
0a577636a9 btrfs: convert can_finish_ordered_extent() to use a folio
Pass in a folio instead, and use a folio instead of a page.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:14 +02:00
Josef Bacik
7e755aa731 btrfs: utilize folio more in btrfs_page_mkwrite()
We already have a folio that we're using in btrfs_page_mkwrite, update
the rest of the function to use folio everywhere else.  This will make
it easier on Willy when he drops page->index.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:14 +02:00
Josef Bacik
c808c1dcb1 btrfs: convert add_ra_bio_pages() to use only folios
Willy is going to get rid of page->index, and add_ra_bio_pages uses
page->index.  Make his life easier by converting add_ra_bio_pages to use
folios so that we are no longer using page->index.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:13 +02:00
Josef Bacik
9b320229c0 btrfs: convert __extent_writepage() to be completely folio based
Now that we've gotten most of the helpers updated to only take a folio,
update __extent_writepage to only deal in folios.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:13 +02:00
Josef Bacik
c1deaa1438 btrfs: convert extent_write_locked_range() to use folios
Instead of using pages for everything, find a folio and use that.  This
makes things a bit cleaner as a lot of the functions calls here all take
folios.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:13 +02:00
Josef Bacik
b8a6263eae btrfs: convert __extent_writepage_io() to take a folio
__extent_writepage_io uses page everywhere, but a lot of these functions
take a folio.  Convert it to use the folio based helpers, and then
change it to take a folio as an argument and update its callers.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:13 +02:00
Josef Bacik
9e97e8b277 btrfs: update the writepage tracepoint to take a folio
Willy is wanting to get rid of page->index, convert the writepage
tracepoint to take a folio so we can do folio->index instead of
page->index.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:13 +02:00
Josef Bacik
56a24a30a4 btrfs: convert btrfs_do_readpage() to only use a folio
Now that the callers and helpers mostly use folio, convert
btrfs_do_readpage to take a folio, and rename it to btrfs_do_read_folio.
Update all of the page stuff to use the folio based helpers instead.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:13 +02:00
Josef Bacik
b35397d1d3 btrfs: convert submit_extent_page() to use a folio
The callers of this helper are going to be converted to using a folio,
so adjust submit_extent_page to become submit_extent_folio and update it
to use all the relevant folio helpers.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:13 +02:00
Josef Bacik
fcf50d161c btrfs: convert begin_page_folio() to take a folio instead
This already uses a folio internally, change it to take a folio as an
argument instead.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:13 +02:00
Josef Bacik
645006d87c btrfs: convert end_page_read() to take a folio
We have this helper function to set the page range uptodate once we're
done reading it, as well as run fsverity against it.  Half of these
functions already take a folio, just rename this to end_folio_read and
then rework it to take a folio instead, and update everything
accordingly.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:13 +02:00
Josef Bacik
e19317ed9e btrfs: convert btrfs_read_folio() to only use a folio
Currently we're using the page for everything here.  Convert this to use
the folio helpers instead.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:13 +02:00
Josef Bacik
03d6612648 btrfs: convert btrfs_readahead() to only use folio
We're the only user of readahead_page_batch().  Convert
btrfs_readahead() to use the folio based helpers to do readahead.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:12 +02:00
Li Zhang
45714ff75c btrfs: print message on device opening error during mount
[ENHANCEMENT]
When mounting a btrfs filesystem, the filesystem opens the block device,
and if this fails, there is no message about it. Print a message about
it to help debugging.

[TEST]
I have a btrfs filesystem on three block devices, one of which is
write-protected, so regular mounts fail, but there is no message in
dmesg.

  /dev/vdb normal
  /dev/vdc write protected
  /dev/vdd normal

  Before patch:
  $ sudo mount /dev/vdb /mnt/
  mount: mount(2) failed: no such file or directory
  $ sudo dmesg # Show only messages about missing block devices
  ....
  [ 352.947196] BTRFS error (device vdb): devid 2 uuid 4ee2c625-a3b2-4fe0-b411-756b23e08533 missing
  ....

  After patch:
  $ sudo mount /dev/vdb /mnt/
  mount: mount(2) failed: no such file or directory
  $ sudo dmesg # Show bdev_file_open_by_path failed.
  ....
  [ 352.944328] BTRFS error: failed to open device for path /dev/vdc with flags 0x3: -13
  [ 352.947196] BTRFS error (device vdb): missing devid 2 uuid 4ee2c625-a3b2-4fe0-b411-756b23e08533
  ....

Signed-off-by: Li Zhang <zhanglikernel@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:12 +02:00
Qu Wenruo
c92bf5df8a btrfs: move uuid tree related code to uuid-tree.[ch]
Functions btrfs_uuid_scan_kthread() and btrfs_create_uuid_tree() are for
UUID tree rescan and creation, it's not suitable for volumes.[ch].

Move them to uuid-tree.[ch] instead.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:12 +02:00
Filipe Manana
ab094670fa btrfs: reduce size and overhead of extent_map_block_end()
At extent_map_block_end() we are calling the inline functions
extent_map_block_start() and extent_map_block_len() multiple times, which
results in expanding their code multiple times, increasing the compiled
code size and repeating the computations those functions do.

Improve this by caching their results in local variables.

The size of the module before this change:

   $ size fs/btrfs/btrfs.ko
      text	   data	    bss	    dec	    hex	filename
   1755770	 163800	  16920	1936490	 1d8c6a	fs/btrfs/btrfs.ko

And after this change:

   $ size fs/btrfs/btrfs.ko
      text	   data	    bss	    dec	    hex	filename
   1755656	 163800	  16920	1936376	 1d8bf8	fs/btrfs/btrfs.ko

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:12 +02:00
Johannes Thumshirn
7fa5230b46 btrfs: update stripe_extent delete loop assumptions
btrfs_delete_raid_extent() was written under the assumption, that it's
call-chain always passes a start, length tuple that matches a single
extent. But btrfs_delete_raid_extent() is called by
do_free_extent_accounting() which in turn is called by
__btrfs_free_extent().

But this call-chain passes in a start address and a length that can
possibly match multiple on-disk extents.

To make this possible, we have to adjust the start and length of each
btree node lookup, to not delete beyond the requested range.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:12 +02:00
Johannes Thumshirn
8c4cba2adb btrfs: update stripe extents for existing logical addresses
Update a stripe extent in case of an already existing logical address,
but with different physical addresses and/or device id instead of
bailing out with EEXIST.

This can happen i.e. in case of a device replace operation, where data
extents get rewritten to a new disk.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:12 +02:00
Filipe Manana
cd9253c23a btrfs: fix race between direct IO write and fsync when using same fd
If we have 2 threads that are using the same file descriptor and one of
them is doing direct IO writes while the other is doing fsync, we have a
race where we can end up either:

1) Attempt a fsync without holding the inode's lock, triggering an
   assertion failures when assertions are enabled;

2) Do an invalid memory access from the fsync task because the file private
   points to memory allocated on stack by the direct IO task and it may be
   used by the fsync task after the stack was destroyed.

The race happens like this:

1) A user space program opens a file descriptor with O_DIRECT;

2) The program spawns 2 threads using libpthread for example;

3) One of the threads uses the file descriptor to do direct IO writes,
   while the other calls fsync using the same file descriptor.

4) Call task A the thread doing direct IO writes and task B the thread
   doing fsyncs;

5) Task A does a direct IO write, and at btrfs_direct_write() sets the
   file's private to an on stack allocated private with the member
   'fsync_skip_inode_lock' set to true;

6) Task B enters btrfs_sync_file() and sees that there's a private
   structure associated to the file which has 'fsync_skip_inode_lock' set
   to true, so it skips locking the inode's VFS lock;

7) Task A completes the direct IO write, and resets the file's private to
   NULL since it had no prior private and our private was stack allocated.
   Then it unlocks the inode's VFS lock;

8) Task B enters btrfs_get_ordered_extents_for_logging(), then the
   assertion that checks the inode's VFS lock is held fails, since task B
   never locked it and task A has already unlocked it.

The stack trace produced is the following:

   assertion failed: inode_is_locked(&inode->vfs_inode), in fs/btrfs/ordered-data.c:983
   ------------[ cut here ]------------
   kernel BUG at fs/btrfs/ordered-data.c:983!
   Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
   CPU: 9 PID: 5072 Comm: worker Tainted: G     U     OE      6.10.5-1-default #1 openSUSE Tumbleweed 69f48d427608e1c09e60ea24c6c55e2ca1b049e8
   Hardware name: Acer Predator PH315-52/Covini_CFS, BIOS V1.12 07/28/2020
   RIP: 0010:btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs]
   Code: 50 d6 86 c0 e8 (...)
   RSP: 0018:ffff9e4a03dcfc78 EFLAGS: 00010246
   RAX: 0000000000000054 RBX: ffff9078a9868e98 RCX: 0000000000000000
   RDX: 0000000000000000 RSI: ffff907dce4a7800 RDI: ffff907dce4a7800
   RBP: ffff907805518800 R08: 0000000000000000 R09: ffff9e4a03dcfb38
   R10: ffff9e4a03dcfb30 R11: 0000000000000003 R12: ffff907684ae7800
   R13: 0000000000000001 R14: ffff90774646b600 R15: 0000000000000000
   FS:  00007f04b96006c0(0000) GS:ffff907dce480000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 00007f32acbfc000 CR3: 00000001fd4fa005 CR4: 00000000003726f0
   Call Trace:
    <TASK>
    ? __die_body.cold+0x14/0x24
    ? die+0x2e/0x50
    ? do_trap+0xca/0x110
    ? do_error_trap+0x6a/0x90
    ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
    ? exc_invalid_op+0x50/0x70
    ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
    ? asm_exc_invalid_op+0x1a/0x20
    ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
    ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
    btrfs_sync_file+0x21a/0x4d0 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
    ? __seccomp_filter+0x31d/0x4f0
    __x64_sys_fdatasync+0x4f/0x90
    do_syscall_64+0x82/0x160
    ? do_futex+0xcb/0x190
    ? __x64_sys_futex+0x10e/0x1d0
    ? switch_fpu_return+0x4f/0xd0
    ? syscall_exit_to_user_mode+0x72/0x220
    ? do_syscall_64+0x8e/0x160
    ? syscall_exit_to_user_mode+0x72/0x220
    ? do_syscall_64+0x8e/0x160
    ? syscall_exit_to_user_mode+0x72/0x220
    ? do_syscall_64+0x8e/0x160
    ? syscall_exit_to_user_mode+0x72/0x220
    ? do_syscall_64+0x8e/0x160
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

Another problem here is if task B grabs the private pointer and then uses
it after task A has finished, since the private was allocated in the stack
of task A, it results in some invalid memory access with a hard to predict
result.

This issue, triggering the assertion, was observed with QEMU workloads by
two users in the Link tags below.

Fix this by not relying on a file's private to pass information to fsync
that it should skip locking the inode and instead pass this information
through a special value stored in current->journal_info. This is safe
because in the relevant section of the direct IO write path we are not
holding a transaction handle, so current->journal_info is NULL.

The following C program triggers the issue:

   $ cat repro.c
   /* Get the O_DIRECT definition. */
   #ifndef _GNU_SOURCE
   #define _GNU_SOURCE
   #endif

   #include <stdio.h>
   #include <stdlib.h>
   #include <unistd.h>
   #include <stdint.h>
   #include <fcntl.h>
   #include <errno.h>
   #include <string.h>
   #include <pthread.h>

   static int fd;

   static ssize_t do_write(int fd, const void *buf, size_t count, off_t offset)
   {
       while (count > 0) {
           ssize_t ret;

           ret = pwrite(fd, buf, count, offset);
           if (ret < 0) {
               if (errno == EINTR)
                   continue;
               return ret;
           }
           count -= ret;
           buf += ret;
       }
       return 0;
   }

   static void *fsync_loop(void *arg)
   {
       while (1) {
           int ret;

           ret = fsync(fd);
           if (ret != 0) {
               perror("Fsync failed");
               exit(6);
           }
       }
   }

   int main(int argc, char *argv[])
   {
       long pagesize;
       void *write_buf;
       pthread_t fsyncer;
       int ret;

       if (argc != 2) {
           fprintf(stderr, "Use: %s <file path>\n", argv[0]);
           return 1;
       }

       fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT, 0666);
       if (fd == -1) {
           perror("Failed to open/create file");
           return 1;
       }

       pagesize = sysconf(_SC_PAGE_SIZE);
       if (pagesize == -1) {
           perror("Failed to get page size");
           return 2;
       }

       ret = posix_memalign(&write_buf, pagesize, pagesize);
       if (ret) {
           perror("Failed to allocate buffer");
           return 3;
       }

       ret = pthread_create(&fsyncer, NULL, fsync_loop, NULL);
       if (ret != 0) {
           fprintf(stderr, "Failed to create writer thread: %d\n", ret);
           return 4;
       }

       while (1) {
           ret = do_write(fd, write_buf, pagesize, 0);
           if (ret != 0) {
               perror("Write failed");
               exit(5);
           }
       }

       return 0;
   }

   $ mkfs.btrfs -f /dev/sdi
   $ mount /dev/sdi /mnt/sdi
   $ timeout 10 ./repro /mnt/sdi/foo

Usually the race is triggered within less than 1 second. A test case for
fstests will follow soon.

Reported-by: Paulo Dias <paulo.miguel.dias@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219187
Reported-by: Andreas Jahn <jahn-andi@web.de>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219199
Reported-by: syzbot+4704b3cc972bd76024f1@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/00000000000044ff540620d7dee2@google.com/
Fixes: 939b656bc8 ("btrfs: fix corruption after buffer fault in during direct IO append write")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-03 20:29:55 +02:00
Naohiro Aota
b1934cd606 btrfs: zoned: handle broken write pointer on zones
Btrfs rejects to mount a FS if it finds a block group with a broken write
pointer (e.g, unequal write pointers on two zones of RAID1 block group).
Since such case can happen easily with a power-loss or crash of a system,
we need to handle the case more gently.

Handle such block group by making it unallocatable, so that there will be
no writes into it. That can be done by setting the allocation pointer at
the end of allocating region (= block_group->zone_capacity). Then, existing
code handle zone_unusable properly.

Having proper zone_capacity is necessary for the change. So, set it as fast
as possible.

We cannot handle RAID0 and RAID10 case like this. But, they are anyway
unable to read because of a missing stripe.

Fixes: 265f7237dd ("btrfs: zoned: allow DUP on meta-data block groups")
Fixes: 568220fa96 ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree")
CC: stable@vger.kernel.org # 6.1+
Reported-by: HAN Yuwei <hrx@bupt.moe>
Cc: Xuefer <xuefer@gmail.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-02 23:39:34 +02:00
Fedor Pchelkin
c346c62976 btrfs: qgroup: don't use extent changeset when not needed
The local extent changeset is passed to clear_record_extent_bits() where
it may have some additional memory dynamically allocated for ulist. When
qgroup is disabled, the memory is leaked because in this case the
changeset is not released upon __btrfs_qgroup_release_data() return.

Since the recorded contents of the changeset are not used thereafter, just
don't pass it.

Found by Linux Verification Center (linuxtesting.org) with Syzkaller.

Reported-by: syzbot+81670362c283f3dd889c@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/lkml/000000000000aa8c0c060ade165e@google.com
Fixes: af0e2aab3b ("btrfs: qgroup: flush reservations during quota disable")
CC: stable@vger.kernel.org # 6.10+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-02 20:18:08 +02:00
Christoph Hellwig
b35243a447 block: rework bio splitting
The current setup with bio_may_exceed_limit and __bio_split_to_limits
is a bit of a mess.

Change it so that __bio_split_to_limits does all the work and is just
a variant of bio_split_to_limits that returns nr_segs.  This is done
by inlining it and instead have the various bio_split_* helpers directly
submit the potentially split bios.

To support btrfs, the rw version has a lower level helper split out
that just returns the offset to split.  This turns out to nicely clean
up the btrfs flow as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Link: https://lore.kernel.org/r/20240826173820.1690925-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-29 04:32:32 -06:00
Filipe Manana
ecb54277cb btrfs: fix uninitialized return value from btrfs_reclaim_sweep()
The return variable 'ret' at btrfs_reclaim_sweep() is never assigned if
none of the space infos is reclaimable (for example if periodic reclaim
is disabled, which is the default), so we return an undefined value.

This can be fixed my making btrfs_reclaim_sweep() not return any value
as well as do_reclaim_sweep() because:

1) do_reclaim_sweep() always returns 0, so we can make it return void;

2) The only caller of btrfs_reclaim_sweep() (btrfs_reclaim_bgs()) doesn't
   care about its return value, and in its context there's nothing to do
   about any errors anyway.

Therefore remove the return value from btrfs_reclaim_sweep() and
do_reclaim_sweep().

Fixes: e4ca3932ae ("btrfs: periodic block_group reclaim")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-27 16:42:09 +02:00
Qu Wenruo
10d9d8c351 btrfs: fix a use-after-free when hitting errors inside btrfs_submit_chunk()
[BUG]
There is an internal report that KASAN is reporting use-after-free, with
the following backtrace:

  BUG: KASAN: slab-use-after-free in btrfs_check_read_bio+0xa68/0xb70 [btrfs]
  Read of size 4 at addr ffff8881117cec28 by task kworker/u16:2/45
  CPU: 1 UID: 0 PID: 45 Comm: kworker/u16:2 Not tainted 6.11.0-rc2-next-20240805-default+ #76
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
  Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
  Call Trace:
   dump_stack_lvl+0x61/0x80
   print_address_description.constprop.0+0x5e/0x2f0
   print_report+0x118/0x216
   kasan_report+0x11d/0x1f0
   btrfs_check_read_bio+0xa68/0xb70 [btrfs]
   process_one_work+0xce0/0x12a0
   worker_thread+0x717/0x1250
   kthread+0x2e3/0x3c0
   ret_from_fork+0x2d/0x70
   ret_from_fork_asm+0x11/0x20

  Allocated by task 20917:
   kasan_save_stack+0x37/0x60
   kasan_save_track+0x10/0x30
   __kasan_slab_alloc+0x7d/0x80
   kmem_cache_alloc_noprof+0x16e/0x3e0
   mempool_alloc_noprof+0x12e/0x310
   bio_alloc_bioset+0x3f0/0x7a0
   btrfs_bio_alloc+0x2e/0x50 [btrfs]
   submit_extent_page+0x4d1/0xdb0 [btrfs]
   btrfs_do_readpage+0x8b4/0x12a0 [btrfs]
   btrfs_readahead+0x29a/0x430 [btrfs]
   read_pages+0x1a7/0xc60
   page_cache_ra_unbounded+0x2ad/0x560
   filemap_get_pages+0x629/0xa20
   filemap_read+0x335/0xbf0
   vfs_read+0x790/0xcb0
   ksys_read+0xfd/0x1d0
   do_syscall_64+0x6d/0x140
   entry_SYSCALL_64_after_hwframe+0x4b/0x53

  Freed by task 20917:
   kasan_save_stack+0x37/0x60
   kasan_save_track+0x10/0x30
   kasan_save_free_info+0x37/0x50
   __kasan_slab_free+0x4b/0x60
   kmem_cache_free+0x214/0x5d0
   bio_free+0xed/0x180
   end_bbio_data_read+0x1cc/0x580 [btrfs]
   btrfs_submit_chunk+0x98d/0x1880 [btrfs]
   btrfs_submit_bio+0x33/0x70 [btrfs]
   submit_one_bio+0xd4/0x130 [btrfs]
   submit_extent_page+0x3ea/0xdb0 [btrfs]
   btrfs_do_readpage+0x8b4/0x12a0 [btrfs]
   btrfs_readahead+0x29a/0x430 [btrfs]
   read_pages+0x1a7/0xc60
   page_cache_ra_unbounded+0x2ad/0x560
   filemap_get_pages+0x629/0xa20
   filemap_read+0x335/0xbf0
   vfs_read+0x790/0xcb0
   ksys_read+0xfd/0x1d0
   do_syscall_64+0x6d/0x140
   entry_SYSCALL_64_after_hwframe+0x4b/0x53

[CAUSE]
Although I cannot reproduce the error, the report itself is good enough
to pin down the cause.

The call trace is the regular endio workqueue context, but the
free-by-task trace is showing that during btrfs_submit_chunk() we
already hit a critical error, and is calling btrfs_bio_end_io() to error
out.  And the original endio function called bio_put() to free the whole
bio.

This means a double freeing thus causing use-after-free, e.g.:

1. Enter btrfs_submit_bio() with a read bio
   The read bio length is 128K, crossing two 64K stripes.

2. The first run of btrfs_submit_chunk()

2.1 Call btrfs_map_block(), which returns 64K
2.2 Call btrfs_split_bio()
    Now there are two bios, one referring to the first 64K, the other
    referring to the second 64K.
2.3 The first half is submitted.

3. The second run of btrfs_submit_chunk()

3.1 Call btrfs_map_block(), which by somehow failed
    Now we call btrfs_bio_end_io() to handle the error

3.2 btrfs_bio_end_io() calls the original endio function
    Which is end_bbio_data_read(), and it calls bio_put() for the
    original bio.

    Now the original bio is freed.

4. The submitted first 64K bio finished
   Now we call into btrfs_check_read_bio() and tries to advance the bio
   iter.
   But since the original bio (thus its iter) is already freed, we
   trigger the above use-after free.

   And even if the memory is not poisoned/corrupted, we will later call
   the original endio function, causing a double freeing.

[FIX]
Instead of calling btrfs_bio_end_io(), call btrfs_orig_bbio_end_io(),
which has the extra check on split bios and do the proper refcounting
for cloned bios.

Furthermore there is already one extra btrfs_cleanup_bio() call, but
that is duplicated to btrfs_orig_bbio_end_io() call, so remove that
label completely.

Reported-by: David Sterba <dsterba@suse.com>
Fixes: 852eee62d3 ("btrfs: allow btrfs_submit_bio to split bios")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-27 01:34:08 +02:00
David Sterba
33f58a0480 btrfs: initialize last_extent_end to fix -Wmaybe-uninitialized warning in extent_fiemap()
There's a warning (probably on some older compiler version):

fs/btrfs/fiemap.c: warning: 'last_extent_end' may be used uninitialized in this function [-Wmaybe-uninitialized]:  => 822:19

Initialize the variable to 0 although it's not necessary as it's either
properly set or not used after an error. The called function is in the
same file so this is a false alert but we want to fix all
-Wmaybe-uninitialized reports.

Link: https://lore.kernel.org/all/20240819070639.2558629-1-geert@linux-m68k.org/
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-26 16:58:13 +02:00
Josef Bacik
2d34472610 btrfs: run delayed iputs when flushing delalloc
We have transient failures with btrfs/301, specifically in the part
where we do

  for i in $(seq 0 10); do
	  write 50m to file
	  rm -f file
  done

Sometimes this will result in a transient quota error, and it's because
sometimes we start writeback on the file which results in a delayed
iput, and thus the rm doesn't actually clean the file up.  When we're
flushing the quota space we need to run the delayed iputs to make sure
all the unlinks that we think have completed have actually completed.
This removes the small window where we could fail to find enough space
in our quota.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-25 19:15:34 +02:00
Qu Wenruo
534f7eff92 btrfs: only enable extent map shrinker for DEBUG builds
Although there are several patches improving the extent map shrinker,
there are still reports of too frequent shrinker behavior, taking too
much CPU for the kswapd process.

So let's only enable extent shrinker for now, until we got more
comprehensive understanding and a better solution.

Link: https://lore.kernel.org/linux-btrfs/3df4acd616a07ef4d2dc6bad668701504b412ffc.camel@intelfx.name/
Link: https://lore.kernel.org/linux-btrfs/c30fd6b3-ca7a-4759-8a53-d42878bf84f7@gmail.com/
Fixes: 956a17d9d0 ("btrfs: add a shrinker for extent maps")
CC: stable@vger.kernel.org # 6.10+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-16 21:22:39 +02:00
Naohiro Aota
e30729d4bd btrfs: zoned: properly take lock to read/update block group's zoned variables
__btrfs_add_free_space_zoned() references and modifies bg's alloc_offset,
ro, and zone_unusable, but without taking the lock. It is mostly safe
because they monotonically increase (at least for now) and this function is
mostly called by a transaction commit, which is serialized by itself.

Still, taking the lock is a safer and correct option and I'm going to add a
change to reset zone_unusable while a block group is still alive. So, add
locking around the operations.

Fixes: 169e0da91a ("btrfs: zoned: track unusable bytes for zones")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-15 20:35:56 +02:00
Qu Wenruo
008e2512dc btrfs: tree-checker: add dev extent item checks
[REPORT]
There is a corruption report that btrfs refused to mount a fs that has
overlapping dev extents:

  BTRFS error (device sdc): dev extent devid 4 physical offset 14263979671552 overlap with previous dev extent end 14263980982272
  BTRFS error (device sdc): failed to verify dev extents against chunks: -117
  BTRFS error (device sdc): open_ctree failed

[CAUSE]
The direct cause is very obvious, there is a bad dev extent item with
incorrect length.

With btrfs check reporting two overlapping extents, the second one shows
some clue on the cause:

  ERROR: dev extent devid 4 offset 14263979671552 len 6488064 overlap with previous dev extent end 14263980982272
  ERROR: dev extent devid 13 offset 2257707008000 len 6488064 overlap with previous dev extent end 2257707270144
  ERROR: errors found in extent allocation tree or chunk allocation

The second one looks like a bitflip happened during new chunk
allocation:
hex(2257707008000) = 0x20da9d30000
hex(2257707270144) = 0x20da9d70000
diff               = 0x00000040000

So it looks like a bitflip happened during new dev extent allocation,
resulting the second overlap.

Currently we only do the dev-extent verification at mount time, but if the
corruption is caused by memory bitflip, we really want to catch it before
writing the corruption to the storage.

Furthermore the dev extent items has the following key definition:

	(<device id> DEV_EXTENT <physical offset>)

Thus we can not just rely on the generic key order check to make sure
there is no overlapping.

[ENHANCEMENT]
Introduce dedicated dev extent checks, including:

- Fixed member checks
  * chunk_tree should always be BTRFS_CHUNK_TREE_OBJECTID (3)
  * chunk_objectid should always be
    BTRFS_FIRST_CHUNK_CHUNK_TREE_OBJECTID (256)

- Alignment checks
  * chunk_offset should be aligned to sectorsize
  * length should be aligned to sectorsize
  * key.offset should be aligned to sectorsize

- Overlap checks
  If the previous key is also a dev-extent item, with the same
  device id, make sure we do not overlap with the previous dev extent.

Reported: Stefan N <stefannnau@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CA+W5K0rSO3koYTo=nzxxTm1-Pdu1HYgVxEpgJ=aGc7d=E8mGEg@mail.gmail.com/
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-15 20:35:52 +02:00
Jeff Layton
3bc2ac2f8f btrfs: update target inode's ctime on unlink
Unlink changes the link count on the target inode. POSIX mandates that
the ctime must also change when this occurs.

According to https://pubs.opengroup.org/onlinepubs/9699919799/functions/unlink.html:

"Upon successful completion, unlink() shall mark for update the last data
 modification and last file status change timestamps of the parent
 directory. Also, if the file's link count is not 0, the last file status
 change timestamp of the file shall be marked for update."

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add link to the opengroup docs ]
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-15 20:35:44 +02:00
Thorsten Blum
c0247d289e btrfs: send: annotate struct name_cache_entry with __counted_by()
Add the __counted_by compiler attribute to the flexible array member
name to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.

Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-15 20:35:32 +02:00
Naohiro Aota
6252690f7e btrfs: fix invalid mapping of extent xarray state
In __extent_writepage_io(), we call btrfs_set_range_writeback() ->
folio_start_writeback(), which clears PAGECACHE_TAG_DIRTY mark from the
mapping xarray if the folio is not dirty. This worked fine before commit
97713b1a2c ("btrfs: do not clear page dirty inside
extent_write_locked_range()").

After the commit, however, the folio is still dirty at this point, so the
mapping DIRTY tag is not cleared anymore. Then, __extent_writepage_io()
calls btrfs_folio_clear_dirty() to clear the folio's dirty flag. That
results in the page being unlocked with a "strange" state. The page is not
PageDirty, but the mapping tag is set as PAGECACHE_TAG_DIRTY.

This strange state looks like causing a hang with a call trace below when
running fstests generic/091 on a null_blk device. It is waiting for a folio
lock.

While I don't have an exact relation between this hang and the strange
state, fixing the state also fixes the hang. And, that state is worth
fixing anyway.

This commit reorders btrfs_folio_clear_dirty() and
btrfs_set_range_writeback() in __extent_writepage_io(), so that the
PAGECACHE_TAG_DIRTY tag is properly removed from the xarray.

  [464.274] task:fsx             state:D stack:0     pid:3034  tgid:3034  ppid:2853   flags:0x00004002
  [464.286] Call Trace:
  [464.291]  <TASK>
  [464.295]  __schedule+0x10ed/0x6260
  [464.301]  ? __pfx___blk_flush_plug+0x10/0x10
  [464.308]  ? __submit_bio+0x37c/0x450
  [464.314]  ? __pfx___schedule+0x10/0x10
  [464.321]  ? lock_release+0x567/0x790
  [464.327]  ? __pfx_lock_acquire+0x10/0x10
  [464.334]  ? __pfx_lock_release+0x10/0x10
  [464.340]  ? __pfx_lock_acquire+0x10/0x10
  [464.347]  ? __pfx_lock_release+0x10/0x10
  [464.353]  ? do_raw_spin_lock+0x12e/0x270
  [464.360]  schedule+0xdf/0x3b0
  [464.365]  io_schedule+0x8f/0xf0
  [464.371]  folio_wait_bit_common+0x2ca/0x6d0
  [464.378]  ? folio_wait_bit_common+0x1cc/0x6d0
  [464.385]  ? __pfx_folio_wait_bit_common+0x10/0x10
  [464.392]  ? __pfx_filemap_get_folios_tag+0x10/0x10
  [464.400]  ? __pfx_wake_page_function+0x10/0x10
  [464.407]  ? __pfx___might_resched+0x10/0x10
  [464.414]  ? do_raw_spin_unlock+0x58/0x1f0
  [464.420]  extent_write_cache_pages+0xe49/0x1620 [btrfs]
  [464.428]  ? lock_acquire+0x435/0x500
  [464.435]  ? __pfx_extent_write_cache_pages+0x10/0x10 [btrfs]
  [464.443]  ? btrfs_do_write_iter+0x493/0x640 [btrfs]
  [464.451]  ? orc_find.part.0+0x1d4/0x380
  [464.457]  ? __pfx_lock_release+0x10/0x10
  [464.464]  ? __pfx_lock_release+0x10/0x10
  [464.471]  ? btrfs_do_write_iter+0x493/0x640 [btrfs]
  [464.478]  btrfs_writepages+0x1cc/0x460 [btrfs]
  [464.485]  ? __pfx_btrfs_writepages+0x10/0x10 [btrfs]
  [464.493]  ? is_bpf_text_address+0x6e/0x100
  [464.500]  ? kernel_text_address+0x145/0x160
  [464.507]  ? unwind_get_return_address+0x5e/0xa0
  [464.514]  ? arch_stack_walk+0xac/0x100
  [464.521]  do_writepages+0x176/0x780
  [464.527]  ? lock_release+0x567/0x790
  [464.533]  ? __pfx_do_writepages+0x10/0x10
  [464.540]  ? __pfx_lock_acquire+0x10/0x10
  [464.546]  ? __pfx_stack_trace_save+0x10/0x10
  [464.553]  ? do_raw_spin_lock+0x12e/0x270
  [464.560]  ? do_raw_spin_unlock+0x58/0x1f0
  [464.566]  ? _raw_spin_unlock+0x23/0x40
  [464.573]  ? wbc_attach_and_unlock_inode+0x3da/0x7d0
  [464.580]  filemap_fdatawrite_wbc+0x113/0x180
  [464.587]  ? prepare_pages.constprop.0+0x13c/0x5c0 [btrfs]
  [464.596]  __filemap_fdatawrite_range+0xaf/0xf0
  [464.603]  ? __pfx___filemap_fdatawrite_range+0x10/0x10
  [464.611]  ? trace_irq_enable.constprop.0+0xce/0x110
  [464.618]  ? kasan_quarantine_put+0xd7/0x1e0
  [464.625]  btrfs_start_ordered_extent+0x46f/0x570 [btrfs]
  [464.633]  ? __pfx_btrfs_start_ordered_extent+0x10/0x10 [btrfs]
  [464.642]  ? __clear_extent_bit+0x2c0/0x9d0 [btrfs]
  [464.650]  btrfs_lock_and_flush_ordered_range+0xc6/0x180 [btrfs]
  [464.659]  ? __pfx_btrfs_lock_and_flush_ordered_range+0x10/0x10 [btrfs]
  [464.669]  btrfs_read_folio+0x12a/0x1d0 [btrfs]
  [464.676]  ? __pfx_btrfs_read_folio+0x10/0x10 [btrfs]
  [464.684]  ? __pfx_filemap_add_folio+0x10/0x10
  [464.691]  ? __pfx___might_resched+0x10/0x10
  [464.698]  ? __filemap_get_folio+0x1c5/0x450
  [464.705]  prepare_uptodate_page+0x12e/0x4d0 [btrfs]
  [464.713]  prepare_pages.constprop.0+0x13c/0x5c0 [btrfs]
  [464.721]  ? fault_in_iov_iter_readable+0xd2/0x240
  [464.729]  btrfs_buffered_write+0x5bd/0x12f0 [btrfs]
  [464.737]  ? __pfx_btrfs_buffered_write+0x10/0x10 [btrfs]
  [464.745]  ? __pfx_lock_release+0x10/0x10
  [464.752]  ? generic_write_checks+0x275/0x400
  [464.759]  ? down_write+0x118/0x1f0
  [464.765]  ? up_write+0x19b/0x500
  [464.770]  btrfs_direct_write+0x731/0xba0 [btrfs]
  [464.778]  ? __pfx_btrfs_direct_write+0x10/0x10 [btrfs]
  [464.785]  ? __pfx___might_resched+0x10/0x10
  [464.792]  ? lock_acquire+0x435/0x500
  [464.798]  ? lock_acquire+0x435/0x500
  [464.804]  btrfs_do_write_iter+0x494/0x640 [btrfs]
  [464.811]  ? __pfx_btrfs_do_write_iter+0x10/0x10 [btrfs]
  [464.819]  ? __pfx___might_resched+0x10/0x10
  [464.825]  ? rw_verify_area+0x6d/0x590
  [464.831]  vfs_write+0x5d7/0xf50
  [464.837]  ? __might_fault+0x9d/0x120
  [464.843]  ? __pfx_vfs_write+0x10/0x10
  [464.849]  ? btrfs_file_llseek+0xb1/0xfb0 [btrfs]
  [464.856]  ? lock_release+0x567/0x790
  [464.862]  ksys_write+0xfb/0x1d0
  [464.867]  ? __pfx_ksys_write+0x10/0x10
  [464.873]  ? _raw_spin_unlock+0x23/0x40
  [464.879]  ? btrfs_getattr+0x4af/0x670 [btrfs]
  [464.886]  ? vfs_getattr_nosec+0x79/0x340
  [464.892]  do_syscall_64+0x95/0x180
  [464.898]  ? __do_sys_newfstat+0xde/0xf0
  [464.904]  ? __pfx___do_sys_newfstat+0x10/0x10
  [464.911]  ? trace_irq_enable.constprop.0+0xce/0x110
  [464.918]  ? syscall_exit_to_user_mode+0xac/0x2a0
  [464.925]  ? do_syscall_64+0xa1/0x180
  [464.931]  ? trace_irq_enable.constprop.0+0xce/0x110
  [464.939]  ? trace_irq_enable.constprop.0+0xce/0x110
  [464.946]  ? syscall_exit_to_user_mode+0xac/0x2a0
  [464.953]  ? btrfs_file_llseek+0xb1/0xfb0 [btrfs]
  [464.960]  ? do_syscall_64+0xa1/0x180
  [464.966]  ? btrfs_file_llseek+0xb1/0xfb0 [btrfs]
  [464.973]  ? trace_irq_enable.constprop.0+0xce/0x110
  [464.980]  ? syscall_exit_to_user_mode+0xac/0x2a0
  [464.987]  ? __pfx_btrfs_file_llseek+0x10/0x10 [btrfs]
  [464.995]  ? trace_irq_enable.constprop.0+0xce/0x110
  [465.002]  ? __pfx_btrfs_file_llseek+0x10/0x10 [btrfs]
  [465.010]  ? do_syscall_64+0xa1/0x180
  [465.016]  ? lock_release+0x567/0x790
  [465.022]  ? __pfx_lock_acquire+0x10/0x10
  [465.028]  ? __pfx_lock_release+0x10/0x10
  [465.034]  ? trace_irq_enable.constprop.0+0xce/0x110
  [465.042]  ? syscall_exit_to_user_mode+0xac/0x2a0
  [465.049]  ? do_syscall_64+0xa1/0x180
  [465.055]  ? syscall_exit_to_user_mode+0xac/0x2a0
  [465.062]  ? do_syscall_64+0xa1/0x180
  [465.068]  ? syscall_exit_to_user_mode+0xac/0x2a0
  [465.075]  ? do_syscall_64+0xa1/0x180
  [465.081]  ? clear_bhb_loop+0x25/0x80
  [465.087]  ? clear_bhb_loop+0x25/0x80
  [465.093]  ? clear_bhb_loop+0x25/0x80
  [465.099]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [465.106] RIP: 0033:0x7f093b8ee784
  [465.111] RSP: 002b:00007ffc29d31b28 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
  [465.122] RAX: ffffffffffffffda RBX: 0000000000006000 RCX: 00007f093b8ee784
  [465.131] RDX: 000000000001de00 RSI: 00007f093b6ed200 RDI: 0000000000000003
  [465.141] RBP: 000000000001de00 R08: 0000000000006000 R09: 0000000000000000
  [465.150] R10: 0000000000023e00 R11: 0000000000000202 R12: 0000000000006000
  [465.160] R13: 0000000000023e00 R14: 0000000000023e00 R15: 0000000000000001
  [465.170]  </TASK>
  [465.174] INFO: lockdep is turned off.

Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 97713b1a2c ("btrfs: do not clear page dirty inside extent_write_locked_range()")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-13 15:36:57 +02:00
Filipe Manana
46a6e10a1a btrfs: send: allow cloning non-aligned extent if it ends at i_size
If we a find that an extent is shared but its end offset is not sector
size aligned, then we don't clone it and issue write operations instead.
This is because the reflink (remap_file_range) operation does not allow
to clone unaligned ranges, except if the end offset of the range matches
the i_size of the source and destination files (and the start offset is
sector size aligned).

While this is not incorrect because send can only guarantee that a file
has the same data in the source and destination snapshots, it's not
optimal and generates confusion and surprising behaviour for users.

For example, running this test:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/sdi
  MNT=/mnt/sdi

  mkfs.btrfs -f $DEV
  mount $DEV $MNT

  # Use a file size not aligned to any possible sector size.
  file_size=$((1 * 1024 * 1024 + 5)) # 1MB + 5 bytes
  dd if=/dev/random of=$MNT/foo bs=$file_size count=1
  cp --reflink=always $MNT/foo $MNT/bar

  btrfs subvolume snapshot -r $MNT/ $MNT/snap
  rm -f /tmp/send-test
  btrfs send -f /tmp/send-test $MNT/snap

  umount $MNT
  mkfs.btrfs -f $DEV
  mount $DEV $MNT

  btrfs receive -vv -f /tmp/send-test $MNT

  xfs_io -r -c "fiemap -v" $MNT/snap/bar

  umount $MNT

Gives the following result:

  (...)
  mkfile o258-7-0
  rename o258-7-0 -> bar
  write bar - offset=0 length=49152
  write bar - offset=49152 length=49152
  write bar - offset=98304 length=49152
  write bar - offset=147456 length=49152
  write bar - offset=196608 length=49152
  write bar - offset=245760 length=49152
  write bar - offset=294912 length=49152
  write bar - offset=344064 length=49152
  write bar - offset=393216 length=49152
  write bar - offset=442368 length=49152
  write bar - offset=491520 length=49152
  write bar - offset=540672 length=49152
  write bar - offset=589824 length=49152
  write bar - offset=638976 length=49152
  write bar - offset=688128 length=49152
  write bar - offset=737280 length=49152
  write bar - offset=786432 length=49152
  write bar - offset=835584 length=49152
  write bar - offset=884736 length=49152
  write bar - offset=933888 length=49152
  write bar - offset=983040 length=49152
  write bar - offset=1032192 length=16389
  chown bar - uid=0, gid=0
  chmod bar - mode=0644
  utimes bar
  utimes
  BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=06d640da-9ca1-604c-b87c-3375175a8eb3, stransid=7
  /mnt/sdi/snap/bar:
   EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
     0: [0..2055]:       26624..28679      2056   0x1

There's no clone operation to clone extents from the file foo into file
bar and fiemap confirms there's no shared flag (0x2000).

So update send_write_or_clone() so that it proceeds with cloning if the
source and destination ranges end at the i_size of the respective files.

After this changes the result of the test is:

  (...)
  mkfile o258-7-0
  rename o258-7-0 -> bar
  clone bar - source=foo source offset=0 offset=0 length=1048581
  chown bar - uid=0, gid=0
  chmod bar - mode=0644
  utimes bar
  utimes
  BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=582420f3-ea7d-564e-bbe5-ce440d622190, stransid=7
  /mnt/sdi/snap/bar:
   EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
     0: [0..2055]:       26624..28679      2056 0x2001

A test case for fstests will also follow up soon.

Link: https://github.com/kdave/btrfs-progs/issues/572#issuecomment-2282841416
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-13 13:45:42 +02:00
Filipe Manana
ae1e766f62 btrfs: only run the extent map shrinker from kswapd tasks
Currently the extent map shrinker can be run by any task when attempting
to allocate memory and there's enough memory pressure to trigger it.

To avoid too much latency we stop iterating over extent maps and removing
them once the task needs to reschedule. This logic was introduced in commit
b3ebb9b7e9 ("btrfs: stop extent map shrinker if reschedule is needed").

While that solved high latency problems for some use cases, it's still
not enough because with a too high number of tasks entering the extent map
shrinker code, either due to memory allocations or because they are a
kswapd task, we end up having a very high level of contention on some
spin locks, namely:

1) The fs_info->fs_roots_radix_lock spin lock, which we need to find
   roots to iterate over their inodes;

2) The spin lock of the xarray used to track open inodes for a root
   (struct btrfs_root::inodes) - on 6.10 kernels and below, it used to
   be a red black tree and the spin lock was root->inode_lock;

3) The fs_info->delayed_iput_lock spin lock since the shrinker adds
   delayed iputs (calls btrfs_add_delayed_iput()).

Instead of allowing the extent map shrinker to be run by any task, make
it run only by kswapd tasks. This still solves the problem of running
into OOM situations due to an unbounded extent map creation, which is
simple to trigger by direct IO writes, as described in the changelog
of commit 956a17d9d0 ("btrfs: add a shrinker for extent maps"), and
by a similar case when doing buffered IO on files with a very large
number of holes (keeping the file open and creating many holes, whose
extent maps are only released when the file is closed).

Reported-by: kzd <kzd@56709.net>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219121
Reported-by: Octavia Togami <octavia.togami@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAHPNGSSt-a4ZZWrtJdVyYnJFscFjP9S7rMcvEMaNSpR556DdLA@mail.gmail.com/
Fixes: 956a17d9d0 ("btrfs: add a shrinker for extent maps")
CC: stable@vger.kernel.org # 6.10+
Tested-by: kzd <kzd@56709.net>
Tested-by: Octavia Togami <octavia.togami@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-13 13:43:28 +02:00
Qu Wenruo
31723c9542 btrfs: tree-checker: reject BTRFS_FT_UNKNOWN dir type
[REPORT]
There is a bug report that kernel is rejecting a mismatching inode mode
and its dir item:

  [ 1881.553937] BTRFS critical (device dm-0): inode mode mismatch with
  dir: inode mode=040700 btrfs type=2 dir type=0

[CAUSE]
It looks like the inode mode is correct, while the dir item type
0 is BTRFS_FT_UNKNOWN, which should not be generated by btrfs at all.

This may be caused by a memory bit flip.

[ENHANCEMENT]
Although tree-checker is not able to do any cross-leaf verification, for
this particular case we can at least reject any dir type with
BTRFS_FT_UNKNOWN.

So here we enhance the dir type check from [0, BTRFS_FT_MAX), to
(0, BTRFS_FT_MAX).
Although the existing corruption can not be fixed just by such enhanced
checking, it should prevent the same 0x2->0x0 bitflip for dir type to
reach disk in the future.

Reported-by: Kota <nospam@kota.moe>
Link: https://lore.kernel.org/linux-btrfs/CACsxjPYnQF9ZF-0OhH16dAx50=BXXOcP74MxBc3BG+xae4vTTw@mail.gmail.com/
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-13 13:42:26 +02:00
Josef Bacik
42fac187b5 btrfs: check delayed refs when we're checking if a ref exists
In the patch 78c52d9eb6 ("btrfs: check for refs on snapshot delete
resume") I added some code to handle file systems that had been
corrupted by a bug that incorrectly skipped updating the drop progress
key while dropping a snapshot.  This code would check to see if we had
already deleted our reference for a child block, and skip the deletion
if we had already.

Unfortunately there is a bug, as the check would only check the on-disk
references.  I made an incorrect assumption that blocks in an already
deleted snapshot that was having the deletion resume on mount wouldn't
be modified.

If we have 2 pending deleted snapshots that share blocks, we can easily
modify the rules for a block.  Take the following example

subvolume a exists, and subvolume b is a snapshot of subvolume a.  They
share references to block 1.  Block 1 will have 2 full references, one
for subvolume a and one for subvolume b, and it belongs to subvolume a
(btrfs_header_owner(block 1) == subvolume a).

When deleting subvolume a, we will drop our full reference for block 1,
and because we are the owner we will drop our full reference for all of
block 1's children, convert block 1 to FULL BACKREF, and add a shared
reference to all of block 1's children.

Then we will start the snapshot deletion of subvolume b.  We look up the
extent info for block 1, which checks delayed refs and tells us that
FULL BACKREF is set, so sets parent to the bytenr of block 1.  However
because this is a resumed snapshot deletion, we call into
check_ref_exists().  Because check_ref_exists() only looks at the disk,
it doesn't find the shared backref for the child of block 1, and thus
returns 0 and we skip deleting the reference for the child of block 1
and continue.  This orphans the child of block 1.

The fix is to lookup the delayed refs, similar to what we do in
btrfs_lookup_extent_info().  However we only care about whether the
reference exists or not.  If we fail to find our reference on disk, go
look up the bytenr in the delayed refs, and if it exists look for an
existing ref in the delayed ref head.  If that exists then we know we
can delete the reference safely and carry on.  If it doesn't exist we
know we have to skip over this block.

This bug has existed since I introduced this fix, however requires
having multiple deleted snapshots pending when we unmount.  We noticed
this in production because our shutdown path stops the container on the
system, which deletes a bunch of subvolumes, and then reboots the box.
This gives us plenty of opportunities to hit this issue.  Looking at the
history we've seen this occasionally in production, but we had a big
spike recently thanks to faster machines getting jobs with multiple
subvolumes in the job.

Chris Mason wrote a reproducer which does the following

mount /dev/nvme4n1 /btrfs
btrfs subvol create /btrfs/s1
simoop -E -f 4k -n 200000 -z /btrfs/s1
while(true) ; do
	btrfs subvol snap /btrfs/s1 /btrfs/s2
	simoop -f 4k -n 200000 -r 10 -z /btrfs/s2
	btrfs subvol snap /btrfs/s2 /btrfs/s3
	btrfs balance start -dusage=80 /btrfs
	btrfs subvol del /btrfs/s2 /btrfs/s3
	umount /btrfs
	btrfsck /dev/nvme4n1 || exit 1
	mount /dev/nvme4n1 /btrfs
done

On the second loop this would fail consistently, with my patch it has
been running for hours and hasn't failed.

I also used dm-log-writes to capture the state of the failure so I could
debug the problem.  Using the existing failure case to test my patch
validated that it fixes the problem.

Fixes: 78c52d9eb6 ("btrfs: check for refs on snapshot delete resume")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-13 13:42:26 +02:00
Al Viro
1da91ea87a introduce fd_file(), convert all accessors to it.
For any changes of struct fd representation we need to
turn existing accesses to fields into calls of wrappers.
Accesses to struct fd::flags are very few (3 in linux/file.h,
1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in
explicit initializers).
	Those can be dealt with in the commit converting to
new layout; accesses to struct fd::file are too many for that.
	This commit converts (almost) all of f.file to
fd_file(f).  It's not entirely mechanical ('file' is used as
a member name more than just in struct fd) and it does not
even attempt to distinguish the uses in pointer context from
those in boolean context; the latter will be eventually turned
into a separate helper (fd_empty()).

	NOTE: mass conversion to fd_empty(), tempting as it
might be, is a bad idea; better do that piecewise in commit
that convert from fdget...() to CLASS(...).

[conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c
caught by git; fs/stat.c one got caught by git grep]
[fs/xattr.c conflict]

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-08-12 22:00:43 -04:00
Linus Torvalds
6a0e382640 for-6.11-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmazbGYACgkQxWXV+ddt
 WDsQAw//Z3XjjylTZPuHNk/AiMe5oochxB5T9ZracQOzG0o70gj1w/UQIZBkSzp1
 66g8I4YdbwvEXKDg9Oi/GPDSON3GuhAiLXp+0Y/reeD/totgvrhROuJ3mIk5CZ0H
 B4fIKH3xCKLQan26Opgju4qjum7+AR7ekFveM6GicxXXb3eAYALgoEFt63eZjZVu
 7myak78gmBuK5QdGHH+onEhn+HfC57UTGBqu1bJsSOQC7dANkU+WzmgbH6FeOHqx
 2T/lN/tu2tBBoF4zMvC472Zjmj4PnubNQnwwv0oJ8Z2Y0yIY95joZV0uXIXO72oz
 BMQC6s0cltiTn1Tfe4iIWn+ZNjcfGAZO7aoD5NcJb2F/Mz5mZuMhtq3BND0wJ8/+
 SuYk7PxX7tcOaFrDAn3Ne7XHsD7r5lLkFICXkzcNG4dqkBUxR3dN4Oi8KKMFrgCP
 kTpf0/lAkYKYSrU86Yn1zhwRLaH8jm3fulVUY8i/p0tJnpsW8AmeME1Sk97Bhxbb
 y6zt8+MPscPuQi3jPsBevaYd8Q8BIT34vVtU/jNmGAyqEv1wVYQRRK2Ma4sJJfNk
 EEQV9i4VqXWLc17dcWVZrG6lEsVRIIBE/2Adhth6Myq73bkunpB9ZNNNZApf1T2j
 Wn5gPzkuQd6FWrXe8V7oSN9rT1OPn9uHL6BmSUWhHUCuFeATgoc=
 =03Gf
 -----END PGP SIGNATURE-----

Merge tag 'for-6.11-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix double inode unlock for direct IO sync writes (reported by
   syzbot)

 - fix root tree id/name map definitions, don't use fixed size buffers
   for name (reported by -Werror=unterminated-string-initialization)

 - fix qgroup reserve leaks in bufferd write path

 - update scrub status structure more often so it can be reported in
   user space more accurately and let 'resume' not repeat work

 - in preparation to remove space cache v1 in the future print a warning
   if it's detected

* tag 'for-6.11-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: avoid using fixed char array size for tree names
  btrfs: fix double inode unlock for direct IO sync writes
  btrfs: emit a warning about space cache v1 being deprecated
  btrfs: fix qgroup reserve leaks in cow_file_range
  btrfs: implement launder_folio for clearing dirty page reserve
  btrfs: scrub: update last_physical after scrubbing one stripe
  btrfs: factor out stripe length calculation into a helper
2024-08-07 09:53:41 -07:00
Qu Wenruo
12653ec361 btrfs: avoid using fixed char array size for tree names
[BUG]
There is a bug report that using the latest trunk GCC 15, btrfs would cause
unterminated-string-initialization warning:

  linux-6.6/fs/btrfs/print-tree.c:29:49: error: initializer-string for array of ‘char’ is too long [-Werror=unterminated-string-initialization]
   29 |         { BTRFS_BLOCK_GROUP_TREE_OBJECTID,      "BLOCK_GROUP_TREE"      },
      |
      ^~~~~~~~~~~~~~~~~~

[CAUSE]
To print tree names we have an array of root_name_map structure, which
uses "char name[16];" to store the name string of a tree.

But the following trees have names exactly at 16 chars length:
- "BLOCK_GROUP_TREE"
- "RAID_STRIPE_TREE"

This means we will have no space for the terminating '\0', and can lead
to unexpected access when printing the name.

[FIX]
Instead of "char name[16];" use "const char *" instead.

Since the name strings are all read-only data, and are all NULL
terminated by default, there is not much need to bother the length at
all.

Reported-by: Sam James <sam@gentoo.org>
Reported-by: Alejandro Colomar <alx@kernel.org>
Fixes: edde81f1ab ("btrfs: add raid stripe tree pretty printer")
Fixes: 9c54e80ddc ("btrfs: add code to support the block group root")
CC: stable@vger.kernel.org # 6.1+
Suggested-by: Alejandro Colomar <alx@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Alejandro Colomar <alx@kernel.org>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-02 22:44:27 +02:00
Filipe Manana
e0391e92f9 btrfs: fix double inode unlock for direct IO sync writes
If we do a direct IO sync write, at btrfs_sync_file(), and we need to skip
inode logging or we get an error starting a transaction or an error when
flushing delalloc, we end up unlocking the inode when we shouldn't under
the 'out_release_extents' label, and then unlock it again at
btrfs_direct_write().

Fix that by checking if we have to skip inode unlocking under that label.

Reported-by: syzbot+7dbbb74af6291b5a5a8b@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/000000000000dfd631061eaeb4bc@google.com/
Fixes: 939b656bc8 ("btrfs: fix corruption after buffer fault in during direct IO append write")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-02 22:32:40 +02:00
Josef Bacik
1e7bec1f7d btrfs: emit a warning about space cache v1 being deprecated
We've been wanting to get rid of this for a while, add a message to
indicate that this feature is going away and when so we can finally have
a date when we're going to remove it.  The output looks like this

BTRFS warning (device nvme0n1): space cache v1 is being deprecated and will be removed in a future release, please use -o space_cache=v2

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-01 17:30:50 +02:00
Boris Burkov
30479f31d4 btrfs: fix qgroup reserve leaks in cow_file_range
In the buffered write path, the dirty page owns the qgroup reserve until
it creates an ordered_extent.

Therefore, any errors that occur before the ordered_extent is created
must free that reservation, or else the space is leaked. The fstest
generic/475 exercises various IO error paths, and is able to trigger
errors in cow_file_range where we fail to get to allocating the ordered
extent. Note that because we *do* clear delalloc, we are likely to
remove the inode from the delalloc list, so the inodes/pages to not have
invalidate/launder called on them in the commit abort path.

This results in failures at the unmount stage of the test that look like:

  BTRFS: error (device dm-8 state EA) in cleanup_transaction:2018: errno=-5 IO failure
  BTRFS: error (device dm-8 state EA) in btrfs_replace_file_extents:2416: errno=-5 IO failure
  BTRFS warning (device dm-8 state EA): qgroup 0/5 has unreleased space, type 0 rsv 28672
  ------------[ cut here ]------------
  WARNING: CPU: 3 PID: 22588 at fs/btrfs/disk-io.c:4333 close_ctree+0x222/0x4d0 [btrfs]
  Modules linked in: btrfs blake2b_generic libcrc32c xor zstd_compress raid6_pq
  CPU: 3 PID: 22588 Comm: umount Kdump: loaded Tainted: G W          6.10.0-rc7-gab56fde445b8 #21
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
  RIP: 0010:close_ctree+0x222/0x4d0 [btrfs]
  RSP: 0018:ffffb4465283be00 EFLAGS: 00010202
  RAX: 0000000000000001 RBX: ffffa1a1818e1000 RCX: 0000000000000001
  RDX: 0000000000000000 RSI: ffffb4465283bbe0 RDI: ffffa1a19374fcb8
  RBP: ffffa1a1818e13c0 R08: 0000000100028b16 R09: 0000000000000000
  R10: 0000000000000003 R11: 0000000000000003 R12: ffffa1a18ad7972c
  R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
  FS:  00007f9168312b80(0000) GS:ffffa1a4afcc0000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f91683c9140 CR3: 000000010acaa000 CR4: 00000000000006f0
  Call Trace:
   <TASK>
   ? close_ctree+0x222/0x4d0 [btrfs]
   ? __warn.cold+0x8e/0xea
   ? close_ctree+0x222/0x4d0 [btrfs]
   ? report_bug+0xff/0x140
   ? handle_bug+0x3b/0x70
   ? exc_invalid_op+0x17/0x70
   ? asm_exc_invalid_op+0x1a/0x20
   ? close_ctree+0x222/0x4d0 [btrfs]
   generic_shutdown_super+0x70/0x160
   kill_anon_super+0x11/0x40
   btrfs_kill_super+0x11/0x20 [btrfs]
   deactivate_locked_super+0x2e/0xa0
   cleanup_mnt+0xb5/0x150
   task_work_run+0x57/0x80
   syscall_exit_to_user_mode+0x121/0x130
   do_syscall_64+0xab/0x1a0
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
  RIP: 0033:0x7f916847a887
  ---[ end trace 0000000000000000 ]---
  BTRFS error (device dm-8 state EA): qgroup reserved space leaked

Cases 2 and 3 in the out_reserve path both pertain to this type of leak
and must free the reserved qgroup data. Because it is already an error
path, I opted not to handle the possible errors in
btrfs_free_qgroup_data.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-01 17:26:52 +02:00
Boris Burkov
872617a089 btrfs: implement launder_folio for clearing dirty page reserve
In the buffered write path, dirty pages can be said to "own" the qgroup
reservation until they create an ordered_extent. It is possible for
there to be outstanding dirty pages when a transaction is aborted, in
which case there is no cancellation path for freeing this reservation
and it is leaked.

We do already walk the list of outstanding delalloc inodes in
btrfs_destroy_delalloc_inodes() and call invalidate_inode_pages2() on them.

This does *not* call btrfs_invalidate_folio(), as one might guess, but
rather calls launder_folio() and release_folio(). Since this is a
reservation associated with dirty pages only, rather than something
associated with the private bit (ordered_extent is cancelled separately
already in the cleanup transaction path), implementing this release
should be done via launder_folio.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-01 17:26:40 +02:00
Qu Wenruo
63447b7dd4 btrfs: scrub: update last_physical after scrubbing one stripe
Currently sctx->stat.last_physical only got updated in the following
cases:

- When the last stripe of a non-RAID56 chunk is scrubbed
  This implies a pitfall, if the last stripe is at the chunk boundary,
  and we finished the scrub of the whole chunk, we won't update
  last_physical at all until the next chunk.

- When a P/Q stripe of a RAID56 chunk is scrubbed

This leads the following two problems:

- sctx->stat.last_physical is not updated for a almost full chunk
  This is especially bad, affecting scrub resume, as the resume would
  start from last_physical, causing unnecessary re-scrub.

- "btrfs scrub status" will not report any progress for a long time

Fix the problem by properly updating @last_physical after each stripe is
scrubbed.

And since we're here, for the sake of consistency, use spin lock to
protect the update of @last_physical, just like all the remaining
call sites touching sctx->stat.

Reported-by: Michel Palleau <michel.palleau@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAMFk-+igFTv2E8svg=cQ6o3e6CrR5QwgQ3Ok9EyRaEvvthpqCQ@mail.gmail.com/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-01 17:15:07 +02:00
Qu Wenruo
33eb1e5db3 btrfs: factor out stripe length calculation into a helper
Currently there are two locations which need to calculate the real
length of a stripe (which can be at the end of a chunk, and the chunk
size may not always be 64K aligned).

Factor them into a helper as we're going to have a third user soon.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-01 17:15:05 +02:00
Linus Torvalds
e4fc196f5b for-6.11-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmapmOQACgkQxWXV+ddt
 WDsXVhAAi4X+xt3o4jcN3IAu08JCQAAyXnFWC3lvn7sqYjSrcccI6ZT4/gAbHss+
 qrifakRGoYQ7fAjYBmhw48HqPmHtI2OQjIUDaIqHQOS68aXShBo9HiE460HRY4GT
 QV/KT0w37E2/R0EDR9gyjLq3ZA3/raxN1n+LNCFhRWmtsAEZrk4XzsADWb05YkIq
 1QBa92DzEhVpd04X8YHIYBgRidWbcYST6xhoWdyL9VZ1pzZsISq5LH67D4f/J1KU
 gXNf+ZnF9DXsQnptJrMsjhx61seJ2F0/vozFZ+l6SjRr0jeysmrJI0dxqQc/hUga
 gbLmdha6ztKdn03JOIL+lfdZYzICFl/2fekSWI2SNcag+TYszACjlFOyHusOgKsa
 3qQwzVB699FheWO5nrOOvOtgq0ZqGsrIvhIXLhA7/bVpNavPnUB7IQCcs8n89ImQ
 hUIebfX1FZnYXTrB6Hhm92LUb0lyLSlW1we3SSmaAMiy1TiXHG7hO2G/sIbOPAJC
 5VzdHf0DEjzEdjmTrGOV7JBfy5JmMK56oN8viZS95p70DYxNGvEOhLs/8n5twpri
 MWV8GElcOjjC+KnGnUH72spsnEKONpdzyccG9kiZEgkEi4csgHSxrkSmAehYD6i6
 MFYk+i7jvZ1VsbOulmdGOLbHS7whxi9pWb/CT3KKF1Ei5/v07bU=
 =JdOX
 -----END PGP SIGNATURE-----

Merge tag 'for-6.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix regression in extent map rework when handling insertion of
   overlapping compressed extent

 - fix unexpected file length when appending to a file using direct io
   and buffer not faulted in

 - in zoned mode, fix accounting of unusable space when flipping
   read-only block group back to read-write

 - fix page locking when COWing an inline range, assertion failure found
   by syzbot

 - fix calculation of space info in debugging print

 - tree-checker, add validation of data reference item

 - fix a few -Wmaybe-uninitialized build warnings

* tag 'for-6.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: initialize location to fix -Wmaybe-uninitialized in btrfs_lookup_dentry()
  btrfs: fix corruption after buffer fault in during direct IO append write
  btrfs: zoned: fix zone_unusable accounting on making block group read-write again
  btrfs: do not subtract delalloc from avail bytes
  btrfs: make cow_file_range_inline() honor locked_page on error
  btrfs: fix corrupt read due to bad offset of a compressed extent map
  btrfs: tree-checker: validate dref root and objectid
2024-07-30 19:28:36 -07:00
David Sterba
b8e947e9f6 btrfs: initialize location to fix -Wmaybe-uninitialized in btrfs_lookup_dentry()
Some arch + compiler combinations report a potentially unused variable
location in btrfs_lookup_dentry(). This is a false alert as the variable
is passed by value and always valid or there's an error. The compilers
cannot probably reason about that although btrfs_inode_by_name() is in
the same file.

   >  + /kisskb/src/fs/btrfs/inode.c: error: 'location.objectid' may be used
   +uninitialized in this function [-Werror=maybe-uninitialized]:  => 5603:9
   >  + /kisskb/src/fs/btrfs/inode.c: error: 'location.type' may be used
   +uninitialized in this function [-Werror=maybe-uninitialized]:  => 5674:5

   m68k-gcc8/m68k-allmodconfig
   mips-gcc8/mips-allmodconfig
   powerpc-gcc5/powerpc-all{mod,yes}config
   powerpc-gcc5/ppc64_defconfig

Initialize it to zero, this should fix the warnings and won't change the
behaviour as btrfs_inode_by_name() accepts only a root or inode item
types, otherwise returns an error.

Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/linux-btrfs/bd4e9928-17b3-9257-8ba7-6b7f9bbb639a@linux-m68k.org/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-30 15:33:06 +02:00
Filipe Manana
939b656bc8 btrfs: fix corruption after buffer fault in during direct IO append write
During an append (O_APPEND write flag) direct IO write if the input buffer
was not previously faulted in, we can corrupt the file in a way that the
final size is unexpected and it includes an unexpected hole.

The problem happens like this:

1) We have an empty file, with size 0, for example;

2) We do an O_APPEND direct IO with a length of 4096 bytes and the input
   buffer is not currently faulted in;

3) We enter btrfs_direct_write(), lock the inode and call
   generic_write_checks(), which calls generic_write_checks_count(), and
   that function sets the iocb position to 0 with the following code:

	if (iocb->ki_flags & IOCB_APPEND)
		iocb->ki_pos = i_size_read(inode);

4) We call btrfs_dio_write() and enter into iomap, which will end up
   calling btrfs_dio_iomap_begin() and that calls
   btrfs_get_blocks_direct_write(), where we update the i_size of the
   inode to 4096 bytes;

5) After btrfs_dio_iomap_begin() returns, iomap will attempt to access
   the page of the write input buffer (at iomap_dio_bio_iter(), with a
   call to bio_iov_iter_get_pages()) and fail with -EFAULT, which gets
   returned to btrfs at btrfs_direct_write() via btrfs_dio_write();

6) At btrfs_direct_write() we get the -EFAULT error, unlock the inode,
   fault in the write buffer and then goto to the label 'relock';

7) We lock again the inode, do all the necessary checks again and call
   again generic_write_checks(), which calls generic_write_checks_count()
   again, and there we set the iocb's position to 4K, which is the current
   i_size of the inode, with the following code pointed above:

        if (iocb->ki_flags & IOCB_APPEND)
                iocb->ki_pos = i_size_read(inode);

8) Then we go again to btrfs_dio_write() and enter iomap and the write
   succeeds, but it wrote to the file range [4K, 8K), leaving a hole in
   the [0, 4K) range and an i_size of 8K, which goes against the
   expectations of having the data written to the range [0, 4K) and get an
   i_size of 4K.

Fix this by not unlocking the inode before faulting in the input buffer,
in case we get -EFAULT or an incomplete write, and not jumping to the
'relock' label after faulting in the buffer - instead jump to a location
immediately before calling iomap, skipping all the write checks and
relocking. This solves this problem and it's fine even in case the input
buffer is memory mapped to the same file range, since only holding the
range locked in the inode's io tree can cause a deadlock, it's safe to
keep the inode lock (VFS lock), as was fixed and described in commit
51bd9563b6 ("btrfs: fix deadlock due to page faults during direct IO
reads and writes").

A sample reproducer provided by a reporter is the following:

   $ cat test.c
   #ifndef _GNU_SOURCE
   #define _GNU_SOURCE
   #endif

   #include <fcntl.h>
   #include <stdio.h>
   #include <sys/mman.h>
   #include <sys/stat.h>
   #include <unistd.h>

   int main(int argc, char *argv[])
   {
       if (argc < 2) {
           fprintf(stderr, "Usage: %s <test file>\n", argv[0]);
           return 1;
       }

       int fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT |
                     O_APPEND, 0644);
       if (fd < 0) {
           perror("creating test file");
           return 1;
       }

       char *buf = mmap(NULL, 4096, PROT_READ,
                        MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
       ssize_t ret = write(fd, buf, 4096);
       if (ret < 0) {
           perror("pwritev2");
           return 1;
       }

       struct stat stbuf;
       ret = fstat(fd, &stbuf);
       if (ret < 0) {
           perror("stat");
           return 1;
       }

       printf("size: %llu\n", (unsigned long long)stbuf.st_size);
       return stbuf.st_size == 4096 ? 0 : 1;
   }

A test case for fstests will be sent soon.

Reported-by: Hanna Czenczek <hreitz@redhat.com>
Link: https://lore.kernel.org/linux-btrfs/0b841d46-12fe-4e64-9abb-871d8d0de271@redhat.com/
Fixes: 8184620ae2 ("btrfs: fix lost file sync on direct IO write with nowait and dsync iocb")
CC: stable@vger.kernel.org # 6.1+
Tested-by: Hanna Czenczek <hreitz@redhat.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-29 19:21:22 +02:00
Naohiro Aota
8cd44dd1d1 btrfs: zoned: fix zone_unusable accounting on making block group read-write again
When btrfs makes a block group read-only, it adds all free regions in the
block group to space_info->bytes_readonly. That free space excludes
reserved and pinned regions. OTOH, when btrfs makes the block group
read-write again, it moves all the unused regions into the block group's
zone_unusable. That unused region includes reserved and pinned regions.
As a result, it counts too much zone_unusable bytes.

Fortunately (or unfortunately), having erroneous zone_unusable does not
affect the calculation of space_info->bytes_readonly, because free
space (num_bytes in btrfs_dec_block_group_ro) calculation is done based on
the erroneous zone_unusable and it reduces the num_bytes just to cancel the
error.

This behavior can be easily discovered by adding a WARN_ON to check e.g,
"bg->pinned > 0" in btrfs_dec_block_group_ro(), and running fstests test
case like btrfs/282.

Fix it by properly considering pinned and reserved in
btrfs_dec_block_group_ro(). Also, add a WARN_ON and introduce
btrfs_space_info_update_bytes_zone_unusable() to catch a similar mistake.

Fixes: 169e0da91a ("btrfs: zoned: track unusable bytes for zones")
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-29 19:21:19 +02:00
Naohiro Aota
d89c285d28 btrfs: do not subtract delalloc from avail bytes
The block group's avail bytes printed when dumping a space info subtract
the delalloc_bytes. However, as shown in btrfs_add_reserved_bytes() and
btrfs_free_reserved_bytes(), it is added or subtracted along with
"reserved" for the delalloc case, which means the "delalloc_bytes" is a
part of the "reserved" bytes. So, excluding it to calculate the avail space
counts delalloc_bytes twice, which can lead to an invalid result.

Fixes: e50b122b83 ("btrfs: print available space for a block group when dumping a space info")
CC: stable@vger.kernel.org # 6.6+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-29 19:21:04 +02:00
Boris Burkov
478574370b btrfs: make cow_file_range_inline() honor locked_page on error
The btrfs buffered write path runs through __extent_writepage() which
has some tricky return value handling for writepage_delalloc().
Specifically, when that returns 1, we exit, but for other return values
we continue and end up calling btrfs_folio_end_all_writers(). If the
folio has been unlocked (note that we check the PageLocked bit at the
start of __extent_writepage()), this results in an assert panic like
this one from syzbot:

  BTRFS: error (device loop0 state EAL) in free_log_tree:3267: errno=-5 IO failure
  BTRFS warning (device loop0 state EAL): Skipping commit of aborted transaction.
  BTRFS: error (device loop0 state EAL) in cleanup_transaction:2018: errno=-5 IO failure
  assertion failed: folio_test_locked(folio), in fs/btrfs/subpage.c:871
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/subpage.c:871!
  Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
  CPU: 1 PID: 5090 Comm: syz-executor225 Not tainted
  6.10.0-syzkaller-05505-gb1bc554e009e #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
  Google 06/27/2024
  RIP: 0010:btrfs_folio_end_all_writers+0x55b/0x610 fs/btrfs/subpage.c:871
  Code: e9 d3 fb ff ff e8 25 22 c2 fd 48 c7 c7 c0 3c 0e 8c 48 c7 c6 80 3d
  0e 8c 48 c7 c2 60 3c 0e 8c b9 67 03 00 00 e8 66 47 ad 07 90 <0f> 0b e8
  6e 45 b0 07 4c 89 ff be 08 00 00 00 e8 21 12 25 fe 4c 89
  RSP: 0018:ffffc900033d72e0 EFLAGS: 00010246
  RAX: 0000000000000045 RBX: 00fff0000000402c RCX: 663b7a08c50a0a00
  RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
  RBP: ffffc900033d73b0 R08: ffffffff8176b98c R09: 1ffff9200067adfc
  R10: dffffc0000000000 R11: fffff5200067adfd R12: 0000000000000001
  R13: dffffc0000000000 R14: 0000000000000000 R15: ffffea0001cbee80
  FS:  0000000000000000(0000) GS:ffff8880b9500000(0000)
  knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f5f076012f8 CR3: 000000000e134000 CR4: 00000000003506f0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
  <TASK>
  __extent_writepage fs/btrfs/extent_io.c:1597 [inline]
  extent_write_cache_pages fs/btrfs/extent_io.c:2251 [inline]
  btrfs_writepages+0x14d7/0x2760 fs/btrfs/extent_io.c:2373
  do_writepages+0x359/0x870 mm/page-writeback.c:2656
  filemap_fdatawrite_wbc+0x125/0x180 mm/filemap.c:397
  __filemap_fdatawrite_range mm/filemap.c:430 [inline]
  __filemap_fdatawrite mm/filemap.c:436 [inline]
  filemap_flush+0xdf/0x130 mm/filemap.c:463
  btrfs_release_file+0x117/0x130 fs/btrfs/file.c:1547
  __fput+0x24a/0x8a0 fs/file_table.c:422
  task_work_run+0x24f/0x310 kernel/task_work.c:222
  exit_task_work include/linux/task_work.h:40 [inline]
  do_exit+0xa2f/0x27f0 kernel/exit.c:877
  do_group_exit+0x207/0x2c0 kernel/exit.c:1026
  __do_sys_exit_group kernel/exit.c:1037 [inline]
  __se_sys_exit_group kernel/exit.c:1035 [inline]
  __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1035
  x64_sys_call+0x2634/0x2640
  arch/x86/include/generated/asm/syscalls_64.h:232
  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
  entry_SYSCALL_64_after_hwframe+0x77/0x7f
  RIP: 0033:0x7f5f075b70c9
  Code: Unable to access opcode bytes at
  0x7f5f075b709f.

I was hitting the same issue by doing hundreds of accelerated runs of
generic/475, which also hits IO errors by design.

I instrumented that reproducer with bpftrace and found that the
undesirable folio_unlock was coming from the following callstack:

  folio_unlock+5
  __process_pages_contig+475
  cow_file_range_inline.constprop.0+230
  cow_file_range+803
  btrfs_run_delalloc_range+566
  writepage_delalloc+332
  __extent_writepage # inlined in my stacktrace, but I added it here
  extent_write_cache_pages+622

Looking at the bisected-to patch in the syzbot report, Josef realized
that the logic of the cow_file_range_inline error path subtly changing.
In the past, on error, it jumped to out_unlock in cow_file_range(),
which honors the locked_page, so when we ultimately call
folio_end_all_writers(), the folio of interest is still locked. After
the change, we always unlocked ignoring the locked_page, on both success
and error. On the success path, this all results in returning 1 to
__extent_writepage(), which skips the folio_end_all_writers() call,
which makes it OK to have unlocked.

Fix the bug by wiring the locked_page into cow_file_range_inline() and
only setting locked_page to NULL on success.

Reported-by: syzbot+a14d8ac9af3a2a4fd0c8@syzkaller.appspotmail.com
Fixes: 0586d0a89e ("btrfs: move extent bit and page cleanup into cow_file_range_inline")
CC: stable@vger.kernel.org # 6.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-29 19:20:51 +02:00
Linus Torvalds
cb04e8b1d2 minmax: don't use max() in situations that want a C constant expression
We only had a couple of array[] declarations, and changing them to just
use 'MAX()' instead of 'max()' fixes the issue.

This will allow us to simplify our min/max macros enormously, since they
can now unconditionally use temporary variables to avoid using the
argument values multiple times.

Cc: David Laight <David.Laight@aculab.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-07-28 20:23:27 -07:00
Filipe Manana
de9f46cb00 btrfs: fix corrupt read due to bad offset of a compressed extent map
If we attempt to insert a compressed extent map that has a range that
overlaps another extent map we have in the inode's extent map tree, we
can end up with an incorrect offset after adjusting the new extent map at
merge_extent_mapping() because we don't update the extent map's offset.

For example consider the following scenario:

1) We have a file extent item for a compressed extent covering the file
   range [108K, 144K) and currently there's no corresponding extent map
   in the inode's extent map tree;

2) The inode's size is 141K;

3) We have an encoded write (compressed) into the file range [120K, 128K),
   which overlaps the existing file extent item. The encoded write creates
   a matching extent map, adds it to the inode's extent map tree and
   creates an ordered extent for it.

   Note that the corresponding file extent item is added to the subvolume
   tree only when the ordered extent completes (when executing
   btrfs_finish_one_ordered());

4) We have a write into the file range [160K, 164K).

   This writes increases the i_size of the file, and there's a hole
   between the current i_size (141K) and the start offset of this write,
   and since the old i_size is in the middle of the block [140K, 144K),
   we have to write zeroes to the range [141K, 144K) (3072 bytes) and
   therefore dirty that page.

   We then call btrfs_set_extent_delalloc() with a start offset of 140K.
   We then end up at btrfs_find_new_delalloc_bytes() which will call
   btrfs_get_extent() for the range [140K, 144K);

5) The btrfs_get_extent() doesn't find any extent map in the inode's
   extent map tree covering the range [140K, 144K), so it searches the
   subvolume tree for any file extent items covering that range.

   There it finds the file extent item for the range [108K, 144K),
   creates a compressed extent map for that range and then calls
   btrfs_add_extent_mapping() with that extent map and passes the
   range [140K, 144K) via the "start" and "len" parameters;

6) The call to add_extent_mapping() done by btrfs_add_extent_mapping()
   fails with -EEXIST because there's an extent map, created at step 2
   for the [120K, 128K) range, that covers that overlaps with the range
   of the given extent map ([108K, 144K)).

   Then it does a lookup for extent map from step 2 add calls
   merge_extent_mapping() to adjust the input extent map ([108K, 144K)).
   That adjust the extent map to a start offset of 128K and a length
   of 16K (starting just after the extent map from step 2), but it does
   not update the offset field of the extent map, leaving it with a value
   of zero instead of updating to a value of 20K (128K - 108K = 20K).

   As a result any read for the range [128K, 144K) can return
   incorrect data since we read from a wrong section of the extent (unless
   both the correct and incorrect ranges happen to have the same data).

So fix this by changing merge_extent_mapping() to update the extent map's
offset even if it's compressed. Also add a test case to the self tests.
This didn't happen before the patchset that does big changes in the extent
map structure (which includes the commit in the Fixes tag below) because
we kept track of the original start offset in the extent map (member
"orig_start") so we could always calculate the correct offset by
subtracting that offset from the start offset.

A test case for fstests that triggered this problem using send/receive
with compressed writes will be added soon.

Fixes: 3d2ac99224 ("btrfs: introduce new members for extent_map")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-25 23:54:06 +02:00
Qu Wenruo
f333a3c7e8 btrfs: tree-checker: validate dref root and objectid
[CORRUPTION]
There is a bug report that btrfs flips RO due to a corruption in the
extent tree, the involved dumps looks like this:

 	item 188 key (402811572224 168 4096) itemoff 14598 itemsize 79
 		extent refs 3 gen 3678544 flags 1
 		ref#0: extent data backref root 13835058055282163977 objectid 281473384125923 offset 81432576 count 1
 		ref#1: shared data backref parent 1947073626112 count 1
 		ref#2: shared data backref parent 1156030103552 count 1
 BTRFS critical (device vdc1: state EA): unable to find ref byte nr 402811572224 parent 0 root 265 owner 28703026 offset 81432576 slot 189
 BTRFS error (device vdc1: state EA): failed to run delayed ref for logical 402811572224 num_bytes 4096 type 178 action 2 ref_mod 1: -2

[CAUSE]
The corrupted entry is ref#0 of item 188.
The root number 13835058055282163977 is beyond the upper limit for root
items (the current limit is 1 << 48), and the objectid also looks
suspicious.

Only the offset and count is correct.

[ENHANCEMENT]
Although it's still unknown why we have such many bytes corrupted
randomly, we can still enhance the tree-checker for data backrefs by:

- Validate the root value
  For now there should only be 3 types of roots can have data backref:
  * subvolume trees
  * data reloc trees
  * root tree
    Only for v1 space cache

- validate the objectid value
  The objectid should be a valid inode number.

Hopefully we can catch such problem in the future with the new checkers.

Reported-by: Kai Krakow <hurikhan77@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAMthOuPjg5RDT-G_LXeBBUUtzt3cq=JywF+D1_h+JYxe=WKp-Q@mail.gmail.com/#t
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-25 23:54:06 +02:00
Linus Torvalds
fbc90c042c - 875fa64577da ("mm/hugetlb_vmemmap: fix race with speculative PFN
walkers") is known to cause a performance regression
   (https://lore.kernel.org/all/3acefad9-96e5-4681-8014-827d6be71c7a@linux.ibm.com/T/#mfa809800a7862fb5bdf834c6f71a3a5113eb83ff).
   Yu has a fix which I'll send along later via the hotfixes branch.
 
 - In the series "mm: Avoid possible overflows in dirty throttling" Jan
   Kara addresses a couple of issues in the writeback throttling code.
   These fixes are also targetted at -stable kernels.
 
 - Ryusuke Konishi's series "nilfs2: fix potential issues related to
   reserved inodes" does that.  This should actually be in the
   mm-nonmm-stable tree, along with the many other nilfs2 patches.  My bad.
 
 - More folio conversions from Kefeng Wang in the series "mm: convert to
   folio_alloc_mpol()"
 
 - Kemeng Shi has sent some cleanups to the writeback code in the series
   "Add helper functions to remove repeated code and improve readability of
   cgroup writeback"
 
 - Kairui Song has made the swap code a little smaller and a little
   faster in the series "mm/swap: clean up and optimize swap cache index".
 
 - In the series "mm/memory: cleanly support zeropage in
   vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()" David
   Hildenbrand has reworked the rather sketchy handling of the use of the
   zeropage in MAP_SHARED mappings.  I don't see any runtime effects here -
   more a cleanup/understandability/maintainablity thing.
 
 - Dev Jain has improved selftests/mm/va_high_addr_switch.c's handling of
   higher addresses, for aarch64.  The (poorly named) series is
   "Restructure va_high_addr_switch".
 
 - The core TLB handling code gets some cleanups and possible slight
   optimizations in Bang Li's series "Add update_mmu_tlb_range() to
   simplify code".
 
 - Jane Chu has improved the handling of our
   fake-an-unrecoverable-memory-error testing feature MADV_HWPOISON in the
   series "Enhance soft hwpoison handling and injection".
 
 - Jeff Johnson has sent a billion patches everywhere to add
   MODULE_DESCRIPTION() to everything.  Some landed in this pull.
 
 - In the series "mm: cleanup MIGRATE_SYNC_NO_COPY mode", Kefeng Wang has
   simplified migration's use of hardware-offload memory copying.
 
 - Yosry Ahmed performs more folio API conversions in his series "mm:
   zswap: trivial folio conversions".
 
 - In the series "large folios swap-in: handle refault cases first",
   Chuanhua Han inches us forward in the handling of large pages in the
   swap code.  This is a cleanup and optimization, working toward the end
   objective of full support of large folio swapin/out.
 
 - In the series "mm,swap: cleanup VMA based swap readahead window
   calculation", Huang Ying has contributed some cleanups and a possible
   fixlet to his VMA based swap readahead code.
 
 - In the series "add mTHP support for anonymous shmem" Baolin Wang has
   taught anonymous shmem mappings to use multisize THP.  By default this
   is a no-op - users must opt in vis sysfs controls.  Dramatic
   improvements in pagefault latency are realized.
 
 - David Hildenbrand has some cleanups to our remaining use of
   page_mapcount() in the series "fs/proc: move page_mapcount() to
   fs/proc/internal.h".
 
 - David also has some highmem accounting cleanups in the series
   "mm/highmem: don't track highmem pages manually".
 
 - Build-time fixes and cleanups from John Hubbard in the series
   "cleanups, fixes, and progress towards avoiding "make headers"".
 
 - Cleanups and consolidation of the core pagemap handling from Barry
   Song in the series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers
   and utilize them".
 
 - Lance Yang's series "Reclaim lazyfree THP without splitting" has
   reduced the latency of the reclaim of pmd-mapped THPs under fairly
   common circumstances.  A 10x speedup is seen in a microbenchmark.
 
   It does this by punting to aother CPU but I guess that's a win unless
   all CPUs are pegged.
 
 - hugetlb_cgroup cleanups from Xiu Jianfeng in the series
   "mm/hugetlb_cgroup: rework on cftypes".
 
 - Miaohe Lin's series "Some cleanups for memory-failure" does just that
   thing.
 
 - Is anyone reading this stuff?  If so, email me!
 
 - Someone other than SeongJae has developed a DAMON feature in Honggyu
   Kim's series "DAMON based tiered memory management for CXL memory".
   This adds DAMON features which may be used to help determine the
   efficiency of our placement of CXL/PCIe attached DRAM.
 
 - DAMON user API centralization and simplificatio work in SeongJae
   Park's series "mm/damon: introduce DAMON parameters online commit
   function".
 
 - In the series "mm: page_type, zsmalloc and page_mapcount_reset()"
   David Hildenbrand does some maintenance work on zsmalloc - partially
   modernizing its use of pageframe fields.
 
 - Kefeng Wang provides more folio conversions in the series "mm: remove
   page_maybe_dma_pinned() and page_mkclean()".
 
 - More cleanup from David Hildenbrand, this time in the series
   "mm/memory_hotplug: use PageOffline() instead of PageReserved() for
   !ZONE_DEVICE".  It "enlightens memory hotplug more about PageOffline()
   pages" and permits the removal of some virtio-mem hacks.
 
 - Barry Song's series "mm: clarify folio_add_new_anon_rmap() and
   __folio_add_anon_rmap()" is a cleanup to the anon folio handling in
   preparation for mTHP (multisize THP) swapin.
 
 - Kefeng Wang's series "mm: improve clear and copy user folio"
   implements more folio conversions, this time in the area of large folio
   userspace copying.
 
 - The series "Docs/mm/damon/maintaier-profile: document a mailing tool
   and community meetup series" tells people how to get better involved
   with other DAMON developers.  From SeongJae Park.
 
 - A large series ("kmsan: Enable on s390") from Ilya Leoshkevich does
   that.
 
 - David Hildenbrand sends along more cleanups, this time against the
   migration code.  The series is "mm/migrate: move NUMA hinting fault
   folio isolation + checks under PTL".
 
 - Jan Kara has found quite a lot of strangenesses and minor errors in
   the readahead code.  He addresses this in the series "mm: Fix various
   readahead quirks".
 
 - SeongJae Park's series "selftests/damon: test DAMOS tried regions and
   {min,max}_nr_regions" adds features and addresses errors in DAMON's self
   testing code.
 
 - Gavin Shan has found a userspace-triggerable WARN in the pagecache
   code.  The series "mm/filemap: Limit page cache size to that supported
   by xarray" addresses this.  The series is marked cc:stable.
 
 - Chengming Zhou's series "mm/ksm: cmp_and_merge_page() optimizations
   and cleanup" cleans up and slightly optimizes KSM.
 
 - Roman Gushchin has separated the memcg-v1 and memcg-v2 code - lots of
   code motion.  The series (which also makes the memcg-v1 code
   Kconfigurable) are
 
   "mm: memcg: separate legacy cgroup v1 code and put under config
   option" and
   "mm: memcg: put cgroup v1-specific memcg data under CONFIG_MEMCG_V1"
 
 - Dan Schatzberg's series "Add swappiness argument to memory.reclaim"
   adds an additional feature to this cgroup-v2 control file.
 
 - The series "Userspace controls soft-offline pages" from Jiaqi Yan
   permits userspace to stop the kernel's automatic treatment of excessive
   correctable memory errors.  In order to permit userspace to monitor and
   handle this situation.
 
 - Kefeng Wang's series "mm: migrate: support poison recover from migrate
   folio" teaches the kernel to appropriately handle migration from
   poisoned source folios rather than simply panicing.
 
 - SeongJae Park's series "Docs/damon: minor fixups and improvements"
   does those things.
 
 - In the series "mm/zsmalloc: change back to per-size_class lock"
   Chengming Zhou improves zsmalloc's scalability and memory utilization.
 
 - Vivek Kasireddy's series "mm/gup: Introduce memfd_pin_folios() for
   pinning memfd folios" makes the GUP code use FOLL_PIN rather than bare
   refcount increments.  So these paes can first be moved aside if they
   reside in the movable zone or a CMA block.
 
 - Andrii Nakryiko has added a binary ioctl()-based API to /proc/pid/maps
   for much faster reading of vma information.  The series is "query VMAs
   from /proc/<pid>/maps".
 
 - In the series "mm: introduce per-order mTHP split counters" Lance Yang
   improves the kernel's presentation of developer information related to
   multisize THP splitting.
 
 - Michael Ellerman has developed the series "Reimplement huge pages
   without hugepd on powerpc (8xx, e500, book3s/64)".  This permits
   userspace to use all available huge page sizes.
 
 - In the series "revert unconditional slab and page allocator fault
   injection calls" Vlastimil Babka removes a performance-affecting and not
   very useful feature from slab fault injection.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZp2C+QAKCRDdBJ7gKXxA
 joTkAQDvjqOoFStqk4GU3OXMYB7WCU/ZQMFG0iuu1EEwTVDZ4QEA8CnG7seek1R3
 xEoo+vw0sWWeLV3qzsxnCA1BJ8cTJA8=
 =z0Lf
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - In the series "mm: Avoid possible overflows in dirty throttling" Jan
   Kara addresses a couple of issues in the writeback throttling code.
   These fixes are also targetted at -stable kernels.

 - Ryusuke Konishi's series "nilfs2: fix potential issues related to
   reserved inodes" does that. This should actually be in the
   mm-nonmm-stable tree, along with the many other nilfs2 patches. My
   bad.

 - More folio conversions from Kefeng Wang in the series "mm: convert to
   folio_alloc_mpol()"

 - Kemeng Shi has sent some cleanups to the writeback code in the series
   "Add helper functions to remove repeated code and improve readability
   of cgroup writeback"

 - Kairui Song has made the swap code a little smaller and a little
   faster in the series "mm/swap: clean up and optimize swap cache
   index".

 - In the series "mm/memory: cleanly support zeropage in
   vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()" David
   Hildenbrand has reworked the rather sketchy handling of the use of
   the zeropage in MAP_SHARED mappings. I don't see any runtime effects
   here - more a cleanup/understandability/maintainablity thing.

 - Dev Jain has improved selftests/mm/va_high_addr_switch.c's handling
   of higher addresses, for aarch64. The (poorly named) series is
   "Restructure va_high_addr_switch".

 - The core TLB handling code gets some cleanups and possible slight
   optimizations in Bang Li's series "Add update_mmu_tlb_range() to
   simplify code".

 - Jane Chu has improved the handling of our
   fake-an-unrecoverable-memory-error testing feature MADV_HWPOISON in
   the series "Enhance soft hwpoison handling and injection".

 - Jeff Johnson has sent a billion patches everywhere to add
   MODULE_DESCRIPTION() to everything. Some landed in this pull.

 - In the series "mm: cleanup MIGRATE_SYNC_NO_COPY mode", Kefeng Wang
   has simplified migration's use of hardware-offload memory copying.

 - Yosry Ahmed performs more folio API conversions in his series "mm:
   zswap: trivial folio conversions".

 - In the series "large folios swap-in: handle refault cases first",
   Chuanhua Han inches us forward in the handling of large pages in the
   swap code. This is a cleanup and optimization, working toward the end
   objective of full support of large folio swapin/out.

 - In the series "mm,swap: cleanup VMA based swap readahead window
   calculation", Huang Ying has contributed some cleanups and a possible
   fixlet to his VMA based swap readahead code.

 - In the series "add mTHP support for anonymous shmem" Baolin Wang has
   taught anonymous shmem mappings to use multisize THP. By default this
   is a no-op - users must opt in vis sysfs controls. Dramatic
   improvements in pagefault latency are realized.

 - David Hildenbrand has some cleanups to our remaining use of
   page_mapcount() in the series "fs/proc: move page_mapcount() to
   fs/proc/internal.h".

 - David also has some highmem accounting cleanups in the series
   "mm/highmem: don't track highmem pages manually".

 - Build-time fixes and cleanups from John Hubbard in the series
   "cleanups, fixes, and progress towards avoiding "make headers"".

 - Cleanups and consolidation of the core pagemap handling from Barry
   Song in the series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers
   and utilize them".

 - Lance Yang's series "Reclaim lazyfree THP without splitting" has
   reduced the latency of the reclaim of pmd-mapped THPs under fairly
   common circumstances. A 10x speedup is seen in a microbenchmark.

   It does this by punting to aother CPU but I guess that's a win unless
   all CPUs are pegged.

 - hugetlb_cgroup cleanups from Xiu Jianfeng in the series
   "mm/hugetlb_cgroup: rework on cftypes".

 - Miaohe Lin's series "Some cleanups for memory-failure" does just that
   thing.

 - Someone other than SeongJae has developed a DAMON feature in Honggyu
   Kim's series "DAMON based tiered memory management for CXL memory".
   This adds DAMON features which may be used to help determine the
   efficiency of our placement of CXL/PCIe attached DRAM.

 - DAMON user API centralization and simplificatio work in SeongJae
   Park's series "mm/damon: introduce DAMON parameters online commit
   function".

 - In the series "mm: page_type, zsmalloc and page_mapcount_reset()"
   David Hildenbrand does some maintenance work on zsmalloc - partially
   modernizing its use of pageframe fields.

 - Kefeng Wang provides more folio conversions in the series "mm: remove
   page_maybe_dma_pinned() and page_mkclean()".

 - More cleanup from David Hildenbrand, this time in the series
   "mm/memory_hotplug: use PageOffline() instead of PageReserved() for
   !ZONE_DEVICE". It "enlightens memory hotplug more about PageOffline()
   pages" and permits the removal of some virtio-mem hacks.

 - Barry Song's series "mm: clarify folio_add_new_anon_rmap() and
   __folio_add_anon_rmap()" is a cleanup to the anon folio handling in
   preparation for mTHP (multisize THP) swapin.

 - Kefeng Wang's series "mm: improve clear and copy user folio"
   implements more folio conversions, this time in the area of large
   folio userspace copying.

 - The series "Docs/mm/damon/maintaier-profile: document a mailing tool
   and community meetup series" tells people how to get better involved
   with other DAMON developers. From SeongJae Park.

 - A large series ("kmsan: Enable on s390") from Ilya Leoshkevich does
   that.

 - David Hildenbrand sends along more cleanups, this time against the
   migration code. The series is "mm/migrate: move NUMA hinting fault
   folio isolation + checks under PTL".

 - Jan Kara has found quite a lot of strangenesses and minor errors in
   the readahead code. He addresses this in the series "mm: Fix various
   readahead quirks".

 - SeongJae Park's series "selftests/damon: test DAMOS tried regions and
   {min,max}_nr_regions" adds features and addresses errors in DAMON's
   self testing code.

 - Gavin Shan has found a userspace-triggerable WARN in the pagecache
   code. The series "mm/filemap: Limit page cache size to that supported
   by xarray" addresses this. The series is marked cc:stable.

 - Chengming Zhou's series "mm/ksm: cmp_and_merge_page() optimizations
   and cleanup" cleans up and slightly optimizes KSM.

 - Roman Gushchin has separated the memcg-v1 and memcg-v2 code - lots of
   code motion. The series (which also makes the memcg-v1 code
   Kconfigurable) are "mm: memcg: separate legacy cgroup v1 code and put
   under config option" and "mm: memcg: put cgroup v1-specific memcg
   data under CONFIG_MEMCG_V1"

 - Dan Schatzberg's series "Add swappiness argument to memory.reclaim"
   adds an additional feature to this cgroup-v2 control file.

 - The series "Userspace controls soft-offline pages" from Jiaqi Yan
   permits userspace to stop the kernel's automatic treatment of
   excessive correctable memory errors. In order to permit userspace to
   monitor and handle this situation.

 - Kefeng Wang's series "mm: migrate: support poison recover from
   migrate folio" teaches the kernel to appropriately handle migration
   from poisoned source folios rather than simply panicing.

 - SeongJae Park's series "Docs/damon: minor fixups and improvements"
   does those things.

 - In the series "mm/zsmalloc: change back to per-size_class lock"
   Chengming Zhou improves zsmalloc's scalability and memory
   utilization.

 - Vivek Kasireddy's series "mm/gup: Introduce memfd_pin_folios() for
   pinning memfd folios" makes the GUP code use FOLL_PIN rather than
   bare refcount increments. So these paes can first be moved aside if
   they reside in the movable zone or a CMA block.

 - Andrii Nakryiko has added a binary ioctl()-based API to
   /proc/pid/maps for much faster reading of vma information. The series
   is "query VMAs from /proc/<pid>/maps".

 - In the series "mm: introduce per-order mTHP split counters" Lance
   Yang improves the kernel's presentation of developer information
   related to multisize THP splitting.

 - Michael Ellerman has developed the series "Reimplement huge pages
   without hugepd on powerpc (8xx, e500, book3s/64)". This permits
   userspace to use all available huge page sizes.

 - In the series "revert unconditional slab and page allocator fault
   injection calls" Vlastimil Babka removes a performance-affecting and
   not very useful feature from slab fault injection.

* tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (411 commits)
  mm/mglru: fix ineffective protection calculation
  mm/zswap: fix a white space issue
  mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio
  mm/hugetlb: fix possible recursive locking detected warning
  mm/gup: clear the LRU flag of a page before adding to LRU batch
  mm/numa_balancing: teach mpol_to_str about the balancing mode
  mm: memcg1: convert charge move flags to unsigned long long
  alloc_tag: fix page_ext_get/page_ext_put sequence during page splitting
  lib: reuse page_ext_data() to obtain codetag_ref
  lib: add missing newline character in the warning message
  mm/mglru: fix overshooting shrinker memory
  mm/mglru: fix div-by-zero in vmpressure_calc_level()
  mm/kmemleak: replace strncpy() with strscpy()
  mm, page_alloc: put should_fail_alloc_page() back behing CONFIG_FAIL_PAGE_ALLOC
  mm, slab: put should_failslab() back behind CONFIG_SHOULD_FAILSLAB
  mm: ignore data-race in __swap_writepage
  hugetlbfs: ensure generic_hugetlb_get_unmapped_area() returns higher address than mmap_min_addr
  mm: shmem: rename mTHP shmem counters
  mm: swap_state: use folio_alloc_mpol() in __read_swap_cache_async()
  mm/migrate: putback split folios when numa hint migration fails
  ...
2024-07-21 17:15:46 -07:00
Linus Torvalds
53a5182c8a for-6.11-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmaak1cACgkQxWXV+ddt
 WDvtDw/9GLN/pzAhH+rg0v51b4Ofxqa5QXuz7GIHb+UOqOdiZIjun+UmBst8794t
 oZ0rhO1sjXVNwci5TG2oWUHcoAkjoPrVgJ/X5nzBHOzHJCQlV1kNzgsDyY+HFL9F
 FBkJ6u1pCXUIxry6zp55NLWLMcjtFdMphjLXENh+GzUx5aAXl7VamK41B3+pHGfW
 gDCNJ41vaaHqTGN8+9AADpZ2Oo4CUCtr8eNxGXZ0O5/0s5CdRJrjphtBTcGwbGch
 SYVceySM4ElLmFi+hKFxx68MkWgnuTnBhZhbOM4V2fPFv0hLzXyQ7OrAk+MY8O2m
 AHLx2jHKfCdONFeTUdt1cY5/wM6Afhy+N/iKtv3t2+Wi70C/LEHQRY1Op3rfx1rn
 vOIbR9IXEHVi6ncO/E9c4LacSqLd/KSOaZn2Z/6i5wN+NY86CrCMuPwr6Pv0LL6x
 aSHka47SFFTQLvHUdwmzexJ2YuosFdI0BhpiEu8ylAZTJ17yDJatk8wM+FB6Rfh1
 vdPMRi93nVfrCwkU63Y2gqtJ3ncb3mbk/0uUdtMMflZJgjL0qkxTmcu3pSEdzIYR
 gHFLVlns4cljL5PB9yMH/JjYjYn+Y6bCVvVyuhQZ3FAanUVOSFin/YWfo1bIi1et
 ENNP+lhUKYvKLcz3QcnQpX3a6PkFPrmFi5wniAvxymrmVKJ3g3E=
 =ROO7
 -----END PGP SIGNATURE-----

Merge tag 'for-6.11-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "A fix for build breakage on 32bit platforms"

* tag 'for-6.11-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: change BTRFS_MOUNT_* flags to 64bit type
2024-07-19 14:34:52 -07:00
Qu Wenruo
c3ece6b7ff btrfs: change BTRFS_MOUNT_* flags to 64bit type
Currently the BTRFS_MOUNT_* flags are already beyond 32 bits, this is
going to cause compilation errors for some 32 bit systems, as their
unsigned long is only 32 bits long, thus flag
BTRFS_MOUNT_IGNORESUPERFLAGS overflows and can lead to errors.

Fix the problem by:

- Migrate all existing BTRFS_MOUNT_* flags to unsigned long long
- Migrate all mount option related variables to unsigned long long
  * btrfs_fs_info::mount_opt
  * btrfs_fs_context::mount_opt
  * mount_opt parameter of btrfs_check_options()
  * old_opts parameter of btrfs_remount_begin()
  * old_opts parameter of btrfs_remount_cleanup()
  * mount_opt parameter of btrfs_check_mountopts_zoned()
  * mount_opt and opt parameters of check_ro_option()

Fixes: 32e6216512 ("btrfs: introduce new "rescue=ignoresuperflags" mount option")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-19 17:20:23 +02:00
Linus Torvalds
a1b547f0f2 for-6.11-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmaVN3MACgkQxWXV+ddt
 WDtpIRAAl+1NjsEj8e5V/UYn8Jr06ujTOnrkR3PCTICxDHbUaMLkQEw21H0K/ogQ
 3fOiEVpSlZOfKdYXtXaMQbC0jd/Af2eA10Uht96nAEjAtxu1uJ4cFZGu2meNdXZP
 xUioivJ/CElMPH2aluG6FaQvUTqmhrEr8tSoYbxzQmUd434q9kqqyjtw1tfzYDG1
 VDn2f7ykhpB/8P0aoqgWSshWTmaCzG0GkuI28o1o0iZUIF/P9TKdzxlLRW6BVHE7
 T2oGLEQjN1GQbCH75L4IeNJDkCBVfcDcbZkUDJ/ae4Pt/jJQTFY53YIP9wXFZQnd
 mdfHmK7Atpsk75ATftYSq+ENkbQ5fsuut5CD63u54gAqA4M1FncDXTAWS1Y30F76
 P8juSCmsSy0o3gTflDIo/IMdntoh/JmncwwStF6oKzmyUZZzzarsqM8mc1P03ZNt
 3ttlnbY7lC1TDAlD5J2wXE0INCT2pN+4C9IToWdRypeuLu6qrI7cQ0oylyp9OVQM
 t9umTXm0B6s1cyqEDjJf0xJZS/JTHYwu7S4EmAJwicgiLpOjABVTmO8021rVmDJy
 TAUu6yEhSsrTT6Dxm7/2Et1EEOKFF5hhsG1SiGD9oUIZK6B5+0waT+rbkEWl7osR
 4/TAv2zX6tuCc7HIW0fQloM/6/Gyd5wcDVaQNDUzFA075uKstwY=
 =k5d3
 -----END PGP SIGNATURE-----

Merge tag 'for-6.11-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "The highlights are new logic behind background block group reclaim,
  automatic removal of qgroup after removing a subvolume and new
  'rescue=' mount options.

  The rest is optimizations, cleanups and refactoring.

  User visible features:

   - dynamic block group reclaim:
      - tunable framework to avoid situations where eager data
        allocations prevent creating new metadata chunks due to lack of
        unallocated space
      - reuse sysfs knob bg_reclaim_threshold (otherwise used only in
        zoned mode) for a fixed value threshold
      - new on/off sysfs knob "dynamic_reclaim" calculating the value
        based on heuristics, aiming to keep spare working space for
        relocating chunks but not to needlessly relocate partially
        utilized block groups or reclaim newly allocated ones
      - stats are exported in sysfs per block group type, files
        "reclaim_*"
      - this may increase IO load at unexpected times but the corner
        case of no allocatable block groups is known to be worse

   - automatically remove qgroup of deleted subvolumes:
      - adjust qgroup removal conditions, make sure all related
        subvolume data are already removed, or return EBUSY, also take
        into account setting of sysfs drop_subtree_threshold
      - also works in squota mode

   - mount option updates: new modes of 'rescue=' that allow to mount
     images (read-only) that could have been partially converted by user
     space tools
      - ignoremetacsums  - invalid metadata checksums are ignored
      - ignoresuperflags - super block flags that track conversion in
                           progress (like UUID or checksums)

  Core:

   - size of struct btrfs_inode is now below 1024 (on a release config),
     improved memory packing and other secondary effects

   - switch tracking of open inodes from rb-tree to xarray, minor
     performance improvement

   - reduce number of empty transaction commits when there are no dirty
     data/metadata

   - memory allocation optimizations (reduced numbers, reordering out of
     critical sections)

   - extent map structure optimizations and refactoring, more sanity
     checks

   - more subpage in zoned mode preparations or fixes

   - general snapshot code cleanups, improvements and documentation

   - tree-checker updates: more file extent ram_bytes fixes, continued

   - raid-stripe-tree update (not backward compatible):
      - remove extent encoding field from the structure, can be inferred
        from other information
      - requires btrfs-progs 6.9.1 or newer

   - cleanups and refactoring
      - error message updates
      - error handling improvements
      - return type and parameter cleanups and improvements"

* tag 'for-6.11-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (152 commits)
  btrfs: fix extent map use-after-free when adding pages to compressed bio
  btrfs: fix bitmap leak when loading free space cache on duplicate entry
  btrfs: remove the BUG_ON() inside extent_range_clear_dirty_for_io()
  btrfs: move extent_range_clear_dirty_for_io() into inode.c
  btrfs: enhance compression error messages
  btrfs: fix data race when accessing the last_trans field of a root
  btrfs: rename the extra_gfp parameter of btrfs_alloc_page_array()
  btrfs: remove the extra_gfp parameter from btrfs_alloc_folio_array()
  btrfs: introduce new "rescue=ignoresuperflags" mount option
  btrfs: introduce new "rescue=ignoremetacsums" mount option
  btrfs: output the unrecognized super block flags as hex
  btrfs: remove unused Opt enums
  btrfs: tree-checker: add extra ram_bytes and disk_num_bytes check
  btrfs: fix the ram_bytes assignment for truncated ordered extents
  btrfs: make validate_extent_map() catch ram_bytes mismatch
  btrfs: ignore incorrect btrfs_file_extent_item::ram_bytes
  btrfs: cleanup the bytenr usage inside btrfs_extent_item_to_extent_map()
  btrfs: fix typo in error message in btrfs_validate_super()
  btrfs: move the direct IO code into its own file
  btrfs: pass a btrfs_inode to btrfs_set_prop()
  ...
2024-07-17 12:38:04 -07:00
Linus Torvalds
3e78198862 for-6.11/block-20240710
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmaOTd8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgppqIEACUr8Vv2FtezvT3OfVSlYWHHLXzkRhwEG5s
 vdk0o7Ow6U54sMjfymbHTgLD0ZOJf3uJ6BI95FQuW41jPzDFVbx4Hy8QzqonMkw9
 1D/YQ4zrVL2mOKBzATbKpoGJzMOzGeoXEueFZ1AYPAX7RrDtP4xPQNfrcfkdE2zF
 LycJN70Vp6lrZZMuI9yb9ts1tf7TFzK0HJANxOAKTgSiPmBmxesjkJlhrdUrgkAU
 qDVyjj7u/ssndBJAb9i6Bl95Do8s9t4DeJq5/6wgKqtf5hClMXzPVB8Wy084gr6E
 rTRsCEhOug3qEZSqfAgAxnd3XFRNc/p2KMUe5YZ4mAqux4hpSmIQQDM/5X5K9vEv
 f4MNqUGlqyqntZx+KPyFpf7kLHFYS1qK4ub0FojWJEY4GrbBPNjjncLJ9+ozR0c8
 kNDaFjMNAjalBee1FxNNH8LdVcd28rrCkPxRLEfO/gvBMUmvJf4ZyKmSED0v5DhY
 vZqKlBqG+wg0EXvdiWEHMDh9Y+q/2XBIkS6NN/Bhh61HNu+XzC838ts1X7lR+4o2
 AM5Vapw+v0q6kFBMRP3IcJI/c0UcIU8EQU7axMyzWtvhog8kx8x01hIj1L4UyYYr
 rUdWrkugBVXJbywFuH/QIJxWxS/z4JdSw5VjASJLIrXy+aANmmG9Wonv95eyhpUv
 5iv+EdRSNA==
 =wVi8
 -----END PGP SIGNATURE-----

Merge tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe updates via Keith:
     - Device initialization memory leak fixes (Keith)
     - More constants defined (Weiwen)
     - Target debugfs support (Hannes)
     - PCIe subsystem reset enhancements (Keith)
     - Queue-depth multipath policy (Redhat and PureStorage)
     - Implement get_unique_id (Christoph)
     - Authentication error fixes (Gaosheng)

 - MD updates via Song
     - sync_action fix and refactoring (Yu Kuai)
     - Various small fixes (Christoph Hellwig, Li Nan, and Ofir Gal, Yu
       Kuai, Benjamin Marzinski, Christophe JAILLET, Yang Li)

 - Fix loop detach/open race (Gulam)

 - Fix lower control limit for blk-throttle (Yu)

 - Add module descriptions to various drivers (Jeff)

 - Add support for atomic writes for block devices, and statx reporting
   for same. Includes SCSI and NVMe (John, Prasad, Alan)

 - Add IO priority information to block trace points (Dongliang)

 - Various zone improvements and tweaks (Damien)

 - mq-deadline tag reservation improvements (Bart)

 - Ignore direct reclaim swap writes in writeback throttling (Baokun)

 - Block integrity improvements and fixes (Anuj)

 - Add basic support for rust based block drivers. Has a dummy null_blk
   variant for now (Andreas)

 - Series converting driver settings to queue limits, and cleanups and
   fixes related to that (Christoph)

 - Cleanup for poking too deeply into the bvec internals, in preparation
   for DMA mapping API changes (Christoph)

 - Various minor tweaks and fixes (Jiapeng, John, Kanchan, Mikulas,
   Ming, Zhu, Damien, Christophe, Chaitanya)

* tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux: (206 commits)
  floppy: add missing MODULE_DESCRIPTION() macro
  loop: add missing MODULE_DESCRIPTION() macro
  ublk_drv: add missing MODULE_DESCRIPTION() macro
  xen/blkback: add missing MODULE_DESCRIPTION() macro
  block/rnbd: Constify struct kobj_type
  block: take offset into account in blk_bvec_map_sg again
  block: fix get_max_segment_size() warning
  loop: Don't bother validating blocksize
  virtio_blk: Don't bother validating blocksize
  null_blk: Don't bother validating blocksize
  block: Validate logical block size in blk_validate_limits()
  virtio_blk: Fix default logical block size fallback
  nvmet-auth: fix nvmet_auth hash error handling
  nvme: implement ->get_unique_id
  block: pass a phys_addr_t to get_max_segment_size
  block: add a bvec_phys helper
  blk-lib: check for kill signal in ioctl BLKZEROOUT
  block: limit the Write Zeroes to manually writing zeroes fallback
  block: refacto blkdev_issue_zeroout
  block: move read-only and supported checks into (__)blkdev_issue_zeroout
  ...
2024-07-15 14:20:22 -07:00
Linus Torvalds
2aae1d67fd vfs-6.11.inode
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZpEG2wAKCRCRxhvAZXjc
 ooW/AQDzyY+xNGt4OPMvlyFUHd5RcyiLsMhYrkKc3FaIFjesVgD+PFW5PPW12c0V
 Z4VHg9w1HDDuUn4XvELs7OXZpek7RgU=
 =eDC8
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.11.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs inode / dentry updates from Christian Brauner:
 "This contains smaller performance improvements to inodes and dentries:

  inode:

   - Add rcu based inode lookup variants.

     They avoid one inode hash lock acquire in the common case thereby
     significantly reducing contention. We already support RCU-based
     operations but didn't take advantage of them during inode
     insertion.

     Callers of iget_locked() get the improvement without any code
     changes. Callers that need a custom callback can switch to
     iget5_locked_rcu() as e.g., did btrfs.

     With 20 threads each walking a dedicated 1000 dirs * 1000 files
     directory tree to stat(2) on a 32 core + 24GB ram vm:

        before: 3.54s user 892.30s system 1966% cpu 45.549 total
        after:  3.28s user 738.66s system 1955% cpu 37.932 total (-16.7%)

     Long-term we should pick up the effort to introduce more
     fine-grained locking and possibly improve on the currently used
     hash implementation.

   - Start zeroing i_state in inode_init_always() instead of doing it in
     individual filesystems.

     This allows us to remove an unneeded lock acquire in new_inode()
     and not burden individual filesystems with this.

  dcache:

   - Move d_lockref out of the area used by RCU lookup to avoid
     cacheline ping poing because the embedded name is sharing a
     cacheline with d_lockref.

   - Fix dentry size on 32bit with CONFIG_SMP=y so it does actually end
     up with 128 bytes in total"

* tag 'vfs-6.11.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: fix dentry size
  vfs: move d_lockref out of the area used by RCU lookup
  bcachefs: remove now spurious i_state initialization
  xfs: remove now spurious i_state initialization in xfs_inode_alloc
  vfs: partially sanitize i_state zeroing on inode creation
  xfs: preserve i_state around inode_init_always in xfs_reinit_inode
  btrfs: use iget5_locked_rcu
  vfs: add rcu-based find_inode variants for iget ops
2024-07-15 11:39:44 -07:00
Linus Torvalds
975f3b6da1 for-6.10-rc7-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmaRcQgACgkQxWXV+ddt
 WDvAGxAAknJAiREp/AmzhSwkhr+nSnqex0t+VVgsOaMTu0BEHO0xhoXc3l0QuSwS
 u2AIqmOYyzr/UQVXCuatBqAE+5T4njtYAYIWwE825yquAtHNyuok9+Sjhfvxrwgs
 HmNAN4Vvl2Fwds7xbWE8ug18QlssuRTIX8hk7ZtS6xo49g0tsbRX9KlzIPpsULD3
 BOZa+2NJwC1PGVeNPf3p06rfiUkKfmFYgdDybe2zJ17uwsRz1CFSsaEEB35ys1f0
 xYOS4epfcie03EGyZmYctuNxatUkk/J/1lTH4Z9JHwvPBvLK1U97SyJ11Wz2VQC/
 8ar8gUDRYtjWdf6vn6AWBM4MseaYm9LDMlPhbSfvpDcWiclGTE64IOP4gKKr3mCh
 WzlNSIR9I+tYgrhvcsCEzd7lvrSVHa7clwfooYgkEx0wl5lgbN0llAdtJWG3eeLn
 3stxje2FqqXsFNj5N9SrPy7f7t6xF2i8vwk4qh6EpRuT4yuatb+nWzDm9EuTT/Bc
 P+zM1KFp7Blk7Zw/Tpw0O9qjt1whStY2xrqcMzg539WVo45MmuFEFzmGBRwZsH55
 QPGLIjXPpt728AgMdhBFEG0DtWaiA3AOI/C5nYOtLu92aZVBmbaX7/d/GpJv3Vvd
 Ihvr9s1c49YvTZsIS0T0tkq/7LXZi/SToRJDjhP5HCrRGf7A30Y=
 =gtsF
 -----END PGP SIGNATURE-----

Merge tag 'for-6.10-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Fix a regression in extent map shrinker behaviour.

  In the past weeks we got reports from users that there are huge
  latency spikes or freezes. This was bisected to newly added shrinker
  of extent maps (it was added to fix a build up of the structures in
  memory).

  I'm assuming that the freezes would happen to many users after release
  so I'd like to get it merged now so it's in 6.10. Although the diff
  size is not small the changes are relatively straightforward, the
  reporters verified the fixes and we did testing on our side.

  The fixes:

   - adjust behaviour under memory pressure and check lock or scheduling
     conditions, bail out if needed

   - synchronize tracking of the scanning progress so inode ranges are
     not skipped or work duplicated

   - do a delayed iput when scanning a root so evicting an inode does
     not slow things down in case of lots of dirty data, also fix
     lockdep warning, a deadlock could happen when writing the dirty
     data would need to start a transaction"

* tag 'for-6.10-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: avoid races when tracking progress for extent map shrinking
  btrfs: stop extent map shrinker if reschedule is needed
  btrfs: use delayed iput during extent map shrinking
2024-07-12 12:08:42 -07:00
Filipe Manana
4484940514 btrfs: avoid races when tracking progress for extent map shrinking
We store the progress (root and inode numbers) of the extent map shrinker
in fs_info without any synchronization but we can have multiple tasks
calling into the shrinker during memory allocations when there's enough
memory pressure for example.

This can result in a task A reading fs_info->extent_map_shrinker_last_ino
after another task B updates it, and task A reading
fs_info->extent_map_shrinker_last_root before task B updates it, making
task A see an odd state that isn't necessarily harmful but may make it
skip certain inode ranges or do more work than necessary by going over
the same inodes again. These unprotected accesses would also trigger
warnings from tools like KCSAN.

So add a lock to protect access to these progress fields.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 16:50:54 +02:00
Filipe Manana
b3ebb9b7e9 btrfs: stop extent map shrinker if reschedule is needed
The extent map shrinker can be called in a variety of contexts where we
are under memory pressure, and of them is when a task is trying to
allocate memory. For this reason the shrinker is typically called with a
value of struct shrink_control::nr_to_scan that is much smaller than what
we return in the nr_cached_objects callback of struct super_operations
(fs/btrfs/super.c:btrfs_nr_cached_objects()), so that the shrinker does
not take a long time and cause high latencies. However we can still take
a lot of time in the shrinker even for a limited amount of nr_to_scan:

1) When traversing the red black tree that tracks open inodes in a root,
   as for example with millions of open inodes we get a deep tree which
   takes time searching for an inode;

2) Iterating over the extent map tree, which is a red black tree, of an
   inode when doing the rb_next() calls and when removing an extent map
   from the tree, since often that requires rebalancing the red black
   tree;

3) When trying to write lock an inode's extent map tree we may wait for a
   significant amount of time, because there's either another task about
   to do IO and searching for an extent map in the tree or inserting an
   extent map in the tree, and we can have thousands or even millions of
   extent maps for an inode. Furthermore, there can be concurrent calls
   to the shrinker so the lock might be busy simply because there is
   already another task shrinking extent maps for the same inode;

4) We often reschedule if we need to, which further increases latency.

So improve on this by stopping the extent map shrinking code whenever we
need to reschedule and make it skip an inode if we can't immediately lock
its extent map tree.

Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Reported-by: Andrea Gelmini <andrea.gelmini@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CABXGCsMmmb36ym8hVNGTiU8yfUS_cGvoUmGCcBrGWq9OxTrs+A@mail.gmail.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 16:45:42 +02:00
Filipe Manana
68a3ebd18b btrfs: use delayed iput during extent map shrinking
When putting an inode during extent map shrinking we're doing a standard
iput() but that may take a long time in case the inode is dirty and we are
doing the final iput that triggers eviction - the VFS will have to wait
for writeback before calling the btrfs evict callback (see
fs/inode.c:evict()).

This slows down the task running the shrinker which may have been
triggered while updating some tree for example, meaning locks are held
as well as an open transaction handle.

Also if the iput() ends up triggering eviction and the inode has no links
anymore, then we trigger item truncation which requires flushing delayed
items, space reservation to start a transaction and that may trigger the
space reclaim task and wait for it, resulting in deadlocks in case the
reclaim task needs for example to commit a transaction and the shrinker
is being triggered from a path holding a transaction handle.

Syzbot reported such a case with the following stack traces:

   ======================================================
   WARNING: possible circular locking dependency detected
   6.10.0-rc2-syzkaller-00010-g2ab795141095 #0 Not tainted
   ------------------------------------------------------
   kswapd0/111 is trying to acquire lock:
   ffff88801eae4610 (sb_internal#3){.+.+}-{0:0}, at: btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275

   but task is already holding lock:
   ffffffff8dd3a9a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xa88/0x1970 mm/vmscan.c:6924

   which lock already depends on the new lock.

   the existing dependency chain (in reverse order) is:

   -> #3 (fs_reclaim){+.+.}-{0:0}:
          __fs_reclaim_acquire mm/page_alloc.c:3783 [inline]
          fs_reclaim_acquire+0x102/0x160 mm/page_alloc.c:3797
          might_alloc include/linux/sched/mm.h:334 [inline]
          slab_pre_alloc_hook mm/slub.c:3890 [inline]
          slab_alloc_node mm/slub.c:3980 [inline]
          kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4019
          btrfs_alloc_inode+0x118/0xb20 fs/btrfs/inode.c:8411
          alloc_inode+0x5d/0x230 fs/inode.c:261
          iget5_locked fs/inode.c:1235 [inline]
          iget5_locked+0x1c9/0x2c0 fs/inode.c:1228
          btrfs_iget_locked fs/btrfs/inode.c:5590 [inline]
          btrfs_iget_path fs/btrfs/inode.c:5607 [inline]
          btrfs_iget+0xfb/0x230 fs/btrfs/inode.c:5636
          create_reloc_inode+0x403/0x820 fs/btrfs/relocation.c:3911
          btrfs_relocate_block_group+0x471/0xe60 fs/btrfs/relocation.c:4114
          btrfs_relocate_chunk+0x143/0x450 fs/btrfs/volumes.c:3373
          __btrfs_balance fs/btrfs/volumes.c:4157 [inline]
          btrfs_balance+0x211a/0x3f00 fs/btrfs/volumes.c:4534
          btrfs_ioctl_balance fs/btrfs/ioctl.c:3675 [inline]
          btrfs_ioctl+0x12ed/0x8290 fs/btrfs/ioctl.c:4742
          __do_compat_sys_ioctl+0x2c3/0x330 fs/ioctl.c:1007
          do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
          __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386
          do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
          entry_SYSENTER_compat_after_hwframe+0x84/0x8e

   -> #2 (btrfs_trans_num_extwriters){++++}-{0:0}:
          join_transaction+0x164/0xf40 fs/btrfs/transaction.c:315
          start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700
          btrfs_rebuild_free_space_tree+0xaa/0x480 fs/btrfs/free-space-tree.c:1323
          btrfs_start_pre_rw_mount+0x218/0xf60 fs/btrfs/disk-io.c:2999
          open_ctree+0x41ab/0x52e0 fs/btrfs/disk-io.c:3554
          btrfs_fill_super fs/btrfs/super.c:946 [inline]
          btrfs_get_tree_super fs/btrfs/super.c:1863 [inline]
          btrfs_get_tree+0x11e9/0x1b90 fs/btrfs/super.c:2089
          vfs_get_tree+0x8f/0x380 fs/super.c:1780
          fc_mount+0x16/0xc0 fs/namespace.c:1125
          btrfs_get_tree_subvol fs/btrfs/super.c:2052 [inline]
          btrfs_get_tree+0xa53/0x1b90 fs/btrfs/super.c:2090
          vfs_get_tree+0x8f/0x380 fs/super.c:1780
          do_new_mount fs/namespace.c:3352 [inline]
          path_mount+0x6e1/0x1f10 fs/namespace.c:3679
          do_mount fs/namespace.c:3692 [inline]
          __do_sys_mount fs/namespace.c:3898 [inline]
          __se_sys_mount fs/namespace.c:3875 [inline]
          __ia32_sys_mount+0x295/0x320 fs/namespace.c:3875
          do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
          __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386
          do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
          entry_SYSENTER_compat_after_hwframe+0x84/0x8e

   -> #1 (btrfs_trans_num_writers){++++}-{0:0}:
          join_transaction+0x148/0xf40 fs/btrfs/transaction.c:314
          start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700
          btrfs_rebuild_free_space_tree+0xaa/0x480 fs/btrfs/free-space-tree.c:1323
          btrfs_start_pre_rw_mount+0x218/0xf60 fs/btrfs/disk-io.c:2999
          open_ctree+0x41ab/0x52e0 fs/btrfs/disk-io.c:3554
          btrfs_fill_super fs/btrfs/super.c:946 [inline]
          btrfs_get_tree_super fs/btrfs/super.c:1863 [inline]
          btrfs_get_tree+0x11e9/0x1b90 fs/btrfs/super.c:2089
          vfs_get_tree+0x8f/0x380 fs/super.c:1780
          fc_mount+0x16/0xc0 fs/namespace.c:1125
          btrfs_get_tree_subvol fs/btrfs/super.c:2052 [inline]
          btrfs_get_tree+0xa53/0x1b90 fs/btrfs/super.c:2090
          vfs_get_tree+0x8f/0x380 fs/super.c:1780
          do_new_mount fs/namespace.c:3352 [inline]
          path_mount+0x6e1/0x1f10 fs/namespace.c:3679
          do_mount fs/namespace.c:3692 [inline]
          __do_sys_mount fs/namespace.c:3898 [inline]
          __se_sys_mount fs/namespace.c:3875 [inline]
          __ia32_sys_mount+0x295/0x320 fs/namespace.c:3875
          do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
          __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386
          do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
          entry_SYSENTER_compat_after_hwframe+0x84/0x8e

   -> #0 (sb_internal#3){.+.+}-{0:0}:
          check_prev_add kernel/locking/lockdep.c:3134 [inline]
          check_prevs_add kernel/locking/lockdep.c:3253 [inline]
          validate_chain kernel/locking/lockdep.c:3869 [inline]
          __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137
          lock_acquire kernel/locking/lockdep.c:5754 [inline]
          lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719
          percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
          __sb_start_write include/linux/fs.h:1655 [inline]
          sb_start_intwrite include/linux/fs.h:1838 [inline]
          start_transaction+0xbc1/0x1a70 fs/btrfs/transaction.c:694
          btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275
          btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291
          evict+0x2ed/0x6c0 fs/inode.c:667
          iput_final fs/inode.c:1741 [inline]
          iput.part.0+0x5a8/0x7f0 fs/inode.c:1767
          iput+0x5c/0x80 fs/inode.c:1757
          btrfs_scan_root fs/btrfs/extent_map.c:1118 [inline]
          btrfs_free_extent_maps+0xbd3/0x1320 fs/btrfs/extent_map.c:1189
          super_cache_scan+0x409/0x550 fs/super.c:227
          do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435
          shrink_slab+0x18a/0x1310 mm/shrinker.c:662
          shrink_one+0x493/0x7c0 mm/vmscan.c:4790
          shrink_many mm/vmscan.c:4851 [inline]
          lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951
          shrink_node mm/vmscan.c:5910 [inline]
          kswapd_shrink_node mm/vmscan.c:6720 [inline]
          balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911
          kswapd+0x5ea/0xbf0 mm/vmscan.c:7180
          kthread+0x2c1/0x3a0 kernel/kthread.c:389
          ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
          ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

   other info that might help us debug this:

   Chain exists of:
     sb_internal#3 --> btrfs_trans_num_extwriters --> fs_reclaim

    Possible unsafe locking scenario:

          CPU0                    CPU1
          ----                    ----
     lock(fs_reclaim);
                                  lock(btrfs_trans_num_extwriters);
                                  lock(fs_reclaim);
     rlock(sb_internal#3);

    *** DEADLOCK ***

   2 locks held by kswapd0/111:
    #0: ffffffff8dd3a9a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xa88/0x1970 mm/vmscan.c:6924
    #1: ffff88801eae40e0 (&type->s_umount_key#62){++++}-{3:3}, at: super_trylock_shared fs/super.c:562 [inline]
    #1: ffff88801eae40e0 (&type->s_umount_key#62){++++}-{3:3}, at: super_cache_scan+0x96/0x550 fs/super.c:196

   stack backtrace:
   CPU: 0 PID: 111 Comm: kswapd0 Not tainted 6.10.0-rc2-syzkaller-00010-g2ab795141095 #0
   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
   Call Trace:
    <TASK>
    __dump_stack lib/dump_stack.c:88 [inline]
    dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:114
    check_noncircular+0x31a/0x400 kernel/locking/lockdep.c:2187
    check_prev_add kernel/locking/lockdep.c:3134 [inline]
    check_prevs_add kernel/locking/lockdep.c:3253 [inline]
    validate_chain kernel/locking/lockdep.c:3869 [inline]
    __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137
    lock_acquire kernel/locking/lockdep.c:5754 [inline]
    lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719
    percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
    __sb_start_write include/linux/fs.h:1655 [inline]
    sb_start_intwrite include/linux/fs.h:1838 [inline]
    start_transaction+0xbc1/0x1a70 fs/btrfs/transaction.c:694
    btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275
    btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291
    evict+0x2ed/0x6c0 fs/inode.c:667
    iput_final fs/inode.c:1741 [inline]
    iput.part.0+0x5a8/0x7f0 fs/inode.c:1767
    iput+0x5c/0x80 fs/inode.c:1757
    btrfs_scan_root fs/btrfs/extent_map.c:1118 [inline]
    btrfs_free_extent_maps+0xbd3/0x1320 fs/btrfs/extent_map.c:1189
    super_cache_scan+0x409/0x550 fs/super.c:227
    do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435
    shrink_slab+0x18a/0x1310 mm/shrinker.c:662
    shrink_one+0x493/0x7c0 mm/vmscan.c:4790
    shrink_many mm/vmscan.c:4851 [inline]
    lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951
    shrink_node mm/vmscan.c:5910 [inline]
    kswapd_shrink_node mm/vmscan.c:6720 [inline]
    balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911
    kswapd+0x5ea/0xbf0 mm/vmscan.c:7180
    kthread+0x2c1/0x3a0 kernel/kthread.c:389
    ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
    ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
    </TASK>

So fix this by using btrfs_add_delayed_iput() so that the final iput is
delegated to the cleaner kthread.

Link: https://lore.kernel.org/linux-btrfs/000000000000892280061a344581@google.com/
Reported-by: syzbot+3dad89b3993a4b275e72@syzkaller.appspotmail.com
Fixes: 956a17d9d0 ("btrfs: add a shrinker for extent maps")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 16:45:18 +02:00
Filipe Manana
8e7860543a btrfs: fix extent map use-after-free when adding pages to compressed bio
At add_ra_bio_pages() we are accessing the extent map to calculate
'add_size' after we dropped our reference on the extent map, resulting
in a use-after-free. Fix this by computing 'add_size' before dropping our
extent map reference.

Reported-by: syzbot+853d80cba98ce1157ae6@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/000000000000038144061c6d18f2@google.com/
Fixes: 6a40491020 ("btrfs: subpage: make add_ra_bio_pages() compatible")
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 16:32:22 +02:00
Filipe Manana
320d8dc612 btrfs: fix bitmap leak when loading free space cache on duplicate entry
If we failed to link a free space entry because there's already a
conflicting entry for the same offset, we free the free space entry but
we don't free the associated bitmap that we had just allocated before.
Fix that by freeing the bitmap before freeing the entry.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:52:25 +02:00
Qu Wenruo
a39484371d btrfs: remove the BUG_ON() inside extent_range_clear_dirty_for_io()
Previously we had a BUG_ON() inside extent_range_clear_dirty_for_io(), as
we expected all involved folios to be still locked, thus no folio should be
missing.

However for extent_range_clear_dirty_for_io() itself, we can skip the
missing folio and handle the remaining ones, and return an error if
there is anything wrong.

Remove the BUG_ON() and let the caller to handle the error.
In the caller we do not have a quick way to cleanup the error, but all
the compression routines would handle the missing folio as an error and
properly error out, so we only need to do an ASSERT() for developers,
while for non-debug build the compression routine would handle the
error correctly.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:52:25 +02:00
Qu Wenruo
af61081fb5 btrfs: move extent_range_clear_dirty_for_io() into inode.c
The function is only used inside inode.c by compress_file_range(),
so move it to inode.c and unexport it.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:52:25 +02:00
David Sterba
be9438f077 btrfs: enhance compression error messages
Add more verbose and specific messages to all main error points in
compression code for all algorithms. Currently there's no way to know
which inode is affected or where in the data errors happened.

The messages follow a common format:

- what happened
- error code if relevant
- root and inode
- additional data like offsets or lengths

There's no helper for the messages as they differ in some details and
that would be cumbersome to generalize to a single function. As all the
errors are "almost never happens" there are the unlikely annotations
done as compression is hot path.

Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:52:25 +02:00
Filipe Manana
ca84529a84 btrfs: fix data race when accessing the last_trans field of a root
KCSAN complains about a data race when accessing the last_trans field of a
root:

  [  199.553628] BUG: KCSAN: data-race in btrfs_record_root_in_trans [btrfs] / record_root_in_trans [btrfs]

  [  199.555186] read to 0x000000008801e308 of 8 bytes by task 2812 on cpu 1:
  [  199.555210]  btrfs_record_root_in_trans+0x9a/0x128 [btrfs]
  [  199.555999]  start_transaction+0x154/0xcd8 [btrfs]
  [  199.556780]  btrfs_join_transaction+0x44/0x60 [btrfs]
  [  199.557559]  btrfs_dirty_inode+0x9c/0x140 [btrfs]
  [  199.558339]  btrfs_update_time+0x8c/0xb0 [btrfs]
  [  199.559123]  touch_atime+0x16c/0x1e0
  [  199.559151]  pipe_read+0x6a8/0x7d0
  [  199.559179]  vfs_read+0x466/0x498
  [  199.559204]  ksys_read+0x108/0x150
  [  199.559230]  __s390x_sys_read+0x68/0x88
  [  199.559257]  do_syscall+0x1c6/0x210
  [  199.559286]  __do_syscall+0xc8/0xf0
  [  199.559318]  system_call+0x70/0x98

  [  199.559431] write to 0x000000008801e308 of 8 bytes by task 2808 on cpu 0:
  [  199.559464]  record_root_in_trans+0x196/0x228 [btrfs]
  [  199.560236]  btrfs_record_root_in_trans+0xfe/0x128 [btrfs]
  [  199.561097]  start_transaction+0x154/0xcd8 [btrfs]
  [  199.561927]  btrfs_join_transaction+0x44/0x60 [btrfs]
  [  199.562700]  btrfs_dirty_inode+0x9c/0x140 [btrfs]
  [  199.563493]  btrfs_update_time+0x8c/0xb0 [btrfs]
  [  199.564277]  file_update_time+0xb8/0xf0
  [  199.564301]  pipe_write+0x8ac/0xab8
  [  199.564326]  vfs_write+0x33c/0x588
  [  199.564349]  ksys_write+0x108/0x150
  [  199.564372]  __s390x_sys_write+0x68/0x88
  [  199.564397]  do_syscall+0x1c6/0x210
  [  199.564424]  __do_syscall+0xc8/0xf0
  [  199.564452]  system_call+0x70/0x98

This is because we update and read last_trans concurrently without any
type of synchronization. This should be generally harmless and in the
worst case it can make us do extra locking (btrfs_record_root_in_trans())
trigger some warnings at ctree.c or do extra work during relocation - this
would probably only happen in case of load or store tearing.

So fix this by always reading and updating the field using READ_ONCE()
and WRITE_ONCE(), this silences KCSAN and prevents load and store tearing.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:52:25 +02:00
Qu Wenruo
0fbf6cbd72 btrfs: rename the extra_gfp parameter of btrfs_alloc_page_array()
There is only one caller utilizing the @extra_gfp parameter,
alloc_eb_folio_array().  And in that case the extra_gfp is only assigned
to __GFP_NOFAIL.

Rename the @extra_gfp parameter to @nofail to indicate that.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:30 +02:00
Qu Wenruo
fea91134c2 btrfs: remove the extra_gfp parameter from btrfs_alloc_folio_array()
The function btrfs_alloc_folio_array() is only utilized in
btrfs_submit_compressed_read() and no other location, and the only
caller is not utilizing the @extra_gfp parameter.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:30 +02:00
Qu Wenruo
32e6216512 btrfs: introduce new "rescue=ignoresuperflags" mount option
This new mount option allows the kernel to skip the super flags check,
it's mostly to allow the kernel to do a rescue mount of an interrupted
checksum conversion.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:30 +02:00
Qu Wenruo
169aaaf2e0 btrfs: introduce new "rescue=ignoremetacsums" mount option
Introduce "rescue=ignoremetacsums" to ignore metadata csums, all the
other metadata sanity checks are still kept as is.

This new mount option is mostly to allow the kernel to mount an
interrupted checksum conversion (at the metadata csum overwrite stage).

And since the main part of metadata sanity checks is inside
tree-checker, we shouldn't lose much safety, and the new mount option is
rescue mount option it requires full read-only mount.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:29 +02:00
Qu Wenruo
cf31b271e0 btrfs: output the unrecognized super block flags as hex
Most of the extra super block flags are beyond 32bits (from
CHANGING_FSID_V2 to CHANGING_*_CSUMS), thus using %llu is not only too
long and pretty hard to read.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:29 +02:00
Qu Wenruo
14114c98a8 btrfs: remove unused Opt enums
The following three Opt_* enums haven't been utilized since the port to
new mount API:

- Opt_ignorebadroots
- Opt_ignoredatacsums
- Opt_rescue_all

All those enums are from the old day where we have dedicated mount
options, nowadays they have been moved to "rescue=" mount option
groups, and no more global tokens for them.

So we can safely remove them now.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:29 +02:00
Qu Wenruo
5fc070a924 btrfs: tree-checker: add extra ram_bytes and disk_num_bytes check
This is to ensure non-compressed file extents (both regular and
prealloc) should have matching ram_bytes and disk_num_bytes.

This is only for CONFIG_BTRFS_DEBUG and CONFIG_BTRFS_ASSERT case,
furthermore this will not return error, but just a kernel warning to
inform developers.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:29 +02:00
Qu Wenruo
896c8b92dd btrfs: fix the ram_bytes assignment for truncated ordered extents
[HICCUP]
After adding extra checks on btrfs_file_extent_item::ram_bytes to
tree-checker, running fsstress leads to tree-checker warning at write time,
as we created file extent items with an invalid ram_bytes.

All those offending file extents have offset 0, and ram_bytes matching
num_bytes, and smaller than disk_num_bytes.

This would also trigger the recently enhanced btrfs-check, which catches
such mismatches and report them as minor errors.

[CAUSE]
When a folio/page is invalidated and it is part of a submitted OE, we
mark the OE truncated just to the beginning of the folio/page.

And for truncated OE, we insert the file extent item with incorrect
value for ram_bytes (using num_bytes instead of the usual value).

This is not a big deal for end users, as we do not utilize the ram_bytes
field for regular non-compressed extents.
This mismatch is just a small violation against on-disk format.

[FIX]
Fix it by removing the override on btrfs_file_extent_item::ram_bytes.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:29 +02:00
Qu Wenruo
1b87d26add btrfs: make validate_extent_map() catch ram_bytes mismatch
Previously validate_extent_map() is only to catch bugs related to
extent_map member cleanups.

But with recent btrfs-check enhancement to catch ram_bytes mismatch with
disk_num_bytes, it would be much better to catch such extent maps
earlier.

So this patch adds extra ram_bytes validation for extent maps.

Please note that, older filesystems with such mismatch won't trigger this error:

- extent_map::ram_bytes is already fixed
  Previous patch has already fixed the ram_bytes for affected file
  extents.

So this enhanced sanity check should not affect end users.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:29 +02:00
Qu Wenruo
88e2e6d724 btrfs: ignore incorrect btrfs_file_extent_item::ram_bytes
[HICCUP]
Kernels can create file extent items with incorrect ram_bytes like this:

	item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53
		generation 7 type 1 (regular)
		extent data disk byte 13631488 nr 32768
		extent data offset 0 nr 4096 ram 4096
		extent compression 0 (none)

Thankfully kernel can handle them properly, as in that case ram_bytes is
not utilized at all.

[ENHANCEMENT]
Since the hiccup is not going to cause any data-loss and is only a minor
violation of on-disk format, here we only need to ignore the incorrect
ram_bytes value, and use the correct one from
btrfs_file_extent_item::disk_num_bytes.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:29 +02:00
Qu Wenruo
0edeb6ea46 btrfs: cleanup the bytenr usage inside btrfs_extent_item_to_extent_map()
[HICCUP]
Before commit 85de2be7129c ("btrfs: remove extent_map::block_start
member"), we utilized @bytenr variable inside
btrfs_extent_item_to_extent_map() to calculate block_start.

But that commit removed block_start completely, we have no need to
advance @bytenr at all.

[ENHANCEMENT]
- Rename @bytenr as @disk_bytenr
- Only declare @disk_bytenr inside the if branch
- Make @disk_bytenr const and remove the modification on it

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:29 +02:00
Mark Harmstone
0102ab54e4 btrfs: fix typo in error message in btrfs_validate_super()
There's a typo in an error message when checking the block group tree
feature, it mentions fres-space-tree instead of free-space-tree. Fix
that.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:29 +02:00
Filipe Manana
9aa29a20b7 btrfs: move the direct IO code into its own file
The direct IO code is over a thousand lines and it's currently spread
between file.c and inode.c, which makes it not easy to locate some parts
of it sometimes. Also inode.c is about 11 thousand lines and file.c about
4 thousand lines, both too big. So move all the direct IO code into a
dedicated file, so that it's easy to locate all its code and reduce the
sizes of inode.c and file.c.

This is a pure move of code without any other changes except export a
a couple functions from inode.c (get_extent_allocation_hint() and
create_io_em()) because they are used in inode.c and the new direct-io.c
file, and a couple functions from file.c (btrfs_buffered_write() and
btrfs_write_check()) because they are used both in file.c and in the new
direct-io.c file.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:29 +02:00
David Sterba
0d9b7e166a btrfs: pass a btrfs_inode to btrfs_set_prop()
Pass a struct btrfs_inode to btrfs_set_prop() as it's an
internal interface, allowing to remove some use of BTRFS_I.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:29 +02:00
David Sterba
e2877c2a03 btrfs: pass a btrfs_inode to btrfs_compress_heuristic()
Pass a struct btrfs_inode to btrfs_compress_heuristic() as it's an
internal interface, allowing to remove some use of BTRFS_I.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:28 +02:00
David Sterba
a1f4e3d7bd btrfs: switch btrfs_ordered_extent::inode to struct btrfs_inode
The structure is internal so we should use struct btrfs_inode for that,
allowing to remove some use of BTRFS_I.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:28 +02:00
David Sterba
c154a8446b btrfs: switch btrfs_pending_snapshot::dir to btrfs_inode
The structure is internal so we should use struct btrfs_inode for that.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:28 +02:00
David Sterba
24e7459849 btrfs: pass a btrfs_inode to btrfs_ioctl_send()
Pass a struct btrfs_inode to btrfs_ioctl_send() and _btrfs_ioctl_send()
as it's an internal interface, allowing to remove some use of BTRFS_I.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:28 +02:00
David Sterba
e108c86b10 btrfs: switch btrfs_block_group::inode to struct btrfs_inode
The structure is internal so we should use struct btrfs_inode for that.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:28 +02:00
David Sterba
8610ba7eab btrfs: pass a btrfs_inode to is_data_inode()
Pass a struct btrfs_inode to is_data_inode() as it's an
internal interface, allowing to remove some use of BTRFS_I.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:28 +02:00
David Sterba
a0d7e98ced btrfs: pass a btrfs_inode to btrfs_readdir_get_delayed_items()
Pass a struct btrfs_inode to btrfs_readdir_get_delayed_items() as it's
an internal interface, allowing to remove some use of BTRFS_I.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:28 +02:00
David Sterba
849c01ae90 btrfs: pass a btrfs_inode to btrfs_readdir_put_delayed_items()
Pass a struct btrfs_inode to btrfs_readdir_put_delayed_items() as it's
an internal interface, allowing to remove some use of BTRFS_I.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:28 +02:00
Johannes Thumshirn
2422547e99 btrfs: remove raid-stripe-tree encoding field from stripe_extent
Remove the encoding field from 'struct btrfs_stripe_extent'. It was
originally intended to encode the RAID type as well as if we're a data
or a parity stripe.

But the RAID type can be inferred form the block-group and the data vs.
parity differentiation can be done easier with adding a new key type
for parity stripes in the RAID stripe tree.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:28 +02:00
Qu Wenruo
e2c1887329 btrfs: print-tree: add generation and type dump for EXTENT_DATA_KEY
When debugging the recent ram_bytes mismatch bug, I can hit it with
enhanced tree-checker for file extent items at write time.

But the bug is not that easy to trigger (mostly triggered with
btrfs/06*, which uses 20 threads fsstress), and when I hit it, the only
info is the kernel leaf dump, but it doesn't include things like the
file extent type (REGULAR or PREALLOC).

Add the dump for generation and type (although only numeric output) to
make debugging a little easier.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:28 +02:00
Boris Burkov
0e962e755b btrfs: urgent periodic reclaim pass
Periodic reclaim attempts to avoid block_groups seeing active use with a
sweep mark that gets cleared on allocation and set on a sweep. In urgent
conditions where we have very little unallocated space (less than one
chunk used by the threshold calculation for the unallocated target), we
want to be able to override this mechanism.

Introduce a second pass that only happens if we fail to find a reclaim
candidate and reclaim is urgent. In that case, do a second pass where
all block groups are eligible.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:27 +02:00
Boris Burkov
813d4c6422 btrfs: prevent pathological periodic reclaim loops
Periodic reclaim runs the risk of getting stuck in a state where it
keeps reclaiming the same block group over and over. This can happen if

1. reclaiming that block_group fails
2. reclaiming that block_group fails to move any extents into existing
   block_groups and just allocates a fresh chunk and moves everything.

Currently, 1. is a very tight loop inside the reclaim worker. That is
critical for edge triggered reclaim or else we risk forgetting about a
reclaimable group. On the other hand, with level triggered reclaim we
can break out of that loop and get it later.

With that fixed, 2. applies to both failures and "successes" with no
progress. If we have done a periodic reclaim on a space_info and nothing
has changed in that space_info, there is not much point to trying again,
so don't, until enough space gets free, which we capture with a
heuristic of needing to net free 1 chunk.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:27 +02:00
Boris Burkov
e4ca3932ae btrfs: periodic block_group reclaim
We currently employ a edge-triggered block group reclaim strategy which
marks block groups for reclaim as they free down past a threshold.

With a dynamic threshold, this is worse than doing it in a
level-triggered fashion periodically. That is because the reclaim
itself happens periodically, so the threshold at that point in time is
what really matters, not the threshold at freeing time. If we mark the
reclaim in a big pass, then sort by usage and do reclaim, we also
benefit from a negative feedback loop preventing unnecessary reclaims as
we crunch through the "best" candidates.

Since this is quite a different model, it requires some additional
support. The edge triggered reclaim has a good heuristic for not
reclaiming fresh block groups, so we need to replace that with a typical
GC sweep mark which skips block groups that have seen an allocation
since the last sweep.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:27 +02:00
Boris Burkov
f5ff64ccf7 btrfs: dynamic block_group reclaim threshold
We can currently recover allocated block_groups by:

- explicitly starting balance operations
- "auto reclaim" via bg_reclaim_threshold

The latter works by checking against a fixed threshold on frees. If we
pass from above the threshold to below, relocation triggers and the
block group will get reclaimed by the cleaner thread (assuming it is
still eligible)

Picking a threshold is challenging. Too high, and you end up trying to
reclaim very full block_groups which is quite costly, and you don't do
reclaim on block_groups that don't get quite THAT full, but could still
be quite fragmented and stranding a lot of space. Too low, and you
similarly miss out on reclaim even if you badly need it to avoid running
out of unallocated space, if you have heavily fragmented block groups
living above the threshold.

No matter the threshold, it suffers from a workload that happens to
bounce around that threshold, which can introduce arbitrary amounts of
reclaim waste.

To improve this situation, introduce a dynamic threshold. The basic idea
behind this threshold is that it should be very lax when there is plenty
of unallocated space, and increasingly aggressive as we approach zero
unallocated space. To that end, it sets a target for unallocated space
(10 chunks) and then linearly increases the threshold as the amount of
space short of the target we are increases. The formula is:
(target - unalloc) / target

I tested this by running it on three interesting workloads:

  1. bounce allocations around X% full.
  2. fill up all the way and introduce full fragmentation.
  3. write in a fragmented way until the filesystem is just about full.

1. and 2. attack the weaknesses of a fixed threshold; fixed either works
perfectly or fully falls apart, depending on the threshold. Dynamic
always handles these cases well.

3. attacks dynamic by checking whether it is too zealous to reclaim in
conditions with low unallocated and low unused. It tends to claw back
1GiB of unallocated fairly aggressively, but not much more. Early
versions of dynamic threshold struggled on this test.

Additional work could be done to intelligently ratchet up the urgency of
reclaim in very low unallocated conditions. Existing mechanisms are
already useless in that case anyway.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:27 +02:00
Boris Burkov
42f620aec1 btrfs: store fs_info in space_info
This is handy when computing space_info dynamic reclaim thresholds where
we do not have access to a block group. We could add it to the various
functions as a parameter, but it seems reasonable for space_info to have
an fs_info pointer.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:27 +02:00
Boris Burkov
243192b676 btrfs: report reclaim stats in sysfs
When evaluating various reclaim strategies/thresholds against each
other, it is useful to collect data about the amount of reclaim
happening. Expose a count, error count, and byte count via sysfs
per space_info.

Note that this is only for automatic reclaim, not manually invoked
balances or other codepaths that use "relocate_block_group"

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:27 +02:00
David Sterba
a5b3abb18c btrfs: qgroup: warn about inconsistent qgroups when relation update fails
Calling btrfs_handle_fs_error() after btrfs_run_qgroups() fails to
update the qgroup status is probably not necessary, this would turn the
filesystem to read-only. For the same reason aborting the transaction is
also not a good option.

The state is left inconsistent and can be fixed by rescan, printing a
warning should be sufficient. Return code reflects the status of
adding/deleting the relation and if the transaction was ended properly.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:27 +02:00
David Sterba
4addc1ffd6 btrfs: qgroup: preallocate memory before adding a relation
There's a transaction joined in the qgroup relation add/remove ioctl and
any error will lead to abort/error. We could lift the allocation from
btrfs_add_qgroup_relation() and move it outside of the transaction
context. The relation deletion does not need that.

The ownership of the structure is moved to the add relation handler.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:27 +02:00
David Sterba
7733b8dd18 btrfs: abort transaction on errors in btrfs_free_chunk()
The errors during removing a chunk item are fatal, we expect to have a
matching item in the chunk map from which the chunk_offset is taken.
Handle that by transaction abort.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:27 +02:00
David Sterba
b9878a89e9 btrfs: only print error message when checking item size in print_extent_item()
The extent item used to have a v0 that was removed in 6.6. There's a
check for minimum expected size that could lead to
btrfs_handle_fs_error() that would make the filesystem read-only. As we
don't have v0 anymore (and haven't seen any reports in the deprecation
period), handle this in a less intrusive way.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:27 +02:00
David Sterba
f4f8947732 btrfs: abort transaction if we don't find extref in btrfs_del_inode_extref()
When an extended ref is deleted we do a sanity check right before
removing the item, if we can't find it then handle the error. This is
done by btrfs_handle_fs_error() but this is from the time before we had
the transaction abort infrastructure, so switch to that. The end result
is the same, the error is reported and switched to read-only. We newly
return the -ENOENT error code as this better represents what happened.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:27 +02:00
Filipe Manana
eba1469f8f btrfs: avoid allocating and running pointless delayed extent operations
We always allocate a delayed extent op structure when allocating a tree
block (except for log trees), but most of the time we don't need it as
we only need to set the BTRFS_BLOCK_FLAG_FULL_BACKREF if we're dealing
with a relocation tree and we only need to set the key of a tree block
in a btrfs_tree_block_info structure if we are not using skinny metadata
(feature enabled by default since btrfs-progs 3.18 and available as of
kernel 3.10).

In these cases, where we don't need neither to update flags nor to set
the key, we only use the delayed extent op structure to set the tree
block's level. This is a waste of memory and besides that, the memory
allocation can fail and can add additional latency.

Instead of using a delayed extent op structure to store the level of
the tree block, use the delayed ref head to store it. This doesn't
change the size of neither structure and helps us avoid allocating
delayed extent ops structures when using the skinny metadata feature
and there's no relocation going on. This also gets rid of a BUG_ON().

For example, for a fs_mark run, with 5 iterations, 8 threads and 100K
files per iteration, before this patch there were 118109 allocations
of delayed extent op structures and after it there were none.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:26 +02:00
Boris Burkov
33336c1805 btrfs: preallocate ulist memory for qgroup rsv
When qgroups are enabled, during data reservation, we allocate the
ulist_nodes that track the exact reserved extents with GFP_ATOMIC
unconditionally. This is unnecessary, and we can follow the model
already employed by the struct extent_state we preallocate in the non
qgroups case, which should reduce the risk of allocation failures with
GFP_ATOMIC.

Add a prealloc node to struct ulist which ulist_add will grab when it is
present, and try to allocate it before taking the tree lock while we can
still take advantage of a less strict gfp mask. The lifetime of that
node belongs to the new prealloc field, until it is used, at which point
it belongs to the ulist linked list.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:26 +02:00
Filipe Manana
28cb13f29f btrfs: don't BUG_ON() when 0 reference count at btrfs_lookup_extent_info()
Instead of doing a BUG_ON() handle the error by returning -EUCLEAN,
aborting the transaction and logging an error message.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:26 +02:00
Filipe Manana
5c83b3beae btrfs: reduce nesting for extent processing at btrfs_lookup_extent_info()
Instead of using an if-else statement when processing the extent item at
btrfs_lookup_extent_info(), use a single if statement for the error case
since it does a goto at the end and leave the success (expected) case
following the if statement, reducing indentation and making the logic a
bit easier to follow. Also make the if statement's condition as unlikely
since it's not expected to ever happen, as it signals some corruption,
making it clear and hint the compiler to generate more efficient code.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:26 +02:00
Filipe Manana
c65967ac4d btrfs: remove superfluous metadata check at btrfs_lookup_extent_info()
If we didn't found an extent item with the initial btrfs_search_slot()
call, it's pointless to test if the "metadata" variable is "true", because
right after we check if the key type is BTRFS_METADATA_ITEM_KEY and that
is the case only when "metadata" is set to "true". So remove the redundant
check.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:26 +02:00
Filipe Manana
b56329a782 btrfs: replace BUG_ON() with error handling at update_ref_for_cow()
Instead of a BUG_ON() just return an error, log an error message and
abort the transaction in case we find an extent buffer belonging to the
relocation tree that doesn't have the full backref flag set. This is
unexpected and should never happen (save for bugs or a potential bad
memory).

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:26 +02:00
Filipe Manana
716404e59a btrfs: simplify setting the full backref flag at update_ref_for_cow()
We keep a "new_flags" variable only to keep track if we need to update the
metadata extent's flags, and when we set BTRFS_BLOCK_FLAG_FULL_BACKREF in
the variable, we do it in an inner scope. Then check in an outer scope
if the variable is not 0 and if so call btrfs_set_disk_extent_flags().
The variable isn't used for anything else. This is somewhat confusing, so
to make it more straightforward update the extent's flags where we are
currently updating "new_flags" and remove the variable.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:26 +02:00
Filipe Manana
119474bdba btrfs: remove NULL transaction support for btrfs_lookup_extent_info()
There are no callers of btrfs_lookup_extent_info() that pass a NULL value
for the transaction handle argument, so there's no point in having special
logic to deal with the NULL. The last caller that passed a NULL value was
removed in commit 19b546d7a1 ("btrfs: relocation:
Use btrfs_find_all_leafs to locate data extent parent tree leaves").

So remove the NULL handling from btrfs_lookup_extent_info().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:26 +02:00
Filipe Manana
d12765dc02 btrfs: use label to deduplicate error path at btrfs_force_cow_block()
At btrfs_force_cow_block() we have several error paths that need to
unlock the "cow" extent buffer, drop the reference on it and then return
an error. This is a bit verbose so add a label where we perform these
tasks and make the error paths jump to that label.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:26 +02:00
Filipe Manana
bb3868033a btrfs: do not BUG_ON() when freeing tree block after error
When freeing a tree block, at btrfs_free_tree_block(), if we fail to
create a delayed reference we don't deal with the error and just do a
BUG_ON(). The error most likely to happen is -ENOMEM, and we have a
comment mentioning that only -ENOMEM can happen, but that is not true,
because in case qgroups are enabled any error returned from
btrfs_qgroup_trace_extent_post() (can be -EUCLEAN or anything returned
from btrfs_search_slot() for example) can be propagated back to
btrfs_free_tree_block().

So stop doing a BUG_ON() and return the error to the callers and make
them abort the transaction to prevent leaking space. Syzbot was
triggering this, likely due to memory allocation failure injection.

Reported-by: syzbot+a306f914b4d01b3958fe@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/000000000000fcba1e05e998263c@google.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:26 +02:00