linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-16 00:52:01 +00:00

Author	SHA1	Message	Date
Filipe Manana	53499d5f6b	btrfs: remove unused is_head field from struct btrfs_delayed_ref_node The 'is_head' field of struct btrfs_delayed_ref_node is no longer after commit `d278850eff` ("btrfs: remove delayed_ref_node from ref_head"), so remove it. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:31 +02:00
Filipe Manana	315dd5cc75	btrfs: reorder some members of struct btrfs_delayed_ref_head Currently struct delayed_ref_head has its 'bytenr' and 'href_node' members in different cache lines (even on a release, non-debug, kernel). This is not optimal because when iterating the red black tree of delayed ref heads for inserting a new delayed ref head (htree_insert()) we have to pull in 2 cache lines of delayed ref heads we find in a patch, one for the tree node (struct rb_node) and another one for the 'bytenr' field. The same applies when searching for an existing delayed ref head (find_ref_head()). On a release (non-debug) kernel, the structure also has two 4 bytes holes, which makes it 8 bytes longer than necessary. Its current layout is the following: struct btrfs_delayed_ref_head { u64 bytenr; /* 0 8 / u64 num_bytes; / 8 8 / refcount_t refs; / 16 4 / / XXX 4 bytes hole, try to pack / struct mutex mutex; / 24 32 / spinlock_t lock; / 56 4 / / XXX 4 bytes hole, try to pack / / --- cacheline 1 boundary (64 bytes) --- / struct rb_root_cached ref_tree; / 64 16 / struct list_head ref_add_list; / 80 16 / struct rb_node href_node __attribute__((__aligned__(8))); / 96 24 / struct btrfs_delayed_extent_op extent_op; /* 120 8 / / --- cacheline 2 boundary (128 bytes) --- / int total_ref_mod; / 128 4 / int ref_mod; / 132 4 / unsigned int must_insert_reserved:1; / 136: 0 4 / unsigned int is_data:1; / 136: 1 4 / unsigned int is_system:1; / 136: 2 4 / unsigned int processing:1; / 136: 3 4 / / size: 144, cachelines: 3, members: 15 / / sum members: 128, holes: 2, sum holes: 8 / / sum bitfield members: 4 bits (0 bytes) / / padding: 4 / / bit_padding: 28 bits / / forced alignments: 1 / / last cacheline: 16 bytes / } __attribute__((__aligned__(8))); This change reorders the 'href_node' and 'refs' members so that we have the 'href_node' in the same cache line as the 'bytenr' field, while also eliminating the two holes and reducing the structure size from 144 bytes down to 136 bytes, so we can now have 30 ref heads per 4K page (on x86_64) instead of 28. The new structure layout after this change is now: struct btrfs_delayed_ref_head { u64 bytenr; / 0 8 / u64 num_bytes; / 8 8 / struct rb_node href_node __attribute__((__aligned__(8))); / 16 24 / struct mutex mutex; / 40 32 / / --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- / refcount_t refs; / 72 4 / spinlock_t lock; / 76 4 / struct rb_root_cached ref_tree; / 80 16 / struct list_head ref_add_list; / 96 16 / struct btrfs_delayed_extent_op extent_op; /* 112 8 / int total_ref_mod; / 120 4 / int ref_mod; / 124 4 / / --- cacheline 2 boundary (128 bytes) --- / unsigned int must_insert_reserved:1; / 128: 0 4 / unsigned int is_data:1; / 128: 1 4 / unsigned int is_system:1; / 128: 2 4 / unsigned int processing:1; / 128: 3 4 / / size: 136, cachelines: 3, members: 15 / / padding: 4 / / bit_padding: 28 bits / / forced alignments: 1 / / last cacheline: 8 bytes / } __attribute__((__aligned__(8))); Running the following fs_mark test shows some significant improvement. $ cat test.sh #!/bin/bash # 15G null block device DEV=/dev/nullb0 MNT=/mnt/nullb0 FILES=100000 THREADS=$(nproc --all) FILE_SIZE=0 echo "performance" \| \ tee /sys/devices/system/cpu/cpu/cpufreq/scaling_governor mkfs.btrfs -f $DEV mount -o ssd $DEV $MNT OPTS="-S 0 -L 5 -n $FILES -s $FILE_SIZE -t $THREADS -k" for ((i = 1; i <= $THREADS; i++)); do OPTS="$OPTS -d $MNT/d$i" done fs_mark $OPTS umount $MNT Before this change: FSUse% Count Size Files/sec App Overhead 10 1200000 0 112631.3 11928055 16 2400000 0 189943.8 12140777 23 3600000 0 150719.2 13178480 50 4800000 0 99137.3 12504293 53 6000000 0 111733.9 12670836 Total files/sec: 664165.5 After this change: FSUse% Count Size Files/sec App Overhead 10 1200000 0 148589.5 11565889 16 2400000 0 227743.8 11561596 23 3600000 0 191590.5 12550755 30 4800000 0 179812.3 12629610 53 6000000 0 92471.4 12352383 Total files/sec: 840207.5 Measuring the execution times of htree_insert(), in nanoseconds, during those fs_mark runs: Before this change: Range: 0.000 - 940647.000; Mean: 619.733; Median: 548.000; Stddev: 1834.231 Percentiles: 90th: 980.000; 95th: 1208.000; 99th: 2090.000 0.000 - 6.384: 257 \| 6.384 - 26.259: 977 \| 26.259 - 99.635: 4963 \| 99.635 - 370.526: 136800 ############# 370.526 - 1370.603: 566110 ##################################################### 1370.603 - 5062.704: 24945 ## 5062.704 - 18693.248: 944 \| 18693.248 - 69014.670: 211 \| 69014.670 - 254791.959: 30 \| 254791.959 - 940647.000: 4 \| After this change: Range: 0.000 - 299200.000; Mean: 587.754; Median: 542.000; Stddev: 1030.422 Percentiles: 90th: 918.000; 95th: 1113.000; 99th: 1987.000 0.000 - 5.585: 163 \| 5.585 - 20.678: 452 \| 20.678 - 70.369: 1806 \| 70.369 - 233.965: 26268 #### 233.965 - 772.564: 333519 ##################################################### 772.564 - 2545.771: 91820 ############### 2545.771 - 8383.615: 2238 \| 8383.615 - 27603.280: 170 \| 27603.280 - 90879.297: 68 \| 90879.297 - 299200.000: 12 \| Mean, percentiles, maximum times are all better, as well as a lower standard deviation. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:31 +02:00
Qu Wenruo	31dd8c81dd	btrfs: use the same uptodate variable for end_bio_extent_readpage() In function end_bio_extent_readpage() we call endio_readpage_release_extent() to unlock the extent io tree. However we pass PageUptodate(page) as @uptodate parameter for it, while for previous end_page_read() call, we use a dedicated @uptodate local variable. This is not a big deal, as even for subpage cases, either the bio only covers part of the page, then the @uptodate is always false, and the subpage ranges can still be merged. But for the sake of consistency, always use @uptodate variable when possible. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:31 +02:00
Qu Wenruo	5a96341927	btrfs: subpage: make alloc_extent_buffer() handle previously uptodate range efficiently Currently alloc_extent_buffer() would make the extent buffer uptodate if the corresponding pages are also uptodate. But this check is only checking PageUptodate, which is fine for regular cases, but not for subpage cases, as we can have multiple extent buffers in the same page. So here we go btrfs_page_test_uptodate() instead. The old code doesn't cause any problem, but is not efficient, as it would cause extra metadata read even if the range is already uptodate. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:31 +02:00
David Sterba	b831306b3b	btrfs: print assertion failure report and stack trace from the same line Assertions reports are split into two parts, the exact file and location of the condition and then the stack trace printed from btrfs_assertfail(). This means all the stack traces report the same line and this is what's typically reported by various tools, making it harder to distinguish the reports. [403.2467] assertion failed: refcount_read(&block_group->refs) == 1, in fs/btrfs/block-group.c:4259 [403.2479] ------------[ cut here ]------------ [403.2484] kernel BUG at fs/btrfs/messages.c:259! [403.2488] invalid opcode: 0000 [#1] PREEMPT SMP KASAN [403.2493] CPU: 2 PID: 23202 Comm: umount Not tainted 6.2.0-rc4-default+ #67 [403.2499] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014 [403.2509] RIP: 0010:btrfs_assertfail+0x19/0x1b [btrfs] ... [403.2595] Call Trace: [403.2598] <TASK> [403.2601] btrfs_free_block_groups.cold+0x52/0xae [btrfs] [403.2608] close_ctree+0x6c2/0x761 [btrfs] [403.2613] ? __wait_for_common+0x2b8/0x360 [403.2618] ? btrfs_cleanup_one_transaction.cold+0x7a/0x7a [btrfs] [403.2626] ? mark_held_locks+0x6b/0x90 [403.2630] ? lockdep_hardirqs_on_prepare+0x13d/0x200 [403.2636] ? __call_rcu_common.constprop.0+0x1ea/0x3d0 [403.2642] ? trace_hardirqs_on+0x2d/0x110 [403.2646] ? __call_rcu_common.constprop.0+0x1ea/0x3d0 [403.2652] generic_shutdown_super+0xb0/0x1c0 [403.2657] kill_anon_super+0x1e/0x40 [403.2662] btrfs_kill_super+0x25/0x30 [btrfs] [403.2668] deactivate_locked_super+0x4c/0xc0 By making btrfs_assertfail a macro we'll get the same line number for the BUG output: [63.5736] assertion failed: 0, in fs/btrfs/super.c:1572 [63.5758] ------------[ cut here ]------------ [63.5782] kernel BUG at fs/btrfs/super.c:1572! [63.5807] invalid opcode: 0000 [#2] PREEMPT SMP KASAN [63.5831] CPU: 0 PID: 859 Comm: mount Tainted: G D 6.3.0-rc7-default+ #2062 [63.5868] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014 [63.5905] RIP: 0010:btrfs_mount+0x24/0x30 [btrfs] [63.5964] RSP: 0018:ffff88800e69fcd8 EFLAGS: 00010246 [63.5982] RAX: 000000000000002d RBX: ffff888008fc1400 RCX: 0000000000000000 [63.6004] RDX: 0000000000000000 RSI: ffffffffb90fd868 RDI: ffffffffbcc3ff20 [63.6026] RBP: ffffffffc081b200 R08: 0000000000000001 R09: ffff88800e69fa27 [63.6046] R10: ffffed1001cd3f44 R11: 0000000000000001 R12: ffff888005a3c370 [63.6062] R13: ffffffffc058e830 R14: 0000000000000000 R15: 00000000ffffffff [63.6081] FS: 00007f7b3561f800(0000) GS:ffff88806c600000(0000) knlGS:0000000000000000 [63.6105] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [63.6120] CR2: 00007fff83726e10 CR3: 0000000002a9e000 CR4: 00000000000006b0 [63.6137] Call Trace: [63.6143] <TASK> [63.6148] legacy_get_tree+0x80/0xd0 [63.6158] vfs_get_tree+0x43/0x120 [63.6166] do_new_mount+0x1f3/0x3d0 [63.6176] ? do_add_mount+0x140/0x140 [63.6187] ? cap_capable+0xa4/0xe0 [63.6197] path_mount+0x223/0xc10 This comes at a cost of bloating the final btrfs.ko module due all the inlining, as long as assertions are compiled in. This is a must for debugging builds but this is often enabled on release builds too. Release build: text data bss dec hex filename 1251676 20317 16088 1288081 13a791 pre/btrfs.ko 1260612 29473 16088 1306173 13ee3d post/btrfs.ko DELTA: +8936 CC: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:31 +02:00
Qu Wenruo	75258f20fb	btrfs: subpage: dump extra subpage bitmaps for debug There is a bug report that assert_eb_page_uptodate() gets triggered for free space tree metadata. Without proper dump for the subpage bitmaps it's much harder to debug. Thus this patch would dump all the subpage bitmaps (split them into their own bitmaps) for a easier debugging. The output would look like this: (Dumped after a tree block got read from disk) page:000000006e34bf49 refcount:4 mapcount:0 mapping:0000000067661ac4 index:0x1d1 pfn:0x110e9 memcg:ffff0000d7d62000 aops:btree_aops [btrfs] ino:1 flags: 0x8000000000002002(referenced\|private\|zone=2) page_type: 0xffffffff() raw: 8000000000002002 0000000000000000 dead000000000122 ffff00000188bed0 raw: 00000000000001d1 ffff0000c7992700 00000004ffffffff ffff0000d7d62000 page dumped because: btrfs subpage dump BTRFS warning (device dm-1): start=30490624 len=16384 page=30474240 bitmaps: uptodate=4-7 error= dirty= writeback= ordered= checked= Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:30 +02:00
Tejun Heo	58e814fcac	btrfs: use alloc_ordered_workqueue() to create ordered workqueues BACKGROUND ========== When multiple work items are queued to a workqueue, their execution order doesn't match the queueing order. They may get executed in any order and simultaneously. When fully serialized execution - one by one in the queueing order - is needed, an ordered workqueue should be used which can be created with alloc_ordered_workqueue(). However, alloc_ordered_workqueue() was a later addition. Before it, an ordered workqueue could be obtained by creating an UNBOUND workqueue with @max_active==1. This originally was an implementation side-effect which was broken by `4c16bd327c` ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered"). Because there were users that depended on the ordered execution, `5c0338c687` ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered") made workqueue allocation path to implicitly promote UNBOUND workqueues w/ @max_active==1 to ordered workqueues. While this has worked okay, overloading the UNBOUND allocation interface this way creates other issues. It's difficult to tell whether a given workqueue actually needs to be ordered and users that legitimately want a min concurrency level wq unexpectedly gets an ordered one instead. With planned UNBOUND workqueue updates to improve execution locality and more prevalence of chiplet designs which can benefit from such improvements, this isn't a state we wanna be in forever. This patch series audits all call sites that create an UNBOUND workqueue w/ @max_active==1 and converts them to alloc_ordered_workqueue() as necessary. BTRFS ===== * fs_info->scrub_workers initialized in scrub_workers_get() was setting @max_active to 1 when @is_dev_replace is set and it seems that the workqueue actually needs to be ordered if @is_dev_replace. Update the code so that alloc_ordered_workqueue() is used if @is_dev_replace. * fs_info->discard_ctl.discard_workers initialized in btrfs_init_workqueues() was directly using alloc_workqueue() w/ @max_active==1. Converted to alloc_ordered_workqueue(). * fs_info->fixup_workers and fs_info->qgroup_rescan_workers initialized in btrfs_queue_work() use the btrfs's workqueue wrapper, btrfs_workqueue, which are allocated with btrfs_alloc_workqueue(). btrfs_workqueue implements automatic @max_active adjustment which is disabled when the specified max limit is below a certain threshold, so calling btrfs_alloc_workqueue() with @limit_active==1 yields an ordered workqueue whose @max_active won't be changed as the auto-tuning is disabled. This is rather brittle in that nothing clearly indicates that the two workqueues should be ordered or btrfs_alloc_workqueue() must disable auto-tuning when @limit_active==1. This patch factors out the common btrfs_workqueue init code into btrfs_init_workqueue() and add explicit btrfs_alloc_ordered_workqueue(). The two workqueues are converted to use the new ordered allocation interface. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:30 +02:00
David Sterba	1d12680044	btrfs: drop gfp from parameter extent state helpers Now that all extent state bit helpers effectively take the GFP_NOFS mask (and GFP_NOWAIT is encoded in the bits) we can remove the parameter. This reduces stack consumption in many functions and simplifies a lot of code. Net effect on module on a release build: text data bss dec hex filename 1250432 20985 16088 1287505 13a551 pre/btrfs.ko 1247074 20985 16088 1284147 139833 post/btrfs.ko DELTA: -3358 Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:30 +02:00
David Sterba	62bc60473a	btrfs: pass NOWAIT for set/clear extent bits as another bit The only flags we now pass to set_extent_bit/__clear_extent_bit are GFP_NOFS and GFP_NOWAIT (a few functions handling mappings). This requires an extra parameter to be passed everywhere but is almost always the same. Encode the GFP_NOWAIT as an artificial extent bit and extract the real bits and gfp mask in the lowest level helpers. Now the passed gfp mask is not actually used and can be removed. Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:30 +02:00
David Sterba	7dde7a8ab3	btrfs: drop NOFAIL from set_extent_bit allocation masks The __GFP_NOFAIL passed to set_extent_bit first appeared in 2010 (commit `f0486c68e4` ("Btrfs: Introduce contexts for metadata reservation")), without any explanation why it would be needed. Meanwhile we've updated the semantics of set_extent_bit to handle failed allocations and do unlock, sleep and retry if needed. The use of the NOFAIL flag is also an outlier, we never want any of the set/clear extent bit helpers to fail, they're used for many critical changes like extent locking, besides the extent state bit changes. Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:30 +02:00
David Sterba	0acd32c294	btrfs: open code set_extent_bits This helper calls set_extent_bit with two more parameters set to default values, but otherwise it's purpose is not clear. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:30 +02:00
David Sterba	e85de967bc	btrfs: open code set_extent_bits_nowait The helper only passes GFP_NOWAIT as gfp flags and is used two times. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:30 +02:00
David Sterba	fe1a598c42	btrfs: open code set_extent_dirty The helper is used a few times, that it's setting the DIRTY extent bit is still clear. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:30 +02:00
David Sterba	eea8686e68	btrfs: open code set_extent_new The helper is used only once. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:30 +02:00
David Sterba	66240ab115	btrfs: open code set_extent_delalloc The helper is used once in fs code and a few times in the self test code. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:30 +02:00
David Sterba	dc5646c15c	btrfs: open code set_extent_defrag The helper is used only once. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:29 +02:00
Christoph Hellwig	25ac047c9d	btrfs: remove a pointless NULL check in btrfs_lookup_fs_root btrfs_grab_root already checks for a NULL root itself. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:29 +02:00
Christoph Hellwig	e91909aace	btrfs: convert btrfs_get_global_root to use a switch statement Use a switch statement instead of an endless chain of if statements to make the code a little cleaner. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:29 +02:00
Christoph Hellwig	85724171b3	btrfs: fix the btrfs_get_global_root return value btrfs_grab_root returns either the root or NULL, and the callers of btrfs_get_global_root expect it to return the same. But all the more recently added roots instead return an ERR_PTR, so fix this. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:29 +02:00
Anand Jain	d85512d54e	btrfs: add and fix comments in btrfs_fs_devices Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:29 +02:00
Anand Jain	25984a5ae8	btrfs: consolidate uuid comparisons in btrfs_validate_super There are three ways the fsid is validated in btrfs_validate_super(): - verify that super_copy::fsid is the same as fs_devices::fsid - if the metadata_uuid flag is set, verify if super_copy::metadata_uuid and fs_devices::metadata_uuid are the same. - a few lines below, often missed out, verify if dev_item::fsid is the same as fs_devices::metadata_uuid. The function btrfs_validate_super() contains multiple if-statements with memcmp() to check UUIDs. This patch consolidates them into a single location. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:29 +02:00
Anand Jain	a3c54b0be1	btrfs: simplify how changed fsid and metadata_uuid is checked We often check if the metadata_uuid is not the same as fsid, and then we check if the given fsid matches the metadata_uuid. This patch refactors this logic into function match_fsid_changed and utilize it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:29 +02:00
Anand Jain	1a89834500	btrfs: simplify fsid and metadata_uuid comparisons Refactor the functions find_fsid() and find_fsid_with_metadata_uuid(), as they currently share a common set of code to compare the fsid and metadata_uuid. Create a common helper function, match_fsid_fs_devices(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:29 +02:00
Anand Jain	413fb1bc1d	btrfs: return bool from check_tree_block_fsid instead of int Simplify the return type of check_tree_block_fsid() from int (1 or 0) to bool. Its only user is interested in knowing the success or failure. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:29 +02:00
Anand Jain	f62c302e6d	btrfs: add comment about metadata_uuid in btrfs_fs_devices Add comment about metadata_uuid in btrfs_fs_devices. No functional change. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:29 +02:00
Anand Jain	c6930d7d11	btrfs: merge calls to alloc_fs_devices in device_list_add Simplify has_metadata_uuid checks - by localizing the has_metadata_uuid checked within alloc_fs_devices()'s second argument, it improves the code readability. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:29 +02:00
Anand Jain	19c4c49ca9	btrfs: streamline fsid checks in alloc_fs_devices We currently have redundant checks for the non-null value of fsid simplify it. And, no one is using alloc_fs_devices() with a NULL metadata_uuid while fsid is not NULL, add an assert() to verify this condition. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:28 +02:00
Anand Jain	4693893bf8	btrfs: reduce struct btrfs_fs_devices size by moving fsid_change Pack bool fsid_change and bool seeding with other bool declarations in the struct btrfs_fs_devices, approximately 6 bytes is saved, depending on the config. before: 512 bytes after: 496 bytes Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:28 +02:00
Christoph Hellwig	46672a44b0	btrfs: merge write_one_subpage_eb into write_one_eb Most of the code in write_one_subpage_eb and write_one_eb is shared, so merge the two functions into one. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:28 +02:00
Christoph Hellwig	d7172f52e9	btrfs: use per-buffer locking for extent_buffer reading Instead of locking and unlocking every page or the extent, just add a new EXTENT_BUFFER_READING bit that mirrors EXTENT_BUFFER_WRITEBACK for synchronizing threads trying to read an extent_buffer and to wait for I/O completion. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:28 +02:00
Christoph Hellwig	9e2aff90fc	btrfs: stop using lock_extent in btrfs_buffer_uptodate The only other place that locks extents on the btree inode is read_extent_buffer_subpage while reading in the partial page for a buffer. This means locking the extent in btrfs_buffer_uptodate does not synchronize with anything on non-subpage file systems, and on subpage file systems it only waits for a parallel read(-ahead) to finish, which seems to be counter to what the callers actually expect. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:28 +02:00
Christoph Hellwig	f3d315eb93	btrfs: don't check for uptodate pages in read_extent_buffer_pages The only place that reads in pages and thus marks them uptodate for the btree inode is read_extent_buffer_pages. Which means that either pages are already uptodate from an old buffer when creating a new one in alloc_extent_buffer, or they will be updated by ca call to read_extent_buffer_pages. This means the checks for uptodate pages in read_extent_buffer_pages and read_extent_buffer_subpage are superfluous and can be removed. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:28 +02:00
Christoph Hellwig	011134f444	btrfs: stop using PageError for extent_buffers PageError is only used to limit the uptodate check in assert_eb_page_uptodate. But we have a much more useful flag indicating the exact condition we are about with the EXTENT_BUFFER_WRITE_ERR flag, so use that instead and help the kernel toward eventually removing PageError. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:28 +02:00
Christoph Hellwig	113fa05c2f	btrfs: remove the io_pages field in struct extent_buffer No need to track the number of pages under I/O now that each extent_buffer is read and written using a single bio. For the read side we need to grab an extra reference for the duration of the I/O to prevent eviction, though. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:28 +02:00
Christoph Hellwig	31d89399da	btrfs: remove the extent_buffer lookup in btree block checksumming The checksumming of btree blocks always operates on the entire extent_buffer, and because btree blocks are always allocated contiguously on disk they are never split by btrfs_submit_bio. Simplify the checksumming code by finding the extent_buffer in the btrfs_bio private data instead of trying to search through the bio_vec. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:28 +02:00
Christoph Hellwig	cd88a4fdbf	btrfs: use a separate end_io handler for extent_buffer writing Now that we always use a single bio to write an extent_buffer, the buffer can be passed to the end_io handler as private data. This allows to simplify the metadata write end I/O handler, and merge the subpage end_io handler into the main one. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:28 +02:00
Christoph Hellwig	b51e6b4bda	btrfs: don't use btrfs_bio_ctrl for extent buffer writing The btrfs_bio_ctrl machinery is overkill for writing extent_buffers as we always operate on PAGE_SIZE chunks (or one smaller one for the subpage case) that are contiguous and are guaranteed to fit into a single bio. Replace it with open coded btrfs_bio_alloc, __bio_add_page and btrfs_submit_bio calls. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:28 +02:00
Christoph Hellwig	81a79b6ae4	btrfs: move page locking from lock_extent_buffer_for_io to write_one_eb Locking the pages in lock_extent_buffer_for_io only for the non-subpage case is very confusing. Move it to write_one_eb to mirror the subpage case and simplify the code. Now lock_extent_buffer_for_io does not leave all the pages locked and each is individually locked/unlocked in write_one_eb. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:28 +02:00
Christoph Hellwig	50b21d7a06	btrfs: submit a writeback bio per extent_buffer Stop trying to cluster writes of multiple extent_buffers into a single bio. There is no need for that as the blk_plug mechanism used all the way up in writeback_inodes_wb gives us the same I/O pattern even with multiple bios. Removing the clustering simplifies lock_extent_buffer_for_io a lot and will also allow passing the eb as private data to the end I/O handler. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:27 +02:00
Christoph Hellwig	9fdd160160	btrfs: return bool from lock_extent_buffer_for_io lock_extent_buffer_for_io never returns a negative error value, so switch the return value to a simple bool. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> [ keep noinline_for_stack ] Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:27 +02:00
Christoph Hellwig	3d66b4b27d	btrfs: do not try to unlock the extent for non-subpage metadata reads Only subpage metadata reads lock the extent. Don't try to unlock it and waste cycles in the extent tree lookup for PAGE_SIZE or larger metadata. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:27 +02:00
Christoph Hellwig	046b562b20	btrfs: use a separate end_io handler for read_extent_buffer Now that we always use a single bio to read an extent_buffer, the buffer can be passed to the end_io handler as private data. This allows implementing a much simplified dedicated end I/O handler for metadata reads. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:27 +02:00
Christoph Hellwig	e194931076	btrfs: remove the mirror_num argument to btrfs_submit_compressed_read Given that read recovery for data I/O is handled in the storage layer, the mirror_num argument to btrfs_submit_compressed_read is always 0, so remove it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:27 +02:00
Christoph Hellwig	b78b98e06f	btrfs: don't use btrfs_bio_ctrl for extent buffer reading The btrfs_bio_ctrl machinery is overkill for reading extent_buffers as we always operate on PAGE_SIZE chunks (or one smaller one for the subpage case) that are contiguous and are guaranteed to fit into a single bio. Replace it with open coded btrfs_bio_alloc, __bio_add_page and btrfs_submit_bio calls in a helper function shared between the subpage and node size >= PAGE_SIZE cases. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:27 +02:00
Christoph Hellwig	e95382834c	btrfs: always read the entire extent_buffer Currently read_extent_buffer_pages skips pages that are already uptodate when reading in an extent_buffer. While this reduces the amount of data read, it increases the number of I/O operations as we now need to do multiple I/Os when reading an extent buffer with one or more uptodate pages in the middle of it. On any modern storage device, be that hard drives or SSDs this actually decreases I/O performance. Fortunately this case is pretty rare as the pages are always initially read together and then aged the same way. Besides simplifying the code a bit as-is this will allow for major simplifications to the I/O completion handler later on. Note that the case where all pages are uptodate is still handled by an optimized fast path that does not read any data from disk. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:27 +02:00
Christoph Hellwig	d87e6575e9	btrfs: merge verify_parent_transid and btrfs_buffer_uptodate verify_parent_transid is only called by btrfs_buffer_uptodate, which confusingly inverts the return value. Merge the two functions and reflow the parent_transid so that error handling is in a branch. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:27 +02:00
Christoph Hellwig	aebcc1596b	btrfs: move setting the buffer uptodate out of validate_extent_buffer Setting the buffer uptodate in a function that is named as a validation helper is a it confusing. Move the call from validate_extent_buffer to the one of its two callers that didn't already have a duplicate call to set_extent_buffer_uptodate. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:27 +02:00
Christoph Hellwig	243984b3b9	btrfs: subpage: fix error handling in end_bio_subpage_eb_writepage Call btrfs_page_clear_uptodate instead of ClearPageUptodate to properly manage the uptodate bit for the subpage case. Reported-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:27 +02:00
Christoph Hellwig	7f26fb1c13	btrfs: mark extent_buffer_under_io static extent_buffer_under_io is only used in extent_io.c, so mark it static. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:27 +02:00
Qu Wenruo	edc728814f	btrfs: trigger orphan inode cleanup during START_SYNC ioctl There is an internal error report that scrub found an error in an orphan inode's data. However there are very limited ways to cleanup such orphan inodes: - btrfs_start_pre_rw_mount() This happens at either mount, or RO->RW switch. This is not a viable solution for root fs which may not be unmounted or RO mounted. Furthermore this doesn't cover every subvolume, it only covers the currently cached subvolumes. - btrfs_lookup_dentry() This happens when we first lookup the subvolume dentry. But dentry can be cached thus it's not ensured to be triggered every time. - create_snapshot() This only happens for the created snapshot, not the source one. This means if we didn't trigger orphan items cleanup, there is really no other way to manually trigger it. Add this step to the START_SYNC ioctl. This is a slight change in the semantics of the ioctl but as sync can be potentially slow and is usually paired with WAIT_SYNC ioctl. The errors are not handled because the main point of the ioctl is the async commit, orphan cleanup is a side effect. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:26 +02:00

1 2 3 4 5 ...

12135 Commits