linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-16 00:52:01 +00:00

Author	SHA1	Message	Date
Boris Burkov	95c8e349d8	btrfs: warn on invalid slot in tree mod log rewind The way that tree mod log tracks the ultimate length of the eb, the variable 'n', eventually turns up the correct value, but at intermediate steps during the rewind, n can be inaccurate as a representation of the end of the eb. For example, it doesn't get updated on move rewinds, and it does get updated for add/remove in the middle of the eb. To detect cases with invalid moves, introduce a separate variable called max_slot which tries to track the maximum valid slot in the rewind eb. We can then warn if we do a move whose src range goes beyond the max valid slot. There is a commented caveat that it is possible to have this value be an overestimate due to the challenge of properly handling 'add' operations in the middle of the eb, but in practice it doesn't cause enough of a problem to throw out the max idea in favor of tracking every valid slot. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:34 +02:00
David Sterba	8ab546bb30	btrfs: disable allocation warnings for compression workspaces The workspaces for compression are typically much larger than a page and for high zstd levels in the range of megabytes. There's a fallback to vmalloc but this can still fail (see the report). Some of the workspaces are preallocated at module load time so we have a safe fallback, otherwise when a new workspace is needed it's allocated but if this fails then the process waits. Which means the warning is only causing noise and we can use the GFP flag to disable it. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=217466 Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:34 +02:00
Christoph Hellwig	8680e58761	btrfs: open code need_full_stripe conditions need_full_stripe is just a somewhat complicated way to say "op != BTRFS_MAP_READ". Just spell that explicit check out, which makes a lot of the code currently using the helper easier to understand. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:34 +02:00
Christoph Hellwig	723b8bb17e	btrfs: open code btrfs_map_sblock btrfs_map_sblock just hard codes three arguments and calls btrfs_map_sblock. Remove it as it doesn't provide any real value, but makes following the btrfs_map_block call chains harder. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:34 +02:00
Christoph Hellwig	cd4efd210e	btrfs: rename __btrfs_map_block to btrfs_map_block Now that the old btrfs_map_block is gone, drop the leading underscores from __btrfs_map_block. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:34 +02:00
Christoph Hellwig	d69d7ffc26	btrfs: remove unused btrfs_map_block There are no users of btrfs_map_block left, so remove it. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:34 +02:00
Christoph Hellwig	78a213a05d	btrfs: optimize simple reads in btrfsic_map_block Pass a smap into __btrfs_map_block so that the usual case of a read that doesn't require parity raid recovery doesn't need an extra memory allocation for the btrfs_io_context. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:33 +02:00
Christoph Hellwig	3965a4c793	btrfs: remove unused BTRFS_MAP_DISCARD BTRFS_MAP_DISCARD is never set, as REQ_OP_DISCARD is never passed to btrfs_op() only only checked in two ASSERTS. Remove it and let the catchall WARN_ON in btrfs_op() deal with accidental REQ_OP_DISCARDs leaked into btrfs_op(). Last use was in `a4012f06f1` ("btrfs: split discard handling out of btrfs_map_block"). Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:33 +02:00
David Sterba	efcfcbc6a3	btrfs: add xxhash to fast checksum implementations The implementation of XXHASH is now CPU only but still fast enough to be considered for the synchronous checksumming, like non-generic crc32c. A userspace benchmark comparing it to various implementations (patched hash-speedtest from btrfs-progs): Block size: 4096 Iterations: 1000000 Implementation: builtin Units: CPU cycles NULL-NOP: cycles: 73384294, cycles/i 73 NULL-MEMCPY: cycles: 228033868, cycles/i 228, 61664.320 MiB/s CRC32C-ref: cycles: 24758559416, cycles/i 24758, 567.950 MiB/s CRC32C-NI: cycles: 1194350470, cycles/i 1194, 11773.433 MiB/s CRC32C-ADLERSW: cycles: 6150186216, cycles/i 6150, 2286.372 MiB/s CRC32C-ADLERHW: cycles: 626979180, cycles/i 626, 22427.453 MiB/s CRC32C-PCL: cycles: 466746732, cycles/i 466, 30126.699 MiB/s XXHASH: cycles: 860656400, cycles/i 860, 16338.188 MiB/s Comparing purely software implementation (ref), current outdated accelerated using crc32q instruction (NI), optimized implementations by M. Adler (https://stackoverflow.com/questions/17645167/implementing-sse-4-2s-crc32c-in-software/17646775#17646775) and the best one that was taken from kernel using the PCLMULQDQ instruction (PCL). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:33 +02:00
Christoph Hellwig	f000bc6fe4	btrfs: pass the new logical address to split_extent_map split_extent_map splits off the first chunk of an extent map into a new one. One of the two users is the zoned I/O completion code that wants to rewrite the logical block start address right after this split. Pass in the logical address to be set in the split off first extent_map as an argument to avoid an extra extent tree lookup for this case. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:33 +02:00
Christoph Hellwig	71df088c1c	btrfs: defer splitting of ordered extents until I/O completion The btrfs zoned completion code currently needs an ordered_extent and extent_map per bio so that it can account for the non-predictable write location from Zone Append. To archive that it currently splits the ordered_extent and extent_map at I/O submission time, and then records the actual physical address in the ->physical field of the ordered_extent. This patch instead switches to record the "original" physical address that the btrfs allocator assigned in spare space in the btrfs_bio, and then rewrites the logical address in the btrfs_ordered_sum structure at I/O completion time. This allows the ordered extent completion handler to simply walk the list of ordered csums and split the ordered extent as needed. This removes an extra ordered extent and extent_map lookup and manipulation during the I/O submission path, and instead batches it in the I/O completion path where we need to touch these anyway. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:33 +02:00
Christoph Hellwig	52b1fdca23	btrfs: handle completed ordered extents in btrfs_split_ordered_extent To delay splitting ordered_extents to I/O completion time we need to be able to handle fully completed ordered extents in btrfs_split_ordered_extent. Besides a bit of accounting this primarily involved moving over the csums to the split bio for the range that it covers, which is simple enough because we always have one btrfs_ordered_sum per bio. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:33 +02:00
Christoph Hellwig	816f589b8d	btrfs: atomically insert the new extent in btrfs_split_ordered_extent Currently there is a small race window in btrfs_split_ordered_extent, where the reduced old extent can be looked up on the per-inode rbtree or the per-root list while the newly split out one isn't visible yet. Fix this by open coding btrfs_alloc_ordered_extent in btrfs_split_ordered_extent, and holding the tree lock and root->ordered_extent_lock over the entire tree and extent manipulation. Note that this introduces new lock ordering because previously ordered_extent_lock was never held over the tree lock. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:33 +02:00
Christoph Hellwig	53d9981ca2	btrfs: split btrfs_alloc_ordered_extent to allocation and insertion helpers Split two low-level helpers out of btrfs_alloc_ordered_extent to allocate and insert the logic extent. The pure alloc helper will be used to improve btrfs_split_ordered_extent. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:33 +02:00
Christoph Hellwig	b0307e2864	btrfs: return the new ordered_extent from btrfs_split_ordered_extent Return the ordered_extent split from the passed in one. This will be needed to be able to store an ordered_extent in the btrfs_bio. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:33 +02:00
Christoph Hellwig	ebdb44a00e	btrfs: reorder conditions in btrfs_extract_ordered_extent There is no good reason for doing one before the other in terms of failure implications, but doing the extent_map split first will simplify some upcoming refactoring. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:33 +02:00
Christoph Hellwig	a6f3e205e4	btrfs: move split_extent_map to extent_map.c split_extent_map doesn't have anything to do with the other code in inode.c, so move it to extent_map.c. This also allows marking replace_extent_mapping static. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:33 +02:00
Christoph Hellwig	3887653c44	btrfs: record orig_physical only for the original bio btrfs_submit_dev_bio is also called for clone bios that aren't embedded into a btrfs_bio structure, but previous commit "btrfs: optimize the logical to physical mapping for zoned writes" added code to assign btrfs_bio.orig_physical in it. This is harmless right now as only the single data profile can be used on zoned devices, but will blow up when the RAID stripe tree is added. Move it out into the single I/O specific branch in the caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:32 +02:00
Christoph Hellwig	cbfce4c7fb	btrfs: optimize the logical to physical mapping for zoned writes The current code to store the final logical to physical mapping for a zone append write in the extent tree is rather inefficient. It first has to split the ordered extent so that there is one ordered extent per bio, so that it can look up the ordered extent on I/O completion in btrfs_record_physical_zoned and store the physical LBA returned by the block driver in the ordered extent. btrfs_rewrite_logical_zoned then has to do a lookup in the chunk tree to see what physical address the logical address for this bio / ordered extent is mapped to, and then rewrite it in the extent tree. To optimize this process, we can store the physical address assigned in the chunk tree to the original logical address and a pointer to btrfs_ordered_sum structure the in the btrfs_bio structure, and then use this information to rewrite the logical address in the btrfs_ordered_sum structure directly at I/O completion time in btrfs_record_physical_zoned. btrfs_rewrite_logical_zoned then simply updates the logical address in the extent tree and the ordered_extent itself. The code in btrfs_rewrite_logical_zoned now runs for all data I/O completions in zoned file systems, which is fine as there is no remapping to do for non-append writes to conventional zones or for relocation, and the overhead for quickly breaking out of the loop is very low. Because zoned file systems now need the ordered_sums structure to record the actual write location returned by zone append, allocate dummy structures without the csum array for them when the I/O doesn't use checksums, and free them when completing the ordered_extent. Note that the btrfs_bio doesn't grow as the new field are places into a union that is so far not used for data writes and has plenty of space left in it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:32 +02:00
Christoph Hellwig	5cfe76f846	btrfs: rename the bytenr field in struct btrfs_ordered_sum to logical btrfs_ordered_sum::bytendr stores a logical address. Make that clear by renaming it to ->logical. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:32 +02:00
Christoph Hellwig	6e4b2479ab	btrfs: mark the len field in struct btrfs_ordered_sum as unsigned len can't ever be negative, so mark it as an u32 instead of int. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:32 +02:00
Christoph Hellwig	e9cb93b9fb	btrfs: don't call btrfs_record_physical_zoned for failed append When a zoned append command fails there is no written address reported, so don't try to record it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:32 +02:00
Christoph Hellwig	dd8b7b0416	btrfs: optimize out btrfs_is_zoned for !CONFIG_BLK_DEV_ZONED Add an IS_ENABLED check for CONFIG_BLK_DEV_ZONED in addition to the run-time check for the zone size. This will allow to make use of compiler dead code elimination for code guarded by btrfs_is_zoned, and for example provide just a dangling prototype for a function instead of adding a stub. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:32 +02:00
Filipe Manana	99f09ce309	btrfs: make btrfs_destroy_delayed_refs() return void btrfs_destroy_delayed_refs() always returns 0 and its single caller does not check its return value, as it also returns void, and so does the callers' caller and so on. This is because we are in the transaction abort path, where we have no way to deal with errors (we are in a critical situation) and all cleanup of resources works in a best effort fashion. So make btrfs_destroy_delayed_refs() return void. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:32 +02:00
Filipe Manana	184533e361	btrfs: remove unnecessary prototype declarations at disk-io.c We have a few static functions at disk-io.c for which we have a forward declaration of their prototype, but it's not needed because all those functions are defined before they are called, so remove them. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:32 +02:00
Filipe Manana	f1ed785a5b	btrfs: use a single switch statement when initializing delayed ref head At init_delayed_ref_head(), we are using two separate if statements to check the delayed ref head action, and initializing 'must_insert_reserved' to false twice, once when the variable is declared and once again in an else branch. Make this simpler and more straightforward by having a single switch statement, also moving the comment about a drop action to the corresponding switch case to make it more clear and eliminating the duplicated initialization of 'must_insert_reserved' to false. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:32 +02:00
Filipe Manana	61c681fef7	btrfs: use bool type for delayed ref head fields that are used as booleans There's no point in have several fields defined as 1 bit unsigned int in struct btrfs_delayed_ref_head, we can instead use a bool type, it makes the code a bit more readable and it doesn't change the structure size. So switch them to proper booleans. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:32 +02:00
Filipe Manana	1e6b71c34b	btrfs: assert correct lock is held at btrfs_select_ref_head() The function btrfs_select_ref_head() iterates over the red black tree of delayed reference heads, which is protected by the spinlock in the delayed refs root. The function doesn't take the lock, it's taken by its single caller, btrfs_obtain_ref_head(), because it needs to call that function and btrfs_delayed_ref_lock() in the same critical section (delimited by that spinlock). So assert at btrfs_select_ref_head() that we are holding the expected lock. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:31 +02:00
Filipe Manana	798f4d95db	btrfs: get rid of label and goto at insert_delayed_ref() At insert_delayed_ref() there's no point of having a label and goto in the case we were able to insert the delayed ref head. We can just add the code under label to the if statement's body and return immediately, and also there is no need to track the return value in a variable, we can just return a literal true or false value directly. So do those changes. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:31 +02:00
Filipe Manana	f38462c447	btrfs: make insert_delayed_ref() return a bool instead of an int insert_delayed_ref() can only return 0 or 1, to indicate if the given delayed reference was added to the head reference or if it was merged into an existing delayed ref, respectively. So just make it return a boolean instead. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:31 +02:00
Filipe Manana	293f8197a4	btrfs: use a bool to track qgroup record insertion when adding ref head We are using an integer as a boolean to track the qgroup record insertion status when adding a delayed reference head. Since all we need is a boolean, switch the type from int to bool to make it more obvious. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:31 +02:00
Filipe Manana	4d34ad34d7	btrfs: remove pointless in_tree field from struct btrfs_delayed_ref_node The 'in_tree' field is really not needed in struct btrfs_delayed_ref_node, as we can check whether a reference is in the tree or not simply by checking its red black tree node member with RB_EMPTY_NODE(), as when we remove it from the tree we always call RB_CLEAR_NODE(). So remove that field and use RB_EMPTY_NODE(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:31 +02:00
Filipe Manana	53499d5f6b	btrfs: remove unused is_head field from struct btrfs_delayed_ref_node The 'is_head' field of struct btrfs_delayed_ref_node is no longer after commit `d278850eff` ("btrfs: remove delayed_ref_node from ref_head"), so remove it. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:31 +02:00
Filipe Manana	315dd5cc75	btrfs: reorder some members of struct btrfs_delayed_ref_head Currently struct delayed_ref_head has its 'bytenr' and 'href_node' members in different cache lines (even on a release, non-debug, kernel). This is not optimal because when iterating the red black tree of delayed ref heads for inserting a new delayed ref head (htree_insert()) we have to pull in 2 cache lines of delayed ref heads we find in a patch, one for the tree node (struct rb_node) and another one for the 'bytenr' field. The same applies when searching for an existing delayed ref head (find_ref_head()). On a release (non-debug) kernel, the structure also has two 4 bytes holes, which makes it 8 bytes longer than necessary. Its current layout is the following: struct btrfs_delayed_ref_head { u64 bytenr; /* 0 8 / u64 num_bytes; / 8 8 / refcount_t refs; / 16 4 / / XXX 4 bytes hole, try to pack / struct mutex mutex; / 24 32 / spinlock_t lock; / 56 4 / / XXX 4 bytes hole, try to pack / / --- cacheline 1 boundary (64 bytes) --- / struct rb_root_cached ref_tree; / 64 16 / struct list_head ref_add_list; / 80 16 / struct rb_node href_node __attribute__((__aligned__(8))); / 96 24 / struct btrfs_delayed_extent_op extent_op; /* 120 8 / / --- cacheline 2 boundary (128 bytes) --- / int total_ref_mod; / 128 4 / int ref_mod; / 132 4 / unsigned int must_insert_reserved:1; / 136: 0 4 / unsigned int is_data:1; / 136: 1 4 / unsigned int is_system:1; / 136: 2 4 / unsigned int processing:1; / 136: 3 4 / / size: 144, cachelines: 3, members: 15 / / sum members: 128, holes: 2, sum holes: 8 / / sum bitfield members: 4 bits (0 bytes) / / padding: 4 / / bit_padding: 28 bits / / forced alignments: 1 / / last cacheline: 16 bytes / } __attribute__((__aligned__(8))); This change reorders the 'href_node' and 'refs' members so that we have the 'href_node' in the same cache line as the 'bytenr' field, while also eliminating the two holes and reducing the structure size from 144 bytes down to 136 bytes, so we can now have 30 ref heads per 4K page (on x86_64) instead of 28. The new structure layout after this change is now: struct btrfs_delayed_ref_head { u64 bytenr; / 0 8 / u64 num_bytes; / 8 8 / struct rb_node href_node __attribute__((__aligned__(8))); / 16 24 / struct mutex mutex; / 40 32 / / --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- / refcount_t refs; / 72 4 / spinlock_t lock; / 76 4 / struct rb_root_cached ref_tree; / 80 16 / struct list_head ref_add_list; / 96 16 / struct btrfs_delayed_extent_op extent_op; /* 112 8 / int total_ref_mod; / 120 4 / int ref_mod; / 124 4 / / --- cacheline 2 boundary (128 bytes) --- / unsigned int must_insert_reserved:1; / 128: 0 4 / unsigned int is_data:1; / 128: 1 4 / unsigned int is_system:1; / 128: 2 4 / unsigned int processing:1; / 128: 3 4 / / size: 136, cachelines: 3, members: 15 / / padding: 4 / / bit_padding: 28 bits / / forced alignments: 1 / / last cacheline: 8 bytes / } __attribute__((__aligned__(8))); Running the following fs_mark test shows some significant improvement. $ cat test.sh #!/bin/bash # 15G null block device DEV=/dev/nullb0 MNT=/mnt/nullb0 FILES=100000 THREADS=$(nproc --all) FILE_SIZE=0 echo "performance" \| \ tee /sys/devices/system/cpu/cpu/cpufreq/scaling_governor mkfs.btrfs -f $DEV mount -o ssd $DEV $MNT OPTS="-S 0 -L 5 -n $FILES -s $FILE_SIZE -t $THREADS -k" for ((i = 1; i <= $THREADS; i++)); do OPTS="$OPTS -d $MNT/d$i" done fs_mark $OPTS umount $MNT Before this change: FSUse% Count Size Files/sec App Overhead 10 1200000 0 112631.3 11928055 16 2400000 0 189943.8 12140777 23 3600000 0 150719.2 13178480 50 4800000 0 99137.3 12504293 53 6000000 0 111733.9 12670836 Total files/sec: 664165.5 After this change: FSUse% Count Size Files/sec App Overhead 10 1200000 0 148589.5 11565889 16 2400000 0 227743.8 11561596 23 3600000 0 191590.5 12550755 30 4800000 0 179812.3 12629610 53 6000000 0 92471.4 12352383 Total files/sec: 840207.5 Measuring the execution times of htree_insert(), in nanoseconds, during those fs_mark runs: Before this change: Range: 0.000 - 940647.000; Mean: 619.733; Median: 548.000; Stddev: 1834.231 Percentiles: 90th: 980.000; 95th: 1208.000; 99th: 2090.000 0.000 - 6.384: 257 \| 6.384 - 26.259: 977 \| 26.259 - 99.635: 4963 \| 99.635 - 370.526: 136800 ############# 370.526 - 1370.603: 566110 ##################################################### 1370.603 - 5062.704: 24945 ## 5062.704 - 18693.248: 944 \| 18693.248 - 69014.670: 211 \| 69014.670 - 254791.959: 30 \| 254791.959 - 940647.000: 4 \| After this change: Range: 0.000 - 299200.000; Mean: 587.754; Median: 542.000; Stddev: 1030.422 Percentiles: 90th: 918.000; 95th: 1113.000; 99th: 1987.000 0.000 - 5.585: 163 \| 5.585 - 20.678: 452 \| 20.678 - 70.369: 1806 \| 70.369 - 233.965: 26268 #### 233.965 - 772.564: 333519 ##################################################### 772.564 - 2545.771: 91820 ############### 2545.771 - 8383.615: 2238 \| 8383.615 - 27603.280: 170 \| 27603.280 - 90879.297: 68 \| 90879.297 - 299200.000: 12 \| Mean, percentiles, maximum times are all better, as well as a lower standard deviation. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:31 +02:00
Qu Wenruo	31dd8c81dd	btrfs: use the same uptodate variable for end_bio_extent_readpage() In function end_bio_extent_readpage() we call endio_readpage_release_extent() to unlock the extent io tree. However we pass PageUptodate(page) as @uptodate parameter for it, while for previous end_page_read() call, we use a dedicated @uptodate local variable. This is not a big deal, as even for subpage cases, either the bio only covers part of the page, then the @uptodate is always false, and the subpage ranges can still be merged. But for the sake of consistency, always use @uptodate variable when possible. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:31 +02:00
Qu Wenruo	5a96341927	btrfs: subpage: make alloc_extent_buffer() handle previously uptodate range efficiently Currently alloc_extent_buffer() would make the extent buffer uptodate if the corresponding pages are also uptodate. But this check is only checking PageUptodate, which is fine for regular cases, but not for subpage cases, as we can have multiple extent buffers in the same page. So here we go btrfs_page_test_uptodate() instead. The old code doesn't cause any problem, but is not efficient, as it would cause extra metadata read even if the range is already uptodate. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:31 +02:00
David Sterba	b831306b3b	btrfs: print assertion failure report and stack trace from the same line Assertions reports are split into two parts, the exact file and location of the condition and then the stack trace printed from btrfs_assertfail(). This means all the stack traces report the same line and this is what's typically reported by various tools, making it harder to distinguish the reports. [403.2467] assertion failed: refcount_read(&block_group->refs) == 1, in fs/btrfs/block-group.c:4259 [403.2479] ------------[ cut here ]------------ [403.2484] kernel BUG at fs/btrfs/messages.c:259! [403.2488] invalid opcode: 0000 [#1] PREEMPT SMP KASAN [403.2493] CPU: 2 PID: 23202 Comm: umount Not tainted 6.2.0-rc4-default+ #67 [403.2499] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014 [403.2509] RIP: 0010:btrfs_assertfail+0x19/0x1b [btrfs] ... [403.2595] Call Trace: [403.2598] <TASK> [403.2601] btrfs_free_block_groups.cold+0x52/0xae [btrfs] [403.2608] close_ctree+0x6c2/0x761 [btrfs] [403.2613] ? __wait_for_common+0x2b8/0x360 [403.2618] ? btrfs_cleanup_one_transaction.cold+0x7a/0x7a [btrfs] [403.2626] ? mark_held_locks+0x6b/0x90 [403.2630] ? lockdep_hardirqs_on_prepare+0x13d/0x200 [403.2636] ? __call_rcu_common.constprop.0+0x1ea/0x3d0 [403.2642] ? trace_hardirqs_on+0x2d/0x110 [403.2646] ? __call_rcu_common.constprop.0+0x1ea/0x3d0 [403.2652] generic_shutdown_super+0xb0/0x1c0 [403.2657] kill_anon_super+0x1e/0x40 [403.2662] btrfs_kill_super+0x25/0x30 [btrfs] [403.2668] deactivate_locked_super+0x4c/0xc0 By making btrfs_assertfail a macro we'll get the same line number for the BUG output: [63.5736] assertion failed: 0, in fs/btrfs/super.c:1572 [63.5758] ------------[ cut here ]------------ [63.5782] kernel BUG at fs/btrfs/super.c:1572! [63.5807] invalid opcode: 0000 [#2] PREEMPT SMP KASAN [63.5831] CPU: 0 PID: 859 Comm: mount Tainted: G D 6.3.0-rc7-default+ #2062 [63.5868] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014 [63.5905] RIP: 0010:btrfs_mount+0x24/0x30 [btrfs] [63.5964] RSP: 0018:ffff88800e69fcd8 EFLAGS: 00010246 [63.5982] RAX: 000000000000002d RBX: ffff888008fc1400 RCX: 0000000000000000 [63.6004] RDX: 0000000000000000 RSI: ffffffffb90fd868 RDI: ffffffffbcc3ff20 [63.6026] RBP: ffffffffc081b200 R08: 0000000000000001 R09: ffff88800e69fa27 [63.6046] R10: ffffed1001cd3f44 R11: 0000000000000001 R12: ffff888005a3c370 [63.6062] R13: ffffffffc058e830 R14: 0000000000000000 R15: 00000000ffffffff [63.6081] FS: 00007f7b3561f800(0000) GS:ffff88806c600000(0000) knlGS:0000000000000000 [63.6105] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [63.6120] CR2: 00007fff83726e10 CR3: 0000000002a9e000 CR4: 00000000000006b0 [63.6137] Call Trace: [63.6143] <TASK> [63.6148] legacy_get_tree+0x80/0xd0 [63.6158] vfs_get_tree+0x43/0x120 [63.6166] do_new_mount+0x1f3/0x3d0 [63.6176] ? do_add_mount+0x140/0x140 [63.6187] ? cap_capable+0xa4/0xe0 [63.6197] path_mount+0x223/0xc10 This comes at a cost of bloating the final btrfs.ko module due all the inlining, as long as assertions are compiled in. This is a must for debugging builds but this is often enabled on release builds too. Release build: text data bss dec hex filename 1251676 20317 16088 1288081 13a791 pre/btrfs.ko 1260612 29473 16088 1306173 13ee3d post/btrfs.ko DELTA: +8936 CC: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:31 +02:00
Qu Wenruo	75258f20fb	btrfs: subpage: dump extra subpage bitmaps for debug There is a bug report that assert_eb_page_uptodate() gets triggered for free space tree metadata. Without proper dump for the subpage bitmaps it's much harder to debug. Thus this patch would dump all the subpage bitmaps (split them into their own bitmaps) for a easier debugging. The output would look like this: (Dumped after a tree block got read from disk) page:000000006e34bf49 refcount:4 mapcount:0 mapping:0000000067661ac4 index:0x1d1 pfn:0x110e9 memcg:ffff0000d7d62000 aops:btree_aops [btrfs] ino:1 flags: 0x8000000000002002(referenced\|private\|zone=2) page_type: 0xffffffff() raw: 8000000000002002 0000000000000000 dead000000000122 ffff00000188bed0 raw: 00000000000001d1 ffff0000c7992700 00000004ffffffff ffff0000d7d62000 page dumped because: btrfs subpage dump BTRFS warning (device dm-1): start=30490624 len=16384 page=30474240 bitmaps: uptodate=4-7 error= dirty= writeback= ordered= checked= Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:30 +02:00
Tejun Heo	58e814fcac	btrfs: use alloc_ordered_workqueue() to create ordered workqueues BACKGROUND ========== When multiple work items are queued to a workqueue, their execution order doesn't match the queueing order. They may get executed in any order and simultaneously. When fully serialized execution - one by one in the queueing order - is needed, an ordered workqueue should be used which can be created with alloc_ordered_workqueue(). However, alloc_ordered_workqueue() was a later addition. Before it, an ordered workqueue could be obtained by creating an UNBOUND workqueue with @max_active==1. This originally was an implementation side-effect which was broken by `4c16bd327c` ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered"). Because there were users that depended on the ordered execution, `5c0338c687` ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered") made workqueue allocation path to implicitly promote UNBOUND workqueues w/ @max_active==1 to ordered workqueues. While this has worked okay, overloading the UNBOUND allocation interface this way creates other issues. It's difficult to tell whether a given workqueue actually needs to be ordered and users that legitimately want a min concurrency level wq unexpectedly gets an ordered one instead. With planned UNBOUND workqueue updates to improve execution locality and more prevalence of chiplet designs which can benefit from such improvements, this isn't a state we wanna be in forever. This patch series audits all call sites that create an UNBOUND workqueue w/ @max_active==1 and converts them to alloc_ordered_workqueue() as necessary. BTRFS ===== * fs_info->scrub_workers initialized in scrub_workers_get() was setting @max_active to 1 when @is_dev_replace is set and it seems that the workqueue actually needs to be ordered if @is_dev_replace. Update the code so that alloc_ordered_workqueue() is used if @is_dev_replace. * fs_info->discard_ctl.discard_workers initialized in btrfs_init_workqueues() was directly using alloc_workqueue() w/ @max_active==1. Converted to alloc_ordered_workqueue(). * fs_info->fixup_workers and fs_info->qgroup_rescan_workers initialized in btrfs_queue_work() use the btrfs's workqueue wrapper, btrfs_workqueue, which are allocated with btrfs_alloc_workqueue(). btrfs_workqueue implements automatic @max_active adjustment which is disabled when the specified max limit is below a certain threshold, so calling btrfs_alloc_workqueue() with @limit_active==1 yields an ordered workqueue whose @max_active won't be changed as the auto-tuning is disabled. This is rather brittle in that nothing clearly indicates that the two workqueues should be ordered or btrfs_alloc_workqueue() must disable auto-tuning when @limit_active==1. This patch factors out the common btrfs_workqueue init code into btrfs_init_workqueue() and add explicit btrfs_alloc_ordered_workqueue(). The two workqueues are converted to use the new ordered allocation interface. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:30 +02:00
David Sterba	1d12680044	btrfs: drop gfp from parameter extent state helpers Now that all extent state bit helpers effectively take the GFP_NOFS mask (and GFP_NOWAIT is encoded in the bits) we can remove the parameter. This reduces stack consumption in many functions and simplifies a lot of code. Net effect on module on a release build: text data bss dec hex filename 1250432 20985 16088 1287505 13a551 pre/btrfs.ko 1247074 20985 16088 1284147 139833 post/btrfs.ko DELTA: -3358 Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:30 +02:00
David Sterba	62bc60473a	btrfs: pass NOWAIT for set/clear extent bits as another bit The only flags we now pass to set_extent_bit/__clear_extent_bit are GFP_NOFS and GFP_NOWAIT (a few functions handling mappings). This requires an extra parameter to be passed everywhere but is almost always the same. Encode the GFP_NOWAIT as an artificial extent bit and extract the real bits and gfp mask in the lowest level helpers. Now the passed gfp mask is not actually used and can be removed. Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:30 +02:00
David Sterba	7dde7a8ab3	btrfs: drop NOFAIL from set_extent_bit allocation masks The __GFP_NOFAIL passed to set_extent_bit first appeared in 2010 (commit `f0486c68e4` ("Btrfs: Introduce contexts for metadata reservation")), without any explanation why it would be needed. Meanwhile we've updated the semantics of set_extent_bit to handle failed allocations and do unlock, sleep and retry if needed. The use of the NOFAIL flag is also an outlier, we never want any of the set/clear extent bit helpers to fail, they're used for many critical changes like extent locking, besides the extent state bit changes. Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:30 +02:00
David Sterba	0acd32c294	btrfs: open code set_extent_bits This helper calls set_extent_bit with two more parameters set to default values, but otherwise it's purpose is not clear. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:30 +02:00
David Sterba	e85de967bc	btrfs: open code set_extent_bits_nowait The helper only passes GFP_NOWAIT as gfp flags and is used two times. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:30 +02:00
David Sterba	fe1a598c42	btrfs: open code set_extent_dirty The helper is used a few times, that it's setting the DIRTY extent bit is still clear. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:30 +02:00
David Sterba	eea8686e68	btrfs: open code set_extent_new The helper is used only once. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:30 +02:00
David Sterba	66240ab115	btrfs: open code set_extent_delalloc The helper is used once in fs code and a few times in the self test code. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:30 +02:00
David Sterba	dc5646c15c	btrfs: open code set_extent_defrag The helper is used only once. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:29 +02:00
Christoph Hellwig	25ac047c9d	btrfs: remove a pointless NULL check in btrfs_lookup_fs_root btrfs_grab_root already checks for a NULL root itself. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:29 +02:00
Christoph Hellwig	e91909aace	btrfs: convert btrfs_get_global_root to use a switch statement Use a switch statement instead of an endless chain of if statements to make the code a little cleaner. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:29 +02:00

1 2 3 4 5 ...

82054 Commits