linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-10 14:11:52 +00:00

Author	SHA1	Message	Date
Kent Overstreet	4da1713a8d	bcachefs: check for inodes that should have backpointers in fsck Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:21 -04:00
Kent Overstreet	45150765d3	bcachefs: bch_member.last_journal_bucket On recovery from clean shutdown we don't typically read the journal, but we still want to avoid overwriting existing entries in the journal for list_journal debugging. Thus, add some fields to the member info section so we can remember where we left off. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:21 -04:00
Kent Overstreet	c749541353	bcachefs: uninline set_btree_iter_dontneed() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:21 -04:00
Hongbo Li	0af0b963b5	bcachefs: eliminate the uninitialized compilation warning in bch2_reconstruct_snapshots When compiling the bcachefs-tools, the following compilation warning is reported: libbcachefs/snapshot.c: In function ‘bch2_reconstruct_snapshots’: libbcachefs/snapshot.c:915:19: warning: ‘tree_id’ may be used uninitialized in this function [-Wmaybe-uninitialized] 915 \| snapshot->v.tree = cpu_to_le32(tree_id); libbcachefs/snapshot.c:903:6: note: ‘tree_id’ was declared here 903 \| u32 tree_id; \| ^~~~~~~ This is a false alert, because @tree_id is changed in bch2_snapshot_tree_create after it returns 0. And if this function returns other value, @tree_id wouldn't be used. Thus there should be nothing wrong in logical. Although the report itself is a false alert, we can still make it more explicit by setting the initial value of @tree_id to 0 (an invalid tree ID). Fixes: `a292be3b68` ("bcachefs: Reconstruct missing snapshot nodes") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:21 -04:00
Kent Overstreet	56522d7276	bcachefs: fix btree_path_clone() ip_allocated Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:21 -04:00
Nathan Chancellor	8bb0eddbbc	bcachefs: Fix format specifiers in bch2_btree_key_cache_to_text() When building for a 32-bit target, for which 'size_t' is 'unsigned int', there are two warnings around mismatched format specifiers and argument types: In file included from fs/bcachefs/vstructs.h:5, from fs/bcachefs/bcachefs_format.h:79, from fs/bcachefs/bcachefs.h:207, from fs/bcachefs/btree_key_cache.c:3: fs/bcachefs/btree_key_cache.c: In function 'bch2_btree_key_cache_to_text': fs/bcachefs/btree_key_cache.c:1046:25: error: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'size_t' {aka 'unsigned int'} [-Werror=format=] 1046 \| prt_printf(out, "nonpcpu freelist:\t%lu\r\n", bc->nr_freed_nonpcpu); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~ \| \| \| size_t {aka unsigned int} fs/bcachefs/util.h:192:63: note: in definition of macro 'prt_printf' 192 \| #define prt_printf(_out, ...) bch2_prt_printf(_out, __VA_ARGS__) \| ^~~~~~~~~~~ fs/bcachefs/btree_key_cache.c:1046:47: note: format string is defined here 1046 \| prt_printf(out, "nonpcpu freelist:\t%lu\r\n", bc->nr_freed_nonpcpu); \| ~~^ \| \| \| long unsigned int \| %u fs/bcachefs/btree_key_cache.c:1047:25: error: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'size_t' {aka 'unsigned int'} [-Werror=format=] 1047 \| prt_printf(out, "pcpu freelist:\t%lu\r\n", bc->nr_freed_pcpu); \| ^~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~ \| \| \| size_t {aka unsigned int} fs/bcachefs/util.h:192:63: note: in definition of macro 'prt_printf' 192 \| #define prt_printf(_out, ...) bch2_prt_printf(_out, __VA_ARGS__) \| ^~~~~~~~~~~ fs/bcachefs/btree_key_cache.c:1047:44: note: format string is defined here 1047 \| prt_printf(out, "pcpu freelist:\t%lu\r\n", bc->nr_freed_pcpu); \| ~~^ \| \| \| long unsigned int \| %u cc1: all warnings being treated as error Use the proper 'size_t' specifier, '%zu', to clear up the warnings for these platforms. Fixes: f2d47ec26af5 ("bcachefs: Btree key cache instrumentation") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:21 -04:00
Nathan Chancellor	2d288745eb	bcachefs: Fix type of flags parameter for some ->trigger() implementations When building with clang's -Wincompatible-function-pointer-types-strict (a warning designed to catch potential kCFI failures at build time), there are several warnings along the lines of: fs/bcachefs/bkey_methods.c:118:2: error: incompatible function pointer types initializing 'int ()(struct btree_trans , enum btree_id, unsigned int, struct bkey_s_c, struct bkey_s, enum btree_iter_update_trigger_flags)' with an expression of type 'int (struct btree_trans *, enum btree_id, unsigned int, struct bkey_s_c, struct bkey_s, unsigned int)' [-Werror,-Wincompatible-function-pointer-types-strict] 118 \| BCH_BKEY_TYPES() \| ^~~~~~~~~~~~~~~~ fs/bcachefs/bcachefs_format.h:394:2: note: expanded from macro 'BCH_BKEY_TYPES' 394 \| x(inode, 8) \ \| ^~~~~~~~~~~~~~~~~~~~~~~~~~ fs/bcachefs/bkey_methods.c:117:41: note: expanded from macro 'x' 117 \| #define x(name, nr) [KEY_TYPE_##name] = bch2_bkey_ops_##name, \| ^~~~~~~~~~~~~~~~~~~~ <scratch space>:277:1: note: expanded from here 277 \| bch2_bkey_ops_inode \| ^~~~~~~~~~~~~~~~~~~ fs/bcachefs/inode.h:26:13: note: expanded from macro 'bch2_bkey_ops_inode' 26 \| .trigger = bch2_trigger_inode, \ \| ^~~~~~~~~~~~~~~~~~ There are several functions that did not have their flags parameter converted to 'enum btree_iter_update_trigger_flags' in the recent unification, which will cause kCFI failures at runtime because the types, while ABI compatible (hence no warning from the non-strict version of this warning), do not match exactly. Fix up these functions (as well as a few other obvious functions that should have it, even if there are no warnings currently) to resolve the warnings and potential kCFI runtime failures. Fixes: 31e4ef3280c8 ("bcachefs: iter/update/trigger/str_hash flag cleanup") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:21 -04:00
Kent Overstreet	24b27975a9	bcachefs: Kill gc_init_recurse() This unifies the online and offline btree gc passes; we're not yet running it online. We now iterate over one level of the btree at a time - the same as check_extents_to_backpointers(); this ordering preserves order of keys regardless of btree splits and merges, which will be important when we re-enable online gc. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:21 -04:00
Kent Overstreet	c451986bf4	bcachefs: do reflink_p repair from BTREE_TRIGGER_check_repair Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	f40d13f94d	bcachefs: Run bch2_check_fix_ptrs() via triggers Currently, the reflink_p gc trigger does repair as well - turning a reflink_p key into an error key if the reflink_v it points to doesn't exist. This won't work with online check/repair, because the repair path once online will be subject to transaction restarts, but BTREE_TRIGGER_gc is not idempotant - we can't run it multiple times if we get a transaction restart. So we need to split these paths; to do so this patch calls check_fix_ptrs() by a new general path - a new trigger type, BTREE_TRIGGER_check_repair. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	930e1a92d6	bcachefs: kill gc looping for bucket gens looping when we change a bucket gen is not ideal - it means we risk failing if we'd go into an infinite loop, and it's better to make forward progress even if fsck doesn't fix everything. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	70e3e039cf	bcachefs: bch2_bucket_ref_update() If we hit an inconsistency when updating allocation information, we don't want to fail the update if it's for a deletion - only if it's for a new key. Rename check_bucket_ref() -> bucket_ref_update() so we can centralize the logic to do this. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	9cc455d1bc	bcachefs: Consolidate mark_stripe_bucket() and trans_mark_stripe_bucket() This eliminates some duplicated logic, and the gc path now handles stripe updates and deletions - we need this since soon we're bringing back runtime gc. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	d930764650	bcachefs: mark_stripe_bucket cleanup Start to work on unifying mark_stripe_bucket() and trans_mark_stripe_bucket(); first, clean up all the unnecessary and gratuitious differences. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	c4e8db2b5d	bcachefs: bucket_data_type_mismatch() We're working on potentially unifying bch2_check_bucket_ref() and bch2_check_fix_ptrs() - or at least eliminating gratuitious differences. Most immediately, there's a bunch of cleanups to be done regarding BCH_DATA_stripe. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	b769590f33	bcachefs: Clean up inode alloc There's no need to be using new_inode(); we can skip all that indirection and make the code easier to follow. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	f04158290d	bcachefs: journal seq blacklist gc no longer has to walk btree Since btree_ptr_v2, we no longer require the journal seq blacklist table for skipping blacklisted bsets (btree node entries); the pointer to a given node indicates how much data is present. Therefore there's no longer any need for journal seq blacklist gc to walk the btree - we can prune entries older than journal last_seq. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	e7f63c67fc	bcachefs: plumb data_type into bch2_bucket_alloc_trans() prep work for making the allocator try to keep btree nodes within the existing member info btree allocated bitmap Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	018b32a63f	bcachefs: Add btree_allocated_bitmap to member_to_text() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	5147b9ae76	bcachefs: Btree key cache instrumentation It turns out the btree key cache shrinker wasn't actually reclaiming anything, prior to the previous patch. This adds instrumentation so that if we have further issues we can see what's going on. Specifically, sysfs internal/btree_key_cache is greatly expanded with new counters, and the SRCU sequence numbers of the first 10 entries on each pending freelist, and we also add trigger_btree_key_cache_shrink for testing without having to prune all the system caches. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Matthew Wilcox (Oracle)	e4f2c4dfee	bcachefs: Remove calls to folio_set_error Common code doesn't test the error flag, so we don't need to set it in bcachefs. We can use folio_end_read() to combine the setting (or not) of the uptodate flag and clearing the lock flag. Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Brian Foster <bfoster@redhat.com> Cc: linux-bcachefs@vger.kernel.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	103304021e	bcachefs: Move gc of bucket.oldest_gen to workqueue This is a nice cleanup - and we've also been having problems with kthread creation in the mount path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	b25fd02ab4	bcachefs: fix flag printing in journal_buf_to_text() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	aef7eecb57	bcachefs: Sync journal when we complete a recovery pass Make things easier when we're debugging long fsck runs - persist the work that successful recovery passes did. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	f7643bc974	bcachefs: make btree read errors silent during scan Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	5a2d15213d	bcachefs: Rip bch2_snapshot_equiv() out of fsck Originally, when deleting snapshots we didn't collapse redundant snapshot nodes; thus, the notion of a class of equivalent snapshot nodes leaked into fsck. Now we do, so snapshot ID equivalence classes are purely local to snapshot deletion. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	9de40d77f0	bcachefs: Check for writing btree_ptr_v2.sectors_written == 0 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	60f2b1bcf5	bcachefs: Add asserts to bch2_dev_btree_bitmap_marked_sectors() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	427e1bb838	bcachefs: fs_alloc_debug_to_text() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	feb255537d	bcachefs: assert that online_reserved == 0 on shutdown Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	fd104e2967	bcachefs: bch2_trans_verify_not_unlocked() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	e590e4e222	bcachefs: bch2_btree_path_can_relock() With the new assertions, we shouldn't be holding locks when trans->locked is false, thus, we shouldn't use relock when we just want to check if we can relock. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	650db8a87c	bcachefs: trans->locked Add a field for tracking whether a transaction object holds btree locks, and assertions to verify state. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	e2e568bd97	bcachefs: bch2_btree_root_alloc_fake_trans() We're starting to be more strict about transaction locked state, and multiple transactions in a task. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	ca563dccb2	bcachefs: bch2_trans_unlock() must always be followed by relock() or begin() We're about to add new asserts for btree_trans locking consistency, and part of that requires that aren't using the btree_trans while it's unlocked. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	4984faff5d	bcachefs: Use bch2_btree_path_upgrade() in key cache traverse Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	5d8c9d9428	bcachefs: bch2_btree_path_upgrade() checks nodes_locked, not uptodate In the key cache fill path, we use path_upgrade() on a path that isn't uptodate yet but should be locked. This change makes bch2_btree_path_upgrade() slightly looser so we can use it in key cache upgrade, instead of the __ version. Also, make the related assert - that path->uptodate implies nodes_locked - slightly clearer. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	f2d9823f46	bcachefs: maintain lock invariants in btree_iter_next_node() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	449ceafb49	bcachefs: bch2_trans_commit_flags_to_text() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	b7f10636d5	bcachefs: prefer drop_locks_do() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	91b5d97fdf	bcachefs: get_unlocked_mut_path -> bch2_path_get_unlocked_mut Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Lukas Bulwahn	d434c2398f	bcachefs: fix typo in reference to BCACHEFS_DEBUG Commit `ec9cc18fc2` ("bcachefs: Add checks for invalid snapshot IDs") intends to check the sanity of a snapshot and panic when BCACHEFS_DEBUG is set, but that conditional has a typo. Fix the typo to refer to the actual existing Kconfig symbol. This was found with ./scripts/checkkconfigsymbols.py. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Ricardo B. Marliere	af3b39b4c6	bcachefs: chardev: make bch_chardev_class constant Since commit `43a7206b09` ("driver core: class: make class_register() take a const *"), the driver core allows for struct class to be in read-only memory, so move the bch_chardev_class structure to be declared at build time placing it into read-only memory, instead of having to be dynamically allocated at boot time. Also, correctly clean up after failing paths in bch2_chardev_init(). Cc: Hongbo Li <lihongbo22@huawei.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	2f724563fc	bcachefs: member helper cleanups Some renaming for better consistency bch2_member_exists -> bch2_member_alive bch2_dev_exists -> bch2_member_exists bch2_dev_exsits2 -> bch2_dev_exists bch_dev_locked -> bch2_dev_locked bch_dev_bkey_exists -> bch2_dev_bkey_exists new helper - bch2_dev_safe Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	d155272b6e	bcachefs: bucket_valid() cut out a branch from doing it the obvious way Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	923ed0ae5e	bcachefs: bch2_trans_relock_fail() - factor out slowpath Factor out slowpath into a separate helper Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	0c0cbfdb84	bcachefs: bch2_dir_emit() - drop_locks_do() conversion Add a new helper that calls dir_emit() and updates ctx->pos on success; this lets us convert bch2_readdir() to drop_locks_do(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	65bd442397	bcachefs: bch2_btree_insert_trans() no longer specifies BTREE_ITER_cached Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	5dd8c60e1e	bcachefs: iter/update/trigger/str_hash flag cleanup Combine iter/update/trigger/str_hash flags into a single enum, and x-macroize them for a to_text() function later. These flags are all for a specific iter/key/update context, so it makes sense to group them together - iter/update/trigger flags were already given distinct bits, this cleans up and unifies that handling. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	bf5f6a689b	bcachefs: __BTREE_ITER_ALL_SNAPSHOTS -> BTREE_ITER_SNAPSHOT_FIELD Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	c281db0fa5	bcachefs: mark_superblock cleanup Consolidate mark_superblock() and trans_mark_superblock(), like we did with the other trigger paths. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	ba665494fb	bcachefs: gc_btree_init_recurse() uses gc_mark_node() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	d1adfe4e7e	bcachefs: move root node topo checks to node_check_topology() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	b982d645a4	bcachefs: move topology repair kick to gc_btrees() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	58dda9c10e	bcachefs: kill metadata only gc Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	d1b213a00d	bcachefs: Finish converting reconstruct_alloc to errors_silent with errors_silent, reconstruct_alloc no longer requires fsck and fix_errors to work Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	68e142405c	bcachefs: bch2_gc() is now private to btree_gc.c Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	665e8b3239	bcachefs: for_each_btree_key_continue() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	a21107eeb1	bcachefs: kill for_each_btree_key_old() Dead code Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kuan-Wei Chiu	0ddb5f0854	bcachefs: Optimize eytzinger0_sort() with bottom-up heapsort This optimization reduces the average number of comparisons required from 2nlog2(n) - 3n + o(n) to nlog2(n) + 0.37n + o(n). When n is sufficiently large, it results in approximately 50% fewer comparisons. Currently, eytzinger0_sort employs the textbook version of heapsort, where during the heapify process, each level requires two comparisons to determine the maximum among three elements. In contrast, the bottom-up heapsort, during heapify, only compares two children at each level until reaching a leaf node. Then, it backtracks from the leaf node to find the correct position. Since heapify typically continues until very close to the leaf node, the standard heapify requires about 2log2(n) comparisons, while the bottom-up variant only needs log2(n) comparisons. The experimental data presented below is based on an array generated by get_random_u32(). \| N \| comparisons(old) \| comparisons(new) \| time(old) \| time(new) \| \|-------\|------------------\|------------------\|-----------\|-----------\| \| 10000 \| 235381 \| 136615 \| 25545 us \| 20366 us \| \| 20000 \| 510694 \| 293425 \| 31336 us \| 18312 us \| \| 30000 \| 800384 \| 457412 \| 35042 us \| 27386 us \| \| 40000 \| 1101617 \| 626831 \| 48779 us \| 38253 us \| \| 50000 \| 1409762 \| 799637 \| 62238 us \| 46950 us \| \| 60000 \| 1721191 \| 974521 \| 75588 us \| 58367 us \| \| 70000 \| 2038536 \| 1152171 \| 90823 us \| 68778 us \| \| 80000 \| 2362958 \| 1333472 \| 104165 us \| 78625 us \| \| 90000 \| 2690900 \| 1516065 \| 116111 us \| 89573 us \| \| 100000\| 3019413 \| `1699879` \| 133638 us \| 100998 us \| Refs: BOTTOM-UP-HEAPSORT, a new variant of HEAPSORT beating, on an average, QUICKSORT (if n is not very small) Ingo Wegener Theoretical Computer Science, 118(1); Pages 81-98, 13 September 1993 https://doi.org/10.1016/0304-3975(93)90364-Y Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	be31bf439c	bcachefs: When traversing to interior nodes, propagate result to paths to same leaf node Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	4dcd90b6d1	bcachefs: Don't read journal just for fsck reading the journal can take a decent amount of time compared to the rest of fsck, let's only read it when required. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	19391b9294	bcachefs: allow for custom action in fsck error messages Be more explicit to the user about what we're doing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	497c982f05	bcachefs: New assertion for writing to the journal after shutdown Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	00589cadb1	bcachefs: bch2_btree_path_to_text() Long form version of bch2_btree_path_to_text() - useful in error messages and tracepoints. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	5577881455	bcachefs: add btree_node_merging_disabled debug param Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	ac01928b8e	bcachefs: bch2_hash_lookup() now returns bkey_s_c small cleanup Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:17 -04:00
Kent Overstreet	6ab71b4a8e	bcachefs: bch2_journal_keys_dump() debug helper Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:17 -04:00
Kent Overstreet	9089376f70	bcachefs: bch2_btree_node_header_to_text() better btree node read path error messages Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:17 -04:00
Kent Overstreet	7423330e30	bcachefs: prt_printf() now respects \r\n\t Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:17 -04:00
Kent Overstreet	2dcb605e86	bcachefs: printbufs: prt_printf() now handles \t\r\n Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:17 -04:00
Kent Overstreet	acce32a51e	bcachefs: printbuf improvements - fix assorted (harmless) off-by-one errors - we were inconsistent on whether out->pos stays <= out->size on overflow; now it does, and printbuf.overflow exists to indicate if a printbuf has overflowed - factor out printbuf_advance_pos() - printbuf_nul_terminate_reserved(); use this to reduce the number of printbuf_make_room() calls Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:17 -04:00
Kent Overstreet	62606398d5	bcachefs: Run upgrade/downgrade even in -o nochanges mode We need to be able to test these paths in dry run mode. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:17 -04:00
Kent Overstreet	6d82869185	bcachefs: Better write_super() error messages When a superblock write is silently dropped or it's been modified by another process we need to know which device it was. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:17 -04:00
Kent Overstreet	74768337de	bcachefs: Fix xattr_to_text() unsafety Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 14:57:19 -04:00
Kent Overstreet	61692c7812	bcachefs: bch2_bkey_format_field_overflows() Fix another shift-by-64 by factoring out a common helper for bch2_bkey_format_invalid() and bformat_needs_redo() (where it was already fixed). Reported-by: syzbot+9833a1d29d4a44361e2c@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 14:57:19 -04:00
Kent Overstreet	5dfd3746b6	bcachefs: Fix needs_whiteout BUG_ON() in bkey_sort() Btree nodes are log structured; thus, we need to emit whiteouts when we're deleting a key that's been written out to disk. k->needs_whiteout tracks whether a key will need a whiteout when it's deleted, and this requires some careful handling; e.g. the key we're deleting may not have been written out to disk, but it may have overwritten a key that was - thus we need to carry this flag around on overwrites. Invariants: There may be multiple key for the same position in a given node (because of overwrites), but only one of them will be a live (non deleted) key, and only one key for a given position will have the needs_whiteout flag set. Additionally, we don't want to carry around whiteouts that need to be written in the main searchable part of a btree node - btree_iter_peek() will have to skip past them, and this can lead to an O(n^2) issues when doing sequential deletions (e.g. inode rm/truncate). So there's a separate region in the btree node buffer for unwritten whiteouts; these are merge sorted with the rest of the keys we're writing in the btree node write path. The unwritten whiteouts was a later optimization that bch2_sort_keys() didn't take into account; the unwritten whiteouts area means that we never have deleted keys with needs_whiteout set in the main searchable part of a btree node. That means we can simplify and optimize some sort paths, and eliminate an assertion that syzbot found: - Unless we're in the btree node write path, it's always ok to drop whiteouts when sorting - When sorting for a btree node write, we drop the whiteout if it's not from the unwritten whiteouts area, or if it's overwritten by a real key at the same position. This completely eliminates some tricky logic for propagating the needs_whiteout flag: syzbot was able to hit the assertion that checked that there shouldn't be more than one key at the same pos with needs_whiteout set, likely due to a combination of flipping on needs_whiteout on all written keys (they need whiteouts if overwritten), combined with not always dropping unneeded whiteouts, and the tricky logic in the sort path for preserving needs_whiteout that wasn't really needed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 14:56:09 -04:00
Kent Overstreet	5ad1f33c29	bcachefs: Fix sb_clean_validate endianness conversion Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 14:56:09 -04:00
Linus Torvalds	45db3ab700	five ksmbd server fixes, all also for stable -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmY6/AAACgkQiiy9cAdy T1FJIgv/ZUwoOodrFevFTrRFtQLS/ssRxgX69FEIWXpzHqeU8olsC8P2ywM974ba ATsiLmfpdreBilnW5DCHFLtJwPb1py2KzqlwYYbh7sdU3d+mGGLX6r1ucn9tKNl3 fWNCHUe8Dz3qRaKkpNFmzS3sXaekr/ZT3SsoJyYg/d8Z7fqXsTy7auo2pVXRiYFp TacIaGDc2Tw7fyf6Dt9o9YuSCOmGXaj9pUlTHrW17/cYXDMsQD+UcaNu93uuyZjo i6xvN1npZDec3x2j6a0YV159fWfag4hR7GxtwBEg0Ltzm3XSL5v0ljtFpeNGlehg u0TV5Tcfx8pEtcfaFdHbNXC5ih2vDMN2Yts0K8WGAWbcECs+XlvCJnYyvHGFVequ pCZuUGcrXM+0EqYnVTBMdY7lk3We8HbeZsbGjQA23MG9Bd537sBEdGpsA7ya43nJ kFK/ky8PjQ+BFpweGKL27fNULXZTSu+1D+IP+XgqksxKM5LYzWkvLAyVdUy+aNdA 6+MqIZIs =Ee/V -----END PGP SIGNATURE----- Merge tag '6.9-rc7-ksmbd-fixes' of git://git.samba.org/ksmbd Pull smb server fixes from Steve French: "Five ksmbd server fixes, all also for stable - Three fixes related to SMB3 leases (fixes two xfstests, and a locking issue) - Unitialized variable fix - Socket creation fix when bindv6only is set" * tag '6.9-rc7-ksmbd-fixes' of git://git.samba.org/ksmbd: ksmbd: do not grant v2 lease if parent lease key and epoch are not set ksmbd: use rwsem instead of rwlock for lease break ksmbd: avoid to send duplicate lease break notifications ksmbd: off ipv6only for both ipv4/ipv6 binding ksmbd: fix uninitialized symbol 'share' in smb2_tree_connect()	2024-05-08 10:39:53 -07:00
Linus Torvalds	065a057a31	fuse fixes for 6.9 final -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCZjsr0QAKCRDh3BK/laaZ PLrpAP9Y1Kz3gSSH1wqDJ9+XzQZdm4dSInMP2Pe47BvSGG2YlAEAwmccoyIoiM58 qvHPETImNxIRTAVZdiBM3W4S3hnzCwc= =SPoy -----END PGP SIGNATURE----- Merge tag 'fuse-fixes-6.9-final' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fixes from Miklos Szeredi: "Two one-liner fixes for issues introduced in -rc1" * tag 'fuse-fixes-6.9-final' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: virtiofs: include a newline in sysfs tag fuse: verify zero padding in fuse_backing_map	2024-05-08 10:33:55 -07:00
Linus Torvalds	fe35bf27a1	Description for this pull request: - Fix xfstests generic/013 test failure with dirsync mount option. - Initialize the reserved fields of deleted file and stream extension dentries to zero. -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEE6NzKS6Uv/XAAGHgyZwv7A1FEIQgFAmY7WcoWHGxpbmtpbmpl b25Aa2VybmVsLm9yZwAKCRBnC/sDUUQhCMxTD/9+qFI6cEfe06Xt6RswN/RDMWrZ ZDzUjT7VATLSyjoiaeyJeCaK9/PCrJuX9+vNybq6W0TqfHzIYDmFn7Wg6HjQrZAJ 0XhiaqVwlQ2/UY4yiv7glJRKFsdgJdo3XhFfTWzV5Eaaj65QFHPjlQMo3tOrZzp9 HsO4+DwIFah2uvehKF8numJBXSZ7uoOELHnlL05A3xSmLAxY+HeueqbkQubv1r11 mIIfvmcdxnXlzdpgs1c+a0KXVg/4/0F+SZKYP+JL5x1N2xpc4y0cWsQgrfXY+7Id fPx6CoRYkchfUFGf/LlX/LKchMO/EuK3q3Q17+zoKfgJgdPbp8TkDpfur9iUOxgy 16wyq/iIPKWEFsMYLtqYN/dlNJ+fmVUVDF457VLNYYEFdDQbp8/VosGn4ct0CBQe E1uzwJlv/iUlBNFX679dNxDewAiBtIat2wyAChCauLK6a1bzHCIDpGUlS88ggBAd OLFvQgzRKILqd8fibb2VV46V/CY3R8SmVCzDBixPFmCJtNZas9crd3UXp1xNvPGA LHDnASkpUHSMQoQN0yfMGfvRosQD7wlJYw1mhMlDq35Z2IJg2HKKSESf2axOc5Z0 25AxNZ8xfgjBNiFfDQI0mClliXnz9GTRGt4LqBVS+YHjdbPYqCHNsvJDbR0r1ZM7 OzYIaxTVoTKtYsurgw== =zS+L -----END PGP SIGNATURE----- Merge tag 'exfat-for-6.9-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat Pull exfat fixes from Namjae Jeon: - Fix xfstests generic/013 test failure with dirsync mount option - Initialize the reserved fields of deleted file and stream extension dentries to zero * tag 'exfat-for-6.9-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat: exfat: zero the reserved fields of file and stream extension dentries exfat: fix timing of synchronizing bitmap and inode	2024-05-08 10:30:13 -07:00
Mateusz Guzik	7f016edaa0	fscrypt: try to avoid refing parent dentry in fscrypt_file_open Merely checking if the directory is encrypted happens for every open when using ext4, at the moment refing and unrefing the parent, costing 2 atomics and serializing opens of different files. The most common case of encryption not being used can be checked for with RCU instead. Sample result from open1_processes -t 20 ("Separate file open/close") from will-it-scale on Sapphire Rapids (ops/s): before: 12539898 after: 25575494 (+103%) v2: - add a comment justifying rcu usage, submitted by Eric Biggers - whack spurious IS_ENCRYPTED check from the refed case Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20240508081400.422212-1-mjguzik@gmail.com Signed-off-by: Eric Biggers <ebiggers@google.com>	2024-05-08 10:28:58 -07:00
Linus Torvalds	f5fcbc8b43	bcachefs fixes for 6.9 - Various syzbot fixes; mainly small gaps in validation - Fix an integer overflow in fiemap() which was preventing filefrag from returning the full list of extents - Fix a refcounting bug on the device refcount, turned up by new assertions in the development branch - Fix a device removal/readd bug; write_super() was repeatedly dropping and retaking bch_dev->io_ref references -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmY6Qz8ACgkQE6szbY3K bnZLRA/9F5dEcNF8mSuVZqJNgzzFAXgL59GZuRMMz5ECJQRTXyHB2c3N2MG6Htxg bJuyT47icTWibIqUrNJubkCaCl9MV9sl3uPRfxF9tVbiOYQg5WhE0UUhprJq8zfi YZ+wlAdPQhPHBgieycF9LIiIzjEGZcYpg8NgCFtdaU9Rxk3aBYyBuD051imvMBqH x0JEibtrIp26u6FScuH5FH5Ro+ysXgw8HZdM0j/9I9WIiYgpya6EbbVqeuXzL3Di scj4vQwA1YoVDw9eUUgXNJq+xD9m6YqJv395imDDWN7sFQm+jGNCossvj0qUKi8m 7QVup6zaO7yNYFJy84/iZCnSC/C7zs1iFJUM6gidRRArkjr7Qw8KAPtIXGRVUM2M 9ogY6Af5u8ie7qVV1NhcULIhCjiOSINUw9uJGYUwv+XtcCfjZb7maBwfvtnFa9VV kQXeoJ/dqVXqCpvnqjQbVej+I8SXCc/s9EPXD2+SHkHzDKAuJkWKzPkQGL3kTSRT 8FPfusF0NDYLTJPOh4MdzuK79YGRQvrPaRv/JyhSAyWsUubACkwmLCyZQUNVAV/f 6WaFoEYCv4coASQNsVnmISPlsoKbLwOtEZDBr14uY9CArKSsCW26QJOKyg4B7tF8 J2DU6sIy+Tzq+TiTkWV5IE/ibQijOIB2/06+5KcM7npsFJldHWs= =rlea -----END PGP SIGNATURE----- Merge tag 'bcachefs-2024-05-07.2' of https://evilpiepirate.org/git/bcachefs Pull bcachefs fixes from Kent Overstreet: - Various syzbot fixes; mainly small gaps in validation - Fix an integer overflow in fiemap() which was preventing filefrag from returning the full list of extents - Fix a refcounting bug on the device refcount, turned up by new assertions in the development branch - Fix a device removal/readd bug; write_super() was repeatedly dropping and retaking bch_dev->io_ref references * tag 'bcachefs-2024-05-07.2' of https://evilpiepirate.org/git/bcachefs: bcachefs: Add missing sched_annotate_sleep() in bch2_journal_flush_seq_async() bcachefs: Fix race in bch2_write_super() bcachefs: BCH_SB_LAYOUT_SIZE_BITS_MAX bcachefs: Add missing skcipher_request_set_callback() call bcachefs: Fix snapshot_t() usage in bch2_fs_quota_read_inode() bcachefs: Fix shift-by-64 in bformat_needs_redo() bcachefs: Guard against unknown k.k->type in __bkey_invalid() bcachefs: Add missing validation for superblock section clean bcachefs: Fix assert in bch2_alloc_v4_invalid() bcachefs: fix overflow in fiemap bcachefs: Add a better limit for maximum number of buckets bcachefs: Fix lifetime issue in device iterator helpers bcachefs: Fix bch2_dev_lookup() refcounting bcachefs: Initialize bch_write_op->failed in inline data path bcachefs: Fix refcount put in sb_field_resize error path bcachefs: Inodes need extra padding for varint_decode_fast() bcachefs: Fix early error path in bch2_fs_btree_key_cache_exit() bcachefs: bucket_pos_to_bp_noerror() bcachefs: don't free error pointers bcachefs: Fix a scheduler splat in __bch2_next_write_buffer_flush_journal_buf()	2024-05-08 10:23:18 -07:00
Allen Pais	4bbf9c3b53	fs/coredump: Enable dynamic configuration of max file note size Introduce the capability to dynamically configure the maximum file note size for ELF core dumps via sysctl. Why is this being done? We have observed that during a crash when there are more than 65k mmaps in memory, the existing fixed limit on the size of the ELF notes section becomes a bottleneck. The notes section quickly reaches its capacity, leading to incomplete memory segment information in the resulting coredump. This truncation compromises the utility of the coredumps, as crucial information about the memory state at the time of the crash might be omitted. This enhancement removes the previous static limit of 4MB, allowing system administrators to adjust the size based on system-specific requirements or constraints. Eg: $ sysctl -a \| grep core_file_note_size_limit kernel.core_file_note_size_limit = 4194304 $ sysctl -n kernel.core_file_note_size_limit 4194304 $echo 519304 > /proc/sys/kernel/core_file_note_size_limit $sysctl -n kernel.core_file_note_size_limit 519304 Attempting to write beyond the ceiling value of 16MB $echo 17194304 > /proc/sys/kernel/core_file_note_size_limit bash: echo: write error: Invalid argument Signed-off-by: Vijay Nag <nagvijay@microsoft.com> Signed-off-by: Allen Pais <apais@linux.microsoft.com> Link: https://lore.kernel.org/r/20240506193700.7884-1-apais@linux.microsoft.com Signed-off-by: Kees Cook <keescook@chromium.org>	2024-05-08 09:53:00 -07:00
Ryusuke Konishi	91d743a9c8	nilfs2: make superblock data array index computation sparse friendly Upon running sparse, "warning: dubious: x & !y" is output at an array index calculation within nilfs_load_super_block(). The calculation is not wrong, but to eliminate the sparse warning, replace it with an equivalent calculation. Also, add a comment to make it easier to understand what the unintuitive array index calculation is doing and whether it's correct. Link: https://lkml.kernel.org/r/20240430080019.4242-3-konishi.ryusuke@gmail.com Fixes: `e339ad31f5` ("nilfs2: introduce secondary super block") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-08 08:41:28 -07:00
Matthew Wilcox (Oracle)	bbf45b7e68	squashfs: remove calls to set the folio error flag Nobody checks the error flag on squashfs folios, so stop setting it. Link: https://lkml.kernel.org/r/20240420025029.2166544-24-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Phillip Lougher <phillip@squashfs.org.uk> Reviewed-by: Phillip Lougher <phillip@squashfs.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-08 08:41:28 -07:00
Matthew Wilcox (Oracle)	675f02e5e6	squashfs: convert squashfs_symlink_read_folio to use folio APIs Remove use of page APIs, return the errno instead of 0, switch from kmap_atomic to kmap_local and use folio_end_read() to unify the two exit paths. Link: https://lkml.kernel.org/r/20240420025029.2166544-23-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Phillip Lougher <phillip@squashfs.org.uk> Reviewed-by: Phillip Lougher <phillip@squashfs.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-08 08:41:28 -07:00
Colin Ian King	f492fb3656	ocfs2: remove redundant assignment to variable status Variable status is being assigned and error code that is never read, it is being assigned inside of a do-while loop. The assignment is redundant and can be removed. Cleans up clang scan build warning: fs/ocfs2/dlm/dlmdomain.c:1530:2: warning: Value stored to 'status' is never read [deadcode.DeadStores] Link: https://lkml.kernel.org/r/20240423223018.1573213-1-colin.i.king@gmail.com Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Heming Zhao <heming.zhao@suse.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-08 08:41:27 -07:00
Eric Sandeen	36defdd9d7	nilfs2: convert to use the new mount API Convert nilfs2 to use the new mount API. [sandeen@redhat.com: v2] Link: https://lkml.kernel.org/r/33d078a7-9072-4d8e-a3a9-dec23d4191da@redhat.com Link: https://lkml.kernel.org/r/20240425190526.10905-1-konishi.ryusuke@gmail.com [konishi.ryusuke: fixed missing SB_RDONLY flag repair in nilfs_reconfigure] Link: https://lkml.kernel.org/r/33d078a7-9072-4d8e-a3a9-dec23d4191da@redhat.com Link: https://lkml.kernel.org/r/20240424182716.6024-1-konishi.ryusuke@gmail.com Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-08 08:41:27 -07:00
Gao Xiang	d69189428d	erofs: clean up z_erofs_load_full_lcluster() Only four lcluster types here, remove redundant code. No real logic changes. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240508123357.3266173-1-hsiangkao@linux.alibaba.com	2024-05-08 20:36:43 +08:00
Hongzhen Luo	1872df8dcd	erofs: derive fsid from on-disk UUID for .statfs() if possible Use the superblock's UUID to generate the fsid when it's non-null. Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com> Link: https://lore.kernel.org/r/20240409113022.74720-1-hongzhen@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2024-05-08 17:12:51 +08:00
Chunhai Guo	0f6273ab46	erofs: add a reserved buffer pool for lz4 decompression This adds a special global buffer pool (in the end) for reserved pages. Using a reserved pool for LZ4 decompression significantly reduces the time spent on extra temporary page allocation for the extreme cases in low memory scenarios. The table below shows the reduction in time spent on page allocation for LZ4 decompression when using a reserved pool. The results were obtained from multi-app launch benchmarks on ARM64 Android devices running the 5.15 kernel with an 8-core CPU and 8GB of memory. In the benchmark, we launched 16 frequently-used apps, and the camera app was the last one in each round. The data in the table is the average time of camera app for each round. After using the reserved pool, there was an average improvement of 150ms in the overall launch time of our camera app, which was obtained from the systrace log. +--------------+---------------+--------------+---------+ \| \| w/o page pool \| w/ page pool \| diff \| +--------------+---------------+--------------+---------+ \| Average (ms) \| 3434 \| 21 \| -99.38% \| +--------------+---------------+--------------+---------+ Based on the benchmark logs, 64 pages are sufficient for 95% of scenarios. This value can be adjusted with a module parameter `reserved_pages`. The default value is 0. This pool is currently only used for the LZ4 decompressor, but it can be applied to more decompressors if needed. Signed-off-by: Chunhai Guo <guochunhai@vivo.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240402131523.2703948-1-guochunhai@vivo.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2024-05-08 17:12:51 +08:00
Chunhai Guo	d6db47e571	erofs: do not use pagepool in z_erofs_gbuf_growsize() Let's use alloc_pages_bulk_array() for simplicity and get rid of unnecessary pagepool. Signed-off-by: Chunhai Guo <guochunhai@vivo.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240402092757.2635257-1-guochunhai@vivo.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2024-05-08 17:12:50 +08:00
Chunhai Guo	f36f3010f6	erofs: rename per-CPU buffers to global buffer pool and make it configurable It will cost more time if compressed buffers are allocated on demand for low-latency algorithms (like lz4) so EROFS uses per-CPU buffers to keep compressed data if in-place decompression is unfulfilled. While it is kind of wasteful of memory for a device with hundreds of CPUs, and only a small number of CPUs concurrently decompress most of the time. This patch renames it as 'global buffer pool' and makes it configurable. This allows two or more CPUs to share a common buffer to reduce memory occupation. Suggested-by: Gao Xiang <xiang@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Chunhai Guo <guochunhai@vivo.com> Link: https://lore.kernel.org/r/20240402100036.2673604-1-guochunhai@vivo.com Signed-off-by: Sandeep Dhavale <dhavale@google.com> Link: https://lore.kernel.org/r/20240408215231.3376659-1-dhavale@google.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2024-05-08 17:12:49 +08:00
Chunhai Guo	cacd5b04e2	erofs: rename utils.c to zutil.c Currently, utils.c is only useful if CONFIG_EROFS_FS_ZIP is on. So let's rename it to zutil.c as well as avoid its inclusion if CONFIG_EROFS_FS_ZIP is explicitly disabled. Signed-off-by: Chunhai Guo <guochunhai@vivo.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240401135550.2550043-1-guochunhai@vivo.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2024-05-08 17:12:49 +08:00
Brian Foster	96d88f65ad	virtiofs: include a newline in sysfs tag The internal tag string doesn't contain a newline. Append one when emitting the tag via sysfs. [Stefan] Orthogonal to the newline issue, sysfs_emit(buf, "%s", fs->tag) is needed to prevent format string injection. Signed-off-by: Brian Foster <bfoster@redhat.com> Fixes: `a8f62f50b4` ("virtiofs: export filesystem tags through sysfs") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2024-05-08 09:31:21 +02:00
Baokun Li	dc1c4663bc	ext4: propagate errors from ext4_sb_bread() in ext4_xattr_block_cache_find() In ext4_xattr_block_cache_find(), when ext4_sb_bread() returns an error, we will either continue to find the next ea block or return NULL to try to insert a new ea block. But whether ext4_sb_bread() returns -EIO or -ENOMEM, the next operation is most likely to fail with the same error. So propagate the error returned by ext4_sb_bread() to make ext4_xattr_block_set() fail to reduce pointless operations. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240504075526.2254349-3-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-07 15:59:18 -04:00
Baokun Li	0c0b4a49d3	ext4: fix mb_cache_entry's e_refcnt leak in ext4_xattr_block_cache_find() Syzbot reports a warning as follows: ============================================ WARNING: CPU: 0 PID: 5075 at fs/mbcache.c:419 mb_cache_destroy+0x224/0x290 Modules linked in: CPU: 0 PID: 5075 Comm: syz-executor199 Not tainted 6.9.0-rc6-gb947cc5bf6d7 RIP: 0010:mb_cache_destroy+0x224/0x290 fs/mbcache.c:419 Call Trace: <TASK> ext4_put_super+0x6d4/0xcd0 fs/ext4/super.c:1375 generic_shutdown_super+0x136/0x2d0 fs/super.c:641 kill_block_super+0x44/0x90 fs/super.c:1675 ext4_kill_sb+0x68/0xa0 fs/ext4/super.c:7327 [...] ============================================ This is because when finding an entry in ext4_xattr_block_cache_find(), if ext4_sb_bread() returns -ENOMEM, the ce's e_refcnt, which has already grown in the __entry_find(), won't be put away, and eventually trigger the above issue in mb_cache_destroy() due to reference count leakage. So call mb_cache_entry_put() on the -ENOMEM error branch as a quick fix. Reported-by: syzbot+dd43bd0f7474512edc47@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=dd43bd0f7474512edc47 Fixes: `fb265c9cb4` ("ext4: add ext4_sb_bread() to disambiguate ENOMEM cases") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240504075526.2254349-2-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-07 15:59:18 -04:00
Colin Ian King	8b57de1c5e	jbd2: remove redundant assignement to variable err The variable err is being assigned a value that is never read, it is being re-assigned inside the following while loop and also after the while loop. The assignment is redundant and can be removed. Cleans up clang scan build warning: fs/jbd2/commit.c:574:2: warning: Value stored to 'err' is never read [deadcode.DeadStores] Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240410112803.232993-1-colin.i.king@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-07 15:52:20 -04:00
Zhang Yi	df0b5afc62	ext4: remove the redundant folio_wait_stable() __filemap_get_folio() with FGP_WRITEBEGIN parameter has already wait for stable folio, so remove the redundant folio_wait_stable() in ext4_da_write_begin(), it was left over from the commit `cc883236b7` ("ext4: drop unnecessary journal handle in delalloc write") that removed the retry getting page logic. Fixes: `cc883236b7` ("ext4: drop unnecessary journal handle in delalloc write") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240419023005.2719050-1-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-07 15:48:04 -04:00
Dan Carpenter	3f4830abd2	ext4: fix potential unnitialized variable Smatch complains "err" can be uninitialized in the caller. fs/ext4/indirect.c:349 ext4_alloc_branch() error: uninitialized symbol 'err'. Set the error to zero on the success path. Fixes: `8016e29f43` ("ext4: fast commit recovery path") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/363a4673-0fb8-4adf-b4fb-90a499077276@moroto.mountain Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-07 15:44:40 -04:00
Matthew Wilcox (Oracle)	c84f1510fb	ext4: convert ac_buddy_page to ac_buddy_folio This just carries around the bd_buddy_folio so should also be a folio. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240416172900.244637-6-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-07 15:38:17 -04:00
Matthew Wilcox (Oracle)	ccedf35b5d	ext4: convert ac_bitmap_page to ac_bitmap_folio This just carries around the bd_bitmap_folio so should also be a folio. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240416172900.244637-5-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-07 15:38:14 -04:00
Matthew Wilcox (Oracle)	e1622a0d55	ext4: convert ext4_mb_init_cache() to take a folio All callers now have a folio, so convert this function from operating on a page to operating on a folio. The folio is assumed to be a single page. Signe-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240416172900.244637-4-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-07 15:38:10 -04:00
Matthew Wilcox (Oracle)	5eea586b47	ext4: convert bd_buddy_page to bd_buddy_folio There is no need to make this a multi-page folio, so leave all the infrastructure around it in pages. But since we're locking it, playing with its refcount and checking whether it's uptodate, it needs to move to the folio API. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240416172900.244637-3-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-07 15:38:07 -04:00
Matthew Wilcox (Oracle)	99b150d84e	ext4: convert bd_bitmap_page to bd_bitmap_folio There is no need to make this a multi-page folio, so leave all the infrastructure around it in pages. But since we're locking it, playing with its refcount and checking whether it's uptodate, it needs to move to the folio API. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240416172900.244637-2-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-07 15:37:46 -04:00
Dan Carpenter	0e39c9e524	btrfs: qgroup: fix initialization of auto inherit array The "i++" was accidentally left out so it just sets qgids[0] over and over. This can lead to unexpected problems, as the groups[1:] would be all 0, leading to later find_qgroup_rb() unable to find a qgroup and cause snapshot creation failure. Fixes: `5343cd9364` ("btrfs: qgroup: simple quota auto hierarchy for nested subvolumes") CC: stable@vger.kernel.org # 6.7+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:11 +02:00
Matthew Wilcox (Oracle)	bc00965dbf	btrfs: count super block write errors in device instead of tracking folio error state Currently the error status of super block write is tracked in page/folio status bit Error. For that we need to keep the reference for the whole duration of write and wait. Count the number of superblock writeback errors in the btrfs_device. That means we don't need the folio to stay around until it's waited for, and can avoid the extra call to folio_get/put. Also remove a mention of PageError in a comment as it's the last mention of the page Error state. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:11 +02:00
Matthew Wilcox (Oracle)	617fb10ea8	btrfs: use the folio iterator in btrfs_end_super_write() Iterate over folios instead of bvecs. Switch the order of unlock and put to be the usual order; we know this folio can't be put until it's been waited for, but that's fragile. Remove the calls to ClearPageUptodate / SetPageUptodate -- if PAGE_SIZE is larger than BTRFS_SUPER_INFO_SIZE, we'd be marking the entire folio uptodate without having actually initialised all the bytes in the page. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:10 +02:00
Matthew Wilcox (Oracle)	f93ee0df51	btrfs: convert super block writes to folio in write_dev_supers() This is a direct conversion from pages to folios, assuming single page folio. Also removes some calls to obsolete APIs and some hidden calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:10 +02:00
Matthew Wilcox (Oracle)	c94b7349b8	btrfs: convert super block writes to folio in wait_dev_supers() This is a direct conversion from pages to folios, assuming single page folio. Also removes a few calls to compound_head() and calls to obsolete APIs. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:10 +02:00
Thorsten Blum	58a774ca16	btrfs: remove duplicate included header from fs.h Remove duplicate included header file linux/blkdev.h . Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:10 +02:00
Josef Bacik	6b0a63a4fa	btrfs: add a cached state to extent_clear_unlock_delalloc Now that we have the lock_extent tightly coupled with extent_clear_unlock_delalloc we can add a cached state to extent_clear_unlock_delalloc and benefit from skipping the extra lookup when we're doing cow. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:10 +02:00
Josef Bacik	8325f41a56	btrfs: push extent lock down in submit_one_async_extent We don't need to include the time we spend in the allocator under our extent lock protection, move it after the allocator and make sure we lock the extent in the error case to ensure we're not clearing these bits without the extent lock held. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:10 +02:00
Josef Bacik	d456c25dbb	btrfs: push lock_extent down in cow_file_range() Now that we've got the extent lock pushed into cow_file_range() we can push it further down into the allocation loop. This allows us to only hold the extent lock during the dropping of the extent map range and inserting the ordered extent. This makes the error case a little trickier as we'll now have to lock the range before clearing any of the other extent bits for the range, but this is the error path so is less performance critical. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:10 +02:00
Josef Bacik	cd241a8f55	btrfs: move can_cow_file_range_inline() outside of the extent lock These checks aren't reliant on the extent lock. Move this up into cow_file_range_inline(), and then update encoded writes to call this check before calling __cow_file_range_inline(). This will allow us to skip the extent lock if we're not able to inline the given extent. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:10 +02:00
Josef Bacik	0ab540995a	btrfs: push lock_extent into cow_file_range_inline Now that we've pushed the lock_extent() into cow_file_range() we can push the extent locking into cow_file_range_inline() and move the lock_extent in cow_file_range() to after we call cow_file_range_inline(). Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:10 +02:00
Josef Bacik	a0766d8f35	btrfs: push extent lock into cow_file_range Now that cow_file_range is the only function that is called with the range locked, push this call into cow_file_range so we can further narrow the scope. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:10 +02:00
Josef Bacik	00009d7bcb	btrfs: push extent lock into run_delalloc_cow This is used by zoned but also as the fallback for uncompressed extents when we fail to compress the ranges. Push the extent lock into run_dealloc_cow(), and adjust the compression case to take the extent lock after calling run_delalloc_cow(). Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:09 +02:00
Josef Bacik	0e128d4e41	btrfs: remove unlock_extent from run_delalloc_compressed Since we immediately unlock the extent range when we enter run_delalloc_compressed() simply move the lock_extent() down to cover cow_file_range() and then remove the unlock_extent() from run_delalloc_compressed. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:09 +02:00
Josef Bacik	aa56b0aa91	btrfs: push extent lock down in run_delalloc_nocow run_delalloc_nocow is a little special because we use the file extents to see if we can nocow a range. We don't actually need the protection of the extent lock to look at the file extents at this point however. We are currently holding the page lock for this range, so we are protected from anybody who would simultaneously be modifying the file extent items for this range. * mmap() - we're holding the page lock. * buffered writes - we're holding the page lock. * direct writes - we're holding the page lock and direct IO has to flush page cache before it's able to continue. * fallocate() - all callers flush the range and wait on ordered extents while holding the inode lock and the mmap lock, so we are again saved by the page lock. We want to use the extent lock to protect 1) The mapping tree for the given range. 2) The ordered extents for the given range. 3) The io_tree for the given range. Push the extent lock down to cover these operations. In the fallback_to_cow() case we simply lock before doing anything and rely on the cow_file_range() helper to handle it's range properly. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:09 +02:00
Josef Bacik	0ed30c17f6	btrfs: adjust while loop condition in run_delalloc_nocow We have the following pattern while (1) { if (cur_offset > end) break; } Which is just while (cur_offset <= end) { ... } so adjust the code to be more clear. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:09 +02:00
Josef Bacik	7c9acd440f	btrfs: push extent lock into run_delalloc_nocow run_delalloc_nocow is a bit special as it walks through the file extents for the inode and determines what it can nocow and what it can't. This is the more complicated area for extent locking, so start with this function. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:09 +02:00
Josef Bacik	c0707c9e1e	btrfs: push the extent lock into btrfs_run_delalloc_range We want to limit the scope of the extent lock to be around operations that can change in flight. Currently we hold the extent lock through the entire writepage operation, which isn't really necessary. We want to protect to make sure nobody has updated DELALLOC. In find_lock_delalloc_range we must lock the range in order to validate the contents of our io_tree. However once we've done that we're safe to unlock the range and continue, as we have the page lock already held for the range. We are protected from all operations at this point. * mmap() - we're holding the page lock, thus are protected. * buffered writes - again, we're protected because we take the page lock for the first and last page in our range for buffered writes so we won't create new delalloc ranges in this area. * direct IO - we invalidate pagecache before attempting to write a new area, which requires the page lock, so again are protected once we're holding the page lock on this range. Additionally this behavior actually already exists for compressed, we unlock the range as soon as we start to process the async extents, and re-lock it during compression. So this is completely safe, and makes the locking more consistent. Make this simple by just pushing the extent lock into btrfs_run_delalloc_range. From there followup patches will push the lock further down into its users. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:09 +02:00
Josef Bacik	7034674b8a	btrfs: lock extent when doing inline extent in compression We currently don't lock the extent when we're doing a cow_file_range_inline() for a compressed extent. This isn't a problem necessarily, but it's inconsistent with the rest of our usage of cow_file_range_inline(). This also leads to some extra weird logic around whether the extent is locked or not. Fix this to lock the extent before calling cow_file_range_inline() in compression to make it consistent with the rest of the inline users. In future patches this will be pushed down into the cow_file_range_inline() helper, so we're fine with the quick and dirty locking here. This patch exists to make the behavior change obvious. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:09 +02:00
Josef Bacik	0586d0a89e	btrfs: move extent bit and page cleanup into cow_file_range_inline We duplicate the extent cleanup for cow_file_range_inline() in the cow and compressed case. The encoded case doesn't need to do cleanup the same way, so rename cow_file_range_inline to __cow_file_range_inline and then make cow_file_range_inline handle the extent cleanup appropriately, and update the callers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:09 +02:00
Josef Bacik	0332967b4d	btrfs: unlock all the pages with successful inline extent creation Since `4750af3bbe` ("btrfs: prevent extent_clear_unlock_delalloc() to unlock page not locked by __process_pages_contig()") we have been unlocking the locked page manually instead of via extent_clear_unlock_delalloc() because of subpage blocksize support. However we actually disable inline extent creation for subpage blocksize support, so this behavior isn't necessary. Remove this code and comment, if at some point the subpage blocksize code grows support for inline extents this can be re-evaluated. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:09 +02:00
Josef Bacik	6eecfa2240	btrfs: push all inline logic into cow_file_range Currently we have a lot of duplicated checks of if (start == 0 && fs_info->sectorsize == PAGE_SIZE) cow_file_range_inline(); Instead of duplicating this check everywhere, consolidate all of the inline extent logic into a helper which documents all of the checks and then use that helper inside of cow_file_range_inline(). With this we can clean up all of the calls to either unconditionally call cow_file_range_inline(), or at least reduce the checks we're doing before we call cow_file_range_inline(); Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:09 +02:00
Josef Bacik	aa5ccf2917	btrfs: handle errors in btrfs_reloc_clone_csums properly In the cow path we will clone the reloc csums for relocated data extents, and if there's an error we already have an ordered extent and rely on the ordered extent finishing to clean everything up. There's a problem however, we don't mark the ordered extent with an error, we pretend like everything was just fine. If we were at the end of our range we won't actually bubble up this error anywhere, and we could end up inserting an extent that doesn't have csums where it should have them. Fix this by adding a helper to mark the ordered extent with an error, and then use this when we fail to lookup the csums in btrfs_reloc_clone_csums. Use this helper in the other place where we use the same pattern while we're here. This will prevent us from erroneously inserting the extent that doesn't have the required checksums. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:09 +02:00
Qu Wenruo	e98bf64f7a	btrfs: add extra sanity checks for create_io_em() The function create_io_em() is called before we submit an IO, to update the in-memory extent map for the involved range. This patch changes the following aspects: - Does not allow BTRFS_ORDERED_NOCOW type For real NOCOW (excluding NOCOW writes into preallocated ranges) writes, we never call create_io_em(), as we does not need to update the extent map at all. So remove the sanity check allowing BTRFS_ORDERED_NOCOW type. - Add extra sanity checks * PREALLOC - @block_len == len For uncompressed writes. * REGULAR - @block_len == @orig_block_len == @ram_bytes == @len We're creating a new uncompressed extent, and referring all of it. - @orig_start == @start We haven no offset inside the extent. * COMPRESSED - valid @compress_type - @len <= @ram_bytes This is to co-operate with encoded writes, which can cause a new file extent referring only part of a uncompressed extent. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:08 +02:00
Qu Wenruo	4bdc558bf9	btrfs: simplify the inline extent map creation With the tree-checker ensuring all inline file extents starts at file offset 0 and has a length no larger than sectorsize, we can simplify the calculation to assigned those fixes values directly. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:08 +02:00
Qu Wenruo	319d91ee72	btrfs: add extra comments on extent_map members The extent_map structure is very critical to btrfs, as it is involved for both read and write paths. Unfortunately the structure is not properly explained, making it pretty hard to understand nor to do further improvement. This patch adds extra comments explaining the major members based on my code reading. Hopefully we can find more members to cleanup in the future. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:08 +02:00
Naohiro Aota	30704a0d56	btrfs: drop unused argument of calcu_metadata_size() calcu_metadata_size() has a "reserve" argument, but the only caller always set it to "1". The other usage (reserve = 0) is dropped by a commit `0647bf564f` ("Btrfs: improve forever loop when doing balance relocation"), which is more than 10 years ago. Drop the argument and simplify the code. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:08 +02:00
Anand Jain	33a44f3760	btrfs: simplify return variables in btrfs_drop_subtree() There's another return variable wret that is only passed to ret on error, we can simply use ret. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:08 +02:00
Anand Jain	1618aa3c2e	btrfs: simplify return variables in lookup_extent_data_ref() First, drop err instead reuse ret, choose to return the error instead of goto fail and then return the same error. Do not initialize the ret until where it has to be initialized. Slight logic change in handling the btrfs_search_slot() and btrfs_next_leaf() return value. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:08 +02:00
Anand Jain	6e812a9c65	btrfs: rename return variables in btrfs_qgroup_rescan_worker() Rename ret to ret2 compile and then err to ret. Also, new ret2 is found to be localized within the 'if (trans)' statement, so move its declaration there. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:08 +02:00
Anand Jain	5e8fb9b84b	btrfs: drop variable err in quick_update_accounting() In quick_update_accounting() err is used as 2nd return value, which could be achieved just with ret. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:08 +02:00
Anand Jain	acde0e8609	btrfs: reuse ret instead of err in relocate_tree_blocks() Coding style fixes the function relocate_tree_blocks(). After the fix, ret is the return value variable. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:08 +02:00
Anand Jain	2daca1e419	btrfs: rename err and ret to ret in build_backref_tree() Code style fix in the function build_backref_tree(). Drop the ret initialization 0, as we don't need it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:08 +02:00
Anand Jain	1e8a42375f	btrfs: rename werr and err to ret in __btrfs_wait_marked_extents() Rename the function's local return variables err and werr to ret. Also, align the variable declarations with the other declarations in the function for better function space alignment. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:08 +02:00
Anand Jain	ce87531120	btrfs: rename werr and err to ret in btrfs_write_marked_extents() Rename the function's local variable werr and err to ret. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:07 +02:00
Anand Jain	9a7b68d32a	btrfs: report filemap_fdata<write\|wait>_range() error In the function btrfs_write_marked_extents() and in __btrfs_wait_marked_extents() return the actual error if when filemap_fdata<write\|wait>_range() fails. Suggested-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:07 +02:00
David Sterba	fef998d1a0	btrfs: use btrfs_is_testing() everywhere There are open coded tests of BTRFS_FS_STATE_DUMMY_FS_INFO and we have a wrapper for that that's a compile-time constant when self-tests are not built in. As this is only for development we can save some bytes and conditions on release configs by using the helper in the remaining cases. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:07 +02:00
Filipe Manana	905a95f3dd	btrfs: initialize delayed inodes xarray without GFP_ATOMIC There's no need to initialize the delayed inodes xarray with a GFP_ATOMIC flag because that actually does nothing on the xarray operations. That was needed for radix trees, but for xarrays the allocation flags are passed as the last argument to xa_store() (which we are using correctly). So initialize the delayed inodes xarray with a simple xa_init(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:07 +02:00
Filipe Manana	de6f14e83e	btrfs: make try_release_extent_mapping() return a bool Currently try_release_extent_mapping() as an int return type, but we use it as a boolean. Its only caller, the release folio callback, also returns a boolean which corresponds to try_release_extent_mapping()'s return value. So change its return value type to bool as well as its helper try_release_extent_state(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:07 +02:00
Filipe Manana	2e504418e4	btrfs: be better releasing extent maps at try_release_extent_mapping() At try_release_extent_mapping(), called during the release folio callback (btrfs_release_folio() callchain), we don't release any extent maps in the range if the GFP flags don't allow blocking. This behaviour is exaggerated because: 1) Both searching for extent maps and removing them are not blocking operations. The only thing that it is the cond_resched() call at the end of the loop that searches for and removes extent maps; 2) We currently only operate on a single page, so for the case where block size matches the page size, we can only have one extent map, and for the case where the block size is smaller than the page size, we can have at most 16 extent maps. So it's very unlikely the cond_resched() call will ever block even in the block size smaller than page size scenario. So instead of not removing any extent maps at all in case the GFP glags don't allow blocking, keep removing extent maps while we don't need to reschedule. This makes it safe for the subpage case and for a future where we can process folios with a size larger than a page. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:07 +02:00
Filipe Manana	433a3e01dd	btrfs: remove i_size restriction at try_release_extent_mapping() Currently we don't attempt to release extent maps if the inode has an i_size that is not greater than 16M. This condition was added way back in 2008 by commit `70dec8079d` ("Btrfs: extent_io and extent_state optimizations"), without any explanation about it. A quick chat with Chris on slack revealed that the goal was probably to release the extent maps for small files only when closing the inode. This however can be harmful in case we have tons of such files being kept open for very long periods of time, since we will consume more and more pages for extent maps. So remove the condition. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:07 +02:00
Filipe Manana	85d288309a	btrfs: use btrfs_get_fs_generation() at try_release_extent_mapping() Nowadays we have the btrfs_get_fs_generation() to get the current generation of the filesystem, so there's no need anymore to lock the transaction spinlock to read it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:07 +02:00
Filipe Manana	078b981aaa	btrfs: rename some variables at try_release_extent_mapping() Rename the following variables: 1) "btrfs_inode" to "inode", because it's shorter to type and clear, and we don't have a VFS inode here as well, so there's no confusion; 2) "tree" to "io_tree", to be clear which tree we are dealing with, since we use 2 different trees in the function; 3) "map" to "extent_tree" since "map" gives the idea we are dealing with an extent map for example, but we are dealing with the inode's extent tree (the tree which stores extent maps). These also make the next patches simpler. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:07 +02:00
Filipe Manana	0d89a15e1a	btrfs: add tracepoints for extent map shrinker events Add some tracepoints for the extent map shrinker to help debug and analyse main events. These have proved useful during development of the shrinker. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:07 +02:00
Filipe Manana	65bb9fb00b	btrfs: update comment for btrfs_set_inode_full_sync() about locking Nowadays we have a lock used to synchronize mmap writes with reflink and fsync operations (struct btrfs_inode::i_mmap_lock), so update the comment for btrfs_set_inode_full_sync() to mention that it can also be called while holding that mmap lock. Besides being a valid alternative to the inode's VFS lock, we already have the extent map shrinker using that mmap lock instead. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:07 +02:00
Filipe Manana	956a17d9d0	btrfs: add a shrinker for extent maps Extent maps are used either to represent existing file extent items, or to represent new extents that are going to be written and the respective file extent items are created when the ordered extent completes. We currently don't have any limit for how many extent maps we can have, neither per inode nor globally. Most of the time this not too noticeable because extent maps are removed in the following situations: 1) When evicting an inode; 2) When releasing folios (pages) through the btrfs_release_folio() address space operation callback. However we won't release extent maps in the folio range if the folio is either dirty or under writeback or if the inode's i_size is less than or equals to 16M (see try_release_extent_mapping(). This 16M i_size constraint was added back in 2008 with commit `70dec8079d` ("Btrfs: extent_io and extent_state optimizations"), but there's no explanation about why we have it or why the 16M value. This means that for buffered IO we can reach an OOM situation due to too many extent maps if either of the following happens: 1) There's a set of tasks constantly doing IO on many files with a size not larger than 16M, specially if they keep the files open for very long periods, therefore preventing inode eviction. This requires a really high number of such files, and having many non mergeable extent maps (due to random 4K writes for example) and a machine with very little memory; 2) There's a set tasks constantly doing random write IO (therefore creating many non mergeable extent maps) on files and keeping them open for long periods of time, so inode eviction doesn't happen and there's always a lot of dirty pages or pages under writeback, preventing btrfs_release_folio() from releasing the respective extent maps. This second case was actually reported in the thread pointed by the Link tag below, and it requires a very large file under heavy IO and a machine with very little amount of RAM, which is probably hard to happen in practice in a real world use case. However when using direct IO this is not so hard to happen, because the page cache is not used, and therefore btrfs_release_folio() is never called. Which means extent maps are dropped only when evicting the inode, and that means that if we have tasks that keep a file descriptor open and keep doing IO on a very large file (or files), we can exhaust memory due to an unbounded amount of extent maps. This is especially easy to happen if we have a huge file with millions of small extents and their extent maps are not mergeable (non contiguous offsets and disk locations). This was reported in that thread with the following fio test: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bs=4K direct=1 numjobs=16 fallocate=none time_based runtime=90000 [file1] size=300G ioengine=libaio iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Monitoring the btrfs_extent_map slab while running the test with: $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ /sys/kernel/slab/btrfs_extent_map/total_objects' Shows the number of active and total extent maps skyrocketing to tens of millions, and on systems with a short amount of memory it's easy and quick to get into an OOM situation, as reported in that thread. So to avoid this issue add a shrinker that will remove extents maps, as long as they are not pinned, and takes proper care with any concurrent fsync to avoid missing extents (setting the full sync flag while in the middle of a fast fsync). This shrinker is triggered through the callbacks nr_cached_objects and free_cached_objects of struct super_operations. The shrinker will iterate over all roots and over all inodes of each root, and keeps track of the last scanned root and inode, so that the next time it runs, it starts from that root and from the next inode. This is similar to what xfs does for its inode reclaim (implements those callbacks, and cycles through inodes by starting from where it ended last time). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:06 +02:00
Filipe Manana	f1d97e7691	btrfs: add a global per cpu counter to track number of used extent maps Add a per cpu counter that tracks the total number of extent maps that are in extent trees of inodes that belong to fs trees. This is going to be used in an upcoming change that adds a shrinker for extent maps. Only extent maps for fs trees are considered, because for special trees such as the data relocation tree we don't want to evict their extent maps which are critical for the relocation to work, and since those are limited, it's not a concern to have them in memory during the relocation of a block group. Another case are extent maps for free space cache inodes, which must always remain in memory, but those are limited (there's only one per free space cache inode, which means one per block group). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:06 +02:00
Filipe Manana	5fa8a6baff	btrfs: pass the extent map tree's inode to try_merge_map() Extent maps are always associated to an inode's extent map tree, so there's no need to pass the extent map tree explicitly to try_merge_map(). In order to facilitate an upcoming change that adds a shrinker for extent maps, change try_merge_map() to receive the inode instead of its extent map tree. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:06 +02:00
Filipe Manana	e778724a5e	btrfs: pass the extent map tree's inode to setup_extent_mapping() Extent maps are always associated to an inode's extent map tree, so there's no need to pass the extent map tree explicitly to setup_extent_mapping(). In order to facilitate an upcoming change that adds a shrinker for extent maps, change setup_extent_mapping() to receive the inode instead of its extent map tree. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:06 +02:00
Filipe Manana	6a3a9113ae	btrfs: pass the extent map tree's inode to replace_extent_mapping() Extent maps are always associated to an inode's extent map tree, so there's no need to pass the extent map tree explicitly to replace_extent_mapping(). In order to facilitate an upcoming change that adds a shrinker for extent maps, change replace_extent_mapping() to receive the inode instead of its extent map tree. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:06 +02:00
Filipe Manana	c2fbd812d7	btrfs: pass the extent map tree's inode to remove_extent_mapping() Extent maps are always associated to an inode's extent map tree, so there's no need to pass the extent map tree explicitly to remove_extent_mapping(). In order to facilitate an upcoming change that adds a shrinker for extent maps, change remove_extent_mapping() to receive the inode instead of its extent map tree. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:06 +02:00
Filipe Manana	002f3a2ce8	btrfs: pass the extent map tree's inode to clear_em_logging() Extent maps are always associated to an inode's extent map tree, so there's no need to pass the extent map tree explicitly to clear_em_logging(). In order to facilitate an upcoming change that adds a shrinker for extent maps, change clear_em_logging() to receive the inode instead of its extent map tree. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:06 +02:00
Filipe Manana	6c566def95	btrfs: pass the extent map tree's inode to add_extent_mapping() Extent maps are always added to an inode's extent map tree, so there's no need to pass the extent map tree explicitly to add_extent_mapping(). In order to facilitate an upcoming change that adds a shrinker for extent maps, change add_extent_mapping() to receive the inode instead of its extent map tree. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:06 +02:00
Josef Bacik	e094f48040	btrfs: change root->root_key.objectid to btrfs_root_id() A comment from Filipe on one of my previous cleanups brought my attention to a new helper we have for getting the root id of a root, which makes it easier to read in the code. The changes where made with the following Coccinelle semantic patch: // <smpl> @@ expression E,E1; @@ ( E->root_key.objectid = E1 \| - E->root_key.objectid + btrfs_root_id(E) ) // </smpl> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:06 +02:00
Josef Bacik	53e2415868	btrfs: set start on clone before calling copy_extent_buffer_full Our subpage testing started hanging on generic/560 and I bisected it down to `1cab1375ba` ("btrfs: reuse cloned extent buffer during fiemap to avoid re-allocations"). This is subtle because we use eb->start to figure out where in the folio we're copying to when we're subpage, as our ->start may refer to an area inside of the folio. For example, assume a 16K page size machine with a 4K node size, and assume that we already have a cloned extent buffer when we cloned the previous search. copy_extent_buffer_full() will do the following when copying the extent buffer path->nodes[0] (src) into cloned (dest): src->start = 8k; // this is the new leaf we're cloning cloned->start = 4k; // this is left over from the previous clone src_addr = folio_address(src->folios[0]); dest_addr = folio_address(dest->folios[0]); memcpy(dest_addr + get_eb_offset_in_folio(dst, 0), src_addr + get_eb_offset_in_folio(src, 0), src->len); Now get_eb_offset_in_folio() is where the problems occur, because for sub-pagesize blocksize we can have multiple eb's per folio, the code for this is as follows size_t get_eb_offset_in_folio(eb, offset) { return (eb->start + offset & (folio_size(eb->folio[0]) - 1)); } So in the above example we are copying into offset 4K inside the folio. However once we update cloned->start to 8K to match the src the math for get_eb_offset_in_folio() changes, and any subsequent reads (i.e. btrfs_item_key_to_cpu()) will start reading from the offset 8K instead of 4K where we copied to, giving us garbage. Fix this by setting start before we co copy_extent_buffer_full() to make sure that we're copying into the same offset inside of the folio that we will read from later. All other sites of copy_extent_buffer_full() are correct because we either set ->start beforehand or we simply don't change it in the case of the tree-log usage. With this fix we now pass generic/560 on our subpage tests. Fixes: `1cab1375ba` ("btrfs: reuse cloned extent buffer during fiemap to avoid re-allocations") Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:06 +02:00
Josef Bacik	99f2be1522	btrfs: replace btrfs_delayed__ref with btrfs__ref Now that these two structs are the same, move the btrfs_data_ref and btrfs_tree_ref up and use these in the btrfs_delayed_ref_node. Then remove the btrfs_delayed_*_ref structs. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:05 +02:00
Josef Bacik	7f6af7c434	btrfs: remove the btrfs_delayed_ref_node container helpers Now that we don't use these helpers anywhere, remove them. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:05 +02:00
Josef Bacik	efc7d5dbf8	btrfs: stop referencing btrfs_delayed_tree_ref directly We only ever need to use this to get the level of the tree block ref, so use the btrfs_delayed_ref_owner() helper, which returns the level for the given reference. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:05 +02:00
Josef Bacik	44cc2e38e6	btrfs: stop referencing btrfs_delayed_data_ref directly Now that most of our elements are inside of btrfs_delayed_ref_node directly and we have helpers for the delayed_data_ref bits, go ahead and remove all direct usage of btrfs_delayed_data_ref and use the helpers where needed. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:05 +02:00
Josef Bacik	b4b5934ac1	btrfs: make the insert backref helpers take a btrfs_delayed_ref_node We don't need to pass in all the elements for the backrefs as function arguments, simply pass through the btrfs_delayed_ref_node and then extract the values we need from that. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:05 +02:00
Josef Bacik	85bb9f544e	btrfs: drop unnecessary arguments from __btrfs_free_extent We have all the information we need in our btrfs_delayed_ref_node, which we already pass into __btrfs_free_extent. Drop the extra arguments and just extract the values from btrfs_delayed_ref_node. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:05 +02:00
Josef Bacik	a502f112ad	btrfs: make __btrfs_inc_extent_ref take a btrfs_delayed_ref_node We're just extracting the values from btrfs_delayed_ref_node and passing them through, simply pass the btrfs_delayed_ref_node into __btrfs_inc_extent_ref and shrink the function arguments. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:05 +02:00
Josef Bacik	5366763446	btrfs: rename btrfs_data_ref->ino to ->objectid This is how we refer to it in the rest of the extent reference related code, make it consistent. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:05 +02:00
Josef Bacik	cf4f04325b	btrfs: move ->parent and ->ref_root into btrfs_delayed_ref_node These two members are shared by both the tree refs and data refs, so move them into btrfs_delayed_ref_node proper. This allows us to greatly simplify the comparison code, as the shared refs always only sort on parent, and the non shared refs always sort first on ref_root, and then only data refs sort on their specific fields. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:05 +02:00
Josef Bacik	12390e42b6	btrfs: rename ->len to ->num_bytes in btrfs_ref We consistently use ->num_bytes everywhere through the delayed ref code, except in btrfs_ref. Rename btrfs_ref to match all the other code. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:05 +02:00
Josef Bacik	f75464f7bb	btrfs: unify the btrfs_add_delayed_*_ref helpers into one helper Now that these helpers are identical, create a helper function that handles everything properly and strip the individual helpers down to use just the common helper. This cleans up a significant amount of duplicated code. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:05 +02:00
Josef Bacik	1bff6d4f87	btrfs: simplify delayed ref tracepoints Now that all of the delayed ref information is in the delayed ref node, drastically simplify the delayed ref tracepoints by simply passing in the btrfs_delayed_ref_node and populating the tracepoints with the values from the structure itself. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:04 +02:00
Josef Bacik	0ea4703cc2	btrfs: move ref specific initialization into init_delayed_ref_common Now that the btrfs_delayed_ref_node contains a union of the data and metadata specific information we can move the initialization into init_delayed_ref_common and just use the btrfs_ref to initialize the correct fields of the reference. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:04 +02:00
Josef Bacik	0509cc5661	btrfs: initialize btrfs_delayed_ref_head with btrfs_ref We are calling init_delayed_ref_head with all of the elements from btrfs_ref, clean this up to simply pass in the btrfs_ref and initialize the btrfs_delayed_ref_head with the values from the btrfs_ref directly. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:04 +02:00
Josef Bacik	da3c548541	btrfs: pass btrfs_ref to init_delayed_ref_common We're extracting all of these values from the btrfs_ref we passed in already, just pass the btrfs_ref through to init_delayed_ref_common and get the values directly from the struct. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:04 +02:00
Josef Bacik	f2e69a77aa	btrfs: move ref_root into btrfs_ref We have this in both btrfs_tree_ref and btrfs_data_ref, which is just wasting space and making the code more complicated. Move this into btrfs_ref proper and update all the call sites to do the assignment in btrfs_ref. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:04 +02:00
Josef Bacik	4d09b4e942	btrfs: do not use a function to initialize btrfs_ref btrfs_ref currently has ->owning_root, and ->ref_root is shared between the tree ref and data ref, so in order to move that into btrfs_ref proper I would need to add another root parameter to the initialization function. This function has too many arguments, and adding another root will make it easy to make mistakes about which root goes where. Drop the generic ref init function and statically initialize the btrfs_ref in every usage. This makes the code easier to read because we can see what elements we're assigning, and will make the upcoming change moving the ref_root into the btrfs_ref more clear and less error prone than adding a new element to the initialization function. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:04 +02:00
Josef Bacik	d3fbb00f5e	btrfs: embed data_ref and tree_ref in btrfs_delayed_ref_node We have been embedding btrfs_delayed_ref_node in the btrfs_delayed_data_ref and btrfs_delayed_tree_ref, and then we have two sets of cachep's and a variety of handling that is awkward because of this separation. Instead union these two members inside of btrfs_delayed_ref_node and make that the first class object. This allows us to go down to one cachep for our delayed ref nodes instead of two. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:04 +02:00
Josef Bacik	0eea355fc0	btrfs: add a helper to get the delayed ref node from the data/tree ref We have several different ways we refer to references throughout the code and it's not consistent and there's a bit of duplication. In order to clean this up I want to have one structure we use to define reference information, and one structure we use for the delayed reference information. Start this process by adding a helper to get from the btrfs_delayed_data_ref/btrfs_delayed_tree_ref to the btrfs_delayed_ref_node so that it'll make moving these structures around simpler. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:04 +02:00
Filipe Manana	26c0fae3e7	btrfs: use btrfs_find_first_inode() at btrfs_prune_dentries() Currently btrfs_prune_dentries() has open code to find the first inode in a root with a minimum inode number. Remove that code and make it use the helper btrfs_find_first_inode() for that task. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:04 +02:00
Filipe Manana	5e485ac6f0	btrfs: export find_next_inode() as btrfs_find_first_inode() Export the relocation private helper find_next_inode() to inode.c, as this same logic is also used at btrfs_prune_dentries() and will be used by an upcoming change that adds an extent map shrinker. The next patch will change btrfs_prune_dentries() to use this helper. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:04 +02:00
Filipe Manana	ed48adf83e	btrfs: simplify add_extent_mapping() by removing pointless label The add_extent_mapping() function is short and trivial, there's no need to have a label for a quick exit in case of an error, even because there's no error handling needed, we just need to return the error. So remove that label and return directly. Also while at it remove the redundant initialization of 'ret', as that may help avoid some warnings with clang tools such as the one reported/fixed by commit `966de47ff0` ("btrfs: remove redundant initialization of variables in log_new_ancestors"). Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:04 +02:00
Filipe Manana	071533da5f	btrfs: tests: error out on unexpected extent map reference count In the extent map self tests, when freeing all extent maps from a test extent map tree we are not expecting to find any extent map with a reference count different from 1 (the tree reference). If we find any, we just log a message but we don't fail the test, which makes it very easy to miss any bug/regression - no one reads the test messages unless a test fails. So change the behaviour to make a test fail if we find an extent map in the tree with a reference count different from 1. Make the failure happen only after removing all extent maps, so that we don't leak memory. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:03 +02:00
Filipe Manana	0a308f8095	btrfs: pass an inode to btrfs_add_extent_mapping() Instead of passing fs_info and extent map tree arguments to btrfs_add_extent_mapping(), we can pass an inode instead, as extent maps are always inserted in the extent map tree of an inode, and the fs_info can be extracted from the inode (inode->root->fs_info). The only exception is in the self tests where we allocate an extent map tree and then use it to insert/update/remove extent maps. However the tests can be changed to use a test inode and then use the inode's extent map tree. So change btrfs_add_extent_mapping() to have an inode as an argument instead of a fs_info and an extent map tree. This reduces the number of parameters and will also be needed for an upcoming change. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:03 +02:00
Filipe Manana	236e3107fc	btrfs: open code csum_exist_in_range() The csum_exist_in_range() function is now too trivial and is only used in one place, so open code it in its single caller. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:03 +02:00
Filipe Manana	8d2a83a97f	btrfs: make NOCOW checks for existence of checksums in a range more efficient Before deciding if we can do a NOCOW write into a range, one of the things we have to do is check if there are checksum items for that range. We do that through the btrfs_lookup_csums_list() function, which searches for checksums and adds them to a list supplied by the caller. But all we need is to check if there is any checksum, we don't need to look for all of them and collect them into a list, which requires more search time in the checksums tree, allocating memory for checksums items to add to the list, copy checksums from a leaf into those list items, then free that memory, etc. This is all unnecessary overhead, wasting mostly CPU time, and perhaps some occasional IO if we need to read from disk any extent buffers. So change btrfs_lookup_csums_list() to allow to return immediately in case it finds any checksum, without the need to add it to a list and read it from a leaf. This is accomplished by allowing a NULL list parameter and making the function return 1 if it found any checksum, 0 if it didn't found any, and a negative value in case of an error. The following test with fio was used to measure performance: $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/nullb0 cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bssplit=4k/20:8k/20:16k/20:32k/20:64k/20 direct=1 numjobs=16 fallocate=posix time_based runtime=300 [file1] size=8G ioengine=io_uring iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $DEV mount -o ssd $DEV $MNT fio /tmp/fio-job.ini umount $MNT The test was run on a release kernel (Debian's default kernel config). The results before this patch: WRITE: bw=139MiB/s (146MB/s), 8204KiB/s-9504KiB/s (8401kB/s-9732kB/s), io=17.0GiB (18.3GB), run=125317-125344msec The results after this patch: WRITE: bw=153MiB/s (160MB/s), 9241KiB/s-10.0MiB/s (9463kB/s-10.5MB/s), io=17.0GiB (18.3GB), run=114054-114071msec Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:03 +02:00
Filipe Manana	fb90e1caf0	btrfs: simplify error path for btrfs_lookup_csums_list() In the error path we have this while loop that keeps iterating over the csums of the list and then delete them from the list and free them, testing for an error (ret < 0) and list emptyness as the conditions of the while loop. Simplify this by using list_for_each_entry_safe() so there's no need to delete elements from the list and need to test the error condition on each iteration. Also rename the 'fail' label to 'out' since the label is not exclusive to a failure path, as we also end up there when the function succeeds, and it's also a more common label name. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:03 +02:00
Filipe Manana	c0dce8b6a3	btrfs: remove use of a temporary list at btrfs_lookup_csums_list() There's no need to use a temporary list to add the checksums, we can just add them to input list and then on error delete and free any checksums that were added. So simplify and remove the temporary list. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:03 +02:00
Filipe Manana	afcb80624f	btrfs: remove search_commit parameter from btrfs_lookup_csums_list() All the callers of btrfs_lookup_csums_list() pass a value of 0 as the "search_commit" parameter. So remove it and make the function behave as to always search from the regular root. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:03 +02:00
Filipe Manana	d800a9065b	btrfs: add function comment to btrfs_lookup_csums_list() Add a function comment to btrfs_lookup_csums_list() to document it. With another upcoming change its parameter list and return value will be less obvious. So add the documentation now so that it can be updated where needed later. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:03 +02:00
Filipe Manana	0ddefc2a7c	btrfs: move btrfs_page_mkwrite() from inode.c into file.c btrfs_page_mkwrite() is a struct vm_operations_struct callback and we define that structure in file.c. Currently the function is in inode.c and has to be exported to be used in file.c, which makes no sense because it's not used anywhere else. So move btrfs_page_mkwrite() from inode.c and into file.c. While at it do a few minor style changes: 1) Capitalize the first word of every comment and end each sentence with punctuation; 2) Avoid splitting some statements into two lines when everything fits in 85 characters or less. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:03 +02:00
Filipe Manana	590e2c4a1e	btrfs: remove no longer used btrfs_clone_chunk_map() There are no more users of btrfs_clone_chunk_map(), the last one (and only one ever) was removed in commit `1ec17ef591` ("btrfs: zoned: fix use-after-free in do_zone_finish()"). So remove btrfs_clone_chunk_map(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:03 +02:00
Filipe Manana	606a1c5de1	btrfs: remove list_empty() check at warn_about_uncommitted_trans() At warn_about_uncommitted_trans(), there's no need to check if the list is empty and return, because list_for_each_entry_safe() is safe to call for an empty list, it simply does nothing. So remove the check. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:03 +02:00
Filipe Manana	47f6944877	btrfs: remove pointless return value assignment at btrfs_finish_one_ordered() At btrfs_finish_one_ordered() it's pointless to assign 0 to the 'ret' variable because if it has a non-zero value (error), we have already jumped to the 'out' label. So remove that redundant assignment. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:02 +02:00
Filipe Manana	2e438442ba	btrfs: remove not needed mod_start and mod_len from struct extent_map The mod_start and mod_len fields of struct extent_map were introduced by commit `4e2f84e63d` ("Btrfs: improve fsync by filtering extents that we want") in order to avoid too low performance when fsyncing a file that keeps getting extent maps merge, because it resulted in each fsync logging again csum ranges that were already merged before. We don't need this anymore as extent maps in the list of modified extents are never merged with other extent maps and once we log an extent map we remove it from the list of modified extent maps, so it's never logged twice. So remove the mod_start and mod_len fields from struct extent_map and use instead the start and len fields when logging checksums in the fast fsync path. This also makes EXTENT_FLAG_FILLING unused so remove it as well. Running the reproducer from the commit mentioned before, with a larger number of extents and against a null block device, so that IO is fast and we can better see any impact from searching checksums items and logging them, gave the following results from dd: Before this change: 409600000 bytes (410 MB, 391 MiB) copied, 22.948 s, 17.8 MB/s After this change: 409600000 bytes (410 MB, 391 MiB) copied, 22.9997 s, 17.8 MB/s So no changes in throughput. The test was done in a release kernel (non-debug, Debian's default kernel config) and its steps are the following: $ mkfs.btrfs -f /dev/nullb0 $ mount /dev/sdb /mnt $ dd if=/dev/zero of=/mnt/foobar bs=4k count=100000 oflag=sync $ umount /mnt This also reduces the size of struct extent_map from 128 bytes down to 112 bytes, so now we can have 36 extents maps per 4K page instead of 32. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:02 +02:00
Boris Burkov	5f2fb819f6	btrfs: free PERTRANS at the end of cleanup_transaction() Some of the operations after the free might convert more PERTRANS metadata. Do the freeing as late as possible to eliminate a source of leaked PERTRANS metadata. This helps with the pass rate of generic/269 and generic/475. Reviewed-by: Qu Wenruo <qwu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:02 +02:00
Qu Wenruo	400b172b8c	btrfs: compression: migrate compression/decompression paths to folios For both compression and decompression paths, we always require a "struct page **pages" and "unsigned long nr_pages", this involves quite some part of the btrfs compression paths: - All the compression entry points - compressed_bio structure This affects both compression and decompression. - async_extent structure Unfortunately with all those involved parts, there is no good way to split the conversion into smaller patches while still passing compiling. So do this in one big conversion in one go. Please note this is direct page->folio conversion, no change on the page sized folio requirement yet. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:02 +02:00
Qu Wenruo	11e03f2f4b	btrfs: introduce btrfs_alloc_folio_array() The new helper will do the same thing as btrfs_alloc_page_array(), but with folios. One extra difference is, there is no extra helper for bulk allocation, thus it may not be as efficient as the page version. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:02 +02:00
Qu Wenruo	ae0d22a7fc	btrfs: migrate insert_inline_extent() to folio interfaces Since insert_inline_extent() now only accepts a single page, it's much easier to convert it to use folio interfaces. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:02 +02:00

... 2 3 4 5 6 ...

91607 Commits