85dc28fa4e
10233 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
|
2f3186d8ee |
btrfs: introduce end_bio_subpage_eb_writepage() function
The new function, end_bio_subpage_eb_writepage(), will handle the metadata writeback endio. The major differences involved are: - How to grab extent buffer Now page::private is a pointer to btrfs_subpage, we can no longer grab extent buffer directly. Thus we need to use the bv_offset to locate the extent buffer manually and iterate through the whole range. - Use btrfs_subpage_end_writeback() caller This helper will handle the subpage writeback for us. Since this function is executed under endio context, when grabbing extent buffers it can't grab eb->refs_lock as that lock is not designed to be grabbed under hardirq context. So here introduce a helper, find_extent_buffer_nolock(), for such situation, and convert find_extent_buffer() to use that helper. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
fb686c6824 |
btrfs: check return value of btrfs_commit_transaction in relocation
There are a few places where we don't check the return value of btrfs_commit_transaction in relocation.c. Thankfully all these places have straightforward error handling, so simply change all of the sites at once. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
24213fa46c |
btrfs: do proper error handling in merge_reloc_roots
We have a BUG_ON() if we get an error back from btrfs_get_fs_root(). This honestly should never fail, as at this point we have a solid coordination of fs root to reloc root, and these roots will all be in memory. But in the name of killing BUG_ON()'s remove these and handle the error condition properly, ASSERT()'ing for developers. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
8717cf440d |
btrfs: handle extent corruption with select_one_root properly
In corruption cases we could have paths from a block up to no root at all, and thus we'll BUG_ON(!root) in select_one_root. Handle this by adding an ASSERT() for developers, and returning an error for normal users. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
e0b085b0b0 |
btrfs: cleanup error handling in prepare_to_merge
This probably can't happen even with a corrupt file system, because we would have failed much earlier on than here. However there's no reason we can't just check and bail out as appropriate, so do that and convert the correctness BUG_ON() to an ASSERT(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
57a304cfd4 |
btrfs: do not panic in __add_reloc_root
If we have a duplicate entry for a reloc root then we could have fs corruption that resulted in a double allocation. Since this shouldn't happen unless there is corruption, add an ASSERT(ret != -EEXIST) to all of the callers of __add_reloc_root() to catch any logic mistakes for developers, otherwise normal error handling will happen for normal users. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
3c9258632c |
btrfs: handle __add_reloc_root failures in btrfs_recover_relocation
We can already handle errors appropriately from this function, deal with an error coming from __add_reloc_root appropriately. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
790c1b8cd4 |
btrfs: do proper error handling in create_reloc_inode
We already handle some errors in this function, and the callers do the correct error handling, so clean up the rest of the function to do the appropriate error handling. There's a little extra work that needs to be done here, as we create the inode item before we create the orphan item. We could potentially add the orphan item, but if we failed to create the inode item we would have to abort the transaction. Instead add a helper to delete the inode item we created in the case that we're unable to look up the inode (this would likely be caused by an ENOMEM), which if it succeeds means we can avoid a transaction abort in this particular error case. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
24cd638902 |
btrfs: remove the extent item sanity checks in relocate_block_group
These checks are all taken care of for us by the tree checker code: - the flags don't change or are updated consistently - the v0 extent item format is invalid and caught in many other places too Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
0ebb6bbbd4 |
btrfs: tree-checker: check for BTRFS_BLOCK_FLAG_FULL_BACKREF being set improperly
We need to validate that a data extent item does not have the FULL_BACKREF flag set on its flags. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
eb6b7fb4b5 |
btrfs: handle extent reference errors in do_relocation
We can already deal with errors appropriately from do_relocation, simply handle any errors that come from changing the refs at this point cleanly. We have to abort the transaction if we fail here as we've modified metadata at this point. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
253e258c34 |
btrfs: handle errors in reference count manipulation in replace_path
If any of the reference count manipulation stuff fails in replace_path we need to abort the transaction, as we've modified the blocks already. We can simply break at this point and everything will be cleaned up. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
0e9873e2fe |
btrfs: handle btrfs_search_slot failure in replace_path
The search can fail for various reasons, in case of errors there's no cleanup to be done so we can pass the error to the caller, adjusting for the case where the key is not found and search slot returns 1. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
45b87c5d25 |
btrfs: handle btrfs_cow_block errors in replace_path
If we error out COWing the root node when doing a replace_path then we simply unlock and free the buffer and return the error. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
7a9213a935 |
btrfs: convert logic BUG_ON()'s in replace_path to ASSERT()'s
A few BUG_ON()'s in replace_path are purely to keep us from making logical mistakes, so replace them with ASSERT()'s. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
592fbcd50c |
btrfs: do proper error handling in btrfs_update_reloc_root
We call btrfs_update_root in btrfs_update_reloc_root, which can fail for all sorts of reasons, including IO errors. Instead of panicing the box lets return the error, now that all callers properly handle those errors. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
bbae13f8ab |
btrfs: handle btrfs_update_reloc_root failure in prepare_to_merge
btrfs_update_reloc_root will will return errors in the future, so handle an error properly in prepare_to_merge. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
7934133fae |
btrfs: handle btrfs_update_reloc_root failure in insert_dirty_subvol
btrfs_update_reloc_root will will return errors in the future, so handle the error properly in insert_dirty_subvol. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
ac54da6c37 |
btrfs: change insert_dirty_subvol to return errors
This will be able to return errors in the future, so change it to return an error and handle the errors appropriately. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
2dd8298eb3 |
btrfs: handle btrfs_update_reloc_root failure in commit_fs_roots
btrfs_update_reloc_root will will return errors in the future, so handle the error properly in commit_fs_roots. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
39200e5908 |
btrfs: validate root::reloc_root after recording root in trans
If we fail to setup a root->reloc_root in a different thread that path will error out, however it still leaves root->reloc_root NULL but would still appear set up in the transaction. Subsequent calls to btrfs_record_root_in_transaction would succeed without attempting to create the reloc root, as the transid has already been updated. Handle this case by making sure we have a root->reloc_root set after a btrfs_record_root_in_transaction call so we don't end up dereferencing a NULL pointer. Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
84c50ba521 |
btrfs: do proper error handling in create_reloc_root
We do memory allocations here, read blocks from disk, all sorts of operations that could easily fail at any given point. Instead of panicing the box, simply return the error back up the chain, all callers at this point have proper error handling. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
00bb36a0e7 |
btrfs: have proper error handling in btrfs_init_reloc_root
create_reloc_root will return errors in the future, and __add_reloc_root can return ENOMEM or EEXIST, so handle these errors properly. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
03a7e111a9 |
btrfs: return an error from btrfs_record_root_in_trans
We can create a reloc root when we record the root in the trans, which can fail for all sorts of different reasons. Propagate this error up the chain of callers. Future patches will fix the callers of btrfs_record_root_in_trans() to handle the error. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
f0118cb6bc |
btrfs: handle record_root_in_trans failure in create_pending_snapshot
record_root_in_trans can currently fail, so handle this failure properly. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
1409e6cc74 |
btrfs: handle record_root_in_trans failure in btrfs_record_root_in_trans
record_root_in_trans can fail currently, handle this failure properly. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
1c442d2246 |
btrfs: handle record_root_in_trans failure in qgroup_account_snapshot
record_root_in_trans can fail currently, so handle this failure properly. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
68075ea8d7 |
btrfs: handle btrfs_record_root_in_trans failure in start_transaction
btrfs_record_root_in_trans will return errors in the future, so handle the error properly in start_transaction. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
d18c7bd95c |
btrfs: handle btrfs_record_root_in_trans failure in relocate_tree_block
btrfs_record_root_in_trans will return errors in the future, so handle the error properly in relocate_tree_block. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
221581e485 |
btrfs: handle btrfs_record_root_in_trans failure in create_subvol
btrfs_record_root_in_trans will return errors in the future, so handle the error properly in create_subvol. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
2002ae112a |
btrfs: handle btrfs_record_root_in_trans failure in btrfs_recover_log_trees
btrfs_record_root_in_trans will return errors in the future, so handle the error properly in btrfs_recover_log_trees. This appears tricky, however we have a reference count on the destination root, so if this fails we need to continue on in the loop to make sure the proper cleanup is done. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
2731f5186b |
btrfs: handle btrfs_record_root_in_trans failure in btrfs_delete_subvolume
btrfs_record_root_in_trans will return errors in the future, so handle the error properly in btrfs_delete_subvolume. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
b0fec6fd33 |
btrfs: handle btrfs_record_root_in_trans failure in btrfs_rename
btrfs_record_root_in_trans will return errors in the future, so handle the error properly in btrfs_rename. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
00aa8e87c9 |
btrfs: handle btrfs_record_root_in_trans failure in btrfs_rename_exchange
btrfs_record_root_in_trans will return errors in the future, so handle the error properly in btrfs_rename_exchange. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
404bccbcaa |
btrfs: do proper error handling in record_reloc_root_in_trans
Generally speaking this shouldn't ever fail, the corresponding fs root for the reloc root will already be in memory, so we won't get ENOMEM here. However if there is no corresponding root for the reloc root then we could get ENOMEM when we try to allocate it or we could get ENOENT when we look it up and see that it doesn't exist. Convert these BUG_ON()'s into ASSERT()'s and add proper error handling for the case of corruption. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
92de551b83 |
btrfs: check record_root_in_trans related failures in select_reloc_root
We will record the fs root or the reloc root in the trans in select_reloc_root. These will actually return errors in the following patches, so check their return value here and return it up the stack. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
8ee66afe99 |
btrfs: convert BUG_ON()'s in select_reloc_root() to proper errors
We have several BUG_ON()'s in select_reloc_root() that can be tripped if there is an extent tree corruption. Convert these to ASSERT()'s, because if we hit it during testing it really is bad, or could indicate a problem with the backref walking code. However if users hit these problems it generally indicates corruption, I've hit a few machines in the fleet that trip over these with clearly corrupted extent trees, so be nice and print out an error message and return an error instead of bringing the whole box down. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
cbdc2ebc7c |
btrfs: handle errors from select_reloc_root()
Currently select_reloc_root() doesn't return an error, but followup patches will make it possible for it to return an error. We do have proper error recovery in do_relocation however, so handle the possibility of select_reloc_root() having an error properly instead of BUG_ON(!root). I've also adjusted select_reloc_root() to return ERR_PTR(-ENOENT) if we don't find a root, instead of NULL, to make the error case easier to deal with. I've replaced the BUG_ON(!root) with an ASSERT(0) for this case as it indicates we messed up the backref walking code, but it could also indicate corruption. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
1c7bfa159f |
btrfs: convert BUG_ON()'s in relocate_tree_block
We have a couple of BUG_ON()'s in relocate_tree_block() that can be tripped if we have file system corruption. Convert these to ASSERT()'s so developers still get yelled at when they break the backref code, but error out nicely for users so the whole box doesn't go down. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
ffe30dd892 |
btrfs: convert some BUG_ON()'s to ASSERT()'s in do_relocation
A few of these are checking for correctness, and won't be triggered by corrupted file systems, so convert them to ASSERT() instead of BUG_ON() and add a comment explaining their existence. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
32c0a6bcaa |
btrfs: add and use readahead_batch_length
Implement readahead_batch_length() to determine the number of bytes in the current batch of readahead pages and use it in btrfs. Also use the readahead_pos to get the offset. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
183ebab766 |
btrfs: move forward declarations to the beginning of extent_io.h
There are two forward declarations deep in extent_io.h, move them to the beginning and remove the duplicate one. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
894d137818 |
btrfs: subpage: add overview comments
This patch adds an overview how btrfs subpage support works: - limitations - behavior - basic implementation points Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
5a2c60752a |
btrfs: make set_btree_ioerr accept extent buffer and be subpage compatible
Current set_btree_ioerr() only accepts @page parameter and grabs extent buffer from page::private. This works fine for sector size == PAGE_SIZE case, but not for subpage case. Add an extra parameter, @eb, for callers to pass extent buffer to this function, so that subpage code can reuse this function. And also add subpage special handling to update btrfs_subpage::error_bitmap. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
0d27797e92 |
btrfs: make set/clear_extent_buffer_dirty() subpage compatible
For set_extent_buffer_dirty() to support subpage sized metadata, just call btrfs_page_set_dirty() to handle both cases. For clear_extent_buffer_dirty(), it needs to clear the page dirty if and only if all extent buffers in the page range are no longer dirty. Also do the same for page error. This is pretty different from the existing clear_extent_buffer_dirty() routine, so add a new helper function, clear_subpage_extent_buffer_dirty() to do this for subpage metadata. Also since the main part of clearing page dirty code is still the same, extract that into btree_clear_page_dirty() so that it can be utilized for both cases. But there is a special race between set_extent_buffer_dirty() and clear_extent_buffer_dirty(), where we can clear the page dirty. [POSSIBLE RACE WINDOW] For the race window between clear_subpage_extent_buffer_dirty() and set_extent_buffer_dirty(), due to the fact that we can't call clear_page_dirty_for_io() under subpage spin lock, we can race like below: T1 (eb1 in the same page) | T2 (eb2 in the same page) -------------------------------+------------------------------ set_extent_buffer_dirty() | clear_extent_buffer_dirty() |- was_dirty = false; | |- clear_subpagE_extent_buffer_dirty() | | |- btrfs_clear_and_test_dirty() | | | Since eb2 is the last dirty page | | | we got: | | | last == true; | | | |- btrfs_page_set_dirty() | | | We set the page dirty and | | | subpage dirty bitmap | | | | |- if (last) | | | Since we don't have subpage lock | | | held, now @last is no longer | | | correct | | |- btree_clear_page_dirty() | | Now PageDirty == false, even if | | we have dirty_bitmap not zero. |- ASSERT(PageDirty()); | ^^^^ CRASH The solution here is to also lock the eb->pages[0] for subpage case of set_extent_buffer_dirty(), to prevent racing with clear_extent_buffer_dirty(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
b8f957715e |
btrfs: support page uptodate assertions in subpage mode
There are quite some assert checks on page uptodate in extent buffer write accessors. They ensure the destination page is already uptodate. This is fine for regular sector size case, but not for subpage case, as for subpage we only mark the page uptodate if the page contains no hole and all its extent buffers are uptodate. So instead of checking PageUptodate(), for subpage case we check the uptodate bitmap of btrfs_subpage structure. To make the check more elegant, introduce a helper, assert_eb_page_uptodate() to do the check for both subpage and regular sector size cases. The following functions are involved: - write_extent_buffer_chunk_tree_uuid() - write_extent_buffer_fsid() - write_extent_buffer() - memzero_extent_buffer() - copy_extent_buffer() - extent_buffer_test_bit() - extent_buffer_bitmap_set() - extent_buffer_bitmap_clear() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
1e5eb3d6a4 |
btrfs: make alloc_extent_buffer() check subpage dirty bitmap
In alloc_extent_buffer(), we make sure that the newly allocated page is never dirty. This is fine for sector size == PAGE_SIZE case, but for subpage it's possible that one extent buffer in the page is dirty, thus the whole page is marked dirty, and could cause false alert. To support subpage, call btrfs_page_test_dirty() to handle both cases. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
eca0f6f643 |
btrfs: subpage: support metadata checksum calculation at write time
Add a new helper, csum_dirty_subpage_buffers(), to iterate through all dirty extent buffers in one bvec. Also extract the code of calculating csum for one extent buffer into csum_one_extent_buffer(), so that both the existing csum_dirty_buffer() and the new csum_dirty_subpage_buffers() can reuse the same routine. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
139e8cd325 |
btrfs: subpage: do more sanity checks on metadata page dirtying
For btree_set_page_dirty(), we should also check the extent buffer sanity for subpage support. Unlike the regular sector size case, since one page can contain multiple extent buffers, we need to make sure there is at least one dirty extent buffer in the page. So this patch will iterate through the btrfs_subpage::dirty_bitmap to get the extent buffers, and check if any dirty extent buffer in the page range has EXTENT_BUFFER_DIRTY and proper refs. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
3470da3b7d |
btrfs: subpage: introduce helpers for writeback status
Introduces the following functions to handle subpage writeback status: - btrfs_subpage_set_writeback() - btrfs_subpage_clear_writeback() - btrfs_subpage_test_writeback() These helpers can only be called when the range is ensured to be inside the page. - btrfs_page_set_writeback() - btrfs_page_clear_writeback() - btrfs_page_test_writeback() These helpers can handle both regular sector size and subpage without problem. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
d8a5713e89 |
btrfs: subpage: introduce helpers for dirty status
Introduce the following functions to handle subpage dirty status: - btrfs_subpage_set_dirty() - btrfs_subpage_clear_dirty() - btrfs_subpage_test_dirty() These helpers can only be called when the range is ensured to be inside the page. - btrfs_page_set_dirty() - btrfs_page_clear_dirty() - btrfs_page_test_dirty() These helpers can handle both regular sector size and subpage without problem. Thus they would be used to replace PageDirty() related calls in later patches. There is one special point to note here, just like set_page_dirty() and clear_page_dirty_for_io(), btrfs_*page_set_dirty() and btrfs_*page_clear_dirty() must be called with page locked. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
d239bcb83b |
btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
In btrfs_invalidatepage() we re-declare @tree variable as btrfs_ordered_inode_tree. Since it's only used to do the spinlock, we can grab it from inode directly, and remove the unnecessary declaration completely. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
ac5804eb85 |
btrfs: use min() to replace open-code in btrfs_invalidatepage()
In btrfs_invalidatepage() we introduce a temporary variable, new_len, to update ordered->truncated_len. But we can use min() to replace it completely and no need for the variable. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
fc57ad8d33 |
btrfs: add sysfs interface for supported sectorsize
Export supported sector sizes in /sys/fs/btrfs/features/supported_sectorsizes. Currently all architectures have PAGE_SIZE, There's some disparity between read-only and read-write support but that will be unified in the future so there's only one file exporting the size. The read-only support for systems with 64K pages also works for 4K sector size. This new sysfs interface would help eg. mkfs.btrfs to print more accurate warnings about potentially incompatible option combinations. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
ace75066ce |
btrfs: improve btree readahead for full send operations
Currently a full send operation uses the standard btree readahead when iterating over the subvolume/snapshot btree, which despite bringing good performance benefits, it could be improved in a few aspects for use cases such as full send operations, which are guaranteed to visit every node and leaf of a btree, in ascending and sequential order. The limitations of that standard btree readahead implementation are the following: 1) It only triggers readahead for leaves that are physically close to the leaf being read, within a 64K range; 2) It only triggers readahead for the next or previous leaves if the leaf being read is not currently in memory; 3) It never triggers readahead for nodes. So add a new readahead mode that addresses all these points and use it for full send operations. The following test script was used to measure the improvement on a box using an average, consumer grade, spinning disk and with 16GiB of RAM: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null mount $MOUNT_OPTIONS $DEV $MNT # Create files with inline data to make it easier and faster to create # large btrees. add_files() { local total=$1 local start_offset=$2 local number_jobs=$3 local total_per_job=$(($total / $number_jobs)) echo "Creating $total new files using $number_jobs jobs" for ((n = 0; n < $number_jobs; n++)); do ( local start_num=$(($start_offset + $n * $total_per_job)) for ((i = 1; i <= $total_per_job; i++)); do local file_num=$((start_num + $i)) local file_path="$MNT/file_${file_num}" xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null if [ $? -ne 0 ]; then echo "Failed creating file $file_path" break fi done ) & worker_pids[$n]=$! done wait ${worker_pids[@]} sync echo echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)" } initial_file_count=500000 add_files $initial_file_count 0 4 echo echo "Creating first snapshot..." btrfs subvolume snapshot -r $MNT $MNT/snap1 echo echo "Adding more files..." add_files $((initial_file_count / 4)) $initial_file_count 4 echo echo "Updating 1/50th of the initial files..." for ((i = 1; i < $initial_file_count; i += 50)); do xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null done echo echo "Creating second snapshot..." btrfs subvolume snapshot -r $MNT $MNT/snap2 umount $MNT echo 3 > /proc/sys/vm/drop_caches blockdev --flushbufs $DEV &> /dev/null hdparm -F $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT echo echo "Testing full send..." start=$(date +%s) btrfs send $MNT/snap1 > /dev/null end=$(date +%s) echo echo "Full send took $((end - start)) seconds" umount $MNT The durations of the full send operation in seconds were the following: Before this change: 217 seconds After this change: 205 seconds (-5.7%) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
eafa4fd0ad |
btrfs: fix exhaustion of the system chunk array due to concurrent allocations
When we are running out of space for updating the chunk tree, that is, when we are low on available space in the system space info, if we have many task concurrently allocating block groups, via fallocate for example, many of them can end up all allocating new system chunks when only one is needed. In extreme cases this can lead to exhaustion of the system chunk array, which has a size limit of 2048 bytes, and results in a transaction abort with errno EFBIG, producing a trace in dmesg like the following, which was triggered on a PowerPC machine with a node/leaf size of 64K: [1359.518899] ------------[ cut here ]------------ [1359.518980] BTRFS: Transaction aborted (error -27) [1359.519135] WARNING: CPU: 3 PID: 16463 at ../fs/btrfs/block-group.c:1968 btrfs_create_pending_block_groups+0x340/0x3c0 [btrfs] [1359.519152] Modules linked in: (...) [1359.519239] Supported: Yes, External [1359.519252] CPU: 3 PID: 16463 Comm: stress-ng Tainted: G X 5.3.18-47-default #1 SLE15-SP3 [1359.519274] NIP: c008000000e36fe8 LR: c008000000e36fe4 CTR: 00000000006de8e8 [1359.519293] REGS: c00000056890b700 TRAP: 0700 Tainted: G X (5.3.18-47-default) [1359.519317] MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 48008222 XER: 00000007 [1359.519356] CFAR: c00000000013e170 IRQMASK: 0 [1359.519356] GPR00: c008000000e36fe4 c00000056890b990 c008000000e83200 0000000000000026 [1359.519356] GPR04: 0000000000000000 0000000000000000 0000d52a3b027651 0000000000000007 [1359.519356] GPR08: 0000000000000003 0000000000000001 0000000000000007 0000000000000000 [1359.519356] GPR12: 0000000000008000 c00000063fe44600 000000001015e028 000000001015dfd0 [1359.519356] GPR16: 000000000000404f 0000000000000001 0000000000010000 0000dd1e287affff [1359.519356] GPR20: 0000000000000001 c000000637c9a000 ffffffffffffffe5 0000000000000000 [1359.519356] GPR24: 0000000000000004 0000000000000000 0000000000000100 ffffffffffffffc0 [1359.519356] GPR28: c000000637c9a000 c000000630e09230 c000000630e091d8 c000000562188b08 [1359.519561] NIP [c008000000e36fe8] btrfs_create_pending_block_groups+0x340/0x3c0 [btrfs] [1359.519613] LR [c008000000e36fe4] btrfs_create_pending_block_groups+0x33c/0x3c0 [btrfs] [1359.519626] Call Trace: [1359.519671] [c00000056890b990] [c008000000e36fe4] btrfs_create_pending_block_groups+0x33c/0x3c0 [btrfs] (unreliable) [1359.519729] [c00000056890ba90] [c008000000d68d44] __btrfs_end_transaction+0xbc/0x2f0 [btrfs] [1359.519782] [c00000056890bae0] [c008000000e309ac] btrfs_alloc_data_chunk_ondemand+0x154/0x610 [btrfs] [1359.519844] [c00000056890bba0] [c008000000d8a0fc] btrfs_fallocate+0xe4/0x10e0 [btrfs] [1359.519891] [c00000056890bd00] [c0000000004a23b4] vfs_fallocate+0x174/0x350 [1359.519929] [c00000056890bd50] [c0000000004a3cf8] ksys_fallocate+0x68/0xf0 [1359.519957] [c00000056890bda0] [c0000000004a3da8] sys_fallocate+0x28/0x40 [1359.519988] [c00000056890bdc0] [c000000000038968] system_call_exception+0xe8/0x170 [1359.520021] [c00000056890be20] [c00000000000cb70] system_call_common+0xf0/0x278 [1359.520037] Instruction dump: [1359.520049] 7d0049ad 40c2fff4 7c0004ac 71490004 40820024 2f83fffb 419e0048 3c620000 [1359.520082] e863bcb8 7ec4b378 48010d91 e8410018 <0fe00000> 3c820000 e884bcc8 7ec6b378 [1359.520122] ---[ end trace d6c186e151022e20 ]--- The following steps explain how we can end up in this situation: 1) Task A is at check_system_chunk(), either because it is allocating a new data or metadata block group, at btrfs_chunk_alloc(), or because it is removing a block group or turning a block group RO. It does not matter why; 2) Task A sees that there is not enough free space in the system space_info object, that is 'left' is < 'thresh'. And at this point the system space_info has a value of 0 for its 'bytes_may_use' counter; 3) As a consequence task A calls btrfs_alloc_chunk() in order to allocate a new system block group (chunk) and then reserves 'thresh' bytes in the chunk block reserve with the call to btrfs_block_rsv_add(). This changes the chunk block reserve's 'reserved' and 'size' counters by an amount of 'thresh', and changes the 'bytes_may_use' counter of the system space_info object from 0 to 'thresh'. Also during its call to btrfs_alloc_chunk(), we end up increasing the value of the 'total_bytes' counter of the system space_info object by 8MiB (the size of a system chunk stripe). This happens through the call chain: btrfs_alloc_chunk() create_chunk() btrfs_make_block_group() btrfs_update_space_info() 4) After it finishes the first phase of the block group allocation, at btrfs_chunk_alloc(), task A unlocks the chunk mutex; 5) At this point the new system block group was added to the transaction handle's list of new block groups, but its block group item, device items and chunk item were not yet inserted in the extent, device and chunk trees, respectively. That only happens later when we call btrfs_finish_chunk_alloc() through a call to btrfs_create_pending_block_groups(); Note that only when we update the chunk tree, through the call to btrfs_finish_chunk_alloc(), we decrement the 'reserved' counter of the chunk block reserve as we COW/allocate extent buffers, through: btrfs_alloc_tree_block() btrfs_use_block_rsv() btrfs_block_rsv_use_bytes() And the system space_info's 'bytes_may_use' is decremented everytime we allocate an extent buffer for COW operations on the chunk tree, through: btrfs_alloc_tree_block() btrfs_reserve_extent() find_free_extent() btrfs_add_reserved_bytes() If we end up COWing less chunk btree nodes/leaves than expected, which is the typical case since the amount of space we reserve is always pessimistic to account for the worst possible case, we release the unused space through: btrfs_create_pending_block_groups() btrfs_trans_release_chunk_metadata() btrfs_block_rsv_release() block_rsv_release_bytes() btrfs_space_info_free_bytes_may_use() But before task A gets into btrfs_create_pending_block_groups()... 6) Many other tasks start allocating new block groups through fallocate, each one does the first phase of block group allocation in a serialized way, since btrfs_chunk_alloc() takes the chunk mutex before calling check_system_chunk() and btrfs_alloc_chunk(). However before everyone enters the final phase of the block group allocation, that is, before calling btrfs_create_pending_block_groups(), new tasks keep coming to allocate new block groups and while at check_system_chunk(), the system space_info's 'bytes_may_use' keeps increasing each time a task reserves space in the chunk block reserve. This means that eventually some other task can end up not seeing enough free space in the system space_info and decide to allocate yet another system chunk. This may repeat several times if yet more new tasks keep allocating new block groups before task A, and all the other tasks, finish the creation of the pending block groups, which is when reserved space in excess is released. Eventually this can result in exhaustion of system chunk array in the superblock, with btrfs_add_system_chunk() returning EFBIG, resulting later in a transaction abort. Even when we don't reach the extreme case of exhausting the system array, most, if not all, unnecessarily created system block groups end up being unused since when finishing creation of the first pending system block group, the creation of the following ones end up not needing to COW nodes/leaves of the chunk tree, so we never allocate and deallocate from them, resulting in them never being added to the list of unused block groups - as a consequence they don't get deleted by the cleaner kthread - the only exceptions are if we unmount and mount the filesystem again, which adds any unused block groups to the list of unused block groups, if a scrub is run, which also adds unused block groups to the unused list, and under some circumstances when using a zoned filesystem or async discard, which may also add unused block groups to the unused list. So fix this by: *) Tracking the number of reserved bytes for the chunk tree per transaction, which is the sum of reserved chunk bytes by each transaction handle currently being used; *) When there is not enough free space in the system space_info, if there are other transaction handles which reserved chunk space, wait for some of them to complete in order to have enough excess reserved space released, and then try again. Otherwise proceed with the creation of a new system chunk. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
b7a7a83463 |
btrfs: make reflinks respect O_SYNC O_DSYNC and S_SYNC flags
If we reflink to or from a file opened with O_SYNC/O_DSYNC or to/from a file that has the S_SYNC attribute set, we totally ignore that and do not durably persist the reflink changes. Since a reflink can change the data readable from a file (and mtime/ctime, or a file size), it makes sense to durably persist (fsync) the source and destination files/ranges. This was previously discussed at: https://lore.kernel.org/linux-btrfs/20200903035225.GJ6090@magnolia/ The recently introduced test case generic/628, from fstests, exercises these scenarios and currently fails without this change. So make sure we fsync the source and destination files/ranges when either of them was opened with O_SYNC/O_DSYNC or has the S_SYNC attribute set, just like XFS already does. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
bb05b298af |
btrfs: zoned: bail out in btrfs_alloc_chunk for bad input
gcc complains that the ctl->max_chunk_size member might be used
uninitialized when none of the three conditions for initializing it in
init_alloc_chunk_ctl_policy_zoned() are true:
In function ‘init_alloc_chunk_ctl_policy_zoned’,
inlined from ‘init_alloc_chunk_ctl’ at fs/btrfs/volumes.c:5023:3,
inlined from ‘btrfs_alloc_chunk’ at fs/btrfs/volumes.c:5340:2:
include/linux/compiler-gcc.h:48:45: error: ‘ctl.max_chunk_size’ may be used uninitialized [-Werror=maybe-uninitialized]
4998 | ctl->max_chunk_size = min(limit, ctl->max_chunk_size);
| ^~~
fs/btrfs/volumes.c: In function ‘btrfs_alloc_chunk’:
fs/btrfs/volumes.c:5316:32: note: ‘ctl’ declared here
5316 | struct alloc_chunk_ctl ctl;
| ^~~
If we ever get into this condition, something is seriously
wrong, as validity is checked in the callers
btrfs_alloc_chunk
init_alloc_chunk_ctl
init_alloc_chunk_ctl_policy_zoned
so the same logic as in init_alloc_chunk_ctl_policy_regular()
and a few other places should be applied. This avoids both further
data corruption, and the compile-time warning.
Fixes:
|
||
|
3227788cd3 |
btrfs: fix a potential hole punching failure
In commit |
||
|
e75f9fd194 |
btrfs: zoned: move log tree node allocation out of log_root_tree->log_mutex
Commit |
||
|
2cdb3909c9 |
btrfs: use percpu_read_positive instead of sum_positive for need_preempt
Looking at perf data for a fio workload I noticed that we were spending a pretty large chunk of time (around 5%) doing percpu_counter_sum() in need_preemptive_reclaim. This is silly, as we only want to know if we have more ordered than delalloc to see if we should be counting the delayed items in our threshold calculation. Change this to percpu_read_positive() to avoid the overhead. I ran this through fsperf to validate the changes, obviously the latency numbers in dbench and fio are quite jittery, so take them as you wish, but overall the improvements on throughput, iops, and bw are all positive. Each test was run two times, the given value is the average of both runs for their respective column. btrfs ssd normal test results bufferedrandwrite16g results metric baseline current diff ========================================================== write_io_kbytes 16777216 16777216 0.00% read_clat_ns_p99 0 0 0.00% write_bw_bytes 1.04e+08 1.05e+08 1.12% read_iops 0 0 0.00% write_clat_ns_p50 13888 11840 -14.75% read_io_kbytes 0 0 0.00% read_io_bytes 0 0 0.00% write_clat_ns_p99 35008 29312 -16.27% read_bw_bytes 0 0 0.00% elapsed 170 167 -1.76% write_lat_ns_min 4221.50 3762.50 -10.87% sys_cpu 39.65 35.37 -10.79% write_lat_ns_max 2.67e+10 2.50e+10 -6.63% read_lat_ns_min 0 0 0.00% write_iops 25270.10 25553.43 1.12% read_lat_ns_max 0 0 0.00% read_clat_ns_p50 0 0 0.00% dbench60 results metric baseline current diff ================================================== qpathinfo 11.12 12.73 14.52% throughput 416.09 445.66 7.11% flush 3485.63 1887.55 -45.85% qfileinfo 0.70 1.92 173.86% ntcreatex 992.60 695.76 -29.91% qfsinfo 2.43 3.71 52.48% close 1.67 3.14 88.09% sfileinfo 66.54 105.20 58.10% rename 809.23 619.59 -23.43% find 16.88 15.46 -8.41% unlink 820.54 670.86 -18.24% writex 3375.20 2637.91 -21.84% deltree 386.33 449.98 16.48% readx 3.43 3.41 -0.60% mkdir 0.05 0.03 -38.46% lockx 0.26 0.26 -0.76% unlockx 0.81 0.32 -60.33% dio4kbs16threads results metric baseline current diff ================================================================ write_io_kbytes 5249676 3357150 -36.05% read_clat_ns_p99 0 0 0.00% write_bw_bytes 89583501.50 57291192.50 -36.05% read_iops 0 0 0.00% write_clat_ns_p50 242688 263680 8.65% read_io_kbytes 0 0 0.00% read_io_bytes 0 0 0.00% write_clat_ns_p99 15826944 36732928 132.09% read_bw_bytes 0 0 0.00% elapsed 61 61 0.00% write_lat_ns_min 42704 42095 -1.43% sys_cpu 5.27 3.45 -34.52% write_lat_ns_max 7.43e+08 9.27e+08 24.71% read_lat_ns_min 0 0 0.00% write_iops 21870.97 13987.11 -36.05% read_lat_ns_max 0 0 0.00% read_clat_ns_p50 0 0 0.00% randwrite2xram results metric baseline current diff ================================================================ write_io_kbytes 24831972 28876262 16.29% read_clat_ns_p99 0 0 0.00% write_bw_bytes 83745273.50 92182192.50 10.07% read_iops 0 0 0.00% write_clat_ns_p50 13952 11648 -16.51% read_io_kbytes 0 0 0.00% read_io_bytes 0 0 0.00% write_clat_ns_p99 50176 52992 5.61% read_bw_bytes 0 0 0.00% elapsed 314 332 5.73% write_lat_ns_min 5920.50 5127 -13.40% sys_cpu 7.82 7.35 -6.07% write_lat_ns_max 5.27e+10 3.88e+10 -26.44% read_lat_ns_min 0 0 0.00% write_iops 20445.62 22505.42 10.07% read_lat_ns_max 0 0 0.00% read_clat_ns_p50 0 0 0.00% untarfirefox results metric baseline current diff ============================================== elapsed 47.41 47.40 -0.03% Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
e2b84217f3 |
btrfs: update outdated comment at btrfs_replace_file_extents()
There is a comment at btrfs_replace_file_extents() that mentions that we
set the full sync flag on an inode when cloning into a file with a size
greater than or equals to 16MiB, through try_release_extent_mapping() when
we truncate the page cache after replacing file extents during a clone
operation.
That is not true anymore since commit
|
||
|
0c0218e9a6 |
btrfs: update outdated comment at btrfs_orphan_cleanup()
btrfs_orphan_cleanup() has a comment referring to find_dead_roots, but
function does not exists since commit
|
||
|
ffbc10a144 |
btrfs: update debug message when checking seq number of a delayed ref
We used to encode two different numbers in the tree mod log counter used
for sequence numbers, one in the upper 32 bits and the other one in the
lower 32 bits. However that is no longer the case, we stopped doing that
since commit
|
||
|
4bae788075 |
btrfs: add and use helper to get lowest sequence number for the tree mod log
There are two places outside the tree mod log module that extract the lowest sequence number of the tree mod log. These places end up duplicating code and open coding the logic and internal implementation details of the tree mod log. So add a helper to the tree mod log module and header that returns the lowest sequence number or 0 if there aren't any tree mod log users at the moment. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
ffe1d039d7 |
btrfs: remove unnecessary leaf check at btrfs_tree_mod_log_free_eb()
At btrfs_tree_mod_log_free_eb() we check if we are dealing with a leaf, and if so, return immediately and do nothing. However this check can be removed, because after it we call tree_mod_need_log(), which returns false when given an extent buffer that corresponds to a leaf. So just remove the leaf check and pass the extent buffer to tree_mod_need_log(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
888dd18339 |
btrfs: use the new bit BTRFS_FS_TREE_MOD_LOG_USERS at btrfs_free_tree_block()
Instead of exposing implementation details of the tree mod log to check if there are active tree mod log users at btrfs_free_tree_block(), use the new bit BTRFS_FS_TREE_MOD_LOG_USERS for fs_info->flags instead. This way extent-tree.c does not need to known about any of the internals of the tree mod log and avoids taking a lock unnecessarily as well. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
bc03f39ec3 |
btrfs: use a bit to track the existence of tree mod log users
The tree modification log functions are called very frequently, basically they are called every time a btree is modified (a pointer added or removed to a node, a new root for a btree is set, etc). Because of that, to avoid heavy lock contention on the lock that protects the list of tree mod log users, we have checks that test the emptiness of the list with a full memory barrier before the checks, so that when there are no tree mod log users we avoid taking the lock. Replace the memory barrier and list emptiness check with a test for a new bit set at fs_info->flags. This bit is used to indicate when there are tree mod log users, set whenever a user is added to the list and cleared when the last user is removed from the list. This makes the intention a bit more obvious and possibly more efficient (assuming test_bit() may be cheaper than a full memory barrier on some architectures). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
406808ab2f |
btrfs: use booleans where appropriate for the tree mod log functions
Several functions of the tree modification log use integers as booleans, so change them to use booleans instead, making their use more clear. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
f3a84ccd28 |
btrfs: move the tree mod log code into its own file
The tree modification log, which records modifications done to btrees, is quite large and currently spread all over ctree.c, which is a huge file already. To make things better organized, move all that code into its own separate source and header files. Functions and definitions that are used outside of the module (mostly by ctree.c) are renamed so that they start with a "btrfs_" prefix. Everything else remains unchanged. This makes it easier to go over the tree modification log code every time I need to go read it to fix a bug. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor comment updates ] Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
9a002d531b |
btrfs: integrity-checker: convert block context kmap's to kmap_local_page
btrfsic_read_block() (which calls kmap()) and btrfsic_release_block_ctx() (which calls kunmap()) are always called within a single thread of execution. Therefore the mappings created within these calls can be a thread local mapping. Convert the kmap() of bloc_ctx->pagev to kmap_local_page(). Luckily the unmap loops backwards through the array pointer so no adjustment needs to be made to the unmapping order. Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
3e037efdbd |
btrfs: integrity-checker: use kmap_local_page in __btrfsic_submit_bio
Again there is an array of pointers which must be unmapped in the correct order. Convert the kmap()'s to kmap_local_page() and adjust the unmapping to work backwards through the unmapping loop. Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
94a0b58d2d |
btrfs: raid56: convert kmaps to kmap_local_page
These kmaps are thread local and don't need to be atomic. So they can use the more efficient kmap_local_page(). However, the mapping of pages in the stripes and the additional parity and qstripe pages are a bit trickier because the unmapping must occur in the opposite order from the mapping. Furthermore, the pointer array in __raid_recover_end_io() may get reordered. Convert these calls to kmap_local_page() taking care to reverse the unmappings of any page arrays as well as being careful with the mappings of any special pages such as the parity and qstripe pages. Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
58c1a35cd5 |
btrfs: convert kmap to kmap_local_page, simple cases
Use a simple coccinelle script to help convert the most common kmap()/kunmap() patterns to kmap_local_page()/kunmap_local(). Note that some kmaps which were caught by this script needed to be handled by hand because of the strict unmapping order of kunmap_local() so they are not included in this patch. But this script got us started. There's another temp variable added for the final length write to the first page so it does not interfere with cpage_out that is used for mapping other pages. The development of this patch was aided by the follow script: // <smpl> // SPDX-License-Identifier: GPL-2.0-only // Find kmap and replace with kmap_local_page then mark kunmap // // Confidence: Low // Copyright: (C) 2021 Intel Corporation // URL: http://coccinelle.lip6.fr/ @ catch_all @ expression e, e2; @@ ( -kmap(e) +kmap_local_page(e) ) ... ( -kunmap(...) +kunmap_local() ) // </smpl> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
cea628008f |
btrfs: remove duplicated in_range() macro
The in_range() macro is defined twice in btrfs' source, once in ctree.h and once in misc.h. Remove the definition in ctree.h and include misc.h in the files depending on it. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
209ecbb858 |
btrfs: remove stale comment and logic from btrfs_inode_in_log()
Currently btrfs_inode_in_log() checks the list of modified extents of the inode, and has a comment mentioning why, as it used to be necessary to make sure if we did something like the following: mmap write range A mmap write range B msync range A (ranged fsync) msync range B (ranged fsync) we ended up with both ranges being logged. If we did not check it, then the second fsync would do nothing because btrfs_inode_in_log() would return true. This was added in |
||
|
bc0939fcfa |
btrfs: fix race between marking inode needs to be logged and log syncing
We have a race between marking that an inode needs to be logged, either at btrfs_set_inode_last_trans() or at btrfs_page_mkwrite(), and between btrfs_sync_log(). The following steps describe how the race happens. 1) We are at transaction N; 2) Inode I was previously fsynced in the current transaction so it has: inode->logged_trans set to N; 3) The inode's root currently has: root->log_transid set to 1 root->last_log_commit set to 0 Which means only one log transaction was committed to far, log transaction 0. When a log tree is created we set ->log_transid and ->last_log_commit of its parent root to 0 (at btrfs_add_log_tree()); 4) One more range of pages is dirtied in inode I; 5) Some task A starts an fsync against some other inode J (same root), and so it joins log transaction 1. Before task A calls btrfs_sync_log()... 6) Task B starts an fsync against inode I, which currently has the full sync flag set, so it starts delalloc and waits for the ordered extent to complete before calling btrfs_inode_in_log() at btrfs_sync_file(); 7) During ordered extent completion we have btrfs_update_inode() called against inode I, which in turn calls btrfs_set_inode_last_trans(), which does the following: spin_lock(&inode->lock); inode->last_trans = trans->transaction->transid; inode->last_sub_trans = inode->root->log_transid; inode->last_log_commit = inode->root->last_log_commit; spin_unlock(&inode->lock); So ->last_trans is set to N and ->last_sub_trans set to 1. But before setting ->last_log_commit... 8) Task A is at btrfs_sync_log(): - it increments root->log_transid to 2 - starts writeback for all log tree extent buffers - waits for the writeback to complete - writes the super blocks - updates root->last_log_commit to 1 It's a lot of slow steps between updating root->log_transid and root->last_log_commit; 9) The task doing the ordered extent completion, currently at btrfs_set_inode_last_trans(), then finally runs: inode->last_log_commit = inode->root->last_log_commit; spin_unlock(&inode->lock); Which results in inode->last_log_commit being set to 1. The ordered extent completes; 10) Task B is resumed, and it calls btrfs_inode_in_log() which returns true because we have all the following conditions met: inode->logged_trans == N which matches fs_info->generation && inode->last_subtrans (1) <= inode->last_log_commit (1) && inode->last_subtrans (1) <= root->last_log_commit (1) && list inode->extent_tree.modified_extents is empty And as a consequence we return without logging the inode, so the existing logged version of the inode does not point to the extent that was written after the previous fsync. It should be impossible in practice for one task be able to do so much progress in btrfs_sync_log() while another task is at btrfs_set_inode_last_trans() right after it reads root->log_transid and before it reads root->last_log_commit. Even if kernel preemption is enabled we know the task at btrfs_set_inode_last_trans() can not be preempted because it is holding the inode's spinlock. However there is another place where we do the same without holding the spinlock, which is in the memory mapped write path at: vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) { (...) BTRFS_I(inode)->last_trans = fs_info->generation; BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid; BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit; (...) So with preemption happening after setting ->last_sub_trans and before setting ->last_log_commit, it is less of a stretch to have another task do enough progress at btrfs_sync_log() such that the task doing the memory mapped write ends up with ->last_sub_trans and ->last_log_commit set to the same value. It is still a big stretch to get there, as the task doing btrfs_sync_log() has to start writeback, wait for its completion and write the super blocks. So fix this in two different ways: 1) For btrfs_set_inode_last_trans(), simply set ->last_log_commit to the value of ->last_sub_trans minus 1; 2) For btrfs_page_mkwrite() only set the inode's ->last_sub_trans, just like we do for buffered and direct writes at btrfs_file_write_iter(), which is all we need to make sure multiple writes and fsyncs to an inode in the same transaction never result in an fsync missing that the inode changed and needs to be logged. Turn this into a helper function and use it both at btrfs_page_mkwrite() and at btrfs_file_write_iter() - this also fixes the problem that at btrfs_page_mkwrite() we were setting those fields without the protection of the inode's spinlock. This is an extremely unlikely race to happen in practice. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
885f46d87f |
btrfs: fix race between memory mapped writes and fsync
When doing an fsync we flush all delalloc, lock the inode (VFS lock), flush
any new delalloc that might have been created before taking the lock and
then wait either for the ordered extents to complete or just for the
writeback to complete (depending on whether the full sync flag is set or
not). We then start logging the inode and assume that while we are doing it
no one else is touching the inode's file extent items (or adding new ones).
That is generally true because all operations that modify an inode acquire
the inode's lock first, including buffered and direct IO writes. However
there is one exception: memory mapped writes, which do not and can not
acquire the inode's lock.
This can cause two types of issues: ending up logging file extent items
with overlapping ranges, which is detected by the tree checker and will
result in aborting the transaction when starting writeback for a log
tree's extent buffers, or a silent corruption where we log a version of
the file that never existed.
Scenario 1 - logging overlapping extents
The following steps explain how we can end up with file extents items with
overlapping ranges in a log tree due to a race between a fsync and memory
mapped writes:
1) Task A starts an fsync on inode X, which has the full sync runtime flag
set. First it starts by flushing all delalloc for the inode;
2) Task A then locks the inode and flushes any other delalloc that might
have been created after the previous flush and waits for all ordered
extents to complete;
3) In the inode's root we have the following leaf:
Leaf N, generation == current transaction id:
---------------------------------------------------------
| (...) [ file extent item, offset 640K, length 128K ] |
---------------------------------------------------------
The last file extent item in leaf N covers the file range from 640K to
768K;
4) Task B does a memory mapped write for the page corresponding to the
file range from 764K to 768K;
5) Task A starts logging the inode. At copy_inode_items_to_log() it uses
btrfs_search_forward() to search for leafs modified in the current
transaction that contain items for the inode. It finds leaf N and copies
all the inode items from that leaf into the log tree.
Now the log tree has a copy of the last file extent item from leaf N.
At the end of the while loop at copy_inode_items_to_log(), we have the
minimum key set to:
min_key.objectid = <inode X number>
min_key.type = BTRFS_EXTENT_DATA_KEY
min_key.offset = 640K
Then we increment the key's offset by 1 so that the next call to
btrfs_search_forward() leaves us at the first key greater than the key
we just processed.
But before btrfs_search_forward() is called again...
6) Dellaloc for the page at offset 764K, dirtied by task B, is started.
It can be started for several reasons:
- The async reclaim task is attempting to satisfy metadata or data
reservation requests, and it has reached a point where it decided
to flush delalloc;
- Due to memory pressure the VMM triggers writeback of dirty pages;
- The system call sync_file_range(2) is called from user space.
7) When the respective ordered extent completes, it trims the length of
the existing file extent item for file offset 640K from 128K to 124K,
and a new file extent item is added with a key offset of 764K and a
length of 4K;
8) Task A calls btrfs_search_forward(), which returns us a path pointing
to the leaf (can be leaf N or some other) containing the new file extent
item for file offset 764K.
We end up copying this item to the log tree, which overlaps with the
last copied file extent item, which covers the file range from 640K to
768K.
When writeback is triggered for log tree's extent buffers, the issue
will be detected by the tree checker which will dump a trace and an
error message on dmesg/syslog. If the writeback is triggered when
syncing the log, which typically is, then we also end up aborting the
current transaction.
This is the same type of problem fixed in
|
||
|
8d9b4a162a |
btrfs: exclude mmap from happening during all fallocate operations
There's a small window where a deadlock can happen between fallocate and mmap. This is described in detail by Filipe: """ When doing a fallocate operation we lock the inode, flush delalloc within the target range, wait for any ordered extents to complete and then lock the file range. Before we lock the range and after we flush delalloc, there is a time window where another task can come in and do a memory mapped write for a page within the fallocate range. This means that after fallocate locks the range, there can be a dirty page in the range. More often than not, this does not cause any problem. The exception is when we are low on available metadata space, because an fallocate operation needs to start a transaction while holding the file range locked, either through btrfs_prealloc_file_range() or through the call to btrfs_fallocate_update_isize(). If that's the case, we can end up in a deadlock. The following list of steps explains how that happens: 1) A fallocate operation starts, locks the inode, flushes delalloc in the range and waits for ordered extents in the range to complete; 2) Before the fallocate task locks the file range, another task does a memory mapped write for a page in the fallocate target range. This is possible since memory mapped writes do not (and can not) lock the inode; 3) The fallocate task locks the file range. At this point there is one dirty page in the range (due to the memory mapped write); 4) When the fallocate task attempts to start a transaction, it blocks when attempting to reserve metadata space, since we are low on available metadata space. Before blocking (wait on its reservation ticket), it starts the async reclaim task (if not running already); 5) The async reclaim task is not able to release space through any other means, so it decides to flush delalloc for inodes with dirty pages. It finds that the inode used in the fallocate operation has a dirty page and therefore queues a job (fs_info->flush_workers workqueue) to flush delalloc for that inode and waits on that job to complete; 6) The flush job blocks when attempting to lock the file range because it is currently locked by the fallocate task; 7) The fallocate task keeps waiting for its metadata reservation, waiting for a wakeup on its reservation ticket. The async reclaim task is waiting on the flush job, which in turn is waiting for locking the file range that is currently locked by the fallocate task. So unless some other task is able to release enough metadata space, for example an ordered extent for some other inode completes, we end up in a deadlock between all these tasks. When this happens stack traces like the following show up in dmesg/syslog: INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds. Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000 Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs] Call Trace: __schedule+0x5d1/0xcf0 schedule+0x45/0xe0 lock_extent_bits+0x1e6/0x2d0 [btrfs] ? finish_wait+0x90/0x90 btrfs_invalidatepage+0x32c/0x390 [btrfs] ? __mod_memcg_state+0x8e/0x160 __extent_writepage+0x2d4/0x400 [btrfs] extent_write_cache_pages+0x2b2/0x500 [btrfs] ? lock_release+0x20e/0x4c0 ? trace_hardirqs_on+0x1b/0xf0 extent_writepages+0x43/0x90 [btrfs] ? lock_acquire+0x1a3/0x490 do_writepages+0x43/0xe0 ? __filemap_fdatawrite_range+0xa4/0x100 __filemap_fdatawrite_range+0xc5/0x100 btrfs_run_delalloc_work+0x17/0x40 [btrfs] btrfs_work_helper+0xf1/0x600 [btrfs] process_one_work+0x24e/0x5e0 worker_thread+0x50/0x3b0 ? process_one_work+0x5e0/0x5e0 kthread+0x153/0x170 ? kthread_mod_delayed_work+0xc0/0xc0 ret_from_fork+0x22/0x30 INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds. Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000 Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] Call Trace: __schedule+0x5d1/0xcf0 ? kvm_clock_read+0x14/0x30 ? wait_for_completion+0x81/0x110 schedule+0x45/0xe0 schedule_timeout+0x30c/0x580 ? _raw_spin_unlock_irqrestore+0x3c/0x60 ? lock_acquire+0x1a3/0x490 ? try_to_wake_up+0x7a/0xa20 ? lock_release+0x20e/0x4c0 ? lock_acquired+0x199/0x490 ? wait_for_completion+0x81/0x110 wait_for_completion+0xab/0x110 start_delalloc_inodes+0x2af/0x390 [btrfs] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs] flush_space+0x24f/0x660 [btrfs] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs] process_one_work+0x24e/0x5e0 worker_thread+0x20f/0x3b0 ? process_one_work+0x5e0/0x5e0 kthread+0x153/0x170 ? kthread_mod_delayed_work+0xc0/0xc0 ret_from_fork+0x22/0x30 (...) several tasks waiting for the inode lock held by the fallocate task below (...) RIP: 0033:0x7f61efe73fff Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5. RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff RDX: 00000000ffffff9c RSI: 0000560fbd5d90a0 RDI: 00000000ffffff9c RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003 R10: 0000560fbd5d7ad0 R11: 0000000000000202 R12: 0000000000000001 R13: 000000000000005e R14: 00007ffc3371bea0 R15: 00007ffc3371beb0 task:fdm-stress state:D stack: 0 pid:2508243 ppid:2508153 flags:0x00000000 Call Trace: __schedule+0x5d1/0xcf0 ? _raw_spin_unlock_irqrestore+0x3c/0x60 schedule+0x45/0xe0 __reserve_bytes+0x4a4/0xb10 [btrfs] ? finish_wait+0x90/0x90 btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs] btrfs_block_rsv_add+0x1f/0x50 [btrfs] start_transaction+0x2d1/0x760 [btrfs] btrfs_replace_file_extents+0x120/0x930 [btrfs] ? btrfs_fallocate+0xdcf/0x1260 [btrfs] btrfs_fallocate+0xdfb/0x1260 [btrfs] ? filename_lookup+0xf1/0x180 vfs_fallocate+0x14f/0x440 ioctl_preallocate+0x92/0xc0 do_vfs_ioctl+0x66b/0x750 ? __do_sys_newfstat+0x53/0x60 __x64_sys_ioctl+0x62/0xb0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 """ Fix this by disallowing mmaps from happening while we're doing any of the fallocate operations on this inode. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
8c99516a8c |
btrfs: exclude mmaps while doing remap
Darrick reported a potential issue to me where we could allow mmap writes after validating a page range matched in the case of dedupe. Generally we rely on lock page -> lock extent with the ordered flush to protect us, but this is done after we check the pages because we use the generic helpers, so we could modify the page in between doing the check and locking the range. There also exists a deadlock, as described by Filipe """ When cloning a file range, we lock the inodes, flush any delalloc within the respective file ranges, wait for any ordered extents and then lock the file ranges in both inodes. This means that right after we flush delalloc and before we lock the file ranges, memory mapped writes can come in and dirty pages in the file ranges of the clone operation. Most of the time this is harmless and causes no problems. However, if we are low on available metadata space, we can later end up in a deadlock when starting a transaction to replace file extent items. This happens if when allocating metadata space for the transaction, we need to wait for the async reclaim thread to release space and the reclaim thread needs to flush delalloc for the inode that got the memory mapped write and has its range locked by the clone task. Basically what happens is the following: 1) A clone operation locks inodes A and B, flushes delalloc for both inodes in the respective file ranges and waits for any ordered extents in those ranges to complete; 2) Before the clone task locks the file ranges, another task does a memory mapped write (which does not lock the inode) for one of the inodes of the clone operation. So now we have a dirty page in one of the ranges used by the clone operation; 3) The clone operation locks the file ranges for inodes A and B; 4) Later, when iterating over the file extents of inode A, the clone task attempts to start a transaction. There's not enough available free metadata space, so the async reclaim task is started (if not running already) and we wait for someone to wake us up on our reservation ticket; 5) The async reclaim task is not able to release space by any other means and decides to flush delalloc for the inode of the clone operation; 6) The workqueue job used to flush the inode blocks when starting delalloc for the inode, since the file range is currently locked by the clone task; 7) But the clone task is waiting on its reservation ticket and the async reclaim task is waiting on the flush job to complete, which can't progress since the clone task has the file range locked. So unless some other task is able to release space, for example an ordered extent for some other inode completes, we have a deadlock between all these tasks; When this happens stack traces like the following show up in dmesg/syslog: INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds. Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000 Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs] Call Trace: __schedule+0x5d1/0xcf0 schedule+0x45/0xe0 lock_extent_bits+0x1e6/0x2d0 [btrfs] ? finish_wait+0x90/0x90 btrfs_invalidatepage+0x32c/0x390 [btrfs] ? __mod_memcg_state+0x8e/0x160 __extent_writepage+0x2d4/0x400 [btrfs] extent_write_cache_pages+0x2b2/0x500 [btrfs] ? lock_release+0x20e/0x4c0 ? trace_hardirqs_on+0x1b/0xf0 extent_writepages+0x43/0x90 [btrfs] ? lock_acquire+0x1a3/0x490 do_writepages+0x43/0xe0 ? __filemap_fdatawrite_range+0xa4/0x100 __filemap_fdatawrite_range+0xc5/0x100 btrfs_run_delalloc_work+0x17/0x40 [btrfs] btrfs_work_helper+0xf1/0x600 [btrfs] process_one_work+0x24e/0x5e0 worker_thread+0x50/0x3b0 ? process_one_work+0x5e0/0x5e0 kthread+0x153/0x170 ? kthread_mod_delayed_work+0xc0/0xc0 ret_from_fork+0x22/0x30 INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds. Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000 Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] Call Trace: __schedule+0x5d1/0xcf0 ? kvm_clock_read+0x14/0x30 ? wait_for_completion+0x81/0x110 schedule+0x45/0xe0 schedule_timeout+0x30c/0x580 ? _raw_spin_unlock_irqrestore+0x3c/0x60 ? lock_acquire+0x1a3/0x490 ? try_to_wake_up+0x7a/0xa20 ? lock_release+0x20e/0x4c0 ? lock_acquired+0x199/0x490 ? wait_for_completion+0x81/0x110 wait_for_completion+0xab/0x110 start_delalloc_inodes+0x2af/0x390 [btrfs] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs] flush_space+0x24f/0x660 [btrfs] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs] process_one_work+0x24e/0x5e0 worker_thread+0x20f/0x3b0 ? process_one_work+0x5e0/0x5e0 kthread+0x153/0x170 ? kthread_mod_delayed_work+0xc0/0xc0 ret_from_fork+0x22/0x30 (...) several other tasks blocked on inode locks held by the clone task below (...) RIP: 0033:0x7f61efe73fff Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5. RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff RDX: 00000000ffffff9c RSI: 0000560fbd604690 RDI: 00000000ffffff9c RBP: 00007ffc3371beb0 R08: 0000000000000002 R09: 0000560fbd5d75f0 R10: 0000560fbd5d81f0 R11: 0000000000000202 R12: 0000000000000002 R13: 000000000000000b R14: 00007ffc3371bea0 R15: 00007ffc3371beb0 task: fdm-stress state:D stack: 0 pid:2508234 ppid:2508153 flags:0x00004000 Call Trace: __schedule+0x5d1/0xcf0 ? _raw_spin_unlock_irqrestore+0x3c/0x60 schedule+0x45/0xe0 __reserve_bytes+0x4a4/0xb10 [btrfs] ? finish_wait+0x90/0x90 btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs] btrfs_block_rsv_add+0x1f/0x50 [btrfs] start_transaction+0x2d1/0x760 [btrfs] btrfs_replace_file_extents+0x120/0x930 [btrfs] ? lock_release+0x20e/0x4c0 btrfs_clone+0x3e4/0x7e0 [btrfs] ? btrfs_lookup_first_ordered_extent+0x8e/0x100 [btrfs] btrfs_clone_files+0xf6/0x150 [btrfs] btrfs_remap_file_range+0x324/0x3d0 [btrfs] do_clone_file_range+0xd4/0x1f0 vfs_clone_file_range+0x4d/0x230 ? lock_release+0x20e/0x4c0 ioctl_file_clone+0x8f/0xc0 do_vfs_ioctl+0x342/0x750 __x64_sys_ioctl+0x62/0xb0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 """ Fix both of these issues by excluding mmaps from happening we are doing any sort of remap, which prevents this race completely. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
64708539cd |
btrfs: use btrfs_inode_lock/btrfs_inode_unlock inode lock helpers
A few places we intermix btrfs_inode_lock with a inode_unlock, and some places we just use inode_lock/inode_unlock instead of btrfs_inode_lock. None of these places are using this incorrectly, but as we adjust some of these callers it would be nice to keep everything consistent, so convert everybody to use btrfs_inode_lock/btrfs_inode_unlock. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
8318ba79ee |
btrfs: add a i_mmap_lock to our inode
We need to be able to exclude page_mkwrite from happening concurrently with certain operations. To facilitate this, add a i_mmap_lock to our inode, down_read() it in our mkwrite, and add a new ILOCK flag to indicate that we want to take the i_mmap_lock as well. I used pahole to check the size of the btrfs_inode, the sizes are as follows no lockdep: before: 1120 (3 per 4k page) after: 1160 (3 per 4k page) lockdep: before: 2072 (1 per 4k page) after: 2224 (1 per 4k page) We're slightly larger but it doesn't change how many objects we can fit per page. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
5e295768a0 |
btrfs: remove mirror argument from btrfs_csum_verify_data()
The parameter mirror is not used and does not make sense for checksum verification of the given bio. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
6e65ae7629 |
btrfs: remove force argument from run_delalloc_nocow()
force_cow can be calculated from inode and does not need to be passed as an argument. This simplifies run_delalloc_nocow() call from btrfs_run_delalloc_range() A new function, should_nocow() checks if the range should be NOCOWed or not. The function returns true iff either BTRFS_INODE_NODATA or BTRFS_INODE_PREALLOC, but is not a defrag extent. Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
d6ade6894e |
btrfs: don't opencode extent_changeset_free
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
7000babdda |
btrfs: assign proper values to a bool variable in dev_extent_hole_check_zoned
Fix the following coccicheck warnings: ./fs/btrfs/volumes.c:1462:10-11: WARNING: return of 0/1 in function 'dev_extent_hole_check_zoned' with return type bool. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
2ce73c6335 |
btrfs: add btree read ahead for incremental send operations
Currently we do not do btree read ahead when doing an incremental send, however we know that we will read and process any node or leaf in the send root that has a generation greater than the generation of the parent root. So triggering read ahead for such nodes and leafs is beneficial for an incremental send. This change does that, triggers read ahead of any node or leaf in the send root that has a generation greater then the generation of the parent root. As for the parent root, no readahead is triggered because knowing in advance which nodes/leaves are going to be read is not so linear and there's often a large time window between visiting nodes or leaves of the parent root. So I opted to leave out the parent root, and triggering read ahead for its nodes/leaves seemed to have not made significant difference. The following test script was used to measure the improvement on a box using an average, consumer grade, spinning disk and with 16GiB of ram: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null mount $MOUNT_OPTIONS $DEV $MNT # Create files with inline data to make it easier and faster to create # large btrees. add_files() { local total=$1 local start_offset=$2 local number_jobs=$3 local total_per_job=$(($total / $number_jobs)) echo "Creating $total new files using $number_jobs jobs" for ((n = 0; n < $number_jobs; n++)); do ( local start_num=$(($start_offset + $n * $total_per_job)) for ((i = 1; i <= $total_per_job; i++)); do local file_num=$((start_num + $i)) local file_path="$MNT/file_${file_num}" xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null if [ $? -ne 0 ]; then echo "Failed creating file $file_path" break fi done ) & worker_pids[$n]=$! done wait ${worker_pids[@]} sync echo echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)" } initial_file_count=500000 add_files $initial_file_count 0 4 echo echo "Creating first snapshot..." btrfs subvolume snapshot -r $MNT $MNT/snap1 echo echo "Adding more files..." add_files $((initial_file_count / 4)) $initial_file_count 4 echo echo "Updating 1/50th of the initial files..." for ((i = 1; i < $initial_file_count; i += 50)); do xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null done echo echo "Creating second snapshot..." btrfs subvolume snapshot -r $MNT $MNT/snap2 umount $MNT echo 3 > /proc/sys/vm/drop_caches blockdev --flushbufs $DEV &> /dev/null hdparm -F $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT echo echo "Testing full send..." start=$(date +%s) btrfs send $MNT/snap1 > /dev/null end=$(date +%s) echo echo "Full send took $((end - start)) seconds" umount $MNT echo 3 > /proc/sys/vm/drop_caches blockdev --flushbufs $DEV &> /dev/null hdparm -F $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT echo echo "Testing incremental send..." start=$(date +%s) btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null end=$(date +%s) echo echo "Incremental send took $((end - start)) seconds" umount $MNT Before this change, incremental send duration: with $initial_file_count == 200000: 51 seconds with $initial_file_count == 500000: 168 seconds After this change, incremental send duration: with $initial_file_count == 200000: 39 seconds (-26.7%) with $initial_file_count == 500000: 125 seconds (-29.4%) For $initial_file_count == 200000 there are 62600 nodes and leaves in the btree of the first snapshot, and 77759 nodes and leaves in the btree of the second snapshot. The root nodes were at level 2. While for $initial_file_count == 500000 there are 152476 nodes and leaves in the btree of the first snapshot, and 190511 nodes and leaves in the btree of the second snapshot. The root nodes were at level 2 as well. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
19358b154f |
btrfs: add btree read ahead for full send operations
When doing a full send we know that we are going to be reading every node and leaf of the send root, so we benefit from enabling read ahead for the btree. This change enables read ahead for full send operations only, incremental sends will have read ahead enabled in a different way by a separate patch. The following test script was used to measure the improvement on a box using an average, consumer grade, spinning disk and with 16GiB of RAM: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null mount $MOUNT_OPTIONS $DEV $MNT # Create files with inline data to make it easier and faster to create # large btrees. add_files() { local total=$1 local start_offset=$2 local number_jobs=$3 local total_per_job=$(($total / $number_jobs)) echo "Creating $total new files using $number_jobs jobs" for ((n = 0; n < $number_jobs; n++)); do ( local start_num=$(($start_offset + $n * $total_per_job)) for ((i = 1; i <= $total_per_job; i++)); do local file_num=$((start_num + $i)) local file_path="$MNT/file_${file_num}" xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null if [ $? -ne 0 ]; then echo "Failed creating file $file_path" break fi done ) & worker_pids[$n]=$! done wait ${worker_pids[@]} sync echo echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)" } initial_file_count=500000 add_files $initial_file_count 0 4 echo echo "Creating first snapshot..." btrfs subvolume snapshot -r $MNT $MNT/snap1 echo echo "Adding more files..." add_files $((initial_file_count / 4)) $initial_file_count 4 echo echo "Updating 1/50th of the initial files..." for ((i = 1; i < $initial_file_count; i += 50)); do xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null done echo echo "Creating second snapshot..." btrfs subvolume snapshot -r $MNT $MNT/snap2 umount $MNT echo 3 > /proc/sys/vm/drop_caches blockdev --flushbufs $DEV &> /dev/null hdparm -F $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT echo echo "Testing full send..." start=$(date +%s) btrfs send $MNT/snap1 > /dev/null end=$(date +%s) echo echo "Full send took $((end - start)) seconds" umount $MNT echo 3 > /proc/sys/vm/drop_caches blockdev --flushbufs $DEV &> /dev/null hdparm -F $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT echo echo "Testing incremental send..." start=$(date +%s) btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null end=$(date +%s) echo echo "Incremental send took $((end - start)) seconds" umount $MNT Before this change, full send duration: with $initial_file_count == 200000: 165 seconds with $initial_file_count == 500000: 407 seconds After this change, full send duration: with $initial_file_count == 200000: 149 seconds (-10.2%) with $initial_file_count == 500000: 353 seconds (-14.2%) For $initial_file_count == 200000 there are 62600 nodes and leaves in the btree of the first snapshot, while for $initial_file_count == 500000 there are 152476 nodes and leaves. The roots were at level 2. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
98686ffc71 |
btrfs: simplify code flow in btrfs_delayed_inode_reserve_metadata
btrfs_block_rsv_add can return only ENOSPC since it's called with NO_FLUSH modifier. This so simplify the logic in btrfs_delayed_inode_reserve_metadata to exploit this invariant. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add assert and comment ] Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
8e3c9d3cf8 |
btrfs: remove btrfs_inode parameter from btrfs_delayed_inode_reserve_metadata
It's only used for tracepoint to obtain the inode number, but we already have the ino from btrfs_delayed_node::inode_id. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
ae396a3b7a |
btrfs: simplify commit logic in try_flush_qgroup
It's no longer expected to call this function with an open transaction so all the workarounds concerning this can be removed. In fact it'll constitute a bug to call this function with a transaction already held so WARN in this case. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
e5ce988690 |
btrfs: scrub: drop a few function declarations
Drop function declarations at the beginning of the file scrub.c. These functions are defined before they are used in the same file and don't need forward declaration. No functional changes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
f4639636b6 |
btrfs: change return type to bool in btrfs_extent_readonly
btrfs_extent_readonly() checks if the block group is readonly, the bool return type should be used. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
05947ae186 |
btrfs: unexport btrfs_extent_readonly() and make it static
btrfs_extent_readonly() is used by can_nocow_extent() in inode.c. So move it from extent-tree.c to inode.c and declare it as static. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
b6e9f16c5f |
btrfs: replace open coded while loop with proper construct
btrfs_inc_block_group_ro wants to ensure that the current transaction is not running dirty block groups, if it is it waits and loops again. That logic is currently implemented using a goto label. Actually using a proper do {} while() construct doesn't hurt readability nor does it introduce excessive nesting and makes the relevant code stand out by being encompassed in the loop construct. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
20bbf20e95 |
btrfs: replace offset_in_entry with in_range
No point in duplicating the functionality just use the generic helper that has the same semantics. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
cca5de97ae |
btrfs: make find_desired_extent take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
bfc78479eb |
btrfs: make btrfs_replace_file_extents take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
0b3dcd131d |
btrfs: fix comment for btrfs ordered extent flag bits
There is small error in comment about BTRFS_ORDERED_* flags, added in
commit
|
||
|
97fc297754 |
btrfs: convert to fileattr
Use the fileattr API to let the VFS handle locking, permission checking and conversion. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: David Sterba <dsterba@suse.com> |
||
|
7d90072491 |
for-5.12-rc6-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmBy9DoACgkQxWXV+ddt WDtqdxAAnK4zx79k5ok6nlj8JlOfReimX4wPYYigiiKGY40cfQUZ1YUqbDscvrt+ cbzvqJuMU/V/UVaPW/CLmNi5XpNlSmj0229iwy59BIcpXfgtAMTsa1zsY4teZ/AT 3noNuT15CTeybwii0nT++AkqJbCbwXc5ItccGh9ZMOQwXuA5IUVTAzKrulUJoxXN zt23lX/ivtSfUH+pMMIG6wMVG2eGIP5m9drw+2n0yK08gt+oprLYnaAaE389mXgb TIRBafeBY7UA1YEcA4JDBDMNa0L8yWSV+XiMhxw7Ear7KoROAunKNbsG8USll6zb zBftfO+Gzv86wVvvPXg2KR8Qs9vyJMw2bOROFKzOnd+wQQ76v0XefOhNUUN98E6g tLTmCH+M1B1Qm1j2hVyOect/PMY51xqJA9xwlTtAbqIcz4qyOtfTR9KqqlWxVKJW 9pAEMII063xEKVxgv2khOhewEjOgqa4v9YFQjVXdcHPKvGTAYBeoJA735+WnQ1HZ okPC5k3DoEcVZEkUPvespEsAqm+RoBufNxWmQ7hq5N3IwZAXsIwTlhysgrXQWyc9 aTigWBq6rQ/bMz/57vI626+MAMh3StL+UOxlWiT+GToInpjZwoxZ0lgQdD6vUfUm T90T2930+PTkykQM9sNdQygGiH0J5FzkvneYvpkOYJ/+vphsRiA= =MuRt -----END PGP SIGNATURE----- Merge tag 'for-5.12-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "One more patch that we'd like to get to 5.12 before release. It's changing where and how the superblock is stored in the zoned mode. It is an on-disk format change but so far there are no implications for users as the proper mkfs support hasn't been merged and is waiting for the kernel side to settle. Until now, the superblocks were derived from the zone index, but zone size can differ per device. This is changed to be based on fixed offset values, to make it independent of the device zone size. The work on that got a bit delayed, we discussed the exact locations to support potential device sizes and usecases. (Partially delayed also due to my vacation.) Having that in the same release where the zoned mode is declared usable is highly desired, there are userspace projects that need to be updated to recognize the feature. Pushing that to the next release would make things harder to test" * tag 'for-5.12-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: move superblock logging zone location |
||
|
53b74fa990 |
btrfs: zoned: move superblock logging zone location
Moves the location of the superblock logging zones. The new locations of the logging zones are now determined based on fixed block addresses instead of on fixed zone numbers. The old placement method based on fixed zone numbers causes problems when one needs to inspect a file system image without access to the drive zone information. In such case, the super block locations cannot be reliably determined as the zone size is unknown. By locating the superblock logging zones using fixed addresses, we can scan a dumped file system image without the zone information since a super block copy will always be present at or after the fixed known locations. Introduce the following three pairs of zones containing fixed offset locations, regardless of the device zone size. - primary superblock: offset 0B (and the following zone) - first copy: offset 512G (and the following zone) - Second copy: offset 4T (4096G, and the following zone) If a logging zone is outside of the disk capacity, we do not record the superblock copy. The first copy position is much larger than for a non-zoned filesystem, which is at 64M. This is to avoid overlapping with the log zones for the primary superblock. This higher location is arbitrary but allows supporting devices with very large zone sizes, plus some space around in between. Such large zone size is unrealistic and very unlikely to ever be seen in real devices. Currently, SMR disks have a zone size of 256MB, and we are expecting ZNS drives to be in the 1-4GB range, so this limit gives us room to breathe. For now, we only allow zone sizes up to 8GB. The maximum zone size that would still fit in the space is 256G. The fixed location addresses are somewhat arbitrary, with the intent of maintaining superblock reliability for smaller and larger devices, with the preference for the latter. For this reason, there are two superblocks under the first 1T. This should cover use cases for physical devices and for emulated/device-mapper devices. The superblock logging zones are reserved for superblock logging and never used for data or metadata blocks. Note that we only reserve the two zones per primary/copy actually used for superblock logging. We do not reserve the ranges of zones possibly containing superblocks with the largest supported zone size (0-16GB, 512G-528GB, 4096G-4112G). The zones containing the fixed location offsets used to store superblocks on a non-zoned volume are also reserved to avoid confusion. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
4f0f586bf0 |
treewide: Change list_sort to use const pointers
list_sort() internally casts the comparison function passed to it to a different type with constant struct list_head pointers, and uses this pointer to call the functions, which trips indirect call Control-Flow Integrity (CFI) checking. Instead of removing the consts, this change defines the list_cmp_func_t type and changes the comparison function types of all list_sort() callers to use const pointers, thus avoiding type mismatches. Suggested-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210408182843.1754385-10-samitolvanen@google.com |
||
|
701c09c988 |
for-5.12-rc4-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmBctBgACgkQxWXV+ddt WDu1nA//bzuPwW3nO+enE+ipi4t6UJTJpHLeDgdMshWwhBIHVt+oFxTUIt4Zd0kT 0hJ+mbNrZHzmDmzpb6ifQn0D6k+wq6zbsEgLtwgmPmBszaXIw46FvnYnxd9FtCde 9SQzBKa86i/KMkRtaIvpUcunniIo5Aj0Hvu0oPgTKObqiB4HP2nV6rKody+mP9JW RanWbBi0JvI4UE/J2Ud1sNWFdDtVpXpcktj1dsI8gbsYNR05HpM08SEUgeF/ts3I yB/L18I5CUeFHyo/yogbj7kkikugPGsmOj/A86UZ6x3NxWoC+m7UXoGrO2/qlFem qd3ioXZKlnPqeX29kAy/REa3xjE61istlDVC/vckqmXBfYc6WK/KAJvFAGI+/3VI 9HvIbBokUQzekhFlA02RTqGcasStXX7VSeJyzyAbXjGhZQKfFTHR8ZBtrREiVBC9 58K+g8SSqIb/9iJqYV4h82lSBRSdf9kHx7CSB2gOBuifihY+chVr4Xzhq12IlXbK TNlue0BTwYLJStwx2dnY2beLbLG34/4FNRsuAR/9JsCio7Bfj0qN8htIyvfsiMxr mkrH7+Ykd10FqC8uu6MHiW9k428871Era3B97TgyQ0V17ehh4IN0v9V7kckk9EWw 3omaPwuF2FGfFOoTR7ipKO0nDx0/y2knnDSTsWknNG09Ciwa+Ww= =SuJv -----END PGP SIGNATURE----- Merge tag 'for-5.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Fixes for issues that have some user visibility and are simple enough for this time of development cycle: - a few fixes for rescue= mount option, adding more checks for missing trees - fix sleeping in atomic context on qgroup deletion - fix subvolume deletion on mount - fix build with M= syntax - fix checksum mismatch error message for direct io" * tag 'for-5.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix check_data_csum() error message for direct I/O btrfs: fix sleep while in non-sleep context during qgroup removal btrfs: fix subvolume/snapshot deletion not triggered on mount btrfs: fix build when using M=fs/btrfs btrfs: do not initialize dev replace for bad dev root btrfs: initialize device::fs_info always btrfs: do not initialize dev stats if we have no dev_root btrfs: zoned: remove outdated WARN_ON in direct IO |
||
|
81aa0968b7 |
for-5.12-rc3-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmBTeBsACgkQxWXV+ddt WDtwcBAAoto5Pbc3Lvt0aha3qn9q/Ms9lNU3YIwTjqXV3lIRKksWCS7kQmWlFmLz dILhdRBg1iWVh8qbeqpL5su7yNJduypsY/ImJroukb/BzwQViFRDGy5qIc56qLH2 OVTx4LQ0zdqVdD86Qj0mt9ilSjgXYN+J53IUjsSSyJIpgt3vVcfjCYSkFO8zBiMH eliRtYShzJHkjEwVWLZRzk76oTnFQEC28IdYJ4y95mYl2wCABfTU2ylSeVDTtc6O x+fNMHHRmde2nbsHc+0eMm7rYLXuzvyx/tY17u6A6iwEQLGjE4rXOVZ7kA93WgAd YTXhM/B+YFfirNh029Av/MJP+2t9YBEODAHl1tnOdM0mfvXkpimaW0jvUEhi5f6I ZGu5FytscsgjyUK827WL7bZKO8WMzTLQvB3ryZ9UcrHm3QbZ7xGdoBE2L86p4Euw LiXUALdOWeYjFKSW9WWKrtQBtdjlLQYqJt+hL0ifaGlnfoi2G+DQeKtL9ZAKH5Cu gcjDUewnJtYPLyDOCRjQPFcts/MD5o81qMLeEwshmZT/bNMD9JOGEppCxBWGWSCx dYGq04Wib/dN710i5jB1XbJboBmT2SZDyBeiKTpCXs5mECBU00uWkkO98oId1YS3 wHu9qyGUOi2g88V27jH593/JstUYn6zyxJYIZX84mzcxOqZlKuo= =auMP -----END PGP SIGNATURE----- Merge tag 'for-5.12-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "There are still regressions being found and fixed in the zoned mode and subpage code, the rest are fixes for bugs reported by users. Regressions: - subpage block support: - readahead works on the proper block size - fix last page zeroing - zoned mode: - linked list corruption for tree log Fixes: - qgroup leak after falloc failure - tree mod log and backref resolving: - extent buffer cloning race when resolving backrefs - pin deleted leaves with active tree mod log users - drop debugging flag from slab cache" * tag 'for-5.12-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: always pin deleted leaves when there are active tree mod log users btrfs: fix race when cloning extent buffer during rewind of an old root btrfs: fix slab cache flags for free space tree bitmap btrfs: subpage: make readahead work properly btrfs: subpage: fix wild pointer access during metadata read failure btrfs: zoned: fix linked list corruption after log root tree allocation failure btrfs: fix qgroup data rsv leak caused by falloc failure btrfs: track qgroup released data in own variable in insert_prealloc_file_extent btrfs: fix wrong offset to zero out range beyond i_size |
||
|
c1d6abdac4 |
btrfs: fix check_data_csum() error message for direct I/O
Commit 1dae796aabf6 ("btrfs: inode: sink parameter start and len to
check_data_csum()") replaced the start parameter to check_data_csum()
with page_offset(), but page_offset() is not meaningful for direct I/O
pages. Bring back the start parameter.
Fixes:
|
||
|
0bb7883009 |
btrfs: fix sleep while in non-sleep context during qgroup removal
While removing a qgroup's sysfs entry we end up taking the kernfs_mutex,
through kobject_del(), while holding the fs_info->qgroup_lock spinlock,
producing the following trace:
[821.843637] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:281
[821.843641] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 28214, name: podman
[821.843644] CPU: 3 PID: 28214 Comm: podman Tainted: G W 5.11.6 #15
[821.843646] Hardware name: Dell Inc. PowerEdge R330/084XW4, BIOS 2.11.0 12/08/2020
[821.843647] Call Trace:
[821.843650] dump_stack+0xa1/0xfb
[821.843656] ___might_sleep+0x144/0x160
[821.843659] mutex_lock+0x17/0x40
[821.843662] kernfs_remove_by_name_ns+0x1f/0x80
[821.843666] sysfs_remove_group+0x7d/0xe0
[821.843668] sysfs_remove_groups+0x28/0x40
[821.843670] kobject_del+0x2a/0x80
[821.843672] btrfs_sysfs_del_one_qgroup+0x2b/0x40 [btrfs]
[821.843685] __del_qgroup_rb+0x12/0x150 [btrfs]
[821.843696] btrfs_remove_qgroup+0x288/0x2a0 [btrfs]
[821.843707] btrfs_ioctl+0x3129/0x36a0 [btrfs]
[821.843717] ? __mod_lruvec_page_state+0x5e/0xb0
[821.843719] ? page_add_new_anon_rmap+0xbc/0x150
[821.843723] ? kfree+0x1b4/0x300
[821.843725] ? mntput_no_expire+0x55/0x330
[821.843728] __x64_sys_ioctl+0x5a/0xa0
[821.843731] do_syscall_64+0x33/0x70
[821.843733] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[821.843736] RIP: 0033:0x4cd3fb
[821.843741] RSP: 002b:000000c000906b20 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
[821.843744] RAX: ffffffffffffffda RBX: 000000c000050000 RCX: 00000000004cd3fb
[821.843745] RDX: 000000c000906b98 RSI: 000000004010942a RDI: 000000000000000f
[821.843747] RBP: 000000c000907cd0 R08: 000000c000622901 R09: 0000000000000000
[821.843748] R10: 000000c000d992c0 R11: 0000000000000206 R12: 000000000000012d
[821.843749] R13: 000000000000012c R14: 0000000000000200 R15: 0000000000000049
Fix this by removing the qgroup sysfs entry while not holding the spinlock,
since the spinlock is only meant for protection of the qgroup rbtree.
Reported-by: Stuart Shelton <srcshelton@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/7A5485BB-0628-419D-A4D3-27B1AF47E25A@gmail.com/
Fixes:
|
||
|
8d488a8c7b |
btrfs: fix subvolume/snapshot deletion not triggered on mount
During the mount procedure we are calling btrfs_orphan_cleanup() against
the root tree, which will find all orphans items in this tree. When an
orphan item corresponds to a deleted subvolume/snapshot (instead of an
inode space cache), it must not delete the orphan item, because that will
cause btrfs_find_orphan_roots() to not find the orphan item and therefore
not add the corresponding subvolume root to the list of dead roots, which
results in the subvolume's tree never being deleted by the cleanup thread.
The same applies to the remount from RO to RW path.
Fix this by making btrfs_find_orphan_roots() run before calling
btrfs_orphan_cleanup() against the root tree.
A test case for fstests will follow soon.
Reported-by: Robbie Ko <robbieko@synology.com>
Link: https://lore.kernel.org/linux-btrfs/b19f4310-35e0-606e-1eea-2dd84d28c5da@synology.com/
Fixes:
|
||
|
ebd99a6b34 |
btrfs: fix build when using M=fs/btrfs
There are people building the module with M= that's supposed to be used
for external modules. This got broken in
|
||
|
3cb894972f |
btrfs: do not initialize dev replace for bad dev root
While helping Neal fix his broken file system I added a debug patch to catch if we were calling btrfs_search_slot with a NULL root, and this stack trace popped: we tried to search with a NULL root CPU: 0 PID: 1760 Comm: mount Not tainted 5.11.0-155.nealbtrfstest.1.fc34.x86_64 #1 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/22/2020 Call Trace: dump_stack+0x6b/0x83 btrfs_search_slot.cold+0x11/0x1b ? btrfs_init_dev_replace+0x36/0x450 btrfs_init_dev_replace+0x71/0x450 open_ctree+0x1054/0x1610 btrfs_mount_root.cold+0x13/0xfa legacy_get_tree+0x27/0x40 vfs_get_tree+0x25/0xb0 vfs_kern_mount.part.0+0x71/0xb0 btrfs_mount+0x131/0x3d0 ? legacy_get_tree+0x27/0x40 ? btrfs_show_options+0x640/0x640 legacy_get_tree+0x27/0x40 vfs_get_tree+0x25/0xb0 path_mount+0x441/0xa80 __x64_sys_mount+0xf4/0x130 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f644730352e Fix this by not starting the device replace stuff if we do not have a NULL dev root. Reported-by: Neal Gompa <ngompa13@gmail.com> CC: stable@vger.kernel.org # 5.11+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
820a49dafc |
btrfs: initialize device::fs_info always
Neal reported a panic trying to use -o rescue=all BUG: kernel NULL pointer dereference, address: 0000000000000030 PGD 0 P4D 0 Oops: 0000 [#1] SMP NOPTI CPU: 0 PID: 696 Comm: mount Tainted: G W 5.12.0-rc2+ #296 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 RIP: 0010:btrfs_device_init_dev_stats+0x1d/0x200 RSP: 0018:ffffafaec1483bb8 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff9a5715bcb298 RCX: 0000000000000070 RDX: ffff9a5703248000 RSI: ffff9a57052ea150 RDI: ffff9a5715bca400 RBP: ffff9a57052ea150 R08: 0000000000000070 R09: ffff9a57052ea150 R10: 000130faf0741c10 R11: 0000000000000000 R12: ffff9a5703700000 R13: 0000000000000000 R14: ffff9a5715bcb278 R15: ffff9a57052ea150 FS: 00007f600d122c40(0000) GS:ffff9a577bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000030 CR3: 0000000112a46005 CR4: 0000000000370ef0 Call Trace: ? btrfs_init_dev_stats+0x1f/0xf0 ? kmem_cache_alloc+0xef/0x1f0 btrfs_init_dev_stats+0x5f/0xf0 open_ctree+0x10cb/0x1720 btrfs_mount_root.cold+0x12/0xea legacy_get_tree+0x27/0x40 vfs_get_tree+0x25/0xb0 vfs_kern_mount.part.0+0x71/0xb0 btrfs_mount+0x10d/0x380 legacy_get_tree+0x27/0x40 vfs_get_tree+0x25/0xb0 path_mount+0x433/0xa00 __x64_sys_mount+0xe3/0x120 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae This happens because when we call btrfs_init_dev_stats we do device->fs_info->dev_root. However device->fs_info isn't initialized because we were only calling btrfs_init_devices_late() if we properly read the device root. However we don't actually need the device root to init the devices, this function simply assigns the devices their ->fs_info pointer properly, so this needs to be done unconditionally always so that we can properly dereference device->fs_info in rescue cases. Reported-by: Neal Gompa <ngompa13@gmail.com> CC: stable@vger.kernel.org # 5.11+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
82d62d06db |
btrfs: do not initialize dev stats if we have no dev_root
Neal reported a panic trying to use -o rescue=all BUG: kernel NULL pointer dereference, address: 0000000000000030 PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 0 PID: 4095 Comm: mount Not tainted 5.11.0-0.rc7.149.fc34.x86_64 #1 RIP: 0010:btrfs_device_init_dev_stats+0x4c/0x1f0 RSP: 0018:ffffa60285fbfb68 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff88b88f806498 RCX: ffff88b82e7a2a10 RDX: ffffa60285fbfb97 RSI: ffff88b82e7a2a10 RDI: 0000000000000000 RBP: ffff88b88f806b3c R08: 0000000000000000 R09: 0000000000000000 R10: ffff88b82e7a2a10 R11: 0000000000000000 R12: ffff88b88f806a00 R13: ffff88b88f806478 R14: ffff88b88f806a00 R15: ffff88b82e7a2a10 FS: 00007f698be1ec40(0000) GS:ffff88b937e00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000030 CR3: 0000000092c9c006 CR4: 00000000003706f0 Call Trace: ? btrfs_init_dev_stats+0x1f/0xf0 btrfs_init_dev_stats+0x62/0xf0 open_ctree+0x1019/0x15ff btrfs_mount_root.cold+0x13/0xfa legacy_get_tree+0x27/0x40 vfs_get_tree+0x25/0xb0 vfs_kern_mount.part.0+0x71/0xb0 btrfs_mount+0x131/0x3d0 ? legacy_get_tree+0x27/0x40 ? btrfs_show_options+0x640/0x640 legacy_get_tree+0x27/0x40 vfs_get_tree+0x25/0xb0 path_mount+0x441/0xa80 __x64_sys_mount+0xf4/0x130 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f698c04e52e This happens because we unconditionally attempt to initialize device stats on mount, but we may not have been able to read the device root. Fix this by skipping initializing the device stats if we do not have a device root. Reported-by: Neal Gompa <ngompa13@gmail.com> CC: stable@vger.kernel.org # 5.11+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
f3da882eae |
btrfs: zoned: remove outdated WARN_ON in direct IO
In btrfs_submit_direct() there's a WAN_ON_ONCE() that will trigger if we're submitting a DIO write on a zoned filesystem but are not using REQ_OP_ZONE_APPEND to submit the IO to the block device. This is a left over from a previous version where btrfs_dio_iomap_begin() didn't use btrfs_use_zone_append() to check for sequential write only zones. It is an oversight from the development phase. In v11 (I think) I've added |
||
|
485df75554 |
btrfs: always pin deleted leaves when there are active tree mod log users
When freeing a tree block we may end up adding its extent back to the free space cache/tree, as long as there are no more references for it, it was created in the current transaction and writeback for it never happened. This is generally fine, however when we have tree mod log operations it can result in inconsistent versions of a btree after unwinding extent buffers with the recorded tree mod log operations. This is because: * We only log operations for nodes (adding and removing key/pointers), for leaves we don't do anything; * This means that we can log a MOD_LOG_KEY_REMOVE_WHILE_FREEING operation for a node that points to a leaf that was deleted; * Before we apply the logged operation to unwind a node, we can have that leaf's extent allocated again, either as a node or as a leaf, and possibly for another btree. This is possible if the leaf was created in the current transaction and writeback for it never started, in which case btrfs_free_tree_block() returns its extent back to the free space cache/tree; * Then, before applying the tree mod log operation, some task allocates the metadata extent just freed before, and uses it either as a leaf or as a node for some btree (can be the same or another one, it does not matter); * After applying the MOD_LOG_KEY_REMOVE_WHILE_FREEING operation we now get the target node with an item pointing to the metadata extent that now has content different from what it had before the leaf was deleted. It might now belong to a different btree and be a node and not a leaf anymore. As a consequence, the results of searches after the unwinding can be unpredictable and produce unexpected results. So make sure we pin extent buffers corresponding to leaves when there are tree mod log users. CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
dbcc7d57bf |
btrfs: fix race when cloning extent buffer during rewind of an old root
While resolving backreferences, as part of a logical ino ioctl call or
fiemap, we can end up hitting a BUG_ON() when replaying tree mod log
operations of a root, triggering a stack trace like the following:
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.c:1210!
invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 1 PID: 19054 Comm: crawl_335 Tainted: G W 5.11.0-2d11c0084b02-misc-next+ #89
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
RIP: 0010:__tree_mod_log_rewind+0x3b1/0x3c0
Code: 05 48 8d 74 10 (...)
RSP: 0018:ffffc90001eb70b8 EFLAGS: 00010297
RAX: 0000000000000000 RBX: ffff88812344e400 RCX: ffffffffb28933b6
RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff88812344e42c
RBP: ffffc90001eb7108 R08: 1ffff11020b60a20 R09: ffffed1020b60a20
R10: ffff888105b050f9 R11: ffffed1020b60a1f R12: 00000000000000ee
R13: ffff8880195520c0 R14: ffff8881bc958500 R15: ffff88812344e42c
FS: 00007fd1955e8700(0000) GS:ffff8881f5600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007efdb7928718 CR3: 000000010103a006 CR4: 0000000000170ee0
Call Trace:
btrfs_search_old_slot+0x265/0x10d0
? lock_acquired+0xbb/0x600
? btrfs_search_slot+0x1090/0x1090
? free_extent_buffer.part.61+0xd7/0x140
? free_extent_buffer+0x13/0x20
resolve_indirect_refs+0x3e9/0xfc0
? lock_downgrade+0x3d0/0x3d0
? __kasan_check_read+0x11/0x20
? add_prelim_ref.part.11+0x150/0x150
? lock_downgrade+0x3d0/0x3d0
? __kasan_check_read+0x11/0x20
? lock_acquired+0xbb/0x600
? __kasan_check_write+0x14/0x20
? do_raw_spin_unlock+0xa8/0x140
? rb_insert_color+0x30/0x360
? prelim_ref_insert+0x12d/0x430
find_parent_nodes+0x5c3/0x1830
? resolve_indirect_refs+0xfc0/0xfc0
? lock_release+0xc8/0x620
? fs_reclaim_acquire+0x67/0xf0
? lock_acquire+0xc7/0x510
? lock_downgrade+0x3d0/0x3d0
? lockdep_hardirqs_on_prepare+0x160/0x210
? lock_release+0xc8/0x620
? fs_reclaim_acquire+0x67/0xf0
? lock_acquire+0xc7/0x510
? poison_range+0x38/0x40
? unpoison_range+0x14/0x40
? trace_hardirqs_on+0x55/0x120
btrfs_find_all_roots_safe+0x142/0x1e0
? find_parent_nodes+0x1830/0x1830
? btrfs_inode_flags_to_xflags+0x50/0x50
iterate_extent_inodes+0x20e/0x580
? tree_backref_for_extent+0x230/0x230
? lock_downgrade+0x3d0/0x3d0
? read_extent_buffer+0xdd/0x110
? lock_downgrade+0x3d0/0x3d0
? __kasan_check_read+0x11/0x20
? lock_acquired+0xbb/0x600
? __kasan_check_write+0x14/0x20
? _raw_spin_unlock+0x22/0x30
? __kasan_check_write+0x14/0x20
iterate_inodes_from_logical+0x129/0x170
? iterate_inodes_from_logical+0x129/0x170
? btrfs_inode_flags_to_xflags+0x50/0x50
? iterate_extent_inodes+0x580/0x580
? __vmalloc_node+0x92/0xb0
? init_data_container+0x34/0xb0
? init_data_container+0x34/0xb0
? kvmalloc_node+0x60/0x80
btrfs_ioctl_logical_to_ino+0x158/0x230
btrfs_ioctl+0x205e/0x4040
? __might_sleep+0x71/0xe0
? btrfs_ioctl_get_supported_features+0x30/0x30
? getrusage+0x4b6/0x9c0
? __kasan_check_read+0x11/0x20
? lock_release+0xc8/0x620
? __might_fault+0x64/0xd0
? lock_acquire+0xc7/0x510
? lock_downgrade+0x3d0/0x3d0
? lockdep_hardirqs_on_prepare+0x210/0x210
? lockdep_hardirqs_on_prepare+0x210/0x210
? __kasan_check_read+0x11/0x20
? do_vfs_ioctl+0xfc/0x9d0
? ioctl_file_clone+0xe0/0xe0
? lock_downgrade+0x3d0/0x3d0
? lockdep_hardirqs_on_prepare+0x210/0x210
? __kasan_check_read+0x11/0x20
? lock_release+0xc8/0x620
? __task_pid_nr_ns+0xd3/0x250
? lock_acquire+0xc7/0x510
? __fget_files+0x160/0x230
? __fget_light+0xf2/0x110
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fd1976e2427
Code: 00 00 90 48 8b 05 (...)
RSP: 002b:00007fd1955e5cf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fd1955e5f40 RCX: 00007fd1976e2427
RDX: 00007fd1955e5f48 RSI: 00000000c038943b RDI: 0000000000000004
RBP: 0000000001000000 R08: 0000000000000000 R09: 00007fd1955e6120
R10: 0000557835366b00 R11: 0000000000000246 R12: 0000000000000004
R13: 00007fd1955e5f48 R14: 00007fd1955e5f40 R15: 00007fd1955e5ef8
Modules linked in:
---[ end trace ec8931a1c36e57be ]---
(gdb) l *(__tree_mod_log_rewind+0x3b1)
0xffffffff81893521 is in __tree_mod_log_rewind (fs/btrfs/ctree.c:1210).
1205 * the modification. as we're going backwards, we do the
1206 * opposite of each operation here.
1207 */
1208 switch (tm->op) {
1209 case MOD_LOG_KEY_REMOVE_WHILE_FREEING:
1210 BUG_ON(tm->slot < n);
1211 fallthrough;
1212 case MOD_LOG_KEY_REMOVE_WHILE_MOVING:
1213 case MOD_LOG_KEY_REMOVE:
1214 btrfs_set_node_key(eb, &tm->key, tm->slot);
Here's what happens to hit that BUG_ON():
1) We have one tree mod log user (through fiemap or the logical ino ioctl),
with a sequence number of 1, so we have fs_info->tree_mod_seq == 1;
2) Another task is at ctree.c:balance_level() and we have eb X currently as
the root of the tree, and we promote its single child, eb Y, as the new
root.
Then, at ctree.c:balance_level(), we call:
tree_mod_log_insert_root(eb X, eb Y, 1);
3) At tree_mod_log_insert_root() we create tree mod log elements for each
slot of eb X, of operation type MOD_LOG_KEY_REMOVE_WHILE_FREEING each
with a ->logical pointing to ebX->start. These are placed in an array
named tm_list.
Lets assume there are N elements (N pointers in eb X);
4) Then, still at tree_mod_log_insert_root(), we create a tree mod log
element of operation type MOD_LOG_ROOT_REPLACE, ->logical set to
ebY->start, ->old_root.logical set to ebX->start, ->old_root.level set
to the level of eb X and ->generation set to the generation of eb X;
5) Then tree_mod_log_insert_root() calls tree_mod_log_free_eb() with
tm_list as argument. After that, tree_mod_log_free_eb() calls
__tree_mod_log_insert() for each member of tm_list in reverse order,
from highest slot in eb X, slot N - 1, to slot 0 of eb X;
6) __tree_mod_log_insert() sets the sequence number of each given tree mod
log operation - it increments fs_info->tree_mod_seq and sets
fs_info->tree_mod_seq as the sequence number of the given tree mod log
operation.
This means that for the tm_list created at tree_mod_log_insert_root(),
the element corresponding to slot 0 of eb X has the highest sequence
number (1 + N), and the element corresponding to the last slot has the
lowest sequence number (2);
7) Then, after inserting tm_list's elements into the tree mod log rbtree,
the MOD_LOG_ROOT_REPLACE element is inserted, which gets the highest
sequence number, which is N + 2;
8) Back to ctree.c:balance_level(), we free eb X by calling
btrfs_free_tree_block() on it. Because eb X was created in the current
transaction, has no other references and writeback did not happen for
it, we add it back to the free space cache/tree;
9) Later some other task T allocates the metadata extent from eb X, since
it is marked as free space in the space cache/tree, and uses it as a
node for some other btree;
10) The tree mod log user task calls btrfs_search_old_slot(), which calls
get_old_root(), and finally that calls __tree_mod_log_oldest_root()
with time_seq == 1 and eb_root == eb Y;
11) First iteration of the while loop finds the tree mod log element with
sequence number N + 2, for the logical address of eb Y and of type
MOD_LOG_ROOT_REPLACE;
12) Because the operation type is MOD_LOG_ROOT_REPLACE, we don't break out
of the loop, and set root_logical to point to tm->old_root.logical
which corresponds to the logical address of eb X;
13) On the next iteration of the while loop, the call to
tree_mod_log_search_oldest() returns the smallest tree mod log element
for the logical address of eb X, which has a sequence number of 2, an
operation type of MOD_LOG_KEY_REMOVE_WHILE_FREEING and corresponds to
the old slot N - 1 of eb X (eb X had N items in it before being freed);
14) We then break out of the while loop and return the tree mod log operation
of type MOD_LOG_ROOT_REPLACE (eb Y), and not the one for slot N - 1 of
eb X, to get_old_root();
15) At get_old_root(), we process the MOD_LOG_ROOT_REPLACE operation
and set "logical" to the logical address of eb X, which was the old
root. We then call tree_mod_log_search() passing it the logical
address of eb X and time_seq == 1;
16) Then before calling tree_mod_log_search(), task T adds a key to eb X,
which results in adding a tree mod log operation of type
MOD_LOG_KEY_ADD to the tree mod log - this is done at
ctree.c:insert_ptr() - but after adding the tree mod log operation
and before updating the number of items in eb X from 0 to 1...
17) The task at get_old_root() calls tree_mod_log_search() and gets the
tree mod log operation of type MOD_LOG_KEY_ADD just added by task T.
Then it enters the following if branch:
if (old_root && tm && tm->op != MOD_LOG_KEY_REMOVE_WHILE_FREEING) {
(...)
} (...)
Calls read_tree_block() for eb X, which gets a reference on eb X but
does not lock it - task T has it locked.
Then it clones eb X while it has nritems set to 0 in its header, before
task T sets nritems to 1 in eb X's header. From hereupon we use the
clone of eb X which no other task has access to;
18) Then we call __tree_mod_log_rewind(), passing it the MOD_LOG_KEY_ADD
mod log operation we just got from tree_mod_log_search() in the
previous step and the cloned version of eb X;
19) At __tree_mod_log_rewind(), we set the local variable "n" to the number
of items set in eb X's clone, which is 0. Then we enter the while loop,
and in its first iteration we process the MOD_LOG_KEY_ADD operation,
which just decrements "n" from 0 to (u32)-1, since "n" is declared with
a type of u32. At the end of this iteration we call rb_next() to find the
next tree mod log operation for eb X, that gives us the mod log operation
of type MOD_LOG_KEY_REMOVE_WHILE_FREEING, for slot 0, with a sequence
number of N + 1 (steps 3 to 6);
20) Then we go back to the top of the while loop and trigger the following
BUG_ON():
(...)
switch (tm->op) {
case MOD_LOG_KEY_REMOVE_WHILE_FREEING:
BUG_ON(tm->slot < n);
fallthrough;
(...)
Because "n" has a value of (u32)-1 (4294967295) and tm->slot is 0.
Fix this by taking a read lock on the extent buffer before cloning it at
ctree.c:get_old_root(). This should be done regardless of the extent
buffer having been freed and reused, as a concurrent task might be
modifying it (while holding a write lock on it).
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Link: https://lore.kernel.org/linux-btrfs/20210227155037.GN28049@hungrycats.org/
Fixes:
|
||
|
34e49994d0 |
btrfs: fix slab cache flags for free space tree bitmap
The free space tree bitmap slab cache is created with SLAB_RED_ZONE but
that's a debugging flag and not always enabled. Also the other slabs are
created with at least SLAB_MEM_SPREAD that we want as well to average
the memory placement cost.
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Fixes:
|
||
|
60484cd9d5 |
btrfs: subpage: make readahead work properly
In readahead infrastructure, we are using a lot of hard coded PAGE_SHIFT while we're not doing anything specific to PAGE_SIZE. One of the most affected part is the radix tree operation of btrfs_fs_info::reada_tree. If using PAGE_SHIFT, subpage metadata readahead is broken and does no help reading metadata ahead. Fix the problem by using btrfs_fs_info::sectorsize_bits so that readahead could work for subpage. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
d9bb77d51e |
btrfs: subpage: fix wild pointer access during metadata read failure
[BUG] When running fstests for btrfs subpage read-write test, it has a very high chance to crash at generic/475 with the following stack: BTRFS warning (device dm-8): direct IO failed ino 510 rw 1,34817 sector 0xcdf0 len 94208 err no 10 Unable to handle kernel paging request at virtual address ffff80001157e7c0 CPU: 2 PID: 687125 Comm: kworker/u12:4 Tainted: G WC 5.12.0-rc2-custom+ #5 Hardware name: Khadas VIM3 (DT) Workqueue: btrfs-endio-meta btrfs_work_helper [btrfs] pc : queued_spin_lock_slowpath+0x1a0/0x390 lr : do_raw_spin_lock+0xc4/0x11c Call trace: queued_spin_lock_slowpath+0x1a0/0x390 _raw_spin_lock+0x68/0x84 btree_readahead_hook+0x38/0xc0 [btrfs] end_bio_extent_readpage+0x504/0x5f4 [btrfs] bio_endio+0x170/0x1a4 end_workqueue_fn+0x3c/0x60 [btrfs] btrfs_work_helper+0x1b0/0x1b4 [btrfs] process_one_work+0x22c/0x430 worker_thread+0x70/0x3a0 kthread+0x13c/0x140 ret_from_fork+0x10/0x30 Code: 910020e0 8b0200c2 f861d884 aa0203e1 (f8246827) [CAUSE] In end_bio_extent_readpage(), if we hit an error during read, we will handle the error differently for data and metadata. For data we queue a repair, while for metadata, we record the error and let the caller choose what to do. But the code is still using page->private to grab extent buffer, which no longer points to extent buffer for subpage metadata pages. Thus this wild pointer access leads to above crash. [FIX] Introduce a helper, find_extent_buffer_readpage(), to grab extent buffer. The difference against find_extent_buffer_nospinlock() is: - Also handles regular sectorsize == PAGE_SIZE case - No extent buffer refs increase/decrease As extent buffer under IO must have non-zero refs, so this is safe Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
e3d3b41576 |
btrfs: zoned: fix linked list corruption after log root tree allocation failure
When using a zoned filesystem, while syncing the log, if we fail to
allocate the root node for the log root tree, we are not removing the
log context we allocated on stack from the list of log contexts of the
log root tree. This means after the return from btrfs_sync_log() we get
a corrupted linked list.
Fix this by allocating the node before adding our stack allocated context
to the list of log contexts of the log root tree.
Fixes:
|
||
|
a3ee79bd8f |
btrfs: fix qgroup data rsv leak caused by falloc failure
[BUG]
When running fsstress with only falloc workload, and a very low qgroup
limit set, we can get qgroup data rsv leak at unmount time.
BTRFS warning (device dm-0): qgroup 0/5 has unreleased space, type 0 rsv 20480
BTRFS error (device dm-0): qgroup reserved space leaked
The minimal reproducer looks like:
#!/bin/bash
dev=/dev/test/test
mnt="/mnt/btrfs"
fsstress=~/xfstests-dev/ltp/fsstress
runtime=8
workload()
{
umount $dev &> /dev/null
umount $mnt &> /dev/null
mkfs.btrfs -f $dev > /dev/null
mount $dev $mnt
btrfs quota en $mnt
btrfs quota rescan -w $mnt
btrfs qgroup limit 16m 0/5 $mnt
$fsstress -w -z -f creat=10 -f fallocate=10 -p 2 -n 100 \
-d $mnt -v > /tmp/fsstress
umount $mnt
if dmesg | grep leak ; then
echo "!!! FAILED !!!"
exit 1
fi
}
for (( i=0; i < $runtime; i++)); do
echo "=== $i/$runtime==="
workload
done
Normally it would fail before round 4.
[CAUSE]
In function insert_prealloc_file_extent(), we first call
btrfs_qgroup_release_data() to know how many bytes are reserved for
qgroup data rsv.
Then use that @qgroup_released number to continue our work.
But after we call btrfs_qgroup_release_data(), we should either queue
@qgroup_released to delayed ref or free them manually in error path.
Unfortunately, we lack the error handling to free the released bytes,
leaking qgroup data rsv.
All the error handling function outside won't help at all, as we have
released the range, meaning in inode io tree, the EXTENT_QGROUP_RESERVED
bit is already cleared, thus all btrfs_qgroup_free_data() call won't
free any data rsv.
[FIX]
Add free_qgroup tag to manually free the released qgroup data rsv.
Reported-by: Nikolay Borisov <nborisov@suse.com>
Reported-by: David Sterba <dsterba@suse.cz>
Fixes:
|
||
|
fbf48bb0b1 |
btrfs: track qgroup released data in own variable in insert_prealloc_file_extent
There is a piece of weird code in insert_prealloc_file_extent(), which looks like: ret = btrfs_qgroup_release_data(inode, file_offset, len); if (ret < 0) return ERR_PTR(ret); if (trans) { ret = insert_reserved_file_extent(trans, inode, file_offset, &stack_fi, true, ret); ... } extent_info.is_new_extent = true; extent_info.qgroup_reserved = ret; ... Note how the variable @ret is abused here, and if anyone is adding code just after btrfs_qgroup_release_data() call, it's super easy to overwrite the @ret and cause tons of qgroup related bugs. Fix such abuse by introducing new variable @qgroup_released, so that we won't reuse the existing variable @ret. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
d2dcc8ed8e |
btrfs: fix wrong offset to zero out range beyond i_size
[BUG] The test generic/091 fails , with the following output: fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -W mapped writes DISABLED Seed set to 1 main: filesystem does not support fallocate mode FALLOC_FL_COLLAPSE_RANGE, disabling! main: filesystem does not support fallocate mode FALLOC_FL_INSERT_RANGE, disabling! skipping zero size read truncating to largest ever: 0xe400 copying to largest ever: 0x1f400 cloning to largest ever: 0x70000 cloning to largest ever: 0x77000 fallocating to largest ever: 0x7a120 Mapped Read: non-zero data past EOF (0x3a7ff) page offset 0x800 is 0xf2e1 <<< ... [CAUSE] In commit |
||
|
ce307084c9 |
block-5.12-2021-03-12-v2
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmBLzKsQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpi0ID/9djN1db0OrAjQgWdOQsKwzcPG4fmVRHJAu Zi8SPRj0ByonWGaPWjiSi297/j00dfYFFIXaB1Pfo4j0wX0IK8bJINl0G8SN6Dag WYBBrT/5rCQgD8fjQ1XhuzuqLwxwcZfYXAnCAlqABG18nPk532D4dX2CMEasl8F7 XWTTj5PqHDN4bCcriH1GEA5S+2nmoz5YXjNZEDcY3/pQMdyb8Jo9mRfZubkrnRxK c9fz2LjUz0IRaSb+9PILY5qDLOSIh+vHOIk/3BKW9DoqU/S3kTTr4twqnOclfVPH VgJM9b+sHveVCztCJ9bnNGkW7HWjUQa8gb/B40NBxKEhw7w/HCjykhhxd+QTUQTM GJVMRGYWhzuUEuU1M1hArPua0GLmPKSvC0CRgbKRmgPNjshTquZPJnBBFwv2wZKQ GkrwktdK9ihE1ya4gu20MupST3PIpT3jtc6NAizr6DCy0wJ0Z1X5KYnFdbtS79No I9qPC8lu3AcZq6NXdBfTO9ngIdiUwi9AfSYj7koS/4dmnVccVJmaj0/NNmVp2Ro3 HtaObanBnTi9v8YHl8WgX6lq5RjuQ204fXmd0No4mHFvgxsl7YaX+JBts7S3A2Nf PoQLqmulcLmzT3EVuEg279aXw2rbnyWHARbF/5/tIr4JcugtLJhwFnBA5YgFreq9 lSbqgoKSHw== =qHyO -----END PGP SIGNATURE----- Merge tag 'block-5.12-2021-03-12-v2' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "Mostly just random fixes all over the map. The only odd-one-out change is finally getting the rename of BIO_MAX_PAGES to BIO_MAX_VECS done. This should've been done with the multipage bvec change, but it's been left. Do it now to avoid hassles around changes piling up for the next merge window. Summary: - NVMe pull request: - one more quirk (Dmitry Monakhov) - fix max_zone_append_sectors initialization (Chaitanya Kulkarni) - nvme-fc reset/create race fix (James Smart) - fix status code on aborts/resets (Hannes Reinecke) - fix the CSS check for ZNS namespaces (Chaitanya Kulkarni) - fix a use after free in a debug printk in nvme-rdma (Lv Yunlong) - Follow-up NVMe error fix for NULL 'id' (Christoph) - Fixup for the bd_size_lock being IRQ safe, now that the offending driver has been dropped (Damien). - rsxx probe failure error return (Jia-Ju) - umem probe failure error return (Wei) - s390/dasd unbind fixes (Stefan) - blk-cgroup stats summing fix (Xunlei) - zone reset handling fix (Damien) - Rename BIO_MAX_PAGES to BIO_MAX_VECS (Christoph) - Suppress uevent trigger for hidden devices (Daniel) - Fix handling of discard on busy device (Jan) - Fix stale cache issue with zone reset (Shin'ichiro)" * tag 'block-5.12-2021-03-12-v2' of git://git.kernel.dk/linux-block: nvme: fix the nsid value to print in nvme_validate_or_alloc_ns block: Discard page cache of zone reset target range block: Suppress uevent for hidden device when removed block: rename BIO_MAX_PAGES to BIO_MAX_VECS nvme-pci: add the DISABLE_WRITE_ZEROES quirk for a Samsung PM1725a nvme-rdma: Fix a use after free in nvmet_rdma_write_data_done nvme-core: check ctrl css before setting up zns nvme-fc: fix racing controller reset and create association nvme-fc: return NVME_SC_HOST_ABORTED_CMD when a command has been aborted nvme-fc: set NVME_REQ_CANCELLED in nvme_fc_terminate_exchange() nvme: add NVME_REQ_CANCELLED flag in nvme_cancel_request() nvme: simplify error logic in nvme_validate_ns() nvme: set max_zone_append_sectors nvme_revalidate_zones block: rsxx: fix error return code of rsxx_pci_probe() block: Fix REQ_OP_ZONE_RESET_ALL handling umem: fix error return code in mm_pci_probe() blk-cgroup: Fix the recursive blkg rwstat s390/dasd: fix hanging IO request during DASD driver unbind s390/dasd: fix hanging DASD driver unbind block: Try to handle busy underlying device on discard |
||
|
a8affc03a9 |
block: rename BIO_MAX_PAGES to BIO_MAX_VECS
Ever since the addition of multipage bio_vecs BIO_MAX_PAGES has been horribly confusingly misnamed. Rename it to BIO_MAX_VECS to stop confusing users of the bio API. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20210311110137.1132391-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
f09b04cc64 |
for-5.12-rc1-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmBCOi4ACgkQxWXV+ddt WDtXvw//TWx3m05qHJqqG8V90uel8hB2J5vd4CA2r62Je1G8RDho57Bo7fyvL4l+ mdCPt+INajb0mpp0IoHMtyLHefojgNOsrX6FAK1/gjnLkjRLFZ3wQqkA34Ue9pNs 2u+rMY6eB105iaS3VejEmiebr++MZfjfQRV+GXU336AEeOEDZdgol8o6jMyde5TO zRH9Dni5Sy/YAGGAb0vaoG2BMyVigrqkbjkzwjYChbUj/KuyffAgQj0v8BvsC9Y6 DnPD5yrt5kSZzuqQFH7c2jxLN0cvW+tJ0znCpnwn/nmiCALbl6y2a4dmewC32TwJ II+3OPGpYudafLJEP15qafsJb7LmEfnGwUIrfEZbyb4lQG12uyYOdP3IN7+8td14 fd29GE62w5aErsmurcMFj/x43k4DIfcqC8b+Y+S27JZF1szh7ExCfoYC/6c5e5Qf j6/6RtRSVqdxImRd0QYv3mCIeSG0CH2UR/1otvC81jRTHRyB3r6TV8wPLo+5K/Rk ongKZ+BQa5RUk8skdFburhrkDDKgfBcjlexl5Gsqw+D/xTGNAcVnNQrTtW9sTSle hB3b7CunXA1eCyui2SIqN1dR8hwao4b9RzYNs3y2jWjSPZD/Bp0BdQ8oxSPvIWkX a8kauFGhKhY2Tdqau+CQ4UbbQWzEB7FulkPCOLiHDDZjyxIvAA4= =tlU3 -----END PGP SIGNATURE----- Merge tag 'for-5.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "More regression fixes and stabilization. Regressions: - zoned mode - count zone sizes in wider int types - fix space accounting for read-only block groups - subpage: fix page tail zeroing Fixes: - fix spurious warning when remounting with free space tree - fix warning when creating a directory with smack enabled - ioctl checks for qgroup inheritance when creating a snapshot - qgroup - fix missing unlock on error path in zero range - fix amount of released reservation on error - fix flushing from unsafe context with open transaction, potentially deadlocking - minor build warning fixes" * tag 'for-5.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: do not account freed region of read-only block group as zone_unusable btrfs: zoned: use sector_t for zone sectors btrfs: subpage: fix the false data csum mismatch error btrfs: fix warning when creating a directory with smack enabled btrfs: don't flush from btrfs_delayed_inode_reserve_metadata btrfs: export and rename qgroup_reserve_meta btrfs: free correct amount of space in btrfs_delayed_inode_reserve_metadata btrfs: fix spurious free_space_tree remount warning btrfs: validate qgroup inherit for SNAP_CREATE_V2 ioctl btrfs: unlock extents in btrfs_zero_range in case of quota reservation errors btrfs: ref-verify: use 'inline void' keyword ordering |
||
|
badae9c869 |
btrfs: zoned: do not account freed region of read-only block group as zone_unusable
We migrate zone unusable bytes to read-only bytes when a block group is
set to read-only, and account all the free region as bytes_readonly.
Thus, we should not increase block_group->zone_unusable when the block
group is read-only.
Fixes:
|
||
|
d734492a14 |
btrfs: zoned: use sector_t for zone sectors
We need to use sector_t for zone_sectors, or it would set the zone size
to zero when the size >= 4GB (= 2^24 sectors) by shifting the
zone_sectors value by SECTOR_SHIFT. We're assuming zones sizes up to
8GiB.
Fixes:
|
||
|
c28ea613fa |
btrfs: subpage: fix the false data csum mismatch error
[BUG] When running fstresss, we can hit strange data csum mismatch where the on-disk data is in fact correct (passes scrub). With some extra debug info added, we have the following traces: 0482us: btrfs_do_readpage: root=5 ino=284 offset=393216, submit force=0 pgoff=0 iosize=8192 0494us: btrfs_do_readpage: root=5 ino=284 offset=401408, submit force=0 pgoff=8192 iosize=4096 0498us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=393216 len=8192 0591us: btrfs_do_readpage: root=5 ino=284 offset=405504, submit force=0 pgoff=12288 iosize=36864 0594us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=401408 len=4096 0863us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=405504 len=36864 0933us: btrfs_verify_data_csum: root=5 ino=284 offset=393216 len=8192 0967us: btrfs_do_readpage: root=5 ino=284 offset=442368, skip beyond isize pgoff=49152 iosize=16384 1047us: btrfs_verify_data_csum: root=5 ino=284 offset=401408 len=4096 1163us: btrfs_verify_data_csum: root=5 ino=284 offset=405504 len=36864 1290us: check_data_csum: !!! root=5 ino=284 offset=438272 pg_off=45056 !!! 7387us: end_bio_extent_readpage: root=5 ino=284 before pending_read_bios=0 [CAUSE] Normally we expect all submitted bio reads to only touch the range we specified, and under subpage context, it means we should only touch the range specified in each bvec. But in data read path, inside end_bio_extent_readpage(), we have page zeroing which only takes regular page size into consideration. This means for subpage if we have an inode whose content looks like below: 0 16K 32K 48K 64K |///////| |///////| | |//| = data needs to be read from disk | | = hole And i_size is 64K initially. Then the following race can happen: T1 | T2 --------------------------------+-------------------------------- btrfs_do_readpage() | |- isize = 64K; | | At this time, the isize is | | 64K | | | |- submit_extent_page() | | submit previous assembled bio| | assemble bio for [0, 16K) | | | |- submit_extent_page() | submit read bio for [0, 16K) | assemble read bio for | [32K, 48K) | | | btrfs_setsize() | |- i_size_write(, 16K); | Now i_size is only 16K end_io() for [0K, 16K) | |- end_bio_extent_readpage() | |- btrfs_verify_data_csum() | | No csum error | |- i_size = 16K; | |- zero_user_segment(16K, | PAGE_SIZE); | !!! We zeroed range | !!! [32K, 48K) | | end_io for [32K, 48K) | |- end_bio_extent_readpage() | |- btrfs_verify_data_csum() | ! CSUM MISMATCH ! | ! As the range is zeroed now ! [FIX] To fix the problem, make end_bio_extent_readpage() to only zero the range of bvec. The bug only affects subpage read-write support, as for full read-only mount we can't change i_size thus won't hit the race condition. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
fd57a98d6f |
btrfs: fix warning when creating a directory with smack enabled
When we have smack enabled, during the creation of a directory smack may attempt to add a "smack transmute" xattr on the inode, which results in the following warning and trace: WARNING: CPU: 3 PID: 2548 at fs/btrfs/transaction.c:537 start_transaction+0x489/0x4f0 Modules linked in: nft_objref nf_conntrack_netbios_ns (...) CPU: 3 PID: 2548 Comm: mkdir Not tainted 5.9.0-rc2smack+ #81 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 RIP: 0010:start_transaction+0x489/0x4f0 Code: e9 be fc ff ff (...) RSP: 0018:ffffc90001887d10 EFLAGS: 00010202 RAX: ffff88816f1e0000 RBX: 0000000000000201 RCX: 0000000000000003 RDX: 0000000000000201 RSI: 0000000000000002 RDI: ffff888177849000 RBP: ffff888177849000 R08: 0000000000000001 R09: 0000000000000004 R10: ffffffff825e8f7a R11: 0000000000000003 R12: ffffffffffffffe2 R13: 0000000000000000 R14: ffff88803d884270 R15: ffff8881680d8000 FS: 00007f67317b8440(0000) GS:ffff88817bcc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f67247a22a8 CR3: 000000004bfbc002 CR4: 0000000000370ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? slab_free_freelist_hook+0xea/0x1b0 ? trace_hardirqs_on+0x1c/0xe0 btrfs_setxattr_trans+0x3c/0xf0 __vfs_setxattr+0x63/0x80 smack_d_instantiate+0x2d3/0x360 security_d_instantiate+0x29/0x40 d_instantiate_new+0x38/0x90 btrfs_mkdir+0x1cf/0x1e0 vfs_mkdir+0x14f/0x200 do_mkdirat+0x6d/0x110 do_syscall_64+0x2d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f673196ae6b Code: 8b 05 11 (...) RSP: 002b:00007ffc3c679b18 EFLAGS: 00000246 ORIG_RAX: 0000000000000053 RAX: ffffffffffffffda RBX: 00000000000001ff RCX: 00007f673196ae6b RDX: 0000000000000000 RSI: 00000000000001ff RDI: 00007ffc3c67a30d RBP: 00007ffc3c67a30d R08: 00000000000001ff R09: 0000000000000000 R10: 000055d3e39fe930 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffc3c679cd8 R14: 00007ffc3c67a30d R15: 00007ffc3c679ce0 irq event stamp: 11029 hardirqs last enabled at (11037): [<ffffffff81153fe6>] console_unlock+0x486/0x670 hardirqs last disabled at (11044): [<ffffffff81153c01>] console_unlock+0xa1/0x670 softirqs last enabled at (8864): [<ffffffff81e0102f>] asm_call_on_stack+0xf/0x20 softirqs last disabled at (8851): [<ffffffff81e0102f>] asm_call_on_stack+0xf/0x20 This happens because at btrfs_mkdir() we call d_instantiate_new() while holding a transaction handle, which results in the following call chain: btrfs_mkdir() trans = btrfs_start_transaction(root, 5); d_instantiate_new() smack_d_instantiate() __vfs_setxattr() btrfs_setxattr_trans() btrfs_start_transaction() start_transaction() WARN_ON() --> a tansaction start has TRANS_EXTWRITERS set in its type h->orig_rsv = h->block_rsv h->block_rsv = NULL btrfs_end_transaction(trans) Besides the warning triggered at start_transaction, we set the handle's block_rsv to NULL which may cause some surprises later on. So fix this by making btrfs_setxattr_trans() not start a transaction when we already have a handle on one, stored in current->journal_info, and use that handle. We are good to use the handle because at btrfs_mkdir() we did reserve space for the xattr and the inode item. Reported-by: Casey Schaufler <casey@schaufler-ca.com> CC: stable@vger.kernel.org # 5.4+ Acked-by: Casey Schaufler <casey@schaufler-ca.com> Tested-by: Casey Schaufler <casey@schaufler-ca.com> Link: https://lore.kernel.org/linux-btrfs/434d856f-bd7b-4889-a6ec-e81aaebfa735@schaufler-ca.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
4d14c5cde5 |
btrfs: don't flush from btrfs_delayed_inode_reserve_metadata
Calling btrfs_qgroup_reserve_meta_prealloc from btrfs_delayed_inode_reserve_metadata can result in flushing delalloc while holding a transaction and delayed node locks. This is deadlock prone. In the past multiple commits: * |
||
|
80e9baed72 |
btrfs: export and rename qgroup_reserve_meta
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
0f9c03d824 |
btrfs: free correct amount of space in btrfs_delayed_inode_reserve_metadata
Following commit |
||
|
c55a4319c4 |
btrfs: fix spurious free_space_tree remount warning
The intended logic of the check is to catch cases where the desired free_space_tree setting doesn't match the mounted setting, and the remount is anything but ro->rw. However, it makes the mistake of checking equality on a masked integer (btrfs_test_opt) against a boolean (btrfs_fs_compat_ro). If you run the reproducer: $ mount -o space_cache=v2 dev mnt $ mount -o remount,ro mnt you would expect no warning, because the remount is not attempting to change the free space tree setting, but we do see the warning. To fix this, add explicit bool type casts to the condition. I tested a variety of transitions: sudo mount -o space_cache=v2 /dev/vg0/lv0 mnt/lol (fst enabled) mount -o remount,ro mnt/lol (no warning, no fst change) sudo mount -o remount,rw,space_cache=v1,clear_cache (no warning, ro->rw) sudo mount -o remount,rw,space_cache=v2 mnt (warning, rw->rw with change) sudo mount -o remount,ro mnt (no warning, no fst change) sudo mount -o remount,rw,space_cache=v2 mnt (no warning, no fst change) Reported-by: Chris Murphy <lists@colorremedies.com> CC: stable@vger.kernel.org # 5.11 Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
5011c5a663 |
btrfs: validate qgroup inherit for SNAP_CREATE_V2 ioctl
The problem is we're copying "inherit" from user space but we don't
necessarily know that we're copying enough data for a 64 byte
struct. Then the next problem is that 'inherit' has a variable size
array at the end, and we have to verify that array is the size we
expected.
Fixes:
|
||
|
4f6a49de64 |
btrfs: unlock extents in btrfs_zero_range in case of quota reservation errors
If btrfs_qgroup_reserve_data returns an error (i.e quota limit reached)
the handling logic directly goes to the 'out' label without first
unlocking the extent range between lockstart, lockend. This results in
deadlocks as other processes try to lock the same extent.
Fixes:
|
||
|
aedb9d9089 |
btrfs: ref-verify: use 'inline void' keyword ordering
Fix build warnings of function signature when CONFIG_STACKTRACE is not enabled by reordering the 'inline' and 'void' keywords. ../fs/btrfs/ref-verify.c:221:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration] static void inline __save_stack_trace(struct ref_action *ra) ../fs/btrfs/ref-verify.c:225:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration] static void inline __print_stack_trace(struct btrfs_fs_info *fs_info, Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
7a7fd0de4a |
Merge branch 'kmap-conversion-for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull kmap conversion updates from David Sterba: "This contains changes regarding kmap API use and eg conversion from kmap_atomic to kmap_local_page. The API belongs to memory management but to save cross-tree dependency headaches we've agreed to take it through the btrfs tree because there are some trivial conversions possible, while the rest will need some time and getting the easy cases out of the way would be convenient. The changes can be grouped: - function exports, new helpers - new VM_BUG_ON for additional verification; it's been discussed if it should be VM_BUG_ON or BUG_ON, the former was chosen due to performance reasons - code replaced by relevant helpers" [ This is an updated version of a request that originally came in during the merge window, but I asked for some updates: https://lore.kernel.org/lkml/cover.1614090658.git.dsterba@suse.com/ which is why this got merge after the merge window closed. - Linus ] * 'kmap-conversion-for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: use copy_highpage() instead of 2 kmaps() btrfs: use memcpy_[to|from]_page() and kmap_local_page() mm/highmem: Add VM_BUG_ON() to mem*_page() calls mm/highmem: Introduce memcpy_page(), memmove_page(), and memset_page() mm/highmem: Convert memcpy_[to|from]_page() to kmap_local_page() mm/highmem: Lift memcpy_[to|from]_page to core |
||
|
c608aca57d |
for-5.12-rc1-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmA85UwACgkQxWXV+ddt WDsdeA/8DXM6pMGaLkYcvkGvR53/vWwQlKq+i+3zuc41fYFJ7k+DQ7/K5hDbEMoM E7YsksoRlNVruH/ZvSdtx1exQ/tNrTdqPuds/UR31lIvS2NX9OZZToGWoC8VmrNw eS9yAwz/7JKUBA6MlMxZFv89OJoHUX9brPSeZVA8hOo3jDr5LXVm0IBskYOBUDRx JIvt+lkJLKMXPWxwUt3hbkbFPAUQVxYYavhJhWiXT9gdxF+eRgjMI0EN43vBMN2y kZtoZGeWR64heo9ehFzYMDlAVyph/loGovQ7m6XVzkk5DQGitg0vs3iAG46WjEXt jxt0ZKmJQwJb3/zNPd8VlLMhULGc56jcq8uhaC2pXjhy18p7EAXml+fH51BExLYK 11hiWtWsrbTsZuYgr6fpqVFukkL/yyH/s7iCWT8Wn+AoPg2fUD99F5nkKT2T0Sso t7MyJVlTdq8avWbTB+8kFx8+Hy1TsRz3Ic2Zpm8+F3KeVflrb31jJIp3cxPCdfUp fWX+7VDjKVt00Ti7uP0fAaFO4hn2FjYcWzR3KOjomWox+8LVxB8PbD4H8jD7As2a 5gGGOULmkiZej7hcP6J6zvnmgZIVAGPsSGSVfZtPh4VGiycL3DozcD0x5QerLchR NZDyIBh2KGE0cRr+cjkPxDyeqfGXQ7VUjp13CBriCkER8SOmBdw= =QJEy -----END PGP SIGNATURE----- Merge tag 'for-5.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "This is the first batch of fixes that usually arrive during the merge window code freeze. Regressions and stable material. Regressions: - fix deadlock in log sync in zoned mode - fix bugs in subpage mode still wrongly assuming sectorsize == page size Fixes: - fix missing kunmap of the Q stripe in RAID6 - block group fixes: - fix race between extent freeing/allocation when using bitmaps - avoid double put of block group when emptying cluster - swapfile fixes: - fix swapfile writes vs running scrub - fix swapfile activation vs snapshot creation - fix stale data exposure after cloning a hole with NO_HOLES enabled - remove tree-checker check that does not work in case information from other leaves is necessary" * tag 'for-5.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: fix deadlock on log sync btrfs: avoid double put of block group when emptying cluster btrfs: fix stale data exposure after cloning a hole with NO_HOLES enabled btrfs: tree-checker: do not error out if extent ref hash doesn't match btrfs: fix race between swap file activation and snapshot creation btrfs: fix race between writes to swap files and scrub btrfs: avoid checking for RO block group twice during nocow writeback btrfs: fix race between extent freeing/allocation when using bitmaps btrfs: make check_compressed_csum() to be subpage compatible btrfs: make btrfs_submit_compressed_read() subpage compatible btrfs: fix raid6 qstripe kmap |
||
|
80cc838423 |
btrfs: use copy_highpage() instead of 2 kmaps()
There are many places where kmap/memove/kunmap patterns occur. This pattern exists in the core common function copy_highpage(). Use copy_highpage to avoid open coding the use of kmap and leverages the core functions use of kmap_local_page(). Development of this patch was aided by the following coccinelle script: // <smpl> // SPDX-License-Identifier: GPL-2.0-only // Find kmap/copypage/kunmap pattern and replace with copy_highpage calls // // NOTE: The expressions in the copy page version of this kmap pattern are // overly complex and so these all need individual attention. // // Confidence: Low // Copyright: (C) 2021 Intel Corporation // URL: http://coccinelle.lip6.fr/ // Comments: // Options: // // Then a copy_page where we have 2 pages involved. // @ copy_page_rule @ expression page, page2, To, From, Size; identifier ptr, ptr2; type VP, VP2; @@ /* kmap */ ( -VP ptr = kmap(page); ... -VP2 ptr2 = kmap(page2); | -VP ptr = kmap_atomic(page); ... -VP2 ptr2 = kmap_atomic(page2); | -ptr = kmap(page); ... -ptr2 = kmap(page2); | -ptr = kmap_atomic(page); ... -ptr2 = kmap_atomic(page2); ) // 1 or more copy versions of the entire page <+... ( -copy_page(To, From); +copy_highpage(To, From); | -memmove(To, From, Size); +memmoveExtra(To, From, Size); ) ...+> /* kunmap */ ( -kunmap(page2); ... -kunmap(page); | -kunmap(page); ... -kunmap(page2); | -kmap_atomic(ptr2); ... -kmap_atomic(ptr); ) // Remove any pointers left unused @ depends on copy_page_rule @ identifier copy_page_rule.ptr; identifier copy_page_rule.ptr2; type VP, VP1; type VP2, VP21; @@ -VP ptr; ... when != ptr; ? VP1 ptr; -VP2 ptr2; ... when != ptr2; ? VP21 ptr2; // </smpl> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
3590ec5899 |
btrfs: use memcpy_[to|from]_page() and kmap_local_page()
There are many places where the pattern kmap/memcpy/kunmap occurs. This pattern was lifted to the core common functions memcpy_[to|from]_page(). Use these new functions to reduce the code, eliminate direct uses of kmap, and leverage the new core functions use of kmap_local_page(). Also, there is 1 place where a kmap/memcpy is followed by an optional memset. Here we leave the kmap open coded to avoid remapping the page but use kmap_local_page() directly. Development of this patch was aided by the coccinelle script: // <smpl> // SPDX-License-Identifier: GPL-2.0-only // Find kmap/memcpy/kunmap pattern and replace with memcpy*page calls // // NOTE: Offsets and other expressions may be more complex than what the script // will automatically generate. Therefore a catchall rule is provided to find // the pattern which then must be evaluated by hand. // // Confidence: Low // Copyright: (C) 2021 Intel Corporation // URL: http://coccinelle.lip6.fr/ // Comments: // Options: // // simple memcpy version // @ memcpy_rule1 @ expression page, T, F, B, Off; identifier ptr; type VP; @@ ( -VP ptr = kmap(page); | -ptr = kmap(page); | -VP ptr = kmap_atomic(page); | -ptr = kmap_atomic(page); ) <+... ( -memcpy(ptr + Off, F, B); +memcpy_to_page(page, Off, F, B); | -memcpy(ptr, F, B); +memcpy_to_page(page, 0, F, B); | -memcpy(T, ptr + Off, B); +memcpy_from_page(T, page, Off, B); | -memcpy(T, ptr, B); +memcpy_from_page(T, page, 0, B); ) ...+> ( -kunmap(page); | -kunmap_atomic(ptr); ) // Remove any pointers left unused @ depends on memcpy_rule1 @ identifier memcpy_rule1.ptr; type VP, VP1; @@ -VP ptr; ... when != ptr; ? VP1 ptr; // // Some callers kmap without a temp pointer // @ memcpy_rule2 @ expression page, T, Off, F, B; @@ <+... ( -memcpy(kmap(page) + Off, F, B); +memcpy_to_page(page, Off, F, B); | -memcpy(kmap(page), F, B); +memcpy_to_page(page, 0, F, B); | -memcpy(T, kmap(page) + Off, B); +memcpy_from_page(T, page, Off, B); | -memcpy(T, kmap(page), B); +memcpy_from_page(T, page, 0, B); ) ...+> -kunmap(page); // No need for the ptr variable removal // // Catch all // @ memcpy_rule3 @ expression page; expression GenTo, GenFrom, GenSize; identifier ptr; type VP; @@ ( -VP ptr = kmap(page); | -ptr = kmap(page); | -VP ptr = kmap_atomic(page); | -ptr = kmap_atomic(page); ) <+... ( // // Some call sites have complex expressions within the memcpy // match a catch all to be evaluated by hand. // -memcpy(GenTo, GenFrom, GenSize); +memcpy_to_pageExtra(page, GenTo, GenFrom, GenSize); +memcpy_from_pageExtra(GenTo, page, GenFrom, GenSize); ) ...+> ( -kunmap(page); | -kunmap_atomic(ptr); ) // Remove any pointers left unused @ depends on memcpy_rule3 @ identifier memcpy_rule3.ptr; type VP, VP1; @@ -VP ptr; ... when != ptr; ? VP1 ptr; // <smpl> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
87fa0f3eb2 |
mm/filemap: rename generic_file_buffered_read to filemap_read
Rename generic_file_buffered_read to match the naming of filemap_fault, also update the written parameter to a more descriptive name and improve the kerneldoc comment. Link: https://lkml.kernel.org/r/20210122160140.223228-18-willy@infradead.org Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
7d6beb71da |
idmapped-mounts-v5.12
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYCegywAKCRCRxhvAZXjc
ouJ6AQDlf+7jCQlQdeKKoN9QDFfMzG1ooemat36EpRRTONaGuAD8D9A4sUsG4+5f
4IU5Lj9oY4DEmF8HenbWK2ZHsesL2Qg=
=yPaw
-----END PGP SIGNATURE-----
Merge tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull idmapped mounts from Christian Brauner:
"This introduces idmapped mounts which has been in the making for some
time. Simply put, different mounts can expose the same file or
directory with different ownership. This initial implementation comes
with ports for fat, ext4 and with Christoph's port for xfs with more
filesystems being actively worked on by independent people and
maintainers.
Idmapping mounts handle a wide range of long standing use-cases. Here
are just a few:
- Idmapped mounts make it possible to easily share files between
multiple users or multiple machines especially in complex
scenarios. For example, idmapped mounts will be used in the
implementation of portable home directories in
systemd-homed.service(8) where they allow users to move their home
directory to an external storage device and use it on multiple
computers where they are assigned different uids and gids. This
effectively makes it possible to assign random uids and gids at
login time.
- It is possible to share files from the host with unprivileged
containers without having to change ownership permanently through
chown(2).
- It is possible to idmap a container's rootfs and without having to
mangle every file. For example, Chromebooks use it to share the
user's Download folder with their unprivileged containers in their
Linux subsystem.
- It is possible to share files between containers with
non-overlapping idmappings.
- Filesystem that lack a proper concept of ownership such as fat can
use idmapped mounts to implement discretionary access (DAC)
permission checking.
- They allow users to efficiently changing ownership on a per-mount
basis without having to (recursively) chown(2) all files. In
contrast to chown (2) changing ownership of large sets of files is
instantenous with idmapped mounts. This is especially useful when
ownership of a whole root filesystem of a virtual machine or
container is changed. With idmapped mounts a single syscall
mount_setattr syscall will be sufficient to change the ownership of
all files.
- Idmapped mounts always take the current ownership into account as
idmappings specify what a given uid or gid is supposed to be mapped
to. This contrasts with the chown(2) syscall which cannot by itself
take the current ownership of the files it changes into account. It
simply changes the ownership to the specified uid and gid. This is
especially problematic when recursively chown(2)ing a large set of
files which is commong with the aforementioned portable home
directory and container and vm scenario.
- Idmapped mounts allow to change ownership locally, restricting it
to specific mounts, and temporarily as the ownership changes only
apply as long as the mount exists.
Several userspace projects have either already put up patches and
pull-requests for this feature or will do so should you decide to pull
this:
- systemd: In a wide variety of scenarios but especially right away
in their implementation of portable home directories.
https://systemd.io/HOME_DIRECTORY/
- container runtimes: containerd, runC, LXD:To share data between
host and unprivileged containers, unprivileged and privileged
containers, etc. The pull request for idmapped mounts support in
containerd, the default Kubernetes runtime is already up for quite
a while now: https://github.com/containerd/containerd/pull/4734
- The virtio-fs developers and several users have expressed interest
in using this feature with virtual machines once virtio-fs is
ported.
- ChromeOS: Sharing host-directories with unprivileged containers.
I've tightly synced with all those projects and all of those listed
here have also expressed their need/desire for this feature on the
mailing list. For more info on how people use this there's a bunch of
talks about this too. Here's just two recent ones:
https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf
https://fosdem.org/2021/schedule/event/containers_idmap/
This comes with an extensive xfstests suite covering both ext4 and
xfs:
https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts
It covers truncation, creation, opening, xattrs, vfscaps, setid
execution, setgid inheritance and more both with idmapped and
non-idmapped mounts. It already helped to discover an unrelated xfs
setgid inheritance bug which has since been fixed in mainline. It will
be sent for inclusion with the xfstests project should you decide to
merge this.
In order to support per-mount idmappings vfsmounts are marked with
user namespaces. The idmapping of the user namespace will be used to
map the ids of vfs objects when they are accessed through that mount.
By default all vfsmounts are marked with the initial user namespace.
The initial user namespace is used to indicate that a mount is not
idmapped. All operations behave as before and this is verified in the
testsuite.
Based on prior discussions we want to attach the whole user namespace
and not just a dedicated idmapping struct. This allows us to reuse all
the helpers that already exist for dealing with idmappings instead of
introducing a whole new range of helpers. In addition, if we decide in
the future that we are confident enough to enable unprivileged users
to setup idmapped mounts the permission checking can take into account
whether the caller is privileged in the user namespace the mount is
currently marked with.
The user namespace the mount will be marked with can be specified by
passing a file descriptor refering to the user namespace as an
argument to the new mount_setattr() syscall together with the new
MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
of extensibility.
The following conditions must be met in order to create an idmapped
mount:
- The caller must currently have the CAP_SYS_ADMIN capability in the
user namespace the underlying filesystem has been mounted in.
- The underlying filesystem must support idmapped mounts.
- The mount must not already be idmapped. This also implies that the
idmapping of a mount cannot be altered once it has been idmapped.
- The mount must be a detached/anonymous mount, i.e. it must have
been created by calling open_tree() with the OPEN_TREE_CLONE flag
and it must not already have been visible in the filesystem.
The last two points guarantee easier semantics for userspace and the
kernel and make the implementation significantly simpler.
By default vfsmounts are marked with the initial user namespace and no
behavioral or performance changes are observed.
The manpage with a detailed description can be found here:
|
||
|
6e37d24599 |
btrfs: zoned: fix deadlock on log sync
Lockdep with fstests test case btrfs/041 detected a unsafe locking
scenario when we allocate the log node on a zoned filesystem.
btrfs/041
============================================
WARNING: possible recursive locking detected
5.11.0-rc7+ #939 Not tainted
--------------------------------------------
xfs_io/698 is trying to acquire lock:
ffff88810cd673a0 (&root->log_mutex){+.+.}-{3:3}, at: btrfs_sync_log+0x3d1/0xee0 [btrfs]
but task is already holding lock:
ffff88810b0fc3a0 (&root->log_mutex){+.+.}-{3:3}, at: btrfs_sync_log+0x313/0xee0 [btrfs]
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&root->log_mutex);
lock(&root->log_mutex);
*** DEADLOCK ***
May be due to missing lock nesting notation
2 locks held by xfs_io/698:
#0: ffff88810cd66620 (sb_internal){.+.+}-{0:0}, at: btrfs_sync_file+0x2c3/0x570 [btrfs]
#1: ffff88810b0fc3a0 (&root->log_mutex){+.+.}-{3:3}, at: btrfs_sync_log+0x313/0xee0 [btrfs]
stack backtrace:
CPU: 0 PID: 698 Comm: xfs_io Not tainted 5.11.0-rc7+ #939
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
Call Trace:
dump_stack+0x77/0x97
__lock_acquire.cold+0xb9/0x32a
lock_acquire+0xb5/0x400
? btrfs_sync_log+0x3d1/0xee0 [btrfs]
__mutex_lock+0x7b/0x8d0
? btrfs_sync_log+0x3d1/0xee0 [btrfs]
? btrfs_sync_log+0x3d1/0xee0 [btrfs]
? find_first_extent_bit+0x9f/0x100 [btrfs]
? __mutex_unlock_slowpath+0x35/0x270
btrfs_sync_log+0x3d1/0xee0 [btrfs]
btrfs_sync_file+0x3a8/0x570 [btrfs]
__x64_sys_fsync+0x34/0x60
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens, because we are taking the ->log_mutex albeit it has already
been locked.
Also while at it, fix the bogus unlock of the tree_log_mutex in the error
handling.
Fixes:
|
||
|
95c85fba1f |
btrfs: avoid double put of block group when emptying cluster
It's wrong calling btrfs_put_block_group in
__btrfs_return_cluster_to_free_space if the block group passed is
different than the block group the cluster represents. As this means the
cluster doesn't have a reference to the passed block group. This results
in double put and a use-after-free bug.
Fix this by simply bailing if the block group we passed in does not
match the block group on the cluster.
Fixes:
|
||
|
3660d0bcdb |
btrfs: fix stale data exposure after cloning a hole with NO_HOLES enabled
When using the NO_HOLES feature, if we clone a file range that spans only a hole into a range that is at or beyond the current i_size of the destination file, we end up not setting the full sync runtime flag on the inode. As a result, if we then fsync the destination file and have a power failure, after log replay we can end up exposing stale data instead of having a hole for that range. The conditions for this to happen are the following: 1) We have a file with a size of, for example, 1280K; 2) There is a written (non-prealloc) extent for the file range from 1024K to 1280K with a length of 256K; 3) This particular file extent layout is durably persisted, so that the existing superblock persisted on disk points to a subvolume root where the file has that exact file extent layout and state; 4) The file is truncated to a smaller size, to an offset lower than the start offset of its last extent, for example to 800K. The truncate sets the full sync runtime flag on the inode; 6) Fsync the file to log it and clear the full sync runtime flag; 7) Clone a region that covers only a hole (implicit hole due to NO_HOLES) into the file with a destination offset that starts at or beyond the 256K file extent item we had - for example to offset 1024K; 8) Since the clone operation does not find extents in the source range, we end up in the if branch at the bottom of btrfs_clone() where we punch a hole for the file range starting at offset 1024K by calling btrfs_replace_file_extents(). There we end up not setting the full sync flag on the inode, because we don't know we are being called in a clone context (and not fallocate's punch hole operation), and neither do we create an extent map to represent a hole because the requested range is beyond eof; 9) A further fsync to the file will be a fast fsync, since the clone operation did not set the full sync flag, and therefore it relies on modified extent maps to correctly log the file layout. But since it does not find any extent map marking the range from 1024K (the previous eof) to the new eof, it does not log a file extent item for that range representing the hole; 10) After a power failure no hole for the range starting at 1024K is punched and we end up exposing stale data from the old 256K extent. Turning this into exact steps: $ mkfs.btrfs -f -O no-holes /dev/sdi $ mount /dev/sdi /mnt # Create our test file with 3 extents of 256K and a 256K hole at offset # 256K. The file has a size of 1280K. $ xfs_io -f -s \ -c "pwrite -S 0xab -b 256K 0 256K" \ -c "pwrite -S 0xcd -b 256K 512K 256K" \ -c "pwrite -S 0xef -b 256K 768K 256K" \ -c "pwrite -S 0x73 -b 256K 1024K 256K" \ /mnt/sdi/foobar # Make sure it's durably persisted. We want the last committed super # block to point to this particular file extent layout. sync # Now truncate our file to a smaller size, falling within a position of # the second extent. This sets the full sync runtime flag on the inode. # Then fsync the file to log it and clear the full sync flag from the # inode. The third extent is no longer part of the file and therefore # it is not logged. $ xfs_io -c "truncate 800K" -c "fsync" /mnt/foobar # Now do a clone operation that only clones the hole and sets back the # file size to match the size it had before the truncate operation # (1280K). $ xfs_io \ -c "reflink /mnt/foobar 256K 1024K 256K" \ -c "fsync" \ /mnt/foobar # File data before power failure: $ od -A d -t x1 /mnt/foobar 0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab * 0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd * 0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef * 0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 1310720 <power fail> # Mount the fs again to replay the log tree. $ mount /dev/sdi /mnt # File data after power failure: $ od -A d -t x1 /mnt/foobar 0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab * 0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd * 0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef * 0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 1048576 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 * 1310720 The range from 1024K to 1280K should correspond to a hole but instead it points to stale data, to the 256K extent that should not exist after the truncate operation. The issue does not exists when not using NO_HOLES, because for that case we use file extent items to represent holes, these are found and copied during the loop that iterates over extents at btrfs_clone(), and that causes btrfs_replace_file_extents() to be called with a non-NULL extent_info argument and therefore set the full sync runtime flag on the inode. So fix this by making the code that deals with a trailing hole during cloning, at btrfs_clone(), to set the full sync flag on the inode, if the range starts at or beyond the current i_size. A test case for fstests will follow soon. Backporting notes: for kernel 5.4 the change goes to ioctl.c into btrfs_clone before the last call to btrfs_punch_hole_range. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
1119a72e22 |
btrfs: tree-checker: do not error out if extent ref hash doesn't match
The tree checker checks the extent ref hash at read and write time to
make sure we do not corrupt the file system. Generally extent
references go inline, but if we have enough of them we need to make an
item, which looks like
key.objectid = <bytenr>
key.type = <BTRFS_EXTENT_DATA_REF_KEY|BTRFS_TREE_BLOCK_REF_KEY>
key.offset = hash(tree, owner, offset)
However if key.offset collide with an unrelated extent reference we'll
simply key.offset++ until we get something that doesn't collide.
Obviously this doesn't match at tree checker time, and thus we error
while writing out the transaction. This is relatively easy to
reproduce, simply do something like the following
xfs_io -f -c "pwrite 0 1M" file
offset=2
for i in {0..10000}
do
xfs_io -c "reflink file 0 ${offset}M 1M" file
offset=$(( offset + 2 ))
done
xfs_io -c "reflink file 0 17999258914816 1M" file
xfs_io -c "reflink file 0 35998517829632 1M" file
xfs_io -c "reflink file 0 53752752058368 1M" file
btrfs filesystem sync
And the sync will error out because we'll abort the transaction. The
magic values above are used because they generate hash collisions with
the first file in the main subvol.
The fix for this is to remove the hash value check from tree checker, as
we have no idea which offset ours should belong to.
Reported-by: Tuomas Lähdekorpi <tuomas.lahdekorpi@gmail.com>
Fixes:
|
||
|
dd0734f2a8 |
btrfs: fix race between swap file activation and snapshot creation
When creating a snapshot we check if the current number of swap files, in
the root, is non-zero, and if it is, we error out and warn that we can not
create the snapshot because there are active swap files.
However this is racy because when a task started activation of a swap
file, another task might have started already snapshot creation and might
have seen the counter for the number of swap files as zero. This means
that after the swap file is activated we may end up with a snapshot of the
same root successfully created, and therefore when the first write to the
swap file happens it has to fall back into COW mode, which should never
happen for active swap files.
Basically what can happen is:
1) Task A starts snapshot creation and enters ioctl.c:create_snapshot().
There it sees that root->nr_swapfiles has a value of 0 so it continues;
2) Task B enters btrfs_swap_activate(). It is not aware that another task
started snapshot creation but it did not finish yet. It increments
root->nr_swapfiles from 0 to 1;
3) Task B checks that the file meets all requirements to be an active
swap file - it has NOCOW set, there are no snapshots for the inode's
root at the moment, no file holes, no reflinked extents, etc;
4) Task B returns success and now the file is an active swap file;
5) Task A commits the transaction to create the snapshot and finishes.
The swap file's extents are now shared between the original root and
the snapshot;
6) A write into an extent of the swap file is attempted - there is a
snapshot of the file's root, so we fall back to COW mode and therefore
the physical location of the extent changes on disk.
So fix this by taking the snapshot lock during swap file activation before
locking the extent range, as that is the order in which we lock these
during buffered writes.
Fixes:
|
||
|
195a49eaf6 |
btrfs: fix race between writes to swap files and scrub
When we active a swap file, at btrfs_swap_activate(), we acquire the
exclusive operation lock to prevent the physical location of the swap
file extents to be changed by operations such as balance and device
replace/resize/remove. We also call there can_nocow_extent() which,
among other things, checks if the block group of a swap file extent is
currently RO, and if it is we can not use the extent, since a write
into it would result in COWing the extent.
However we have no protection against a scrub operation running after we
activate the swap file, which can result in the swap file extents to be
COWed while the scrub is running and operating on the respective block
group, because scrub turns a block group into RO before it processes it
and then back again to RW mode after processing it. That means an attempt
to write into a swap file extent while scrub is processing the respective
block group, will result in COWing the extent, changing its physical
location on disk.
Fix this by making sure that block groups that have extents that are used
by active swap files can not be turned into RO mode, therefore making it
not possible for a scrub to turn them into RO mode. When a scrub finds a
block group that can not be turned to RO due to the existence of extents
used by swap files, it proceeds to the next block group and logs a warning
message that mentions the block group was skipped due to active swap
files - this is the same approach we currently use for balance.
Fixes:
|
||
|
20903032cd |
btrfs: avoid checking for RO block group twice during nocow writeback
During the nocow writeback path, we currently iterate the rbtree of block groups twice: once for checking if the target block group is RO with the call to btrfs_extent_readonly()), and once again for getting a nocow reference on the block group with a call to btrfs_inc_nocow_writers(). Since btrfs_inc_nocow_writers() already returns false when the target block group is RO, remove the call to btrfs_extent_readonly(). Not only we avoid searching the blocks group rbtree twice, it also helps reduce contention on the lock that protects it (specially since it is a spin lock and not a read-write lock). That may make a noticeable difference on very large filesystems, with thousands of allocated block groups. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
3c17916510 |
btrfs: fix race between extent freeing/allocation when using bitmaps
During allocation the allocator will try to allocate an extent using cluster policy. Once the current cluster is exhausted it will remove the entry under btrfs_free_cluster::lock and subsequently acquire btrfs_free_space_ctl::tree_lock to dispose of the already-deleted entry and adjust btrfs_free_space_ctl::total_bitmap. This poses a problem because there exists a race condition between removing the entry under one lock and doing the necessary accounting holding a different lock since extent freeing only uses the 2nd lock. This can result in the following situation: T1: T2: btrfs_alloc_from_cluster insert_into_bitmap <holds tree_lock> if (entry->bytes == 0) if (block_group && !list_empty(&block_group->cluster_list)) { rb_erase(entry) spin_unlock(&cluster->lock); (total_bitmaps is still 4) spin_lock(&cluster->lock); <doesn't find entry in cluster->root> spin_lock(&ctl->tree_lock); <goes to new_bitmap label, adds <blocked since T2 holds tree_lock> <a new entry and calls add_new_bitmap> recalculate_thresholds <crashes, due to total_bitmaps becoming 5 and triggering an ASSERT> To fix this ensure that once depleted, the cluster entry is deleted when both cluster lock and tree locks are held in the allocator (T1), this ensures that even if there is a race with a concurrent insert_into_bitmap call it will correctly find the entry in the cluster and add the new space to it. CC: <stable@vger.kernel.org> # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
04d4ba4c90 |
btrfs: make check_compressed_csum() to be subpage compatible
Currently check_compressed_csum() completely relies on sectorsize == PAGE_SIZE to do checksum verification for compressed extents. To make it subpage compatible, this patch will: - Do extra calculation for the csum range Since we have multiple sectors inside a page, we need to only hash the range we want, not the full page anymore. - Do sector-by-sector hash inside the page With this patch and previous conversion on btrfs_submit_compressed_read(), now we can read subpage compressed extents properly, and do proper csum verification. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
be6a13613f |
btrfs: make btrfs_submit_compressed_read() subpage compatible
For compressed read, we always submit page read using page size. This doesn't work well with subpage, as for subpage one page can contain several sectors. Such submission will read range out of what we want, and cause problems. Thankfully to make it subpage compatible, we only need to change how the last page of the compressed extent is read. Instead of always adding a full page to the compressed read bio, if we're at the last page, calculate the size using compressed length, so that we only add part of the range into the compressed read bio. Since we are here, also change the PAGE_SIZE used in lookup_extent_mapping() to sectorsize. This modification won't cause any functional change, as lookup_extent_mapping() can handle the case where the search range is larger than found extent range. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
d70cef0d46 |
btrfs: fix raid6 qstripe kmap
When a qstripe is required an extra page is allocated and mapped. There were 3 problems: 1) There is no corresponding call of kunmap() for the qstripe page. 2) There is no reason to map the qstripe page more than once if the number of bits set in rbio->dbitmap is greater than one. 3) There is no reason to map the parity page and unmap it each time through the loop. The page memory can continue to be reused with a single mapping on each iteration by raid6_call.gen_syndrome() without remapping. So map the page for the duration of the loop. Similarly, improve the algorithm by mapping the parity page just 1 time. Fixes: |
||
|
582cd91f69 |
for-5.12/block-2021-02-17
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmAtmIwQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgplzLEAC5O+3rBM8QuiJdo39Yppmuw4hDJ6hOKynP EJQLKQQi0VfXgU+MprGvcbpFYmNbgICvUICQkEzJuk++kPCu/BJtJz0yErQeLgS+ RdXiPV6enbF7iRML5TVRTr1q/z7sJMXcIIJ8Pz/rU/JNfGYExVd0WfnEY9mp1jOt Bl9V+qyTazdP+Ma4+uEPatSayqcdi1rxB5I+7v/sLiOvKZZWkaRZjUZ/mxAjUfvK dBOOPjMygEo3tCLkIyyA6lpLvr1r+SUZhLuebRLEKa3To3TW6RtoG0qwpKmI2iKw ylLeVLB60nM9RUxjflVOfBsHxz1bDg5Ve86y5nCjQd4Jo8x1c4DnecyGE5/Tu8Rg rgbsfD6nFWzhDCvcZT0XrfQ4ZAjIL2IfT+ypQiQ6UlRd3hvIKRmzWMkjuH2svr0u ey9Kq+lYerI4cM0F3W73gzUKdIQOuCzBCYxQuSQQomscBa7FCInyU192dAI9Aj6l Yd06mgKu6qCx6zLv6JfpBqaBHZMwyGE4dmZgPQFuuwO+b4N+Ck3Jm5fzEzw/xIxQ wdo/DlsAl60BXentB6FByGBJaCjVdSymRqN/xNCAbFKCjmr6TLBuXPfg1gYYO7xC VOcVjWe8iN3wWHZab3t2mxMKH9B9B/KKzIhu6TNHSmgtQ5paZPRCBx995pDyRw26 WC22RGC2MA== =os1E -----END PGP SIGNATURE----- Merge tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block Pull core block updates from Jens Axboe: "Another nice round of removing more code than what is added, mostly due to Christoph's relentless pursuit of tech debt removal/cleanups. This pull request contains: - Two series of BFQ improvements (Paolo, Jan, Jia) - Block iov_iter improvements (Pavel) - bsg error path fix (Pan) - blk-mq scheduler improvements (Jan) - -EBUSY discard fix (Jan) - bvec allocation improvements (Ming, Christoph) - bio allocation and init improvements (Christoph) - Store bdev pointer in bio instead of gendisk + partno (Christoph) - Block trace point cleanups (Christoph) - hard read-only vs read-only split (Christoph) - Block based swap cleanups (Christoph) - Zoned write granularity support (Damien) - Various fixes/tweaks (Chunguang, Guoqing, Lei, Lukas, Huhai)" * tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block: (104 commits) mm: simplify swapdev_block sd_zbc: clear zone resources for non-zoned case block: introduce blk_queue_clear_zone_settings() zonefs: use zone write granularity as block size block: introduce zone_write_granularity limit block: use blk_queue_set_zoned in add_partition() nullb: use blk_queue_set_zoned() to setup zoned devices nvme: cleanup zone information initialization block: document zone_append_max_bytes attribute block: use bi_max_vecs to find the bvec pool md/raid10: remove dead code in reshape_request block: mark the bio as cloned in bio_iov_bvec_set block: set BIO_NO_PAGE_REF in bio_iov_bvec_set block: remove a layer of indentation in bio_iov_iter_get_pages block: turn the nr_iovecs argument to bio_alloc* into an unsigned short block: remove the 1 and 4 vec bvec_slabs entries block: streamline bvec_alloc block: factor out a bvec_alloc_gfp helper block: move struct biovec_slab to bio.c block: reuse BIO_INLINE_VECS for integrity bvecs ... |
||
|
4f016a316f |
New code for 5.12:
- Adjust the final parameter of iomap_dio_rw. - Add a new flag to request that iomap directio writes return EAGAIN if the write is not a pure overwrite within EOF; this will be used to reduce lock contention with unaligned direct writes on XFS. - Amend XFS' directio code to eliminate exclusive locking for unaligned direct writes if the circumstances permit -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmAZgQAACgkQ+H93GTRK tOtNqw/+KPff1NjQVK2k361R0+LjlEHfe2nxh7+kS10IiR5nbBz4Fu+GwEosZKq+ H9ficBbZ0wIveV+5CEt2xZLEJFC4LZUpNPVVrUf8XPLKiVexP/U3wtKzmv9Z7D5J 5walMWQycVeR+ycomynV36giqekvARL7KCQG5By2ITfSNxfnb/wvKhn1d61ZDOF6 f4xzq7F6+cEOrSZt2LcFzGSfsTl6oakYMAomPU57sqGmw7MHRqoPTErbdh2HnVJy yQ47eiZgSKWKA+Qm+VvHHePYCYnu0nvA2rbNerjTN70hnO8rK9S0Vle6Sp5CUqAX sXOy8zxOLYKqyM4S/QkIN2TGIyWg+CHiakVLZGF3Q4AUDDYfpD0cHvAe9N3v9euL qt8ypT8dz2C3qiTg5E31xy033wlAP0wg3FZiLAqEjL5o3fzD+qbplTiSmYbMV2Fb xuu7a2T6u1MHaIn1IhaL0cB49Fzn+5EMyp6BlAucAOakyuqJCyJiXokdk0Looy5e jUshvcwWcmHMpI/YYYY6t56KV6tl2exGq5sySY5U6dr8/r5lwc0SI+TrYFG0jTR8 59DGd5CkKgdBFcuys+eaZDXgr7A4ymkVE+pE0QNDz9UwNP20tLb3dQNlhgxchUgu NgPaFgQkoNM3HmQNyU2wX/t1aFlC/doqSkb/96UWQSxq6IrajMU= =AR07 -----END PGP SIGNATURE----- Merge tag 'iomap-5.12-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull iomap updates from Darrick Wong: "The big change in this cycle is some new code to make it possible for XFS to try unaligned directio overwrites without taking locks. If the block is fully written and within EOF (i.e. doesn't require any further fs intervention) then we can let the unlocked write proceed. If not, we fall back to synchronizing direct writes. Summary: - Adjust the final parameter of iomap_dio_rw. - Add a new flag to request that iomap directio writes return EAGAIN if the write is not a pure overwrite within EOF; this will be used to reduce lock contention with unaligned direct writes on XFS. - Amend XFS' directio code to eliminate exclusive locking for unaligned direct writes if the circumstances permit" * tag 'iomap-5.12-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: reduce exclusive locking on unaligned dio xfs: split the unaligned DIO write code out xfs: improve the reflink_bounce_dio_write tracepoint xfs: simplify the read/write tracepoints xfs: remove the buffered I/O fallback assert xfs: cleanup the read/write helper naming xfs: make xfs_file_aio_write_checks IOCB_NOWAIT-aware xfs: factor out a xfs_ilock_iocb helper iomap: add a IOMAP_DIO_OVERWRITE_ONLY flag iomap: pass a flags argument to iomap_dio_rw iomap: rename the flags variable in __iomap_dio_rw |
||
|
6f3952cbe0 |
for-5.12-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmAqyGEACgkQxWXV+ddt WDuU6BAAhfI5BndMm6a1LooMsBHTR7Mh/aFXZEKX7vCDRnrkr+WiihDFhXu4tH3y arRsdwMnJCnta2/JMI5xCZZRg9Bsb/Sa0qWoR9sDBVoGRMnE1DS5YHQyv0bfJYk0 qYOW/jorBV1n/hL19+WbDFajwajP86uGtlDKV7cJ/C3lIogQma7zQ7ygwxbDcZqm ZQVHg7ooM4P1t7EV0eDlatxn0Sm8KFkxXD7dbu37qDLWr3Aw8N4IwT7I9h4b+/tg hL4dqMPxX6AyRiI0VBsqKnmcRWtT9cN7yw0+J+/JK5KuaFFx3qyZZ+EQu1jAGZDt 2m432YKya8LQfyBuSe8uoCIcczhGoD0EPIhspecDMfWTvxdo+AeTJZzZzj3u1y+v 3pih+gBN1sa8vRVSX08mIBF/k0pPfxRu7gIjvl4wl18bm3Khq5VJ93ImP7DNroNg bKiUG35K+kvXGBNaLY71zZfO6aLMddK73aDudSbYOS8XcbKhor1G8j5o5/EkcVQA wio4Gw5BmfVeRuXOl2h1aEXThk+469s0DR7MiMiAA6917cUjQiFUgFOaogR0XY3S 8ffX+S50AFW834J0eIGHPLmzi70WwSSXCS2q+zl87PPRK5+jCp9ZzWGi9MGG1qdh fp7XVMkzHVSKGK5GXB+ICUfzkShxfTCh+EbxcXIulONxsEdADsc= =0O6r -----END PGP SIGNATURE----- Merge tag 'for-5.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "This brings updates of space handling, performance improvements or bug fixes. The subpage block size and zoned mode features have reached state where they're usable but with limitations. Performance or related: - do not block on deleted block group mutex in the cleaner, avoids some long stalls - improved flushing: make it work better with ticket space reservations and avoid excessive transaction commits in some scenarios, slightly improves throughput for random write load - preemptive background flushing: separate the logic from ticket reservations, improve the accounting and decisions when to flush in low space conditions - less lock contention related to running delayed refs, let just one thread do the flushing when there are many inside transaction commit - dbench workload improvements: avoid unnecessary work when logging inodes, fewer fallbacks to transaction commit and thus less waiting for it (+7% throughput, -20% latency) Core: - subpage block size - currently read-only support - refactor and generalize code where sectorsize is assumed to be page size, add the subpage handling everywhere - the read-write support is on the way, page sizes are still limited to 4K or 64K - zoned mode, first working version but with limitations - SMR/ZBC/ZNS friendly allocation mode, utilizing the "no fixed location for structures" and chunked allocation - superblock as the only fixed data structure needs special handling, uses 2 consecutive zones as a ring buffer - tree-log support with a dedicated block group to avoid unordered writes - emulated zones on non-zoned devices - not yet working - all non-single block group profiles, requires more zone write pointer synchronization between the multiple block groups - fitrim due to dependency on space cache, can be implemented Fixes: - ref-verify: proper tree owner and node level tracking - fix pinned byte accounting, causing some early ENOSPC now more likely due to other changes in delayed refs Other: - error handling fixes and improvements - more error injection points - more function documentation - more and updated tracepoints - subset of W=1 checked by default - update comments to allow more automatic kdoc parameter checks" * tag 'for-5.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (144 commits) btrfs: zoned: enable to mount ZONED incompat flag btrfs: zoned: deal with holes writing out tree-log pages btrfs: zoned: reorder log node allocation on zoned filesystem btrfs: zoned: serialize log transaction on zoned filesystems btrfs: zoned: extend zoned allocator to use dedicated tree-log block group btrfs: split alloc_log_tree() btrfs: zoned: relocate block group to repair IO failure in zoned filesystems btrfs: zoned: enable relocation on a zoned filesystem btrfs: zoned: support dev-replace in zoned filesystems btrfs: zoned: implement copying for zoned device-replace btrfs: zoned: implement cloning for zoned device-replace btrfs: zoned: mark block groups to copy for device-replace btrfs: zoned: do not use async metadata checksum on zoned filesystems btrfs: zoned: wait for existing extents before truncating btrfs: zoned: serialize metadata IO btrfs: zoned: introduce dedicated data write path for zoned filesystems btrfs: zoned: enable zone append writing for direct IO btrfs: zoned: use ZONE_APPEND write for zoned mode btrfs: save irq flags when looking up an ordered extent btrfs: zoned: cache if block group is on a sequential zone ... |
||
|
e42ee56fe5 |
for-5.11-rc7-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmAmlkAACgkQxWXV+ddt WDuwNxAAiBAhEwPllzyU86p4RMMip5pa24zu11HkTya65yGk6EFuj4zTlx/L5Fn6 JOjxwlPqaTItER1PYJ5HRdIy1Y2E4eWEiDLolvmvDCPZrfKRKhBU1MZbgXwDbp+Z pwaJGIm5ZaXDGyuFge3bKA48BERfqxRBO3qIOZ0tzgsUFLlZ2d9EdDc99093/J6k QzIijXQjFnvnB2MNawN1b/KQ63xqXLo2hemKcKIFCxJHm9eaet/qwGHl5iuR5ScY bOGCWvLSkCXceartDur3msOZXur09YLyfeYmE9dj1FN3aNu97sW8VivWRrs3aglK if51iYrrjKSnDr4SOK28S5UYdgeStb/qWWtosdcMsQVBo0t7iCnGT2psGaQCkdfG FChqbs2uXlbJrojlelV6xbaU3S2D2MtSz5mF+I2G5MpQbj1jkhYE9ZTUQeibcd7o l+edn/VJvVK4X0NAX8pIWJ4nFY1HqUTyfn28IQ7ymBhyyUloIoazvSkBuSWy6iy0 9aPpohOKjCw8Y3MbgcIfIEJhdK+aIKF8ZPh52+zcXQzf1OtSryVarLHsNXWm9vJ8 tHsRHCzrbLFdAXZccT6YlerzPs4+PVf44UknDbFCg7sLcG04NIGGrMXOtTHwgEZL BEywTjAMlMDjrEXouxYAPNPnEg/NlvQGZYRvBnxrtZE4G2fxJ7o= =7w6G -----END PGP SIGNATURE----- Merge tag 'for-5.11-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "A regression fix caused by a refactoring in 5.11. A corrupted superblock wouldn't be detected by checksum verification due to wrongly placed initialization of the checksum length, thus making memcmp always work" * tag 'for-5.11-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: initialize fs_info::csum_size earlier in open_ctree |
||
|
83c68bbcb6 |
btrfs: initialize fs_info::csum_size earlier in open_ctree
User reported that btrfs-progs misc-tests/028-superblock-recover fails: [TEST/misc] 028-superblock-recover unexpected success: mounted fs with corrupted superblock test failed for case 028-superblock-recover The test case expects that a broken image with bad superblock will be rejected to be mounted. However, the test image just passed csum check of superblock and was successfully mounted. Commit |
||
|
9d294a685f |
btrfs: zoned: enable to mount ZONED incompat flag
This final patch adds the ZONED incompat flag to the supported flags and enables to mount ZONED flagged file system. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
b528f46713 |
btrfs: zoned: deal with holes writing out tree-log pages
Since the zoned filesystem requires sequential write out of metadata, we cannot proceed with a hole in tree-log pages. When such a hole exists, btree_write_cache_pages() will return -EAGAIN. This happens when someone, e.g., a concurrent transaction commit, writes a dirty extent in this tree-log commit. If we are not going to wait for the extents, we can hope the concurrent writing fills the hole for us. So, we can ignore the error in this case and hope the next write will succeed. If we want to wait for them and got the error, we cannot wait for them because it will cause a deadlock. So, let's bail out to a full commit in this case. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
3ddebf27fc |
btrfs: zoned: reorder log node allocation on zoned filesystem
This is the 3/3 patch to enable tree-log on zoned filesystems. The allocation order of nodes of "fs_info->log_root_tree" and nodes of "root->log_root" is not the same as the writing order of them. So, the writing causes unaligned write errors. Reorder the allocation of them by delaying allocation of the root node of "fs_info->log_root_tree," so that the node buffers can go out sequentially to devices. Cc: Filipe Manana <fdmanana@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
fa1a0f42a0 |
btrfs: zoned: serialize log transaction on zoned filesystems
This is the 2/3 patch to enable tree-log on zoned filesystems. Since we can start more than one log transactions per subvolume simultaneously, nodes from multiple transactions can be allocated interleaved. Such mixed allocation results in non-sequential writes at the time of a log transaction commit. The nodes of the global log root tree (fs_info->log_root_tree), also have the same problem with mixed allocation. Serializes log transactions by waiting for a committing transaction when someone tries to start a new transaction, to avoid the mixed allocation problem. We must also wait for running log transactions from another subvolume, but there is no easy way to detect which subvolume root is running a log transaction. So, this patch forbids starting a new log transaction when other subvolumes already allocated the global log root tree. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
40ab3be102 |
btrfs: zoned: extend zoned allocator to use dedicated tree-log block group
This is the 1/3 patch to enable tree log on zoned filesystems. The tree-log feature does not work on a zoned filesystem as is. Blocks for a tree-log tree are allocated mixed with other metadata blocks and btrfs writes and syncs the tree-log blocks to devices at the time of fsync(), which has a different timing than a global transaction commit. As a result, both writing tree-log blocks and writing other metadata blocks become non-sequential writes that zoned filesystems must avoid. Introduce a dedicated block group for tree-log blocks, so that tree-log blocks and other metadata blocks can be separate write streams. As a result, each write stream can now be written to devices separately. "fs_info->treelog_bg" tracks the dedicated block group and assigns "treelog_bg" on-demand on tree-log block allocation time. This commit extends the zoned block allocator to use the block group. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
6ab6ebb760 |
btrfs: split alloc_log_tree()
This is a preparation patch for the next patch. Split alloc_log_tree() into two parts. The first one allocating the tree structure, remains in alloc_log_tree() and the second part allocating the tree node, which is moved into btrfs_alloc_log_tree_node(). Also export the latter part is to be used in the next patch. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
f7ef5287a6 |
btrfs: zoned: relocate block group to repair IO failure in zoned filesystems
When a bad checksum is found and if the filesystem has a mirror of the damaged data, we read the correct data from the mirror and writes it to damaged blocks. This however, violates the sequential write constraints of a zoned block device. We can consider three methods to repair an IO failure in zoned filesystems: (1) Reset and rewrite the damaged zone (2) Allocate new device extent and replace the damaged device extent to the new extent (3) Relocate the corresponding block group Method (1) is most similar to a behavior done with regular devices. However, it also wipes non-damaged data in the same device extent, and so it unnecessary degrades non-damaged data. Method (2) is much like device replacing but done in the same device. It is safe because it keeps the device extent until the replacing finish. However, extending device replacing is non-trivial. It assumes "src_dev->physical == dst_dev->physical". Also, the extent mapping replacing function should be extended to support replacing device extent position in one device. Method (3) invokes relocation of the damaged block group and is straightforward to implement. It relocates all the mirrored device extents, so it potentially is a more costly operation than method (1) or (2). But it relocates only used extents which reduce the total IO size. Let's apply method (3) for now. In the future, we can extend device-replace and apply method (2). For protecting a block group gets relocated multiple time with multiple IO errors, this commit introduces "relocating_repair" bit to show it's now relocating to repair IO failures. Also it uses a new kthread "btrfs-relocating-repair", not to block IO path with relocating process. This commit also supports repairing in the scrub process. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
32430c6148 |
btrfs: zoned: enable relocation on a zoned filesystem
Currently fallocate() is disabled on a zoned filesystem. Since current relocation process relies on preallocation to move file data extents, it must be handled differently. On a zoned filesystem, we just truncate the inode to the size that we wanted to pre-allocate. Then, we flush dirty pages on the file before finishing the relocation process. run_delalloc_zoned() will handle all the allocations and submit IOs to the underlying layers. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
7db1c5d14d |
btrfs: zoned: support dev-replace in zoned filesystems
This is 4/4 patch to implement device-replace on zoned filesystems. Even after the copying is done, the write pointers of the source device and the destination device may not be synchronized. For example, when the last allocated extent is freed before device-replace process, the extent is not copied, leaving a hole there. Synchronize the write pointers by writing zeroes to the destination device. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
de17addce7 |
btrfs: zoned: implement copying for zoned device-replace
This is 3/4 patch to implement device-replace on zoned filesystems.
This commit implements copying. To do this, it tracks the write pointer
during the device replace process. As device-replace's copy process is
smart enough to only copy used extents on the source device, we have to
fill the gap to honor the sequential write requirement in the target
device.
The device-replace process on zoned filesystems must copy or clone all
the extents in the source device exactly once. So, we need to ensure
allocations started just before the dev-replace process to have their
corresponding extent information in the B-trees.
finish_extent_writes_for_zoned() implements that functionality, which
basically is the removed code in the commit
|
||
|
6143c23ccc |
btrfs: zoned: implement cloning for zoned device-replace
This is 2/4 patch to implement device replace for zoned filesystems. In zoned mode, a block group must be either copied (from the source device to the target device) or cloned (to both devices). Implement the cloning part. If a block group targeted by an IO is marked to copy, we should not clone the IO to the destination device, because the block group is eventually copied by the replace process. This commit also handles cloning of device reset. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
78ce9fc269 |
btrfs: zoned: mark block groups to copy for device-replace
This is the 1/4 patch to support device-replace on zoned filesystems. We have two types of IOs during the device replace process. One is an IO to "copy" (by the scrub functions) all the device extents from the source device to the destination device. The other one is an IO to "clone" (by handle_ops_on_dev_replace()) new incoming write IOs from users to the source device into the target device. Cloning incoming IOs can break the sequential write rule in on target device. When a write is mapped in the middle of a block group, the IO is directed to the middle of a target device zone, which breaks the sequential write requirement. However, the cloning function cannot be disabled since incoming IOs targeting already copied device extents must be cloned so that the IO is executed on the target device. We cannot use dev_replace->cursor_{left,right} to determine whether a bio is going to a not yet copied region. Since we have a time gap between finishing btrfs_scrub_dev() and rewriting the mapping tree in btrfs_dev_replace_finishing(), we can have a newly allocated device extent which is never cloned nor copied. So the point is to copy only already existing device extents. This patch introduces mark_block_group_to_copy() to mark existing block groups as a target of copying. Then, handle_ops_on_dev_replace() and dev-replace can check the flag to do their job. Also, btrfs_finish_block_group_to_copy() will check if the copied stripe is the last stripe in the block group. With the last stripe copied, the to_copy flag is finally disabled. Afterwards we can safely clone incoming IOs on this block group. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
4eef29ef63 |
btrfs: zoned: do not use async metadata checksum on zoned filesystems
On zoned filesystems, btrfs uses per-fs zoned_meta_io_lock to serialize the metadata write IOs. Even with this serialization, write bios sent from btree_write_cache_pages can be reordered by async checksum workers as these workers are per CPU and not per zone. To preserve write bio ordering, we disable async metadata checksum on a zoned filesystem. This does not result in lower performance with HDDs as a single CPU core is fast enough to do checksum for a single zone write stream with the maximum possible bandwidth of the device. If multiple zones are being written simultaneously, HDD seek overhead lowers the achievable maximum bandwidth, resulting again in a per zone checksum serialization not affecting the performance. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
24c0a7227f |
btrfs: zoned: wait for existing extents before truncating
When truncating a file, file buffers which have already been allocated but not yet written may be truncated. Truncating these buffers could cause breakage of a sequential write pattern in a block group if the truncated blocks are for example followed by blocks allocated to another file. To avoid this problem, always wait for write out of all unwritten buffers before proceeding with the truncate execution. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
0bc09ca129 |
btrfs: zoned: serialize metadata IO
We cannot use zone append for writing metadata, because the B-tree nodes have references to each other using logical address. Without knowing the address in advance, we cannot construct the tree in the first place. So we need to serialize write IOs for metadata. We cannot add a mutex around allocation and submission because metadata blocks are allocated in an earlier stage to build up B-trees. Add a zoned_meta_io_lock and hold it during metadata IO submission in btree_write_cache_pages() to serialize IOs. Furthermore, this adds a per-block group metadata IO submission pointer "meta_write_pointer" to ensure sequential writing, which can break when attempting to write back blocks in an unfinished transaction. If the writing out failed because of a hole and the write out is for data integrity (WB_SYNC_ALL), it returns EAGAIN. A caller like fsync() code should handle this properly e.g. by falling back to a full transaction commit. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
42c0110009 |
btrfs: zoned: introduce dedicated data write path for zoned filesystems
If more than one IO is issued for one file extent, these IO can be written to separate regions on a device. Since we cannot map one file extent to such a separate area on a zoned filesystem, we need to follow the "one IO == one ordered extent" rule. The normal buffered, uncompressed and not pre-allocated write path (used by cow_file_range()) sometimes does not follow this rule. It can write a part of an ordered extent when specified a region to write e.g., when its called from fdatasync(). Introduce a dedicated (uncompressed buffered) data write path for zoned filesystems, that will COW the region and write it at once. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
544d24f9de |
btrfs: zoned: enable zone append writing for direct IO
Likewise to buffered IO, enable zone append writing for direct IO when its used on a zoned block device. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
d8e3fb106f |
btrfs: zoned: use ZONE_APPEND write for zoned mode
Enable zone append writing for zoned mode. When using zone append, a bio is issued to the start of a target zone and the device decides to place it inside the zone. Upon completion the device reports the actual written position back to the host. Three parts are necessary to enable zone append mode. First, modify the bio to use REQ_OP_ZONE_APPEND in btrfs_submit_bio_hook() and adjust the bi_sector to point the beginning of the zone. Second, record the returned physical address (and disk/partno) to the ordered extent in end_bio_extent_writepage() after the bio has been completed. We cannot resolve the physical address to the logical address because we can neither take locks nor allocate a buffer in this end_bio context. So, we need to record the physical address to resolve it later in btrfs_finish_ordered_io(). And finally, rewrite the logical addresses of the extent mapping and checksum data according to the physical address using btrfs_rmap_block. If the returned address matches the originally allocated address, we can skip this rewriting process. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
24533f6a9a |
btrfs: save irq flags when looking up an ordered extent
A following patch will add another caller of btrfs_lookup_ordered_extent(), but from a bio's endio context. btrfs_lookup_ordered_extent() uses spin_lock_irq() which unconditionally disables interrupts. Change this to spin_lock_irqsave() so interrupts aren't disabled and re-enabled unconditionally. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
08f455593f |
btrfs: zoned: cache if block group is on a sequential zone
On a zoned filesystem, cache if a block group is on a sequential write only zone. On sequential write only zones, we can use REQ_OP_ZONE_APPEND for writing data, therefore provide btrfs_use_zone_append() to figure out if IO is targeting a sequential write only zone and we can use REQ_OP_ZONE_APPEND for data writing. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
138082f366 |
btrfs: extend btrfs_rmap_block for specifying a device
btrfs_rmap_block currently reverse-maps the physical addresses on all devices to the corresponding logical addresses. Extend the function to match to a specified device. The old functionality of querying all devices is left intact by specifying NULL as target device. A block_device instead of a btrfs_device is passed into btrfs_rmap_block, as this function is intended to reverse-map the result of a bio, which only has a block_device. Also export the function for later use. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
cacb2cea46 |
btrfs: zoned: check if bio spans across an ordered extent
To ensure that an ordered extent maps to a contiguous region on disk, we need to maintain a "one bio == one ordered extent" rule. Ensure that constructing bio does not span more than an ordered extent. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
d22002fd37 |
btrfs: zoned: split ordered extent when bio is sent
For a zone append write, the device decides the location the data is being written to. Therefore we cannot ensure that two bios are written consecutively on the device. In order to ensure that an ordered extent maps to a contiguous region on disk, we need to maintain a "one bio == one ordered extent" rule. Implement splitting of an ordered extent and extent map on bio submission to adhere to the rule. extract_ordered_extent() hooks into btrfs_submit_data_bio() and splits the corresponding ordered extent so that the ordered extent's region fits into one bio and the corresponding device limits. Several sanity checks need to be done in extract_ordered_extent() e.g. - We cannot split once end_bio'd ordered extent because we cannot divide ordered->bytes_left for the split ones - We do not expect a compressed ordered extent - We should not have checksum list because we omit the list splitting. Since the function is called before btrfs_wq_submit_bio() or btrfs_csum_one_bio(), this should be always ensured. We also need to split an extent map by creating a new one. If not, unpin_extent_cache() complains about the difference between the start of the extent map and the file's logical offset. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
cfe94440d1 |
btrfs: zoned: handle REQ_OP_ZONE_APPEND as writing
Zoned filesystems use REQ_OP_ZONE_APPEND bios for writing to actual devices. Let btrfs_end_bio() and btrfs_op be aware of it, by mapping REQ_OP_ZONE_APPEND to BTRFS_MAP_WRITE and using btrfs_op() instead of bio_op(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
e1326f0339 |
btrfs: zoned: use bio_add_zone_append_page
A zoned device has its own hardware restrictions e.g. max_zone_append_size when using REQ_OP_ZONE_APPEND. To follow these restrictions, use bio_add_zone_append_page() instead of bio_add_page(). We need target device to use bio_add_zone_append_page(), so this commit reads the chunk information to cache the target device to btrfs_io_bio(bio)->device. Caching only the target device is sufficient here as zoned filesystems only supports the single profile at the moment. Once more profiles will be supported btrfs_io_bio can hold an extent_map to be able to check for the restrictions of all devices the btrfs_bio will be mapped to. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
953651eb30 |
btrfs: factor out helper adding a page to bio
Factor out adding a page to a bio from submit_extent_page(). The page is added only when bio_flags are the same, contiguous and the added page fits in the same stripe as pages in the bio. Condition checks are reordered to allow early return to avoid possibly heavy btrfs_bio_fits_in_stripe() calling. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
dcba6e48b5 |
btrfs: zoned: reset zones of unused block groups
We must reset the zones of a deleted unused block group to rewind the zones' write pointers to the zones' start. To do this, we can use the DISCARD_SYNC code to do the reset when the filesystem is running on zoned devices. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
011b41bffa |
btrfs: zoned: advance allocation pointer after tree log node
Since the allocation info of a tree log node is not recorded in the extent tree, calculate_alloc_pointer() cannot detect this node, so the pointer can be over a tree node. Replaying the log calls btrfs_remove_free_space() for each node in the log tree. So, advance the pointer after the node to not allocate over it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
d3575156f6 |
btrfs: zoned: redirty released extent buffers
Tree manipulating operations like merging nodes often release once-allocated tree nodes. Such nodes are cleaned so that pages in the node are not uselessly written out. On zoned volumes, however, such optimization blocks the following IOs as the cancellation of the write out of the freed blocks breaks the sequential write sequence expected by the device. Introduce a list of clean and unwritten extent buffers that have been released in a transaction. Redirty the buffers so that btree_write_cache_pages() can send proper bios to the devices. Besides it clears the entire content of the extent buffer not to confuse raw block scanners e.g. 'btrfs check'. By clearing the content, csum_dirty_buffer() complains about bytenr mismatch, so avoid the checking and checksum using newly introduced buffer flag EXTENT_BUFFER_NO_CHECK. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
2eda57089e |
btrfs: zoned: implement sequential extent allocation
Implement a sequential extent allocator for zoned filesystems. This allocator only needs to check if there is enough space in the block group after the allocation pointer to satisfy the extent allocation request. Therefore the allocator never manages bitmaps or clusters. Also, add assertions to the corresponding functions. As zone append writing is used, it would be unnecessary to track the allocation offset, as the allocator only needs to check available space. But by tracking and returning the offset as an allocated region, we can skip modification of ordered extents and checksum information when there is no IO reordering. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
169e0da91a |
btrfs: zoned: track unusable bytes for zones
In a zoned filesystem a once written then freed region is not usable until the underlying zone has been reset. So we need to distinguish such unusable space from usable free space. Therefore we need to introduce the "zone_unusable" field to the block group structure, and "bytes_zone_unusable" to the space_info structure to track the unusable space. Pinned bytes are always reclaimed to the unusable space. But, when an allocated region is returned before using e.g., the block group becomes read-only between allocation time and reservation time, we can safely return the region to the block group. For the situation, this commit introduces "btrfs_add_free_space_unused". This behaves the same as btrfs_add_free_space() on regular filesystem. On zoned filesystems, it rewinds the allocation offset. Because the read-only bytes tracks free but unusable bytes when the block group is read-only, we need to migrate the zone_unusable bytes to read-only bytes when a block group is marked read-only. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
a94794d50d |
btrfs: zoned: calculate allocation offset for conventional zones
Conventional zones do not have a write pointer, so we cannot use it to determine the allocation offset for sequential allocation if a block group contains a conventional zone. But instead, we can consider the end of the highest addressed extent in the block group for the allocation offset. For new block group, we cannot calculate the allocation offset by consulting the extent tree, because it can cause deadlock by taking extent buffer lock after chunk mutex, which is already taken in btrfs_make_block_group(). Since it is a new block group anyways, we can simply set the allocation offset to 0. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
08e11a3db0 |
btrfs: zoned: load zone's allocation offset
A zoned filesystem must allocate blocks at the zones' write pointer. The device's write pointer position can be mapped to a logical address within a block group. To facilitate this, add an "alloc_offset" to the block-group to track the logical addresses of the write pointer. This logical address is populated in btrfs_load_block_group_zone_info() from the write pointers of corresponding zones. For now, zoned filesystems the single profile. Supporting non-single profile with zone append writing is not trivial. For example, in the DUP profile, we send a zone append writing IO to two zones on a device. The device reply with written LBAs for the IOs. If the offsets of the returned addresses from the beginning of the zone are different, then it results in different logical addresses. We need fine-grained logical to physical mapping to support such separated physical address issue. Since it should require additional metadata type, disable non-single profiles for now. This commit supports the case all the zones in a block group are sequential. The next patch will handle the case having a conventional zone. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
381a696eb5 |
btrfs: zoned: verify device extent is aligned to zone
Add a check in verify_one_dev_extent() to ensure that a device extent on a zoned block device is aligned to the respective zone boundary. If it isn't, mark the filesystem as unclean. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
1cd6121f2a |
btrfs: zoned: implement zoned chunk allocator
Implement a zoned chunk and device extent allocator. One device zone becomes a device extent so that a zone reset affects only this device extent and does not change the state of blocks in the neighbor device extents. To implement the allocator, we need to extend the following functions for a zoned filesystem. - init_alloc_chunk_ctl - dev_extent_search_start - dev_extent_hole_check - decide_stripe_size init_alloc_chunk_ctl_zoned() is mostly the same as regular one. It always set the stripe_size to the zone size and aligns the parameters to the zone size. dev_extent_search_start() only aligns the start offset to zone boundaries. We don't care about the first 1MB like in regular filesystem because we anyway reserve the first two zones for superblock logging. dev_extent_hole_check_zoned() checks if zones in given hole are either conventional or empty sequential zones. Also, it skips zones reserved for superblock logging. With the change to the hole, the new hole may now contain pending extents. So, in this case, loop again to check that. Finally, decide_stripe_size_zoned() should shrink the number of devices instead of stripe size because we need to honor stripe_size == zone_size. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
3c9daa09cc |
btrfs: zoned: allow zoned filesystems on non-zoned block devices
Run a zoned filesystem on non-zoned devices. This is done by "slicing up" the block device into static sized chunks and fake a conventional zone on each of them. The emulated zone size is determined from the size of device extent. This is mainly aimed at testing of zoned filesystems, i.e. the zoned chunk allocator, on regular block devices. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
1cb3dc3f79 |
btrfs: zoned: disallow fitrim on zoned filesystems
The implementation of fitrim depends on space cache, which is not used and disabled for zoned extent allocator. So the current code does not work with zoned filesystem. In the future, we can implement fitrim for zoned filesystems by enabling space cache (but, only for fitrim) or scanning the extent tree at fitrim time. For now, disallow fitrim on zoned filesystems. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
b53429bad3 |
btrfs: zoned: do not load fs_info::zoned from incompat flag
Don't set the zoned flag in fs_info as soon as we're encountering the incompat filesystem flag for a zoned filesystem on mount. The zoned flag in fs_info is in a union together with the zone_size, so setting it too early will result in setting an incorrect zone_size as well. Once the correct zone_size is read from the device, we can rely on the zoned flag in fs_info as well to determine if the filesystem is zoned. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
4afd2fe835 |
btrfs: release path before calling to btrfs_load_block_group_zone_info
Since we have no write pointer in conventional zones, we cannot determine the allocation offset from it. Instead, we set the allocation offset after the highest addressed extent. This is done by reading the extent tree in btrfs_load_block_group_zone_info(). However, this function is called from btrfs_read_block_groups(), so the read lock for the tree node could be recursively taken. To avoid this unsafe locking scenario, release the path before reading the extent tree to get the allocation offset. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
d6639b35da |
btrfs: zoned: use regular super block location on zone emulation
A zoned filesystem currently has a superblock at the beginning of the superblock logging zones if the zones are conventional. This difference in superblock position causes a chicken-and-egg problem for filesystems with emulated zones. Since the device is a regular (non-zoned) device, we cannot know if the filesystem is regular or zoned while reading the superblock. But, to load the superblock, we need to see if it is emulated zoned or not. Place the superblocks at the same location as they are on regular filesystem on regular devices to solve the problem. It is possible because it's ensured that all the superblock locations are at an (emulated) conventional zone on regular devices. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
7365104236 |
btrfs: zoned: defer loading zone info after opening trees
This is a preparation patch to implement zone emulation on a regular device. To emulate a zoned filesystem on a regular (non-zoned) device, we need to decide an emulated zone size. Instead of making it a compile-time static value, we'll make it configurable at mkfs time. Since we have one zone == one device extent restriction, we can determine the emulated zone size from the size of a device extent. We can extend btrfs_get_dev_zone_info() to show a regular device filled with conventional zones once the zone size is decided. The current call site of btrfs_get_dev_zone_info() during the mount process is earlier than loading the file system trees so that we don't know the size of a device extent at this point. Thus we can't slice a regular device to conventional zones. This patch introduces btrfs_get_dev_zone_info_all_devices to load the zone info for all the devices. And, it places this function in open_ctree() after loading the trees. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
72c9925f87 |
btrfs: fix extent buffer leak on failure to copy root
At btrfs_copy_root(), if the call to btrfs_inc_ref() fails we end up
returning without unlocking and releasing our reference on the extent
buffer named "cow" we previously allocated with btrfs_alloc_tree_block().
So fix that by unlocking the extent buffer and dropping our reference on
it before returning.
Fixes:
|
||
|
2c4d8cb737 |
btrfs: explain page locking and readahead in read_extent_buffer_pages()
In read_extent_buffer_pages(), if we failed to lock the page atomically, we just exit with return value 0. This is counter-intuitive, as normally if we can't lock what we need, we would return something like EAGAIN. But that return hides under (wait == WAIT_NONE) branch, which only gets triggered for readahead. And for readahead, if we failed to lock the page, it means the extent buffer is either being read by other thread, or has been read and is under modification. Either way the eb will or has been cached, thus readahead has no need to wait for it. Add comment on this counter-intuitive behavior. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
0bb3eb3ee8 |
btrfs: allow read-only mount of 4K sector size fs on 64K page system
This adds the basic RO mount ability for 4K sector size on 64K page system. Currently we only plan to support 4K and 64K page system. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
92082d4097 |
btrfs: integrate page status update for data read path into begin/end_page_read
In btrfs data page read path, the page status update are handled in two different locations: btrfs_do_read_page() { while (cur <= end) { /* No need to read from disk */ if (HOLE/PREALLOC/INLINE){ memset(); set_extent_uptodate(); continue; } /* Read from disk */ ret = submit_extent_page(end_bio_extent_readpage); } end_bio_extent_readpage() { endio_readpage_uptodate_page_status(); } This is fine for sectorsize == PAGE_SIZE case, as for above loop we should only hit one branch and then exit. But for subpage, there is more work to be done in page status update: - Page Unlock condition Unlike regular page size == sectorsize case, we can no longer just unlock a page. Only the last reader of the page can unlock the page. This means, we can unlock the page either in the while() loop, or in the endio function. - Page uptodate condition Since we have multiple sectors to read for a page, we can only mark the full page uptodate if all sectors are uptodate. To handle both subpage and regular cases, introduce a pair of functions to help handling page status update: - begin_page_read() For regular case, it does nothing. For subpage case, it updates the reader counters so that later end_page_read() can know who is the last one to unlock the page. - end_page_read() This is just endio_readpage_uptodate_page_status() renamed. The original name is a little too long and too specific for endio. The new thing added is the condition for page unlock. Now for subpage data, we unlock the page if we're the last reader. This does not only provide the basis for subpage data read, but also hide the special handling of page read from the main read loop. Also, since we're changing how the page lock is handled, there are two existing error paths where we need to manually unlock the page before calling begin_page_read(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
32443de338 |
btrfs: introduce btrfs_subpage for data inodes
To support subpage sector size, data also need extra info to make sure which sectors in a page are uptodate/dirty/... This patch will make pages for data inodes get btrfs_subpage structure attached, and detached when the page is freed. This patch also slightly changes the timing when set_page_extent_mapped() is called to make sure: - We have page->mapping set page->mapping->host is used to grab btrfs_fs_info, thus we can only call this function after page is mapped to an inode. One call site attaches pages to inode manually, thus we have to modify the timing of set_page_extent_mapped() a bit. - As soon as possible, before other operations Since memory allocation can fail, we have to do extra error handling. Calling set_page_extent_mapped() as soon as possible can simply the error handling for several call sites. The idea is pretty much the same as iomap_page, but with more bitmaps for btrfs specific cases. Currently the plan is to switch iomap if iomap can provide sector aligned write back (only write back dirty sectors, but not the full page, data balance require this feature). So we will stick to btrfs specific bitmap for now. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
371cdc0700 |
btrfs: introduce subpage metadata validation check
For subpage metadata validation check, there are some differences: - Read must finish in one bvec Since we're just reading one subpage range in one page, it should never be split into two bios nor two bvecs. - How to grab the existing eb Instead of grabbing eb using page->private, we have to go search radix tree as we don't have any direct pointer at hand. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
4325cb2293 |
btrfs: support subpage in endio_readpage_update_page_status()
To handle subpage status update, add the following: - Use btrfs_page_*() subpage-aware helpers to update page status Now we can handle both cases well. - No page unlock for subpage metadata Since subpage metadata doesn't utilize page locking at all, skip it. For subpage data locking, it's handled in later commits. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
4012daf769 |
btrfs: introduce read_extent_buffer_subpage()
Introduce a helper, read_extent_buffer_subpage(), to do the subpage extent buffer read. The difference between regular and subpage routines are: - No page locking Here we completely rely on extent locking. Page locking can reduce the concurrency greatly, as if we lock one page to read one extent buffer, all the other extent buffers in the same page will have to wait. - Extent uptodate condition Despite the existing PageUptodate() and EXTENT_BUFFER_UPTODATE check, We also need to check btrfs_subpage::uptodate_bitmap. - No page iteration Just one page, no need to loop, this greatly simplified the subpage routine. This patch only implements the bio submit part, no endio support yet. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
d1e86e3fc3 |
btrfs: support subpage in try_release_extent_buffer()
Unlike the original try_release_extent_buffer(), try_release_subpage_extent_buffer() will iterate through all the ebs in the page, and try to release each. We can release the full page only after there's no private attached, which means all ebs of that page have been released as well. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
92d83e9436 |
btrfs: support subpage in btrfs_clone_extent_buffer
For btrfs_clone_extent_buffer(), it's mostly the same code of __alloc_dummy_extent_buffer(), except it has extra page copy. So to make it subpage compatible, we only need to: - Call set_extent_buffer_uptodate() instead of SetPageUptodate() This will set correct uptodate bit for subpage and regular sector size cases. Since we're calling set_extent_buffer_uptodate() which will also set EXTENT_BUFFER_UPTODATE bit, we don't need to manually set that bit either. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
251f2acc71 |
btrfs: support subpage in set/clear_extent_buffer_uptodate()
To support subpage in set_extent_buffer_uptodate and clear_extent_buffer_uptodate we only need to use the subpage-aware helpers to update the page bits. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
03a816b32b |
btrfs: introduce helpers for subpage error status
Introduce the following functions to handle subpage error status: - btrfs_subpage_set_error() - btrfs_subpage_clear_error() - btrfs_subpage_test_error() These helpers can only be called when the page has subpage attached and the range is ensured to be inside the page. - btrfs_page_set_error() - btrfs_page_clear_error() - btrfs_page_test_error() These helpers can handle both regular sector size and subpage without problem. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
a1d767c11c |
btrfs: introduce helpers for subpage uptodate status
Introduce the following functions to handle subpage uptodate status: - btrfs_subpage_set_uptodate() - btrfs_subpage_clear_uptodate() - btrfs_subpage_test_uptodate() These helpers can only be called when the page has subpage attached and the range is ensured to be inside the page. - btrfs_page_set_uptodate() - btrfs_page_clear_uptodate() - btrfs_page_test_uptodate() These helpers can handle both regular sector size and subpage. Although caller should still ensure that the range is inside the page. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
09bc1f0fb8 |
btrfs: attach private to dummy extent buffer pages
There are locations where we allocate dummy extent buffers for temporary usage, like in tree_mod_log_rewind() or get_old_root(). These dummy extent buffers will be handled by the same eb accessors, and if they don't have page::private subpage eb accessors could fail. To address such problems, make __alloc_dummy_extent_buffer() attach page private for dummy extent buffers too. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
8ff8466d29 |
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling for subpage. Introduce a helper, detach_extent_buffer_page(), to do different handling for regular and subpage cases. For subpage case, handle detaching page private. For unmapped (dummy or cloned) ebs, we can detach the page private immediately as the page can only be attached to one unmapped eb. For mapped ebs, we have to ensure there are no eb in the page range before we delete it, as page->private is shared between all ebs in the same page. But there is a subpage specific race, where we can race with extent buffer allocation, and clear the page private while new eb is still being utilized, like this: Extent buffer A is the new extent buffer which will be allocated, while extent buffer B is the last existing extent buffer of the page. T1 (eb A) | T2 (eb B) -------------------------------+------------------------------ alloc_extent_buffer() | btrfs_release_extent_buffer_pages() |- p = find_or_create_page() | | |- attach_extent_buffer_page() | | | | |- detach_extent_buffer_page() | | |- if (!page_range_has_eb()) | | | No new eb in the page range yet | | | As new eb A hasn't yet been | | | inserted into radix tree. | | |- btrfs_detach_subpage() | | |- detach_page_private(); |- radix_tree_insert() | Then we have a metadata eb whose page has no private bit. To avoid such race, we introduce a subpage metadata-specific member, btrfs_subpage::eb_refs. In alloc_extent_buffer() we increase eb_refs in the critical section of private_lock. Then page_range_has_eb() will return true for detach_extent_buffer_page(), and will not detach page private. The section is marked by: - btrfs_page_inc_eb_refs() - btrfs_page_dec_eb_refs() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
819822107d |
btrfs: make grab_extent_buffer_from_page() handle subpage case
For subpage case, grab_extent_buffer() can't really get an extent buffer just from btrfs_subpage. We have radix tree lock protecting us from inserting the same eb into the tree. Thus we don't really need to do the extra hassle, just let alloc_extent_buffer() handle the existing eb in radix tree. Now if two ebs are being allocated as the same time, one will fail with -EEIXST when inserting into the radix tree. So for grab_extent_buffer(), just always return NULL for subpage case. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
760f991f14 |
btrfs: make attach_extent_buffer_page() handle subpage case
For subpage case, we need to allocate additional memory for each metadata page. So we need to: - Allow attach_extent_buffer_page() to return int to indicate allocation failure - Allow manually pre-allocate subpage memory for alloc_extent_buffer() As we don't want to use GFP_ATOMIC under spinlock, we introduce btrfs_alloc_subpage() and btrfs_free_subpage() functions for this purpose. (The simple wrap for btrfs_free_subpage() is for later convert to kmem_cache. Already internally tested without problem) - Preallocate btrfs_subpage structure for alloc_extent_buffer() We don't want to call memory allocation with spinlock held, so do preallocation before we acquire mapping->private_lock. - Handle subpage and regular case differently in attach_extent_buffer_page() For regular case, no change, just do the usual thing. For subpage case, allocate new memory or use the preallocated memory. For future subpage metadata, we will make use of radix tree to grab extent buffer. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
cac06d843f |
btrfs: introduce the skeleton of btrfs_subpage structure
For sectorsize < page size support, we need a structure to record extra status info for each sector of a page. Introduce the skeleton structure, all subpage related code would go to subpage.[ch]. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
62c053fbb2 |
btrfs: set UNMAPPED bit early in btrfs_clone_extent_buffer() for subpage support
For the incoming subpage support, UNMAPPED extent buffer will have different behavior in btrfs_release_extent_buffer(). This means we need to set UNMAPPED bit early before calling btrfs_release_extent_buffer(). Currently there is only one caller which relies on btrfs_release_extent_buffer() in its error path while set UNMAPPED bit late: - btrfs_clone_extent_buffer() Make it subpage compatible by setting the UNMAPPED bit early, since we're here, also move the UPTODATE bit early. There is another caller, __alloc_dummy_extent_buffer(), setting UNMAPPED bit late, but that function clean up the allocated page manually, thus no need for any modification. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
6869b0a8be |
btrfs: merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to PAGE_START_WRITEBACK
PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK are two defines used in __process_pages_contig(), to let the function know to clear page dirty bit and then set page writeback. However page writeback and dirty bits are conflicting (at least for sector size == PAGE_SIZE case), this means these two have to be always updated together. This means we can merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to PAGE_START_WRITEBACK. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
d0c2f4fa55 |
btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
Often an fsync needs to fallback to a transaction commit for several reasons (to ensure consistency after a power failure, a new block group was allocated or a temporary error such as ENOMEM or ENOSPC happened). In that case the log is marked as needing a full commit and any concurrent tasks attempting to log inodes or commit the log will also fallback to the transaction commit. When this happens they all wait for the task that first started the transaction commit to finish the transaction commit - however they wait until the full transaction commit happens, which is not needed, as they only need to wait for the superblocks to be persisted and not for unpinning all the extents pinned during the transaction's lifetime, which even for short lived transactions can be a few thousand and take some significant amount of time to complete - for dbench workloads I have observed up to 4~5 milliseconds of time spent unpinning extents in the worst cases, and the number of pinned extents was between 2 to 3 thousand. So allow fsync tasks to skip waiting for the unpinning of extents when they call btrfs_commit_transaction() and they were not the task that started the transaction commit (that one has to do it, the alternative would be to offload the transaction commit to another task so that it could avoid waiting for the extent unpinning or offload the extent unpinning to another task). This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit After applying the entire patchset, dbench shows improvements in respect to throughput and latency. The script used to measure it is the following: $ cat dbench-test.sh #!/bin/bash DEV=/dev/sdk MNT=/mnt/sdk MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-m single -d single" echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor umount $DEV &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT dbench -D $MNT -t 300 64 umount $MNT The test was run on a physical machine with 12 cores (Intel corei7), 64G of ram, using a NVMe device and a non-debug kernel configuration (Debian's default configuration). Before applying patchset, 32 clients: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX |
||
|
64d6b281ba |
btrfs: remove unnecessary check_parent_dirs_for_sync()
Whenever we fsync an inode, if it is a directory, a regular file that was created in the current transaction or has last_unlink_trans set to the generation of the current transaction, we check if any of its ancestor inodes (and the inode itself if it is a directory) can not be logged and need a fallback to a full transaction commit - if so, we return with a value of 1 in order to fallback to a transaction commit. However we often do not need to fallback to a transaction commit because: 1) The ancestor inode is not an immediate parent, and therefore there is not an explicit request to log it and it is not needed neither to guarantee the consistency of the inode originally asked to be logged (fsynced) nor its immediate parent; 2) The ancestor inode was already logged before, in which case any link, unlink or rename operation updates the log as needed. So for these two cases we can avoid an unnecessary transaction commit. Therefore remove check_parent_dirs_for_sync() and add a check at the top of btrfs_log_inode() to make us fallback immediately to a transaction commit when we are logging a directory inode that can not be logged and needs a full transaction commit. All we need to protect is the case where after renaming a file someone fsyncs only the old directory, which would result is losing the renamed file after a log replay. This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Performance results, after applying all patches, are mentioned in the change log of the last patch. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
0e44cb3f94 |
btrfs: skip logging inodes already logged when logging new entries
When logging new directory entries of a directory, we log the inodes of new dentries and the inodes of dentries pointing to directories that may have been created in past transactions. For the case of directories we log in full mode, which can be particularly expensive for large directories. We do use btrfs_inode_in_log() to skip already logged inodes, however for that helper to return true, it requires that the log transaction used to log the inode to be already committed. This means that when we have more than one task using the same log transaction we can end up logging an inode multiple times, which is a waste of time and not necessary since the log will be committed by one of the tasks and the others will wait for the log transaction to be committed before returning to user space. So simply replace the use of btrfs_inode_in_log() with the new helper function need_log_inode(), introduced in a previous commit. This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Performance results, after applying all patches, are mentioned in the change log of the last patch. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
3e6a86a193 |
btrfs: skip logging directories already logged when logging all parents
Some times when we fsync an inode we need to do a full log of all its ancestors (due to unlink, link or rename operations), which can be an expensive operation, specially if the directories are large. However if we find an ancestor directory inode that is already logged in the current transaction, and has no inserted/updated/deleted xattrs since it was last logged, we can skip logging the directory again. We are safe to skip that since we know that for logged directories, any link, unlink or rename operations that implicate the directory will update the log as necessary. So use the helper need_log_dir(), introduced in a previous commit, to detect already logged directories that can be skipped. This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Performance results, after applying all patches, are mentioned in the change log of the last patch. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
ab12313a9f |
btrfs: avoid logging new ancestor inodes when logging new inode
When we fsync a new file, created in the current transaction, we check all its ancestor inodes and always log them if they were created in the current transaction - even if we have already logged them before, which is a waste of time. So avoid logging new ancestor inodes if they were already logged before and have no xattrs added/updated/removed since they were last logged. This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Performance results, after applying all patches, are mentioned in the change log of the last patch. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
e593e54ed1 |
btrfs: stop setting nbytes when filling inode item for logging
When we fill an inode item for logging we are setting its nbytes field with the value returned by inode_get_bytes() (a VFS API), however we do not need it because it is not used during log replay. In fact, for fast fsyncs, when we call inode_get_bytes() we may even get an outdated value for nbytes because the nbytes field of the inode is only updated when ordered extents complete, and a fast fsync only waits for writeback to complete, it does not wait for ordered extent completion. So just remove the setup of nbytes and add an explicit comment mentioning why we do not set it. This also avoids adding contention on the inode's i_lock (VFS) with concurrent stat() calls, since that spinlock is used by inode_get_bytes() which is also called by our stat callback (btrfs_getattr()). This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Performance results, after applying all patches, are mentioned in the change log of the last patch. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
ddffcf6fb5 |
btrfs: remove unnecessary directory inode item update when deleting dir entry
When we remove a directory entry, as part of an unlink operation, if the
directory was logged before we must remove the directory index items from
the log. We are also updating the inode item of the directory to update
its i_size, but that is not necessary because during log replay we do not
need it and we correctly adjust the i_size in the inode item of the
subvolume as we process directory index items and replay deletes.
This is not needed since commit
|
||
|
4203431319 |
btrfs: let callers of btrfs_get_io_geometry pass the em
Before this change, the btrfs_get_io_geometry() function was calling btrfs_get_chunk_map() to get the extent mapping, necessary for calculating the I/O geometry. It was using that extent mapping only internally and freeing the pointer after its execution. That resulted in calling btrfs_get_chunk_map() de facto twice by the __btrfs_map_block() function. It was calling btrfs_get_io_geometry() first and then calling btrfs_get_chunk_map() directly to get the extent mapping, used by the rest of the function. Change that to passing the extent mapping to the btrfs_get_io_geometry() function as an argument. This could improve performance in some cases. For very large filesystems, i.e. several thousands of allocated chunks, not only this avoids searching two times the rbtree, saving time, it may also help reducing contention on the lock that protects the tree - thinking of writeback starting for multiple inodes, other tasks allocating or removing chunks, and anything else that requires access to the rbtree. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Michal Rostecki <mrostecki@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add Filipe's analysis ] Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
951c80f83d |
btrfs: fix double accounting of ordered extent for subpage case in btrfs_invalidapge
Commit |
||
|
a4559e6f6f |
btrfs: simplify condition in __btrfs_run_delayed_items
Fix the following coccicheck warnings: ./fs/btrfs/delayed-inode.c:1157:39-41: WARNING !A || A && B is equivalent to !A || B. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Suggested-by: Jiapeng Zhong <oswb@linux.alibaba.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Abaci Team <abaci-bugfix@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
2965194b77 |
btrfs: remove wrong comment for can_nocow_extent()
The comment for can_nocow_extent() says that the function will flush
ordered extents, however that never happens and was never true before the
comment was added in commit
|
||
|
e5ad49e215 |
btrfs: add a trace class for dumping the current ENOSPC state
Often when I'm debugging ENOSPC related issues I have to resort to printing the entire ENOSPC state with trace_printk() in different spots. This gets pretty annoying, so add a trace state that does this for us. Then add a trace point at the end of preemptive flushing so you can see the state of the space_info when we decide to exit preemptive flushing. This helped me figure out we weren't kicking in the preemptive flushing soon enough. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
4b02b00fe5 |
btrfs: adjust the flush trace point to include the source
Since we have normal ticketed flushing and preemptive flushing, adjust the tracepoint so that we know the source of the flushing action to make it easier to debug problems. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
88a777a6e5 |
btrfs: implement space clamping for preemptive flushing
Starting preemptive flushing at 50% of available free space is a good start, but some workloads are particularly abusive and can quickly overwhelm the preemptive flushing code and drive us into using tickets. Handle this by clamping down on our threshold for starting and continuing to run preemptive flushing. This is particularly important for our overcommit case, as we can really drive the file system into overages and then it's more difficult to pull it back as we start to actually fill up the file system. The clamping is essentially 2^CLAMP, but we start at 1 so whatever we calculate for overcommit is the baseline. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
2e294c6049 |
btrfs: simplify the logic in need_preemptive_flushing
A lot of this was added all in one go with no explanation, and is a bit unwieldy and confusing. Simplify the logic to start preemptive flushing if we've reserved more than half of our available free space. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
9f42d37748 |
btrfs: rework btrfs_calc_reclaim_metadata_size
Currently btrfs_calc_reclaim_metadata_size does two things, it returns the space currently required for flushing by the tickets, and if there are no tickets it calculates a value for the preemptive flushing. However for the normal ticketed flushing we really only care about the space required for tickets. We will accidentally come in and flush one time, but as soon as we see there are no tickets we bail out of our flushing. Fix this by making btrfs_calc_reclaim_metadata_size really only tell us what is required for flushing if we have people waiting on space. Then move the preemptive flushing logic into need_preemptive_reclaim(). We ignore btrfs_calc_reclaim_metadata_size() in need_preemptive_reclaim() because if we are in this path then we made our reservation and there are not pending tickets currently, so we do not need to check it, simply do the fuzzy logic to check if we're getting low on space. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
f205edf773 |
btrfs: check reclaim_size in need_preemptive_reclaim
If we're flushing space for tickets then we have space_info->reclaim_size set and we do not need to do background reclaim. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
ae7913ba52 |
btrfs: rename need_do_async_reclaim
All of our normal flushing is asynchronous reclaim, so this helper is poorly named. This is more checking if we need to preemptively flush space, so rename it to need_preemptive_reclaim. Also switch it to bool and make it plain static as followup patches will move more code here. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
576fa34830 |
btrfs: improve preemptive background space flushing
Currently if we ever have to flush space because we do not have enough we allocate a ticket and attach it to the space_info, and then systematically flush things in the filesystem that hold space reservations until our space is reclaimed. However this has a latency cost, we must go to sleep and wait for the flushing to make progress before we are woken up and allowed to continue doing our work. In order to address that we used to kick off the async worker to flush space preemptively, so that we could be reclaiming space hopefully before any tasks needed to stop and wait for space to reclaim. When I introduced the ticketed ENOSPC stuff this broke slightly in the fact that we were using tickets to indicate if we were done flushing. No tickets, no more flushing. However this meant that we essentially never preemptively flushed. This caused a write performance regression that Nikolay noticed in an unrelated patch that removed the committing of the transaction during btrfs_end_transaction. The behavior that happened pre that patch was btrfs_end_transaction() would see that we were low on space, and it would commit the transaction. This was bad because in this particular case you could end up with thousands and thousands of transactions being committed during the 5 minute reproducer. With the patch to remove this behavior we got much more sane transaction commits, but we ended up slower because we would write for a while, flush, write for a while, flush again. To address this we need to reinstate a preemptive flushing mechanism. However it is distinctly different from our ticketing flushing in that it doesn't have tickets to base it's decisions on. Instead of bolting this logic into our existing flushing work, add another worker to handle this preemptive flushing. Here we will attempt to be slightly intelligent about the things that we flushing, attempting to balance between whichever pool is taking up the most space. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
f00c42dd4c |
btrfs: introduce a FORCE_COMMIT_TRANS flush operation
Solely for preemptive flushing, we want to be able to force the transaction commit without any of the ambiguity of may_commit_transaction(). This is because may_commit_transaction() checks tickets and such, and in preemptive flushing we already know it'll be helpful, so use this to keep the code nice and clean and straightforward. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
5deb17e18e |
btrfs: track ordered bytes instead of just dio ordered bytes
We track dio_bytes because the shrink delalloc code needs to know if we have more DIO in flight than we have normal buffered IO. The reason for this is because we can't "flush" DIO, we have to just wait on the ordered extents to finish. However this is true of all ordered extents. If we have more ordered space outstanding than dirty pages we should be waiting on ordered extents. We already are ok on this front technically, because we always do a FLUSH_DELALLOC_WAIT loop, but I want to use the ordered counter in the preemptive flushing code as well, so change this to count all ordered bytes instead of just DIO ordered bytes. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
ac1ea10e75 |
btrfs: add a trace point for reserve tickets
While debugging a ENOSPC related performance problem I needed to see the time difference between start and end of a reserve ticket, so add a trace point to report when we handle a reserve ticket. I opted to spit out start_ns itself without calculating the difference because there could be a gap between enabling the tracepoint and setting start_ns. Doing it this way allows us to filter on 0 start_ns so we don't get bogus entries, and we can easily calculate the time difference with bpftrace or something else. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
91e79a83ff |
btrfs: make flush_space take a enum btrfs_flush_state instead of int
I got a automated message from somebody who runs clang against our kernels and it's because I used the wrong enum type for what I passed into flush_space, caught by -Wenum-conversion. Change the argument to be explicitly the enum we're expecting to make everything consistent. Maybe eventually gcc will catch errors like this. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
8898038309 |
btrfs: send: use struct send_ctx *sctx for btrfs_compare_trees and changed_cb
btrfs_compare_trees and changed_cb use a void *ctx parameter instead of
struct send_ctx *sctx but when used in changed_cb it is immediately
cast to `struct send_ctx *sctx = ctx;`.
changed_cb is only ever called from btrfs_compare_trees and full_send_tree:
- full_send_tree already passes a struct send_ctx *sctx
- btrfs_compare_trees is only called by send_subvol with a struct send_ctx *sctx
- void *ctx in btrfs_compare_trees is only used to be passed to changed_cb
So casting to/from void *ctx seems unnecessary and directly using
struct send_ctx *sctx instead provides better type-safety.
The original reason for using void *ctx in the first place seems to have
been dropped with
|
||
|
488bc2a2d2 |
btrfs: run delayed refs less often in commit_cowonly_roots
We love running delayed refs in commit_cowonly_roots, but it is a bit excessive. I was seeing cases of running 3 or 4 refs a few times in a row during this time. Instead simply: - update all of the roots first - then run delayed refs - then handle the empty block groups case - and then if we have any more dirty roots do the whole thing again This allows us to be much more efficient with our delayed ref running, as we can batch a few more operations at once. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
dac348e925 |
btrfs: stop running all delayed refs during snapshot
This was added in commit
|
||
|
b7774425e0 |
btrfs: remove bogus BUG_ON in alloc_reserved_tree_block
The fix |
||
|
2a4d84c11a |
btrfs: move delayed ref flushing for qgroup into qgroup helper
The commit
|
||
|
ad368f3394 |
btrfs: only run delayed refs once before committing
We try to pre-flush the delayed refs when committing, because we want to do as little work as possible in the critical section of the transaction commit. However doing this twice can lead to very long transaction commit delays as other threads are allowed to continue to generate more delayed refs, which potentially delays the commit by multiple minutes in very extreme cases. So simply stick to one pre-flush, and then continue the rest of the transaction commit. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
61a56a992f |
btrfs: delayed refs pre-flushing should only run the heads we have
Previously our delayed ref running used the total number of items as the items to run. However we changed that to number of heads to run with the delayed_refs_rsv, as generally we want to run all of the operations for one bytenr. But with btrfs_run_delayed_refs(trans, 0) we set our count to 2x the number of items that we have. This is generally fine, but if we have some operation generation loads of delayed refs while we're doing this pre-flushing in the transaction commit, we'll just spin forever doing delayed refs. Fix this to simply pick the number of delayed refs we currently have, that way we do not end up doing a lot of extra work that's being generated in other threads. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
e19eb11f4f |
btrfs: only let one thread pre-flush delayed refs in commit
I've been running a stress test that runs 20 workers in their own subvolume, which are running an fsstress instance with 4 threads per worker, which is 80 total fsstress threads. In addition to this I'm running balance in the background as well as creating and deleting snapshots. This test takes around 12 hours to run normally, going slower and slower as the test goes on. The reason for this is because fsstress is running fsync sometimes, and because we're messing with block groups we often fall through to btrfs_commit_transaction, so will often have 20-30 threads all calling btrfs_commit_transaction at the same time. These all get stuck contending on the extent tree while they try to run delayed refs during the initial part of the commit. This is suboptimal, really because the extent tree is a single point of failure we only want one thread acting on that tree at once to reduce lock contention. Fix this by making the flushing mechanism a bit operation, to make it easy to use test_and_set_bit() in order to make sure only one task does this initial flush. Once we're into the transaction commit we only have one thread doing delayed ref running, it's just this initial pre-flush that is problematic. With this patch my stress test takes around 90 minutes to run, instead of 12 hours. The memory barrier is not necessary for the flushing bit as it's ordered, unlike plain int. The transaction state accessed in btrfs_should_end_transaction could be affected by that too as it's not always used under transaction lock. Upon Nikolay's analysis in [1] it's not necessary: In should_end_transaction it's read without holding any locks. (U) It's modified in btrfs_cleanup_transaction without holding the fs_info->trans_lock (U), but the STATE_ERROR flag is going to be set. set in cleanup_transaction under fs_info->trans_lock (L) set in btrfs_commit_trans to COMMIT_START under fs_info->trans_lock.(L) set in btrfs_commit_trans to COMMIT_DOING under fs_info->trans_lock.(L) set in btrfs_commit_trans to COMMIT_UNBLOCK under fs_info->trans_lock.(L) set in btrfs_commit_trans to COMMIT_COMPLETED without locks but at this point the transaction is finished and fs_info->running_trans is NULL (U but irrelevant). So by the looks of it we can have a concurrent READ race with a WRITE, due to reads not taking a lock. In this case what we want to ensure is we either see new or old state. I consulted with Will Deacon and he said that in such a case we'd want to annotate the accesses to ->state with (READ|WRITE)_ONCE so as to avoid a theoretical tear, in this case I don't think this could happen but I imagine at some point KCSAN would flag such an access as racy (which it is). [1] https://lore.kernel.org/linux-btrfs/e1fd5cc1-0f28-f670-69f4-e9958b4964e6@suse.com Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add comments regarding memory barrier ] Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
ddfd08cb04 |
btrfs: do not block on deleted bgs mutex in the cleaner
While running some stress tests I started getting hung task messages. This is because the delete unused block groups code has to take the delete_unused_bgs_mutex to do it's work, which is taken by balance to make sure we don't delete block groups while we're balancing. The problem is that balance can take a while, and so we were getting hung task warnings. We don't need to block and run these things, and the cleaner is needed to do other work, so trylock on this mutex and just bail if we can't acquire it right away. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
867ed321f9 |
btrfs: abort the transaction if we fail to inc ref in btrfs_copy_root
While testing my error handling patches, I added a error injection site at btrfs_inc_extent_ref, to validate the error handling I added was doing the correct thing. However I hit a pretty ugly corruption while doing this check, with the following error injection stack trace: btrfs_inc_extent_ref btrfs_copy_root create_reloc_root btrfs_init_reloc_root btrfs_record_root_in_trans btrfs_start_transaction btrfs_update_inode btrfs_update_time touch_atime file_accessed btrfs_file_mmap This is because we do not catch the error from btrfs_inc_extent_ref, which in practice would be ENOMEM, which means we lose the extent references for a root that has already been allocated and inserted, which is the problem. Fix this by aborting the transaction if we fail to do the reference modification. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
eddda68d97 |
btrfs: add asserts for deleting backref cache nodes
A weird KASAN problem that Zygo reported could have been easily caught if we checked for basic things in our backref freeing code. We have two methods of freeing a backref node - btrfs_backref_free_node: this just is kfree() essentially. - btrfs_backref_drop_node: this actually unlinks the node and cleans up everything and then calls btrfs_backref_free_node(). We should mostly be using btrfs_backref_drop_node(), to make sure the node is properly unlinked from the backref cache, and only use btrfs_backref_free_node() when we know the node isn't actually linked to the backref cache. We made a mistake here and thus got the KASAN splat. Make this style of issue easier to find by adding some ASSERT()'s to btrfs_backref_free_node() and adjusting our deletion stuff to properly init the list so we can rely on list_empty() checks working properly. BUG: KASAN: use-after-free in btrfs_backref_cleanup_node+0x18a/0x420 Read of size 8 at addr ffff888112402950 by task btrfs/28836 CPU: 0 PID: 28836 Comm: btrfs Tainted: G W 5.10.0-e35f27394290-for-next+ #23 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 Call Trace: dump_stack+0xbc/0xf9 ? btrfs_backref_cleanup_node+0x18a/0x420 print_address_description.constprop.8+0x21/0x210 ? record_print_text.cold.34+0x11/0x11 ? btrfs_backref_cleanup_node+0x18a/0x420 ? btrfs_backref_cleanup_node+0x18a/0x420 kasan_report.cold.10+0x20/0x37 ? btrfs_backref_cleanup_node+0x18a/0x420 __asan_load8+0x69/0x90 btrfs_backref_cleanup_node+0x18a/0x420 btrfs_backref_release_cache+0x83/0x1b0 relocate_block_group+0x394/0x780 ? merge_reloc_roots+0x4a0/0x4a0 btrfs_relocate_block_group+0x26e/0x4c0 btrfs_relocate_chunk+0x52/0x120 btrfs_balance+0xe2e/0x1900 ? check_flags.part.50+0x6c/0x1e0 ? btrfs_relocate_chunk+0x120/0x120 ? kmem_cache_alloc_trace+0xa06/0xcb0 ? _copy_from_user+0x83/0xc0 btrfs_ioctl_balance+0x3a7/0x460 btrfs_ioctl+0x24c8/0x4360 ? __kasan_check_read+0x11/0x20 ? check_chain_key+0x1f4/0x2f0 ? __asan_loadN+0xf/0x20 ? btrfs_ioctl_get_supported_features+0x30/0x30 ? kvm_sched_clock_read+0x18/0x30 ? check_chain_key+0x1f4/0x2f0 ? lock_downgrade+0x3f0/0x3f0 ? handle_mm_fault+0xad6/0x2150 ? do_vfs_ioctl+0xfc/0x9d0 ? ioctl_file_clone+0xe0/0xe0 ? check_flags.part.50+0x6c/0x1e0 ? check_flags.part.50+0x6c/0x1e0 ? check_flags+0x26/0x30 ? lock_is_held_type+0xc3/0xf0 ? syscall_enter_from_user_mode+0x1b/0x60 ? do_syscall_64+0x13/0x80 ? rcu_read_lock_sched_held+0xa1/0xd0 ? __kasan_check_read+0x11/0x20 ? __fget_light+0xae/0x110 __x64_sys_ioctl+0xc3/0x100 do_syscall_64+0x37/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f4c4bdfe427 RSP: 002b:00007fff33ee6df8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007fff33ee6e98 RCX: 00007f4c4bdfe427 RDX: 00007fff33ee6e98 RSI: 00000000c4009420 RDI: 0000000000000003 RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078 R10: fffffffffffff59d R11: 0000000000000202 R12: 0000000000000001 R13: 0000000000000000 R14: 00007fff33ee8a34 R15: 0000000000000001 Allocated by task 28836: kasan_save_stack+0x21/0x50 __kasan_kmalloc.constprop.18+0xbe/0xd0 kasan_kmalloc+0x9/0x10 kmem_cache_alloc_trace+0x410/0xcb0 btrfs_backref_alloc_node+0x46/0xf0 btrfs_backref_add_tree_node+0x60d/0x11d0 build_backref_tree+0xc5/0x700 relocate_tree_blocks+0x2be/0xb90 relocate_block_group+0x2eb/0x780 btrfs_relocate_block_group+0x26e/0x4c0 btrfs_relocate_chunk+0x52/0x120 btrfs_balance+0xe2e/0x1900 btrfs_ioctl_balance+0x3a7/0x460 btrfs_ioctl+0x24c8/0x4360 __x64_sys_ioctl+0xc3/0x100 do_syscall_64+0x37/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Freed by task 28836: kasan_save_stack+0x21/0x50 kasan_set_track+0x20/0x30 kasan_set_free_info+0x1f/0x30 __kasan_slab_free+0xf3/0x140 kasan_slab_free+0xe/0x10 kfree+0xde/0x200 btrfs_backref_error_cleanup+0x452/0x530 build_backref_tree+0x1a5/0x700 relocate_tree_blocks+0x2be/0xb90 relocate_block_group+0x2eb/0x780 btrfs_relocate_block_group+0x26e/0x4c0 btrfs_relocate_chunk+0x52/0x120 btrfs_balance+0xe2e/0x1900 btrfs_ioctl_balance+0x3a7/0x460 btrfs_ioctl+0x24c8/0x4360 __x64_sys_ioctl+0xc3/0x100 do_syscall_64+0x37/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The buggy address belongs to the object at ffff888112402900 which belongs to the cache kmalloc-128 of size 128 The buggy address is located 80 bytes inside of 128-byte region [ffff888112402900, ffff888112402980) The buggy address belongs to the page: page:0000000028b1cd08 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888131c810c0 pfn:0x112402 flags: 0x17ffe0000000200(slab) raw: 017ffe0000000200 ffffea000424f308 ffffea0007d572c8 ffff888100040440 raw: ffff888131c810c0 ffff888112402000 0000000100000009 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888112402800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888112402880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff888112402900: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff888112402980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff888112402a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Link: https://lore.kernel.org/linux-btrfs/20201208194607.GI31381@hungrycats.org/ CC: stable@vger.kernel.org # 5.10+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
f78743fbda |
btrfs: do not warn if we can't find the reloc root when looking up backref
The backref code is looking for a reloc_root that corresponds to the given fs root. However any number of things could have gone wrong while initializing that reloc_root, like ENOMEM while trying to allocate the root itself, or EIO while trying to write the root item. This would result in no corresponding reloc_root being in the reloc root cache, and thus would return NULL when we do the find_reloc_root() call. Because of this we do not want to WARN_ON(). This presumably was meant to catch developer errors, cases where we messed up adding the reloc root. However we can easily hit this case with error injection, and thus should not do a WARN_ON(). CC: stable@vger.kernel.org # 5.10+ Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
938fcbfb0c |
btrfs: splice remaining dirty_bg's onto the transaction dirty bg list
While doing error injection testing with my relocation patches I hit the following assert: assertion failed: list_empty(&block_group->dirty_list), in fs/btrfs/block-group.c:3356 ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.h:3357! invalid opcode: 0000 [#1] SMP NOPTI CPU: 0 PID: 24351 Comm: umount Tainted: G W 5.10.0-rc3+ #193 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 RIP: 0010:assertfail.constprop.0+0x18/0x1a RSP: 0018:ffffa09b019c7e00 EFLAGS: 00010282 RAX: 0000000000000056 RBX: ffff8f6492c18000 RCX: 0000000000000000 RDX: ffff8f64fbc27c60 RSI: ffff8f64fbc19050 RDI: ffff8f64fbc19050 RBP: ffff8f6483bbdc00 R08: 0000000000000000 R09: 0000000000000000 R10: ffffa09b019c7c38 R11: ffffffff85d70928 R12: ffff8f6492c18100 R13: ffff8f6492c18148 R14: ffff8f6483bbdd70 R15: dead000000000100 FS: 00007fbfda4cdc40(0000) GS:ffff8f64fbc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fbfda666fd0 CR3: 000000013cf66002 CR4: 0000000000370ef0 Call Trace: btrfs_free_block_groups.cold+0x55/0x55 close_ctree+0x2c5/0x306 ? fsnotify_destroy_marks+0x14/0x100 generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 deactivate_locked_super+0x36/0xa0 cleanup_mnt+0x12d/0x190 task_work_run+0x5c/0xa0 exit_to_user_mode_prepare+0x1b1/0x1d0 syscall_exit_to_user_mode+0x54/0x280 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This happened because I injected an error in btrfs_cow_block() while running the dirty block groups. When we run the dirty block groups, we splice the list onto a local list to process. However if an error occurs, we only cleanup the transactions dirty block group list, not any pending block groups we have on our locally spliced list. In fact if we fail to allocate a path in this function we'll also fail to clean up the splice list. Fix this by splicing the list back onto the transaction dirty block group list so that the block groups are cleaned up. Then add a 'out' label and have the error conditions jump to out so that the errors are handled properly. This also has the side-effect of fixing a problem where we would clear 'ret' on error because we unconditionally ran btrfs_run_delayed_refs(). CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
c78a10aebb |
btrfs: fix reloc root leak with 0 ref reloc roots on recovery
When recovering a relocation, if we run into a reloc root that has 0 refs we simply add it to the reloc_control->reloc_roots list, and then clean it up later. The problem with this is __del_reloc_root() doesn't do anything if the root isn't in the radix tree, which in this case it won't be because we never call __add_reloc_root() on the reloc_root. This exit condition simply isn't correct really. During normal operation we can remove ourselves from the rb tree and then we're meant to clean up later at merge_reloc_roots() time, and this happens correctly. During recovery we're depending on free_reloc_roots() to drop our references, but we're short-circuiting. Fix this by continuing to check if we're on the list and dropping ourselves from the reloc_control root list and dropping our reference appropriately. Change the corresponding BUG_ON() to an ASSERT() that does the correct thing if we aren't in the rb tree. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
2e626e5673 |
btrfs: remove repeated word in struct member comment
Comment for processed extent end of range has an unnecessary "in", remove it. Signed-off-by: Nigel Christian <nigel.l.christian@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
81e75ac74e |
btrfs: account for new extents being deleted in total_bytes_pinned
My recent patch set "A variety of lock contention fixes", found here https://lore.kernel.org/linux-btrfs/cover.1608319304.git.josef@toxicpanda.com/ (Tracked in https://github.com/btrfs/linux/issues/86) that reduce lock contention on the extent root by running delayed refs less often resulted in a regression in generic/371. This test fallocate()'s the fs until it's full, deletes all the files, and then tries to fallocate() until full again. Before these patches we would run all of the delayed refs during flushing, and then would commit the transaction because we had plenty of pinned space to recover in order to allocate. However my patches made it so we weren't running the delayed refs as aggressively, which meant that we appeared to have less pinned space when we were deciding to commit the transaction. We use the space_info->total_bytes_pinned to approximate how much space we have pinned. It's approximate because if we remove a reference to an extent we may free it, but there may be more references to it than we know of at that point, but we account it as pinned at the creation time, and then it's properly accounted when the delayed ref runs. The way we account for pinned space is if the delayed_ref_head->total_ref_mod is < 0, because that is clearly a freeing option. However there is another case, and that is where ->total_ref_mod == 0 && ->must_insert_reserved == 1. When we allocate a new extent, we have ->total_ref_mod == 1 and we have ->must_insert_reserved == 1. This is used to indicate that it is a brand new extent and will need to have its extent entry added before we modify any references on the delayed ref head. But if we subsequently remove that extent reference, our ->total_ref_mod will be 0, and that space will be pinned and freed. Accounting for this case properly allows for generic/371 to pass with my delayed refs patches applied. It's important to note that this problem exists without the referenced patches, it just was uncovered by them. CC: stable@vger.kernel.org # 5.10 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
2187374f35 |
btrfs: handle space_info::total_bytes_pinned inside the delayed ref itself
Currently we pass things around to figure out if we maybe freeing data based on the state of the delayed refs head. This makes the accounting sort of confusing and hard to follow, as it's distinctly separate from the delayed ref heads stuff, but also depends on it entirely. Fix this by explicitly adjusting the space_info->total_bytes_pinned in the delayed refs code. We now have two places where we modify this counter, once where we create the delayed and destroy the delayed refs, and once when we pin and unpin the extents. This means there is a slight overlap between delayed refs and the pin/unpin mechanisms, but this is simply used by the ENOSPC infrastructure to determine if we need to commit the transaction, so there's no adverse affect from this, we might simply commit thinking it will give us enough space when it might not. CC: stable@vger.kernel.org # 5.10 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
e9aa7c285d |
btrfs: enable W=1 checks for btrfs
Now that the btrfs' codebase is clean of almost all W=1 warnings let's enable those checks unconditionally for the entire fs/btrfs/ and its subdirectories to catch potential errors during development. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add some comments ] Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
8c31a3dbaa |
btrfs: zoned: remove unused variable in btrfs_sb_log_location_bdev
This fixes warning:
fs/btrfs/zoned.c:491:6: warning: variable ‘zone_size’ set but not used [-Wunused-but-set-variable]
491 | u64 zone_size;
which got introduced in
|
||
|
3bed2da1b0 |
btrfs: fix parameter description for functions in extent_io.c
This makes the file W=1 clean and fixes the following warnings: fs/btrfs/extent_io.c:414: warning: Function parameter or member 'tree' not described in '__etree_search' fs/btrfs/extent_io.c:414: warning: Function parameter or member 'offset' not described in '__etree_search' fs/btrfs/extent_io.c:414: warning: Function parameter or member 'next_ret' not described in '__etree_search' fs/btrfs/extent_io.c:414: warning: Function parameter or member 'prev_ret' not described in '__etree_search' fs/btrfs/extent_io.c:414: warning: Function parameter or member 'p_ret' not described in '__etree_search' fs/btrfs/extent_io.c:414: warning: Function parameter or member 'parent_ret' not described in '__etree_search' fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'tree' not described in 'find_contiguous_extent_bit' fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'start' not described in 'find_contiguous_extent_bit' fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'start_ret' not described in 'find_contiguous_extent_bit' fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'end_ret' not described in 'find_contiguous_extent_bit' fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'bits' not described in 'find_contiguous_extent_bit' fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'tree' not described in 'find_first_clear_extent_bit' fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'start' not described in 'find_first_clear_extent_bit' fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'start_ret' not described in 'find_first_clear_extent_bit' fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'end_ret' not described in 'find_first_clear_extent_bit' fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'bits' not described in 'find_first_clear_extent_bit' fs/btrfs/extent_io.c:4187: warning: Function parameter or member 'epd' not described in 'extent_write_cache_pages' fs/btrfs/extent_io.c:4187: warning: Excess function parameter 'data' description in 'extent_write_cache_pages' Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
d98b188ea4 |
btrfs: fix parameter description in space-info.c
With these fixes space-info.c is clear for W=1 warnings, namely the following ones are fixed: fs/btrfs/space-info.c:575: warning: Function parameter or member 'fs_info' not described in 'may_commit_transaction' fs/btrfs/space-info.c:575: warning: Function parameter or member 'space_info' not described in 'may_commit_transaction' fs/btrfs/space-info.c:1231: warning: Function parameter or member 'fs_info' not described in 'handle_reserve_ticket' fs/btrfs/space-info.c:1231: warning: Function parameter or member 'space_info' not described in 'handle_reserve_ticket' fs/btrfs/space-info.c:1231: warning: Function parameter or member 'ticket' not described in 'handle_reserve_ticket' fs/btrfs/space-info.c:1231: warning: Function parameter or member 'flush' not described in 'handle_reserve_ticket' fs/btrfs/space-info.c:1315: warning: Function parameter or member 'fs_info' not described in '__reserve_bytes' fs/btrfs/space-info.c:1315: warning: Function parameter or member 'space_info' not described in '__reserve_bytes' fs/btrfs/space-info.c:1315: warning: Function parameter or member 'orig_bytes' not described in '__reserve_bytes' fs/btrfs/space-info.c:1315: warning: Function parameter or member 'flush' not described in '__reserve_bytes' fs/btrfs/space-info.c:1427: warning: Function parameter or member 'root' not described in 'btrfs_reserve_metadata_bytes' fs/btrfs/space-info.c:1427: warning: Function parameter or member 'block_rsv' not described in 'btrfs_reserve_metadata_bytes' fs/btrfs/space-info.c:1427: warning: Function parameter or member 'orig_bytes' not described in 'btrfs_reserve_metadata_bytes' fs/btrfs/space-info.c:1427: warning: Function parameter or member 'flush' not described in 'btrfs_reserve_metadata_bytes' fs/btrfs/space-info.c:1462: warning: Function parameter or member 'fs_info' not described in 'btrfs_reserve_data_bytes' fs/btrfs/space-info.c:1462: warning: Function parameter or member 'bytes' not described in 'btrfs_reserve_data_bytes' fs/btrfs/space-info.c:1462: warning: Function parameter or member 'flush' not described in 'btrfs_reserve_data_bytes' Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
b762d1d08d |
btrfs: fix parameter description of btrfs_inode_rsv_release/btrfs_delalloc_release_space
Fixes following warnings: fs/btrfs/delalloc-space.c:205: warning: Function parameter or member 'inode' not described in 'btrfs_inode_rsv_release' fs/btrfs/delalloc-space.c:205: warning: Function parameter or member 'qgroup_free' not described in 'btrfs_inode_rsv_release' fs/btrfs/delalloc-space.c:472: warning: Function parameter or member 'reserved' not described in 'btrfs_delalloc_release_space' fs/btrfs/delalloc-space.c:472: warning: Function parameter or member 'qgroup_free' not described in 'btrfs_delalloc_release_space' fs/btrfs/delalloc-space.c:472: warning: Excess function parameter 'release_bytes' description in 'btrfs_delalloc_release_space' Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
6e353e3b3c |
btrfs: document btrfs_check_shared parameters
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
2639631d34 |
btrfs: fix description format of fs_info of btrfs_wait_on_delayed_iputs
Fixes fs/btrfs/inode.c:3101: warning: Function parameter or member 'fs_info' not described in 'btrfs_wait_on_delayed_iputs' Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
9ee9b97990 |
btrfs: document fs_info in btrfs_rmap_block
Fixes fs/btrfs/block-group.c:1570: warning: Function parameter or member 'fs_info' not described in 'btrfs_rmap_block' Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
9241969547 |
btrfs: document now parameter of peek_discard_list
Fixes fs/btrfs/discard.c:203: warning: Function parameter or member 'now' not described in 'peek_discard_list' Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
f092cf3cfd |
btrfs: improve parameter description for __btrfs_write_out_cache
Fixes following W=1 warnings: fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'root' not described in '__btrfs_write_out_cache' fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'inode' not described in '__btrfs_write_out_cache' fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'ctl' not described in '__btrfs_write_out_cache' fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'block_group' not described in '__btrfs_write_out_cache' fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'io_ctl' not described in '__btrfs_write_out_cache' fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'trans' not described in '__btrfs_write_out_cache' Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
696eb22b67 |
btrfs: fix parameter description in delayed-ref.c functions
This fixes the following warnings: fs/btrfs/delayed-ref.c:80: warning: Function parameter or member 'fs_info' not described in 'btrfs_delayed_refs_rsv_release' fs/btrfs/delayed-ref.c:80: warning: Function parameter or member 'nr' not described in 'btrfs_delayed_refs_rsv_release' fs/btrfs/delayed-ref.c:128: warning: Function parameter or member 'fs_info' not described in 'btrfs_migrate_to_delayed_refs_rsv' fs/btrfs/delayed-ref.c:128: warning: Function parameter or member 'src' not described in 'btrfs_migrate_to_delayed_refs_rsv' fs/btrfs/delayed-ref.c:128: warning: Function parameter or member 'num_bytes' not described in 'btrfs_migrate_to_delayed_refs_rsv' fs/btrfs/delayed-ref.c:174: warning: Function parameter or member 'fs_info' not described in 'btrfs_delayed_refs_rsv_refill' fs/btrfs/delayed-ref.c:174: warning: Function parameter or member 'flush' not described in 'btrfs_delayed_refs_rsv_refill' Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
ca4207ae13 |
btrfs: fix function description formats in file-item.c
This fixes following W=1 warnings: fs/btrfs/file-item.c:27: warning: Cannot understand * @inode: the inode we want to update the disk_i_size for on line 27 - I thought it was a doc line fs/btrfs/file-item.c:65: warning: Cannot understand * @inode - the inode we're modifying on line 65 - I thought it was a doc line fs/btrfs/file-item.c:91: warning: Cannot understand * @inode - the inode we're modifying on line 91 - I thought it was a doc line Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
9ad37bb3ff |
btrfs: fix parameter description of btrfs_add_extent_mapping
This fixes the following compiler warnings: fs/btrfs/extent_map.c:601: warning: Function parameter or member 'fs_info' not described in 'btrfs_add_extent_mapping' fs/btrfs/extent_map.c:601: warning: Function parameter or member 'em_tree' not described in 'btrfs_add_extent_mapping' fs/btrfs/extent_map.c:601: warning: Function parameter or member 'em_in' not described in 'btrfs_add_extent_mapping' fs/btrfs/extent_map.c:601: warning: Function parameter or member 'start' not described in 'btrfs_add_extent_mapping' fs/btrfs/extent_map.c:601: warning: Function parameter or member 'len' not described in 'btrfs_add_extent_mapping' Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
401bd2dd12 |
btrfs: document modified parameter of add_extent_mapping
Fixes fs/btrfs/extent_map.c:399: warning: Function parameter or member 'modified' not described in 'add_extent_mapping' Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
3c198fe064 |
btrfs: rework the order of btrfs_ordered_extent::flags
[BUG]
There is a long existing bug in the last parameter of
btrfs_add_ordered_extent(), in commit
|
||
|
fe3b7bb085 |
btrfs: remove redundant NULL check before kvfree
Fix below warnings reported by coccicheck: ./fs/btrfs/raid56.c:237:2-8: WARNING: NULL check before some freeing functions is not needed. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Yang Li <abaci-bugfix@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
7e2a870a59 |
btrfs: do not cleanup upper nodes in btrfs_backref_cleanup_node
Zygo reported the following panic when testing my error handling patches for relocation: kernel BUG at fs/btrfs/backref.c:2545! invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 3 PID: 8472 Comm: btrfs Tainted: G W 14 Hardware name: QEMU Standard PC (i440FX + PIIX, Call Trace: btrfs_backref_error_cleanup+0x4df/0x530 build_backref_tree+0x1a5/0x700 ? _raw_spin_unlock+0x22/0x30 ? release_extent_buffer+0x225/0x280 ? free_extent_buffer.part.52+0xd7/0x140 relocate_tree_blocks+0x2a6/0xb60 ? kasan_unpoison_shadow+0x35/0x50 ? do_relocation+0xc10/0xc10 ? kasan_kmalloc+0x9/0x10 ? kmem_cache_alloc_trace+0x6a3/0xcb0 ? free_extent_buffer.part.52+0xd7/0x140 ? rb_insert_color+0x342/0x360 ? add_tree_block.isra.36+0x236/0x2b0 relocate_block_group+0x2eb/0x780 ? merge_reloc_roots+0x470/0x470 btrfs_relocate_block_group+0x26e/0x4c0 btrfs_relocate_chunk+0x52/0x120 btrfs_balance+0xe2e/0x18f0 ? pvclock_clocksource_read+0xeb/0x190 ? btrfs_relocate_chunk+0x120/0x120 ? lock_contended+0x620/0x6e0 ? do_raw_spin_lock+0x1e0/0x1e0 ? do_raw_spin_unlock+0xa8/0x140 btrfs_ioctl_balance+0x1f9/0x460 btrfs_ioctl+0x24c8/0x4380 ? __kasan_check_read+0x11/0x20 ? check_chain_key+0x1f4/0x2f0 ? __asan_loadN+0xf/0x20 ? btrfs_ioctl_get_supported_features+0x30/0x30 ? kvm_sched_clock_read+0x18/0x30 ? check_chain_key+0x1f4/0x2f0 ? lock_downgrade+0x3f0/0x3f0 ? handle_mm_fault+0xad6/0x2150 ? do_vfs_ioctl+0xfc/0x9d0 ? ioctl_file_clone+0xe0/0xe0 ? check_flags.part.50+0x6c/0x1e0 ? check_flags.part.50+0x6c/0x1e0 ? check_flags+0x26/0x30 ? lock_is_held_type+0xc3/0xf0 ? syscall_enter_from_user_mode+0x1b/0x60 ? do_syscall_64+0x13/0x80 ? rcu_read_lock_sched_held+0xa1/0xd0 ? __kasan_check_read+0x11/0x20 ? __fget_light+0xae/0x110 __x64_sys_ioctl+0xc3/0x100 do_syscall_64+0x37/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This occurs because of this check if (RB_EMPTY_NODE(&upper->rb_node)) BUG_ON(!list_empty(&node->upper)); As we are dropping the backref node, if we discover that our upper node in the edge we just cleaned up isn't linked into the cache that we are now done with this node, thus the BUG_ON(). However this is an erroneous assumption, as we will look up all the references for a node first, and then process the pending edges. All of the 'upper' nodes in our pending edges won't be in the cache's rb_tree yet, because they haven't been processed. We could very well have many edges still left to cleanup on this node. The fact is we simply do not need this check, we can just process all of the edges only for this node, because below this check we do the following if (list_empty(&upper->lower)) { list_add_tail(&upper->lower, &cache->leaves); upper->lowest = 1; } If the upper node truly isn't used yet, then we add it to the cache->leaves list to be cleaned up later. If it is still used then the last child node that has it linked into its node will add it to the leaves list and then it will be cleaned up. Fix this problem by dropping this logic altogether. With this fix I no longer see the panic when testing with error injection in the backref code. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
f7ba2d3751 |
btrfs: keep track of the root owner for relocation reads
While testing the error paths in relocation, I hit the following lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 5.10.0-rc3+ #206 Not tainted ------------------------------------------------------ btrfs-balance/1571 is trying to acquire lock: ffff8cdbcc8f77d0 (&head_ref->mutex){+.+.}-{3:3}, at: btrfs_lookup_extent_info+0x156/0x3b0 but task is already holding lock: ffff8cdbc54adbf8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_lock+0x27/0x100 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (btrfs-tree-00){++++}-{3:3}: down_write_nested+0x43/0x80 __btrfs_tree_lock+0x27/0x100 btrfs_search_slot+0x248/0x890 relocate_tree_blocks+0x490/0x650 relocate_block_group+0x1ba/0x5d0 kretprobe_trampoline+0x0/0x50 -> #1 (btrfs-csum-01){++++}-{3:3}: down_read_nested+0x43/0x130 __btrfs_tree_read_lock+0x27/0x100 btrfs_read_lock_root_node+0x31/0x40 btrfs_search_slot+0x5ab/0x890 btrfs_del_csums+0x10b/0x3c0 __btrfs_free_extent+0x49d/0x8e0 __btrfs_run_delayed_refs+0x283/0x11f0 btrfs_run_delayed_refs+0x86/0x220 btrfs_start_dirty_block_groups+0x2ba/0x520 kretprobe_trampoline+0x0/0x50 -> #0 (&head_ref->mutex){+.+.}-{3:3}: __lock_acquire+0x1167/0x2150 lock_acquire+0x116/0x3e0 __mutex_lock+0x7e/0x7b0 btrfs_lookup_extent_info+0x156/0x3b0 walk_down_proc+0x1c3/0x280 walk_down_tree+0x64/0xe0 btrfs_drop_subtree+0x182/0x260 do_relocation+0x52e/0x660 relocate_tree_blocks+0x2ae/0x650 relocate_block_group+0x1ba/0x5d0 kretprobe_trampoline+0x0/0x50 other info that might help us debug this: Chain exists of: &head_ref->mutex --> btrfs-csum-01 --> btrfs-tree-00 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-tree-00); lock(btrfs-csum-01); lock(btrfs-tree-00); lock(&head_ref->mutex); *** DEADLOCK *** 5 locks held by btrfs-balance/1571: #0: ffff8cdb89749ff8 (&fs_info->delete_unused_bgs_mutex){+.+.}-{3:3}, at: btrfs_balance+0x563/0xf40 #1: ffff8cdb89748838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x156/0x300 #2: ffff8cdbc2c16650 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x413/0x5c0 #3: ffff8cdbc135f538 (btrfs-treloc-01){+.+.}-{3:3}, at: __btrfs_tree_lock+0x27/0x100 #4: ffff8cdbc54adbf8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_lock+0x27/0x100 stack backtrace: CPU: 1 PID: 1571 Comm: btrfs-balance Not tainted 5.10.0-rc3+ #206 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack+0x8b/0xb0 check_noncircular+0xcf/0xf0 ? trace_call_bpf+0x139/0x260 __lock_acquire+0x1167/0x2150 lock_acquire+0x116/0x3e0 ? btrfs_lookup_extent_info+0x156/0x3b0 __mutex_lock+0x7e/0x7b0 ? btrfs_lookup_extent_info+0x156/0x3b0 ? btrfs_lookup_extent_info+0x156/0x3b0 ? release_extent_buffer+0x124/0x170 ? _raw_spin_unlock+0x1f/0x30 ? release_extent_buffer+0x124/0x170 btrfs_lookup_extent_info+0x156/0x3b0 walk_down_proc+0x1c3/0x280 walk_down_tree+0x64/0xe0 btrfs_drop_subtree+0x182/0x260 do_relocation+0x52e/0x660 relocate_tree_blocks+0x2ae/0x650 ? add_tree_block+0x149/0x1b0 relocate_block_group+0x1ba/0x5d0 elfcorehdr_read+0x40/0x40 ? elfcorehdr_read+0x40/0x40 ? btrfs_balance+0x796/0xf40 ? __kthread_parkme+0x66/0x90 ? btrfs_balance+0xf40/0xf40 ? balance_kthread+0x37/0x50 ? kthread+0x137/0x150 ? __kthread_bind_mask+0x60/0x60 ? ret_from_fork+0x1f/0x30 As you can see this is bogus, we never take another tree's lock under the csum lock. This happens because sometimes we have to read tree blocks from disk without knowing which root they belong to during relocation. We defaulted to an owner of 0, which translates to an fs tree. This is fine as all fs trees have the same class, but obviously isn't fine if the block belongs to a COW only tree. Thankfully COW only trees only have their owners root as a reference to them, and since we already look up the extent information during relocation, go ahead and check and see if this block might belong to a COW only tree, and if so save the owner in the tree_block struct. This allows us to read_tree_block with the proper owner, which gets rid of this lockdep splat. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
c0f0a9e716 |
btrfs: introduce helper to grab an existing extent buffer from a page
This patch will extract the code to grab an extent buffer from a page into a helper, grab_extent_buffer_from_page(). This reduces one indent level, and provides the work place for later expansion for subapge support. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
c0fab48095 |
btrfs: update comment for btrfs_dirty_pages
The original comment is from the initial merge, which has several problems: - No holes check any more - No inline decision is made Update the out-of-date comment with more correct one. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
6bc5636a67 |
btrfs: refactor __extent_writepage_io() to improve readability
The refactoring involves the following modifications: - iosize alignment In fact we don't really need to manually do alignment at all. All extent maps should already be aligned, thus basic ASSERT() check would be enough. - redundant variables We have extra variable like blocksize/pg_offset/end. They are all unnecessary. @blocksize can be replaced by sectorsize size directly, and it's only used to verify the em start/size is aligned. @pg_offset can be easily calculated using @cur and page_offset(page). @end is just assigned from @page_end and never modified, use "start + PAGE_SIZE - 1" directly and remove @page_end. - remove some BUG_ON()s The BUG_ON()s are for extent map, which we have tree-checker to check on-disk extent data item and runtime check. ASSERT() should be enough. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
0c64c33c60 |
btrfs: rename parameter offset to disk_bytenr in submit_extent_page
The parameter offset is confusing, it's supposed to be the disk bytenr of metadata/data. Rename it to disk_bytenr and update the comment. Also rename each offset passed to submit_extent_page() as @disk_bytenr so they're consistent. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
58f74b2203 |
btrfs: refactor btrfs_dec_test_* functions for ordered extents
The refactoring involves the following modifications: - Return bool instead of int - Parameter update for @cached of btrfs_dec_test_first_ordered_pending() For btrfs_dec_test_first_ordered_pending(), @cached is only used to return the finished ordered extent. Rename it to @finished_ret. - Comment updates * Change one stale comment Which still refers to btrfs_dec_test_ordered_pending(), but the context is calling btrfs_dec_test_first_ordered_pending(). * Follow the common comment style for both functions Add more detailed descriptions for parameters and the return value * Move the reason why test_and_set_bit() is used into the call sites - Change how the return value is calculated The most anti-human part of the return value is: if (...) ret = 1; ... return ret == 0; This means, when we set ret to 1, the function returns 0. Change the local variable name to @finished, and directly return the value of it. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
523929f1ca |
btrfs: make btrfs_dio_private::bytes u32
btrfs_dio_private::bytes is only assigned from bio::bi_iter::bi_size, which is never larger than U32. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
d7830b7155 |
btrfs: remove always true condition in btrfs_start_delalloc_roots
Following the rework in
|
||
|
9db4dc241e |
btrfs: make btrfs_start_delalloc_root's nr argument a long
It's currently u64 which gets instantly translated either to LONG_MAX (if U64_MAX is passed) or cast to an unsigned long (which is in fact, wrong because writeback_control::nr_to_write is a signed, long type). Just convert the function's argument to be long time which obviates the need to manually convert u64 value to a long. Adjust all call sites which pass U64_MAX to pass LONG_MAX. Finally ensure that in shrink_delalloc the u64 is converted to a long without overflowing, resulting in a negative number. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
9c4a062a94 |
btrfs: send: remove stale code when checking for shared extents
After commit
|
||
|
7056bf69e5 |
btrfs: consolidate btrfs_previous_item ret val handling in btrfs_shrink_device
Instead of having three 'if' to handle non-NULL return value consolidate this in one 'if (ret)'. That way the code is more obvious: - Always drop delete_unused_bgs_mutex if ret is not NULL - If ret is negative -> goto done - If it's 1 -> reset ret to 0, release the path and finish the loop. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
1478143ac8 |
btrfs: ref-verify: make sure owner is set for all refs
I noticed that shared ref entries in ref-verify didn't have the proper owner set, which caused me to think there was something seriously wrong. However the problem is if we have a parent we simply weren't filling out the owner part of the reference, even though we have it. Fix this by making sure we set all the proper fields when we modify a reference, this way we'll have the proper owner if a problem happens and we don't waste time thinking we're updating the wrong level. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
0d73a11c62 |
btrfs: ref-verify: pass down tree block level when building refs
I noticed that sometimes I would have the wrong level printed out with ref-verify while testing some error injection related problems. This is because we only get the level from the main extent item, but our references could go off the current leaf into another, and at that point we lose our level. Fix this by keeping track of the last tree block level that we found, the same way we keep track of our bytenr and num_bytes, in case we happen to wander into another leaf while still processing the references for a bytenr. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
1fec12a560 |
btrfs: noinline btrfs_should_cancel_balance
I was attempting to reproduce a problem that Zygo hit, but my error injection wasn't firing for a few of the common calls to btrfs_should_cancel_balance. This is because the compiler decided to inline it at these spots. Keep this from happening by explicitly marking the function as noinline so that error injection will always work. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
f75e2b79b5 |
btrfs: allow error injection for btrfs_search_slot and btrfs_cow_block
The following patches are going to address error handling in relocation, in order to test those patches I need to be able to inject errors in btrfs_search_slot and btrfs_cow_block, as we call both of these pretty often in different cases during relocation. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
69948022c9 |
btrfs: remove new_dirid argument from btrfs_create_subvol_root
It's no longer used. While at it also remove new_dirid in create_subvol as it's used in a single place and open code it. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
23125104d8 |
btrfs: make btrfs_root::free_objectid hold the next available objectid
Adjust the way free_objectid is being initialized, it now stores BTRFS_FIRST_FREE_OBJECTID rather than the, somewhat arbitrary, BTRFS_FIRST_FREE_OBJECTID - 1. This change also has the added benefit that now it becomes unnecessary to explicitly initialize free_objectid for a newly create fs root. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
6b8fad576a |
btrfs: rename btrfs_root::highest_objectid to free_objectid
This reflects the true purpose of the member as it's being used solely in context where a new objectid is being allocated. Future changes will also change the way it's being used to closely follow this semantics. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
543068a217 |
btrfs: rename btrfs_find_free_objectid to btrfs_get_free_objectid
This better reflects the semantics of the function i.e no search is performed whatsoever. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
453e487386 |
btrfs: rename btrfs_find_highest_objectid to btrfs_init_root_free_objectid
This function is used to initialize the in-memory btrfs_root::highest_objectid member, which is used to get an available objectid. Rename it to better reflect its semantics. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
149716570b |
btrfs: cleanup local variables in btrfs_file_write_iter
First replace all inode instances with a pointer to btrfs_inode. This removes multiple invocations of the BTRFS_I macro, subsequently remove 2 local variables as they are called only once and simply refer to them directly. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
3cc64e7ebf |
btrfs: clarify error returns values in __load_free_space_cache
Return value in __load_free_space_cache is not properly set after
(unlikely) memory allocation failures and 0 is returned instead.
This is not a problem for the caller load_free_space_cache because only
value 1 is considered as 'cache loaded' but for clarity it's better
to set the errors accordingly.
Fixes:
|
||
|
4f4317c13a |
btrfs: fix error handling in commit_fs_roots
While doing error injection I would sometimes get a corrupt file system. This is because I was injecting errors at btrfs_search_slot, but would only do it one time per stack. This uncovered a problem in commit_fs_roots, where if we get an error we would just break. However we're in a nested loop, the first loop being a loop to find all the dirty fs roots, and then subsequent root updates would succeed clearing the error value. This isn't likely to happen in real scenarios, however we could potentially get a random ENOMEM once and then not again, and we'd end up with a corrupted file system. Fix this by moving the error checking around a bit to the main loop, as this is the only place where something will fail, and return the error as soon as it occurs. With this patch my reproducer no longer corrupts the file system. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
c05d51c773 |
for-5.11-rc5-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmAUIkAACgkQxWXV+ddt WDsWVg/+IIEk9H1v9q9ShvVmPvmnlT8/0ywj1hdwFMBkFBjIeU8tBz9ZMGPXCzrF XemmWKChVOnR3SIq/bMrwuRC/Gv/pBvwVshXLP51YJHv7lSGX0Ayrb27BFQcVaC/ 3QhpE7veEiqxwLyMj+LWG4hE2X+oqiqzrXCpeC5un4zEluT45RSKooqueQ4jM8aw DrKLQA57a1YEIqrE2KQzy5A6BnSNyxPXEEX34kbugmmen46Fh77hrwme1K9vQn1t v3/V4LcarXADxxokAxU2Igb/vK0+BN33NOYsBwLWWD4kUaTGS4KczsDOowkRRTMH /qiQUdca0X7ElR+VFl8rgB8PxuJcZ87aCdsMkErUA4sjxyp11VDIeEgirPNAcXtR b+1LIkn3k3l8JzkKyXwDuZuNBsh0idTY24IE+QDBMIGq+jE1N6N3t5gEwa2NeaiP 9O5QnS5XAJCo8a9+gp1aF5z94vwQwvf9TA80nGrnpxGmXEEEZ9PgXsc4JON1Blhn NtJDwBPzEjHCEYdE73/lRMsLmYeGhpRugKb+lQ+OTo2iZzxH2SjWn9vXKiN7vAp2 zysjzdPfkY5BLggH5cPg0fuRaf/Is00EeVqn3eA7QsFKDhrpoPFBO+aV5xeshsaz 8fjt7kkXFb+Vyy4SDvmPioJQ7/MFZ5Czn+BL1JwO4l/vYcEMUzM= =/yHv -----END PGP SIGNATURE----- Merge tag 'for-5.11-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes for a late rc: - fix lockdep complaint on 32bit arches and also remove an unsafe memory use due to device vs filesystem lifetime - two fixes for free space tree: * race during log replay and cache rebuild, now more likely to happen due to changes in this dev cycle * possible free space tree corruption with online conversion during initial tree population" * tag 'for-5.11-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix log replay failure due to race with space cache rebuild btrfs: fix lockdep warning due to seqcount_mutex on 32bit arch btrfs: fix possible free space tree corruption with online conversion |
||
|
616c6a6884 |
btrfs: use bio_kmalloc in __alloc_device
Use bio_kmalloc instead of open coding it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
9ad6d91f05 |
btrfs: fix log replay failure due to race with space cache rebuild
After a sudden power failure we may end up with a space cache on disk that
is not valid and needs to be rebuilt from scratch.
If that happens, during log replay when we attempt to pin an extent buffer
from a log tree, at btrfs_pin_extent_for_log_replay(), we do not wait for
the space cache to be rebuilt through the call to:
btrfs_cache_block_group(cache, 1);
That is because that only waits for the task (work queue job) that loads
the space cache to change the cache state from BTRFS_CACHE_FAST to any
other value. That is ok when the space cache on disk exists and is valid,
but when the cache is not valid and needs to be rebuilt, it ends up
returning as soon as the cache state changes to BTRFS_CACHE_STARTED (done
at caching_thread()).
So this means that we can end up trying to unpin a range which is not yet
marked as free in the block group. This results in the call to
btrfs_remove_free_space() to return -EINVAL to
btrfs_pin_extent_for_log_replay(), which in turn makes the log replay fail
as well as mounting the filesystem. More specifically the -EINVAL comes
from free_space_cache.c:remove_from_bitmap(), because the requested range
is not marked as free space (ones in the bitmap), we have the following
condition triggered:
static noinline int remove_from_bitmap(struct btrfs_free_space_ctl *ctl,
(...)
if (ret < 0 || search_start != *offset)
return -EINVAL;
(...)
It's the "search_start != *offset" that results in the condition being
evaluated to true.
When this happens we got the following in dmesg/syslog:
[72383.415114] BTRFS: device fsid 32b95b69-0ea9-496a-9f02-3f5a56dc9322 devid 1 transid 1432 /dev/sdb scanned by mount (3816007)
[72383.417837] BTRFS info (device sdb): disk space caching is enabled
[72383.418536] BTRFS info (device sdb): has skinny extents
[72383.423846] BTRFS info (device sdb): start tree-log replay
[72383.426416] BTRFS warning (device sdb): block group 30408704 has wrong amount of free space
[72383.427686] BTRFS warning (device sdb): failed to load free space cache for block group 30408704, rebuilding it now
[72383.454291] BTRFS: error (device sdb) in btrfs_recover_log_trees:6203: errno=-22 unknown (Failed to pin buffers while recovering log root tree.)
[72383.456725] BTRFS: error (device sdb) in btrfs_replay_log:2253: errno=-22 unknown (Failed to recover log tree)
[72383.460241] BTRFS error (device sdb): open_ctree failed
We also mark the range for the extent buffer in the excluded extents io
tree. That is fine when the space cache is valid on disk and we can load
it, in which case it causes no problems.
However, for the case where we need to rebuild the space cache, because it
is either invalid or it is missing, having the extent buffer range marked
in the excluded extents io tree leads to a -EINVAL failure from the call
to btrfs_remove_free_space(), resulting in the log replay and mount to
fail. This is because by having the range marked in the excluded extents
io tree, the caching thread ends up never adding the range of the extent
buffer as free space in the block group since the calls to
add_new_free_space(), called from load_extent_tree_free(), filter out any
ranges that are marked as excluded extents.
So fix this by making sure that during log replay we wait for the caching
task to finish completely when we need to rebuild a space cache, and also
drop the need to mark the extent buffer range in the excluded extents io
tree, as well as clearing ranges from that tree at
btrfs_finish_extent_commit().
This started to happen with some frequency on large filesystems having
block groups with a lot of fragmentation since the recent commit
|
||
|
c41ec4529d |
btrfs: fix lockdep warning due to seqcount_mutex on 32bit arch
This effectively reverts commit |
||
|
2f96e40212 |
btrfs: fix possible free space tree corruption with online conversion
While running btrfs/011 in a loop I would often ASSERT() while trying to
add a new free space entry that already existed, or get an EEXIST while
adding a new block to the extent tree, which is another indication of
double allocation.
This occurs because when we do the free space tree population, we create
the new root and then populate the tree and commit the transaction.
The problem is when you create a new root, the root node and commit root
node are the same. During this initial transaction commit we will run
all of the delayed refs that were paused during the free space tree
generation, and thus begin to cache block groups. While caching block
groups the caching thread will be reading from the main root for the
free space tree, so as we make allocations we'll be changing the free
space tree, which can cause us to add the same range twice which results
in either the ASSERT(ret != -EEXIST); in __btrfs_add_free_space, or in a
variety of different errors when running delayed refs because of a
double allocation.
Fix this by marking the fs_info as unsafe to load the free space tree,
and fall back on the old slow method. We could be smarter than this,
for example caching the block group while we're populating the free
space tree, but since this is a serious problem I've opted for the
simplest solution.
CC: stable@vger.kernel.org # 4.9+
Fixes:
|
||
|
309dca309f |
block: store a block_device pointer in struct bio
Replace the gendisk pointer in struct bio with a pointer to the newly improved struct block device. From that the gendisk can be trivially accessed with an extra indirection, but it also allows to directly look up all information related to partition remapping. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
549c729771
|
fs: make helpers idmap mount aware
Extend some inode methods with an additional user namespace argument. A filesystem that is aware of idmapped mounts will receive the user namespace the mount has been marked with. This can be used for additional permission checking and also to enable filesystems to translate between uids and gids if they need to. We have implemented all relevant helpers in earlier patches. As requested we simply extend the exisiting inode method instead of introducing new ones. This is a little more code churn but it's mostly mechanical and doesnt't leave us with additional inode methods. Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> |
||
|
ba73d98745
|
namei: handle idmapped mounts in may_*() helpers
The may_follow_link(), may_linkat(), may_lookup(), may_open(), may_o_create(), may_create_in_sticky(), may_delete(), and may_create() helpers determine whether the caller is privileged enough to perform the associated operations. Let them handle idmapped mounts by mapping the inode or fsids according to the mount's user namespace. Afterwards the checks are identical to non-idmapped inodes. The patch takes care to retrieve the mount's user namespace right before performing permission checks and passing it down into the fileystem so the user namespace can't change in between by someone idmapping a mount that is currently not idmapped. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-13-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> |
||
|
0d56a4518d
|
stat: handle idmapped mounts
The generic_fillattr() helper fills in the basic attributes associated with an inode. Enable it to handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace before we store the uid and gid. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-12-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> |
||
|
e65ce2a50c
|
acl: handle idmapped mounts
The posix acl permission checking helpers determine whether a caller is privileged over an inode according to the acls associated with the inode. Add helpers that make it possible to handle acls on idmapped mounts. The vfs and the filesystems targeted by this first iteration make use of posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to translate basic posix access and default permissions such as the ACL_USER and ACL_GROUP type according to the initial user namespace (or the superblock's user namespace) to and from the caller's current user namespace. Adapt these two helpers to handle idmapped mounts whereby we either map from or into the mount's user namespace depending on in which direction we're translating. Similarly, cap_convert_nscap() is used by the vfs to translate user namespace and non-user namespace aware filesystem capabilities from the superblock's user namespace to the caller's user namespace. Enable it to handle idmapped mounts by accounting for the mount's user namespace. In addition the fileystems targeted in the first iteration of this patch series make use of the posix_acl_chmod() and, posix_acl_update_mode() helpers. Both helpers perform permission checks on the target inode. Let them handle idmapped mounts. These two helpers are called when posix acls are set by the respective filesystems to handle this case we extend the ->set() method to take an additional user namespace argument to pass the mount's user namespace down. Link: https://lore.kernel.org/r/20210121131959.646623-9-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> |
||
|
2f221d6f7b
|
attr: handle idmapped mounts
When file attributes are changed most filesystems rely on the setattr_prepare(), setattr_copy(), and notify_change() helpers for initialization and permission checking. Let them handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Helpers that perform checks on the ia_uid and ia_gid fields in struct iattr assume that ia_uid and ia_gid are intended values and have already been mapped correctly at the userspace-kernelspace boundary as we already do today. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> |
||
|
21cb47be6f
|
inode: make init and permission helpers idmapped mount aware
The inode_owner_or_capable() helper determines whether the caller is the owner of the inode or is capable with respect to that inode. Allow it to handle idmapped mounts. If the inode is accessed through an idmapped mount it according to the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Similarly, allow the inode_init_owner() helper to handle idmapped mounts. It initializes a new inode on idmapped mounts by mapping the fsuid and fsgid of the caller from the mount's user namespace. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-7-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> |
||
|
47291baa8d
|
namei: make permission helpers idmapped mount aware
The two helpers inode_permission() and generic_permission() are used by the vfs to perform basic permission checking by verifying that the caller is privileged over an inode. In order to handle idmapped mounts we extend the two helpers with an additional user namespace argument. On idmapped mounts the two helpers will make sure to map the inode according to the mount's user namespace and then peform identical permission checks to inode_permission() and generic_permission(). If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> |
||
|
2f63296578 |
iomap: pass a flags argument to iomap_dio_rw
Pass a set of flags to iomap_dio_rw instead of the boolean wait_for_completion argument. The IOMAP_DIO_FORCE_WAIT flag replaces the wait_for_completion, but only needs to be passed when the iocb isn't synchronous to start with to simplify the callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> [djwong: rework xfs_file.c so that we can push iomap changes separately] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> |
||
|
9791581c04 |
for-5.11-rc4-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmAIojwACgkQxWXV+ddt WDstnw/+O0KSsK6ChZCNdjqFAgWL41RYj0fPOgM/8xlNaQyYHS0Jczeoud6m/2Wm U41kTb/a6xpmx0Z/2uf/5pDIBPFld/IUuUf/AdJsMzy8Bpky2/sfg6Kmx0tKGLXQ 1WKp9ox0MlAUI0Tz/jGfX5rwsIgWKYKIF2iGUio/H1ktR3l+cXlmLWsSIB43F6VL AjKRRyFCNU//dV7syNMmmj9yU0HpSs53SpWxUIURuTFaE71LyUgzaxDTlZ6c/PET e4wdf8nl0wzEESCgSUPdh2AWNNiTEbbGhhhNi9250PUyQki2f4AGBlxVSLZH/fDn 6PbBDvefW4umCMeMxxmgnYJU6tG78qg/LvxzZXt54rOtB0WMbrIl0u7hFCVhQ3Qk nqrS4tqeaL+OeuR6xamBMaRohgRFa9S+QVjTwtDFo/oVYH4TVvQDfKQS6GsWwDvB ySzz3WewoFqhe47cMsy28Dg49xkDSIJIr5hmSNGSXTreZ2JIa+qLKywoH87+YDIE ql0PN47z4NB+MbWDV7SZM8DCVqiQ7+1LOV9bPmqfvNl3YTfvXyMaoPLmWWVstPr2 iyhXrvESgm1s2RCF1a0tXIkv82L6QYjJ3eeEDsvAmtKBouNL9BnMvwi3zW5yKiry m1qj7C7e6C1TivYitcCfbRCKqeAnUv8VwkSbW9BvNDe7i5AD++U= =gSYr -----END PGP SIGNATURE----- Merge tag 'for-5.11-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more one line fixes for various bugs, stable material. - fix send when emitting clone operation from the same file and root - fix double free on error when cleaning backrefs - lockdep fix during relocation - handle potential error during reloc when starting transaction - skip running delayed refs during commit (leftover from code removal in this dev cycle)" * tag 'for-5.11-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: don't clear ret in btrfs_start_dirty_block_groups btrfs: fix lockdep splat in btrfs_recover_relocation btrfs: do not double free backref nodes on error btrfs: don't get an EINTR during drop_snapshot for reloc btrfs: send: fix invalid clone operations when cloning from the same file and root btrfs: no need to run delayed refs after commit_fs_roots during commit |
||
|
34d1eb0e59 |
btrfs: don't clear ret in btrfs_start_dirty_block_groups
If we fail to update a block group item in the loop we'll break, however we'll do btrfs_run_delayed_refs and lose our error value in ret, and thus not clean up properly. Fix this by only running the delayed refs if there was no failure. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
fb28610097 |
btrfs: fix lockdep splat in btrfs_recover_relocation
While testing the error paths of relocation I hit the following lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 5.10.0-rc6+ #217 Not tainted ------------------------------------------------------ mount/779 is trying to acquire lock: ffffa0e676945418 (&fs_info->balance_mutex){+.+.}-{3:3}, at: btrfs_recover_balance+0x2f0/0x340 but task is already holding lock: ffffa0e60ee31da8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x100 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (btrfs-root-00){++++}-{3:3}: down_read_nested+0x43/0x130 __btrfs_tree_read_lock+0x27/0x100 btrfs_read_lock_root_node+0x31/0x40 btrfs_search_slot+0x462/0x8f0 btrfs_update_root+0x55/0x2b0 btrfs_drop_snapshot+0x398/0x750 clean_dirty_subvols+0xdf/0x120 btrfs_recover_relocation+0x534/0x5a0 btrfs_start_pre_rw_mount+0xcb/0x170 open_ctree+0x151f/0x1726 btrfs_mount_root.cold+0x12/0xea legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 vfs_kern_mount.part.0+0x71/0xb0 btrfs_mount+0x10d/0x380 legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 path_mount+0x433/0xc10 __x64_sys_mount+0xe3/0x120 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #1 (sb_internal#2){.+.+}-{0:0}: start_transaction+0x444/0x700 insert_balance_item.isra.0+0x37/0x320 btrfs_balance+0x354/0xf40 btrfs_ioctl_balance+0x2cf/0x380 __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&fs_info->balance_mutex){+.+.}-{3:3}: __lock_acquire+0x1120/0x1e10 lock_acquire+0x116/0x370 __mutex_lock+0x7e/0x7b0 btrfs_recover_balance+0x2f0/0x340 open_ctree+0x1095/0x1726 btrfs_mount_root.cold+0x12/0xea legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 vfs_kern_mount.part.0+0x71/0xb0 btrfs_mount+0x10d/0x380 legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 path_mount+0x433/0xc10 __x64_sys_mount+0xe3/0x120 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 other info that might help us debug this: Chain exists of: &fs_info->balance_mutex --> sb_internal#2 --> btrfs-root-00 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-root-00); lock(sb_internal#2); lock(btrfs-root-00); lock(&fs_info->balance_mutex); *** DEADLOCK *** 2 locks held by mount/779: #0: ffffa0e60dc040e0 (&type->s_umount_key#47/1){+.+.}-{3:3}, at: alloc_super+0xb5/0x380 #1: ffffa0e60ee31da8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x100 stack backtrace: CPU: 0 PID: 779 Comm: mount Not tainted 5.10.0-rc6+ #217 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack+0x8b/0xb0 check_noncircular+0xcf/0xf0 ? trace_call_bpf+0x139/0x260 __lock_acquire+0x1120/0x1e10 lock_acquire+0x116/0x370 ? btrfs_recover_balance+0x2f0/0x340 __mutex_lock+0x7e/0x7b0 ? btrfs_recover_balance+0x2f0/0x340 ? btrfs_recover_balance+0x2f0/0x340 ? rcu_read_lock_sched_held+0x3f/0x80 ? kmem_cache_alloc_trace+0x2c4/0x2f0 ? btrfs_get_64+0x5e/0x100 btrfs_recover_balance+0x2f0/0x340 open_ctree+0x1095/0x1726 btrfs_mount_root.cold+0x12/0xea ? rcu_read_lock_sched_held+0x3f/0x80 legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 vfs_kern_mount.part.0+0x71/0xb0 btrfs_mount+0x10d/0x380 ? __kmalloc_track_caller+0x2f2/0x320 legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 ? capable+0x3a/0x60 path_mount+0x433/0xc10 __x64_sys_mount+0xe3/0x120 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This is straightforward to fix, simply release the path before we setup the balance_ctl. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
49ecc679ab |
btrfs: do not double free backref nodes on error
Zygo reported the following KASAN splat:
BUG: KASAN: use-after-free in btrfs_backref_cleanup_node+0x18a/0x420
Read of size 8 at addr ffff888112402950 by task btrfs/28836
CPU: 0 PID: 28836 Comm: btrfs Tainted: G W 5.10.0-e35f27394290-for-next+ #23
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Call Trace:
dump_stack+0xbc/0xf9
? btrfs_backref_cleanup_node+0x18a/0x420
print_address_description.constprop.8+0x21/0x210
? record_print_text.cold.34+0x11/0x11
? btrfs_backref_cleanup_node+0x18a/0x420
? btrfs_backref_cleanup_node+0x18a/0x420
kasan_report.cold.10+0x20/0x37
? btrfs_backref_cleanup_node+0x18a/0x420
__asan_load8+0x69/0x90
btrfs_backref_cleanup_node+0x18a/0x420
btrfs_backref_release_cache+0x83/0x1b0
relocate_block_group+0x394/0x780
? merge_reloc_roots+0x4a0/0x4a0
btrfs_relocate_block_group+0x26e/0x4c0
btrfs_relocate_chunk+0x52/0x120
btrfs_balance+0xe2e/0x1900
? check_flags.part.50+0x6c/0x1e0
? btrfs_relocate_chunk+0x120/0x120
? kmem_cache_alloc_trace+0xa06/0xcb0
? _copy_from_user+0x83/0xc0
btrfs_ioctl_balance+0x3a7/0x460
btrfs_ioctl+0x24c8/0x4360
? __kasan_check_read+0x11/0x20
? check_chain_key+0x1f4/0x2f0
? __asan_loadN+0xf/0x20
? btrfs_ioctl_get_supported_features+0x30/0x30
? kvm_sched_clock_read+0x18/0x30
? check_chain_key+0x1f4/0x2f0
? lock_downgrade+0x3f0/0x3f0
? handle_mm_fault+0xad6/0x2150
? do_vfs_ioctl+0xfc/0x9d0
? ioctl_file_clone+0xe0/0xe0
? check_flags.part.50+0x6c/0x1e0
? check_flags.part.50+0x6c/0x1e0
? check_flags+0x26/0x30
? lock_is_held_type+0xc3/0xf0
? syscall_enter_from_user_mode+0x1b/0x60
? do_syscall_64+0x13/0x80
? rcu_read_lock_sched_held+0xa1/0xd0
? __kasan_check_read+0x11/0x20
? __fget_light+0xae/0x110
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f4c4bdfe427
Allocated by task 28836:
kasan_save_stack+0x21/0x50
__kasan_kmalloc.constprop.18+0xbe/0xd0
kasan_kmalloc+0x9/0x10
kmem_cache_alloc_trace+0x410/0xcb0
btrfs_backref_alloc_node+0x46/0xf0
btrfs_backref_add_tree_node+0x60d/0x11d0
build_backref_tree+0xc5/0x700
relocate_tree_blocks+0x2be/0xb90
relocate_block_group+0x2eb/0x780
btrfs_relocate_block_group+0x26e/0x4c0
btrfs_relocate_chunk+0x52/0x120
btrfs_balance+0xe2e/0x1900
btrfs_ioctl_balance+0x3a7/0x460
btrfs_ioctl+0x24c8/0x4360
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Freed by task 28836:
kasan_save_stack+0x21/0x50
kasan_set_track+0x20/0x30
kasan_set_free_info+0x1f/0x30
__kasan_slab_free+0xf3/0x140
kasan_slab_free+0xe/0x10
kfree+0xde/0x200
btrfs_backref_error_cleanup+0x452/0x530
build_backref_tree+0x1a5/0x700
relocate_tree_blocks+0x2be/0xb90
relocate_block_group+0x2eb/0x780
btrfs_relocate_block_group+0x26e/0x4c0
btrfs_relocate_chunk+0x52/0x120
btrfs_balance+0xe2e/0x1900
btrfs_ioctl_balance+0x3a7/0x460
btrfs_ioctl+0x24c8/0x4360
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This occurred because we freed our backref node in
btrfs_backref_error_cleanup(), but then tried to free it again in
btrfs_backref_release_cache(). This is because
btrfs_backref_release_cache() will cycle through all of the
cache->leaves nodes and free them up. However
btrfs_backref_error_cleanup() freed the backref node with
btrfs_backref_free_node(), which simply kfree()d the backref node
without unlinking it from the cache. Change this to a
btrfs_backref_drop_node(), which does the appropriate cleanup and
removes the node from the cache->leaves list, so when we go to free the
remaining cache we don't trip over items we've already dropped.
Fixes:
|
||
|
18d3bff411 |
btrfs: don't get an EINTR during drop_snapshot for reloc
This was partially fixed by |
||
|
518837e650 |
btrfs: send: fix invalid clone operations when cloning from the same file and root
When an incremental send finds an extent that is shared, it checks which file extent items in the range refer to that extent, and for those it emits clone operations, while for others it emits regular write operations to avoid corruption at the destination (as described and fixed by commit |
||
|
14ff8e1970 |
btrfs: no need to run delayed refs after commit_fs_roots during commit
The inode number cache has been removed in this dev cycle, there's one
more leftover. We don't need to run the delayed refs again after
commit_fs_roots as stated in the comment, because btrfs_save_ino_cache
is no more since
|
||
|
6e68b9961f |
for-5.11-rc3-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl/8jD4ACgkQxWXV+ddt WDteWQ//QcpD6STpLwAC+g6zJyJln7Au9lfQvawugvOJssbtdPkJQP3ZiK+Izwi/ /xagu6XMazJM+47acNJKDNntOqVkp+O6CxEbLU+rL/D288L3HEGxayZ2LL90wm6J tbIebOE+BSVZ/5oe0jVdqZXwYvUtTiJ7PoFgrZPXJCnddSitZRD3tC4Wmi/Yo5+0 +7CW6PT3/s7KARwYXpgpMM5vi8qO2nfHfTUdRlSh59g7zC/TH7HiitL6roHzlX1k g/aaKYLVcg62OPpw7ZXwde/qH8n1TR+H5WX6vBInqd/9jYcNkVGqijCgBeL1TJkN Vx/b69ccODK2GNzuuYoo3k3XvSwZWsOTZp+k4y3EZ1cMONMo1snu/xglYsvSZvUL lNCQlA9hIZNskRwEvkEea68/bQdiOl6xezgR9tajMlmz7oCsV/Cz/MJ+RfqaxdH3 bV6eTTex67lQfzAda+gN+zjBrFzQdmK700gKimdzF1XfcYmmCIdZVX8Gm/N6ldQN LNRe8zYRaqrmRk9PQ355RqYDZmft/wLiUV6V0j74oV65WpPe2R4pULWdmPAGm6Oj UWM+ZR3u9m8asg7ghKYgct2pxCS3+gLbDNXNcOSxYxthEEZB2JqkAMjtjCfwJilN PXfuXaBKRmRck+AcYfbBrfJOljQ+zAJdTK/Rid40TwwpFCe/jjY= =G3R4 -----END PGP SIGNATURE----- Merge tag 'for-5.11-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "More material for stable trees. - tree-checker: check item end overflow - fix false warning during relocation regarding extent type - fix inode flushing logic, caused notable performance regression (since 5.10) - debugging fixups: - print correct offset for reloc tree key - pass reliable fs_info pointer to error reporting helper" * tag 'for-5.11-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: shrink delalloc pages instead of full inodes btrfs: reloc: fix wrong file extent type check to avoid false ENOENT btrfs: tree-checker: check if chunk item end overflows btrfs: prevent NULL pointer dereference in extent_io_tree_panic btrfs: print the actual offset in btrfs_root_name |
||
|
e076ab2a2c |
btrfs: shrink delalloc pages instead of full inodes
Commit |
||
|
50e31ef486 |
btrfs: reloc: fix wrong file extent type check to avoid false ENOENT
[BUG] There are several bug reports about recent kernel unable to relocate certain data block groups. Sometimes the error just goes away, but there is one reporter who can reproduce it reliably. The dmesg would look like: [438.260483] BTRFS info (device dm-10): balance: start -dvrange=34625344765952..34625344765953 [438.269018] BTRFS info (device dm-10): relocating block group 34625344765952 flags data|raid1 [450.439609] BTRFS info (device dm-10): found 167 extents, stage: move data extents [463.501781] BTRFS info (device dm-10): balance: ended with status: -2 [CAUSE] The ENOENT error is returned from the following call chain: add_data_references() |- delete_v1_space_cache(); |- if (!found) return -ENOENT; The variable @found is set to true if we find a data extent whose disk bytenr matches parameter @data_bytes. With extra debugging, the offending tree block looks like this: leaf bytenr = 42676709441536, data_bytenr = 34626327621632 ctime 1567904822.739884119 (2019-09-08 03:07:02) mtime 0.0 (1970-01-01 01:00:00) otime 0.0 (1970-01-01 01:00:00) item 27 key (51933 EXTENT_DATA 0) itemoff 9854 itemsize 53 generation 1517381 type 2 (prealloc) prealloc data disk byte 34626327621632 nr 262144 <<< prealloc data offset 0 nr 262144 item 28 key (52262 ROOT_ITEM 0) itemoff 9415 itemsize 439 generation 2618893 root_dirid 256 bytenr 42677048360960 level 3 refs 1 lastsnap 2618893 byte_limit 0 bytes_used 5557338112 flags 0x0(none) uuid d0d4361f-d231-6d40-8901-fe506e4b2b53 Although item 27 has disk bytenr 34626327621632, which matches the data_bytenr, its type is prealloc, not reg. This makes the existing code skip that item, and return ENOENT. [FIX] The code is modified in commit |
||
|
347fb0cfc9 |
btrfs: tree-checker: check if chunk item end overflows
While mounting a crafted image provided by user, kernel panics due to the invalid chunk item whose end is less than start. [66.387422] loop: module loaded [66.389773] loop0: detected capacity change from 262144 to 0 [66.427708] BTRFS: device fsid a62e00e8-e94e-4200-8217-12444de93c2e devid 1 transid 12 /dev/loop0 scanned by mount (613) [66.431061] BTRFS info (device loop0): disk space caching is enabled [66.431078] BTRFS info (device loop0): has skinny extents [66.437101] BTRFS error: insert state: end < start 29360127 37748736 [66.437136] ------------[ cut here ]------------ [66.437140] WARNING: CPU: 16 PID: 613 at fs/btrfs/extent_io.c:557 insert_state.cold+0x1a/0x46 [btrfs] [66.437369] CPU: 16 PID: 613 Comm: mount Tainted: G O 5.11.0-rc1-custom #45 [66.437374] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014 [66.437378] RIP: 0010:insert_state.cold+0x1a/0x46 [btrfs] [66.437420] RSP: 0018:ffff93e5414c3908 EFLAGS: 00010286 [66.437427] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000 [66.437431] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff [66.437434] RBP: ffff93e5414c3938 R08: 0000000000000001 R09: 0000000000000001 [66.437438] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d72aa0 [66.437441] R13: ffff8ec78bc71628 R14: 0000000000000000 R15: 0000000002400000 [66.437447] FS: 00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000 [66.437451] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [66.437455] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0 [66.437460] PKRU: 55555554 [66.437464] Call Trace: [66.437475] set_extent_bit+0x652/0x740 [btrfs] [66.437539] set_extent_bits_nowait+0x1d/0x20 [btrfs] [66.437576] add_extent_mapping+0x1e0/0x2f0 [btrfs] [66.437621] read_one_chunk+0x33c/0x420 [btrfs] [66.437674] btrfs_read_chunk_tree+0x6a4/0x870 [btrfs] [66.437708] ? kvm_sched_clock_read+0x18/0x40 [66.437739] open_ctree+0xb32/0x1734 [btrfs] [66.437781] ? bdi_register_va+0x1b/0x20 [66.437788] ? super_setup_bdi_name+0x79/0xd0 [66.437810] btrfs_mount_root.cold+0x12/0xeb [btrfs] [66.437854] ? __kmalloc_track_caller+0x217/0x3b0 [66.437873] legacy_get_tree+0x34/0x60 [66.437880] vfs_get_tree+0x2d/0xc0 [66.437888] vfs_kern_mount.part.0+0x78/0xc0 [66.437897] vfs_kern_mount+0x13/0x20 [66.437902] btrfs_mount+0x11f/0x3c0 [btrfs] [66.437940] ? kfree+0x5ff/0x670 [66.437944] ? __kmalloc_track_caller+0x217/0x3b0 [66.437962] legacy_get_tree+0x34/0x60 [66.437974] vfs_get_tree+0x2d/0xc0 [66.437983] path_mount+0x48c/0xd30 [66.437998] __x64_sys_mount+0x108/0x140 [66.438011] do_syscall_64+0x38/0x50 [66.438018] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [66.438023] RIP: 0033:0x7f0138827f6e [66.438033] RSP: 002b:00007ffecd79edf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 [66.438040] RAX: ffffffffffffffda RBX: 00007f013894c264 RCX: 00007f0138827f6e [66.438044] RDX: 00005593a4a41360 RSI: 00005593a4a33690 RDI: 00005593a4a3a6c0 [66.438047] RBP: 00005593a4a33440 R08: 0000000000000000 R09: 0000000000000001 [66.438050] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [66.438054] R13: 00005593a4a3a6c0 R14: 00005593a4a41360 R15: 00005593a4a33440 [66.438078] irq event stamp: 18169 [66.438082] hardirqs last enabled at (18175): [<ffffffffb81154bf>] console_unlock+0x4ff/0x5f0 [66.438088] hardirqs last disabled at (18180): [<ffffffffb8115427>] console_unlock+0x467/0x5f0 [66.438092] softirqs last enabled at (16910): [<ffffffffb8a00fe2>] asm_call_irq_on_stack+0x12/0x20 [66.438097] softirqs last disabled at (16905): [<ffffffffb8a00fe2>] asm_call_irq_on_stack+0x12/0x20 [66.438103] ---[ end trace e114b111db64298b ]--- [66.438107] BTRFS error: found node 12582912 29360127 on insert of 37748736 29360127 [66.438127] BTRFS critical: panic in extent_io_tree_panic:679: locking error: extent tree was modified by another thread while locked (errno=-17 Object already exists) [66.441069] ------------[ cut here ]------------ [66.441072] kernel BUG at fs/btrfs/extent_io.c:679! [66.442064] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [66.443018] CPU: 16 PID: 613 Comm: mount Tainted: G W O 5.11.0-rc1-custom #45 [66.444538] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014 [66.446223] RIP: 0010:extent_io_tree_panic.isra.0+0x23/0x25 [btrfs] [66.450878] RSP: 0018:ffff93e5414c3948 EFLAGS: 00010246 [66.451840] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000 [66.453141] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff [66.454445] RBP: ffff93e5414c3948 R08: 0000000000000001 R09: 0000000000000001 [66.455743] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d728c0 [66.457055] R13: ffff8ec78bc71628 R14: ffff8ec782d72aa0 R15: 0000000002400000 [66.458356] FS: 00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000 [66.459841] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [66.460895] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0 [66.462196] PKRU: 55555554 [66.462692] Call Trace: [66.463139] set_extent_bit.cold+0x30/0x98 [btrfs] [66.464049] set_extent_bits_nowait+0x1d/0x20 [btrfs] [66.490466] add_extent_mapping+0x1e0/0x2f0 [btrfs] [66.514097] read_one_chunk+0x33c/0x420 [btrfs] [66.534976] btrfs_read_chunk_tree+0x6a4/0x870 [btrfs] [66.555718] ? kvm_sched_clock_read+0x18/0x40 [66.575758] open_ctree+0xb32/0x1734 [btrfs] [66.595272] ? bdi_register_va+0x1b/0x20 [66.614638] ? super_setup_bdi_name+0x79/0xd0 [66.633809] btrfs_mount_root.cold+0x12/0xeb [btrfs] [66.652938] ? __kmalloc_track_caller+0x217/0x3b0 [66.671925] legacy_get_tree+0x34/0x60 [66.690300] vfs_get_tree+0x2d/0xc0 [66.708221] vfs_kern_mount.part.0+0x78/0xc0 [66.725808] vfs_kern_mount+0x13/0x20 [66.742730] btrfs_mount+0x11f/0x3c0 [btrfs] [66.759350] ? kfree+0x5ff/0x670 [66.775441] ? __kmalloc_track_caller+0x217/0x3b0 [66.791750] legacy_get_tree+0x34/0x60 [66.807494] vfs_get_tree+0x2d/0xc0 [66.823349] path_mount+0x48c/0xd30 [66.838753] __x64_sys_mount+0x108/0x140 [66.854412] do_syscall_64+0x38/0x50 [66.869673] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [66.885093] RIP: 0033:0x7f0138827f6e [66.945613] RSP: 002b:00007ffecd79edf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 [66.977214] RAX: ffffffffffffffda RBX: 00007f013894c264 RCX: 00007f0138827f6e [66.994266] RDX: 00005593a4a41360 RSI: 00005593a4a33690 RDI: 00005593a4a3a6c0 [67.011544] RBP: 00005593a4a33440 R08: 0000000000000000 R09: 0000000000000001 [67.028836] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [67.045812] R13: 00005593a4a3a6c0 R14: 00005593a4a41360 R15: 00005593a4a33440 [67.216138] ---[ end trace e114b111db64298c ]--- [67.237089] RIP: 0010:extent_io_tree_panic.isra.0+0x23/0x25 [btrfs] [67.325317] RSP: 0018:ffff93e5414c3948 EFLAGS: 00010246 [67.347946] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000 [67.371343] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff [67.394757] RBP: ffff93e5414c3948 R08: 0000000000000001 R09: 0000000000000001 [67.418409] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d728c0 [67.441906] R13: ffff8ec78bc71628 R14: ffff8ec782d72aa0 R15: 0000000002400000 [67.465436] FS: 00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000 [67.511660] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [67.535047] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0 [67.558449] PKRU: 55555554 [67.581146] note: mount[613] exited with preempt_count 2 The image has a chunk item which has a logical start 37748736 and length 18446744073701163008 (-8M). The calculated end 29360127 overflows. EEXIST was caught by insert_state() because of the duplicate end and extent_io_tree_panic() was called. Add overflow check of chunk item end to tree checker so it can be detected early at mount time. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208929 CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Su Yue <l@damenly.su> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
29b665cc51 |
btrfs: prevent NULL pointer dereference in extent_io_tree_panic
Some extent io trees are initialized with NULL private member (e.g.
btrfs_device::alloc_state and btrfs_fs_info::excluded_extents).
Dereference of a NULL tree->private as inode pointer will cause panic.
Pass tree->fs_info as it's known to be valid in all cases.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208929
Fixes:
|
||
|
71008734d2 |
btrfs: print the actual offset in btrfs_root_name
We're supposed to print the root_key.offset in btrfs_root_name in the
case of a reloc root, not the objectid. Fix this helper to take the key
so we have access to the offset when we need it.
Fixes:
|
||
|
71c061d244 |
for-5.11-rc2-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl/0cI8ACgkQxWXV+ddt WDspQw/8DcC8zhGgunk0m2kcXd6dFOGbsr3hNGCsgUSKESRw6AgTZ0rJf/QLjayF /vaJWzQW9ijfZ92fWZS+mrmskk0N8RFOsEvkCRLesgRaasbrkchLBo5HGQasOBEV LXyU878GrBkNaHzClJz+JdU26i0d17BFdddgtZVQ1St9Wd9ecc7Q6iqG80RWFeE7 uVbhv+QjocM3EieOnwIy5Mz6jZgJLYwqw7/y2njKduBeJtbt1K1j/y7IJk0WFMUM 8eUpDL6vlAHB8FjV2wWOzO46bbEaUpaBADM6yabrq0lnM0kr7Rb+WV/WSLM/AZ3g Hzs4qROOEP+zjfZ5nYjJQDJRMpSipZomsUY5uMZnhRxlZuHPaoBotRRzs5AIZYj2 BnkfucOcjxS/JTBD//ltJXE8RxbMIyMBBBipbBwqmxOkR9gM9BPuJ6iJPfUX//gG 1GHJ+FPns8ua3JW21ih6H31xNEPS36tsywvE8yCEtEWMxCFCBwgGu+4D8KpGBjtY ySFxkxxAbTuFi9fqSE/mBC+6lpbVTO0OvizuoEQh8C2izkXRbDsDVgPN8d7rCW7h Cdox4DUp61sNf+G3ll9Dv9ceAXroZTVRTHGjlav6NAFpydz3yPo5x54Ex7S+k3oN BAcZEl1Tl3hz4WxF8Ywc+yJ8n8l9AVa3KcYRXVbyVjTGg+JjU94= =jlQf -----END PGP SIGNATURE----- Merge tag 'for-5.11-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes that arrived before the end of the year: - a bunch of fixes related to transaction handle lifetime wrt various operations (umount, remount, qgroup scan, orphan cleanup) - async discard scheduling fixes - fix item size calculation when item keys collide for extend refs (hardlinks) - fix qgroup flushing from running transaction - fix send, wrong file path when there is an inode with a pending rmdir - fix deadlock when cloning inline extent and low on free metadata space" * tag 'for-5.11-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: run delayed iputs when remounting RO to avoid leaking them btrfs: add assertion for empty list of transactions at late stage of umount btrfs: fix race between RO remount and the cleaner task btrfs: fix transaction leak and crash after cleaning up orphans on RO mount btrfs: fix transaction leak and crash after RO remount caused by qgroup rescan btrfs: merge critical sections of discard lock in workfn btrfs: fix racy access to discard_ctl data btrfs: fix async discard stall btrfs: tests: initialize test inodes location btrfs: send: fix wrong file path when there is an inode with a pending rmdir btrfs: qgroup: don't try to wait flushing if we're already holding a transaction btrfs: correctly calculate item size used when item key collision happens btrfs: fix deadlock when cloning inline extent and low on free metadata space |
||
|
a8cc263eb5 |
btrfs: run delayed iputs when remounting RO to avoid leaking them
When remounting RO, after setting the superblock with the RO flag, the cleaner task will start sleeping and do nothing, since the call to btrfs_need_cleaner_sleep() keeps returning 'true'. However, when the cleaner task goes to sleep, the list of delayed iputs may not be empty. As long as we are in RO mode, the cleaner task will keep sleeping and never run the delayed iputs. This means that if a filesystem unmount is started, we get into close_ctree() with a non-empty list of delayed iputs, and because the filesystem is in RO mode and is not in an error state (or a transaction aborted), btrfs_error_commit_super() and btrfs_commit_super(), which run the delayed iputs, are never called, and later we fail the assertion that checks if the delayed iputs list is empty: assertion failed: list_empty(&fs_info->delayed_iputs), in fs/btrfs/disk-io.c:4049 ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.h:3153! invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI CPU: 1 PID: 3780621 Comm: umount Tainted: G L 5.6.0-rc2-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 RIP: 0010:assertfail.constprop.0+0x18/0x26 [btrfs] Code: 8b 7b 58 48 85 ff 74 (...) RSP: 0018:ffffb748c89bbdf8 EFLAGS: 00010246 RAX: 0000000000000051 RBX: ffff9608f2584000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff91998988 RDI: 00000000ffffffff RBP: ffff9608f25870d8 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffc0cbc500 R13: ffffffff92411750 R14: 0000000000000000 R15: ffff9608f2aab250 FS: 00007fcbfaa66c80(0000) GS:ffff960936c80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffc2c2dd38 CR3: 0000000235e54002 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: close_ctree+0x1a2/0x2e6 [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x93/0xc0 exit_to_usermode_loop+0xf9/0x100 do_syscall_64+0x20d/0x260 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fcbfaca6307 Code: eb 0b 00 f7 d8 64 89 (...) RSP: 002b:00007fffc2c2ed68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 0000558203b559b0 RCX: 00007fcbfaca6307 RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000558203b55bc0 RBP: 0000000000000000 R08: 0000000000000001 R09: 00007fffc2c2dad0 R10: 0000558203b55bf0 R11: 0000000000000246 R12: 0000558203b55bc0 R13: 00007fcbfadcc204 R14: 0000558203b55aa8 R15: 0000000000000000 Modules linked in: btrfs dm_flakey dm_log_writes (...) ---[ end trace d44d303790049ef6 ]--- So fix this by making the remount RO path run any remaining delayed iputs after waiting for the cleaner to become inactive. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
0a31daa4b6 |
btrfs: add assertion for empty list of transactions at late stage of umount
Add an assertion to close_ctree(), after destroying all the work queues, to verify we do not have any transaction still open or committing at that at that point. If we have any, it means something is seriously wrong and that can cause memory leaks and use-after-free problems. This is motivated by the previous patches that fixed bugs where we ended up leaking an open transaction after unmounting the filesystem. Tested-by: Fabian Vogt <fvogt@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
a0a1db70df |
btrfs: fix race between RO remount and the cleaner task
When we are remounting a filesystem in RO mode we can race with the cleaner task and result in leaking a transaction if the filesystem is unmounted shortly after, before the transaction kthread had a chance to commit that transaction. That also results in a crash during unmount, due to a use-after-free, if hardware acceleration is not available for crc32c. The following sequence of steps explains how the race happens. 1) The filesystem is mounted in RW mode and the cleaner task is running. This means that currently BTRFS_FS_CLEANER_RUNNING is set at fs_info->flags; 2) The cleaner task is currently running delayed iputs for example; 3) A filesystem RO remount operation starts; 4) The RO remount task calls btrfs_commit_super(), which commits any currently open transaction, and it finishes; 5) At this point the cleaner task is still running and it creates a new transaction by doing one of the following things: * When running the delayed iput() for an inode with a 0 link count, in which case at btrfs_evict_inode() we start a transaction through the call to evict_refill_and_join(), use it and then release its handle through btrfs_end_transaction(); * When deleting a dead root through btrfs_clean_one_deleted_snapshot(), a transaction is started at btrfs_drop_snapshot() and then its handle is released through a call to btrfs_end_transaction_throttle(); * When the remount task was still running, and before the remount task called btrfs_delete_unused_bgs(), the cleaner task also called btrfs_delete_unused_bgs() and it picked and removed one block group from the list of unused block groups. Before the cleaner task started a transaction, through btrfs_start_trans_remove_block_group() at btrfs_delete_unused_bgs(), the remount task had already called btrfs_commit_super(); 6) So at this point the filesystem is in RO mode and we have an open transaction that was started by the cleaner task; 7) Shortly after a filesystem unmount operation starts. At close_ctree() we stop the transaction kthread before it had a chance to commit the transaction, since less than 30 seconds (the default commit interval) have elapsed since the last transaction was committed; 8) We end up calling iput() against the btree inode at close_ctree() while there is an open transaction, and since that transaction was used to update btrees by the cleaner, we have dirty pages in the btree inode due to COW operations on metadata extents, and therefore writeback is triggered for the btree inode. So btree_write_cache_pages() is invoked to flush those dirty pages during the final iput() on the btree inode. This results in creating a bio and submitting it, which makes us end up at btrfs_submit_metadata_bio(); 9) At btrfs_submit_metadata_bio() we end up at the if-then-else branch that calls btrfs_wq_submit_bio(), because check_async_write() returned a value of 1. This value of 1 is because we did not have hardware acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not set in fs_info->flags; 10) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the workqueue at fs_info->workers, which was already freed before by the call to btrfs_stop_all_workers() at close_ctree(). This results in an invalid memory access due to a use-after-free, leading to a crash. When this happens, before the crash there are several warnings triggered, since we have reserved metadata space in a block group, the delayed refs reservation, etc: ------------[ cut here ]------------ WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs] Modules linked in: btrfs dm_snapshot dm_thin_pool (...) CPU: 4 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs] Code: f0 01 00 00 48 39 c2 75 (...) RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206 RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8 RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800 RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110 R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100 FS: 00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_free_block_groups+0x17f/0x2f0 [btrfs] close_ctree+0x2ba/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f15ee221ee7 Code: ff 0b 00 f7 d8 64 89 01 48 (...) RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000 RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0 R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000 R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace dd74718fef1ed5c6 ]--- ------------[ cut here ]------------ WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs] Modules linked in: btrfs dm_snapshot dm_thin_pool (...) CPU: 2 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs] Code: 48 83 bb b0 03 00 00 00 (...) RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206 RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110 R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100 FS: 00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_free_block_groups+0x24c/0x2f0 [btrfs] close_ctree+0x2ba/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f15ee221ee7 Code: ff 0b 00 f7 d8 64 89 01 (...) RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000 RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0 R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000 R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace dd74718fef1ed5c7 ]--- ------------[ cut here ]------------ WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs] Modules linked in: btrfs dm_snapshot dm_thin_pool (...) CPU: 5 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs] Code: ad de 49 be 22 01 00 (...) RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206 RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246 RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00 R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100 FS: 00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: close_ctree+0x2ba/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f15ee221ee7 Code: ff 0b 00 f7 d8 64 89 (...) RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000 RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0 R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000 R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace dd74718fef1ed5c8 ]--- BTRFS info (device sdc): space_info 4 has 268238848 free, is not full BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536 BTRFS info (device sdc): global_block_rsv: size 0 reserved 0 BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0 BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0 BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0 BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0 And the crash, which only happens when we do not have crc32c hardware acceleration, produces the following trace immediately after those warnings: stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI CPU: 2 PID: 1749129 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs] Code: 54 55 53 48 89 f3 (...) RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282 RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0 RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8 R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000 FS: 00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_wq_submit_bio+0xb3/0xd0 [btrfs] btrfs_submit_metadata_bio+0x44/0xc0 [btrfs] submit_one_bio+0x61/0x70 [btrfs] btree_write_cache_pages+0x414/0x450 [btrfs] ? kobject_put+0x9a/0x1d0 ? trace_hardirqs_on+0x1b/0xf0 ? _raw_spin_unlock_irqrestore+0x3c/0x60 ? free_debug_processing+0x1e1/0x2b0 do_writepages+0x43/0xe0 ? lock_acquired+0x199/0x490 __writeback_single_inode+0x59/0x650 writeback_single_inode+0xaf/0x120 write_inode_now+0x94/0xd0 iput+0x187/0x2b0 close_ctree+0x2c6/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f3cfebabee7 Code: ff 0b 00 f7 d8 64 89 01 (...) RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000 RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0 R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000 R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60 Modules linked in: btrfs dm_snapshot dm_thin_pool (...) ---[ end trace dd74718fef1ed5cc ]--- Finally when we remove the btrfs module (rmmod btrfs), there are several warnings about objects that were allocated from our slabs but were never freed, consequence of the transaction that was never committed and got leaked: ============================================================================= BUG btrfs_delayed_ref_head (Tainted: G B W ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown() ----------------------------------------------------------------------------- INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200 CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 slab_err+0xb7/0xdc ? lock_acquired+0x199/0x490 __kmem_cache_shutdown+0x1ac/0x3c0 ? lock_release+0x20e/0x4c0 kmem_cache_destroy+0x55/0x120 btrfs_delayed_ref_exit+0x11/0x35 [btrfs] exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 f5 (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 INFO: Object 0x0000000050cbdd61 @offset=12104 INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873 __slab_alloc.isra.0+0x109/0x1c0 kmem_cache_alloc+0x7bb/0x830 btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] btrfs_free_tree_block+0x128/0x360 [btrfs] __btrfs_cow_block+0x489/0x5f0 [btrfs] btrfs_cow_block+0xf7/0x220 [btrfs] btrfs_search_slot+0x62a/0xc40 [btrfs] btrfs_del_orphan_item+0x65/0xd0 [btrfs] btrfs_find_orphan_roots+0x1bf/0x200 [btrfs] open_ctree+0x125a/0x18a0 [btrfs] btrfs_mount_root.cold+0x13/0xed [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xe0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 btrfs_mount+0x13b/0x3e0 [btrfs] INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526 kmem_cache_free+0x34c/0x3c0 __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] btrfs_run_delayed_refs+0x81/0x210 [btrfs] commit_cowonly_roots+0xfb/0x300 [btrfs] btrfs_commit_transaction+0x367/0xc40 [btrfs] sync_filesystem+0x74/0x90 generic_shutdown_super+0x22/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 INFO: Object 0x0000000086e9b0ff @offset=12776 INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873 __slab_alloc.isra.0+0x109/0x1c0 kmem_cache_alloc+0x7bb/0x830 btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] btrfs_alloc_tree_block+0x2bf/0x360 [btrfs] alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs] __btrfs_cow_block+0x12d/0x5f0 [btrfs] btrfs_cow_block+0xf7/0x220 [btrfs] btrfs_search_slot+0x62a/0xc40 [btrfs] btrfs_del_orphan_item+0x65/0xd0 [btrfs] btrfs_find_orphan_roots+0x1bf/0x200 [btrfs] open_ctree+0x125a/0x18a0 [btrfs] btrfs_mount_root.cold+0x13/0xed [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xe0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803 kmem_cache_free+0x34c/0x3c0 __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] btrfs_run_delayed_refs+0x81/0x210 [btrfs] btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs] commit_cowonly_roots+0x248/0x300 [btrfs] btrfs_commit_transaction+0x367/0xc40 [btrfs] close_ctree+0x113/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 kmem_cache_destroy+0x119/0x120 btrfs_delayed_ref_exit+0x11/0x35 [btrfs] exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 f5 0b (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 ============================================================================= BUG btrfs_delayed_tree_ref (Tainted: G B W ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown() ----------------------------------------------------------------------------- INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200 CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 slab_err+0xb7/0xdc ? lock_acquired+0x199/0x490 __kmem_cache_shutdown+0x1ac/0x3c0 ? lock_release+0x20e/0x4c0 kmem_cache_destroy+0x55/0x120 btrfs_delayed_ref_exit+0x1d/0x35 [btrfs] exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 f5 (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 INFO: Object 0x000000001a340018 @offset=4408 INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873 __slab_alloc.isra.0+0x109/0x1c0 kmem_cache_alloc+0x7bb/0x830 btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] btrfs_free_tree_block+0x128/0x360 [btrfs] __btrfs_cow_block+0x489/0x5f0 [btrfs] btrfs_cow_block+0xf7/0x220 [btrfs] btrfs_search_slot+0x62a/0xc40 [btrfs] btrfs_del_orphan_item+0x65/0xd0 [btrfs] btrfs_find_orphan_roots+0x1bf/0x200 [btrfs] open_ctree+0x125a/0x18a0 [btrfs] btrfs_mount_root.cold+0x13/0xed [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xe0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 btrfs_mount+0x13b/0x3e0 [btrfs] INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795 kmem_cache_free+0x34c/0x3c0 __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] btrfs_run_delayed_refs+0x81/0x210 [btrfs] btrfs_commit_transaction+0x60/0xc40 [btrfs] create_subvol+0x56a/0x990 [btrfs] btrfs_mksubvol+0x3fb/0x4a0 [btrfs] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs] btrfs_ioctl_snap_create+0x58/0x80 [btrfs] btrfs_ioctl+0x1a92/0x36f0 [btrfs] __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 INFO: Object 0x000000002b46292a @offset=13648 INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873 __slab_alloc.isra.0+0x109/0x1c0 kmem_cache_alloc+0x7bb/0x830 btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] btrfs_alloc_tree_block+0x2bf/0x360 [btrfs] alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs] __btrfs_cow_block+0x12d/0x5f0 [btrfs] btrfs_cow_block+0xf7/0x220 [btrfs] btrfs_search_slot+0x62a/0xc40 [btrfs] btrfs_del_orphan_item+0x65/0xd0 [btrfs] btrfs_find_orphan_roots+0x1bf/0x200 [btrfs] open_ctree+0x125a/0x18a0 [btrfs] btrfs_mount_root.cold+0x13/0xed [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xe0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803 kmem_cache_free+0x34c/0x3c0 __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] btrfs_run_delayed_refs+0x81/0x210 [btrfs] commit_cowonly_roots+0xfb/0x300 [btrfs] btrfs_commit_transaction+0x367/0xc40 [btrfs] close_ctree+0x113/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 kmem_cache_destroy+0x119/0x120 btrfs_delayed_ref_exit+0x1d/0x35 [btrfs] exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 f5 (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 ============================================================================= BUG btrfs_delayed_extent_op (Tainted: G B W ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown() ----------------------------------------------------------------------------- INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200 CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 slab_err+0xb7/0xdc ? lock_acquired+0x199/0x490 __kmem_cache_shutdown+0x1ac/0x3c0 ? __mutex_unlock_slowpath+0x45/0x2a0 kmem_cache_destroy+0x55/0x120 exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 f5 (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 INFO: Object 0x000000004cf95ea8 @offset=6264 INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873 __slab_alloc.isra.0+0x109/0x1c0 kmem_cache_alloc+0x7bb/0x830 btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs] __btrfs_cow_block+0x12d/0x5f0 [btrfs] btrfs_cow_block+0xf7/0x220 [btrfs] btrfs_search_slot+0x62a/0xc40 [btrfs] btrfs_del_orphan_item+0x65/0xd0 [btrfs] btrfs_find_orphan_roots+0x1bf/0x200 [btrfs] open_ctree+0x125a/0x18a0 [btrfs] btrfs_mount_root.cold+0x13/0xed [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xe0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 btrfs_mount+0x13b/0x3e0 [btrfs] INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803 kmem_cache_free+0x34c/0x3c0 __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] btrfs_run_delayed_refs+0x81/0x210 [btrfs] commit_cowonly_roots+0xfb/0x300 [btrfs] btrfs_commit_transaction+0x367/0xc40 [btrfs] close_ctree+0x113/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 kmem_cache_destroy+0x119/0x120 exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1 So fix this by making the remount path to wait for the cleaner task before calling btrfs_commit_super(). The remount path now waits for the bit BTRFS_FS_CLEANER_RUNNING to be cleared from fs_info->flags before calling btrfs_commit_super() and this ensures the cleaner can not start a transaction after that, because it sleeps when the filesystem is in RO mode and we have already flagged the filesystem as RO before waiting for BTRFS_FS_CLEANER_RUNNING to be cleared. This also introduces a new flag BTRFS_FS_STATE_RO to be used for fs_info->fs_state when the filesystem is in RO mode. This is because we were doing the RO check using the flags of the superblock and setting the RO mode simply by ORing into the superblock's flags - those operations are not atomic and could result in the cleaner not seeing the update from the remount task after it clears BTRFS_FS_CLEANER_RUNNING. Tested-by: Fabian Vogt <fvogt@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
638331fa56 |
btrfs: fix transaction leak and crash after cleaning up orphans on RO mount
When we delete a root (subvolume or snapshot), at the very end of the operation, we attempt to remove the root's orphan item from the root tree, at btrfs_drop_snapshot(), by calling btrfs_del_orphan_item(). We ignore any error from btrfs_del_orphan_item() since it is not a serious problem and the next time the filesystem is mounted we remove such stray orphan items at btrfs_find_orphan_roots(). However if the filesystem is mounted RO and we have stray orphan items for any previously deleted root, we can end up leaking a transaction and other data structures when unmounting the filesystem, as well as crashing if we do not have hardware acceleration for crc32c available. The steps that lead to the transaction leak are the following: 1) The filesystem is mounted in RW mode; 2) A subvolume is deleted; 3) When the cleaner kthread runs btrfs_drop_snapshot() to delete the root, it gets a failure at btrfs_del_orphan_item(), which is ignored, due to an ENOMEM when allocating a path for example. So the orphan item for the root remains in the root tree; 4) The filesystem is unmounted; 5) The filesystem is mounted RO (-o ro). During the mount path we call btrfs_find_orphan_roots(), which iterates the root tree searching for orphan items. It finds the orphan item for our deleted root, and since it can not find the root, it starts a transaction to delete the orphan item (by calling btrfs_del_orphan_item()); 6) The RO mount completes; 7) Before the transaction kthread commits the transaction created for deleting the orphan item (i.e. less than 30 seconds elapsed since the mount, the default commit interval), a filesystem unmount operation is started; 8) At close_ctree(), we stop the transaction kthread, but we still have a transaction open with at least one dirty extent buffer, a leaf for the tree root which was COWed when deleting the orphan item; 9) We then proceed to destroy the work queues, free the roots and block groups, etc. After that we drop the last reference on the btree inode by calling iput() on it. Since there are dirty pages for the btree inode, corresponding to the COWed extent buffer, btree_write_cache_pages() is invoked to flush those dirty pages. This results in creating a bio and submitting it, which makes us end up at btrfs_submit_metadata_bio(); 10) At btrfs_submit_metadata_bio() we end up at the if-then-else branch that calls btrfs_wq_submit_bio(), because check_async_write() returned a value of 1. This value of 1 is because we did not have hardware acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not set in fs_info->flags; 11) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the workqueue at fs_info->workers, which was already freed before by the call to btrfs_stop_all_workers() at close_ctree(). This results in an invalid memory access due to a use-after-free, leading to a crash. When this happens, before the crash there are several warnings triggered, since we have reserved metadata space in a block group, the delayed refs reservation, etc: ------------[ cut here ]------------ WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs] Modules linked in: btrfs dm_snapshot dm_thin_pool (...) CPU: 4 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs] Code: f0 01 00 00 48 39 c2 75 (...) RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206 RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8 RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800 RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110 R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100 FS: 00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_free_block_groups+0x17f/0x2f0 [btrfs] close_ctree+0x2ba/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f15ee221ee7 Code: ff 0b 00 f7 d8 64 89 01 48 (...) RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000 RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0 R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000 R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace dd74718fef1ed5c6 ]--- ------------[ cut here ]------------ WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs] Modules linked in: btrfs dm_snapshot dm_thin_pool (...) CPU: 2 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs] Code: 48 83 bb b0 03 00 00 00 (...) RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206 RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110 R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100 FS: 00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_free_block_groups+0x24c/0x2f0 [btrfs] close_ctree+0x2ba/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f15ee221ee7 Code: ff 0b 00 f7 d8 64 89 01 (...) RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000 RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0 R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000 R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace dd74718fef1ed5c7 ]--- ------------[ cut here ]------------ WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs] Modules linked in: btrfs dm_snapshot dm_thin_pool (...) CPU: 5 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs] Code: ad de 49 be 22 01 00 (...) RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206 RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246 RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00 R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100 FS: 00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: close_ctree+0x2ba/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f15ee221ee7 Code: ff 0b 00 f7 d8 64 89 (...) RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000 RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0 R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000 R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace dd74718fef1ed5c8 ]--- BTRFS info (device sdc): space_info 4 has 268238848 free, is not full BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536 BTRFS info (device sdc): global_block_rsv: size 0 reserved 0 BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0 BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0 BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0 BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0 And the crash, which only happens when we do not have crc32c hardware acceleration, produces the following trace immediately after those warnings: stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI CPU: 2 PID: 1749129 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs] Code: 54 55 53 48 89 f3 (...) RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282 RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0 RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8 R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000 FS: 00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_wq_submit_bio+0xb3/0xd0 [btrfs] btrfs_submit_metadata_bio+0x44/0xc0 [btrfs] submit_one_bio+0x61/0x70 [btrfs] btree_write_cache_pages+0x414/0x450 [btrfs] ? kobject_put+0x9a/0x1d0 ? trace_hardirqs_on+0x1b/0xf0 ? _raw_spin_unlock_irqrestore+0x3c/0x60 ? free_debug_processing+0x1e1/0x2b0 do_writepages+0x43/0xe0 ? lock_acquired+0x199/0x490 __writeback_single_inode+0x59/0x650 writeback_single_inode+0xaf/0x120 write_inode_now+0x94/0xd0 iput+0x187/0x2b0 close_ctree+0x2c6/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f3cfebabee7 Code: ff 0b 00 f7 d8 64 89 01 (...) RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000 RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0 R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000 R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60 Modules linked in: btrfs dm_snapshot dm_thin_pool (...) ---[ end trace dd74718fef1ed5cc ]--- Finally when we remove the btrfs module (rmmod btrfs), there are several warnings about objects that were allocated from our slabs but were never freed, consequence of the transaction that was never committed and got leaked: ============================================================================= BUG btrfs_delayed_ref_head (Tainted: G B W ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown() ----------------------------------------------------------------------------- INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200 CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 slab_err+0xb7/0xdc ? lock_acquired+0x199/0x490 __kmem_cache_shutdown+0x1ac/0x3c0 ? lock_release+0x20e/0x4c0 kmem_cache_destroy+0x55/0x120 btrfs_delayed_ref_exit+0x11/0x35 [btrfs] exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 f5 (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 INFO: Object 0x0000000050cbdd61 @offset=12104 INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873 __slab_alloc.isra.0+0x109/0x1c0 kmem_cache_alloc+0x7bb/0x830 btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] btrfs_free_tree_block+0x128/0x360 [btrfs] __btrfs_cow_block+0x489/0x5f0 [btrfs] btrfs_cow_block+0xf7/0x220 [btrfs] btrfs_search_slot+0x62a/0xc40 [btrfs] btrfs_del_orphan_item+0x65/0xd0 [btrfs] btrfs_find_orphan_roots+0x1bf/0x200 [btrfs] open_ctree+0x125a/0x18a0 [btrfs] btrfs_mount_root.cold+0x13/0xed [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xe0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 btrfs_mount+0x13b/0x3e0 [btrfs] INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526 kmem_cache_free+0x34c/0x3c0 __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] btrfs_run_delayed_refs+0x81/0x210 [btrfs] commit_cowonly_roots+0xfb/0x300 [btrfs] btrfs_commit_transaction+0x367/0xc40 [btrfs] sync_filesystem+0x74/0x90 generic_shutdown_super+0x22/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 INFO: Object 0x0000000086e9b0ff @offset=12776 INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873 __slab_alloc.isra.0+0x109/0x1c0 kmem_cache_alloc+0x7bb/0x830 btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] btrfs_alloc_tree_block+0x2bf/0x360 [btrfs] alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs] __btrfs_cow_block+0x12d/0x5f0 [btrfs] btrfs_cow_block+0xf7/0x220 [btrfs] btrfs_search_slot+0x62a/0xc40 [btrfs] btrfs_del_orphan_item+0x65/0xd0 [btrfs] btrfs_find_orphan_roots+0x1bf/0x200 [btrfs] open_ctree+0x125a/0x18a0 [btrfs] btrfs_mount_root.cold+0x13/0xed [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xe0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803 kmem_cache_free+0x34c/0x3c0 __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] btrfs_run_delayed_refs+0x81/0x210 [btrfs] btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs] commit_cowonly_roots+0x248/0x300 [btrfs] btrfs_commit_transaction+0x367/0xc40 [btrfs] close_ctree+0x113/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 kmem_cache_destroy+0x119/0x120 btrfs_delayed_ref_exit+0x11/0x35 [btrfs] exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 f5 0b (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 ============================================================================= BUG btrfs_delayed_tree_ref (Tainted: G B W ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown() ----------------------------------------------------------------------------- INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200 CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 slab_err+0xb7/0xdc ? lock_acquired+0x199/0x490 __kmem_cache_shutdown+0x1ac/0x3c0 ? lock_release+0x20e/0x4c0 kmem_cache_destroy+0x55/0x120 btrfs_delayed_ref_exit+0x1d/0x35 [btrfs] exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 f5 (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 INFO: Object 0x000000001a340018 @offset=4408 INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873 __slab_alloc.isra.0+0x109/0x1c0 kmem_cache_alloc+0x7bb/0x830 btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] btrfs_free_tree_block+0x128/0x360 [btrfs] __btrfs_cow_block+0x489/0x5f0 [btrfs] btrfs_cow_block+0xf7/0x220 [btrfs] btrfs_search_slot+0x62a/0xc40 [btrfs] btrfs_del_orphan_item+0x65/0xd0 [btrfs] btrfs_find_orphan_roots+0x1bf/0x200 [btrfs] open_ctree+0x125a/0x18a0 [btrfs] btrfs_mount_root.cold+0x13/0xed [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xe0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 btrfs_mount+0x13b/0x3e0 [btrfs] INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795 kmem_cache_free+0x34c/0x3c0 __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] btrfs_run_delayed_refs+0x81/0x210 [btrfs] btrfs_commit_transaction+0x60/0xc40 [btrfs] create_subvol+0x56a/0x990 [btrfs] btrfs_mksubvol+0x3fb/0x4a0 [btrfs] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs] btrfs_ioctl_snap_create+0x58/0x80 [btrfs] btrfs_ioctl+0x1a92/0x36f0 [btrfs] __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 INFO: Object 0x000000002b46292a @offset=13648 INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873 __slab_alloc.isra.0+0x109/0x1c0 kmem_cache_alloc+0x7bb/0x830 btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] btrfs_alloc_tree_block+0x2bf/0x360 [btrfs] alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs] __btrfs_cow_block+0x12d/0x5f0 [btrfs] btrfs_cow_block+0xf7/0x220 [btrfs] btrfs_search_slot+0x62a/0xc40 [btrfs] btrfs_del_orphan_item+0x65/0xd0 [btrfs] btrfs_find_orphan_roots+0x1bf/0x200 [btrfs] open_ctree+0x125a/0x18a0 [btrfs] btrfs_mount_root.cold+0x13/0xed [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xe0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803 kmem_cache_free+0x34c/0x3c0 __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] btrfs_run_delayed_refs+0x81/0x210 [btrfs] commit_cowonly_roots+0xfb/0x300 [btrfs] btrfs_commit_transaction+0x367/0xc40 [btrfs] close_ctree+0x113/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 kmem_cache_destroy+0x119/0x120 btrfs_delayed_ref_exit+0x1d/0x35 [btrfs] exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 f5 (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 ============================================================================= BUG btrfs_delayed_extent_op (Tainted: G B W ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown() ----------------------------------------------------------------------------- INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200 CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 slab_err+0xb7/0xdc ? lock_acquired+0x199/0x490 __kmem_cache_shutdown+0x1ac/0x3c0 ? __mutex_unlock_slowpath+0x45/0x2a0 kmem_cache_destroy+0x55/0x120 exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 f5 (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 INFO: Object 0x000000004cf95ea8 @offset=6264 INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873 __slab_alloc.isra.0+0x109/0x1c0 kmem_cache_alloc+0x7bb/0x830 btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs] __btrfs_cow_block+0x12d/0x5f0 [btrfs] btrfs_cow_block+0xf7/0x220 [btrfs] btrfs_search_slot+0x62a/0xc40 [btrfs] btrfs_del_orphan_item+0x65/0xd0 [btrfs] btrfs_find_orphan_roots+0x1bf/0x200 [btrfs] open_ctree+0x125a/0x18a0 [btrfs] btrfs_mount_root.cold+0x13/0xed [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xe0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 btrfs_mount+0x13b/0x3e0 [btrfs] INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803 kmem_cache_free+0x34c/0x3c0 __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] btrfs_run_delayed_refs+0x81/0x210 [btrfs] commit_cowonly_roots+0xfb/0x300 [btrfs] btrfs_commit_transaction+0x367/0xc40 [btrfs] close_ctree+0x113/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 kmem_cache_destroy+0x119/0x120 exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1 So fix this by calling btrfs_find_orphan_roots() in the mount path only if we are mounting the filesystem in RW mode. It's pointless to have it called for RO mounts anyway, since despite adding any deleted roots to the list of dead roots, we will never have the roots deleted until the filesystem is remounted in RW mode, as the cleaner kthread does nothing when we are mounted in RO - btrfs_need_cleaner_sleep() always returns true and the cleaner spends all time sleeping, never cleaning dead roots. This is accomplished by moving the call to btrfs_find_orphan_roots() from open_ctree() to btrfs_start_pre_rw_mount(), which also guarantees that if later the filesystem is remounted RW, we populate the list of dead roots and have the cleaner task delete the dead roots. Tested-by: Fabian Vogt <fvogt@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
cb13eea3b4 |
btrfs: fix transaction leak and crash after RO remount caused by qgroup rescan
If we remount a filesystem in RO mode while the qgroup rescan worker is running, we can end up having it still running after the remount is done, and at unmount time we may end up with an open transaction that ends up never getting committed. If that happens we end up with several memory leaks and can crash when hardware acceleration is unavailable for crc32c. Possibly it can lead to other nasty surprises too, due to use-after-free issues. The following steps explain how the problem happens. 1) We have a filesystem mounted in RW mode and the qgroup rescan worker is running; 2) We remount the filesystem in RO mode, and never stop/pause the rescan worker, so after the remount the rescan worker is still running. The important detail here is that the rescan task is still running after the remount operation committed any ongoing transaction through its call to btrfs_commit_super(); 3) The rescan is still running, and after the remount completed, the rescan worker started a transaction, after it finished iterating all leaves of the extent tree, to update the qgroup status item in the quotas tree. It does not commit the transaction, it only releases its handle on the transaction; 4) A filesystem unmount operation starts shortly after; 5) The unmount task, at close_ctree(), stops the transaction kthread, which had not had a chance to commit the open transaction since it was sleeping and the commit interval (default of 30 seconds) has not yet elapsed since the last time it committed a transaction; 6) So after stopping the transaction kthread we still have the transaction used to update the qgroup status item open. At close_ctree(), when the filesystem is in RO mode and no transaction abort happened (or the filesystem is in error mode), we do not expect to have any transaction open, so we do not call btrfs_commit_super(); 7) We then proceed to destroy the work queues, free the roots and block groups, etc. After that we drop the last reference on the btree inode by calling iput() on it. Since there are dirty pages for the btree inode, corresponding to the COWed extent buffer for the quotas btree, btree_write_cache_pages() is invoked to flush those dirty pages. This results in creating a bio and submitting it, which makes us end up at btrfs_submit_metadata_bio(); 8) At btrfs_submit_metadata_bio() we end up at the if-then-else branch that calls btrfs_wq_submit_bio(), because check_async_write() returned a value of 1. This value of 1 is because we did not have hardware acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not set in fs_info->flags; 9) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the workqueue at fs_info->workers, which was already freed before by the call to btrfs_stop_all_workers() at close_ctree(). This results in an invalid memory access due to a use-after-free, leading to a crash. When this happens, before the crash there are several warnings triggered, since we have reserved metadata space in a block group, the delayed refs reservation, etc: ------------[ cut here ]------------ WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs] Modules linked in: btrfs dm_snapshot dm_thin_pool (...) CPU: 4 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs] Code: f0 01 00 00 48 39 c2 75 (...) RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206 RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8 RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800 RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110 R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100 FS: 00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_free_block_groups+0x17f/0x2f0 [btrfs] close_ctree+0x2ba/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f15ee221ee7 Code: ff 0b 00 f7 d8 64 89 01 48 (...) RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000 RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0 R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000 R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace dd74718fef1ed5c6 ]--- ------------[ cut here ]------------ WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs] Modules linked in: btrfs dm_snapshot dm_thin_pool (...) CPU: 2 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs] Code: 48 83 bb b0 03 00 00 00 (...) RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206 RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110 R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100 FS: 00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_free_block_groups+0x24c/0x2f0 [btrfs] close_ctree+0x2ba/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f15ee221ee7 Code: ff 0b 00 f7 d8 64 89 01 (...) RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000 RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0 R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000 R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace dd74718fef1ed5c7 ]--- ------------[ cut here ]------------ WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs] Modules linked in: btrfs dm_snapshot dm_thin_pool (...) CPU: 5 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs] Code: ad de 49 be 22 01 00 (...) RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206 RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246 RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00 R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100 FS: 00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: close_ctree+0x2ba/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f15ee221ee7 Code: ff 0b 00 f7 d8 64 89 (...) RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000 RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0 R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000 R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace dd74718fef1ed5c8 ]--- BTRFS info (device sdc): space_info 4 has 268238848 free, is not full BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536 BTRFS info (device sdc): global_block_rsv: size 0 reserved 0 BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0 BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0 BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0 BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0 And the crash, which only happens when we do not have crc32c hardware acceleration, produces the following trace immediately after those warnings: stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI CPU: 2 PID: 1749129 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs] Code: 54 55 53 48 89 f3 (...) RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282 RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0 RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8 R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000 FS: 00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_wq_submit_bio+0xb3/0xd0 [btrfs] btrfs_submit_metadata_bio+0x44/0xc0 [btrfs] submit_one_bio+0x61/0x70 [btrfs] btree_write_cache_pages+0x414/0x450 [btrfs] ? kobject_put+0x9a/0x1d0 ? trace_hardirqs_on+0x1b/0xf0 ? _raw_spin_unlock_irqrestore+0x3c/0x60 ? free_debug_processing+0x1e1/0x2b0 do_writepages+0x43/0xe0 ? lock_acquired+0x199/0x490 __writeback_single_inode+0x59/0x650 writeback_single_inode+0xaf/0x120 write_inode_now+0x94/0xd0 iput+0x187/0x2b0 close_ctree+0x2c6/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f3cfebabee7 Code: ff 0b 00 f7 d8 64 89 01 (...) RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000 RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0 R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000 R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60 Modules linked in: btrfs dm_snapshot dm_thin_pool (...) ---[ end trace dd74718fef1ed5cc ]--- Finally when we remove the btrfs module (rmmod btrfs), there are several warnings about objects that were allocated from our slabs but were never freed, consequence of the transaction that was never committed and got leaked: ============================================================================= BUG btrfs_delayed_ref_head (Tainted: G B W ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown() ----------------------------------------------------------------------------- INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200 CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 slab_err+0xb7/0xdc ? lock_acquired+0x199/0x490 __kmem_cache_shutdown+0x1ac/0x3c0 ? lock_release+0x20e/0x4c0 kmem_cache_destroy+0x55/0x120 btrfs_delayed_ref_exit+0x11/0x35 [btrfs] exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 f5 (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 INFO: Object 0x0000000050cbdd61 @offset=12104 INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873 __slab_alloc.isra.0+0x109/0x1c0 kmem_cache_alloc+0x7bb/0x830 btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] btrfs_free_tree_block+0x128/0x360 [btrfs] __btrfs_cow_block+0x489/0x5f0 [btrfs] btrfs_cow_block+0xf7/0x220 [btrfs] btrfs_search_slot+0x62a/0xc40 [btrfs] btrfs_del_orphan_item+0x65/0xd0 [btrfs] btrfs_find_orphan_roots+0x1bf/0x200 [btrfs] open_ctree+0x125a/0x18a0 [btrfs] btrfs_mount_root.cold+0x13/0xed [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xe0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 btrfs_mount+0x13b/0x3e0 [btrfs] INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526 kmem_cache_free+0x34c/0x3c0 __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] btrfs_run_delayed_refs+0x81/0x210 [btrfs] commit_cowonly_roots+0xfb/0x300 [btrfs] btrfs_commit_transaction+0x367/0xc40 [btrfs] sync_filesystem+0x74/0x90 generic_shutdown_super+0x22/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 INFO: Object 0x0000000086e9b0ff @offset=12776 INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873 __slab_alloc.isra.0+0x109/0x1c0 kmem_cache_alloc+0x7bb/0x830 btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] btrfs_alloc_tree_block+0x2bf/0x360 [btrfs] alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs] __btrfs_cow_block+0x12d/0x5f0 [btrfs] btrfs_cow_block+0xf7/0x220 [btrfs] btrfs_search_slot+0x62a/0xc40 [btrfs] btrfs_del_orphan_item+0x65/0xd0 [btrfs] btrfs_find_orphan_roots+0x1bf/0x200 [btrfs] open_ctree+0x125a/0x18a0 [btrfs] btrfs_mount_root.cold+0x13/0xed [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xe0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803 kmem_cache_free+0x34c/0x3c0 __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] btrfs_run_delayed_refs+0x81/0x210 [btrfs] btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs] commit_cowonly_roots+0x248/0x300 [btrfs] btrfs_commit_transaction+0x367/0xc40 [btrfs] close_ctree+0x113/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 kmem_cache_destroy+0x119/0x120 btrfs_delayed_ref_exit+0x11/0x35 [btrfs] exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 f5 0b (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 ============================================================================= BUG btrfs_delayed_tree_ref (Tainted: G B W ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown() ----------------------------------------------------------------------------- INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200 CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 slab_err+0xb7/0xdc ? lock_acquired+0x199/0x490 __kmem_cache_shutdown+0x1ac/0x3c0 ? lock_release+0x20e/0x4c0 kmem_cache_destroy+0x55/0x120 btrfs_delayed_ref_exit+0x1d/0x35 [btrfs] exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 f5 (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 INFO: Object 0x000000001a340018 @offset=4408 INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873 __slab_alloc.isra.0+0x109/0x1c0 kmem_cache_alloc+0x7bb/0x830 btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] btrfs_free_tree_block+0x128/0x360 [btrfs] __btrfs_cow_block+0x489/0x5f0 [btrfs] btrfs_cow_block+0xf7/0x220 [btrfs] btrfs_search_slot+0x62a/0xc40 [btrfs] btrfs_del_orphan_item+0x65/0xd0 [btrfs] btrfs_find_orphan_roots+0x1bf/0x200 [btrfs] open_ctree+0x125a/0x18a0 [btrfs] btrfs_mount_root.cold+0x13/0xed [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xe0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 btrfs_mount+0x13b/0x3e0 [btrfs] INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795 kmem_cache_free+0x34c/0x3c0 __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] btrfs_run_delayed_refs+0x81/0x210 [btrfs] btrfs_commit_transaction+0x60/0xc40 [btrfs] create_subvol+0x56a/0x990 [btrfs] btrfs_mksubvol+0x3fb/0x4a0 [btrfs] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs] btrfs_ioctl_snap_create+0x58/0x80 [btrfs] btrfs_ioctl+0x1a92/0x36f0 [btrfs] __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 INFO: Object 0x000000002b46292a @offset=13648 INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873 __slab_alloc.isra.0+0x109/0x1c0 kmem_cache_alloc+0x7bb/0x830 btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] btrfs_alloc_tree_block+0x2bf/0x360 [btrfs] alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs] __btrfs_cow_block+0x12d/0x5f0 [btrfs] btrfs_cow_block+0xf7/0x220 [btrfs] btrfs_search_slot+0x62a/0xc40 [btrfs] btrfs_del_orphan_item+0x65/0xd0 [btrfs] btrfs_find_orphan_roots+0x1bf/0x200 [btrfs] open_ctree+0x125a/0x18a0 [btrfs] btrfs_mount_root.cold+0x13/0xed [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xe0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803 kmem_cache_free+0x34c/0x3c0 __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] btrfs_run_delayed_refs+0x81/0x210 [btrfs] commit_cowonly_roots+0xfb/0x300 [btrfs] btrfs_commit_transaction+0x367/0xc40 [btrfs] close_ctree+0x113/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 kmem_cache_destroy+0x119/0x120 btrfs_delayed_ref_exit+0x1d/0x35 [btrfs] exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 f5 (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 ============================================================================= BUG btrfs_delayed_extent_op (Tainted: G B W ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown() ----------------------------------------------------------------------------- INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200 CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 slab_err+0xb7/0xdc ? lock_acquired+0x199/0x490 __kmem_cache_shutdown+0x1ac/0x3c0 ? __mutex_unlock_slowpath+0x45/0x2a0 kmem_cache_destroy+0x55/0x120 exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 f5 (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 INFO: Object 0x000000004cf95ea8 @offset=6264 INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873 __slab_alloc.isra.0+0x109/0x1c0 kmem_cache_alloc+0x7bb/0x830 btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs] __btrfs_cow_block+0x12d/0x5f0 [btrfs] btrfs_cow_block+0xf7/0x220 [btrfs] btrfs_search_slot+0x62a/0xc40 [btrfs] btrfs_del_orphan_item+0x65/0xd0 [btrfs] btrfs_find_orphan_roots+0x1bf/0x200 [btrfs] open_ctree+0x125a/0x18a0 [btrfs] btrfs_mount_root.cold+0x13/0xed [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xe0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 btrfs_mount+0x13b/0x3e0 [btrfs] INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803 kmem_cache_free+0x34c/0x3c0 __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] btrfs_run_delayed_refs+0x81/0x210 [btrfs] commit_cowonly_roots+0xfb/0x300 [btrfs] btrfs_commit_transaction+0x367/0xc40 [btrfs] close_ctree+0x113/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 kmem_cache_destroy+0x119/0x120 exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1 Fix this issue by having the remount path stop the qgroup rescan worker when we are remounting RO and teach the rescan worker to stop when a remount is in progress. If later a remount in RW mode happens, we are already resuming the qgroup rescan worker through the call to btrfs_qgroup_rescan_resume(), so we do not need to worry about that. Tested-by: Fabian Vogt <fvogt@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
8fc058597a |
btrfs: merge critical sections of discard lock in workfn
btrfs_discard_workfn() drops discard_ctl->lock just to take it again in a moment in btrfs_discard_schedule_work(). Avoid that and also reuse ktime. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
1ea2872fc6 |
btrfs: fix racy access to discard_ctl data
Because only one discard worker may be running at any given point, it could have been safe to modify ->prev_discard, etc. without synchronization, if not for @override flag in btrfs_discard_schedule_work() and delayed_work_pending() returning false while workfn is running. That may lead to torn reads of u64 for some architectures, but that's not a big problem as only slightly affects the discard rate. Suggested-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
ea9ed87c73 |
btrfs: fix async discard stall
Might happen that bg->discard_eligible_time was changed without rescheduling, so btrfs_discard_workfn() wakes up earlier than that new time, peek_discard_list() returns NULL, and all work halts and goes to sleep without further rescheduling even there are block groups to discard. It happens pretty often, but not so visible from the userspace because after some time it usually will be kicked off anyway by someone else calling btrfs_discard_reschedule_work(). Fix it by continue rescheduling if block group discard lists are not empty. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
675a4fc8f3 |
btrfs: tests: initialize test inodes location
I noticed that sometimes the module failed to load because the self tests failed like this: BTRFS: selftest: fs/btrfs/tests/inode-tests.c:963 miscount, wanted 1, got 0 This turned out to be because sometimes the btrfs ino would be the btree inode number, and thus we'd skip calling the set extent delalloc bit helper, and thus not adjust ->outstanding_extents. Fix this by making sure we initialize test inodes with a valid inode number so that we don't get random failures during self tests. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
0b3f407e67 |
btrfs: send: fix wrong file path when there is an inode with a pending rmdir
When doing an incremental send, if we have a new inode that happens to have the same number that an old directory inode had in the base snapshot and that old directory has a pending rmdir operation, we end up computing a wrong path for the new inode, causing the receiver to fail. Example reproducer: $ cat test-send-rmdir.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV >/dev/null mount $DEV $MNT mkdir $MNT/dir touch $MNT/dir/file1 touch $MNT/dir/file2 touch $MNT/dir/file3 # Filesystem looks like: # # . (ino 256) # |----- dir/ (ino 257) # |----- file1 (ino 258) # |----- file2 (ino 259) # |----- file3 (ino 260) # btrfs subvolume snapshot -r $MNT $MNT/snap1 btrfs send -f /tmp/snap1.send $MNT/snap1 # Now remove our directory and all its files. rm -fr $MNT/dir # Unmount the filesystem and mount it again. This is to ensure that # the next inode that is created ends up with the same inode number # that our directory "dir" had, 257, which is the first free "objectid" # available after mounting again the filesystem. umount $MNT mount $DEV $MNT # Now create a new file (it could be a directory as well). touch $MNT/newfile # Filesystem now looks like: # # . (ino 256) # |----- newfile (ino 257) # btrfs subvolume snapshot -r $MNT $MNT/snap2 btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2 # Now unmount the filesystem, create a new one, mount it and try to apply # both send streams to recreate both snapshots. umount $DEV mkfs.btrfs -f $DEV >/dev/null mount $DEV $MNT btrfs receive -f /tmp/snap1.send $MNT btrfs receive -f /tmp/snap2.send $MNT umount $MNT When running the test, the receive operation for the incremental stream fails: $ ./test-send-rmdir.sh Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1' At subvol /mnt/sdi/snap1 Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2' At subvol /mnt/sdi/snap2 At subvol snap1 At snapshot snap2 ERROR: chown o257-9-0 failed: No such file or directory So fix this by tracking directories that have a pending rmdir by inode number and generation number, instead of only inode number. A test case for fstests follows soon. Reported-by: Massimo B. <massimo.b@gmx.net> Tested-by: Massimo B. <massimo.b@gmx.net> Link: https://lore.kernel.org/linux-btrfs/6ae34776e85912960a253a8327068a892998e685.camel@gmx.net/ CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
ae5e070eac |
btrfs: qgroup: don't try to wait flushing if we're already holding a transaction
There is a chance of racing for qgroup flushing which may lead to deadlock: Thread A | Thread B (not holding trans handle) | (holding a trans handle) --------------------------------+-------------------------------- __btrfs_qgroup_reserve_meta() | __btrfs_qgroup_reserve_meta() |- try_flush_qgroup() | |- try_flush_qgroup() |- QGROUP_FLUSHING bit set | | | | |- test_and_set_bit() | | |- wait_event() |- btrfs_join_transaction() | |- btrfs_commit_transaction()| !!! DEAD LOCK !!! Since thread A wants to commit transaction, but thread B is holding a transaction handle, blocking the commit. At the same time, thread B is waiting for thread A to finish its commit. This is just a hot fix, and would lead to more EDQUOT when we're near the qgroup limit. The proper fix would be to make all metadata/data reservations happen without holding a transaction handle. CC: stable@vger.kernel.org # 5.9+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
9a66497156 |
btrfs: correctly calculate item size used when item key collision happens
Item key collision is allowed for some item types, like dir item and
inode refs, but the overall item size is limited by the nodesize.
item size(ins_len) passed from btrfs_insert_empty_items to
btrfs_search_slot already contains size of btrfs_item.
When btrfs_search_slot reaches leaf, we'll see if we need to split leaf.
The check incorrectly reports that split leaf is required, because
it treats the space required by the newly inserted item as
btrfs_item + item data. But in item key collision case, only item data
is actually needed, the newly inserted item could merge into the existing
one. No new btrfs_item will be inserted.
And split_leaf return EOVERFLOW from following code:
if (extend && data_size + btrfs_item_size_nr(l, slot) +
sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(fs_info))
return -EOVERFLOW;
In most cases, when callers receive EOVERFLOW, they either return
this error or handle in different ways. For example, in normal dir item
creation the userspace will get errno EOVERFLOW; in inode ref case
INODE_EXTREF is used instead.
However, this is not the case for rename. To avoid the unrecoverable
situation in rename, btrfs_check_dir_item_collision is called in
early phase of rename. In this function, when item key collision is
detected leaf space is checked:
data_size = sizeof(*di) + name_len;
if (data_size + btrfs_item_size_nr(leaf, slot) +
sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(root->fs_info))
the sizeof(struct btrfs_item) + btrfs_item_size_nr(leaf, slot) here
refers to existing item size, the condition here correctly calculates
the needed size for collision case rather than the wrong case above.
The consequence of inconsistent condition check between
btrfs_check_dir_item_collision and btrfs_search_slot when item key
collision happens is that we might pass check here but fail
later at btrfs_search_slot. Rename fails and volume is forced readonly
[436149.586170] ------------[ cut here ]------------
[436149.586173] BTRFS: Transaction aborted (error -75)
[436149.586196] WARNING: CPU: 0 PID: 16733 at fs/btrfs/inode.c:9870 btrfs_rename2+0x1938/0x1b70 [btrfs]
[436149.586227] CPU: 0 PID: 16733 Comm: python Tainted: G D 4.18.0-rc5+ #1
[436149.586228] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
[436149.586238] RIP: 0010:btrfs_rename2+0x1938/0x1b70 [btrfs]
[436149.586254] RSP: 0018:ffffa327043a7ce0 EFLAGS: 00010286
[436149.586255] RAX: 0000000000000000 RBX: ffff8d8a17d13340 RCX: 0000000000000006
[436149.586256] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8d8a7fc164b0
[436149.586257] RBP: ffffa327043a7da0 R08: 0000000000000560 R09: 7265282064657472
[436149.586258] R10: 0000000000000000 R11: 6361736e61725420 R12: ffff8d8a0d4c8b08
[436149.586258] R13: ffff8d8a17d13340 R14: ffff8d8a33e0a540 R15: 00000000000001fe
[436149.586260] FS: 00007fa313933740(0000) GS:ffff8d8a7fc00000(0000) knlGS:0000000000000000
[436149.586261] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[436149.586262] CR2: 000055d8d9c9a720 CR3: 000000007aae0003 CR4: 00000000003606f0
[436149.586295] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[436149.586296] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[436149.586296] Call Trace:
[436149.586311] vfs_rename+0x383/0x920
[436149.586313] ? vfs_rename+0x383/0x920
[436149.586315] do_renameat2+0x4ca/0x590
[436149.586317] __x64_sys_rename+0x20/0x30
[436149.586324] do_syscall_64+0x5a/0x120
[436149.586330] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[436149.586332] RIP: 0033:0x7fa3133b1d37
[436149.586348] RSP: 002b:00007fffd3e43908 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
[436149.586349] RAX: ffffffffffffffda RBX: 00007fa3133b1d30 RCX: 00007fa3133b1d37
[436149.586350] RDX: 000055d8da06b5e0 RSI: 000055d8da225d60 RDI: 000055d8da2c4da0
[436149.586351] RBP: 000055d8da2252f0 R08: 00007fa313782000 R09: 00000000000177e0
[436149.586351] R10: 000055d8da010680 R11: 0000000000000246 R12: 00007fa313840b00
Thanks to Hans van Kranenburg for information about crc32 hash collision
tools, I was able to reproduce the dir item collision with following
python script.
https://github.com/wutzuchieh/misc_tools/blob/master/crc32_forge.py Run
it under a btrfs volume will trigger the abort transaction. It simply
creates files and rename them to forged names that leads to
hash collision.
There are two ways to fix this. One is to simply revert the patch
|
||
|
3d45f221ce |
btrfs: fix deadlock when cloning inline extent and low on free metadata space
When cloning an inline extent there are cases where we can not just copy
the inline extent from the source range to the target range (e.g. when the
target range starts at an offset greater than zero). In such cases we copy
the inline extent's data into a page of the destination inode and then
dirty that page. However, after that we will need to start a transaction
for each processed extent and, if we are ever low on available metadata
space, we may need to flush existing delalloc for all dirty inodes in an
attempt to release metadata space - if that happens we may deadlock:
* the async reclaim task queued a delalloc work to flush delalloc for
the destination inode of the clone operation;
* the task executing that delalloc work gets blocked waiting for the
range with the dirty page to be unlocked, which is currently locked
by the task doing the clone operation;
* the async reclaim task blocks waiting for the delalloc work to complete;
* the cloning task is waiting on the waitqueue of its reservation ticket
while holding the range with the dirty page locked in the inode's
io_tree;
* if metadata space is not released by some other task (like delalloc for
some other inode completing for example), the clone task waits forever
and as a consequence the delalloc work and async reclaim tasks will hang
forever as well. Releasing more space on the other hand may require
starting a transaction, which will hang as well when trying to reserve
metadata space, resulting in a deadlock between all these tasks.
When this happens, traces like the following show up in dmesg/syslog:
[87452.323003] INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
[87452.323644] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
[87452.324248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87452.324852] task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000
[87452.325520] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
[87452.326136] Call Trace:
[87452.326737] __schedule+0x5d1/0xcf0
[87452.327390] schedule+0x45/0xe0
[87452.328174] lock_extent_bits+0x1e6/0x2d0 [btrfs]
[87452.328894] ? finish_wait+0x90/0x90
[87452.329474] btrfs_invalidatepage+0x32c/0x390 [btrfs]
[87452.330133] ? __mod_memcg_state+0x8e/0x160
[87452.330738] __extent_writepage+0x2d4/0x400 [btrfs]
[87452.331405] extent_write_cache_pages+0x2b2/0x500 [btrfs]
[87452.332007] ? lock_release+0x20e/0x4c0
[87452.332557] ? trace_hardirqs_on+0x1b/0xf0
[87452.333127] extent_writepages+0x43/0x90 [btrfs]
[87452.333653] ? lock_acquire+0x1a3/0x490
[87452.334177] do_writepages+0x43/0xe0
[87452.334699] ? __filemap_fdatawrite_range+0xa4/0x100
[87452.335720] __filemap_fdatawrite_range+0xc5/0x100
[87452.336500] btrfs_run_delalloc_work+0x17/0x40 [btrfs]
[87452.337216] btrfs_work_helper+0xf1/0x600 [btrfs]
[87452.337838] process_one_work+0x24e/0x5e0
[87452.338437] worker_thread+0x50/0x3b0
[87452.339137] ? process_one_work+0x5e0/0x5e0
[87452.339884] kthread+0x153/0x170
[87452.340507] ? kthread_mod_delayed_work+0xc0/0xc0
[87452.341153] ret_from_fork+0x22/0x30
[87452.341806] INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
[87452.342487] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
[87452.343274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87452.344049] task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000
[87452.344974] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
[87452.345655] Call Trace:
[87452.346305] __schedule+0x5d1/0xcf0
[87452.346947] ? kvm_clock_read+0x14/0x30
[87452.347676] ? wait_for_completion+0x81/0x110
[87452.348389] schedule+0x45/0xe0
[87452.349077] schedule_timeout+0x30c/0x580
[87452.349718] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[87452.350340] ? lock_acquire+0x1a3/0x490
[87452.351006] ? try_to_wake_up+0x7a/0xa20
[87452.351541] ? lock_release+0x20e/0x4c0
[87452.352040] ? lock_acquired+0x199/0x490
[87452.352517] ? wait_for_completion+0x81/0x110
[87452.353000] wait_for_completion+0xab/0x110
[87452.353490] start_delalloc_inodes+0x2af/0x390 [btrfs]
[87452.353973] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
[87452.354455] flush_space+0x24f/0x660 [btrfs]
[87452.355063] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
[87452.355565] process_one_work+0x24e/0x5e0
[87452.356024] worker_thread+0x20f/0x3b0
[87452.356487] ? process_one_work+0x5e0/0x5e0
[87452.356973] kthread+0x153/0x170
[87452.357434] ? kthread_mod_delayed_work+0xc0/0xc0
[87452.357880] ret_from_fork+0x22/0x30
(...)
< stack traces of several tasks waiting for the locks of the inodes of the
clone operation >
(...)
[92867.444138] RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
[92867.444624] RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73f97
[92867.445116] RDX: 0000000000000000 RSI: 0000560fbd5d7a40 RDI: 0000560fbd5d8960
[92867.445595] RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
[92867.446070] R10: 00007ffc3371b996 R11: 0000000000000246 R12: 0000000000000000
[92867.446820] R13: 000000000000001f R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
[92867.447361] task:fsstress state:D stack: 0 pid:2508238 ppid:2508153 flags:0x00004000
[92867.447920] Call Trace:
[92867.448435] __schedule+0x5d1/0xcf0
[92867.448934] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[92867.449423] schedule+0x45/0xe0
[92867.449916] __reserve_bytes+0x4a4/0xb10 [btrfs]
[92867.450576] ? finish_wait+0x90/0x90
[92867.451202] btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
[92867.451815] btrfs_block_rsv_add+0x1f/0x50 [btrfs]
[92867.452412] start_transaction+0x2d1/0x760 [btrfs]
[92867.453216] clone_copy_inline_extent+0x333/0x490 [btrfs]
[92867.453848] ? lock_release+0x20e/0x4c0
[92867.454539] ? btrfs_search_slot+0x9a7/0xc30 [btrfs]
[92867.455218] btrfs_clone+0x569/0x7e0 [btrfs]
[92867.455952] btrfs_clone_files+0xf6/0x150 [btrfs]
[92867.456588] btrfs_remap_file_range+0x324/0x3d0 [btrfs]
[92867.457213] do_clone_file_range+0xd4/0x1f0
[92867.457828] vfs_clone_file_range+0x4d/0x230
[92867.458355] ? lock_release+0x20e/0x4c0
[92867.458890] ioctl_file_clone+0x8f/0xc0
[92867.459377] do_vfs_ioctl+0x342/0x750
[92867.459913] __x64_sys_ioctl+0x62/0xb0
[92867.460377] do_syscall_64+0x33/0x80
[92867.460842] entry_SYSCALL_64_after_hwframe+0x44/0xa9
(...)
< stack traces of more tasks blocked on metadata reservation like the clone
task above, because the async reclaim task has deadlocked >
(...)
Another thing to notice is that the worker task that is deadlocked when
trying to flush the destination inode of the clone operation is at
btrfs_invalidatepage(). This is simply because the clone operation has a
destination offset greater than the i_size and we only update the i_size
of the destination file after cloning an extent (just like we do in the
buffered write path).
Since the async reclaim path uses btrfs_start_delalloc_roots() to trigger
the flushing of delalloc for all inodes that have delalloc, add a runtime
flag to an inode to signal it should not be flushed, and for inodes with
that flag set, start_delalloc_inodes() will simply skip them. When the
cloning code needs to dirty a page to copy an inline extent, set that flag
on the inode and then clear it when the clone operation finishes.
This could be sporadically triggered with test case generic/269 from
fstests, which exercises many fsstress processes running in parallel with
several dd processes filling up the entire filesystem.
CC: stable@vger.kernel.org # 5.9+
Fixes:
|
||
|
ac7ac4618c |
for-5.11/block-2020-12-14
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/Xec8QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpoLbEACzXypgZWwMdfgRckA/Vt333rXHtbhUV+hK 2XP+P81iRvr9Esi31UPbRp82vrgcDO0cpI1QmQojS5U5TIQP88BfXptfRZZu48eb wT5RDDNQ34HItqAh/yEuYsv9yUKcxeIrB99tBVvM+4UmQg9zTdIW3mg6PvCBdbhV N38jI0tCF/PJatjfRuphT/nXonQLPWBlVDmZk06KZQFOwQe9ep1vUi1+nbiRPuo3 geFBpTh1Kp6Vl1B3n4RpECs6Y7I0RRuJdaH2sDizICla1/BW91F9fQwHimNnUxUq e1Q1kMuh6ftcQGkYlHSYcPhuv6CvorldTZCO5arPxWpcwvxriTSMRPWAgUr5pEiF fhiGhqeDu9e6vl9vS31wUD1B30hy+jFz9wyjRrDwJ3cPHH1JVBjTzvdX+cIh/1ku IbIwUMteUtvUrzqAv/DzbGhedp7xWtOFaVo8j0QFYh9zkjd6b8yDOF/yztwX2gjY Xt1cd+KpDSiN449ZRaoMI0sCJAxqzhMa6nsWlb0L7KuNyWKAbvKQBm9Rb47FLV9A Vx70KC+zkFoyw23capvIahmQazerriUJ5PGe0lVm6ROgmIFdCpXTPDjnrvq/6RZ/ GEpD7gTW9atGJ7EuEE8686sAfKD5kneChWLX5EHXf0d0AG5Mr2lKsluiGp5LpPJg Q1Xqs6xwww== =zo4w -----END PGP SIGNATURE----- Merge tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: "Another series of killing more code than what is being added, again thanks to Christoph's relentless cleanups and tech debt tackling. This contains: - blk-iocost improvements (Baolin Wang) - part0 iostat fix (Jeffle Xu) - Disable iopoll for split bios (Jeffle Xu) - block tracepoint cleanups (Christoph Hellwig) - Merging of struct block_device and hd_struct (Christoph Hellwig) - Rework/cleanup of how block device sizes are updated (Christoph Hellwig) - Simplification of gendisk lookup and removal of block device aliasing (Christoph Hellwig) - Block device ioctl cleanups (Christoph Hellwig) - Removal of bdget()/blkdev_get() as exported API (Christoph Hellwig) - Disk change rework, avoid ->revalidate_disk() (Christoph Hellwig) - sbitmap improvements (Pavel Begunkov) - Hybrid polling fix (Pavel Begunkov) - bvec iteration improvements (Pavel Begunkov) - Zone revalidation fixes (Damien Le Moal) - blk-throttle limit fix (Yu Kuai) - Various little fixes" * tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block: (126 commits) blk-mq: fix msec comment from micro to milli seconds blk-mq: update arg in comment of blk_mq_map_queue blk-mq: add helper allocating tagset->tags Revert "block: Fix a lockdep complaint triggered by request queue flushing" nvme-loop: use blk_mq_hctx_set_fq_lock_class to set loop's lock class blk-mq: add new API of blk_mq_hctx_set_fq_lock_class block: disable iopoll for split bio block: Improve blk_revalidate_disk_zones() checks sbitmap: simplify wrap check sbitmap: replace CAS with atomic and sbitmap: remove swap_lock sbitmap: optimise sbitmap_deferred_clear() blk-mq: skip hybrid polling if iopoll doesn't spin blk-iocost: Factor out the base vrate change into a separate function blk-iocost: Factor out the active iocgs' state check into a separate function blk-iocost: Move the usage ratio calculation to the correct place blk-iocost: Remove unnecessary advance declaration blk-iocost: Fix some typos in comments blktrace: fix up a kerneldoc comment block: remove the request_queue to argument request based tracepoints ... |
||
|
f1ee3b8829 |
for-5.11-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl/XdB4ACgkQxWXV+ddt WDv41g//dOkrwjAVBfDUwRT/yKqojyEsZB1aNyHlPHFw8KEw5oIW7wxR4oqXi2ed /i9KIJe4E9AfqAiexhLvA+Wyt/Sgwz+k4ys82PKhhRNQn7LE4tvhSBUu6JYJDU09 6I1jagya7ILa8akFXZTmVbXdliI4Ab+pcXWAmQYK/xPVDxYTSsBf4o4MilNBA9FS lTwwBh5GTEtIkubr2yVd3pKfF4fT2g1hd+yglpHaOzpcrLMNN4hj4sUFlLbx/FlJ MWo+914cSNKJoebbnqhK9djD9hggaaXnNooqfBOXUhZN0VN9rQoKb5tW+TREQmFm shrmBSqN7CaqKfSOMZs7WOnTuTvmV/825PnLqDqcTUaLw+BgdyacpO9WflgfSs16 Cdvagr1SqbrSQ/3WYCpbqPLDNP3XuZ6+m5OWizf6fhyo8xdFcUHZgRC8qejDlycy V/zP0c5OYOMi5vo6x/zhrD7Uft7xoFUVcSJCe8WPri082d9LbA2BqwCsullD60PQ K/fsmlHs5Uxxy3MFgBPVDdWGgaa9rQ2vXequezbozBIIeeVL+Q9zkeyBFSYuFeE8 HToRE9B9BUEUh+p1JxPjOdFH/m+sKe1WMdmRLQthMzfOiNWW7pp/nL5rl4BUVmjm 58dQS73Cj/YNdBomRJXPPtgKIJPAWRrzU/JBcwAdMoKy57oh9NQ= =5YAS -----END PGP SIGNATURE----- Merge tag 'for-5.11-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "We have a mix of all kinds of changes, feature updates, core stuff, performance improvements and lots of cleanups and preparatory changes. User visible: - export filesystem generation in sysfs - new features for mount option 'rescue': - what's currently supported is exported in sysfs - 'ignorebadroots'/'ibadroots' - continue even if some essential tree roots are not usable (extent, uuid, data reloc, device, csum, free space) - 'ignoredatacsums'/'idatacsums' - skip checksum verification on data - 'all' - now enables 'ignorebadroots' + 'ignoredatacsums' + 'nologreplay' - export read mirror policy settings to sysfs, new policies will be added in the future - remove inode number cache feature (mount -o inode_cache), obsoleted in 5.9 User visible fixes: - async discard scheduling fixes on high loads - update inode byte counter atomically so stat() does not report wrong value in some cases - free space tree fixes: - correctly report status of v2 after remount - clear v1 cache inodes when v2 is newly enabled after remount Core: - switch own tree lock implementation to standard rw semaphore: - one-level lock nesting is not required anymore, the last use of this was in free space that's now loaded asynchronously - own implementation of adaptive spinning before taking mutex has been part of rwsem - performance seems to be better in general, much better (+tens of percents) for some workloads - lockdep does not complain - finish direct IO conversion to iomap infrastructure, remove temporary workaround for DSYNC after iomap API updates - preparatory work to support data and metadata blocks smaller than page: - generalize code that assumes sectorsize == PAGE_SIZE, lots of refactoring - planned namely for 64K pages (eg. arm64, ppc64) - scrub read-only support - preparatory work for zoned allocation mode (SMR/ZBC/ZNS friendly): - disable incompatible features - round-robin superblock write - free space cache (v1) is loaded asynchronously, remove tree path recursion - slightly improved time tacking for transaction kthread wake ups Performance improvements (note that the numbers depend on load type or other features and weren't run on the same machine): - skip unnecessary work: - do not start readahead for csum tree when scrubbing non-data block groups - do not start and wait for delalloc on snapshot roots on transaction commit - fix race when defragmenting leads to unnecessary IO - dbench speedups (+throughput%/-max latency%): - skip unnecessary searches for xattrs when logging an inode (+10.8/-8.2) - stop incrementing log batch when joining log transaction (1-2) - unlock path before checking if extent is shared during nocow writeback (+5.0/-20.5), on fio load +9.7% throughput/-9.8% runtime - several tree log improvements, eg. removing unnecessary operations, fixing races that lead to additional work (+12.7/-8.2) - tree-checker error branches annotated with unlikely() (+3% throughput) Other: - cleanups - lockdep fixes - more btrfs_inode conversions - error variable cleanups" * tag 'for-5.11-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (198 commits) btrfs: scrub: allow scrub to work with subpage sectorsize btrfs: scrub: support subpage data scrub btrfs: scrub: support subpage tree block scrub btrfs: scrub: always allocate one full page for one sector for RAID56 btrfs: scrub: reduce width of extent_len/stripe_len from 64 to 32 bits btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs btrfs: remove btrfs_find_ordered_sum call from btrfs_lookup_bio_sums btrfs: handle sectorsize < PAGE_SIZE case for extent buffer accessors btrfs: update num_extent_pages to support subpage sized extent buffer btrfs: don't allow tree block to cross page boundary for subpage support btrfs: calculate inline extent buffer page size based on page size btrfs: factor out btree page submission code to a helper btrfs: make btrfs_verify_data_csum follow sector size btrfs: pass bio_offset to check_data_csum() directly btrfs: rename bio_offset of extent_submit_bio_start_t to dio_file_offset btrfs: fix lockdep warning when creating free space tree btrfs: skip space_cache v1 setup when not using it btrfs: remove free space items when disabling space cache v1 btrfs: warn when remount will not change the free space tree btrfs: use superblock state to print space_cache mount option ... |
||
|
edd7ab7684 |
The new preemtible kmap_local() implementation:
- Consolidate all kmap_atomic() internals into a generic implementation which builds the base for the kmap_local() API and make the kmap_atomic() interface wrappers which handle the disabling/enabling of preemption and pagefaults. - Switch the storage from per-CPU to per task and provide scheduler support for clearing mapping when scheduling out and restoring them when scheduling back in. - Merge the migrate_disable/enable() code, which is also part of the scheduler pull request. This was required to make the kmap_local() interface available which does not disable preemption when a mapping is established. It has to disable migration instead to guarantee that the virtual address of the mapped slot is the same accross preemption. - Provide better debug facilities: guard pages and enforced utilization of the mapping mechanics on 64bit systems when the architecture allows it. - Provide the new kmap_local() API which can now be used to cleanup the kmap_atomic() usage sites all over the place. Most of the usage sites do not require the implicit disabling of preemption and pagefaults so the penalty on 64bit and 32bit non-highmem systems is removed and quite some of the code can be simplified. A wholesale conversion is not possible because some usage depends on the implicit side effects and some need to be cleaned up because they work around these side effects. The migrate disable side effect is only effective on highmem systems and when enforced debugging is enabled. On 64bit and 32bit non-highmem systems the overhead is completely avoided. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XyQwTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoUolD/9+R+BX96fGir+I8rG9dc3cbLw5meSi 0I/Nq3PToZMs2Iqv50DsoaPYHHz/M6fcAO9LRIgsE9jRbnY93GnsBM0wU9Y8yQaT 4wUzOG5WHaLDfqIkx/CN9coUl458oEiwOEbn79A2FmPXFzr7IpkufnV3ybGDwzwP p73bjMJMPPFrsa9ig87YiYfV/5IAZHi82PN8Cq1v4yNzgXRP3Tg6QoAuCO84ZnWF RYlrfKjcJ2xPdn+RuYyXolPtxr1hJQ0bOUpe4xu/UfeZjxZ7i1wtwLN9kWZe8CKH +x4Lz8HZZ5QMTQ9sCHOLtKzu2MceMcpISzoQH4/aFQCNMgLn1zLbS790XkYiQCuR ne9Cua+IqgYfGMG8cq8+bkU9HCNKaXqIBgPEKE/iHYVmqzCOqhW5Cogu4KFekf6V Wi7pyyUdX2en8BAWpk5NHc8de9cGcc+HXMq2NIcgXjVWvPaqRP6DeITERTZLJOmz XPxq5oPLGl7wdm7z+ICIaNApy8zuxpzb6sPLNcn7l5OeorViORlUu08AN8587wAj FiVjp6ZYomg+gyMkiNkDqFOGDH5TMENpOFoB0hNNEyJwwS0xh6CgWuwZcv+N8aPO HuS/P+tNANbD8ggT4UparXYce7YCtgOf3IG4GA3JJYvYmJ6pU+AZOWRoDScWq4o+ +jlfoJhMbtx5Gg== =n71I -----END PGP SIGNATURE----- Merge tag 'core-mm-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull kmap updates from Thomas Gleixner: "The new preemtible kmap_local() implementation: - Consolidate all kmap_atomic() internals into a generic implementation which builds the base for the kmap_local() API and make the kmap_atomic() interface wrappers which handle the disabling/enabling of preemption and pagefaults. - Switch the storage from per-CPU to per task and provide scheduler support for clearing mapping when scheduling out and restoring them when scheduling back in. - Merge the migrate_disable/enable() code, which is also part of the scheduler pull request. This was required to make the kmap_local() interface available which does not disable preemption when a mapping is established. It has to disable migration instead to guarantee that the virtual address of the mapped slot is the same across preemption. - Provide better debug facilities: guard pages and enforced utilization of the mapping mechanics on 64bit systems when the architecture allows it. - Provide the new kmap_local() API which can now be used to cleanup the kmap_atomic() usage sites all over the place. Most of the usage sites do not require the implicit disabling of preemption and pagefaults so the penalty on 64bit and 32bit non-highmem systems is removed and quite some of the code can be simplified. A wholesale conversion is not possible because some usage depends on the implicit side effects and some need to be cleaned up because they work around these side effects. The migrate disable side effect is only effective on highmem systems and when enforced debugging is enabled. On 64bit and 32bit non-highmem systems the overhead is completely avoided" * tag 'core-mm-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) ARM: highmem: Fix cache_is_vivt() reference x86/crashdump/32: Simplify copy_oldmem_page() io-mapping: Provide iomap_local variant mm/highmem: Provide kmap_local* sched: highmem: Store local kmaps in task struct x86: Support kmap_local() forced debugging mm/highmem: Provide CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP mm/highmem: Provide and use CONFIG_DEBUG_KMAP_LOCAL microblaze/mm/highmem: Add dropped #ifdef back xtensa/mm/highmem: Make generic kmap_atomic() work correctly mm/highmem: Take kmap_high_get() properly into account highmem: High implementation details and document API Documentation/io-mapping: Remove outdated blurb io-mapping: Cleanup atomic iomap mm/highmem: Remove the old kmap_atomic cruft highmem: Get rid of kmap_types.h xtensa/mm/highmem: Switch to generic kmap atomic sparc/mm/highmem: Switch to generic kmap atomic powerpc/mm/highmem: Switch to generic kmap atomic nds32/mm/highmem: Switch to generic kmap atomic ... |
||
|
b42fe98c92 |
btrfs: scrub: allow scrub to work with subpage sectorsize
Since btrfs scrub is utilizing its own infrastructure to submit read/write, scrub is independent from all other routines. This brings one very neat feature, allow us to read 4K data into offset 0 of a 64K page. So is the writeback routine. This makes scrub on subpage sector size much easier to implement, and thanks to previous commits which just changed the implementation to always do scrub based on sector size, now scrub can handle subpage filesystem without any problem. This patch will just remove the restriction on (sectorsize != PAGE_SIZE), to make scrub finally work on subpage filesystems. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
b29dca44ab |
btrfs: scrub: support subpage data scrub
Btrfs scrub is more flexible than buffered data write path, as we can read an unaligned subpage data into page offset 0. This ability makes subpage support much easier, we just need to check each scrub_page::page_len and ensure we only calculate hash for [0, page_len) of a page. There is a small thing to notice: for subpage case, we still do sector by sector scrub. This means we will submit a read bio for each sector to scrub, resulting in the same amount of read bios, just like on the 4K page systems. This behavior can be considered as a good thing, if we want everything to be the same as 4K page systems. But this also means, we're wasting the possibility to submit larger bio using 64K page size. This is another problem to consider in the future. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
53f3251d3b |
btrfs: scrub: support subpage tree block scrub
To support subpage tree block scrub, scrub_checksum_tree_block() only needs to learn 2 new tricks: - Follow sector size Now scrub_page only represents one sector, we need to follow it properly. - Run checksum on all sectors Since scrub_page only represents one sector, we need to run checksum on all sectors, not only (nodesize >> PAGE_SIZE). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
d0a7a9c050 |
btrfs: scrub: always allocate one full page for one sector for RAID56
For scrub_pages() and scrub_pages_for_parity(), we currently allocate one scrub_page structure for one page. This is fine if we only read/write one sector one time. But for cases like scrubbing RAID56, we need to read/write the full stripe, which is in 64K size for now. For subpage size, we will submit the read in just one page, which is normally a good thing, but for RAID56 case, it only expects to see one sector, not the full stripe in its endio function. This could lead to wrong parity checksum for RAID56 on subpage. To make the existing code work well for subpage case, here we take a shortcut by always allocating a full page for one sector. This should provide the base to make RAID56 work for subpage case. The cost is pretty obvious now, for one RAID56 stripe now we always need 16 pages. For support subpage situation (64K page size, 4K sector size), this means we need full one megabyte to scrub just one RAID56 stripe. And for data scrub, each 4K sector will also need one 64K page. This is mostly just a workaround, the proper fix for this is a much larger project, using scrub_block to replace scrub_page, and allow scrub_block to handle multi pages, csums, and csum_bitmap to avoid allocating one page for each sector. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
fa485d21a7 |
btrfs: scrub: reduce width of extent_len/stripe_len from 64 to 32 bits
Btrfs on-disk format chose to use u64 for almost everything, but there are a other restrictions that won't let us use more than u32 for things like extent length (the maximum length is 128MiB for non-hole extents), or stripe length (we have device number limit). This means if we don't have extra handling to convert u64 to u32, we will always have some questionable operations like "u32 = u64 >> sectorsize_bits" in the code. This patch will try to address the problem by reducing the width for the following members/parameters: - scrub_parity::stripe_len - @len of scrub_pages() - @extent_len of scrub_remap_extent() - @len of scrub_parity_mark_sectors_error() - @len of scrub_parity_mark_sectors_data() - @len of scrub_extent() - @len of scrub_pages_for_parity() - @len of scrub_extent_for_parity() For members extracted from on-disk structure, like map->stripe_len, they will be kept as is. Since that modification would require on-disk format change. There will be cases like "u32 = u64 - u64" or "u32 = u64", for such call sites, extra ASSERT() is added to be extra safe for debug builds. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
6275193ef1 |
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
Refactor btrfs_lookup_bio_sums() by: - Remove the @file_offset parameter There are two factors making the @file_offset parameter useless: * For csum lookup in csum tree, file offset makes no sense We only need disk_bytenr, which is unrelated to file_offset * page_offset (file offset) of each bvec is not contiguous. Pages can be added to the same bio as long as their on-disk bytenr is contiguous, meaning we could have pages at different file offsets in the same bio. Thus passing file_offset makes no sense any more. The only user of file_offset is for data reloc inode, we will use a new function, search_file_offset_in_bio(), to handle it. - Extract the csum tree lookup into search_csum_tree() The new function will handle the csum search in csum tree. The return value is the same as btrfs_find_ordered_sum(), returning the number of found sectors which have checksum. - Change how we do the main loop The only needed info from bio is: * the on-disk bytenr * the length After extracting the above info, we can do the search without bio at all, which makes the main loop much simpler: for (cur_disk_bytenr = orig_disk_bytenr; cur_disk_bytenr < orig_disk_bytenr + orig_len; cur_disk_bytenr += count * sectorsize) { /* Lookup csum tree */ count = search_csum_tree(fs_info, path, cur_disk_bytenr, search_len, csum_dst); if (!count) { /* Csum hole handling */ } } - Use single variable as the source to calculate all other offsets Instead of all different type of variables, we use only one main variable, cur_disk_bytenr, which represents the current disk bytenr. All involved values can be calculated from that variable, and all those variable will only be visible in the inner loop. The above refactoring makes btrfs_lookup_bio_sums() way more robust than it used to be, especially related to the file offset lookup. Now file_offset lookup is only related to data reloc inode, otherwise we don't need to bother file_offset at all. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
9e46458a7c |
btrfs: remove btrfs_find_ordered_sum call from btrfs_lookup_bio_sums
The function btrfs_lookup_bio_sums() is only called for read bios. While btrfs_find_ordered_sum() is to search ordered extent sums, which is only for write path. This means to read a page we either: - Submit read bio if it's not uptodate This means we only need to search csum tree for checksums. - The page is already uptodate It can be marked uptodate for previous read, or being marked dirty. As we always mark page uptodate for dirty page. In that case, we don't need to submit read bio at all, thus no need to search any checksums. Remove the btrfs_find_ordered_sum() call in btrfs_lookup_bio_sums(). And since btrfs_lookup_bio_sums() is the only caller for btrfs_find_ordered_sum(), also remove the implementation. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |