Sparse reports a warning at release_extent_buffer()
warning: context imbalance in release_extent_buffer() - unexpected unlock
The root cause is the missing annotation at release_extent_buffer()
Add the missing __releases(&eb->refs_lock) annotation
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In my EIO stress testing I noticed I was getting forced to rescan the
uuid tree pretty often, which was weird. This is because my error
injection stuff would sometimes inject an error after log replay but
before we loaded the UUID tree. If log replay committed the transaction
it wouldn't have updated the uuid tree generation, but the tree was
valid and didn't change, so there's no reason to not update the
generation here.
Fix this by setting the BTRFS_FS_UPDATE_UUID_TREE_GEN bit immediately
after reading all the fs roots if the uuid tree generation matches the
fs generation. Then any transaction commits that happen during mount
won't screw up our uuid tree state, forcing us to do needless uuid
rescans.
Fixes: 70f8017547 ("Btrfs: check UUID tree during mount if required")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In doing my fsstress+EIO stress testing I started running into issues
where umount would get stuck forever because the uuid checker was
chewing through the thousands of subvolumes I had created.
We shouldn't block umount on this, simply bail if we're unmounting the
fs. We need to make sure we don't mark the UUID tree as ok, so we only
set that bit if we made it through the whole rescan operation, but
otherwise this is completely safe.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's used only during filesystem mount as such it can be made private to
disk-io.c file. Also use the occasion to move btrfs_uuid_rescan_kthread
as btrfs_check_uuid_tree is its sole caller.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_uuid_tree_iterate is called from only once place and its 2nd
argument is always btrfs_check_uuid_tree_entry. Simplify
btrfs_uuid_tree_iterate's signature by removing its 2nd argument and
directly calling btrfs_check_uuid_tree_entry. Also move the latter into
uuid-tree.h. No functional changes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are temporary variables tracking the index of P and Q stripes, but
none of them is really used as such, merely for determining if the Q
stripe is present. This leads to compiler warnings with
-Wunused-but-set-variable and has been reported several times.
fs/btrfs/raid56.c: In function ‘finish_rmw’:
fs/btrfs/raid56.c:1199:6: warning: variable ‘p_stripe’ set but not used [-Wunused-but-set-variable]
1199 | int p_stripe = -1;
| ^~~~~~~~
fs/btrfs/raid56.c: In function ‘finish_parity_scrub’:
fs/btrfs/raid56.c:2356:6: warning: variable ‘p_stripe’ set but not used [-Wunused-but-set-variable]
2356 | int p_stripe = -1;
| ^~~~~~~~
Replace the two variables with one that has a clear meaning and also get
rid of the warnings. The logic that verifies that there are only 2
valid cases is unchanged.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With the following patches:
- btrfs: backref, only collect file extent items matching backref offset
- btrfs: backref, not adding refs from shared block when resolving normal backref
- btrfs: backref, only search backref entries from leaves of the same root
we only collect the normal data refs we want, so the imprecise upper
bound total_refs of that EXTENT_ITEM could now be changed to the count
of the normal backref entry we want to search.
Background and how the patches fit together:
Btrfs has two types of data backref.
For BTRFS_EXTENT_DATA_REF_KEY type of backref, we don't have the
exact block number. Therefore, we need to call resolve_indirect_refs.
It uses btrfs_search_slot to locate the leaf block. Then
we need to walk through the leaves to search for the EXTENT_DATA items
that have disk bytenr matching the extent item (add_all_parents).
When resolving indirect refs, we could take entries that don't
belong to the backref entry we are searching for right now.
For that reason when searching backref entry, we always use total
refs of that EXTENT_ITEM rather than individual count.
For example:
item 11 key (40831553536 EXTENT_ITEM 4194304) itemoff 15460 itemsize
extent refs 24 gen 7302 flags DATA
shared data backref parent 394985472 count 10 #1
extent data backref root 257 objectid 260 offset 1048576 count 3 #2
extent data backref root 256 objectid 260 offset 65536 count 6 #3
extent data backref root 257 objectid 260 offset 65536 count 5 #4
For example, when searching backref entry #4, we'll use total_refs
24, a very loose loop ending condition, instead of total_refs = 5.
But using total_refs = 24 is not accurate. Sometimes, we'll never find
all the refs from specific root. As a result, the loop keeps on going
until we reach the end of that inode.
The first 3 patches, handle 3 different types refs we might encounter.
These refs do not belong to the normal backref we are searching, and
hence need to be skipped.
This patch changes the total_refs to correct number so that we could
end loop as soon as we find all the refs we want.
btrfs send uses backref to find possible clone sources, the following
is a simple test to compare the results with and without this patch:
$ btrfs subvolume create /sub1
$ for i in `seq 1 163840`; do
dd if=/dev/zero of=/sub1/file bs=64K count=1 seek=$((i-1)) conv=notrunc oflag=direct
done
$ btrfs subvolume snapshot /sub1 /sub2
$ for i in `seq 1 163840`; do
dd if=/dev/zero of=/sub1/file bs=4K count=1 seek=$(((i-1)*16+10)) conv=notrunc oflag=direct
done
$ btrfs subvolume snapshot -r /sub1 /snap1
$ time btrfs send /snap1 | btrfs receive /volume2
Without this patch:
real 69m48.124s
user 0m50.199s
sys 70m15.600s
With this patch:
real 1m59.683s
user 0m35.421s
sys 2m42.684s
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: ethanwu <ethanwu@synology.com>
[ add patchset cover letter with background and numbers ]
Signed-off-by: David Sterba <dsterba@suse.com>
We could have some nodes/leaves in subvolume whose owner are not the
that subvolume. In this way, when we resolve normal backrefs of that
subvolume, we should avoid collecting those references from these blocks.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: ethanwu <ethanwu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All references from the block of SHARED_DATA_REF belong to that shared
block backref.
For example:
item 11 key (40831553536 EXTENT_ITEM 4194304) itemoff 15460 itemsize 95
extent refs 24 gen 7302 flags DATA
extent data backref root 257 objectid 260 offset 65536 count 5
extent data backref root 258 objectid 265 offset 0 count 9
shared data backref parent 394985472 count 10
Block 394985472 might be leaf from root 257, and the item obejctid and
(file_pos - file_extent_item::offset) in that leaf just happens to be
260 and 65536 which is equal to the first extent data backref entry.
Before this patch, when we resolve backref:
root 257 objectid 260 offset 65536
we will add those refs in block 394985472 and wrongly treat those as the
refs we want.
Fix this by checking if the leaf we are processing is shared data
backref, if so, just skip this leaf.
Shared data refs added into preftrees.direct have all entry value = 0
(root_id = 0, key = NULL, level = 0) except parent entry.
Other refs from indirect tree will have key value and root id != 0, and
these values won't be changed when their parent is resolved and added to
preftrees.direct. Therefore, we could reuse the preftrees.direct and
search ref with all values = 0 except parent is set to avoid getting
those resolved refs block.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: ethanwu <ethanwu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When resolving one backref of type EXTENT_DATA_REF, we collect all
references that simply reference the EXTENT_ITEM even though their
(file_pos - file_extent_item::offset) are not the same as the
btrfs_extent_data_ref::offset we are searching for.
This patch adds additional check so that we only collect references whose
(file_pos - file_extent_item::offset) == btrfs_extent_data_ref::offset.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: ethanwu <ethanwu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The integrity checking code for the super block mirrors is the last
remaining user of buffer_heads, change it to using plain bios as well.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the last caller of btrfsic_process_written_block() with
buffer_heads is gone, remove the buffer_head processing path from it as
well.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the last use of btrfsic_submit_bh() is gone as the super block
is now written using bios, remove the function as well.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Similar to the superblock read path, change the write path to using bios
and pages instead of buffer_heads. This allows us to skip over the
buffer_head code, for writing the superblock to disk.
This is based on a patch originally authored by Nikolay Borisov.
Co-developed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Super-block reading in BTRFS is done using buffer_heads. Buffer_heads
have some drawbacks, like not being able to propagate errors from the
lower layers.
Directly use the page cache for reading the super blocks from disk or
invalidating an on-disk super block. We have to use the page cache so to
avoid races between mkfs and udev. See also 6f60cbd3ae ("btrfs: access
superblock via pagecache in scan_one_device").
This patch unwraps the buffer head API and does not change the way the
super block is actually read.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_scratch_superblocks() isn't used anywhere outside volumes.c so
remove it from the header file and mark it as static. Also move it
above it's callers so we don't need a forward declaration.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Block device mappings are never in highmem so kmap() / kunmap() calls for
pages from block devices are unneeded. Use page_address() instead of
kmap() to get to the virtual addreses.
While we're at it, read_cache_page_gfp() doesn't return NULL on error,
only an ERR_PTR, so use IS_ERR() to check for errors.
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preparatory patch for removal of buffer_head usage in btrfs.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When attempting to set bits on a range of an exent io tree that already
has those bits set we can end up splitting an extent state record, use
the preallocated extent state record, insert it into the red black tree,
do another search on the red black tree, merge the preallocated extent
state record with the previous extent state record, remove that previous
record from the red black tree and then free it. This is all unnecessary
work that consumes time.
This happens specifically at the following case at __set_extent_bit():
$ cat -n fs/btrfs/extent_io.c
957 static int __must_check
958 __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
(...)
1044 /*
1045 * | ---- desired range ---- |
1046 * | state |
1047 * or
1048 * | ------------- state -------------- |
1049 *
(...)
1060 if (state->start < start) {
1061 if (state->state & exclusive_bits) {
1062 *failed_start = start;
1063 err = -EEXIST;
1064 goto out;
1065 }
1066
1067 prealloc = alloc_extent_state_atomic(prealloc);
1068 BUG_ON(!prealloc);
1069 err = split_state(tree, state, prealloc, start);
1070 if (err)
1071 extent_io_tree_panic(tree, err);
1072
1073 prealloc = NULL;
So if our extent state represents a range from 0 to 1MiB for example, and
we want to set bits in the range 128KiB to 256KiB for example, and that
extent state record already has all those bits set, we end up splitting
that record, so we end up with extent state records in the tree which
represent the ranges from 0 to 128KiB and from 128KiB to 1MiB. This is
temporary because a subsequent iteration in that function will end up
merging the records.
The splitting requires using the preallocated extent state record, so
a future iteration that needs to do another split will need to allocate
another extent state record in an atomic context, something not ideal
that we try to avoid as much as possible. The splitting also requires
an insertion in the red black tree, and a subsequent merge will require
a deletion from the red black tree and freeing an extent state record.
This change just skips the splitting of an extent state record when it
already has all the bits the we need to set.
Setting a bit that is already set for a range is very common in the
inode's 'file_extent_tree' extent io tree for example, where we keep
setting the EXTENT_DIRTY bit every time we replace an extent.
This change also fixes a bug that happens after the recent patchset from
Josef that avoids having implicit holes after a power failure when not
using the NO_HOLES feature, more specifically the patch with the subject:
"btrfs: introduce the inode->file_extent_tree"
This patch introduced an extent io tree per inode to keep track of
completed ordered extents and figure out at any time what is the safe
value for the inode's disk_i_size. This assumes that for contiguous
ranges in a file we always end up with a single extent state record in
the io tree, but that is not the case, as there is a short time window
where we can have two extent state records representing contiguous
ranges. When this happens we end setting up an incorrect value for the
inode's disk_i_size, resulting in data loss after a clean unmount
of the filesystem. The following example explains how this can happen.
Suppose we have an inode with an i_size and a disk_i_size of 1MiB, so in
the inode's file_extent_tree we have a single extent state record that
represents the range [0, 1MiB) with the EXTENT_DIRTY bit set. Then the
following steps happen:
1) A buffered write against file range [512KiB, 768KiB) is made. At this
point delalloc was not flushed yet;
2) Deduplication from some other inode into this inode's range
[128KiB, 256KiB) is made. This causes btrfs_inode_set_file_extent_range()
to be called, from btrfs_insert_clone_extent(), to mark the range
[128KiB, 256KiB) with EXTENT_DIRTY in the inode's file_extent_tree;
3) When btrfs_inode_set_file_extent_range() calls set_extent_bits(), we
end up at __set_extent_bit(). In the first iteration of that function's
loop we end up in the following branch:
$ cat -n fs/btrfs/extent_io.c
957 static int __must_check
958 __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
(...)
1044 /*
1045 * | ---- desired range ---- |
1046 * | state |
1047 * or
1048 * | ------------- state -------------- |
1049 *
(...)
1060 if (state->start < start) {
1061 if (state->state & exclusive_bits) {
1062 *failed_start = start;
1063 err = -EEXIST;
1064 goto out;
1065 }
1066
1067 prealloc = alloc_extent_state_atomic(prealloc);
1068 BUG_ON(!prealloc);
1069 err = split_state(tree, state, prealloc, start);
1070 if (err)
1071 extent_io_tree_panic(tree, err);
1072
1073 prealloc = NULL;
(...)
1089 goto search_again;
This splits the state record into two, one for range [0, 128KiB) and
another for the range [128KiB, 1MiB). Both already have the EXTENT_DIRTY
bit set. Then we jump to the 'search_again' label, where we unlock the
the spinlock protecting the extent io tree before jumping to the
'again' label to perform the next iteration;
4) In the meanwhile, delalloc is flushed, the ordered extent for the range
[512KiB, 768KiB) is created and when it completes, at
btrfs_finish_ordered_io(), it calls btrfs_inode_safe_disk_i_size_write()
with a value of 0 for its 'new_size' argument;
5) Before the deduplication task currently at __set_extent_bit() moves to
the next iteration, the task finishing the ordered extent calls
find_first_extent_bit() through btrfs_inode_safe_disk_i_size_write()
and gets 'start' set to 0 and 'end' set to 128KiB - because at this
moment the io tree has two extent state records, one representing the
range [0, 128KiB) and another representing the range [128KiB, 1MiB),
both with EXTENT_DIRTY set. Then we set 'isize' to:
isize = min(isize, end + 1)
= min(1MiB, 128KiB - 1 + 1)
= 128KiB
Then we set the inode's disk_i_size to 128KiB (isize).
After a clean unmount of the filesystem and mounting it again, we have
the file with a size of 128KiB, and effectively lost all the data it
had before in the range from 128KiB to 1MiB.
This change fixes that issue too, as we never end up splitting extent
state records when they already have all the bits we want set.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we're allocating a logged extent we attempt to insert an extent
record for the file extent directly. We increase
space_info->bytes_reserved, because the extent entry addition will call
btrfs_update_block_group(), which will convert the ->bytes_reserved to
->bytes_used. However if we fail at any point while inserting the
extent entry we will bail and leave space on ->bytes_reserved, which
will trigger a WARN_ON() on umount. Fix this by pinning the space if we
fail to insert, which is what happens in every other failure case that
involves adding the extent entry.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is only used in read_fs_root(), which is just a wrapper of
btrfs_get_fs_root().
For all the mentioned essential roots except log root tree,
btrfs_get_fs_root() has its own quick path to grab them from fs_info
directly, thus no need for key.offset modification.
For subvolume trees, btrfs_get_fs_root() with key.offset == -1 is
completely fine.
For log trees and log root tree, it's impossible to hit them, as for
relocation all backrefs are fetched from commit root, which never
records log tree blocks.
Log tree blocks either get freed in regular transaction commit, or
replayed at mount time. At runtime we should never hit an backref for
log tree in extent tree.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This commit flips the switch to start tracking/processing pinned extents
on a per-transaction basis. It mostly replaces all references from
btrfs_fs_info::(pinned_extents|freed_extents[]) to
btrfs_transaction::pinned_extents.
Two notable modifications that warrant explicit mention are changing
clean_pinned_extents to get a reference to the previously running
transaction. The other one is removal of call to
btrfs_destroy_pinned_extent since transactions are going to be cleaned
in btrfs_cleanup_one_transaction.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Next patch is going to refactor how pinned extents are tracked which
will necessitate changing this code. To ease that work and contain the
changes factor the code now in preparation, this will also help review.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In preparation to making pinned extents per-transaction ensure that log
such extents are always excluded from caching. To achieve this in
addition to marking them via btrfs_pin_extent_for_log_replay they also
need to be marked with btrfs_add_excluded_extent to prevent log tree
extent buffer being loaded by the free space caching thread. That's
required since log tree blocks are not recorded in the extent tree, hence
they always look free.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preparation for refactoring pinned extents tracking.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers have a reference to a transaction handle so pass it to
pin_down_extent. This is the final step before switching pinned extent
tracking to a per-transaction basis.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preparation for refactoring pinned extents tracking.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_pin_reserved_extent is now only called with a valid transaction so
exploit the fact to take a transaction. This is preparation for tracking
pinned extents on a per-transaction basis.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Calling btrfs_pin_reserved_extent makes sense only with a valid
transaction since pinned extents are processed from transaction commit
in btrfs_finish_extent_commit. In case of error it's sufficient to
adjust the reserved counter to account for log tree extents allocated in
the last transaction.
This commit moves btrfs_pin_reserved_extent to be called only with valid
transaction handle and otherwise uses the newly introduced
unaccount_log_buffer to adjust "reserved". If this is not done if a
failure occurs before transaction is committed WARN_ON are going to be
triggered on unmount. This was especially pronounced with generic/475
test.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function correctly adjusts the reserved bytes occupied by a log
tree extent buffer. It will be used instead of calling
btrfs_pin_reserved_extent.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preparation for switching pinned extent tracking to a per-transaction
basis.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Having btrfs_destroy_delayed_refs call btrfs_pin_extent is problematic
for making pinned extents tracking per-transaction since
btrfs_trans_handle cannot be passed to btrfs_pin_extent in this context.
Additionally delayed refs heads pinned in btrfs_destroy_delayed_refs
are going to be handled very closely, in btrfs_destroy_pinned_extent.
To enable btrfs_pin_extent to take btrfs_trans_handle simply open code
it in btrfs_destroy_delayed_refs and call btrfs_error_unpin_extent_range
on the range. This enables us to do less work in
btrfs_destroy_pinned_extent and leaves btrfs_pin_extent being called in
contexts which have a valid btrfs_trans_handle.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The devinfo attribute handlers were added in 668e48af7a ("btrfs:
sysfs, add devid/dev_state kobject and device attributes") and the name
should contain _devinfo_, there's one that does not conform, so unify it
with the rest.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit 668e48af7a ("btrfs: sysfs, add devid/dev_state kobject and
device attributes"), the functions btrfs_sysfs_add_device_link() and
btrfs_sysfs_rm_device_link() do more than just adding and removing the
device link as its name indicated. Rename them to be more specific
that's about the directory with the attirbutes
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have one simple function btrfs_sysfs_remove_fsid() to undo
btrfs_sysfs_add_fsid(), which also does proper checks before releasing
objects.
One difference, if btrfs_sysfs_remove_fsid is used that now we also call
kobject_del() which was missing before. This was tested (with kobject
debug turned on) and no change in behaviour was found.
This is a cleanup patch.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree pointer can be safely read from the inode, use it and drop the
redundant argument.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree pointer can be safely read from the inode, use it and drop the
redundant argument.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree pointer can be safely read from the inode, use it and drop the
redundant argument.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree pointer can be safely read from the page's inode, use it and
drop the redundant argument.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree pointer can be safely read from the inode so we can drop the
redundant argument from btrfs_lock_and_flush_ordered_range.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add assertions to all helpers that get tree as argument and verify that
it's the same that can be obtained from the inode or from its pages. In
followup patches the redundant arguments and assertions will be removed
one by one.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we're sure the tree from argument is same as the one we can get
from the page's inode io_tree, drop the redundant argument.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All functions that set up extent_page_data::tree set it to the inode
io_tree. That's passed down the callstack that accesses either the same
inode or its pages. In the end submit_extent_page can pull the tree out
of the page and we don't have to store it in the structure.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The status of aborted transaction can change between calls and it needs
to be accessed by READ_ONCE. Add a helper that also wraps the unlikely
hint.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The helpers are related to locking so move them there, update comments.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are now using these for all roots, rename them to btrfs_put_root()
and btrfs_grab_root();
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we're going to start relying on getting ref counting right for
roots, add a list to track allocated roots and print out any roots that
aren't freed up at free_fs_info time.
Hide this behind CONFIG_BTRFS_DEBUG because this will just be used for
developers to verify they aren't breaking things.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In adding things like eb leak checking and root leak checking there were
a lot of weird corner cases that come from the fact that
1) We do not init the fs_info until we get to open_ctree time in the
normal case and
2) The test infrastructure half-init's the fs_info for things that it
needs.
This makes it really annoying to make changes because you have to add
init in two different places, have special cases for testing fs_info's
that may not have certain things initialized, and cases for fs_info's
that didn't make it to open_ctree and thus are not fully set up.
Fix this by extracting out the non-allocating init of the fs info into
it's own public function and use that to make sure we're all getting
consistent views of an allocated fs_info.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
open_ctree mixes initialization of fs stuff and fs_info stuff, which
makes it confusing when doing things like adding the root leak
detection. Make a separate function that inits all the static
structures inside of the fs_info needed for the fs to operate, and then
call that before we start setting up the fs_info to be mounted.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Things like the percpu_counters, the mapping_tree, and the csum hash can
all be freed at btrfs_free_fs_info time, since the helpers all check if
the structure has been initialized already. This significantly cleans
up the error cases in open_ctree.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that all callers of btrfs_get_fs_root are subsequently calling
btrfs_grab_fs_root and handling dropping the ref when they are done
appropriately, go ahead and push btrfs_grab_fs_root up into
btrfs_get_fs_root.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we are going to track leaked roots we need to free them all the same
way, so don't kfree() roots directly, use btrfs_put_fs_root.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We lookup the fs_root and put it in our fs_info directly, we should hold
a ref on this root for the lifetime of the fs_info.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to start freeing roots and doing other complicated things in
free_fs_info, so we need to move it to disk-io.c and export it in order
to use things lik btrfs_put_fs_root().
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We lookup the uuid of arbitrary subvolumes, hold a ref on the root while
we're doing this.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We replay the log into arbitrary fs roots, hold a ref on the root while
we're doing this.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We create the snapshot and then use it for a bunch of things, we need to
hold a ref on it while we're messing with it.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We lookup the name of a subvol which means we'll cross into different
roots. Hold a ref while we're doing the look ups in the fs_root we're
searching.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We lookup all the clone roots and the parent root for send, so we need
to hold refs on all of these roots while we're processing them.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We look up the root for the bytenr that is failing, so we need to hold a
ref on the root for that operation.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We lookup roots for every orphan item we have, we need to hold a ref on
the root while we're doing this work.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All of relocation uses read_fs_root to lookup fs roots, so push the
btrfs_grab_fs_root() up into that helper and remove the individual
calls.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We look up the fs root in various places in here when recovering from a
crashed relcoation. Make sure we hold a ref on the root whenever we
look them up.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're creating a reloc inode in the data reloc tree, we need to hold a
ref on the root while we're doing that.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're looking up the data references for the bytenr in a root, we need
to hold a ref on that root while we're doing that.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are recording this root in the transaction, so we need to hold a ref
on it until we do that.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We look up the corresponding root for the reloc root, we need to hold a
ref while we're messing with it.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We look up the reloc roots corresponding root, we need to hold a ref on
that root.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is trickier than the previous conversions. We have backref_node's
that need to hold onto their root for their lifetime. Do the read of
the root and grab the ref. If at any point we don't use the root we
discard it, however if we use it in our backref node we don't free it
until we free the backref node. Any time we switch the root's for the
backref node we need to drop our ref on the old root and grab the ref on
the new root, and if we dupe a node we need to get a ref on the root
there as well.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We look up an arbitrary fs root here, we need to hold a ref on the root
for the duration.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We look up whatever root userspace has given us, we need to hold a ref
throughout this operation. Use 'root' only for the on fs root and not as
a temporary variable elsewhere.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can wander into a different root, so grab a ref on the root we look
up. Later on we make root = fs_info->tree_root so we need this separate
out label to make sure we do the right cleanup only in the case we're
looking up a different root.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We look up an arbitrary fs root, we need to hold a ref on it while we're
doing our search.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We lookup a arbitrary fs root, we need to hold a ref on that root. If
we're using our own inodes root then grab a ref on that as well to make
the cleanup easier.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're creating the new root here, but we should hold the ref until after
we've initialized the inode for it.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Looking up the inode from an arbitrary tree means we need to hold a ref
on that root.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are looking up an arbitrary inode, we need to hold a ref on the root
while we're doing this.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Looking up the inode we need to search the root, make sure we hold a
reference on that root while we're doing the lookup.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're looking up a random root, we need to hold a ref on it while we're
using it.
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If the root is sitting in the radix tree, we should probably have a ref
for the radix tree. Grab a ref on the root when we insert it, and drop
it when it gets deleted.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add another comment to cover how the space reservation system works
generally. This covers the actual reservation flow, as well as how
flushing is handled.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
delalloc space reservation is tricky because it encompasses both data
and metadata. Make it clear what each side does, the general flow of
how space is moved throughout the lifetime of a write, and what goes
into the calculations.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a giant comment at the top of block-rsv.c describing generally
how block reserves work. It is purely about the block reserves
themselves, and nothing to do with how the actual reservation system
works.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We want to use this for dropping all roots, and in some error cases we
may not have a root, so handle this to make the cleanup code easier.
Make btrfs_grab_fs_root the same so we can use it in cases where the
root may not exist (like the quota root).
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the orphan cleanup stuff doesn't use this directly we can just
make them static.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All this does is call btrfs_get_fs_root() with check_ref == true. Just
use btrfs_get_fs_root() so we don't have a bunch of different helpers
that do the same thing.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All helpers should either be using btrfs_get_fs_root() or
btrfs_read_tree_root().
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Relocation has it's special roots, we don't want to save these in the
root cache either, so swap it to use btrfs_read_tree_root(). However
the reloc root does need REF_COWS set, so make sure we set it everywhere
we use this helper, as it no longer does the REF_COWS setting.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Tree-log uses btrfs_read_fs_root to load its log, but this just calls
btrfs_read_tree_root. We don't save the log roots in our root cache, so
just export this helper and use it in the logging code.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_find_orphan_roots has this weird thing where it looks up the root
in cache to see if it is there before just reading the root. But the
read it uses just reads the root, it doesn't do any of the init work, we
do that by hand here. But this is unnecessary, all we really want is to
see if the root still exists and add it to the dead roots list to be
cleaned up, otherwise we delete the orphan item.
Fix this by just using btrfs_get_fs_root directly with check_ref set to
false so we get the orphan root items. Then we just handle in cache and
out of cache roots the same, add them to the dead roots list and carry
on.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a helper for reading fs roots that just reads the fs root off
the disk and then sets REF_COWS and init's the inheritable flags. Move
this into btrfs_init_fs_root so we can later get rid of this helper and
consolidate all of the fs root reading into one helper.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no reason to not init the root at alloc time, and with later
patches it actually causes problems if we error out mounting the fs
before the tree_root is init'ed because we expect it to have a valid ref
count. Fix this by pushing __setup_root into btrfs_alloc_root.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we have a safe way to update the isize, remove all of this code
as it's no longer needed.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we have a safe way to update the i_size, replace all uses of
btrfs_ordered_update_i_size with btrfs_inode_safe_disk_i_size_write.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We want to use this everywhere we modify the file extent items
permanently. These include:
1) Inserting new file extents for writes and prealloc extents.
2) Truncating inode items.
3) btrfs_cont_expand().
4) Insert inline extents.
5) Insert new extents from log replay.
6) Insert a new extent for clone, as it could be past i_size.
7) Hole punching
For hole punching in particular it might seem it's not necessary because
anybody extending would use btrfs_cont_expand, however there is a corner
that still can give us trouble. Start with an empty file and
fallocate KEEP_SIZE 1M-2M
We now have a 0 length file, and a hole file extent from 0-1M, and a
prealloc extent from 1M-2M. Now
punch 1M-1.5M
Because this is past i_size we have
[HOLE EXTENT][ NOTHING ][PREALLOC]
[0 1M][1M 1.5M][1.5M 2M]
with an i_size of 0. Now if we pwrite 0-1.5M we'll increas our i_size
to 1.5M, but our disk_i_size is still 0 until the ordered extent
completes.
However if we now immediately truncate 2M on the file we'll just call
btrfs_cont_expand(inode, 1.5M, 2M), since our old i_size is 1.5M. If we
commit the transaction here and crash we'll expose the gap.
To fix this we need to clear the file extent mapping for the range that
we punched but didn't insert a corresponding file extent for. This will
mean the truncate will only get an disk_i_size set to 1M if we crash
before the finish ordered io happens.
I've written an xfstest to reproduce the problem and validate this fix.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to keep track of where we have file extents on disk, and thus
where it is safe to adjust the i_size to, we need to have a tree in
place to keep track of the contiguous areas we have file extents for.
Add helpers to use this tree, as it's not required for NO_HOLES file
systems. We will use this by setting DIRTY for areas we know we have
file extent item's set, and clearing it when we remove file extent items
for truncation.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We were using btrfs_i_size_write(), which unconditionally jacks up
inode->disk_i_size. However since clone can operate on ranges we could
have pending ordered extents for a range prior to the start of our clone
operation and thus increase disk_i_size too far and have a hole with no
file extent.
Fix this by using the btrfs_ordered_update_i_size helper which will do
the right thing in the face of pending ordered extents outside of our
clone range.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfsctl was removed in 2012, now the function btrfs_control_ioctl()
is only used for devices ioctls. So update the comment.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Su Yue <Damenly_Su@gmx.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Relocation is one of the most complex part of btrfs, while it's also the
foundation stone for online resizing, profile converting.
For such a complex facility, we should at least have some introduction
to it.
This patch will add an basic introduction at pretty a high level,
explaining:
- What relocation does
- How relocation is done
Only mentioning how data reloc tree and reloc tree are involved in the
operation.
No details like the backref cache, or the data reloc tree contents.
- Which function to refer.
More detailed comments will be added for reloc tree creation, data reloc
tree creation and backref cache.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Each new element added to the mod seq list is always appended to the list,
and each one gets a sequence number coming from a counter which gets
incremented everytime a new element is added to the list (or a new node
is added to the tree mod log rbtree). Therefore the element with the
lowest sequence number is always the first element in the list.
So just remove the list iteration at btrfs_put_tree_mod_seq() that
computes the minimum sequence number in the list and replace it with
a check for the first element's sequence number.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The overview of btrfs dev-replace. It mentions some corner cases caused
by the write duplication and scrub based data copy.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ adjust wording ]
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl53Vh0ACgkQxWXV+ddt
WDtfOQ//bbUyKXcdH0FBZOCEcJmegcK1eUFYqKrwR2bHGe5JRdLM8pAvjCcqmWeO
jtaRiFC4NSCqTIl3mkBUb+XmQtjZwixBUHRxJpuEO8zqawvFZXTqg/KJklNvi2rd
KdflSNia6KrozTT+B/lpwZ5emS+wSdj5XTZ6VGj4riwtphSfWAjOu+4cOASMeFu+
Gfn+N9xu0ZcR/6zO20xAg0Xz+WU2uj4EfeM35dtRP2bPLG0yOGmiYT15Ll9h74Wm
7F+28iNTQfYutAexGvUpiouanGXE+ka3TCsJg5LuVTpdKGraOVGEuX+RhsyoKQrB
E8bk91fbkLlooluhUC306iNA9/+RN/yFGtILX8JsgI2Od26ZuU01l/OHrc19MDIm
gw1w3PMsD/hXLsG5ba4QsIYOzXofSrPdWej29h/o5p0VEQrAoCJEpAi7fVsiJDR1
sx6kCodw5jYhVs1P6DdXO1pgjE7iFUmjUQCFkl40edPMLy/LwB99A4zNnCOwI0KZ
49CMWHDe+tXVJBTzPvtma/PycQHIxJYMf1f8ko9E4stB7HtfH4dnUERDkb1UwQ5n
aJgyhsCCnp/EJoPunUT7g9nLUdyu0Rtwknn3NascWZEieX2QhKEF5RcjAUSL+Hlo
jbGGvoLhG0nOtYkU7BNSQbL8wxPJEEAq8e6F4tWMcOkhX4pNZP8=
=YkB0
-----END PGP SIGNATURE-----
Merge tag 'for-5.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two fixes.
The first is a regression: when dropping some incompat bits the
conditions were reversed. The other is a fix for rename whiteout
potentially leaving stack memory linked to a list"
* tag 'for-5.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix removal of raid[56|1c34} incompat flags after removing block group
btrfs: fix log context list corruption after rename whiteout error
We are incorrectly dropping the raid56 and raid1c34 incompat flags when
there are still raid56 and raid1c34 block groups, not when we do not any
of those anymore. The logic just got unintentionally broken after adding
the support for the raid1c34 modes.
Fix this by clear the flags only if we do not have block groups with the
respective profiles.
Fixes: 9c907446dc ("btrfs: drop incompat bit for raid1c34 after last block group is gone")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During a rename whiteout, if btrfs_whiteout_for_rename() returns an error
we can end up returning from btrfs_rename() with the log context object
still in the root's log context list - this happens if 'sync_log' was
set to true before we called btrfs_whiteout_for_rename() and it is
dangerous because we end up with a corrupt linked list (root->log_ctxs)
as the log context object was allocated on the stack.
After btrfs_rename() returns, any task that is running btrfs_sync_log()
concurrently can end up crashing because that linked list is traversed by
btrfs_sync_log() (through btrfs_remove_all_log_ctxs()). That results in
the same issue that commit e6c617102c ("Btrfs: fix log context list
corruption after rename exchange operation") fixed.
Fixes: d4682ba03e ("Btrfs: sync log after logging new name")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl5iXe8ACgkQxWXV+ddt
WDvWGg/+LFP+Y8Qz6xHTl3vXuGJKjCr7X/MIi69r2N0JFoCUeXyOdxeSlOuNCfhb
HiLZzfA5TYoptsdLJAXQLy7nPKFCQcc+J19Mbt2+aebpdGqfgN+YZEGkltfKL8Ao
xjOGu5HROFFpNTtnwa1dYOQkyVuZ8oafuJxwVJ8T28fxepRvBbi5jmy3lb1ypL3W
NoIPBe+67g5z/W0ATFmBMF7cCbvS5gsEGWKpbbjh7r8ZHJkhUaxVU7YdxPqlXrAO
ejZfiJUwi8rTGm0zd8A5TX/wsxSeBEXolvh91k5tatTljjzROHa028KRg2voUZIW
C5/7X+Z2C3gzuT0o7TGLBOR6CkVhkSutDV8/QIE6hDjZ/aCMNi0mIFco1hG8jjd1
jQfjemjj7PWuVEnZ6EuVSoHSXjZvBvX66of40YhTQEtSaJpcZU4jP26+8cXENN6+
6WbWcQpEQbT0cp0YKWhWvAIwGMf0jmWESISeFMEaF0eQd8BtzrH1qYcs3JTmXvHC
XmC47hoEJLhjQkAgQ4oNa5PZQzR1wEfW/4FPdqlADOR2frE1wDiKdrpN/dkAYHdQ
edNlo9u0+bRWCP40p04i2IUX/aUAc+me9QxiZwxT3Fw0g5QBSE2035Ly4spvT8NZ
gIvwnq1KGxmtrJSo5Lpkv4bjHYbByYMOiGJUMOTCIEdqajFI224=
=06pr
-----END PGP SIGNATURE-----
Merge tag 'for-5.6-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"One fixup for DIO when in use with the new checksums, a missed case
where the checksum size was still assuming u32"
* tag 'for-5.6-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix RAID direct I/O reads with alternate csums
btrfs_lookup_and_bind_dio_csum() does pointer arithmetic which assumes
32-bit checksums. If using a larger checksum, this leads to spurious
failures when a direct I/O read crosses a stripe. This is easy
to reproduce:
# mkfs.btrfs -f --checksum blake2 -d raid0 /dev/vdc /dev/vdd
...
# mount /dev/vdc /mnt
# cd /mnt
# dd if=/dev/urandom of=foo bs=1M count=1 status=none
# dd if=foo of=/dev/null bs=1M iflag=direct status=none
dd: error reading 'foo': Input/output error
# dmesg | tail -1
[ 135.821568] BTRFS warning (device vdc): csum failed root 5 ino 257 off 421888 ...
Fix it by using the actual checksum size.
Fixes: 1e25a2e3ca ("btrfs: don't assume ordered sums to be 4 bytes")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl5SdTAACgkQxWXV+ddt
WDtOPQ/+PGbN2bfZSf0DBUcOC+IOdiCYWHmaO6NQZzQb4CTjuumsDErSTzbg77OQ
gBqRBfcaWnyZZIabHe0JjctQ//buycP8yGb8VevgVdD4qBnb5sM/pvBqT4kO9Omy
6K3W7RhcP8gejOxGJb4nuiUBdYRW3KTji8QT3TMv/njuqn0p8n1NgzVlz0m6+xtO
IfJOqThvvG9LBytsQq/pvqaoXID/06lXRyM42XbsKFyygv10vp69xz5Skdl1XkGk
pyymzLlLDkLorRuDkjnzIvOqNFDAYiSFJafHupSE4SlfscRYfTwCV0uXE+NX4bdL
piTr5169ALoIRyTHU37YNNCPZXNGHnmn5Mtf4o8Q0ps0MKz1+sIjFRAxrdolLHjw
iYcocXU9L5wtHUBouTBzAsnuWJnizJNExHYb/3MHnHZYDu371l8wOGL1AHzRXEm/
qkxTPS3V2JpaFkFvTmwEQbl8PzDgpxpPDDBQUcEBZn3Vb9AFX38Fo/5OiGOnpnTd
9wVKXK2S6vHz/xpfcT/3SOtDUljSPJMUXUVkVcdz+OfUWvn6icXzR0t4keXVqZv+
INSOHzXb+iCuIb+NX7VZTN9oq0GH58aA+ApIo5beNbwJ6EzFotGE8dQmK3AfQ6pU
Pod0gzW5Hqj9Q2AfWYTZjWP+Og+dF2+bQFYSRK8NjcKjcrvUEVU=
=Asj0
-----END PGP SIGNATURE-----
Merge tag 'for-5.6-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"These are fixes that were found during testing with help of error
injection, plus some other stable material.
There's a fixup to patch added to rc1 causing locking in wrong context
warnings, tests found one more deadlock scenario. The patches are
tagged for stable, two of them now in the queue but we'd like all
three released at the same time.
I'm not happy about fixes to fixes in such a fast succession during
rcs, but I hope we found all the fallouts of commit 28553fa992
('Btrfs: fix race between shrinking truncate and fiemap')"
* tag 'for-5.6-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix deadlock during fast fsync when logging prealloc extents beyond eof
Btrfs: fix btrfs_wait_ordered_range() so that it waits for all ordered extents
btrfs: fix bytes_may_use underflow in prealloc error condtition
btrfs: handle logged extent failure properly
btrfs: do not check delayed items are empty for single transaction cleanup
btrfs: reset fs_root to NULL on error in open_ctree
btrfs: destroy qgroup extent records on transaction abort
While logging the prealloc extents of an inode during a fast fsync we call
btrfs_truncate_inode_items(), through btrfs_log_prealloc_extents(), while
holding a read lock on a leaf of the inode's root (not the log root, the
fs/subvol root), and then that function locks the file range in the inode's
iotree. This can lead to a deadlock when:
* the fsync is ranged
* the file has prealloc extents beyond eof
* writeback for a range different from the fsync range starts
during the fsync
* the size of the file is not sector size aligned
Because when finishing an ordered extent we lock first a file range and
then try to COW the fs/subvol tree to insert an extent item.
The following diagram shows how the deadlock can happen.
CPU 1 CPU 2
btrfs_sync_file()
--> for range [0, 1MiB)
--> inode has a size of
1MiB and has 1 prealloc
extent beyond the
i_size, starting at offset
4MiB
flushes all delalloc for the
range [0MiB, 1MiB) and waits
for the respective ordered
extents to complete
--> before task at CPU 1 locks the
inode, a write into file range
[1MiB, 2MiB + 1KiB) is made
--> i_size is updated to 2MiB + 1KiB
--> writeback is started for that
range, [1MiB, 2MiB + 4KiB)
--> end offset rounded up to
be sector size aligned
btrfs_log_dentry_safe()
btrfs_log_inode_parent()
btrfs_log_inode()
btrfs_log_changed_extents()
btrfs_log_prealloc_extents()
--> does a search on the
inode's root
--> holds a read lock on
leaf X
btrfs_finish_ordered_io()
--> locks range [1MiB, 2MiB + 4KiB)
--> end offset rounded up
to be sector size aligned
--> tries to cow leaf X, through
insert_reserved_file_extent()
--> already locked by the
task at CPU 1
btrfs_truncate_inode_items()
--> gets an i_size of
2MiB + 1KiB, which is
not sector size
aligned
--> tries to lock file
range [2MiB, (u64)-1)
--> the start range
is rounded down
from 2MiB + 1K
to 2MiB to be sector
size aligned
--> but the subrange
[2MiB, 2MiB + 4KiB) is
already locked by
task at CPU 2 which
is waiting to get a
write lock on leaf X
for which we are
holding a read lock
*** deadlock ***
This results in a stack trace like the following, triggered by test case
generic/561 from fstests:
[ 2779.973608] INFO: task kworker/u8:6:247 blocked for more than 120 seconds.
[ 2779.979536] Not tainted 5.6.0-rc2-btrfs-next-53 #1
[ 2779.984503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2779.990136] kworker/u8:6 D 0 247 2 0x80004000
[ 2779.990457] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[ 2779.990466] Call Trace:
[ 2779.990491] ? __schedule+0x384/0xa30
[ 2779.990521] schedule+0x33/0xe0
[ 2779.990616] btrfs_tree_read_lock+0x19e/0x2e0 [btrfs]
[ 2779.990632] ? remove_wait_queue+0x60/0x60
[ 2779.990730] btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
[ 2779.990782] btrfs_search_slot+0x510/0x1000 [btrfs]
[ 2779.990869] btrfs_lookup_file_extent+0x4a/0x70 [btrfs]
[ 2779.990944] __btrfs_drop_extents+0x161/0x1060 [btrfs]
[ 2779.990987] ? mark_held_locks+0x6d/0xc0
[ 2779.990994] ? __slab_alloc.isra.49+0x99/0x100
[ 2779.991060] ? insert_reserved_file_extent.constprop.19+0x64/0x300 [btrfs]
[ 2779.991145] insert_reserved_file_extent.constprop.19+0x97/0x300 [btrfs]
[ 2779.991222] ? start_transaction+0xdd/0x5c0 [btrfs]
[ 2779.991291] btrfs_finish_ordered_io+0x4f4/0x840 [btrfs]
[ 2779.991405] btrfs_work_helper+0xaa/0x720 [btrfs]
[ 2779.991432] process_one_work+0x26d/0x6a0
[ 2779.991460] worker_thread+0x4f/0x3e0
[ 2779.991481] ? process_one_work+0x6a0/0x6a0
[ 2779.991489] kthread+0x103/0x140
[ 2779.991499] ? kthread_create_worker_on_cpu+0x70/0x70
[ 2779.991515] ret_from_fork+0x3a/0x50
(...)
[ 2780.026211] INFO: task fsstress:17375 blocked for more than 120 seconds.
[ 2780.027480] Not tainted 5.6.0-rc2-btrfs-next-53 #1
[ 2780.028482] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2780.030035] fsstress D 0 17375 17373 0x00004000
[ 2780.030038] Call Trace:
[ 2780.030044] ? __schedule+0x384/0xa30
[ 2780.030052] schedule+0x33/0xe0
[ 2780.030075] lock_extent_bits+0x20c/0x320 [btrfs]
[ 2780.030094] ? btrfs_truncate_inode_items+0xf4/0x1150 [btrfs]
[ 2780.030098] ? rcu_read_lock_sched_held+0x59/0xa0
[ 2780.030102] ? remove_wait_queue+0x60/0x60
[ 2780.030122] btrfs_truncate_inode_items+0x133/0x1150 [btrfs]
[ 2780.030151] ? btrfs_set_path_blocking+0xb2/0x160 [btrfs]
[ 2780.030165] ? btrfs_search_slot+0x379/0x1000 [btrfs]
[ 2780.030195] btrfs_log_changed_extents.isra.8+0x841/0x93e [btrfs]
[ 2780.030202] ? do_raw_spin_unlock+0x49/0xc0
[ 2780.030215] ? btrfs_get_num_csums+0x10/0x10 [btrfs]
[ 2780.030239] btrfs_log_inode+0xf83/0x1124 [btrfs]
[ 2780.030251] ? __mutex_unlock_slowpath+0x45/0x2a0
[ 2780.030275] btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
[ 2780.030282] ? dget_parent+0xa1/0x370
[ 2780.030309] btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
[ 2780.030329] btrfs_sync_file+0x3f3/0x490 [btrfs]
[ 2780.030339] do_fsync+0x38/0x60
[ 2780.030343] __x64_sys_fdatasync+0x13/0x20
[ 2780.030345] do_syscall_64+0x5c/0x280
[ 2780.030348] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 2780.030356] RIP: 0033:0x7f2d80f6d5f0
[ 2780.030361] Code: Bad RIP value.
[ 2780.030362] RSP: 002b:00007ffdba3c8548 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
[ 2780.030364] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f2d80f6d5f0
[ 2780.030365] RDX: 00007ffdba3c84b0 RSI: 00007ffdba3c84b0 RDI: 0000000000000003
[ 2780.030367] RBP: 000000000000004a R08: 0000000000000001 R09: 00007ffdba3c855c
[ 2780.030368] R10: 0000000000000078 R11: 0000000000000246 R12: 00000000000001f4
[ 2780.030369] R13: 0000000051eb851f R14: 00007ffdba3c85f0 R15: 0000557a49220d90
So fix this by making btrfs_truncate_inode_items() not lock the range in
the inode's iotree when the target root is a log root, since it's not
needed to lock the range for log roots as the protection from the inode's
lock and log_mutex are all that's needed.
Fixes: 28553fa992 ("Btrfs: fix race between shrinking truncate and fiemap")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_wait_ordered_range() once we find an ordered extent that has
finished with an error we exit the loop and don't wait for any other
ordered extents that might be still in progress.
All the users of btrfs_wait_ordered_range() expect that there are no more
ordered extents in progress after that function returns. So past fixes
such like the ones from the two following commits:
ff612ba784 ("btrfs: fix panic during relocation after ENOSPC before
writeback happens")
28aeeac1dd ("Btrfs: fix panic when starting bg cache writeout after
IO error")
don't work when there are multiple ordered extents in the range.
Fix that by making btrfs_wait_ordered_range() wait for all ordered extents
even after it finds one that had an error.
Link: https://github.com/kdave/btrfs-progs/issues/228#issuecomment-569777554
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I hit the following warning while running my error injection stress
testing:
WARNING: CPU: 3 PID: 1453 at fs/btrfs/space-info.h:108 btrfs_free_reserved_data_space_noquota+0xfd/0x160 [btrfs]
RIP: 0010:btrfs_free_reserved_data_space_noquota+0xfd/0x160 [btrfs]
Call Trace:
btrfs_free_reserved_data_space+0x4f/0x70 [btrfs]
__btrfs_prealloc_file_range+0x378/0x470 [btrfs]
elfcorehdr_read+0x40/0x40
? elfcorehdr_read+0x40/0x40
? btrfs_commit_transaction+0xca/0xa50 [btrfs]
? dput+0xb4/0x2a0
? btrfs_log_dentry_safe+0x55/0x70 [btrfs]
? btrfs_sync_file+0x30e/0x420 [btrfs]
? do_fsync+0x38/0x70
? __x64_sys_fdatasync+0x13/0x20
? do_syscall_64+0x5b/0x1b0
? entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens if we fail to insert our reserved file extent. At this
point we've already converted our reservation from ->bytes_may_use to
->bytes_reserved. However once we break we will attempt to free
everything from [cur_offset, end] from ->bytes_may_use, but our extent
reservation will overlap part of this.
Fix this problem by adding ins.offset (our extent allocation size) to
cur_offset so we remove the actual remaining part from ->bytes_may_use.
I validated this fix using my inject-error.py script
python inject-error.py -o should_fail_bio -t cache_save_setup -t \
__btrfs_prealloc_file_range \
-t insert_reserved_file_extent.constprop.0 \
-r "-5" ./run-fsstress.sh
where run-fsstress.sh simply mounts and runs fsstress on a disk.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we're allocating a logged extent we attempt to insert an extent
record for the file extent directly. We increase
space_info->bytes_reserved, because the extent entry addition will call
btrfs_update_block_group(), which will convert the ->bytes_reserved to
->bytes_used. However if we fail at any point while inserting the
extent entry we will bail and leave space on ->bytes_reserved, which
will trigger a WARN_ON() on umount. Fix this by pinning the space if we
fail to insert, which is what happens in every other failure case that
involves adding the extent entry.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_assert_delayed_root_empty() will check if the delayed root is
completely empty, but this is a filesystem-wide check. On cleanup we
may have allowed other transactions to begin, for whatever reason, and
thus the delayed root is not empty.
So remove this check from cleanup_one_transation(). This however can
stay in btrfs_cleanup_transaction(), because it checks only after all of
the transactions have been properly cleaned up, and thus is valid.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While running my error injection script I hit a panic when we tried to
clean up the fs_root when freeing the fs_root. This is because
fs_info->fs_root == PTR_ERR(-EIO), which isn't great. Fix this by
setting fs_info->fs_root = NULL; if we fail to read the root.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We clean up the delayed references when we abort a transaction but we
leave the pending qgroup extent records behind, leaking memory.
This patch destroys the extent records when we destroy the delayed refs
and makes sure ensure they're gone before releasing the transaction.
Fixes: 3368d001ba ("btrfs: qgroup: Record possible quota-related extent for qgroup.")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
[ Rebased to latest upstream, remove to_qgroup() helper, use
rbtree_postorder_for_each_entry_safe() wrapper ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl5K0xIACgkQxWXV+ddt
WDvihhAAjTdEr8Gbt4Beu+F9I2j+1+x/qdVADsrJErBJ5s1J0prXI27Vx5auHMMD
qJViTGTTjwlIo4jFI+AtlrzcE07eIMN1zzjVb2o8efrpkFDTfgT+dAkSVR8zAX20
nycsQ93QebCrlRGpPuXsniczy+QTRXWBY+vPXe1UbfGrzb3q+1pBJgfdq/7pkt/V
0RgB78JDVWUhw+YBjLuV0Y7tjmWlANGvofT34A34CR23i/JoIjVx7GfeQyILBmHE
iuCwzMgAdIKlRpjG5WJeK60bW0Dkjmbr3HDyJexiXSa4u2UwgUlT8UsITbVCr48t
OrfYxb7aWmwEqa5fbZzIiyTCWDKSxiQ5X0XW8e6ehdkh+s9DWmNp48xt9OogpYcv
8qni+vMgjrOLZwj//DgHpr2CYrISxJ+vhCvaOLMK++8TMIH/EQZEyaYkKn5Osg/8
iSwI9b2IF8LmnnvWhkbs1p4Jn2Ca4fMhlN2D5xflwSK/lbhgJk5E6P+4h6pQ9MvG
U5bWR2mqW1xT+lCO7AD+MBq+VbF/BZik0TjuNssoYJJtx/VbTbpydrHW65SvZvQK
QT7MTHekkbtIR+SGYl66n1Sc4RG7znWNwjfoyVEbSblLIczfzDDsl1O/9ldqbia4
BoPWghEO8Wy543UTH9DyX+oMzB2Pwb/TMa4eQyNhbQMqcwXHv5Q=
=nlEW
-----END PGP SIGNATURE-----
Merge tag 'for-5.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"This is the fix for sleeping in a locked section bug reported by Dave
Jones, caused by a patch dependence in development and pulled
branches.
I picked the existing patch over the fixup that Filipe sent, as it's a
bit more generic fix. I've verified it with a specific test case, some
rsync stress and one round of fstests"
* tag 'for-5.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: don't set path->leave_spinning for truncate
The only time we actually leave the path spinning is if we're truncating
a small amount and don't actually free an extent, which is not a common
occurrence. We have to set the path blocking in order to add the
delayed ref anyway, so the first extent we find we set the path to
blocking and stay blocking for the duration of the operation. With the
upcoming file extent map stuff there will be another case that we have
to have the path blocking, so just swap to blocking always.
Note: this patch also fixes a warning after 28553fa992 ("Btrfs: fix
race between shrinking truncate and fiemap") got merged that inserts
extent locks around truncation so the path must not leave spinning locks
after btrfs_search_slot.
[70.794783] BUG: sleeping function called from invalid context at mm/slab.h:565
[70.794834] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1141, name: rsync
[70.794863] 5 locks held by rsync/1141:
[70.794876] #0: ffff888417b9c408 (sb_writers#17){.+.+}, at: mnt_want_write+0x20/0x50
[70.795030] #1: ffff888428de28e8 (&type->i_mutex_dir_key#13/1){+.+.}, at: lock_rename+0xf1/0x100
[70.795051] #2: ffff888417b9c608 (sb_internal#2){.+.+}, at: start_transaction+0x394/0x560
[70.795124] #3: ffff888403081768 (btrfs-fs-01){++++}, at: btrfs_try_tree_write_lock+0x2f/0x160
[70.795203] #4: ffff888403086568 (btrfs-fs-00){++++}, at: btrfs_try_tree_write_lock+0x2f/0x160
[70.795222] CPU: 5 PID: 1141 Comm: rsync Not tainted 5.6.0-rc2-backup+ #2
[70.795362] Call Trace:
[70.795374] dump_stack+0x71/0xa0
[70.795445] ___might_sleep.part.96.cold.106+0xa6/0xb6
[70.795459] kmem_cache_alloc+0x1d3/0x290
[70.795471] alloc_extent_state+0x22/0x1c0
[70.795544] __clear_extent_bit+0x3ba/0x580
[70.795557] ? _raw_spin_unlock_irq+0x24/0x30
[70.795569] btrfs_truncate_inode_items+0x339/0xe50
[70.795647] btrfs_evict_inode+0x269/0x540
[70.795659] ? dput.part.38+0x29/0x460
[70.795671] evict+0xcd/0x190
[70.795682] __dentry_kill+0xd6/0x180
[70.795754] dput.part.38+0x2ad/0x460
[70.795765] do_renameat2+0x3cb/0x540
[70.795777] __x64_sys_rename+0x1c/0x20
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Fixes: 28553fa992 ("Btrfs: fix race between shrinking truncate and fiemap")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl5JQQUACgkQxWXV+ddt
WDv+fRAAh2VeACdVgGPCqH3kBphdBzqT2x9EUWZK++AWTr0Upg4Mz1mvD2h1WX6L
Yd/cuG0hIy2AFl1fEwrOp9d64PGFRnVcs0IS0C9GC/kH9ZzrSuz3pcfni3uxCr0N
t+/Pghxb654fLPd8eho9Pnc9bi/EPPRZ+mmwup4A9MyC7Bdfzf821ykrhqwam54V
LWRUanVfeJOuxNbgF2Hjnr9ZI2fALsFfh5+q1VCeNsm76Le67RGVi7V6i9Gxj8eE
3zOH/XM1Jzft3cLinfimIUC0BQKBuFXSKnYun5zP47Y5VHgxYCcfBBAD71uGgs52
mxaz1vLGE6GZo631zBBXvM6BalEHyQROP8Daman9NiJDctkvICXzv/bwi4k5ZHSF
i45MANAtBb10S/AYitscqmGR7sYHZtKBkgJHjH+0YqKgcu3OWInqNVuW77kZEguV
/UjcdjAmxirvyAaxOdfXtBfuZKBU1kL+dZwg4OeqzbDoBsUAtkEG+3oIcjkGnP9e
6QuQx5XYKmI1tugcAz0uj5ryXW6UtIFVy+Fgxht+XMSNRR5CiUpu7Ndm/ZrKx5RS
/hv2KZot+/p7IA9J0YvZbYwwvw28YxR4PgGMOKYhqKnTVHR6eaFTDNKiaQRkJaXr
Nimk6TQrpc9Qu/rcoeyuQajxCRhwSqNAJH97MJPMDz39xolEys8=
=i1Ct
-----END PGP SIGNATURE-----
Merge tag 'for-5.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two races fixed, memory leak fix, sysfs directory fixup and two new
log messages:
- two fixed race conditions: extent map merging and truncate vs
fiemap
- create the right sysfs directory with device information and move
the individual device dirs under it
- print messages when the tree-log is replayed at mount time or
cannot be replayed on remount"
* tag 'for-5.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: sysfs, move device id directories to UUID/devinfo
btrfs: sysfs, add UUID/devinfo kobject
Btrfs: fix race between shrinking truncate and fiemap
btrfs: log message when rw remount is attempted with unclean tree-log
btrfs: print message when tree-log replay starts
Btrfs: fix race between using extent maps and merging them
btrfs: ref-verify: fix memory leaks
Originally it was planned to create device id directories under
UUID/devinfo, but it got under UUID/devices by mistake. We really want
it under definfo so the bare device node names are not mixed with device
ids and are easy to enumerate.
Fixes: 668e48af7a ("btrfs: sysfs, add devid/dev_state kobject and device attributes")
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Create directory /sys/fs/btrfs/UUID/devinfo to hold devices directories
by the id (unlike /devices).
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When there is a fiemap executing in parallel with a shrinking truncate
we can end up in a situation where we have extent maps for which we no
longer have corresponding file extent items. This is generally harmless
and at the moment the only consequences are missing file extent items
representing holes after we expand the file size again after the
truncate operation removed the prealloc extent items, and stale
information for future fiemap calls (reporting extents that no longer
exist or may have been reallocated to other files for example).
Consider the following example:
1) Our inode has a size of 128KiB, one 128KiB extent at file offset 0
and a 1MiB prealloc extent at file offset 128KiB;
2) Task A starts doing a shrinking truncate of our inode to reduce it to
a size of 64KiB. Before it searches the subvolume tree for file
extent items to delete, it drops all the extent maps in the range
from 64KiB to (u64)-1 by calling btrfs_drop_extent_cache();
3) Task B starts doing a fiemap against our inode. When looking up for
the inode's extent maps in the range from 128KiB to (u64)-1, it
doesn't find any in the inode's extent map tree, since they were
removed by task A. Because it didn't find any in the extent map
tree, it scans the inode's subvolume tree for file extent items, and
it finds the 1MiB prealloc extent at file offset 128KiB, then it
creates an extent map based on that file extent item and adds it to
inode's extent map tree (this ends up being done by
btrfs_get_extent() <- btrfs_get_extent_fiemap() <-
get_extent_skip_holes());
4) Task A then drops the prealloc extent at file offset 128KiB and
shrinks the 128KiB extent file offset 0 to a length of 64KiB. The
truncation operation finishes and we end up with an extent map
representing a 1MiB prealloc extent at file offset 128KiB, despite we
don't have any more that extent;
After this the two types of problems we have are:
1) Future calls to fiemap always report that a 1MiB prealloc extent
exists at file offset 128KiB. This is stale information, no longer
correct;
2) If the size of the file is increased, by a truncate operation that
increases the file size or by a write into a file offset > 64KiB for
example, we end up not inserting file extent items to represent holes
for any range between 128KiB and 128KiB + 1MiB, since the hole
expansion function, btrfs_cont_expand() will skip hole insertion for
any range for which an extent map exists that represents a prealloc
extent. This causes fsck to complain about missing file extent items
when not using the NO_HOLES feature.
The second issue could be often triggered by test case generic/561 from
fstests, which runs fsstress and duperemove in parallel, and duperemove
does frequent fiemap calls.
Essentially the problems happens because fiemap does not acquire the
inode's lock while truncate does, and fiemap locks the file range in the
inode's iotree while truncate does not. So fix the issue by making
btrfs_truncate_inode_items() lock the file range from the new file size
to (u64)-1, so that it serializes with fiemap.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A remount to a read-write filesystem is not safe when there's tree-log
to be replayed. Files that could be opened until now might be affected
by the changes in the tree-log.
A regular mount is needed to replay the log so the filesystem presents
the consistent view with the pending changes included.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no logged information about tree-log replay although this is
something that points to previous unclean unmount. Other filesystems
report that as well.
Suggested-by: Chris Murphy <lists@colorremedies.com>
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a few cases where we allow an extent map that is in an extent map
tree to be merged with other extents in the tree. Such cases include the
unpinning of an extent after the respective ordered extent completed or
after logging an extent during a fast fsync. This can lead to subtle and
dangerous problems because when doing the merge some other task might be
using the same extent map and as consequence see an inconsistent state of
the extent map - for example sees the new length but has seen the old start
offset.
With luck this triggers a BUG_ON(), and not some silent bug, such as the
following one in __do_readpage():
$ cat -n fs/btrfs/extent_io.c
3061 static int __do_readpage(struct extent_io_tree *tree,
3062 struct page *page,
(...)
3127 em = __get_extent_map(inode, page, pg_offset, cur,
3128 end - cur + 1, get_extent, em_cached);
3129 if (IS_ERR_OR_NULL(em)) {
3130 SetPageError(page);
3131 unlock_extent(tree, cur, end);
3132 break;
3133 }
3134 extent_offset = cur - em->start;
3135 BUG_ON(extent_map_end(em) <= cur);
(...)
Consider the following example scenario, where we end up hitting the
BUG_ON() in __do_readpage().
We have an inode with a size of 8KiB and 2 extent maps:
extent A: file offset 0, length 4KiB, disk_bytenr = X, persisted on disk by
a previous transaction
extent B: file offset 4KiB, length 4KiB, disk_bytenr = X + 4KiB, not yet
persisted but writeback started for it already. The extent map
is pinned since there's writeback and an ordered extent in
progress, so it can not be merged with extent map A yet
The following sequence of steps leads to the BUG_ON():
1) The ordered extent for extent B completes, the respective page gets its
writeback bit cleared and the extent map is unpinned, at that point it
is not yet merged with extent map A because it's in the list of modified
extents;
2) Due to memory pressure, or some other reason, the MM subsystem releases
the page corresponding to extent B - btrfs_releasepage() is called and
returns 1, meaning the page can be released as it's not dirty, not under
writeback anymore and the extent range is not locked in the inode's
iotree. However the extent map is not released, either because we are
not in a context that allows memory allocations to block or because the
inode's size is smaller than 16MiB - in this case our inode has a size
of 8KiB;
3) Task B needs to read extent B and ends up __do_readpage() through the
btrfs_readpage() callback. At __do_readpage() it gets a reference to
extent map B;
4) Task A, doing a fast fsync, calls clear_em_loggin() against extent map B
while holding the write lock on the inode's extent map tree - this
results in try_merge_map() being called and since it's possible to merge
extent map B with extent map A now (the extent map B was removed from
the list of modified extents), the merging begins - it sets extent map
B's start offset to 0 (was 4KiB), but before it increments the map's
length to 8KiB (4kb + 4KiB), task A is at:
BUG_ON(extent_map_end(em) <= cur);
The call to extent_map_end() sees the extent map has a start of 0
and a length still at 4KiB, so it returns 4KiB and 'cur' is 4KiB, so
the BUG_ON() is triggered.
So it's dangerous to modify an extent map that is in the tree, because some
other task might have got a reference to it before and still using it, and
needs to see a consistent map while using it. Generally this is very rare
since most paths that lookup and use extent maps also have the file range
locked in the inode's iotree. The fsync path is pretty much the only
exception where we don't do it to avoid serialization with concurrent
reads.
Fix this by not allowing an extent map do be merged if if it's being used
by tasks other then the one attempting to merge the extent map (when the
reference count of the extent map is greater than 2).
Reported-by: ryusuke1925 <st13s20@gm.ibaraki-ct.ac.jp>
Reported-by: Koki Mitani <koki.mitani.xg@hco.ntt.co.jp>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206211
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_ref_tree_mod(), 'ref' and 'ra' are allocated through kzalloc() and
kmalloc(), respectively. In the following code, if an error occurs, the
execution will be redirected to 'out' or 'out_unlock' and the function will
be exited. However, on some of the paths, 'ref' and 'ra' are not
deallocated, leading to memory leaks. For example, if 'action' is
BTRFS_ADD_DELAYED_EXTENT, add_block_entry() will be invoked. If the return
value indicates an error, the execution will be redirected to 'out'. But,
'ref' is not deallocated on this path, causing a memory leak.
To fix the above issues, deallocate both 'ref' and 'ra' before exiting from
the function when an error is encountered.
CC: stable@vger.kernel.org # 4.15+
Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl44NzMACgkQxWXV+ddt
WDvSMA/+MYGBB/+HIET55a70hoLZR1n21joLPYat+RZnY//vBpVQtHewIHNoqBi3
EWY7rB+OWomIMiHRw4gS4nKWpds3Ou8minUnAVmbP86irfu2e1mip9rQWT6EO7ne
KWd52hb41M4ZG2Oq/2XZdMu49IGUIgUBShioJQAN7VimTI8XSX1mQn0N9pvkRk3s
IX/77kmf5jolO3/hZOJDCN6+LsI1inN6TkEH8ODKC+0ounGN+TcQQydJlfjZ+4n/
BH5G9mpm5FmFQWKp14vfyc2jknwoO9ryd7Mez5Vf70xuFCMQw+Z0ZKyLDeMLGhur
dCV3j57/+XUwsSflT/Q/cmIQZyXIdmShOxcHi9QMnax5lf6XrMgkvEjjzfGGzlzU
ey8f8hTkqkH7KM89G8pvla+DN1It4xzs9kLYczS49pWT5qn/15l7FcRMdwOR37mU
QFcDTfXEhQ5wPbpaNYDWGVycFugyyxgxBgpnSbOpgmvVS/i3qJoChUXGJrM3HMyx
Xsej/oLUnYsBEBe20mEaVp/j288NnQdMo58C+BRGtUEMC4QZM/tg3HUmGJCXqGI1
PXJx8qPPs1ZR4U4ciuWwDAim27LlD04NMGlf/r3ABFaPLMAiPKaVR93Ny0nIEkBA
iPHJD5xVKKmhJnRAWVxz3ZyyRGbjG9IB6syawYsSWMu1qH8z150=
=BuZc
-----END PGP SIGNATURE-----
Merge tag 'for-5.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull more btrfs updates from David Sterba:
"Fixes that arrived after the merge window freeze, mostly stable
material.
- fix race in tree-mod-log element tracking
- fix bio flushing inside extent writepages
- fix assertion when in-memory tracking of discarded extents finds an
empty tree (eg. after adding a new device)
- update logic of temporary read-only block groups to take into
account overcommit
- fix some fixup worker corner cases:
- page could not go through proper COW cycle and the dirty status
is lost due to page migration
- deadlock if delayed allocation is performed under page lock
- fix send emitting invalid clones within the same file
- fix statfs reporting 0 free space when global block reserve size is
larger than remaining free space but there is still space for new
chunks"
* tag 'for-5.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: do not zero f_bavail if we have available space
Btrfs: send, fix emission of invalid clone operations within the same file
btrfs: do not do delalloc reservation under page lock
btrfs: drop the -EBUSY case in __extent_writepage_io
Btrfs: keep pages dirty when using btrfs_writepage_fixup_worker
btrfs: take overcommit into account in inc_block_group_ro
btrfs: fix force usage in inc_block_group_ro
btrfs: Correctly handle empty trees in find_first_clear_extent_bit
btrfs: flush write bio if we loop in extent_write_cache_pages
Btrfs: fix race between adding and putting tree mod seq elements and nodes
There was some logic added a while ago to clear out f_bavail in statfs()
if we did not have enough free metadata space to satisfy our global
reserve. This was incorrect at the time, however didn't really pose a
problem for normal file systems because we would often allocate chunks
if we got this low on free metadata space, and thus wouldn't really hit
this case unless we were actually full.
Fast forward to today and now we are much better about not allocating
metadata chunks all of the time. Couple this with d792b0f197 ("btrfs:
always reserve our entire size for the global reserve") which now means
we'll easily have a larger global reserve than our free space, we are
now more likely to trip over this while still having plenty of space.
Fix this by skipping this logic if the global rsv's space_info is not
full. space_info->full is 0 unless we've attempted to allocate a chunk
for that space_info and that has failed. If this happens then the space
for the global reserve is definitely sacred and we need to report
b_avail == 0, but before then we can just use our calculated b_avail.
Reported-by: Martin Steigerwald <martin@lichtvoll.de>
Fixes: ca8a51b3a9 ("btrfs: statfs: report zero available if metadata are exhausted")
CC: stable@vger.kernel.org # 4.5+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Tested-By: Martin Steigerwald <martin@lichtvoll.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to benefit from s390 zlib hardware compression support,
increase the btrfs zlib workspace buffer size from 1 to 4 pages (if s390
zlib hardware support is enabled on the machine).
This brings up to 60% better performance in hardware on s390 compared to
the PAGE_SIZE buffer and much more compared to the software zlib
processing in btrfs. In case of memory pressure, fall back to a single
page buffer during workspace allocation.
The data compressed with larger input buffers will still conform to zlib
standard and thus can be decompressed also on a systems that uses only
PAGE_SIZE buffer for btrfs zlib.
Link: http://lkml.kernel.org/r/20200108105103.29028-1-zaslonko@linux.ibm.com
Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Eduard Shishkin <edward6@linux.ibm.com>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When doing an incremental send and a file has extents shared with itself
at different file offsets, it's possible for send to emit clone operations
that will fail at the destination because the source range goes beyond the
file's current size. This happens when the file size has increased in the
send snapshot, there is a hole between the shared extents and both shared
extents are at file offsets which are greater the file's size in the
parent snapshot.
Example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt/sdb
$ xfs_io -f -c "pwrite -S 0xf1 0 64K" /mnt/sdb/foobar
$ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base
$ btrfs send -f /tmp/1.snap /mnt/sdb/base
# Create a 320K extent at file offset 512K.
$ xfs_io -c "pwrite -S 0xab 512K 64K" /mnt/sdb/foobar
$ xfs_io -c "pwrite -S 0xcd 576K 64K" /mnt/sdb/foobar
$ xfs_io -c "pwrite -S 0xef 640K 64K" /mnt/sdb/foobar
$ xfs_io -c "pwrite -S 0x64 704K 64K" /mnt/sdb/foobar
$ xfs_io -c "pwrite -S 0x73 768K 64K" /mnt/sdb/foobar
# Clone part of that 320K extent into a lower file offset (192K).
# This file offset is greater than the file's size in the parent
# snapshot (64K). Also the clone range is a bit behind the offset of
# the 320K extent so that we leave a hole between the shared extents.
$ xfs_io -c "reflink /mnt/sdb/foobar 448K 192K 192K" /mnt/sdb/foobar
$ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr
$ btrfs send -p /mnt/sdb/base -f /tmp/2.snap /mnt/sdb/incr
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ btrfs receive -f /tmp/1.snap /mnt/sdc
$ btrfs receive -f /tmp/2.snap /mnt/sdc
ERROR: failed to clone extents to foobar: Invalid argument
The problem is that after processing the extent at file offset 256K, which
refers to the first 128K of the 320K extent created by the buffered write
operations, we have 'cur_inode_next_write_offset' set to 384K, which
corresponds to the end offset of the partially shared extent (256K + 128K)
and to the current file size in the receiver. Then when we process the
extent at offset 512K, we do extent backreference iteration to figure out
if we can clone the extent from some other inode or from the same inode,
and we consider the extent at offset 256K of the same inode as a valid
source for a clone operation, which is not correct because at that point
the current file size in the receiver is 384K, which corresponds to the
end of last processed extent (at file offset 256K), so using a clone
source range from 256K to 256K + 320K is invalid because that goes past
the current size of the file (384K) - this makes the receiver get an
-EINVAL error when attempting the clone operation.
So fix this by excluding clone sources that have a range that goes beyond
the current file size in the receiver when iterating extent backreferences.
A test case for fstests follows soon.
Fixes: 11f2069c11 ("Btrfs: send, allow clone operations within the same file")
CC: stable@vger.kernel.org # 5.5+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We ran into a deadlock in production with the fixup worker. The stack
traces were as follows:
Thread responsible for the writeout, waiting on the page lock
[<0>] io_schedule+0x12/0x40
[<0>] __lock_page+0x109/0x1e0
[<0>] extent_write_cache_pages+0x206/0x360
[<0>] extent_writepages+0x40/0x60
[<0>] do_writepages+0x31/0xb0
[<0>] __writeback_single_inode+0x3d/0x350
[<0>] writeback_sb_inodes+0x19d/0x3c0
[<0>] __writeback_inodes_wb+0x5d/0xb0
[<0>] wb_writeback+0x231/0x2c0
[<0>] wb_workfn+0x308/0x3c0
[<0>] process_one_work+0x1e0/0x390
[<0>] worker_thread+0x2b/0x3c0
[<0>] kthread+0x113/0x130
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff
Thread of the fixup worker who is holding the page lock
[<0>] start_delalloc_inodes+0x241/0x2d0
[<0>] btrfs_start_delalloc_roots+0x179/0x230
[<0>] btrfs_alloc_data_chunk_ondemand+0x11b/0x2e0
[<0>] btrfs_check_data_free_space+0x53/0xa0
[<0>] btrfs_delalloc_reserve_space+0x20/0x70
[<0>] btrfs_writepage_fixup_worker+0x1fc/0x2a0
[<0>] normal_work_helper+0x11c/0x360
[<0>] process_one_work+0x1e0/0x390
[<0>] worker_thread+0x2b/0x3c0
[<0>] kthread+0x113/0x130
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff
Thankfully the stars have to align just right to hit this. First you
have to end up in the fixup worker, which is tricky by itself (my
reproducer does DIO reads into a MMAP'ed region, so not a common
operation). Then you have to have less than a page size of free data
space and 0 unallocated space so you go down the "commit the transaction
to free up pinned space" path. This was accomplished by a random
balance that was running on the host. Then you get this deadlock.
I'm still in the process of trying to force the deadlock to happen on
demand, but I've hit other issues. I can still trigger the fixup worker
path itself so this patch has been tested in that regard, so the normal
case is fine.
Fixes: 87826df0ec ("btrfs: delalloc for page dirtied out-of-band in fixup worker")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we only return 0 or -EAGAIN from btrfs_writepage_cow_fixup, we
do not need this -EBUSY case.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For COW, btrfs expects pages dirty pages to have been through a few setup
steps. This includes reserving space for the new block allocations and marking
the range in the state tree for delayed allocation.
A few places outside btrfs will dirty pages directly, especially when unmapping
mmap'd pages. In order for these to properly go through COW, we run them
through a fixup worker to wait for stable pages, and do the delalloc prep.
87826df0ec added a window where the dirty pages were cleaned, but pending
more action from the fixup worker. We clear_page_dirty_for_io() before
we call into writepage, so the page is no longer dirty. The commit
changed it so now we leave the page clean between unlocking it here and
the fixup worker starting at some point in the future.
During this window, page migration can jump in and relocate the page. Once our
fixup work actually starts, it finds page->mapping is NULL and we end up
freeing the page without ever writing it.
This leads to crc errors and other exciting problems, since it screws up the
whole statemachine for waiting for ordered extents. The fix here is to keep
the page dirty while we're waiting for the fixup worker to get to work.
This is accomplished by returning -EAGAIN from btrfs_writepage_cow_fixup
if we queued the page up for fixup, which will cause the writepage
function to redirty the page.
Because we now expect the page to be dirty once it gets to the fixup
worker we must adjust the error cases to call clear_page_dirty_for_io()
on the page. That is the bulk of the patch, but it is not the fix, the
fix is the -EAGAIN from btrfs_writepage_cow_fixup. We cannot separate
these two changes out because the error conditions change with the new
expectations.
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
inc_block_group_ro does a calculation to see if we have enough room left
over if we mark this block group as read only in order to see if it's ok
to mark the block group as read only.
The problem is this calculation _only_ works for data, where our used is
always less than our total. For metadata we will overcommit, so this
will almost always fail for metadata.
Fix this by exporting btrfs_can_overcommit, and then see if we have
enough space to remove the remaining free space in the block group we
are trying to mark read only. If we do then we can mark this block
group as read only.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For some reason we've translated the do_chunk_alloc that goes into
btrfs_inc_block_group_ro to force in inc_block_group_ro, but these are
two different things.
force for inc_block_group_ro is used when we are forcing the block group
read only no matter what, for example when the underlying chunk is
marked read only. We need to not do the space check here as this block
group needs to be read only.
btrfs_inc_block_group_ro() has a do_chunk_alloc flag that indicates that
we need to pre-allocate a chunk before marking the block group read
only. This has nothing to do with forcing, and in fact we _always_ want
to do the space check in this case, so unconditionally pass false for
force in this case.
Then fixup inc_block_group_ro to honor force as it's expected and
documented to do.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Raviu reported that running his regular fs_trim segfaulted with the
following backtrace:
[ 237.525947] assertion failed: prev, in ../fs/btrfs/extent_io.c:1595
[ 237.525984] ------------[ cut here ]------------
[ 237.525985] kernel BUG at ../fs/btrfs/ctree.h:3117!
[ 237.525992] invalid opcode: 0000 [#1] SMP PTI
[ 237.525998] CPU: 4 PID: 4423 Comm: fstrim Tainted: G U OE 5.4.14-8-vanilla #1
[ 237.526001] Hardware name: ASUSTeK COMPUTER INC.
[ 237.526044] RIP: 0010:assfail.constprop.58+0x18/0x1a [btrfs]
[ 237.526079] Call Trace:
[ 237.526120] find_first_clear_extent_bit+0x13d/0x150 [btrfs]
[ 237.526148] btrfs_trim_fs+0x211/0x3f0 [btrfs]
[ 237.526184] btrfs_ioctl_fitrim+0x103/0x170 [btrfs]
[ 237.526219] btrfs_ioctl+0x129a/0x2ed0 [btrfs]
[ 237.526227] ? filemap_map_pages+0x190/0x3d0
[ 237.526232] ? do_filp_open+0xaf/0x110
[ 237.526238] ? _copy_to_user+0x22/0x30
[ 237.526242] ? cp_new_stat+0x150/0x180
[ 237.526247] ? do_vfs_ioctl+0xa4/0x640
[ 237.526278] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[ 237.526283] do_vfs_ioctl+0xa4/0x640
[ 237.526288] ? __do_sys_newfstat+0x3c/0x60
[ 237.526292] ksys_ioctl+0x70/0x80
[ 237.526297] __x64_sys_ioctl+0x16/0x20
[ 237.526303] do_syscall_64+0x5a/0x1c0
[ 237.526310] entry_SYSCALL_64_after_hwframe+0x49/0xbe
That was due to btrfs_fs_device::aloc_tree being empty. Initially I
thought this wasn't possible and as a percaution have put the assert in
find_first_clear_extent_bit. Turns out this is indeed possible and could
happen when a file system with SINGLE data/metadata profile has a 2nd
device added. Until balance is run or a new chunk is allocated on this
device it will be completely empty.
In this case find_first_clear_extent_bit should return the full range
[0, -1ULL] and let the caller handle this i.e for trim the end will be
capped at the size of actual device.
Link: https://lore.kernel.org/linux-btrfs/izW2WNyvy1dEDweBICizKnd2KDwDiDyY2EYQr4YCwk7pkuIpthx-JRn65MPBde00ND6V0_Lh8mW0kZwzDiLDv25pUYWxkskWNJnVP0kgdMA=@protonmail.com/
Fixes: 45bfcfc168 ("btrfs: Implement find_first_clear_extent_bit")
CC: stable@vger.kernel.org # 5.2+
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There exists a deadlock with range_cyclic that has existed forever. If
we loop around with a bio already built we could deadlock with a writer
who has the page locked that we're attempting to write but is waiting on
a page in our bio to be written out. The task traces are as follows
PID: 1329874 TASK: ffff889ebcdf3800 CPU: 33 COMMAND: "kworker/u113:5"
#0 [ffffc900297bb658] __schedule at ffffffff81a4c33f
#1 [ffffc900297bb6e0] schedule at ffffffff81a4c6e3
#2 [ffffc900297bb6f8] io_schedule at ffffffff81a4ca42
#3 [ffffc900297bb708] __lock_page at ffffffff811f145b
#4 [ffffc900297bb798] __process_pages_contig at ffffffff814bc502
#5 [ffffc900297bb8c8] lock_delalloc_pages at ffffffff814bc684
#6 [ffffc900297bb900] find_lock_delalloc_range at ffffffff814be9ff
#7 [ffffc900297bb9a0] writepage_delalloc at ffffffff814bebd0
#8 [ffffc900297bba18] __extent_writepage at ffffffff814bfbf2
#9 [ffffc900297bba98] extent_write_cache_pages at ffffffff814bffbd
PID: 2167901 TASK: ffff889dc6a59c00 CPU: 14 COMMAND:
"aio-dio-invalid"
#0 [ffffc9003b50bb18] __schedule at ffffffff81a4c33f
#1 [ffffc9003b50bba0] schedule at ffffffff81a4c6e3
#2 [ffffc9003b50bbb8] io_schedule at ffffffff81a4ca42
#3 [ffffc9003b50bbc8] wait_on_page_bit at ffffffff811f24d6
#4 [ffffc9003b50bc60] prepare_pages at ffffffff814b05a7
#5 [ffffc9003b50bcd8] btrfs_buffered_write at ffffffff814b1359
#6 [ffffc9003b50bdb0] btrfs_file_write_iter at ffffffff814b5933
#7 [ffffc9003b50be38] new_sync_write at ffffffff8128f6a8
#8 [ffffc9003b50bec8] vfs_write at ffffffff81292b9d
#9 [ffffc9003b50bf00] ksys_pwrite64 at ffffffff81293032
I used drgn to find the respective pages we were stuck on
page_entry.page 0xffffea00fbfc7500 index 8148 bit 15 pid 2167901
page_entry.page 0xffffea00f9bb7400 index 7680 bit 0 pid 1329874
As you can see the kworker is waiting for bit 0 (PG_locked) on index
7680, and aio-dio-invalid is waiting for bit 15 (PG_writeback) on index
8148. aio-dio-invalid has 7680, and the kworker epd looks like the
following
crash> struct extent_page_data ffffc900297bbbb0
struct extent_page_data {
bio = 0xffff889f747ed830,
tree = 0xffff889eed6ba448,
extent_locked = 0,
sync_io = 0
}
Probably worth mentioning as well that it waits for writeback of the
page to complete while holding a lock on it (at prepare_pages()).
Using drgn I walked the bio pages looking for page
0xffffea00fbfc7500 which is the one we're waiting for writeback on
bio = Object(prog, 'struct bio', address=0xffff889f747ed830)
for i in range(0, bio.bi_vcnt.value_()):
bv = bio.bi_io_vec[i]
if bv.bv_page.value_() == 0xffffea00fbfc7500:
print("FOUND IT")
which validated what I suspected.
The fix for this is simple, flush the epd before we loop back around to
the beginning of the file during writeout.
Fixes: b293f02e14 ("Btrfs: Add writepages support")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a race between adding and removing elements to the tree mod log
list and rbtree that can lead to use-after-free problems.
Consider the following example that explains how/why the problems happens:
1) Task A has mod log element with sequence number 200. It currently is
the only element in the mod log list;
2) Task A calls btrfs_put_tree_mod_seq() because it no longer needs to
access the tree mod log. When it enters the function, it initializes
'min_seq' to (u64)-1. Then it acquires the lock 'tree_mod_seq_lock'
before checking if there are other elements in the mod seq list.
Since the list it empty, 'min_seq' remains set to (u64)-1. Then it
unlocks the lock 'tree_mod_seq_lock';
3) Before task A acquires the lock 'tree_mod_log_lock', task B adds
itself to the mod seq list through btrfs_get_tree_mod_seq() and gets a
sequence number of 201;
4) Some other task, name it task C, modifies a btree and because there
elements in the mod seq list, it adds a tree mod elem to the tree
mod log rbtree. That node added to the mod log rbtree is assigned
a sequence number of 202;
5) Task B, which is doing fiemap and resolving indirect back references,
calls btrfs get_old_root(), with 'time_seq' == 201, which in turn
calls tree_mod_log_search() - the search returns the mod log node
from the rbtree with sequence number 202, created by task C;
6) Task A now acquires the lock 'tree_mod_log_lock', starts iterating
the mod log rbtree and finds the node with sequence number 202. Since
202 is less than the previously computed 'min_seq', (u64)-1, it
removes the node and frees it;
7) Task B still has a pointer to the node with sequence number 202, and
it dereferences the pointer itself and through the call to
__tree_mod_log_rewind(), resulting in a use-after-free problem.
This issue can be triggered sporadically with the test case generic/561
from fstests, and it happens more frequently with a higher number of
duperemove processes. When it happens to me, it either freezes the VM or
it produces a trace like the following before crashing:
[ 1245.321140] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[ 1245.321200] CPU: 1 PID: 26997 Comm: pool Not tainted 5.5.0-rc6-btrfs-next-52 #1
[ 1245.321235] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
[ 1245.321287] RIP: 0010:rb_next+0x16/0x50
[ 1245.321307] Code: ....
[ 1245.321372] RSP: 0018:ffffa151c4d039b0 EFLAGS: 00010202
[ 1245.321388] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8ae221363c80 RCX: 6b6b6b6b6b6b6b6b
[ 1245.321409] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ae221363c80
[ 1245.321439] RBP: ffff8ae20fcc4688 R08: 0000000000000002 R09: 0000000000000000
[ 1245.321475] R10: ffff8ae20b120910 R11: 00000000243f8bb1 R12: 0000000000000038
[ 1245.321506] R13: ffff8ae221363c80 R14: 000000000000075f R15: ffff8ae223f762b8
[ 1245.321539] FS: 00007fdee1ec7700(0000) GS:ffff8ae236c80000(0000) knlGS:0000000000000000
[ 1245.321591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1245.321614] CR2: 00007fded4030c48 CR3: 000000021da16003 CR4: 00000000003606e0
[ 1245.321642] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1245.321668] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1245.321706] Call Trace:
[ 1245.321798] __tree_mod_log_rewind+0xbf/0x280 [btrfs]
[ 1245.321841] btrfs_search_old_slot+0x105/0xd00 [btrfs]
[ 1245.321877] resolve_indirect_refs+0x1eb/0xc60 [btrfs]
[ 1245.321912] find_parent_nodes+0x3dc/0x11b0 [btrfs]
[ 1245.321947] btrfs_check_shared+0x115/0x1c0 [btrfs]
[ 1245.321980] ? extent_fiemap+0x59d/0x6d0 [btrfs]
[ 1245.322029] extent_fiemap+0x59d/0x6d0 [btrfs]
[ 1245.322066] do_vfs_ioctl+0x45a/0x750
[ 1245.322081] ksys_ioctl+0x70/0x80
[ 1245.322092] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 1245.322113] __x64_sys_ioctl+0x16/0x20
[ 1245.322126] do_syscall_64+0x5c/0x280
[ 1245.322139] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1245.322155] RIP: 0033:0x7fdee3942dd7
[ 1245.322177] Code: ....
[ 1245.322258] RSP: 002b:00007fdee1ec6c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1245.322294] RAX: ffffffffffffffda RBX: 00007fded40210d8 RCX: 00007fdee3942dd7
[ 1245.322314] RDX: 00007fded40210d8 RSI: 00000000c020660b RDI: 0000000000000004
[ 1245.322337] RBP: 0000562aa89e7510 R08: 0000000000000000 R09: 00007fdee1ec6d44
[ 1245.322369] R10: 0000000000000073 R11: 0000000000000246 R12: 00007fdee1ec6d48
[ 1245.322390] R13: 00007fdee1ec6d40 R14: 00007fded40210d0 R15: 00007fdee1ec6d50
[ 1245.322423] Modules linked in: ....
[ 1245.323443] ---[ end trace 01de1e9ec5dff3cd ]---
Fix this by ensuring that btrfs_put_tree_mod_seq() computes the minimum
sequence number and iterates the rbtree while holding the lock
'tree_mod_log_lock' in write mode. Also get rid of the 'tree_mod_seq_lock'
lock, since it is now redundant.
Fixes: bd989ba359 ("Btrfs: add tree modification log functions")
Fixes: 097b8a7c9e ("Btrfs: join tree mod log code with the code holding back delayed refs")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl4vEM0ACgkQxWXV+ddt
WDvWFA//eHQZjWQ4FNby9AZwkDS4NIq2v6XGg7Xx9Ioxf1NenuBdyt1uNda2I3Mw
4/0tEnrd3XL3POldx4TTZXlg0itfXeY2un8pGREPwbqGL67296wmENUT8kwXU8aX
4zMuqO8veO7jzqeibX7T0VntnAC4Q+yWUd8Tb07KXaNH6Jeowo2rbU7n2CdI8T91
IZdc/AxnmomVQ3Y0lmQe5fL5z9KRkwud7toXek8yVXyheX1JgnqkzVjCgLyk3ai7
0VP0jfOPtC1qoMPR+buD2VuxaDTrLGkmsaG9l4XZghmmSnAvTMOxIosFA/SLqHwp
LPDZ5RABbxMj1MHTAr2Bq+9uokqSIIdbSWFiMazeJsL/JWOiOQk0BekNJudeL7Ow
PtGIZkKC+c3qKIjLymnd+u56klCe/O4vhx7wOHjoGxC0j+qRN473LK8dmGmgM9Yv
XH5YfWqDpjt5h7SSei94S7JrqaJ29+gY8bMtMXJhx46jHh6AdXAkszU83kLr8Biz
mvV49ECib12RuI4EVB4I3XedCftzMRH5ocN/Y3B2qGn/bU3g9nuExAc24FRBt6QK
e/ucc4jQyRpbvpDwsmdE+kSxUvENuIRnlyke9XdOsN9Mth0X85gvYOS6aFS+4qjq
QiXborlp+fZY00INYjG6Ck9UyfpdSEUJg452xVkGBhd/emAAIDo=
=FGXQ
-----END PGP SIGNATURE-----
Merge tag 'fs-dedupe-last-block-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull fs deduplication fix from David Sterba:
"This is a fix for deduplication bug: the last block of two files is
allowed to deduplicated. This got broken in 5.1 by lifting some
generic checks to VFS layer. The affected filesystems are btrfs and
xfs.
The patches are marked for stable as the bug decreases deduplication
effectivity"
* tag 'fs-dedupe-last-block-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: make deduplication with range including the last block work
fs: allow deduplication of eof block into the end of the destination file
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl4vDYkACgkQxWXV+ddt
WDsNJQ//WJEcYoRpN5Y7oOIk/vo5ulF68P3kUh3hl206A13xpaHorvTvZKAD5s2o
C6xACJk839sGEhMdDRWvdeBDCHTedMk7EXjiZ6kJD+7EPpWmDllI5O6DTolT7SR2
b9zId4KCO+m8LiLZccRsxCJbdkJ7nJnz2c5+063TjsS3uq1BFudctRUjW/XnFCCZ
JIE5iOkdXrA+bFqc+l2zKTwgByQyJg+hVKRTZEJBT0QZsyNQvHKzXAmXxGopW8bO
SeuzFkiFTA0raK8xBz6mUwaZbk40Qlzm9v9AitFZx0x2nvQnMu447N3xyaiuyDWd
Li1aMN0uFZNgSz+AemuLfG0Wj70x1HrQisEj958XKzn4cPpUuMcc3lr1PZ2NIX+C
p6pSgaLOEq8Rc0U78/euZX6oyiLJPAmQO1TdkVMHrcMi36esBI6uG11rds+U+xeK
XoP20qXLFVYLLrl3wH9F4yIzydfMYu66Us1AeRPRB14NSSa7tbCOG//aCafOoLM6
518sJCazSWlv1kDewK8dtLiXc8eM6XJN+KI4NygFZrUj2Rq376q5oovUUKKkn3iN
pdHtF/7gAxIx6bZ+jY/gyt/Xe5AdPi7sKggahvrSOL3X+LLINwC4r+vAnnpd6yh4
NfJj5fobvc/mO9PEVMwgJ8PmHw5uNqeMlORGjk7stQs7Oez3tCw=
=4OkE
-----END PGP SIGNATURE-----
Merge tag 'for-5.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Features, highlights:
- async discard
- "mount -o discard=async" to enable it
- freed extents are not discarded immediatelly, but grouped
together and trimmed later, with IO rate limiting
- the "sync" mode submits short extents that could have been
ignored completely by the device, for SATA prior to 3.1 the
requests are unqueued and have a big impact on performance
- the actual discard IO requests have been moved out of
transaction commit to a worker thread, improving commit latency
- IO rate and request size can be tuned by sysfs files, for now
enabled only with CONFIG_BTRFS_DEBUG as we might need to
add/delete the files and don't have a stable-ish ABI for
general use, defaults are conservative
- export device state info in sysfs, eg. missing, writeable
- no discard of extents known to be untouched on disk (eg. after
reservation)
- device stats reset is logged with process name and PID that called
the ioctl
Fixes:
- fix missing hole after hole punching and fsync when using NO_HOLES
- writeback: range cyclic mode could miss some dirty pages and lead
to OOM
- two more corner cases for metadata_uuid change after power loss
during the change
- fix infinite loop during fsync after mix of rename operations
Core changes:
- qgroup assign returns ENOTCONN when quotas not enabled, used to
return EINVAL that was confusing
- device closing does not need to allocate memory anymore
- snapshot aware code got removed, disabled for years due to
performance problems, reimplmentation will allow to select wheter
defrag breaks or does not break COW on shared extents
- tree-checker:
- check leaf chunk item size, cross check against number of
stripes
- verify location keys for DIR_ITEM, DIR_INDEX and XATTR items
- new self test for physical -> logical mapping code, used for super
block range exclusion
- assertion helpers/macros updated to avoid objtool "unreachable
code" reports on older compilers or config option combinations"
* tag 'for-5.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (84 commits)
btrfs: free block groups after free'ing fs trees
btrfs: Fix split-brain handling when changing FSID to metadata uuid
btrfs: Handle another split brain scenario with metadata uuid feature
btrfs: Factor out metadata_uuid code from find_fsid.
btrfs: Call find_fsid from find_fsid_inprogress
Btrfs: fix infinite loop during fsync after rename operations
btrfs: set trans->drity in btrfs_commit_transaction
btrfs: drop log root for dropped roots
btrfs: sysfs, add devid/dev_state kobject and device attributes
btrfs: Refactor btrfs_rmap_block to improve readability
btrfs: Add self-tests for btrfs_rmap_block
btrfs: selftests: Add support for dummy devices
btrfs: Move and unexport btrfs_rmap_block
btrfs: separate definition of assertion failure handlers
btrfs: device stats, log when stats are zeroed
btrfs: fix improper setting of scanned for range cyclic write cache pages
btrfs: safely advance counter when looking up bio csums
btrfs: remove unused member btrfs_device::work
btrfs: remove unnecessary wrapper get_alloc_profile
btrfs: add correction to handle -1 edge case in async discard
...
Pull scheduler updates from Ingo Molnar:
"These were the main changes in this cycle:
- More -rt motivated separation of CONFIG_PREEMPT and
CONFIG_PREEMPTION.
- Add more low level scheduling topology sanity checks and warnings
to filter out nonsensical topologies that break scheduling.
- Extend uclamp constraints to influence wakeup CPU placement
- Make the RT scheduler more aware of asymmetric topologies and CPU
capacities, via uclamp metrics, if CONFIG_UCLAMP_TASK=y
- Make idle CPU selection more consistent
- Various fixes, smaller cleanups, updates and enhancements - please
see the git log for details"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
sched/fair: Define sched_idle_cpu() only for SMP configurations
sched/topology: Assert non-NUMA topology masks don't (partially) overlap
idle: fix spelling mistake "iterrupts" -> "interrupts"
sched/fair: Remove redundant call to cpufreq_update_util()
sched/psi: create /proc/pressure and /proc/pressure/{io|memory|cpu} only when psi enabled
sched/fair: Fix sgc->{min,max}_capacity calculation for SD_OVERLAP
sched/fair: calculate delta runnable load only when it's needed
sched/cputime: move rq parameter in irqtime_account_process_tick
stop_machine: Make stop_cpus() static
sched/debug: Reset watchdog on all CPUs while processing sysrq-t
sched/core: Fix size of rq::uclamp initialization
sched/uclamp: Fix a bug in propagating uclamp value in new cgroups
sched/fair: Load balance aggressively for SCHED_IDLE CPUs
sched/fair : Improve update_sd_pick_busiest for spare capacity case
watchdog: Remove soft_lockup_hrtimer_cnt and related code
sched/rt: Make RT capacity-aware
sched/fair: Make EAS wakeup placement consider uclamp restrictions
sched/fair: Make task_fits_capacity() consider uclamp restrictions
sched/uclamp: Rename uclamp_util_with() into uclamp_rq_util_with()
sched/uclamp: Make uclamp util helpers use and return UL values
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl4sLasACgkQxWXV+ddt
WDsegg/8CBQ1/pGj+8mvf+ws6f71Av8jspY2Ebr+HCjaGhD2MG3HI1kA5gC9Qnbb
fQVd12M5ma2BTrIcszxwm+VMIMlDotRFzfAp8uuFJtW0aAEGMCboX6VRYWa/4I0o
SmgJg0RYh926VL73qSe3S72pfIYjar30RwjVIVTmsHxL/D/lEkrHg6IGKRCe/MaN
eQipth3iuFtcWmGm1+DxEySsOs7AMPg3wL8KVnQcYoDI2kg3BXFH9a4wTE6VmWsU
ZjonJBA/Rl8oA2YOVDum4mL5j2c5RulWEymdVKyo1oH+8kLDOQ8snd7Bxp3qtJ1C
gdVbS8gi7gT5/C+yex+ZWlAdfmCSGWj7dr7jjiELZhTrsBhtS7y+GM52GivSrJ3z
TciNQtF/Y0SrZGprPMgVGAHuIKWWwSmWJPmkRB4zv/5efFFdKg8/UmcRmh6dMo83
IF4VPEBQgJLj3ja9Wns5yvW9asKNcynGeFK7aV+BlGW/wuvBW9o017c4Q04dXSAK
iFpipJaR/6ZGmXlRQLa1uyKWVHNIfSFT47WJqa6Dbo6iWRE/S/MhfkZU42z2A3H9
O2qMWmZikZnPCkha6fWyNJEDxF3imC+/LBsYoEuVPR7kZ/irDnI1cJNsTocOlyj1
kgFtL5MnCBHCop9/tPGiVdin9ilHJs3q2kAkR5BNCSEqhC8mo4g=
=IPUk
-----END PGP SIGNATURE-----
Merge tag 'for-5.5-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"Here's a last minute fix for a regression introduced in this
development cycle.
There's a small chance of a silent corruption when device replace and
NOCOW data writes happen at the same time in one block group. Metadata
or COW data writes are unaffected.
The extra fixup patch is there to silence an unnecessary warning"
* tag 'for-5.5-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: dev-replace: remove warning for unknown return codes when finished
btrfs: scrub: Require mandatory block group RO for dev-replace
The fstests btrfs/011 triggered a warning at the end of device replace,
[ 1891.998975] BTRFS warning (device vdd): failed setting block group ro: -28
[ 1892.038338] BTRFS error (device vdd): btrfs_scrub_dev(/dev/vdd, 1, /dev/vdb) failed -28
[ 1892.059993] ------------[ cut here ]------------
[ 1892.063032] WARNING: CPU: 2 PID: 2244 at fs/btrfs/dev-replace.c:506 btrfs_dev_replace_start.cold+0xf9/0x140 [btrfs]
[ 1892.074346] CPU: 2 PID: 2244 Comm: btrfs Not tainted 5.5.0-rc7-default+ #942
[ 1892.079956] RIP: 0010:btrfs_dev_replace_start.cold+0xf9/0x140 [btrfs]
[ 1892.096576] RSP: 0018:ffffbb58c7b3fd10 EFLAGS: 00010286
[ 1892.098311] RAX: 00000000ffffffe4 RBX: 0000000000000001 RCX: 8888888888888889
[ 1892.100342] RDX: 0000000000000001 RSI: ffff9e889645f5d8 RDI: ffffffff92821080
[ 1892.102291] RBP: ffff9e889645c000 R08: 000001b8878fe1f6 R09: 0000000000000000
[ 1892.104239] R10: ffffbb58c7b3fd08 R11: 0000000000000000 R12: ffff9e88a0017000
[ 1892.106434] R13: ffff9e889645f608 R14: ffff9e88794e1000 R15: ffff9e88a07b5200
[ 1892.108642] FS: 00007fcaed3f18c0(0000) GS:ffff9e88bda00000(0000) knlGS:0000000000000000
[ 1892.111558] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1892.113492] CR2: 00007f52509ff420 CR3: 00000000603dd002 CR4: 0000000000160ee0
[ 1892.115814] Call Trace:
[ 1892.116896] btrfs_dev_replace_by_ioctl+0x35/0x60 [btrfs]
[ 1892.118962] btrfs_ioctl+0x1d62/0x2550 [btrfs]
caused by the previous patch ("btrfs: scrub: Require mandatory block
group RO for dev-replace"). Hitting ENOSPC is possible and could happen
when the block group is set read-only, preventing NOCOW writes to the
area that's being accessed by dev-replace.
This has happend with scratch devices of size 12G but not with 5G and
20G, so this is depends on timing and other activity on the filesystem.
The whole replace operation is restartable, the space state should be
examined by the user in any case.
The error code is propagated back to the ioctl caller so the kernel
warning is causing false alerts.
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
looped runs can lead to random failure, where scrub finds csum error.
The possibility is not high, around 1/20 to 1/100, but it's causing data
corruption.
The bug is observable after commit b12de52896 ("btrfs: scrub: Don't
check free space before marking a block group RO")
[CAUSE]
Dev-replace has two source of writes:
- Write duplication
All writes to source device will also be duplicated to target device.
Content: Not yet persisted data/meta
- Scrub copy
Dev-replace reused scrub code to iterate through existing extents, and
copy the verified data to target device.
Content: Previously persisted data and metadata
The difference in contents makes the following race possible:
Regular Writer | Dev-replace
-----------------------------------------------------------------
^ |
| Preallocate one data extent |
| at bytenr X, len 1M |
v |
^ Commit transaction |
| Now extent [X, X+1M) is in |
v commit root |
================== Dev replace starts =========================
| ^
| | Scrub extent [X, X+1M)
| | Read [X, X+1M)
| | (The content are mostly garbage
| | since it's preallocated)
^ | v
| Write back happens for |
| extent [X, X+512K) |
| New data writes to both |
| source and target dev. |
v |
| ^
| | Scrub writes back extent [X, X+1M)
| | to target device.
| | This will over write the new data in
| | [X, X+512K)
| v
This race can only happen for nocow writes. Thus metadata and data cow
writes are safe, as COW will never overwrite extents of previous
transaction (in commit root).
This behavior can be confirmed by disabling all fallocate related calls
in fsstress (*), then all related tests can pass a 2000 run loop.
*: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
-f collapse=0 -f punch=0 -f resvsp=0"
I didn't expect resvsp ioctl will fallback to fallocate in VFS...
[FIX]
Make dev-replace to require mandatory block group RO, and wait for current
nocow writes before calling scrub_chunk().
This patch will mostly revert commit 76a8efa171 ("btrfs: Continue replace
when set_block_ro failed") for dev-replace path.
The side effect is, dev-replace can be more strict on avaialble space, but
definitely worth to avoid data corruption.
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 76a8efa171 ("btrfs: Continue replace when set_block_ro failed")
Fixes: b12de52896 ("btrfs: scrub: Don't check free space before marking a block group RO")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since btrfs was migrated to use the generic VFS helpers for clone and
deduplication, it stopped allowing for the last block of a file to be
deduplicated when the source file size is not sector size aligned (when
eof is somewhere in the middle of the last block). There are two reasons
for that:
1) The generic code always rounds down, to a multiple of the block size,
the range's length for deduplications. This means we end up never
deduplicating the last block when the eof is not block size aligned,
even for the safe case where the destination range's end offset matches
the destination file's size. That rounding down operation is done at
generic_remap_check_len();
2) Because of that, the btrfs specific code does not expect anymore any
non-aligned range length's for deduplication and therefore does not
work if such nona-aligned length is given.
This patch addresses that second part, and it depends on a patch that
fixes generic_remap_check_len(), in the VFS, which was submitted ealier
and has the following subject:
"fs: allow deduplication of eof block into the end of the destination file"
These two patches address reports from users that started seeing lower
deduplication rates due to the last block never being deduplicated when
the file size is not aligned to the filesystem's block size.
Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/
CC: stable@vger.kernel.org # 5.1+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sometimes when running generic/475 we would trip the
WARN_ON(cache->reserved) check when free'ing the block groups on umount.
This is because sometimes we don't commit the transaction because of IO
errors and thus do not cleanup the tree logs until at umount time.
These blocks are still reserved until they are cleaned up, but they
aren't cleaned up until _after_ we do the free block groups work. Fix
this by moving the free after free'ing the fs roots, that way all of the
tree logs are cleaned up and we have a properly cleaned fs. A bunch of
loops of generic/475 confirmed this fixes the problem.
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Current code doesn't correctly handle the situation which arises when
a file system that has METADATA_UUID_INCOMPAT flag set and has its FSID
changed to the one in metadata uuid. This causes the incompat flag to
disappear.
In case of a power failure we could end up in a situation where part of
the disks in a multi-disk filesystem are correctly reverted to
METADATA_UUID_INCOMPAT flag unset state, while others have
METADATA_UUID_INCOMPAT set and CHANGING_FSID_V2_IN_PROGRESS.
This patch corrects the behavior required to handle the case where a
disk of the second type is scanned first, creating the necessary
btrfs_fs_devices. Subsequently, when a disk which has already completed
the transition is scanned it should overwrite the data in
btrfs_fs_devices.
Reported-by: Su Yue <Damenly_Su@gmx.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is one more cases which isn't handled by the original metadata
uuid work. Namely, when a filesystem has METADATA_UUID incompat bit and
the user decides to change the FSID to the original one e.g. have
metadata_uuid and fsid match. In case of power failure while this
operation is in progress we could end up in a situation where some of
the disks have the incompat bit removed and the other half have both
METADATA_UUID_INCOMPAT and FSID_CHANGING_IN_PROGRESS flags.
This patch handles the case where a disk that has successfully changed
its FSID such that it equals METADATA_UUID is scanned first.
Subsequently when a disk with both
METADATA_UUID_INCOMPAT/FSID_CHANGING_IN_PROGRESS flags is scanned
find_fsid_changed won't be able to find an appropriate btrfs_fs_devices.
This is done by extending find_fsid_changed to correctly find
btrfs_fs_devices whose metadata_uuid/fsid are the same and they match
the metadata_uuid of the currently scanned device.
Fixes: cc5de4e702 ("btrfs: Handle final split-brain possibility during fsid change")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reported-by: Su Yue <Damenly_Su@gmx.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
find_fsid became rather hairy with the introduction of metadata uuid
changing feature. Alleviate this by factoring out the metadata uuid
specific code in a dedicated function which deals with finding
correct fsid for a device with changed uuid.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Su Yue <Damenly_Su@gmx.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since find_fsid_inprogress should also handle the case in which an fs
didn't change its FSID make it call find_fsid directly. This makes the
code in device_list_add simpler by eliminating a conditional call of
find_fsid. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Su Yue <Damenly_Su@gmx.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Recently fsstress (from fstests) sporadically started to trigger an
infinite loop during fsync operations. This turned out to be because
support for the rename exchange and whiteout operations was added to
fsstress in fstests. These operations, unlike any others in fsstress,
cause file names to be reused, whence triggering this issue. However
it's not necessary to use rename exchange and rename whiteout operations
trigger this issue, simple rename operations and file creations are
enough to trigger the issue.
The issue boils down to when we are logging inodes that conflict (that
had the name of any inode we need to log during the fsync operation), we
keep logging them even if they were already logged before, and after
that we check if there's any other inode that conflicts with them and
then add it again to the list of inodes to log. Skipping already logged
inodes fixes the issue.
Consider the following example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/testdir # inode 257
$ touch /mnt/testdir/zz # inode 258
$ ln /mnt/testdir/zz /mnt/testdir/zz_link
$ touch /mnt/testdir/a # inode 259
$ sync
# The following 3 renames achieve the same result as a rename exchange
# operation (<rename_exchange> /mnt/testdir/zz_link to /mnt/testdir/a).
$ mv /mnt/testdir/a /mnt/testdir/a/tmp
$ mv /mnt/testdir/zz_link /mnt/testdir/a
$ mv /mnt/testdir/a/tmp /mnt/testdir/zz_link
# The following rename and file creation give the same result as a
# rename whiteout operation (<rename_whiteout> zz to a2).
$ mv /mnt/testdir/zz /mnt/testdir/a2
$ touch /mnt/testdir/zz # inode 260
$ xfs_io -c fsync /mnt/testdir/zz
--> results in the infinite loop
The following steps happen:
1) When logging inode 260, we find that its reference named "zz" was
used by inode 258 in the previous transaction (through the commit
root), so inode 258 is added to the list of conflicting indoes that
need to be logged;
2) After logging inode 258, we find that its reference named "a" was
used by inode 259 in the previous transaction, and therefore we add
inode 259 to the list of conflicting inodes to be logged;
3) After logging inode 259, we find that its reference named "zz_link"
was used by inode 258 in the previous transaction - we add inode 258
to the list of conflicting inodes to log, again - we had already
logged it before at step 3. After logging it again, we find again
that inode 259 conflicts with him, and we add again 259 to the list,
etc - we end up repeating all the previous steps.
So fix this by skipping logging of conflicting inodes that were already
logged.
Fixes: 6b5fc433a7 ("Btrfs: fix fsync after succession of renames of different files")
CC: stable@vger.kernel.org # 5.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we abort a transaction we have the following sequence
if (!trans->dirty && list_empty(&trans->new_bgs))
return;
WRITE_ONCE(trans->transaction->aborted, err);
The idea being if we didn't modify anything with our trans handle then
we don't really need to abort the whole transaction, maybe the other
trans handles are fine and we can carry on.
However in the case of create_snapshot we add a pending_snapshot object
to our transaction and then commit the transaction. We don't actually
modify anything. sync() behaves the same way, attach to an existing
transaction and commit it. This means that if we have an IO error in
the right places we could abort the committing transaction with our
trans->dirty being not set and thus not set transaction->aborted.
This is a problem because in the create_snapshot() case we depend on
pending->error being set to something, or btrfs_commit_transaction
returning an error.
If we are not the trans handle that gets to commit the transaction, and
we're waiting on the commit to happen we get our return value from
cur_trans->aborted. If this was not set to anything because sync() hit
an error in the transaction commit before it could modify anything then
cur_trans->aborted would be 0. Thus we'd return 0 from
btrfs_commit_transaction() in create_snapshot.
This is a problem because we then try to do things with
pending_snapshot->snap, which will be NULL because we didn't create the
snapshot, and then we'll get a NULL pointer dereference like the
following
"BUG: kernel NULL pointer dereference, address: 00000000000001f0"
RIP: 0010:btrfs_orphan_cleanup+0x2d/0x330
Call Trace:
? btrfs_mksubvol.isra.31+0x3f2/0x510
btrfs_mksubvol.isra.31+0x4bc/0x510
? __sb_start_write+0xfa/0x200
? mnt_want_write_file+0x24/0x50
btrfs_ioctl_snap_create_transid+0x16c/0x1a0
btrfs_ioctl_snap_create_v2+0x11e/0x1a0
btrfs_ioctl+0x1534/0x2c10
? free_debug_processing+0x262/0x2a3
do_vfs_ioctl+0xa6/0x6b0
? do_sys_open+0x188/0x220
? syscall_trace_enter+0x1f8/0x330
ksys_ioctl+0x60/0x90
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x4a/0x1b0
In order to fix this we need to make sure anybody who calls
commit_transaction has trans->dirty set so that they properly set the
trans->transaction->aborted value properly so any waiters know bad
things happened.
This was found while I was running generic/475 with my modified
fsstress, it reproduced within a few runs. I ran with this patch all
night and didn't see the problem again.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we fsync on a subvolume and create a log root for that volume, and
then later delete that subvolume we'll never clean up its log root. Fix
this by making switch_commit_roots free the log for any dropped roots we
encounter. The extra churn is because we need a btrfs_trans_handle, not
the btrfs_transaction.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
New sysfs attributes that track the filesystem status of devices, stored
in the per-filesystem directory in /sys/fs/btrfs/FSID/devinfo . There's
a directory for each device, with name corresponding to the numerical
device id.
in_fs_metadata - device is in the list of fs metadata
missing - device is missing (no device node or block device)
replace_target - device is target of replace
writeable - writes from fs are allowed
These attributes reflect the state of the device::dev_state and created
at mount time.
Sample output:
$ pwd
/sys/fs/btrfs/6e1961f1-5918-4ecc-a22f-948897b409f7/devinfo/1/
$ ls
in_fs_metadata missing replace_target writeable
$ cat missing
0
The output from these attributes are 0 or 1. 0 indicates unset and 1
indicates set. These attributes are readonly.
It is observed that the device delete thread and sysfs read thread will
not race because the delete thread calls sysfs kobject_put() which in
turn waits for existing sysfs read to complete.
Note for device replace devid swap:
During the replace the target device temporarily assumes devid 0 before
assigning the devid of the soruce device.
In btrfs_dev_replace_finishing() we remove source sysfs devid using the
function btrfs_sysfs_remove_devices_attr(), so after that call
kobject_rename() to update the devid in the sysfs. This adds and calls
btrfs_sysfs_update_devid() helper function to update the device id.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Move variables to appropriate scope. Remove last BUG_ON in the function
and rework error handling accordingly. Make the duplicate detection code
more straightforward. Use in_range macro. And give variables more
descriptive name by explicitly distinguishing between IO stripe size
(size recorded in the chunk item) and data stripe size (the size of
an actual stripe, constituting a logical chunk/block group).
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add RAID1 and single testcases to verify that data stripes are excluded
from super block locations and that the address mapping is valid.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Add basic infrastructure to create and link dummy btrfs_devices. This
will be used in the pending btrfs_rmap_block test which deals with
the block groups.
Calling btrfs_alloc_dummy_device will link the newly created device to
the passed fs_info and the test framework will free them once the test
is finished.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's used only during initial block group reading to map physical
address of super block to a list of logical ones. Make it private to
block-group.c, add proper kernel doc and ensure it's exported only for
tests.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a report where objtool detects unreachable instructions, eg.:
fs/btrfs/ctree.o: warning: objtool: btrfs_search_slot()+0x2d4: unreachable instruction
This seems to be a false positive due to compiler version. The cause is
in the ASSERT macro implementation that does the conditional check as
IS_DEFINED(CONFIG_BTRFS_ASSERT) and not an #ifdef.
To avoid that, use the ifdefs directly.
There are still 2 reports that aren't fixed:
fs/btrfs/extent_io.o: warning: objtool: __set_extent_bit()+0x71f: unreachable instruction
fs/btrfs/relocation.o: warning: objtool: find_data_references()+0x4e0: unreachable instruction
Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David Sterba <dsterba@suse.com>
We had a report indicating that some read errors aren't reported by the
device stats in the userland. It is important to have the errors
reported in the device stat as user land scripts might depend on it to
take the reasonable corrective actions. But to debug these issue we need
to be really sure that request to reset the device stat did not come
from the userland itself. So log an info message when device error reset
happens.
For example:
BTRFS info (device sdc): device stats zeroed by btrfs(9223)
Reported-by: philip@philip-seeger.de
Link: https://www.spinics.net/lists/linux-btrfs/msg96528.html
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We noticed that we were having regular CG OOM kills in cases where there
was still enough dirty pages to avoid OOM'ing. It turned out there's
this corner case in btrfs's handling of range_cyclic where files that
were being redirtied were not getting fully written out because of how
we do range_cyclic writeback.
We unconditionally were setting scanned = 1; the first time we found any
pages in the inode. This isn't actually what we want, we want it to be
set if we've scanned the entire file. For range_cyclic we could be
starting in the middle or towards the end of the file, so we could write
one page and then not write any of the other dirty pages in the file
because we set scanned = 1.
Fix this by not setting scanned = 1 if we find pages. The rules for
setting scanned should be
1) !range_cyclic. In this case we have a specified range to write out.
2) range_cyclic && index == 0. In this case we've started at the
beginning and there is no need to loop around a second time.
3) range_cyclic && we started at index > 0 and we've reached the end of
the file without satisfying our nr_to_write.
This patch fixes both of our writepages implementations to make sure
these rules hold true. This fixed our over zealous CG OOMs in
production.
Fixes: d1310b2e0c ("Btrfs: Split the extent_map code into two parts")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Dan's smatch tool reports
fs/btrfs/file-item.c:295 btrfs_lookup_bio_sums()
warn: should this be 'count == -1'
which points to the while (count--) loop. With count == 0 the check
itself could decrement it to -1. There's a WARN_ON a few lines below
that has never been seen in practice though.
It turns out that the value of page_bytes_left matches the count (by
sectorsize multiples). The loop never reaches the state where count
would go to -1, because page_bytes_left == 0 is found first and this
breaks out.
For clarity, use only plain check on count (and only for positive
value), decrement safely inside the loop. Any other discrepancy after
the whole bio list processing should be reported by the exising
WARN_ON_ONCE as well.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a leftover from recently removed bio scheduling framework.
Fixes: ba8a9d0795 ("Btrfs: delete the entire async bio submission framework")
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_get_alloc_profile() is a simple wrapper over get_alloc_profile().
The only difference is btrfs_get_alloc_profile() is visible to other
functions in btrfs while get_alloc_profile() is static and thus only
visible to functions in block-group.c.
Let's just fold get_alloc_profile() into btrfs_get_alloc_profile() to
get rid of the unnecessary second function.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <jth@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
From Dave's testing described below, it's possible to drive a file
system to have bogus values of discardable_extents and _bytes. As
btrfs_discard_calc_delay() is the only user of discardable_extents, we
can correct here for any negative discardable_extents/discardable_bytes.
The problem is not reliably reproducible. The workload that created it
was based on linux git tree, switching between release tags, then
everytihng deleted followed by a full rebalance. At this state the
values of discardable_bytes was 16K and discardable_extents was -1,
expected values 0 and 0.
Repeating the workload again did not correct the bogus values so the
offset seems to be stable once it happens.
Reported-by: David Sterba <dsterba@suse.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
Most callers of free_bitmap() only call it if bitmap_info->bytes is 0.
However, there are certain cases where we may free the free space cache
via __btrfs_remove_free_space_cache(). This exposes a path where
free_bitmap() is called regardless. This may result in a bad accounting
situation for discardable_bytes and discardable_extents. So, remove the
stats and call btrfs_discard_update_discardable().
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
It's less than ideal for small extents to eat into our extent budget, so
force extents <= 32KB into the bitmaps save for the first handful.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, there is no way for the free space cache to recover from
being serviced by purely bitmaps because the extent threshold is set to
0 in recalculate_thresholds() when we surpass the metadata allowance.
This adds a recovery mechanism by keeping large extents out of the
bitmaps and increases the metadata upper bound to 64KB. The recovery
mechanism bypasses this upper bound, thus making it a soft upper bound.
But, with the bypass being 1MB or greater, it shouldn't add unbounded
overhead.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Give a brief overview for how async discard is implemented.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Keep track of how much we are discarding and how often we are reusing
with async discard. The discard_*_bytes values don't need any special
protection because the work item provides the single threaded access.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As mentioned earlier, discarding data can be done either by issuing an
explicit discard or implicitly by reusing the LBA. Metadata block_groups
see much more frequent reuse due to well it being metadata. So instead
of explicitly discarding metadata block_groups, just leave them be and
let the latter implicit discarding be done for them.
For mixed block_groups, block_groups which contain both metadata and
data, we let them be as higher fragmentation is expected.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Non-block group destruction discarding currently only had a single list
with no minimum discard length. This can lead to caravaning more
meaningful discards behind a heavily fragmented block group.
This adds support for multiple lists with minimum discard lengths to
prevent the caravan effect. We promote block groups back up when we
exceed the BTRFS_ASYNC_DISCARD_MAX_FILTER size, currently we support
only 2 lists with filters of 1MB and 32KB respectively.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Expose max_discard_size as a tunable via sysfs and switch the current
fixed maximum to the default value.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Throttle the maximum size of a discard so that we can provide an upper
bound for the rate of async discard. While the block layer is able to
split discards into the appropriate sized discards, we want to be able
to account more accurately the rate at which we are consuming NCQ slots
as well as limit the upper bound of work for a discard.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Provide the ability to rate limit based on kbps in addition to iops as
additional guides for the target discard rate. The delay used ends up
being max(kbps_delay, iops_delay).
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
An earlier patch keeps track of discardable_extents. These are
undiscarded extents managed by the free space cache. Here, we will use
this to dynamically calculate the discard delay interval.
There are 3 rate to consider. The first is the target convergence rate,
the rate to discard all discardable_extents over the
BTRFS_DISCARD_TARGET_MSEC time frame. This is clamped by the lower
limit, the iops limit or BTRFS_DISCARD_MIN_DELAY (1ms), and the upper
limit, BTRFS_DISCARD_MAX_DELAY (1s). We reevaluate this delay every
transaction commit.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Keep track of this metric so that we can understand how ahead or behind
we are in discarding rate. This uses the same accounting method as
discardable_extents, deltas between previous/current values and
propagating them up.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
The number of discardable extents will serve as the rate limiting metric
for how often we should discard. This keeps track of discardable extents
in the free space caches by maintaining deltas and propagating them to
the global count.
The deltas are calculated from 2 values stored in PREV and CURR entries,
then propagated up to the global discard ctl. The current counter value
becomes the previous counter value after update.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Setup base sysfs directory for discard stats + tunables.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs only allowed attributes to be exposed in debug/. Let's let other
groups be created by making debug its own kobject.
This also makes the per-fs debug options separate from the global
features mount attributes. This seems to be needed as
sysfs_create_files() requires const struct attribute * while
sysfs_create_group() can take struct attribute *. This seems nicer as
per file system, you'll probably use to_fs_info().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We probably should call sysfs_remove_group() on debug/.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The prior two patches added discarding via a background workqueue. This
just piggybacked off of the fstrim code to trim the whole block at once.
Well inevitably this is worse performance wise and will aggressively
overtrim. But it was nice to plumb the other infrastructure to keep the
patches easier to review.
This adds the real goal of this series which is discarding slowly (ie. a
slow long running fstrim). The discarding is split into two phases,
extents and then bitmaps. The reason for this is two fold. First, the
bitmap regions overlap the extent regions. Second, discarding the
extents first will let the newly trimmed bitmaps have the highest chance
of coalescing when being readded to the free space cache.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
block_group removal is a little tricky. It can race with the extent
allocator, the cleaner thread, and balancing. The current path is for a
block_group to be added to the unused_bgs list. Then, when the cleaner
thread comes around, it starts a transaction and then proceeds with
removing the block_group. Extents that are pinned are subsequently
removed from the pinned trees and then eventually a discard is issued
for the entire block_group.
Async discard introduces another player into the game, the discard
workqueue. While it has none of the racing issues, the new problem is
ensuring we don't leave free space untrimmed prior to forgetting the
block_group. This is handled by placing fully free block_groups on a
separate discard queue. This is necessary to maintain discarding order
as in the future we will slowly trim even fully free block_groups. The
ordering helps us make progress on the same block_group rather than say
the last fully freed block_group or needing to search through the fully
freed block groups at the beginning of a list and insert after.
The new order of events is a fully freed block group gets placed on the
unused discard queue first. Once it's processed, it will be placed on
the unusued_bgs list and then the original sequence of events will
happen, just without the final whole block_group discard.
The mount flags can change when processing unused_bgs, so when flipping
from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the
discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt
free block groups on the discard_list to the unused_bg queue which will
do the final discard for us.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When discard is enabled, everytime a pinned extent is released back to
the block_group's free space cache, a discard is issued for the extent.
This is an overeager approach when it comes to discarding and helping
the SSD maintain enough free space to prevent severe garbage collection
situations.
This adds the beginning of async discard. Instead of issuing a discard
prior to returning it to the free space, it is just marked as untrimmed.
The block_group is then added to a LRU which then feeds into a workqueue
to issue discards at a much slower rate. Full discarding of unused block
groups is still done and will be addressed in a future patch of the
series.
For now, we don't persist the discard state of extents and bitmaps.
Therefore, our failure recovery mode will be to consider extents
untrimmed. This lets us handle failure and unmounting as one in the
same.
On a number of Facebook webservers, I collected data every minute
accounting the time we spent in btrfs_finish_extent_commit() (col. 1)
and in btrfs_commit_transaction() (col. 2). btrfs_finish_extent_commit()
is where we discard extents synchronously before returning them to the
free space cache.
discard=sync:
p99 total per minute p99 total per minute
Drive | extent_commit() (ms) | commit_trans() (ms)
---------------------------------------------------------------
Drive A | 434 | 1170
Drive B | 880 | 2330
Drive C | 2943 | 3920
Drive D | 4763 | 5701
discard=async:
p99 total per minute p99 total per minute
Drive | extent_commit() (ms) | commit_trans() (ms)
--------------------------------------------------------------
Drive A | 134 | 956
Drive B | 64 | 1972
Drive C | 59 | 1032
Drive D | 62 | 1200
While it's not great that the stats are cumulative over 1m, all of these
servers are running the same workload and and the delta between the two
are substantial. We are spending significantly less time in
btrfs_finish_extent_commit() which is responsible for discarding.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a cap in btrfs in the amount of free extents that a block group
can have. When it surpasses that threshold, future extents are placed
into bitmaps. Instead of keeping track of if a certain bit is trimmed or
not in a second bitmap, keep track of the relative state of the bitmap.
With async discard, trimming bitmaps becomes a more frequent operation.
As a trade off with simplicity, we keep track of if discarding a bitmap
is in progress. If we fully scan a bitmap and trim as necessary, the
bitmap is marked clean. This has some caveats as the min block size may
skip over regions deemed too small. But this should be a reasonable
trade off rather than keeping a second bitmap and making allocation
paths more complex. The downside is we may overtrim, but ideally the min
block size should prevent us from doing that too often and getting stuck
trimming pathological cases.
BTRFS_TRIM_STATE_TRIMMING is added to indicate a bitmap is in the
process of being trimmed. If additional free space is added to that
bitmap, the bit is cleared. A bitmap will be marked
BTRFS_TRIM_STATE_TRIMMED if the trimming code was able to reach the end
of it and the former is still set.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Async discard will use the free space cache as backing knowledge for
which extents to discard. This patch plumbs knowledge about which
extents need to be discarded into the free space cache from
unpin_extent_range().
An untrimmed extent can merge with everything as this is a new region.
Absorbing trimmed extents is a tradeoff to for greater coalescing which
makes life better for find_free_extent(). Additionally, it seems the
size of a trim isn't as problematic as the trim io itself.
When reading in the free space cache from disk, if sync is set, mark all
extents as trimmed. The current code ensures at transaction commit that
all free space is trimmed when sync is set, so this reflects that.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This series introduces async discard which will use the flag
DISCARD_ASYNC, so rename the original flag to DISCARD_SYNC as it is
synchronously done in transaction commit.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[PROBLEM]
There is a user report in the mail list, showing the following corrupted
tree blocks:
item 62 key (486836 DIR_ITEM 2543451757) itemoff 6273 itemsize 74
location key (4065004 INODE_ITEM 1073741824) type FILE
transid 21397 data_len 0 name_len 44
name: FILENAME
Note that location key, its offset should be 0 for all INODE_ITEMS.
This caused failed lookup of the inode.
[CAUSE]
That offending value, 1073741824, is 0x40000000. So this looks like a
memory bit flip.
[FIX]
This patch will enhance tree-checker to check location key of
DIR_INDEX/DIR_ITEM/XATTR_ITEM.
There are several different combinations needs to check:
- item_key.type == DIR_INDEX/DIR_ITEM
* location_key.type == BTRFS_INODE_ITEM_KEY
This location_key should follow the check in inode_item check.
* location_key.type == BTRFS_ROOT_ITEM_KEY
Despite the existing check, DIR_INDEX/DIR_ITEM can only points to
subvolume trees.
* All other keys are not allowed.
- item_key.type == XATTR_ITEM
location_key should be all 0.
Reported-by: Mike Gilbert <floppymaster@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
ROOT_ITEM key check itself is not as simple as single line check, and
will be reused for both ROOT_ITEM and DIR_ITEM/DIR_INDEX location key
check, so refactor such check into check_root_key().
Also since we are here, fix a comment error about ROOT_ITEM offset,
which is transid of snapshot creation, not some "older kernel behavior".
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Inode key check is not as easy as several lines, and it will be called
in more than one location (INODE_ITEM check and
DIR_ITEM/DIR_INDEX/XATTR_ITEM location key check).
So here refactor such check into check_inode_key(). And add extra
checks for XATTR_ITEM.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The @fs_info parameter can be extracted from extent_buffer structure,
and there are already some wrappers getting rid of the @fs_info
parameter.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Inspired by btrfs-progs github issue #208, where chunk item in chunk
tree has invalid num_stripes (0).
Although that can already be caught by current btrfs_check_chunk_valid(),
that function doesn't really check item size as it needs to handle chunk
item in super block sys_chunk_array().
This patch will add two extra checks for chunk items in chunk tree:
- Basic chunk item size
If the item is smaller than btrfs_chunk (which already contains one
stripe), exit right now as reading num_stripes may even go beyond
eb boundary.
- Item size check against num_stripes
If item size doesn't match with calculated chunk size, then either the
item size or the num_stripes is corrupted. Error out anyway.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This hasn't been used since it was first introduced in commit
b4bd745d12 ("btrfs: Introduce find_free_extent_ctl structure for later
rework"). Passing that to btrfs_add_reserved_bytes in find_free_extent
is not strictly necessary and using the local ram_bytes instead seems
cleaner.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 7087a9d8db ("btrfs: Remove
extent_io_ops::writepage_end_io_hook") left this logic in a confusing
state. Simplify it.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only pass this as 1 from __extent_writepage_io(). The parameter
basically means "pretend I didn't pass in a page". This is silly since
we can simply not pass in the page. Get rid of the parameter from
btrfs_get_extent(), and since it's used as a get_extent_t callback,
remove it from get_extent_t and btree_get_extent(), neither of which
need it.
While we're here, let's document btrfs_get_extent().
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In __extent_writepage_io(), we check whether
i_size <= page_offset(page).
Note that if i_size < page_offset(page), then
i_size >> PAGE_SHIFT < page->index.
If i_size == page_offset(page), then
i_size >> PAGE_SHIFT == page->index && offset_in_page(i_size) == 0.
__extent_writepage() already has a check for these cases that
returns without calling __extent_writepage_io():
end_index = i_size >> PAGE_SHIFT
pg_offset = offset_in_page(i_size);
if (page->index > end_index ||
(page->index == end_index && !pg_offset)) {
page->mapping->a_ops->invalidatepage(page, 0, PAGE_SIZE);
unlock_page(page);
return 0;
}
Get rid of the one in __extent_writepage_io(), which was obsoleted in
211c17f51f ("Fix corners in writepage and btrfs_truncate_page").
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since 40f765805f ("Btrfs: split up __extent_writepage to lower stack
usage"), done_unlocked is simply a return 0. Get rid of it.
Mid-statement block returns don seem to make the code less readable here.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're initializing pg_offset to 0, setting it immediately, then
reassigning it to 0 again after. The former became unnecessary in
211c17f51f ("Fix corners in writepage and btrfs_truncate_page"). The
latter is a leftover that should've been removed in 40f765805f
("Btrfs: split up __extent_writepage to lower stack usage"). Remove
both.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
ordered->start, ordered->len, and ordered->disk_len correspond to
fi->disk_bytenr, fi->num_bytes, and fi->disk_num_bytes, respectively.
It's confusing to translate between the two naming schemes. Since a
btrfs_ordered_extent is basically a pending btrfs_file_extent_item,
let's make the former use the naming from the latter.
Note that I didn't touch the names in tracepoints just in case there are
scripts depending on the current naming.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Snapshot-aware defrag has been disabled since commit 8101c8dbf6
("Btrfs: disable snapshot aware defrag for now") almost 6 years ago.
Let's remove the dead code. If someone is up to the task of bringing it
back, they can dig it up from git.
This is logically a revert of commit 38c227d87c ("Btrfs:
snapshot-aware defrag") except that now we have to clear the
EXTENT_DEFRAG bit to avoid need_force_cow() returning true forever.
The reasons to disable were caused by runtime problems (like long stalls
or memory consumption) on heavily referenced extents (eg. thousands of
snapshots). There were attempts to fix that but never finished.
Current defrag breaks the extent references and some users prefer that
behaviour over the one implemented by snapshot aware (ie. keeping links
for defragmentation). To enable both usecases we'd need to extend
defrag ioctl but let's do that properly from scratch.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ enhance ]
Signed-off-by: David Sterba <dsterba@suse.com>
We can encode this in the offset parameter: -1 means use the page
offsets, anything else is a valid offset.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, we have two wrappers for __btrfs_lookup_bio_sums():
btrfs_lookup_bio_sums_dio(), which is used for direct I/O, and
btrfs_lookup_bio_sums(), which is used everywhere else. The only
difference is that the _dio variant looks up csums starting at the given
offset instead of using the page index, which isn't actually direct
I/O-specific. Let's clean up the signature and return value of
__btrfs_lookup_bio_sums(), rename it to btrfs_lookup_bio_sums(), and get
rid of the trivial helpers.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When closing a device, btrfs_close_one_device() first allocates a new
device, copies the device to close's name, replaces it in the dev_list
with the copy and then finally frees it.
This involves two memory allocation, which can potentially fail. As this
code path is tricky to unwind, the allocation failures where handled by
BUG_ON()s.
But this copying isn't strictly needed, all that is needed is resetting
the device in question to it's state it had after the allocation.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_close_one_device we're decrementing the number of open devices
before we're calling btrfs_close_bdev().
As there is no intermediate exit between these points in this function it
is technically OK to do so, but it makes the code a bit harder to understand.
Move both operations closer together and move the decrement step after
btrfs_close_bdev().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When you snapshot a subvolume containing a subvolume, you get a
placeholder directory where the subvolume would be. These directories
have their own btrfs_dir_ro_inode_operations.
Al pointed out [1] that these directories can use simple_lookup()
instead of btrfs_lookup(), as they are always empty. Furthermore, they
can use the default generic_permission() instead of btrfs_permission();
the additional checks in the latter don't matter because we can't write
to the directory anyways. Finally, they can use the default
generic_update_time() instead of btrfs_update_time(), as the inode
doesn't exist on disk and doesn't need any special handling.
All together, this means that we can get rid of
btrfs_dir_ro_inode_operations and use simple_dir_inode_operations
instead.
1: https://lore.kernel.org/linux-btrfs/20190929052934.GY26530@ZenIV.linux.org.uk/
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
We have a user report, that cppcheck is complaining about a possible
NULL-pointer dereference in btrfs_destroy_dev_replace_tgtdev().
We're first dereferencing the 'tgtdev' variable and the later check for
the validity of the pointer with a WARN_ON(!tgtdev);
But all callers of btrfs_destroy_dev_replace_tgtdev() either explicitly
check if 'tgtdev' is non-NULL or directly allocate 'tgtdev', so the
WARN_ON() is impossible to hit. Just remove it to silence the checker's
complains.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205003
Signed-off-by: Johannes Thumshirn <jth@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfsic_process_superblock() BUG_ON()s if 'state' is NULL. But this can
never happen as the only caller from btrfsic_process_superblock() is
btrfsic_mount() which allocates 'state' some lines above calling
btrfsic_process_superblock() and checks for the allocation to succeed.
Let's just remove the impossible to hit BUG_ON().
Signed-off-by: Johannes Thumshirn <jth@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A user reports a possible NULL-pointer dereference in
btrfsic_process_superblock(). We are assigning state->fs_info to a local
fs_info variable and afterwards checking for the presence of state.
While we would BUG_ON() a NULL state anyways, we can also just remove
the local fs_info copy, as fs_info is only used once as the first
argument for btrfs_num_copies(). There we can just pass in
state->fs_info as well.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205003
Signed-off-by: Johannes Thumshirn <jth@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a relic from a time before we had a proper reservation mechanism
and you could end up with really full chunks at chunk allocation time.
This doesn't make sense anymore, so just kill it.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have the space_info, we can just check its flags to see if it's the
system chunk space info.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's a simple wrapper over btrfs_panic and is called only once. Just
open code it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are two relocation stages but both print the same message. Add the
description of the stage. This can help debugging or provides
informative message to users.
BTRFS info (device dm-5): balance: start -d -m -s
BTRFS info (device dm-5): relocating block group 30408704 flags metadata|dup
BTRFS info (device dm-5): found 2 extents, stage: move data extents
BTRFS info (device dm-5): relocating block group 22020096 flags system|dup
BTRFS info (device dm-5): found 1 extents, stage: move data extents
BTRFS info (device dm-5): relocating block group 13631488 flags data
BTRFS info (device dm-5): found 1 extents, stage: move data extents
BTRFS info (device dm-5): found 1 extents, stage: update data pointers
BTRFS info (device dm-5): balance: ended with status: 0
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The condition '!ret2' is always true. commit 717beb96d9 ("Btrfs: fix
regression in btrfs_page_mkwrite() from vm_fault_t conversion") left
behind the check after moving this code out of the goto, so remove the
unused condition check.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
level <0 and level >= BTRFS_MAX_LEVEL are already performed upon
extent buffer read by tree checker in btrfs_check_node.
go. As far as 'level <= 0' we are guaranteed that level is '> 0'
because the value of level _before_ reading 'next' is larger than 1
(otherwise we wouldn't have executed that code at all) this in turn
guarantees that 'level' after btrfs_read_buffer is 'level - 1' since
we verify this invariant in:
btrfs_read_buffer
btree_read_extent_buffer_pages
btrfs_verify_level_key
This guarantees that level can never be '<= 0' so the warn on is
never triggered.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The log_root passed to walk_log_tree is guaranteed to have its
root_key.objectid always be BTRFS_TREE_LOG_OBJECTID. This is by
merit that all log roots of an ordinary root are allocated in
alloc_log_tree which hard-codes objectid to be BTRFS_TREE_LOG_OBJECTID.
In case walk_log_tree is called for a log tree found by btrfs_read_fs_root
in btrfs_recover_log_trees, that function already ensures
found_key.objectid is BTRFS_TREE_LOG_OBJECTID.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__btrfs_free_reserved_extent now performs the actions of
btrfs_free_and_pin_reserved_extent. But this name is a bit of a
misnomer, since the extent is not really freed but just pinned. Reflect
this in the new name. No semantics changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__btrfs_free_reserved_extent performs 2 entirely different operations
depending on whether its 'pin' argument is true or false. This patch
lifts the 2nd case (pin is false) into it's sole caller
btrfs_free_reserved_extent. No semantics changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers of btrfs_free_reserved_extent (respectively
__btrfs_free_reserved_extent with in set to 0) pass in extents which
have only been reserved but not yet written to. Namely,
* in cow_file_range that function is called only if create_io_em fails
or btrfs_add_ordered_extent fail, both of which happen _before_ any IO
is submitted to the newly reserved range
* in submit_compressed_extents the code flow is similar -
out_free_reserve can be called only before
btrfs_submit_compressed_write which is where any writes to the range
could occur
* btrfs_new_extent_direct also calls btrfs_free_reserved_extent only
if extent_map fails, before any IO is issued
* __btrfs_prealloc_file_range also calls btrfs_free_reserved_extent
in case insertion of the metadata fails
* btrfs_alloc_tree_block again can only be called in case in-memory
operations fail, before any IO is submitted
* btrfs_finish_ordered_io - this is the only caller where discarding
the extent could have a material effect, since it can be called for
an extent which was partially written.
With this change the submission of discards is optimised since discards
are now not being created for extents which are known to not have been
touched on disk.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[PROBLEM]
qgroup create/remove code is currently returning EINVAL when the user
tries to create a qgroup on a subvolume without quota enabled. EINVAL is
already being used for too many error scenarios so that is hard to
depict what is the problem.
[FIX]
Currently scrub and balance code return -ENOTCONN when the user tries to
cancel/pause and no scrub or balance is currently running for the
desired subvolume. Do the same here by returning -ENOTCONN when a user
tries to create/delete/assing/list a qgroup on a subvolume without quota
enabled.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove some variables that are set only to be checked later, and never
used.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Merge btrfs_sysfs_add_fsid() and btrfs_sysfs_add_devices_kobj() functions
as these two are small and they are called one after the other.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_sysfs_add_device() creates the directory
/sys/fs/btrfs/UUID/devices but its function name is misleading. Rename
it to btrfs_sysfs_add_devices_kobj() instead. No functional changes.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 24bd69cb ("Btrfs: sysfs: add support to add parent for fsid")
added parent argument in preparation to show the seed fsid under the
sprout fsid as in the patch [1] in the mailing list.
[1] Btrfs: sysfs: support seed devices in the sysfs layout
But later this idea was superseded by another idea to rename the fsid as
in the commit f93c39970b ("btrfs: factor out sysfs code for updating
sprout fsid").
So we don't need parent argument anymore.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The struct member btrfs_device::device_dir_kobj holds the kobj of the
sysfs directory /sys/fs/btrfs/UUID/devices, so rename it from
device_dir_kobj to devices_kobj. No functional changes.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When using the NO_HOLES feature, if we punch a hole into a file and then
fsync it, there are cases where a subsequent fsync will miss the fact that
a hole was punched, resulting in the holes not existing after replaying
the log tree.
Essentially these cases all imply that, tree-log.c:copy_items(), is not
invoked for the leafs that delimit holes, because nothing changed those
leafs in the current transaction. And it's precisely copy_items() where
we currenly detect and log holes, which works as long as the holes are
between file extent items in the input leaf or between the beginning of
input leaf and the previous leaf or between the last item in the leaf
and the next leaf.
First example where we miss a hole:
*) The extent items of the inode span multiple leafs;
*) The punched hole covers a range that affects only the extent items of
the first leaf;
*) The fsync operation is done in full mode (BTRFS_INODE_NEEDS_FULL_SYNC
is set in the inode's runtime flags).
That results in the hole not existing after replaying the log tree.
For example, if the fs/subvolume tree has the following layout for a
particular inode:
Leaf N, generation 10:
[ ... INODE_ITEM INODE_REF EXTENT_ITEM (0 64K) EXTENT_ITEM (64K 128K) ]
Leaf N + 1, generation 10:
[ EXTENT_ITEM (128K 64K) ... ]
If at transaction 11 we punch a hole coverting the range [0, 128K[, we end
up dropping the two extent items from leaf N, but we don't touch the other
leaf, so we end up in the following state:
Leaf N, generation 11:
[ ... INODE_ITEM INODE_REF ]
Leaf N + 1, generation 10:
[ EXTENT_ITEM (128K 64K) ... ]
A full fsync after punching the hole will only process leaf N because it
was modified in the current transaction, but not leaf N + 1, since it
was not modified in the current transaction (generation 10 and not 11).
As a result the fsync will not log any holes, because it didn't process
any leaf with extent items.
Second example where we will miss a hole:
*) An inode as its items spanning 5 (or more) leafs;
*) A hole is punched and it covers only the extents items of the 3rd
leaf. This resulsts in deleting the entire leaf and not touching any
of the other leafs.
So the only leaf that is modified in the current transaction, when
punching the hole, is the first leaf, which contains the inode item.
During the full fsync, the only leaf that is passed to copy_items()
is that first leaf, and that's not enough for the hole detection
code in copy_items() to determine there's a hole between the last
file extent item in the 2nd leaf and the first file extent item in
the 3rd leaf (which was the 4th leaf before punching the hole).
Fix this by scanning all leafs and punch holes as necessary when doing a
full fsync (less common than a non-full fsync) when the NO_HOLES feature
is enabled. The lack of explicit file extent items to mark holes makes it
necessary to scan existing extents to determine if holes exist.
A test case for fstests follows soon.
Fixes: 16e7549f04 ("Btrfs: incompatible format change to remove hole extents")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl4h7wwACgkQxWXV+ddt
WDtFgxAAqanZ3wqq8xNqybfWmmFNrKtdkakErRuQFWe/+kZH59HuBifTbYeVMD4Z
hFgveFJpBIqo7uMUxUItSTtHcr3qx7TP9ejGYOaQO997oNPxPQXuEY8Lq5ebDBVB
89Gn+Eg/Q+uPvCJSctxx4dblSiGZKb3iOEh+lJuWJV4bj8beekcTrqsg01ZchPRO
Ygk1ltW7Vpf0wVkdts4FKiKiwX02M2C9zxh9NQjpNwH1DMow4XtBPsbqHbiHzRym
SoD4+0dbhfdnKkNnBTFEJBbjbZcYwM9EQnfiyVL+/hDMHX4XTetqeFN1G8usfXXX
2kxvwttPUtluJqlQXQnUU4mQEA4p5ORTgGgw1WBF3h+Aezumkql+27Bd6aiDKGZz
SPc9sveft60R23TxorlrYVqfADgyZKEaZ+2wEM99Xoz4OdvP7jkqDentJW9us1Xh
Xmfovq5xcRY17f9tdhiwqH5vgwxrLgmjBvTm/kcGX3ImhU8Yxk8xKw1JoV0P9cjW
7awK4l8pyPbOUhekdT8hYqWXlL/DXhAMHraV1zfBKIbu1omlGByeg23jNM2iS/0B
YtRkEEen0tRHpuKLB08twTKCak94wObBamKNFE6Snt1cDudLwGDpUosazM9l4uPR
2D3SHs7UWNgtvTRCfq2LVoRMRSR2BA1b19EkylUig4ay7khmZ2k=
=2e9q
-----END PGP SIGNATURE-----
Merge tag 'for-5.5-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes that have been in the works during last twp weeks.
All have a user visible effect and are stable material:
- scrub: properly update progress after calling cancel ioctl, calling
'resume' would start from the beginning otherwise
- fix subvolume reference removal, after moving out of the original
path the reference is not recognized and will lead to transaction
abort
- fix reloc root lifetime checks, could lead to crashes when there's
subvolume cleaning running in parallel
- fix memory leak when quotas get disabled in the middle of extent
accounting
- fix transaction abort in case of balance being started on degraded
mount on eg. RAID1"
* tag 'for-5.5-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: check rw_devices, not num_devices for balance
Btrfs: always copy scrub arguments back to user space
btrfs: relocation: fix reloc_root lifespan and access
btrfs: fix memory leak in qgroup accounting
btrfs: do not delete mismatched root refs
btrfs: fix invalid removal of root ref
btrfs: rework arguments of btrfs_unlink_subvol
The fstest btrfs/154 reports
[ 8675.381709] BTRFS: Transaction aborted (error -28)
[ 8675.383302] WARNING: CPU: 1 PID: 31900 at fs/btrfs/block-group.c:2038 btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs]
[ 8675.390925] CPU: 1 PID: 31900 Comm: btrfs Not tainted 5.5.0-rc6-default+ #935
[ 8675.392780] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[ 8675.395452] RIP: 0010:btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs]
[ 8675.402672] RSP: 0018:ffffb2090888fb00 EFLAGS: 00010286
[ 8675.404413] RAX: 0000000000000000 RBX: ffff92026dfa91c8 RCX: 0000000000000001
[ 8675.406609] RDX: 0000000000000000 RSI: ffffffff8e100899 RDI: ffffffff8e100971
[ 8675.408775] RBP: ffff920247c61660 R08: 0000000000000000 R09: 0000000000000000
[ 8675.410978] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffe4
[ 8675.412647] R13: ffff92026db74000 R14: ffff920247c616b8 R15: ffff92026dfbc000
[ 8675.413994] FS: 00007fd5e57248c0(0000) GS:ffff92027d800000(0000) knlGS:0000000000000000
[ 8675.416146] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8675.417833] CR2: 0000564aa51682d8 CR3: 000000006dcbc004 CR4: 0000000000160ee0
[ 8675.419801] Call Trace:
[ 8675.420742] btrfs_start_dirty_block_groups+0x355/0x480 [btrfs]
[ 8675.422600] btrfs_commit_transaction+0xc8/0xaf0 [btrfs]
[ 8675.424335] reset_balance_state+0x14a/0x190 [btrfs]
[ 8675.425824] btrfs_balance.cold+0xe7/0x154 [btrfs]
[ 8675.427313] ? kmem_cache_alloc_trace+0x235/0x2c0
[ 8675.428663] btrfs_ioctl_balance+0x298/0x350 [btrfs]
[ 8675.430285] btrfs_ioctl+0x466/0x2550 [btrfs]
[ 8675.431788] ? mem_cgroup_charge_statistics+0x51/0xf0
[ 8675.433487] ? mem_cgroup_commit_charge+0x56/0x400
[ 8675.435122] ? do_raw_spin_unlock+0x4b/0xc0
[ 8675.436618] ? _raw_spin_unlock+0x1f/0x30
[ 8675.438093] ? __handle_mm_fault+0x499/0x740
[ 8675.439619] ? do_vfs_ioctl+0x56e/0x770
[ 8675.441034] do_vfs_ioctl+0x56e/0x770
[ 8675.442411] ksys_ioctl+0x3a/0x70
[ 8675.443718] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 8675.445333] __x64_sys_ioctl+0x16/0x20
[ 8675.446705] do_syscall_64+0x50/0x210
[ 8675.448059] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 8675.479187] BTRFS: error (device vdb) in btrfs_create_pending_block_groups:2038: errno=-28 No space left
We now use btrfs_can_overcommit() to see if we can flip a block group
read only. Before this would fail because we weren't taking into
account the usable un-allocated space for allocating chunks. With my
patches we were allowed to do the balance, which is technically correct.
The test is trying to start balance on degraded mount. So now we're
trying to allocate a chunk and cannot because we want to allocate a
RAID1 chunk, but there's only 1 device that's available for usage. This
results in an ENOSPC.
But we shouldn't even be making it this far, we don't have enough
devices to restripe. The problem is we're using btrfs_num_devices(),
that also includes missing devices. That's not actually what we want, we
need to use rw_devices.
The chunk_mutex is not needed here, rw_devices changes only in device
add, remove or replace, all are excluded by EXCL_OP mechanism.
Fixes: e4d8ec0f65 ("Btrfs: implement online profile changing")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add stacktrace, update changelog, drop chunk_mutex ]
Signed-off-by: David Sterba <dsterba@suse.com>
If scrub returns an error we are not copying back the scrub arguments
structure to user space. This prevents user space to know how much
progress scrub has done if an error happened - this includes -ECANCELED
which is returned when users ask for scrub to stop. A particular use
case, which is used in btrfs-progs, is to resume scrub after it is
canceled, in that case it relies on checking the progress from the scrub
arguments structure and then use that progress in a call to resume
scrub.
So fix this by always copying the scrub arguments structure to user
space, overwriting the value returned to user space with -EFAULT only if
copying the structure failed to let user space know that either that
copying did not happen, and therefore the structure is stale, or it
happened partially and the structure is probably not valid and corrupt
due to the partial copy.
Reported-by: Graham Cobb <g.btrfs@cobb.uk.net>
Link: https://lore.kernel.org/linux-btrfs/d0a97688-78be-08de-ca7d-bcb4c7fb397e@cobb.uk.net/
Fixes: 06fe39ab15 ("Btrfs: do not overwrite scrub error with fault error in scrub ioctl")
CC: stable@vger.kernel.org # 5.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Tested-by: Graham Cobb <g.btrfs@cobb.uk.net>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There are several different KASAN reports for balance + snapshot
workloads. Involved call paths include:
should_ignore_root+0x54/0xb0 [btrfs]
build_backref_tree+0x11af/0x2280 [btrfs]
relocate_tree_blocks+0x391/0xb80 [btrfs]
relocate_block_group+0x3e5/0xa00 [btrfs]
btrfs_relocate_block_group+0x240/0x4d0 [btrfs]
btrfs_relocate_chunk+0x53/0xf0 [btrfs]
btrfs_balance+0xc91/0x1840 [btrfs]
btrfs_ioctl_balance+0x416/0x4e0 [btrfs]
btrfs_ioctl+0x8af/0x3e60 [btrfs]
do_vfs_ioctl+0x831/0xb10
create_reloc_root+0x9f/0x460 [btrfs]
btrfs_reloc_post_snapshot+0xff/0x6c0 [btrfs]
create_pending_snapshot+0xa9b/0x15f0 [btrfs]
create_pending_snapshots+0x111/0x140 [btrfs]
btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
btrfs_mksubvol+0x915/0x960 [btrfs]
btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
btrfs_ioctl+0x241b/0x3e60 [btrfs]
do_vfs_ioctl+0x831/0xb10
btrfs_reloc_pre_snapshot+0x85/0xc0 [btrfs]
create_pending_snapshot+0x209/0x15f0 [btrfs]
create_pending_snapshots+0x111/0x140 [btrfs]
btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
btrfs_mksubvol+0x915/0x960 [btrfs]
btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
btrfs_ioctl+0x241b/0x3e60 [btrfs]
do_vfs_ioctl+0x831/0xb10
[CAUSE]
All these call sites are only relying on root->reloc_root, which can
undergo btrfs_drop_snapshot(), and since we don't have real refcount
based protection to reloc roots, we can reach already dropped reloc
root, triggering KASAN.
[FIX]
To avoid such access to unstable root->reloc_root, we should check
BTRFS_ROOT_DEAD_RELOC_TREE bit first.
This patch introduces wrappers that provide the correct way to check the
bit with memory barriers protection.
Most callers don't distinguish merged reloc tree and no reloc tree. The
only exception is should_ignore_root(), as merged reloc tree can be
ignored, while no reloc tree shouldn't.
[CRITICAL SECTION ANALYSIS]
Although test_bit()/set_bit()/clear_bit() doesn't imply a barrier, the
DEAD_RELOC_TREE bit has extra help from transaction as a higher level
barrier, the lifespan of root::reloc_root and DEAD_RELOC_TREE bit are:
NULL: reloc_root is NULL PTR: reloc_root is not NULL
0: DEAD_RELOC_ROOT bit not set DEAD: DEAD_RELOC_ROOT bit set
(NULL, 0) Initial state __
| /\ Section A
btrfs_init_reloc_root() \/
| __
(PTR, 0) reloc_root initialized /\
| |
btrfs_update_reloc_root() | Section B
| |
(PTR, DEAD) reloc_root has been merged \/
| __
=== btrfs_commit_transaction() ====================
| /\
clean_dirty_subvols() |
| | Section C
(NULL, DEAD) reloc_root cleanup starts \/
| __
btrfs_drop_snapshot() /\
| | Section D
(NULL, 0) Back to initial state \/
Every have_reloc_root() or test_bit(DEAD_RELOC_ROOT) caller holds
transaction handle, so none of such caller can cross transaction boundary.
In Section A, every caller just found no DEAD bit, and grab reloc_root.
In the cross section A-B, caller may get no DEAD bit, but since reloc_root
is still completely valid thus accessing reloc_root is completely safe.
No test_bit() caller can cross the boundary of Section B and Section C.
In Section C, every caller found the DEAD bit, so no one will access
reloc_root.
In the cross section C-D, either caller gets the DEAD bit set, avoiding
access reloc_root no matter if it's safe or not. Or caller get the DEAD
bit cleared, then access reloc_root, which is already NULL, nothing will
be wrong.
The memory write barriers are between the reloc_root updates and bit
set/clear, the pairing read side is before test_bit.
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Fixes: d2311e6985 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ barriers ]
Signed-off-by: David Sterba <dsterba@suse.com>
When running xfstests on the current btrfs I get the following splat from
kmemleak:
unreferenced object 0xffff88821b2404e0 (size 32):
comm "kworker/u4:7", pid 26663, jiffies 4295283698 (age 8.776s)
hex dump (first 32 bytes):
01 00 00 00 00 00 00 00 10 ff fd 26 82 88 ff ff ...........&....
10 ff fd 26 82 88 ff ff 20 ff fd 26 82 88 ff ff ...&.... ..&....
backtrace:
[<00000000f94fd43f>] ulist_alloc+0x25/0x60 [btrfs]
[<00000000fd023d99>] btrfs_find_all_roots_safe+0x41/0x100 [btrfs]
[<000000008f17bd32>] btrfs_find_all_roots+0x52/0x70 [btrfs]
[<00000000b7660afb>] btrfs_qgroup_rescan_worker+0x343/0x680 [btrfs]
[<0000000058e66778>] btrfs_work_helper+0xac/0x1e0 [btrfs]
[<00000000f0188930>] process_one_work+0x1cf/0x350
[<00000000af5f2f8e>] worker_thread+0x28/0x3c0
[<00000000b55a1add>] kthread+0x109/0x120
[<00000000f88cbd17>] ret_from_fork+0x35/0x40
This corresponds to:
(gdb) l *(btrfs_find_all_roots_safe+0x41)
0x8d7e1 is in btrfs_find_all_roots_safe (fs/btrfs/backref.c:1413).
1408
1409 tmp = ulist_alloc(GFP_NOFS);
1410 if (!tmp)
1411 return -ENOMEM;
1412 *roots = ulist_alloc(GFP_NOFS);
1413 if (!*roots) {
1414 ulist_free(tmp);
1415 return -ENOMEM;
1416 }
1417
Following the lifetime of the allocated 'roots' ulist, it gets freed
again in btrfs_qgroup_account_extent().
But this does not happen if the function is called with the
'BTRFS_FS_QUOTA_ENABLED' flag cleared, then btrfs_qgroup_account_extent()
does a short leave and directly returns.
Instead of directly returning we should jump to the 'out_free' in order to
free all resources as expected.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_del_root_ref() will simply WARN_ON() if the ref doesn't match in
any way, and then continue to delete the reference. This shouldn't
happen, we have these values because there's more to the reference than
the original root and the sub root. If any of these checks fail, return
-ENOENT.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we have the following sequence of events
btrfs sub create A
btrfs sub create A/B
btrfs sub snap A C
mkdir C/foo
mv A/B C/foo
rm -rf *
We will end up with a transaction abort.
The reason for this is because we create a root ref for B pointing to A.
When we create a snapshot of C we still have B in our tree, but because
the root ref points to A and not C we will make it appear to be empty.
The problem happens when we move B into C. This removes the root ref
for B pointing to A and adds a ref of B pointing to C. When we rmdir C
we'll see that we have a ref to our root and remove the root ref,
despite not actually matching our reference name.
Now btrfs_del_root_ref() allowing this to work is a bug as well, however
we know that this inode does not actually point to a root ref in the
first place, so we shouldn't be calling btrfs_del_root_ref() in the
first place and instead simply look up our dir index for this item and
do the rest of the removal.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_unlink_subvol takes the name of the dentry and the root objectid
based on what kind of inode this is, either a real subvolume link or a
empty one that we inherited as a snapshot. We need to fix how we unlink
in the case for BTRFS_EMPTY_SUBVOL_DIR_OBJECTID in the future, so rework
btrfs_unlink_subvol to just take the dentry and handle getting the right
objectid given the type of inode this is. There is no functional change
here, simply pushing the work into btrfs_unlink_subvol() proper.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl4PenMACgkQxWXV+ddt
WDtsyg/6A86Da1zpLQKHiQtfhH629oeL/lFI0Cz+52xcgKSAUX8pHrWzKawSVpTT
OTgksskHREPDjyDQytxAUKJnuplFEF7sCFNhzhsjQ7cvhC0lLDmW0VQr1Wob1/c4
miW18YXN2DrYFQqdCeypvq1FItglxuBmsa9eITYzytRZcATY+3b506FhT/d8NIiH
f3nnChUiNuyfAiez1SC4FavDpVma1O6SpPMk29wUN/+/Dnd4aBrfvhuygy7xyIOK
rzotRR5xagQDpei+99lT2hys4Pv0yEOajoGmgbgpaZNP0vgQcBLaF5UTeMv/Ib2j
i+muu1rWi4R6lS3hh30kRSifKmPF7I9JB1dbd5gPfZiERlDQk+qZWrPFv9l0lf7M
R3jn04EaPVNr0dtftRFF2VvpIlUQYgIvwZHx4Td6Oy9XO0X4n5vlKxYg9aHmybJ2
Vni13ad2oWNLo2Dd1eUYrEJ/QwGc1pq7JU9HiOST7yEJv1FN++AF4qTYTyvc7Yl4
HrN349/2k2J9vU88BIlXIxYG73AL9lIpGF2ROiEw+xn4oOevDP/5HCP8ZvELoNVM
NqhJGoURu+AhZ7Rm3RYAYySShjQPF+qw/Dnz8sowQsEFUQ990E1wUxWZwTOP6+JQ
cdwzYiXYIExTghfQdUDdwNlpCXUbSuJxFOGjKmhBuzA7HHwGf1Q=
=jGvW
-----END PGP SIGNATURE-----
Merge tag 'for-5.5-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few fixes for btrfs:
- blkcg accounting problem with compression that could stall writes
- setting up blkcg bio for compression crashes due to NULL bdev
pointer
- fix possible infinite loop in writeback for nocow files (here
possible means almost impossible, 13 things that need to happen to
trigger it)"
* tag 'for-5.5-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix infinite loop during nocow writeback due to race
btrfs: fix compressed write bio blkcg attribution
btrfs: punt all bios created in btrfs_submit_compressed_write()
When starting writeback for a range that covers part of a preallocated
extent, due to a race with writeback for another range that also covers
another part of the same preallocated extent, we can end up in an infinite
loop.
Consider the following example where for inode 280 we have two dirty
ranges:
range A, from 294912 to 303103, 8192 bytes
range B, from 348160 to 438271, 90112 bytes
and we have the following file extent item layout for our inode:
leaf 38895616 gen 24544 total ptrs 29 free space 13820 owner 5
(...)
item 27 key (280 108 200704) itemoff 14598 itemsize 53
extent data disk bytenr 0 nr 0 type 1 (regular)
extent data offset 0 nr 94208 ram 94208
item 28 key (280 108 294912) itemoff 14545 itemsize 53
extent data disk bytenr 10433052672 nr 81920 type 2 (prealloc)
extent data offset 0 nr 81920 ram 81920
Then the following happens:
1) Writeback starts for range B (from 348160 to 438271), execution of
run_delalloc_nocow() starts;
2) The first iteration of run_delalloc_nocow()'s whil loop leaves us at
the extent item at slot 28, pointing to the prealloc extent item
covering the range from 294912 to 376831. This extent covers part of
our range;
3) An ordered extent is created against that extent, covering the file
range from 348160 to 376831 (28672 bytes);
4) We adjust 'cur_offset' to 376832 and move on to the next iteration of
the while loop;
5) The call to btrfs_lookup_file_extent() leaves us at the same leaf,
pointing to slot 29, 1 slot after the last item (the extent item
we processed in the previous iteration);
6) Because we are a slot beyond the last item, we call btrfs_next_leaf(),
which releases the search path before doing a another search for the
last key of the leaf (280 108 294912);
7) Right after btrfs_next_leaf() released the path, and before it did
another search for the last key of the leaf, writeback for the range
A (from 294912 to 303103) completes (it was previously started at
some point);
8) Upon completion of the ordered extent for range A, the prealloc extent
we previously found got split into two extent items, one covering the
range from 294912 to 303103 (8192 bytes), with a type of regular extent
(and no longer prealloc) and another covering the range from 303104 to
376831 (73728 bytes), with a type of prealloc and an offset of 8192
bytes. So our leaf now has the following layout:
leaf 38895616 gen 24544 total ptrs 31 free space 13664 owner 5
(...)
item 27 key (280 108 200704) itemoff 14598 itemsize 53
extent data disk bytenr 0 nr 0 type 1
extent data offset 0 nr 8192 ram 94208
item 28 key (280 108 208896) itemoff 14545 itemsize 53
extent data disk bytenr 10433142784 nr 86016 type 1
extent data offset 0 nr 86016 ram 86016
item 29 key (280 108 294912) itemoff 14492 itemsize 53
extent data disk bytenr 10433052672 nr 81920 type 1
extent data offset 0 nr 8192 ram 81920
item 30 key (280 108 303104) itemoff 14439 itemsize 53
extent data disk bytenr 10433052672 nr 81920 type 2
extent data offset 8192 nr 73728 ram 81920
9) After btrfs_next_leaf() returns, we have our path pointing to that same
leaf and at slot 30, since it has a key we didn't have before and it's
the first key greater then the key that was previously the last key of
the leaf (key (280 108 294912));
10) The extent item at slot 30 covers the range from 303104 to 376831
which is in our target range, so we process it, despite having already
created an ordered extent against this extent for the file range from
348160 to 376831. This is because we skip to the next extent item only
if its end is less than or equals to the start of our delalloc range,
and not less than or equals to the current offset ('cur_offset');
11) As a result we compute 'num_bytes' as:
num_bytes = min(end + 1, extent_end) - cur_offset;
= min(438271 + 1, 376832) - 376832 = 0
12) We then call create_io_em() for a 0 bytes range starting at offset
376832;
13) Then create_io_em() enters an infinite loop because its calls to
btrfs_drop_extent_cache() do nothing due to the 0 length range
passed to it. So no existing extent maps that cover the offset
376832 get removed, and therefore calls to add_extent_mapping()
return -EEXIST, resulting in an infinite loop. This loop from
create_io_em() is the following:
do {
btrfs_drop_extent_cache(BTRFS_I(inode), em->start,
em->start + em->len - 1, 0);
write_lock(&em_tree->lock);
ret = add_extent_mapping(em_tree, em, 1);
write_unlock(&em_tree->lock);
/*
* The caller has taken lock_extent(), who could race with us
* to add em?
*/
} while (ret == -EEXIST);
Also, each call to btrfs_drop_extent_cache() triggers a warning because
the start offset passed to it (376832) is smaller then the end offset
(376832 - 1) passed to it by -1, due to the 0 length:
[258532.052621] ------------[ cut here ]------------
[258532.052643] WARNING: CPU: 0 PID: 9987 at fs/btrfs/file.c:602 btrfs_drop_extent_cache+0x3f4/0x590 [btrfs]
(...)
[258532.052672] CPU: 0 PID: 9987 Comm: fsx Tainted: G W 5.4.0-rc7-btrfs-next-64 #1
[258532.052673] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
[258532.052691] RIP: 0010:btrfs_drop_extent_cache+0x3f4/0x590 [btrfs]
(...)
[258532.052695] RSP: 0018:ffffb4be0153f860 EFLAGS: 00010287
[258532.052700] RAX: ffff975b445ee360 RBX: ffff975b44eb3e08 RCX: 0000000000000000
[258532.052700] RDX: 0000000000038fff RSI: 0000000000039000 RDI: ffff975b445ee308
[258532.052700] RBP: 0000000000038fff R08: 0000000000000000 R09: 0000000000000001
[258532.052701] R10: ffff975b513c5c10 R11: 00000000e3c0cfa9 R12: 0000000000039000
[258532.052703] R13: ffff975b445ee360 R14: 00000000ffffffef R15: ffff975b445ee308
[258532.052705] FS: 00007f86a821de80(0000) GS:ffff975b76a00000(0000) knlGS:0000000000000000
[258532.052707] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[258532.052708] CR2: 00007fdacf0f3ab4 CR3: 00000001f9d26002 CR4: 00000000003606f0
[258532.052712] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[258532.052717] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[258532.052717] Call Trace:
[258532.052718] ? preempt_schedule_common+0x32/0x70
[258532.052722] ? ___preempt_schedule+0x16/0x20
[258532.052741] create_io_em+0xff/0x180 [btrfs]
[258532.052767] run_delalloc_nocow+0x942/0xb10 [btrfs]
[258532.052791] btrfs_run_delalloc_range+0x30b/0x520 [btrfs]
[258532.052812] ? find_lock_delalloc_range+0x221/0x250 [btrfs]
[258532.052834] writepage_delalloc+0xe4/0x140 [btrfs]
[258532.052855] __extent_writepage+0x110/0x4e0 [btrfs]
[258532.052876] extent_write_cache_pages+0x21c/0x480 [btrfs]
[258532.052906] extent_writepages+0x52/0xb0 [btrfs]
[258532.052911] do_writepages+0x23/0x80
[258532.052915] __filemap_fdatawrite_range+0xd2/0x110
[258532.052938] btrfs_fdatawrite_range+0x1b/0x50 [btrfs]
[258532.052954] start_ordered_ops+0x57/0xa0 [btrfs]
[258532.052973] ? btrfs_sync_file+0x225/0x490 [btrfs]
[258532.052988] btrfs_sync_file+0x225/0x490 [btrfs]
[258532.052997] __x64_sys_msync+0x199/0x200
[258532.053004] do_syscall_64+0x5c/0x250
[258532.053007] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[258532.053010] RIP: 0033:0x7f86a7dfd760
(...)
[258532.053014] RSP: 002b:00007ffd99af0368 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
[258532.053016] RAX: ffffffffffffffda RBX: 0000000000000ec9 RCX: 00007f86a7dfd760
[258532.053017] RDX: 0000000000000004 RSI: 000000000000836c RDI: 00007f86a8221000
[258532.053019] RBP: 0000000000021ec9 R08: 0000000000000003 R09: 00007f86a812037c
[258532.053020] R10: 0000000000000001 R11: 0000000000000246 R12: 00000000000074a3
[258532.053021] R13: 00007f86a8221000 R14: 000000000000836c R15: 0000000000000001
[258532.053032] irq event stamp: 1653450494
[258532.053035] hardirqs last enabled at (1653450493): [<ffffffff9dec69f9>] _raw_spin_unlock_irq+0x29/0x50
[258532.053037] hardirqs last disabled at (1653450494): [<ffffffff9d4048ea>] trace_hardirqs_off_thunk+0x1a/0x20
[258532.053039] softirqs last enabled at (1653449852): [<ffffffff9e200466>] __do_softirq+0x466/0x6bd
[258532.053042] softirqs last disabled at (1653449845): [<ffffffff9d4c8a0c>] irq_exit+0xec/0x120
[258532.053043] ---[ end trace 8476fce13d9ce20a ]---
Which results in flooding dmesg/syslog since btrfs_drop_extent_cache()
uses WARN_ON() and not WARN_ON_ONCE().
So fix this issue by changing run_delalloc_nocow()'s loop to move to the
next extent item when the current extent item ends at at offset less than
or equals to the current offset instead of the start offset.
Fixes: 80ff385665 ("Btrfs: update nodatacow code v2")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Bio attribution is handled at bio_set_dev() as once we have a device, we
have a corresponding request_queue and then can derive the current css.
In special cases, we want to attribute to bio to someone else. This can
be done by calling bio_associate_blkg_from_css() or
kthread_associate_blkcg() depending on the scenario. Btrfs does this for
compressed writeback as they are handled by kworkers, so the latter can
be done here.
Commit 1a41802701 ("btrfs: drop bio_set_dev where not needed") removes
early bio_set_dev() calls prior to submit_stripe_bio(). This breaks the
above assumption that we'll have a request_queue when we are doing
association. To fix this, switch to using kthread_associate_blkcg().
Without this, we crash in btrfs/024:
[ 3052.093088] BUG: kernel NULL pointer dereference, address: 0000000000000510
[ 3052.107013] #PF: supervisor read access in kernel mode
[ 3052.107014] #PF: error_code(0x0000) - not-present page
[ 3052.107015] PGD 0 P4D 0
[ 3052.107021] Oops: 0000 [#1] SMP
[ 3052.138904] CPU: 42 PID: 201270 Comm: kworker/u161:0 Kdump: loaded Not tainted 5.5.0-rc1-00062-g4852d8ac90a9 #712
[ 3052.138905] Hardware name: Quanta Tioga Pass Single Side 01-0032211004/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
[ 3052.138912] Workqueue: btrfs-delalloc btrfs_work_helper
[ 3052.191375] RIP: 0010:bio_associate_blkg_from_css+0x1e/0x3c0
[ 3052.191379] RSP: 0018:ffffc900210cfc90 EFLAGS: 00010282
[ 3052.191380] RAX: 0000000000000000 RBX: ffff88bfe5573c00 RCX: 0000000000000000
[ 3052.191382] RDX: ffff889db48ec2f0 RSI: ffff88bfe5573c00 RDI: ffff889db48ec2f0
[ 3052.191386] RBP: 0000000000000800 R08: 0000000000203bb0 R09: ffff889db16b2400
[ 3052.293364] R10: 0000000000000000 R11: ffff88a07fffde80 R12: ffff889db48ec2f0
[ 3052.293365] R13: 0000000000001000 R14: ffff889de82bc000 R15: ffff889e2b7bdcc8
[ 3052.293367] FS: 0000000000000000(0000) GS:ffff889ffba00000(0000) knlGS:0000000000000000
[ 3052.293368] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3052.293369] CR2: 0000000000000510 CR3: 0000000002611001 CR4: 00000000007606e0
[ 3052.293370] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3052.293371] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3052.293372] PKRU: 55555554
[ 3052.293376] Call Trace:
[ 3052.402552] btrfs_submit_compressed_write+0x137/0x390
[ 3052.402558] submit_compressed_extents+0x40f/0x4c0
[ 3052.422401] btrfs_work_helper+0x246/0x5a0
[ 3052.422408] process_one_work+0x200/0x570
[ 3052.438601] ? process_one_work+0x180/0x570
[ 3052.438605] worker_thread+0x4c/0x3e0
[ 3052.438614] kthread+0x103/0x140
[ 3052.460735] ? process_one_work+0x570/0x570
[ 3052.460737] ? kthread_mod_delayed_work+0xc0/0xc0
[ 3052.460744] ret_from_fork+0x24/0x30
Fixes: 1a41802701 ("btrfs: drop bio_set_dev where not needed")
Reported-by: Chris Murphy <chris@colorremedies.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Compressed writes happen in the background via kworkers. However, this
causes bios to be attributed to root bypassing any cgroup limits from
the actual writer. We tag the first bio with REQ_CGROUP_PUNT, which will
punt the bio to an appropriate cgroup specific workqueue and attribute
the IO properly. However, if btrfs_submit_compressed_write() creates a
new bio, we don't tag it the same way. Add the appropriate tagging for
subsequent bios.
Fixes: ec39f7696c ("Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios")
Reviewed-by: Chris Mason <clm@fb.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl35CAYACgkQxWXV+ddt
WDur/w//S98RvSZYMW5y2u+bPGe8sCpXwu5Sr87hTd14We8cBWj8684npUmSk7Dz
rTRSjcf9EQe5dGoiHOzpKU0HcsLKy9DVTPigvVbmsWZfT9mqS6Y8wAKMw/7UUvyy
n7aZk/yQGRow3gZ/Z/aF23JypRoDJK7DPbSMKUW164BnD5rCCyr+VdA8V+CwHgVh
UN6UG0KMDbDKS4501DsX8418pcJN+a+Jo4oBGwN/guKRjK1oNcrhj34DNhvXlaOV
Rlu7HcVtfHNDS/xD3DZS9mDIiycJ6qHkvC3hUsEmlKRoPEm1leVxTDLDf78oEy9H
TrvH71hbvYjxaOU4YQbJG8ky+VwFfiV0Vrj73GgdEeRRDuMbYwUyFI5gYQOji8fS
DuYdJGyslOqQovpii+jrPiT1TPG+97R4+qKH2DfOW1xUChYsbQHt7FOfzUbLe0JE
dev9zV6MRqZ1qf70+Wt2LuWYFefpg9KVnsn8mcjoBwz9s9uImzLgpI90+DMPKOaU
TizwJK3W5K3YLhqPHwPLvqVxKwVOzu00v01xl/bjTuyp982oPSCj3fj+FprGV1la
OkqOYbKe2ZqEkQpINDu8I58oydTKywZGVsUl4ldJlcSY1hEDFCyeoAFmixaJbRbQ
IdBcQnjD7qgvu9E4cA0kL8Ma1op2+1zw8sUOdXKFIDiNEqL5FPs=
=AnHL
-----END PGP SIGNATURE-----
Merge tag 'for-5.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A mix of regression fixes and regular fixes for stable trees:
- fix swapped error messages for qgroup enable/rescan
- fixes for NO_HOLES feature with clone range
- fix deadlock between iget/srcu lock/synchronize srcu while freeing
an inode
- fix double lock on subvolume cross-rename
- tree log fixes
* fix missing data checksums after replaying a log tree
* also teach tree-checker about this problem
* skip log replay on orphaned roots
- fix maximum devices constraints for RAID1C -3 and -4
- send: don't print warning on read-only mount regarding orphan
cleanup
- error handling fixes"
* tag 'for-5.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: send: remove WARN_ON for readonly mount
btrfs: do not leak reloc root if we fail to read the fs root
btrfs: skip log replay on orphaned roots
btrfs: handle ENOENT in btrfs_uuid_tree_iterate
btrfs: abort transaction after failed inode updates in create_subvol
Btrfs: fix hole extent items with a zero size after range cloning
Btrfs: fix removal logic of the tree mod log that leads to use-after-free issues
Btrfs: make tree checker detect checksum items with overlapping ranges
Btrfs: fix missing data checksums after replaying a log tree
btrfs: return error pointer from alloc_test_extent_buffer
btrfs: fix devs_max constraints for raid1c3 and raid1c4
btrfs: tree-checker: Fix error format string for size_t
btrfs: don't double lock the subvol_sem for rename exchange
btrfs: handle error in btrfs_cache_block_group
btrfs: do not call synchronize_srcu() in inode_tree_del
Btrfs: fix cloning range with a hole when using the NO_HOLES feature
btrfs: Fix error messages in qgroup_rescan_init
We log warning if root::orphan_cleanup_state is not set to
ORPHAN_CLEANUP_DONE in btrfs_ioctl_send(). However if the filesystem is
mounted as readonly we skip the orphan item cleanup during the lookup
and root::orphan_cleanup_state remains at the init state 0 instead of
ORPHAN_CLEANUP_DONE (2). So during send in btrfs_ioctl_send() we hit the
warning as below.
WARN_ON(send_root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE);
WARNING: CPU: 0 PID: 2616 at /Volumes/ws/btrfs-devel/fs/btrfs/send.c:7090 btrfs_ioctl_send+0xb2f/0x18c0 [btrfs]
::
RIP: 0010:btrfs_ioctl_send+0xb2f/0x18c0 [btrfs]
::
Call Trace:
::
_btrfs_ioctl_send+0x7b/0x110 [btrfs]
btrfs_ioctl+0x150a/0x2b00 [btrfs]
::
do_vfs_ioctl+0xa9/0x620
? __fget+0xac/0xe0
ksys_ioctl+0x60/0x90
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x49/0x130
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Reproducer:
mkfs.btrfs -fq /dev/sdb
mount /dev/sdb /btrfs
btrfs subvolume create /btrfs/sv1
btrfs subvolume snapshot -r /btrfs/sv1 /btrfs/ss1
umount /btrfs
mount -o ro /dev/sdb /btrfs
btrfs send /btrfs/ss1 -f /tmp/f
The warning exists because having orphan inodes could confuse send and
cause it to fail or produce incorrect streams. The two cases that would
cause such send failures, which are already fixed are:
1) Inodes that were unlinked - these are orphanized and remain with a
link count of 0. These caused send operations to fail because it
expected to always find at least one path for an inode. However this
is no longer a problem since send is now able to deal with such
inodes since commit 46b2f4590a ("Btrfs: fix send failure when root
has deleted files still open") and treats them as having been
completely removed (the state after an orphan cleanup is performed).
2) Inodes that were in the process of being truncated. These resulted in
send not knowing about the truncation and potentially issue write
operations full of zeroes for the range from the new file size to the
old file size. This is no longer a problem because we no longer
create orphan items for truncation since commit f7e9e8fc79 ("Btrfs:
stop creating orphan items for truncate").
As such before these commits, the WARN_ON here provided a clue in case
something went wrong. Instead of being a warning against the
root::orphan_cleanup_state value, it could have been more accurate by
checking if there were actually any orphan items, and then issue a
warning only if any exists, but that would be more expensive to check.
Since orphanized inodes no longer cause problems for send, just remove
the warning.
Reported-by: Christoph Anton Mitterer <calestyo@scientia.net>
Link: https://lore.kernel.org/linux-btrfs/21cb5e8d059f6e1496a903fa7bfc0a297e2f5370.camel@scientia.net/
CC: stable@vger.kernel.org # 4.19+
Suggested-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we fail to read the fs root corresponding with a reloc root we'll
just break out and free the reloc roots. But we remove our current
reloc_root from this list higher up, which means we'll leak this
reloc_root. Fix this by adding ourselves back to the reloc_roots list
so we are properly cleaned up.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
My fsstress modifications coupled with generic/475 uncovered a failure
to mount and replay the log if we hit a orphaned root. We do not want
to replay the log for an orphan root, but it's completely legitimate to
have an orphaned root with a log attached. Fix this by simply skipping
replaying the log. We still need to pin it's root node so that we do
not overwrite it while replaying other logs, as we re-read the log root
at every stage of the replay.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we get an -ENOENT back from btrfs_uuid_iter_rem when iterating the
uuid tree we'll just continue and do btrfs_next_item(). However we've
done a btrfs_release_path() at this point and no longer have a valid
path. So increment the key and go back and do a normal search.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can just abort the transaction here, and in fact do that for every
other failure in this function except these two cases.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Normally when cloning a file range if we find an implicit hole at the end
of the range we assume it is because the NO_HOLES feature is enabled.
However that is not always the case. One well known case [1] is when we
have a power failure after mixing buffered and direct IO writes against
the same file.
In such cases we need to punch a hole in the destination file, and if
the NO_HOLES feature is not enabled, we need to insert explicit file
extent items to represent the hole. After commit 690a5dbfc5
("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning
extents"), we started to insert file extent items representing the hole
with an item size of 0, which is invalid and should be 53 bytes (the size
of a btrfs_file_extent_item structure), resulting in all sorts of
corruptions and invalid memory accesses. This is detected by the tree
checker when we attempt to write a leaf to disk.
The problem can be sporadically triggered by test case generic/561 from
fstests. That test case does not exercise power failure and creates a new
filesystem when it starts, so it does not use a filesystem created by any
previous test that tests power failure. However the test does both
buffered and direct IO writes (through fsstress) and it's precisely that
which is creating the implicit holes in files. That happens even before
the commit mentioned earlier. I need to investigate why we get those
implicit holes to check if there is a real problem or not. For now this
change fixes the regression of introducing file extent items with an item
size of 0 bytes.
Fix the issue by calling btrfs_punch_hole_range() without passing a
btrfs_clone_extent_info structure, which ensures file extent items are
inserted to represent the hole with a correct item size. We were passing
a btrfs_clone_extent_info with a value of 0 for its 'item_size' field,
which was causing the insertion of file extent items with an item size
of 0.
[1] https://www.spinics.net/lists/linux-btrfs/msg75350.html
Reported-by: David Sterba <dsterba@suse.com>
Fixes: 690a5dbfc5 ("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a tree mod log user no longer needs to use the tree it calls
btrfs_put_tree_mod_seq() to remove itself from the list of users and
delete all no longer used elements of the tree's red black tree, which
should be all elements with a sequence number less then our equals to
the caller's sequence number. However the logic is broken because it
can delete and free elements from the red black tree that have a
sequence number greater then the caller's sequence number:
1) At a point in time we have sequence numbers 1, 2, 3 and 4 in the
tree mod log;
2) The task which got assigned the sequence number 1 calls
btrfs_put_tree_mod_seq();
3) Sequence number 1 is deleted from the list of sequence numbers;
4) The current minimum sequence number is computed to be the sequence
number 2;
5) A task using sequence number 2 is at tree_mod_log_rewind() and gets
a pointer to one of its elements from the red black tree through
a call to tree_mod_log_search();
6) The task with sequence number 1 iterates the red black tree of tree
modification elements and deletes (and frees) all elements with a
sequence number less then or equals to 2 (the computed minimum sequence
number) - it ends up only leaving elements with sequence numbers of 3
and 4;
7) The task with sequence number 2 now uses the pointer to its element,
already freed by the other task, at __tree_mod_log_rewind(), resulting
in a use-after-free issue. When CONFIG_DEBUG_PAGEALLOC=y it produces
a trace like the following:
[16804.546854] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[16804.547451] CPU: 0 PID: 28257 Comm: pool Tainted: G W 5.4.0-rc8-btrfs-next-51 #1
[16804.548059] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
[16804.548666] RIP: 0010:rb_next+0x16/0x50
(...)
[16804.550581] RSP: 0018:ffffb948418ef9b0 EFLAGS: 00010202
[16804.551227] RAX: 6b6b6b6b6b6b6b6b RBX: ffff90e0247f6600 RCX: 6b6b6b6b6b6b6b6b
[16804.551873] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff90e0247f6600
[16804.552504] RBP: ffff90dffe0d4688 R08: 0000000000000001 R09: 0000000000000000
[16804.553136] R10: ffff90dffa4a0040 R11: 0000000000000000 R12: 000000000000002e
[16804.553768] R13: ffff90e0247f6600 R14: 0000000000001663 R15: ffff90dff77862b8
[16804.554399] FS: 00007f4b197ae700(0000) GS:ffff90e036a00000(0000) knlGS:0000000000000000
[16804.555039] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[16804.555683] CR2: 00007f4b10022000 CR3: 00000002060e2004 CR4: 00000000003606f0
[16804.556336] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[16804.556968] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[16804.557583] Call Trace:
[16804.558207] __tree_mod_log_rewind+0xbf/0x280 [btrfs]
[16804.558835] btrfs_search_old_slot+0x105/0xd00 [btrfs]
[16804.559468] resolve_indirect_refs+0x1eb/0xc70 [btrfs]
[16804.560087] ? free_extent_buffer.part.19+0x5a/0xc0 [btrfs]
[16804.560700] find_parent_nodes+0x388/0x1120 [btrfs]
[16804.561310] btrfs_check_shared+0x115/0x1c0 [btrfs]
[16804.561916] ? extent_fiemap+0x59d/0x6d0 [btrfs]
[16804.562518] extent_fiemap+0x59d/0x6d0 [btrfs]
[16804.563112] ? __might_fault+0x11/0x90
[16804.563706] do_vfs_ioctl+0x45a/0x700
[16804.564299] ksys_ioctl+0x70/0x80
[16804.564885] ? trace_hardirqs_off_thunk+0x1a/0x20
[16804.565461] __x64_sys_ioctl+0x16/0x20
[16804.566020] do_syscall_64+0x5c/0x250
[16804.566580] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[16804.567153] RIP: 0033:0x7f4b1ba2add7
(...)
[16804.568907] RSP: 002b:00007f4b197adc88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[16804.569513] RAX: ffffffffffffffda RBX: 00007f4b100210d8 RCX: 00007f4b1ba2add7
[16804.570133] RDX: 00007f4b100210d8 RSI: 00000000c020660b RDI: 0000000000000003
[16804.570726] RBP: 000055de05a6cfe0 R08: 0000000000000000 R09: 00007f4b197add44
[16804.571314] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f4b197add48
[16804.571905] R13: 00007f4b197add40 R14: 00007f4b100210d0 R15: 00007f4b197add50
(...)
[16804.575623] ---[ end trace 87317359aad4ba50 ]---
Fix this by making btrfs_put_tree_mod_seq() skip deletion of elements that
have a sequence number equals to the computed minimum sequence number, and
not just elements with a sequence number greater then that minimum.
Fixes: bd989ba359 ("Btrfs: add tree modification log functions")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Having checksum items, either on the checksums tree or in a log tree, that
represent ranges that overlap each other is a sign of a corruption. Such
case confuses the checksum lookup code and can result in not being able to
find checksums or find stale checksums.
So add a check for such case.
This is motivated by a recent fix for a case where a log tree had checksum
items covering ranges that overlap each other due to extent cloning, and
resulted in missing checksums after replaying the log tree. It also helps
detect past issues such as stale and outdated checksums due to overlapping,
commit 27b9a8122f ("Btrfs: fix csum tree corruption, duplicate and
outdated checksums").
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging a file that has shared extents (reflinked with other files or
with itself), we can end up logging multiple checksum items that cover
overlapping ranges. This confuses the search for checksums at log replay
time causing some checksums to never be added to the fs/subvolume tree.
Consider the following example of a file that shares the same extent at
offsets 0 and 256Kb:
[ bytenr 13893632, offset 64Kb, len 64Kb ]
0 64Kb
[ bytenr 13631488, offset 64Kb, len 192Kb ]
64Kb 256Kb
[ bytenr 13893632, offset 0, len 256Kb ]
256Kb 512Kb
When logging the inode, at tree-log.c:copy_items(), when processing the
file extent item at offset 0, we log a checksum item covering the range
13959168 to 14024704, which corresponds to 13893632 + 64Kb and 13893632 +
64Kb + 64Kb, respectively.
Later when processing the extent item at offset 256K, we log the checksums
for the range from 13893632 to 14155776 (which corresponds to 13893632 +
256Kb). These checksums get merged with the checksum item for the range
from 13631488 to 13893632 (13631488 + 256Kb), logged by a previous fsync.
So after this we get the two following checksum items in the log tree:
(...)
item 6 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 3095 itemsize 512
range start 13631488 end 14155776 length 524288
item 7 key (EXTENT_CSUM EXTENT_CSUM 13959168) itemoff 3031 itemsize 64
range start 13959168 end 14024704 length 65536
The first one covers the range from the second one, they overlap.
So far this does not cause a problem after replaying the log, because
when replaying the file extent item for offset 256K, we copy all the
checksums for the extent 13893632 from the log tree to the fs/subvolume
tree, since searching for an checksum item for bytenr 13893632 leaves us
at the first checksum item, which covers the whole range of the extent.
However if we write 64Kb to file offset 256Kb for example, we will
not be able to find and copy the checksums for the last 128Kb of the
extent at bytenr 13893632, referenced by the file range 384Kb to 512Kb.
After writing 64Kb into file offset 256Kb we get the following extent
layout for our file:
[ bytenr 13893632, offset 64K, len 64Kb ]
0 64Kb
[ bytenr 13631488, offset 64Kb, len 192Kb ]
64Kb 256Kb
[ bytenr 14155776, offset 0, len 64Kb ]
256Kb 320Kb
[ bytenr 13893632, offset 64Kb, len 192Kb ]
320Kb 512Kb
After fsync'ing the file, if we have a power failure and then mount
the filesystem to replay the log, the following happens:
1) When replaying the file extent item for file offset 320Kb, we
lookup for the checksums for the extent range from 13959168
(13893632 + 64Kb) to 14155776 (13893632 + 256Kb), through a call
to btrfs_lookup_csums_range();
2) btrfs_lookup_csums_range() finds the checksum item that starts
precisely at offset 13959168 (item 7 in the log tree, shown before);
3) However that checksum item only covers 64Kb of data, and not 192Kb
of data;
4) As a result only the checksums for the first 64Kb of data referenced
by the file extent item are found and copied to the fs/subvolume tree.
The remaining 128Kb of data, file range 384Kb to 512Kb, doesn't get
the corresponding data checksums found and copied to the fs/subvolume
tree.
5) After replaying the log userspace will not be able to read the file
range from 384Kb to 512Kb, because the checksums are missing and
resulting in an -EIO error.
The following steps reproduce this scenario:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ xfs_io -f -c "pwrite -S 0xa3 0 256K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
$ xfs_io -c "pwrite -S 0xc7 256K 256K" /mnt/sdc/foobar
$ xfs_io -c "reflink /mnt/sdc/foobar 320K 0 64K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
$ xfs_io -c "pwrite -S 0xe5 256K 64K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
<power failure>
$ mount /dev/sdc /mnt/sdc
$ md5sum /mnt/sdc/foobar
md5sum: /mnt/sdc/foobar: Input/output error
$ dmesg | tail
[165305.003464] BTRFS info (device sdc): no csum found for inode 257 start 401408
[165305.004014] BTRFS info (device sdc): no csum found for inode 257 start 405504
[165305.004559] BTRFS info (device sdc): no csum found for inode 257 start 409600
[165305.005101] BTRFS info (device sdc): no csum found for inode 257 start 413696
[165305.005627] BTRFS info (device sdc): no csum found for inode 257 start 417792
[165305.006134] BTRFS info (device sdc): no csum found for inode 257 start 421888
[165305.006625] BTRFS info (device sdc): no csum found for inode 257 start 425984
[165305.007278] BTRFS info (device sdc): no csum found for inode 257 start 430080
[165305.008248] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
[165305.009550] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
Fix this simply by deleting first any checksums, from the log tree, for the
range of the extent we are logging at copy_items(). This ensures we do not
get checksum items in the log tree that have overlapping ranges.
This is a long time issue that has been present since we have the clone
(and deduplication) ioctl, and can happen both when an extent is shared
between different files and within the same file.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Callers of alloc_test_extent_buffer have not correctly interpreted the
return value as error pointer, as alloc_test_extent_buffer should behave
as alloc_extent_buffer. The self-tests were unaffected but
btrfs_find_create_tree_block could call both functions and that would
cause problems up in the call chain.
Fixes: faa2dbf004 ("Btrfs: add sanity tests for new qgroup accounting code")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The value 0 for devs_max means to spread the allocated chunks over all
available devices, eg. stripe for RAID0 or RAID5. This got mistakenly
copied to the RAID1C3/4 profiles. The intention is to have exactly 3 and
4 copies respectively.
Fixes: 47e6f7423b ("btrfs: add support for 3-copy replication (raid1c3)")
Fixes: 8d6fac0087 ("btrfs: add support for 4-copy replication (raid1c4)")
Signed-off-by: David Sterba <dsterba@suse.com>
Argument BTRFS_FILE_EXTENT_INLINE_DATA_START is defined as offsetof(),
which returns type size_t, so we need %zu instead of %lu.
This fixes a build warning on 32-bit ARM:
../fs/btrfs/tree-checker.c: In function 'check_extent_data_item':
../fs/btrfs/tree-checker.c:230:43: warning: format '%lu' expects argument of type 'long unsigned int', but argument 5 has type 'unsigned int' [-Wformat=]
230 | "invalid item size, have %u expect [%lu, %u)",
| ~~^
| long unsigned int
| %u
Fixes: 153a6d2999 ("btrfs: tree-checker: Check item size before reading file extent type")
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andreas Färber <afaerber@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we're rename exchanging two subvols we'll try to lock this lock
twice, which is bad. Just lock once if either of the ino's are subvols.
Fixes: cdd1fedf82 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a BUG_ON(ret < 0) in find_free_extent from
btrfs_cache_block_group. If we fail to allocate our ctl we'll just
panic, which is not good. Instead just go on to another block group.
If we fail to find a block group we don't want to return ENOSPC, because
really we got a ENOMEM and that's the root of the problem. Save our
return from btrfs_cache_block_group(), and then if we still fail to make
our allocation return that ret so we get the right error back.
Tested with inject-error.py from bcc.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Testing with the new fsstress uncovered a pretty nasty deadlock with
lookup and snapshot deletion.
Process A
unlink
-> final iput
-> inode_tree_del
-> synchronize_srcu(subvol_srcu)
Process B
btrfs_lookup <- srcu_read_lock() acquired here
-> btrfs_iget
-> find inode that has I_FREEING set
-> __wait_on_freeing_inode()
We're holding the srcu_read_lock() while doing the iget in order to make
sure our fs root doesn't go away, and then we are waiting for the inode
to finish freeing. However because the free'ing process is doing a
synchronize_srcu() we deadlock.
Fix this by dropping the synchronize_srcu() in inode_tree_del(). We
don't need people to stop accessing the fs root at this point, we're
only adding our empty root to the dead roots list.
A larger much more invasive fix is forthcoming to address how we deal
with fs roots, but this fixes the immediate problem.
Fixes: 76dda93c6a ("Btrfs: add snapshot/subvolume destroy ioctl")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When using the NO_HOLES feature if we clone a range that contains a hole
and a temporary ENOSPC happens while dropping extents from the target
inode's range, we can end up failing and aborting the transaction with
-EEXIST or with a corrupt file extent item, that has a length greater
than it should and overlaps with other extents. For example when cloning
the following range from inode A to inode B:
Inode A:
extent A1 extent A2
[ ----------- ] [ hole, implicit, 4MB length ] [ ------------- ]
0 1MB 5MB 6MB
Range to clone: [1MB, 6MB)
Inode B:
extent B1 extent B2 extent B3 extent B4
[ ---------- ] [ --------- ] [ ---------- ] [ ---------- ]
0 1MB 1MB 2MB 2MB 5MB 5MB 6MB
Target range: [1MB, 6MB) (same as source, to make it easier to explain)
The following can happen:
1) btrfs_punch_hole_range() gets -ENOSPC from __btrfs_drop_extents();
2) At that point, 'cur_offset' is set to 1MB and __btrfs_drop_extents()
set 'drop_end' to 2MB, meaning it was able to drop only extent B2;
3) We then compute 'clone_len' as 'drop_end' - 'cur_offset' = 2MB - 1MB =
1MB;
4) We then attempt to insert a file extent item at inode B with a file
offset of 5MB, which is the value of clone_info->file_offset. This
fails with error -EEXIST because there's already an extent at that
offset (extent B4);
5) We abort the current transaction with -EEXIST and return that error
to user space as well.
Another example, for extent corruption:
Inode A:
extent A1 extent A2
[ ----------- ] [ hole, implicit, 10MB length ] [ ------------- ]
0 1MB 11MB 12MB
Inode B:
extent B1 extent B2
[ ----------- ] [ --------- ] [ ----------------------------- ]
0 1MB 1MB 5MB 5MB 12MB
Target range: [1MB, 12MB) (same as source, to make it easier to explain)
1) btrfs_punch_hole_range() gets -ENOSPC from __btrfs_drop_extents();
2) At that point, 'cur_offset' is set to 1MB and __btrfs_drop_extents()
set 'drop_end' to 5MB, meaning it was able to drop only extent B2;
3) We then compute 'clone_len' as 'drop_end' - 'cur_offset' = 5MB - 1MB =
4MB;
4) We then insert a file extent item at inode B with a file offset of 11MB
which is the value of clone_info->file_offset, and a length of 4MB (the
value of 'clone_len'). So we get 2 extents items with ranges that
overlap and an extent length of 4MB, larger then the extent A2 from
inode A (1MB length);
5) After that we end the transaction, balance the btree dirty pages and
then start another or join the previous transaction. It might happen
that the transaction which inserted the incorrect extent was committed
by another task so we end up with extent corruption if a power failure
happens.
So fix this by making sure we attempt to insert the extent to clone at
the destination inode only if we are past dropping the sub-range that
corresponds to a hole.
Fixes: 690a5dbfc5 ("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The branch of qgroup_rescan_init which is executed from the mount
path prints wrong errors messages. The textual print out in case
BTRFS_QGROUP_STATUS_FLAG_RESCAN/BTRFS_QGROUP_STATUS_FLAG_ON are not
set are transposed. Fix it by exchanging their place.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Because the BLAKE2B code went through a different tree, it was not
available at the time the btrfs part was merged. Now that the Kconfig
symbol exists, add it to the list.
Signed-off-by: David Sterba <dsterba@suse.com>
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT.
Both PREEMPT and PREEMPT_RT require the same functionality which today
depends on CONFIG_PREEMPT.
Switch the btrfs_device_set_…() macro over to use CONFIG_PREEMPTION.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-btrfs@vger.kernel.org
Link: https://lore.kernel.org/r/20191015191821.11479-25-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As part of the cleanup of some remaining y2038 issues, I came to
fs/compat_ioctl.c, which still has a couple of commands that need support
for time64_t.
In completely unrelated work, I spent time on cleaning up parts of this
file in the past, moving things out into drivers instead.
After Al Viro reviewed an earlier version of this series and did a lot
more of that cleanup, I decided to try to completely eliminate the rest
of it and move it all into drivers.
This series incorporates some of Al's work and many patches of my own,
but in the end stops short of actually removing the last part, which is
the scsi ioctl handlers. I have patches for those as well, but they need
more testing or possibly a rewrite.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJdsHCdAAoJEJpsee/mABjZtYkP/1JGl3jFv3Iq/5BCdPkaePP1
RtMJRNfURgK3GeuHUui330PvVjI/pLWXU/VXMK2MPTASpJLzYz3uCaZrpVWEMpDZ
+ImzGmgJkITlW1uWU3zOcQhOxTyb1hCZ0Ci+2xn9QAmyOL7prXoXCXDWv3h6iyiF
lwG+nW+HNtyx41YG+9bRfKNoG0ZJ+nkJ70BV6u0acQHXWn7Xuupa9YUmBL87hxAL
6dlJfLTJg6q8QSv/Q6LxslfWk2Ti8OOJZOwtFM5R8Bgl0iUcvshiRCKfv/3t9jXD
dJNvF1uq8z+gracWK49Qsfq5dnZ2ZxHFUo9u0NjbCrxNvWH/sdvhbaUBuJI75seH
VIznCkdxFhrqitJJ8KmxANxG08u+9zSKjSlxG2SmlA4qFx/AoStoHwQXcogJscNb
YIXYKmWBvwPzYu09QFAXdHFPmZvp/3HhMWU6o92lvDhsDwzkSGt3XKhCJea4DCaT
m+oCcoACqSWhMwdbJOEFofSub4bY43s5iaYuKes+c8O261/Dwg6v/pgIVez9mxXm
TBnvCsotq5m8wbwzv99eFqGeJH8zpDHrXxEtRR5KQqMqjLq/OQVaEzmpHZTEuK7n
e/V/PAKo2/V63g4k6GApQXDxnjwT+m0aWToWoeEzPYXS6KmtWC91r4bWtslu3rdl
bN65armTm7bFFR32Avnu
=lgCl
-----END PGP SIGNATURE-----
Merge tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground
Pull removal of most of fs/compat_ioctl.c from Arnd Bergmann:
"As part of the cleanup of some remaining y2038 issues, I came to
fs/compat_ioctl.c, which still has a couple of commands that need
support for time64_t.
In completely unrelated work, I spent time on cleaning up parts of
this file in the past, moving things out into drivers instead.
After Al Viro reviewed an earlier version of this series and did a lot
more of that cleanup, I decided to try to completely eliminate the
rest of it and move it all into drivers.
This series incorporates some of Al's work and many patches of my own,
but in the end stops short of actually removing the last part, which
is the scsi ioctl handlers. I have patches for those as well, but they
need more testing or possibly a rewrite"
* tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (42 commits)
scsi: sd: enable compat ioctls for sed-opal
pktcdvd: add compat_ioctl handler
compat_ioctl: move SG_GET_REQUEST_TABLE handling
compat_ioctl: ppp: move simple commands into ppp_generic.c
compat_ioctl: handle PPPIOCGIDLE for 64-bit time_t
compat_ioctl: move PPPIOCSCOMPRESS to ppp_generic
compat_ioctl: unify copy-in of ppp filters
tty: handle compat PPP ioctls
compat_ioctl: move SIOCOUTQ out of compat_ioctl.c
compat_ioctl: handle SIOCOUTQNSD
af_unix: add compat_ioctl support
compat_ioctl: reimplement SG_IO handling
compat_ioctl: move WDIOC handling into wdt drivers
fs: compat_ioctl: move FITRIM emulation into file systems
gfs2: add compat_ioctl support
compat_ioctl: remove unused convert_in_user macro
compat_ioctl: remove last RAID handling code
compat_ioctl: remove /dev/raw ioctl translation
compat_ioctl: remove PCI ioctl translation
compat_ioctl: remove joystick ioctl translation
...
After previous patches removing bdev being passed around to set it to
bio, it has become unused in submit_extent_page. So it now has "only" 13
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
We can now remove the bdev from extent_map. Previous patches made sure
that bio_set_dev is correctly in all places and that we don't need to
grab it from latest_bdev or pass it around inside the extent map.
Signed-off-by: David Sterba <dsterba@suse.com>
bio_set_dev sets a bdev to a bio and is not only setting a pointer bug
also changing some state bits if there was a different bdev set before.
This is one thing that's not needed.
Another thing is that setting a bdev at bio allocation time is too early
and actually does not work with plain redundancy profiles, where each
time we submit a bio to a device, the bdev is set correctly.
In many places the bio bdev is set to latest_bdev that seems to serve as
a stub pointer "just to put something to bio". But we don't have to do
that.
Where do we know which bdev to set:
* for regular IO: submit_stripe_bio that's called by btrfs_map_bio
* repair IO: repair_io_failure, read or write from specific device
* super block write (using buffer_heads but uses raw bdev) and barriers
* scrub: this does not use all regular IO paths as it needs to reach all
copies, verify and fixup eventually, and for that all bdev management
is independent
* raid56: rbio_add_io_page, for the RMW write
* integrity-checker: does it's own low-level block tracking
Signed-off-by: David Sterba <dsterba@suse.com>
This is preparatory patch to remove @bdev parameter from
submit_extent_page. It can't be removed completely, because the cgroups
need it for wbc when initializing the bio
wbc_init_bio
bio_associate_blkg_from_css
dereference bdev->bi_disk->queue
The bdev pointer is the same as latest_bdev, thus no functional change.
We can retrieve it from fs_devices that's reachable through several
dereferences. The local variable shadows the parameter, but that's only
temporary.
Signed-off-by: David Sterba <dsterba@suse.com>
Testing with the new fsstress support for subvolumes uncovered a pretty
bad problem with rename exchange on subvolumes. We're modifying two
different subvolumes, but we only start the transaction on one of them,
so the other one is not added to the dirty root list. This is caught by
btrfs_cow_block() with a warning because the root has not been updated,
however if we do not modify this root again we'll end up pointing at an
invalid root because the root item is never updated.
Fix this by making sure we add the destination root to the trans list,
the same as we do with normal renames. This fixes the corruption.
Fixes: cdd1fedf82 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a device replace, while at scrub.c:scrub_enumerate_chunks(), we
set the block group to RO mode and then wait for any ongoing writes into
extents of the block group to complete. While doing that wait we overwrite
the value of the variable 'ret' and can break out of the loop if an error
happens without turning the block group back into RW mode. So what happens
is the following:
1) btrfs_inc_block_group_ro() returns 0, meaning it set the block group
to RO mode (its ->ro field set to 1 or incremented to some value > 1);
2) Then btrfs_wait_ordered_roots() returns a value > 0;
3) Then if either joining or committing the transaction fails, we break
out of the loop wihtout calling btrfs_dec_block_group_ro(), leaving
the block group in RO mode forever.
To fix this, just remove the code that waits for ongoing writes to extents
of the block group, since it's not needed because in the initial setup
phase of a device replace operation, before starting to find all chunks
and their extents, we set the target device for replace while holding
fs_info->dev_replace->rwsem, which ensures that after releasing that
semaphore, any writes into the source device are made to the target device
as well (__btrfs_map_block() guarantees that). So while at
scrub_enumerate_chunks() we only need to worry about finding and copying
extents (from the source device to the target device) that were written
before we started the device replace operation.
Fixes: f0e9b7d640 ("Btrfs: fix race setting block group readonly during device replace")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When running btrfs/072 with only one online CPU, it has a pretty high
chance to fail:
btrfs/072 12s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//btrfs/072.dmesg)
- output mismatch (see xfstests-dev/results//btrfs/072.out.bad)
--- tests/btrfs/072.out 2019-10-22 15:18:14.008965340 +0800
+++ /xfstests-dev/results//btrfs/072.out.bad 2019-11-14 15:56:45.877152240 +0800
@@ -1,2 +1,3 @@
QA output created by 072
Silence is golden
+Scrub find errors in "-m dup -d single" test
...
And with the following call trace:
BTRFS info (device dm-5): scrub: started on devid 1
------------[ cut here ]------------
BTRFS: Transaction aborted (error -27)
WARNING: CPU: 0 PID: 55087 at fs/btrfs/block-group.c:1890 btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
CPU: 0 PID: 55087 Comm: btrfs Tainted: G W O 5.4.0-rc1-custom+ #13
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
Call Trace:
__btrfs_end_transaction+0xdb/0x310 [btrfs]
btrfs_end_transaction+0x10/0x20 [btrfs]
btrfs_inc_block_group_ro+0x1c9/0x210 [btrfs]
scrub_enumerate_chunks+0x264/0x940 [btrfs]
btrfs_scrub_dev+0x45c/0x8f0 [btrfs]
btrfs_ioctl+0x31a1/0x3fb0 [btrfs]
do_vfs_ioctl+0x636/0xaa0
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x43/0x50
do_syscall_64+0x79/0xe0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
---[ end trace 166c865cec7688e7 ]---
[CAUSE]
The error number -27 is -EFBIG, returned from the following call chain:
btrfs_end_transaction()
|- __btrfs_end_transaction()
|- btrfs_create_pending_block_groups()
|- btrfs_finish_chunk_alloc()
|- btrfs_add_system_chunk()
This happens because we have used up all space of
btrfs_super_block::sys_chunk_array.
The root cause is, we have the following bad loop of creating tons of
system chunks:
1. The only SYSTEM chunk is being scrubbed
It's very common to have only one SYSTEM chunk.
2. New SYSTEM bg will be allocated
As btrfs_inc_block_group_ro() will check if we have enough space
after marking current bg RO. If not, then allocate a new chunk.
3. New SYSTEM bg is still empty, will be reclaimed
During the reclaim, we will mark it RO again.
4. That newly allocated empty SYSTEM bg get scrubbed
We go back to step 2, as the bg is already mark RO but still not
cleaned up yet.
If the cleaner kthread doesn't get executed fast enough (e.g. only one
CPU), then we will get more and more empty SYSTEM chunks, using up all
the space of btrfs_super_block::sys_chunk_array.
[FIX]
Since scrub/dev-replace doesn't always need to allocate new extent,
especially chunk tree extent, so we don't really need to do chunk
pre-allocation.
To break above spiral, here we introduce a new parameter to
btrfs_inc_block_group(), @do_chunk_alloc, which indicates whether we
need extra chunk pre-allocation.
For relocation, we pass @do_chunk_alloc=true, while for scrub, we pass
@do_chunk_alloc=false.
This should keep unnecessary empty chunks from popping up for scrub.
Also, since there are two parameters for btrfs_inc_block_group_ro(),
add more comment for it.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
struct btrfs_fs_devices::rotating currently is declared as an integer
variable but only used as a boolean.
Change the variable definition to bool and update to code touching it to
set 'true' and 'false'.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
struct btrfs_fs_devices::seeding currently is declared as an integer
variable but only used as a boolean.
Change the variable definition to bool and update to code touching it to
set 'true' and 'false'.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The type name is misleading, a single entry is named 'cache' while this
normally means a collection of objects. Rename that everywhere. Also the
identifier was quite long, making function prototypes harder to format.
Suggested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For read_one_block_group(), its only caller has already got the item key
to search next block group item.
So we can use that key directly without doing our own convertion on
stack.
Also, since that key used in btrfs_read_block_groups() is vital for
block group item search, add 'const' keyword for that parameter to
prevent read_one_block_group() to modify it.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Refactor the work inside the loop of btrfs_read_block_groups() into one
separate function, read_one_block_group().
This allows read_one_block_group to be reused for later BG_TREE feature.
The refactor does the following extra fix:
- Use btrfs_fs_incompat() to replace open-coded feature check
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A nice writeup of the LKMM (Linux Kernel Memory Model) rules for access
once policies can be found here
https://lwn.net/Articles/799218/#Access-Marking%20Policies .
The locked and unlocked access to eb::blocking_writers should be
annotated accordingly, following this:
Writes:
- locked write must use ONCE, may use plain read
- unlocked write must use ONCE
Reads:
- unlocked read must use ONCE
- locked read may use plain read iff not mixed with unlocked read
- unlocked read then locked must use ONCE
There's one difference on the assembly level, where
btrfs_tree_read_lock_atomic and btrfs_try_tree_read_lock used the cached
value and did not reevaluate it after taking the lock. This could have
missed some opportunities to take the lock in case blocking writers
changed between the calls, but the window is just a few instructions
long. As this is in try-lock, the callers handle that.
Signed-off-by: David Sterba <dsterba@suse.com>
The increment and decrement was inherited from previous version that
used atomics, switched in commit 06297d8cef ("btrfs: switch
extent_buffer blocking_writers from atomic to int"). The only possible
values are 0 and 1 so we can set them directly.
The generated assembly (gcc 9.x) did the direct value assignment in
btrfs_set_lock_blocking_write (asm diff after change in 06297d8cef):
5d: test %eax,%eax
5f: je 62 <btrfs_set_lock_blocking_write+0x22>
61: retq
- 62: lock incl 0x44(%rdi)
- 66: add $0x50,%rdi
- 6a: jmpq 6f <btrfs_set_lock_blocking_write+0x2f>
+ 62: movl $0x1,0x44(%rdi)
+ 69: add $0x50,%rdi
+ 6d: jmpq 72 <btrfs_set_lock_blocking_write+0x32>
The part in btrfs_tree_unlock did a decrement because
BUG_ON(blockers > 1) is probably not a strong hint for the compiler, but
otherwise the output looks safe:
- lock decl 0x44(%rdi)
+ sub $0x1,%eax
+ mov %eax,0x44(%rdi)
Signed-off-by: David Sterba <dsterba@suse.com>
There are two ifs that use eb::blocking_writers. As this is a variable
modified inside and outside of locks, we could minimize number of
accesses to avoid problems with getting different results at different
times.
The access here is locked so this can only race with btrfs_tree_unlock
that sets blocking_writers to 0 without lock and unsets the lock owner.
The first branch is taken only if the same thread already holds the
lock, the second if checks for blocking writers. Here we'd either unlock
and wait, or proceed. Both are valid states of the locking protocol.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
When there are no raid1c3 or raid1c4 block groups left after balance
(either convert or with other filters applied), remove the incompat bit.
This is already done for RAID56, do the same for RAID1C34.
Signed-off-by: David Sterba <dsterba@suse.com>
The new raid1c3 and raid1c4 profiles are backward incompatible and the
name shall be 'raid1c34', the status can be found in the global
supported features in /sys/fs/btrfs/features or in the per-filesystem
directory.
Signed-off-by: David Sterba <dsterba@suse.com>
Add new block group profile to store 4 copies in a simliar way that
current RAID1 does. The profile attributes and constraints are defined
in the raid table and used by the same code that already handles the 2-
and 3-copy RAID1.
The minimum number of devices is 4, the maximum number of devices/chunks
that can be lost/damaged is 3. There is no comparable traditional RAID
level, the profile is added for future needs to accompany triple-parity
and beyond.
Signed-off-by: David Sterba <dsterba@suse.com>
Add new block group profile to store 3 copies in a simliar way that
current RAID1 does. The profile attributes and constraints are defined
in the raid table and used by the same code that already handles the
2-copy RAID1.
The minimum number of devices is 3, the maximum number of devices/chunks
that can be lost/damaged is 2. Like RAID6 but with 33% space
utilization.
Signed-off-by: David Sterba <dsterba@suse.com>
In commit "Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios",
cow_file_range_async gained wbc as a parameter and this makes passing
write flags redundant. Set it inside the function and remove the
parameter.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__extent_writepage reads write flags from wbc and passes both to
__extent_writepage_io. This makes write_flags redundant and we can
remove it.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Backreference walking, which is used by send to figure if it can issue
clone operations instead of write operations, can be very slow and use
too much memory when extents have many references. This change simply
skips backreference walking when an extent has more than 64 references,
in which case we fallback to a write operation instead of a clone
operation. This limit is conservative and in practice I observed no
signicant slowdown with up to 100 references and still low memory usage
up to that limit.
This is a temporary workaround until there are speedups in the backref
walking code, and as such it does not attempt to add extra interfaces or
knobs to tweak the threshold.
Reported-by: Atemu <atemu.main@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For send we currently skip clone operations when the source and
destination files are the same. This is so because clone didn't support
this case in its early days, but support for it was added back in May
2013 by commit a96fbc7288 ("Btrfs: allow file data clone within a
file"). This change adds support for it.
Example:
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt/sdd
$ xfs_io -f -c "pwrite -S 0xab -b 64K 0 64K" /mnt/sdd/foobar
$ xfs_io -c "reflink /mnt/sdd/foobar 0 64K 64K" /mnt/sdd/foobar
$ btrfs subvolume snapshot -r /mnt/sdd /mnt/sdd/snap
$ mkfs.btrfs -f /dev/sde
$ mount /dev/sde /mnt/sde
$ btrfs send /mnt/sdd/snap | btrfs receive /mnt/sde
Without this change file foobar at the destination has a single 128Kb
extent:
$ filefrag -v /mnt/sde/snap/foobar
Filesystem type is: 9123683e
File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 31: 0.. 31: 32: last,unknown_loc,delalloc,eof
/mnt/sde/snap/foobar: 1 extent found
With this we get a single 64Kb extent that is shared at file offsets 0
and 64K, just like in the source filesystem:
$ filefrag -v /mnt/sde/snap/foobar
Filesystem type is: 9123683e
File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 15: 3328.. 3343: 16: shared
1: 16.. 31: 3328.. 3343: 16: 3344: last,shared,eof
/mnt/sde/snap/foobar: 2 extents found
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When deleting large files (which cross block group boundary) with
discard mount option, we find some btrfs_discard_extent() calls only
trimmed part of its space, not the whole range:
btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50%
type: bbio->map_type, in above case, it's SINGLE DATA.
start: Logical address of this trim
len: Logical length of this trim
trimmed: Physically trimmed bytes
ratio: trimmed / len
Thus leaving some unused space not discarded.
[CAUSE]
When discard mount option is specified, after a transaction is fully
committed (super block written to disk), we begin to cleanup pinned
extents in the following call chain:
btrfs_commit_transaction()
|- btrfs_finish_extent_commit()
|- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY);
|- btrfs_discard_extent()
However, pinned extents are recorded in an extent_io_tree, which can
merge adjacent extent states.
When a large file gets deleted and it has adjacent file extents across
block group boundary, we will get a large merged range like this:
|<--- BG1 --->|<--- BG2 --->|
|//////|<-- Range to discard --->|/////|
To discard that range, we have the following calls:
btrfs_discard_extent()
|- btrfs_map_block()
| Returned bbio will end at BG1's end. As btrfs_map_block()
| never returns result across block group boundary.
|- btrfs_issuse_discard()
Issue discard for each stripe.
So we will only discard the range in BG1, not the remaining part in BG2.
Furthermore, this bug is not that reliably observed, for above case, if
there is no other extent in BG2, BG2 will be empty and btrfs will trim
all space of BG2, covering up the bug.
[FIX]
- Allow __btrfs_map_block_for_discard() to modify @length parameter
btrfs_map_block() uses its @length paramter to notify the caller how
many bytes are mapped in current call.
With __btrfs_map_block_for_discard() also modifing the @length,
btrfs_discard_extent() now understands when to do extra trim.
- Call btrfs_map_block() in a loop until we hit the range end Since we
now know how many bytes are mapped each time, we can iterate through
each block group boundary and issue correct trim for each range.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old code goes:
offset = logical - em->start;
length = min_t(u64, em->len - offset, length);
Where @length calculation is dependent on offset, it can take reader
several more seconds to find it's just the same code as:
offset = logical - em->start;
length = min_t(u64, em->start + em->len - logical, length);
Use above code to make the length calculate independent from other
variable, thus slightly increase the readability.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In check_extent_data_item(), we read file extent type without verifying
if the item size is valid.
Add such check to ensure the file extent type we read is correct.
The check is not as accurate as we need to cover both inline and regular
extents, so it only checks if the item size is larger or equal to inline
header.
So the existing size checks on inline/regular extents are still needed.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The "&fs_info->dev_replace.rwsem" and "&dev_replace->rwsem" refer to
the same lock but Smatch is not clever enough to figure that out so it
leads to static checker warnings. It's better to use it consistently
anyway.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The backup_root_index member stores the index at which the backup root
should be saved upon next transaction commit. However, there is a
small deviation from this behavior in the form of a check in
backup_super_roots which checks if current root generation equals to the
generation of the previous root. This can trigger in the following
scenario:
slot0: gen-2
slot1: gen-1
slot2: gen
slot3: unused
Now suppose slot3 (which is also the root specified in the super block)
is corrupted hence init_tree_roots chooses to use the backup root at
slot2, meaning read_backup_root will read slot2 and assign the
superblock generation to gen-1. Despite this backup_root_index will
point at slot3 because its init happens in init_backup_root_slot, long
before any parsing of the backup roots occur. Then on next transaction
start, gen-1 will be incremented by 1 making the root's generation
equal gen. Subsequently, on transaction commit the following check
triggers:
if (btrfs_backup_tree_root_gen(root_backup) ==
btrfs_header_generation(info->tree_root->node))
This causes the 'next_backup', which is the index at which the backup is
going to be written to, to set to last_backup, which will be slot2.
All of this is a very confusing way of expressing the following
invariant:
Always write a backup root at the index following the last used backup
root.
This commit streamlines this logic by setting backup_root_index to the
next index after the one used for mount.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old name name was an awful misnomer because it didn't really find
the oldest super backup per-se but rather its slot. For example if we
have:
slot0: gen - 2
slot1: gen - 1
slot2: gen
slot3: empty
init_backup_root_slot will return slot3 and not slot0.
The new name is more appropriate since the function doesn't care whether
there is a valid backup in the returned slot or not.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function has been superseded by previous commits and is no longer
used so just remove it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the filesystem is not well formed and no trees are loaded it's
pointless holding the objectid_mutex. Just remove its usage.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The code responsible for reading and initializing tree roots is
scattered in open_ctree among 2 labels, emulating a loop. This is rather
confusing to reason about. Instead, factor the code to a new function,
init_tree_roots which implements the same logical flow.
There are a couple of notable differences, namely:
* Instead of using next_backup_root it's using the newly introduced
read_backup_root.
* If read_backup_root returns an error init_tree_roots propagates the
error and there is no special handling of that case e.g. the code jumps
straight to 'fail_tree_roots' label. The old code, however, was
(erroneously) jumping to 'fail_block_groups' label if next_backup_root
did fail, this was unnecessary since the tree roots init logic doesn't
modify the state of block groups.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function will replace next_root_backup with a much saner/cleaner
interface.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's no longer needed following cleanups around find_newest_backup_root
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Backup roots are always written in a circular manner. By definition we
can only ever have 1 backup root whose generation equals to that of the
superblock. Hence, the 'if' in the for loop will trigger at most once.
This is sufficient to return the newest backup root.
Furthermore the newest_gen parameter is always set to the generation of
the superblock. This value can be obtained from the fs_info.
This patch removes the unnecessary code dealing with the wraparound
case and makes 'newest_gen' a local variable.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The inode delalloc mutex was added a long time ago by commit f248679e86
("Btrfs: add a delalloc mutex to inodes for delalloc reservations"), and
the reason for its introduction is not very clear from the change log. It
claims it solves bogus warnings from lockdep, however it lacks an example
report/warning from lockdep, or any explanation.
Since we have enough concurrentcy protection from the locks of the space
info and block reserve objects, and such lockdep warnings don't seem to
exist anymore (at least on a 5.3 kernel I couldn't get them with fstests,
ltp, fs_mark, etc), remove it, simplifying things a bit and decreasing
the size of the btrfs_inode structure. With some quick fio tests doing
direct IO and mmap writes I couldn't observe any significant performance
increase either (direct IO writes that don't increase the file's size
don't hold the inode's lock for their entire duration and mmap writes
don't hold the inode's lock at all), which are the only type of writes
that could see any performance gain due to less serialization.
Review feedback from Josef:
The problem was taking the i_mutex in mmap, which is how I was
protecting delalloc reservations originally. The delalloc mutex didn't
come with all of the other dependencies. That's what the lockdep
messages were about, removing the lock isn't going to make them appear
again.
We _had_ to lock around this because we used to do tricks to keep from
over-reserving, and if we didn't serialize delalloc reservations we'd
end up with ugly accounting problems when we tried to clean things up.
However with my recentish changes this isn't the case anymore. Every
operation is responsible for reserving its space, and then adding it to
the inode. Then cleaning up is straightforward and can't be mucked up
by other users. So we no longer need the delalloc mutex to safe us from
ourselves.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It is not used anymore since commit 957780eb27 ("Btrfs: introduce
ticketed enospc infrastructure"), so just remove it.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_statfs() we cache fs_info::space_info in a local variable only
to use it once in a list_for_each_rcu() statement.
Not only is the local variable unnecessary it even makes the code harder
to follow as it's not clear which list it is iterating.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The on-disk format of block group item makes use of the key that stores
the offset and length. This is further used in the code, although this
makes thing harder to understand. The key is also packed so the
offset/length is not properly aligned as u64.
Add start (key.objectid) and length (key.offset) members to block group
and remove the embedded key. When the item is searched or written, a
local variable for key is used.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Accessors defined by BTRFS_SETGET_FUNCS take a raw extent buffer and
manipulate the items there, there's no special prefix required. The
block group accessors had _disk_ because previously the names were
occupied by the on-stack accessors. As this has been addressed in the
previous patch, we can now unify the naming.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All accessors defined by BTRFS_SETGET_STACK_FUNCS contain _stack_ in the
name, the block group ones were not following that scheme, so let's
switch them.
Signed-off-by: David Sterba <dsterba@suse.com>
The members ::used and ::flags are now in the block group cache
structure, the last one is chunk_objectid, but that's set to a fixed
value and otherwise unused. The item is constructed from a local
variable before write, so we can remove the embedded one from block
group.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The flags are read from the item that's embedded to block group struct,
but the item will be removed. Use the ::flags after read and before
write.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For unknown reasons, the member 'used' in the block group struct is
stored in the b-tree item and accessed everywhere using the special
accessor helper. Let's unify it and make it a regular member and only
update the item before writing it to the tree.
The item is still being used for flags and chunk_objectid, there's some
duplication until the item is removed in following patches.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The last user of btrfs_bio::flags was removed in commit 326e1dbb57
("block: remove management of bi_remaining when restoring original
bi_end_io"), remove it.
(Tagged for stable as the structure is heavily used and space savings
are desirable.)
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently all the checksum algorithms generate a fixed size digest size
and we use it. The on-disk format can hold up to BTRFS_CSUM_SIZE bytes
and BLAKE2b produces digest of 512 bits by default. We can't do that and
will use the blake2b-256, this needs to be passed to the crypto API.
Separate that from the base algorithm name and add a member to request
specific driver, in this case with the digest size.
The only place that uses the driver name is the crypto API setup.
Signed-off-by: David Sterba <dsterba@suse.com>
Show the used driver for the checksum algorithm for the filesystem in
sysfs file /sys/fs/btrfs/UUID/features/checksum, eg.
crc32c (crc32c-generic)
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Export supported checksum algorithms via sysfs in the list of static
features:
/sys/fs/btrfs/features/supported_checksums
Space spearated list of checksum algorithm names.
Co-developed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Add sha256 to the list of possible checksumming algorithms used by BTRFS.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add xxhash64 to the list of possible checksumming algorithms used by
BTRFS.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
To remove use of extent_map::bdev we need to find a replacement, and the
latest_bdev is the only one we can use here, because inode::i_bdev and
superblock::s_bdev are NULL.
The DIO code uses bdev in two places:
* to read blocksize to perform alignment checks in
do_blockdev_direct_IO, but we do them in btrfs code before any call to
DIO
* in the following call chain:
do_direct_IO
get_more_blocks
sdio->get_block() <-- this is btrfs_get_blocks_direct
subsequently the map_bh->b_dev member is used in clean_bdev_aliases
and dio_new_bio to set the bio's bdev to that of the buffer_head.
However, because we have provided a submit function dio_bio_submit
calls our submission function and ignores the bdev.
So it's safe to pass any valid bdev that's used within the filesystem.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a preparatory patch for removing extent_map::bdev. There's some
history behind the code so this is only precaution to catch if things
break before the actual removal happens.
Logically, comparing a raw low-level block device (bdev) does not make
sense for extent maps (high-level objects). This had no effect in
practice but was quite confusing in the code. The lookup_map is set iff
EXTENT_FLAG_FS_MAPPING is set.
The two pointers were stored in the same bytes and used potentially in
two meanings. Now they're split, so the asserts are in place to check
that the condition will not change.
The lookup map pointer misused bdev, this has been changed in commit
95617d6932 ("btrfs: cleanup, stop casting for extent_map->lookup
everywhere") to the explicit type. But the semantics hasn't changed and
bdev was not actually used to decide if maps are mergeable.
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of checking if we've read a BTRFS_CHUNK_ITEM_KEY from disk and
then process it we could just bail out early if the read disk key wasn't
a BTRFS_CHUNK_ITEM_KEY.
This removes a level of indentation and makes the code nicer to read.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_may_alloc_data_chunk() we're checking if the chunk type is of
type BTRFS_BLOCK_GROUP_DATA and if it is we process it.
Instead of checking if the chunk type is a BTRFS_BLOCK_GROUP_DATA chunk
we can negate the check and bail out early if it isn't.
This makes the code a bit more readable.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In lock_stripe_add() we're caching the bucket for the stripe hash table
just for a single call to dereference the stripe hash.
If we just directly call rbio_bucket() we can safe the pointless local
variable.
Also move the dereferencing of the stripe hash outside of the variable
declaration block to not break over the 80 characters limit.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In lock_stripe_add() we're traversing the stripe hash list and check if
the current list element's raid_map equals is equal to the raid bio's
raid_map. If both are equal we continue processing.
If we'd check for inequality instead of equality we can reduce one level
of indentation.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of using an input pointer parameter as the return value and have
an int as the return type of find_desired_extent, rework the function to
directly return the found offset. Doing that the 'ret' variable in
btrfs_llseek_file can be removed. Additional (subjective) benefit is
that btrfs' llseek function now resemebles those of the other major
filesystems.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Handle SEEK_END/SEEK_CUR in a single 'default' case by directly
returning from generic_file_llseek. This makes the 'out' label
redundant. Finally return directly the vale from vfs_setpos. No
semantic changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Modifying the file position is done on a per-file basis. This renders
holding the inode lock for writing useless and makes the performance of
concurrent llseek's abysmal.
Fix this by holding the inode for read. This provides protection against
concurrent truncates and find_desired_extent already includes proper
extent locking for the range which ensures proper locking against
concurrent writes. SEEK_CUR and SEEK_END can be done lockessly.
The former is synchronized by file::f_lock spinlock. SEEK_END is not
synchronized but atomic, but that's OK since there is not guarantee that
SEEK_END will always be at the end of the file in the face of tail
modifications.
This change brings ~82% performance improvement when doing a lot of
parallel fseeks. The workload essentially does:
for (d=0; d<num_seek_read; d++)
{
/* offset %= 16777216; */
fseek (f, 256 * d % 16777216, SEEK_SET);
fread (buffer, 64, 1, f);
}
Without patch:
num workprocesses = 16
num fseek/fread = 8000000
step = 256
fork 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
real 0m41.412s
user 0m28.777s
sys 2m16.510s
With patch:
num workprocesses = 16
num fseek/fread = 8000000
step = 256
fork 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
real 0m11.479s
user 0m27.629s
sys 0m21.040s
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can infer the ops from the type that is now passed to all functions
that would need it, this makes workspace_manager::ops redundant and can
be removed.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replace indirect calls to free_workspace by switch and calls to the
specific callbacks. This is mainly to get rid of the indirection due to
spectre vulnerability mitigations.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can infer the workspace_manager from type and the type will be used
in the following patch to call a common helper for free_workspace.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replace indirect calls to alloc_workspace by switch and calls to the
specific callbacks. This is mainly to get rid of the indirection due to
spectre vulnerability mitigations.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can infer the workspace_manager from type and the type will be used
in the following patch to call a common helper for alloc_workspace.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Similar to get_workspace, majority of the callbacks is trivial, we don't
gain anything by the indirection, so replace them by a switch function.
Trivial callback implementations use the helper.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Majority of the callbacks is trivial, we don't gain anything by the
indirection, so replace them by a switch function.
ZLIB needs to adjust level in the callback and ZSTD workspace management
is complex, the rest is call to the helper.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The indirect calls will be replaced by a switch in compression.c.
(Switch is faster than indirect calls with when Spectre mitigations are
enabled).
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replace loop calling to all algos with a list of direct calls to the
cleanup manager callback. When that becomes trivial it is replaced by
direct call to the helper.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With the access to the workspace structures, we can look it up together
with the compression ops inside the workspace manager cleanup helper.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replace loop calling to all algos with a list of direct calls to the
init manager callback. When that becomes trivial it is replaced by
direct call to the helper.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With the access to the workspace structures, we can look it up together
with the compression ops inside the workspace manager init helper.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a lot of indirection when the generic code calls into
algo-specific callbacks to reach the private workspace manager structure
and back to the generic code.
To simplify that, export the workspace manager for heuristic, LZO and
ZLIB, while ZSTD is going to use it's own manager.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The indirect calls bring some overhead due to spectre vulnerability
mitigations. The number of cases is small and below the threshold
(10-20) where indirect call would be better.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Export compress_pages, decompress_bio and decompress callbacks for all
compression algos. The indirect calls will be replaced by a switch.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When free'ing extents in a block group we check to see if the block
group is not cached, and then cache it if we need to. However we'll
just carry on as long as we're loading the cache. This is problematic
because we are dirtying the block group here. If we are fast enough we
could do a transaction commit and clear the free space cache while we're
still loading the space cache in another thread. This truncates the
free space inode, which will keep it from loading the space cache.
Fix this by using the btrfs_block_group_cache_done helper so that we try
to load the space cache unconditionally here, which will result in the
caller waiting for the fast caching to complete and keep us from
truncating the free space inode.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While testing 5.2 we ran into the following panic
[52238.017028] BUG: kernel NULL pointer dereference, address: 0000000000000001
[52238.105608] RIP: 0010:drop_buffers+0x3d/0x150
[52238.304051] Call Trace:
[52238.308958] try_to_free_buffers+0x15b/0x1b0
[52238.317503] shrink_page_list+0x1164/0x1780
[52238.325877] shrink_inactive_list+0x18f/0x3b0
[52238.334596] shrink_node_memcg+0x23e/0x7d0
[52238.342790] ? do_shrink_slab+0x4f/0x290
[52238.350648] shrink_node+0xce/0x4a0
[52238.357628] balance_pgdat+0x2c7/0x510
[52238.365135] kswapd+0x216/0x3e0
[52238.371425] ? wait_woken+0x80/0x80
[52238.378412] ? balance_pgdat+0x510/0x510
[52238.386265] kthread+0x111/0x130
[52238.392727] ? kthread_create_on_node+0x60/0x60
[52238.401782] ret_from_fork+0x1f/0x30
The page we were trying to drop had a page->private, but had no
page->mapping and so called drop_buffers, assuming that we had a
buffer_head on the page, and then panic'ed trying to deref 1, which is
our page->private for data pages.
This is happening because we're truncating the free space cache while
we're trying to load the free space cache. This isn't supposed to
happen, and I'll fix that in a followup patch. However we still
shouldn't allow those sort of mistakes to result in messing with pages
that do not belong to us. So add the page->mapping check to verify that
we still own this page after dropping and re-acquiring the page lock.
This page being unlocked as:
btrfs_readpage
extent_read_full_page
__extent_read_full_page
__do_readpage
if (!nr)
unlock_page <-- nr can be 0 only if submit_extent_page
returns an error
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add callchain ]
Signed-off-by: David Sterba <dsterba@suse.com>
In the fixup worker, if we fail to mark the range as delalloc in the io
tree, we must release the previously reserved metadata, as well as update
the outstanding extents counter for the inode, otherwise we leak metadata
space.
In pratice we can't return an error from btrfs_set_extent_delalloc(),
which is just a wrapper around __set_extent_bit(), as for most errors
__set_extent_bit() does a BUG_ON() (or panics which hits a BUG_ON() as
well) and returning an -EEXIST error doesn't happen in this case since
the exclusive bits parameter always has a value of 0 through this code
path. Nevertheless, just fix the error handling in the fixup worker,
in case one day __set_extent_bit() can return an error to this code
path.
Fixes: f3038ee3a3 ("btrfs: Handle btrfs_set_extent_delalloc failure in fixup worker")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a buffered write it's possible to leave the subv_writers
counter of the root, used for synchronization between buffered nocow
writers and snapshotting. This happens in an exceptional case like the
following:
1) We fail to allocate data space for the write, since there's not
enough available data space nor enough unallocated space for allocating
a new data block group;
2) Because of that failure, we try to go to NOCOW mode, which succeeds
and therefore we set the local variable 'only_release_metadata' to true
and set the root's sub_writers counter to 1 through the call to
btrfs_start_write_no_snapshotting() made by check_can_nocow();
3) The call to btrfs_copy_from_user() returns zero, which is very unlikely
to happen but not impossible;
4) No pages are copied because btrfs_copy_from_user() returned zero;
5) We call btrfs_end_write_no_snapshotting() which decrements the root's
subv_writers counter to 0;
6) We don't set 'only_release_metadata' back to 'false' because we do
it only if 'copied', the value returned by btrfs_copy_from_user(), is
greater than zero;
7) On the next iteration of the while loop, which processes the same
page range, we are now able to allocate data space for the write (we
got enough data space released in the meanwhile);
8) After this if we fail at btrfs_delalloc_reserve_metadata(), because
now there isn't enough free metadata space, or in some other place
further below (prepare_pages(), lock_and_cleanup_extent_if_need(),
btrfs_dirty_pages()), we break out of the while loop with
'only_release_metadata' having a value of 'true';
9) Because 'only_release_metadata' is 'true' we end up decrementing the
root's subv_writers counter to -1 (through a call to
btrfs_end_write_no_snapshotting()), and we also end up not releasing the
data space previously reserved through btrfs_check_data_free_space().
As a consequence the mechanism for synchronizing NOCOW buffered writes
with snapshotting gets broken.
Fix this by always setting 'only_release_metadata' to false at the start
of each iteration.
Fixes: 8257b2dc3c ("Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume")
Fixes: 7ee9e4405f ("Btrfs: check if we can nocow if we don't have data space")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Some functions are doing some unnecessary indirection to reach the
btrfs_fs_info struct. Change these functions to receive a btrfs_fs_info
struct instead of a *file.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need int argument bool shall do in free_root_pointers(). And
rename the argument as it confused two people.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The compression type upper limit constant is the same as the last value
and this is confusing. In order to keep coding style consistent, use
BTRFS_NR_COMPRESS_TYPES as the total number that follows the idom of
'NR' being one more than the last value.
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use enum to replace macro definitions of extent types.
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
DEFINE_HASHTABLE itself has already included initialization code,
we don't have to call hash_init() again, so remove it.
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is used only during the final phase of freespace cache
writeout. This is necessary since using the plain btrfs_join_transaction
api is deadlock prone. The deadlock looks like:
T1:
btrfs_commit_transaction
commit_cowonly_roots
btrfs_write_dirty_block_groups
btrfs_wait_cache_io
__btrfs_wait_cache_io
btrfs_wait_ordered_range <-- Triggers ordered IO for freespace
inode and blocks transaction commit
until freespace cache writeout
T2: <-- after T1 has triggered the writeout
finish_ordered_fn
btrfs_finish_ordered_io
btrfs_join_transaction <--- this would block waiting for current
transaction to commit, but since trans
commit is waiting for this writeout to
finish
The special purpose functions prevents it by simply skipping the "wait
for writeout" since it's guaranteed the transaction won't proceed until
we are done.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Using an ASSERT in btrfs_pin_extent allows to more stringently observe
whether the function is called under a transaction or not.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The helper is trivial and we can understand what the atomic_inc on
something named refs does.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During a cyclic writeback, extent_write_cache_pages() uses done_index
to update the writeback_index after the current run is over. However,
instead of current index + 1, it gets to to the current index itself.
Unfortunately, this, combined with returning on EOF instead of looping
back, can lead to the following pathlogical behavior.
1. There is a single file which has accumulated enough dirty pages to
trigger balance_dirty_pages() and the writer appending to the file
with a series of short writes.
2. balance_dirty_pages kicks in, wakes up background writeback and sleeps.
3. Writeback kicks in and the cursor is on the last page of the dirty
file. Writeback is started or skipped if already in progress. As
it's EOF, extent_write_cache_pages() returns and the cursor is set
to done_index which is pointing to the last page.
4. Writeback is done. Nothing happens till balance_dirty_pages
finishes, at which point we go back to #1.
This can almost completely stall out writing back of the file and keep
the system over dirty threshold for a long time which can mess up the
whole system. We encountered this issue in production with a package
handling application which can reliably reproduce the issue when
running under tight memory limits.
Reading the comment in the error handling section, this seems to be to
avoid accidentally skipping a page in case the write attempt on the
page doesn't succeed. However, this concern seems bogus.
On each page, the code either:
* Skips and moves onto the next page.
* Fails issue and sets done_index to index + 1.
* Successfully issues and continue to the next page if budget allows
and not EOF.
IOW, as long as it's not EOF and there's budget, the code never
retries writing back the same page. Only when a page happens to be
the last page of a particular run, we end up retrying the page, which
can't possibly guarantee anything data integrity related. Besides,
cyclic writes are only used for non-syncing writebacks meaning that
there's no data integrity implication to begin with.
Fix it by always setting done_index past the current page being
processed.
Note that this problem exists in other writepages too.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 4617ea3a52 (" Btrfs: fix necessary chunk tree space calculation
when allocating a chunk") removed the is_allocation argument from
check_system_chunk, since the formula for reserving the necessary space
for allocation or removing a chunk would be the same.
So, rework the comment by removing the mention of is_allocation
argument.
Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unlike read time tree checker errors, write time error can't be
inspected by "btrfs inspect dump-tree", so we need extra information to
determine what's going wrong.
The patch will add the following output for write time tree checker
error:
- The content of the offending tree block
To help determining if it's a false alert.
- Kernel WARN_ON() for debug build
This is helpful for us to detect unexpected write time tree checker
error, especially fstests could catch the dmesg.
Since the WARN_ON() is only triggered for write time tree checker,
test cases utilizing dm-error won't trigger this WARN_ON(), thus no
extra noise.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Refactor the check for prev_key->objectid of the following key types
into one function, check_prev_ino():
- EXTENT_DATA
- INODE_REF
- DIR_INDEX
- DIR_ITEM
- XATTR_ITEM
Also add the check of prev_key for INODE_REF.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
extent_write_locked_range() is used when we're falling back to buffered
IO from inside of compression. It allocates its own wbc and should
associate it with the inode's i_wb to make sure the IO goes down from
the correct cgroup.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Async CRCs and compression submit IO through helper threads, which means
they have IO priority inversions when cgroup IO controllers are in use.
This flags all of the writes submitted by btrfs helper threads as
REQ_CGROUP_PUNT. submit_bio() will punt these to dedicated per-blkcg
work items to avoid the priority inversion.
For the compression code, we take a reference on the wbc's blkg css and
pass it down to the async workers.
For the async CRCs, the bio already has the correct css, we just need to
tell the block layer to use REQ_CGROUP_PUNT.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Chris Mason <clm@fb.com>
Modified-and-reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
The btrfs writepages function collects a large range of pages flagged
for delayed allocation, and then sends them down through the COW code
for processing. When compression is on, we allocate one async_chunk
structure for every 512K, and then run those pages through the
compression code for IO submission.
writepages starts all of this off with a single page, locked by the
original call to extent_write_cache_pages(), and it's important to keep
track of this page because it has already been through
clear_page_dirty_for_io().
The btrfs async_chunk struct has a pointer to the locked_page, and when
we're redirtying the page because compression had to fallback to
uncompressed IO, we use page->index to decide if a given async_chunk
struct really owns that page.
But, this is racey. If a given delalloc range is broken up into two
async_chunks (chunkA and chunkB), we can end up with something like
this:
compress_file_range(chunkA)
submit_compress_extents(chunkA)
submit compressed bios(chunkA)
put_page(locked_page)
compress_file_range(chunkB)
...
Or:
async_cow_submit
submit_compressed_extents <--- falls back to buffered writeout
cow_file_range
extent_clear_unlock_delalloc
__process_pages_contig
put_page(locked_pages)
async_cow_submit
The end result is that chunkA is completed and cleaned up before chunkB
even starts processing. This means we can free locked_page() and reuse
it elsewhere. If we get really lucky, it'll have the same page->index
in its new home as it did before.
While we're processing chunkB, we might decide we need to fall back to
uncompressed IO, and so compress_file_range() will call
__set_page_dirty_nobufers() on chunkB->locked_page.
Without cgroups in use, this creates as a phantom dirty page, which
isn't great but isn't the end of the world. What can happen, it can go
through the fixup worker and the whole COW machinery again:
in submit_compressed_extents():
while (async extents) {
...
cow_file_range
if (!page_started ...)
extent_write_locked_range
else if (...)
unlock_page
continue;
This hasn't been observed in practice but is still possible.
With cgroups in use, we might crash in the accounting code because
page->mapping->i_wb isn't set.
BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
IP: percpu_counter_add_batch+0x11/0x70
PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
CPU: 16 PID: 2172 Comm: rm Not tainted
RIP: 0010:percpu_counter_add_batch+0x11/0x70
RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
RAX: 0000000000000005 RBX: 0000000000000090 RCX: 0000000000026115
RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 0000000000000090
RBP: 0000000000000000 R08: fffffffffffffff5 R09: 0000000000000000
R10: 00000000000260c0 R11: ffff881037fc26c0 R12: ffffffffffffffff
R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 0000000000000001
FS: 00007f5503ced480(0000) GS:ffff880ff7200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 0000000000360ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
account_page_cleaned+0x15b/0x1f0
__cancel_dirty_page+0x146/0x200
truncate_cleanup_page+0x92/0xb0
truncate_inode_pages_range+0x202/0x7d0
btrfs_evict_inode+0x92/0x5a0
evict+0xc1/0x190
do_unlinkat+0x176/0x280
do_syscall_64+0x63/0x1a0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
The fix here is to make asyc_chunk->locked_page NULL everywhere but the
one async_chunk struct that's allowed to do things to the locked page.
Link: https://lore.kernel.org/linux-btrfs/c2419d01-5c84-3fb4-189e-4db519d08796@suse.com/
Fixes: 771ed689d2 ("Btrfs: Optimize compressed writeback and reads")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Chris Mason <clm@fb.com>
[ update changelog from mail thread discussion ]
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we're not using btrfs_schedule_bio() anymore, delete all the
code that supported it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_schedule_bio() hands IO off to a helper thread to do the actual
submit_bio() call. This has been used to make sure async crc and
compression helpers don't get stuck on IO submission. To maintain good
performance, over time the IO submission threads duplicated some IO
scheduler characteristics such as high and low priority IOs and they
also made some ugly assumptions about request allocation batch sizes.
All of this cost at least one extra context switch during IO submission,
and doesn't fit well with the modern blkmq IO stack. So, this commit stops
using btrfs_schedule_bio(). We may need to adjust the number of async
helper threads for crcs and compression, but long term it's a better
path.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The attribute is more relaxed than const and the functions could
dereference pointers, as long as the observable state is not changed. We
do have such functions, based on -Wsuggest-attribute=pure .
The visible effects of this patch are negligible, there are differences
in the assembly but hard to summarize.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For some reason the attribute is called __attribute_const__ and not
__const, marks functions that have no observable effects on program
state, IOW not reading pointers, just the arguments and calculating a
value. Allows the compiler to do some optimizations, based on
-Wsuggest-attribute=const . The effects are rather small, though, about
60 bytes decrese of btrfs.ko.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The attribute can mark functions supposed to be called rarely if at all
and the text can be moved to sections far from the other code. The
attribute has been added to several functions already, this patch is
based on hints given by gcc -Wsuggest-attribute=cold.
The net effect of this patch is decrease of btrfs.ko by 1000-1300,
depending on the config options.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter is now always set to NULL and could be dropped. The last
user was get_default_root but that got reworked in 05dbe6837b ("Btrfs:
unify subvol= and subvolid= mounting") and the parameter became unused.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We hit the following warning while running down a different problem
[ 6197.175850] ------------[ cut here ]------------
[ 6197.185082] refcount_t: underflow; use-after-free.
[ 6197.194704] WARNING: CPU: 47 PID: 966 at lib/refcount.c:190 refcount_sub_and_test_checked+0x53/0x60
[ 6197.521792] Call Trace:
[ 6197.526687] __btrfs_release_delayed_node+0x76/0x1c0
[ 6197.536615] btrfs_kill_all_delayed_nodes+0xec/0x130
[ 6197.546532] ? __btrfs_btree_balance_dirty+0x60/0x60
[ 6197.556482] btrfs_clean_one_deleted_snapshot+0x71/0xd0
[ 6197.566910] cleaner_kthread+0xfa/0x120
[ 6197.574573] kthread+0x111/0x130
[ 6197.581022] ? kthread_create_on_node+0x60/0x60
[ 6197.590086] ret_from_fork+0x1f/0x30
[ 6197.597228] ---[ end trace 424bb7ae00509f56 ]---
This is because the free side drops the ref without the lock, and then
takes the lock if our refcount is 0. So you can have nodes on the tree
that have a refcount of 0. Fix this by zero'ing out that element in our
temporary array so we don't try to kill it again.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Its very helpful if we had logged the device scanner process name to
debug the race condition between the systemd-udevd scan and the user
initiated device forget command.
This patch adds process name and pid to the scan message.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add pid to the message ]
Signed-off-by: David Sterba <dsterba@suse.com>
That function adds unnecessary indirection between backref_in_log and
the caller. Furthermore it also "downgrades" backref_in_log's return
value to a boolean, when in fact it could very well be an error.
Rectify the situation by simply opencoding name_in_log_ref in
replay_one_name and properly handling possible return codes from
backref_in_log.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
This function can return a negative error value if btrfs_search_slot
errors for whatever reason or if btrfs_alloc_path runs out of memory.
This is currently problemattic because backref_in_log is treated by its
callers as if it returns boolean.
Fix this by adding proper error handling in callers. That also enables
the function to return the direct error code from btrfs_search_slot.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Direct replacement, though note that the inside of the loop in
btrfs_find_name_in_backref is organized in a slightly different way but
is equvalent.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
The state was introduced in commit 4a9d8bdee3 ("Btrfs: make the state
of the transaction more readable"), then in commit 302167c50b
("btrfs: don't end the transaction for delayed refs in throttle") the
state is completely removed.
So we can just clean up the state since it's only compared but never
set.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add an overview of the basic btrfs transaction transitions, including
the following states:
- No transaction states
- Transaction N [[TRANS_STATE_RUNNING]]
- Transaction N [[TRANS_STATE_COMMIT_START]]
- Transaction N [[TRANS_STATE_COMMIT_DOING]]
- Transaction N [[TRANS_STATE_UNBLOCKED]]
- Transaction N [[TRANS_STATE_COMPLETED]]
For each state, the comment will include:
- Basic explaination about current state
- How to go next stage
- What will happen if we call various start_transaction() functions
- Relationship to transaction N+1
This doesn't provide tech details, but serves as a cheat sheet for
reader to get into the code a little easier.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replace is_power_of_2 with the helper that is self-documenting and
remove the open coded call in alloc_profile_is_valid.
Signed-off-by: David Sterba <dsterba@suse.com>
As is_power_of_two takes unsigned long, it's not safe on 32bit
architectures, but we could pass any u64 value in seveal places. Add a
separate helper and also an alias that better expresses the purpose for
which the helper is used.
Signed-off-by: David Sterba <dsterba@suse.com>
When balance reduces the number of copies of metadata, it reduces the
redundancy, use the term redundancy instead of integrity.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function belongs to the family of locking functions, so move it
there. The 'noinline' keyword is dropped as it's now an exported
function that does not need it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function belongs to the family of locking functions, so move it
there. The 'noinline' keyword is dropped as it's now an exported
function that does not need it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_assert_tree_locked is used outside of the locking
code so it is exported, however we can make it static inine as it's
fairly trivial.
This is the only locking assertion used in release builds, inlining
improves the text size by 174 bytes and reduces stack consumption in the
callers.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I've noticed that none of the btrfs_assert_*lock* debugging helpers is
inlined, despite they're short and mostly a value update. Making them
inline shaves 67 from the text size, reduces stack consumption and
perhaps also slightly improves the performance due to avoiding
unnecessary calls.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit ac0c7cf8be ("btrfs: fix crash when tracepoint arguments are
freed by wq callbacks") added a void pointer, wtag, which is passed into
trace_btrfs_all_work_done() instead of the freed work item. This is
silly for a few reasons:
1. The freed work item still has the same address.
2. work is still in scope after it's freed, so assigning wtag doesn't
stop anyone from using it.
3. The tracepoint has always taken a void * argument, so assigning wtag
doesn't actually make things any more type-safe. (Note that the
original bug in commit bc074524e1 ("btrfs: prefix fsid to all trace
events") was that the void * was implicitly casted when it was passed
to btrfs_work_owner() in the trace point itself).
Instead, let's add some clearer warnings as comments.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 9e0af23764 ("Btrfs: fix task hang under heavy compressed
write") worked around the issue that a recycled work item could get a
false dependency on the original work item due to how the workqueue code
guarantees non-reentrancy. It did so by giving different work functions
to different types of work.
However, the fixes in the previous few patches are more complete, as
they prevent a work item from being recycled at all (except for a tiny
window that the kernel workqueue code handles for us). This obsoletes
the previous fix, so we don't need the unique helpers for correctness.
The only other reason to keep them would be so they show up in stack
traces, but they always seem to be optimized to a tail call, so they
don't show up anyways. So, let's just get rid of the extra indirection.
While we're here, rename normal_work_helper() to the more informative
btrfs_work_helper().
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, scrub_missing_raid56_worker() puts and potentially frees
sblock (which embeds the work item) and then submits a bio through
scrub_wr_submit(). This is another potential instance of the bug in
"btrfs: don't prematurely free work in run_ordered_work()". Fix it by
dropping the reference after we submit the bio.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, reada_start_machine_worker() frees the reada_machine_work and
then calls __reada_start_machine() to do readahead. This is another
potential instance of the bug in "btrfs: don't prematurely free work in
run_ordered_work()".
There _might_ already be a deadlock here: reada_start_machine_worker()
can depend on itself through stacked filesystems (__read_start_machine()
-> reada_start_machine_dev() -> reada_tree_block_flagged() ->
read_extent_buffer_pages() -> submit_one_bio() ->
btree_submit_bio_hook() -> btrfs_map_bio() -> submit_stripe_bio() ->
submit_bio() onto a loop device can trigger readahead on the lower
filesystem).
Either way, let's fix it by freeing the work at the end.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, end_workqueue_fn() frees the end_io_wq entry (which embeds
the work item) and then calls bio_endio(). This is another potential
instance of the bug in "btrfs: don't prematurely free work in
run_ordered_work()".
In particular, the endio call may depend on other work items. For
example, btrfs_end_dio_bio() can call btrfs_subio_endio_read() ->
__btrfs_correct_data_nocsum() -> dio_read_error() ->
submit_dio_repair_bio(), which submits a bio that is also completed
through a end_workqueue_fn() work item. However,
__btrfs_correct_data_nocsum() waits for the newly submitted bio to
complete, thus it depends on another work item.
This example currently usually works because we use different workqueue
helper functions for BTRFS_WQ_ENDIO_DATA and BTRFS_WQ_ENDIO_DIO_REPAIR.
However, it may deadlock with stacked filesystems and is fragile
overall. The proper fix is to free the work item at the very end of the
work function, so let's do that.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We hit the following very strange deadlock on a system with Btrfs on a
loop device backed by another Btrfs filesystem:
1. The top (loop device) filesystem queues an async_cow work item from
cow_file_range_async(). We'll call this work X.
2. Worker thread A starts work X (normal_work_helper()).
3. Worker thread A executes the ordered work for the top filesystem
(run_ordered_work()).
4. Worker thread A finishes the ordered work for work X and frees X
(work->ordered_free()).
5. Worker thread A executes another ordered work and gets blocked on I/O
to the bottom filesystem (still in run_ordered_work()).
6. Meanwhile, the bottom filesystem allocates and queues an async_cow
work item which happens to be the recently-freed X.
7. The workqueue code sees that X is already being executed by worker
thread A, so it schedules X to be executed _after_ worker thread A
finishes (see the find_worker_executing_work() call in
process_one_work()).
Now, the top filesystem is waiting for I/O on the bottom filesystem, but
the bottom filesystem is waiting for the top filesystem to finish, so we
deadlock.
This happens because we are breaking the workqueue assumption that a
work item cannot be recycled while it still depends on other work. Fix
it by waiting to free the work item until we are done with all of the
related ordered work.
P.S.:
One might ask why the workqueue code doesn't try to detect a recycled
work item. It actually does try by checking whether the work item has
the same work function (find_worker_executing_work()), but in our case
the function is the same. This is the only key that the workqueue code
has available to compare, short of adding an additional, layer-violating
"custom key". Considering that we're the only ones that have ever hit
this, we should just play by the rules.
Unfortunately, we haven't been able to create a minimal reproducer other
than our full container setup using a compress-force=zstd filesystem on
top of another compress-force=zstd filesystem.
Suggested-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit fc97fab0ea ("btrfs: Replace fs_info->qgroup_rescan_worker
workqueue with btrfs_workqueue.") converted qgroup_rescan_work to be
initialized with btrfs_init_work(), but it left behind an unnecessary
memset(). Get rid of the memset().
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This needs to be cleaned up in the future, but for now it belongs to the
extent-io-tree stuff since it uses the internal tree search code.
Needed to export get_state_failrec and set_state_failrec as well since
we're not going to move the actual IO part of the failrec stuff out at
this point.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This utilizes internal stuff to the extent_io_tree, so we need to export
it before we move it.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
extent_io.c/h are huge, encompassing a bunch of different things. The
extent_io_tree code can live on its own, so separate this out.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are moving extent_io_tree into it's on file, so separate out the
extent_state init stuff from extent_io_tree_init().
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We check both extent buffer and extent state leaks in the same function,
separate these two functions out so we can move them around.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The following comment shows up in btrfs_search_slot() with out much
sense:
/*
* setup the path here so we can release it under lock
* contention with the cow code
*/
if (cow) {
/* code touching path->lock[] is far away from here */
}
This comment hasn't been cleaned up after the relevant code has been
removed.
The original code is introduced in commit 65b51a009e
("btrfs_search_slot: reduce lock contention by cowing in two stages"):
+
+ /*
+ * setup the path here so we can release it under lock
+ * contention with the cow code
+ */
+ p->nodes[level] = b;
+ if (!p->skip_locking)
+ p->locks[level] = 1;
+
But in current code, we have different timing for modifying path lock,
so just remove the comment.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Similar to btrfs_search_slot() done in previous patch, make a shortcut
for the level 0 case and allow to reduce indentation for the remaining
case.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_search_slot(), we something like:
if (level != 0) {
/* Do search inside tree nodes*/
} else {
/* Do search inside tree leaves */
goto done;
}
This caused extra indent for tree node search code. Change it to
something like:
if (level == 0) {
/* Do search inside tree leaves */
goto done'
}
/* Do search inside tree nodes */
So we have more space to maneuver our code, this is especially useful as
the tree nodes search code is more complex than the leaves search code.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For INODE_REF we will check:
- Objectid (ino) against previous key
To detect missing INODE_ITEM.
- No overflow/padding in the data payload
Much like DIR_ITEM, but with less members to check.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For the following items, key->objectid is inode number:
- DIR_ITEM
- DIR_INDEX
- XATTR_ITEM
- EXTENT_DATA
- INODE_REF
So in the subvolume tree, such items must have its previous item share the
same objectid, e.g.:
(257 INODE_ITEM 0)
(257 DIR_INDEX xxx)
(257 DIR_ITEM xxx)
(258 INODE_ITEM 0)
(258 INODE_REF 0)
(258 XATTR_ITEM 0)
(258 EXTENT_DATA 0)
But if we have the following sequence, then there is definitely
something wrong, normally some INODE_ITEM is missing, like:
(257 INODE_ITEM 0)
(257 DIR_INDEX xxx)
(257 DIR_ITEM xxx)
(258 XATTR_ITEM 0) <<< objecitd suddenly changed to 258
(258 EXTENT_DATA 0)
So just by checking the previous key for above inode based key types, we
can detect a missing inode item.
For INODE_REF key type, the check will be added along with INODE_REF
checker.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's not used ouside of transaction.c
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A recent patch to btrfs showed that there was at least 1 case where a
nested transaction was committed. Nested transaction in this case means
a code which has a transaction handle calls some function which in turn
obtains a copy of the same transaction handle. In such cases the correct
thing to do is for the lower callee to call btrfs_end_transaction which
contains appropriate checks so as to not commit the transaction which
will result in stale trans handler for the caller.
To catch such cases add an assert in btrfs_commit_transaction ensuring
btrfs_trans_handle::use_count is always 1.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is similar to 942491c9e6 ("xfs: fix AIM7 regression"). Apparently
our current rwsem code doesn't like doing the trylock, then lock for
real scheme. This causes extra contention on the lock and can be
measured eg. by AIM7 benchmark. So change our read/write methods to
just do the trylock for the RWF_NOWAIT case.
Fixes: edf064e7c6 ("btrfs: nowait aio support")
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl3MUkoACgkQxWXV+ddt
WDvAEQ//fEHZ51NbIMwJNltqF4mr6Oao0M5u0wejEgiXzmR9E1IuHUgVK+KDQmSu
wZl/y+RTlQC0TiURyStaVFBreXEqiuB79my9u4iDeNv/4UJQB42qpmYB4EuviDgB
mxb9bFpWTLkO6Oc+vMrGF3BOmVsQKlq2nOua25g8VFtApQ6uiEfbwBOslCcC8kQB
ZpNBl6x74xz/VWNWZnRStBfwYjRitKNDVU6dyIyRuLj8cktqfGBxGtx7/w0wDiZT
kPR1bNtdpy3Ndke6H/0G6plRWi9kENqcN43hvrz54IKh2l+Jd2/as51j4Qq2tJU9
KaAnJzRaSePxc2m0SqtgZTvc2BYSOg7dqaCyHxBB0CUBdTdJdz2TVZ2KM9MiLns4
1haHBLo4l8o8zeYZpW05ac6OXKY4f8qsjWPEGshn4FDbq0TrHQzYxAF3c0X3hPag
SnuvilgYUuYal+n87qinePg/ZmVrrBXPRycpQnn7FxqezJbf/2WUEojUVQnreU05
mdp8mulxQxyFhgEvO7K1uDtlP8bqW69IO9M//6IWzGNKTDK2SRI08ULplghqgyna
8SG0+y9w26r8UIWDhuvPbdfUMSG3kEH8yLFK84AFDMVJJxOnfznE3sC8sGOiP5q9
OUkl8l7bhDkyAdWZY57gGUobebdPfnLxRV9A+LZQ2El1kSOEK18=
=xzXs
-----END PGP SIGNATURE-----
Merge tag 'for-5.4-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"A fix for an older bug that has started to show up during testing
(because of an updated test for rename exchange).
It's an in-memory corruption caused by local variable leaking out of
the function scope"
* tag 'for-5.4-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix log context list corruption after rename exchange operation
During rename exchange we might have successfully log the new name in the
source root's log tree, in which case we leave our log context (allocated
on stack) in the root's list of log contextes. However we might fail to
log the new name in the destination root, in which case we fallback to
a transaction commit later and never sync the log of the source root,
which causes the source root log context to remain in the list of log
contextes. This later causes invalid memory accesses because the context
was allocated on stack and after rename exchange finishes the stack gets
reused and overwritten for other purposes.
The kernel's linked list corruption detector (CONFIG_DEBUG_LIST=y) can
detect this and report something like the following:
[ 691.489929] ------------[ cut here ]------------
[ 691.489947] list_add corruption. prev->next should be next (ffff88819c944530), but was ffff8881c23f7be4. (prev=ffff8881c23f7a38).
[ 691.489967] WARNING: CPU: 2 PID: 28933 at lib/list_debug.c:28 __list_add_valid+0x95/0xe0
(...)
[ 691.489998] CPU: 2 PID: 28933 Comm: fsstress Not tainted 5.4.0-rc6-btrfs-next-62 #1
[ 691.490001] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
[ 691.490003] RIP: 0010:__list_add_valid+0x95/0xe0
(...)
[ 691.490007] RSP: 0018:ffff8881f0b3faf8 EFLAGS: 00010282
[ 691.490010] RAX: 0000000000000000 RBX: ffff88819c944530 RCX: 0000000000000000
[ 691.490011] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffffffa2c497e0
[ 691.490013] RBP: ffff8881f0b3fe68 R08: ffffed103eaa4115 R09: ffffed103eaa4114
[ 691.490015] R10: ffff88819c944000 R11: ffffed103eaa4115 R12: 7fffffffffffffff
[ 691.490016] R13: ffff8881b4035610 R14: ffff8881e7b84728 R15: 1ffff1103e167f7b
[ 691.490019] FS: 00007f4b25ea2e80(0000) GS:ffff8881f5500000(0000) knlGS:0000000000000000
[ 691.490021] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 691.490022] CR2: 00007fffbb2d4eec CR3: 00000001f2a4a004 CR4: 00000000003606e0
[ 691.490025] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 691.490027] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 691.490029] Call Trace:
[ 691.490058] btrfs_log_inode_parent+0x667/0x2730 [btrfs]
[ 691.490083] ? join_transaction+0x24a/0xce0 [btrfs]
[ 691.490107] ? btrfs_end_log_trans+0x80/0x80 [btrfs]
[ 691.490111] ? dget_parent+0xb8/0x460
[ 691.490116] ? lock_downgrade+0x6b0/0x6b0
[ 691.490121] ? rwlock_bug.part.0+0x90/0x90
[ 691.490127] ? do_raw_spin_unlock+0x142/0x220
[ 691.490151] btrfs_log_dentry_safe+0x65/0x90 [btrfs]
[ 691.490172] btrfs_sync_file+0x9f1/0xc00 [btrfs]
[ 691.490195] ? btrfs_file_write_iter+0x1800/0x1800 [btrfs]
[ 691.490198] ? rcu_read_lock_any_held.part.11+0x20/0x20
[ 691.490204] ? __do_sys_newstat+0x88/0xd0
[ 691.490207] ? cp_new_stat+0x5d0/0x5d0
[ 691.490218] ? do_fsync+0x38/0x60
[ 691.490220] do_fsync+0x38/0x60
[ 691.490224] __x64_sys_fdatasync+0x32/0x40
[ 691.490228] do_syscall_64+0x9f/0x540
[ 691.490233] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 691.490235] RIP: 0033:0x7f4b253ad5f0
(...)
[ 691.490239] RSP: 002b:00007fffbb2d6078 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
[ 691.490242] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f4b253ad5f0
[ 691.490244] RDX: 00007fffbb2d5fe0 RSI: 00007fffbb2d5fe0 RDI: 0000000000000003
[ 691.490245] RBP: 000000000000000d R08: 0000000000000001 R09: 00007fffbb2d608c
[ 691.490247] R10: 00000000000002e8 R11: 0000000000000246 R12: 00000000000001f4
[ 691.490248] R13: 0000000051eb851f R14: 00007fffbb2d6120 R15: 00005635a498bda0
This started happening recently when running some test cases from fstests
like btrfs/004 for example, because support for rename exchange was added
last week to fsstress from fstests.
So fix this by deleting the log context for the source root from the list
if we have logged the new name in the source root.
Reported-by: Su Yue <Damenly_Su@gmx.com>
Fixes: d4682ba03e ("Btrfs: sync log after logging new name")
CC: stable@vger.kernel.org # 4.19+
Tested-by: Su Yue <Damenly_Su@gmx.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl3GwvEACgkQxWXV+ddt
WDvsfQ//UQ/TND1roMW3EmN+PLTRTUQS75Hb737r5Q66j0j94gFnBxB03R+tU8lP
bH5XVpateb1wDmmdscqVQ1WM9O82bQdDNiYeLDr86+kzpLgy61rZHswZfNlDtl5M
wDwyaxsrd7HndDeUZnIuaakYI9MXz9WIaNXkt0o8hSHctt0N18y23DSBFTVSh/4q
T3cn4odkoBKtQ4mIn2OSmQvkl69nWRBVpjPJZIvsNszKjOo9aZTuGHrOWUV5RPiE
Ho9UBkr+IjEDo8OH88vXAsHeYFIoYhEeUltjLHyF6Agwk1/Ajwp5sxXSubbfHUMQ
l1YdmrTZf+l1Dxdj0sCdyK1npcgGI5IuZmIICpNUEAny9AbTtpSE3GNwtnIHAEAr
cpki+1Z3lfaVSwNMYUz9Esbsb72C+f08WJHGHBMaOhjrBIwQUeWeYzTx6N7uDwNg
GjaDRxjqFV2o7373isQaz7WOTatUMNtL1xvhkeceONI9NBXQRjW5rq5zECmr1ix7
lSTKDzn7yAd6eGBW0T6iYdRl4Bta/zxPiDEF5KDNPugvv23yx17LAOdS4qAiHzbx
oglra/kz9D4xmKVqH7hpFaaHrzL38G4mz6aMdwBu0M7dkn9AwsXiXWSDb0n7jnK0
o96gkUBZjUSFBseUFD2s34MikFtrDynEJzPq96JHuWLguaByMcc=
=CZxY
-----END PGP SIGNATURE-----
Merge tag 'for-5.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few regressions and fixes for stable.
Regressions:
- fix a race leading to metadata space leak after task received a
signal
- un-deprecate 2 ioctls, marked as deprecated by mistake
Fixes:
- fix limit check for number of devices during chunk allocation
- fix a race due to double evaluation of i_size_read inside max()
macro, can cause a crash
- remove wrong device id check in tree-checker"
* tag 'for-5.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: un-deprecate ioctls START_SYNC and WAIT_SYNC
btrfs: save i_size to avoid double evaluation of i_size_read in compress_file_range
Btrfs: fix race leading to metadata space leak after task received signal
btrfs: tree-checker: Fix wrong check on max devid
btrfs: Consider system chunk array size for new SYSTEM chunks
The two ioctls START_SYNC and WAIT_SYNC were mistakenly marked as
deprecated and scheduled for removal but we actualy do use them for
'btrfs subvolume delete -C/-c'. The deprecated thing in ebc87351e5
should have been just the async flag for subvolume creation.
The deprecation has been added in this development cycle, remove it
until it's time.
Fixes: ebc87351e5 ("btrfs: Deprecate BTRFS_SUBVOL_CREATE_ASYNC flag")
Signed-off-by: David Sterba <dsterba@suse.com>
We hit a regression while rolling out 5.2 internally where we were
hitting the following panic
kernel BUG at mm/page-writeback.c:2659!
RIP: 0010:clear_page_dirty_for_io+0xe6/0x1f0
Call Trace:
__process_pages_contig+0x25a/0x350
? extent_clear_unlock_delalloc+0x43/0x70
submit_compressed_extents+0x359/0x4d0
normal_work_helper+0x15a/0x330
process_one_work+0x1f5/0x3f0
worker_thread+0x2d/0x3d0
? rescuer_thread+0x340/0x340
kthread+0x111/0x130
? kthread_create_on_node+0x60/0x60
ret_from_fork+0x1f/0x30
This is happening because the page is not locked when doing
clear_page_dirty_for_io. Looking at the core dump it was because our
async_extent had a ram_size of 24576 but our async_chunk range only
spanned 20480, so we had a whole extra page in our ram_size for our
async_extent.
This happened because we try not to compress pages outside of our
i_size, however a cleanup patch changed us to do
actual_end = min_t(u64, i_size_read(inode), end + 1);
which is problematic because i_size_read() can evaluate to different
values in between checking and assigning. So either an expanding
truncate or a fallocate could increase our i_size while we're doing
writeout and actual_end would end up being past the range we have
locked.
I confirmed this was what was happening by installing a debug kernel
that had
actual_end = min_t(u64, i_size_read(inode), end + 1);
if (actual_end > end + 1) {
printk(KERN_ERR "KABOOM\n");
actual_end = end + 1;
}
and installing it onto 500 boxes of the tier that had been seeing the
problem regularly. Last night I got my debug message and no panic,
confirming what I expected.
[ dsterba: the assembly confirms a tiny race window:
mov 0x20(%rsp),%rax
cmp %rax,0x48(%r15) # read
movl $0x0,0x18(%rsp)
mov %rax,%r12
mov %r14,%rax
cmovbe 0x48(%r15),%r12 # eval
Where r15 is inode and 0x48 is offset of i_size.
The original fix was to revert 62b3762271 that would do an
intermediate assignment and this would also avoid the doulble
evaluation but is not future-proof, should the compiler merge the
stores and call i_size_read anyway.
There's a patch adding READ_ONCE to i_size_read but that's not being
applied at the moment and we need to fix the bug. Instead, emulate
READ_ONCE by two barrier()s that's what effectively happens. The
assembly confirms single evaluation:
mov 0x48(%rbp),%rax # read once
mov 0x20(%rsp),%rcx
mov $0x20,%edx
cmp %rax,%rcx
cmovbe %rcx,%rax
mov %rax,(%rsp)
mov %rax,%rcx
mov %r14,%rax
Where 0x48(%rbp) is inode->i_size stored to %eax.
]
Fixes: 62b3762271 ("btrfs: Remove isize local variable in compress_file_range")
CC: stable@vger.kernel.org # v5.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ changelog updated ]
Signed-off-by: David Sterba <dsterba@suse.com>
When a task that is allocating metadata needs to wait for the async
reclaim job to process its ticket and gets a signal (because it was killed
for example) before doing the wait, the task ends up erroring out but
with space reserved for its ticket, which never gets released, resulting
in a metadata space leak (more specifically a leak in the bytes_may_use
counter of the metadata space_info object).
Here's the sequence of steps leading to the space leak:
1) A task tries to create a file for example, so it ends up trying to
start a transaction at btrfs_create();
2) The filesystem is currently in a state where there is not enough
metadata free space to satisfy the transaction's needs. So at
space-info.c:__reserve_metadata_bytes() we create a ticket and
add it to the list of tickets of the space info object. Also,
because the metadata async reclaim job is not running, we queue
a job ro run metadata reclaim;
3) In the meanwhile the task receives a signal (like SIGTERM from
a kill command for example);
4) After queing the async reclaim job, at __reserve_metadata_bytes(),
we unlock the metadata space info and call handle_reserve_ticket();
5) That last function calls wait_reserve_ticket(), which acquires the
lock from the metadata space info. Then in the first iteration of
its while loop, it calls prepare_to_wait_event(), which returns
-ERESTARTSYS because the task has a pending signal. As a result,
we set the error field of the ticket to -EINTR and exit the while
loop without deleting the ticket from the list of tickets (in the
space info object). After exiting the loop we unlock the space info;
6) The async reclaim job is able to release enough metadata, acquires
the metadata space info's lock and then reserves space for the ticket,
since the ticket is still in the list of (non-priority) tickets. The
space reservation happens at btrfs_try_granting_tickets(), called from
maybe_fail_all_tickets(). This increments the bytes_may_use counter
from the metadata space info object, sets the ticket's bytes field to
zero (meaning success, that space was reserved) and removes it from
the list of tickets;
7) wait_reserve_ticket() returns, with the error field of the ticket
set to -EINTR. Then handle_reserve_ticket() just propagates that error
to the caller. Because an error was returned, the caller does not
release the reserved space, since the expectation is that any error
means no space was reserved.
Fix this by removing the ticket from the list, while holding the space
info lock, at wait_reserve_ticket() when prepare_to_wait_event() returns
an error.
Also add some comments and an assertion to guarantee we never end up with
a ticket that has an error set and a bytes counter field set to zero, to
more easily detect regressions in the future.
This issue could be triggered sporadically by some test cases from fstests
such as generic/269 for example, which tries to fill a filesystem and then
kills fsstress processes running in the background.
When this issue happens, we get a warning in syslog/dmesg when unmounting
the filesystem, like the following:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 13240 at fs/btrfs/block-group.c:3186 btrfs_free_block_groups+0x314/0x470 [btrfs]
(...)
CPU: 0 PID: 13240 Comm: umount Tainted: G W L 5.3.0-rc8-btrfs-next-48+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_free_block_groups+0x314/0x470 [btrfs]
(...)
RSP: 0018:ffff9910c14cfdb8 EFLAGS: 00010286
RAX: 0000000000000024 RBX: ffff89cd8a4d55f0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff89cdf6a178a8 RDI: ffff89cdf6a178a8
RBP: ffff9910c14cfde8 R08: 0000000000000000 R09: 0000000000000001
R10: ffff89cd4d618040 R11: 0000000000000000 R12: ffff89cd8a4d5508
R13: ffff89cde7c4a600 R14: dead000000000122 R15: dead000000000100
FS: 00007f42754432c0(0000) GS:ffff89cdf6a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd25a47f730 CR3: 000000021f8d6006 CR4: 00000000003606f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
close_ctree+0x1ad/0x390 [btrfs]
generic_shutdown_super+0x6c/0x110
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0xa0 [btrfs]
deactivate_locked_super+0x3a/0x70
cleanup_mnt+0xb4/0x160
task_work_run+0x7e/0xc0
exit_to_usermode_loop+0xfa/0x100
do_syscall_64+0x1cb/0x220
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f4274d2cb37
(...)
RSP: 002b:00007ffcff701d38 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 0000557ebde2f060 RCX: 00007f4274d2cb37
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000557ebde2f240
RBP: 0000557ebde2f240 R08: 0000557ebde2f270 R09: 0000000000000015
R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f427522ee64
R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffcff701fc0
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
softirqs last enabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace bcf4b235461b26f6 ]---
BTRFS info (device sdb): space_info 4 has 19116032 free, is full
BTRFS info (device sdb): space_info total=33554432, used=14176256, pinned=0, reserved=0, may_use=196608, readonly=65536
BTRFS info (device sdb): global_block_rsv: size 0 reserved 0
BTRFS info (device sdb): trans_block_rsv: size 0 reserved 0
BTRFS info (device sdb): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sdb): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sdb): delayed_refs_rsv: size 0 reserved 0
Fixes: 374bf9c5cd ("btrfs: unify error handling for ticket flushing")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
The following script will cause false alert on devid check.
#!/bin/bash
dev1=/dev/test/test
dev2=/dev/test/scratch1
mnt=/mnt/btrfs
umount $dev1 &> /dev/null
umount $dev2 &> /dev/null
umount $mnt &> /dev/null
mkfs.btrfs -f $dev1
mount $dev1 $mnt
_fail()
{
echo "!!! FAILED !!!"
exit 1
}
for ((i = 0; i < 4096; i++)); do
btrfs dev add -f $dev2 $mnt || _fail
btrfs dev del $dev1 $mnt || _fail
dev_tmp=$dev1
dev1=$dev2
dev2=$dev_tmp
done
[CAUSE]
Tree-checker uses BTRFS_MAX_DEVS() and BTRFS_MAX_DEVS_SYS_CHUNK() as
upper limit for devid. But we can have devid holes just like above
script.
So the check for devid is incorrect and could cause false alert.
[FIX]
Just remove the whole devid check. We don't have any hard requirement
for devid assignment.
Furthermore, even devid could get corrupted by a bitflip, we still have
dev extents verification at mount time, so corrupted data won't sneak
in.
This fixes fstests btrfs/194.
Reported-by: Anand Jain <anand.jain@oracle.com>
Fixes: ab4ba2e133 ("btrfs: tree-checker: Verify dev item")
CC: stable@vger.kernel.org # 5.2+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For SYSTEM chunks, despite the regular chunk item size limit, there is
another limit due to system chunk array size.
The extra limit was removed in a refactoring, so add it back.
Fixes: e3ecdb3fde ("btrfs: factor out devs_max setting in __btrfs_alloc_chunk")
CC: stable@vger.kernel.org # 5.3+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The .ioctl and .compat_ioctl file operations have the same prototype so
they can both point to the same function, which works great almost all
the time when all the commands are compatible.
One exception is the s390 architecture, where a compat pointer is only
31 bit wide, and converting it into a 64-bit pointer requires calling
compat_ptr(). Most drivers here will never run in s390, but since we now
have a generic helper for it, it's easy enough to use it consistently.
I double-checked all these drivers to ensure that all ioctl arguments
are used as pointers or are ignored, but are not interpreted as integer
values.
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Acked-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: David Sterba <dsterba@suse.com>
Acked-by: Darren Hart (VMware) <dvhart@infradead.org>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Acked-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl2vBPgACgkQxWXV+ddt
WDvaGQ/+JbV05ML7QjufNFKjzojNPg8dOwvLqF58askMEAqzyqOrzu1YsIOjchqv
eZPP+wxh6j6i8LFnbaMYRhSVMlgpK9XOR3tNStp73QlSox6/6VKu7XR4wFXAoip3
l4XI8UqO5XqDG55UTtjKyo2VNq8vgq1gRUWD6hPtfzDP6WYj4JZuXOd+dT6PbP32
VHd91RoB8Z8qmypLF9Sju6IBWNhKl91TjlRqVdfLywaK8azqaxE1UttPf5DVbE6+
MlTuXhuxkNc4ddzj//oJ1s3ZP/nXtFmIZ75+Sd6P5DfqRNeIfjkKqXvffsjqoYVI
1Wv9sDiezxRB1RZjfInhtqvdvsqcsrXrM7x6BRVqI7IZDUaH5em8IoozQT62ze/4
MJBrRIbVodx2I7EVDMNGkx2TaIDAfnW4Z3UC+YSHLoy6jht1+SA5KLQDR1G/4NeR
dWht5wOXwhp8P3DoaczTUpk0DLtAtygj04fH8CG277EILLCpuWwW8iRkPttkrIM2
HrtRKrKFJNyEpq9vHSFVvpQJkzgzqBFs1UnqnEuYh6qNgWCrS4PJDu9geQ8aPGr2
pA5aEqg5b+jjWzIYDP93PF0u7kF4mFVAozn6xMf95FKmM1OupYYQg5BRe7n/DfDu
yh3Ms0Mmd9+snpNJ9lDr40cobHC5CDvfvO2SRERyBeXxARainN0=
=Ft+G
-----END PGP SIGNATURE-----
Merge tag 'for-5.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fixes of error handling cleanup of metadata accounting with qgroups
enabled
- fix swapped values for qgroup tracepoints
- fix race when handling full sync flag
- don't start unused worker thread, functionality removed already
* tag 'for-5.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: check for the full sync flag while holding the inode lock during fsync
Btrfs: fix qgroup double free after failure to reserve metadata for delalloc
btrfs: tracepoints: Fix bad entry members of qgroup events
btrfs: tracepoints: Fix wrong parameter order for qgroup events
btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents()
btrfs: don't needlessly create extent-refs kernel thread
btrfs: block-group: Fix a memory leak due to missing btrfs_put_block_group()
Btrfs: add missing extents release on file extent cluster relocation error
We were checking for the full fsync flag in the inode before locking the
inode, which is racy, since at that that time it might not be set but
after we acquire the inode lock some other task set it. One case where
this can happen is on a system low on memory and some concurrent task
failed to allocate an extent map and therefore set the full sync flag on
the inode, to force the next fsync to work in full mode.
A consequence of missing the full fsync flag set is hitting the problems
fixed by commit 0c713cbab6 ("Btrfs: fix race between ranged fsync and
writeback of adjacent ranges"), BUG_ON() when dropping extents from a log
tree, hitting assertion failures at tree-log.c:copy_items() or all sorts
of weird inconsistencies after replaying a log due to file extents items
representing ranges that overlap.
So just move the check such that it's done after locking the inode and
before starting writeback again.
Fixes: 0c713cbab6 ("Btrfs: fix race between ranged fsync and writeback of adjacent ranges")
CC: stable@vger.kernel.org # 5.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we fail to reserve metadata for delalloc operations we end up releasing
the previously reserved qgroup amount twice, once explicitly under the
'out_qgroup' label by calling btrfs_qgroup_free_meta_prealloc() and once
again, under label 'out_fail', by calling btrfs_inode_rsv_release() with a
value of 'true' for its 'qgroup_free' argument, which results in
btrfs_qgroup_free_meta_prealloc() being called again, so we end up having
a double free.
Also if we fail to reserve the necessary qgroup amount, we jump to the
label 'out_fail', which calls btrfs_inode_rsv_release() and that in turns
calls btrfs_qgroup_free_meta_prealloc(), even though we weren't able to
reserve any qgroup amount. So we freed some amount we never reserved.
So fix this by removing the call to btrfs_inode_rsv_release() in the
failure path, since it's not necessary at all as we haven't changed the
inode's block reserve in any way at this point.
Fixes: c8eaeac7b7 ("btrfs: reserve delalloc metadata differently")
CC: stable@vger.kernel.org # 5.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
For btrfs:qgroup_meta_reserve event, the trace event can output garbage:
qgroup_meta_reserve: 9c7f6acc-b342-4037-bc47-7f6e4d2232d7: refroot=5(FS_TREE) type=DATA diff=2
The diff should always be alinged to sector size (4k), so there is
definitely something wrong.
[CAUSE]
For the wrong @diff, it's caused by wrong parameter order.
The correct parameters are:
struct btrfs_root, s64 diff, int type.
However the parameters used are:
struct btrfs_root, int type, s64 diff.
Fixes: 4ee0d8832c ("btrfs: qgroup: Update trace events for metadata reservation")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[Background]
Btrfs qgroup uses two types of reserved space for METADATA space,
PERTRANS and PREALLOC.
PERTRANS is metadata space reserved for each transaction started by
btrfs_start_transaction().
While PREALLOC is for delalloc, where we reserve space before joining a
transaction, and finally it will be converted to PERTRANS after the
writeback is done.
[Inconsistency]
However there is inconsistency in how we handle PREALLOC metadata space.
The most obvious one is:
In btrfs_buffered_write():
btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true);
We always free qgroup PREALLOC meta space.
While in btrfs_truncate_block():
btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0));
We only free qgroup PREALLOC meta space when something went wrong.
[The Correct Behavior]
The correct behavior should be the one in btrfs_buffered_write(), we
should always free PREALLOC metadata space.
The reason is, the btrfs_delalloc_* mechanism works by:
- Reserve metadata first, even it's not necessary
In btrfs_delalloc_reserve_metadata()
- Free the unused metadata space
Normally in:
btrfs_delalloc_release_extents()
|- btrfs_inode_rsv_release()
Here we do calculation on whether we should release or not.
E.g. for 64K buffered write, the metadata rsv works like:
/* The first page */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=0
total: num_bytes=calc_inode_reservations()
/* The first page caused one outstanding extent, thus needs metadata
rsv */
/* The 2nd page */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=calc_inode_reservations()
total: not changed
/* The 2nd page doesn't cause new outstanding extent, needs no new meta
rsv, so we free what we have reserved */
/* The 3rd~16th pages */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=calc_inode_reservations()
total: not changed (still space for one outstanding extent)
This means, if btrfs_delalloc_release_extents() determines to free some
space, then those space should be freed NOW.
So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other
than btrfs_qgroup_convert_reserved_meta().
The good news is:
- The callers are not that hot
The hottest caller is in btrfs_buffered_write(), which is already
fixed by commit 336a8bb8e3 ("btrfs: Fix wrong
btrfs_delalloc_release_extents parameter"). Thus it's not that
easy to cause false EDQUOT.
- The trans commit in advance for qgroup would hide the bug
Since commit f5fef45936 ("btrfs: qgroup: Make qgroup async transaction
commit more aggressive"), when btrfs qgroup metadata free space is slow,
it will try to commit transaction and free the wrongly converted
PERTRANS space, so it's not that easy to hit such bug.
[FIX]
So to fix the problem, remove the @qgroup_free parameter for
btrfs_delalloc_release_extents(), and always pass true to
btrfs_inode_rsv_release().
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 43b18595d6 ("btrfs: qgroup: Use separate meta reservation type for delalloc")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The patch 32b593bfcb ("Btrfs: remove no longer used function to run
delayed refs asynchronously") removed the async delayed refs but the
thread has been created, without any use. Remove it to avoid resource
consumption.
Fixes: 32b593bfcb ("Btrfs: remove no longer used function to run delayed refs asynchronously")
CC: stable@vger.kernel.org # 5.2+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_read_block_groups(), if we have an invalid block group which
has mixed type (DATA|METADATA) while the fs doesn't have MIXED_GROUPS
feature, we error out without freeing the block group cache.
This patch will add the missing btrfs_put_block_group() to prevent
memory leak.
Note for stable backports: the file to patch in versions <= 5.3 is
fs/btrfs/extent-tree.c
Fixes: 49303381f1 ("Btrfs: bail out if block group has different mixed flag")
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we error out when finding a page at relocate_file_extent_cluster(), we
need to release the outstanding extents counter on the relocation inode,
set by the previous call to btrfs_delalloc_reserve_metadata(), otherwise
the inode's block reserve size can never decrease to zero and metadata
space is leaked. Therefore add a call to btrfs_delalloc_release_extents()
in case we can't find the target page.
Fixes: 8b62f87bad ("Btrfs: rework outstanding_extents")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl2fQkoACgkQxWXV+ddt
WDsXgA/+OpORcroqswa5V/AB5NgSMv08QfBtL7en7cUA2cGbrTt0lAoixIesCeGA
7y4umlx7zWDhrG0NFv8E21xxkMzK72/YQnN0C6ZwRCi+ZbPSDlpMCwSNV9b6oVt4
t9mJQ2kNZ80wj9W7jtoyiiWZ2OBccWywKBYr+BXybha5BliSd1XbpWMv90lODQHc
1oH1FOphPAf+nSEtW4g0BV8UHy1n1+7TqjEHDilj0TuHlO0MHsR2FQcsRnMBv5H/
CcEH3+870pUOjvm/l1OAdIrzrnH6UsXKHnOmJpYZIAQdG4xtHq/d6O9sfQMZK/Af
VMXuju558kGOvrYdgffYa0HuW3P6c8q5BeEtGAZwjGsQS5GxmW63et/x+HIEl4RU
4SWMfD592GK1/yQn31NWGav7Olkr4L0Eh9iqfKk2nWayTV+pqs8NgkGumkoNB66n
WJs2m2WwKvZzj1Ys9ilRzo17qt2U/m4eLQtbut3m5eYCpzvo38PDrqOVqKTZG/dc
nG1JBJNk+yjV+RkOO0gnQ8/zzooEGvMhVIx5iq52tWjvxnvOGSVv9QAAkKrbMoOX
kn/b3heL57Z+kilUU3Zd0GvVif+awIgOj+PmAevR+w3MLoJ2gLfWb7qcONdHyFPk
jK7rsuP737fyHpZ1tvJyfTt4Lk/u1UbApKrO4QIku1r9IJmmu20=
=zpDj
-----END PGP SIGNATURE-----
Merge tag 'for-5.4-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more stabitly fixes, one build warning fix.
- fix inode allocation under NOFS context
- fix leak in fiemap due to concurrent append writes
- fix log-root tree updates
- fix balance convert of single profile on 32bit architectures
- silence false positive warning on old GCCs (code moved in rc1)"
* tag 'for-5.4-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: silence maybe-uninitialized warning in clone_range
btrfs: fix uninitialized ret in ref-verify
btrfs: allocate new inode in NOFS context
btrfs: fix balance convert to single on 32-bit host CPUs
btrfs: fix incorrect updating of log root tree
Btrfs: fix memory leak due to concurrent append writes with fiemap
GCC throws warning message as below:
‘clone_src_i_size’ may be used uninitialized in this function
[-Wmaybe-uninitialized]
#define IS_ALIGNED(x, a) (((x) & ((typeof(x))(a) - 1)) == 0)
^
fs/btrfs/send.c:5088:6: note: ‘clone_src_i_size’ was declared here
u64 clone_src_i_size;
^
The clone_src_i_size is only used as call-by-reference
in a call to get_inode_info().
Silence the warning by initializing clone_src_i_size to 0.
Note that the warning is a false positive and reported by older versions
of GCC (eg. 7.x) but not eg 9.x. As there have been numerous people, the
patch is applied. Setting clone_src_i_size to 0 does not otherwise make
sense and would not do any action in case the code changes in the future.
Signed-off-by: Austin Kim <austindh.kim@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
Coverity caught a case where we could return with a uninitialized value
in ret in process_leaf. This is actually pretty likely because we could
very easily run into a block group item key and have a garbage value in
ret and think there was an errror. Fix this by initializing ret to 0.
Reported-by: Colin Ian King <colin.king@canonical.com>
Fixes: fd708b81d9 ("Btrfs: add a extent ref verify tool")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, the command:
btrfs balance start -dconvert=single,soft .
on a Raspberry Pi produces the following kernel message:
BTRFS error (device mmcblk0p2): balance: invalid convert data profile single
This fails because we use is_power_of_2(unsigned long) to validate
the new data profile, the constant for 'single' profile uses bit 48,
and there are only 32 bits in a long on ARM.
Fix by open-coding the check using u64 variables.
Tested by completing the original balance command on several Raspberry
Pis.
Fixes: 818255feec ("btrfs: use common helper instead of open coding a bit test")
CC: stable@vger.kernel.org # 4.20+
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We've historically had reports of being unable to mount file systems
because the tree log root couldn't be read. Usually this is the "parent
transid failure", but could be any of the related errors, including
"fsid mismatch" or "bad tree block", depending on which block got
allocated.
The modification of the individual log root items are serialized on the
per-log root root_mutex. This means that any modification to the
per-subvol log root_item is completely protected.
However we update the root item in the log root tree outside of the log
root tree log_mutex. We do this in order to allow multiple subvolumes
to be updated in each log transaction.
This is problematic however because when we are writing the log root
tree out we update the super block with the _current_ log root node
information. Since these two operations happen independently of each
other, you can end up updating the log root tree in between writing out
the dirty blocks and setting the super block to point at the current
root.
This means we'll point at the new root node that hasn't been written
out, instead of the one we should be pointing at. Thus whatever garbage
or old block we end up pointing at complains when we mount the file
system later and try to replay the log.
Fix this by copying the log's root item into a local root item copy.
Then once we're safely under the log_root_tree->log_mutex we update the
root item in the log_root_tree. This way we do not modify the
log_root_tree while we're committing it, fixing the problem.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Chris Mason <clm@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we have a buffered write that starts at an offset greater than or
equals to the file's size happening concurrently with a full ranged
fiemap, we can end up leaking an extent state structure.
Suppose we have a file with a size of 1Mb, and before the buffered write
and fiemap are performed, it has a single extent state in its io tree
representing the range from 0 to 1Mb, with the EXTENT_DELALLOC bit set.
The following sequence diagram shows how the memory leak happens if a
fiemap a buffered write, starting at offset 1Mb and with a length of
4Kb, are performed concurrently.
CPU 1 CPU 2
extent_fiemap()
--> it's a full ranged fiemap
range from 0 to LLONG_MAX - 1
(9223372036854775807)
--> locks range in the inode's
io tree
--> after this we have 2 extent
states in the io tree:
--> 1 for range [0, 1Mb[ with
the bits EXTENT_LOCKED and
EXTENT_DELALLOC_BITS set
--> 1 for the range
[1Mb, LLONG_MAX[ with
the EXTENT_LOCKED bit set
--> start buffered write at offset
1Mb with a length of 4Kb
btrfs_file_write_iter()
btrfs_buffered_write()
--> cached_state is NULL
lock_and_cleanup_extent_if_need()
--> returns 0 and does not lock
range because it starts
at current i_size / eof
--> cached_state remains NULL
btrfs_dirty_pages()
btrfs_set_extent_delalloc()
(...)
__set_extent_bit()
--> splits extent state for range
[1Mb, LLONG_MAX[ and now we
have 2 extent states:
--> one for the range
[1Mb, 1Mb + 4Kb[ with
EXTENT_LOCKED set
--> another one for the range
[1Mb + 4Kb, LLONG_MAX[ with
EXTENT_LOCKED set as well
--> sets EXTENT_DELALLOC on the
extent state for the range
[1Mb, 1Mb + 4Kb[
--> caches extent state
[1Mb, 1Mb + 4Kb[ into
@cached_state because it has
the bit EXTENT_LOCKED set
--> btrfs_buffered_write() ends up
with a non-NULL cached_state and
never calls anything to release its
reference on it, resulting in a
memory leak
Fix this by calling free_extent_state() on cached_state if the range was
not locked by lock_and_cleanup_extent_if_need().
The same issue can happen if anything else other than fiemap locks a range
that covers eof and beyond.
This could be triggered, sporadically, by test case generic/561 from the
fstests suite, which makes duperemove run concurrently with fsstress, and
duperemove does plenty of calls to fiemap. When CONFIG_BTRFS_DEBUG is set
the leak is reported in dmesg/syslog when removing the btrfs module with
a message like the following:
[77100.039461] BTRFS: state leak: start 6574080 end 6582271 state 16402 in tree 0 refs 1
Otherwise (CONFIG_BTRFS_DEBUG not set) detectable with kmemleak.
CC: stable@vger.kernel.org # 4.16+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl2SDbMACgkQxWXV+ddt
WDsUhw/9HcRsT6SlrwA2R5leHxCR5UMwT2Zmbxpfft37ANF0SC1UINHBfnmquM97
xX6fdRSR9RUjF9DrdLPfLBnJDQ/MnHl1ruIVBFhJm6cJ9TJwf9E0TiJBQt+08JWg
vy5hZBWvsPWWRBJ94XPMe4LtakK/isW4Cz5W9AdrC2Siqw69j6eZzms2AnIjyBjA
BoKg4se2Ay2rMxLZWXIOj9374PU+N1cnRnqgh77ZxLku5WdCzrDfB5safE7UmoTG
/MWJuuIgzOk0iQpQORRtEZDS1dNe5KT9m4xXkUbrZbQROwqnXrT1SVIsuqNAvlPk
uaymR1W8nshepzpMlSxVydLv/mKWZNUGnDxOJ23ooow8Yd7ndppXEtFuGwCYqIFc
xQqxuTLREvJ9+jpSv11bmDpk/ULRqpV+2PjUqGaWlGwFArJ+qFRLVGYx31eXmDPj
t2mrPOcXGzY0pKtIpbkuUGleY/jeI+BNsvD4+QPs+jnp0nmfvH0/Rmp7grGqx2FI
rQM8Gn4a5i3nuEDWLp8nN2wcKC3ePwy96Vp2tqfsl6TVTPx4EFzGLkWogHR2yiqI
0LAj8YWFmWuChSv71wYOjX79CppjcbNwOakSwtDjV30jkwoh2f/0D3OwOpua2xe8
75KQMaSB0kesGZz7ZkL1kMqA5m5w7MGZom6XZoBJ+bq2HPLB2jo=
=2UM7
-----END PGP SIGNATURE-----
Merge tag 'for-5.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A bunch of fixes that accumulated in recent weeks, mostly material for
stable.
Summary:
- fix for regression from 5.3 that prevents to use balance convert
with single profile
- qgroup fixes: rescan race, accounting leak with multiple writers,
potential leak after io failure recovery
- fix for use after free in relocation (reported by KASAN)
- other error handling fixups"
* tag 'for-5.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: qgroup: Fix reserved data space leak if we have multiple reserve calls
btrfs: qgroup: Fix the wrong target io_tree when freeing reserved data space
btrfs: Fix a regression which we can't convert to SINGLE profile
btrfs: relocation: fix use-after-free on dead relocation roots
Btrfs: fix race setting up and completing qgroup rescan workers
Btrfs: fix missing error return if writeback for extent buffer never started
btrfs: adjust dirty_metadata_bytes after writeback failure of extent buffer
Btrfs: fix selftests failure due to uninitialized i_mode in test inodes
[BUG]
The following script can cause btrfs qgroup data space leak:
mkfs.btrfs -f $dev
mount $dev -o nospace_cache $mnt
btrfs subv create $mnt/subv
btrfs quota en $mnt
btrfs quota rescan -w $mnt
btrfs qgroup limit 128m $mnt/subv
for (( i = 0; i < 3; i++)); do
# Create 3 64M holes for latter fallocate to fail
truncate -s 192m $mnt/subv/file
xfs_io -c "pwrite 64m 4k" $mnt/subv/file > /dev/null
xfs_io -c "pwrite 128m 4k" $mnt/subv/file > /dev/null
sync
# it's supposed to fail, and each failure will leak at least 64M
# data space
xfs_io -f -c "falloc 0 192m" $mnt/subv/file &> /dev/null
rm $mnt/subv/file
sync
done
# Shouldn't fail after we removed the file
xfs_io -f -c "falloc 0 64m" $mnt/subv/file
[CAUSE]
Btrfs qgroup data reserve code allow multiple reservations to happen on
a single extent_changeset:
E.g:
btrfs_qgroup_reserve_data(inode, &data_reserved, 0, SZ_1M);
btrfs_qgroup_reserve_data(inode, &data_reserved, SZ_1M, SZ_2M);
btrfs_qgroup_reserve_data(inode, &data_reserved, 0, SZ_4M);
Btrfs qgroup code has its internal tracking to make sure we don't
double-reserve in above example.
The only pattern utilizing this feature is in the main while loop of
btrfs_fallocate() function.
However btrfs_qgroup_reserve_data()'s error handling has a bug in that
on error it clears all ranges in the io_tree with EXTENT_QGROUP_RESERVED
flag but doesn't free previously reserved bytes.
This bug has a two fold effect:
- Clearing EXTENT_QGROUP_RESERVED ranges
This is the correct behavior, but it prevents
btrfs_qgroup_check_reserved_leak() to catch the leakage as the
detector is purely EXTENT_QGROUP_RESERVED flag based.
- Leak the previously reserved data bytes.
The bug manifests when N calls to btrfs_qgroup_reserve_data are made and
the last one fails, leaking space reserved in the previous ones.
[FIX]
Also free previously reserved data bytes when btrfs_qgroup_reserve_data
fails.
Fixes: 5247255370 ("btrfs: qgroup: Introduce btrfs_qgroup_reserve_data function")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Under the following case with qgroup enabled, if some error happened
after we have reserved delalloc space, then in error handling path, we
could cause qgroup data space leakage:
From btrfs_truncate_block() in inode.c:
ret = btrfs_delalloc_reserve_space(inode, &data_reserved,
block_start, blocksize);
if (ret)
goto out;
again:
page = find_or_create_page(mapping, index, mask);
if (!page) {
btrfs_delalloc_release_space(inode, data_reserved,
block_start, blocksize, true);
btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, true);
ret = -ENOMEM;
goto out;
}
[CAUSE]
In the above case, btrfs_delalloc_reserve_space() will call
btrfs_qgroup_reserve_data() and mark the io_tree range with
EXTENT_QGROUP_RESERVED flag.
In the error handling path, we have the following call stack:
btrfs_delalloc_release_space()
|- btrfs_free_reserved_data_space()
|- btrsf_qgroup_free_data()
|- __btrfs_qgroup_release_data(reserved=@reserved, free=1)
|- qgroup_free_reserved_data(reserved=@reserved)
|- clear_record_extent_bits();
|- freed += changeset.bytes_changed;
However due to a completion bug, qgroup_free_reserved_data() will clear
EXTENT_QGROUP_RESERVED flag in BTRFS_I(inode)->io_failure_tree, other
than the correct BTRFS_I(inode)->io_tree.
Since io_failure_tree is never marked with that flag,
btrfs_qgroup_free_data() will not free any data reserved space at all,
causing a leakage.
This type of error handling can only be triggered by errors outside of
qgroup code. So EDQUOT error from qgroup can't trigger it.
[FIX]
Fix the wrong target io_tree.
Reported-by: Josef Bacik <josef@toxicpanda.com>
Fixes: bc42bda223 ("btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
With v5.3 kernel, we can't convert to SINGLE profile:
# btrfs balance start -f -dconvert=single $mnt
ERROR: error during balancing '/mnt/btrfs': Invalid argument
# dmesg -t | tail
validate_convert_profile: data profile=0x1000000000000 allowed=0x20 is_valid=1 final=0x1000000000000 ret=1
BTRFS error (device dm-3): balance: invalid convert data profile single
[CAUSE]
With the extra debug output added, it shows that the @allowed bit is
lacking the special in-memory only SINGLE profile bit.
Thus we fail at that (profile & ~allowed) check.
This regression is caused by commit 081db89b13 ("btrfs: use raid_attr
to get allowed profiles for balance conversion") and the fact that we
don't use any bit to indicate SINGLE profile on-disk, but uses special
in-memory only bit to help distinguish different profiles.
[FIX]
Add that BTRFS_AVAIL_ALLOC_BIT_SINGLE to @allowed, so the code should be
the same as it was and fix the regression.
Reported-by: Chris Murphy <lists@colorremedies.com>
Fixes: 081db89b13 ("btrfs: use raid_attr to get allowed profiles for balance conversion")
CC: stable@vger.kernel.org # 5.3+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
One user reported a reproducible KASAN report about use-after-free:
BTRFS info (device sdi1): balance: start -dvrange=1256811659264..1256811659265
BTRFS info (device sdi1): relocating block group 1256811659264 flags data|raid0
==================================================================
BUG: KASAN: use-after-free in btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
Write of size 8 at addr ffff88856f671710 by task kworker/u24:10/261579
CPU: 2 PID: 261579 Comm: kworker/u24:10 Tainted: P OE 5.2.11-arch1-1-kasan #4
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X99 Extreme4, BIOS P3.80 04/06/2018
Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
Call Trace:
dump_stack+0x7b/0xba
print_address_description+0x6c/0x22e
? btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
__kasan_report.cold+0x1b/0x3b
? btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
kasan_report+0x12/0x17
__asan_report_store8_noabort+0x17/0x20
btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
record_root_in_trans+0x2a0/0x370 [btrfs]
btrfs_record_root_in_trans+0xf4/0x140 [btrfs]
start_transaction+0x1ab/0xe90 [btrfs]
btrfs_join_transaction+0x1d/0x20 [btrfs]
btrfs_finish_ordered_io+0x7bf/0x18a0 [btrfs]
? lock_repin_lock+0x400/0x400
? __kmem_cache_shutdown.cold+0x140/0x1ad
? btrfs_unlink_subvol+0x9b0/0x9b0 [btrfs]
finish_ordered_fn+0x15/0x20 [btrfs]
normal_work_helper+0x1bd/0xca0 [btrfs]
? process_one_work+0x819/0x1720
? kasan_check_read+0x11/0x20
btrfs_endio_write_helper+0x12/0x20 [btrfs]
process_one_work+0x8c9/0x1720
? pwq_dec_nr_in_flight+0x2f0/0x2f0
? worker_thread+0x1d9/0x1030
worker_thread+0x98/0x1030
kthread+0x2bb/0x3b0
? process_one_work+0x1720/0x1720
? kthread_park+0x120/0x120
ret_from_fork+0x35/0x40
Allocated by task 369692:
__kasan_kmalloc.part.0+0x44/0xc0
__kasan_kmalloc.constprop.0+0xba/0xc0
kasan_kmalloc+0x9/0x10
kmem_cache_alloc_trace+0x138/0x260
btrfs_read_tree_root+0x92/0x360 [btrfs]
btrfs_read_fs_root+0x10/0xb0 [btrfs]
create_reloc_root+0x47d/0xa10 [btrfs]
btrfs_init_reloc_root+0x1e2/0x340 [btrfs]
record_root_in_trans+0x2a0/0x370 [btrfs]
btrfs_record_root_in_trans+0xf4/0x140 [btrfs]
start_transaction+0x1ab/0xe90 [btrfs]
btrfs_start_transaction+0x1e/0x20 [btrfs]
__btrfs_prealloc_file_range+0x1c2/0xa00 [btrfs]
btrfs_prealloc_file_range+0x13/0x20 [btrfs]
prealloc_file_extent_cluster+0x29f/0x570 [btrfs]
relocate_file_extent_cluster+0x193/0xc30 [btrfs]
relocate_data_extent+0x1f8/0x490 [btrfs]
relocate_block_group+0x600/0x1060 [btrfs]
btrfs_relocate_block_group+0x3a0/0xa00 [btrfs]
btrfs_relocate_chunk+0x9e/0x180 [btrfs]
btrfs_balance+0x14e4/0x2fc0 [btrfs]
btrfs_ioctl_balance+0x47f/0x640 [btrfs]
btrfs_ioctl+0x119d/0x8380 [btrfs]
do_vfs_ioctl+0x9f5/0x1060
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x73/0xb0
do_syscall_64+0xa5/0x370
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Freed by task 369692:
__kasan_slab_free+0x14f/0x210
kasan_slab_free+0xe/0x10
kfree+0xd8/0x270
btrfs_drop_snapshot+0x154c/0x1eb0 [btrfs]
clean_dirty_subvols+0x227/0x340 [btrfs]
relocate_block_group+0x972/0x1060 [btrfs]
btrfs_relocate_block_group+0x3a0/0xa00 [btrfs]
btrfs_relocate_chunk+0x9e/0x180 [btrfs]
btrfs_balance+0x14e4/0x2fc0 [btrfs]
btrfs_ioctl_balance+0x47f/0x640 [btrfs]
btrfs_ioctl+0x119d/0x8380 [btrfs]
do_vfs_ioctl+0x9f5/0x1060
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x73/0xb0
do_syscall_64+0xa5/0x370
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The buggy address belongs to the object at ffff88856f671100
which belongs to the cache kmalloc-4k of size 4096
The buggy address is located 1552 bytes inside of
4096-byte region [ffff88856f671100, ffff88856f672100)
The buggy address belongs to the page:
page:ffffea0015bd9c00 refcount:1 mapcount:0 mapping:ffff88864400e600 index:0x0 compound_mapcount: 0
flags: 0x2ffff0000010200(slab|head)
raw: 02ffff0000010200 dead000000000100 dead000000000200 ffff88864400e600
raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88856f671600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88856f671680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff88856f671700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88856f671780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88856f671800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
BTRFS info (device sdi1): 1 enospc errors during balance
BTRFS info (device sdi1): balance: ended with status: -28
[CAUSE]
The problem happens when finish_ordered_io() get called with balance
still running, while the reloc root of that subvolume is already dead.
(Tree is swap already done, but tree not yet deleted for possible qgroup
usage.)
That means root->reloc_root still exists, but that reloc_root can be
under btrfs_drop_snapshot(), thus we shouldn't access it.
The following race could cause the use-after-free problem:
CPU1 | CPU2
--------------------------------------------------------------------------
| relocate_block_group()
| |- unset_reloc_control(rc)
| |- btrfs_commit_transaction()
btrfs_finish_ordered_io() | |- clean_dirty_subvols()
|- btrfs_join_transaction() | |
|- record_root_in_trans() | |
|- btrfs_init_reloc_root() | |
|- if (root->reloc_root) | |
| | |- root->reloc_root = NULL
| | |- btrfs_drop_snapshot(reloc_root);
|- reloc_root->last_trans|
= trans->transid |
^^^^^^^^^^^^^^^^^^^^^^
Use after free
[FIX]
Fix it by the following modifications:
- Test if the root has dead reloc tree before accessing root->reloc_root
If the root has BTRFS_ROOT_DEAD_RELOC_TREE, then we don't need to
create or update root->reloc_tree
- Clear the BTRFS_ROOT_DEAD_RELOC_TREE flag until we have fully dropped
reloc tree
To co-operate with above modification, so as long as
BTRFS_ROOT_DEAD_RELOC_TREE is still set, we won't try to re-create
reloc tree at record_root_in_trans().
Reported-by: Cebtenzzre <cebtenzzre@gmail.com>
Fixes: d2311e6985 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
CC: stable@vger.kernel.org # 5.1+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a race between setting up a qgroup rescan worker and completing
a qgroup rescan worker that can lead to callers of the qgroup rescan wait
ioctl to either not wait for the rescan worker to complete or to hang
forever due to missing wake ups. The following diagram shows a sequence
of steps that illustrates the race.
CPU 1 CPU 2 CPU 3
btrfs_ioctl_quota_rescan()
btrfs_qgroup_rescan()
qgroup_rescan_init()
mutex_lock(&fs_info->qgroup_rescan_lock)
spin_lock(&fs_info->qgroup_lock)
fs_info->qgroup_flags |=
BTRFS_QGROUP_STATUS_FLAG_RESCAN
init_completion(
&fs_info->qgroup_rescan_completion)
fs_info->qgroup_rescan_running = true
mutex_unlock(&fs_info->qgroup_rescan_lock)
spin_unlock(&fs_info->qgroup_lock)
btrfs_init_work()
--> starts the worker
btrfs_qgroup_rescan_worker()
mutex_lock(&fs_info->qgroup_rescan_lock)
fs_info->qgroup_flags &=
~BTRFS_QGROUP_STATUS_FLAG_RESCAN
mutex_unlock(&fs_info->qgroup_rescan_lock)
starts transaction, updates qgroup status
item, etc
btrfs_ioctl_quota_rescan()
btrfs_qgroup_rescan()
qgroup_rescan_init()
mutex_lock(&fs_info->qgroup_rescan_lock)
spin_lock(&fs_info->qgroup_lock)
fs_info->qgroup_flags |=
BTRFS_QGROUP_STATUS_FLAG_RESCAN
init_completion(
&fs_info->qgroup_rescan_completion)
fs_info->qgroup_rescan_running = true
mutex_unlock(&fs_info->qgroup_rescan_lock)
spin_unlock(&fs_info->qgroup_lock)
btrfs_init_work()
--> starts another worker
mutex_lock(&fs_info->qgroup_rescan_lock)
fs_info->qgroup_rescan_running = false
mutex_unlock(&fs_info->qgroup_rescan_lock)
complete_all(&fs_info->qgroup_rescan_completion)
Before the rescan worker started by the task at CPU 3 completes, if
another task calls btrfs_ioctl_quota_rescan(), it will get -EINPROGRESS
because the flag BTRFS_QGROUP_STATUS_FLAG_RESCAN is set at
fs_info->qgroup_flags, which is expected and correct behaviour.
However if other task calls btrfs_ioctl_quota_rescan_wait() before the
rescan worker started by the task at CPU 3 completes, it will return
immediately without waiting for the new rescan worker to complete,
because fs_info->qgroup_rescan_running is set to false by CPU 2.
This race is making test case btrfs/171 (from fstests) to fail often:
btrfs/171 9s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad)
--- tests/btrfs/171.out 2018-09-16 21:30:48.505104287 +0100
+++ /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad 2019-09-19 02:01:36.938486039 +0100
@@ -1,2 +1,3 @@
QA output created by 171
+ERROR: quota rescan failed: Operation now in progress
Silence is golden
...
(Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/171.out /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad' to see the entire diff)
That is because the test calls the btrfs-progs commands "qgroup quota
rescan -w", "qgroup assign" and "qgroup remove" in a sequence that makes
calls to the rescan start ioctl fail with -EINPROGRESS (note the "btrfs"
commands 'qgroup assign' and 'qgroup remove' often call the rescan start
ioctl after calling the qgroup assign ioctl,
btrfs_ioctl_qgroup_assign()), since previous waits didn't actually wait
for a rescan worker to complete.
Another problem the race can cause is missing wake ups for waiters,
since the call to complete_all() happens outside a critical section and
after clearing the flag BTRFS_QGROUP_STATUS_FLAG_RESCAN. In the sequence
diagram above, if we have a waiter for the first rescan task (executed
by CPU 2), then fs_info->qgroup_rescan_completion.wait is not empty, and
if after the rescan worker clears BTRFS_QGROUP_STATUS_FLAG_RESCAN and
before it calls complete_all() against
fs_info->qgroup_rescan_completion, the task at CPU 3 calls
init_completion() against fs_info->qgroup_rescan_completion which
re-initilizes its wait queue to an empty queue, therefore causing the
rescan worker at CPU 2 to call complete_all() against an empty queue,
never waking up the task waiting for that rescan worker.
Fix this by clearing BTRFS_QGROUP_STATUS_FLAG_RESCAN and setting
fs_info->qgroup_rescan_running to false in the same critical section,
delimited by the mutex fs_info->qgroup_rescan_lock, as well as doing the
call to complete_all() in that same critical section. This gives the
protection needed to avoid rescan wait ioctl callers not waiting for a
running rescan worker and the lost wake ups problem, since setting that
rescan flag and boolean as well as initializing the wait queue is done
already in a critical section delimited by that mutex (at
qgroup_rescan_init()).
Fixes: 57254b6ebc ("Btrfs: add ioctl to wait for qgroup rescan completion")
Fixes: d2c609b834 ("btrfs: properly track when rescan worker is running")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If lock_extent_buffer_for_io() fails, it returns a negative value, but its
caller btree_write_cache_pages() ignores such error. This means that a
call to flush_write_bio(), from lock_extent_buffer_for_io(), might have
failed. We should make btree_write_cache_pages() notice such error values
and stop immediatelly, making sure filemap_fdatawrite_range() returns an
error to the transaction commit path. A failure from flush_write_bio()
should also result in the endio callback end_bio_extent_buffer_writepage()
being invoked, which sets the BTRFS_FS_*_ERR bits appropriately, so that
there's no risk a transaction or log commit doesn't catch a writeback
failure.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Before, if a eb failed to write out, we would end up triggering a
BUG_ON(). As of f4340622e0 ("btrfs: extent_io: Move the BUG_ON() in
flush_write_bio() one level up"), we no longer BUG_ON(), so we should
make life consistent and add back the unwritten bytes to
dirty_metadata_bytes.
Fixes: f4340622e0 ("btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up")
CC: stable@vger.kernel.org # 5.2+
Reviewed-by: Filipe Manana <fdmanana@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
Some of the self tests create a test inode, setup some extents and then do
calls to btrfs_get_extent() to test that the corresponding extent maps
exist and are correct. However btrfs_get_extent(), since the 5.2 merge
window, now errors out when it finds a regular or prealloc extent for an
inode that does not correspond to a regular file (its ->i_mode is not
S_IFREG). This causes the self tests to fail sometimes, specially when
KASAN, slub_debug and page poisoning are enabled:
$ modprobe btrfs
modprobe: ERROR: could not insert 'btrfs': Invalid argument
$ dmesg
[ 9414.691648] Btrfs loaded, crc32c=crc32c-intel, debug=on, assert=on, integrity-checker=on, ref-verify=on
[ 9414.692655] BTRFS: selftest: sectorsize: 4096 nodesize: 4096
[ 9414.692658] BTRFS: selftest: running btrfs free space cache tests
[ 9414.692918] BTRFS: selftest: running extent only tests
[ 9414.693061] BTRFS: selftest: running bitmap only tests
[ 9414.693366] BTRFS: selftest: running bitmap and extent tests
[ 9414.696455] BTRFS: selftest: running space stealing from bitmap to extent tests
[ 9414.697131] BTRFS: selftest: running extent buffer operation tests
[ 9414.697133] BTRFS: selftest: running btrfs_split_item tests
[ 9414.697564] BTRFS: selftest: running extent I/O tests
[ 9414.697583] BTRFS: selftest: running find delalloc tests
[ 9415.081125] BTRFS: selftest: running find_first_clear_extent_bit test
[ 9415.081278] BTRFS: selftest: running extent buffer bitmap tests
[ 9415.124192] BTRFS: selftest: running inode tests
[ 9415.124195] BTRFS: selftest: running btrfs_get_extent tests
[ 9415.127909] BTRFS: selftest: running hole first btrfs_get_extent test
[ 9415.128343] BTRFS critical (device (efault)): regular/prealloc extent found for non-regular inode 256
[ 9415.131428] BTRFS: selftest: fs/btrfs/tests/inode-tests.c:904 expected a real extent, got 0
This happens because the test inodes are created without ever initializing
the i_mode field of the inode, and neither VFS's new_inode() nor the btrfs
callback btrfs_alloc_inode() initialize the i_mode. Initialization of the
i_mode is done through the various callbacks used by the VFS to create
new inodes (regular files, directories, symlinks, tmpfiles, etc), which
all call btrfs_new_inode() which in turn calls inode_init_owner(), which
sets the inode's i_mode. Since the tests only uses new_inode() to create
the test inodes, the i_mode was never initialized.
This always happens on a VM I used with kasan, slub_debug and many other
debug facilities enabled. It also happened to someone who reported this
on bugzilla (on a 5.3-rc).
Fix this by setting i_mode to S_IFREG at btrfs_new_test_inode().
Fixes: 6bf9e4bd6a ("btrfs: inode: Verify inode mode to avoid NULL pointer dereference")
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204397
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl1/hCoACgkQxWXV+ddt
WDs0lQ//flGLX4fvaY2vuWA26t1elITnIatyX8S+xP4pUsT1Tyy1egeGpR8Jku/7
sCOgUlEM2MNXqveOdkQqPJuFPp3B6tInz4S/fowtLlz4enp7uTXw2SFuS3bhOJ+b
rpxK9VTc6QV3aipBCG31m8fnDiMaj2Hcspp0oej3V2mBhLUvzn69+P4eo7WN+46w
r2F605+lfURauHE6WjM09HINx3NGSfPqdSA5rJvHSm0jlxhb9l3DJOX8cYkbf8lo
MAbLDZmtiDiQAqRcsQPi6LZ1LKBkOYaeSnVvnXnH23FI04LBra3duk03qpvWCW2R
c1tFnKF5vACCyBQp1z8WYP9GjjoW5WT33R2iXufgwXP6pkLpS/12qLLeXqO2K4p5
zINKrIkF3P+GHxiDsQZE3G9A4UpKWFHCxKdxyWIV8LQDEBrgE2Mo3NThEyRBbP+8
1dia4j+qFHvPTMNBvBCjCZMqDwbCe9H70WOXKGE36JITW2le91mn4qHl4SuWReUP
IoHYDVcC/eBGRegc9X+bLJNjJYqo+XFo6u32/fUC5YVhngycQEi2vg1vv8fWQ7dB
g/Ruo3Inrk8h5kPmrHvbOzGazgANIt5ELHrYMRMA5WSgaq29jtGt9oTnsrd+I88G
aPJtwAZfLwdSjl/pwJw8atEPrf04DA2w+gO7rZ/AmeLshnGfOTc=
=bY+a
-----END PGP SIGNATURE-----
Merge tag 'for-5.4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"This continues with work on code refactoring, sanity checks and space
handling. There are some less user visible changes, nothing that would
particularly stand out.
User visible changes:
- tree checker, more sanity checks of:
- ROOT_ITEM (key, size, generation, level, alignment, flags)
- EXTENT_ITEM and METADATA_ITEM checks (key, size, offset,
alignment, refs)
- tree block reference items
- EXTENT_DATA_REF (key, hash, offset)
- deprecate flag BTRFS_SUBVOL_CREATE_ASYNC for subvolume creation
ioctl, scheduled removal in 5.7
- delete stale and unused UAPI definitions
BTRFS_DEV_REPLACE_ITEM_STATE_*
- improved export of debugging information available via existing
sysfs directory structure
- try harder to delete relations between qgroups and allow to delete
orphan entries
- remove unreliable space checks before relocation starts
Core:
- space handling:
- improved ticket reservations and other high level logic in
order to remove special cases
- factor flushing infrastructure and use it for different
contexts, allows to remove some special case handling
- reduce metadata reservation when only updating inodes
- reduce global block reserve minimum size (affects small
filesystems)
- improved overcommit logic wrt global block reserve
- tests:
- fix memory leaks in extent IO tree
- catch all TRIM range
Fixes:
- fix ENOSPC errors, leading to transaction aborts, when cloning
extents
- several fixes for inode number cache (mount option inode_cache)
- fix potential soft lockups during send when traversing large trees
- fix unaligned access to space cache pages with SLUB debug on
(PowerPC)
Other:
- refactoring public/private functions, moving to new or more
appropriate files
- defines converted to enums
- error handling improvements
- more assertions and comments
- old code deletion"
* tag 'for-5.4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (138 commits)
btrfs: Relinquish CPUs in btrfs_compare_trees
btrfs: Don't assign retval of btrfs_try_tree_write_lock/btrfs_tree_read_lock_atomic
btrfs: create structure to encode checksum type and length
btrfs: turn checksum type define into an enum
btrfs: add enospc debug messages for ticket failure
btrfs: do not account global reserve in can_overcommit
btrfs: use btrfs_try_granting_tickets in update_global_rsv
btrfs: always reserve our entire size for the global reserve
btrfs: change the minimum global reserve size
btrfs: rename btrfs_space_info_add_old_bytes
btrfs: remove orig_bytes from reserve_ticket
btrfs: fix may_commit_transaction to deal with no partial filling
btrfs: rework wake_all_tickets
btrfs: refactor the ticket wakeup code
btrfs: stop partially refilling tickets when releasing space
btrfs: add space reservation tracepoint for reserved bytes
btrfs: roll tracepoint into btrfs_space_info_update helper
btrfs: do not allow reservations if we have pending tickets
btrfs: stop clearing EXTENT_DIRTY in inode I/O tree
btrfs: treat RWF_{,D}SYNC writes as sync for CRCs
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl16aTEACgkQxWXV+ddt
WDvICg//cSn5w+g6EnxbrAZ6IYQJ4GA7lZSk2i6Dc/lI3BTfs7Wj0SPRKd01pBjT
N+wbqoOgubsS1jkNfJsGCN80XzSa0tvyQdbezj5ncgSPXp4FRlT0K24EUQNPaqbg
SsvvxAOCerVN3Yj2qrHNWIS5qZ5/8/NjLXca1DJ/OYmrkKfhe+Z6/b9EuKffPnco
erMnaeSvQ27hYkkcdM0DGcWDoHHAQrefGNjQzp5vncJNN1F7+EGLbcH31UwApk1K
/hvOQ6Q6SoR/NKbQu3AitrR9u7v9uhWP9jHJZT46q1m89CzI4S5FjK2wKZFjPE6r
0PGRqnpdaGAERaTo3s6jIqv/X2gzJkhhhzGMiPgPJCQbAH39f/fFGEX22TjG33Yq
2CiGSIPnmKQ7HE494YLuSyHD/89SutMMCkbF0sFBoKuTnu2HQMn9r5Pk6bEKtvIY
iTk75/WTXR02qWCVhTyNDa9QnxewQGJC1d1KNQ6MwbzBiYyG9S/DDZnjLJPNx7DF
KAAANCDdyPpraLcmw2sD/qd1o10HfQmn9z1L2v3YvJBfjMe76SQFCP5WwaJRcjOm
c3ScAX9bXeXJgH+E98kWc7T6p49IPdMDGAtArQmtjO4V8pFRuqG+2Ibg6Za/y5XZ
fkaS5UY+XIk3TUpEqkWKMPMigM9a3jgHskyMgdRLQfVnoOc6Z+k=
=KXB8
-----END PGP SIGNATURE-----
Merge tag 'for-5.3-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Here are two fixes, one of them urgent fixing a bug introduced in 5.2
and reported by many users. It took time to identify the root cause,
catching the 5.3 release is higly desired also to push the fix to 5.2
stable tree.
The bug is a mess up of return values after adding proper error
handling and honestly the kind of bug that can cause sleeping
disorders until it's caught. My appologies to everybody who was
affected.
Summary of what could happen:
1) either a hang when committing a transaction, if this happens
there's no risk of corruption, still the hang is very inconvenient
and can't be resolved without a reboot
2) writeback for some btree nodes may never be started and we end up
committing a transaction without noticing that, this is really
serious and that will lead to the "parent transid verify failed"
messages"
* tag 'for-5.3-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix unwritten extent buffers and hangs on future writeback attempts
Btrfs: fix assertion failure during fsync and use of stale transaction
The lock_extent_buffer_io() returns 1 to the caller to tell it everything
went fine and the callers needs to start writeback for the extent buffer
(submit a bio, etc), 0 to tell the caller everything went fine but it does
not need to start writeback for the extent buffer, and a negative value if
some error happened.
When it's about to return 1 it tries to lock all pages, and if a try lock
on a page fails, and we didn't flush any existing bio in our "epd", it
calls flush_write_bio(epd) and overwrites the return value of 1 to 0 or
an error. The page might have been locked elsewhere, not with the goal
of starting writeback of the extent buffer, and even by some code other
than btrfs, like page migration for example, so it does not mean the
writeback of the extent buffer was already started by some other task,
so returning a 0 tells the caller (btree_write_cache_pages()) to not
start writeback for the extent buffer. Note that epd might currently have
either no bio, so flush_write_bio() returns 0 (success) or it might have
a bio for another extent buffer with a lower index (logical address).
Since we return 0 with the EXTENT_BUFFER_WRITEBACK bit set on the
extent buffer and writeback is never started for the extent buffer,
future attempts to writeback the extent buffer will hang forever waiting
on that bit to be cleared, since it can only be cleared after writeback
completes. Such hang is reported with a trace like the following:
[49887.347053] INFO: task btrfs-transacti:1752 blocked for more than 122 seconds.
[49887.347059] Not tainted 5.2.13-gentoo #2
[49887.347060] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[49887.347062] btrfs-transacti D 0 1752 2 0x80004000
[49887.347064] Call Trace:
[49887.347069] ? __schedule+0x265/0x830
[49887.347071] ? bit_wait+0x50/0x50
[49887.347072] ? bit_wait+0x50/0x50
[49887.347074] schedule+0x24/0x90
[49887.347075] io_schedule+0x3c/0x60
[49887.347077] bit_wait_io+0x8/0x50
[49887.347079] __wait_on_bit+0x6c/0x80
[49887.347081] ? __lock_release.isra.29+0x155/0x2d0
[49887.347083] out_of_line_wait_on_bit+0x7b/0x80
[49887.347084] ? var_wake_function+0x20/0x20
[49887.347087] lock_extent_buffer_for_io+0x28c/0x390
[49887.347089] btree_write_cache_pages+0x18e/0x340
[49887.347091] do_writepages+0x29/0xb0
[49887.347093] ? kmem_cache_free+0x132/0x160
[49887.347095] ? convert_extent_bit+0x544/0x680
[49887.347097] filemap_fdatawrite_range+0x70/0x90
[49887.347099] btrfs_write_marked_extents+0x53/0x120
[49887.347100] btrfs_write_and_wait_transaction.isra.4+0x38/0xa0
[49887.347102] btrfs_commit_transaction+0x6bb/0x990
[49887.347103] ? start_transaction+0x33e/0x500
[49887.347105] transaction_kthread+0x139/0x15c
So fix this by not overwriting the return value (ret) with the result
from flush_write_bio(). We also need to clear the EXTENT_BUFFER_WRITEBACK
bit in case flush_write_bio() returns an error, otherwise it will hang
any future attempts to writeback the extent buffer, and undo all work
done before (set back EXTENT_BUFFER_DIRTY, etc).
This is a regression introduced in the 5.2 kernel.
Fixes: 2e3c25136a ("btrfs: extent_io: add proper error handling to lock_extent_buffer_for_io()")
Fixes: f4340622e0 ("btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up")
Reported-by: Zdenek Sojka <zsojka@seznam.cz>
Link: https://lore.kernel.org/linux-btrfs/GpO.2yos.3WGDOLpx6t%7D.1TUDYM@seznam.cz/T/#u
Reported-by: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Link: https://lore.kernel.org/linux-btrfs/5c4688ac-10a7-fb07-70e8-c5d31a3fbb38@profihost.ag/T/#t
Reported-by: Drazen Kacar <drazen.kacar@oradian.com>
Link: https://lore.kernel.org/linux-btrfs/DB8PR03MB562876ECE2319B3E579590F799C80@DB8PR03MB5628.eurprd03.prod.outlook.com/
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204377
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sometimes when fsync'ing a file we need to log that other inodes exist and
when we need to do that we acquire a reference on the inodes and then drop
that reference using iput() after logging them.
That generally is not a problem except if we end up doing the final iput()
(dropping the last reference) on the inode and that inode has a link count
of 0, which can happen in a very short time window if the logging path
gets a reference on the inode while it's being unlinked.
In that case we end up getting the eviction callback, btrfs_evict_inode(),
invoked through the iput() call chain which needs to drop all of the
inode's items from its subvolume btree, and in order to do that, it needs
to join a transaction at the helper function evict_refill_and_join().
However because the task previously started a transaction at the fsync
handler, btrfs_sync_file(), it has current->journal_info already pointing
to a transaction handle and therefore evict_refill_and_join() will get
that transaction handle from btrfs_join_transaction(). From this point on,
two different problems can happen:
1) evict_refill_and_join() will often change the transaction handle's
block reserve (->block_rsv) and set its ->bytes_reserved field to a
value greater than 0. If evict_refill_and_join() never commits the
transaction, the eviction handler ends up decreasing the reference
count (->use_count) of the transaction handle through the call to
btrfs_end_transaction(), and after that point we have a transaction
handle with a NULL ->block_rsv (which is the value prior to the
transaction join from evict_refill_and_join()) and a ->bytes_reserved
value greater than 0. If after the eviction/iput completes the inode
logging path hits an error or it decides that it must fallback to a
transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
non-zero value from btrfs_log_dentry_safe(), and because of that
non-zero value it tries to commit the transaction using a handle with
a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
the transaction commit hit an assertion failure at
btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
the ->block_rsv is NULL. The produced stack trace for that is like the
following:
[192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
[192922.917553] ------------[ cut here ]------------
[192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
[192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
[192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G W 5.1.4-btrfs-next-47 #1
[192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
(...)
[192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
[192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
[192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
[192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
[192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
[192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
[192922.923200] FS: 00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
[192922.923579] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
[192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[192922.925105] Call Trace:
[192922.925505] btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
[192922.925911] btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
[192922.926324] btrfs_sync_file+0x44c/0x490 [btrfs]
[192922.926731] do_fsync+0x38/0x60
[192922.927138] __x64_sys_fdatasync+0x13/0x20
[192922.927543] do_syscall_64+0x60/0x1c0
[192922.927939] entry_SYSCALL_64_after_hwframe+0x49/0xbe
(...)
[192922.934077] ---[ end trace f00808b12068168f ]---
2) If evict_refill_and_join() decides to commit the transaction, it will
be able to do it, since the nested transaction join only increments the
transaction handle's ->use_count reference counter and it does not
prevent the transaction from getting committed. This means that after
eviction completes, the fsync logging path will be using a transaction
handle that refers to an already committed transaction. What happens
when using such a stale transaction can be unpredictable, we are at
least having a use-after-free on the transaction handle itself, since
the transaction commit will call kmem_cache_free() against the handle
regardless of its ->use_count value, or we can end up silently losing
all the updates to the log tree after that iput() in the logging path,
or using a transaction handle that in the meanwhile was allocated to
another task for a new transaction, etc, pretty much unpredictable
what can happen.
In order to fix both of them, instead of using iput() during logging, use
btrfs_add_delayed_iput(), so that the logging path of fsync never drops
the last reference on an inode, that step is offloaded to a safe context
(usually the cleaner kthread).
The assertion failure issue was sporadically triggered by the test case
generic/475 from fstests, which loads the dm error target while fsstress
is running, which lead to fsync failing while logging inodes with -EIO
errors and then trying later to commit the transaction, triggering the
assertion failure.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing any form of incremental send the parent and the child trees
need to be compared via btrfs_compare_trees. This can result in long
loop chains without ever relinquishing the CPU. This causes softlockup
detector to trigger when comparing trees with a lot of items. Example
report:
watchdog: BUG: soft lockup - CPU#0 stuck for 24s! [snapperd:16153]
CPU: 0 PID: 16153 Comm: snapperd Not tainted 5.2.9-1-default #1 openSUSE Tumbleweed (unreleased)
Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
pstate: 40000005 (nZcv daif -PAN -UAO)
pc : __ll_sc_arch_atomic_sub_return+0x14/0x20
lr : btrfs_release_extent_buffer_pages+0xe0/0x1e8 [btrfs]
sp : ffff00001273b7e0
Call trace:
__ll_sc_arch_atomic_sub_return+0x14/0x20
release_extent_buffer+0xdc/0x120 [btrfs]
free_extent_buffer.part.0+0xb0/0x118 [btrfs]
free_extent_buffer+0x24/0x30 [btrfs]
btrfs_release_path+0x4c/0xa0 [btrfs]
btrfs_free_path.part.0+0x20/0x40 [btrfs]
btrfs_free_path+0x24/0x30 [btrfs]
get_inode_info+0xa8/0xf8 [btrfs]
finish_inode_if_needed+0xe0/0x6d8 [btrfs]
changed_cb+0x9c/0x410 [btrfs]
btrfs_compare_trees+0x284/0x648 [btrfs]
send_subvol+0x33c/0x520 [btrfs]
btrfs_ioctl_send+0x8a0/0xaf0 [btrfs]
btrfs_ioctl+0x199c/0x2288 [btrfs]
do_vfs_ioctl+0x4b0/0x820
ksys_ioctl+0x84/0xb8
__arm64_sys_ioctl+0x28/0x38
el0_svc_common.constprop.0+0x7c/0x188
el0_svc_handler+0x34/0x90
el0_svc+0x8/0xc
Fix this by adding a call to cond_resched at the beginning of the main
loop in btrfs_compare_trees.
Fixes: 7069830a9e ("Btrfs: add btrfs_compare_trees function")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Those function are simple boolean predicates there is no need to assign
their return values to interim variables. Use them directly as
predicates. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Create a structure to encode the type and length for the known on-disk
checksums. This makes it easier to add new checksums later.
The structure and helpers are moved from ctree.h so they don't occupy
space in all headers including ctree.h. This save some space in the
final object.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When debugging weird enospc problems it's handy to be able to dump the
space info when we wake up all tickets, and see what the ticket values
are. This helped me figure out cases where we were enospc'ing when we
shouldn't have been.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We ran into a problem in production where a box with plenty of space was
getting wedged doing ENOSPC flushing. These boxes only had 20% of the
disk allocated, but their metadata space + global reserve was right at
the size of their metadata chunk.
In this case can_overcommit should be allowing allocations without
problem, but there's logic in can_overcommit that doesn't allow us to
overcommit if there's not enough real space to satisfy the global
reserve.
This is for historical reasons. Before there were only certain places
we could allocate chunks. We could go to commit the transaction and not
have enough space for our pending delayed refs and such and be unable to
allocate a new chunk. This would result in a abort because of ENOSPC.
This code was added to solve this problem.
However since then we've gained the ability to always be able to
allocate a chunk. So we can easily overcommit in these cases without
risking a transaction abort because of ENOSPC.
Also prior to now the global reserve really would be used because that's
the space we relied on for delayed refs. With delayed refs being
tracked separately we no longer have to worry about running out of
delayed refs space while committing. We are much less likely to
exhaust our global reserve space during transaction commit.
Fix the can_overcommit code to simply see if our current usage + what we
want is less than our current free space plus whatever slack space we
have in the disk is. This solves the problem we were seeing in
production and keeps us from flushing as aggressively as we approach our
actual metadata size usage.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have some annoying xfstests tests that will create a very small fs,
fill it up, delete it, and repeat to make sure everything works right.
This trips btrfs up sometimes because we may commit a transaction to
free space, but most of the free metadata space was being reserved by
the global reserve. So we commit and update the global reserve, but the
space is simply added to bytes_may_use directly, instead of trying to
add it to existing tickets. This results in ENOSPC when we really did
have space. Fix this by calling btrfs_try_granting_tickets once we add
back our excess space to wake any pending tickets.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While messing with the overcommit logic I noticed that sometimes we'd
ENOSPC out when really we should have run out of space much earlier. It
turns out it's because we'll only reserve up to the free amount left in
the space info for the global reserve, but that doesn't make sense with
overcommit because we could be well above our actual size. This results
in the global reserve not carving out it's entire reservation, and thus
not putting enough pressure on the rest of the infrastructure to do the
right thing and ENOSPC out at a convenient time. Fix this by always
taking our full reservation amount for the global reserve.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It made sense to have the global reserve set at 16M in the past, but
since it is used less nowadays set the minimum size to the number of
items we'll need to update the main trees we update during a transaction
commit, plus some slop area so we can do unlinks if we need to.
In practice this doesn't affect normal file systems, but for xfstests
where we do things like fill up a fs and then rm * it can fall over in
weird ways. This enables us for more sane behavior at extremely small
file system sizes.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This name doesn't really fit with how the space reservation stuff works
now, rename it to btrfs_space_info_free_bytes_may_use so it's clear what
the function is doing.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we do not do partial filling of tickets simply remove
orig_bytes, it is no longer needed.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we aren't partially filling tickets we may have some slack
space left in the space_info. We need to account for this in
may_commit_transaction, otherwise we may choose to not commit the
transaction despite it actually having enough space to satisfy our
ticket.
Calculate the free space we have in the space_info, if any, and subtract
this from the ticket we have and use that amount to determine if we will
need to commit to reclaim enough space.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we no longer partially fill tickets we need to rework
wake_all_tickets to call btrfs_try_to_wakeup_tickets() in order to see
if any subsequent tickets are able to be satisfied. If our tickets_id
changes we know something happened and we can keep flushing.
Also if we find a ticket that is smaller than the first ticket in our
queue then we want to retry the flushing loop again in case
may_commit_transaction() decides we could satisfy the ticket by
committing the transaction.
Rename this to maybe_fail_all_tickets() while we're at it, to better
reflect what the function is actually doing.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that btrfs_space_info_add_old_bytes simply checks if we can make the
reservation and updates bytes_may_use, there's no reason to have both
helpers in place.
Factor out the ticket wakeup logic into it's own helper, make
btrfs_space_info_add_old_bytes() update bytes_may_use and then call the
wakeup helper, and replace all calls to btrfs_space_info_add_new_bytes()
with the wakeup helper.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_space_info_add_old_bytes is used when adding the extra space from
an existing reservation back into the space_info to be used by any
waiting tickets. In order to keep us from overcommitting we check to
make sure that we can still use this space for our reserve ticket, and
if we cannot we'll simply subtract it from space_info->bytes_may_use.
However this is problematic, because it assumes that only changes to
bytes_may_use would affect our ability to make reservations. Any
changes to bytes_reserved would be missed. If we were unable to make a
reservation prior because of reserved space, but that reserved space was
free'd due to unlink or truncate and we were allowed to immediately
reclaim that metadata space we would still ENOSPC.
Consider the example where we create a file with a bunch of extents,
using up 2MiB of actual space for the new tree blocks. Then we try to
make a reservation of 2MiB but we do not have enough space to make this
reservation. The iput() occurs in another thread and we remove this
space, and since we did not write the blocks we simply do
space_info->bytes_reserved -= 2MiB. We would never see this because we
do not check our space info used, we just try to re-use the freed
reservations.
To fix this problem, and to greatly simplify the wakeup code, do away
with this partial refilling nonsense. Use
btrfs_space_info_add_old_bytes to subtract the reservation from
space_info->bytes_may_use, and then check the ticket against the total
used of the space_info the same way we do with the initial reservation
attempt.
This keeps the reservation logic consistent and solves the problem of
early ENOSPC in the case that we free up space in places other than
bytes_may_use and bytes_pinned. Thanks,
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I noticed when folding the trace_btrfs_space_reservation() tracepoint
into the btrfs_space_info_update_* helpers that we didn't emit a
tracepoint when doing btrfs_add_reserved_bytes(). I know this is
because we were swapping bytes_may_use for bytes_reserved, so in my mind
there was no reason to have the tracepoint there. But now there is
because we always emit the unreserve for the bytes_may_use side, and
this would have broken if compression was on anyway. Add a tracepoint
to cover the bytes_reserved counter so the math still comes out right.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We duplicate this tracepoint everywhere we call these helpers, so update
the helper to have the tracepoint as well.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we already have tickets on the list we don't want to steal their
reservations. This is a preparation patch for upcoming changes,
technically this shouldn't happen today because of the way we add bytes
to tickets before adding them to the space_info in most cases.
This does not change the FIFO nature of reserve tickets, it simply
allows us to enforce it in a different way. Previously it was enforced
because any new space would be added to the first ticket on the list,
which would result in new reservations getting a reserve ticket. This
replaces that mechanism by simply checking to see if we have outstanding
reserve tickets and skipping straight to adding a ticket for our
reservation.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit fee187d9d9 ("Btrfs: do not set EXTENT_DIRTY along with
EXTENT_DELALLOC"), we never set EXTENT_DIRTY in inode->io_tree, so we
can simplify and stop trying to clear it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The VFS indicates a synchronous write to ->write_iter() via
iocb->ki_flags. The IOCB_{,D}SYNC flags may be set based on the file
(see iocb_flags()) or the RWF_* flags passed to a syscall like
pwritev2() (see kiocb_set_rw_flags()).
However, in btrfs_file_write_iter(), we're checking if a write is
synchronous based only on the file; we use this to decide when to bump
the sync_writers counter and thus do CRCs synchronously. Make sure we do
this for all synchronous writes as determined by the VFS.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add const ]
Signed-off-by: David Sterba <dsterba@suse.com>
generic_write_checks() may modify iov_iter_count(), so we must get the
count after the call, not before. Using the wrong one has a couple of
consequences:
1. We check a longer range in check_can_nocow() for nowait than we're
actually writing.
2. We create extra hole extent maps in btrfs_cont_expand(). As far as I
can tell, this is harmless, but I might be missing something.
These issues are pretty minor, but let's fix it before something more
important trips on it.
Fixes: edf064e7c6 ("btrfs: nowait aio support")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Further simplifaction of the get/set helpers is possible when the token
is uniquely tied to an extent buffer. A condition and an assignment can
be avoided.
The initializations are moved closer to the first use when the extent
buffer is valid. There's one exception in __push_leaf_left where the
token is reused.
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we can safely assume that the token is always a valid pointer,
remove the branches that check that.
Signed-off-by: David Sterba <dsterba@suse.com>
There are helpers for all type widths defined via macro and optionally
can use a token which is a cached pointer to avoid repeated mapping of
the extent buffer.
The token value is known at compile time, when it's valid it's always
address of a local variable, otherwise it's NULL passed by the
token-less helpers.
This can be utilized to remove some branching as the helpers are used
frequenlty.
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_find_name_in_ext_backref returns either 0/1 depending on whether it
found a backref for the given name. If it returns true then the actual
inode_ref struct is returned in one of its parameters. That's pointless,
instead refactor the function such that it returns either a pointer
to the btrfs_inode_extref or NULL it it didn't find anything. This
streamlines the function calling convention.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_find_name_in_backref returns either 0/1 depending on whether it
found a backref for the given name. If it returns true then the actual
inode_ref struct is returned in one of its parameters. That's pointless,
instead refactor the function such that it returns either a pointer
to the btrfs_inode_ref or NULL it it didn't find anything. This
streamlines the function calling convention.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The other dev stats functions are already there and the helpers are not
used by anything else.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The io_ctl structure is used for free space management, and used only by
the v1 space cache code, but unfortunatlly the full definition is
required by block-group.h so it can't be moved to free-space-cache.c
without additional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Send is the only user of tree_compare, we can move it there along with
the other helpers and definitions.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Preparatory work for code that will be moved out of ctree and uses this
function.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The file ctree.h serves as a header for everything and has become quite
bloated. Split some helpers that are generic and create a new file that
should be the catch-all for code that's not btrfs-specific.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Fix the fake ENOMEM return error code to the actual error in
clone_fs_devices().
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In a corrupted tree, if search for next devid finds the device with
devid = -1, then report the error -EUCLEAN back to the parent function
to fail gracefully.
The tree checker will not catch this in case the devids are created
using the following script:
umount /btrfs
dev1=/dev/sdb
dev2=/dev/sdc
mkfs.btrfs -fq -dsingle -msingle $dev1
mount $dev1 /btrfs
_fail()
{
echo $1
exit 1
}
while true; do
btrfs dev add -f $dev2 /btrfs || _fail "add failed"
btrfs dev del $dev1 /btrfs || _fail "del failed"
dev_tmp=$dev1
dev1=$dev2
dev2=$dev_tmp
done
With output:
BTRFS critical (device sdb): corrupt leaf: root=3 block=313739198464 slot=1 devid=1 invalid devid: has=507 expect=[0, 506]
BTRFS error (device sdb): block=313739198464 write time tree block corruption detected
BTRFS: error (device sdb) in btrfs_commit_transaction:2268: errno=-5 IO failure (Error while writing out transaction)
BTRFS warning (device sdb): Skipping commit of aborted transaction.
BTRFS: error (device sdb) in cleanup_transaction:1827: errno=-5 IO failure
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ add script and messages ]
Signed-off-by: David Sterba <dsterba@suse.com>
Various notifications of type "BUG kmalloc-4096 () : Redzone
overwritten" have been observed recently in various parts of the kernel.
After some time, it has been made a relation with the use of BTRFS
filesystem and with SLUB_DEBUG turned on.
[ 22.809700] BUG kmalloc-4096 (Tainted: G W ): Redzone overwritten
[ 22.810286] INFO: 0xbe1a5921-0xfbfc06cd. First byte 0x0 instead of 0xcc
[ 22.810866] INFO: Allocated in __load_free_space_cache+0x588/0x780 [btrfs] age=22 cpu=0 pid=224
[ 22.811193] __slab_alloc.constprop.26+0x44/0x70
[ 22.811345] kmem_cache_alloc_trace+0xf0/0x2ec
[ 22.811588] __load_free_space_cache+0x588/0x780 [btrfs]
[ 22.811848] load_free_space_cache+0xf4/0x1b0 [btrfs]
[ 22.812090] cache_block_group+0x1d0/0x3d0 [btrfs]
[ 22.812321] find_free_extent+0x680/0x12a4 [btrfs]
[ 22.812549] btrfs_reserve_extent+0xec/0x220 [btrfs]
[ 22.812785] btrfs_alloc_tree_block+0x178/0x5f4 [btrfs]
[ 22.813032] __btrfs_cow_block+0x150/0x5d4 [btrfs]
[ 22.813262] btrfs_cow_block+0x194/0x298 [btrfs]
[ 22.813484] commit_cowonly_roots+0x44/0x294 [btrfs]
[ 22.813718] btrfs_commit_transaction+0x63c/0xc0c [btrfs]
[ 22.813973] close_ctree+0xf8/0x2a4 [btrfs]
[ 22.814107] generic_shutdown_super+0x80/0x110
[ 22.814250] kill_anon_super+0x18/0x30
[ 22.814437] btrfs_kill_super+0x18/0x90 [btrfs]
[ 22.814590] INFO: Freed in proc_cgroup_show+0xc0/0x248 age=41 cpu=0 pid=83
[ 22.814841] proc_cgroup_show+0xc0/0x248
[ 22.814967] proc_single_show+0x54/0x98
[ 22.815086] seq_read+0x278/0x45c
[ 22.815190] __vfs_read+0x28/0x17c
[ 22.815289] vfs_read+0xa8/0x14c
[ 22.815381] ksys_read+0x50/0x94
[ 22.815475] ret_from_syscall+0x0/0x38
Commit 69d2480456 ("btrfs: use copy_page for copying pages instead of
memcpy") changed the way bitmap blocks are copied. But allthough bitmaps
have the size of a page, they were allocated with kzalloc().
Most of the time, kzalloc() allocates aligned blocks of memory, so
copy_page() can be used. But when some debug options like SLAB_DEBUG are
activated, kzalloc() may return unaligned pointer.
On powerpc, memcpy(), copy_page() and other copying functions use
'dcbz' instruction which provides an entire zeroed cacheline to avoid
memory read when the intention is to overwrite a full line. Functions
like memcpy() are writen to care about partial cachelines at the start
and end of the destination, but copy_page() assumes it gets pages. As
pages are naturally cache aligned, copy_page() doesn't care about
partial lines. This means that when copy_page() is called with a
misaligned pointer, a few leading bytes are zeroed.
To fix it, allocate bitmaps through kmem_cache instead of using kzalloc()
The cache pool is created with PAGE_SIZE alignment constraint.
Reported-by: Erhard F. <erhard_f@mailbox.org>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204371
Fixes: 69d2480456 ("btrfs: use copy_page for copying pages instead of memcpy")
Cc: stable@vger.kernel.org # 4.19+
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename to btrfs_free_space_bitmap ]
Signed-off-by: David Sterba <dsterba@suse.com>
Support for asynchronous snapshot creation was originally added in
72fd032e94 ("Btrfs: add SNAP_CREATE_ASYNC ioctl") to cater for
ceph's backend needs. However, since Ceph has deprecated support for
btrfs there is no longer need for that support in btrfs. Additionally,
this was never supported by btrfs-progs, the official userspace tools.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Correctly handle failure cases when adding an ordered extents in case
of REGULAR or PREALLOC extents. Remove the BUG_ON.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a comment explaining why we keep the BUG also use the already read
and cached value of extent ram bytes stored in 'ram_bytes'.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The extent range check right after the "out_check" label is redundant,
because the only way it can trigger is if we have an inline extent. In
this case it makes more sense to actually move it in the branch
explictly dealing with inlines extents.
What's more, the nested 'if (nocow)' can never be true because for
inline extents we always do COW and there is no chance 'nocow' can be
true, just remove that check.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no point in checking the type of the extent again just to set
the 'type' variable, when this check has already been performed before.
Instead, extend the original if branch with an 'else' clause. This
allows to remove one local variable and make it obvious how the code
flow differs for prealloc/regular extents.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
run_delalloc_nocow contains numerous, somewhat subtle, checks when
figuring out whether a particular extent should be CoW'ed or not. This
patch explicitly states the assumptions those checks verify. As a
result also document 2 of the more subtle checks in check_committed_ref
as well.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Of the 22 (!!!) local variables declared in this function only 9 have
function-wide context. Of the remaining 13, 12 are needed in the main
while loop of the function and 1 is needed in a tiny if branch, only in
case we have prealloc extent. This commit reduces the lifespan of every
variable to its bare minimum. It also renames the 'nolock' boolean to
freespace_inode to clearly indicate its purpose.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Historically we reserved worst case for every btree operation, and
generally speaking we want to do that in cases where it could be the
worst case. However for updating inodes we know the inode items are
already in the tree, so it will only be an update operation and never an
insert operation. This allows us to always reserve only the
metadata_size amount for inode updates rather than the
insert_metadata_size amount.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_calc_trunc_metadata_size differs from trans_metadata_size in that
it doesn't take into account any splitting at the levels, because
truncate will never split nodes. However truncate _and_ changing will
never split nodes, so rename btrfs_calc_trunc_metadata_size to
btrfs_calc_metadata_size. Also btrfs_calc_trans_metadata_size is purely
for inserting items, so rename this to btrfs_calc_insert_metadata_size.
Making these clearer will help when I start using them differently in
upcoming patches.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
EXTENT_DATA_REF is a little like DIR_ITEM which contains hash in its
key->offset.
This patch will check the following contents:
- Key->objectid
Basic alignment check.
- Hash
Hash of each extent_data_ref item must match key->offset.
- Offset
Basic alignment check.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For TREE_BLOCK_REF, SHARED_DATA_REF and SHARED_BLOCK_REF we need to
check:
| TREE_BLOCK_REF | SHARED_BLOCK_REF | SHARED_BLOCK_REF
--------------+----------------+-----------------+------------------
key->objectid | Alignment | Alignment | Alignment
key->offset | Any value | Alignment | Alignment
item_size | 0 | 0 | sizeof(le32) (*)
*: sizeof(struct btrfs_shared_data_ref)
So introduce a check to check all these 3 key types together.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch introduces the ability to check extent items.
This check involves:
- key->objectid check
Basic alignment check.
- key->type check
Against btrfs_extent_item::type and SKINNY_METADATA feature.
- key->offset alignment check for EXTENT_ITEM
- key->offset check for METADATA_ITEM
- item size check
Both against minimal size and stepping check.
- btrfs_extent_item check
Checks its flags and generation.
- btrfs_extent_inline_ref checks
Against 4 types inline ref.
Checks bytenr alignment and tree level.
- btrfs_extent_item::refs check
Check against total refs found in inline refs.
This check would be the most complex single item check due to its nature
of inlined items.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The btrfs_get_chunk_map() never returns NULL, it returns error pointers.
Fixes: 89b798ad1b ("btrfs: Use btrfs_get_io_geometry appropriately")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the function btrfs_init_dev_stats() goto out is not needed, because the
alloc has failed. So just return -ENOMEM.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
%found_key is not used, drop it since it hasn't been used since the
beginning in 733f4fbbc1 ("Btrfs: read device stats on mount, write
modified ones during commit").
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is used only for the readahead machinery. It makes no
sense to keep it external to reada.c file. Place it above its sole
caller and make it static. No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The set_level callbacks do not do anything special and can be replaced
by a helper that uses the levels defined in the tables.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The maximum and default levels do not change and can be defined
directly. The set_level callback was a temporary solution and will be
removed.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The BTRFS_DEV_REPLACE_ITEM_STATE_x defines, as shown in [1], are
unused in both kernel and btrfs-progs (except for one instance of
BTRFS_DEV_REPLACE_ITEM_STATE_NEVER_STARTED in kernel).
[1]
btrfs.h:#define BTRFS_IOCTL_DEV_REPLACE_STATE_FINISHED 2
btrfs.h:#define BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED 3
btrfs.h:#define BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED 4
Further these define-values are different form its counterpart
BTRFS_IOCTL_DEV_REPLACE_STATE_x series as shown in [2].
[2]
btrfs_tree.h:#define BTRFS_DEV_REPLACE_ITEM_STATE_SUSPENDED 2
btrfs_tree.h:#define BTRFS_DEV_REPLACE_ITEM_STATE_FINISHED 3
btrfs_tree.h:#define BTRFS_DEV_REPLACE_ITEM_STATE_CANCELED 4
So this patch deletes the BTRFS_DEV_REPLACE_ITEM_STATE_x altogether, and
one instance of BTRFS_DEV_REPLACE_ITEM_STATE_NEVER_STARTED is replaced
with BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED in the kernel.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have this weird space flushing loop inside inode.c for evict where
we'll do the normal LIMIT flush, and then commit the transaction and
hope we get our space. This is super janky, and in fact there's really
nothing stopping us from using FLUSH_ALL except that we run delayed
iputs, which means we could deadlock. So introduce a new flush state
for eviction that does the normal priority flushing with all of the
states that are safe for eviction.
The nice side-effect of this is that we'll try harder for evictions.
Previously if (for example generic/269) you had a bunch of other
operations happening on the fs you could race with those reservations
when committing the transaction, and eventually miss getting a
reservation for the evict. With this code we'll have our ticket in
place through the transaction commit, so any pinned bytes will go to our
pending evictions first.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With the eviction flushing stuff we'll want to allow for different
states, but still work basically the same way that
priority_reclaim_metadata_space works currently. Refactor this to take
the flushing states and size as an argument so we can use the same logic
for limit flushing and eviction flushing.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to make this logic a little more complicated for evict, so
factor the ticket flushing/waiting code out of __reserve_metadata_bytes.
This has no functional change.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we handle the cleanup of errored out tickets in both the
priority flush path and the normal flushing path. This is the same code
in both places, so just refactor so we don't duplicate the cleanup work.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Delayed iputs could very well free up enough space without needing to
commit the transaction, so make this step it's own step. This will
allow us to skip the step for evictions in a later patch.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These were renamed and exported to facilitate logical migration of
different code chunks into block-group.c. Now that all the users are in
one file go ahead and rename them back, move the code around, and make
them static.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This can now be easily migrated as well.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ refresh on top of sysfs cleanups ]
Signed-off-by: David Sterba <dsterba@suse.com>
These feel more at home in block-group.c.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ refresh, adjust btrfs_get_alloc_profile exports ]
Signed-off-by: David Sterba <dsterba@suse.com>
This feels more at home in block-group.c than in extent-tree.c.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>i
[ refresh ]
Signed-off-by: David Sterba <dsterba@suse.com>
We can now easily migrate this code as well.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Want to move these functions into block-group.c, so export them.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This can be easily migrated over now.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comments ]
Signed-off-by: David Sterba <dsterba@suse.com>