xfstests's btrfs/035 triggers a BUG_ON, which we use to detect the split
of inline extents in __btrfs_drop_extents().
For inline extents, we cannot duplicate another EXTENT_DATA item, because
it breaks the rule of inline extents, that is, 'start offset' needs to be 0.
We have set limitations for the source inode's compressed inline extents,
because it needs to decompress and recompress. Now the destination inode's
inline extents also need similar limitations.
With this, xfstests btrfs/035 doesn't run into panic.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
I added an optimization for large files where we would stop searching for
backrefs once we had looked at the number of references we currently had for
this extent. This works great most of the time, but for snapshots that point to
this extent and has changes in the original root this assumption falls on it
face. So keep track of any delayed ref mods made and add in the actual ref
count as reported by the extent item and use that to limit how far down an inode
we'll search for extents. Thanks,
Reportedy-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reported-by: Hugo Mills <hugo@carfax.org.uk>
Tested-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Chris Mason <clm@fb.com>
For an incremental send, fix the process of determining whether the directory
inode we're currently processing needs to have its move/rename operation delayed.
We were ignoring the fact that if the inode's new immediate ancestor has a higher
inode number than ours but wasn't renamed/moved, we might still need to delay our
move/rename, because some other ancestor directory higher in the hierarchy might
have an inode number higher than ours *and* was renamed/moved too - in this case
we have to wait for rename/move of that ancestor to happen before our current
directory's rename/move operation.
Simple steps to reproduce this issue:
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt
$ mkdir -p /mnt/a/x1/x2
$ mkdir /mnt/a/Z
$ mkdir -p /mnt/a/x1/x2/x3/x4/x5
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send /mnt/snap1 -f /tmp/base.send
$ mv /mnt/a/x1/x2/x3 /mnt/a/Z/X33
$ mv /mnt/a/x1/x2 /mnt/a/Z/X33/x4/x5/X22
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
The incremental send caused the kernel code to enter an infinite loop when
building the path string for directory Z after its references are processed.
A more complex scenario:
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt
$ mkdir -p /mnt/a/b/c/d
$ mkdir /mnt/a/b/c/d/e
$ mkdir /mnt/a/b/c/d/f
$ mv /mnt/a/b/c/d/e /mnt/a/b/c/d/f/E2
$ mkdir /mmt/a/b/c/g
$ mv /mnt/a/b/c/d /mnt/a/b/D2
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send /mnt/snap1 -f /tmp/base.send
$ mkdir /mnt/a/o
$ mv /mnt/a/b/c/g /mnt/a/b/D2/f/G2
$ mv /mnt/a/b/D2 /mnt/a/b/dd
$ mv /mnt/a/b/c /mnt/a/C2
$ mv /mnt/a/b/dd/f /mnt/a/o/FF
$ mv /mnt/a/b /mnt/a/o/FF/E2/BB
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
A test case for xfstests follows.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
It's possible to change the parent/child relationship between directories
in such a way that if a child directory has a higher inode number than
its parent, it doesn't necessarily means the child rename/move operation
can be performed immediately. The parent migth have its own rename/move
operation delayed, therefore in this case the child needs to have its
rename/move operation delayed too, and be performed after its new parent's
rename/move.
Steps to reproduce the issue:
$ umount /mnt
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt
$ mkdir /mnt/A
$ mkdir /mnt/B
$ mkdir /mnt/C
$ mv /mnt/C /mnt/A
$ mv /mnt/B /mnt/A/C
$ mkdir /mnt/A/C/D
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send /mnt/snap1 -f /tmp/base.send
$ mv /mnt/A/C/D /mnt/A/D2
$ mv /mnt/A/C/B /mnt/A/D2/B2
$ mv /mnt/A/C /mnt/A/D2/B2/C2
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
The incremental send caused the kernel code to enter an infinite loop when
building the path string for directory C after its references are processed.
The necessary conditions here are that C has an inode number higher than both
A and B, and B as an higher inode number higher than A, and D has the highest
inode number, that is:
inode_number(A) < inode_number(B) < inode_number(C) < inode_number(D)
The same issue could happen if after the first snapshot there's any number
of intermediary parent directories between A2 and B2, and between B2 and C2.
A test case for xfstests follows, covering this simple case and more advanced
ones, with files and hard links created inside the directories.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
No need to search in the send tree for the generation number of the inode,
we already have it in the recorded_ref structure passed to us.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
While we update an existing ref head's extent_op, we're not holding
its spinlock, so while we're updating its extent_op contents (key,
flags) we can have a task running __btrfs_run_delayed_refs() that
holds the ref head's lock and sets its extent_op to NULL right after
the task updating the ref head just checked its extent_op was not NULL.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Since most of the btrfs_workqueue is printed as pointer address,
for easier analysis, add trace for btrfs_workqueue alloc/destroy.
So it is possible to determine the workqueue that a given work belongs
to(by comparing the wq pointer address with alloc trace event).
Signed-off-by: Qu Wenruo <quenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
When finding new extents during an autodefrag, don't do so many fs tree
lookups to find an extent with a size smaller then the target treshold.
Instead, after each fs tree forward search immediately unlock upper
levels and process the entire leaf while holding a read lock on the leaf,
since our leaf processing is very fast.
This reduces lock contention, allowing for higher concurrency when other
tasks want to write/update items related to other inodes in the fs tree,
as we're not holding read locks on upper tree levels while processing the
leaf and we do less tree searches.
Test:
sysbench --test=fileio --file-num=512 --file-total-size=16G \
--file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \
--file-rw-ratio=3 --file-io-mode=sync --max-time=1800 \
--max-requests=10000000000 [prepare|run]
(fileystem mounted with -o autodefrag, averages of 5 runs)
Before this change: 58.852Mb/sec throughtput, read 77.589Gb, written 25.863Gb
After this change: 63.034Mb/sec throughtput, read 83.102Gb, written 27.701Gb
Test machine: quad core intel i5-3570K, 32Gb of RAM, SSD.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
The error message is confusing:
# btrfs sub delete /mnt/mysub/
Delete subvolume '/mnt/mysub'
ERROR: cannot delete '/mnt/mysub' - Directory not empty
The error message does not make sense to me: It's not about deleting a
directory but it's a subvolume, and it doesn't matter if the subvolume is
empty or not.
Maybe EPERM or is more appropriate in this case, combined with an explanatory
kernel log message. (e.g. "subvolume with ID 123 cannot be deleted because
it is configured as default subvolume.")
Reported-by: Koen De Wit <koen.de.wit@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
When locking file ranges in the inode's io_tree, cache the first
extent state that belongs to the target range, so that when unlocking
the range we don't need to search in the io_tree again, reducing cpu
time and making and therefore holding the io_tree's lock for a shorter
period.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Zach found this deadlock that would happen like this
btrfs_end_transaction <- reduce trans->use_count to 0
btrfs_run_delayed_refs
btrfs_cow_block
find_free_extent
btrfs_start_transaction <- increase trans->use_count to 1
allocate chunk
btrfs_end_transaction <- decrease trans->use_count to 0
btrfs_run_delayed_refs
lock tree block we are cowing above ^^
We need to only decrease trans->use_count if it is above 1, otherwise leave it
alone. This will make nested trans be the only ones who decrease their added
ref, and will let us get rid of the trans->use_count++ hack if we have to commit
the transaction. Thanks,
cc: stable@vger.kernel.org
Reported-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Tested-by: Zach Brown <zab@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
We didn't have a lock to protect the access to the delalloc inodes list, that is
we might access a empty delalloc inodes list if someone start flushing delalloc
inodes because the delalloc inodes were moved into a other list temporarily.
Fix it by wrapping the access with a lock.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
When we create a snapshot, we just need wait the ordered extents in
the source fs/file root, but because we use the global mutex to protect
this ordered extents list of the source fs/file root to avoid accessing
a empty list, if someone got the mutex to access the ordered extents list
of the other fs/file root, we had to wait.
This patch splits the above global mutex, now every fs/file root has
its own mutex to protect its own list.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
We needn't flush all delalloc inodes when we doesn't get s_umount lock,
or we would make the tasks wait for a long time.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
generic/074 in xfstests failed sometimes because of the enospc error,
the reason of this problem is that we just reclaimed the space we need
from the reserved space for delalloc, and then tried to reserve the space,
but if some task did no-flush reservation between the above reclamation
and reservation,
Task1 Task2
shrink_delalloc()
reclaim 1 block
(The space that can
be reserved now is 1
block)
do no-flush reservation
reserve 1 block
(The space that can
be reserved now is 0
block)
reserving 1 block failed
the reservation of Task1 failed, but in fact, there was enough space to
reserve if we could reclaim more space before.
Fix this problem by the aggressive reclamation of the reserved delalloc
metadata space.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
The reason is:
- The per-cpu counter has its own lock to protect itself.
- Here we needn't get a exact value.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
As the comment in the btrfs_direct_IO says, only the compressed pages need be
flush again to make sure they are on the disk, but the common pages needn't,
so we add a if statement to check if the inode has compressed pages or not,
if no, skip the flush.
And in order to prevent the write ranges from intersecting, we need wait for
the running ordered extents. But the current code waits for them twice, one
is done before the direct IO starts (in btrfs_wait_ordered_range()), the other
is before we get the blocks, it is unnecessary. because we can do the direct
IO without holding i_mutex, it means that the intersected ordered extents may
happen during the direct IO, the first wait can not avoid this problem. So we
use filemap_fdatawrite_range() instead of btrfs_wait_ordered_range() to remove
the first wait.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
The tasks that wait for the IO_DONE flag just care about the io of the dirty
pages, so it is better to wake up them immediately after all the pages are
written, not the whole process of the io completes.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
btrfs_wait_ordered_roots() moves all the list entries to a new list,
and then deals with them one by one. But if the other task invokes this
function at that time, it would get a empty list. It makes the enospc
error happens more early. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
If the snapshot creation happened after the nocow write but before the dirty
data flush, we would fail to flush the dirty data because of no space.
So we must keep track of when those nocow write operations start and when they
end, if there are nocow writers, the snapshot creators must wait. In order
to implement this function, I introduce btrfs_{start, end}_nocow_write(),
which is similar to mnt_{want,drop}_write().
These two functions are only used for nocow file write operations.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Add ftrace for btrfs_workqueue for further workqueue tunning.
This patch needs to applied after the workqueue replace patchset.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
The new btrfs_workqueue still use open-coded function defition,
this patch will change them into btrfs_func_t type which is much the
same as kernel workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Btrfs send reads data from disk and then writes to a stream via pipe or
a file via flush.
Currently we're going to read each page a time, so every page results
in a disk read, which is not friendly to disks, esp. HDD. Given that,
the performance can be gained by adding readahead for those pages.
Here is a quick test:
$ btrfs subvolume create send
$ xfs_io -f -c "pwrite 0 1G" send/foobar
$ btrfs subvolume snap -r send ro
$ time "btrfs send ro -f /dev/null"
w/o w
real 1m37.527s 0m9.097s
user 0m0.122s 0m0.086s
sys 0m53.191s 0m12.857s
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
This has no functional change, only picks out the same part of two functions,
and makes it shared.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
When we're finishing processing of an inode, if we're dealing with a
directory inode that has a pending move/rename operation, we don't
need to send a utimes update instruction to the send stream, as we'll
do it later after doing the move/rename operation. Therefore we save
some time here building paths and doing btree lookups.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
When using prealloc extents, a file defragment operation may actually
fragment the file and increase the amount of data space used by the file.
This change fixes that behaviour.
Example:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt
$ cd /mnt
$ xfs_io -f -c 'falloc 0 1048576' foobar && sync
$ xfs_io -c 'pwrite -S 0xff -b 100000 5000 100000' foobar
$ xfs_io -c 'pwrite -S 0xac -b 100000 200000 100000' foobar
$ xfs_io -c 'pwrite -S 0xe1 -b 100000 900000 100000' foobar && sync
Before defragmenting the file:
$ btrfs filesystem df /mnt
Data, single: total=8.00MiB, used=1.25MiB
System, DUP: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=1.00GiB, used=112.00KiB
Metadata, single: total=8.00MiB, used=0.00
$ btrfs-debug-tree /dev/sdb3
(...)
item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
prealloc data disk byte 12845056 nr 1048576
prealloc data offset 0 nr 4096
item 7 key (257 EXTENT_DATA 4096) itemoff 15757 itemsize 53
extent data disk byte 12845056 nr 1048576
extent data offset 4096 nr 102400 ram 1048576
extent compression 0
item 8 key (257 EXTENT_DATA 106496) itemoff 15704 itemsize 53
prealloc data disk byte 12845056 nr 1048576
prealloc data offset 106496 nr 90112
item 9 key (257 EXTENT_DATA 196608) itemoff 15651 itemsize 53
extent data disk byte 12845056 nr 1048576
extent data offset 196608 nr 106496 ram 1048576
extent compression 0
item 10 key (257 EXTENT_DATA 303104) itemoff 15598 itemsize 53
prealloc data disk byte 12845056 nr 1048576
prealloc data offset 303104 nr 593920
item 11 key (257 EXTENT_DATA 897024) itemoff 15545 itemsize 53
extent data disk byte 12845056 nr 1048576
extent data offset 897024 nr 106496 ram 1048576
extent compression 0
item 12 key (257 EXTENT_DATA 1003520) itemoff 15492 itemsize 53
prealloc data disk byte 12845056 nr 1048576
prealloc data offset 1003520 nr 45056
(...)
Now defragmenting the file results in more data space used than before:
$ btrfs filesystem defragment -f foobar && sync
$ btrfs filesystem df /mnt
Data, single: total=8.00MiB, used=1.55MiB
System, DUP: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=1.00GiB, used=112.00KiB
Metadata, single: total=8.00MiB, used=0.00
And the corresponding file extent items are now no longer perfectly sequential
as before, and we're now needlessly using more space from data block groups:
$ btrfs-debug-tree /dev/sdb3
(...)
item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
extent data disk byte 12845056 nr 1048576
extent data offset 0 nr 4096 ram 1048576
extent compression 0
item 7 key (257 EXTENT_DATA 4096) itemoff 15757 itemsize 53
extent data disk byte 13893632 nr 102400
extent data offset 0 nr 102400 ram 102400
extent compression 0
item 8 key (257 EXTENT_DATA 106496) itemoff 15704 itemsize 53
extent data disk byte 12845056 nr 1048576
extent data offset 106496 nr 90112 ram 1048576
extent compression 0
item 9 key (257 EXTENT_DATA 196608) itemoff 15651 itemsize 53
extent data disk byte 13996032 nr 106496
extent data offset 0 nr 106496 ram 106496
extent compression 0
item 10 key (257 EXTENT_DATA 303104) itemoff 15598 itemsize 53
prealloc data disk byte 12845056 nr 1048576
prealloc data offset 303104 nr 593920
item 11 key (257 EXTENT_DATA 897024) itemoff 15545 itemsize 53
extent data disk byte 14102528 nr 106496
extent data offset 0 nr 106496 ram 106496
extent compression 0
item 12 key (257 EXTENT_DATA 1003520) itemoff 15492 itemsize 53
extent data disk byte 12845056 nr 1048576
extent data offset 1003520 nr 45056 ram 1048576
extent compression 0
(...)
With this change, the above example will no longer cause allocation of new data
space nor change the sequentiality of the file extents, that is, defragment will
be effectless, leaving all extent items pointing to the extent starting at disk
byte 12845056.
In a 20Gb filesystem I had, mounted with the autodefrag option and 20 files of
400Mb each, initially consisting of a single prealloc extent of 400Mb, having
random writes happening at a low rate, lead to a total of over ~17Gb of data
space used, not far from eventually reaching an ENOSPC state.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
When the defrag flag BTRFS_DEFRAG_RANGE_START_IO is set and compression
enabled, we weren't flushing completely, as writing compressed extents
is a 2 steps process, one to compress the data and another one to write
the compressed data to disk.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Since the "_struct" suffix is mainly used for distinguish the differnt
btrfs_work between the original and the newly created one,
there is no need using the suffix since all btrfs_workers are changed
into btrfs_workqueue.
Also this patch fixed some codes whose code style is changed due to the
too long "_struct" suffix.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Since all the btrfs_worker is replaced with the newly created
btrfs_workqueue, the old codes can be easily remove.
Signed-off-by: Quwenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Replace the fs_info->scrub_* with the newly created
btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Replace the fs_info->qgroup_rescan_worker with the newly created
btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Replace the fs_info->delayed_workers with the newly created
btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Replace the fs_info->fixup_workers with the newly created
btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Replace the fs_info->readahead_workers with the newly created
btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Replace the fs_info->cache_workers with the newly created
btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Replace the fs_info->rmw_workers with the newly created
btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Replace the fs_info->endio_* workqueues with the newly created
btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Replace the fs_info->submit_workers with the newly created
btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Much like the fs_info->workers, replace the fs_info->submit_workers
use the same btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Much like the fs_info->workers, replace the fs_info->delalloc_workers
use the same btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Use the newly created btrfs_workqueue_struct to replace the original
fs_info->workers
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
The original btrfs_workers has thresholding functions to dynamically
create or destroy kthreads.
Though there is no such function in kernel workqueue because the worker
is not created manually, we can still use the workqueue_set_max_active
to simulated the behavior, mainly to achieve a better HDD performance by
setting a high threshold on submit_workers.
(Sadly, no resource can be saved)
So in this patch, extra workqueue pending counters are introduced to
dynamically change the max active of each btrfs_workqueue_struct, hoping
to restore the behavior of the original thresholding function.
Also, workqueue_set_max_active use a mutex to protect workqueue_struct,
which is not meant to be called too frequently, so a new interval
mechanism is applied, that will only call workqueue_set_max_active after
a count of work is queued. Hoping to balance both the random and
sequence performance on HDD.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Add high priority function to btrfs_workqueue.
This is implemented by embedding a btrfs_workqueue into a
btrfs_workqueue and use some helper functions to differ the normal
priority wq and high priority wq.
So the high priority wq is completely independent from the normal
workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Use kernel workqueue to implement a new btrfs_workqueue_struct, which
has the ordering execution feature like the btrfs_worker.
The func is executed in a concurrency way, and the
ordred_func/ordered_free is executed in the sequence them are queued
after the corresponding func is done.
The new btrfs_workqueue works much like the original one, one workqueue
for normal work and a list for ordered work.
When a work is queued, ordered work will be added to the list and helper
function will be queued into the workqueue.
The helper function will execute a normal work and then check and execute as many
ordered work as possible in the sequence they were queued.
At this patch, high priority work queue or thresholding is not added yet.
The high priority feature and thresholding will be added in the following patches.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
The struct async_sched is not used by any codes and can be removed.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fusionio.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
It is really unnecessary to search tree again for @gen, @mode and @rdev
in the case of REG inodes' creation, as we've got btrfs_inode_item in sctx,
and @gen, @mode and @rdev can easily be fetched.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
We can not release the reserved metadata space for the first write if we
find the write position is pre-allocated. Because the kernel might write
the data on the disk before we do the second write but after the can-nocow
check, if we release the space for the first write, we might fail to update
the metadata because of no space.
Fix this problem by end nocow write if there is dirty data in the range whose
space is pre-allocated.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
The write range may not be sector-aligned, for example:
|--------|--------| <- write range, sector-unaligned, size: 2blocks
|--------|--------|--------| <- correct lock range, size: 3blocks
But according to the old code, we used the size of write range to calculate
the lock range directly, not considered the offset, we would get a wrong lock
range:
|--------|--------| <- write range, sector-unaligned, size: 2blocks
|--------|--------| <- wrong lock range, size: 2blocks
And besides that, the old code also had the same problem when calculating
the real write size. Correct them.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
In "btrfs: send: lower memory requirements in common case" the code to
save the old_buf_len was incorrectly moved to a wrong place and broke
the original logic.
Reported-by: Filipe David Manana <fdmanana@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Filipe David Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
While droping extent map structures from the extent cache that cover our
target range, we would remove each extent map structure from the red black
tree and then add either 1 or 2 new extent map structures if the former
extent map covered sections outside our target range.
This change simply attempts to replace the existing extent map structure
with a new one that covers the subsection we're not interested in, instead
of doing a red black remove operation followed by an insertion operation.
The number of elements in an inode's extent map tree can get very high for large
files under random writes. For example, while running the following test:
sysbench --test=fileio --file-num=1 --file-total-size=10G \
--file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \
--max-requests=500000 --file-rw-ratio=2 [prepare|run]
I captured the following histogram capturing the number of extent_map items
in the red black tree while that test was running:
Count: 122462
Range: 1.000 - 172231.000; Mean: 96415.831; Median: 101855.000; Stddev: 49700.981
Percentiles: 90th: 160120.000; 95th: 166335.000; 99th: 171070.000
1.000 - 5.231: 452 |
5.231 - 187.392: 87 |
187.392 - 585.911: 206 |
585.911 - 1827.438: 623 |
1827.438 - 5695.245: 1962 #
5695.245 - 17744.861: 6204 ####
17744.861 - 55283.764: 21115 ############
55283.764 - 172231.000: 91813 #####################################################
Benchmark:
sysbench --test=fileio --file-num=1 --file-total-size=10G --file-test-mode=rndwr \
--num-threads=64 --file-block-size=32768 --max-requests=0 --max-time=60 \
--file-io-mode=sync --file-fsync-freq=0 [prepare|run]
Before this change: 122.1Mb/sec
After this change: 125.07Mb/sec
(averages of 5 test runs)
Test machine: quad core intel i5-3570K, 32Gb of ram, SSD
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
When we split an extent state there's no need to start the rbtree search
from the root node - we can start it from the original extent state node,
since we would end up in its subtree if we do the search starting at the
root node anyway.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
We don't need to have an unsigned int field in the extent_map struct
to tell us whether the extent map is in the inode's extent_map tree or
not. We can use the rb_node struct field and the RB_CLEAR_NODE and
RB_EMPTY_NODE macros to achieve the same task.
This reduces sizeof(struct extent_map) from 152 bytes to 144 bytes (on a
64 bits system).
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
We won't change commit root, skip locking dance with commit root
when walking backrefs, this can speed up btrfs send operations.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
check if @scrubs_running=@scrubs_paused condition inside wait_event()
is not an atomic operation which means we may inc/dec @scrub_running/
paused at any time. Let's wake up @scrub_pause_wait as much as we can
to let commit transaction blocked less.
An example below:
Thread1 Thread2
|->scrub_blocked_if_needed() |->scrub_pending_trans_workers_inc
|->increase @scrub_paused
|->increase @scrub_running
|->wake up scrub_pause_wait list
|->scrub blocked
|->increase @scrub_paused
Thread3 is commiting transaction which is blocked at btrfs_scrub_pause().
So after Thread2 increase @scrub_paused, we meet the condition
@scrub_paused=@scrub_running, but transaction will be still blocked until
another calling to wake up @scrub_pause_wait.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
If we fail to commit transaction, we'd better
cancel scrub operations.
Suggested-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
commit cb7ab02156 cause a following deadlock found by
xfstests,btrfs/011:
Thread1 is commiting transaction which is blocked at
btrfs_scrub_pause().
Thread2 is calling btrfs_file_aio_write() which has held
inode's @i_mutex and commit transaction(blocked because
Thread1 is committing transaction).
Thread3 is copy_nocow_page worker which will also try to
hold inode @i_mutex, so thread3 will wait Thread1 finished.
Thread4 is waiting pending workers finished which will wait
Thread3 finished. So the problem is like this:
Thread1--->Thread4--->Thread3--->Thread2---->Thread1
Deadlock happens! we fix it by letting Thread1 go firstly,
which means we won't block transaction commit while we are
waiting pending workers finished.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
btrfs_scrub_continue() will be called when cleaning up transaction.However,
this can only be called if btrfs_scrub_pause() is called before.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
PTR_RET is deprecated. Use PTR_ERR_OR_ZERO instead. While at it
also include missing err.h header.
Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
When doing an incremental send, if we had a directory pending a move/rename
operation and none of its parents, except for the immediate parent, were
pending a move/rename, after processing the directory's references, we would
be issuing utimes, chown and chmod intructions against am outdated path - a
path which matched the one in the parent root.
This change also simplifies a bit the code that deals with building a path
for a directory which has a move/rename operation delayed.
Steps to reproduce:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d/e
$ mkdir /mnt/btrfs/a/b/c/f
$ chmod 0777 /mnt/btrfs/a/b/c/d/e
$ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
$ mv /mnt/btrfs/a/b/c/f /mnt/btrfs/a/b/f2
$ mv /mnt/btrfs/a/b/c/d/e /mnt/btrfs/a/b/f2/e2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2
$ mv /mnt/btrfs/a/b/c2/d /mnt/btrfs/a/b/c2/d2
$ chmod 0700 /mnt/btrfs/a/b/f2/e2
$ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send
$ umount /mnt/btrfs
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ btrfs receive /mnt/btrfs -f /tmp/base.send
$ btrfs receive /mnt/btrfs -f /tmp/incremental.send
The second btrfs receive command failed with:
ERROR: chmod a/b/c/d/e failed. No such file or directory
A test case for xfstests follows.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Just comparing the pointers (logical disk addresses) of the btree nodes is
not completely bullet proof, we have to check if their generation numbers
match too.
It is guaranteed that a COW operation will result in a block with a different
logical disk address than the original block's address, but over time we can
reuse that former logical disk address.
For example, creating a 2Gb filesystem on a loop device, and having a script
running in a loop always updating the access timestamp of a file, resulted in
the same logical disk address being reused for the same fs btree block in about
only 4 minutes.
This could make us skip entire subtrees when doing an incremental send (which
is currently the only user of btrfs_compare_trees). However the odds of getting
2 blocks at the same tree level, with the same logical disk address, equal first
slot keys and different generations, should hopefully be very low.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
The incremental send algorithm assumed that it was possible to issue
a directory remove (rmdir) if the the inode number it was currently
processing was greater than (or equal) to any inode that referenced
the directory's inode. This wasn't a valid assumption because any such
inode might be a child directory that is pending a move/rename operation,
because it was moved into a directory that has a higher inode number and
was moved/renamed too - in other words, the case the following commit
addressed:
9f03740a95
(Btrfs: fix infinite path build loops in incremental send)
This made an incremental send issue an rmdir operation before the
target directory was actually empty, which made btrfs receive fail.
Therefore it needs to wait for all pending child directory inodes to
be moved/renamed before sending an rmdir operation.
Simple steps to reproduce this issue:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/x
$ mkdir /mnt/btrfs/a/b/y
$ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
$ mv /mnt/btrfs/a/b/y /mnt/btrfs/a/b/YY
$ mv /mnt/btrfs/a/b/c/x /mnt/btrfs/a/b/YY
$ rmdir /mnt/btrfs/a/b/c
$ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send
$ umount /mnt/btrfs
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ btrfs receive /mnt/btrfs -f /tmp/base.send
$ btrfs receive /mnt/btrfs -f /tmp/incremental.send
The second btrfs receive command failed with:
ERROR: rmdir o259-6-0 failed. Directory not empty
A test case for xfstests follows.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
When doing an incremental send, if we delete a directory that has N > 1
hardlinks for the same file and that file has the highest inode number
inside the directory contents, an incremental send would send N times an
rmdir operation against the directory. This made the btrfs receive command
fail on the second rmdir instruction, as the target directory didn't exist
anymore.
Steps to reproduce the issue:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c
$ echo 'ola mundo' > /mnt/btrfs/a/b/c/foo.txt
$ ln /mnt/btrfs/a/b/c/foo.txt /mnt/btrfs/a/b/c/bar.txt
$ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
$ rm -f /mnt/btrfs/a/b/c/foo.txt
$ rm -f /mnt/btrfs/a/b/c/bar.txt
$ rmdir /mnt/btrfs/a/b/c
$ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send
$ umount /mnt/btrfs
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ btrfs receive /mnt/btrfs -f /tmp/base.send
$ btrfs receive /mnt/btrfs -f /tmp/incremental.send
The second btrfs receive command failed with:
ERROR: rmdir o259-6-0 failed. No such file or directory
A test case for xfstests follows.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
This fixes yet one more case not caught by the commit titled:
Btrfs: fix infinite path build loops in incremental send
In this case, even before the initial full send, we have a directory
which is a child of a directory with a higher inode number. Then we
perform the initial send, and after we rename both the child and the
parent, without moving them around. After doing these 2 renames, an
incremental send sent a rename instruction for the child directory
which contained an invalid "from" path (referenced the parent's old
name, not the new one), which made the btrfs receive command fail.
Steps to reproduce:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b
$ mkdir /mnt/btrfs/d
$ mkdir /mnt/btrfs/a/b/c
$ mv /mnt/btrfs/d /mnt/btrfs/a/b/c
$ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/x
$ mv /mnt/btrfs/a/b/x/d /mnt/btrfs/a/b/x/y
$ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send
$ umout /mnt/btrfs
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ btrfs receive /mnt/btrfs -f /tmp/base.send
$ btrfs receive /mnt/btrfs -f /tmp/incremental.send
The second btrfs receive command failed with:
"ERROR: rename a/b/c/d -> a/b/x/y failed. No such file or directory"
A test case for xfstests follows.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
If we punch beyond the size of an inode, we'll correctly remove any prealloc extents,
but we'll also insert file extent items representing holes (disk bytenr == 0) that start
with a key offset that lies beyond the inode's size and are not contiguous with the last
file extent item.
Example:
$XFS_IO_PROG -f -c "truncate 118811" $SCRATCH_MNT/foo
$XFS_IO_PROG -c "fpunch 582007 864596" $SCRATCH_MNT/foo
$XFS_IO_PROG -c "pwrite -S 0x0d -b 39987 92267 39987" $SCRATCH_MNT/foo
btrfs-debug-tree output:
item 4 key (257 INODE_ITEM 0) itemoff 15885 itemsize 160
inode generation 6 transid 6 size 132254 block group 0 mode 100600 links 1
item 5 key (257 INODE_REF 256) itemoff 15872 itemsize 13
inode ref index 2 namelen 3 name: foo
item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
extent data disk byte 0 nr 0 gen 6
extent data offset 0 nr 90112 ram 122880
extent compression 0
item 7 key (257 EXTENT_DATA 90112) itemoff 15766 itemsize 53
extent data disk byte 12845056 nr 4096 gen 6
extent data offset 0 nr 45056 ram 45056
extent compression 2
item 8 key (257 EXTENT_DATA 585728) itemoff 15713 itemsize 53
extent data disk byte 0 nr 0 gen 6
extent data offset 0 nr 860160 ram 860160
extent compression 0
The last extent item, which represents a hole, is useless as it lies beyond the inode's
size.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
The argument last wasn't used, all callers supplied a NULL value
for it. Also removed unnecessary intermediate storage of the result
of key comparisons.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
When we didn't find the exact ref head we were looking for, if
return_bigger != 0 we set a new search key to match either the
next node after the last one we found or the first one in the
ref heads rb tree, and then did another full tree search. For both
cases this ended up being pointless as we would end up returning
an entry we already had before repeating the search.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Now that we can adjust the commit interval with a remount, we need
to wake up the transaction thread or else he will continue to sleep
until the previous transaction interval has elapsed before waking
up. So, if we go from a large commit interval to something smaller,
the transaction thread will not wake up until the large interval has
expired. This also causes the cleaner thread to stay sleeping, since
it gets woken up by the transaction thread.
Fix it by simply waking up the transaction thread during a remount.
Signed-off-by: Justin Maggard <jmaggard10@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
If the log sync fails, there is something wrong in the log tree, we
should not continue to join the log transaction and log the metadata.
What we should do is to do a full commit.
This patch fixes this problem by setting ->last_trans_log_full_commit
to the current transaction id, it will tell the tasks not to join
the log transaction, and do a full commit.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
We might commit the log sub-transaction which didn't contain the metadata we
logged. It was because we didn't record the log transid and just select
the current log sub-transaction to commit, but the right one might be
committed by the other task already. Actually, we needn't do anything
and it is safe that we go back directly in this case.
This patch improves the log sync by the above idea. We record the transid
of the log sub-transaction in which we log the metadata, and the transid
of the log sub-transaction we have committed. If the committed transid
is >= the transid we record when logging the metadata, we just go back.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
It is possible that many tasks sync the log tree at the same time, but
only one task can do the sync work, the others will wait for it. But those
wait tasks didn't get the result of the log sync, and returned 0 when they
ended the wait. It caused those tasks skipped the error handle, and the
serious problem was they told the users the file sync succeeded but in
fact they failed.
This patch fixes this problem by introducing a log context structure,
we insert it into the a global list. When the sync fails, we will set
the error number of every log context in the list, then the waiting tasks
get the error number of the log context and handle the error if need.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
The log trans id is initialized to be 0 every time we create a log tree,
and the log tree need be re-created after a new transaction is started,
it means the log trans id is unlikely to be a huge number, so we can use
signed integer instead of unsigned long integer to save a bit space.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Mutex unlock implies certain memory barriers to make sure all the memory
operation completes before the unlock, and the next mutex lock implies memory
barriers to make sure the all the memory happens after the lock. So it is
a full memory barrier(smp_mb), we needn't add memory barriers. Remove them.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
The old code would start the log transaction even the log tree init
failed, it was unnecessary. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
We may abort the wait earlier if ->last_trans_log_full_commit was set to
the current transaction id, at this case, we need commit the current
transaction instead of the log sub-transaction. But the current code
didn't tell the caller to do it (return 0, not -EAGAIN). Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
->last_trans_log_full_commit may be changed by the other tasks without lock,
so we need prevent the compiler from the optimize access just like
tmp = fs_info->last_trans_log_full_commit
if (tmp == ...)
...
<do something>
if (tmp == ...)
...
In fact, we need get the new value of ->last_trans_log_full_commit during
the second access. Fix it by ACCESS_ONCE().
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
So after transaction is aborted, we need to cleanup inode resources by
calling btrfs_invalidate_inodes(), and btrfs_invalidate_inodes() hopes
roots' refs to be zero in old times and sets a WARN_ON(), however, this
is not always true within cleaning up transaction, so we get to detect
transaction abortion and not warn at all.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
This is an extension to my previous commit titled:
"Btrfs: faster file extent item replace operations"
(hash 1acae57b16)
Instead of inserting the new file extent item if we deleted existing
file extent items covering our target file range, also allow to insert
the new file extent item if we didn't find any existing items to delete
and replace_extent != 0, since in this case our caller would do another
tree search to insert the new file extent item anyway, therefore just
combine the two tree searches into a single one, saving cpu time, reducing
lock contention and reducing btree node/leaf COW operations.
This covers the case where applications keep doing tail append writes to
files, which for example is the case of Apache CouchDB (its database and
view index files are always open with O_APPEND).
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
In case we do not refill, we can overwrite cur pointer from prio_head
by one from not prioritized head, what looks as something that was
not intended.
This change make we always take works from prio_head first until it's
not empty.
Signed-off-by: Stanislaw Gruszka <stf_xl@wp.pl>
Signed-off-by: Josef Bacik <jbacik@fb.com>
This reverts commit 41ce9970a8.
Previously i was thinking we can use readonly root's commit root
safely while it is not true, readonly root may be cowed with the
following cases.
1.snapshot send root will cow source root.
2.balance,device operations will also cow readonly send root
to relocate.
So i have two ideas to make us safe to use commit root.
-->approach 1:
make it protected by transaction and end transaction properly and we research
next item from root node(see btrfs_search_slot_for_read()).
-->approach 2:
add another counter to local root structure to sync snapshot with send.
and add a global counter to sync send with exclusive device operations.
So with approach 2, send can use commit root safely, because we make sure
send root can not be cowed during send. Unfortunately, it make codes *ugly*
and more complex to maintain.
To make snapshot and send exclusively, device operations and send operation
exclusively with each other is a little confusing for common users.
So why not drop into previous way.
Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Since we have introduced btrfs_previous_extent_item() to search previous
extent item, just switch into it.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
I got an error on v3.13:
BTRFS error (device sdf1) in write_all_supers:3378: errno=-5 IO failure (errors while submitting device barriers.)
how to reproduce:
> mkfs.btrfs -f -d raid1 /dev/sdf1 /dev/sdf2
> wipefs -a /dev/sdf2
> mount -o degraded /dev/sdf1 /mnt
> btrfs balance start -f -sconvert=single -mconvert=single -dconvert=single /mnt
The reason of the error is that barrier_all_devices() failed to submit
barrier to the missing device. However it is clear that we cannot do
anything on missing device, and also it is not necessary to care chunks
on the missing device.
This patch stops sending/waiting barrier if device is missing.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
When I converted the BUG_ON() for the free_space_cache_inode in cow_file_range I
made it so we just return an error instead of unlocking all of our various
stuff. This is a mistake and causes us to hang when we run into this. This
patch fixes this problem. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
While trying to reproduce a delayed ref problem I noticed the box kept falling
over using all 80gb of my ram with btrfs_inode's and btrfs_delayed_node's.
Turns out this is because we only throttle delayed inode updates in
btrfs_dirty_inode, which doesn't actually get called that often, especially when
all you are doing is creating a bunch of files. So balance delayed inode
updates everytime we create a new inode. With this patch we no longer use up
all of our ram with delayed inode updates. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Help during debugging to export various interesting infromation and
tunables without the need of extra mount options or ioctls.
Usage:
* declare your variable in sysfs.h, and include where you need it
* define the variable in sysfs.c and make it visible via
debugfs_create_TYPE
Depends on CONFIG_DEBUG_FS.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
The fs_path structure uses an inline buffer and falls back to a chain of
allocations, but vmalloc is not necessary because PATH_MAX fits into
PAGE_SIZE.
The size of fs_path has been reduced to 256 bytes from PAGE_SIZE,
usually 4k. Experimental measurements show that most paths on a single
filesystem do not exceed 200 bytes, and these get stored into the inline
buffer directly, which is now 230 bytes. Longer paths are kmalloced when
needed.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
We have this pattern where we do search for a contiguous group of
items in a tree and everytime we find an item, we process it, then
we release our path, increment the offset of the search key, do
another full tree search and repeat these steps until a tree search
can't find more items we're interested in.
Instead of doing these full tree searches after processing each item,
just process the next item/slot in our leaf and don't release the path.
Since all these trees are read only and we always use the commit root
for a search and skip node/leaf locks, we're not affecting concurrency
on the trees.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
This was a leftover from the commit:
74dd17fbe3
(Btrfs: fix btrfs send for inline items and compression)
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
If cleaning the name cache fails, we could try to proceed at the cost of
some memory leak. This is not expected to happen often.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
There are only 2 static callers, the BUG would normally be never
reached, but let's be nice.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
We know that buf_len is at most PATH_MAX, 4k, and can merge it with the
reversed member. This saves 3 bytes in favor of inline_buf.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
We don't need to keep track of that, it's available via is_vmalloc_addr.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
The member is used only to return value back from
fs_path_prepare_for_add, we can do it locally and save 8 bytes for the
inline_buf path.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
The buffer passed to snprintf can hold the fully expanded format string,
64 = 3x largest ULL + 3x char + trailing null. I don't think that removing the
check entirely is a good idea, hence the ASSERT.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
The commit titled "Btrfs: fix infinite path build loops in incremental send"
didn't cover a particular case where the parent-child relationship inversion
of directories doesn't imply a rename of the new parent directory. This was
due to a simple logic mistake, a logical and instead of a logical or.
Steps to reproduce:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/bar1/bar2/bar3/bar4
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/bar1/bar2/bar3/bar4 /mnt/btrfs/a/b/k44
$ mv /mnt/btrfs/a/b/bar1/bar2/bar3 /mnt/btrfs/a/b/k44
$ mv /mnt/btrfs/a/b/bar1/bar2 /mnt/btrfs/a/b/k44/bar3
$ mv /mnt/btrfs/a/b/bar1 /mnt/btrfs/a/b/k44/bar3/bar2/k11
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
A patch to update the test btrfs/030 from xfstests, so that it covers
this case, will be submitted soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
This fixes a case that the commit titled:
Btrfs: fix infinite path build loops in incremental send
didn't cover. If the parent-child relationship between 2 directories
is inverted, both get renamed, and the former parent has a file that
got renamed too (but remains a child of that directory), the incremental
send operation would use the file's old path after sending an unlink
operation for that old path, causing receive to fail on future operations
like changing owner, permissions or utimes of the corresponding inode.
This is not a regression from the commit mentioned before, as without
that commit we would fall into the issues that commit fixed, so it's
just one case that wasn't covered before.
Simple steps to reproduce this issue are:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ touch /mnt/btrfs/a/b/c/d/file
$ mkdir -p /mnt/btrfs/a/b/x
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/x /mnt/btrfs/a/b/c/x2
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c/x2/d2
$ mv /mnt/btrfs/a/b/c/x2/d2/file /mnt/btrfs/a/b/c/x2/d2/file2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
A patch to update the test btrfs/030 from xfstests, so that it covers
this case, will be submitted soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
find_all_leafs() dosen't need add all roots actually, add roots only
if we need, this can avoid unnecessary ulist dance.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
The structure for BTRFS_SET_RECEIVED_IOCTL packs differently on 32-bit
and 64-bit systems. This means that it is impossible to use btrfs
receive on a system with a 64-bit kernel and 32-bit userspace, because
the structure size (and hence the ioctl number) is different.
This patch adds a compatibility structure and ioctl to deal with the
above case.
Signed-off-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Function wait_for_parent_move() returns negative value if an error
happened, 0 if we don't need to wait for the parent's move, and
1 if the wait is needed.
Before this change an error return value was being treated like the
return value 1, which was not correct.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
During device replace test, we hit a null pointer deference (It was very easy
to reproduce it by running xfstests' btrfs/011 on the devices with the virtio
scsi driver). There were two bugs that caused this problem:
- We might allocate new chunks on the replaced device after we updated
the mapping tree. And we forgot to replace the source device in those
mapping of the new chunks.
- We might get the mapping information which including the source device
before the mapping information update. And then submit the bio which was
based on that mapping information after we freed the source device.
For the first bug, we can fix it by doing mapping tree update and source
device remove in the same context of the chunk mutex. The chunk mutex is
used to protect the allocable device list, the above method can avoid
the new chunk allocation, and after we remove the source device, all
the new chunks will be allocated on the new device. So it can fix
the first bug.
For the second bug, we need make sure all flighting bios are finished and
no new bios are produced during we are removing the source device. To fix
this problem, we introduced a global @bio_counter, we not only inc/dec
@bio_counter outsize of map_blocks, but also inc it before submitting bio
and dec @bio_counter when ending bios.
Since Raid56 is a little different and device replace dosen't support raid56
yet, it is not addressed in the patch and I add comments to make sure we will
fix it in the future.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
the alloc list of the filesystem is protected by ->chunk_mutex, we need
get that mutex when we insert the new device into the list.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
EXDEV seems an appropriate error if an operation fails bacause it
crosses file system boundaries.
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Kusanagi Kouichi <slash@ac.auone-net.jp>
Signed-off-by: Josef Bacik <jbacik@fb.com>
There was a problem in the old code:
If we failed to log the csum, we would free all the ordered extents in the log list
including those ordered extents that were logged successfully, it would make the
log committer not to wait for the completion of the ordered extents.
This patch doesn't insert the ordered extents that is about to be logged into
a global list, instead, we insert them into a local list. If we log the ordered
extents successfully, we splice them with the global list, or we will throw them
away, then do full sync. It can also reduce the lock contention and the traverse
time of list.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Pull btrfs fixes from Chris Mason:
"We have a small collection of fixes in my for-linus branch.
The big thing that stands out is a revert of a new ioctl. Users
haven't shipped yet in btrfs-progs, and Dave Sterba found a better way
to export the information"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: use right clone root offset for compressed extents
btrfs: fix null pointer deference at btrfs_sysfs_add_one+0x105
Btrfs: unset DCACHE_DISCONNECTED when mounting default subvol
Btrfs: fix max_inline mount option
Btrfs: fix a lockdep warning when cleaning up aborted transaction
Revert "btrfs: add ioctl to export size of global metadata reservation"
For non compressed extents, iterate_extent_inodes() gives us offsets
that take into account the data offset from the file extent items, while
for compressed extents it doesn't. Therefore we have to adjust them before
placing them in a send clone instruction. Not doing this adjustment leads to
the receiving end requesting for a wrong a file range to the clone ioctl,
which results in different file content from the one in the original send
root.
Issue reproducible with the following excerpt from the test I made for
xfstests:
_scratch_mkfs
_scratch_mount "-o compress-force=lzo"
$XFS_IO_PROG -f -c "truncate 118811" $SCRATCH_MNT/foo
$XFS_IO_PROG -c "pwrite -S 0x0d -b 39987 92267 39987" $SCRATCH_MNT/foo
$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap1
$XFS_IO_PROG -c "pwrite -S 0x3e -b 80000 200000 80000" $SCRATCH_MNT/foo
$BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT
$XFS_IO_PROG -c "pwrite -S 0xdc -b 10000 250000 10000" $SCRATCH_MNT/foo
$XFS_IO_PROG -c "pwrite -S 0xff -b 10000 300000 10000" $SCRATCH_MNT/foo
# will be used for incremental send to be able to issue clone operations
$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/clones_snap
$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap2
$FSSUM_PROG -A -f -w $tmp/1.fssum $SCRATCH_MNT/mysnap1
$FSSUM_PROG -A -f -w $tmp/2.fssum -x $SCRATCH_MNT/mysnap2/mysnap1 \
-x $SCRATCH_MNT/mysnap2/clones_snap $SCRATCH_MNT/mysnap2
$FSSUM_PROG -A -f -w $tmp/clones.fssum $SCRATCH_MNT/clones_snap \
-x $SCRATCH_MNT/clones_snap/mysnap1 -x $SCRATCH_MNT/clones_snap/mysnap2
$BTRFS_UTIL_PROG send $SCRATCH_MNT/mysnap1 -f $tmp/1.snap
$BTRFS_UTIL_PROG send $SCRATCH_MNT/clones_snap -f $tmp/clones.snap
$BTRFS_UTIL_PROG send -p $SCRATCH_MNT/mysnap1 \
-c $SCRATCH_MNT/clones_snap $SCRATCH_MNT/mysnap2 -f $tmp/2.snap
_scratch_unmount
_scratch_mkfs
_scratch_mount
$BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/1.snap
$FSSUM_PROG -r $tmp/1.fssum $SCRATCH_MNT/mysnap1 2>> $seqres.full
$BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/clones.snap
$FSSUM_PROG -r $tmp/clones.fssum $SCRATCH_MNT/clones_snap 2>> $seqres.full
$BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/2.snap
$FSSUM_PROG -r $tmp/2.fssum $SCRATCH_MNT/mysnap2 2>> $seqres.full
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
A user was running into errors from an NFS export of a subvolume that had a
default subvol set. When we mount a default subvol we will use d_obtain_alias()
to find an existing dentry for the subvolume in the case that the root subvol
has already been mounted, or a dummy one is allocated in the case that the root
subvol has not already been mounted. This allows us to connect the dentry later
on if we wander into the path. However if we don't ever wander into the path we
will keep DCACHE_DISCONNECTED set for a long time, which angers NFS. It doesn't
appear to cause any problems but it is annoying nonetheless, so simply unset
DCACHE_DISCONNECTED in the get_default_root case and switch btrfs_lookup() to
use d_materialise_unique() instead which will make everything play nicely
together and reconnect stuff if we wander into the defaul subvol path from a
different way. With this patch I'm no longer getting the NFS errors when
exporting a volume that has been mounted with a default subvol set. Thanks,
cc: bfields@fieldses.org
cc: ebiederm@xmission.com
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Chris Mason <clm@fb.com>
Currently, the only mount option for max_inline that has any effect is
max_inline=0. Any other value that is supplied to max_inline will be
adjusted to a minimum of 4k. Since max_inline has an effective maximum
of ~3900 bytes due to page size limitations, the current behaviour
only has meaning for max_inline=0.
This patch will allow the the max_inline mount option to accept non-zero
values as indicated in the documentation.
Signed-off-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Chris Mason <clm@fb.com>
Given now we have 2 spinlock for management of delayed refs,
CONFIG_DEBUG_SPINLOCK=y helped me find this,
[ 4723.413809] BUG: spinlock wrong CPU on CPU#1, btrfs-transacti/2258
[ 4723.414882] lock: 0xffff880048377670, .magic: dead4ead, .owner: btrfs-transacti/2258, .owner_cpu: 2
[ 4723.417146] CPU: 1 PID: 2258 Comm: btrfs-transacti Tainted: G W O 3.12.0+ #4
[ 4723.421321] Call Trace:
[ 4723.421872] [<ffffffff81680fe7>] dump_stack+0x54/0x74
[ 4723.422753] [<ffffffff81681093>] spin_dump+0x8c/0x91
[ 4723.424979] [<ffffffff816810b9>] spin_bug+0x21/0x26
[ 4723.425846] [<ffffffff81323956>] do_raw_spin_unlock+0x66/0x90
[ 4723.434424] [<ffffffff81689bf7>] _raw_spin_unlock+0x27/0x40
[ 4723.438747] [<ffffffffa015da9e>] btrfs_cleanup_one_transaction+0x35e/0x710 [btrfs]
[ 4723.443321] [<ffffffffa015df54>] btrfs_cleanup_transaction+0x104/0x570 [btrfs]
[ 4723.444692] [<ffffffff810c1b5d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
[ 4723.450336] [<ffffffff810c1c2d>] ? trace_hardirqs_on+0xd/0x10
[ 4723.451332] [<ffffffffa015e5ee>] transaction_kthread+0x22e/0x270 [btrfs]
[ 4723.452543] [<ffffffffa015e3c0>] ? btrfs_cleanup_transaction+0x570/0x570 [btrfs]
[ 4723.457833] [<ffffffff81079efa>] kthread+0xea/0xf0
[ 4723.458990] [<ffffffff81079e10>] ? kthread_create_on_node+0x140/0x140
[ 4723.460133] [<ffffffff81692aac>] ret_from_fork+0x7c/0xb0
[ 4723.460865] [<ffffffff81079e10>] ? kthread_create_on_node+0x140/0x140
[ 4723.496521] ------------[ cut here ]------------
----------------------------------------------------------------------
The reason is that we get to call cond_resched_lock(&head_ref->lock) while
still holding @delayed_refs->lock.
So it's different with __btrfs_run_delayed_refs(), where we do drop-acquire
dance before and after actually processing delayed refs.
Here we don't drop the lock, others are not able to add new delayed refs to
head_ref, so cond_resched_lock(&head_ref->lock) is not necessary here.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
This reverts commit 01e219e806.
David Sterba found a different way to provide these features without adding a new
ioctl. We haven't released any progs with this ioctl yet, so I'm taking this out
for now until we finalize things.
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
CC: Jeff Mahoney <jeffm@suse.com>
Pull btrfs fixes from Chris Mason:
"This is a small collection of fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix data corruption when reading/updating compressed extents
Btrfs: don't loop forever if we can't run because of the tree mod log
btrfs: reserve no transaction units in btrfs_ioctl_set_features
btrfs: commit transaction after setting label and features
Btrfs: fix assert screwup for the pending move stuff
When using a mix of compressed file extents and prealloc extents, it
is possible to fill a page of a file with random, garbage data from
some unrelated previous use of the page, instead of a sequence of zeroes.
A simple sequence of steps to get into such case, taken from the test
case I made for xfstests, is:
_scratch_mkfs
_scratch_mount "-o compress-force=lzo"
$XFS_IO_PROG -f -c "pwrite -S 0x06 -b 18670 266978 18670" $SCRATCH_MNT/foobar
$XFS_IO_PROG -c "falloc 26450 665194" $SCRATCH_MNT/foobar
$XFS_IO_PROG -c "truncate 542872" $SCRATCH_MNT/foobar
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar
This results in the following file items in the fs tree:
item 4 key (257 INODE_ITEM 0) itemoff 15879 itemsize 160
inode generation 6 transid 6 size 542872 block group 0 mode 100600
item 5 key (257 INODE_REF 256) itemoff 15863 itemsize 16
inode ref index 2 namelen 6 name: foobar
item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
extent data disk byte 0 nr 0 gen 6
extent data offset 0 nr 24576 ram 266240
extent compression 0
item 7 key (257 EXTENT_DATA 24576) itemoff 15757 itemsize 53
prealloc data disk byte 12849152 nr 241664 gen 6
prealloc data offset 0 nr 241664
item 8 key (257 EXTENT_DATA 266240) itemoff 15704 itemsize 53
extent data disk byte 12845056 nr 4096 gen 6
extent data offset 0 nr 20480 ram 20480
extent compression 2
item 9 key (257 EXTENT_DATA 286720) itemoff 15651 itemsize 53
prealloc data disk byte 13090816 nr 405504 gen 6
prealloc data offset 0 nr 258048
The on disk extent at offset 266240 (which corresponds to 1 single disk block),
contains 5 compressed chunks of file data. Each of the first 4 compress 4096
bytes of file data, while the last one only compresses 3024 bytes of file data.
Therefore a read into the file region [285648 ; 286720[ (length = 4096 - 3024 =
1072 bytes) should always return zeroes (our next extent is a prealloc one).
The solution here is the compression code path to zero the remaining (untouched)
bytes of the last page it uncompressed data into, as the information about how
much space the file data consumes in the last page is not known in the upper layer
fs/btrfs/extent_io.c:__do_readpage(). In __do_readpage we were correctly zeroing
the remainder of the page but only if it corresponds to the last page of the inode
and if the inode's size is not a multiple of the page size.
This would cause not only returning random data on reads, but also permanently
storing random data when updating parts of the region that should be zeroed.
For the example above, it means updating a single byte in the region [285648 ; 286720[
would store that byte correctly but also store random data on disk.
A test case for xfstests follows soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
A user reported a 100% cpu hang with my new delayed ref code. Turns out I
forgot to increase the count check when we can't run a delayed ref because of
the tree mod log. If we can't run any delayed refs during this there is no
point in continuing to look, and we need to break out. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Added in patch "btrfs: add ioctls to query/change feature bits online"
modifications to superblock don't need to reserve metadata blocks when
starting a transaction.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
The set_fslabel ioctl uses btrfs_end_transaction, which means it's
possible that the change will be lost if the system crashes, same for
the newly set features. Let's use btrfs_commit_transaction instead.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Wang noticed that he was failing btrfs/030 even though me and Filipe couldn't
reproduce. Turns out this is because Wang didn't have CONFIG_BTRFS_ASSERT set,
which meant that a key part of Filipe's original patch was not being built in.
This appears to be a mess up with merging Filipe's patch as it does not exist in
his original patch. Fix this by changing how we make sure del_waiting_dir_move
asserts that it did not error and take the function out of the ifdef check.
This makes btrfs/030 pass with the assert on or off. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull btrfs fixes from Chris Mason:
"Filipe is fixing compile and boot problems with our crc32c rework, and
Josef has disabled snapshot aware defrag for now.
As the number of snapshots increases, we're hitting OOM. For the
short term we're disabling things until a bigger fix is ready"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: use late_initcall instead of module_init
Btrfs: use btrfs_crc32c everywhere instead of libcrc32c
Btrfs: disable snapshot aware defrag for now
It seems that when init_btrfs_fs() is called, crc32c/crc32c-intel might
not always be already initialized, which results in the call to crypto_alloc_shash()
returning -ENOENT, as experienced by Ahmet who reported this.
Therefore make sure init_btrfs_fs() is called after crc32c is initialized (which
is at initialization level 6, module_init), by using late_initcall (which is at
initialization level 7) instead of module_init for btrfs.
Reported-and-Tested-by: Ahmet Inan <ainan@mathematik.uni-freiburg.de>
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
After the commit titled "Btrfs: fix btrfs boot when compiled as built-in",
LIBCRC32C requirement was removed from btrfs' Kconfig. This made it not
possible to build a kernel with btrfs enabled (either as module or built-in)
if libcrc32c is not enabled as well. So just replace all uses of libcrc32c
with the equivalent function in btrfs hash.h - btrfs_crc32c.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
It's just broken and it's taking a lot of effort to fix it, so for now just
disable it so people can defrag in peace. Thanks,
Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull btrfs updates from Chris Mason:
"This is a pretty big pull, and most of these changes have been
floating in btrfs-next for a long time. Filipe's properties work is a
cool building block for inheriting attributes like compression down on
a per inode basis.
Jeff Mahoney kicked in code to export filesystem info into sysfs.
Otherwise, lots of performance improvements, cleanups and bug fixes.
Looks like there are still a few other small pending incrementals, but
I wanted to get the bulk of this in first"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (149 commits)
Btrfs: fix spin_unlock in check_ref_cleanup
Btrfs: setup inode location during btrfs_init_inode_locked
Btrfs: don't use ram_bytes for uncompressed inline items
Btrfs: fix btrfs_search_slot_for_read backwards iteration
Btrfs: do not export ulist functions
Btrfs: rework ulist with list+rb_tree
Btrfs: fix memory leaks on walking backrefs failure
Btrfs: fix send file hole detection leading to data corruption
Btrfs: add a reschedule point in btrfs_find_all_roots()
Btrfs: make send's file extent item search more efficient
Btrfs: fix to catch all errors when resolving indirect ref
Btrfs: fix protection between walking backrefs and root deletion
btrfs: fix warning while merging two adjacent extents
Btrfs: fix infinite path build loops in incremental send
btrfs: undo sysfs when open_ctree() fails
Btrfs: fix snprintf usage by send's gen_unique_name
btrfs: fix defrag 32-bit integer overflow
btrfs: sysfs: list the NO_HOLES feature
btrfs: sysfs: don't show reserved incompat feature
btrfs: call permission checks earlier in ioctls and return EPERM
...
Pull core block IO changes from Jens Axboe:
"The major piece in here is the immutable bio_ve series from Kent, the
rest is fairly minor. It was supposed to go in last round, but
various issues pushed it to this release instead. The pull request
contains:
- Various smaller blk-mq fixes from different folks. Nothing major
here, just minor fixes and cleanups.
- Fix for a memory leak in the error path in the block ioctl code
from Christian Engelmayer.
- Header export fix from CaiZhiyong.
- Finally the immutable biovec changes from Kent Overstreet. This
enables some nice future work on making arbitrarily sized bios
possible, and splitting more efficient. Related fixes to immutable
bio_vecs:
- dm-cache immutable fixup from Mike Snitzer.
- btrfs immutable fixup from Muthu Kumar.
- bio-integrity fix from Nic Bellinger, which is also going to stable"
* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
xtensa: fixup simdisk driver to work with immutable bio_vecs
block/blk-mq-cpu.c: use hotcpu_notifier()
blk-mq: for_each_* macro correctness
block: Fix memory leak in rw_copy_check_uvector() handling
bio-integrity: Fix bio_integrity_verify segment start bug
block: remove unrelated header files and export symbol
blk-mq: uses page->list incorrectly
blk-mq: use __smp_call_function_single directly
btrfs: fix missing increment of bi_remaining
Revert "block: Warn and free bio if bi_end_io is not set"
block: Warn and free bio if bi_end_io is not set
blk-mq: fix initializing request's start time
block: blk-mq: don't export blk_mq_free_queue()
block: blk-mq: make blk_sync_queue support mq
block: blk-mq: support draining mq queue
dm cache: increment bi_remaining when bi_end_io is restored
block: fixup for generic bio chaining
block: Really silence spurious compiler warnings
block: Silence spurious compiler warnings
block: Kill bio_pair_split()
...
We have a race during inode init because the BTRFS_I(inode)->location is setup
after the inode hash table lock is dropped. btrfs_find_actor uses the location
field, so our search might not find an existing inode in the hash table if we
race with the inode init code.
This commit changes things to setup the location field sooner. Also the find actor now
uses only the location objectid to match inodes. For inode hashing, we just
need a unique and stable test, it doesn't have to reflect the inode numbers we
show to userland.
Signed-off-by: Chris Mason <clm@fb.com>
CC: stable@vger.kernel.org
If we truncate an uncompressed inline item, ram_bytes isn't updated to reflect
the new size. The fixe uses the size directly from the item header when
reading uncompressed inlines, and also fixes truncate to update the
size as it goes.
Reported-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
CC: stable@vger.kernel.org
If the current path's leaf slot is 0, we do search for the previous
leaf (via btrfs_prev_leaf) and set the new path's leaf slot to a
value corresponding to the number of items - 1 of the former leaf.
Fix this by using the slot set by btrfs_prev_leaf, decrementing it
by 1 if it's equal to the leaf's number of items.
Use of btrfs_search_slot_for_read() for backward iteration is used in
particular by the send feature, which could miss items when the input
leaf has less items than its previous leaf.
This could be reproduced by running btrfs/007 from xfstests in a loop.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
There are not any users that use ulist except Btrfs,don't
export them.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We are really suffering from now ulist's implementation, some developers
gave their try, and i just gave some of my ideas for things:
1. use list+rb_tree instead of arrary+rb_tree
2. add cur_list to iterator rather than ulist structure.
3. add seqnum into every node when they are added, this is
used to do selfcheck when iterating node.
I noticed Zach Brown's comments before, long term is to kick off
ulist implementation, however, for now, we need at least avoid
arrary from ulist.
Cc: Liu Bo <bo.li.liu@oracle.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Zach Brown <zab@redhat.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
When walking backrefs, we may iterate every inode's extent
and add/merge them into ulist, and the caller will free memory
from ulist.
However, if we fail to allocate inode's extents element
memory or ulist_add() fail to allocate memory, we won't
add allocated memory into ulist, and the caller won't
free some allocated memory thus memory leaks happen.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
There was a case where file hole detection was incorrect and it would
cause an incremental send to override a section of a file with zeroes.
This happened in the case where between the last leaf we processed which
contained a file extent item for our current inode and the leaf we're
currently are at (and has a file extent item for our current inode) there
are only leafs containing exclusively file extent items for our current
inode, and none of them was updated since the previous send operation.
The file hole detection code would incorrectly consider the file range
covered by these leafs as a hole.
A test case for xfstests follows soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Instead of looking for a file extent item, process it, release the path
and do a btree search for the next file extent item, just process all
file extent items in a leaf without intermediate btree searches. This way
we save cpu and we're not blocking other tasks or affecting concurrency on
the btree, because send's paths use the commit root and skip btree node/leaf
locking.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We can only tolerate ENOENT here, for other errors, we should
return directly.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
There is a race condition between resolving indirect ref and root deletion,
and we should gurantee that root can not be destroyed to avoid accessing
broken tree here.
Here we fix it by holding @subvol_srcu, and we will release it as soon
as we have held root node lock.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
When we have two adjacent extents in relink_extent_backref,
we try to merge them. When we use btrfs_search_slot to locate the
slot for the current extent, we shouldn't set "ins_len = 1",
because we will merge it into the previous extent rather than
insert a new item. Otherwise, we may happen to create a new leaf
in btrfs_search_slot and path->slot[0] will be 0. Then we try to
fetch the previous item using "path->slots[0]--", and it will cause
a warning as follows:
[ 145.713385] WARNING: CPU: 3 PID: 1796 at fs/btrfs/extent_io.c:5043 map_private_extent_buffer+0xd4/0xe0
[ 145.713387] btrfs bad mapping eb start 5337088 len 4096, wanted 167772306 8
...
[ 145.713462] [<ffffffffa034b1f4>] map_private_extent_buffer+0xd4/0xe0
[ 145.713476] [<ffffffffa030097a>] ? btrfs_free_path+0x2a/0x40
[ 145.713485] [<ffffffffa0340864>] btrfs_get_token_64+0x64/0xf0
[ 145.713498] [<ffffffffa033472c>] relink_extent_backref+0x41c/0x820
[ 145.713508] [<ffffffffa0334d69>] btrfs_finish_ordered_io+0x239/0xa80
I encounter this warning when running defrag having mkfs.btrfs
with option -M. At the same time there are read/writes & snapshots
running at background.
Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0()
Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid
CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007
[ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0
[ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0
[ 5381.660457] Call Trace:
[ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68
[ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0
[ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20
[ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0
[ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440
[ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60
[ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0
[ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50
[ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50
[ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100
[ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230
[ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe
[ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0
[ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs]
[ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs]
[ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs]
[ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs]
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d
$ mkdir /mnt/btrfs/a/b/c2
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc
$ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
The structure of the filesystem when the first snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c (ino 259)
| |-- d (ino 260)
|
|-- c2 (ino 261)
And its structure when the second snapshot is taken is:
. (ino 256)
|-- a (ino 257)
|-- b (ino 258)
|-- c2 (ino 261)
|-- d2 (ino 260)
|-- cc (ino 259)
Before the move/rename operation is performed for the inode 259, the
move/rename for inode 260 must be performed, since 259 is now a child
of 260.
A test case for xfstests, with a more complex scenario, will follow soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The buffer size argument passed to snprintf must account for the
trailing null byte added by snprintf, and it returns a value >= then
sizeof(buffer) when the string can't fit in the buffer.
Since our buffer has a size of 64 characters, and the maximum orphan
name we can generate is 63 characters wide, we must pass 64 as the
buffer size to snprintf, and not 63.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
When defragging a very large file, the cluster variable can wrap its 32-bit
signed int type and become negative, which eventually gets passed to
btrfs_force_ra() as a very large unsigned long value. On 32-bit platforms,
this eventually results in an Oops from the SLAB allocator.
Change the cluster and max_cluster signed int variables to unsigned long to
match the readahead functions. This also allows the min() comparison in
btrfs_defrag_file() to work as intended.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The COMPRESS_LZOv2 incompat featue is currently not implemented, the bit
is only reserved, no point to list it in sysfs.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The owner and capability checks in IOC_SUBVOL_SETFLAGS and
SET_RECEIVED_SUBVOL should be called before any other checks are done.
Also unify the error code to EPERM.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Currently, any user can snapshot any subvolume if the path is accessible and
thus indirectly create and keep files he does not own under his direcotries.
This is not possible with traditional directories.
In security context, a user can snapshot root filesystem and pin any
potentially buggy binaries, even if the updates are applied.
All the snapshots are visible to the administrator, so it's possible to
verify if there are suspicious snapshots.
Another more practical problem is that any user can pin the space used
by eg. root and cause ENOSPC.
Original report:
https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/484786
CC: stable@vger.kernel.org
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We allocate the free space from the former block group, not the current
one, so should use the former one to output the trace information.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
used_block_group is just used for the space cluster which doesn't
belong to the current block group, the other place needn't use it.
Or the logic of code seems unclear.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
It is better that the position of the lock is close to the data which is
protected by it, because they may be in the same cache line, we will load
less cache lines when we access them. So we rearrange the members' position
of btrfs_space_info structure to make the lock be closer to the its data.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
To search tree root without transaction protection, we should neither search commit
root nor skip locking here, fix it.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The performance of fsync dropped down suddenly sometimes, the main reason
of this problem was that we might only flush part dirty pages in a ordered
extent, then got that ordered extent, wait for the csum calcucation. But if
no task flushed the left part, we would wait until the flusher flushed them,
sometimes we need wait for several seconds, it made the performance drop
down suddenly. (On my box, it drop down from 56MB/s to 4-10MB/s)
This patch improves the above problem by flushing left dirty pages aggressively.
Test Environment:
CPU: 2CPU * 2Cores
Memory: 4GB
Partition: 20GB(HDD)
Test Command:
# sysbench --num-threads=8 --test=fileio --file-num=1 \
> --file-total-size=8G --file-block-size=32768 \
> --file-io-mode=sync --file-fsync-freq=100 \
> --file-fsync-end=no --max-requests=10000 \
> --file-test-mode=rndwr run
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Steps to reproduce:
# mkfs.btrfs -f /dev/sda8
# mount /dev/sda8 /mnt -o flushoncommit
# dd if=/dev/zero of=/mnt/data bs=4k count=102400 &
# mount /dev/sda8 /mnt -o remount, ro
When remounting RW to RO, the logic is to firstly set flag
to RO and then commit transaction, however with option
flushoncommit enabled,we will do RO check within committing
transaction, so we get a transaction abortion here.
Actually,here check is wrong, we should check if FS_STATE_ERROR
is set, fix it.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Suggested-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
When we are looking for file extent items that intersect the cloning
range, for each one that falls completely outside the range, don't
release the path and do another full tree search - just move on
to the next slot and copy the file extent item into our buffer only
if the item intersects the cloning range.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
When transaction is aborted, we fail to commit transaction, instead we do
cleanup work. After that when we umount btrfs, we get to free fs roots' log
trees respectively, but that happens after we unpin extents, so those extents
pinned by freeing log trees will remain in memory and lead to the leak.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Since remount will pending the new mount options to the original mount
options, which will make btrfs_parse_options check the old options then
new options, causing some stupid output like "enabling XXX" following by
"disable XXX".
This patch will add extra check before every btrfs_info to skip the
output from old options checking.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Add noinode_cache mount option for btrfs.
Since inode map cache involves all the btrfs_find_free_ino/return_ino
things and if just trigger the mount_opt,
an inode number get from inode map cache will not returned to inode map
cache.
To keep the find and return inode both in the same behavior,
a new bit in mount_opt, CHANGE_INODE_CACHE, is introduced for this idea.
CHANGE_INODE_CACHE is set/cleared in remounting, and the original
INODE_MAP_CACHE is set/cleared according to CHANGE_INODE_CACHE after a
success transaction.
Since find/return inode is all done between btrfs_start_transaction and
btrfs_commit_transaction, this will keep consistent behavior.
Also noinode_cache mount option will not stop the caching_kthread.
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
There is a bug that using btrfs_previous_item() to search metadata extent item.
This is because in btrfs_previous_item(), we need type match, however, since
skinny metada was introduced by josef, we may mix this two types. So just
use btrfs_previous_item() is not working right.
To keep btrfs_previous_item() like normal tree search, i introduce another
function btrfs_previous_extent_item().
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Check if we support skinny metadata firstly and fix to use
right type to search.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
It is possible for the send feature to send clone operations that
request a cloning range (offset + length) that is not aligned with
the block size. This makes the btrfs receive command send issue a
clone ioctl call that will fail, as the ioctl will return an -EINVAL
error because of the unaligned range.
Fix this by not sending clone operations for non block aligned ranges,
and instead send regular write operation for these (less common) cases.
The following xfstest reproduces this issue, which fails on the second
btrfs receive command without this change:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=`mktemp -d`
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $tmp
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
rm -f $seqres.full
_scratch_mkfs >/dev/null 2>&1
_scratch_mount
$XFS_IO_PROG -f -c "truncate 819200" $SCRATCH_MNT/foo | _filter_xfs_io
$BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch
$XFS_IO_PROG -c "falloc -k 819200 667648" $SCRATCH_MNT/foo | _filter_xfs_io
$BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch
$XFS_IO_PROG -f -c "pwrite 1482752 2978" $SCRATCH_MNT/foo | _filter_xfs_io
$BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch
$BTRFS_UTIL_PROG subvol snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap1 | \
_filter_scratch
$XFS_IO_PROG -f -c "truncate 883305" $SCRATCH_MNT/foo | _filter_xfs_io
$BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch
$BTRFS_UTIL_PROG subvol snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap2 | \
_filter_scratch
$BTRFS_UTIL_PROG send $SCRATCH_MNT/mysnap1 -f $tmp/1.snap 2>&1 | _filter_scratch
$BTRFS_UTIL_PROG send -p $SCRATCH_MNT/mysnap1 $SCRATCH_MNT/mysnap2 \
-f $tmp/2.snap 2>&1 | _filter_scratch
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/mysnap1/foo | _filter_scratch
md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch
_scratch_unmount
_check_btrfs_filesystem $SCRATCH_DEV
_scratch_mkfs >/dev/null 2>&1
_scratch_mount
$BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/1.snap
md5sum $SCRATCH_MNT/mysnap1/foo | _filter_scratch
$BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/2.snap
md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch
_scratch_unmount
_check_btrfs_filesystem $SCRATCH_DEV
status=0
exit
The tests expected output is:
QA output created by 025
FSSync 'SCRATCH_MNT'
FSSync 'SCRATCH_MNT'
wrote 2978/2978 bytes at offset 1482752
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
FSSync 'SCRATCH_MNT'
Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/mysnap1'
FSSync 'SCRATCH_MNT'
Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/mysnap2'
At subvol SCRATCH_MNT/mysnap1
At subvol SCRATCH_MNT/mysnap2
129b8eaee8d3c2bcad49bec596591cb3 SCRATCH_MNT/foo
42b6369eae2a8725c1aacc0440e597aa SCRATCH_MNT/mysnap1/foo
129b8eaee8d3c2bcad49bec596591cb3 SCRATCH_MNT/mysnap2/foo
At subvol mysnap1
42b6369eae2a8725c1aacc0440e597aa SCRATCH_MNT/mysnap1/foo
At snapshot mysnap2
129b8eaee8d3c2bcad49bec596591cb3 SCRATCH_MNT/mysnap2/foo
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
After the change titled "Btrfs: add support for inode properties", if
btrfs was built-in the kernel (i.e. not as a module), it would cause a
kernel panic, as reported recently by Fengguang:
[ 2.024722] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 2.027814] IP: [<ffffffff81501594>] crc32c+0xc/0x6b
[ 2.028684] PGD 0
[ 2.028684] Oops: 0000 [#1] SMP
[ 2.028684] Modules linked in:
[ 2.028684] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.13.0-rc7-04795-ga7b57c2 #1
[ 2.028684] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 2.028684] task: ffff88000edba100 ti: ffff88000edd6000 task.ti: ffff88000edd6000
[ 2.028684] RIP: 0010:[<ffffffff81501594>] [<ffffffff81501594>] crc32c+0xc/0x6b
[ 2.028684] RSP: 0000:ffff88000edd7e58 EFLAGS: 00010246
[ 2.028684] RAX: 0000000000000000 RBX: ffffffff82295550 RCX: 0000000000000000
[ 2.028684] RDX: 0000000000000011 RSI: ffffffff81efe393 RDI: 00000000fffffffe
[ 2.028684] RBP: ffff88000edd7e60 R08: 0000000000000003 R09: 0000000000015d20
[ 2.028684] R10: ffffffff81ef225e R11: ffffffff811b0222 R12: ffffffffffffffff
[ 2.028684] R13: 0000000000000239 R14: 0000000000000000 R15: 0000000000000000
[ 2.028684] FS: 0000000000000000(0000) GS:ffff88000fa00000(0000) knlGS:0000000000000000
[ 2.028684] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2.028684] CR2: 0000000000000000 CR3: 000000000220c000 CR4: 00000000000006f0
[ 2.028684] Stack:
[ 2.028684] ffffffff82295550 ffff88000edd7e80 ffffffff8238af62 ffffffff8238ac05
[ 2.028684] 0000000000000000 ffff88000edd7e98 ffffffff8238ac0f ffffffff8238ac05
[ 2.028684] ffff88000edd7f08 ffffffff810002ba ffff88000edd7f00 ffffffff810e2404
[ 2.028684] Call Trace:
[ 2.028684] [<ffffffff8238af62>] btrfs_props_init+0x4f/0x96
[ 2.028684] [<ffffffff8238ac05>] ? ftrace_define_fields_btrfs_space_reservation+0x145/0x145
[ 2.028684] [<ffffffff8238ac0f>] init_btrfs_fs+0xa/0xf0
[ 2.028684] [<ffffffff8238ac05>] ? ftrace_define_fields_btrfs_space_reservation+0x145/0x145
[ 2.028684] [<ffffffff810002ba>] do_one_initcall+0xa4/0x13a
[ 2.028684] [<ffffffff810e2404>] ? parse_args+0x25f/0x33d
[ 2.028684] [<ffffffff8234cf75>] kernel_init_freeable+0x1aa/0x230
[ 2.028684] [<ffffffff8234c785>] ? do_early_param+0x88/0x88
[ 2.028684] [<ffffffff819f61b5>] ? rest_init+0x89/0x89
[ 2.028684] [<ffffffff819f61c3>] kernel_init+0xe/0x109
The issue here is that the initialization function of btrfs (super.c:init_btrfs_fs)
started using crc32c (from lib/libcrc32c.c). But when it needs to call crc32c (as
part of the properties initialization routine), the libcrc32c is not yet initialized,
so crc32c derreferenced a NULL pointer (lib/libcrc32c.c:tfm), causing the kernel
panic on boot.
The approach to fix this is to use crypto component directly to use its crc32c (which
is basically what lib/libcrc32c.c is, a wrapper around crypto). This is what ext4 is
doing as well, it uses crypto directly to get crc32c functionality.
Verified this works both when btrfs is built-in and when it's loadable kernel module.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
In the clone ioctl, when the source and target inodes are different,
we can acquire their mutexes in 2 possible different orders. After
we're done cloning, we were releasing the mutexes always in the same
order - the most correct way of doing it is to release them by the
reverse order they were acquired.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Here we are not going to free memory, no need to remove every node
one by one, just init root node here is ok.
Cc: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We don't have to keep subvolume's block_rsv during transaction commit,
and within transaction commit, we may also need the free space reclaimed
from this block_rsv to process delayed refs.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
When we ran the 274th case of xfstests with nodatacow mount option,
We met the following warning message:
WARNING: CPU: 1 PID: 14185 at fs/btrfs/extent-tree.c:3734 btrfs_free_reserved_data_space+0xa6/0xd0
It is caused by the race between the write back and nocow buffered
write:
Task1 Task2
__btrfs_buffered_write()
skip data reservation
reserve the metadata space
copy the data
dirty the pages
unlock the pages
write back the pages
release the data space
becasue there is no
noreserve flag
set the noreserve flag
This patch fixes this problem by unlocking the pages after
the noreserve flag is set.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The backref walking code will search down to the key it is looking for and then
proceed to walk _all_ of the extents on the file until it hits the end. This is
suboptimal with large files, we only need to look for as many extents as we have
references for that inode. I have a testcase that creates a randomly written 4
gig file and before this patch it took 6min 30sec to do the initial send, with
this patch it takes 2min 30sec to do the intial send. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Could have sworn I fixed this before but apparently not. This makes us pass
btrfs/022 with skinny metadata enabled. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
I don't think this is an issue and I've not seen it in practice but
extent_from_logical will fail to find a skinny extent because it uses
btrfs_previous_item and gives it the normal extent item type. This is just not
a place to use btrfs_previous_item since we care about either normal extents or
skinny extents, so open code btrfs_previous_item to properly check. This would
only affect metadata and the only place this is used for metadata is scrub and
I'm pretty sure it's just for printing stuff out, not actually doing any work so
hopefully it was never a problem other than a cosmetic one. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
On one of our gluster clusters we noticed some pretty big lag spikes. This
turned out to be because our transaction commit was taking like 3 minutes to
complete. This is because we have like 30 gigs of metadata, so our global
reserve would end up being the max which is like 512 mb. So our throttling code
would allow a ridiculous amount of delayed refs to build up and then they'd all
get run at transaction commit time, and for a cold mounted file system that
could take up to 3 minutes to run. So fix the throttling to be based on both
the size of the global reserve and how long it takes us to run delayed refs.
This patch tracks the time it takes to run delayed refs and then only allows 1
seconds worth of outstanding delayed refs at a time. This way it will auto-tune
itself from cold cache up to when everything is in memory and it no longer has
to go to disk. This makes our transaction commits take much less time to run.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Currently we have two rb-trees, one for delayed ref heads and one for all of the
delayed refs, including the delayed ref heads. When we process the delayed refs
we have to hold onto the delayed ref lock for all of the selecting and merging
and such, which results in quite a bit of lock contention. This was solved by
having a waitqueue and only one flusher at a time, however this hurts if we get
a lot of delayed refs queued up.
So instead just have an rb tree for the delayed ref heads, and then attach the
delayed ref updates to an rb tree that is per delayed ref head. Then we only
need to take the delayed ref lock when adding new delayed refs and when
selecting a delayed ref head to process, all the rest of the time we deal with a
per delayed ref head lock which will be much less contentious.
The locking rules for this get a little more complicated since we have to lock
up to 3 things to properly process delayed refs, but I will address that problem
later. For now this passes all of xfstests and my overnight stress tests.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Looking into some performance related issues with large amounts of metadata
revealed that we can have some pretty huge swings in fsync() performance. If we
have a lot of delayed refs backed up (as you will tend to do with lots of
metadata) fsync() will wander off and try to run some of those delayed refs
which can result in reading from disk and such. Since the actual act of fsync()
doesn't create any delayed refs there is no need to make it throttle on delayed
ref stuff, that will be handled by other people. With this patch we get much
smoother fsync performance with large amounts of metadata. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
This change adds infrastructure to allow for generic properties for
inodes. Properties are name/value pairs that can be associated with
inodes for different purposes. They are stored as xattrs with the
prefix "btrfs."
Properties can be inherited - this means when a directory inode has
inheritable properties set, these are added to new inodes created
under that directory. Further, subvolumes can also have properties
associated with them, and they can be inherited from their parent
subvolume. Naturally, directory properties have priority over subvolume
properties (in practice a subvolume property is just a regular
property associated with the root inode, objectid 256, of the
subvolume's fs tree).
This change also adds one specific property implementation, named
"compression", whose values can be "lzo" or "zlib" and it's an
inheritable property.
The corresponding changes to btrfs-progs were also implemented.
A patch with xfstests for this feature will follow once there's
agreement on this change/feature.
Further, the script at the bottom of this commit message was used to
do some benchmarks to measure any performance penalties of this feature.
Basically the tests correspond to:
Test 1 - create a filesystem and mount it with compress-force=lzo,
then sequentially create N files of 64Kb each, measure how long it took
to create the files, unmount the filesystem, mount the filesystem and
perform an 'ls -lha' against the test directory holding the N files, and
report the time the command took.
Test 2 - create a filesystem and don't use any compression option when
mounting it - instead set the compression property of the subvolume's
root to 'lzo'. Then create N files of 64Kb, and report the time it took.
The unmount the filesystem, mount it again and perform an 'ls -lha' like
in the former test. This means every single file ends up with a property
(xattr) associated to it.
Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the
compression property, have no real effect other than adding more work
when inheriting properties and taking more btree leaf space.
Test 4 - same as test 3 but with 10 properties per file.
Results (in seconds, and averages of 5 runs each), for different N
numbers of files follow.
* Without properties (test 1)
file creation time ls -lha time
10 000 files 3.49 0.76
100 000 files 47.19 8.37
1 000 000 files 518.51 107.06
* With 1 property (compression property set to lzo - test 2)
file creation time ls -lha time
10 000 files 3.63 0.93
100 000 files 48.56 9.74
1 000 000 files 537.72 125.11
* With 4 properties (test 3)
file creation time ls -lha time
10 000 files 3.94 1.20
100 000 files 52.14 11.48
1 000 000 files 572.70 142.13
* With 10 properties (test 4)
file creation time ls -lha time
10 000 files 4.61 1.35
100 000 files 58.86 13.83
1 000 000 files 656.01 177.61
The increased latencies with properties are essencialy because of:
*) When creating an inode, we now synchronously write 1 more item
(an xattr item) for each property inherited from the parent dir
(or subvolume). This could be done in an asynchronous way such
as we do for dir intex items (delayed-inode.c), which could help
reduce the file creation latency;
*) With properties, we now have larger fs trees. For this particular
test each xattr item uses 75 bytes of leaf space in the fs tree.
This could be less by using a new item for xattr items, instead of
the current btrfs_dir_item, since we could cut the 'location' and
'type' fields (saving 18 bytes) and maybe 'transid' too (saving a
total of 26 bytes per xattr item) from the btrfs_dir_item type.
Also tried batching the xattr insertions (ignoring proper hash
collision handling, since it didn't exist) when creating files that
inherit properties from their parent inode/subvolume, but the end
results were (surprisingly) essentially the same.
Test script:
$ cat test.pl
#!/usr/bin/perl -w
use strict;
use Time::HiRes qw(time);
use constant NUM_FILES => 10_000;
use constant FILE_SIZES => (64 * 1024);
use constant DEV => '/dev/sdb4';
use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev';
use constant TEST_DIR => (MNT_POINT . '/testdir');
system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!";
# following line for testing without properties
#system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!";
# following 2 lines for testing with properties
system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!";
system("mkdir", TEST_DIR) == 0 or die "mkdir failed!";
my ($t1, $t2);
$t1 = time();
for (my $i = 1; $i <= NUM_FILES; $i++) {
my $p = TEST_DIR . '/file_' . $i;
open(my $f, '>', $p) or die "Error opening file!";
$f->autoflush(1);
for (my $j = 0; $j < FILE_SIZES; $j += 4096) {
print $f ('A' x 4096) or die "Error writing to file!";
}
close($f);
}
$t2 = time();
print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n";
system("umount", DEV) == 0 or die "umount failed!";
system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
$t1 = time();
system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!";
$t2 = time();
print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n";
system("umount", DEV) == 0 or die "umount failed!";
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
When writing to a file we drop existing file extent items that cover the
write range and then add a new file extent item that represents that write
range.
Before this change we were doing a tree lookup to remove the file extent
items, and then after we did another tree lookup to insert the new file
extent item.
Most of the time all the file extent items we need to drop are located
within a single leaf - this is the leaf where our new file extent item ends
up at. Therefore, in this common case just combine these 2 operations into
a single one.
By avoiding the second btree navigation for insertion of the new file extent
item, we reduce btree node/leaf lock acquisitions/releases, btree block/leaf
COW operations, CPU time on btree node/leaf key binary searches, etc.
Besides for file writes, this is an operation that happens for file fsync's
as well. However log btrees are much less likely to big as big as regular
fs btrees, therefore the impact of this change is smaller.
The following benchmark was performed against an SSD drive and a
HDD drive, both for random and sequential writes:
sysbench --test=fileio --file-num=4096 --file-total-size=8G \
--file-test-mode=[rndwr|seqwr] --num-threads=512 \
--file-block-size=8192 \ --max-requests=1000000 \
--file-fsync-freq=0 --file-io-mode=sync [prepare|run]
All results below are averages of 10 runs of the respective test.
** SSD sequential writes
Before this change: 225.88 Mb/sec
After this change: 277.26 Mb/sec
** SSD random writes
Before this change: 49.91 Mb/sec
After this change: 56.39 Mb/sec
** HDD sequential writes
Before this change: 68.53 Mb/sec
After this change: 69.87 Mb/sec
** HDD random writes
Before this change: 13.04 Mb/sec
After this change: 14.39 Mb/sec
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We may return early in btrfs_drop_snapshot(), we shouldn't
call btrfs_std_err() for this case, fix it.
Cc: stable@vger.kernel.org
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We will finish orphan cleanups during snapshot, so we don't
have to commit transaction here.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We should gurantee that parent and clone roots can not be destroyed
during send, for this we have two ideas.
1.by holding @subvol_sem, this might be a nightmare, because it will
block all subvolumes deletion for a long time.
2.Miao pointed out we can reuse @send_in_progress, that mean we will
skip snapshot deletion if root sending is in progress.
Here we adopt the second approach since it won't block other subvolumes
deletion for a long time.
Besides in btrfs_clean_one_deleted_snapshot(), we only check first root
, if this root is involved in send, we return directly rather than
continue to check.There are several reasons about it:
1.this case happen seldomly.
2.after sending,cleaner thread can continue to drop that root.
3.make code simple
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Steps to reproduce:
# mkfs.btrfs -f /dev/sda8
# mount /dev/sda8 /mnt
# btrfs sub snapshot -r /mnt /mnt/snap1
# btrfs sub snapshot -r /mnt /mnt/snap2
# btrfs send /mnt/snap1 -p /mnt/snap2 -f /mnt/1
# dmesg
The problem is that we will sort clone roots(include @send_root), it
might push @send_root before thus @send_root's @send_in_progress will
be decreased twice.
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Add treelog mount option to enable tree log with
remount option.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Add datasum mount option to enable checksum with
remount option.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Add datacow mount option to enable copy-on-write with
remount option.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Add acl mount option to enable acl with remount option.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Add noflushoncommit mount option to disable flush on commit with
remount option.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Add noenospc_debug mount option to disable ENOSPC debug with
remount option.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Add nodiscard mount option to disable discard with remount option.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Btrfs has autodefrag mount option but no pairing noautodefrag option,
which makes it impossible to disable autodefrag without umount.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Btrfs can be remounted without barrier, but there is no "barrier" option
so nobody can remount btrfs back with barrier on. Only umount and
mount again can re-enable barrier.(Quite awkward)
Also the mount options in the document is also changed slightly for the
further pairing options changes.
Reported-by: Daniel Blueman <daniel@quora.org>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Mike Fleetwood <mike.fleetwood@googlemail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We only intent to fua the first superblock in every device from
comments, fix it.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
@full is not protected within global_rsv.lock, so we may think global_rsv
is already full but in fact it's not, so we miss the opportunity to return
free space to global_rsv directly when we release other block_rsvs.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
During balance test, we hit an oops:
[ 2013.841551] kernel BUG at fs/btrfs/relocation.c:1174!
The problem is that if we fail to relocate tree blocks, we should
update backref cache, otherwise, some pending nodes are not updated
while snapshot check @cache->last_trans is within one transaction
and won't update it and then oops happen.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The following warning message was outputed when running the 274th case
of xfstests with nodatacow option:
BUG: Bad page state in process kswapd0 pfn:1c66f
page:ffffea0000636848 count:0 mapcount:0 mapping:(null) index:0x78000
page flags: 0x1000000000100a(error|uptodate|private_2)
It is because the check of nocow range was wrong, we should compare the
start and end position of the extent with the write position to verify
if the write position was in the extent, but the current code just used
the start postion to do the check, so we got the wrong extent and told
the caller that it was a nocow write. And then when we write back the
dirty pages, we found we should cow the extent, but at that time, there
was no space in the fs, we had to the error flag for the page. When
someone reclaimed that page, the above warning outputed. Fix it.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Previously, we will free reloc root memory and then force filesystem
to be readonly. The problem is that there may be another thread commiting
transaction which will try to access freed reloc root during merging reloc
roots process.
To keep consistency snapshots shared space, we should allow snapshot
finished if possible, so here we don't free reloc root memory.
signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
@nr is no longer used, remove it from select_reloc_root()
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we do a btree search with the goal of updating an existing item
without changing its size (ins_len == 0 and cow == 1), then we never
need to hold locks on upper level nodes (even when slot == 0) after we
COW their child nodes/leaves, as we won't have node splits or merges
in this scenario (that is, no key additions, removals or shifts on any
nodes or leaves).
Therefore release the locks immediately after COWing the child nodes/leaves
while navigating the btree, even if their parent slot is 0, instead of
returning a path to the caller with those nodes locked, which would get
released only when the caller releases or frees the path (or if it calls
btrfs_unlock_up_safe).
This is a common scenario, for example when updating inode items in fs
trees and block group items in the extent tree.
The following benchmarks were performed on a quad core machine with 32Gb
of ram, using a leaf/node size of 4Kb (to generate deeper fs trees more
quickly).
sysbench --test=fileio --file-num=131072 --file-total-size=8G \
--file-test-mode=seqwr --num-threads=512 --file-block-size=8192 \
--max-requests=100000 --file-io-mode=sync [prepare|run]
Before this change: 49.85Mb/s (average of 5 runs)
After this change: 50.38Mb/s (average of 5 runs)
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The local variable 'new_size' comes from userspace. If a large number
was passed, there would be an integer overflow in the following line:
new_size = old_size + new_size;
Signed-off-by: Wenliang Fan <fanwlexca@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We can starve out the transaction commit with a bunch of caching threads all
running at the same time. This is because we will only drop the
extent_commit_sem if we need_resched(), which isn't likely to happen since we
will be reading a lot from the disk so have already schedule()'ed plenty. Alex
observed that he could starve out a transaction commit for up to a minute with
32 caching threads all running at once. This will allow us to drop the
extent_commit_sem to allow the transaction commit to swap the commit_root out
and then all the cachers will start back up. Here is an explanation provided by
Igno
So, just to fill in what happens in this loop:
mutex_unlock(&caching_ctl->mutex);
cond_resched();
goto again;
where 'again:' takes caching_ctl->mutex and fs_info->extent_commit_sem
again:
again:
mutex_lock(&caching_ctl->mutex);
/* need to make sure the commit_root doesn't disappear */
down_read(&fs_info->extent_commit_sem);
So, if I'm reading the code correct, there can be a fair amount of
concurrency here: there may be multiple 'caching kthreads' per filesystem
active, while there's one fs_info->extent_commit_sem per filesystem
AFAICS.
So, what happens if there are a lot of CPUs all busy holding the
->extent_commit_sem rwsem read-locked and a writer arrives? They'd all
rush to try to release the fs_info->extent_commit_sem, and they'd block in
the down_read() because there's a writer waiting.
So there's a guarantee of forward progress. This should answer akpm's
concern I think.
Thanks,
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
The inode reference item is close to inode item, so we insert it simultaneously
with the inode item insertion when we create a file/directory.. In fact, we also
can handle the inode reference deletion by the same way. So we made this patch to
introduce the delayed inode reference deletion for the single link inode(At most
case, the file doesn't has hard link, so we don't take the hard link into account).
This function is based on the delayed inode mechanism. After applying this patch,
we can reduce the time of the file/directory deletion by ~10%.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Two reasons:
- btrfs_end_transaction_dmeta() is the same as btrfs_end_transaction_throttle()
so it is unnecessary.
- All the delayed items should be dealt in the current transaction, so the
workers should not commit the transaction, instead, deal with the delayed
items as many as possible.
So we can remove btrfs_end_transaction_dmeta()
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
- move the condition check for wait into a function
- use wait_event_interruptible instead of prepare-schedule-finish process
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
If the number of the delayed items is greater than the upper limit, we will
try to flush all the delayed items. After that, it is unnecessary to run
them again because they are being dealt with by the wokers or the number of
them is less than the lower limit.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Before applying the patch
commit de3cb945db
title: Btrfs: improve the delayed inode throttling
We need requeue the async work after the current work was done, it
introduced a deadlock problem. So we wrote the code that this patch
removes to avoid the above problem. But after applying the above
patch, the deadlock problem didn't exist. So we should remove that
fix code.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Convert all applicable cases of printk and pr_* to the btrfs_* macros.
Fix all uses of the BTRFS prefix.
Signed-off-by: Frank Holton <fholton@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
While running the test btrfs/004 from xfstests in a loop, it failed
about 1 time out of 20 runs in my desktop. The failure happened in
the backref walking part of the test, and the test's error message was
like this:
btrfs/004 93s ... [failed, exit status 1] - output mismatch (see /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad)
--- tests/btrfs/004.out 2013-11-26 18:25:29.263333714 +0000
+++ /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad 2013-12-10 15:25:10.327518516 +0000
@@ -1,3 +1,8 @@
QA output created by 004
*** test backref walking
-*** done
+unexpected output from
+ /home/fdmanana/git/hub/btrfs-progs/btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
+expected inum: 405, expected address: 454656, file: /home/fdmanana/btrfs-tests/scratch_1/snap1/p0/d6/d3d/d156/fce, got:
+
...
(Run 'diff -u tests/btrfs/004.out /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad' to see the entire diff)
Ran: btrfs/004
Failures: btrfs/004
Failed 1 of 1 tests
But immediately after the test finished, the btrfs inspect-internal command
returned the expected output:
$ btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
inode 405 offset 454656 root 258
inode 405 offset 454656 root 5
It turned out this was because the btrfs_search_old_slot() calls performed
during backref walking (backref.c:__resolve_indirect_ref) were not finding
anything. The reason for this turned out to be that the tree mod logging
code was not logging some node multi-step operations atomically, therefore
btrfs_search_old_slot() callers iterated often over an incomplete tree that
wasn't fully consistent with any tree state from the past. Besides missing
items, this often (but not always) resulted in -EIO errors during old slot
searches, reported in dmesg like this:
[ 4299.933936] ------------[ cut here ]------------
[ 4299.933949] WARNING: CPU: 0 PID: 23190 at fs/btrfs/ctree.c:1343 btrfs_search_old_slot+0x57b/0xab0 [btrfs]()
[ 4299.933950] Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) bnep rfcomm bluetooth parport_pc ppdev binfmt_misc joydev snd_hda_codec_h
[ 4299.933977] CPU: 0 PID: 23190 Comm: btrfs Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70
[ 4299.933978] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 4299.933979] 000000000000053f ffff8806f3fd98f8 ffffffff8176d284 0000000000000007
[ 4299.933982] 0000000000000000 ffff8806f3fd9938 ffffffff8104a81c ffff880659c64b70
[ 4299.933984] ffff880659c643d0 ffff8806599233d8 ffff880701e2e938 0000160000000000
[ 4299.933987] Call Trace:
[ 4299.933991] [<ffffffff8176d284>] dump_stack+0x55/0x76
[ 4299.933994] [<ffffffff8104a81c>] warn_slowpath_common+0x8c/0xc0
[ 4299.933997] [<ffffffff8104a86a>] warn_slowpath_null+0x1a/0x20
[ 4299.934003] [<ffffffffa065d3bb>] btrfs_search_old_slot+0x57b/0xab0 [btrfs]
[ 4299.934005] [<ffffffff81775f3b>] ? _raw_read_unlock+0x2b/0x50
[ 4299.934010] [<ffffffffa0655001>] ? __tree_mod_log_search+0x81/0xc0 [btrfs]
[ 4299.934019] [<ffffffffa06dd9b0>] __resolve_indirect_refs+0x130/0x5f0 [btrfs]
[ 4299.934027] [<ffffffffa06a21f1>] ? free_extent_buffer+0x61/0xc0 [btrfs]
[ 4299.934034] [<ffffffffa06de39c>] find_parent_nodes+0x1fc/0xe40 [btrfs]
[ 4299.934042] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934048] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934056] [<ffffffffa06df980>] iterate_extent_inodes+0xe0/0x250 [btrfs]
[ 4299.934058] [<ffffffff817762db>] ? _raw_spin_unlock+0x2b/0x50
[ 4299.934065] [<ffffffffa06dfb82>] iterate_inodes_from_logical+0x92/0xb0 [btrfs]
[ 4299.934071] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934078] [<ffffffffa06b7015>] btrfs_ioctl+0xf65/0x1f60 [btrfs]
[ 4299.934080] [<ffffffff811658b8>] ? handle_mm_fault+0x278/0xb00
[ 4299.934083] [<ffffffff81075563>] ? up_read+0x23/0x40
[ 4299.934085] [<ffffffff8177a41c>] ? __do_page_fault+0x20c/0x5a0
[ 4299.934088] [<ffffffff811b2946>] do_vfs_ioctl+0x96/0x570
[ 4299.934090] [<ffffffff81776e23>] ? error_sti+0x5/0x6
[ 4299.934093] [<ffffffff810b71e8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 4299.934096] [<ffffffff81776a09>] ? retint_swapgs+0xe/0x13
[ 4299.934098] [<ffffffff811b2eb1>] SyS_ioctl+0x91/0xb0
[ 4299.934100] [<ffffffff813eecde>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[ 4299.934104] ---[ end trace 48f0cfc902491414 ]---
[ 4299.934378] btrfs bad fsid on block 0
These tree mod log operations that must be performed atomically, tree_mod_log_free_eb,
tree_mod_log_eb_copy, tree_mod_log_insert_root and tree_mod_log_insert_move, used to
be performed atomically before the following commit:
c8cc634165
(Btrfs: stop using GFP_ATOMIC for the tree mod log allocations)
That change removed the atomicity of such operations. This patch restores the
atomicity while still not doing the GFP_ATOMIC allocations of tree_mod_elem
structures, so it has to do the allocations using GFP_NOFS before acquiring
the mod log lock.
This issue has been experienced by several users recently, such as for example:
http://www.spinics.net/lists/linux-btrfs/msg28574.html
After running the btrfs/004 test for 679 consecutive iterations with this
patch applied, I didn't ran into the issue anymore.
Cc: stable@vger.kernel.org
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Warn if the balance goes below zero, which appears to be unlikely
though. Otherwise cleans up the code a bit.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Since daivd did the work that force us to use readonly snapshot,
we can safely remove transaction protection from btrfs send.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We met the following oops when doing space balance:
kobject (ffff88081b590278): tried to init an initialized object, something is seriously wrong.
...
Call Trace:
[<ffffffff81937262>] dump_stack+0x49/0x5f
[<ffffffff8137d259>] kobject_init+0x89/0xa0
[<ffffffff8137d36a>] kobject_init_and_add+0x2a/0x70
[<ffffffffa009bd79>] ? clear_extent_bit+0x199/0x470 [btrfs]
[<ffffffffa005e82c>] __link_block_group+0xfc/0x120 [btrfs]
[<ffffffffa006b9db>] btrfs_make_block_group+0x24b/0x370 [btrfs]
[<ffffffffa00a899b>] __btrfs_alloc_chunk+0x54b/0x7e0 [btrfs]
[<ffffffffa00a8c6f>] btrfs_alloc_chunk+0x3f/0x50 [btrfs]
[<ffffffffa0060123>] do_chunk_alloc+0x363/0x440 [btrfs]
[<ffffffffa00633d4>] btrfs_check_data_free_space+0x104/0x310 [btrfs]
[<ffffffffa0069f4d>] btrfs_write_dirty_block_groups+0x48d/0x600 [btrfs]
[<ffffffffa007aad4>] commit_cowonly_roots+0x184/0x250 [btrfs]
...
Steps to reproduce:
# mkfs.btrfs -f <dev>
# mount -o nospace_cache <dev> <mnt>
# btrfs balance start <mnt>
# dd if=/dev/zero of=<mnt>/tmpfile bs=1M count=1
The reason of this problem is that we initialized the raid kobject when we added
a block group into a empty raid list. As we know, when we mounted a btrfs filesystem,
the raid list was empty, we would initialize the raid kobject when we added the first
block group. But if there was not data stored in the block group, the block group
would be freed when doing balance, and the raid list would be empty. And then if we
allocated a new block group and added it into the raid list, we would initialize
the raid kobject again, the oops happened.
Fix this problem by initializing the raid kobject just when mounting the fs.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reported-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
All the subvolues that are involved in send must be read-only during the
whole operation. The ioctl SUBVOL_SETFLAGS could be used to change the
status to read-write and the result of send stream is undefined if the
data change unexpectedly.
Fix that by adding a refcount for all involved roots and verify that
there's no send in progress during SUBVOL_SETFLAGS ioctl call that does
read-only -> read-write transition.
We need refcounts because there are no restrictions on number of send
parallel operations currently run on a single subvolume, be it source,
parent or one of the multiple clone sources.
Kernel is silent when the RO checks fail and returns EPERM. The same set
of checks is done already in userspace before send starts.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Unused since ed2590953b
"Btrfs: stop using vfs_read in send".
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Remove ifdefed code:
- tlv_put for 8, 16 and 32, add a generic tempalte if needed in future
- tlv_put_timespec - the btrfs_timespec fields are used
- fs_path_remove obsoleted long ago
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
While running btrfs/004 from xfstests, after 503 iterations, dmesg reported
a deadlock between tasks iterating inode refs and tasks running delayed inodes
(during a transaction commit).
It turns out that iterating inode refs implies doing one tree search and
release all nodes in the path except the leaf node, and then passing that
leaf node to btrfs_ref_to_path(), which in turn does another tree search
without releasing the lock on the leaf node it received as parameter.
This is a problem when other task wants to write to the btree as well and
ends up updating the leaf that is read locked - the writer task locks the
parent of the leaf and then blocks waiting for the leaf's lock to be
released - at the same time, the task executing btrfs_ref_to_path()
does a second tree search, without releasing the lock on the first leaf,
and wants to access a leaf (the same or another one) that is a child of
the same parent, resulting in a deadlock.
The trace reported by lockdep follows.
[84314.936373] INFO: task fsstress:11930 blocked for more than 120 seconds.
[84314.936381] Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70
[84314.936383] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[84314.936386] fsstress D ffff8806e1bf8000 0 11930 11926 0x00000000
[84314.936393] ffff8804d6d89b78 0000000000000046 ffff8804d6d89b18 ffffffff810bd8bd
[84314.936399] ffff8806e1bf8000 ffff8804d6d89fd8 ffff8804d6d89fd8 ffff8804d6d89fd8
[84314.936405] ffff880806308000 ffff8806e1bf8000 ffff8804d6d89c08 ffff8804deb8f190
[84314.936410] Call Trace:
[84314.936421] [<ffffffff810bd8bd>] ? trace_hardirqs_on+0xd/0x10
[84314.936428] [<ffffffff81774269>] schedule+0x29/0x70
[84314.936451] [<ffffffffa0715bf5>] btrfs_tree_lock+0x75/0x270 [btrfs]
[84314.936457] [<ffffffff810715c0>] ? __init_waitqueue_head+0x60/0x60
[84314.936470] [<ffffffffa06ba231>] btrfs_search_slot+0x7f1/0x930 [btrfs]
[84314.936489] [<ffffffffa0731c2a>] ? __btrfs_run_delayed_items+0x13a/0x1e0 [btrfs]
[84314.936504] [<ffffffffa06d2e1f>] btrfs_lookup_inode+0x2f/0xa0 [btrfs]
[84314.936510] [<ffffffff810bd6ef>] ? trace_hardirqs_on_caller+0x1f/0x1e0
[84314.936528] [<ffffffffa073173c>] __btrfs_update_delayed_inode+0x4c/0x1d0 [btrfs]
[84314.936543] [<ffffffffa0731c2a>] ? __btrfs_run_delayed_items+0x13a/0x1e0 [btrfs]
[84314.936558] [<ffffffffa0731c2a>] ? __btrfs_run_delayed_items+0x13a/0x1e0 [btrfs]
[84314.936573] [<ffffffffa0731c82>] __btrfs_run_delayed_items+0x192/0x1e0 [btrfs]
[84314.936589] [<ffffffffa0731d03>] btrfs_run_delayed_items+0x13/0x20 [btrfs]
[84314.936604] [<ffffffffa06dbcd4>] btrfs_flush_all_pending_stuffs+0x24/0x80 [btrfs]
[84314.936620] [<ffffffffa06ddc13>] btrfs_commit_transaction+0x223/0xa20 [btrfs]
[84314.936630] [<ffffffffa06ae5ae>] btrfs_sync_fs+0x6e/0x110 [btrfs]
[84314.936635] [<ffffffff811d0b50>] ? __sync_filesystem+0x60/0x60
[84314.936639] [<ffffffff811d0b50>] ? __sync_filesystem+0x60/0x60
[84314.936643] [<ffffffff811d0b70>] sync_fs_one_sb+0x20/0x30
[84314.936648] [<ffffffff811a3541>] iterate_supers+0xf1/0x100
[84314.936652] [<ffffffff811d0c45>] sys_sync+0x55/0x90
[84314.936658] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[84314.936660] INFO: lockdep is turned off.
[84314.936663] INFO: task btrfs:11955 blocked for more than 120 seconds.
[84314.936666] Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70
[84314.936668] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[84314.936670] btrfs D ffff880541729a88 0 11955 11608 0x00000000
[84314.936674] ffff880541729a38 0000000000000046 ffff8805417299d8 ffffffff810bd8bd
[84314.936680] ffff88075430c8a0 ffff880541729fd8 ffff880541729fd8 ffff880541729fd8
[84314.936685] ffffffff81c104e0 ffff88075430c8a0 ffff8804de8b00b8 ffff8804de8b0000
[84314.936690] Call Trace:
[84314.936695] [<ffffffff810bd8bd>] ? trace_hardirqs_on+0xd/0x10
[84314.936700] [<ffffffff81774269>] schedule+0x29/0x70
[84314.936717] [<ffffffffa0715815>] btrfs_tree_read_lock+0xd5/0x140 [btrfs]
[84314.936721] [<ffffffff810715c0>] ? __init_waitqueue_head+0x60/0x60
[84314.936733] [<ffffffffa06ba201>] btrfs_search_slot+0x7c1/0x930 [btrfs]
[84314.936746] [<ffffffffa06bd505>] btrfs_find_item+0x55/0x160 [btrfs]
[84314.936763] [<ffffffffa06ff689>] ? free_extent_buffer+0x49/0xc0 [btrfs]
[84314.936780] [<ffffffffa073c9ca>] btrfs_ref_to_path+0xba/0x1e0 [btrfs]
[84314.936797] [<ffffffffa06f9719>] ? release_extent_buffer+0xb9/0xe0 [btrfs]
[84314.936813] [<ffffffffa06ff689>] ? free_extent_buffer+0x49/0xc0 [btrfs]
[84314.936830] [<ffffffffa073cb50>] inode_to_path+0x60/0xd0 [btrfs]
[84314.936846] [<ffffffffa073d365>] paths_from_inode+0x115/0x3c0 [btrfs]
[84314.936851] [<ffffffff8118dd44>] ? kmem_cache_alloc_trace+0x114/0x200
[84314.936868] [<ffffffffa0714494>] btrfs_ioctl+0xf14/0x2030 [btrfs]
[84314.936873] [<ffffffff817762db>] ? _raw_spin_unlock+0x2b/0x50
[84314.936877] [<ffffffff8116598f>] ? handle_mm_fault+0x34f/0xb00
[84314.936882] [<ffffffff81075563>] ? up_read+0x23/0x40
[84314.936886] [<ffffffff8177a41c>] ? __do_page_fault+0x20c/0x5a0
[84314.936892] [<ffffffff811b2946>] do_vfs_ioctl+0x96/0x570
[84314.936896] [<ffffffff81776e23>] ? error_sti+0x5/0x6
[84314.936901] [<ffffffff810b71e8>] ? trace_hardirqs_off_caller+0x28/0xd0
[84314.936906] [<ffffffff81776a09>] ? retint_swapgs+0xe/0x13
[84314.936910] [<ffffffff811b2eb1>] SyS_ioctl+0x91/0xb0
[84314.936915] [<ffffffff813eecde>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[84314.936920] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[84314.936922] INFO: lockdep is turned off.
[84434.866873] INFO: task btrfs-transacti:11921 blocked for more than 120 seconds.
[84434.866881] Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70
[84434.866883] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[84434.866886] btrfs-transacti D ffff880755b6a478 0 11921 2 0x00000000
[84434.866893] ffff8800735b9ce8 0000000000000046 ffff8800735b9c88 ffffffff810bd8bd
[84434.866899] ffff8805a1b848a0 ffff8800735b9fd8 ffff8800735b9fd8 ffff8800735b9fd8
[84434.866904] ffffffff81c104e0 ffff8805a1b848a0 ffff880755b6a478 ffff8804cece78f0
[84434.866910] Call Trace:
[84434.866920] [<ffffffff810bd8bd>] ? trace_hardirqs_on+0xd/0x10
[84434.866927] [<ffffffff81774269>] schedule+0x29/0x70
[84434.866948] [<ffffffffa06dd2ef>] wait_current_trans.isra.33+0xbf/0x120 [btrfs]
[84434.866954] [<ffffffff810715c0>] ? __init_waitqueue_head+0x60/0x60
[84434.866970] [<ffffffffa06dec18>] start_transaction+0x388/0x5a0 [btrfs]
[84434.866985] [<ffffffffa06db9b5>] ? transaction_kthread+0xb5/0x280 [btrfs]
[84434.866999] [<ffffffffa06dee97>] btrfs_attach_transaction+0x17/0x20 [btrfs]
[84434.867012] [<ffffffffa06dba9e>] transaction_kthread+0x19e/0x280 [btrfs]
[84434.867026] [<ffffffffa06db900>] ? open_ctree+0x2260/0x2260 [btrfs]
[84434.867030] [<ffffffff81070dad>] kthread+0xed/0x100
[84434.867035] [<ffffffff81070cc0>] ? flush_kthread_worker+0x190/0x190
[84434.867040] [<ffffffff8177ee6c>] ret_from_fork+0x7c/0xb0
[84434.867044] [<ffffffff81070cc0>] ? flush_kthread_worker+0x190/0x190
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Chris introduced hleper function read_csums() and this function
has been removed, but we forgot to remove its corresponding comments.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
It's not used anywhere, so just drop it.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
fs/btrfs/file.c: In function ‘prepare_pages.isra.18’:
fs/btrfs/file.c:1265:6: warning: ‘err’ may be used uninitialized in this function [-Wuninitialized]
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We have commited transaction before, remove redundant filemap writting and
waiting here, it can speed up balance relocation process.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Clean up btrfs_lookup_dentry() to never return NULL, but PTR_ERR(-ENOENT)
instead. This keeps the return value convention consistent.
Callers who use btrfs_lookup_dentry() require a trivial update.
create_snapshot() in particular looks like it can also lose a BUG_ON(!inode)
which is not really needed - there seems less harm in returning ENOENT to
userspace at that point in the stack than there is to crash the machine.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
In ctree.c:tree_mod_log_set_node_key() we were calling
__tree_mod_log_insert_key() even when the modification doesn't need
to be logged. This would allocate a tree_mod_elem structure, fill it
and pass it to __tree_mod_log_insert(), which would just acquire
the tree mod log write lock and then free the tree_mod_elem structure
and return (that is, a no-op).
Therefore call tree_mod_log_insert() instead of __tree_mod_log_insert()
which just returns immediately if the modification doesn't need to be
logged (without allocating the structure, fill it, acquire write lock,
free structure).
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
I need to create a fake tree to test qgroups and I don't want to have to setup a
fake btree_inode. The fact is we only use the radix tree for the fs_info, so
everybody else who allocates an extent_io_tree is just wasting the space anyway.
This patch moves the radix tree and its lock into btrfs_fs_info so there is less
stuff I have to fake to do qgroup sanity tests. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
For creating a dummy in-memory btree I need to be able to use the radix tree to
keep track of the buffers like normal extent buffers. With dummy buffers we
skip the radix tree step, and we still want to do that for the tree mod log
dummy buffers but for my test buffers we need to be able to remove them from the
radix tree like normal. This will give me a way to do that. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
I need to add infrastructure to allocate dummy extent buffers for running sanity
tests, and to do this I need to not have to worry about having an
address_mapping for an io_tree, so just fix up the places where we assume that
all io_tree's have a non-NULL ->mapping. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Currently when finding the leaf to insert a key into a btree, if the
leaf doesn't have enough space to store the item we attempt to move
off some items from our leaf to its right neighbor leaf, and if this
fails to create enough free space in our leaf, we try to move off more
items to the left neighbor leaf as well.
When trying to move off items to the right neighbor leaf, if it has
enough room to store the new key but not not enough room to move off
at least one item from our target leaf, __push_leaf_right returns 1 and
we have to attempt to move items to the left neighbor (push_leaf_left
function) without touching the right neighbor leaf.
For the case where the right leaf has enough room to store at least 1
item from our leaf, we end up modifying (and dirtying) both our leaf
and the right leaf. This is non-optimal for the case where the new key
is greater than any key in our target leaf because it can be inserted at
slot 0 of the right neighbor leaf and we don't need to touch our leaf
at all nor to attempt to move off items to the left neighbor leaf.
Therefore this change just selects the right neighbor leaf as our new
target leaf if it has enough room for the new key without modifying our
initial target leaf - we do this only if the new key is higher than any
key in the initial target leaf.
While running the following test, push_leaf_right was called by split_leaf
4802 times. Out of those 4802 calls, for 2571 calls (53.5%) we hit this
special case (right leaf has enough room and new key is higher than any key
in the initial target leaf).
Test:
sysbench --test=fileio --file-num=512 --file-total-size=5G \
--file-test-mode=[seqwr|rndwr] --num-threads=512 --file-block-size=8192 \
--max-requests=100000 --file-io-mode=sync [prepare|run]
Results:
sequential writes
Throughput before this change: 65.71Mb/sec (average of 10 runs)
Throughput after this change: 66.58Mb/sec (average of 10 runs)
random writes
Throughput before this change: 10.75Mb/sec (average of 10 runs)
Throughput after this change: 11.56Mb/sec (average of 10 runs)
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Just wrap same code into one function scrub_blocked_if_needed().
This make a change that we will move waiting (@workers_pending = 0)
before we can wake up commiting transaction(atomic_inc(@scrub_paused)),
we must take carefully to not deadlock here.
Thread 1 Thread 2
|->btrfs_commit_transaction()
|->set trans type(COMMIT_DOING)
|->btrfs_scrub_paused()(blocked)
|->join_transaction(blocked)
Move btrfs_scrub_paused() before setting trans type which means we can
still join a transaction when commiting_transaction is blocked.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Suggested-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We came a race condition when scrubbing superblocks, the story is:
In commiting transaction, we will update @last_trans_commited after
writting superblocks, if scrubber start after writting superblocks
and before updating @last_trans_commited, generation mismatch happens!
We fix this by checking @scrub_pause_req, and we won't start a srubber
until commiting transaction is finished.(after btrfs_scrub_continue()
finished.)
Reported-by: Sebastian Ochmann <ochmann@informatik.uni-bonn.de>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
fs/btrfs/send.c:2190:9: warning: incorrect type in argument 3 (different base types)
fs/btrfs/send.c:2190:9: expected unsigned long long [unsigned] [usertype] value
fs/btrfs/send.c:2190:9: got restricted __le64 [usertype] ctransid
fs/btrfs/send.c:2195:17: warning: incorrect type in argument 3 (different base types)
fs/btrfs/send.c:2195:17: expected unsigned long long [unsigned] [usertype] value
fs/btrfs/send.c:2195:17: got restricted __le64 [usertype] ctransid
fs/btrfs/send.c:3716:9: warning: incorrect type in argument 3 (different base types)
fs/btrfs/send.c:3716:9: expected unsigned long long [unsigned] [usertype] value
fs/btrfs/send.c:3716:9: got restricted __le64 [usertype] ctransid
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
When merging an extent_map with its right neighbor, increment
its block_len with the neighbor's block_len.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
[commit 8185554d: fix incorrect inode acl reset] introduced a dead
code by adding a condition which can never be true to an else
branch. The condition can never be true because it is already
checked by a previous if statement which causes function to return.
Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Reviewed-By: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We were accounting for sizeof(struct btrfs_item) twice, once
in the data_size variable and another time in the if statement
below.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Currently we do 2 traversals of an inode's extent_io_tree
before inserting an extent state structure: 1 to see if a
matching extent state already exists and 1 to do the insertion
if the fist traversal didn't found such extent state.
This change just combines those tree traversals into a single one.
While running sysbench tests (random writes) I captured the number
of elements in extent_io_tree trees for a while (into a procfs file
backed by a seq_list from seq_file module) and got this histogram:
Count: 9310
Range: 51.000 - 21386.000; Mean: 11785.243; Median: 18743.500; Stddev: 8923.688
Percentiles: 90th: 20985.000; 95th: 21155.000; 99th: 21369.000
51.000 - 93.933: 693 ########
93.933 - 172.314: 938 ##########
172.314 - 315.408: 856 #########
315.408 - 576.646: 95 #
576.646 - 6415.830: 888 ##########
6415.830 - 11713.809: 1024 ###########
11713.809 - 21386.000: 4816 #####################################################
So traversing such trees can take some significant time that can
easily be avoided.
Ran the following sysbench tests, 5 times each, for sequential and
random writes, and got the following results:
sysbench --test=fileio --file-num=1 --file-total-size=2G \
--file-test-mode=seqwr --num-threads=16 --file-block-size=65536 \
--max-requests=0 --max-time=60 --file-io-mode=sync
sysbench --test=fileio --file-num=1 --file-total-size=2G \
--file-test-mode=rndwr --num-threads=16 --file-block-size=65536 \
--max-requests=0 --max-time=60 --file-io-mode=sync
Before this change:
sequential writes: 69.28Mb/sec (average of 5 runs)
random writes: 4.14Mb/sec (average of 5 runs)
After this change:
sequential writes: 69.91Mb/sec (average of 5 runs)
random writes: 5.69Mb/sec (average of 5 runs)
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
When we didn't find a matching extent state, we inserted a new one
but didn't cache it in the **cached_state parameter, which makes a
subsequent call do a tree lookup to get it.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Before this change, adding an extent map to the extent map tree of an
inode required 2 tree nevigations:
1) doing a tree navigation to search for an existing extent map starting
at the same offset or an extent map that overlaps the extent map we
want to insert;
2) Another tree navigation to add the extent map to the tree (if the
former tree search didn't found anything).
This change just merges these 2 steps into a single one.
While running first few btrfs xfstests I had noticed these trees easily
had a few hundred elements, and then with the following sysbench test it
reached over 1100 elements very often.
Test:
sysbench --test=fileio --file-num=32 --file-total-size=10G \
--file-test-mode=seqwr --num-threads=512 --file-block-size=8192 \
--max-requests=1000000 --file-io-mode=sync [prepare|run]
(fs created with mkfs.btrfs -l 4096 -f /dev/sdb3 before each sysbench
prepare phase)
Before this patch:
run 1 - 41.894Mb/sec
run 2 - 40.527Mb/sec
run 3 - 40.922Mb/sec
run 4 - 49.433Mb/sec
run 5 - 40.959Mb/sec
average - 42.75Mb/sec
After this patch:
run 1 - 48.036Mb/sec
run 2 - 50.21Mb/sec
run 3 - 50.929Mb/sec
run 4 - 46.881Mb/sec
run 5 - 53.192Mb/sec
average - 49.85Mb/sec
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
When attempting to move items from our target leaf to its neighbor
leaves (right and left), we only need to free data_size - free_space
bytes from our leaf in order to add the new item (which has size of
data_size bytes). Therefore attempt to move items to the right and
left leaves if they have at least data_size - free_space bytes free,
instead of data_size bytes free.
After 5 runs of the following test, I got a smaller number of btree
node splits overall:
sysbench --test=fileio --file-num=512 --file-total-size=5G \
--file-test-mode=seqwr --num-threads=512 \
--file-block-size=8192 --max-requests=100000 --file-io-mode=sync
Before this change:
* 6171 splits (average of 5 test runs)
* 61.508Mb/sec of throughput (average of 5 test runs)
After this change:
* 6036 splits (average of 5 test runs)
* 63.533Mb/sec of throughput (average of 5 test runs)
An ideal test would not just have multiple threads/processes writing
to a file (insertion of file extent items) but also do other operations
that result in insertion of items with varied sizes, like file/directory
creations, creation of links, symlinks, xattrs, etc.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
After an ordered extent completes, don't blindly reset the
inode's ordered tree last accessed ordered extent pointer.
While running the xfstests I noticed that about 29% of the
time the ordered extent to which tree->last pointed was not
the same as our just completed ordered extent. After that I
ran the following sysbench test (after a prepare phase) and
noticed that about 68% of the time tree->last pointed to
a different ordered extent too.
sysbench --test=fileio --file-num=32 --file-total-size=4G \
--file-test-mode=rndwr --num-threads=512 \
--file-block-size=32768 --max-time=60 --max-requests=0 run
Therefore reset tree->last on ordered extent removal only if
it pointed to the ordered extent we're removing from the tree.
Results from 4 runs of the following test before and after
applying this patch:
$ sysbench --test=fileio --file-num=32 --file-total-size=4G \
--file-test-mode=seqwr --num-threads=512 \
--file-block-size=32768 --max-time=60 --file-io-mode=sync prepare
$ sysbench --test=fileio --file-num=32 --file-total-size=4G \
--file-test-mode=seqwr --num-threads=512 \
--file-block-size=32768 --max-time=60 --file-io-mode=sync run
Before this path:
run 1 - 64.049Mb/sec
run 2 - 63.455Mb/sec
run 3 - 64.656Mb/sec
run 4 - 63.833Mb/sec
After this patch:
run 1 - 66.149Mb/sec
run 2 - 68.459Mb/sec
run 3 - 66.338Mb/sec
run 4 - 66.176Mb/sec
With random writes (--file-test-mode=rndwr) I had huge fluctuations
on the results (+- 35% easily).
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Filipe noticed that we were leaking the features attribute group
after umount. His fix of just calling sysfs_remove_group() wasn't enough
since that removes just the supported features and not the unsupported
features.
This patch changes the unknown feature handling to add them individually
so we can skip the kmalloc and uses the same iteration to tear them down
later.
We also fix the error handling during mount so that we catch the
failing creation of the per-super kobject, and handle proper teardown
of a half-setup sysfs context.
Tested properly with kmemleak enabled this time.
Reported-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Tested-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
This patch fixes the following warnings:
fs/btrfs/extent-tree.c:6201:12: sparse: symbol 'get_raid_name' was not declared. Should it be static?
fs/btrfs/extent-tree.c:8430:9: error: format not a string literal and no format arguments [-Werror=format-security] get_raid_name(index));
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The inode eviction can be very slow, because during eviction we
tell the VFS to truncate all of the inode's pages. This results
in calls to btrfs_invalidatepage() which in turn does calls to
lock_extent_bits() and clear_extent_bit(). These calls result in
too many merges and splits of extent_state structures, which
consume a lot of time and cpu when the inode has many pages. In
some scenarios I have experienced umount times higher than 15
minutes, even when there's no pending IO (after a btrfs fs sync).
A quick way to reproduce this issue:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ cd /mnt/btrfs
$ sysbench --test=fileio --file-num=128 --file-total-size=16G \
--file-test-mode=seqwr --num-threads=128 \
--file-block-size=16384 --max-time=60 --max-requests=0 run
$ time btrfs fi sync .
FSSync '.'
real 0m25.457s
user 0m0.000s
sys 0m0.092s
$ cd ..
$ time umount /mnt/btrfs
real 1m38.234s
user 0m0.000s
sys 1m25.760s
The same test on ext4 runs much faster:
$ mkfs.ext4 /dev/sdb3
$ mount /dev/sdb3 /mnt/ext4
$ cd /mnt/ext4
$ sysbench --test=fileio --file-num=128 --file-total-size=16G \
--file-test-mode=seqwr --num-threads=128 \
--file-block-size=16384 --max-time=60 --max-requests=0 run
$ sync
$ cd ..
$ time umount /mnt/ext4
real 0m3.626s
user 0m0.004s
sys 0m3.012s
After this patch, the unmount (inode evictions) is much faster:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ cd /mnt/btrfs
$ sysbench --test=fileio --file-num=128 --file-total-size=16G \
--file-test-mode=seqwr --num-threads=128 \
--file-block-size=16384 --max-time=60 --max-requests=0 run
$ time btrfs fi sync .
FSSync '.'
real 0m26.774s
user 0m0.000s
sys 0m0.084s
$ cd ..
$ time umount /mnt/btrfs
real 0m1.811s
user 0m0.000s
sys 0m1.564s
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We hit a forever loop when doing balance relocation,the reason
is that we firstly reserve 4M(node size is 16k).and within transaction
we will try to add extra reservation for snapshot roots,this will
return -EAGAIN if there has been a thread flushing space to reserve
space.We will do this again and again with filesystem becoming nearly
full.
If the above '-EAGAIN' case happens, we try to refill reservation more
outsize of transaction, and this will return eariler in enospc case,however,
this dosen't really hurt because it makes no sense doing balance relocation
with the filesystem nearly full.
Miao Xie helped a lot to track this issue, thanks.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
If the ordered extent's last byte was 1 less than our region's
start byte, we would unnecessarily wait for the completion of
that ordered extent, because it doesn't intersect our target
range.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
When we ran sysbench on the fs with compression, the following WARN_ONs were
triggered:
fs/btrfs/inode.c:7829 WARN_ON(BTRFS_I(inode)->outstanding_extents);
fs/btrfs/inode.c:7830 WARN_ON(BTRFS_I(inode)->reserved_extents);
fs/btrfs/inode.c:7832 WARN_ON(BTRFS_I(inode)->csum_bytes);
Steps to reproduce:
# mkfs.btrfs -f <dev>
# mount -o compress <dev> <mnt>
# cd <mnt>
# sysbench --test=fileio --num-threads=8 --file-total-size=8G \
> --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \
> --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \
> --file-test-mode=sync prepare
# cd -
# umount <mnt>
# mount -o compress <dev> <mnt>
# cd <mnt>
# sysbench --test=fileio --num-threads=8 --file-total-size=8G \
> --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \
> --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \
> --file-test-mode=sync run
# cd -
# umount <mnt>
The reason of this problem is:
Task0 Task1
btrfs_direct_IO
unlock(&inode->i_mutex)
lock(&inode->i_mutex)
reserve_space()
prepare_pages()
lock_extent()
clear_extent()
unlock_extent()
lock_extent()
test_extent(uptodate)
return false
copy_data()
set_delalloc_extent()
extent need compress
go back to buffered write
clear_extent(DELALLOC | DIRTY)
unlock_extent()
Task 0 and 1 wrote the same place, and task0 cleared the delalloc flag which
was set by task1, it made the dirty pages in that extents couldn't be flushed
into the disk, so the reserved space for that extent was not released at
the end.
This patch fixes the above bug by unlocking the extent after the delalloc.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
- the caller has gotten the inode object, needn't pass the file object.
And if so, we needn't define a inode pointer variant.
- the position should be aligned by the page size not sector size, so
we also needn't pass the root object into prepare_pages().
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We don't need to crash hard here, it's just reading a sysfs file. The
values considered in switch are from a fixed set, the default case
should not happen at all.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Added in patch "btrfs: add ability to change features via sysfs",
modifications to superblock don't need to reserve metadata blocks when
starting a transaction.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The kernel macro pr_debug is defined as a empty statement when DEBUG is
not defined. Make btrfs_debug match pr_debug to avoid spamming
the kernel log with debug messages
Signed-off-by: Frank Holton <fholton@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
Found by uselex.rb:
> btrfs_get_inode_ref_index: [R]: exported from:
fs/btrfs/inode-item.o fs/btrfs/btrfs.o fs/btrfs/built-in.o
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Reviewed-by: David Stebra <dsterba@suse.cz>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
This is the third step in bootstrapping the btrfs_find_item interface.
The function find_orphan_item(), in orphan.c, is similar to the two
functions already replaced by the new interface. It uses two parameters,
which are already present in the interface, and is nearly identical to
the function brought in in the previous patch.
Replace the two calls to find_orphan_item() with calls to
btrfs_find_item(), with the defined objectid and type that was used
internally by find_orphan_item(), a null path, and a null key. Add a
test for a null path to btrfs_find_item, and if it passes, allocate and
free the path. Finally, remove find_orphan_item().
Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
This patch is the second step in bootstrapping the btrfs_find_item
interface. The btrfs_find_root_ref() is similar to the former
__inode_info(); it accepts four of its parameters, and duplicates the
first half of its functionality.
Replace the one former call to btrfs_find_root_ref() with a call to
btrfs_find_item(), along with the defined key type that was used
internally by btrfs_find_root ref, and a null found key. In
btrfs_find_item(), add a test for the null key at the place where
the functionality of btrfs_find_root_ref() ends; btrfs_find_item()
then returns if the test passes. Finally, remove btrfs_find_root_ref().
Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com>
Suggested-by: Zach Brown <zab@redhat.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
There are many btrfs functions that manually search the tree for an
item. They all reimplement the same mechanism and differ in the
conditions that they use to find the item. __inode_info() is one such
example. Zach Brown proposed creating a new interface to take the place
of these functions.
This patch is the first step to creating the interface. A new function,
btrfs_find_item, has been added to ctree.c and prototyped in ctree.h.
It is identical to __inode_info, except that the order of the parameters
has been rearranged to more closely those of similar functions elsewhere
in the code (now, root and path come first, then the objectid, offset
and type, and the key to be filled in last). __inode_info's callers have
been set to call this new function instead, and __inode_info itself has
been removed.
Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com>
Suggested-by: Zach Brown <zab@redhat.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
Use otherwise unused local variables slot in update_qgroup_limit_item and
in update_qgroup_info_item, and remove unused variable ins from
btrfs_qgroup_account_ref.
Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
The variable window_start in setup_cluster_no_bitmap is not used since commit
1bb91902dc
(Btrfs: revamp clustered allocation logic)
Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
Remove unused variables:
* tree from end_bio_extent_writepage,
* item from extent_fiemap.
Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
The variable found_uncached_bg in find_free_extent is not used since commit
285ff5af6c
(Btrfs: remove the ideal caching code)
Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
Remove unused variables:
* tree from csum_dirty_buffer,
* tree from btree_readpage_end_io_hook,
* tree from btree_writepages,
* bytenr from btrfs_create_tree,
* fs_info from end_workqueue_fn.
Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
Variable owner in btrfs_new_inode is unused since commit
d82a6f1d7e
(Btrfs: kill BTRFS_I(inode)->block_group)
Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
This adds a writeable attribute which describes the label.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
Now that we have the infrastructure for per-super attributes, we can
publish device membership in /sys/fs/btrfs/<fsid>/devices. The information
is published as symlinks to the block devices.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
While trying to debug ENOSPC issues, it's helpful to understand what the
kernel's view of the available space is. We export this information
via ioctl, but sysfs files are more easily used.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
btrfs filesystem df output will show the size of the metadata space
and how much of it is used, and the user assumes that the difference
is all usable space. Since that's not actually the case due to the
global metadata reservation, we should provide the full picture to the
user.
This patch adds an ioctl that exports the size of the global metadata
reservation so that btrfs filesystem df can report it.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
Now that we have the feature name strings available in the kernel via
the sysfs attributes, we can use them for printing better failure
messages from the ioctl path.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
This patch adds the ability to change (set/clear) features while the file
system is mounted. A bitmask is added for each feature set for the
support to set and clear the bits. A message indicating which bit
has been set or cleared is issued when it's been changed and also when
permission or support for a particular bit has been denied.
Since the the attributes can now be writable, we need to introduce
another struct attribute to hold the different permissions.
If neither set or clear is supported, the file will have 0444 permissions.
If either set or clear is supported, the file will have 0644 permissions
and the store handler will filter out the write based on the bitmask.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
With the compat and compat-ro bits, it's possible for file systems to
exist that have features that aren't supported by the kernel's file system
implementation yet still be mountable.
This patch publishes read-only info on those features using a prefix:number
format, where the number is the bit number rather than the shifted value.
e.g. "compat:12"
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
This patch publishes information on which features are enabled in the
file system on a per-super basis. At this point, it only publishes
information on features supported by the file system implementation.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
This patch adds per-super attributes to sysfs.
It doesn't publish any attributes yet, but does the proper lifetime
handling as well as the basic infrastructure to add new attributes.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
This patch adds the ability to publish supported features to sysfs under
/sys/fs/btrfs/features.
The files are module-wide and export which features the kernel supports.
The content, for now, is just "0\n".
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
There are some feature bits that require no offline setup and can
be enabled online. I've only reviewed extended irefs, but there will
probably be more.
We introduce three new ioctls:
- BTRFS_IOC_GET_SUPPORTED_FEATURES: query the kernel for supported features.
- BTRFS_IOC_GET_FEATURES: query the kernel for enabled features on a per-fs
basis, as well as querying for which features are changeable with mounted.
- BTRFS_IOC_SET_FEATURES: change features on a per-fs basis.
We introduce two new masks per feature set (_SAFE_SET and _SAFE_CLEAR) that
allow us to define which features are safe to change at runtime.
The failure modes for BTRFS_IOC_SET_FEATURES are as follows:
- Enabling a completely unsupported feature: warns and returns -ENOTSUPP
- Enabling a feature that can only be done offline: warns and returns -EPERM
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
When we have data deduplication on, we'll hang on the merge part
because it needs to verify every queued delayed data refs related to
this disk offset but we may have millions refs.
And in the case of delayed data refs, we don't usually have too much
data refs to merge.
So it's safe to shut it down for data refs.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
The way how we process delayed refs is
1) get a bunch of head refs,
2) pick up one head ref,
3) go one node back for any delayed ref updates.
The head ref is also linked in the same rbtree as the delayed ref is,
so in 1) stage, we have to walk one by one including not only head refs, but
delayed refs.
When we have a great number of delayed refs pending to process,
this'll cost time a lot.
Here we introduce a head ref specific rbtree, it only has head refs, so troubles
go away.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
We were looking at file_extent_num_bytes unconditionally when looking at
referenced data bytes, but this isn't correct for compression. Fix this by
checking the compression of the file extent we are and setting num_bytes to
disk_num_bytes in the case of compression so that we are marking the proper
bytes as referenced. This fixes check_int_data freaking out when running
btrfs/004. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
Btrfs has always had these filler extent data items for holes in inodes. This
has made somethings very easy, like logging hole punches and sending hole
punches. However for large holey files these extent data items are pure
overhead. So add an incompatible feature to no longer add hole extents to
reduce the amount of metadata used by these sort of files. This has a few
changes for logging and send obviously since they will need to detect holes and
log/send the holes if there are any. I've tested this thoroughly with xfstests
and it doesn't cause any issues with and without the incompat format set.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull vfs updates from Al Viro:
"Assorted stuff; the biggest pile here is Christoph's ACL series. Plus
assorted cleanups and fixes all over the place...
There will be another pile later this week"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (43 commits)
__dentry_path() fixes
vfs: Remove second variable named error in __dentry_path
vfs: Is mounted should be testing mnt_ns for NULL or error.
Fix race when checking i_size on direct i/o read
hfsplus: remove can_set_xattr
nfsd: use get_acl and ->set_acl
fs: remove generic_acl
nfs: use generic posix ACL infrastructure for v3 Posix ACLs
gfs2: use generic posix ACL infrastructure
jfs: use generic posix ACL infrastructure
xfs: use generic posix ACL infrastructure
reiserfs: use generic posix ACL infrastructure
ocfs2: use generic posix ACL infrastructure
jffs2: use generic posix ACL infrastructure
hfsplus: use generic posix ACL infrastructure
f2fs: use generic posix ACL infrastructure
ext2/3/4: use generic posix ACL infrastructure
btrfs: use generic posix ACL infrastructure
fs: make posix_acl_create more useful
fs: make posix_acl_chmod more useful
...
Also don't bother to set up a .get_acl method for symlinks as we do not
support access control (ACLs or even mode bits) for symlinks in Linux.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Rename the current posix_acl_created to __posix_acl_create and add
a fully featured helper to set up the ACLs on file creation that
uses get_acl().
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Rename the current posix_acl_chmod to __posix_acl_chmod and add
a fully featured ACL chmod helper that uses the ->set_acl inode
operation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* don't assume that ->dest_count won't change between copy_from_user()
and memdup_user()
* use fdget instead of fget
* don't bother comparing superblocks when we'd already compared vfsmounts
* get rid of excessive goto
* use file_inode() instead of open-coding the sucker
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
For 3.14-rc1 there are fixes in the areas of remote attributes, discard,
growfs, memory leaks in recovery, directory v2, quotas, the MAINTAINERS
file, allocation alignment, extent list locking, and in
xfs_bmapi_allocate. There are cleanups in xfs_setsize_buftarg, removing
unused macros, quotas, setattr, and freeing of inode clusters. The
in-memory and on-disk log format have been decoupled, a common helper to
calculate the number of blocks in an inode cluster has been added, and
handling of i_version has been pulled into the filesystems that use it.
- cleanup in xfs_setsize_buftarg
- removal of remaining unused flags for vop toss/flush/flushinval
- fix for memory corruption in xfs_attrlist_by_handle
- fix for out-of-date comment in xfs_trans_dqlockedjoin
- fix for discard if range length is less than one block
- fix for overrun of agfl buffer using growfs on v4 superblock filesystems
- pull i_version handling out into the filesystems that use it
- don't leak recovery items on error
- fix for memory leak in xfs_dir2_node_removename
- several cleanups for quotas
- fix bad assertion in xfs_qm_vop_create_dqattach
- cleanup for xfs_setattr_mode, and add xfs_setattr_time
- fix quota assert in xfs_setattr_nonsize
- fix an infinite loop when turning off group/project quota before user
quota
- fix for temporary buffer allocation failure in xfs_dir2_block_to_sf
with large directory block sizes
- fix Dave's email address in MAINTAINERS
- cleanup calculation of freed inode cluster blocks
- fix alignment of initial file allocations to match filesystem geometry
- decouple in-memory and on-disk log format
- introduce a common helper to calculate the number of filesystem
blocks in an inode cluster
- fixes for extent list locking
- fix for off-by-one in xfs_attr3_rmt_verify
- fix for missing destroy_work_on_stack in xfs_bmapi_allocate
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
iQIcBAABAgAGBQJS4C2sAAoJENaLyazVq6ZOF/UP/A/FmscsEbgz+KHtsg1UXP+g
EMkrD8WOOzLnm7/GhkfvmmFLQrWrrNPmiP54MPJ/rxpfDb7rV60HgS0iHm13U+IR
0uPyk2NIQKG/Sj6FHCrgn4oh19B0tmer6lLE32UPrlNM7w/0NRm32Ms/jKz7WNvc
9Tqw3qNdtmP7leHG8EyD1ISzqUz6FNSC+qGyOTuBQUqG24LqsmZ1qD2Nw/HEz0ir
1YCqS+cjj8c3WaWMsdjEFdCvEIKS/sMYM+ZihATIl/5ggwtVobP7PH/4cG8Biby3
5wvcl3/jM+xjgGDC1YMQQeSADieus6ER4aZTjCvGLn9RajQTOBjP0QZTu2OHntTI
kD/52DfYErthGijS2qJcqXy11AaYtWQU2yTZXuMpaACIM1DDSD4NyGFP99BzbBaX
D4BgnC+vmfpbXO37PmophkeZwAvj+9K2BBG7X+g4sLynj/BtmvFCxIyK+SLHE3hN
kn+Pn03yTw0VGgvj0krXSllbYKE0LyLElaA6h0LejQgcOZM0W6Q8qaj+RODLitbw
HkQNKYCOMCzbdNu8rKAfup9UPZloiMukyMdMaw1MeCgCQSQLkZVxTegmVW9gzpIh
QJEARDaOnKB/G/CLCwbPZR09DBpg3yBt/eHjt0VnV+z2x+u9DTInAp6f013g7E/W
gU+vnBBPY1AYI8iiDraU
=t8nB
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-v3.14-rc1' of git://oss.sgi.com/xfs/xfs
Pull xfs update from Ben Myers:
"This is primarily bug fixes, many of which you already have. New
stuff includes a series to decouple the in-memory and on-disk log
format, helpers in the area of inode clusters, and i_version handling.
We decided to try to use more topic branches this release, so there
are some merge commits in there on account of that. I'm afraid I
didn't do a good job of putting meaningful comments in the first
couple of merges. Sorry about that. I think I have the hang of it
now.
For 3.14-rc1 there are fixes in the areas of remote attributes,
discard, growfs, memory leaks in recovery, directory v2, quotas, the
MAINTAINERS file, allocation alignment, extent list locking, and in
xfs_bmapi_allocate. There are cleanups in xfs_setsize_buftarg,
removing unused macros, quotas, setattr, and freeing of inode
clusters. The in-memory and on-disk log format have been decoupled, a
common helper to calculate the number of blocks in an inode cluster
has been added, and handling of i_version has been pulled into the
filesystems that use it.
- cleanup in xfs_setsize_buftarg
- removal of remaining unused flags for vop toss/flush/flushinval
- fix for memory corruption in xfs_attrlist_by_handle
- fix for out-of-date comment in xfs_trans_dqlockedjoin
- fix for discard if range length is less than one block
- fix for overrun of agfl buffer using growfs on v4 superblock
filesystems
- pull i_version handling out into the filesystems that use it
- don't leak recovery items on error
- fix for memory leak in xfs_dir2_node_removename
- several cleanups for quotas
- fix bad assertion in xfs_qm_vop_create_dqattach
- cleanup for xfs_setattr_mode, and add xfs_setattr_time
- fix quota assert in xfs_setattr_nonsize
- fix an infinite loop when turning off group/project quota before
user quota
- fix for temporary buffer allocation failure in xfs_dir2_block_to_sf
with large directory block sizes
- fix Dave's email address in MAINTAINERS
- cleanup calculation of freed inode cluster blocks
- fix alignment of initial file allocations to match filesystem
geometry
- decouple in-memory and on-disk log format
- introduce a common helper to calculate the number of filesystem
blocks in an inode cluster
- fixes for extent list locking
- fix for off-by-one in xfs_attr3_rmt_verify
- fix for missing destroy_work_on_stack in xfs_bmapi_allocate"
* tag 'xfs-for-linus-v3.14-rc1' of git://oss.sgi.com/xfs/xfs: (51 commits)
xfs: Calling destroy_work_on_stack() to pair with INIT_WORK_ONSTACK()
xfs: fix off-by-one error in xfs_attr3_rmt_verify
xfs: assert that we hold the ilock for extent map access
xfs: use xfs_ilock_attr_map_shared in xfs_attr_list_int
xfs: use xfs_ilock_attr_map_shared in xfs_attr_get
xfs: use xfs_ilock_data_map_shared in xfs_qm_dqiterate
xfs: use xfs_ilock_data_map_shared in xfs_qm_dqtobp
xfs: take the ilock around xfs_bmapi_read in xfs_zero_remaining_bytes
xfs: reinstate the ilock in xfs_readdir
xfs: add xfs_ilock_attr_map_shared
xfs: rename xfs_ilock_map_shared
xfs: remove xfs_iunlock_map_shared
xfs: no need to lock the inode in xfs_find_handle
xfs: use xfs_icluster_size_fsb in xfs_imap
xfs: use xfs_icluster_size_fsb in xfs_ifree_cluster
xfs: use xfs_icluster_size_fsb in xfs_ialloc_inode_init
xfs: use xfs_icluster_size_fsb in xfs_bulkstat
xfs: introduce a common helper xfs_icluster_size_fsb
xfs: get rid of XFS_IALLOC_BLOCKS macros
xfs: get rid of XFS_INODE_CLUSTER_SIZE macros
...
In btrfs_end_bio(), we increment bi_remaining if is_orig_bio. If not,
we restore the orig_bio but failed to increment bi_remaining for
orig_bio, which triggers a BUG_ON later when bio_endio is called. Fix
is to increment bi_remaining when we restore the orig bio as well.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
CC: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Muthukumar Ratty <muthur@gmail.com>
Reviewed-by: Chris Mason <clm@fb.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch fixed several typo in printk from various
part of kernel source.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJSwLfoAAoJEHm+PkMAQRiGi6QH/1U1B7lmHChDTw3jj1lfm9gA
189Si4QJlnxFWCKHvKEL+pcaVuACU+aMGI8+KyMYK4/JfuWVjjj5fr/SvyHH2/8m
LdSK8aHMhJ46uBS4WJ/l6v46qQa5e2vn8RKSBAyKm/h4vpt+hd6zJdoFrFai4th7
k/TAwOAEHI5uzexUChwLlUBRTvbq4U8QUvDu+DeifC8cT63CGaaJ4qVzjOZrx1an
eP6UXZrKDASZs7RU950i7xnFVDQu4PsjlZi25udsbeiKcZJgPqGgXz5ULf8ZH8RQ
YCi1JOnTJRGGjyIOyLj7pyB01h7XiSM2+eMQ0S7g54F2s7gCJ58c2UwQX45vRWU=
=/4/R
-----END PGP SIGNATURE-----
Merge tag 'v3.13-rc6' into for-3.14/core
Needed to bring blk-mq uptodate, since changes have been going in
since for-3.14/core was established.
Fixup merge issues related to the immutable biovec changes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Conflicts:
block/blk-flush.c
fs/btrfs/check-integrity.c
fs/btrfs/extent_io.c
fs/btrfs/scrub.c
fs/logfs/dev_bdev.c
Correct spelling typo in various part of kernel
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Pull btrfs fixes from Chris Mason:
"This is a small collection of fixes. It was rebased this morning, but
I was just fixing signed-off-by tags with the wrong email"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix access_ok() check in btrfs_ioctl_send()
Btrfs: make sure we cleanup all reloc roots if error happens
Btrfs: skip building backref tree for uuid and quota tree when doing balance relocation
Btrfs: fix an oops when doing balance relocation
Btrfs: don't miss skinny extent items on delayed ref head contention
btrfs: call mnt_drop_write after interrupted subvol deletion
Btrfs: don't clear the default compression type
The closing parenthesis is in the wrong place. We want to check
"sizeof(*arg->clone_sources) * arg->clone_sources_count" instead of
"sizeof(*arg->clone_sources * arg->clone_sources_count)".
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
cc: stable@vger.kernel.org
I hit an oops when merging reloc roots fails, the reason is that
new reloc roots may be added and we should make sure we cleanup
all reloc roots.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Quota tree and UUID Tree is only cowed, they can not be snapshoted.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
I hit an oops when inserting reloc root into @reloc_root_tree(it can be
easily triggered when forcing cow for relocation root)
[ 866.494539] [<ffffffffa0499579>] btrfs_init_reloc_root+0x79/0xb0 [btrfs]
[ 866.495321] [<ffffffffa044c240>] record_root_in_trans+0xb0/0x110 [btrfs]
[ 866.496109] [<ffffffffa044d758>] btrfs_record_root_in_trans+0x48/0x80 [btrfs]
[ 866.496908] [<ffffffffa0494da8>] select_reloc_root+0xa8/0x210 [btrfs]
[ 866.497703] [<ffffffffa0495c8a>] do_relocation+0x16a/0x540 [btrfs]
This is because reloc root inserted into @reloc_root_tree is not within one
transaction,reloc root may be cowed and root block bytenr will be reused then
oops happens.We should update reloc root in @reloc_root_tree when cow reloc
root node, fix it.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Currently extent-tree.c:btrfs_lookup_extent_info() can miss the lookup
of skinny extent items. This can happen when the execution flow is the
following:
* We do an extent tree lookup and fail to find a skinny extent item;
* As a result, we attempt to see if a non-skinny extent item exists,
either by looking at previous item in the leaf or by doing another
full extent tree search;
* We have a transaction and then we check for a matching delayed ref
head in the transaction's delayed refs rbtree;
* We find such delayed ref head and then we try to lock it with a
call to mutex_trylock();
* The lock was contended so we jump to the label "again", which repeats
the extent tree search but for a non-skinny extent item, because we set
previously metadata variable to 0 and the search key to look for a
non-skinny extent-item;
* After the jump (and after releasing the transaction's delayed refs
lock), a skinny extent item might have been added to the extent tree
but we will miss it because metadata is set to 0 and the search key
is set for a non-skinny extent-item.
The fix here is to not reset metadata to 0 and to jump to the initial search
key setup if the delayed ref head is contended, instead of jumping directly
to the extent tree search label ("again").
This issue was found while investigating the issue reported at Bugzilla 64961.
David Sterba suspected this function was missing extent items, and that
this could be caused by the last change to this function, which was made
in the following patch:
[PATCH] Btrfs: optimize btrfs_lookup_extent_info()
(commit 74be951087)
But in fact this issue already existed before, because after failing to find
a skinny extent item, the code set the search key for a non-skinny extent
item, and on contention of a matching delayed ref head it would not search
the extent tree for a skinny extent item anymore.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
If btrfs_ioctl_snap_destroy blocks on the mutex and the process is
killed, mnt_write count is unbalanced and leads to unmountable
filesystem.
CC: stable@vger.kernel.org
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
We met a oops caused by the wrong compression type:
[ 556.512356] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 556.512370] IP: [<ffffffff811dbaa0>] __list_del_entry+0x1/0x98
[SNIP]
[ 556.512490] [<ffffffff811dbb44>] ? list_del+0xd/0x2b
[ 556.512539] [<ffffffffa05dd5ce>] find_workspace+0x97/0x175 [btrfs]
[ 556.512546] [<ffffffff813c14b5>] ? _raw_spin_lock+0xe/0x10
[ 556.512576] [<ffffffffa05de276>] btrfs_compress_pages+0x2d/0xa2 [btrfs]
[ 556.512601] [<ffffffffa05af060>] compress_file_range.constprop.54+0x1f2/0x4e8 [btrfs]
[ 556.512627] [<ffffffffa05af388>] async_cow_start+0x32/0x4d [btrfs]
[ 556.512655] [<ffffffffa05cc7a1>] worker_loop+0x144/0x4c3 [btrfs]
[ 556.512661] [<ffffffff81059404>] ? finish_task_switch+0x80/0xb8
[ 556.512689] [<ffffffffa05cc65d>] ? btrfs_queue_worker+0x244/0x244 [btrfs]
[ 556.512695] [<ffffffff8104fa4e>] kthread+0x8d/0x95
[ 556.512699] [<ffffffff81050000>] ? bit_waitqueue+0x34/0x7d
[ 556.512704] [<ffffffff8104f9c1>] ? __kthread_parkme+0x65/0x65
[ 556.512709] [<ffffffff813c7eec>] ret_from_fork+0x7c/0xb0
[ 556.512713] [<ffffffff8104f9c1>] ? __kthread_parkme+0x65/0x65
Steps to reproduce:
# mkfs.btrfs -f <dev>
# mount -o nodatacow <dev> <mnt>
# touch <mnt>/<file>
# chattr =c <mnt>/<file>
# dd if=/dev/zero of=<mnt>/<file> bs=1M count=10
It is because we cleared the default compression type when setting the
nodatacow. In fact, we needn't do it because we have used COMPRESS flag to
indicate if we need compressed the file data or not, needn't use the
variant -- compress_type -- in btrfs_info to do the same thing, and just
use it to hold the default compression type. Or we would get a wrong compress
type for a file whose own compress flag is set but the compress flag of its
filesystem is not set.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull block layer fixes from Jens Axboe:
"A small collection of fixes for the current series. It contains:
- A fix for a use-after-free of a request in blk-mq. From Ming Lei
- A fix for a blk-mq bug that could attempt to dereference a NULL rq
if allocation failed
- Two xen-blkfront small fixes
- Cleanup of submit_bio_wait() type uses in the kernel, unifying
that. From Kent
- A fix for 32-bit blkg_rwstat reading. I apologize for this one
looking mangled in the shortlog, it's entirely my fault for missing
an empty line between the description and body of the text"
* 'for-linus' of git://git.kernel.dk/linux-block:
blk-mq: fix use-after-free of request
blk-mq: fix dereference of rq->mq_ctx if allocation fails
block: xen-blkfront: Fix possible NULL ptr dereference
xen-blkfront: Silence pfn maybe-uninitialized warning
block: submit_bio_wait() conversions
Update of blkg_stat and blkg_rwstat may happen in bh context
Currently notify_change directly updates i_version for size updates,
which not only is counter to how all other fields are updated through
struct iattr, but also breaks XFS, which need inode updates to happen
under its own lock, and synchronized to the structure that gets written
to the log.
Remove the update in the common code, and it to btrfs and ext4,
XFS already does a proper updaste internally and currently gets a
double update with the existing code.
IMHO this is 3.13 and -stable material and should go in through the XFS
tree.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Acked-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
btrfs bits got lost in the rebase
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Chris Mason <clm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It was being open coded in a few places.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With immutable biovecs we don't want code accessing bi_io_vec directly -
the uses this patch changes weren't incorrect since they all own the
bio, but it makes the code harder to audit for no good reason - also,
this will help with multipage bvecs later.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
It was being open coded in a few places.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Acked-by: NeilBrown <neilb@suse.de>
Pull btrfs fixes from Chris Mason:
"Almost all of these are bug fixes. Dave Sterba's documentation update
is the big exception because he removed our promises to set any
machine running Btrfs on fire"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Documentation: filesystems: update btrfs tools section
Documentation: filesystems: add new btrfs mount options
btrfs: update kconfig help text
btrfs: fix bio_size_ok() for max_sectors > 0xffff
btrfs: Use trace condition for get_extent tracepoint
btrfs: fix typo in the log message
Btrfs: fix list delete warning when removing ordered root from the list
Btrfs: print bytenr instead of page pointer in check-int
Btrfs: remove dead codes from ctree.h
Btrfs: don't wait for ordered data outside desired range
Btrfs: fix lockdep error in async commit
Btrfs: avoid heavy operations in btrfs_commit_super
Btrfs: fix __btrfs_start_workers retval
Btrfs: disable online raid-repair on ro mounts
Btrfs: do not inc uncorrectable_errors counter on ro scrubs
Btrfs: only drop modified extents if we logged the whole inode
Btrfs: make sure to copy everything if we rename
Btrfs: don't BUG_ON() if we get an error walking backrefs
Reflect the current status. Portions of the text taken from the
wiki pages.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The data type of max_sectors in queue settings is unsigned int. But
this value is stored to the local variable whose type is unsigned short
in bio_size_ok(). This can cause unexpected result when max_sectors >
0xffff.
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Doing an if statement to test some condition to know if we should
trigger a tracepoint is pointless when tracing is disabled. This just
adds overhead and wastes a branch prediction. This is why the
TRACE_EVENT_CONDITION() was created. It places the check inside the jump
label so that the branch does not happen unless tracing is enabled.
That is, instead of doing:
if (em)
trace_btrfs_get_extent(root, em);
Which is basically this:
if (em)
if (static_key(trace_btrfs_get_extent)) {
Using a TRACE_EVENT_CONDITION() we can just do:
trace_btrfs_get_extent(root, em);
And the condition trace event will do:
if (static_key(trace_btrfs_get_extent)) {
if (em) {
...
The static key is a non conditional jump (or nop) that is faster than
having to check if em is NULL or not.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Commit b02441999e "Btrfs: don't wait for
the completion of all the ordered extents" introduced a bug that broke
the ordered root list:
WARNING: CPU: 1 PID: 7119 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98()
It is because we forgot to return the roots in the splice list to the
ordered list of the fs. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The page pointer information was useless. The bytenr is what you
want when you search for submitted write bios.
Additionally, a new bit in the print mask is added that allows
to selectively enable the check-int submit_bio verbose mode. Before,
the global verbose mode had to be enabled leading to many million
useless lines in the kernel log.
And a comment is added that explains that LOG_BUF_SHIFT needs to
be set to a really high value.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
These two functions are only stated but undefined.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In btrfs_wait_ordered_range(), if we found an extent to the left
of the start of our desired wait range and the last byte of that
extent is 1 less than the desired range's start, we would would
wait for the IO completion of that extent unnecessarily.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Lockdep complains about btrfs's async commit:
[ 2372.462171] [ BUG: bad unlock balance detected! ]
[ 2372.462191] 3.12.0+ #32 Tainted: G W
[ 2372.462209] -------------------------------------
[ 2372.462228] ceph-osd/14048 is trying to release lock (sb_internal) at:
[ 2372.462275] [<ffffffffa022cb10>] btrfs_commit_transaction_async+0x1b0/0x2a0 [btrfs]
[ 2372.462305] but there are no more locks to release!
[ 2372.462324]
[ 2372.462324] other info that might help us debug this:
[ 2372.462349] no locks held by ceph-osd/14048.
[ 2372.462367]
[ 2372.462367] stack backtrace:
[ 2372.462386] CPU: 2 PID: 14048 Comm: ceph-osd Tainted: G W 3.12.0+ #32
[ 2372.462414] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015 11/09/2011
[ 2372.462455] ffffffffa022cb10 ffff88007490fd28 ffffffff816f094a ffff8800378aa320
[ 2372.462491] ffff88007490fd50 ffffffff810adf4c ffff8800378aa320 ffff88009af97650
[ 2372.462526] ffffffffa022cb10 ffff88007490fd88 ffffffff810b01ee ffff8800898c0000
[ 2372.462562] Call Trace:
[ 2372.462584] [<ffffffffa022cb10>] ? btrfs_commit_transaction_async+0x1b0/0x2a0 [btrfs]
[ 2372.462619] [<ffffffff816f094a>] dump_stack+0x45/0x56
[ 2372.462642] [<ffffffff810adf4c>] print_unlock_imbalance_bug+0xec/0x100
[ 2372.462677] [<ffffffffa022cb10>] ? btrfs_commit_transaction_async+0x1b0/0x2a0 [btrfs]
[ 2372.462710] [<ffffffff810b01ee>] lock_release+0x18e/0x210
[ 2372.462742] [<ffffffffa022cb36>] btrfs_commit_transaction_async+0x1d6/0x2a0 [btrfs]
[ 2372.462783] [<ffffffffa025a7ce>] btrfs_ioctl_start_sync+0x3e/0xc0 [btrfs]
[ 2372.462822] [<ffffffffa025f1d3>] btrfs_ioctl+0x4c3/0x1f70 [btrfs]
[ 2372.462849] [<ffffffff812c0321>] ? avc_has_perm+0x121/0x1b0
[ 2372.462873] [<ffffffff812c0224>] ? avc_has_perm+0x24/0x1b0
[ 2372.462897] [<ffffffff8107ecc8>] ? sched_clock_cpu+0xa8/0x100
[ 2372.462922] [<ffffffff8117b145>] do_vfs_ioctl+0x2e5/0x4e0
[ 2372.462946] [<ffffffff812c19e6>] ? file_has_perm+0x86/0xa0
[ 2372.462969] [<ffffffff8117b3c1>] SyS_ioctl+0x81/0xa0
[ 2372.462991] [<ffffffff817045a4>] tracesys+0xdd/0xe2
====================================================
It's because that we don't do the right thing when checking if it's ok to
tell lockdep that we're trying to release the rwsem.
If the trans handle's type is TRANS_ATTACH, we won't acquire the freeze rwsem, but
as TRANS_ATTACH fits the check (trans < TRANS_JOIN_NOLOCK), we'll release the freeze
rwsem, which makes lockdep complains a lot.
Reported-by: Ma Jianpeng <majianpeng@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The 'git blame' history shows that, the old transaction commit code has to do
twice to ensure roots are updated and we have to flush metadata and super block
manually, however, right now all of these can be handled well inside
the transaction commit code without extra efforts.
And the error handling part remains same with the current code, -- 'return to
caller once we get error'.
This saves us a transaction commit and a flush of super block, which are both
heavy operations according to ftrace output analysis.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
__btrfs_start_workers returns 0 in case it raced with
btrfs_stop_workers and lost the race. This is wrong because worker in
this case is not allowed to start and is in fact destroyed. Return
-EINVAL instead.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This disables the "if needed, write the good copy back before the read
is completed" part of the read sequence for read-only mounts.
Cc: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Currently if we discover an error when scrubbing in ro mode we a)
blindly increment the uncorrectable_errors counter, and b) spam the
dmesg with the 'unable to fixup (regular) error at ...' message, even
though a) we haven't tried to determine if the error is correctable or
not, and b) we haven't tried to fixup anything. Fix this.
Cc: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we fsync, seek and write, rename and then fsync again we will lose the
modified hole extent because the rename will drop all of the modified extents
since we didn't do the fast search. We need to only drop the modified extents
if we didn't do the fast search and we were logging the entire inode as we don't
need them anymore, otherwise this is being premature. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we rename a file that is already in the log and we fsync again we will lose
the new name. This is because we just log the inode update and not the new ref.
To fix this we just need to check if we are logging the new name of the inode
and copy all the metadata instead of just updating the inode itself. With this
patch my testcase now passes. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We can just return false for this so we stop doing the snapshot aware defrag
stuff. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs fixes from Chris Mason:
"This pull fixes the empty_zero_page bug that Heiko reported, and
includes one more cleanup from Al Viro"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: get rid of fdentry()
btrfs: fix empty_zero_page misusage
Pull trivial tree updates from Jiri Kosina:
"Usual earth-shaking, news-breaking, rocket science pile from
trivial.git"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits)
doc: usb: Fix typo in Documentation/usb/gadget_configs.txt
doc: add missing files to timers/00-INDEX
timekeeping: Fix some trivial typos in comments
mm: Fix some trivial typos in comments
irq: Fix some trivial typos in comments
NUMA: fix typos in Kconfig help text
mm: update 00-INDEX
doc: Documentation/DMA-attributes.txt fix typo
DRM: comment: `halve' -> `half'
Docs: Kconfig: `devlopers' -> `developers'
doc: typo on word accounting in kprobes.c in mutliple architectures
treewide: fix "usefull" typo
treewide: fix "distingush" typo
mm/Kconfig: Grammar s/an/a/
kexec: Typo s/the/then/
Documentation/kvm: Update cpuid documentation for steal time and pv eoi
treewide: Fix common typo in "identify"
__page_to_pfn: Fix typo in comment
Correct some typos for word frequency
clk: fixed-factor: Fix a trivial typo
...
Heiko Carstens noticed that btrfs was using empty_zero_page
incorrectly. He explained:
The definition of empty_zero_page is architecture specific. It
is (currently) either a character array, an unsigned long
containing the address of the empty_zero_page, or even worse
only the address of the struct page belonging to the
empty_zero_page.
This commit changes btrfs to use a for-loop instead. On x86
the resulting .ko is smaller, and we're no longer worrying about
how each arch builds its zeros.
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
rename the function -- btrfs_start_all_delalloc_inodes(), and make its
name be compatible to btrfs_wait_ordered_roots(), since they are always
used at the same place.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
It is very likely that there are lots of ordered extents in the filesytem,
if we wait for the completion of all of them when we want to reclaim some
space for the metadata space reservation, we would be blocked for a long
time. The performance would drop down suddenly for a long time.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
It was very likely that there were lots of async delalloc pages in the
filesystem, if we waited until all the pages were flushed, we would be
blocked for a long time, and the performance would also drop down.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In shrink_delalloc(), what we need reclaim is the metadata space, so
flushing pages by to_reclaim is not reasonable, it is very likely that
the pages we flush are not enough. And then we had to invoke the flush
function for several times, at the worst, we need call flush_space for
several times. It wasted time.
We improve this problem by converting the metadata space size we need
reserve to the delalloc bytes, By this way, we can flush the pages
by a reasonable number.
(Now we use a fixed number to do conversion, it is not flexible, maybe
we can find a good way to improve it in the future.)
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This patch picked up the code that was used to calculate the number of
the items for which we need reserve space, and we will use it in the next
patch.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We only allocate scrub workers if we pass all the necessary
checks, for example, there are no operation in progress.
Besides, move mutex lock protection outside of scrub_workers_get()
/scrub_workers_put().
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I hit this problem with my no holes patch and it made me realize what the
problem was for bz 60834. If the first item in the leaf is an inline extent and
we try to read anything starting from disk_bytenr onward we will read off the
end of the leaf. So we need to check to see what it's type is, and if it's not
REG we can just break out. This should fix this problem. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The function write_ctree_super() in disk-io.c uses variable ret to return
the result of function write_all_supers(). Since, this variable serves
no purpose, hence the patch removes it and returns the call of the
called function.
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Fix spacing issues detected via checkpatch.pl in accordance with the
kernel style guidelines.
Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Replace kmalloc(size * nr, ) with kmalloc_array(nr, size), thus making
it easier to check is that the calculation doesn't wrap or return a smaller allocation
Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Enclose macros with complex values within parenthesis in accordance to
checkpatch.pl.
Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Use WARN_ON()'s return value in place of WARN_ON(1) for cleaner source
code that outputs a more descriptive warnings. Also fix the styling
warning of redundant braces that came up as a result of this fix.
Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Remove redundant local zero structure, replacing it by the kernel's
global ZERO_PAGE.
Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pack the structure btrfs_device in volumes.h to eliminate holes detected
by pahole, thus reducing binary memory footprint.
Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This patch replaces multiple atomic_inc() with atomic_add() in
delayed-inode.c to reduce source code and have few instructions
for compilation.
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The function free_root_pointers() in disk-io.h contains redundant code.
Therefore, this patch adds a helper function free_root_extent_buffers()
to free_root_pointers() to eliminate redundancy.
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Running balance and defrag concurrently can end up with a crash:
kernel BUG at fs/btrfs/relocation.c:4528!
RIP: 0010:[<ffffffffa01ac33b>] [<ffffffffa01ac33b>] btrfs_reloc_cow_block+ 0x1eb/0x230 [btrfs]
Call Trace:
[<ffffffffa01398c1>] ? update_ref_for_cow+0x241/0x380 [btrfs]
[<ffffffffa0180bad>] ? copy_extent_buffer+0xad/0x110 [btrfs]
[<ffffffffa0139da1>] __btrfs_cow_block+0x3a1/0x520 [btrfs]
[<ffffffffa013a0b6>] btrfs_cow_block+0x116/0x1b0 [btrfs]
[<ffffffffa013ddad>] btrfs_search_slot+0x43d/0x970 [btrfs]
[<ffffffffa0153c57>] btrfs_lookup_file_extent+0x37/0x40 [btrfs]
[<ffffffffa0172a5e>] __btrfs_drop_extents+0x11e/0xae0 [btrfs]
[<ffffffffa013b3fd>] ? generic_bin_search.constprop.39+0x8d/0x1a0 [btrfs]
[<ffffffff8117d14a>] ? kmem_cache_alloc+0x1da/0x200
[<ffffffffa0138e7a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
[<ffffffffa0173ef0>] btrfs_drop_extents+0x60/0x90 [btrfs]
[<ffffffffa016b24d>] relink_extent_backref+0x2ed/0x780 [btrfs]
[<ffffffffa0162fe0>] ? btrfs_submit_bio_hook+0x1e0/0x1e0 [btrfs]
[<ffffffffa01b8ed7>] ? iterate_inodes_from_logical+0x87/0xa0 [btrfs]
[<ffffffffa016b909>] btrfs_finish_ordered_io+0x229/0xac0 [btrfs]
[<ffffffffa016c3b5>] finish_ordered_fn+0x15/0x20 [btrfs]
[<ffffffffa018cbe5>] worker_loop+0x125/0x4e0 [btrfs]
[<ffffffffa018cac0>] ? btrfs_queue_worker+0x300/0x300 [btrfs]
[<ffffffff81075ea0>] kthread+0xc0/0xd0
[<ffffffff81075de0>] ? insert_kthread_work+0x40/0x40
[<ffffffff8164796c>] ret_from_fork+0x7c/0xb0
[<ffffffff81075de0>] ? insert_kthread_work+0x40/0x40
----------------------------------------------------------------------
It turns out to be that balance operation will bump root's @last_snapshot,
which enables snapshot-aware defrag path, and backref walking stuff will
find data reloc tree as refs' parent, and hit the BUG_ON() during COW.
As data reloc tree's data is just for relocation purpose, and will be deleted right
after relocation is done, it's unnecessary to walk those refs belonged to data reloc
tree, it'd be better to skip them.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If something wrong happens in write endio, running snapshot-aware defragment
can end up with undefined results, maybe a crash, so we should avoid it.
In order to share similar code, this also adds a helper to free the struct for
snapshot-aware defrag.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we get any error while doing a dir index/item lookup in the
log tree, we were always unlinking the corresponding inode in
the subvolume. It makes sense to unlink only if the lookup failed
to find the dir index/item, which corresponds to NULL or -ENOENT,
and not when other errors happen (like a transient -ENOMEM or -EIO).
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We were setting the csums search offset and length to the right values if
the extent is compressed, but later on right before doing the csums lookup
we were overriding these two parameters regardless of compression being
set or not for the extent.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We were ignoring the name component of the dir_item. Both the
name and data must fit within BTRFS_MAX_XATTR_SIZE(root).
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Originally, we introduced scrub_super_lock to synchronize
tree log code with scrubbing super.
However we can replace scrub_super_lock with device_list_mutex,
because writing super will hold this mutex, this will reduce an extra
lock holding when writing supers in sync log code.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We define a 'int' to get extent's generation by mistake,fix it.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
After running space balance on a new fs, the fs check program outputed the
following warning message:
free space inode generation (0) did not match free space cache generation (20)
Steps to reproduce:
# mkfs.btrfs -f <dev>
# mount <dev> <mnt>
# btrfs balance start <mnt>
# umount <mnt>
# btrfs check <dev>
It was because there was no data space after the space balance, and the free
space write out task didn't try to allocate a new data chunk for the free space
inode when doing the reservation. So the data space reservation failed, and in
order to tell the free space loader that this free space inode could not be
trusted, the generation of the free space inode wasn't updated. Then the check
program found this problem and outputed the above message.
But in fact, it is safe that we try to allocate a new data chunk when we find
the data space is not enough. The patch fixes the above problem by this way.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I noticed with my horrible snapshot excercisor that we were taking forever to
relocate the larger the file system got. This appeared to be because we were
committing the transaction _constantly_. There were a few places where we do
braindead things with metadata reservation, like start a transaction and then
try to refill the block rsv, which not only keeps us from committing a
transaction during the enospc stuff, but keeps us from doing some of the harder
flushing work which will make us more likely to need to commit the transaction.
We also were checking the block rsv and committing the transaction if the block
rsv was below a certain threshold, but we were doing this in a place where we
don't actually keep anything in the block rsv so this was always ending up false
so we always committed the transaction in this case. I tested this to make sure
it didn't break anything, but it takes about 10 hours to get the box to this
state so I don't know how much of an impact it will really make. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When using delalloc workers in a non-waiting way (like for enospc handling) we
can end up not actually waiting for the dirty pages to be started if we have
compression. We need to add an extra filemap flush to make sure any async
extents that have started are actually moved along before returning. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
A user reported a list corruption warning from btrfs_remove_ordered_extent, it
is because we aren't taking the ordered_root_lock when we remove the inode from
the ordered operations list. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This is just the write path, the only reason we start a transaction is so we can
check cross references, we don't make any actual changes, so there is no reason
to abort the transaction if we fail. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We can just return an error and we'll bail out properly. We still want to catch
this case to make sure we don't have a bug somewhere, so just warn if this pops
up. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I noticed that if the free space cache has an error writing out it's data it
won't actually error out, it will just carry on. This is because it doesn't
check the return value of btrfs_wait_ordered_range, which didn't actually return
anything. So fix this in order to keep us from making free space cache look
valid when it really isnt. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Apparently we don't actually close the files until we return to userspace, so
stop using vfs_read in send. This is actually better for us since we can avoid
all the extra logic of holding the file we're sending open and making sure to
clean it up. This will fix people who have been hitting too many files open
errors when trying to send. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In mixed-mode, when a data-block was later reused for metadata, a
warning was printed. This condition is now filtered out and the
warning is eliminated in this case.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Yet another cleanup patch broke code for which no xfstest exists.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Instead of doing another extent tree search if the first search failed
to find a metadata item, check if the previous item in the leaf is an
extent item and use it if it is, otherwise do the second tree search
for an extent item.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When debugging ENOSPC issues, it's nice to be able to see which
reservations failed as well as the ones which succeeded.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
fs/btrfs/compat.h only contained trivial macro wrappers of drop_nlink()
and inc_nlink(). This doesn't belong in mainline.
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
move_pages() has an inefficient backwards byte copy of regions of two
different pages. They're different pages so the regions won't overlap
and it could use memcpy().
At that point, though, move_pages() would be a slightly dimmer
re-implementation of copy_pages() that lacked the test for overlapping
page regions.
So remove move_pages() and just call copy_pages().
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When a directory has a default ACL and a subdirectory is created
under that directory, btrfs_init_acl() is called when the
subdirectory's inode is created to initialize the inode's ACL
(inherited from the parent directory) but it was clearing the ACL
from the inode after setting it if posix_acl_create() returned
success, instead of clearing it only if it returned an error.
To reproduce this issue:
$ mkfs.btrfs -f /dev/loop0
$ mount /dev/loop0 /mnt
$ mkdir /mnt/acl
$ setfacl -d --set u::rwx,g::rwx,o::- /mnt/acl
$ getfacl /mnt/acl
user::rwx
group::rwx
other::r-x
default:user::rwx
default:group::rwx
default:other::---
$ mkdir /mnt/acl/dir1
$ getfacl /mnt/acl/dir1
user::rwx
group::rwx
other::---
After unmounting and mounting again the filesystem, fgetacl returned the
expected ACL:
$ umount /mnt/acl
$ mount /dev/loop0 /mnt
$ getfacl /mnt/acl/dir1
user::rwx
group::rwx
other::---
default:user::rwx
default:group::rwx
default:other::---
Meaning that the underlying xattr was persisted.
Reported-by: Giuseppe Fierro <giuseppe@fierro.org>
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Due to an off-by-one error, it is possible to reproduce a bug
when the inode cache is used.
The same inode number is assigned twice, the second time this
leads to an EEXIST in btrfs_insert_empty_items().
The issue can happen when a file is removed right after a subvolume
is created and then a new inode number is created before the
inodes in free_inode_pinned are processed.
unlink() calls btrfs_return_ino() which calls start_caching() in this
case which adds [highest_ino + 1, BTRFS_LAST_FREE_OBJECTID] by
searching for the highest inode (which already cannot find the
unlinked one anymore in btrfs_find_free_objectid()). So if this
unlinked inode's number is equal to the highest_ino + 1 (or >= this value
instead of > this value which was the off-by-one error), we mustn't add
the inode number to free_ino_pinned (caching_thread() does it right).
In this case we need to try directly to add the number to the inode_cache
which will fail in this case.
When this inode number is allocated while it is still in free_ino_pinned,
it is allocated and still added to the free inode cache when the
pinned inodes are processed, thus one of the following inode number
allocations will get an inode that is already in use and fail with EEXIST
in btrfs_insert_empty_items().
One example which was created with the reproducer below:
Create a snapshot, work in the newly created snapshot for the rest.
In unlink(inode 34284) call btrfs_return_ino() which calls start_caching().
start_caching() calls add_free_space [34284, 18446744073709517077].
In btrfs_return_ino(), call start_caching pinned [34284, 1] which is wrong.
mkdir() call btrfs_find_ino_for_alloc() which returns the number 34284.
btrfs_unpin_free_ino calls add_free_space [34284, 1].
mkdir() call btrfs_find_ino_for_alloc() which returns the number 34284.
EEXIST when the new inode is inserted.
One possible reproducer is this one:
#!/bin/sh
# preparation
TEST_DEV=/dev/sdc1
TEST_MNT=/mnt
umount ${TEST_MNT} 2>/dev/null || true
mkfs.btrfs -f ${TEST_DEV}
mount ${TEST_DEV} ${TEST_MNT} -o \
rw,relatime,compress=lzo,space_cache,inode_cache
btrfs subv create ${TEST_MNT}/s1
for i in `seq 34027`; do touch ${TEST_MNT}/s1/${i}; done
btrfs subv snap ${TEST_MNT}/s1 ${TEST_MNT}/s2
FILENAME=`find ${TEST_MNT}/s1/ -inum 4085 | sed 's|^.*/\([^/]*\)$|\1|'`
rm ${TEST_MNT}/s2/$FILENAME
touch ${TEST_MNT}/s2/$FILENAME
# the following steps can be repeated to reproduce the issue again and again
[ -e ${TEST_MNT}/s3 ] && btrfs subv del ${TEST_MNT}/s3
btrfs subv snap ${TEST_MNT}/s2 ${TEST_MNT}/s3
rm ${TEST_MNT}/s3/$FILENAME
touch ${TEST_MNT}/s3/$FILENAME
ls -alFi ${TEST_MNT}/s?/$FILENAME
touch ${TEST_MNT}/s3/_1 || logger FAILED
ls -alFi ${TEST_MNT}/s?/_1
touch ${TEST_MNT}/s3/_2 || logger FAILED
ls -alFi ${TEST_MNT}/s?/_2
touch ${TEST_MNT}/s3/__1 || logger FAILED
ls -alFi ${TEST_MNT}/s?/__1
touch ${TEST_MNT}/s3/__2 || logger FAILED
ls -alFi ${TEST_MNT}/s?/__2
# if the above is not enough, add the following loop:
for i in `seq 3 9`; do touch ${TEST_MNT}/s3/__${i} || logger FAILED; done
#for i in `seq 3 34027`; do touch ${TEST_MNT}/s3/__${i} || logger FAILED; done
# one of the touch(1) calls in s3 fail due to EEXIST because the inode is
# already in use that btrfs_find_ino_for_alloc() returns.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we decrement the key type, we must reset its offset to the largest
possible offset (u64)-1. If we decrement the key's objectid, then we
must reset the key's type and offset to their largest possible values,
(u8)-1 and (u64)-1 respectively. Not doing so can make us miss an
items in the tree.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Avoid repeated tree searches by processing all inode ref items in
a leaf at once instead of processing one at a time, followed by a
path release and a tree search for a key with a decremented offset.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Use memdup_user rather than duplicating its implementation
This is a little bit restricted to reduce false positives
The semantic patch that makes this report is available
in scripts/coccinelle/api/memdup_user.cocci.
More information about semantic patching is available at
http://coccinelle.lip6.fr/
Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Comparison of an inode's last modified transaction with the last committed
transaction is incorrect. Fix it.
Signed-off-by: chandan <chandan@linux.vnet.ibm.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If the path allocation failed, we would return without decrementing
the reference count in the delayed node we got before, resulting
in a leak.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If the user remounts the filesystem read-only while the uuid-tree
scan and rebuild task is still running (this happens once after the
filesystem was mounted with an old kernel, or when forced with the
mount options), the remount should wait on the tasks completion
before setting the filesystem read-only. Otherwise the background
task continues to write to the filesystem which is apparently not
what users expect.
The reproducer:
TEST_DEV=/dev/sdzzzzz1
TEST_MNT=/mnt
mkfs.btrfs -f $TEST_DEV
mount $TEST_DEV $TEST_MNT
for i in `seq 50000`; do btrfs subvolume create ${TEST_MNT}/$i; done
umount $TEST_MNT
mount $TEST_DEV $TEST_MNT -o rescan_uuid_tree
sleep 1
ps -elf | fgrep '[btrfs-uuid]' | grep -v grep
mount $TEST_DEV $TEST_MNT -o ro,remount
ps -elf | fgrep '[btrfs-uuid]' | grep -v grep
sleep 1
umount $TEST_MNT
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Device stats are only initialized (read from tree items) on mount.
Trying to read device stats after adding or replacing new devices will
return errors.
btrfs_init_new_device() and btrfs_init_dev_replace_tgtdev() are the two
functions that allocate and initialize new btrfs_device structures after
a filesystem is mounted. They set the device stats to zero by using
kzalloc() which is correct for new devices. The only missing thing was
to declare these stats as being valid (device->dev_stats_valid = 1) and
this patch adds this missing code.
This is the reproducer:
TEST_DEV1=/dev/sdzzzzz1
TEST_DEV2=/dev/sdzzzzz2
TEST_DEV3=/dev/sdzzzzz3
TEST_MNT=/mnt
mkfs.btrfs $TEST_DEV1
mount $TEST_DEV1 $TEST_MNT
btrfs device add $TEST_DEV2 $TEST_MNT
btrfs device stat $TEST_MNT
btrfs replace start -B $TEST_DEV2 $TEST_DEV3 $TEST_MNT
btrfs device stat $TEST_MNT
umount $TEST_MNT
Reported-by: Ondrej Kunc <kunc88@gmail.com>
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When we fail to add a reference after a non-inline insertion by some reasons,
eg. ENOSPC, we'll abort the transaction, but we don't return this error to
the caller who has to walk around again to find something wrong, that's
unnecessary.
Also fixup other error paths to keep it simple.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
For both balance and replace, cancelling involves changing the on-disk
state and committing a transaction, which is not a good thing to do on
read-only filesystems.
Cc: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
struct btrfs_ioctl_dev_replace_args memory is leaked if replace is
requested on a read-only filesystem. Fix it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
On mount failures, __btrfs_close_devices can be called well before
dev-replace state is read and ->is_tgtdev_for_dev_replace is set. This
leads to a bogus decrement of ->rw_devices and sets off a WARN_ON in
__btrfs_close_devices if replace target device happens to be on the
lists and we fail early in the mount sequence. Fix this by checking
the devid instead of ->is_tgtdev_for_dev_replace before the decrement:
for replace targets devid is always equal to BTRFS_DEV_REPLACE_DEVID.
Cc: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In add_inode_ref() function:
Initializes local pointers.
Reduces the logical condition with the __add_inode_ref() return
value by using only one 'goto out'.
Centralizes the exiting, ensuring the freeing of all used memory.
Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>
Reviewed-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
After commit de78b51a28
(btrfs: remove cache only arguments from defrag path), @blockptr is no more
used.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
@is_extent is no more needed since we don't defrag extent root.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The btrfs_insert_empty_item() function doesn't modify its
key argument.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
alloc_extent_buffer() uses radix_tree_lookup() when radix_tree_insert()
fails with EEXIST. That part of the code is very similar to the code in
find_extent_buffer(). This patch replaces radix_tree_lookup() and
surrounding code in alloc_extent_buffer() with find_extent_buffer().
Note that radix_tree_lookup() does not need to be protected by
tree->buffer_lock. It is protected by eb->refs.
While at it, this patch
- changes the other usage of radix_tree_lookup() in alloc_extent_buffer()
with find_extent_buffer() to reduce redundancy.
- removes the unused argument 'len' to find_extent_buffer().
Signed-Off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Whoever wrote this was braindead. Also it doesn't work right if you have
VACANCY's since we assumed you would only have that at the end of the file,
which won't be the case in the near future. I tested this with generic/285 and
generic/286 as well as the btrfs tests that use fssum since it uses
seek_hole/seek_data to verify things are ok. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I was hitting weird issues when trying to remove hole extents and it turned out
it was because I was sending non-aligned offsets down to
btrfs_lookup_csums_range. So add an assert for this in case somebody trips over
this in the future. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I added an assert to make sure we were looking up aligned offsets for csums and
I tripped it when running xfstests. This is because log_one_extent was checking
if block_start == 0 for a hole instead of EXTENT_MAP_HOLE. This worked out fine
in practice it seems, but it adds a lot of extra work that is uneeded. With
this fix I'm no longer tripping my assert. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Btrfs_get_extent was not handling this case properly, add a test to make sure we
don't regress. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
While trying to kill our hole extents I noticed I was seeing problems where we
seek into a file and then start writing and then try to fiemap that file later.
This is because we search for offset 0, don't find anything and so back up one
slot, which puts us at the inode ref or something like that, which means we goto
not_found and create an extent map for our entire search area. This isn't quite
what we want, we want to move forward one slot and see if there is an extent
there so we can limit our hole extent. This patch fixes this problem, I will
add a testcase for this as well. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Stefan was hitting a panic in the async worker stuff because we had outstanding
read bios while we were stopping the worker threads. You could reproduce this
easily if you mount -o nospace_cache and ran generic/273. This is because the
caching thread stuff is still going and we were stopping all the worker threads.
We need to stop the workers after this work is done, and the free block groups
code will wait for all the caching threads to stop first so we don't run into
this problem. With this patch we no longer panic. Thanks,
Cc: stable@vger.kernel.org
Reported-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I'm going to be removing hole extents in the near future so I wanted to make a
sanity test for btrfs_get_extent to make sure I don't break anything in the
meantime. This patch just puts btrfs_get_extent through its paces by giving it
a completely unreasonable mapping to look at and make sure it is giving us back
maps that make sense. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
So both Liu and I made huge messes of find_lock_delalloc_range trying to fix
stuff, me first by fixing extent size, then him by fixing something I broke and
then me again telling him to fix it a different way. So this is obviously a
candidate for some testing. This patch adds a pseudo fs so we can allocate fake
inodes for tests that need an inode or pages. Then it addes a bunch of tests to
make sure find_lock_delalloc_range is acting the way it is supposed to. With
this patch and all of our previous patches to find_lock_delalloc_range I am sure
it is working as expected now. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
While trying to track down a reserved space leak I noticed a few places where we
won't properly clean up reserved space if we have an error, this patch fixes
those up. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In trying to track down where we were leaking reserved space I noticed our
reserve extent tracepoints are a little off. First we were saying that the
reserved space had been alloced in btrfs_reserve_extent, which isn't the case,
this needs to be triggered when we actually allocate the space when we run the
delayed ref. We were also missing a few places where we should have been
tracing the btrfs_reserve_extent_free tracepoint. With these in place I was
able to put together where we were leaking reserved space. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we abort a transaction we will do the tree log cleanup at unmount, but this
happens after we free up the block groups. This makes all the leak detection
warnings go off because we think we've leaked space but in reality we just
haven't cleaned it up yet. So instead do the block group cleanup stuff after
free'ing the fs roots so we don't get these warnings. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
On error we will wait and free the tree log at unmount without a transaction.
This means that the actual freeing of the blocks doesn't happen which means we
complain about space leaks on unmount. So to fix this just skip the transaction
specific cleanup part of the tree log free'ing if we don't have a transaction
and that way we can free up our reserved space and our counters stay happy.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The transactions should be cleaning up their reservations on failure, this just
causes us to have warnings on unmount because we go negative by free'ing
reservations that have already been free'ed. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Replace progresses strictly from lower to higher offsets, and the
progress is tracked in chunks, by storing the physical offset of the
dev_extent which is being copied in the cursor_left field of
btrfs_dev_replace_item. When we are done copying the chunk,
left_cursor is updated to point one byte past the dev_extent, so that
on resume we can skip the dev_extents that have already been copied.
There is a major bug (which goes all the way back to the inception of
dev-replace in 3.8) in the way left_cursor is bumped: the bump is done
unconditionally, without any regard to the scrub_chunk return value.
On suspend (and also on any kind of error) scrub_chunk returns early,
i.e. without completing the copy. This leads to us skipping the chunk
that hasn't been fully copied yet when resuming.
Fix this by doing the cursor_left update only if scrub_chunk ret is 0.
(On suspend scrub_chunk returns with -ECANCELED, so this fix covers
both suspend and error cases.)
Cc: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Currently the hash value used for adding an inode to the VFS's inode
hash table consists of the plain inode number, which is a 64 bits
integer. This results in hash table buckets (hlist_head lists) with
too many elements for at least 2 important scenarios:
1) When we have many subvolumes. Each subvolume has its own btree
where its files and directories are added to, and each has its
own objectid (inode number) namespace. This means that if we have
N subvolumes, and all have inode number X associated to a file or
directory, the corresponding inodes all map to the same hash table
entry, resulting in a bucket (hlist_head list) with N elements;
2) On 32 bits machines. Th VFS hash values are unsigned longs, which
are 32 bits wide on 32 bits machines, and the inode (objectid)
numbers are 64 bits unsigned integers. We simply cast the inode
numbers to hash values, which means that for all inodes with the
same 32 bits lower half, the same hash bucket is used for all of
them. For example, all inodes with a number (objectid) between
0x0000_0000_ffff_ffff and 0xffff_ffff_ffff_ffff will end up in
the same hash table bucket.
This change ensures the inode's hash value depends both on the
objectid (inode number) and its subvolume's (btree root) objectid.
For 32 bits machines, this change gives better entropy by making
the hash value depend on both the upper and lower 32 bits of the
64 bits hash previously computed.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In tree-log.c:btrfs_log_inode(), we keep calling btrfs_search_forward()
until it returns a key whose objectid is higher than our inode or until
the key's type is higher than our maximum allowed type.
At the end of the loop, we increment our mininum search key's objectid
and type regardless of our desired target objectid and maximum desired
type, which causes another loop iteration that will call again
btrfs_search_forward() just to figure out we've gone beyond our maximum
key and exit the loop. Therefore while incrementing our minimum key,
don't do it blindly and exit the loop immiediately if the next search
key's objectid or type is beyond what we seek.
Also after incrementing the type, set the key's offset to 0, which was
missing and could make us loose some of the inode's items.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
It is not used for anything.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
As we're hold a ref on looking up the extent map, we need to drop the ref
before returning to callers.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The performance was slowed down sometimes when we ran sysbench to measure
the performance of the sequential buffered write by 2 or more threads.
It was because the write order of the test threads might be confused
by the task scheduler, and the coming write would be beyond the end of
the file, in this case, we need insert dummy file extents and create
a hole for the area we skip. But in order to avoid the ongoing ordered
extents which are in the area, we need wait for them. Unfortunately,
the current code doesn't check if there are ordered extents in the area
or not, try to find and flush the dirty pages directly, but in fact,
there is no dirty page in that area, this step of the current code is
unnecessary, and just wastes time. Sometimes, it would increase
the contention of some locks, and makes the performance slow down suddenly.
So we remove the ordered extent flush function before the check, and flush
the dirty pages and wait for the ordered extents only when we find them.
According to my test, we got 1-2 times of the performance regression when
we ran the test by 10 times before applying this patch. After applying
this patch, the regression went away.
Test Environment:
CPU: 1CPU * 4Cores
Memory: 6GB
Partition: 20GB
Test Command:
# sysbench --test=fileio --file-total-size=16G --file-test-mode=seqwr \
> --num-threads=512 --file-block-size=16384 --max-time=60 --max-requests=0 run
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When we did space balance and snapshot creation at the same time, we might
meet the following oops:
kernel BUG at fs/btrfs/inode.c:3038!
[SNIP]
Call Trace:
[<ffffffffa0411ec7>] btrfs_orphan_cleanup+0x293/0x407 [btrfs]
[<ffffffffa042dc45>] btrfs_mksubvol.isra.28+0x259/0x373 [btrfs]
[<ffffffffa042de85>] btrfs_ioctl_snap_create_transid+0x126/0x156 [btrfs]
[<ffffffffa042dff1>] btrfs_ioctl_snap_create_v2+0xd0/0x121 [btrfs]
[<ffffffffa0430b2c>] btrfs_ioctl+0x414/0x1854 [btrfs]
[<ffffffff813b60b7>] ? __do_page_fault+0x305/0x379
[<ffffffff811215a9>] vfs_ioctl+0x1d/0x39
[<ffffffff81121d7c>] do_vfs_ioctl+0x32d/0x3e2
[<ffffffff81057fe7>] ? finish_task_switch+0x80/0xb8
[<ffffffff81121e88>] SyS_ioctl+0x57/0x83
[<ffffffff813b39ff>] ? do_device_not_available+0x12/0x14
[<ffffffff813b99c2>] system_call_fastpath+0x16/0x1b
[SNIP]
RIP [<ffffffffa040da40>] btrfs_orphan_add+0xc3/0x126 [btrfs]
The reason of the problem is that the relocation root creation stole
the reserved space, which was reserved for orphan item deletion.
There are several ways to fix this problem, one is to increasing
the reserved space size of the space balace, and then we can use
that space to create the relocation tree for each fs/file trees.
But it is hard to calculate the suitable size because we doesn't
know how many fs/file trees we need relocate.
We fixed this problem by reserving the space for relocation root creation
actively since the space it need is very small (one tree block, used for
root node copy), then we use that reserved space to create the
relocation tree. If we don't reserve space for relocation tree creation,
we will use the reserved space of the balance.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Remove unused parameter, 'eb'. Unused since introduction in
5f39d397df
Updated to be rebased against current upstream and correct diff supplied this time!
Signed-off-by: Ross Kirk <ross.kirk@gmail.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I was noticing the slab redzone stuff going off every once and a while during
transaction aborts. This was caused by two things
1) We would walk the pending snapshots and set their error to -ECANCELED. We
don't need to do this, the snapshot stuff waits for a transaction commit and if
there is a problem we just free our pending snapshot object and exit. Doing
this was causing us to touch the pending snapshot object after the thing had
already been freed.
2) We were freeing the transaction manually with wanton disregard for it's
use_count reference counter. To fix this I cleaned up the transaction freeing
loop to either wait for the transaction commit to finish if it was in the middle
of that (since it will be cleaned and freed up there) or to do the cleanup
oursevles.
I also moved the global "kill all things dirty everywhere" stuff outside of the
transaction cleanup loop since that only needs to be done once. With this patch
I'm no longer seeing slab corruption because of use after frees. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Noticed this when forcing errors to happen during delayed ref running. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
During transaction cleanup after an abort we are just removing roots from the
ordered roots list which is incorrect. We have a BUG_ON() to make sure that the
root is still part of the ordered roots list when we put our ordered extent
which we were tripping in this case. So do like we do everywhere else and just
move it to the tail of the ordered roots list and allow the normal cleanup to
take care of stuff. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we abort not during a transaction commit we won't clean up anything until we
unmount. Unfortunately if we abort in the middle of writing out an ordered
extent we won't clean it up and if somebody is waiting on that ordered extent
they will wait forever. To fix this just make the transaction kthread call the
cleanup transaction stuff if it notices theres an error, and make
btrfs_end_transaction wake up the transaction kthread if there is an error.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I've been testing our error paths and I was tripping the BUG_ON() in
drop_outstanding_extent because our outstanding_extents is 0 for space cache
inodes. This is because we don't reserve metadata space for these inodes since
we depend on the global block reserve for our space. To fix this we need to
make sure the DO_ACCOUNTING stuff doesn't actually call release_metadata for
space cache inodes. With this patch I'm no longer panicing. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we abort a transaction in the middle of a commit we weren't undoing the
intwrite locking. This patch fixes that problem.
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
A user reported a problem where they were getting csum errors when running a
balance and running systemd's journal. This is because systemd is awesome and
fallocate()'s its log space and writes into it. Unfortunately we assume that
when we read in all the csums for an extent that they are sequential starting at
the bytenr we care about. This obviously isn't the case for prealloc extents,
where we could have written to the middle of the prealloc extent only, which
means the csum would be for the bytenr in the middle of our range and not the
front of our range. Fix this by offsetting the new bytenr we are logging to
based on the original bytenr the csum was for. With this patch I no longer see
the csum errors I was seeing. Thanks,
Cc: stable@vger.kernel.org
Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In extent-tree.c:btrfs_write_dirty_block_groups(), if the call to
write_one_cache_group() failed, we would return without putting
the block group first.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Currently the fs sync function (super.c:btrfs_sync_fs()) doesn't
wait for delayed work to finish before returning success to the
caller. This change fixes this, ensuring that there's no data loss
if a power failure happens right after fs sync returns success to
the caller and before the next commit happens.
Steps to reproduce the data loss issue:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ perl -e '$d = ("\x41" x 6001); open($f,">","/mnt/btrfs/foobar"); print $f $d; close($f);' && btrfs fi sync /mnt/btrfs
Right after the btrfs fi sync command (a second or 2 for example), power
off the machine and reboot it. The file will be empty, as it can be verified
after mounting the filesystem and through btrfs-debug-tree:
$ btrfs-debug-tree /dev/sdb3 | egrep '\(257 INODE_ITEM 0\) itemoff' -B 3 -A 8
item 3 key (256 DIR_INDEX 2) itemoff 3751 itemsize 36
location key (257 INODE_ITEM 0) type FILE
namelen 6 datalen 0 name: foobar
item 4 key (257 INODE_ITEM 0) itemoff 3591 itemsize 160
inode generation 7 transid 7 size 0 block group 0 mode 100644 links 1
item 5 key (257 INODE_REF 256) itemoff 3575 itemsize 16
inode ref index 2 namelen 6 name: foobar
checksum tree key (CSUM_TREE ROOT_ITEM 0)
leaf 29429760 items 0 free space 3995 generation 7 owner 7
fs uuid 6192815c-af2a-4b75-b3db-a959ffb6166e
chunk uuid b529c44b-938c-4d3d-910a-013b4700bcae
uuid tree key (UUID_TREE ROOT_ITEM 0)
After this patch, the data loss no longer happens after a power failure and
btrfs-debug-tree shows:
$ btrfs-debug-tree /dev/sdb3 | egrep '\(257 INODE_ITEM 0\) itemoff' -B 3 -A 8
item 3 key (256 DIR_INDEX 2) itemoff 3751 itemsize 36
location key (257 INODE_ITEM 0) type FILE
namelen 6 datalen 0 name: foobar
item 4 key (257 INODE_ITEM 0) itemoff 3591 itemsize 160
inode generation 6 transid 6 size 6001 block group 0 mode 100644 links 1
item 5 key (257 INODE_REF 256) itemoff 3575 itemsize 16
inode ref index 2 namelen 6 name: foobar
item 6 key (257 EXTENT_DATA 0) itemoff 3522 itemsize 53
extent data disk byte 12845056 nr 8192
extent data offset 0 nr 8192 ram 8192
extent compression 0
checksum tree key (CSUM_TREE ROOT_ITEM 0)
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In inode.c:btrfs_orphan_add() if we failed to insert the orphan
item, we would return without decrementing the orphan count that
we just incremented before attempting the insertion, leaving the
orphan inode count wrong.
In inode.c:btrfs_orphan_del(), we were decrementing the inode
orphan count if the bit BTRFS_INODE_ORPHAN_META_RESERVED was set,
which is logically wrong because it should be decremented if the
bit BTRFS_INODE_HAS_ORPHAN_ITEM was set - after all we increment
the count when we set the bit BTRFS_INODE_HAS_ORPHAN_ITEM elsewhere.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Similar to ocfs2, btrfs also supports that extents can be shared by
different inodes, and there are some userspace tools requesting
for this kind of 'space shared infomation'.[1]
ocfs2 uses flag FIEMAP_EXTENT_SHARED, so does btrfs.
[1]: http://thr3ads.net/ocfs2-devel/2010/09/489052-PATCH-3-3-shared-du-using-fiemap-to-figure-up-the-shared-extents-per-file-and-the-footprint-in
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Not used for anything, and removing it avoids caller's need to
allocate a path structure.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We're doing a unnecessary extra lookup of the ino cache's
inode when we already have it (and holding a reference)
during the process of saving the ino cache contents to disk.
Therefore remove this extra lookup.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
While running some snashot aware defrag tests I noticed I was panicing every
once and a while in key_search. This is because of the optimization that says
if we find a key at slot 0 it will be at slot 0 all the way down the rest of the
tree. This isn't the case for btrfs_search_old_slot since it will likely replay
changes to a buffer if something has changed since we took our sequence number.
So short circuit this optimization by setting prev_cmp to -1 every time we call
key_search so we will do our normal binary search. With this patch I am no
longer seeing the panics I was seeing before. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
While looking at somebodys corruption I became completely convinced that
btrfs_split_item was broken, so I wrote this test to verify that it was working
as it was supposed to. Thankfully it appears to be working as intended, so just
add this test to make sure nobody breaks it in the future. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Remove unused eb parameter from btrfs_item_nr
Signed-off-by: Ross Kirk <ross.kirk@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
It is not necessary to store the NULL byte in a symlink inline file
extent. There's currently no code that requires the NULL byte to be
present in the extent. This change also doesn't break file format
compatibility nor the send/receive feature.
The VFS also doesn't need the NULL byte to be present in the extent,
as it reads up to inode->i_size bytes (which already excluded the NULL
byte) and sets the NULL byte for us (in fs/namei.c:page_getlink()).
So with this change we save 1 byte per symlink file extent (which is
always inlined in the btree leaf) without losing backward and forward
compatibility.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The fact that btrfs_root_refs() returned 0 for the tree_root caused
bugs in the past, therefore it is set to 1 with this patch and
(hopefully) all affected code is adapted to this change.
I verified this change by temporarily adding WARN_ON() checks
everywhere where btrfs_root_refs() is used, checking whether the
logic of the code is changed by btrfs_root_refs() returning 1
instead of 0 for root->root_key.objectid == BTRFS_ROOT_TREE_OBJECTID.
With these added checks, I ran the xfstests './check -g auto'.
The two roots chunk_root and log_root_tree that are only referenced
by the superblock and the log_roots below the log_root_tree still
have btrfs_root_refs() == 0, only the tree_root is changed.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs fix from Chris Mason:
"Sage hit a deadlock with ceph on btrfs, and Josef tracked it down to a
regression in our initial rc1 pull. When doing nocow writes we were
sometimes starting a transaction with locks held"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: release path before starting transaction in can_nocow_extent
We can't be holding tree locks while we try to start a transaction, we will
deadlock. Thanks,
Reported-by: Sage Weil <sage@inktank.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs fixes from Chris Mason:
"We've got more bug fixes in my for-linus branch:
One of these fixes another corner of the compression oops from last
time. Miao nailed down some problems with concurrent snapshot
deletion and drive balancing.
I kept out one of his patches for more testing, but these are all
stable"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix oops caused by the space balance and dead roots
Btrfs: insert orphan roots into fs radix tree
Btrfs: limit delalloc pages outside of find_delalloc_range
Btrfs: use right root when checking for hash collision
When doing space balance and subvolume destroy at the same time, we met
the following oops:
kernel BUG at fs/btrfs/relocation.c:2247!
RIP: 0010: [<ffffffffa04cec16>] prepare_to_merge+0x154/0x1f0 [btrfs]
Call Trace:
[<ffffffffa04b5ab7>] relocate_block_group+0x466/0x4e6 [btrfs]
[<ffffffffa04b5c7a>] btrfs_relocate_block_group+0x143/0x275 [btrfs]
[<ffffffffa0495c56>] btrfs_relocate_chunk.isra.27+0x5c/0x5a2 [btrfs]
[<ffffffffa0459871>] ? btrfs_item_key_to_cpu+0x15/0x31 [btrfs]
[<ffffffffa048b46a>] ? btrfs_get_token_64+0x7e/0xcd [btrfs]
[<ffffffffa04a3467>] ? btrfs_tree_read_unlock_blocking+0xb2/0xb7 [btrfs]
[<ffffffffa049907d>] btrfs_balance+0x9c7/0xb6f [btrfs]
[<ffffffffa049ef84>] btrfs_ioctl_balance+0x234/0x2ac [btrfs]
[<ffffffffa04a1e8e>] btrfs_ioctl+0xd87/0x1ef9 [btrfs]
[<ffffffff81122f53>] ? path_openat+0x234/0x4db
[<ffffffff813c3b78>] ? __do_page_fault+0x31d/0x391
[<ffffffff810f8ab6>] ? vma_link+0x74/0x94
[<ffffffff811250f5>] vfs_ioctl+0x1d/0x39
[<ffffffff811258c8>] do_vfs_ioctl+0x32d/0x3e2
[<ffffffff811259d4>] SyS_ioctl+0x57/0x83
[<ffffffff813c3bfa>] ? do_page_fault+0xe/0x10
[<ffffffff813c73c2>] system_call_fastpath+0x16/0x1b
It is because we returned the error number if the reference of the root was 0
when doing space relocation. It was not right here, because though the root
was dead(refs == 0), but the space it held still need be relocated, or we
could not remove the block group. So in this case, we should return the root
no matter it is dead or not.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Now we don't drop all the deleted snapshots/subvolumes before the space
balance. It means we have to relocate the space which is held by the dead
snapshots/subvolumes. So we must into them into fs radix tree, or we would
forget to commit the change of them when doing transaction commit, and it
would corrupt the metadata.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Liu fixed part of this problem and unfortunately I steered him in slightly the
wrong direction and so didn't completely fix the problem. The problem is we
limit the size of the delalloc range we are looking for to max bytes and then we
try to lock that range. If we fail to lock the pages in that range we will
shrink the max bytes to a single page and re loop. However if our first page is
inside of the delalloc range then we will end up limiting the end of the range
to a period before our first page. This is illustrated below
[0 -------- delalloc range --------- 256mb]
[page]
So find_delalloc_range will return with delalloc_start as 0 and end as 128mb,
and then we will notice that delalloc_start < *start and adjust it up, but not
adjust delalloc_end up, so things go sideways. To fix this we need to not limit
the max bytes in find_delalloc_range, but in find_lock_delalloc_range and that
way we don't end up with this confusion. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
btrfs_rename was using the root of the old dir instead of the root of the new
dir when checking for a hash collision, so if you tried to move a file into a
subvol it would freak out because it would see the file you are trying to move
in its current root. This fixes the bug where this would fail
btrfs subvol create test1
btrfs subvol create test2
mv test1 test2.
Thanks to Chris Murphy for catching this,
Cc: stable@vger.kernel.org
Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
free_device rcu callback, scheduled from btrfs_rm_dev_replace_srcdev,
can be processed before btrfs_scratch_superblock is called, which would
result in a use-after-free on btrfs_device contents. Fix this by
zeroing the superblock before the rcu callback is registered.
Cc: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The current implementation of worker threads in Btrfs has races in
worker stopping code, which cause all kinds of panics and lockups when
running btrfs/011 xfstest in a loop. The problem is that
btrfs_stop_workers is unsynchronized with respect to check_idle_worker,
check_busy_worker and __btrfs_start_workers.
E.g., check_idle_worker race flow:
btrfs_stop_workers(): check_idle_worker(aworker):
- grabs the lock
- splices the idle list into the
working list
- removes the first worker from the
working list
- releases the lock to wait for
its kthread's completion
- grabs the lock
- if aworker is on the working list,
moves aworker from the working list
to the idle list
- releases the lock
- grabs the lock
- puts the worker
- removes the second worker from the
working list
......
btrfs_stop_workers returns, aworker is on the idle list
FS is umounted, memory is freed
......
aworker is waken up, fireworks ensue
With this applied, I wasn't able to trigger the problem in 48 hours,
whereas previously I could reliably reproduce at least one of these
races within an hour.
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The crash[1] is found by xfstests/generic/208 with "-o compress",
it's not reproduced everytime, but it does panic.
The bug is quite interesting, it's actually introduced by a recent commit
(573aecafca,
Btrfs: actually limit the size of delalloc range).
Btrfs implements delay allocation, so during writeback, we
(1) get a page A and lock it
(2) search the state tree for delalloc bytes and lock all pages within the range
(3) process the delalloc range, including find disk space and create
ordered extent and so on.
(4) submit the page A.
It runs well in normal cases, but if we're in a racy case, eg.
buffered compressed writes and aio-dio writes,
sometimes we may fail to lock all pages in the 'delalloc' range,
in which case, we need to fall back to search the state tree again with
a smaller range limit(max_bytes = PAGE_CACHE_SIZE - offset).
The mentioned commit has a side effect, that is, in the fallback case,
we can find delalloc bytes before the index of the page we already have locked,
so we're in the case of (delalloc_end <= *start) and return with (found > 0).
This ends with not locking delalloc pages but making ->writepage still
process them, and the crash happens.
This fixes it by just thinking that we find nothing and returning to caller
as the caller knows how to deal with it properly.
[1]:
------------[ cut here ]------------
kernel BUG at mm/page-writeback.c:2170!
[...]
CPU: 2 PID: 11755 Comm: btrfs-delalloc- Tainted: G O 3.11.0+ #8
[...]
RIP: 0010:[<ffffffff810f5093>] [<ffffffff810f5093>] clear_page_dirty_for_io+0x1e/0x83
[...]
[ 4934.248731] Stack:
[ 4934.248731] ffff8801477e5dc8 ffffea00049b9f00 ffff8801869f9ce8 ffffffffa02b841a
[ 4934.248731] 0000000000000000 0000000000000000 0000000000000fff 0000000000000620
[ 4934.248731] ffff88018db59c78 ffffea0005da8d40 ffffffffa02ff860 00000001810016c0
[ 4934.248731] Call Trace:
[ 4934.248731] [<ffffffffa02b841a>] extent_range_clear_dirty_for_io+0xcf/0xf5 [btrfs]
[ 4934.248731] [<ffffffffa02a8889>] compress_file_range+0x1dc/0x4cb [btrfs]
[ 4934.248731] [<ffffffff8104f7af>] ? detach_if_pending+0x22/0x4b
[ 4934.248731] [<ffffffffa02a8bad>] async_cow_start+0x35/0x53 [btrfs]
[ 4934.248731] [<ffffffffa02c694b>] worker_loop+0x14b/0x48c [btrfs]
[ 4934.248731] [<ffffffffa02c6800>] ? btrfs_queue_worker+0x25c/0x25c [btrfs]
[ 4934.248731] [<ffffffff810608f5>] kthread+0x8d/0x95
[ 4934.248731] [<ffffffff81060868>] ? kthread_freezable_should_stop+0x43/0x43
[ 4934.248731] [<ffffffff814fe09c>] ret_from_fork+0x7c/0xb0
[ 4934.248731] [<ffffffff81060868>] ? kthread_freezable_should_stop+0x43/0x43
[ 4934.248731] Code: ff 85 c0 0f 94 c0 0f b6 c0 59 5b 5d c3 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb e8 2c de 00 00 49 89 c4 48 8b 03 a8 01 75 02 <0f> 0b 4d 85 e4 74 52 49 8b 84 24 80 00 00 00 f6 40 20 01 75 44
[ 4934.248731] RIP [<ffffffff810f5093>] clear_page_dirty_for_io+0x1e/0x83
[ 4934.248731] RSP <ffff8801869f9c48>
[ 4934.280307] ---[ end trace 36f06d3f8750236a ]---
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If we crash with a log, remount and recover that log, and then crash before we
can commit another transaction we will get transid verify errors on the next
mount. This is because we were not zero'ing out the log when we committed the
transaction after recovery. This is ok as long as we commit another transaction
at some point in the future, but if you abort or something else goes wrong you
can end up in this weird state because the recovery stuff says that the tree log
should have a generation+1 of the super generation, which won't be the case of
the transaction that was started for recovery. Fix this by removing the check
and _always_ zero out the log portion of the super when we commit a transaction.
This fixes the transid verify issues I was seeing with my force errors tests.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Pull btrfs fixes from Chris Mason:
"These are mostly bug fixes and a two small performance fixes. The
most important of the bunch are Josef's fix for a snapshotting
regression and Mark's update to fix compile problems on arm"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (25 commits)
Btrfs: create the uuid tree on remount rw
btrfs: change extent-same to copy entire argument struct
Btrfs: dir_inode_operations should use btrfs_update_time also
btrfs: Add btrfs: prefix to kernel log output
btrfs: refuse to remount read-write after abort
Btrfs: btrfs_ioctl_default_subvol: Revert back to toplevel subvolume when arg is 0
Btrfs: don't leak transaction in btrfs_sync_file()
Btrfs: add the missing mutex unlock in write_all_supers()
Btrfs: iput inode on allocation failure
Btrfs: remove space_info->reservation_progress
Btrfs: kill delay_iput arg to the wait_ordered functions
Btrfs: fix worst case calculator for space usage
Revert "Btrfs: rework the overcommit logic to be based on the total size"
Btrfs: improve replacing nocow extents
Btrfs: drop dir i_size when adding new names on replay
Btrfs: replay dir_index items before other items
Btrfs: check roots last log commit when checking if an inode has been logged
Btrfs: actually log directory we are fsync()'ing
Btrfs: actually limit the size of delalloc range
Btrfs: allocate the free space by the existed max extent size when ENOSPC
...
Users have been complaining of the uuid tree stuff warning that there is no uuid
root when trying to do snapshot operations. This is because if you mount -o ro
we will not create the uuid tree. But then if you mount -o rw,remount we will
still not create it and then any subsequent snapshot/subvol operations you try
to do will fail gloriously. Fix this by creating the uuid_root on remount rw if
it was not already there. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
btrfs_ioctl_file_extent_same() uses __put_user_unaligned() to copy some data
back to it's argument struct. Unfortunately, not all architectures provide
__put_user_unaligned(), so compiles break on them if btrfs is selected.
Instead, just copy the whole struct in / out at the start and end of
operations, respectively.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Commit 2bc5565286 (Btrfs: don't update atime on
RO subvolumes) ensures that the access time of an inode is not updated when
the inode lives in a read-only subvolume.
However, if a directory on a read-only subvolume is accessed, the atime is
updated. This results in a write operation to a read-only subvolume. I
believe that access times should never be updated on read-only subvolumes.
To reproduce:
# mkfs.btrfs -f /dev/dm-3
(...)
# mount /dev/dm-3 /mnt
# btrfs subvol create /mnt/sub
Create subvolume '/mnt/sub'
# mkdir /mnt/sub/dir
# echo "abc" > /mnt/sub/dir/file
# btrfs subvol snapshot -r /mnt/sub /mnt/rosnap
Create a readonly snapshot of '/mnt/sub' in '/mnt/rosnap'
# stat /mnt/rosnap/dir
File: `/mnt/rosnap/dir'
Size: 8 Blocks: 0 IO Block: 4096 directory
Device: 16h/22d Inode: 257 Links: 1
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2013-09-11 07:21:49.389157126 -0400
Modify: 2013-09-11 07:22:02.330156079 -0400
Change: 2013-09-11 07:22:02.330156079 -0400
# ls /mnt/rosnap/dir
file
# stat /mnt/rosnap/dir
File: `/mnt/rosnap/dir'
Size: 8 Blocks: 0 IO Block: 4096 directory
Device: 16h/22d Inode: 257 Links: 1
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2013-09-11 07:22:56.797151670 -0400
Modify: 2013-09-11 07:22:02.330156079 -0400
Change: 2013-09-11 07:22:02.330156079 -0400
Reported-by: Koen De Wit <koen.de.wit@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The kernel log entries for device label %s and device fsid %pU
are missing the btrfs: prefix. Add those here.
Signed-off-by: Frank Holton <fholton@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
It's still possible to flip the filesystem into RW mode after it's
remounted RO due to an abort. There are lots of places that check for
the superblock error bit and will not write data, but we should not let
the filesystem appear read-write.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This patch makes it possible to set BTRFS_FS_TREE_OBJECTID as the default
subvolume by passing a subvolume id of 0.
Signed-off-by: chandan <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In btrfs_sync_file(), if the call to btrfs_log_dentry_safe() returns
a negative error (for e.g. -ENOMEM via btrfs_log_inode()), we would
return without ending/freeing the transaction.
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The BUG() was replaced by btrfs_error() and return -EIO with the
patch "get rid of one BUG() in write_all_supers()", but the missing
mutex_unlock() was overlooked.
The 0-DAY kernel build service from Intel reported the missing
unlock which was found by the coccinelle tool:
fs/btrfs/disk-io.c:3422:2-8: preceding lock on line 3374
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We don't do the iput when we fail to allocate our delayed delalloc work in
__start_delalloc_inodes, fix this.
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This isn't used for anything anymore, just remove it.
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This is a left over of how we used to wait for ordered extents, which was to
grab the inode and then run filemap flush on it. However if we have an ordered
extent then we already are holding a ref on the inode, and we just use
btrfs_start_ordered_extent anyway, so there is no reason to have an extra ref on
the inode to start work on the ordered extent. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Forever ago I made the worst case calculator say that we could potentially split
into 3 blocks for every level on the way down, which isn't right. If we split
we're only going to get two new blocks, the one we originally cow'ed and the new
one we're going to split. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This reverts commit 70afa3998c. It is causing
performance issues and wasn't actually correct. There were problems with the
way we flushed delalloc and that was the real cause of the early enospc.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Various people have hit a deadlock when running btrfs/011. This is because when
replacing nocow extents we will take the i_mutex to make sure nobody messes with
the file while we are replacing the extent. The problem is we are already
holding a transaction open, which is a locking inversion, so instead we need to
save these inodes we find and then process them outside of the transaction.
Further we can't just lock the inode and assume we are good to go. We need to
lock the extent range and then read back the extent cache for the inode to make
sure the extent really still points at the physical block we want. If it
doesn't we don't have to copy it. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
So if we have dir_index items in the log that means we also have the inode item
as well, which means that the inode's i_size is correct. However when we
process dir_index'es we call btrfs_add_link() which will increase the
directory's i_size for the new entry. To fix this we need to just set the dir
items i_size to 0, and then as we find dir_index items we adjust the i_size.
btrfs_add_link() will do it for new entries, and if the entry already exists we
can just add the name_len to the i_size ourselves. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
A user reported a bug where his log would not replay because he was getting
-EEXIST back. This was because he had a file moved into a directory that was
logged. What happens is the file had a lower inode number, and so it is
processed first when replaying the log, and so we add the inode ref in for the
directory it was moved to. But then we process the directories DIR_INDEX item
and try to add the inode ref for that inode and it fails because we already
added it when we replayed the inode. To solve this problem we need to just
process any DIR_INDEX items we have in the log first so this all is taken care
of, and then we can replay the rest of the items. With this patch my reproducer
can remount the file system properly instead of erroring out. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Liu introduced a local copy of the last log commit for an inode to make sure we
actually log an inode even if a log commit has already taken place. In order to
make sure we didn't relog the same inode multiple times he set this local copy
to the current trans when we log the inode, because usually we log the inode and
then sync the log. The exception to this is during rename, we will relog an
inode if the name changed and it is already in the log. The problem with this
is then we go to sync the inode, and our check to see if the inode has already
been logged is tripped and we don't sync the log. To fix this we need to _also_
check against the roots last log commit, because it could be less than what is
in our local copy of the log commit. This fixes a bug where we rename a file
into a directory and then fsync the directory and then on remount the directory
is no longer there. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If you just create a directory and then fsync that directory and then pull the
power plug you will come back up and the directory will not be there. That is
because we won't actually create directories if we've logged files inside of
them since they will be created on replay, but in this check we will set our
logged_trans of our current directory if it happens to be a directory, making us
think it doesn't need to be logged. Fix the logic to only do this to parent
directories. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
So forever we have had this thing to limit the amount of delalloc pages we'll
setup to be written out to 128mb. This is because we have to lock all the pages
in this range, so anything above this gets a bit unweildly, and also without a
limit we'll happily allocate gigantic chunks of disk space. Turns out our check
for this wasn't quite right, we wouldn't actually limit the chunk we wanted to
write out, we'd just stop looking for more space after we went over the limit.
So if you do a giant 20gb dd on my box with lots of ram I could get 2gig
extents. This is fine normally, except when you go to relocate these extents
and we can't find enough space to relocate these moster extents, since we have
to be able to allocate exactly the same sized extent to move it around. So fix
this by actually enforcing the limit. With this patch I'm no longer seeing
giant 1.5gb extents. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
By the current code, if the requested size is very large, and all the extents
in the free space cache are small, we will waste lots of the cpu time to cut
the requested size in half and search the cache again and again until it gets
down to the size the allocator can return. In fact, we can know the max extent
size in the cache after the first search, so we needn't cut the size in half
repeatedly, and just use the max extent size directly. This way can save
lots of cpu time and make the performance grow up when there are only fragments
in the free space cache.
According to my test, if there are only 4KB free space extents in the fs,
and the total size of those extents are 256MB, we can reduce the execute
time of the following test from 5.4s to 1.4s.
dd if=/dev/zero of=<testfile> bs=1MB count=1 oflag=sync
Changelog v2 -> v3:
- fix the problem that we skip the block group with the space which is
less than we need.
Changelog v1 -> v2:
- address the problem that we return a wrong start position when searching
the free space in a bitmap.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We want to know if there are debugging features compiled in, this may
affect performance. The message is printed before the sanity checks.
(This commit message is a copy of David Sterba's commit message when
he introduced btrfs_print_info()).
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Instead of removing the current inode from the red black tree
and then add the new one, just use the red black tree replace
operation, which is more efficient.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If replace was suspended by the umount, replace target device is added
to the fs_devices->alloc_list during a later mount. This is obviously
wrong. ->is_tgtdev_for_dev_replace is supposed to guard against that,
but ->is_tgtdev_for_dev_replace is (and can only ever be) initialized
*after* everything is opened and fs_devices lists are populated. Fix
this by checking the devid instead: for replace targets it's always
equal to BTRFS_DEV_REPLACE_DEVID.
Cc: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we failed to actually allocate the correct size of the extent to relocate we
will end up in an infinite loop because we won't return an error, we'll just
move on to the next extent. So fix this up by returning an error, and then fix
all the callers to return an error up the stack rather than BUG_ON()'ing.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Merge more patches from Andrew Morton:
"The rest of MM. Plus one misc cleanup"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (35 commits)
mm/Kconfig: add MMU dependency for MIGRATION.
kernel: replace strict_strto*() with kstrto*()
mm, thp: count thp_fault_fallback anytime thp fault fails
thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()
thp: do_huge_pmd_anonymous_page() cleanup
thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
mm: cleanup add_to_page_cache_locked()
thp: account anon transparent huge pages into NR_ANON_PAGES
truncate: drop 'oldsize' truncate_pagecache() parameter
mm: make lru_add_drain_all() selective
memcg: document cgroup dirty/writeback memory statistics
memcg: add per cgroup writeback pages accounting
memcg: check for proper lock held in mem_cgroup_update_page_stat
memcg: remove MEMCG_NR_FILE_MAPPED
memcg: reduce function dereference
memcg: avoid overflow caused by PAGE_ALIGN
memcg: rename RESOURCE_MAX to RES_COUNTER_MAX
memcg: correct RESOURCE_MAX to ULLONG_MAX
mm: memcg: do not trap chargers with full callstack on OOM
mm: memcg: rework and document OOM waiting and wakeup
...
truncate_pagecache() doesn't care about old size since commit
cedabed49b ("vfs: Fix vmtruncate() regression"). Let's drop it.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull btrfs updates from Chris Mason:
"This is against 3.11-rc7, but was pulled and tested against your tree
as of yesterday. We do have two small incrementals queued up, but I
wanted to get this bunch out the door before I hop on an airplane.
This is a fairly large batch of fixes, performance improvements, and
cleanups from the usual Btrfs suspects.
We've included Stefan Behren's work to index subvolume UUIDs, which is
targeted at speeding up send/receive with many subvolumes or snapshots
in place. It closes a long standing performance issue that was built
in to the disk format.
Mark Fasheh's offline dedup work is also here. In this case offline
means the FS is mounted and active, but the dedup work is not done
inline during file IO. This is a building block where utilities are
able to ask the FS to dedup a series of extents. The kernel takes
care of verifying the data involved really is the same. Today this
involves reading both extents, but we'll continue to evolve the
patches"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (118 commits)
Btrfs: optimize key searches in btrfs_search_slot
Btrfs: don't use an async starter for most of our workers
Btrfs: only update disk_i_size as we remove extents
Btrfs: fix deadlock in uuid scan kthread
Btrfs: stop refusing the relocation of chunk 0
Btrfs: fix memory leak of uuid_root in free_fs_info
btrfs: reuse kbasename helper
btrfs: return btrfs error code for dev excl ops err
Btrfs: allow partial ordered extent completion
Btrfs: convert all bug_ons in free-space-cache.c
Btrfs: add support for asserts
Btrfs: adjust the fs_devices->missing count on unmount
Btrf: cleanup: don't check for root_refs == 0 twice
Btrfs: fix for patch "cleanup: don't check the same thing twice"
Btrfs: get rid of one BUG() in write_all_supers()
Btrfs: allocate prelim_ref with a slab allocater
Btrfs: pass gfp_t to __add_prelim_ref() to avoid always using GFP_ATOMIC
Btrfs: fix race conditions in BTRFS_IOC_FS_INFO ioctl
Btrfs: fix race between removing a dev and writing sbs
Btrfs: remove ourselves from the cluster list under lock
...
Pull trivial tree from Jiri Kosina:
"The usual trivial updates all over the tree -- mostly typo fixes and
documentation updates"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (52 commits)
doc: Documentation/cputopology.txt fix typo
treewide: Convert retrun typos to return
Fix comment typo for init_cma_reserved_pageblock
Documentation/trace: Correcting and extending tracepoint documentation
mm/hotplug: fix a typo in Documentation/memory-hotplug.txt
power: Documentation: Update s2ram link
doc: fix a typo in Documentation/00-INDEX
Documentation/printk-formats.txt: No casts needed for u64/s64
doc: Fix typo "is is" in Documentations
treewide: Fix printks with 0x%#
zram: doc fixes
Documentation/kmemcheck: update kmemcheck documentation
doc: documentation/hwspinlock.txt fix typo
PM / Hibernate: add section for resume options
doc: filesystems : Fix typo in Documentations/filesystems
scsi/megaraid fixed several typos in comments
ppc: init_32: Fix error typo "CONFIG_START_KERNEL"
treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks
page_isolation: Fix a comment typo in test_pages_isolated()
doc: fix a typo about irq affinity
...
Pull vfs pile 1 from Al Viro:
"Unfortunately, this merge window it'll have a be a lot of small piles -
my fault, actually, for not keeping #for-next in anything that would
resemble a sane shape ;-/
This pile: assorted fixes (the first 3 are -stable fodder, IMO) and
cleanups + %pd/%pD formats (dentry/file pathname, up to 4 last
components) + several long-standing patches from various folks.
There definitely will be a lot more (starting with Miklos'
check_submount_and_drop() series)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
direct-io: Handle O_(D)SYNC AIO
direct-io: Implement generic deferred AIO completions
add formats for dentry/file pathnames
kvm eventfd: switch to fdget
powerpc kvm: use fdget
switch fchmod() to fdget
switch epoll_ctl() to fdget
switch copy_module_from_fd() to fdget
git simplify nilfs check for busy subtree
ibmasmfs: don't bother passing superblock when not needed
don't pass superblock to hypfs_{mkdir,create*}
don't pass superblock to hypfs_diag_create_files
don't pass superblock to hypfs_vm_create_files()
oprofile: get rid of pointless forward declarations of struct super_block
oprofilefs_create_...() do not need superblock argument
oprofilefs_mkdir() doesn't need superblock argument
don't bother with passing superblock to oprofile_create_stats_files()
oprofile: don't bother with passing superblock to ->create_files()
don't bother passing sb to oprofile_create_files()
coh901318: don't open-code simple_read_from_buffer()
...
up with PTR_ERR_OR_ZERO(), and replacing or fixing all the usages.
This has been sitting in linux-next for a whole cycle.
Thanks,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJSJo+1AAoJENkgDmzRrbjxIC4QALJK95o8AUXuwUkl+2fmFkUt
hh2/PJ1vDYgk4Xt0J6hyoK7XMa0H1RkbBrROuDdsBnorMFpEsGcgdkUZte9ufoAS
97Bg+7N0KPbTB/S8vOwtW1vbERTJIVPN2uf6h1Wqm9Xc2puCh3HbMMr1AWMGu0WQ
NqY5+Zz8zecy1UOrMhEP6H1CjeQcL1w1DO6YM5ydeqlKNzAz+JMfDXriLPDwiE7+
XFPDF/O3Vtd2ckA7L70Lio7hfHwxV5U4WwFVfiwls98XB4jcZqDKIoh1r8z4SRgR
+0Rae2DN3BaOabGMr//5XdrzQVpwJTh5m2w8BAOHJvCJ9HR7Sq29UIN4u+TowZBy
L2xYo4dvFxkympwu5zEd3c7vHYWKIaqmSq5PIjr4gF/uIo2OeOTrpPIK782ZEYb7
e+qUgOEM05V9AmQZCrSZeP9u474Sj8ow3sCtWxfdRtwNfoEIcUXsNNJd/zDHlVtW
cEtXqc2xXIpcuUJQWlSaGp8fmRQjVZPzrLKYLM2m39ZcOOJbf5rzQAYS7hHPosIa
SK+YVux/+Zzi+Xo/vXq1OlM/SruCr5S7JOgCxLowoQ88vupgXME6uPyC8EO+QQ50
GsrHes5ZNLbk0uVsfcexIyojkUnyvDmmnDpv+1zdC6RgZLJQn8OXp5yNhHhnhrFT
BiHX6YFWtDDqRlVv8Q0F
=LeaW
-----END PGP SIGNATURE-----
Merge tag 'PTR_RET-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull PTR_RET() removal patches from Rusty Russell:
"PTR_RET() is a weird name, and led to some confusing usage. We ended
up with PTR_ERR_OR_ZERO(), and replacing or fixing all the usages.
This has been sitting in linux-next for a whole cycle"
[ There are still some PTR_RET users scattered about, with some of them
possibly being new, but most of them existing in Rusty's tree too. We
have that
#define PTR_RET(p) PTR_ERR_OR_ZERO(p)
thing in <linux/err.h>, so they continue to work for now - Linus ]
* tag 'PTR_RET-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
GFS2: Replace PTR_RET with PTR_ERR_OR_ZERO
Btrfs: volume: Replace PTR_RET with PTR_ERR_OR_ZERO
drm/cma: Replace PTR_RET with PTR_ERR_OR_ZERO
sh_veu: Replace PTR_RET with PTR_ERR_OR_ZERO
dma-buf: Replace PTR_RET with PTR_ERR_OR_ZERO
drivers/rtc: Replace PTR_RET with PTR_ERR_OR_ZERO
mm/oom_kill: remove weird use of ERR_PTR()/PTR_ERR().
staging/zcache: don't use PTR_RET().
remoteproc: don't use PTR_RET().
pinctrl: don't use PTR_RET().
acpi: Replace weird use of PTR_RET.
s390: Replace weird use of PTR_RET.
PTR_RET is now PTR_ERR_OR_ZERO(): Replace most.
PTR_RET is now PTR_ERR_OR_ZERO
Call generic_write_sync() from the deferred I/O completion handler if
O_DSYNC is set for a write request. Also make sure various callers
don't call generic_write_sync if the direct I/O code returns
-EIOCBQUEUED.
Based on an earlier patch from Jan Kara <jack@suse.cz> with updates from
Jeff Moyer <jmoyer@redhat.com> and Darrick J. Wong <darrick.wong@oracle.com>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When the binary search returns 0 (exact match), the target key
will necessarily be at slot 0 of all nodes below the current one,
so in this case the binary search is not needed because it will
always return 0, and we waste time doing it, holding node locks
for longer than necessary, etc.
Below follow histograms with the times spent on the current approach of
doing a binary search when the previous binary search returned 0, and
times for the new approach, which directly picks the first item/child
node in the leaf/node.
Current approach:
Count: 6682
Range: 35.000 - 8370.000; Mean: 85.837; Median: 75.000; Stddev: 106.429
Percentiles: 90th: 124.000; 95th: 145.000; 99th: 206.000
35.000 - 61.080: 1235 ################
61.080 - 106.053: 4207 #####################################################
106.053 - 183.606: 1122 ##############
183.606 - 317.341: 111 #
317.341 - 547.959: 6 |
547.959 - 8370.000: 1 |
Approach proposed by this patch:
Count: 6682
Range: 6.000 - 135.000; Mean: 16.690; Median: 16.000; Stddev: 7.160
Percentiles: 90th: 23.000; 95th: 27.000; 99th: 40.000
6.000 - 8.418: 58 #
8.418 - 11.670: 1149 #########################
11.670 - 16.046: 2418 #####################################################
16.046 - 21.934: 2098 ##############################################
21.934 - 29.854: 744 ################
29.854 - 40.511: 154 ###
40.511 - 54.848: 41 #
54.848 - 74.136: 5 |
74.136 - 100.087: 9 |
100.087 - 135.000: 6 |
These samples were captured during a run of the btrfs tests 001, 002 and
004 in the xfstests, with a leaf/node size of 4Kb.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We only need an async starter if we can't make a GFP_NOFS allocation in our
current path. This is the case for the endio stuff since it happens in IRQ
context, but things like the caching thread workers and the delalloc flushers we
can easily make this allocation and start threads right away. Also change the
worker count for the caching thread pool. Traditionally we limited this to 2
since we took read locks while caching, but nowadays we do this lockless so
there's no reason to limit the number of caching threads. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This fixes a problem where if we fail a truncate we will leave the i_size set
where we wanted to truncate to instead of where we were able to truncate to.
Fix this by making btrfs_truncate_inode_items do the disk_i_size update as it
removes extents, that way it will always be consistent with where its extents
are. Then if the truncate fails at all we can update the in-ram i_size with
what we have on disk and delete the orphan item. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
AFAICT chunk 0 is no longer special, and so it should be restriped just
like every other chunk. One reason for this change is us refusing the
relocation can lead to filesystems that can only be mounted ro, and
never rw -- see the bugzilla [1] for details. The other reason is that
device removal code is already doing this: it will happily relocate
chunk 0 is part of shrinking the device.
[1] https://bugzilla.kernel.org/show_bug.cgi?id=60594
Reported-by: Xavier Bassery <xavier@bartica.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
To get name of the file from a pathname let's use kbasename() helper. It allows
to simplify code a bit.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
now threads can return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS
as defined in btrfs.h for the dev excl operation error in
the FS, which means with this kernel would stop logging
(almost an user error) into the /var/log/messages
v2: accepts Josef' comment
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We currently have this problem where you can truncate pages that have not yet
been written for an ordered extent. We do this because the truncate will be
coming behind to clean us up anyway so what's the harm right? Well if truncate
fails for whatever reason we leave an orphan item around for the file to be
cleaned up later. But if the user goes and truncates up the file and tries to
read from the area that had been discarded previously they will get a csum error
because we never actually wrote that data out.
This patch fixes this by allowing us to either discard the ordered extent
completely, by which I mean we just free up the space we had allocated and not
add the file extent, or adjust the length of the file extent we write. We do
this by setting the length we truncated down to in the ordered extent, and then
we set the file extent length and ram bytes to this length. The total disk
space stays unchanged since we may be compressed and we can't just chop off the
disk space, but at least this way the file extent only points to the valid data.
Then when the file extent is free'd the extent and csums will be freed normally.
This patch is needed for the next series which will give us more graceful
recovery of failed truncates. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
All of these are logic checks to make sure we're not breaking anything, so
convert them over to ASSERT(). Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
One of the complaints we get a lot is how many BUG_ON()'s we have. So to help
with this I'm introducing a kconfig option to enable/disable a new ASSERT()
mechanism much like what XFS does. This will allow us developers to still get
our nice panics but allow users/distros to compile them out. With this we can
go through and convert any BUG_ON()'s that we have to catch actual programming
mistakes to the new ASSERT() and then fix everybody else to return errors. This
will also allow developers to leave sanity checks in their new code to make sure
we don't trip over problems while testing stuff and vetting new features.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I noticed that if I tried to mount a file system with -o degraded after having
done it once already we would fail to mount. This is because the
fs_devices->missing count was getting bumped everytime we mounted, but not
getting reset whenever we unmounted. To fix this we just drop the missing count
as we're closing devices to make sure this doesn't happen. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
btrfs_read_fs_root_no_name() already checks if btrfs_root_refs()
is zero and returns ENOENT in this case. There is no need to do
it again in three more places.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Mitch Harder noticed that the patch 3c64a1a mentioned in the subject
line was causing a kernel BUG() on snapshot deletion.
The patch was wrong. It did not handle cached roots correctly. The
check for root_refs == 0 was removed everywhere where
btrfs_read_fs_root_no_name() had been used to retrieve the root,
because this check was already dealt with in
btrfs_read_fs_root_no_name(). But in the case when the root was
found in the cache, there was no such check.
This patch adds the missing check in the case where the root is
found in the cache.
Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The second round uses btrfs_error() and return -EIO, the first round
can handle write errors the same way.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
struct __prelim_ref is allocated and freed frequently when
walking backref tree, using slab allocater can not only
speed up allocating but also detect memory leaks.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Currently, only add_delayed_refs have to allocate with GFP_ATOMIC,
So just pass arg 'gfp_t' to decide which allocation mode.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The handler for the ioctl BTRFS_IOC_FS_INFO was reading the
number of devices before acquiring the device list mutex.
This could lead to inconsistent results because the update of
the device list and the number of devices counter (amongst other
counters related to the device list) are updated in volumes.c
while holding the device list mutex - except for 2 places, one
was volumes.c:btrfs_prepare_sprout() and the other was
volumes.c:device_list_add().
For example, if we have 2 devices, with IDs 1 and 2 and then add
a new device, with ID 3, and while adding the device is in progress
an BTRFS_IOC_FS_INFO ioctl arrives, it could return a number of
devices of 2 and a max dev id of 3. This would be incorrect.
Also, this ioctl handler was reading the fsid while it can be
updated concurrently. This can happen when while a new device is
being added and the current filesystem is in seeding mode.
Example:
$ mkfs.btrfs -f /dev/sdb1
$ mkfs.btrfs -f /dev/sdb2
$ btrfstune -S 1 /dev/sdb1
$ mount /dev/sdb1 /mnt/test
$ btrfs device add /dev/sdb2 /mnt/test
If during the last step a BTRFS_IOC_FS_INFO ioctl was requested, it
could read an fsid that was never valid (some bits part of the old
fsid and others part of the new fsid). Also, it could read a number
of devices that doesn't match the number of devices in the list and
the max device id, as explained before.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This change fixes an issue when removing a device and writing
all super blocks run simultaneously. Here's the steps necessary
for the issue to happen:
1) disk-io.c:write_all_supers() gets a number of N devices from the
super_copy, so it will not panic if it fails to write super blocks
for N - 1 devices;
2) Then it tries to acquire the device_list_mutex, but blocks because
volumes.c:btrfs_rm_device() got it first;
3) btrfs_rm_device() removes the device from the list, then unlocks the
mutex and after the unlock it updates the number of devices in
super_copy to N - 1.
4) write_all_supers() finally acquires the mutex, iterates over all the
devices in the list and gets N - 1 errors, that is, it failed to write
super blocks to all the devices;
5) Because write_all_supers() thinks there are a total of N devices, it
considers N - 1 errors to be ok, and therefore won't panic.
So this change just makes sure that write_all_supers() reads the number
of devices from super_copy after it acquires the device_list_mutex.
Conversely, it changes btrfs_rm_device() to update the number of devices
in super_copy before it releases the device list mutex.
The code path to add a new device (volumes.c:btrfs_init_new_device),
already has the right behaviour: it updates the number of devices in
super_copy while holding the device_list_mutex.
The only code path that doesn't lock the device list mutex
before updating the number of devices in the super copy is
disk-io.c:next_root_backup(), called by open_ctree() during
mount time where concurrency issues can't happen.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
A user was reporting weird warnings from btrfs_put_delayed_ref() and I noticed
that we were doing this list_del_init() on our head ref outside of
delayed_refs->lock. This is a problem if we have people still on the list, we
could end up modifying old pointers and such. Fix this by removing us from the
list before we do our run_delayed_ref on our head ref. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We were unconditionally clearing our runtime flag on the inode on error when
trying to insert an orphan item. This is wrong in the case of -EEXIST since we
obviously have an orphan item. This was causing us to not do the correct
cleanup of our orphan items which caused issues on cleanup. This happens
because currently when truncate fails we just leave the orphan item on there so
it can be cleaned up, so if we go to remove the file later we will hit this
issue. What we do for truncate isn't right either, but we shouldn't screw this
sort of thing up on error either, so fix this and then I'll fix truncate in a
different patch. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Send was just sending everything it found, even if the extent was a hole. This
is unpleasant for users, so just skip holes when we are sending. This will also
skip sending prealloc extents since the send spec doesn't have a prealloc
command. Eventually we will add a prealloc command and rev the send version so
we can send down the prealloc info. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The name buffer is not terminated by a '\0' character,
therefore it needs to be printed with %.*s and use the
length of the buffer.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
sector_t may be either "u64" (always 64 bit) or "unsigned long" (32 or 64
bit). Casting it to "unsigned long" will truncate it on 32-bit platforms
where CONFIG_LBDAF=y.
Cast to "unsigned long long" and format using "ll" instead.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
PAGE_CACHE_SIZE == PAGE_SIZE is "unsigned long" everywhere, so there's no
need to cast it to "unsigned long".
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Internally, btrfs_header_chunk_tree_uuid() calculates an unsigned long, but
casts it to a pointer, while all callers cast it to unsigned long again.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Internally, btrfs_header_fsid() calculates an unsigned long, but casts
it to a pointer, while all callers cast it to unsigned long again.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Internally, btrfs_dev_extent_chunk_tree_uuid() calculates an unsigned long,
but casts it to a pointer, while all callers cast it to unsigned long
again.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
All callers of btrfs_device_fsid() cast its return type to unsigned long.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
All callers of btrfs_device_uuid() cast its return type to unsigned long.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
mirror_num is always "int", hence don't cast it to "unsigned long long" and
format it as a 64-bit number.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
PAGE_SIZE is "unsigned long" everywhere, so there's no need to cast it to
"unsigned long long" and format it as a 64-bit number.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The internal btrfs device id is a u64, hence make the constant
BTRFS_DEV_REPLACE_DEVID "unsigned long long" as well, so we no longer need
a cast to print it.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
u64 is "unsigned long long" on all architectures now, so there's no need to
cast it when formatting it using the "ll" length modifier.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
It turns out we don't properly rollback in-core btrfs_device state on
umount. We zero out ->bdev, ->in_fs_metadata and that's about it. In
particular, we don't zero out ->generation, and this can lead to us
refusing a mount -- a non-NULL fs_devices->latest_bdev is essential, but
btrfs_close_extra_devices will happily assign NULL to ->latest_bdev if
the first device on the dev_list happens to be missing and consequently
has no bdev attached. This happens because since commit a6b0d5c8
btrfs_close_extra_devices adjusts ->latest_bdev, and in doing that,
relies on the ->generation. Fix this, and possibly other problems, by
zeroing out everything except for what device_list_add sets, so that a
mount right after insmod and 'btrfs dev scan' is no different from any
later mount in this respect.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In the spirit of btrfs_alloc_device, add a helper for allocating and
doing some common initialization of btrfs_fs_devices struct.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Currently btrfs_device is allocated ad-hoc in a few different places,
and as a result not all fields are initialized properly. In particular,
readahead state is only initialized in device_list_add (at scan time),
and not in btrfs_init_new_device (when the new device is added with
'btrfs dev add'). Fix this by adding an allocation helper and switch
everybody but __btrfs_close_devices to it. (__btrfs_close_devices is
dealt with in a later commit.)
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
find_next_devid() knows which root to search, so it should take an
fs_info instead of an arbitrary root.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If you start the replace procedure on a read only filesystem, at
the end the procedure fails to write the updated dev_items to the
chunk tree. The problem is that this error is not indicated except
for a WARN_ON(). If the user now thinks that everything was done
as expected and destroys the source device (with mkfs or with a
hammer). The next mount fails with "failed to read chunk root" and
the filesystem is gone.
This commit adds code to fail the attempt to start the replace
procedure if the filesystem is mounted read-only.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Cc: <stable@vger.kernel.org> # 3.10+
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
After we set force_compress with a new value (which was not being done
while holding the inode mutex), if an error happens and we jump to
the label out_ra, the force_compress property of the inode is not set
to BTRFS_COMPRESS_NONE (unlike in the case where no errors happen).
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This should never be needed, but since all functions are there
to check and rebuild the UUID tree, a mount option is added that
allows to force this check and rebuild procedure.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If the filesystem was mounted with an old kernel that was not
aware of the UUID tree, this is detected by looking at the
uuid_tree_generation field of the superblock (similar to how
the free space cache is doing it). If a mismatch is detected
at mount time, a thread is started that does two things:
1. Iterate through the UUID tree, check each entry, delete those
entries that are not valid anymore (i.e., the subvol does not
exist anymore or the value changed).
2. Iterate through the root tree, for each found subvolume, add
the UUID tree entries for the subvolume (if they are not
already there).
This mechanism is also used to handle and repair errors that
happened during the initial creation and filling of the tree.
The update of the uuid_tree_generation field (which indicates
that the state of the UUID tree is up to date) is blocked until
all create and repair operations are successfully completed.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In order to be able to detect the case that a filesystem is mounted
with an old kernel, add a uuid-tree-gen field like the free space
cache is doing it. It is part of the super block and written with
each commit. Old kernels do not know this field and don't update it.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When the UUID tree is initially created, a task is spawned that
walks through the root tree. For each found subvolume root_item,
the uuid and received_uuid entries in the UUID tree are added.
This is such a quick operation so that in case somebody wants
to unmount the filesystem while the task is still running, the
unmount is delayed until the UUID tree building task is finished.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When a new subvolume or snapshot is created, a new UUID item is added
to the UUID tree. Such items are removed when the subvolume is deleted.
The ioctl to set the received subvolume UUID is also touched and will
now also add this received UUID into the UUID tree together with the
ID of the subvolume. The latter is also done when read-only snapshots
are created which inherit all the send/receive information from the
parent subvolume.
User mode programs use the BTRFS_IOC_TREE_SEARCH ioctl to search and
read in the UUID tree.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This tree is not created by mkfs.btrfs. Therefore when a filesystem
is mounted writable and the UUID tree does not exist, this tree is
created if required. The tree is also added to the fs_info structure
and initialized, but this commit does not yet read or write UUID tree
elements.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This commit adds support to print UUID tree elements to print-tree.c.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Mapping UUIDs to subvolume IDs is an operation with a high effort
today. Today, the algorithm even has quadratic effort (based on the
number of existing subvolumes), which means, that it takes minutes
to send/receive a single subvolume if 10,000 subvolumes exist. But
even linear effort would be too much since it is a waste. And these
data structures to allow mapping UUIDs to subvolume IDs are created
every time a btrfs send/receive instance is started.
It is much more efficient to maintain a searchable persistent data
structure in the filesystem, one that is updated whenever a
subvolume/snapshot is created and deleted, and when the received
subvolume UUID is set by the btrfs-receive tool.
Therefore kernel code is added with this commit that is able to
maintain data structures in the filesystem that allow to quickly
search for a given UUID and to retrieve data that is assigned to
this UUID, like which subvolume ID is related to this UUID.
This commit adds a new tree to hold UUID-to-data mapping items. The
key of the items is the full UUID plus the key type BTRFS_UUID_KEY.
Multiple data blocks can be stored for a given UUID, a type/length/
value scheme is used.
Now follows the lengthy justification, why a new tree was added
instead of using the existing root tree:
The first approach was to not create another tree that holds UUID
items. Instead, the items should just go into the top root tree.
Unfortunately this confused the algorithm to assign the objectid
of subvolumes and snapshots. The reason is that
btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for
the first created subvol or snapshot after mounting a filesystem,
and this function simply searches for the largest used objectid in
the root tree keys to pick the next objectid to assign. Of course,
the UUID keys have always been the ones with the highest offset
value, and the next assigned subvol ID was wastefully huge.
To use any other existing tree did not look proper. To apply a
workaround such as setting the objectid to zero in the UUID item
key and to implement collision handling would either add
limitations (in case of a btrfs_extend_item() approach to handle
the collisions) or a lot of complexity and source code (in case a
key would be looked up that is free of collisions). Adding new code
that introduces limitations is not good, and adding code that is
complex and lengthy for no good reason is also not good. That's the
justification why a completely new tree was introduced.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Cc: Chris Mason <chris.mason@fusionio.com>
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
make C=2 fs/btrfs/ CF=-D__CHECK_ENDIAN__
I tried to filter out the warnings for which patches have already
been sent to the mailing list, pending for inclusion in btrfs-next.
All these changes should be obviously safe.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If the inode ref key was not found and the current leaf slot
was 0 (first item in the leaf) the code would always return
-ENOENT. This was not correct because the desired inode ref
item might be the last item in the previous leaf.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If the path doesn't fit in the input buffer, return ENAMETOOLONG
instead of returning with a success code (0) and a partially
filled and right justified buffer.
Also removed useless buffer pointer check outside the while loop.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We have checked 'quota_root' with qgroup_ioctl_lock held before,So
here the check is reduplicate, remove it.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
btrfs_free_qgroup_config() is not only called by open/close_ctree(),but
also btrfs_disable_quota().And for btrfs_disable_quota(),we have set
'quota_root' to be null before calling btrfs_free_qgroup_config(),so it
is safe to cleanup in-memory structures without lock held.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When disabling quota, we should clear out list 'dirty_qgroups',otherwise,
we will get oops if enabling quota again. Fix this by abstracting similar
code from del_qgroup_rb().
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If you are sending a snapshot and specifying a parent snapshot we will walk the
trees and figure out where they differ and send the differences only. The way
we check for differences are if the leaves aren't the same and if the keys are
not the same within the leaves. So if neither leaf is the same (ie the leaf has
been cow'ed from the parent snapshot) we walk each item in the send root and
check it against the parent root. If the items match exactly then we don't do
anything. This doesn't quite work for inode refs, since they will just have the
name and the parent objectid. If you move the file from a directory and then
remove that directory and re-create a directory with the same inode number as
the old directory and then move that file back into that directory we will
assume that nothing changed and you will get errors when you try to receive.
In order to fix this we need to do extra checking to see if the inode ref really
is the same or not. So do this by passing down BTRFS_COMPARE_TREE_SAME if the
items match. Then if the key type is an inode ref we can do some extra
checking, otherwise we just keep processing. The extra checking is to look up
the generation of the directory in the parent volume and compare it to the
generation of the send volume. If they match then they are the same directory
and we are good to go. If they don't we have to add them to the changed refs
list.
This means we have to track the generation of the ref we're trying to lookup
when we iterate all the refs for a particular inode. So in the case of looking
for new refs we have to get the generation from the parent volume, and in the
case of looking for deleted refs we have to get the generation from the send
volume to compare with.
There was also the issue of using a ulist to keep track of the directories we
needed to check. Because we can get a deleted ref and a new ref for the same
inode number the ulist won't work since it indexes based on the value. So
instead just dup any directory ref we find and add it to a local list, and then
process that list as normal and do away with using a ulist for this altogether.
Before we would fail all of the tests in the far-progs that related to moving
directories (test group 32). With this patch we now pass these tests, and all
of the tests in the far-progs send testing suite. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The plan is to have a bunch of unit tests that run when btrfs is loaded when you
build with the appropriate config option. My ultimate goal is to have a test
for every non-static function we have, but at first I'm going to focus on the
things that cause us the most problems. To start out with this just adds a
tests/ directory and moves the existing free space cache tests into that
directory and sets up all of the infrastructure. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I noticed while looking at a deadlock that we are always starting a transaction
in cow_file_range(). This isn't really needed since we only need a transaction
if we are doing an inline extent, or if the allocator needs to allocate a chunk.
So push down all the transaction start stuff to be closer to where we actually
need a transaction in all of these cases. This will hopefully reduce our write
latency when we are committing often. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I added a patch where we started taking the ordered operations mutex when we
waited on ordered extents. We need this because we splice the list and process
it, so if a flusher came in during this scenario it would think the list was
empty and we'd usually get an early ENOSPC. The problem with this is that this
lock is used in transaction committing. So we end up with something like this
Transaction commit
-> wait on writers
Delalloc flusher
-> run_ordered_operations (holds mutex)
->wait for filemap-flush to do its thing
flush task
-> cow_file_range
->wait on btrfs_join_transaction because we're commiting
some other task
-> commit_transaction because we notice trans->transaction->flush is set
-> run_ordered_operations (hang on mutex)
We need to disentangle the ordered operations flushing from the delalloc
flushing, since they are separate things. This solves the deadlock issue I was
seeing. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
There are several places where we BUG_ON() if we fail to remove the orphan items
and such, which is not ok, so remove those and either abort or just carry on.
This also fixes a problem where if we couldn't start a transaction we wouldn't
actually remove the orphan item reserve for the inode. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Eric pointed out that btrfs will happily allow you to delete the default subvol.
This is a problem obviously since the next time you go to mount the file system
it will freak out because it can't find the root. Fix this by adding a check to
see if our default subvol points to the subvol we are trying to delete, and if
it does not allowing it to happen. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We have logic to see if we've already created a parent directory by check to see
if an inode inside of that directory has a lower inode number than the one we
are currently processing. The logic is that if there is a lower inode number
then we would have had to made sure the directory was created at that previous
point. The problem is that subvols inode numbers count from the lowest objectid
in the root tree, which may be less than our current progress. So just skip if
our dir item key is a root item. This fixes the original test and the xfstest
version I made that added an extra subvol create. Thanks,
Reported-by: Emil Karlson <jekarlson@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This patch adds an ioctl, BTRFS_IOC_FILE_EXTENT_SAME which will try to
de-duplicate a list of extents across a range of files.
Internally, the ioctl re-uses code from the clone ioctl. This avoids
rewriting a large chunk of extent handling code.
Userspace passes in an array of file, offset pairs along with a length
argument. The ioctl will then (for each dedupe) do a byte-by-byte comparison
of the user data before deduping the extent. Status and number of bytes
deduped are returned for each operation.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We want this for btrfs_extent_same. Basically readpage and friends do their
own extent locking but for the purposes of dedupe, we want to have both
files locked down across a set of readpage operations (so that we can
compare data). Introduce this variant and a flag which can be set for
extent_read_full_page() to indicate that we are already locked.
Partial credit for this patch goes to Gabriel de Perthuis <g2p.code@gmail.com>
as I have included a fix from him to the original patch which avoids a
deadlock on compressed extents.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
There's some 250+ lines here that are easily encapsulated into their own
function. I don't change how anything works here, just create and document
the new btrfs_clone() function from btrfs_ioctl_clone() code.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The range locking in btrfs_ioctl_clone is trivially broken out into it's own
function. This reduces the complexity of btrfs_ioctl_clone() by a small bit
and makes that locking code available to future functions in
fs/btrfs/ioctl.c
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In extent-tree.c:do_chunk_alloc(), early on we returned 0 (success)
when the target space was full and when chunk allocation is needed.
However, later on in that same function we return ENOSPC if
btrfs_alloc_chunk() fails (and chunk allocation was needed) and
set the space's full flag.
This was inconsistent, as -ENOSPC should be returned if the space
is full and a chunk allocation needs to performed. If the space is
full but no chunk allocation is needed, just return 0 (success).
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
tree-log.c was ignoring the return value from btrfs_run_delayed_items()
in several places.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In tree-log.c:replay_one_name(), if memory allocation for
the name fails, ensure we iput the dir inode we got before
before we return.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The rule originally comes from nocow writing, but snapshot-aware
defrag is a different case, the extent has been writen and we're
not going to change the extent but add a reference on the data.
So we're able to allow such compressed extents to be merged into
one bigger extent if they're pointing to the same data.
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I'ts hardcoded to 30 seconds which is fine for most users. Higher values
defer data being synced to permanent storage with obvious consequences
when the system crashes. The upper bound is not forced, but a warning is
printed if it's more than 300 seconds (5 minutes).
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
There is no reason we can't just set the path to blocking and then do normal
GFP_NOFS allocations for these extent buffers. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We can get ENOMEM trying to allocate dummy bufs for the rewind operation of the
tree mod log. Instead of BUG_ON()'ing in this case pass up ENOMEM. I looked
back through the callers and I'm pretty sure I got everybody who did BUG_ON(ret)
in this path. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When doing a send with a parent subvol we will check to see if the file we are
acting on is being overwritten and move it if we think it may be needed further
down the line during the send. We check this by checking its directory and
making sure it existed in the parent and making sure the file existed in the
parent. The problem with this check is that if we create a directory and a file
in that directory, and then snapshot, and then remove and re-create that same
directory and file with different inode numbers and then try to snapshot and
send with the original parent we will try and save the original file inside of
that directory. This is a problem because during the receive we move the
directory out of the way because it is a completely new inode, which makes us
unable to find the old file inside of the directory when we try to move that out
of the way for the overwrite. We fix this by checking the parent directory of
the inode we think we are overwriting. If the parent directory generation in
the send root != the parent directory generation in the parent root then we know
it is a completely new directory and we need not bother with moving the file out
of the way because it would have been completely destroyed. This fixes bz
60673. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Alex Lyakas reported a bug where wait_block_group_cache_progress() would wait
forever if a drive failed. This is because we just bail out if there is an
error while trying to cache a block group, we don't update anybody who may be
waiting. So this introduces a new enum for the cache state in case of error and
makes everybody bail out if we have an error. Alex tested and verified this
patch fixed his problem. This fixes bz 59431. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we bail out when the stripe alloc fails, we need to undo the
earlier allocation of raid_map.
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
After reading all device items from the chunk tree, don't
exit the loop and then navigate down the tree again to find
the chunk items. Instead just read all device items and
chunk items with a single tree search. This is possible
because all device items are found before any chunk item in
the chunks tree.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
There is no reason for this sort of jackassery. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Previously we only added blocks to the list to have their backrefs checked if
the level of the block is right above the one we are searching for. This is
because we want to make sure we don't add the entire path up to the root to the
lists to make sure we process things one at a time. This assumes that if any
blocks in the path to the root are going to be not checked (shared in other
words) then they will be in the level right above the current block on up. This
isn't quite right though since we can have blocks higher up the list that are
shared because they are attached to a reloc root. But we won't add this block
to be checked and then later on we will BUG_ON(!upper->checked). So instead
keep track of wether or not we've queued a block to be checked in this current
search, and if we haven't go ahead and queue it to be checked. This patch fixed
the panic I was seeing where we BUG_ON(!upper->checked). Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If our item isn't big enough to have an actual inline item when we have skinny
metadata enabled just return 1 in find_inline_backref so we can move on to the
next item. This probably wasn't causing a problem since we check the values of
ptr and end properly, but just in case this will keep us from doing extra work.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
First of all we no longer set EXTENT_DIRTY when we dirty an extent so this patch
removes the clearing of EXTENT_DIRTY since it confuses me. This patch also adds
clearing EXTENT_DEFRAG and also doing EXTENT_DO_ACCOUNTING when we have errors.
This is because if we are clearing delalloc without adding an ordered extent
then we need to make sure the enospc handling stuff is accounted for. Also if
this range was DEFRAG we need to make sure that bit is cleared so we dont leak
it. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This patch removes the io_tree argument for extent_clear_unlock_delalloc since
we always use &BTRFS_I(inode)->io_tree, and it separates out the extent tree
operations from the page operations. This way we just pass in the extent bits
we want to clear and then pass in the operations we want done to the pages.
This is because I'm going to fix what extent bits we clear in some cases and
rather than add a bunch of new flags we'll just use the actual extent bits we
want to clear. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When we read several pages at once, we needn't get the extent map object
every time we deal with a page, and we can cache the extent map object.
So, we can reduce the search time of the extent map, and besides that, we
also can reduce the lock contention of the extent map tree.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In the past, we cached the checksum value in the extent state object, so we
had to split the extent state object by the block size, or we had no space
to keep this checksum value. But it increased the lock contention of the
extent state tree.
Now we removed this limit by caching the checksum into the bio object, so
it is unnecessary to do the extent state operations by the block size, we
can do it in batches, in this way, we can reduce the lock contention of
the extent state tree.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Before applying this patch, we set the uptodate flag and unlock the extent
by the page size, it is unnecessary, we can do it in batches, it can reduce
the lock contention of the extent state tree.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Before applying this patch, we cached the csum value into the extent state
tree when reading some data from the disk, this operation increased the lock
contention of the state tree.
Now, we just store the csum value into the bio structure or other unshared
structure, so we can reduce the lock contention.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This patch add some branch prediction hints into the end IO function
of the read page, it reduced the percentage of the branch misses from
5.5% to 4.9%.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Some options are missing in btrfs_show_options(), this patch
adds them.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Although for most time, int is enough for subvolid, we should
ensure safety in theory.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I just notice the following commands succeed:
mount <dev> <mnt> -o thread_pool=-1
This is ridiculous, only positive thread_pool makes sense,this
patch adds sanity checks for them, and also catches the error of
ENOMEM if allocating memory fails.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The alloc_rbio() frees "raid_map" and "bbio" on error, so there is a
potential double free bug in raid56_parity_write(). The
raid56_parity_write() and raid56_parity_recover() functions should still
free "raid_map" and "bbio" on error if other errors occur though, so I
have added some more calls to kfree().
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We can end up with inodes on the auto defrag list that exist on roots that are
going to be deleted. This is extra work we don't need to do, so just bail if
our root has 0 root refs. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I was hitting the BUG_ON() at the end of merge_reloc_roots() because we were
aborting the transaction at some point previously and then getting an error when
we tried to drop the reloc root. I fixed btrfs_drop_snapshot to re-add us to
the dead roots list if we failed, but this isn't the right thing to do for reloc
roots since it uses root->root_list for it's own stuff in order to know what
needs to be cleaned up. So fix btrfs_drop_snapshot to only do the re-add if we
aren't dropping for reloc, and handle errors from merge_reloc_root() by dropping
the reloc root we are processing since it won't be on the list of roots to
cleanup. With this patch my reproducer no longer panics. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I was getting warnings when running find ./ -type f -exec btrfs fi defrag -f {}
\; from record_one_backref because ret was set. Turns out it was because it was
set to 1 because the search slot didn't come out exact and we never reset it.
So reset it to 0 right after the search so we don't leak this and get
uneccessary warnings. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
btrfs_ioctl_get_fslabel() and btrfs_ioctl_set_fslabel()
used root->fs_info->volume_mutex mutex which caused operations
like balance to block set/get label operation until its
completion and generally balance operation takes a long
time to complete, so it will be annoying to the user when
cli appears hung
also this patch will add a bit of optimization within
the btrfs_ioctl_get_falabel() function.
v1->v2:
use fs_info->super_lock instead of uuid_mutex
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This is confusing, sometimes the key type is printed in hex (without
a leading "0x" which makes things even more complicated), sometimes
in decimal...
Change it to be in decimal everywhere.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This shows exactly how btrfs processes the delayed refs onto disks,
which is very helpful on understanding delayed ref mechanism and
debugging related bugs.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
btrfs_space_info->block_groups.
The current code uses integer literals to index
btrfs_space_info->block_groups[] array. Instead use corresponding
enums from 'enum btrfs_raid_types'.
Signed-off-by: chandan <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Some codes still use the cpu_to_lexx instead of the
BTRFS_SETGET_STACK_FUNCS declared in ctree.h.
Also added some BTRFS_SETGET_STACK_FUNCS for btrfs_header btrfs_timespec
and other structures.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaoxie@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We call ulist_free(qgroup_ulist) in btrfs_free_qgroup_config(),
and btrfs_free_qgroup_config() may be called in two cases:
(1)umount filesystem
(2)disabling quota
However, if we firstly disable quota and then umount filesystem,
a double free happens. Fix it.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The function relocation.c:add_data_references() was not checking
if all calls to __add_tree_block() and find_data_references() were
succeeding or not.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The log message level 'critical' is verbose enough, 'emergency' beeps on
all terminals.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
So to cache free space, we iterate every extent item to gather free space info.
When we have say 10,000 non-inline extent refs(such as BTRFS_EXTENT_DATA_REF),
it takes quite a long time, and since inline extent refs and non-inline ones have
same objectid in their keys, we can just re-search the tree with the next address
to skip non-inline references.
(This is found by dedup feature because dedup extents can end up with many
non-inline extent refs.)
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I recently did some ENOSPC testing that involved filling the disk
while create and removing snapshots in a loop. During the test cycle,
I ran into an ENOSPC when trying to remove a snapshot, leaving the fs
stuck in ENOSPC even after a umount/mount cycle.
This patch allow subvolume removal to fall back onto the global
block reservation in order to succeed when it would have failed
otherwise.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we're looking for a metadata item in the tree and the
search fails with return value of 1, and the slot doesn't
point to the first item in the leaf, check if the previous
item in the leaf corresponds to an extent item for the same
object id - if it does, then don't do another tree search
to get it.
This optimization is already done by btrfs-progs.
V2: updated commit message.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Device scanning waits on the uuid_mutex, which can result in a very long
wait if dev delete is shrinking the device.
Signed-off-by: Carey Underwood <cwillu@cwillu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We've been seeing spurious complaints out of lockdep because the lock class name
changes. This is happening because when we drop a snapshot we will lock a block
before we've read it in, which sets the lockdep class to whatever the default
is. Then once we read the thing in we reset the lockdep class to what it is
supposed to be, which blows lockdeps' mind. This patch should fix the problem,
it appears to be the only place where we do this sort of thing. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
With this fix the lzo code behaves like the zlib code by returning an
error
code when compression does not help reduce the size of the file.
This is currently not a bug since the compressed size is checked again
in
the calling method compress_file_range.
Signed-off-by: Stefan Agner <stefan@agner.ch>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Previously we held the tree mod lock when adding stuff because we use it to
check and see if we truly do want to track tree modifications. This is
admirable, but GFP_ATOMIC in a critical area that is going to get hit pretty
hard and often is not nice. So instead do our basic checks to see if we don't
need to track modifications, and if those pass then do our allocation, and then
when we go to insert the new modification check if we still care, and if we
don't just free up our mod and return. Otherwise we're good to go and we can
carry on. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Don't emit OOM warnings when k.alloc calls fail when
there there is a v.alloc immediately afterwards.
Converted a kmalloc/vmalloc with memset to kzalloc/vzalloc.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
When btrfs readdir() hits the last entry it sets the readdir offset to a
huge value to stop buggy apps from breaking when the same name is
returned by readdir() with concurrent rename()s.
But unconditionally setting the offset to INT_MAX causes readdir() to
loop returning any entries with offsets past INT_MAX. It only takes a
few hours of constant file creation and removal to create entries past
INT_MAX.
So let's set the huge offset to LLONG_MAX if the last entry has already
overflowed 32bit loff_t. Without large offsets behaviour is identical.
With large offsets 64bit apps will work and 32bit apps will be no more
broken than they currently are if they see large offsets.
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
A user reported a panic when running with autodefrag and deleting snapshots.
This is because we could end up trying to add the root to the dead roots list
twice. To fix this check to see if we are empty before adding ourselves to the
dead roots list. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The ceph guys tripped over this bug where we were still holding onto the
original path that we used to copy the inode with when logging. This is based
on Chris's fix which was reported to fix the problem. We need to drop the paths
in two cases anyway so just move the drop up so that we don't have duplicate
code. Thanks,
Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I noticed while running multi-threaded fsync tests that sometimes fsck would
complain about an improper gap. This happens because we fail to add a hole
extent to the file, which was happening when we'd split a hole EM because
btrfs_drop_extent_cache was just discarding the whole em instead of splitting
it. So this patch fixes this by allowing us to split a hole em properly, which
means that added holes actually get logged properly and we no longer see this
fsck error. Thankfully we're tolerant of these sort of problems so a user would
not see any adverse effects of this bug, other than fsck complaining. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Because we don't mess with the offset into the extent for compressed we will
properly find both extents for this case
[extent a][extent b][rest of extent a]
but because we already added a ref for the front half we won't add the inode
information for the second half. This causes us to leak that memory and not
print out the other offset when we do logical-resolve. So fix this by calling
ulist_add_merge and then add our eie to the existing entry if there is one.
With this patch we get both offsets out of logical-resolve. With this and the
other 2 patches I've sent we now pass btrfs/276 on my vm with compress-force=lzo
set. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If you do btrfs inspect-internal logical-resolve on a compressed extent that has
been partly overwritten it won't find anything. This is because we try and
match the extent offset we've searched for based on the extent offset in the
data extent entry. However this doesn't work for compressed extents because the
offsets are for the uncompressed size, not the compressed size. So instead only
do this check if we are not compressed, that way we can get an actual entry for
the physical offset rather than nothing for compressed. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
xfstest btrfs/276 was freaking out on slower boxes partly because fiemap was
offsetting the physical based on the extent offset. This is perfectly fine with
uncompressed extents, however the extent offset is into the uncompressed area,
not the compressed. So we can return a physical value that isn't at all within
the area we have allocated on disk. Fix this by returning the start of the
extent if it is compressed no matter what the offset. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
commit 47fb091fb787420cd195e66f162737401cce023f(Btrfs: fix unlock after free on rewinded tree blocks)
takes an extra increment on the reference of allocated dummy extent buffer, so now we
cannot free this dummy one, and end up with extent buffer leak.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
For partial extents, snapshot-aware defrag does not work as expected,
since
a) we use the wrong logical offset to search for parents, which should be
disk_bytenr + extent_offset, not just disk_bytenr,
b) 'offset' returned by the backref walking just refers to key.offset, not
the 'offset' stored in btrfs_extent_data_ref which is
(key.offset - extent_offset).
The reproducer:
$ mkfs.btrfs sda
$ mount sda /mnt
$ btrfs sub create /mnt/sub
$ for i in `seq 5 -1 1`; do dd if=/dev/zero of=/mnt/sub/foo bs=5k count=1 seek=$i conv=notrunc oflag=sync; done
$ btrfs sub snap /mnt/sub /mnt/snap1
$ btrfs sub snap /mnt/sub /mnt/snap2
$ sync; btrfs filesystem defrag /mnt/sub/foo;
$ umount /mnt
$ btrfs-debug-tree sda (Here we can check whether the defrag operation is snapshot-awared.
This addresses the above two problems.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Create a small file and fallocate it to a big size with
FALLOC_FL_KEEP_SIZE option, then truncate it back to the
small size again, the disk free space is not changed back
in this case. i.e,
total 4
-rw-r--r-- 1 root root 512 Jun 28 11:35 test
Filesystem Size Used Avail Use% Mounted on
....
/dev/sdb1 8.0G 56K 7.2G 1% /mnt
-rw-r--r-- 1 root root 512 Jun 28 11:35 /mnt/test
Filesystem Size Used Avail Use% Mounted on
....
/dev/sdb1 8.0G 5.1G 2.2G 70% /mnt
Filesystem Size Used Avail Use% Mounted on
....
/dev/sdb1 8.0G 5.1G 2.2G 70% /mnt
With this fix, the truncated up space is back as:
Filesystem Size Used Avail Use% Mounted on
....
/dev/sdb1 8.0G 56K 7.2G 1% /mnt
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Miao Xie reported the following issue:
The filesystem was corrupted after we did a device replace.
Steps to reproduce:
# mkfs.btrfs -f -m single -d raid10 <device0>..<device3>
# mount <device0> <mnt>
# btrfs replace start -rfB 1 <device4> <mnt>
# umount <mnt>
# btrfsck <device4>
The reason for the issue is that we changed the write offset by mistake,
introduced by commit 625f1c8dc.
We read the data from the source device at first, and then write the
data into the corresponding place of the new device. In order to
implement the "-r" option, the source location is remapped using
btrfs_map_block(). The read takes place on the mapped location, and
the write needs to take place on the unmapped location. Currently
the write is using the mapped location, and this commit changes it
back by undoing the change to the write address that the aforementioned
commit added by mistake.
Reported-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: <stable@vger.kernel.org> # 3.10+
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If we stop dropping a root for whatever reason we need to add it back to the
dead root list so that we will re-start the dropping next transaction commit.
The other case this happens is if we recover a drop because we will add a root
without adding it to the fs radix tree, so we can leak it's root and commit root
extent buffer, adding this to the dead root list makes this cleanup happen.
Thanks,
Cc: stable@vger.kernel.org
Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We aren't setting path->locks[level] when we resume a snapshot deletion which
means we won't unlock the buffer when we free the path. This causes deadlocks
if we happen to re-allocate the block before we've evicted the extent buffer
from cache. Thanks,
Cc: stable@vger.kernel.org
Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Alex pointed out a problem and fix that exists in the drop one snapshot at a
time patch. If we decide we need to exit for whatever reason (umount for
example) we will just exit the snapshot dropping without updating the drop
progress. So the next time we go to resume we will BUG_ON() because we can't
find the extent we left off at because we never updated it. This patch fixes
the problem.
Cc: stable@vger.kernel.org
Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
PTR_RET is now deprecated. Use PTR_ERR_OR_ZERO instead.
Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Pull btrfs update from Chris Mason:
"These are the usual mixture of bugs, cleanups and performance fixes.
Miao has some really nice tuning of our crc code as well as our
transaction commits.
Josef is peeling off more and more problems related to early enospc,
and has a number of important bug fixes in here too"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (81 commits)
Btrfs: wait ordered range before doing direct io
Btrfs: only do the tree_mod_log_free_eb if this is our last ref
Btrfs: hold the tree mod lock in __tree_mod_log_rewind
Btrfs: make backref walking code handle skinny metadata
Btrfs: fix crash regarding to ulist_add_merge
Btrfs: fix several potential problems in copy_nocow_pages_for_inode
Btrfs: cleanup the code of copy_nocow_pages_for_inode()
Btrfs: fix oops when recovering the file data by scrub function
Btrfs: make the chunk allocator completely tree lockless
Btrfs: cleanup orphaned root orphan item
Btrfs: fix wrong mirror number tuning
Btrfs: cleanup redundant code in btrfs_submit_direct()
Btrfs: remove btrfs_sector_sum structure
Btrfs: check if we can nocow if we don't have data space
Btrfs: stop using try_to_writeback_inodes_sb_nr to flush delalloc
Btrfs: use a percpu to keep track of possibly pinned bytes
Btrfs: check for actual acls rather than just xattrs when caching no acl
Btrfs: move btrfs_truncate_page to btrfs_cont_expand instead of btrfs_truncate
Btrfs: optimize reada_for_balance
Btrfs: optimize read_block_for_search
...
Pull trivial tree updates from Jiri Kosina:
"The usual stuff from trivial tree"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
treewide: relase -> release
Documentation/cgroups/memory.txt: fix stat file documentation
sysctl/net.txt: delete reference to obsolete 2.4.x kernel
spinlock_api_smp.h: fix preprocessor comments
treewide: Fix typo in printk
doc: device tree: clarify stuff in usage-model.txt.
open firmware: "/aliasas" -> "/aliases"
md: bcache: Fixed a typo with the word 'arithmetic'
irq/generic-chip: fix a few kernel-doc entries
frv: Convert use of typedef ctl_table to struct ctl_table
sgi: xpc: Convert use of typedef ctl_table to struct ctl_table
doc: clk: Fix incorrect wording
Documentation/arm/IXP4xx fix a typo
Documentation/networking/ieee802154 fix a typo
Documentation/DocBook/media/v4l fix a typo
Documentation/video4linux/si476x.txt fix a typo
Documentation/virtual/kvm/api.txt fix a typo
Documentation/early-userspace/README fix a typo
Documentation/video4linux/soc-camera.txt fix a typo
lguest: fix CONFIG_PAE -> CONFIG_x86_PAE in comment
...
Pull second set of VFS changes from Al Viro:
"Assorted f_pos race fixes, making do_splice_direct() safe to call with
i_mutex on parent, O_TMPFILE support, Jeff's locks.c series,
->d_hash/->d_compare calling conventions changes from Linus, misc
stuff all over the place."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
Document ->tmpfile()
ext4: ->tmpfile() support
vfs: export lseek_execute() to modules
lseek_execute() doesn't need an inode passed to it
block_dev: switch to fixed_size_llseek()
cpqphp_sysfs: switch to fixed_size_llseek()
tile-srom: switch to fixed_size_llseek()
proc_powerpc: switch to fixed_size_llseek()
ubi/cdev: switch to fixed_size_llseek()
pci/proc: switch to fixed_size_llseek()
isapnp: switch to fixed_size_llseek()
lpfc: switch to fixed_size_llseek()
locks: give the blocked_hash its own spinlock
locks: add a new "lm_owner_key" lock operation
locks: turn the blocked_list into a hashtable
locks: convert fl_link to a hlist_node
locks: avoid taking global lock if possible when waking up blocked waiters
locks: protect most of the file_lock handling with i_lock
locks: encapsulate the fl_link list handling
locks: make "added" in __posix_lock_file a bool
...
For those file systems(btrfs/ext4/ocfs2/tmpfs) that support
SEEK_DATA/SEEK_HOLE functions, we end up handling the similar
matter in lseek_execute() to update the current file offset
to the desired offset if it is valid, ceph also does the
simliar things at ceph_llseek().
To reduce the duplications, this patch make lseek_execute()
public accessible so that we can call it directly from the
underlying file systems.
Thanks Dave Chinner for this suggestion.
[AV: call it vfs_setpos(), don't bring the removed 'inode' argument back]
v2->v1:
- Add kernel-doc comments for lseek_execute()
- Call lseek_execute() in ceph->llseek()
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Cc: Ben Myers <bpm@sgi.com>
Cc: Ted Tso <tytso@mit.edu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Sage Weil <sage@inktank.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
category, of note is a fix for on-line resizing file systems where the
block size is smaller than the page size (i.e., file systems 1k blocks
on x86, or more interestingly file systems with 4k blocks on Power or
ia64 systems.)
In the cleanup category, the ext4's punch hole implementation was
significantly improved by Lukas Czerner, and now supports bigalloc
file systems. In addition, Jan Kara significantly cleaned up the
write submission code path. We also improved error checking and added
a few sanity checks.
In the optimizations category, two major optimizations deserve
mention. The first is that ext4_writepages() is now used for
nodelalloc and ext3 compatibility mode. This allows writes to be
submitted much more efficiently as a single bio request, instead of
being sent as individual 4k writes into the block layer (which then
relied on the elevator code to coalesce the requests in the block
queue). Secondly, the extent cache shrink mechanism, which was
introduce in 3.9, no longer has a scalability bottleneck caused by the
i_es_lru spinlock. Other optimizations include some changes to reduce
CPU usage and to avoid issuing empty commits unnecessarily.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABCAAGBQJR0XhgAAoJENNvdpvBGATwMXkQAJwTPk5XYLqtAwLziFLvM6wG
0tWa1QAzTNo80tLyM9iGqI6x74X5nddLw5NMICUmPooOa9agMuA4tlYVSss5jWzV
yyB7vLzsc/2eZJusuVqfTKrdGybE+M766OI6VO9WodOoIF1l51JXKjktKeaWegfv
NkcLKlakD4V+ZASEDB/cOcR/lTwAs9dQ89AZzgPiW+G8Do922QbqkENJB8mhalbg
rFGX+lu9W0f3fqdmT3Xi8KGn3EglETdVd6jU7kOZN4vb5LcF5BKHQnnUmMlpeWMT
ksOVasb3RZgcsyf5ZOV5feXV601EsNtPBrHAmH22pWQy3rdTIvMv/il63XlVUXZ2
AXT3cHEvNQP0/yVaOTCZ9xQVxT8sL4mI6kENP9PtNuntx7E90JBshiP5m24kzTZ/
zkIeDa+FPhsDx1D5EKErinFLqPV8cPWONbIt/qAgo6663zeeIyMVhzxO4resTS9k
U2QEztQH+hDDbjgABtz9M/GjSrohkTYNSkKXzhTjqr/m5huBrVMngjy/F4/7G7RD
vSEx5aXqyagnrUcjsupx+biJ1QvbvZWOVxAE/6hNQNRGDt9gQtHAmKw1eG2mugHX
+TFDxodNE4iWEURenkUxXW3mDx7hFbGZR0poHG3M/LVhKMAAAw0zoKrrUG5c70G7
XrddRLGlk4Hf+2o7/D7B
=SwaI
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 update from Ted Ts'o:
"Lots of bug fixes, cleanups and optimizations. In the bug fixes
category, of note is a fix for on-line resizing file systems where the
block size is smaller than the page size (i.e., file systems 1k blocks
on x86, or more interestingly file systems with 4k blocks on Power or
ia64 systems.)
In the cleanup category, the ext4's punch hole implementation was
significantly improved by Lukas Czerner, and now supports bigalloc
file systems. In addition, Jan Kara significantly cleaned up the
write submission code path. We also improved error checking and added
a few sanity checks.
In the optimizations category, two major optimizations deserve
mention. The first is that ext4_writepages() is now used for
nodelalloc and ext3 compatibility mode. This allows writes to be
submitted much more efficiently as a single bio request, instead of
being sent as individual 4k writes into the block layer (which then
relied on the elevator code to coalesce the requests in the block
queue). Secondly, the extent cache shrink mechanism, which was
introduce in 3.9, no longer has a scalability bottleneck caused by the
i_es_lru spinlock. Other optimizations include some changes to reduce
CPU usage and to avoid issuing empty commits unnecessarily."
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits)
ext4: optimize starting extent in ext4_ext_rm_leaf()
jbd2: invalidate handle if jbd2_journal_restart() fails
ext4: translate flag bits to strings in tracepoints
ext4: fix up error handling for mpage_map_and_submit_extent()
jbd2: fix theoretical race in jbd2__journal_restart
ext4: only zero partial blocks in ext4_zero_partial_blocks()
ext4: check error return from ext4_write_inline_data_end()
ext4: delete unnecessary C statements
ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree()
jbd2: move superblock checksum calculation to jbd2_write_superblock()
ext4: pass inode pointer instead of file pointer to punch hole
ext4: improve free space calculation for inline_data
ext4: reduce object size when !CONFIG_PRINTK
ext4: improve extent cache shrink mechanism to avoid to burn CPU time
ext4: implement error handling of ext4_mb_new_preallocation()
ext4: fix corruption when online resizing a fs with 1K block size
ext4: delete unused variables
ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents
jbd2: remove debug dependency on debug_fs and update Kconfig help text
jbd2: use a single printk for jbd_debug()
...
My recent truncate patch uncovered this bug, but I can reproduce it without the
truncate patch. If you mount with -o compress-force, do a direct write to some
area, do a buffered write to some other area, and then do a direct read you will
get the wrong data for where you did the buffered write. This is because the
generic direct io helpers only call filemap_write_and_wait once, and for
compression we need it twice. So to be safe add the btrfs_wait_ordered_range to
the start of the direct io function to make sure any compressed writes have
truly been written. This patch makes xfstests 130 pass when you mount with -o
compress-force=lzo. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There is another bug in the tree mod log stuff in that we're calling
tree_mod_log_free_eb every single time a block is cow'ed. The problem with this
is that if this block is shared by multiple snapshots we will call this multiple
times per block, so if we go to rewind the mod log for this block we'll BUG_ON()
in __tree_mod_log_rewind because we try to rewind a free twice. We only want to
call tree_mod_log_free_eb if we are actually freeing the block. With this patch
I no longer hit the panic in __tree_mod_log_rewind. Thanks,
Cc: stable@vger.kernel.org
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We need to hold the tree mod log lock in __tree_mod_log_rewind since we walk
forward in the tree mod entries, otherwise we'll end up with random entries and
trip the BUG_ON() at the front of __tree_mod_log_rewind. This fixes the panics
people were seeing when running
find /whatever -type f -exec btrfs fi defrag {} \;
Thansk,
Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I missed fixing the backref stuff when I introduced the skinny metadata. If you
try and do things like snapshot aware defrag with skinny metadata you are going
to see tons of warnings related to the backref count being less than 0. This is
because the delayed refs will be found for stuff just fine, but it won't find
the skinny metadata extent refs. With this patch I'm not seeing warnings
anymore. Thanks,
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Several users reported this crash of NULL pointer or general protection,
the story is that we add a rbtree for speedup ulist iteration, and we
use krealloc() to address ulist growth, and krealloc() use memcpy to copy
old data to new memory area, so it's OK for an array as it doesn't use
pointers while it's not OK for a rbtree as it uses pointers.
So krealloc() will mess up our rbtree and it ends up with crash.
Reviewed-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
- It makes no sense that we deal with a inode in the dead tree.
- fix the race between dio and page copy by waiting the dio completion
- avoid the page copy vs truncate/punch hole
- check if the page is in the page cache or not
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
- It make no sense that we continue to do something after the error
happened, just go back with this patch.
- remove some check of copy_nocow_pages_for_inode(), such as page check
after write, inode check in the end of the function, because we are
sure they exist.
- remove the unnecessary goto in the return value check of the write
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When adjusting the enospc rules for relocation I ran into a deadlock because we
were relocating the only system chunk and that forced us to try and allocate a
new system chunk while holding locks in the chunk tree, which caused us to
deadlock. To fix this I've moved all of the dev extent addition and chunk
addition out to the delayed chunk completion stuff. We still keep the in-memory
stuff which makes sure everything is consistent.
One change I had to make was to search the commit root of the device tree to
find a free dev extent, and hold onto any chunk em's that we allocated in that
transaction so we do not allocate the same dev extent twice. This has the side
effect of fixing a bug with balance that has been there ever since balance
existed. Basically you can free a block group and it's dev extent and then
immediately allocate that dev extent for a new block group and write stuff to
that dev extent, all within the same transaction. So if you happen to crash
during a balance you could come back to a completely broken file system. This
patch should keep these sort of things from happening in the future since we
won't be able to allocate free'd dev extents until after the transaction
commits. This has passed all of the xfstests and my super annoying stress test
followed by a balance. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I hit a weird problem were my root item had been deleted but the orphan item had
not. This isn't necessarily a problem, but it keeps the file system from being
mounted. To fix this we just need to axe the orphan item if we can't find the
fs root when we're putting them altogether. With this patch I was able to
successfully mount my file system. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Now reading the data from the target device of the replace operation is allowed,
so the mirror number that is greater than the stripes number of a chunk is valid,
we will tune it when we find there is no target device later. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We always just try and reserve data space when we write, but if we are out of
space but have prealloc'ed extents we should still successfully write. This
patch will try and see if we can write to prealloc'ed space and if we can go
ahead and allow the write to continue. With this patch we now pass xfstests
generic/274. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
try_to_writeback_inodes_sb_nr returns 1 if writeback is already underway, which
is completely fraking useless for us as we need to make sure pages are actually
written before we go and check if there are ordered extents. So replace this
with an open coding of try_to_writeback_inodes_sb_nr minus the writeback
underway check so that we are sure to actually have flushed some dirty pages out
and will have ordered extents to use. With this patch xfstests generic/273 now
passes. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There are all of these checks in the ENOSPC code to see if committing the
transaction would free up enough space to make the allocation. This is because
early on we just committed the transaction and hoped and prayed, which resulted
in cases where it took _forever_ to get an ENOSPC when we really were out of
space. So we check space_info->bytes_pinned, except this isn't completely true
because it doesn't account for space we may free but are stuck in delayed refs.
So tests like xfstests 226 would fail because we wouldn't commit the transaction
to free up the data space. So instead add a percpu counter that will be a
little fuzzier, it will add bytes as soon as we try to free up the space, and
remove any space it doesn't actually free up when we get around to doing the
actual free. We then 0 out this counter every transaction period so we have a
better idea of how much space we will actually free up by committing this
transaction. With this patch we now pass xfstests 226. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We have an optimization that will go ahead and cache no acls on an inode if
there are no xattrs on the inode. This saves us a lookup later to check the
acls for writes or any other access. The problem is I use selinux so I always
have an xattr on inodes, so make this test a little smarter and check for the
actual acl hash on the key and if it isn't there then we still get to cache no
acl which makes everybody who uses selinux a little happier. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This has plagued us forever and I'm so over working around it. When we truncate
down to a non-page aligned offset we will call btrfs_truncate_page to zero out
the end of the page and write it back to disk, this will keep us from exposing
stale data if we truncate back up from that point. The problem with this is it
requires data space to do this, and people don't really expect to get ENOSPC
from truncate() for these sort of things. This also tends to bite the orphan
cleanup stuff too which keeps people from mounting. To get around this we can
just move this into btrfs_cont_expand() to make sure if we are truncating up
from a non-page size aligned i_size we will zero out the rest of this page so
that we don't expose stale data. This will give ENOSPC if you try to truncate()
up or if you try to write past the end of isize, which is much more reasonable.
This fixes xfstests generic/083 failing to mount because of the orphan cleanup
failing. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This patch does two things. First we no longer explicitly read in the blocks
we're trying to readahead. For things like balance_level we may never actually
use the blocks so this just adds uneeded latency, and balance_level and
split_node will both read in the blocks they care about explicitly so if the
blocks need to be waited on it will be done there. Secondly we no longer drop
the path if we do readahead, we just set the path blocking before we call
reada_for_balance() and then we're good to go. Hopefully this will cut down on
the number of re-searches. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This patch does two things, first it only does one call to
btrfs_buffer_uptodate() with the gen specified instead of once with 0 and then
again with gen specified. The other thing is to call btrfs_read_buffer() on the
buffer we've found instead of dropping it and then calling read_tree_block().
This will keep us from doing yet another radix tree lookup for a buffer we've
already found. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
A user reported a deadlock where the async submit thread was blocked on the
lock_extent() lock, and then everybody behind him was locked on the page lock
for the page he was holding. Looking at the code I noticed we do not unlock the
extent range when we get ENOSPC and goto retry. This is bad because we
immediately try to lock that range again to do the cow, which will cause a
deadlock. Fix this by unlocking the range. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The comment is for btrfs_attach_transaction_barrier, not for
btrfs_attach_transaction. Fix the typo.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Acked-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We unconditionally search for the EXTENT_ITEM_KEY for metadata during balance,
and then check the key that we found to see if it is actually a
METADATA_ITEM_KEY, but this doesn't work right because METADATA is a higher key
value, so if what we are looking for happens to be the first item in the leaf
the search will dump us out at the previous leaf, and we won't find our item.
So instead do what we do everywhere else, search for the skinny extent first and
if we don't find it go back and re-search for the extent item. This patch fixes
the panic I was hitting when balancing a large file system with skinny extents.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Looking into this backref problem I noticed we're using a macro to what turns
out to essentially be a NULL check to see if we need to search the commit root.
I'm killing this, let's just do what everybody else does and checks if trans ==
NULL. I've also made it so we pass in the path to __resolve_indirect_refs which
will have the search_commit_root flag set properly already and that way we can
avoid allocating another path when we have a perfectly good one to use. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
A user reported scrub taking up an unreasonable amount of ram as it ran. This
is because we lookup the csums for the extent we're scrubbing but don't free it
up until after we're done with the scrub, which means we can take up a whole lot
of ram. This patch fixes this by dropping the csums once we're done with the
extent we've scrubbed. The user reported this to fix their problem. Thanks,
Reported-and-tested-by: Remco Hosman <remco@hosman.xs4all.nl>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Dave has this fs_mark script that can make btrfs abort with sufficient amount of
ram. This is because with more ram we can keep more dirty metadata in cache
which in a round about way makes for many more pending delayed refs. What
happens is we end up not throttling the transaction enough so when we go to
commit the transaction when we've completely filled the file system we'll
abort() because we use all of the space in the global reserve and we still have
delayed refs to run. To fix this we need to make the delayed ref flushing and
the transaction throttling dependant upon the number of delayed refs that we
have instead of how much reserved space is left in the global reserve. With
this patch we not only stop aborting transactions but we also get a smoother run
speed with fs_mark and it makes us about 10% faster. Thanks,
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I hit a hang when run_delayed_refs returned an error in the beginning of
btrfs_commit_transaction. If we decide we need to commit the transaction in
btrfs_end_transaction we'll set BLOCKED and start to commit, but if we get an
error this early on we'll just exit without committing. This is fine, except
that anybody else who tried to start a transaction will sit in
wait_current_trans() since we're set to BLOCKED and we never set it to something
else and woke people up. To fix this we want to check for trans->aborted
everywhere we wait for the transaction state to change, and make
btrfs_abort_transaction() wake up any waiters there may be. All the callers
will notice that the transaction has aborted and exit out properly. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I hit a deadlock because we aborted when flushing delayed refs but didn't wake
any of the other flushers up and so everybody was just sleeping forever. This
should fix the problem. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Fix the code comments for lzo compression workspace.
The buf item is used to store the decompressed data
and cbuf is used to store the compressed data.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Balance will create reloc_root for each fs root, and it's going to
record last_snapshot to filter shared blocks. The side effect of
setting last_snapshot is to break nocow attributes of files.
Since the extents are not shared by the relocation tree after the balance,
we can recover the old last_snapshot safely if no one snapshoted the
source tree. We fix the above problem by this way.
Reported-by: Kyle Gates <kylegates@hotmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
With non-mixed block groups we replay the logs before we're allowed to do any
writes, so we get away with not pinning/removing the data extents until right
when we replay them. However with mixed block groups we allocate out of the
same pool, so we could easily allocate a metadata block that was logged in our
tree log. To deal with this we just need to notice that we have mixed block
groups and do the normal excluding/removal dance during the pin stage of the log
replay and that way we don't allocate metadata blocks from areas we have logged
data extents. With this patch we now pass xfstests generic/311 with mixed
block groups turned on. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When we cross into a different subvol when doing a lookup we will run the orhpan
cleanup. If this fails however we do not drop the ref to the inode we were
looking up before we return an error, which leads to busy inodes on umount.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When testing a corrupted fs I noticed I was getting sleep while atomic errors
when the transaction aborted. This is because btrfs_pin_extent may need to
allocate memory and we are calling this under the spin lock. Fix this by moving
it out and doing the pin after dropping the spin lock but before dropping the
mutex, the same way it works when delayed refs run normally. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When called during mount, we cannot start the rescan worker thread until
open_ctree is done. This commit restuctures the qgroup rescan internals to
enable a clean deferral of the rescan resume operation.
First of all, the struct qgroup_rescan is removed, saving us a malloc and
some initialization synchronizations problems. Its only element (the worker
struct) now lives within fs_info just as the rest of the rescan code.
Then setting up a rescan worker is split into several reusable stages.
Currently we have three different rescan startup scenarios:
(A) rescan ioctl
(B) rescan resume by mount
(C) rescan by quota enable
Each case needs its own combination of the four following steps:
(1) set the progress [A, C: zero; B: state of umount]
(2) commit the transaction [A]
(3) set the counters [A, C: zero; B: state of umount]
(4) start worker [A, B, C]
qgroup_rescan_init does step (1). There's no extra function added to commit
a transaction, we've got that already. qgroup_rescan_zero_tracking does
step (3). Step (4) is nothing more than a call to the generic
btrfs_queue_worker.
We also get rid of a double check for the rescan progress during
btrfs_qgroup_account_ref, which is no longer required due to having step 2
from the list above.
As a side effect, this commit prepares to move the rescan start code from
btrfs_run_qgroups (which is run during commit) to a less time critical
section.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When btrfs_read_qgroup_config or btrfs_quota_enable return non-zero, we've
already freed the fs_info->qgroup_ulist. The final btrfs_free_qgroup_config
called from quota_disable makes another ulist_free(fs_info->qgroup_ulist)
call.
We set fs_info->qgroup_ulist to NULL on the mentioned error paths, turning
the ulist_free in btrfs_free_qgroup_config into a noop.
Cc: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Commit 5b7c665e introduced fs_info->qgroup_ulist, that is allocated during
btrfs_read_qgroup_config and meant to be used later by the qgroup accounting
code. However, it is always freed before btrfs_read_qgroup_config returns,
becuase the commit mentioned above adds a check for (ret), where a check
for (ret < 0) would have been the right choice. This commit fixes the check.
Cc: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Dave pointed out a problem where if you filled up a file system as much as
possible you couldn't remove any files. The whole unlink reservation thing is
convoluted because it tries to guess if it's going to add space to unlink
something or not, and has all these odd uncommented cases where it simply does
not try. So to fix this I've added a way to conditionally steal from the global
reserve if we can't make our normal reservation. If we have more than half the
space in the global reserve free we will go ahead and steal from the global
reserve. With this patch Dave's reproducer now works and I can rm all the files
on the file system. Thanks,
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Before applying this patch, we flushed the log tree of the fs/file
tree firstly, and then flushed the log root tree. It is ineffective,
especially on the hard disk. This patch improved this problem by wrapping
the above two flushes by the same blk_plug.
By test, the performance of the sync write went up ~60%(2.9MB/s -> 4.6MB/s)
on my scsi disk whose disk buffer was enabled.
Test step:
# mkfs.btrfs -f -m single <disk>
# mount <disk> <mnt>
# dd if=/dev/zero of=<mnt>/file0 bs=32K count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We did not allow file data clone within the same file because of
deadlock issues.
However, we now use nested lock to avoid deadlock between the
parent directory and the child file.
So it's safe to do file clone within the same file when the two
ranges are not overlapped.
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
EXTREF is treated same as REF, so we can make the code tidy.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
During splitting a leaf, pushing items around to hopefully get some space only
works when we have a parent, ie. we have at least one sibling leaf.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
As for splitting a leaf, root is just the leaf, and tree mod log does not apply
on leaf, so in this case, we don't do log_removal.
As for splitting a node, the old root is kept as a normal node and we have nicely
put records in tree mod log for moving keys and items, so in this case we don't do
that either.
As above, insert_new_root can get rid of log_removal.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Fix to return error code instead always return 0 from function
btrfs_check_trunc_cache_free_space().
Introduced by commit 7b61cd9224
(Btrfs: don't use global block reservation for inode cache truncation)
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This fixes bugzilla 57491. If we take a snapshot of a fs with a unlink ongoing
and then try to send that root we will run into problems. When comparing with a
parent root we will search the parents and the send roots commit_root, which if
we've just created the snapshot will include the file that needs to be evicted
by the orphan cleanup. So when we find a changed extent we will try and copy
that info into the send stream, but when we lookup the inode we use the normal
root, which no longer has the inode because the orphan cleanup deleted it. The
best solution I have for this is to check our otransid with the generation of
the commit root and if they match just commit the transaction again, that way we
get the changes from the orphan cleanup. With this patch the reproducer I made
for this bugzilla no longer returns ESTALE when trying to do the send. Thanks,
Cc: stable@vger.kernel.org
Reported-by: Chris Wilson <jakdaw@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
when user runs command btrfs dev del the raid requisite error if any
goes to the /var/log/messages, its not good idea to clutter messages
with these user (knowledge) errors, further user don't have to review
the system messages to know problem with the cli it should be dropped
to the user as part of the cli return.
to bring this feature created a set of the ERROR defined
BTRFS_ERROR_DEV* error codes and created their error string.
I expect this enum to be added with other error which we might
want to communicate to the user land
v3:
moved the code with in the file no logical change
v1->v2:
introduce error codes for the device mgmt usage
v1:
adds a parameter in the ioctl arg struct to carry the error string
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We get lock inversion with umount if we allow iputs from sync_fs, so use the
delay iput flag to keep this from happening. Thanks,
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We used 3 variants to track the state of the transaction, it was complex
and wasted the memory space. Besides that, it was hard to understand that
which types of the transaction handles should be blocked in each transaction
state, so the developers often made mistakes.
This patch improved the above problem. In this patch, we define 6 states
for the transaction,
enum btrfs_trans_state {
TRANS_STATE_RUNNING = 0,
TRANS_STATE_BLOCKED = 1,
TRANS_STATE_COMMIT_START = 2,
TRANS_STATE_COMMIT_DOING = 3,
TRANS_STATE_UNBLOCKED = 4,
TRANS_STATE_COMPLETED = 5,
TRANS_STATE_MAX = 6,
}
and just use 1 variant to track those state.
In order to make the blocked handle types for each state more clear,
we introduce a array:
unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = {
[TRANS_STATE_RUNNING] = 0U,
[TRANS_STATE_BLOCKED] = (__TRANS_USERSPACE |
__TRANS_START),
[TRANS_STATE_COMMIT_START] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH),
[TRANS_STATE_COMMIT_DOING] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN),
[TRANS_STATE_UNBLOCKED] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN |
__TRANS_JOIN_NOLOCK),
[TRANS_STATE_COMPLETED] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN |
__TRANS_JOIN_NOLOCK),
}
it is very intuitionistic.
Besides that, because we remove ->in_commit in transaction structure, so
the lock ->commit_lock which was used to protect it is unnecessary, remove
->commit_lock.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We checked the commit time to avoid committing the transaction
frequently, but it is unnecessary because:
- It made the transaction commit spend more time, and delayed the
operation of the external writers(TRANS_START/TRANS_USERSPACE).
- Except the space that we have to commit transaction, such as
snapshot creation, btrfs doesn't commit the transaction on its
own initiative.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We used ->num_joined track if there were some writers which join the current
transaction when the committer was sleeping. If some writers joined the current
transaction, we has to continue the while loop to do some necessary stuff, such
as flush the ordered operations. But it is unnecessary because we will do it
after the while loop.
Besides that, tracking ->num_joined would make the committer drop into the while
loop when there are lots of internal writers(TRANS_JOIN).
So we remove ->num_joined and don't track if there are some writers which join
the current transaction when the committer is sleeping.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
It is unnecessary to flush the delalloc inodes again and again because
we don't care the dirty pages which are introduced after the flush, and
they will be flush in the transaction commit.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
btrfs_commit_transaction has the following loop before we commit the
transaction.
do {
// attempt to do some useful stuff and/or sleep
} while (atomic_read(&cur_trans->num_writers) > 1 ||
(should_grow && cur_trans->num_joined != joined));
This is used to prevent from the TRANS_START to get in the way of a
committing transaction. But it does not prevent from TRANS_JOIN, that
is we would do this loop for a long time if some writers JOIN the
current transaction endlessly.
Because we need join the current transaction to do some useful stuff,
we can not block TRANS_JOIN here. So we introduce a external writer
counter, which is used to count the TRANS_USERSPACE/TRANS_START writers.
If the external writer counter is zero, we can break the above loop.
In order to make the code more clear, we don't use enum variant
to define the type of the transaction handle, use bitmask instead.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If the transaction is removed from the transaction list, it means the
transaction has been committed successfully. So it is impossible to
call cleanup_transaction(), otherwise there is something wrong with
the code logic. Thus, we use BUG_ON() instead of the original handle.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When we umount a fs with serious errors, we will invoke btrfs_cleanup_transactions()
to clean up the residual transaction. At this time, It is impossible to start a new
transaction, so we needn't assign trans_no_join to 1, and also needn't clear running
transaction every time we destroy a residual transaction.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Before applying this patch, we need flush all the delalloc inodes in
the fs when we want to create a snapshot, it wastes time, and make
the transaction commit be blocked for a long time. It means some other
user operation would also be blocked for a long time.
This patch improves this problem, we just flush the delalloc inodes that
in the source trees before snapshot creation, so the transaction commit
will complete quickly.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The reason we introduce per-subvolume ordered extent list is the same
as the per-subvolume delalloc inode list.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When we create a snapshot, we need flush all delalloc inodes in the
fs, just flushing the inodes in the source tree is OK. So we introduce
per-subvolume delalloc inode list.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The grab/put funtions will be used in the next patch, which need grab
the root object and ensure it is not freed. We use reference counter
instead of the srcu lock is to aovid blocking the memory reclaim task,
which invokes synchronize_srcu().
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There are several functions whose code is similar, such as
btrfs_find_last_root()
btrfs_read_fs_root_no_radix()
Besides that, some functions are invoked twice, it is unnecessary,
for example, we are sure that all roots which is found in
btrfs_find_orphan_roots()
have their orphan items, so it is unnecessary to check the orphan
item again.
So cleanup it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The snapshot/subvolume deletion might spend lots of time, it would make
the remount task wait for a long time. This patch improve this problem,
we will break the deletion if the fs is remounted to be R/O. It will make
the users happy.
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If the fs is remounted to be R/O, it is unnecessary to call
btrfs_clean_one_deleted_snapshot(), so move the R/O check out of
this function. And besides that, it can make the check logic in the
caller more clear.
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
In order to avoid the R/O remount, we acquired ->s_umount lock during
we deleted the dead snapshots and subvolumes. But it is unnecessary,
because we have cleaner_mutex.
We use cleaner_mutex to protect the process of the dead snapshots/subvolumes
deletion. And when we remount the fs to be R/O, we also acquire this mutex to
do cleanup after we change the status of the fs. That is this lock can serialize
the above operations, the cleaner can be aware of the status of the fs, and if
the cleaner is deleting the dead snapshots/subvolumes, the remount task will
wait for it. So it is safe to remove ->s_umount in cleaner_kthread().
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
btrfs_read_fs_root_no_name() already checks if btrfs_root_refs()
is zero and returns ENOENT in this case. There is no need to do
it again in six places.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
No need to check for NULL in send.c and disk-io.c.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We don't need to copy it back to user side as it remains unchanged.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Clean up the format of the definitions of BTRFS_BLOCK_GROUP_RAID5 and
BTRFS_BLOCK_GROUP_RAID6.
Signed-off-by: Andreas Philipp <philipp.andreas@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
sctx is removed from the argument of the function that
doesn't use sctx.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The size parameter to btrfs_extend_item() is the number of bytes
to add to the item, not the size of the item after the operation
(like it is for btrfs_truncate_item(), there the size parameter
is not the number of bytes to take away, but the total size of
the item after truncation).
Fix it in the comment.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
btrfs_qgroup_wait_for_completion waits until the currently running qgroup
operation completes. It returns immediately when no rescan process is in
progress. This is useful to automate things around the rescan process (e.g.
testing).
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When doing qgroup accounting, we call ulist_alloc()/ulist_free() every time
when we want to walk qgroup tree.
By introducing 'qgroup_ulist', we only need to call ulist_alloc()/ulist_free()
once. This reduce some sys time to allocate memory, see the measurements below
fsstress -p 4 -n 10000 -d $dir
With this patch:
real 0m50.153s
user 0m0.081s
sys 0m6.294s
real 0m51.113s
user 0m0.092s
sys 0m6.220s
real 0m52.610s
user 0m0.096s
sys 0m6.125s avg 6.213
-----------------------------------------------------
Without the patch:
real 0m54.825s
user 0m0.061s
sys 0m10.665s
real 1m6.401s
user 0m0.089s
sys 0m11.218s
real 1m13.768s
user 0m0.087s
sys 0m10.665s avg 10.849
we can see the sys time reduce ~43%.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We want to know if there are debugging features compiled in, this may
affect performance. The message is printed before the sanity checks.
Also kill version.h file that serves no purpose, we don't use any
version tag for kernel module.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The 'end' value must exactly cover the end of the interval, which means
one byte less than the expected block alignment, or in case of a file
smaller than one block, one byte less than the inode size.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Code checked for raid 5 flag in two else-if branches, so code would never be reached. Probably a copy-paste bug.
Signed-off-by: Henrik Nordvik <henrikno@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Pull btrfs fixes from Chris Mason:
"This is an assortment of crash fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: stop all workers before cleaning up roots
Btrfs: fix use-after-free bug during umount
Btrfs: init relocate extent_io_tree with a mapping
btrfs: Drop inode if inode root is NULL
Btrfs: don't delete fs_roots until after we cleanup the transaction
Dave reported a panic because the extent_root->commit_root was NULL in the
caching kthread. That is because we just unset it in free_root_pointers, which
is not the correct thing to do, we have to either wait for the caching kthread
to complete or hold the extent_commit_sem lock so we know the thread has exited.
This patch makes the kthreads all stop first and then we do our cleanup. This
should fix the race. Thanks,
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Dave reported a NULL pointer deref. This is caused because he thought he'd be
smart and add sanity checks to the extent_io bit operations, but he didn't
expect a tree to have a NULL mapping. To fix this we just need to init the
relocation's processed_blocks with the btree_inode->i_mapping. Thanks,
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
There is a path where btrfs_drop_inode() is called with its inode's root
is NULL: In btrfs_new_inode(), when btrfs_set_inode_index() fails,
iput() is called. We should handle this case before taking look at the
root->root_item.
Signed-off-by: Naohiro Aota <naota@elisp.net>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We get a use after free if we had a transaction to cleanup since there could be
delayed inodes which refer to their respective fs_root. Thanks
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The size parameter to btrfs_extend_item() is the number of bytes
to add to the item, not the size of the item after the operation
(like it is for btrfs_truncate_item(), there the size parameter
is not the number of bytes to take away, but the total size of
the item after truncation).
Fix it in the comment.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Currently there is no way to truncate partial page where the end
truncate point is not at the end of the page. This is because it was not
needed and the functionality was enough for file system truncate
operation to work properly. However more file systems now support punch
hole feature and it can benefit from mm supporting truncating page just
up to the certain point.
Specifically, with this functionality truncate_inode_pages_range() can
be changed so it supports truncating partial page at the end of the
range (currently it will BUG_ON() if 'end' is not at the end of the
page).
This commit changes the invalidatepage() address space operation
prototype to accept range to be invalidated and update all the instances
for it.
We also change the block_invalidatepage() in the same way and actually
make a use of the new length argument implementing range invalidation.
Actual file system implementations will follow except the file systems
where the changes are really simple and should not change the behaviour
in any way .Implementation for truncate_page_range() which will be able
to accept page unaligned ranges will follow as well.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Pull btrfs fixes from Chris Mason:
"Miao Xie has been very busy, fixing races and enospc problems and many
other small but important pieces.
Alexandre Oliva discovered some problems with how our error handling
was interacting with the block layer and for now has disabled our
partial handling of sub-page writes. The real sub-page work is in a
series of patches from IBM that we still need to integrate and test.
The code Alexandre has turned off was really incomplete.
Josef has more error handling fixes and an important fix for the new
skinny extent format.
This also has my fix for the tracepoint crash from late in 3.9. It's
the first stage in a larger clean up to get rid of btrfs_bio and make
a proper bioset for all the items we need to tack into the bio. For
now the bioset only holds our mirror_num and stripe_index, but for the
next merge window I'll shuffle more in."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (25 commits)
Btrfs: use a btrfs bioset instead of abusing bio internals
Btrfs: make sure roots are assigned before freeing their nodes
Btrfs: explicitly use global_block_rsv for quota_tree
btrfs: do away with non-whole_page extent I/O
Btrfs: don't invoke btrfs_invalidate_inodes() in the spin lock context
Btrfs: remove BUG_ON() in btrfs_read_fs_tree_no_radix()
Btrfs: pause the space balance when remounting to R/O
Btrfs: fix unprotected root node of the subvolume's inode rb-tree
Btrfs: fix accessing a freed tree root
Btrfs: return errno if possible when we fail to allocate memory
Btrfs: update the global reserve if it is empty
Btrfs: don't steal the reserved space from the global reserve if their space type is different
Btrfs: optimize the error handle of use_block_rsv()
Btrfs: don't use global block reservation for inode cache truncation
Btrfs: don't abort the current transaction if there is no enough space for inode cache
Correct allowed raid levels on balance.
Btrfs: fix possible memory leak in replace_path()
Btrfs: fix possible memory leak in the find_parent_nodes()
Btrfs: don't allow device replace on RAID5/RAID6
Btrfs: handle running extent ops with skinny metadata
...
Btrfs has been pointer tagging bi_private and using bi_bdev
to store the stripe index and mirror number of failed IOs.
As bios bubble back up through the call chain, we use these
to decide if and how to retry our IOs. They are also used
to count IO failures on a per device basis.
Recently a bio tracepoint was added lead to crashes because
we were abusing bi_bdev.
This commit adds a btrfs bioset, and creates explicit fields
for the mirror number and stripe index. The plan is to
extend this structure for all of the fields currently in
struct btrfs_bio, which will mean one less kmalloc in
our IO path.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Reported-by: Tejun Heo <tj@kernel.org>
If we fail to load the chunk tree we'll call free_root_pointers, except we may
not have assigned the roots for the dev_root/extent_root/csum_root yet, so we
could NULL pointer deref at this point. Just add checks to make sure these
roots are set to keep us from panicing. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The quota_tree was set up to use the empty_block_rsv before
which would be problematic when the filesystem is filled up
and ENOSPC happens during internal operations while the quota
tree is updated and COWed (when the btrfs_qgroup_info_item
items) are written. In fact, use_block_rsv() which is used
in btrfs_cow_block() falls back to the global_block_rsv in
this case. But just in order to make it more clear what is
happening, change it to explicitly use the global_block_rsv.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
end_bio_extent_readpage computes whole_page based on bv_offset and
bv_len, without taking into account that blk_update_request may modify
them when some of the blocks to be read into a page produce a read
error. This would cause the read to unlock only part of the file
range associated with the page, which would in turn leave the entire
page locked, which would not only keep the process blocked instead of
returning -EIO to it, but also prevent any further access to the file.
It turns out that btrfs always issues whole-page reads and writes.
The special handling of non-whole_page appears to be a mistake or a
left-over from a time when this wasn't the case. Indeed,
end_bio_extent_writepage distinguished between whole_page and
non-whole_page writes but behaved identically in both cases!
I've replaced the whole_page computations with warnings, just to be
sure that we're not issuing partial page reads or writes. The
warnings should probably just go away some time.
Signed-off-by: Alexandre Oliva <oliva@gnu.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
btrfs_invalidate_inodes() may sleep, so we should not invoke it in the
spin lock context. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We have checked if ->node is NULL or not, so it is unnecessary to
use BUG_ON() to check again. Remove it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The root node of the rb-tree may be changed, so we should get it under
the lock. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
inode_tree_del() will move the tree root into the dead root list, and
then the tree will be destroyed by the cleaner. So if we remove the
delayed node which is cached in the inode after inode_tree_del(),
we may access a freed tree root. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We need to set return value explicitly, otherwise we'll lose the error
value.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Before applying this patch, we reserved the space for the global reserve
by the minimum unit if we found it is empty, it was unreasonable and
inefficient, because if the global reserve space was depleted, it implied
that the size of the global reserve was too small. In this case, we shoud
update the global reserve and fill it.
Cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If the type of the space we need is different with the global reserve, we
can not steal the space from the global reserve, because we can not allocate
the space from the free space cache that the global reserve points to.
Cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
It is very likely that there are lots of subvolumes/snapshots in the filesystem,
so if we use global block reservation to do inode cache truncation, we may hog
all the free space that is reserved in global rsv. So it is better that we do
the free space reservation for inode cache truncation by ourselves.
Cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The filesystem with inode cache was forced to be read-only when we umounted it.
Steps to reproduce:
# mkfs.btrfs -f ${DEV}
# mount -o inode_cache ${DEV} ${MNT}
# dd if=/dev/zero of=${MNT}/file1 bs=1M count=8192
# btrfs fi syn ${MNT}
# dd if=${MNT}/file1 of=/dev/null bs=1M
# rm -f ${MNT}/file1
# btrfs fi syn ${MNT}
# umount ${MNT}
It is because there was no enough space to do inode cache truncation, and then
we aborted the current transaction.
But no space error is not a serious problem when we write out the inode cache,
and it is safe that we just skip this step if we meet this problem. So we need
not abort the current transaction.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Raid5 with 3 devices is well defined while the old logic allowed
raid5 only with a minimum of 4 devices when converting the block group
profile via btrfs balance. Creating a raid5 with just three devices
using mkfs.btrfs worked always as expected. This is now fixed and the
whole logic is rewritten.
Signed-off-by: Andreas Philipp <philipp.andreas@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
In replace_path(), if read_tree_block() fails, we cannot return
directly, we should free some allocated memory otherwise memory
leak happens.
Similar to Wang's "Btrfs: fix possible memory leak in the
find_parent_nodes()" patch, the current commit fixes an issue that
is related to the "Btrfs: fix all callers of read_tree_block"
commit.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
In the find_parent_nodes(), if read_tree_block() fails, we can
not return directly, we should free some allocated memory otherwise
memory leak happens.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This is not yet supported and causes crashes. One sad user reported
that it destroyed his filesystem.
One failure is in __btrfs_map_block+0xc1f calling kmalloc(0).
0x5f21f is in __btrfs_map_block (fs/btrfs/volumes.c:4923).
4918 num_stripes = map->num_stripes;
4919 max_errors = nr_parity_stripes(map);
4920
4921 raid_map = kmalloc(sizeof(u64) * num_stripes,
4922 GFP_NOFS);
4923 if (!raid_map) {
4924 ret = -ENOMEM;
4925 goto out;
4926 }
4927
There might be more issues. Until this is really tested, don't allow
users to start the procedure on RAID5/RAID6 filesystems.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Chris hit a bug where we weren't finding extent records when running extent ops.
This is because we use the delayed_ref_head when running the extent op, which
means we can't use the ->type checks to see if we are metadata. We also lose
the level of the metadata we are working on. So to fix this we can just check
the ->is_data section of the extent_op, and we can store the level of the buffer
we were modifying in the extent_op. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This catches block groups that are too large to properly cache. We deal with
this case fine, so the warning just confuses users. Remove the warning.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I'm sorry, theres no excuse for this sort of work. We need to use
root->leafsize since eb may be NULL. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The search ioctl skips items that are too large for a result buffer, but
inline items of a certain size occuring before any search result is
found would trigger an overflow and stop the search entirely.
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=57641
Cc: stable@vger.kernel.org
Signed-off-by: Gabriel de Perthuis <g2p.code+btrfs@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
lock_extent/unlock_extent expect an exclusive end.
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Quota tree has been missing from lockdep annotations, though no warning
has been seen in the wild.
There's currently one entry that does not belong there,
BTRFS_ORPHAN_OBJECTID. No such tree exists, it's probably a copy &
paste mistake, the id is defined among tree ids.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Pull btrfs update from Chris Mason:
"These are mostly fixes. The biggest exceptions are Josef's skinny
extents and Jan Schmidt's code to rebuild our quota indexes if they
get out of sync (or you enable quotas on an existing filesystem).
The skinny extents are off by default because they are a new variation
on the extent allocation tree format. btrfstune -x enables them, and
the new format makes the extent allocation tree about 30% smaller.
I rebased this a few days ago to rework Dave Sterba's crc checks on
the super block, but almost all of these go back to rc6, since I
though 3.9 was due any minute.
The biggest missing fix is the tracepoint bug that was hit late in
3.9. I ran into problems with that in overnight testing and I'm still
tracking it down. I'll definitely have that fixed for rc2."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (101 commits)
Btrfs: allow superblock mismatch from older mkfs
btrfs: enhance superblock checks
btrfs: fix misleading variable name for flags
btrfs: use unsigned long type for extent state bits
Btrfs: improve the loop of scrub_stripe
btrfs: read entire device info under lock
btrfs: remove unused gfp mask parameter from release_extent_buffer callchain
btrfs: handle errors returned from get_tree_block_key
btrfs: make static code static & remove dead code
Btrfs: deal with errors in write_dev_supers
Btrfs: remove almost all of the BUG()'s from tree-log.c
Btrfs: deal with free space cache errors while replaying log
Btrfs: automatic rescan after "quota enable" command
Btrfs: rescan for qgroups
Btrfs: split btrfs_qgroup_account_ref into four functions
Btrfs: allocate new chunks if the space is not enough for global rsv
Btrfs: separate sequence numbers for delayed ref tracking and tree mod log
btrfs: move leak debug code to functions
Btrfs: return free space in cow error path
Btrfs: set UUID in root_item for created trees
...
Pull block core updates from Jens Axboe:
- Major bit is Kents prep work for immutable bio vecs.
- Stable candidate fix for a scheduling-while-atomic in the queue
bypass operation.
- Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
discard bios.
- Tejuns changes to convert the writeback thread pool to the generic
workqueue mechanism.
- Runtime PM framework, SCSI patches exists on top of these in James'
tree.
- A few random fixes.
* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
relay: move remove_buf_file inside relay_close_buf
partitions/efi.c: replace useless kzalloc's by kmalloc's
fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
block: fix max discard sectors limit
blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
Documentation: cfq-iosched: update documentation help for cfq tunables
writeback: expose the bdi_wq workqueue
writeback: replace custom worker pool implementation with unbound workqueue
writeback: remove unused bdi_pending_list
aoe: Fix unitialized var usage
bio-integrity: Add explicit field for owner of bip_buf
block: Add an explicit bio flag for bios that own their bvec
block: Add bio_alloc_pages()
block: Convert some code to bio_for_each_segment_all()
block: Add bio_for_each_segment_all()
bounce: Refactor __blk_queue_bounce to not use bi_io_vec
raid1: use bio_copy_data()
pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
pktcdvd: use bio_copy_data()
block: Add bio_copy_data()
...
We've added new checks to make sure the super block crc is correct
during mount. A fresh filesystem from an older mkfs won't have the
crc set. This adds a warning when it finds a newly created filesystem
but doesn't fail the mount.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The superblock checksum is not verified upon mount. <awkward silence>
Add that check and also reorder existing checks to a more logical
order.
Current mkfs.btrfs does not calculate the correct checksum of
super_block and thus a freshly created filesytem will fail to mount when
this patch is applied.
First transaction commit calculates correct superblock checksum and
saves it to disk.
Reproducer:
$ mfks.btrfs /dev/sda
$ mount /dev/sda /mnt
$ btrfs scrub start /mnt
$ sleep 5
$ btrfs scrub status /mnt
... super:2 ...
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The variable was named 'data' in btrfs_reserve_extent and that's the
only function that actually uses it to let btrfs_get_alloc_profile know
what profile we want. Then it's passed down as u64 flags.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
1) Right now scrub_stripe() is looping in some unnecessary cases:
* when the found extent item's objectid has been out of the dev extent's range
but we haven't finish scanning all the range within the dev extent
* when all the items has been processed but we haven't finish scanning all the
range within the dev extent
In both cases, we can just finish the loop to save costs.
2) Besides, when the found extent item's length is larger than the stripe
len(64k), we don't have to release the path and search again as it'll get at the
same key used in the last loop, we can instead increase the logical cursor in
place till all space of the extent is scanned.
3) And we use 0 as the key's offset to search btree, then get to previous item
to find a smaller item, and again have to move to the next one to get the right
item. Setting offset=-1 and previous_item() is the correct way.
4) As we won't find any checksum at offset unless this 'offset' is in a data
extent, we can just find checksum when we're really going to scrub an extent.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There's a theoretical possibility of reading stale (or even more
theoretically, freed) data from DEV_INFO ioctl when the device would
disappear between an early mutex unlock and data being copied from the
device structure.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Big patch, but all it does is add statics to functions which
are in fact static, then remove the associated dead-code fallout.
removed functions:
btrfs_iref_to_path()
__btrfs_lookup_delayed_deletion_item()
__btrfs_search_delayed_insertion_item()
__btrfs_search_delayed_deletion_item()
find_eb_for_page()
btrfs_find_block_group()
range_straddles_pages()
extent_range_uptodate()
btrfs_file_extent_length()
btrfs_scrub_cancel_devid()
btrfs_start_transaction_lflush()
btrfs_print_tree() is left because it is used for debugging.
btrfs_start_transaction_lflush() and btrfs_reada_detach() are
left for symmetry.
ulist.c functions are left, another patch will take care of those.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If you try to mount -o loop a restored file system it will panic if the file
ends up being smaller than the original disk. This is because we go to try and
get a block for a super that may be past the EOF which makes __getblk return
NULL for a buffer head when we aren't expecting it to. Fix this by dealing with
this case and just jacking up the errors count. With this patch we no longer
panic when mounting a restored file system loopback. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There were a whole bunch and I was doing it for other things. I haven't tested
these error paths but at the very least this is better than panicing. I've only
left 2 BUG_ON()'s since they are logic errors and I want to replace them with a
ASSERT framework that we can compile out for production users. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
So everybody who got hit by my fsync bug will still continue to hit this
BUG_ON() in the free space cache, which is pretty heavy handed. So I took a
file system that had this bug and fixed up all the BUG_ON()'s and leaks that
popped up when I tried to mount a broken file system like this. With this patch
we just fail to mount instead of panicing. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When qgroup tracking is enabled, we do an automatic cycle of the new rescan
mechanism.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If qgroup tracking is out of sync, a rescan operation can be started. It
iterates the complete extent tree and recalculates all qgroup tracking data.
This is an expensive operation and should not be used unless required.
A filesystem under rescan can still be umounted. The rescan continues on the
next mount. Status information is provided with a separate ioctl while a
rescan operation is in progress.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The function is separated into a preparation part and the three accounting
steps mentioned in the qgroups documentation. The goal is to make steps two
and three usable by the rescan functionality. A side effect is that the
function is restructured into readable subunits.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When running the 208th of xfstests, the fs returned the enospc
error when there was lots of free space in the disk.
By bisect debug, we found it was introduced by commit 96f1bb5777.
This commit makes the space check for the global reservation in
can_overcommit() be inconsistent with should_alloc_chunk().
can_overcommit() requires that the free space is 2 times the size
of the global reservation, or we can't do overcommit. And instead,
we need reclaim some reserved space, and if we still don't have
enough free space, we need allocate a new chunk. But unfortunately,
should_alloc_chunk() just requires that the free space is 1 time
the size of the global reservation, that is we would not try to
allocate a new chunk if the free space size is in the middle of
these two requires, and just return the enospc error. Fix it.
Cc: Jim Schutt <jaschut@sandia.gov>
Cc: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Sequence numbers for delayed refs have been introduced in the first version
of the qgroup patch set. To solve the problem of find_all_roots on a busy
file system, the tree mod log was introduced. The sequence numbers for that
were simply shared between those two users.
However, at one point in qgroup's quota accounting, there's a statement
accessing the previous sequence number, that's still just doing (seq - 1)
just as it would have to in the very first version.
To satisfy that requirement, this patch makes the sequence number counter 64
bit and splits it into a major part (used for qgroup sequence number
counting) and a minor part (incremented for each tree modification in the
log). This enables us to go exactly one major step backwards, as required
for qgroups, while still incrementing the sequence counter for tree mod log
insertions to keep track of their order. Keeping them in a single variable
means there's no need to change all the code dealing with comparisons of two
sequence numbers.
The sequence number is reset to 0 on commit (not new in this patch), which
ensures we won't overflow the two 32 bit counters.
Without this fix, the qgroup tracking can occasionally go wrong and WARN_ONs
from the tree mod log code may happen.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Clean up the leak debugging in extent_io.c by moving
the debug code into functions. This also removes the
list_heads used for debugging from the extent_buffer
and extent_state structures when debug is not enabled.
Since we need a global debug config to do that last
part, implement CONFIG_BTRFS_DEBUG to accommodate.
Thanks to Dave Sterba for the Kconfig bit.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Replace some BUG_ONs with proper handling and take allocated space back to
free space cache for later use.
We don't have to worry about extent maps since they'd be freed in releasepage
path.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
It is a rare exception that a new tree is created, like the qgroups
tree. So far these new trees have an all-zero UUID in their root
items. All trees that mkfs.btrfs has created get an UUID during the
first mount when btrfs_read_root_item() rewrites the root_item to
the v2 structure style. These UUID are never used so far, but
anyway, since it is better to have it uniform for all trees, this
commit adds some lines that generate and write an UUID for newly
created trees.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
fget() returns NULL if error. So, we should check NULL or not.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I have a broken file system that when it aborts leaves all sorts of accounting
things wrong and gives you lots of WARN_ON()'s other than the abort. This is
because we're not cleaning up various parts of the file system when we abort.
The first chunks are specific to mount failures, we weren't cleaning up the
block group cached inodes and we weren't cleaning up any transactions that had
been aborted, which leaves a bunch of things laying around.
The second half of this are related to the cleanup parts. First we don't need
to release space for the dirty pages from the trans_block_rsv, that's all
handled by the trans handles so this is just plain wrong. The other thing is we
need to pin down extents that were set ->must_insert_reserved for delayed refs.
This isn't so much for the pinning but more for the cleaning up the
cache->reserved counter since we are no longer going to use those reserved
bytes. With this patch I no longer see a bunch of WARN_ON()'s when I try to
mount this broken file system, just the initial one from the abort. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We can just look up the extent_buffers for the range and free stuff that way.
This makes the cleanup a bit cleaner and we can make sure to evict the
extent_buffers pretty quickly by marking them as stale. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We need to check the return value of the commit in case something goes wrong,
otherwise we could end up going down the line and doing more stuff (like orphan
cleanup) before we notice we should have errored out. We need to do this before
we free up the log_tree_root since the caller will handle all of that. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We can run the tree logging recovery or the orphan cleanup on mount, so we'll
end up looking up a random fs tree in the meantime. So we need to clean this up
so we don't leave extent buffers hanging around on the cache. With this patch
we no longer leak extent buffers on failure to mount. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This is the same as the fix from commit
Btrfs: fix bad extent logging
but for O_DIRECT. I missed this when I fixed the problem originally, we were
still using the em for the orig_start and orig_block_len, which would be the
merged extent. We need to use the actual extent from the on disk file extent
item, which we have to lookup to make sure it's ok to nocow anyway so just pass
in some pointers to hold this info. Thanks,
Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We kept leaking extent buffers when mounting a broken file system and it turns
out it's because not everybody uses read_tree_block properly. You need to check
and make sure the extent_buffer is uptodate before you use it. This patch fixes
everybody who calls read_tree_block directly to make sure they check that it is
uptodate and free it and return an error if it is not. With this we no longer
leak EB's when things go horribly wrong. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If we fail to load block groups halfway through we can leave extent_state's on
the excluded tree. This is because we just lookup the supers and add them to
the excluded tree regardless of which block group we are looking at currently.
This is a problem because we remove the excluded extents for the range of the
block group only, so if we don't ever load a block group for one of the excluded
extents we won't ever free it. This fixes the problem by only adding excluded
extents if it falls in the block group range we care about. With this patch
we're no longer leaking space when we fail to read all of the block groups.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
With a users corrupted fs I was getting weird behavior and panics and it turns
out it was because one of his tree blocks had a bogus header level. So add this
to the sanity checks in the endio handler for tree blocks. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
A user sent me a btrfs-image that was panicing because of some corruption. This
is because we pass in a bogus value to btrfs_num_copies, and it panics. Instead
just return 1. We only call btrfs_num_copies to see if there are other copies
to try and read for things, so if we just return 1 it will make the callers exit
out with an appropriate error value. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Martin Steigerwald reported a BUG_ON() where we were given a bogus bytenr to
map. Turns out he is using > PAGESIZE leafsizes. The readahead stuff is called
every time we do a completion, but we may not have finished reading in all the
pages, so the bytenr we read off the node could be completely bogus. Fix this
by only calling the readahead hook once all pages have been read in. Thanks,
Reported-by: Martin Steigerwald <Martin@lichtvoll.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Martin Steigerwald reported a BUG_ON() in btrfs_map_block where we didn't find
a chunk for a particular block we were trying to map. This happened because the
block was bogus. We shouldn't be BUG_ON()'ing in this case, just print a
message and return an error. This came from reada_add_block and it appears to
deal with an error fine so we should be good there. Thanks,
Reported-by: Martin Steigerwald <Martin@lichtvoll.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We need to tag metadata io with REQ_META to avoid priority inversion when using
io throttling cqroups. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
So I noticed there is an infinite loop in the slow caching code. If we return 1
when we hit the end of the tree, so we could end up caching the last block group
the slow way and suddenly we're looping forever because we just keep
re-searching and trying again. Fix this by only doing btrfs_next_leaf() if we
don't need_resched(). Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The locking order for stuff is
__sb_start_write
ordered_mutex
but with sync() we don't do __sb_start_write for some strange reason, which
means that our iput in wait_ordered_extents could start a transaction which does
the __sb_start_write while we're holding the ordered_mutex. Fix this by using
delayed iput in sync. Thanks,
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Since all the quota configurations are loaded in memory, and we can
have ioctl checks before operating in the disk. It is safe to do such
things because qgroup_ioctl_lock is held outside.
Without these extra checks firstly, it should be ok to do user change
for quota operations. For example:
if we want to add an existed qgroup, we will do:
->add_qgroup_item()
->add_qgroup_rb()
add_qgroup_item() will return -EEXIST to us, however, qgroups are all
in memory, why not check them in memory firstly.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
ulist_add() may return -ENOMEM, fix missing check about
return value.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
For created snapshots, the full root_item is copied from the source
root and afterwards selectively modified. The current code forgets
to clear the field received_uuid. The only problem is that it is
confusing when you look at it with 'btrfs subv list', since for
writable snapshots, the contents of the snapshot can be completely
unrelated to the previously received snapshot.
The receiver ignores such snapshots anyway because he also checks
the field stransid in the root_item and that value used to be reset
to zero for all created snapshots.
This commit changes two things:
- clear the received_uuid field for new writable snapshots.
- don't clear the send/receive related information like the stransid
for read-only snapshots (which makes them useable as a parent for
the automatic selection of parents in the receive code).
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Dave reported a BUG_ON() that happened in end_page_writeback() after an abort.
This happened because we unconditionally call end_page_writeback() in the endio
case, which is right. However when we abort the transaction we will call
end_page_writeback() on any writeback pages we find, which is wrong. We need to
lock the page and wait on page writeback to complete if it is. There is nothing
unsafe about this since we are discarding the transaction anyway. Thanks,
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We need such a sanity check for wrong start when we defrag a file, otherwise,
even with a wrong start that's larger than file size, we can end up changing
not only inode's force compress flag but also FS's incompat flags.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This fixes the following errors:
fs/btrfs/reada.c: In function ‘btrfs_reada_wait’:
fs/btrfs/reada.c:958:42: error: invalid operands to binary < (have ‘atomic_t’ and ‘int’)
fs/btrfs/reada.c:961:41: error: invalid operands to binary < (have ‘atomic_t’ and ‘int’)
Signed-off-by: Vincent Stehlé <vincent.stehle@laposte.net>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: linux-btrfs@vger.kernel.org
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Argument 'trans' became unnecessary from setup_inline_extent_backref()
that called btrfs_extend_item().
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Argument 'trans' is not used in btrfs_extend_item().
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If argument 'trans' is unnecessary in the function where
fixup_low_keys() is called, 'trans' is deleted.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Argument 'trans' is not used in fixup_low_keys(). So, remove it.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Step to reproduce:
mkfs.btrfs <disk>
mount <disk> <mnt>
dd if=/dev/zero of=/<mnt>/data bs=1M count=10
sync
btrfs quota enable <mnt>
btrfs qgroup create 0/5 <mnt>
btrfs qgroup limit 5M 0/5 <mnt>
rm -f /<mnt>/data
sync
btrfs qgroup show <mnt>
dd if=/dev/zero of=data bs=1M count=1
>From the perspective of users, qgroup's referenced or exclusive
is negative,but user can not continue to write data! a workaround
way is to cast u64 to s64 when doing qgroup reservation.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If out of memory happens, we should return -ENOMEM directly to the caller
rather than continue the work.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
In the comment describing the sync_writers field of the btrfs_inode
struct, "fsyncing" was misspelled "fsycing."
Signed-off-by: Nathaniel Yazdani <n1ght.4nd.d4y@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When tree_mod_log_rewind decides to make a copy of the current tree buffer
for its modifications, it subsequently freed the buffer before unlocking it.
Obviously, those operations are required in reverse order.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The tree mod log functions were accessing root->node->... directly, without
use of btrfs_root_node() or explicit rcu locking. This could lead to an
extent buffer reference being leaked and another reference being freed too
early when preemtion was enabled.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Commit d9abbf1c changed tree mod log locking around ROOT_REPLACE operations.
When a tree root is split, however, we were logging removal of all elements
from the root node before logging removal of half of the elements for the
split operation. This leads to a BUG_ON when rewinding.
This commit removes the erroneous logging of removal of all elements.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The following case will make the incompat/compat flag of the super block
be recovered.
Task1 |Task2
flags = btrfs_super_incompat_flags(); |
|flags = btrfs_super_incompat_flags();
flags |= new_flag1; |
|flags |= new_flag2;
btrfs_set_super_incompat_flags(flags); |
|btrfs_set_super_incompat_flags(flags);
the new_flag1 is recovered.
In order to avoid this problem, we introduce a lock named super_lock into
the btrfs_fs_info structure. If we want to update incompat/compat flags
of the super block, we must hold it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The new mount option is set after parsing the remount arguments,
so it is wrong that checking the autodefrag is close or not at
btrfs_remount_prepare(). Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Walking backref tree and btrfs quota rely on ulist very much.
This patch tries to use rb_tree to speed up search time.
The original code always checks whether an element
exists before adding a new element, however it costs O(n).
I try to add a rb_tree in the ulist,this is only used to speed up
search. I also do some measurements with quota enabled.
fsstress -p 4 -n 10000
Without this path:
real 0m51.058s 2m4.745s 1m28.222s 1m5.137s
user 0m0.035s 0m0.041s 0m0.105s 0m0.100s
sys 0m12.009s 0m11.246s 0m10.901s 0m10.999s 0m11.287s
With this path:
real 0m55.295s 0m50.960s 1m2.214s 0m48.273s
user 0m0.053s 0m0.095s 0m0.135s 0m0.107s
sys 0m7.766s 0m6.013s 0m6.319s 0m6.030s 0m6.532s
After applying the patch,the execute time is down by ~42%.(11.287s->6.532s)
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Two new flags are added to allow omitting the stream header and the
end command for btrfs send streams. This is used in cases where you
send multiple snapshots back-to-back in one stream.
This used to be encoded like this (with 2 snapshots in this example):
<stream header> + <sequence of commands> + <end cmd> +
<stream header> + <sequence of commands> + <end cmd> + EOF
The new format (if the two new flags are used) is this one:
<stream header> + <sequence of commands> +
<sequence of commands> + <end cmd>
Note that the currently existing receivers treat <end cmd> only as
an indication that a new <stream header> is following. This means,
you can just skip the sequence <end cmd> <stream header> without
loosing compatibility. As long as an EOF is following, the currently
existing receivers handle the new format (if the two new flags are
used) exactly as the old one.
So what is the benefit of this change? The goal is to be able to use
a single stream (one TCP connection) to multiplex a request/response
handshake plus Btrfs send streams, all in the same stream. In this
case you cannot evaluate an EOF condition as an end of the Btrfs send
stream. You need something else, and the <end cmd> is just perfect
for this purpose.
The summary is:
The format change is driven by the need to send several Btrfs send
streams over a single TCP connections, with the ability for a repeated
request/response handshake in the middle. And this format change does
not break any existing tool, it is completely compatible.
You could compare the old behaviour of the Btrfs send stream to the
one of ftp where you need a seperate request/response channel and
newly opened data transfer channels for each file, while the new
behaviour is more like http using a single stream for everything.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
__merge_refs() always return 0, it is unnecessary
for the caller to check the return value.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The only error return value of __add_prelim_ref() is -ENOMEM,
just return errors rather than trigger BUG_ON().
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Steps to reproduce:
mkfs.btrfs <disk>
mount <disk> <mnt>
btrfs quota enable <mnt>
btrfs sub create <mnt>/subv
btrfs qgroup limit 10K <mnt>/subv
btrfs quota disable <mnt>/subv
It is wrong for qgroup to reserve when disabling quota,
so just use tree_root to avoid edquot when disabling quota.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Step to reproduce:
mkfs.btrfs <disk>
mount <disk> <mnt>
btrfs quota enable <mnt>
btrfs qgroup limit 0/1 <mnt>
dmesg
If the relative qgroup dosen't exist, flag 'BTRFS_QGROUP_STATUS_
FLAG_INCONSISTENT' will be set, and print the noise message.
This is wrong, we can just move find_qgroup_rb() before
update_qgroup_limit_item().this dosen't change the logic of the
function. But it can avoid unnecessary noise message and wrong set of flag.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The original code forgot to check 'inherit', we should
gurantee that all the qgroups in the struct 'inherit' exist.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Step to reproduce:
mkfs.btrfs <disk>
mount <disk> <mnt>
btrfs quota enable <mnt>
btrfs qgroup assign 0/1 1/1 <mnt>
umount <mnt>
btrfs-debug-tree <disk> | grep QGROUP
If we want to add a qgroup relation, we should gurantee that
'src' and 'dst' exist, otherwise, such qgroup relation should
not be allowed to create.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We use mutex lock to protect all the user change operations.
So when we are calling find_qgroup_rb() to check whether qgroup
exists, we don't have to hold spin_lock.
Besides, when enabling/disabling quota, it must be single thread
when operations come here. spin lock must be firstly used to
clear quota_root when disabling quota, while enabling quota, spin
lock must be used to complete the last assign work.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The original code has one spin_lock 'qgroup_lock' to protect quota
configurations in memory. If we want to add a BTRFS_QGROUP_INFO_KEY,
it will be added to Btree firstly, and then update configurations in
memory,however, a race condition may happen between these operations.
For example:
->add_qgroup_info_item()
->add_qgroup_rb()
For the above case, del_qgroup_info_item() may happen just before
add_qgroup_rb().
What's worse, when we want to add a qgroup relation:
->add_qgroup_relation_item()
->add_qgroup_relations()
We don't have any checks whether 'src' and 'dst' exist before
add_qgroup_relation_item(), a race condition can also happen for
the above case.
To avoid race condition and have all the necessary checks, we introduce
a mutex lock 'qgroup_ioctl_lock', and we make all the user change operations
protected by the mutex lock.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
__btrfs_unlink_inode() aborts its transaction when it sees errors after
it removes the directory item. But it missed the case where
btrfs_del_dir_entries_in_log() returns an error. If this happens then
the unlink appears to fail but the items have been removed without
updating the directory size. The directory then has leaked bytes in
i_size and can never be removed.
Adding the missing transaction abort at least makes this failure
consistent with the other failure cases.
I noticed this while reading the code after someone on irc reported
having a directory with i_size but no entries. I tested it by forcing
btrfs_del_dir_entries_in_log() to return -ENOMEM.
Signed-off-by: Zach Brown <zab@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This:
# mkfs.btrfs /dev/sdb{1,2} ; wipefs -a /dev/sdb1; mount /dev/sdb2 /mnt/test
would lead to a blkdev open/close mismatch when the mount fails, and
a permanently busy (opened O_EXCL) sdb2:
# wipefs -a /dev/sdb2
wipefs: error: /dev/sdb2: probing initialization failed: Device or resource busy
It's because btrfs_open_devices() may open some devices, fail on
the last one, and return that failure stored in "ret." The mount
then fails, but the caller then does not clean up the open devices.
Chris assures me that:
"btrfs_open_devices just means: go off and open every bdev you can from
this uuid. It should return success if we opened any of them at all."
So change the logic to ignore any open failures; just skip processing
of that device. Later on it's decided whether we have enough devices
to continue.
Reported-by: Jan Safranek <jsafrane@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
It is very likely that there are several blocks in bio, it is very
inefficient if we get their csums one by one. This patch improves
this problem by getting the csums in batch.
According to the result of the following test, the execute time of
__btrfs_lookup_bio_sums() is down by ~28%(300us -> 217us).
# dd if=<mnt>/file of=/dev/null bs=1M count=1024
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
A user sent me a btrfs-image of a file system that was panicing on mount during
the log recovery. I had originally thought these problems were from a bug in
the free space cache code, but that was just a symptom of the problem. The
problem is if your application does something like this
[prealloc][prealloc][prealloc]
the internal extent maps will merge those all together into one extent map, even
though on disk they are 3 separate extents. So if you go to write into one of
these ranges the extent map will be right since we use the physical extent when
doing the write, but when we log the extents they will use the wrong sizes for
the remainder prealloc space. If this doesn't happen to trip up the free space
cache (which it won't in a lot of cases) then you will get bogus entries in your
extent tree which will screw stuff up later. The data and such will still work,
but everything else is broken. This patch fixes this by not allowing extents
that are on the modified list to be merged. This has the side effect that we
are no longer adding everything to the modified list all the time, which means
we now have to call btrfs_drop_extents every time we log an extent into the
tree. So this allows me to drop all this speciality code I was using to get
around calling btrfs_drop_extents. With this patch the testcase I've created no
longer creates a bogus file system after replaying the log. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When logging changed extents I was logging ram_bytes as the current length,
which isn't correct, it's supposed to be the ram bytes of the original extent.
This is for compression where even if we split the extent we need to know the
ram bytes so when we uncompress the extent we know how big it will be. This was
still working out right with compression for some reason but I think we were
getting lucky. It was definitely off for prealloc which is why I noticed it,
btrfsck was complaining about it. With this patch btrfsck no longer complains
after a log replay. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Dave was hitting a lockdep warning because we're now properly taking the ordered
operations mutex in the ordered wait stuff. This is because some cases we will
have a trans handle when we are flushing delalloc space, but we can't wait on
ordered extents because we could potentially deadlock, so fix this by not doing
the wait if we have a trans handle. Thanks
Reported-and-tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I noticed that we will add a block group to the space info before we add it to
the block group cache rb tree, so we could potentially allocate from the block
group before it's able to be searched for. I don't think this is too much of
a problem, the race window is microscopic, but just in case move the tree
insertion to above the space info linking. This makes it easier to adjust the
error handling as well, so we can remove a couple of BUG_ON(ret)'s and have real
error handling setup for these scenarios. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If btrfs_find_all_roots() fails, 'roots' has been freed or 'roots'
fails to allocate. We don't need to free it outside btrfs_find_all_roots()
again.Fix it.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The reason that BUG_ON() happens in these places is just
because of ENOMEM.
We try ro return ENOMEM rather than trigger BUG_ON(), the
caller will abort the transaction thus avoiding the kernel panic.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
A user reported a panic while running a balance. What was happening was he was
relocating a block, which added the reference to the relocation tree. Then
relocation would walk through the relocation tree and drop that reference and
free that block, and then it would walk down a snapshot which referenced the
same block and add another ref to the block. The problem is this was all
happening in the same transaction, so the parent block was free'ed up when we
drop our reference which was immediately available for allocation, and then it
was used _again_ to add a reference for the same block from a different
snapshot. This resulted in something like this in the delayed ref tree
add ref to 90234880, parent=2067398656, ref_root 1766, level 1
del ref to 90234880, parent=2067398656, ref_root 18446744073709551608, level 1
add ref to 90234880, parent=2067398656, ref_root 1767, level 1
as you can see the ref_root's don't match, because when we inc the ref we use
the header owner, which is the original tree the block belonged to, instead of
the data reloc tree. Then when we remove the extent we use the reloc tree
objectid. But none of this matters, since it is a shared reference which means
only the parent matters. When the delayed ref stuff runs it adds all the
increments first, and then does all the drops, to make sure that we don't delete
the ref if we net a positive ref count. But tree blocks aren't allowed to have
multiple refs from the same block, so this panics when it tries to add the
second ref. We need the add and the drop to cancel each other out in memory so
we only do the final add.
So to fix this we need to adjust how the delayed refs are added to the tree.
Only the ref_root matters when it is a normal backref, and only the parent
matters when it is a shared backref. So make our decision based on what ref
type we have. This allows us to keep the ref_root in memory in case anybody
wants to use it for something else, and it allows the delayed refs to be merged
properly so we don't end up with this panic.
With this patch the users image no longer panics on mount, and it has a clean
fsck after a normal mount/umount cycle. Thanks,
Cc: stable@vger.kernel.org
Reported-by: Roman Mamedov <rm@romanrm.ru>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Testing my enospc log code I managed to abort a transaction during mount, which
put me into an infinite loop. This is because of two things, first we don't
reset trans_no_join if we abort during transaction commit, which will force
anybody trying to start a transaction to just loop endlessly waiting for it to
be set to 0. But this is still just a symptom, the second issue is we don't set
the fs state to error during errors on mount. This is because we don't want to
do the flip read only thing during mount, but we still really want to set the fs
state to an error to keep us from even getting to the trans_no_join check. So
fix both of these things, make sure to reset trans_no_join if we abort during a
commit, and make sure we set the fs state to error no matter if we're mounting
or not. This should keep us from getting into this infinite loop again.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Steps to reproduce:
mkfs.btrfs <disk>
mount <disk> <mnt>
btrfs quota enable <mnt>
btrfs sub create <mnt>/subv
i=1
while [ $i -le 10000 ]
do
dd if=/dev/zero of=<mnt>/subv/data_$i bs=1K count=1
i=$(($i+1))
if [ $i -eq 500 ]
then
btrfs quota disable $mnt
fi
done
dmesg
Obviously, this warn_on() is unnecessary, and it will be easily triggered.
Just remove it.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
set_extent_bit()'s (u64 *failed_start) expects NULL not 0.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The subvolume ioctls block on the parent directory mutex that can be
held by other concurrent snapshot activity for a long time. Give the
user at least some chance to get out of this situation by allowing
to send a kill signal.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The messages
btrfs: unlinked 123 orphans
btrfs: truncated 456 orphans
are not useful to regular users and raise questions whether there are
problems with the filesystem.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This mount option was a workaround when subvol= assumed path relative
to the default subvolume, not the toplevel one. This was fixed long time
ago and subvolrootid has no effect.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
With more than one btrfs volume mounted, it can be very difficult to find
out which volume is hitting an error. btrfs_error() will print this, but
it is currently rigged as more of a fatal error handler, while many of
the printk()s are currently for debugging and yet-unhandled cases.
This patch just changes the functions where the device information is
already available. Some cases remain where the root or fs_info is not
passed to the function emitting the error.
This may introduce some confusion with volumes backed by multiple devices
emitting errors referring to the primary device in the set instead of the
one on which the error occurred.
Use btrfs_printk(fs_info, format, ...) rather than writing the device
string every time, and introduce macro wrappers ala XFS for brevity.
Since the function already cannot be used for continuations, print a
newline as part of the btrfs_printk() message rather than at each caller.
Signed-off-by: Simon Kirby <sim@hostway.ca>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The Kconfig title does not make much sense after the cleanup of
CONFIG_EXPERIMENTAL option, align the wording with other filesystems.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Each time pick one dead root from the list and let the caller know if
it's needed to continue. This should improve responsiveness during
umount and balance which at some point waits for cleaning all currently
queued dead roots.
A new dead root is added to the end of the list, so the snapshots
disappear in the order of deletion.
The snapshot cleaning work is now done only from the cleaner thread and the
others wake it if needed.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We currently store the first key of the tree block inside the reference for the
tree block in the extent tree. This takes up quite a bit of space. Make a new
key type for metadata which holds the level as the offset and completely removes
storing the btrfs_tree_block_info inside the extent ref. This reduces the size
from 51 bytes to 33 bytes per extent reference for each tree block. In practice
this results in a 30-35% decrease in the size of our extent tree, which means we
COW less and can keep more of the extent tree in memory which makes our heavy
metadata operations go much faster. This is not an automatic format change, you
must enable it at mkfs time or with btrfstune. This patch deals with having
metadata stored as either the old format or the new format so it is easy to
convert. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
free_root_pointers() has been introduced to cleanup all of tree roots,
so just use it instead.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The transaction abort stacktrace is printed only once per module
lifetime, but we'd like to see it each time it happens per mounted
filesystem. Introduce a fs_state flag that records it.
Tweak the messages around abort:
* add error number to the first abort
* print the exact negative errno from btrfs_decode_error
* clean up btrfs_decode_error and callers
* no dots at the end of the messages
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We keep hitting bugs in the tree log replay because btrfs_remove_free_space
doesn't account for some corner case. So add a bunch of tests to try and fully
test btrfs_remove_free_space since the only time it is called is during tree log
replay. These tests all finish successfully, so as we find more of these bugs
we need to add to these tests to make sure we don't regress in fixing things.
I've hidden the tests behind a Kconfig option, but they take no time to run so
all btrfs developers should have this turned on all the time. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Pull VFS updates from Al Viro,
Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).
7kloc removed.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
don't bother with deferred freeing of fdtables
proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
proc: Make the PROC_I() and PDE() macros internal to procfs
proc: Supply a function to remove a proc entry by PDE
take cgroup_open() and cpuset_open() to fs/proc/base.c
ppc: Clean up scanlog
ppc: Clean up rtas_flash driver somewhat
hostap: proc: Use remove_proc_subtree()
drm: proc: Use remove_proc_subtree()
drm: proc: Use minor->index to label things, not PDE->name
drm: Constify drm_proc_list[]
zoran: Don't print proc_dir_entry data in debug
reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
proc: Supply an accessor for getting the data from a PDE's parent
airo: Use remove_proc_subtree()
rtl8192u: Don't need to save device proc dir PDE
rtl8187se: Use a dir under /proc/net/r8180/
proc: Add proc_mkdir_data()
proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
proc: Move PDE_NET() to fs/proc/proc_net.c
...
Pull one more btrfs fix from Chris Mason:
"This has a recent fix from Josef for our tree log replay code. It
fixes problems where the inode counter for the number of bytes in the
file wasn't getting updated properly during fsync replay.
The commit did get rebased this morning, but it was only to clean up
the subject line. The code hasn't changed."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: make sure nbytes are right after log replay
While trying to track down a tree log replay bug I noticed that fsck was always
complaining about nbytes not being right for our fsynced file. That is because
the new fsync stuff doesn't wait for ordered extents to complete, so the inodes
nbytes are not necessarily updated properly when we log it. So to fix this we
need to set nbytes to whatever it is on the inode that is on disk, so when we
replay the extents we can just add the bytes that are being added as we replay
the extent. This makes it work for the case that we have the wrong nbytes or
the case that we logged everything and nbytes is actually correct. With this
I'm no longer getting nbytes errors out of btrfsck.
Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Tejun writes:
-----
This is the pull request for the earlier patchset[1] with the same
name. It's only three patches (the first one was committed to
workqueue tree) but the merge strategy is a bit involved due to the
dependencies.
* Because the conversion needs features from wq/for-3.10,
block/for-3.10/core is based on rc3, and wq/for-3.10 has conflicts
with rc3, I pulled mainline (rc5) into wq/for-3.10 to prevent those
workqueue conflicts from flaring up in block tree.
* Resolving the issue that Jan and Dave raised about debugging
requires arch-wide changes. The patchset is being worked on[2] but
it'll have to go through -mm after these changes show up in -next,
and not included in this pull request.
The three commits are located in the following git branch.
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git writeback-workqueue
Pulling it into block/for-3.10/core produces a conflict in
drivers/md/raid5.c between the following two commits.
e3620a3ad5 ("MD RAID5: Avoid accessing gendisk or queue structs when not available")
2f6db2a707 ("raid5: use bio_reset()")
The conflict is trivial - one removes an "if ()" conditional while the
other removes "rbi->bi_next = NULL" right above it. We just need to
remove both. The merged branch is available at
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git block-test-merge
so that you can use it for verification. The test merge commit has
proper merge description.
While these changes are a bit of pain to route, they make code simpler
and even have, while minute, measureable performance gain[3] even on a
workload which isn't particularly favorable to showing the benefits of
this conversion.
----
Fixed up the conflict.
Conflicts:
drivers/md/raid5.c
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull btrfs fixes from Chris Mason:
"We've had a busy two weeks of bug fixing. The biggest patches in here
are some long standing early-enospc problems (Josef) and a very old
race where compression and mmap combine forces to lose writes (me).
I'm fairly sure the mmap bug goes all the way back to the introduction
of the compression code, which is proof that fsx doesn't trigger every
possible mmap corner after all.
I'm sure you'll notice one of these is from this morning, it's a small
and isolated use-after-free fix in our scrub error reporting. I
double checked it here."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: don't drop path when printing out tree errors in scrub
Btrfs: fix wrong return value of btrfs_lookup_csum()
Btrfs: fix wrong reservation of csums
Btrfs: fix double free in the btrfs_qgroup_account_ref()
Btrfs: limit the global reserve to 512mb
Btrfs: hold the ordered operations mutex when waiting on ordered extents
Btrfs: fix space accounting for unlink and rename
Btrfs: fix space leak when we fail to reserve metadata space
Btrfs: fix EIO from btrfs send in is_extent_unchanged for punched holes
Btrfs: fix race between mmap writes and compression
Btrfs: fix memory leak in btrfs_create_tree()
Btrfs: fix locking on ROOT_REPLACE operations in tree mod log
Btrfs: fix missing qgroup reservation before fallocating
Btrfs: handle a bogus chunk tree nicely
Btrfs: update to use fs_state bit
A user reported a panic where we were panicing somewhere in
tree_backref_for_extent from scrub_print_warning. He only captured the trace
but looking at scrub_print_warning we drop the path right before we mess with
the extent buffer to print out a bunch of stuff, which isn't right. So fix this
by dropping the path after we use the eb if we need to. Thanks,
Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we don't find the expected csum item, but find a csum item which is
adjacent to the specified extent, we should return -EFBIG, or we should
return -ENOENT. But btrfs_lookup_csum() return -EFBIG even the csum item
is not adjacent to the specified extent. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We reserve the space for csums only when we write data into a file, in
the other cases, such as tree log, log replay, we don't do reservation,
so we can use the reservation of the transaction handle just for the former.
And for the latter, we should use the tree's own reservation. But the
function - btrfs_csum_file_blocks() didn't differentiate between these
two types of the cases, fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The function btrfs_find_all_roots is responsible to allocate
memory for 'roots' and free it if errors happen,so the caller should not
free it again since the work has been done.
Besides,'tmp' is allocated after the function btrfs_find_all_roots,
so we can return directly if btrfs_find_all_roots() fails.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
A user reported a problem where he was getting early ENOSPC with hundreds of
gigs of free data space and 6 gigs of free metadata space. This is because the
global block reserve was taking up the entire free metadata space. This is
ridiculous, we have infrastructure in place to throttle if we start using too
much of the global reserve, so instead of letting it get this huge just limit it
to 512mb so that users can still get work done. This allowed the user to
complete his rsync without issues. Thanks
Cc: stable@vger.kernel.org
Reported-and-tested-by: Stefan Priebe <s.priebe@profihost.ag>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We need to hold the ordered_operations mutex while waiting on ordered extents
since we splice and run the ordered extents list. We need to make sure anybody
else who wants to wait on ordered extents does actually wait for them to be
completed. This will keep us from bailing out of flushing in case somebody is
already waiting on ordered extents to complete. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We are way over-reserving for unlink and rename. Rename is just some random
huge number and unlink accounts for tree log operations that don't actually
happen during unlink, not to mention the tree log doesn't take from the trans
block rsv anyway so it's completely useless. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Dave reported a warning when running xfstest 275. We have been leaking delalloc
metadata space when our reservations fail. This is because we were improperly
calculating how much space to free for our checksum reservations. The problem
is we would sometimes free up space that had already been freed in another
thread and we would end up with negative usage for the delalloc space. This
patch fixes the problem by calculating how much space the other threads would
have already freed, and then calculate how much space we need to free had we not
done the reservation at all, and then freeing any excess space. This makes
xfstests 275 no longer have leaked space. Thanks
Cc: stable@vger.kernel.org
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When you take a snapshot, punch a hole where there has been data, then take
another snapshot and try to send an incremental stream, btrfs send would
give you EIO. That is because is_extent_unchanged had no support for holes
being punched. With this patch, instead of returning EIO we just return
0 (== the extent is not unchanged) and we're good.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Cc: Alexander Block <ablock84@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Btrfs uses page_mkwrite to ensure stable pages during
crc calculations and mmap workloads. We call clear_page_dirty_for_io
before we do any crcs, and this forces any application with the file
mapped to wait for the crc to finish before it is allowed to change
the file.
With compression on, the clear_page_dirty_for_io step is happening after
we've compressed the pages. This means the applications might be
changing the pages while we are compressing them, and some of those
modifications might not hit the disk.
This commit adds the clear_page_dirty_for_io before compression starts
and makes sure to redirty the page if we have to fallback to
uncompressed IO as well.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Reported-by: Alexandre Oliva <oliva@gnu.org>
cc: stable@vger.kernel.org
Bunch of places in the code weren't using it where they could be -
this'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx
into a struct bvec_iter.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: "Ed L. Cashin" <ecashin@coraid.com>
CC: Nick Piggin <npiggin@kernel.dk>
CC: Jiri Kosina <jkosina@suse.cz>
CC: Jim Paris <jim@jtan.com>
CC: Geoff Levand <geoff@infradead.org>
CC: Alasdair Kergon <agk@redhat.com>
CC: dm-devel@redhat.com
CC: Neil Brown <neilb@suse.de>
CC: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Ed Cashin <ecashin@coraid.com>
Just a little convenience macro - main reason to add it now is preparing
for immutable bio vecs, it'll reduce the size of the patch that puts
bi_sector/bi_size/bi_idx into a struct bvec_iter.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Lars Ellenberg <drbd-dev@lists.linbit.com>
CC: Jiri Kosina <jkosina@suse.cz>
CC: Alasdair Kergon <agk@redhat.com>
CC: dm-devel@redhat.com
CC: Neil Brown <neilb@suse.de>
CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
CC: Heiko Carstens <heiko.carstens@de.ibm.com>
CC: linux-s390@vger.kernel.org
CC: Chris Mason <chris.mason@fusionio.com>
CC: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
We should free leaf and root before returning from the error
handling code.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
To resolve backrefs, ROOT_REPLACE operations in the tree mod log are
required to be tied to at least one KEY_REMOVE_WHILE_FREEING operation.
Therefore, those operations must be enclosed by tree_mod_log_write_lock()
and tree_mod_log_write_unlock() calls.
Those calls are private to the tree_mod_log_* functions, which means that
removal of the elements of an old root node must be logged from
tree_mod_log_insert_root. This partly reverts and corrects commit ba1bfbd5
(Btrfs: fix a tree mod logging issue for root replacement operations).
This fixes the brand-new version of xfstest 276 as of commit cfe73f71.
Cc: stable@vger.kernel.org
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Steps to reproduce:
mkfs.btrfs <disk>
mount <disk> <mnt>
btrfs quota enable <mnt>
btrfs sub create <mnt>/subv
btrfs qgroup limit 10M <mnt>/subv
fallocate --length 20M <mnt>/subv/data
For the above example, fallocating will return successfully which
is not expected, we try to fix it by doing qgroup reservation before
fallocating.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If you restore a btrfs-image file system and try to mount that file system we'll
panic. That's because btrfs-image restores and just makes one big chunk to
envelope the whole disk, since they are really only meant to be messed with by
our btrfs-progs. So fix up btrfs_rmap_block and the callers of it for mount so
that we no longer panic but instead just return an error and fail to mount.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Now that we use bit operation to check fs_state, update
btrfs_free_fs_root()'s checker, otherwise we get back to
memory leak case.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs fixes from Chris Mason:
"Eric's rcu barrier patch fixes a long standing problem with our
unmount code hanging on to devices in workqueue helpers. Liu Bo
nailed down a difficult assertion for in-memory extent mappings."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix warning of free_extent_map
Btrfs: fix warning when creating snapshots
Btrfs: return as soon as possible when edquot happens
Btrfs: return EIO if we have extent tree corruption
btrfs: use rcu_barrier() to wait for bdev puts at unmount
Btrfs: remove btrfs_try_spin_lock
Btrfs: get better concurrency for snapshot-aware defrag work
Users report that an extent map's list is still linked when it's actually
going to be freed from cache.
The story is that
a) when we're going to drop an extent map and may split this large one into
smaller ems, and if this large one is flagged as EXTENT_FLAG_LOGGING which means
that it's on the list to be logged, then the smaller ems split from it will also
be flagged as EXTENT_FLAG_LOGGING, and this is _not_ expected.
b) we'll keep ems from unlinking the list and freeing when they are flagged with
EXTENT_FLAG_LOGGING, because the log code holds one reference.
The end result is the warning, but the truth is that we set the flag
EXTENT_FLAG_LOGGING only during fsync.
So clear flag EXTENT_FLAG_LOGGING for extent maps split from a large one.
Reported-by: Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de>
Reported-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Creating snapshot passes extent_root to commit its transaction,
but it can lead to the warning of checking root for quota in
the __btrfs_end_transaction() when someone else is committing
the current transaction. Since we've recorded the needed root
in trans_handle, just use it to get rid of the warning.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If one of qgroup fails to reserve firstly, we should return immediately,
it is unnecessary to continue check.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The callers of lookup_inline_extent_info all handle getting an error back
properly, so return an error if we have corruption instead of being a jerk and
panicing. Still WARN_ON() since this is kind of crucial and I've been seeing it
a bit too much recently for my taste, I think we're doing something wrong
somewhere. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Doing this would reliably fail with -EBUSY for me:
# mount /dev/sdb2 /mnt/scratch; umount /mnt/scratch; mkfs.btrfs -f /dev/sdb2
...
unable to open /dev/sdb2: Device or resource busy
because mkfs.btrfs tries to open the device O_EXCL, and somebody still has it.
Using systemtap to track bdev gets & puts shows a kworker thread doing a
blkdev put after mkfs attempts a get; this is left over from the unmount
path:
btrfs_close_devices
__btrfs_close_devices
call_rcu(&device->rcu, free_device);
free_device
INIT_WORK(&device->rcu_work, __free_device);
schedule_work(&device->rcu_work);
so unmount might complete before __free_device fires & does its blkdev_put.
Adding an rcu_barrier() to btrfs_close_devices() causes unmount to wait
until all blkdev_put()s are done, and the device is truly free once
unmount completes.
Cc: stable@vger.kernel.org
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Remove a useless function declaration
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Using spinning case instead of blocking will result in better concurrency
overall.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull namespace bugfixes from Eric Biederman:
"This is three simple fixes against 3.9-rc1. I have tested each of
these fixes and verified they work correctly.
The userns oops in key_change_session_keyring and the BUG_ON triggered
by proc_ns_follow_link were found by Dave Jones.
I am including the enhancement for mount to only trigger requests of
filesystem modules here instead of delaying this for the 3.10 merge
window because it is both trivial and the kind of change that tends to
bit-rot if left untouched for two months."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
proc: Use nd_jump_link in proc_ns_follow_link
fs: Limit sys_mount to only request filesystem modules (Part 2).
fs: Limit sys_mount to only request filesystem modules.
userns: Stop oopsing in key_change_session_keyring
Pull btrfs fixes from Chris Mason:
"These are scattered fixes and one performance improvement. The
biggest functional change is in how we throttle metadata changes. The
new code bumps our average file creation rate up by ~13% in fs_mark,
and lowers CPU usage.
Stefan bisected out a regression in our allocation code that made
balance loop on extents larger than 256MB."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: improve the delayed inode throttling
Btrfs: fix a mismerge in btrfs_balance()
Btrfs: enforce min_bytes parameter during extent allocation
Btrfs: allow running defrag in parallel to administrative tasks
Btrfs: avoid deadlock on transaction waiting list
Btrfs: do not BUG_ON on aborted situation
Btrfs: do not BUG_ON in prepare_to_reloc
Btrfs: free all recorded tree blocks on error
Btrfs: build up error handling for merge_reloc_roots
Btrfs: check for NULL pointer in updating reloc roots
Btrfs: fix unclosed transaction handler when the async transaction commitment fails
Btrfs: fix wrong handle at error path of create_snapshot() when the commit fails
Btrfs: use set_nlink if our i_nlink is 0
The delayed inode code batches up changes to the btree in hopes of doing
them in bulk. As the changes build up, processes kick off worker
threads and wait for them to make progress.
The current code kicks off an async work queue item for each delayed
node, which creates a lot of churn. It also uses a fixed 1 HZ waiting
period for the throttle, which allows us to build a lot of pending
work and can slow down the commit.
This changes us to watch a sequence counter as it is bumped during the
operations. We kick off fewer work items and have each work item do
more work.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Raid56 merge (merge commit e942f88) had mistakenly removed a call to
__cancel_balance(), which resulted in balance not cleaning up after itself
after a successful finish. (Cleanup includes switching the state, removing
the balance item and releasing mut_ex_op testnset lock.) Bring it back.
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Commit 24542bf7ea changed preallocation of
extents to cap the max size we try to allocate. It's a valid change,
but the extent reservation code is also used by balance, and that
can't tolerate a smaller extent being allocated.
__btrfs_prealloc_file_range already has a min_size parameter, which is
used by relocation to request a specific extent size. This commit
adds an extra check to enforce that minimum extent size.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Reported-by: Stefan Behrens <sbehrens@giantdisaster.de>
Commit 5ac00add added a testnset mutex and code that disallows
running administrative tasks in parallel. It is prevented that
the device add/delete/balance/replace/resize operations are
started in parallel. By mistake, the defragmentation operation
was included in the check for mutually exclusiveness as well.
This is fixed with this commit.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Only let one trans handle to wait for other handles, otherwise we
will get ABBA issues.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Btrfs balance can easily hit BUG_ON in these places, but we want
to it bail out gracefully after we force the whole filesystem to
readonly. So we use btrfs_std_error hook in place of BUG_ON.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We can bail out from here gracefully instead of a cold BUG_ON.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We first use btrfs_std_error hook to replace with BUG_ON, and we
also need to cleanup what is left, including reloc roots rbtree
and reloc roots list.
Here we use a helper function to cleanup both rbtree and list, and
since this function can also be used in the balance recover path,
we also make the change as well to keep code simple.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If the async transaction commitment failed, we need close the
current transaction handler, or the current transaction will be
blocked to commit because of this orphan handler.
We fix the problem by doing sync transaction commitment, that is
to invoke btrfs_commit_transaction().
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There are several bugs at error path of create_snapshot() when the
transaction commitment failed.
- access the freed transaction handler. At the end of the
transaction commitment, the transaction handler was freed, so we
should not access it after the transaction commitment.
- we were not aware of the error which happened during the snapshot
creation if we submitted a async transaction commitment.
- pending snapshot access vs pending snapshot free. when something
wrong happened after we submitted a async transaction commitment,
the transaction committer would cleanup the pending snapshots and
free them. But the snapshot creators were not aware of it, they
would access the freed pending snapshots.
This patch fixes the above problems by:
- remove the dangerous code that accessed the freed handler
- assign ->error if the error happens during the snapshot creation
- the transaction committer doesn't free the pending snapshots,
just assigns the error number and evicts them before we unblock
the transaction.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We need to inc the nlink of deleted entries when running replay so we can do the
unlink on the fs_root and get everything cleaned up and then have the orphan
cleanup do the right thing. The problem is inc_nlink complains about this, even
thought it still does the right thing. So use set_nlink() if our i_nlink is 0
to keep users from seeing the warnings during log replay. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Modify the request_module to prefix the file system type with "fs-"
and add aliases to all of the filesystems that can be built as modules
to match.
A common practice is to build all of the kernel code and leave code
that is not commonly needed as modules, with the result that many
users are exposed to any bug anywhere in the kernel.
Looking for filesystems with a fs- prefix limits the pool of possible
modules that can be loaded by mount to just filesystems trivially
making things safer with no real cost.
Using aliases means user space can control the policy of which
filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
with blacklist and alias directives. Allowing simple, safe,
well understood work-arounds to known problematic software.
This also addresses a rare but unfortunate problem where the filesystem
name is not the same as it's module name and module auto-loading
would not work. While writing this patch I saw a handful of such
cases. The most significant being autofs that lives in the module
autofs4.
This is relevant to user namespaces because we can reach the request
module in get_fs_type() without having any special permissions, and
people get uncomfortable when a user specified string (in this case
the filesystem type) goes all of the way to request_module.
After having looked at this issue I don't think there is any
particular reason to perform any filtering or permission checks beyond
making it clear in the module request that we want a filesystem
module. The common pattern in the kernel is to call request_module()
without regards to the users permissions. In general all a filesystem
module does once loaded is call register_filesystem() and go to sleep.
Which means there is not much attack surface exposed by loading a
filesytem module unless the filesystem is mounted. In a user
namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
which most filesystems do not set today.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Reported-by: Kees Cook <keescook@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Pull btrfs fixup from Chris Mason:
"Geert and James both sent this one in, sorry guys"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs/raid56: Add missing #include <linux/vmalloc.h>
tilegx_defconfig:
fs/btrfs/raid56.c: In function 'btrfs_alloc_stripe_hash_table':
fs/btrfs/raid56.c:206:3: error: implicit declaration of function 'vzalloc' [-Werror=implicit-function-declaration]
fs/btrfs/raid56.c:206:9: warning: assignment makes pointer from integer without a cast [enabled by default]
fs/btrfs/raid56.c:226:4: error: implicit declaration of function 'vfree' [-Werror=implicit-function-declaration]
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs update from Chris Mason:
"The biggest feature in the pull is the new (and still experimental)
raid56 code that David Woodhouse started long ago. I'm still working
on the parity logging setup that will avoid inconsistent parity after
a crash, so this is only for testing right now. But, I'd really like
to get it out to a broader audience to hammer out any performance
issues or other problems.
scrub does not yet correct errors on raid5/6 either.
Josef has another pass at fsync performance. The big change here is
to combine waiting for metadata with waiting for data, which is a big
latency win. It is also step one toward using atomics from the
hardware during a commit.
Mark Fasheh has a new way to use btrfs send/receive to send only the
metadata changes. SUSE is using this to make snapper more efficient
at finding changes between snapshosts.
Snapshot-aware defrag is also included.
Otherwise we have a large number of fixes and cleanups. Eric Sandeen
wins the award for removing the most lines, and I'm hoping we steal
this idea from XFS over and over again."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (118 commits)
btrfs: fixup/remove module.h usage as required
Btrfs: delete inline extents when we find them during logging
btrfs: try harder to allocate raid56 stripe cache
Btrfs: cleanup to make the function btrfs_delalloc_reserve_metadata more logic
Btrfs: don't call btrfs_qgroup_free if just btrfs_qgroup_reserve fails
Btrfs: remove reduplicate check about root in the function btrfs_clean_quota_tree
Btrfs: return ENOMEM rather than use BUG_ON when btrfs_alloc_path fails
Btrfs: fix missing deleted items in btrfs_clean_quota_tree
btrfs: use only inline_pages from extent buffer
Btrfs: fix wrong reserved space when deleting a snapshot/subvolume
Btrfs: fix wrong reserved space in qgroup during snap/subv creation
Btrfs: remove unnecessary dget_parent/dput when creating the pending snapshot
btrfs: remove a printk from scan_one_device
Btrfs: fix NULL pointer after aborting a transaction
Btrfs: fix memory leak of log roots
Btrfs: copy everything if we've created an inline extent
btrfs: cleanup for open-coded alignment
Btrfs: do not change inode flags in rename
Btrfs: use reserved space for creating a snapshot
clear chunk_alloc flag on retryable failure
...
We want to avoid module.h where posible, since it in turn includes
nearly all of header space. This means removing it where it is not
required, and using export.h where we are only exporting symbols via
EXPORT_SYMBOL and friends.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Apparently when we do inline extents we allow the data to overlap the last chunk
of the btrfs_file_extent_item, which means that we can possibly have a
btrfs_file_extent_item that isn't actually as large as a btrfs_file_extent_item.
This messes with us when we try to overwrite the extent when logging new extents
since we expect for it to be the right size. To fix this just delete the item
and try to do the insert again which will give us the proper sized
btrfs_file_extent_item. This fixes a panic where map_private_extent_buffer
would blow up because we're trying to write past the end of the leaf. Thanks,
Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The stripe hash table is large, starting with allocation order 4 and can go as
high as order 7 in case lock debugging is turned on and structure padding
happens.
Observed mount failure:
mount: page allocation failure: order:7, mode:0x200050
Pid: 8234, comm: mount Tainted: G W 3.8.0-default+ #267
Call Trace:
[<ffffffff81114353>] warn_alloc_failed+0xf3/0x140
[<ffffffff811171d2>] ? __alloc_pages_direct_compact+0x92/0x250
[<ffffffff81117ac3>] __alloc_pages_nodemask+0x733/0x9d0
[<ffffffff81152878>] ? cache_alloc_refill+0x3f8/0x840
[<ffffffff811528bc>] cache_alloc_refill+0x43c/0x840
[<ffffffff811302eb>] ? is_kernel_percpu_address+0x4b/0x90
[<ffffffffa00a00ac>] ? btrfs_alloc_stripe_hash_table+0x5c/0x130 [btrfs]
[<ffffffff811531d7>] kmem_cache_alloc_trace+0x247/0x270
[<ffffffffa00a00ac>] btrfs_alloc_stripe_hash_table+0x5c/0x130 [btrfs]
[<ffffffffa003133f>] open_ctree+0xb2f/0x1f90 [btrfs]
[<ffffffff81397289>] ? string+0x49/0xe0
[<ffffffff813987b3>] ? vsnprintf+0x443/0x5d0
[<ffffffffa0007cb6>] btrfs_mount+0x526/0x600 [btrfs]
[<ffffffff8115127c>] ? cache_alloc_debugcheck_after+0x4c/0x200
[<ffffffff81162b90>] mount_fs+0x20/0xe0
[<ffffffff8117db26>] vfs_kern_mount+0x76/0x120
[<ffffffff811801b6>] do_mount+0x386/0x980
[<ffffffff8112a5cb>] ? strndup_user+0x5b/0x80
[<ffffffff81180840>] sys_mount+0x90/0xe0
[<ffffffff81962e99>] system_call_fastpath+0x16/0x1b
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The original code is a little confusing and not clear, The right
way to deal with the kernel code like this:
[...]
if (ret)
goto out;
[...]
So i move the common clean_up code to the place labeled with
out_fail, this will be easier to maintain.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
commit eb6b88d92c leads into another bug.
If it is just because qgroup_reserve fails, the function btrfs_qgroup_free
should not be called, otherwise, it will cause the wrong quota accounting.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The check work has been done just before the function btrfs_clean_quota_tree
is called, it is not necessary to check it again, remove it.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Return ENOMEM rather trigger BUG_ON, fix it.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Steps to reproduce:
i=0
ncases=100
mkfs.btrfs <disk>
mount <disk> <mnt>
btrfs quota enable <mnt>
btrfs qgroup create 2/1 <mnt>
while [ $i -le $ncases ]
do
btrfs qgroup create 1/$i <mnt>
btrfs qgroup assign 1/$i 2/1 <mnt>
i=$(($i+1))
done
btrfs quota disable <mnt>
umount <mnt>
btrfsck <mnt>
You can also use the commands:
btrfs-debug-tree <disk> | grep QGROUP
You will find there are still items existed.The reasons why this happens
is because the original code just checks slots[0]==0 and returns.
We try to fix it by deleting the leaf one by one.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The nodesize is capped at 64k and there are enough pages preallocated in
extent_buffer::inline_pages. The fallback to kmalloc never happened
because even on the smallest page size considered (4k) inline_pages
covered the needs.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When deleting a snapshot/subvolume, we need remove root ref/backref,
dir entries and update the dir inode, so we must reserve free space
for those operations.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There are two problems in the space reservation of the snapshot/
subvolume creation.
- don't reserve the space for the root item insertion
- the space which is reserved in the qgroup is different with
the free space reservation. we need reserve free space for
7 items, but in qgroup reservation, we need reserve space only
for 3 items.
So we implement new metadata reservation functions for the
snapshot/subvolume creation.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Since we have grabbed the parent inode at the beginning of the
snapshot creation, and both sync and async snapshot creation
release it after the pending snapshots are actually created,
it is safe to access the parent inode directly during the snapshot
creation, we needn't use dget_parent/dput to fix the parent dentry
and get the dir inode.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Dave pointed out that he saw messages from btrfs although there was no
such filesystem on his computers. The automatic device scan is called on
every new blockdevice if the usual distro udev rule set is used. The
printk introduced in 6f60cbd3ae was a remainder from copying
portions of code from btrfs_get_bdev_and_sb which is used under
different conditions and the warning makes sense there.
Reported-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
While doing cleanup work on an aborted transaction, we've set
the global running transaction pointer to NULL _before_ waiting all
other transaction handles to finish, so others'd hit NULL pointer
crash when referencing the global running transaction pointer.
This first sets a hint to avoid new transaction handle joining, then
waits other existing handles to abort or finish so that we can safely
set the above global pointer to NULL.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When we abort a transaction while fsyncing, we'll skip freeing log roots
part of committing a transaction, which leads to memory leak.
This adds a 'free log roots' in putting super when no more users hold
references on log roots, so it's safe and clean.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I noticed while looking into a tree logging bug that we aren't logging inline
extents properly. Since this requires copying and it shouldn't happen too often
just force us to copy everything for the inode into the tree log when we have an
inline extent. With this patch we have valid data after a crash when we write
an inline extent. Thanks,
Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Pull vfs pile (part one) from Al Viro:
"Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
locking violations, etc.
The most visible changes here are death of FS_REVAL_DOT (replaced with
"has ->d_weak_revalidate()") and a new helper getting from struct file
to inode. Some bits of preparation to xattr method interface changes.
Misc patches by various people sent this cycle *and* ocfs2 fixes from
several cycles ago that should've been upstream right then.
PS: the next vfs pile will be xattr stuff."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
saner proc_get_inode() calling conventions
proc: avoid extra pde_put() in proc_fill_super()
fs: change return values from -EACCES to -EPERM
fs/exec.c: make bprm_mm_init() static
ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
ocfs2: fix possible use-after-free with AIO
ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
target: writev() on single-element vector is pointless
export kernel_write(), convert open-coded instances
fs: encode_fh: return FILEID_INVALID if invalid fid_type
kill f_vfsmnt
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
nfsd: handle vfs_getattr errors in acl protocol
switch vfs_getattr() to struct path
default SET_PERSONALITY() in linux/elf.h
ceph: prepopulate inodes only when request is aborted
d_hash_and_lookup(): export, switch open-coded instances
9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
9p: split dropping the acls from v9fs_set_create_acl()
...
Though most of the btrfs codes are using ALIGN macro for page alignment,
there are still some codes using open-coded alignment like the
following:
------
u64 mask = ((u64)root->stripesize - 1);
u64 ret = (val + mask) & ~mask;
------
Or even hidden one:
------
num_bytes = (end - start + blocksize) & ~(blocksize - 1);
------
Sometimes these open-coded alignment is not so easy to understand for
newbie like me.
This commit changes the open-coded alignment to the ALIGN macro for a
better readability.
Also there is a previous patch from David Sterba with similar changes,
but the patch is for 3.2 kernel and seems not merged.
http://www.spinics.net/lists/linux-btrfs/msg12747.html
Cc: David Sterba <dave@jikos.cz>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Before we forced to change a file's NOCOW and COMPRESS flag due to
the parent directory's, but this ends up a bad idea, because it
confuses end users a lot about file's NOCOW status, eg. if someone
change a file to NOCOW via 'chattr' and then rename it in the current
directory which is without NOCOW attribute, the file will lose the
NOCOW flag silently.
This diables 'change flags in rename', so from now on we'll only
inherit flags from the parent directory on creation stage while in
other places we can use 'chattr' to set NOCOW or COMPRESS flags.
Reported-by: Marios Titas <redneb8888@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
While inserting dir index and updating inode for a snapshot, we'd
add delayed items which consume trans->block_rsv, if we don't have
any space reserved in this trans handle, we either just return or
reserve space again.
But before creating pending snapshots during committing transaction,
we've done a release on this trans handle, so we don't have space reserved
in it at this stage.
What we're using is block_rsv of pending snapshots which has already
reserved well enough space for both inserting dir index and updating
inode, so we need to set trans handle to indicate that we have space
now.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I've experienced filesystem freezes with permanent spikes in the active
process count for quite a while, particularly on filesystems whose
available raw space has already been fully allocated to chunks.
While looking into this, I found a pretty obvious error in
do_chunk_alloc: it sets space_info->chunk_alloc, but if
btrfs_alloc_chunk returns an error other than ENOSPC, it returns leaving
that flag set, which causes any other threads waiting for
space_info->chunk_alloc to become zero to spin indefinitely.
I haven't double-checked that this patch fixes the failure I've observed
fully (it's not exactly trivial to trigger), but it surely is a bug and
the fix is trivial, so... Please put it in :-)
What I saw in that function also happens to explain why in some cases I
see filesystems allocate a huge number of chunks that remain unused
(leading to the scenario above, of not having more chunks to allocate).
It happens for data and metadata, but not necessarily both. I'm
guessing some thread sets the force_alloc flag on the corresponding
space_info, and then several threads trying to get disk space end up
attempting to allocate a new chunk concurrently. All of them will see
the force_alloc flag and bump their local copy of force up to the level
they see first, and they won't clear it even if another thread succeeds
in allocating a chunk, thus clearing the force flag. Then each thread
that observed the force flag will, on its turn, force the allocation of
a new chunk. And any threads that come in while it does that will see
the force flag still set and pick it up, and so on. This sounds like a
problem to me, but... what should the correct behavior be? Clear
force_flag once we copy it to a local force? Reset force to the
incoming value on every loop? Set the flag to our incoming force if we
have it at first, clear our local flag, and move it from the space_info
when we determined that we are the thread that's going to perform the
allocation?
btrfs: clear chunk_alloc flag on retryable failure
From: Alexandre Oliva <oliva@gnu.org>
If btrfs_alloc_chunk fails with e.g. ENOMEM, we exit do_chunk_alloc
without clearing chunk_alloc in space_info. As a result, any further
calls to do_chunk_alloc on that filesystem will start busy-waiting for
chunk_alloc to be cleared, but it never will be. This patch adjusts
do_chunk_alloc so that it clears this flag in case of an error.
Signed-off-by: Alexandre Oliva <oliva@gnu.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When a subvolume is removed, we remove the root item from the root tree,
while the tree blocks and backrefs remain for a while. When backref walking
comes across one of those orphan tree blocks, it can find a backref for a
no longer existing root. This is all good, we only must tolerate
__resolve_indirect_ref returning an error and continue with the good refs
found.
Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
A user reported hitting the BUG_ON() in btrfs_finished_ordered_io() where we had
csums on a NOCOW extent. This can happen if we have NODATACOW set but not
NODATASUM set, which can happen in two cases, either we mount with -o nodatacow
and then write into preallocated space, or chattr +C a directory and move a file
into that directory. Liu has fixed the move case in a different place, but this
fixes the mount -o nodatacow case. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This patch is a follow up on below patch:
[PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type
commit: 216b6cbdcb
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Vivek Trivedi <t.vivek@samsung.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Sage Weil <sage@inktank.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull trivial tree from Jiri Kosina:
"Assorted tiny fixes queued in trivial tree"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (22 commits)
DocBook: update EXPORT_SYMBOL entry to point at export.h
Documentation: update top level 00-INDEX file with new additions
ARM: at91/ide: remove unsused at91-ide Kconfig entry
percpu_counter.h: comment code for better readability
x86, efi: fix comment typo in head_32.S
IB: cxgb3: delay freeing mem untill entirely done with it
net: mvneta: remove unneeded version.h include
time: x86: report_lost_ticks doesn't exist any more
pcmcia: avoid static analysis complaint about use-after-free
fs/jfs: Fix typo in comment : 'how may' -> 'how many'
of: add missing documentation for of_platform_populate()
btrfs: remove unnecessary cur_trans set before goto loop in join_transaction
sound: soc: Fix typo in sound/codecs
treewide: Fix typo in various drivers
btrfs: fix comment typos
Update ibmvscsi module name in Kconfig.
powerpc: fix typo (utilties -> utilities)
of: fix spelling mistake in comment
h8300: Fix home page URL in h8300/README
xtensa: Fix home page URL in Kconfig
...
Here is the big driver core merge for 3.9-rc1
There are two major series here, both of which touch lots of drivers all
over the kernel, and will cause you some merge conflicts:
- add a new function called devm_ioremap_resource() to properly be
able to check return values.
- remove CONFIG_EXPERIMENTAL
If you need me to provide a merged tree to handle these resolutions,
please let me know.
Other than those patches, there's not much here, some minor fixes and
updates.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iEYEABECAAYFAlEmV0cACgkQMUfUDdst+yncCQCfbmnQZju7kzWXk6PjdFuKspT9
weAAoMCzcAtEzzc4LXuUxxG/sXBVBCjW
=yWAQ
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core patches from Greg Kroah-Hartman:
"Here is the big driver core merge for 3.9-rc1
There are two major series here, both of which touch lots of drivers
all over the kernel, and will cause you some merge conflicts:
- add a new function called devm_ioremap_resource() to properly be
able to check return values.
- remove CONFIG_EXPERIMENTAL
Other than those patches, there's not much here, some minor fixes and
updates"
Fix up trivial conflicts
* tag 'driver-core-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (221 commits)
base: memory: fix soft/hard_offline_page permissions
drivercore: Fix ordering between deferred_probe and exiting initcalls
backlight: fix class_find_device() arguments
TTY: mark tty_get_device call with the proper const values
driver-core: constify data for class_find_device()
firmware: Ignore abort check when no user-helper is used
firmware: Reduce ifdef CONFIG_FW_LOADER_USER_HELPER
firmware: Make user-mode helper optional
firmware: Refactoring for splitting user-mode helper code
Driver core: treat unregistered bus_types as having no devices
watchdog: Convert to devm_ioremap_resource()
thermal: Convert to devm_ioremap_resource()
spi: Convert to devm_ioremap_resource()
power: Convert to devm_ioremap_resource()
mtd: Convert to devm_ioremap_resource()
mmc: Convert to devm_ioremap_resource()
mfd: Convert to devm_ioremap_resource()
media: Convert to devm_ioremap_resource()
iommu: Convert to devm_ioremap_resource()
drm: Convert to devm_ioremap_resource()
...
If we remount the fs to close the auto defragment or make the fs R/O,
we should stop the auto defragment.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When running the 083th case of xfstests on the filesystem with
"compress-force=lzo", the following WARNINGs were triggered.
WARNING: at fs/btrfs/inode.c:7908
WARNING: at fs/btrfs/inode.c:7909
WARNING: at fs/btrfs/inode.c:7911
WARNING: at fs/btrfs/extent-tree.c:4510
WARNING: at fs/btrfs/extent-tree.c:4511
This problem was introduced by the patch "Btrfs: fix deadlock due
to unsubmitted". In this patch, there are two bugs which caused
the above problem.
The 1st one is a off-by-one bug, if the DIO write return 0, it is
also a short write, we need release the reserved space for it. But
we didn't do it in that patch. Fix it by change "ret > 0" to
"ret >= 0".
The 2nd one is ->outstanding_extents was increased twice when
a short write happened. As we know, ->outstanding_extents is
a counter to keep track of the number of extent items we may
use duo to delalloc, when we reserve the free space for a
delalloc write, we assume that the write will introduce just
one extent item, so we increase ->outstanding_extents by 1 at
that time. And then we will increase it every time we split the
write, it is done at the beginning of btrfs_get_blocks_direct().
So when a short write happens, we needn't increase
->outstanding_extents again. But this patch done.
In order to fix the 2nd problem, I re-write the logic for
->outstanding_extents operation. We don't increase it at the
beginning of btrfs_get_blocks_direct(), instead, we just
increase it when the split actually happens.
Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This comes from one of btrfs's project ideas,
As we defragment files, we break any sharing from other snapshots.
The balancing code will preserve the sharing, and defrag needs to grow this
as well.
Now we're able to fill the blank with this patch, in which we make full use of
backref walking stuff.
Here is the basic idea,
o set the writeback ranges started by defragment with flag EXTENT_DEFRAG
o at endio, after we finish updating fs tree, we use backref walking to find
all parents of the ranges and re-link them with the new COWed file layout by
adding corresponding backrefs.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We try to limit the size of a chunk to 10GB, which keeps the unit of
work reasonable during balance and resize operations. The limit checks
were taking into account the number of copies of the data we had but
what they really should be doing is comparing against the logical
size of the chunk we're creating.
This moves the code around a little to use the count of data stripes
from raid5/6.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Very large fallocate requests are cpu bound and result in extents with a
repeating pattern of ever decreasing size:
$ time fallocate -l 1T file
real 0m13.039s
( an excerpt of the extents from btrfs-debug-tree: )
prealloc data disk byte 1536292564992 nr 397312
prealloc data disk byte 1536292962304 nr 196608
prealloc data disk byte 1536293158912 nr 98304
prealloc data disk byte 1536293257216 nr 49152
prealloc data disk byte 1536293306368 nr 24576
prealloc data disk byte 1536293330944 nr 12288
prealloc data disk byte 1536293343232 nr 8192
prealloc data disk byte 1536293351424 nr 4096
prealloc data disk byte 1536293355520 nr 4096
prealloc data disk byte 1536293359616 nr 4096
The excessive cpu use comes from __btrfs_prealloc_file_range() trying to
allocate the entire remaining size after each extent is allocated.
btrfs_reserve_extent() repeatedly cuts this requested size in half until
it gets down to the size that the allocators can return. We limit the
problem for now by capping each reservation at 256 meg.
The small extents come from a masking bug when decreasing the requested
reservation size. The high 32bits are cleared and the remaining low
bits might happen to reserve a small size. Fix this by using
round_down() which properly casts the mask.
After these fixes huge fallocate requests are fast and result in nice
large extents:
$ time fallocate -l 1T file
real 0m0.082s
prealloc data disk byte 1112425889792 nr 268435456
prealloc data disk byte 1112694325248 nr 268435456
prealloc data disk byte 1112962760704 nr 268435456
Reported-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
__btrfs_close_devices() clones btrfs device structs with
memcpy(). Some of the fields in the clone are reinitialized, but it's
missing to init io_lock. In mainline this goes unnoticed, but on RT it
leaves the plist pointing to the original about to be freed lock
struct.
Initialize io_lock after cloning, so no references to the original
struct are left.
Reported-and-tested-by: Mike Galbraith <efault@gmx.de>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We forget to free qgroup reservation in commit_transaction(),fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Cc: Arne Jansen <sensille@gmx.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The original code forget to check whether quota has been disabled firstly,
and it will return 'EINVAL' and return error to users if quota has been
disabled,it will be unfriendly and confusing for users to see that.
So just return directly if quota has been disabled.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Cc: Arne Jansen <sensille@gmx.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Right now inode cache inode is treated as the same as space cache
inode, ie. keep inode in memory till putting super.
But this leads to an awkward situation.
If we're going to delete a snapshot/subvolume, btrfs will not
actually delete it and return free space, but will add it to dead
roots list until the last inode on this snap/subvol being destroyed.
Then we'll fetch deleted roots and cleanup them via cleaner thread.
So here is the problem, if we enable inode cache option, each
snap/subvol has a cached inode which is used to store inode allcation
information. And this cache inode will be kept in memory, as the above
said. So with inode cache, snap/subvol can only be added into
dead roots list during freeing roots stage in umount, so that we can
ONLY get space back after another remount(we cleanup dead roots on mount).
But the real thing is we'll no more use the snap/subvol if we mark it
deleted, so we can safely iput its cache inode when we delete snap/subvol.
Another thing is that we need to change the rules of droping inode, we
don't keep snap/subvol's cache inode in memory till end so that we can
add snap/subvol into dead roots list in time.
Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
In some cases, we need commit the current transaction, but don't want
to start a new one if there is no running transaction, so we introduce
the function - btrfs_attach_transaction(), which can catch the current
transaction, and return -ENOENT if there is no running transaction.
But no running transaction doesn't mean the current transction completely,
because we removed the running transaction before it completes. In some
cases, it doesn't matter. But in some special cases, such as freeze fs, we
hope the transaction is fully on disk, it will introduce some bugs, for
example, we may feeze the fs and dump the data in the disk, if the transction
doesn't complete, we would dump inconsistent data. So we need fix the above
problem for those cases.
We fixes this problem by introducing a function:
btrfs_attach_transaction_barrier()
if we hope all the transaction is fully on the disk, even they are not
running, we can use this function.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Now btrfs_commit_transaction() does this
ret = btrfs_run_ordered_operations(root, 0)
which async flushes all inodes on the ordered operations list, it introduced
a deadlock that transaction-start task, transaction-commit task and the flush
workers waited for each other.
(See the following URL to get the detail
http://marc.info/?l=linux-btrfs&m=136070705732646&w=2)
As we know, if ->in_commit is set, it means someone is committing the
current transaction, we should not try to join it if we are not JOIN
or JOIN_NOLOCK, wait is the best choice for it. In this way, we can avoid
the above problem. In this way, there is another benefit: there is no new
transaction handle to block the transaction which is on the way of commit,
once we set ->in_commit.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
In start_transactio(), we will try to join the transaction again after
the current transaction is committed, so we should not release the
reserved space of the qgroup. Fix it.
Cc: Arne Jansen <sensille@gmx.net>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
super.magic is an le64 but it's treated as an unterminated string when
compared against BTRFS_MAGIC which is defined as a string. Instead
define BTRFS_MAGIC as a normal hex value and use endian helpers to
compare it to the super's magic.
I tested this by mounting an fs made before the change and made sure
that it didn't introduce sparse errors. This matches a similar cleanup
that is pending in btrfs-progs. David Sterba pointed out that we should
fix the kernel side as well :).
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
With this new ioctl(2) BTRFS_IOC_SET_FSLABEL, we can set/change the label of a mounted file system.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Goffredo Baroncelli <kreijack@inwind.it>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Goffredo Baroncelli <kreijack@inwind.it>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Add a new ioctl(2) BTRFS_IOC_GET_FSLABLE, so that we can get the label upon a mounted filesystem.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Goffredo Baroncelli <kreijack@inwind.it>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Miao made the ordered operations stuff run async, which introduced a
deadlock where we could get somebody (sync) racing in and committing the
transaction while a commit was already happening. The new committer would
try and flush ordered operations which would hang waiting for the commit to
finish because it is done asynchronously and no longer inherits the callers
trans handle. To fix this we need to make the ordered operations list a per
transaction list. We can get new inodes added to the ordered operation list
by truncating them and then having another process writing to them, so this
makes it so that anybody trying to add an ordered operation _must_ start a
transaction in order to add itself to the list, which will keep new inodes
from getting added to the ordered operations list after we start committing.
This should fix the deadlock and also keeps us from doing a lot more work
than we need to during commit. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Dave pointed out that xfstests 273 will tell you that it failed to load the
space cache for a block group when it remounts. This is because we run out
of space writing out the block group cache. This is ok and is working as it
should, but let's try to be a bit nicer. This happens because the block
group was 100mb, but bitmap entries cover 128mb, so we were only getting
extent entries for this block group, which ended up being too many to fit in
the free space cache. So relax the bitmap size requirements to block groups
that are at least half the size a bitmap will cover or larger, that way we
can still keep the amount of space used in the free space cache low enough
to be able to write it out. With this patch I no longer fail to write out
the free space cache. Thanks,
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Enhance balance usage filter by making it possible to balance out only
completely empty chunks. Today, usage filter properly acts on values
from 1 to 99 inclusive, usage=100 selects all chunks, and usage=0
selects no chunks. This commit changes the usage=0 case: the new
meaning is to restripe only completely empty chunks and nothing else.
Suggested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Commit 5af3e8cc introduced a use-after-free at volumes.c:3139: bctl is freed
above in __cancel_balance() in all cases except for balance pause. Fix this
by moving the offending check a couple statements above, the meaning of the
check is preserved.
Reported-by: Chris Mason <chris.mason@fusionio.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The defrag operation can take very long, we want to have a way how to
cancel it. The code checks for a pending signal at safe points in the
defrag loops and returns EAGAIN. This means a user can press ^C after
running 'btrfs fi defrag', woks for both defrag modes, files and root.
Returning from the command was instant in my light tests, but may take
longer depending on the aging factor of the filesystem.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The warning in use_block_rsv is not useful for users and may fill
the logs unnecessarily.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This idea is from ext4. By this patch, we can make the dio write parallel,
and improve the performance. But because we can not update isize without
i_mutex, the unlocked dio write just can be done in front of the EOF.
We needn't worry about the race between dio write and truncate, because the
truncate need wait untill all the dio write end.
And we also needn't worry about the race between dio write and punch hole,
because we have extent lock to protect our operation.
I ran fio to test the performance of this feature.
== Hardware ==
CPU: Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93GHz
Mem: 2GB
SSD: Intel X25-M 120GB (Test Partition: 60GB)
== config file ==
[global]
ioengine=psync
direct=1
bs=4k
size=32G
runtime=60
directory=/mnt/btrfs/
filename=testfile
group_reporting
thread
[file1]
numjobs=1 # 2 4
rw=randwrite
== result (KBps) ==
write 1 2 4
lock 24936 24738 24726
nolock 24962 30866 32101
== result (iops) ==
write 1 2 4
lock 6234 6184 6181
nolock 6240 7716 8025
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Currently, we can do unlocked dio reads, but the following race
is possible:
dio_read_task truncate_task
->btrfs_setattr()
->btrfs_direct_IO
->__blockdev_direct_IO
->btrfs_get_block
->btrfs_truncate()
#alloc truncated blocks
#to other inode
->submit_io()
#INFORMATION LEAK
In order to avoid this problem, we must serialize unlocked dio reads with
truncate. There are two approaches:
- use extent lock to protect the extent that we truncate
- use inode_dio_wait() to make sure the truncating task will wait for
the read DIO.
If we use the 1st one, we will meet the endless truncation problem due to
the nonlocked read DIO after we implement the nonlocked write DIO. It is
because we still need invoke inode_dio_wait() avoid the race between write
DIO and truncation. By that time, we have to introduce
btrfs_inode_{block, resume}_nolock_dio()
again. That is we have to implement this patch again, so I choose the 2nd
way to fix the problem.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The deadlock problem happened when running fsstress(a test program in LTP).
Steps to reproduce:
# mkfs.btrfs -b 100M <partition>
# mount <partition> <mnt>
# <Path>/fsstress -p 3 -n 10000000 -d <mnt>
The reason is:
btrfs_direct_IO()
|->do_direct_IO()
|->get_page()
|->get_blocks()
| |->btrfs_delalloc_resereve_space()
| |->btrfs_add_ordered_extent() ------- Add a new ordered extent
|->dio_send_cur_page(page0) -------------- We didn't submit bio here
|->get_page()
|->get_blocks()
|->btrfs_delalloc_resereve_space()
|->flush_space()
|->btrfs_start_ordered_extent()
|->wait_event() ---------- Wait the completion of
the ordered extent that is
mentioned above
But because we didn't submit the bio that is mentioned above, the ordered
extent can not complete, we would wait for its completion forever.
There are two methods which can fix this deadlock problem:
1. submit the bio before we invoke get_blocks()
2. reserve the space before we do dio
Though the 1st is the simplest way, we need modify the code of VFS, and it
is likely to break contiguous requests, and introduce performance regression
for the other filesystems.
So we have to choose the 2nd way.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I noticed we were getting lots of warnings with xfstest 83 because we have
reservations outstanding. This is because we moved the orphan add outside
of the truncate, but we don't actually cleanup our reservation if something
fails. This fixes the problem and I no longer see warnings. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Sometimes xfstest 83 will fail to remount the scratch device because we've
gotten ourselves so full that we cannot cleanup the orphan items. In this
case check to see if we're doing the orphan cleanup and if we are allow us
to steal our reservation from the global block rsv. With this patch I've
not been able to reproduce the failed mount problem. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The argument "inherit" of btrfs_ioctl_snap_create_transid() was assigned
to NULL during we created the snapshots, so we didn't free it though we
called kfree() in the caller.
But since we are sure the snapshot creation is done after the function -
btrfs_ioctl_snap_create_transid() - completes, it is safe that we don't
assign the pointer "inherit" to NULL, and just free it in the caller of
btrfs_ioctl_snap_create_transid(). In this way, the code can become more
readable.
Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
Cc: Arne Jansen <sensille@gmx.net>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
open_ctree() need read the metadata to initialize the global information
of btrfs. But it may fail after it submit some bio, and then it will jump
to the error path. Unfortunately, it doesn't check if there are some bios
in flight, and just stop all the worker threads. As a result, when the
submitted bios end, they can not find any worker thread which can deal with
subsequent work, then oops happen.
kernel BUG at fs/btrfs/async-thread.c:605!
Fix this problem by invoking invalidate_inode_pages2() before we stop the
worker threads. This function will wait until the bio end because it need
lock the pages which are going to be invalidated, and if a page is under
disk read IO, it must be locked. invalidate_inode_pages2() need wait until
end bio handler to unlocked it.
Reported-and-Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This patch adds the flag, BTRFS_SEND_FLAG_NO_FILE_DATA to the btrfs send
ioctl code. When this flag is set, the btrfs send code will never write file
data into the stream (thus also avoiding expensive reads of that data in the
first place). BTRFS_SEND_C_UPDATE_EXTENT commands will be sent (instead of
BTRFS_SEND_C_WRITE) with an offset, length pair indicating the extent in
question.
This patch does not affect the operation of BTRFS_SEND_C_CLONE commands -
they will continue to be sent when a search finds an appropriate extent to
clone from.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
For write, we also reserve some space for COW blocks during updating
the checksum tree, and we calculate the number of blocks by checking
if the number of bytes outstanding that are going to need csums needs
one more block for csum.
When we add these checksum into the checksum tree, we use ordered sums
list.
Every ordered sum contains csums for each sector, and we'll first try
to look up an existing csum item,
a) if we don't yet have a proper csum item, then we need to insert one,
b) or if we find one but the csum item is not big enough, then we need
to extend it.
The point is we'll unlock the whole path and then insert or extend.
So others can hack in and update the tree.
Each insert or extend needs update the tree with COW on, and we may need
to insert/extend for many times.
That means what we've reserved for updating checksum tree is NOT enough
indeed.
The case is even more serious with having several write threads at the
same time, it can end up eating our reserved space quickly and starting
eating globle reserve pool instead.
I don't yet come up with a way to calculate the worse case for updating
csum, but extending the checksum item as much as possible can be helpful
in my test.
The idea behind is that it can reduce the times we insert/extend so that
it saves us precious reserved space.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The entry point at the defrag ioctl always sets "cache only" to 0;
the codepaths haven't run for a long time as far as I can
tell. Chris says they're dead code, so remove them.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I hit a deadlock where transaction commit was waiting on num_writers to be
0. This happened because somebody came into btrfs_commit_transaction and
noticed we had aborted and it went to cleanup_transaction. This shouldn't
happen because cleanup_transaction is really to fixup a bad commit, it
doesn't do the normal trans handle cleanup things. So if we have an error
just do the normal btrfs_end_transaction dance and return. Once we are in
the actual commit path we can use cleanup_transaction and be good to go.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I noticed we would deadlock if we aborted a transaction while doing
compressed io. This is because we don't unlock our pages if something goes
horribly wrong. To fix this we need to make sure that we call
extent_clear_unlock_delalloc in order to unlock all the pages. If we have
to cow in the async submission thread we need to make sure to unlock our
locked_page as the cow error path will not unlock the locked page as it
depends on the caller to unlock that page. With this patch we no longer
deadlock on the page lock when we have an aborted transaction. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
People have been complaining about random ENOSPC errors that will clear up
after a umount or just a given amount of time. Chris was able to reproduce
this with stress.sh and lots of processes and so was I. Basically the
overcommit stuff would really let us get out of hand, in my tests I saw up
to 30 gigs of outstanding reservations with only 2 gigs total of metadata
space. This usually worked out fine but with so much outstanding
reservation the flushing stuff short circuits to make sure we don't hang
forever flushing when we really need ENOSPC. Plus we allocate chunks in
order to alleviate the pressure, but this doesn't actually help us since we
only use the non-allocated area in our over commit logic.
So instead of basing overcommit on the amount of non-allocated space,
instead just do it based on how much total space we have, and then limit it
to the non-allocated space in case we are short on space to spill over into.
This allows us to have the same performance as well as no longer giving
random ENOSPC. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Dave sent me a panic where we were doing the orphan cleanup and panic'ed
trying to release our reservation from the orphan block rsv. The reason for
this is because our orphan block rsv had been free'd out from underneath us
because the transaction commit found that there were no orphan inodes
according to its count and decided to free it. This is incorrect so make
sure we inc the orphan inodes count so the accounting is all done properly.
This would also cause the warning in the orphan commit code normally if you
had any orphans to cleanup as they would only decrement the orphan count so
you'd get a negative orphan count which could cause problems during runtime.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When a transaction aborts or there's an EIO on an ordered extent or any
error really we will not free up the space we reserved for this ordered
extent. This results in warnings from the block group cache cleanup in the
case of a transaction abort, or leaking space in the case of EIO on an
ordered extent. Fix this up by free'ing the reserved space if we have an
error at all trying to complete an ordered extent. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When we abort we've been just free'ing up all the ordered extents and
hoping for the best. This results in lots of warnings from various places,
warnings from btrfs_destroy_inode() because it's ENOSPC accounting isn't
fixed. It will also screw up lots of pages who have been set private but
never get cleared because the ordered extents are never allowed to be
submitted. This patch fixes those warnings. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I hit this error when reproducing a bug that would end in a transaction
abort. We take the delayed ref head's mutex to keep anybody from processing
it while we're destroying it, but we fail to drop the mutex before we carry
on and free the damned thing. Fix this by doing the remove logic for the
head ourselves and unlock the mutex, that way we can avoid use after free's
or hung tasks waiting on that mutex to come back so they know the delayed
ref completed. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
WARN_ON isn't enough, we need to stop the loop if for any reason
we would overrun the devices_info array.
I tried to track down the connection between the length of
the alloc_devices list and the rw_devices counter but
it wasn't immediately obvious, so be defensive about it.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
__btrfs_std_error didn't always properly call va_end,
and might call va_start even if fmt was NULL.
Move all the varargs handling into the block where we
have fmt.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I don't think that BTRFS_DEV_EXTENT_KEY is supposed
to fall through to BTRFS_DEV_STATS_KEY ...
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
At least backref_tree_panic() can apparently pass
in a null fs_info, so handle that in __btrfs_panic
to get the message out on the console.
The btrfs_panic macro also uses fs_info, but that's
largely pointless; it's testing to see if
BTRFS_MOUNT_PANIC_ON_FATAL_ERROR is not set.
But if it *were* set, __btrfs_panic() would have,
well, paniced and we wouldn't be here, testing it!
So just BUG() at this point.
And since we only use fs_info once now, just use it
directly.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
No need to test the result, we can't get a
null pointer from list_entry()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Because of how little we allocate chunks now we can get really tight on
metadata space before we will allocate a new chunk. This resulted in being
unable to add device extents when allocating a new metadata chunk as we did
not have enough space. This is because we were allowed to overcommit too
much metadata without actually making sure we had enough space to make
allocations. The idea behind overcommit is that we are allowed to say "sure
you can have that reservation" when most of the free space is occupied by
reservations, not actual allocations. But in this case where a majority of
the total space is in use by actual allocations we can screw ourselves by
not being able to make real allocations when it matters. So make sure we
have enough real space for our global reserve, and if not then don't allow
overcommitting. Thanks,
Reported-and-tested-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I got a double free error when unmounting a file system that failed to add a
chunk during its operation. This is because we will kfree the mapping that
we created but leave the extent_map in the em_tree for chunks. So to fix
this just remove the extent_map when we error out so we don't run into this
problem. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If we error out allocating a dev extent we will have already created the
block group and such which will cause problems since the allocator may have
tried to allocate out of the block group that no longer exists. This will
cause BUG_ON()'s in the bio submission path. This also makes a failure to
allocate a dev extent a non-abort error, we will just clean up the dev
extents we did allocate and exit. Now if we fail to delete the dev extents
we will abort since we can't have half of the dev extents hanging around,
but this will make us much less likely to abort. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There is no lock to protect fs_info->fs_state, it will introduce
some problems, such as the value may be covered by the other task
when several tasks modify it. For example:
Task0 - CPU0 Task1 - CPU1
mov %fs_state rax
or $0x1 rax
mov %fs_state rax
or $0x2 rax
mov rax %fs_state
mov rax %fs_state
The expected value is 3, but in fact, it is 2.
Though this problem doesn't happen now (because there is only one
flag currently), the code is error prone, if we add other flags,
the above problem will happen to a certainty.
Now we use bit operation for it to fix the above problem.
In this way, we can make the code more robust and be easy to
add new flags.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There is no lock to protect
fs_info->avail_{data, metadata, system}_alloc_bits,
it may introduce some problem, such as the wrong profile
information, so we add a seqlock to protect them.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We need not use a global lock to protect the delalloc_bytes of the
inode, just use its own lock. In this way, we can reduce the lock
contention and ->delalloc_lock will just protect delalloc inode
list.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
fs_info->delalloc_bytes is accessed very frequently, so use percpu
counter instead of the u64 variant for it to reduce the lock
contention.
This patch also fixed the problem that we access the variant
without the lock protection.At worst, we would not flush the
delalloc inodes, and just return ENOSPC error when we still have
some free space in the fs.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
->dirty_metadata_bytes is accessed very frequently, so use percpu
counter instead of the u64 variant to reduce the contention of
the lock.
This patch also fixed the problem that we access it without
lock protection in __btrfs_btree_balance_dirty(), which may
cause we skip the dirty pages flush.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
fs_info->alloc_start is a 64bits variant, can be accessed by
multi-task, but it is not protected strictly, it can be changed
while we are accessing it. On 32bit machine, we will get wrong
value because we access it by two instructions.(In fact, it is
also possible that the same problem happens on the 64bit machine,
because the compiler may split the 64bit operation into two 32bit
operation.)
For example:
Assuming -> alloc_start is 0x0000 0000 0001 0000 at the beginning,
then we remount and set ->alloc_start to 0x0000 0100 0000 0000.
Task0 Task1
load high 32 bits
set high 32 bits
set low 32 bits
load low 32 bits
Task1 will get 0.
This patch fixes this problem by using two locks to protect it
fs_info->chunk_mutex
sb->s_umount
On the read side, we just need get one of these two locks, and on
the write side, we must lock all of them.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Though ->max_inline is a 64bit variant, and may be accessed by
multi-task, but it is just suggestive number, so we needn't add
anything to protect fs_info->max_inline, just add a comment to
explain wny we don't use a lock to protect it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The header file will then be installed under /usr/include/linux so that
userspace applications can refer to Btrfs ioctls by name and use the same
structs used internally in the kernel.
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
CAP_DAC_READ_SEARCH overrides read and search permission check on
file and directory. It seems fit for BTRFS_IOC_INO_PATHS.
Signed-off-by: Kusanagi Kouichi <slash@ac.auone-net.jp>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This reverts commit 2794ed013b.
Wasn't supposed to get used in btrfs_mknod, it was supposed to be in
btrfs_create, which was done in commit
9185aa587b.
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
btrfs_run_ordered_operations() needn't traverse the ordered operation list
repeatedly, it is because the transaction commiter will invoke it again when
there is no other writer in this transaction, it can ensure that no one can
add new objects into the ordered operation list.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
btrfs_start_delalloc_inodes() needn't traverse and flush the delalloc inodes
repeatedly. It is because we can regard the data that the users write after
we start delalloc inodes flush as the one which is after the delalloc inodes
flush is done, and we can flush it next time.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We forget to check the return value of btrfs_run_ordered_operations() when
flushing all the pending stuffs, fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We forget to check the return value of btrfs_start_delalloc_inodes(), fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The current code of raid attr arry is hard to understand and it is easy to
introduce some problem if we modify the array. So I changed it and made it
more readable.
Cc: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This'd save us a rbtree search which may become expensive in large filesystem.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This does not change the logic of code, but can save us a read_lock.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The API in tree log code has done sort of changes, and it proves that
we can benifit from using token, so do the same thing here.
function_graph tracer's timer shows that it costs nearly half time
of before(39.788us -> 22.391us).
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
commit d53ba47484
(Btrfs: use commit root when loading free space cache) has remove
the deadlock check, and the related comments can be removed as well.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If we start running low on metadata space we will try to allocate a chunk,
which could then try to allocate a chunk to add the device entry. The thing
is we allocate a chunk before we try really hard to make the allocation, so
we should be able to find space for the device entry. Add a flag to the
trans handle so we know we're currently allocating a chunk so we can just
bail out if we try to allocate another chunk. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Since we don't actually copy the extent information from the source tree in
the fast case we don't need to wait for ordered io to be completed in order
to fsync, we just need to wait for the io to be completed. So when we're
logging our file just attach all of the ordered extents to the log, and then
when the log syncs just wait for IO_DONE on the ordered extents and then
write the super. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This patch fixes the following problem:
- improper return value
- unnecessary read-only check
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Use wrapper page_offset to get byte-offset into filesystem object for page.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We may try to flush some dirty pages when there is no enough space to reserve.
But it is possible that this operation fails, in order to get enough space to
reserve successfully, we will sync all the delalloc file. This operation is
safe, we needn't worry about the case that the filesystem goes from r/w to r/o.
because the filesystem should guarantee all the dirty pages have been written
into the disk after it becomes readonly, so the sync operation will do nothing
if the filesystem is already readonly. Though it may waste lots of time,
as a corner case, we needn't care.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Locking and unlocking delayed ref mutex are in the different functions,
and the name of lock functions is not uniform, so the readability is not
so good, this patch optimizes the lock logic and makes it more readable.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We're running into having 50-100 orphans left over with xfstests 83
because of ENOSPC when trying to start the transaction for the inode update.
But in fact, it makes no sense in updating the inode for the new size while
we're deleting the stupid thing. This patch fixes this problem.
Reported-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The delayed item commit code in several functions is similar, so
cleanup it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Since we do not want to delay the async transaction commit, we should
use common work, not delayed work.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
We clear the transaction object and the trans handle when they are about to be
freed, it is unnecessary, cleanup it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
The delayed reference allocation is in the fast path of the IO, so use slabs
to improve the speed of the allocation.
And besides that, it can do check for leaked objects when the module is removed.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
btrfs_scan_one_device is calling set_blocksize() which can race
with a concurrent process making dirty page cache pages. It can end up
dropping dirty page cache pages on the floor, which isn't very nice when
someone is just running btrfs dev scan to find filesystems on the
box.
Now that udev is registering btrfs devices as it discovers them, we can
actually end up racing with our own mkfs program too. When this
happens, we drop some of the important blocks written by mkfs.
This commit changes scan_one_device to read the super out of the page
cache instead of trying to use bread. This way we don't have to care
about the blocksize of the device.
This also drops the invalidate_bdev() call. It wasn't very polite to
invalidate during the scan either. mkfs is putting the super into the
page cache, there's no reason to invalidate at this point.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When replaying a log tree with qgroups enabled, tree_mod_log_rewind does a
sanity-check of the number of items against the maximum possible number.
It calculates that number with the nodesize of fs_root. Unfortunately
fs_root is not yet set at this stage. So instead use the nodesize from
tree_root, which is already initialized.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs fixes from Chris Mason:
"We've got corner cases for updating i_size that ceph was hitting,
error handling for quotas when we run out of space, a very subtle
snapshot deletion race, a crash while removing devices, and one
deadlock between subvolume creation and the sb_internal code (thanks
lockdep)."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: move d_instantiate outside the transaction during mksubvol
Btrfs: fix EDQUOT handling in btrfs_delalloc_reserve_metadata
Btrfs: fix possible stale data exposure
Btrfs: fix missing i_size update
Btrfs: fix race between snapshot deletion and getting inode
Btrfs: fix missing release of the space/qgroup reservation in start_transaction()
Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write()
Btrfs: do not merge logged extents if we've removed them from the tree
btrfs: don't try to notify udev about missing devices
Dave Sterba triggered a lockdep complaint about lock ordering
between the sb_internal lock and the cleaner semaphore.
btrfs_lookup_dentry() checks for orphans if we're looking up
the inode for a subvolume, and subvolume creation is triggering
the lookup with a transaction running.
This commit moves the d_instantiate after the transaction closes.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When btrfs_qgroup_reserve returned a failure, we were missing a counter
operation for BTRFS_I(inode)->outstanding_extents++, leading to warning
messages about outstanding extents and space_info->bytes_may_use != 0.
Additionally, the error handling code didn't take into account that we
dropped the inode lock which might require more cleanup.
Luckily, all the cleanup code we need is already there and can be shared
with reserve_metadata_bytes, which is exactly what this patch does.
Reported-by: Lev Vainblat <lev@zadarastorage.com>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We specifically do not update the disk i_size if there are ordered extents
outstanding for any area between the current disk_i_size and our ordered
extent so that we do not expose stale data. The problem is the check we
have only checks if the ordered extent starts at or after the current
disk_i_size, which doesn't take into account an ordered extent that starts
before the current disk_i_size and ends past the disk_i_size. Fix this by
checking if the extent ends past the disk_i_size. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If we have an ordered extent before the ordered extent we are currently
completing that is after the current disk_i_size we will put our i_size
update into that ordered extent so that we do not expose stale data. The
problem is that if our disk i_size is updated past the previous ordered
extent we won't update the i_size with the pending i_size update. So check
the pending i_size update and if its above the current disk i_size we need
to go ahead and try to update. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
While running snapshot testscript created by Mitch and David,
the race between autodefrag and snapshot deletion can lead to
corruption of dead_root list so that we can get crash on
btrfs_clean_old_snapshots().
And besides autodefrag, scrub also does the same thing, ie. read
root first and get inode.
Here is the story(take autodefrag as an example):
(1) when we delete a snapshot or subvolume, it will set its root's
refs to zero and do a iput() on its own inode, and if this inode happens
to be the only active in-meory one in root's inode rbtree, it will add
itself to the global dead_roots list for later cleanup.
(2) after (1), the autodefrag thread may read another inode for defrag
and the inode is just in the deleted snapshot/subvolume, but all of these
are without checking if the root is still valid(refs > 0). So the end up
result is adding the deleted snapshot/subvolume's root to the global
dead_roots list AGAIN.
Fortunately, we already have a srcu lock to avoid the race, ie. subvol_srcu.
So all we need to do is to take the lock to protect 'read root and get inode',
since we synchronize to wait for the rcu grace period before adding something
to the global dead_roots list.
Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>