We allocate the free space from the former block group, not the current
one, so should use the former one to output the trace information.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
used_block_group is just used for the space cluster which doesn't
belong to the current block group, the other place needn't use it.
Or the logic of code seems unclear.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
It is better that the position of the lock is close to the data which is
protected by it, because they may be in the same cache line, we will load
less cache lines when we access them. So we rearrange the members' position
of btrfs_space_info structure to make the lock be closer to the its data.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
To search tree root without transaction protection, we should neither search commit
root nor skip locking here, fix it.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The performance of fsync dropped down suddenly sometimes, the main reason
of this problem was that we might only flush part dirty pages in a ordered
extent, then got that ordered extent, wait for the csum calcucation. But if
no task flushed the left part, we would wait until the flusher flushed them,
sometimes we need wait for several seconds, it made the performance drop
down suddenly. (On my box, it drop down from 56MB/s to 4-10MB/s)
This patch improves the above problem by flushing left dirty pages aggressively.
Test Environment:
CPU: 2CPU * 2Cores
Memory: 4GB
Partition: 20GB(HDD)
Test Command:
# sysbench --num-threads=8 --test=fileio --file-num=1 \
> --file-total-size=8G --file-block-size=32768 \
> --file-io-mode=sync --file-fsync-freq=100 \
> --file-fsync-end=no --max-requests=10000 \
> --file-test-mode=rndwr run
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Steps to reproduce:
# mkfs.btrfs -f /dev/sda8
# mount /dev/sda8 /mnt -o flushoncommit
# dd if=/dev/zero of=/mnt/data bs=4k count=102400 &
# mount /dev/sda8 /mnt -o remount, ro
When remounting RW to RO, the logic is to firstly set flag
to RO and then commit transaction, however with option
flushoncommit enabled,we will do RO check within committing
transaction, so we get a transaction abortion here.
Actually,here check is wrong, we should check if FS_STATE_ERROR
is set, fix it.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Suggested-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
When we are looking for file extent items that intersect the cloning
range, for each one that falls completely outside the range, don't
release the path and do another full tree search - just move on
to the next slot and copy the file extent item into our buffer only
if the item intersects the cloning range.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
When transaction is aborted, we fail to commit transaction, instead we do
cleanup work. After that when we umount btrfs, we get to free fs roots' log
trees respectively, but that happens after we unpin extents, so those extents
pinned by freeing log trees will remain in memory and lead to the leak.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Since remount will pending the new mount options to the original mount
options, which will make btrfs_parse_options check the old options then
new options, causing some stupid output like "enabling XXX" following by
"disable XXX".
This patch will add extra check before every btrfs_info to skip the
output from old options checking.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Add noinode_cache mount option for btrfs.
Since inode map cache involves all the btrfs_find_free_ino/return_ino
things and if just trigger the mount_opt,
an inode number get from inode map cache will not returned to inode map
cache.
To keep the find and return inode both in the same behavior,
a new bit in mount_opt, CHANGE_INODE_CACHE, is introduced for this idea.
CHANGE_INODE_CACHE is set/cleared in remounting, and the original
INODE_MAP_CACHE is set/cleared according to CHANGE_INODE_CACHE after a
success transaction.
Since find/return inode is all done between btrfs_start_transaction and
btrfs_commit_transaction, this will keep consistent behavior.
Also noinode_cache mount option will not stop the caching_kthread.
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
There is a bug that using btrfs_previous_item() to search metadata extent item.
This is because in btrfs_previous_item(), we need type match, however, since
skinny metada was introduced by josef, we may mix this two types. So just
use btrfs_previous_item() is not working right.
To keep btrfs_previous_item() like normal tree search, i introduce another
function btrfs_previous_extent_item().
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Check if we support skinny metadata firstly and fix to use
right type to search.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
It is possible for the send feature to send clone operations that
request a cloning range (offset + length) that is not aligned with
the block size. This makes the btrfs receive command send issue a
clone ioctl call that will fail, as the ioctl will return an -EINVAL
error because of the unaligned range.
Fix this by not sending clone operations for non block aligned ranges,
and instead send regular write operation for these (less common) cases.
The following xfstest reproduces this issue, which fails on the second
btrfs receive command without this change:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=`mktemp -d`
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $tmp
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
rm -f $seqres.full
_scratch_mkfs >/dev/null 2>&1
_scratch_mount
$XFS_IO_PROG -f -c "truncate 819200" $SCRATCH_MNT/foo | _filter_xfs_io
$BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch
$XFS_IO_PROG -c "falloc -k 819200 667648" $SCRATCH_MNT/foo | _filter_xfs_io
$BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch
$XFS_IO_PROG -f -c "pwrite 1482752 2978" $SCRATCH_MNT/foo | _filter_xfs_io
$BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch
$BTRFS_UTIL_PROG subvol snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap1 | \
_filter_scratch
$XFS_IO_PROG -f -c "truncate 883305" $SCRATCH_MNT/foo | _filter_xfs_io
$BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch
$BTRFS_UTIL_PROG subvol snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap2 | \
_filter_scratch
$BTRFS_UTIL_PROG send $SCRATCH_MNT/mysnap1 -f $tmp/1.snap 2>&1 | _filter_scratch
$BTRFS_UTIL_PROG send -p $SCRATCH_MNT/mysnap1 $SCRATCH_MNT/mysnap2 \
-f $tmp/2.snap 2>&1 | _filter_scratch
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/mysnap1/foo | _filter_scratch
md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch
_scratch_unmount
_check_btrfs_filesystem $SCRATCH_DEV
_scratch_mkfs >/dev/null 2>&1
_scratch_mount
$BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/1.snap
md5sum $SCRATCH_MNT/mysnap1/foo | _filter_scratch
$BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/2.snap
md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch
_scratch_unmount
_check_btrfs_filesystem $SCRATCH_DEV
status=0
exit
The tests expected output is:
QA output created by 025
FSSync 'SCRATCH_MNT'
FSSync 'SCRATCH_MNT'
wrote 2978/2978 bytes at offset 1482752
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
FSSync 'SCRATCH_MNT'
Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/mysnap1'
FSSync 'SCRATCH_MNT'
Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/mysnap2'
At subvol SCRATCH_MNT/mysnap1
At subvol SCRATCH_MNT/mysnap2
129b8eaee8d3c2bcad49bec596591cb3 SCRATCH_MNT/foo
42b6369eae2a8725c1aacc0440e597aa SCRATCH_MNT/mysnap1/foo
129b8eaee8d3c2bcad49bec596591cb3 SCRATCH_MNT/mysnap2/foo
At subvol mysnap1
42b6369eae2a8725c1aacc0440e597aa SCRATCH_MNT/mysnap1/foo
At snapshot mysnap2
129b8eaee8d3c2bcad49bec596591cb3 SCRATCH_MNT/mysnap2/foo
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
After the change titled "Btrfs: add support for inode properties", if
btrfs was built-in the kernel (i.e. not as a module), it would cause a
kernel panic, as reported recently by Fengguang:
[ 2.024722] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 2.027814] IP: [<ffffffff81501594>] crc32c+0xc/0x6b
[ 2.028684] PGD 0
[ 2.028684] Oops: 0000 [#1] SMP
[ 2.028684] Modules linked in:
[ 2.028684] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.13.0-rc7-04795-ga7b57c2 #1
[ 2.028684] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 2.028684] task: ffff88000edba100 ti: ffff88000edd6000 task.ti: ffff88000edd6000
[ 2.028684] RIP: 0010:[<ffffffff81501594>] [<ffffffff81501594>] crc32c+0xc/0x6b
[ 2.028684] RSP: 0000:ffff88000edd7e58 EFLAGS: 00010246
[ 2.028684] RAX: 0000000000000000 RBX: ffffffff82295550 RCX: 0000000000000000
[ 2.028684] RDX: 0000000000000011 RSI: ffffffff81efe393 RDI: 00000000fffffffe
[ 2.028684] RBP: ffff88000edd7e60 R08: 0000000000000003 R09: 0000000000015d20
[ 2.028684] R10: ffffffff81ef225e R11: ffffffff811b0222 R12: ffffffffffffffff
[ 2.028684] R13: 0000000000000239 R14: 0000000000000000 R15: 0000000000000000
[ 2.028684] FS: 0000000000000000(0000) GS:ffff88000fa00000(0000) knlGS:0000000000000000
[ 2.028684] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2.028684] CR2: 0000000000000000 CR3: 000000000220c000 CR4: 00000000000006f0
[ 2.028684] Stack:
[ 2.028684] ffffffff82295550 ffff88000edd7e80 ffffffff8238af62 ffffffff8238ac05
[ 2.028684] 0000000000000000 ffff88000edd7e98 ffffffff8238ac0f ffffffff8238ac05
[ 2.028684] ffff88000edd7f08 ffffffff810002ba ffff88000edd7f00 ffffffff810e2404
[ 2.028684] Call Trace:
[ 2.028684] [<ffffffff8238af62>] btrfs_props_init+0x4f/0x96
[ 2.028684] [<ffffffff8238ac05>] ? ftrace_define_fields_btrfs_space_reservation+0x145/0x145
[ 2.028684] [<ffffffff8238ac0f>] init_btrfs_fs+0xa/0xf0
[ 2.028684] [<ffffffff8238ac05>] ? ftrace_define_fields_btrfs_space_reservation+0x145/0x145
[ 2.028684] [<ffffffff810002ba>] do_one_initcall+0xa4/0x13a
[ 2.028684] [<ffffffff810e2404>] ? parse_args+0x25f/0x33d
[ 2.028684] [<ffffffff8234cf75>] kernel_init_freeable+0x1aa/0x230
[ 2.028684] [<ffffffff8234c785>] ? do_early_param+0x88/0x88
[ 2.028684] [<ffffffff819f61b5>] ? rest_init+0x89/0x89
[ 2.028684] [<ffffffff819f61c3>] kernel_init+0xe/0x109
The issue here is that the initialization function of btrfs (super.c:init_btrfs_fs)
started using crc32c (from lib/libcrc32c.c). But when it needs to call crc32c (as
part of the properties initialization routine), the libcrc32c is not yet initialized,
so crc32c derreferenced a NULL pointer (lib/libcrc32c.c:tfm), causing the kernel
panic on boot.
The approach to fix this is to use crypto component directly to use its crc32c (which
is basically what lib/libcrc32c.c is, a wrapper around crypto). This is what ext4 is
doing as well, it uses crypto directly to get crc32c functionality.
Verified this works both when btrfs is built-in and when it's loadable kernel module.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
In the clone ioctl, when the source and target inodes are different,
we can acquire their mutexes in 2 possible different orders. After
we're done cloning, we were releasing the mutexes always in the same
order - the most correct way of doing it is to release them by the
reverse order they were acquired.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Here we are not going to free memory, no need to remove every node
one by one, just init root node here is ok.
Cc: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We don't have to keep subvolume's block_rsv during transaction commit,
and within transaction commit, we may also need the free space reclaimed
from this block_rsv to process delayed refs.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
When we ran the 274th case of xfstests with nodatacow mount option,
We met the following warning message:
WARNING: CPU: 1 PID: 14185 at fs/btrfs/extent-tree.c:3734 btrfs_free_reserved_data_space+0xa6/0xd0
It is caused by the race between the write back and nocow buffered
write:
Task1 Task2
__btrfs_buffered_write()
skip data reservation
reserve the metadata space
copy the data
dirty the pages
unlock the pages
write back the pages
release the data space
becasue there is no
noreserve flag
set the noreserve flag
This patch fixes this problem by unlocking the pages after
the noreserve flag is set.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The backref walking code will search down to the key it is looking for and then
proceed to walk _all_ of the extents on the file until it hits the end. This is
suboptimal with large files, we only need to look for as many extents as we have
references for that inode. I have a testcase that creates a randomly written 4
gig file and before this patch it took 6min 30sec to do the initial send, with
this patch it takes 2min 30sec to do the intial send. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Could have sworn I fixed this before but apparently not. This makes us pass
btrfs/022 with skinny metadata enabled. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
I don't think this is an issue and I've not seen it in practice but
extent_from_logical will fail to find a skinny extent because it uses
btrfs_previous_item and gives it the normal extent item type. This is just not
a place to use btrfs_previous_item since we care about either normal extents or
skinny extents, so open code btrfs_previous_item to properly check. This would
only affect metadata and the only place this is used for metadata is scrub and
I'm pretty sure it's just for printing stuff out, not actually doing any work so
hopefully it was never a problem other than a cosmetic one. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
On one of our gluster clusters we noticed some pretty big lag spikes. This
turned out to be because our transaction commit was taking like 3 minutes to
complete. This is because we have like 30 gigs of metadata, so our global
reserve would end up being the max which is like 512 mb. So our throttling code
would allow a ridiculous amount of delayed refs to build up and then they'd all
get run at transaction commit time, and for a cold mounted file system that
could take up to 3 minutes to run. So fix the throttling to be based on both
the size of the global reserve and how long it takes us to run delayed refs.
This patch tracks the time it takes to run delayed refs and then only allows 1
seconds worth of outstanding delayed refs at a time. This way it will auto-tune
itself from cold cache up to when everything is in memory and it no longer has
to go to disk. This makes our transaction commits take much less time to run.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Currently we have two rb-trees, one for delayed ref heads and one for all of the
delayed refs, including the delayed ref heads. When we process the delayed refs
we have to hold onto the delayed ref lock for all of the selecting and merging
and such, which results in quite a bit of lock contention. This was solved by
having a waitqueue and only one flusher at a time, however this hurts if we get
a lot of delayed refs queued up.
So instead just have an rb tree for the delayed ref heads, and then attach the
delayed ref updates to an rb tree that is per delayed ref head. Then we only
need to take the delayed ref lock when adding new delayed refs and when
selecting a delayed ref head to process, all the rest of the time we deal with a
per delayed ref head lock which will be much less contentious.
The locking rules for this get a little more complicated since we have to lock
up to 3 things to properly process delayed refs, but I will address that problem
later. For now this passes all of xfstests and my overnight stress tests.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Looking into some performance related issues with large amounts of metadata
revealed that we can have some pretty huge swings in fsync() performance. If we
have a lot of delayed refs backed up (as you will tend to do with lots of
metadata) fsync() will wander off and try to run some of those delayed refs
which can result in reading from disk and such. Since the actual act of fsync()
doesn't create any delayed refs there is no need to make it throttle on delayed
ref stuff, that will be handled by other people. With this patch we get much
smoother fsync performance with large amounts of metadata. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
This change adds infrastructure to allow for generic properties for
inodes. Properties are name/value pairs that can be associated with
inodes for different purposes. They are stored as xattrs with the
prefix "btrfs."
Properties can be inherited - this means when a directory inode has
inheritable properties set, these are added to new inodes created
under that directory. Further, subvolumes can also have properties
associated with them, and they can be inherited from their parent
subvolume. Naturally, directory properties have priority over subvolume
properties (in practice a subvolume property is just a regular
property associated with the root inode, objectid 256, of the
subvolume's fs tree).
This change also adds one specific property implementation, named
"compression", whose values can be "lzo" or "zlib" and it's an
inheritable property.
The corresponding changes to btrfs-progs were also implemented.
A patch with xfstests for this feature will follow once there's
agreement on this change/feature.
Further, the script at the bottom of this commit message was used to
do some benchmarks to measure any performance penalties of this feature.
Basically the tests correspond to:
Test 1 - create a filesystem and mount it with compress-force=lzo,
then sequentially create N files of 64Kb each, measure how long it took
to create the files, unmount the filesystem, mount the filesystem and
perform an 'ls -lha' against the test directory holding the N files, and
report the time the command took.
Test 2 - create a filesystem and don't use any compression option when
mounting it - instead set the compression property of the subvolume's
root to 'lzo'. Then create N files of 64Kb, and report the time it took.
The unmount the filesystem, mount it again and perform an 'ls -lha' like
in the former test. This means every single file ends up with a property
(xattr) associated to it.
Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the
compression property, have no real effect other than adding more work
when inheriting properties and taking more btree leaf space.
Test 4 - same as test 3 but with 10 properties per file.
Results (in seconds, and averages of 5 runs each), for different N
numbers of files follow.
* Without properties (test 1)
file creation time ls -lha time
10 000 files 3.49 0.76
100 000 files 47.19 8.37
1 000 000 files 518.51 107.06
* With 1 property (compression property set to lzo - test 2)
file creation time ls -lha time
10 000 files 3.63 0.93
100 000 files 48.56 9.74
1 000 000 files 537.72 125.11
* With 4 properties (test 3)
file creation time ls -lha time
10 000 files 3.94 1.20
100 000 files 52.14 11.48
1 000 000 files 572.70 142.13
* With 10 properties (test 4)
file creation time ls -lha time
10 000 files 4.61 1.35
100 000 files 58.86 13.83
1 000 000 files 656.01 177.61
The increased latencies with properties are essencialy because of:
*) When creating an inode, we now synchronously write 1 more item
(an xattr item) for each property inherited from the parent dir
(or subvolume). This could be done in an asynchronous way such
as we do for dir intex items (delayed-inode.c), which could help
reduce the file creation latency;
*) With properties, we now have larger fs trees. For this particular
test each xattr item uses 75 bytes of leaf space in the fs tree.
This could be less by using a new item for xattr items, instead of
the current btrfs_dir_item, since we could cut the 'location' and
'type' fields (saving 18 bytes) and maybe 'transid' too (saving a
total of 26 bytes per xattr item) from the btrfs_dir_item type.
Also tried batching the xattr insertions (ignoring proper hash
collision handling, since it didn't exist) when creating files that
inherit properties from their parent inode/subvolume, but the end
results were (surprisingly) essentially the same.
Test script:
$ cat test.pl
#!/usr/bin/perl -w
use strict;
use Time::HiRes qw(time);
use constant NUM_FILES => 10_000;
use constant FILE_SIZES => (64 * 1024);
use constant DEV => '/dev/sdb4';
use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev';
use constant TEST_DIR => (MNT_POINT . '/testdir');
system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!";
# following line for testing without properties
#system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!";
# following 2 lines for testing with properties
system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!";
system("mkdir", TEST_DIR) == 0 or die "mkdir failed!";
my ($t1, $t2);
$t1 = time();
for (my $i = 1; $i <= NUM_FILES; $i++) {
my $p = TEST_DIR . '/file_' . $i;
open(my $f, '>', $p) or die "Error opening file!";
$f->autoflush(1);
for (my $j = 0; $j < FILE_SIZES; $j += 4096) {
print $f ('A' x 4096) or die "Error writing to file!";
}
close($f);
}
$t2 = time();
print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n";
system("umount", DEV) == 0 or die "umount failed!";
system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
$t1 = time();
system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!";
$t2 = time();
print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n";
system("umount", DEV) == 0 or die "umount failed!";
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
When writing to a file we drop existing file extent items that cover the
write range and then add a new file extent item that represents that write
range.
Before this change we were doing a tree lookup to remove the file extent
items, and then after we did another tree lookup to insert the new file
extent item.
Most of the time all the file extent items we need to drop are located
within a single leaf - this is the leaf where our new file extent item ends
up at. Therefore, in this common case just combine these 2 operations into
a single one.
By avoiding the second btree navigation for insertion of the new file extent
item, we reduce btree node/leaf lock acquisitions/releases, btree block/leaf
COW operations, CPU time on btree node/leaf key binary searches, etc.
Besides for file writes, this is an operation that happens for file fsync's
as well. However log btrees are much less likely to big as big as regular
fs btrees, therefore the impact of this change is smaller.
The following benchmark was performed against an SSD drive and a
HDD drive, both for random and sequential writes:
sysbench --test=fileio --file-num=4096 --file-total-size=8G \
--file-test-mode=[rndwr|seqwr] --num-threads=512 \
--file-block-size=8192 \ --max-requests=1000000 \
--file-fsync-freq=0 --file-io-mode=sync [prepare|run]
All results below are averages of 10 runs of the respective test.
** SSD sequential writes
Before this change: 225.88 Mb/sec
After this change: 277.26 Mb/sec
** SSD random writes
Before this change: 49.91 Mb/sec
After this change: 56.39 Mb/sec
** HDD sequential writes
Before this change: 68.53 Mb/sec
After this change: 69.87 Mb/sec
** HDD random writes
Before this change: 13.04 Mb/sec
After this change: 14.39 Mb/sec
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We may return early in btrfs_drop_snapshot(), we shouldn't
call btrfs_std_err() for this case, fix it.
Cc: stable@vger.kernel.org
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We will finish orphan cleanups during snapshot, so we don't
have to commit transaction here.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We should gurantee that parent and clone roots can not be destroyed
during send, for this we have two ideas.
1.by holding @subvol_sem, this might be a nightmare, because it will
block all subvolumes deletion for a long time.
2.Miao pointed out we can reuse @send_in_progress, that mean we will
skip snapshot deletion if root sending is in progress.
Here we adopt the second approach since it won't block other subvolumes
deletion for a long time.
Besides in btrfs_clean_one_deleted_snapshot(), we only check first root
, if this root is involved in send, we return directly rather than
continue to check.There are several reasons about it:
1.this case happen seldomly.
2.after sending,cleaner thread can continue to drop that root.
3.make code simple
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Steps to reproduce:
# mkfs.btrfs -f /dev/sda8
# mount /dev/sda8 /mnt
# btrfs sub snapshot -r /mnt /mnt/snap1
# btrfs sub snapshot -r /mnt /mnt/snap2
# btrfs send /mnt/snap1 -p /mnt/snap2 -f /mnt/1
# dmesg
The problem is that we will sort clone roots(include @send_root), it
might push @send_root before thus @send_root's @send_in_progress will
be decreased twice.
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Add treelog mount option to enable tree log with
remount option.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Add datasum mount option to enable checksum with
remount option.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Add datacow mount option to enable copy-on-write with
remount option.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Add acl mount option to enable acl with remount option.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Add noflushoncommit mount option to disable flush on commit with
remount option.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Add noenospc_debug mount option to disable ENOSPC debug with
remount option.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Add nodiscard mount option to disable discard with remount option.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Btrfs has autodefrag mount option but no pairing noautodefrag option,
which makes it impossible to disable autodefrag without umount.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Btrfs can be remounted without barrier, but there is no "barrier" option
so nobody can remount btrfs back with barrier on. Only umount and
mount again can re-enable barrier.(Quite awkward)
Also the mount options in the document is also changed slightly for the
further pairing options changes.
Reported-by: Daniel Blueman <daniel@quora.org>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Mike Fleetwood <mike.fleetwood@googlemail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We only intent to fua the first superblock in every device from
comments, fix it.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
@full is not protected within global_rsv.lock, so we may think global_rsv
is already full but in fact it's not, so we miss the opportunity to return
free space to global_rsv directly when we release other block_rsvs.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
During balance test, we hit an oops:
[ 2013.841551] kernel BUG at fs/btrfs/relocation.c:1174!
The problem is that if we fail to relocate tree blocks, we should
update backref cache, otherwise, some pending nodes are not updated
while snapshot check @cache->last_trans is within one transaction
and won't update it and then oops happen.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The following warning message was outputed when running the 274th case
of xfstests with nodatacow option:
BUG: Bad page state in process kswapd0 pfn:1c66f
page:ffffea0000636848 count:0 mapcount:0 mapping:(null) index:0x78000
page flags: 0x1000000000100a(error|uptodate|private_2)
It is because the check of nocow range was wrong, we should compare the
start and end position of the extent with the write position to verify
if the write position was in the extent, but the current code just used
the start postion to do the check, so we got the wrong extent and told
the caller that it was a nocow write. And then when we write back the
dirty pages, we found we should cow the extent, but at that time, there
was no space in the fs, we had to the error flag for the page. When
someone reclaimed that page, the above warning outputed. Fix it.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Previously, we will free reloc root memory and then force filesystem
to be readonly. The problem is that there may be another thread commiting
transaction which will try to access freed reloc root during merging reloc
roots process.
To keep consistency snapshots shared space, we should allow snapshot
finished if possible, so here we don't free reloc root memory.
signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
@nr is no longer used, remove it from select_reloc_root()
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we do a btree search with the goal of updating an existing item
without changing its size (ins_len == 0 and cow == 1), then we never
need to hold locks on upper level nodes (even when slot == 0) after we
COW their child nodes/leaves, as we won't have node splits or merges
in this scenario (that is, no key additions, removals or shifts on any
nodes or leaves).
Therefore release the locks immediately after COWing the child nodes/leaves
while navigating the btree, even if their parent slot is 0, instead of
returning a path to the caller with those nodes locked, which would get
released only when the caller releases or frees the path (or if it calls
btrfs_unlock_up_safe).
This is a common scenario, for example when updating inode items in fs
trees and block group items in the extent tree.
The following benchmarks were performed on a quad core machine with 32Gb
of ram, using a leaf/node size of 4Kb (to generate deeper fs trees more
quickly).
sysbench --test=fileio --file-num=131072 --file-total-size=8G \
--file-test-mode=seqwr --num-threads=512 --file-block-size=8192 \
--max-requests=100000 --file-io-mode=sync [prepare|run]
Before this change: 49.85Mb/s (average of 5 runs)
After this change: 50.38Mb/s (average of 5 runs)
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The local variable 'new_size' comes from userspace. If a large number
was passed, there would be an integer overflow in the following line:
new_size = old_size + new_size;
Signed-off-by: Wenliang Fan <fanwlexca@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We can starve out the transaction commit with a bunch of caching threads all
running at the same time. This is because we will only drop the
extent_commit_sem if we need_resched(), which isn't likely to happen since we
will be reading a lot from the disk so have already schedule()'ed plenty. Alex
observed that he could starve out a transaction commit for up to a minute with
32 caching threads all running at once. This will allow us to drop the
extent_commit_sem to allow the transaction commit to swap the commit_root out
and then all the cachers will start back up. Here is an explanation provided by
Igno
So, just to fill in what happens in this loop:
mutex_unlock(&caching_ctl->mutex);
cond_resched();
goto again;
where 'again:' takes caching_ctl->mutex and fs_info->extent_commit_sem
again:
again:
mutex_lock(&caching_ctl->mutex);
/* need to make sure the commit_root doesn't disappear */
down_read(&fs_info->extent_commit_sem);
So, if I'm reading the code correct, there can be a fair amount of
concurrency here: there may be multiple 'caching kthreads' per filesystem
active, while there's one fs_info->extent_commit_sem per filesystem
AFAICS.
So, what happens if there are a lot of CPUs all busy holding the
->extent_commit_sem rwsem read-locked and a writer arrives? They'd all
rush to try to release the fs_info->extent_commit_sem, and they'd block in
the down_read() because there's a writer waiting.
So there's a guarantee of forward progress. This should answer akpm's
concern I think.
Thanks,
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
The inode reference item is close to inode item, so we insert it simultaneously
with the inode item insertion when we create a file/directory.. In fact, we also
can handle the inode reference deletion by the same way. So we made this patch to
introduce the delayed inode reference deletion for the single link inode(At most
case, the file doesn't has hard link, so we don't take the hard link into account).
This function is based on the delayed inode mechanism. After applying this patch,
we can reduce the time of the file/directory deletion by ~10%.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Two reasons:
- btrfs_end_transaction_dmeta() is the same as btrfs_end_transaction_throttle()
so it is unnecessary.
- All the delayed items should be dealt in the current transaction, so the
workers should not commit the transaction, instead, deal with the delayed
items as many as possible.
So we can remove btrfs_end_transaction_dmeta()
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
- move the condition check for wait into a function
- use wait_event_interruptible instead of prepare-schedule-finish process
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
If the number of the delayed items is greater than the upper limit, we will
try to flush all the delayed items. After that, it is unnecessary to run
them again because they are being dealt with by the wokers or the number of
them is less than the lower limit.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Before applying the patch
commit de3cb945db
title: Btrfs: improve the delayed inode throttling
We need requeue the async work after the current work was done, it
introduced a deadlock problem. So we wrote the code that this patch
removes to avoid the above problem. But after applying the above
patch, the deadlock problem didn't exist. So we should remove that
fix code.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Convert all applicable cases of printk and pr_* to the btrfs_* macros.
Fix all uses of the BTRFS prefix.
Signed-off-by: Frank Holton <fholton@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
While running the test btrfs/004 from xfstests in a loop, it failed
about 1 time out of 20 runs in my desktop. The failure happened in
the backref walking part of the test, and the test's error message was
like this:
btrfs/004 93s ... [failed, exit status 1] - output mismatch (see /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad)
--- tests/btrfs/004.out 2013-11-26 18:25:29.263333714 +0000
+++ /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad 2013-12-10 15:25:10.327518516 +0000
@@ -1,3 +1,8 @@
QA output created by 004
*** test backref walking
-*** done
+unexpected output from
+ /home/fdmanana/git/hub/btrfs-progs/btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
+expected inum: 405, expected address: 454656, file: /home/fdmanana/btrfs-tests/scratch_1/snap1/p0/d6/d3d/d156/fce, got:
+
...
(Run 'diff -u tests/btrfs/004.out /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad' to see the entire diff)
Ran: btrfs/004
Failures: btrfs/004
Failed 1 of 1 tests
But immediately after the test finished, the btrfs inspect-internal command
returned the expected output:
$ btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1
inode 405 offset 454656 root 258
inode 405 offset 454656 root 5
It turned out this was because the btrfs_search_old_slot() calls performed
during backref walking (backref.c:__resolve_indirect_ref) were not finding
anything. The reason for this turned out to be that the tree mod logging
code was not logging some node multi-step operations atomically, therefore
btrfs_search_old_slot() callers iterated often over an incomplete tree that
wasn't fully consistent with any tree state from the past. Besides missing
items, this often (but not always) resulted in -EIO errors during old slot
searches, reported in dmesg like this:
[ 4299.933936] ------------[ cut here ]------------
[ 4299.933949] WARNING: CPU: 0 PID: 23190 at fs/btrfs/ctree.c:1343 btrfs_search_old_slot+0x57b/0xab0 [btrfs]()
[ 4299.933950] Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) bnep rfcomm bluetooth parport_pc ppdev binfmt_misc joydev snd_hda_codec_h
[ 4299.933977] CPU: 0 PID: 23190 Comm: btrfs Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70
[ 4299.933978] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012
[ 4299.933979] 000000000000053f ffff8806f3fd98f8 ffffffff8176d284 0000000000000007
[ 4299.933982] 0000000000000000 ffff8806f3fd9938 ffffffff8104a81c ffff880659c64b70
[ 4299.933984] ffff880659c643d0 ffff8806599233d8 ffff880701e2e938 0000160000000000
[ 4299.933987] Call Trace:
[ 4299.933991] [<ffffffff8176d284>] dump_stack+0x55/0x76
[ 4299.933994] [<ffffffff8104a81c>] warn_slowpath_common+0x8c/0xc0
[ 4299.933997] [<ffffffff8104a86a>] warn_slowpath_null+0x1a/0x20
[ 4299.934003] [<ffffffffa065d3bb>] btrfs_search_old_slot+0x57b/0xab0 [btrfs]
[ 4299.934005] [<ffffffff81775f3b>] ? _raw_read_unlock+0x2b/0x50
[ 4299.934010] [<ffffffffa0655001>] ? __tree_mod_log_search+0x81/0xc0 [btrfs]
[ 4299.934019] [<ffffffffa06dd9b0>] __resolve_indirect_refs+0x130/0x5f0 [btrfs]
[ 4299.934027] [<ffffffffa06a21f1>] ? free_extent_buffer+0x61/0xc0 [btrfs]
[ 4299.934034] [<ffffffffa06de39c>] find_parent_nodes+0x1fc/0xe40 [btrfs]
[ 4299.934042] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934048] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934056] [<ffffffffa06df980>] iterate_extent_inodes+0xe0/0x250 [btrfs]
[ 4299.934058] [<ffffffff817762db>] ? _raw_spin_unlock+0x2b/0x50
[ 4299.934065] [<ffffffffa06dfb82>] iterate_inodes_from_logical+0x92/0xb0 [btrfs]
[ 4299.934071] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs]
[ 4299.934078] [<ffffffffa06b7015>] btrfs_ioctl+0xf65/0x1f60 [btrfs]
[ 4299.934080] [<ffffffff811658b8>] ? handle_mm_fault+0x278/0xb00
[ 4299.934083] [<ffffffff81075563>] ? up_read+0x23/0x40
[ 4299.934085] [<ffffffff8177a41c>] ? __do_page_fault+0x20c/0x5a0
[ 4299.934088] [<ffffffff811b2946>] do_vfs_ioctl+0x96/0x570
[ 4299.934090] [<ffffffff81776e23>] ? error_sti+0x5/0x6
[ 4299.934093] [<ffffffff810b71e8>] ? trace_hardirqs_off_caller+0x28/0xd0
[ 4299.934096] [<ffffffff81776a09>] ? retint_swapgs+0xe/0x13
[ 4299.934098] [<ffffffff811b2eb1>] SyS_ioctl+0x91/0xb0
[ 4299.934100] [<ffffffff813eecde>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[ 4299.934104] ---[ end trace 48f0cfc902491414 ]---
[ 4299.934378] btrfs bad fsid on block 0
These tree mod log operations that must be performed atomically, tree_mod_log_free_eb,
tree_mod_log_eb_copy, tree_mod_log_insert_root and tree_mod_log_insert_move, used to
be performed atomically before the following commit:
c8cc634165
(Btrfs: stop using GFP_ATOMIC for the tree mod log allocations)
That change removed the atomicity of such operations. This patch restores the
atomicity while still not doing the GFP_ATOMIC allocations of tree_mod_elem
structures, so it has to do the allocations using GFP_NOFS before acquiring
the mod log lock.
This issue has been experienced by several users recently, such as for example:
http://www.spinics.net/lists/linux-btrfs/msg28574.html
After running the btrfs/004 test for 679 consecutive iterations with this
patch applied, I didn't ran into the issue anymore.
Cc: stable@vger.kernel.org
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Warn if the balance goes below zero, which appears to be unlikely
though. Otherwise cleans up the code a bit.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Since daivd did the work that force us to use readonly snapshot,
we can safely remove transaction protection from btrfs send.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We met the following oops when doing space balance:
kobject (ffff88081b590278): tried to init an initialized object, something is seriously wrong.
...
Call Trace:
[<ffffffff81937262>] dump_stack+0x49/0x5f
[<ffffffff8137d259>] kobject_init+0x89/0xa0
[<ffffffff8137d36a>] kobject_init_and_add+0x2a/0x70
[<ffffffffa009bd79>] ? clear_extent_bit+0x199/0x470 [btrfs]
[<ffffffffa005e82c>] __link_block_group+0xfc/0x120 [btrfs]
[<ffffffffa006b9db>] btrfs_make_block_group+0x24b/0x370 [btrfs]
[<ffffffffa00a899b>] __btrfs_alloc_chunk+0x54b/0x7e0 [btrfs]
[<ffffffffa00a8c6f>] btrfs_alloc_chunk+0x3f/0x50 [btrfs]
[<ffffffffa0060123>] do_chunk_alloc+0x363/0x440 [btrfs]
[<ffffffffa00633d4>] btrfs_check_data_free_space+0x104/0x310 [btrfs]
[<ffffffffa0069f4d>] btrfs_write_dirty_block_groups+0x48d/0x600 [btrfs]
[<ffffffffa007aad4>] commit_cowonly_roots+0x184/0x250 [btrfs]
...
Steps to reproduce:
# mkfs.btrfs -f <dev>
# mount -o nospace_cache <dev> <mnt>
# btrfs balance start <mnt>
# dd if=/dev/zero of=<mnt>/tmpfile bs=1M count=1
The reason of this problem is that we initialized the raid kobject when we added
a block group into a empty raid list. As we know, when we mounted a btrfs filesystem,
the raid list was empty, we would initialize the raid kobject when we added the first
block group. But if there was not data stored in the block group, the block group
would be freed when doing balance, and the raid list would be empty. And then if we
allocated a new block group and added it into the raid list, we would initialize
the raid kobject again, the oops happened.
Fix this problem by initializing the raid kobject just when mounting the fs.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reported-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
All the subvolues that are involved in send must be read-only during the
whole operation. The ioctl SUBVOL_SETFLAGS could be used to change the
status to read-write and the result of send stream is undefined if the
data change unexpectedly.
Fix that by adding a refcount for all involved roots and verify that
there's no send in progress during SUBVOL_SETFLAGS ioctl call that does
read-only -> read-write transition.
We need refcounts because there are no restrictions on number of send
parallel operations currently run on a single subvolume, be it source,
parent or one of the multiple clone sources.
Kernel is silent when the RO checks fail and returns EPERM. The same set
of checks is done already in userspace before send starts.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Unused since ed2590953b
"Btrfs: stop using vfs_read in send".
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Remove ifdefed code:
- tlv_put for 8, 16 and 32, add a generic tempalte if needed in future
- tlv_put_timespec - the btrfs_timespec fields are used
- fs_path_remove obsoleted long ago
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
While running btrfs/004 from xfstests, after 503 iterations, dmesg reported
a deadlock between tasks iterating inode refs and tasks running delayed inodes
(during a transaction commit).
It turns out that iterating inode refs implies doing one tree search and
release all nodes in the path except the leaf node, and then passing that
leaf node to btrfs_ref_to_path(), which in turn does another tree search
without releasing the lock on the leaf node it received as parameter.
This is a problem when other task wants to write to the btree as well and
ends up updating the leaf that is read locked - the writer task locks the
parent of the leaf and then blocks waiting for the leaf's lock to be
released - at the same time, the task executing btrfs_ref_to_path()
does a second tree search, without releasing the lock on the first leaf,
and wants to access a leaf (the same or another one) that is a child of
the same parent, resulting in a deadlock.
The trace reported by lockdep follows.
[84314.936373] INFO: task fsstress:11930 blocked for more than 120 seconds.
[84314.936381] Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70
[84314.936383] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[84314.936386] fsstress D ffff8806e1bf8000 0 11930 11926 0x00000000
[84314.936393] ffff8804d6d89b78 0000000000000046 ffff8804d6d89b18 ffffffff810bd8bd
[84314.936399] ffff8806e1bf8000 ffff8804d6d89fd8 ffff8804d6d89fd8 ffff8804d6d89fd8
[84314.936405] ffff880806308000 ffff8806e1bf8000 ffff8804d6d89c08 ffff8804deb8f190
[84314.936410] Call Trace:
[84314.936421] [<ffffffff810bd8bd>] ? trace_hardirqs_on+0xd/0x10
[84314.936428] [<ffffffff81774269>] schedule+0x29/0x70
[84314.936451] [<ffffffffa0715bf5>] btrfs_tree_lock+0x75/0x270 [btrfs]
[84314.936457] [<ffffffff810715c0>] ? __init_waitqueue_head+0x60/0x60
[84314.936470] [<ffffffffa06ba231>] btrfs_search_slot+0x7f1/0x930 [btrfs]
[84314.936489] [<ffffffffa0731c2a>] ? __btrfs_run_delayed_items+0x13a/0x1e0 [btrfs]
[84314.936504] [<ffffffffa06d2e1f>] btrfs_lookup_inode+0x2f/0xa0 [btrfs]
[84314.936510] [<ffffffff810bd6ef>] ? trace_hardirqs_on_caller+0x1f/0x1e0
[84314.936528] [<ffffffffa073173c>] __btrfs_update_delayed_inode+0x4c/0x1d0 [btrfs]
[84314.936543] [<ffffffffa0731c2a>] ? __btrfs_run_delayed_items+0x13a/0x1e0 [btrfs]
[84314.936558] [<ffffffffa0731c2a>] ? __btrfs_run_delayed_items+0x13a/0x1e0 [btrfs]
[84314.936573] [<ffffffffa0731c82>] __btrfs_run_delayed_items+0x192/0x1e0 [btrfs]
[84314.936589] [<ffffffffa0731d03>] btrfs_run_delayed_items+0x13/0x20 [btrfs]
[84314.936604] [<ffffffffa06dbcd4>] btrfs_flush_all_pending_stuffs+0x24/0x80 [btrfs]
[84314.936620] [<ffffffffa06ddc13>] btrfs_commit_transaction+0x223/0xa20 [btrfs]
[84314.936630] [<ffffffffa06ae5ae>] btrfs_sync_fs+0x6e/0x110 [btrfs]
[84314.936635] [<ffffffff811d0b50>] ? __sync_filesystem+0x60/0x60
[84314.936639] [<ffffffff811d0b50>] ? __sync_filesystem+0x60/0x60
[84314.936643] [<ffffffff811d0b70>] sync_fs_one_sb+0x20/0x30
[84314.936648] [<ffffffff811a3541>] iterate_supers+0xf1/0x100
[84314.936652] [<ffffffff811d0c45>] sys_sync+0x55/0x90
[84314.936658] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[84314.936660] INFO: lockdep is turned off.
[84314.936663] INFO: task btrfs:11955 blocked for more than 120 seconds.
[84314.936666] Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70
[84314.936668] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[84314.936670] btrfs D ffff880541729a88 0 11955 11608 0x00000000
[84314.936674] ffff880541729a38 0000000000000046 ffff8805417299d8 ffffffff810bd8bd
[84314.936680] ffff88075430c8a0 ffff880541729fd8 ffff880541729fd8 ffff880541729fd8
[84314.936685] ffffffff81c104e0 ffff88075430c8a0 ffff8804de8b00b8 ffff8804de8b0000
[84314.936690] Call Trace:
[84314.936695] [<ffffffff810bd8bd>] ? trace_hardirqs_on+0xd/0x10
[84314.936700] [<ffffffff81774269>] schedule+0x29/0x70
[84314.936717] [<ffffffffa0715815>] btrfs_tree_read_lock+0xd5/0x140 [btrfs]
[84314.936721] [<ffffffff810715c0>] ? __init_waitqueue_head+0x60/0x60
[84314.936733] [<ffffffffa06ba201>] btrfs_search_slot+0x7c1/0x930 [btrfs]
[84314.936746] [<ffffffffa06bd505>] btrfs_find_item+0x55/0x160 [btrfs]
[84314.936763] [<ffffffffa06ff689>] ? free_extent_buffer+0x49/0xc0 [btrfs]
[84314.936780] [<ffffffffa073c9ca>] btrfs_ref_to_path+0xba/0x1e0 [btrfs]
[84314.936797] [<ffffffffa06f9719>] ? release_extent_buffer+0xb9/0xe0 [btrfs]
[84314.936813] [<ffffffffa06ff689>] ? free_extent_buffer+0x49/0xc0 [btrfs]
[84314.936830] [<ffffffffa073cb50>] inode_to_path+0x60/0xd0 [btrfs]
[84314.936846] [<ffffffffa073d365>] paths_from_inode+0x115/0x3c0 [btrfs]
[84314.936851] [<ffffffff8118dd44>] ? kmem_cache_alloc_trace+0x114/0x200
[84314.936868] [<ffffffffa0714494>] btrfs_ioctl+0xf14/0x2030 [btrfs]
[84314.936873] [<ffffffff817762db>] ? _raw_spin_unlock+0x2b/0x50
[84314.936877] [<ffffffff8116598f>] ? handle_mm_fault+0x34f/0xb00
[84314.936882] [<ffffffff81075563>] ? up_read+0x23/0x40
[84314.936886] [<ffffffff8177a41c>] ? __do_page_fault+0x20c/0x5a0
[84314.936892] [<ffffffff811b2946>] do_vfs_ioctl+0x96/0x570
[84314.936896] [<ffffffff81776e23>] ? error_sti+0x5/0x6
[84314.936901] [<ffffffff810b71e8>] ? trace_hardirqs_off_caller+0x28/0xd0
[84314.936906] [<ffffffff81776a09>] ? retint_swapgs+0xe/0x13
[84314.936910] [<ffffffff811b2eb1>] SyS_ioctl+0x91/0xb0
[84314.936915] [<ffffffff813eecde>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[84314.936920] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b
[84314.936922] INFO: lockdep is turned off.
[84434.866873] INFO: task btrfs-transacti:11921 blocked for more than 120 seconds.
[84434.866881] Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70
[84434.866883] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[84434.866886] btrfs-transacti D ffff880755b6a478 0 11921 2 0x00000000
[84434.866893] ffff8800735b9ce8 0000000000000046 ffff8800735b9c88 ffffffff810bd8bd
[84434.866899] ffff8805a1b848a0 ffff8800735b9fd8 ffff8800735b9fd8 ffff8800735b9fd8
[84434.866904] ffffffff81c104e0 ffff8805a1b848a0 ffff880755b6a478 ffff8804cece78f0
[84434.866910] Call Trace:
[84434.866920] [<ffffffff810bd8bd>] ? trace_hardirqs_on+0xd/0x10
[84434.866927] [<ffffffff81774269>] schedule+0x29/0x70
[84434.866948] [<ffffffffa06dd2ef>] wait_current_trans.isra.33+0xbf/0x120 [btrfs]
[84434.866954] [<ffffffff810715c0>] ? __init_waitqueue_head+0x60/0x60
[84434.866970] [<ffffffffa06dec18>] start_transaction+0x388/0x5a0 [btrfs]
[84434.866985] [<ffffffffa06db9b5>] ? transaction_kthread+0xb5/0x280 [btrfs]
[84434.866999] [<ffffffffa06dee97>] btrfs_attach_transaction+0x17/0x20 [btrfs]
[84434.867012] [<ffffffffa06dba9e>] transaction_kthread+0x19e/0x280 [btrfs]
[84434.867026] [<ffffffffa06db900>] ? open_ctree+0x2260/0x2260 [btrfs]
[84434.867030] [<ffffffff81070dad>] kthread+0xed/0x100
[84434.867035] [<ffffffff81070cc0>] ? flush_kthread_worker+0x190/0x190
[84434.867040] [<ffffffff8177ee6c>] ret_from_fork+0x7c/0xb0
[84434.867044] [<ffffffff81070cc0>] ? flush_kthread_worker+0x190/0x190
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Chris introduced hleper function read_csums() and this function
has been removed, but we forgot to remove its corresponding comments.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
It's not used anywhere, so just drop it.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
fs/btrfs/file.c: In function ‘prepare_pages.isra.18’:
fs/btrfs/file.c:1265:6: warning: ‘err’ may be used uninitialized in this function [-Wuninitialized]
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We have commited transaction before, remove redundant filemap writting and
waiting here, it can speed up balance relocation process.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Clean up btrfs_lookup_dentry() to never return NULL, but PTR_ERR(-ENOENT)
instead. This keeps the return value convention consistent.
Callers who use btrfs_lookup_dentry() require a trivial update.
create_snapshot() in particular looks like it can also lose a BUG_ON(!inode)
which is not really needed - there seems less harm in returning ENOENT to
userspace at that point in the stack than there is to crash the machine.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
In ctree.c:tree_mod_log_set_node_key() we were calling
__tree_mod_log_insert_key() even when the modification doesn't need
to be logged. This would allocate a tree_mod_elem structure, fill it
and pass it to __tree_mod_log_insert(), which would just acquire
the tree mod log write lock and then free the tree_mod_elem structure
and return (that is, a no-op).
Therefore call tree_mod_log_insert() instead of __tree_mod_log_insert()
which just returns immediately if the modification doesn't need to be
logged (without allocating the structure, fill it, acquire write lock,
free structure).
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
I need to create a fake tree to test qgroups and I don't want to have to setup a
fake btree_inode. The fact is we only use the radix tree for the fs_info, so
everybody else who allocates an extent_io_tree is just wasting the space anyway.
This patch moves the radix tree and its lock into btrfs_fs_info so there is less
stuff I have to fake to do qgroup sanity tests. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
For creating a dummy in-memory btree I need to be able to use the radix tree to
keep track of the buffers like normal extent buffers. With dummy buffers we
skip the radix tree step, and we still want to do that for the tree mod log
dummy buffers but for my test buffers we need to be able to remove them from the
radix tree like normal. This will give me a way to do that. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
I need to add infrastructure to allocate dummy extent buffers for running sanity
tests, and to do this I need to not have to worry about having an
address_mapping for an io_tree, so just fix up the places where we assume that
all io_tree's have a non-NULL ->mapping. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Currently when finding the leaf to insert a key into a btree, if the
leaf doesn't have enough space to store the item we attempt to move
off some items from our leaf to its right neighbor leaf, and if this
fails to create enough free space in our leaf, we try to move off more
items to the left neighbor leaf as well.
When trying to move off items to the right neighbor leaf, if it has
enough room to store the new key but not not enough room to move off
at least one item from our target leaf, __push_leaf_right returns 1 and
we have to attempt to move items to the left neighbor (push_leaf_left
function) without touching the right neighbor leaf.
For the case where the right leaf has enough room to store at least 1
item from our leaf, we end up modifying (and dirtying) both our leaf
and the right leaf. This is non-optimal for the case where the new key
is greater than any key in our target leaf because it can be inserted at
slot 0 of the right neighbor leaf and we don't need to touch our leaf
at all nor to attempt to move off items to the left neighbor leaf.
Therefore this change just selects the right neighbor leaf as our new
target leaf if it has enough room for the new key without modifying our
initial target leaf - we do this only if the new key is higher than any
key in the initial target leaf.
While running the following test, push_leaf_right was called by split_leaf
4802 times. Out of those 4802 calls, for 2571 calls (53.5%) we hit this
special case (right leaf has enough room and new key is higher than any key
in the initial target leaf).
Test:
sysbench --test=fileio --file-num=512 --file-total-size=5G \
--file-test-mode=[seqwr|rndwr] --num-threads=512 --file-block-size=8192 \
--max-requests=100000 --file-io-mode=sync [prepare|run]
Results:
sequential writes
Throughput before this change: 65.71Mb/sec (average of 10 runs)
Throughput after this change: 66.58Mb/sec (average of 10 runs)
random writes
Throughput before this change: 10.75Mb/sec (average of 10 runs)
Throughput after this change: 11.56Mb/sec (average of 10 runs)
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Just wrap same code into one function scrub_blocked_if_needed().
This make a change that we will move waiting (@workers_pending = 0)
before we can wake up commiting transaction(atomic_inc(@scrub_paused)),
we must take carefully to not deadlock here.
Thread 1 Thread 2
|->btrfs_commit_transaction()
|->set trans type(COMMIT_DOING)
|->btrfs_scrub_paused()(blocked)
|->join_transaction(blocked)
Move btrfs_scrub_paused() before setting trans type which means we can
still join a transaction when commiting_transaction is blocked.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Suggested-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We came a race condition when scrubbing superblocks, the story is:
In commiting transaction, we will update @last_trans_commited after
writting superblocks, if scrubber start after writting superblocks
and before updating @last_trans_commited, generation mismatch happens!
We fix this by checking @scrub_pause_req, and we won't start a srubber
until commiting transaction is finished.(after btrfs_scrub_continue()
finished.)
Reported-by: Sebastian Ochmann <ochmann@informatik.uni-bonn.de>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
fs/btrfs/send.c:2190:9: warning: incorrect type in argument 3 (different base types)
fs/btrfs/send.c:2190:9: expected unsigned long long [unsigned] [usertype] value
fs/btrfs/send.c:2190:9: got restricted __le64 [usertype] ctransid
fs/btrfs/send.c:2195:17: warning: incorrect type in argument 3 (different base types)
fs/btrfs/send.c:2195:17: expected unsigned long long [unsigned] [usertype] value
fs/btrfs/send.c:2195:17: got restricted __le64 [usertype] ctransid
fs/btrfs/send.c:3716:9: warning: incorrect type in argument 3 (different base types)
fs/btrfs/send.c:3716:9: expected unsigned long long [unsigned] [usertype] value
fs/btrfs/send.c:3716:9: got restricted __le64 [usertype] ctransid
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
When merging an extent_map with its right neighbor, increment
its block_len with the neighbor's block_len.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
[commit 8185554d: fix incorrect inode acl reset] introduced a dead
code by adding a condition which can never be true to an else
branch. The condition can never be true because it is already
checked by a previous if statement which causes function to return.
Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Reviewed-By: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We were accounting for sizeof(struct btrfs_item) twice, once
in the data_size variable and another time in the if statement
below.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Currently we do 2 traversals of an inode's extent_io_tree
before inserting an extent state structure: 1 to see if a
matching extent state already exists and 1 to do the insertion
if the fist traversal didn't found such extent state.
This change just combines those tree traversals into a single one.
While running sysbench tests (random writes) I captured the number
of elements in extent_io_tree trees for a while (into a procfs file
backed by a seq_list from seq_file module) and got this histogram:
Count: 9310
Range: 51.000 - 21386.000; Mean: 11785.243; Median: 18743.500; Stddev: 8923.688
Percentiles: 90th: 20985.000; 95th: 21155.000; 99th: 21369.000
51.000 - 93.933: 693 ########
93.933 - 172.314: 938 ##########
172.314 - 315.408: 856 #########
315.408 - 576.646: 95 #
576.646 - 6415.830: 888 ##########
6415.830 - 11713.809: 1024 ###########
11713.809 - 21386.000: 4816 #####################################################
So traversing such trees can take some significant time that can
easily be avoided.
Ran the following sysbench tests, 5 times each, for sequential and
random writes, and got the following results:
sysbench --test=fileio --file-num=1 --file-total-size=2G \
--file-test-mode=seqwr --num-threads=16 --file-block-size=65536 \
--max-requests=0 --max-time=60 --file-io-mode=sync
sysbench --test=fileio --file-num=1 --file-total-size=2G \
--file-test-mode=rndwr --num-threads=16 --file-block-size=65536 \
--max-requests=0 --max-time=60 --file-io-mode=sync
Before this change:
sequential writes: 69.28Mb/sec (average of 5 runs)
random writes: 4.14Mb/sec (average of 5 runs)
After this change:
sequential writes: 69.91Mb/sec (average of 5 runs)
random writes: 5.69Mb/sec (average of 5 runs)
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
When we didn't find a matching extent state, we inserted a new one
but didn't cache it in the **cached_state parameter, which makes a
subsequent call do a tree lookup to get it.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Before this change, adding an extent map to the extent map tree of an
inode required 2 tree nevigations:
1) doing a tree navigation to search for an existing extent map starting
at the same offset or an extent map that overlaps the extent map we
want to insert;
2) Another tree navigation to add the extent map to the tree (if the
former tree search didn't found anything).
This change just merges these 2 steps into a single one.
While running first few btrfs xfstests I had noticed these trees easily
had a few hundred elements, and then with the following sysbench test it
reached over 1100 elements very often.
Test:
sysbench --test=fileio --file-num=32 --file-total-size=10G \
--file-test-mode=seqwr --num-threads=512 --file-block-size=8192 \
--max-requests=1000000 --file-io-mode=sync [prepare|run]
(fs created with mkfs.btrfs -l 4096 -f /dev/sdb3 before each sysbench
prepare phase)
Before this patch:
run 1 - 41.894Mb/sec
run 2 - 40.527Mb/sec
run 3 - 40.922Mb/sec
run 4 - 49.433Mb/sec
run 5 - 40.959Mb/sec
average - 42.75Mb/sec
After this patch:
run 1 - 48.036Mb/sec
run 2 - 50.21Mb/sec
run 3 - 50.929Mb/sec
run 4 - 46.881Mb/sec
run 5 - 53.192Mb/sec
average - 49.85Mb/sec
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
When attempting to move items from our target leaf to its neighbor
leaves (right and left), we only need to free data_size - free_space
bytes from our leaf in order to add the new item (which has size of
data_size bytes). Therefore attempt to move items to the right and
left leaves if they have at least data_size - free_space bytes free,
instead of data_size bytes free.
After 5 runs of the following test, I got a smaller number of btree
node splits overall:
sysbench --test=fileio --file-num=512 --file-total-size=5G \
--file-test-mode=seqwr --num-threads=512 \
--file-block-size=8192 --max-requests=100000 --file-io-mode=sync
Before this change:
* 6171 splits (average of 5 test runs)
* 61.508Mb/sec of throughput (average of 5 test runs)
After this change:
* 6036 splits (average of 5 test runs)
* 63.533Mb/sec of throughput (average of 5 test runs)
An ideal test would not just have multiple threads/processes writing
to a file (insertion of file extent items) but also do other operations
that result in insertion of items with varied sizes, like file/directory
creations, creation of links, symlinks, xattrs, etc.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
After an ordered extent completes, don't blindly reset the
inode's ordered tree last accessed ordered extent pointer.
While running the xfstests I noticed that about 29% of the
time the ordered extent to which tree->last pointed was not
the same as our just completed ordered extent. After that I
ran the following sysbench test (after a prepare phase) and
noticed that about 68% of the time tree->last pointed to
a different ordered extent too.
sysbench --test=fileio --file-num=32 --file-total-size=4G \
--file-test-mode=rndwr --num-threads=512 \
--file-block-size=32768 --max-time=60 --max-requests=0 run
Therefore reset tree->last on ordered extent removal only if
it pointed to the ordered extent we're removing from the tree.
Results from 4 runs of the following test before and after
applying this patch:
$ sysbench --test=fileio --file-num=32 --file-total-size=4G \
--file-test-mode=seqwr --num-threads=512 \
--file-block-size=32768 --max-time=60 --file-io-mode=sync prepare
$ sysbench --test=fileio --file-num=32 --file-total-size=4G \
--file-test-mode=seqwr --num-threads=512 \
--file-block-size=32768 --max-time=60 --file-io-mode=sync run
Before this path:
run 1 - 64.049Mb/sec
run 2 - 63.455Mb/sec
run 3 - 64.656Mb/sec
run 4 - 63.833Mb/sec
After this patch:
run 1 - 66.149Mb/sec
run 2 - 68.459Mb/sec
run 3 - 66.338Mb/sec
run 4 - 66.176Mb/sec
With random writes (--file-test-mode=rndwr) I had huge fluctuations
on the results (+- 35% easily).
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Filipe noticed that we were leaking the features attribute group
after umount. His fix of just calling sysfs_remove_group() wasn't enough
since that removes just the supported features and not the unsupported
features.
This patch changes the unknown feature handling to add them individually
so we can skip the kmalloc and uses the same iteration to tear them down
later.
We also fix the error handling during mount so that we catch the
failing creation of the per-super kobject, and handle proper teardown
of a half-setup sysfs context.
Tested properly with kmemleak enabled this time.
Reported-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Tested-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
This patch fixes the following warnings:
fs/btrfs/extent-tree.c:6201:12: sparse: symbol 'get_raid_name' was not declared. Should it be static?
fs/btrfs/extent-tree.c:8430:9: error: format not a string literal and no format arguments [-Werror=format-security] get_raid_name(index));
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The inode eviction can be very slow, because during eviction we
tell the VFS to truncate all of the inode's pages. This results
in calls to btrfs_invalidatepage() which in turn does calls to
lock_extent_bits() and clear_extent_bit(). These calls result in
too many merges and splits of extent_state structures, which
consume a lot of time and cpu when the inode has many pages. In
some scenarios I have experienced umount times higher than 15
minutes, even when there's no pending IO (after a btrfs fs sync).
A quick way to reproduce this issue:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ cd /mnt/btrfs
$ sysbench --test=fileio --file-num=128 --file-total-size=16G \
--file-test-mode=seqwr --num-threads=128 \
--file-block-size=16384 --max-time=60 --max-requests=0 run
$ time btrfs fi sync .
FSSync '.'
real 0m25.457s
user 0m0.000s
sys 0m0.092s
$ cd ..
$ time umount /mnt/btrfs
real 1m38.234s
user 0m0.000s
sys 1m25.760s
The same test on ext4 runs much faster:
$ mkfs.ext4 /dev/sdb3
$ mount /dev/sdb3 /mnt/ext4
$ cd /mnt/ext4
$ sysbench --test=fileio --file-num=128 --file-total-size=16G \
--file-test-mode=seqwr --num-threads=128 \
--file-block-size=16384 --max-time=60 --max-requests=0 run
$ sync
$ cd ..
$ time umount /mnt/ext4
real 0m3.626s
user 0m0.004s
sys 0m3.012s
After this patch, the unmount (inode evictions) is much faster:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ cd /mnt/btrfs
$ sysbench --test=fileio --file-num=128 --file-total-size=16G \
--file-test-mode=seqwr --num-threads=128 \
--file-block-size=16384 --max-time=60 --max-requests=0 run
$ time btrfs fi sync .
FSSync '.'
real 0m26.774s
user 0m0.000s
sys 0m0.084s
$ cd ..
$ time umount /mnt/btrfs
real 0m1.811s
user 0m0.000s
sys 0m1.564s
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We hit a forever loop when doing balance relocation,the reason
is that we firstly reserve 4M(node size is 16k).and within transaction
we will try to add extra reservation for snapshot roots,this will
return -EAGAIN if there has been a thread flushing space to reserve
space.We will do this again and again with filesystem becoming nearly
full.
If the above '-EAGAIN' case happens, we try to refill reservation more
outsize of transaction, and this will return eariler in enospc case,however,
this dosen't really hurt because it makes no sense doing balance relocation
with the filesystem nearly full.
Miao Xie helped a lot to track this issue, thanks.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
If the ordered extent's last byte was 1 less than our region's
start byte, we would unnecessarily wait for the completion of
that ordered extent, because it doesn't intersect our target
range.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
When we ran sysbench on the fs with compression, the following WARN_ONs were
triggered:
fs/btrfs/inode.c:7829 WARN_ON(BTRFS_I(inode)->outstanding_extents);
fs/btrfs/inode.c:7830 WARN_ON(BTRFS_I(inode)->reserved_extents);
fs/btrfs/inode.c:7832 WARN_ON(BTRFS_I(inode)->csum_bytes);
Steps to reproduce:
# mkfs.btrfs -f <dev>
# mount -o compress <dev> <mnt>
# cd <mnt>
# sysbench --test=fileio --num-threads=8 --file-total-size=8G \
> --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \
> --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \
> --file-test-mode=sync prepare
# cd -
# umount <mnt>
# mount -o compress <dev> <mnt>
# cd <mnt>
# sysbench --test=fileio --num-threads=8 --file-total-size=8G \
> --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \
> --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \
> --file-test-mode=sync run
# cd -
# umount <mnt>
The reason of this problem is:
Task0 Task1
btrfs_direct_IO
unlock(&inode->i_mutex)
lock(&inode->i_mutex)
reserve_space()
prepare_pages()
lock_extent()
clear_extent()
unlock_extent()
lock_extent()
test_extent(uptodate)
return false
copy_data()
set_delalloc_extent()
extent need compress
go back to buffered write
clear_extent(DELALLOC | DIRTY)
unlock_extent()
Task 0 and 1 wrote the same place, and task0 cleared the delalloc flag which
was set by task1, it made the dirty pages in that extents couldn't be flushed
into the disk, so the reserved space for that extent was not released at
the end.
This patch fixes the above bug by unlocking the extent after the delalloc.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
- the caller has gotten the inode object, needn't pass the file object.
And if so, we needn't define a inode pointer variant.
- the position should be aligned by the page size not sector size, so
we also needn't pass the root object into prepare_pages().
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We don't need to crash hard here, it's just reading a sysfs file. The
values considered in switch are from a fixed set, the default case
should not happen at all.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Added in patch "btrfs: add ability to change features via sysfs",
modifications to superblock don't need to reserve metadata blocks when
starting a transaction.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The kernel macro pr_debug is defined as a empty statement when DEBUG is
not defined. Make btrfs_debug match pr_debug to avoid spamming
the kernel log with debug messages
Signed-off-by: Frank Holton <fholton@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
Found by uselex.rb:
> btrfs_get_inode_ref_index: [R]: exported from:
fs/btrfs/inode-item.o fs/btrfs/btrfs.o fs/btrfs/built-in.o
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Reviewed-by: David Stebra <dsterba@suse.cz>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
This is the third step in bootstrapping the btrfs_find_item interface.
The function find_orphan_item(), in orphan.c, is similar to the two
functions already replaced by the new interface. It uses two parameters,
which are already present in the interface, and is nearly identical to
the function brought in in the previous patch.
Replace the two calls to find_orphan_item() with calls to
btrfs_find_item(), with the defined objectid and type that was used
internally by find_orphan_item(), a null path, and a null key. Add a
test for a null path to btrfs_find_item, and if it passes, allocate and
free the path. Finally, remove find_orphan_item().
Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
This patch is the second step in bootstrapping the btrfs_find_item
interface. The btrfs_find_root_ref() is similar to the former
__inode_info(); it accepts four of its parameters, and duplicates the
first half of its functionality.
Replace the one former call to btrfs_find_root_ref() with a call to
btrfs_find_item(), along with the defined key type that was used
internally by btrfs_find_root ref, and a null found key. In
btrfs_find_item(), add a test for the null key at the place where
the functionality of btrfs_find_root_ref() ends; btrfs_find_item()
then returns if the test passes. Finally, remove btrfs_find_root_ref().
Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com>
Suggested-by: Zach Brown <zab@redhat.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
There are many btrfs functions that manually search the tree for an
item. They all reimplement the same mechanism and differ in the
conditions that they use to find the item. __inode_info() is one such
example. Zach Brown proposed creating a new interface to take the place
of these functions.
This patch is the first step to creating the interface. A new function,
btrfs_find_item, has been added to ctree.c and prototyped in ctree.h.
It is identical to __inode_info, except that the order of the parameters
has been rearranged to more closely those of similar functions elsewhere
in the code (now, root and path come first, then the objectid, offset
and type, and the key to be filled in last). __inode_info's callers have
been set to call this new function instead, and __inode_info itself has
been removed.
Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com>
Suggested-by: Zach Brown <zab@redhat.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
Use otherwise unused local variables slot in update_qgroup_limit_item and
in update_qgroup_info_item, and remove unused variable ins from
btrfs_qgroup_account_ref.
Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
The variable window_start in setup_cluster_no_bitmap is not used since commit
1bb91902dc
(Btrfs: revamp clustered allocation logic)
Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
Remove unused variables:
* tree from end_bio_extent_writepage,
* item from extent_fiemap.
Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
The variable found_uncached_bg in find_free_extent is not used since commit
285ff5af6c
(Btrfs: remove the ideal caching code)
Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
Remove unused variables:
* tree from csum_dirty_buffer,
* tree from btree_readpage_end_io_hook,
* tree from btree_writepages,
* bytenr from btrfs_create_tree,
* fs_info from end_workqueue_fn.
Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
Variable owner in btrfs_new_inode is unused since commit
d82a6f1d7e
(Btrfs: kill BTRFS_I(inode)->block_group)
Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
This adds a writeable attribute which describes the label.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
Now that we have the infrastructure for per-super attributes, we can
publish device membership in /sys/fs/btrfs/<fsid>/devices. The information
is published as symlinks to the block devices.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
While trying to debug ENOSPC issues, it's helpful to understand what the
kernel's view of the available space is. We export this information
via ioctl, but sysfs files are more easily used.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
btrfs filesystem df output will show the size of the metadata space
and how much of it is used, and the user assumes that the difference
is all usable space. Since that's not actually the case due to the
global metadata reservation, we should provide the full picture to the
user.
This patch adds an ioctl that exports the size of the global metadata
reservation so that btrfs filesystem df can report it.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
Now that we have the feature name strings available in the kernel via
the sysfs attributes, we can use them for printing better failure
messages from the ioctl path.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
This patch adds the ability to change (set/clear) features while the file
system is mounted. A bitmask is added for each feature set for the
support to set and clear the bits. A message indicating which bit
has been set or cleared is issued when it's been changed and also when
permission or support for a particular bit has been denied.
Since the the attributes can now be writable, we need to introduce
another struct attribute to hold the different permissions.
If neither set or clear is supported, the file will have 0444 permissions.
If either set or clear is supported, the file will have 0644 permissions
and the store handler will filter out the write based on the bitmask.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
With the compat and compat-ro bits, it's possible for file systems to
exist that have features that aren't supported by the kernel's file system
implementation yet still be mountable.
This patch publishes read-only info on those features using a prefix:number
format, where the number is the bit number rather than the shifted value.
e.g. "compat:12"
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
This patch publishes information on which features are enabled in the
file system on a per-super basis. At this point, it only publishes
information on features supported by the file system implementation.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
This patch adds per-super attributes to sysfs.
It doesn't publish any attributes yet, but does the proper lifetime
handling as well as the basic infrastructure to add new attributes.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
This patch adds the ability to publish supported features to sysfs under
/sys/fs/btrfs/features.
The files are module-wide and export which features the kernel supports.
The content, for now, is just "0\n".
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
There are some feature bits that require no offline setup and can
be enabled online. I've only reviewed extended irefs, but there will
probably be more.
We introduce three new ioctls:
- BTRFS_IOC_GET_SUPPORTED_FEATURES: query the kernel for supported features.
- BTRFS_IOC_GET_FEATURES: query the kernel for enabled features on a per-fs
basis, as well as querying for which features are changeable with mounted.
- BTRFS_IOC_SET_FEATURES: change features on a per-fs basis.
We introduce two new masks per feature set (_SAFE_SET and _SAFE_CLEAR) that
allow us to define which features are safe to change at runtime.
The failure modes for BTRFS_IOC_SET_FEATURES are as follows:
- Enabling a completely unsupported feature: warns and returns -ENOTSUPP
- Enabling a feature that can only be done offline: warns and returns -EPERM
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
When we have data deduplication on, we'll hang on the merge part
because it needs to verify every queued delayed data refs related to
this disk offset but we may have millions refs.
And in the case of delayed data refs, we don't usually have too much
data refs to merge.
So it's safe to shut it down for data refs.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
The way how we process delayed refs is
1) get a bunch of head refs,
2) pick up one head ref,
3) go one node back for any delayed ref updates.
The head ref is also linked in the same rbtree as the delayed ref is,
so in 1) stage, we have to walk one by one including not only head refs, but
delayed refs.
When we have a great number of delayed refs pending to process,
this'll cost time a lot.
Here we introduce a head ref specific rbtree, it only has head refs, so troubles
go away.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
We were looking at file_extent_num_bytes unconditionally when looking at
referenced data bytes, but this isn't correct for compression. Fix this by
checking the compression of the file extent we are and setting num_bytes to
disk_num_bytes in the case of compression so that we are marking the proper
bytes as referenced. This fixes check_int_data freaking out when running
btrfs/004. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
Btrfs has always had these filler extent data items for holes in inodes. This
has made somethings very easy, like logging hole punches and sending hole
punches. However for large holey files these extent data items are pure
overhead. So add an incompatible feature to no longer add hole extents to
reduce the amount of metadata used by these sort of files. This has a few
changes for logging and send obviously since they will need to detect holes and
log/send the holes if there are any. I've tested this thoroughly with xfstests
and it doesn't cause any issues with and without the incompat format set.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull vfs updates from Al Viro:
"Assorted stuff; the biggest pile here is Christoph's ACL series. Plus
assorted cleanups and fixes all over the place...
There will be another pile later this week"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (43 commits)
__dentry_path() fixes
vfs: Remove second variable named error in __dentry_path
vfs: Is mounted should be testing mnt_ns for NULL or error.
Fix race when checking i_size on direct i/o read
hfsplus: remove can_set_xattr
nfsd: use get_acl and ->set_acl
fs: remove generic_acl
nfs: use generic posix ACL infrastructure for v3 Posix ACLs
gfs2: use generic posix ACL infrastructure
jfs: use generic posix ACL infrastructure
xfs: use generic posix ACL infrastructure
reiserfs: use generic posix ACL infrastructure
ocfs2: use generic posix ACL infrastructure
jffs2: use generic posix ACL infrastructure
hfsplus: use generic posix ACL infrastructure
f2fs: use generic posix ACL infrastructure
ext2/3/4: use generic posix ACL infrastructure
btrfs: use generic posix ACL infrastructure
fs: make posix_acl_create more useful
fs: make posix_acl_chmod more useful
...
Also don't bother to set up a .get_acl method for symlinks as we do not
support access control (ACLs or even mode bits) for symlinks in Linux.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Rename the current posix_acl_created to __posix_acl_create and add
a fully featured helper to set up the ACLs on file creation that
uses get_acl().
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Rename the current posix_acl_chmod to __posix_acl_chmod and add
a fully featured ACL chmod helper that uses the ->set_acl inode
operation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* don't assume that ->dest_count won't change between copy_from_user()
and memdup_user()
* use fdget instead of fget
* don't bother comparing superblocks when we'd already compared vfsmounts
* get rid of excessive goto
* use file_inode() instead of open-coding the sucker
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
For 3.14-rc1 there are fixes in the areas of remote attributes, discard,
growfs, memory leaks in recovery, directory v2, quotas, the MAINTAINERS
file, allocation alignment, extent list locking, and in
xfs_bmapi_allocate. There are cleanups in xfs_setsize_buftarg, removing
unused macros, quotas, setattr, and freeing of inode clusters. The
in-memory and on-disk log format have been decoupled, a common helper to
calculate the number of blocks in an inode cluster has been added, and
handling of i_version has been pulled into the filesystems that use it.
- cleanup in xfs_setsize_buftarg
- removal of remaining unused flags for vop toss/flush/flushinval
- fix for memory corruption in xfs_attrlist_by_handle
- fix for out-of-date comment in xfs_trans_dqlockedjoin
- fix for discard if range length is less than one block
- fix for overrun of agfl buffer using growfs on v4 superblock filesystems
- pull i_version handling out into the filesystems that use it
- don't leak recovery items on error
- fix for memory leak in xfs_dir2_node_removename
- several cleanups for quotas
- fix bad assertion in xfs_qm_vop_create_dqattach
- cleanup for xfs_setattr_mode, and add xfs_setattr_time
- fix quota assert in xfs_setattr_nonsize
- fix an infinite loop when turning off group/project quota before user
quota
- fix for temporary buffer allocation failure in xfs_dir2_block_to_sf
with large directory block sizes
- fix Dave's email address in MAINTAINERS
- cleanup calculation of freed inode cluster blocks
- fix alignment of initial file allocations to match filesystem geometry
- decouple in-memory and on-disk log format
- introduce a common helper to calculate the number of filesystem
blocks in an inode cluster
- fixes for extent list locking
- fix for off-by-one in xfs_attr3_rmt_verify
- fix for missing destroy_work_on_stack in xfs_bmapi_allocate
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
iQIcBAABAgAGBQJS4C2sAAoJENaLyazVq6ZOF/UP/A/FmscsEbgz+KHtsg1UXP+g
EMkrD8WOOzLnm7/GhkfvmmFLQrWrrNPmiP54MPJ/rxpfDb7rV60HgS0iHm13U+IR
0uPyk2NIQKG/Sj6FHCrgn4oh19B0tmer6lLE32UPrlNM7w/0NRm32Ms/jKz7WNvc
9Tqw3qNdtmP7leHG8EyD1ISzqUz6FNSC+qGyOTuBQUqG24LqsmZ1qD2Nw/HEz0ir
1YCqS+cjj8c3WaWMsdjEFdCvEIKS/sMYM+ZihATIl/5ggwtVobP7PH/4cG8Biby3
5wvcl3/jM+xjgGDC1YMQQeSADieus6ER4aZTjCvGLn9RajQTOBjP0QZTu2OHntTI
kD/52DfYErthGijS2qJcqXy11AaYtWQU2yTZXuMpaACIM1DDSD4NyGFP99BzbBaX
D4BgnC+vmfpbXO37PmophkeZwAvj+9K2BBG7X+g4sLynj/BtmvFCxIyK+SLHE3hN
kn+Pn03yTw0VGgvj0krXSllbYKE0LyLElaA6h0LejQgcOZM0W6Q8qaj+RODLitbw
HkQNKYCOMCzbdNu8rKAfup9UPZloiMukyMdMaw1MeCgCQSQLkZVxTegmVW9gzpIh
QJEARDaOnKB/G/CLCwbPZR09DBpg3yBt/eHjt0VnV+z2x+u9DTInAp6f013g7E/W
gU+vnBBPY1AYI8iiDraU
=t8nB
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-v3.14-rc1' of git://oss.sgi.com/xfs/xfs
Pull xfs update from Ben Myers:
"This is primarily bug fixes, many of which you already have. New
stuff includes a series to decouple the in-memory and on-disk log
format, helpers in the area of inode clusters, and i_version handling.
We decided to try to use more topic branches this release, so there
are some merge commits in there on account of that. I'm afraid I
didn't do a good job of putting meaningful comments in the first
couple of merges. Sorry about that. I think I have the hang of it
now.
For 3.14-rc1 there are fixes in the areas of remote attributes,
discard, growfs, memory leaks in recovery, directory v2, quotas, the
MAINTAINERS file, allocation alignment, extent list locking, and in
xfs_bmapi_allocate. There are cleanups in xfs_setsize_buftarg,
removing unused macros, quotas, setattr, and freeing of inode
clusters. The in-memory and on-disk log format have been decoupled, a
common helper to calculate the number of blocks in an inode cluster
has been added, and handling of i_version has been pulled into the
filesystems that use it.
- cleanup in xfs_setsize_buftarg
- removal of remaining unused flags for vop toss/flush/flushinval
- fix for memory corruption in xfs_attrlist_by_handle
- fix for out-of-date comment in xfs_trans_dqlockedjoin
- fix for discard if range length is less than one block
- fix for overrun of agfl buffer using growfs on v4 superblock
filesystems
- pull i_version handling out into the filesystems that use it
- don't leak recovery items on error
- fix for memory leak in xfs_dir2_node_removename
- several cleanups for quotas
- fix bad assertion in xfs_qm_vop_create_dqattach
- cleanup for xfs_setattr_mode, and add xfs_setattr_time
- fix quota assert in xfs_setattr_nonsize
- fix an infinite loop when turning off group/project quota before
user quota
- fix for temporary buffer allocation failure in xfs_dir2_block_to_sf
with large directory block sizes
- fix Dave's email address in MAINTAINERS
- cleanup calculation of freed inode cluster blocks
- fix alignment of initial file allocations to match filesystem
geometry
- decouple in-memory and on-disk log format
- introduce a common helper to calculate the number of filesystem
blocks in an inode cluster
- fixes for extent list locking
- fix for off-by-one in xfs_attr3_rmt_verify
- fix for missing destroy_work_on_stack in xfs_bmapi_allocate"
* tag 'xfs-for-linus-v3.14-rc1' of git://oss.sgi.com/xfs/xfs: (51 commits)
xfs: Calling destroy_work_on_stack() to pair with INIT_WORK_ONSTACK()
xfs: fix off-by-one error in xfs_attr3_rmt_verify
xfs: assert that we hold the ilock for extent map access
xfs: use xfs_ilock_attr_map_shared in xfs_attr_list_int
xfs: use xfs_ilock_attr_map_shared in xfs_attr_get
xfs: use xfs_ilock_data_map_shared in xfs_qm_dqiterate
xfs: use xfs_ilock_data_map_shared in xfs_qm_dqtobp
xfs: take the ilock around xfs_bmapi_read in xfs_zero_remaining_bytes
xfs: reinstate the ilock in xfs_readdir
xfs: add xfs_ilock_attr_map_shared
xfs: rename xfs_ilock_map_shared
xfs: remove xfs_iunlock_map_shared
xfs: no need to lock the inode in xfs_find_handle
xfs: use xfs_icluster_size_fsb in xfs_imap
xfs: use xfs_icluster_size_fsb in xfs_ifree_cluster
xfs: use xfs_icluster_size_fsb in xfs_ialloc_inode_init
xfs: use xfs_icluster_size_fsb in xfs_bulkstat
xfs: introduce a common helper xfs_icluster_size_fsb
xfs: get rid of XFS_IALLOC_BLOCKS macros
xfs: get rid of XFS_INODE_CLUSTER_SIZE macros
...
In btrfs_end_bio(), we increment bi_remaining if is_orig_bio. If not,
we restore the orig_bio but failed to increment bi_remaining for
orig_bio, which triggers a BUG_ON later when bio_endio is called. Fix
is to increment bi_remaining when we restore the orig bio as well.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
CC: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Muthukumar Ratty <muthur@gmail.com>
Reviewed-by: Chris Mason <clm@fb.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch fixed several typo in printk from various
part of kernel source.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJSwLfoAAoJEHm+PkMAQRiGi6QH/1U1B7lmHChDTw3jj1lfm9gA
189Si4QJlnxFWCKHvKEL+pcaVuACU+aMGI8+KyMYK4/JfuWVjjj5fr/SvyHH2/8m
LdSK8aHMhJ46uBS4WJ/l6v46qQa5e2vn8RKSBAyKm/h4vpt+hd6zJdoFrFai4th7
k/TAwOAEHI5uzexUChwLlUBRTvbq4U8QUvDu+DeifC8cT63CGaaJ4qVzjOZrx1an
eP6UXZrKDASZs7RU950i7xnFVDQu4PsjlZi25udsbeiKcZJgPqGgXz5ULf8ZH8RQ
YCi1JOnTJRGGjyIOyLj7pyB01h7XiSM2+eMQ0S7g54F2s7gCJ58c2UwQX45vRWU=
=/4/R
-----END PGP SIGNATURE-----
Merge tag 'v3.13-rc6' into for-3.14/core
Needed to bring blk-mq uptodate, since changes have been going in
since for-3.14/core was established.
Fixup merge issues related to the immutable biovec changes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Conflicts:
block/blk-flush.c
fs/btrfs/check-integrity.c
fs/btrfs/extent_io.c
fs/btrfs/scrub.c
fs/logfs/dev_bdev.c
Correct spelling typo in various part of kernel
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Pull btrfs fixes from Chris Mason:
"This is a small collection of fixes. It was rebased this morning, but
I was just fixing signed-off-by tags with the wrong email"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix access_ok() check in btrfs_ioctl_send()
Btrfs: make sure we cleanup all reloc roots if error happens
Btrfs: skip building backref tree for uuid and quota tree when doing balance relocation
Btrfs: fix an oops when doing balance relocation
Btrfs: don't miss skinny extent items on delayed ref head contention
btrfs: call mnt_drop_write after interrupted subvol deletion
Btrfs: don't clear the default compression type
The closing parenthesis is in the wrong place. We want to check
"sizeof(*arg->clone_sources) * arg->clone_sources_count" instead of
"sizeof(*arg->clone_sources * arg->clone_sources_count)".
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
cc: stable@vger.kernel.org
I hit an oops when merging reloc roots fails, the reason is that
new reloc roots may be added and we should make sure we cleanup
all reloc roots.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Quota tree and UUID Tree is only cowed, they can not be snapshoted.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
I hit an oops when inserting reloc root into @reloc_root_tree(it can be
easily triggered when forcing cow for relocation root)
[ 866.494539] [<ffffffffa0499579>] btrfs_init_reloc_root+0x79/0xb0 [btrfs]
[ 866.495321] [<ffffffffa044c240>] record_root_in_trans+0xb0/0x110 [btrfs]
[ 866.496109] [<ffffffffa044d758>] btrfs_record_root_in_trans+0x48/0x80 [btrfs]
[ 866.496908] [<ffffffffa0494da8>] select_reloc_root+0xa8/0x210 [btrfs]
[ 866.497703] [<ffffffffa0495c8a>] do_relocation+0x16a/0x540 [btrfs]
This is because reloc root inserted into @reloc_root_tree is not within one
transaction,reloc root may be cowed and root block bytenr will be reused then
oops happens.We should update reloc root in @reloc_root_tree when cow reloc
root node, fix it.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Currently extent-tree.c:btrfs_lookup_extent_info() can miss the lookup
of skinny extent items. This can happen when the execution flow is the
following:
* We do an extent tree lookup and fail to find a skinny extent item;
* As a result, we attempt to see if a non-skinny extent item exists,
either by looking at previous item in the leaf or by doing another
full extent tree search;
* We have a transaction and then we check for a matching delayed ref
head in the transaction's delayed refs rbtree;
* We find such delayed ref head and then we try to lock it with a
call to mutex_trylock();
* The lock was contended so we jump to the label "again", which repeats
the extent tree search but for a non-skinny extent item, because we set
previously metadata variable to 0 and the search key to look for a
non-skinny extent-item;
* After the jump (and after releasing the transaction's delayed refs
lock), a skinny extent item might have been added to the extent tree
but we will miss it because metadata is set to 0 and the search key
is set for a non-skinny extent-item.
The fix here is to not reset metadata to 0 and to jump to the initial search
key setup if the delayed ref head is contended, instead of jumping directly
to the extent tree search label ("again").
This issue was found while investigating the issue reported at Bugzilla 64961.
David Sterba suspected this function was missing extent items, and that
this could be caused by the last change to this function, which was made
in the following patch:
[PATCH] Btrfs: optimize btrfs_lookup_extent_info()
(commit 74be951087)
But in fact this issue already existed before, because after failing to find
a skinny extent item, the code set the search key for a non-skinny extent
item, and on contention of a matching delayed ref head it would not search
the extent tree for a skinny extent item anymore.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
If btrfs_ioctl_snap_destroy blocks on the mutex and the process is
killed, mnt_write count is unbalanced and leads to unmountable
filesystem.
CC: stable@vger.kernel.org
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
We met a oops caused by the wrong compression type:
[ 556.512356] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 556.512370] IP: [<ffffffff811dbaa0>] __list_del_entry+0x1/0x98
[SNIP]
[ 556.512490] [<ffffffff811dbb44>] ? list_del+0xd/0x2b
[ 556.512539] [<ffffffffa05dd5ce>] find_workspace+0x97/0x175 [btrfs]
[ 556.512546] [<ffffffff813c14b5>] ? _raw_spin_lock+0xe/0x10
[ 556.512576] [<ffffffffa05de276>] btrfs_compress_pages+0x2d/0xa2 [btrfs]
[ 556.512601] [<ffffffffa05af060>] compress_file_range.constprop.54+0x1f2/0x4e8 [btrfs]
[ 556.512627] [<ffffffffa05af388>] async_cow_start+0x32/0x4d [btrfs]
[ 556.512655] [<ffffffffa05cc7a1>] worker_loop+0x144/0x4c3 [btrfs]
[ 556.512661] [<ffffffff81059404>] ? finish_task_switch+0x80/0xb8
[ 556.512689] [<ffffffffa05cc65d>] ? btrfs_queue_worker+0x244/0x244 [btrfs]
[ 556.512695] [<ffffffff8104fa4e>] kthread+0x8d/0x95
[ 556.512699] [<ffffffff81050000>] ? bit_waitqueue+0x34/0x7d
[ 556.512704] [<ffffffff8104f9c1>] ? __kthread_parkme+0x65/0x65
[ 556.512709] [<ffffffff813c7eec>] ret_from_fork+0x7c/0xb0
[ 556.512713] [<ffffffff8104f9c1>] ? __kthread_parkme+0x65/0x65
Steps to reproduce:
# mkfs.btrfs -f <dev>
# mount -o nodatacow <dev> <mnt>
# touch <mnt>/<file>
# chattr =c <mnt>/<file>
# dd if=/dev/zero of=<mnt>/<file> bs=1M count=10
It is because we cleared the default compression type when setting the
nodatacow. In fact, we needn't do it because we have used COMPRESS flag to
indicate if we need compressed the file data or not, needn't use the
variant -- compress_type -- in btrfs_info to do the same thing, and just
use it to hold the default compression type. Or we would get a wrong compress
type for a file whose own compress flag is set but the compress flag of its
filesystem is not set.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull block layer fixes from Jens Axboe:
"A small collection of fixes for the current series. It contains:
- A fix for a use-after-free of a request in blk-mq. From Ming Lei
- A fix for a blk-mq bug that could attempt to dereference a NULL rq
if allocation failed
- Two xen-blkfront small fixes
- Cleanup of submit_bio_wait() type uses in the kernel, unifying
that. From Kent
- A fix for 32-bit blkg_rwstat reading. I apologize for this one
looking mangled in the shortlog, it's entirely my fault for missing
an empty line between the description and body of the text"
* 'for-linus' of git://git.kernel.dk/linux-block:
blk-mq: fix use-after-free of request
blk-mq: fix dereference of rq->mq_ctx if allocation fails
block: xen-blkfront: Fix possible NULL ptr dereference
xen-blkfront: Silence pfn maybe-uninitialized warning
block: submit_bio_wait() conversions
Update of blkg_stat and blkg_rwstat may happen in bh context
Currently notify_change directly updates i_version for size updates,
which not only is counter to how all other fields are updated through
struct iattr, but also breaks XFS, which need inode updates to happen
under its own lock, and synchronized to the structure that gets written
to the log.
Remove the update in the common code, and it to btrfs and ext4,
XFS already does a proper updaste internally and currently gets a
double update with the existing code.
IMHO this is 3.13 and -stable material and should go in through the XFS
tree.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Acked-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
btrfs bits got lost in the rebase
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Chris Mason <clm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It was being open coded in a few places.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With immutable biovecs we don't want code accessing bi_io_vec directly -
the uses this patch changes weren't incorrect since they all own the
bio, but it makes the code harder to audit for no good reason - also,
this will help with multipage bvecs later.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
It was being open coded in a few places.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Acked-by: NeilBrown <neilb@suse.de>
Pull btrfs fixes from Chris Mason:
"Almost all of these are bug fixes. Dave Sterba's documentation update
is the big exception because he removed our promises to set any
machine running Btrfs on fire"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Documentation: filesystems: update btrfs tools section
Documentation: filesystems: add new btrfs mount options
btrfs: update kconfig help text
btrfs: fix bio_size_ok() for max_sectors > 0xffff
btrfs: Use trace condition for get_extent tracepoint
btrfs: fix typo in the log message
Btrfs: fix list delete warning when removing ordered root from the list
Btrfs: print bytenr instead of page pointer in check-int
Btrfs: remove dead codes from ctree.h
Btrfs: don't wait for ordered data outside desired range
Btrfs: fix lockdep error in async commit
Btrfs: avoid heavy operations in btrfs_commit_super
Btrfs: fix __btrfs_start_workers retval
Btrfs: disable online raid-repair on ro mounts
Btrfs: do not inc uncorrectable_errors counter on ro scrubs
Btrfs: only drop modified extents if we logged the whole inode
Btrfs: make sure to copy everything if we rename
Btrfs: don't BUG_ON() if we get an error walking backrefs
Reflect the current status. Portions of the text taken from the
wiki pages.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The data type of max_sectors in queue settings is unsigned int. But
this value is stored to the local variable whose type is unsigned short
in bio_size_ok(). This can cause unexpected result when max_sectors >
0xffff.
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Doing an if statement to test some condition to know if we should
trigger a tracepoint is pointless when tracing is disabled. This just
adds overhead and wastes a branch prediction. This is why the
TRACE_EVENT_CONDITION() was created. It places the check inside the jump
label so that the branch does not happen unless tracing is enabled.
That is, instead of doing:
if (em)
trace_btrfs_get_extent(root, em);
Which is basically this:
if (em)
if (static_key(trace_btrfs_get_extent)) {
Using a TRACE_EVENT_CONDITION() we can just do:
trace_btrfs_get_extent(root, em);
And the condition trace event will do:
if (static_key(trace_btrfs_get_extent)) {
if (em) {
...
The static key is a non conditional jump (or nop) that is faster than
having to check if em is NULL or not.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Commit b02441999e "Btrfs: don't wait for
the completion of all the ordered extents" introduced a bug that broke
the ordered root list:
WARNING: CPU: 1 PID: 7119 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98()
It is because we forgot to return the roots in the splice list to the
ordered list of the fs. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The page pointer information was useless. The bytenr is what you
want when you search for submitted write bios.
Additionally, a new bit in the print mask is added that allows
to selectively enable the check-int submit_bio verbose mode. Before,
the global verbose mode had to be enabled leading to many million
useless lines in the kernel log.
And a comment is added that explains that LOG_BUF_SHIFT needs to
be set to a really high value.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
These two functions are only stated but undefined.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In btrfs_wait_ordered_range(), if we found an extent to the left
of the start of our desired wait range and the last byte of that
extent is 1 less than the desired range's start, we would would
wait for the IO completion of that extent unnecessarily.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Lockdep complains about btrfs's async commit:
[ 2372.462171] [ BUG: bad unlock balance detected! ]
[ 2372.462191] 3.12.0+ #32 Tainted: G W
[ 2372.462209] -------------------------------------
[ 2372.462228] ceph-osd/14048 is trying to release lock (sb_internal) at:
[ 2372.462275] [<ffffffffa022cb10>] btrfs_commit_transaction_async+0x1b0/0x2a0 [btrfs]
[ 2372.462305] but there are no more locks to release!
[ 2372.462324]
[ 2372.462324] other info that might help us debug this:
[ 2372.462349] no locks held by ceph-osd/14048.
[ 2372.462367]
[ 2372.462367] stack backtrace:
[ 2372.462386] CPU: 2 PID: 14048 Comm: ceph-osd Tainted: G W 3.12.0+ #32
[ 2372.462414] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015 11/09/2011
[ 2372.462455] ffffffffa022cb10 ffff88007490fd28 ffffffff816f094a ffff8800378aa320
[ 2372.462491] ffff88007490fd50 ffffffff810adf4c ffff8800378aa320 ffff88009af97650
[ 2372.462526] ffffffffa022cb10 ffff88007490fd88 ffffffff810b01ee ffff8800898c0000
[ 2372.462562] Call Trace:
[ 2372.462584] [<ffffffffa022cb10>] ? btrfs_commit_transaction_async+0x1b0/0x2a0 [btrfs]
[ 2372.462619] [<ffffffff816f094a>] dump_stack+0x45/0x56
[ 2372.462642] [<ffffffff810adf4c>] print_unlock_imbalance_bug+0xec/0x100
[ 2372.462677] [<ffffffffa022cb10>] ? btrfs_commit_transaction_async+0x1b0/0x2a0 [btrfs]
[ 2372.462710] [<ffffffff810b01ee>] lock_release+0x18e/0x210
[ 2372.462742] [<ffffffffa022cb36>] btrfs_commit_transaction_async+0x1d6/0x2a0 [btrfs]
[ 2372.462783] [<ffffffffa025a7ce>] btrfs_ioctl_start_sync+0x3e/0xc0 [btrfs]
[ 2372.462822] [<ffffffffa025f1d3>] btrfs_ioctl+0x4c3/0x1f70 [btrfs]
[ 2372.462849] [<ffffffff812c0321>] ? avc_has_perm+0x121/0x1b0
[ 2372.462873] [<ffffffff812c0224>] ? avc_has_perm+0x24/0x1b0
[ 2372.462897] [<ffffffff8107ecc8>] ? sched_clock_cpu+0xa8/0x100
[ 2372.462922] [<ffffffff8117b145>] do_vfs_ioctl+0x2e5/0x4e0
[ 2372.462946] [<ffffffff812c19e6>] ? file_has_perm+0x86/0xa0
[ 2372.462969] [<ffffffff8117b3c1>] SyS_ioctl+0x81/0xa0
[ 2372.462991] [<ffffffff817045a4>] tracesys+0xdd/0xe2
====================================================
It's because that we don't do the right thing when checking if it's ok to
tell lockdep that we're trying to release the rwsem.
If the trans handle's type is TRANS_ATTACH, we won't acquire the freeze rwsem, but
as TRANS_ATTACH fits the check (trans < TRANS_JOIN_NOLOCK), we'll release the freeze
rwsem, which makes lockdep complains a lot.
Reported-by: Ma Jianpeng <majianpeng@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The 'git blame' history shows that, the old transaction commit code has to do
twice to ensure roots are updated and we have to flush metadata and super block
manually, however, right now all of these can be handled well inside
the transaction commit code without extra efforts.
And the error handling part remains same with the current code, -- 'return to
caller once we get error'.
This saves us a transaction commit and a flush of super block, which are both
heavy operations according to ftrace output analysis.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
__btrfs_start_workers returns 0 in case it raced with
btrfs_stop_workers and lost the race. This is wrong because worker in
this case is not allowed to start and is in fact destroyed. Return
-EINVAL instead.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This disables the "if needed, write the good copy back before the read
is completed" part of the read sequence for read-only mounts.
Cc: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Currently if we discover an error when scrubbing in ro mode we a)
blindly increment the uncorrectable_errors counter, and b) spam the
dmesg with the 'unable to fixup (regular) error at ...' message, even
though a) we haven't tried to determine if the error is correctable or
not, and b) we haven't tried to fixup anything. Fix this.
Cc: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we fsync, seek and write, rename and then fsync again we will lose the
modified hole extent because the rename will drop all of the modified extents
since we didn't do the fast search. We need to only drop the modified extents
if we didn't do the fast search and we were logging the entire inode as we don't
need them anymore, otherwise this is being premature. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we rename a file that is already in the log and we fsync again we will lose
the new name. This is because we just log the inode update and not the new ref.
To fix this we just need to check if we are logging the new name of the inode
and copy all the metadata instead of just updating the inode itself. With this
patch my testcase now passes. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We can just return false for this so we stop doing the snapshot aware defrag
stuff. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs fixes from Chris Mason:
"This pull fixes the empty_zero_page bug that Heiko reported, and
includes one more cleanup from Al Viro"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: get rid of fdentry()
btrfs: fix empty_zero_page misusage
Pull trivial tree updates from Jiri Kosina:
"Usual earth-shaking, news-breaking, rocket science pile from
trivial.git"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits)
doc: usb: Fix typo in Documentation/usb/gadget_configs.txt
doc: add missing files to timers/00-INDEX
timekeeping: Fix some trivial typos in comments
mm: Fix some trivial typos in comments
irq: Fix some trivial typos in comments
NUMA: fix typos in Kconfig help text
mm: update 00-INDEX
doc: Documentation/DMA-attributes.txt fix typo
DRM: comment: `halve' -> `half'
Docs: Kconfig: `devlopers' -> `developers'
doc: typo on word accounting in kprobes.c in mutliple architectures
treewide: fix "usefull" typo
treewide: fix "distingush" typo
mm/Kconfig: Grammar s/an/a/
kexec: Typo s/the/then/
Documentation/kvm: Update cpuid documentation for steal time and pv eoi
treewide: Fix common typo in "identify"
__page_to_pfn: Fix typo in comment
Correct some typos for word frequency
clk: fixed-factor: Fix a trivial typo
...
Heiko Carstens noticed that btrfs was using empty_zero_page
incorrectly. He explained:
The definition of empty_zero_page is architecture specific. It
is (currently) either a character array, an unsigned long
containing the address of the empty_zero_page, or even worse
only the address of the struct page belonging to the
empty_zero_page.
This commit changes btrfs to use a for-loop instead. On x86
the resulting .ko is smaller, and we're no longer worrying about
how each arch builds its zeros.
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
rename the function -- btrfs_start_all_delalloc_inodes(), and make its
name be compatible to btrfs_wait_ordered_roots(), since they are always
used at the same place.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
It is very likely that there are lots of ordered extents in the filesytem,
if we wait for the completion of all of them when we want to reclaim some
space for the metadata space reservation, we would be blocked for a long
time. The performance would drop down suddenly for a long time.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
It was very likely that there were lots of async delalloc pages in the
filesystem, if we waited until all the pages were flushed, we would be
blocked for a long time, and the performance would also drop down.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In shrink_delalloc(), what we need reclaim is the metadata space, so
flushing pages by to_reclaim is not reasonable, it is very likely that
the pages we flush are not enough. And then we had to invoke the flush
function for several times, at the worst, we need call flush_space for
several times. It wasted time.
We improve this problem by converting the metadata space size we need
reserve to the delalloc bytes, By this way, we can flush the pages
by a reasonable number.
(Now we use a fixed number to do conversion, it is not flexible, maybe
we can find a good way to improve it in the future.)
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This patch picked up the code that was used to calculate the number of
the items for which we need reserve space, and we will use it in the next
patch.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We only allocate scrub workers if we pass all the necessary
checks, for example, there are no operation in progress.
Besides, move mutex lock protection outside of scrub_workers_get()
/scrub_workers_put().
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I hit this problem with my no holes patch and it made me realize what the
problem was for bz 60834. If the first item in the leaf is an inline extent and
we try to read anything starting from disk_bytenr onward we will read off the
end of the leaf. So we need to check to see what it's type is, and if it's not
REG we can just break out. This should fix this problem. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The function write_ctree_super() in disk-io.c uses variable ret to return
the result of function write_all_supers(). Since, this variable serves
no purpose, hence the patch removes it and returns the call of the
called function.
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Fix spacing issues detected via checkpatch.pl in accordance with the
kernel style guidelines.
Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Replace kmalloc(size * nr, ) with kmalloc_array(nr, size), thus making
it easier to check is that the calculation doesn't wrap or return a smaller allocation
Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Enclose macros with complex values within parenthesis in accordance to
checkpatch.pl.
Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Use WARN_ON()'s return value in place of WARN_ON(1) for cleaner source
code that outputs a more descriptive warnings. Also fix the styling
warning of redundant braces that came up as a result of this fix.
Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Remove redundant local zero structure, replacing it by the kernel's
global ZERO_PAGE.
Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pack the structure btrfs_device in volumes.h to eliminate holes detected
by pahole, thus reducing binary memory footprint.
Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This patch replaces multiple atomic_inc() with atomic_add() in
delayed-inode.c to reduce source code and have few instructions
for compilation.
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The function free_root_pointers() in disk-io.h contains redundant code.
Therefore, this patch adds a helper function free_root_extent_buffers()
to free_root_pointers() to eliminate redundancy.
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Running balance and defrag concurrently can end up with a crash:
kernel BUG at fs/btrfs/relocation.c:4528!
RIP: 0010:[<ffffffffa01ac33b>] [<ffffffffa01ac33b>] btrfs_reloc_cow_block+ 0x1eb/0x230 [btrfs]
Call Trace:
[<ffffffffa01398c1>] ? update_ref_for_cow+0x241/0x380 [btrfs]
[<ffffffffa0180bad>] ? copy_extent_buffer+0xad/0x110 [btrfs]
[<ffffffffa0139da1>] __btrfs_cow_block+0x3a1/0x520 [btrfs]
[<ffffffffa013a0b6>] btrfs_cow_block+0x116/0x1b0 [btrfs]
[<ffffffffa013ddad>] btrfs_search_slot+0x43d/0x970 [btrfs]
[<ffffffffa0153c57>] btrfs_lookup_file_extent+0x37/0x40 [btrfs]
[<ffffffffa0172a5e>] __btrfs_drop_extents+0x11e/0xae0 [btrfs]
[<ffffffffa013b3fd>] ? generic_bin_search.constprop.39+0x8d/0x1a0 [btrfs]
[<ffffffff8117d14a>] ? kmem_cache_alloc+0x1da/0x200
[<ffffffffa0138e7a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
[<ffffffffa0173ef0>] btrfs_drop_extents+0x60/0x90 [btrfs]
[<ffffffffa016b24d>] relink_extent_backref+0x2ed/0x780 [btrfs]
[<ffffffffa0162fe0>] ? btrfs_submit_bio_hook+0x1e0/0x1e0 [btrfs]
[<ffffffffa01b8ed7>] ? iterate_inodes_from_logical+0x87/0xa0 [btrfs]
[<ffffffffa016b909>] btrfs_finish_ordered_io+0x229/0xac0 [btrfs]
[<ffffffffa016c3b5>] finish_ordered_fn+0x15/0x20 [btrfs]
[<ffffffffa018cbe5>] worker_loop+0x125/0x4e0 [btrfs]
[<ffffffffa018cac0>] ? btrfs_queue_worker+0x300/0x300 [btrfs]
[<ffffffff81075ea0>] kthread+0xc0/0xd0
[<ffffffff81075de0>] ? insert_kthread_work+0x40/0x40
[<ffffffff8164796c>] ret_from_fork+0x7c/0xb0
[<ffffffff81075de0>] ? insert_kthread_work+0x40/0x40
----------------------------------------------------------------------
It turns out to be that balance operation will bump root's @last_snapshot,
which enables snapshot-aware defrag path, and backref walking stuff will
find data reloc tree as refs' parent, and hit the BUG_ON() during COW.
As data reloc tree's data is just for relocation purpose, and will be deleted right
after relocation is done, it's unnecessary to walk those refs belonged to data reloc
tree, it'd be better to skip them.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If something wrong happens in write endio, running snapshot-aware defragment
can end up with undefined results, maybe a crash, so we should avoid it.
In order to share similar code, this also adds a helper to free the struct for
snapshot-aware defrag.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we get any error while doing a dir index/item lookup in the
log tree, we were always unlinking the corresponding inode in
the subvolume. It makes sense to unlink only if the lookup failed
to find the dir index/item, which corresponds to NULL or -ENOENT,
and not when other errors happen (like a transient -ENOMEM or -EIO).
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We were setting the csums search offset and length to the right values if
the extent is compressed, but later on right before doing the csums lookup
we were overriding these two parameters regardless of compression being
set or not for the extent.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We were ignoring the name component of the dir_item. Both the
name and data must fit within BTRFS_MAX_XATTR_SIZE(root).
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Originally, we introduced scrub_super_lock to synchronize
tree log code with scrubbing super.
However we can replace scrub_super_lock with device_list_mutex,
because writing super will hold this mutex, this will reduce an extra
lock holding when writing supers in sync log code.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We define a 'int' to get extent's generation by mistake,fix it.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
After running space balance on a new fs, the fs check program outputed the
following warning message:
free space inode generation (0) did not match free space cache generation (20)
Steps to reproduce:
# mkfs.btrfs -f <dev>
# mount <dev> <mnt>
# btrfs balance start <mnt>
# umount <mnt>
# btrfs check <dev>
It was because there was no data space after the space balance, and the free
space write out task didn't try to allocate a new data chunk for the free space
inode when doing the reservation. So the data space reservation failed, and in
order to tell the free space loader that this free space inode could not be
trusted, the generation of the free space inode wasn't updated. Then the check
program found this problem and outputed the above message.
But in fact, it is safe that we try to allocate a new data chunk when we find
the data space is not enough. The patch fixes the above problem by this way.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I noticed with my horrible snapshot excercisor that we were taking forever to
relocate the larger the file system got. This appeared to be because we were
committing the transaction _constantly_. There were a few places where we do
braindead things with metadata reservation, like start a transaction and then
try to refill the block rsv, which not only keeps us from committing a
transaction during the enospc stuff, but keeps us from doing some of the harder
flushing work which will make us more likely to need to commit the transaction.
We also were checking the block rsv and committing the transaction if the block
rsv was below a certain threshold, but we were doing this in a place where we
don't actually keep anything in the block rsv so this was always ending up false
so we always committed the transaction in this case. I tested this to make sure
it didn't break anything, but it takes about 10 hours to get the box to this
state so I don't know how much of an impact it will really make. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When using delalloc workers in a non-waiting way (like for enospc handling) we
can end up not actually waiting for the dirty pages to be started if we have
compression. We need to add an extra filemap flush to make sure any async
extents that have started are actually moved along before returning. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
A user reported a list corruption warning from btrfs_remove_ordered_extent, it
is because we aren't taking the ordered_root_lock when we remove the inode from
the ordered operations list. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This is just the write path, the only reason we start a transaction is so we can
check cross references, we don't make any actual changes, so there is no reason
to abort the transaction if we fail. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We can just return an error and we'll bail out properly. We still want to catch
this case to make sure we don't have a bug somewhere, so just warn if this pops
up. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I noticed that if the free space cache has an error writing out it's data it
won't actually error out, it will just carry on. This is because it doesn't
check the return value of btrfs_wait_ordered_range, which didn't actually return
anything. So fix this in order to keep us from making free space cache look
valid when it really isnt. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Apparently we don't actually close the files until we return to userspace, so
stop using vfs_read in send. This is actually better for us since we can avoid
all the extra logic of holding the file we're sending open and making sure to
clean it up. This will fix people who have been hitting too many files open
errors when trying to send. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In mixed-mode, when a data-block was later reused for metadata, a
warning was printed. This condition is now filtered out and the
warning is eliminated in this case.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Yet another cleanup patch broke code for which no xfstest exists.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Instead of doing another extent tree search if the first search failed
to find a metadata item, check if the previous item in the leaf is an
extent item and use it if it is, otherwise do the second tree search
for an extent item.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When debugging ENOSPC issues, it's nice to be able to see which
reservations failed as well as the ones which succeeded.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
fs/btrfs/compat.h only contained trivial macro wrappers of drop_nlink()
and inc_nlink(). This doesn't belong in mainline.
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
move_pages() has an inefficient backwards byte copy of regions of two
different pages. They're different pages so the regions won't overlap
and it could use memcpy().
At that point, though, move_pages() would be a slightly dimmer
re-implementation of copy_pages() that lacked the test for overlapping
page regions.
So remove move_pages() and just call copy_pages().
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When a directory has a default ACL and a subdirectory is created
under that directory, btrfs_init_acl() is called when the
subdirectory's inode is created to initialize the inode's ACL
(inherited from the parent directory) but it was clearing the ACL
from the inode after setting it if posix_acl_create() returned
success, instead of clearing it only if it returned an error.
To reproduce this issue:
$ mkfs.btrfs -f /dev/loop0
$ mount /dev/loop0 /mnt
$ mkdir /mnt/acl
$ setfacl -d --set u::rwx,g::rwx,o::- /mnt/acl
$ getfacl /mnt/acl
user::rwx
group::rwx
other::r-x
default:user::rwx
default:group::rwx
default:other::---
$ mkdir /mnt/acl/dir1
$ getfacl /mnt/acl/dir1
user::rwx
group::rwx
other::---
After unmounting and mounting again the filesystem, fgetacl returned the
expected ACL:
$ umount /mnt/acl
$ mount /dev/loop0 /mnt
$ getfacl /mnt/acl/dir1
user::rwx
group::rwx
other::---
default:user::rwx
default:group::rwx
default:other::---
Meaning that the underlying xattr was persisted.
Reported-by: Giuseppe Fierro <giuseppe@fierro.org>
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Due to an off-by-one error, it is possible to reproduce a bug
when the inode cache is used.
The same inode number is assigned twice, the second time this
leads to an EEXIST in btrfs_insert_empty_items().
The issue can happen when a file is removed right after a subvolume
is created and then a new inode number is created before the
inodes in free_inode_pinned are processed.
unlink() calls btrfs_return_ino() which calls start_caching() in this
case which adds [highest_ino + 1, BTRFS_LAST_FREE_OBJECTID] by
searching for the highest inode (which already cannot find the
unlinked one anymore in btrfs_find_free_objectid()). So if this
unlinked inode's number is equal to the highest_ino + 1 (or >= this value
instead of > this value which was the off-by-one error), we mustn't add
the inode number to free_ino_pinned (caching_thread() does it right).
In this case we need to try directly to add the number to the inode_cache
which will fail in this case.
When this inode number is allocated while it is still in free_ino_pinned,
it is allocated and still added to the free inode cache when the
pinned inodes are processed, thus one of the following inode number
allocations will get an inode that is already in use and fail with EEXIST
in btrfs_insert_empty_items().
One example which was created with the reproducer below:
Create a snapshot, work in the newly created snapshot for the rest.
In unlink(inode 34284) call btrfs_return_ino() which calls start_caching().
start_caching() calls add_free_space [34284, 18446744073709517077].
In btrfs_return_ino(), call start_caching pinned [34284, 1] which is wrong.
mkdir() call btrfs_find_ino_for_alloc() which returns the number 34284.
btrfs_unpin_free_ino calls add_free_space [34284, 1].
mkdir() call btrfs_find_ino_for_alloc() which returns the number 34284.
EEXIST when the new inode is inserted.
One possible reproducer is this one:
#!/bin/sh
# preparation
TEST_DEV=/dev/sdc1
TEST_MNT=/mnt
umount ${TEST_MNT} 2>/dev/null || true
mkfs.btrfs -f ${TEST_DEV}
mount ${TEST_DEV} ${TEST_MNT} -o \
rw,relatime,compress=lzo,space_cache,inode_cache
btrfs subv create ${TEST_MNT}/s1
for i in `seq 34027`; do touch ${TEST_MNT}/s1/${i}; done
btrfs subv snap ${TEST_MNT}/s1 ${TEST_MNT}/s2
FILENAME=`find ${TEST_MNT}/s1/ -inum 4085 | sed 's|^.*/\([^/]*\)$|\1|'`
rm ${TEST_MNT}/s2/$FILENAME
touch ${TEST_MNT}/s2/$FILENAME
# the following steps can be repeated to reproduce the issue again and again
[ -e ${TEST_MNT}/s3 ] && btrfs subv del ${TEST_MNT}/s3
btrfs subv snap ${TEST_MNT}/s2 ${TEST_MNT}/s3
rm ${TEST_MNT}/s3/$FILENAME
touch ${TEST_MNT}/s3/$FILENAME
ls -alFi ${TEST_MNT}/s?/$FILENAME
touch ${TEST_MNT}/s3/_1 || logger FAILED
ls -alFi ${TEST_MNT}/s?/_1
touch ${TEST_MNT}/s3/_2 || logger FAILED
ls -alFi ${TEST_MNT}/s?/_2
touch ${TEST_MNT}/s3/__1 || logger FAILED
ls -alFi ${TEST_MNT}/s?/__1
touch ${TEST_MNT}/s3/__2 || logger FAILED
ls -alFi ${TEST_MNT}/s?/__2
# if the above is not enough, add the following loop:
for i in `seq 3 9`; do touch ${TEST_MNT}/s3/__${i} || logger FAILED; done
#for i in `seq 3 34027`; do touch ${TEST_MNT}/s3/__${i} || logger FAILED; done
# one of the touch(1) calls in s3 fail due to EEXIST because the inode is
# already in use that btrfs_find_ino_for_alloc() returns.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we decrement the key type, we must reset its offset to the largest
possible offset (u64)-1. If we decrement the key's objectid, then we
must reset the key's type and offset to their largest possible values,
(u8)-1 and (u64)-1 respectively. Not doing so can make us miss an
items in the tree.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Avoid repeated tree searches by processing all inode ref items in
a leaf at once instead of processing one at a time, followed by a
path release and a tree search for a key with a decremented offset.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Use memdup_user rather than duplicating its implementation
This is a little bit restricted to reduce false positives
The semantic patch that makes this report is available
in scripts/coccinelle/api/memdup_user.cocci.
More information about semantic patching is available at
http://coccinelle.lip6.fr/
Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Comparison of an inode's last modified transaction with the last committed
transaction is incorrect. Fix it.
Signed-off-by: chandan <chandan@linux.vnet.ibm.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If the path allocation failed, we would return without decrementing
the reference count in the delayed node we got before, resulting
in a leak.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If the user remounts the filesystem read-only while the uuid-tree
scan and rebuild task is still running (this happens once after the
filesystem was mounted with an old kernel, or when forced with the
mount options), the remount should wait on the tasks completion
before setting the filesystem read-only. Otherwise the background
task continues to write to the filesystem which is apparently not
what users expect.
The reproducer:
TEST_DEV=/dev/sdzzzzz1
TEST_MNT=/mnt
mkfs.btrfs -f $TEST_DEV
mount $TEST_DEV $TEST_MNT
for i in `seq 50000`; do btrfs subvolume create ${TEST_MNT}/$i; done
umount $TEST_MNT
mount $TEST_DEV $TEST_MNT -o rescan_uuid_tree
sleep 1
ps -elf | fgrep '[btrfs-uuid]' | grep -v grep
mount $TEST_DEV $TEST_MNT -o ro,remount
ps -elf | fgrep '[btrfs-uuid]' | grep -v grep
sleep 1
umount $TEST_MNT
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Device stats are only initialized (read from tree items) on mount.
Trying to read device stats after adding or replacing new devices will
return errors.
btrfs_init_new_device() and btrfs_init_dev_replace_tgtdev() are the two
functions that allocate and initialize new btrfs_device structures after
a filesystem is mounted. They set the device stats to zero by using
kzalloc() which is correct for new devices. The only missing thing was
to declare these stats as being valid (device->dev_stats_valid = 1) and
this patch adds this missing code.
This is the reproducer:
TEST_DEV1=/dev/sdzzzzz1
TEST_DEV2=/dev/sdzzzzz2
TEST_DEV3=/dev/sdzzzzz3
TEST_MNT=/mnt
mkfs.btrfs $TEST_DEV1
mount $TEST_DEV1 $TEST_MNT
btrfs device add $TEST_DEV2 $TEST_MNT
btrfs device stat $TEST_MNT
btrfs replace start -B $TEST_DEV2 $TEST_DEV3 $TEST_MNT
btrfs device stat $TEST_MNT
umount $TEST_MNT
Reported-by: Ondrej Kunc <kunc88@gmail.com>
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When we fail to add a reference after a non-inline insertion by some reasons,
eg. ENOSPC, we'll abort the transaction, but we don't return this error to
the caller who has to walk around again to find something wrong, that's
unnecessary.
Also fixup other error paths to keep it simple.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
For both balance and replace, cancelling involves changing the on-disk
state and committing a transaction, which is not a good thing to do on
read-only filesystems.
Cc: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
struct btrfs_ioctl_dev_replace_args memory is leaked if replace is
requested on a read-only filesystem. Fix it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
On mount failures, __btrfs_close_devices can be called well before
dev-replace state is read and ->is_tgtdev_for_dev_replace is set. This
leads to a bogus decrement of ->rw_devices and sets off a WARN_ON in
__btrfs_close_devices if replace target device happens to be on the
lists and we fail early in the mount sequence. Fix this by checking
the devid instead of ->is_tgtdev_for_dev_replace before the decrement:
for replace targets devid is always equal to BTRFS_DEV_REPLACE_DEVID.
Cc: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In add_inode_ref() function:
Initializes local pointers.
Reduces the logical condition with the __add_inode_ref() return
value by using only one 'goto out'.
Centralizes the exiting, ensuring the freeing of all used memory.
Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>
Reviewed-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
After commit de78b51a28
(btrfs: remove cache only arguments from defrag path), @blockptr is no more
used.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
@is_extent is no more needed since we don't defrag extent root.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The btrfs_insert_empty_item() function doesn't modify its
key argument.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
alloc_extent_buffer() uses radix_tree_lookup() when radix_tree_insert()
fails with EEXIST. That part of the code is very similar to the code in
find_extent_buffer(). This patch replaces radix_tree_lookup() and
surrounding code in alloc_extent_buffer() with find_extent_buffer().
Note that radix_tree_lookup() does not need to be protected by
tree->buffer_lock. It is protected by eb->refs.
While at it, this patch
- changes the other usage of radix_tree_lookup() in alloc_extent_buffer()
with find_extent_buffer() to reduce redundancy.
- removes the unused argument 'len' to find_extent_buffer().
Signed-Off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Whoever wrote this was braindead. Also it doesn't work right if you have
VACANCY's since we assumed you would only have that at the end of the file,
which won't be the case in the near future. I tested this with generic/285 and
generic/286 as well as the btrfs tests that use fssum since it uses
seek_hole/seek_data to verify things are ok. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I was hitting weird issues when trying to remove hole extents and it turned out
it was because I was sending non-aligned offsets down to
btrfs_lookup_csums_range. So add an assert for this in case somebody trips over
this in the future. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I added an assert to make sure we were looking up aligned offsets for csums and
I tripped it when running xfstests. This is because log_one_extent was checking
if block_start == 0 for a hole instead of EXTENT_MAP_HOLE. This worked out fine
in practice it seems, but it adds a lot of extra work that is uneeded. With
this fix I'm no longer tripping my assert. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Btrfs_get_extent was not handling this case properly, add a test to make sure we
don't regress. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
While trying to kill our hole extents I noticed I was seeing problems where we
seek into a file and then start writing and then try to fiemap that file later.
This is because we search for offset 0, don't find anything and so back up one
slot, which puts us at the inode ref or something like that, which means we goto
not_found and create an extent map for our entire search area. This isn't quite
what we want, we want to move forward one slot and see if there is an extent
there so we can limit our hole extent. This patch fixes this problem, I will
add a testcase for this as well. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Stefan was hitting a panic in the async worker stuff because we had outstanding
read bios while we were stopping the worker threads. You could reproduce this
easily if you mount -o nospace_cache and ran generic/273. This is because the
caching thread stuff is still going and we were stopping all the worker threads.
We need to stop the workers after this work is done, and the free block groups
code will wait for all the caching threads to stop first so we don't run into
this problem. With this patch we no longer panic. Thanks,
Cc: stable@vger.kernel.org
Reported-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I'm going to be removing hole extents in the near future so I wanted to make a
sanity test for btrfs_get_extent to make sure I don't break anything in the
meantime. This patch just puts btrfs_get_extent through its paces by giving it
a completely unreasonable mapping to look at and make sure it is giving us back
maps that make sense. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
So both Liu and I made huge messes of find_lock_delalloc_range trying to fix
stuff, me first by fixing extent size, then him by fixing something I broke and
then me again telling him to fix it a different way. So this is obviously a
candidate for some testing. This patch adds a pseudo fs so we can allocate fake
inodes for tests that need an inode or pages. Then it addes a bunch of tests to
make sure find_lock_delalloc_range is acting the way it is supposed to. With
this patch and all of our previous patches to find_lock_delalloc_range I am sure
it is working as expected now. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
While trying to track down a reserved space leak I noticed a few places where we
won't properly clean up reserved space if we have an error, this patch fixes
those up. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In trying to track down where we were leaking reserved space I noticed our
reserve extent tracepoints are a little off. First we were saying that the
reserved space had been alloced in btrfs_reserve_extent, which isn't the case,
this needs to be triggered when we actually allocate the space when we run the
delayed ref. We were also missing a few places where we should have been
tracing the btrfs_reserve_extent_free tracepoint. With these in place I was
able to put together where we were leaking reserved space. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we abort a transaction we will do the tree log cleanup at unmount, but this
happens after we free up the block groups. This makes all the leak detection
warnings go off because we think we've leaked space but in reality we just
haven't cleaned it up yet. So instead do the block group cleanup stuff after
free'ing the fs roots so we don't get these warnings. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
On error we will wait and free the tree log at unmount without a transaction.
This means that the actual freeing of the blocks doesn't happen which means we
complain about space leaks on unmount. So to fix this just skip the transaction
specific cleanup part of the tree log free'ing if we don't have a transaction
and that way we can free up our reserved space and our counters stay happy.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The transactions should be cleaning up their reservations on failure, this just
causes us to have warnings on unmount because we go negative by free'ing
reservations that have already been free'ed. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Replace progresses strictly from lower to higher offsets, and the
progress is tracked in chunks, by storing the physical offset of the
dev_extent which is being copied in the cursor_left field of
btrfs_dev_replace_item. When we are done copying the chunk,
left_cursor is updated to point one byte past the dev_extent, so that
on resume we can skip the dev_extents that have already been copied.
There is a major bug (which goes all the way back to the inception of
dev-replace in 3.8) in the way left_cursor is bumped: the bump is done
unconditionally, without any regard to the scrub_chunk return value.
On suspend (and also on any kind of error) scrub_chunk returns early,
i.e. without completing the copy. This leads to us skipping the chunk
that hasn't been fully copied yet when resuming.
Fix this by doing the cursor_left update only if scrub_chunk ret is 0.
(On suspend scrub_chunk returns with -ECANCELED, so this fix covers
both suspend and error cases.)
Cc: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Currently the hash value used for adding an inode to the VFS's inode
hash table consists of the plain inode number, which is a 64 bits
integer. This results in hash table buckets (hlist_head lists) with
too many elements for at least 2 important scenarios:
1) When we have many subvolumes. Each subvolume has its own btree
where its files and directories are added to, and each has its
own objectid (inode number) namespace. This means that if we have
N subvolumes, and all have inode number X associated to a file or
directory, the corresponding inodes all map to the same hash table
entry, resulting in a bucket (hlist_head list) with N elements;
2) On 32 bits machines. Th VFS hash values are unsigned longs, which
are 32 bits wide on 32 bits machines, and the inode (objectid)
numbers are 64 bits unsigned integers. We simply cast the inode
numbers to hash values, which means that for all inodes with the
same 32 bits lower half, the same hash bucket is used for all of
them. For example, all inodes with a number (objectid) between
0x0000_0000_ffff_ffff and 0xffff_ffff_ffff_ffff will end up in
the same hash table bucket.
This change ensures the inode's hash value depends both on the
objectid (inode number) and its subvolume's (btree root) objectid.
For 32 bits machines, this change gives better entropy by making
the hash value depend on both the upper and lower 32 bits of the
64 bits hash previously computed.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In tree-log.c:btrfs_log_inode(), we keep calling btrfs_search_forward()
until it returns a key whose objectid is higher than our inode or until
the key's type is higher than our maximum allowed type.
At the end of the loop, we increment our mininum search key's objectid
and type regardless of our desired target objectid and maximum desired
type, which causes another loop iteration that will call again
btrfs_search_forward() just to figure out we've gone beyond our maximum
key and exit the loop. Therefore while incrementing our minimum key,
don't do it blindly and exit the loop immiediately if the next search
key's objectid or type is beyond what we seek.
Also after incrementing the type, set the key's offset to 0, which was
missing and could make us loose some of the inode's items.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
It is not used for anything.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
As we're hold a ref on looking up the extent map, we need to drop the ref
before returning to callers.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The performance was slowed down sometimes when we ran sysbench to measure
the performance of the sequential buffered write by 2 or more threads.
It was because the write order of the test threads might be confused
by the task scheduler, and the coming write would be beyond the end of
the file, in this case, we need insert dummy file extents and create
a hole for the area we skip. But in order to avoid the ongoing ordered
extents which are in the area, we need wait for them. Unfortunately,
the current code doesn't check if there are ordered extents in the area
or not, try to find and flush the dirty pages directly, but in fact,
there is no dirty page in that area, this step of the current code is
unnecessary, and just wastes time. Sometimes, it would increase
the contention of some locks, and makes the performance slow down suddenly.
So we remove the ordered extent flush function before the check, and flush
the dirty pages and wait for the ordered extents only when we find them.
According to my test, we got 1-2 times of the performance regression when
we ran the test by 10 times before applying this patch. After applying
this patch, the regression went away.
Test Environment:
CPU: 1CPU * 4Cores
Memory: 6GB
Partition: 20GB
Test Command:
# sysbench --test=fileio --file-total-size=16G --file-test-mode=seqwr \
> --num-threads=512 --file-block-size=16384 --max-time=60 --max-requests=0 run
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When we did space balance and snapshot creation at the same time, we might
meet the following oops:
kernel BUG at fs/btrfs/inode.c:3038!
[SNIP]
Call Trace:
[<ffffffffa0411ec7>] btrfs_orphan_cleanup+0x293/0x407 [btrfs]
[<ffffffffa042dc45>] btrfs_mksubvol.isra.28+0x259/0x373 [btrfs]
[<ffffffffa042de85>] btrfs_ioctl_snap_create_transid+0x126/0x156 [btrfs]
[<ffffffffa042dff1>] btrfs_ioctl_snap_create_v2+0xd0/0x121 [btrfs]
[<ffffffffa0430b2c>] btrfs_ioctl+0x414/0x1854 [btrfs]
[<ffffffff813b60b7>] ? __do_page_fault+0x305/0x379
[<ffffffff811215a9>] vfs_ioctl+0x1d/0x39
[<ffffffff81121d7c>] do_vfs_ioctl+0x32d/0x3e2
[<ffffffff81057fe7>] ? finish_task_switch+0x80/0xb8
[<ffffffff81121e88>] SyS_ioctl+0x57/0x83
[<ffffffff813b39ff>] ? do_device_not_available+0x12/0x14
[<ffffffff813b99c2>] system_call_fastpath+0x16/0x1b
[SNIP]
RIP [<ffffffffa040da40>] btrfs_orphan_add+0xc3/0x126 [btrfs]
The reason of the problem is that the relocation root creation stole
the reserved space, which was reserved for orphan item deletion.
There are several ways to fix this problem, one is to increasing
the reserved space size of the space balace, and then we can use
that space to create the relocation tree for each fs/file trees.
But it is hard to calculate the suitable size because we doesn't
know how many fs/file trees we need relocate.
We fixed this problem by reserving the space for relocation root creation
actively since the space it need is very small (one tree block, used for
root node copy), then we use that reserved space to create the
relocation tree. If we don't reserve space for relocation tree creation,
we will use the reserved space of the balance.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Remove unused parameter, 'eb'. Unused since introduction in
5f39d397df
Updated to be rebased against current upstream and correct diff supplied this time!
Signed-off-by: Ross Kirk <ross.kirk@gmail.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I was noticing the slab redzone stuff going off every once and a while during
transaction aborts. This was caused by two things
1) We would walk the pending snapshots and set their error to -ECANCELED. We
don't need to do this, the snapshot stuff waits for a transaction commit and if
there is a problem we just free our pending snapshot object and exit. Doing
this was causing us to touch the pending snapshot object after the thing had
already been freed.
2) We were freeing the transaction manually with wanton disregard for it's
use_count reference counter. To fix this I cleaned up the transaction freeing
loop to either wait for the transaction commit to finish if it was in the middle
of that (since it will be cleaned and freed up there) or to do the cleanup
oursevles.
I also moved the global "kill all things dirty everywhere" stuff outside of the
transaction cleanup loop since that only needs to be done once. With this patch
I'm no longer seeing slab corruption because of use after frees. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Noticed this when forcing errors to happen during delayed ref running. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
During transaction cleanup after an abort we are just removing roots from the
ordered roots list which is incorrect. We have a BUG_ON() to make sure that the
root is still part of the ordered roots list when we put our ordered extent
which we were tripping in this case. So do like we do everywhere else and just
move it to the tail of the ordered roots list and allow the normal cleanup to
take care of stuff. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we abort not during a transaction commit we won't clean up anything until we
unmount. Unfortunately if we abort in the middle of writing out an ordered
extent we won't clean it up and if somebody is waiting on that ordered extent
they will wait forever. To fix this just make the transaction kthread call the
cleanup transaction stuff if it notices theres an error, and make
btrfs_end_transaction wake up the transaction kthread if there is an error.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I've been testing our error paths and I was tripping the BUG_ON() in
drop_outstanding_extent because our outstanding_extents is 0 for space cache
inodes. This is because we don't reserve metadata space for these inodes since
we depend on the global block reserve for our space. To fix this we need to
make sure the DO_ACCOUNTING stuff doesn't actually call release_metadata for
space cache inodes. With this patch I'm no longer panicing. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we abort a transaction in the middle of a commit we weren't undoing the
intwrite locking. This patch fixes that problem.
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
A user reported a problem where they were getting csum errors when running a
balance and running systemd's journal. This is because systemd is awesome and
fallocate()'s its log space and writes into it. Unfortunately we assume that
when we read in all the csums for an extent that they are sequential starting at
the bytenr we care about. This obviously isn't the case for prealloc extents,
where we could have written to the middle of the prealloc extent only, which
means the csum would be for the bytenr in the middle of our range and not the
front of our range. Fix this by offsetting the new bytenr we are logging to
based on the original bytenr the csum was for. With this patch I no longer see
the csum errors I was seeing. Thanks,
Cc: stable@vger.kernel.org
Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In extent-tree.c:btrfs_write_dirty_block_groups(), if the call to
write_one_cache_group() failed, we would return without putting
the block group first.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Currently the fs sync function (super.c:btrfs_sync_fs()) doesn't
wait for delayed work to finish before returning success to the
caller. This change fixes this, ensuring that there's no data loss
if a power failure happens right after fs sync returns success to
the caller and before the next commit happens.
Steps to reproduce the data loss issue:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ perl -e '$d = ("\x41" x 6001); open($f,">","/mnt/btrfs/foobar"); print $f $d; close($f);' && btrfs fi sync /mnt/btrfs
Right after the btrfs fi sync command (a second or 2 for example), power
off the machine and reboot it. The file will be empty, as it can be verified
after mounting the filesystem and through btrfs-debug-tree:
$ btrfs-debug-tree /dev/sdb3 | egrep '\(257 INODE_ITEM 0\) itemoff' -B 3 -A 8
item 3 key (256 DIR_INDEX 2) itemoff 3751 itemsize 36
location key (257 INODE_ITEM 0) type FILE
namelen 6 datalen 0 name: foobar
item 4 key (257 INODE_ITEM 0) itemoff 3591 itemsize 160
inode generation 7 transid 7 size 0 block group 0 mode 100644 links 1
item 5 key (257 INODE_REF 256) itemoff 3575 itemsize 16
inode ref index 2 namelen 6 name: foobar
checksum tree key (CSUM_TREE ROOT_ITEM 0)
leaf 29429760 items 0 free space 3995 generation 7 owner 7
fs uuid 6192815c-af2a-4b75-b3db-a959ffb6166e
chunk uuid b529c44b-938c-4d3d-910a-013b4700bcae
uuid tree key (UUID_TREE ROOT_ITEM 0)
After this patch, the data loss no longer happens after a power failure and
btrfs-debug-tree shows:
$ btrfs-debug-tree /dev/sdb3 | egrep '\(257 INODE_ITEM 0\) itemoff' -B 3 -A 8
item 3 key (256 DIR_INDEX 2) itemoff 3751 itemsize 36
location key (257 INODE_ITEM 0) type FILE
namelen 6 datalen 0 name: foobar
item 4 key (257 INODE_ITEM 0) itemoff 3591 itemsize 160
inode generation 6 transid 6 size 6001 block group 0 mode 100644 links 1
item 5 key (257 INODE_REF 256) itemoff 3575 itemsize 16
inode ref index 2 namelen 6 name: foobar
item 6 key (257 EXTENT_DATA 0) itemoff 3522 itemsize 53
extent data disk byte 12845056 nr 8192
extent data offset 0 nr 8192 ram 8192
extent compression 0
checksum tree key (CSUM_TREE ROOT_ITEM 0)
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In inode.c:btrfs_orphan_add() if we failed to insert the orphan
item, we would return without decrementing the orphan count that
we just incremented before attempting the insertion, leaving the
orphan inode count wrong.
In inode.c:btrfs_orphan_del(), we were decrementing the inode
orphan count if the bit BTRFS_INODE_ORPHAN_META_RESERVED was set,
which is logically wrong because it should be decremented if the
bit BTRFS_INODE_HAS_ORPHAN_ITEM was set - after all we increment
the count when we set the bit BTRFS_INODE_HAS_ORPHAN_ITEM elsewhere.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Similar to ocfs2, btrfs also supports that extents can be shared by
different inodes, and there are some userspace tools requesting
for this kind of 'space shared infomation'.[1]
ocfs2 uses flag FIEMAP_EXTENT_SHARED, so does btrfs.
[1]: http://thr3ads.net/ocfs2-devel/2010/09/489052-PATCH-3-3-shared-du-using-fiemap-to-figure-up-the-shared-extents-per-file-and-the-footprint-in
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Not used for anything, and removing it avoids caller's need to
allocate a path structure.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We're doing a unnecessary extra lookup of the ino cache's
inode when we already have it (and holding a reference)
during the process of saving the ino cache contents to disk.
Therefore remove this extra lookup.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
While running some snashot aware defrag tests I noticed I was panicing every
once and a while in key_search. This is because of the optimization that says
if we find a key at slot 0 it will be at slot 0 all the way down the rest of the
tree. This isn't the case for btrfs_search_old_slot since it will likely replay
changes to a buffer if something has changed since we took our sequence number.
So short circuit this optimization by setting prev_cmp to -1 every time we call
key_search so we will do our normal binary search. With this patch I am no
longer seeing the panics I was seeing before. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
While looking at somebodys corruption I became completely convinced that
btrfs_split_item was broken, so I wrote this test to verify that it was working
as it was supposed to. Thankfully it appears to be working as intended, so just
add this test to make sure nobody breaks it in the future. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Remove unused eb parameter from btrfs_item_nr
Signed-off-by: Ross Kirk <ross.kirk@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
It is not necessary to store the NULL byte in a symlink inline file
extent. There's currently no code that requires the NULL byte to be
present in the extent. This change also doesn't break file format
compatibility nor the send/receive feature.
The VFS also doesn't need the NULL byte to be present in the extent,
as it reads up to inode->i_size bytes (which already excluded the NULL
byte) and sets the NULL byte for us (in fs/namei.c:page_getlink()).
So with this change we save 1 byte per symlink file extent (which is
always inlined in the btree leaf) without losing backward and forward
compatibility.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The fact that btrfs_root_refs() returned 0 for the tree_root caused
bugs in the past, therefore it is set to 1 with this patch and
(hopefully) all affected code is adapted to this change.
I verified this change by temporarily adding WARN_ON() checks
everywhere where btrfs_root_refs() is used, checking whether the
logic of the code is changed by btrfs_root_refs() returning 1
instead of 0 for root->root_key.objectid == BTRFS_ROOT_TREE_OBJECTID.
With these added checks, I ran the xfstests './check -g auto'.
The two roots chunk_root and log_root_tree that are only referenced
by the superblock and the log_roots below the log_root_tree still
have btrfs_root_refs() == 0, only the tree_root is changed.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs fix from Chris Mason:
"Sage hit a deadlock with ceph on btrfs, and Josef tracked it down to a
regression in our initial rc1 pull. When doing nocow writes we were
sometimes starting a transaction with locks held"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: release path before starting transaction in can_nocow_extent
We can't be holding tree locks while we try to start a transaction, we will
deadlock. Thanks,
Reported-by: Sage Weil <sage@inktank.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs fixes from Chris Mason:
"We've got more bug fixes in my for-linus branch:
One of these fixes another corner of the compression oops from last
time. Miao nailed down some problems with concurrent snapshot
deletion and drive balancing.
I kept out one of his patches for more testing, but these are all
stable"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix oops caused by the space balance and dead roots
Btrfs: insert orphan roots into fs radix tree
Btrfs: limit delalloc pages outside of find_delalloc_range
Btrfs: use right root when checking for hash collision
When doing space balance and subvolume destroy at the same time, we met
the following oops:
kernel BUG at fs/btrfs/relocation.c:2247!
RIP: 0010: [<ffffffffa04cec16>] prepare_to_merge+0x154/0x1f0 [btrfs]
Call Trace:
[<ffffffffa04b5ab7>] relocate_block_group+0x466/0x4e6 [btrfs]
[<ffffffffa04b5c7a>] btrfs_relocate_block_group+0x143/0x275 [btrfs]
[<ffffffffa0495c56>] btrfs_relocate_chunk.isra.27+0x5c/0x5a2 [btrfs]
[<ffffffffa0459871>] ? btrfs_item_key_to_cpu+0x15/0x31 [btrfs]
[<ffffffffa048b46a>] ? btrfs_get_token_64+0x7e/0xcd [btrfs]
[<ffffffffa04a3467>] ? btrfs_tree_read_unlock_blocking+0xb2/0xb7 [btrfs]
[<ffffffffa049907d>] btrfs_balance+0x9c7/0xb6f [btrfs]
[<ffffffffa049ef84>] btrfs_ioctl_balance+0x234/0x2ac [btrfs]
[<ffffffffa04a1e8e>] btrfs_ioctl+0xd87/0x1ef9 [btrfs]
[<ffffffff81122f53>] ? path_openat+0x234/0x4db
[<ffffffff813c3b78>] ? __do_page_fault+0x31d/0x391
[<ffffffff810f8ab6>] ? vma_link+0x74/0x94
[<ffffffff811250f5>] vfs_ioctl+0x1d/0x39
[<ffffffff811258c8>] do_vfs_ioctl+0x32d/0x3e2
[<ffffffff811259d4>] SyS_ioctl+0x57/0x83
[<ffffffff813c3bfa>] ? do_page_fault+0xe/0x10
[<ffffffff813c73c2>] system_call_fastpath+0x16/0x1b
It is because we returned the error number if the reference of the root was 0
when doing space relocation. It was not right here, because though the root
was dead(refs == 0), but the space it held still need be relocated, or we
could not remove the block group. So in this case, we should return the root
no matter it is dead or not.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Now we don't drop all the deleted snapshots/subvolumes before the space
balance. It means we have to relocate the space which is held by the dead
snapshots/subvolumes. So we must into them into fs radix tree, or we would
forget to commit the change of them when doing transaction commit, and it
would corrupt the metadata.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Liu fixed part of this problem and unfortunately I steered him in slightly the
wrong direction and so didn't completely fix the problem. The problem is we
limit the size of the delalloc range we are looking for to max bytes and then we
try to lock that range. If we fail to lock the pages in that range we will
shrink the max bytes to a single page and re loop. However if our first page is
inside of the delalloc range then we will end up limiting the end of the range
to a period before our first page. This is illustrated below
[0 -------- delalloc range --------- 256mb]
[page]
So find_delalloc_range will return with delalloc_start as 0 and end as 128mb,
and then we will notice that delalloc_start < *start and adjust it up, but not
adjust delalloc_end up, so things go sideways. To fix this we need to not limit
the max bytes in find_delalloc_range, but in find_lock_delalloc_range and that
way we don't end up with this confusion. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
btrfs_rename was using the root of the old dir instead of the root of the new
dir when checking for a hash collision, so if you tried to move a file into a
subvol it would freak out because it would see the file you are trying to move
in its current root. This fixes the bug where this would fail
btrfs subvol create test1
btrfs subvol create test2
mv test1 test2.
Thanks to Chris Murphy for catching this,
Cc: stable@vger.kernel.org
Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
free_device rcu callback, scheduled from btrfs_rm_dev_replace_srcdev,
can be processed before btrfs_scratch_superblock is called, which would
result in a use-after-free on btrfs_device contents. Fix this by
zeroing the superblock before the rcu callback is registered.
Cc: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The current implementation of worker threads in Btrfs has races in
worker stopping code, which cause all kinds of panics and lockups when
running btrfs/011 xfstest in a loop. The problem is that
btrfs_stop_workers is unsynchronized with respect to check_idle_worker,
check_busy_worker and __btrfs_start_workers.
E.g., check_idle_worker race flow:
btrfs_stop_workers(): check_idle_worker(aworker):
- grabs the lock
- splices the idle list into the
working list
- removes the first worker from the
working list
- releases the lock to wait for
its kthread's completion
- grabs the lock
- if aworker is on the working list,
moves aworker from the working list
to the idle list
- releases the lock
- grabs the lock
- puts the worker
- removes the second worker from the
working list
......
btrfs_stop_workers returns, aworker is on the idle list
FS is umounted, memory is freed
......
aworker is waken up, fireworks ensue
With this applied, I wasn't able to trigger the problem in 48 hours,
whereas previously I could reliably reproduce at least one of these
races within an hour.
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The crash[1] is found by xfstests/generic/208 with "-o compress",
it's not reproduced everytime, but it does panic.
The bug is quite interesting, it's actually introduced by a recent commit
(573aecafca,
Btrfs: actually limit the size of delalloc range).
Btrfs implements delay allocation, so during writeback, we
(1) get a page A and lock it
(2) search the state tree for delalloc bytes and lock all pages within the range
(3) process the delalloc range, including find disk space and create
ordered extent and so on.
(4) submit the page A.
It runs well in normal cases, but if we're in a racy case, eg.
buffered compressed writes and aio-dio writes,
sometimes we may fail to lock all pages in the 'delalloc' range,
in which case, we need to fall back to search the state tree again with
a smaller range limit(max_bytes = PAGE_CACHE_SIZE - offset).
The mentioned commit has a side effect, that is, in the fallback case,
we can find delalloc bytes before the index of the page we already have locked,
so we're in the case of (delalloc_end <= *start) and return with (found > 0).
This ends with not locking delalloc pages but making ->writepage still
process them, and the crash happens.
This fixes it by just thinking that we find nothing and returning to caller
as the caller knows how to deal with it properly.
[1]:
------------[ cut here ]------------
kernel BUG at mm/page-writeback.c:2170!
[...]
CPU: 2 PID: 11755 Comm: btrfs-delalloc- Tainted: G O 3.11.0+ #8
[...]
RIP: 0010:[<ffffffff810f5093>] [<ffffffff810f5093>] clear_page_dirty_for_io+0x1e/0x83
[...]
[ 4934.248731] Stack:
[ 4934.248731] ffff8801477e5dc8 ffffea00049b9f00 ffff8801869f9ce8 ffffffffa02b841a
[ 4934.248731] 0000000000000000 0000000000000000 0000000000000fff 0000000000000620
[ 4934.248731] ffff88018db59c78 ffffea0005da8d40 ffffffffa02ff860 00000001810016c0
[ 4934.248731] Call Trace:
[ 4934.248731] [<ffffffffa02b841a>] extent_range_clear_dirty_for_io+0xcf/0xf5 [btrfs]
[ 4934.248731] [<ffffffffa02a8889>] compress_file_range+0x1dc/0x4cb [btrfs]
[ 4934.248731] [<ffffffff8104f7af>] ? detach_if_pending+0x22/0x4b
[ 4934.248731] [<ffffffffa02a8bad>] async_cow_start+0x35/0x53 [btrfs]
[ 4934.248731] [<ffffffffa02c694b>] worker_loop+0x14b/0x48c [btrfs]
[ 4934.248731] [<ffffffffa02c6800>] ? btrfs_queue_worker+0x25c/0x25c [btrfs]
[ 4934.248731] [<ffffffff810608f5>] kthread+0x8d/0x95
[ 4934.248731] [<ffffffff81060868>] ? kthread_freezable_should_stop+0x43/0x43
[ 4934.248731] [<ffffffff814fe09c>] ret_from_fork+0x7c/0xb0
[ 4934.248731] [<ffffffff81060868>] ? kthread_freezable_should_stop+0x43/0x43
[ 4934.248731] Code: ff 85 c0 0f 94 c0 0f b6 c0 59 5b 5d c3 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb e8 2c de 00 00 49 89 c4 48 8b 03 a8 01 75 02 <0f> 0b 4d 85 e4 74 52 49 8b 84 24 80 00 00 00 f6 40 20 01 75 44
[ 4934.248731] RIP [<ffffffff810f5093>] clear_page_dirty_for_io+0x1e/0x83
[ 4934.248731] RSP <ffff8801869f9c48>
[ 4934.280307] ---[ end trace 36f06d3f8750236a ]---
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If we crash with a log, remount and recover that log, and then crash before we
can commit another transaction we will get transid verify errors on the next
mount. This is because we were not zero'ing out the log when we committed the
transaction after recovery. This is ok as long as we commit another transaction
at some point in the future, but if you abort or something else goes wrong you
can end up in this weird state because the recovery stuff says that the tree log
should have a generation+1 of the super generation, which won't be the case of
the transaction that was started for recovery. Fix this by removing the check
and _always_ zero out the log portion of the super when we commit a transaction.
This fixes the transid verify issues I was seeing with my force errors tests.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Pull btrfs fixes from Chris Mason:
"These are mostly bug fixes and a two small performance fixes. The
most important of the bunch are Josef's fix for a snapshotting
regression and Mark's update to fix compile problems on arm"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (25 commits)
Btrfs: create the uuid tree on remount rw
btrfs: change extent-same to copy entire argument struct
Btrfs: dir_inode_operations should use btrfs_update_time also
btrfs: Add btrfs: prefix to kernel log output
btrfs: refuse to remount read-write after abort
Btrfs: btrfs_ioctl_default_subvol: Revert back to toplevel subvolume when arg is 0
Btrfs: don't leak transaction in btrfs_sync_file()
Btrfs: add the missing mutex unlock in write_all_supers()
Btrfs: iput inode on allocation failure
Btrfs: remove space_info->reservation_progress
Btrfs: kill delay_iput arg to the wait_ordered functions
Btrfs: fix worst case calculator for space usage
Revert "Btrfs: rework the overcommit logic to be based on the total size"
Btrfs: improve replacing nocow extents
Btrfs: drop dir i_size when adding new names on replay
Btrfs: replay dir_index items before other items
Btrfs: check roots last log commit when checking if an inode has been logged
Btrfs: actually log directory we are fsync()'ing
Btrfs: actually limit the size of delalloc range
Btrfs: allocate the free space by the existed max extent size when ENOSPC
...
Users have been complaining of the uuid tree stuff warning that there is no uuid
root when trying to do snapshot operations. This is because if you mount -o ro
we will not create the uuid tree. But then if you mount -o rw,remount we will
still not create it and then any subsequent snapshot/subvol operations you try
to do will fail gloriously. Fix this by creating the uuid_root on remount rw if
it was not already there. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
btrfs_ioctl_file_extent_same() uses __put_user_unaligned() to copy some data
back to it's argument struct. Unfortunately, not all architectures provide
__put_user_unaligned(), so compiles break on them if btrfs is selected.
Instead, just copy the whole struct in / out at the start and end of
operations, respectively.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Commit 2bc5565286 (Btrfs: don't update atime on
RO subvolumes) ensures that the access time of an inode is not updated when
the inode lives in a read-only subvolume.
However, if a directory on a read-only subvolume is accessed, the atime is
updated. This results in a write operation to a read-only subvolume. I
believe that access times should never be updated on read-only subvolumes.
To reproduce:
# mkfs.btrfs -f /dev/dm-3
(...)
# mount /dev/dm-3 /mnt
# btrfs subvol create /mnt/sub
Create subvolume '/mnt/sub'
# mkdir /mnt/sub/dir
# echo "abc" > /mnt/sub/dir/file
# btrfs subvol snapshot -r /mnt/sub /mnt/rosnap
Create a readonly snapshot of '/mnt/sub' in '/mnt/rosnap'
# stat /mnt/rosnap/dir
File: `/mnt/rosnap/dir'
Size: 8 Blocks: 0 IO Block: 4096 directory
Device: 16h/22d Inode: 257 Links: 1
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2013-09-11 07:21:49.389157126 -0400
Modify: 2013-09-11 07:22:02.330156079 -0400
Change: 2013-09-11 07:22:02.330156079 -0400
# ls /mnt/rosnap/dir
file
# stat /mnt/rosnap/dir
File: `/mnt/rosnap/dir'
Size: 8 Blocks: 0 IO Block: 4096 directory
Device: 16h/22d Inode: 257 Links: 1
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2013-09-11 07:22:56.797151670 -0400
Modify: 2013-09-11 07:22:02.330156079 -0400
Change: 2013-09-11 07:22:02.330156079 -0400
Reported-by: Koen De Wit <koen.de.wit@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The kernel log entries for device label %s and device fsid %pU
are missing the btrfs: prefix. Add those here.
Signed-off-by: Frank Holton <fholton@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
It's still possible to flip the filesystem into RW mode after it's
remounted RO due to an abort. There are lots of places that check for
the superblock error bit and will not write data, but we should not let
the filesystem appear read-write.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This patch makes it possible to set BTRFS_FS_TREE_OBJECTID as the default
subvolume by passing a subvolume id of 0.
Signed-off-by: chandan <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In btrfs_sync_file(), if the call to btrfs_log_dentry_safe() returns
a negative error (for e.g. -ENOMEM via btrfs_log_inode()), we would
return without ending/freeing the transaction.
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The BUG() was replaced by btrfs_error() and return -EIO with the
patch "get rid of one BUG() in write_all_supers()", but the missing
mutex_unlock() was overlooked.
The 0-DAY kernel build service from Intel reported the missing
unlock which was found by the coccinelle tool:
fs/btrfs/disk-io.c:3422:2-8: preceding lock on line 3374
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We don't do the iput when we fail to allocate our delayed delalloc work in
__start_delalloc_inodes, fix this.
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This isn't used for anything anymore, just remove it.
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This is a left over of how we used to wait for ordered extents, which was to
grab the inode and then run filemap flush on it. However if we have an ordered
extent then we already are holding a ref on the inode, and we just use
btrfs_start_ordered_extent anyway, so there is no reason to have an extra ref on
the inode to start work on the ordered extent. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Forever ago I made the worst case calculator say that we could potentially split
into 3 blocks for every level on the way down, which isn't right. If we split
we're only going to get two new blocks, the one we originally cow'ed and the new
one we're going to split. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This reverts commit 70afa3998c. It is causing
performance issues and wasn't actually correct. There were problems with the
way we flushed delalloc and that was the real cause of the early enospc.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Various people have hit a deadlock when running btrfs/011. This is because when
replacing nocow extents we will take the i_mutex to make sure nobody messes with
the file while we are replacing the extent. The problem is we are already
holding a transaction open, which is a locking inversion, so instead we need to
save these inodes we find and then process them outside of the transaction.
Further we can't just lock the inode and assume we are good to go. We need to
lock the extent range and then read back the extent cache for the inode to make
sure the extent really still points at the physical block we want. If it
doesn't we don't have to copy it. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
So if we have dir_index items in the log that means we also have the inode item
as well, which means that the inode's i_size is correct. However when we
process dir_index'es we call btrfs_add_link() which will increase the
directory's i_size for the new entry. To fix this we need to just set the dir
items i_size to 0, and then as we find dir_index items we adjust the i_size.
btrfs_add_link() will do it for new entries, and if the entry already exists we
can just add the name_len to the i_size ourselves. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
A user reported a bug where his log would not replay because he was getting
-EEXIST back. This was because he had a file moved into a directory that was
logged. What happens is the file had a lower inode number, and so it is
processed first when replaying the log, and so we add the inode ref in for the
directory it was moved to. But then we process the directories DIR_INDEX item
and try to add the inode ref for that inode and it fails because we already
added it when we replayed the inode. To solve this problem we need to just
process any DIR_INDEX items we have in the log first so this all is taken care
of, and then we can replay the rest of the items. With this patch my reproducer
can remount the file system properly instead of erroring out. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Liu introduced a local copy of the last log commit for an inode to make sure we
actually log an inode even if a log commit has already taken place. In order to
make sure we didn't relog the same inode multiple times he set this local copy
to the current trans when we log the inode, because usually we log the inode and
then sync the log. The exception to this is during rename, we will relog an
inode if the name changed and it is already in the log. The problem with this
is then we go to sync the inode, and our check to see if the inode has already
been logged is tripped and we don't sync the log. To fix this we need to _also_
check against the roots last log commit, because it could be less than what is
in our local copy of the log commit. This fixes a bug where we rename a file
into a directory and then fsync the directory and then on remount the directory
is no longer there. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If you just create a directory and then fsync that directory and then pull the
power plug you will come back up and the directory will not be there. That is
because we won't actually create directories if we've logged files inside of
them since they will be created on replay, but in this check we will set our
logged_trans of our current directory if it happens to be a directory, making us
think it doesn't need to be logged. Fix the logic to only do this to parent
directories. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
So forever we have had this thing to limit the amount of delalloc pages we'll
setup to be written out to 128mb. This is because we have to lock all the pages
in this range, so anything above this gets a bit unweildly, and also without a
limit we'll happily allocate gigantic chunks of disk space. Turns out our check
for this wasn't quite right, we wouldn't actually limit the chunk we wanted to
write out, we'd just stop looking for more space after we went over the limit.
So if you do a giant 20gb dd on my box with lots of ram I could get 2gig
extents. This is fine normally, except when you go to relocate these extents
and we can't find enough space to relocate these moster extents, since we have
to be able to allocate exactly the same sized extent to move it around. So fix
this by actually enforcing the limit. With this patch I'm no longer seeing
giant 1.5gb extents. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
By the current code, if the requested size is very large, and all the extents
in the free space cache are small, we will waste lots of the cpu time to cut
the requested size in half and search the cache again and again until it gets
down to the size the allocator can return. In fact, we can know the max extent
size in the cache after the first search, so we needn't cut the size in half
repeatedly, and just use the max extent size directly. This way can save
lots of cpu time and make the performance grow up when there are only fragments
in the free space cache.
According to my test, if there are only 4KB free space extents in the fs,
and the total size of those extents are 256MB, we can reduce the execute
time of the following test from 5.4s to 1.4s.
dd if=/dev/zero of=<testfile> bs=1MB count=1 oflag=sync
Changelog v2 -> v3:
- fix the problem that we skip the block group with the space which is
less than we need.
Changelog v1 -> v2:
- address the problem that we return a wrong start position when searching
the free space in a bitmap.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We want to know if there are debugging features compiled in, this may
affect performance. The message is printed before the sanity checks.
(This commit message is a copy of David Sterba's commit message when
he introduced btrfs_print_info()).
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Instead of removing the current inode from the red black tree
and then add the new one, just use the red black tree replace
operation, which is more efficient.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If replace was suspended by the umount, replace target device is added
to the fs_devices->alloc_list during a later mount. This is obviously
wrong. ->is_tgtdev_for_dev_replace is supposed to guard against that,
but ->is_tgtdev_for_dev_replace is (and can only ever be) initialized
*after* everything is opened and fs_devices lists are populated. Fix
this by checking the devid instead: for replace targets it's always
equal to BTRFS_DEV_REPLACE_DEVID.
Cc: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we failed to actually allocate the correct size of the extent to relocate we
will end up in an infinite loop because we won't return an error, we'll just
move on to the next extent. So fix this up by returning an error, and then fix
all the callers to return an error up the stack rather than BUG_ON()'ing.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Merge more patches from Andrew Morton:
"The rest of MM. Plus one misc cleanup"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (35 commits)
mm/Kconfig: add MMU dependency for MIGRATION.
kernel: replace strict_strto*() with kstrto*()
mm, thp: count thp_fault_fallback anytime thp fault fails
thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()
thp: do_huge_pmd_anonymous_page() cleanup
thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
mm: cleanup add_to_page_cache_locked()
thp: account anon transparent huge pages into NR_ANON_PAGES
truncate: drop 'oldsize' truncate_pagecache() parameter
mm: make lru_add_drain_all() selective
memcg: document cgroup dirty/writeback memory statistics
memcg: add per cgroup writeback pages accounting
memcg: check for proper lock held in mem_cgroup_update_page_stat
memcg: remove MEMCG_NR_FILE_MAPPED
memcg: reduce function dereference
memcg: avoid overflow caused by PAGE_ALIGN
memcg: rename RESOURCE_MAX to RES_COUNTER_MAX
memcg: correct RESOURCE_MAX to ULLONG_MAX
mm: memcg: do not trap chargers with full callstack on OOM
mm: memcg: rework and document OOM waiting and wakeup
...
truncate_pagecache() doesn't care about old size since commit
cedabed49b ("vfs: Fix vmtruncate() regression"). Let's drop it.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull btrfs updates from Chris Mason:
"This is against 3.11-rc7, but was pulled and tested against your tree
as of yesterday. We do have two small incrementals queued up, but I
wanted to get this bunch out the door before I hop on an airplane.
This is a fairly large batch of fixes, performance improvements, and
cleanups from the usual Btrfs suspects.
We've included Stefan Behren's work to index subvolume UUIDs, which is
targeted at speeding up send/receive with many subvolumes or snapshots
in place. It closes a long standing performance issue that was built
in to the disk format.
Mark Fasheh's offline dedup work is also here. In this case offline
means the FS is mounted and active, but the dedup work is not done
inline during file IO. This is a building block where utilities are
able to ask the FS to dedup a series of extents. The kernel takes
care of verifying the data involved really is the same. Today this
involves reading both extents, but we'll continue to evolve the
patches"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (118 commits)
Btrfs: optimize key searches in btrfs_search_slot
Btrfs: don't use an async starter for most of our workers
Btrfs: only update disk_i_size as we remove extents
Btrfs: fix deadlock in uuid scan kthread
Btrfs: stop refusing the relocation of chunk 0
Btrfs: fix memory leak of uuid_root in free_fs_info
btrfs: reuse kbasename helper
btrfs: return btrfs error code for dev excl ops err
Btrfs: allow partial ordered extent completion
Btrfs: convert all bug_ons in free-space-cache.c
Btrfs: add support for asserts
Btrfs: adjust the fs_devices->missing count on unmount
Btrf: cleanup: don't check for root_refs == 0 twice
Btrfs: fix for patch "cleanup: don't check the same thing twice"
Btrfs: get rid of one BUG() in write_all_supers()
Btrfs: allocate prelim_ref with a slab allocater
Btrfs: pass gfp_t to __add_prelim_ref() to avoid always using GFP_ATOMIC
Btrfs: fix race conditions in BTRFS_IOC_FS_INFO ioctl
Btrfs: fix race between removing a dev and writing sbs
Btrfs: remove ourselves from the cluster list under lock
...
Pull trivial tree from Jiri Kosina:
"The usual trivial updates all over the tree -- mostly typo fixes and
documentation updates"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (52 commits)
doc: Documentation/cputopology.txt fix typo
treewide: Convert retrun typos to return
Fix comment typo for init_cma_reserved_pageblock
Documentation/trace: Correcting and extending tracepoint documentation
mm/hotplug: fix a typo in Documentation/memory-hotplug.txt
power: Documentation: Update s2ram link
doc: fix a typo in Documentation/00-INDEX
Documentation/printk-formats.txt: No casts needed for u64/s64
doc: Fix typo "is is" in Documentations
treewide: Fix printks with 0x%#
zram: doc fixes
Documentation/kmemcheck: update kmemcheck documentation
doc: documentation/hwspinlock.txt fix typo
PM / Hibernate: add section for resume options
doc: filesystems : Fix typo in Documentations/filesystems
scsi/megaraid fixed several typos in comments
ppc: init_32: Fix error typo "CONFIG_START_KERNEL"
treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks
page_isolation: Fix a comment typo in test_pages_isolated()
doc: fix a typo about irq affinity
...
Pull vfs pile 1 from Al Viro:
"Unfortunately, this merge window it'll have a be a lot of small piles -
my fault, actually, for not keeping #for-next in anything that would
resemble a sane shape ;-/
This pile: assorted fixes (the first 3 are -stable fodder, IMO) and
cleanups + %pd/%pD formats (dentry/file pathname, up to 4 last
components) + several long-standing patches from various folks.
There definitely will be a lot more (starting with Miklos'
check_submount_and_drop() series)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
direct-io: Handle O_(D)SYNC AIO
direct-io: Implement generic deferred AIO completions
add formats for dentry/file pathnames
kvm eventfd: switch to fdget
powerpc kvm: use fdget
switch fchmod() to fdget
switch epoll_ctl() to fdget
switch copy_module_from_fd() to fdget
git simplify nilfs check for busy subtree
ibmasmfs: don't bother passing superblock when not needed
don't pass superblock to hypfs_{mkdir,create*}
don't pass superblock to hypfs_diag_create_files
don't pass superblock to hypfs_vm_create_files()
oprofile: get rid of pointless forward declarations of struct super_block
oprofilefs_create_...() do not need superblock argument
oprofilefs_mkdir() doesn't need superblock argument
don't bother with passing superblock to oprofile_create_stats_files()
oprofile: don't bother with passing superblock to ->create_files()
don't bother passing sb to oprofile_create_files()
coh901318: don't open-code simple_read_from_buffer()
...
up with PTR_ERR_OR_ZERO(), and replacing or fixing all the usages.
This has been sitting in linux-next for a whole cycle.
Thanks,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJSJo+1AAoJENkgDmzRrbjxIC4QALJK95o8AUXuwUkl+2fmFkUt
hh2/PJ1vDYgk4Xt0J6hyoK7XMa0H1RkbBrROuDdsBnorMFpEsGcgdkUZte9ufoAS
97Bg+7N0KPbTB/S8vOwtW1vbERTJIVPN2uf6h1Wqm9Xc2puCh3HbMMr1AWMGu0WQ
NqY5+Zz8zecy1UOrMhEP6H1CjeQcL1w1DO6YM5ydeqlKNzAz+JMfDXriLPDwiE7+
XFPDF/O3Vtd2ckA7L70Lio7hfHwxV5U4WwFVfiwls98XB4jcZqDKIoh1r8z4SRgR
+0Rae2DN3BaOabGMr//5XdrzQVpwJTh5m2w8BAOHJvCJ9HR7Sq29UIN4u+TowZBy
L2xYo4dvFxkympwu5zEd3c7vHYWKIaqmSq5PIjr4gF/uIo2OeOTrpPIK782ZEYb7
e+qUgOEM05V9AmQZCrSZeP9u474Sj8ow3sCtWxfdRtwNfoEIcUXsNNJd/zDHlVtW
cEtXqc2xXIpcuUJQWlSaGp8fmRQjVZPzrLKYLM2m39ZcOOJbf5rzQAYS7hHPosIa
SK+YVux/+Zzi+Xo/vXq1OlM/SruCr5S7JOgCxLowoQ88vupgXME6uPyC8EO+QQ50
GsrHes5ZNLbk0uVsfcexIyojkUnyvDmmnDpv+1zdC6RgZLJQn8OXp5yNhHhnhrFT
BiHX6YFWtDDqRlVv8Q0F
=LeaW
-----END PGP SIGNATURE-----
Merge tag 'PTR_RET-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull PTR_RET() removal patches from Rusty Russell:
"PTR_RET() is a weird name, and led to some confusing usage. We ended
up with PTR_ERR_OR_ZERO(), and replacing or fixing all the usages.
This has been sitting in linux-next for a whole cycle"
[ There are still some PTR_RET users scattered about, with some of them
possibly being new, but most of them existing in Rusty's tree too. We
have that
#define PTR_RET(p) PTR_ERR_OR_ZERO(p)
thing in <linux/err.h>, so they continue to work for now - Linus ]
* tag 'PTR_RET-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
GFS2: Replace PTR_RET with PTR_ERR_OR_ZERO
Btrfs: volume: Replace PTR_RET with PTR_ERR_OR_ZERO
drm/cma: Replace PTR_RET with PTR_ERR_OR_ZERO
sh_veu: Replace PTR_RET with PTR_ERR_OR_ZERO
dma-buf: Replace PTR_RET with PTR_ERR_OR_ZERO
drivers/rtc: Replace PTR_RET with PTR_ERR_OR_ZERO
mm/oom_kill: remove weird use of ERR_PTR()/PTR_ERR().
staging/zcache: don't use PTR_RET().
remoteproc: don't use PTR_RET().
pinctrl: don't use PTR_RET().
acpi: Replace weird use of PTR_RET.
s390: Replace weird use of PTR_RET.
PTR_RET is now PTR_ERR_OR_ZERO(): Replace most.
PTR_RET is now PTR_ERR_OR_ZERO
Call generic_write_sync() from the deferred I/O completion handler if
O_DSYNC is set for a write request. Also make sure various callers
don't call generic_write_sync if the direct I/O code returns
-EIOCBQUEUED.
Based on an earlier patch from Jan Kara <jack@suse.cz> with updates from
Jeff Moyer <jmoyer@redhat.com> and Darrick J. Wong <darrick.wong@oracle.com>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When the binary search returns 0 (exact match), the target key
will necessarily be at slot 0 of all nodes below the current one,
so in this case the binary search is not needed because it will
always return 0, and we waste time doing it, holding node locks
for longer than necessary, etc.
Below follow histograms with the times spent on the current approach of
doing a binary search when the previous binary search returned 0, and
times for the new approach, which directly picks the first item/child
node in the leaf/node.
Current approach:
Count: 6682
Range: 35.000 - 8370.000; Mean: 85.837; Median: 75.000; Stddev: 106.429
Percentiles: 90th: 124.000; 95th: 145.000; 99th: 206.000
35.000 - 61.080: 1235 ################
61.080 - 106.053: 4207 #####################################################
106.053 - 183.606: 1122 ##############
183.606 - 317.341: 111 #
317.341 - 547.959: 6 |
547.959 - 8370.000: 1 |
Approach proposed by this patch:
Count: 6682
Range: 6.000 - 135.000; Mean: 16.690; Median: 16.000; Stddev: 7.160
Percentiles: 90th: 23.000; 95th: 27.000; 99th: 40.000
6.000 - 8.418: 58 #
8.418 - 11.670: 1149 #########################
11.670 - 16.046: 2418 #####################################################
16.046 - 21.934: 2098 ##############################################
21.934 - 29.854: 744 ################
29.854 - 40.511: 154 ###
40.511 - 54.848: 41 #
54.848 - 74.136: 5 |
74.136 - 100.087: 9 |
100.087 - 135.000: 6 |
These samples were captured during a run of the btrfs tests 001, 002 and
004 in the xfstests, with a leaf/node size of 4Kb.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We only need an async starter if we can't make a GFP_NOFS allocation in our
current path. This is the case for the endio stuff since it happens in IRQ
context, but things like the caching thread workers and the delalloc flushers we
can easily make this allocation and start threads right away. Also change the
worker count for the caching thread pool. Traditionally we limited this to 2
since we took read locks while caching, but nowadays we do this lockless so
there's no reason to limit the number of caching threads. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This fixes a problem where if we fail a truncate we will leave the i_size set
where we wanted to truncate to instead of where we were able to truncate to.
Fix this by making btrfs_truncate_inode_items do the disk_i_size update as it
removes extents, that way it will always be consistent with where its extents
are. Then if the truncate fails at all we can update the in-ram i_size with
what we have on disk and delete the orphan item. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
AFAICT chunk 0 is no longer special, and so it should be restriped just
like every other chunk. One reason for this change is us refusing the
relocation can lead to filesystems that can only be mounted ro, and
never rw -- see the bugzilla [1] for details. The other reason is that
device removal code is already doing this: it will happily relocate
chunk 0 is part of shrinking the device.
[1] https://bugzilla.kernel.org/show_bug.cgi?id=60594
Reported-by: Xavier Bassery <xavier@bartica.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
To get name of the file from a pathname let's use kbasename() helper. It allows
to simplify code a bit.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
now threads can return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS
as defined in btrfs.h for the dev excl operation error in
the FS, which means with this kernel would stop logging
(almost an user error) into the /var/log/messages
v2: accepts Josef' comment
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We currently have this problem where you can truncate pages that have not yet
been written for an ordered extent. We do this because the truncate will be
coming behind to clean us up anyway so what's the harm right? Well if truncate
fails for whatever reason we leave an orphan item around for the file to be
cleaned up later. But if the user goes and truncates up the file and tries to
read from the area that had been discarded previously they will get a csum error
because we never actually wrote that data out.
This patch fixes this by allowing us to either discard the ordered extent
completely, by which I mean we just free up the space we had allocated and not
add the file extent, or adjust the length of the file extent we write. We do
this by setting the length we truncated down to in the ordered extent, and then
we set the file extent length and ram bytes to this length. The total disk
space stays unchanged since we may be compressed and we can't just chop off the
disk space, but at least this way the file extent only points to the valid data.
Then when the file extent is free'd the extent and csums will be freed normally.
This patch is needed for the next series which will give us more graceful
recovery of failed truncates. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
All of these are logic checks to make sure we're not breaking anything, so
convert them over to ASSERT(). Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
One of the complaints we get a lot is how many BUG_ON()'s we have. So to help
with this I'm introducing a kconfig option to enable/disable a new ASSERT()
mechanism much like what XFS does. This will allow us developers to still get
our nice panics but allow users/distros to compile them out. With this we can
go through and convert any BUG_ON()'s that we have to catch actual programming
mistakes to the new ASSERT() and then fix everybody else to return errors. This
will also allow developers to leave sanity checks in their new code to make sure
we don't trip over problems while testing stuff and vetting new features.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I noticed that if I tried to mount a file system with -o degraded after having
done it once already we would fail to mount. This is because the
fs_devices->missing count was getting bumped everytime we mounted, but not
getting reset whenever we unmounted. To fix this we just drop the missing count
as we're closing devices to make sure this doesn't happen. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
btrfs_read_fs_root_no_name() already checks if btrfs_root_refs()
is zero and returns ENOENT in this case. There is no need to do
it again in three more places.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Mitch Harder noticed that the patch 3c64a1a mentioned in the subject
line was causing a kernel BUG() on snapshot deletion.
The patch was wrong. It did not handle cached roots correctly. The
check for root_refs == 0 was removed everywhere where
btrfs_read_fs_root_no_name() had been used to retrieve the root,
because this check was already dealt with in
btrfs_read_fs_root_no_name(). But in the case when the root was
found in the cache, there was no such check.
This patch adds the missing check in the case where the root is
found in the cache.
Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The second round uses btrfs_error() and return -EIO, the first round
can handle write errors the same way.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
struct __prelim_ref is allocated and freed frequently when
walking backref tree, using slab allocater can not only
speed up allocating but also detect memory leaks.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Currently, only add_delayed_refs have to allocate with GFP_ATOMIC,
So just pass arg 'gfp_t' to decide which allocation mode.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The handler for the ioctl BTRFS_IOC_FS_INFO was reading the
number of devices before acquiring the device list mutex.
This could lead to inconsistent results because the update of
the device list and the number of devices counter (amongst other
counters related to the device list) are updated in volumes.c
while holding the device list mutex - except for 2 places, one
was volumes.c:btrfs_prepare_sprout() and the other was
volumes.c:device_list_add().
For example, if we have 2 devices, with IDs 1 and 2 and then add
a new device, with ID 3, and while adding the device is in progress
an BTRFS_IOC_FS_INFO ioctl arrives, it could return a number of
devices of 2 and a max dev id of 3. This would be incorrect.
Also, this ioctl handler was reading the fsid while it can be
updated concurrently. This can happen when while a new device is
being added and the current filesystem is in seeding mode.
Example:
$ mkfs.btrfs -f /dev/sdb1
$ mkfs.btrfs -f /dev/sdb2
$ btrfstune -S 1 /dev/sdb1
$ mount /dev/sdb1 /mnt/test
$ btrfs device add /dev/sdb2 /mnt/test
If during the last step a BTRFS_IOC_FS_INFO ioctl was requested, it
could read an fsid that was never valid (some bits part of the old
fsid and others part of the new fsid). Also, it could read a number
of devices that doesn't match the number of devices in the list and
the max device id, as explained before.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This change fixes an issue when removing a device and writing
all super blocks run simultaneously. Here's the steps necessary
for the issue to happen:
1) disk-io.c:write_all_supers() gets a number of N devices from the
super_copy, so it will not panic if it fails to write super blocks
for N - 1 devices;
2) Then it tries to acquire the device_list_mutex, but blocks because
volumes.c:btrfs_rm_device() got it first;
3) btrfs_rm_device() removes the device from the list, then unlocks the
mutex and after the unlock it updates the number of devices in
super_copy to N - 1.
4) write_all_supers() finally acquires the mutex, iterates over all the
devices in the list and gets N - 1 errors, that is, it failed to write
super blocks to all the devices;
5) Because write_all_supers() thinks there are a total of N devices, it
considers N - 1 errors to be ok, and therefore won't panic.
So this change just makes sure that write_all_supers() reads the number
of devices from super_copy after it acquires the device_list_mutex.
Conversely, it changes btrfs_rm_device() to update the number of devices
in super_copy before it releases the device list mutex.
The code path to add a new device (volumes.c:btrfs_init_new_device),
already has the right behaviour: it updates the number of devices in
super_copy while holding the device_list_mutex.
The only code path that doesn't lock the device list mutex
before updating the number of devices in the super copy is
disk-io.c:next_root_backup(), called by open_ctree() during
mount time where concurrency issues can't happen.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
A user was reporting weird warnings from btrfs_put_delayed_ref() and I noticed
that we were doing this list_del_init() on our head ref outside of
delayed_refs->lock. This is a problem if we have people still on the list, we
could end up modifying old pointers and such. Fix this by removing us from the
list before we do our run_delayed_ref on our head ref. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We were unconditionally clearing our runtime flag on the inode on error when
trying to insert an orphan item. This is wrong in the case of -EEXIST since we
obviously have an orphan item. This was causing us to not do the correct
cleanup of our orphan items which caused issues on cleanup. This happens
because currently when truncate fails we just leave the orphan item on there so
it can be cleaned up, so if we go to remove the file later we will hit this
issue. What we do for truncate isn't right either, but we shouldn't screw this
sort of thing up on error either, so fix this and then I'll fix truncate in a
different patch. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Send was just sending everything it found, even if the extent was a hole. This
is unpleasant for users, so just skip holes when we are sending. This will also
skip sending prealloc extents since the send spec doesn't have a prealloc
command. Eventually we will add a prealloc command and rev the send version so
we can send down the prealloc info. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The name buffer is not terminated by a '\0' character,
therefore it needs to be printed with %.*s and use the
length of the buffer.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
sector_t may be either "u64" (always 64 bit) or "unsigned long" (32 or 64
bit). Casting it to "unsigned long" will truncate it on 32-bit platforms
where CONFIG_LBDAF=y.
Cast to "unsigned long long" and format using "ll" instead.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
PAGE_CACHE_SIZE == PAGE_SIZE is "unsigned long" everywhere, so there's no
need to cast it to "unsigned long".
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Internally, btrfs_header_chunk_tree_uuid() calculates an unsigned long, but
casts it to a pointer, while all callers cast it to unsigned long again.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Internally, btrfs_header_fsid() calculates an unsigned long, but casts
it to a pointer, while all callers cast it to unsigned long again.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Internally, btrfs_dev_extent_chunk_tree_uuid() calculates an unsigned long,
but casts it to a pointer, while all callers cast it to unsigned long
again.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
All callers of btrfs_device_fsid() cast its return type to unsigned long.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
All callers of btrfs_device_uuid() cast its return type to unsigned long.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
mirror_num is always "int", hence don't cast it to "unsigned long long" and
format it as a 64-bit number.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
PAGE_SIZE is "unsigned long" everywhere, so there's no need to cast it to
"unsigned long long" and format it as a 64-bit number.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The internal btrfs device id is a u64, hence make the constant
BTRFS_DEV_REPLACE_DEVID "unsigned long long" as well, so we no longer need
a cast to print it.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
u64 is "unsigned long long" on all architectures now, so there's no need to
cast it when formatting it using the "ll" length modifier.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
It turns out we don't properly rollback in-core btrfs_device state on
umount. We zero out ->bdev, ->in_fs_metadata and that's about it. In
particular, we don't zero out ->generation, and this can lead to us
refusing a mount -- a non-NULL fs_devices->latest_bdev is essential, but
btrfs_close_extra_devices will happily assign NULL to ->latest_bdev if
the first device on the dev_list happens to be missing and consequently
has no bdev attached. This happens because since commit a6b0d5c8
btrfs_close_extra_devices adjusts ->latest_bdev, and in doing that,
relies on the ->generation. Fix this, and possibly other problems, by
zeroing out everything except for what device_list_add sets, so that a
mount right after insmod and 'btrfs dev scan' is no different from any
later mount in this respect.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In the spirit of btrfs_alloc_device, add a helper for allocating and
doing some common initialization of btrfs_fs_devices struct.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Currently btrfs_device is allocated ad-hoc in a few different places,
and as a result not all fields are initialized properly. In particular,
readahead state is only initialized in device_list_add (at scan time),
and not in btrfs_init_new_device (when the new device is added with
'btrfs dev add'). Fix this by adding an allocation helper and switch
everybody but __btrfs_close_devices to it. (__btrfs_close_devices is
dealt with in a later commit.)
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
find_next_devid() knows which root to search, so it should take an
fs_info instead of an arbitrary root.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If you start the replace procedure on a read only filesystem, at
the end the procedure fails to write the updated dev_items to the
chunk tree. The problem is that this error is not indicated except
for a WARN_ON(). If the user now thinks that everything was done
as expected and destroys the source device (with mkfs or with a
hammer). The next mount fails with "failed to read chunk root" and
the filesystem is gone.
This commit adds code to fail the attempt to start the replace
procedure if the filesystem is mounted read-only.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Cc: <stable@vger.kernel.org> # 3.10+
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
After we set force_compress with a new value (which was not being done
while holding the inode mutex), if an error happens and we jump to
the label out_ra, the force_compress property of the inode is not set
to BTRFS_COMPRESS_NONE (unlike in the case where no errors happen).
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This should never be needed, but since all functions are there
to check and rebuild the UUID tree, a mount option is added that
allows to force this check and rebuild procedure.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If the filesystem was mounted with an old kernel that was not
aware of the UUID tree, this is detected by looking at the
uuid_tree_generation field of the superblock (similar to how
the free space cache is doing it). If a mismatch is detected
at mount time, a thread is started that does two things:
1. Iterate through the UUID tree, check each entry, delete those
entries that are not valid anymore (i.e., the subvol does not
exist anymore or the value changed).
2. Iterate through the root tree, for each found subvolume, add
the UUID tree entries for the subvolume (if they are not
already there).
This mechanism is also used to handle and repair errors that
happened during the initial creation and filling of the tree.
The update of the uuid_tree_generation field (which indicates
that the state of the UUID tree is up to date) is blocked until
all create and repair operations are successfully completed.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In order to be able to detect the case that a filesystem is mounted
with an old kernel, add a uuid-tree-gen field like the free space
cache is doing it. It is part of the super block and written with
each commit. Old kernels do not know this field and don't update it.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When the UUID tree is initially created, a task is spawned that
walks through the root tree. For each found subvolume root_item,
the uuid and received_uuid entries in the UUID tree are added.
This is such a quick operation so that in case somebody wants
to unmount the filesystem while the task is still running, the
unmount is delayed until the UUID tree building task is finished.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When a new subvolume or snapshot is created, a new UUID item is added
to the UUID tree. Such items are removed when the subvolume is deleted.
The ioctl to set the received subvolume UUID is also touched and will
now also add this received UUID into the UUID tree together with the
ID of the subvolume. The latter is also done when read-only snapshots
are created which inherit all the send/receive information from the
parent subvolume.
User mode programs use the BTRFS_IOC_TREE_SEARCH ioctl to search and
read in the UUID tree.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This tree is not created by mkfs.btrfs. Therefore when a filesystem
is mounted writable and the UUID tree does not exist, this tree is
created if required. The tree is also added to the fs_info structure
and initialized, but this commit does not yet read or write UUID tree
elements.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This commit adds support to print UUID tree elements to print-tree.c.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Mapping UUIDs to subvolume IDs is an operation with a high effort
today. Today, the algorithm even has quadratic effort (based on the
number of existing subvolumes), which means, that it takes minutes
to send/receive a single subvolume if 10,000 subvolumes exist. But
even linear effort would be too much since it is a waste. And these
data structures to allow mapping UUIDs to subvolume IDs are created
every time a btrfs send/receive instance is started.
It is much more efficient to maintain a searchable persistent data
structure in the filesystem, one that is updated whenever a
subvolume/snapshot is created and deleted, and when the received
subvolume UUID is set by the btrfs-receive tool.
Therefore kernel code is added with this commit that is able to
maintain data structures in the filesystem that allow to quickly
search for a given UUID and to retrieve data that is assigned to
this UUID, like which subvolume ID is related to this UUID.
This commit adds a new tree to hold UUID-to-data mapping items. The
key of the items is the full UUID plus the key type BTRFS_UUID_KEY.
Multiple data blocks can be stored for a given UUID, a type/length/
value scheme is used.
Now follows the lengthy justification, why a new tree was added
instead of using the existing root tree:
The first approach was to not create another tree that holds UUID
items. Instead, the items should just go into the top root tree.
Unfortunately this confused the algorithm to assign the objectid
of subvolumes and snapshots. The reason is that
btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for
the first created subvol or snapshot after mounting a filesystem,
and this function simply searches for the largest used objectid in
the root tree keys to pick the next objectid to assign. Of course,
the UUID keys have always been the ones with the highest offset
value, and the next assigned subvol ID was wastefully huge.
To use any other existing tree did not look proper. To apply a
workaround such as setting the objectid to zero in the UUID item
key and to implement collision handling would either add
limitations (in case of a btrfs_extend_item() approach to handle
the collisions) or a lot of complexity and source code (in case a
key would be looked up that is free of collisions). Adding new code
that introduces limitations is not good, and adding code that is
complex and lengthy for no good reason is also not good. That's the
justification why a completely new tree was introduced.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Cc: Chris Mason <chris.mason@fusionio.com>
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
make C=2 fs/btrfs/ CF=-D__CHECK_ENDIAN__
I tried to filter out the warnings for which patches have already
been sent to the mailing list, pending for inclusion in btrfs-next.
All these changes should be obviously safe.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If the inode ref key was not found and the current leaf slot
was 0 (first item in the leaf) the code would always return
-ENOENT. This was not correct because the desired inode ref
item might be the last item in the previous leaf.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If the path doesn't fit in the input buffer, return ENAMETOOLONG
instead of returning with a success code (0) and a partially
filled and right justified buffer.
Also removed useless buffer pointer check outside the while loop.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We have checked 'quota_root' with qgroup_ioctl_lock held before,So
here the check is reduplicate, remove it.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
btrfs_free_qgroup_config() is not only called by open/close_ctree(),but
also btrfs_disable_quota().And for btrfs_disable_quota(),we have set
'quota_root' to be null before calling btrfs_free_qgroup_config(),so it
is safe to cleanup in-memory structures without lock held.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When disabling quota, we should clear out list 'dirty_qgroups',otherwise,
we will get oops if enabling quota again. Fix this by abstracting similar
code from del_qgroup_rb().
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If you are sending a snapshot and specifying a parent snapshot we will walk the
trees and figure out where they differ and send the differences only. The way
we check for differences are if the leaves aren't the same and if the keys are
not the same within the leaves. So if neither leaf is the same (ie the leaf has
been cow'ed from the parent snapshot) we walk each item in the send root and
check it against the parent root. If the items match exactly then we don't do
anything. This doesn't quite work for inode refs, since they will just have the
name and the parent objectid. If you move the file from a directory and then
remove that directory and re-create a directory with the same inode number as
the old directory and then move that file back into that directory we will
assume that nothing changed and you will get errors when you try to receive.
In order to fix this we need to do extra checking to see if the inode ref really
is the same or not. So do this by passing down BTRFS_COMPARE_TREE_SAME if the
items match. Then if the key type is an inode ref we can do some extra
checking, otherwise we just keep processing. The extra checking is to look up
the generation of the directory in the parent volume and compare it to the
generation of the send volume. If they match then they are the same directory
and we are good to go. If they don't we have to add them to the changed refs
list.
This means we have to track the generation of the ref we're trying to lookup
when we iterate all the refs for a particular inode. So in the case of looking
for new refs we have to get the generation from the parent volume, and in the
case of looking for deleted refs we have to get the generation from the send
volume to compare with.
There was also the issue of using a ulist to keep track of the directories we
needed to check. Because we can get a deleted ref and a new ref for the same
inode number the ulist won't work since it indexes based on the value. So
instead just dup any directory ref we find and add it to a local list, and then
process that list as normal and do away with using a ulist for this altogether.
Before we would fail all of the tests in the far-progs that related to moving
directories (test group 32). With this patch we now pass these tests, and all
of the tests in the far-progs send testing suite. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The plan is to have a bunch of unit tests that run when btrfs is loaded when you
build with the appropriate config option. My ultimate goal is to have a test
for every non-static function we have, but at first I'm going to focus on the
things that cause us the most problems. To start out with this just adds a
tests/ directory and moves the existing free space cache tests into that
directory and sets up all of the infrastructure. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I noticed while looking at a deadlock that we are always starting a transaction
in cow_file_range(). This isn't really needed since we only need a transaction
if we are doing an inline extent, or if the allocator needs to allocate a chunk.
So push down all the transaction start stuff to be closer to where we actually
need a transaction in all of these cases. This will hopefully reduce our write
latency when we are committing often. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I added a patch where we started taking the ordered operations mutex when we
waited on ordered extents. We need this because we splice the list and process
it, so if a flusher came in during this scenario it would think the list was
empty and we'd usually get an early ENOSPC. The problem with this is that this
lock is used in transaction committing. So we end up with something like this
Transaction commit
-> wait on writers
Delalloc flusher
-> run_ordered_operations (holds mutex)
->wait for filemap-flush to do its thing
flush task
-> cow_file_range
->wait on btrfs_join_transaction because we're commiting
some other task
-> commit_transaction because we notice trans->transaction->flush is set
-> run_ordered_operations (hang on mutex)
We need to disentangle the ordered operations flushing from the delalloc
flushing, since they are separate things. This solves the deadlock issue I was
seeing. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
There are several places where we BUG_ON() if we fail to remove the orphan items
and such, which is not ok, so remove those and either abort or just carry on.
This also fixes a problem where if we couldn't start a transaction we wouldn't
actually remove the orphan item reserve for the inode. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Eric pointed out that btrfs will happily allow you to delete the default subvol.
This is a problem obviously since the next time you go to mount the file system
it will freak out because it can't find the root. Fix this by adding a check to
see if our default subvol points to the subvol we are trying to delete, and if
it does not allowing it to happen. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We have logic to see if we've already created a parent directory by check to see
if an inode inside of that directory has a lower inode number than the one we
are currently processing. The logic is that if there is a lower inode number
then we would have had to made sure the directory was created at that previous
point. The problem is that subvols inode numbers count from the lowest objectid
in the root tree, which may be less than our current progress. So just skip if
our dir item key is a root item. This fixes the original test and the xfstest
version I made that added an extra subvol create. Thanks,
Reported-by: Emil Karlson <jekarlson@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This patch adds an ioctl, BTRFS_IOC_FILE_EXTENT_SAME which will try to
de-duplicate a list of extents across a range of files.
Internally, the ioctl re-uses code from the clone ioctl. This avoids
rewriting a large chunk of extent handling code.
Userspace passes in an array of file, offset pairs along with a length
argument. The ioctl will then (for each dedupe) do a byte-by-byte comparison
of the user data before deduping the extent. Status and number of bytes
deduped are returned for each operation.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We want this for btrfs_extent_same. Basically readpage and friends do their
own extent locking but for the purposes of dedupe, we want to have both
files locked down across a set of readpage operations (so that we can
compare data). Introduce this variant and a flag which can be set for
extent_read_full_page() to indicate that we are already locked.
Partial credit for this patch goes to Gabriel de Perthuis <g2p.code@gmail.com>
as I have included a fix from him to the original patch which avoids a
deadlock on compressed extents.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
There's some 250+ lines here that are easily encapsulated into their own
function. I don't change how anything works here, just create and document
the new btrfs_clone() function from btrfs_ioctl_clone() code.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The range locking in btrfs_ioctl_clone is trivially broken out into it's own
function. This reduces the complexity of btrfs_ioctl_clone() by a small bit
and makes that locking code available to future functions in
fs/btrfs/ioctl.c
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In extent-tree.c:do_chunk_alloc(), early on we returned 0 (success)
when the target space was full and when chunk allocation is needed.
However, later on in that same function we return ENOSPC if
btrfs_alloc_chunk() fails (and chunk allocation was needed) and
set the space's full flag.
This was inconsistent, as -ENOSPC should be returned if the space
is full and a chunk allocation needs to performed. If the space is
full but no chunk allocation is needed, just return 0 (success).
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
tree-log.c was ignoring the return value from btrfs_run_delayed_items()
in several places.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In tree-log.c:replay_one_name(), if memory allocation for
the name fails, ensure we iput the dir inode we got before
before we return.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The rule originally comes from nocow writing, but snapshot-aware
defrag is a different case, the extent has been writen and we're
not going to change the extent but add a reference on the data.
So we're able to allow such compressed extents to be merged into
one bigger extent if they're pointing to the same data.
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I'ts hardcoded to 30 seconds which is fine for most users. Higher values
defer data being synced to permanent storage with obvious consequences
when the system crashes. The upper bound is not forced, but a warning is
printed if it's more than 300 seconds (5 minutes).
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
There is no reason we can't just set the path to blocking and then do normal
GFP_NOFS allocations for these extent buffers. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We can get ENOMEM trying to allocate dummy bufs for the rewind operation of the
tree mod log. Instead of BUG_ON()'ing in this case pass up ENOMEM. I looked
back through the callers and I'm pretty sure I got everybody who did BUG_ON(ret)
in this path. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When doing a send with a parent subvol we will check to see if the file we are
acting on is being overwritten and move it if we think it may be needed further
down the line during the send. We check this by checking its directory and
making sure it existed in the parent and making sure the file existed in the
parent. The problem with this check is that if we create a directory and a file
in that directory, and then snapshot, and then remove and re-create that same
directory and file with different inode numbers and then try to snapshot and
send with the original parent we will try and save the original file inside of
that directory. This is a problem because during the receive we move the
directory out of the way because it is a completely new inode, which makes us
unable to find the old file inside of the directory when we try to move that out
of the way for the overwrite. We fix this by checking the parent directory of
the inode we think we are overwriting. If the parent directory generation in
the send root != the parent directory generation in the parent root then we know
it is a completely new directory and we need not bother with moving the file out
of the way because it would have been completely destroyed. This fixes bz
60673. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Alex Lyakas reported a bug where wait_block_group_cache_progress() would wait
forever if a drive failed. This is because we just bail out if there is an
error while trying to cache a block group, we don't update anybody who may be
waiting. So this introduces a new enum for the cache state in case of error and
makes everybody bail out if we have an error. Alex tested and verified this
patch fixed his problem. This fixes bz 59431. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we bail out when the stripe alloc fails, we need to undo the
earlier allocation of raid_map.
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
After reading all device items from the chunk tree, don't
exit the loop and then navigate down the tree again to find
the chunk items. Instead just read all device items and
chunk items with a single tree search. This is possible
because all device items are found before any chunk item in
the chunks tree.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
There is no reason for this sort of jackassery. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Previously we only added blocks to the list to have their backrefs checked if
the level of the block is right above the one we are searching for. This is
because we want to make sure we don't add the entire path up to the root to the
lists to make sure we process things one at a time. This assumes that if any
blocks in the path to the root are going to be not checked (shared in other
words) then they will be in the level right above the current block on up. This
isn't quite right though since we can have blocks higher up the list that are
shared because they are attached to a reloc root. But we won't add this block
to be checked and then later on we will BUG_ON(!upper->checked). So instead
keep track of wether or not we've queued a block to be checked in this current
search, and if we haven't go ahead and queue it to be checked. This patch fixed
the panic I was seeing where we BUG_ON(!upper->checked). Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If our item isn't big enough to have an actual inline item when we have skinny
metadata enabled just return 1 in find_inline_backref so we can move on to the
next item. This probably wasn't causing a problem since we check the values of
ptr and end properly, but just in case this will keep us from doing extra work.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
First of all we no longer set EXTENT_DIRTY when we dirty an extent so this patch
removes the clearing of EXTENT_DIRTY since it confuses me. This patch also adds
clearing EXTENT_DEFRAG and also doing EXTENT_DO_ACCOUNTING when we have errors.
This is because if we are clearing delalloc without adding an ordered extent
then we need to make sure the enospc handling stuff is accounted for. Also if
this range was DEFRAG we need to make sure that bit is cleared so we dont leak
it. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This patch removes the io_tree argument for extent_clear_unlock_delalloc since
we always use &BTRFS_I(inode)->io_tree, and it separates out the extent tree
operations from the page operations. This way we just pass in the extent bits
we want to clear and then pass in the operations we want done to the pages.
This is because I'm going to fix what extent bits we clear in some cases and
rather than add a bunch of new flags we'll just use the actual extent bits we
want to clear. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When we read several pages at once, we needn't get the extent map object
every time we deal with a page, and we can cache the extent map object.
So, we can reduce the search time of the extent map, and besides that, we
also can reduce the lock contention of the extent map tree.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In the past, we cached the checksum value in the extent state object, so we
had to split the extent state object by the block size, or we had no space
to keep this checksum value. But it increased the lock contention of the
extent state tree.
Now we removed this limit by caching the checksum into the bio object, so
it is unnecessary to do the extent state operations by the block size, we
can do it in batches, in this way, we can reduce the lock contention of
the extent state tree.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Before applying this patch, we set the uptodate flag and unlock the extent
by the page size, it is unnecessary, we can do it in batches, it can reduce
the lock contention of the extent state tree.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Before applying this patch, we cached the csum value into the extent state
tree when reading some data from the disk, this operation increased the lock
contention of the state tree.
Now, we just store the csum value into the bio structure or other unshared
structure, so we can reduce the lock contention.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This patch add some branch prediction hints into the end IO function
of the read page, it reduced the percentage of the branch misses from
5.5% to 4.9%.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Some options are missing in btrfs_show_options(), this patch
adds them.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Although for most time, int is enough for subvolid, we should
ensure safety in theory.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I just notice the following commands succeed:
mount <dev> <mnt> -o thread_pool=-1
This is ridiculous, only positive thread_pool makes sense,this
patch adds sanity checks for them, and also catches the error of
ENOMEM if allocating memory fails.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The alloc_rbio() frees "raid_map" and "bbio" on error, so there is a
potential double free bug in raid56_parity_write(). The
raid56_parity_write() and raid56_parity_recover() functions should still
free "raid_map" and "bbio" on error if other errors occur though, so I
have added some more calls to kfree().
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We can end up with inodes on the auto defrag list that exist on roots that are
going to be deleted. This is extra work we don't need to do, so just bail if
our root has 0 root refs. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I was hitting the BUG_ON() at the end of merge_reloc_roots() because we were
aborting the transaction at some point previously and then getting an error when
we tried to drop the reloc root. I fixed btrfs_drop_snapshot to re-add us to
the dead roots list if we failed, but this isn't the right thing to do for reloc
roots since it uses root->root_list for it's own stuff in order to know what
needs to be cleaned up. So fix btrfs_drop_snapshot to only do the re-add if we
aren't dropping for reloc, and handle errors from merge_reloc_root() by dropping
the reloc root we are processing since it won't be on the list of roots to
cleanup. With this patch my reproducer no longer panics. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I was getting warnings when running find ./ -type f -exec btrfs fi defrag -f {}
\; from record_one_backref because ret was set. Turns out it was because it was
set to 1 because the search slot didn't come out exact and we never reset it.
So reset it to 0 right after the search so we don't leak this and get
uneccessary warnings. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
btrfs_ioctl_get_fslabel() and btrfs_ioctl_set_fslabel()
used root->fs_info->volume_mutex mutex which caused operations
like balance to block set/get label operation until its
completion and generally balance operation takes a long
time to complete, so it will be annoying to the user when
cli appears hung
also this patch will add a bit of optimization within
the btrfs_ioctl_get_falabel() function.
v1->v2:
use fs_info->super_lock instead of uuid_mutex
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This is confusing, sometimes the key type is printed in hex (without
a leading "0x" which makes things even more complicated), sometimes
in decimal...
Change it to be in decimal everywhere.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This shows exactly how btrfs processes the delayed refs onto disks,
which is very helpful on understanding delayed ref mechanism and
debugging related bugs.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
btrfs_space_info->block_groups.
The current code uses integer literals to index
btrfs_space_info->block_groups[] array. Instead use corresponding
enums from 'enum btrfs_raid_types'.
Signed-off-by: chandan <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Some codes still use the cpu_to_lexx instead of the
BTRFS_SETGET_STACK_FUNCS declared in ctree.h.
Also added some BTRFS_SETGET_STACK_FUNCS for btrfs_header btrfs_timespec
and other structures.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaoxie@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We call ulist_free(qgroup_ulist) in btrfs_free_qgroup_config(),
and btrfs_free_qgroup_config() may be called in two cases:
(1)umount filesystem
(2)disabling quota
However, if we firstly disable quota and then umount filesystem,
a double free happens. Fix it.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The function relocation.c:add_data_references() was not checking
if all calls to __add_tree_block() and find_data_references() were
succeeding or not.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The log message level 'critical' is verbose enough, 'emergency' beeps on
all terminals.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
So to cache free space, we iterate every extent item to gather free space info.
When we have say 10,000 non-inline extent refs(such as BTRFS_EXTENT_DATA_REF),
it takes quite a long time, and since inline extent refs and non-inline ones have
same objectid in their keys, we can just re-search the tree with the next address
to skip non-inline references.
(This is found by dedup feature because dedup extents can end up with many
non-inline extent refs.)
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I recently did some ENOSPC testing that involved filling the disk
while create and removing snapshots in a loop. During the test cycle,
I ran into an ENOSPC when trying to remove a snapshot, leaving the fs
stuck in ENOSPC even after a umount/mount cycle.
This patch allow subvolume removal to fall back onto the global
block reservation in order to succeed when it would have failed
otherwise.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we're looking for a metadata item in the tree and the
search fails with return value of 1, and the slot doesn't
point to the first item in the leaf, check if the previous
item in the leaf corresponds to an extent item for the same
object id - if it does, then don't do another tree search
to get it.
This optimization is already done by btrfs-progs.
V2: updated commit message.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Device scanning waits on the uuid_mutex, which can result in a very long
wait if dev delete is shrinking the device.
Signed-off-by: Carey Underwood <cwillu@cwillu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We've been seeing spurious complaints out of lockdep because the lock class name
changes. This is happening because when we drop a snapshot we will lock a block
before we've read it in, which sets the lockdep class to whatever the default
is. Then once we read the thing in we reset the lockdep class to what it is
supposed to be, which blows lockdeps' mind. This patch should fix the problem,
it appears to be the only place where we do this sort of thing. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
With this fix the lzo code behaves like the zlib code by returning an
error
code when compression does not help reduce the size of the file.
This is currently not a bug since the compressed size is checked again
in
the calling method compress_file_range.
Signed-off-by: Stefan Agner <stefan@agner.ch>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Previously we held the tree mod lock when adding stuff because we use it to
check and see if we truly do want to track tree modifications. This is
admirable, but GFP_ATOMIC in a critical area that is going to get hit pretty
hard and often is not nice. So instead do our basic checks to see if we don't
need to track modifications, and if those pass then do our allocation, and then
when we go to insert the new modification check if we still care, and if we
don't just free up our mod and return. Otherwise we're good to go and we can
carry on. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Don't emit OOM warnings when k.alloc calls fail when
there there is a v.alloc immediately afterwards.
Converted a kmalloc/vmalloc with memset to kzalloc/vzalloc.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
When btrfs readdir() hits the last entry it sets the readdir offset to a
huge value to stop buggy apps from breaking when the same name is
returned by readdir() with concurrent rename()s.
But unconditionally setting the offset to INT_MAX causes readdir() to
loop returning any entries with offsets past INT_MAX. It only takes a
few hours of constant file creation and removal to create entries past
INT_MAX.
So let's set the huge offset to LLONG_MAX if the last entry has already
overflowed 32bit loff_t. Without large offsets behaviour is identical.
With large offsets 64bit apps will work and 32bit apps will be no more
broken than they currently are if they see large offsets.
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
A user reported a panic when running with autodefrag and deleting snapshots.
This is because we could end up trying to add the root to the dead roots list
twice. To fix this check to see if we are empty before adding ourselves to the
dead roots list. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The ceph guys tripped over this bug where we were still holding onto the
original path that we used to copy the inode with when logging. This is based
on Chris's fix which was reported to fix the problem. We need to drop the paths
in two cases anyway so just move the drop up so that we don't have duplicate
code. Thanks,
Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I noticed while running multi-threaded fsync tests that sometimes fsck would
complain about an improper gap. This happens because we fail to add a hole
extent to the file, which was happening when we'd split a hole EM because
btrfs_drop_extent_cache was just discarding the whole em instead of splitting
it. So this patch fixes this by allowing us to split a hole em properly, which
means that added holes actually get logged properly and we no longer see this
fsck error. Thankfully we're tolerant of these sort of problems so a user would
not see any adverse effects of this bug, other than fsck complaining. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Because we don't mess with the offset into the extent for compressed we will
properly find both extents for this case
[extent a][extent b][rest of extent a]
but because we already added a ref for the front half we won't add the inode
information for the second half. This causes us to leak that memory and not
print out the other offset when we do logical-resolve. So fix this by calling
ulist_add_merge and then add our eie to the existing entry if there is one.
With this patch we get both offsets out of logical-resolve. With this and the
other 2 patches I've sent we now pass btrfs/276 on my vm with compress-force=lzo
set. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If you do btrfs inspect-internal logical-resolve on a compressed extent that has
been partly overwritten it won't find anything. This is because we try and
match the extent offset we've searched for based on the extent offset in the
data extent entry. However this doesn't work for compressed extents because the
offsets are for the uncompressed size, not the compressed size. So instead only
do this check if we are not compressed, that way we can get an actual entry for
the physical offset rather than nothing for compressed. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
xfstest btrfs/276 was freaking out on slower boxes partly because fiemap was
offsetting the physical based on the extent offset. This is perfectly fine with
uncompressed extents, however the extent offset is into the uncompressed area,
not the compressed. So we can return a physical value that isn't at all within
the area we have allocated on disk. Fix this by returning the start of the
extent if it is compressed no matter what the offset. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
commit 47fb091fb787420cd195e66f162737401cce023f(Btrfs: fix unlock after free on rewinded tree blocks)
takes an extra increment on the reference of allocated dummy extent buffer, so now we
cannot free this dummy one, and end up with extent buffer leak.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
For partial extents, snapshot-aware defrag does not work as expected,
since
a) we use the wrong logical offset to search for parents, which should be
disk_bytenr + extent_offset, not just disk_bytenr,
b) 'offset' returned by the backref walking just refers to key.offset, not
the 'offset' stored in btrfs_extent_data_ref which is
(key.offset - extent_offset).
The reproducer:
$ mkfs.btrfs sda
$ mount sda /mnt
$ btrfs sub create /mnt/sub
$ for i in `seq 5 -1 1`; do dd if=/dev/zero of=/mnt/sub/foo bs=5k count=1 seek=$i conv=notrunc oflag=sync; done
$ btrfs sub snap /mnt/sub /mnt/snap1
$ btrfs sub snap /mnt/sub /mnt/snap2
$ sync; btrfs filesystem defrag /mnt/sub/foo;
$ umount /mnt
$ btrfs-debug-tree sda (Here we can check whether the defrag operation is snapshot-awared.
This addresses the above two problems.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Create a small file and fallocate it to a big size with
FALLOC_FL_KEEP_SIZE option, then truncate it back to the
small size again, the disk free space is not changed back
in this case. i.e,
total 4
-rw-r--r-- 1 root root 512 Jun 28 11:35 test
Filesystem Size Used Avail Use% Mounted on
....
/dev/sdb1 8.0G 56K 7.2G 1% /mnt
-rw-r--r-- 1 root root 512 Jun 28 11:35 /mnt/test
Filesystem Size Used Avail Use% Mounted on
....
/dev/sdb1 8.0G 5.1G 2.2G 70% /mnt
Filesystem Size Used Avail Use% Mounted on
....
/dev/sdb1 8.0G 5.1G 2.2G 70% /mnt
With this fix, the truncated up space is back as:
Filesystem Size Used Avail Use% Mounted on
....
/dev/sdb1 8.0G 56K 7.2G 1% /mnt
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Miao Xie reported the following issue:
The filesystem was corrupted after we did a device replace.
Steps to reproduce:
# mkfs.btrfs -f -m single -d raid10 <device0>..<device3>
# mount <device0> <mnt>
# btrfs replace start -rfB 1 <device4> <mnt>
# umount <mnt>
# btrfsck <device4>
The reason for the issue is that we changed the write offset by mistake,
introduced by commit 625f1c8dc.
We read the data from the source device at first, and then write the
data into the corresponding place of the new device. In order to
implement the "-r" option, the source location is remapped using
btrfs_map_block(). The read takes place on the mapped location, and
the write needs to take place on the unmapped location. Currently
the write is using the mapped location, and this commit changes it
back by undoing the change to the write address that the aforementioned
commit added by mistake.
Reported-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: <stable@vger.kernel.org> # 3.10+
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If we stop dropping a root for whatever reason we need to add it back to the
dead root list so that we will re-start the dropping next transaction commit.
The other case this happens is if we recover a drop because we will add a root
without adding it to the fs radix tree, so we can leak it's root and commit root
extent buffer, adding this to the dead root list makes this cleanup happen.
Thanks,
Cc: stable@vger.kernel.org
Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We aren't setting path->locks[level] when we resume a snapshot deletion which
means we won't unlock the buffer when we free the path. This causes deadlocks
if we happen to re-allocate the block before we've evicted the extent buffer
from cache. Thanks,
Cc: stable@vger.kernel.org
Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Alex pointed out a problem and fix that exists in the drop one snapshot at a
time patch. If we decide we need to exit for whatever reason (umount for
example) we will just exit the snapshot dropping without updating the drop
progress. So the next time we go to resume we will BUG_ON() because we can't
find the extent we left off at because we never updated it. This patch fixes
the problem.
Cc: stable@vger.kernel.org
Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
PTR_RET is now deprecated. Use PTR_ERR_OR_ZERO instead.
Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Pull btrfs update from Chris Mason:
"These are the usual mixture of bugs, cleanups and performance fixes.
Miao has some really nice tuning of our crc code as well as our
transaction commits.
Josef is peeling off more and more problems related to early enospc,
and has a number of important bug fixes in here too"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (81 commits)
Btrfs: wait ordered range before doing direct io
Btrfs: only do the tree_mod_log_free_eb if this is our last ref
Btrfs: hold the tree mod lock in __tree_mod_log_rewind
Btrfs: make backref walking code handle skinny metadata
Btrfs: fix crash regarding to ulist_add_merge
Btrfs: fix several potential problems in copy_nocow_pages_for_inode
Btrfs: cleanup the code of copy_nocow_pages_for_inode()
Btrfs: fix oops when recovering the file data by scrub function
Btrfs: make the chunk allocator completely tree lockless
Btrfs: cleanup orphaned root orphan item
Btrfs: fix wrong mirror number tuning
Btrfs: cleanup redundant code in btrfs_submit_direct()
Btrfs: remove btrfs_sector_sum structure
Btrfs: check if we can nocow if we don't have data space
Btrfs: stop using try_to_writeback_inodes_sb_nr to flush delalloc
Btrfs: use a percpu to keep track of possibly pinned bytes
Btrfs: check for actual acls rather than just xattrs when caching no acl
Btrfs: move btrfs_truncate_page to btrfs_cont_expand instead of btrfs_truncate
Btrfs: optimize reada_for_balance
Btrfs: optimize read_block_for_search
...
Pull trivial tree updates from Jiri Kosina:
"The usual stuff from trivial tree"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
treewide: relase -> release
Documentation/cgroups/memory.txt: fix stat file documentation
sysctl/net.txt: delete reference to obsolete 2.4.x kernel
spinlock_api_smp.h: fix preprocessor comments
treewide: Fix typo in printk
doc: device tree: clarify stuff in usage-model.txt.
open firmware: "/aliasas" -> "/aliases"
md: bcache: Fixed a typo with the word 'arithmetic'
irq/generic-chip: fix a few kernel-doc entries
frv: Convert use of typedef ctl_table to struct ctl_table
sgi: xpc: Convert use of typedef ctl_table to struct ctl_table
doc: clk: Fix incorrect wording
Documentation/arm/IXP4xx fix a typo
Documentation/networking/ieee802154 fix a typo
Documentation/DocBook/media/v4l fix a typo
Documentation/video4linux/si476x.txt fix a typo
Documentation/virtual/kvm/api.txt fix a typo
Documentation/early-userspace/README fix a typo
Documentation/video4linux/soc-camera.txt fix a typo
lguest: fix CONFIG_PAE -> CONFIG_x86_PAE in comment
...
Pull second set of VFS changes from Al Viro:
"Assorted f_pos race fixes, making do_splice_direct() safe to call with
i_mutex on parent, O_TMPFILE support, Jeff's locks.c series,
->d_hash/->d_compare calling conventions changes from Linus, misc
stuff all over the place."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
Document ->tmpfile()
ext4: ->tmpfile() support
vfs: export lseek_execute() to modules
lseek_execute() doesn't need an inode passed to it
block_dev: switch to fixed_size_llseek()
cpqphp_sysfs: switch to fixed_size_llseek()
tile-srom: switch to fixed_size_llseek()
proc_powerpc: switch to fixed_size_llseek()
ubi/cdev: switch to fixed_size_llseek()
pci/proc: switch to fixed_size_llseek()
isapnp: switch to fixed_size_llseek()
lpfc: switch to fixed_size_llseek()
locks: give the blocked_hash its own spinlock
locks: add a new "lm_owner_key" lock operation
locks: turn the blocked_list into a hashtable
locks: convert fl_link to a hlist_node
locks: avoid taking global lock if possible when waking up blocked waiters
locks: protect most of the file_lock handling with i_lock
locks: encapsulate the fl_link list handling
locks: make "added" in __posix_lock_file a bool
...
For those file systems(btrfs/ext4/ocfs2/tmpfs) that support
SEEK_DATA/SEEK_HOLE functions, we end up handling the similar
matter in lseek_execute() to update the current file offset
to the desired offset if it is valid, ceph also does the
simliar things at ceph_llseek().
To reduce the duplications, this patch make lseek_execute()
public accessible so that we can call it directly from the
underlying file systems.
Thanks Dave Chinner for this suggestion.
[AV: call it vfs_setpos(), don't bring the removed 'inode' argument back]
v2->v1:
- Add kernel-doc comments for lseek_execute()
- Call lseek_execute() in ceph->llseek()
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Cc: Ben Myers <bpm@sgi.com>
Cc: Ted Tso <tytso@mit.edu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Sage Weil <sage@inktank.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
category, of note is a fix for on-line resizing file systems where the
block size is smaller than the page size (i.e., file systems 1k blocks
on x86, or more interestingly file systems with 4k blocks on Power or
ia64 systems.)
In the cleanup category, the ext4's punch hole implementation was
significantly improved by Lukas Czerner, and now supports bigalloc
file systems. In addition, Jan Kara significantly cleaned up the
write submission code path. We also improved error checking and added
a few sanity checks.
In the optimizations category, two major optimizations deserve
mention. The first is that ext4_writepages() is now used for
nodelalloc and ext3 compatibility mode. This allows writes to be
submitted much more efficiently as a single bio request, instead of
being sent as individual 4k writes into the block layer (which then
relied on the elevator code to coalesce the requests in the block
queue). Secondly, the extent cache shrink mechanism, which was
introduce in 3.9, no longer has a scalability bottleneck caused by the
i_es_lru spinlock. Other optimizations include some changes to reduce
CPU usage and to avoid issuing empty commits unnecessarily.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABCAAGBQJR0XhgAAoJENNvdpvBGATwMXkQAJwTPk5XYLqtAwLziFLvM6wG
0tWa1QAzTNo80tLyM9iGqI6x74X5nddLw5NMICUmPooOa9agMuA4tlYVSss5jWzV
yyB7vLzsc/2eZJusuVqfTKrdGybE+M766OI6VO9WodOoIF1l51JXKjktKeaWegfv
NkcLKlakD4V+ZASEDB/cOcR/lTwAs9dQ89AZzgPiW+G8Do922QbqkENJB8mhalbg
rFGX+lu9W0f3fqdmT3Xi8KGn3EglETdVd6jU7kOZN4vb5LcF5BKHQnnUmMlpeWMT
ksOVasb3RZgcsyf5ZOV5feXV601EsNtPBrHAmH22pWQy3rdTIvMv/il63XlVUXZ2
AXT3cHEvNQP0/yVaOTCZ9xQVxT8sL4mI6kENP9PtNuntx7E90JBshiP5m24kzTZ/
zkIeDa+FPhsDx1D5EKErinFLqPV8cPWONbIt/qAgo6663zeeIyMVhzxO4resTS9k
U2QEztQH+hDDbjgABtz9M/GjSrohkTYNSkKXzhTjqr/m5huBrVMngjy/F4/7G7RD
vSEx5aXqyagnrUcjsupx+biJ1QvbvZWOVxAE/6hNQNRGDt9gQtHAmKw1eG2mugHX
+TFDxodNE4iWEURenkUxXW3mDx7hFbGZR0poHG3M/LVhKMAAAw0zoKrrUG5c70G7
XrddRLGlk4Hf+2o7/D7B
=SwaI
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 update from Ted Ts'o:
"Lots of bug fixes, cleanups and optimizations. In the bug fixes
category, of note is a fix for on-line resizing file systems where the
block size is smaller than the page size (i.e., file systems 1k blocks
on x86, or more interestingly file systems with 4k blocks on Power or
ia64 systems.)
In the cleanup category, the ext4's punch hole implementation was
significantly improved by Lukas Czerner, and now supports bigalloc
file systems. In addition, Jan Kara significantly cleaned up the
write submission code path. We also improved error checking and added
a few sanity checks.
In the optimizations category, two major optimizations deserve
mention. The first is that ext4_writepages() is now used for
nodelalloc and ext3 compatibility mode. This allows writes to be
submitted much more efficiently as a single bio request, instead of
being sent as individual 4k writes into the block layer (which then
relied on the elevator code to coalesce the requests in the block
queue). Secondly, the extent cache shrink mechanism, which was
introduce in 3.9, no longer has a scalability bottleneck caused by the
i_es_lru spinlock. Other optimizations include some changes to reduce
CPU usage and to avoid issuing empty commits unnecessarily."
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits)
ext4: optimize starting extent in ext4_ext_rm_leaf()
jbd2: invalidate handle if jbd2_journal_restart() fails
ext4: translate flag bits to strings in tracepoints
ext4: fix up error handling for mpage_map_and_submit_extent()
jbd2: fix theoretical race in jbd2__journal_restart
ext4: only zero partial blocks in ext4_zero_partial_blocks()
ext4: check error return from ext4_write_inline_data_end()
ext4: delete unnecessary C statements
ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree()
jbd2: move superblock checksum calculation to jbd2_write_superblock()
ext4: pass inode pointer instead of file pointer to punch hole
ext4: improve free space calculation for inline_data
ext4: reduce object size when !CONFIG_PRINTK
ext4: improve extent cache shrink mechanism to avoid to burn CPU time
ext4: implement error handling of ext4_mb_new_preallocation()
ext4: fix corruption when online resizing a fs with 1K block size
ext4: delete unused variables
ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents
jbd2: remove debug dependency on debug_fs and update Kconfig help text
jbd2: use a single printk for jbd_debug()
...
My recent truncate patch uncovered this bug, but I can reproduce it without the
truncate patch. If you mount with -o compress-force, do a direct write to some
area, do a buffered write to some other area, and then do a direct read you will
get the wrong data for where you did the buffered write. This is because the
generic direct io helpers only call filemap_write_and_wait once, and for
compression we need it twice. So to be safe add the btrfs_wait_ordered_range to
the start of the direct io function to make sure any compressed writes have
truly been written. This patch makes xfstests 130 pass when you mount with -o
compress-force=lzo. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There is another bug in the tree mod log stuff in that we're calling
tree_mod_log_free_eb every single time a block is cow'ed. The problem with this
is that if this block is shared by multiple snapshots we will call this multiple
times per block, so if we go to rewind the mod log for this block we'll BUG_ON()
in __tree_mod_log_rewind because we try to rewind a free twice. We only want to
call tree_mod_log_free_eb if we are actually freeing the block. With this patch
I no longer hit the panic in __tree_mod_log_rewind. Thanks,
Cc: stable@vger.kernel.org
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We need to hold the tree mod log lock in __tree_mod_log_rewind since we walk
forward in the tree mod entries, otherwise we'll end up with random entries and
trip the BUG_ON() at the front of __tree_mod_log_rewind. This fixes the panics
people were seeing when running
find /whatever -type f -exec btrfs fi defrag {} \;
Thansk,
Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I missed fixing the backref stuff when I introduced the skinny metadata. If you
try and do things like snapshot aware defrag with skinny metadata you are going
to see tons of warnings related to the backref count being less than 0. This is
because the delayed refs will be found for stuff just fine, but it won't find
the skinny metadata extent refs. With this patch I'm not seeing warnings
anymore. Thanks,
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Several users reported this crash of NULL pointer or general protection,
the story is that we add a rbtree for speedup ulist iteration, and we
use krealloc() to address ulist growth, and krealloc() use memcpy to copy
old data to new memory area, so it's OK for an array as it doesn't use
pointers while it's not OK for a rbtree as it uses pointers.
So krealloc() will mess up our rbtree and it ends up with crash.
Reviewed-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
- It makes no sense that we deal with a inode in the dead tree.
- fix the race between dio and page copy by waiting the dio completion
- avoid the page copy vs truncate/punch hole
- check if the page is in the page cache or not
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
- It make no sense that we continue to do something after the error
happened, just go back with this patch.
- remove some check of copy_nocow_pages_for_inode(), such as page check
after write, inode check in the end of the function, because we are
sure they exist.
- remove the unnecessary goto in the return value check of the write
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When adjusting the enospc rules for relocation I ran into a deadlock because we
were relocating the only system chunk and that forced us to try and allocate a
new system chunk while holding locks in the chunk tree, which caused us to
deadlock. To fix this I've moved all of the dev extent addition and chunk
addition out to the delayed chunk completion stuff. We still keep the in-memory
stuff which makes sure everything is consistent.
One change I had to make was to search the commit root of the device tree to
find a free dev extent, and hold onto any chunk em's that we allocated in that
transaction so we do not allocate the same dev extent twice. This has the side
effect of fixing a bug with balance that has been there ever since balance
existed. Basically you can free a block group and it's dev extent and then
immediately allocate that dev extent for a new block group and write stuff to
that dev extent, all within the same transaction. So if you happen to crash
during a balance you could come back to a completely broken file system. This
patch should keep these sort of things from happening in the future since we
won't be able to allocate free'd dev extents until after the transaction
commits. This has passed all of the xfstests and my super annoying stress test
followed by a balance. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I hit a weird problem were my root item had been deleted but the orphan item had
not. This isn't necessarily a problem, but it keeps the file system from being
mounted. To fix this we just need to axe the orphan item if we can't find the
fs root when we're putting them altogether. With this patch I was able to
successfully mount my file system. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Now reading the data from the target device of the replace operation is allowed,
so the mirror number that is greater than the stripes number of a chunk is valid,
we will tune it when we find there is no target device later. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We always just try and reserve data space when we write, but if we are out of
space but have prealloc'ed extents we should still successfully write. This
patch will try and see if we can write to prealloc'ed space and if we can go
ahead and allow the write to continue. With this patch we now pass xfstests
generic/274. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
try_to_writeback_inodes_sb_nr returns 1 if writeback is already underway, which
is completely fraking useless for us as we need to make sure pages are actually
written before we go and check if there are ordered extents. So replace this
with an open coding of try_to_writeback_inodes_sb_nr minus the writeback
underway check so that we are sure to actually have flushed some dirty pages out
and will have ordered extents to use. With this patch xfstests generic/273 now
passes. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There are all of these checks in the ENOSPC code to see if committing the
transaction would free up enough space to make the allocation. This is because
early on we just committed the transaction and hoped and prayed, which resulted
in cases where it took _forever_ to get an ENOSPC when we really were out of
space. So we check space_info->bytes_pinned, except this isn't completely true
because it doesn't account for space we may free but are stuck in delayed refs.
So tests like xfstests 226 would fail because we wouldn't commit the transaction
to free up the data space. So instead add a percpu counter that will be a
little fuzzier, it will add bytes as soon as we try to free up the space, and
remove any space it doesn't actually free up when we get around to doing the
actual free. We then 0 out this counter every transaction period so we have a
better idea of how much space we will actually free up by committing this
transaction. With this patch we now pass xfstests 226. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We have an optimization that will go ahead and cache no acls on an inode if
there are no xattrs on the inode. This saves us a lookup later to check the
acls for writes or any other access. The problem is I use selinux so I always
have an xattr on inodes, so make this test a little smarter and check for the
actual acl hash on the key and if it isn't there then we still get to cache no
acl which makes everybody who uses selinux a little happier. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This has plagued us forever and I'm so over working around it. When we truncate
down to a non-page aligned offset we will call btrfs_truncate_page to zero out
the end of the page and write it back to disk, this will keep us from exposing
stale data if we truncate back up from that point. The problem with this is it
requires data space to do this, and people don't really expect to get ENOSPC
from truncate() for these sort of things. This also tends to bite the orphan
cleanup stuff too which keeps people from mounting. To get around this we can
just move this into btrfs_cont_expand() to make sure if we are truncating up
from a non-page size aligned i_size we will zero out the rest of this page so
that we don't expose stale data. This will give ENOSPC if you try to truncate()
up or if you try to write past the end of isize, which is much more reasonable.
This fixes xfstests generic/083 failing to mount because of the orphan cleanup
failing. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This patch does two things. First we no longer explicitly read in the blocks
we're trying to readahead. For things like balance_level we may never actually
use the blocks so this just adds uneeded latency, and balance_level and
split_node will both read in the blocks they care about explicitly so if the
blocks need to be waited on it will be done there. Secondly we no longer drop
the path if we do readahead, we just set the path blocking before we call
reada_for_balance() and then we're good to go. Hopefully this will cut down on
the number of re-searches. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This patch does two things, first it only does one call to
btrfs_buffer_uptodate() with the gen specified instead of once with 0 and then
again with gen specified. The other thing is to call btrfs_read_buffer() on the
buffer we've found instead of dropping it and then calling read_tree_block().
This will keep us from doing yet another radix tree lookup for a buffer we've
already found. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
A user reported a deadlock where the async submit thread was blocked on the
lock_extent() lock, and then everybody behind him was locked on the page lock
for the page he was holding. Looking at the code I noticed we do not unlock the
extent range when we get ENOSPC and goto retry. This is bad because we
immediately try to lock that range again to do the cow, which will cause a
deadlock. Fix this by unlocking the range. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The comment is for btrfs_attach_transaction_barrier, not for
btrfs_attach_transaction. Fix the typo.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Acked-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We unconditionally search for the EXTENT_ITEM_KEY for metadata during balance,
and then check the key that we found to see if it is actually a
METADATA_ITEM_KEY, but this doesn't work right because METADATA is a higher key
value, so if what we are looking for happens to be the first item in the leaf
the search will dump us out at the previous leaf, and we won't find our item.
So instead do what we do everywhere else, search for the skinny extent first and
if we don't find it go back and re-search for the extent item. This patch fixes
the panic I was hitting when balancing a large file system with skinny extents.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Looking into this backref problem I noticed we're using a macro to what turns
out to essentially be a NULL check to see if we need to search the commit root.
I'm killing this, let's just do what everybody else does and checks if trans ==
NULL. I've also made it so we pass in the path to __resolve_indirect_refs which
will have the search_commit_root flag set properly already and that way we can
avoid allocating another path when we have a perfectly good one to use. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
A user reported scrub taking up an unreasonable amount of ram as it ran. This
is because we lookup the csums for the extent we're scrubbing but don't free it
up until after we're done with the scrub, which means we can take up a whole lot
of ram. This patch fixes this by dropping the csums once we're done with the
extent we've scrubbed. The user reported this to fix their problem. Thanks,
Reported-and-tested-by: Remco Hosman <remco@hosman.xs4all.nl>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Dave has this fs_mark script that can make btrfs abort with sufficient amount of
ram. This is because with more ram we can keep more dirty metadata in cache
which in a round about way makes for many more pending delayed refs. What
happens is we end up not throttling the transaction enough so when we go to
commit the transaction when we've completely filled the file system we'll
abort() because we use all of the space in the global reserve and we still have
delayed refs to run. To fix this we need to make the delayed ref flushing and
the transaction throttling dependant upon the number of delayed refs that we
have instead of how much reserved space is left in the global reserve. With
this patch we not only stop aborting transactions but we also get a smoother run
speed with fs_mark and it makes us about 10% faster. Thanks,
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I hit a hang when run_delayed_refs returned an error in the beginning of
btrfs_commit_transaction. If we decide we need to commit the transaction in
btrfs_end_transaction we'll set BLOCKED and start to commit, but if we get an
error this early on we'll just exit without committing. This is fine, except
that anybody else who tried to start a transaction will sit in
wait_current_trans() since we're set to BLOCKED and we never set it to something
else and woke people up. To fix this we want to check for trans->aborted
everywhere we wait for the transaction state to change, and make
btrfs_abort_transaction() wake up any waiters there may be. All the callers
will notice that the transaction has aborted and exit out properly. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I hit a deadlock because we aborted when flushing delayed refs but didn't wake
any of the other flushers up and so everybody was just sleeping forever. This
should fix the problem. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Fix the code comments for lzo compression workspace.
The buf item is used to store the decompressed data
and cbuf is used to store the compressed data.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Balance will create reloc_root for each fs root, and it's going to
record last_snapshot to filter shared blocks. The side effect of
setting last_snapshot is to break nocow attributes of files.
Since the extents are not shared by the relocation tree after the balance,
we can recover the old last_snapshot safely if no one snapshoted the
source tree. We fix the above problem by this way.
Reported-by: Kyle Gates <kylegates@hotmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
With non-mixed block groups we replay the logs before we're allowed to do any
writes, so we get away with not pinning/removing the data extents until right
when we replay them. However with mixed block groups we allocate out of the
same pool, so we could easily allocate a metadata block that was logged in our
tree log. To deal with this we just need to notice that we have mixed block
groups and do the normal excluding/removal dance during the pin stage of the log
replay and that way we don't allocate metadata blocks from areas we have logged
data extents. With this patch we now pass xfstests generic/311 with mixed
block groups turned on. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When we cross into a different subvol when doing a lookup we will run the orhpan
cleanup. If this fails however we do not drop the ref to the inode we were
looking up before we return an error, which leads to busy inodes on umount.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When testing a corrupted fs I noticed I was getting sleep while atomic errors
when the transaction aborted. This is because btrfs_pin_extent may need to
allocate memory and we are calling this under the spin lock. Fix this by moving
it out and doing the pin after dropping the spin lock but before dropping the
mutex, the same way it works when delayed refs run normally. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When called during mount, we cannot start the rescan worker thread until
open_ctree is done. This commit restuctures the qgroup rescan internals to
enable a clean deferral of the rescan resume operation.
First of all, the struct qgroup_rescan is removed, saving us a malloc and
some initialization synchronizations problems. Its only element (the worker
struct) now lives within fs_info just as the rest of the rescan code.
Then setting up a rescan worker is split into several reusable stages.
Currently we have three different rescan startup scenarios:
(A) rescan ioctl
(B) rescan resume by mount
(C) rescan by quota enable
Each case needs its own combination of the four following steps:
(1) set the progress [A, C: zero; B: state of umount]
(2) commit the transaction [A]
(3) set the counters [A, C: zero; B: state of umount]
(4) start worker [A, B, C]
qgroup_rescan_init does step (1). There's no extra function added to commit
a transaction, we've got that already. qgroup_rescan_zero_tracking does
step (3). Step (4) is nothing more than a call to the generic
btrfs_queue_worker.
We also get rid of a double check for the rescan progress during
btrfs_qgroup_account_ref, which is no longer required due to having step 2
from the list above.
As a side effect, this commit prepares to move the rescan start code from
btrfs_run_qgroups (which is run during commit) to a less time critical
section.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When btrfs_read_qgroup_config or btrfs_quota_enable return non-zero, we've
already freed the fs_info->qgroup_ulist. The final btrfs_free_qgroup_config
called from quota_disable makes another ulist_free(fs_info->qgroup_ulist)
call.
We set fs_info->qgroup_ulist to NULL on the mentioned error paths, turning
the ulist_free in btrfs_free_qgroup_config into a noop.
Cc: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Commit 5b7c665e introduced fs_info->qgroup_ulist, that is allocated during
btrfs_read_qgroup_config and meant to be used later by the qgroup accounting
code. However, it is always freed before btrfs_read_qgroup_config returns,
becuase the commit mentioned above adds a check for (ret), where a check
for (ret < 0) would have been the right choice. This commit fixes the check.
Cc: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Dave pointed out a problem where if you filled up a file system as much as
possible you couldn't remove any files. The whole unlink reservation thing is
convoluted because it tries to guess if it's going to add space to unlink
something or not, and has all these odd uncommented cases where it simply does
not try. So to fix this I've added a way to conditionally steal from the global
reserve if we can't make our normal reservation. If we have more than half the
space in the global reserve free we will go ahead and steal from the global
reserve. With this patch Dave's reproducer now works and I can rm all the files
on the file system. Thanks,
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Before applying this patch, we flushed the log tree of the fs/file
tree firstly, and then flushed the log root tree. It is ineffective,
especially on the hard disk. This patch improved this problem by wrapping
the above two flushes by the same blk_plug.
By test, the performance of the sync write went up ~60%(2.9MB/s -> 4.6MB/s)
on my scsi disk whose disk buffer was enabled.
Test step:
# mkfs.btrfs -f -m single <disk>
# mount <disk> <mnt>
# dd if=/dev/zero of=<mnt>/file0 bs=32K count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We did not allow file data clone within the same file because of
deadlock issues.
However, we now use nested lock to avoid deadlock between the
parent directory and the child file.
So it's safe to do file clone within the same file when the two
ranges are not overlapped.
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
EXTREF is treated same as REF, so we can make the code tidy.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
During splitting a leaf, pushing items around to hopefully get some space only
works when we have a parent, ie. we have at least one sibling leaf.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
As for splitting a leaf, root is just the leaf, and tree mod log does not apply
on leaf, so in this case, we don't do log_removal.
As for splitting a node, the old root is kept as a normal node and we have nicely
put records in tree mod log for moving keys and items, so in this case we don't do
that either.
As above, insert_new_root can get rid of log_removal.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Fix to return error code instead always return 0 from function
btrfs_check_trunc_cache_free_space().
Introduced by commit 7b61cd9224
(Btrfs: don't use global block reservation for inode cache truncation)
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This fixes bugzilla 57491. If we take a snapshot of a fs with a unlink ongoing
and then try to send that root we will run into problems. When comparing with a
parent root we will search the parents and the send roots commit_root, which if
we've just created the snapshot will include the file that needs to be evicted
by the orphan cleanup. So when we find a changed extent we will try and copy
that info into the send stream, but when we lookup the inode we use the normal
root, which no longer has the inode because the orphan cleanup deleted it. The
best solution I have for this is to check our otransid with the generation of
the commit root and if they match just commit the transaction again, that way we
get the changes from the orphan cleanup. With this patch the reproducer I made
for this bugzilla no longer returns ESTALE when trying to do the send. Thanks,
Cc: stable@vger.kernel.org
Reported-by: Chris Wilson <jakdaw@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
when user runs command btrfs dev del the raid requisite error if any
goes to the /var/log/messages, its not good idea to clutter messages
with these user (knowledge) errors, further user don't have to review
the system messages to know problem with the cli it should be dropped
to the user as part of the cli return.
to bring this feature created a set of the ERROR defined
BTRFS_ERROR_DEV* error codes and created their error string.
I expect this enum to be added with other error which we might
want to communicate to the user land
v3:
moved the code with in the file no logical change
v1->v2:
introduce error codes for the device mgmt usage
v1:
adds a parameter in the ioctl arg struct to carry the error string
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We get lock inversion with umount if we allow iputs from sync_fs, so use the
delay iput flag to keep this from happening. Thanks,
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We used 3 variants to track the state of the transaction, it was complex
and wasted the memory space. Besides that, it was hard to understand that
which types of the transaction handles should be blocked in each transaction
state, so the developers often made mistakes.
This patch improved the above problem. In this patch, we define 6 states
for the transaction,
enum btrfs_trans_state {
TRANS_STATE_RUNNING = 0,
TRANS_STATE_BLOCKED = 1,
TRANS_STATE_COMMIT_START = 2,
TRANS_STATE_COMMIT_DOING = 3,
TRANS_STATE_UNBLOCKED = 4,
TRANS_STATE_COMPLETED = 5,
TRANS_STATE_MAX = 6,
}
and just use 1 variant to track those state.
In order to make the blocked handle types for each state more clear,
we introduce a array:
unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = {
[TRANS_STATE_RUNNING] = 0U,
[TRANS_STATE_BLOCKED] = (__TRANS_USERSPACE |
__TRANS_START),
[TRANS_STATE_COMMIT_START] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH),
[TRANS_STATE_COMMIT_DOING] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN),
[TRANS_STATE_UNBLOCKED] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN |
__TRANS_JOIN_NOLOCK),
[TRANS_STATE_COMPLETED] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN |
__TRANS_JOIN_NOLOCK),
}
it is very intuitionistic.
Besides that, because we remove ->in_commit in transaction structure, so
the lock ->commit_lock which was used to protect it is unnecessary, remove
->commit_lock.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We checked the commit time to avoid committing the transaction
frequently, but it is unnecessary because:
- It made the transaction commit spend more time, and delayed the
operation of the external writers(TRANS_START/TRANS_USERSPACE).
- Except the space that we have to commit transaction, such as
snapshot creation, btrfs doesn't commit the transaction on its
own initiative.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We used ->num_joined track if there were some writers which join the current
transaction when the committer was sleeping. If some writers joined the current
transaction, we has to continue the while loop to do some necessary stuff, such
as flush the ordered operations. But it is unnecessary because we will do it
after the while loop.
Besides that, tracking ->num_joined would make the committer drop into the while
loop when there are lots of internal writers(TRANS_JOIN).
So we remove ->num_joined and don't track if there are some writers which join
the current transaction when the committer is sleeping.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
It is unnecessary to flush the delalloc inodes again and again because
we don't care the dirty pages which are introduced after the flush, and
they will be flush in the transaction commit.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
btrfs_commit_transaction has the following loop before we commit the
transaction.
do {
// attempt to do some useful stuff and/or sleep
} while (atomic_read(&cur_trans->num_writers) > 1 ||
(should_grow && cur_trans->num_joined != joined));
This is used to prevent from the TRANS_START to get in the way of a
committing transaction. But it does not prevent from TRANS_JOIN, that
is we would do this loop for a long time if some writers JOIN the
current transaction endlessly.
Because we need join the current transaction to do some useful stuff,
we can not block TRANS_JOIN here. So we introduce a external writer
counter, which is used to count the TRANS_USERSPACE/TRANS_START writers.
If the external writer counter is zero, we can break the above loop.
In order to make the code more clear, we don't use enum variant
to define the type of the transaction handle, use bitmask instead.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If the transaction is removed from the transaction list, it means the
transaction has been committed successfully. So it is impossible to
call cleanup_transaction(), otherwise there is something wrong with
the code logic. Thus, we use BUG_ON() instead of the original handle.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When we umount a fs with serious errors, we will invoke btrfs_cleanup_transactions()
to clean up the residual transaction. At this time, It is impossible to start a new
transaction, so we needn't assign trans_no_join to 1, and also needn't clear running
transaction every time we destroy a residual transaction.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Before applying this patch, we need flush all the delalloc inodes in
the fs when we want to create a snapshot, it wastes time, and make
the transaction commit be blocked for a long time. It means some other
user operation would also be blocked for a long time.
This patch improves this problem, we just flush the delalloc inodes that
in the source trees before snapshot creation, so the transaction commit
will complete quickly.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The reason we introduce per-subvolume ordered extent list is the same
as the per-subvolume delalloc inode list.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When we create a snapshot, we need flush all delalloc inodes in the
fs, just flushing the inodes in the source tree is OK. So we introduce
per-subvolume delalloc inode list.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The grab/put funtions will be used in the next patch, which need grab
the root object and ensure it is not freed. We use reference counter
instead of the srcu lock is to aovid blocking the memory reclaim task,
which invokes synchronize_srcu().
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There are several functions whose code is similar, such as
btrfs_find_last_root()
btrfs_read_fs_root_no_radix()
Besides that, some functions are invoked twice, it is unnecessary,
for example, we are sure that all roots which is found in
btrfs_find_orphan_roots()
have their orphan items, so it is unnecessary to check the orphan
item again.
So cleanup it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The snapshot/subvolume deletion might spend lots of time, it would make
the remount task wait for a long time. This patch improve this problem,
we will break the deletion if the fs is remounted to be R/O. It will make
the users happy.
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If the fs is remounted to be R/O, it is unnecessary to call
btrfs_clean_one_deleted_snapshot(), so move the R/O check out of
this function. And besides that, it can make the check logic in the
caller more clear.
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
In order to avoid the R/O remount, we acquired ->s_umount lock during
we deleted the dead snapshots and subvolumes. But it is unnecessary,
because we have cleaner_mutex.
We use cleaner_mutex to protect the process of the dead snapshots/subvolumes
deletion. And when we remount the fs to be R/O, we also acquire this mutex to
do cleanup after we change the status of the fs. That is this lock can serialize
the above operations, the cleaner can be aware of the status of the fs, and if
the cleaner is deleting the dead snapshots/subvolumes, the remount task will
wait for it. So it is safe to remove ->s_umount in cleaner_kthread().
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
btrfs_read_fs_root_no_name() already checks if btrfs_root_refs()
is zero and returns ENOENT in this case. There is no need to do
it again in six places.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
No need to check for NULL in send.c and disk-io.c.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We don't need to copy it back to user side as it remains unchanged.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Clean up the format of the definitions of BTRFS_BLOCK_GROUP_RAID5 and
BTRFS_BLOCK_GROUP_RAID6.
Signed-off-by: Andreas Philipp <philipp.andreas@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
sctx is removed from the argument of the function that
doesn't use sctx.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The size parameter to btrfs_extend_item() is the number of bytes
to add to the item, not the size of the item after the operation
(like it is for btrfs_truncate_item(), there the size parameter
is not the number of bytes to take away, but the total size of
the item after truncation).
Fix it in the comment.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
btrfs_qgroup_wait_for_completion waits until the currently running qgroup
operation completes. It returns immediately when no rescan process is in
progress. This is useful to automate things around the rescan process (e.g.
testing).
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When doing qgroup accounting, we call ulist_alloc()/ulist_free() every time
when we want to walk qgroup tree.
By introducing 'qgroup_ulist', we only need to call ulist_alloc()/ulist_free()
once. This reduce some sys time to allocate memory, see the measurements below
fsstress -p 4 -n 10000 -d $dir
With this patch:
real 0m50.153s
user 0m0.081s
sys 0m6.294s
real 0m51.113s
user 0m0.092s
sys 0m6.220s
real 0m52.610s
user 0m0.096s
sys 0m6.125s avg 6.213
-----------------------------------------------------
Without the patch:
real 0m54.825s
user 0m0.061s
sys 0m10.665s
real 1m6.401s
user 0m0.089s
sys 0m11.218s
real 1m13.768s
user 0m0.087s
sys 0m10.665s avg 10.849
we can see the sys time reduce ~43%.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We want to know if there are debugging features compiled in, this may
affect performance. The message is printed before the sanity checks.
Also kill version.h file that serves no purpose, we don't use any
version tag for kernel module.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The 'end' value must exactly cover the end of the interval, which means
one byte less than the expected block alignment, or in case of a file
smaller than one block, one byte less than the inode size.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Code checked for raid 5 flag in two else-if branches, so code would never be reached. Probably a copy-paste bug.
Signed-off-by: Henrik Nordvik <henrikno@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Pull btrfs fixes from Chris Mason:
"This is an assortment of crash fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: stop all workers before cleaning up roots
Btrfs: fix use-after-free bug during umount
Btrfs: init relocate extent_io_tree with a mapping
btrfs: Drop inode if inode root is NULL
Btrfs: don't delete fs_roots until after we cleanup the transaction
Dave reported a panic because the extent_root->commit_root was NULL in the
caching kthread. That is because we just unset it in free_root_pointers, which
is not the correct thing to do, we have to either wait for the caching kthread
to complete or hold the extent_commit_sem lock so we know the thread has exited.
This patch makes the kthreads all stop first and then we do our cleanup. This
should fix the race. Thanks,
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Dave reported a NULL pointer deref. This is caused because he thought he'd be
smart and add sanity checks to the extent_io bit operations, but he didn't
expect a tree to have a NULL mapping. To fix this we just need to init the
relocation's processed_blocks with the btree_inode->i_mapping. Thanks,
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
There is a path where btrfs_drop_inode() is called with its inode's root
is NULL: In btrfs_new_inode(), when btrfs_set_inode_index() fails,
iput() is called. We should handle this case before taking look at the
root->root_item.
Signed-off-by: Naohiro Aota <naota@elisp.net>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We get a use after free if we had a transaction to cleanup since there could be
delayed inodes which refer to their respective fs_root. Thanks
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The size parameter to btrfs_extend_item() is the number of bytes
to add to the item, not the size of the item after the operation
(like it is for btrfs_truncate_item(), there the size parameter
is not the number of bytes to take away, but the total size of
the item after truncation).
Fix it in the comment.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Currently there is no way to truncate partial page where the end
truncate point is not at the end of the page. This is because it was not
needed and the functionality was enough for file system truncate
operation to work properly. However more file systems now support punch
hole feature and it can benefit from mm supporting truncating page just
up to the certain point.
Specifically, with this functionality truncate_inode_pages_range() can
be changed so it supports truncating partial page at the end of the
range (currently it will BUG_ON() if 'end' is not at the end of the
page).
This commit changes the invalidatepage() address space operation
prototype to accept range to be invalidated and update all the instances
for it.
We also change the block_invalidatepage() in the same way and actually
make a use of the new length argument implementing range invalidation.
Actual file system implementations will follow except the file systems
where the changes are really simple and should not change the behaviour
in any way .Implementation for truncate_page_range() which will be able
to accept page unaligned ranges will follow as well.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Pull btrfs fixes from Chris Mason:
"Miao Xie has been very busy, fixing races and enospc problems and many
other small but important pieces.
Alexandre Oliva discovered some problems with how our error handling
was interacting with the block layer and for now has disabled our
partial handling of sub-page writes. The real sub-page work is in a
series of patches from IBM that we still need to integrate and test.
The code Alexandre has turned off was really incomplete.
Josef has more error handling fixes and an important fix for the new
skinny extent format.
This also has my fix for the tracepoint crash from late in 3.9. It's
the first stage in a larger clean up to get rid of btrfs_bio and make
a proper bioset for all the items we need to tack into the bio. For
now the bioset only holds our mirror_num and stripe_index, but for the
next merge window I'll shuffle more in."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (25 commits)
Btrfs: use a btrfs bioset instead of abusing bio internals
Btrfs: make sure roots are assigned before freeing their nodes
Btrfs: explicitly use global_block_rsv for quota_tree
btrfs: do away with non-whole_page extent I/O
Btrfs: don't invoke btrfs_invalidate_inodes() in the spin lock context
Btrfs: remove BUG_ON() in btrfs_read_fs_tree_no_radix()
Btrfs: pause the space balance when remounting to R/O
Btrfs: fix unprotected root node of the subvolume's inode rb-tree
Btrfs: fix accessing a freed tree root
Btrfs: return errno if possible when we fail to allocate memory
Btrfs: update the global reserve if it is empty
Btrfs: don't steal the reserved space from the global reserve if their space type is different
Btrfs: optimize the error handle of use_block_rsv()
Btrfs: don't use global block reservation for inode cache truncation
Btrfs: don't abort the current transaction if there is no enough space for inode cache
Correct allowed raid levels on balance.
Btrfs: fix possible memory leak in replace_path()
Btrfs: fix possible memory leak in the find_parent_nodes()
Btrfs: don't allow device replace on RAID5/RAID6
Btrfs: handle running extent ops with skinny metadata
...
Btrfs has been pointer tagging bi_private and using bi_bdev
to store the stripe index and mirror number of failed IOs.
As bios bubble back up through the call chain, we use these
to decide if and how to retry our IOs. They are also used
to count IO failures on a per device basis.
Recently a bio tracepoint was added lead to crashes because
we were abusing bi_bdev.
This commit adds a btrfs bioset, and creates explicit fields
for the mirror number and stripe index. The plan is to
extend this structure for all of the fields currently in
struct btrfs_bio, which will mean one less kmalloc in
our IO path.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Reported-by: Tejun Heo <tj@kernel.org>
If we fail to load the chunk tree we'll call free_root_pointers, except we may
not have assigned the roots for the dev_root/extent_root/csum_root yet, so we
could NULL pointer deref at this point. Just add checks to make sure these
roots are set to keep us from panicing. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The quota_tree was set up to use the empty_block_rsv before
which would be problematic when the filesystem is filled up
and ENOSPC happens during internal operations while the quota
tree is updated and COWed (when the btrfs_qgroup_info_item
items) are written. In fact, use_block_rsv() which is used
in btrfs_cow_block() falls back to the global_block_rsv in
this case. But just in order to make it more clear what is
happening, change it to explicitly use the global_block_rsv.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
end_bio_extent_readpage computes whole_page based on bv_offset and
bv_len, without taking into account that blk_update_request may modify
them when some of the blocks to be read into a page produce a read
error. This would cause the read to unlock only part of the file
range associated with the page, which would in turn leave the entire
page locked, which would not only keep the process blocked instead of
returning -EIO to it, but also prevent any further access to the file.
It turns out that btrfs always issues whole-page reads and writes.
The special handling of non-whole_page appears to be a mistake or a
left-over from a time when this wasn't the case. Indeed,
end_bio_extent_writepage distinguished between whole_page and
non-whole_page writes but behaved identically in both cases!
I've replaced the whole_page computations with warnings, just to be
sure that we're not issuing partial page reads or writes. The
warnings should probably just go away some time.
Signed-off-by: Alexandre Oliva <oliva@gnu.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
btrfs_invalidate_inodes() may sleep, so we should not invoke it in the
spin lock context. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We have checked if ->node is NULL or not, so it is unnecessary to
use BUG_ON() to check again. Remove it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The root node of the rb-tree may be changed, so we should get it under
the lock. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
inode_tree_del() will move the tree root into the dead root list, and
then the tree will be destroyed by the cleaner. So if we remove the
delayed node which is cached in the inode after inode_tree_del(),
we may access a freed tree root. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We need to set return value explicitly, otherwise we'll lose the error
value.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Before applying this patch, we reserved the space for the global reserve
by the minimum unit if we found it is empty, it was unreasonable and
inefficient, because if the global reserve space was depleted, it implied
that the size of the global reserve was too small. In this case, we shoud
update the global reserve and fill it.
Cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If the type of the space we need is different with the global reserve, we
can not steal the space from the global reserve, because we can not allocate
the space from the free space cache that the global reserve points to.
Cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
It is very likely that there are lots of subvolumes/snapshots in the filesystem,
so if we use global block reservation to do inode cache truncation, we may hog
all the free space that is reserved in global rsv. So it is better that we do
the free space reservation for inode cache truncation by ourselves.
Cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The filesystem with inode cache was forced to be read-only when we umounted it.
Steps to reproduce:
# mkfs.btrfs -f ${DEV}
# mount -o inode_cache ${DEV} ${MNT}
# dd if=/dev/zero of=${MNT}/file1 bs=1M count=8192
# btrfs fi syn ${MNT}
# dd if=${MNT}/file1 of=/dev/null bs=1M
# rm -f ${MNT}/file1
# btrfs fi syn ${MNT}
# umount ${MNT}
It is because there was no enough space to do inode cache truncation, and then
we aborted the current transaction.
But no space error is not a serious problem when we write out the inode cache,
and it is safe that we just skip this step if we meet this problem. So we need
not abort the current transaction.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Raid5 with 3 devices is well defined while the old logic allowed
raid5 only with a minimum of 4 devices when converting the block group
profile via btrfs balance. Creating a raid5 with just three devices
using mkfs.btrfs worked always as expected. This is now fixed and the
whole logic is rewritten.
Signed-off-by: Andreas Philipp <philipp.andreas@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
In replace_path(), if read_tree_block() fails, we cannot return
directly, we should free some allocated memory otherwise memory
leak happens.
Similar to Wang's "Btrfs: fix possible memory leak in the
find_parent_nodes()" patch, the current commit fixes an issue that
is related to the "Btrfs: fix all callers of read_tree_block"
commit.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
In the find_parent_nodes(), if read_tree_block() fails, we can
not return directly, we should free some allocated memory otherwise
memory leak happens.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This is not yet supported and causes crashes. One sad user reported
that it destroyed his filesystem.
One failure is in __btrfs_map_block+0xc1f calling kmalloc(0).
0x5f21f is in __btrfs_map_block (fs/btrfs/volumes.c:4923).
4918 num_stripes = map->num_stripes;
4919 max_errors = nr_parity_stripes(map);
4920
4921 raid_map = kmalloc(sizeof(u64) * num_stripes,
4922 GFP_NOFS);
4923 if (!raid_map) {
4924 ret = -ENOMEM;
4925 goto out;
4926 }
4927
There might be more issues. Until this is really tested, don't allow
users to start the procedure on RAID5/RAID6 filesystems.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Chris hit a bug where we weren't finding extent records when running extent ops.
This is because we use the delayed_ref_head when running the extent op, which
means we can't use the ->type checks to see if we are metadata. We also lose
the level of the metadata we are working on. So to fix this we can just check
the ->is_data section of the extent_op, and we can store the level of the buffer
we were modifying in the extent_op. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This catches block groups that are too large to properly cache. We deal with
this case fine, so the warning just confuses users. Remove the warning.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I'm sorry, theres no excuse for this sort of work. We need to use
root->leafsize since eb may be NULL. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The search ioctl skips items that are too large for a result buffer, but
inline items of a certain size occuring before any search result is
found would trigger an overflow and stop the search entirely.
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=57641
Cc: stable@vger.kernel.org
Signed-off-by: Gabriel de Perthuis <g2p.code+btrfs@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
lock_extent/unlock_extent expect an exclusive end.
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Quota tree has been missing from lockdep annotations, though no warning
has been seen in the wild.
There's currently one entry that does not belong there,
BTRFS_ORPHAN_OBJECTID. No such tree exists, it's probably a copy &
paste mistake, the id is defined among tree ids.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Pull btrfs update from Chris Mason:
"These are mostly fixes. The biggest exceptions are Josef's skinny
extents and Jan Schmidt's code to rebuild our quota indexes if they
get out of sync (or you enable quotas on an existing filesystem).
The skinny extents are off by default because they are a new variation
on the extent allocation tree format. btrfstune -x enables them, and
the new format makes the extent allocation tree about 30% smaller.
I rebased this a few days ago to rework Dave Sterba's crc checks on
the super block, but almost all of these go back to rc6, since I
though 3.9 was due any minute.
The biggest missing fix is the tracepoint bug that was hit late in
3.9. I ran into problems with that in overnight testing and I'm still
tracking it down. I'll definitely have that fixed for rc2."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (101 commits)
Btrfs: allow superblock mismatch from older mkfs
btrfs: enhance superblock checks
btrfs: fix misleading variable name for flags
btrfs: use unsigned long type for extent state bits
Btrfs: improve the loop of scrub_stripe
btrfs: read entire device info under lock
btrfs: remove unused gfp mask parameter from release_extent_buffer callchain
btrfs: handle errors returned from get_tree_block_key
btrfs: make static code static & remove dead code
Btrfs: deal with errors in write_dev_supers
Btrfs: remove almost all of the BUG()'s from tree-log.c
Btrfs: deal with free space cache errors while replaying log
Btrfs: automatic rescan after "quota enable" command
Btrfs: rescan for qgroups
Btrfs: split btrfs_qgroup_account_ref into four functions
Btrfs: allocate new chunks if the space is not enough for global rsv
Btrfs: separate sequence numbers for delayed ref tracking and tree mod log
btrfs: move leak debug code to functions
Btrfs: return free space in cow error path
Btrfs: set UUID in root_item for created trees
...
Pull block core updates from Jens Axboe:
- Major bit is Kents prep work for immutable bio vecs.
- Stable candidate fix for a scheduling-while-atomic in the queue
bypass operation.
- Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
discard bios.
- Tejuns changes to convert the writeback thread pool to the generic
workqueue mechanism.
- Runtime PM framework, SCSI patches exists on top of these in James'
tree.
- A few random fixes.
* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
relay: move remove_buf_file inside relay_close_buf
partitions/efi.c: replace useless kzalloc's by kmalloc's
fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
block: fix max discard sectors limit
blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
Documentation: cfq-iosched: update documentation help for cfq tunables
writeback: expose the bdi_wq workqueue
writeback: replace custom worker pool implementation with unbound workqueue
writeback: remove unused bdi_pending_list
aoe: Fix unitialized var usage
bio-integrity: Add explicit field for owner of bip_buf
block: Add an explicit bio flag for bios that own their bvec
block: Add bio_alloc_pages()
block: Convert some code to bio_for_each_segment_all()
block: Add bio_for_each_segment_all()
bounce: Refactor __blk_queue_bounce to not use bi_io_vec
raid1: use bio_copy_data()
pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
pktcdvd: use bio_copy_data()
block: Add bio_copy_data()
...
We've added new checks to make sure the super block crc is correct
during mount. A fresh filesystem from an older mkfs won't have the
crc set. This adds a warning when it finds a newly created filesystem
but doesn't fail the mount.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The superblock checksum is not verified upon mount. <awkward silence>
Add that check and also reorder existing checks to a more logical
order.
Current mkfs.btrfs does not calculate the correct checksum of
super_block and thus a freshly created filesytem will fail to mount when
this patch is applied.
First transaction commit calculates correct superblock checksum and
saves it to disk.
Reproducer:
$ mfks.btrfs /dev/sda
$ mount /dev/sda /mnt
$ btrfs scrub start /mnt
$ sleep 5
$ btrfs scrub status /mnt
... super:2 ...
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The variable was named 'data' in btrfs_reserve_extent and that's the
only function that actually uses it to let btrfs_get_alloc_profile know
what profile we want. Then it's passed down as u64 flags.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
1) Right now scrub_stripe() is looping in some unnecessary cases:
* when the found extent item's objectid has been out of the dev extent's range
but we haven't finish scanning all the range within the dev extent
* when all the items has been processed but we haven't finish scanning all the
range within the dev extent
In both cases, we can just finish the loop to save costs.
2) Besides, when the found extent item's length is larger than the stripe
len(64k), we don't have to release the path and search again as it'll get at the
same key used in the last loop, we can instead increase the logical cursor in
place till all space of the extent is scanned.
3) And we use 0 as the key's offset to search btree, then get to previous item
to find a smaller item, and again have to move to the next one to get the right
item. Setting offset=-1 and previous_item() is the correct way.
4) As we won't find any checksum at offset unless this 'offset' is in a data
extent, we can just find checksum when we're really going to scrub an extent.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There's a theoretical possibility of reading stale (or even more
theoretically, freed) data from DEV_INFO ioctl when the device would
disappear between an early mutex unlock and data being copied from the
device structure.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Big patch, but all it does is add statics to functions which
are in fact static, then remove the associated dead-code fallout.
removed functions:
btrfs_iref_to_path()
__btrfs_lookup_delayed_deletion_item()
__btrfs_search_delayed_insertion_item()
__btrfs_search_delayed_deletion_item()
find_eb_for_page()
btrfs_find_block_group()
range_straddles_pages()
extent_range_uptodate()
btrfs_file_extent_length()
btrfs_scrub_cancel_devid()
btrfs_start_transaction_lflush()
btrfs_print_tree() is left because it is used for debugging.
btrfs_start_transaction_lflush() and btrfs_reada_detach() are
left for symmetry.
ulist.c functions are left, another patch will take care of those.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If you try to mount -o loop a restored file system it will panic if the file
ends up being smaller than the original disk. This is because we go to try and
get a block for a super that may be past the EOF which makes __getblk return
NULL for a buffer head when we aren't expecting it to. Fix this by dealing with
this case and just jacking up the errors count. With this patch we no longer
panic when mounting a restored file system loopback. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There were a whole bunch and I was doing it for other things. I haven't tested
these error paths but at the very least this is better than panicing. I've only
left 2 BUG_ON()'s since they are logic errors and I want to replace them with a
ASSERT framework that we can compile out for production users. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
So everybody who got hit by my fsync bug will still continue to hit this
BUG_ON() in the free space cache, which is pretty heavy handed. So I took a
file system that had this bug and fixed up all the BUG_ON()'s and leaks that
popped up when I tried to mount a broken file system like this. With this patch
we just fail to mount instead of panicing. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When qgroup tracking is enabled, we do an automatic cycle of the new rescan
mechanism.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If qgroup tracking is out of sync, a rescan operation can be started. It
iterates the complete extent tree and recalculates all qgroup tracking data.
This is an expensive operation and should not be used unless required.
A filesystem under rescan can still be umounted. The rescan continues on the
next mount. Status information is provided with a separate ioctl while a
rescan operation is in progress.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The function is separated into a preparation part and the three accounting
steps mentioned in the qgroups documentation. The goal is to make steps two
and three usable by the rescan functionality. A side effect is that the
function is restructured into readable subunits.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When running the 208th of xfstests, the fs returned the enospc
error when there was lots of free space in the disk.
By bisect debug, we found it was introduced by commit 96f1bb5777.
This commit makes the space check for the global reservation in
can_overcommit() be inconsistent with should_alloc_chunk().
can_overcommit() requires that the free space is 2 times the size
of the global reservation, or we can't do overcommit. And instead,
we need reclaim some reserved space, and if we still don't have
enough free space, we need allocate a new chunk. But unfortunately,
should_alloc_chunk() just requires that the free space is 1 time
the size of the global reservation, that is we would not try to
allocate a new chunk if the free space size is in the middle of
these two requires, and just return the enospc error. Fix it.
Cc: Jim Schutt <jaschut@sandia.gov>
Cc: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Sequence numbers for delayed refs have been introduced in the first version
of the qgroup patch set. To solve the problem of find_all_roots on a busy
file system, the tree mod log was introduced. The sequence numbers for that
were simply shared between those two users.
However, at one point in qgroup's quota accounting, there's a statement
accessing the previous sequence number, that's still just doing (seq - 1)
just as it would have to in the very first version.
To satisfy that requirement, this patch makes the sequence number counter 64
bit and splits it into a major part (used for qgroup sequence number
counting) and a minor part (incremented for each tree modification in the
log). This enables us to go exactly one major step backwards, as required
for qgroups, while still incrementing the sequence counter for tree mod log
insertions to keep track of their order. Keeping them in a single variable
means there's no need to change all the code dealing with comparisons of two
sequence numbers.
The sequence number is reset to 0 on commit (not new in this patch), which
ensures we won't overflow the two 32 bit counters.
Without this fix, the qgroup tracking can occasionally go wrong and WARN_ONs
from the tree mod log code may happen.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Clean up the leak debugging in extent_io.c by moving
the debug code into functions. This also removes the
list_heads used for debugging from the extent_buffer
and extent_state structures when debug is not enabled.
Since we need a global debug config to do that last
part, implement CONFIG_BTRFS_DEBUG to accommodate.
Thanks to Dave Sterba for the Kconfig bit.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Replace some BUG_ONs with proper handling and take allocated space back to
free space cache for later use.
We don't have to worry about extent maps since they'd be freed in releasepage
path.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
It is a rare exception that a new tree is created, like the qgroups
tree. So far these new trees have an all-zero UUID in their root
items. All trees that mkfs.btrfs has created get an UUID during the
first mount when btrfs_read_root_item() rewrites the root_item to
the v2 structure style. These UUID are never used so far, but
anyway, since it is better to have it uniform for all trees, this
commit adds some lines that generate and write an UUID for newly
created trees.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
fget() returns NULL if error. So, we should check NULL or not.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I have a broken file system that when it aborts leaves all sorts of accounting
things wrong and gives you lots of WARN_ON()'s other than the abort. This is
because we're not cleaning up various parts of the file system when we abort.
The first chunks are specific to mount failures, we weren't cleaning up the
block group cached inodes and we weren't cleaning up any transactions that had
been aborted, which leaves a bunch of things laying around.
The second half of this are related to the cleanup parts. First we don't need
to release space for the dirty pages from the trans_block_rsv, that's all
handled by the trans handles so this is just plain wrong. The other thing is we
need to pin down extents that were set ->must_insert_reserved for delayed refs.
This isn't so much for the pinning but more for the cleaning up the
cache->reserved counter since we are no longer going to use those reserved
bytes. With this patch I no longer see a bunch of WARN_ON()'s when I try to
mount this broken file system, just the initial one from the abort. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We can just look up the extent_buffers for the range and free stuff that way.
This makes the cleanup a bit cleaner and we can make sure to evict the
extent_buffers pretty quickly by marking them as stale. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We need to check the return value of the commit in case something goes wrong,
otherwise we could end up going down the line and doing more stuff (like orphan
cleanup) before we notice we should have errored out. We need to do this before
we free up the log_tree_root since the caller will handle all of that. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We can run the tree logging recovery or the orphan cleanup on mount, so we'll
end up looking up a random fs tree in the meantime. So we need to clean this up
so we don't leave extent buffers hanging around on the cache. With this patch
we no longer leak extent buffers on failure to mount. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This is the same as the fix from commit
Btrfs: fix bad extent logging
but for O_DIRECT. I missed this when I fixed the problem originally, we were
still using the em for the orig_start and orig_block_len, which would be the
merged extent. We need to use the actual extent from the on disk file extent
item, which we have to lookup to make sure it's ok to nocow anyway so just pass
in some pointers to hold this info. Thanks,
Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We kept leaking extent buffers when mounting a broken file system and it turns
out it's because not everybody uses read_tree_block properly. You need to check
and make sure the extent_buffer is uptodate before you use it. This patch fixes
everybody who calls read_tree_block directly to make sure they check that it is
uptodate and free it and return an error if it is not. With this we no longer
leak EB's when things go horribly wrong. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If we fail to load block groups halfway through we can leave extent_state's on
the excluded tree. This is because we just lookup the supers and add them to
the excluded tree regardless of which block group we are looking at currently.
This is a problem because we remove the excluded extents for the range of the
block group only, so if we don't ever load a block group for one of the excluded
extents we won't ever free it. This fixes the problem by only adding excluded
extents if it falls in the block group range we care about. With this patch
we're no longer leaking space when we fail to read all of the block groups.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
With a users corrupted fs I was getting weird behavior and panics and it turns
out it was because one of his tree blocks had a bogus header level. So add this
to the sanity checks in the endio handler for tree blocks. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
A user sent me a btrfs-image that was panicing because of some corruption. This
is because we pass in a bogus value to btrfs_num_copies, and it panics. Instead
just return 1. We only call btrfs_num_copies to see if there are other copies
to try and read for things, so if we just return 1 it will make the callers exit
out with an appropriate error value. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Martin Steigerwald reported a BUG_ON() where we were given a bogus bytenr to
map. Turns out he is using > PAGESIZE leafsizes. The readahead stuff is called
every time we do a completion, but we may not have finished reading in all the
pages, so the bytenr we read off the node could be completely bogus. Fix this
by only calling the readahead hook once all pages have been read in. Thanks,
Reported-by: Martin Steigerwald <Martin@lichtvoll.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Martin Steigerwald reported a BUG_ON() in btrfs_map_block where we didn't find
a chunk for a particular block we were trying to map. This happened because the
block was bogus. We shouldn't be BUG_ON()'ing in this case, just print a
message and return an error. This came from reada_add_block and it appears to
deal with an error fine so we should be good there. Thanks,
Reported-by: Martin Steigerwald <Martin@lichtvoll.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We need to tag metadata io with REQ_META to avoid priority inversion when using
io throttling cqroups. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
So I noticed there is an infinite loop in the slow caching code. If we return 1
when we hit the end of the tree, so we could end up caching the last block group
the slow way and suddenly we're looping forever because we just keep
re-searching and trying again. Fix this by only doing btrfs_next_leaf() if we
don't need_resched(). Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The locking order for stuff is
__sb_start_write
ordered_mutex
but with sync() we don't do __sb_start_write for some strange reason, which
means that our iput in wait_ordered_extents could start a transaction which does
the __sb_start_write while we're holding the ordered_mutex. Fix this by using
delayed iput in sync. Thanks,
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Since all the quota configurations are loaded in memory, and we can
have ioctl checks before operating in the disk. It is safe to do such
things because qgroup_ioctl_lock is held outside.
Without these extra checks firstly, it should be ok to do user change
for quota operations. For example:
if we want to add an existed qgroup, we will do:
->add_qgroup_item()
->add_qgroup_rb()
add_qgroup_item() will return -EEXIST to us, however, qgroups are all
in memory, why not check them in memory firstly.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
ulist_add() may return -ENOMEM, fix missing check about
return value.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
For created snapshots, the full root_item is copied from the source
root and afterwards selectively modified. The current code forgets
to clear the field received_uuid. The only problem is that it is
confusing when you look at it with 'btrfs subv list', since for
writable snapshots, the contents of the snapshot can be completely
unrelated to the previously received snapshot.
The receiver ignores such snapshots anyway because he also checks
the field stransid in the root_item and that value used to be reset
to zero for all created snapshots.
This commit changes two things:
- clear the received_uuid field for new writable snapshots.
- don't clear the send/receive related information like the stransid
for read-only snapshots (which makes them useable as a parent for
the automatic selection of parents in the receive code).
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Dave reported a BUG_ON() that happened in end_page_writeback() after an abort.
This happened because we unconditionally call end_page_writeback() in the endio
case, which is right. However when we abort the transaction we will call
end_page_writeback() on any writeback pages we find, which is wrong. We need to
lock the page and wait on page writeback to complete if it is. There is nothing
unsafe about this since we are discarding the transaction anyway. Thanks,
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We need such a sanity check for wrong start when we defrag a file, otherwise,
even with a wrong start that's larger than file size, we can end up changing
not only inode's force compress flag but also FS's incompat flags.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This fixes the following errors:
fs/btrfs/reada.c: In function ‘btrfs_reada_wait’:
fs/btrfs/reada.c:958:42: error: invalid operands to binary < (have ‘atomic_t’ and ‘int’)
fs/btrfs/reada.c:961:41: error: invalid operands to binary < (have ‘atomic_t’ and ‘int’)
Signed-off-by: Vincent Stehlé <vincent.stehle@laposte.net>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: linux-btrfs@vger.kernel.org
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Argument 'trans' became unnecessary from setup_inline_extent_backref()
that called btrfs_extend_item().
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Argument 'trans' is not used in btrfs_extend_item().
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If argument 'trans' is unnecessary in the function where
fixup_low_keys() is called, 'trans' is deleted.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Argument 'trans' is not used in fixup_low_keys(). So, remove it.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Step to reproduce:
mkfs.btrfs <disk>
mount <disk> <mnt>
dd if=/dev/zero of=/<mnt>/data bs=1M count=10
sync
btrfs quota enable <mnt>
btrfs qgroup create 0/5 <mnt>
btrfs qgroup limit 5M 0/5 <mnt>
rm -f /<mnt>/data
sync
btrfs qgroup show <mnt>
dd if=/dev/zero of=data bs=1M count=1
>From the perspective of users, qgroup's referenced or exclusive
is negative,but user can not continue to write data! a workaround
way is to cast u64 to s64 when doing qgroup reservation.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If out of memory happens, we should return -ENOMEM directly to the caller
rather than continue the work.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
In the comment describing the sync_writers field of the btrfs_inode
struct, "fsyncing" was misspelled "fsycing."
Signed-off-by: Nathaniel Yazdani <n1ght.4nd.d4y@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When tree_mod_log_rewind decides to make a copy of the current tree buffer
for its modifications, it subsequently freed the buffer before unlocking it.
Obviously, those operations are required in reverse order.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The tree mod log functions were accessing root->node->... directly, without
use of btrfs_root_node() or explicit rcu locking. This could lead to an
extent buffer reference being leaked and another reference being freed too
early when preemtion was enabled.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Commit d9abbf1c changed tree mod log locking around ROOT_REPLACE operations.
When a tree root is split, however, we were logging removal of all elements
from the root node before logging removal of half of the elements for the
split operation. This leads to a BUG_ON when rewinding.
This commit removes the erroneous logging of removal of all elements.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The following case will make the incompat/compat flag of the super block
be recovered.
Task1 |Task2
flags = btrfs_super_incompat_flags(); |
|flags = btrfs_super_incompat_flags();
flags |= new_flag1; |
|flags |= new_flag2;
btrfs_set_super_incompat_flags(flags); |
|btrfs_set_super_incompat_flags(flags);
the new_flag1 is recovered.
In order to avoid this problem, we introduce a lock named super_lock into
the btrfs_fs_info structure. If we want to update incompat/compat flags
of the super block, we must hold it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The new mount option is set after parsing the remount arguments,
so it is wrong that checking the autodefrag is close or not at
btrfs_remount_prepare(). Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Walking backref tree and btrfs quota rely on ulist very much.
This patch tries to use rb_tree to speed up search time.
The original code always checks whether an element
exists before adding a new element, however it costs O(n).
I try to add a rb_tree in the ulist,this is only used to speed up
search. I also do some measurements with quota enabled.
fsstress -p 4 -n 10000
Without this path:
real 0m51.058s 2m4.745s 1m28.222s 1m5.137s
user 0m0.035s 0m0.041s 0m0.105s 0m0.100s
sys 0m12.009s 0m11.246s 0m10.901s 0m10.999s 0m11.287s
With this path:
real 0m55.295s 0m50.960s 1m2.214s 0m48.273s
user 0m0.053s 0m0.095s 0m0.135s 0m0.107s
sys 0m7.766s 0m6.013s 0m6.319s 0m6.030s 0m6.532s
After applying the patch,the execute time is down by ~42%.(11.287s->6.532s)
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Two new flags are added to allow omitting the stream header and the
end command for btrfs send streams. This is used in cases where you
send multiple snapshots back-to-back in one stream.
This used to be encoded like this (with 2 snapshots in this example):
<stream header> + <sequence of commands> + <end cmd> +
<stream header> + <sequence of commands> + <end cmd> + EOF
The new format (if the two new flags are used) is this one:
<stream header> + <sequence of commands> +
<sequence of commands> + <end cmd>
Note that the currently existing receivers treat <end cmd> only as
an indication that a new <stream header> is following. This means,
you can just skip the sequence <end cmd> <stream header> without
loosing compatibility. As long as an EOF is following, the currently
existing receivers handle the new format (if the two new flags are
used) exactly as the old one.
So what is the benefit of this change? The goal is to be able to use
a single stream (one TCP connection) to multiplex a request/response
handshake plus Btrfs send streams, all in the same stream. In this
case you cannot evaluate an EOF condition as an end of the Btrfs send
stream. You need something else, and the <end cmd> is just perfect
for this purpose.
The summary is:
The format change is driven by the need to send several Btrfs send
streams over a single TCP connections, with the ability for a repeated
request/response handshake in the middle. And this format change does
not break any existing tool, it is completely compatible.
You could compare the old behaviour of the Btrfs send stream to the
one of ftp where you need a seperate request/response channel and
newly opened data transfer channels for each file, while the new
behaviour is more like http using a single stream for everything.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
__merge_refs() always return 0, it is unnecessary
for the caller to check the return value.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The only error return value of __add_prelim_ref() is -ENOMEM,
just return errors rather than trigger BUG_ON().
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Steps to reproduce:
mkfs.btrfs <disk>
mount <disk> <mnt>
btrfs quota enable <mnt>
btrfs sub create <mnt>/subv
btrfs qgroup limit 10K <mnt>/subv
btrfs quota disable <mnt>/subv
It is wrong for qgroup to reserve when disabling quota,
so just use tree_root to avoid edquot when disabling quota.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Step to reproduce:
mkfs.btrfs <disk>
mount <disk> <mnt>
btrfs quota enable <mnt>
btrfs qgroup limit 0/1 <mnt>
dmesg
If the relative qgroup dosen't exist, flag 'BTRFS_QGROUP_STATUS_
FLAG_INCONSISTENT' will be set, and print the noise message.
This is wrong, we can just move find_qgroup_rb() before
update_qgroup_limit_item().this dosen't change the logic of the
function. But it can avoid unnecessary noise message and wrong set of flag.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The original code forgot to check 'inherit', we should
gurantee that all the qgroups in the struct 'inherit' exist.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Step to reproduce:
mkfs.btrfs <disk>
mount <disk> <mnt>
btrfs quota enable <mnt>
btrfs qgroup assign 0/1 1/1 <mnt>
umount <mnt>
btrfs-debug-tree <disk> | grep QGROUP
If we want to add a qgroup relation, we should gurantee that
'src' and 'dst' exist, otherwise, such qgroup relation should
not be allowed to create.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We use mutex lock to protect all the user change operations.
So when we are calling find_qgroup_rb() to check whether qgroup
exists, we don't have to hold spin_lock.
Besides, when enabling/disabling quota, it must be single thread
when operations come here. spin lock must be firstly used to
clear quota_root when disabling quota, while enabling quota, spin
lock must be used to complete the last assign work.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The original code has one spin_lock 'qgroup_lock' to protect quota
configurations in memory. If we want to add a BTRFS_QGROUP_INFO_KEY,
it will be added to Btree firstly, and then update configurations in
memory,however, a race condition may happen between these operations.
For example:
->add_qgroup_info_item()
->add_qgroup_rb()
For the above case, del_qgroup_info_item() may happen just before
add_qgroup_rb().
What's worse, when we want to add a qgroup relation:
->add_qgroup_relation_item()
->add_qgroup_relations()
We don't have any checks whether 'src' and 'dst' exist before
add_qgroup_relation_item(), a race condition can also happen for
the above case.
To avoid race condition and have all the necessary checks, we introduce
a mutex lock 'qgroup_ioctl_lock', and we make all the user change operations
protected by the mutex lock.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
__btrfs_unlink_inode() aborts its transaction when it sees errors after
it removes the directory item. But it missed the case where
btrfs_del_dir_entries_in_log() returns an error. If this happens then
the unlink appears to fail but the items have been removed without
updating the directory size. The directory then has leaked bytes in
i_size and can never be removed.
Adding the missing transaction abort at least makes this failure
consistent with the other failure cases.
I noticed this while reading the code after someone on irc reported
having a directory with i_size but no entries. I tested it by forcing
btrfs_del_dir_entries_in_log() to return -ENOMEM.
Signed-off-by: Zach Brown <zab@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This:
# mkfs.btrfs /dev/sdb{1,2} ; wipefs -a /dev/sdb1; mount /dev/sdb2 /mnt/test
would lead to a blkdev open/close mismatch when the mount fails, and
a permanently busy (opened O_EXCL) sdb2:
# wipefs -a /dev/sdb2
wipefs: error: /dev/sdb2: probing initialization failed: Device or resource busy
It's because btrfs_open_devices() may open some devices, fail on
the last one, and return that failure stored in "ret." The mount
then fails, but the caller then does not clean up the open devices.
Chris assures me that:
"btrfs_open_devices just means: go off and open every bdev you can from
this uuid. It should return success if we opened any of them at all."
So change the logic to ignore any open failures; just skip processing
of that device. Later on it's decided whether we have enough devices
to continue.
Reported-by: Jan Safranek <jsafrane@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
It is very likely that there are several blocks in bio, it is very
inefficient if we get their csums one by one. This patch improves
this problem by getting the csums in batch.
According to the result of the following test, the execute time of
__btrfs_lookup_bio_sums() is down by ~28%(300us -> 217us).
# dd if=<mnt>/file of=/dev/null bs=1M count=1024
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
A user sent me a btrfs-image of a file system that was panicing on mount during
the log recovery. I had originally thought these problems were from a bug in
the free space cache code, but that was just a symptom of the problem. The
problem is if your application does something like this
[prealloc][prealloc][prealloc]
the internal extent maps will merge those all together into one extent map, even
though on disk they are 3 separate extents. So if you go to write into one of
these ranges the extent map will be right since we use the physical extent when
doing the write, but when we log the extents they will use the wrong sizes for
the remainder prealloc space. If this doesn't happen to trip up the free space
cache (which it won't in a lot of cases) then you will get bogus entries in your
extent tree which will screw stuff up later. The data and such will still work,
but everything else is broken. This patch fixes this by not allowing extents
that are on the modified list to be merged. This has the side effect that we
are no longer adding everything to the modified list all the time, which means
we now have to call btrfs_drop_extents every time we log an extent into the
tree. So this allows me to drop all this speciality code I was using to get
around calling btrfs_drop_extents. With this patch the testcase I've created no
longer creates a bogus file system after replaying the log. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When logging changed extents I was logging ram_bytes as the current length,
which isn't correct, it's supposed to be the ram bytes of the original extent.
This is for compression where even if we split the extent we need to know the
ram bytes so when we uncompress the extent we know how big it will be. This was
still working out right with compression for some reason but I think we were
getting lucky. It was definitely off for prealloc which is why I noticed it,
btrfsck was complaining about it. With this patch btrfsck no longer complains
after a log replay. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Dave was hitting a lockdep warning because we're now properly taking the ordered
operations mutex in the ordered wait stuff. This is because some cases we will
have a trans handle when we are flushing delalloc space, but we can't wait on
ordered extents because we could potentially deadlock, so fix this by not doing
the wait if we have a trans handle. Thanks
Reported-and-tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I noticed that we will add a block group to the space info before we add it to
the block group cache rb tree, so we could potentially allocate from the block
group before it's able to be searched for. I don't think this is too much of
a problem, the race window is microscopic, but just in case move the tree
insertion to above the space info linking. This makes it easier to adjust the
error handling as well, so we can remove a couple of BUG_ON(ret)'s and have real
error handling setup for these scenarios. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If btrfs_find_all_roots() fails, 'roots' has been freed or 'roots'
fails to allocate. We don't need to free it outside btrfs_find_all_roots()
again.Fix it.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The reason that BUG_ON() happens in these places is just
because of ENOMEM.
We try ro return ENOMEM rather than trigger BUG_ON(), the
caller will abort the transaction thus avoiding the kernel panic.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
A user reported a panic while running a balance. What was happening was he was
relocating a block, which added the reference to the relocation tree. Then
relocation would walk through the relocation tree and drop that reference and
free that block, and then it would walk down a snapshot which referenced the
same block and add another ref to the block. The problem is this was all
happening in the same transaction, so the parent block was free'ed up when we
drop our reference which was immediately available for allocation, and then it
was used _again_ to add a reference for the same block from a different
snapshot. This resulted in something like this in the delayed ref tree
add ref to 90234880, parent=2067398656, ref_root 1766, level 1
del ref to 90234880, parent=2067398656, ref_root 18446744073709551608, level 1
add ref to 90234880, parent=2067398656, ref_root 1767, level 1
as you can see the ref_root's don't match, because when we inc the ref we use
the header owner, which is the original tree the block belonged to, instead of
the data reloc tree. Then when we remove the extent we use the reloc tree
objectid. But none of this matters, since it is a shared reference which means
only the parent matters. When the delayed ref stuff runs it adds all the
increments first, and then does all the drops, to make sure that we don't delete
the ref if we net a positive ref count. But tree blocks aren't allowed to have
multiple refs from the same block, so this panics when it tries to add the
second ref. We need the add and the drop to cancel each other out in memory so
we only do the final add.
So to fix this we need to adjust how the delayed refs are added to the tree.
Only the ref_root matters when it is a normal backref, and only the parent
matters when it is a shared backref. So make our decision based on what ref
type we have. This allows us to keep the ref_root in memory in case anybody
wants to use it for something else, and it allows the delayed refs to be merged
properly so we don't end up with this panic.
With this patch the users image no longer panics on mount, and it has a clean
fsck after a normal mount/umount cycle. Thanks,
Cc: stable@vger.kernel.org
Reported-by: Roman Mamedov <rm@romanrm.ru>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Testing my enospc log code I managed to abort a transaction during mount, which
put me into an infinite loop. This is because of two things, first we don't
reset trans_no_join if we abort during transaction commit, which will force
anybody trying to start a transaction to just loop endlessly waiting for it to
be set to 0. But this is still just a symptom, the second issue is we don't set
the fs state to error during errors on mount. This is because we don't want to
do the flip read only thing during mount, but we still really want to set the fs
state to an error to keep us from even getting to the trans_no_join check. So
fix both of these things, make sure to reset trans_no_join if we abort during a
commit, and make sure we set the fs state to error no matter if we're mounting
or not. This should keep us from getting into this infinite loop again.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Steps to reproduce:
mkfs.btrfs <disk>
mount <disk> <mnt>
btrfs quota enable <mnt>
btrfs sub create <mnt>/subv
i=1
while [ $i -le 10000 ]
do
dd if=/dev/zero of=<mnt>/subv/data_$i bs=1K count=1
i=$(($i+1))
if [ $i -eq 500 ]
then
btrfs quota disable $mnt
fi
done
dmesg
Obviously, this warn_on() is unnecessary, and it will be easily triggered.
Just remove it.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
set_extent_bit()'s (u64 *failed_start) expects NULL not 0.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The subvolume ioctls block on the parent directory mutex that can be
held by other concurrent snapshot activity for a long time. Give the
user at least some chance to get out of this situation by allowing
to send a kill signal.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The messages
btrfs: unlinked 123 orphans
btrfs: truncated 456 orphans
are not useful to regular users and raise questions whether there are
problems with the filesystem.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This mount option was a workaround when subvol= assumed path relative
to the default subvolume, not the toplevel one. This was fixed long time
ago and subvolrootid has no effect.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
With more than one btrfs volume mounted, it can be very difficult to find
out which volume is hitting an error. btrfs_error() will print this, but
it is currently rigged as more of a fatal error handler, while many of
the printk()s are currently for debugging and yet-unhandled cases.
This patch just changes the functions where the device information is
already available. Some cases remain where the root or fs_info is not
passed to the function emitting the error.
This may introduce some confusion with volumes backed by multiple devices
emitting errors referring to the primary device in the set instead of the
one on which the error occurred.
Use btrfs_printk(fs_info, format, ...) rather than writing the device
string every time, and introduce macro wrappers ala XFS for brevity.
Since the function already cannot be used for continuations, print a
newline as part of the btrfs_printk() message rather than at each caller.
Signed-off-by: Simon Kirby <sim@hostway.ca>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The Kconfig title does not make much sense after the cleanup of
CONFIG_EXPERIMENTAL option, align the wording with other filesystems.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Each time pick one dead root from the list and let the caller know if
it's needed to continue. This should improve responsiveness during
umount and balance which at some point waits for cleaning all currently
queued dead roots.
A new dead root is added to the end of the list, so the snapshots
disappear in the order of deletion.
The snapshot cleaning work is now done only from the cleaner thread and the
others wake it if needed.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We currently store the first key of the tree block inside the reference for the
tree block in the extent tree. This takes up quite a bit of space. Make a new
key type for metadata which holds the level as the offset and completely removes
storing the btrfs_tree_block_info inside the extent ref. This reduces the size
from 51 bytes to 33 bytes per extent reference for each tree block. In practice
this results in a 30-35% decrease in the size of our extent tree, which means we
COW less and can keep more of the extent tree in memory which makes our heavy
metadata operations go much faster. This is not an automatic format change, you
must enable it at mkfs time or with btrfstune. This patch deals with having
metadata stored as either the old format or the new format so it is easy to
convert. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
free_root_pointers() has been introduced to cleanup all of tree roots,
so just use it instead.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The transaction abort stacktrace is printed only once per module
lifetime, but we'd like to see it each time it happens per mounted
filesystem. Introduce a fs_state flag that records it.
Tweak the messages around abort:
* add error number to the first abort
* print the exact negative errno from btrfs_decode_error
* clean up btrfs_decode_error and callers
* no dots at the end of the messages
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We keep hitting bugs in the tree log replay because btrfs_remove_free_space
doesn't account for some corner case. So add a bunch of tests to try and fully
test btrfs_remove_free_space since the only time it is called is during tree log
replay. These tests all finish successfully, so as we find more of these bugs
we need to add to these tests to make sure we don't regress in fixing things.
I've hidden the tests behind a Kconfig option, but they take no time to run so
all btrfs developers should have this turned on all the time. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Pull VFS updates from Al Viro,
Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).
7kloc removed.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
don't bother with deferred freeing of fdtables
proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
proc: Make the PROC_I() and PDE() macros internal to procfs
proc: Supply a function to remove a proc entry by PDE
take cgroup_open() and cpuset_open() to fs/proc/base.c
ppc: Clean up scanlog
ppc: Clean up rtas_flash driver somewhat
hostap: proc: Use remove_proc_subtree()
drm: proc: Use remove_proc_subtree()
drm: proc: Use minor->index to label things, not PDE->name
drm: Constify drm_proc_list[]
zoran: Don't print proc_dir_entry data in debug
reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
proc: Supply an accessor for getting the data from a PDE's parent
airo: Use remove_proc_subtree()
rtl8192u: Don't need to save device proc dir PDE
rtl8187se: Use a dir under /proc/net/r8180/
proc: Add proc_mkdir_data()
proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
proc: Move PDE_NET() to fs/proc/proc_net.c
...
Pull one more btrfs fix from Chris Mason:
"This has a recent fix from Josef for our tree log replay code. It
fixes problems where the inode counter for the number of bytes in the
file wasn't getting updated properly during fsync replay.
The commit did get rebased this morning, but it was only to clean up
the subject line. The code hasn't changed."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: make sure nbytes are right after log replay
While trying to track down a tree log replay bug I noticed that fsck was always
complaining about nbytes not being right for our fsynced file. That is because
the new fsync stuff doesn't wait for ordered extents to complete, so the inodes
nbytes are not necessarily updated properly when we log it. So to fix this we
need to set nbytes to whatever it is on the inode that is on disk, so when we
replay the extents we can just add the bytes that are being added as we replay
the extent. This makes it work for the case that we have the wrong nbytes or
the case that we logged everything and nbytes is actually correct. With this
I'm no longer getting nbytes errors out of btrfsck.
Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Tejun writes:
-----
This is the pull request for the earlier patchset[1] with the same
name. It's only three patches (the first one was committed to
workqueue tree) but the merge strategy is a bit involved due to the
dependencies.
* Because the conversion needs features from wq/for-3.10,
block/for-3.10/core is based on rc3, and wq/for-3.10 has conflicts
with rc3, I pulled mainline (rc5) into wq/for-3.10 to prevent those
workqueue conflicts from flaring up in block tree.
* Resolving the issue that Jan and Dave raised about debugging
requires arch-wide changes. The patchset is being worked on[2] but
it'll have to go through -mm after these changes show up in -next,
and not included in this pull request.
The three commits are located in the following git branch.
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git writeback-workqueue
Pulling it into block/for-3.10/core produces a conflict in
drivers/md/raid5.c between the following two commits.
e3620a3ad5 ("MD RAID5: Avoid accessing gendisk or queue structs when not available")
2f6db2a707 ("raid5: use bio_reset()")
The conflict is trivial - one removes an "if ()" conditional while the
other removes "rbi->bi_next = NULL" right above it. We just need to
remove both. The merged branch is available at
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git block-test-merge
so that you can use it for verification. The test merge commit has
proper merge description.
While these changes are a bit of pain to route, they make code simpler
and even have, while minute, measureable performance gain[3] even on a
workload which isn't particularly favorable to showing the benefits of
this conversion.
----
Fixed up the conflict.
Conflicts:
drivers/md/raid5.c
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull btrfs fixes from Chris Mason:
"We've had a busy two weeks of bug fixing. The biggest patches in here
are some long standing early-enospc problems (Josef) and a very old
race where compression and mmap combine forces to lose writes (me).
I'm fairly sure the mmap bug goes all the way back to the introduction
of the compression code, which is proof that fsx doesn't trigger every
possible mmap corner after all.
I'm sure you'll notice one of these is from this morning, it's a small
and isolated use-after-free fix in our scrub error reporting. I
double checked it here."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: don't drop path when printing out tree errors in scrub
Btrfs: fix wrong return value of btrfs_lookup_csum()
Btrfs: fix wrong reservation of csums
Btrfs: fix double free in the btrfs_qgroup_account_ref()
Btrfs: limit the global reserve to 512mb
Btrfs: hold the ordered operations mutex when waiting on ordered extents
Btrfs: fix space accounting for unlink and rename
Btrfs: fix space leak when we fail to reserve metadata space
Btrfs: fix EIO from btrfs send in is_extent_unchanged for punched holes
Btrfs: fix race between mmap writes and compression
Btrfs: fix memory leak in btrfs_create_tree()
Btrfs: fix locking on ROOT_REPLACE operations in tree mod log
Btrfs: fix missing qgroup reservation before fallocating
Btrfs: handle a bogus chunk tree nicely
Btrfs: update to use fs_state bit
A user reported a panic where we were panicing somewhere in
tree_backref_for_extent from scrub_print_warning. He only captured the trace
but looking at scrub_print_warning we drop the path right before we mess with
the extent buffer to print out a bunch of stuff, which isn't right. So fix this
by dropping the path after we use the eb if we need to. Thanks,
Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we don't find the expected csum item, but find a csum item which is
adjacent to the specified extent, we should return -EFBIG, or we should
return -ENOENT. But btrfs_lookup_csum() return -EFBIG even the csum item
is not adjacent to the specified extent. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We reserve the space for csums only when we write data into a file, in
the other cases, such as tree log, log replay, we don't do reservation,
so we can use the reservation of the transaction handle just for the former.
And for the latter, we should use the tree's own reservation. But the
function - btrfs_csum_file_blocks() didn't differentiate between these
two types of the cases, fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The function btrfs_find_all_roots is responsible to allocate
memory for 'roots' and free it if errors happen,so the caller should not
free it again since the work has been done.
Besides,'tmp' is allocated after the function btrfs_find_all_roots,
so we can return directly if btrfs_find_all_roots() fails.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
A user reported a problem where he was getting early ENOSPC with hundreds of
gigs of free data space and 6 gigs of free metadata space. This is because the
global block reserve was taking up the entire free metadata space. This is
ridiculous, we have infrastructure in place to throttle if we start using too
much of the global reserve, so instead of letting it get this huge just limit it
to 512mb so that users can still get work done. This allowed the user to
complete his rsync without issues. Thanks
Cc: stable@vger.kernel.org
Reported-and-tested-by: Stefan Priebe <s.priebe@profihost.ag>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We need to hold the ordered_operations mutex while waiting on ordered extents
since we splice and run the ordered extents list. We need to make sure anybody
else who wants to wait on ordered extents does actually wait for them to be
completed. This will keep us from bailing out of flushing in case somebody is
already waiting on ordered extents to complete. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We are way over-reserving for unlink and rename. Rename is just some random
huge number and unlink accounts for tree log operations that don't actually
happen during unlink, not to mention the tree log doesn't take from the trans
block rsv anyway so it's completely useless. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Dave reported a warning when running xfstest 275. We have been leaking delalloc
metadata space when our reservations fail. This is because we were improperly
calculating how much space to free for our checksum reservations. The problem
is we would sometimes free up space that had already been freed in another
thread and we would end up with negative usage for the delalloc space. This
patch fixes the problem by calculating how much space the other threads would
have already freed, and then calculate how much space we need to free had we not
done the reservation at all, and then freeing any excess space. This makes
xfstests 275 no longer have leaked space. Thanks
Cc: stable@vger.kernel.org
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When you take a snapshot, punch a hole where there has been data, then take
another snapshot and try to send an incremental stream, btrfs send would
give you EIO. That is because is_extent_unchanged had no support for holes
being punched. With this patch, instead of returning EIO we just return
0 (== the extent is not unchanged) and we're good.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Cc: Alexander Block <ablock84@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Btrfs uses page_mkwrite to ensure stable pages during
crc calculations and mmap workloads. We call clear_page_dirty_for_io
before we do any crcs, and this forces any application with the file
mapped to wait for the crc to finish before it is allowed to change
the file.
With compression on, the clear_page_dirty_for_io step is happening after
we've compressed the pages. This means the applications might be
changing the pages while we are compressing them, and some of those
modifications might not hit the disk.
This commit adds the clear_page_dirty_for_io before compression starts
and makes sure to redirty the page if we have to fallback to
uncompressed IO as well.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Reported-by: Alexandre Oliva <oliva@gnu.org>
cc: stable@vger.kernel.org
Bunch of places in the code weren't using it where they could be -
this'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx
into a struct bvec_iter.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: "Ed L. Cashin" <ecashin@coraid.com>
CC: Nick Piggin <npiggin@kernel.dk>
CC: Jiri Kosina <jkosina@suse.cz>
CC: Jim Paris <jim@jtan.com>
CC: Geoff Levand <geoff@infradead.org>
CC: Alasdair Kergon <agk@redhat.com>
CC: dm-devel@redhat.com
CC: Neil Brown <neilb@suse.de>
CC: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Ed Cashin <ecashin@coraid.com>
Just a little convenience macro - main reason to add it now is preparing
for immutable bio vecs, it'll reduce the size of the patch that puts
bi_sector/bi_size/bi_idx into a struct bvec_iter.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Lars Ellenberg <drbd-dev@lists.linbit.com>
CC: Jiri Kosina <jkosina@suse.cz>
CC: Alasdair Kergon <agk@redhat.com>
CC: dm-devel@redhat.com
CC: Neil Brown <neilb@suse.de>
CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
CC: Heiko Carstens <heiko.carstens@de.ibm.com>
CC: linux-s390@vger.kernel.org
CC: Chris Mason <chris.mason@fusionio.com>
CC: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
We should free leaf and root before returning from the error
handling code.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
To resolve backrefs, ROOT_REPLACE operations in the tree mod log are
required to be tied to at least one KEY_REMOVE_WHILE_FREEING operation.
Therefore, those operations must be enclosed by tree_mod_log_write_lock()
and tree_mod_log_write_unlock() calls.
Those calls are private to the tree_mod_log_* functions, which means that
removal of the elements of an old root node must be logged from
tree_mod_log_insert_root. This partly reverts and corrects commit ba1bfbd5
(Btrfs: fix a tree mod logging issue for root replacement operations).
This fixes the brand-new version of xfstest 276 as of commit cfe73f71.
Cc: stable@vger.kernel.org
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Steps to reproduce:
mkfs.btrfs <disk>
mount <disk> <mnt>
btrfs quota enable <mnt>
btrfs sub create <mnt>/subv
btrfs qgroup limit 10M <mnt>/subv
fallocate --length 20M <mnt>/subv/data
For the above example, fallocating will return successfully which
is not expected, we try to fix it by doing qgroup reservation before
fallocating.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If you restore a btrfs-image file system and try to mount that file system we'll
panic. That's because btrfs-image restores and just makes one big chunk to
envelope the whole disk, since they are really only meant to be messed with by
our btrfs-progs. So fix up btrfs_rmap_block and the callers of it for mount so
that we no longer panic but instead just return an error and fail to mount.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Now that we use bit operation to check fs_state, update
btrfs_free_fs_root()'s checker, otherwise we get back to
memory leak case.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs fixes from Chris Mason:
"Eric's rcu barrier patch fixes a long standing problem with our
unmount code hanging on to devices in workqueue helpers. Liu Bo
nailed down a difficult assertion for in-memory extent mappings."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix warning of free_extent_map
Btrfs: fix warning when creating snapshots
Btrfs: return as soon as possible when edquot happens
Btrfs: return EIO if we have extent tree corruption
btrfs: use rcu_barrier() to wait for bdev puts at unmount
Btrfs: remove btrfs_try_spin_lock
Btrfs: get better concurrency for snapshot-aware defrag work
Users report that an extent map's list is still linked when it's actually
going to be freed from cache.
The story is that
a) when we're going to drop an extent map and may split this large one into
smaller ems, and if this large one is flagged as EXTENT_FLAG_LOGGING which means
that it's on the list to be logged, then the smaller ems split from it will also
be flagged as EXTENT_FLAG_LOGGING, and this is _not_ expected.
b) we'll keep ems from unlinking the list and freeing when they are flagged with
EXTENT_FLAG_LOGGING, because the log code holds one reference.
The end result is the warning, but the truth is that we set the flag
EXTENT_FLAG_LOGGING only during fsync.
So clear flag EXTENT_FLAG_LOGGING for extent maps split from a large one.
Reported-by: Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de>
Reported-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Creating snapshot passes extent_root to commit its transaction,
but it can lead to the warning of checking root for quota in
the __btrfs_end_transaction() when someone else is committing
the current transaction. Since we've recorded the needed root
in trans_handle, just use it to get rid of the warning.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If one of qgroup fails to reserve firstly, we should return immediately,
it is unnecessary to continue check.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The callers of lookup_inline_extent_info all handle getting an error back
properly, so return an error if we have corruption instead of being a jerk and
panicing. Still WARN_ON() since this is kind of crucial and I've been seeing it
a bit too much recently for my taste, I think we're doing something wrong
somewhere. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Doing this would reliably fail with -EBUSY for me:
# mount /dev/sdb2 /mnt/scratch; umount /mnt/scratch; mkfs.btrfs -f /dev/sdb2
...
unable to open /dev/sdb2: Device or resource busy
because mkfs.btrfs tries to open the device O_EXCL, and somebody still has it.
Using systemtap to track bdev gets & puts shows a kworker thread doing a
blkdev put after mkfs attempts a get; this is left over from the unmount
path:
btrfs_close_devices
__btrfs_close_devices
call_rcu(&device->rcu, free_device);
free_device
INIT_WORK(&device->rcu_work, __free_device);
schedule_work(&device->rcu_work);
so unmount might complete before __free_device fires & does its blkdev_put.
Adding an rcu_barrier() to btrfs_close_devices() causes unmount to wait
until all blkdev_put()s are done, and the device is truly free once
unmount completes.
Cc: stable@vger.kernel.org
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Remove a useless function declaration
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Using spinning case instead of blocking will result in better concurrency
overall.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull namespace bugfixes from Eric Biederman:
"This is three simple fixes against 3.9-rc1. I have tested each of
these fixes and verified they work correctly.
The userns oops in key_change_session_keyring and the BUG_ON triggered
by proc_ns_follow_link were found by Dave Jones.
I am including the enhancement for mount to only trigger requests of
filesystem modules here instead of delaying this for the 3.10 merge
window because it is both trivial and the kind of change that tends to
bit-rot if left untouched for two months."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
proc: Use nd_jump_link in proc_ns_follow_link
fs: Limit sys_mount to only request filesystem modules (Part 2).
fs: Limit sys_mount to only request filesystem modules.
userns: Stop oopsing in key_change_session_keyring
Pull btrfs fixes from Chris Mason:
"These are scattered fixes and one performance improvement. The
biggest functional change is in how we throttle metadata changes. The
new code bumps our average file creation rate up by ~13% in fs_mark,
and lowers CPU usage.
Stefan bisected out a regression in our allocation code that made
balance loop on extents larger than 256MB."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: improve the delayed inode throttling
Btrfs: fix a mismerge in btrfs_balance()
Btrfs: enforce min_bytes parameter during extent allocation
Btrfs: allow running defrag in parallel to administrative tasks
Btrfs: avoid deadlock on transaction waiting list
Btrfs: do not BUG_ON on aborted situation
Btrfs: do not BUG_ON in prepare_to_reloc
Btrfs: free all recorded tree blocks on error
Btrfs: build up error handling for merge_reloc_roots
Btrfs: check for NULL pointer in updating reloc roots
Btrfs: fix unclosed transaction handler when the async transaction commitment fails
Btrfs: fix wrong handle at error path of create_snapshot() when the commit fails
Btrfs: use set_nlink if our i_nlink is 0
The delayed inode code batches up changes to the btree in hopes of doing
them in bulk. As the changes build up, processes kick off worker
threads and wait for them to make progress.
The current code kicks off an async work queue item for each delayed
node, which creates a lot of churn. It also uses a fixed 1 HZ waiting
period for the throttle, which allows us to build a lot of pending
work and can slow down the commit.
This changes us to watch a sequence counter as it is bumped during the
operations. We kick off fewer work items and have each work item do
more work.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Raid56 merge (merge commit e942f88) had mistakenly removed a call to
__cancel_balance(), which resulted in balance not cleaning up after itself
after a successful finish. (Cleanup includes switching the state, removing
the balance item and releasing mut_ex_op testnset lock.) Bring it back.
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Commit 24542bf7ea changed preallocation of
extents to cap the max size we try to allocate. It's a valid change,
but the extent reservation code is also used by balance, and that
can't tolerate a smaller extent being allocated.
__btrfs_prealloc_file_range already has a min_size parameter, which is
used by relocation to request a specific extent size. This commit
adds an extra check to enforce that minimum extent size.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Reported-by: Stefan Behrens <sbehrens@giantdisaster.de>
Commit 5ac00add added a testnset mutex and code that disallows
running administrative tasks in parallel. It is prevented that
the device add/delete/balance/replace/resize operations are
started in parallel. By mistake, the defragmentation operation
was included in the check for mutually exclusiveness as well.
This is fixed with this commit.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Only let one trans handle to wait for other handles, otherwise we
will get ABBA issues.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Btrfs balance can easily hit BUG_ON in these places, but we want
to it bail out gracefully after we force the whole filesystem to
readonly. So we use btrfs_std_error hook in place of BUG_ON.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We can bail out from here gracefully instead of a cold BUG_ON.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We first use btrfs_std_error hook to replace with BUG_ON, and we
also need to cleanup what is left, including reloc roots rbtree
and reloc roots list.
Here we use a helper function to cleanup both rbtree and list, and
since this function can also be used in the balance recover path,
we also make the change as well to keep code simple.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If the async transaction commitment failed, we need close the
current transaction handler, or the current transaction will be
blocked to commit because of this orphan handler.
We fix the problem by doing sync transaction commitment, that is
to invoke btrfs_commit_transaction().
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There are several bugs at error path of create_snapshot() when the
transaction commitment failed.
- access the freed transaction handler. At the end of the
transaction commitment, the transaction handler was freed, so we
should not access it after the transaction commitment.
- we were not aware of the error which happened during the snapshot
creation if we submitted a async transaction commitment.
- pending snapshot access vs pending snapshot free. when something
wrong happened after we submitted a async transaction commitment,
the transaction committer would cleanup the pending snapshots and
free them. But the snapshot creators were not aware of it, they
would access the freed pending snapshots.
This patch fixes the above problems by:
- remove the dangerous code that accessed the freed handler
- assign ->error if the error happens during the snapshot creation
- the transaction committer doesn't free the pending snapshots,
just assigns the error number and evicts them before we unblock
the transaction.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We need to inc the nlink of deleted entries when running replay so we can do the
unlink on the fs_root and get everything cleaned up and then have the orphan
cleanup do the right thing. The problem is inc_nlink complains about this, even
thought it still does the right thing. So use set_nlink() if our i_nlink is 0
to keep users from seeing the warnings during log replay. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Modify the request_module to prefix the file system type with "fs-"
and add aliases to all of the filesystems that can be built as modules
to match.
A common practice is to build all of the kernel code and leave code
that is not commonly needed as modules, with the result that many
users are exposed to any bug anywhere in the kernel.
Looking for filesystems with a fs- prefix limits the pool of possible
modules that can be loaded by mount to just filesystems trivially
making things safer with no real cost.
Using aliases means user space can control the policy of which
filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
with blacklist and alias directives. Allowing simple, safe,
well understood work-arounds to known problematic software.
This also addresses a rare but unfortunate problem where the filesystem
name is not the same as it's module name and module auto-loading
would not work. While writing this patch I saw a handful of such
cases. The most significant being autofs that lives in the module
autofs4.
This is relevant to user namespaces because we can reach the request
module in get_fs_type() without having any special permissions, and
people get uncomfortable when a user specified string (in this case
the filesystem type) goes all of the way to request_module.
After having looked at this issue I don't think there is any
particular reason to perform any filtering or permission checks beyond
making it clear in the module request that we want a filesystem
module. The common pattern in the kernel is to call request_module()
without regards to the users permissions. In general all a filesystem
module does once loaded is call register_filesystem() and go to sleep.
Which means there is not much attack surface exposed by loading a
filesytem module unless the filesystem is mounted. In a user
namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
which most filesystems do not set today.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Reported-by: Kees Cook <keescook@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Pull btrfs fixup from Chris Mason:
"Geert and James both sent this one in, sorry guys"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs/raid56: Add missing #include <linux/vmalloc.h>
tilegx_defconfig:
fs/btrfs/raid56.c: In function 'btrfs_alloc_stripe_hash_table':
fs/btrfs/raid56.c:206:3: error: implicit declaration of function 'vzalloc' [-Werror=implicit-function-declaration]
fs/btrfs/raid56.c:206:9: warning: assignment makes pointer from integer without a cast [enabled by default]
fs/btrfs/raid56.c:226:4: error: implicit declaration of function 'vfree' [-Werror=implicit-function-declaration]
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs update from Chris Mason:
"The biggest feature in the pull is the new (and still experimental)
raid56 code that David Woodhouse started long ago. I'm still working
on the parity logging setup that will avoid inconsistent parity after
a crash, so this is only for testing right now. But, I'd really like
to get it out to a broader audience to hammer out any performance
issues or other problems.
scrub does not yet correct errors on raid5/6 either.
Josef has another pass at fsync performance. The big change here is
to combine waiting for metadata with waiting for data, which is a big
latency win. It is also step one toward using atomics from the
hardware during a commit.
Mark Fasheh has a new way to use btrfs send/receive to send only the
metadata changes. SUSE is using this to make snapper more efficient
at finding changes between snapshosts.
Snapshot-aware defrag is also included.
Otherwise we have a large number of fixes and cleanups. Eric Sandeen
wins the award for removing the most lines, and I'm hoping we steal
this idea from XFS over and over again."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (118 commits)
btrfs: fixup/remove module.h usage as required
Btrfs: delete inline extents when we find them during logging
btrfs: try harder to allocate raid56 stripe cache
Btrfs: cleanup to make the function btrfs_delalloc_reserve_metadata more logic
Btrfs: don't call btrfs_qgroup_free if just btrfs_qgroup_reserve fails
Btrfs: remove reduplicate check about root in the function btrfs_clean_quota_tree
Btrfs: return ENOMEM rather than use BUG_ON when btrfs_alloc_path fails
Btrfs: fix missing deleted items in btrfs_clean_quota_tree
btrfs: use only inline_pages from extent buffer
Btrfs: fix wrong reserved space when deleting a snapshot/subvolume
Btrfs: fix wrong reserved space in qgroup during snap/subv creation
Btrfs: remove unnecessary dget_parent/dput when creating the pending snapshot
btrfs: remove a printk from scan_one_device
Btrfs: fix NULL pointer after aborting a transaction
Btrfs: fix memory leak of log roots
Btrfs: copy everything if we've created an inline extent
btrfs: cleanup for open-coded alignment
Btrfs: do not change inode flags in rename
Btrfs: use reserved space for creating a snapshot
clear chunk_alloc flag on retryable failure
...
We want to avoid module.h where posible, since it in turn includes
nearly all of header space. This means removing it where it is not
required, and using export.h where we are only exporting symbols via
EXPORT_SYMBOL and friends.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Apparently when we do inline extents we allow the data to overlap the last chunk
of the btrfs_file_extent_item, which means that we can possibly have a
btrfs_file_extent_item that isn't actually as large as a btrfs_file_extent_item.
This messes with us when we try to overwrite the extent when logging new extents
since we expect for it to be the right size. To fix this just delete the item
and try to do the insert again which will give us the proper sized
btrfs_file_extent_item. This fixes a panic where map_private_extent_buffer
would blow up because we're trying to write past the end of the leaf. Thanks,
Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The stripe hash table is large, starting with allocation order 4 and can go as
high as order 7 in case lock debugging is turned on and structure padding
happens.
Observed mount failure:
mount: page allocation failure: order:7, mode:0x200050
Pid: 8234, comm: mount Tainted: G W 3.8.0-default+ #267
Call Trace:
[<ffffffff81114353>] warn_alloc_failed+0xf3/0x140
[<ffffffff811171d2>] ? __alloc_pages_direct_compact+0x92/0x250
[<ffffffff81117ac3>] __alloc_pages_nodemask+0x733/0x9d0
[<ffffffff81152878>] ? cache_alloc_refill+0x3f8/0x840
[<ffffffff811528bc>] cache_alloc_refill+0x43c/0x840
[<ffffffff811302eb>] ? is_kernel_percpu_address+0x4b/0x90
[<ffffffffa00a00ac>] ? btrfs_alloc_stripe_hash_table+0x5c/0x130 [btrfs]
[<ffffffff811531d7>] kmem_cache_alloc_trace+0x247/0x270
[<ffffffffa00a00ac>] btrfs_alloc_stripe_hash_table+0x5c/0x130 [btrfs]
[<ffffffffa003133f>] open_ctree+0xb2f/0x1f90 [btrfs]
[<ffffffff81397289>] ? string+0x49/0xe0
[<ffffffff813987b3>] ? vsnprintf+0x443/0x5d0
[<ffffffffa0007cb6>] btrfs_mount+0x526/0x600 [btrfs]
[<ffffffff8115127c>] ? cache_alloc_debugcheck_after+0x4c/0x200
[<ffffffff81162b90>] mount_fs+0x20/0xe0
[<ffffffff8117db26>] vfs_kern_mount+0x76/0x120
[<ffffffff811801b6>] do_mount+0x386/0x980
[<ffffffff8112a5cb>] ? strndup_user+0x5b/0x80
[<ffffffff81180840>] sys_mount+0x90/0xe0
[<ffffffff81962e99>] system_call_fastpath+0x16/0x1b
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The original code is a little confusing and not clear, The right
way to deal with the kernel code like this:
[...]
if (ret)
goto out;
[...]
So i move the common clean_up code to the place labeled with
out_fail, this will be easier to maintain.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
commit eb6b88d92c leads into another bug.
If it is just because qgroup_reserve fails, the function btrfs_qgroup_free
should not be called, otherwise, it will cause the wrong quota accounting.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The check work has been done just before the function btrfs_clean_quota_tree
is called, it is not necessary to check it again, remove it.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Return ENOMEM rather trigger BUG_ON, fix it.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Steps to reproduce:
i=0
ncases=100
mkfs.btrfs <disk>
mount <disk> <mnt>
btrfs quota enable <mnt>
btrfs qgroup create 2/1 <mnt>
while [ $i -le $ncases ]
do
btrfs qgroup create 1/$i <mnt>
btrfs qgroup assign 1/$i 2/1 <mnt>
i=$(($i+1))
done
btrfs quota disable <mnt>
umount <mnt>
btrfsck <mnt>
You can also use the commands:
btrfs-debug-tree <disk> | grep QGROUP
You will find there are still items existed.The reasons why this happens
is because the original code just checks slots[0]==0 and returns.
We try to fix it by deleting the leaf one by one.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The nodesize is capped at 64k and there are enough pages preallocated in
extent_buffer::inline_pages. The fallback to kmalloc never happened
because even on the smallest page size considered (4k) inline_pages
covered the needs.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When deleting a snapshot/subvolume, we need remove root ref/backref,
dir entries and update the dir inode, so we must reserve free space
for those operations.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There are two problems in the space reservation of the snapshot/
subvolume creation.
- don't reserve the space for the root item insertion
- the space which is reserved in the qgroup is different with
the free space reservation. we need reserve free space for
7 items, but in qgroup reservation, we need reserve space only
for 3 items.
So we implement new metadata reservation functions for the
snapshot/subvolume creation.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Since we have grabbed the parent inode at the beginning of the
snapshot creation, and both sync and async snapshot creation
release it after the pending snapshots are actually created,
it is safe to access the parent inode directly during the snapshot
creation, we needn't use dget_parent/dput to fix the parent dentry
and get the dir inode.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Dave pointed out that he saw messages from btrfs although there was no
such filesystem on his computers. The automatic device scan is called on
every new blockdevice if the usual distro udev rule set is used. The
printk introduced in 6f60cbd3ae was a remainder from copying
portions of code from btrfs_get_bdev_and_sb which is used under
different conditions and the warning makes sense there.
Reported-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
While doing cleanup work on an aborted transaction, we've set
the global running transaction pointer to NULL _before_ waiting all
other transaction handles to finish, so others'd hit NULL pointer
crash when referencing the global running transaction pointer.
This first sets a hint to avoid new transaction handle joining, then
waits other existing handles to abort or finish so that we can safely
set the above global pointer to NULL.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When we abort a transaction while fsyncing, we'll skip freeing log roots
part of committing a transaction, which leads to memory leak.
This adds a 'free log roots' in putting super when no more users hold
references on log roots, so it's safe and clean.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I noticed while looking into a tree logging bug that we aren't logging inline
extents properly. Since this requires copying and it shouldn't happen too often
just force us to copy everything for the inode into the tree log when we have an
inline extent. With this patch we have valid data after a crash when we write
an inline extent. Thanks,
Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Pull vfs pile (part one) from Al Viro:
"Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
locking violations, etc.
The most visible changes here are death of FS_REVAL_DOT (replaced with
"has ->d_weak_revalidate()") and a new helper getting from struct file
to inode. Some bits of preparation to xattr method interface changes.
Misc patches by various people sent this cycle *and* ocfs2 fixes from
several cycles ago that should've been upstream right then.
PS: the next vfs pile will be xattr stuff."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
saner proc_get_inode() calling conventions
proc: avoid extra pde_put() in proc_fill_super()
fs: change return values from -EACCES to -EPERM
fs/exec.c: make bprm_mm_init() static
ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
ocfs2: fix possible use-after-free with AIO
ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
target: writev() on single-element vector is pointless
export kernel_write(), convert open-coded instances
fs: encode_fh: return FILEID_INVALID if invalid fid_type
kill f_vfsmnt
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
nfsd: handle vfs_getattr errors in acl protocol
switch vfs_getattr() to struct path
default SET_PERSONALITY() in linux/elf.h
ceph: prepopulate inodes only when request is aborted
d_hash_and_lookup(): export, switch open-coded instances
9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
9p: split dropping the acls from v9fs_set_create_acl()
...
Though most of the btrfs codes are using ALIGN macro for page alignment,
there are still some codes using open-coded alignment like the
following:
------
u64 mask = ((u64)root->stripesize - 1);
u64 ret = (val + mask) & ~mask;
------
Or even hidden one:
------
num_bytes = (end - start + blocksize) & ~(blocksize - 1);
------
Sometimes these open-coded alignment is not so easy to understand for
newbie like me.
This commit changes the open-coded alignment to the ALIGN macro for a
better readability.
Also there is a previous patch from David Sterba with similar changes,
but the patch is for 3.2 kernel and seems not merged.
http://www.spinics.net/lists/linux-btrfs/msg12747.html
Cc: David Sterba <dave@jikos.cz>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Before we forced to change a file's NOCOW and COMPRESS flag due to
the parent directory's, but this ends up a bad idea, because it
confuses end users a lot about file's NOCOW status, eg. if someone
change a file to NOCOW via 'chattr' and then rename it in the current
directory which is without NOCOW attribute, the file will lose the
NOCOW flag silently.
This diables 'change flags in rename', so from now on we'll only
inherit flags from the parent directory on creation stage while in
other places we can use 'chattr' to set NOCOW or COMPRESS flags.
Reported-by: Marios Titas <redneb8888@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
While inserting dir index and updating inode for a snapshot, we'd
add delayed items which consume trans->block_rsv, if we don't have
any space reserved in this trans handle, we either just return or
reserve space again.
But before creating pending snapshots during committing transaction,
we've done a release on this trans handle, so we don't have space reserved
in it at this stage.
What we're using is block_rsv of pending snapshots which has already
reserved well enough space for both inserting dir index and updating
inode, so we need to set trans handle to indicate that we have space
now.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I've experienced filesystem freezes with permanent spikes in the active
process count for quite a while, particularly on filesystems whose
available raw space has already been fully allocated to chunks.
While looking into this, I found a pretty obvious error in
do_chunk_alloc: it sets space_info->chunk_alloc, but if
btrfs_alloc_chunk returns an error other than ENOSPC, it returns leaving
that flag set, which causes any other threads waiting for
space_info->chunk_alloc to become zero to spin indefinitely.
I haven't double-checked that this patch fixes the failure I've observed
fully (it's not exactly trivial to trigger), but it surely is a bug and
the fix is trivial, so... Please put it in :-)
What I saw in that function also happens to explain why in some cases I
see filesystems allocate a huge number of chunks that remain unused
(leading to the scenario above, of not having more chunks to allocate).
It happens for data and metadata, but not necessarily both. I'm
guessing some thread sets the force_alloc flag on the corresponding
space_info, and then several threads trying to get disk space end up
attempting to allocate a new chunk concurrently. All of them will see
the force_alloc flag and bump their local copy of force up to the level
they see first, and they won't clear it even if another thread succeeds
in allocating a chunk, thus clearing the force flag. Then each thread
that observed the force flag will, on its turn, force the allocation of
a new chunk. And any threads that come in while it does that will see
the force flag still set and pick it up, and so on. This sounds like a
problem to me, but... what should the correct behavior be? Clear
force_flag once we copy it to a local force? Reset force to the
incoming value on every loop? Set the flag to our incoming force if we
have it at first, clear our local flag, and move it from the space_info
when we determined that we are the thread that's going to perform the
allocation?
btrfs: clear chunk_alloc flag on retryable failure
From: Alexandre Oliva <oliva@gnu.org>
If btrfs_alloc_chunk fails with e.g. ENOMEM, we exit do_chunk_alloc
without clearing chunk_alloc in space_info. As a result, any further
calls to do_chunk_alloc on that filesystem will start busy-waiting for
chunk_alloc to be cleared, but it never will be. This patch adjusts
do_chunk_alloc so that it clears this flag in case of an error.
Signed-off-by: Alexandre Oliva <oliva@gnu.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When a subvolume is removed, we remove the root item from the root tree,
while the tree blocks and backrefs remain for a while. When backref walking
comes across one of those orphan tree blocks, it can find a backref for a
no longer existing root. This is all good, we only must tolerate
__resolve_indirect_ref returning an error and continue with the good refs
found.
Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
A user reported hitting the BUG_ON() in btrfs_finished_ordered_io() where we had
csums on a NOCOW extent. This can happen if we have NODATACOW set but not
NODATASUM set, which can happen in two cases, either we mount with -o nodatacow
and then write into preallocated space, or chattr +C a directory and move a file
into that directory. Liu has fixed the move case in a different place, but this
fixes the mount -o nodatacow case. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This patch is a follow up on below patch:
[PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type
commit: 216b6cbdcb
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Vivek Trivedi <t.vivek@samsung.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Sage Weil <sage@inktank.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull trivial tree from Jiri Kosina:
"Assorted tiny fixes queued in trivial tree"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (22 commits)
DocBook: update EXPORT_SYMBOL entry to point at export.h
Documentation: update top level 00-INDEX file with new additions
ARM: at91/ide: remove unsused at91-ide Kconfig entry
percpu_counter.h: comment code for better readability
x86, efi: fix comment typo in head_32.S
IB: cxgb3: delay freeing mem untill entirely done with it
net: mvneta: remove unneeded version.h include
time: x86: report_lost_ticks doesn't exist any more
pcmcia: avoid static analysis complaint about use-after-free
fs/jfs: Fix typo in comment : 'how may' -> 'how many'
of: add missing documentation for of_platform_populate()
btrfs: remove unnecessary cur_trans set before goto loop in join_transaction
sound: soc: Fix typo in sound/codecs
treewide: Fix typo in various drivers
btrfs: fix comment typos
Update ibmvscsi module name in Kconfig.
powerpc: fix typo (utilties -> utilities)
of: fix spelling mistake in comment
h8300: Fix home page URL in h8300/README
xtensa: Fix home page URL in Kconfig
...
Here is the big driver core merge for 3.9-rc1
There are two major series here, both of which touch lots of drivers all
over the kernel, and will cause you some merge conflicts:
- add a new function called devm_ioremap_resource() to properly be
able to check return values.
- remove CONFIG_EXPERIMENTAL
If you need me to provide a merged tree to handle these resolutions,
please let me know.
Other than those patches, there's not much here, some minor fixes and
updates.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iEYEABECAAYFAlEmV0cACgkQMUfUDdst+yncCQCfbmnQZju7kzWXk6PjdFuKspT9
weAAoMCzcAtEzzc4LXuUxxG/sXBVBCjW
=yWAQ
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core patches from Greg Kroah-Hartman:
"Here is the big driver core merge for 3.9-rc1
There are two major series here, both of which touch lots of drivers
all over the kernel, and will cause you some merge conflicts:
- add a new function called devm_ioremap_resource() to properly be
able to check return values.
- remove CONFIG_EXPERIMENTAL
Other than those patches, there's not much here, some minor fixes and
updates"
Fix up trivial conflicts
* tag 'driver-core-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (221 commits)
base: memory: fix soft/hard_offline_page permissions
drivercore: Fix ordering between deferred_probe and exiting initcalls
backlight: fix class_find_device() arguments
TTY: mark tty_get_device call with the proper const values
driver-core: constify data for class_find_device()
firmware: Ignore abort check when no user-helper is used
firmware: Reduce ifdef CONFIG_FW_LOADER_USER_HELPER
firmware: Make user-mode helper optional
firmware: Refactoring for splitting user-mode helper code
Driver core: treat unregistered bus_types as having no devices
watchdog: Convert to devm_ioremap_resource()
thermal: Convert to devm_ioremap_resource()
spi: Convert to devm_ioremap_resource()
power: Convert to devm_ioremap_resource()
mtd: Convert to devm_ioremap_resource()
mmc: Convert to devm_ioremap_resource()
mfd: Convert to devm_ioremap_resource()
media: Convert to devm_ioremap_resource()
iommu: Convert to devm_ioremap_resource()
drm: Convert to devm_ioremap_resource()
...
If we remount the fs to close the auto defragment or make the fs R/O,
we should stop the auto defragment.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When running the 083th case of xfstests on the filesystem with
"compress-force=lzo", the following WARNINGs were triggered.
WARNING: at fs/btrfs/inode.c:7908
WARNING: at fs/btrfs/inode.c:7909
WARNING: at fs/btrfs/inode.c:7911
WARNING: at fs/btrfs/extent-tree.c:4510
WARNING: at fs/btrfs/extent-tree.c:4511
This problem was introduced by the patch "Btrfs: fix deadlock due
to unsubmitted". In this patch, there are two bugs which caused
the above problem.
The 1st one is a off-by-one bug, if the DIO write return 0, it is
also a short write, we need release the reserved space for it. But
we didn't do it in that patch. Fix it by change "ret > 0" to
"ret >= 0".
The 2nd one is ->outstanding_extents was increased twice when
a short write happened. As we know, ->outstanding_extents is
a counter to keep track of the number of extent items we may
use duo to delalloc, when we reserve the free space for a
delalloc write, we assume that the write will introduce just
one extent item, so we increase ->outstanding_extents by 1 at
that time. And then we will increase it every time we split the
write, it is done at the beginning of btrfs_get_blocks_direct().
So when a short write happens, we needn't increase
->outstanding_extents again. But this patch done.
In order to fix the 2nd problem, I re-write the logic for
->outstanding_extents operation. We don't increase it at the
beginning of btrfs_get_blocks_direct(), instead, we just
increase it when the split actually happens.
Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This comes from one of btrfs's project ideas,
As we defragment files, we break any sharing from other snapshots.
The balancing code will preserve the sharing, and defrag needs to grow this
as well.
Now we're able to fill the blank with this patch, in which we make full use of
backref walking stuff.
Here is the basic idea,
o set the writeback ranges started by defragment with flag EXTENT_DEFRAG
o at endio, after we finish updating fs tree, we use backref walking to find
all parents of the ranges and re-link them with the new COWed file layout by
adding corresponding backrefs.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We try to limit the size of a chunk to 10GB, which keeps the unit of
work reasonable during balance and resize operations. The limit checks
were taking into account the number of copies of the data we had but
what they really should be doing is comparing against the logical
size of the chunk we're creating.
This moves the code around a little to use the count of data stripes
from raid5/6.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Very large fallocate requests are cpu bound and result in extents with a
repeating pattern of ever decreasing size:
$ time fallocate -l 1T file
real 0m13.039s
( an excerpt of the extents from btrfs-debug-tree: )
prealloc data disk byte 1536292564992 nr 397312
prealloc data disk byte 1536292962304 nr 196608
prealloc data disk byte 1536293158912 nr 98304
prealloc data disk byte 1536293257216 nr 49152
prealloc data disk byte 1536293306368 nr 24576
prealloc data disk byte 1536293330944 nr 12288
prealloc data disk byte 1536293343232 nr 8192
prealloc data disk byte 1536293351424 nr 4096
prealloc data disk byte 1536293355520 nr 4096
prealloc data disk byte 1536293359616 nr 4096
The excessive cpu use comes from __btrfs_prealloc_file_range() trying to
allocate the entire remaining size after each extent is allocated.
btrfs_reserve_extent() repeatedly cuts this requested size in half until
it gets down to the size that the allocators can return. We limit the
problem for now by capping each reservation at 256 meg.
The small extents come from a masking bug when decreasing the requested
reservation size. The high 32bits are cleared and the remaining low
bits might happen to reserve a small size. Fix this by using
round_down() which properly casts the mask.
After these fixes huge fallocate requests are fast and result in nice
large extents:
$ time fallocate -l 1T file
real 0m0.082s
prealloc data disk byte 1112425889792 nr 268435456
prealloc data disk byte 1112694325248 nr 268435456
prealloc data disk byte 1112962760704 nr 268435456
Reported-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
__btrfs_close_devices() clones btrfs device structs with
memcpy(). Some of the fields in the clone are reinitialized, but it's
missing to init io_lock. In mainline this goes unnoticed, but on RT it
leaves the plist pointing to the original about to be freed lock
struct.
Initialize io_lock after cloning, so no references to the original
struct are left.
Reported-and-tested-by: Mike Galbraith <efault@gmx.de>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We forget to free qgroup reservation in commit_transaction(),fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Cc: Arne Jansen <sensille@gmx.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The original code forget to check whether quota has been disabled firstly,
and it will return 'EINVAL' and return error to users if quota has been
disabled,it will be unfriendly and confusing for users to see that.
So just return directly if quota has been disabled.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Cc: Arne Jansen <sensille@gmx.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Right now inode cache inode is treated as the same as space cache
inode, ie. keep inode in memory till putting super.
But this leads to an awkward situation.
If we're going to delete a snapshot/subvolume, btrfs will not
actually delete it and return free space, but will add it to dead
roots list until the last inode on this snap/subvol being destroyed.
Then we'll fetch deleted roots and cleanup them via cleaner thread.
So here is the problem, if we enable inode cache option, each
snap/subvol has a cached inode which is used to store inode allcation
information. And this cache inode will be kept in memory, as the above
said. So with inode cache, snap/subvol can only be added into
dead roots list during freeing roots stage in umount, so that we can
ONLY get space back after another remount(we cleanup dead roots on mount).
But the real thing is we'll no more use the snap/subvol if we mark it
deleted, so we can safely iput its cache inode when we delete snap/subvol.
Another thing is that we need to change the rules of droping inode, we
don't keep snap/subvol's cache inode in memory till end so that we can
add snap/subvol into dead roots list in time.
Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
In some cases, we need commit the current transaction, but don't want
to start a new one if there is no running transaction, so we introduce
the function - btrfs_attach_transaction(), which can catch the current
transaction, and return -ENOENT if there is no running transaction.
But no running transaction doesn't mean the current transction completely,
because we removed the running transaction before it completes. In some
cases, it doesn't matter. But in some special cases, such as freeze fs, we
hope the transaction is fully on disk, it will introduce some bugs, for
example, we may feeze the fs and dump the data in the disk, if the transction
doesn't complete, we would dump inconsistent data. So we need fix the above
problem for those cases.
We fixes this problem by introducing a function:
btrfs_attach_transaction_barrier()
if we hope all the transaction is fully on the disk, even they are not
running, we can use this function.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Now btrfs_commit_transaction() does this
ret = btrfs_run_ordered_operations(root, 0)
which async flushes all inodes on the ordered operations list, it introduced
a deadlock that transaction-start task, transaction-commit task and the flush
workers waited for each other.
(See the following URL to get the detail
http://marc.info/?l=linux-btrfs&m=136070705732646&w=2)
As we know, if ->in_commit is set, it means someone is committing the
current transaction, we should not try to join it if we are not JOIN
or JOIN_NOLOCK, wait is the best choice for it. In this way, we can avoid
the above problem. In this way, there is another benefit: there is no new
transaction handle to block the transaction which is on the way of commit,
once we set ->in_commit.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
In start_transactio(), we will try to join the transaction again after
the current transaction is committed, so we should not release the
reserved space of the qgroup. Fix it.
Cc: Arne Jansen <sensille@gmx.net>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
super.magic is an le64 but it's treated as an unterminated string when
compared against BTRFS_MAGIC which is defined as a string. Instead
define BTRFS_MAGIC as a normal hex value and use endian helpers to
compare it to the super's magic.
I tested this by mounting an fs made before the change and made sure
that it didn't introduce sparse errors. This matches a similar cleanup
that is pending in btrfs-progs. David Sterba pointed out that we should
fix the kernel side as well :).
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
With this new ioctl(2) BTRFS_IOC_SET_FSLABEL, we can set/change the label of a mounted file system.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Goffredo Baroncelli <kreijack@inwind.it>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Goffredo Baroncelli <kreijack@inwind.it>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Add a new ioctl(2) BTRFS_IOC_GET_FSLABLE, so that we can get the label upon a mounted filesystem.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Goffredo Baroncelli <kreijack@inwind.it>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Miao made the ordered operations stuff run async, which introduced a
deadlock where we could get somebody (sync) racing in and committing the
transaction while a commit was already happening. The new committer would
try and flush ordered operations which would hang waiting for the commit to
finish because it is done asynchronously and no longer inherits the callers
trans handle. To fix this we need to make the ordered operations list a per
transaction list. We can get new inodes added to the ordered operation list
by truncating them and then having another process writing to them, so this
makes it so that anybody trying to add an ordered operation _must_ start a
transaction in order to add itself to the list, which will keep new inodes
from getting added to the ordered operations list after we start committing.
This should fix the deadlock and also keeps us from doing a lot more work
than we need to during commit. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Dave pointed out that xfstests 273 will tell you that it failed to load the
space cache for a block group when it remounts. This is because we run out
of space writing out the block group cache. This is ok and is working as it
should, but let's try to be a bit nicer. This happens because the block
group was 100mb, but bitmap entries cover 128mb, so we were only getting
extent entries for this block group, which ended up being too many to fit in
the free space cache. So relax the bitmap size requirements to block groups
that are at least half the size a bitmap will cover or larger, that way we
can still keep the amount of space used in the free space cache low enough
to be able to write it out. With this patch I no longer fail to write out
the free space cache. Thanks,
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Enhance balance usage filter by making it possible to balance out only
completely empty chunks. Today, usage filter properly acts on values
from 1 to 99 inclusive, usage=100 selects all chunks, and usage=0
selects no chunks. This commit changes the usage=0 case: the new
meaning is to restripe only completely empty chunks and nothing else.
Suggested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Commit 5af3e8cc introduced a use-after-free at volumes.c:3139: bctl is freed
above in __cancel_balance() in all cases except for balance pause. Fix this
by moving the offending check a couple statements above, the meaning of the
check is preserved.
Reported-by: Chris Mason <chris.mason@fusionio.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The defrag operation can take very long, we want to have a way how to
cancel it. The code checks for a pending signal at safe points in the
defrag loops and returns EAGAIN. This means a user can press ^C after
running 'btrfs fi defrag', woks for both defrag modes, files and root.
Returning from the command was instant in my light tests, but may take
longer depending on the aging factor of the filesystem.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The warning in use_block_rsv is not useful for users and may fill
the logs unnecessarily.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This idea is from ext4. By this patch, we can make the dio write parallel,
and improve the performance. But because we can not update isize without
i_mutex, the unlocked dio write just can be done in front of the EOF.
We needn't worry about the race between dio write and truncate, because the
truncate need wait untill all the dio write end.
And we also needn't worry about the race between dio write and punch hole,
because we have extent lock to protect our operation.
I ran fio to test the performance of this feature.
== Hardware ==
CPU: Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93GHz
Mem: 2GB
SSD: Intel X25-M 120GB (Test Partition: 60GB)
== config file ==
[global]
ioengine=psync
direct=1
bs=4k
size=32G
runtime=60
directory=/mnt/btrfs/
filename=testfile
group_reporting
thread
[file1]
numjobs=1 # 2 4
rw=randwrite
== result (KBps) ==
write 1 2 4
lock 24936 24738 24726
nolock 24962 30866 32101
== result (iops) ==
write 1 2 4
lock 6234 6184 6181
nolock 6240 7716 8025
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Currently, we can do unlocked dio reads, but the following race
is possible:
dio_read_task truncate_task
->btrfs_setattr()
->btrfs_direct_IO
->__blockdev_direct_IO
->btrfs_get_block
->btrfs_truncate()
#alloc truncated blocks
#to other inode
->submit_io()
#INFORMATION LEAK
In order to avoid this problem, we must serialize unlocked dio reads with
truncate. There are two approaches:
- use extent lock to protect the extent that we truncate
- use inode_dio_wait() to make sure the truncating task will wait for
the read DIO.
If we use the 1st one, we will meet the endless truncation problem due to
the nonlocked read DIO after we implement the nonlocked write DIO. It is
because we still need invoke inode_dio_wait() avoid the race between write
DIO and truncation. By that time, we have to introduce
btrfs_inode_{block, resume}_nolock_dio()
again. That is we have to implement this patch again, so I choose the 2nd
way to fix the problem.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The deadlock problem happened when running fsstress(a test program in LTP).
Steps to reproduce:
# mkfs.btrfs -b 100M <partition>
# mount <partition> <mnt>
# <Path>/fsstress -p 3 -n 10000000 -d <mnt>
The reason is:
btrfs_direct_IO()
|->do_direct_IO()
|->get_page()
|->get_blocks()
| |->btrfs_delalloc_resereve_space()
| |->btrfs_add_ordered_extent() ------- Add a new ordered extent
|->dio_send_cur_page(page0) -------------- We didn't submit bio here
|->get_page()
|->get_blocks()
|->btrfs_delalloc_resereve_space()
|->flush_space()
|->btrfs_start_ordered_extent()
|->wait_event() ---------- Wait the completion of
the ordered extent that is
mentioned above
But because we didn't submit the bio that is mentioned above, the ordered
extent can not complete, we would wait for its completion forever.
There are two methods which can fix this deadlock problem:
1. submit the bio before we invoke get_blocks()
2. reserve the space before we do dio
Though the 1st is the simplest way, we need modify the code of VFS, and it
is likely to break contiguous requests, and introduce performance regression
for the other filesystems.
So we have to choose the 2nd way.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I noticed we were getting lots of warnings with xfstest 83 because we have
reservations outstanding. This is because we moved the orphan add outside
of the truncate, but we don't actually cleanup our reservation if something
fails. This fixes the problem and I no longer see warnings. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Sometimes xfstest 83 will fail to remount the scratch device because we've
gotten ourselves so full that we cannot cleanup the orphan items. In this
case check to see if we're doing the orphan cleanup and if we are allow us
to steal our reservation from the global block rsv. With this patch I've
not been able to reproduce the failed mount problem. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The argument "inherit" of btrfs_ioctl_snap_create_transid() was assigned
to NULL during we created the snapshots, so we didn't free it though we
called kfree() in the caller.
But since we are sure the snapshot creation is done after the function -
btrfs_ioctl_snap_create_transid() - completes, it is safe that we don't
assign the pointer "inherit" to NULL, and just free it in the caller of
btrfs_ioctl_snap_create_transid(). In this way, the code can become more
readable.
Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
Cc: Arne Jansen <sensille@gmx.net>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
open_ctree() need read the metadata to initialize the global information
of btrfs. But it may fail after it submit some bio, and then it will jump
to the error path. Unfortunately, it doesn't check if there are some bios
in flight, and just stop all the worker threads. As a result, when the
submitted bios end, they can not find any worker thread which can deal with
subsequent work, then oops happen.
kernel BUG at fs/btrfs/async-thread.c:605!
Fix this problem by invoking invalidate_inode_pages2() before we stop the
worker threads. This function will wait until the bio end because it need
lock the pages which are going to be invalidated, and if a page is under
disk read IO, it must be locked. invalidate_inode_pages2() need wait until
end bio handler to unlocked it.
Reported-and-Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This patch adds the flag, BTRFS_SEND_FLAG_NO_FILE_DATA to the btrfs send
ioctl code. When this flag is set, the btrfs send code will never write file
data into the stream (thus also avoiding expensive reads of that data in the
first place). BTRFS_SEND_C_UPDATE_EXTENT commands will be sent (instead of
BTRFS_SEND_C_WRITE) with an offset, length pair indicating the extent in
question.
This patch does not affect the operation of BTRFS_SEND_C_CLONE commands -
they will continue to be sent when a search finds an appropriate extent to
clone from.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
For write, we also reserve some space for COW blocks during updating
the checksum tree, and we calculate the number of blocks by checking
if the number of bytes outstanding that are going to need csums needs
one more block for csum.
When we add these checksum into the checksum tree, we use ordered sums
list.
Every ordered sum contains csums for each sector, and we'll first try
to look up an existing csum item,
a) if we don't yet have a proper csum item, then we need to insert one,
b) or if we find one but the csum item is not big enough, then we need
to extend it.
The point is we'll unlock the whole path and then insert or extend.
So others can hack in and update the tree.
Each insert or extend needs update the tree with COW on, and we may need
to insert/extend for many times.
That means what we've reserved for updating checksum tree is NOT enough
indeed.
The case is even more serious with having several write threads at the
same time, it can end up eating our reserved space quickly and starting
eating globle reserve pool instead.
I don't yet come up with a way to calculate the worse case for updating
csum, but extending the checksum item as much as possible can be helpful
in my test.
The idea behind is that it can reduce the times we insert/extend so that
it saves us precious reserved space.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The entry point at the defrag ioctl always sets "cache only" to 0;
the codepaths haven't run for a long time as far as I can
tell. Chris says they're dead code, so remove them.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I hit a deadlock where transaction commit was waiting on num_writers to be
0. This happened because somebody came into btrfs_commit_transaction and
noticed we had aborted and it went to cleanup_transaction. This shouldn't
happen because cleanup_transaction is really to fixup a bad commit, it
doesn't do the normal trans handle cleanup things. So if we have an error
just do the normal btrfs_end_transaction dance and return. Once we are in
the actual commit path we can use cleanup_transaction and be good to go.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I noticed we would deadlock if we aborted a transaction while doing
compressed io. This is because we don't unlock our pages if something goes
horribly wrong. To fix this we need to make sure that we call
extent_clear_unlock_delalloc in order to unlock all the pages. If we have
to cow in the async submission thread we need to make sure to unlock our
locked_page as the cow error path will not unlock the locked page as it
depends on the caller to unlock that page. With this patch we no longer
deadlock on the page lock when we have an aborted transaction. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
People have been complaining about random ENOSPC errors that will clear up
after a umount or just a given amount of time. Chris was able to reproduce
this with stress.sh and lots of processes and so was I. Basically the
overcommit stuff would really let us get out of hand, in my tests I saw up
to 30 gigs of outstanding reservations with only 2 gigs total of metadata
space. This usually worked out fine but with so much outstanding
reservation the flushing stuff short circuits to make sure we don't hang
forever flushing when we really need ENOSPC. Plus we allocate chunks in
order to alleviate the pressure, but this doesn't actually help us since we
only use the non-allocated area in our over commit logic.
So instead of basing overcommit on the amount of non-allocated space,
instead just do it based on how much total space we have, and then limit it
to the non-allocated space in case we are short on space to spill over into.
This allows us to have the same performance as well as no longer giving
random ENOSPC. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Dave sent me a panic where we were doing the orphan cleanup and panic'ed
trying to release our reservation from the orphan block rsv. The reason for
this is because our orphan block rsv had been free'd out from underneath us
because the transaction commit found that there were no orphan inodes
according to its count and decided to free it. This is incorrect so make
sure we inc the orphan inodes count so the accounting is all done properly.
This would also cause the warning in the orphan commit code normally if you
had any orphans to cleanup as they would only decrement the orphan count so
you'd get a negative orphan count which could cause problems during runtime.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When a transaction aborts or there's an EIO on an ordered extent or any
error really we will not free up the space we reserved for this ordered
extent. This results in warnings from the block group cache cleanup in the
case of a transaction abort, or leaking space in the case of EIO on an
ordered extent. Fix this up by free'ing the reserved space if we have an
error at all trying to complete an ordered extent. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When we abort we've been just free'ing up all the ordered extents and
hoping for the best. This results in lots of warnings from various places,
warnings from btrfs_destroy_inode() because it's ENOSPC accounting isn't
fixed. It will also screw up lots of pages who have been set private but
never get cleared because the ordered extents are never allowed to be
submitted. This patch fixes those warnings. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I hit this error when reproducing a bug that would end in a transaction
abort. We take the delayed ref head's mutex to keep anybody from processing
it while we're destroying it, but we fail to drop the mutex before we carry
on and free the damned thing. Fix this by doing the remove logic for the
head ourselves and unlock the mutex, that way we can avoid use after free's
or hung tasks waiting on that mutex to come back so they know the delayed
ref completed. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
WARN_ON isn't enough, we need to stop the loop if for any reason
we would overrun the devices_info array.
I tried to track down the connection between the length of
the alloc_devices list and the rw_devices counter but
it wasn't immediately obvious, so be defensive about it.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
__btrfs_std_error didn't always properly call va_end,
and might call va_start even if fmt was NULL.
Move all the varargs handling into the block where we
have fmt.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I don't think that BTRFS_DEV_EXTENT_KEY is supposed
to fall through to BTRFS_DEV_STATS_KEY ...
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
At least backref_tree_panic() can apparently pass
in a null fs_info, so handle that in __btrfs_panic
to get the message out on the console.
The btrfs_panic macro also uses fs_info, but that's
largely pointless; it's testing to see if
BTRFS_MOUNT_PANIC_ON_FATAL_ERROR is not set.
But if it *were* set, __btrfs_panic() would have,
well, paniced and we wouldn't be here, testing it!
So just BUG() at this point.
And since we only use fs_info once now, just use it
directly.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
No need to test the result, we can't get a
null pointer from list_entry()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Because of how little we allocate chunks now we can get really tight on
metadata space before we will allocate a new chunk. This resulted in being
unable to add device extents when allocating a new metadata chunk as we did
not have enough space. This is because we were allowed to overcommit too
much metadata without actually making sure we had enough space to make
allocations. The idea behind overcommit is that we are allowed to say "sure
you can have that reservation" when most of the free space is occupied by
reservations, not actual allocations. But in this case where a majority of
the total space is in use by actual allocations we can screw ourselves by
not being able to make real allocations when it matters. So make sure we
have enough real space for our global reserve, and if not then don't allow
overcommitting. Thanks,
Reported-and-tested-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I got a double free error when unmounting a file system that failed to add a
chunk during its operation. This is because we will kfree the mapping that
we created but leave the extent_map in the em_tree for chunks. So to fix
this just remove the extent_map when we error out so we don't run into this
problem. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If we error out allocating a dev extent we will have already created the
block group and such which will cause problems since the allocator may have
tried to allocate out of the block group that no longer exists. This will
cause BUG_ON()'s in the bio submission path. This also makes a failure to
allocate a dev extent a non-abort error, we will just clean up the dev
extents we did allocate and exit. Now if we fail to delete the dev extents
we will abort since we can't have half of the dev extents hanging around,
but this will make us much less likely to abort. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There is no lock to protect fs_info->fs_state, it will introduce
some problems, such as the value may be covered by the other task
when several tasks modify it. For example:
Task0 - CPU0 Task1 - CPU1
mov %fs_state rax
or $0x1 rax
mov %fs_state rax
or $0x2 rax
mov rax %fs_state
mov rax %fs_state
The expected value is 3, but in fact, it is 2.
Though this problem doesn't happen now (because there is only one
flag currently), the code is error prone, if we add other flags,
the above problem will happen to a certainty.
Now we use bit operation for it to fix the above problem.
In this way, we can make the code more robust and be easy to
add new flags.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There is no lock to protect
fs_info->avail_{data, metadata, system}_alloc_bits,
it may introduce some problem, such as the wrong profile
information, so we add a seqlock to protect them.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We need not use a global lock to protect the delalloc_bytes of the
inode, just use its own lock. In this way, we can reduce the lock
contention and ->delalloc_lock will just protect delalloc inode
list.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
fs_info->delalloc_bytes is accessed very frequently, so use percpu
counter instead of the u64 variant for it to reduce the lock
contention.
This patch also fixed the problem that we access the variant
without the lock protection.At worst, we would not flush the
delalloc inodes, and just return ENOSPC error when we still have
some free space in the fs.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
->dirty_metadata_bytes is accessed very frequently, so use percpu
counter instead of the u64 variant to reduce the contention of
the lock.
This patch also fixed the problem that we access it without
lock protection in __btrfs_btree_balance_dirty(), which may
cause we skip the dirty pages flush.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
fs_info->alloc_start is a 64bits variant, can be accessed by
multi-task, but it is not protected strictly, it can be changed
while we are accessing it. On 32bit machine, we will get wrong
value because we access it by two instructions.(In fact, it is
also possible that the same problem happens on the 64bit machine,
because the compiler may split the 64bit operation into two 32bit
operation.)
For example:
Assuming -> alloc_start is 0x0000 0000 0001 0000 at the beginning,
then we remount and set ->alloc_start to 0x0000 0100 0000 0000.
Task0 Task1
load high 32 bits
set high 32 bits
set low 32 bits
load low 32 bits
Task1 will get 0.
This patch fixes this problem by using two locks to protect it
fs_info->chunk_mutex
sb->s_umount
On the read side, we just need get one of these two locks, and on
the write side, we must lock all of them.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Though ->max_inline is a 64bit variant, and may be accessed by
multi-task, but it is just suggestive number, so we needn't add
anything to protect fs_info->max_inline, just add a comment to
explain wny we don't use a lock to protect it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The header file will then be installed under /usr/include/linux so that
userspace applications can refer to Btrfs ioctls by name and use the same
structs used internally in the kernel.
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
CAP_DAC_READ_SEARCH overrides read and search permission check on
file and directory. It seems fit for BTRFS_IOC_INO_PATHS.
Signed-off-by: Kusanagi Kouichi <slash@ac.auone-net.jp>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This reverts commit 2794ed013b.
Wasn't supposed to get used in btrfs_mknod, it was supposed to be in
btrfs_create, which was done in commit
9185aa587b.
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
btrfs_run_ordered_operations() needn't traverse the ordered operation list
repeatedly, it is because the transaction commiter will invoke it again when
there is no other writer in this transaction, it can ensure that no one can
add new objects into the ordered operation list.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
btrfs_start_delalloc_inodes() needn't traverse and flush the delalloc inodes
repeatedly. It is because we can regard the data that the users write after
we start delalloc inodes flush as the one which is after the delalloc inodes
flush is done, and we can flush it next time.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We forget to check the return value of btrfs_run_ordered_operations() when
flushing all the pending stuffs, fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We forget to check the return value of btrfs_start_delalloc_inodes(), fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The current code of raid attr arry is hard to understand and it is easy to
introduce some problem if we modify the array. So I changed it and made it
more readable.
Cc: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This'd save us a rbtree search which may become expensive in large filesystem.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This does not change the logic of code, but can save us a read_lock.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The API in tree log code has done sort of changes, and it proves that
we can benifit from using token, so do the same thing here.
function_graph tracer's timer shows that it costs nearly half time
of before(39.788us -> 22.391us).
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
commit d53ba47484
(Btrfs: use commit root when loading free space cache) has remove
the deadlock check, and the related comments can be removed as well.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If we start running low on metadata space we will try to allocate a chunk,
which could then try to allocate a chunk to add the device entry. The thing
is we allocate a chunk before we try really hard to make the allocation, so
we should be able to find space for the device entry. Add a flag to the
trans handle so we know we're currently allocating a chunk so we can just
bail out if we try to allocate another chunk. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Since we don't actually copy the extent information from the source tree in
the fast case we don't need to wait for ordered io to be completed in order
to fsync, we just need to wait for the io to be completed. So when we're
logging our file just attach all of the ordered extents to the log, and then
when the log syncs just wait for IO_DONE on the ordered extents and then
write the super. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This patch fixes the following problem:
- improper return value
- unnecessary read-only check
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Use wrapper page_offset to get byte-offset into filesystem object for page.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We may try to flush some dirty pages when there is no enough space to reserve.
But it is possible that this operation fails, in order to get enough space to
reserve successfully, we will sync all the delalloc file. This operation is
safe, we needn't worry about the case that the filesystem goes from r/w to r/o.
because the filesystem should guarantee all the dirty pages have been written
into the disk after it becomes readonly, so the sync operation will do nothing
if the filesystem is already readonly. Though it may waste lots of time,
as a corner case, we needn't care.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Locking and unlocking delayed ref mutex are in the different functions,
and the name of lock functions is not uniform, so the readability is not
so good, this patch optimizes the lock logic and makes it more readable.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We're running into having 50-100 orphans left over with xfstests 83
because of ENOSPC when trying to start the transaction for the inode update.
But in fact, it makes no sense in updating the inode for the new size while
we're deleting the stupid thing. This patch fixes this problem.
Reported-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The delayed item commit code in several functions is similar, so
cleanup it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Since we do not want to delay the async transaction commit, we should
use common work, not delayed work.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
We clear the transaction object and the trans handle when they are about to be
freed, it is unnecessary, cleanup it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
The delayed reference allocation is in the fast path of the IO, so use slabs
to improve the speed of the allocation.
And besides that, it can do check for leaked objects when the module is removed.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
btrfs_scan_one_device is calling set_blocksize() which can race
with a concurrent process making dirty page cache pages. It can end up
dropping dirty page cache pages on the floor, which isn't very nice when
someone is just running btrfs dev scan to find filesystems on the
box.
Now that udev is registering btrfs devices as it discovers them, we can
actually end up racing with our own mkfs program too. When this
happens, we drop some of the important blocks written by mkfs.
This commit changes scan_one_device to read the super out of the page
cache instead of trying to use bread. This way we don't have to care
about the blocksize of the device.
This also drops the invalidate_bdev() call. It wasn't very polite to
invalidate during the scan either. mkfs is putting the super into the
page cache, there's no reason to invalidate at this point.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When replaying a log tree with qgroups enabled, tree_mod_log_rewind does a
sanity-check of the number of items against the maximum possible number.
It calculates that number with the nodesize of fs_root. Unfortunately
fs_root is not yet set at this stage. So instead use the nodesize from
tree_root, which is already initialized.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs fixes from Chris Mason:
"We've got corner cases for updating i_size that ceph was hitting,
error handling for quotas when we run out of space, a very subtle
snapshot deletion race, a crash while removing devices, and one
deadlock between subvolume creation and the sb_internal code (thanks
lockdep)."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: move d_instantiate outside the transaction during mksubvol
Btrfs: fix EDQUOT handling in btrfs_delalloc_reserve_metadata
Btrfs: fix possible stale data exposure
Btrfs: fix missing i_size update
Btrfs: fix race between snapshot deletion and getting inode
Btrfs: fix missing release of the space/qgroup reservation in start_transaction()
Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write()
Btrfs: do not merge logged extents if we've removed them from the tree
btrfs: don't try to notify udev about missing devices
Dave Sterba triggered a lockdep complaint about lock ordering
between the sb_internal lock and the cleaner semaphore.
btrfs_lookup_dentry() checks for orphans if we're looking up
the inode for a subvolume, and subvolume creation is triggering
the lookup with a transaction running.
This commit moves the d_instantiate after the transaction closes.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When btrfs_qgroup_reserve returned a failure, we were missing a counter
operation for BTRFS_I(inode)->outstanding_extents++, leading to warning
messages about outstanding extents and space_info->bytes_may_use != 0.
Additionally, the error handling code didn't take into account that we
dropped the inode lock which might require more cleanup.
Luckily, all the cleanup code we need is already there and can be shared
with reserve_metadata_bytes, which is exactly what this patch does.
Reported-by: Lev Vainblat <lev@zadarastorage.com>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We specifically do not update the disk i_size if there are ordered extents
outstanding for any area between the current disk_i_size and our ordered
extent so that we do not expose stale data. The problem is the check we
have only checks if the ordered extent starts at or after the current
disk_i_size, which doesn't take into account an ordered extent that starts
before the current disk_i_size and ends past the disk_i_size. Fix this by
checking if the extent ends past the disk_i_size. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If we have an ordered extent before the ordered extent we are currently
completing that is after the current disk_i_size we will put our i_size
update into that ordered extent so that we do not expose stale data. The
problem is that if our disk i_size is updated past the previous ordered
extent we won't update the i_size with the pending i_size update. So check
the pending i_size update and if its above the current disk i_size we need
to go ahead and try to update. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
While running snapshot testscript created by Mitch and David,
the race between autodefrag and snapshot deletion can lead to
corruption of dead_root list so that we can get crash on
btrfs_clean_old_snapshots().
And besides autodefrag, scrub also does the same thing, ie. read
root first and get inode.
Here is the story(take autodefrag as an example):
(1) when we delete a snapshot or subvolume, it will set its root's
refs to zero and do a iput() on its own inode, and if this inode happens
to be the only active in-meory one in root's inode rbtree, it will add
itself to the global dead_roots list for later cleanup.
(2) after (1), the autodefrag thread may read another inode for defrag
and the inode is just in the deleted snapshot/subvolume, but all of these
are without checking if the root is still valid(refs > 0). So the end up
result is adding the deleted snapshot/subvolume's root to the global
dead_roots list AGAIN.
Fortunately, we already have a srcu lock to avoid the race, ie. subvol_srcu.
So all we need to do is to take the lock to protect 'read root and get inode',
since we synchronize to wait for the rcu grace period before adding something
to the global dead_roots list.
Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When we fail to start a transaction, we need to release the reserved free space
and qgroup space, fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If the checks at the beginning of btrfs_file_aio_write() fail, we needn't
decrease ->sync_writers, because we have not increased it. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
You can run into this problem where if somebody is fsyncing and writing out
the existing extents you will have removed the extent map from the em tree,
but it's still valid for the current fsync so we go ahead and write it. The
problem is we unconditionally try to merge it back into the em tree, but if
we've removed it from the em tree that will cause use after free problems.
Fix this to only merge if we are still a part of the tree. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The device removal code was incorrectly checking against two different limits for
raid5 and raid6.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We batch up operations to the extent allocation tree, which allows
us to deal with the recursive nature of using the extent allocation
tree to allocate extents to the extent allocation tree.
It also provides a mechanism to sort and collect extent
operations, which makes it much more efficient to record extents
that are close together.
The delayed extent operations must all be finished before the
running transaction commits, so we have code to make sure and run a few
of the batched operations when closing our transaction handles.
This creates a great deal of contention for the locks in the
delayed extent operation tree, and also contention for the lock on the
extent allocation tree itself. All the extra contention just slows
down the operations and doesn't get things done any faster.
This commit changes things to use a wait queue instead. As procs
want to run the delayed operations, one of them races in and gets
permission to hit the tree, and the others step back and wait for
progress to be made.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The extent buffers have a refs_lock which we use to make coordinate freeing
the extent buffer with operations on the radix tree. On tree roots and
other extent buffers that very cache hot, this can be highly contended.
These are also the extent buffers that are basically pinned in memory.
This commit adds code to cmpxchg our way through the ref modifications,
and as long as the result of the reference change is still pinned in
ram, we skip the expensive spinlock.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
With the new raid56 code, we want to make sure we're
properly aligning our allocation clusters with -o ssd
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Buffered writes and DIRECT_IO writes will often break up
big contiguous changes to the file into sub-stripe writes.
This adds a plugging callback to gather those smaller writes full stripe
writes.
Example on flash:
fio job to do 64K writes in batches of 3 (which makes a full stripe):
With plugging: 450MB/s
Without plugging: 220MB/s
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The stripe cache allows us to avoid extra read/modify/write cycles
by caching the pages we read off the disk. Pages are cached when:
* They are read in during a read/modify/write cycle
* They are written during a read/modify/write cycle
* They are involved in a parity rebuild
Pages are not cached if we're doing a full stripe write. We're
assuming that a full stripe write won't be followed by another
partial stripe write any time soon.
This provides a substantial boost in performance for workloads that
synchronously modify adjacent offsets in the file, and for the parity
rebuild use case in general.
The size of the stripe cache isn't tunable (yet) and is set at 1024
entries.
Example on flash: dd if=/dev/zero of=/mnt/xxx bs=4K oflag=direct
Without the stripe cache -- 2.1MB/s
With the stripe cache 21MB/s
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This builds on David Woodhouse's original Btrfs raid5/6 implementation.
The code has changed quite a bit, blame Chris Mason for any bugs.
Read/modify/write is done after the higher levels of the filesystem have
prepared a given bio. This means the higher layers are not responsible
for building full stripes, and they don't need to query for the topology
of the extents that may get allocated during delayed allocation runs.
It also means different files can easily share the same stripe.
But, it does expose us to incorrect parity if we crash or lose power
while doing a read/modify/write cycle. This will be addressed in a
later commit.
Scrub is unable to repair crc errors on raid5/6 chunks.
Discard does not work on raid5/6 (yet)
The stripe size is fixed at 64KiB per disk. This will be tunable
in a later commit.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We'll want to merge writes so they can fill a full RAID[56] stripe, but
not necessarily reads.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we remove a missing device, bdev is null, and if we
send that off to btrfs_kobject_uevent we'll panic.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs fixes from Chris Mason:
"It turns out that we had two crc bugs when running fsx-linux in a
loop. Many thanks to Josef, Miao Xie, and Dave Sterba for nailing it
all down. Miao also has a new OOM fix in this v2 pull as well.
Ilya fixed a regression Liu Bo found in the balance ioctls for pausing
and resuming a running balance across drives.
Josef's orphan truncate patch fixes an obscure corruption we'd see
during xfstests.
Arne's patches address problems with subvolume quotas. If the user
destroys quota groups incorrectly the FS will refuse to mount.
The rest are smaller fixes and plugs for memory leaks."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (30 commits)
Btrfs: fix repeated delalloc work allocation
Btrfs: fix wrong max device number for single profile
Btrfs: fix missed transaction->aborted check
Btrfs: Add ACCESS_ONCE() to transaction->abort accesses
Btrfs: put csums on the right ordered extent
Btrfs: use right range to find checksum for compressed extents
Btrfs: fix panic when recovering tree log
Btrfs: do not allow logged extents to be merged or removed
Btrfs: fix a regression in balance usage filter
Btrfs: prevent qgroup destroy when there are still relations
Btrfs: ignore orphan qgroup relations
Btrfs: reorder locks and sanity checks in btrfs_ioctl_defrag
Btrfs: fix unlock order in btrfs_ioctl_rm_dev
Btrfs: fix unlock order in btrfs_ioctl_resize
Btrfs: fix "mutually exclusive op is running" error code
Btrfs: bring back balance pause/resume logic
btrfs: update timestamps on truncate()
btrfs: fix btrfs_cont_expand() freeing IS_ERR em
Btrfs: fix a bug when llseek for delalloc bytes behind prealloc extents
Btrfs: fix off-by-one in lseek
...
btrfs_start_delalloc_inodes() locks the delalloc_inodes list, fetches the
first inode, unlocks the list, triggers btrfs_alloc_delalloc_work/
btrfs_queue_worker for this inode, and then it locks the list, checks the
head of the list again. But because we don't delete the first inode that it
deals with before, it will fetch the same inode. As a result, this function
allocates a huge amount of btrfs_delalloc_work structures, and OOM happens.
Fix this problem by splice this delalloc list.
Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The max device number of single profile is 1, not 0 (0 means 'as many as
possible'). Fix it.
Cc: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
First, though the current transaction->aborted check can stop the commit early
and avoid unnecessary operations, it is too early, and some transaction handles
don't end, those handles may set transaction->aborted after the check.
Second, when we commit the transaction, we will wake up some worker threads to
flush the space cache and inode cache. Those threads also allocate some transaction
handles and may set transaction->aborted if some serious error happens.
So we need more check for ->aborted when committing the transaction. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We may access and update transaction->aborted on the different CPUs without
lock, so we need ACCESS_ONCE() wrapper to prevent the compiler from creating
unsolicited accesses and make sure we can get the right value.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I noticed a WARN_ON going off when adding csums because we were going over
the amount of csum bytes that should have been allowed for an ordered
extent. This is a leftover from when we used to hold the csums privately
for direct io, but now we use the normal ordered sum stuff so we need to
make sure and check if we've moved on to another extent so that the csums
are added to the right extent. Without this we could end up with csums for
bytenrs that don't have extents to cover them yet. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
For compressed extents, the range of checksum is covered by disk length,
and the disk length is different with ram length, so we need to use disk
length instead to get us the right checksum.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
A user reported a BUG_ON(ret) that occured during tree log replay. Ret was
-EAGAIN, so what I think happened is that we removed an extent that covered
a bitmap entry and an extent entry. We remove the part from the bitmap and
return -EAGAIN and then search for the next piece we want to remove, which
happens to be an entire extent entry, so we just free the sucker and return.
The problem is ret is still set to -EAGAIN so we trip the BUG_ON(). The
user used btrfs-zero-log so I'm not 100% sure this is what happened so I've
added a WARN_ON() to catch the other possibility. Thanks,
Reported-by: Jan Steffens <jan.steffens@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We drop the extent map tree lock while we're logging extents, so somebody
could come in and merge another extent into this one and screw up our
logging, or they could even remove us from the list which would keep us from
logging the extent or freeing our ref on it, so we need to make sure to not
clear LOGGING until after the extent is logged, and then we can merge it to
adjacent extents. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Commit 3fed40cc ("Btrfs: cleanup duplicated division functions"), which
was merged into 3.8-rc1, has introduced a regression by removing logic
that was guarding us against bad user input. Bring it back.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Currently you can just destroy a qgroup even though it is in use by other qgroups
or has qgroups assigned to it. This patch prevents destruction of qgroups unless
they are completely unused. Otherwise destroy will return EBUSY.
Reported-by: Eric Hopper <hopper@omnifarious.org>
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If a qgroup that has still assignments is deleted by the user, the corresponding
relations are left in the tree. This leads to an unmountable filesystem.
With this patch, those relations are simple ignored.
Reported-by: Eric Hopper <hopper@omnifarious.org>
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Operation-specific check (whether subvol is readonly or not) should go
after the mutual exclusiveness check.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The error code that is returned in response to starting a mutually
exclusive operation when there is one already running got silently
changed from EINVAL to EINPROGRESS by 5ac00add. Returning EINPROGRESS
to, say, add_dev, when rm_dev is running is misleading. Furthermore,
the operation itself may want to use EINPROGRESS for other purposes.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Balance pause/resume logic got broken by 5ac00add (went in into 3.8-rc1
as part of dev-replace merge). Offending commit took a stab at making
mutually exclusive volume operations (add_dev, rm_dev, resize, balance,
replace_dev) not block behind volume_mutex if another such operation is
in progress and instead return an error right away. Balancing front-end
relied on the blocking behaviour, so the fix is ugly, but short of a
complete rework, it's the best we can do.
Reported-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
truncate() vs. ftruncate() differ in the VFS; truncate()
doesn't set (ATTR_CTIME | ATTR_MTIME), and it's up to the
fs to do the timestamp updates if the size changes.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
btrfs_cont_expand() tries to free an IS_ERR em as it gets an error from
btrfs_get_extent() and breaks out of its loop.
An instance of -EEXIST was reported in the wild:
https://bugzilla.redhat.com/show_bug.cgi?id=874407
I have no idea if that -EEXIST is surprising, or not. Regardless, this
error handling should be cleaned up to handle other reasonable errors
(ENOMEM, EIO; whatever).
This seemed to be the only buggy freeing of the relatively rare IS_ERR
em so I opted to fix the caller rather than teach free_extent_map() to
use IS_ERR_OR_NULL().
Signed-off-by: Zach Brown <zab@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
xfstests case 285 complains.
It it because btrfs did not try to find unwritten delalloc
bytes(only dirty pages, not yet writeback) behind prealloc
extents, it ends up finding nothing while we're with SEEK_DATA.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We forgot to reset the path lock state to zero after we unlock the path block,
and this can lead to the ASSERT checker in tree unlock API.
Reported-by: Slava Barinov <rayslava@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This'd avoid us empty looping.
Say we have only one disk and the metadata raid type will be defaultly DUP,
and we do not need to start from index=0(RAID10) and get over two empty
loops to index=2(DUP).
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Running xfstests 83 in a loop would sometimes fail the fsck. This happens
because if we invalidate a page that already has an ordered extent setup for
it we will complete the ordered extent ourselves, assuming that the truncate
will clean everything up. The problem with this is there is plenty of time
for the truncate to fail after we've done this work. So to fix this we need
to add the orphan item first to make sure the cleanup gets done properly,
and then we can truncate the pagecache and all that stuff and be safe. This
fixes the btrfsck failures I was seeing while running 83 in a loop. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We still need to say we're flushing if we're limit flushing to keep somebody
from coming in and stealing our reservation. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We forget to give up the write access after we find some device operation
is going on. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Step to reproduce:
# mkfs.btrfs <disk>
# mount <disk> <mnt>
# btrfs sub create <mnt>/subv0
# btrfs sub snap <mnt> <mnt>/subv0/snap0
# change <mnt>/subv0 from R/W to R/O
# btrfs sub del <mnt>/subv0/snap0
We deleted the snapshot successfully. I think we should not be able to delete
the snapshot since the parent subvolume is R/O.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
When we're deleting the device we should get it in write mode since
we're going to re-write the super block magic on that device. And it
should fail if the device is read-only.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
writeback_inodes_sb(_nr)_if_idle() is re-implemented by replacing down_read()
with down_read_trylock() because
- If ->s_umount is write locked, then the sb is not idle. That is
writeback_inodes_sb(_nr)_if_idle() needn't wait for the lock.
- writeback_inodes_sb(_nr)_if_idle() grabs s_umount lock when it want to start
writeback, it may bring us deadlock problem when doing umount. In order to
fix the problem, ext4 and btrfs implemented their own writeback functions
instead of writeback_inodes_sb(_nr)_if_idle(), but it introduced the redundant
code, it is better to implement a new writeback_inodes_sb(_nr)_if_idle().
The name of these two functions is cumbersome, so rename them to
try_to_writeback_inodes_sb(_nr).
This idea came from Christoph Hellwig.
Some code is from the patch of Kamal Mostafa.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
In the big loop, cur_trans will be set fs_info->running_transaction
before it's used. And after kmem_cache_free it and goto loop, it will
be setup again. No need to setup it immediately after freed.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Pull VFS update from Al Viro:
"fscache fixes, ESTALE patchset, vmtruncate removal series, assorted
misc stuff."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (79 commits)
vfs: make lremovexattr retry once on ESTALE error
vfs: make removexattr retry once on ESTALE
vfs: make llistxattr retry once on ESTALE error
vfs: make listxattr retry once on ESTALE error
vfs: make lgetxattr retry once on ESTALE
vfs: make getxattr retry once on an ESTALE error
vfs: allow lsetxattr() to retry once on ESTALE errors
vfs: allow setxattr to retry once on ESTALE errors
vfs: allow utimensat() calls to retry once on an ESTALE error
vfs: fix user_statfs to retry once on ESTALE errors
vfs: make fchownat retry once on ESTALE errors
vfs: make fchmodat retry once on ESTALE errors
vfs: have chroot retry once on ESTALE error
vfs: have chdir retry lookup and call once on ESTALE error
vfs: have faccessat retry once on an ESTALE error
vfs: have do_sys_truncate retry once on an ESTALE error
vfs: fix renameat to retry on ESTALE errors
vfs: make do_unlinkat retry once on ESTALE errors
vfs: make do_rmdir retry once on ESTALE errors
vfs: add a flags argument to user_path_parent
...
Pull two btrfs reverts from Chris Mason:
"I had missed that for two of the patches in my last pull, we had
included different fixes during 3.7."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Revert "Btrfs: reorder tree mod log operations in deleting a pointer"
Revert "Btrfs: MOD_LOG_KEY_REMOVE_WHILE_MOVING never change node's nritems"
The code that relied on that flag was ripped out of btrfs quite some
time ago, and never added back. Josef indicated that he was going to
take a different approach to the problem in btrfs, and that we
could just eliminate this flag.
Cc: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This reverts commit 6a7a665d78.
This was bug was fixed differently in 3.6, so this commit
isn't needed.
Conflicts:
fs/btrfs/ctree.c
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This reverts commit 95c80bb1f6.
The bug addressed by this commit was fixed differently back in 3.6
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs update from Chris Mason:
"A big set of fixes and features.
In terms of line count, most of the code comes from Stefan, who added
the ability to replace a single drive in place. This is different
from how btrfs normally replaces drives, and is much much much faster.
Josef is plowing through our synchronous write performance. This pull
request does not include the DIO_OWN_WAITING patch that was discussed
on the list, but it has a number of other improvements to cut down our
latencies and CPU time during fsync/O_DIRECT writes.
Miao Xie has a big series of fixes and is spreading out ordered
operations over more CPUs. This improves performance and reduces
contention.
I've put in fixes for error handling around hash collisions. These
are going back to individual stable kernels as I test against them.
Otherwise we have a lot of fixes and cleanups, thanks everyone!
raid5/6 is being rebased against the device replacement code. I'll
have it posted this Friday along with a nice series of benchmarks."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (115 commits)
Btrfs: fix a bug of per-file nocow
Btrfs: fix hash overflow handling
Btrfs: don't take inode delalloc mutex if we're a free space inode
Btrfs: fix autodefrag and umount lockup
Btrfs: fix permissions of empty files not affected by umask
Btrfs: put raid properties into global table
Btrfs: fix BUG() in scrub when first superblock reading gives EIO
Btrfs: do not call file_update_time in aio_write
Btrfs: only unlock and relock if we have to
Btrfs: use tokens where we can in the tree log
Btrfs: optimize leaf_space_used
Btrfs: don't memset new tokens
Btrfs: only clear dirty on the buffer if it is marked as dirty
Btrfs: move checks in set_page_dirty under DEBUG
Btrfs: log changed inodes based on the extent map tree
Btrfs: add path->really_keep_locks
Btrfs: do not mark ems as prealloc if we are writing to them
Btrfs: keep track of the extents original block length
Btrfs: inline csums if we're fsyncing
Btrfs: don't bother copying if we're only logging the inode
...
But the kernel decided to call it "origin" instead. Fix most of the
sites.
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Users report a bug, the reproducer is:
$ mkfs.btrfs /dev/loop0
$ mount /dev/loop0 /mnt/btrfs/
$ mkdir /mnt/btrfs/dir
$ chattr +C /mnt/btrfs/dir/
$ dd if=/dev/zero of=/mnt/btrfs/dir/foo bs=4K count=10;
$ lsattr /mnt/btrfs/dir/foo
---------------C- /mnt/btrfs/dir/foo
$ filefrag /mnt/btrfs/dir/foo
/mnt/btrfs/dir/foo: 1 extent found ---> an extent
$ dd if=/dev/zero of=/mnt/btrfs/dir/foo bs=4K count=1 seek=5 conv=notrunc,nocreat; sync
$ filefrag /mnt/btrfs/dir/foo
/mnt/btrfs/dir/foo: 3 extents found ---> with nocow, btrfs breaks the extent into three parts
The new created file should not only inherit the NODATACOW flag, but also
honor NODATASUM flag, because we must do COW on a file extent with checksum.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The handling for directory crc hash overflows was fairly obscure,
split_leaf returns EOVERFLOW when we try to extend the item and that is
supposed to bubble up to userland. For a while it did so, but along the
way we added better handling of errors and forced the FS readonly if we
hit IO errors during the directory insertion.
Along the way, we started testing only for EEXIST and the EOVERFLOW case
was dropped. The end result is that we may force the FS readonly if we
catch a directory hash bucket overflow.
This fixes a few problem spots. First I add tests for EOVERFLOW in the
places where we can safely just return the error up the chain.
btrfs_rename is harder though, because it tries to insert the new
directory item only after it has already unlinked anything the rename
was going to overwrite. Rather than adding very complex logic, I added
a helper to test for the hash overflow case early while it is still safe
to bail out.
Snapshot and subvolume creation had a similar problem, so they are using
the new helper now too.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Reported-by: Pascal Junod <pascal@junod.info>
This confuses and angers lockdep even though it's ok. We don't really need
the lock for free space inodes since only the transaction committer will be
reserving space. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This happens because writeback_inodes_sb_nr_if_idle does down_read. This
doesn't work for us and it has not been fixed upstream yet, so do it
ourselves and use that instead so we can stop having this stupid long
standing lockup. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When a new file is created with btrfs_create(), the inode will initially be
created with permissions 0666 and later on in btrfs_init_acl() it will be
adapted to mask out the umask bits. The problem is that this change won't make
it into the btrfs_inode unless there's another change to the inode (e.g. writing
content changing the size or touching the file changing the mtime.)
This fix adds a call to btrfs_update_inode() to btrfs_create() to make sure that
the change will not get lost if the in-memory inode is flushed before other
changes are made to the file.
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Raid properties can be shared among raid calculation code, we can put
them into a global table to keep it simple.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This fixes a very special case that can be reproduced by just
disconnecting a disk at runtime, and without unmounting the
filesystem first, start scrub on the filesystem with the
disconnected disk. All read and write EIOs are handled
correctly, only the first superblock is an exception and gives
a BUG() in a subfunction. The BUG() is correct, it would crash
later otherwise. The subfunction must not be called for
superblocks and this is what the fix changes.
Reported-by: Joeri Vanthienen <mail@joerivanthienen.be>
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This starts a transaction and dirties the inode everytime we call it, which
is super expensive if you have a write heavy workload. We will be updating
the inode when the IO completes and we reserve the space for the inode
update when we reserve space for the write, so there is no chance of loss of
information or enospc issues. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I noticed while doing fsync tests that we were always dropping the path and
re-searching when we first cow the log root even though we've already gotten
the write lock on the root. That's because we don't take into account that
there might not be a parent node, so fix the check to make sure there is
actually a parent node before we undo all of this work for nothing. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we are syncing over and over the overhead of doing all those maps in
fill_inode_item and log_changed_extents really starts to hurt, so use map
tokens so we can avoid all the extra mapping. Since the token maps from our
offset to the end of the page make sure to set the first thing in the item
first so we really only do one map. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This gets called at least 4 times for every level while adding an object,
and it involves 3 kmapping calls, which on my box take about 5us a piece.
So instead use a token, which brings us down to 1 kmap call and makes this
function take 1/3 of the time per call. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Our token logic depends on token->kaddr being set, and if it is not it sets
everything properly as needed. So instead of memsetting just set
token->kaddr to NULL. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
No reason to set the path blocking or loop through all of the pages if the
extent buffer isn't actually marked dirty. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This is a high traffic function, let's try and do as little as possible
during normal operations shall we?
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We don't really need to copy extents from the source tree since we have all
of the information already available to us in the extent_map tree. So
instead just write the extents straight to the log tree and don't bother to
copy the extent items from the source tree.
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
You'd think path->keep_locks would keep all the locks wouldn't you? You'd
be wrong. It only keeps them if the slot is pointing to the last item in
the node. This is for use with btrfs_next_leaf, which needs this sort of
thing. But the horrible horrible things I'm going to do to the tree log
means I really need everything held from root to leaf so I can add and
delete items in the same search. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We are going to use EM's to log extents in the future, so we need to not
mark them as prealloc if they aren't actually prealloc extents. Instead
mark them with FILLING so we know to ammend mod_start/mod_len and that way
we don't confuse the extent logging code. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we've written to a prealloc extent we need to know the original block len
for the extent. We can't figure this out currently since ->block_len is
just set to the extent length. So introduce ->orig_block_len so that we
know how many bytes were in the original extent for proper extent logging
that future patches will need. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The tree logging stuff needs the csums to be on the ordered extents in order
to log them properly, so mark that we're sync and inline the csum creation
so we don't have to wait on the csumming to be done when logging extents
that are still in flight. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We don't copy inode items anwyay, we just copy them straight into the log
from the in memory inode. So if we know we're only logging the inode, don't
bother dropping anything, just try to insert it and either if it succeeds or
we get EEXIST we can update the inode item in the log and carry on. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Currently we copy all the file information into the log, inode item, the
refs, xattrs etc. Except most of this doesn't change from fsync to fsync,
just the inode item changes. So set a flag if an xattr changes or a link is
added, and otherwise only log the inode item. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Originally root_times_lock was introduced as part of send/receive
code however newly developed patch to label the subvol reused
the same lock, so renaming it for a meaningful name.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Currently udev does not know about the device being removed from the
file system. This may result in the situation where we're unable to
mount the file system by UUID or by LABEL because the by-uuid and
by-label links may still point to the device which is no longer part of
the btrfs file system and hence does not have any btrfs super block.
It can be easily reproduced by the following:
mkfs.btrfs -L bugfs /dev/loop[0-6]
mount /dev/loop0 /mnt/test
btrfs device delete /dev/loop0 /mnt/test
umount /mnt/test
mount LABEL=bugfs /mnt/test <---- this fails
then see:
ls -l /dev/disk/by-label/bugfs
which will still point to the /dev/loop0
We did not noticed this before because libblkid would send the udev
event for us when it notice that the link does not fit the reality,
however it does not do that anymore and completely relies on udev
information.
Fix this by sending the KOBJ_CHANGE event to the bdev kobject after
successful device removal.
Note that this does not affect device addition, because we will open the
device prior the addition from userspace and udev will notice that and
reread the device afterwards.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
ret variant may be set to 0 if we read page successfully, but it might be
released before we lock it again. On this case, if we fail to allocate a
new page, we will return 0, it is wrong, fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Since we can pre-allocate the space past EOF, we should be able to reclaim
that space if we need. This patch implements it by removing the EOF check.
Though the manual of fallocate command says we can use truncate command to
reclaim the pre-allocated space which past EOF, but because truncate command
changes the file size, we must run several commands to reclaim the space if we
don't want to change the file size, so it is not a good choice.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Steps to reproduce:
# mkfs.btrfs <disk>
# mount <disk> <mnt>
# dd if=/dev/zero of=<mnt>/<file> bs=512 seek=5 count=8
# fallocate -p -o 2048 -l 16384 <mnt>/<file>
# dd if=/dev/zero of=<mnt>/<file> bs=4096 seek=3 count=8 conv=notrunc,nocreat
# umount <mnt>
# dmesg
WARNING: at fs/btrfs/inode.c:7140 btrfs_destroy_inode+0x2eb/0x330
The reason is that we inputed a range which is beyond the end of the file. And
because the end of this range was not page-aligned, we had to truncate the last
page in this range, this operation is similar to a buffered file write. In other
words, we reserved enough space and clear the data which was in the hole range
on that page. But when we expanded that test file, write the data into the same
page, we forgot that we have reserved enough space for the buffered write of
that page because in most cases there is no page that is beyond the end of
the file. As a result, we reserved the space twice.
In fact, we needn't truncate the page if it is beyond the end of the file, just
release the allocated space in that range. Fix the above problem by this way.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
(start + len) is the start of the adjacent extent, not the end of the current
extent, so we should not use it to check the hole is on the same page or not.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We forget to release the reserved space in the error path of delalloc
reservatiom, fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we runt the direct IO, we should not run auto defrag, because it may
introduce buffered IO vs direcIO problem, and make direct IO slow down.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We should use ctl->unit for free space calculation instead of block_group->sectorsize
even though for free space use_bitmap or free space cluster we only have sectorsize assigned to ctl->unit currently. Also, we can keep it consisten in code style.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Refactor it by checking whether the inode has been created and needs to be
dropped (drop_inode_on_err) and also if the err variable is set. That way the
variable doesn't need to be set on each and every error handling block.
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When a new file is created with btrfs_create(), the inode will initially be
created with permissions 0666 and later on in btrfs_init_acl() it will be
adapted to mask out the umask bits. The problem is that this change won't make
it into the btrfs_inode unless there's another change to the inode (e.g. writing
content changing the size or touching the file changing the mtime.)
This fix adds a call to btrfs_update_inode() to btrfs_create() to make sure that
the change will not get lost if the in-memory inode is flushed before other
changes are made to the file.
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When the flag not supported is specified, it is necessary to return the error
to the caller.
So, we add the validity check of the fiemap's flag.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Passing a null extended attribute value means to remove the attribute,
but we don't have to add a new NULL extended attribute.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If the acl can be exactly represented in the traditional file
mode permission bits, we don't set another acl attribute.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
alloc_end is not the real end of the current extent, it is the start of the
next adjoining extent. So we needn't +1 when calculating the size the space
that is about to be reserved.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The kernel developers have implemented some often-used align macros, we should
use them instead of the complex code.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This regression was introduced by the device-replace patches.
Scrub immediately stops checking those disks that have write errors.
This is nothing that happens in the real world, but it is wrong
since scrub is the tool to detect and repair defects. Fix it.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This issue was detected by the "0-DAY kernel build testing".
fs/btrfs/volumes.c: In function 'btrfs_rm_device':
fs/btrfs/volumes.c:1505:1: warning: label 'error_close' defined but not used [-Wunused-label]
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The structure member mirror_num is modified concurrently to the
structure member is_iodone. This doesn't require any locking by
design, unless everything is stored in the same 32 bits of a
bit field. This was the case and xfstest 284 was able to
trigger false warnings from the checker code. This patch
seperates the bits and fixes the race.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we freeze the fs, the auto defragment should not run. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This patch restructure btrfs_run_defrag_inodes() and make the code of the auto
defragment more readable.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We forget to get the defrag lock when we re-add the defragable inode,
Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The auto defrag allocation is in the fast path of the IO, so use slabs
to improve the speed of the allocation.
And besides that, it can do check for leaked objects when the module is removed.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We need get write access for qgroup operations, or we will modify the R/O fs.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We need get write access for scrub, or we will modify the R/O fs.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Steps to reproduce:
# mkfs.btrfs -d single -m single <disk0> <disk1>
# mount -o ro <disk0> <mnt0>
# mount -o ro <disk0> <mnt1>
# mount -o remount,rw <mnt0>
# umount <mnt0>
# btrfs device delete <disk1> <mnt1>
We can remove a device from a R/O filesystem. The reason is that we just check
the R/O flag of the super block object. It is not enough, because the kernel
may set the R/O flag only for the mount point. We need invoke
mnt_want_write_file()
to do a full check.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Steps to reproduce:
# mkfs.btrfs <partition>
# mount -o ro <partition> <mnt0>
# mount -o ro <partition> <mnt1>
# mount -o remount,rw <mnt0>
# umount <mnt0>
# btrfs fi resize 10g <mnt1>
We re-sized a R/O filesystem. The reason is that we just check the R/O flag
of the super block object. It is not enough, because the kernel may set the
R/O flag only for the mount point. We need invoke mnt_want_write_file() to
do a full check.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When wen want to set the default subvolume, we must get write access, or
we will change the R/O file system.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If the id of the existed transaction is more than the one we specified, it
means the specified transaction was commited, so we should return 0, not
EINVAL.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If there is no running transaction in the fs, we needn't start a new one when
we want to start sync.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Since we have gotten the root in the caller, just pass it into
btrfs_ioctl_{start, wait}_sync() directly.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we found an invalid xattr dir item, we'd better try the next one instead.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
io_ctl_map_page is called by many functions in free-space-cache.
In most scenarios, the ->cur is not null, e.g. io_ctl_add_entry.
I think we'd better remove the warn_on here.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This is the commit that allows to start the device replace
procedure.
An ioctl() interface is added that supports starting and
canceling the device replace procedure, and to retrieve
the status and progress.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull trivial branch from Jiri Kosina:
"Usual stuff -- comment/printk typo fixes, documentation updates, dead
code elimination."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
HOWTO: fix double words typo
x86 mtrr: fix comment typo in mtrr_bp_init
propagate name change to comments in kernel source
doc: Update the name of profiling based on sysfs
treewide: Fix typos in various drivers
treewide: Fix typos in various Kconfig
wireless: mwifiex: Fix typo in wireless/mwifiex driver
messages: i2o: Fix typo in messages/i2o
scripts/kernel-doc: check that non-void fcts describe their return value
Kernel-doc: Convention: Use a "Return" section to describe return values
radeon: Fix typo and copy/paste error in comments
doc: Remove unnecessary declarations from Documentation/accounting/getdelays.c
various: Fix spelling of "asynchronous" in comments.
Fix misspellings of "whether" in comments.
eisa: Fix spelling of "asynchronous".
various: Fix spelling of "registered" in comments.
doc: fix quite a few typos within Documentation
target: iscsi: fix comment typos in target/iscsi drivers
treewide: fix typo of "suport" in various comments and Kconfig
treewide: fix typo of "suppport" in various comments
...
Make the target disk of a running device replace operation
available for reading. This is only used as a last ressort for
the defect repair procedure. And it is dependent on the location
of the data block to read, because during an ongoing device
replace operation, the target drive is only partially filled
with the filesystem data.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This change of the define is effective in all modes, it
is required and used only in the case when a device replace
procedure is running. The reason is that during an active
device replace procedure, the target device of the copy
operation is a mirror for the filesystem data as well that
can be used to read data in order to repair read errors on
other disks.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
It is desirable to be able to configure the device replace
procedure to avoid reading the source drive (the one to be
copied) whenever possible. This is useful when the number of
read errors on this disk is high, because it would delay the
copy procedure alot. Therefore there is an option to avoid
reading from the source disk unless the repair procedure
really needs to access it. The regular read req asks for
mapping the block with mirror_num == 0, in this case the
source disk is avoided whenever possible. The repair code
selects the mirror_num explicitly (mirror_num != 0), this
case is not changed by this commit.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
During a running dev replace operation, all write requests to
the live filesystem are duplicated to also write to the target
drive. Therefore btrfs_map_block() is changed to duplicate
stripes that are written to the source disk of a device replace
procedure to be written to the target disk as well.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Before this commit, btrfs_map_block() was called with REQ_WRITE
in order to retrieve the list of mirrors for a disk block.
This needs to be changed for the device replace procedure since
it makes a difference whether you are asking for read mirrors
or for locations to write to.
GET_READ_MIRRORS is introduced as a new interface to call
btrfs_map_block().
In the current commit, the functionality is not yet changed,
only the interface for GET_READ_MIRRORS is introduced and all
the places that should use this new interface are adapted.
The reason that REQ_WRITE cannot be abused anymore to retrieve
a list of read mirrors is that during a running dev replace
operation all write requests to the live filesystem are
duplicated to also write to the target drive.
Keep in mind that the target disk is only partially a valid
copy of the source disk while the operation is ongoing. All
writes go to the target disk, but not all reads would return
valid data on the target disk. Therefore it is not possible
anymore to abuse a REQ_WRITE interface to find valid mirrors
for a REQ_READ.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This commit contains all the essential changes to the core code
of Btrfs for support of the device replace procedure.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This adds a new file to the sources together with the header file
and the changes to ioctl.h and ctree.h that are required by the
new C source file. Additionally, 4 new functions are added to
volume.c that deal with device creation and destruction.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit adds code to scrub to allow the scrub code to copy read
data to another disk.
One goal is to be able to perform as fast as possible. Therefore the
write requests are collected until huge bios are built, and the
write process is decoupled from the read process with some kind of
flow control, of course, in order to limit the allocated memory.
The best performance on spinning disks could by reached when the
head movements are avoided as much as possible. Therefore a single
worker is used to interface the read process with the write process.
The regular scrub operation works as fast as before, it is not
negatively influenced and actually it is more or less unchanged.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
With the addition of the device replace procedure, it is possible
for btrfs_map_bio(READ) to report an error. This happens when the
specific mirror is requested which is located on the target disk,
and the copy operation has not yet copied this block. Hence the
block cannot be read and this error state is indicated by
returning EIO.
Some background information follows now. A new mirror is added
while the device replace procedure is running.
btrfs_get_num_copies() returns one more, and
btrfs_map_bio(GET_READ_MIRROR) adds one more mirror if a disk
location is involved that was already handled by the device
replace copy operation. The assigned mirror num is the highest
mirror number, e.g. the value 3 in case of RAID1.
If btrfs_map_bio() is invoked with mirror_num == 0 (i.e., select
any mirror), the copy on the target drive is never selected
because that disk shall be able to perform the write requests as
quickly as possible. The parallel execution of read requests would
only slow down the disk copy procedure. Second case is that
btrfs_map_bio() is called with mirror_num > 0. This is done from
the repair code only. In this case, the highest mirror num is
assigned to the target disk, since it is used last. And when this
mirror is not available because the copy procedure has not yet
handled this area, an error is returned. Everywhere in the code
the handling of such errors is added now.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This patch adds some code to disallow operations on the device that
is used as the target for the device replace operation.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Btrfs admin operations that are manually started from user mode
and that cannot be executed at the same time return -EINPROGRESS.
A common way to enter and leave this locked section is introduced
since it used to be specific to the balance operation.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Remove the attempt to cancel a running scrub or device replace
operation in btrfs_handle_error() because it adds the risk of
a deadlock. The only penalty of not canceling the operation is
that some I/O remains active until the procedure completes.
This is basically the same thing that happens to other tasks
that are running in user mode context, they are not affected
or stopped in btrfs_handle_error(), these tasks just need to
handle write errors correctly.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
A small number of functions that are used in a device replace
procedure when the operation is resumed at mount time are unable
to pass the same root pointer that would be used in the regular
(ioctl) context. And since the root pointer is not required, only
the fs_info is, the root pointer argument is replaced with the
fs_info pointer argument.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This new function is used by the device replace procedure in
a later patch.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This is required for the device replace procedure in a later step.
Two calling functions also had to be changed to have the fs_info
pointer: repair_io_failure() and scrub_setup_recheck_block().
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This is required for the device replace procedure in a later step.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The new function btrfs_find_device_missing_or_by_path() will be
used for the device replace procedure. This function itself calls
the second new function btrfs_find_device_by_path().
Unfortunately, it is not possible to currently make the rest of the
code use these functions as well, since all functions that look
similar at first view are all a little bit different in what they
are doing. But in the future, new code could benefit from these
two new functions, and currently, device replace uses them.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Some code to open block devices, to read the superblock and to
handle errors was repeated multiple times in 3 places, and the
following patch makes use of it as well. This code is now moved
into a subfunction.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Just move some code into functions to make everything more readable.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In the scrub repair code, the code is changed to handle memory
allocation errors a little bit smarter. The change is to handle
it just like a read error. This simplifies the code and removes
a couple of lines of code, since the code to handle read errors
is there anyway.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In case that disk blocks need to be repaired (rewritten), the
current code at first (for simplicity reasons) reads all alternate
mirrors in the first step, afterwards selects the best one in a
second step. This is now changed to read one alternate mirror
after the other and to leave the loop early when a perfect mirror
is found.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
With the modified design (in order to support the devive replace
procedure) it is necessary to alloc the page array dynamically.
The reason is that pages are reused. At first a page is used for
the bio to read the data from the filesystem, then the same page
is reused for the bio that writes the data to the target disk.
Since the read process and the write process are completely
decoupled, this requires a new concept of refcounts and get/put
functions for pages, and it requires to use newly created pages
for each read bio which are freed after the write operation
is finished.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The block device is removed from the scrub context state structure.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in the scrub
context struct and moved into the lower level scope of scrub_bio,
fixup and page structures where the block device context is known.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit is a first preparation step to adapt the scrub code to
be shareable for the device replace procedure.
The block device will be removed from the scrub context state
structure in a later step. It used to be the source block device.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in a later
patch, and therefore the context structure is renamed (this is the
goal of the current patch) to reflect that no source block device
scope is there anymore.
Summary:
This first preparation step consists of a textual substitution of the
term "dev" to the term "ctx" whereever the scrub context is used.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Since we've kill the bigger one volume_mutex, we need to add devices
list mutex back.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
- 'nr' is no more used.
- btrfs_btree_balance_dirty() and __btrfs_btree_balance_dirty() can share
a bunch of code.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When __merge_refs merges two refs, it is also needed to merge the
inode_list of both refs. Otherwise we have missed backrefs and memory
leaks. This happens for example if two inodes share an extent and
both lie in the same leaf and thus also have the same parent.
Signed-off-by: Alexander Block <ablock84@googlemail.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Even if the hole punching is executed, the modification time of the
file is not updated.
So, current time is set to inode.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Someone who is root or capable(CAP_SYS_ADMIN) could corrupt the
superblock and make Btrfs printk("%s") crash while holding the
uuid_mutex since nobody forces a limit on the string. Since the
uuid_mutex is significant, the system would be unusable
afterwards.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When creating a snapshot, failing to commit a transaction can end up
with aborting the transaction, following by doing a cleanup for it, where
we'll free all snapshots pending to disk.
So we check it and avoid double free on pending snapshots.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When committing a transaction, we may bail out of running delayed refs
due to ENOSPC, and then abort the current transaction to flip into readonly.
But we'll hit a deadlock on ref head's lock since we forget to release
its lock and other cleanup stuff.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Just use WARN_ON rather than an if containing only WARN_ON(1).
A simplified version of the semantic patch that makes this transformation
is as follows: (http://coccinelle.lip6.fr/)
// <smpl>
@@
expression e;
@@
- if (e) WARN_ON(1);
+ WARN_ON(e);
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Use WARN rather than printk followed by WARN_ON(1), for conciseness.
A simplified version of the semantic patch that makes this transformation
is as follows: (http://coccinelle.lip6.fr/)
// <smpl>
@@
expression list es;
@@
-printk(
+WARN(1,
es);
-WARN_ON(1);
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we set BTRFS_INODE_NEEDS_FULL_SYNC, we should log all the extent,
but now we forget to take it into account, and set a wrong max key,
if so, we will skip the file extent metadata when doing logging. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We forget to protect the modified_extents list, fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
There are two types of the file extent - inline extent and regular extent,
When we log file extents, we didn't take inline extent into account, fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Consider the following case:
Task1 Task2
start_transaction
commit_transaction
check pending snapshots list and the
list is empty.
add pending snapshot into list
skip the delalloc flush
end_transaction
...
And then the problem that the snapshot is different with the source subvolume
happen.
This patch fixes the above problem by flush all pending stuffs when all the
other tasks end the transaction.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we flush inodes with pending delalloc in a transaction, we may join
the same transaction handler more than 2 times.
The reason is:
Task use_count of trans handle
commit_transaction 1
|-> btrfs_start_delalloc_inodes 1
|-> run_delalloc_nocow 1
|-> join_transaction 2
|-> cow_file_range 2
|-> join_transaction 3
In fact, cow_file_range needn't join the transaction again because the caller
have joined the transaction, so we fix this problem by this way.
Reported-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
btrfs_wait_ordered_range expects for 'len' instead of 'end'.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When we log new names, we need to log just enough to recreate the inode
during log replay, and there is no need to log extents along with it.
This actually fixes a bug revealed by xfstests 241, where it shows
that we're logging some extents that have not updated metadata,
so we don't get proper EXTENT_DATA items to be copied to log tree.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The current behavior is to allow mounting or remounting a filesystem
writeable in degraded mode if at least one writeable device is
present.
The next failed write access to a missing device which is above
the tolerance of the configured level of redundancy results in an
read-only enforcement. Even without this, the next time
barrier_all_devices() is called and more devices are missing than
tolerable, the switch to read-only mode takes place.
In order to behave predictably and to provide proper feedback to
the user at mount time, this patch compares the number of missing
devices with the number of devices that are tolerated to be missing
according to the configured RAID level. If more devices are missing
than tolerated, e.g. if two devices are missing in case of RAID1,
only a read-only mount and remount is allowed.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Remove an invalid size check up from btrfs_shrink_dev().
The new size should not larger than the device->total_bytes as it was
already verified before coming to here(i.e. new_size < old_size).
Remove invalid check up for btrfs_shrink_dev().
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
There is no reason to pass the nr_pages_dirtied argument, because
nr_pages_dirtied value from the caller is unused in
balance_dirty_pages_ratelimited_nr().
Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Signed-off-by: Vivek Trivedi <vtrivedi018@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Though the process of the ordered extents is a bit different with the delalloc inode
flush, but we can see it as a subset of the delalloc inode flush, so we also handle
them by flush workers.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The process of the ordered operations is similar to the delalloc inode flush, so
we handle them by flush workers.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This patch introduce a new worker pool named "flush_workers", and if we
want to force all the inode with pending delalloc to the disks, we can
queue those inodes into the work queue of the worker pool, in this way,
those inodes will be flushed by multi-task.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Dave gave me an image of a very full file system that would abort the
transaction because it ran out of space while committing the transaction.
This is because we would think there was plenty of room to create a snapshot
even though the global reserve was not full. This happens because we
calculate the global reserve size before we unpin any space, so after we
unpin the space we allow reservations to occur even though we haven't
reserved all of the space for our global reserve. Fix this by adding to the
global reserve while unpinning in order to make sure we always have enough
space to do our work. With this patch we no longer end up with an aborted
transaction, we return ENOSPC properly to the person trying to create the
snapshot. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The argument 'tree_mod_log' is not necessary since all of callers enable it.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Since we don't use MOD_LOG_KEY_REMOVE_WHILE_MOVING to add nritems
during rewinding, we should insert a MOD_LOG_KEY_REMOVE operation first.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Key MOD_LOG_KEY_REMOVE_WHILE_MOVING means that we're doing memmove inside
an extent buffer node, and the node's number of items remains unchanged
(unless we are inserting a single pointer, but we have MOD_LOG_KEY_ADD for that).
So we don't need to increase node's number of items during rewinding,
otherwise we may get an node larger than leafsize and cause general protection
errors later.
Here is the details,
- If we do memory move for inserting a single pointer, we need to
add node's nritems by one, and we honor MOD_LOG_KEY_ADD for adding.
- If we do memory move for deleting a single pointer, we need to
decrease node's nritems by one, and we honor MOD_LOG_KEY_REMOVE for
deleting.
- If we do memory move for balance left/right, we need to decrease
node's nritems, and we honor MOD_LOG_KEY_REMOVE for balaning.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When we find a bitmap free space entry, we may check the previous extent
entry covers the offset or not. But if we find this entry is also a bitmap
entry, we will continue to check the previous entry of the current one by
a while loop. It is unnecessary because it is impossible that the extent
entry which is in front of a bitmap entry can cover the offset of the entry
after that bitmap entry.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Alex reported a problem where we were writing between chunks on a rbd
device. The thing is we do bio_add_page using logical offsets, but the
physical offset may be different. So when we map the bio now check to see
if the bio is still ok with the physical offset, and if it is not split the
bio up and redo the bio_add_page with the physical sector. This fixes the
problem for Alex and doesn't affect performance in the normal case. Thanks,
Reported-and-tested-by: Alex Elder <elder@inktank.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The comment is not coincident with the code. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
div_factor{_fine} has been implemented for two times, cleanup it.
And I move them into a independent file named math.h because they are
common math functions.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
"Whether" is misspelled in various comments across the tree; this
fixes them. No code changes.
Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Pull btrfs fixes from Chris Mason:
"This has our series of fixes for the next rc. The biggest batch is
from Jan Schmidt, fixing up some problems in our subvolume quota code
and fixing btrfs send/receive to work with the new extended inode
refs."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: do not bug when we fail to commit the transaction
Btrfs: fix memory leak when cloning root's node
Btrfs: Use btrfs_update_inode_fallback when creating a snapshot
Btrfs: Send: preserve ownership (uid and gid) also for symlinks.
Btrfs: fix deadlock caused by the nested chunk allocation
btrfs: Return EINVAL when length to trim is less than FSB
Btrfs: fix memory leak in btrfs_quota_enable()
Btrfs: send correct rdev and mode in btrfs-send
Btrfs: extended inode refs support for send mechanism
Btrfs: Fix wrong error handling code
Fix a sign bug causing invalid memory access in the ino_paths ioctl.
Btrfs: comment for loop in tree_mod_log_insert_move
Btrfs: fix extent buffer reference for tree mod log roots
Btrfs: determine level of old roots
Btrfs: tree mod log's old roots could still be part of the tree
Btrfs: fix a tree mod logging issue for root replacement operations
Btrfs: don't put removals from push_node_left into tree mod log twice
We BUG if we fail to commit the transaction when creating a snapshot, which
is just obnoxious. Remove the BUG_ON(). Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
After cloning root's node, we forgot to dec the src's ref
which can lead to a memory leak.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
On a really full file system I was getting ENOSPC back from
btrfs_update_inode when trying to update the parent inode when creating a
snapshot. Just use the fallback method so we can update the inode and not
have to worry about having a delayed ref. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This patch also requires a change in the user-space part of "receive".
We need to use "lchown" instead of "chown". We will do this in the
following patch.
Signed-off-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
if (S_ISREG(sctx->cur_inode_mode)) {
Steps to reproduce:
# mkfs.btrfs -m raid1 <disk1> <disk2>
# btrfstune -S 1 <disk1>
# mount <disk1> <mnt>
# btrfs device add <disk3> <disk4> <mnt>
# mount -o remount,rw <mnt>
# dd if=/dev/zero of=<mnt>/tmpfile bs=1M count=1
Deadlock happened.
It is because of the nested chunk allocation. When we wrote the data
into the filesystem, we would allocate the data chunk because there was
no data chunk in the filesystem. At the end of the data chunk allocation,
we should insert the metadata of the data chunk into the extent tree, but
there was no raid1 chunk, so we tried to lock the chunk allocation mutex to
allocate the new chunk, but we had held the mutex, the deadlock happened.
By rights, we would allocate the raid1 chunk when we added the second device
because the profile of the seed filesystem is raid1 and we had two devices.
But we didn't do that in fact. It is because the last step of the first device
insertion didn't commit the transaction. So when we added the second device,
we didn't cow the tree, and just inserted the relative metadata into the leaves
which were generated by the first device insertion, and its profile was dup.
So, I fix this problem by commiting the transaction at the end of the first
device insertion.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Currently if len argument in btrfs_ioctl_fitrim() is smaller than
one FSB we will continue and finally return 0 bytes discarded.
However if the length to discard is smaller then file system block
we should really return EINVAL.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
gcc says "warning: comparison of unsigned expression >= 0 is always
true" because i is an unsigned long. And gcc is right this time.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
To see the problem, create many hardlinks to the same file (120 should do it),
then look up paths by inode with:
ls -i
btrfs inspect inode-resolve -v $ino /mnt/btrfs
I noticed the memory layout of the fspath->val data had some irregularities
(some unnecessary gaps that stop appearing about halfway),
so I'm not sure there aren't any bugs left in it.
Emphasis the way tree_mod_log_insert_move avoids adding
MOD_LOG_KEY_REMOVE_WHILE_MOVING operations, depending on the direction of
the move operation.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
In get_old_root we grab a lock on the extent buffer before we obtain a
reference on that buffer. That order is changed now.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
In btrfs_find_all_roots' termination condition, we compare the level of the
old buffer we got from btrfs_search_old_slot to the level of the current
root node. We'd better compare it to the level of the rewinded root node.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Tree mod log treated old root buffers as always empty buffers when starting
the rewind operations. However, the old root may still be part of the
current tree at a lower level, with still some valid entries.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Avoid the implicit free by tree_mod_log_set_root_pointer, which is wrong in
two places. Where needed, we call tree_mod_log_free_eb explicitly now.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Independant of the check (push_items < src_items) tree_mod_log_eb_copy did
log the removal of the old data entries from the source buffer. Therefore,
we must not call tree_mod_log_eb_move if the check evaluates to true, as
that would log the removal twice, finally resulting in (rewinded) buffers
with wrong values for header_nritems.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Pull user namespace compile fixes from Eric W Biederman:
"This tree contains three trivial fixes. One compiler warning, one
thinko fix, and one build fix"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
btrfs: Fix compilation with user namespace support enabled
userns: Fix posix_acl_file_xattr_userns gid conversion
userns: Properly print bluetooth socket uids
When compiling with user namespace support btrfs fails like:
fs/btrfs/tree-log.c: In function ‘fill_inode_item’:
fs/btrfs/tree-log.c:2955:2: error: incompatible type for argument 3 of ‘btrfs_set_inode_uid’
fs/btrfs/ctree.h:2026:1: note: expected ‘u32’ but argument is of type ‘kuid_t’
fs/btrfs/tree-log.c:2956:2: error: incompatible type for argument 3 of ‘btrfs_set_inode_gid’
fs/btrfs/ctree.h:2027:1: note: expected ‘u32’ but argument is of type ‘kgid_t’
Fix this by using i_uid_read and i_gid_read in
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In order to accomodate retrying path-based syscalls, we need to add a
new "type" argument to audit_inode_child. This will tell us whether
we're looking for a child entry that represents a create or a delete.
If we find a parent, don't automatically assume that we need to create a
new entry. Instead, use the information we have to try to find an
existing entry first. Update it if one is found and create a new one if
not.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Most of the callers get called with an inode and dentry in the reverse
order. The compiler then has to reshuffle the arg registers and/or
stack in order to pass them on to audit_inode_child.
Reverse those arguments for a micro-optimization.
Reported-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull btrfs update from Chris Mason:
"This is a large pull, with the bulk of the updates coming from:
- Hole punching
- send/receive fixes
- fsync performance
- Disk format extension allowing more hardlinks inside a single
directory (btrfs-progs patch required to enable the compat bit for
this one)
I'm cooking more unrelated RAID code, but I wanted to make sure this
original batch makes it in. The largest updates here are relatively
old and have been in testing for some time."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (121 commits)
btrfs: init ref_index to zero in add_inode_ref
Btrfs: remove repeated eb->pages check in, disk-io.c/csum_dirty_buffer
Btrfs: fix page leakage
Btrfs: do not warn_on when we cannot alloc a page for an extent buffer
Btrfs: don't bug on enomem in readpage
Btrfs: cleanup pages properly when ENOMEM in compression
Btrfs: make filesystem read-only when submitting barrier fails
Btrfs: detect corrupted filesystem after write I/O errors
Btrfs: make compress and nodatacow mount options mutually exclusive
btrfs: fix message printing
Btrfs: don't bother committing delayed inode updates when fsyncing
btrfs: move inline function code to header file
Btrfs: remove unnecessary IS_ERR in bio_readpage_error()
btrfs: remove unused function btrfs_insert_some_items()
Btrfs: don't commit instead of overcommitting
Btrfs: confirmation of value is added before trace_btrfs_get_extent() is called
Btrfs: be smarter about dropping things from the tree log
Btrfs: don't lookup csums for prealloc extents
Btrfs: cache extent state when writing out dirty metadata pages
Btrfs: do not hold the file extent leaf locked when adding extent item
...
In csum_dirty_buffer, we first get eb from page->private.
Then we check if the page is the first page of eb. Later
we check it again. Remove the repeated check here.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Alloc_dummy_extent_buffer will not free the first page in the eb array if we
fail to allocate a page, fix this. Thanks,
Reported-by: David Sterba <dave@jikos.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
It's just annoying and the user will have gotten a nice OOM killer message
so they are already fully aware they are screwed :). Thanks,
Reported-by: Jérôme Poulin <jeromepoulin@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Get rid of the BUG_ON(ret == -ENOMEM) in __extent_read_full_page. Thanks,
Reported-by: Jérôme Poulin <jeromepoulin@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We were freeing non-existent pages which was causing a panic for a user who
was suffering from ENOMEM. This patch fixes the problem. Thanks,
Reported-by: Jérôme Poulin <jeromepoulin@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
So far the return code of barrier_all_devices() is ignored, which
means that errors are ignored. The result can be a corrupt
filesystem which is not consistent.
This commit adds code to evaluate the return code of
barrier_all_devices(). The normal btrfs_error() mechanism is used to
switch the filesystem into read-only mode when errors are detected.
In order to decide whether barrier_all_devices() should return
error or success, the number of disks that are allowed to fail the
barrier submission is calculated. This calculation accounts for the
worst RAID level of metadata, system and data. If single, dup or
RAID0 is in use, a single disk error is already considered to be
fatal. Otherwise a single disk error is tolerated.
The calculation of the number of disks that are tolerated to fail
the barrier operation is performed when the filesystem gets mounted,
when a balance operation is started and finished, and when devices
are added or removed.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
In check-integrity, detect when a superblock is written that points
to blocks that have not been written to disk due to I/O write errors.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
If a filesystem is mounted with compression and then remounted by adding nodatacow,
the compression is disabled but the compress flag is still visible.
Also, if a filesystem is mounted with nodatacow and then remounted with compression,
nodatacow flag is still present but it's not active.
This patch:
- removes compress flags and notifies that the compression has been disabled if the
filesystem is mounted with nodatacow
- removes nodatacow and nodatasum flags if mounted with compress.
Signed-off-by: Andrei Popa <andrei.popa@i-neo.ro>
We can just copy the in memory inode into the tree log directly, no sense in
updating the fs tree so we can copy it into the tree log tree. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When building btrfs from kernel code, it will report:
fs/btrfs/extent_io.h:281: warning: 'extent_buffer_page' declared inline after being called
fs/btrfs/extent_io.h:281: warning: previous declaration of 'extent_buffer_page' was here
fs/btrfs/extent_io.h:280: warning: 'num_extent_pages' declared inline after being called
fs/btrfs/extent_io.h:280: warning: previous declaration of 'num_extent_pages' was here
because of the wrong declaration of inline functions.
Signed-off-by: Robin Dong <sanbai@taobao.com>
I don't think we have the same problem that this was supposed to fix
originally since we can allocate chunks in the enospc path now. This code
is causing us to constantly commit the transaction as we get close to using
all of our available space in our currently allocated chunks, instead of
allocating another chunk and carrying on with life, which is not nice for
performance. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We should confirm the value of extent_map before calling
trace_btrfs_get_extent() because the value of extent_map has the
possibility of NULL.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
When we truncate existing items in the tree log we've been searching for
each individual item and removing them. This is unnecessary churn and
searching, just keep track of the slot we are on and how many items we need
to delete and delete them all at once. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The tree logging stuff was looking up csums to copy over for prealloc
extents which is just work we don't need to be doing. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Everytime we write out dirty pages we search for an offset in the tree,
convert the bits in the state, and then when we wait we search for the
offset again and clear the bits. So for every dirty range in the io tree we
are doing 4 rb searches, which is suboptimal. With this patch we are only
doing 2 searches for every cycle (modulo weird things happening). Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
For some reason we unlock everything except the leaf we are on, set the path
blocking and then add the extent item for the extent we just finished
writing. I can't for the life of me figure out why we would want to do
this, and the history doesn't really indicate that there was a real reason
for it, so just remove it. This will reduce our tree lock contention on
heavy writes. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There are a coule scenarios where farming metadata csumming off to an async
thread doesn't help. The first is if our processor supports crc32c, in
which case the csumming will be fast and so the overhead of the async model
is not worth the cost. The other case is for our tree log. We will be
making that stuff dirty and writing it out and waiting for it immediately.
Even with software crc32c this gives me a ~15% increase in speed with O_SYNC
workloads. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
commit 7ca4be45a0 limited csum items to
PAGE_CACHE_SIZE. It used min() with incompatible types in 32bit which
generates warnings:
fs/btrfs/file-item.c: In function ‘btrfs_csum_file_blocks’:
fs/btrfs/file-item.c:717: warning: comparison of distinct pointer types lacks a cast
This uses min_t(u32,) to fix the warnings. u32 seemed reasonable
because btrfs_root->leafsize is u32 and PAGE_CACHE_SIZE is unsigned
long.
Signed-off-by: Zach Brown <zab@zabbo.net>
Running delayed refs is faster than running delalloc, so lets do that first
to try and reclaim space. This makes my fs_mark test about 20% faster.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
With the following debug patch:
static int btrfs_freeze(struct super_block *sb)
{
+ struct btrfs_fs_info *fs_info = btrfs_sb(sb);
+ struct btrfs_transaction *trans;
+
+ spin_lock(&fs_info->trans_lock);
+ trans = fs_info->running_transaction;
+ if (trans) {
+ printk("Transid %llu, use_count %d, num_writer %d\n",
+ trans->transid, atomic_read(&trans->use_count),
+ atomic_read(&trans->num_writers));
+ }
+ spin_unlock(&fs_info->trans_lock);
return 0;
}
I found there was a orphan transaction after the freeze operation was done.
It is because the transaction may not be committed when the transaction handle
end even though it is the last handle of the current transaction. This design
avoid committing the transaction frequently, but also introduce the above
problem.
So I add btrfs_attach_transaction() which can catch the current transaction
and commit it. If there is no transaction, it will return ENOENT, and do not
anything.
This function also can be used to instead of btrfs_join_transaction_freeze()
because it don't increase the writer counter and don't start a new transaction,
so it also can fix the deadlock between sync and freeze.
Besides that, it is used to instead of btrfs_join_transaction() in
transaction_kthread(), because if there is no transaction, the transaction
kthread needn't anything.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
This patch add a type field into the transaction handle structure,
in this way, we needn't implement various end-transaction functions
and can make the code more simple and readable.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
This patch fixes memory leak of the transaction handle which happened
when starting transaction failed on a freezed fs.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
The iterate_irefs in backref.c is used to build path components from inode
refs. This patch adds code to iterate extended refs as well.
I had modify the callback function signature to abstract out some of the
differences between ref structures. iref_to_path() also needed similar
changes.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
This patch adds basic support for extended inode refs. This includes support
for link and unlink of the refs, which basically gets us support for rename
as well.
Inode creation does not need changing - extended refs are only added after
the ref array is full.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Move actual pte filling for non-linear file mappings into the new special
vma operation: ->remap_pages().
Filesystems must implement this method to get non-linear mapping support,
if it uses filemap_fault() then generic_file_remap_pages() can be used.
Now device drivers can implement this method and obtain nonlinear vma support.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com> #arch/tile
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Venkatesh Pallipadi <venki@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Moved part of the code into a sub function and replaced most of the gotos
by ifs, hoping that it will be easier to read now.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
I started hitting warnings when running xfstest 68 in a loop because there
were EM's that were not lined up properly with the physical extents. This
is ok, if we do something like punch a hole or write to a preallocated space
or something like that we can have an EM that doesn't cover the entire
physical extent. So fix the tree logging stuff to cope with this case so we
don't just commit the transaction. With this patch I no longer see the
warnings from the tree logging code. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Call btrfs_abort_transaction as early as possible when an error
condition is detected, that way the line number reported is useful
and we're not clueless anymore which error path led to the abort.
Signed-off-by: David Sterba <dsterba@suse.cz>
The macro btrfs_abort_transaction() can get the line number of the code
where the problem happens, so we should invoke it in the place that the
error occurs, or we will lose the line number.
Reported-by: David Sterba <dave@jikos.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Btrfs uses inclusive range end for lock_extent(), unlock_extent() and
related functions, so we made off-by-one errors in file clone.
This fixes it and also fixes some style problems.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Hi,
the patch si simple, but it has user visible impact and I'm not quite sure how
to resolve it.
In short, $subj says it, chattr -C supports it and we want to use it.
The conditions that acutally allow to change the NOCOW flag are clear. What if
I try to set the flag on a file that is not empty? Options:
1) whole ioctl will fail, EINVAL
2.1) ioctl will succeed, the NOCOW flag will be silently removed, but the file
will stay COW-ed and checksummed
2.2) ioctl will succeed, flag will not be removed and a syslog message will
warn that the COW flag has not been changed
2.2.1) dtto, no syslog message
Man page of chattr states that
"If it is set on a file which already has data blocks, it is undefined when
the blocks assigned to the file will be fully stable."
Yes, it's undefined and with current implementation it'll never happen. So from
this end, the user cannot expect anything. I'm trying to find a reasonable
behaviour, so that a command like 'chattr -R -aijS +C' to tweak a broad set of
flags in a deep directory does not fail unnecessarily and does not pollute the
log.
My personal preference is 2.2.1, but my dev's oppinion is skewed, not counting
the fact that I know the code and otherwise would look there before consulting
the documentation.
The patch implements 2.2.1.
david
-------------8<-------------------
From: David Sterba <dsterba@suse.cz>
It's safe to turn off checksums for a zero sized file.
http://thread.gmane.org/gmane.comp.file-systems.btrfs/18030
"We cannot switch on NODATASUM for a file that already has extents that
are checksummed. The invariant here is that either all the extents or
none are checksummed.
Theoretically it's possible to add/remove all checksums from a given
file, but it's a potentially longtime operation, the file has to be in
some intermediate state where the checksums partially exist but have to
be ignored (for the csum->nocsum) until the file is fully converted,
this brings more special cases to extent handling, it has to survive
power failure and remain consistent, and probably needs to be restarted
after next mount."
Signed-off-by: David Sterba <dsterba@suse.cz>
I saw the warning in btrfs_drop_extent_cache where our end is less than our
start while running xfstests 68 in a loop. This is because we
unconditionally do drop_end = min(end, extent_end) in
__btrfs_drop_extents(), even though we may not have found an extent in the
range we were looking to drop. So keep track of wether or not we found
something, and if we didn't just use our end. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We do not need to do anything special to freeze or unfreeze, it's all taken
care of by the generic work, and what we currently have is wrong anyway
since we shouldn't be returnning to userspace with mutexes held anyway.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The btree inode has it's own write cache pages so we can remove this write
cache pages hook as it's not used. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We can race when checking wether PagePrivate is set on a page and we
actually have an eb saved in the pages private pointer. We could have
easily written out this page and released it in the time that we did the
pagevec lookup and actually got around to looking at this page. So use
mapping->private_lock to ensure we get a consistent view of the
page->private pointer. This is inline with the alloc and releasepage paths
which use private_lock when manipulating page->private. Thanks,
Reported-by: David Sterba <dave@jikos.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Dave Sterba pointed out a sleeping while atomic bug while doing fsync. This
is because I'm an idiot and didn't realize that rwlock's were spin locks, so
we've been holding this thing while doing allocations and such which is not
good. This patch fixes this by dropping the write lock before we do
anything heavy and re-acquire it when it is done. We also need to take a
ref on the em's in case their corresponding pages are evicted and mark them
as being logged so that releasepage does not remove them and doesn't remove
them from our local list. Thanks,
Reported-by: Dave Sterba <dave@jikos.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
So we start our freeze, somebody comes in and does an fsync() on a file
where we have to commit a transaction for whatever reason, and we will
deadlock because the freeze is waiting on FS_FREEZE people to stop writing
to the file system, but the transaction is waiting for its free space inodes
to be written out, which are in turn waiting on sb_start_intwrite while
trying to write the file extents. To fix this we'll just skip the
sb_start_intwrite() if we TRANS_JOIN_NOLOCK since we're being waited on by a
transaction commit so we're safe wrt to freeze and this will keep us from
deadlocking. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I screwed this up, there is a race between checking if there is a running
transaction and actually starting a transaction in sync where we could race
with a freezer and get ourselves into trouble. To fix this we need to make
a new join type to only do the try lock on the freeze stuff. If it fails
we'll return EPERM and just return from sync. This fixes a hang Liu Bo
reported when running xfstest 68 in a loop. Thanks,
Reported-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
A subvolume cannot be deleted via rmdir, but the error code ENOTEMPTY
is confusing. Return EPERM instead, as this is not permitted.
Signed-off-by: David Sterba <dsterba@suse.cz>
Using for_each_set_bit_from() to simplify the code.
spatch with a semantic match is used to found this.
(http://coccinelle.lip6.fr/)
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Unnecessary lookup_extent_mapping() is removed because an error is
returned to the caller.
This patch was made based on the advice from Stefan Behrens, thanks.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Pull vfs update from Al Viro:
- big one - consolidation of descriptor-related logics; almost all of
that is moved to fs/file.c
(BTW, I'm seriously tempted to rename the result to fd.c. As it is,
we have a situation when file_table.c is about handling of struct
file and file.c is about handling of descriptor tables; the reasons
are historical - file_table.c used to be about a static array of
struct file we used to have way back).
A lot of stray ends got cleaned up and converted to saner primitives,
disgusting mess in android/binder.c is still disgusting, but at least
doesn't poke so much in descriptor table guts anymore. A bunch of
relatively minor races got fixed in process, plus an ext4 struct file
leak.
- related thing - fget_light() partially unuglified; see fdget() in
there (and yes, it generates the code as good as we used to have).
- also related - bits of Cyrill's procfs stuff that got entangled into
that work; _not_ all of it, just the initial move to fs/proc/fd.c and
switch of fdinfo to seq_file.
- Alex's fs/coredump.c spiltoff - the same story, had been easier to
take that commit than mess with conflicts. The rest is a separate
pile, this was just a mechanical code movement.
- a few misc patches all over the place. Not all for this cycle,
there'll be more (and quite a few currently sit in akpm's tree)."
Fix up trivial conflicts in the android binder driver, and some fairly
simple conflicts due to two different changes to the sock_alloc_file()
interface ("take descriptor handling from sock_alloc_file() to callers"
vs "net: Providing protocol type via system.sockprotoname xattr of
/proc/PID/fd entries" adding a dentry name to the socket)
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
MAX_LFS_FILESIZE should be a loff_t
compat: fs: Generic compat_sys_sendfile implementation
fs: push rcu_barrier() from deactivate_locked_super() to filesystems
btrfs: reada_extent doesn't need kref for refcount
coredump: move core dump functionality into its own file
coredump: prevent double-free on an error path in core dumper
usb/gadget: fix misannotations
fcntl: fix misannotations
ceph: don't abuse d_delete() on failure exits
hypfs: ->d_parent is never NULL or negative
vfs: delete surplus inode NULL check
switch simple cases of fget_light to fdget
new helpers: fdget()/fdput()
switch o2hb_region_dev_write() to fget_light()
proc_map_files_readdir(): don't bother with grabbing files
make get_file() return its argument
vhost_set_vring(): turn pollstart/pollstop into bool
switch prctl_set_mm_exe_file() to fget_light()
switch xfs_find_handle() to fget_light()
switch xfs_swapext() to fget_light()
...
There's no reason to call rcu_barrier() on every
deactivate_locked_super(). We only need to make sure that all delayed rcu
free inodes are flushed before we destroy related cache.
Removing rcu_barrier() from deactivate_locked_super() affects some fast
paths. E.g. on my machine exit_group() of a last process in IPC
namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All increments and decrements are under the same spinlock - have to be,
since they need to protect the radix_tree it's found in. Just use
int, no need to wank with kref...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull user namespace changes from Eric Biederman:
"This is a mostly modest set of changes to enable basic user namespace
support. This allows the code to code to compile with user namespaces
enabled and removes the assumption there is only the initial user
namespace. Everything is converted except for the most complex of the
filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
nfs, ocfs2 and xfs as those patches need a bit more review.
The strategy is to push kuid_t and kgid_t values are far down into
subsystems and filesystems as reasonable. Leaving the make_kuid and
from_kuid operations to happen at the edge of userspace, as the values
come off the disk, and as the values come in from the network.
Letting compile type incompatible compile errors (present when user
namespaces are enabled) guide me to find the issues.
The most tricky areas have been the places where we had an implicit
union of uid and gid values and were storing them in an unsigned int.
Those places were converted into explicit unions. I made certain to
handle those places with simple trivial patches.
Out of that work I discovered we have generic interfaces for storing
quota by projid. I had never heard of the project identifiers before.
Adding full user namespace support for project identifiers accounts
for most of the code size growth in my git tree.
Ultimately there will be work to relax privlige checks from
"capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
root in a user names to do those things that today we only forbid to
non-root users because it will confuse suid root applications.
While I was pushing kuid_t and kgid_t changes deep into the audit code
I made a few other cleanups. I capitalized on the fact we process
netlink messages in the context of the message sender. I removed
usage of NETLINK_CRED, and started directly using current->tty.
Some of these patches have also made it into maintainer trees, with no
problems from identical code from different trees showing up in
linux-next.
After reading through all of this code I feel like I might be able to
win a game of kernel trivial pursuit."
Fix up some fairly trivial conflicts in netfilter uid/git logging code.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
userns: Convert the ufs filesystem to use kuid/kgid where appropriate
userns: Convert the udf filesystem to use kuid/kgid where appropriate
userns: Convert ubifs to use kuid/kgid
userns: Convert squashfs to use kuid/kgid where appropriate
userns: Convert reiserfs to use kuid and kgid where appropriate
userns: Convert jfs to use kuid/kgid where appropriate
userns: Convert jffs2 to use kuid and kgid where appropriate
userns: Convert hpfs to use kuid and kgid where appropriate
userns: Convert btrfs to use kuid/kgid where appropriate
userns: Convert bfs to use kuid/kgid where appropriate
userns: Convert affs to use kuid/kgid wherwe appropriate
userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
userns: On ia64 deal with current_uid and current_gid being kuid and kgid
userns: On ppc convert current_uid from a kuid before printing.
userns: Convert s390 getting uid and gid system calls to use kuid and kgid
userns: Convert s390 hypfs to use kuid and kgid where appropriate
userns: Convert binder ipc to use kuids
userns: Teach security_path_chown to take kuids and kgids
userns: Add user namespace support to IMA
userns: Convert EVM to deal with kuids and kgids in it's hmac computation
...
This reverts commit 0885ef5b56
After applying the above patch, the performance slowed down because the dirty
page flush can only be done by one task, so revert it.
The following is the test result of sysbench:
Before After
24MB/s 39MB/s
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Everybody is just making stuff up, and it's just used to see if we really do
need to alloc a chunk, and since we do this when we already know we really
do it's just a waste of space. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
So we have lots of places where we try to preallocate chunks in order to
make sure we have enough space as we make our allocations. This has
historically meant that we're constantly tweaking when we should allocate a
new chunk, and historically we have gotten this horribly wrong so we way
over allocate either metadata or data. To try and keep this from happening
we are going to make it so that the block group item insertion is done out
of band at the end of a transaction. This will allow us to create chunks
even if we are trying to make an allocation for the extent tree. With this
patch my enospc tests run faster (didn't expect this) and more efficiently
use the disk space (this is what I wanted). Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
For immutable bio vecs, I've been auditing and removing bi_idx
references. These were harmless, but removing them will make auditing
easier.
scrub_bio_end_io_worker() was open coding a bio_reset() - but this
doesn't appear to have been needed for anything as right after it does a
bio_put(), and perusing the code it doesn't appear anything else was
holding a reference to the bio.
The other use end_bio_extent_readpage() was just for a pr_debug() -
changed it to something that might be a bit more useful.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Stefan Behrens <sbehrens@giantdisaster.de>
When we wrote some data by compress mode into a btrfs filesystem which was full
of the fragments, the kernel will report:
BTRFS warning (device xxx): Aborting unused transaction.
The reason is:
We can not find a long enough free space to store the compressed data because
of the fragmentary free space, and the compressed data can not be splited,
so the kernel outputed the above message.
In fact, btrfs can deal with this problem very well: it fall back to
uncompressed IO, split the uncompressed data into small ones, and then
store them into to the fragmentary free space. So we shouldn't output the
above warning message.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Wade Cline reported a problem where he was getting garbage and warnings when
writing to a preallocated range via O_DIRECT. This is because we weren't
creating our normal pinned extent_map for the range we were writing to,
which was causing all sorts of issues. This patch fixes the problem and
makes his testcase much happier. Thanks,
Reported-by: Wade Cline <clinew@linux.vnet.ibm.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Sage reported the following lockdep backtrace
=====================================
[ BUG: bad unlock balance detected! ]
3.6.0-rc2-ceph-00171-gc7ed62d #1 Not tainted
-------------------------------------
btrfs-cleaner/7607 is trying to release lock (sb_internal) at:
[<ffffffffa00422ae>] btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
but there are no more locks to release!
other info that might help us debug this:
1 lock held by btrfs-cleaner/7607:
#0: (&fs_info->cleaner_mutex){+.+...}, at: [<ffffffffa003b405>] cleaner_kthread+0x95/0x120 [btrfs]
stack backtrace:
Pid: 7607, comm: btrfs-cleaner Not tainted 3.6.0-rc2-ceph-00171-gc7ed62d #1
Call Trace:
[<ffffffffa00422ae>] ? btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
[<ffffffff810afa9e>] print_unlock_inbalance_bug+0xfe/0x110
[<ffffffff810b289e>] lock_release_non_nested+0x1ee/0x310
[<ffffffff81172f9b>] ? kmem_cache_free+0x7b/0x160
[<ffffffffa004106c>] ? put_transaction+0x8c/0x130 [btrfs]
[<ffffffffa00422ae>] ? btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
[<ffffffff810b2a95>] lock_release+0xd5/0x220
[<ffffffff81173071>] ? kmem_cache_free+0x151/0x160
[<ffffffff8117d9ed>] __sb_end_write+0x7d/0x90
[<ffffffffa00422ae>] btrfs_commit_transaction+0xa6e/0xb20 [btrfs]
[<ffffffff81079850>] ? __init_waitqueue_head+0x60/0x60
[<ffffffff81634c6b>] ? _raw_spin_unlock+0x2b/0x40
[<ffffffffa0042758>] __btrfs_end_transaction+0x368/0x3c0 [btrfs]
[<ffffffffa0042808>] btrfs_end_transaction_throttle+0x18/0x20 [btrfs]
[<ffffffffa00318f0>] btrfs_drop_snapshot+0x410/0x600 [btrfs]
[<ffffffff8132babd>] ? do_raw_spin_unlock+0x5d/0xb0
[<ffffffffa00430ef>] btrfs_clean_old_snapshots+0xaf/0x150 [btrfs]
[<ffffffffa003b405>] ? cleaner_kthread+0x95/0x120 [btrfs]
[<ffffffffa003b419>] cleaner_kthread+0xa9/0x120 [btrfs]
[<ffffffffa003b370>] ? btrfs_destroy_delayed_refs.isra.102+0x220/0x220 [btrfs]
[<ffffffff810791ee>] kthread+0xae/0xc0
[<ffffffff810b379d>] ? trace_hardirqs_on+0xd/0x10
[<ffffffff8163e744>] kernel_thread_helper+0x4/0x10
[<ffffffff81635430>] ? retint_restore_args+0x13/0x13
[<ffffffff81079140>] ? flush_kthread_work+0x1a0/0x1a0
[<ffffffff8163e740>] ? gs_change+0x13/0x13
This is because the throttle stuff can commit the transaction, which expects to
be the one stopping the intwrite stuff, but we've already done it in the
__btrfs_end_transaction. Moving the sb_end_intewrite after this logic makes the
lockdep go away. Thanks,
Tested-by: Sage Weil <sage@inktank.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This is the change of the kernel side.
Translation of logical to inode used to have an upper limit 4k on
inode container's size, but the limit is not large enough for a data
with a great many of refs, so when resolving logical address,
we can end up with
"ioctl ret=0, bytes_left=0, bytes_missing=19944, cnt=510, missed=2493"
This changes to regard 64k as the upper limit and use vmalloc instead of
kmalloc to get memory more easily.
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
In logical resolve, we parse extent_from_logical()'s 'ret' as a kind of flag.
It is possible to lose our errors because
(-EXXXX & BTRFS_EXTENT_FLAG_TREE_BLOCK) is true.
I'm not sure if it is on purpose, it just looks too hacky if it is.
I'd rather use a real flag and a 'ret' to catch errors.
Acked-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Liu Bo <liub.liubo@gmail.com>
As ref cache has been removed from btrfs, there is no user on
its lock and its check.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
When we delete a inode, we will remove all the delayed items including delayed
inode update, and then truncate all the relative metadata. If there is lots of
metadata, we will end the current transaction, and start a new transaction to
truncate the left metadata. In this way, we will leave a inode item that its
link counter is > 0, and also may leave some directory index items in fs/file tree
after the current transaction ends. In other words, the metadata in this fs/file tree
is inconsistent. If we create a snapshot for this tree now, we will find a inode with
corrupted metadata in the new snapshot, and we won't continue to drop the left metadata,
because its link counter is not 0.
We fix this problem by updating the inode item before the current transaction ends.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
I noticed I was seeing large lags when running my torrent test in a vm on my
laptop. While trying to make it lag less I noticed that our overcommit math
was taking into account the number of bytes we wanted to reclaim, not the
number of bytes we actually wanted to allocate, which means we wouldn't
overcommit as often. This patch fixes the overcommit math and makes
shrink_delalloc() use that logic so that it will stop looping faster. We
still have pretty high spikes of latency, but the test now takes 3 minutes
less time (about 5% faster). Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Mitch reported a problem where you could get an ENOSPC error when untarring
a kernel git tree onto a 16gb file system with compress-force=zlib. This is
because compression is a huge pain, it will return from ->writepages()
without having actually created any ordered extents. To get around this we
check to see if the async submit counter is up, and if it is wait until it
drops to 0 before doing our normal ordered wait dance. With this patch I
can now untar a kernel git tree onto a 16gb file system without getting
ENOSPC errors. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We're going to use this flag EXTENT_DEFRAG to indicate which range
belongs to defragment so that we can implement snapshow-aware defrag:
We set the EXTENT_DEFRAG flag when dirtying the extents that need
defragmented, so later on writeback thread can differentiate between
normal writeback and writeback started by defragmentation.
Original-Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
ulist_alloc() has the possibility of returning NULL.
So, it is necessary to check the return value.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
When we ran fsstress(a program in xfstests), the filesystem hung up when it
is full. It was because the space reserved in btrfs_fallocate() was wrong,
btrfs_fallocate() just used the size of the pre-allocation to reserve the
space, didn't took the block size aligning into account, so the size of
the reserved space was less than the allocated space, it caused the over
reserve problem and made the filesystem hung up when invoking cow_file_range().
Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Though we dump the stack information when aborting a unused transaction
handle, we don't know the correct place where we decide to abort the
transaction handle if one function has several place where the transaction
abort function is invoked and jumps to the same place after this call.
And beside that we also don't know the reason why we jump to abort
the current handle. So I modify the transaction abort function and make
it output the function name, line and error information.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
We forget to protect ->log_batch when syncing a file, this patch fix
this problem by atomic operation. And ->log_batch is used to check
if there are parallel sync operations or not, so it is unnecessary to
reset it to 0 after the sync operation of the current log tree complete.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
We should insert/update 6 items(root ref, root backref, dir item, dir index,
root item and parent inode) when creating a snapshot, not 5 items, fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
The snapshot should be the image of the fs tree before it was created,
so the metadata of the snapshot should not exist in the its tree. But now, we
found the directory item and directory name index is in both the snapshot tree
and the fs tree. It introduces some problems and makes the users feel strange:
# mkfs.btrfs /dev/sda1
# mount /dev/sda1 /mnt
# mkdir /mnt/1
# cd /mnt/1
# btrfs subvolume snapshot /mnt snap0
# ls -a /mnt/1/snap0/1
. .. [no other file/dir]
# ll /mnt/1/snap0/
total 0
drwxr-xr-x 1 root root 10 Ju1 24 12:11 1
^^^
There is no file/dir in it, but it's size is 10
# cd /mnt/1/snap0/1/snap0
[Enter a unexisted directory successfully...]
There is nothing in the directory 1 in snap0, but btrfs told the length of
this directory is 10. Beside that, we can enter an unexisted directory, it is
very strange to the users.
# btrfs subvolume snapshot /mnt/1/snap0 /mnt/snap1
# ll /mnt/1/snap0/1/
total 0
[None]
# ll /mnt/snap1/1/
total 0
drwxr-xr-x 1 root root 0 Ju1 24 12:14 snap0
And the source of snap1 did have any directory in Directory 1, but snap1 have
a snap0, it is different between the source and the snapshot.
So I think we should insert directory item and directory name index and update
the parent inode as the last step of snapshot creation, and do not leave the
useless metadata in the file tree.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Sometimes we need choose the method of the reservation according to the type
of the block reservation, such as the reservation for the delayed inode update.
Now we identify the type just by comparing the address of the reservation
variants, it is very ugly if it is a temporary one because we need compare it
with all the common reservation variants. So we add a new "type" field to keep
the type the reservation variants.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
The ordered extent allocation is in the fast path of the IO, so use a slab
to improve the speed of the allocation.
"Size of the struct is 280, so this will fall into the size-512 bucket,
giving 8 objects per page, while own slab will pack 14 objects into a page.
Another benefit I see is to check for leaked objects when the module is
removed (and the cache destroy takes place)."
-- David Sterba
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
If a snapshot is created while we are writing some data into the file,
the i_size of the corresponding file in the snapshot will be wrong, it will
be beyond the end of the last file extent. And btrfsck will report:
root 256 inode 257 errors 100
Steps to reproduce:
# mkfs.btrfs <partition>
# mount <partition> <mnt>
# cd <mnt>
# dd if=/dev/zero of=tmpfile bs=4M count=1024 &
# for ((i=0; i<4; i++))
> do
> btrfs sub snap . $i
> done
This because the algorithm of disk_i_size update is wrong. Though there are
some ordered extents behind the current one which we use to update disk_i_size,
it doesn't mean those extents will be dealt with in the same transaction. So
We shouldn't use the offset of those extents to update disk_i_size. Or we will
get the wrong i_size in the snapshot.
We fix this problem by recording the max real i_size. If we find there is a
ordered extent which is in front of the current one and doesn't complete, we
will record the end of the current one into that ordered extent. Surely, if
the current extent holds the end of other extent(it must be greater than
the current one because it is behind the current one), we will record the
number that the current extent holds. In this way, we can exclude the ordered
extents that may not be dealth with in the same transaction, and be easy to
know the real disk_i_size.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
If we create several snapshots at the same time, the following BUG_ON() will be
triggered.
kernel BUG at fs/btrfs/extent-tree.c:6047!
Steps to reproduce:
# mkfs.btrfs <partition>
# mount <partition> <mnt>
# cd <mnt>
# for ((i=0;i<2400;i++)); do touch long_name_to_make_tree_more_deep$i; done
# for ((i=0; i<4; i++))
> do
> mkdir $i
> for ((j=0; j<200; j++))
> do
> btrfs sub snap . $i/$j
> done &
> done
The reason is:
Before transaction commit, some operations changed the fs tree and new tree
blocks were allocated because of COW. We used the implicit non-shared back
reference for those newly allocated tree blocks because they were not shared by
two or more trees.
And then we created the first snapshot for the fs tree, according to the back
reference rules, we also used implicit back refs for the child tree blocks of
the root node of the fs tree, now those child nodes/leaves were shared by two
trees.
Then We didn't deal with the delayed references, and continued to change the fs
tree(created the second snapshot and inserted the dir item of the new snapshot
into the fs tree). According to the rules of the back reference, we added full
back refs for those tree blocks whose parents have be shared by two trees.
Now some newly allocated tree blocks had two types of the references.
As we know, the delayed reference system handles these delayed references from
back to front, and the full delayed reference is inserted after the implicit
ones. So when we dealt with the back references of those newly allocated tree
blocks, the full references was dealt with at first. And if the first reference
is a shared back reference and the tree block that the reference points to is
newly allocated, It would be considered as a tree block which is shared by two
or more trees when it is allocated and should be a full back reference not a
implicit one, the flag of its reference also should be set to FULL_BACKREF.
But in fact, it was a non-shared tree block with a implicit reference at
beginning, so it was not compulsory to set the flags to FULL_BACKREF. So BUG_ON
was triggered.
We have several methods to fix this bug:
1. deal with delayed references after the snapshot is created and before we
change the source tree of the snapshot. This is the easiest and safest way.
2. modify the sort method of the delayed reference tree, make the full delayed
references be inserted before the implicit ones. It is also very easy, but
I don't know if it will introduce some problems or not.
3. modify select_delayed_ref() and make it select the implicit delayed reference
at first. This way is not so good because it may wastes CPU time if we have
lots of delayed references.
4. set the flags to FULL_BACKREF, this method is a little complex comparing with
the 1st way.
I chose the 1st way to fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
This patch fixes the following problem:
- If we failed to deal with the delayed dir items, we should abort transaction,
just as its comment said. Fix it.
- If root reference or root back reference insertion failed, we should
abort transaction. Fix it.
- Fix the double free problem of pending->inherit.
- Do not restore the trans->rsv if we doesn't change it.
- make the error path more clearly.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
bbio has been malloced in btrfs_map_block() and should be
freed before leaving from the error handling cases.
spatch with a semantic match is used to found this problem.
(http://coccinelle.lip6.fr/)
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
I noticed this when I was doing the fsync stuff, we allocate split extents if we
drop an extent range that is in the middle of an existing extent. This BUG()'s
if we fail to allocate memory, but the fact is this is just a cache, we will
just regenerate the cache if we need it, the important part is that we free the
range we are given. This can be done without allocations, so if we fail to
allocate splits just skip the splitting stage and free our em and look for more
extents to drop. This also makes btrfs_drop_extent_cache a void since nobody
was checking the return value anyway. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The freeze rwsem is taken by sb_start_intwrite() and dropped during the
commit_ or end_transaction(). In the async case, that happens in a worker
thread. Tell lockdep the calling thread is releasing ownership of the
rwsem and the async thread is picking it up.
XFS plays the same trick in fs/xfs/xfs_aops.c.
Signed-off-by: Sage Weil <sage@inktank.com>
I audited all users of btrfs_drop_extents and found that nobody actually uses
the hint_byte argument. I'm sure it was used for something at some point but
it's not used now, and the way the pinning works the disk bytenr would never be
immediately useful anyway so lets just remove it. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This is based on Josef's "Btrfs: turbo charge fsync".
If an inode is a BTRFS_INODE_NODATASUM one, we don't need to look for csum
items any more.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
This is based on Josef's "Btrfs: turbo charge fsync".
The current btrfs checks if an inode is in log by comparing
root's last_log_commit to inode's last_sub_trans[2].
But the problem is that this root->last_log_commit is shared among
inodes.
Say we have N inodes to be logged, after the first inode,
root's last_log_commit is updated and the N-1 remained files will
be skipped.
This fixes the bug by keeping a local copy of root's last_log_commit
inside each inode and this local copy will be maintained itself.
[1]: we regard each log transaction as a subset of btrfs's transaction,
i.e. sub_trans
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
This is based on Josef's "Btrfs: turbo charge fsync".
The above Josef's patch performs very good in random sync write test,
because we won't have too much extents to merge.
However, it does not performs good on the test:
dd if=/dev/zero of=foobar bs=4k count=12500 oflag=sync
The reason is when we do sequencial sync write, we need to merge the
current extent just with the previous one, so that we can get accumulated
extents to log:
A(4k) --> AA(8k) --> AAA(12k) --> AAAA(16k) ...
So we'll have to flush more and more checksum into log tree, which is the
bottleneck according to my tests.
But we can avoid this by telling fsync the real extents that are needed
to be logged.
With this, I did the above dd sync write test (size=50m),
w/o (orig) w/ (josef's) w/ (this)
SATA 104KB/s 109KB/s 121KB/s
ramdisk 1.5MB/s 1.5MB/s 10.7MB/s (613%)
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
We will stop and restart a transaction every time we move to a different leaf
when truncating a file. This is for enospc reasons, but really we could
probably get away with doing this a little better by actually working until we
hit an ENOSPC. So add a ->failfast flag to the block_rsv and set it when we do
truncates which will fail as soon as the block rsv runs out of space, and then
at that point we can stop and restart the transaction and refill the block rsv
and carry on. This will make rm'ing of a file with lots of extents a bit
faster. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This is based on Josef's "Btrfs: turbo charge fsync".
We should cleanup those extents after we've finished logging inode,
otherwise we may do redundant work on them.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
I hit this a couple times while working on my fsync patch (all my bugs, not
normal operation), but with my new stuff we could have new errors from cases
I have not encountered, so instead of BUG()'ing we should be WARN()'ing so
that we are notified there is a problem but the user doesn't lose their
data. We can easily commit the transaction in the case that the tree
logging fails and still be fine, so let's try and be as nice to the user as
possible. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
At least for the vm workload. Currently on fsync we will
1) Truncate all items in the log tree for the given inode if they exist
and
2) Copy all items for a given inode into the log
The problem with this is that for things like VMs you can have lots of
extents from the fragmented writing behavior, and worst yet you may have
only modified a few extents, not the entire thing. This patch fixes this
problem by tracking which transid modified our extent, and then when we do
the tree logging we find all of the extents we've modified in our current
transaction, sort them and commit them. We also only truncate up to the
xattrs of the inode and copy that stuff in normally, and then just drop any
extents in the range we have that exist in the log already. Here are some
numbers of a 50 meg fio job that does random writes and fsync()s after every
write
Original Patched
SATA drive 82KB/s 140KB/s
Fusion drive 431KB/s 2532KB/s
So around 2-6 times faster depending on your hardware. There are a few
corner cases, for example if you truncate at all we have to do it the old
way since there is no way to be sure what is in the log is ok. This
probably could be done smarter, but if you write-fsync-truncate-write-fsync
you deserve what you get. All this work is in RAM of course so if your
inode gets evicted from cache and you read it in and fsync it we'll do it
the slow way if we are still in the same transaction that we last modified
the inode in.
The biggest cool part of this is that it requires no changes to the recovery
code, so if you fsync with this patch and crash and load an old kernel, it
will run the recovery and be a-ok. I have tested this pretty thoroughly
with an fsync tester and everything comes back fine, as well as xfstests.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>