Commit Graph

3810 Commits

Author SHA1 Message Date
Wang Shilong
0647bf564f Btrfs: improve forever loop when doing balance relocation
We hit a forever loop when doing balance relocation,the reason
is that we firstly reserve 4M(node size is 16k).and within transaction
we will try to add extra reservation for snapshot roots,this will
return -EAGAIN if there has been a thread flushing space to reserve
space.We will do this again and again with filesystem becoming nearly
full.

If the above '-EAGAIN' case happens, we try to refill reservation more
outsize of transaction, and this will return eariler in enospc case,however,
this dosen't really hurt because it makes no sense doing balance relocation
with the filesystem nearly full.

Miao Xie helped a lot to track this issue, thanks.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:43 -08:00
Filipe David Borba Manana
6126e3caf7 Btrfs: fix ordered extent check in btrfs_punch_hole
If the ordered extent's last byte was 1 less than our region's
start byte, we would unnecessarily wait for the completion of
that ordered extent, because it doesn't intersect our target
range.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:42 -08:00
Miao Xie
376cc685cb Btrfs: fix the reserved space leak caused by the race between nonlock dio and buffered io
When we ran sysbench on the fs with compression, the following WARN_ONs were
triggered:
 fs/btrfs/inode.c:7829	WARN_ON(BTRFS_I(inode)->outstanding_extents);
 fs/btrfs/inode.c:7830	WARN_ON(BTRFS_I(inode)->reserved_extents);
 fs/btrfs/inode.c:7832	WARN_ON(BTRFS_I(inode)->csum_bytes);

Steps to reproduce:
 # mkfs.btrfs -f <dev>
 # mount -o compress <dev> <mnt>
 # cd <mnt>
 # sysbench --test=fileio --num-threads=8 --file-total-size=8G \
 > --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \
 > --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \
 > --file-test-mode=sync prepare
 # cd -
 # umount <mnt>
 # mount -o compress <dev> <mnt>
 # cd <mnt>
 # sysbench --test=fileio --num-threads=8 --file-total-size=8G \
 > --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \
 > --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \
 > --file-test-mode=sync run
 # cd -
 # umount <mnt>

The reason of this problem is:
Task0				Task1
btrfs_direct_IO
  unlock(&inode->i_mutex)
				lock(&inode->i_mutex)
				reserve_space()
				prepare_pages()
				  lock_extent()
				  clear_extent()
				  unlock_extent()
  lock_extent()
  test_extent(uptodate)
    return false
				copy_data()
				set_delalloc_extent()
  extent need compress
    go back to buffered write
  clear_extent(DELALLOC | DIRTY)
  unlock_extent()

Task 0 and 1 wrote the same place, and task0 cleared the delalloc flag which
was set by task1, it made the dirty pages in that extents couldn't be flushed
into the disk, so the reserved space for that extent was not released at
the end.

This patch fixes the above bug by unlocking the extent after the delalloc.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:42 -08:00
Miao Xie
b37392ea86 Btrfs: cleanup unnecessary parameter and variant of prepare_pages()
- the caller has gotten the inode object, needn't pass the file object.
  And if so, we needn't define a inode pointer variant.
- the position should be aligned by the page size not sector size, so
  we also needn't pass the root object into prepare_pages().

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:41 -08:00
David Sterba
cc37bb0420 btrfs: replace BUG in can_modify_feature
We don't need to crash hard here, it's just reading a sysfs file. The
values considered in switch are from a fixed set, the default case
should not happen at all.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:41 -08:00
David Sterba
43d87fa231 btrfs: reserve no transaction units in btrfs_feature_attr_store
Added in patch "btrfs: add ability to change features via sysfs",
modifications to superblock don't need to reserve metadata blocks when
starting a transaction.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:40 -08:00
Frank Holton
27a0dd61a5 Btrfs: make btrfs_debug match pr_debug handling related to DEBUG
The kernel macro pr_debug is defined as a empty statement when DEBUG is
not defined. Make btrfs_debug match pr_debug to avoid spamming
the kernel log with debug messages

Signed-off-by: Frank Holton <fholton@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:39 -08:00
Sergei Trofimovich
33b98f2271 btrfs: cleanup: removed unused 'btrfs_get_inode_ref_index'
Found by uselex.rb:
> btrfs_get_inode_ref_index: [R]: exported from:
fs/btrfs/inode-item.o fs/btrfs/btrfs.o fs/btrfs/built-in.o

Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Reviewed-by: David Stebra <dsterba@suse.cz>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:39 -08:00
Kelley Nielsen
3f870c2899 btrfs: expand btrfs_find_item() to include find_orphan_item functionality
This is the third step in bootstrapping the btrfs_find_item interface.
The function find_orphan_item(), in orphan.c, is similar to the two
functions already replaced by the new interface. It uses two parameters,
which are already present in the interface, and is nearly identical to
the function brought in in the previous patch.

Replace the two calls to find_orphan_item() with calls to
btrfs_find_item(), with the defined objectid and type that was used
internally by find_orphan_item(), a null path, and a null key. Add a
test for a null path to btrfs_find_item, and if it passes, allocate and
free the path. Finally, remove find_orphan_item().

Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:37 -08:00
Kelley Nielsen
75ac2dd907 btrfs: expand btrfs_find_item() to include find_root_ref functionality
This patch is the second step in bootstrapping the btrfs_find_item
interface. The btrfs_find_root_ref() is similar to the former
__inode_info(); it accepts four of its parameters, and duplicates the
first half of its functionality.

Replace the one former call to btrfs_find_root_ref() with a call to
btrfs_find_item(), along with the defined key type that was used
internally by btrfs_find_root ref, and a null found key. In
btrfs_find_item(), add a test for the null key at the place where
the functionality of btrfs_find_root_ref() ends; btrfs_find_item()
then returns if the test passes. Finally, remove btrfs_find_root_ref().

Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com>
Suggested-by: Zach Brown <zab@redhat.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:36 -08:00
Kelley Nielsen
e33d5c3d6d btrfs: bootstrap generic btrfs_find_item interface
There are many btrfs functions that manually search the tree for an
item. They all reimplement the same mechanism and differ in the
conditions that they use to find the item. __inode_info() is one such
example. Zach Brown proposed creating a new interface to take the place
of these functions.

This patch is the first step to creating the interface. A new function,
btrfs_find_item, has been added to ctree.c and prototyped in ctree.h.
It is identical to __inode_info, except that the order of the parameters
has been rearranged to more closely those of similar functions elsewhere
in the code (now, root and path come first, then the objectid, offset
and type, and the key to be filled in last). __inode_info's callers have
been set to call this new function instead, and __inode_info itself has
been removed.

Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com>
Suggested-by: Zach Brown <zab@redhat.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:36 -08:00
Valentina Giusti
a3df41ee37 btrfs: fix unused variables in qgroup.c
Use otherwise unused local variables slot in update_qgroup_limit_item and
in update_qgroup_info_item, and remove unused variable ins from
btrfs_qgroup_account_ref.

Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:35 -08:00
Valentina Giusti
e94acd86d4 btrfs: replace path->slots[0] with otherwise unused variable 'slot'
Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:35 -08:00
Valentina Giusti
ce3e7f1073 btrfs: remove unused variable from scrub_fixup_nodatasum
Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:34 -08:00
Valentina Giusti
f0265bb409 btrfs: remove unused variable from setup_cluster_no_bitmap
The variable window_start in setup_cluster_no_bitmap is not used since commit
1bb91902dc
(Btrfs: revamp clustered allocation logic)

Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:33 -08:00
Valentina Giusti
50892bac3b btrfs: remove unused variables from extent_io.c
Remove unused variables:
* tree from end_bio_extent_writepage,
* item from extent_fiemap.

Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:33 -08:00
Valentina Giusti
4b447bfac6 btrfs: remove unused variable from find_free_extent
The variable found_uncached_bg in find_free_extent is not used since commit
285ff5af6c
(Btrfs: remove the ideal caching code)

Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:32 -08:00
Valentina Giusti
71db2a7751 btrfs: remove unused variables from disk-io.c
Remove unused variables:
* tree from csum_dirty_buffer,
* tree from btree_readpage_end_io_hook,
* tree from btree_writepages,
* bytenr from btrfs_create_tree,
* fs_info from end_workqueue_fn.

Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:31 -08:00
Valentina Giusti
99e22f783b btrfs: remove unused variable from btrfs_new_inode
Variable owner in btrfs_new_inode is unused since commit
d82a6f1d7e
(Btrfs: kill BTRFS_I(inode)->block_group)

Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:31 -08:00
Jeff Mahoney
f8ba9c11f8 btrfs: publish fs label in sysfs
This adds a writeable attribute which describes the label.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:30 -08:00
Jeff Mahoney
29e5be240a btrfs: publish device membership in sysfs
Now that we have the infrastructure for per-super attributes, we can
publish device membership in /sys/fs/btrfs/<fsid>/devices. The information
is published as symlinks to the block devices.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:29 -08:00
Jeff Mahoney
6ab0a2029c btrfs: publish allocation data in sysfs
While trying to debug ENOSPC issues, it's helpful to understand what the
kernel's view of the available space is. We export this information
via ioctl, but sysfs files are more easily used.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:29 -08:00
Jeff Mahoney
01e219e806 btrfs: add ioctl to export size of global metadata reservation
btrfs filesystem df output will show the size of the metadata space
and how much of it is used, and the user assumes that the difference
is all usable space. Since that's not actually the case due to the
global metadata reservation, we should provide the full picture to the
user.

This patch adds an ioctl that exports the size of the global metadata
reservation so that btrfs filesystem df can report it.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:28 -08:00
Jeff Mahoney
3b02a68a63 btrfs: use feature attribute names to print better error messages
Now that we have the feature name strings available in the kernel via
the sysfs attributes, we can use them for printing better failure
messages from the ioctl path.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:28 -08:00
Jeff Mahoney
ba631941ef btrfs: add ability to change features via sysfs
This patch adds the ability to change (set/clear) features while the file
system is mounted. A bitmask is added for each feature set for the
support to set and clear the bits. A message indicating which bit
has been set or cleared is issued when it's been changed and also when
permission or support for a particular bit has been denied.

Since the the attributes can now be writable, we need to introduce
another struct attribute to hold the different permissions.

If neither set or clear is supported, the file will have 0444 permissions.
If either set or clear is supported, the file will have 0644 permissions
and the store handler will filter out the write based on the bitmask.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:27 -08:00
Jeff Mahoney
79da4fa4d9 btrfs: publish unknown feature bits in sysfs
With the compat and compat-ro bits, it's possible for file systems to
exist that have features that aren't supported by the kernel's file system
implementation yet still be mountable.

This patch publishes read-only info on those features using a prefix:number
format, where the number is the bit number rather than the shifted value.
e.g. "compat:12"

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:26 -08:00
Jeff Mahoney
510d73600a btrfs: publish per-super features in sysfs
This patch publishes information on which features are enabled in the
file system on a per-super basis. At this point, it only publishes
information on features supported by the file system implementation.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:26 -08:00
Jeff Mahoney
5ac1d209f1 btrfs: publish per-super attributes in sysfs
This patch adds per-super attributes to sysfs.

It doesn't publish any attributes yet, but does the proper lifetime
handling as well as the basic infrastructure to add new attributes.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:25 -08:00
Jeff Mahoney
079b72bca3 btrfs: publish supported featured in sysfs
This patch adds the ability to publish supported features to sysfs under
/sys/fs/btrfs/features.

The files are module-wide and export which features the kernel supports.

The content, for now, is just "0\n".

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:25 -08:00
Jeff Mahoney
2eaa055fab btrfs: add ioctls to query/change feature bits online
There are some feature bits that require no offline setup and can
be enabled online. I've only reviewed extended irefs, but there will
probably be more.

We introduce three new ioctls:
- BTRFS_IOC_GET_SUPPORTED_FEATURES: query the kernel for supported features.
- BTRFS_IOC_GET_FEATURES: query the kernel for enabled features on a per-fs
  basis, as well as querying for which features are changeable with mounted.
- BTRFS_IOC_SET_FEATURES: change features on a per-fs basis.

We introduce two new masks per feature set (_SAFE_SET and _SAFE_CLEAR) that
allow us to define which features are safe to change at runtime.

The failure modes for BTRFS_IOC_SET_FEATURES are as follows:
- Enabling a completely unsupported feature: warns and returns -ENOTSUPP
- Enabling a feature that can only be done offline: warns and returns -EPERM

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:23 -08:00
Liu Bo
9e5ac13acb Btrfs: skip merge part for delayed data refs
When we have data deduplication on, we'll hang on the merge part
because it needs to verify every queued delayed data refs related to
this disk offset but we may have millions refs.

And in the case of delayed data refs, we don't usually have too much
data refs to merge.

So it's safe to shut it down for data refs.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:23 -08:00
Liu Bo
c46effa601 Btrfs: introduce a head ref rbtree
The way how we process delayed refs is
1) get a bunch of head refs,
2) pick up one head ref,
3) go one node back for any delayed ref updates.

The head ref is also linked in the same rbtree as the delayed ref is,
so in 1) stage, we have to walk one by one including not only head refs, but
delayed refs.

When we have a great number of delayed refs pending to process,
this'll cost time a lot.

Here we introduce a head ref specific rbtree, it only has head refs, so troubles
go away.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:22 -08:00
Josef Bacik
e20d6c5ba3 Btrfs: fix check-integrity to look at the referenced data properly
We were looking at file_extent_num_bytes unconditionally when looking at
referenced data bytes, but this isn't correct for compression.  Fix this by
checking the compression of the file extent we are and setting num_bytes to
disk_num_bytes in the case of compression so that we are marking the proper
bytes as referenced.  This fixes check_int_data freaking out when running
btrfs/004.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:21 -08:00
Josef Bacik
16e7549f04 Btrfs: incompatible format change to remove hole extents
Btrfs has always had these filler extent data items for holes in inodes.  This
has made somethings very easy, like logging hole punches and sending hole
punches.  However for large holey files these extent data items are pure
overhead.  So add an incompatible feature to no longer add hole extents to
reduce the amount of metadata used by these sort of files.  This has a few
changes for logging and send obviously since they will need to detect holes and
log/send the holes if there are any.  I've tested this thoroughly with xfstests
and it doesn't cause any issues with and without the incompat format set.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-01-28 13:19:21 -08:00
Linus Torvalds
e09f67f147 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This is a small collection of fixes.  It was rebased this morning, but
  I was just fixing signed-off-by tags with the wrong email"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix access_ok() check in btrfs_ioctl_send()
  Btrfs: make sure we cleanup all reloc roots if error happens
  Btrfs: skip building backref tree for uuid and quota tree when doing balance relocation
  Btrfs: fix an oops when doing balance relocation
  Btrfs: don't miss skinny extent items on delayed ref head contention
  btrfs: call mnt_drop_write after interrupted subvol deletion
  Btrfs: don't clear the default compression type
2013-12-12 15:25:10 -08:00
Dan Carpenter
700ff4f095 Btrfs: fix access_ok() check in btrfs_ioctl_send()
The closing parenthesis is in the wrong place.  We want to check
"sizeof(*arg->clone_sources) * arg->clone_sources_count" instead of
"sizeof(*arg->clone_sources * arg->clone_sources_count)".

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
cc: stable@vger.kernel.org
2013-12-12 07:13:02 -08:00
Wang Shilong
467bb1d27c Btrfs: make sure we cleanup all reloc roots if error happens
I hit an oops when merging reloc roots fails, the reason is that
new reloc roots may be added and we should make sure we cleanup
all reloc roots.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2013-12-12 07:12:51 -08:00
Wang Shilong
6646374863 Btrfs: skip building backref tree for uuid and quota tree when doing balance relocation
Quota tree and UUID Tree is only cowed, they can not be snapshoted.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2013-12-12 07:12:36 -08:00
Wang Shilong
c974c4642f Btrfs: fix an oops when doing balance relocation
I hit an oops when inserting reloc root into @reloc_root_tree(it can be
easily triggered when forcing cow for relocation root)

[  866.494539]  [<ffffffffa0499579>] btrfs_init_reloc_root+0x79/0xb0 [btrfs]
[  866.495321]  [<ffffffffa044c240>] record_root_in_trans+0xb0/0x110 [btrfs]
[  866.496109]  [<ffffffffa044d758>] btrfs_record_root_in_trans+0x48/0x80 [btrfs]
[  866.496908]  [<ffffffffa0494da8>] select_reloc_root+0xa8/0x210 [btrfs]
[  866.497703]  [<ffffffffa0495c8a>] do_relocation+0x16a/0x540 [btrfs]

This is because reloc root inserted into @reloc_root_tree is not within one
transaction,reloc root may be cowed and root block bytenr will be reused then
oops happens.We should update reloc root in @reloc_root_tree when cow reloc
root node, fix it.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2013-12-12 07:12:20 -08:00
Filipe David Borba Manana
639eefc8af Btrfs: don't miss skinny extent items on delayed ref head contention
Currently extent-tree.c:btrfs_lookup_extent_info() can miss the lookup
of skinny extent items. This can happen when the execution flow is the
following:

* We do an extent tree lookup and fail to find a skinny extent item;

* As a result, we attempt to see if a non-skinny extent item exists,
  either by looking at previous item in the leaf or by doing another
  full extent tree search;

* We have a transaction and then we check for a matching delayed ref
  head in the transaction's delayed refs rbtree;

* We find such delayed ref head and then we try to lock it with a
  call to mutex_trylock();

* The lock was contended so we jump to the label "again", which repeats
  the extent tree search but for a non-skinny extent item, because we set
  previously metadata variable to 0 and the search key to look for a
  non-skinny extent-item;

* After the jump (and after releasing the transaction's delayed refs
  lock), a skinny extent item might have been added to the extent tree
  but we will miss it because metadata is set to 0 and the search key
  is set for a non-skinny extent-item.

The fix here is to not reset metadata to 0 and to jump to the initial search
key setup if the delayed ref head is contended, instead of jumping directly
to the extent tree search label ("again").

This issue was found while investigating the issue reported at Bugzilla 64961.

David Sterba suspected this function was missing extent items, and that
this could be caused by the last change to this function, which was made
in the following patch:

    [PATCH] Btrfs: optimize btrfs_lookup_extent_info()
    (commit 74be951087)

But in fact this issue already existed before, because after failing to find
a skinny extent item, the code set the search key for a non-skinny extent
item, and on contention of a matching delayed ref head it would not search
the extent tree for a skinny extent item anymore.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2013-12-12 07:11:58 -08:00
David Sterba
e43f998e47 btrfs: call mnt_drop_write after interrupted subvol deletion
If btrfs_ioctl_snap_destroy blocks on the mutex and the process is
killed, mnt_write count is unbalanced and leads to unmountable
filesystem.

CC: stable@vger.kernel.org
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2013-12-12 07:11:38 -08:00
Miao Xie
a7e252af5a Btrfs: don't clear the default compression type
We met a oops caused by the wrong compression type:
[  556.512356] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  556.512370] IP: [<ffffffff811dbaa0>] __list_del_entry+0x1/0x98
[SNIP]
[  556.512490]  [<ffffffff811dbb44>] ? list_del+0xd/0x2b
[  556.512539]  [<ffffffffa05dd5ce>] find_workspace+0x97/0x175 [btrfs]
[  556.512546]  [<ffffffff813c14b5>] ? _raw_spin_lock+0xe/0x10
[  556.512576]  [<ffffffffa05de276>] btrfs_compress_pages+0x2d/0xa2 [btrfs]
[  556.512601]  [<ffffffffa05af060>] compress_file_range.constprop.54+0x1f2/0x4e8 [btrfs]
[  556.512627]  [<ffffffffa05af388>] async_cow_start+0x32/0x4d [btrfs]
[  556.512655]  [<ffffffffa05cc7a1>] worker_loop+0x144/0x4c3 [btrfs]
[  556.512661]  [<ffffffff81059404>] ? finish_task_switch+0x80/0xb8
[  556.512689]  [<ffffffffa05cc65d>] ? btrfs_queue_worker+0x244/0x244 [btrfs]
[  556.512695]  [<ffffffff8104fa4e>] kthread+0x8d/0x95
[  556.512699]  [<ffffffff81050000>] ? bit_waitqueue+0x34/0x7d
[  556.512704]  [<ffffffff8104f9c1>] ? __kthread_parkme+0x65/0x65
[  556.512709]  [<ffffffff813c7eec>] ret_from_fork+0x7c/0xb0
[  556.512713]  [<ffffffff8104f9c1>] ? __kthread_parkme+0x65/0x65

Steps to reproduce:
 # mkfs.btrfs -f <dev>
 # mount -o nodatacow <dev> <mnt>
 # touch <mnt>/<file>
 # chattr =c <mnt>/<file>
 # dd if=/dev/zero of=<mnt>/<file> bs=1M count=10

It is because we cleared the default compression type when setting the
nodatacow. In fact, we needn't do it because we have used COMPRESS flag to
indicate if we need compressed the file data or not, needn't use the
variant -- compress_type -- in btrfs_info to do the same thing, and just
use it to hold the default compression type. Or we would get a wrong compress
type for a file whose own compress flag is set but the compress flag of its
filesystem is not set.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2013-12-12 07:11:19 -08:00
Linus Torvalds
5ee540613d Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
 "A small collection of fixes for the current series. It contains:

   - A fix for a use-after-free of a request in blk-mq.  From Ming Lei

   - A fix for a blk-mq bug that could attempt to dereference a NULL rq
     if allocation failed

   - Two xen-blkfront small fixes

   - Cleanup of submit_bio_wait() type uses in the kernel, unifying
     that.  From Kent

   - A fix for 32-bit blkg_rwstat reading.  I apologize for this one
     looking mangled in the shortlog, it's entirely my fault for missing
     an empty line between the description and body of the text"

* 'for-linus' of git://git.kernel.dk/linux-block:
  blk-mq: fix use-after-free of request
  blk-mq: fix dereference of rq->mq_ctx if allocation fails
  block: xen-blkfront: Fix possible NULL ptr dereference
  xen-blkfront: Silence pfn maybe-uninitialized warning
  block: submit_bio_wait() conversions
  Update of blkg_stat and blkg_rwstat may happen in bh context
2013-12-05 15:33:27 -08:00
Kent Overstreet
c170bbb45f block: submit_bio_wait() conversions
It was being open coded in a few places.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-24 16:33:41 -07:00
Linus Torvalds
fb0d1eb892 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Almost all of these are bug fixes.  Dave Sterba's documentation update
  is the big exception because he removed our promises to set any
  machine running Btrfs on fire"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Documentation: filesystems: update btrfs tools section
  Documentation: filesystems: add new btrfs mount options
  btrfs: update kconfig help text
  btrfs: fix bio_size_ok() for max_sectors > 0xffff
  btrfs: Use trace condition for get_extent tracepoint
  btrfs: fix typo in the log message
  Btrfs: fix list delete warning when removing ordered root from the list
  Btrfs: print bytenr instead of page pointer in check-int
  Btrfs: remove dead codes from ctree.h
  Btrfs: don't wait for ordered data outside desired range
  Btrfs: fix lockdep error in async commit
  Btrfs: avoid heavy operations in btrfs_commit_super
  Btrfs: fix __btrfs_start_workers retval
  Btrfs: disable online raid-repair on ro mounts
  Btrfs: do not inc uncorrectable_errors counter on ro scrubs
  Btrfs: only drop modified extents if we logged the whole inode
  Btrfs: make sure to copy everything if we rename
  Btrfs: don't BUG_ON() if we get an error walking backrefs
2013-11-22 08:38:55 -08:00
David Sterba
4204617d14 btrfs: update kconfig help text
Reflect the current status. Portions of the text taken from the
wiki pages.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:49:09 -05:00
Akinobu Mita
475bf36ffb btrfs: fix bio_size_ok() for max_sectors > 0xffff
The data type of max_sectors in queue settings is unsigned int.  But
this value is stored to the local variable whose type is unsigned short
in bio_size_ok().  This can cause unexpected result when max_sectors >
0xffff.

Cc: Chris Mason <chris.mason@fusionio.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:48:44 -05:00
Steven Rostedt
4cd8587ce8 btrfs: Use trace condition for get_extent tracepoint
Doing an if statement to test some condition to know if we should
trigger a tracepoint is pointless when tracing is disabled. This just
adds overhead and wastes a branch prediction. This is why the
TRACE_EVENT_CONDITION() was created. It places the check inside the jump
label so that the branch does not happen unless tracing is enabled.

That is, instead of doing:

	if (em)
		trace_btrfs_get_extent(root, em);

Which is basically this:

	if (em)
		if (static_key(trace_btrfs_get_extent)) {

Using a TRACE_EVENT_CONDITION() we can just do:

	trace_btrfs_get_extent(root, em);

And the condition trace event will do:

	if (static_key(trace_btrfs_get_extent)) {
		if (em) {
			...

The static key is a non conditional jump (or nop) that is faster than
having to check if em is NULL or not.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:44:47 -05:00
Anand Jain
52a1575921 btrfs: fix typo in the log message
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:44:47 -05:00
Miao Xie
931aa87791 Btrfs: fix list delete warning when removing ordered root from the list
Commit b02441999e "Btrfs: don't wait for
the completion of all the ordered extents" introduced a bug that broke
the ordered root list:
 WARNING: CPU: 1 PID: 7119 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98()

It is because we forgot to return the roots in the splice list to the
ordered list of the fs. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:44:46 -05:00
Stefan Behrens
56d140f5f6 Btrfs: print bytenr instead of page pointer in check-int
The page pointer information was useless. The bytenr is what you
want when you search for submitted write bios.

Additionally, a new bit in the print mask is added that allows
to selectively enable the check-int submit_bio verbose mode. Before,
the global verbose mode had to be enabled leading to many million
useless lines in the kernel log.

And a comment is added that explains that LOG_BUF_SHIFT needs to
be set to a really high value.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:44:46 -05:00
Wang Shilong
9650e05c07 Btrfs: remove dead codes from ctree.h
These two functions are only stated but undefined.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:44:45 -05:00
Filipe David Borba Manana
b52abf1e3b Btrfs: don't wait for ordered data outside desired range
In btrfs_wait_ordered_range(), if we found an extent to the left
of the start of our desired wait range and the last byte of that
extent is 1 less than the desired range's start, we would would
wait for the IO completion of that extent unnecessarily.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:44:45 -05:00
Liu Bo
b1a06a4b57 Btrfs: fix lockdep error in async commit
Lockdep complains about btrfs's async commit:

[ 2372.462171] [ BUG: bad unlock balance detected! ]
[ 2372.462191] 3.12.0+ #32 Tainted: G        W
[ 2372.462209] -------------------------------------
[ 2372.462228] ceph-osd/14048 is trying to release lock (sb_internal) at:
[ 2372.462275] [<ffffffffa022cb10>] btrfs_commit_transaction_async+0x1b0/0x2a0 [btrfs]
[ 2372.462305] but there are no more locks to release!
[ 2372.462324]
[ 2372.462324] other info that might help us debug this:
[ 2372.462349] no locks held by ceph-osd/14048.
[ 2372.462367]
[ 2372.462367] stack backtrace:
[ 2372.462386] CPU: 2 PID: 14048 Comm: ceph-osd Tainted: G        W    3.12.0+ #32
[ 2372.462414] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015  11/09/2011
[ 2372.462455]  ffffffffa022cb10 ffff88007490fd28 ffffffff816f094a ffff8800378aa320
[ 2372.462491]  ffff88007490fd50 ffffffff810adf4c ffff8800378aa320 ffff88009af97650
[ 2372.462526]  ffffffffa022cb10 ffff88007490fd88 ffffffff810b01ee ffff8800898c0000
[ 2372.462562] Call Trace:
[ 2372.462584]  [<ffffffffa022cb10>] ? btrfs_commit_transaction_async+0x1b0/0x2a0 [btrfs]
[ 2372.462619]  [<ffffffff816f094a>] dump_stack+0x45/0x56
[ 2372.462642]  [<ffffffff810adf4c>] print_unlock_imbalance_bug+0xec/0x100
[ 2372.462677]  [<ffffffffa022cb10>] ? btrfs_commit_transaction_async+0x1b0/0x2a0 [btrfs]
[ 2372.462710]  [<ffffffff810b01ee>] lock_release+0x18e/0x210
[ 2372.462742]  [<ffffffffa022cb36>] btrfs_commit_transaction_async+0x1d6/0x2a0 [btrfs]
[ 2372.462783]  [<ffffffffa025a7ce>] btrfs_ioctl_start_sync+0x3e/0xc0 [btrfs]
[ 2372.462822]  [<ffffffffa025f1d3>] btrfs_ioctl+0x4c3/0x1f70 [btrfs]
[ 2372.462849]  [<ffffffff812c0321>] ? avc_has_perm+0x121/0x1b0
[ 2372.462873]  [<ffffffff812c0224>] ? avc_has_perm+0x24/0x1b0
[ 2372.462897]  [<ffffffff8107ecc8>] ? sched_clock_cpu+0xa8/0x100
[ 2372.462922]  [<ffffffff8117b145>] do_vfs_ioctl+0x2e5/0x4e0
[ 2372.462946]  [<ffffffff812c19e6>] ? file_has_perm+0x86/0xa0
[ 2372.462969]  [<ffffffff8117b3c1>] SyS_ioctl+0x81/0xa0
[ 2372.462991]  [<ffffffff817045a4>] tracesys+0xdd/0xe2

====================================================

It's because that we don't do the right thing when checking if it's ok to
tell lockdep that we're trying to release the rwsem.

If the trans handle's type is TRANS_ATTACH, we won't acquire the freeze rwsem, but
as TRANS_ATTACH fits the check (trans < TRANS_JOIN_NOLOCK), we'll release the freeze
rwsem, which makes lockdep complains a lot.

Reported-by: Ma Jianpeng <majianpeng@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:44:44 -05:00
Liu Bo
d52c1bcc64 Btrfs: avoid heavy operations in btrfs_commit_super
The 'git blame' history shows that, the old transaction commit code has to do
twice to ensure roots are updated and we have to flush metadata and super block
manually, however, right now all of these can be handled well inside
the transaction commit code without extra efforts.

And the error handling part remains same with the current code, -- 'return to
caller once we get error'.

This saves us a transaction commit and a flush of super block, which are both
heavy operations according to ftrace output analysis.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:42:16 -05:00
Ilya Dryomov
ba69994a40 Btrfs: fix __btrfs_start_workers retval
__btrfs_start_workers returns 0 in case it raced with
btrfs_stop_workers and lost the race.  This is wrong because worker in
this case is not allowed to start and is in fact destroyed.  Return
-EINVAL instead.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:42:11 -05:00
Ilya Dryomov
908960c6c0 Btrfs: disable online raid-repair on ro mounts
This disables the "if needed, write the good copy back before the read
is completed" part of the read sequence for read-only mounts.

Cc: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:42:05 -05:00
Ilya Dryomov
33ef30add1 Btrfs: do not inc uncorrectable_errors counter on ro scrubs
Currently if we discover an error when scrubbing in ro mode we a)
blindly increment the uncorrectable_errors counter, and b) spam the
dmesg with the 'unable to fixup (regular) error at ...' message, even
though a) we haven't tried to determine if the error is correctable or
not, and b) we haven't tried to fixup anything.  Fix this.

Cc: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:41:38 -05:00
Josef Bacik
d006a04816 Btrfs: only drop modified extents if we logged the whole inode
If we fsync, seek and write, rename and then fsync again we will lose the
modified hole extent because the rename will drop all of the modified extents
since we didn't do the fast search.  We need to only drop the modified extents
if we didn't do the fast search and we were logging the entire inode as we don't
need them anymore, otherwise this is being premature.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:41:32 -05:00
Josef Bacik
6cfab851f4 Btrfs: make sure to copy everything if we rename
If we rename a file that is already in the log and we fsync again we will lose
the new name.  This is because we just log the inode update and not the new ref.
To fix this we just need to check if we are logging the new name of the inode
and copy all the metadata instead of just updating the inode itself.  With this
patch my testcase now passes.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:41:24 -05:00
Josef Bacik
4724b106b9 Btrfs: don't BUG_ON() if we get an error walking backrefs
We can just return false for this so we stop doing the snapshot aware defrag
stuff.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:41:16 -05:00
Linus Torvalds
ffd3c0260a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This pull fixes the empty_zero_page bug that Heiko reported, and
  includes one more cleanup from Al Viro"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: get rid of fdentry()
  btrfs: fix empty_zero_page misusage
2013-11-16 11:57:05 -08:00
Linus Torvalds
9073e1a804 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
 "Usual earth-shaking, news-breaking, rocket science pile from
  trivial.git"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits)
  doc: usb: Fix typo in Documentation/usb/gadget_configs.txt
  doc: add missing files to timers/00-INDEX
  timekeeping: Fix some trivial typos in comments
  mm: Fix some trivial typos in comments
  irq: Fix some trivial typos in comments
  NUMA: fix typos in Kconfig help text
  mm: update 00-INDEX
  doc: Documentation/DMA-attributes.txt fix typo
  DRM: comment: `halve' -> `half'
  Docs: Kconfig: `devlopers' -> `developers'
  doc: typo on word accounting in kprobes.c in mutliple architectures
  treewide: fix "usefull" typo
  treewide: fix "distingush" typo
  mm/Kconfig: Grammar s/an/a/
  kexec: Typo s/the/then/
  Documentation/kvm: Update cpuid documentation for steal time and pv eoi
  treewide: Fix common typo in "identify"
  __page_to_pfn: Fix typo in comment
  Correct some typos for word frequency
  clk: fixed-factor: Fix a trivial typo
  ...
2013-11-15 16:47:22 -08:00
Al Viro
54563d41a5 btrfs: get rid of fdentry()
3 of 4 callers actually want file_inode()...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-15 09:18:14 -05:00
Chris Mason
46e0f66a0c btrfs: fix empty_zero_page misusage
Heiko Carstens noticed that btrfs was using empty_zero_page
incorrectly.  He explained:

	The definition of empty_zero_page is architecture specific.  It
	is (currently) either a character array, an unsigned long
	containing the address of the empty_zero_page, or even worse
	only the address of the struct page belonging to the
	empty_zero_page.

This commit changes btrfs to use a for-loop instead.  On x86
the resulting .ko is smaller, and we're no longer worrying about
how each arch builds its zeros.

Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-15 09:17:47 -05:00
Miao Xie
91aef86f3b Btrfs: rename btrfs_start_all_delalloc_inodes
rename the function -- btrfs_start_all_delalloc_inodes(), and make its
name be compatible to btrfs_wait_ordered_roots(), since they are always
used at the same place.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:13:58 -05:00
Miao Xie
b02441999e Btrfs: don't wait for the completion of all the ordered extents
It is very likely that there are lots of ordered extents in the filesytem,
if we wait for the completion of all of them when we want to reclaim some
space for the metadata space reservation, we would be blocked for a long
time. The performance would drop down suddenly for a long time.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:13:44 -05:00
Miao Xie
9f3a074d10 Btrfs: don't wait for all the async delalloc when shrinking delalloc
It was very likely that there were lots of async delalloc pages in the
filesystem, if we waited until all the pages were flushed, we would be
blocked for a long time, and the performance would also drop down.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:13:37 -05:00
Miao Xie
c61a16a701 Btrfs: fix the confusion between delalloc bytes and metadata bytes
In shrink_delalloc(), what we need reclaim is the metadata space, so
flushing pages by to_reclaim is not reasonable, it is very likely that
the pages we flush are not enough. And then we had to invoke the flush
function for several times, at the worst, we need call flush_space for
several times. It wasted time.

We improve this problem by converting the metadata space size we need
reserve to the delalloc bytes, By this way, we can flush the pages
by a reasonable number.

(Now we use a fixed number to do conversion, it is not flexible, maybe
 we can find a good way to improve it in the future.)

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:13:30 -05:00
Miao Xie
18cd8ea6df Btrfs: pick up the code for the item number calculation in flush_space()
This patch picked up the code that was used to calculate the number of
the items for which we need reserve space, and we will use it in the next
patch.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:13:23 -05:00
Miao Xie
38c135af8e Btrfs: wait for the ordered extent only when we want
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:13:15 -05:00
Miao Xie
d3ee29e396 Btrfs: remove unnecessary initialization and memory barrior in shrink_delalloc()
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:13:07 -05:00
Wang Shilong
3b7a016f44 Btrfs: avoid unnecessary scrub workers allocation
We only allocate scrub workers if we pass all the necessary
checks, for example, there are no operation in progress.

Besides, move mutex lock protection outside of scrub_workers_get()
/scrub_workers_put().

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:12:58 -05:00
Josef Bacik
007d31f755 Btrfs: check file extent type before anything else
I hit this problem with my no holes patch and it made me realize what the
problem was for bz 60834.  If the first item in the leaf is an inline extent and
we try to read anything starting from disk_bytenr onward we will read off the
end of the leaf.  So we need to check to see what it's type is, and if it's not
REG we can just break out.  This should fix this problem.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:12:49 -05:00
Rashika
f570e757b5 btrfs: Remove useless variable in write_ctree_super()
The function write_ctree_super() in disk-io.c uses variable ret to return
the result of function write_all_supers(). Since, this variable serves
no purpose, hence the patch removes it and returns the call of the
called function.

Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:12:40 -05:00
Dulshani Gunawardhana
678712545b btrfs: Fix checkpatch.pl warning of spacing issues
Fix spacing issues detected via checkpatch.pl in accordance with the
kernel style guidelines.

Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:12:31 -05:00
Dulshani Gunawardhana
d9b0d9ba04 btrfs: Replace kmalloc with kmalloc_array
Replace kmalloc(size * nr, ) with kmalloc_array(nr, size), thus making
it easier to check is that the calculation doesn't wrap or return a smaller allocation

Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:12:22 -05:00
Dulshani Gunawardhana
e248e04e77 btrfs: Enclose macros with complex values within parenthesis
Enclose macros with complex values within parenthesis in accordance to
checkpatch.pl.

Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:12:06 -05:00
Dulshani Gunawardhana
fae7f21cec btrfs: Use WARN_ON()'s return value in place of WARN_ON(1)
Use WARN_ON()'s return value in place of WARN_ON(1) for cleaner source
code that outputs a more descriptive warnings. Also fix the styling
warning of redundant braces that came up as a result of this fix.

Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:11:53 -05:00
Dulshani Gunawardhana
b19e684393 btrfs: Remove redundant local zero structure
Remove redundant local zero structure, replacing it by the kernel's
global ZERO_PAGE.

Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:11:39 -05:00
Dulshani Gunawardhana
3c45bfc152 btrfs: Pack struct btrfs_device
Pack the structure btrfs_device in volumes.h to eliminate holes detected
by pahole, thus reducing binary memory footprint.

Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:11:26 -05:00
Rashika
95e94d14b4 btrfs: Replace multiple atomic_inc() with atomic_add()
This patch replaces multiple atomic_inc() with atomic_add() in
delayed-inode.c to reduce source code and have few instructions
for compilation.

Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:11:19 -05:00
Rashika
2e9f595497 btrfs: Add helper function for free_root_pointers()
The function free_root_pointers() in disk-io.h contains redundant code.
Therefore, this patch adds a helper function free_root_extent_buffers()
to free_root_pointers() to eliminate redundancy.

Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:11:13 -05:00
Liu Bo
48ec47364b Btrfs: fix a crash when running balance and defrag concurrently
Running balance and defrag concurrently can end up with a crash:

kernel BUG at fs/btrfs/relocation.c:4528!
RIP: 0010:[<ffffffffa01ac33b>]  [<ffffffffa01ac33b>] btrfs_reloc_cow_block+ 0x1eb/0x230 [btrfs]
Call Trace:
  [<ffffffffa01398c1>] ? update_ref_for_cow+0x241/0x380 [btrfs]
  [<ffffffffa0180bad>] ? copy_extent_buffer+0xad/0x110 [btrfs]
  [<ffffffffa0139da1>] __btrfs_cow_block+0x3a1/0x520 [btrfs]
  [<ffffffffa013a0b6>] btrfs_cow_block+0x116/0x1b0 [btrfs]
  [<ffffffffa013ddad>] btrfs_search_slot+0x43d/0x970 [btrfs]
  [<ffffffffa0153c57>] btrfs_lookup_file_extent+0x37/0x40 [btrfs]
  [<ffffffffa0172a5e>] __btrfs_drop_extents+0x11e/0xae0 [btrfs]
  [<ffffffffa013b3fd>] ? generic_bin_search.constprop.39+0x8d/0x1a0 [btrfs]
  [<ffffffff8117d14a>] ? kmem_cache_alloc+0x1da/0x200
  [<ffffffffa0138e7a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
  [<ffffffffa0173ef0>] btrfs_drop_extents+0x60/0x90 [btrfs]
  [<ffffffffa016b24d>] relink_extent_backref+0x2ed/0x780 [btrfs]
  [<ffffffffa0162fe0>] ? btrfs_submit_bio_hook+0x1e0/0x1e0 [btrfs]
  [<ffffffffa01b8ed7>] ? iterate_inodes_from_logical+0x87/0xa0 [btrfs]
  [<ffffffffa016b909>] btrfs_finish_ordered_io+0x229/0xac0 [btrfs]
  [<ffffffffa016c3b5>] finish_ordered_fn+0x15/0x20 [btrfs]
  [<ffffffffa018cbe5>] worker_loop+0x125/0x4e0 [btrfs]
  [<ffffffffa018cac0>] ? btrfs_queue_worker+0x300/0x300 [btrfs]
  [<ffffffff81075ea0>] kthread+0xc0/0xd0
  [<ffffffff81075de0>] ? insert_kthread_work+0x40/0x40
  [<ffffffff8164796c>] ret_from_fork+0x7c/0xb0
  [<ffffffff81075de0>] ? insert_kthread_work+0x40/0x40
----------------------------------------------------------------------

It turns out to be that balance operation will bump root's @last_snapshot,
which enables snapshot-aware defrag path, and backref walking stuff will
find data reloc tree as refs' parent, and hit the BUG_ON() during COW.

As data reloc tree's data is just for relocation purpose, and will be deleted right
after relocation is done, it's unnecessary to walk those refs belonged to data reloc
tree, it'd be better to skip them.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:11:07 -05:00
Liu Bo
6f519564d7 Btrfs: do not run snapshot-aware defragment on error
If something wrong happens in write endio, running snapshot-aware defragment
can end up with undefined results, maybe a crash, so we should avoid it.

In order to share similar code, this also adds a helper to free the struct for
snapshot-aware defrag.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:11:00 -05:00
Filipe David Borba Manana
269d040ff2 Btrfs: log recovery, don't unlink inode always on error
If we get any error while doing a dir index/item lookup in the
log tree, we were always unlinking the corresponding inode in
the subvolume. It makes sense to unlink only if the lookup failed
to find the dir index/item, which corresponds to NULL or -ENOENT,
and not when other errors happen (like a transient -ENOMEM or -EIO).

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:10:48 -05:00
Filipe David Borba Manana
488111aa0e Btrfs: fix csum search offset/length calculation in log tree
We were setting the csums search offset and length to the right values if
the extent is compressed, but later on right before doing the csums lookup
we were overriding these two parameters regardless of compression being
set or not for the extent.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:10:42 -05:00
Filipe David Borba Manana
e46f5388cd Btrfs: fix verification of dir_item
We were ignoring the name component of the dir_item. Both the
name and data must fit within BTRFS_MAX_XATTR_SIZE(root).

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:10:36 -05:00
Wang Shilong
9b011adfe1 Btrfs: remove scrub_super_lock holding in btrfs_sync_log()
Originally, we introduced scrub_super_lock to synchronize
tree log code with scrubbing super.

However we can replace scrub_super_lock with device_list_mutex,
because writing super will hold this mutex, this will reduce an extra
lock holding when writing supers in sync log code.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:10:13 -05:00
Wang Shilong
7fdf4b608d Btrfs: use 'u64' rather than 'int' to get extent's generation
We define a 'int' to get extent's generation by mistake,fix it.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:10:06 -05:00
Miao Xie
9dced186f9 Btrfs: fix the free space write out failure when there is no data space
After running space balance on a new fs, the fs check program outputed the
following warning message:
 free space inode generation (0) did not match free space cache generation (20)

Steps to reproduce:
 # mkfs.btrfs -f <dev>
 # mount <dev> <mnt>
 # btrfs balance start <mnt>
 # umount <mnt>
 # btrfs check <dev>

It was because there was no data space after the space balance, and the free
space write out task didn't try to allocate a new data chunk for the free space
inode when doing the reservation. So the data space reservation failed, and in
order to tell the free space loader that this free space inode could not be
trusted, the generation of the free space inode wasn't updated. Then the check
program found this problem and outputed the above message.

But in fact, it is safe that we try to allocate a new data chunk when we find
the data space is not enough. The patch fixes the above problem by this way.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:08:49 -05:00
Josef Bacik
9e6a0c52b7 Btrfs: stop committing the transaction so much during relocate
I noticed with my horrible snapshot excercisor that we were taking forever to
relocate the larger the file system got.  This appeared to be because we were
committing the transaction _constantly_.  There were a few places where we do
braindead things with metadata reservation, like start a transaction and then
try to refill the block rsv, which not only keeps us from committing a
transaction during the enospc stuff, but keeps us from doing some of the harder
flushing work which will make us more likely to need to commit the transaction.
We also were checking the block rsv and committing the transaction if the block
rsv was below a certain threshold, but we were doing this in a place where we
don't actually keep anything in the block rsv so this was always ending up false
so we always committed the transaction in this case.  I tested this to make sure
it didn't break anything, but it takes about 10 hours to get the box to this
state so I don't know how much of an impact it will really make.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:08:31 -05:00
Josef Bacik
9f23e289ed Btrfs: make sure the delalloc workers actually flush compressed writes
When using delalloc workers in a non-waiting way (like for enospc handling) we
can end up not actually waiting for the dirty pages to be started if we have
compression.  We need to add an extra filemap flush to make sure any async
extents that have started are actually moved along before returning.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:08:22 -05:00
Josef Bacik
9385876917 Btrfs: take ordered root lock when removing ordered operations inode
A user reported a list corruption warning from btrfs_remove_ordered_extent, it
is because we aren't taking the ordered_root_lock when we remove the inode from
the ordered operations list.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:08:10 -05:00
Josef Bacik
d788a34929 Btrfs: don't abort transaction in run_delalloc_nocow
This is just the write path, the only reason we start a transaction is so we can
check cross references, we don't make any actual changes, so there is no reason
to abort the transaction if we fail.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:07:58 -05:00
Josef Bacik
02ecd2c278 Btrfs: do not bug_on if we try to cow a free space cache inode
We can just return an error and we'll bail out properly.  We still want to catch
this case to make sure we don't have a bug somewhere, so just warn if this pops
up.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:07:49 -05:00
Josef Bacik
0ef8b72607 Btrfs: return an error from btrfs_wait_ordered_range
I noticed that if the free space cache has an error writing out it's data it
won't actually error out, it will just carry on.  This is because it doesn't
check the return value of btrfs_wait_ordered_range, which didn't actually return
anything.  So fix this in order to keep us from making free space cache look
valid when it really isnt.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:07:35 -05:00
Josef Bacik
ed2590953b Btrfs: stop using vfs_read in send
Apparently we don't actually close the files until we return to userspace, so
stop using vfs_read in send.  This is actually better for us since we can avoid
all the extra logic of holding the file we're sending open and making sure to
clean it up.  This will fix people who have been hitting too many files open
errors when trying to send.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:07:11 -05:00
Stefan Behrens
301993a4a1 Btrfs: check_int, remove warning for mixed-mode
In mixed-mode, when a data-block was later reused for metadata, a
warning was printed. This condition is now filtered out and the
warning is eliminated in this case.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:04:25 -05:00
Stefan Behrens
a5f519c91d Btrfs: fix check_int 'leaf item out of bounce' regression
Yet another cleanup patch broke code for which no xfstest exists.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:04:06 -05:00
Filipe David Borba Manana
5599488708 Btrfs: optimize extent item search in run_delayed_extent_op
Instead of doing another extent tree search if the first search failed
to find a metadata item, check if the previous item in the leaf is an
extent item and use it if it is, otherwise do the second tree search
for an extent item.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:03:53 -05:00
Jeff Mahoney
cab45e22da btrfs: add tracing for failed reservations
When debugging ENOSPC issues, it's nice to be able to see which
reservations failed as well as the ones which succeeded.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:03:37 -05:00
Zach Brown
8b558c5f09 btrfs: remove fs/btrfs/compat.h
fs/btrfs/compat.h only contained trivial macro wrappers of drop_nlink()
and inc_nlink().  This doesn't belong in mainline.

Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:03:19 -05:00
Zach Brown
1877e1a747 btrfs: remove move_pages()
move_pages() has an inefficient backwards byte copy of regions of two
different pages.  They're different pages so the regions won't overlap
and it could use memcpy().

At that point, though, move_pages() would be a slightly dimmer
re-implementation of copy_pages() that lacked the test for overlapping
page regions.

So remove move_pages() and just call copy_pages().

Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:03:09 -05:00
Zach Brown
4546bcaeba btrfs: use get_seconds() instead of btrfs wrapper
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:03:00 -05:00
Filipe David Borba Manana
8185554d3e Btrfs: fix incorrect inode acl reset
When a directory has a default ACL and a subdirectory is created
under that directory, btrfs_init_acl() is called when the
subdirectory's inode is created to initialize the inode's ACL
(inherited from the parent directory) but it was clearing the ACL
from the inode after setting it if posix_acl_create() returned
success, instead of clearing it only if it returned an error.

To reproduce this issue:

$ mkfs.btrfs -f /dev/loop0
$ mount /dev/loop0 /mnt
$ mkdir /mnt/acl
$ setfacl -d --set u::rwx,g::rwx,o::- /mnt/acl
$ getfacl /mnt/acl
user::rwx
group::rwx
other::r-x
default:user::rwx
default:group::rwx
default:other::---

$ mkdir /mnt/acl/dir1
$ getfacl /mnt/acl/dir1
user::rwx
group::rwx
other::---

After unmounting and mounting again the filesystem, fgetacl returned the
expected ACL:

$ umount /mnt/acl
$ mount /dev/loop0 /mnt
$ getfacl /mnt/acl/dir1
user::rwx
group::rwx
other::---
default:user::rwx
default:group::rwx
default:other::---

Meaning that the underlying xattr was persisted.

Reported-by: Giuseppe Fierro <giuseppe@fierro.org>
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:02:51 -05:00
Stefan Behrens
ff76b05655 Btrfs: Don't allocate inode that is already in use
Due to an off-by-one error, it is possible to reproduce a bug
when the inode cache is used.

The same inode number is assigned twice, the second time this
leads to an EEXIST in btrfs_insert_empty_items().

The issue can happen when a file is removed right after a subvolume
is created and then a new inode number is created before the
inodes in free_inode_pinned are processed.
unlink() calls btrfs_return_ino() which calls start_caching() in this
case which adds [highest_ino + 1, BTRFS_LAST_FREE_OBJECTID] by
searching for the highest inode (which already cannot find the
unlinked one anymore in btrfs_find_free_objectid()). So if this
unlinked inode's number is equal to the highest_ino + 1 (or >= this value
instead of > this value which was the off-by-one error), we mustn't add
the inode number to free_ino_pinned (caching_thread() does it right).
In this case we need to try directly to add the number to the inode_cache
which will fail in this case.

When this inode number is allocated while it is still in free_ino_pinned,
it is allocated and still added to the free inode cache when the
pinned inodes are processed, thus one of the following inode number
allocations will get an inode that is already in use and fail with EEXIST
in btrfs_insert_empty_items().

One example which was created with the reproducer below:
Create a snapshot, work in the newly created snapshot for the rest.
In unlink(inode 34284) call btrfs_return_ino() which calls start_caching().
start_caching() calls add_free_space [34284, 18446744073709517077].
In btrfs_return_ino(), call start_caching pinned [34284, 1] which is wrong.
mkdir() call btrfs_find_ino_for_alloc() which returns the number 34284.
btrfs_unpin_free_ino calls add_free_space [34284, 1].
mkdir() call btrfs_find_ino_for_alloc() which returns the number 34284.
EEXIST when the new inode is inserted.

One possible reproducer is this one:
 #!/bin/sh
 # preparation
TEST_DEV=/dev/sdc1
TEST_MNT=/mnt
umount ${TEST_MNT} 2>/dev/null || true
mkfs.btrfs -f ${TEST_DEV}
mount ${TEST_DEV} ${TEST_MNT} -o \
 rw,relatime,compress=lzo,space_cache,inode_cache
btrfs subv create ${TEST_MNT}/s1
for i in `seq 34027`; do touch ${TEST_MNT}/s1/${i}; done
btrfs subv snap ${TEST_MNT}/s1 ${TEST_MNT}/s2
FILENAME=`find ${TEST_MNT}/s1/ -inum 4085 | sed 's|^.*/\([^/]*\)$|\1|'`
rm ${TEST_MNT}/s2/$FILENAME
touch ${TEST_MNT}/s2/$FILENAME
 # the following steps can be repeated to reproduce the issue again and again
[ -e ${TEST_MNT}/s3 ] && btrfs subv del ${TEST_MNT}/s3
btrfs subv snap ${TEST_MNT}/s2 ${TEST_MNT}/s3
rm ${TEST_MNT}/s3/$FILENAME
touch ${TEST_MNT}/s3/$FILENAME
ls -alFi ${TEST_MNT}/s?/$FILENAME
touch ${TEST_MNT}/s3/_1 || logger FAILED
ls -alFi ${TEST_MNT}/s?/_1
touch ${TEST_MNT}/s3/_2 || logger FAILED
ls -alFi ${TEST_MNT}/s?/_2
touch ${TEST_MNT}/s3/__1 || logger FAILED
ls -alFi ${TEST_MNT}/s?/__1
touch ${TEST_MNT}/s3/__2 || logger FAILED
ls -alFi ${TEST_MNT}/s?/__2
 # if the above is not enough, add the following loop:
for i in `seq 3 9`; do touch ${TEST_MNT}/s3/__${i} || logger FAILED; done
 #for i in `seq 3 34027`; do touch ${TEST_MNT}/s3/__${i} || logger FAILED; done
 # one of the touch(1) calls in s3 fail due to EEXIST because the inode is
 # already in use that btrfs_find_ino_for_alloc() returns.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:02:36 -05:00
Filipe David Borba Manana
e8b0d724d5 Btrfs: fix btrfs_prev_leaf() previous key computation
If we decrement the key type, we must reset its offset to the largest
possible offset (u64)-1. If we decrement the key's objectid, then we
must reset the key's type and offset to their largest possible values,
(u8)-1 and (u64)-1 respectively. Not doing so can make us miss an
items in the tree.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:02:26 -05:00
Filipe David Borba Manana
e93ae26fe1 Btrfs: optimize tree-log.c:count_inode_refs()
Avoid repeated tree searches by processing all inode ref items in
a leaf at once instead of processing one at a time, followed by a
path release and a tree search for a key with a decremented offset.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:02:19 -05:00
Geyslan G. Bem
229eed4348 btrfs: simplify kmalloc+copy_from_user to memdup_user
Use memdup_user rather than duplicating its implementation
This is a little bit restricted to reduce false positives

The semantic patch that makes this report is available
in scripts/coccinelle/api/memdup_user.cocci.

More information about semantic patching is available at
http://coccinelle.lip6.fr/

Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:01:51 -05:00
chandan
5ede859b00 Btrfs: btrfs_add_ordered_operation: Fix last modified transaction comparison.
Comparison of an inode's last modified transaction with the last committed
transaction is incorrect. Fix it.

Signed-off-by: chandan <chandan@linux.vnet.ibm.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:01:37 -05:00
Filipe David Borba Manana
3c77bd94ec Btrfs: don't leak delayed node on path allocation failure
If the path allocation failed, we would return without decrementing
the reference count in the delayed node we got before, resulting
in a leak.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:01:27 -05:00
Stefan Behrens
361c093d7f Btrfs: Wait for uuid-tree rebuild task on remount read-only
If the user remounts the filesystem read-only while the uuid-tree
scan and rebuild task is still running (this happens once after the
filesystem was mounted with an old kernel, or when forced with the
mount options), the remount should wait on the tasks completion
before setting the filesystem read-only. Otherwise the background
task continues to write to the filesystem which is apparently not
what users expect.

The reproducer:

TEST_DEV=/dev/sdzzzzz1
TEST_MNT=/mnt
mkfs.btrfs -f $TEST_DEV
mount $TEST_DEV $TEST_MNT
for i in `seq 50000`; do btrfs subvolume create ${TEST_MNT}/$i; done
umount $TEST_MNT
mount $TEST_DEV $TEST_MNT -o rescan_uuid_tree
sleep 1
ps -elf | fgrep '[btrfs-uuid]' | grep -v grep
mount $TEST_DEV $TEST_MNT -o ro,remount
ps -elf | fgrep '[btrfs-uuid]' | grep -v grep
sleep 1
umount $TEST_MNT

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:01:18 -05:00
Stefan Behrens
27087f3701 Btrfs: init device stats for new devices
Device stats are only initialized (read from tree items) on mount.
Trying to read device stats after adding or replacing new devices will
return errors.

btrfs_init_new_device() and btrfs_init_dev_replace_tgtdev() are the two
functions that allocate and initialize new btrfs_device structures after
a filesystem is mounted. They set the device stats to zero by using
kzalloc() which is correct for new devices. The only missing thing was
to declare these stats as being valid (device->dev_stats_valid = 1) and
this patch adds this missing code.

This is the reproducer:

TEST_DEV1=/dev/sdzzzzz1
TEST_DEV2=/dev/sdzzzzz2
TEST_DEV3=/dev/sdzzzzz3
TEST_MNT=/mnt
mkfs.btrfs $TEST_DEV1
mount $TEST_DEV1 $TEST_MNT
btrfs device add $TEST_DEV2 $TEST_MNT
btrfs device stat $TEST_MNT
btrfs replace start -B $TEST_DEV2 $TEST_DEV3 $TEST_MNT
btrfs device stat $TEST_MNT
umount $TEST_MNT

Reported-by: Ondrej Kunc <kunc88@gmail.com>
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:01:09 -05:00
Liu Bo
30d133fc22 Btrfs: fixup error path in __btrfs_inc_extent_ref
When we fail to add a reference after a non-inline insertion by some reasons,
eg. ENOSPC, we'll abort the transaction, but we don't return this error to
the caller who has to walk around again to find something wrong, that's
unnecessary.

Also fixup other error paths to keep it simple.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:01:00 -05:00
Ilya Dryomov
e649e587cb Btrfs: disallow 'btrfs {balance,replace} cancel' on ro mounts
For both balance and replace, cancelling involves changing the on-disk
state and committing a transaction, which is not a good thing to do on
read-only filesystems.

Cc: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:00:50 -05:00
Ilya Dryomov
adfa97cbdf Btrfs: don't leak ioctl args in btrfs_ioctl_dev_replace
struct btrfs_ioctl_dev_replace_args memory is leaked if replace is
requested on a read-only filesystem.  Fix it.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:00:37 -05:00
Ilya Dryomov
f747cab7b7 Btrfs: nuke a bogus rw_devices decrement in __btrfs_close_devices
On mount failures, __btrfs_close_devices can be called well before
dev-replace state is read and ->is_tgtdev_for_dev_replace is set.  This
leads to a bogus decrement of ->rw_devices and sets off a WARN_ON in
__btrfs_close_devices if replace target device happens to be on the
lists and we fail early in the mount sequence.  Fix this by checking
the devid instead of ->is_tgtdev_for_dev_replace before the decrement:
for replace targets devid is always equal to BTRFS_DEV_REPLACE_DEVID.

Cc: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:00:24 -05:00
Geyslan G. Bem
03b2f08b5f btrfs: Fix memory leakage in the tree-log.c
In add_inode_ref() function:

Initializes local pointers.

Reduces the logical condition with the __add_inode_ref() return
value by using only one 'goto out'.

Centralizes the exiting, ensuring the freeing of all used memory.

Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>
Reviewed-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 22:00:11 -05:00
Liu Bo
498456d33e Btrfs: kill unused code in btrfs_search_forward
After commit de78b51a28
(btrfs: remove cache only arguments from defrag path), @blockptr is no more
used.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:59:56 -05:00
Liu Bo
8319bfe136 Btrfs: cleanup dead code of defragment
@is_extent is no more needed since we don't defrag extent root.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:59:45 -05:00
Filipe David Borba Manana
efd0c4055a Btrfs: remove unnecessary key copy when logging inode
The btrfs_insert_empty_item() function doesn't modify its
key argument.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:59:30 -05:00
Chandra Seetharaman
452c75c3d2 Btrfs: Simplify the logic in alloc_extent_buffer() for existing extent buffer case
alloc_extent_buffer() uses radix_tree_lookup() when radix_tree_insert()
fails with EEXIST. That part of the code is very similar to the code in
find_extent_buffer(). This patch replaces radix_tree_lookup() and
surrounding code in alloc_extent_buffer() with find_extent_buffer().

Note that radix_tree_lookup() does not need to be protected by
tree->buffer_lock. It is protected by eb->refs.

While at it, this patch
  - changes the other usage of radix_tree_lookup() in alloc_extent_buffer()
    with find_extent_buffer() to reduce redundancy.
  - removes the unused argument 'len' to find_extent_buffer().

Signed-Off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Zach Brown <zab@redhat.com>

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:59:11 -05:00
Josef Bacik
7f4ca37c48 Btrfs: fix up seek_hole/seek_data handling
Whoever wrote this was braindead.  Also it doesn't work right if you have
VACANCY's since we assumed you would only have that at the end of the file,
which won't be the case in the near future.  I tested this with generic/285 and
generic/286 as well as the btrfs tests that use fssum since it uses
seek_hole/seek_data to verify things are ok.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:58:56 -05:00
Josef Bacik
4277a9c3b3 Btrfs: add an assert to btrfs_lookup_csums_range for alignment
I was hitting weird issues when trying to remove hole extents and it turned out
it was because I was sending non-aligned offsets down to
btrfs_lookup_csums_range.  So add an assert for this in case somebody trips over
this in the future.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:58:45 -05:00
Josef Bacik
ed9e8af88e Btrfs: fix hole check in log_one_extent
I added an assert to make sure we were looking up aligned offsets for csums and
I tripped it when running xfstests.  This is because log_one_extent was checking
if block_start == 0 for a hole instead of EXTENT_MAP_HOLE.  This worked out fine
in practice it seems, but it adds a lot of extra work that is uneeded.  With
this fix I'm no longer tripping my assert.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:58:32 -05:00
Josef Bacik
0e30db86a4 Btrfs: add a sanity test for a vacant extent at the front of a file
Btrfs_get_extent was not handling this case properly, add a test to make sure we
don't regress.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:58:19 -05:00
Josef Bacik
25a50341b6 Btrfs: handle a missing extent for the first file extent
While trying to kill our hole extents I noticed I was seeing problems where we
seek into a file and then start writing and then try to fiemap that file later.
This is because we search for offset 0, don't find anything and so back up one
slot, which puts us at the inode ref or something like that, which means we goto
not_found and create an extent map for our entire search area.  This isn't quite
what we want, we want to move forward one slot and see if there is an extent
there so we can limit our hole extent.  This patch fixes this problem, I will
add a testcase for this as well.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:58:05 -05:00
Josef Bacik
96192499c2 Btrfs: stop all workers after we free block groups
Stefan was hitting a panic in the async worker stuff because we had outstanding
read bios while we were stopping the worker threads.  You could reproduce this
easily if you mount -o nospace_cache and ran generic/273.  This is because the
caching thread stuff is still going and we were stopping all the worker threads.
We need to stop the workers after this work is done, and the free block groups
code will wait for all the caching threads to stop first so we don't run into
this problem.  With this patch we no longer panic.  Thanks,

Cc: stable@vger.kernel.org
Reported-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:57:49 -05:00
Josef Bacik
aaedb55bc0 Btrfs: add tests for btrfs_get_extent
I'm going to be removing hole extents in the near future so I wanted to make a
sanity test for btrfs_get_extent to make sure I don't break anything in the
meantime.  This patch just puts btrfs_get_extent through its paces by giving it
a completely unreasonable mapping to look at and make sure it is giving us back
maps that make sense.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:57:30 -05:00
Josef Bacik
294e30fee3 Btrfs: add tests for find_lock_delalloc_range
So both Liu and I made huge messes of find_lock_delalloc_range trying to fix
stuff, me first by fixing extent size, then him by fixing something I broke and
then me again telling him to fix it a different way.  So this is obviously a
candidate for some testing.  This patch adds a pseudo fs so we can allocate fake
inodes for tests that need an inode or pages.  Then it addes a bunch of tests to
make sure find_lock_delalloc_range is acting the way it is supposed to.  With
this patch and all of our previous patches to find_lock_delalloc_range I am sure
it is working as expected now.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:56:51 -05:00
Josef Bacik
857cc2fc29 Btrfs: free reserved space on error in a few places
While trying to track down a reserved space leak I noticed a few places where we
won't properly clean up reserved space if we have an error, this patch fixes
those up.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:56:41 -05:00
Josef Bacik
0be5dc67c4 Btrfs: fixup reserved trace points
In trying to track down where we were leaking reserved space I noticed our
reserve extent tracepoints are a little off.  First we were saying that the
reserved space had been alloced in btrfs_reserve_extent, which isn't the case,
this needs to be triggered when we actually allocate the space when we run the
delayed ref.  We were also missing a few places where we should have been
tracing the btrfs_reserve_extent_free tracepoint.  With these in place I was
able to put together where we were leaking reserved space.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:56:31 -05:00
Josef Bacik
2b1360da35 Btrfs: free up block groups after everything
If we abort a transaction we will do the tree log cleanup at unmount, but this
happens after we free up the block groups.  This makes all the leak detection
warnings go off because we think we've leaked space but in reality we just
haven't cleaned it up yet.  So instead do the block group cleanup stuff after
free'ing the fs roots so we don't get these warnings.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:56:17 -05:00
Josef Bacik
681ae50917 Btrfs: cleanup reserved space when freeing tree log on error
On error we will wait and free the tree log at unmount without a transaction.
This means that the actual freeing of the blocks doesn't happen which means we
complain about space leaks on unmount.  So to fix this just skip the transaction
specific cleanup part of the tree log free'ing if we don't have a transaction
and that way we can free up our reserved space and our counters stay happy.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:56:11 -05:00
Josef Bacik
eb58bb371a Btrfs: do not free the dirty bytes from the trans block rsv on cleanup
The transactions should be cleaning up their reservations on failure, this just
causes us to have warnings on unmount because we go negative by free'ing
reservations that have already been free'ed.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:55:58 -05:00
Filipe David Borba Manana
80d94fb3df Btrfs: fix memory leaks on transaction commit failure
Structures of the types tree_mod_elem and qgroup_update are allocated
during transaction commit but were not being released if the call to
btrfs_run_delayed_items() returned an error.

Stack trace reported by kmemleak:

unreferenced object 0xffff880679f0b398 (size 128):
  comm "umount", pid 21508, jiffies 4295967793 (age 36718.112s)
  hex dump (first 32 bytes):
    60 b5 f0 79 06 88 ff ff 00 00 00 00 00 00 00 00  `..y............
    00 00 00 00 00 00 00 00 50 1c 00 00 00 00 00 00  ........P.......
  backtrace:
    [<ffffffff81742d26>] kmemleak_alloc+0x26/0x50
    [<ffffffff811889c2>] kmem_cache_alloc_trace+0x112/0x200
    [<ffffffffa046f2d3>] tree_mod_log_insert_key.constprop.45+0x93/0x150 [btrfs]
    [<ffffffffa04720f9>] __btrfs_cow_block+0x299/0x4f0 [btrfs]
    [<ffffffffa0472510>] btrfs_cow_block+0x120/0x1f0 [btrfs]
    [<ffffffffa0476679>] btrfs_search_slot+0x449/0x930 [btrfs]
    [<ffffffffa048eecf>] btrfs_lookup_inode+0x2f/0xa0 [btrfs]
    [<ffffffffa04eb49c>] __btrfs_update_delayed_inode+0x1c/0x1d0 [btrfs]
    [<ffffffffa04eb9e2>] __btrfs_run_delayed_items+0x162/0x1e0 [btrfs]
    [<ffffffffa04eba63>] btrfs_delayed_inode_exit+0x3/0x20 [btrfs]
    [<ffffffffa0499c63>] btrfs_commit_transaction+0x203/0xa50 [btrfs]
    [<ffffffffa046b519>] btrfs_sync_fs+0x69/0x110 [btrfs]
    [<ffffffff811cb210>] __sync_filesystem+0x30/0x60
    [<ffffffff811cb2bb>] sync_filesystem+0x4b/0x70
    [<ffffffff8119ce7b>] generic_shutdown_super+0x3b/0xf0
    [<ffffffff8119cfc6>] kill_anon_super+0x16/0x30
unreferenced object 0xffff880677e0dd88 (size 32):
  comm "umount", pid 21508, jiffies 4295967793 (age 36718.112s)
  hex dump (first 32 bytes):
    78 75 11 a9 06 88 ff ff 00 c0 e0 77 06 88 ff ff  xu.........w....
    40 c3 a2 70 06 88 ff ff 00 00 00 00 00 00 00 00  @..p............
  backtrace:
    [<ffffffff81742d26>] kmemleak_alloc+0x26/0x50
    [<ffffffff811889c2>] kmem_cache_alloc_trace+0x112/0x200
    [<ffffffffa04fa54f>] btrfs_qgroup_record_ref+0xf/0x90 [btrfs]
    [<ffffffffa04e1914>] btrfs_add_delayed_tree_ref+0xf4/0x170 [btrfs]
    [<ffffffffa048518a>] btrfs_free_tree_block+0x9a/0x220 [btrfs]
    [<ffffffffa0472163>] __btrfs_cow_block+0x303/0x4f0 [btrfs]
    [<ffffffffa0472510>] btrfs_cow_block+0x120/0x1f0 [btrfs]
    [<ffffffffa0476679>] btrfs_search_slot+0x449/0x930 [btrfs]
    [<ffffffffa048eecf>] btrfs_lookup_inode+0x2f/0xa0 [btrfs]
    [<ffffffffa04eb49c>] __btrfs_update_delayed_inode+0x1c/0x1d0 [btrfs]
    [<ffffffffa04eb9e2>] __btrfs_run_delayed_items+0x162/0x1e0 [btrfs]
    [<ffffffffa04eba63>] btrfs_delayed_inode_exit+0x3/0x20 [btrfs]
    [<ffffffffa0499c63>] btrfs_commit_transaction+0x203/0xa50 [btrfs]
    [<ffffffffa046b519>] btrfs_sync_fs+0x69/0x110 [btrfs]
    [<ffffffff811cb210>] __sync_filesystem+0x30/0x60
    [<ffffffff811cb2bb>] sync_filesystem+0x4b/0x70

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:55:46 -05:00
Ilya Dryomov
539f358a30 Btrfs: fix the dev-replace suspend sequence
Replace progresses strictly from lower to higher offsets, and the
progress is tracked in chunks, by storing the physical offset of the
dev_extent which is being copied in the cursor_left field of
btrfs_dev_replace_item.  When we are done copying the chunk,
left_cursor is updated to point one byte past the dev_extent, so that
on resume we can skip the dev_extents that have already been copied.

There is a major bug (which goes all the way back to the inception of
dev-replace in 3.8) in the way left_cursor is bumped: the bump is done
unconditionally, without any regard to the scrub_chunk return value.
On suspend (and also on any kind of error) scrub_chunk returns early,
i.e. without completing the copy.  This leads to us skipping the chunk
that hasn't been fully copied yet when resuming.

Fix this by doing the cursor_left update only if scrub_chunk ret is 0.
(On suspend scrub_chunk returns with -ECANCELED, so this fix covers
both suspend and error cases.)

Cc: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:55:36 -05:00
Filipe David Borba Manana
778ba82b17 Btrfs: improve inode hash function/inode lookup
Currently the hash value used for adding an inode to the VFS's inode
hash table consists of the plain inode number, which is a 64 bits
integer. This results in hash table buckets (hlist_head lists) with
too many elements for at least 2 important scenarios:

1) When we have many subvolumes. Each subvolume has its own btree
   where its files and directories are added to, and each has its
   own objectid (inode number) namespace. This means that if we have
   N subvolumes, and all have inode number X associated to a file or
   directory, the corresponding inodes all map to the same hash table
   entry, resulting in a bucket (hlist_head list) with N elements;

2) On 32 bits machines. Th VFS hash values are unsigned longs, which
   are 32 bits wide on 32 bits machines, and the inode (objectid)
   numbers are 64 bits unsigned integers. We simply cast the inode
   numbers to hash values, which means that for all inodes with the
   same 32 bits lower half, the same hash bucket is used for all of
   them. For example, all inodes with a number (objectid) between
   0x0000_0000_ffff_ffff and 0xffff_ffff_ffff_ffff will end up in
   the same hash table bucket.

This change ensures the inode's hash value depends both on the
objectid (inode number) and its subvolume's (btree root) objectid.
For 32 bits machines, this change gives better entropy by making
the hash value depend on both the upper and lower 32 bits of the
64 bits hash previously computed.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:55:19 -05:00
Filipe David Borba Manana
3d41d70252 Btrfs: remove unnecessary tree search when logging inode
In tree-log.c:btrfs_log_inode(), we keep calling btrfs_search_forward()
until it returns a key whose objectid is higher than our inode or until
the key's type is higher than our maximum allowed type.

At the end of the loop, we increment our mininum search key's objectid
and type regardless of our desired target objectid and maximum desired
type, which causes another loop iteration that will call again
btrfs_search_forward() just to figure out we've gone beyond our maximum
key and exit the loop. Therefore while incrementing our minimum key,
don't do it blindly and exit the loop immiediately if the next search
key's objectid or type is beyond what we seek.

Also after incrementing the type, set the key's offset to 0, which was
missing and could make us loose some of the inode's items.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:55:11 -05:00
Filipe David Borba Manana
6174d3cb43 Btrfs: remove unused max_key arg from btrfs_search_forward
It is not used for anything.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:54:57 -05:00
Liu Bo
7d3d1744f8 Btrfs: fix memory leak of chunks' extent map
As we're hold a ref on looking up the extent map, we need to drop the ref
before returning to callers.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:54:48 -05:00
Miao Xie
fa7c14947a Btrfs: improve jitter performance of the sequential buffered write
The performance was slowed down sometimes when we ran sysbench to measure
the performance of the sequential buffered write by 2 or more threads.

It was because the write order of the test threads might be confused
by the task scheduler, and the coming write would be beyond the end of
the file, in this case, we need insert dummy file extents and create
a hole for the area we skip. But in order to avoid the ongoing ordered
extents which are in the area, we need wait for them. Unfortunately,
the current code doesn't check if there are ordered extents in the area
or not, try to find and flush the dirty pages directly, but in fact,
there is no dirty page in that area, this step of the current code is
unnecessary, and just wastes time. Sometimes, it would increase
the contention of some locks, and makes the performance slow down suddenly.

So we remove the ordered extent flush function before the check, and flush
the dirty pages and wait for the ordered extents only when we find them.

According to my test, we got 1-2 times of the performance regression when
we ran the test by 10 times before applying this patch. After applying
this patch, the regression went away.

Test Environment:
 CPU:		1CPU * 4Cores
 Memory:	6GB
 Partition:	20GB

Test Command:
 # sysbench --test=fileio --file-total-size=16G --file-test-mode=seqwr \
 > --num-threads=512 --file-block-size=16384 --max-time=60 --max-requests=0 run

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:54:38 -05:00
Miao Xie
20dd2cbf01 Btrfs: fix BUG_ON() casued by the reserved space migration
When we did space balance and snapshot creation at the same time, we might
meet the following oops:
 kernel BUG at fs/btrfs/inode.c:3038!
 [SNIP]
 Call Trace:
 [<ffffffffa0411ec7>] btrfs_orphan_cleanup+0x293/0x407 [btrfs]
 [<ffffffffa042dc45>] btrfs_mksubvol.isra.28+0x259/0x373 [btrfs]
 [<ffffffffa042de85>] btrfs_ioctl_snap_create_transid+0x126/0x156 [btrfs]
 [<ffffffffa042dff1>] btrfs_ioctl_snap_create_v2+0xd0/0x121 [btrfs]
 [<ffffffffa0430b2c>] btrfs_ioctl+0x414/0x1854 [btrfs]
 [<ffffffff813b60b7>] ? __do_page_fault+0x305/0x379
 [<ffffffff811215a9>] vfs_ioctl+0x1d/0x39
 [<ffffffff81121d7c>] do_vfs_ioctl+0x32d/0x3e2
 [<ffffffff81057fe7>] ? finish_task_switch+0x80/0xb8
 [<ffffffff81121e88>] SyS_ioctl+0x57/0x83
 [<ffffffff813b39ff>] ? do_device_not_available+0x12/0x14
 [<ffffffff813b99c2>] system_call_fastpath+0x16/0x1b
 [SNIP]
 RIP  [<ffffffffa040da40>] btrfs_orphan_add+0xc3/0x126 [btrfs]

The reason of the problem is that the relocation root creation stole
the reserved space, which was reserved for orphan item deletion.

There are several ways to fix this problem, one is to increasing
the reserved space size of the space balace, and then we can use
that space to create the relocation tree for each fs/file trees.
But it is hard to calculate the suitable size because we doesn't
know how many fs/file trees we need relocate.

We fixed this problem by reserving the space for relocation root creation
actively since the space it need is very small (one tree block, used for
root node copy), then we use that reserved space to create the
relocation tree. If we don't reserve space for relocation tree creation,
we will use the reserved space of the balance.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:54:28 -05:00
Ross Kirk
0a4e558609 btrfs: remove unused parameter from btrfs_header_fsid
Remove unused parameter, 'eb'. Unused since introduction in
5f39d397df

Updated to be rebased against current upstream and correct diff supplied this time!

Signed-off-by: Ross Kirk <ross.kirk@gmail.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:54:16 -05:00
Josef Bacik
724e2315db Btrfs: fix two use-after-free bugs with transaction cleanup
I was noticing the slab redzone stuff going off every once and a while during
transaction aborts.  This was caused by two things

1) We would walk the pending snapshots and set their error to -ECANCELED.  We
don't need to do this, the snapshot stuff waits for a transaction commit and if
there is a problem we just free our pending snapshot object and exit.  Doing
this was causing us to touch the pending snapshot object after the thing had
already been freed.

2) We were freeing the transaction manually with wanton disregard for it's
use_count reference counter.  To fix this I cleaned up the transaction freeing
loop to either wait for the transaction commit to finish if it was in the middle
of that (since it will be cleaned and freed up there) or to do the cleanup
oursevles.

I also moved the global "kill all things dirty everywhere" stuff outside of the
transaction cleanup loop since that only needs to be done once.  With this patch
I'm no longer seeing slab corruption because of use after frees.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:54:03 -05:00
Josef Bacik
c16ce19014 Btrfs: remove all BUG_ON()'s from commit_cowonly_roots
Noticed this when forcing errors to happen during delayed ref running.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:53:57 -05:00
Josef Bacik
1de2cfde93 Btrfs: don't delete ordered roots from list during cleanup
During transaction cleanup after an abort we are just removing roots from the
ordered roots list which is incorrect.  We have a BUG_ON() to make sure that the
root is still part of the ordered roots list when we put our ordered extent
which we were tripping in this case.  So do like we do everywhere else and just
move it to the tail of the ordered roots list and allow the normal cleanup to
take care of stuff.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:53:49 -05:00
Josef Bacik
4e121c06ad Btrfs: cleanup transaction on abort
If we abort not during a transaction commit we won't clean up anything until we
unmount.  Unfortunately if we abort in the middle of writing out an ordered
extent we won't clean it up and if somebody is waiting on that ordered extent
they will wait forever.  To fix this just make the transaction kthread call the
cleanup transaction stuff if it notices theres an error, and make
btrfs_end_transaction wake up the transaction kthread if there is an error.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:53:42 -05:00
Josef Bacik
b6d08f0630 Btrfs: do not release metadata for space cache inodes
I've been testing our error paths and I was tripping the BUG_ON() in
drop_outstanding_extent because our outstanding_extents is 0 for space cache
inodes.  This is because we don't reserve metadata space for these inodes since
we depend on the global block reserve for our space.  To fix this we need to
make sure the DO_ACCOUNTING stuff doesn't actually call release_metadata for
space cache inodes.  With this patch I'm no longer panicing.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-11 21:53:36 -05:00