Commit Graph

1244 Commits

Author SHA1 Message Date
Chris Mason
6374e57ad8 btrfs: fix integer overflow in calc_reclaim_items_nr
Dave Jones hit a WARN_ON(nr < 0) in btrfs_wait_ordered_roots() with
v4.12-rc6.  This was because commit 70e7af244 made it possible for
calc_reclaim_items_nr() to return a negative number.  It's not really a
bug in that commit, it just didn't go far enough down the stack to find
all the possible 64->32 bit overflows.

This switches calc_reclaim_items_nr() to return a u64 and changes everyone
that uses the results of that math to u64 as well.

Reported-by: Dave Jones <davej@codemonkey.org.uk>
Fixes: 70e7af2 ("Btrfs: fix delalloc accounting leak caused by u32 overflow")
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:02 +02:00
Qu Wenruo
bc42bda223 btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges
[BUG]
For the following case, btrfs can underflow qgroup reserved space
at an error path:
(Page size 4K, function name without "btrfs_" prefix)

         Task A                  |             Task B
----------------------------------------------------------------------
Buffered_write [0, 2K)           |
|- check_data_free_space()       |
|  |- qgroup_reserve_data()      |
|     Range aligned to page      |
|     range [0, 4K)          <<< |
|     4K bytes reserved      <<< |
|- copy pages to page cache      |
                                 | Buffered_write [2K, 4K)
                                 | |- check_data_free_space()
                                 | |  |- qgroup_reserved_data()
                                 | |     Range alinged to page
                                 | |     range [0, 4K)
                                 | |     Already reserved by A <<<
                                 | |     0 bytes reserved      <<<
                                 | |- delalloc_reserve_metadata()
                                 | |  And it *FAILED* (Maybe EQUOTA)
                                 | |- free_reserved_data_space()
                                      |- qgroup_free_data()
                                         Range aligned to page range
                                         [0, 4K)
                                         Freeing 4K
(Special thanks to Chandan for the detailed report and analyse)

[CAUSE]
Above Task B is freeing reserved data range [0, 4K) which is actually
reserved by Task A.

And at writeback time, page dirty by Task A will go through writeback
routine, which will free 4K reserved data space at file extent insert
time, causing the qgroup underflow.

[FIX]
For btrfs_qgroup_free_data(), add @reserved parameter to only free
data ranges reserved by previous btrfs_qgroup_reserve_data().
So in above case, Task B will try to free 0 byte, so no underflow.

Reported-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:02 +02:00
Qu Wenruo
364ecf3651 btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
Introduce a new parameter, struct extent_changeset for
btrfs_qgroup_reserved_data() and its callers.

Such extent_changeset was used in btrfs_qgroup_reserve_data() to record
which range it reserved in current reserve, so it can free it in error
paths.

The reason we need to export it to callers is, at buffered write error
path, without knowing what exactly which range we reserved in current
allocation, we can free space which is not reserved by us.

This will lead to qgroup reserved space underflow.

Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:02 +02:00
Qu Wenruo
7bc329c183 btrfs: qgroup: Return actually freed bytes for qgroup release or free data
btrfs_qgroup_release/free_data() only returns 0 or a negative error
number (ENOMEM is the only possible error).

This is normally good enough, but sometimes we need the exact byte
count it freed/released.

Change it to return actually released/freed bytenr number instead of 0
for success.
And slightly modify related extent_changeset structure, since in btrfs
one no-hole data extent won't be larger than 128M, so "unsigned int"
is large enough for the use case.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:02 +02:00
Omar Sandoval
d7eae3403f Btrfs: rework delayed ref total_bytes_pinned accounting
The total_bytes_pinned counter is completely broken when accounting
delayed refs:

- If two drops for the same extent are merged, we will decrement
  total_bytes_pinned twice but only increment it once.
- If an add is merged into a drop or vice versa, we will decrement the
  total_bytes_pinned counter but never increment it.
- If multiple references to an extent are dropped, we will account it
  multiple times, potentially vastly over-estimating the number of bytes
  that will be freed by a commit and doing unnecessary work when we're
  close to ENOSPC.

The last issue is relatively minor, but the first two make the
total_bytes_pinned counter leak or underflow very often. These
accounting issues were introduced in b150a4f10d ("Btrfs: use a percpu
to keep track of possibly pinned bytes"), but they were papered over by
zeroing out the counter on every commit until d288db5dc0 ("Btrfs: fix
race of using total_bytes_pinned").

We need to make sure that an extent is accounted as pinned exactly once
if and only if we will drop references to it when when the transaction
is committed. Ideally we would only add to total_bytes_pinned when the
*last* reference is dropped, but this information isn't readily
available for data extents. Again, this over-estimation can lead to
extra commits when we're close to ENOSPC, but it's not as bad as before.

The fix implemented here is to increment total_bytes_pinned when the
total refmod count for an extent goes negative and decrement it if the
refmod count goes back to non-negative or after we've run all of the
delayed refs for that extent.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:01 +02:00
Omar Sandoval
7be07912b3 Btrfs: return old and new total ref mods when adding delayed refs
We need this to decide when to account pinned bytes.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:01 +02:00
Omar Sandoval
0a16c7d7ae Btrfs: always account pinned bytes when dropping a tree block ref
Currently, we only increment total_bytes_pinned in
btrfs_free_tree_block() when dropping the last reference on the block.
However, when the delayed ref is run later, we will decrement
total_bytes_pinned regardless of whether it was the last reference or
not. This causes the counter to underflow when the reference we dropped
was not the last reference. Fix it by incrementing the counter
unconditionally, which is what btrfs_free_extent() does. This makes
total_bytes_pinned an overestimate when references to shared extents are
dropped, but in the worst case this will just make us try to commit the
transaction to try to free up space and find we didn't free enough.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:01 +02:00
Omar Sandoval
4da8b76d34 Btrfs: update total_bytes_pinned when pinning down extents
The extents marked in pin_down_extent() will be unpinned later in
unpin_extent_range(), which decrements total_bytes_pinned.
pin_down_extent() must increment the counter to avoid underflowing it.
Also adjust btrfs_free_tree_block() to avoid accounting for the same
extent twice.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:01 +02:00
Omar Sandoval
55e8196a57 Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT()
The value of flags is one of DATA/METADATA/SYSTEM, they must exist at
when add_pinned_bytes is called.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ added changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:01 +02:00
Omar Sandoval
0d9f824df3 Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64
There are a few places where we pass in a negative num_bytes, so make it
signed for clarity. Also move it up in the file since later patches will
need it there.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:01 +02:00
Nikolay Borisov
0eee8a494e btrfs: Use btrfs_space_info_used instead of opencoding it
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:04 +02:00
Nikolay Borisov
d2006e6d28 btrfs: Refactor update_space_info
Following the factoring out of the creation code udpate_space_info can
only be called for already-existing space_info structs. As such it
cannot fail.  Remove superfluous error handling and make the function
return void.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:00 +02:00
Nikolay Borisov
2be12ef79f btrfs: Separate space_info create/update
Currently the struct space_info creation code is intermixed in the
udpate_space_info function. There are well-defined points at which the
we actually want to create brand-new space_info structs (e.g. during
mount of the filesystem as well as sometimes when adding/initialising
new chunks). In such cases update_space_info is called with 0 as the
bytes parameter. All of this makes for spaghetti code.

Fix it by factoring out the creation code in a separate
create_space_info structure. This also allows to simplify the internals.
Also remove BUG_ON from do_alloc_chunk since the callers handle errors.
Furthermore it will make the update_space_info function not fail,
allowing us to remove error handling in callers. This will come in a
follow up patch.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:00 +02:00
Liu Bo
28785f70ef Btrfs: skip commit transaction if we don't have enough pinned bytes
We commit transaction in order to reclaim space from pinned bytes because
it could process delayed refs, and in may_commit_transaction(), we check
first if pinned bytes are enough for the required space, we then check if
that plus bytes reserved for delayed insert are enough for the required
space.

This changes the code to the above logic.

Fixes: b150a4f10d ("Btrfs: use a percpu to keep track of possibly pinned bytes")
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reported-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:00 +02:00
Jeff Mahoney
c1c4919b11 btrfs: remove root usage from can_overcommit
can_overcommit using the root to determine the allocation profile
is the only use of a root in the call graph below reserve_metadata_bytes.

It turns out that we only need to know whether the allocation is for
the chunk root or not -- and we can pass that around as a bool instead.

This allows us to pull root usage out of the reservation path all the
way up to reserve_metadata_bytes itself, which uses it only to compare
against fs_info->chunk_root to set the bool.  In turn, this eliminates
a bunch of races where we use a particular root too early in the mount
process.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:00 +02:00
Jeff Mahoney
1b86826d12 btrfs: cleanup root usage by btrfs_get_alloc_profile
There are two places where we don't already know what kind of alloc
profile we need before calling btrfs_get_alloc_profile, but we need
access to a root everywhere we call it.

This patch adds helpers for btrfs_{data,metadata,system}_alloc_profile()
and relegates btrfs_system_alloc_profile to a static for use in those
two cases.  The next patch will eliminate one of those.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Nikolay Borisov
a5ed45f822 btrfs: Convert fs_info->free_chunk_space to atomic64_t
The ->free_chunk_space variable is used to track the unallocated space
and access to it is protected by a spinlock, which is not used for
anything else.  Make the code a bit self-explanatory by switching the
variable to an atomic64_t type and kill the spinlock.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
[ not a performance critical code, use of atomic type is ok ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Jeff Mahoney
a9b3311ef3 btrfs: fix race with relocation recovery and fs_root setup
If we have to recover relocation during mount, we'll ultimately have to
evict the orphan inode.  That goes through the reservation dance, where
priority_reclaim_metadata_space and flush_space expect fs_info->fs_root
to be valid.  That's the next thing to be set up during mount, so we
crash, almost always in flush_space trying to join the transaction
but priority_reclaim_metadata_space is possible as well.  This call
path has been problematic in the past WRT whether ->fs_root is valid
yet.  Commit 957780eb27 (Btrfs: introduce ticketed enospc
infrastructure) added new users that are called in the direct path
instead of the async path that had already been worked around.

The thing is that we don't actually need the fs_root, specifically, for
anything.  We either use it to determine whether the root is the
chunk_root for use in choosing an allocation profile or as a root to pass
btrfs_join_transaction before immediately committing it.  Anything that
isn't the chunk root works in the former case and any root works in
the latter.

A simple fix is to use a root we know will always be there: the
extent_root.

Cc: <stable@vger.kernel.org> # v4.8+
Fixes: 957780eb27 (Btrfs: introduce ticketed enospc infrastructure)
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-01 16:56:55 +02:00
Jeff Mahoney
896533a7da btrfs: fix memory leak in update_space_info failure path
If we fail to add the space_info kobject, we'll leak the memory
for the percpu counter.

Fixes: 6ab0a2029c (btrfs: publish allocation data in sysfs)
Cc: <stable@vger.kernel.org> # v3.14+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-01 16:56:31 +02:00
Qu Wenruo
0966a7b130 btrfs: scrub: Introduce full stripe lock for RAID56
Unlike mirror based profiles, RAID5/6 recovery needs to read out the
whole full stripe.

And if we don't do proper protection, it can easily cause race condition.

Introduce 2 new functions: lock_full_stripe() and unlock_full_stripe()
for RAID5/6.
Which store a rb_tree of mutexes for full stripes, so scrub callers can
use them to lock a full stripe to avoid race.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor comment adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:27 +02:00
David Sterba
f486135eba btrfs: remove unused qgroup members from btrfs_trans_handle
The members have been effectively unused since "Btrfs: rework qgroup
accounting" (fcebe4562d), there's no substitute for
assert_qgroups_uptodate so it's removed as well.

Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
Liu Bo
1a79c1f246 Btrfs: update comments in cache_save_setup
We also don't bother to flush free space cache while with free space
tree.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
Elena Reshetova
6df8cdf5bd btrfs: convert btrfs_delayed_ref_node.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:23 +02:00
Elena Reshetova
1e4f4714d5 btrfs: convert btrfs_caching_control.count from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:23 +02:00
Elena Reshetova
9b64f57ddf btrfs: convert btrfs_transaction.use_count from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:23 +02:00
Linus Torvalds
1827adb11a Merge branch 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull sched.h split-up from Ingo Molnar:
 "The point of these changes is to significantly reduce the
  <linux/sched.h> header footprint, to speed up the kernel build and to
  have a cleaner header structure.

  After these changes the new <linux/sched.h>'s typical preprocessed
  size goes down from a previous ~0.68 MB (~22K lines) to ~0.45 MB (~15K
  lines), which is around 40% faster to build on typical configs.

  Not much changed from the last version (-v2) posted three weeks ago: I
  eliminated quirks, backmerged fixes plus I rebased it to an upstream
  SHA1 from yesterday that includes most changes queued up in -next plus
  all sched.h changes that were pending from Andrew.

  I've re-tested the series both on x86 and on cross-arch defconfigs,
  and did a bisectability test at a number of random points.

  I tried to test as many build configurations as possible, but some
  build breakage is probably still left - but it should be mostly
  limited to architectures that have no cross-compiler binaries
  available on kernel.org, and non-default configurations"

* 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (146 commits)
  sched/headers: Clean up <linux/sched.h>
  sched/headers: Remove #ifdefs from <linux/sched.h>
  sched/headers: Remove the <linux/topology.h> include from <linux/sched.h>
  sched/headers, hrtimer: Remove the <linux/wait.h> include from <linux/hrtimer.h>
  sched/headers, x86/apic: Remove the <linux/pm.h> header inclusion from <asm/apic.h>
  sched/headers, timers: Remove the <linux/sysctl.h> include from <linux/timer.h>
  sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h>
  sched/headers: Remove <linux/sched.h> from <linux/sched/init.h>
  sched/core: Remove unused prefetch_stack()
  sched/headers: Remove <linux/rculist.h> from <linux/sched.h>
  sched/headers: Remove the 'init_pid_ns' prototype from <linux/sched.h>
  sched/headers: Remove <linux/signal.h> from <linux/sched.h>
  sched/headers: Remove <linux/rwsem.h> from <linux/sched.h>
  sched/headers: Remove the runqueue_is_locked() prototype
  sched/headers: Remove <linux/sched.h> from <linux/sched/hotplug.h>
  sched/headers: Remove <linux/sched.h> from <linux/sched/debug.h>
  sched/headers: Remove <linux/sched.h> from <linux/sched/nohz.h>
  sched/headers: Remove <linux/sched.h> from <linux/sched/stat.h>
  sched/headers: Remove the <linux/gfp.h> include from <linux/sched.h>
  sched/headers: Remove <linux/rtmutex.h> from <linux/sched.h>
  ...
2017-03-03 10:16:38 -08:00
Ingo Molnar
f361bf4a66 sched/headers: Prepare for the reduction of <linux/sched.h>'s signal API dependency
Instead of including the full <linux/signal.h>, we are going to include the
types-only <linux/signal_types.h> header in <linux/sched.h>, to further
decouple the scheduler header from the signal headers.

This means that various files which relied on the full <linux/signal.h> need
to be updated to gain an explicit dependency on it.

Update the code that relies on sched.h's inclusion of the <linux/signal.h> header.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:37 +01:00
Chris Mason
e9f467d028 Merge branch 'for-chris-4.11-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.11 2017-02-28 14:35:09 -08:00
Nikolay Borisov
73f2e545b6 btrfs: Make btrfs_orphan_add take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:10 +01:00
Nikolay Borisov
691fa05967 btrfs: all btrfs_delalloc_release_metadata take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:07 +01:00
Nikolay Borisov
9f3db423f9 btrfs: Make btrfs_delalloc_reserve_metadata take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:07 +01:00
Nikolay Borisov
703b391a03 btrfs: Make btrfs_orphan_release_metadata take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:07 +01:00
Nikolay Borisov
8ed7a2a0e0 btrfs: Make btrfs_orphan_reserve_metadata take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:07 +01:00
Nikolay Borisov
0e6bf9b13c btrfs: Make calc_csum_metadata_size take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:07 +01:00
Nikolay Borisov
baa3ba39b9 btrfs: Make drop_outstanding_extent take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:07 +01:00
Nikolay Borisov
04f4f91653 btrfs: make btrfs_alloc_data_chunk_ondemand take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:06 +01:00
Nikolay Borisov
70ddc553b5 btrfs: make btrfs_is_free_space_inode take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:06 +01:00
Filipe Manana
5cdd7db6c5 Btrfs: fix assertion failure when freeing block groups at close_ctree()
At close_ctree() we free the block groups and then only after we wait for
any running worker kthreads to finish and shutdown the workqueues. This
behaviour is racy and it triggers an assertion failure when freeing block
groups because while we are doing it we can have for example a block group
caching kthread running, and in that case the block group's reference
count can still be greater than 1 by the time we assert its reference count
is 1, leading to an assertion failure:

[19041.198004] assertion failed: atomic_read(&block_group->count) == 1, file: fs/btrfs/extent-tree.c, line: 9799
[19041.200584] ------------[ cut here ]------------
[19041.201692] kernel BUG at fs/btrfs/ctree.h:3418!
[19041.202830] invalid opcode: 0000 [#1] PREEMPT SMP
[19041.203929] Modules linked in: btrfs xor raid6_pq dm_flakey dm_mod crc32c_generic ppdev sg psmouse acpi_cpufreq pcspkr parport_pc evdev tpm_tis parport tpm_tis_core i2c_piix4 i2c_core tpm serio_raw processor button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[19041.208082] CPU: 6 PID: 29051 Comm: umount Not tainted 4.9.0-rc7-btrfs-next-36+ #1
[19041.208082] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[19041.208082] task: ffff88015f028980 task.stack: ffffc9000ad34000
[19041.208082] RIP: 0010:[<ffffffffa03e319e>]  [<ffffffffa03e319e>] assfail.constprop.41+0x1c/0x1e [btrfs]
[19041.208082] RSP: 0018:ffffc9000ad37d60  EFLAGS: 00010286
[19041.208082] RAX: 0000000000000061 RBX: ffff88015ecb4000 RCX: 0000000000000001
[19041.208082] RDX: ffff88023f392fb8 RSI: ffffffff817ef7ba RDI: 00000000ffffffff
[19041.208082] RBP: ffffc9000ad37d60 R08: 0000000000000001 R09: 0000000000000000
[19041.208082] R10: ffffc9000ad37cb0 R11: ffffffff82f2b66d R12: ffff88023431d170
[19041.208082] R13: ffff88015ecb40c0 R14: ffff88023431d000 R15: ffff88015ecb4100
[19041.208082] FS:  00007f44f3d42840(0000) GS:ffff88023f380000(0000) knlGS:0000000000000000
[19041.208082] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[19041.208082] CR2: 00007f65d623b000 CR3: 00000002166f2000 CR4: 00000000000006e0
[19041.208082] Stack:
[19041.208082]  ffffc9000ad37d98 ffffffffa035989f ffff88015ecb4000 ffff88015ecb5630
[19041.208082]  ffff88014f6be000 0000000000000000 00007ffcf0ba6a10 ffffc9000ad37df8
[19041.208082]  ffffffffa0368cd4 ffff88014e9658e0 ffffc9000ad37e08 ffffffff811a634d
[19041.208082] Call Trace:
[19041.208082]  [<ffffffffa035989f>] btrfs_free_block_groups+0x17f/0x392 [btrfs]
[19041.208082]  [<ffffffffa0368cd4>] close_ctree+0x1c5/0x2e1 [btrfs]
[19041.208082]  [<ffffffff811a634d>] ? evict_inodes+0x132/0x141
[19041.208082]  [<ffffffffa034356d>] btrfs_put_super+0x15/0x17 [btrfs]
[19041.208082]  [<ffffffff8118fc32>] generic_shutdown_super+0x6a/0xeb
[19041.208082]  [<ffffffff8119004f>] kill_anon_super+0x12/0x1c
[19041.208082]  [<ffffffffa0343370>] btrfs_kill_super+0x16/0x21 [btrfs]
[19041.208082]  [<ffffffff8118fad1>] deactivate_locked_super+0x3b/0x68
[19041.208082]  [<ffffffff8118fb34>] deactivate_super+0x36/0x39
[19041.208082]  [<ffffffff811a9946>] cleanup_mnt+0x58/0x76
[19041.208082]  [<ffffffff811a99a2>] __cleanup_mnt+0x12/0x14
[19041.208082]  [<ffffffff81071573>] task_work_run+0x6f/0x95
[19041.208082]  [<ffffffff81001897>] prepare_exit_to_usermode+0xa3/0xc1
[19041.208082]  [<ffffffff81001a23>] syscall_return_slowpath+0x16e/0x1d2
[19041.208082]  [<ffffffff814c607d>] entry_SYSCALL_64_fastpath+0xab/0xad
[19041.208082] Code: c7 ae a0 3e a0 48 89 e5 e8 4e 74 d4 e0 0f 0b 55 89 f1 48 c7 c2 0b a4 3e a0 48 89 fe 48 c7 c7 a4 a6 3e a0 48 89 e5 e8 30 74 d4 e0 <0f> 0b 55 31 d2 48 89 e5 e8 d5 b9 f7 ff 5d c3 48 63 f6 55 31 c9
[19041.208082] RIP  [<ffffffffa03e319e>] assfail.constprop.41+0x1c/0x1e [btrfs]
[19041.208082]  RSP <ffffc9000ad37d60>
[19041.279264] ---[ end trace 23330586f16f064d ]---

This started happening as of kernel 4.8, since commit f3bca8028b
("Btrfs: add ASSERT for block group's memory leak") introduced these
assertions.

So fix this by freeing the block groups only after waiting for all
worker kthreads to complete and shutdown the workqueues.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2017-02-24 00:38:27 +00:00
Jeff Mahoney
77ab86bf1c btrfs: free-space-cache, clean up unnecessary root arguments
The free space cache APIs accept a root but always use the tree root.

Also, btrfs_truncate_free_space_cache accepts a root AND an inode but
the inode always points to the root anyway, so let's just pass the inode.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:56 +01:00
Jeff Mahoney
5e00f1939f btrfs: convert btrfs_inc_block_group_ro to accept fs_info
btrfs_inc_block_group_ro is either passed the extent root or the dev
root, but it doesn't do anything with the dev tree.  Let's convert
to passing an fs_info and using the extent root.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:56 +01:00
Jeff Mahoney
0c9ab349c2 btrfs: flush_space always takes fs_info->fs_root
We don't need to pass a root to flush_space since it always uses
the fs_root.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:55 +01:00
Jeff Mahoney
87bde3cdfc btrfs: pass fs_info to (more) routines that are only called with extent_root
Outside of interactions with qgroups, the roots passed in extent-tree.c
are usually passed to ensure that we don't do refcounts on log trees or
to get the allocation profile for an allocation request.  Otherwise, it
operates on the extent root.  This patch converts some more routines in
extent-tree.c that are always called with the extent root to accept
an fs_info instead.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:55 +01:00
David Sterba
8b74c03e3c btrfs: remove unused parameter from btrfs_prepare_extent_commit
Added but never used.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:52 +01:00
David Sterba
7775c8184e btrfs: remove unused parameter from btrfs_subvolume_release_metadata
Unused since qgroup refactoring that split data and metadata accounting,
the btrfs_qgroup_free helper.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:52 +01:00
David Sterba
7c302b49dd btrfs: remove unused parameter from clean_tree_block
Added but never needed.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:51 +01:00
Liu Bo
4136135b08 Btrfs: use helper to get used bytes of space_info
This uses a helper instead of open code around used byte of space_info
everywhere.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:48 +01:00
Liu Bo
0c9b36e0d7 Btrfs: try to avoid acquiring free space ctl's lock
We don't need to take the lock if the block group has not been cached.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:48 +01:00
Liu Bo
e4c3b2dcd1 Btrfs: kill trans in run_delalloc_nocow and btrfs_cross_ref_exist
run_delalloc_nocow has used trans in two places where they don't
actually need @trans.

For btrfs_lookup_file_extent, we search for file extents without COWing
anything, and for btrfs_cross_ref_exist, the only place where we need
@trans is deferencing it in order to get running_transaction which we
could easily get from the global fs_info.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:51:00 +01:00
Liu Bo
f72ad18e99 Btrfs: pass delayed_refs directly to btrfs_find_delayed_ref_head
All we need is @delayed_refs, all callers have get it ahead of calling
btrfs_find_delayed_ref_head since lock needs to be acquired firstly,
there is no reason to deference it again inside the function.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:59 +01:00
Jeff Mahoney
003d7c59e8 btrfs: allow unlink to exceed subvolume quota
Once a qgroup limit is exceeded, it's impossible to restore normal
operation to the subvolume without modifying the limit or removing
the subvolume.  This is a surprising situation for many users used
to the typical workflow with quotas on other file systems where it's
possible to remove files until the used space is back under the limit.

When we go to unlink a file and start the transaction, we'll hit
the qgroup limit while trying to reserve space for the items we'll
modify while removing the file.  We discussed last month how best
to handle this situation and agreed that there is no perfect solution.
The best principle-of-least-surprise solution is to handle it similarly
to how we already handle ENOSPC when unlinking, which is to allow
the operation to succeed with the expectation that it will ultimately
release space under most circumstances.

This patch modifies the transaction start path to select whether to
honor the qgroups limits.  btrfs_start_transaction_fallback_global_rsv
is the only caller that skips enforcement.  The reservation and tracking
still happens normally -- it just skips the enforcement step.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:59 +01:00
Omar Sandoval
310712b2f7 Btrfs: constify struct btrfs_{,disk_}key wherever possible
In a lot of places, it's unclear when it's safe to reuse a struct
btrfs_key after it has been passed to a helper function. Constify these
arguments wherever possible to make it obvious.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:58 +01:00
David Sterba
f85b7379cd btrfs: fix over-80 lines introduced by previous cleanups
This goes as a separate patch because fixing that inside the patches
caused too many many conflicts.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:57 +01:00
Nikolay Borisov
4a0cc7ca6c btrfs: Make btrfs_ino take a struct btrfs_inode
Currently btrfs_ino takes a struct inode and this causes a lot of
internal btrfs functions which consume this ino to take a VFS inode,
rather than btrfs' own struct btrfs_inode. In order to fix this "leak"
of VFS structs into the internals of btrfs first it's necessary to
eliminate all uses of struct inode for the purpose of inode. This patch
does that by using BTRFS_I to convert an inode to btrfs_inode. With
this problem eliminated subsequent patches will start eliminating the
passing of struct inode altogether, eventually resulting in a lot cleaner
code.

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
[ fix btrfs_get_extent tracepoint prototype ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:51 +01:00
David Sterba
823bb20ab4 btrfs: add wrapper for counting BTRFS_MAX_EXTENT_SIZE
The expression is open-coded in several places, this asks for a wrapper.
As we know the MAX_EXTENT fits to u32, we can use the appropirate
division helper. This cascades to the result type updates.

Compiler is clever enough to use shift instead of integer division, so
there's no change in the generated assembly.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:51 +01:00
Jeff Mahoney
fef394f75b btrfs: drop unused extent_op arg from btrfs_add_delayed_data_ref
btrfs_add_delayed_data_ref is always called with a NULL extent_op,
so let's drop the argument.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:50 +01:00
Liu Bo
e321f8a801 Btrfs: use down_read_nested to make lockdep silent
If @block_group is not @used_bg, it'll try to get @used_bg's lock without
droping @block_group 's lock and lockdep has throwed a scary deadlock warning
about it.
Fix it by using down_read_nested.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-01-03 15:19:17 +01:00
Jeff Mahoney
d028099643 btrfs: fix locking when we put back a delayed ref that's too new
In __btrfs_run_delayed_refs, when we put back a delayed ref that's too
new, we have already dropped the lock on locked_ref when we set
->processing = 0.

This patch keeps the lock to cover that assignment.

Fixes: d7df2c796d (Btrfs: attach delayed ref updates to delayed ref heads)
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-01-03 15:14:21 +01:00
Jeff Mahoney
aa7c8da35d btrfs: fix error handling when run_delayed_extent_op fails
In __btrfs_run_delayed_refs, the error path when run_delayed_extent_op
fails sets locked_ref->processing = 0 but doesn't re-increment
delayed_refs->num_heads_ready.  As a result, we end up triggering
the WARN_ON in btrfs_select_ref_head.

Fixes: d7df2c796d (Btrfs: attach delayed ref updates to delayed ref heads)
Reported-by: Jon Nelson <jnelson-suse@jamponi.net>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-01-03 15:14:08 +01:00
David Sterba
34441361c4 btrfs: opencode chunk locking, remove helpers
The helpers are trivial and we don't use them consistently.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:07:00 +01:00
Jeff Mahoney
3a45bb207e btrfs: remove root parameter from transaction commit/end routines
Now we only use the root parameter to print the root objectid in
a tracepoint.  We can use the root parameter from the transaction
handle for that.  It's also used to join the transaction with
async commits, so we remove the comment that it's just for checking.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:07:00 +01:00
Jeff Mahoney
2ff7e61e0d btrfs: take an fs_info directly when the root is not used otherwise
There are loads of functions in btrfs that accept a root parameter
but only use it to obtain an fs_info pointer.  Let's convert those to
just accept an fs_info pointer directly.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:59 +01:00
Jeff Mahoney
afdb571890 btrfs: simplify btrfs_wait_cache_io prototype
With the exception of the one case where btrfs_wait_cache_io is called
without a block group, it's called with the same arguments.  The root
argument is only used in the special case, so let's factor out the core
and simplify the call in the normal case to require a trans, block group,
and path.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:59 +01:00
Jeff Mahoney
71ff6437c2 btrfs: convert extent-tree tracepoints to use fs_info
The extent-tree tracepoints all operate on the extent root, regardless of
which root is passed in.  Let's just use the extent root objectid instead.
If it turns out that nobody is depending on the format of this tracepoint,
we can drop the root printing entirely.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:59 +01:00
Jeff Mahoney
0b246afa62 btrfs: root->fs_info cleanup, add fs_info convenience variables
In routines where someptr->fs_info is referenced multiple times, we
introduce a convenience variable.  This makes the code considerably
more readable.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:59 +01:00
Jeff Mahoney
6202df6921 btrfs: root->fs_info cleanup, update_block_group{,flags}
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:58 +01:00
Jeff Mahoney
3796d33535 btrfs: root->fs_info cleanup, lock/unlock_chunks
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:58 +01:00
Jeff Mahoney
27965b6c2c btrfs: root->fs_info cleanup, btrfs_calc_{trans,trunc}_metadata_size
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:58 +01:00
Jeff Mahoney
da17066c40 btrfs: pull node/sector/stripe sizes out of root and into fs_info
We track the node sizes per-root, but they never vary from the values
in the superblock.  This patch messes with the 80-column style a bit,
but subsequent patches to factor out root->fs_info into a convenience
variable fix it up again.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:58 +01:00
Jeff Mahoney
fb456252d3 btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:58 +01:00
Jeff Mahoney
2b2e27eb92 btrfs: alloc_reserved_file_extent trace point should use extent_root
Even though a separate root is passed in, we're still operating on the
extent root.  Let's use that for the trace point.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:57 +01:00
Jeff Mahoney
6bccf3ab1e btrfs: call functions that always use the same root with fs_info instead
There are many functions that are always called with the same root
argument.  Rather than passing the same root every time, we can
pass an fs_info pointer instead and have the function get the root
pointer itself.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:57 +01:00
Jeff Mahoney
5b4aacefb8 btrfs: call functions that overwrite their root parameter with fs_info
There are 11 functions that accept a root parameter and immediately
overwrite it.  We can pass those an fs_info pointer instead.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:57 +01:00
Wang Xiaoguang
1d57ee9416 btrfs: improve delayed refs iterations
This issue was found when I tried to delete a heavily reflinked file,
when deleting such files, other transaction operation will not have a
chance to make progress, for example, start_transaction() will blocked
in wait_current_trans(root) for long time, sometimes it even triggers
soft lockups, and the time taken to delete such heavily reflinked file
is also very large, often hundreds of seconds. Using perf top, it reports
that:

PerfTop:    7416 irqs/sec  kernel:99.8%  exact:  0.0% [4000Hz cpu-clock],  (all, 4 CPUs)
---------------------------------------------------------------------------------------
    84.37%  [btrfs]             [k] __btrfs_run_delayed_refs.constprop.80
    11.02%  [kernel]            [k] delay_tsc
     0.79%  [kernel]            [k] _raw_spin_unlock_irq
     0.78%  [kernel]            [k] _raw_spin_unlock_irqrestore
     0.45%  [kernel]            [k] do_raw_spin_lock
     0.18%  [kernel]            [k] __slab_alloc
It seems __btrfs_run_delayed_refs() took most cpu time, after some debug
work, I found it's select_delayed_ref() causing this issue, for a delayed
head, in our case, it'll be full of BTRFS_DROP_DELAYED_REF nodes, but
select_delayed_ref() will firstly try to iterate node list to find
BTRFS_ADD_DELAYED_REF nodes, obviously it's a disaster in this case, and
waste much time.

To fix this issue, we introduce a new ref_add_list in struct btrfs_delayed_ref_head,
then in select_delayed_ref(), if this list is not empty, we can directly use
nodes in this list. With this patch, it just took about 10~15 seconds to
delte the same file. Now using perf top, it reports that:

PerfTop:    2734 irqs/sec  kernel:99.5%  exact:  0.0% [4000Hz cpu-clock],  (all, 4 CPUs)
----------------------------------------------------------------------------------------

    20.74%  [kernel]          [k] _raw_spin_unlock_irqrestore
    16.33%  [kernel]          [k] __slab_alloc
     5.41%  [kernel]          [k] lock_acquired
     4.42%  [kernel]          [k] lock_acquire
     4.05%  [kernel]          [k] lock_release
     3.37%  [kernel]          [k] _raw_spin_unlock_irq

For normal files, this patch also gives help, at least we do not need to
iterate whole list to found BTRFS_ADD_DELAYED_REF nodes.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:21 +01:00
Qu Wenruo
33d1f05ccb btrfs: Export and move leaf/subtree qgroup helpers to qgroup.c
Move account_shared_subtree() to qgroup.c and rename it to
btrfs_qgroup_trace_subtree().

Do the same thing for account_leaf_items() and rename it to
btrfs_qgroup_trace_leaf_items().

Since all these functions are only for qgroup, move them to qgroup.c and
export them is more appropriate.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:21 +01:00
Qu Wenruo
50b3e040b7 btrfs: qgroup: Rename functions to make it follow reserve,trace,account steps
Rename btrfs_qgroup_insert_dirty_extent(_nolock) to
btrfs_qgroup_trace_extent(_nolock), according to the new
reserve/trace/account naming schema.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:21 +01:00
Jeff Mahoney
0c476a5d7f btrfs: Ensure proper sector alignment for btrfs_free_reserved_data_space
This fixes the WARN_ON on BTRFS_I(inode)->reserved_extents in
btrfs_destroy_inode and the WARN_ON on nonzero delalloc bytes on umount
with qgroups enabled.

I was able to reproduce this by setting up a small (~500kb) quota limit
and writing a file one byte at a time until I hit the limit.  The warnings
would all hit on umount.

The root cause is that we would reserve a block-sized range in both
the reservation and the quota in btrfs_check_data_free_space, but if we
encountered a problem (like e.g. EDQUOT), we would only release the single
byte in the qgroup reservation.  That caused an iotree state split, which
increased the number of outstanding extents, in turn disallowing releasing
the metadata reservation.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:19 +01:00
David Sterba
b159fa2808 btrfs: remove constant parameter to memset_extent_buffer and rename it
The only memset we do is to 0, so sink the parameter to the function and
simplify all calls. Rename the function to reflect the behaviour.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:17 +01:00
David Sterba
62d1f9fe97 btrfs: remove trivial helper btrfs_find_tree_block
During the time, the function has been shrunk to the point that it just
calls find_extent_buffer, just passing the parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:16 +01:00
Xiaoguang Wang
745699ef62 btrfs: remove useless comments
Fixes: ("btrfs: update btrfs_space_info's bytes_may_use timely")

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-30 13:45:14 +01:00
Wang Xiaoguang
dc1a90c6aa btrfs: cleanup: use already calculated value in btrfs_should_throttle_delayed_refs()
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-29 14:10:38 +01:00
Christoph Hellwig
cf8cddd38b btrfs: don't abuse REQ_OP_* flags for btrfs_map_block
btrfs_map_block supports different types of mappings, which to a large
extent resemble block layer operations.  But they don't always do, and
currently btrfs dangerously overlays it's own flag over the block layer
flags.  This is just asking for a conflict, so introduce a different
map flags enum inside of btrfs instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-29 14:10:38 +01:00
Wang Xiaoguang
9d1032cc49 btrfs: fix WARNING in btrfs_select_ref_head()
This issue was found when testing in-band dedupe enospc behaviour,
sometimes run_one_delayed_ref() may fail for enospc reason, then
__btrfs_run_delayed_refs()will return, but forget to add num_heads_read
back, which will trigger "WARN_ON(delayed_refs->num_heads_ready == 0)" in
btrfs_select_ref_head().

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-10-24 18:20:29 +02:00
Chris Mason
19c4d2f994 Revert "btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs"
This reverts commit 5d8eb6fe51.

When we remove devices, we free the device structures.  Delaying
btfs_remove_chunk() ends up hitting a use-after-free on them.

Signed-off-by: Chris Mason <clm@fb.com>
2016-10-10 13:43:31 -07:00
Josef Bacik
4867268c57 Btrfs: don't BUG() during drop snapshot
Really there's lots of things that can go wrong here, kill all the
BUG_ON()'s and replace the logic ones with ASSERT()'s and return EIO
instead.

Signed-off-by: Josef Bacik <jbacik@fb.com>
[ switched to btrfs_err, errors go to common label ]
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:37:06 +02:00
Goldwyn Rodrigues
6cea66e544 btrfs: Remove already completed TODO comment
Fixes: 7cf5b97650 ("btrfs: qgroup: Cleanup old inaccurate facilities")
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:37:06 +02:00
Goldwyn Rodrigues
dd12d5b804 btrfs: Do not reassign count in btrfs_run_delayed_refs
Code cleanup. count is already (unsgined long)-1. That is the reason
run_all was set. Do not reassign it (unsigned long)-1.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:37:06 +02:00
Liu Bo
a958eab0ed Btrfs: fix memory leak in do_walk_down
The extent buffer 'next' needs to be free'd conditionally.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:37:06 +02:00
Jeff Mahoney
ab8d0fc48d btrfs: convert pr_* to btrfs_* where possible
For many printks, we want to know which file system issued the message.

This patch converts most pr_* calls to use the btrfs_* versions instead.
In some cases, this means adding plumbing to allow call sites access to
an fs_info pointer.

fs/btrfs/check-integrity.c is left alone for another day.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 19:37:04 +02:00
Jeff Mahoney
62e855771d btrfs: convert printk(KERN_* to use pr_* calls
This patch converts printk(KERN_* style messages to use the pr_* versions.

One side effect is that anything that was KERN_DEBUG is now automatically
a dynamic debug message.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 18:08:44 +02:00
Jeff Mahoney
5d163e0e68 btrfs: unsplit printed strings
CodingStyle chapter 2:
"[...] never break user-visible strings such as printk messages,
because that breaks the ability to grep for them."

This patch unsplits user-visible strings.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 18:08:44 +02:00
Liu Bo
02794222c4 Btrfs: kill BUG_ON in run_delayed_tree_ref
In a corrupted btrfs image, we can come across this BUG_ON and
get an unreponsive system, but if we return errors instead,
its caller can handle everything gracefully by aborting the current
transaction.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 18:08:44 +02:00
Masahiro Yamada
e2c8990734 btrfs: squash lines for simple wrapper functions
Remove unneeded variables and assignments.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 18:08:38 +02:00
Josef Bacik
afcdd129e0 Btrfs: add a flags field to btrfs_fs_info
We have a lot of random ints in btrfs_fs_info that can be put into flags.  This
is mostly equivalent with the exception of how we deal with quota going on or
off, now instead we set a flag when we are turning it on or off and deal with
that appropriately, rather than just having a pending state that the current
quota_enabled gets set to.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Luis Henriques
1f079fa2f8 btrfs: Fix warning "variable ‘blocksize’ set but not used"
Variable 'blocksize' in reada_walk_down() is not used since commit
d3e46fea1b ("btrfs: sink blocksize parameter to readahead_tree_block").
This patch simply removes this variable.

Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Naohiro Aota
5d8eb6fe51 btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs
Currently, btrfs_relocate_chunk() is removing relocated BG by itself. But
the work can be done by btrfs_delete_unused_bgs() (and it's better since it
trim the BG). Let's dedupe the code.

While btrfs_delete_unused_bgs() is already hitting the relocated BG, it
skip the BG since the BG has "ro" flag set (to keep balancing BG intact).
On the other hand, btrfs cannot drop "ro" flag here to prevent additional
writes. So this patch make use of "removed" flag.
btrfs_delete_unused_bgs() now detect the flag to distinguish whether a
read-only BG is relocating or not.

Signed-off-by: Naohiro Aota <naohiro.aota@hgst.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Liu Bo
49303381f1 Btrfs: bail out if block group has different mixed flag
Currently we allow inconsistence about mixed flag
 (BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA).

We'd get ENOSPC if block group has mixed flag and btrfs doesn't.
If that happens, we have one space_info with mixed flag and another
space_info only with BTRFS_BLOCK_GROUP_METADATA, and
global_block_rsv.space_info points to the latter one, but all bytes
from block_group contributes to the mixed space_info, thus all the
allocation will fail with ENOSPC.

This adds a check for the above case.

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
[ updated message ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Liu Bo
c79a175175 Btrfs: fix memory leak of block group cache
While processing delayed refs, we may update block group's statistics
and attach it to cur_trans->dirty_bgs, and later writing dirty block
groups will process the list, which happens during
btrfs_commit_transaction().

For whatever reason, the transaction is aborted and dirty_bgs
is not processed in cleanup_transaction(), we end up with memory leak
of these dirty block group cache.

Since btrfs_start_dirty_block_groups() doesn't make it go to the commit
critical section, this also adds the cleanup work inside it.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Linus Torvalds
b22734a550 Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Josef fixed a problem when quotas are enabled with his latest ENOSPC
  rework, and Jeff added more checks into the subvol ioctls to avoid
  tripping up lookup_one_len"

* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: ensure that file descriptor used with subvol ioctls is a dir
  Btrfs: handle quota reserve failure properly
2016-09-23 13:39:37 -07:00
Josef Bacik
1e5ec2e709 Btrfs: handle quota reserve failure properly
btrfs/022 was spitting a warning for the case that we exceed the quota.  If we
fail to make our quota reservation we need to clean up our data space
reservation.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Tested-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-09-21 17:22:16 -07:00
Linus Torvalds
f4a9c169c2 Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "I'm not proud of how long it took me to track down that one liner in
  btrfs_sync_log(), but the good news is the patches I was trying to
  blame for these problems were actually fine (sorry Filipe)"

* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: introduce tickets_id to determine whether asynchronous metadata reclaim work makes progress
  btrfs: remove root_log_ctx from ctx list before btrfs_sync_log returns
  btrfs: do not decrease bytes_may_use when replaying extents
2016-09-09 12:52:31 -07:00