btrfsic_process_superblock() BUG_ON()s if 'state' is NULL. But this can
never happen as the only caller from btrfsic_process_superblock() is
btrfsic_mount() which allocates 'state' some lines above calling
btrfsic_process_superblock() and checks for the allocation to succeed.
Let's just remove the impossible to hit BUG_ON().
Signed-off-by: Johannes Thumshirn <jth@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A user reports a possible NULL-pointer dereference in
btrfsic_process_superblock(). We are assigning state->fs_info to a local
fs_info variable and afterwards checking for the presence of state.
While we would BUG_ON() a NULL state anyways, we can also just remove
the local fs_info copy, as fs_info is only used once as the first
argument for btrfs_num_copies(). There we can just pass in
state->fs_info as well.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205003
Signed-off-by: Johannes Thumshirn <jth@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a relic from a time before we had a proper reservation mechanism
and you could end up with really full chunks at chunk allocation time.
This doesn't make sense anymore, so just kill it.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have the space_info, we can just check its flags to see if it's the
system chunk space info.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's a simple wrapper over btrfs_panic and is called only once. Just
open code it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are two relocation stages but both print the same message. Add the
description of the stage. This can help debugging or provides
informative message to users.
BTRFS info (device dm-5): balance: start -d -m -s
BTRFS info (device dm-5): relocating block group 30408704 flags metadata|dup
BTRFS info (device dm-5): found 2 extents, stage: move data extents
BTRFS info (device dm-5): relocating block group 22020096 flags system|dup
BTRFS info (device dm-5): found 1 extents, stage: move data extents
BTRFS info (device dm-5): relocating block group 13631488 flags data
BTRFS info (device dm-5): found 1 extents, stage: move data extents
BTRFS info (device dm-5): found 1 extents, stage: update data pointers
BTRFS info (device dm-5): balance: ended with status: 0
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The condition '!ret2' is always true. commit 717beb96d9 ("Btrfs: fix
regression in btrfs_page_mkwrite() from vm_fault_t conversion") left
behind the check after moving this code out of the goto, so remove the
unused condition check.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
level <0 and level >= BTRFS_MAX_LEVEL are already performed upon
extent buffer read by tree checker in btrfs_check_node.
go. As far as 'level <= 0' we are guaranteed that level is '> 0'
because the value of level _before_ reading 'next' is larger than 1
(otherwise we wouldn't have executed that code at all) this in turn
guarantees that 'level' after btrfs_read_buffer is 'level - 1' since
we verify this invariant in:
btrfs_read_buffer
btree_read_extent_buffer_pages
btrfs_verify_level_key
This guarantees that level can never be '<= 0' so the warn on is
never triggered.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The log_root passed to walk_log_tree is guaranteed to have its
root_key.objectid always be BTRFS_TREE_LOG_OBJECTID. This is by
merit that all log roots of an ordinary root are allocated in
alloc_log_tree which hard-codes objectid to be BTRFS_TREE_LOG_OBJECTID.
In case walk_log_tree is called for a log tree found by btrfs_read_fs_root
in btrfs_recover_log_trees, that function already ensures
found_key.objectid is BTRFS_TREE_LOG_OBJECTID.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__btrfs_free_reserved_extent now performs the actions of
btrfs_free_and_pin_reserved_extent. But this name is a bit of a
misnomer, since the extent is not really freed but just pinned. Reflect
this in the new name. No semantics changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__btrfs_free_reserved_extent performs 2 entirely different operations
depending on whether its 'pin' argument is true or false. This patch
lifts the 2nd case (pin is false) into it's sole caller
btrfs_free_reserved_extent. No semantics changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers of btrfs_free_reserved_extent (respectively
__btrfs_free_reserved_extent with in set to 0) pass in extents which
have only been reserved but not yet written to. Namely,
* in cow_file_range that function is called only if create_io_em fails
or btrfs_add_ordered_extent fail, both of which happen _before_ any IO
is submitted to the newly reserved range
* in submit_compressed_extents the code flow is similar -
out_free_reserve can be called only before
btrfs_submit_compressed_write which is where any writes to the range
could occur
* btrfs_new_extent_direct also calls btrfs_free_reserved_extent only
if extent_map fails, before any IO is issued
* __btrfs_prealloc_file_range also calls btrfs_free_reserved_extent
in case insertion of the metadata fails
* btrfs_alloc_tree_block again can only be called in case in-memory
operations fail, before any IO is submitted
* btrfs_finish_ordered_io - this is the only caller where discarding
the extent could have a material effect, since it can be called for
an extent which was partially written.
With this change the submission of discards is optimised since discards
are now not being created for extents which are known to not have been
touched on disk.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[PROBLEM]
qgroup create/remove code is currently returning EINVAL when the user
tries to create a qgroup on a subvolume without quota enabled. EINVAL is
already being used for too many error scenarios so that is hard to
depict what is the problem.
[FIX]
Currently scrub and balance code return -ENOTCONN when the user tries to
cancel/pause and no scrub or balance is currently running for the
desired subvolume. Do the same here by returning -ENOTCONN when a user
tries to create/delete/assing/list a qgroup on a subvolume without quota
enabled.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove some variables that are set only to be checked later, and never
used.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Merge btrfs_sysfs_add_fsid() and btrfs_sysfs_add_devices_kobj() functions
as these two are small and they are called one after the other.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_sysfs_add_device() creates the directory
/sys/fs/btrfs/UUID/devices but its function name is misleading. Rename
it to btrfs_sysfs_add_devices_kobj() instead. No functional changes.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 24bd69cb ("Btrfs: sysfs: add support to add parent for fsid")
added parent argument in preparation to show the seed fsid under the
sprout fsid as in the patch [1] in the mailing list.
[1] Btrfs: sysfs: support seed devices in the sysfs layout
But later this idea was superseded by another idea to rename the fsid as
in the commit f93c39970b ("btrfs: factor out sysfs code for updating
sprout fsid").
So we don't need parent argument anymore.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The struct member btrfs_device::device_dir_kobj holds the kobj of the
sysfs directory /sys/fs/btrfs/UUID/devices, so rename it from
device_dir_kobj to devices_kobj. No functional changes.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When using the NO_HOLES feature, if we punch a hole into a file and then
fsync it, there are cases where a subsequent fsync will miss the fact that
a hole was punched, resulting in the holes not existing after replaying
the log tree.
Essentially these cases all imply that, tree-log.c:copy_items(), is not
invoked for the leafs that delimit holes, because nothing changed those
leafs in the current transaction. And it's precisely copy_items() where
we currenly detect and log holes, which works as long as the holes are
between file extent items in the input leaf or between the beginning of
input leaf and the previous leaf or between the last item in the leaf
and the next leaf.
First example where we miss a hole:
*) The extent items of the inode span multiple leafs;
*) The punched hole covers a range that affects only the extent items of
the first leaf;
*) The fsync operation is done in full mode (BTRFS_INODE_NEEDS_FULL_SYNC
is set in the inode's runtime flags).
That results in the hole not existing after replaying the log tree.
For example, if the fs/subvolume tree has the following layout for a
particular inode:
Leaf N, generation 10:
[ ... INODE_ITEM INODE_REF EXTENT_ITEM (0 64K) EXTENT_ITEM (64K 128K) ]
Leaf N + 1, generation 10:
[ EXTENT_ITEM (128K 64K) ... ]
If at transaction 11 we punch a hole coverting the range [0, 128K[, we end
up dropping the two extent items from leaf N, but we don't touch the other
leaf, so we end up in the following state:
Leaf N, generation 11:
[ ... INODE_ITEM INODE_REF ]
Leaf N + 1, generation 10:
[ EXTENT_ITEM (128K 64K) ... ]
A full fsync after punching the hole will only process leaf N because it
was modified in the current transaction, but not leaf N + 1, since it
was not modified in the current transaction (generation 10 and not 11).
As a result the fsync will not log any holes, because it didn't process
any leaf with extent items.
Second example where we will miss a hole:
*) An inode as its items spanning 5 (or more) leafs;
*) A hole is punched and it covers only the extents items of the 3rd
leaf. This resulsts in deleting the entire leaf and not touching any
of the other leafs.
So the only leaf that is modified in the current transaction, when
punching the hole, is the first leaf, which contains the inode item.
During the full fsync, the only leaf that is passed to copy_items()
is that first leaf, and that's not enough for the hole detection
code in copy_items() to determine there's a hole between the last
file extent item in the 2nd leaf and the first file extent item in
the 3rd leaf (which was the 4th leaf before punching the hole).
Fix this by scanning all leafs and punch holes as necessary when doing a
full fsync (less common than a non-full fsync) when the NO_HOLES feature
is enabled. The lack of explicit file extent items to mark holes makes it
necessary to scan existing extents to determine if holes exist.
A test case for fstests follows soon.
Fixes: 16e7549f04 ("Btrfs: incompatible format change to remove hole extents")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl4h7wwACgkQxWXV+ddt
WDtFgxAAqanZ3wqq8xNqybfWmmFNrKtdkakErRuQFWe/+kZH59HuBifTbYeVMD4Z
hFgveFJpBIqo7uMUxUItSTtHcr3qx7TP9ejGYOaQO997oNPxPQXuEY8Lq5ebDBVB
89Gn+Eg/Q+uPvCJSctxx4dblSiGZKb3iOEh+lJuWJV4bj8beekcTrqsg01ZchPRO
Ygk1ltW7Vpf0wVkdts4FKiKiwX02M2C9zxh9NQjpNwH1DMow4XtBPsbqHbiHzRym
SoD4+0dbhfdnKkNnBTFEJBbjbZcYwM9EQnfiyVL+/hDMHX4XTetqeFN1G8usfXXX
2kxvwttPUtluJqlQXQnUU4mQEA4p5ORTgGgw1WBF3h+Aezumkql+27Bd6aiDKGZz
SPc9sveft60R23TxorlrYVqfADgyZKEaZ+2wEM99Xoz4OdvP7jkqDentJW9us1Xh
Xmfovq5xcRY17f9tdhiwqH5vgwxrLgmjBvTm/kcGX3ImhU8Yxk8xKw1JoV0P9cjW
7awK4l8pyPbOUhekdT8hYqWXlL/DXhAMHraV1zfBKIbu1omlGByeg23jNM2iS/0B
YtRkEEen0tRHpuKLB08twTKCak94wObBamKNFE6Snt1cDudLwGDpUosazM9l4uPR
2D3SHs7UWNgtvTRCfq2LVoRMRSR2BA1b19EkylUig4ay7khmZ2k=
=2e9q
-----END PGP SIGNATURE-----
Merge tag 'for-5.5-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes that have been in the works during last twp weeks.
All have a user visible effect and are stable material:
- scrub: properly update progress after calling cancel ioctl, calling
'resume' would start from the beginning otherwise
- fix subvolume reference removal, after moving out of the original
path the reference is not recognized and will lead to transaction
abort
- fix reloc root lifetime checks, could lead to crashes when there's
subvolume cleaning running in parallel
- fix memory leak when quotas get disabled in the middle of extent
accounting
- fix transaction abort in case of balance being started on degraded
mount on eg. RAID1"
* tag 'for-5.5-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: check rw_devices, not num_devices for balance
Btrfs: always copy scrub arguments back to user space
btrfs: relocation: fix reloc_root lifespan and access
btrfs: fix memory leak in qgroup accounting
btrfs: do not delete mismatched root refs
btrfs: fix invalid removal of root ref
btrfs: rework arguments of btrfs_unlink_subvol
The fstest btrfs/154 reports
[ 8675.381709] BTRFS: Transaction aborted (error -28)
[ 8675.383302] WARNING: CPU: 1 PID: 31900 at fs/btrfs/block-group.c:2038 btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs]
[ 8675.390925] CPU: 1 PID: 31900 Comm: btrfs Not tainted 5.5.0-rc6-default+ #935
[ 8675.392780] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[ 8675.395452] RIP: 0010:btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs]
[ 8675.402672] RSP: 0018:ffffb2090888fb00 EFLAGS: 00010286
[ 8675.404413] RAX: 0000000000000000 RBX: ffff92026dfa91c8 RCX: 0000000000000001
[ 8675.406609] RDX: 0000000000000000 RSI: ffffffff8e100899 RDI: ffffffff8e100971
[ 8675.408775] RBP: ffff920247c61660 R08: 0000000000000000 R09: 0000000000000000
[ 8675.410978] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffe4
[ 8675.412647] R13: ffff92026db74000 R14: ffff920247c616b8 R15: ffff92026dfbc000
[ 8675.413994] FS: 00007fd5e57248c0(0000) GS:ffff92027d800000(0000) knlGS:0000000000000000
[ 8675.416146] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8675.417833] CR2: 0000564aa51682d8 CR3: 000000006dcbc004 CR4: 0000000000160ee0
[ 8675.419801] Call Trace:
[ 8675.420742] btrfs_start_dirty_block_groups+0x355/0x480 [btrfs]
[ 8675.422600] btrfs_commit_transaction+0xc8/0xaf0 [btrfs]
[ 8675.424335] reset_balance_state+0x14a/0x190 [btrfs]
[ 8675.425824] btrfs_balance.cold+0xe7/0x154 [btrfs]
[ 8675.427313] ? kmem_cache_alloc_trace+0x235/0x2c0
[ 8675.428663] btrfs_ioctl_balance+0x298/0x350 [btrfs]
[ 8675.430285] btrfs_ioctl+0x466/0x2550 [btrfs]
[ 8675.431788] ? mem_cgroup_charge_statistics+0x51/0xf0
[ 8675.433487] ? mem_cgroup_commit_charge+0x56/0x400
[ 8675.435122] ? do_raw_spin_unlock+0x4b/0xc0
[ 8675.436618] ? _raw_spin_unlock+0x1f/0x30
[ 8675.438093] ? __handle_mm_fault+0x499/0x740
[ 8675.439619] ? do_vfs_ioctl+0x56e/0x770
[ 8675.441034] do_vfs_ioctl+0x56e/0x770
[ 8675.442411] ksys_ioctl+0x3a/0x70
[ 8675.443718] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 8675.445333] __x64_sys_ioctl+0x16/0x20
[ 8675.446705] do_syscall_64+0x50/0x210
[ 8675.448059] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 8675.479187] BTRFS: error (device vdb) in btrfs_create_pending_block_groups:2038: errno=-28 No space left
We now use btrfs_can_overcommit() to see if we can flip a block group
read only. Before this would fail because we weren't taking into
account the usable un-allocated space for allocating chunks. With my
patches we were allowed to do the balance, which is technically correct.
The test is trying to start balance on degraded mount. So now we're
trying to allocate a chunk and cannot because we want to allocate a
RAID1 chunk, but there's only 1 device that's available for usage. This
results in an ENOSPC.
But we shouldn't even be making it this far, we don't have enough
devices to restripe. The problem is we're using btrfs_num_devices(),
that also includes missing devices. That's not actually what we want, we
need to use rw_devices.
The chunk_mutex is not needed here, rw_devices changes only in device
add, remove or replace, all are excluded by EXCL_OP mechanism.
Fixes: e4d8ec0f65 ("Btrfs: implement online profile changing")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add stacktrace, update changelog, drop chunk_mutex ]
Signed-off-by: David Sterba <dsterba@suse.com>
If scrub returns an error we are not copying back the scrub arguments
structure to user space. This prevents user space to know how much
progress scrub has done if an error happened - this includes -ECANCELED
which is returned when users ask for scrub to stop. A particular use
case, which is used in btrfs-progs, is to resume scrub after it is
canceled, in that case it relies on checking the progress from the scrub
arguments structure and then use that progress in a call to resume
scrub.
So fix this by always copying the scrub arguments structure to user
space, overwriting the value returned to user space with -EFAULT only if
copying the structure failed to let user space know that either that
copying did not happen, and therefore the structure is stale, or it
happened partially and the structure is probably not valid and corrupt
due to the partial copy.
Reported-by: Graham Cobb <g.btrfs@cobb.uk.net>
Link: https://lore.kernel.org/linux-btrfs/d0a97688-78be-08de-ca7d-bcb4c7fb397e@cobb.uk.net/
Fixes: 06fe39ab15 ("Btrfs: do not overwrite scrub error with fault error in scrub ioctl")
CC: stable@vger.kernel.org # 5.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Tested-by: Graham Cobb <g.btrfs@cobb.uk.net>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There are several different KASAN reports for balance + snapshot
workloads. Involved call paths include:
should_ignore_root+0x54/0xb0 [btrfs]
build_backref_tree+0x11af/0x2280 [btrfs]
relocate_tree_blocks+0x391/0xb80 [btrfs]
relocate_block_group+0x3e5/0xa00 [btrfs]
btrfs_relocate_block_group+0x240/0x4d0 [btrfs]
btrfs_relocate_chunk+0x53/0xf0 [btrfs]
btrfs_balance+0xc91/0x1840 [btrfs]
btrfs_ioctl_balance+0x416/0x4e0 [btrfs]
btrfs_ioctl+0x8af/0x3e60 [btrfs]
do_vfs_ioctl+0x831/0xb10
create_reloc_root+0x9f/0x460 [btrfs]
btrfs_reloc_post_snapshot+0xff/0x6c0 [btrfs]
create_pending_snapshot+0xa9b/0x15f0 [btrfs]
create_pending_snapshots+0x111/0x140 [btrfs]
btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
btrfs_mksubvol+0x915/0x960 [btrfs]
btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
btrfs_ioctl+0x241b/0x3e60 [btrfs]
do_vfs_ioctl+0x831/0xb10
btrfs_reloc_pre_snapshot+0x85/0xc0 [btrfs]
create_pending_snapshot+0x209/0x15f0 [btrfs]
create_pending_snapshots+0x111/0x140 [btrfs]
btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
btrfs_mksubvol+0x915/0x960 [btrfs]
btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
btrfs_ioctl+0x241b/0x3e60 [btrfs]
do_vfs_ioctl+0x831/0xb10
[CAUSE]
All these call sites are only relying on root->reloc_root, which can
undergo btrfs_drop_snapshot(), and since we don't have real refcount
based protection to reloc roots, we can reach already dropped reloc
root, triggering KASAN.
[FIX]
To avoid such access to unstable root->reloc_root, we should check
BTRFS_ROOT_DEAD_RELOC_TREE bit first.
This patch introduces wrappers that provide the correct way to check the
bit with memory barriers protection.
Most callers don't distinguish merged reloc tree and no reloc tree. The
only exception is should_ignore_root(), as merged reloc tree can be
ignored, while no reloc tree shouldn't.
[CRITICAL SECTION ANALYSIS]
Although test_bit()/set_bit()/clear_bit() doesn't imply a barrier, the
DEAD_RELOC_TREE bit has extra help from transaction as a higher level
barrier, the lifespan of root::reloc_root and DEAD_RELOC_TREE bit are:
NULL: reloc_root is NULL PTR: reloc_root is not NULL
0: DEAD_RELOC_ROOT bit not set DEAD: DEAD_RELOC_ROOT bit set
(NULL, 0) Initial state __
| /\ Section A
btrfs_init_reloc_root() \/
| __
(PTR, 0) reloc_root initialized /\
| |
btrfs_update_reloc_root() | Section B
| |
(PTR, DEAD) reloc_root has been merged \/
| __
=== btrfs_commit_transaction() ====================
| /\
clean_dirty_subvols() |
| | Section C
(NULL, DEAD) reloc_root cleanup starts \/
| __
btrfs_drop_snapshot() /\
| | Section D
(NULL, 0) Back to initial state \/
Every have_reloc_root() or test_bit(DEAD_RELOC_ROOT) caller holds
transaction handle, so none of such caller can cross transaction boundary.
In Section A, every caller just found no DEAD bit, and grab reloc_root.
In the cross section A-B, caller may get no DEAD bit, but since reloc_root
is still completely valid thus accessing reloc_root is completely safe.
No test_bit() caller can cross the boundary of Section B and Section C.
In Section C, every caller found the DEAD bit, so no one will access
reloc_root.
In the cross section C-D, either caller gets the DEAD bit set, avoiding
access reloc_root no matter if it's safe or not. Or caller get the DEAD
bit cleared, then access reloc_root, which is already NULL, nothing will
be wrong.
The memory write barriers are between the reloc_root updates and bit
set/clear, the pairing read side is before test_bit.
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Fixes: d2311e6985 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ barriers ]
Signed-off-by: David Sterba <dsterba@suse.com>
When running xfstests on the current btrfs I get the following splat from
kmemleak:
unreferenced object 0xffff88821b2404e0 (size 32):
comm "kworker/u4:7", pid 26663, jiffies 4295283698 (age 8.776s)
hex dump (first 32 bytes):
01 00 00 00 00 00 00 00 10 ff fd 26 82 88 ff ff ...........&....
10 ff fd 26 82 88 ff ff 20 ff fd 26 82 88 ff ff ...&.... ..&....
backtrace:
[<00000000f94fd43f>] ulist_alloc+0x25/0x60 [btrfs]
[<00000000fd023d99>] btrfs_find_all_roots_safe+0x41/0x100 [btrfs]
[<000000008f17bd32>] btrfs_find_all_roots+0x52/0x70 [btrfs]
[<00000000b7660afb>] btrfs_qgroup_rescan_worker+0x343/0x680 [btrfs]
[<0000000058e66778>] btrfs_work_helper+0xac/0x1e0 [btrfs]
[<00000000f0188930>] process_one_work+0x1cf/0x350
[<00000000af5f2f8e>] worker_thread+0x28/0x3c0
[<00000000b55a1add>] kthread+0x109/0x120
[<00000000f88cbd17>] ret_from_fork+0x35/0x40
This corresponds to:
(gdb) l *(btrfs_find_all_roots_safe+0x41)
0x8d7e1 is in btrfs_find_all_roots_safe (fs/btrfs/backref.c:1413).
1408
1409 tmp = ulist_alloc(GFP_NOFS);
1410 if (!tmp)
1411 return -ENOMEM;
1412 *roots = ulist_alloc(GFP_NOFS);
1413 if (!*roots) {
1414 ulist_free(tmp);
1415 return -ENOMEM;
1416 }
1417
Following the lifetime of the allocated 'roots' ulist, it gets freed
again in btrfs_qgroup_account_extent().
But this does not happen if the function is called with the
'BTRFS_FS_QUOTA_ENABLED' flag cleared, then btrfs_qgroup_account_extent()
does a short leave and directly returns.
Instead of directly returning we should jump to the 'out_free' in order to
free all resources as expected.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_del_root_ref() will simply WARN_ON() if the ref doesn't match in
any way, and then continue to delete the reference. This shouldn't
happen, we have these values because there's more to the reference than
the original root and the sub root. If any of these checks fail, return
-ENOENT.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we have the following sequence of events
btrfs sub create A
btrfs sub create A/B
btrfs sub snap A C
mkdir C/foo
mv A/B C/foo
rm -rf *
We will end up with a transaction abort.
The reason for this is because we create a root ref for B pointing to A.
When we create a snapshot of C we still have B in our tree, but because
the root ref points to A and not C we will make it appear to be empty.
The problem happens when we move B into C. This removes the root ref
for B pointing to A and adds a ref of B pointing to C. When we rmdir C
we'll see that we have a ref to our root and remove the root ref,
despite not actually matching our reference name.
Now btrfs_del_root_ref() allowing this to work is a bug as well, however
we know that this inode does not actually point to a root ref in the
first place, so we shouldn't be calling btrfs_del_root_ref() in the
first place and instead simply look up our dir index for this item and
do the rest of the removal.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_unlink_subvol takes the name of the dentry and the root objectid
based on what kind of inode this is, either a real subvolume link or a
empty one that we inherited as a snapshot. We need to fix how we unlink
in the case for BTRFS_EMPTY_SUBVOL_DIR_OBJECTID in the future, so rework
btrfs_unlink_subvol to just take the dentry and handle getting the right
objectid given the type of inode this is. There is no functional change
here, simply pushing the work into btrfs_unlink_subvol() proper.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl4PenMACgkQxWXV+ddt
WDtsyg/6A86Da1zpLQKHiQtfhH629oeL/lFI0Cz+52xcgKSAUX8pHrWzKawSVpTT
OTgksskHREPDjyDQytxAUKJnuplFEF7sCFNhzhsjQ7cvhC0lLDmW0VQr1Wob1/c4
miW18YXN2DrYFQqdCeypvq1FItglxuBmsa9eITYzytRZcATY+3b506FhT/d8NIiH
f3nnChUiNuyfAiez1SC4FavDpVma1O6SpPMk29wUN/+/Dnd4aBrfvhuygy7xyIOK
rzotRR5xagQDpei+99lT2hys4Pv0yEOajoGmgbgpaZNP0vgQcBLaF5UTeMv/Ib2j
i+muu1rWi4R6lS3hh30kRSifKmPF7I9JB1dbd5gPfZiERlDQk+qZWrPFv9l0lf7M
R3jn04EaPVNr0dtftRFF2VvpIlUQYgIvwZHx4Td6Oy9XO0X4n5vlKxYg9aHmybJ2
Vni13ad2oWNLo2Dd1eUYrEJ/QwGc1pq7JU9HiOST7yEJv1FN++AF4qTYTyvc7Yl4
HrN349/2k2J9vU88BIlXIxYG73AL9lIpGF2ROiEw+xn4oOevDP/5HCP8ZvELoNVM
NqhJGoURu+AhZ7Rm3RYAYySShjQPF+qw/Dnz8sowQsEFUQ990E1wUxWZwTOP6+JQ
cdwzYiXYIExTghfQdUDdwNlpCXUbSuJxFOGjKmhBuzA7HHwGf1Q=
=jGvW
-----END PGP SIGNATURE-----
Merge tag 'for-5.5-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few fixes for btrfs:
- blkcg accounting problem with compression that could stall writes
- setting up blkcg bio for compression crashes due to NULL bdev
pointer
- fix possible infinite loop in writeback for nocow files (here
possible means almost impossible, 13 things that need to happen to
trigger it)"
* tag 'for-5.5-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix infinite loop during nocow writeback due to race
btrfs: fix compressed write bio blkcg attribution
btrfs: punt all bios created in btrfs_submit_compressed_write()
When starting writeback for a range that covers part of a preallocated
extent, due to a race with writeback for another range that also covers
another part of the same preallocated extent, we can end up in an infinite
loop.
Consider the following example where for inode 280 we have two dirty
ranges:
range A, from 294912 to 303103, 8192 bytes
range B, from 348160 to 438271, 90112 bytes
and we have the following file extent item layout for our inode:
leaf 38895616 gen 24544 total ptrs 29 free space 13820 owner 5
(...)
item 27 key (280 108 200704) itemoff 14598 itemsize 53
extent data disk bytenr 0 nr 0 type 1 (regular)
extent data offset 0 nr 94208 ram 94208
item 28 key (280 108 294912) itemoff 14545 itemsize 53
extent data disk bytenr 10433052672 nr 81920 type 2 (prealloc)
extent data offset 0 nr 81920 ram 81920
Then the following happens:
1) Writeback starts for range B (from 348160 to 438271), execution of
run_delalloc_nocow() starts;
2) The first iteration of run_delalloc_nocow()'s whil loop leaves us at
the extent item at slot 28, pointing to the prealloc extent item
covering the range from 294912 to 376831. This extent covers part of
our range;
3) An ordered extent is created against that extent, covering the file
range from 348160 to 376831 (28672 bytes);
4) We adjust 'cur_offset' to 376832 and move on to the next iteration of
the while loop;
5) The call to btrfs_lookup_file_extent() leaves us at the same leaf,
pointing to slot 29, 1 slot after the last item (the extent item
we processed in the previous iteration);
6) Because we are a slot beyond the last item, we call btrfs_next_leaf(),
which releases the search path before doing a another search for the
last key of the leaf (280 108 294912);
7) Right after btrfs_next_leaf() released the path, and before it did
another search for the last key of the leaf, writeback for the range
A (from 294912 to 303103) completes (it was previously started at
some point);
8) Upon completion of the ordered extent for range A, the prealloc extent
we previously found got split into two extent items, one covering the
range from 294912 to 303103 (8192 bytes), with a type of regular extent
(and no longer prealloc) and another covering the range from 303104 to
376831 (73728 bytes), with a type of prealloc and an offset of 8192
bytes. So our leaf now has the following layout:
leaf 38895616 gen 24544 total ptrs 31 free space 13664 owner 5
(...)
item 27 key (280 108 200704) itemoff 14598 itemsize 53
extent data disk bytenr 0 nr 0 type 1
extent data offset 0 nr 8192 ram 94208
item 28 key (280 108 208896) itemoff 14545 itemsize 53
extent data disk bytenr 10433142784 nr 86016 type 1
extent data offset 0 nr 86016 ram 86016
item 29 key (280 108 294912) itemoff 14492 itemsize 53
extent data disk bytenr 10433052672 nr 81920 type 1
extent data offset 0 nr 8192 ram 81920
item 30 key (280 108 303104) itemoff 14439 itemsize 53
extent data disk bytenr 10433052672 nr 81920 type 2
extent data offset 8192 nr 73728 ram 81920
9) After btrfs_next_leaf() returns, we have our path pointing to that same
leaf and at slot 30, since it has a key we didn't have before and it's
the first key greater then the key that was previously the last key of
the leaf (key (280 108 294912));
10) The extent item at slot 30 covers the range from 303104 to 376831
which is in our target range, so we process it, despite having already
created an ordered extent against this extent for the file range from
348160 to 376831. This is because we skip to the next extent item only
if its end is less than or equals to the start of our delalloc range,
and not less than or equals to the current offset ('cur_offset');
11) As a result we compute 'num_bytes' as:
num_bytes = min(end + 1, extent_end) - cur_offset;
= min(438271 + 1, 376832) - 376832 = 0
12) We then call create_io_em() for a 0 bytes range starting at offset
376832;
13) Then create_io_em() enters an infinite loop because its calls to
btrfs_drop_extent_cache() do nothing due to the 0 length range
passed to it. So no existing extent maps that cover the offset
376832 get removed, and therefore calls to add_extent_mapping()
return -EEXIST, resulting in an infinite loop. This loop from
create_io_em() is the following:
do {
btrfs_drop_extent_cache(BTRFS_I(inode), em->start,
em->start + em->len - 1, 0);
write_lock(&em_tree->lock);
ret = add_extent_mapping(em_tree, em, 1);
write_unlock(&em_tree->lock);
/*
* The caller has taken lock_extent(), who could race with us
* to add em?
*/
} while (ret == -EEXIST);
Also, each call to btrfs_drop_extent_cache() triggers a warning because
the start offset passed to it (376832) is smaller then the end offset
(376832 - 1) passed to it by -1, due to the 0 length:
[258532.052621] ------------[ cut here ]------------
[258532.052643] WARNING: CPU: 0 PID: 9987 at fs/btrfs/file.c:602 btrfs_drop_extent_cache+0x3f4/0x590 [btrfs]
(...)
[258532.052672] CPU: 0 PID: 9987 Comm: fsx Tainted: G W 5.4.0-rc7-btrfs-next-64 #1
[258532.052673] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
[258532.052691] RIP: 0010:btrfs_drop_extent_cache+0x3f4/0x590 [btrfs]
(...)
[258532.052695] RSP: 0018:ffffb4be0153f860 EFLAGS: 00010287
[258532.052700] RAX: ffff975b445ee360 RBX: ffff975b44eb3e08 RCX: 0000000000000000
[258532.052700] RDX: 0000000000038fff RSI: 0000000000039000 RDI: ffff975b445ee308
[258532.052700] RBP: 0000000000038fff R08: 0000000000000000 R09: 0000000000000001
[258532.052701] R10: ffff975b513c5c10 R11: 00000000e3c0cfa9 R12: 0000000000039000
[258532.052703] R13: ffff975b445ee360 R14: 00000000ffffffef R15: ffff975b445ee308
[258532.052705] FS: 00007f86a821de80(0000) GS:ffff975b76a00000(0000) knlGS:0000000000000000
[258532.052707] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[258532.052708] CR2: 00007fdacf0f3ab4 CR3: 00000001f9d26002 CR4: 00000000003606f0
[258532.052712] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[258532.052717] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[258532.052717] Call Trace:
[258532.052718] ? preempt_schedule_common+0x32/0x70
[258532.052722] ? ___preempt_schedule+0x16/0x20
[258532.052741] create_io_em+0xff/0x180 [btrfs]
[258532.052767] run_delalloc_nocow+0x942/0xb10 [btrfs]
[258532.052791] btrfs_run_delalloc_range+0x30b/0x520 [btrfs]
[258532.052812] ? find_lock_delalloc_range+0x221/0x250 [btrfs]
[258532.052834] writepage_delalloc+0xe4/0x140 [btrfs]
[258532.052855] __extent_writepage+0x110/0x4e0 [btrfs]
[258532.052876] extent_write_cache_pages+0x21c/0x480 [btrfs]
[258532.052906] extent_writepages+0x52/0xb0 [btrfs]
[258532.052911] do_writepages+0x23/0x80
[258532.052915] __filemap_fdatawrite_range+0xd2/0x110
[258532.052938] btrfs_fdatawrite_range+0x1b/0x50 [btrfs]
[258532.052954] start_ordered_ops+0x57/0xa0 [btrfs]
[258532.052973] ? btrfs_sync_file+0x225/0x490 [btrfs]
[258532.052988] btrfs_sync_file+0x225/0x490 [btrfs]
[258532.052997] __x64_sys_msync+0x199/0x200
[258532.053004] do_syscall_64+0x5c/0x250
[258532.053007] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[258532.053010] RIP: 0033:0x7f86a7dfd760
(...)
[258532.053014] RSP: 002b:00007ffd99af0368 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
[258532.053016] RAX: ffffffffffffffda RBX: 0000000000000ec9 RCX: 00007f86a7dfd760
[258532.053017] RDX: 0000000000000004 RSI: 000000000000836c RDI: 00007f86a8221000
[258532.053019] RBP: 0000000000021ec9 R08: 0000000000000003 R09: 00007f86a812037c
[258532.053020] R10: 0000000000000001 R11: 0000000000000246 R12: 00000000000074a3
[258532.053021] R13: 00007f86a8221000 R14: 000000000000836c R15: 0000000000000001
[258532.053032] irq event stamp: 1653450494
[258532.053035] hardirqs last enabled at (1653450493): [<ffffffff9dec69f9>] _raw_spin_unlock_irq+0x29/0x50
[258532.053037] hardirqs last disabled at (1653450494): [<ffffffff9d4048ea>] trace_hardirqs_off_thunk+0x1a/0x20
[258532.053039] softirqs last enabled at (1653449852): [<ffffffff9e200466>] __do_softirq+0x466/0x6bd
[258532.053042] softirqs last disabled at (1653449845): [<ffffffff9d4c8a0c>] irq_exit+0xec/0x120
[258532.053043] ---[ end trace 8476fce13d9ce20a ]---
Which results in flooding dmesg/syslog since btrfs_drop_extent_cache()
uses WARN_ON() and not WARN_ON_ONCE().
So fix this issue by changing run_delalloc_nocow()'s loop to move to the
next extent item when the current extent item ends at at offset less than
or equals to the current offset instead of the start offset.
Fixes: 80ff385665 ("Btrfs: update nodatacow code v2")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Bio attribution is handled at bio_set_dev() as once we have a device, we
have a corresponding request_queue and then can derive the current css.
In special cases, we want to attribute to bio to someone else. This can
be done by calling bio_associate_blkg_from_css() or
kthread_associate_blkcg() depending on the scenario. Btrfs does this for
compressed writeback as they are handled by kworkers, so the latter can
be done here.
Commit 1a41802701 ("btrfs: drop bio_set_dev where not needed") removes
early bio_set_dev() calls prior to submit_stripe_bio(). This breaks the
above assumption that we'll have a request_queue when we are doing
association. To fix this, switch to using kthread_associate_blkcg().
Without this, we crash in btrfs/024:
[ 3052.093088] BUG: kernel NULL pointer dereference, address: 0000000000000510
[ 3052.107013] #PF: supervisor read access in kernel mode
[ 3052.107014] #PF: error_code(0x0000) - not-present page
[ 3052.107015] PGD 0 P4D 0
[ 3052.107021] Oops: 0000 [#1] SMP
[ 3052.138904] CPU: 42 PID: 201270 Comm: kworker/u161:0 Kdump: loaded Not tainted 5.5.0-rc1-00062-g4852d8ac90a9 #712
[ 3052.138905] Hardware name: Quanta Tioga Pass Single Side 01-0032211004/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
[ 3052.138912] Workqueue: btrfs-delalloc btrfs_work_helper
[ 3052.191375] RIP: 0010:bio_associate_blkg_from_css+0x1e/0x3c0
[ 3052.191379] RSP: 0018:ffffc900210cfc90 EFLAGS: 00010282
[ 3052.191380] RAX: 0000000000000000 RBX: ffff88bfe5573c00 RCX: 0000000000000000
[ 3052.191382] RDX: ffff889db48ec2f0 RSI: ffff88bfe5573c00 RDI: ffff889db48ec2f0
[ 3052.191386] RBP: 0000000000000800 R08: 0000000000203bb0 R09: ffff889db16b2400
[ 3052.293364] R10: 0000000000000000 R11: ffff88a07fffde80 R12: ffff889db48ec2f0
[ 3052.293365] R13: 0000000000001000 R14: ffff889de82bc000 R15: ffff889e2b7bdcc8
[ 3052.293367] FS: 0000000000000000(0000) GS:ffff889ffba00000(0000) knlGS:0000000000000000
[ 3052.293368] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3052.293369] CR2: 0000000000000510 CR3: 0000000002611001 CR4: 00000000007606e0
[ 3052.293370] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3052.293371] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3052.293372] PKRU: 55555554
[ 3052.293376] Call Trace:
[ 3052.402552] btrfs_submit_compressed_write+0x137/0x390
[ 3052.402558] submit_compressed_extents+0x40f/0x4c0
[ 3052.422401] btrfs_work_helper+0x246/0x5a0
[ 3052.422408] process_one_work+0x200/0x570
[ 3052.438601] ? process_one_work+0x180/0x570
[ 3052.438605] worker_thread+0x4c/0x3e0
[ 3052.438614] kthread+0x103/0x140
[ 3052.460735] ? process_one_work+0x570/0x570
[ 3052.460737] ? kthread_mod_delayed_work+0xc0/0xc0
[ 3052.460744] ret_from_fork+0x24/0x30
Fixes: 1a41802701 ("btrfs: drop bio_set_dev where not needed")
Reported-by: Chris Murphy <chris@colorremedies.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Compressed writes happen in the background via kworkers. However, this
causes bios to be attributed to root bypassing any cgroup limits from
the actual writer. We tag the first bio with REQ_CGROUP_PUNT, which will
punt the bio to an appropriate cgroup specific workqueue and attribute
the IO properly. However, if btrfs_submit_compressed_write() creates a
new bio, we don't tag it the same way. Add the appropriate tagging for
subsequent bios.
Fixes: ec39f7696c ("Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios")
Reviewed-by: Chris Mason <clm@fb.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl35CAYACgkQxWXV+ddt
WDur/w//S98RvSZYMW5y2u+bPGe8sCpXwu5Sr87hTd14We8cBWj8684npUmSk7Dz
rTRSjcf9EQe5dGoiHOzpKU0HcsLKy9DVTPigvVbmsWZfT9mqS6Y8wAKMw/7UUvyy
n7aZk/yQGRow3gZ/Z/aF23JypRoDJK7DPbSMKUW164BnD5rCCyr+VdA8V+CwHgVh
UN6UG0KMDbDKS4501DsX8418pcJN+a+Jo4oBGwN/guKRjK1oNcrhj34DNhvXlaOV
Rlu7HcVtfHNDS/xD3DZS9mDIiycJ6qHkvC3hUsEmlKRoPEm1leVxTDLDf78oEy9H
TrvH71hbvYjxaOU4YQbJG8ky+VwFfiV0Vrj73GgdEeRRDuMbYwUyFI5gYQOji8fS
DuYdJGyslOqQovpii+jrPiT1TPG+97R4+qKH2DfOW1xUChYsbQHt7FOfzUbLe0JE
dev9zV6MRqZ1qf70+Wt2LuWYFefpg9KVnsn8mcjoBwz9s9uImzLgpI90+DMPKOaU
TizwJK3W5K3YLhqPHwPLvqVxKwVOzu00v01xl/bjTuyp982oPSCj3fj+FprGV1la
OkqOYbKe2ZqEkQpINDu8I58oydTKywZGVsUl4ldJlcSY1hEDFCyeoAFmixaJbRbQ
IdBcQnjD7qgvu9E4cA0kL8Ma1op2+1zw8sUOdXKFIDiNEqL5FPs=
=AnHL
-----END PGP SIGNATURE-----
Merge tag 'for-5.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A mix of regression fixes and regular fixes for stable trees:
- fix swapped error messages for qgroup enable/rescan
- fixes for NO_HOLES feature with clone range
- fix deadlock between iget/srcu lock/synchronize srcu while freeing
an inode
- fix double lock on subvolume cross-rename
- tree log fixes
* fix missing data checksums after replaying a log tree
* also teach tree-checker about this problem
* skip log replay on orphaned roots
- fix maximum devices constraints for RAID1C -3 and -4
- send: don't print warning on read-only mount regarding orphan
cleanup
- error handling fixes"
* tag 'for-5.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: send: remove WARN_ON for readonly mount
btrfs: do not leak reloc root if we fail to read the fs root
btrfs: skip log replay on orphaned roots
btrfs: handle ENOENT in btrfs_uuid_tree_iterate
btrfs: abort transaction after failed inode updates in create_subvol
Btrfs: fix hole extent items with a zero size after range cloning
Btrfs: fix removal logic of the tree mod log that leads to use-after-free issues
Btrfs: make tree checker detect checksum items with overlapping ranges
Btrfs: fix missing data checksums after replaying a log tree
btrfs: return error pointer from alloc_test_extent_buffer
btrfs: fix devs_max constraints for raid1c3 and raid1c4
btrfs: tree-checker: Fix error format string for size_t
btrfs: don't double lock the subvol_sem for rename exchange
btrfs: handle error in btrfs_cache_block_group
btrfs: do not call synchronize_srcu() in inode_tree_del
Btrfs: fix cloning range with a hole when using the NO_HOLES feature
btrfs: Fix error messages in qgroup_rescan_init
We log warning if root::orphan_cleanup_state is not set to
ORPHAN_CLEANUP_DONE in btrfs_ioctl_send(). However if the filesystem is
mounted as readonly we skip the orphan item cleanup during the lookup
and root::orphan_cleanup_state remains at the init state 0 instead of
ORPHAN_CLEANUP_DONE (2). So during send in btrfs_ioctl_send() we hit the
warning as below.
WARN_ON(send_root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE);
WARNING: CPU: 0 PID: 2616 at /Volumes/ws/btrfs-devel/fs/btrfs/send.c:7090 btrfs_ioctl_send+0xb2f/0x18c0 [btrfs]
::
RIP: 0010:btrfs_ioctl_send+0xb2f/0x18c0 [btrfs]
::
Call Trace:
::
_btrfs_ioctl_send+0x7b/0x110 [btrfs]
btrfs_ioctl+0x150a/0x2b00 [btrfs]
::
do_vfs_ioctl+0xa9/0x620
? __fget+0xac/0xe0
ksys_ioctl+0x60/0x90
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x49/0x130
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Reproducer:
mkfs.btrfs -fq /dev/sdb
mount /dev/sdb /btrfs
btrfs subvolume create /btrfs/sv1
btrfs subvolume snapshot -r /btrfs/sv1 /btrfs/ss1
umount /btrfs
mount -o ro /dev/sdb /btrfs
btrfs send /btrfs/ss1 -f /tmp/f
The warning exists because having orphan inodes could confuse send and
cause it to fail or produce incorrect streams. The two cases that would
cause such send failures, which are already fixed are:
1) Inodes that were unlinked - these are orphanized and remain with a
link count of 0. These caused send operations to fail because it
expected to always find at least one path for an inode. However this
is no longer a problem since send is now able to deal with such
inodes since commit 46b2f4590a ("Btrfs: fix send failure when root
has deleted files still open") and treats them as having been
completely removed (the state after an orphan cleanup is performed).
2) Inodes that were in the process of being truncated. These resulted in
send not knowing about the truncation and potentially issue write
operations full of zeroes for the range from the new file size to the
old file size. This is no longer a problem because we no longer
create orphan items for truncation since commit f7e9e8fc79 ("Btrfs:
stop creating orphan items for truncate").
As such before these commits, the WARN_ON here provided a clue in case
something went wrong. Instead of being a warning against the
root::orphan_cleanup_state value, it could have been more accurate by
checking if there were actually any orphan items, and then issue a
warning only if any exists, but that would be more expensive to check.
Since orphanized inodes no longer cause problems for send, just remove
the warning.
Reported-by: Christoph Anton Mitterer <calestyo@scientia.net>
Link: https://lore.kernel.org/linux-btrfs/21cb5e8d059f6e1496a903fa7bfc0a297e2f5370.camel@scientia.net/
CC: stable@vger.kernel.org # 4.19+
Suggested-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we fail to read the fs root corresponding with a reloc root we'll
just break out and free the reloc roots. But we remove our current
reloc_root from this list higher up, which means we'll leak this
reloc_root. Fix this by adding ourselves back to the reloc_roots list
so we are properly cleaned up.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
My fsstress modifications coupled with generic/475 uncovered a failure
to mount and replay the log if we hit a orphaned root. We do not want
to replay the log for an orphan root, but it's completely legitimate to
have an orphaned root with a log attached. Fix this by simply skipping
replaying the log. We still need to pin it's root node so that we do
not overwrite it while replaying other logs, as we re-read the log root
at every stage of the replay.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we get an -ENOENT back from btrfs_uuid_iter_rem when iterating the
uuid tree we'll just continue and do btrfs_next_item(). However we've
done a btrfs_release_path() at this point and no longer have a valid
path. So increment the key and go back and do a normal search.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can just abort the transaction here, and in fact do that for every
other failure in this function except these two cases.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Normally when cloning a file range if we find an implicit hole at the end
of the range we assume it is because the NO_HOLES feature is enabled.
However that is not always the case. One well known case [1] is when we
have a power failure after mixing buffered and direct IO writes against
the same file.
In such cases we need to punch a hole in the destination file, and if
the NO_HOLES feature is not enabled, we need to insert explicit file
extent items to represent the hole. After commit 690a5dbfc5
("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning
extents"), we started to insert file extent items representing the hole
with an item size of 0, which is invalid and should be 53 bytes (the size
of a btrfs_file_extent_item structure), resulting in all sorts of
corruptions and invalid memory accesses. This is detected by the tree
checker when we attempt to write a leaf to disk.
The problem can be sporadically triggered by test case generic/561 from
fstests. That test case does not exercise power failure and creates a new
filesystem when it starts, so it does not use a filesystem created by any
previous test that tests power failure. However the test does both
buffered and direct IO writes (through fsstress) and it's precisely that
which is creating the implicit holes in files. That happens even before
the commit mentioned earlier. I need to investigate why we get those
implicit holes to check if there is a real problem or not. For now this
change fixes the regression of introducing file extent items with an item
size of 0 bytes.
Fix the issue by calling btrfs_punch_hole_range() without passing a
btrfs_clone_extent_info structure, which ensures file extent items are
inserted to represent the hole with a correct item size. We were passing
a btrfs_clone_extent_info with a value of 0 for its 'item_size' field,
which was causing the insertion of file extent items with an item size
of 0.
[1] https://www.spinics.net/lists/linux-btrfs/msg75350.html
Reported-by: David Sterba <dsterba@suse.com>
Fixes: 690a5dbfc5 ("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a tree mod log user no longer needs to use the tree it calls
btrfs_put_tree_mod_seq() to remove itself from the list of users and
delete all no longer used elements of the tree's red black tree, which
should be all elements with a sequence number less then our equals to
the caller's sequence number. However the logic is broken because it
can delete and free elements from the red black tree that have a
sequence number greater then the caller's sequence number:
1) At a point in time we have sequence numbers 1, 2, 3 and 4 in the
tree mod log;
2) The task which got assigned the sequence number 1 calls
btrfs_put_tree_mod_seq();
3) Sequence number 1 is deleted from the list of sequence numbers;
4) The current minimum sequence number is computed to be the sequence
number 2;
5) A task using sequence number 2 is at tree_mod_log_rewind() and gets
a pointer to one of its elements from the red black tree through
a call to tree_mod_log_search();
6) The task with sequence number 1 iterates the red black tree of tree
modification elements and deletes (and frees) all elements with a
sequence number less then or equals to 2 (the computed minimum sequence
number) - it ends up only leaving elements with sequence numbers of 3
and 4;
7) The task with sequence number 2 now uses the pointer to its element,
already freed by the other task, at __tree_mod_log_rewind(), resulting
in a use-after-free issue. When CONFIG_DEBUG_PAGEALLOC=y it produces
a trace like the following:
[16804.546854] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[16804.547451] CPU: 0 PID: 28257 Comm: pool Tainted: G W 5.4.0-rc8-btrfs-next-51 #1
[16804.548059] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
[16804.548666] RIP: 0010:rb_next+0x16/0x50
(...)
[16804.550581] RSP: 0018:ffffb948418ef9b0 EFLAGS: 00010202
[16804.551227] RAX: 6b6b6b6b6b6b6b6b RBX: ffff90e0247f6600 RCX: 6b6b6b6b6b6b6b6b
[16804.551873] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff90e0247f6600
[16804.552504] RBP: ffff90dffe0d4688 R08: 0000000000000001 R09: 0000000000000000
[16804.553136] R10: ffff90dffa4a0040 R11: 0000000000000000 R12: 000000000000002e
[16804.553768] R13: ffff90e0247f6600 R14: 0000000000001663 R15: ffff90dff77862b8
[16804.554399] FS: 00007f4b197ae700(0000) GS:ffff90e036a00000(0000) knlGS:0000000000000000
[16804.555039] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[16804.555683] CR2: 00007f4b10022000 CR3: 00000002060e2004 CR4: 00000000003606f0
[16804.556336] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[16804.556968] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[16804.557583] Call Trace:
[16804.558207] __tree_mod_log_rewind+0xbf/0x280 [btrfs]
[16804.558835] btrfs_search_old_slot+0x105/0xd00 [btrfs]
[16804.559468] resolve_indirect_refs+0x1eb/0xc70 [btrfs]
[16804.560087] ? free_extent_buffer.part.19+0x5a/0xc0 [btrfs]
[16804.560700] find_parent_nodes+0x388/0x1120 [btrfs]
[16804.561310] btrfs_check_shared+0x115/0x1c0 [btrfs]
[16804.561916] ? extent_fiemap+0x59d/0x6d0 [btrfs]
[16804.562518] extent_fiemap+0x59d/0x6d0 [btrfs]
[16804.563112] ? __might_fault+0x11/0x90
[16804.563706] do_vfs_ioctl+0x45a/0x700
[16804.564299] ksys_ioctl+0x70/0x80
[16804.564885] ? trace_hardirqs_off_thunk+0x1a/0x20
[16804.565461] __x64_sys_ioctl+0x16/0x20
[16804.566020] do_syscall_64+0x5c/0x250
[16804.566580] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[16804.567153] RIP: 0033:0x7f4b1ba2add7
(...)
[16804.568907] RSP: 002b:00007f4b197adc88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[16804.569513] RAX: ffffffffffffffda RBX: 00007f4b100210d8 RCX: 00007f4b1ba2add7
[16804.570133] RDX: 00007f4b100210d8 RSI: 00000000c020660b RDI: 0000000000000003
[16804.570726] RBP: 000055de05a6cfe0 R08: 0000000000000000 R09: 00007f4b197add44
[16804.571314] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f4b197add48
[16804.571905] R13: 00007f4b197add40 R14: 00007f4b100210d0 R15: 00007f4b197add50
(...)
[16804.575623] ---[ end trace 87317359aad4ba50 ]---
Fix this by making btrfs_put_tree_mod_seq() skip deletion of elements that
have a sequence number equals to the computed minimum sequence number, and
not just elements with a sequence number greater then that minimum.
Fixes: bd989ba359 ("Btrfs: add tree modification log functions")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Having checksum items, either on the checksums tree or in a log tree, that
represent ranges that overlap each other is a sign of a corruption. Such
case confuses the checksum lookup code and can result in not being able to
find checksums or find stale checksums.
So add a check for such case.
This is motivated by a recent fix for a case where a log tree had checksum
items covering ranges that overlap each other due to extent cloning, and
resulted in missing checksums after replaying the log tree. It also helps
detect past issues such as stale and outdated checksums due to overlapping,
commit 27b9a8122f ("Btrfs: fix csum tree corruption, duplicate and
outdated checksums").
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging a file that has shared extents (reflinked with other files or
with itself), we can end up logging multiple checksum items that cover
overlapping ranges. This confuses the search for checksums at log replay
time causing some checksums to never be added to the fs/subvolume tree.
Consider the following example of a file that shares the same extent at
offsets 0 and 256Kb:
[ bytenr 13893632, offset 64Kb, len 64Kb ]
0 64Kb
[ bytenr 13631488, offset 64Kb, len 192Kb ]
64Kb 256Kb
[ bytenr 13893632, offset 0, len 256Kb ]
256Kb 512Kb
When logging the inode, at tree-log.c:copy_items(), when processing the
file extent item at offset 0, we log a checksum item covering the range
13959168 to 14024704, which corresponds to 13893632 + 64Kb and 13893632 +
64Kb + 64Kb, respectively.
Later when processing the extent item at offset 256K, we log the checksums
for the range from 13893632 to 14155776 (which corresponds to 13893632 +
256Kb). These checksums get merged with the checksum item for the range
from 13631488 to 13893632 (13631488 + 256Kb), logged by a previous fsync.
So after this we get the two following checksum items in the log tree:
(...)
item 6 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 3095 itemsize 512
range start 13631488 end 14155776 length 524288
item 7 key (EXTENT_CSUM EXTENT_CSUM 13959168) itemoff 3031 itemsize 64
range start 13959168 end 14024704 length 65536
The first one covers the range from the second one, they overlap.
So far this does not cause a problem after replaying the log, because
when replaying the file extent item for offset 256K, we copy all the
checksums for the extent 13893632 from the log tree to the fs/subvolume
tree, since searching for an checksum item for bytenr 13893632 leaves us
at the first checksum item, which covers the whole range of the extent.
However if we write 64Kb to file offset 256Kb for example, we will
not be able to find and copy the checksums for the last 128Kb of the
extent at bytenr 13893632, referenced by the file range 384Kb to 512Kb.
After writing 64Kb into file offset 256Kb we get the following extent
layout for our file:
[ bytenr 13893632, offset 64K, len 64Kb ]
0 64Kb
[ bytenr 13631488, offset 64Kb, len 192Kb ]
64Kb 256Kb
[ bytenr 14155776, offset 0, len 64Kb ]
256Kb 320Kb
[ bytenr 13893632, offset 64Kb, len 192Kb ]
320Kb 512Kb
After fsync'ing the file, if we have a power failure and then mount
the filesystem to replay the log, the following happens:
1) When replaying the file extent item for file offset 320Kb, we
lookup for the checksums for the extent range from 13959168
(13893632 + 64Kb) to 14155776 (13893632 + 256Kb), through a call
to btrfs_lookup_csums_range();
2) btrfs_lookup_csums_range() finds the checksum item that starts
precisely at offset 13959168 (item 7 in the log tree, shown before);
3) However that checksum item only covers 64Kb of data, and not 192Kb
of data;
4) As a result only the checksums for the first 64Kb of data referenced
by the file extent item are found and copied to the fs/subvolume tree.
The remaining 128Kb of data, file range 384Kb to 512Kb, doesn't get
the corresponding data checksums found and copied to the fs/subvolume
tree.
5) After replaying the log userspace will not be able to read the file
range from 384Kb to 512Kb, because the checksums are missing and
resulting in an -EIO error.
The following steps reproduce this scenario:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ xfs_io -f -c "pwrite -S 0xa3 0 256K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
$ xfs_io -c "pwrite -S 0xc7 256K 256K" /mnt/sdc/foobar
$ xfs_io -c "reflink /mnt/sdc/foobar 320K 0 64K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
$ xfs_io -c "pwrite -S 0xe5 256K 64K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
<power failure>
$ mount /dev/sdc /mnt/sdc
$ md5sum /mnt/sdc/foobar
md5sum: /mnt/sdc/foobar: Input/output error
$ dmesg | tail
[165305.003464] BTRFS info (device sdc): no csum found for inode 257 start 401408
[165305.004014] BTRFS info (device sdc): no csum found for inode 257 start 405504
[165305.004559] BTRFS info (device sdc): no csum found for inode 257 start 409600
[165305.005101] BTRFS info (device sdc): no csum found for inode 257 start 413696
[165305.005627] BTRFS info (device sdc): no csum found for inode 257 start 417792
[165305.006134] BTRFS info (device sdc): no csum found for inode 257 start 421888
[165305.006625] BTRFS info (device sdc): no csum found for inode 257 start 425984
[165305.007278] BTRFS info (device sdc): no csum found for inode 257 start 430080
[165305.008248] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
[165305.009550] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
Fix this simply by deleting first any checksums, from the log tree, for the
range of the extent we are logging at copy_items(). This ensures we do not
get checksum items in the log tree that have overlapping ranges.
This is a long time issue that has been present since we have the clone
(and deduplication) ioctl, and can happen both when an extent is shared
between different files and within the same file.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Callers of alloc_test_extent_buffer have not correctly interpreted the
return value as error pointer, as alloc_test_extent_buffer should behave
as alloc_extent_buffer. The self-tests were unaffected but
btrfs_find_create_tree_block could call both functions and that would
cause problems up in the call chain.
Fixes: faa2dbf004 ("Btrfs: add sanity tests for new qgroup accounting code")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The value 0 for devs_max means to spread the allocated chunks over all
available devices, eg. stripe for RAID0 or RAID5. This got mistakenly
copied to the RAID1C3/4 profiles. The intention is to have exactly 3 and
4 copies respectively.
Fixes: 47e6f7423b ("btrfs: add support for 3-copy replication (raid1c3)")
Fixes: 8d6fac0087 ("btrfs: add support for 4-copy replication (raid1c4)")
Signed-off-by: David Sterba <dsterba@suse.com>
Argument BTRFS_FILE_EXTENT_INLINE_DATA_START is defined as offsetof(),
which returns type size_t, so we need %zu instead of %lu.
This fixes a build warning on 32-bit ARM:
../fs/btrfs/tree-checker.c: In function 'check_extent_data_item':
../fs/btrfs/tree-checker.c:230:43: warning: format '%lu' expects argument of type 'long unsigned int', but argument 5 has type 'unsigned int' [-Wformat=]
230 | "invalid item size, have %u expect [%lu, %u)",
| ~~^
| long unsigned int
| %u
Fixes: 153a6d2999 ("btrfs: tree-checker: Check item size before reading file extent type")
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andreas Färber <afaerber@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we're rename exchanging two subvols we'll try to lock this lock
twice, which is bad. Just lock once if either of the ino's are subvols.
Fixes: cdd1fedf82 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a BUG_ON(ret < 0) in find_free_extent from
btrfs_cache_block_group. If we fail to allocate our ctl we'll just
panic, which is not good. Instead just go on to another block group.
If we fail to find a block group we don't want to return ENOSPC, because
really we got a ENOMEM and that's the root of the problem. Save our
return from btrfs_cache_block_group(), and then if we still fail to make
our allocation return that ret so we get the right error back.
Tested with inject-error.py from bcc.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Testing with the new fsstress uncovered a pretty nasty deadlock with
lookup and snapshot deletion.
Process A
unlink
-> final iput
-> inode_tree_del
-> synchronize_srcu(subvol_srcu)
Process B
btrfs_lookup <- srcu_read_lock() acquired here
-> btrfs_iget
-> find inode that has I_FREEING set
-> __wait_on_freeing_inode()
We're holding the srcu_read_lock() while doing the iget in order to make
sure our fs root doesn't go away, and then we are waiting for the inode
to finish freeing. However because the free'ing process is doing a
synchronize_srcu() we deadlock.
Fix this by dropping the synchronize_srcu() in inode_tree_del(). We
don't need people to stop accessing the fs root at this point, we're
only adding our empty root to the dead roots list.
A larger much more invasive fix is forthcoming to address how we deal
with fs roots, but this fixes the immediate problem.
Fixes: 76dda93c6a ("Btrfs: add snapshot/subvolume destroy ioctl")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When using the NO_HOLES feature if we clone a range that contains a hole
and a temporary ENOSPC happens while dropping extents from the target
inode's range, we can end up failing and aborting the transaction with
-EEXIST or with a corrupt file extent item, that has a length greater
than it should and overlaps with other extents. For example when cloning
the following range from inode A to inode B:
Inode A:
extent A1 extent A2
[ ----------- ] [ hole, implicit, 4MB length ] [ ------------- ]
0 1MB 5MB 6MB
Range to clone: [1MB, 6MB)
Inode B:
extent B1 extent B2 extent B3 extent B4
[ ---------- ] [ --------- ] [ ---------- ] [ ---------- ]
0 1MB 1MB 2MB 2MB 5MB 5MB 6MB
Target range: [1MB, 6MB) (same as source, to make it easier to explain)
The following can happen:
1) btrfs_punch_hole_range() gets -ENOSPC from __btrfs_drop_extents();
2) At that point, 'cur_offset' is set to 1MB and __btrfs_drop_extents()
set 'drop_end' to 2MB, meaning it was able to drop only extent B2;
3) We then compute 'clone_len' as 'drop_end' - 'cur_offset' = 2MB - 1MB =
1MB;
4) We then attempt to insert a file extent item at inode B with a file
offset of 5MB, which is the value of clone_info->file_offset. This
fails with error -EEXIST because there's already an extent at that
offset (extent B4);
5) We abort the current transaction with -EEXIST and return that error
to user space as well.
Another example, for extent corruption:
Inode A:
extent A1 extent A2
[ ----------- ] [ hole, implicit, 10MB length ] [ ------------- ]
0 1MB 11MB 12MB
Inode B:
extent B1 extent B2
[ ----------- ] [ --------- ] [ ----------------------------- ]
0 1MB 1MB 5MB 5MB 12MB
Target range: [1MB, 12MB) (same as source, to make it easier to explain)
1) btrfs_punch_hole_range() gets -ENOSPC from __btrfs_drop_extents();
2) At that point, 'cur_offset' is set to 1MB and __btrfs_drop_extents()
set 'drop_end' to 5MB, meaning it was able to drop only extent B2;
3) We then compute 'clone_len' as 'drop_end' - 'cur_offset' = 5MB - 1MB =
4MB;
4) We then insert a file extent item at inode B with a file offset of 11MB
which is the value of clone_info->file_offset, and a length of 4MB (the
value of 'clone_len'). So we get 2 extents items with ranges that
overlap and an extent length of 4MB, larger then the extent A2 from
inode A (1MB length);
5) After that we end the transaction, balance the btree dirty pages and
then start another or join the previous transaction. It might happen
that the transaction which inserted the incorrect extent was committed
by another task so we end up with extent corruption if a power failure
happens.
So fix this by making sure we attempt to insert the extent to clone at
the destination inode only if we are past dropping the sub-range that
corresponds to a hole.
Fixes: 690a5dbfc5 ("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The branch of qgroup_rescan_init which is executed from the mount
path prints wrong errors messages. The textual print out in case
BTRFS_QGROUP_STATUS_FLAG_RESCAN/BTRFS_QGROUP_STATUS_FLAG_ON are not
set are transposed. Fix it by exchanging their place.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Because the BLAKE2B code went through a different tree, it was not
available at the time the btrfs part was merged. Now that the Kconfig
symbol exists, add it to the list.
Signed-off-by: David Sterba <dsterba@suse.com>
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT.
Both PREEMPT and PREEMPT_RT require the same functionality which today
depends on CONFIG_PREEMPT.
Switch the btrfs_device_set_…() macro over to use CONFIG_PREEMPTION.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-btrfs@vger.kernel.org
Link: https://lore.kernel.org/r/20191015191821.11479-25-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As part of the cleanup of some remaining y2038 issues, I came to
fs/compat_ioctl.c, which still has a couple of commands that need support
for time64_t.
In completely unrelated work, I spent time on cleaning up parts of this
file in the past, moving things out into drivers instead.
After Al Viro reviewed an earlier version of this series and did a lot
more of that cleanup, I decided to try to completely eliminate the rest
of it and move it all into drivers.
This series incorporates some of Al's work and many patches of my own,
but in the end stops short of actually removing the last part, which is
the scsi ioctl handlers. I have patches for those as well, but they need
more testing or possibly a rewrite.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJdsHCdAAoJEJpsee/mABjZtYkP/1JGl3jFv3Iq/5BCdPkaePP1
RtMJRNfURgK3GeuHUui330PvVjI/pLWXU/VXMK2MPTASpJLzYz3uCaZrpVWEMpDZ
+ImzGmgJkITlW1uWU3zOcQhOxTyb1hCZ0Ci+2xn9QAmyOL7prXoXCXDWv3h6iyiF
lwG+nW+HNtyx41YG+9bRfKNoG0ZJ+nkJ70BV6u0acQHXWn7Xuupa9YUmBL87hxAL
6dlJfLTJg6q8QSv/Q6LxslfWk2Ti8OOJZOwtFM5R8Bgl0iUcvshiRCKfv/3t9jXD
dJNvF1uq8z+gracWK49Qsfq5dnZ2ZxHFUo9u0NjbCrxNvWH/sdvhbaUBuJI75seH
VIznCkdxFhrqitJJ8KmxANxG08u+9zSKjSlxG2SmlA4qFx/AoStoHwQXcogJscNb
YIXYKmWBvwPzYu09QFAXdHFPmZvp/3HhMWU6o92lvDhsDwzkSGt3XKhCJea4DCaT
m+oCcoACqSWhMwdbJOEFofSub4bY43s5iaYuKes+c8O261/Dwg6v/pgIVez9mxXm
TBnvCsotq5m8wbwzv99eFqGeJH8zpDHrXxEtRR5KQqMqjLq/OQVaEzmpHZTEuK7n
e/V/PAKo2/V63g4k6GApQXDxnjwT+m0aWToWoeEzPYXS6KmtWC91r4bWtslu3rdl
bN65armTm7bFFR32Avnu
=lgCl
-----END PGP SIGNATURE-----
Merge tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground
Pull removal of most of fs/compat_ioctl.c from Arnd Bergmann:
"As part of the cleanup of some remaining y2038 issues, I came to
fs/compat_ioctl.c, which still has a couple of commands that need
support for time64_t.
In completely unrelated work, I spent time on cleaning up parts of
this file in the past, moving things out into drivers instead.
After Al Viro reviewed an earlier version of this series and did a lot
more of that cleanup, I decided to try to completely eliminate the
rest of it and move it all into drivers.
This series incorporates some of Al's work and many patches of my own,
but in the end stops short of actually removing the last part, which
is the scsi ioctl handlers. I have patches for those as well, but they
need more testing or possibly a rewrite"
* tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (42 commits)
scsi: sd: enable compat ioctls for sed-opal
pktcdvd: add compat_ioctl handler
compat_ioctl: move SG_GET_REQUEST_TABLE handling
compat_ioctl: ppp: move simple commands into ppp_generic.c
compat_ioctl: handle PPPIOCGIDLE for 64-bit time_t
compat_ioctl: move PPPIOCSCOMPRESS to ppp_generic
compat_ioctl: unify copy-in of ppp filters
tty: handle compat PPP ioctls
compat_ioctl: move SIOCOUTQ out of compat_ioctl.c
compat_ioctl: handle SIOCOUTQNSD
af_unix: add compat_ioctl support
compat_ioctl: reimplement SG_IO handling
compat_ioctl: move WDIOC handling into wdt drivers
fs: compat_ioctl: move FITRIM emulation into file systems
gfs2: add compat_ioctl support
compat_ioctl: remove unused convert_in_user macro
compat_ioctl: remove last RAID handling code
compat_ioctl: remove /dev/raw ioctl translation
compat_ioctl: remove PCI ioctl translation
compat_ioctl: remove joystick ioctl translation
...
After previous patches removing bdev being passed around to set it to
bio, it has become unused in submit_extent_page. So it now has "only" 13
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
We can now remove the bdev from extent_map. Previous patches made sure
that bio_set_dev is correctly in all places and that we don't need to
grab it from latest_bdev or pass it around inside the extent map.
Signed-off-by: David Sterba <dsterba@suse.com>
bio_set_dev sets a bdev to a bio and is not only setting a pointer bug
also changing some state bits if there was a different bdev set before.
This is one thing that's not needed.
Another thing is that setting a bdev at bio allocation time is too early
and actually does not work with plain redundancy profiles, where each
time we submit a bio to a device, the bdev is set correctly.
In many places the bio bdev is set to latest_bdev that seems to serve as
a stub pointer "just to put something to bio". But we don't have to do
that.
Where do we know which bdev to set:
* for regular IO: submit_stripe_bio that's called by btrfs_map_bio
* repair IO: repair_io_failure, read or write from specific device
* super block write (using buffer_heads but uses raw bdev) and barriers
* scrub: this does not use all regular IO paths as it needs to reach all
copies, verify and fixup eventually, and for that all bdev management
is independent
* raid56: rbio_add_io_page, for the RMW write
* integrity-checker: does it's own low-level block tracking
Signed-off-by: David Sterba <dsterba@suse.com>
This is preparatory patch to remove @bdev parameter from
submit_extent_page. It can't be removed completely, because the cgroups
need it for wbc when initializing the bio
wbc_init_bio
bio_associate_blkg_from_css
dereference bdev->bi_disk->queue
The bdev pointer is the same as latest_bdev, thus no functional change.
We can retrieve it from fs_devices that's reachable through several
dereferences. The local variable shadows the parameter, but that's only
temporary.
Signed-off-by: David Sterba <dsterba@suse.com>
Testing with the new fsstress support for subvolumes uncovered a pretty
bad problem with rename exchange on subvolumes. We're modifying two
different subvolumes, but we only start the transaction on one of them,
so the other one is not added to the dirty root list. This is caught by
btrfs_cow_block() with a warning because the root has not been updated,
however if we do not modify this root again we'll end up pointing at an
invalid root because the root item is never updated.
Fix this by making sure we add the destination root to the trans list,
the same as we do with normal renames. This fixes the corruption.
Fixes: cdd1fedf82 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a device replace, while at scrub.c:scrub_enumerate_chunks(), we
set the block group to RO mode and then wait for any ongoing writes into
extents of the block group to complete. While doing that wait we overwrite
the value of the variable 'ret' and can break out of the loop if an error
happens without turning the block group back into RW mode. So what happens
is the following:
1) btrfs_inc_block_group_ro() returns 0, meaning it set the block group
to RO mode (its ->ro field set to 1 or incremented to some value > 1);
2) Then btrfs_wait_ordered_roots() returns a value > 0;
3) Then if either joining or committing the transaction fails, we break
out of the loop wihtout calling btrfs_dec_block_group_ro(), leaving
the block group in RO mode forever.
To fix this, just remove the code that waits for ongoing writes to extents
of the block group, since it's not needed because in the initial setup
phase of a device replace operation, before starting to find all chunks
and their extents, we set the target device for replace while holding
fs_info->dev_replace->rwsem, which ensures that after releasing that
semaphore, any writes into the source device are made to the target device
as well (__btrfs_map_block() guarantees that). So while at
scrub_enumerate_chunks() we only need to worry about finding and copying
extents (from the source device to the target device) that were written
before we started the device replace operation.
Fixes: f0e9b7d640 ("Btrfs: fix race setting block group readonly during device replace")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When running btrfs/072 with only one online CPU, it has a pretty high
chance to fail:
btrfs/072 12s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//btrfs/072.dmesg)
- output mismatch (see xfstests-dev/results//btrfs/072.out.bad)
--- tests/btrfs/072.out 2019-10-22 15:18:14.008965340 +0800
+++ /xfstests-dev/results//btrfs/072.out.bad 2019-11-14 15:56:45.877152240 +0800
@@ -1,2 +1,3 @@
QA output created by 072
Silence is golden
+Scrub find errors in "-m dup -d single" test
...
And with the following call trace:
BTRFS info (device dm-5): scrub: started on devid 1
------------[ cut here ]------------
BTRFS: Transaction aborted (error -27)
WARNING: CPU: 0 PID: 55087 at fs/btrfs/block-group.c:1890 btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
CPU: 0 PID: 55087 Comm: btrfs Tainted: G W O 5.4.0-rc1-custom+ #13
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
Call Trace:
__btrfs_end_transaction+0xdb/0x310 [btrfs]
btrfs_end_transaction+0x10/0x20 [btrfs]
btrfs_inc_block_group_ro+0x1c9/0x210 [btrfs]
scrub_enumerate_chunks+0x264/0x940 [btrfs]
btrfs_scrub_dev+0x45c/0x8f0 [btrfs]
btrfs_ioctl+0x31a1/0x3fb0 [btrfs]
do_vfs_ioctl+0x636/0xaa0
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x43/0x50
do_syscall_64+0x79/0xe0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
---[ end trace 166c865cec7688e7 ]---
[CAUSE]
The error number -27 is -EFBIG, returned from the following call chain:
btrfs_end_transaction()
|- __btrfs_end_transaction()
|- btrfs_create_pending_block_groups()
|- btrfs_finish_chunk_alloc()
|- btrfs_add_system_chunk()
This happens because we have used up all space of
btrfs_super_block::sys_chunk_array.
The root cause is, we have the following bad loop of creating tons of
system chunks:
1. The only SYSTEM chunk is being scrubbed
It's very common to have only one SYSTEM chunk.
2. New SYSTEM bg will be allocated
As btrfs_inc_block_group_ro() will check if we have enough space
after marking current bg RO. If not, then allocate a new chunk.
3. New SYSTEM bg is still empty, will be reclaimed
During the reclaim, we will mark it RO again.
4. That newly allocated empty SYSTEM bg get scrubbed
We go back to step 2, as the bg is already mark RO but still not
cleaned up yet.
If the cleaner kthread doesn't get executed fast enough (e.g. only one
CPU), then we will get more and more empty SYSTEM chunks, using up all
the space of btrfs_super_block::sys_chunk_array.
[FIX]
Since scrub/dev-replace doesn't always need to allocate new extent,
especially chunk tree extent, so we don't really need to do chunk
pre-allocation.
To break above spiral, here we introduce a new parameter to
btrfs_inc_block_group(), @do_chunk_alloc, which indicates whether we
need extra chunk pre-allocation.
For relocation, we pass @do_chunk_alloc=true, while for scrub, we pass
@do_chunk_alloc=false.
This should keep unnecessary empty chunks from popping up for scrub.
Also, since there are two parameters for btrfs_inc_block_group_ro(),
add more comment for it.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
struct btrfs_fs_devices::rotating currently is declared as an integer
variable but only used as a boolean.
Change the variable definition to bool and update to code touching it to
set 'true' and 'false'.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
struct btrfs_fs_devices::seeding currently is declared as an integer
variable but only used as a boolean.
Change the variable definition to bool and update to code touching it to
set 'true' and 'false'.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The type name is misleading, a single entry is named 'cache' while this
normally means a collection of objects. Rename that everywhere. Also the
identifier was quite long, making function prototypes harder to format.
Suggested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For read_one_block_group(), its only caller has already got the item key
to search next block group item.
So we can use that key directly without doing our own convertion on
stack.
Also, since that key used in btrfs_read_block_groups() is vital for
block group item search, add 'const' keyword for that parameter to
prevent read_one_block_group() to modify it.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Refactor the work inside the loop of btrfs_read_block_groups() into one
separate function, read_one_block_group().
This allows read_one_block_group to be reused for later BG_TREE feature.
The refactor does the following extra fix:
- Use btrfs_fs_incompat() to replace open-coded feature check
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A nice writeup of the LKMM (Linux Kernel Memory Model) rules for access
once policies can be found here
https://lwn.net/Articles/799218/#Access-Marking%20Policies .
The locked and unlocked access to eb::blocking_writers should be
annotated accordingly, following this:
Writes:
- locked write must use ONCE, may use plain read
- unlocked write must use ONCE
Reads:
- unlocked read must use ONCE
- locked read may use plain read iff not mixed with unlocked read
- unlocked read then locked must use ONCE
There's one difference on the assembly level, where
btrfs_tree_read_lock_atomic and btrfs_try_tree_read_lock used the cached
value and did not reevaluate it after taking the lock. This could have
missed some opportunities to take the lock in case blocking writers
changed between the calls, but the window is just a few instructions
long. As this is in try-lock, the callers handle that.
Signed-off-by: David Sterba <dsterba@suse.com>
The increment and decrement was inherited from previous version that
used atomics, switched in commit 06297d8cef ("btrfs: switch
extent_buffer blocking_writers from atomic to int"). The only possible
values are 0 and 1 so we can set them directly.
The generated assembly (gcc 9.x) did the direct value assignment in
btrfs_set_lock_blocking_write (asm diff after change in 06297d8cef):
5d: test %eax,%eax
5f: je 62 <btrfs_set_lock_blocking_write+0x22>
61: retq
- 62: lock incl 0x44(%rdi)
- 66: add $0x50,%rdi
- 6a: jmpq 6f <btrfs_set_lock_blocking_write+0x2f>
+ 62: movl $0x1,0x44(%rdi)
+ 69: add $0x50,%rdi
+ 6d: jmpq 72 <btrfs_set_lock_blocking_write+0x32>
The part in btrfs_tree_unlock did a decrement because
BUG_ON(blockers > 1) is probably not a strong hint for the compiler, but
otherwise the output looks safe:
- lock decl 0x44(%rdi)
+ sub $0x1,%eax
+ mov %eax,0x44(%rdi)
Signed-off-by: David Sterba <dsterba@suse.com>
There are two ifs that use eb::blocking_writers. As this is a variable
modified inside and outside of locks, we could minimize number of
accesses to avoid problems with getting different results at different
times.
The access here is locked so this can only race with btrfs_tree_unlock
that sets blocking_writers to 0 without lock and unsets the lock owner.
The first branch is taken only if the same thread already holds the
lock, the second if checks for blocking writers. Here we'd either unlock
and wait, or proceed. Both are valid states of the locking protocol.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
When there are no raid1c3 or raid1c4 block groups left after balance
(either convert or with other filters applied), remove the incompat bit.
This is already done for RAID56, do the same for RAID1C34.
Signed-off-by: David Sterba <dsterba@suse.com>
The new raid1c3 and raid1c4 profiles are backward incompatible and the
name shall be 'raid1c34', the status can be found in the global
supported features in /sys/fs/btrfs/features or in the per-filesystem
directory.
Signed-off-by: David Sterba <dsterba@suse.com>
Add new block group profile to store 4 copies in a simliar way that
current RAID1 does. The profile attributes and constraints are defined
in the raid table and used by the same code that already handles the 2-
and 3-copy RAID1.
The minimum number of devices is 4, the maximum number of devices/chunks
that can be lost/damaged is 3. There is no comparable traditional RAID
level, the profile is added for future needs to accompany triple-parity
and beyond.
Signed-off-by: David Sterba <dsterba@suse.com>
Add new block group profile to store 3 copies in a simliar way that
current RAID1 does. The profile attributes and constraints are defined
in the raid table and used by the same code that already handles the
2-copy RAID1.
The minimum number of devices is 3, the maximum number of devices/chunks
that can be lost/damaged is 2. Like RAID6 but with 33% space
utilization.
Signed-off-by: David Sterba <dsterba@suse.com>
In commit "Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios",
cow_file_range_async gained wbc as a parameter and this makes passing
write flags redundant. Set it inside the function and remove the
parameter.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__extent_writepage reads write flags from wbc and passes both to
__extent_writepage_io. This makes write_flags redundant and we can
remove it.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Backreference walking, which is used by send to figure if it can issue
clone operations instead of write operations, can be very slow and use
too much memory when extents have many references. This change simply
skips backreference walking when an extent has more than 64 references,
in which case we fallback to a write operation instead of a clone
operation. This limit is conservative and in practice I observed no
signicant slowdown with up to 100 references and still low memory usage
up to that limit.
This is a temporary workaround until there are speedups in the backref
walking code, and as such it does not attempt to add extra interfaces or
knobs to tweak the threshold.
Reported-by: Atemu <atemu.main@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For send we currently skip clone operations when the source and
destination files are the same. This is so because clone didn't support
this case in its early days, but support for it was added back in May
2013 by commit a96fbc7288 ("Btrfs: allow file data clone within a
file"). This change adds support for it.
Example:
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt/sdd
$ xfs_io -f -c "pwrite -S 0xab -b 64K 0 64K" /mnt/sdd/foobar
$ xfs_io -c "reflink /mnt/sdd/foobar 0 64K 64K" /mnt/sdd/foobar
$ btrfs subvolume snapshot -r /mnt/sdd /mnt/sdd/snap
$ mkfs.btrfs -f /dev/sde
$ mount /dev/sde /mnt/sde
$ btrfs send /mnt/sdd/snap | btrfs receive /mnt/sde
Without this change file foobar at the destination has a single 128Kb
extent:
$ filefrag -v /mnt/sde/snap/foobar
Filesystem type is: 9123683e
File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 31: 0.. 31: 32: last,unknown_loc,delalloc,eof
/mnt/sde/snap/foobar: 1 extent found
With this we get a single 64Kb extent that is shared at file offsets 0
and 64K, just like in the source filesystem:
$ filefrag -v /mnt/sde/snap/foobar
Filesystem type is: 9123683e
File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 15: 3328.. 3343: 16: shared
1: 16.. 31: 3328.. 3343: 16: 3344: last,shared,eof
/mnt/sde/snap/foobar: 2 extents found
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When deleting large files (which cross block group boundary) with
discard mount option, we find some btrfs_discard_extent() calls only
trimmed part of its space, not the whole range:
btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50%
type: bbio->map_type, in above case, it's SINGLE DATA.
start: Logical address of this trim
len: Logical length of this trim
trimmed: Physically trimmed bytes
ratio: trimmed / len
Thus leaving some unused space not discarded.
[CAUSE]
When discard mount option is specified, after a transaction is fully
committed (super block written to disk), we begin to cleanup pinned
extents in the following call chain:
btrfs_commit_transaction()
|- btrfs_finish_extent_commit()
|- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY);
|- btrfs_discard_extent()
However, pinned extents are recorded in an extent_io_tree, which can
merge adjacent extent states.
When a large file gets deleted and it has adjacent file extents across
block group boundary, we will get a large merged range like this:
|<--- BG1 --->|<--- BG2 --->|
|//////|<-- Range to discard --->|/////|
To discard that range, we have the following calls:
btrfs_discard_extent()
|- btrfs_map_block()
| Returned bbio will end at BG1's end. As btrfs_map_block()
| never returns result across block group boundary.
|- btrfs_issuse_discard()
Issue discard for each stripe.
So we will only discard the range in BG1, not the remaining part in BG2.
Furthermore, this bug is not that reliably observed, for above case, if
there is no other extent in BG2, BG2 will be empty and btrfs will trim
all space of BG2, covering up the bug.
[FIX]
- Allow __btrfs_map_block_for_discard() to modify @length parameter
btrfs_map_block() uses its @length paramter to notify the caller how
many bytes are mapped in current call.
With __btrfs_map_block_for_discard() also modifing the @length,
btrfs_discard_extent() now understands when to do extra trim.
- Call btrfs_map_block() in a loop until we hit the range end Since we
now know how many bytes are mapped each time, we can iterate through
each block group boundary and issue correct trim for each range.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old code goes:
offset = logical - em->start;
length = min_t(u64, em->len - offset, length);
Where @length calculation is dependent on offset, it can take reader
several more seconds to find it's just the same code as:
offset = logical - em->start;
length = min_t(u64, em->start + em->len - logical, length);
Use above code to make the length calculate independent from other
variable, thus slightly increase the readability.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In check_extent_data_item(), we read file extent type without verifying
if the item size is valid.
Add such check to ensure the file extent type we read is correct.
The check is not as accurate as we need to cover both inline and regular
extents, so it only checks if the item size is larger or equal to inline
header.
So the existing size checks on inline/regular extents are still needed.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The "&fs_info->dev_replace.rwsem" and "&dev_replace->rwsem" refer to
the same lock but Smatch is not clever enough to figure that out so it
leads to static checker warnings. It's better to use it consistently
anyway.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The backup_root_index member stores the index at which the backup root
should be saved upon next transaction commit. However, there is a
small deviation from this behavior in the form of a check in
backup_super_roots which checks if current root generation equals to the
generation of the previous root. This can trigger in the following
scenario:
slot0: gen-2
slot1: gen-1
slot2: gen
slot3: unused
Now suppose slot3 (which is also the root specified in the super block)
is corrupted hence init_tree_roots chooses to use the backup root at
slot2, meaning read_backup_root will read slot2 and assign the
superblock generation to gen-1. Despite this backup_root_index will
point at slot3 because its init happens in init_backup_root_slot, long
before any parsing of the backup roots occur. Then on next transaction
start, gen-1 will be incremented by 1 making the root's generation
equal gen. Subsequently, on transaction commit the following check
triggers:
if (btrfs_backup_tree_root_gen(root_backup) ==
btrfs_header_generation(info->tree_root->node))
This causes the 'next_backup', which is the index at which the backup is
going to be written to, to set to last_backup, which will be slot2.
All of this is a very confusing way of expressing the following
invariant:
Always write a backup root at the index following the last used backup
root.
This commit streamlines this logic by setting backup_root_index to the
next index after the one used for mount.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old name name was an awful misnomer because it didn't really find
the oldest super backup per-se but rather its slot. For example if we
have:
slot0: gen - 2
slot1: gen - 1
slot2: gen
slot3: empty
init_backup_root_slot will return slot3 and not slot0.
The new name is more appropriate since the function doesn't care whether
there is a valid backup in the returned slot or not.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function has been superseded by previous commits and is no longer
used so just remove it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the filesystem is not well formed and no trees are loaded it's
pointless holding the objectid_mutex. Just remove its usage.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The code responsible for reading and initializing tree roots is
scattered in open_ctree among 2 labels, emulating a loop. This is rather
confusing to reason about. Instead, factor the code to a new function,
init_tree_roots which implements the same logical flow.
There are a couple of notable differences, namely:
* Instead of using next_backup_root it's using the newly introduced
read_backup_root.
* If read_backup_root returns an error init_tree_roots propagates the
error and there is no special handling of that case e.g. the code jumps
straight to 'fail_tree_roots' label. The old code, however, was
(erroneously) jumping to 'fail_block_groups' label if next_backup_root
did fail, this was unnecessary since the tree roots init logic doesn't
modify the state of block groups.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function will replace next_root_backup with a much saner/cleaner
interface.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's no longer needed following cleanups around find_newest_backup_root
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Backup roots are always written in a circular manner. By definition we
can only ever have 1 backup root whose generation equals to that of the
superblock. Hence, the 'if' in the for loop will trigger at most once.
This is sufficient to return the newest backup root.
Furthermore the newest_gen parameter is always set to the generation of
the superblock. This value can be obtained from the fs_info.
This patch removes the unnecessary code dealing with the wraparound
case and makes 'newest_gen' a local variable.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The inode delalloc mutex was added a long time ago by commit f248679e86
("Btrfs: add a delalloc mutex to inodes for delalloc reservations"), and
the reason for its introduction is not very clear from the change log. It
claims it solves bogus warnings from lockdep, however it lacks an example
report/warning from lockdep, or any explanation.
Since we have enough concurrentcy protection from the locks of the space
info and block reserve objects, and such lockdep warnings don't seem to
exist anymore (at least on a 5.3 kernel I couldn't get them with fstests,
ltp, fs_mark, etc), remove it, simplifying things a bit and decreasing
the size of the btrfs_inode structure. With some quick fio tests doing
direct IO and mmap writes I couldn't observe any significant performance
increase either (direct IO writes that don't increase the file's size
don't hold the inode's lock for their entire duration and mmap writes
don't hold the inode's lock at all), which are the only type of writes
that could see any performance gain due to less serialization.
Review feedback from Josef:
The problem was taking the i_mutex in mmap, which is how I was
protecting delalloc reservations originally. The delalloc mutex didn't
come with all of the other dependencies. That's what the lockdep
messages were about, removing the lock isn't going to make them appear
again.
We _had_ to lock around this because we used to do tricks to keep from
over-reserving, and if we didn't serialize delalloc reservations we'd
end up with ugly accounting problems when we tried to clean things up.
However with my recentish changes this isn't the case anymore. Every
operation is responsible for reserving its space, and then adding it to
the inode. Then cleaning up is straightforward and can't be mucked up
by other users. So we no longer need the delalloc mutex to safe us from
ourselves.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It is not used anymore since commit 957780eb27 ("Btrfs: introduce
ticketed enospc infrastructure"), so just remove it.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_statfs() we cache fs_info::space_info in a local variable only
to use it once in a list_for_each_rcu() statement.
Not only is the local variable unnecessary it even makes the code harder
to follow as it's not clear which list it is iterating.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The on-disk format of block group item makes use of the key that stores
the offset and length. This is further used in the code, although this
makes thing harder to understand. The key is also packed so the
offset/length is not properly aligned as u64.
Add start (key.objectid) and length (key.offset) members to block group
and remove the embedded key. When the item is searched or written, a
local variable for key is used.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Accessors defined by BTRFS_SETGET_FUNCS take a raw extent buffer and
manipulate the items there, there's no special prefix required. The
block group accessors had _disk_ because previously the names were
occupied by the on-stack accessors. As this has been addressed in the
previous patch, we can now unify the naming.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All accessors defined by BTRFS_SETGET_STACK_FUNCS contain _stack_ in the
name, the block group ones were not following that scheme, so let's
switch them.
Signed-off-by: David Sterba <dsterba@suse.com>
The members ::used and ::flags are now in the block group cache
structure, the last one is chunk_objectid, but that's set to a fixed
value and otherwise unused. The item is constructed from a local
variable before write, so we can remove the embedded one from block
group.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The flags are read from the item that's embedded to block group struct,
but the item will be removed. Use the ::flags after read and before
write.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For unknown reasons, the member 'used' in the block group struct is
stored in the b-tree item and accessed everywhere using the special
accessor helper. Let's unify it and make it a regular member and only
update the item before writing it to the tree.
The item is still being used for flags and chunk_objectid, there's some
duplication until the item is removed in following patches.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The last user of btrfs_bio::flags was removed in commit 326e1dbb57
("block: remove management of bi_remaining when restoring original
bi_end_io"), remove it.
(Tagged for stable as the structure is heavily used and space savings
are desirable.)
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently all the checksum algorithms generate a fixed size digest size
and we use it. The on-disk format can hold up to BTRFS_CSUM_SIZE bytes
and BLAKE2b produces digest of 512 bits by default. We can't do that and
will use the blake2b-256, this needs to be passed to the crypto API.
Separate that from the base algorithm name and add a member to request
specific driver, in this case with the digest size.
The only place that uses the driver name is the crypto API setup.
Signed-off-by: David Sterba <dsterba@suse.com>
Show the used driver for the checksum algorithm for the filesystem in
sysfs file /sys/fs/btrfs/UUID/features/checksum, eg.
crc32c (crc32c-generic)
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Export supported checksum algorithms via sysfs in the list of static
features:
/sys/fs/btrfs/features/supported_checksums
Space spearated list of checksum algorithm names.
Co-developed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Add sha256 to the list of possible checksumming algorithms used by BTRFS.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add xxhash64 to the list of possible checksumming algorithms used by
BTRFS.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
To remove use of extent_map::bdev we need to find a replacement, and the
latest_bdev is the only one we can use here, because inode::i_bdev and
superblock::s_bdev are NULL.
The DIO code uses bdev in two places:
* to read blocksize to perform alignment checks in
do_blockdev_direct_IO, but we do them in btrfs code before any call to
DIO
* in the following call chain:
do_direct_IO
get_more_blocks
sdio->get_block() <-- this is btrfs_get_blocks_direct
subsequently the map_bh->b_dev member is used in clean_bdev_aliases
and dio_new_bio to set the bio's bdev to that of the buffer_head.
However, because we have provided a submit function dio_bio_submit
calls our submission function and ignores the bdev.
So it's safe to pass any valid bdev that's used within the filesystem.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a preparatory patch for removing extent_map::bdev. There's some
history behind the code so this is only precaution to catch if things
break before the actual removal happens.
Logically, comparing a raw low-level block device (bdev) does not make
sense for extent maps (high-level objects). This had no effect in
practice but was quite confusing in the code. The lookup_map is set iff
EXTENT_FLAG_FS_MAPPING is set.
The two pointers were stored in the same bytes and used potentially in
two meanings. Now they're split, so the asserts are in place to check
that the condition will not change.
The lookup map pointer misused bdev, this has been changed in commit
95617d6932 ("btrfs: cleanup, stop casting for extent_map->lookup
everywhere") to the explicit type. But the semantics hasn't changed and
bdev was not actually used to decide if maps are mergeable.
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of checking if we've read a BTRFS_CHUNK_ITEM_KEY from disk and
then process it we could just bail out early if the read disk key wasn't
a BTRFS_CHUNK_ITEM_KEY.
This removes a level of indentation and makes the code nicer to read.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_may_alloc_data_chunk() we're checking if the chunk type is of
type BTRFS_BLOCK_GROUP_DATA and if it is we process it.
Instead of checking if the chunk type is a BTRFS_BLOCK_GROUP_DATA chunk
we can negate the check and bail out early if it isn't.
This makes the code a bit more readable.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In lock_stripe_add() we're caching the bucket for the stripe hash table
just for a single call to dereference the stripe hash.
If we just directly call rbio_bucket() we can safe the pointless local
variable.
Also move the dereferencing of the stripe hash outside of the variable
declaration block to not break over the 80 characters limit.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In lock_stripe_add() we're traversing the stripe hash list and check if
the current list element's raid_map equals is equal to the raid bio's
raid_map. If both are equal we continue processing.
If we'd check for inequality instead of equality we can reduce one level
of indentation.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of using an input pointer parameter as the return value and have
an int as the return type of find_desired_extent, rework the function to
directly return the found offset. Doing that the 'ret' variable in
btrfs_llseek_file can be removed. Additional (subjective) benefit is
that btrfs' llseek function now resemebles those of the other major
filesystems.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Handle SEEK_END/SEEK_CUR in a single 'default' case by directly
returning from generic_file_llseek. This makes the 'out' label
redundant. Finally return directly the vale from vfs_setpos. No
semantic changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Modifying the file position is done on a per-file basis. This renders
holding the inode lock for writing useless and makes the performance of
concurrent llseek's abysmal.
Fix this by holding the inode for read. This provides protection against
concurrent truncates and find_desired_extent already includes proper
extent locking for the range which ensures proper locking against
concurrent writes. SEEK_CUR and SEEK_END can be done lockessly.
The former is synchronized by file::f_lock spinlock. SEEK_END is not
synchronized but atomic, but that's OK since there is not guarantee that
SEEK_END will always be at the end of the file in the face of tail
modifications.
This change brings ~82% performance improvement when doing a lot of
parallel fseeks. The workload essentially does:
for (d=0; d<num_seek_read; d++)
{
/* offset %= 16777216; */
fseek (f, 256 * d % 16777216, SEEK_SET);
fread (buffer, 64, 1, f);
}
Without patch:
num workprocesses = 16
num fseek/fread = 8000000
step = 256
fork 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
real 0m41.412s
user 0m28.777s
sys 2m16.510s
With patch:
num workprocesses = 16
num fseek/fread = 8000000
step = 256
fork 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
real 0m11.479s
user 0m27.629s
sys 0m21.040s
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can infer the ops from the type that is now passed to all functions
that would need it, this makes workspace_manager::ops redundant and can
be removed.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replace indirect calls to free_workspace by switch and calls to the
specific callbacks. This is mainly to get rid of the indirection due to
spectre vulnerability mitigations.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can infer the workspace_manager from type and the type will be used
in the following patch to call a common helper for free_workspace.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replace indirect calls to alloc_workspace by switch and calls to the
specific callbacks. This is mainly to get rid of the indirection due to
spectre vulnerability mitigations.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can infer the workspace_manager from type and the type will be used
in the following patch to call a common helper for alloc_workspace.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Similar to get_workspace, majority of the callbacks is trivial, we don't
gain anything by the indirection, so replace them by a switch function.
Trivial callback implementations use the helper.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Majority of the callbacks is trivial, we don't gain anything by the
indirection, so replace them by a switch function.
ZLIB needs to adjust level in the callback and ZSTD workspace management
is complex, the rest is call to the helper.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The indirect calls will be replaced by a switch in compression.c.
(Switch is faster than indirect calls with when Spectre mitigations are
enabled).
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replace loop calling to all algos with a list of direct calls to the
cleanup manager callback. When that becomes trivial it is replaced by
direct call to the helper.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With the access to the workspace structures, we can look it up together
with the compression ops inside the workspace manager cleanup helper.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replace loop calling to all algos with a list of direct calls to the
init manager callback. When that becomes trivial it is replaced by
direct call to the helper.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With the access to the workspace structures, we can look it up together
with the compression ops inside the workspace manager init helper.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a lot of indirection when the generic code calls into
algo-specific callbacks to reach the private workspace manager structure
and back to the generic code.
To simplify that, export the workspace manager for heuristic, LZO and
ZLIB, while ZSTD is going to use it's own manager.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The indirect calls bring some overhead due to spectre vulnerability
mitigations. The number of cases is small and below the threshold
(10-20) where indirect call would be better.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Export compress_pages, decompress_bio and decompress callbacks for all
compression algos. The indirect calls will be replaced by a switch.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When free'ing extents in a block group we check to see if the block
group is not cached, and then cache it if we need to. However we'll
just carry on as long as we're loading the cache. This is problematic
because we are dirtying the block group here. If we are fast enough we
could do a transaction commit and clear the free space cache while we're
still loading the space cache in another thread. This truncates the
free space inode, which will keep it from loading the space cache.
Fix this by using the btrfs_block_group_cache_done helper so that we try
to load the space cache unconditionally here, which will result in the
caller waiting for the fast caching to complete and keep us from
truncating the free space inode.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While testing 5.2 we ran into the following panic
[52238.017028] BUG: kernel NULL pointer dereference, address: 0000000000000001
[52238.105608] RIP: 0010:drop_buffers+0x3d/0x150
[52238.304051] Call Trace:
[52238.308958] try_to_free_buffers+0x15b/0x1b0
[52238.317503] shrink_page_list+0x1164/0x1780
[52238.325877] shrink_inactive_list+0x18f/0x3b0
[52238.334596] shrink_node_memcg+0x23e/0x7d0
[52238.342790] ? do_shrink_slab+0x4f/0x290
[52238.350648] shrink_node+0xce/0x4a0
[52238.357628] balance_pgdat+0x2c7/0x510
[52238.365135] kswapd+0x216/0x3e0
[52238.371425] ? wait_woken+0x80/0x80
[52238.378412] ? balance_pgdat+0x510/0x510
[52238.386265] kthread+0x111/0x130
[52238.392727] ? kthread_create_on_node+0x60/0x60
[52238.401782] ret_from_fork+0x1f/0x30
The page we were trying to drop had a page->private, but had no
page->mapping and so called drop_buffers, assuming that we had a
buffer_head on the page, and then panic'ed trying to deref 1, which is
our page->private for data pages.
This is happening because we're truncating the free space cache while
we're trying to load the free space cache. This isn't supposed to
happen, and I'll fix that in a followup patch. However we still
shouldn't allow those sort of mistakes to result in messing with pages
that do not belong to us. So add the page->mapping check to verify that
we still own this page after dropping and re-acquiring the page lock.
This page being unlocked as:
btrfs_readpage
extent_read_full_page
__extent_read_full_page
__do_readpage
if (!nr)
unlock_page <-- nr can be 0 only if submit_extent_page
returns an error
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add callchain ]
Signed-off-by: David Sterba <dsterba@suse.com>
In the fixup worker, if we fail to mark the range as delalloc in the io
tree, we must release the previously reserved metadata, as well as update
the outstanding extents counter for the inode, otherwise we leak metadata
space.
In pratice we can't return an error from btrfs_set_extent_delalloc(),
which is just a wrapper around __set_extent_bit(), as for most errors
__set_extent_bit() does a BUG_ON() (or panics which hits a BUG_ON() as
well) and returning an -EEXIST error doesn't happen in this case since
the exclusive bits parameter always has a value of 0 through this code
path. Nevertheless, just fix the error handling in the fixup worker,
in case one day __set_extent_bit() can return an error to this code
path.
Fixes: f3038ee3a3 ("btrfs: Handle btrfs_set_extent_delalloc failure in fixup worker")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a buffered write it's possible to leave the subv_writers
counter of the root, used for synchronization between buffered nocow
writers and snapshotting. This happens in an exceptional case like the
following:
1) We fail to allocate data space for the write, since there's not
enough available data space nor enough unallocated space for allocating
a new data block group;
2) Because of that failure, we try to go to NOCOW mode, which succeeds
and therefore we set the local variable 'only_release_metadata' to true
and set the root's sub_writers counter to 1 through the call to
btrfs_start_write_no_snapshotting() made by check_can_nocow();
3) The call to btrfs_copy_from_user() returns zero, which is very unlikely
to happen but not impossible;
4) No pages are copied because btrfs_copy_from_user() returned zero;
5) We call btrfs_end_write_no_snapshotting() which decrements the root's
subv_writers counter to 0;
6) We don't set 'only_release_metadata' back to 'false' because we do
it only if 'copied', the value returned by btrfs_copy_from_user(), is
greater than zero;
7) On the next iteration of the while loop, which processes the same
page range, we are now able to allocate data space for the write (we
got enough data space released in the meanwhile);
8) After this if we fail at btrfs_delalloc_reserve_metadata(), because
now there isn't enough free metadata space, or in some other place
further below (prepare_pages(), lock_and_cleanup_extent_if_need(),
btrfs_dirty_pages()), we break out of the while loop with
'only_release_metadata' having a value of 'true';
9) Because 'only_release_metadata' is 'true' we end up decrementing the
root's subv_writers counter to -1 (through a call to
btrfs_end_write_no_snapshotting()), and we also end up not releasing the
data space previously reserved through btrfs_check_data_free_space().
As a consequence the mechanism for synchronizing NOCOW buffered writes
with snapshotting gets broken.
Fix this by always setting 'only_release_metadata' to false at the start
of each iteration.
Fixes: 8257b2dc3c ("Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume")
Fixes: 7ee9e4405f ("Btrfs: check if we can nocow if we don't have data space")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Some functions are doing some unnecessary indirection to reach the
btrfs_fs_info struct. Change these functions to receive a btrfs_fs_info
struct instead of a *file.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need int argument bool shall do in free_root_pointers(). And
rename the argument as it confused two people.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The compression type upper limit constant is the same as the last value
and this is confusing. In order to keep coding style consistent, use
BTRFS_NR_COMPRESS_TYPES as the total number that follows the idom of
'NR' being one more than the last value.
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use enum to replace macro definitions of extent types.
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
DEFINE_HASHTABLE itself has already included initialization code,
we don't have to call hash_init() again, so remove it.
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is used only during the final phase of freespace cache
writeout. This is necessary since using the plain btrfs_join_transaction
api is deadlock prone. The deadlock looks like:
T1:
btrfs_commit_transaction
commit_cowonly_roots
btrfs_write_dirty_block_groups
btrfs_wait_cache_io
__btrfs_wait_cache_io
btrfs_wait_ordered_range <-- Triggers ordered IO for freespace
inode and blocks transaction commit
until freespace cache writeout
T2: <-- after T1 has triggered the writeout
finish_ordered_fn
btrfs_finish_ordered_io
btrfs_join_transaction <--- this would block waiting for current
transaction to commit, but since trans
commit is waiting for this writeout to
finish
The special purpose functions prevents it by simply skipping the "wait
for writeout" since it's guaranteed the transaction won't proceed until
we are done.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Using an ASSERT in btrfs_pin_extent allows to more stringently observe
whether the function is called under a transaction or not.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The helper is trivial and we can understand what the atomic_inc on
something named refs does.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During a cyclic writeback, extent_write_cache_pages() uses done_index
to update the writeback_index after the current run is over. However,
instead of current index + 1, it gets to to the current index itself.
Unfortunately, this, combined with returning on EOF instead of looping
back, can lead to the following pathlogical behavior.
1. There is a single file which has accumulated enough dirty pages to
trigger balance_dirty_pages() and the writer appending to the file
with a series of short writes.
2. balance_dirty_pages kicks in, wakes up background writeback and sleeps.
3. Writeback kicks in and the cursor is on the last page of the dirty
file. Writeback is started or skipped if already in progress. As
it's EOF, extent_write_cache_pages() returns and the cursor is set
to done_index which is pointing to the last page.
4. Writeback is done. Nothing happens till balance_dirty_pages
finishes, at which point we go back to #1.
This can almost completely stall out writing back of the file and keep
the system over dirty threshold for a long time which can mess up the
whole system. We encountered this issue in production with a package
handling application which can reliably reproduce the issue when
running under tight memory limits.
Reading the comment in the error handling section, this seems to be to
avoid accidentally skipping a page in case the write attempt on the
page doesn't succeed. However, this concern seems bogus.
On each page, the code either:
* Skips and moves onto the next page.
* Fails issue and sets done_index to index + 1.
* Successfully issues and continue to the next page if budget allows
and not EOF.
IOW, as long as it's not EOF and there's budget, the code never
retries writing back the same page. Only when a page happens to be
the last page of a particular run, we end up retrying the page, which
can't possibly guarantee anything data integrity related. Besides,
cyclic writes are only used for non-syncing writebacks meaning that
there's no data integrity implication to begin with.
Fix it by always setting done_index past the current page being
processed.
Note that this problem exists in other writepages too.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 4617ea3a52 (" Btrfs: fix necessary chunk tree space calculation
when allocating a chunk") removed the is_allocation argument from
check_system_chunk, since the formula for reserving the necessary space
for allocation or removing a chunk would be the same.
So, rework the comment by removing the mention of is_allocation
argument.
Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unlike read time tree checker errors, write time error can't be
inspected by "btrfs inspect dump-tree", so we need extra information to
determine what's going wrong.
The patch will add the following output for write time tree checker
error:
- The content of the offending tree block
To help determining if it's a false alert.
- Kernel WARN_ON() for debug build
This is helpful for us to detect unexpected write time tree checker
error, especially fstests could catch the dmesg.
Since the WARN_ON() is only triggered for write time tree checker,
test cases utilizing dm-error won't trigger this WARN_ON(), thus no
extra noise.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Refactor the check for prev_key->objectid of the following key types
into one function, check_prev_ino():
- EXTENT_DATA
- INODE_REF
- DIR_INDEX
- DIR_ITEM
- XATTR_ITEM
Also add the check of prev_key for INODE_REF.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
extent_write_locked_range() is used when we're falling back to buffered
IO from inside of compression. It allocates its own wbc and should
associate it with the inode's i_wb to make sure the IO goes down from
the correct cgroup.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Async CRCs and compression submit IO through helper threads, which means
they have IO priority inversions when cgroup IO controllers are in use.
This flags all of the writes submitted by btrfs helper threads as
REQ_CGROUP_PUNT. submit_bio() will punt these to dedicated per-blkcg
work items to avoid the priority inversion.
For the compression code, we take a reference on the wbc's blkg css and
pass it down to the async workers.
For the async CRCs, the bio already has the correct css, we just need to
tell the block layer to use REQ_CGROUP_PUNT.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Chris Mason <clm@fb.com>
Modified-and-reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
The btrfs writepages function collects a large range of pages flagged
for delayed allocation, and then sends them down through the COW code
for processing. When compression is on, we allocate one async_chunk
structure for every 512K, and then run those pages through the
compression code for IO submission.
writepages starts all of this off with a single page, locked by the
original call to extent_write_cache_pages(), and it's important to keep
track of this page because it has already been through
clear_page_dirty_for_io().
The btrfs async_chunk struct has a pointer to the locked_page, and when
we're redirtying the page because compression had to fallback to
uncompressed IO, we use page->index to decide if a given async_chunk
struct really owns that page.
But, this is racey. If a given delalloc range is broken up into two
async_chunks (chunkA and chunkB), we can end up with something like
this:
compress_file_range(chunkA)
submit_compress_extents(chunkA)
submit compressed bios(chunkA)
put_page(locked_page)
compress_file_range(chunkB)
...
Or:
async_cow_submit
submit_compressed_extents <--- falls back to buffered writeout
cow_file_range
extent_clear_unlock_delalloc
__process_pages_contig
put_page(locked_pages)
async_cow_submit
The end result is that chunkA is completed and cleaned up before chunkB
even starts processing. This means we can free locked_page() and reuse
it elsewhere. If we get really lucky, it'll have the same page->index
in its new home as it did before.
While we're processing chunkB, we might decide we need to fall back to
uncompressed IO, and so compress_file_range() will call
__set_page_dirty_nobufers() on chunkB->locked_page.
Without cgroups in use, this creates as a phantom dirty page, which
isn't great but isn't the end of the world. What can happen, it can go
through the fixup worker and the whole COW machinery again:
in submit_compressed_extents():
while (async extents) {
...
cow_file_range
if (!page_started ...)
extent_write_locked_range
else if (...)
unlock_page
continue;
This hasn't been observed in practice but is still possible.
With cgroups in use, we might crash in the accounting code because
page->mapping->i_wb isn't set.
BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
IP: percpu_counter_add_batch+0x11/0x70
PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
CPU: 16 PID: 2172 Comm: rm Not tainted
RIP: 0010:percpu_counter_add_batch+0x11/0x70
RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
RAX: 0000000000000005 RBX: 0000000000000090 RCX: 0000000000026115
RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 0000000000000090
RBP: 0000000000000000 R08: fffffffffffffff5 R09: 0000000000000000
R10: 00000000000260c0 R11: ffff881037fc26c0 R12: ffffffffffffffff
R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 0000000000000001
FS: 00007f5503ced480(0000) GS:ffff880ff7200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 0000000000360ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
account_page_cleaned+0x15b/0x1f0
__cancel_dirty_page+0x146/0x200
truncate_cleanup_page+0x92/0xb0
truncate_inode_pages_range+0x202/0x7d0
btrfs_evict_inode+0x92/0x5a0
evict+0xc1/0x190
do_unlinkat+0x176/0x280
do_syscall_64+0x63/0x1a0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
The fix here is to make asyc_chunk->locked_page NULL everywhere but the
one async_chunk struct that's allowed to do things to the locked page.
Link: https://lore.kernel.org/linux-btrfs/c2419d01-5c84-3fb4-189e-4db519d08796@suse.com/
Fixes: 771ed689d2 ("Btrfs: Optimize compressed writeback and reads")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Chris Mason <clm@fb.com>
[ update changelog from mail thread discussion ]
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we're not using btrfs_schedule_bio() anymore, delete all the
code that supported it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_schedule_bio() hands IO off to a helper thread to do the actual
submit_bio() call. This has been used to make sure async crc and
compression helpers don't get stuck on IO submission. To maintain good
performance, over time the IO submission threads duplicated some IO
scheduler characteristics such as high and low priority IOs and they
also made some ugly assumptions about request allocation batch sizes.
All of this cost at least one extra context switch during IO submission,
and doesn't fit well with the modern blkmq IO stack. So, this commit stops
using btrfs_schedule_bio(). We may need to adjust the number of async
helper threads for crcs and compression, but long term it's a better
path.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The attribute is more relaxed than const and the functions could
dereference pointers, as long as the observable state is not changed. We
do have such functions, based on -Wsuggest-attribute=pure .
The visible effects of this patch are negligible, there are differences
in the assembly but hard to summarize.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For some reason the attribute is called __attribute_const__ and not
__const, marks functions that have no observable effects on program
state, IOW not reading pointers, just the arguments and calculating a
value. Allows the compiler to do some optimizations, based on
-Wsuggest-attribute=const . The effects are rather small, though, about
60 bytes decrese of btrfs.ko.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The attribute can mark functions supposed to be called rarely if at all
and the text can be moved to sections far from the other code. The
attribute has been added to several functions already, this patch is
based on hints given by gcc -Wsuggest-attribute=cold.
The net effect of this patch is decrease of btrfs.ko by 1000-1300,
depending on the config options.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter is now always set to NULL and could be dropped. The last
user was get_default_root but that got reworked in 05dbe6837b ("Btrfs:
unify subvol= and subvolid= mounting") and the parameter became unused.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We hit the following warning while running down a different problem
[ 6197.175850] ------------[ cut here ]------------
[ 6197.185082] refcount_t: underflow; use-after-free.
[ 6197.194704] WARNING: CPU: 47 PID: 966 at lib/refcount.c:190 refcount_sub_and_test_checked+0x53/0x60
[ 6197.521792] Call Trace:
[ 6197.526687] __btrfs_release_delayed_node+0x76/0x1c0
[ 6197.536615] btrfs_kill_all_delayed_nodes+0xec/0x130
[ 6197.546532] ? __btrfs_btree_balance_dirty+0x60/0x60
[ 6197.556482] btrfs_clean_one_deleted_snapshot+0x71/0xd0
[ 6197.566910] cleaner_kthread+0xfa/0x120
[ 6197.574573] kthread+0x111/0x130
[ 6197.581022] ? kthread_create_on_node+0x60/0x60
[ 6197.590086] ret_from_fork+0x1f/0x30
[ 6197.597228] ---[ end trace 424bb7ae00509f56 ]---
This is because the free side drops the ref without the lock, and then
takes the lock if our refcount is 0. So you can have nodes on the tree
that have a refcount of 0. Fix this by zero'ing out that element in our
temporary array so we don't try to kill it again.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Its very helpful if we had logged the device scanner process name to
debug the race condition between the systemd-udevd scan and the user
initiated device forget command.
This patch adds process name and pid to the scan message.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add pid to the message ]
Signed-off-by: David Sterba <dsterba@suse.com>
That function adds unnecessary indirection between backref_in_log and
the caller. Furthermore it also "downgrades" backref_in_log's return
value to a boolean, when in fact it could very well be an error.
Rectify the situation by simply opencoding name_in_log_ref in
replay_one_name and properly handling possible return codes from
backref_in_log.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
This function can return a negative error value if btrfs_search_slot
errors for whatever reason or if btrfs_alloc_path runs out of memory.
This is currently problemattic because backref_in_log is treated by its
callers as if it returns boolean.
Fix this by adding proper error handling in callers. That also enables
the function to return the direct error code from btrfs_search_slot.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Direct replacement, though note that the inside of the loop in
btrfs_find_name_in_backref is organized in a slightly different way but
is equvalent.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
The state was introduced in commit 4a9d8bdee3 ("Btrfs: make the state
of the transaction more readable"), then in commit 302167c50b
("btrfs: don't end the transaction for delayed refs in throttle") the
state is completely removed.
So we can just clean up the state since it's only compared but never
set.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add an overview of the basic btrfs transaction transitions, including
the following states:
- No transaction states
- Transaction N [[TRANS_STATE_RUNNING]]
- Transaction N [[TRANS_STATE_COMMIT_START]]
- Transaction N [[TRANS_STATE_COMMIT_DOING]]
- Transaction N [[TRANS_STATE_UNBLOCKED]]
- Transaction N [[TRANS_STATE_COMPLETED]]
For each state, the comment will include:
- Basic explaination about current state
- How to go next stage
- What will happen if we call various start_transaction() functions
- Relationship to transaction N+1
This doesn't provide tech details, but serves as a cheat sheet for
reader to get into the code a little easier.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replace is_power_of_2 with the helper that is self-documenting and
remove the open coded call in alloc_profile_is_valid.
Signed-off-by: David Sterba <dsterba@suse.com>
As is_power_of_two takes unsigned long, it's not safe on 32bit
architectures, but we could pass any u64 value in seveal places. Add a
separate helper and also an alias that better expresses the purpose for
which the helper is used.
Signed-off-by: David Sterba <dsterba@suse.com>
When balance reduces the number of copies of metadata, it reduces the
redundancy, use the term redundancy instead of integrity.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function belongs to the family of locking functions, so move it
there. The 'noinline' keyword is dropped as it's now an exported
function that does not need it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function belongs to the family of locking functions, so move it
there. The 'noinline' keyword is dropped as it's now an exported
function that does not need it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_assert_tree_locked is used outside of the locking
code so it is exported, however we can make it static inine as it's
fairly trivial.
This is the only locking assertion used in release builds, inlining
improves the text size by 174 bytes and reduces stack consumption in the
callers.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I've noticed that none of the btrfs_assert_*lock* debugging helpers is
inlined, despite they're short and mostly a value update. Making them
inline shaves 67 from the text size, reduces stack consumption and
perhaps also slightly improves the performance due to avoiding
unnecessary calls.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit ac0c7cf8be ("btrfs: fix crash when tracepoint arguments are
freed by wq callbacks") added a void pointer, wtag, which is passed into
trace_btrfs_all_work_done() instead of the freed work item. This is
silly for a few reasons:
1. The freed work item still has the same address.
2. work is still in scope after it's freed, so assigning wtag doesn't
stop anyone from using it.
3. The tracepoint has always taken a void * argument, so assigning wtag
doesn't actually make things any more type-safe. (Note that the
original bug in commit bc074524e1 ("btrfs: prefix fsid to all trace
events") was that the void * was implicitly casted when it was passed
to btrfs_work_owner() in the trace point itself).
Instead, let's add some clearer warnings as comments.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 9e0af23764 ("Btrfs: fix task hang under heavy compressed
write") worked around the issue that a recycled work item could get a
false dependency on the original work item due to how the workqueue code
guarantees non-reentrancy. It did so by giving different work functions
to different types of work.
However, the fixes in the previous few patches are more complete, as
they prevent a work item from being recycled at all (except for a tiny
window that the kernel workqueue code handles for us). This obsoletes
the previous fix, so we don't need the unique helpers for correctness.
The only other reason to keep them would be so they show up in stack
traces, but they always seem to be optimized to a tail call, so they
don't show up anyways. So, let's just get rid of the extra indirection.
While we're here, rename normal_work_helper() to the more informative
btrfs_work_helper().
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, scrub_missing_raid56_worker() puts and potentially frees
sblock (which embeds the work item) and then submits a bio through
scrub_wr_submit(). This is another potential instance of the bug in
"btrfs: don't prematurely free work in run_ordered_work()". Fix it by
dropping the reference after we submit the bio.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, reada_start_machine_worker() frees the reada_machine_work and
then calls __reada_start_machine() to do readahead. This is another
potential instance of the bug in "btrfs: don't prematurely free work in
run_ordered_work()".
There _might_ already be a deadlock here: reada_start_machine_worker()
can depend on itself through stacked filesystems (__read_start_machine()
-> reada_start_machine_dev() -> reada_tree_block_flagged() ->
read_extent_buffer_pages() -> submit_one_bio() ->
btree_submit_bio_hook() -> btrfs_map_bio() -> submit_stripe_bio() ->
submit_bio() onto a loop device can trigger readahead on the lower
filesystem).
Either way, let's fix it by freeing the work at the end.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, end_workqueue_fn() frees the end_io_wq entry (which embeds
the work item) and then calls bio_endio(). This is another potential
instance of the bug in "btrfs: don't prematurely free work in
run_ordered_work()".
In particular, the endio call may depend on other work items. For
example, btrfs_end_dio_bio() can call btrfs_subio_endio_read() ->
__btrfs_correct_data_nocsum() -> dio_read_error() ->
submit_dio_repair_bio(), which submits a bio that is also completed
through a end_workqueue_fn() work item. However,
__btrfs_correct_data_nocsum() waits for the newly submitted bio to
complete, thus it depends on another work item.
This example currently usually works because we use different workqueue
helper functions for BTRFS_WQ_ENDIO_DATA and BTRFS_WQ_ENDIO_DIO_REPAIR.
However, it may deadlock with stacked filesystems and is fragile
overall. The proper fix is to free the work item at the very end of the
work function, so let's do that.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We hit the following very strange deadlock on a system with Btrfs on a
loop device backed by another Btrfs filesystem:
1. The top (loop device) filesystem queues an async_cow work item from
cow_file_range_async(). We'll call this work X.
2. Worker thread A starts work X (normal_work_helper()).
3. Worker thread A executes the ordered work for the top filesystem
(run_ordered_work()).
4. Worker thread A finishes the ordered work for work X and frees X
(work->ordered_free()).
5. Worker thread A executes another ordered work and gets blocked on I/O
to the bottom filesystem (still in run_ordered_work()).
6. Meanwhile, the bottom filesystem allocates and queues an async_cow
work item which happens to be the recently-freed X.
7. The workqueue code sees that X is already being executed by worker
thread A, so it schedules X to be executed _after_ worker thread A
finishes (see the find_worker_executing_work() call in
process_one_work()).
Now, the top filesystem is waiting for I/O on the bottom filesystem, but
the bottom filesystem is waiting for the top filesystem to finish, so we
deadlock.
This happens because we are breaking the workqueue assumption that a
work item cannot be recycled while it still depends on other work. Fix
it by waiting to free the work item until we are done with all of the
related ordered work.
P.S.:
One might ask why the workqueue code doesn't try to detect a recycled
work item. It actually does try by checking whether the work item has
the same work function (find_worker_executing_work()), but in our case
the function is the same. This is the only key that the workqueue code
has available to compare, short of adding an additional, layer-violating
"custom key". Considering that we're the only ones that have ever hit
this, we should just play by the rules.
Unfortunately, we haven't been able to create a minimal reproducer other
than our full container setup using a compress-force=zstd filesystem on
top of another compress-force=zstd filesystem.
Suggested-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit fc97fab0ea ("btrfs: Replace fs_info->qgroup_rescan_worker
workqueue with btrfs_workqueue.") converted qgroup_rescan_work to be
initialized with btrfs_init_work(), but it left behind an unnecessary
memset(). Get rid of the memset().
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This needs to be cleaned up in the future, but for now it belongs to the
extent-io-tree stuff since it uses the internal tree search code.
Needed to export get_state_failrec and set_state_failrec as well since
we're not going to move the actual IO part of the failrec stuff out at
this point.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This utilizes internal stuff to the extent_io_tree, so we need to export
it before we move it.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
extent_io.c/h are huge, encompassing a bunch of different things. The
extent_io_tree code can live on its own, so separate this out.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are moving extent_io_tree into it's on file, so separate out the
extent_state init stuff from extent_io_tree_init().
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We check both extent buffer and extent state leaks in the same function,
separate these two functions out so we can move them around.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The following comment shows up in btrfs_search_slot() with out much
sense:
/*
* setup the path here so we can release it under lock
* contention with the cow code
*/
if (cow) {
/* code touching path->lock[] is far away from here */
}
This comment hasn't been cleaned up after the relevant code has been
removed.
The original code is introduced in commit 65b51a009e
("btrfs_search_slot: reduce lock contention by cowing in two stages"):
+
+ /*
+ * setup the path here so we can release it under lock
+ * contention with the cow code
+ */
+ p->nodes[level] = b;
+ if (!p->skip_locking)
+ p->locks[level] = 1;
+
But in current code, we have different timing for modifying path lock,
so just remove the comment.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Similar to btrfs_search_slot() done in previous patch, make a shortcut
for the level 0 case and allow to reduce indentation for the remaining
case.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_search_slot(), we something like:
if (level != 0) {
/* Do search inside tree nodes*/
} else {
/* Do search inside tree leaves */
goto done;
}
This caused extra indent for tree node search code. Change it to
something like:
if (level == 0) {
/* Do search inside tree leaves */
goto done'
}
/* Do search inside tree nodes */
So we have more space to maneuver our code, this is especially useful as
the tree nodes search code is more complex than the leaves search code.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For INODE_REF we will check:
- Objectid (ino) against previous key
To detect missing INODE_ITEM.
- No overflow/padding in the data payload
Much like DIR_ITEM, but with less members to check.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For the following items, key->objectid is inode number:
- DIR_ITEM
- DIR_INDEX
- XATTR_ITEM
- EXTENT_DATA
- INODE_REF
So in the subvolume tree, such items must have its previous item share the
same objectid, e.g.:
(257 INODE_ITEM 0)
(257 DIR_INDEX xxx)
(257 DIR_ITEM xxx)
(258 INODE_ITEM 0)
(258 INODE_REF 0)
(258 XATTR_ITEM 0)
(258 EXTENT_DATA 0)
But if we have the following sequence, then there is definitely
something wrong, normally some INODE_ITEM is missing, like:
(257 INODE_ITEM 0)
(257 DIR_INDEX xxx)
(257 DIR_ITEM xxx)
(258 XATTR_ITEM 0) <<< objecitd suddenly changed to 258
(258 EXTENT_DATA 0)
So just by checking the previous key for above inode based key types, we
can detect a missing inode item.
For INODE_REF key type, the check will be added along with INODE_REF
checker.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's not used ouside of transaction.c
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A recent patch to btrfs showed that there was at least 1 case where a
nested transaction was committed. Nested transaction in this case means
a code which has a transaction handle calls some function which in turn
obtains a copy of the same transaction handle. In such cases the correct
thing to do is for the lower callee to call btrfs_end_transaction which
contains appropriate checks so as to not commit the transaction which
will result in stale trans handler for the caller.
To catch such cases add an assert in btrfs_commit_transaction ensuring
btrfs_trans_handle::use_count is always 1.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is similar to 942491c9e6 ("xfs: fix AIM7 regression"). Apparently
our current rwsem code doesn't like doing the trylock, then lock for
real scheme. This causes extra contention on the lock and can be
measured eg. by AIM7 benchmark. So change our read/write methods to
just do the trylock for the RWF_NOWAIT case.
Fixes: edf064e7c6 ("btrfs: nowait aio support")
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl3MUkoACgkQxWXV+ddt
WDvAEQ//fEHZ51NbIMwJNltqF4mr6Oao0M5u0wejEgiXzmR9E1IuHUgVK+KDQmSu
wZl/y+RTlQC0TiURyStaVFBreXEqiuB79my9u4iDeNv/4UJQB42qpmYB4EuviDgB
mxb9bFpWTLkO6Oc+vMrGF3BOmVsQKlq2nOua25g8VFtApQ6uiEfbwBOslCcC8kQB
ZpNBl6x74xz/VWNWZnRStBfwYjRitKNDVU6dyIyRuLj8cktqfGBxGtx7/w0wDiZT
kPR1bNtdpy3Ndke6H/0G6plRWi9kENqcN43hvrz54IKh2l+Jd2/as51j4Qq2tJU9
KaAnJzRaSePxc2m0SqtgZTvc2BYSOg7dqaCyHxBB0CUBdTdJdz2TVZ2KM9MiLns4
1haHBLo4l8o8zeYZpW05ac6OXKY4f8qsjWPEGshn4FDbq0TrHQzYxAF3c0X3hPag
SnuvilgYUuYal+n87qinePg/ZmVrrBXPRycpQnn7FxqezJbf/2WUEojUVQnreU05
mdp8mulxQxyFhgEvO7K1uDtlP8bqW69IO9M//6IWzGNKTDK2SRI08ULplghqgyna
8SG0+y9w26r8UIWDhuvPbdfUMSG3kEH8yLFK84AFDMVJJxOnfznE3sC8sGOiP5q9
OUkl8l7bhDkyAdWZY57gGUobebdPfnLxRV9A+LZQ2El1kSOEK18=
=xzXs
-----END PGP SIGNATURE-----
Merge tag 'for-5.4-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"A fix for an older bug that has started to show up during testing
(because of an updated test for rename exchange).
It's an in-memory corruption caused by local variable leaking out of
the function scope"
* tag 'for-5.4-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix log context list corruption after rename exchange operation
During rename exchange we might have successfully log the new name in the
source root's log tree, in which case we leave our log context (allocated
on stack) in the root's list of log contextes. However we might fail to
log the new name in the destination root, in which case we fallback to
a transaction commit later and never sync the log of the source root,
which causes the source root log context to remain in the list of log
contextes. This later causes invalid memory accesses because the context
was allocated on stack and after rename exchange finishes the stack gets
reused and overwritten for other purposes.
The kernel's linked list corruption detector (CONFIG_DEBUG_LIST=y) can
detect this and report something like the following:
[ 691.489929] ------------[ cut here ]------------
[ 691.489947] list_add corruption. prev->next should be next (ffff88819c944530), but was ffff8881c23f7be4. (prev=ffff8881c23f7a38).
[ 691.489967] WARNING: CPU: 2 PID: 28933 at lib/list_debug.c:28 __list_add_valid+0x95/0xe0
(...)
[ 691.489998] CPU: 2 PID: 28933 Comm: fsstress Not tainted 5.4.0-rc6-btrfs-next-62 #1
[ 691.490001] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
[ 691.490003] RIP: 0010:__list_add_valid+0x95/0xe0
(...)
[ 691.490007] RSP: 0018:ffff8881f0b3faf8 EFLAGS: 00010282
[ 691.490010] RAX: 0000000000000000 RBX: ffff88819c944530 RCX: 0000000000000000
[ 691.490011] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffffffa2c497e0
[ 691.490013] RBP: ffff8881f0b3fe68 R08: ffffed103eaa4115 R09: ffffed103eaa4114
[ 691.490015] R10: ffff88819c944000 R11: ffffed103eaa4115 R12: 7fffffffffffffff
[ 691.490016] R13: ffff8881b4035610 R14: ffff8881e7b84728 R15: 1ffff1103e167f7b
[ 691.490019] FS: 00007f4b25ea2e80(0000) GS:ffff8881f5500000(0000) knlGS:0000000000000000
[ 691.490021] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 691.490022] CR2: 00007fffbb2d4eec CR3: 00000001f2a4a004 CR4: 00000000003606e0
[ 691.490025] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 691.490027] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 691.490029] Call Trace:
[ 691.490058] btrfs_log_inode_parent+0x667/0x2730 [btrfs]
[ 691.490083] ? join_transaction+0x24a/0xce0 [btrfs]
[ 691.490107] ? btrfs_end_log_trans+0x80/0x80 [btrfs]
[ 691.490111] ? dget_parent+0xb8/0x460
[ 691.490116] ? lock_downgrade+0x6b0/0x6b0
[ 691.490121] ? rwlock_bug.part.0+0x90/0x90
[ 691.490127] ? do_raw_spin_unlock+0x142/0x220
[ 691.490151] btrfs_log_dentry_safe+0x65/0x90 [btrfs]
[ 691.490172] btrfs_sync_file+0x9f1/0xc00 [btrfs]
[ 691.490195] ? btrfs_file_write_iter+0x1800/0x1800 [btrfs]
[ 691.490198] ? rcu_read_lock_any_held.part.11+0x20/0x20
[ 691.490204] ? __do_sys_newstat+0x88/0xd0
[ 691.490207] ? cp_new_stat+0x5d0/0x5d0
[ 691.490218] ? do_fsync+0x38/0x60
[ 691.490220] do_fsync+0x38/0x60
[ 691.490224] __x64_sys_fdatasync+0x32/0x40
[ 691.490228] do_syscall_64+0x9f/0x540
[ 691.490233] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 691.490235] RIP: 0033:0x7f4b253ad5f0
(...)
[ 691.490239] RSP: 002b:00007fffbb2d6078 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
[ 691.490242] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f4b253ad5f0
[ 691.490244] RDX: 00007fffbb2d5fe0 RSI: 00007fffbb2d5fe0 RDI: 0000000000000003
[ 691.490245] RBP: 000000000000000d R08: 0000000000000001 R09: 00007fffbb2d608c
[ 691.490247] R10: 00000000000002e8 R11: 0000000000000246 R12: 00000000000001f4
[ 691.490248] R13: 0000000051eb851f R14: 00007fffbb2d6120 R15: 00005635a498bda0
This started happening recently when running some test cases from fstests
like btrfs/004 for example, because support for rename exchange was added
last week to fsstress from fstests.
So fix this by deleting the log context for the source root from the list
if we have logged the new name in the source root.
Reported-by: Su Yue <Damenly_Su@gmx.com>
Fixes: d4682ba03e ("Btrfs: sync log after logging new name")
CC: stable@vger.kernel.org # 4.19+
Tested-by: Su Yue <Damenly_Su@gmx.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl3GwvEACgkQxWXV+ddt
WDvsfQ//UQ/TND1roMW3EmN+PLTRTUQS75Hb737r5Q66j0j94gFnBxB03R+tU8lP
bH5XVpateb1wDmmdscqVQ1WM9O82bQdDNiYeLDr86+kzpLgy61rZHswZfNlDtl5M
wDwyaxsrd7HndDeUZnIuaakYI9MXz9WIaNXkt0o8hSHctt0N18y23DSBFTVSh/4q
T3cn4odkoBKtQ4mIn2OSmQvkl69nWRBVpjPJZIvsNszKjOo9aZTuGHrOWUV5RPiE
Ho9UBkr+IjEDo8OH88vXAsHeYFIoYhEeUltjLHyF6Agwk1/Ajwp5sxXSubbfHUMQ
l1YdmrTZf+l1Dxdj0sCdyK1npcgGI5IuZmIICpNUEAny9AbTtpSE3GNwtnIHAEAr
cpki+1Z3lfaVSwNMYUz9Esbsb72C+f08WJHGHBMaOhjrBIwQUeWeYzTx6N7uDwNg
GjaDRxjqFV2o7373isQaz7WOTatUMNtL1xvhkeceONI9NBXQRjW5rq5zECmr1ix7
lSTKDzn7yAd6eGBW0T6iYdRl4Bta/zxPiDEF5KDNPugvv23yx17LAOdS4qAiHzbx
oglra/kz9D4xmKVqH7hpFaaHrzL38G4mz6aMdwBu0M7dkn9AwsXiXWSDb0n7jnK0
o96gkUBZjUSFBseUFD2s34MikFtrDynEJzPq96JHuWLguaByMcc=
=CZxY
-----END PGP SIGNATURE-----
Merge tag 'for-5.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few regressions and fixes for stable.
Regressions:
- fix a race leading to metadata space leak after task received a
signal
- un-deprecate 2 ioctls, marked as deprecated by mistake
Fixes:
- fix limit check for number of devices during chunk allocation
- fix a race due to double evaluation of i_size_read inside max()
macro, can cause a crash
- remove wrong device id check in tree-checker"
* tag 'for-5.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: un-deprecate ioctls START_SYNC and WAIT_SYNC
btrfs: save i_size to avoid double evaluation of i_size_read in compress_file_range
Btrfs: fix race leading to metadata space leak after task received signal
btrfs: tree-checker: Fix wrong check on max devid
btrfs: Consider system chunk array size for new SYSTEM chunks
The two ioctls START_SYNC and WAIT_SYNC were mistakenly marked as
deprecated and scheduled for removal but we actualy do use them for
'btrfs subvolume delete -C/-c'. The deprecated thing in ebc87351e5
should have been just the async flag for subvolume creation.
The deprecation has been added in this development cycle, remove it
until it's time.
Fixes: ebc87351e5 ("btrfs: Deprecate BTRFS_SUBVOL_CREATE_ASYNC flag")
Signed-off-by: David Sterba <dsterba@suse.com>
We hit a regression while rolling out 5.2 internally where we were
hitting the following panic
kernel BUG at mm/page-writeback.c:2659!
RIP: 0010:clear_page_dirty_for_io+0xe6/0x1f0
Call Trace:
__process_pages_contig+0x25a/0x350
? extent_clear_unlock_delalloc+0x43/0x70
submit_compressed_extents+0x359/0x4d0
normal_work_helper+0x15a/0x330
process_one_work+0x1f5/0x3f0
worker_thread+0x2d/0x3d0
? rescuer_thread+0x340/0x340
kthread+0x111/0x130
? kthread_create_on_node+0x60/0x60
ret_from_fork+0x1f/0x30
This is happening because the page is not locked when doing
clear_page_dirty_for_io. Looking at the core dump it was because our
async_extent had a ram_size of 24576 but our async_chunk range only
spanned 20480, so we had a whole extra page in our ram_size for our
async_extent.
This happened because we try not to compress pages outside of our
i_size, however a cleanup patch changed us to do
actual_end = min_t(u64, i_size_read(inode), end + 1);
which is problematic because i_size_read() can evaluate to different
values in between checking and assigning. So either an expanding
truncate or a fallocate could increase our i_size while we're doing
writeout and actual_end would end up being past the range we have
locked.
I confirmed this was what was happening by installing a debug kernel
that had
actual_end = min_t(u64, i_size_read(inode), end + 1);
if (actual_end > end + 1) {
printk(KERN_ERR "KABOOM\n");
actual_end = end + 1;
}
and installing it onto 500 boxes of the tier that had been seeing the
problem regularly. Last night I got my debug message and no panic,
confirming what I expected.
[ dsterba: the assembly confirms a tiny race window:
mov 0x20(%rsp),%rax
cmp %rax,0x48(%r15) # read
movl $0x0,0x18(%rsp)
mov %rax,%r12
mov %r14,%rax
cmovbe 0x48(%r15),%r12 # eval
Where r15 is inode and 0x48 is offset of i_size.
The original fix was to revert 62b3762271 that would do an
intermediate assignment and this would also avoid the doulble
evaluation but is not future-proof, should the compiler merge the
stores and call i_size_read anyway.
There's a patch adding READ_ONCE to i_size_read but that's not being
applied at the moment and we need to fix the bug. Instead, emulate
READ_ONCE by two barrier()s that's what effectively happens. The
assembly confirms single evaluation:
mov 0x48(%rbp),%rax # read once
mov 0x20(%rsp),%rcx
mov $0x20,%edx
cmp %rax,%rcx
cmovbe %rcx,%rax
mov %rax,(%rsp)
mov %rax,%rcx
mov %r14,%rax
Where 0x48(%rbp) is inode->i_size stored to %eax.
]
Fixes: 62b3762271 ("btrfs: Remove isize local variable in compress_file_range")
CC: stable@vger.kernel.org # v5.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ changelog updated ]
Signed-off-by: David Sterba <dsterba@suse.com>
When a task that is allocating metadata needs to wait for the async
reclaim job to process its ticket and gets a signal (because it was killed
for example) before doing the wait, the task ends up erroring out but
with space reserved for its ticket, which never gets released, resulting
in a metadata space leak (more specifically a leak in the bytes_may_use
counter of the metadata space_info object).
Here's the sequence of steps leading to the space leak:
1) A task tries to create a file for example, so it ends up trying to
start a transaction at btrfs_create();
2) The filesystem is currently in a state where there is not enough
metadata free space to satisfy the transaction's needs. So at
space-info.c:__reserve_metadata_bytes() we create a ticket and
add it to the list of tickets of the space info object. Also,
because the metadata async reclaim job is not running, we queue
a job ro run metadata reclaim;
3) In the meanwhile the task receives a signal (like SIGTERM from
a kill command for example);
4) After queing the async reclaim job, at __reserve_metadata_bytes(),
we unlock the metadata space info and call handle_reserve_ticket();
5) That last function calls wait_reserve_ticket(), which acquires the
lock from the metadata space info. Then in the first iteration of
its while loop, it calls prepare_to_wait_event(), which returns
-ERESTARTSYS because the task has a pending signal. As a result,
we set the error field of the ticket to -EINTR and exit the while
loop without deleting the ticket from the list of tickets (in the
space info object). After exiting the loop we unlock the space info;
6) The async reclaim job is able to release enough metadata, acquires
the metadata space info's lock and then reserves space for the ticket,
since the ticket is still in the list of (non-priority) tickets. The
space reservation happens at btrfs_try_granting_tickets(), called from
maybe_fail_all_tickets(). This increments the bytes_may_use counter
from the metadata space info object, sets the ticket's bytes field to
zero (meaning success, that space was reserved) and removes it from
the list of tickets;
7) wait_reserve_ticket() returns, with the error field of the ticket
set to -EINTR. Then handle_reserve_ticket() just propagates that error
to the caller. Because an error was returned, the caller does not
release the reserved space, since the expectation is that any error
means no space was reserved.
Fix this by removing the ticket from the list, while holding the space
info lock, at wait_reserve_ticket() when prepare_to_wait_event() returns
an error.
Also add some comments and an assertion to guarantee we never end up with
a ticket that has an error set and a bytes counter field set to zero, to
more easily detect regressions in the future.
This issue could be triggered sporadically by some test cases from fstests
such as generic/269 for example, which tries to fill a filesystem and then
kills fsstress processes running in the background.
When this issue happens, we get a warning in syslog/dmesg when unmounting
the filesystem, like the following:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 13240 at fs/btrfs/block-group.c:3186 btrfs_free_block_groups+0x314/0x470 [btrfs]
(...)
CPU: 0 PID: 13240 Comm: umount Tainted: G W L 5.3.0-rc8-btrfs-next-48+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_free_block_groups+0x314/0x470 [btrfs]
(...)
RSP: 0018:ffff9910c14cfdb8 EFLAGS: 00010286
RAX: 0000000000000024 RBX: ffff89cd8a4d55f0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff89cdf6a178a8 RDI: ffff89cdf6a178a8
RBP: ffff9910c14cfde8 R08: 0000000000000000 R09: 0000000000000001
R10: ffff89cd4d618040 R11: 0000000000000000 R12: ffff89cd8a4d5508
R13: ffff89cde7c4a600 R14: dead000000000122 R15: dead000000000100
FS: 00007f42754432c0(0000) GS:ffff89cdf6a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd25a47f730 CR3: 000000021f8d6006 CR4: 00000000003606f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
close_ctree+0x1ad/0x390 [btrfs]
generic_shutdown_super+0x6c/0x110
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0xa0 [btrfs]
deactivate_locked_super+0x3a/0x70
cleanup_mnt+0xb4/0x160
task_work_run+0x7e/0xc0
exit_to_usermode_loop+0xfa/0x100
do_syscall_64+0x1cb/0x220
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f4274d2cb37
(...)
RSP: 002b:00007ffcff701d38 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 0000557ebde2f060 RCX: 00007f4274d2cb37
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000557ebde2f240
RBP: 0000557ebde2f240 R08: 0000557ebde2f270 R09: 0000000000000015
R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f427522ee64
R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffcff701fc0
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
softirqs last enabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace bcf4b235461b26f6 ]---
BTRFS info (device sdb): space_info 4 has 19116032 free, is full
BTRFS info (device sdb): space_info total=33554432, used=14176256, pinned=0, reserved=0, may_use=196608, readonly=65536
BTRFS info (device sdb): global_block_rsv: size 0 reserved 0
BTRFS info (device sdb): trans_block_rsv: size 0 reserved 0
BTRFS info (device sdb): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sdb): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sdb): delayed_refs_rsv: size 0 reserved 0
Fixes: 374bf9c5cd ("btrfs: unify error handling for ticket flushing")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
The following script will cause false alert on devid check.
#!/bin/bash
dev1=/dev/test/test
dev2=/dev/test/scratch1
mnt=/mnt/btrfs
umount $dev1 &> /dev/null
umount $dev2 &> /dev/null
umount $mnt &> /dev/null
mkfs.btrfs -f $dev1
mount $dev1 $mnt
_fail()
{
echo "!!! FAILED !!!"
exit 1
}
for ((i = 0; i < 4096; i++)); do
btrfs dev add -f $dev2 $mnt || _fail
btrfs dev del $dev1 $mnt || _fail
dev_tmp=$dev1
dev1=$dev2
dev2=$dev_tmp
done
[CAUSE]
Tree-checker uses BTRFS_MAX_DEVS() and BTRFS_MAX_DEVS_SYS_CHUNK() as
upper limit for devid. But we can have devid holes just like above
script.
So the check for devid is incorrect and could cause false alert.
[FIX]
Just remove the whole devid check. We don't have any hard requirement
for devid assignment.
Furthermore, even devid could get corrupted by a bitflip, we still have
dev extents verification at mount time, so corrupted data won't sneak
in.
This fixes fstests btrfs/194.
Reported-by: Anand Jain <anand.jain@oracle.com>
Fixes: ab4ba2e133 ("btrfs: tree-checker: Verify dev item")
CC: stable@vger.kernel.org # 5.2+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For SYSTEM chunks, despite the regular chunk item size limit, there is
another limit due to system chunk array size.
The extra limit was removed in a refactoring, so add it back.
Fixes: e3ecdb3fde ("btrfs: factor out devs_max setting in __btrfs_alloc_chunk")
CC: stable@vger.kernel.org # 5.3+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The .ioctl and .compat_ioctl file operations have the same prototype so
they can both point to the same function, which works great almost all
the time when all the commands are compatible.
One exception is the s390 architecture, where a compat pointer is only
31 bit wide, and converting it into a 64-bit pointer requires calling
compat_ptr(). Most drivers here will never run in s390, but since we now
have a generic helper for it, it's easy enough to use it consistently.
I double-checked all these drivers to ensure that all ioctl arguments
are used as pointers or are ignored, but are not interpreted as integer
values.
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Acked-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: David Sterba <dsterba@suse.com>
Acked-by: Darren Hart (VMware) <dvhart@infradead.org>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Acked-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl2vBPgACgkQxWXV+ddt
WDvaGQ/+JbV05ML7QjufNFKjzojNPg8dOwvLqF58askMEAqzyqOrzu1YsIOjchqv
eZPP+wxh6j6i8LFnbaMYRhSVMlgpK9XOR3tNStp73QlSox6/6VKu7XR4wFXAoip3
l4XI8UqO5XqDG55UTtjKyo2VNq8vgq1gRUWD6hPtfzDP6WYj4JZuXOd+dT6PbP32
VHd91RoB8Z8qmypLF9Sju6IBWNhKl91TjlRqVdfLywaK8azqaxE1UttPf5DVbE6+
MlTuXhuxkNc4ddzj//oJ1s3ZP/nXtFmIZ75+Sd6P5DfqRNeIfjkKqXvffsjqoYVI
1Wv9sDiezxRB1RZjfInhtqvdvsqcsrXrM7x6BRVqI7IZDUaH5em8IoozQT62ze/4
MJBrRIbVodx2I7EVDMNGkx2TaIDAfnW4Z3UC+YSHLoy6jht1+SA5KLQDR1G/4NeR
dWht5wOXwhp8P3DoaczTUpk0DLtAtygj04fH8CG277EILLCpuWwW8iRkPttkrIM2
HrtRKrKFJNyEpq9vHSFVvpQJkzgzqBFs1UnqnEuYh6qNgWCrS4PJDu9geQ8aPGr2
pA5aEqg5b+jjWzIYDP93PF0u7kF4mFVAozn6xMf95FKmM1OupYYQg5BRe7n/DfDu
yh3Ms0Mmd9+snpNJ9lDr40cobHC5CDvfvO2SRERyBeXxARainN0=
=Ft+G
-----END PGP SIGNATURE-----
Merge tag 'for-5.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fixes of error handling cleanup of metadata accounting with qgroups
enabled
- fix swapped values for qgroup tracepoints
- fix race when handling full sync flag
- don't start unused worker thread, functionality removed already
* tag 'for-5.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: check for the full sync flag while holding the inode lock during fsync
Btrfs: fix qgroup double free after failure to reserve metadata for delalloc
btrfs: tracepoints: Fix bad entry members of qgroup events
btrfs: tracepoints: Fix wrong parameter order for qgroup events
btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents()
btrfs: don't needlessly create extent-refs kernel thread
btrfs: block-group: Fix a memory leak due to missing btrfs_put_block_group()
Btrfs: add missing extents release on file extent cluster relocation error
We were checking for the full fsync flag in the inode before locking the
inode, which is racy, since at that that time it might not be set but
after we acquire the inode lock some other task set it. One case where
this can happen is on a system low on memory and some concurrent task
failed to allocate an extent map and therefore set the full sync flag on
the inode, to force the next fsync to work in full mode.
A consequence of missing the full fsync flag set is hitting the problems
fixed by commit 0c713cbab6 ("Btrfs: fix race between ranged fsync and
writeback of adjacent ranges"), BUG_ON() when dropping extents from a log
tree, hitting assertion failures at tree-log.c:copy_items() or all sorts
of weird inconsistencies after replaying a log due to file extents items
representing ranges that overlap.
So just move the check such that it's done after locking the inode and
before starting writeback again.
Fixes: 0c713cbab6 ("Btrfs: fix race between ranged fsync and writeback of adjacent ranges")
CC: stable@vger.kernel.org # 5.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we fail to reserve metadata for delalloc operations we end up releasing
the previously reserved qgroup amount twice, once explicitly under the
'out_qgroup' label by calling btrfs_qgroup_free_meta_prealloc() and once
again, under label 'out_fail', by calling btrfs_inode_rsv_release() with a
value of 'true' for its 'qgroup_free' argument, which results in
btrfs_qgroup_free_meta_prealloc() being called again, so we end up having
a double free.
Also if we fail to reserve the necessary qgroup amount, we jump to the
label 'out_fail', which calls btrfs_inode_rsv_release() and that in turns
calls btrfs_qgroup_free_meta_prealloc(), even though we weren't able to
reserve any qgroup amount. So we freed some amount we never reserved.
So fix this by removing the call to btrfs_inode_rsv_release() in the
failure path, since it's not necessary at all as we haven't changed the
inode's block reserve in any way at this point.
Fixes: c8eaeac7b7 ("btrfs: reserve delalloc metadata differently")
CC: stable@vger.kernel.org # 5.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
For btrfs:qgroup_meta_reserve event, the trace event can output garbage:
qgroup_meta_reserve: 9c7f6acc-b342-4037-bc47-7f6e4d2232d7: refroot=5(FS_TREE) type=DATA diff=2
The diff should always be alinged to sector size (4k), so there is
definitely something wrong.
[CAUSE]
For the wrong @diff, it's caused by wrong parameter order.
The correct parameters are:
struct btrfs_root, s64 diff, int type.
However the parameters used are:
struct btrfs_root, int type, s64 diff.
Fixes: 4ee0d8832c ("btrfs: qgroup: Update trace events for metadata reservation")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[Background]
Btrfs qgroup uses two types of reserved space for METADATA space,
PERTRANS and PREALLOC.
PERTRANS is metadata space reserved for each transaction started by
btrfs_start_transaction().
While PREALLOC is for delalloc, where we reserve space before joining a
transaction, and finally it will be converted to PERTRANS after the
writeback is done.
[Inconsistency]
However there is inconsistency in how we handle PREALLOC metadata space.
The most obvious one is:
In btrfs_buffered_write():
btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true);
We always free qgroup PREALLOC meta space.
While in btrfs_truncate_block():
btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0));
We only free qgroup PREALLOC meta space when something went wrong.
[The Correct Behavior]
The correct behavior should be the one in btrfs_buffered_write(), we
should always free PREALLOC metadata space.
The reason is, the btrfs_delalloc_* mechanism works by:
- Reserve metadata first, even it's not necessary
In btrfs_delalloc_reserve_metadata()
- Free the unused metadata space
Normally in:
btrfs_delalloc_release_extents()
|- btrfs_inode_rsv_release()
Here we do calculation on whether we should release or not.
E.g. for 64K buffered write, the metadata rsv works like:
/* The first page */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=0
total: num_bytes=calc_inode_reservations()
/* The first page caused one outstanding extent, thus needs metadata
rsv */
/* The 2nd page */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=calc_inode_reservations()
total: not changed
/* The 2nd page doesn't cause new outstanding extent, needs no new meta
rsv, so we free what we have reserved */
/* The 3rd~16th pages */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=calc_inode_reservations()
total: not changed (still space for one outstanding extent)
This means, if btrfs_delalloc_release_extents() determines to free some
space, then those space should be freed NOW.
So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other
than btrfs_qgroup_convert_reserved_meta().
The good news is:
- The callers are not that hot
The hottest caller is in btrfs_buffered_write(), which is already
fixed by commit 336a8bb8e3 ("btrfs: Fix wrong
btrfs_delalloc_release_extents parameter"). Thus it's not that
easy to cause false EDQUOT.
- The trans commit in advance for qgroup would hide the bug
Since commit f5fef45936 ("btrfs: qgroup: Make qgroup async transaction
commit more aggressive"), when btrfs qgroup metadata free space is slow,
it will try to commit transaction and free the wrongly converted
PERTRANS space, so it's not that easy to hit such bug.
[FIX]
So to fix the problem, remove the @qgroup_free parameter for
btrfs_delalloc_release_extents(), and always pass true to
btrfs_inode_rsv_release().
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 43b18595d6 ("btrfs: qgroup: Use separate meta reservation type for delalloc")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The patch 32b593bfcb ("Btrfs: remove no longer used function to run
delayed refs asynchronously") removed the async delayed refs but the
thread has been created, without any use. Remove it to avoid resource
consumption.
Fixes: 32b593bfcb ("Btrfs: remove no longer used function to run delayed refs asynchronously")
CC: stable@vger.kernel.org # 5.2+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_read_block_groups(), if we have an invalid block group which
has mixed type (DATA|METADATA) while the fs doesn't have MIXED_GROUPS
feature, we error out without freeing the block group cache.
This patch will add the missing btrfs_put_block_group() to prevent
memory leak.
Note for stable backports: the file to patch in versions <= 5.3 is
fs/btrfs/extent-tree.c
Fixes: 49303381f1 ("Btrfs: bail out if block group has different mixed flag")
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we error out when finding a page at relocate_file_extent_cluster(), we
need to release the outstanding extents counter on the relocation inode,
set by the previous call to btrfs_delalloc_reserve_metadata(), otherwise
the inode's block reserve size can never decrease to zero and metadata
space is leaked. Therefore add a call to btrfs_delalloc_release_extents()
in case we can't find the target page.
Fixes: 8b62f87bad ("Btrfs: rework outstanding_extents")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl2fQkoACgkQxWXV+ddt
WDsXgA/+OpORcroqswa5V/AB5NgSMv08QfBtL7en7cUA2cGbrTt0lAoixIesCeGA
7y4umlx7zWDhrG0NFv8E21xxkMzK72/YQnN0C6ZwRCi+ZbPSDlpMCwSNV9b6oVt4
t9mJQ2kNZ80wj9W7jtoyiiWZ2OBccWywKBYr+BXybha5BliSd1XbpWMv90lODQHc
1oH1FOphPAf+nSEtW4g0BV8UHy1n1+7TqjEHDilj0TuHlO0MHsR2FQcsRnMBv5H/
CcEH3+870pUOjvm/l1OAdIrzrnH6UsXKHnOmJpYZIAQdG4xtHq/d6O9sfQMZK/Af
VMXuju558kGOvrYdgffYa0HuW3P6c8q5BeEtGAZwjGsQS5GxmW63et/x+HIEl4RU
4SWMfD592GK1/yQn31NWGav7Olkr4L0Eh9iqfKk2nWayTV+pqs8NgkGumkoNB66n
WJs2m2WwKvZzj1Ys9ilRzo17qt2U/m4eLQtbut3m5eYCpzvo38PDrqOVqKTZG/dc
nG1JBJNk+yjV+RkOO0gnQ8/zzooEGvMhVIx5iq52tWjvxnvOGSVv9QAAkKrbMoOX
kn/b3heL57Z+kilUU3Zd0GvVif+awIgOj+PmAevR+w3MLoJ2gLfWb7qcONdHyFPk
jK7rsuP737fyHpZ1tvJyfTt4Lk/u1UbApKrO4QIku1r9IJmmu20=
=zpDj
-----END PGP SIGNATURE-----
Merge tag 'for-5.4-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more stabitly fixes, one build warning fix.
- fix inode allocation under NOFS context
- fix leak in fiemap due to concurrent append writes
- fix log-root tree updates
- fix balance convert of single profile on 32bit architectures
- silence false positive warning on old GCCs (code moved in rc1)"
* tag 'for-5.4-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: silence maybe-uninitialized warning in clone_range
btrfs: fix uninitialized ret in ref-verify
btrfs: allocate new inode in NOFS context
btrfs: fix balance convert to single on 32-bit host CPUs
btrfs: fix incorrect updating of log root tree
Btrfs: fix memory leak due to concurrent append writes with fiemap
GCC throws warning message as below:
‘clone_src_i_size’ may be used uninitialized in this function
[-Wmaybe-uninitialized]
#define IS_ALIGNED(x, a) (((x) & ((typeof(x))(a) - 1)) == 0)
^
fs/btrfs/send.c:5088:6: note: ‘clone_src_i_size’ was declared here
u64 clone_src_i_size;
^
The clone_src_i_size is only used as call-by-reference
in a call to get_inode_info().
Silence the warning by initializing clone_src_i_size to 0.
Note that the warning is a false positive and reported by older versions
of GCC (eg. 7.x) but not eg 9.x. As there have been numerous people, the
patch is applied. Setting clone_src_i_size to 0 does not otherwise make
sense and would not do any action in case the code changes in the future.
Signed-off-by: Austin Kim <austindh.kim@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
Coverity caught a case where we could return with a uninitialized value
in ret in process_leaf. This is actually pretty likely because we could
very easily run into a block group item key and have a garbage value in
ret and think there was an errror. Fix this by initializing ret to 0.
Reported-by: Colin Ian King <colin.king@canonical.com>
Fixes: fd708b81d9 ("Btrfs: add a extent ref verify tool")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, the command:
btrfs balance start -dconvert=single,soft .
on a Raspberry Pi produces the following kernel message:
BTRFS error (device mmcblk0p2): balance: invalid convert data profile single
This fails because we use is_power_of_2(unsigned long) to validate
the new data profile, the constant for 'single' profile uses bit 48,
and there are only 32 bits in a long on ARM.
Fix by open-coding the check using u64 variables.
Tested by completing the original balance command on several Raspberry
Pis.
Fixes: 818255feec ("btrfs: use common helper instead of open coding a bit test")
CC: stable@vger.kernel.org # 4.20+
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We've historically had reports of being unable to mount file systems
because the tree log root couldn't be read. Usually this is the "parent
transid failure", but could be any of the related errors, including
"fsid mismatch" or "bad tree block", depending on which block got
allocated.
The modification of the individual log root items are serialized on the
per-log root root_mutex. This means that any modification to the
per-subvol log root_item is completely protected.
However we update the root item in the log root tree outside of the log
root tree log_mutex. We do this in order to allow multiple subvolumes
to be updated in each log transaction.
This is problematic however because when we are writing the log root
tree out we update the super block with the _current_ log root node
information. Since these two operations happen independently of each
other, you can end up updating the log root tree in between writing out
the dirty blocks and setting the super block to point at the current
root.
This means we'll point at the new root node that hasn't been written
out, instead of the one we should be pointing at. Thus whatever garbage
or old block we end up pointing at complains when we mount the file
system later and try to replay the log.
Fix this by copying the log's root item into a local root item copy.
Then once we're safely under the log_root_tree->log_mutex we update the
root item in the log_root_tree. This way we do not modify the
log_root_tree while we're committing it, fixing the problem.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Chris Mason <clm@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we have a buffered write that starts at an offset greater than or
equals to the file's size happening concurrently with a full ranged
fiemap, we can end up leaking an extent state structure.
Suppose we have a file with a size of 1Mb, and before the buffered write
and fiemap are performed, it has a single extent state in its io tree
representing the range from 0 to 1Mb, with the EXTENT_DELALLOC bit set.
The following sequence diagram shows how the memory leak happens if a
fiemap a buffered write, starting at offset 1Mb and with a length of
4Kb, are performed concurrently.
CPU 1 CPU 2
extent_fiemap()
--> it's a full ranged fiemap
range from 0 to LLONG_MAX - 1
(9223372036854775807)
--> locks range in the inode's
io tree
--> after this we have 2 extent
states in the io tree:
--> 1 for range [0, 1Mb[ with
the bits EXTENT_LOCKED and
EXTENT_DELALLOC_BITS set
--> 1 for the range
[1Mb, LLONG_MAX[ with
the EXTENT_LOCKED bit set
--> start buffered write at offset
1Mb with a length of 4Kb
btrfs_file_write_iter()
btrfs_buffered_write()
--> cached_state is NULL
lock_and_cleanup_extent_if_need()
--> returns 0 and does not lock
range because it starts
at current i_size / eof
--> cached_state remains NULL
btrfs_dirty_pages()
btrfs_set_extent_delalloc()
(...)
__set_extent_bit()
--> splits extent state for range
[1Mb, LLONG_MAX[ and now we
have 2 extent states:
--> one for the range
[1Mb, 1Mb + 4Kb[ with
EXTENT_LOCKED set
--> another one for the range
[1Mb + 4Kb, LLONG_MAX[ with
EXTENT_LOCKED set as well
--> sets EXTENT_DELALLOC on the
extent state for the range
[1Mb, 1Mb + 4Kb[
--> caches extent state
[1Mb, 1Mb + 4Kb[ into
@cached_state because it has
the bit EXTENT_LOCKED set
--> btrfs_buffered_write() ends up
with a non-NULL cached_state and
never calls anything to release its
reference on it, resulting in a
memory leak
Fix this by calling free_extent_state() on cached_state if the range was
not locked by lock_and_cleanup_extent_if_need().
The same issue can happen if anything else other than fiemap locks a range
that covers eof and beyond.
This could be triggered, sporadically, by test case generic/561 from the
fstests suite, which makes duperemove run concurrently with fsstress, and
duperemove does plenty of calls to fiemap. When CONFIG_BTRFS_DEBUG is set
the leak is reported in dmesg/syslog when removing the btrfs module with
a message like the following:
[77100.039461] BTRFS: state leak: start 6574080 end 6582271 state 16402 in tree 0 refs 1
Otherwise (CONFIG_BTRFS_DEBUG not set) detectable with kmemleak.
CC: stable@vger.kernel.org # 4.16+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl2SDbMACgkQxWXV+ddt
WDsUhw/9HcRsT6SlrwA2R5leHxCR5UMwT2Zmbxpfft37ANF0SC1UINHBfnmquM97
xX6fdRSR9RUjF9DrdLPfLBnJDQ/MnHl1ruIVBFhJm6cJ9TJwf9E0TiJBQt+08JWg
vy5hZBWvsPWWRBJ94XPMe4LtakK/isW4Cz5W9AdrC2Siqw69j6eZzms2AnIjyBjA
BoKg4se2Ay2rMxLZWXIOj9374PU+N1cnRnqgh77ZxLku5WdCzrDfB5safE7UmoTG
/MWJuuIgzOk0iQpQORRtEZDS1dNe5KT9m4xXkUbrZbQROwqnXrT1SVIsuqNAvlPk
uaymR1W8nshepzpMlSxVydLv/mKWZNUGnDxOJ23ooow8Yd7ndppXEtFuGwCYqIFc
xQqxuTLREvJ9+jpSv11bmDpk/ULRqpV+2PjUqGaWlGwFArJ+qFRLVGYx31eXmDPj
t2mrPOcXGzY0pKtIpbkuUGleY/jeI+BNsvD4+QPs+jnp0nmfvH0/Rmp7grGqx2FI
rQM8Gn4a5i3nuEDWLp8nN2wcKC3ePwy96Vp2tqfsl6TVTPx4EFzGLkWogHR2yiqI
0LAj8YWFmWuChSv71wYOjX79CppjcbNwOakSwtDjV30jkwoh2f/0D3OwOpua2xe8
75KQMaSB0kesGZz7ZkL1kMqA5m5w7MGZom6XZoBJ+bq2HPLB2jo=
=2UM7
-----END PGP SIGNATURE-----
Merge tag 'for-5.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A bunch of fixes that accumulated in recent weeks, mostly material for
stable.
Summary:
- fix for regression from 5.3 that prevents to use balance convert
with single profile
- qgroup fixes: rescan race, accounting leak with multiple writers,
potential leak after io failure recovery
- fix for use after free in relocation (reported by KASAN)
- other error handling fixups"
* tag 'for-5.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: qgroup: Fix reserved data space leak if we have multiple reserve calls
btrfs: qgroup: Fix the wrong target io_tree when freeing reserved data space
btrfs: Fix a regression which we can't convert to SINGLE profile
btrfs: relocation: fix use-after-free on dead relocation roots
Btrfs: fix race setting up and completing qgroup rescan workers
Btrfs: fix missing error return if writeback for extent buffer never started
btrfs: adjust dirty_metadata_bytes after writeback failure of extent buffer
Btrfs: fix selftests failure due to uninitialized i_mode in test inodes
[BUG]
The following script can cause btrfs qgroup data space leak:
mkfs.btrfs -f $dev
mount $dev -o nospace_cache $mnt
btrfs subv create $mnt/subv
btrfs quota en $mnt
btrfs quota rescan -w $mnt
btrfs qgroup limit 128m $mnt/subv
for (( i = 0; i < 3; i++)); do
# Create 3 64M holes for latter fallocate to fail
truncate -s 192m $mnt/subv/file
xfs_io -c "pwrite 64m 4k" $mnt/subv/file > /dev/null
xfs_io -c "pwrite 128m 4k" $mnt/subv/file > /dev/null
sync
# it's supposed to fail, and each failure will leak at least 64M
# data space
xfs_io -f -c "falloc 0 192m" $mnt/subv/file &> /dev/null
rm $mnt/subv/file
sync
done
# Shouldn't fail after we removed the file
xfs_io -f -c "falloc 0 64m" $mnt/subv/file
[CAUSE]
Btrfs qgroup data reserve code allow multiple reservations to happen on
a single extent_changeset:
E.g:
btrfs_qgroup_reserve_data(inode, &data_reserved, 0, SZ_1M);
btrfs_qgroup_reserve_data(inode, &data_reserved, SZ_1M, SZ_2M);
btrfs_qgroup_reserve_data(inode, &data_reserved, 0, SZ_4M);
Btrfs qgroup code has its internal tracking to make sure we don't
double-reserve in above example.
The only pattern utilizing this feature is in the main while loop of
btrfs_fallocate() function.
However btrfs_qgroup_reserve_data()'s error handling has a bug in that
on error it clears all ranges in the io_tree with EXTENT_QGROUP_RESERVED
flag but doesn't free previously reserved bytes.
This bug has a two fold effect:
- Clearing EXTENT_QGROUP_RESERVED ranges
This is the correct behavior, but it prevents
btrfs_qgroup_check_reserved_leak() to catch the leakage as the
detector is purely EXTENT_QGROUP_RESERVED flag based.
- Leak the previously reserved data bytes.
The bug manifests when N calls to btrfs_qgroup_reserve_data are made and
the last one fails, leaking space reserved in the previous ones.
[FIX]
Also free previously reserved data bytes when btrfs_qgroup_reserve_data
fails.
Fixes: 5247255370 ("btrfs: qgroup: Introduce btrfs_qgroup_reserve_data function")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Under the following case with qgroup enabled, if some error happened
after we have reserved delalloc space, then in error handling path, we
could cause qgroup data space leakage:
From btrfs_truncate_block() in inode.c:
ret = btrfs_delalloc_reserve_space(inode, &data_reserved,
block_start, blocksize);
if (ret)
goto out;
again:
page = find_or_create_page(mapping, index, mask);
if (!page) {
btrfs_delalloc_release_space(inode, data_reserved,
block_start, blocksize, true);
btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, true);
ret = -ENOMEM;
goto out;
}
[CAUSE]
In the above case, btrfs_delalloc_reserve_space() will call
btrfs_qgroup_reserve_data() and mark the io_tree range with
EXTENT_QGROUP_RESERVED flag.
In the error handling path, we have the following call stack:
btrfs_delalloc_release_space()
|- btrfs_free_reserved_data_space()
|- btrsf_qgroup_free_data()
|- __btrfs_qgroup_release_data(reserved=@reserved, free=1)
|- qgroup_free_reserved_data(reserved=@reserved)
|- clear_record_extent_bits();
|- freed += changeset.bytes_changed;
However due to a completion bug, qgroup_free_reserved_data() will clear
EXTENT_QGROUP_RESERVED flag in BTRFS_I(inode)->io_failure_tree, other
than the correct BTRFS_I(inode)->io_tree.
Since io_failure_tree is never marked with that flag,
btrfs_qgroup_free_data() will not free any data reserved space at all,
causing a leakage.
This type of error handling can only be triggered by errors outside of
qgroup code. So EDQUOT error from qgroup can't trigger it.
[FIX]
Fix the wrong target io_tree.
Reported-by: Josef Bacik <josef@toxicpanda.com>
Fixes: bc42bda223 ("btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
With v5.3 kernel, we can't convert to SINGLE profile:
# btrfs balance start -f -dconvert=single $mnt
ERROR: error during balancing '/mnt/btrfs': Invalid argument
# dmesg -t | tail
validate_convert_profile: data profile=0x1000000000000 allowed=0x20 is_valid=1 final=0x1000000000000 ret=1
BTRFS error (device dm-3): balance: invalid convert data profile single
[CAUSE]
With the extra debug output added, it shows that the @allowed bit is
lacking the special in-memory only SINGLE profile bit.
Thus we fail at that (profile & ~allowed) check.
This regression is caused by commit 081db89b13 ("btrfs: use raid_attr
to get allowed profiles for balance conversion") and the fact that we
don't use any bit to indicate SINGLE profile on-disk, but uses special
in-memory only bit to help distinguish different profiles.
[FIX]
Add that BTRFS_AVAIL_ALLOC_BIT_SINGLE to @allowed, so the code should be
the same as it was and fix the regression.
Reported-by: Chris Murphy <lists@colorremedies.com>
Fixes: 081db89b13 ("btrfs: use raid_attr to get allowed profiles for balance conversion")
CC: stable@vger.kernel.org # 5.3+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
One user reported a reproducible KASAN report about use-after-free:
BTRFS info (device sdi1): balance: start -dvrange=1256811659264..1256811659265
BTRFS info (device sdi1): relocating block group 1256811659264 flags data|raid0
==================================================================
BUG: KASAN: use-after-free in btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
Write of size 8 at addr ffff88856f671710 by task kworker/u24:10/261579
CPU: 2 PID: 261579 Comm: kworker/u24:10 Tainted: P OE 5.2.11-arch1-1-kasan #4
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X99 Extreme4, BIOS P3.80 04/06/2018
Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
Call Trace:
dump_stack+0x7b/0xba
print_address_description+0x6c/0x22e
? btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
__kasan_report.cold+0x1b/0x3b
? btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
kasan_report+0x12/0x17
__asan_report_store8_noabort+0x17/0x20
btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
record_root_in_trans+0x2a0/0x370 [btrfs]
btrfs_record_root_in_trans+0xf4/0x140 [btrfs]
start_transaction+0x1ab/0xe90 [btrfs]
btrfs_join_transaction+0x1d/0x20 [btrfs]
btrfs_finish_ordered_io+0x7bf/0x18a0 [btrfs]
? lock_repin_lock+0x400/0x400
? __kmem_cache_shutdown.cold+0x140/0x1ad
? btrfs_unlink_subvol+0x9b0/0x9b0 [btrfs]
finish_ordered_fn+0x15/0x20 [btrfs]
normal_work_helper+0x1bd/0xca0 [btrfs]
? process_one_work+0x819/0x1720
? kasan_check_read+0x11/0x20
btrfs_endio_write_helper+0x12/0x20 [btrfs]
process_one_work+0x8c9/0x1720
? pwq_dec_nr_in_flight+0x2f0/0x2f0
? worker_thread+0x1d9/0x1030
worker_thread+0x98/0x1030
kthread+0x2bb/0x3b0
? process_one_work+0x1720/0x1720
? kthread_park+0x120/0x120
ret_from_fork+0x35/0x40
Allocated by task 369692:
__kasan_kmalloc.part.0+0x44/0xc0
__kasan_kmalloc.constprop.0+0xba/0xc0
kasan_kmalloc+0x9/0x10
kmem_cache_alloc_trace+0x138/0x260
btrfs_read_tree_root+0x92/0x360 [btrfs]
btrfs_read_fs_root+0x10/0xb0 [btrfs]
create_reloc_root+0x47d/0xa10 [btrfs]
btrfs_init_reloc_root+0x1e2/0x340 [btrfs]
record_root_in_trans+0x2a0/0x370 [btrfs]
btrfs_record_root_in_trans+0xf4/0x140 [btrfs]
start_transaction+0x1ab/0xe90 [btrfs]
btrfs_start_transaction+0x1e/0x20 [btrfs]
__btrfs_prealloc_file_range+0x1c2/0xa00 [btrfs]
btrfs_prealloc_file_range+0x13/0x20 [btrfs]
prealloc_file_extent_cluster+0x29f/0x570 [btrfs]
relocate_file_extent_cluster+0x193/0xc30 [btrfs]
relocate_data_extent+0x1f8/0x490 [btrfs]
relocate_block_group+0x600/0x1060 [btrfs]
btrfs_relocate_block_group+0x3a0/0xa00 [btrfs]
btrfs_relocate_chunk+0x9e/0x180 [btrfs]
btrfs_balance+0x14e4/0x2fc0 [btrfs]
btrfs_ioctl_balance+0x47f/0x640 [btrfs]
btrfs_ioctl+0x119d/0x8380 [btrfs]
do_vfs_ioctl+0x9f5/0x1060
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x73/0xb0
do_syscall_64+0xa5/0x370
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Freed by task 369692:
__kasan_slab_free+0x14f/0x210
kasan_slab_free+0xe/0x10
kfree+0xd8/0x270
btrfs_drop_snapshot+0x154c/0x1eb0 [btrfs]
clean_dirty_subvols+0x227/0x340 [btrfs]
relocate_block_group+0x972/0x1060 [btrfs]
btrfs_relocate_block_group+0x3a0/0xa00 [btrfs]
btrfs_relocate_chunk+0x9e/0x180 [btrfs]
btrfs_balance+0x14e4/0x2fc0 [btrfs]
btrfs_ioctl_balance+0x47f/0x640 [btrfs]
btrfs_ioctl+0x119d/0x8380 [btrfs]
do_vfs_ioctl+0x9f5/0x1060
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x73/0xb0
do_syscall_64+0xa5/0x370
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The buggy address belongs to the object at ffff88856f671100
which belongs to the cache kmalloc-4k of size 4096
The buggy address is located 1552 bytes inside of
4096-byte region [ffff88856f671100, ffff88856f672100)
The buggy address belongs to the page:
page:ffffea0015bd9c00 refcount:1 mapcount:0 mapping:ffff88864400e600 index:0x0 compound_mapcount: 0
flags: 0x2ffff0000010200(slab|head)
raw: 02ffff0000010200 dead000000000100 dead000000000200 ffff88864400e600
raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88856f671600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88856f671680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff88856f671700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88856f671780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88856f671800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
BTRFS info (device sdi1): 1 enospc errors during balance
BTRFS info (device sdi1): balance: ended with status: -28
[CAUSE]
The problem happens when finish_ordered_io() get called with balance
still running, while the reloc root of that subvolume is already dead.
(Tree is swap already done, but tree not yet deleted for possible qgroup
usage.)
That means root->reloc_root still exists, but that reloc_root can be
under btrfs_drop_snapshot(), thus we shouldn't access it.
The following race could cause the use-after-free problem:
CPU1 | CPU2
--------------------------------------------------------------------------
| relocate_block_group()
| |- unset_reloc_control(rc)
| |- btrfs_commit_transaction()
btrfs_finish_ordered_io() | |- clean_dirty_subvols()
|- btrfs_join_transaction() | |
|- record_root_in_trans() | |
|- btrfs_init_reloc_root() | |
|- if (root->reloc_root) | |
| | |- root->reloc_root = NULL
| | |- btrfs_drop_snapshot(reloc_root);
|- reloc_root->last_trans|
= trans->transid |
^^^^^^^^^^^^^^^^^^^^^^
Use after free
[FIX]
Fix it by the following modifications:
- Test if the root has dead reloc tree before accessing root->reloc_root
If the root has BTRFS_ROOT_DEAD_RELOC_TREE, then we don't need to
create or update root->reloc_tree
- Clear the BTRFS_ROOT_DEAD_RELOC_TREE flag until we have fully dropped
reloc tree
To co-operate with above modification, so as long as
BTRFS_ROOT_DEAD_RELOC_TREE is still set, we won't try to re-create
reloc tree at record_root_in_trans().
Reported-by: Cebtenzzre <cebtenzzre@gmail.com>
Fixes: d2311e6985 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
CC: stable@vger.kernel.org # 5.1+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a race between setting up a qgroup rescan worker and completing
a qgroup rescan worker that can lead to callers of the qgroup rescan wait
ioctl to either not wait for the rescan worker to complete or to hang
forever due to missing wake ups. The following diagram shows a sequence
of steps that illustrates the race.
CPU 1 CPU 2 CPU 3
btrfs_ioctl_quota_rescan()
btrfs_qgroup_rescan()
qgroup_rescan_init()
mutex_lock(&fs_info->qgroup_rescan_lock)
spin_lock(&fs_info->qgroup_lock)
fs_info->qgroup_flags |=
BTRFS_QGROUP_STATUS_FLAG_RESCAN
init_completion(
&fs_info->qgroup_rescan_completion)
fs_info->qgroup_rescan_running = true
mutex_unlock(&fs_info->qgroup_rescan_lock)
spin_unlock(&fs_info->qgroup_lock)
btrfs_init_work()
--> starts the worker
btrfs_qgroup_rescan_worker()
mutex_lock(&fs_info->qgroup_rescan_lock)
fs_info->qgroup_flags &=
~BTRFS_QGROUP_STATUS_FLAG_RESCAN
mutex_unlock(&fs_info->qgroup_rescan_lock)
starts transaction, updates qgroup status
item, etc
btrfs_ioctl_quota_rescan()
btrfs_qgroup_rescan()
qgroup_rescan_init()
mutex_lock(&fs_info->qgroup_rescan_lock)
spin_lock(&fs_info->qgroup_lock)
fs_info->qgroup_flags |=
BTRFS_QGROUP_STATUS_FLAG_RESCAN
init_completion(
&fs_info->qgroup_rescan_completion)
fs_info->qgroup_rescan_running = true
mutex_unlock(&fs_info->qgroup_rescan_lock)
spin_unlock(&fs_info->qgroup_lock)
btrfs_init_work()
--> starts another worker
mutex_lock(&fs_info->qgroup_rescan_lock)
fs_info->qgroup_rescan_running = false
mutex_unlock(&fs_info->qgroup_rescan_lock)
complete_all(&fs_info->qgroup_rescan_completion)
Before the rescan worker started by the task at CPU 3 completes, if
another task calls btrfs_ioctl_quota_rescan(), it will get -EINPROGRESS
because the flag BTRFS_QGROUP_STATUS_FLAG_RESCAN is set at
fs_info->qgroup_flags, which is expected and correct behaviour.
However if other task calls btrfs_ioctl_quota_rescan_wait() before the
rescan worker started by the task at CPU 3 completes, it will return
immediately without waiting for the new rescan worker to complete,
because fs_info->qgroup_rescan_running is set to false by CPU 2.
This race is making test case btrfs/171 (from fstests) to fail often:
btrfs/171 9s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad)
--- tests/btrfs/171.out 2018-09-16 21:30:48.505104287 +0100
+++ /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad 2019-09-19 02:01:36.938486039 +0100
@@ -1,2 +1,3 @@
QA output created by 171
+ERROR: quota rescan failed: Operation now in progress
Silence is golden
...
(Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/171.out /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad' to see the entire diff)
That is because the test calls the btrfs-progs commands "qgroup quota
rescan -w", "qgroup assign" and "qgroup remove" in a sequence that makes
calls to the rescan start ioctl fail with -EINPROGRESS (note the "btrfs"
commands 'qgroup assign' and 'qgroup remove' often call the rescan start
ioctl after calling the qgroup assign ioctl,
btrfs_ioctl_qgroup_assign()), since previous waits didn't actually wait
for a rescan worker to complete.
Another problem the race can cause is missing wake ups for waiters,
since the call to complete_all() happens outside a critical section and
after clearing the flag BTRFS_QGROUP_STATUS_FLAG_RESCAN. In the sequence
diagram above, if we have a waiter for the first rescan task (executed
by CPU 2), then fs_info->qgroup_rescan_completion.wait is not empty, and
if after the rescan worker clears BTRFS_QGROUP_STATUS_FLAG_RESCAN and
before it calls complete_all() against
fs_info->qgroup_rescan_completion, the task at CPU 3 calls
init_completion() against fs_info->qgroup_rescan_completion which
re-initilizes its wait queue to an empty queue, therefore causing the
rescan worker at CPU 2 to call complete_all() against an empty queue,
never waking up the task waiting for that rescan worker.
Fix this by clearing BTRFS_QGROUP_STATUS_FLAG_RESCAN and setting
fs_info->qgroup_rescan_running to false in the same critical section,
delimited by the mutex fs_info->qgroup_rescan_lock, as well as doing the
call to complete_all() in that same critical section. This gives the
protection needed to avoid rescan wait ioctl callers not waiting for a
running rescan worker and the lost wake ups problem, since setting that
rescan flag and boolean as well as initializing the wait queue is done
already in a critical section delimited by that mutex (at
qgroup_rescan_init()).
Fixes: 57254b6ebc ("Btrfs: add ioctl to wait for qgroup rescan completion")
Fixes: d2c609b834 ("btrfs: properly track when rescan worker is running")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If lock_extent_buffer_for_io() fails, it returns a negative value, but its
caller btree_write_cache_pages() ignores such error. This means that a
call to flush_write_bio(), from lock_extent_buffer_for_io(), might have
failed. We should make btree_write_cache_pages() notice such error values
and stop immediatelly, making sure filemap_fdatawrite_range() returns an
error to the transaction commit path. A failure from flush_write_bio()
should also result in the endio callback end_bio_extent_buffer_writepage()
being invoked, which sets the BTRFS_FS_*_ERR bits appropriately, so that
there's no risk a transaction or log commit doesn't catch a writeback
failure.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Before, if a eb failed to write out, we would end up triggering a
BUG_ON(). As of f4340622e0 ("btrfs: extent_io: Move the BUG_ON() in
flush_write_bio() one level up"), we no longer BUG_ON(), so we should
make life consistent and add back the unwritten bytes to
dirty_metadata_bytes.
Fixes: f4340622e0 ("btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up")
CC: stable@vger.kernel.org # 5.2+
Reviewed-by: Filipe Manana <fdmanana@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
Some of the self tests create a test inode, setup some extents and then do
calls to btrfs_get_extent() to test that the corresponding extent maps
exist and are correct. However btrfs_get_extent(), since the 5.2 merge
window, now errors out when it finds a regular or prealloc extent for an
inode that does not correspond to a regular file (its ->i_mode is not
S_IFREG). This causes the self tests to fail sometimes, specially when
KASAN, slub_debug and page poisoning are enabled:
$ modprobe btrfs
modprobe: ERROR: could not insert 'btrfs': Invalid argument
$ dmesg
[ 9414.691648] Btrfs loaded, crc32c=crc32c-intel, debug=on, assert=on, integrity-checker=on, ref-verify=on
[ 9414.692655] BTRFS: selftest: sectorsize: 4096 nodesize: 4096
[ 9414.692658] BTRFS: selftest: running btrfs free space cache tests
[ 9414.692918] BTRFS: selftest: running extent only tests
[ 9414.693061] BTRFS: selftest: running bitmap only tests
[ 9414.693366] BTRFS: selftest: running bitmap and extent tests
[ 9414.696455] BTRFS: selftest: running space stealing from bitmap to extent tests
[ 9414.697131] BTRFS: selftest: running extent buffer operation tests
[ 9414.697133] BTRFS: selftest: running btrfs_split_item tests
[ 9414.697564] BTRFS: selftest: running extent I/O tests
[ 9414.697583] BTRFS: selftest: running find delalloc tests
[ 9415.081125] BTRFS: selftest: running find_first_clear_extent_bit test
[ 9415.081278] BTRFS: selftest: running extent buffer bitmap tests
[ 9415.124192] BTRFS: selftest: running inode tests
[ 9415.124195] BTRFS: selftest: running btrfs_get_extent tests
[ 9415.127909] BTRFS: selftest: running hole first btrfs_get_extent test
[ 9415.128343] BTRFS critical (device (efault)): regular/prealloc extent found for non-regular inode 256
[ 9415.131428] BTRFS: selftest: fs/btrfs/tests/inode-tests.c:904 expected a real extent, got 0
This happens because the test inodes are created without ever initializing
the i_mode field of the inode, and neither VFS's new_inode() nor the btrfs
callback btrfs_alloc_inode() initialize the i_mode. Initialization of the
i_mode is done through the various callbacks used by the VFS to create
new inodes (regular files, directories, symlinks, tmpfiles, etc), which
all call btrfs_new_inode() which in turn calls inode_init_owner(), which
sets the inode's i_mode. Since the tests only uses new_inode() to create
the test inodes, the i_mode was never initialized.
This always happens on a VM I used with kasan, slub_debug and many other
debug facilities enabled. It also happened to someone who reported this
on bugzilla (on a 5.3-rc).
Fix this by setting i_mode to S_IFREG at btrfs_new_test_inode().
Fixes: 6bf9e4bd6a ("btrfs: inode: Verify inode mode to avoid NULL pointer dereference")
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204397
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl1/hCoACgkQxWXV+ddt
WDs0lQ//flGLX4fvaY2vuWA26t1elITnIatyX8S+xP4pUsT1Tyy1egeGpR8Jku/7
sCOgUlEM2MNXqveOdkQqPJuFPp3B6tInz4S/fowtLlz4enp7uTXw2SFuS3bhOJ+b
rpxK9VTc6QV3aipBCG31m8fnDiMaj2Hcspp0oej3V2mBhLUvzn69+P4eo7WN+46w
r2F605+lfURauHE6WjM09HINx3NGSfPqdSA5rJvHSm0jlxhb9l3DJOX8cYkbf8lo
MAbLDZmtiDiQAqRcsQPi6LZ1LKBkOYaeSnVvnXnH23FI04LBra3duk03qpvWCW2R
c1tFnKF5vACCyBQp1z8WYP9GjjoW5WT33R2iXufgwXP6pkLpS/12qLLeXqO2K4p5
zINKrIkF3P+GHxiDsQZE3G9A4UpKWFHCxKdxyWIV8LQDEBrgE2Mo3NThEyRBbP+8
1dia4j+qFHvPTMNBvBCjCZMqDwbCe9H70WOXKGE36JITW2le91mn4qHl4SuWReUP
IoHYDVcC/eBGRegc9X+bLJNjJYqo+XFo6u32/fUC5YVhngycQEi2vg1vv8fWQ7dB
g/Ruo3Inrk8h5kPmrHvbOzGazgANIt5ELHrYMRMA5WSgaq29jtGt9oTnsrd+I88G
aPJtwAZfLwdSjl/pwJw8atEPrf04DA2w+gO7rZ/AmeLshnGfOTc=
=bY+a
-----END PGP SIGNATURE-----
Merge tag 'for-5.4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"This continues with work on code refactoring, sanity checks and space
handling. There are some less user visible changes, nothing that would
particularly stand out.
User visible changes:
- tree checker, more sanity checks of:
- ROOT_ITEM (key, size, generation, level, alignment, flags)
- EXTENT_ITEM and METADATA_ITEM checks (key, size, offset,
alignment, refs)
- tree block reference items
- EXTENT_DATA_REF (key, hash, offset)
- deprecate flag BTRFS_SUBVOL_CREATE_ASYNC for subvolume creation
ioctl, scheduled removal in 5.7
- delete stale and unused UAPI definitions
BTRFS_DEV_REPLACE_ITEM_STATE_*
- improved export of debugging information available via existing
sysfs directory structure
- try harder to delete relations between qgroups and allow to delete
orphan entries
- remove unreliable space checks before relocation starts
Core:
- space handling:
- improved ticket reservations and other high level logic in
order to remove special cases
- factor flushing infrastructure and use it for different
contexts, allows to remove some special case handling
- reduce metadata reservation when only updating inodes
- reduce global block reserve minimum size (affects small
filesystems)
- improved overcommit logic wrt global block reserve
- tests:
- fix memory leaks in extent IO tree
- catch all TRIM range
Fixes:
- fix ENOSPC errors, leading to transaction aborts, when cloning
extents
- several fixes for inode number cache (mount option inode_cache)
- fix potential soft lockups during send when traversing large trees
- fix unaligned access to space cache pages with SLUB debug on
(PowerPC)
Other:
- refactoring public/private functions, moving to new or more
appropriate files
- defines converted to enums
- error handling improvements
- more assertions and comments
- old code deletion"
* tag 'for-5.4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (138 commits)
btrfs: Relinquish CPUs in btrfs_compare_trees
btrfs: Don't assign retval of btrfs_try_tree_write_lock/btrfs_tree_read_lock_atomic
btrfs: create structure to encode checksum type and length
btrfs: turn checksum type define into an enum
btrfs: add enospc debug messages for ticket failure
btrfs: do not account global reserve in can_overcommit
btrfs: use btrfs_try_granting_tickets in update_global_rsv
btrfs: always reserve our entire size for the global reserve
btrfs: change the minimum global reserve size
btrfs: rename btrfs_space_info_add_old_bytes
btrfs: remove orig_bytes from reserve_ticket
btrfs: fix may_commit_transaction to deal with no partial filling
btrfs: rework wake_all_tickets
btrfs: refactor the ticket wakeup code
btrfs: stop partially refilling tickets when releasing space
btrfs: add space reservation tracepoint for reserved bytes
btrfs: roll tracepoint into btrfs_space_info_update helper
btrfs: do not allow reservations if we have pending tickets
btrfs: stop clearing EXTENT_DIRTY in inode I/O tree
btrfs: treat RWF_{,D}SYNC writes as sync for CRCs
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl16aTEACgkQxWXV+ddt
WDvICg//cSn5w+g6EnxbrAZ6IYQJ4GA7lZSk2i6Dc/lI3BTfs7Wj0SPRKd01pBjT
N+wbqoOgubsS1jkNfJsGCN80XzSa0tvyQdbezj5ncgSPXp4FRlT0K24EUQNPaqbg
SsvvxAOCerVN3Yj2qrHNWIS5qZ5/8/NjLXca1DJ/OYmrkKfhe+Z6/b9EuKffPnco
erMnaeSvQ27hYkkcdM0DGcWDoHHAQrefGNjQzp5vncJNN1F7+EGLbcH31UwApk1K
/hvOQ6Q6SoR/NKbQu3AitrR9u7v9uhWP9jHJZT46q1m89CzI4S5FjK2wKZFjPE6r
0PGRqnpdaGAERaTo3s6jIqv/X2gzJkhhhzGMiPgPJCQbAH39f/fFGEX22TjG33Yq
2CiGSIPnmKQ7HE494YLuSyHD/89SutMMCkbF0sFBoKuTnu2HQMn9r5Pk6bEKtvIY
iTk75/WTXR02qWCVhTyNDa9QnxewQGJC1d1KNQ6MwbzBiYyG9S/DDZnjLJPNx7DF
KAAANCDdyPpraLcmw2sD/qd1o10HfQmn9z1L2v3YvJBfjMe76SQFCP5WwaJRcjOm
c3ScAX9bXeXJgH+E98kWc7T6p49IPdMDGAtArQmtjO4V8pFRuqG+2Ibg6Za/y5XZ
fkaS5UY+XIk3TUpEqkWKMPMigM9a3jgHskyMgdRLQfVnoOc6Z+k=
=KXB8
-----END PGP SIGNATURE-----
Merge tag 'for-5.3-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Here are two fixes, one of them urgent fixing a bug introduced in 5.2
and reported by many users. It took time to identify the root cause,
catching the 5.3 release is higly desired also to push the fix to 5.2
stable tree.
The bug is a mess up of return values after adding proper error
handling and honestly the kind of bug that can cause sleeping
disorders until it's caught. My appologies to everybody who was
affected.
Summary of what could happen:
1) either a hang when committing a transaction, if this happens
there's no risk of corruption, still the hang is very inconvenient
and can't be resolved without a reboot
2) writeback for some btree nodes may never be started and we end up
committing a transaction without noticing that, this is really
serious and that will lead to the "parent transid verify failed"
messages"
* tag 'for-5.3-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix unwritten extent buffers and hangs on future writeback attempts
Btrfs: fix assertion failure during fsync and use of stale transaction
The lock_extent_buffer_io() returns 1 to the caller to tell it everything
went fine and the callers needs to start writeback for the extent buffer
(submit a bio, etc), 0 to tell the caller everything went fine but it does
not need to start writeback for the extent buffer, and a negative value if
some error happened.
When it's about to return 1 it tries to lock all pages, and if a try lock
on a page fails, and we didn't flush any existing bio in our "epd", it
calls flush_write_bio(epd) and overwrites the return value of 1 to 0 or
an error. The page might have been locked elsewhere, not with the goal
of starting writeback of the extent buffer, and even by some code other
than btrfs, like page migration for example, so it does not mean the
writeback of the extent buffer was already started by some other task,
so returning a 0 tells the caller (btree_write_cache_pages()) to not
start writeback for the extent buffer. Note that epd might currently have
either no bio, so flush_write_bio() returns 0 (success) or it might have
a bio for another extent buffer with a lower index (logical address).
Since we return 0 with the EXTENT_BUFFER_WRITEBACK bit set on the
extent buffer and writeback is never started for the extent buffer,
future attempts to writeback the extent buffer will hang forever waiting
on that bit to be cleared, since it can only be cleared after writeback
completes. Such hang is reported with a trace like the following:
[49887.347053] INFO: task btrfs-transacti:1752 blocked for more than 122 seconds.
[49887.347059] Not tainted 5.2.13-gentoo #2
[49887.347060] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[49887.347062] btrfs-transacti D 0 1752 2 0x80004000
[49887.347064] Call Trace:
[49887.347069] ? __schedule+0x265/0x830
[49887.347071] ? bit_wait+0x50/0x50
[49887.347072] ? bit_wait+0x50/0x50
[49887.347074] schedule+0x24/0x90
[49887.347075] io_schedule+0x3c/0x60
[49887.347077] bit_wait_io+0x8/0x50
[49887.347079] __wait_on_bit+0x6c/0x80
[49887.347081] ? __lock_release.isra.29+0x155/0x2d0
[49887.347083] out_of_line_wait_on_bit+0x7b/0x80
[49887.347084] ? var_wake_function+0x20/0x20
[49887.347087] lock_extent_buffer_for_io+0x28c/0x390
[49887.347089] btree_write_cache_pages+0x18e/0x340
[49887.347091] do_writepages+0x29/0xb0
[49887.347093] ? kmem_cache_free+0x132/0x160
[49887.347095] ? convert_extent_bit+0x544/0x680
[49887.347097] filemap_fdatawrite_range+0x70/0x90
[49887.347099] btrfs_write_marked_extents+0x53/0x120
[49887.347100] btrfs_write_and_wait_transaction.isra.4+0x38/0xa0
[49887.347102] btrfs_commit_transaction+0x6bb/0x990
[49887.347103] ? start_transaction+0x33e/0x500
[49887.347105] transaction_kthread+0x139/0x15c
So fix this by not overwriting the return value (ret) with the result
from flush_write_bio(). We also need to clear the EXTENT_BUFFER_WRITEBACK
bit in case flush_write_bio() returns an error, otherwise it will hang
any future attempts to writeback the extent buffer, and undo all work
done before (set back EXTENT_BUFFER_DIRTY, etc).
This is a regression introduced in the 5.2 kernel.
Fixes: 2e3c25136a ("btrfs: extent_io: add proper error handling to lock_extent_buffer_for_io()")
Fixes: f4340622e0 ("btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up")
Reported-by: Zdenek Sojka <zsojka@seznam.cz>
Link: https://lore.kernel.org/linux-btrfs/GpO.2yos.3WGDOLpx6t%7D.1TUDYM@seznam.cz/T/#u
Reported-by: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Link: https://lore.kernel.org/linux-btrfs/5c4688ac-10a7-fb07-70e8-c5d31a3fbb38@profihost.ag/T/#t
Reported-by: Drazen Kacar <drazen.kacar@oradian.com>
Link: https://lore.kernel.org/linux-btrfs/DB8PR03MB562876ECE2319B3E579590F799C80@DB8PR03MB5628.eurprd03.prod.outlook.com/
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204377
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sometimes when fsync'ing a file we need to log that other inodes exist and
when we need to do that we acquire a reference on the inodes and then drop
that reference using iput() after logging them.
That generally is not a problem except if we end up doing the final iput()
(dropping the last reference) on the inode and that inode has a link count
of 0, which can happen in a very short time window if the logging path
gets a reference on the inode while it's being unlinked.
In that case we end up getting the eviction callback, btrfs_evict_inode(),
invoked through the iput() call chain which needs to drop all of the
inode's items from its subvolume btree, and in order to do that, it needs
to join a transaction at the helper function evict_refill_and_join().
However because the task previously started a transaction at the fsync
handler, btrfs_sync_file(), it has current->journal_info already pointing
to a transaction handle and therefore evict_refill_and_join() will get
that transaction handle from btrfs_join_transaction(). From this point on,
two different problems can happen:
1) evict_refill_and_join() will often change the transaction handle's
block reserve (->block_rsv) and set its ->bytes_reserved field to a
value greater than 0. If evict_refill_and_join() never commits the
transaction, the eviction handler ends up decreasing the reference
count (->use_count) of the transaction handle through the call to
btrfs_end_transaction(), and after that point we have a transaction
handle with a NULL ->block_rsv (which is the value prior to the
transaction join from evict_refill_and_join()) and a ->bytes_reserved
value greater than 0. If after the eviction/iput completes the inode
logging path hits an error or it decides that it must fallback to a
transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
non-zero value from btrfs_log_dentry_safe(), and because of that
non-zero value it tries to commit the transaction using a handle with
a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
the transaction commit hit an assertion failure at
btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
the ->block_rsv is NULL. The produced stack trace for that is like the
following:
[192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
[192922.917553] ------------[ cut here ]------------
[192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
[192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
[192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G W 5.1.4-btrfs-next-47 #1
[192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
(...)
[192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
[192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
[192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
[192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
[192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
[192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
[192922.923200] FS: 00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
[192922.923579] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
[192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[192922.925105] Call Trace:
[192922.925505] btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
[192922.925911] btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
[192922.926324] btrfs_sync_file+0x44c/0x490 [btrfs]
[192922.926731] do_fsync+0x38/0x60
[192922.927138] __x64_sys_fdatasync+0x13/0x20
[192922.927543] do_syscall_64+0x60/0x1c0
[192922.927939] entry_SYSCALL_64_after_hwframe+0x49/0xbe
(...)
[192922.934077] ---[ end trace f00808b12068168f ]---
2) If evict_refill_and_join() decides to commit the transaction, it will
be able to do it, since the nested transaction join only increments the
transaction handle's ->use_count reference counter and it does not
prevent the transaction from getting committed. This means that after
eviction completes, the fsync logging path will be using a transaction
handle that refers to an already committed transaction. What happens
when using such a stale transaction can be unpredictable, we are at
least having a use-after-free on the transaction handle itself, since
the transaction commit will call kmem_cache_free() against the handle
regardless of its ->use_count value, or we can end up silently losing
all the updates to the log tree after that iput() in the logging path,
or using a transaction handle that in the meanwhile was allocated to
another task for a new transaction, etc, pretty much unpredictable
what can happen.
In order to fix both of them, instead of using iput() during logging, use
btrfs_add_delayed_iput(), so that the logging path of fsync never drops
the last reference on an inode, that step is offloaded to a safe context
(usually the cleaner kthread).
The assertion failure issue was sporadically triggered by the test case
generic/475 from fstests, which loads the dm error target while fsstress
is running, which lead to fsync failing while logging inodes with -EIO
errors and then trying later to commit the transaction, triggering the
assertion failure.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing any form of incremental send the parent and the child trees
need to be compared via btrfs_compare_trees. This can result in long
loop chains without ever relinquishing the CPU. This causes softlockup
detector to trigger when comparing trees with a lot of items. Example
report:
watchdog: BUG: soft lockup - CPU#0 stuck for 24s! [snapperd:16153]
CPU: 0 PID: 16153 Comm: snapperd Not tainted 5.2.9-1-default #1 openSUSE Tumbleweed (unreleased)
Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
pstate: 40000005 (nZcv daif -PAN -UAO)
pc : __ll_sc_arch_atomic_sub_return+0x14/0x20
lr : btrfs_release_extent_buffer_pages+0xe0/0x1e8 [btrfs]
sp : ffff00001273b7e0
Call trace:
__ll_sc_arch_atomic_sub_return+0x14/0x20
release_extent_buffer+0xdc/0x120 [btrfs]
free_extent_buffer.part.0+0xb0/0x118 [btrfs]
free_extent_buffer+0x24/0x30 [btrfs]
btrfs_release_path+0x4c/0xa0 [btrfs]
btrfs_free_path.part.0+0x20/0x40 [btrfs]
btrfs_free_path+0x24/0x30 [btrfs]
get_inode_info+0xa8/0xf8 [btrfs]
finish_inode_if_needed+0xe0/0x6d8 [btrfs]
changed_cb+0x9c/0x410 [btrfs]
btrfs_compare_trees+0x284/0x648 [btrfs]
send_subvol+0x33c/0x520 [btrfs]
btrfs_ioctl_send+0x8a0/0xaf0 [btrfs]
btrfs_ioctl+0x199c/0x2288 [btrfs]
do_vfs_ioctl+0x4b0/0x820
ksys_ioctl+0x84/0xb8
__arm64_sys_ioctl+0x28/0x38
el0_svc_common.constprop.0+0x7c/0x188
el0_svc_handler+0x34/0x90
el0_svc+0x8/0xc
Fix this by adding a call to cond_resched at the beginning of the main
loop in btrfs_compare_trees.
Fixes: 7069830a9e ("Btrfs: add btrfs_compare_trees function")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Those function are simple boolean predicates there is no need to assign
their return values to interim variables. Use them directly as
predicates. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Create a structure to encode the type and length for the known on-disk
checksums. This makes it easier to add new checksums later.
The structure and helpers are moved from ctree.h so they don't occupy
space in all headers including ctree.h. This save some space in the
final object.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When debugging weird enospc problems it's handy to be able to dump the
space info when we wake up all tickets, and see what the ticket values
are. This helped me figure out cases where we were enospc'ing when we
shouldn't have been.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We ran into a problem in production where a box with plenty of space was
getting wedged doing ENOSPC flushing. These boxes only had 20% of the
disk allocated, but their metadata space + global reserve was right at
the size of their metadata chunk.
In this case can_overcommit should be allowing allocations without
problem, but there's logic in can_overcommit that doesn't allow us to
overcommit if there's not enough real space to satisfy the global
reserve.
This is for historical reasons. Before there were only certain places
we could allocate chunks. We could go to commit the transaction and not
have enough space for our pending delayed refs and such and be unable to
allocate a new chunk. This would result in a abort because of ENOSPC.
This code was added to solve this problem.
However since then we've gained the ability to always be able to
allocate a chunk. So we can easily overcommit in these cases without
risking a transaction abort because of ENOSPC.
Also prior to now the global reserve really would be used because that's
the space we relied on for delayed refs. With delayed refs being
tracked separately we no longer have to worry about running out of
delayed refs space while committing. We are much less likely to
exhaust our global reserve space during transaction commit.
Fix the can_overcommit code to simply see if our current usage + what we
want is less than our current free space plus whatever slack space we
have in the disk is. This solves the problem we were seeing in
production and keeps us from flushing as aggressively as we approach our
actual metadata size usage.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have some annoying xfstests tests that will create a very small fs,
fill it up, delete it, and repeat to make sure everything works right.
This trips btrfs up sometimes because we may commit a transaction to
free space, but most of the free metadata space was being reserved by
the global reserve. So we commit and update the global reserve, but the
space is simply added to bytes_may_use directly, instead of trying to
add it to existing tickets. This results in ENOSPC when we really did
have space. Fix this by calling btrfs_try_granting_tickets once we add
back our excess space to wake any pending tickets.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While messing with the overcommit logic I noticed that sometimes we'd
ENOSPC out when really we should have run out of space much earlier. It
turns out it's because we'll only reserve up to the free amount left in
the space info for the global reserve, but that doesn't make sense with
overcommit because we could be well above our actual size. This results
in the global reserve not carving out it's entire reservation, and thus
not putting enough pressure on the rest of the infrastructure to do the
right thing and ENOSPC out at a convenient time. Fix this by always
taking our full reservation amount for the global reserve.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It made sense to have the global reserve set at 16M in the past, but
since it is used less nowadays set the minimum size to the number of
items we'll need to update the main trees we update during a transaction
commit, plus some slop area so we can do unlinks if we need to.
In practice this doesn't affect normal file systems, but for xfstests
where we do things like fill up a fs and then rm * it can fall over in
weird ways. This enables us for more sane behavior at extremely small
file system sizes.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This name doesn't really fit with how the space reservation stuff works
now, rename it to btrfs_space_info_free_bytes_may_use so it's clear what
the function is doing.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we do not do partial filling of tickets simply remove
orig_bytes, it is no longer needed.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we aren't partially filling tickets we may have some slack
space left in the space_info. We need to account for this in
may_commit_transaction, otherwise we may choose to not commit the
transaction despite it actually having enough space to satisfy our
ticket.
Calculate the free space we have in the space_info, if any, and subtract
this from the ticket we have and use that amount to determine if we will
need to commit to reclaim enough space.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we no longer partially fill tickets we need to rework
wake_all_tickets to call btrfs_try_to_wakeup_tickets() in order to see
if any subsequent tickets are able to be satisfied. If our tickets_id
changes we know something happened and we can keep flushing.
Also if we find a ticket that is smaller than the first ticket in our
queue then we want to retry the flushing loop again in case
may_commit_transaction() decides we could satisfy the ticket by
committing the transaction.
Rename this to maybe_fail_all_tickets() while we're at it, to better
reflect what the function is actually doing.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that btrfs_space_info_add_old_bytes simply checks if we can make the
reservation and updates bytes_may_use, there's no reason to have both
helpers in place.
Factor out the ticket wakeup logic into it's own helper, make
btrfs_space_info_add_old_bytes() update bytes_may_use and then call the
wakeup helper, and replace all calls to btrfs_space_info_add_new_bytes()
with the wakeup helper.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_space_info_add_old_bytes is used when adding the extra space from
an existing reservation back into the space_info to be used by any
waiting tickets. In order to keep us from overcommitting we check to
make sure that we can still use this space for our reserve ticket, and
if we cannot we'll simply subtract it from space_info->bytes_may_use.
However this is problematic, because it assumes that only changes to
bytes_may_use would affect our ability to make reservations. Any
changes to bytes_reserved would be missed. If we were unable to make a
reservation prior because of reserved space, but that reserved space was
free'd due to unlink or truncate and we were allowed to immediately
reclaim that metadata space we would still ENOSPC.
Consider the example where we create a file with a bunch of extents,
using up 2MiB of actual space for the new tree blocks. Then we try to
make a reservation of 2MiB but we do not have enough space to make this
reservation. The iput() occurs in another thread and we remove this
space, and since we did not write the blocks we simply do
space_info->bytes_reserved -= 2MiB. We would never see this because we
do not check our space info used, we just try to re-use the freed
reservations.
To fix this problem, and to greatly simplify the wakeup code, do away
with this partial refilling nonsense. Use
btrfs_space_info_add_old_bytes to subtract the reservation from
space_info->bytes_may_use, and then check the ticket against the total
used of the space_info the same way we do with the initial reservation
attempt.
This keeps the reservation logic consistent and solves the problem of
early ENOSPC in the case that we free up space in places other than
bytes_may_use and bytes_pinned. Thanks,
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I noticed when folding the trace_btrfs_space_reservation() tracepoint
into the btrfs_space_info_update_* helpers that we didn't emit a
tracepoint when doing btrfs_add_reserved_bytes(). I know this is
because we were swapping bytes_may_use for bytes_reserved, so in my mind
there was no reason to have the tracepoint there. But now there is
because we always emit the unreserve for the bytes_may_use side, and
this would have broken if compression was on anyway. Add a tracepoint
to cover the bytes_reserved counter so the math still comes out right.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We duplicate this tracepoint everywhere we call these helpers, so update
the helper to have the tracepoint as well.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we already have tickets on the list we don't want to steal their
reservations. This is a preparation patch for upcoming changes,
technically this shouldn't happen today because of the way we add bytes
to tickets before adding them to the space_info in most cases.
This does not change the FIFO nature of reserve tickets, it simply
allows us to enforce it in a different way. Previously it was enforced
because any new space would be added to the first ticket on the list,
which would result in new reservations getting a reserve ticket. This
replaces that mechanism by simply checking to see if we have outstanding
reserve tickets and skipping straight to adding a ticket for our
reservation.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit fee187d9d9 ("Btrfs: do not set EXTENT_DIRTY along with
EXTENT_DELALLOC"), we never set EXTENT_DIRTY in inode->io_tree, so we
can simplify and stop trying to clear it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The VFS indicates a synchronous write to ->write_iter() via
iocb->ki_flags. The IOCB_{,D}SYNC flags may be set based on the file
(see iocb_flags()) or the RWF_* flags passed to a syscall like
pwritev2() (see kiocb_set_rw_flags()).
However, in btrfs_file_write_iter(), we're checking if a write is
synchronous based only on the file; we use this to decide when to bump
the sync_writers counter and thus do CRCs synchronously. Make sure we do
this for all synchronous writes as determined by the VFS.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add const ]
Signed-off-by: David Sterba <dsterba@suse.com>
generic_write_checks() may modify iov_iter_count(), so we must get the
count after the call, not before. Using the wrong one has a couple of
consequences:
1. We check a longer range in check_can_nocow() for nowait than we're
actually writing.
2. We create extra hole extent maps in btrfs_cont_expand(). As far as I
can tell, this is harmless, but I might be missing something.
These issues are pretty minor, but let's fix it before something more
important trips on it.
Fixes: edf064e7c6 ("btrfs: nowait aio support")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Further simplifaction of the get/set helpers is possible when the token
is uniquely tied to an extent buffer. A condition and an assignment can
be avoided.
The initializations are moved closer to the first use when the extent
buffer is valid. There's one exception in __push_leaf_left where the
token is reused.
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we can safely assume that the token is always a valid pointer,
remove the branches that check that.
Signed-off-by: David Sterba <dsterba@suse.com>
There are helpers for all type widths defined via macro and optionally
can use a token which is a cached pointer to avoid repeated mapping of
the extent buffer.
The token value is known at compile time, when it's valid it's always
address of a local variable, otherwise it's NULL passed by the
token-less helpers.
This can be utilized to remove some branching as the helpers are used
frequenlty.
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_find_name_in_ext_backref returns either 0/1 depending on whether it
found a backref for the given name. If it returns true then the actual
inode_ref struct is returned in one of its parameters. That's pointless,
instead refactor the function such that it returns either a pointer
to the btrfs_inode_extref or NULL it it didn't find anything. This
streamlines the function calling convention.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_find_name_in_backref returns either 0/1 depending on whether it
found a backref for the given name. If it returns true then the actual
inode_ref struct is returned in one of its parameters. That's pointless,
instead refactor the function such that it returns either a pointer
to the btrfs_inode_ref or NULL it it didn't find anything. This
streamlines the function calling convention.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The other dev stats functions are already there and the helpers are not
used by anything else.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The io_ctl structure is used for free space management, and used only by
the v1 space cache code, but unfortunatlly the full definition is
required by block-group.h so it can't be moved to free-space-cache.c
without additional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Send is the only user of tree_compare, we can move it there along with
the other helpers and definitions.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Preparatory work for code that will be moved out of ctree and uses this
function.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The file ctree.h serves as a header for everything and has become quite
bloated. Split some helpers that are generic and create a new file that
should be the catch-all for code that's not btrfs-specific.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Fix the fake ENOMEM return error code to the actual error in
clone_fs_devices().
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In a corrupted tree, if search for next devid finds the device with
devid = -1, then report the error -EUCLEAN back to the parent function
to fail gracefully.
The tree checker will not catch this in case the devids are created
using the following script:
umount /btrfs
dev1=/dev/sdb
dev2=/dev/sdc
mkfs.btrfs -fq -dsingle -msingle $dev1
mount $dev1 /btrfs
_fail()
{
echo $1
exit 1
}
while true; do
btrfs dev add -f $dev2 /btrfs || _fail "add failed"
btrfs dev del $dev1 /btrfs || _fail "del failed"
dev_tmp=$dev1
dev1=$dev2
dev2=$dev_tmp
done
With output:
BTRFS critical (device sdb): corrupt leaf: root=3 block=313739198464 slot=1 devid=1 invalid devid: has=507 expect=[0, 506]
BTRFS error (device sdb): block=313739198464 write time tree block corruption detected
BTRFS: error (device sdb) in btrfs_commit_transaction:2268: errno=-5 IO failure (Error while writing out transaction)
BTRFS warning (device sdb): Skipping commit of aborted transaction.
BTRFS: error (device sdb) in cleanup_transaction:1827: errno=-5 IO failure
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ add script and messages ]
Signed-off-by: David Sterba <dsterba@suse.com>
Various notifications of type "BUG kmalloc-4096 () : Redzone
overwritten" have been observed recently in various parts of the kernel.
After some time, it has been made a relation with the use of BTRFS
filesystem and with SLUB_DEBUG turned on.
[ 22.809700] BUG kmalloc-4096 (Tainted: G W ): Redzone overwritten
[ 22.810286] INFO: 0xbe1a5921-0xfbfc06cd. First byte 0x0 instead of 0xcc
[ 22.810866] INFO: Allocated in __load_free_space_cache+0x588/0x780 [btrfs] age=22 cpu=0 pid=224
[ 22.811193] __slab_alloc.constprop.26+0x44/0x70
[ 22.811345] kmem_cache_alloc_trace+0xf0/0x2ec
[ 22.811588] __load_free_space_cache+0x588/0x780 [btrfs]
[ 22.811848] load_free_space_cache+0xf4/0x1b0 [btrfs]
[ 22.812090] cache_block_group+0x1d0/0x3d0 [btrfs]
[ 22.812321] find_free_extent+0x680/0x12a4 [btrfs]
[ 22.812549] btrfs_reserve_extent+0xec/0x220 [btrfs]
[ 22.812785] btrfs_alloc_tree_block+0x178/0x5f4 [btrfs]
[ 22.813032] __btrfs_cow_block+0x150/0x5d4 [btrfs]
[ 22.813262] btrfs_cow_block+0x194/0x298 [btrfs]
[ 22.813484] commit_cowonly_roots+0x44/0x294 [btrfs]
[ 22.813718] btrfs_commit_transaction+0x63c/0xc0c [btrfs]
[ 22.813973] close_ctree+0xf8/0x2a4 [btrfs]
[ 22.814107] generic_shutdown_super+0x80/0x110
[ 22.814250] kill_anon_super+0x18/0x30
[ 22.814437] btrfs_kill_super+0x18/0x90 [btrfs]
[ 22.814590] INFO: Freed in proc_cgroup_show+0xc0/0x248 age=41 cpu=0 pid=83
[ 22.814841] proc_cgroup_show+0xc0/0x248
[ 22.814967] proc_single_show+0x54/0x98
[ 22.815086] seq_read+0x278/0x45c
[ 22.815190] __vfs_read+0x28/0x17c
[ 22.815289] vfs_read+0xa8/0x14c
[ 22.815381] ksys_read+0x50/0x94
[ 22.815475] ret_from_syscall+0x0/0x38
Commit 69d2480456 ("btrfs: use copy_page for copying pages instead of
memcpy") changed the way bitmap blocks are copied. But allthough bitmaps
have the size of a page, they were allocated with kzalloc().
Most of the time, kzalloc() allocates aligned blocks of memory, so
copy_page() can be used. But when some debug options like SLAB_DEBUG are
activated, kzalloc() may return unaligned pointer.
On powerpc, memcpy(), copy_page() and other copying functions use
'dcbz' instruction which provides an entire zeroed cacheline to avoid
memory read when the intention is to overwrite a full line. Functions
like memcpy() are writen to care about partial cachelines at the start
and end of the destination, but copy_page() assumes it gets pages. As
pages are naturally cache aligned, copy_page() doesn't care about
partial lines. This means that when copy_page() is called with a
misaligned pointer, a few leading bytes are zeroed.
To fix it, allocate bitmaps through kmem_cache instead of using kzalloc()
The cache pool is created with PAGE_SIZE alignment constraint.
Reported-by: Erhard F. <erhard_f@mailbox.org>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204371
Fixes: 69d2480456 ("btrfs: use copy_page for copying pages instead of memcpy")
Cc: stable@vger.kernel.org # 4.19+
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename to btrfs_free_space_bitmap ]
Signed-off-by: David Sterba <dsterba@suse.com>
Support for asynchronous snapshot creation was originally added in
72fd032e94 ("Btrfs: add SNAP_CREATE_ASYNC ioctl") to cater for
ceph's backend needs. However, since Ceph has deprecated support for
btrfs there is no longer need for that support in btrfs. Additionally,
this was never supported by btrfs-progs, the official userspace tools.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Correctly handle failure cases when adding an ordered extents in case
of REGULAR or PREALLOC extents. Remove the BUG_ON.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a comment explaining why we keep the BUG also use the already read
and cached value of extent ram bytes stored in 'ram_bytes'.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The extent range check right after the "out_check" label is redundant,
because the only way it can trigger is if we have an inline extent. In
this case it makes more sense to actually move it in the branch
explictly dealing with inlines extents.
What's more, the nested 'if (nocow)' can never be true because for
inline extents we always do COW and there is no chance 'nocow' can be
true, just remove that check.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no point in checking the type of the extent again just to set
the 'type' variable, when this check has already been performed before.
Instead, extend the original if branch with an 'else' clause. This
allows to remove one local variable and make it obvious how the code
flow differs for prealloc/regular extents.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
run_delalloc_nocow contains numerous, somewhat subtle, checks when
figuring out whether a particular extent should be CoW'ed or not. This
patch explicitly states the assumptions those checks verify. As a
result also document 2 of the more subtle checks in check_committed_ref
as well.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Of the 22 (!!!) local variables declared in this function only 9 have
function-wide context. Of the remaining 13, 12 are needed in the main
while loop of the function and 1 is needed in a tiny if branch, only in
case we have prealloc extent. This commit reduces the lifespan of every
variable to its bare minimum. It also renames the 'nolock' boolean to
freespace_inode to clearly indicate its purpose.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Historically we reserved worst case for every btree operation, and
generally speaking we want to do that in cases where it could be the
worst case. However for updating inodes we know the inode items are
already in the tree, so it will only be an update operation and never an
insert operation. This allows us to always reserve only the
metadata_size amount for inode updates rather than the
insert_metadata_size amount.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_calc_trunc_metadata_size differs from trans_metadata_size in that
it doesn't take into account any splitting at the levels, because
truncate will never split nodes. However truncate _and_ changing will
never split nodes, so rename btrfs_calc_trunc_metadata_size to
btrfs_calc_metadata_size. Also btrfs_calc_trans_metadata_size is purely
for inserting items, so rename this to btrfs_calc_insert_metadata_size.
Making these clearer will help when I start using them differently in
upcoming patches.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
EXTENT_DATA_REF is a little like DIR_ITEM which contains hash in its
key->offset.
This patch will check the following contents:
- Key->objectid
Basic alignment check.
- Hash
Hash of each extent_data_ref item must match key->offset.
- Offset
Basic alignment check.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For TREE_BLOCK_REF, SHARED_DATA_REF and SHARED_BLOCK_REF we need to
check:
| TREE_BLOCK_REF | SHARED_BLOCK_REF | SHARED_BLOCK_REF
--------------+----------------+-----------------+------------------
key->objectid | Alignment | Alignment | Alignment
key->offset | Any value | Alignment | Alignment
item_size | 0 | 0 | sizeof(le32) (*)
*: sizeof(struct btrfs_shared_data_ref)
So introduce a check to check all these 3 key types together.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch introduces the ability to check extent items.
This check involves:
- key->objectid check
Basic alignment check.
- key->type check
Against btrfs_extent_item::type and SKINNY_METADATA feature.
- key->offset alignment check for EXTENT_ITEM
- key->offset check for METADATA_ITEM
- item size check
Both against minimal size and stepping check.
- btrfs_extent_item check
Checks its flags and generation.
- btrfs_extent_inline_ref checks
Against 4 types inline ref.
Checks bytenr alignment and tree level.
- btrfs_extent_item::refs check
Check against total refs found in inline refs.
This check would be the most complex single item check due to its nature
of inlined items.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The btrfs_get_chunk_map() never returns NULL, it returns error pointers.
Fixes: 89b798ad1b ("btrfs: Use btrfs_get_io_geometry appropriately")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the function btrfs_init_dev_stats() goto out is not needed, because the
alloc has failed. So just return -ENOMEM.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
%found_key is not used, drop it since it hasn't been used since the
beginning in 733f4fbbc1 ("Btrfs: read device stats on mount, write
modified ones during commit").
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is used only for the readahead machinery. It makes no
sense to keep it external to reada.c file. Place it above its sole
caller and make it static. No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The set_level callbacks do not do anything special and can be replaced
by a helper that uses the levels defined in the tables.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The maximum and default levels do not change and can be defined
directly. The set_level callback was a temporary solution and will be
removed.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The BTRFS_DEV_REPLACE_ITEM_STATE_x defines, as shown in [1], are
unused in both kernel and btrfs-progs (except for one instance of
BTRFS_DEV_REPLACE_ITEM_STATE_NEVER_STARTED in kernel).
[1]
btrfs.h:#define BTRFS_IOCTL_DEV_REPLACE_STATE_FINISHED 2
btrfs.h:#define BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED 3
btrfs.h:#define BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED 4
Further these define-values are different form its counterpart
BTRFS_IOCTL_DEV_REPLACE_STATE_x series as shown in [2].
[2]
btrfs_tree.h:#define BTRFS_DEV_REPLACE_ITEM_STATE_SUSPENDED 2
btrfs_tree.h:#define BTRFS_DEV_REPLACE_ITEM_STATE_FINISHED 3
btrfs_tree.h:#define BTRFS_DEV_REPLACE_ITEM_STATE_CANCELED 4
So this patch deletes the BTRFS_DEV_REPLACE_ITEM_STATE_x altogether, and
one instance of BTRFS_DEV_REPLACE_ITEM_STATE_NEVER_STARTED is replaced
with BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED in the kernel.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have this weird space flushing loop inside inode.c for evict where
we'll do the normal LIMIT flush, and then commit the transaction and
hope we get our space. This is super janky, and in fact there's really
nothing stopping us from using FLUSH_ALL except that we run delayed
iputs, which means we could deadlock. So introduce a new flush state
for eviction that does the normal priority flushing with all of the
states that are safe for eviction.
The nice side-effect of this is that we'll try harder for evictions.
Previously if (for example generic/269) you had a bunch of other
operations happening on the fs you could race with those reservations
when committing the transaction, and eventually miss getting a
reservation for the evict. With this code we'll have our ticket in
place through the transaction commit, so any pinned bytes will go to our
pending evictions first.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With the eviction flushing stuff we'll want to allow for different
states, but still work basically the same way that
priority_reclaim_metadata_space works currently. Refactor this to take
the flushing states and size as an argument so we can use the same logic
for limit flushing and eviction flushing.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to make this logic a little more complicated for evict, so
factor the ticket flushing/waiting code out of __reserve_metadata_bytes.
This has no functional change.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we handle the cleanup of errored out tickets in both the
priority flush path and the normal flushing path. This is the same code
in both places, so just refactor so we don't duplicate the cleanup work.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Delayed iputs could very well free up enough space without needing to
commit the transaction, so make this step it's own step. This will
allow us to skip the step for evictions in a later patch.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These were renamed and exported to facilitate logical migration of
different code chunks into block-group.c. Now that all the users are in
one file go ahead and rename them back, move the code around, and make
them static.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This can now be easily migrated as well.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ refresh on top of sysfs cleanups ]
Signed-off-by: David Sterba <dsterba@suse.com>
These feel more at home in block-group.c.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ refresh, adjust btrfs_get_alloc_profile exports ]
Signed-off-by: David Sterba <dsterba@suse.com>
This feels more at home in block-group.c than in extent-tree.c.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>i
[ refresh ]
Signed-off-by: David Sterba <dsterba@suse.com>
We can now easily migrate this code as well.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Want to move these functions into block-group.c, so export them.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This can be easily migrated over now.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
This can easily be moved now.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ refresh ]
Signed-off-by: David Sterba <dsterba@suse.com>
This gets used by a few different logical chunks of the block group
code, export it while we move things around.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All of the prep work has been done so we can now cleanly move this chunk
over.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ refresh, add btrfs_get_alloc_profile export, comment updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
This is the removal code and the unused bgs code.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ refresh, move clear_incompat_bg_bits ]
Signed-off-by: David Sterba <dsterba@suse.com>
This is used in a few logical parts of the block group code, temporarily
export it so we can move things in pieces.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can now just copy it over to block-group.c.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The kobject should be pulled in via sysfs.h and that needs to include it
because it needs various definitions like kobj_attribute or kobject.
Signed-off-by: David Sterba <dsterba@suse.com>
The helpers to create block group and space info directories already
live in sysfs.c, move the deletion part there too.
Signed-off-by: David Sterba <dsterba@suse.com>
The last non-sysfs usage of space_info_ktype has been moved to a private
helper in previous patch so the variable can be made static.
Signed-off-by: David Sterba <dsterba@suse.com>
The last non-sysfs usage of btrfs_raid_ktype has been moved to a private
helper in previous patch so the variable can be made static.
Signed-off-by: David Sterba <dsterba@suse.com>
The part of link_block_group that just creates the sysfs object is
independent and can be factored out to a helper.
Signed-off-by: David Sterba <dsterba@suse.com>
__btrfs_reset_dev_stats() is a small helper function to reset devices stat
values, and is used only once, instead just open code it.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_dev_stat_reset() is an overdo in terms of wrapping. So this patch
open codes btrfs_dev_stat_reset().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we try to delete qgroups, we're pretty cautious, we make sure both
qgroups exist and there is a relationship between them, then try to
delete the relation.
This behavior is OK, but the problem is we need to two relation items,
and if we failed the first item deletion, we error out, leaving the
other relation item in qgroup tree.
Sometimes the error from del_qgroup_relation_item() could just be
-ENOENT, thus we can ignore that error and continue without any problem.
Further more, such cautious behavior makes qgroup relation deletion
impossible for orphan relation items.
This patch will enhance __del_qgroup_relation():
- If both qgroups and their relation items exist
Go the regular deletion routine and update their accounting if needed.
- If any qgroup or relation item doesn't exist
Then we still try to delete the orphan items anyway, but don't trigger
the accounting update.
By this, we try our best to remove relation items, and can handle orphan
relation items properly, while still keep the existing behavior for good
qgroup tree.
Reported-by: Andrei Borzenkov <arvidjaar@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If any call to find_first_clear_extent_bit() returns an unexpected result,
the test should fail and not just print an error message, otherwise it
makes detection of regressions much harder to notice.
Fixes: 1eaebb341d ("btrfs: Don't trim returned range based on input value in find_first_clear_extent_bit")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The test creates an extent io tree and sets several ranges with the
CHUNK_ALLOCATED and CHUNK_TRIMMED bits, resulting in the allocation of
several extent state structures. However the test never clears those
ranges, resulting in memory leaks of the extent state structures.
This is detected when CONFIG_BTRFS_DEBUG is set once we remove the
btrfs module (rmmod btrfs):
[57399.787918] BTRFS: state leak: start 67108864 end 75497471 state 1 in tree 1 refs 1
[57399.790155] BTRFS: state leak: start 33554432 end 67108863 state 33 in tree 1 refs 1
[57399.791941] BTRFS: state leak: start 1048576 end 4194303 state 33 in tree 1 refs 1
[57399.793753] BTRFS: state leak: start 67108864 end 75497471 state 1 in tree 1 refs 1
[57399.795188] BTRFS: state leak: start 33554432 end 67108863 state 33 in tree 1 refs 1
[57399.796453] BTRFS: state leak: start 1048576 end 4194303 state 33 in tree 1 refs 1
[57399.797765] BTRFS: state leak: start 67108864 end 75497471 state 1 in tree 1 refs 1
[57399.799049] BTRFS: state leak: start 33554432 end 67108863 state 33 in tree 1 refs 1
[57399.800142] BTRFS: state leak: start 1048576 end 4194303 state 33 in tree 1 refs 1
[57399.801126] BTRFS: state leak: start 67108864 end 75497471 state 1 in tree 1 refs 1
[57399.802106] BTRFS: state leak: start 33554432 end 67108863 state 33 in tree 1 refs 1
[57399.803119] BTRFS: state leak: start 1048576 end 4194303 state 33 in tree 1 refs 1
[57399.804153] BTRFS: state leak: start 67108864 end 75497471 state 1 in tree 1 refs 1
[57399.805196] BTRFS: state leak: start 33554432 end 67108863 state 33 in tree 1 refs 1
[57399.806191] BTRFS: state leak: start 1048576 end 4194303 state 33 in tree 1 refs 1
The start and end offsets reported correspond exactly to the ranges
used by the test.
So fix that by clearing all the ranges when the test finishes.
Fixes: 1eaebb341d ("btrfs: Don't trim returned range based on input value in find_first_clear_extent_bit")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add 'debug' directories to global sysfs and per-filesystem. This will
replace the debugfs directory. The sysfs location is simpler and builds
on top of the existing file hierarchy so there will hopefully be no more
questions about the sample debugfs file.
The directory is called 'debug' and only present under
CONFIG_BTRFS_DEBUG so this will not affect productions builds.
Signed-off-by: David Sterba <dsterba@suse.com>
extent-tree.c has a find_next_key that just walks up the path to find
the next key, but it is used for both the caching stuff and the snapshot
delete stuff. The snapshot deletion stuff is special so it can't really
use btrfs_find_next_key, but the caching thread stuff can. We just need
to fix btrfs_find_next_key to deal with ->skip_locking and then it works
exactly the same as the private find_next_key helper.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is used in caching and reading block groups, so export it while we
move these chunks independently.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Man a lot of people use this stuff.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We'll need this to move the caching stuff around.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This will make it so we can move them easily.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ coding style updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
These are relatively straightforward as well.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Another easy set to move over to block-group.c.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move these bits first as they are the easiest to move. Export two of
the helpers so they can be moved all at once.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor style updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
This is prep work for moving all of the block group cache code into its
own file.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor comment updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
This is prep work for moving block_group_cache around. Having this in
the header file makes the header file include need to be in a certain
order, which is awkward, so just move it into free-space-cache.c and
then we can re-arrange later.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Used only for in-memory state tracking.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The switch to open coded set/get has happend long time ago in
962a298f35 ("btrfs: kill the key type accessor helpers"), remove the
stray helpers.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The status of flush bio is tracked as a status bit, changed in commit
1c3063b6db ("btrfs: cleanup device states define
BTRFS_DEV_STATE_FLUSH_SENT"), the flush_bio_sent was forgotten.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The bulk of the work done when cloning extents, at ioctl.c:btrfs_clone(),
is done inside an if statement that checks if the found key has the type
BTRFS_EXTENT_DATA_KEY. That if statement is redundant however, because
btrfs_search_slot() always leaves us in a leaf slot that points to a key
that is always greater then or equals to the search key, and our search
key here has that type, and because we bail out before that if statement
if the key at the given leaf slot is greater then BTRFS_EXTENT_DATA_KEY.
Therefore just remove that if statement, not only because it is useless
but mostly because it increases the nesting/indentation level in this
function which is quite big and makes things a bit awkward whenever I need
to fix something that requires changing btrfs_clone() (and it has been
like that for many years already).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Simplify the code by removing variables that don't bring any real value
as well as simplifying the checks when buidling the candidate list of
devices. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
join_running_log_trans checks btrfs_root::log_root outside of
btrfs_root::log_mutex to avoid contention on the mutex. Turns out this
check is not necessary because the two callers of join_running_log_trans
(both of which deal with removing entries from the tree-log during
unlink) explicitly check whether the respective inode has been logged in
the current transaction.
If it hasn't then it won't have any items in the tree-log and call path
will return before calling join_running_log_trans. If the check passes,
however, then it's guaranteed that btrfs_root::log_root is set because
the inode is logged.
Those guarantees allows us to remove the speculative as well as the
implicity and tricky memory barrier.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we need to start an inode caching thread, because none currently exists
on disk, we can wake up all waiters as soon as we mark the range starting
at root's highest objectid + 1 and ending at BTRFS_LAST_FREE_OBJECTID as
free, so that they don't need to wait for the caching thread to start and
do some progress. We follow the same approach within the caching thread,
since as soon as it finds a free range and marks it as free space in the
cache, it wakes up all waiters. So improve this by adding such a wakeup
call after marking that initial range as free space.
Fixes: a47d6b70e2 ("Btrfs: setup free ino caching in a more asynchronous way")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If the caching thread fails to allocate a path, it returns without waking
up any cache waiters, leaving them hang forever. Fix this by following the
same approach as when we fail to start the caching thread: print an error
message, disable inode caching and make the wakers fallback to non-caching
mode behaviour (calling btrfs_find_free_objectid()).
Fixes: 581bb05094 ("Btrfs: Cache free inode numbers in memory")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we fail to start the inode caching thread, we print an error message
and disable the inode cache, however we never wake up any waiters, so they
hang forever waiting for the caching to finish. Fix this by waking them
up and have them fallback to a call to btrfs_find_free_objectid().
Fixes: e60efa8425 ("Btrfs: avoid triggering bug_on() when we fail to start inode caching task")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we are able to load an existing inode cache off disk, we set the state
of the cache to BTRFS_CACHE_FINISHED, but we don't wake up any one waiting
for the cache to be available. This means that anyone waiting for the
cache to be available, waiting on the condition that either its state is
BTRFS_CACHE_FINISHED or its available free space is greather than zero,
can hang forever.
This could be observed running fstests with MOUNT_OPTIONS="-o inode_cache",
in particular test case generic/161 triggered it very frequently for me,
producing a trace like the following:
[63795.739712] BTRFS info (device sdc): enabling inode map caching
[63795.739714] BTRFS info (device sdc): disk space caching is enabled
[63795.739716] BTRFS info (device sdc): has skinny extents
[64036.653886] INFO: task btrfs-transacti:3917 blocked for more than 120 seconds.
[64036.654079] Not tainted 5.2.0-rc4-btrfs-next-50 #1
[64036.654143] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[64036.654232] btrfs-transacti D 0 3917 2 0x80004000
[64036.654239] Call Trace:
[64036.654258] ? __schedule+0x3ae/0x7b0
[64036.654271] schedule+0x3a/0xb0
[64036.654325] btrfs_commit_transaction+0x978/0xae0 [btrfs]
[64036.654339] ? remove_wait_queue+0x60/0x60
[64036.654395] transaction_kthread+0x146/0x180 [btrfs]
[64036.654450] ? btrfs_cleanup_transaction+0x620/0x620 [btrfs]
[64036.654456] kthread+0x103/0x140
[64036.654464] ? kthread_create_worker_on_cpu+0x70/0x70
[64036.654476] ret_from_fork+0x3a/0x50
[64036.654504] INFO: task xfs_io:3919 blocked for more than 120 seconds.
[64036.654568] Not tainted 5.2.0-rc4-btrfs-next-50 #1
[64036.654617] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[64036.654685] xfs_io D 0 3919 3633 0x00000000
[64036.654691] Call Trace:
[64036.654703] ? __schedule+0x3ae/0x7b0
[64036.654716] schedule+0x3a/0xb0
[64036.654756] btrfs_find_free_ino+0xa9/0x120 [btrfs]
[64036.654764] ? remove_wait_queue+0x60/0x60
[64036.654809] btrfs_create+0x72/0x1f0 [btrfs]
[64036.654822] lookup_open+0x6bc/0x790
[64036.654849] path_openat+0x3bc/0xc00
[64036.654854] ? __lock_acquire+0x331/0x1cb0
[64036.654869] do_filp_open+0x99/0x110
[64036.654884] ? __alloc_fd+0xee/0x200
[64036.654895] ? do_raw_spin_unlock+0x49/0xc0
[64036.654909] ? do_sys_open+0x132/0x220
[64036.654913] do_sys_open+0x132/0x220
[64036.654926] do_syscall_64+0x60/0x1d0
[64036.654933] entry_SYSCALL_64_after_hwframe+0x49/0xbe
Fix this by adding a wake_up() call right after setting the cache state to
BTRFS_CACHE_FINISHED, at start_caching(), when we are able to load the
cache from disk.
Fixes: 82d5902d9c ("Btrfs: Support reading/writing on disk free ino cache")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch will introduce ROOT_ITEM check, which includes:
- Key->objectid and key->offset check
Currently only some easy check, e.g. 0 as rootid is invalid.
- Item size check
Root item size is fixed.
- Generation checks
Generation, generation_v2 and last_snapshot should not be greater than
super generation + 1
- Level and alignment check
Level should be in [0, 7], and bytenr must be aligned to sector size.
- Flags check
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203261
Reported-by: Jungyeon Yoon <jungyeon.yoon@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
With fuzzed image and MIXED_GROUPS super flag, we can hit the following
BUG_ON():
kernel BUG at fs/btrfs/delayed-ref.c:491!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 1849 Comm: sync Tainted: G O 5.2.0-custom #27
RIP: 0010:update_existing_head_ref.cold+0x44/0x46 [btrfs]
Call Trace:
add_delayed_ref_head+0x20c/0x2d0 [btrfs]
btrfs_add_delayed_tree_ref+0x1fc/0x490 [btrfs]
btrfs_free_tree_block+0x123/0x380 [btrfs]
__btrfs_cow_block+0x435/0x500 [btrfs]
btrfs_cow_block+0x110/0x240 [btrfs]
btrfs_search_slot+0x230/0xa00 [btrfs]
? __lock_acquire+0x105e/0x1e20
btrfs_insert_empty_items+0x67/0xc0 [btrfs]
alloc_reserved_file_extent+0x9e/0x340 [btrfs]
__btrfs_run_delayed_refs+0x78e/0x1240 [btrfs]
? kvm_clock_read+0x18/0x30
? __sched_clock_gtod_offset+0x21/0x50
btrfs_run_delayed_refs.part.0+0x4e/0x180 [btrfs]
btrfs_run_delayed_refs+0x23/0x30 [btrfs]
btrfs_commit_transaction+0x53/0x9f0 [btrfs]
btrfs_sync_fs+0x7c/0x1c0 [btrfs]
? __ia32_sys_fdatasync+0x20/0x20
sync_fs_one_sb+0x23/0x30
iterate_supers+0x95/0x100
ksys_sync+0x62/0xb0
__ia32_sys_sync+0xe/0x20
do_syscall_64+0x65/0x240
entry_SYSCALL_64_after_hwframe+0x49/0xbe
[CAUSE]
This situation is caused by several factors:
- Fuzzed image
The extent tree of this fs missed one backref for extent tree root.
So we can allocated space from that slot.
- MIXED_BG feature
Super block has MIXED_BG flag.
- No mixed block groups exists
All block groups are just regular ones.
This makes data space_info->block_groups[] contains metadata block
groups. And when we reserve space for data, we can use space in
metadata block group.
Then we hit the following file operations:
- fallocate
We need to allocate data extents.
find_free_extent() choose to use the metadata block to allocate space
from, and choose the space of extent tree root, since its backref is
missing.
This generate one delayed ref head with is_data = 1.
- extent tree update
We need to update extent tree at run_delayed_ref time.
This generate one delayed ref head with is_data = 0, for the same
bytenr of old extent tree root.
Then we trigger the BUG_ON().
[FIX]
The quick fix here is to check block_group->flags before using it.
The problem can only happen for MIXED_GROUPS fs. Regular filesystems
won't have space_info with DATA|METADATA flag, and no way to hit the
bug.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203255
Reported-by: Jungyeon Yoon <jungyeon.yoon@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is one report of fuzzed image which leads to BUG_ON() in
btrfs_delete_delayed_dir_index().
Although that fuzzed image can already be addressed by enhanced
extent-tree error handler, it's still better to hunt down more BUG_ON().
This patch will hunt down two BUG_ON()s in
btrfs_delete_delayed_dir_index():
- One for error from btrfs_delayed_item_reserve_metadata()
Instead of BUG_ON(), we output an error message and free the item.
And return the error.
All callers of this function handles the error by aborting current
trasaction.
- One for possible EEXIST from __btrfs_add_delayed_deletion_item()
That function can return -EEXIST.
We already have a good enough error message for that, only need to
clean up the reserved metadata space and allocated item.
To help above cleanup, also modifiy __btrfs_remove_delayed_item() called
in btrfs_release_delayed_item(), to skip unassociated item.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203253
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Test case btrfs/156 fails since commit 302167c50b ("btrfs: don't end
the transaction for delayed refs in throttle") with ENOSPC.
[CAUSE]
The ENOSPC is reported from btrfs_can_relocate().
This function will check:
- If this block group is empty, we can relocate
- If we can enough free space, we can relocate
Above checks are valid but the following check is vague due to its
implementation:
- If and only if we can allocated a new block group to contain all the
used space, we can relocate
This design itself is OK, but the way to determine if we can allocate a
new block group is problematic.
btrfs_can_relocate() uses find_free_dev_extent() to find free space on a
device.
However find_free_dev_extent() only searches commit root and excludes
dev extents allocated in current trans, this makes it unable to use dev
extent just freed in current transaction.
So for the following example, btrfs_can_relocate() will report ENOSPC:
The example block group layout:
1M 129M 257M 385M 513M 550M
|///////|///////////|//////////| | |
// = Used bg, consider all bg is 100% used for easy calculation.
And all block groups are SINGLE, on-disk bytenr is the same as the
logical bytenr.
1) Bg in [129M, 257M) get relocated to [385M, 513M), transid=100
1M 129M 257M 385M 513M 550M
|///////| |//////////|/////////|
In transid 100, bg in [129M, 257M) get relocated to [385M, 513M)
However transid 100 is not committed yet, so in dev commit tree, we
still have the old dev extents layout:
1M 129M 257M 385M 513M 550M
|///////|///////////|//////////| | |
2) Try to relocate bg [257M, 385M)
We goes into btrfs_can_relocate(), no free space in current bgs, so we
check if we can find large enough free dev extents.
The first slot is [385M, 513M), but that is already used by new bg at
[385M, 513M), so we continue search.
The remaining slot is [512M, 550M), smaller than the bg's length 128M.
So btrfs_can_relocate report ENOSPC.
However this is over killed, in fact if we just skip btrfs_can_relocate()
check, and go into regular relocation routine, at extent reservation time,
if we can't find free extent, then we fallback to commit transaction,
which will free up the dev extents and allow new block group to be created.
[FIX]
The fix here is to remove btrfs_can_relocate() completely.
If we hit the false ENOSPC case just like btrfs/156, extent allocator
will push harder by committing transaction and we will have space for
new block group, avoiding the false ENOSPC.
If we really ran out of space, we will hit ENOSPC at
relocate_block_group(), and btrfs will just reports the ENOSPC error as
usual.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
inc_block_group_ro() is only designed to mark one block group read-only,
it doesn't really care if other block groups have enough free space to
contain the used space in the block group.
However due to the close connection between this function and
relocation, sometimes we can be confused and think this function is
responsible for balance space reservation, which is not true.
Add some comment to make the functionality clear.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit 6df9a95e63 ("Btrfs: make the chunk allocator completely
tree lockless") we search commit root of device tree to avoid deadlock.
This introduced a safety feature, find_free_dev_extent_start() won't
use dev extents which just get freed in current transaction.
This safety feature makes sure we won't allocate new block group using
just freed dev extents to break CoW.
However, this feature also makes find_free_dev_extent_start() not
reliable reporting free device space. Just add such comment to make
later viewer careful about this behavior.
This behavior makes one caller, btrfs_can_relocate() unreliable
determining the device free space.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is only used locally in find_free_dev_extent(), no
external callers.
So unexport it.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree is going to be modified so it must be the exclusive lock.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As add_extent_mapping is called from several functions, let's add the
lock annotation. The tree is going to be modified so it must be the
exclusive lock.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In insert_inline_extent(), the case that checks compressed_size > 0
and compressed_pages = NULL cannot occur, otherwise a null-pointer
dereference may occur on line 215:
cpage = compressed_pages[i];
To catch this incorrect case, an assertion is added.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's unlikely in-band dedupe is going to land so just remove any
leftovers - dedupe.h header as well as the 'dedupe' parameter to
btrfs_set_extent_delalloc.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It was added in ba8b04c1d4 ("btrfs: extend btrfs_set_extent_delalloc
and its friends to support in-band dedupe and subpage size patchset") as
a preparatory patch for in-band and subapge block size patchsets.
However neither of those are likely to be merged anytime soon and the
code has diverged significantly from the last public post of either
of those patchsets.
It's unlikely either of the patchests are going to use those preparatory
steps so just remove the variables. Since cow_file_range also took
delalloc_end to pass it to extent_clear_unlock_delalloc remove the
parameter from that function as well.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This label is only executed if compress_file_range fails to create an
inline extent. So move its code in the semantically related inline
extent handling branch. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
compress_file_range returns a void, yet uses a function parameter as a
return value. Make that more idiomatic by simply returning the number
of compressed extents directly. Also track such extents in more aptly
named variables. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I lifted the btrfs label get/set ioctls to the vfs some time ago, but
never followed up to use those common definitions directly in btrfs.
This patch does that.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Those were split out of btrfs_clear_lock_blocking_rw by
aa12c02778 ("btrfs: split btrfs_clear_lock_blocking_rw to read and write helpers")
however at that time this function was unused due to commit
5239834016 ("Btrfs: kill btrfs_clear_path_blocking"). Put the final
nail in the coffin of those 2 functions.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfsic_process_written_block() cals btrfsic_process_metablock(),
which has a fairly large stack usage due to the btrfsic_stack_frame
variable. It also calls btrfsic_test_for_metadata(), which now
needs several hundreds of bytes for its SHASH_DESC_ON_STACK().
In some configurations, we end up with both functions on the
same stack, and gcc warns about the excessive stack usage that
might cause the available stack space to run out:
fs/btrfs/check-integrity.c:1743:13: error: stack frame size of 1152 bytes in function 'btrfsic_process_written_block' [-Werror,-Wframe-larger-than=]
Marking both child functions as noinline_for_stack helps because
this guarantees that the large variables are not on the same
stack frame.
Fixes: d5178578bc ("btrfs: directly call into crypto framework for checksumming")
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Fixes gcc '-Wunused-but-set-variable' warning:
fs/btrfs/volumes.c: In function __btrfs_map_block:
fs/btrfs/volumes.c:6023:6: warning:
variable offset set but not used [-Wunused-but-set-variable]
It is not used any more since commit 343abd1c0ca9 ("btrfs: Use
btrfs_get_io_geometry appropriately")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When cloning extents (or deduplicating) we create a transaction with a
space reservation that considers we will drop or update a single file
extent item of the destination inode (that we modify a single leaf). That
is fine for the vast majority of scenarios, however it might happen that
we need to drop many file extent items, and adjust at most two file extent
items, in the destination root, which can span multiple leafs. This will
lead to either the call to btrfs_drop_extents() to fail with ENOSPC or
the subsequent calls to btrfs_insert_empty_item() or btrfs_update_inode()
(called through clone_finish_inode_update()) to fail with ENOSPC. Such
failure results in a transaction abort, leaving the filesystem in a
read-only mode.
In order to fix this we need to follow the same approach as the hole
punching code, where we create a local reservation with 1 unit and keep
ending and starting transactions, after balancing the btree inode,
when __btrfs_drop_extents() returns ENOSPC. So fix this by making the
extent cloning call calls the recently added btrfs_punch_hole_range()
helper, which is what does the mentioned work for hole punching, and
make sure whenever we drop extent items in a transaction, we also add a
replacing file extent item, to avoid corruption (a hole) if after ending
a transaction and before starting a new one, the old transaction gets
committed and a power failure happens before we finish cloning.
A test case for fstests follows soon.
Reported-by: David Goodwin <david@codepoets.co.uk>
Link: https://lore.kernel.org/linux-btrfs/a4a4cf31-9cf4-e52c-1f86-c62d336c9cd1@codepoets.co.uk/
Reported-by: Sam Tygier <sam@tygier.co.uk>
Link: https://lore.kernel.org/linux-btrfs/82aace9f-a1e3-1f0b-055f-3ea75f7a41a0@tygier.co.uk/
Fixes: b6f3409b21 ("Btrfs: reserve sufficient space for ioctl clone")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move the code that is responsible for dropping extents in a range out of
btrfs_punch_hole() into a new helper function, btrfs_punch_hole_range(),
so that later it can be used by the reflinking (extent cloning and dedup)
code to fix a ENOSPC bug.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl1ZOygACgkQxWXV+ddt
WDvokw/8CRjbN5Bjlk/JzmikH+mU28Cd7qQgahWw7Afkyh5Gzb4IJJbNXzapy988
dMMKYF2H6lxY46EZG8cF4MFjCv8L4L1eZ9CqwfZyf7MfzPL2pnLSN77QJYYebWp3
y9rc8xv+qUdIcumQP25yxXmtN0YxT5bIJiuCmpHJFNfyijHVRnXV2CxQ8nwe63/1
G25a03x6BNKtSrU3ZP0fW2VjARKzIF7i+bEy4Ew3n3dNKqAiROYecswcBedhCBkF
D9hL62uHcxQSeHi/6lAZYKpsp4g4pKEO4c92MxsI8PJtd4zgHQG7MttXsHC1F1FQ
lHcP3vxKICOOMa13W6QdytSX6uSpjeLyMTDfmvaahQmwG6I5dBN56pkGlWdDMKNn
NUCNBmV063D7Shed4W2uIafLfo3BLEwKr2pd6pOfOZHwOKPWblJ0v4KzVDLoKC7v
CA5HB9BBz4KfSyp3fkm9B5VGJW938vRHVfx55IoakL+tfs67cKDYo8QEFHDuz0P5
DrYNwKvp4a5llGQ6vdRKfeiN7vBea10795MI5vFJiGUHfY3pN99R4UUed7kaul58
n6gqYukcPtIBqm8auxK037nngA+V0N2y0ceM1/aKUGaZVlCtSmEKXseYjbaiH6fP
MEZSgZn4W5qnu2oQuKohfQsNHtY5WP9489GpByy+DS5QsbDaueg=
=ovM7
-----END PGP SIGNATURE-----
Merge tag 'for-5.3-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two fixes that popped up during testing:
- fix for sysfs-related code that adds/removes block groups, warnings
appear during several fstests in connection with sysfs updates in
5.3, the fix essentially replaces a workaround with scope NOFS and
applies to 5.2-based branch too
- add sanity check of trim range"
* tag 'for-5.3-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: trim: Check the range passed into to prevent overflow
Btrfs: fix sysfs warning and missing raid sysfs directories
Normally the range->len is set to default value (U64_MAX), but when it's
not default value, we should check if the range overflows.
And if it overflows, return -EINVAL before doing anything.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the 5.3 merge window, commit 7c7e301406 ("btrfs: sysfs: Replace
default_attrs in ktypes with groups"), we started using the member
"defaults_groups" for the kobject type "btrfs_raid_ktype". That leads
to a series of warnings when running some test cases of fstests, such
as btrfs/027, btrfs/124 and btrfs/176. The traces produced by those
warnings are like the following:
[116648.059212] kernfs: can not remove 'total_bytes', no directory
[116648.060112] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80
(...)
[116648.066482] CPU: 3 PID: 28500 Comm: umount Tainted: G W 5.3.0-rc3-btrfs-next-54 #1
(...)
[116648.069376] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80
(...)
[116648.072385] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282
[116648.073437] RAX: 0000000000000000 RBX: ffffffffc0c11998 RCX: 0000000000000000
[116648.074201] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8
[116648.074956] RBP: ffffffffc0b9ca2f R08: 0000000000000000 R09: 0000000000000001
[116648.075708] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120
[116648.076434] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100
[116648.077143] FS: 00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000
[116648.077852] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[116648.078546] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0
[116648.079235] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[116648.079907] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[116648.080585] Call Trace:
[116648.081262] remove_files+0x31/0x70
[116648.081929] sysfs_remove_group+0x38/0x80
[116648.082596] sysfs_remove_groups+0x34/0x70
[116648.083258] kobject_del+0x20/0x60
[116648.083933] btrfs_free_block_groups+0x405/0x430 [btrfs]
[116648.084608] close_ctree+0x19a/0x380 [btrfs]
[116648.085278] generic_shutdown_super+0x6c/0x110
[116648.085951] kill_anon_super+0xe/0x30
[116648.086621] btrfs_kill_super+0x12/0xa0 [btrfs]
[116648.087289] deactivate_locked_super+0x3a/0x70
[116648.087956] cleanup_mnt+0xb4/0x160
[116648.088620] task_work_run+0x7e/0xc0
[116648.089285] exit_to_usermode_loop+0xfa/0x100
[116648.089933] do_syscall_64+0x1cb/0x220
[116648.090567] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[116648.091197] RIP: 0033:0x7f9cdc073b37
(...)
[116648.100046] ---[ end trace 22e24db328ccadf8 ]---
[116648.100618] ------------[ cut here ]------------
[116648.101175] kernfs: can not remove 'used_bytes', no directory
[116648.101731] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80
(...)
[116648.105649] CPU: 3 PID: 28500 Comm: umount Tainted: G W 5.3.0-rc3-btrfs-next-54 #1
(...)
[116648.107461] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80
(...)
[116648.109336] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282
[116648.109979] RAX: 0000000000000000 RBX: ffffffffc0c119a0 RCX: 0000000000000000
[116648.110625] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8
[116648.111283] RBP: ffffffffc0b9ca41 R08: 0000000000000000 R09: 0000000000000001
[116648.111940] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120
[116648.112603] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100
[116648.113268] FS: 00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000
[116648.113939] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[116648.114607] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0
[116648.115286] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[116648.115966] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[116648.116649] Call Trace:
[116648.117326] remove_files+0x31/0x70
[116648.117997] sysfs_remove_group+0x38/0x80
[116648.118671] sysfs_remove_groups+0x34/0x70
[116648.119342] kobject_del+0x20/0x60
[116648.120022] btrfs_free_block_groups+0x405/0x430 [btrfs]
[116648.120707] close_ctree+0x19a/0x380 [btrfs]
[116648.121396] generic_shutdown_super+0x6c/0x110
[116648.122057] kill_anon_super+0xe/0x30
[116648.122702] btrfs_kill_super+0x12/0xa0 [btrfs]
[116648.123335] deactivate_locked_super+0x3a/0x70
[116648.123961] cleanup_mnt+0xb4/0x160
[116648.124586] task_work_run+0x7e/0xc0
[116648.125210] exit_to_usermode_loop+0xfa/0x100
[116648.125830] do_syscall_64+0x1cb/0x220
[116648.126463] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[116648.127080] RIP: 0033:0x7f9cdc073b37
(...)
[116648.135923] ---[ end trace 22e24db328ccadf9 ]---
These happen because, during the unmount path, we call kobject_del() for
raid kobjects that are not fully initialized, meaning that we set their
ktype (as btrfs_raid_ktype) through link_block_group() but we didn't set
their parent kobject, which is done through btrfs_add_raid_kobjects().
We have this split raid kobject setup since commit 75cb379d26
("btrfs: defer adding raid type kobject until after chunk relocation") in
order to avoid triggering reclaim during contextes where we can not
(either we are holding a transaction handle or some lock required by
the transaction commit path), so that we do the calls to kobject_add(),
which triggers GFP_KERNEL allocations, through btrfs_add_raid_kobjects()
in contextes where it is safe to trigger reclaim. That change expected
that a new raid kobject can only be created either when mounting the
filesystem or after raid profile conversion through the relocation path.
However, we can have new raid kobject created in other two cases at least:
1) During device replace (or scrub) after adding a device a to the
filesystem. The replace procedure (and scrub) do calls to
btrfs_inc_block_group_ro() which can allocate a new block group
with a new raid profile (because we now have more devices). This
can be triggered by test cases btrfs/027 and btrfs/176.
2) During a degraded mount trough any write path. This can be triggered
by test case btrfs/124.
Fixing this by adding extra calls to btrfs_add_raid_kobjects(), not only
makes things more complex and fragile, can also introduce deadlocks with
reclaim the following way:
1) Calling btrfs_add_raid_kobjects() at btrfs_inc_block_group_ro() or
anywhere in the replace/scrub path will cause a deadlock with reclaim
because if reclaim happens and a transaction commit is triggered,
the transaction commit path will block at btrfs_scrub_pause().
2) During degraded mounts it is essentially impossible to figure out where
to add extra calls to btrfs_add_raid_kobjects(), because allocation of
a block group with a new raid profile can happen anywhere, which means
we can't safely figure out which contextes are safe for reclaim, as
we can either hold a transaction handle or some lock needed by the
transaction commit path.
So it is too complex and error prone to have this split setup of raid
kobjects. So fix the issue by consolidating the setup of the kobjects in a
single place, at link_block_group(), and setup a nofs context there in
order to prevent reclaim being triggered by the memory allocations done
through the call chain of kobject_add().
Besides fixing the sysfs warnings during kobject_del(), this also ensures
the sysfs directories for the new raid profiles end up created and visible
to users (a bug that existed before the 5.3 commit 7c7e301406
("btrfs: sysfs: Replace default_attrs in ktypes with groups")).
Fixes: 75cb379d26 ("btrfs: defer adding raid type kobject until after chunk relocation")
Fixes: 7c7e301406 ("btrfs: sysfs: Replace default_attrs in ktypes with groups")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl1ETYkACgkQxWXV+ddt
WDupLA/+LXPJvxRG/Ob655sxOAqpG4+XUW7aaI5jSnlr958TtDlNFkpzrQ28NFzk
XlEe/HjgI6CH7DhNwe1IlHLqFt874fiCeIZ2jvuSVVbPAO+9T8a5BIbeUY6V4h5v
QUmbaUpVshsh2IDArU8Vc/UjycCuwrccfGkS5ydTVI6Eni4BWsOY2PMiB5ojdOTE
bpclws3ca/CMbNpGKnn3cH5aXJpENTXedXxBbp99DproZVY3YL1ZyDxc37R6+5Ua
kbZznSMZFoLjHt3vktHmiozF7W1Zdi6ExOp9NHApshlw/n08ICkY1oNfd9KSYgQs
8gTl9mZ8jbJnSfZTuHNL6LwfSZjDyT5LCTbXSj0UnaVOYwJVwopsDrrdO6AESd8S
1eFeEFQ62lEV+31xa0nsYBz5j6ejEfcvxiq14W8LawWpkxtAIsEyXockWtmXPods
29BYkstQADlCOu2WYdQ8ugseBtfEpEAvD+Rf+qqyZ6XlHMIjL1J1S+BIXQ868z1f
KZCVsepJ0dVm0siVNMFBUaM2z1PcHusxNgYd37YcJRX2t4Gql9Ku/00PJ+tRhsBa
80a5ue9p6nuZfR6XoM6Cpe3uG8qyFyPccC3ohwKEOAnHnmk2LEJYASudV4pk/P8G
Vy3o9Uv/NMMznop24r7UB3mklBQCIchKpaij3RT0Jx+3RGdR4Yo=
=jr4T
-----END PGP SIGNATURE-----
Merge tag 'for-5.3-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- tiny race window during 2 transactions aborting at the same time can
accidentally lead to a commit
- regression fix, possible deadlock during fiemap
- fix for an old bug when incremental send can fail on a file that has
been deduplicated in a special way
* tag 'for-5.3-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix deadlock between fiemap and transaction commits
Btrfs: fix race leading to fs corruption after transaction abort
Btrfs: fix incremental send failure after deduplication
The fiemap handler locks a file range that can have unflushed delalloc,
and after locking the range, it tries to attach to a running transaction.
If the running transaction started its commit, that is, it is in state
TRANS_STATE_COMMIT_START, and either the filesystem was mounted with the
flushoncommit option or the transaction is creating a snapshot for the
subvolume that contains the file that fiemap is operating on, we end up
deadlocking. This happens because fiemap is blocked on the transaction,
waiting for it to complete, and the transaction is waiting for the flushed
dealloc to complete, which requires locking the file range that the fiemap
task already locked. The following stack traces serve as an example of
when this deadlock happens:
(...)
[404571.515510] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
[404571.515956] Call Trace:
[404571.516360] ? __schedule+0x3ae/0x7b0
[404571.516730] schedule+0x3a/0xb0
[404571.517104] lock_extent_bits+0x1ec/0x2a0 [btrfs]
[404571.517465] ? remove_wait_queue+0x60/0x60
[404571.517832] btrfs_finish_ordered_io+0x292/0x800 [btrfs]
[404571.518202] normal_work_helper+0xea/0x530 [btrfs]
[404571.518566] process_one_work+0x21e/0x5c0
[404571.518990] worker_thread+0x4f/0x3b0
[404571.519413] ? process_one_work+0x5c0/0x5c0
[404571.519829] kthread+0x103/0x140
[404571.520191] ? kthread_create_worker_on_cpu+0x70/0x70
[404571.520565] ret_from_fork+0x3a/0x50
[404571.520915] kworker/u8:6 D 0 31651 2 0x80004000
[404571.521290] Workqueue: btrfs-flush_delalloc btrfs_flush_delalloc_helper [btrfs]
(...)
[404571.537000] fsstress D 0 13117 13115 0x00004000
[404571.537263] Call Trace:
[404571.537524] ? __schedule+0x3ae/0x7b0
[404571.537788] schedule+0x3a/0xb0
[404571.538066] wait_current_trans+0xc8/0x100 [btrfs]
[404571.538349] ? remove_wait_queue+0x60/0x60
[404571.538680] start_transaction+0x33c/0x500 [btrfs]
[404571.539076] btrfs_check_shared+0xa3/0x1f0 [btrfs]
[404571.539513] ? extent_fiemap+0x2ce/0x650 [btrfs]
[404571.539866] extent_fiemap+0x2ce/0x650 [btrfs]
[404571.540170] do_vfs_ioctl+0x526/0x6f0
[404571.540436] ksys_ioctl+0x70/0x80
[404571.540734] __x64_sys_ioctl+0x16/0x20
[404571.540997] do_syscall_64+0x60/0x1d0
[404571.541279] entry_SYSCALL_64_after_hwframe+0x49/0xbe
(...)
[404571.543729] btrfs D 0 14210 14208 0x00004000
[404571.544023] Call Trace:
[404571.544275] ? __schedule+0x3ae/0x7b0
[404571.544526] ? wait_for_completion+0x112/0x1a0
[404571.544795] schedule+0x3a/0xb0
[404571.545064] schedule_timeout+0x1ff/0x390
[404571.545351] ? lock_acquire+0xa6/0x190
[404571.545638] ? wait_for_completion+0x49/0x1a0
[404571.545890] ? wait_for_completion+0x112/0x1a0
[404571.546228] wait_for_completion+0x131/0x1a0
[404571.546503] ? wake_up_q+0x70/0x70
[404571.546775] btrfs_wait_ordered_extents+0x27c/0x400 [btrfs]
[404571.547159] btrfs_commit_transaction+0x3b0/0xae0 [btrfs]
[404571.547449] ? btrfs_mksubvol+0x4a4/0x640 [btrfs]
[404571.547703] ? remove_wait_queue+0x60/0x60
[404571.547969] btrfs_mksubvol+0x605/0x640 [btrfs]
[404571.548226] ? __sb_start_write+0xd4/0x1c0
[404571.548512] ? mnt_want_write_file+0x24/0x50
[404571.548789] btrfs_ioctl_snap_create_transid+0x169/0x1a0 [btrfs]
[404571.549048] btrfs_ioctl_snap_create_v2+0x11d/0x170 [btrfs]
[404571.549307] btrfs_ioctl+0x133f/0x3150 [btrfs]
[404571.549549] ? mem_cgroup_charge_statistics+0x4c/0xd0
[404571.549792] ? mem_cgroup_commit_charge+0x84/0x4b0
[404571.550064] ? __handle_mm_fault+0xe3e/0x11f0
[404571.550306] ? do_raw_spin_unlock+0x49/0xc0
[404571.550608] ? _raw_spin_unlock+0x24/0x30
[404571.550976] ? __handle_mm_fault+0xedf/0x11f0
[404571.551319] ? do_vfs_ioctl+0xa2/0x6f0
[404571.551659] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[404571.552087] do_vfs_ioctl+0xa2/0x6f0
[404571.552355] ksys_ioctl+0x70/0x80
[404571.552621] __x64_sys_ioctl+0x16/0x20
[404571.552864] do_syscall_64+0x60/0x1d0
[404571.553104] entry_SYSCALL_64_after_hwframe+0x49/0xbe
(...)
If we were joining the transaction instead of attaching to it, we would
not risk a deadlock because a join only blocks if the transaction is in a
state greater then or equals to TRANS_STATE_COMMIT_DOING, and the delalloc
flush performed by a transaction is done before it reaches that state,
when it is in the state TRANS_STATE_COMMIT_START. However a transaction
join is intended for use cases where we do modify the filesystem, and
fiemap only needs to peek at delayed references from the current
transaction in order to determine if extents are shared, and, besides
that, when there is no current transaction or when it blocks to wait for
a current committing transaction to complete, it creates a new transaction
without reserving any space. Such unnecessary transactions, besides doing
unnecessary IO, can cause transaction aborts (-ENOSPC) and unnecessary
rotation of the precious backup roots.
So fix this by adding a new transaction join variant, named join_nostart,
which behaves like the regular join, but it does not create a transaction
when none currently exists or after waiting for a committing transaction
to complete.
Fixes: 03628cdbc6 ("Btrfs: do not start a transaction during fiemap")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When one transaction is finishing its commit, it is possible for another
transaction to start and enter its initial commit phase as well. If the
first ends up getting aborted, we have a small time window where the second
transaction commit does not notice that the previous transaction aborted
and ends up committing, writing a superblock that points to btrees that
reference extent buffers (nodes and leafs) that were not persisted to disk.
The consequence is that after mounting the filesystem again, we will be
unable to load some btree nodes/leafs, either because the content on disk
is either garbage (or just zeroes) or corresponds to the old content of a
previouly COWed or deleted node/leaf, resulting in the well known error
messages "parent transid verify failed on ...".
The following sequence diagram illustrates how this can happen.
CPU 1 CPU 2
<at transaction N>
btrfs_commit_transaction()
(...)
--> sets transaction state to
TRANS_STATE_UNBLOCKED
--> sets fs_info->running_transaction
to NULL
(...)
btrfs_start_transaction()
start_transaction()
wait_current_trans()
--> returns immediately
because
fs_info->running_transaction
is NULL
join_transaction()
--> creates transaction N + 1
--> sets
fs_info->running_transaction
to transaction N + 1
--> adds transaction N + 1 to
the fs_info->trans_list list
--> returns transaction handle
pointing to the new
transaction N + 1
(...)
btrfs_sync_file()
btrfs_start_transaction()
--> returns handle to
transaction N + 1
(...)
btrfs_write_and_wait_transaction()
--> writeback of some extent
buffer fails, returns an
error
btrfs_handle_fs_error()
--> sets BTRFS_FS_STATE_ERROR in
fs_info->fs_state
--> jumps to label "scrub_continue"
cleanup_transaction()
btrfs_abort_transaction(N)
--> sets BTRFS_FS_STATE_TRANS_ABORTED
flag in fs_info->fs_state
--> sets aborted field in the
transaction and transaction
handle structures, for
transaction N only
--> removes transaction from the
list fs_info->trans_list
btrfs_commit_transaction(N + 1)
--> transaction N + 1 was not
aborted, so it proceeds
(...)
--> sets the transaction's state
to TRANS_STATE_COMMIT_START
--> does not find the previous
transaction (N) in the
fs_info->trans_list, so it
doesn't know that transaction
was aborted, and the commit
of transaction N + 1 proceeds
(...)
--> sets transaction N + 1 state
to TRANS_STATE_UNBLOCKED
btrfs_write_and_wait_transaction()
--> succeeds writing all extent
buffers created in the
transaction N + 1
write_all_supers()
--> succeeds
--> we now have a superblock on
disk that points to trees
that refer to at least one
extent buffer that was
never persisted
So fix this by updating the transaction commit path to check if the flag
BTRFS_FS_STATE_TRANS_ABORTED is set on fs_info->fs_state if after setting
the transaction to the TRANS_STATE_COMMIT_START we do not find any previous
transaction in the fs_info->trans_list. If the flag is set, just fail the
transaction commit with -EROFS, as we do in other places. The exact error
code for the previous transaction abort was already logged and reported.
Fixes: 49b25e0540 ("btrfs: enhance transaction abort infrastructure")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing an incremental send operation we can fail if we previously did
deduplication operations against a file that exists in both snapshots. In
that case we will fail the send operation with -EIO and print a message
to dmesg/syslog like the following:
BTRFS error (device sdc): Send: inconsistent snapshot, found updated \
extent for inode 257 without updated inode item, send root is 258, \
parent root is 257
This requires that we deduplicate to the same file in both snapshots for
the same amount of times on each snapshot. The issue happens because a
deduplication only updates the iversion of an inode and does not update
any other field of the inode, therefore if we deduplicate the file on
each snapshot for the same amount of time, the inode will have the same
iversion value (stored as the "sequence" field on the inode item) on both
snapshots, therefore it will be seen as unchanged between in the send
snapshot while there are new/updated/deleted extent items when comparing
to the parent snapshot. This makes the send operation return -EIO and
print an error message.
Example reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
# Create our first file. The first half of the file has several 64Kb
# extents while the second half as a single 512Kb extent.
$ xfs_io -f -s -c "pwrite -S 0xb8 -b 64K 0 512K" /mnt/foo
$ xfs_io -c "pwrite -S 0xb8 512K 512K" /mnt/foo
# Create the base snapshot and the parent send stream from it.
$ btrfs subvolume snapshot -r /mnt /mnt/mysnap1
$ btrfs send -f /tmp/1.snap /mnt/mysnap1
# Create our second file, that has exactly the same data as the first
# file.
$ xfs_io -f -c "pwrite -S 0xb8 0 1M" /mnt/bar
# Create the second snapshot, used for the incremental send, before
# doing the file deduplication.
$ btrfs subvolume snapshot -r /mnt /mnt/mysnap2
# Now before creating the incremental send stream:
#
# 1) Deduplicate into a subrange of file foo in snapshot mysnap1. This
# will drop several extent items and add a new one, also updating
# the inode's iversion (sequence field in inode item) by 1, but not
# any other field of the inode;
#
# 2) Deduplicate into a different subrange of file foo in snapshot
# mysnap2. This will replace an extent item with a new one, also
# updating the inode's iversion by 1 but not any other field of the
# inode.
#
# After these two deduplication operations, the inode items, for file
# foo, are identical in both snapshots, but we have different extent
# items for this inode in both snapshots. We want to check this doesn't
# cause send to fail with an error or produce an incorrect stream.
$ xfs_io -r -c "dedupe /mnt/bar 0 0 512K" /mnt/mysnap1/foo
$ xfs_io -r -c "dedupe /mnt/bar 512K 512K 512K" /mnt/mysnap2/foo
# Create the incremental send stream.
$ btrfs send -p /mnt/mysnap1 -f /tmp/2.snap /mnt/mysnap2
ERROR: send ioctl failed with -5: Input/output error
This issue started happening back in 2015 when deduplication was updated
to not update the inode's ctime and mtime and update only the iversion.
Back then we would hit a BUG_ON() in send, but later in 2016 send was
updated to return -EIO and print the error message instead of doing the
BUG_ON().
A test case for fstests follows soon.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203933
Fixes: 1c919a5e13 ("btrfs: don't update mtime/ctime on deduped inodes")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl07KzUACgkQxWXV+ddt
WDs6Ow/+LQ4He5jW+xpgZ+5fSnSjervfSsYWxKARc4hYZog5l8e2ONASzrniKR0H
aU2SWUyLF0Os1w++vp1FYYwa1xaPJJRTvlqTI6DSCFgEpu9DefXkCAFuP/+fXbhC
Fx+ZDAbqgY2gUCEWcs4jQusSgz1sO0m97SJR7K4Vz+ZaBa+eneE5Hkm4kZd8AXg6
ufKkjOD0+Vy+CUVBZdPg8RlU5KRJ+yAE7B2CRyuSeq6cF0LxTxoNeAtWz5KWd6Pn
aNWDqa07na4kQz/p5s7Bw6qPC6i4hTdnNKVcc8N79iyz2TAetbjN8VXfbFFVCLZ9
3+9O/kFCh+ZolgV1f9U8iLwSbg7V/8wu1uUdR3645cq2PQljluMJEHPSWOmm+LVw
nOhm8VQiGsILbEYLRJbM3wI8dT4B0PtgP9IqYuk7TLV3qho6Pg8J2nyfpo5jehs5
IP0gDanCBHYgsoCUwZwnJaKH3hIP2+5IJw6AqX0Lt6ge9aYA35+kWR4KE5459Pxw
n0Eoiu+5y9QcfpgUuzb5EQIxHdRNlCjik70+n1yXBwwPtV9b6wDN9TlP3eI7p/Vx
j+FUYcgrjwjfAP7Eh5a/ne/XEcq1R5UghM+2TWEWpqGh11NPnMDJS6d4OYyeCvA6
yPLAhVvvqdbYM+PpALaNjvLxspxd6MSuKNLNk3pkJmkd69IXnXw=
=RT/D
-----END PGP SIGNATURE-----
Merge tag 'for-5.3-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two regression fixes:
- hangs caused by a missing barrier in the locking code
- memory leaks of extent_state due to bad handling of a cached
pointer"
* tag 'for-5.3-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix extent_state leak in btrfs_lock_and_flush_ordered_range
btrfs: Fix deadlock caused by missing memory barrier
btrfs_lock_and_flush_ordered_range() loads given "*cached_state" into
cachedp, which, in general, is NULL. Then, lock_extent_bits() updates
"cachedp", but it never goes backs to the caller. Thus the caller still
see its "cached_state" to be NULL and never free the state allocated
under btrfs_lock_and_flush_ordered_range(). As a result, we will
see massive state leak with e.g. fstests btrfs/005. Fix this bug by
properly handling the pointers.
Fixes: bd80d94efb ("btrfs: Always use a cached extent_state in btrfs_lock_and_flush_ordered_range")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 06297d8cef ("btrfs: switch extent_buffer blocking_writers from
atomic to int") changed the type of blocking_writers but forgot to
adjust relevant code in btrfs_tree_unlock by converting the
smp_mb__after_atomic to smp_mb. This opened up the possibility of a
deadlock due to re-ordering of setting blocking_writers and
checking/waking up the waiter. This particular lockup is explained in a
comment above waitqueue_active() function.
Fix it by converting the memory barrier to a full smp_mb, accounting
for the fact that blocking_writers is a simple integer.
Fixes: 06297d8cef ("btrfs: switch extent_buffer blocking_writers from atomic to int")
Tested-by: Johannes Thumshirn <jthumshirn@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl01plkACgkQxWXV+ddt
WDtdmQ/7B4dLmod95cUIY6vhkmlZ3joVEPCCacSj1i1VXFP+QxDgHmmGXQQ0gTuP
fJ0oe1gae5wFGnSZTVJ1WADkVeHqg3rZenSZaBNgLnNORwRVJUpgUchuvd+3L9z6
G0W7CU4bTX99eAERKHAy2uZ3t8bihmKySnZ79gkkfNTN0sV+yq68+5nQLMghGjwM
svutFEEKOenkXaV24a9cffcxcQliRMVjjHojHNqA5l6QJJpjM4ZKy7KPxvOkmeFx
VTko8E33kMO4TvipwrdZ7wD1E/mdMoUDe3dh1oJDL6h1DFlhvr/DhdYOx2cXq/0O
3zlAL8AEWPZp2gPeRgpmydvuVLdUkgd54yvbROYYpo0GcQgINSxW6VJ+wpHOMdhA
dmfE0NzNhLMklfJFhCL5l6GFrsdXXl1zzaVcYanZly4HY60RNu4N3rG0YE1gHKvO
biioerrMR9/+r0ZZh9RLcw0pnvVlx2VlD9lbE7QtoeJvQX6l3yQ2r/MFvNlluGiU
9z21wqtQf9aDnSETvvec0JM3eOgDe6PnjINxj/zUzUY1qoc/ESuSqf9i6AJ3/ix7
L9cevq/+hMARPAqILbtKWZWI/gqtKX22DckUIHcvqYdkG4zJW3e+PworBbsTI6Lb
4QpEHW++dzctc/sdgoEBnwFgewXGtZnSHSCL88y3Zxt0suqwixY=
=6yeW
-----END PGP SIGNATURE-----
Merge tag 'for-5.3-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fixes for leaks caused by recently merged patches
- one build fix
- a fix to prevent mixing of incompatible features
* tag 'for-5.3-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: don't leak extent_map in btrfs_get_io_geometry()
btrfs: free checksum hash on in close_ctree
btrfs: Fix build error while LIBCRC32C is module
btrfs: inode: Don't compress if NODATASUM or NODATACOW set
Pull vfs mount updates from Al Viro:
"The first part of mount updates.
Convert filesystems to use the new mount API"
* 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
mnt_init(): call shmem_init() unconditionally
constify ksys_mount() string arguments
don't bother with registering rootfs
init_rootfs(): don't bother with init_ramfs_fs()
vfs: Convert smackfs to use the new mount API
vfs: Convert selinuxfs to use the new mount API
vfs: Convert securityfs to use the new mount API
vfs: Convert apparmorfs to use the new mount API
vfs: Convert openpromfs to use the new mount API
vfs: Convert xenfs to use the new mount API
vfs: Convert gadgetfs to use the new mount API
vfs: Convert oprofilefs to use the new mount API
vfs: Convert ibmasmfs to use the new mount API
vfs: Convert qib_fs/ipathfs to use the new mount API
vfs: Convert efivarfs to use the new mount API
vfs: Convert configfs to use the new mount API
vfs: Convert binfmt_misc to use the new mount API
convenience helper: get_tree_single()
convenience helper get_tree_nodev()
vfs: Kill sget_userns()
...
btrfs_get_io_geometry() calls btrfs_get_chunk_map() to acquire a reference
on a extent_map, but on normal operation it does not drop this reference
anymore.
This leads to excessive kmemleak reports.
Always call free_extent_map(), not just in the error case.
Fixes: 5f1411265e ("btrfs: Introduce btrfs_io_geometry infrastructure")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If CONFIG_BTRFS_FS is y and CONFIG_LIBCRC32C is m,
building fails:
fs/btrfs/super.o: In function `btrfs_mount_root':
super.c:(.text+0xb7f9): undefined reference to `crc32c_impl'
fs/btrfs/super.o: In function `init_btrfs_fs':
super.c:(.init.text+0x3465): undefined reference to `crc32c_impl'
fs/btrfs/extent-tree.o: In function `hash_extent_data_ref':
extent-tree.c:(.text+0xe60): undefined reference to `crc32c'
extent-tree.c:(.text+0xe78): undefined reference to `crc32c'
extent-tree.c:(.text+0xe8b): undefined reference to `crc32c'
fs/btrfs/dir-item.o: In function `btrfs_insert_xattr_item':
dir-item.c:(.text+0x291): undefined reference to `crc32c'
fs/btrfs/dir-item.o: In function `btrfs_insert_dir_item':
dir-item.c:(.text+0x429): undefined reference to `crc32c'
Select LIBCRC32C to fix it.
Reported-by: Hulk Robot <hulkci@huawei.com>
Fixes: d5178578bc ("btrfs: directly call into crypto framework for checksumming")
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As btrfs(5) specified:
Note
If nodatacow or nodatasum are enabled, compression is disabled.
If NODATASUM or NODATACOW set, we should not compress the extent.
Normally NODATACOW is detected properly in run_delalloc_range() so
compression won't happen for NODATACOW.
However for NODATASUM we don't have any check, and it can cause
compressed extent without csum pretty easily, just by:
mkfs.btrfs -f $dev
mount $dev $mnt -o nodatasum
touch $mnt/foobar
mount -o remount,datasum,compress $mnt
xfs_io -f -c "pwrite 0 128K" $mnt/foobar
And in fact, we have a bug report about corrupted compressed extent
without proper data checksum so even RAID1 can't recover the corruption.
(https://bugzilla.kernel.org/show_bug.cgi?id=199707)
Running compression without proper checksum could cause more damage when
corruption happens, as compressed data could make the whole extent
unreadable, so there is no need to allow compression for
NODATACSUM.
The fix will refactor the inode compression check into two parts:
- inode_can_compress()
As the hard requirement, checked at btrfs_run_delalloc_range(), so no
compression will happen for NODATASUM inode at all.
- inode_need_compress()
As the soft requirement, checked at btrfs_run_delalloc_range() and
compress_file_range().
Reported-by: James Harvey <jamespharvey20@gmail.com>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl0sNWYACgkQxWXV+ddt
WDsyQA/8CGnF68g6hwVuYz4K7f39gOiFlBnRxeN/3RT6vkNSyLZxvRDaDrSTzVIo
cz2G/9qZLXsIll+3EfZlyzZZiA+4f4hEDAfAd4yVPavRom+uu7dbqzAIpgvFlYdH
vhAYKOeWSqWElWJ06hzWO3FCwjY9GKFMk4PS0XHHp+STCT0hq1MkaHr44kiHsqdh
T5nVGDwXz8nGDZ51RO6+mgiSrd5eHbs6kXCd8rW7hmjTx8ClKHa1tdkxN/us+pJm
hTFT669m5ckHhY2AUKmkREoOwpnt2HcXQJNkz6gO+o03IDvYz73SScbhSYdNTlwi
j74GLf89FA52qVM+JDg9MaWYqgf1pQI8AHK/rXw2FNbuP/eL9kuZ85ZIbO6CiO0c
5jAixReSwzSP/V0+MKW3F7k4KtIqbHAV6mkI8zLwrAee4Xj81BOtgL7gYPFQTwSZ
ma0hEoen7IV5+/z9upUuLA5wr4BT+h1T+EllCWe1+9+9mRYOvowtkRNBL8HZWTDI
b65oTITfot54xX9ecKtiuG2qoqJEjjkR+YKdRM4nph6wflSNZxEoezBp3iRFpYOL
Lx+g97RcJ2EEoBVjVMkTqfj93GeiKRifa8yXdRY+A0I2ZXZEcS8DjSJM6rj3AOPy
4idIl+ABscayZowfqu0FSIULf1La0qiRXmbGNeG4ylhN4L6S/og=
=eshk
-----END PGP SIGNATURE-----
Merge tag 'for-5.3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Highlights:
- chunks that have been trimmed and unchanged since last mount are
tracked and skipped on repeated trims
- use hw assissed crc32c on more arches, speedups if native
instructions or optimized implementation is available
- the RAID56 incompat bit is automatically removed when the last
block group of that type is removed
Fixes:
- fsync fix for reflink on NODATACOW files that could lead to ENOSPC
- fix data loss after inode eviction, renaming it, and fsync it
- fix fsync not persisting dentry deletions due to inode evictions
- update ctime/mtime/iversion after hole punching
- fix compression type validation (reported by KASAN)
- send won't be allowed to start when relocation is in progress, this
can cause spurious errors or produce incorrect send stream
Core:
- new tracepoints for space update
- tree-checker: better check for end of extents for some tree items
- preparatory work for more checksum algorithms
- run delayed iput at unlink time and don't push the work to cleaner
thread where it's not properly throttled
- wrap block mapping to structures and helpers, base for further
refactoring
- split large files, part 1:
- space info handling
- block group reservations
- delayed refs
- delayed allocation
- other cleanups and refactoring"
* tag 'for-5.3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (103 commits)
btrfs: fix memory leak of path on error return path
btrfs: move the subvolume reservation stuff out of extent-tree.c
btrfs: migrate the delalloc space stuff to it's own home
btrfs: migrate btrfs_trans_release_chunk_metadata
btrfs: migrate the delayed refs rsv code
btrfs: Evaluate io_tree in find_lock_delalloc_range()
btrfs: migrate the global_block_rsv helpers to block-rsv.c
btrfs: migrate the block-rsv code to block-rsv.c
btrfs: stop using block_rsv_release_bytes everywhere
btrfs: cleanup the target logic in __btrfs_block_rsv_release
btrfs: export __btrfs_block_rsv_release
btrfs: export btrfs_block_rsv_add_bytes
btrfs: move btrfs_block_rsv definitions into it's own header
btrfs: Simplify update of space_info in __reserve_metadata_bytes()
btrfs: unexport can_overcommit
btrfs: move reserve_metadata_bytes and supporting code to space-info.c
btrfs: move dump_space_info to space-info.c
btrfs: export block_rsv_use_bytes
btrfs: move btrfs_space_info_add_*_bytes to space-info.c
btrfs: move the space info update macro to space-info.h
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl0s1ZEQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpiCEEACE9H/pXoegTTWIVPVajMlsa19UHIeilk4N
GI7oKSiirQEMZnAOmrEzgB4/0zyYQsVypys0gZlYUD3GJVsXDT3zzjNXL5NpVg/O
nqwSGWMHBSjWkLbaM40Pb2QLXsYgveptNL+9PtxrgtoYPoT5/+TyrJMFrRfi72EK
WFeNDKOu6aJxpJ26JSsckJ0gluKeeEpRoEqsgHGIwaMIGHQf+b+ikk7tel5FAIgA
uDwwD+Oxsdgh/ChsXL0d90GkcbcSp6GQ7GybxVmw/tPijx6mpeIY72xY3Zx+t8zF
b71UNk6NmCKjOPO/6fiuYKKTYw+KhzlyEKO0j675HKfx2AhchEwKw0irp4yUlydA
zxWYmz4U7iRgktJtymv3J4FEQQ3S6d1EnuQkQNX1LwiOsEsfzhkWi+7jy7KFhZoJ
AqtYzqnOXvLx92q0vloj06HtK6zo+I/MINldy0+qn9lq0N0VF+dctyztAHLsF7P6
pUtS6i7l1JSFKAmMhC31sIj5TImaehM2e/TWMUPEDZaO96oKCmQwOF1oiloc6vlW
h4xWsxP/9zOFcWNyPzy6Vo3JUXWRvFA7K+jV3Hsukw6rVHiNCGVYGSlTv8Roi5b7
I4ggu9R2JOGyku7UIlL50IRxEyjAp11LaO8yHhcCnRB65rmyBuNMQNcfOsfxpZ5Y
1mtSNhm5TQ==
=g8xI
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20190715' of git://git.kernel.dk/linux-block
Pull more block updates from Jens Axboe:
"A later pull request with some followup items. I had some vacation
coming up to the merge window, so certain things items were delayed a
bit. This pull request also contains fixes that came in within the
last few days of the merge window, which I didn't want to push right
before sending you a pull request.
This contains:
- NVMe pull request, mostly fixes, but also a few minor items on the
feature side that were timing constrained (Christoph et al)
- Report zones fixes (Damien)
- Removal of dead code (Damien)
- Turn on cgroup psi memstall (Josef)
- block cgroup MAINTAINERS entry (Konstantin)
- Flush init fix (Josef)
- blk-throttle low iops timing fix (Konstantin)
- nbd resize fixes (Mike)
- nbd 0 blocksize crash fix (Xiubo)
- block integrity error leak fix (Wenwen)
- blk-cgroup writeback and priority inheritance fixes (Tejun)"
* tag 'for-linus-20190715' of git://git.kernel.dk/linux-block: (42 commits)
MAINTAINERS: add entry for block io cgroup
null_blk: fixup ->report_zones() for !CONFIG_BLK_DEV_ZONED
block: Limit zone array allocation size
sd_zbc: Fix report zones buffer allocation
block: Kill gfp_t argument of blkdev_report_zones()
block: Allow mapping of vmalloc-ed buffers
block/bio-integrity: fix a memory leak bug
nvme: fix NULL deref for fabrics options
nbd: add netlink reconfigure resize support
nbd: fix crash when the blksize is zero
block: Disable write plugging for zoned block devices
block: Fix elevator name declaration
block: Remove unused definitions
nvme: fix regression upon hot device removal and insertion
blk-throttle: fix zero wait time for iops throttled group
block: Fix potential overflow in blk_report_zones()
blkcg: implement REQ_CGROUP_PUNT
blkcg, writeback: Implement wbc_blkcg_css()
blkcg, writeback: Add wbc->no_cgroup_owner
blkcg, writeback: Rename wbc_account_io() to wbc_account_cgroup_owner()
...
- Standardize parameter checking for the SETFLAGS and FSSETXATTR ioctls
(which were the file attribute setters for ext4 and xfs and have now
been hoisted to the vfs)
- Only allow the DAX flag to be set on files and directories.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl0aJgMACgkQ+H93GTRK
tOuKkg//SJaxcB63uVPZk9hDraYTmyo9OXRRX6X9WwDKPTWwa88CUwS1ny1QF7Mt
zMkgzG2/y2Rs9PQ0ARoPbh1hNb2CXnvA+xnzUEev1MW6UN/nTFMZEOPn2ZQ+DxQE
gg/0U56kKgtjtXzBZVpTgHzSETivdXwHxFW3hiTtyRXg+4ulgDIZLOjN2wRB+Pdb
X8ZmM6MqKOTbhQEXlw13TlCKBzoMjC1w4UU4rkZPjoSjAaUWiPfrk/XU7qgguf9p
v1dbSN2dADQ19jzZ1dmggXnlJsRMZjk/ls5rxJlB5DHDbh6YgnA2TE+tYrtH28eB
uyKfD+RQnMzRVdmH8PsMQRQQFXR2UYyprVP7a6wi6TkB+gytn7sR5uT4sbAhmhcF
TiTYfYNRXzemHCewyOwOsUE/7oCeiJcdbqiPAHHD/jYLZfRjSXDcGzz3+7ZYZ3GO
hRxUhpxHPbkmK4T2OxhzReCbRsLN/0BeEcDdLkNWmi2FTh3V1gYzMGkgI9wsVbsd
pHjoGIHbMPWqktF/obuGq96WVfYBBaWJ6WNzQqKT4dQYAJBW2omxitXQHLpi6cjt
hG5ncxa3cPpWx4t3Lx2hb0TPS7RyYvuoQIcS/Me2RWioxrwWrgnOqdHFfLEwWpfN
jRowdWiGgOIsq8hMt7qycmGCXzbgsbaA/7oRqh8TiwM9taPOM4c=
=uH2E
-----END PGP SIGNATURE-----
Merge tag 'vfs-fix-ioctl-checking-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull common SETFLAGS/FSSETXATTR parameter checking from Darrick Wong:
"Here's a patch series that sets up common parameter checking functions
for the FS_IOC_SETFLAGS and FS_IOC_FSSETXATTR ioctl implementations.
The goal here is to reduce the amount of behaviorial variance between
the filesystems where those ioctls originated (ext2 and XFS,
respectively) and everybody else.
- Standardize parameter checking for the SETFLAGS and FSSETXATTR
ioctls (which were the file attribute setters for ext4 and xfs and
have now been hoisted to the vfs)
- Only allow the DAX flag to be set on files and directories"
* tag 'vfs-fix-ioctl-checking-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
vfs: only allow FSSETXATTR to set DAX flag on files and dirs
vfs: teach vfs_ioc_fssetxattr_check to check extent size hints
vfs: teach vfs_ioc_fssetxattr_check to check project id info
vfs: create a generic checking function for FS_IOC_FSSETXATTR
vfs: create a generic checking and prep function for FS_IOC_SETFLAGS
Here is the "big" driver core and debugfs changes for 5.3-rc1
It's a lot of different patches, all across the tree due to some api
changes and lots of debugfs cleanups. Because of this, there is going
to be some merge issues with your tree at the moment, I'll follow up
with the expected resolutions to make it easier for you.
Other than the debugfs cleanups, in this set of changes we have:
- bus iteration function cleanups (will cause build warnings
with s390 and coresight drivers in your tree)
- scripts/get_abi.pl tool to display and parse Documentation/ABI
entries in a simple way
- cleanups to Documenatation/ABI/ entries to make them parse
easier due to typos and other minor things
- default_attrs use for some ktype users
- driver model documentation file conversions to .rst
- compressed firmware file loading
- deferred probe fixes
All of these have been in linux-next for a while, with a bunch of merge
issues that Stephen has been patient with me for. Other than the merge
issues, functionality is working properly in linux-next :)
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXSgpnQ8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ykcwgCfS30OR4JmwZydWGJ7zK/cHqk+KjsAnjOxjC1K
LpRyb3zX29oChFaZkc5a
=XrEZ
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core and debugfs updates from Greg KH:
"Here is the "big" driver core and debugfs changes for 5.3-rc1
It's a lot of different patches, all across the tree due to some api
changes and lots of debugfs cleanups.
Other than the debugfs cleanups, in this set of changes we have:
- bus iteration function cleanups
- scripts/get_abi.pl tool to display and parse Documentation/ABI
entries in a simple way
- cleanups to Documenatation/ABI/ entries to make them parse easier
due to typos and other minor things
- default_attrs use for some ktype users
- driver model documentation file conversions to .rst
- compressed firmware file loading
- deferred probe fixes
All of these have been in linux-next for a while, with a bunch of
merge issues that Stephen has been patient with me for"
* tag 'driver-core-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (102 commits)
debugfs: make error message a bit more verbose
orangefs: fix build warning from debugfs cleanup patch
ubifs: fix build warning after debugfs cleanup patch
driver: core: Allow subsystems to continue deferring probe
drivers: base: cacheinfo: Ensure cpu hotplug work is done before Intel RDT
arch_topology: Remove error messages on out-of-memory conditions
lib: notifier-error-inject: no need to check return value of debugfs_create functions
swiotlb: no need to check return value of debugfs_create functions
ceph: no need to check return value of debugfs_create functions
sunrpc: no need to check return value of debugfs_create functions
ubifs: no need to check return value of debugfs_create functions
orangefs: no need to check return value of debugfs_create functions
nfsd: no need to check return value of debugfs_create functions
lib: 842: no need to check return value of debugfs_create functions
debugfs: provide pr_fmt() macro
debugfs: log errors when something goes wrong
drivers: s390/cio: Fix compilation warning about const qualifiers
drivers: Add generic helper to match by of_node
driver_find_device: Unify the match function with class_find_device()
bus_find_device: Unify the match callback with class_find_device
...
wbc_account_io() does a very specific job - try to see which cgroup is
actually dirtying an inode and transfer its ownership to the majority
dirtier if needed. The name is too generic and confusing. Let's
rename it to something more specific.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently if the allocation of roots or tmp_ulist fails the error handling
does not free up the allocation of path causing a memory leak. Fix this and
other similar leaks by moving the call of btrfs_free_path from label out
to label out_free_ulist.
Kudos to David Sterba for spotting the issue in my original fix and suggesting
the correct way to fix the leak and Anand Jain for spotting a double free
issue.
Addresses-Coverity: ("Resource leak")
Fixes: 5911c8fe05 ("btrfs: fiemap: preallocate ulists for btrfs_check_shared")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is just two functions, put it in root-tree.c since it involves root
items.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have code for data and metadata reservations for delalloc. There's
quite a bit of code here, and it's used in a lot of places so I've
separated it out to it's own file. inode.c and file.c are already
pretty large, and this code is complicated enough to live in its own
space.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move this into transaction.c with the rest of the transaction related
code.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These belong with the delayed refs related code, not in extent-tree.c.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Simplification. No point passing the tree variable when it can be
evaluated from inode. The tests now use the io_tree from btrfs_inode as
opposed to creating one.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This moves everything out of extent-tree.c to block-rsv.c.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
block_rsv_release_bytes() is the internal to the block_rsv code, and
shouldn't be called directly by anything else. Switch all users to the
exported helpers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This works for all callers already, but if we wanted to use the helper
for the global_block_rsv it would end up trying to refill itself. Fix
the logic to be able to be used no matter which block rsv is passed in
to this helper.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The delalloc reserve stuff calls this directly because it cares about
the qgroup accounting stuff, so export it so we can move it around. Fix
btrfs_block_rsv_release() to just be a static inline since it just calls
__btrfs_block_rsv_release() with NULL for the qgroup stuff.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is used in a few places, we need to make sure it's exported so we
can move it around.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Prep work for separating out all of the block_rsv related code into its
own file.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need an if-else-if chain where we can use a simple OR since
both conditions are performing the same action. The short-circuit for OR
will ensure that if the first condition is true, can_overcommit() is not
called.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we've moved all of the users to space-info.c, unexport it and
name it back to can_overcommit.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This moves all of the metadata reservation code into space-info.c.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We'll need this exported so we can use it in all the various was we need
to use it. This is prep work to move reserve_metadata_bytes.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are going to need this to move the metadata reservation stuff to
space_info.c.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we've moved all the pre-requisite stuff, move these two
functions.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Also rename it to btrfs_space_info_update_* so it's clear what we're
updating.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the first piece of moving the space reservation code to
space-info.c
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These are the basic init and lookup functions and some helper functions,
fairly straightforward before the bad stuff starts.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Prep work for consolidating all of the space_info code into one file.
We need to export these so multiple files can use them.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Really we just need the enum, but as we break more things up it'll help
to have this external to extent-tree.c.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Migrate the struct definition and the one helper that's in ctree.h into
space-info.h
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The block device is passed around for the only purpose to set it in new
bios. Move the assignment one level up. This is a preparatory patch for
further bdev cleanups.
Signed-off-by: David Sterba <dsterba@suse.com>
Minimum stripe count matches the minimum devices required for a given
profile. The open coded assignments match the raid_attr table.
What's changed here is the meaning for RAID5/6. Previously their
min_stripes would be 1, while newly it's devs_min. This however shold be
the same as before because it's not possible to create filesystem on
fewer devices than the raid_attr table allows.
There's no adjustment regarding the parity stripes (like
calc_data_stripes does), because we're interested in overall space that
would fit on the devices.
Missing devices make no difference for the whole calculation, we have
the size stored in the structures.
Signed-off-by: David Sterba <dsterba@suse.com>
Special case for DUP can be replaced by lookup to the attribute table,
where the dev_stripes is the right coefficient.
Signed-off-by: David Sterba <dsterba@suse.com>
A few more instances whre we don't need to specify the values as long as
they are the same that enum assigns automatically. All of the enums are
in-memory only and nothing relies on the exact values.
Signed-off-by: David Sterba <dsterba@suse.com>
We have been seeing issues in production where a cleaner script will end
up unlinking a bunch of files that have pending iputs. This means they
will get their final iput's run at btrfs-cleaner time and thus are not
throttled, which impacts the workload.
Since we are unlinking these files we can just drop the delayed iput at
unlink time. We are already holding a reference to the inode so this
will not be the final iput and thus is completely safe to do at this
point. Doing this means we are more likely to be doing the final iput
at unlink time, and thus will get the IO charged to the caller and get
throttled appropriately without affecting the main workload.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If the range for which we are punching a hole covers only part of a page,
we end up updating the inode item but we skip the update of the inode's
iversion, mtime and ctime. Fix that by ensuring we update those properties
of the inode.
A patch for fstests test case generic/059 that tests this as been sent
along with this fix.
Fixes: 2aaa665581 ("Btrfs: add hole punching")
Fixes: e8c1c76e80 ("Btrfs: add missing inode update when punching hole")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to avoid searches on a log tree when unlinking an inode, we check
if the inode being unlinked was logged in the current transaction, as well
as the inode of its parent directory. When any of the inodes are logged,
we proceed to delete directory items and inode reference items from the
log, to ensure that if a subsequent fsync of only the inode being unlinked
or only of the parent directory when the other is not fsync'ed as well,
does not result in the entry still existing after a power failure.
That check however is not reliable when one of the inodes involved (the
one being unlinked or its parent directory's inode) is evicted, since the
logged_trans field is transient, that is, it is not stored on disk, so it
is lost when the inode is evicted and loaded into memory again (which is
set to zero on load). As a consequence the checks currently being done by
btrfs_del_dir_entries_in_log() and btrfs_del_inode_ref_in_log() always
return true if the inode was evicted before, regardless of the inode
having been logged or not before (and in the current transaction), this
results in the dentry being unlinked still existing after a log replay
if after the unlink operation only one of the inodes involved is fsync'ed.
Example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/dir
$ touch /mnt/dir/foo
$ xfs_io -c fsync /mnt/dir/foo
# Keep an open file descriptor on our directory while we evict inodes.
# We just want to evict the file's inode, the directory's inode must not
# be evicted.
$ ( cd /mnt/dir; while true; do :; done ) &
$ pid=$!
# Wait a bit to give time to background process to chdir to our test
# directory.
$ sleep 0.5
# Trigger eviction of the file's inode.
$ echo 2 > /proc/sys/vm/drop_caches
# Unlink our file and fsync the parent directory. After a power failure
# we don't expect to see the file anymore, since we fsync'ed the parent
# directory.
$ rm -f $SCRATCH_MNT/dir/foo
$ xfs_io -c fsync /mnt/dir
<power failure>
$ mount /dev/sdb /mnt
$ ls /mnt/dir
foo
$
--> file still there, unlink not persisted despite explicit fsync on dir
Fix this by checking if the inode has the full_sync bit set in its runtime
flags as well, since that bit is set everytime an inode is loaded from
disk, or for other less common cases such as after a shrinking truncate
or failure to allocate extent maps for holes, and gets cleared after the
first fsync. Also consider the inode as possibly logged only if it was
last modified in the current transaction (besides having the full_fsync
flag set).
Fixes: 3a5f1d458a ("Btrfs: Optimize btree walking while logging inodes")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Presently btrfs_map_block is used not only to do everything necessary to
map a bio to the underlying allocation profile but it's also used to
identify how much data could be written based on btrfs' stripe logic
without actually submitting anything. This is achieved by passing NULL
for 'bbio_ret' parameter.
This patch refactors all callers that require just the mapping length
by switching them to using btrfs_io_geometry instead of calling
btrfs_map_block with a special NULL value for 'bbio_ret'. No functional
change.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a structure that holds various parameters for IO calculations and a
helper that fills the values. This will help further refactoring and
reduction of functions that in some way open-coded the calculations.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the messages printed after setting an incompat feature are
cryptis, we can easily make it better as the textual description is
passed to the helpers. Old:
setting 128 feature flag
updated:
setting incompat feature flag for RAID56 (0x80)
Signed-off-by: David Sterba <dsterba@suse.com>
gcc sometimes can't determine whether a variable has been initialized
when both the initialization and the use are conditional:
fs/btrfs/props.c: In function 'inherit_props':
fs/btrfs/props.c:389:4: error: 'num_bytes' may be used uninitialized in this function [-Werror=maybe-uninitialized]
btrfs_block_rsv_release(fs_info, trans->block_rsv,
This code is fine. Unfortunately, I cannot think of a good way to
rephrase it in a way that makes gcc understand this, so I add a bogus
initialization the way one should not.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ gcc 8 and 9 don't emit the warning ]
Signed-off-by: David Sterba <dsterba@suse.com>
Send always operates on read-only trees and always expected that while it
is in progress, nothing changes in those trees. Due to that expectation
and the fact that send is a read-only operation, it operates on commit
roots and does not hold transaction handles. However relocation can COW
nodes and leafs from read-only trees, which can cause unexpected failures
and crashes (hitting BUG_ONs). while send using a node/leaf, it gets
COWed, the transaction used to COW it is committed, a new transaction
starts, the extent previously used for that node/leaf gets allocated,
possibly for another tree, and the respective extent buffer' content
changes while send is still using it. When this happens send normally
fails with EIO being returned to user space and messages like the
following are found in dmesg/syslog:
[ 3408.699121] BTRFS error (device sdc): parent transid verify failed on 58703872 wanted 250 found 253
[ 3441.523123] BTRFS error (device sdc): did not find backref in send_root. inode=63211, offset=0, disk_byte=5222825984 found extent=5222825984
Other times, less often, we hit a BUG_ON() because an extent buffer that
send is using used to be a node, and while send is still using it, it
got COWed and got reused as a leaf while send is still using, producing
the following trace:
[ 3478.466280] ------------[ cut here ]------------
[ 3478.466282] kernel BUG at fs/btrfs/ctree.c:1806!
[ 3478.466965] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
[ 3478.467635] CPU: 0 PID: 2165 Comm: btrfs Not tainted 5.0.0-btrfs-next-46 #1
[ 3478.468311] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[ 3478.469681] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
(...)
[ 3478.471758] RSP: 0018:ffffa437826bfaa0 EFLAGS: 00010246
[ 3478.472457] RAX: ffff961416ed7000 RBX: 000000000000003d RCX: 0000000000000002
[ 3478.473151] RDX: 000000000000003d RSI: ffff96141e387408 RDI: ffff961599b30000
[ 3478.473837] RBP: ffffa437826bfb8e R08: 0000000000000001 R09: ffffa437826bfb8e
[ 3478.474515] R10: ffffa437826bfa70 R11: 0000000000000000 R12: ffff9614385c8708
[ 3478.475186] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 3478.475840] FS: 00007f8e0e9cc8c0(0000) GS:ffff9615b6a00000(0000) knlGS:0000000000000000
[ 3478.476489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3478.477127] CR2: 00007f98b67a056e CR3: 0000000005df6005 CR4: 00000000003606f0
[ 3478.477762] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3478.478385] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3478.479003] Call Trace:
[ 3478.479600] ? do_raw_spin_unlock+0x49/0xc0
[ 3478.480202] tree_advance+0x173/0x1d0 [btrfs]
[ 3478.480810] btrfs_compare_trees+0x30c/0x690 [btrfs]
[ 3478.481388] ? process_extent+0x1280/0x1280 [btrfs]
[ 3478.481954] btrfs_ioctl_send+0x1037/0x1270 [btrfs]
[ 3478.482510] _btrfs_ioctl_send+0x80/0x110 [btrfs]
[ 3478.483062] btrfs_ioctl+0x13fe/0x3120 [btrfs]
[ 3478.483581] ? rq_clock_task+0x2e/0x60
[ 3478.484086] ? wake_up_new_task+0x1f3/0x370
[ 3478.484582] ? do_vfs_ioctl+0xa2/0x6f0
[ 3478.485075] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[ 3478.485552] do_vfs_ioctl+0xa2/0x6f0
[ 3478.486016] ? __fget+0x113/0x200
[ 3478.486467] ksys_ioctl+0x70/0x80
[ 3478.486911] __x64_sys_ioctl+0x16/0x20
[ 3478.487337] do_syscall_64+0x60/0x1b0
[ 3478.487751] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 3478.488159] RIP: 0033:0x7f8e0d7d4dd7
(...)
[ 3478.489349] RSP: 002b:00007ffcf6fb4908 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[ 3478.489742] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f8e0d7d4dd7
[ 3478.490142] RDX: 00007ffcf6fb4990 RSI: 0000000040489426 RDI: 0000000000000005
[ 3478.490548] RBP: 0000000000000005 R08: 00007f8e0d6f3700 R09: 00007f8e0d6f3700
[ 3478.490953] R10: 00007f8e0d6f39d0 R11: 0000000000000202 R12: 0000000000000005
[ 3478.491343] R13: 00005624e0780020 R14: 0000000000000000 R15: 0000000000000001
(...)
[ 3478.493352] ---[ end trace d5f537302be4f8c8 ]---
Another possibility, much less likely to happen, is that send will not
fail but the contents of the stream it produces may not be correct.
To avoid this, do not allow send and relocation (balance) to run in
parallel. In the long term the goal is to allow for both to be able to
run concurrently without any problems, but that will take a significant
effort in development and testing.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preparatory patch for additional RAID1 profiles with more copies. The
mask will contain 3-copy and 4-copy, most of the checks for plain RAID1
work the same for the other profiles.
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Lockdep will report the following circular locking dependency:
WARNING: possible circular locking dependency detected
5.2.0-rc2-custom #24 Tainted: G O
------------------------------------------------------
btrfs/8631 is trying to acquire lock:
000000002536438c (&fs_info->qgroup_ioctl_lock#2){+.+.}, at: btrfs_qgroup_inherit+0x40/0x620 [btrfs]
but task is already holding lock:
000000003d52cc23 (&fs_info->tree_log_mutex){+.+.}, at: create_pending_snapshot+0x8b6/0xe60 [btrfs]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (&fs_info->tree_log_mutex){+.+.}:
__mutex_lock+0x76/0x940
mutex_lock_nested+0x1b/0x20
btrfs_commit_transaction+0x475/0xa00 [btrfs]
btrfs_commit_super+0x71/0x80 [btrfs]
close_ctree+0x2bd/0x320 [btrfs]
btrfs_put_super+0x15/0x20 [btrfs]
generic_shutdown_super+0x72/0x110
kill_anon_super+0x18/0x30
btrfs_kill_super+0x16/0xa0 [btrfs]
deactivate_locked_super+0x3a/0x80
deactivate_super+0x51/0x60
cleanup_mnt+0x3f/0x80
__cleanup_mnt+0x12/0x20
task_work_run+0x94/0xb0
exit_to_usermode_loop+0xd8/0xe0
do_syscall_64+0x210/0x240
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #1 (&fs_info->reloc_mutex){+.+.}:
__mutex_lock+0x76/0x940
mutex_lock_nested+0x1b/0x20
btrfs_commit_transaction+0x40d/0xa00 [btrfs]
btrfs_quota_enable+0x2da/0x730 [btrfs]
btrfs_ioctl+0x2691/0x2b40 [btrfs]
do_vfs_ioctl+0xa9/0x6d0
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x1a/0x20
do_syscall_64+0x65/0x240
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #0 (&fs_info->qgroup_ioctl_lock#2){+.+.}:
lock_acquire+0xa7/0x190
__mutex_lock+0x76/0x940
mutex_lock_nested+0x1b/0x20
btrfs_qgroup_inherit+0x40/0x620 [btrfs]
create_pending_snapshot+0x9d7/0xe60 [btrfs]
create_pending_snapshots+0x94/0xb0 [btrfs]
btrfs_commit_transaction+0x415/0xa00 [btrfs]
btrfs_mksubvol+0x496/0x4e0 [btrfs]
btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs]
btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs]
btrfs_ioctl+0xa90/0x2b40 [btrfs]
do_vfs_ioctl+0xa9/0x6d0
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x1a/0x20
do_syscall_64+0x65/0x240
entry_SYSCALL_64_after_hwframe+0x49/0xbe
other info that might help us debug this:
Chain exists of:
&fs_info->qgroup_ioctl_lock#2 --> &fs_info->reloc_mutex --> &fs_info->tree_log_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_info->tree_log_mutex);
lock(&fs_info->reloc_mutex);
lock(&fs_info->tree_log_mutex);
lock(&fs_info->qgroup_ioctl_lock#2);
*** DEADLOCK ***
6 locks held by btrfs/8631:
#0: 00000000ed8f23f6 (sb_writers#12){.+.+}, at: mnt_want_write_file+0x28/0x60
#1: 000000009fb1597a (&type->i_mutex_dir_key#10/1){+.+.}, at: btrfs_mksubvol+0x70/0x4e0 [btrfs]
#2: 0000000088c5ad88 (&fs_info->subvol_sem){++++}, at: btrfs_mksubvol+0x128/0x4e0 [btrfs]
#3: 000000009606fc3e (sb_internal#2){.+.+}, at: start_transaction+0x37a/0x520 [btrfs]
#4: 00000000f82bbdf5 (&fs_info->reloc_mutex){+.+.}, at: btrfs_commit_transaction+0x40d/0xa00 [btrfs]
#5: 000000003d52cc23 (&fs_info->tree_log_mutex){+.+.}, at: create_pending_snapshot+0x8b6/0xe60 [btrfs]
[CAUSE]
Due to the delayed subvolume creation, we need to call
btrfs_qgroup_inherit() inside commit transaction code, with a lot of
other mutex hold.
This hell of lock chain can lead to above problem.
[FIX]
On the other hand, we don't really need to hold qgroup_ioctl_lock if
we're in the context of create_pending_snapshot().
As in that context, we're the only one being able to modify qgroup.
All other qgroup functions which needs qgroup_ioctl_lock are either
holding a transaction handle, or will start a new transaction:
Functions will start a new transaction():
* btrfs_quota_enable()
* btrfs_quota_disable()
Functions hold a transaction handler:
* btrfs_add_qgroup_relation()
* btrfs_del_qgroup_relation()
* btrfs_create_qgroup()
* btrfs_remove_qgroup()
* btrfs_limit_qgroup()
* btrfs_qgroup_inherit() call inside create_subvol()
So we have a higher level protection provided by transaction, thus we
don't need to always hold qgroup_ioctl_lock in btrfs_qgroup_inherit().
Only the btrfs_qgroup_inherit() call in create_subvol() needs to hold
qgroup_ioctl_lock, while the btrfs_qgroup_inherit() call in
create_pending_snapshot() is already protected by transaction.
So the fix is to detect the context by checking
trans->transaction->state.
If we're at TRANS_STATE_COMMIT_DOING, then we're in commit transaction
context and no need to get the mutex.
Reported-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay reported the following KASAN splat when running btrfs/048:
[ 1843.470920] ==================================================================
[ 1843.471971] BUG: KASAN: slab-out-of-bounds in strncmp+0x66/0xb0
[ 1843.472775] Read of size 1 at addr ffff888111e369e2 by task btrfs/3979
[ 1843.473904] CPU: 3 PID: 3979 Comm: btrfs Not tainted 5.2.0-rc3-default #536
[ 1843.475009] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ 1843.476322] Call Trace:
[ 1843.476674] dump_stack+0x7c/0xbb
[ 1843.477132] ? strncmp+0x66/0xb0
[ 1843.477587] print_address_description+0x114/0x320
[ 1843.478256] ? strncmp+0x66/0xb0
[ 1843.478740] ? strncmp+0x66/0xb0
[ 1843.479185] __kasan_report+0x14e/0x192
[ 1843.479759] ? strncmp+0x66/0xb0
[ 1843.480209] kasan_report+0xe/0x20
[ 1843.480679] strncmp+0x66/0xb0
[ 1843.481105] prop_compression_validate+0x24/0x70
[ 1843.481798] btrfs_xattr_handler_set_prop+0x65/0x160
[ 1843.482509] __vfs_setxattr+0x71/0x90
[ 1843.483012] __vfs_setxattr_noperm+0x84/0x130
[ 1843.483606] vfs_setxattr+0xac/0xb0
[ 1843.484085] setxattr+0x18c/0x230
[ 1843.484546] ? vfs_setxattr+0xb0/0xb0
[ 1843.485048] ? __mod_node_page_state+0x1f/0xa0
[ 1843.485672] ? _raw_spin_unlock+0x24/0x40
[ 1843.486233] ? __handle_mm_fault+0x988/0x1290
[ 1843.486823] ? lock_acquire+0xb4/0x1e0
[ 1843.487330] ? lock_acquire+0xb4/0x1e0
[ 1843.487842] ? mnt_want_write_file+0x3c/0x80
[ 1843.488442] ? debug_lockdep_rcu_enabled+0x22/0x40
[ 1843.489089] ? rcu_sync_lockdep_assert+0xe/0x70
[ 1843.489707] ? __sb_start_write+0x158/0x200
[ 1843.490278] ? mnt_want_write_file+0x3c/0x80
[ 1843.490855] ? __mnt_want_write+0x98/0xe0
[ 1843.491397] __x64_sys_fsetxattr+0xba/0xe0
[ 1843.492201] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 1843.493201] do_syscall_64+0x6c/0x230
[ 1843.493988] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1843.495041] RIP: 0033:0x7fa7a8a7707a
[ 1843.495819] Code: 48 8b 0d 21 de 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 be 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ee dd 2b 00 f7 d8 64 89 01 48
[ 1843.499203] RSP: 002b:00007ffcb73bca38 EFLAGS: 00000202 ORIG_RAX: 00000000000000be
[ 1843.500210] RAX: ffffffffffffffda RBX: 00007ffcb73bda9d RCX: 00007fa7a8a7707a
[ 1843.501170] RDX: 00007ffcb73bda9d RSI: 00000000006dc050 RDI: 0000000000000003
[ 1843.502152] RBP: 00000000006dc050 R08: 0000000000000000 R09: 0000000000000000
[ 1843.503109] R10: 0000000000000002 R11: 0000000000000202 R12: 00007ffcb73bda91
[ 1843.504055] R13: 0000000000000003 R14: 00007ffcb73bda82 R15: ffffffffffffffff
[ 1843.505268] Allocated by task 3979:
[ 1843.505771] save_stack+0x19/0x80
[ 1843.506211] __kasan_kmalloc.constprop.5+0xa0/0xd0
[ 1843.506836] setxattr+0xeb/0x230
[ 1843.507264] __x64_sys_fsetxattr+0xba/0xe0
[ 1843.507886] do_syscall_64+0x6c/0x230
[ 1843.508429] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1843.509558] Freed by task 0:
[ 1843.510188] (stack is not available)
[ 1843.511309] The buggy address belongs to the object at ffff888111e369e0
which belongs to the cache kmalloc-8 of size 8
[ 1843.514095] The buggy address is located 2 bytes inside of
8-byte region [ffff888111e369e0, ffff888111e369e8)
[ 1843.516524] The buggy address belongs to the page:
[ 1843.517561] page:ffff88813f478d80 refcount:1 mapcount:0 mapping:ffff88811940c300 index:0xffff888111e373b8 compound_mapcount: 0
[ 1843.519993] flags: 0x4404000010200(slab|head)
[ 1843.520951] raw: 0004404000010200 ffff88813f48b008 ffff888119403d50 ffff88811940c300
[ 1843.522616] raw: ffff888111e373b8 000000000016000f 00000001ffffffff 0000000000000000
[ 1843.524281] page dumped because: kasan: bad access detected
[ 1843.525936] Memory state around the buggy address:
[ 1843.526975] ffff888111e36880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1843.528479] ffff888111e36900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1843.530138] >ffff888111e36980: fc fc fc fc fc fc fc fc fc fc fc fc 02 fc fc fc
[ 1843.531877] ^
[ 1843.533287] ffff888111e36a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1843.534874] ffff888111e36a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1843.536468] ==================================================================
This is caused by supplying a too short compression value ('lz') in the
test-case and comparing it to 'lzo' with strncmp() and a length of 3.
strncmp() read past the 'lz' when looking for the 'o' and thus caused an
out-of-bounds read.
Introduce a new check 'btrfs_compress_is_valid_type()' which not only
checks the user-supplied value against known compression types, but also
employs checks for too short values.
Reported-by: Nikolay Borisov <nborisov@suse.com>
Fixes: 272e5326c7 ("btrfs: prop: fix vanished compression property after failed set")
CC: stable@vger.kernel.org # 5.1+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we log an inode, regardless of logging it completely or only that it
exists, we always update it as logged (logged_trans and last_log_commit
fields of the inode are updated). This is generally fine and avoids future
attempts to log it from having to do repeated work that brings no value.
However, if we write data to a file, then evict its inode after all the
dealloc was flushed (and ordered extents completed), rename the file and
fsync it, we end up not logging the new extents, since the rename may
result in logging that the inode exists in case the parent directory was
logged before. The following reproducer shows and explains how this can
happen:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/dir
$ touch /mnt/dir/foo
$ touch /mnt/dir/bar
# Do a direct IO write instead of a buffered write because with a
# buffered write we would need to make sure dealloc gets flushed and
# complete before we do the inode eviction later, and we can not do that
# from user space with call to things such as sync(2) since that results
# in a transaction commit as well.
$ xfs_io -d -c "pwrite -S 0xd3 0 4K" /mnt/dir/bar
# Keep the directory dir in use while we evict inodes. We want our file
# bar's inode to be evicted but we don't want our directory's inode to
# be evicted (if it were evicted too, we would not be able to reproduce
# the issue since the first fsync below, of file foo, would result in a
# transaction commit.
$ ( cd /mnt/dir; while true; do :; done ) &
$ pid=$!
# Wait a bit to give time for the background process to chdir.
$ sleep 0.1
# Evict all inodes, except the inode for the directory dir because it is
# currently in use by our background process.
$ echo 2 > /proc/sys/vm/drop_caches
# fsync file foo, which ends up persisting information about the parent
# directory because it is a new inode.
$ xfs_io -c fsync /mnt/dir/foo
# Rename bar, this results in logging that this inode exists (inode item,
# names, xattrs) because the parent directory is in the log.
$ mv /mnt/dir/bar /mnt/dir/baz
# Now fsync baz, which ends up doing absolutely nothing because of the
# rename operation which logged that the inode exists only.
$ xfs_io -c fsync /mnt/dir/baz
<power failure>
$ mount /dev/sdb /mnt
$ od -t x1 -A d /mnt/dir/baz
0000000
--> Empty file, data we wrote is missing.
Fix this by not updating last_sub_trans of an inode when we are logging
only that it exists and the inode was not yet logged since it was loaded
from disk (full_sync bit set), this is enough to make btrfs_inode_in_log()
return false for this scenario and make us log the inode. The logged_trans
of the inode is still always setsince that alone is used to track if names
need to be deleted as part of unlink operations.
Fixes: 257c62e1bc ("Btrfs: avoid tree log commit when there are no changes")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The incompat bit for RAID56 is set either at mount time or automatically
when the profile is used by balance. The part where the bit is removed
is missing and can be unexpected or undesired when an older kernel is
needed.
This patch will drop the incompat bit after this command, assuming
that RAID5 profile is not used by system or metadata:
$ btrfs balance start -dconvert=raid5 /mnt
$ btrfs balance start -dconvert=raid1 /mnt
This will print "clearing 128 feature flag" to the system log.
The patch is safe for backporting to older kernels.
Reported-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: David Sterba <dsterba@suse.com>
The write_locks is either 0 or 1 and always updated under the lock,
so we don't need the atomic_t semantics.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The spinning_writers is either 0 or 1 and always updated under the lock,
so we don't need the atomic_t semantics.
Signed-off-by: David Sterba <dsterba@suse.com>
The blocking_writers is either 0 or 1 and always updated under the lock,
so we don't need the atomic_t semantics.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Turn the comment about required lock into an assertion.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Create a generic checking function for the incoming FS_IOC_FSSETXATTR
fsxattr values so that we can standardize some of the implementation
behaviors.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Create a generic function to check incoming FS_IOC_SETFLAGS flag values
and later prepare the inode for updates so that we can standardize the
implementations that follow ext4's flag values.
Note that the efivarfs implementation no longer fails a no-op SETFLAGS
without CAP_LINUX_IMMUTABLE since that's the behavior in ext*.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
There are no concerns about locking during the selftests so the locks
are not necessary, but following patches will add lockdep assertions to
add_extent_mapping so this is needed in tests too.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function has a lot of return values and specific conventions making
it cumbersome to understand what's returned. Have a go at documenting
its parameters and return values.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently find_first_clear_extent_bit always returns a range whose
starting value is >= passed 'start'. This implicit trimming behavior is
somewhat subtle and an implementation detail.
Instead, this patch modifies the function such that now it always
returns the range which contains passed 'start' and has the given bits
unset. This range could either be due to presence of existing records
which contains 'start' but have the bits unset or because there are no
records that contain the given starting offset.
This patch also adds test cases which cover find_first_clear_extent_bit
since they were missing up until now.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the first megabyte on a device housing a btrfs filesystem is
exempt from allocation and trimming. Currently this is not a problem
since 'start' is set to 1M at the beginning of btrfs_trim_free_extents
and find_first_clear_extent_bit always returns a range that is >= start.
However, in a follow up patch find_first_clear_extent_bit will be
changed such that it will return a range containing 'start' and this
range may very well be 0...>=1M so 'start'.
Future proof the sole user of find_first_clear_extent_bit by setting
'start' after the function is called. No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The filename:line format is commonly understood by editors and can be
copy&pasted more easily than the current format.
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_print_data_csum_error() still assumed checksums to be 32 bit in
size. Make it size agnostic.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_csum_data() relied on the crc32c() wrapper around the
crypto framework for calculating the CRCs.
As we have our own crypto_shash structure in the fs_info now, we can
directly call into the crypto framework without going trough the wrapper.
This way we can even remove the btrfs_csum_data() and btrfs_csum_final()
wrappers.
The module dependency on crc32c is preserved via MODULE_SOFTDEP("pre:
crc32c"), which was previously provided by LIBCRC32C config option doing
the same.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add boilerplate code for directly including the crypto framework. This
helps us flipping the switch for new algorithms.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we have already checked for a valid checksum type before
calling btrfs_check_super_csum(), it can be simplified even further.
While at it get rid of the implicit size assumption of the resulting
checksum as well.
This is a preparation for changing all checksum functionality to use the
crypto layer later.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we have factorerd out the superblock checksum type validation,
we can check for supported superblock checksum types before doing the
actual validation of the superblock read from disk.
This leads the path to further simplifications of
btrfs_check_super_csum() later on.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs is only supporting CRC32C as checksumming algorithm. As
this is about to change provide a function to validate the checksum type
in the superblock against all possible algorithms.
This makes adding new algorithms easier as there are fewer places to
adjust when adding new algorithms.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a small helper for btrfs_print_data_csum_error() which formats the
checksum according to it's type for pretty printing.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ shorten macro name ]
Signed-off-by: David Sterba <dsterba@suse.com>
BTRFS has the implicit assumption that a checksum in compressed_bio is 4
bytes. While this is true for CRC32C, it is not for any other checksum.
Change the data type to be a byte array and adjust loop index calculation
accordingly.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
BTRFS has the implicit assumption that a checksum in btrfs_orderd_sums
is 4 bytes. While this is true for CRC32C, it is not for any other
checksum.
Change the data type to be a byte array and adjust loop index
calculation accordingly.
This includes moving the adjustment of 'index' by 'ins_size' in
btrfs_csum_file_blocks() before dividing 'ins_size' by the checksum
size, because before this patch the 'sums' member of 'struct
btrfs_ordered_sum' was 4 Bytes in size and afterwards it is only one
byte.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The CRC checksum in the free space cache is not dependant on the super
block's csum_type field but always a CRC32C.
So use btrfs_crc32c() and btrfs_crc32c_final() instead of
btrfs_csum_data() and btrfs_csum_final() for computing these checksums.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 9678c54388 ("btrfs: Remove custom crc32c init code") removed
the btrfs_crc32c() function, because it was a duplicate of the crc32c()
library function we already have in the kernel.
Resurrect it as a shim wrapper over crc32c() to make following
transformations of the checksumming code in btrfs easier.
Also provide a btrfs_crc32_final() to ease following transformations.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfsic_test_for_metadata() directly calls the crc32c() library function
for calculating the CRC32C checksum, but then uses btrfs_csum_final() to
invert the result.
To ease further refactoring and development around checksumming in BTRFS
convert to calling btrfs_csum_data(), which is a wrapper around
crc32c().
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
The following script can cause unexpected fsync failure:
#!/bin/bash
dev=/dev/test/test
mnt=/mnt/btrfs
mkfs.btrfs -f $dev -b 512M > /dev/null
mount $dev $mnt -o nospace_cache
# Prealloc one extent
xfs_io -f -c "falloc 8k 64m" $mnt/file1
# Fill the remaining data space
xfs_io -f -c "pwrite 0 -b 4k 512M" $mnt/padding
sync
# Write into the prealloc extent
xfs_io -c "pwrite 1m 16m" $mnt/file1
# Reflink then fsync, fsync would fail due to ENOSPC
xfs_io -c "reflink $mnt/file1 8k 0 4k" -c "fsync" $mnt/file1
umount $dev
The fsync fails with ENOSPC, and the last page of the buffered write is
lost.
[CAUSE]
This is caused by:
- Btrfs' back reference only has extent level granularity
So write into shared extent must be COWed even only part of the extent
is shared.
So for above script we have:
- fallocate
Create a preallocated extent where we can do NOCOW write.
- fill all the remaining data and unallocated space
- buffered write into preallocated space
As we have not enough space available for data and the extent is not
shared (yet) we fall into NOCOW mode.
- reflink
Now part of the large preallocated extent is shared, later write
into that extent must be COWed.
- fsync triggers writeback
But now the extent is shared and therefore we must fallback into COW
mode, which fails with ENOSPC since there's not enough space to
allocate data extents.
[WORKAROUND]
The workaround is to ensure any buffered write in the related extents
(not just the reflink source range) get flushed before reflink/dedupe,
so that NOCOW writes succeed that happened before reflinking succeed.
The workaround is expensive, we could do it better by only flushing
NOCOW range, but that needs extra accounting for NOCOW range.
For now, fix the possible data loss first.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The first thing code does in check_can_nocow is trying to block
concurrent snapshots. If this fails (due to snpashot already being in
progress) the function returns ENOSPC which makes no sense. Instead
return EAGAIN. Despite this return value not being propagated to callers
it's good practice to return the closest in terms of semantics error
code. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In case no cached_state argument is passed to
btrfs_lock_and_flush_ordered_range use one locally in the function. This
optimises the case when an ordered extent is found since the unlock
function will be able to unlock that state directly without searching
for it again.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There several functions which open code
btrfs_lock_and_flush_ordered_range, just replace them with a call to the
function. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a certain idiom used in multiple places in btrfs' codebase,
dealing with flushing an ordered range. Factor this in a separate
function that can be reused. Future patches will replace the existing
code with that function.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At the context of btrfs_run_delalloc_range(), we haven't started/joined
a transaction, thus even something went wrong, we can't and won't abort
transaction, thus no way to make the fs RO.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Just add a safe net for btrfs_space_info member updating.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The helper lacks the btrfs_ prefix and the parameter is the raw
blockgroup type, so none of the callers has to do the flags -> index
conversion.
Signed-off-by: David Sterba <dsterba@suse.com>
The raid_attr table is now 7 * 56 = 392 bytes long, consisting of just
small numbers so we don't have to use ints. New size is 7 * 32 = 224,
saving 3 cachelines.
Signed-off-by: David Sterba <dsterba@suse.com>
Factor the sequence of ifs to a helper, the 'data stripes' here means
the number of stripes without redundancy and parity.
Signed-off-by: David Sterba <dsterba@suse.com>
Replace open coded list of the profiles by selecting them from the
raid_attr table. The criteria are now more explicit, we need profiles
that have more than 1 copy of the data or can reconstruct the data with
a missing device.
Signed-off-by: David Sterba <dsterba@suse.com>
Iterate over the table and gather all allowed profiles for a given
number of devices, instead of open coding.
Signed-off-by: David Sterba <dsterba@suse.com>
fs_info::mapping_tree is the physical<->logical mapping tree and uses
the same underlying structure as extents, but is embedded to another
structure. There are no other members and this indirection is useless.
No functional change.
Signed-off-by: David Sterba <dsterba@suse.com>
The minimum number of devices for RAID5 is 2, though this is only a bit
expensive RAID1, and for RAID6 it's 3, which is a triple copy that works
only 3 devices.
mkfs.btrfs allows that and mounting such filesystem also works, so the
conversion via balance filters is inconsistent with the others and we
should not prevent it.
Signed-off-by: David Sterba <dsterba@suse.com>
The list of profiles in btrfs_chunk_max_errors lists DUP as a profile
DUP able to tolerate 1 device missing. Though this profile is special
with 2 copies, it still needs the device, unlike the others.
Looking at the history of changes, thre's no clear reason why DUP is
there, functions were refactored and blocks of code merged to one
helper.
d20983b40e Btrfs: fix writing data into the seed filesystem
- factor code to a helper
de11cc12df Btrfs: don't pre-allocate btrfs bio
- unrelated change, DUP still in the list with max errors 1
a236aed14c Btrfs: Deal with failed writes in mirrored configurations
- introduced the max errors, leaves DUP and RAID1 in the same group
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This code was first introduced in 5f39d397df ("Btrfs: Create
extent_buffer interface for large blocksizes") and the function was
named btrfs_unlink_trans. It later got renamed to __btrfs_unlink_inode
and finally commit 16cdcec736 ("btrfs: implement delayed inode items
operation") changed the way inodes are deleted and obviated the need for
those two members.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ replace changelog by Nikolay's version ]
Signed-off-by: David Sterba <dsterba@suse.com>
This is a leftover from 312c89fbca ("btrfs: cleanup btrfs_mount()
using btrfs_mount_root()"), the mode was used for opening devices that's
not done here anymore.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In function do_trimming(), block_group->lock should be unlocked first,
as the locks should be released in the reverse order. This does not
cause problems but should follow the best practices.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Under certain conditions, we could have strange file extent item in log
tree like:
item 18 key (69599 108 397312) itemoff 15208 itemsize 53
extent data disk bytenr 0 nr 0
extent data offset 0 nr 18446744073709547520 ram 18446744073709547520
The num_bytes + ram_bytes overflow 64 bit type.
For num_bytes part, we can detect such overflow along with file offset
(key->offset), as file_offset + num_bytes should never go beyond u64.
For ram_bytes part, it's about the decompressed size of the extent, not
directly related to the size.
In theory it is OK to have a large value, and put extra limitation
on RAM bytes may cause unexpected false alerts.
So in tree-checker, we only check if the file offset and num bytes
overflow.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is already done in btrfs_init_dev_replace_tgtdev which is the first
phase of device replace, called before doing scrub. During that time
exclusive lock is held. Additionally btrfs_fs_device::commit_total_bytes
is always set based on the size of the underlying block device which
shouldn't change once set. This makes the 2nd assignment of the variable
in the finishing phase redundant.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Part of device replace involves writing an item to the device root
containing information about pending replace operations. Currently space
for this item is not being explicitly reserved so this works thanks to
presence of global reserve. While not fatal it's not a good practice.
Let's be explicit about space requirement of device replace and reserve
space when starting the transaction.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are only 2 branches which goto leave label with need_unlock set
to true. Essentially need_unlock is used as a substitute for directly
calling up_write. Since the branches needing this are only 2 and their
context is not that big it's more clear to just call up_write where
required. No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_init_dev_replace_tgtdev reads certain values from the source
device (such as commit_total_bytes) which are updated during transaction
commit. Currently this function is called before committing any pending
transaction, leading to possibly reading outdated values.
Fix this by moving the function below the transaction commit, at this
point the EXCL_OP bit it set hence once transaction is complete the
total size of the device cannot be changed (it's usually changed by
resize/remove ops which are blocked).
Fixes: 9e271ae27e ("Btrfs: kernel operation should come after user input has been verified")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This WARN_ON can never trigger because src_device cannot be null.
btrfs_find_device_by_devspec always returns either an error or a valid
pointer to the device. Just remove it.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no point in holding btrfs_fs_devices::device_list_mutex
while initialising fields of the not-yet-published device. Instead,
hold the mutex only when the newly initialised device is being
published. I think holding device_list_mutex here is redundant
altogether, because at this point BTRFS_FS_EXCL_OP is set which
prevents device removal/addition/balance/resize to occur.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Using sync_blockdev makes it plain obvious what's happening. No
functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_check_shared looks up parents of a given extent and uses ulists
for that. These are allocated and freed repeatedly. Preallocation in the
caller will avoid the overhead and also allow us to use the GFP_KERNEL
as it is happens before the extent locks are taken.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, there's only check for fast crc32c implementation on X86,
based on the CPU flags. This is used to decide if checksumming should be
offloaded to worker threads or can be calculated by the caller.
As there are more architectures that implement a faster version of
crc32c (ARM, SPARC, s390, MIPS, PowerPC), also there are specialized hw
cards.
The detection is based on driver name, all generic C implementations
contain 'generic', while the specialized versions do not. Alternatively
the priority could be used, but this is not currently provided by the
crypto API.
The flag is set per-filesystem at mount time and used for the offloading
decisions.
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of using @sign to determine whether we're adding or subtracting.
Even it only has 3 callers, it's still (and in fact already caused
problem in the past) confusing to use.
Refactor add_pinned_bytes() to add_pinned_bytes() and sub_pinned_bytes()
to explicitly show what we're doing.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The kobj_type default_attrs field is being replaced by the
default_groups field. Replace the default_attrs fields in
btrfs_raid_ktype and space_info_ktype with default_groups.
Change "raid_attributes" to "raid_attrs", and use the ATTRIBUTE_GROUPS
macro to create raid_groups and space_info_groups.
Signed-off-by: Kimberly Brown <kimbrownkd@gmail.com>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This will allow generating fsnotify delete events after the
fsnotify_nameremove() hook is removed from d_delete().
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl0JE18ACgkQxWXV+ddt
WDskyw//fi4r3/D2lz7ZKHRz00o8h9gDyBw50y4FHXN44GrAWgM0dbi1l0VXRquy
A9Xy4rBoquDdUKkSIvMr3ff33eK+Y2O9oUBeXMcb4tfg8TicAMdyTHduDwka1Ljg
TXcupG8cWd7Boi9FfeSDDPQpV+NXxzL9VSy7uTMjMmmxWYWViMJo38ultsxUhoOY
ZVNY8LNT/F6i/4Us9D5ymzqn6uQWzfu2GXZT2I3Pq5Ps3PBSc7OJVhrPYE9FQ/jv
ptirddkoeOo6xQJ1Pb/UMPjkTZ5ct2Wy/lAvQPXiWf9FjjwR7zuSL1Xe3wzpUg/y
llENXp3Ps+oMzTF1XKid43yHlt0Swqzr+EIiNvLbSj5E5o5msKZ+jXQYHV8LHZfW
I12uizChQc3N11SrwC+gVou5oAMhTnJmHjlq96vWXaw0lcvX+yHMWP3w16OGM3Pc
9aN0ap6SgRBM6XZkXR65Rf4sIAnCm12hDrVdHNPJhz95W6PQZkPhMS0FWjrOAUYy
yqhrMqtY4ELNRBBBXJ4dq+k+l/I6lHkPbhPXVt0VekXdyKPr4o5WZLI1k3C2S14L
wXrqw6wTU4pgaD6LCqQ7NFtCZF3Zz+8L+MHbbK1LLlMFUcMFs/+cRrg7EKipOxxn
mm/mjOKD5lGUJPGKuFb87rCcgAOXcOWnF+FoR/FNf6HVEJJJOuY=
=uzhr
-----END PGP SIGNATURE-----
Merge tag 'for-5.2-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- regression where properties stored as xattrs are not properly
persisted
- a small readahead fix (the fstests testcase for that fix hangs on
unpatched kernel, so we'd like get it merged to ease future testing)
- fix a race during block group creation and deletion
* tag 'for-5.2-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix failure to persist compression property xattr deletion on fsync
btrfs: start readahead also in seed devices
Btrfs: fix race between block group removal and block group allocation
After the recent series of cleanups in the properties and xattrs modules
that landed in the 5.2 merge window, we ended up with a regression where
after deleting the compression xattr property through the setflags ioctl,
we don't set the BTRFS_INODE_COPY_EVERYTHING flag in the inode anymore.
As a consequence, if the inode was fsync'ed when it had the compression
property set, after deleting the compression property through the setflags
ioctl and fsync'ing again the inode, the log will still contain the
compression xattr, because the inode did not had that bit set, which
made the fsync not delete all xattrs from the log and copy all xattrs
from the subvolume tree to the log tree.
This regression happens due to the fact that that series of cleanups
made btrfs_set_prop() call the old function do_setxattr() (which is now
named btrfs_setxattr()), and not the old version of btrfs_setxattr(),
which is now called btrfs_setxattr_trans().
Fix this by setting the BTRFS_INODE_COPY_EVERYTHING bit in the current
btrfs_setxattr() function and remove it from everywhere else, including
its setup at btrfs_ioctl_setflags(). This is cleaner, avoids similar
regressions in the future, and centralizes the setup of the bit. After
all, the need to setup this bit should only be in the xattrs module,
since it is an implementation of xattrs.
Fixes: 04e6863b19 ("btrfs: split btrfs_setxattr calls regarding transaction")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, btrfs does not consult seed devices to start readahead. As a
result, if readahead zone is added to the seed devices, btrfs_reada_wait()
indefinitely wait for the reada_ctl to finish.
You can reproduce the hung by modifying btrfs/163 to have larger initial
file size (e.g. xfs_io pwrite 4M instead of current 256K).
Fixes: 7414a03fbf ("btrfs: initial readahead code and prototypes")
Cc: stable@vger.kernel.org # 3.2+: ce7791ffee: Btrfs: fix race between readahead and device replace/removal
Cc: stable@vger.kernel.org # 3.2+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If a task is removing the block group that currently has the highest start
offset amongst all existing block groups, there is a short time window
where it races with a concurrent block group allocation, resulting in a
transaction abort with an error code of EEXIST.
The following diagram explains the race in detail:
Task A Task B
btrfs_remove_block_group(bg offset X)
remove_extent_mapping(em offset X)
-> removes extent map X from the
tree of extent maps
(fs_info->mapping_tree), so the
next call to find_next_chunk()
will return offset X
btrfs_alloc_chunk()
find_next_chunk()
--> returns offset X
__btrfs_alloc_chunk(offset X)
btrfs_make_block_group()
btrfs_create_block_group_cache()
--> creates btrfs_block_group_cache
object with a key corresponding
to the block group item in the
extent, the key is:
(offset X, BTRFS_BLOCK_GROUP_ITEM_KEY, 1G)
--> adds the btrfs_block_group_cache object
to the list new_bgs of the transaction
handle
btrfs_end_transaction(trans handle)
__btrfs_end_transaction()
btrfs_create_pending_block_groups()
--> sees the new btrfs_block_group_cache
in the new_bgs list of the transaction
handle
--> its call to btrfs_insert_item() fails
with -EEXIST when attempting to insert
the block group item key
(offset X, BTRFS_BLOCK_GROUP_ITEM_KEY, 1G)
because task A has not removed that key yet
--> aborts the running transaction with
error -EEXIST
btrfs_del_item()
-> removes the block group's key from
the extent tree, key is
(offset X, BTRFS_BLOCK_GROUP_ITEM_KEY, 1G)
A sample transaction abort trace:
[78912.403537] ------------[ cut here ]------------
[78912.403811] BTRFS: Transaction aborted (error -17)
[78912.404082] WARNING: CPU: 2 PID: 20465 at fs/btrfs/extent-tree.c:10551 btrfs_create_pending_block_groups+0x196/0x250 [btrfs]
(...)
[78912.405642] CPU: 2 PID: 20465 Comm: btrfs Tainted: G W 5.0.0-btrfs-next-46 #1
[78912.405941] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[78912.406586] RIP: 0010:btrfs_create_pending_block_groups+0x196/0x250 [btrfs]
(...)
[78912.407636] RSP: 0018:ffff9d3d4b7e3b08 EFLAGS: 00010282
[78912.407997] RAX: 0000000000000000 RBX: ffff90959a3796f0 RCX: 0000000000000006
[78912.408369] RDX: 0000000000000007 RSI: 0000000000000001 RDI: ffff909636b16860
[78912.408746] RBP: ffff909626758a58 R08: 0000000000000000 R09: 0000000000000000
[78912.409144] R10: ffff9095ff462400 R11: 0000000000000000 R12: ffff90959a379588
[78912.409521] R13: ffff909626758ab0 R14: ffff9095036c0000 R15: ffff9095299e1158
[78912.409899] FS: 00007f387f16f700(0000) GS:ffff909636b00000(0000) knlGS:0000000000000000
[78912.410285] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[78912.410673] CR2: 00007f429fc87cbc CR3: 000000014440a004 CR4: 00000000003606e0
[78912.411095] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[78912.411496] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[78912.411898] Call Trace:
[78912.412318] __btrfs_end_transaction+0x5b/0x1c0 [btrfs]
[78912.412746] btrfs_inc_block_group_ro+0xcf/0x160 [btrfs]
[78912.413179] scrub_enumerate_chunks+0x188/0x5b0 [btrfs]
[78912.413622] ? __mutex_unlock_slowpath+0x100/0x2a0
[78912.414078] btrfs_scrub_dev+0x2ef/0x720 [btrfs]
[78912.414535] ? __sb_start_write+0xd4/0x1c0
[78912.414963] ? mnt_want_write_file+0x24/0x50
[78912.415403] btrfs_ioctl+0x17fb/0x3120 [btrfs]
[78912.415832] ? lock_acquire+0xa6/0x190
[78912.416256] ? do_vfs_ioctl+0xa2/0x6f0
[78912.416685] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[78912.417116] do_vfs_ioctl+0xa2/0x6f0
[78912.417534] ? __fget+0x113/0x200
[78912.417954] ksys_ioctl+0x70/0x80
[78912.418369] __x64_sys_ioctl+0x16/0x20
[78912.418812] do_syscall_64+0x60/0x1b0
[78912.419231] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[78912.419644] RIP: 0033:0x7f3880252dd7
(...)
[78912.420957] RSP: 002b:00007f387f16ed68 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[78912.421426] RAX: ffffffffffffffda RBX: 000055f5becc1df0 RCX: 00007f3880252dd7
[78912.421889] RDX: 000055f5becc1df0 RSI: 00000000c400941b RDI: 0000000000000003
[78912.422354] RBP: 0000000000000000 R08: 00007f387f16f700 R09: 0000000000000000
[78912.422790] R10: 00007f387f16f700 R11: 0000000000000246 R12: 0000000000000000
[78912.423202] R13: 00007ffda49c266f R14: 0000000000000000 R15: 00007f388145e040
[78912.425505] ---[ end trace eb9bfe7c426fc4d3 ]---
Fix this by calling remove_extent_mapping(), at btrfs_remove_block_group(),
only at the very end, after removing the block group item key from the
extent tree (and removing the free space tree entry if we are using the
free space tree feature).
Fixes: 04216820fe ("Btrfs: fix race between fs trimming and block group remove/allocation")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlz/z0wACgkQxWXV+ddt
WDt/4g//eS+USsVpHV65AoOC+LWHQFvKHVNX97l53yy7GvINxgCO5+X7MctvCP1P
cEhhyGhapTFWinOj2zoMSz+mk1//wkqPsC9KzN+ha1fx0ktj+IKms+xbh3kFsygq
dbqLMBHobGpCVV1z7kg+YN0OnG3W90qv/5TqeYhPADBEzEDffXuj+5qEul88J47h
n4GP309mJcJwO1SmYOMFTchZDKJJnFMejr+KS+hOvzh3i5C6ZVOLeEs2yksdFvUi
X9zeM9sbzshPonzRQVR9xazzW3JKP69rgkz+fo5TLnKqYwXiGN9ObCCjtKm/rck6
pkJbvSmqV6QpX/pBUwdI/8DjgQyrPlfVyVcv5lU960mwya0eCkJYTa95bMC7N4UJ
NNfROq9aJHKmct/rHECLsOwRTy2KuAqpX7+Ktjy6Sarw9bvlPwpgBep7PvYR9DXf
mC9QHcRTYDCoKR5SF5xzB5lQHVOfWe6dduCugW8BhvO8t3ty+IW8p9WfcNXkTVu8
SQWuxF2Y9uOEuTJMmyw6gbl1KRppg+U95r7vpKKbPFXMrsJDcGy2eUhMHfChwnkO
brI2scsAaJg+UMvTjjO/8DVw4qpR9UDZaDgPqsGcFCNIQv65bPlL/UnQD112M2Ba
w9FsvwbyI1AgUC0JsslLoHQVcOqHl2DnxFyvtbFTy0zcJK8fPxY=
=UByh
-----END PGP SIGNATURE-----
Merge tag 'for-5.2-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"One regression fix to TRIM ioctl.
The range cannot be used as its meaning can be confusing regarding
physical and logical addresses. This confusion in code led to
potential corruptions when the range overlapped data.
The original patch made it to several stable kernels and was promptly
reverted, the version for master branch is different due to additional
changes but the change is effectively the same"
* tag 'for-5.2-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: Always trim all unallocated space in btrfs_trim_free_extents
This patch removes support for range parameters of FITRIM ioctl when
trimming unallocated space on devices. This is necessary since ranges
passed from user space are generally interpreted as logical addresses,
whereas btrfs_trim_free_extents used to interpret them as device
physical extents. This could result in counter-intuitive behavior for
users so it's best to remove that support altogether.
Additionally, the existing range support had a bug where if an offset
was passed to FITRIM which overflows u64 e.g. -1 (parsed as u64
18446744073709551615) then wrong data was fed into btrfs_issue_discard,
which in turn leads to wrap-around when aligning the passed range and
results in wrong regions being discarded which leads to data corruption.
Fixes: c2d1b3aae3 ("btrfs: Honour FITRIM range constraints during free space trim")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When calling debugfs functions, there is no need to ever check the
return value. The function can work or not, but the code logic should
never do something different based on this.
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: David Sterba <dsterba@suse.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlzvsOAACgkQxWXV+ddt
WDuLQg/+OHwlNW/8KT+1/gQvAxVnI2bglRJ3lYOQRenR8jA4y3rIKgXWXyd7A/uK
acrjeZYMaho5HY5VaKqAqDST7KikR+gPQh1IArYlBcL7tI5c/YsEgqf2G8PXo1U1
9B13og3kWpdIRNIF9OyKUPcGGfnG5UdBDGNFAEuQZpRXbFKJ+8+ijYU0dXIIFdJb
scl9vWQWFDoLlZ2szRDbl5gAG0lYwk5q0rTRDt+xyla83gD5UNP5oG8XNp1o/T5+
yDwM81IhQ636n51/NkX5RgFbs0ljjRqVzXJg5pa3XH1w9vwZuWoKRNcUhuDH6j9W
wL4Gw33Q8607uk01D5wDdtNI8JTOaXDDYnKsgzNb+7A7ICWlQ/8OR6VZintMioun
ccpNY7HMuVdGdRZxE7ZW63LxLyXulZW51r5G2IvBwRfT6aGl+oKwU4AwB6slEId3
S1ftxcCKYHqtCkRAutirjUknuYdzr0LB1sePoiFwQmIN6782fzuLF8O4hxl5Hcd9
UoEgz/240HiTDqsluUmVkurLVUwBk7CoIdec3tPELrCagI7rqG4H2nkj7XXMJiVD
XyCJZB0dF3E6G8TzlL5lKQWDniqDrLizYwnxYr6OSYZvp9kzfHgxpTPGdxwbIAjr
JT+v6332N09ODooODtzci0Pt0YdfcK1tIhcWXP+oLpE4v/PZj8g=
=lyvo
-----END PGP SIGNATURE-----
Merge tag 'for-5.2-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes for bugs reported by users, fuzzing tools and
regressions:
- fix crashes in relocation:
+ resuming interrupted balance operation does not properly clean
up orphan trees
+ with enabled qgroups, resuming needs to be more careful about
block groups due to limited context when updating qgroups
- fsync and logging fixes found by fuzzing
- incremental send fixes for no-holes and clone
- fix spin lock type used in timer function for zstd"
* tag 'for-5.2-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix race updating log root item during fsync
Btrfs: fix wrong ctime and mtime of a directory after log replay
Btrfs: fix fsync not persisting changed attributes of a directory
btrfs: qgroup: Check bg while resuming relocation to avoid NULL pointer dereference
btrfs: reloc: Also queue orphan reloc tree for cleanup to avoid BUG_ON()
Btrfs: incremental send, fix emission of invalid clone operations
Btrfs: incremental send, fix file corruption when no-holes feature is enabled
btrfs: correct zstd workspace manager lock to use spin_lock_bh()
btrfs: Ensure replaced device doesn't have pending chunk allocation
When syncing the log, the final phase of a fsync operation, we need to
either create a log root's item or update the existing item in the log
tree of log roots, and that depends on the current value of the log
root's log_transid - if it's 1 we need to create the log root item,
otherwise it must exist already and we update it. Since there is no
synchronization between updating the log_transid and checking it for
deciding whether the log root's item needs to be created or updated, we
end up with a tiny race window that results in attempts to update the
item to fail because the item was not yet created:
CPU 1 CPU 2
btrfs_sync_log()
lock root->log_mutex
set log root's log_transid to 1
unlock root->log_mutex
btrfs_sync_log()
lock root->log_mutex
sets log root's
log_transid to 2
unlock root->log_mutex
update_log_root()
sees log root's log_transid
with a value of 2
calls btrfs_update_root(),
which fails with -EUCLEAN
and causes transaction abort
Until recently the race lead to a BUG_ON at btrfs_update_root(), but after
the recent commit 7ac1e464c4 ("btrfs: Don't panic when we can't find a
root key") we just abort the current transaction.
A sample trace of the BUG_ON() on a SLE12 kernel:
------------[ cut here ]------------
kernel BUG at ../fs/btrfs/root-tree.c:157!
Oops: Exception in kernel mode, sig: 5 [#1]
SMP NR_CPUS=2048 NUMA pSeries
(...)
Supported: Yes, External
CPU: 78 PID: 76303 Comm: rtas_errd Tainted: G X 4.4.156-94.57-default #1
task: c00000ffa906d010 ti: c00000ff42b08000 task.ti: c00000ff42b08000
NIP: d000000036ae5cdc LR: d000000036ae5cd8 CTR: 0000000000000000
REGS: c00000ff42b0b860 TRAP: 0700 Tainted: G X (4.4.156-94.57-default)
MSR: 8000000002029033 <SF,VEC,EE,ME,IR,DR,RI,LE> CR: 22444484 XER: 20000000
CFAR: d000000036aba66c SOFTE: 1
GPR00: d000000036ae5cd8 c00000ff42b0bae0 d000000036bda220 0000000000000054
GPR04: 0000000000000001 0000000000000000 c00007ffff8d37c8 0000000000000000
GPR08: c000000000e19c00 0000000000000000 0000000000000000 3736343438312079
GPR12: 3930373337303434 c000000007a3a800 00000000007fffff 0000000000000023
GPR16: c00000ffa9d26028 c00000ffa9d261f8 0000000000000010 c00000ffa9d2ab28
GPR20: c00000ff42b0bc48 0000000000000001 c00000ff9f0d9888 0000000000000001
GPR24: c00000ffa9d26000 c00000ffa9d261e8 c00000ffa9d2a800 c00000ff9f0d9888
GPR28: c00000ffa9d26028 c00000ffa9d2aa98 0000000000000001 c00000ffa98f5b20
NIP [d000000036ae5cdc] btrfs_update_root+0x25c/0x4e0 [btrfs]
LR [d000000036ae5cd8] btrfs_update_root+0x258/0x4e0 [btrfs]
Call Trace:
[c00000ff42b0bae0] [d000000036ae5cd8] btrfs_update_root+0x258/0x4e0 [btrfs] (unreliable)
[c00000ff42b0bba0] [d000000036b53610] btrfs_sync_log+0x2d0/0xc60 [btrfs]
[c00000ff42b0bce0] [d000000036b1785c] btrfs_sync_file+0x44c/0x4e0 [btrfs]
[c00000ff42b0bd80] [c00000000032e300] vfs_fsync_range+0x70/0x120
[c00000ff42b0bdd0] [c00000000032e44c] do_fsync+0x5c/0xb0
[c00000ff42b0be10] [c00000000032e8dc] SyS_fdatasync+0x2c/0x40
[c00000ff42b0be30] [c000000000009488] system_call+0x3c/0x100
Instruction dump:
7f43d378 4bffebb9 60000000 88d90008 3d220000 e8b90000 3b390009 e87a01f0
e8898e08 e8f90000 4bfd48e5 60000000 <0fe00000> e95b0060 39200004 394a0ea0
---[ end trace 8f2dc8f919cabab8 ]---
So fix this by doing the check of log_transid and updating or creating the
log root's item while holding the root's log_mutex.
Fixes: 7237f18336 ("Btrfs: fix tree logs parallel sync")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When replaying a log that contains a new file or directory name that needs
to be added to its parent directory, we end up updating the mtime and the
ctime of the parent directory to the current time after we have set their
values to the correct ones (set at fsync time), efectivelly losing them.
Sample reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/dir
$ touch /mnt/dir/file
# fsync of the directory is optional, not needed
$ xfs_io -c fsync /mnt/dir
$ xfs_io -c fsync /mnt/dir/file
$ stat -c %Y /mnt/dir
1557856079
<power failure>
$ sleep 3
$ mount /dev/sdb /mnt
$ stat -c %Y /mnt/dir
1557856082
--> should have been 1557856079, the mtime is updated to the current
time when replaying the log
Fix this by not updating the mtime and ctime to the current time at
btrfs_add_link() when we are replaying a log tree.
This could be triggered by my recent fsync fuzz tester for fstests, for
which an fstests patch exists titled "fstests: generic, fsync fuzz tester
with fsstress".
Fixes: e02119d5a7 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While logging an inode we follow its ancestors and for each one we mark
it as logged in the current transaction, even if we have not logged it.
As a consequence if we change an attribute of an ancestor, such as the
UID or GID for example, and then explicitly fsync it, we end up not
logging the inode at all despite returning success to user space, which
results in the attribute being lost if a power failure happens after
the fsync.
Sample reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/dir
$ chown 6007:6007 /mnt/dir
$ sync
$ chown 9003:9003 /mnt/dir
$ touch /mnt/dir/file
$ xfs_io -c fsync /mnt/dir/file
# fsync our directory after fsync'ing the new file, should persist the
# new values for the uid and gid.
$ xfs_io -c fsync /mnt/dir
<power failure>
$ mount /dev/sdb /mnt
$ stat -c %u:%g /mnt/dir
6007:6007
--> should be 9003:9003, the uid and gid were not persisted, despite
the explicit fsync on the directory prior to the power failure
Fix this by not updating the logged_trans field of ancestor inodes when
logging an inode, since we have not logged them. Let only future calls to
btrfs_log_inode() to mark inodes as logged.
This could be triggered by my recent fsync fuzz tester for fstests, for
which an fstests patch exists titled "fstests: generic, fsync fuzz tester
with fsstress".
Fixes: 12fcfd22fe ("Btrfs: tree logging unlink/rename fixes")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When mounting a fs with reloc tree and has qgroup enabled, it can cause
NULL pointer dereference at mount time:
BUG: kernel NULL pointer dereference, address: 00000000000000a8
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
RIP: 0010:btrfs_qgroup_add_swapped_blocks+0x186/0x300 [btrfs]
Call Trace:
replace_path.isra.23+0x685/0x900 [btrfs]
merge_reloc_root+0x26e/0x5f0 [btrfs]
merge_reloc_roots+0x10a/0x1a0 [btrfs]
btrfs_recover_relocation+0x3cd/0x420 [btrfs]
open_ctree+0x1bc8/0x1ed0 [btrfs]
btrfs_mount_root+0x544/0x680 [btrfs]
legacy_get_tree+0x34/0x60
vfs_get_tree+0x2d/0xf0
fc_mount+0x12/0x40
vfs_kern_mount.part.12+0x61/0xa0
vfs_kern_mount+0x13/0x20
btrfs_mount+0x16f/0x860 [btrfs]
legacy_get_tree+0x34/0x60
vfs_get_tree+0x2d/0xf0
do_mount+0x81f/0xac0
ksys_mount+0xbf/0xe0
__x64_sys_mount+0x25/0x30
do_syscall_64+0x65/0x240
entry_SYSCALL_64_after_hwframe+0x49/0xbe
[CAUSE]
In btrfs_recover_relocation(), we don't have enough info to determine
which block group we're relocating, but only to merge existing reloc
trees.
Thus in btrfs_recover_relocation(), rc->block_group is NULL.
btrfs_qgroup_add_swapped_blocks() hasn't taken this into consideration,
and causes a NULL pointer dereference.
The bug is introduced by commit 3d0174f78e ("btrfs: qgroup: Only trace
data extents in leaves if we're relocating data block group"), and
later qgroup refactoring still keeps this optimization.
[FIX]
Thankfully in the context of btrfs_recover_relocation(), there is no
other progress can modify tree blocks, thus those swapped tree blocks
pair will never affect qgroup numbers, no matter whatever we set for
block->trace_leaf.
So we only need to check if @bg is NULL before accessing @bg->flags.
Reported-by: Juan Erbes <jerbes@gmail.com>
Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1134806
Fixes: 3d0174f78e ("btrfs: qgroup: Only trace data extents in leaves if we're relocating data block group")
CC: stable@vger.kernel.org # 4.20+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When a fs has orphan reloc tree along with unfinished balance:
...
item 16 key (TREE_RELOC ROOT_ITEM FS_TREE) itemoff 12090 itemsize 439
generation 12 root_dirid 256 bytenr 300400640 level 1 refs 0 <<<
lastsnap 8 byte_limit 0 bytes_used 1359872 flags 0x0(none)
uuid 7c48d938-33a3-4aae-ab19-6e5c9d406e46
item 17 key (BALANCE TEMPORARY_ITEM 0) itemoff 11642 itemsize 448
temporary item objectid BALANCE offset 0
balance status flags 14
Then at mount time, we can hit the following kernel BUG_ON():
BTRFS info (device dm-3): relocating block group 298844160 flags metadata|dup
------------[ cut here ]------------
kernel BUG at fs/btrfs/relocation.c:1413!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 1 PID: 897 Comm: btrfs-balance Tainted: G O 5.2.0-rc1-custom #15
RIP: 0010:create_reloc_root+0x1eb/0x200 [btrfs]
Call Trace:
btrfs_init_reloc_root+0x96/0xb0 [btrfs]
record_root_in_trans+0xb2/0xe0 [btrfs]
btrfs_record_root_in_trans+0x55/0x70 [btrfs]
select_reloc_root+0x7e/0x230 [btrfs]
do_relocation+0xc4/0x620 [btrfs]
relocate_tree_blocks+0x592/0x6a0 [btrfs]
relocate_block_group+0x47b/0x5d0 [btrfs]
btrfs_relocate_block_group+0x183/0x2f0 [btrfs]
btrfs_relocate_chunk+0x4e/0xe0 [btrfs]
btrfs_balance+0x864/0xfa0 [btrfs]
balance_kthread+0x3b/0x50 [btrfs]
kthread+0x123/0x140
ret_from_fork+0x27/0x50
[CAUSE]
In btrfs, reloc trees are used to record swapped tree blocks during
balance.
Reloc tree either get merged (replace old tree blocks of its parent
subvolume) in next transaction if its ref is 1 (fresh).
Or is already merged and will be cleaned up if its ref is 0 (orphan).
After commit d2311e6985 ("btrfs: relocation: Delay reloc tree deletion
after merge_reloc_roots"), reloc tree cleanup is delayed until one block
group is balanced.
Since fresh reloc roots are recorded during merge, as long as there
is no power loss, those orphan reloc roots converted from fresh ones are
handled without problem.
However when power loss happens, orphan reloc roots can be recorded
on-disk, thus at next mount time, we will have orphan reloc roots from
on-disk data directly, and ignored by clean_dirty_subvols() routine.
Then when background balance starts to balance another block group, and
needs to create new reloc root for the same root, btrfs_insert_item()
returns -EEXIST, and trigger that BUG_ON().
[FIX]
For orphan reloc roots, also queue them to rc->dirty_subvol_roots, so
all reloc roots no matter orphan or not, can be cleaned up properly and
avoid above BUG_ON().
And to cooperate with above change, clean_dirty_subvols() will check if
the queued root is a reloc root or a subvol root.
For a subvol root, do the old work, and for a orphan reloc root, clean it
up.
Fixes: d2311e6985 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
CC: stable@vger.kernel.org # 5.1
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing an incremental send we can now issue clone operations with a
source range that ends at the source's file eof and with a destination
range that ends at an offset smaller then the destination's file eof.
If the eof of the source file is not aligned to the sector size of the
filesystem, the receiver will get a -EINVAL error when trying to do the
operation or, on older kernels, silently corrupt the destination file.
The corruption happens on kernels without commit ac765f83f1
("Btrfs: fix data corruption due to cloning of eof block"), while the
failure to clone happens on kernels with that commit.
Example reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt/sdb
$ xfs_io -f -c "pwrite -S 0xb1 0 2M" /mnt/sdb/foo
$ xfs_io -f -c "pwrite -S 0xc7 0 2M" /mnt/sdb/bar
$ xfs_io -f -c "pwrite -S 0x4d 0 2M" /mnt/sdb/baz
$ xfs_io -f -c "pwrite -S 0xe2 0 2M" /mnt/sdb/zoo
$ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base
$ btrfs send -f /tmp/base.send /mnt/sdb/base
$ xfs_io -c "reflink /mnt/sdb/bar 1560K 500K 100K" /mnt/sdb/bar
$ xfs_io -c "reflink /mnt/sdb/bar 1560K 0 100K" /mnt/sdb/zoo
$ xfs_io -c "truncate 550K" /mnt/sdb/bar
$ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr
$ btrfs send -f /tmp/incr.send -p /mnt/sdb/base /mnt/sdb/incr
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ btrfs receive -f /tmp/base.send /mnt/sdc
$ btrfs receive -vv -f /tmp/incr.send /mnt/sdc
(...)
truncate bar size=563200
utimes bar
clone zoo - source=bar source offset=512000 offset=0 length=51200
ERROR: failed to clone extents to zoo
Invalid argument
The failure happens because the clone source range ends at the eof of file
bar, 563200, which is not aligned to the filesystems sector size (4Kb in
this case), and the destination range ends at offset 0 + 51200, which is
less then the size of the file zoo (2Mb).
So fix this by detecting such case and instead of issuing a clone
operation for the whole range, do a clone operation for smaller range
that is sector size aligned followed by a write operation for the block
containing the eof. Here we will always be pessimistic and assume the
destination filesystem of the send stream has the largest possible sector
size (64Kb), since we have no way of determining it.
This fixes a recent regression introduced in kernel 5.2-rc1.
Fixes: 040ee6120c ("Btrfs: send, improve clone range")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When using the no-holes feature, if we have a file with prealloc extents
with a start offset beyond the file's eof, doing an incremental send can
cause corruption of the file due to incorrect hole detection. Such case
requires that the prealloc extent(s) exist in both the parent and send
snapshots, and that a hole is punched into the file that covers all its
extents that do not cross the eof boundary.
Example reproducer:
$ mkfs.btrfs -f -O no-holes /dev/sdb
$ mount /dev/sdb /mnt/sdb
$ xfs_io -f -c "pwrite -S 0xab 0 500K" /mnt/sdb/foobar
$ xfs_io -c "falloc -k 1200K 800K" /mnt/sdb/foobar
$ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base
$ btrfs send -f /tmp/base.snap /mnt/sdb/base
$ xfs_io -c "fpunch 0 500K" /mnt/sdb/foobar
$ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr
$ btrfs send -p /mnt/sdb/base -f /tmp/incr.snap /mnt/sdb/incr
$ md5sum /mnt/sdb/incr/foobar
816df6f64deba63b029ca19d880ee10a /mnt/sdb/incr/foobar
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ btrfs receive -f /tmp/base.snap /mnt/sdc
$ btrfs receive -f /tmp/incr.snap /mnt/sdc
$ md5sum /mnt/sdc/incr/foobar
cf2ef71f4a9e90c2f6013ba3b2257ed2 /mnt/sdc/incr/foobar
--> Different checksum, because the prealloc extent beyond the
file's eof confused the hole detection code and it assumed
a hole starting at offset 0 and ending at the offset of the
prealloc extent (1200Kb) instead of ending at the offset
500Kb (the file's size).
Fix this by ensuring we never cross the file's size when issuing the
write operations for a hole.
Fixes: 16e7549f04 ("Btrfs: incompatible format change to remove hole extents")
CC: stable@vger.kernel.org # 3.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Recent FITRIM work, namely bbbf7243d6 ("btrfs: combine device update
operations during transaction commit") combined the way certain
operations are recoded in a transaction. As a result an ASSERT was added
in dev_replace_finish to ensure the new code works correctly.
Unfortunately I got reports that it's possible to trigger the assert,
meaning that during a device replace it's possible to have an unfinished
chunk allocation on the source device.
This is supposed to be prevented by the fact that a transaction is
committed before finishing the replace oepration and alter acquiring the
chunk mutex. This is not sufficient since by the time the transaction is
committed and the chunk mutex acquired it's possible to allocate a chunk
depending on the workload being executed on the replaced device. This
bug has been present ever since device replace was introduced but there
was never code which checks for it.
The correct way to fix is to ensure that there is no pending device
modification operation when the chunk mutex is acquire and if there is
repeat transaction commit. Unfortunately it's not possible to just
exclude the source device from btrfs_fs_devices::dev_alloc_list since
this causes ENOSPC to be hit in transaction commit.
Fixing that in another way would need to add special cases to handle the
last writes and forbid new ones. The looped transaction fix is more
obvious, and can be easily backported. The runtime of dev-replace is
long so there's no noticeable delay caused by that.
Reported-by: David Sterba <dsterba@suse.com>
Fixes: 391cd9df81 ("Btrfs: fix unprotected alloc list insertion during the finishing procedure of replace")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Convert the btrfs_test filesystem to the new internal mount API as the old
one will be obsoleted and removed. This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.
See Documentation/filesystems/mount_api.txt for more information.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: David Sterba <dsterba@suse.com>
cc: Chris Mason <clm@fb.com>
cc: Josef Bacik <josef@toxicpanda.com>
cc: linux-btrfs@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Once upon a time we used to set ->d_name of e.g. pipefs root
so that d_path() on pipes would work. These days it's
completely pointless - dentries of pipes are not even connected
to pipefs root. However, mount_pseudo() had set the root
dentry name (passed as the second argument) and callers
kept inventing names to pass to it. Including those that
didn't *have* any non-root dentries to start with...
All of that had been pointless for about 8 years now; it's
time to get rid of that cargo-culting...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlzi2E8ACgkQxWXV+ddt
WDuwdg/9Gil8uC28r7HLk1DkMdUZp6qHPXC2D79iN63XOIyxtTv2Y/ZDOneHheTa
NaW9DOe6PUWoVyrYRCM/BhRxouZp0cFlpMG1m8ABdaO3uSCzwlc9wHs7YPNOwiGJ
DM3qikX4V8w0ECoY3Z9NzbHLGTi9INzgkuazWGQnplK1ZA7CHe4RLH1r442daTAO
iFr+bhjODmwHyebXlK66dcOGw7HXp4ac+iyZnlivNcTipGtOTdA7kryZLaNmfepz
JfMESxGMrLhdrd/YxeaDEVYRAh1ZSD57/WGrQDeRQ54qD2ELXmoPX0rAtquwoziS
F1PSitiW0DzYGjS+KCKP9553tlEtJ5Md45k0AibK4h/aqCPy6s6khK/PfsHQT5K+
lD0CqwB4zr9zOhS0n1uFRlNomzK4UZ2SPDtB4KMpCCEQLlvwJIkUqb3Bx6JZgAEH
FPFEZGVX/Xyqv6w/VASHHhhoAGRJ/mIx+mU/RGVU+jFVBzwd0EmlCymFDMF2z44K
8HZz7ib4fMvArR5S2uEz/h85JM7EzDG7YkPluzERiQy86Abi79QQl8qWfC7yBGYd
K3g6VQM/H6NUprXqTNQ/NU7Zvrq5HPXC+NhrLvC+Ul0DlwLAwxRj8NeYImUuDDpi
Du49hJcV0U2kWocvwdP+600y48UroioJHlqKtqlng3NKxdjUGxw=
=qN6T
-----END PGP SIGNATURE-----
Merge tag 'for-5.2-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Notable highlights:
- fixes for some long-standing bugs in fsync that were quite hard to
catch but now finaly fixed
- some fixups to error handling paths that did not properly clean up
(locking, memory)
- fix to space reservation for inheriting properties"
* tag 'for-5.2-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: tree-checker: detect file extent items with overlapping ranges
Btrfs: fix race between ranged fsync and writeback of adjacent ranges
Btrfs: avoid fallback to transaction commit during fsync of files with holes
btrfs: extent-tree: Fix a bug that btrfs is unable to add pinned bytes
btrfs: sysfs: don't leak memory when failing add fsid
btrfs: sysfs: Fix error path kobject memory leak
Btrfs: do not abort transaction at btrfs_update_root() after failure to COW path
btrfs: use the existing reserved items for our first prop for inheritance
btrfs: don't double unlock on error in btrfs_punch_hole
btrfs: Check the compression level before getting a workspace
Having file extent items with ranges that overlap each other is a
serious issue that leads to all sorts of corruptions and crashes (like a
BUG_ON() during the course of __btrfs_drop_extents() when it traims file
extent items). Therefore teach the tree checker to detect such cases.
This is motivated by a recently fixed bug (race between ranged full
fsync and writeback or adjacent ranges).
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the
inode) that happens to be ranged, which happens during a msync() or writes
for files opened with O_SYNC for example, we can end up with a corrupt log,
due to different file extent items representing ranges that overlap with
each other, or hit some assertion failures.
When doing a ranged fsync we only flush delalloc and wait for ordered
exents within that range. If while we are logging items from our inode
ordered extents for adjacent ranges complete, we end up in a race that can
make us insert the file extent items that overlap with others we logged
previously and the assertion failures.
For example, if tree-log.c:copy_items() receives a leaf that has the
following file extents items, all with a length of 4K and therefore there
is an implicit hole in the range 68K to 72K - 1:
(257 EXTENT_ITEM 64K), (257 EXTENT_ITEM 72K), (257 EXTENT_ITEM 76K), ...
It copies them to the log tree. However due to the need to detect implicit
holes, it may release the path, in order to look at the previous leaf to
detect an implicit hole, and then later it will search again in the tree
for the first file extent item key, with the goal of locking again the
leaf (which might have changed due to concurrent changes to other inodes).
However when it locks again the leaf containing the first key, the key
corresponding to the extent at offset 72K may not be there anymore since
there is an ordered extent for that range that is finishing (that is,
somewhere in the middle of btrfs_finish_ordered_io()), and it just
removed the file extent item but has not yet replaced it with a new file
extent item, so the part of copy_items() that does hole detection will
decide that there is a hole in the range starting from 68K to 76K - 1,
and therefore insert a file extent item to represent that hole, having
a key offset of 68K. After that we now have a log tree with 2 different
extent items that have overlapping ranges:
1) The file extent item copied before copy_items() released the path,
which has a key offset of 72K and a length of 4K, representing the
file range 72K to 76K - 1.
2) And a file extent item representing a hole that has a key offset of
68K and a length of 8K, representing the range 68K to 76K - 1. This
item was inserted after releasing the path, and overlaps with the
extent item inserted before.
The overlapping extent items can cause all sorts of unpredictable and
incorrect behaviour, either when replayed or if a fast (non full) fsync
happens later, which can trigger a BUG_ON() when calling
btrfs_set_item_key_safe() through __btrfs_drop_extents(), producing a
trace like the following:
[61666.783269] ------------[ cut here ]------------
[61666.783943] kernel BUG at fs/btrfs/ctree.c:3182!
[61666.784644] invalid opcode: 0000 [#1] PREEMPT SMP
(...)
[61666.786253] task: ffff880117b88c40 task.stack: ffffc90008168000
[61666.786253] RIP: 0010:btrfs_set_item_key_safe+0x7c/0xd2 [btrfs]
[61666.786253] RSP: 0018:ffffc9000816b958 EFLAGS: 00010246
[61666.786253] RAX: 0000000000000000 RBX: 000000000000000f RCX: 0000000000030000
[61666.786253] RDX: 0000000000000000 RSI: ffffc9000816ba4f RDI: ffffc9000816b937
[61666.786253] RBP: ffffc9000816b998 R08: ffff88011dae2428 R09: 0000000000001000
[61666.786253] R10: 0000160000000000 R11: 6db6db6db6db6db7 R12: ffff88011dae2418
[61666.786253] R13: ffffc9000816ba4f R14: ffff8801e10c4118 R15: ffff8801e715c000
[61666.786253] FS: 00007f6060a18700(0000) GS:ffff88023f5c0000(0000) knlGS:0000000000000000
[61666.786253] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[61666.786253] CR2: 00007f6060a28000 CR3: 0000000213e69000 CR4: 00000000000006e0
[61666.786253] Call Trace:
[61666.786253] __btrfs_drop_extents+0x5e3/0xaad [btrfs]
[61666.786253] ? time_hardirqs_on+0x9/0x14
[61666.786253] btrfs_log_changed_extents+0x294/0x4e0 [btrfs]
[61666.786253] ? release_extent_buffer+0x38/0xb4 [btrfs]
[61666.786253] btrfs_log_inode+0xb6e/0xcdc [btrfs]
[61666.786253] ? lock_acquire+0x131/0x1c5
[61666.786253] ? btrfs_log_inode_parent+0xee/0x659 [btrfs]
[61666.786253] ? arch_local_irq_save+0x9/0xc
[61666.786253] ? btrfs_log_inode_parent+0x1f5/0x659 [btrfs]
[61666.786253] btrfs_log_inode_parent+0x223/0x659 [btrfs]
[61666.786253] ? arch_local_irq_save+0x9/0xc
[61666.786253] ? lockref_get_not_zero+0x2c/0x34
[61666.786253] ? rcu_read_unlock+0x3e/0x5d
[61666.786253] btrfs_log_dentry_safe+0x60/0x7b [btrfs]
[61666.786253] btrfs_sync_file+0x317/0x42c [btrfs]
[61666.786253] vfs_fsync_range+0x8c/0x9e
[61666.786253] SyS_msync+0x13c/0x1c9
[61666.786253] entry_SYSCALL_64_fastpath+0x18/0xad
A sample of a corrupt log tree leaf with overlapping extents I got from
running btrfs/072:
item 14 key (295 108 200704) itemoff 2599 itemsize 53
extent data disk bytenr 0 nr 0
extent data offset 0 nr 458752 ram 458752
item 15 key (295 108 659456) itemoff 2546 itemsize 53
extent data disk bytenr 4343541760 nr 770048
extent data offset 606208 nr 163840 ram 770048
item 16 key (295 108 663552) itemoff 2493 itemsize 53
extent data disk bytenr 4343541760 nr 770048
extent data offset 610304 nr 155648 ram 770048
item 17 key (295 108 819200) itemoff 2440 itemsize 53
extent data disk bytenr 4334788608 nr 4096
extent data offset 0 nr 4096 ram 4096
The file extent item at offset 659456 (item 15) ends at offset 823296
(659456 + 163840) while the next file extent item (item 16) starts at
offset 663552.
Another different problem that the race can trigger is a failure in the
assertions at tree-log.c:copy_items(), which expect that the first file
extent item key we found before releasing the path exists after we have
released path and that the last key we found before releasing the path
also exists after releasing the path:
$ cat -n fs/btrfs/tree-log.c
4080 if (need_find_last_extent) {
4081 /* btrfs_prev_leaf could return 1 without releasing the path */
4082 btrfs_release_path(src_path);
4083 ret = btrfs_search_slot(NULL, inode->root, &first_key,
4084 src_path, 0, 0);
4085 if (ret < 0)
4086 return ret;
4087 ASSERT(ret == 0);
(...)
4103 if (i >= btrfs_header_nritems(src_path->nodes[0])) {
4104 ret = btrfs_next_leaf(inode->root, src_path);
4105 if (ret < 0)
4106 return ret;
4107 ASSERT(ret == 0);
4108 src = src_path->nodes[0];
4109 i = 0;
4110 need_find_last_extent = true;
4111 }
(...)
The second assertion implicitly expects that the last key before the path
release still exists, because the surrounding while loop only stops after
we have found that key. When this assertion fails it produces a stack like
this:
[139590.037075] assertion failed: ret == 0, file: fs/btrfs/tree-log.c, line: 4107
[139590.037406] ------------[ cut here ]------------
[139590.037707] kernel BUG at fs/btrfs/ctree.h:3546!
[139590.038034] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
[139590.038340] CPU: 1 PID: 31841 Comm: fsstress Tainted: G W 5.0.0-btrfs-next-46 #1
(...)
[139590.039354] RIP: 0010:assfail.constprop.24+0x18/0x1a [btrfs]
(...)
[139590.040397] RSP: 0018:ffffa27f48f2b9b0 EFLAGS: 00010282
[139590.040730] RAX: 0000000000000041 RBX: ffff897c635d92c8 RCX: 0000000000000000
[139590.041105] RDX: 0000000000000000 RSI: ffff897d36a96868 RDI: ffff897d36a96868
[139590.041470] RBP: ffff897d1b9a0708 R08: 0000000000000000 R09: 0000000000000000
[139590.041815] R10: 0000000000000008 R11: 0000000000000000 R12: 0000000000000013
[139590.042159] R13: 0000000000000227 R14: ffff897cffcbba88 R15: 0000000000000001
[139590.042501] FS: 00007f2efc8dee80(0000) GS:ffff897d36a80000(0000) knlGS:0000000000000000
[139590.042847] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[139590.043199] CR2: 00007f8c064935e0 CR3: 0000000232252002 CR4: 00000000003606e0
[139590.043547] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[139590.043899] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[139590.044250] Call Trace:
[139590.044631] copy_items+0xa3f/0x1000 [btrfs]
[139590.045009] ? generic_bin_search.constprop.32+0x61/0x200 [btrfs]
[139590.045396] btrfs_log_inode+0x7b3/0xd70 [btrfs]
[139590.045773] btrfs_log_inode_parent+0x2b3/0xce0 [btrfs]
[139590.046143] ? do_raw_spin_unlock+0x49/0xc0
[139590.046510] btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
[139590.046872] btrfs_sync_file+0x3b6/0x440 [btrfs]
[139590.047243] btrfs_file_write_iter+0x45b/0x5c0 [btrfs]
[139590.047592] __vfs_write+0x129/0x1c0
[139590.047932] vfs_write+0xc2/0x1b0
[139590.048270] ksys_write+0x55/0xc0
[139590.048608] do_syscall_64+0x60/0x1b0
[139590.048946] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[139590.049287] RIP: 0033:0x7f2efc4be190
(...)
[139590.050342] RSP: 002b:00007ffe743243a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[139590.050701] RAX: ffffffffffffffda RBX: 0000000000008d58 RCX: 00007f2efc4be190
[139590.051067] RDX: 0000000000008d58 RSI: 00005567eca0f370 RDI: 0000000000000003
[139590.051459] RBP: 0000000000000024 R08: 0000000000000003 R09: 0000000000008d60
[139590.051863] R10: 0000000000000078 R11: 0000000000000246 R12: 0000000000000003
[139590.052252] R13: 00000000003d3507 R14: 00005567eca0f370 R15: 0000000000000000
(...)
[139590.055128] ---[ end trace 193f35d0215cdeeb ]---
So fix this race between a full ranged fsync and writeback of adjacent
ranges by flushing all delalloc and waiting for all ordered extents to
complete before logging the inode. This is the simplest way to solve the
problem because currently the full fsync path does not deal with ranges
at all (it assumes a full range from 0 to LLONG_MAX) and it always needs
to look at adjacent ranges for hole detection. For use cases of ranged
fsyncs this can make a few fsyncs slower but on the other hand it can
make some following fsyncs to other ranges do less work or no need to do
anything at all. A full fsync is rare anyway and happens only once after
loading/creating an inode and once after less common operations such as a
shrinking truncate.
This is an issue that exists for a long time, and was often triggered by
generic/127, because it does mmap'ed writes and msync (which triggers a
ranged fsync). Adding support for the tree checker to detect overlapping
extents (next patch in the series) and trigger a WARN() when such cases
are found, and then calling btrfs_check_leaf_full() at the end of
btrfs_insert_file_extent() made the issue much easier to detect. Running
btrfs/072 with that change to the tree checker and making fsstress open
files always with O_SYNC made it much easier to trigger the issue (as
triggering it with generic/127 is very rare).
CC: stable@vger.kernel.org # 3.16+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we are doing a full fsync (bit BTRFS_INODE_NEEDS_FULL_SYNC set) of a
file that has holes and has file extent items spanning two or more leafs,
we can end up falling to back to a full transaction commit due to a logic
bug that leads to failure to insert a duplicate file extent item that is
meant to represent a hole between the last file extent item of a leaf and
the first file extent item in the next leaf. The failure (EEXIST error)
leads to a transaction commit (as most errors when logging an inode do).
For example, we have the two following leafs:
Leaf N:
-----------------------------------------------
| ..., ..., ..., (257, FILE_EXTENT_ITEM, 64K) |
-----------------------------------------------
The file extent item at the end of leaf N has a length of 4Kb,
representing the file range from 64K to 68K - 1.
Leaf N + 1:
-----------------------------------------------
| (257, FILE_EXTENT_ITEM, 72K), ..., ..., ... |
-----------------------------------------------
The file extent item at the first slot of leaf N + 1 has a length of
4Kb too, representing the file range from 72K to 76K - 1.
During the full fsync path, when we are at tree-log.c:copy_items() with
leaf N as a parameter, after processing the last file extent item, that
represents the extent at offset 64K, we take a look at the first file
extent item at the next leaf (leaf N + 1), and notice there's a 4K hole
between the two extents, and therefore we insert a file extent item
representing that hole, starting at file offset 68K and ending at offset
72K - 1. However we don't update the value of *last_extent, which is used
to represent the end offset (plus 1, non-inclusive end) of the last file
extent item inserted in the log, so it stays with a value of 68K and not
with a value of 72K.
Then, when copy_items() is called for leaf N + 1, because the value of
*last_extent is smaller then the offset of the first extent item in the
leaf (68K < 72K), we look at the last file extent item in the previous
leaf (leaf N) and see it there's a 4K gap between it and our first file
extent item (again, 68K < 72K), so we decide to insert a file extent item
representing the hole, starting at file offset 68K and ending at offset
72K - 1, this insertion will fail with -EEXIST being returned from
btrfs_insert_file_extent() because we already inserted a file extent item
representing a hole for this offset (68K) in the previous call to
copy_items(), when processing leaf N.
The -EEXIST error gets propagated to the fsync callback, btrfs_sync_file(),
which falls back to a full transaction commit.
Fix this by adjusting *last_extent after inserting a hole when we had to
look at the next leaf.
Fixes: 4ee3fad34a ("Btrfs: fix fsync after hole punching when using no-holes feature")
Cc: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit ddf30cf03f ("btrfs: extent-tree: Use btrfs_ref to refactor
add_pinned_bytes()") refactored add_pinned_bytes(), but during that
refactor, there are two callers which add the pinned bytes instead
of subtracting.
That refactor misses those two caller, causing incorrect pinned bytes
calculation and resulting unexpected ENOSPC error.
Fix it by adding a new parameter @sign to restore the original behavior.
Reported-by: kernel test robot <rong.a.chen@intel.com>
Fixes: ddf30cf03f ("btrfs: extent-tree: Use btrfs_ref to refactor add_pinned_bytes()")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A failed call to kobject_init_and_add() must be followed by a call to
kobject_put(). Currently in the error path when adding fs_devices we
are missing this call. This could be fixed by calling
btrfs_sysfs_remove_fsid() if btrfs_sysfs_add_fsid() returns an error or
by adding a call to kobject_put() directly in btrfs_sysfs_add_fsid().
Here we choose the second option because it prevents the slightly
unusual error path handling requirements of kobject from leaking out
into btrfs functions.
Add a call to kobject_put() in the error path of kobject_add_and_init().
This causes the release method to be called if kobject_init_and_add()
fails. open_tree() is the function that calls btrfs_sysfs_add_fsid()
and the error code in this function is already written with the
assumption that the release method is called during the error path of
open_tree() (as seen by the call to btrfs_sysfs_remove_fsid() under the
fail_fsdev_sysfs label).
Cc: stable@vger.kernel.org # v4.4+
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tobin C. Harding <tobin@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If a call to kobject_init_and_add() fails we must call kobject_put()
otherwise we leak memory.
Calling kobject_put() when kobject_init_and_add() fails drops the
refcount back to 0 and calls the ktype release method (which in turn
calls the percpu destroy and kfree).
Add call to kobject_put() in the error path of call to
kobject_init_and_add().
Cc: stable@vger.kernel.org # v4.4+
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tobin C. Harding <tobin@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently when we fail to COW a path at btrfs_update_root() we end up
always aborting the transaction. However all the current callers of
btrfs_update_root() are able to deal with errors returned from it, many do
end up aborting the transaction themselves (directly or not, such as the
transaction commit path), other BUG_ON() or just gracefully cancel whatever
they were doing.
When syncing the fsync log, we call btrfs_update_root() through
tree-log.c:update_log_root(), and if it returns an -ENOSPC error, the log
sync code does not abort the transaction, instead it gracefully handles
the error and returns -EAGAIN to the fsync handler, so that it falls back
to a transaction commit. Any other error different from -ENOSPC, makes the
log sync code abort the transaction.
So remove the transaction abort from btrfs_update_log() when we fail to
COW a path to update the root item, so that if an -ENOSPC failure happens
we avoid aborting the current transaction and have a chance of the fsync
succeeding after falling back to a transaction commit.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203413
Fixes: 79787eaab4 ("btrfs: replace many BUG_ONs with proper error handling")
Cc: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're now reserving an extra items worth of space for property
inheritance. We only have one property at the moment so this covers us,
but if we add more in the future this will allow us to not get bitten by
the extra space reservation. If we do add more properties in the future
we should re-visit how we calculate the space reservation needs by the
callers.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ refreshed on top of prop/xattr cleanups ]
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlzR0AAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpo0MD/47D1kBK9rGzkAwIz1Jkh1Qy/ITVaDJzmHJ
UP5uncQsgKFLKMR1LbRcrWtmk2MwFDNULGbteHFeCYE1ypCrTgpWSp5+SJluKd1Q
hma9krLSAXO9QiSaZ4jafshXFIZxz6IjakOW8c9LrT80Ze47yh7AxiLwDafcp/Jj
x6NW790qB7ENDtfarDkZk14NCS8HGLRHO5B21LB+hT0Kfbh0XZaLzJdj7Mck1wPA
VT8hL9mPuA++AjF7Ra4kUjwSakgmajTa3nS2fpkwTYdztQfas7x5Jiv7FWxrrelb
qbabkNkWKepcHAPEiZR7o53TyfCucGeSK/jG+dsJ9KhNp26kl1ci3frl5T6PfVMP
SPPDjsKIHs+dqFrU9y5rSGhLJqewTs96hHthnLGxyF67+5sRb5+YIy+dcqgiyc/b
TUVyjCD6r0cO2q4v9VhwnhOyeBUA9Rwbu8nl7JV5Q45uG7qI4BC39l1jfubMNDPO
GLNGUUzb6ER7z6lYINjRSF2Jhejsx8SR9P7jhpb1Q7k/VvDDxO1T4FpwvqWFz9+s
Gn+s6//+cA6LL+42eZkQjvwF2CUNE7TaVT8zdb+s5HP1RQkZToqUnsQCGeRTrFni
RqWXfW9o9+awYRp431417oMdX/LvLGq9+ZtifRk9DqDcowXevTaf0W2RpplWSuiX
RcCuPeLAVg==
=Ot0g
-----END PGP SIGNATURE-----
Merge tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
"Nothing major in this series, just fixes and improvements all over the
map. This contains:
- Series of fixes for sed-opal (David, Jonas)
- Fixes and performance tweaks for BFQ (via Paolo)
- Set of fixes for bcache (via Coly)
- Set of fixes for md (via Song)
- Enabling multi-page for passthrough requests (Ming)
- Queue release fix series (Ming)
- Device notification improvements (Martin)
- Propagate underlying device rotational status in loop (Holger)
- Removal of mtip32xx trim support, which has been disabled for years
(Christoph)
- Improvement and cleanup of nvme command handling (Christoph)
- Add block SPDX tags (Christoph)
- Cleanup/hardening of bio/bvec iteration (Christoph)
- A few NVMe pull requests (Christoph)
- Removal of CONFIG_LBDAF (Christoph)
- Various little fixes here and there"
* tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block: (164 commits)
block: fix mismerge in bvec_advance
block: don't drain in-progress dispatch in blk_cleanup_queue()
blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
blk-mq: always free hctx after request queue is freed
blk-mq: split blk_mq_alloc_and_init_hctx into two parts
blk-mq: free hw queue's resource in hctx's release handler
blk-mq: move cancel of requeue_work into blk_mq_release
blk-mq: grab .q_usage_counter when queuing request from plug code path
block: fix function name in comment
nvmet: protect discovery change log event list iteration
nvme: mark nvme_core_init and nvme_core_exit static
nvme: move command size checks to the core
nvme-fabrics: check more command sizes
nvme-pci: check more command sizes
nvme-pci: remove an unneeded variable initialization
nvme-pci: unquiesce admin queue on shutdown
nvme-pci: shutdown on timeout during deletion
nvme-pci: fix psdt field for single segment sgls
nvme-multipath: don't print ANA group state by default
nvme-multipath: split bios with the ns_head bio_set before submitting
...
Hi Linus,
This is my very first pull-request. I've been working full-time as
a kernel developer for more than two years now. During this time I've
been fixing bugs reported by Coverity all over the tree and, as part
of my work, I'm also contributing to the KSPP. My work in the kernel
community has been supervised by Greg KH and Kees Cook.
OK. So, after the quick introduction above, please, pull the following
patches that mark switch cases where we are expecting to fall through.
These patches are part of the ongoing efforts to enable -Wimplicit-fallthrough.
They have been ignored for a long time (most of them more than 3 months,
even after pinging multiple times), which is the reason why I've created
this tree. Most of them have been baking in linux-next for a whole development
cycle. And with Stephen Rothwell's help, we've had linux-next nag-emails
going out for newly introduced code that triggers -Wimplicit-fallthrough
to avoid gaining more of these cases while we work to remove the ones
that are already present.
I'm happy to let you know that we are getting close to completing this
work. Currently, there are only 32 of 2311 of these cases left to be
addressed in linux-next. I'm auditing every case; I take a look into
the code and analyze it in order to determine if I'm dealing with an
actual bug or a false positive, as explained here:
https://lore.kernel.org/lkml/c2fad584-1705-a5f2-d63c-824e9b96cf50@embeddedor.com/
While working on this, I've found and fixed the following missing
break/return bugs, some of them introduced more than 5 years ago:
84242b82d87850b51b6c5e420fe63509186e5034b5be8531817264235ee7cc5034a5d2479826cc865340f23df8df997abeeb2f10d82373307b00c5e65d25ff7a54a7ed5b3e7dc24bfa8f21ad0eaee6199ba8376ce1dc586a60a1a8e9b186f14e57562b4860747828eac5b974bee9cc44ba91162c930e3d0a
Once this work is finish, we'll be able to universally enable
"-Wimplicit-fallthrough" to avoid any of these kinds of bugs from
entering the kernel again.
Thanks
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEkmRahXBSurMIg1YvRwW0y0cG2zEFAlzQR2IACgkQRwW0y0cG
2zEJbQ//X930OcBtT/9DRW4XL1Jeq0Mjssz/GLX2Vpup5CwwcTROG65no80Zezf/
yQRWnUjGX0OBv/fmUK32/nTxI/7k7NkmIXJHe0HiEF069GEENB7FT6tfDzIPjU8M
qQkB8NsSUWJs3IH6BVynb/9MGE1VpGBDbYk7CBZRtRJT1RMM+3kQPucgiZMgUBPo
Yd9zKwn4i/8tcOCli++EUdQ29ukMoY2R3qpK4LftdX9sXLKZBWNwQbiCwSkjnvJK
I6FDiA7RaWH2wWGlL7BpN5RrvAXp3z8QN/JZnivIGt4ijtAyxFUL/9KOEgQpBQN2
6TBRhfTQFM73NCyzLgGLNzvd8awem1rKGSBBUvevaPbgesgM+Of65wmmTQRhFNCt
A7+e286X1GiK3aNcjUKrByKWm7x590EWmDzmpmICxNPdt5DHQ6EkmvBdNjnxCMrO
aGA24l78tBN09qN45LR7wtHYuuyR0Jt9bCmeQZmz7+x3ICDHi/+Gw7XPN/eM9+T6
lZbbINiYUyZVxOqwzkYDCsdv9+kUvu3e4rPs20NERWRpV8FEvBIyMjXAg6NAMTue
K+ikkyMBxCvyw+NMimHJwtD7ho4FkLPcoeXb2ZGJTRHixiZAEtF1RaQ7dA05Q/SL
gbSc0DgLZeHlLBT+BSWC2Z8SDnoIhQFXW49OmuACwCUC68NHKps=
=k30z
-----END PGP SIGNATURE-----
Merge tag 'Wimplicit-fallthrough-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux
Pull Wimplicit-fallthrough updates from Gustavo A. R. Silva:
"Mark switch cases where we are expecting to fall through.
This is part of the ongoing efforts to enable -Wimplicit-fallthrough.
Most of them have been baking in linux-next for a whole development
cycle. And with Stephen Rothwell's help, we've had linux-next
nag-emails going out for newly introduced code that triggers
-Wimplicit-fallthrough to avoid gaining more of these cases while we
work to remove the ones that are already present.
We are getting close to completing this work. Currently, there are
only 32 of 2311 of these cases left to be addressed in linux-next. I'm
auditing every case; I take a look into the code and analyze it in
order to determine if I'm dealing with an actual bug or a false
positive, as explained here:
https://lore.kernel.org/lkml/c2fad584-1705-a5f2-d63c-824e9b96cf50@embeddedor.com/
While working on this, I've found and fixed the several missing
break/return bugs, some of them introduced more than 5 years ago.
Once this work is finished, we'll be able to universally enable
"-Wimplicit-fallthrough" to avoid any of these kinds of bugs from
entering the kernel again"
* tag 'Wimplicit-fallthrough-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: (27 commits)
memstick: mark expected switch fall-throughs
drm/nouveau/nvkm: mark expected switch fall-throughs
NFC: st21nfca: Fix fall-through warnings
NFC: pn533: mark expected switch fall-throughs
block: Mark expected switch fall-throughs
ASN.1: mark expected switch fall-through
lib/cmdline.c: mark expected switch fall-throughs
lib: zstd: Mark expected switch fall-throughs
scsi: sym53c8xx_2: sym_nvram: Mark expected switch fall-through
scsi: sym53c8xx_2: sym_hipd: mark expected switch fall-throughs
scsi: ppa: mark expected switch fall-through
scsi: osst: mark expected switch fall-throughs
scsi: lpfc: lpfc_scsi: Mark expected switch fall-throughs
scsi: lpfc: lpfc_nvme: Mark expected switch fall-through
scsi: lpfc: lpfc_nportdisc: Mark expected switch fall-through
scsi: lpfc: lpfc_hbadisc: Mark expected switch fall-throughs
scsi: lpfc: lpfc_els: Mark expected switch fall-throughs
scsi: lpfc: lpfc_ct: Mark expected switch fall-throughs
scsi: imm: mark expected switch fall-throughs
scsi: csiostor: csio_wr: mark expected switch fall-through
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlzQM7MACgkQxWXV+ddt
WDvrVw/+K0AElSuEfDFWd9HBqRAPlGaEP71xCGGle1tkzuY0DJVIBRZ72q8UR0YP
7yke7DU0oqXekGype83eTJUjDSLoOXrlVoQ+VqBdFteDk0W4BCG6Nw+N+wYBF7An
gXRXlGFaYzb2CqqjG92FbtkfxBzISR0XBCQBUN9CBqHNDu1EUQSbnTBkmTMN8MYh
PCoo37S6e5fR36uB/rOKbGNBJjsZEEg/2G6DprP52+eiQWV2h0avEUJrvv6xC4so
97QNgUNuuiUmyurqcYHdlaflZwIhuf5nQeNeu/UvMZmmRnBHPhSP7YPM7f7FftwA
y0d0p+AiEAO0he8nGFb5C6Avs4vuv1u65o1NbF5fqnmAyt+KXWem3LeG6etsXgU8
+eITgprJD3sNBMDLbLoA+wlhTps+w9tukVF5Zp2a8KgQLMMEyAYqUDWmSHvnO2Me
RCNPZLzeGXETgKun0WuMtl/CX2iBDnc0Kq5O6ks2ORl2TH6bg5lgEIwr6HP/Ewoy
w8twsmCOltrxiIptqyQHYD+kvNwqMVV9LSOQ8+EjbYd6BHsfjHjKObOBkhmJ7iqz
4MAIcZU++F9DLRv92H1kUYVNhAMCdXkEIWyxhZPwN1lUi5k9AhknY3FbheNc7ldl
LNPIgRxamWCq9oBmzfOcJ3eFOBtNN02fgA1GTXGd1/AgAilEep8=
=fEkD
-----END PGP SIGNATURE-----
Merge tag 'for-5.2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"This time the majority of changes are cleanups, though there's still a
number of changes of user interest.
User visible changes:
- better read time and write checks to catch errors early and before
writing data to disk (to catch potential memory corruption on data
that get checksummed)
- qgroups + metadata relocation: last speed up patch int the series
to address the slowness, there should be no overhead comparing
balance with and without qgroups
- FIEMAP ioctl does not start a transaction unnecessarily, this can
result in a speed up and less blocking due to IO
- LOGICAL_INO (v1, v2) does not start transaction unnecessarily, this
can speed up the mentioned ioctl and scrub as well
- fsync on files with many (but not too many) hardlinks is faster,
finer decision if the links should be fsynced individually or
completely
- send tries harder to find ranges to clone
- trim/discard will skip unallocated chunks that haven't been touched
since the last mount
Fixes:
- send flushes delayed allocation before start, otherwise it could
miss some changes in case of a very recent rw->ro switch of a
subvolume
- fix fallocate with qgroups that could lead to space accounting
underflow, reported as a warning
- trim/discard ioctl honours the requested range
- starting send and dedupe on a subvolume at the same time will let
only one of them succeed, this is to prevent changes that send
could miss due to dedupe; both operations are restartable
Core changes:
- more tree-checker validations, errors reported by fuzzing tools:
- device item
- inode item
- block group profiles
- tracepoints for extent buffer locking
- async cow preallocates memory to avoid errors happening too deep in
the call chain
- metadata reservations for delalloc reworked to better adapt in
many-writers/low-space scenarios
- improved space flushing logic for intense DIO vs buffered workloads
- lots of cleanups
- removed unused struct members
- redundant argument removal
- properties and xattrs
- extent buffer locking
- selftests
- use common file type conversions
- many-argument functions reduction"
* tag 'for-5.2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (227 commits)
btrfs: Use kvmalloc for allocating compressed path context
btrfs: Factor out common extent locking code in submit_compressed_extents
btrfs: Set io_tree only once in submit_compressed_extents
btrfs: Replace clear_extent_bit with unlock_extent
btrfs: Make compress_file_range take only struct async_chunk
btrfs: Remove fs_info from struct async_chunk
btrfs: Rename async_cow to async_chunk
btrfs: Preallocate chunks in cow_file_range_async
btrfs: reserve delalloc metadata differently
btrfs: track DIO bytes in flight
btrfs: merge calls of btrfs_setxattr and btrfs_setxattr_trans in btrfs_set_prop
btrfs: delete unused function btrfs_set_prop_trans
btrfs: start transaction in xattr_handler_set_prop
btrfs: drop local copy of inode i_mode
btrfs: drop old_fsflags in btrfs_ioctl_setflags
btrfs: modify local copy of btrfs_inode flags
btrfs: drop useless inode i_flags copy and restore
btrfs: start transaction in btrfs_ioctl_setflags()
btrfs: export btrfs_set_prop
btrfs: refactor btrfs_set_props to validate externally
...
Pull vfs inode freeing updates from Al Viro:
"Introduction of separate method for RCU-delayed part of
->destroy_inode() (if any).
Pretty much as posted, except that destroy_inode() stashes
->free_inode into the victim (anon-unioned with ->i_fops) before
scheduling i_callback() and the last two patches (sockfs conversion
and folding struct socket_wq into struct socket) are excluded - that
pair should go through netdev once davem reopens his tree"
* 'work.icache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (58 commits)
orangefs: make use of ->free_inode()
shmem: make use of ->free_inode()
hugetlb: make use of ->free_inode()
overlayfs: make use of ->free_inode()
jfs: switch to ->free_inode()
fuse: switch to ->free_inode()
ext4: make use of ->free_inode()
ecryptfs: make use of ->free_inode()
ceph: use ->free_inode()
btrfs: use ->free_inode()
afs: switch to use of ->free_inode()
dax: make use of ->free_inode()
ntfs: switch to ->free_inode()
securityfs: switch to ->free_inode()
apparmor: switch to ->free_inode()
rpcpipe: switch to ->free_inode()
bpf: switch to ->free_inode()
mqueue: switch to ->free_inode()
ufs: switch to ->free_inode()
coda: switch to ->free_inode()
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAlzP8nQACgkQUqAMR0iA
lPK79A/+NkRouqA9ihAZhUbgW0DHzOAFvUJSBgX11HQAZbGjngakuoyYFvwUx0T0
m80SUTCysxQrWl+xLdccPZ9ZrhP2KFQrEBEdeYHZ6ymcYcl83+3bOIBS7VwdZAbO
EzB8u/58uU/sI6ABL4lF7ZF/+R+U4CXveEUoVUF04bxdPOxZkRX4PT8u3DzCc+RK
r4yhwQUXGcKrHa2GrRL3GXKsDxcnRdFef/nzq4RFSZsi0bpskzEj34WrvctV6j+k
FH/R3kEcZrtKIMPOCoDMMWq07yNqK/QKj0MJlGoAlwfK4INgcrSXLOx+pAmr6BNq
uMKpkxCFhnkZVKgA/GbKEGzFf+ZGz9+2trSFka9LD2Ig6DIstwXqpAgiUK8JFQYj
lq1mTaJZD3DfF2vnGHGeAfBFG3XETv+mIT/ow6BcZi3NyNSVIaqa5GAR+lMc6xkR
waNkcMDkzLFuP1r0p7ZizXOksk9dFkMP3M6KqJomRtApwbSNmtt+O2jvyLPvB3+w
wRyN9WT7IJZYo4v0rrD5Bl6BjV15ZeCPRSFZRYofX+vhcqJQsFX1M9DeoNqokh55
Cri8f6MxGzBVjE1G70y2/cAFFvKEKJud0NUIMEuIbcy+xNrEAWPF8JhiwpKKnU10
c0u674iqHJ2HeVsYWZF0zqzqQ6E1Idhg/PrXfuVuhAaL5jIOnYY=
=WZfC
-----END PGP SIGNATURE-----
Merge tag 'printk-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk
Pull printk updates from Petr Mladek:
- Allow state reset of printk_once() calls.
- Prevent crashes when dereferencing invalid pointers in vsprintf().
Only the first byte is checked for simplicity.
- Make vsprintf warnings consistent and inlined.
- Treewide conversion of obsolete %pf, %pF to %ps, %pF printf
modifiers.
- Some clean up of vsprintf and test_printf code.
* tag 'printk-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
lib/vsprintf: Make function pointer_string static
vsprintf: Limit the length of inlined error messages
vsprintf: Avoid confusion between invalid address and value
vsprintf: Prevent crash when dereferencing invalid pointers
vsprintf: Consolidate handling of unknown pointer specifiers
vsprintf: Factor out %pO handler as kobject_string()
vsprintf: Factor out %pV handler as va_format()
vsprintf: Factor out %p[iI] handler as ip_addr_string()
vsprintf: Do not check address of well-known strings
vsprintf: Consistent %pK handling for kptr_restrict == 0
vsprintf: Shuffle restricted_pointer()
printk: Tie printk_once / printk_deferred_once into .data.once for reset
treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively
lib/test_printf: Switch to bitmap_zalloc()
Pull stack trace updates from Ingo Molnar:
"So Thomas looked at the stacktrace code recently and noticed a few
weirdnesses, and we all know how such stories of crummy kernel code
meeting German engineering perfection end: a 45-patch series to clean
it all up! :-)
Here's the changes in Thomas's words:
'Struct stack_trace is a sinkhole for input and output parameters
which is largely pointless for most usage sites. In fact if embedded
into other data structures it creates indirections and extra storage
overhead for no benefit.
Looking at all usage sites makes it clear that they just require an
interface which is based on a storage array. That array is either on
stack, global or embedded into some other data structure.
Some of the stack depot usage sites are outright wrong, but
fortunately the wrongness just causes more stack being used for
nothing and does not have functional impact.
Another oddity is the inconsistent termination of the stack trace
with ULONG_MAX. It's pointless as the number of entries is what
determines the length of the stored trace. In fact quite some call
sites remove the ULONG_MAX marker afterwards with or without nasty
comments about it. Not all architectures do that and those which do,
do it inconsistenly either conditional on nr_entries == 0 or
unconditionally.
The following series cleans that up by:
1) Removing the ULONG_MAX termination in the architecture code
2) Removing the ULONG_MAX fixups at the call sites
3) Providing plain storage array based interfaces for stacktrace
and stackdepot.
4) Cleaning up the mess at the callsites including some related
cleanups.
5) Removing the struct stack_trace based interfaces
This is not changing the struct stack_trace interfaces at the
architecture level, but it removes the exposure to the generic
code'"
* 'core-stacktrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
x86/stacktrace: Use common infrastructure
stacktrace: Provide common infrastructure
lib/stackdepot: Remove obsolete functions
stacktrace: Remove obsolete functions
livepatch: Simplify stack trace retrieval
tracing: Remove the last struct stack_trace usage
tracing: Simplify stack trace retrieval
tracing: Make ftrace_trace_userstack() static and conditional
tracing: Use percpu stack trace buffer more intelligently
tracing: Simplify stacktrace retrieval in histograms
lockdep: Simplify stack trace handling
lockdep: Remove save argument from check_prev_add()
lockdep: Remove unused trace argument from print_circular_bug()
drm: Simplify stacktrace handling
dm persistent data: Simplify stack trace handling
dm bufio: Simplify stack trace retrieval
btrfs: ref-verify: Simplify stack trace retrieval
dma/debug: Simplify stracktrace retrieval
fault-inject: Simplify stacktrace retrieval
mm/page_owner: Simplify stack trace handling
...
If we have an error writing out a delalloc range in
btrfs_punch_hole_lock_range we'll unlock the inode and then goto
out_only_mutex, where we will again unlock the inode. This is bad,
don't do this.
Fixes: f27451f229 ("Btrfs: add support for fallocate's zero range operation")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a file's compression property is set as zlib or zstd but leave
the compression mount option not be set, that means btrfs will try
to compress the file with default compression level. But in
btrfs_compress_pages(), it calls get_workspace() with level = 0.
This will return a workspace with a wrong compression level.
For zlib, the compression level in the workspace will be 0
(that means "store only"). And for zstd, the compression in the
workspace will be 1, not the default level 3.
How to reproduce:
mkfs -t btrfs /dev/sdb
mount /dev/sdb /mnt/
mkdir /mnt/zlib
btrfs property set /mnt/zlib/ compression zlib
dd if=/dev/zero of=/mnt/zlib/compression-friendly-file-10M bs=1M count=10
sync
btrfs-debugfs -f /mnt/zlib/compression-friendly-file-10M
btrfs-debugfs output:
* before:
...
(258 9961472): ram 524288 disk 1106247680 disk_size 524288
file: ... extents 20 disk size 10485760 logical size 10485760 ratio 1.00
* after:
...
(258 10354688): ram 131072 disk 14217216 disk_size 4096
file: ... extents 80 disk size 327680 logical size 10485760 ratio 32.00
The steps for zstd are similar, but need to put a debugging message to
show the level of the return workspace in zstd_get_workspace().
This commit adds a check of the compression level before getting a
workspace by set_level().
CC: stable@vger.kernel.org # 5.1+
Signed-off-by: Johnny Chang <johnnyc@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Recent refactoring of cow_file_range_async means it's now possible to
request a rather large physically contiguous memory via kmalloc. The
size is dependent on the number of 512k chunks that the compressed range
consists of. David reported multiple OOM messages on such large
allocations. Fix it by switching to using kvmalloc.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Irrespective of whether the compress code fell back to uncompressed or
a compressed extent has to be submitted, the extent range is always
locked. So factor out the common lock_extent call at the beginning of
the loop. No functional changes just removes one duplicate lock_extent
call.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The inode never changes so it's sufficient to dereference it and get
the iotree only once, before the execution of the main loop. No
functional changes, only the size of the function is decreased:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-44 (-44)
Function old new delta
submit_compressed_extents 1240 1196 -44
Total: Before=88476, After=88432, chg -0.05%
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All context this function needs is held within struct async_chunk.
Currently we not only pass the struct but also every individual member.
This is redundant, simplify it by only passing struct async_chunk and
leaving it to compress_file_range to extract the values it requires.
No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The associated btrfs_work already contains a reference to the fs_info so
use that instead of passing it via async_chunk. No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we have an explicit async_chunk struct rename references to
variables of this type to async_chunk. No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This commit changes the implementation of cow_file_range_async in order
to get rid of the BUG_ON in the middle of the loop. Additionally it
reworks the inner loop in the hopes of making it more understandable.
The idea is to make async_cow be a top-level structured, shared amongst
all chunks being sent for compression. This allows to perform one memory
allocation at the beginning and gracefully fail the IO if there isn't
enough memory. Now, each chunk is going to be described by an
async_chunk struct. It's the responsibility of the final chunk
to actually free the memory.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With the per-inode block reserves we started refilling the reserve based
on the calculated size of the outstanding csum bytes and extents for the
inode, including the amount we were adding with the new operation.
However, generic/224 exposed a problem with this approach. With 1000
files all writing at the same time we ended up with a bunch of bytes
being reserved but unusable.
When you write to a file we reserve space for the csum leaves for those
bytes, the number of extent items required to cover those bytes, and a
single transaction item for updating the inode at ordered extent finish
for that range of bytes. This is held until the ordered extent finishes
and we release all of the reserved space.
If a second write comes in at this point we would add a single
reservation for the new outstanding extent and however many reservations
for the csum leaves. At this point we find the delta of how much we
have reserved and how much outstanding size this is and attempt to
reserve this delta. If the first write finishes it will not release any
space, because the space it had reserved for the initial write is still
needed for the second write. However some space would have been used,
as we have added csums, extent items, and dirtied the inode. Our
reserved space would be > 0 but less than the total needed reserved
space.
This is just for a single inode, now consider generic/224. This has
1000 inodes writing in parallel to a very small file system, 1GiB. In
my testing this usually means we get about a 120MiB metadata area to
work with, more than enough to allow the writes to continue, but not
enough if all of the inodes are stuck trying to reserve the slack space
while continuing to hold their leftovers from their initial writes.
Fix this by pre-reserved _only_ for the space we are currently trying to
add. Then once that is successful modify our inodes csum count and
outstanding extents, and then add the newly reserved space to the inodes
block_rsv. This allows us to actually pass generic/224 without running
out of metadata space.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The 'extent_type' variable does seem to be reliably initialized, but
it's _very_ non-obvious, since there's a "goto next" case that jumps
over the normal initialization. That will then always trigger the
"start >= extent_end" test, which will end up never falling through to
the use of that variable.
But the code is certainly not obvious, and the compiler warning looks
reasonable. Make 'extent_type' an int, and initialize it to an invalid
negative value, which seems to be the common pattern in other places.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We only have two callers that need the integer loop iterator, and they
can easily maintain it themselves.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When diagnosing a slowdown of generic/224 I noticed we were not doing
anything when calling into shrink_delalloc(). This is because all
writes in 224 are O_DIRECT, not delalloc, and thus our delalloc_bytes
counter is 0, which short circuits most of the work inside of
shrink_delalloc(). However O_DIRECT writes still consume metadata
resources and generate ordered extents, which we can still wait on.
Fix this by tracking outstanding DIO write bytes, and use this as well
as the delalloc bytes counter to decide if we need to lookup and wait on
any ordered extents. If we have more DIO writes than delalloc bytes
we'll go ahead and wait on any ordered extents regardless of our flush
state as flushing delalloc is likely to not gain us anything.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ use dio instead of odirect in identifiers ]
Signed-off-by: David Sterba <dsterba@suse.com>
Since now the trans argument is never NULL in btrfs_set_prop we don't
have to check. So delete it and use btrfs_setxattr that makes use of
that.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The last consumer of btrfs_set_prop_trans() was taken away by the patch
("btrfs: start transaction in xattr_handler_set_prop") so now this
function can be deleted.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs specific extended attributes on the inode are set using
btrfs_xattr_handler_set_prop(), and the required transaction for this
update is started by btrfs_setxattr(). For better visibility of the
transaction start and end, do this in btrfs_xattr_handler_set_prop().
For which this patch copied code of btrfs_setxattr() as it is in the
original, which needs proper error handling.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There isn't real use of making struct inode::i_mode a local copy, it
saves a dereference one time, not much. Just use it directly.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_inode_flags_to_fsflags() is copied into @old_fsflags and used only
once. Instead used it directly.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of updating the binode::flags directly, update a local copy, and
then at the point of no error, store copy it to the binode::flags.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The patch ("btrfs: start transaction in btrfs_ioctl_setflags()") used
btrfs_set_prop() instead of btrfs_set_prop_trans() by which now the
inode::i_flags update functions such as
btrfs_sync_inode_flags_to_i_flags() and btrfs_update_inode() is called
in btrfs_ioctl_setflags() instead of
btrfs_set_prop_trans()->btrfs_setxattr() as earlier. So the
inode::i_flags remains unmodified until the thread has checked all the
conditions. So drop the saved inode::i_flags in out_i_flags.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Inode attribute can be set through the FS_IOC_SETFLAGS ioctl. This
flags also includes compression attribute for which we would set/reset
the compression extended attribute. While doing this there is a bit of
duplicate code, the following things happens twice:
- start/end_transaction
- inode_inc_iversion()
- current_time update to inode->i_ctime
- and btrfs_update_inode()
These are updated both at btrfs_ioctl_setflags() and btrfs_set_props()
as well. This patch merges these two duplicate codes at
btrfs_ioctl_setflags().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Make btrfs_set_prop() a non-static function, so that it can be called
from btrfs_ioctl_setflags(). We need btrfs_set_prop() instead of
btrfs_set_prop_trans() so that we can use the transaction which is
already started in the current thread.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In preparation to merge multiple transactions when setting the
compression flags, split btrfs_set_props() validation part outside of
it.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Allowing error injection for btrfs_check_leaf_full() and
btrfs_check_node() is useful to test the failure path of btrfs write
time tree check.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 41bd606769 ("Btrfs: fix fsync of files with multiple hard links
in new directories") introduced a path that makes fsync fallback to a full
transaction commit in order to avoid losing hard links and new ancestors
of the fsynced inode. That path is triggered only when the inode has more
than one hard link and either has a new hard link created in the current
transaction or the inode was evicted and reloaded in the current
transaction.
That path ends up getting triggered very often (hundreds of times) during
the course of pgbench benchmarks, resulting in performance drops of about
20%.
This change restores the performance by not triggering the full transaction
commit in those cases, and instead iterate the fs/subvolume tree in search
of all possible new ancestors, for all hard links, to log them.
Reported-by: Zhao Yuhu <zyuhu@suse.com>
Tested-by: James Wang <jnwang@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Send operates on read only trees and expects them to never change while it
is using them. This is part of its initial design, and this expection is
due to two different reasons:
1) When it was introduced, no operations were allowed to modifiy read-only
subvolumes/snapshots (including defrag for example).
2) It keeps send from having an impact on other filesystem operations.
Namely send does not need to keep locks on the trees nor needs to hold on
to transaction handles and delay transaction commits. This ends up being
a consequence of the former reason.
However the deduplication feature was introduced later (on September 2013,
while send was introduced in July 2012) and it allowed for deduplication
with destination files that belong to read-only trees (subvolumes and
snapshots).
That means that having a send operation (either full or incremental) running
in parallel with a deduplication that has the destination inode in one of
the trees used by the send operation, can result in tree nodes and leaves
getting freed and reused while send is using them. This problem is similar
to the problem solved for the root nodes getting freed and reused when a
snapshot is made against one tree that is currenly being used by a send
operation, fixed in commits [1] and [2]. These commits explain in detail
how the problem happens and the explanation is valid for any node or leaf
that is not the root of a tree as well. This problem was also discussed
and explained recently in a thread [3].
The problem is very easy to reproduce when using send with large trees
(snapshots) and just a few concurrent deduplication operations that target
files in the trees used by send. A stress test case is being sent for
fstests that triggers the issue easily. The most common error to hit is
the send ioctl return -EIO with the following messages in dmesg/syslog:
[1631617.204075] BTRFS error (device sdc): did not find backref in send_root. inode=63292, offset=0, disk_byte=5228134400 found extent=5228134400
[1631633.251754] BTRFS error (device sdc): parent transid verify failed on 32243712 wanted 24 found 27
The first one is very easy to hit while the second one happens much less
frequently, except for very large trees (in that test case, snapshots
with 100000 files having large xattrs to get deep and wide trees).
Less frequently, at least one BUG_ON can be hit:
[1631742.130080] ------------[ cut here ]------------
[1631742.130625] kernel BUG at fs/btrfs/ctree.c:1806!
[1631742.131188] invalid opcode: 0000 [#6] SMP DEBUG_PAGEALLOC PTI
[1631742.131726] CPU: 1 PID: 13394 Comm: btrfs Tainted: G B D W 5.0.0-rc8-btrfs-next-45 #1
[1631742.132265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[1631742.133399] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
(...)
[1631742.135061] RSP: 0018:ffffb530021ebaa0 EFLAGS: 00010246
[1631742.135615] RAX: ffff93ac8912e000 RBX: 000000000000009d RCX: 0000000000000002
[1631742.136173] RDX: 000000000000009d RSI: ffff93ac564b0d08 RDI: ffff93ad5b48c000
[1631742.136759] RBP: ffffb530021ebb7d R08: 0000000000000001 R09: ffffb530021ebb7d
[1631742.137324] R10: ffffb530021eba70 R11: 0000000000000000 R12: ffff93ac87d0a708
[1631742.137900] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[1631742.138455] FS: 00007f4cdb1528c0(0000) GS:ffff93ad76a80000(0000) knlGS:0000000000000000
[1631742.139010] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1631742.139568] CR2: 00007f5acb3d0420 CR3: 000000012be3e006 CR4: 00000000003606e0
[1631742.140131] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1631742.140719] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1631742.141272] Call Trace:
[1631742.141826] ? do_raw_spin_unlock+0x49/0xc0
[1631742.142390] tree_advance+0x173/0x1d0 [btrfs]
[1631742.142948] btrfs_compare_trees+0x268/0x690 [btrfs]
[1631742.143533] ? process_extent+0x1070/0x1070 [btrfs]
[1631742.144088] btrfs_ioctl_send+0x1037/0x1270 [btrfs]
[1631742.144645] _btrfs_ioctl_send+0x80/0x110 [btrfs]
[1631742.145161] ? trace_sched_stick_numa+0xe0/0xe0
[1631742.145685] btrfs_ioctl+0x13fe/0x3120 [btrfs]
[1631742.146179] ? account_entity_enqueue+0xd3/0x100
[1631742.146662] ? reweight_entity+0x154/0x1a0
[1631742.147135] ? update_curr+0x20/0x2a0
[1631742.147593] ? check_preempt_wakeup+0x103/0x250
[1631742.148053] ? do_vfs_ioctl+0xa2/0x6f0
[1631742.148510] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[1631742.148942] do_vfs_ioctl+0xa2/0x6f0
[1631742.149361] ? __fget+0x113/0x200
[1631742.149767] ksys_ioctl+0x70/0x80
[1631742.150159] __x64_sys_ioctl+0x16/0x20
[1631742.150543] do_syscall_64+0x60/0x1b0
[1631742.150931] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[1631742.151326] RIP: 0033:0x7f4cd9f5add7
(...)
[1631742.152509] RSP: 002b:00007ffe91017708 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[1631742.152892] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f4cd9f5add7
[1631742.153268] RDX: 00007ffe91017790 RSI: 0000000040489426 RDI: 0000000000000007
[1631742.153633] RBP: 0000000000000007 R08: 00007f4cd9e79700 R09: 00007f4cd9e79700
[1631742.153999] R10: 00007f4cd9e799d0 R11: 0000000000000202 R12: 0000000000000003
[1631742.154365] R13: 0000555dfae53020 R14: 0000000000000000 R15: 0000000000000001
(...)
[1631742.156696] ---[ end trace 5dac9f96dcc3fd6b ]---
That BUG_ON happens because while send is using a node, that node is COWed
by a concurrent deduplication, gets freed and gets reused as a leaf (because
a transaction commit happened in between), so when it attempts to read a
slot from the extent buffer, at ctree.c:read_node_slot(), the extent buffer
contents were wiped out and it now matches a leaf (which can even belong to
some other tree now), hitting the BUG_ON(level == 0).
Fix this concurrency issue by not allowing send and deduplication to run
in parallel if both operate on the same readonly trees, returning EAGAIN
to user space and logging an exlicit warning in dmesg/syslog.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be6821f82c3cc36e026f5afd10249988852b35ea
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2
[3] https://lore.kernel.org/linux-btrfs/CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com/
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we set a subvolume to read-only mode we do not flush dellaloc for any
of its inodes (except if the filesystem is mounted with -o flushoncommit),
since it does not affect correctness for any subsequent operations - except
for a future send operation. The send operation will not be able to see the
delalloc data since the respective file extent items, inode item updates,
backreferences, etc, have not hit yet the subvolume and extent trees.
Effectively this means data loss, since the send stream will not contain
any data from existing delalloc. Another problem from this is that if the
writeback starts and finishes while the send operation is in progress, we
have the subvolume tree being being modified concurrently which can result
in send failing unexpectedly with EIO or hitting runtime errors, assertion
failures or hitting BUG_ONs, etc.
Simple reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ btrfs subvolume create /mnt/sv
$ xfs_io -f -c "pwrite -S 0xea 0 108K" /mnt/sv/foo
$ btrfs property set /mnt/sv ro true
$ btrfs send -f /tmp/send.stream /mnt/sv
$ od -t x1 -A d /mnt/sv/foo
0000000 ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea
*
0110592
$ umount /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ btrfs receive -f /tmp/send.stream /mnt
$ echo $?
0
$ od -t x1 -A d /mnt/sv/foo
0000000
# ---> empty file
Since this a problem that affects send only, fix it in send by flushing
dellaloc for all the roots used by the send operation before send starts
to process the commit roots.
This is a problem that affects send since it was introduced (commit
31db9f7c23 ("Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive"))
but backporting it to older kernels has some dependencies:
- For kernels between 3.19 and 4.20, it depends on commit 3cd24c6980
("btrfs: use tagged writepage to mitigate livelock of snapshot") because
the function btrfs_start_delalloc_snapshot() does not exist before that
commit. So one has to either pick that commit or replace the calls to
btrfs_start_delalloc_snapshot() in this patch with calls to
btrfs_start_delalloc_inodes().
- For kernels older than 3.19 it also requires commit e5fa8f865b
("Btrfs: ensure send always works on roots without orphans") because
it depends on the function ensure_commit_roots_uptodate() which that
commits introduced.
- No dependencies for 5.0+ kernels.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 3.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During fiemap, for regular extents (non inline) we need to check if they
are shared and if they are, set the shared bit. Checking if an extent is
shared requires checking the delayed references of the currently running
transaction, since some reference might have not yet hit the extent tree
and be only in the in-memory delayed references.
However we were using a transaction join for this, which creates a new
transaction when there is no transaction currently running. That means
that two more potential failures can happen: creating the transaction and
committing it. Further, if no write activity is currently happening in the
system, and fiemap calls keep being done, we end up creating and
committing transactions that do nothing.
In some extreme cases this can result in the commit of the transaction
created by fiemap to fail with ENOSPC when updating the root item of a
subvolume tree because a join does not reserve any space, leading to a
trace like the following:
heisenberg kernel: ------------[ cut here ]------------
heisenberg kernel: BTRFS: Transaction aborted (error -28)
heisenberg kernel: WARNING: CPU: 0 PID: 7137 at fs/btrfs/root-tree.c:136 btrfs_update_root+0x22b/0x320 [btrfs]
(...)
heisenberg kernel: CPU: 0 PID: 7137 Comm: btrfs-transacti Not tainted 4.19.0-4-amd64 #1 Debian 4.19.28-2
heisenberg kernel: Hardware name: FUJITSU LIFEBOOK U757/FJNB2A5, BIOS Version 1.21 03/19/2018
heisenberg kernel: RIP: 0010:btrfs_update_root+0x22b/0x320 [btrfs]
(...)
heisenberg kernel: RSP: 0018:ffffb5448828bd40 EFLAGS: 00010286
heisenberg kernel: RAX: 0000000000000000 RBX: ffff8ed56bccef50 RCX: 0000000000000006
heisenberg kernel: RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff8ed6bda166a0
heisenberg kernel: RBP: 00000000ffffffe4 R08: 00000000000003df R09: 0000000000000007
heisenberg kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff8ed63396a078
heisenberg kernel: R13: ffff8ed092d7c800 R14: ffff8ed64f5db028 R15: ffff8ed6bd03d068
heisenberg kernel: FS: 0000000000000000(0000) GS:ffff8ed6bda00000(0000) knlGS:0000000000000000
heisenberg kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
heisenberg kernel: CR2: 00007f46f75f8000 CR3: 0000000310a0a002 CR4: 00000000003606f0
heisenberg kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
heisenberg kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
heisenberg kernel: Call Trace:
heisenberg kernel: commit_fs_roots+0x166/0x1d0 [btrfs]
heisenberg kernel: ? _cond_resched+0x15/0x30
heisenberg kernel: ? btrfs_run_delayed_refs+0xac/0x180 [btrfs]
heisenberg kernel: btrfs_commit_transaction+0x2bd/0x870 [btrfs]
heisenberg kernel: ? start_transaction+0x9d/0x3f0 [btrfs]
heisenberg kernel: transaction_kthread+0x147/0x180 [btrfs]
heisenberg kernel: ? btrfs_cleanup_transaction+0x530/0x530 [btrfs]
heisenberg kernel: kthread+0x112/0x130
heisenberg kernel: ? kthread_bind+0x30/0x30
heisenberg kernel: ret_from_fork+0x35/0x40
heisenberg kernel: ---[ end trace 05de912e30e012d9 ]---
Since fiemap (and btrfs_check_shared()) is a read-only operation, do not do
a transaction join to avoid the overhead of creating a new transaction (if
there is currently no running transaction) and introducing a potential
point of failure when the new transaction gets committed, instead use a
transaction attach to grab a handle for the currently running transaction
if any.
Reported-by: Christoph Anton Mitterer <calestyo@scientia.net>
Link: https://lore.kernel.org/linux-btrfs/b2a668d7124f1d3e410367f587926f622b3f03a4.camel@scientia.net/
Fixes: afce772e87 ("btrfs: fix check_shared for fiemap ioctl")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since reloc tree doesn't contribute to qgroup numbers, just skip them.
This should catch the final cause of unnecessary data ref processing
when running balance of metadata with qgroups on.
The 4G data 16 snapshots test (*) should explain it pretty well:
| delayed subtree | refactor delayed ref | this patch
---------------------------------------------------------------------
relocated | 22653 | 22673 | 22744
qgroup dirty | 122792 | 48360 | 70
time | 24.494 | 11.606 | 3.944
Finally, we're at the stage where qgroup + metadata balance cost no
obvious overhead.
Test environment:
Test VM:
- vRAM 8G
- vCPU 8
- block dev vitrio-blk, 'unsafe' cache mode
- host block 850evo
Test workload:
- Copy 4G data from /usr/ to one subvolume
- Create 16 snapshots of that subvolume, and modify 3 files in each
snapshot
- Enable quota, rescan
- Time "btrfs balance start -m"
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Similar to btrfs_inc_extent_ref(), use btrfs_ref to replace the long
parameter list and the confusing @owner parameter.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the new btrfs_ref structure and replace parameter list to clean up
the usage of owner and level to distinguish the extent types.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since add_pinned_bytes() only needs to know if the extent is metadata
and if it's a chunk tree extent, btrfs_ref is a perfect match for it, as
we don't need various owner/level trick to determine extent type.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's a perfect match for btrfs_ref_tree_mod() to use btrfs_ref, as
btrfs_ref describes a metadata/data reference update comprehensively.
Now we have one less function use confusing owner/level trick.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Just like btrfs_add_delayed_tree_ref(), use btrfs_ref to refactor
btrfs_add_delayed_data_ref().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_add_delayed_tree_ref() has a longer and longer parameter list, and
some callers like btrfs_inc_extent_ref() are using @owner as level for
delayed tree ref.
Instead of making the parameter list longer, use btrfs_ref to refactor
it, so each parameter assignment should be self-explaining without dirty
level/owner trick, and provides the basis for later refactoring.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The process_func function pointer is local to __btrfs_mod_ref() and
points to either btrfs_inc_extent_ref() or btrfs_free_extent().
Open code it to make later delayed ref refactor easier, so we can
refactor btrfs_inc_extent_ref() and btrfs_free_extent() in different
patches.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Current delayed ref interface has several problems:
- Longer and longer parameter lists
bytenr
num_bytes
parent
---------- so far so good
ref_root
owner
offset
---------- I don't feel good now
- Different interpretation of the same parameter
Above @owner for data ref is inode number (u64),
while for tree ref, it's level (int).
They are even in different size range.
For level we only need 0 ~ 8, while for ino it's
BTRFS_FIRST_FREE_OBJECTID ~ BTRFS_LAST_FREE_OBJECTID.
And @offset doesn't even make sense for tree ref.
Such parameter reuse may look clever as an hidden union, but it
destroys code readability.
To solve both problems, we introduce a new structure, btrfs_ref to solve
them:
- Structure instead of long parameter list
This makes later expansion easier, and is better documented.
- Use btrfs_ref::type to distinguish data and tree ref
- Use proper union to store data/tree ref specific structures.
- Use separate functions to fill data/tree ref data, with a common generic
function to fill common bytenr/num_bytes members.
All parameters will find its place in btrfs_ref, and an extra member,
@real_root, inspired by ref-verify code, is newly introduced for later
qgroup code, to record which tree is triggered by this extent modification.
This patch doesn't touch any code, but provides the basis for further
refactoring.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When finding out which inodes have references on a particular extent, done
by backref.c:iterate_extent_inodes(), from the BTRFS_IOC_LOGICAL_INO (both
v1 and v2) ioctl and from scrub we use the transaction join API to grab a
reference on the currently running transaction, since in order to give
accurate results we need to inspect the delayed references of the currently
running transaction.
However, if there is currently no running transaction, the join operation
will create a new transaction. This is inefficient as the transaction will
eventually be committed, doing unnecessary IO and introducing a potential
point of failure that will lead to a transaction abort due to -ENOSPC, as
recently reported [1].
That's because the join, creates the transaction but does not reserve any
space, so when attempting to update the root item of the root passed to
btrfs_join_transaction(), during the transaction commit, we can end up
failling with -ENOSPC. Users of a join operation are supposed to actually
do some filesystem changes and reserve space by some means, which is not
the case of iterate_extent_inodes(), it is a read-only operation for all
contextes from which it is called.
The reported [1] -ENOSPC failure stack trace is the following:
heisenberg kernel: ------------[ cut here ]------------
heisenberg kernel: BTRFS: Transaction aborted (error -28)
heisenberg kernel: WARNING: CPU: 0 PID: 7137 at fs/btrfs/root-tree.c:136 btrfs_update_root+0x22b/0x320 [btrfs]
(...)
heisenberg kernel: CPU: 0 PID: 7137 Comm: btrfs-transacti Not tainted 4.19.0-4-amd64 #1 Debian 4.19.28-2
heisenberg kernel: Hardware name: FUJITSU LIFEBOOK U757/FJNB2A5, BIOS Version 1.21 03/19/2018
heisenberg kernel: RIP: 0010:btrfs_update_root+0x22b/0x320 [btrfs]
(...)
heisenberg kernel: RSP: 0018:ffffb5448828bd40 EFLAGS: 00010286
heisenberg kernel: RAX: 0000000000000000 RBX: ffff8ed56bccef50 RCX: 0000000000000006
heisenberg kernel: RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff8ed6bda166a0
heisenberg kernel: RBP: 00000000ffffffe4 R08: 00000000000003df R09: 0000000000000007
heisenberg kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff8ed63396a078
heisenberg kernel: R13: ffff8ed092d7c800 R14: ffff8ed64f5db028 R15: ffff8ed6bd03d068
heisenberg kernel: FS: 0000000000000000(0000) GS:ffff8ed6bda00000(0000) knlGS:0000000000000000
heisenberg kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
heisenberg kernel: CR2: 00007f46f75f8000 CR3: 0000000310a0a002 CR4: 00000000003606f0
heisenberg kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
heisenberg kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
heisenberg kernel: Call Trace:
heisenberg kernel: commit_fs_roots+0x166/0x1d0 [btrfs]
heisenberg kernel: ? _cond_resched+0x15/0x30
heisenberg kernel: ? btrfs_run_delayed_refs+0xac/0x180 [btrfs]
heisenberg kernel: btrfs_commit_transaction+0x2bd/0x870 [btrfs]
heisenberg kernel: ? start_transaction+0x9d/0x3f0 [btrfs]
heisenberg kernel: transaction_kthread+0x147/0x180 [btrfs]
heisenberg kernel: ? btrfs_cleanup_transaction+0x530/0x530 [btrfs]
heisenberg kernel: kthread+0x112/0x130
heisenberg kernel: ? kthread_bind+0x30/0x30
heisenberg kernel: ret_from_fork+0x35/0x40
heisenberg kernel: ---[ end trace 05de912e30e012d9 ]---
So fix that by using the attach API, which does not create a transaction
when there is currently no running transaction.
[1] https://lore.kernel.org/linux-btrfs/b2a668d7124f1d3e410367f587926f622b3f03a4.camel@scientia.net/
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
None of the implementers of the submit_bio_hook use the bio_offset
parameter, simply remove it. No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The btree submit hook queues the async csum and forwards the bio_offset
parameter passed to btree_submit_bio_hook. This is redundant since
btree_submit_bio_start calls btree_csum_one_bio which doesn't use the
offset at all. No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Buffered writeback always calls btrfs_csum_one_bio with the last 2
arguments being 0 irrespective of what the bio_offset has been passed to
btrfs_submit_bio_start. Make this apparent by explicitly passing 0 for
bio_offset when calling btrfs_wq_submit_bio from btrfs_submit_bio_hook.
This will allow for further simplifications down the line. No functional
changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function always uses the btree inode's io_tree. Stop taking the
tree as a function argument and instead access it internally from
read_extent_buffer_pages. No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The only possible 'private_data' that is passed to this function is
actually an inode. Make that explicit by changing the signature of the
call back. No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to use a typedef to define the type of the function and
then use that to define the respective member in extent_io_ops. Define
struct's member directly. No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can read fs_info from the block group cache structure and can drop it
from the parameters. Though the transaction is also availabe, it's not
guaranteed to be non-NULL.
Signed-off-by: David Sterba <dsterba@suse.com>
It used to be called from only two places (truncate path and releasing a
transaction handle), but commits 28bad21257 ("btrfs: fix truncate
throttling") and db2462a6ad ("btrfs: don't run delayed refs in the end
transaction logic") removed their calls to this function, so it's not used
anymore. Just remove it and all its helpers.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previous patch made sure that btrfs_setxattr_trans() is called only when
transaction NULL. Clean up btrfs_setxattr_trans() and drop the
parameter.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When the caller has already created the transaction handle,
btrfs_setxattr() will use it. Also adds assert in btrfs_setxattr().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_setxattr_trans() is called by 5 functions as below and all of them
do updates. None of them would be roun on a read-only root.
So its ok to remove the readonly root check here as it's a high-level
conditon.
1.
__btrfs_set_acl()
btrfs_init_acl()
btrfs_init_inode_security()
2.
__btrfs_set_acl()
btrfs_set_acl()
3.
btrfs_set_prop()
btrfs_set_prop_trans()
/ \
btrfs_ioctl_setflags() btrfs_xattr_handler_set_prop()
4.
btrfs_xattr_handler_set()
5.
btrfs_initxattrs()
btrfs_xattr_security_init()
btrfs_init_inode_security()
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preparatory patch, as we are going split the calls with and without
transaction to use the respective btrfs_setxattr() and
btrfs_setxattr_trans() functions. Export btrfs_setxattr() for calls
outside of xattr.c.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When trans is not NULL btrfs_setxattr() calls do_setxattr() directly
with a check for readonly root. Rename do_setxattr() btrfs_setxattr() in
preparation to call do_setxattr() directly instead. Preparatory patch,
no functional changes.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Rename btrfs_setxattr() to btrfs_setxattr_trans(), so that do_setxattr()
can be renamed to btrfs_setxattr().
Preparatory patch, no functional changes.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unlike btrfs_tree_lock() and btrfs_tree_read_lock(), the remaining
functions in locking.c will not sleep, thus doesn't make much sense to
record their execution time.
Those events are introduced mainly for user space tool to audit and
detect lock leakage or dead lock.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are two tree lock events which can sleep:
- btrfs_tree_read_lock()
- btrfs_tree_lock()
Sometimes we may need to look into the concurrency picture of the fs.
For that case, we need the execution time of above two functions and the
owner of @eb.
Here we introduce a trace events for user space tools like bcc, to get
the execution time of above two functions, and get detailed owner info
where eBPF code can't.
All the overhead is hidden behind the trace events, so if events are not
enabled, there is no overhead.
These trace events also output bytenr and generation, allow them to be
pared with unlock events to pin down deadlock.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The member num_dirty_bgs of struct btrfs_transaction is not used anymore,
it is set and incremented but nothing reads its value anymore. Its last
read use was removed by commit 64403612b7 ("btrfs: rework
btrfs_check_space_for_delayed_refs"). So just remove that member.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Ordered csums are keyed off of a btrfs_ordered_extent, which already has
a reference to the inode. This implies that an explicit inode argument
is redundant. So remove it.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are at least 2 reports about a memory bit flip sneaking into
on-disk data.
Currently we only have a relaxed check triggered at
btrfs_mark_buffer_dirty() time, as it's not mandatory and only for
CONFIG_BTRFS_FS_CHECK_INTEGRITY enabled build, it doesn't help users to
detect such problem.
This patch will address the hole by triggering comprehensive check on
tree blocks before writing it back to disk.
The design points are:
- Timing of the check: Tree block write hook
This timing is chosen to reduce the overhead.
The comprehensive check should be as expensive as a checksum
calculation.
Doing full check at btrfs_mark_buffer_dirty() is too expensive for end
user.
- Loose empty leaf check
Originally for an empty leaf, tree-checker will report error if it's
not a tree root.
The problem for such check at write time is:
* False alert for tree root created in current transaction
In that case, the commit root still needs to be written to disk.
And since current root can differ from commit root, then it will
cause false alert.
This happens for log tree.
* False alert for relocated tree block
Relocated tree block can be written to disk due to memory pressure,
in that case an empty csum tree root can be written to disk and
cause false alert, since csum root node hasn't been updated.
Previous patch of removing comprehensive empty leaf owner check has
paved the way for this patch.
The example error output will be something like:
BTRFS critical (device dm-3): corrupt leaf: root=2 block=1350630375424 slot=68, bad key order, prev (10510212874240 169 0) current (1714119868416 169 0)
BTRFS error (device dm-3): block=1350630375424 write time tree block corruption detected
BTRFS: error (device dm-3) in btrfs_commit_transaction:2220: errno=-5 IO failure (Error while writing out transaction)
BTRFS info (device dm-3): forced readonly
BTRFS warning (device dm-3): Skipping commit of aborted transaction.
BTRFS: error (device dm-3) in cleanup_transaction:1839: errno=-5 IO failure
BTRFS info (device dm-3): delayed_refs has NO entry
Reported-by: Leonard Lausen <leonard@lausen.nl>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 1ba98d086f ("Btrfs: detect corruption when non-root leaf has
zero item") introduced comprehensive root owner checker.
However it's pretty expensive tree search to locate the owner root,
especially when it get reused by mandatory read and write time
tree-checker.
This patch will remove that check, and completely rely on owner based
empty leaf check, which is much faster and still works fine for most
case.
And since we skip the old root owner check, now write time tree check
can be merged with btrfs_check_leaf_full().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing fallocate, we first add the range to the reserve_list and
then reserve the quota. If quota reservation fails, we'll release all
reserved parts of reserve_list.
However, cur_offset is not updated to indicate that this range is
already been inserted into the list. Therefore, the same range is freed
twice. Once at list_for_each_entry loop, and once at the end of the
function. This will result in WARN_ON on bytes_may_use when we free the
remaining space.
At the end, under the 'out' label we have a call to:
btrfs_free_reserved_data_space(inode, data_reserved, alloc_start, alloc_end - cur_offset);
The start offset, third argument, should be cur_offset.
Everything from alloc_start to cur_offset was freed by the
list_for_each_entry_safe_loop.
Fixes: 18513091af ("btrfs: update btrfs_space_info's bytes_may_use timely")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of always calling the allocator to search for a free extent,
that satisfies the input criteria, switch btrfs_trim_free_extents to
using find_first_clear_extent_bit. With this change it's no longer
necessary to read the device tree in order to figure out holes in
the devices.
Now the code always searches in-memory data structure to figure out the
space range which contains the requested which should result in speed
improvements.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is very similar to find_first_extent_bit except that it
locates the first contiguous span of space which does not have bits set.
It's intended use is in the freespace trimming code.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently unallocated chunks are always trimmed. For example
2 consecutive trims on large storage would trim freespace twice
irrespective of whether the space was actually allocated or not between
those trims.
Optimise this behavior by exploiting the newly introduced alloc_state
tree of btrfs_device. A new CHUNK_TRIMMED bit is used to mark
those unallocated chunks which have been trimmed and have not been
allocated afterwards. On chunk allocation the respective underlying devices'
physical space will have its CHUNK_TRIMMED flag cleared. This avoids
submitting discards for space which hasn't been changed since the last
time discard was issued.
This applies to the single mount period of the filesystem as the
information is not stored permanently.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is used in more than one places so let's factor it out in ctree.h.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that these functions no longer require a handle to transaction to
inspect pending/pinned chunks the argument can be removed. At the same
time also remove any surrounding code which acquired the handle.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The pending chunks list contains chunks that are allocated in the
current transaction but haven't been created yet. The pinned chunks
list contains chunks that are being released in the current transaction.
Both describe chunks that are not reflected on disk as in use but are
unavailable just the same.
The pending chunks list is anchored by the transaction handle, which
means that we need to hold a reference to a transaction when working
with the list.
The way we use them is by iterating over both lists to perform
comparisons on the stripes they describe for each device. This is
backwards and requires that we keep a transaction handle open while
we're trimming.
This patchset adds an extent_io_tree to btrfs_device that maintains
the allocation state of the device. Extents are set dirty when
chunks are first allocated -- when the extent maps are added to the
mapping tree. They're cleared when last removed -- when the extent
maps are removed from the mapping tree. This matches the lifespan
of the pending and pinned chunks list and allows us to do trims
on unallocated space safely without pinning the transaction for what
may be a lengthy operation. We can also use this io tree to mark
which chunks have already been trimmed so we don't repeat the operation.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Following the introduction of the alloc_state tree, some of the callees
of btrfs_mapping_tree_free will have to interact with the btrfs_device
of the constituent devices. Enable this by moving the code responsible
for freeing devices after the last user (btrfs_mapping_tree_free).
Otherwise the kernel could crash due to use-after-free.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_device structs are freed from RCU context since device iteration
is protected by RCU. Currently this is achieved by using call_rcu since
no blocking functions are called within btrfs_free_device. Future
refactoring of pending/pinned chunks will require calling sleeping
functions.
This patch is in preparation for these changes by simply switching from
RCU callbacks to explicit calls of synchronize_rcu and calling
btrfs_free_device directly. This is functionally equivalent, making sure
that there are no readers at that time.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It will be used in a future patch that will require modifying an
extent_io_tree struct under a spinlock.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Rather than hijacking the existing defines let's just define new bits,
with more descriptive names. Instead of using yet more (currently at 18)
bits for the new flags, use the fact those flags will be specific to
the device allocation tree so define them using existing EXTENT_* flags.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Chunks read from disk currently don't get their ->orig_block_len member
set, in contrast when a new chunk is allocated, the respective
extent_map's ->orig_block_len is assigned the size of the stripe of this
chunk.
Let's apply the same strategy for chunks which are read from
disk, not only does this codify the invariant that ->orig_block_len
always contains the size of the stripe for a chunk (when the em belongs
to the mapping tree). But it's also a preparatory patch for further work
around tracking chunk allocation in an extent tree rather than
pinned/pending lists.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is going to be used to clear out the device extent
allocation information. Give it a more generic name and export it. This
is in preparation to replacing the pending/pinned chunk lists with an
extent tree. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During device shrink pinned/pending chunks (i.e. those which have been
deleted/created respectively, in the current transaction and haven't
touched disk) need to be accounted when doing device shrink. Presently
this happens after the main relocation loop in btrfs_shrink_device,
which could lead to making another go in the body of the function.
Since there is no hard requirement to perform pinned/pending chunks
handling after the relocation loop, move the code before it. This leads
to simplifying the code flow around - i.e. no need to use 'goto again'.
A notable side effect of this change is that modification of the
device's size requires a transaction to be started and committed before
the relocation loop starts. This is necessary to ensure that relocation
process sees the shrunk device size.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We currently overload the pending_chunks list to handle updating
btrfs_device->commit_bytes used. We don't actually care about the
extent mapping or even the device mapping for the chunk - we just need
the device, and we can end up processing it multiple times. The
fs_devices->resized_list does more or less the same thing, but with the
disk size. They are called consecutively during commit and have more or
less the same purpose.
We can combine the two lists into a single list that attaches to the
transaction and contains a list of devices that need updating. Since we
always add the device to a list when we change bytes_used or
disk_total_size, there's no harm in copying both values at once.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Up until now trimming the freespace was done irrespective of what the
arguments of the FITRIM ioctl were. For example fstrim's -o/-l arguments
will be entirely ignored. Fix it by correctly handling those paramter.
This requires breaking if the found freespace extent is after the end of
the passed range as well as completing trim after trimming
fstrim_range::len bytes.
Fixes: 499f377f49 ("btrfs: iterate over unused chunk space in FITRIM")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Improve clone_range in two scenarios.
1. Remove the limit of inode size when find clone inodes We can do
partial clone, so there is no need to limit the size of the candidate
inode. When clone a range, we clone the legal range only by bytenr,
offset, len, inode size.
2. In the scenarios of rewrite or clone_range, data_offset rarely
matches exactly, so the chance of a clone is missed.
e.g.
1. Write a 1M file
dd if=/dev/zero of=1M bs=1M count=1
2. Clone 1M file
cp --reflink 1M clone
3. Rewrite 4k on the clone file
dd if=/dev/zero of=clone bs=4k count=1 conv=notrunc
The disk layout is as follows:
item 16 key (257 EXTENT_DATA 0) itemoff 15353 itemsize 53
extent data disk byte 1103101952 nr 1048576
extent data offset 0 nr 1048576 ram 1048576
extent compression(none)
...
item 22 key (258 EXTENT_DATA 0) itemoff 14959 itemsize 53
extent data disk byte 1104150528 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression(none)
item 23 key (258 EXTENT_DATA 4096) itemoff 14906 itemsize 53
extent data disk byte 1103101952 nr 1048576
extent data offset 4096 nr 1044480 ram 1048576
extent compression(none)
When send, inode 258 file offset 4096~1048576 (item 23) has a chance to
clone_range, but because data_offset does not match inode 257 (item 16),
it causes missed clone and can only transfer actual data.
Improve the problem by judging whether the current data_offset has
overlap with the file extent item, and if so, adjusting offset and
extent_len so that we can clone correctly.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When an inode inherits property from its parent, we call btrfs_set_prop().
btrfs_set_prop() does an elaborate checks, which is not required in the
context of inheriting a property. Instead just open-code only the required
items from btrfs_set_prop() and then call btrfs_setxattr() directly. So
now the only user of btrfs_set_prop() is gone, (except for the wraper
function btrfs_set_prop_trans()).
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
@device_name in mount_subvol() is not used, drop it. Also see:
5bedc48a8f ("btrfs: drop unused parameters from mount_subvol").
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When accessing a file on a crafted image, btrfs can crash in block layer:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
PGD 136501067 P4D 136501067 PUD 124519067 PMD 0
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.0.0-rc8-default #252
RIP: 0010:end_bio_extent_readpage+0x144/0x700
Call Trace:
<IRQ>
blk_update_request+0x8f/0x350
blk_mq_end_request+0x1a/0x120
blk_done_softirq+0x99/0xc0
__do_softirq+0xc7/0x467
irq_exit+0xd1/0xe0
call_function_single_interrupt+0xf/0x20
</IRQ>
RIP: 0010:default_idle+0x1e/0x170
[CAUSE]
The crafted image has a tricky corruption, the INODE_ITEM has a
different type against its parent dir:
item 20 key (268 INODE_ITEM 0) itemoff 2808 itemsize 160
generation 13 transid 13 size 1048576 nbytes 1048576
block group 0 mode 121644 links 1 uid 0 gid 0 rdev 0
sequence 9 flags 0x0(none)
This mode number 0120000 means it's a symlink.
But the dir item think it's still a regular file:
item 8 key (264 DIR_INDEX 5) itemoff 3707 itemsize 32
location key (268 INODE_ITEM 0) type FILE
transid 13 data_len 0 name_len 2
name: f4
item 40 key (264 DIR_ITEM 51821248) itemoff 1573 itemsize 32
location key (268 INODE_ITEM 0) type FILE
transid 13 data_len 0 name_len 2
name: f4
For symlink, we don't set BTRFS_I(inode)->io_tree.ops and leave it
empty, as symlink is only designed to have inlined extent, all handled
by tree block read. Thus no need to trigger btrfs_submit_bio_hook() for
inline file extent.
However end_bio_extent_readpage() expects tree->ops populated, as it's
reading regular data extent. This causes NULL pointer dereference.
[FIX]
This patch fixes the problem in two ways:
- Verify inode mode against its dir item when looking up inode
So in btrfs_lookup_dentry() if we find inode mode mismatch with dir
item, we error out so that corrupted inode will not be accessed.
- Verify inode mode when getting extent mapping
Only regular file should have regular or preallocated extent.
If we found regular/preallocated file extent for symlink or
the rest, we error out before submitting the read bio.
With this fix that crafted image can be rejected gracefully:
BTRFS critical (device loop0): inode mode mismatch with dir: inode mode=0121644 btrfs type=7 dir type=1
Reported-by: Yoon Jungyeon <jungyeon@gatech.edu>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202763
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a report in kernel bugzilla about mismatch file type in dir
item and inode item.
This inspires us to check inode mode in inode item.
This patch will check the following members:
- inode key objectid
Should be ROOT_DIR_DIR or [256, (u64)-256] or FREE_INO.
- inode key offset
Should be 0
- inode item generation
- inode item transid
No newer than sb generation + 1.
The +1 is for log tree.
- inode item mode
No unknown bits.
No invalid S_IF* bit.
NOTE: S_IFMT check is not enough, need to check every know type.
- inode item nlink
Dir should have no more link than 1.
- inode item flags
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs-progs already have a comprehensive type checker, to ensure there
is only 0 (SINGLE profile) or 1 (DUP/RAID0/1/5/6/10) bit set for chunk
profile bits.
Do the same work for kernel.
Reported-by: Yoon Jungyeon <jungyeon@gatech.edu>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202765
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
For fuzzed image whose DEV_ITEM has invalid total_bytes as 0, then
kernel will just panic:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
#PF error: [normal kernel read fault]
PGD 800000022b2bd067 P4D 800000022b2bd067 PUD 22b2bc067 PMD 0
Oops: 0000 [#1] SMP PTI
CPU: 0 PID: 1106 Comm: mount Not tainted 5.0.0-rc8+ #9
RIP: 0010:btrfs_verify_dev_extents+0x2a5/0x5a0
Call Trace:
open_ctree+0x160d/0x2149
btrfs_mount_root+0x5b2/0x680
[CAUSE]
If device extent verification finds a deivce with 0 total_bytes, then it
assumes it's a seed dummy, then search for seed devices.
But in this case, there is no seed device at all, causing NULL pointer.
[FIX]
Since this is caused by fuzzed image, let's go the tree-check way, just
add a new verification for device item.
Reported-by: Yoon Jungyeon <jungyeon@gatech.edu>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202691
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Since we have btrfs_check_chunk_valid() in tree-checker, let's do
chunk item verification in tree-checker too.
Since the tree-checker is run at endio time, if one chunk leaf fails
chunk verification, we can still retry the other copy, making btrfs more
robust to fuzzed image as we may still get a good chunk item.
Also since we have done chunk verification in tree block read time, skip
the btrfs_check_chunk_valid() call in read_one_chunk() if we're reading
chunk items from leaf.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
To follow the standard behavior of tree-checker.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Old error message would be something like:
BTRFS error (device dm-3): invalid chunk num_stipres: 0
New error message would be:
Btrfs critical (device dm-3): corrupt superblock syschunk array: chunk_start=2097152, invalid chunk num_stripes: 0
Or
Btrfs critical (device dm-3): corrupt leaf: root=3 block=8388608 slot=3 chunk_start=2097152, invalid chunk num_stripes: 0
And for certain error message, also output expected value.
The error message levels are changed from error to critical.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
By function, chunk item verification is more suitable to be done inside
tree-checker.
So move btrfs_check_chunk_valid() to tree-checker.c and export it.
And since it's now moved to tree-checker, also add a better comment for
what this function is doing.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The commit fcebe4562d ("Btrfs: rework qgroup accounting") reworked
qgroups and added some new structures. Another rework of qgroup
mechanics e69bcee376 ("btrfs: qgroup: Cleanup the old
ref_node-oriented mechanism.") stopped using them and left uncleaned.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can read fs_info from extent buffer and can drop it from the
parameters. As all callsites are updated, add the btrfs_ prefix as the
function is exported.
Signed-off-by: David Sterba <dsterba@suse.com>
Deduplicate the btrfs file type conversion implementation - file systems
that use the same file types as defined by POSIX do not need to define
their own versions and can use the common helper functions decared in
fs_types.h and implemented in fs_types.c
Common implementation can be found via commit:
bbe7449e25 "fs: common implementation of file type"
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Phillip Potter <phil@philpotter.co.uk>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move code to make it more readable, so as locking and unlocking is
done in the same function. The generic checks that are now performed in
the locked section are unaffected.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
BUG_ON(1) leads to bogus warnings from clang when
CONFIG_PROFILE_ANNOTATED_BRANCHES is set:
fs/btrfs/volumes.c:5041:3: error: variable 'max_chunk_size' is used uninitialized whenever 'if' condition is false
[-Werror,-Wsometimes-uninitialized]
BUG_ON(1);
^~~~~~~~~
include/asm-generic/bug.h:61:36: note: expanded from macro 'BUG_ON'
#define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0)
^~~~~~~~~~~~~~~~~~~
include/linux/compiler.h:48:23: note: expanded from macro 'unlikely'
# define unlikely(x) (__branch_check__(x, 0, __builtin_constant_p(x)))
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fs/btrfs/volumes.c:5046:9: note: uninitialized use occurs here
max_chunk_size);
^~~~~~~~~~~~~~
include/linux/kernel.h:860:36: note: expanded from macro 'min'
#define min(x, y) __careful_cmp(x, y, <)
^
include/linux/kernel.h:853:17: note: expanded from macro '__careful_cmp'
__cmp_once(x, y, __UNIQUE_ID(__x), __UNIQUE_ID(__y), op))
^
include/linux/kernel.h:847:25: note: expanded from macro '__cmp_once'
typeof(y) unique_y = (y); \
^
fs/btrfs/volumes.c:5041:3: note: remove the 'if' if its condition is always true
BUG_ON(1);
^
include/asm-generic/bug.h:61:32: note: expanded from macro 'BUG_ON'
#define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0)
^
fs/btrfs/volumes.c:4993:20: note: initialize the variable 'max_chunk_size' to silence this warning
u64 max_chunk_size;
^
= 0
Change it to BUG() so clang can see that this code path can never
continue.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The wrapper names better describe what's happening so they're not
deleted though they're trivial, but at least moved closer to their place
of use.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Long time ago (2008), the extent buffers were organized in a LRU list
and switched to rb-tree in 6af118ce51 ("Btrfs: Index extent
buffers in an rbtree"). There was one stale macro definition left.
Signed-off-by: David Sterba <dsterba@suse.com>
The messages like 'extent I/O tests finished' are redundant, if the test
fails it's quite obvious in the log and hang is also noticeable. No
other then extent_io and free space tree tests print that so make it
consistent.
Signed-off-by: David Sterba <dsterba@suse.com>
Comments about ranges did not match the code, the correct calculation is
to use start and start+len as the interval boundaries.
Signed-off-by: David Sterba <dsterba@suse.com>
The way the extent map tests handle errors does not conform to the rest
of the suite, where the first failure is reported and then it stops.
Do the same now that we have the errors returned from all the functions.
Signed-off-by: David Sterba <dsterba@suse.com>
The individual testcases for extent maps do not return an error on
allocation failures. This is not a big problem as the allocation don't
fail in general but there are functional tests handled with ASSERTS.
This makes tests dependent on them and it's not reliable.
This patch adds the allocation failure handling and allows for the
conversion of the asserts to proper error handling and reporting.
Signed-off-by: David Sterba <dsterba@suse.com>
Allocation of main objects like fs_info or extent buffers is in each
test so let's simplify and unify the error messages to a table and add a
convenience helper.
Signed-off-by: David Sterba <dsterba@suse.com>
For better diagnostics print the file name and line to locate the
errors. Sample output:
[ 9.052924] BTRFS: selftest: fs/btrfs/tests/extent-io-tests.c:283 offset bits do not match
Signed-off-by: David Sterba <dsterba@suse.com>
The fs_info is not freed at the end of the function and leaks. The
function is called twice so there can be up to 2x sizeof(struct
btrfs_fs_info) of leaked memory. Fortunatelly this affects only testing
builds, the size could be 16k with several debugging features enabled.
Signed-off-by: David Sterba <dsterba@suse.com>
Just add one extra line to show when the corruption is detected.
Currently only read time detection is possible.
The planned distinguish line would be:
read time:
<detailed report>
block=XXXXX read time tree block corruption detected
write time:
<detailed report>
block=XXXXX write time tree block corruption detected
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We've been seeing the following sporadically throughout our fleet
panic: kernel BUG at fs/btrfs/relocation.c:4584!
netversion: 5.0-0
Backtrace:
#0 [ffffc90003adb880] machine_kexec at ffffffff81041da8
#1 [ffffc90003adb8c8] __crash_kexec at ffffffff8110396c
#2 [ffffc90003adb988] crash_kexec at ffffffff811048ad
#3 [ffffc90003adb9a0] oops_end at ffffffff8101c19a
#4 [ffffc90003adb9c0] do_trap at ffffffff81019114
#5 [ffffc90003adba00] do_error_trap at ffffffff810195d0
#6 [ffffc90003adbab0] invalid_op at ffffffff81a00a9b
[exception RIP: btrfs_reloc_cow_block+692]
RIP: ffffffff8143b614 RSP: ffffc90003adbb68 RFLAGS: 00010246
RAX: fffffffffffffff7 RBX: ffff8806b9c32000 RCX: ffff8806aad00690
RDX: ffff880850b295e0 RSI: ffff8806b9c32000 RDI: ffff88084f205bd0
RBP: ffff880849415000 R8: ffffc90003adbbe0 R9: ffff88085ac90000
R10: ffff8805f7369140 R11: 0000000000000000 R12: ffff880850b295e0
R13: ffff88084f205bd0 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffffc90003adbbb0] __btrfs_cow_block at ffffffff813bf1cd
#8 [ffffc90003adbc28] btrfs_cow_block at ffffffff813bf4b3
#9 [ffffc90003adbc78] btrfs_search_slot at ffffffff813c2e6c
The way relocation moves data extents is by creating a reloc inode and
preallocating extents in this inode and then copying the data into these
preallocated extents. Once we've done this for all of our extents,
we'll write out these dirty pages, which marks the extent written, and
goes into btrfs_reloc_cow_block(). From here we get our current
reloc_control, which _should_ match the reloc_control for the current
block group we're relocating.
However if we get an ENOSPC in this path at some point we'll bail out,
never initiating writeback on this inode. Not a huge deal, unless we
happen to be doing relocation on a different block group, and this block
group is now rc->stage == UPDATE_DATA_PTRS. This trips the BUG_ON() in
btrfs_reloc_cow_block(), because we expect to be done modifying the data
inode. We are in fact done modifying the metadata for the data inode
we're currently using, but not the one from the failed block group, and
thus we BUG_ON().
(This happens when writeback finishes for extents from the previous
group, when we are at btrfs_finish_ordered_io() which updates the data
reloc tree (inode item, drops/adds extent items, etc).)
Fix this by writing out the reloc data inode always, and then breaking
out of the loop after that point to keep from tripping this BUG_ON()
later.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[ add note from Filipe ]
Signed-off-by: David Sterba <dsterba@suse.com>
The uptodate parameter of btrfs_writepage_endio_finish_ordered is used
to signal whether an error has occured while writing the given page.
0 signals an error, which is propagated to callees and 1 signifies
success. In end_compressed_bio_write the ->bi_status is checked and
based on it either BLK_STS_OK (0) or BLK_STS_NOTSUPP (1) are used. While
from functional point of view this is ok it's a for the poor reader of
the code, since the block layer values are conflated with the semantics
of the parameter.
Just use plain 0 or 1. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can only get <=0 from extent_write_cache_pages, add an ASSERT() for
it just in case.
Then instead of submitting the write bio even if we got some error,
check the return value first.
If we have already hit some error, just clean up the corrupted or
half-baked bio, and return error.
If there is no error so far, then call flush_write_bio() and return the
result.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function needs some extra checks on locked pages and eb. For error
handling we need to unlock locked pages and the eb.
There is a rare >0 return value branch, where all pages get locked
while write bio is not flushed.
Thankfully it's handled by the only caller, btree_write_cache_pages(),
as later write_one_eb() call will trigger submit_one_bio(). So there
shouldn't be any problem.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can only get @ret <= 0. Add an ASSERT() for it just in case.
Then, instead of submitting the write bio even we got some error, check
the return value first.
If we have already hit some error, just clean up the corrupted or
half-baked bio, and return error.
If there is no error so far, then call flush_write_bio() and return the
result.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since __extent_writepage() will no longer return >0 value,
(ret == AOP_WRITEPAGE_ACTIVATE) will never be true.
Kill that dead branch.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btree_write_cache_pages(), we can only get @ret <= 0.
Add an ASSERT() for it just in case.
Then instead of submitting the write bio even we got some error, check
the return value first.
If we have already hit some error, just clean up the corrupted or
half-baked bio, and return error.
If there is no error so far, then call flush_write_bio() and return the
result.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since now flush_write_bio() could return error, kill the BUG_ON() first.
Then don't call flush_write_bio() unconditionally, instead we check the
return value from __extent_writepage() first.
If __extent_writepage() fails, we do cleanup, and return error without
submitting the possible corrupted or half-baked bio.
If __extent_writepage() successes, then we call flush_write_bio() and
return the result.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a BUG_ON() in flush_write_bio() to handle the return value of
submit_one_bio().
Move the BUG_ON() one level up to all its callers.
This patch will introduce temporary variable, @flush_ret to keep code
change minimal in this patch. That variable will be cleaned up when
enhancing the error handling later.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have internal report of strange transaction abort due to EUCLEAN
without any error message.
Since error message inside verify_level_key() is only enabled for
CONFIG_BTRFS_DEBUG, the error message won't be printed on most builds.
This patch will make the error message mandatory, so when problem
happens we know what's causing the problem.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When reading a file from a fuzzed image, kernel can panic like:
BTRFS warning (device loop0): csum failed root 5 ino 270 off 0 csum 0x98f94189 expected csum 0x00000000 mirror 1
assertion failed: !memcmp_extent_buffer(b, &disk_key, offsetof(struct btrfs_leaf, items[0].key), sizeof(disk_key)), file: fs/btrfs/ctree.c, line: 2544
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.h:3500!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
RIP: 0010:btrfs_search_slot.cold.24+0x61/0x63 [btrfs]
Call Trace:
btrfs_lookup_csum+0x52/0x150 [btrfs]
__btrfs_lookup_bio_sums+0x209/0x640 [btrfs]
btrfs_submit_bio_hook+0x103/0x170 [btrfs]
submit_one_bio+0x59/0x80 [btrfs]
extent_read_full_page+0x58/0x80 [btrfs]
generic_file_read_iter+0x2f6/0x9d0
__vfs_read+0x14d/0x1a0
vfs_read+0x8d/0x140
ksys_read+0x52/0xc0
do_syscall_64+0x60/0x210
entry_SYSCALL_64_after_hwframe+0x49/0xbe
[CAUSE]
The fuzzed image has a corrupted leaf whose first key doesn't match its
parent:
checksum tree key (CSUM_TREE ROOT_ITEM 0)
node 29741056 level 1 items 14 free 107 generation 19 owner CSUM_TREE
fs uuid 3381d111-94a3-4ac7-8f39-611bbbdab7e6
chunk uuid 9af1c3c7-2af5-488b-8553-530bd515f14c
...
key (EXTENT_CSUM EXTENT_CSUM 79691776) block 29761536 gen 19
leaf 29761536 items 1 free space 1726 generation 19 owner CSUM_TREE
leaf 29761536 flags 0x1(WRITTEN) backref revision 1
fs uuid 3381d111-94a3-4ac7-8f39-611bbbdab7e6
chunk uuid 9af1c3c7-2af5-488b-8553-530bd515f14c
item 0 key (EXTENT_CSUM EXTENT_CSUM 8798638964736) itemoff 1751 itemsize 2244
range start 8798638964736 end 8798641262592 length 2297856
When reading the above tree block, we have extent_buffer->refs = 2 in
the context:
- initial one from __alloc_extent_buffer()
alloc_extent_buffer()
|- __alloc_extent_buffer()
|- atomic_set(&eb->refs, 1)
- one being added to fs_info->buffer_radix
alloc_extent_buffer()
|- check_buffer_tree_ref()
|- atomic_inc(&eb->refs)
So if even we call free_extent_buffer() in read_tree_block or other
similar situation, we only decrease the refs by 1, it doesn't reach 0
and won't be freed right now.
The staled eb and its corrupted content will still be kept cached.
Furthermore, we have several extra cases where we either don't do first
key check or the check is not proper for all callers:
- scrub
We just don't have first key in this context.
- shared tree block
One tree block can be shared by several snapshot/subvolume trees.
In that case, the first key check for one subvolume doesn't apply to
another.
So for the above reasons, a corrupted extent buffer can sneak into the
buffer cache.
[FIX]
Call verify_level_key in read_block_for_search to do another
verification. For that purpose the function is exported.
Due to above reasons, although we can free corrupted extent buffer from
cache, we still need the check in read_block_for_search(), for scrub and
shared tree blocks.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202755
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202757
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202759
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202761
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202767
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202769
Reported-by: Yoon Jungyeon <jungyeon@gatech.edu>
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If a an eb fails to be read for whatever reason - it's corrupted on disk
and parent transid/key validations fail or IO for eb pages fail then
this buffer must be removed from the buffer cache. Currently the code
calls free_extent_buffer if an error occurs. Unfortunately this doesn't
achieve the desired behavior since btrfs_find_create_tree_block returns
with eb->refs == 2.
On the other hand free_extent_buffer will only decrement the refs once
leaving it added to the buffer cache radix tree. This enables later
code to look up the buffer from the cache and utilize it potentially
leading to a crash.
The correct way to free the buffer is call free_extent_buffer_stale.
This function will correctly call atomic_dec explicitly for the buffer
and subsequently call release_extent_buffer which will decrement the
final reference thus correctly remove the invalid buffer from buffer
cache. This change affects only newly allocated buffers since they have
eb->refs == 2.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=202755
Reported-by: Jungyeon <jungyeon@gatech.edu>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
From the introduction of btrfs_(set|clear)_header_flag, there is no
usage of its return value. So just make it return void.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit d2311e6985 ("btrfs: relocation: Delay reloc tree deletion after
merge_reloc_roots()") expands the life span of root->reloc_root.
This breaks certain checs of fs_info->reloc_ctl. Before that commit, if
we have a root with valid reloc_root, then it's ensured to have
fs_info->reloc_ctl.
But now since reloc_root doesn't always mean a valid fs_info->reloc_ctl,
such check is unreliable and can cause the following NULL pointer
dereference:
BUG: unable to handle kernel NULL pointer dereference at 00000000000005c1
IP: btrfs_reloc_pre_snapshot+0x20/0x50 [btrfs]
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
CPU: 0 PID: 10379 Comm: snapperd Not tainted
Call Trace:
create_pending_snapshot+0xd7/0xfc0 [btrfs]
create_pending_snapshots+0x8e/0xb0 [btrfs]
btrfs_commit_transaction+0x2ac/0x8f0 [btrfs]
btrfs_mksubvol+0x561/0x570 [btrfs]
btrfs_ioctl_snap_create_transid+0x189/0x190 [btrfs]
btrfs_ioctl_snap_create_v2+0x102/0x150 [btrfs]
btrfs_ioctl+0x5c9/0x1e60 [btrfs]
do_vfs_ioctl+0x90/0x5f0
SyS_ioctl+0x74/0x80
do_syscall_64+0x7b/0x150
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x7fd7cdab8467
Fix it by explicitly checking fs_info->reloc_ctl other than using the
implied root->reloc_root.
Fixes: d2311e6985 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In case we hit the error case for a metadata buffer in
end_bio_extent_readpage then 'ret' won't really be checked before it's
written again to. This means the -EIO in this case will never be
checked, just remove it.
Fixes-coverity-id: 1442513
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently extent_readpages (called from btrfs_readpages) will always
call __extent_readpages which tries to create contiguous range of pages
and call __do_contiguous_readpages when such contiguous range is
created.
It turns out this is unnecessary due to the fact that generic MM code
always calls filesystem's ->readpages callback (btrfs_readpages in
this case) with already contiguous pages. Armed with this knowledge it's
possible to simplify extent_readpages by eliminating the call to
__extent_readpages and directly calling contiguous_readpages.
The only edge case that needs to be handled is when
add_to_page_cache_lru fails. This is easy as all that is needed is to
submit whatever is the number of pages successfully added to the lru.
This can happen when the page is already in the range, so it does not
need to be read again, and we can't do anything else in case of other
errors.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The member is tracking simple status of the lock, we can use bool for
that and make some room for further space reduction in the structure.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the helpers where open coded. On non-debug builds, the warnings will
not trigger and extent_buffer::write_locks become unused and can be
moved to the appropriate section, saving a few bytes.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The write_locks are a simple counter to track locking balance and used
to assert tree locks. Add helpers to make it conditionally work only in
DEBUG builds. Will be used in followup patches.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the helpers where open coded. On non-debug builds, the warnings will
not trigger and extent_buffer::read_locks become unused and can be
moved to the appropriate section, saving a few bytes.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The read_locks are a simple counter to track locking balance and used to
assert tree locks. Add helpers to make it conditionally work only in
DEBUG builds. Will be used in followup patches.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the helpers where open coded. On non-debug builds, the warnings will
not trigger and extent_buffer::spining_readers become unused and can be
moved to the appropriate section, saving a few bytes.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Add helpers for conditional DEBUG build to assert that the extent buffer
spinning_readers constraints are met. Will be used in followup patches.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the helpers where open coded. On non-debug builds, the warnings will
not trigger and extent_buffer::spining_writers become unused and can be
moved to the appropriate section, saving a few bytes.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Add helpers for conditional DEBUG build to assert that the extent buffer
spinning_writers constraints are met. Will be used in followup patches.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
This flag just became synonymous to EXTENT_LOCKED, so just remove it and
used EXTENT_LOCKED directly. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This flag was introduced in a52d9a8033 ("Btrfs: Extent based page
cache code.") and subsequently it's usage effectively was removed by
1edbb734b4 ("Btrfs: reduce CPU usage in the extent_state tree") and
f2a97a9dbd ("btrfs: remove all unused functions"). Just remove it,
no functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When building with -Wsometimes-uninitialized, Clang warns:
fs/btrfs/uuid-tree.c:129:13: warning: variable 'eb' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
fs/btrfs/uuid-tree.c:129:13: warning: variable 'offset' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
Clang can't tell that all cases are covered with this final else if.
Just turn it into an else so that it is clear.
Link: https://github.com/ClangBuiltLinux/linux/issues/385
Suggested-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_set_prop() takes transaction pointer as the first argument,
however in ioctl.c for the purpose of setting the compression property,
we call btrfs_set_prop() with NULL transaction pointer. Down in
the call chain btrfs_setxattr() starts transaction to update the
attribute and also to update the inode.
So for clarity, create btrfs_set_prop_trans() with no transaction
pointer as argument, in preparation to start transaction here instead of
doing it down the call chain at btrfs_setxattr().
Also now the btrfs_set_prop() is a static function.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
fs_info is commonly used to represent struct fs_info *, rename
to fs_private to avoid confusion.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Drop forward declaration of the functions:
- prop_compression_validate
- prop_compression_apply
- prop_compression_extract
No functional changes.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_set_prop() is a redirect to __btrfs_set_prop() with the
transaction handle equal to NULL. __btrfs_set_prop() in turn passes
this to do_setxattr() which then transaction is actually created.
Instead merge __btrfs_set_prop() to btrfs_set_prop(), and update the
caller with NULL argument.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit c40a3d38af ("Btrfs: Compute and look up csums based on
sectorsized blocks") we do a kmap_atomic() on the contents of a bvec.
The code before c40a3d38af had the kmap region just around the
checksumming too.
kmap_atomic() in turn does a preempt_disable() and pagefault_disable(),
so we shouldn't map the data for too long. Reduce the time the bvec's
page is mapped to when we actually need it.
Performance wise it doesn't seem to make a huge difference with a 2 vcpu VM
on a /dev/zram device:
vanilla patched delta
write 17.4MiB/s 17.8MiB/s +0.4MiB/s (+2%)
read 40.6MiB/s 41.5MiB/s +0.9MiB/s (+2%)
The following fio job profile was used in the comparision:
[global]
ioengine=libaio
direct=1
sync=1
norandommap
time_based
runtime=10m
size=100m
group_reporting
numjobs=2
[test]
filename=/mnt/test/fio
rw=randrw
rwmixread=70
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Although btrfs heavily relies on extent_io_tree, we don't really have
any good trace events for them.
This patch will add the folowing trace events:
- trace_btrfs_set_extent_bit()
- trace_btrfs_clear_extent_bit()
- trace_btrfs_convert_extent_bit()
Since selftests could create temporary extent_io_tree without fs_info,
modify TP_fast_assign_fsid() to accept NULL as fs_info. NULL fs_info
will lead to all zero fsid.
The output would be:
btrfs_set_extent_bit: <FDID>: io_tree=INODE_IO ino=1 root=1 start=22036480 len=4096 set_bits=LOCKED
btrfs_set_extent_bit: <FSID>: io_tree=INODE_IO ino=1 root=1 start=22040576 len=4096 set_bits=LOCKED
btrfs_set_extent_bit: <FSID>: io_tree=INODE_IO ino=1 root=1 start=22044672 len=4096 set_bits=LOCKED
btrfs_set_extent_bit: <FSID>: io_tree=INODE_IO ino=1 root=1 start=22048768 len=4096 set_bits=LOCKED
btrfs_clear_extent_bit: <FSID>: io_tree=INODE_IO ino=1 root=1 start=22036480 len=16384 clear_bits=LOCKED
^^^ Extent buffer 22036480 read from disk, the locking progress
btrfs_set_extent_bit: <FSID>: io_tree=TRANS_DIRTY_PAGES ino=1 root=1 start=30425088 len=16384 set_bits=DIRTY
btrfs_set_extent_bit: <FSID>: io_tree=TRANS_DIRTY_PAGES ino=1 root=1 start=30441472 len=16384 set_bits=DIRTY
^^^ 2 new tree blocks allocated in one transaction
btrfs_set_extent_bit: <FSID>: io_tree=FREED_EXTENTS0 ino=0 root=0 start=30523392 len=16384 set_bits=DIRTY
btrfs_set_extent_bit: <FSID>: io_tree=FREED_EXTENTS0 ino=0 root=0 start=30556160 len=16384 set_bits=DIRTY
^^^ 2 old tree blocks get pinned down
There is one point which need attention:
1) Those trace events can be pretty heavy:
The following workload would generate over 400 trace events.
mkfs.btrfs -f $dev
start_trace
mount $dev $mnt -o enospc_debug
sync
touch $mnt/file1
touch $mnt/file2
touch $mnt/file3
xfs_io -f -c "pwrite 0 16k" $mnt/file4
umount $mnt
end_trace
It's not recommended to use them in real world environment.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename enums ]
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs has the following different extent_io_trees used:
- fs_info::free_extents[2]
- btrfs_inode::io_tree - for both normal inodes and the btree inode
- btrfs_inode::io_failure_tree
- btrfs_transaction::dirty_pages
- btrfs_root::dirty_log_pages
If we want to trace changes in those trees, it will be pretty hard to
distinguish them.
Instead of using hard-to-read pointer address, this patch will introduce
a new member extent_io_tree::owner to track the owner.
This modification needs all the callers of extent_io_tree_init() to
accept a new parameter @owner.
This patch provides the basis for later trace events.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch is split from the following one "btrfs: Introduce
extent_io_tree::owner to distinguish different io_trees" from Qu, so the
different changes are not mixed together.
Signed-off-by: David Sterba <dsterba@suse.com>
This patch will add a new member fs_info to extent_io_tree.
This provides the basis for later trace events to distinguish the output
between different btrfs filesystems. While this increases the size of
the structure, we want to know the source of the trace events and
passing the fs_info as an argument to all contexts is not possible.
The selftests are now allowed to set it to NULL as they don't use the
tracepoints.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit db2462a6ad ("btrfs: don't run delayed refs in the end transaction
logic") removed its last use, so now it does absolutely nothing, therefore
remove it.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While calling functions inside zstd, we don't need to use the
indirection provided by the workspace_manager. Forward declarations are
added to maintain the function order of btrfs_compress_op.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The error code used here is wrong as it's not invalid to try to start
scrub when umount has begun. Returning EAGAIN is more user friendly as
it's recoverable.
Signed-off-by: David Sterba <dsterba@suse.com>
inode->i_op is initialized multiple times. Perform it once. This was
left by 4779cc0424 ("Btrfs: get rid of btrfs_symlink_aops").
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we failed to find a root key in btrfs_update_root(), we just panic.
That's definitely not cool, fix it by outputting an unique error
message, aborting current transaction and return -EUCLEAN. This should
not normally happen as the root has been used by the callers in some
way.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit d2e174d5d3 ("btrfs: document extent mapping assumptions in
checksum") we have a comment in place why map_private_extent_buffer()
can't return 1 in the csum_tree_block() case.
Make this a bit more explicit and WARN_ON() in case this this assumption
breaks.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently csum_tree_block() does two things, first it as it's name
suggests it calculates the checksum for a tree-block. But it also writes
this checksum to disk or reads an extent_buffer from disk and compares the
checksum with the calculated checksum, depending on the verify argument.
Furthermore one of the two callers passes in '1' for the verify argument,
the other one passes in '0'.
For clarity and less layering violations, factor out the second stage in
csum_tree_block()'s callers.
Suggested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlzC5YIACgkQxWXV+ddt
WDvYuxAAprsX3QjNDurMnm32TKTzY1a4yjYNQZXxNUKYvDAt4sglUMIfY3kP4uVy
te9GtbHf0XDFNX58LBdBo8tnA8H/TbUr2qOlmh1hRqF89VVUCHpWBkhtOpWtYNVP
jyL+tModOQjzhjV3RD2m4qHL0Q3oJpoiC7o+kuLGP46NIzfHt/tD1iDIW0j3QJRL
f1rviXFiheNsXoeuDv/Shj6jMy6tGFa9Ys6tVmWcOxHBHVTBu+GsJaG86p4X39Sj
ffOUPF3Btug8Q5ALppX+tbWocScITJs//mJJq4FjdAt8Qn5gnJM3h87GEPj+RZJf
EwDMyd9uOwk7/HKMTtasJ6LDMsycriZ4r4cPOh/bqKz+dAsMA2V+FsV+MDSzhJlw
w0MLZc5ELzKY1jV2jm4KR1PClj+tQ4n8Jl2P/b87TP1rsJcjpDpOgUwXDBgJCXQg
LqZ/quQgfg3Zpcp+lPlsVN7dgwAwNYGlcKmMDXOnzZYRT2nNS6c4yx3EpSOXf6BI
t557BdfP/Kz23hLNuawdO33XOTxhutVd4gyghmz2VOwz8XbYKw9MoJgmODWzenM7
QqbmQvoKx82hHFLc1WQDyZxk9mhmTetQTdE4rFb291oxFBn6cGglx2omHylfJ8LU
P27vH68QgihkWs/WXrkktPzhwZVlOTlf+cCZ+h5PEw0z8k9kqWY=
=efvr
-----END PGP SIGNATURE-----
Merge tag 'for-5.1-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"One patch to fix a crash in io submission path, due to memory
allocation errors.
In short, the multipage bio work that landed in 5.1 caused larger bios
that in turn require larger temporary memory for checksums. The patch
is a workaround, we're going to rework the allocation so it does not
require the vmalloc fallback.
It took a while to identify that it's caused by patches in 5.1 and not
a patchset that did some changes in error handling in the code. I've
tested it on various memory/cpu combinations, it could hit OOM but
does not crash.
The timestamp of the patch is less than a day due to updates in the
changelog, tests were running meanwhile"
* tag 'for-5.1-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: Switch memory allocations in async csum calculation path to kvmalloc
Recent multi-page biovec rework allowed creation of bios that can span
large regions - up to 128 megabytes in the case of btrfs. OTOH btrfs'
submission path currently allocates a contiguous array to store the
checksums for every bio submitted. This means we can request up to
(128mb / BTRFS_SECTOR_SIZE) * 4 bytes + 32bytes of memory from kmalloc.
On busy systems with possibly fragmented memory said kmalloc can fail
which will trigger BUG_ON due to improper error handling IO submission
context in btrfs.
Until error handling is improved or bios in btrfs limited to a more
manageable size (e.g. 1m) let's use kvmalloc to fallback to vmalloc for
such large allocations. There is no hard requirement that the memory
allocated for checksums during IO submission has to be contiguous, but
this is a simple fix that does not require several non-contiguous
allocations.
For small writes this is unlikely to have any visible effect since
kmalloc will still satisfy allocation requests as usual. For larger
requests the code will just fallback to vmalloc.
We've performed evaluation on several workload types and there was no
significant difference kmalloc vs kvmalloc.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlyveOkACgkQxWXV+ddt
WDvb3g//RTIy+8o1PwPihGT9z0wWaFTxJF9Ea3SDzgMVsOaWzZIM29Q0XCwGyR5s
xdwJlNrum4eJwcLRuvvZtZ6e9h6vdgmNxi7ULem0r2ik58rvgZf3TQp/t78qMMLo
Z8Qp/jQPHOOhwGURFsUd2TwCYQ+3yyEhzoObDDQ5OAyeCgneLe2hLNvyMMF7YNkl
joaWY5iAwDY61UaRxggwx8OI7TkhCA/ZA27zyjc6oomCQglIM/KmdvmjHpiBs0j/
1Ij4SDmSo9nqGES/SfubW/l3fpg42hoQWBMuI3/WLr3CBKN08wuRe+BKoDmVyoex
eVTy3+AnBp6KsjyOmN9h5Am+r1lyToJ1ZpsjkKQzo5SRlYC/SA0oFlxhHKygoft6
tEEJf+hbySiod/ZX0KItS6Myo1xsHsX8LnidAO+7pK0S5e5D59QPXtw6oJIp0JX/
kAKrng6bX3+7bSrF8h62nKqpFq/NUTjF8zxB6gwMwnrtroU36r8AFE4+bcyX5Z5g
+JoJcZ9VFsFcA4GCTzYWTYQ2RfCU6Vnbvh4wTLHiR+IDvbkNFxMUE4z5O+fwqhkl
6nv8G2EgK3ORZS/4mNAjmanB71gPwbovhTQhju8LQGEmlmwBrF1mQ+YhrBlMrMv9
XspCzNktqzbGj70tMPbPSB2A5H9oi+5Oabzq2MPFUVBkA9ztpSA=
=fYo9
-----END PGP SIGNATURE-----
Merge tag 'for-5.1-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix parsing of compression algorithm when set as a inode property,
this could end up with eg. 'zst' or 'zli' in the value
- don't allow trim on a filesystem with unreplayed log, this could
cause data loss if there are pending updates to the block groups that
would not be subject to trim after replay
* tag 'for-5.1-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: prop: fix vanished compression property after failed set
btrfs: prop: fix zstd compression parameter validation
Btrfs: do not allow trimming when a fs is mounted with the nologreplay option
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
This patch fixes the following warnings:
fs/affs/affs.h:124:38: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/configfs/dir.c:1692:11: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/configfs/dir.c:1694:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ceph/file.c:249:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/hash.c:233:15: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/hash.c:246:15: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext2/inode.c:1237:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext2/inode.c:1244:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/indirect.c:1182:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/indirect.c:1188:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/indirect.c:1432:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/indirect.c:1440:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/f2fs/node.c:618:8: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/f2fs/node.c:620:8: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/btrfs/ref-verify.c:522:15: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/gfs2/bmap.c:711:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/gfs2/bmap.c:722:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/jffs2/fs.c:339:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/nfsd/nfs4proc.c:429:12: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ufs/util.h:62:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ufs/util.h:43:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/fcntl.c:770:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/seq_file.c:319:10: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/libfs.c:148:11: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/libfs.c:150:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/signalfd.c:178:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/locks.c:1473:16: warning: this statement may fall through [-Wimplicit-fallthrough=]
Warning level 3 was used: -Wimplicit-fallthrough=3
This patch is part of the ongoing efforts to enabling
-Wimplicit-fallthrough.
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
The compression property resets to NULL, instead of the old value if we
fail to set the new compression parameter.
$ btrfs prop get /btrfs compression
compression=lzo
$ btrfs prop set /btrfs compression zli
ERROR: failed to set compression for /btrfs: Invalid argument
$ btrfs prop get /btrfs compression
This is because the compression property ->validate() is successful for
'zli' as the strncmp() used the length passed from the userspace.
Fix it by using the expected string length in strncmp().
Fixes: 63541927c8 ("Btrfs: add support for inode properties")
Fixes: 5c1aab1dd5 ("btrfs: Add zstd support")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We let pass zstd compression parameter even if it is not fully valid.
For example:
$ btrfs prop set /btrfs compression zst
$ btrfs prop get /btrfs compression
compression=zst
zlib and lzo are fine.
Fix it by checking the correct prefix length.
Fixes: 5c1aab1dd5 ("btrfs: Add zstd support")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Whan a filesystem is mounted with the nologreplay mount option, which
requires it to be mounted in RO mode as well, we can not allow discard on
free space inside block groups, because log trees refer to extents that
are not pinned in a block group's free space cache (pinning the extents is
precisely the first phase of replaying a log tree).
So do not allow the fitrim ioctl to do anything when the filesystem is
mounted with the nologreplay option, because later it can be mounted RW
without that option, which causes log replay to happen and result in
either a failure to replay the log trees (leading to a mount failure), a
crash or some silent corruption.
Reported-by: Darrick J. Wong <darrick.wong@oracle.com>
Fixes: 96da09192c ("btrfs: Introduce new mount option to disable tree log replay")
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Back in commit a89ca6f24f ("Btrfs: fix fsync after truncate when
no_holes feature is enabled") I added an assertion that is triggered when
an inline extent is found to assert that the length of the (uncompressed)
data the extent represents is the same as the i_size of the inode, since
that is true most of the time I couldn't find or didn't remembered about
any exception at that time. Later on the assertion was expanded twice to
deal with a case of a compressed inline extent representing a range that
matches the sector size followed by an expanding truncate, and another
case where fallocate can update the i_size of the inode without adding
or updating existing extents (if the fallocate range falls entirely within
the first block of the file). These two expansion/fixes of the assertion
were done by commit 7ed586d0a8 ("Btrfs: fix assertion on fsync of
regular file when using no-holes feature") and commit 6399fb5a0b
("Btrfs: fix assertion failure during fsync in no-holes mode").
These however missed the case where an falloc expands the i_size of an
inode to exactly the sector size and inline extent exists, for example:
$ mkfs.btrfs -f -O no-holes /dev/sdc
$ mount /dev/sdc /mnt
$ xfs_io -f -c "pwrite -S 0xab 0 1096" /mnt/foobar
wrote 1096/1096 bytes at offset 0
1 KiB, 1 ops; 0.0002 sec (4.448 MiB/sec and 4255.3191 ops/sec)
$ xfs_io -c "falloc 1096 3000" /mnt/foobar
$ xfs_io -c "fsync" /mnt/foobar
Segmentation fault
$ dmesg
[701253.602385] assertion failed: len == i_size || (len == fs_info->sectorsize && btrfs_file_extent_compression(leaf, extent) != BTRFS_COMPRESS_NONE) || (len < i_size && i_size < fs_info->sectorsize), file: fs/btrfs/tree-log.c, line: 4727
[701253.602962] ------------[ cut here ]------------
[701253.603224] kernel BUG at fs/btrfs/ctree.h:3533!
[701253.603503] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
[701253.603774] CPU: 2 PID: 7192 Comm: xfs_io Tainted: G W 5.0.0-rc8-btrfs-next-45 #1
[701253.604054] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[701253.604650] RIP: 0010:assfail.constprop.23+0x18/0x1a [btrfs]
(...)
[701253.605591] RSP: 0018:ffffbb48c186bc48 EFLAGS: 00010286
[701253.605914] RAX: 00000000000000de RBX: ffff921d0a7afc08 RCX: 0000000000000000
[701253.606244] RDX: 0000000000000000 RSI: ffff921d36b16868 RDI: ffff921d36b16868
[701253.606580] RBP: ffffbb48c186bcf0 R08: 0000000000000000 R09: 0000000000000000
[701253.606913] R10: 0000000000000003 R11: 0000000000000000 R12: ffff921d05d2de18
[701253.607247] R13: ffff921d03b54000 R14: 0000000000000448 R15: ffff921d059ecf80
[701253.607769] FS: 00007f14da906700(0000) GS:ffff921d36b00000(0000) knlGS:0000000000000000
[701253.608163] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[701253.608516] CR2: 000056087ea9f278 CR3: 00000002268e8001 CR4: 00000000003606e0
[701253.608880] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[701253.609250] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[701253.609608] Call Trace:
[701253.609994] btrfs_log_inode+0xdfb/0xe40 [btrfs]
[701253.610383] btrfs_log_inode_parent+0x2be/0xa60 [btrfs]
[701253.610770] ? do_raw_spin_unlock+0x49/0xc0
[701253.611150] btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
[701253.611537] btrfs_sync_file+0x3b2/0x440 [btrfs]
[701253.612010] ? do_sysinfo+0xb0/0xf0
[701253.612552] do_fsync+0x38/0x60
[701253.612988] __x64_sys_fsync+0x10/0x20
[701253.613360] do_syscall_64+0x60/0x1b0
[701253.613733] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[701253.614103] RIP: 0033:0x7f14da4e66d0
(...)
[701253.615250] RSP: 002b:00007fffa670fdb8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[701253.615647] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f14da4e66d0
[701253.616047] RDX: 000056087ea9c260 RSI: 000056087ea9c260 RDI: 0000000000000003
[701253.616450] RBP: 0000000000000001 R08: 0000000000000020 R09: 0000000000000010
[701253.616854] R10: 000000000000009b R11: 0000000000000246 R12: 000056087ea9c260
[701253.617257] R13: 000056087ea9c240 R14: 0000000000000000 R15: 000056087ea9dd10
(...)
[701253.619941] ---[ end trace e088d74f132b6da5 ]---
Updating the assertion again to allow for this particular case would result
in a meaningless assertion, plus there is currently no risk of logging
content that would result in any corruption after a log replay if the size
of the data encoded in an inline extent is greater than the inode's i_size
(which is not currently possibe either with or without compression),
therefore just remove the assertion.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
qgroup_rsv_size is calculated as the product of
outstanding_extent * fs_info->nodesize. The product is calculated with
32 bit precision since both variables are defined as u32. Yet
qgroup_rsv_size expects a 64 bit result.
Avoid possible multiplication overflow by casting outstanding_extent to
u64. Such overflow would in the worst case (64K nodesize) require more
than 65536 extents, which is quite large and i'ts not likely that it
would happen in practice.
Fixes-coverity-id: 1435101
Fixes: ff6bc37eb7 ("btrfs: qgroup: Use independent and accurate per inode qgroup rsv")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If 'cur_level' is 7 then the bound checking at the top of the function
will actually pass. Later on, it's possible to dereference
ds_path->nodes[cur_level+1] which will be an out of bounds.
The correct check will be cur_level >= BTRFS_MAX_LEVEL - 1 .
Fixes-coverty-id: 1440918
Fixes-coverty-id: 1440911
Fixes: ea49f3e73c ("btrfs: qgroup: Introduce function to find all new tree blocks of reloc tree")
CC: stable@vger.kernel.org # 4.20+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Parity page is incorrectly unmapped in finish_parity_scrub(), triggering
a reference counter bug on i386, i.e.:
[ 157.662401] kernel BUG at mm/highmem.c:349!
[ 157.666725] invalid opcode: 0000 [#1] SMP PTI
The reason is that kunmap(p_page) was completely left out, so we never
did an unmap for the p_page and the loop unmapping the rbio page was
iterating over the wrong number of stripes: unmapping should be done
with nr_data instead of rbio->real_stripes.
Test case to reproduce the bug:
- create a raid5 btrfs filesystem:
# mkfs.btrfs -m raid5 -d raid5 /dev/sdb /dev/sdc /dev/sdd /dev/sde
- mount it:
# mount /dev/sdb /mnt
- run btrfs scrub in a loop:
# while :; do btrfs scrub start -BR /mnt; done
BugLink: https://bugs.launchpad.net/bugs/1812845
Fixes: 5a6ac9eacb ("Btrfs, raid56: support parity scrub on raid56")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As readahead is an optimization, all errors are usually filtered out,
but still properly handled when the real read call is done. The commit
5e9d398240 ("btrfs: readpages() should submit IO as read-ahead") added
REQ_RAHEAD to readpages() because that's only used for readahead
(despite what one would expect from the callback name).
This causes a flood of messages and inflated read error stats, so skip
reporting in case it's readahead.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202403
Reported-by: LimeTech <tomm@lime-technology.com>
Fixes: 5e9d398240 ("btrfs: readpages() should submit IO as read-ahead")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: David Sterba <dsterba@suse.com>
When we are mixing buffered writes with direct IO writes against the same
file and snapshotting is happening concurrently, we can end up with a
corrupt file content in the snapshot. Example:
1) Inode/file is empty.
2) Snapshotting starts.
2) Buffered write at offset 0 length 256Kb. This updates the i_size of the
inode to 256Kb, disk_i_size remains zero. This happens after the task
doing the snapshot flushes all existing delalloc.
3) DIO write at offset 256Kb length 768Kb. Once the ordered extent
completes it sets the inode's disk_i_size to 1Mb (256Kb + 768Kb) and
updates the inode item in the fs tree with a size of 1Mb (which is
the value of disk_i_size).
4) The dealloc for the range [0, 256Kb[ did not start yet.
5) The transaction used in the DIO ordered extent completion, which updated
the inode item, is committed by the snapshotting task.
6) Snapshot creation completes.
7) Dealloc for the range [0, 256Kb[ is flushed.
After that when reading the file from the snapshot we always get zeroes for
the range [0, 256Kb[, the file has a size of 1Mb and the data written by
the direct IO write is found. From an application's point of view this is
a corruption, since in the source subvolume it could never read a version
of the file that included the data from the direct IO write without the
data from the buffered write included as well. In the snapshot's tree,
file extent items are missing for the range [0, 256Kb[.
The issue, obviously, does not happen when using the -o flushoncommit
mount option.
Fix this by flushing delalloc for all the roots that are about to be
snapshotted when committing a transaction. This guarantees total ordering
when updating the disk_i_size of an inode since the flush for dealloc is
done when a transaction is in the TRANS_STATE_COMMIT_START state and wait
is done once no more external writers exist. This is similar to what we
do when using the flushoncommit mount option, but we do it only if the
transaction has snapshots to create and only for the roots of the
subvolumes to be snapshotted. The bulk of the dealloc is flushed in the
snapshot creation ioctl, so the flush work we do inside the transaction
is minimized.
This issue, involving buffered and direct IO writes with snapshotting, is
often triggered by fstest btrfs/078, and got reported by fsck when not
using the NO_HOLES features, for example:
$ cat results/btrfs/078.full
(...)
_check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
*** fsck.btrfs output ***
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
[4/7] checking fs roots
root 258 inode 264 errors 100, file extent discount
Found file extent holes:
start: 524288, len: 65536
ERROR: errors found in fs roots
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When Filipe added the recursive directory logging stuff in
2f2ff0ee5e ("Btrfs: fix metadata inconsistencies after directory
fsync") he specifically didn't take the directory i_mutex for the
children directories that we need to log because of lockdep. This is
generally fine, but can lead to this WARN_ON() tripping if we happen to
run delayed deletion's in between our first search and our second search
of dir_item/dir_indexes for this directory. We expect this to happen,
so the WARN_ON() isn't necessary. Drop the WARN_ON() and add a comment
so we know why this case can happen.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we do a shrinking truncate against an inode which is already present
in the respective log tree and then rename it, as part of logging the new
name we end up logging an inode item that reflects the old size of the
file (the one which we previously logged) and not the new smaller size.
The decision to preserve the size previously logged was added by commit
1a4bcf470c ("Btrfs: fix fsync data loss after adding hard link to
inode") in order to avoid data loss after replaying the log. However that
decision is only needed for the case the logged inode size is smaller then
the current size of the inode, as explained in that commit's change log.
If the current size of the inode is smaller then the previously logged
size, we know a shrinking truncate happened and therefore need to use
that smaller size.
Example to trigger the problem:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "pwrite -S 0xab 0 8000" /mnt/foo
$ xfs_io -c "fsync" /mnt/foo
$ xfs_io -c "truncate 3000" /mnt/foo
$ mv /mnt/foo /mnt/bar
$ xfs_io -c "fsync" /mnt/bar
<power failure>
$ mount /dev/sdb /mnt
$ od -t x1 -A d /mnt/bar
0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
*
0008000
Once we rename the file, we log its name (and inode item), and because
the inode was already logged before in the current transaction, we log it
with a size of 8000 bytes because that is the size we previously logged
(with the first fsync). As part of the rename, besides logging the inode,
we do also sync the log, which is done since commit d4682ba03e
("Btrfs: sync log after logging new name"), so the next fsync against our
inode is effectively a no-op, since no new changes happened since the
rename operation. Even if did not sync the log during the rename
operation, the same problem (fize size of 8000 bytes instead of 3000
bytes) would be visible after replaying the log if the log ended up
getting synced to disk through some other means, such as for example by
fsyncing some other modified file. In the example above the fsync after
the rename operation is there just because not every filesystem may
guarantee logging/journalling the inode (and syncing the log/journal)
during the rename operation, for example it is needed for f2fs, but not
for ext4 and xfs.
Fix this scenario by, when logging a new name (which is triggered by
rename and link operations), using the current size of the inode instead
of the previously logged inode size.
A test case for fstests follows soon.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202695
CC: stable@vger.kernel.org # 4.4+
Reported-by: Seulbae Kim <seulbae@gatech.edu>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlyHwgcACgkQxWXV+ddt
WDvfFQ//VBO8Rnz0+V4XbxaIaz25EbCjR4cBuXwwXyl8HPJMZBvBPqW0LVXtV0eP
SwK0A5qiWqaXgWNpByD43AvgizZuWF/9SvxebaCKTjSK5t9TuXpR27vJNnHJf0L0
o4DeMXlgd8yE8yZstQo7UnLWfNU69v6Pi3Zbar/7IIJ0sVVCPMMoGoARZDlQ+w0M
wwppi04+a6bnAUbqpnWiL0a8++WX6gqP7MovqLRgf/up4cmzmDFoV7b/7pbvZxNv
LrKQBmJZQq44bW4TXMzhpkrIGyzrrUQuBhpbYJus9yZYqS6Owkzl5AQpdzo9reg2
V35xOkOZbXxqdOTGY0he9Z6wxJL+ocfryfRyA2hE4gXbCAnfFqIRyFicpTXuXxwg
RBan8VLB+1iC7j9djX/sP/uCH3tsPgN4WnjdZgnkUOkUhTuvpPw/A9bp6Uqfjr6g
JU0o/TlCC8npaveUQsuNbqyVYgPk58d9by12HsSW7UaA8ENyHz62+zoiv9jX/uZY
Tl4t2L+MKxEcsd0KEKEQpV+0hV56GtYcIZIqJTe9WFmPBHmEH3PCHDx4A5LrYveO
hC+hGAnX9xWK4XIr8T3ck1tsnxNApD25pmKSivadUiVJqOrPpJFyZb3aztcKcx4Y
sDbZdOV7XHq6ACrIhLoxpYWQc27v1FqrWVqsF51wo07I3meUVaA=
=u4Kf
-----END PGP SIGNATURE-----
Merge tag 'for-5.1-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Correctness and a deadlock fixes"
* tag 'for-5.1-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zstd: ensure reclaim timer is properly cleaned up
btrfs: move ulist allocation out of transaction in quota enable
btrfs: save drop_progress if we drop refs at all
btrfs: check for refs on snapshot delete resume
Btrfs: fix deadlock between clone/dedupe and rename
Btrfs: fix corruption reading shared and compressed extents after hole punching
All users of VM_MAX_READAHEAD actually convert it to kbytes and then to
pages. Define the macro explicitly as (SZ_128K / PAGE_SIZE). This
simplifies the expression in every filesystem. Also rename the macro to
VM_READAHEAD_PAGES to properly convey its meaning. Finally remove unused
VM_MIN_READAHEAD
[akpm@linux-foundation.org: fix fs/io_uring.c, per Stephen]
Link: http://lkml.kernel.org/r/20181221144053.24318-1-nborisov@suse.com
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlx63XIQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpp2vEACfrrQsap7R+Av28mmXpmXi2FPa3g5Tev1t
yYjK2qHvhlMZjPTYw3hCmbYdDDczlF7PEgSE2x2DjdcsYapb8Fy1lZ2X16c7ztBR
HD/t9b5AVSQsczZzKgv3RqsNtTnjzS5V0A8XH8FAP2QRgiwDMwSN6G0FP0JBLbE/
ZgxQrH1Iy1F33Wz4hI3Z7dEghKPZrH1IlegkZCEu47q9SlWS76qUetSy2GEtchOl
3Lgu54mQZyVdI5/QZf9DyMDLF6dIz3tYU2qhuo01AHjGRCC72v86p8sIiXcUr94Q
8pbegJhJ/g8KBol9Qhv3+pWG/QUAZwi/ZwasTkK+MJ4klRXfOrznxPubW1z6t9Vn
QRo39Po5SqqP0QWAscDxCFjESIQlWlKa+LZurJL7DJDCUGrSgzTpnVwFqKwc5zTP
HJa5MT2tEeL2TfUYRYCfh0ZV0elINdHA1y1klDBh38drh4EWr2gW8xdseGYXqRjh
fLgEpoF7VQ8kTvxKN+E4jZXkcZmoLmefp0ZyAbblS6IawpPVC7kXM9Fdn2OU8f2c
fjVjvSiqxfeN6dnpfeLDRbbN9894HwgP/LPropJOQ7KmjCorQq5zMDkAvoh3tElq
qwluRqdBJpWT/F05KweY+XVW8OawIycmUWqt6JrVNoIDAK31auHQv47kR0VA4OvE
DRVVhYpocw==
=VBaU
-----END PGP SIGNATURE-----
Merge tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
"Not a huge amount of changes in this round, the biggest one is that we
finally have Mings multi-page bvec support merged. Apart from that,
this pull request contains:
- Small series that avoids quiescing the queue for sysfs changes that
match what we currently have (Aleksei)
- Series of bcache fixes (via Coly)
- Series of lightnvm fixes (via Mathias)
- NVMe pull request from Christoph. Nothing major, just SPDX/license
cleanups, RR mp policy (Hannes), and little fixes (Bart,
Chaitanya).
- BFQ series (Paolo)
- Save blk-mq cpu -> hw queue mapping, removing a pointer indirection
for the fast path (Jianchao)
- fops->iopoll() added for async IO polling, this is a feature that
the upcoming io_uring interface will use (Christoph, me)
- Partition scan loop fixes (Dongli)
- mtip32xx conversion from managed resource API (Christoph)
- cdrom registration race fix (Guenter)
- MD pull from Song, two minor fixes.
- Various documentation fixes (Marcos)
- Multi-page bvec feature. This brings a lot of nice improvements
with it, like more efficient splitting, larger IOs can be supported
without growing the bvec table size, and so on. (Ming)
- Various little fixes to core and drivers"
* tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-block: (117 commits)
block: fix updating bio's front segment size
block: Replace function name in string with __func__
nbd: propagate genlmsg_reply return code
floppy: remove set but not used variable 'q'
null_blk: fix checking for REQ_FUA
block: fix NULL pointer dereference in register_disk
fs: fix guard_bio_eod to check for real EOD errors
blk-mq: use HCTX_TYPE_DEFAULT but not 0 to index blk_mq_tag_set->map
block: optimize bvec iteration in bvec_iter_advance
block: introduce mp_bvec_for_each_page() for iterating over page
block: optimize blk_bio_segment_split for single-page bvec
block: optimize __blk_segment_map_sg() for single-page bvec
block: introduce bvec_nth_page()
iomap: wire up the iopoll method
block: add bio_set_polled() helper
block: wire up block device iopoll method
fs: add an iopoll method to struct file_operations
loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part()
loop: do not print warn message if partition scan is successful
block: bounce: make sure that bvec table is updated
...
Merge more updates from Andrew Morton:
- some of the rest of MM
- various misc things
- dynamic-debug updates
- checkpatch
- some epoll speedups
- autofs
- rapidio
- lib/, lib/lzo/ updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (83 commits)
samples/mic/mpssd/mpssd.h: remove duplicate header
kernel/fork.c: remove duplicated include
include/linux/relay.h: fix percpu annotation in struct rchan
arch/nios2/mm/fault.c: remove duplicate include
unicore32: stop printing the virtual memory layout
MAINTAINERS: fix GTA02 entry and mark as orphan
mm: create the new vm_fault_t type
arm, s390, unicore32: remove oneliner wrappers for memblock_alloc()
arch: simplify several early memory allocations
openrisc: simplify pte_alloc_one_kernel()
sh: prefer memblock APIs returning virtual address
microblaze: prefer memblock API returning virtual address
powerpc: prefer memblock APIs returning virtual address
lib/lzo: separate lzo-rle from lzo
lib/lzo: implement run-length encoding
lib/lzo: fast 8-byte copy on arm64
lib/lzo: 64-bit CTZ on arm64
lib/lzo: tidy-up ifdefs
ipc/sem.c: replace kvmalloc/memset with kvzalloc and use struct_size
ipc: annotate implicit fall through
...
First, the btrfs_debug macros open-code (one possible definition of)
DYNAMIC_DEBUG_BRANCH, so they don't benefit from the CONFIG_JUMP_LABEL
optimization.
Second, a planned change of struct _ddebug (to reduce its size on 64 bit
machines) requires that all descriptors in a translation unit use
distinct identifiers.
Using the new _dynamic_func_call_no_desc helper macro from
dynamic_debug.h takes care of both of these. No functional change.
Link: http://lkml.kernel.org/r/20190212214150.4807-12-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: David Sterba <dsterba@suse.com>
Acked-by: Jason Baron <jbaron@akamai.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The timer function, zstd_reclaim_timer_fn(), reschedules itself under
certain conditions. When cleaning up, take the lock and remove all
workspaces. This prevents the timer from rearming itself. Lastly, switch
to del_timer_sync() to ensure that the timer function can't trigger as
we're unloading.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The allocation happens with GFP_KERNEL after a transaction has been
started, this can potentially cause deadlock if reclaim tries to get the
memory by flushing filesystem data.
The fs_info::qgroup_ulist is not used during transaction start when
quotas are not enabled. The status bit BTRFS_FS_QUOTA_ENABLED is set
later in btrfs_quota_enable so it's safe to move it before the
transaction start.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previously we only updated the drop_progress key if we were in the
DROP_REFERENCE stage of snapshot deletion. This is because the
UPDATE_BACKREF stage checks the flags of the blocks it's converting to
FULL_BACKREF, so if we go over a block we processed before it doesn't
matter, we just don't do anything.
The problem is in do_walk_down() we will go ahead and drop the roots
reference to any blocks that we know we won't need to walk into.
Given subvolume A and snapshot B. The root of B points to all of the
nodes that belong to A, so all of those nodes have a refcnt > 1. If B
did not modify those blocks it'll hit this condition in do_walk_down
if (!wc->update_ref ||
generation <= root->root_key.offset)
goto skip;
and in "goto skip" we simply do a btrfs_free_extent() for that bytenr
that we point at.
Now assume we modified some data in B, and then took a snapshot of B and
call it C. C points to all the nodes in B, making every node the root
of B points to have a refcnt > 1. This assumes the root level is 2 or
higher.
We delete snapshot B, which does the above work in do_walk_down,
free'ing our ref for nodes we share with A that we didn't modify. Now
we hit a node we _did_ modify, thus we own. We need to walk down into
this node and we set wc->stage == UPDATE_BACKREF. We walk down to level
0 which we also own because we modified data. We can't walk any further
down and thus now need to walk up and start the next part of the
deletion. Now walk_up_proc is supposed to put us back into
DROP_REFERENCE, but there's an exception to this
if (level < wc->shared_level)
goto out;
we are at level == 0, and our shared_level == 1. We skip out of this
one and go up to level 1. Since path->slots[1] < nritems we
path->slots[1]++ and break out of walk_up_tree to stop our transaction
and loop back around. Now in btrfs_drop_snapshot we have this snippet
if (wc->stage == DROP_REFERENCE) {
level = wc->level;
btrfs_node_key(path->nodes[level],
&root_item->drop_progress,
path->slots[level]);
root_item->drop_level = level;
}
our stage == UPDATE_BACKREF still, so we don't update the drop_progress
key. This is a problem because we would have done btrfs_free_extent()
for the nodes leading up to our current position. If we crash or
unmount here and go to remount we'll start over where we were before and
try to free our ref for blocks we've already freed, and thus abort()
out.
Fix this by keeping track of the last place we dropped a reference for
our block in do_walk_down. Then if wc->stage == UPDATE_BACKREF we know
we'll start over from a place we meant to, and otherwise things continue
to work as they did before.
I have a complicated reproducer for this problem, without this patch
we'll fail to fsck the fs when replaying the log writes log. With this
patch we can replay the whole log without any fsck or mount failures.
The steps to reproduce this easily are sort of tricky, I had to add a
couple of debug patches to the kernel in order to make it easy,
basically I just needed to make sure we did actually commit the
transaction every time we finished a walk_down_tree/walk_up_tree combo.
The reproducer:
1) Creates a base subvolume.
2) Creates 100k files in the subvolume.
3) Snapshots the base subvolume (snap1).
4) Touches files 5000-6000 in snap1.
5) Snapshots snap1 (snap2).
6) Deletes snap1.
I do this with dm-log-writes, and then replay to every FUA in the log
and fsck the fs.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ copy reproducer steps ]
Signed-off-by: David Sterba <dsterba@suse.com>
There's a bug in snapshot deletion where we won't update the
drop_progress key if we're in the UPDATE_BACKREF stage. This is a
problem because we could drop refs for blocks we know don't belong to
ours. If we crash or umount at the right time we could experience
messages such as the following when snapshot deletion resumes
BTRFS error (device dm-3): unable to find ref byte nr 66797568 parent 0 root 258 owner 1 offset 0
------------[ cut here ]------------
WARNING: CPU: 3 PID: 16052 at fs/btrfs/extent-tree.c:7108 __btrfs_free_extent.isra.78+0x62c/0xb30 [btrfs]
CPU: 3 PID: 16052 Comm: umount Tainted: G W OE 5.0.0-rc4+ #147
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011
RIP: 0010:__btrfs_free_extent.isra.78+0x62c/0xb30 [btrfs]
RSP: 0018:ffffc90005cd7b18 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
RDX: ffff88842fade680 RSI: ffff88842fad6b18 RDI: ffff88842fad6b18
RBP: ffffc90005cd7bc8 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000001 R11: ffffffff822696b8 R12: 0000000003fb4000
R13: 0000000000000001 R14: 0000000000000102 R15: ffff88819c9d67e0
FS: 00007f08bb138fc0(0000) GS:ffff88842fac0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f8f5d861ea0 CR3: 00000003e99fe000 CR4: 00000000000006e0
Call Trace:
? _raw_spin_unlock+0x27/0x40
? btrfs_merge_delayed_refs+0x356/0x3e0 [btrfs]
__btrfs_run_delayed_refs+0x75a/0x13c0 [btrfs]
? join_transaction+0x2b/0x460 [btrfs]
btrfs_run_delayed_refs+0xf3/0x1c0 [btrfs]
btrfs_commit_transaction+0x52/0xa50 [btrfs]
? start_transaction+0xa6/0x510 [btrfs]
btrfs_sync_fs+0x79/0x1c0 [btrfs]
sync_filesystem+0x70/0x90
generic_shutdown_super+0x27/0x120
kill_anon_super+0x12/0x30
btrfs_kill_super+0x16/0xa0 [btrfs]
deactivate_locked_super+0x43/0x70
deactivate_super+0x40/0x60
cleanup_mnt+0x3f/0x80
__cleanup_mnt+0x12/0x20
task_work_run+0x8b/0xc0
exit_to_usermode_loop+0xce/0xd0
do_syscall_64+0x20b/0x210
entry_SYSCALL_64_after_hwframe+0x49/0xbe
To fix this simply mark dead roots we read from disk as DEAD and then
set the walk_control->restarted flag so we know we have a restarted
deletion. From here whenever we try to drop refs for blocks we check to
verify our ref is set on them, and if it is not we skip it. Once we
find a ref that is set we unset walk_control->restarted since the tree
should be in a normal state from then on, and any problems we run into
from there are different issues. I tested this with an existing broken
fs and my reproducer that creates a broken fs and it fixed both file
systems.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reflinking (clone/dedupe) and rename are operations that operate on two
inodes and therefore need to lock them in the same order to avoid ABBA
deadlocks. It happens that Btrfs' reflink implementation always locked
them in a different order from VFS's lock_two_nondirectories() helper,
which is used by the rename code in VFS, resulting in ABBA type deadlocks.
Btrfs' locking order:
static void btrfs_double_inode_lock(struct inode *inode1, struct inode *inode2)
{
if (inode1 < inode2)
swap(inode1, inode2);
inode_lock_nested(inode1, I_MUTEX_PARENT);
inode_lock_nested(inode2, I_MUTEX_CHILD);
}
VFS's locking order:
void lock_two_nondirectories(struct inode *inode1, struct inode *inode2)
{
if (inode1 > inode2)
swap(inode1, inode2);
if (inode1 && !S_ISDIR(inode1->i_mode))
inode_lock(inode1);
if (inode2 && !S_ISDIR(inode2->i_mode) && inode2 != inode1)
inode_lock_nested(inode2, I_MUTEX_NONDIR2);
}
Fix this by killing the btrfs helper function that does the double inode
locking and replace it with VFS's helper lock_two_nondirectories().
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Fixes: 416161db9b ("btrfs: offline dedupe")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the past we had data corruption when reading compressed extents that
are shared within the same file and they are consecutive, this got fixed
by commit 005efedf2c ("Btrfs: fix read corruption of compressed and
shared extents") and by commit 808f80b467 ("Btrfs: update fix for read
corruption of compressed and shared extents"). However there was a case
that was missing in those fixes, which is when the shared and compressed
extents are referenced with a non-zero offset. The following shell script
creates a reproducer for this issue:
#!/bin/bash
mkfs.btrfs -f /dev/sdc &> /dev/null
mount -o compress /dev/sdc /mnt/sdc
# Create a file with 3 consecutive compressed extents, each has an
# uncompressed size of 128Kb and a compressed size of 4Kb.
for ((i = 1; i <= 3; i++)); do
head -c 4096 /dev/zero
for ((j = 1; j <= 31; j++)); do
head -c 4096 /dev/zero | tr '\0' "\377"
done
done > /mnt/sdc/foobar
sync
echo "Digest after file creation: $(md5sum /mnt/sdc/foobar)"
# Clone the first extent into offsets 128K and 256K.
xfs_io -c "reflink /mnt/sdc/foobar 0 128K 128K" /mnt/sdc/foobar
xfs_io -c "reflink /mnt/sdc/foobar 0 256K 128K" /mnt/sdc/foobar
sync
echo "Digest after cloning: $(md5sum /mnt/sdc/foobar)"
# Punch holes into the regions that are already full of zeroes.
xfs_io -c "fpunch 0 4K" /mnt/sdc/foobar
xfs_io -c "fpunch 128K 4K" /mnt/sdc/foobar
xfs_io -c "fpunch 256K 4K" /mnt/sdc/foobar
sync
echo "Digest after hole punching: $(md5sum /mnt/sdc/foobar)"
echo "Dropping page cache..."
sysctl -q vm.drop_caches=1
echo "Digest after hole punching: $(md5sum /mnt/sdc/foobar)"
umount /dev/sdc
When running the script we get the following output:
Digest after file creation: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar
linked 131072/131072 bytes at offset 131072
128 KiB, 1 ops; 0.0033 sec (36.960 MiB/sec and 295.6830 ops/sec)
linked 131072/131072 bytes at offset 262144
128 KiB, 1 ops; 0.0015 sec (78.567 MiB/sec and 628.5355 ops/sec)
Digest after cloning: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar
Digest after hole punching: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar
Dropping page cache...
Digest after hole punching: fba694ae8664ed0c2e9ff8937e7f1484 /mnt/sdc/foobar
This happens because after reading all the pages of the extent in the
range from 128K to 256K for example, we read the hole at offset 256K
and then when reading the page at offset 260K we don't submit the
existing bio, which is responsible for filling all the page in the
range 128K to 256K only, therefore adding the pages from range 260K
to 384K to the existing bio and submitting it after iterating over the
entire range. Once the bio completes, the uncompressed data fills only
the pages in the range 128K to 256K because there's no more data read
from disk, leaving the pages in the range 260K to 384K unfilled. It is
just a slightly different variant of what was solved by commit
005efedf2c ("Btrfs: fix read corruption of compressed and shared
extents").
Fix this by forcing a bio submit, during readpages(), whenever we find a
compressed extent map for a page that is different from the extent map
for the previous page or has a different starting offset (in case it's
the same compressed extent), instead of the extent map's original start
offset.
A test case for fstests follows soon.
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Fixes: 808f80b467 ("Btrfs: update fix for read corruption of compressed and shared extents")
Fixes: 005efedf2c ("Btrfs: fix read corruption of compressed and shared extents")
Cc: stable@vger.kernel.org # 4.3+
Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a messy cast here:
min_t(int, len, (int)sizeof(*item)));
min_t() should normally cast to unsigned. It's not possible for "len"
to be negative, but if it were then we definitely wouldn't want to pass
negatives to read_extent_buffer(). Also there is an extra cast.
This patch shouldn't affect runtime, it's just a clean up.
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At ctree.c:key_search(), the assertion that verifies the first key on a
child extent buffer corresponds to the key at a specific slot in the
parent has a disadvantage: we effectively hit a BUG_ON() which requires
rebooting the machine later. It also does not tell any information about
which extent buffer is affected, from which root, the expected and found
keys, etc.
However as of commit 581c176041 ("btrfs: Validate child tree block's
level and first key"), that assertion is not needed since at the time we
read an extent buffer from disk we validate that its first key matches the
key, at the respective slot, in the parent extent buffer. Therefore just
remove the assertion at key_search().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function map_private_extent_buffer() can return an -EINVAL error, and
it is called by generic_bin_search() which will return back the error. The
btrfs_bin_search() function in turn calls generic_bin_search() and the
key_search() function calls btrfs_bin_search(), so both can return the
-EINVAL error coming from the map_private_extent_buffer() function. Some
callers of these functions were ignoring that these functions can return
an error, so fix them to deal with error return values.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We should drop the lock on this error path. This has been found by a
static tool.
The lock needs to be released, it's there to protect access to the
dev_replace members and is not supposed to be left locked. The value of
state that's being switched would need to be artifically changed to an
invalid value so the default: branch is taken.
Fixes: d189dd70e2 ("btrfs: fix use-after-free due to race between replace start and cancel")
CC: stable@vger.kernel.org # 5.0+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We recently had a customer issue with a corrupted filesystem. When
trying to mount this image btrfs panicked with a division by zero in
calc_stripe_length().
The corrupt chunk had a 'num_stripes' value of 1. calc_stripe_length()
takes this value and divides it by the number of copies the RAID profile
is expected to have to calculate the amount of data stripes. As a DUP
profile is expected to have 2 copies this division resulted in 1/2 = 0.
Later then the 'data_stripes' variable is used as a divisor in the
stripe length calculation which results in a division by 0 and thus a
kernel panic.
When encountering a filesystem with a DUP block group and a
'num_stripes' value unequal to 2, refuse mounting as the image is
corrupted and will lead to unexpected behaviour.
Code inspection showed a RAID1 block group has the same issues.
Fixes: e06cd3dd7c ("Btrfs: add validadtion checks for chunk loading")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The scrub_ctx csum_list member must be initialized before scrub_free_ctx
is called. If the csum_list is not initialized beforehand, the
list_empty call in scrub_free_csums will result in a null deref if the
allocation fails in the for loop.
Fixes: a2de733c78 ("btrfs: scrub")
CC: stable@vger.kernel.org # 3.0+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Dan Robertson <dan@dlrobertson.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Comparing the content of the pages in the range to deduplicate is now
done in generic_remap_checks called by the generic helper
generic_remap_file_range_prep(), which takes care of ensuring we do not
compare/deduplicate undefined data beyond a file's EOF (range from EOF
to the next block boundary). So remove these checks which are now
redundant.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After a succession of renames operations of different files and unlinking
one of them, if we fsync one of the renamed files we can end up with a
log that will either fail to replay at mount time or result in a filesystem
that is in an inconsistent state. One example scenario:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/testdir
$ touch /mnt/testdir/fname1
$ touch /mnt/testdir/fname2
$ sync
$ mv /mnt/testdir/fname1 /mnt/testdir/fname3
$ rm -f /mnt/testdir/fname2
$ ln /mnt/testdir/fname3 /mnt/testdir/fname2
$ touch /mnt/testdir/fname1
$ xfs_io -c "fsync" /mnt/testdir/fname1
<power failure>
$ mount /dev/sdb /mnt
$ umount /mnt
$ btrfs check /dev/sdb
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
[4/7] checking fs roots
root 5 inode 259 errors 2, no orphan item
ERROR: errors found in fs roots
Opening filesystem to check...
Checking filesystem on /dev/sdc
UUID: 20e4abb8-5a19-4492-8bb4-6084125c2d0d
found 393216 bytes used, error(s) found
total csum bytes: 0
total tree bytes: 131072
total fs tree bytes: 32768
total extent tree bytes: 16384
btree space waste bytes: 122986
file data blocks allocated: 262144
referenced 262144
On a kernel without the first patch in this series, titled
"[PATCH] Btrfs: fix fsync after succession of renames of different files",
we get instead an error when mounting the filesystem due to failure of
replaying the log:
$ mount /dev/sdb /mnt
mount: mount /dev/sdb on /mnt failed: File exists
Fix this by logging the parent directory of an inode whenever we find an
inode that no longer exists (was unlinked in the current transaction),
during the procedure which finds inodes that have old names that collide
with new names of other inodes.
A test case for fstests follows soon.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After a succession of rename operations of different files and fsyncing
one of them, such that each file gets a new name that corresponds to an
old name of another file, we can end up with a log that will cause a
failure when attempted to replay at mount time (an EEXIST error).
We currently have correct behaviour when such succession of renames
involves only two files, but if there are more files involved, we end up
not logging all the inodes that are needed, therefore resulting in a
failure when attempting to replay the log.
Example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/testdir
$ touch /mnt/testdir/fname1
$ touch /mnt/testdir/fname2
$ sync
$ mv /mnt/testdir/fname1 /mnt/testdir/fname3
$ mv /mnt/testdir/fname2 /mnt/testdir/fname4
$ ln /mnt/testdir/fname3 /mnt/testdir/fname2
$ touch /mnt/testdir/fname1
$ xfs_io -c "fsync" /mnt/testdir/fname1
<power failure>
$ mount /dev/sdb /mnt
mount: mount /dev/sdb on /mnt failed: File exists
So fix this by checking all inode dependencies when logging an inode. That
is, if one logged inode A has a new name that matches the old name of some
other inode B, check if inode B has a new name that matches the old name
of some other inode C, and so on. This fix is implemented not by doing any
recursive function calls but by using an iterative method using a linked
list that is used in a first-in-first-out fashion.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qgroups will do the old roots lookup at delayed ref time, which could be
while walking down the extent root while running a delayed ref. This
should be fine, except we specifically lock eb's in the backref walking
code irrespective of path->skip_locking, which deadlocks the system.
Fix up the backref code to honor path->skip_locking, nobody will be
modifying the commit_root when we're searching so it's completely safe
to do.
This happens since fb235dc06f ("btrfs: qgroup: Move half of the qgroup
accounting time out of commit trans"), kernel may lockup with quota
enabled.
There is one backref trace triggered by snapshot dropping along with
write operation in the source subvolume. The example can be reliably
reproduced:
btrfs-cleaner D 0 4062 2 0x80000000
Call Trace:
schedule+0x32/0x90
btrfs_tree_read_lock+0x93/0x130 [btrfs]
find_parent_nodes+0x29b/0x1170 [btrfs]
btrfs_find_all_roots_safe+0xa8/0x120 [btrfs]
btrfs_find_all_roots+0x57/0x70 [btrfs]
btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs]
btrfs_qgroup_trace_leaf_items+0x10b/0x140 [btrfs]
btrfs_qgroup_trace_subtree+0xc8/0xe0 [btrfs]
do_walk_down+0x541/0x5e3 [btrfs]
walk_down_tree+0xab/0xe7 [btrfs]
btrfs_drop_snapshot+0x356/0x71a [btrfs]
btrfs_clean_one_deleted_snapshot+0xb8/0xf0 [btrfs]
cleaner_kthread+0x12b/0x160 [btrfs]
kthread+0x112/0x130
ret_from_fork+0x27/0x50
When dropping snapshots with qgroup enabled, we will trigger backref
walk.
However such backref walk at that timing is pretty dangerous, as if one
of the parent nodes get WRITE locked by other thread, we could cause a
dead lock.
For example:
FS 260 FS 261 (Dropped)
node A node B
/ \ / \
node C node D node E
/ \ / \ / \
leaf F|leaf G|leaf H|leaf I|leaf J|leaf K
The lock sequence would be:
Thread A (cleaner) | Thread B (other writer)
-----------------------------------------------------------------------
write_lock(B) |
write_lock(D) |
^^^ called by walk_down_tree() |
| write_lock(A)
| write_lock(D) << Stall
read_lock(H) << for backref walk |
read_lock(D) << lock owner is |
the same thread A |
so read lock is OK |
read_lock(A) << Stall |
So thread A hold write lock D, and needs read lock A to unlock.
While thread B holds write lock A, while needs lock D to unlock.
This will cause a deadlock.
This is not only limited to snapshot dropping case. As the backref
walk, even only happens on commit trees, is breaking the normal top-down
locking order, makes it deadlock prone.
Fixes: fb235dc06f ("btrfs: qgroup: Move half of the qgroup accounting time out of commit trans")
CC: stable@vger.kernel.org # 4.14+
Reported-and-tested-by: David Sterba <dsterba@suse.com>
Reported-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[ rebase to latest branch and fix lock assert bug in btrfs/007 ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ copy logs and deadlock analysis from Qu's patch ]
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Btrfs qgroup will still hit EDQUOT under the following case:
$ dev=/dev/test/test
$ mnt=/mnt/btrfs
$ umount $mnt &> /dev/null
$ umount $dev &> /dev/null
$ mkfs.btrfs -f $dev
$ mount $dev $mnt -o nospace_cache
$ btrfs subv create $mnt/subv
$ btrfs quota enable $mnt
$ btrfs quota rescan -w $mnt
$ btrfs qgroup limit -e 1G $mnt/subv
$ fallocate -l 900M $mnt/subv/padding
$ sync
$ rm $mnt/subv/padding
# Hit EDQUOT
$ xfs_io -f -c "pwrite 0 512M" $mnt/subv/real_file
[CAUSE]
Since commit a514d63882 ("btrfs: qgroup: Commit transaction in advance
to reduce early EDQUOT"), btrfs is not forced to commit transaction to
reclaim more quota space.
Instead, we just check pertrans metadata reservation against some
threshold and try to do asynchronously transaction commit.
However in above case, the pertrans metadata reservation is pretty small
thus it will never trigger asynchronous transaction commit.
[FIX]
Instead of only accounting pertrans metadata reservation, we calculate
how much free space we have, and if there isn't much free space left,
commit transaction asynchronously to try to free some space.
This may slow down the fs when we have less than 32M free qgroup space,
but should reduce a lot of false EDQUOT, so the cost should be
acceptable.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Btrfs/139 will fail with a high probability if the testing machine (VM)
has only 2G RAM.
Resulting the final write success while it should fail due to EDQUOT,
and the fs will have quota exceeding the limit by 16K.
The simplified reproducer will be: (needs a 2G ram VM)
$ mkfs.btrfs -f $dev
$ mount $dev $mnt
$ btrfs subv create $mnt/subv
$ btrfs quota enable $mnt
$ btrfs quota rescan -w $mnt
$ btrfs qgroup limit -e 1G $mnt/subv
$ for i in $(seq -w 1 8); do
xfs_io -f -c "pwrite 0 128M" $mnt/subv/file_$i > /dev/null
echo "file $i written" > /dev/kmsg
done
$ sync
$ btrfs qgroup show -pcre --raw $mnt
The last pwrite will not trigger EDQUOT and final 'qgroup show' will
show something like:
qgroupid rfer excl max_rfer max_excl parent child
-------- ---- ---- -------- -------- ------ -----
0/5 16384 16384 none none --- ---
0/256 1073758208 1073758208 none 1073741824 --- ---
And 1073758208 is larger than
> 1073741824.
[CAUSE]
It's a bug in btrfs qgroup data reserved space management.
For quota limit, we must ensure that:
reserved (data + metadata) + rfer/excl <= limit
Since rfer/excl is only updated at transaction commmit time, reserved
space needs to be taken special care.
One important part of reserved space is data, and for a new data extent
written to disk, we still need to take the reserved space until
rfer/excl numbers get updated.
Originally when an ordered extent finishes, we migrate the reserved
qgroup data space from extent_io tree to delayed ref head of the data
extent, expecting delayed ref will only be cleaned up at commit
transaction time.
However for small RAM machine, due to memory pressure dirty pages can be
flushed back to disk without committing a transaction.
The related events will be something like:
file 1 written
btrfs_finish_ordered_io: ino=258 ordered offset=0 len=54947840
btrfs_finish_ordered_io: ino=258 ordered offset=54947840 len=5636096
btrfs_finish_ordered_io: ino=258 ordered offset=61153280 len=57344
btrfs_finish_ordered_io: ino=258 ordered offset=61210624 len=8192
btrfs_finish_ordered_io: ino=258 ordered offset=60583936 len=569344
cleanup_ref_head: num_bytes=54947840
cleanup_ref_head: num_bytes=5636096
cleanup_ref_head: num_bytes=569344
cleanup_ref_head: num_bytes=57344
cleanup_ref_head: num_bytes=8192
^^^^^^^^^^^^^^^^ This will free qgroup data reserved space
file 2 written
...
file 8 written
cleanup_ref_head: num_bytes=8192
...
btrfs_commit_transaction <<< the only transaction committed during
the test
When file 2 is written, we have already freed 128M reserved qgroup data
space for ino 258. Thus later write won't trigger EDQUOT.
This allows us to write more data beyond qgroup limit.
In my 2G ram VM, it could reach about 1.2G before hitting EDQUOT.
[FIX]
By moving reserved qgroup data space from btrfs_delayed_ref_head to
btrfs_qgroup_extent_record, we can ensure that reserved qgroup data
space won't be freed half way before commit transaction, thus fix the
problem.
Fixes: f64d5ca868 ("btrfs: delayed_ref: Add new function to record reserved space into delayed ref")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The member btrfs_fs_info::scrub_nocow_workers is unused since the nocow
optimization was removed from scrub in 9bebe665c3 ("btrfs: scrub:
Remove unused copy_nocow_pages and its callchain").
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The scrub worker pointers are not NULL iff the scrub is running, so
reset them back once the last reference is dropped. Add assertions to
the initial phase of scrub to verify that.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the refcount_t for fs_info::scrub_workers_refcnt instead of int so
we get the extra checks. All reference changes are still done under
scrub_lock.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
scrub_workers_refcnt is protected by scrub_lock, add lockdep_assert_held()
in scrub_workers_get().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Suggested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have killed volume mutex (commit: dccdb07bc9
btrfs: kill btrfs_fs_info::volume_mutex). This a trival one seems to have
escaped.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to forward declare flush_write_bio(), as it only
depends on submit_one_bio(). Both of them are pretty small, just move
them to kill the forward declaration.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The variables and function parameters of __etree_search which pertain to
prev/next are grossly misnamed. Namely, prev_ret holds the next state
and not the previous. Similarly, next_ret actually holds the previous
extent state relating to the offset we are interested in. Fix this by
renaming the variables as well as switching the arguments order. No
functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With the refactoring introduced in 8b62f87bad ("Btrfs: reworki
outstanding_extents") this flag became unused. Remove it and renumber
the following flags accordingly. No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no point in using a construct like 'if (!condition)
WARN_ON(1)'. Use WARN_ON(!condition) directly. No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We could generate a lot of delayed refs in evict but never have any left
over space from our block rsv to make up for that fact. So reserve some
extra space and give it to the transaction so it can be used to refill
the delayed refs rsv every loop through the truncate path.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For FLUSH_LIMIT flushers we really can only allocate chunks and flush
delayed inode items, everything else is problematic. I added a bunch of
new states and it lead to weirdness in the FLUSH_LIMIT case because I
forgot about how it worked. So instead explicitly declare the states
that are ok for flushing with FLUSH_LIMIT and use that for our state
machine. Then as we add new things that are safe we can just add them
to this list.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With severe fragmentation we can end up with our inode rsv size being
huge during writeout, which would cause us to need to make very large
metadata reservations.
However we may not actually need that much once writeout is complete,
because of the over-reservation for the worst case.
So instead try to make our reservation, and if we couldn't make it
re-calculate our new reservation size and try again. If our reservation
size doesn't change between tries then we know we are actually out of
space and can error. Flushing that could have been running in parallel
did not make any space.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ rename to calc_refill_bytes, update comment and changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
With the introduction of the per-inode block_rsv it became possible to
have really really large reservation requests made because of data
fragmentation. Since the ticket stuff assumed that we'd always have
relatively small reservation requests it just killed all tickets if we
were unable to satisfy the current request.
However, this is generally not the case anymore. So fix this logic to
instead see if we had a ticket that we were able to give some
reservation to, and if we were continue the flushing loop again.
Likewise we make the tickets use the space_info_add_old_bytes() method
of returning what reservation they did receive in hopes that it could
satisfy reservations down the line.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We've done this forever because of the voodoo around knowing how much
space we have. However, we have better ways of doing this now, and on
normal file systems we'll easily have a global reserve of 512MiB, and
since metadata chunks are usually 1GiB that means we'll allocate
metadata chunks more readily. Instead use the actual used amount when
determining if we need to allocate a chunk or not.
This has a side effect for mixed block group fs'es where we are no
longer allocating enough chunks for the data/metadata requirements. To
deal with this add a ALLOC_CHUNK_FORCE step to the flushing state
machine. This will only get used if we've already made a full loop
through the flushing machinery and tried committing the transaction.
If we have then we can try and force a chunk allocation since we likely
need it to make progress. This resolves issues I was seeing with
the mixed bg tests in xfstests without the new flushing state.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ merged with patch "add ALLOC_CHUNK_FORCE to the flushing code" ]
Signed-off-by: David Sterba <dsterba@suse.com>
For enospc_debug having the block rsvs is super helpful to see if we've
done something wrong.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
may_commit_transaction will skip committing the transaction if we don't
have enough pinned space or if we're trying to find space for a SYSTEM
chunk. However, if we have pending free block groups in this transaction
we still want to commit as we may be able to allocate a chunk to make
our reservation. So instead of just returning ENOSPC, check if we have
free block groups pending, and if so commit the transaction to allow us
to use that free space.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Zstd compression requires different amounts of memory for each level of
compression. The prior patches implemented indirection to allow for each
compression type to manage their workspaces independently. This patch
uses this indirection to implement compression level support for zstd.
To manage the additional memory require, each compression level has its
own queue of workspaces. A global LRU is used to help with reclaim.
Reclaim is done via a timer which provides a mechanism to decrease
memory utilization by keeping only workspaces around that are sized
appropriately. Forward progress is guaranteed by a preallocated max
workspace hidden from the LRU.
When getting a workspace, it uses a bitmap to identify the levels that
are populated and scans up. If it finds a workspace that is greater than
it, it uses it, but does not update the last_used time and the
corresponding place in the LRU. If we hit memory pressure, we sleep on
the max level workspace. We continue to rescan in case we can use a
smaller workspace, but eventually should be able to obtain the max level
workspace or allocate one again should memory pressure subside.
The memory requirement for decompression is the same as level 1, and
therefore can use any of available workspace.
The number of workspaces is bound by an upper limit of the workqueue's
limit which currently is 2 (percpu limit). The reclaim timer is used to
free inactive/improperly sized workspaces and is set to 307s to avoid
colliding with transaction commit (every 30s).
Repeating the experiment from v2 [1], the Silesia corpus was copied to a
btrfs filesystem 10 times and then read back after dropping the caches.
The btrfs filesystem was on an SSD.
Level Ratio Compression (MB/s) Decompression (MB/s) Memory (KB)
1 2.658 438.47 910.51 780
2 2.744 364.86 886.55 1004
3 2.801 336.33 828.41 1260
4 2.858 286.71 886.55 1260
5 2.916 212.77 556.84 1388
6 2.363 119.82 990.85 1516
7 3.000 154.06 849.30 1516
8 3.011 159.54 875.03 1772
9 3.025 100.51 940.15 1772
10 3.033 118.97 616.26 1772
11 3.036 94.19 802.11 1772
12 3.037 73.45 931.49 1772
13 3.041 55.17 835.26 2284
14 3.087 44.70 716.78 2547
15 3.126 37.30 878.84 2547
[1] https://lore.kernel.org/linux-btrfs/20181031181108.289340-1-terrelln@fb.com/
Cc: Nick Terrell <terrelln@fb.com>
Cc: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It is possible based on the level configurations that a higher level
workspace uses less memory than a lower level workspace. In order to
reuse workspaces, this must be made a monotonic relationship. This
precomputes the required memory for each level and enforces the
monotonicity between level and memory required. This is also done
in upstream zstd in [1].
[1] a68b76afef
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Zstd currently only supports the default level of compression. This
patch switches to using the level passed in for btrfs zstd
configuration.
Zstd workspaces now keep track of the requested level as this can differ
from the size of the workspace.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, the only user of set_level() is zlib which sets an internal
workspace parameter. As level is now plumbed into get_workspace(), this
can be handled there rather than separately.
This repurposes set_level() to bound the level passed in so it can be
used when setting the mounts compression level and as well as verifying
the level before getting a workspace. The other benefit is this divides
the meaning of compress(0) and get_workspace(0). The former means we
want to use the default compression level of the compression type. The
latter means we can use any workspace available.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Zlib compression supports multiple levels, but doesn't require changing
in how a workspace itself is created and managed. Zstd introduces a
different memory requirement such that higher levels of compression
require more memory.
This requires changes in how the alloc()/get() methods work for zstd.
This pach plumbs compression level through the interface as a parameter
in preparation for zstd compression levels. This gives the compression
types opportunity to create/manage based on the compression level.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The previous patch added generic helpers for get_workspace() and
put_workspace(). Now, we can migrate ownership of the workspace_manager
to be in the compression type code as the compression code itself
doesn't care beyond being able to get a workspace. The init/cleanup and
get/put methods are abstracted so each compression algorithm can decide
how they want to manage their workspaces.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are two levels of workspace management. First, alloc()/free()
which are responsible for actually creating and destroy workspaces.
Second, at a higher level, get()/put() which is the compression code
asking for a workspace from a workspace_manager.
The compression code shouldn't really care how it gets a workspace, but
that it got a workspace. This adds get_workspace() and put_workspace()
to be the higher level interface which is responsible for indexing into
the appropriate compression type. It also introduces
btrfs_put_workspace() and btrfs_get_workspace() to be the generic
implementations of the higher interface.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Workspace manager init and cleanup code is open coded inside a for loop
over the compression types. This forces each compression type to rely on
the same workspace manager implementation. This patch creates helper
methods that will be the generic implementation for btrfs workspace
management.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Make the workspace_manager own the interface operations rather than
managing index-paired arrays for the workspace_manager and compression
operations.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While the heuristic workspaces aren't really compression workspaces,
they use the same interface for managing them. So rather than branching,
let's just handle them once again as the index 0 compression type.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is in preparation for zstd compression levels. As each level will
require different size of workspace, workspaces_list is no longer a
really fitting name.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It is very easy to miss places that rely on a certain bitshifting for
decoding the type_level overloading. Add helpers to do this instead.
Cc: Omar Sandoval <osandov@osandov.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Support for a new command that can be used eg. as a command
$ btrfs device scan --forget [dev]'
(the final name may change though)
to undo the effects of 'btrfs device scan [dev]'. For this purpose
this patch proposes to use ioctl #5 as it was empty and is next to the
SCAN ioctl.
The new ioctl BTRFS_IOC_FORGET_DEV works only on the control device
(/dev/btrfs-control) to unregister one or all devices, devices that are
not mounted.
The argument is struct btrfs_ioctl_vol_args, ::name specifies the device
path. To unregister all device, the path is an empty string.
Again, the devices are removed only if they aren't part of a mounte
filesystem.
This new ioctl provides:
- release of unwanted btrfs_fs_devices and btrfs_devices structures
from memory if the device is not going to be mounted
- ability to mount filesystem in degraded mode, when one devices is
corrupted like in split brain raid1
- running test cases which would require reloading the kernel module
but this is not possible eg. due to mounted filesystem or built-in
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
The throttle path doesn't take cleaner_delayed_iput_mutex, which means
we could think we're done flushing iputs in the data space reservation
path when we could have a throttler doing an iput. There's no real
reason to serialize the delayed iput flushing, so instead of taking the
cleaner_delayed_iput_mutex whenever we flush the delayed iputs just
replace it with an atomic counter and a waitqueue. This removes the
short (or long depending on how big the inode is) window where we think
there are no more pending iputs when there really are some.
The waiting is killable as it could be indirectly called from user
operations like fallocate or zero-range. Such call sites should handle
the error but otherwise it's not necessary. Eg. flush_space just needs
to attempt to make space by waiting on iputs.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add killable comment and changelog parts ]
Signed-off-by: David Sterba <dsterba@suse.com>
Since inc_block_group_ro() would return -ENOSPC, outputting debug info
for enospc_debug mount option would be helpful to debug some balance
false ENOSPC report.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Inside qgroup_rsv_add/release(), we have trace events
trace_qgroup_update_reserve() to catch reserved space update.
However we still have two manual trace_qgroup_update_reserve() calls
just outside these functions. Remove these duplicated calls.
Fixes: 64ee4e751a ("btrfs: qgroup: Update trace events to use new separate rsv types")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A compiler warning (in a patch in development) pointed to a variable
that was used only inside and ASSERT:
u64 root_objectid = root->root_key.objectid;
ASSERT(root_objectid == ...);
fs/btrfs/relocation.c: In function ‘insert_dirty_subv’:
fs/btrfs/relocation.c:2138:6: warning: unused variable ‘root_objectid’ [-Wunused-variable]
u64 root_objectid = root->root_key.objectid;
^~~~~~~~~~~~~
When CONFIG_BRTFS_ASSERT isn't enabled, variable root_objectid isn't used.
Rework the assertion helper by adding a runtime check instead of the
'#ifdef CONFIG_BTRFS_ASSERT #else ...", so the compiler sees the
condition being passed into an inline function after preprocessing.
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
The last caller that does not have a fixed value of lock is
btrfs_set_path_blocking, that actually does the same conditional swtich
by the lock type so we can merge the branches together and remove the
helper.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, the number of readers and writers is checked and in case
there are any, wait and redo the locks. There's some duplication
before the branches go back to again label, eg. calling wait_event on
blocking_readers twice.
The sequence is transformed
loop:
* wait for readers
* wait for writers
* write_lock
* check readers, unlock and wait for readers, loop
* check writers, unlock and wait for writers, loop
The new sequence is not exactly the same due to the simplification, for
readers it's slightly faster. For the writers, original code does
* wait for writers
* (loop) wait for readers
* wait for writers -- again
while the new goes directly to the reader check. This should behave the
same on a contended lock with multiple writers and readers, but can
reduce number of times we're waiting on something.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_set_lock_blocking is now only a simple wrapper around
btrfs_set_lock_blocking_write. The name does not bring any semantic
value that could not be inferred from the new function so there's no
point keeping it.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
We can use the right helper where the lock type is a fixed parameter.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
There are many callers that hardcode the desired lock type so we can
avoid the switch and call them directly. Split the current function to
two. There are no remaining users of btrfs_clear_lock_blocking_rw so
it's removed. The call sites will be converted in followup patches.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
There are many callers that hardcode the desired lock type so we can
avoid the switch and call them directly. Split the current function to
two but leave a helper that still takes the variable lock type to make
current code compile. The call sites will be converted in followup
patches.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Since it's replaced by new delayed subtree swap code, remove the
original code.
The cleanup is small since most of its core function is still used by
delayed subtree swap trace.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Before this patch, qgroup code traces the whole subtree of subvolume and
reloc trees unconditionally.
This makes qgroup numbers consistent, but it could cause tons of
unnecessary extent tracing, which causes a lot of overhead.
However for subtree swap of balance, just swap both subtrees because
they contain the same contents and tree structure, so qgroup numbers
won't change.
It's the race window between subtree swap and transaction commit could
cause qgroup number change.
This patch will delay the qgroup subtree scan until COW happens for the
subtree root.
So if there is no other operations for the fs, balance won't cause extra
qgroup overhead. (best case scenario)
Depending on the workload, most of the subtree scan can still be
avoided.
Only for worst case scenario, it will fall back to old subtree swap
overhead. (scan all swapped subtrees)
[[Benchmark]]
Hardware:
VM 4G vRAM, 8 vCPUs,
disk is using 'unsafe' cache mode,
backing device is SAMSUNG 850 evo SSD.
Host has 16G ram.
Mkfs parameter:
--nodesize 4K (To bump up tree size)
Initial subvolume contents:
4G data copied from /usr and /lib.
(With enough regular small files)
Snapshots:
16 snapshots of the original subvolume.
each snapshot has 3 random files modified.
balance parameter:
-m
So the content should be pretty similar to a real world root fs layout.
And after file system population, there is no other activity, so it
should be the best case scenario.
| v4.20-rc1 | w/ patchset | diff
-----------------------------------------------------------------------
relocated extents | 22615 | 22457 | -0.1%
qgroup dirty extents | 163457 | 121606 | -25.6%
time (sys) | 22.884s | 18.842s | -17.6%
time (real) | 27.724s | 22.884s | -17.5%
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
To allow delayed subtree swap rescan, btrfs needs to record per-root
information about which tree blocks get swapped. This patch introduces
the required infrastructure.
The designed workflow will be:
1) Record the subtree root block that gets swapped.
During subtree swap:
O = Old tree blocks
N = New tree blocks
reloc tree subvolume tree X
Root Root
/ \ / \
NA OB OA OB
/ | | \ / | | \
NC ND OE OF OC OD OE OF
In this case, NA and OA are going to be swapped, record (NA, OA) into
subvolume tree X.
2) After subtree swap.
reloc tree subvolume tree X
Root Root
/ \ / \
OA OB NA OB
/ | | \ / | | \
OC OD OE OF NC ND OE OF
3a) COW happens for OB
If we are going to COW tree block OB, we check OB's bytenr against
tree X's swapped_blocks structure.
If it doesn't fit any, nothing will happen.
3b) COW happens for NA
Check NA's bytenr against tree X's swapped_blocks, and get a hit.
Then we do subtree scan on both subtrees OA and NA.
Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND).
Then no matter what we do to subvolume tree X, qgroup numbers will
still be correct.
Then NA's record gets removed from X's swapped_blocks.
4) Transaction commit
Any record in X's swapped_blocks gets removed, since there is no
modification to swapped subtrees, no need to trigger heavy qgroup
subtree rescan for them.
This will introduce 128 bytes overhead for each btrfs_root even qgroup
is not enabled. This is to reduce memory allocations and potential
failures.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Refactor btrfs_qgroup_trace_subtree_swap() into
qgroup_trace_subtree_swap(), which only needs two extent buffer and some
other bool to control the behavior.
This provides the basis for later delayed subtree scan work.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Relocation code will drop btrfs_root::reloc_root as soon as
merge_reloc_root() finishes.
However later qgroup code will need to access btrfs_root::reloc_root
after merge_reloc_root() for delayed subtree rescan.
So alter the timming of resetting btrfs_root:::reloc_root, make it
happens after transaction commit.
With this patch, we will introduce a new btrfs_root::state,
BTRFS_ROOT_DEAD_RELOC_TREE, to info part of btrfs_root::reloc_tree user
that although btrfs_root::reloc_tree is still non-NULL, but still it's
not used any more.
The lifespan of btrfs_root::reloc tree will become:
Old behavior | New
------------------------------------------------------------------------
btrfs_init_reloc_root() --- | btrfs_init_reloc_root() ---
set reloc_root | | set reloc_root |
| | |
| | |
merge_reloc_root() | | merge_reloc_root() |
|- btrfs_update_reloc_root() --- | |- btrfs_update_reloc_root() -+-
clear btrfs_root::reloc_root | set ROOT_DEAD_RELOC_TREE |
| record root into dirty |
| roots rbtree |
| |
| reloc_block_group() Or |
| btrfs_recover_relocation() |
| | After transaction commit |
| |- clean_dirty_subvols() ---
| clear btrfs_root::reloc_root
During ROOT_DEAD_RELOC_TREE set lifespan, the only user of
btrfs_root::reloc_tree should be qgroup.
Since reloc root needs a longer life-span, this patch will also delay
btrfs_drop_snapshot() call.
Now btrfs_drop_snapshot() is called in clean_dirty_subvols().
This patch will increase the size of btrfs_root by 16 bytes.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The first thing we do is loop through the list, this
if (!list_empty())
btrfs_create_pending_block_groups();
thing is just wasted space.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of open coding this stuff use the helper instead.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have this open coded in btrfs_destroy_delayed_refs, use the helper
instead.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The kernel log messages help debugging and audit, add them for scrub
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The workqueue name is constructed from a format string but the prefix
does not need to be set by %s.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Both btrfs_find_device() and find_device() does the same thing except
that the latter does not take the seed device onto account in the device
scanning context. We can merge them.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preparatory patch to add ioctl that allows to forget a device (ie.
reverse of scan).
Refactors btrfs_free_stale_devices() to obtain return status. As this
function can fail if it can't find the given path (returns -ENOENT) or
trying to delete a mounted device (returns -EBUSY).
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_find_device() accepts fs_info as an argument and retrieves
fs_devices from fs_info.
Instead use fs_devices, so that this function can be used in non-mount
(during device scanning) context as well.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_find_device_by_devspec() finds the device by @devid or by
@device_path. This patch makes code flow easy to read by open coding the
else part and renames devpath to device_path.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_find_device_missing_or_by_path() is relatively small function, and
its only parent btrfs_find_device_by_devspec() is small as well. Besides
there are a number of find_device functions. Merge
btrfs_find_device_missing_or_by_path() into its parent.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to avoid duplicating init code for em there is an additional
label, not_found_em, which is used to only set ->block_start. The only
case when it will be used is if the extent we are adding overlaps with
an existing extent. Make that case more obvious by:
1. Adding a comment hinting at what's going on
2. Assigning EXTENT_MAP_HOLE and directly going to insert.
No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Core btree functions in btrfs generally return 0 when an item is found,
1 in case the sought item cannot be found and <0 when an error happens.
Consolidate the checks for those conditions in one 'if () {} else if ()
{}' construct rather than 2 separate 'if () {}' statements. This
emphasizes that the handling code pertains to a single function. No
functional changes.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
found_type really holds the type of extent and is guaranteed to to have
a value between [0, 2]. The only time it can contain anything different
is if btrfs_lookup_file_extent returned a positive value and the
previous item is different than an extent. Avoid this situation by
simply checking found_key.type rather than assigning the item type to
found_type intermittently. Also make the variable an u8 to reduce stack
usage. No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move the check that verifies if both inodes have checksums disabled or
both have them enabled, from the clone and deduplication functions into
the new common helper btrfs_remap_file_range_prep().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can never have extents marked as EXTENT_MAP_DELALLOC since this
value is only ever used by btrfs_get_extent_fiemap. In this case the
extent map is created by btrfs_get_extent_fiemap and is never really
published, this flag is used to return the corresponding userspace one.
Considering this, it's pointless having a check for EXTENT_MAP_DELALLOC
in mergable_maps. Just remove it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If the call to btrfs_balance() failed we would overwrite the error
returned to user space with -EFAULT if the call to copy_to_user() failed
as well. Fix that by calling copy_to_user() only if btrfs_balance()
returned success or was canceled.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If the call to btrfs_dev_replace_by_ioctl() failed we would overwrite the
error returned to user space with -EFAULT if the call to copy_to_user()
failed as well. Fix that by calling copy_to_user() only if no error
happened before or a device replace operation was canceled.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Checking if either of the inodes corresponds to a swapfile is already
performed by generic_remap_file_range_prep(), so we do not need to do
it in the btrfs clone and deduplication functions.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a couple of comments regarding the logic flow in shrink_delalloc.
Then, cease using max_reclaim as a temporary variable when calculating
nr_pages. Finally give max_reclaim a more becoming name, which
uneqivocally shows at what this variable really holds. No functional
changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a comment explaining when ->inode could be NULL and why we always
perform the ->async_delalloc_pages modification.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can never trigger since before calling alloc_delalloc_work we have
called igrab in start_delalloc_inodes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
ihold is supposed to be used when the caller already has a reference to
the inode. In the case of cow_file_range_async this invariants holds,
since the 3 call chains leading to this function all take a reference:
btrfs_writepage <--- does igrab
extent_write_full_page
__extent_writepage
writepage_delalloc
btrfs_run_delalloc_range
cow_file_range_async
extent_write_cache_pages <--- does igrab
__extent_writepage (same callchain as above)
and
submit_compressed_extents <-- already called from async CoW submit path,
which would have done ihold.
extent_write_locked_range
__extent_writepage
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
It's used only once so just inline the call to i_size_read. The
semantics regarding the inode size are not changed, the pages in the
range are locked and i_size cannot change between the time it was set
and used.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already pass the async_cow struct that holds a reference to the
inode. Exploit this fact and remove the extra inode argument. No
functional changes.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fixes gcc '-Wunused-but-set-variable' warning:
fs/btrfs/ioctl.c: In function 'btrfs_extent_same':
fs/btrfs/ioctl.c:3260:6: warning:
variable 'num_pages' set but not used [-Wunused-but-set-variable]
It not used any more since commit 9ee8234e6220 ("Btrfs: use
generic_remap_file_range_prep() for cloning and deduplication")
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David Sterba <dsterba@suse.com>
hole_len is only used if the hole falls within the requested range. Make
that explicitly clear by only assigning in the corresponding branch.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Make btrfs_get_extent_fiemap a bit more friendly. First step is to
rename the closely related, yet arbitrary named
range_start/found_end/found variables. They define the delalloc range
that is found in case a real extent wasn't found. Subsequently remove
an unnecessary check for hole_em since it's guaranteed to be set i.e the
check is always true. Top it off by giving all comments a refresh.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformatted a few more comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
This function is a simple wrapper over btrfs_get_extent that returns
either:
a) A real extent in the passed range or
b) Adjusted extent based on whether delalloc bytes are found backing up
a hole.
To support these semantics it doesn't need the page/pg_offset/create
arguments which are passed to btrfs_get_extent in case an extent is to
be created. So simplify the function by removing the unused arguments.
No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are holding a transaction handle when setting an acl, therefore we can
not allocate the xattr value buffer using GFP_KERNEL, as we could deadlock
if reclaim is triggered by the allocation, therefore setup a nofs context.
Fixes: 39a27ec100 ("btrfs: use GFP_KERNEL for xattr and acl allocations")
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are holding a transaction handle when creating a tree, therefore we can
not allocate the root using GFP_KERNEL, as we could deadlock if reclaim is
triggered by the allocation, therefore setup a nofs context.
Fixes: 74e4d82757 ("btrfs: let callers of btrfs_alloc_root pass gfp flags")
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If the call to btrfs_get_dev_stats() failed we would overwrite the error
returned to user space with -EFAULT if the call to copy_to_user() failed
as well. Fix that by calling copy_to_user() only if btrfs_get_dev_stats()
returned success.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If the call to btrfs_scrub_progress() failed we would overwrite the error
returned to user space with -EFAULT if the call to copy_to_user() failed
as well. Fix that by calling copy_to_user() only if btrfs_scrub_progress()
returned success.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If scrub returned an error and then the copy_to_user() call did not
succeed, we would overwrite the error returned by scrub with -EFAULT.
Fix that by calling copy_to_user() only if btrfs_scrub_dev() returned
success.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since this function is no longer a callback there is no need to have
its first argument obfuscated with a void *. Change it directly to a
pointer to an inode. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Drop LIST_HEAD where the variable it declares is never used.
The uses were removed in 3fd0a5585e ("Btrfs: Metadata ENOSPC
handling for balance"), but not the declaration.
The semantic patch that fixes this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
identifier x;
@@
- LIST_HEAD(x);
... when != x
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQFRBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlxgqNUeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGwsoH+OVXu0NQofwTvVru
8lgF3BSDG2mhf7mxbBBlBizGVy9jnjRNGCFMC+Jq8IwiFLwprja/G27kaDTkpuF1
PHC3yfjKvjTeUP5aNdHlmxv6j1sSJfZl0y46DQal4UeTG/Giq8TFTi+Tbz7Wb/WV
yCx4Lr8okAwTuNhnL8ojUCVIpd3c8QsyR9v6nEQ14Mj+MvEbokyTkMJV0bzOrM38
JOB+/X1XY4JPZ6o3MoXrBca3bxbAJzMneq+9CWw1U5eiIG3msg4a+Ua3++RQMDNr
8BP0yCZ6wo32S8uu0PI6HrZaBnLYi5g9Wh7Q7yc0mn1Uh1zWFykA6TtqK90agJeR
A6Ktjw==
=scY4
-----END PGP SIGNATURE-----
Merge tag 'v5.0-rc6' into for-5.1/block
Pull in 5.0-rc6 to avoid a dumb merge conflict with fs/iomap.c.
This is needed since io_uring is now based on the block branch,
to avoid a conflict between the multi-page bvecs and the bits
of io_uring that touch the core block parts.
* tag 'v5.0-rc6': (525 commits)
Linux 5.0-rc6
x86/mm: Make set_pmd_at() paravirt aware
MAINTAINERS: Update the ocores i2c bus driver maintainer, etc
blk-mq: remove duplicated definition of blk_mq_freeze_queue
Blk-iolatency: warn on negative inflight IO counter
blk-iolatency: fix IO hang due to negative inflight counter
MAINTAINERS: unify reference to xen-devel list
x86/mm/cpa: Fix set_mce_nospec()
futex: Handle early deadlock return correctly
futex: Fix barrier comment
net: dsa: b53: Fix for failure when irq is not defined in dt
blktrace: Show requests without sector
mips: cm: reprime error cause
mips: loongson64: remove unreachable(), fix loongson_poweroff().
sit: check if IPv6 enabled before calling ip6_err_gen_icmpv6_unreach()
geneve: should not call rt6_lookup() when ipv6 was disabled
KVM: nVMX: unconditionally cancel preemption timer in free_nested (CVE-2019-7221)
KVM: x86: work around leak of uninitialized stack contents (CVE-2019-7222)
kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974)
signal: Better detection of synchronous signals
...
This patch introduces one extra iterator variable to bio_for_each_segment_all(),
then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.
Given it is just one mechannical & simple change on all bio_for_each_segment_all()
users, this patch does tree-wide change in one single patch, so that we can
avoid to use a temporary helper for this conversion.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio_readpage_error currently uses bi_vcnt to decide if it is worth
retrying an I/O. But the vector count is mostly an implementation
artifact - it really should figure out if there is more than a
single sector worth retrying. Use bi_size for that and shift by
PAGE_SHIFT. This really should be blocks/sectors, but given that
btrfs doesn't support a sector size different from the PAGE_SIZE
using the page size keeps the changes to a minimum.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlxWtJgACgkQxWXV+ddt
WDsDow//ZpnyDwQWvSIfF2UUQOPlcBjbHKuBA7rU0wdybW635QYGR0mqnI+1VnMj
7ssUkeN6N0a2gQzrUG4Y+zpdzOWv2xQ4jKZ9GMOp9gwyzEFyPkcFXOnmM8UfYtVu
e3fK65e8BZHmTeu0kGKah4Dt1g0t4fUmhsKR4Pfp5YNJC+zuuGTwUW1K/ZQHXJ+3
kTHc7WP1lsF7wgaZ+Gl+Kvp8fVrHVdygMVTdRBW8QaBgPLa/KExvK62jW+NmCYhj
7OIkWdew7e8IXc3Ie5IbOomHAv7IgqqgiO9VO9+n0EpyV4UobUgxrgBKJ+0yc1Ya
eidbKhMslwUE50y00JVm+vw0gwQHkR+hZDn/mRB6xiIeI8tu/yQIJZ6AhYJXoByR
cu8+SNO5Z5dOZ1f146ZH8lnkr24tuSnkDUhbRDR5pdb4tAHHej2ALzhbfbwbPEpF
IverYLw5fOMGeRU/mBsjkVadpSZ4S0HVNU85ERdhLtVLK1PSaY2UkUaA+Ii5y7au
EYDjaGMflmJ8cAVqgtgedEff2n8OKDnzRZlz4IPLI73MVSITGZkM7PmYmYsLm2Bs
mDPnmyqR8kzcd1RRtSeZTvqOpAIZG+QUOmD2jiKrchmp54Sz0V/HJWRs3aQybD6Q
ph0yAcbkvgp/ewe5IFgaI0kcyH7zYdL6GtiI2WUE3/8DObgrgsA=
=E2PP
-----END PGP SIGNATURE-----
Merge tag 'for-5.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- regression fix: transaction commit can run away due to delayed ref
waiting heuristic, this is not necessary now because of the proper
reservation mechanism introduced in 5.0
- regression fix: potential crash due to use-before-check of an ERR_PTR
return value
- fix for transaction abort during transaction commit that needs to
properly clean up pending block groups
- fix deadlock during b-tree node/leaf splitting, when this happens on
some of the fundamental trees, we must prevent new tree block
allocation to re-enter indirectly via the block group flushing path
- potential memory leak after errors during mount
* tag 'for-5.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: On error always free subvol_name in btrfs_mount
btrfs: clean up pending block groups when transaction commit aborts
btrfs: fix potential oops in device_list_add
btrfs: don't end the transaction for delayed refs in throttle
Btrfs: fix deadlock when allocating tree block during leaf/node split
The subvol_name is allocated in btrfs_parse_subvol_options and is
consumed and freed in mount_subvol. Add a free to the error paths that
don't call mount_subvol so that it is guaranteed that subvol_name is
freed when an error happens.
Fixes: 312c89fbca ("btrfs: cleanup btrfs_mount() using btrfs_mount_root()")
Cc: stable@vger.kernel.org # v4.19+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
alloc_fs_devices() can return ERR_PTR(-ENOMEM), so dereferencing its
result before the check for IS_ERR() is a bad idea.
Fixes: d1a6300282 ("btrfs: add members to fs_devices to track fsid changes")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previously callers to btrfs_end_transaction_throttle() would commit the
transaction if there wasn't enough delayed refs space. This happens in
relocation, and if the fs is relatively empty we'll run out of delayed
refs space basically immediately, so we'll just be stuck in this loop of
committing the transaction over and over again.
This code existed because we didn't have a good feedback mechanism for
running delayed refs, but with the delayed refs rsv we do now. Delete
this throttling code and let the btrfs_start_transaction() in relocation
deal with putting pressure on the delayed refs infrastructure. With
this patch we no longer take 5 minutes to balance a metadata only fs.
Qu has submitted a fstest to catch slow balance or excessive transaction
commits. Steps to reproduce:
* create subvolume
* create many (eg. 16000) inlined files, of size 2KiB
* iteratively snapshot and touch several files to trigger metadata
updates
* start balance -m
Reported-by: Qu Wenruo <wqu@suse.com>
Fixes: 64403612b7 ("btrfs: rework btrfs_check_space_for_delayed_refs")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add tags and steps to reproduce ]
Signed-off-by: David Sterba <dsterba@suse.com>
When splitting a leaf or node from one of the trees that are modified when
flushing pending block groups (extent, chunk, device and free space trees),
we need to allocate a new tree block, which in turn can result in the need
to allocate a new block group. After allocating the new block group we may
need to flush new block groups that were previously allocated during the
course of the current transaction, which is what may cause a deadlock due
to attempts to write lock twice the same leaf or node, as when splitting
a leaf or node we are holding a write lock on it and its parent node.
The same type of deadlock can also happen when increasing the tree's
height, since we are holding a lock on the existing root while allocating
the tree block to use as the new root node.
An example trace when the deadlock happens during the leaf split path is:
[27175.293054] CPU: 0 PID: 3005 Comm: kworker/u17:6 Tainted: G W 4.19.16 #1
[27175.293942] Hardware name: Penguin Computing Relion 1900/MD90-FS0-ZB-XX, BIOS R15 06/25/2018
[27175.294846] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
(...)
[27175.298384] RSP: 0018:ffffab2087107758 EFLAGS: 00010246
[27175.299269] RAX: 0000000000000bbd RBX: ffff9fadc7141c48 RCX: 0000000000000001
[27175.300155] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff9fadc7141c48
[27175.301023] RBP: 0000000000000001 R08: ffff9faeb6ac1040 R09: ffff9fa9c0000000
[27175.301887] R10: 0000000000000000 R11: 0000000000000040 R12: ffff9fb21aac8000
[27175.302743] R13: ffff9fb1a64d6a20 R14: 0000000000000001 R15: ffff9fb1a64d6a18
[27175.303601] FS: 0000000000000000(0000) GS:ffff9fb21fa00000(0000) knlGS:0000000000000000
[27175.304468] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[27175.305339] CR2: 00007fdc8743ead8 CR3: 0000000763e0a006 CR4: 00000000003606f0
[27175.306220] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[27175.307087] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[27175.307940] Call Trace:
[27175.308802] btrfs_search_slot+0x779/0x9a0 [btrfs]
[27175.309669] ? update_space_info+0xba/0xe0 [btrfs]
[27175.310534] btrfs_insert_empty_items+0x67/0xc0 [btrfs]
[27175.311397] btrfs_insert_item+0x60/0xd0 [btrfs]
[27175.312253] btrfs_create_pending_block_groups+0xee/0x210 [btrfs]
[27175.313116] do_chunk_alloc+0x25f/0x300 [btrfs]
[27175.313984] find_free_extent+0x706/0x10d0 [btrfs]
[27175.314855] btrfs_reserve_extent+0x9b/0x1d0 [btrfs]
[27175.315707] btrfs_alloc_tree_block+0x100/0x5b0 [btrfs]
[27175.316548] split_leaf+0x130/0x610 [btrfs]
[27175.317390] btrfs_search_slot+0x94d/0x9a0 [btrfs]
[27175.318235] btrfs_insert_empty_items+0x67/0xc0 [btrfs]
[27175.319087] alloc_reserved_file_extent+0x84/0x2c0 [btrfs]
[27175.319938] __btrfs_run_delayed_refs+0x596/0x1150 [btrfs]
[27175.320792] btrfs_run_delayed_refs+0xed/0x1b0 [btrfs]
[27175.321643] delayed_ref_async_start+0x81/0x90 [btrfs]
[27175.322491] normal_work_helper+0xd0/0x320 [btrfs]
[27175.323328] ? move_linked_works+0x6e/0xa0
[27175.324160] process_one_work+0x191/0x370
[27175.324976] worker_thread+0x4f/0x3b0
[27175.325763] kthread+0xf8/0x130
[27175.326531] ? rescuer_thread+0x320/0x320
[27175.327284] ? kthread_create_worker_on_cpu+0x50/0x50
[27175.328027] ret_from_fork+0x35/0x40
[27175.328741] ---[ end trace 300a1b9f0ac30e26 ]---
Fix this by preventing the flushing of new blocks groups when splitting a
leaf/node and when inserting a new root node for one of the trees modified
by the flushing operation, similar to what is done when COWing a node/leaf
from on of these trees.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202383
Reported-by: Eli V <eliventer@gmail.com>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlxElHsACgkQxWXV+ddt
WDsF8Q/6A+l/Ku8D+xSmw+JGXGNroX1V62sxVYWqIgrkYNg6iSMjicqF3aN1bCbR
LqvOQ5skerrKteNYIPbTTOD5Xp37ccinSeEWEF0ktFkeU1G6yJo7aRsnQlO1sduk
2PKUOA1/PdeTBiOj1bej/1ybhtIW+d0MaoPtnUCMC8DD/ihfmU332+KC8VmRUYGZ
4kT1DvKfBOSVz1UTl6OJdWo76crvjz0eGVnH1YG7DoFbpVzAbphHE7+aC/WkHQme
X5Ux2NtYVMe/0IGAC7kJnj4jCZ1weAdlmvzzagjzGtWdhWgTxntiJ/FVJs9nO8Dm
G/pVtD8RVjgMkPOWcfw5fdderrBJqjVGgl4VDrDLqjO9OTGNFJs+HgcPRi6Oli28
sA+HG+U34YzSfKY0L9eAmpkNxMjWywBuXTQIAlMhHNZCL0vZ2K5EYa3dTp9OswAW
IIcOh/LfZxiomvMvUqQWcRCy5y/b+cYjOjbHwkrw+ewd3IWXVLG8YLMyZI3vnHKu
/f1xn6KCap9a1cS4LwyK6gzstEugn0MYmnmD/Jx8I1BJFBt55Q31ES6tPHgTmh/d
QjveRjMkxNCql4h5Hq0+LiXoSoocBmsO0wrs2QrWSx4PBnJsvjySnWr/8GfAOj79
BhnuQFxbr/BkNyBzvKrjoI+zZnrVm0cBU59lP6PzN75+kQTaIfs=
=hGlM
-----END PGP SIGNATURE-----
Merge tag 'for-5.0-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A handful of fixes (some of them in testing for a long time):
- fix some test failures regarding cleanup after transaction abort
- revert of a patch that could cause a deadlock
- delayed iput fixes, that can help in ENOSPC situation when there's
low space and a lot data to write"
* tag 'for-5.0-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: wakeup cleaner thread when adding delayed iput
btrfs: run delayed iputs before committing
btrfs: wait on ordered extents on abort cleanup
btrfs: handle delayed ref head accounting cleanup in abort
Revert "btrfs: balance dirty metadata pages in btrfs_finish_ordered_io"
The cleaner thread usually takes care of delayed iputs, with the
exception of the btrfs_end_transaction_throttle path. Delaying iputs
means we are potentially delaying the eviction of an inode and it's
respective space. The cleaner thread only gets woken up every 30
seconds, or when we require space. If there are a lot of inodes that
need to be deleted we could induce a serious amount of latency while we
wait for these inodes to be evicted. So instead wakeup the cleaner if
it's not already awake to process any new delayed iputs we add to the
list. If we suddenly need space we will less likely be backed up
behind a bunch of inodes that are waiting to be deleted, and we could
possibly free space before we need to get into the flushing logic which
will save us some latency.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Delayed iputs means we can have final iputs of deleted inodes in the
queue, which could potentially generate a lot of pinned space that could
be free'd. So before we decide to commit the transaction for ENOPSC
reasons, run the delayed iputs so that any potential space is free'd up.
If there is and we freed enough we can then commit the transaction and
potentially be able to make our reservation.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This reverts commit e73e81b6d0.
This patch causes a few problems:
- adds latency to btrfs_finish_ordered_io
- as btrfs_finish_ordered_io is used for free space cache, generating
more work from btrfs_btree_balance_dirty_nodelay could end up in the
same workque, effectively deadlocking
12260 kworker/u96:16+btrfs-freespace-write D
[<0>] balance_dirty_pages+0x6e6/0x7ad
[<0>] balance_dirty_pages_ratelimited+0x6bb/0xa90
[<0>] btrfs_finish_ordered_io+0x3da/0x770
[<0>] normal_work_helper+0x1c5/0x5a0
[<0>] process_one_work+0x1ee/0x5a0
[<0>] worker_thread+0x46/0x3d0
[<0>] kthread+0xf5/0x130
[<0>] ret_from_fork+0x24/0x30
[<0>] 0xffffffffffffffff
Transaction commit will wait on the freespace cache:
838 btrfs-transacti D
[<0>] btrfs_start_ordered_extent+0x154/0x1e0
[<0>] btrfs_wait_ordered_range+0xbd/0x110
[<0>] __btrfs_wait_cache_io+0x49/0x1a0
[<0>] btrfs_write_dirty_block_groups+0x10b/0x3b0
[<0>] commit_cowonly_roots+0x215/0x2b0
[<0>] btrfs_commit_transaction+0x37e/0x910
[<0>] transaction_kthread+0x14d/0x180
[<0>] kthread+0xf5/0x130
[<0>] ret_from_fork+0x24/0x30
[<0>] 0xffffffffffffffff
And then writepages ends up waiting on transaction commit:
9520 kworker/u96:13+flush-btrfs-1 D
[<0>] wait_current_trans+0xac/0xe0
[<0>] start_transaction+0x21b/0x4b0
[<0>] cow_file_range_inline+0x10b/0x6b0
[<0>] cow_file_range.isra.69+0x329/0x4a0
[<0>] run_delalloc_range+0x105/0x3c0
[<0>] writepage_delalloc+0x119/0x180
[<0>] __extent_writepage+0x10c/0x390
[<0>] extent_write_cache_pages+0x26f/0x3d0
[<0>] extent_writepages+0x4f/0x80
[<0>] do_writepages+0x17/0x60
[<0>] __writeback_single_inode+0x59/0x690
[<0>] writeback_sb_inodes+0x291/0x4e0
[<0>] __writeback_inodes_wb+0x87/0xb0
[<0>] wb_writeback+0x3bb/0x500
[<0>] wb_workfn+0x40d/0x610
[<0>] process_one_work+0x1ee/0x5a0
[<0>] worker_thread+0x1e0/0x3d0
[<0>] kthread+0xf5/0x130
[<0>] ret_from_fork+0x24/0x30
[<0>] 0xffffffffffffffff
Eventually, we have every process in the system waiting on
balance_dirty_pages(), and nobody is able to make progress on page
writeback.
The original patch tried to fix an OOM condition, that happened on 4.4 but no
success reproducing that on later kernels (4.19 and 4.20). This is more likely
a problem in OOM itself.
Link: https://lore.kernel.org/linux-btrfs/20180528054821.9092-1-ethanlien@synology.com/
Reported-by: Chris Mason <clm@fb.com>
CC: stable@vger.kernel.org # 4.18+
CC: ethanlien <ethanlien@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlw7Y68ACgkQxWXV+ddt
WDs+Jw/9Hn71GLb1YNpNUlzJSQHUzbvFtXbtvEVyYeuuLkmjTG4FT2bkGqYhAeC9
2jvaqPaxI0VJ7vNulAaMVZayFwNEQrF9p+z+vzV/Ty2lu0Ep/7Uuwp9KL8X49req
5hEtb5p9Dbj9hXqa8XNOfF2GPDRTQ6D4eknYiNM7MY+aTkvMEtpuZg9kfwXrZ2au
JeFBzZo9SA5nxqrXZg1XlRYttDnOf44h6YJmEFOZOJuAouKcd5I7C8BshCRKDuWo
iJBjYTBTFqjgUgCrn10UM92T19hKufHiDE3WWQ/7zykqtThpKgBFR1opAUcNBJBf
HvOOsYnZcGbxIKOjFpucSpButTmjFcnI83f/7dmZYXUsyIzP/xH51uLiz/CLJqWj
JsRgtgtJ7l5s1M8GkFG6B9Tp89KxHnVKqNC5HyX+4AuFWiIJ+CWWSddGqNSfAIHe
o2ceQRMumvwbVKgfd0AfPcZ9v4sRM/DfwxWXgEQmCSNJikSuUjP9b3D51ttnXQ8c
q6lz7L6nRKK4mgBnfJCmpus3IArdvNwJ7CF8C1RejfRwL824RzH+yHjWdg36nVXB
oaBYKJf8GRRn+3In1W6npr65NxCQERIZV4M89EgST/RK7tW5u0Fotd1KJCmo+ayO
cSRbp16On9G6gBV+qrBs8X65KIWALZpdTycwyFplCU581DXdtnU=
=aCB2
-----END PGP SIGNATURE-----
Merge tag 'for-5.0-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- two regression fixes in clone/dedupe ioctls, the generic check
callback needs to lock extents properly and wait for io to avoid
problems with writeback and relocation
- fix deadlock when using free space tree due to block group creation
- a recently added check refuses a valid fileystem with seeding device,
make that work again with a quickfix, proper solution needs more
intrusive changes
* tag 'for-5.0-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: Use real device structure to verify dev extent
Btrfs: fix deadlock when using free space tree due to block group creation
Btrfs: fix race between reflink/dedupe and relocation
Btrfs: fix race between cloning range ending at eof and writeback
[BUG]
Linux v5.0-rc1 will fail fstests/btrfs/163 with the following kernel
message:
BTRFS error (device dm-6): dev extent devid 1 physical offset 13631488 len 8388608 is beyond device boundary 0
BTRFS error (device dm-6): failed to verify dev extents against chunks: -117
BTRFS error (device dm-6): open_ctree failed
[CAUSE]
Commit cf90d884b3 ("btrfs: Introduce mount time chunk <-> dev extent
mapping check") introduced strict check on dev extents.
We use btrfs_find_device() with dev uuid and fs uuid set to NULL, and
only dependent on @devid to find the real device.
For seed devices, we call clone_fs_devices() in open_seed_devices() to
allow us search seed devices directly.
However clone_fs_devices() just populates devices with devid and dev
uuid, without populating other essential members, like disk_total_bytes.
This makes any device returned by btrfs_find_device(fs_info, devid,
NULL, NULL) is just a dummy, with 0 disk_total_bytes, and any dev
extents on the seed device will not pass the device boundary check.
[FIX]
This patch will try to verify the device returned by btrfs_find_device()
and if it's a dummy then re-search in seed devices.
Fixes: cf90d884b3 ("btrfs: Introduce mount time chunk <-> dev extent mapping check")
CC: stable@vger.kernel.org # 4.19+
Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When modifying the free space tree we can end up COWing one of its extent
buffers which in turn might result in allocating a new chunk, which in
turn can result in flushing (finish creation) of pending block groups. If
that happens we can deadlock because creating a pending block group needs
to update the free space tree, and if any of the updates tries to modify
the same extent buffer that we are COWing, we end up in a deadlock since
we try to write lock twice the same extent buffer.
So fix this by skipping pending block group creation if we are COWing an
extent buffer from the free space tree. This is a case missed by commit
5ce555578e ("Btrfs: fix deadlock when writing out free space caches").
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202173
Fixes: 5ce555578e ("Btrfs: fix deadlock when writing out free space caches")
CC: stable@vger.kernel.org # 4.18+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The recent rework that makes btrfs' remap_file_range operation use the
generic helper generic_remap_file_range_prep() introduced a race between
relocation and reflinking (for both cloning and deduplication) the file
extents between the source and destination inodes.
This happens because we no longer lock the source range anymore, and we do
not lock it anymore because we wait for direct IO writes and writeback to
complete early on the code path right after locking the inodes, which
guarantees no other file operations interfere with the reflinking. However
there is one exception which is relocation, since it replaces the byte
number of file extents items in the fs tree after locking the range the
file extent items represent. This is a problem because after finding each
file extent to clone in the fs tree, the reflink process copies the file
extent item into a local buffer, releases the search path, inserts new
file extent items in the destination range and then increments the
reference count for the extent mentioned in the file extent item that it
previously copied to the buffer. If right after copying the file extent
item into the buffer and releasing the path the relocation process
updates the file extent item to point to the new extent, the reflink
process ends up creating a delayed reference to increment the reference
count of the old extent, for which the relocation process already created
a delayed reference to drop it. This results in failure to run delayed
references because we will attempt to increment the count of a reference
that was already dropped. This is illustrated by the following diagram:
CPU 1 CPU 2
relocation is running
btrfs_clone_files()
btrfs_clone()
--> finds extent item
in source range
point to extent
at bytenr X
--> copies it into a
local buffer
--> releases path
replace_file_extents()
--> successfully locks the
range represented by
the file extent item
--> replaces disk_bytenr
field in the file
extent item with some
other value Y
--> creates delayed reference
to increment reference
count for extent at
bytenr Y
--> creates delayed reference
to drop the extent at
bytenr X
--> starts transaction
--> creates delayed
reference to
increment extent
at bytenr X
<delayed references are run, due to a transaction
commit for example, and the transaction is aborted
with -EIO because we attempt to increment reference
count for the extent at bytenr X after we freed it>
When this race is hit the running transaction ends up getting aborted with
an -EIO error and a trace like the following is produced:
[ 4382.553858] WARNING: CPU: 2 PID: 3648 at fs/btrfs/extent-tree.c:1552 lookup_inline_extent_backref+0x4f4/0x650 [btrfs]
(...)
[ 4382.556293] CPU: 2 PID: 3648 Comm: btrfs Tainted: G W 4.20.0-rc6-btrfs-next-41 #1
[ 4382.556294] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[ 4382.556308] RIP: 0010:lookup_inline_extent_backref+0x4f4/0x650 [btrfs]
(...)
[ 4382.556310] RSP: 0018:ffffac784408f738 EFLAGS: 00010202
[ 4382.556311] RAX: 0000000000000001 RBX: ffff8980673c3a48 RCX: 0000000000000001
[ 4382.556312] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000
[ 4382.556312] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
[ 4382.556313] R10: 0000000000000001 R11: ffff897f40000000 R12: 0000000000001000
[ 4382.556313] R13: 00000000c224f000 R14: ffff89805de9bd40 R15: ffff8980453f4548
[ 4382.556315] FS: 00007f5e759178c0(0000) GS:ffff89807b300000(0000) knlGS:0000000000000000
[ 4382.563130] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4382.563562] CR2: 00007f2e9789fcbc CR3: 0000000120512001 CR4: 00000000003606e0
[ 4382.564005] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4382.564451] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 4382.564887] Call Trace:
[ 4382.565343] insert_inline_extent_backref+0x55/0xe0 [btrfs]
[ 4382.565796] __btrfs_inc_extent_ref.isra.60+0x88/0x260 [btrfs]
[ 4382.566249] ? __btrfs_run_delayed_refs+0x93/0x1650 [btrfs]
[ 4382.566702] __btrfs_run_delayed_refs+0xa22/0x1650 [btrfs]
[ 4382.567162] btrfs_run_delayed_refs+0x7e/0x1d0 [btrfs]
[ 4382.567623] btrfs_commit_transaction+0x50/0x9c0 [btrfs]
[ 4382.568112] ? _raw_spin_unlock+0x24/0x30
[ 4382.568557] ? block_rsv_release_bytes+0x14e/0x410 [btrfs]
[ 4382.569006] create_subvol+0x3c8/0x830 [btrfs]
[ 4382.569461] ? btrfs_mksubvol+0x317/0x600 [btrfs]
[ 4382.569906] btrfs_mksubvol+0x317/0x600 [btrfs]
[ 4382.570383] ? rcu_sync_lockdep_assert+0xe/0x60
[ 4382.570822] ? __sb_start_write+0xd4/0x1c0
[ 4382.571262] ? mnt_want_write_file+0x24/0x50
[ 4382.571712] btrfs_ioctl_snap_create_transid+0x117/0x1a0 [btrfs]
[ 4382.572155] ? _copy_from_user+0x66/0x90
[ 4382.572602] btrfs_ioctl_snap_create+0x66/0x80 [btrfs]
[ 4382.573052] btrfs_ioctl+0x7c1/0x30e0 [btrfs]
[ 4382.573502] ? mem_cgroup_commit_charge+0x8b/0x570
[ 4382.573946] ? do_raw_spin_unlock+0x49/0xc0
[ 4382.574379] ? _raw_spin_unlock+0x24/0x30
[ 4382.574803] ? __handle_mm_fault+0xf29/0x12d0
[ 4382.575215] ? do_vfs_ioctl+0xa2/0x6f0
[ 4382.575622] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[ 4382.576020] do_vfs_ioctl+0xa2/0x6f0
[ 4382.576405] ksys_ioctl+0x70/0x80
[ 4382.576776] __x64_sys_ioctl+0x16/0x20
[ 4382.577137] do_syscall_64+0x60/0x1b0
[ 4382.577488] entry_SYSCALL_64_after_hwframe+0x49/0xbe
(...)
[ 4382.578837] RSP: 002b:00007ffe04bf64c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[ 4382.579174] RAX: ffffffffffffffda RBX: 00005564136f3050 RCX: 00007f5e74724dd7
[ 4382.579505] RDX: 00007ffe04bf64d0 RSI: 000000005000940e RDI: 0000000000000003
[ 4382.579848] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000044
[ 4382.580164] R10: 0000000000000541 R11: 0000000000000202 R12: 00005564136f3010
[ 4382.580477] R13: 0000000000000003 R14: 00005564136f3035 R15: 00005564136f3050
[ 4382.580792] irq event stamp: 0
[ 4382.581106] hardirqs last enabled at (0): [<0000000000000000>] (null)
[ 4382.581441] hardirqs last disabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320
[ 4382.581772] softirqs last enabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320
[ 4382.582095] softirqs last disabled at (0): [<0000000000000000>] (null)
[ 4382.582413] ---[ end trace d3c188e3e9367382 ]---
[ 4382.623855] BTRFS: error (device sdc) in btrfs_run_delayed_refs:2981: errno=-5 IO failure
[ 4382.624295] BTRFS info (device sdc): forced readonly
Fix this by locking the source range before searching for the file extent
items in the fs tree, since the relocation process will try to lock the
range a file extent item represents before updating it with the new extent
location.
Fixes: 34a28e3d77 ("Btrfs: use generic_remap_file_range_prep() for cloning and deduplication")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The recent rework that makes btrfs' remap_file_range operation use the
generic helper generic_remap_file_range_prep() introduced a race between
writeback and cloning a range that covers the eof extent of the source
file into a destination offset that is greater then the same file's size.
This happens because we now wait for writeback to complete before doing
the truncation of the eof block, while previously we did the truncation
and then waited for writeback to complete. This leads to a race between
writeback of the truncated block and cloning the file extents in the
source range, because we copy each file extent item we find in the fs
root into a buffer, then release the path and then increment the reference
count for the extent referred in that file extent item we copied, which
can no longer exist if writeback of the truncated eof block completes
after we copied the file extent item into the buffer and before we
incremented the reference count. This is illustrated by the following
diagram:
CPU 1 CPU 2
btrfs_clone_files()
btrfs_cont_expand()
btrfs_truncate_block()
--> zeroes part of the
page containg eof,
marking it for
delalloc
btrfs_clone()
--> finds extent item
covering eof,
points to extent
at bytenr X
--> copies it into a
local buffer
--> releases path
writeback starts
btrfs_finish_ordered_io()
insert_reserved_file_extent()
__btrfs_drop_extents()
--> creates delayed
reference to drop
the extent at
bytenr X
--> starts transaction
--> creates delayed
reference to
increment extent
at bytenr X
<delayed references are run, due to a transaction
commit for example, and the transaction is aborted
with -EIO because we attempt to increment reference
count for the extent at bytenr X after we freed it>
When this race is hit the running transaction ends up getting aborted with
an -EIO error and a trace like the following is produced:
[ 4382.553858] WARNING: CPU: 2 PID: 3648 at fs/btrfs/extent-tree.c:1552 lookup_inline_extent_backref+0x4f4/0x650 [btrfs]
(...)
[ 4382.556293] CPU: 2 PID: 3648 Comm: btrfs Tainted: G W 4.20.0-rc6-btrfs-next-41 #1
[ 4382.556294] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[ 4382.556308] RIP: 0010:lookup_inline_extent_backref+0x4f4/0x650 [btrfs]
(...)
[ 4382.556310] RSP: 0018:ffffac784408f738 EFLAGS: 00010202
[ 4382.556311] RAX: 0000000000000001 RBX: ffff8980673c3a48 RCX: 0000000000000001
[ 4382.556312] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000
[ 4382.556312] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
[ 4382.556313] R10: 0000000000000001 R11: ffff897f40000000 R12: 0000000000001000
[ 4382.556313] R13: 00000000c224f000 R14: ffff89805de9bd40 R15: ffff8980453f4548
[ 4382.556315] FS: 00007f5e759178c0(0000) GS:ffff89807b300000(0000) knlGS:0000000000000000
[ 4382.563130] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4382.563562] CR2: 00007f2e9789fcbc CR3: 0000000120512001 CR4: 00000000003606e0
[ 4382.564005] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4382.564451] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 4382.564887] Call Trace:
[ 4382.565343] insert_inline_extent_backref+0x55/0xe0 [btrfs]
[ 4382.565796] __btrfs_inc_extent_ref.isra.60+0x88/0x260 [btrfs]
[ 4382.566249] ? __btrfs_run_delayed_refs+0x93/0x1650 [btrfs]
[ 4382.566702] __btrfs_run_delayed_refs+0xa22/0x1650 [btrfs]
[ 4382.567162] btrfs_run_delayed_refs+0x7e/0x1d0 [btrfs]
[ 4382.567623] btrfs_commit_transaction+0x50/0x9c0 [btrfs]
[ 4382.568112] ? _raw_spin_unlock+0x24/0x30
[ 4382.568557] ? block_rsv_release_bytes+0x14e/0x410 [btrfs]
[ 4382.569006] create_subvol+0x3c8/0x830 [btrfs]
[ 4382.569461] ? btrfs_mksubvol+0x317/0x600 [btrfs]
[ 4382.569906] btrfs_mksubvol+0x317/0x600 [btrfs]
[ 4382.570383] ? rcu_sync_lockdep_assert+0xe/0x60
[ 4382.570822] ? __sb_start_write+0xd4/0x1c0
[ 4382.571262] ? mnt_want_write_file+0x24/0x50
[ 4382.571712] btrfs_ioctl_snap_create_transid+0x117/0x1a0 [btrfs]
[ 4382.572155] ? _copy_from_user+0x66/0x90
[ 4382.572602] btrfs_ioctl_snap_create+0x66/0x80 [btrfs]
[ 4382.573052] btrfs_ioctl+0x7c1/0x30e0 [btrfs]
[ 4382.573502] ? mem_cgroup_commit_charge+0x8b/0x570
[ 4382.573946] ? do_raw_spin_unlock+0x49/0xc0
[ 4382.574379] ? _raw_spin_unlock+0x24/0x30
[ 4382.574803] ? __handle_mm_fault+0xf29/0x12d0
[ 4382.575215] ? do_vfs_ioctl+0xa2/0x6f0
[ 4382.575622] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[ 4382.576020] do_vfs_ioctl+0xa2/0x6f0
[ 4382.576405] ksys_ioctl+0x70/0x80
[ 4382.576776] __x64_sys_ioctl+0x16/0x20
[ 4382.577137] do_syscall_64+0x60/0x1b0
[ 4382.577488] entry_SYSCALL_64_after_hwframe+0x49/0xbe
(...)
[ 4382.578837] RSP: 002b:00007ffe04bf64c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[ 4382.579174] RAX: ffffffffffffffda RBX: 00005564136f3050 RCX: 00007f5e74724dd7
[ 4382.579505] RDX: 00007ffe04bf64d0 RSI: 000000005000940e RDI: 0000000000000003
[ 4382.579848] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000044
[ 4382.580164] R10: 0000000000000541 R11: 0000000000000202 R12: 00005564136f3010
[ 4382.580477] R13: 0000000000000003 R14: 00005564136f3035 R15: 00005564136f3050
[ 4382.580792] irq event stamp: 0
[ 4382.581106] hardirqs last enabled at (0): [<0000000000000000>] (null)
[ 4382.581441] hardirqs last disabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320
[ 4382.581772] softirqs last enabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320
[ 4382.582095] softirqs last disabled at (0): [<0000000000000000>] (null)
[ 4382.582413] ---[ end trace d3c188e3e9367382 ]---
[ 4382.623855] BTRFS: error (device sdc) in btrfs_run_delayed_refs:2981: errno=-5 IO failure
[ 4382.624295] BTRFS info (device sdc): forced readonly
Fix this by waiting for writeback to complete after truncating the eof
block.
Fixes: 34a28e3d77 ("Btrfs: use generic_remap_file_range_prep() for cloning and deduplication")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull vfs mount API prep from Al Viro:
"Mount API prereqs.
Mostly that's LSM mount options cleanups. There are several minor
fixes in there, but nothing earth-shattering (leaks on failure exits,
mostly)"
* 'mount.part1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (27 commits)
mount_fs: suppress MAC on MS_SUBMOUNT as well as MS_KERNMOUNT
smack: rewrite smack_sb_eat_lsm_opts()
smack: get rid of match_token()
smack: take the guts of smack_parse_opts_str() into a new helper
LSM: new method: ->sb_add_mnt_opt()
selinux: rewrite selinux_sb_eat_lsm_opts()
selinux: regularize Opt_... names a bit
selinux: switch away from match_token()
selinux: new helper - selinux_add_opt()
LSM: bury struct security_mnt_opts
smack: switch to private smack_mnt_opts
selinux: switch to private struct selinux_mnt_opts
LSM: hide struct security_mnt_opts from any generic code
selinux: kill selinux_sb_get_mnt_opts()
LSM: turn sb_eat_lsm_opts() into a method
nfs_remount(): don't leak, don't ignore LSM options quietly
btrfs: sanitize security_mnt_opts use
selinux; don't open-code a loop in sb_finish_set_opts()
LSM: split ->sb_set_mnt_opts() out of ->sb_kern_mount()
new helper: security_sb_eat_lsm_opts()
...
Merge more updates from Andrew Morton:
- procfs updates
- various misc bits
- lib/ updates
- epoll updates
- autofs
- fatfs
- a few more MM bits
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (58 commits)
mm/page_io.c: fix polled swap page in
checkpatch: add Co-developed-by to signature tags
docs: fix Co-Developed-by docs
drivers/base/platform.c: kmemleak ignore a known leak
fs: don't open code lru_to_page()
fs/: remove caller signal_pending branch predictions
mm/: remove caller signal_pending branch predictions
arch/arc/mm/fault.c: remove caller signal_pending_branch predictions
kernel/sched/: remove caller signal_pending branch predictions
kernel/locking/mutex.c: remove caller signal_pending branch predictions
mm: select HAVE_MOVE_PMD on x86 for faster mremap
mm: speed up mremap by 20x on large regions
mm: treewide: remove unused address argument from pte_alloc functions
initramfs: cleanup incomplete rootfs
scripts/gdb: fix lx-version string output
kernel/kcov.c: mark write_comp_data() as notrace
kernel/sysctl: add panic_print into sysctl
panic: add options to print system info when panic happens
bfs: extra sanity checking and static inode bitmap
exec: separate MM_ANONPAGES and RLIMIT_STACK accounting
...
Multiple filesystems open code lru_to_page(). Rectify this by moving
the macro from mm_inline (which is specific to lru stuff) to the more
generic mm.h header and start using the macro where appropriate.
No functional changes.
Link: http://lkml.kernel.org/r/20181129104810.23361-1-nborisov@suse.com
Link: https://lkml.kernel.org/r/20181129075301.29087-1-nborisov@suse.com
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Pankaj gupta <pagupta@redhat.com>
Acked-by: "Yan, Zheng" <zyan@redhat.com> [ceph]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.
It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access. But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.
A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model. And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.
This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
There were a couple of notable cases:
- csky still had the old "verify_area()" name as an alias.
- the iter_iov code had magical hardcoded knowledge of the actual
values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
really used it)
- microblaze used the type argument for a debug printout
but other than those oddities this should be a total no-op patch.
I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something. Any missed conversion should be trivially fixable, though.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Keep void * instead, allocate on demand (in parse_str_opts, at the
moment). Eventually both selinux and smack will be better off
with private structures with several strings in those, rather than
this "counter and two pointers to dynamically allocated arrays"
ugliness. This commit allows to do that at leisure, without
disrupting anything outside of given module.
Changes:
* instead of struct security_mnt_opt use an opaque pointer
initialized to NULL.
* security_sb_eat_lsm_opts(), security_sb_parse_opts_str() and
security_free_mnt_opts() take it as var argument (i.e. as void **);
call sites are unchanged.
* security_sb_set_mnt_opts() and security_sb_remount() take
it by value (i.e. as void *).
* new method: ->sb_free_mnt_opts(). Takes void *, does
whatever freeing that needs to be done.
* ->sb_set_mnt_opts() and ->sb_remount() might get NULL as
mnt_opts argument, meaning "empty".
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
1) keeping a copy in btrfs_fs_info is completely pointless - we never
use it for anything. Getting rid of that allows for simpler calling
conventions for setup_security_options() (caller is responsible for
freeing mnt_opts in all cases).
2) on remount we want to use ->sb_remount(), not ->sb_set_mnt_opts(),
same as we would if not for FS_BINARY_MOUNTDATA. Behaviours *are*
close (in fact, selinux sb_set_mnt_opts() ought to punt to
sb_remount() in "already initialized" case), but let's handle
that uniformly. And the only reason why the original btrfs changes
didn't go for security_sb_remount() in btrfs_remount() case is that
it hadn't been exported. Let's export it for a while - it'll be
going away soon anyway.
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
combination of alloc_secdata(), security_sb_copy_data(),
security_sb_parse_opt_str() and free_secdata().
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The typos accumulate over time so once in a while time they get fixed in
a large patch.
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the error handling block, err holds the return value of either
btrfs_del_root_ref() or btrfs_del_inode_ref() but it hasn't been checked
since it's introduction with commit fe66a05a06 (Btrfs: improve error
handling for btrfs_insert_dir_item callers) in 2012.
If the error handling in the error handling fails, there's not much left
to do and the abort either happened earlier in the callees or is
necessary here.
So if one of btrfs_del_root_ref() or btrfs_del_inode_ref() failed, abort
the transaction, but still return the original code of the failure
stored in 'ret' as this will be reported to the user.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since cloning and deduplication are no longer Btrfs specific operations, we
now have generic code to handle parameter validation, compare file ranges
used for deduplication, clear capabilities when cloning, etc. This change
makes Btrfs use it, eliminating a lot of code in Btrfs and also fixing a
few bugs, such as:
1) When cloning, the destination file's capabilities were not dropped
(the fstest generic/513 tests this);
2) We were not checking if the destination file is immutable;
3) Not checking if either the source or destination files are swap
files (swap file support is coming soon for Btrfs);
4) System limits were not checked (resource limits and O_LARGEFILE).
Note that the generic helper generic_remap_file_range_prep() does start
and waits for writeback by calling filemap_write_and_wait_range(), however
that is not enough for Btrfs for two reasons:
1) With compression, we need to start writeback twice in order to get the
pages marked for writeback and ordered extents created;
2) filemap_write_and_wait_range() (and all its other variants) only waits
for the IO to complete, but we need to wait for the ordered extents to
finish, so that when we do the actual reflinking operations the file
extent items are in the fs tree. This is also important due to the fact
that the generic helper, for the deduplication case, compares the
contents of the pages in the requested range, which might require
reading extents from disk in the very unlikely case that pages get
invalidated after writeback finishes (so the file extent items must be
up to date in the fs tree).
Since these reasons are specific to Btrfs we have to do it in the Btrfs
code before calling generic_remap_file_range_prep(). This also results
in a simpler way of dealing with existing delalloc in the source/target
ranges, specially for the deduplication case where we used to lock all
the pages first and then if we found any dealloc for the range, or
ordered extent, we would unlock the pages trigger writeback and wait for
ordered extents to complete, then lock all the pages again and check if
deduplication can be done. So now we get a simpler approach: lock the
inodes, then trigger writeback and then wait for ordered extents to
complete.
So make btrfs use generic_remap_file_range_prep() (XFS and OCFS2 use it)
to eliminate duplicated code, fix a few bugs and benefit from future bug
fixes done there - for example the recent clone and dedupe bugs involving
reflinking a partial EOF block got a counterpart fix in the generic
helper, since it affected all filesystems supporting these operations,
so we no longer need special checks in Btrfs for them.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
extent_readpages processes all pages in the readlist in batches of 16,
this is implemented by a single for loop but thanks to an if condition
the loop does 2 things based on whether we've filled the batch or not.
Additionally due to the structure of the code there is an additional
check which deals with partial batches.
Streamline all of this by explicitly using two loops. The outter one is
used to process all pages while the inner one just fills in the batch
of 16 (currently). Due to this new structure the code guarantees that
all pages are processed in the loop hence the code to deal with any
leftovers is eliminated.
This also enable the compiler to inline __extent_readpages:
./scripts/bloat-o-meter fs/btrfs/extent_io.o extent_io.for
add/remove: 0/1 grow/shrink: 1/0 up/down: 660/-820 (-160)
Function old new delta
extent_readpages 476 1136 +660
__extent_readpages 820 - -820
Total: Before=44315, After=44155, chg -0.36%
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The first step of the rebalance process ensures there is 1MiB free on
each device. This number seems rather small. And in fact when talking to
the original authors their opinions were:
"man that's a little bonkers"
"i don't think we even need that code anymore"
"I think it was there to make sure we had room for the blank 1M at the
beginning. I bet it goes all the way back to v0"
"we just don't need any of that tho, i say we just delete it"
Clearly, this piece of code has lost its original intent throughout the
years. It doesn't really bring any real practical benefits to the
relocation process.
Additionally, this patch makes the balance process more lightweight by
removing a pair of shrink/grow operations which are rather expensive for
heavily populated filesystems. This is mainly due to shrink requiring
relocating block groups, involving heavy use of the btree.
The intermediate shrink/grow can fail and leave the filesystem in a
middle state that would need to be changed back by the user.
Suggested-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
If we create a snapshot of a snapshot currently being used by a send
operation, we can end up with send failing unexpectedly (returning
-ENOENT error to user space for example). The following diagram shows
how this happens.
CPU 1 CPU2 CPU3
btrfs_ioctl_send()
(...)
create_snapshot()
-> creates snapshot of a
root used by the send
task
btrfs_commit_transaction()
create_pending_snapshot()
__get_inode_info()
btrfs_search_slot()
btrfs_search_slot_get_root()
down_read commit_root_sem
get reference on eb of the
commit root
-> eb with bytenr == X
up_read commit_root_sem
btrfs_cow_block(root node)
btrfs_free_tree_block()
-> creates delayed ref to
free the extent
btrfs_run_delayed_refs()
-> runs the delayed ref,
adds extent to
fs_info->pinned_extents
btrfs_finish_extent_commit()
unpin_extent_range()
-> marks extent as free
in the free space cache
transaction commit finishes
btrfs_start_transaction()
(...)
btrfs_cow_block()
btrfs_alloc_tree_block()
btrfs_reserve_extent()
-> allocates extent at
bytenr == X
btrfs_init_new_buffer(bytenr X)
btrfs_find_create_tree_block()
alloc_extent_buffer(bytenr X)
find_extent_buffer(bytenr X)
-> returns existing eb,
which the send task got
(...)
-> modifies content of the
eb with bytenr == X
-> uses an eb that now
belongs to some other
tree and no more matches
the commit root of the
snapshot, resuts will be
unpredictable
The consequences of this race can be various, and can lead to searches in
the commit root performed by the send task failing unexpectedly (unable to
find inode items, returning -ENOENT to user space, for example) or not
failing because an inode item with the same number was added to the tree
that reused the metadata extent, in which case send can behave incorrectly
in the worst case or just fail later for some reason.
Fix this by performing a copy of the commit root's extent buffer when doing
a search in the context of a send operation.
CC: stable@vger.kernel.org # 4.4.x: 1fc28d8e2e: Btrfs: move get root out of btrfs_search_slot to a helper
CC: stable@vger.kernel.org # 4.4.x: f9ddfd0592: Btrfs: remove unused check of skip_locking
CC: stable@vger.kernel.org # 4.4.x
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When initializing the security xattrs, we are holding a transaction handle
therefore we need to use a GFP_NOFS context in order to avoid a deadlock
with reclaim in case it's triggered.
Fixes: 39a27ec100 ("btrfs: use GFP_KERNEL for xattr and acl allocations")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With my delayed refs patches in place we started seeing a large amount
of aborts in __btrfs_free_extent:
BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 35964 owner 1 offset 0
Call Trace:
? btrfs_merge_delayed_refs+0xaf/0x340
__btrfs_run_delayed_refs+0x6ea/0xfc0
? btrfs_set_path_blocking+0x31/0x60
btrfs_run_delayed_refs+0xeb/0x180
btrfs_commit_transaction+0x179/0x7f0
? btrfs_check_space_for_delayed_refs+0x30/0x50
? should_end_transaction.isra.19+0xe/0x40
btrfs_drop_snapshot+0x41c/0x7c0
btrfs_clean_one_deleted_snapshot+0xb5/0xd0
cleaner_kthread+0xf6/0x120
kthread+0xf8/0x130
? btree_invalidatepage+0x90/0x90
? kthread_bind+0x10/0x10
ret_from_fork+0x35/0x40
This was because btrfs_drop_snapshot depends on the root not being
modified while it's dropping the snapshot. It will unlock the root node
(and really every node) as it walks down the tree, only to re-lock it
when it needs to do something. This is a problem because if we modify
the tree we could cow a block in our path, which frees our reference to
that block. Then once we get back to that shared block we'll free our
reference to it again, and get ENOENT when trying to lookup our extent
reference to that block in __btrfs_free_extent.
This is ultimately happening because we have delayed items left to be
processed for our deleted snapshot _after_ all of the inodes are closed
for the snapshot. We only run the delayed inode item if we're deleting
the inode, and even then we do not run the delayed insertions or delayed
removals. These can be run at any point after our final inode does its
last iput, which is what triggers the snapshot deletion. We can end up
with the snapshot deletion happening and then have the delayed items run
on that file system, resulting in the above problem.
This problem has existed forever, however my patches made it much easier
to hit as I wake up the cleaner much more often to deal with delayed
iputs, which made us more likely to start the snapshot dropping work
before the transaction commits, which is when the delayed items would
generally be run. Before, generally speaking, we would run the delayed
items, commit the transaction, and wakeup the cleaner thread to start
deleting snapshots, which means we were less likely to hit this problem.
You could still hit it if you had multiple snapshots to be deleted and
ended up with lots of delayed items, but it was definitely harder.
Fix for now by simply running all the delayed items before starting to
drop the snapshot. We could make this smarter in the future by making
the delayed items per-root, and then simply drop any delayed items for
roots that we are going to delete. But for now just a quick and easy
solution is the safest.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When debugging some weird extent reference bug I suspected that we were
changing a snapshot while we were deleting it, which could explain my
bug. This was indeed what was happening, and this patch helped me
verify my theory. It is never correct to modify the snapshot once it's
being deleted, so mark the root when we are deleting it and make sure we
complain about it when it happens.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
@blocksize variable in do_walk_down() is only used once, really no need
to declare it.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since scrub workers only do memory allocation with GFP_KERNEL when they
need to perform repair, we can move the recent setup of the nofs context
up to scrub_handle_errored_block() instead of setting it up down the call
chain at insert_full_stripe_lock() and scrub_add_page_to_wr_bio(),
removing some duplicate code and comment. So the only paths for which a
scrub worker can do memory allocations using GFP_KERNEL are the following:
scrub_bio_end_io_worker()
scrub_block_complete()
scrub_handle_errored_block()
lock_full_stripe()
insert_full_stripe_lock()
-> kmalloc with GFP_KERNEL
scrub_bio_end_io_worker()
scrub_block_complete()
scrub_handle_errored_block()
scrub_write_page_to_dev_replace()
scrub_add_page_to_wr_bio()
-> kzalloc with GFP_KERNEL
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The scrub context is allocated with GFP_KERNEL and called from
btrfs_scrub_dev under the fs_info::device_list_mutex. This is not safe
regarding reclaim that could try to flush filesystem data in order to
get the memory. And the device_list_mutex is held during superblock
commit, so this would cause a lockup.
Move the alocation and initialization before any changes that require
the mutex.
Signed-off-by: David Sterba <dsterba@suse.com>
We can pass fs_info directly as this is the only member of btrfs_device
that's bing used inside scrub_setup_ctx.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a bunch of magic to make sure we're throttling delayed refs when
truncating a file. Now that we have a delayed refs rsv and a mechanism
for refilling that reserve simply use that instead of all of this magic.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Over the years we have built up a lot of infrastructure to keep delayed
refs in check, mostly by running them at btrfs_end_transaction() time.
We have a lot of different maths we do to figure out how much, if we
should do it inline or async, etc. This existed because we had no
feedback mechanism to force the flushing of delayed refs when they
became a problem. However with the enospc flushing infrastructure in
place for flushing delayed refs when they put too much pressure on the
enospc system we have this problem solved. Rip out all of this code as
it is no longer needed.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now with the delayed_refs_rsv we can now know exactly how much pending
delayed refs space we need. This means we can drastically simplify
btrfs_check_space_for_delayed_refs by simply checking how much space we
have reserved for the global rsv (which acts as a spill over buffer) and
the delayed refs rsv. If our total size is beyond that amount then we
know it's time to commit the transaction and stop any more delayed refs
from being generated.
With the introduction of dealyed_refs_rsv infrastructure, namely
btrfs_update_delayed_refs_rsv we now know exactly how much pending
delayed refs space is required.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A nice thing we gain with the delayed refs rsv is the ability to flush
the delayed refs on demand to deal with enospc pressure. Add states to
flush delayed refs on demand, and this will allow us to remove a lot of
ad-hoc work around checking to see if we should commit the transaction
to run our delayed refs.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Any space used in the delayed_refs_rsv will be freed up by a transaction
commit, so instead of just counting the pinned space we also need to
account for any space in the delayed_refs_rsv when deciding if it will
make a different to commit the transaction to satisfy our space
reservation. If we have enough bytes to satisfy our reservation ticket
then we are good to go, otherwise subtract out what space we would gain
back by committing the transaction and compare that against the pinned
space to make our decision.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Traditionally we've had voodoo in btrfs to account for the space that
delayed refs may take up by having a global_block_rsv. This works most
of the time, except when it doesn't. We've had issues reported and seen
in production where sometimes the global reserve is exhausted during
transaction commit before we can run all of our delayed refs, resulting
in an aborted transaction. Because of this voodoo we have equally
dubious flushing semantics around throttling delayed refs which we often
get wrong.
So instead give them their own block_rsv. This way we can always know
exactly how much outstanding space we need for delayed refs. This
allows us to make sure we are constantly filling that reservation up
with space, and allows us to put more precise pressure on the enospc
system. Instead of doing math to see if its a good time to throttle,
the normal enospc code will be invoked if we have a lot of delayed refs
pending, and they will be run via the normal flushing mechanism.
For now the delayed_refs_rsv will hold the reservations for the delayed
refs, the block group updates, and deleting csums. We could have a
separate rsv for the block group updates, but the csum deletion stuff is
still handled via the delayed_refs so that will stay there.
Historical background:
The global reserve has grown to cover everything we don't reserve space
explicitly for, and we've grown a lot of weird ad-hoc heuristics to know
if we're running short on space and when it's time to force a commit. A
failure rate of 20-40 file systems when we run hundreds of thousands of
them isn't super high, but cleaning up this code will make things less
ugly and more predictible.
Thus the delayed refs rsv. We always know how many delayed refs we have
outstanding, and although running them generates more we can use the
global reserve for that spill over, which fits better into it's desired
use than a full blown reservation. This first approach is to simply
take how many times we're reserving space for and multiply that by 2 in
order to save enough space for the delayed refs that could be generated.
This is a niave approach and will probably evolve, but for now it works.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com> # high-level review
[ added background notes from the cover letter ]
Signed-off-by: David Sterba <dsterba@suse.com>
We use this number to figure out how many delayed refs to run, but
__btrfs_run_delayed_refs really only checks every time we need a new
delayed ref head, so we always run at least one ref head completely no
matter what the number of items on it. Fix the accounting to only be
adjusted when we add/remove a ref head.
In addition to using this number to limit the number of delayed refs
run, a future patch is also going to use it to calculate the amount of
space required for delayed refs space reservation.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The cleanup_extent_op function actually would run the extent_op if it
needed running, which made the name sort of a misnomer. Change it to
run_and_cleanup_extent_op, and move the actual cleanup work to
cleanup_extent_op so it can be used by check_ref_cleanup() in order to
unify the extent op handling.
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We were missing some quota cleanups in check_ref_cleanup, so break the
ref head accounting cleanup into a helper and call that from both
check_ref_cleanup and cleanup_ref_head. This will hopefully ensure that
we don't screw up accounting in the future for other things that we add.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
into a helper and cleanup the calling functions.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When using a 'var & (PAGE_SIZE - 1)' construct one is checking for a page
alignment and thus should use the PAGE_ALIGNED() macro instead of
open-coding it.
Convert all open-coded occurrences of PAGE_ALIGNED().
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Constructs like 'var & (PAGE_SIZE - 1)' or 'var & ~PAGE_MASK' can denote an
offset into a page.
So replace them by the offset_in_page() macro instead of open-coding it if
they're not used as an alignment check.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The dev-replace locking functions are now trivial wrappers around rw
semaphore that can be used directly everywhere. No functional change.
Signed-off-by: David Sterba <dsterba@suse.com>
After the rw semaphore has been added, the custom blocking using
::blocking_readers and ::read_lock_wq is redundant.
The blocking logic in __btrfs_map_block is replaced by extending the
time the semaphore is held, that has the same blocking effect on writes
as the previous custom scheme that waited until ::blocking_readers was
zero.
Signed-off-by: David Sterba <dsterba@suse.com>
This is the first part of removing the custom locking and waiting scheme
used for device replace. It was probably copied from extent buffer
locking, but there's nothing that would require more than is provided by
the common locking primitives.
The rw spinlock protects waiting tasks counter in case of incompatible
locks and the waitqueue. Same as rw semaphore.
This patch only switches the locking primitive, for better
bisectability. There should be no functional change other than the
overhead of the locking and potential sleeping instead of spinning when
the lock is contended.
Signed-off-by: David Sterba <dsterba@suse.com>
The device-replace read lock is going to use rw semaphore in followup
commits. The semaphore might sleep which is not possible in the radix
tree preload section. The lock nesting is now:
* device replace
* radix tree preload
* readahead spinlock
Signed-off-by: David Sterba <dsterba@suse.com>
Running btrfs/124 in a loop hung up on me sporadically with the
following call trace:
btrfs D 0 5760 5324 0x00000000
Call Trace:
? __schedule+0x243/0x800
schedule+0x33/0x90
btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
? wait_woken+0xa0/0xa0
btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
btrfs_relocate_chunk+0x49/0x100 [btrfs]
btrfs_balance+0xbeb/0x1740 [btrfs]
btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
btrfs_ioctl+0x1691/0x3110 [btrfs]
? lockdep_hardirqs_on+0xed/0x180
? __handle_mm_fault+0x8e7/0xfb0
? _raw_spin_unlock+0x24/0x30
? __handle_mm_fault+0x8e7/0xfb0
? do_vfs_ioctl+0xa5/0x6e0
? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
do_vfs_ioctl+0xa5/0x6e0
? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
ksys_ioctl+0x3a/0x70
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x60/0x1b0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
This happens because during page writeback it's valid for
writepage_delalloc to instantiate a delalloc range which doesn't belong
to the page currently being written back.
The reason this case is valid is due to find_lock_delalloc_range
returning any available range after the passed delalloc_start and
ignoring whether the page under writeback is within that range.
In turn ordered extents (OE) are always created for the returned range
from find_lock_delalloc_range. If, however, a failure occurs while OE
are being created then the clean up code in btrfs_cleanup_ordered_extents
will be called.
Unfortunately the code in btrfs_cleanup_ordered_extents doesn't consider
the case of such 'foreign' range being processed and instead it always
assumes that the range OE are created for belongs to the page. This
leads to the first page of such foregin range to not be cleaned up since
it's deliberately missed and skipped by the current cleaning up code.
Fix this by correctly checking whether the current page belongs to the
range being instantiated and if so adjsut the range parameters passed
for cleaning up. If it doesn't, then just clean the whole OE range
directly.
Fixes: 524272607e ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The @found is always false when it comes to the if branch. Besides, the
bool type is more suitable for @found. Change the return value of the
function and its caller to bool as well.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The test case btrfs/001 with inode_cache mount option will encounter the
following warning:
WARNING: CPU: 1 PID: 23700 at fs/btrfs/inode.c:956 cow_file_range.isra.19+0x32b/0x430 [btrfs]
CPU: 1 PID: 23700 Comm: btrfs Kdump: loaded Tainted: G W O 4.20.0-rc4-custom+ #30
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:cow_file_range.isra.19+0x32b/0x430 [btrfs]
Call Trace:
? free_extent_buffer+0x46/0x90 [btrfs]
run_delalloc_nocow+0x455/0x900 [btrfs]
btrfs_run_delalloc_range+0x1a7/0x360 [btrfs]
writepage_delalloc+0xf9/0x150 [btrfs]
__extent_writepage+0x125/0x3e0 [btrfs]
extent_write_cache_pages+0x1b6/0x3e0 [btrfs]
? __wake_up_common_lock+0x63/0xc0
extent_writepages+0x50/0x80 [btrfs]
do_writepages+0x41/0xd0
? __filemap_fdatawrite_range+0x9e/0xf0
__filemap_fdatawrite_range+0xbe/0xf0
btrfs_fdatawrite_range+0x1b/0x50 [btrfs]
__btrfs_write_out_cache+0x42c/0x480 [btrfs]
btrfs_write_out_ino_cache+0x84/0xd0 [btrfs]
btrfs_save_ino_cache+0x551/0x660 [btrfs]
commit_fs_roots+0xc5/0x190 [btrfs]
btrfs_commit_transaction+0x2bf/0x8d0 [btrfs]
btrfs_mksubvol+0x48d/0x4d0 [btrfs]
btrfs_ioctl_snap_create_transid+0x170/0x180 [btrfs]
btrfs_ioctl_snap_create_v2+0x124/0x180 [btrfs]
btrfs_ioctl+0x123f/0x3030 [btrfs]
The file extent generation of the free space inode is equal to the last
snapshot of the file root, so the inode will be passed to cow_file_rage.
But the inode was created and its extents were preallocated in
btrfs_save_ino_cache, there are no cow copies on disk.
The preallocated extent is not yet in the extent tree, and
btrfs_cross_ref_exist will ignore the -ENOENT returned by
check_committed_ref, so we can directly write the inode to the disk.
Fixes: 78d4295b1e ("btrfs: lift some btrfs_cross_ref_exist checks in nocow path")
CC: stable@vger.kernel.org # 4.18+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The log tree has a long standing problem that when a file is fsync'ed we
only check for new ancestors, created in the current transaction, by
following only the hard link for which the fsync was issued. We follow the
ancestors using the VFS' dget_parent() API. This means that if we create a
new link for a file in a directory that is new (or in an any other new
ancestor directory) and then fsync the file using an old hard link, we end
up not logging the new ancestor, and on log replay that new hard link and
ancestor do not exist. In some cases, involving renames, the file will not
exist at all.
Example:
mkfs.btrfs -f /dev/sdb
mount /dev/sdb /mnt
mkdir /mnt/A
touch /mnt/foo
ln /mnt/foo /mnt/A/bar
xfs_io -c fsync /mnt/foo
<power failure>
In this example after log replay only the hard link named 'foo' exists
and directory A does not exist, which is unexpected. In other major linux
filesystems, such as ext4, xfs and f2fs for example, both hard links exist
and so does directory A after mounting again the filesystem.
Checking if any new ancestors are new and need to be logged was added in
2009 by commit 12fcfd22fe ("Btrfs: tree logging unlink/rename fixes"),
however only for the ancestors of the hard link (dentry) for which the
fsync was issued, instead of checking for all ancestors for all of the
inode's hard links.
So fix this by tracking the id of the last transaction where a hard link
was created for an inode and then on fsync fallback to a full transaction
commit when an inode has more than one hard link and at least one new hard
link was created in the current transaction. This is the simplest solution
since this is not a common use case (adding frequently hard links for
which there's an ancestor created in the current transaction and then
fsync the file). In case it ever becomes a common use case, a solution
that consists of iterating the fs/subvol btree for each hard link and
check if any ancestor is new, could be implemented.
This solves many unexpected scenarios reported by Jayashree Mohan and
Vijay Chidambaram, and for which there is a new test case for fstests
under review.
Fixes: 12fcfd22fe ("Btrfs: tree logging unlink/rename fixes")
CC: stable@vger.kernel.org # 4.4+
Reported-by: Vijay Chidambaram <vvijay03@gmail.com>
Reported-by: Jayashree Mohan <jayashree2912@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The first auto-assigned value to enum is 0, we can use that and not
initialize all members where the auto-increment does the same. This is
used for values that are not part of on-disk format.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
We can use simple enum for values that are not part of on-disk format:
ordered extent flags.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
We can use simple enum for values that are not part of on-disk format:
extent map flags.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
We can use simple enum for values that are not part of on-disk format:
extent buffer flags;
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
We can use simple enum for values that are not part of on-disk format:
root tree flags.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
We can use simple enum for values that are not part of on-disk format:
internal filesystem states.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can use simple enum for values that are not part of on-disk format:
block reserve types.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
We can use simple enum for values that are not part of on-disk format:
global filesystem states.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
This function really checks whether adding more data to the bio will
straddle a stripe/chunk. So first let's give it a more appropraite name
- btrfs_bio_fits_in_stripe. Secondly, the offset parameter was never
used to just remove it. Thirdly, pages are submitted to either btree or
data inodes so it's guaranteed that tree->ops is set so replace the
check with an ASSERT. Finally, document the parameters of the function.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When it was introduced in commit f094ac32ab ("Btrfs: fix NULL pointer
after aborting a transaction"), it was not used.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Document why map_private_extent_buffer() cannot return '1' (i.e. the map
spans two pages) for the csum_tree_block() case.
The current algorithm for detecting a page boundary crossing in
map_private_extent_buffer() will return a '1' *IFF* the extent buffer's
offset in the page + the offset passed in by csum_tree_block() and the
minimal length passed in by csum_tree_block() - 1 are bigger than
PAGE_SIZE.
We always pass BTRFS_CSUM_SIZE (32) as offset and a minimal length of 32
and the current extent buffer allocator always guarantees page aligned
extends, so the above condition can't be true.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
In map_private_extent_buffer() the 'offset' variable is initialized to a
page aligned version of the 'start' parameter.
But later on it is overwritten with either the offset from the extent
buffer's start or 0.
So get rid of the initial initialization.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
When a transaction commit starts, it attempts to pause scrub and it blocks
until the scrub is paused. So while the transaction is blocked waiting for
scrub to pause, we can not do memory allocation with GFP_KERNEL from scrub,
otherwise we risk getting into a deadlock with reclaim.
Checking for scrub pause requests is done early at the beginning of the
while loop of scrub_stripe() and later in the loop, scrub_extent() and
scrub_raid56_parity() are called, which in turn call scrub_pages() and
scrub_pages_for_parity() respectively. These last two functions do memory
allocations using GFP_KERNEL. Same problem could happen while scrubbing
the super blocks, since it calls scrub_pages().
We also can not have any of the worker tasks, created by the scrub task,
doing GFP_KERNEL allocations, because before pausing, the scrub task waits
for all the worker tasks to complete (also done at scrub_stripe()).
So make sure GFP_NOFS is used for the memory allocations because at any
time a scrub pause request can happen from another task that started to
commit a transaction.
Fixes: 58c4e17384 ("btrfs: scrub: use GFP_KERNEL on the submission path")
CC: stable@vger.kernel.org # 4.6+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For data inodes this hook does nothing but to return -EAGAIN which is
used to signal to the endio routines that this bio belongs to a data
inode. If this is the case the actual retrying is handled by
bio_readpage_error. Alternatively, if this bio belongs to the btree
inode then btree_io_failed_hook just does some cleanup and doesn't retry
anything.
This patch simplifies the code flow by eliminating
readpage_io_failed_hook and instead open-coding btree_io_failed_hook in
end_bio_extent_readpage. Also eliminate some needless checks since IO is
always performed on either data inode or btree inode, both of which are
guaranteed to have their extent_io_tree::ops set.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The btrfs_bio_end_io_t typedef was introduced with commit
a1d3c4786a ("btrfs: btrfs_multi_bio replaced with btrfs_bio")
but never used anywhere. This commit also introduced a forward declaration
of 'struct btrfs_bio' which is only needed for btrfs_bio_end_io_t.
Remove both as they're not needed anywhere.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The end_io callback implemented as btrfs_io_bio_endio_readpage only
calls kfree. Also the callback is set only in case the csum buffer is
allocated and not pointing to the inline buffer. We can use that
information to drop the indirection and call a helper that will free the
csums only in the right case.
This shrinks struct btrfs_io_bio by 8 bytes.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The io_bio tracks checksums and has an inline buffer or an allocated
one. And there's a third member that points to the right one, but we
don't need to use an extra pointer for that. Let btrfs_io_bio::csum
point to the right buffer and check that the inline buffer is not
accidentally freed.
This shrinks struct btrfs_io_bio by 8 bytes.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The async_cow::root is used to propagate fs_info to async_cow_submit.
We can't use inode to reach it because it could become NULL after
write without compression in async_cow_start.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
There's one caller and its code is simple, we can open code it in
run_one_async_done. The errors are passed through bio.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Print a kernel log message when the balance ends, either for cancel or
completed or if it is paused.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The information about balance arguments is important for system audit,
this patch prints the textual representation when balance starts or is
resumed.
Example command:
$ btrfs balance start -f -mprofiles=raid1,convert=single,soft -dlimit=10..20,usage=50 /btrfs
Example kernel log output:
BTRFS info (device sdb): balance: start -f -dusage=50,limit=10..20 -mconvert=single,soft,profiles=raid1 -sconvert=single,soft,profiles=raid1
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog, simplify code ]
Signed-off-by: David Sterba <dsterba@suse.com>
Factor out helper that describes block group flags from
describe_relocation. The result will not be longer than the given size.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
If the quota enable and snapshot creation ioctls are called concurrently
we can get into a deadlock where the task enabling quotas will deadlock
on the fs_info->qgroup_ioctl_lock mutex because it attempts to lock it
twice, or the task creating a snapshot tries to commit the transaction
while the task enabling quota waits for the former task to commit the
transaction while holding the mutex. The following time diagrams show how
both cases happen.
First scenario:
CPU 0 CPU 1
btrfs_ioctl()
btrfs_ioctl_quota_ctl()
btrfs_quota_enable()
mutex_lock(fs_info->qgroup_ioctl_lock)
btrfs_start_transaction()
btrfs_ioctl()
btrfs_ioctl_snap_create_v2
create_snapshot()
--> adds snapshot to the
list pending_snapshots
of the current
transaction
btrfs_commit_transaction()
create_pending_snapshots()
create_pending_snapshot()
qgroup_account_snapshot()
btrfs_qgroup_inherit()
mutex_lock(fs_info->qgroup_ioctl_lock)
--> deadlock, mutex already locked
by this task at
btrfs_quota_enable()
Second scenario:
CPU 0 CPU 1
btrfs_ioctl()
btrfs_ioctl_quota_ctl()
btrfs_quota_enable()
mutex_lock(fs_info->qgroup_ioctl_lock)
btrfs_start_transaction()
btrfs_ioctl()
btrfs_ioctl_snap_create_v2
create_snapshot()
--> adds snapshot to the
list pending_snapshots
of the current
transaction
btrfs_commit_transaction()
--> waits for task at
CPU 0 to release
its transaction
handle
btrfs_commit_transaction()
--> sees another task started
the transaction commit first
--> releases its transaction
handle
--> waits for the transaction
commit to be completed by
the task at CPU 1
create_pending_snapshot()
qgroup_account_snapshot()
btrfs_qgroup_inherit()
mutex_lock(fs_info->qgroup_ioctl_lock)
--> deadlock, task at CPU 0
has the mutex locked but
it is waiting for us to
finish the transaction
commit
So fix this by setting the quota enabled flag in fs_info after committing
the transaction at btrfs_quota_enable(). This ends up serializing quota
enable and snapshot creation as if the snapshot creation happened just
before the quota enable request. The quota rescan task, scheduled after
committing the transaction in btrfs_quote_enable(), will do the accounting.
Fixes: 6426c7ad69 ("btrfs: qgroup: Fix qgroup accounting when creating snapshot")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The available allocation bits members from struct btrfs_fs_info are
protected by a sequence lock, and when starting balance we access them
incorrectly in two different ways:
1) In the read sequence lock loop at btrfs_balance() we use the values we
read from fs_info->avail_*_alloc_bits and we can immediately do actions
that have side effects and can not be undone (printing a message and
jumping to a label). This is wrong because a retry might be needed, so
our actions must not have side effects and must be repeatable as long
as read_seqretry() returns a non-zero value. In other words, we were
essentially ignoring the sequence lock;
2) Right below the read sequence lock loop, we were reading the values
from avail_metadata_alloc_bits and avail_data_alloc_bits without any
protection from concurrent writers, that is, reading them outside of
the read sequence lock critical section.
So fix this by making sure we only read the available allocation bits
while in a read sequence lock critical section and that what we do in the
critical section is repeatable (has nothing that can not be undone) so
that any eventual retry that is needed is handled properly.
Fixes: de98ced9e7 ("Btrfs: use seqlock to protect fs_info->avail_{data, metadata, system}_alloc_bits")
Fixes: 1450612797 ("btrfs: fix a bogus warning when converting only data or metadata")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can have a lot freed extents during the life span of transaction, so
the red black tree that keeps track of the ranges of each freed extent
(fs_info->freed_extents[]) can get quite big. When finishing a
transaction commit we find each range, process it (discard the extents,
unpin them) and then remove it from the red black tree.
We can use an extent state record as a cache when searching for a range,
so that when we clean the range we can use the cached extent state we
passed to the search function instead of iterating the red black tree
again. Doing things as fast as possible when finishing a transaction (in
state TRANS_STATE_UNBLOCKED) is convenient as it reduces the time we
block another task that wants to commit the next transaction.
So change clear_extent_dirty() to allow an optional extent state record to
be passed as an argument, which will be passed down to __clear_extent_bit.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch lands the last case which needs to be handled by the fsid
change code. Namely, this is the case where a multidisk filesystem has
already undergone at least one successful fsid change i.e all disks
have the METADATA_UUID incompat bit and power failure occurs as another
fsid change is in progress. When such an event occurs, disks could be
split in 2 groups. One of the groups will have both METADATA_UUID and
CHANGING_FSID_V2 flags set coupled with old fsid/metadata_uuid pairs.
The other group of disks will have only METADATA_UUID bit set and their
fsid will be different than the one in disks in the first group. Here
we look at the following cases:
a) A disk from the first group is scanned first, so fs_devices is
created with stale fsid/metdata_uuid. Then when a disk from the
second group is scanned it needs to first check whether there exists
such an fs_devices that has fsid_change set to true (because it was
created with a disk having the CHANGING_FSID_V2 flag), the
metadata_uuid and fsid of the fs_devices will be different (since it was
created by a disk which already has had at least 1 successful fsid change)
and finally the metadata_uuid of the fs_devices will equal that of the
currently scanned disk (because metadata_uuid never really changes).
When the correct fs_devices is found the information from the scanned
disk will replace the current one in fs_devices since the scanned disk
will have higher generation number.
b) A disk from the second group is scanned so fs_devices is created
as usual with differing fsid/metdata_uid. Then when a disk from the
first group is scanned the code detects that it has both
CHANGING_FSID_V2 and METADATA_UUID flags set and will search for
fs_devices that has differing metadata_uuid/fsid and whose
metadata_uuid is the same as that of the scanned device.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This commit continues hardening the scanning code to handle cases where
power loss could have caused disks in a multi-disk filesystem to be
in inconsistent state. Namely handle the situation that can occur when
some of the disks in multi-disk fs have completed their fsid change i.e
they have METADATA_UUID incompat flag set, have cleared the
CHANGING_FSID_V2 flag and their fsid/metadata_uuid are different. At
the same time the other half of the disks will have their
fsid/metadata_uuid unchanged and will only have CHANGING_FSID_V2 flag.
This is handled by introducing code in the scan path which:
a) Handles the case when a device with CHANGING_FSID_V2 flag is
scanned and as a result btrfs_fs_devices is created with matching
fsid/metdata_uuid. Subsequently, when a device with completed fsid
change is scanned it will detect this via the new code in find_fsid
i.e that such an fs_devices exist that fsid_change flag is set to true,
it's metadata_uuid/fsid match and the metadata_uuid of the scanned
device matches that of the fs_devices. In this case, it's important to
note that the devices which has its fsid change completed will have a
higher generation number than the device with FSID_CHANGING_V2 flag
set, so its superblock block will be used during mount. To prevent an
assertion triggering because the sb used for mounting will have
differing fsid/metadata_uuid than the ones in the fs_devices struct
also add code in device_list_add which overwrites the values in
fs_devices.
b) Alternatively we can end up with a device that completed its
fsid change be scanned first which will create the respective
btrfs_fs_devices struct with differing fsid/metadata_uuid. In this
case when a device with FSID_CHANGING_V2 flag set is scanned it will
call the newly added find_fsid_inprogress function which will return
the correct fs_devices.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to gracefully handle split-brain scenario during fsid change
(which are very unlikely, yet possible), two more pieces of information
will be necessary:
1. The highest generation number among all devices registered to a
particular btrfs_fs_devices
2. A boolean flag whether a given btrfs_fs_devices was created by a
device which had the FSID_CHANGING_V2 flag set.
This is a preparatory patch and just introduces the variables as well
as code which sets them, their actual use is going to happen in a later
patch.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Even though fsid change without rewrite is a very quick operation it's
still possible to experience a split-brain scenario if power loss occurs
at the most inconvenient time. This patch handles the case where power
failure occurs while the first transaction (the one setting
CHANGING_FSID_V2) flag is being persisted on disk. This can cause the
btrfs_fs_devices of this filesystem to be created by a device which:
a) has the CHANGING_FSID_V2 flag set but its fsid value is intact
b) or a device which doesn't have CHANGING_FSID_V2 flag set and its
fsid value is intact
This situation is trivially handled by the current find_fsid code since
in both cases the devices are going to be treated like ordinary devices.
Since btrfs is always mounted using the superblock of the latest
device (the one with highest generation number), meaning it will have
the CHANGING_FSID_V2 flag set, ensure it's being cleared on mount. On
the first transaction commit following mount all disks will have it
cleared.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_fs_info structure contains a copy of the
fsid/metadata_uuid fields. Same values are also contained in the
btrfs_fs_devices structure which fs_info has a reference to. Let's
reduce duplication by removing the fields from fs_info and always refer
to the ones in fs_devices. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the metadata_uuid is a new incompat feature it requires the
respective sysfs hooks. This patch adds the 'metdata_uuid' feature to
be shown if it supported by the kernel. Additionally it adds
/sys/fs/btrfs/UUID/metadata_uuid attribute which allows one to read
the current metadata_uuid.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This field is going to be used when the user wants to change the UUID
of the filesystem without having to rewrite all metadata blocks. This
field adds another level of indirection such that when the FSID is
changed what really happens is the current UUID (the one with which the
fs was created) is copied to the 'metadata_uuid' field in the superblock
as well as a new incompat flag is set METADATA_UUID. When the kernel
detects this flag is set it knows that the superblock in fact has 2
UUIDs:
1. Is the UUID which is user-visible, currently known as FSID.
2. Metadata UUID - this is the UUID which is stamped into all on-disk
datastructures belonging to this file system.
When the new incompat flag is present device scanning checks whether
both fsid/metadata_uuid of the scanned device match any of the
registered filesystems. When the flag is not set then both UUIDs are
equal and only the FSID is retained on disk, metadata_uuid is set only
in-memory during mount.
Additionally a new metadata_uuid field is also added to the fs_info
struct. It's initialised either with the FSID in case METADATA_UUID
incompat flag is not set or with the metdata_uuid of the superblock
otherwise.
This commit introduces the new fields as well as the new incompat flag
and switches all users of the fsid to the new logic.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor updates in comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Several functions in BTRFS are only used inside the source file they are
declared if CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not defined. However if
CONFIG_BTRFS_FS_RUN_SANITY_TESTS is defined these functions are shared
with the unit tests code.
Before the introduction of the EXPORT_FOR_TESTS macro, these functions
could not be declared as static and the compiler had a harder task when
optimizing and inlining them.
As we have EXPORT_FOR_TESTS now, use it where appropriate to support the
compiler.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Depending on whether CONFIG_BTRFS_FS_RUN_SANITY_TESTS is set, some BTRFS
functions are either local to the file they are implemented in and thus
should be declared static or are called from within the test
implementation defined in a different file.
Introduce an EXPORT_FOR_TESTS macro which depending on
CONFIG_BTRFS_FS_RUN_SANITY_TESTS either adds the 'static' keyword to a
function or not.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Up to commit 32955c5422 ("btrfs: switch to discard_new_inode()") the
drop_on_err variable in btrfs_mkdir() was used to check whether the
inode had to be dropped via iput().
After commit 32955c5422 ("btrfs: switch to discard_new_inode()")
discard_new_inode() is called when err is set and inode is non NULL.
Therefore drop_on_err is not used anymore and thus causes a warning when
building with -Wunused-but-set-variable.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
lock_delalloc_pages should only return 2 values - 0 in case of success
and -EAGAIN if the range of pages to be locked should be shrunk due to
some of gone. Manual inspections confirms that this is indeed the case
since __process_pages_contig is where lock_delalloc_pages gets its
return value. The latter always returns 0 or -EAGAIN so the invariant
holds. No functional changes.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers of this function pass BTRFS_MAX_EXTENT_SIZE (128M) so let's
reduce the argument count and make that a local variable. No functional
changes.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's unnecessary to check map->stripes[i].dev for NULL given its value
is already set and dereferenced above the the check. No functional
changes.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As of now only user requested replace cancel can cancel the
replace-scrub so no need to log the error.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we successfully cancel the device replace, its scrub worker returns
-ECANCELED, which is then passed to btrfs_dev_replace_finishing.
It cleans up based on the returned status and propagates the same
-ECANCELED back the parent function. As of now only user can cancel the
replace-scrub, so its ok to silence the warning here.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We recast the replace return status
BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS to 0, to indicate no
error.
And since BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR should also return 0,
which is also declared as 0, so we just return. Instead add it to the if
statement so that there is enough clarity while reading the code.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When the replace state is in the suspended state, btrfs_scrub_cancel()
should fail with -ENOTCONN as there is no scrub running. As a safety
catch check if btrfs_scrub_cancel() returns -ENOTCONN and assert if it
doesn't.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The device-replace needs to check the result code of the scrub workers
in btrfs_dev_replace_cancel and distinguish if successful cancel
operation and when the there was no operation running.
If btrfs_scrub_cancel() fails, return
BTRFS_IOCTL_DEV_REPLACE_RESULT_NOT_STARTED so that user can try
to cancel the replace again.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
The device replace cancel thread can race with the replace start thread
and if fs_info::scrubs_running is not yet set, btrfs_scrub_cancel() will
fail to stop the scrub thread.
The scrub thread continues with the scrub for replace which then will
try to write to the target device and which is already freed by the
cancel thread.
scrub_setup_ctx() warns as tgtdev is NULL.
struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev, int is_dev_replace)
{
...
if (is_dev_replace) {
WARN_ON(!fs_info->dev_replace.tgtdev); <===
sctx->pages_per_wr_bio = SCRUB_PAGES_PER_WR_BIO;
sctx->wr_tgtdev = fs_info->dev_replace.tgtdev;
sctx->flush_all_writes = false;
}
[ 6724.497655] BTRFS info (device sdb): dev_replace from /dev/sdb (devid 1) to /dev/sdc started
[ 6753.945017] BTRFS info (device sdb): dev_replace from /dev/sdb (devid 1) to /dev/sdc canceled
[ 6852.426700] WARNING: CPU: 0 PID: 4494 at fs/btrfs/scrub.c:622 scrub_setup_ctx.isra.19+0x220/0x230 [btrfs]
...
[ 6852.428928] RIP: 0010:scrub_setup_ctx.isra.19+0x220/0x230 [btrfs]
...
[ 6852.432970] Call Trace:
[ 6852.433202] btrfs_scrub_dev+0x19b/0x5c0 [btrfs]
[ 6852.433471] btrfs_dev_replace_start+0x48c/0x6a0 [btrfs]
[ 6852.433800] btrfs_dev_replace_by_ioctl+0x3a/0x60 [btrfs]
[ 6852.434097] btrfs_ioctl+0x2476/0x2d20 [btrfs]
[ 6852.434365] ? do_sigaction+0x7d/0x1e0
[ 6852.434623] do_vfs_ioctl+0xa9/0x6c0
[ 6852.434865] ? syscall_trace_enter+0x1c8/0x310
[ 6852.435124] ? syscall_trace_enter+0x1c8/0x310
[ 6852.435387] ksys_ioctl+0x60/0x90
[ 6852.435663] __x64_sys_ioctl+0x16/0x20
[ 6852.435907] do_syscall_64+0x50/0x180
[ 6852.436150] entry_SYSCALL_64_after_hwframe+0x49/0xbe
Further, as the replace thread enters scrub_write_page_to_dev_replace()
without the target device it panics:
static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
struct scrub_page *spage)
{
...
bio_set_dev(bio, sbio->dev->bdev); <======
[ 6929.715145] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a0
..
[ 6929.717106] Workqueue: btrfs-scrub btrfs_scrub_helper [btrfs]
[ 6929.717420] RIP: 0010:scrub_write_page_to_dev_replace+0xb4/0x260
[btrfs]
..
[ 6929.721430] Call Trace:
[ 6929.721663] scrub_write_block_to_dev_replace+0x3f/0x60 [btrfs]
[ 6929.721975] scrub_bio_end_io_worker+0x1af/0x490 [btrfs]
[ 6929.722277] normal_work_helper+0xf0/0x4c0 [btrfs]
[ 6929.722552] process_one_work+0x1f4/0x520
[ 6929.722805] ? process_one_work+0x16e/0x520
[ 6929.723063] worker_thread+0x46/0x3d0
[ 6929.723313] kthread+0xf8/0x130
[ 6929.723544] ? process_one_work+0x520/0x520
[ 6929.723800] ? kthread_delayed_work_timer_fn+0x80/0x80
[ 6929.724081] ret_from_fork+0x3a/0x50
Fix this by letting the btrfs_dev_replace_finishing() to do the job of
cleaning after the cancel, including freeing of the target device.
btrfs_dev_replace_finishing() is called when btrfs_scub_dev() returns
along with the scrub return status.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In a secnario where balance and replace co-exists as below,
- start balance
- pause balance
- start replace
- reboot
and when system restarts, balance resumes first. Then the replace is
attempted to restart but will fail as the EXCL_OP lock is already held
by the balance. If so place the replace state back to
BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state.
Fixes: 010a47bde9 ("btrfs: add proper safety check before resuming dev-replace")
CC: stable@vger.kernel.org # 4.18+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At the time of forced unmount we place the running replace to
BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state, so when the system comes
back and expect the target device is missing.
Then let the replace state continue to be in
BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state instead of
BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED as there isn't any matching scrub
running as part of replace.
Fixes: e93c89c1aa ("Btrfs: add new sources for device replace code")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There isn't any other consumer other than in its own file dev-replace.c.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's not that impossible to imagine that a device OR a btrfs image is
copied just by using the dd or the cp command. Which in case both the
copies of the btrfs will have the same fsid. If on the system with
automount enabled, the copied FS gets scanned.
We have a known bug in btrfs, that we let the device path be changed
after the device has been mounted. So using this loop hole the new
copied device would appears as if its mounted immediately after it's
been copied.
For example:
Initially.. /dev/mmcblk0p4 is mounted as /
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
mmcblk0 179:0 0 29.2G 0 disk
|-mmcblk0p4 179:4 0 4G 0 part /
|-mmcblk0p2 179:2 0 500M 0 part /boot
|-mmcblk0p3 179:3 0 256M 0 part [SWAP]
`-mmcblk0p1 179:1 0 256M 0 part /boot/efi
$ btrfs fi show
Label: none uuid: 07892354-ddaa-4443-90ea-f76a06accaba
Total devices 1 FS bytes used 1.40GiB
devid 1 size 4.00GiB used 3.00GiB path /dev/mmcblk0p4
Copy mmcblk0 to sda
$ dd if=/dev/mmcblk0 of=/dev/sda
And immediately after the copy completes the change in the device
superblock is notified which the automount scans using btrfs device scan
and the new device sda becomes the mounted root device.
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 1 14.9G 0 disk
|-sda4 8:4 1 4G 0 part /
|-sda2 8:2 1 500M 0 part
|-sda3 8:3 1 256M 0 part
`-sda1 8:1 1 256M 0 part
mmcblk0 179:0 0 29.2G 0 disk
|-mmcblk0p4 179:4 0 4G 0 part
|-mmcblk0p2 179:2 0 500M 0 part /boot
|-mmcblk0p3 179:3 0 256M 0 part [SWAP]
`-mmcblk0p1 179:1 0 256M 0 part /boot/efi
$ btrfs fi show /
Label: none uuid: 07892354-ddaa-4443-90ea-f76a06accaba
Total devices 1 FS bytes used 1.40GiB
devid 1 size 4.00GiB used 3.00GiB path /dev/sda4
The bug is quite nasty that you can't either unmount /dev/sda4 or
/dev/mmcblk0p4. And the problem does not get solved until you take sda
out of the system on to another system to change its fsid using the
'btrfstune -u' command.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of hardcoding exceptions for RAID5 and RAID6 in the code, use an
nparity field in raid_attr.
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
RAID5 and RAID6 profile store one copy of the data, not 2 or 3. These
values are not yet used anywhere so there's no change.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 92e222df7b "btrfs: alloc_chunk: fix DUP stripe size handling"
fixed calculating the stripe_size for a new DUP chunk.
However, the same calculation reappears a bit later, and that one was
not changed yet. The resulting bug that is exposed is that the newly
allocated device extents ('stripes') can have a few MiB overlap with the
next thing stored after them, which is another device extent or the end
of the disk.
The scenario in which this can happen is:
* The block device for the filesystem is less than 10GiB in size.
* The amount of contiguous free unallocated disk space chosen to use for
chunk allocation is 20% of the total device size, or a few MiB more or
less.
An example:
- The filesystem device is 7880MiB (max_chunk_size gets set to 788MiB)
- There's 1578MiB unallocated raw disk space left in one contiguous
piece.
In this case stripe_size is first calculated as 789MiB, (half of
1578MiB).
Since 789MiB (stripe_size * data_stripes) > 788MiB (max_chunk_size), we
enter the if block. Now stripe_size value is immediately overwritten
while calculating an adjusted value based on max_chunk_size, which ends
up as 788MiB.
Next, the value is rounded up to a 16MiB boundary, 800MiB, which is
actually more than the value we had before. However, the last comparison
fails to detect this, because it's comparing the value with the total
amount of free space, which is about twice the size of stripe_size.
In the example above, this means that the resulting raw disk space being
allocated is 1600MiB, while only a gap of 1578MiB has been found. The
second device extent object for this DUP chunk will overlap for 22MiB
with whatever comes next.
The underlying problem here is that the stripe_size is reused all the
time for different things. So, when entering the code in the if block,
stripe_size is immediately overwritten with something else. If later we
decide we want to have the previous value back, then the logic to
compute it was copy pasted in again.
With this change, the value in stripe_size is not unnecessarily
destroyed, so the duplicated calculation is not needed any more.
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The variable num_bytes is really a way too generic name for a variable
in this function. There are a dozen other variables that hold a number
of bytes as value.
Give it a name that actually describes what it does, which is holding
the size of the chunk that we're allocating.
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The variable num_bytes is used to store the chunk length of the chunk
that we're allocating. Do not reuse it for something really different in
the same function.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Snapshot is expected to be fast. But if there are writers steadily
creating dirty pages in our subvolume, the snapshot may take a very long
time to complete. To fix the problem, we use tagged writepage for
snapshot flusher as we do in the generic write_cache_pages(), so we can
omit pages dirtied after the snapshot command.
This does not change the semantics regarding which data get to the
snapshot, if there are pages being dirtied during the snapshotting
operation. There's a sync called before snapshot is taken in old/new
case, any IO in flight just after that may be in the snapshot but this
depends on other system effects that might still sync the IO.
We do a simple snapshot speed test on a Intel D-1531 box:
fio --ioengine=libaio --iodepth=32 --bs=4k --rw=write --size=64G
--direct=0 --thread=1 --numjobs=1 --time_based --runtime=120
--filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio
original: 1m58sec
patched: 6.54sec
This is the best case for this patch since for a sequential write case,
we omit nearly all pages dirtied after the snapshot command.
For a multi writers, random write test:
fio --ioengine=libaio --iodepth=32 --bs=4k --rw=randwrite --size=64G
--direct=0 --thread=1 --numjobs=4 --time_based --runtime=120
--filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio
original: 15.83sec
patched: 10.35sec
The improvement is smaller compared to the sequential write case,
since we omit only half of the pages dirtied after snapshot command.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This parameter was never used, yet was part of the interface of the
function ever since its introduction as extent_io_ops::writepage_end_io_hook
in e6dcd2dc9c ("Btrfs: New data=ordered implementation"). Now that
NULL is passed everywhere as a value for this parameter let's remove it
for good. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The only remaining use of the 'epd' argument in writepage_delalloc is
to reference the extent_io_tree which was set in extent_writepages. Since
it is guaranteed that page->mapping of any page passed to
writepage_delalloc (and __extent_writepage as the sole caller) to be
equal to that passed in extent_writepages we can directly get the
io_tree via the already passed inode (which is also taken from
page->mapping->host). No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If epd::extent_locked is set then writepage_delalloc terminates. Make
this a bit more apparent in the caller by simply bubbling the check up.
This enables to remove epd as an argument to writepage_delalloc in a
future patch. No functional change.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Before btrfs_map_bio submits all stripe bios it does a number of checks
to ensure the device for every stripe is present. However, it doesn't do
a DEV_STATE_MISSING check, instead this is relegated to the lower level
btrfs_schedule_bio (in the async submission case, sync submission
doesn't check DEV_STATE_MISSING at all). Additionally
btrfs_schedule_bios does the duplicate device->bdev check which has
already been performed in btrfs_map_bio.
This patch moves the DEV_STATE_MISSING check in btrfs_map_bio and
removes the duplicate device->bdev check. Doing so ensures that no bio
cloning/submission happens for both async/sync requests in the face of
missing device. This makes the async io submission path slightly shorter
in terms of instruction count. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
dev_replace::replace_state has been set to
BTRFS_DEV_REPLACE_ITEM_STATE_NEVER_STARTED (0) in the same function,
So delete the line which sets replace_state = 0;
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The io_err field of struct btrfs_log_ctx is no longer used after the
recent simplification of the fast fsync path, where we now wait for
ordered extents to complete before logging the inode. We did this in
commit b5e6c3e170 ("btrfs: always wait on ordered extents at fsync
time") and commit a2120a473a ("btrfs: clean up the left over
logged_list usage") removed its last use.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We currently are in a loop finding each range (corresponding to a btree
node/leaf) in a log root's extent io tree and then clean it up. This is a
waste of time since we are traversing the extent io tree's rb_tree more
times then needed (one for a range lookup and another for cleaning it up)
without any good reason.
We free the log trees when we are in the critical section of a transaction
commit (the transaction state is set to TRANS_STATE_COMMIT_DOING), so it's
of great convenience to do everything as fast as possible in order to
reduce the time we block other tasks from starting a new transaction.
So fix this by traversing the extent io tree once and cleaning up all its
records in one go while traversing it.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The loop construct in free_extent_buffer was added in
242e18c7c1 ("Btrfs: reduce lock contention on extent buffer locks")
as means of reducing the times the eb lock is taken, the non-last ref
count is decremented and lock is released. As the special handling
of UNMAPPED extent buffers was removed now there is only one decrement
op which is happening for EXTENT_BUFFER_UNMAPPED case.
This commit modifies the loop condition so that in case of UNMAPPED
buffers the eb's lock is taken only if we are 100% sure the eb is going
to be freed by the current executor of the code. Additionally, remove
superfluous ref count ops in btrfs test.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the whole of btrfs code has been audited for eb reference count
management it's time to remove the hunk in free_extent_buffer that
essentially considered the condition
"eb->ref == 2 && EXTENT_BUFFER_DUMMY"
to equal "eb->ref = 1". Also remove the last location
which takes an extra reference count in alloc_test_extent_buffer.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In qgroup_rescan_leaf a copy is made of the target leaf by calling
btrfs_clone_extent_buffer. The latter allocates a new buffer and
attaches a new set of pages and copies the content of the source buffer.
The new scratch buffer is only used to iterate it's items, it's not
published anywhere and cannot be accessed by a third party.
Hence, it's not necessary to perform any locking on it whatsoever.
Furthermore, remove the extra extent_buffer_get call since the new
buffer is always allocated with a reference count of 1 which is
sufficient here. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When the 2 comparison trees roots are initialised they are private to
the function and already have reference counts of 1 each. There is no
need to further increment the reference count since the cloned buffers
are already accessed via struct btrfs_path. Eventually the 2 paths used
for comparison are going to be released, effectively disposing of the
cloned buffers.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a rewound buffer is created it already has a ref count of 1 and the
dummy flag set. Then another ref is taken bumping the count to 2.
Finally when this buffer is released from btrfs_release_path the extra
reference is decremented by the special handling code in
free_extent_buffer.
However, this special code is in fact redundant sinca ref count of 1 is
still correct since the buffer is only accessed via btrfs_path struct.
This paves the way forward of removing the special handling in
free_extent_buffer.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
get_old_root used used only by btrfs_search_old_slot to initialise the
path structure. The old root is always a cloned buffer (either via alloc
dummy or via btrfs_clone_extent_buffer) and its reference count is 2: 1
from allocation, 1 from extent_buffer_get call in get_old_root.
This latter explicit ref count acquire operation is in fact unnecessary
since the semantic is such that the newly allocated buffer is handed
over to the btrfs_path for lifetime management. Considering this just
remove the extra extent_buffer_get in get_old_root.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In iterate_inode_exrefs the eb is cloned via btrfs_clone_extent_buffer
which creates a private extent buffer with the dummy flag set and ref
count of 1. Then this buffer is locked for reading and its ref count is
incremented by 1. Finally it's fed to the passed iterate_irefs_t
function. The actual iterate call back is inode_to_path (coming from
paths_from_inode) which feeds the eb to btrfs_ref_to_path. In this final
function the passed eb is only read by first assigning it to the local
eb variable. This variable is only modified in the case another eb was
referenced from the passed path that is eb != eb_in check triggers.
Considering this there is no point in locking the cloned eb in
iterate_inode_refs since it's never being modified and is not published
anywhere. Furthermore the cloned eb is completely fine having its ref
count be 1.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In iterate_inode_refs the eb is cloned via btrfs_clone_extent_buffer
which creates a private extent buffer with the dummy flag set and ref
count of 1. Then this buffer is locked for reading and its ref count is
incremented by 1. Finally it's fed to the passed iterate_irefs_t
function. The actual iterate call back is inode_to_path (coming from
paths_from_inode) which feeds the eb to btrfs_ref_to_path. In this final
function the passed eb is only read by first assigning it to the local
eb variable. This variable is only modified in the case another eb was
referenced from the passed path that is eb != eb_in check triggers.
Considering this there is no point in locking the cloned eb in
iterate_inode_refs since it's never being modified and is not published
anywhere. Furthermore the cloned eb is completely fine having its ref
count be 1.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In extent-io self test, we need 2 ordered extents at its maximum size to
do the test.
Instead of using the intermediate numbers, use BTRFS_MAX_EXTENT_SIZE for
@max_bytes, and twice @max_bytes for @total_dirty. This should explain
why we need all these magic numbers and prevent people to modify them by
accident.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs has not allowed swap files since commit 35054394c4 ("Btrfs: stop
providing a bmap operation to avoid swapfile corruptions"). However, now
that the proper restrictions are in place, Btrfs can support swap files
through the swap file a_ops, similar to iomap in commit 67482129cd
("iomap: add a swapfile activation function").
For Btrfs, activation needs to make sure that the file can be used as a
swap file, which currently means that it must be fully allocated as
NOCOW with no compression on one device. It must also do the proper
tracking so that ioctls will not interfere with the swap file.
Deactivation clears this tracking.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The Btrfs swap code is going to need it, so give it a btrfs_ prefix and
make it non-static.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.
When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:
- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag
Don't allow those to happen on an active swap file.
Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.
Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the counterpart to merge_extent_hook, similarly, it's used only
for data/freespace inodes so let's remove it, rename it and call it
directly where necessary. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This callback is used only for data and free space inodes. Such inodes
are guaranteed to have their extent_io_tree::private_data set to the
inode struct. Exploit this fact to directly call the function. Also give
it a more descriptive name. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the counterpart to ex-set_bit_hook (now btrfs_set_delalloc_extent),
similar to what was done before remove clear_bit_hook and rename the
function. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This callback is used to properly account delalloc extents for data
inodes (ordinary file inodes and freespace v1 inodes). Those can be
easily identified since they have their extent_io trees ->private_data
member point to the inode. Let's exploit this fact to remove the
needless indirection through extent_io_hooks and directly call the
function. Also give the function a name which reflects its purpose -
btrfs_set_delalloc_extent.
This patch also modified test_find_delalloc so that the extent_io_tree
used for testing doesn't have its ->private_data set which would have
caused a crash in btrfs_set_delalloc_extent due to the btrfs_inode->root
member not being initialised. The old version of the code also didn't
call set_bit_hook since the extent_io ops weren't set for the inode. No
functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This callback was only used in debug builds by btrfs_leak_debug_check.
A better approach is to move its implementation in
btrfs_leak_debug_check and ensure the latter is only executed for extent
tree which have ->private_data set i.e. relate to a data node and not
the btree one. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This callback is ony ever called for data page writeout so there is no
need to actually abstract it via extent_io_ops. Lets just export it,
remove the definition of the callback and call it directly in the
functions that invoke the callback. Also rename the function to
btrfs_writepage_endio_finish_ordered since what it really does is
account finished io in the ordered extent data structures. No
functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This hook is called only from __extent_writepage_io which is already
called only from the data page writeout path. So there is no need to
make an indirect call via extent_io_ops. This patch just removes the
callback definition, exports the callback function and calls it directly
at the only call site. Also give the function a more descriptive name.
No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This callback is called only from writepage_delalloc which in turn is
guaranteed to be called from the data page writeout path. In the end
there is no reason to have the call to this function to be indrected via
the extent_io_ops structure. This patch removes the callback definition,
exports the function and calls it directly. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename to btrfs_run_delalloc_range ]
Signed-off-by: David Sterba <dsterba@suse.com>
This will be used in future patches that remove the optional
extent_io_ops callbacks.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add extra dev extent end check against device boundary.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Enhance btrfs_verify_dev_extents() to remember previous checked dev
extents, so it can verify no dev extents can overlap.
Analysis from Hans:
"Imagine allocating a DATA|DUP chunk.
In the chunk allocator, we first set...
max_stripe_size = SZ_1G;
max_chunk_size = BTRFS_MAX_DATA_CHUNK_SIZE
... which is 10GiB.
Then...
/* we don't want a chunk larger than 10% of writeable space */
max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
max_chunk_size);
Imagine we only have one 7880MiB block device in this filesystem. Now
max_chunk_size is down to 788MiB.
The next step in the code is to search for max_stripe_size * dev_stripes
amount of free space on the device, which is in our example 1GiB * 2 =
2GiB. Imagine the device has exactly 1578MiB free in one contiguous
piece. This amount of bytes will be put in devices_info[ndevs - 1].max_avail
Next we recalculate the stripe_size (which is actually the device extent
length), based on the actual maximum amount of available raw disk space:
stripe_size = div_u64(devices_info[ndevs - 1].max_avail, dev_stripes);
stripe_size is now 789MiB
Next we do...
data_stripes = num_stripes / ncopies
...where data_stripes ends up as 1, because num_stripes is 2 (the amount
of device extents we're going to have), and DUP has ncopies 2.
Next there's a check...
if (stripe_size * data_stripes > max_chunk_size)
...which matches because 789MiB * 1 > 788MiB.
We go into the if code, and next is...
stripe_size = div_u64(max_chunk_size, data_stripes);
...which resets stripe_size to max_chunk_size: 788MiB
Next is a fun one...
/* bump the answer up to a 16MB boundary */
stripe_size = round_up(stripe_size, SZ_16M);
...which changes stripe_size from 788MiB to 800MiB.
We're not done changing stripe_size yet...
/* But don't go higher than the limits we found while searching
* for free extents
*/
stripe_size = min(devices_info[ndevs - 1].max_avail,
stripe_size);
This is bad. max_avail is twice the stripe_size (we need to fit 2 device
extents on the same device for DUP).
The result here is that 800MiB < 1578MiB, so it's unchanged. However,
the resulting DUP chunk will need 1600MiB disk space, which isn't there,
and the second dev_extent might extend into the next thing (next
dev_extent? end of device?) for 22MiB.
The last shown line of code relies on a situation where there's twice
the value of stripe_size present as value for the variable stripe_size
when it's DUP. This was actually the case before commit 92e222df7b
"btrfs: alloc_chunk: fix DUP stripe size handling", from which I quote:
"[...] in the meantime there's a check to see if the stripe_size does
not exceed max_chunk_size. Since during this check stripe_size is twice
the amount as intended, the check will reduce the stripe_size to
max_chunk_size if the actual correct to be used stripe_size is more than
half the amount of max_chunk_size."
In the previous version of the code, the 16MiB alignment (why is this
done, by the way?) would result in a 50% chance that it would actually
do an 8MiB alignment for the individual dev_extents, since it was
operating on double the size. Does this matter?
Does it matter that stripe_size can be set to anything which is not
16MiB aligned because of the amount of remaining available disk space
which is just taken?
What is the main purpose of this round_up?
The most straightforward thing to do seems something like...
stripe_size = min(
div_u64(devices_info[ndevs - 1].max_avail, dev_stripes),
stripe_size
)
..just putting half of the max_avail into stripe_size."
Link: https://lore.kernel.org/linux-btrfs/b3461a38-e5f8-f41d-c67c-2efac8129054@mendix.com/
Reported-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ add analysis from report ]
Signed-off-by: David Sterba <dsterba@suse.com>
We have a complex loop design for find_free_extent(), that has different
behavior for each loop, some even includes new chunk allocation.
Instead of putting such a long code into find_free_extent() and makes it
harder to read, just extract them into find_free_extent_update_loop().
With all the cleanups, the main find_free_extent() should be pretty
barebone:
find_free_extent()
|- Iterate through all block groups
| |- Get a valid block group
| |- Try to do clustered allocation in that block group
| |- Try to do unclustered allocation in that block group
| |- Check if the result is valid
| | |- If valid, then exit
| |- Jump to next block group
|
|- Push harder to find free extents
|- If not found, re-iterate all block groups
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
[ copy callchain from changelog to function comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
This patch will extract unclsutered extent allocation code into
find_free_extent_unclustered().
And this helper function will use return value to indicate what to do
next.
This should make find_free_extent() a little easier to read.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
[Update merge conflict with fb5c39d7a8 ("btrfs: don't use ctl->free_space for max_extent_size")]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have two main methods to find free extents inside a block group:
1) clustered allocation
2) unclustered allocation
This patch will extract the clustered allocation into
find_free_extent_clustered() to make it a little easier to read.
Instead of jumping between different labels in find_free_extent(), the
helper function will use return value to indicate different behavior.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of tons of different local variables in find_free_extent(),
extract them into find_free_extent_ctl structure, and add better
explanation for them.
Some modification may looks redundant, but will later greatly simplify
function parameter list during find_free_extent() refactor.
Also add two comments to co-operate with fb5c39d7a8 ("btrfs: don't use
ctl->free_space for max_extent_size"), to make ffe_ctl->max_extent_size
update more reader-friendly.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a new wrapper update_bytes_pinned to replace open coded
bytes_pinned modifiers. Now the underflows of space_info::bytes_pinned
get detected and reported.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Although we have space_info::bytes_may_use underflow detection in
btrfs_free_reserved_data_space_noquota(), we have more callers who are
subtracting number from space_info::bytes_may_use.
So instead of doing underflow detection for every caller, introduce a
new wrapper update_bytes_may_use() to replace open coded bytes_may_use
modifiers.
This also introduce a macro to declare more wrappers, but currently
space_info::bytes_may_use is the mostly interesting one.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Tracking pending ordered extents per transaction was introduced in commit
50d9aa99bd ("Btrfs: make sure logged extents complete in the current
transaction V3") and later updated in commit 161c3549b4 ("Btrfs: change
how we wait for pending ordered extents").
However now that on fsync we always wait for ordered extents to complete
before logging, done in commit 5636cf7d6d ("btrfs: remove the logged
extents infrastructure"), we no longer need the stuff to track for pending
ordered extents, which was not completely removed in the mentioned commit.
So remove the remaining of the pending ordered extents infrastructure.
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The logged_start and logged_end variables, at btrfs_log_changed_extents,
were added in commit 8c6c592831 ("btrfs: log csums for all modified
extents"). However since the recent simplification for fsync, which makes
us wait for all ordered extents to complete before logging extents, we
no longer need those variables. Commit a2120a473a ("btrfs: clean up the
left over logged_list usage") forgot to remove them.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlwH0k8ACgkQxWXV+ddt
WDtmVg/+Kgvk7laQI9bLEr1/30eG1JfBUMHcVE1F8+g99l28m1Yihjd21j9norVd
YexBz53jgKou+zV+37CKWBYT1uDPq7CIoxctkdE2j9U0s+RmsqDrhech0dsBsfMR
jo9VnHJFuJSxGMhjfGnFV+wMtAr4q5aQptNGBl+hR1MvMneroktFv+0WiLmp0Vhj
+6Iq9WAClJYpgk//cI7nhKkscdzWwRyN3V9RUtdNeYklk1D7l1WprlaPzw22WA9u
VjQVMICjEaJeIixIwT/D8lz05QgjKlqy1z6faYG5JuJxoYQikuNv/xe2dhZVm35A
aNsBR0byf3zzuXKQZAlvXJ6/gYPvep+KI7epPyBOdycaqoZza7rQ+/MkSAgQ77Vk
yBnQuhqiw9Srjh6LDWFkNclVln2wymRKd1SqpZmFPRZre/8L+DU+I8RRaeS2/WcE
M2k+awRD0oVofbB+hxkFIoR+I1Ggkp2rxQlTT/41tGx0geWC3AGX+TlKSW6ZM5HD
lRmRXIsVocfighKEnI3Zy7ecZuwCI4/4D6+PQtyhCJb3tDigZ/a4UEYdSVucG8CG
SuQ5YMn+MyyKT0wH8xkGKDGT15YZ+u9Q/BmPHZRL6sSouFpiCQHA5miD1YA+t1d9
qMjH6Ycz46Y3j2M0BDfDcm714zoD5/bgeSy5SPC3Zh5lQCGpeIk=
=VW/F
-----END PGP SIGNATURE-----
Merge tag 'for-4.20-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"A patch in 4.19 introduced a sanity check that was too strict and a
filesystem cannot be mounted.
This happens for filesystems with more than 10 devices and has been
reported by a few users so we need the fix to propagate to stable"
* tag 'for-4.20-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: tree-checker: Don't check max block group size as current max chunk size limit is unreliable
[BUG]
A completely valid btrfs will refuse to mount, with error message like:
BTRFS critical (device sdb2): corrupt leaf: root=2 block=239681536 slot=172 \
bg_start=12018974720 bg_len=10888413184, invalid block group size, \
have 10888413184 expect (0, 10737418240]
This has been reported several times as the 4.19 kernel is now being
used. The filesystem refuses to mount, but is otherwise ok and booting
4.18 is a workaround.
Btrfs check returns no error, and all kernels used on this fs is later
than 2011, which should all have the 10G size limit commit.
[CAUSE]
For a 12 devices btrfs, we could allocate a chunk larger than 10G due to
stripe stripe bump up.
__btrfs_alloc_chunk()
|- max_stripe_size = 1G
|- max_chunk_size = 10G
|- data_stripe = 11
|- if (1G * 11 > 10G) {
stripe_size = 976128930;
stripe_size = round_up(976128930, SZ_16M) = 989855744
However the final stripe_size (989855744) * 11 = 10888413184, which is
still larger than 10G.
[FIX]
For the comprehensive check, we need to do the full check at chunk read
time, and rely on bg <-> chunk mapping to do the check.
We could just skip the length check for now.
Fixes: fce466eab7 ("btrfs: tree-checker: Verify block_group_item")
Cc: stable@vger.kernel.org # v4.19+
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlv9qYgACgkQxWXV+ddt
WDupPQ/8DdeLZYQG1tlx2Q+X4/tqPVyAzUjguYzbIY7wvSs1zbEEedENsD8E97yC
So8ooGnP5B6/dqVidLFQBPwTXN59GybYbrDci8qh0DOJTl3+1r8byD9JC+iofrOF
tltJkZ+eCOQyyqHHzlzw15uNOg48Qzj1oXvTAcE0P6iN5UcvcfwRW/S39pjsn63C
63zc09XJ1hmJMJTWZo5h3GoD2UvzrwGXPKXNdv/NWkw9sqQbWdjvZFdqKbvY1VeM
Oa6FPAPErJqEEEePhpDYbyRcnzjJRMs0deLGpGGChGldQxgMO8ILzBwh/KalfzK7
h7LIuv1EclUqlyv0mXPqg2E/C3n2UMPqQYFsK9Lt+4Y/PkrWA2jx0lSg0fBl3k8c
7PyiTqPNPNF8LU48tPEnOzJuNPkquOycgdyQOUpHnS43OF5OLIb6tVyjK4eJHRWw
xtP65M72qM8T65+gsxYcdm0lvIDLidIwFS+2g4ibKU7EwlYkTC9AHFIAyFKTgxeP
MpkIH90mKhSxOpbq8RICgr2jWcJZYoFQ4soi1oE+bgyjv75PyhJ0eXOprCh/4KZp
nkXlPy2skkO9gGecyvr51x/opDEjEkObyOjQm2LhhWYvgcnHgW8Zp1jhQKxabHvz
iZdVIs/agOerpk1d9ZBHhIXOeS2UcE5klqVRAdf961Wobh+HNis=
=cCvI
-----END PGP SIGNATURE-----
Merge tag 'for-4.20-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Some of these bugs are being hit during testing so we'd like to get
them merged, otherwise there are usual stability fixes for stable
trees"
* tag 'for-4.20-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: relocation: set trans to be NULL after ending transaction
Btrfs: fix race between enabling quotas and subvolume creation
Btrfs: send, fix infinite loop due to directory rename dependencies
Btrfs: ensure path name is null terminated at btrfs_control_ioctl
Btrfs: fix rare chances for data loss when doing a fast fsync
btrfs: Always try all copies when reading extent buffers
The function relocate_block_group calls btrfs_end_transaction to release
trans when update_backref_cache returns 1, and then continues the loop
body. If btrfs_block_rsv_refill fails this time, it will jump out the
loop and the freed trans will be accessed. This may result in a
use-after-free bug. The patch assigns NULL to trans after trans is
released so that it will not be accessed.
Fixes: 0647bf564f ("Btrfs: improve forever loop when doing balance relocation")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a race between enabling quotas end subvolume creation that cause
subvolume creation to fail with -EINVAL, and the following diagram shows
how it happens:
CPU 0 CPU 1
btrfs_ioctl()
btrfs_ioctl_quota_ctl()
btrfs_quota_enable()
mutex_lock(fs_info->qgroup_ioctl_lock)
btrfs_ioctl()
create_subvol()
btrfs_qgroup_inherit()
-> save fs_info->quota_root
into quota_root
-> stores a NULL value
-> tries to lock the mutex
qgroup_ioctl_lock
-> blocks waiting for
the task at CPU0
-> sets BTRFS_FS_QUOTA_ENABLED in fs_info
-> sets quota_root in fs_info->quota_root
(non-NULL value)
mutex_unlock(fs_info->qgroup_ioctl_lock)
-> checks quota enabled
flag is set
-> returns -EINVAL because
fs_info->quota_root was
NULL before it acquired
the mutex
qgroup_ioctl_lock
-> ioctl returns -EINVAL
Returning -EINVAL to user space will be confusing if all the arguments
passed to the subvolume creation ioctl were valid.
Fix it by grabbing the value from fs_info->quota_root after acquiring
the mutex.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing an incremental send, due to the need of delaying directory move
(rename) operations we can end up in infinite loop at
apply_children_dir_moves().
An example scenario that triggers this problem is described below, where
directory names correspond to the numbers of their respective inodes.
Parent snapshot:
.
|--- 261/
|--- 271/
|--- 266/
|--- 259/
|--- 260/
| |--- 267
|
|--- 264/
| |--- 258/
| |--- 257/
|
|--- 265/
|--- 268/
|--- 269/
| |--- 262/
|
|--- 270/
|--- 272/
| |--- 263/
| |--- 275/
|
|--- 274/
|--- 273/
Send snapshot:
.
|-- 275/
|-- 274/
|-- 273/
|-- 262/
|-- 269/
|-- 258/
|-- 271/
|-- 268/
|-- 267/
|-- 270/
|-- 259/
| |-- 265/
|
|-- 272/
|-- 257/
|-- 260/
|-- 264/
|-- 263/
|-- 261/
|-- 266/
When processing inode 257 we delay its move (rename) operation because its
new parent in the send snapshot, inode 272, was not yet processed. Then
when processing inode 272, we delay the move operation for that inode
because inode 274 is its ancestor in the send snapshot. Finally we delay
the move operation for inode 274 when processing it because inode 275 is
its new parent in the send snapshot and was not yet moved.
When finishing processing inode 275, we start to do the move operations
that were previously delayed (at apply_children_dir_moves()), resulting in
the following iterations:
1) We issue the move operation for inode 274;
2) Because inode 262 depended on the move operation of inode 274 (it was
delayed because 274 is its ancestor in the send snapshot), we issue the
move operation for inode 262;
3) We issue the move operation for inode 272, because it was delayed by
inode 274 too (ancestor of 272 in the send snapshot);
4) We issue the move operation for inode 269 (it was delayed by 262);
5) We issue the move operation for inode 257 (it was delayed by 272);
6) We issue the move operation for inode 260 (it was delayed by 272);
7) We issue the move operation for inode 258 (it was delayed by 269);
8) We issue the move operation for inode 264 (it was delayed by 257);
9) We issue the move operation for inode 271 (it was delayed by 258);
10) We issue the move operation for inode 263 (it was delayed by 264);
11) We issue the move operation for inode 268 (it was delayed by 271);
12) We verify if we can issue the move operation for inode 270 (it was
delayed by 271). We detect a path loop in the current state, because
inode 267 needs to be moved first before we can issue the move
operation for inode 270. So we delay again the move operation for
inode 270, this time we will attempt to do it after inode 267 is
moved;
13) We issue the move operation for inode 261 (it was delayed by 263);
14) We verify if we can issue the move operation for inode 266 (it was
delayed by 263). We detect a path loop in the current state, because
inode 270 needs to be moved first before we can issue the move
operation for inode 266. So we delay again the move operation for
inode 266, this time we will attempt to do it after inode 270 is
moved (its move operation was delayed in step 12);
15) We issue the move operation for inode 267 (it was delayed by 268);
16) We verify if we can issue the move operation for inode 266 (it was
delayed by 270). We detect a path loop in the current state, because
inode 270 needs to be moved first before we can issue the move
operation for inode 266. So we delay again the move operation for
inode 266, this time we will attempt to do it after inode 270 is
moved (its move operation was delayed in step 12). So here we added
again the same delayed move operation that we added in step 14;
17) We attempt again to see if we can issue the move operation for inode
266, and as in step 16, we realize we can not due to a path loop in
the current state due to a dependency on inode 270. Again we delay
inode's 266 rename to happen after inode's 270 move operation, adding
the same dependency to the empty stack that we did in steps 14 and 16.
The next iteration will pick the same move dependency on the stack
(the only entry) and realize again there is still a path loop and then
again the same dependency to the stack, over and over, resulting in
an infinite loop.
So fix this by preventing adding the same move dependency entries to the
stack by removing each pending move record from the red black tree of
pending moves. This way the next call to get_pending_dir_moves() will
not return anything for the current parent inode.
A test case for fstests, with this reproducer, follows soon.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[Wrote changelog with example and more clear explanation]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We were using the path name received from user space without checking that
it is null terminated. While btrfs-progs is well behaved and does proper
validation and null termination, someone could call the ioctl and pass
a non-null terminated patch, leading to buffer overrun problems in the
kernel. The ioctl is protected by CAP_SYS_ADMIN.
So just set the last byte of the path to a null character, similar to what
we do in other ioctls (add/remove/resize device, snapshot creation, etc).
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After the simplification of the fast fsync patch done recently by commit
b5e6c3e170 ("btrfs: always wait on ordered extents at fsync time") and
commit e7175a6927 ("btrfs: remove the wait ordered logic in the
log_one_extent path"), we got a very short time window where we can get
extents logged without writeback completing first or extents logged
without logging the respective data checksums. Both issues can only happen
when doing a non-full (fast) fsync.
As soon as we enter btrfs_sync_file() we trigger writeback, then lock the
inode and then wait for the writeback to complete before starting to log
the inode. However before we acquire the inode's lock and after we started
writeback, it's possible that more writes happened and dirtied more pages.
If that happened and those pages get writeback triggered while we are
logging the inode (for example, the VM subsystem triggering it due to
memory pressure, or another concurrent fsync), we end up seeing the
respective extent maps in the inode's list of modified extents and will
log matching file extent items without waiting for the respective
ordered extents to complete, meaning that either of the following will
happen:
1) We log an extent after its writeback finishes but before its checksums
are added to the csum tree, leading to -EIO errors when attempting to
read the extent after a log replay.
2) We log an extent before its writeback finishes.
Therefore after the log replay we will have a file extent item pointing
to an unwritten extent (and without the respective data checksums as
well).
This could not happen before the fast fsync patch simplification, because
for any extent we found in the list of modified extents, we would wait for
its respective ordered extent to finish writeback or collect its checksums
for logging if it did not complete yet.
Fix this by triggering writeback again after acquiring the inode's lock
and before waiting for ordered extents to complete.
Fixes: e7175a6927 ("btrfs: remove the wait ordered logic in the log_one_extent path")
Fixes: b5e6c3e170 ("btrfs: always wait on ordered extents at fsync time")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a metadata read is served the endio routine btree_readpage_end_io_hook
is called which eventually runs the tree-checker. If tree-checker fails
to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
leads to btree_read_extent_buffer_pages wrongly assuming that all
available copies of this extent buffer are wrong and failing prematurely.
Fix this modify btree_read_extent_buffer_pages to read all copies of
the data.
This failure was exhibitted in xfstests btrfs/124 which would
spuriously fail its balance operations. The reason was that when balance
was run following re-introduction of the missing raid1 disk
__btrfs_map_block would map the read request to stripe 0, which
corresponded to devid 2 (the disk which is being removed in the test):
item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
io_align 65536 io_width 65536 sector_size 4096
num_stripes 2 sub_stripes 1
stripe 0 devid 2 offset 2156920832
dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
stripe 1 devid 1 offset 3553624064
dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca
This caused read requests for a checksum item that to be routed to the
stale disk which triggered the aforementioned logic involving
EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
the balance operation.
Fixes: a826d6dcb3 ("Btrfs: check items for correctness as we search")
CC: stable@vger.kernel.org # 4.4+
Suggested-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlvoGIUACgkQxWXV+ddt
WDta6g//UJSLnVskCUwh8VyMdd47QArQnaLJowOH7wQn4Nqj+2hf04mCq/kv05ed
OneTezzONZc/qW9fiJGS+Dp77ln4JIDA1hWHtb/A4t9pYlksSQllJ3oiDUVsCp3q
2EbzrjuNz3iQO6TjKlaHX473CLCMQMXS2OXOUnCkF2maMJSdr86oi+j1UiSnud1/
C7uMYM3hG8nkfEfjjb1COpkS2MmzYcPruF5RDcbT/WOUfylTsjjX1E7rK/ZEqS9P
SUcp4uoZe9BNoyWMASLaM7oHE82day4X9MwQoCQFRcm0kq4CnRAZ8X4lBl+M70iW
7Olii/wNZ2SRiJf3jac/rpxoBHvEskXTHyiHTEmdHp4n1L1pL9GzGYIePQcX7uV1
Tb6ImdUUKCC//fPqyeB7cEk5yxqahmlFD3qZVs6GnQkzKrPE+ChLx+7PgcJC/XVh
C5ogNmJm+NvFOuTrYk9zSXg85B8gWHescDJrvNKVizIjw3nKmqiC+dXZljhzw+p8
HscK9EXsiS8jW9ClfJljXzIa4SeA/i7fQGe4tCKfIrCQ+OqUxWpFCEoxygchinfF
Rw90fJ0jX083oXsnfFcVdQpQ+SLSKka/aIRMvi58WRgLU3trci5NNN4TFg8TYRKP
xBDF/iF3sqXajc+xsjoqLhLioZL3Pa5VDNuhsFdois9M5JSRekU=
=K14u
-----END PGP SIGNATURE-----
Merge tag 'for-4.20-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Several fixes to recent release (4.19, fixes tagged for stable) and
other fixes"
* tag 'for-4.20-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix missing delayed iputs on unmount
Btrfs: fix data corruption due to cloning of eof block
Btrfs: fix infinite loop on inode eviction after deduplication of eof block
Btrfs: fix deadlock on tree root leaf when finding free extent
btrfs: avoid link error with CONFIG_NO_AUTO_INLINE
btrfs: tree-checker: Fix misleading group system information
Btrfs: fix missing data checksums after a ranged fsync (msync)
btrfs: fix pinned underflow after transaction aborted
Btrfs: fix cur_offset in the error case for nocow
There's a race between close_ctree() and cleaner_kthread().
close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
sees it set, but this is racy; the cleaner might have already checked
the bit and could be cleaning stuff. In particular, if it deletes unused
block groups, it will create delayed iputs for the free space cache
inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
longer running delayed iputs after a commit. Therefore, if the cleaner
creates more delayed iputs after delayed iputs are run in
btrfs_commit_super(), we will leak inodes on unmount and get a busy
inode crash from the VFS.
Fix it by parking the cleaner before we actually close anything. Then,
any remaining delayed iputs will always be handled in
btrfs_commit_super(). This also ensures that the commit in close_ctree()
is really the last commit, so we can get rid of the commit in
cleaner_kthread().
The fstest/generic/475 followed by 476 can trigger a crash that
manifests as a slab corruption caused by accessing the freed kthread
structure by a wake up function. Sample trace:
[ 5657.077612] BUG: unable to handle kernel NULL pointer dereference at 00000000000000cc
[ 5657.079432] PGD 1c57a067 P4D 1c57a067 PUD da10067 PMD 0
[ 5657.080661] Oops: 0000 [#1] PREEMPT SMP
[ 5657.081592] CPU: 1 PID: 5157 Comm: fsstress Tainted: G W 4.19.0-rc8-default+ #323
[ 5657.083703] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
[ 5657.086577] RIP: 0010:shrink_page_list+0x2f9/0xe90
[ 5657.091937] RSP: 0018:ffffb5c745c8f728 EFLAGS: 00010287
[ 5657.092953] RAX: 0000000000000074 RBX: ffffb5c745c8f830 RCX: 0000000000000000
[ 5657.094590] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9a8747fdf3d0
[ 5657.095987] RBP: ffffb5c745c8f9e0 R08: 0000000000000000 R09: 0000000000000000
[ 5657.097159] R10: ffff9a8747fdf5e8 R11: 0000000000000000 R12: ffffb5c745c8f788
[ 5657.098513] R13: ffff9a877f6ff2c0 R14: ffff9a877f6ff2c8 R15: dead000000000200
[ 5657.099689] FS: 00007f948d853b80(0000) GS:ffff9a877d600000(0000) knlGS:0000000000000000
[ 5657.101032] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5657.101953] CR2: 00000000000000cc CR3: 00000000684bd000 CR4: 00000000000006e0
[ 5657.103159] Call Trace:
[ 5657.103776] shrink_inactive_list+0x194/0x410
[ 5657.104671] shrink_node_memcg.constprop.84+0x39a/0x6a0
[ 5657.105750] shrink_node+0x62/0x1c0
[ 5657.106529] try_to_free_pages+0x1a4/0x500
[ 5657.107408] __alloc_pages_slowpath+0x2c9/0xb20
[ 5657.108418] __alloc_pages_nodemask+0x268/0x2b0
[ 5657.109348] kmalloc_large_node+0x37/0x90
[ 5657.110205] __kmalloc_node+0x236/0x310
[ 5657.111014] kvmalloc_node+0x3e/0x70
Fixes: 30928e9baa ("btrfs: don't run delayed_iputs in commit")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add trace ]
Signed-off-by: David Sterba <dsterba@suse.com>
We currently allow cloning a range from a file which includes the last
block of the file even if the file's size is not aligned to the block
size. This is fine and useful when the destination file has the same size,
but when it does not and the range ends somewhere in the middle of the
destination file, it leads to corruption because the bytes between the EOF
and the end of the block have undefined data (when there is support for
discard/trimming they have a value of 0x00).
Example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ export foo_size=$((256 * 1024 + 100))
$ xfs_io -f -c "pwrite -S 0x3c 0 $foo_size" /mnt/foo
$ xfs_io -f -c "pwrite -S 0xb5 0 1M" /mnt/bar
$ xfs_io -c "reflink /mnt/foo 0 512K $foo_size" /mnt/bar
$ od -A d -t x1 /mnt/bar
0000000 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5
*
0524288 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c
*
0786528 3c 3c 3c 3c 00 00 00 00 00 00 00 00 00 00 00 00
0786544 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0790528 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5
*
1048576
The bytes in the range from 786532 (512Kb + 256Kb + 100 bytes) to 790527
(512Kb + 256Kb + 4Kb - 1) got corrupted, having now a value of 0x00 instead
of 0xb5.
This is similar to the problem we had for deduplication that got recently
fixed by commit de02b9f6bb ("Btrfs: fix data corruption when
deduplicating between different files").
Fix this by not allowing such operations to be performed and return the
errno -EINVAL to user space. This is what XFS is doing as well at the VFS
level. This change however now makes us return -EINVAL instead of
-EOPNOTSUPP for cases where the source range maps to an inline extent and
the destination range's end is smaller then the destination file's size,
since the detection of inline extents is done during the actual process of
dropping file extent items (at __btrfs_drop_extents()). Returning the
-EINVAL error is done early on and solely based on the input parameters
(offsets and length) and destination file's size. This makes us consistent
with XFS and anyone else supporting cloning since this case is now checked
at a higher level in the VFS and is where the -EINVAL will be returned
from starting with kernel 4.20 (the VFS changed was introduced in 4.20-rc1
by commit 07d19dc9fb ("vfs: avoid problematic remapping requests into
partial EOF block"). So this change is more geared towards stable kernels,
as it's unlikely the new VFS checks get removed intentionally.
A test case for fstests follows soon, as well as an update to filter
existing tests that expect -EOPNOTSUPP to accept -EINVAL as well.
CC: <stable@vger.kernel.org> # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we are writing out a free space cache, during the transaction commit
phase, we can end up in a deadlock which results in a stack trace like the
following:
schedule+0x28/0x80
btrfs_tree_read_lock+0x8e/0x120 [btrfs]
? finish_wait+0x80/0x80
btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
btrfs_search_slot+0xf6/0x9f0 [btrfs]
? evict_refill_and_join+0xd0/0xd0 [btrfs]
? inode_insert5+0x119/0x190
btrfs_lookup_inode+0x3a/0xc0 [btrfs]
? kmem_cache_alloc+0x166/0x1d0
btrfs_iget+0x113/0x690 [btrfs]
__lookup_free_space_inode+0xd8/0x150 [btrfs]
lookup_free_space_inode+0x5b/0xb0 [btrfs]
load_free_space_cache+0x7c/0x170 [btrfs]
? cache_block_group+0x72/0x3b0 [btrfs]
cache_block_group+0x1b3/0x3b0 [btrfs]
? finish_wait+0x80/0x80
find_free_extent+0x799/0x1010 [btrfs]
btrfs_reserve_extent+0x9b/0x180 [btrfs]
btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
__btrfs_cow_block+0x11d/0x500 [btrfs]
btrfs_cow_block+0xdc/0x180 [btrfs]
btrfs_search_slot+0x3bd/0x9f0 [btrfs]
btrfs_lookup_inode+0x3a/0xc0 [btrfs]
? kmem_cache_alloc+0x166/0x1d0
btrfs_update_inode_item+0x46/0x100 [btrfs]
cache_save_setup+0xe4/0x3a0 [btrfs]
btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
At cache_save_setup() we need to update the inode item of a block group's
cache which is located in the tree root (fs_info->tree_root), which means
that it may result in COWing a leaf from that tree. If that happens we
need to find a free metadata extent and while looking for one, if we find
a block group which was not cached yet we attempt to load its cache by
calling cache_block_group(). However this function will try to load the
inode of the free space cache, which requires finding the matching inode
item in the tree root - if that inode item is located in the same leaf as
the inode item of the space cache we are updating at cache_save_setup(),
we end up in a deadlock, since we try to obtain a read lock on the same
extent buffer that we previously write locked.
So fix this by using the tree root's commit root when searching for a
block group's free space cache inode item when we are attempting to load
a free space cache. This is safe since block groups once loaded stay in
memory forever, as well as their caches, so after they are first loaded
we will never need to read their inode items again. For new block groups,
once they are created they get their ->cached field set to
BTRFS_CACHE_FINISHED meaning we will not need to read their inode item.
Reported-by: Andrew Nelson <andrew.s.nelson@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAPTELenq9x5KOWuQ+fa7h1r3nsJG8vyiTH8+ifjURc_duHh2Wg@mail.gmail.com/
Fixes: 9d66e233c7 ("Btrfs: load free space cache if it exists")
Tested-by: Andrew Nelson <andrew.s.nelson@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Note: this patch fixes a problem in a feature outside of btrfs ("kernel
hacking: add a config option to disable compiler auto-inlining") and is
applied ahead of time due to cross-subsystem dependencies.
On 32-bit ARM with gcc-8, I see a link error with the addition of the
CONFIG_NO_AUTO_INLINE option:
fs/btrfs/super.o: In function `btrfs_statfs':
super.c:(.text+0x67b8): undefined reference to `__aeabi_uldivmod'
super.c:(.text+0x67fc): undefined reference to `__aeabi_uldivmod'
super.c:(.text+0x6858): undefined reference to `__aeabi_uldivmod'
super.c:(.text+0x6920): undefined reference to `__aeabi_uldivmod'
super.c:(.text+0x693c): undefined reference to `__aeabi_uldivmod'
fs/btrfs/super.o:super.c:(.text+0x6958): more undefined references to `__aeabi_uldivmod' follow
So far this is the only file that shows the behavior, so I'd propose
to just work around it by marking the functions as 'static inline'
that normally get inlined here.
The reference to __aeabi_uldivmod comes from a div_u64() which has an
optimization for a constant division that uses a straight '/' operator
when the result should be known to the compiler. My interpretation is
that as we turn off inlining, gcc still expects the result to be constant
but fails to use that constant value.
Link: https://lkml.kernel.org/r/20181103153941.1881966-1-arnd@arndb.de
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
[ add the note ]
Signed-off-by: David Sterba <dsterba@suse.com>
block_group_err shows the group system as a decimal value with a '0x'
prefix, which is somewhat misleading.
Fix it to print hexadecimal, as was intended.
Fixes: fce466eab7 ("btrfs: tree-checker: Verify block_group_item")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Recently we got a massive simplification for fsync, where for the fast
path we no longer log new extents while their respective ordered extents
are still running.
However that simplification introduced a subtle regression for the case
where we use a ranged fsync (msync). Consider the following example:
CPU 0 CPU 1
mmap write to range [2Mb, 4Mb[
mmap write to range [512Kb, 1Mb[
msync range [512K, 1Mb[
--> triggers fast fsync
(BTRFS_INODE_NEEDS_FULL_SYNC
not set)
--> creates extent map A for this
range and adds it to list of
modified extents
--> starts ordered extent A for
this range
--> waits for it to complete
writeback triggered for range
[2Mb, 4Mb[
--> create extent map B and
adds it to the list of
modified extents
--> creates ordered extent B
--> start looking for and logging
modified extents
--> logs extent maps A and B
--> finds checksums for extent A
in the csum tree, but not for
extent B
fsync (msync) finishes
--> ordered extent B
finishes and its
checksums are added
to the csum tree
<power cut>
After replaying the log, we have the extent covering the range [2Mb, 4Mb[
but do not have the data checksum items covering that file range.
This happens because at the very beginning of an fsync (btrfs_sync_file())
we start and wait for IO in the given range [512Kb, 1Mb[ and therefore
wait for any ordered extents in that range to complete before we start
logging the extents. However if right before we start logging the extent
in our range [512Kb, 1Mb[, writeback is started for any other dirty range,
such as the range [2Mb, 4Mb[ due to memory pressure or a concurrent fsync
or msync (btrfs_sync_file() starts writeback before acquiring the inode's
lock), an ordered extent is created for that other range and a new extent
map is created to represent that range and added to the inode's list of
modified extents.
That means that we will see that other extent in that list when collecting
extents for logging (done at btrfs_log_changed_extents()) and log the
extent before the respective ordered extent finishes - namely before the
checksum items are added to the checksums tree, which is where
log_extent_csums() looks for the checksums, therefore making us log an
extent without logging its checksums. Before that massive simplification
of fsync, this wasn't a problem because besides looking for checkums in
the checksums tree, we also looked for them in any ordered extent still
running.
The consequence of data checksums missing for a file range is that users
attempting to read the affected file range will get -EIO errors and dmesg
reports the following:
[10188.358136] BTRFS info (device sdc): no csum found for inode 297 start 57344
[10188.359278] BTRFS warning (device sdc): csum failed root 5 ino 297 off 57344 csum 0x98f94189 expected csum 0x00000000 mirror 1
So fix this by skipping extents outside of our logging range at
btrfs_log_changed_extents() and leaving them on the list of modified
extents so that any subsequent ranged fsync may collect them if needed.
Also, if we find a hole extent outside of the range still log it, just
to prevent having gaps between extent items after replaying the log,
otherwise fsck will complain when we are not using the NO_HOLES feature
(fstest btrfs/056 triggers such case).
Fixes: e7175a6927 ("btrfs: remove the wait ordered logic in the log_one_extent path")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When the cow_file_range fails, the related resources are unlocked
according to the range [start..end), so the unlock cannot be repeated in
run_delalloc_nocow.
In some cases (e.g. cur_offset <= end && cow_start != -1), cur_offset is
not updated correctly, so move the cur_offset update before
cow_file_range.
kernel BUG at mm/page-writeback.c:2663!
Internal error: Oops - BUG: 0 [#1] SMP
CPU: 3 PID: 31525 Comm: kworker/u8:7 Tainted: P O
Hardware name: Realtek_RTD1296 (DT)
Workqueue: writeback wb_workfn (flush-btrfs-1)
task: ffffffc076db3380 ti: ffffffc02e9ac000 task.ti: ffffffc02e9ac000
PC is at clear_page_dirty_for_io+0x1bc/0x1e8
LR is at clear_page_dirty_for_io+0x14/0x1e8
pc : [<ffffffc00033c91c>] lr : [<ffffffc00033c774>] pstate: 40000145
sp : ffffffc02e9af4f0
Process kworker/u8:7 (pid: 31525, stack limit = 0xffffffc02e9ac020)
Call trace:
[<ffffffc00033c91c>] clear_page_dirty_for_io+0x1bc/0x1e8
[<ffffffbffc514674>] extent_clear_unlock_delalloc+0x1e4/0x210 [btrfs]
[<ffffffbffc4fb168>] run_delalloc_nocow+0x3b8/0x948 [btrfs]
[<ffffffbffc4fb948>] run_delalloc_range+0x250/0x3a8 [btrfs]
[<ffffffbffc514c0c>] writepage_delalloc.isra.21+0xbc/0x1d8 [btrfs]
[<ffffffbffc516048>] __extent_writepage+0xe8/0x248 [btrfs]
[<ffffffbffc51630c>] extent_write_cache_pages.isra.17+0x164/0x378 [btrfs]
[<ffffffbffc5185a8>] extent_writepages+0x48/0x68 [btrfs]
[<ffffffbffc4f5828>] btrfs_writepages+0x20/0x30 [btrfs]
[<ffffffc00033d758>] do_writepages+0x30/0x88
[<ffffffc0003ba0f4>] __writeback_single_inode+0x34/0x198
[<ffffffc0003ba6c4>] writeback_sb_inodes+0x184/0x3c0
[<ffffffc0003ba96c>] __writeback_inodes_wb+0x6c/0xc0
[<ffffffc0003bac20>] wb_writeback+0x1b8/0x1c0
[<ffffffc0003bb0f0>] wb_workfn+0x150/0x250
[<ffffffc0002b0014>] process_one_work+0x1dc/0x388
[<ffffffc0002b02f0>] worker_thread+0x130/0x500
[<ffffffc0002b6344>] kthread+0x10c/0x110
[<ffffffc000284590>] ret_from_fork+0x10/0x40
Code: d503201f a9025bb5 a90363b7 f90023b9 (d4210000)
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Rework the vfs_clone_file_range and vfs_dedupe_file_range infrastructure to use
a common .remap_file_range method and supply generic bounds and sanity checking
functions that are shared with the data write path. The current VFS
infrastructure has problems with rlimit, LFS file sizes, file time stamps,
maximum filesystem file sizes, stripping setuid bits, etc and so they are
addressed in these commits.
We also introduce the ability for the ->remap_file_range methods to return short
clones so that clones for vfs_copy_file_range() don't get rejected if the entire
range can't be cloned. It also allows filesystems to sliently skip deduplication
of partial EOF blocks if they are not capable of doing so without requiring
errors to be thrown to userspace.
All existing filesystems are converted to user the new .remap_file_range method,
and both XFS and ocfs2 are modified to make use of the new generic checking
infrastructure.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJb29gEAAoJEK3oKUf0dfodpOAQAL2VbHjvKXEwNMDTKscSRMmZ
Z0xXo3gamFKQ+VGOqy2g2lmAYQs9SAnTuCGTJ7zIAp7u+q8gzUy5FzKAwLS4Id6L
8siaY6nzlicfO04d0MdXnWz0f3xykChgzfdQfVUlUi7WrDioBUECLPmx4a+USsp1
DQGjLOZfoOAmn2rijdnH9RTEaHqg+8mcTaLN9TRav4gGqrWxldFKXw2y6ouFC7uo
/hxTRNXR9VI+EdbDelwBNXl9nU9gQA0WLOvRKwgUrtv6bSJohTPsmXt7EbBtNcVR
cl3zDNc1sLD1bLaRLEUAszI/33wXaaQgom1iB51obIcHHef+JxRNG/j6rUMfzxZI
VaauGv5EIvtaKN0LTAqVVLQ8t2MQFYfOr8TykmO+1UFog204aKRANdVMHDSjxD/0
dTGKJGcq+HnKQ+JHDbTdvuXEL8sUUl1FiLjOQbZPw63XmuddLKFUA2TOjXn6htbU
1h1MG5d9KjGLpabp2BQheczD08NuSmcrOBNt7IoeI3+nxr3HpMwprfB9TyaERy9X
iEgyVXmjjc9bLLRW7A2wm77aW64NvPs51wKMnvuNgNwnCewrGS6cB8WVj2zbQjH1
h3f3nku44s9ctNPSBzb/sJLnpqmZQ5t0oSmrMSN+5+En6rNTacoJCzxHRJBA7z/h
Z+C6y1GTZw0euY6Zjiwu
=CE/A
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.20-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull vfs dedup fixes from Dave Chinner:
"This reworks the vfs data cloning infrastructure.
We discovered many issues with these interfaces late in the 4.19 cycle
- the worst of them (data corruption, setuid stripping) were fixed for
XFS in 4.19-rc8, but a larger rework of the infrastructure fixing all
the problems was needed. That rework is the contents of this pull
request.
Rework the vfs_clone_file_range and vfs_dedupe_file_range
infrastructure to use a common .remap_file_range method and supply
generic bounds and sanity checking functions that are shared with the
data write path. The current VFS infrastructure has problems with
rlimit, LFS file sizes, file time stamps, maximum filesystem file
sizes, stripping setuid bits, etc and so they are addressed in these
commits.
We also introduce the ability for the ->remap_file_range methods to
return short clones so that clones for vfs_copy_file_range() don't get
rejected if the entire range can't be cloned. It also allows
filesystems to sliently skip deduplication of partial EOF blocks if
they are not capable of doing so without requiring errors to be thrown
to userspace.
Existing filesystems are converted to user the new remap_file_range
method, and both XFS and ocfs2 are modified to make use of the new
generic checking infrastructure"
* tag 'xfs-4.20-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (28 commits)
xfs: remove [cm]time update from reflink calls
xfs: remove xfs_reflink_remap_range
xfs: remove redundant remap partial EOF block checks
xfs: support returning partial reflink results
xfs: clean up xfs_reflink_remap_blocks call site
xfs: fix pagecache truncation prior to reflink
ocfs2: remove ocfs2_reflink_remap_range
ocfs2: support partial clone range and dedupe range
ocfs2: fix pagecache truncation prior to reflink
ocfs2: truncate page cache for clone destination file before remapping
vfs: clean up generic_remap_file_range_prep return value
vfs: hide file range comparison function
vfs: enable remap callers that can handle short operations
vfs: plumb remap flags through the vfs dedupe functions
vfs: plumb remap flags through the vfs clone functions
vfs: make remap_file_range functions take and return bytes completed
vfs: remap helper should update destination inode metadata
vfs: pass remap flags to generic_remap_checks
vfs: pass remap flags to generic_remap_file_range_prep
vfs: combine the clone and dedupe into a single remap_file_range
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlvYVlMACgkQxWXV+ddt
WDv9xxAAmN+R9y+wOKjPkDoM7jr8hRR12YnTC8R4X8oD8QTnSXWOrmfO2prYpe7d
RyUxpuhqY+q+qvCxkp+BREa86a0zswhn/Z6HfLbHn4CaEhtchkMKR/gFOiYeL2B1
ZIJtqgnqOGP3N1oxfn3Zr586W3ECUJq+4EUD/1OWCxZHvn1DWWd7L3VL0884hAhE
kDVWhMdBm0nX1SOet/8haI0N98NLdyltsGdz80ooi65qR52YE4u2IoqXEg2z0AEM
EApA6vQeOIIuZaRznIl2xFiIMbQCoMRb2sQgwIPmWoXrfboJUHyHfFrKRv5gGUHg
DXjOXTvVdu9EEqm+1HughwZL/KRkr+OcXHHWwP+v51zsiyfbic+fegpM6a+Z0NjD
LCo5D1NSLulhpZHr14F3qM27+LYHEC4xxXrrzRoVq4DCoSq7xgj3ip49uXe1F4Rw
AyLeJGGOp8aqvPiD0BfgMVi4+YhWJUd/ob9Ldn9z+2y0XGQ2FDM58iCt+49+YIQi
e2ywGaHt3aXghPAo/mvnckfZMLNZ7DJPwA7K6ayJ3N23dqGW2CORkKrGy7xVGoZn
2AjIN1pSRLlknQJZsa6Yp1mPxnrBQfutTVxxUfKOtmEzydxMVS0g92+Lu/JRb4pu
F/tpq/lC7dpTvP08EWw0sLjIhLeqMKzbXk38pSfUm39yDgQ10e8=
=CiDs
-----END PGP SIGNATURE-----
Merge tag 'for-4.20-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull more btrfs updates from David Sterba:
"This contains a few minor updates and fixes that were under testing or
arrived shortly after the merge window freeze, mostly stable material"
* tag 'for-4.20-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix use-after-free when dumping free space
Btrfs: fix use-after-free during inode eviction
btrfs: move the dio_sem higher up the callchain
btrfs: don't run delayed_iputs in commit
btrfs: fix insert_reserved error handling
btrfs: only free reserved extent if we didn't insert it
btrfs: don't use ctl->free_space for max_extent_size
btrfs: set max_extent_size properly
btrfs: reset max_extent_size properly
MAINTAINERS: update my email address for btrfs
btrfs: delayed-ref: extract find_first_ref_head from find_ref_head
Btrfs: fix deadlock when writing out free space caches
Btrfs: fix assertion on fsync of regular file when using no-holes feature
Btrfs: fix null pointer dereference on compressed write path error
Change the remap_file_range functions to take a number of bytes to
operate upon and return the number of bytes they operated on. This is a
requirement for allowing fs implementations to return short clone/dedupe
results to the user, which will enable us to obey resource limits in a
graceful manner.
A subsequent patch will enable copy_file_range to signal to the
->clone_file_range implementation that it can handle a short length,
which will be returned in the function's return value. For now the
short return is not implemented anywhere so the behavior won't change --
either copy_file_range manages to clone the entire range or it tries an
alternative.
Neither clone ioctl can take advantage of this, alas.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Combine the clone_file_range and dedupe_file_range operations into a
single remap_file_range file operation dispatch since they're
fundamentally the same operation. The differences between the two can
be made in the prep functions.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Pull XArray conversion from Matthew Wilcox:
"The XArray provides an improved interface to the radix tree data
structure, providing locking as part of the API, specifying GFP flags
at allocation time, eliminating preloading, less re-walking the tree,
more efficient iterations and not exposing RCU-protected pointers to
its users.
This patch set
1. Introduces the XArray implementation
2. Converts the pagecache to use it
3. Converts memremap to use it
The page cache is the most complex and important user of the radix
tree, so converting it was most important. Converting the memremap
code removes the only other user of the multiorder code, which allows
us to remove the radix tree code that supported it.
I have 40+ followup patches to convert many other users of the radix
tree over to the XArray, but I'd like to get this part in first. The
other conversions haven't been in linux-next and aren't suitable for
applying yet, but you can see them in the xarray-conv branch if you're
interested"
* 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits)
radix tree: Remove multiorder support
radix tree test: Convert multiorder tests to XArray
radix tree tests: Convert item_delete_rcu to XArray
radix tree tests: Convert item_kill_tree to XArray
radix tree tests: Move item_insert_order
radix tree test suite: Remove multiorder benchmarking
radix tree test suite: Remove __item_insert
memremap: Convert to XArray
xarray: Add range store functionality
xarray: Move multiorder_check to in-kernel tests
xarray: Move multiorder_shrink to kernel tests
xarray: Move multiorder account test in-kernel
radix tree test suite: Convert iteration test to XArray
radix tree test suite: Convert tag_tagged_items to XArray
radix tree: Remove radix_tree_clear_tags
radix tree: Remove radix_tree_maybe_preload_order
radix tree: Remove split/join code
radix tree: Remove radix_tree_update_node_t
page cache: Finish XArray conversion
dax: Convert page fault handlers to XArray
...
Pull more ->lookup() cleanups from Al Viro:
"Some ->lookup() instances are still overcomplicating the life
for themselves, open-coding the stuff that would be handled by
d_splice_alias() just fine.
Simplify a couple of such cases caught this cycle and document
d_splice_alias() intended use"
* 'work.lookup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
Document d_splice_alias() calling conventions for ->lookup() users.
simplify btrfs_lookup()
clean erofs_lookup()
At inode.c:evict_inode_truncate_pages(), when we iterate over the
inode's extent states, we access an extent state record's "state" field
after we unlocked the inode's io tree lock. This can lead to a
use-after-free issue because after we unlock the io tree that extent
state record might have been freed due to being merged into another
adjacent extent state record (a previous inflight bio for a read
operation finished in the meanwhile which unlocked a range in the io
tree and cause a merge of extent state records, as explained in the
comment before the while loop added in commit 6ca0709756 ("Btrfs: fix
hang during inode eviction due to concurrent readahead")).
Fix this by keeping a copy of the extent state's flags in a local
variable and using it after unlocking the io tree.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201189
Fixes: b9d0b38928 ("btrfs: Add handler for invalidate page")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This could result in a really bad case where we do something like
evict
evict_refill_and_join
btrfs_commit_transaction
btrfs_run_delayed_iputs
evict
evict_refill_and_join
btrfs_commit_transaction
... forever
We have plenty of other places where we run delayed iputs that are much
safer, let those do the work.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We were not handling the reserved byte accounting properly for data
references. Metadata was fine, if it errored out the error paths would
free the bytes_reserved count and pin the extent, but it even missed one
of the error cases. So instead move this handling up into
run_one_delayed_ref so we are sure that both cases are properly cleaned
up in case of a transaction abort.
CC: stable@vger.kernel.org # 4.18+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we insert the file extent once the ordered extent completes we free
the reserved extent reservation as it'll have been migrated to the
bytes_used counter. However if we error out after this step we'll still
clear the reserved extent reservation, resulting in a negative
accounting of the reserved bytes for the block group and space info.
Fix this by only doing the free if we didn't successfully insert a file
extent for this extent.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
max_extent_size is supposed to be the largest contiguous range for the
space info, and ctl->free_space is the total free space in the block
group. We need to keep track of these separately and _only_ use the
max_free_space if we don't have a max_extent_size, as that means our
original request was too large to search any of the block groups for and
therefore wouldn't have a max_extent_size set.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can't use entry->bytes if our entry is a bitmap entry, we need to use
entry->max_extent_size in that case. Fix up all the logic to make this
consistent.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we use up our block group before allocating a new one we'll easily
get a max_extent_size that's set really really low, which will result in
a lot of fragmentation. We need to make sure we're resetting the
max_extent_size when we add a new chunk or add new space.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The find_ref_head shouldn't return the first entry even if no exact match
is found. So move the hidden behavior to higher level.
Besides, remove the useless local variables in the btrfs_select_ref_head.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
[ reformat comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
When writing out a block group free space cache we can end deadlocking
with ourselves on an extent buffer lock resulting in a warning like the
following:
[245043.379979] WARNING: CPU: 4 PID: 2608 at fs/btrfs/locking.c:251 btrfs_tree_lock+0x1be/0x1d0 [btrfs]
[245043.392792] CPU: 4 PID: 2608 Comm: btrfs-transacti Tainted: G
W I 4.16.8 #1
[245043.395489] RIP: 0010:btrfs_tree_lock+0x1be/0x1d0 [btrfs]
[245043.396791] RSP: 0018:ffffc9000424b840 EFLAGS: 00010246
[245043.398093] RAX: 0000000000000a30 RBX: ffff8807e20a3d20 RCX: 0000000000000001
[245043.399414] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff8807e20a3d20
[245043.400732] RBP: 0000000000000001 R08: ffff88041f39a700 R09: ffff880000000000
[245043.402021] R10: 0000000000000040 R11: ffff8807e20a3d20 R12: ffff8807cb220630
[245043.403296] R13: 0000000000000001 R14: ffff8807cb220628 R15: ffff88041fbdf000
[245043.404780] FS: 0000000000000000(0000) GS:ffff88082fc80000(0000) knlGS:0000000000000000
[245043.406050] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[245043.407321] CR2: 00007fffdbdb9f10 CR3: 0000000001c09005 CR4: 00000000000206e0
[245043.408670] Call Trace:
[245043.409977] btrfs_search_slot+0x761/0xa60 [btrfs]
[245043.411278] btrfs_insert_empty_items+0x62/0xb0 [btrfs]
[245043.412572] btrfs_insert_item+0x5b/0xc0 [btrfs]
[245043.413922] btrfs_create_pending_block_groups+0xfb/0x1e0 [btrfs]
[245043.415216] do_chunk_alloc+0x1e5/0x2a0 [btrfs]
[245043.416487] find_free_extent+0xcd0/0xf60 [btrfs]
[245043.417813] btrfs_reserve_extent+0x96/0x1e0 [btrfs]
[245043.419105] btrfs_alloc_tree_block+0xfb/0x4a0 [btrfs]
[245043.420378] __btrfs_cow_block+0x127/0x550 [btrfs]
[245043.421652] btrfs_cow_block+0xee/0x190 [btrfs]
[245043.422979] btrfs_search_slot+0x227/0xa60 [btrfs]
[245043.424279] ? btrfs_update_inode_item+0x59/0x100 [btrfs]
[245043.425538] ? iput+0x72/0x1e0
[245043.426798] write_one_cache_group.isra.49+0x20/0x90 [btrfs]
[245043.428131] btrfs_start_dirty_block_groups+0x102/0x420 [btrfs]
[245043.429419] btrfs_commit_transaction+0x11b/0x880 [btrfs]
[245043.430712] ? start_transaction+0x8e/0x410 [btrfs]
[245043.432006] transaction_kthread+0x184/0x1a0 [btrfs]
[245043.433341] kthread+0xf0/0x130
[245043.434628] ? btrfs_cleanup_transaction+0x4e0/0x4e0 [btrfs]
[245043.435928] ? kthread_create_worker_on_cpu+0x40/0x40
[245043.437236] ret_from_fork+0x1f/0x30
[245043.441054] ---[ end trace 15abaa2aaf36827f ]---
This is because at write_one_cache_group() when we are COWing a leaf from
the extent tree we end up allocating a new block group (chunk) and,
because we have hit a threshold on the number of bytes reserved for system
chunks, we attempt to finalize the creation of new block groups from the
current transaction, by calling btrfs_create_pending_block_groups().
However here we also need to modify the extent tree in order to insert
a block group item, and if the location for this new block group item
happens to be in the same leaf that we were COWing earlier, we deadlock
since btrfs_search_slot() tries to write lock the extent buffer that we
locked before at write_one_cache_group().
We have already hit similar cases in the past and commit d9a0540a79
("Btrfs: fix deadlock when finalizing block group creation") fixed some
of those cases by delaying the creation of pending block groups at the
known specific spots that could lead to a deadlock. This change reworks
that commit to be more generic so that we don't have to add similar logic
to every possible path that can lead to a deadlock. This is done by
making __btrfs_cow_block() disallowing the creation of new block groups
(setting the transaction's can_flush_pending_bgs to false) before it
attempts to allocate a new extent buffer for either the extent, chunk or
device trees, since those are the trees that pending block creation
modifies. Once the new extent buffer is allocated, it allows creation of
pending block groups to happen again.
This change depends on a recent patch from Josef which is not yet in
Linus' tree, named "btrfs: make sure we create all new block groups" in
order to avoid occasional warnings at btrfs_trans_release_chunk_metadata().
Fixes: d9a0540a79 ("Btrfs: fix deadlock when finalizing block group creation")
CC: stable@vger.kernel.org # 4.4+
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199753
Link: https://lore.kernel.org/linux-btrfs/CAJtFHUTHna09ST-_EEiyWmDH6gAqS6wa=zMNMBsifj8ABu99cw@mail.gmail.com/
Reported-by: E V <eliventer@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When using the NO_HOLES feature and logging a regular file, we were
expecting that if we find an inline extent, that either its size in RAM
(uncompressed and unenconded) matches the size of the file or if it does
not, that it matches the sector size and it represents compressed data.
This assertion does not cover a case where the length of the inline extent
is smaller than the sector size and also smaller the file's size, such
case is possible through fallocate. Example:
$ mkfs.btrfs -f -O no-holes /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "pwrite -S 0xb60 0 21" /mnt/foobar
$ xfs_io -c "falloc 40 40" /mnt/foobar
$ xfs_io -c "fsync" /mnt/foobar
In the above example we trigger the assertion because the inline extent's
length is 21 bytes while the file size is 80 bytes. The fallocate() call
merely updated the file's size and did not touch the existing inline
extent, as expected.
So fix this by adjusting the assertion so that an inline extent length
smaller than the file size is valid if the file size is smaller than the
filesystem's sector size.
A test case for fstests follows soon.
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Fixes: a89ca6f24f ("Btrfs: fix fsync after truncate when no_holes feature is enabled")
CC: stable@vger.kernel.org # 4.14+
Link: https://lore.kernel.org/linux-btrfs/CAE5jQCfRSBC7n4pUTFJcmHh109=gwyT9mFkCOL+NKfzswmR=_Q@mail.gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At inode.c:compress_file_range(), under the "free_pages_out" label, we can
end up dereferencing the "pages" pointer when it has a NULL value. This
case happens when "start" has a value of 0 and we fail to allocate memory
for the "pages" pointer. When that happens we jump to the "cont" label and
then enter the "if (start == 0)" branch where we immediately call the
cow_file_range_inline() function. If that function returns 0 (success
creating an inline extent) or an error (like -ENOMEM for example) we jump
to the "free_pages_out" label and then access "pages[i]" leading to a NULL
pointer dereference, since "nr_pages" has a value greater than zero at
that point.
Fix this by setting "nr_pages" to 0 when we fail to allocate memory for
the "pages" pointer.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201119
Fixes: 771ed689d2 ("Btrfs: Optimize compressed writeback and reads")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Using bool is more suitable than int here, and add the comment about the
return_bigger.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The avg_delayed_ref_runtime can be referenced from the transaction
handle.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the transaction handle.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since trans is only used for referring to delayed_refs, there is no need
to pass it instead of delayed_refs to btrfs_delayed_ref_lock().
No functional change.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since trans is only used for referring to delayed_refs, there is no need
to pass it instead of delayed_refs to btrfs_select_ref_head(). No
functional change.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no reason to put this check in (!qgroup)'s else branch because
if qgroup is null, it will goto out directly. So move it out to reduce
indentation level. No functional change.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 581c176041 ("btrfs: Validate child tree block's level and first
key") has made tree block level check mandatory.
So if tree block level doesn't match, we won't get a valid extent
buffer. The extra WARN_ON() check can be removed completely.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
And add one line comment explaining what we're doing for each loop.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Some qgroup trace events like btrfs_qgroup_release_data() and
btrfs_qgroup_free_delayed_ref() can still be triggered even if qgroup is
not enabled.
This is caused by the lack of qgroup status check before calling some
qgroup functions. Thankfully the functions can handle quota disabled
case well and just do nothing for qgroup disabled case.
This patch will do earlier check before triggering related trace events.
And for enabled <-> disabled race case:
1) For enabled->disabled case
Disable will wipe out all qgroups data including reservation and
excl/rfer. Even if we leak some reservation or numbers, it will
still be cleared, so nothing will go wrong.
2) For disabled -> enabled case
Current btrfs_qgroup_release_data() will use extent_io tree to ensure
we won't underflow reservation. And for delayed_ref we use
head->qgroup_reserved to record the reserved space, so in that case
head->qgroup_reserved should be 0 and we won't underflow.
CC: stable@vger.kernel.org # 4.14+
Reported-by: Chris Murphy <lists@colorremedies.com>
Link: https://lore.kernel.org/linux-btrfs/CAJCQCtQau7DtuUUeycCkZ36qjbKuxNzsgqJ7+sJ6W0dK_NLE3w@mail.gmail.com/
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In a scenario like the following:
mkdir /mnt/A # inode 258
mkdir /mnt/B # inode 259
touch /mnt/B/bar # inode 260
sync
mv /mnt/B/bar /mnt/A/bar
mv -T /mnt/A /mnt/B
fsync /mnt/B/bar
<power fail>
After replaying the log we end up with file bar having 2 hard links, both
with the name 'bar' and one in the directory with inode number 258 and the
other in the directory with inode number 259. Also, we end up with the
directory inode 259 still existing and with the directory inode 258 still
named as 'A', instead of 'B'. In this scenario, file 'bar' should only
have one hard link, located at directory inode 258, the directory inode
259 should not exist anymore and the name for directory inode 258 should
be 'B'.
This incorrect behaviour happens because when attempting to log the old
parents of an inode, we skip any parents that no longer exist. Fix this
by forcing a full commit if an old parent no longer exists.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When replaying a log which contains a tmpfile (which necessarily has a
link count of 0) we end up calling inc_nlink(), at
fs/btrfs/tree-log.c:replay_one_buffer(), which produces a warning like
the following:
[195191.943673] WARNING: CPU: 0 PID: 6924 at fs/inode.c:342 inc_nlink+0x33/0x40
[195191.943723] CPU: 0 PID: 6924 Comm: mount Not tainted 4.19.0-rc6-btrfs-next-38 #1
[195191.943724] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[195191.943726] RIP: 0010:inc_nlink+0x33/0x40
[195191.943728] RSP: 0018:ffffb96e425e3870 EFLAGS: 00010246
[195191.943730] RAX: 0000000000000000 RBX: ffff8c0d1e6af4f0 RCX: 0000000000000006
[195191.943731] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8c0d1e6af4f0
[195191.943731] RBP: 0000000000000097 R08: 0000000000000001 R09: 0000000000000000
[195191.943732] R10: 0000000000000000 R11: 0000000000000000 R12: ffffb96e425e3a60
[195191.943733] R13: ffff8c0d10cff0c8 R14: ffff8c0d0d515348 R15: ffff8c0d78a1b3f8
[195191.943735] FS: 00007f570ee24480(0000) GS:ffff8c0dfb200000(0000) knlGS:0000000000000000
[195191.943736] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[195191.943737] CR2: 00005593286277c8 CR3: 00000000bb8f2006 CR4: 00000000003606f0
[195191.943739] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[195191.943740] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[195191.943741] Call Trace:
[195191.943778] replay_one_buffer+0x797/0x7d0 [btrfs]
[195191.943802] walk_up_log_tree+0x1c1/0x250 [btrfs]
[195191.943809] ? rcu_read_lock_sched_held+0x3f/0x70
[195191.943825] walk_log_tree+0xae/0x1d0 [btrfs]
[195191.943840] btrfs_recover_log_trees+0x1d7/0x4d0 [btrfs]
[195191.943856] ? replay_dir_deletes+0x280/0x280 [btrfs]
[195191.943870] open_ctree+0x1c3b/0x22a0 [btrfs]
[195191.943887] btrfs_mount_root+0x6b4/0x800 [btrfs]
[195191.943894] ? rcu_read_lock_sched_held+0x3f/0x70
[195191.943899] ? pcpu_alloc+0x55b/0x7c0
[195191.943906] ? mount_fs+0x3b/0x140
[195191.943908] mount_fs+0x3b/0x140
[195191.943912] ? __init_waitqueue_head+0x36/0x50
[195191.943916] vfs_kern_mount+0x62/0x160
[195191.943927] btrfs_mount+0x134/0x890 [btrfs]
[195191.943936] ? rcu_read_lock_sched_held+0x3f/0x70
[195191.943938] ? pcpu_alloc+0x55b/0x7c0
[195191.943943] ? mount_fs+0x3b/0x140
[195191.943952] ? btrfs_remount+0x570/0x570 [btrfs]
[195191.943954] mount_fs+0x3b/0x140
[195191.943956] ? __init_waitqueue_head+0x36/0x50
[195191.943960] vfs_kern_mount+0x62/0x160
[195191.943963] do_mount+0x1f9/0xd40
[195191.943967] ? memdup_user+0x4b/0x70
[195191.943971] ksys_mount+0x7e/0xd0
[195191.943974] __x64_sys_mount+0x21/0x30
[195191.943977] do_syscall_64+0x60/0x1b0
[195191.943980] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[195191.943983] RIP: 0033:0x7f570e4e524a
[195191.943986] RSP: 002b:00007ffd83589478 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[195191.943989] RAX: ffffffffffffffda RBX: 0000563f335b2060 RCX: 00007f570e4e524a
[195191.943990] RDX: 0000563f335b2240 RSI: 0000563f335b2280 RDI: 0000563f335b2260
[195191.943992] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000020
[195191.943993] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 0000563f335b2260
[195191.943994] R13: 0000563f335b2240 R14: 0000000000000000 R15: 00000000ffffffff
[195191.944002] irq event stamp: 8688
[195191.944010] hardirqs last enabled at (8687): [<ffffffff9cb004c3>] console_unlock+0x503/0x640
[195191.944012] hardirqs last disabled at (8688): [<ffffffff9ca037dd>] trace_hardirqs_off_thunk+0x1a/0x1c
[195191.944018] softirqs last enabled at (8638): [<ffffffff9cc0a5d1>] __set_page_dirty_nobuffers+0x101/0x150
[195191.944020] softirqs last disabled at (8634): [<ffffffff9cc26bbe>] wb_wakeup_delayed+0x2e/0x60
[195191.944022] ---[ end trace 5d6e873a9a0b811a ]---
This happens because the inode does not have the flag I_LINKABLE set,
which is a runtime only flag, not meant to be persisted, set when the
inode is created through open(2) if the flag O_EXCL is not passed to it.
Except for the warning, there are no other consequences (like corruptions
or metadata inconsistencies).
Since it's pointless to replay a tmpfile as it would be deleted in a
later phase of the log replay procedure (it has a link count of 0), fix
this by not logging tmpfiles and if a tmpfile is found in a log (created
by a kernel without this change), skip the replay of the inode.
A test case for fstests follows soon.
Fixes: 471d557afe ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay")
CC: stable@vger.kernel.org # 4.18+
Reported-by: Martin Steigerwald <martin@lichtvoll.de>
Link: https://lore.kernel.org/linux-btrfs/3666619.NTnn27ZJZE@merkaba/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need it, rsv->size is set once and never changes throughout
its lifetime, so just use that for the reserve size.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I ran into an issue where there was some reference being held on an
inode that I couldn't track. This assert wasn't triggered, but it at
least rules out we're doing something stupid.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Allocating new chunks modifies both the extent and chunk tree, which can
trigger new chunk allocations. So instead of doing list_for_each_safe,
just do while (!list_empty()) so we make sure we don't exit with other
pending bg's still on our list.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We need to clear the max_extent_size when we clear bits from a bitmap
since it could have been from the range that contains the
max_extent_size.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we're allocating a new space cache inode it's likely going to be
under a transaction handle, so we need to use memalloc_nofs_save() in
order to avoid deadlocks, and more importantly lockdep messages that
make xfstests fail.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We want to release the unused reservation we have since it refills the
delayed refs reserve, which will make everything go smoother when
running the delayed refs if we're short on our reservation.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs's btree locking has two modes, spinning mode and blocking mode,
while searching btree, locking is always acquired in spinning mode and
then converted to blocking mode if necessary, and in some hot paths we may
switch the locking back to spinning mode by btrfs_clear_path_blocking().
When acquiring locks, both of reader and writer need to wait for blocking
readers and writers to complete before doing read_lock()/write_lock().
The problem is that btrfs_clear_path_blocking() needs to switch nodes
in the path to blocking mode at first (by btrfs_set_path_blocking) to
make lockdep happy before doing its actual clearing blocking job.
When switching to blocking mode from spinning mode, it consists of
step 1) bumping up blocking readers counter and
step 2) read_unlock()/write_unlock(),
this has caused serious ping-pong effect if there're a great amount of
concurrent readers/writers, as waiters will be woken up and go to
sleep immediately.
1) Killing this kind of ping-pong results in a big improvement in my 1600k
files creation script,
MNT=/mnt/btrfs
mkfs.btrfs -f /dev/sdf
mount /dev/def $MNT
time fsmark -D 10000 -S0 -n 100000 -s 0 -L 1 -l /tmp/fs_log.txt \
-d $MNT/0 -d $MNT/1 \
-d $MNT/2 -d $MNT/3 \
-d $MNT/4 -d $MNT/5 \
-d $MNT/6 -d $MNT/7 \
-d $MNT/8 -d $MNT/9 \
-d $MNT/10 -d $MNT/11 \
-d $MNT/12 -d $MNT/13 \
-d $MNT/14 -d $MNT/15
w/o patch:
real 2m27.307s
user 0m12.839s
sys 13m42.831s
w/ patch:
real 1m2.273s
user 0m15.802s
sys 8m16.495s
1.1) latency histogram from funclatency[1]
Overall with the patch, there're ~50% less write lock acquisition and
the 95% max latency that write lock takes also reduces to ~100ms from
>500ms.
--------------------------------------------
w/o patch:
--------------------------------------------
Function = btrfs_tree_lock
msecs : count distribution
0 -> 1 : 2385222 |****************************************|
2 -> 3 : 37147 | |
4 -> 7 : 20452 | |
8 -> 15 : 13131 | |
16 -> 31 : 3877 | |
32 -> 63 : 3900 | |
64 -> 127 : 2612 | |
128 -> 255 : 974 | |
256 -> 511 : 165 | |
512 -> 1023 : 13 | |
Function = btrfs_tree_read_lock
msecs : count distribution
0 -> 1 : 6743860 |****************************************|
2 -> 3 : 2146 | |
4 -> 7 : 190 | |
8 -> 15 : 38 | |
16 -> 31 : 4 | |
--------------------------------------------
w/ patch:
--------------------------------------------
Function = btrfs_tree_lock
msecs : count distribution
0 -> 1 : 1318454 |****************************************|
2 -> 3 : 6800 | |
4 -> 7 : 3664 | |
8 -> 15 : 2145 | |
16 -> 31 : 809 | |
32 -> 63 : 219 | |
64 -> 127 : 10 | |
Function = btrfs_tree_read_lock
msecs : count distribution
0 -> 1 : 6854317 |****************************************|
2 -> 3 : 2383 | |
4 -> 7 : 601 | |
8 -> 15 : 92 | |
2) dbench also proves the improvement,
dbench -t 120 -D /mnt/btrfs 16
w/o patch:
Throughput 158.363 MB/sec
w/ patch:
Throughput 449.52 MB/sec
3) xfstests didn't show any additional failures.
One thing to note is that callers may set path->leave_spinning to have
all nodes in the path stay in spinning mode, which means callers are
ready to not sleep before releasing the path, but it won't cause
problems if they don't want to sleep in blocking mode.
[1]: https://github.com/iovisor/bcc/blob/master/tools/funclatency.py
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The value of blocking_readers is increased only when the lock is taken
for read, no way we can fail the condition with the write lock.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The replace_wait and bio_counter were mistakenly added to fs_info in
commit c404e0dc2c ("Btrfs: fix use-after-free in the finishing
procedure of the device replace"), but they logically belong to
fs_info::dev_replace. Besides, bio_counter is a very generic name and is
confusing in bare fs_info context.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The exit sequence in btrfs_dev_replace_start does not allow to simply
add a label to the right place so the error handling after starting
transaction failure jumps there. Currently there's a lock that pairs
with the unlock in the section, which is unnecessary and only raises
questions. Add a variable to track the locking status and avoid the
extra locking.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Too trivial, the purpose can be simply documented in a comment.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The wrapper is too trivial, open coding does not make it less readable.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a single caller and the function name does not say it's actually
taking the lock, so open coding makes it more explicit.
For now, btrfs_dev_replace_read_lock is used instead of read_lock so
it's paired with the unlocking wrapper in the same block.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This member seems to be copied from the extent_buffer locking scheme and
is at least used to assert that the read lock/unlock is properly nested.
In some way. While the _inc/_dec are called inside the read lock
section, the asserts are both inside and outside, so the ordering is not
guaranteed and we can see read/inc/dec ordered in any way
(theoretically).
A missing call of btrfs_dev_replace_clear_lock_blocking could cause
unexpected read_locks count, so this at least looks like a valid
assertion, but this will become unnecessary with later updates.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Although we have tree level check at tree read runtime, it's completely
based on its parent level.
We still need to do accurate level check to avoid invalid tree blocks
sneak into kernel space.
The check itself is simple, for leaf its level should always be 0.
For nodes its level should be in range [1, BTRFS_MAX_LEVEL - 1].
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For qgroup_trace_extent_swap(), if we find one leaf that needs to be
traced, we will also iterate all file extents and trace them.
This is OK if we're relocating data block groups, but if we're
relocating metadata block groups, balance code itself has ensured that
both subtree of file tree and reloc tree contain the same contents.
That's to say, if we're relocating metadata block groups, all file
extents in reloc and file tree should match, thus no need to trace them.
This should reduce the total number of dirty extents processed in metadata
block group balance.
[[Benchmark]] (with all previous enhancement)
Hardware:
VM 4G vRAM, 8 vCPUs,
disk is using 'unsafe' cache mode,
backing device is SAMSUNG 850 evo SSD.
Host has 16G ram.
Mkfs parameter:
--nodesize 4K (To bump up tree size)
Initial subvolume contents:
4G data copied from /usr and /lib.
(With enough regular small files)
Snapshots:
16 snapshots of the original subvolume.
each snapshot has 3 random files modified.
balance parameter:
-m
So the content should be pretty similar to a real world root fs layout.
| v4.19-rc1 | w/ patchset | diff (*)
---------------------------------------------------------------
relocated extents | 22929 | 22851 | -0.3%
qgroup dirty extents | 227757 | 140886 | -38.1%
time (sys) | 65.253s | 37.464s | -42.6%
time (real) | 74.032s | 44.722s | -39.6%
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reloc tree doesn't contribute to qgroup numbers, as we have accounted
them at balance time (see replace_path()).
Skipping the unneeded subtree tracing should reduce the overhead.
[[Benchmark]]
Hardware:
VM 4G vRAM, 8 vCPUs,
disk is using 'unsafe' cache mode,
backing device is SAMSUNG 850 evo SSD.
Host has 16G ram.
Mkfs parameter:
--nodesize 4K (To bump up tree size)
Initial subvolume contents:
4G data copied from /usr and /lib.
(With enough regular small files)
Snapshots:
16 snapshots of the original subvolume.
each snapshot has 3 random files modified.
balance parameter:
-m
So the content should be pretty similar to a real world root fs layout.
| v4.19-rc1 | w/ patchset | diff (*)
---------------------------------------------------------------
relocated extents | 22929 | 22900 | -0.1%
qgroup dirty extents | 227757 | 167139 | -26.6%
time (sys) | 65.253s | 50.123s | -23.2%
time (real) | 74.032s | 52.551s | -29.0%
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Before this patch, with quota enabled during balance, we need to mark
the whole subtree dirty for quota.
E.g.
OO = Old tree blocks (from file tree)
NN = New tree blocks (from reloc tree)
File tree (src) Reloc tree (dst)
OO (a) NN (a)
/ \ / \
(b) OO OO (c) (b) NN NN (c)
/ \ / \ / \ / \
OO OO OO OO (d) OO OO OO NN (d)
For old balance + quota case, quota will mark the whole src and dst tree
dirty, including all the 3 old tree blocks in reloc tree.
It's doable for small file tree or new tree blocks are all located at
lower level.
But for large file tree or new tree blocks are all located at higher
level, this will lead to mark the whole tree dirty, and be unbelievably
slow.
This patch will change how we handle such balance with quota enabled
case.
Now we will search from (b) and (c) for any new tree blocks whose
generation is equal to @last_snapshot, and only mark them dirty.
In above case, we only need to trace tree blocks NN(b), NN(c) and NN(d).
(NN(a) will be traced when COW happens for nodeptr modification). And
also for tree blocks OO(b), OO(c), OO(d). (OO(a) will be traced when COW
happens for nodeptr modification.)
For above case, we could skip 3 tree blocks, but for larger tree, we can
skip tons of unmodified tree blocks, and hugely speed up balance.
This patch will introduce a new function,
btrfs_qgroup_trace_subtree_swap(), which will do the following main
work:
1) Read out real root eb
And setup basic dst_path for later calls
2) Call qgroup_trace_new_subtree_blocks()
To trace all new tree blocks in reloc tree and their counter
parts in the file tree.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce new function, qgroup_trace_new_subtree_blocks(), to iterate
all new tree blocks in a reloc tree.
So that qgroup could skip unrelated tree blocks during balance, which
should hugely speedup balance speed when quota is enabled.
The function qgroup_trace_new_subtree_blocks() itself only cares about
new tree blocks in reloc tree.
All its main works are:
1) Read out tree blocks according to parent pointers
2) Do recursive depth-first search
Will call the same function on all its children tree blocks, with
search level set to current level -1.
And will also skip all children whose generation is smaller than
@last_snapshot.
3) Call qgroup_trace_extent_swap() to trace tree blocks
So although we have parameter list related to source file tree, it's not
used at all, but only passed to qgroup_trace_extent_swap().
Thus despite the tree read code, the core should be pretty short and all
about recursive depth-first search.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a new function, qgroup_trace_extent_swap(), which will be used
later for balance qgroup speedup.
The basis idea of balance is swapping tree blocks between reloc tree and
the real file tree.
The swap will happen in highest tree block, but there may be a lot of
tree blocks involved.
For example:
OO = Old tree blocks
NN = New tree blocks allocated during balance
File tree (257) Reloc tree for 257
L2 OO NN
/ \ / \
L1 OO OO (a) OO NN (a)
/ \ / \ / \ / \
L0 OO OO OO OO OO OO NN NN
(b) (c) (b) (c)
When calling qgroup_trace_extent_swap(), we will pass:
@src_eb = OO(a)
@dst_path = [ nodes[1] = NN(a), nodes[0] = NN(c) ]
@dst_level = 0
@root_level = 1
In that case, qgroup_trace_extent_swap() will search from OO(a) to
reach OO(c), then mark both OO(c) and NN(c) as qgroup dirty.
The main work of qgroup_trace_extent_swap() can be split into 3 parts:
1) Tree search from @src_eb
It should acts as a simplified btrfs_search_slot().
The key for search can be extracted from @dst_path->nodes[dst_level]
(first key).
2) Mark the final tree blocks in @src_path and @dst_path qgroup dirty
NOTE: In above case, OO(a) and NN(a) won't be marked qgroup dirty.
They should be marked during preivous (@dst_level = 1) iteration.
3) Mark file extents in leaves dirty
We don't have good way to pick out new file extents only.
So we still follow the old method by scanning all file extents in
the leave.
This function can free us from keeping two pathes, thus later we only need
to care about how to iterate all new tree blocks in reloc tree.
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ copy changelog to function comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Number of qgroup dirty extents is directly linked to the performance
overhead, so add a new trace event, trace_qgroup_num_dirty_extents(), to
record how many dirty extents is processed in
btrfs_qgroup_account_extents().
This will be pretty handy to analyze later balance performance
improvement.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
fs/btrfs/relocation.c:build_backref_tree() is some code from 2009 era,
although it works pretty fine, it's not that easy to understand.
Especially combined with the complex btrfs backref format.
This patch adds some basic comment for the backref build part of the
code, making it less hard to read, at least for backref searching part.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The only aops we define for symlinks are identical to the aops for
regular files. This has been the case since symlink support was added in
commit 2b8d99a723 ("Btrfs: symlinks and hard links"). As far as I can
tell, there wasn't a good reason to have separate aops then, and there
isn't now, so let's just do what most other filesystems do and reuse the
same structure.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During buffered writes, we follow this basic series of steps:
again:
lock all the pages
wait for writeback on all the pages
Take the extent range lock
wait for ordered extents on the whole range
clean all the pages
if (copy_from_user_in_atomic() hits a fault) {
drop our locks
goto again;
}
dirty all the pages
release all the locks
The extra waiting, cleaning and locking are there to make sure we don't
modify pages in flight to the drive, after they've been crc'd.
If some of the pages in the range were already dirty when the write
began, and we need to goto again, we create a window where a dirty page
has been cleaned and unlocked. It may be reclaimed before we're able to
lock it again, which means we'll read the old contents off the drive and
lose any modifications that had been pending writeback.
We don't actually need to clean the pages. All of the other locking in
place makes sure we don't start IO on the pages, so we can just leave
them dirty for the duration of the write.
Fixes: 73d59314e6 (the original btrfs merge)
CC: stable@vger.kernel.org # v4.4+
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The helper does the same math and we take care about the special case
when flags is 0 too.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Refactor the delayed refs loop by using the newly introduced
btrfs_run_delayed_refs_for_head function. This greatly simplifies
__btrfs_run_delayed_refs and makes it more obvious what is happening.
We now have 1 loop which iterates the existing delayed_heads and then
each selected ref head is processed by the new helper. All existing
semantics of the code are preserved so no functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch introduces a new helper encompassing the implicit inner loop
in __btrfs_run_delayed_refs which processes all the refs for a given
head. The code is mostly copy/paste, the only difference is that if we
detect a newer reference then -EAGAIN is returned so that callers can
react correctly.
Also, at the end of the loop the head is relocked and
btrfs_merge_delayed_refs is run again to retain the pre-refactoring
semantics.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is in preparation to refactor the giant loop in
__btrfs_run_delayed_refs. As a first step define a new function
which implements acquiring a reference to a btrfs_delayed_refs_head and
use it. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Avoid the inline ifdefs and use two sections for self-tests enabled and
disabled.
Though there could be no ifdef and unconditional test_bit of
BTRFS_FS_STATE_DUMMY_FS_INFO, the static inline can help to optimize out
any code that would depend on conditions using btrfs_is_testing.
As this is only for the testing code, drop unlikely().
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The data used only for tests are better placed at the end of the
structure so that they don't change the structure layout. All new
members of btrfs_root should be placed before.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The helper find_lock_delalloc_range is now conditionally built static,
dpending on whether the self-tests are enabled or not. There's a macro
that is supposed to hide the export, used only once. To discourage
further use, drop it an add a public wrapper for the helper needed by
tests.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
rb_first_cached() trades an extra pointer "leftmost" for doing the same
job as rb_first() but in O(1).
While resolving indirect refs and missing refs, it always looks for the
first rb entry in a while loop, it's helpful to use rb_first_cached
instead.
For more details about the optimization see patch "Btrfs: delayed-refs:
use rb_first_cached for href_root".
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
rb_first_cached() trades an extra pointer "leftmost" for doing the
same job as rb_first() but in O(1).
As evict_inode_truncate_pages() removes all extent mapping by always
looking for the first rb entry, it's helpful to use rb_first_cached
instead.
For more details about the optimization see patch "Btrfs: delayed-refs:
use rb_first_cached for href_root".
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
rb_first_cached() trades an extra pointer "leftmost" for doing the same job as
rb_first() but in O(1).
Functions manipulating delayed_item need to get the first entry, this converts
it to use rb_first_cached().
For more details about the optimization see patch "Btrfs: delayed-refs:
use rb_first_cached for href_root".
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
rb_first_cached() trades an extra pointer "leftmost" for doing the same
job as rb_first() but in O(1).
Functions manipulating href->ref_tree need to get the first entry, this
converts href->ref_tree to use rb_first_cached().
For more details about the optimization see patch "Btrfs: delayed-refs:
use rb_first_cached for href_root".
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
rb_first_cached() trades an extra pointer "leftmost" for doing the same
job as rb_first() but in O(1).
Functions manipulating href_root need to get the first entry, this
converts href_root to use rb_first_cached().
This patch is first in the sequenct of similar updates to other rbtrees
and this is analysis of the expected behaviour and improvements.
There's a common pattern:
while (node = rb_first) {
entry = rb_entry(node)
next = rb_next(node)
rb_erase(node)
cleanup(entry)
}
rb_first needs to traverse the tree up to logN depth, rb_erase can
completely reshuffle the tree. With the caching we'll skip the traversal
in rb_first. That's a cached memory access vs looped pointer
dereference trade-off that IMHO has a clear winner.
Measurements show there's not much difference in a sample tree with
10000 nodes: 4.5s / rb_first and 4.8s / rb_first_cached. Real effects of
caching and pointer chasing are unpredictable though.
Further optimzations can be done to avoid the expensive rb_erase step.
In some cases it's ok to process the nodes in any order, so the tree can
be traversed in post-order, not rebalancing the children nodes and just
calling free. Care must be taken regarding the next node.
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog from mail discussions ]
Signed-off-by: David Sterba <dsterba@suse.com>
While testing my backport I noticed there was a panic if I ran
generic/416 generic/417 generic/418 all in a row. This just happened to
uncover a race where we had outstanding IO after we destroy all of our
workqueues, and then we'd go to queue the endio work on those free'd
workqueues.
This is because we aren't waiting for the caching threads to be done
before freeing everything up, so to fix this make sure we wait on any
outstanding caching that's being done before we free up the block group,
so we're sure to be done with all IO by the time we get to
btrfs_stop_all_workers(). This fixes the panic I was seeing
consistently in testing.
------------[ cut here ]------------
kernel BUG at fs/btrfs/volumes.c:6112!
SMP PTI
Modules linked in:
CPU: 1 PID: 27165 Comm: kworker/u4:7 Not tainted 4.16.0-02155-g3553e54a578d-dirty #875
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
Workqueue: btrfs-cache btrfs_cache_helper
RIP: 0010:btrfs_map_bio+0x346/0x370
RSP: 0000:ffffc900061e79d0 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff880071542e00 RCX: 0000000000533000
RDX: ffff88006bb74380 RSI: 0000000000000008 RDI: ffff880078160000
RBP: 0000000000000001 R08: ffff8800781cd200 R09: 0000000000503000
R10: ffff88006cd21200 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: ffff8800781cd200 R15: ffff880071542e00
FS: 0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000817ffc4 CR3: 0000000078314000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btree_submit_bio_hook+0x8a/0xd0
submit_one_bio+0x5d/0x80
read_extent_buffer_pages+0x18a/0x320
btree_read_extent_buffer_pages+0xbc/0x200
? alloc_extent_buffer+0x359/0x3e0
read_tree_block+0x3d/0x60
read_block_for_search.isra.30+0x1a5/0x360
btrfs_search_slot+0x41b/0xa10
btrfs_next_old_leaf+0x212/0x470
caching_thread+0x323/0x490
normal_work_helper+0xc5/0x310
process_one_work+0x141/0x340
worker_thread+0x44/0x3c0
kthread+0xf8/0x130
? process_one_work+0x340/0x340
? kthread_bind+0x10/0x10
ret_from_fork+0x35/0x40
RIP: btrfs_map_bio+0x346/0x370 RSP: ffffc900061e79d0
---[ end trace 827eb13e50846033 ]---
Kernel panic - not syncing: Fatal exception
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 499f377f49 (btrfs: iterate over unused chunk space in FITRIM)
fixed free space trimming, but introduced latency when it was running.
This is due to it pinning the transaction using both a incremented
refcount and holding the commit root sem for the duration of a single
trim operation.
This was to ensure safety but it's unnecessary. We already hold the the
chunk mutex so we know that the chunk we're using can't be allocated
while we're trimming it.
In order to check against chunks allocated already in this transaction,
we need to check the pending chunks list. To to that safely without
joining the transaction (or attaching than then having to commit it) we
need to ensure that the dev root's commit root doesn't change underneath
us and the pending chunk lists stays around until we're done with it.
We can ensure the former by holding the commit root sem and the latter
by pinning the transaction. We do this now, but the critical section
covers the trim operation itself and we don't need to do that.
This patch moves the pinning and unpinning logic into helpers and unpins
the transaction after performing the search and check for pending
chunks.
Limiting the critical section of the transaction pinning improves the
latency substantially on slower storage (e.g. image files over NFS).
Fixes: 499f377f49 ("btrfs: iterate over unused chunk space in FITRIM")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We check whether any device the file system is using supports discard in
the ioctl call, but then we attempt to trim free extents on every device
regardless of whether discard is supported. Due to the way we mask off
EOPNOTSUPP, we can end up issuing the trim operations on each free range
on devices that don't support it, just wasting time.
Fixes: 499f377f49 ("btrfs: iterate over unused chunk space in FITRIM")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_trim_fs iterates over the fs_devices->alloc_list while holding the
device_list_mutex. The problem is that ->alloc_list is protected by the
chunk mutex. We don't want to hold the chunk mutex over the trim of the
entire file system. Fortunately, the ->dev_list list is protected by
the dev_list mutex and while it will give us all devices, including
read-only devices, we already just skip the read-only devices. Then we
can continue to take and release the chunk mutex while scanning each
device.
Fixes: 499f377f49 ("btrfs: iterate over unused chunk space in FITRIM")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
fstrim on some btrfs only trims the unallocated space, not trimming any
space in existing block groups.
[CAUSE]
Before fstrim_range passed to btrfs_trim_fs(), it gets truncated to
range [0, super->total_bytes). So later btrfs_trim_fs() will only be
able to trim block groups in range [0, super->total_bytes).
While for btrfs, any bytenr aligned to sectorsize is valid, since btrfs
uses its logical address space, there is nothing limiting the location
where we put block groups.
For filesystem with frequent balance, it's quite easy to relocate all
block groups and bytenr of block groups will start beyond
super->total_bytes.
In that case, btrfs will not trim existing block groups.
[FIX]
Just remove the truncation in btrfs_ioctl_fitrim(), so btrfs_trim_fs()
can get the unmodified range, which is normally set to [0, U64_MAX].
Reported-by: Chris Murphy <lists@colorremedies.com>
Fixes: f4c697e640 ("btrfs: return EINVAL if start > total_bytes in fitrim ioctl")
CC: <stable@vger.kernel.org> # v4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Function btrfs_trim_fs() doesn't handle errors in a consistent way. If
error happens when trimming existing block groups, it will skip the
remaining blocks and continue to trim unallocated space for each device.
The return value will only reflect the final error from device trimming.
This patch will fix such behavior by:
1) Recording the last error from block group or device trimming
The return value will also reflect the last error during trimming.
Make developer more aware of the problem.
2) Continuing trimming if possible
If we failed to trim one block group or device, we could still try
the next block group or device.
3) Report number of failures during block group and device trimming
It would be less noisy, but still gives user a brief summary of
what's going wrong.
Such behavior can avoid confusion for cases like failure to trim the
first block group and then only unallocated space is trimmed.
Reported-by: Chris Murphy <lists@colorremedies.com>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add bg_ret and dev_ret to the messages ]
Signed-off-by: David Sterba <dsterba@suse.com>
When we fail to start a transaction in btrfs_dev_replace_start, we leave
dev_replace->replace_start set to STARTED but clear ->srcdev and
->tgtdev. Later, that can result in an Oops in
btrfs_dev_replace_progress when having state set to STARTED or SUSPENDED
implies that ->srcdev is valid.
Also fix error handling when the state is already STARTED or SUSPENDED
while starting. That, too, will clear ->srcdev and ->tgtdev even though
it doesn't own them. This should be an impossible case to hit since we
should be protected by the BTRFS_FS_EXCL_OP bit being set. Let's add an
ASSERT there while we're at it.
Fixes: e93c89c1aa (Btrfs: add new sources for device replace code)
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
remove_extent_mapping uses the variable "ret" for return value, but it
is not modified after initialzation. Further, I find that any of the
callers do not handle the return value and the callees are only simple
functions so the return values does not need to be passed.
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_search_old_slot get_old_root is always used with the assumption
it cannot fail. However, this is not true in rare circumstance it can
fail and return null. This will lead to null point dereference when the
header is read. Fix this by checking the return value and properly
handling NULL by setting ret to -EIO and returning gracefully.
Coverity-id: 1087503
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_orphan_cleanup the final 'if (ret) goto out' cannot ever be
executed. This is due to the last assignment to 'ret' depending on the
return value of btrfs_iget. If an error other than -ENOENT is returned
then the loop is prematurely terminated by 'goto out'. On the other
hand, if the error value is ENOENT then a subsequent if branch is
executed that always re-assigns 'ret' and in case it's an error just
terminates the loop. No functional changes.
Coverity-id: 1437392
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we delete an inode,
btrfs_evict_inode() {
truncate_inode_pages_final()
truncate_inode_pages_range()
lock_page()
truncate_cleanup_page()
btrfs_invalidatepage()
wait_on_page_writeback
btrfs_lookup_ordered_range()
cancel_dirty_page()
unlock_page()
...
btrfs_wait_ordered_range()
...
As VFS has called ->invalidatepage() to get all ordered extents done (if
there are any) and truncated all page cache pages (no dirty pages to
writeback after this step), wait_ordered_range() is just a noop.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As long as @eb is marked with EXTENT_BUFFER_DIRTY, all of its pages
are dirty, so no need to set pages dirty again.
Ftrace showed that the loop took 10us on my dev box, so removing this
can save us at least 10us if eb is already dirty and otherwise avoid a
potentially expensive calls to set_page_dirty.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Just in case that someone breaks the rule that pages are dirty as long
as eb is dirty. The next patch will dirty the pages conditionally.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the callchain:
btrfs_search_slot()
if (level != 0)
setup_nodes_for_search()
balance_level()
It is just impossible to have level=0 in balance_level, we can drop the
check.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unify the error handling of directory item lookups using IS_ERR_OR_NULL.
No functional changes.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
kcalloc is defined as:
kcalloc(size_t n, size_t size, gfp_t flags)
Although this won't cause problems in practice, btrfsic_read_block()
uses kcalloc with n and size in the opposite order.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of returning an error value and using one of the parameters for
returning the actual object we are interested in just refactor the
function to directly return btrfs_device *. Also bubble up the error
handling for the special BTRFS_ERROR_DEV_MISSING_NOT_FOUND value into
btrfs_rm_device. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function returns a numeric error value and additionally the
device found in one of its input parameters. Simplify this by making
the function directly return a pointer to btrfs_device. Additionally
adjust the caller to handle the case when we want to remove the
'missing' device and ENOENT is returned to return the expected
positive error value, parsed by progs. Finally, unexport the function
since it's not called outside of volume.c. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently this function returns an error code as well as uses one of
its arguments as a return value for struct btrfs_device. Change the
function so that it returns btrfs_device directly and use the usual
"encode error in pointer" mechanics if something goes wrong. No
functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit d7df2c796d ("Btrfs attach delayed ref updates to delayed
ref heads"), check_delayed_ref() won't return -ENOENT.
In btrfs_cross_ref_exist(), two variables 'ret' and 'ret2' are
originally used to handle -ENOENT error case. Since the code is not
needed anymore, let's just remove 'ret2'.
Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unless it's going to read inline extents from btree leaf to page,
btrfs_get_extent won't sleep during the period of holding path lock.
This sets leave_spinning at first and sets path to blocking mode right
before reading inline extent if that's the case. The benefit is that a
path in spinning mode typically has lower impact (faster) on waiters
rather than that in the blocking mode.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This fixes btrfs_get_extent to be consistent with our existing
declaration style.
Note: For the record, indentation styles that are accepted are both,
aligning under the opening ( and tab or double tab indentation on the
next line. Preferrably not spliting the type or long expressions in the
argument lists.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
Pointer 'tree' is being assigned but is never used hence it is redundant
and can be removed. This is a leftover from cleanup patch
00032d38ea ("btrfs: drop extent_io_ops::merge_bio_hook
callback").
Cleans up clang warning:
warning: variable 'tree' set but not used [-Wunused-but-set-variable]
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pointer inode is being assigned but is never used hence it is redundant
and can be removed. It's been unused since the introduction in
38c227d87c ("Btrfs: snapshot-aware defrag").
Cleans up clang warning:
variable ‘inode’ set but not used [-Wunused-but-set-variable]
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit 8b62f87bad ("Btrfs: rework outstanding_extents"),
manual operations of outstanding_extent in btrfs_inode are replaced by
btrfs_mod_outstanding_extents().
The one in cluster_pages_for_defrag seems to be lost, so replace it
of btrfs_mod_outstanding_extents().
Fixes: 8b62f87bad ("Btrfs: rework outstanding_extents")
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Here we're not releasing any space, but transferring bytes from
->bytes_may_use to ->bytes_reserved. The last change to the code in
commit 18513091af ("btrfs: update btrfs_space_info's
bytes_may_use timely") removed a conditional tracepoint and the logic
changed too but the tracepiont remained.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
trace_btrfs_get_extent() has nothing to do with path, place
btrfs_free_path ahead so that we can unlock path on error.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As next_state() is already defined to get the next state, use it in
find_first_extent_bit. No functional changes.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
For certain crafted image, whose csum root leaf has missing backref, if
we try to trigger write with data csum, it could cause deadlock with the
following kernel WARN_ON():
WARNING: CPU: 1 PID: 41 at fs/btrfs/locking.c:230 btrfs_tree_lock+0x3e2/0x400
CPU: 1 PID: 41 Comm: kworker/u4:1 Not tainted 4.18.0-rc1+ #8
Workqueue: btrfs-endio-write btrfs_endio_write_helper
RIP: 0010:btrfs_tree_lock+0x3e2/0x400
Call Trace:
btrfs_alloc_tree_block+0x39f/0x770
__btrfs_cow_block+0x285/0x9e0
btrfs_cow_block+0x191/0x2e0
btrfs_search_slot+0x492/0x1160
btrfs_lookup_csum+0xec/0x280
btrfs_csum_file_blocks+0x2be/0xa60
add_pending_csums+0xaf/0xf0
btrfs_finish_ordered_io+0x74b/0xc90
finish_ordered_fn+0x15/0x20
normal_work_helper+0xf6/0x500
btrfs_endio_write_helper+0x12/0x20
process_one_work+0x302/0x770
worker_thread+0x81/0x6d0
kthread+0x180/0x1d0
ret_from_fork+0x35/0x40
[CAUSE]
That crafted image has missing backref for csum tree root leaf. And
when we try to allocate new tree block, since there is no
EXTENT/METADATA_ITEM for csum tree root, btrfs consider it's free slot
and use it.
The extent tree of the image looks like:
Normal image | This fuzzed image
----------------------------------+--------------------------------
BG 29360128 | BG 29360128
One empty slot | One empty slot
29364224: backref to UUID tree | 29364224: backref to UUID tree
Two empty slots | Two empty slots
29376512: backref to CSUM tree | One empty slot (bad type) <<<
29380608: backref to D_RELOC tree | 29380608: backref to D_RELOC tree
... | ...
Since bytenr 29376512 has no METADATA/EXTENT_ITEM, when btrfs try to
alloc tree block, it's an valid slot for btrfs.
And for finish_ordered_write, when we need to insert csum, we try to CoW
csum tree root.
By accident, empty slots at bytenr BG_OFFSET, BG_OFFSET + 8K,
BG_OFFSET + 12K is already used by tree block COW for other trees, the
next empty slot is BG_OFFSET + 16K, which should be the backref for CSUM
tree.
But due to the bad type, btrfs can recognize it and still consider it as
an empty slot, and will try to use it for csum tree CoW.
Then in the following call trace, we will try to lock the new tree
block, which turns out to be the old csum tree root which is already
locked:
btrfs_search_slot() called on csum tree root, which is at 29376512
|- btrfs_cow_block()
|- btrfs_set_lock_block()
| |- Now locks tree block 29376512 (old csum tree root)
|- __btrfs_cow_block()
|- btrfs_alloc_tree_block()
|- btrfs_reserve_extent()
| Now it returns tree block 29376512, which extent tree
| shows its empty slot, but it's already hold by csum tree
|- btrfs_init_new_buffer()
|- btrfs_tree_lock()
| Triggers WARN_ON(eb->lock_owner == current->pid)
|- wait_event()
Wait lock owner to release the lock, but it's
locked by ourself, so it will deadlock
[FIX]
This patch will do the lock_owner and current->pid check at
btrfs_init_new_buffer().
So above deadlock can be avoided.
Since such problem can only happen in crafted image, we will still
trigger kernel warning for later aborted transaction, but with a little
more meaningful warning message.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200405
Reported-by: Xu Wen <wen.xu@gatech.edu>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When mounting certain crafted image, btrfs will trigger kernel BUG_ON()
when trying to recover balance:
kernel BUG at fs/btrfs/extent-tree.c:8956!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10
RIP: 0010:walk_up_proc+0x336/0x480 [btrfs]
RSP: 0018:ffffb53540c9b890 EFLAGS: 00010202
Call Trace:
walk_up_tree+0x172/0x1f0 [btrfs]
btrfs_drop_snapshot+0x3a4/0x830 [btrfs]
merge_reloc_roots+0xe1/0x1d0 [btrfs]
btrfs_recover_relocation+0x3ea/0x420 [btrfs]
open_ctree+0x1af3/0x1dd0 [btrfs]
btrfs_mount_root+0x66b/0x740 [btrfs]
mount_fs+0x3b/0x16a
vfs_kern_mount.part.9+0x54/0x140
btrfs_mount+0x16d/0x890 [btrfs]
mount_fs+0x3b/0x16a
vfs_kern_mount.part.9+0x54/0x140
do_mount+0x1fd/0xda0
ksys_mount+0xba/0xd0
__x64_sys_mount+0x21/0x30
do_syscall_64+0x60/0x210
entry_SYSCALL_64_after_hwframe+0x49/0xbe
[CAUSE]
Extent tree corruption. In this particular case, reloc tree root's
owner is DATA_RELOC_TREE (should be TREE_RELOC), thus its backref is
corrupted and we failed the owner check in walk_up_tree().
[FIX]
It's pretty hard to take care of every extent tree corruption, but at
least we can remove such BUG_ON() and exit more gracefully.
And since in this particular image, DATA_RELOC_TREE and TREE_RELOC share
the same root (which is obviously invalid), we needs to make
__del_reloc_root() more robust to detect such invalid sharing to avoid
possible NULL dereference as root->node can be NULL in this case.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411
Reported-by: Xu Wen <wen.xu@gatech.edu>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_pin_log_trans defines the variable "ret" for return value, but it
is not modified after initialization. Further, I find that none of the
callers do handles the return value, so it is safe to drop the unneeded
"ret" and make it return void.
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_free_reserved_bytes uses the variable "ret" for return value,
but it is not modified after initialzation. Further, I find that any of
the callers do not handle the return value, so it is safe to drop the
unneeded "ret" and return void. There are no callees that would need the
function to handle or pass the value either.
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
@path is always NULL when it comes to the if branch.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
In the following case, rescan won't zero out the number of qgroup 1/0:
$ mkfs.btrfs -fq $DEV
$ mount $DEV /mnt
$ btrfs quota enable /mnt
$ btrfs qgroup create 1/0 /mnt
$ btrfs sub create /mnt/sub
$ btrfs qgroup assign 0/257 1/0 /mnt
$ dd if=/dev/urandom of=/mnt/sub/file bs=1k count=1000
$ btrfs sub snap /mnt/sub /mnt/snap
$ btrfs quota rescan -w /mnt
$ btrfs qgroup show -pcre /mnt
qgroupid rfer excl max_rfer max_excl parent child
-------- ---- ---- -------- -------- ------ -----
0/5 16.00KiB 16.00KiB none none --- ---
0/257 1016.00KiB 16.00KiB none none 1/0 ---
0/258 1016.00KiB 16.00KiB none none --- ---
1/0 1016.00KiB 16.00KiB none none --- 0/257
So far so good, but:
$ btrfs qgroup remove 0/257 1/0 /mnt
WARNING: quotas may be inconsistent, rescan needed
$ btrfs quota rescan -w /mnt
$ btrfs qgroup show -pcre /mnt
qgoupid rfer excl max_rfer max_excl parent child
-------- ---- ---- -------- -------- ------ -----
0/5 16.00KiB 16.00KiB none none --- ---
0/257 1016.00KiB 16.00KiB none none --- ---
0/258 1016.00KiB 16.00KiB none none --- ---
1/0 1016.00KiB 16.00KiB none none --- ---
^^^^^^^^^^ ^^^^^^^^ not cleared
[CAUSE]
Before rescan we call qgroup_rescan_zero_tracking() to zero out all
qgroups' accounting numbers.
However we don't mark all qgroups dirty, but rely on rescan to do so.
If we have any high level qgroup without children, it won't be marked
dirty during rescan, since we cannot reach that qgroup.
This will cause QGROUP_INFO items of childless qgroups never get updated
in the quota tree, thus their numbers will stay the same in "btrfs
qgroup show" output.
[FIX]
Just mark all qgroups dirty in qgroup_rescan_zero_tracking(), so even if
we have childless qgroups, their QGROUP_INFO items will still get
updated during rescan.
Reported-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Tested-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
struct scrub_ctx has an ->is_dev_replace member, so there's no point in
passing around is_dev_replace where sctx is available.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When the replace is running the fs_devices::num_devices also includes
the replaced device, however in some operations like device delete and
balance it needs the actual num_devices without the repalced devices.
The function btrfs_num_devices() just provides that.
And here is a scenario how balance and repalce items could co-exist:
Consider balance is started and paused, now start the replace followed
by a unmount or system power-cycle. During following mount, the
open_ctree() first restarts the balance so it must check for the device
replace otherwise our num_devices calculation will be wrong.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
In preparation to add helper function to deduce the num_devices with
replace running, use assert instead of BUG_ON or WARN_ON. The number of
devices would not normally drop to 0 due to other checks so the assert
is sufficient.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog, adjust the assert condition ]
Signed-off-by: David Sterba <dsterba@suse.com>
Kfree has taken the NULL pointer into account. So remove the check
before kfree.
The issue is detected with the help of Coccinelle.
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As we're going to return right after the call, it's not necessary to get
update the new write_lock_level from unlock_up.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are two members in struct btrfs_root which indicate root's
objectid: objectid and root_key.objectid.
They are both set to the same value in __setup_root():
static void __setup_root(struct btrfs_root *root,
struct btrfs_fs_info *fs_info,
u64 objectid)
{
...
root->objectid = objectid;
...
root->root_key.objectid = objecitd;
...
}
and not changed to other value after initialization.
grep in btrfs directory shows both are used in many places:
$ grep -rI "root->root_key.objectid" | wc -l
133
$ grep -rI "root->objectid" | wc -l
55
(4.17, inc. some noise)
It is confusing to have two similar variable names and it seems
that there is no rule about which should be used in a certain case.
Since ->root_key itself is needed for tree reloc tree, let's remove
'objecitd' member and unify code to use ->root_key.objectid in all places.
Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since ret must be 0 here, don't have to return. No functional change
and code readability is not hurt.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers pass the root tree of dir, we can push that down to the
function itself.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Using true and false here is closer to the expected semantic than using
0 and 1. No functional change.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Only when send_in_progress, we have to do something different such as
btrfs_warn() and return -EPERM. Therefore, we could check
send_in_progress first and process error handling, after the
root_item_lock has been got.
Just for better readability. No functional change.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce xarray value entries and tagged pointers to replace radix
tree exceptional entries. This is a slight change in encoding to allow
the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
value entry). It is also a change in emphasis; exceptional entries are
intimidating and different. As the comment explains, you can choose
to store values or pointers in the xarray and they are both first-class
citizens.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAluRLa8ACgkQxWXV+ddt
WDvc+BAAqxTMVngZ60WfktXzsS56OB6fu/R3DORgYcSZ0BCD4zTwoDlCjLhrCK6E
cmC+BMj+AspDQYiYESwGyFcN10sK0X7w7fa3wypTc4GNWxpkRm0Z6zT/kCvLUhdI
NlkMqAfsZ9N6iIXcR0qOxI7G55e3mpXPZGdFTk5rmDTv/9TqU0TMp9s8Zw5scn6R
ctdE+iE0lpRfNjF8ZDH1BtYIV4g2X81sZF/fkGz621HQfMTCjjPHFdlz+jlirBaf
BrYR4w4zjVuMKd3ZC5FHffVchbkvt29h6fAr4sEpJTwFJwd8pjI7GuPYWDQ918NB
TGX6EUP6usQqDK2zD405jCS6MbMshJm3uh5kmEpeNgK/tKJTln8Sbef/Xs93yIn2
+k9BMKOIcUHHBiv6PgCaZomcWCpii2S2u6vncqCnNuI4wK1RN3gHJc5YPhJArlrB
NUFJiTCQE6LWYOP2Hw+rggcrtBxli0bX7Mqp5FYFVdh5KBvolJE1o3B/JS8qpqRF
u0dPwbLHtTpTpXM5EfmM8a45S+DxuxTDBh3vdoAOM9LN/ivpeqqnFbHrIGmrTMjo
pQJ8aTrCwYMEMNu6oCV1cniFrOYRZ439hYjg524MjVXYCRyxhzAdVmVTEBaLjWCW
9GlGqEC7YZY2wLi5lPEGqxsIaVVELpettJB9KbBKmYB47VFWEf0=
=fu93
-----END PGP SIGNATURE-----
Merge tag 'for-4.19-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix for improper fsync after hardlink
- fix for a corruption during file deduplication
- use after free fixes
- RCU warning fix
- fix for buffered write to nodatacow file
* tag 'for-4.19-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: Fix suspicious RCU usage warning in btrfs_debug_in_rcu
btrfs: use after free in btrfs_quota_enable
btrfs: btrfs_shrink_device should call commit transaction at the end
btrfs: fix qgroup_free wrong num_bytes in btrfs_subvolume_reserve_metadata
Btrfs: fix data corruption when deduplicating between different files
Btrfs: sync log after logging new name
Btrfs: fix unexpected failure of nocow buffered writes after snapshotting when low on space
Commit 672d599041 ("btrfs: Use wrapper macro for rcu string to remove
duplicate code") replaces some open coded RCU string handling with macro.
It turns out that btrfs_debug_in_rcu() is used for the first time and
the macro lacks lock/unlock of RCU string for non-debug case (i.e. when
the message is not printed), leading to suspicious RCU usage warning
when CONFIG_PROVE_RCU is on.
Fix this by adding a wrapper to call lock/unlock for the non-debug case
too.
Fixes: 672d599041 ("btrfs: Use wrapper macro for rcu string to remove duplicate code")
Reported-by: David Howells <dhowells@redhat.com>
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The issue here is that btrfs_commit_transaction() frees "trans" on both
the error and the success path. So the problem would be if
btrfs_commit_transaction() succeeds, and then qgroup_rescan_init()
fails. That means that "ret" is non-zero and "trans" is non-NULL and it
leads to a use after free inside the btrfs_end_transaction() macro.
Fixes: 340f1aa27f ("btrfs: qgroups: Move transaction management inside btrfs_quota_enable/disable")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Test case btrfs/164 reports use-after-free:
[ 6712.084324] general protection fault: 0000 [#1] PREEMPT SMP
..
[ 6712.195423] btrfs_update_commit_device_size+0x75/0xf0 [btrfs]
[ 6712.201424] btrfs_commit_transaction+0x57d/0xa90 [btrfs]
[ 6712.206999] btrfs_rm_device+0x627/0x850 [btrfs]
[ 6712.211800] btrfs_ioctl+0x2b03/0x3120 [btrfs]
Reason for this is that btrfs_shrink_device adds the resized device to
the fs_devices::resized_devices after it has called the last commit
transaction.
So the list fs_devices::resized_devices is not empty when
btrfs_shrink_device returns. Now the parent function
btrfs_rm_device calls:
btrfs_close_bdev(device);
call_rcu(&device->rcu, free_device_rcu);
and then does the transactio ncommit. It goes through the
fs_devices::resized_devices in btrfs_update_commit_device_size and
leads to use-after-free.
Fix this by making sure btrfs_shrink_device calls the last needed
btrfs_commit_transaction before the return. This is consistent with what
the grow counterpart does and this makes sure the on-disk state is
persistent when the function returns.
Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Tested-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
After btrfs_qgroup_reserve_meta_prealloc(), num_bytes will be assigned
again by btrfs_calc_trans_metadata_size(). Once block_rsv fails, we
can't properly free the num_bytes of the previous qgroup_reserve. Use a
separate variable to store the num_bytes of the qgroup_reserve.
Delete the comment for the qgroup_reserved that does not exist and add a
comment about use_global_rsv.
Fixes: c4c129db5d ("btrfs: drop unused parameter qgroup_reserved")
CC: stable@vger.kernel.org # 4.18+
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we deduplicate extents between two different files we can end up
corrupting data if the source range ends at the size of the source file,
the source file's size is not aligned to the filesystem's block size
and the destination range does not go past the size of the destination
file size.
Example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "pwrite -S 0x6b 0 2518890" /mnt/foo
# The first byte with a value of 0xae starts at an offset (2518890)
# which is not a multiple of the sector size.
$ xfs_io -c "pwrite -S 0xae 2518890 102398" /mnt/foo
# Confirm the file content is full of bytes with values 0x6b and 0xae.
$ od -t x1 /mnt/foo
0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
*
11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae
11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
*
11777540 ae ae ae ae ae ae ae ae
11777550
# Create a second file with a length not aligned to the sector size,
# whose bytes all have the value 0x6b, so that its extent(s) can be
# deduplicated with the first file.
$ xfs_io -f -c "pwrite -S 0x6b 0 557771" /mnt/bar
# Now deduplicate the entire second file into a range of the first file
# that also has all bytes with the value 0x6b. The destination range's
# end offset must not be aligned to the sector size and must be less
# then the offset of the first byte with the value 0xae (byte at offset
# 2518890).
$ xfs_io -c "dedupe /mnt/bar 0 1957888 557771" /mnt/foo
# The bytes in the range starting at offset 2515659 (end of the
# deduplication range) and ending at offset 2519040 (start offset
# rounded up to the block size) must all have the value 0xae (and not
# replaced with 0x00 values). In other words, we should have exactly
# the same data we had before we asked for deduplication.
$ od -t x1 /mnt/foo
0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
*
11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae
11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
*
11777540 ae ae ae ae ae ae ae ae
11777550
# Unmount the filesystem and mount it again. This guarantees any file
# data in the page cache is dropped.
$ umount /dev/sdb
$ mount /dev/sdb /mnt
$ od -t x1 /mnt/foo
0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
*
11461300 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 00
11461320 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
11470000 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
*
11777540 ae ae ae ae ae ae ae ae
11777550
# The bytes in range 2515659 to 2519040 have a value of 0x00 and not a
# value of 0xae, data corruption happened due to the deduplication
# operation.
So fix this by rounding down, to the sector size, the length used for the
deduplication when the following conditions are met:
1) Source file's range ends at its i_size;
2) Source file's i_size is not aligned to the sector size;
3) Destination range does not cross the i_size of the destination file.
Fixes: e1d227a42e ("btrfs: Handle unaligned length in extent_same")
CC: stable@vger.kernel.org # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we add a new name for an inode which was logged in the current
transaction, we update the inode in the log so that its new name and
ancestors are added to the log. However when we do this we do not persist
the log, so the changes remain in memory only, and as a consequence, any
ancestors that were created in the current transaction are updated such
that future calls to btrfs_inode_in_log() return true. This leads to a
subsequent fsync against such new ancestor directories returning
immediately, without persisting the log, therefore after a power failure
the new ancestor directories do not exist, despite fsync being called
against them explicitly.
Example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/A
$ mkdir /mnt/B
$ mkdir /mnt/A/C
$ touch /mnt/B/foo
$ xfs_io -c "fsync" /mnt/B/foo
$ ln /mnt/B/foo /mnt/A/C/foo
$ xfs_io -c "fsync" /mnt/A
<power failure>
After the power failure, directory "A" does not exist, despite the explicit
fsync on it.
Instead of fixing this by changing the behaviour of the explicit fsync on
directory "A" to persist the log instead of doing nothing, make the logging
of the new file name (which happens when creating a hard link or renaming)
persist the log. This approach not only is simpler, not requiring addition
of new fields to the inode in memory structure, but also gives us the same
behaviour as ext4, xfs and f2fs (possibly other filesystems too).
A test case for fstests follows soon.
Fixes: 12fcfd22fe ("Btrfs: tree logging unlink/rename fixes")
Reported-by: Vijay Chidambaram <vvijay03@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This contains two new features:
1) Stack file operations: this allows removal of several hacks from the
VFS, proper interaction of read-only open files with copy-up,
possibility to implement fs modifying ioctls properly, and others.
2) Metadata only copy-up: when file is on lower layer and only metadata is
modified (except size) then only copy up the metadata and continue to
use the data from the lower file.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW3srhAAKCRDh3BK/laaZ
PC6tAQCP+KklcN+TvNp502f+O/kATahSpgnun4NY1/p4I8JV+AEAzdlkTN3+MiAO
fn9brN6mBK7h59DO3hqedPLJy2vrgwg=
=QDXH
-----END PGP SIGNATURE-----
Merge tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
"This contains two new features:
- Stack file operations: this allows removal of several hacks from
the VFS, proper interaction of read-only open files with copy-up,
possibility to implement fs modifying ioctls properly, and others.
- Metadata only copy-up: when file is on lower layer and only
metadata is modified (except size) then only copy up the metadata
and continue to use the data from the lower file"
* tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (66 commits)
ovl: Enable metadata only feature
ovl: Do not do metacopy only for ioctl modifying file attr
ovl: Do not do metadata only copy-up for truncate operation
ovl: add helper to force data copy-up
ovl: Check redirect on index as well
ovl: Set redirect on upper inode when it is linked
ovl: Set redirect on metacopy files upon rename
ovl: Do not set dentry type ORIGIN for broken hardlinks
ovl: Add an inode flag OVL_CONST_INO
ovl: Treat metacopy dentries as type OVL_PATH_MERGE
ovl: Check redirects for metacopy files
ovl: Move some dir related ovl_lookup_single() code in else block
ovl: Do not expose metacopy only dentry from d_real()
ovl: Open file with data except for the case of fsync
ovl: Add helper ovl_inode_realdata()
ovl: Store lower data inode in ovl_inode
ovl: Fix ovl_getattr() to get number of blocks from lower
ovl: Add helper ovl_dentry_lowerdata() to get lower data dentry
ovl: Copy up meta inode data from lowest data inode
ovl: Modify ovl_lookup() and friends to lookup metacopy dentry
...
a_ops->readpages() is only ever used for read-ahead. Ensure that we
pass this information down to the block layer.
Link: http://lkml.kernel.org/r/20180621010725.17813-4-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit e9894fd3e3 ("Btrfs: fix snapshot vs nocow writting") forced
nocow writes to fallback to COW, during writeback, when a snapshot is
created. This resulted in writes made before creating the snapshot to
unexpectedly fail with ENOSPC during writeback when success (0) was
returned to user space through the write system call.
The steps leading to this problem are:
1. When it's not possible to allocate data space for a write, the
buffered write path checks if a NOCOW write is possible. If it is,
it will not reserve space and success (0) is returned to user space.
2. Then when a snapshot is created, the root's will_be_snapshotted
atomic is incremented and writeback is triggered for all inode's that
belong to the root being snapshotted. Incrementing that atomic forces
all previous writes to fallback to COW during writeback (running
delalloc).
3. This results in the writeback for the inodes to fail and therefore
setting the ENOSPC error in their mappings, so that a subsequent
fsync on them will report the error to user space. So it's not a
completely silent data loss (since fsync will report ENOSPC) but it's
a very unexpected and undesirable behaviour, because if a clean
shutdown/unmount of the filesystem happens without previous calls to
fsync, it is expected to have the data present in the files after
mounting the filesystem again.
So fix this by adding a new atomic named snapshot_force_cow to the
root structure which prevents this behaviour and works the following way:
1. It is incremented when we start to create a snapshot after triggering
writeback and before waiting for writeback to finish.
2. This new atomic is now what is used by writeback (running delalloc)
to decide whether we need to fallback to COW or not. Because we
incremented this new atomic after triggering writeback in the
snapshot creation ioctl, we ensure that all buffered writes that
happened before snapshot creation will succeed and not fallback to
COW (which would make them fail with ENOSPC).
3. The existing atomic, will_be_snapshotted, is kept because it is used
to force new buffered writes, that start after we started
snapshotting, to reserve data space even when NOCOW is possible.
This makes these writes fail early with ENOSPC when there's no
available space to allocate, preventing the unexpected behaviour of
writeback later failing with ENOSPC due to a fallback to COW mode.
Fixes: e9894fd3e3 ("Btrfs: fix snapshot vs nocow writting")
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAltxe7QACgkQxWXV+ddt
WDswMA//QlRO+Ln5CH+RlT4fyf1RQUQZblWss2zxrmlo1GRI3Ljf2DNsBE3rD7P4
NSiXfHmgkdjcQP6poPLJwHxwkNd4NFXglYg64wWO10RjHGhKglmH6ztU88wsPfr2
2RZv271/NvYIEkEi6kdyy8ilKeWMshOfyj3+PaeapQn67uJfimyiUvDgUgbvwH3c
yj0nVRLP1C7snNj4Atti/rjXMhG+m1UWfjRkZsmqlBp52k2UAcrtiwQK+DS5b9mL
aWLSaGmIcJtSMkNJPQBST9GTWbJfKTpceoCzkT0o3irvQpN2e2flAJ4ireL8q4mN
MvqJ7giPBFHNDcHEzN6VERvsaA1Rx9Vq20ieQl8JAMd4p/bi5ehN3ww+9vau5zCw
Pc8WeKEILKrLYEAgHOnUO1wxHw994Iv5CA26roTQ0HNXQJjyEZ4m40Ch6LzmfKPm
WKcHX14Uw22GKaFEXHTOpRZ0U0d1cMTcn5zaAajGsB9LwcaiLM+OiFSPtDkwUOB9
QGJHklZVXAD1IH9HFPuq85uUtXTLXbxsw1g8phEJGbmaVxxCOAUAXwEk3qxuZNbz
CHL3G5+l3JEXxfoJSbDW60kr8xic7teqQDszqqP2qlqtP15ty2xc9d5Q8MZajSTZ
H1z9+0gfjYYHrGuAp69MtCbdQhhDSqLyivjJJm0HBaKfVNGW2Xg=
=jBaz
-----END PGP SIGNATURE-----
Merge tag 'for-4.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Mostly fixes and cleanups, nothing big, though the notable thing is
the inserted/deleted lines delta -1124.
User visible changes:
- allow defrag on opened read-only files that have rw permissions;
similar to what dedupe will allow on such files
Core changes:
- tree checker improvements, reported by fuzzing:
* more checks for: block group items, essential trees
* chunk type validation
* mount time cross-checks that physical and logical chunks match
* switch more error codes to EUCLEAN aka EFSCORRUPTED
Fixes:
- fsync corner case fixes
- fix send failure when root has deleted files still open
- send, fix incorrect file layout after hole punching beyond eof
- fix races between mount and deice scan ioctl, found by fuzzing
- fix deadlock when delayed iput is called from writeback on the same
inode; rare but has been observed in practice, also removes code
- fix pinned byte accounting, using the right percpu helpers; this
should avoid some write IO inefficiency during low space conditions
- don't remove block group that still has pinned bytes
- reset on-disk device stats value after replace, otherwise this
would report stale values for the new device
Cleanups:
- time64_t/timespec64 cleanups
- remove remaining dead code in scrub handling NOCOW extents after
disabling it in previous cycle
- simplify fsync regarding ordered extents logic and remove all the
related code
- remove redundant arguments in order to reduce stack space
consumption
- remove support for V0 type of extents, not in use since 2.6.30
- remove several unused structure members
- fewer indirect function calls by inlining some callbacks
- qgroup rescan timing fixes
- vfs: iget cleanups"
* tag 'for-4.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (182 commits)
btrfs: revert fs_devices state on error of btrfs_init_new_device
btrfs: Exit gracefully when chunk map cannot be inserted to the tree
btrfs: Introduce mount time chunk <-> dev extent mapping check
btrfs: Verify that every chunk has corresponding block group at mount time
btrfs: Check that each block group has corresponding chunk at mount time
Btrfs: send, fix incorrect file layout after hole punching beyond eof
btrfs: Use wrapper macro for rcu string to remove duplicate code
btrfs: simplify btrfs_iget
btrfs: lift make_bad_inode into btrfs_iget
btrfs: simplify IS_ERR/PTR_ERR checks
btrfs: btrfs_iget never returns an is_bad_inode inode
btrfs: replace: Reset on-disk dev stats value after replace
btrfs: extent-tree: Remove unused __btrfs_free_block_rsv
btrfs: backref: Use ERR_CAST to return error code
btrfs: Remove redundant btrfs_release_path from btrfs_unlink_subvol
btrfs: Remove root parameter from btrfs_unlink_subvol
btrfs: Remove fs_info from btrfs_add_root_ref
btrfs: Remove fs_info from btrfs_del_root_ref
btrfs: Remove fs_info from btrfs_del_root
btrfs: Remove fs_info from btrfs_delete_delayed_dir_index
...
Pull vfs icache updates from Al Viro:
- NFS mkdir/open_by_handle race fix
- analogous solution for FUSE, replacing the one currently in mainline
- new primitive to be used when discarding halfway set up inodes on
failed object creation; gives sane warranties re icache lookups not
returning such doomed by still not freed inodes. A bunch of
filesystems switched to that animal.
- Miklos' fix for last cycle regression in iget5_locked(); -stable will
need a slightly different variant, unfortunately.
- misc bits and pieces around things icache-related (in adfs and jfs).
* 'work.mkdir' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
jfs: don't bother with make_bad_inode() in ialloc()
adfs: don't put inodes into icache
new helper: inode_fake_hash()
vfs: don't evict uninitialized inode
jfs: switch to discard_new_inode()
ext2: make sure that partially set up inodes won't be returned by ext2_iget()
udf: switch to discard_new_inode()
ufs: switch to discard_new_inode()
btrfs: switch to discard_new_inode()
new primitive: discard_new_inode()
kill d_instantiate_no_diralias()
nfs_instantiate(): prevent multiple aliases for directory inode
When btrfs hits error after modifying fs_devices in
btrfs_init_new_device() (such as btrfs_add_dev_item() returns error), it
leaves everything as is, but frees allocated btrfs_device. As a result,
fs_devices->devices and fs_devices->alloc_list contain already freed
btrfs_device, leading to later use-after-free bug.
Error path also messes the things like ->num_devices. While they go back
to the original value by unscanning btrfs devices, it is safe to revert
them here.
Fixes: 79787eaab4 ("btrfs: replace many BUG_ONs with proper error handling")
Signed-off-by: Naohiro Aota <naota@elisp.net>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's entirely possible that a crafted btrfs image contains overlapping
chunks.
Although we can't detect such problem by tree-checker, it's not a
catastrophic problem, current extent map can already detect such problem
and return -EEXIST.
We just only need to exit gracefully and fail the mount.
Reported-by: Xu Wen <wen.xu@gatech.edu>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200409
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch will introduce chunk <-> dev extent mapping check, to protect
us against invalid dev extents or chunks.
Since chunk mapping is the fundamental infrastructure of btrfs, extra
check at mount time could prevent a lot of unexpected behavior (BUG_ON).
Reported-by: Xu Wen <wen.xu@gatech.edu>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200403
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200407
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If a crafted image has missing block group items, it could cause
unexpected behavior and breaks the assumption of 1:1 chunk<->block group
mapping.
Although we have the block group -> chunk mapping check, we still need
chunk -> block group mapping check.
This patch will do extra check to ensure each chunk has its
corresponding block group.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199847
Reported-by: Xu Wen <wen.xu@gatech.edu>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A crafted btrfs image with incorrect chunk<->block group mapping will
trigger a lot of unexpected things as the mapping is essential.
Although the problem can be caught by block group item checker
added in "btrfs: tree-checker: Verify block_group_item", it's still not
sufficient. A sufficiently valid block group item can pass the check
added by the mentioned patch but could fail to match the existing chunk.
This patch will add extra block group -> chunk mapping check, to ensure
we have a completely matching (start, len, flags) chunk for each block
group at mount time.
Here we reuse the original helper find_first_block_group(), which is
already doing the basic bg -> chunk checks, adding further checks of the
start/len and type flags.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199837
Reported-by: Xu Wen <wen.xu@gatech.edu>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing an incremental send, if we have a file in the parent snapshot
that has prealloc extents beyond EOF and in the send snapshot it got a
hole punch that partially covers the prealloc extents, the send stream,
when replayed by a receiver, can result in a file that has a size bigger
than it should and filled with zeroes past the correct EOF.
For example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "falloc -k 0 4M" /mnt/foobar
$ xfs_io -c "pwrite -S 0xea 0 1M" /mnt/foobar
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send -f /tmp/1.send /mnt/snap1
$ xfs_io -c "fpunch 1M 2M" /mnt/foobar
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send -f /tmp/2.send -p /mnt/snap1 /mnt/snap2
$ stat --format %s /mnt/snap2/foobar
1048576
$ md5sum /mnt/snap2/foobar
d31659e82e87798acd4669a1e0a19d4f /mnt/snap2/foobar
$ umount /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ btrfs receive -f /mnt/1.snap /mnt
$ btrfs receive -f /mnt/2.snap /mnt
$ stat --format %s /mnt/snap2/foobar
3145728
# --> should be 1Mb and not 3Mb (which was the end offset of hole
# punch operation)
$ md5sum /mnt/snap2/foobar
117baf295297c2a995f92da725b0b651 /mnt/snap2/foobar
# --> should be d31659e82e87798acd4669a1e0a19d4f as in the original fs
This issue actually happens only since commit ffa7c4296e ("Btrfs: send,
do not issue unnecessary truncate operations"), but before that commit we
were issuing a write operation full of zeroes (to "punch" a hole) which
was extending the file size beyond the correct value and then immediately
issue a truncate operation to the correct size and undoing the previous
write operation. Since the send protocol does not support fallocate, for
extent preallocation and hole punching, fix this by not even attempting
to send a "hole" (regular write full of zeroes) if it starts at an offset
greater then or equals to the file's size. This approach, besides being
much more simple then making send issue the truncate operation, adds the
benefit of avoiding the useless pair of write of zeroes and truncate
operations, saving time and IO at the receiver and reducing the size of
the send stream.
A test case for fstests follows soon.
Fixes: ffa7c4296e ("Btrfs: send, do not issue unnecessary truncate operations")
CC: stable@vger.kernel.org # 4.17+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Don't open-code iget_failed(), don't bother with btrfs_free_path(NULL),
move handling of positive return values of btrfs_lookup_inode() from
btrfs_read_locked_inode() to btrfs_iget() and kill now obviously
pointless ASSERT() in there.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need to check is_bad_inode() after the call of
btrfs_read_locked_inode() - it's exactly the same as checking return
value for being non-zero.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
IS_ERR(p) && PTR_ERR(p) == n is a weird way to spell p == ERR_PTR(n).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Just get rid of pointless checks.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
on-disk devs stats value is updated in btrfs_run_dev_stats(),
which is called during commit transaction, if device->dev_stats_ccnt
is not zero.
Since current replace operation does not touch dev_stats_ccnt,
on-disk dev stats value is not updated. Therefore "btrfs device stats"
may return old device's value after umount/mount
(Example: See "btrfs ins dump-t -t DEV $DEV" after btrfs/100 finish).
Fix this by just incrementing dev_stats_ccnt in
btrfs_dev_replace_finishing() when replace is succeeded and this will
update the values.
Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no user of this function anymore.
This was forgotten to be removed in commit a575ceeb13
("Btrfs: get rid of unused orphan infrastructure").
Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use ERR_CAST() instead of void * to make meaning clear.
Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Although it is safe to call this on already released paths with no locks
held or extent buffers, removing the redundant btrfs_release_path is
reasonable.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers pass the root tree of dir, we can push that down to the
function itself.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed transaction handle.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed transaction handle.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed transaction handle.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed transaction handle.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed transaction handle.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Leftover after fix e339a6b097 ("Btrfs: __btrfs_mod_ref should always
use no_quota"), that removed it from the function calls but not the
structure.
Signed-off-by: David Sterba <dsterba@suse.com>
The more common use case of send involves creating a RO snapshot and then
use it for a send operation. In this case it's not possible to have inodes
in the snapshot that have a link count of zero (inode with an orphan item)
since during snapshot creation we do the orphan cleanup. However, other
less common use cases for send can end up seeing inodes with a link count
of zero and in this case the send operation fails with a ENOENT error
because any attempt to generate a path for the inode, with the purpose
of creating it or updating it at the receiver, fails since there are no
inode reference items. One use case it to use a regular subvolume for
a send operation after turning it to RO mode or turning a RW snapshot
into RO mode and then using it for a send operation. In both cases, if a
file gets all its hard links deleted while there is an open file
descriptor before turning the subvolume/snapshot into RO mode, the send
operation will encounter an inode with a link count of zero and then
fail with errno ENOENT.
Example using a full send with a subvolume:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ btrfs subvolume create /mnt/sv1
$ touch /mnt/sv1/foo
$ touch /mnt/sv1/bar
# keep an open file descriptor on file bar
$ exec 73</mnt/sv1/bar
$ unlink /mnt/sv1/bar
# Turn the subvolume to RO mode and use it for a full send, while
# holding the open file descriptor.
$ btrfs property set /mnt/sv1 ro true
$ btrfs send -f /tmp/full.send /mnt/sv1
At subvol /mnt/sv1
ERROR: send ioctl failed with -2: No such file or directory
Example using an incremental send with snapshots:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ btrfs subvolume create /mnt/sv1
$ touch /mnt/sv1/foo
$ touch /mnt/sv1/bar
$ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap1
$ echo "hello world" >> /mnt/sv1/bar
$ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap2
# Turn the second snapshot to RW mode and delete file foo while
# holding an open file descriptor on it.
$ btrfs property set /mnt/snap2 ro false
$ exec 73</mnt/snap2/foo
$ unlink /mnt/snap2/foo
# Set the second snapshot back to RO mode and do an incremental send.
$ btrfs property set /mnt/snap2 ro true
$ btrfs send -f /tmp/inc.send -p /mnt/snap1 /mnt/snap2
At subvol /mnt/snap2
ERROR: send ioctl failed with -2: No such file or directory
So fix this by ignoring inodes with a link count of zero if we are either
doing a full send or if they do not exist in the parent snapshot (they
are new in the send snapshot), and unlink all paths found in the parent
snapshot when doing an incremental send (and ignoring all other inode
items, such as xattrs and extents).
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 4.4+
Reported-by: Martin Wilck <martin.wilck@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we end up with logging an inode reference item which has the same name
but different index from the one we have persisted, we end up failing when
replaying the log with an errno value of -EEXIST. The error comes from
btrfs_add_link(), which is called from add_inode_ref(), when we are
replaying an inode reference item.
Example scenario where this happens:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ touch /mnt/foo
$ ln /mnt/foo /mnt/bar
$ sync
# Rename the first hard link (foo) to a new name and rename the second
# hard link (bar) to the old name of the first hard link (foo).
$ mv /mnt/foo /mnt/qwerty
$ mv /mnt/bar /mnt/foo
# Create a new file, in the same parent directory, with the old name of
# the second hard link (bar) and fsync this new file.
# We do this instead of calling fsync on foo/qwerty because if we did
# that the fsync resulted in a full transaction commit, not triggering
# the problem.
$ touch /mnt/bar
$ xfs_io -c "fsync" /mnt/bar
<power fail>
$ mount /dev/sdb /mnt
mount: mount /dev/sdb on /mnt failed: File exists
So fix this by checking if a conflicting inode reference exists (same
name, same parent but different index), removing it (and the associated
dir index entries from the parent inode) if it exists, before attempting
to add the new reference.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we're trying to make a data reservation and we have to allocate a
data chunk we could leak ret == 1, as do_chunk_alloc() will return 1 if
it allocated a chunk. Since the end of the function is the success path
just return 0.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The exported helper just calls the static one. There's no obvious reason
to have them separate eg. for performance reasons where the static one
could be better optimized in the same unit. There's a slight decrease in
code size and stack consumption.
Signed-off-by: David Sterba <dsterba@suse.com>
Lock owner and nesting level have been unused since day 1, probably
copy&pasted from the extent_buffer locking scheme without much thinking.
The locking of device replace is simpler and does not need any lock
nesting.
Signed-off-by: David Sterba <dsterba@suse.com>
Added in 58176a9604 ("Btrfs: Add per-root block accounting and sysfs
entries") in 2007, the roots had names exported in sysfs. The code
was commented out in 4df27c4d5c ("Btrfs: change how subvolumes
are organized") and cleaned by 182608c829 ("btrfs: remove old
unused commented out code").
Signed-off-by: David Sterba <dsterba@suse.com>
Requiring a read-write descriptor conflicts both ways with exec,
returning ETXTBSY whenever you try to defrag a program that's currently
being run, or causing intermittent exec failures on a live system being
defragged.
As defrag doesn't change the file's contents in any way, there's no
reason to consider it a rw operation. Thus, let's check only whether
the file could have been opened rw. Such access control is still needed
as currently defrag can use extra disk space, and might trigger bugs.
We return EINVAL when the request is invalid; here it's ok but merely
the user has insufficient privileges. Thus, the EPERM return value
reflects the error better -- as discussed in the identical case for
dedupe.
According to codesearch.debian.net, no userspace program distinguishes
these values beyond strerror().
Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Reviewed-by: David Sterba <dsterba@suse.com>
[ fold the EPERM patch from Adam ]
Signed-off-by: David Sterba <dsterba@suse.com>
We recently ran into the following deadlock involving
btrfs_write_inode():
[ +0.005066] __schedule+0x38e/0x8c0
[ +0.007144] schedule+0x36/0x80
[ +0.006447] bit_wait+0x11/0x60
[ +0.006446] __wait_on_bit+0xbe/0x110
[ +0.007487] ? bit_wait_io+0x60/0x60
[ +0.007319] __inode_wait_for_writeback+0x96/0xc0
[ +0.009568] ? autoremove_wake_function+0x40/0x40
[ +0.009565] inode_wait_for_writeback+0x21/0x30
[ +0.009224] evict+0xb0/0x190
[ +0.006099] iput+0x1a8/0x210
[ +0.006103] btrfs_run_delayed_iputs+0x73/0xc0
[ +0.009047] btrfs_commit_transaction+0x799/0x8c0
[ +0.009567] btrfs_write_inode+0x81/0xb0
[ +0.008008] __writeback_single_inode+0x267/0x320
[ +0.009569] writeback_sb_inodes+0x25b/0x4e0
[ +0.008702] wb_writeback+0x102/0x2d0
[ +0.007487] wb_workfn+0xa4/0x310
[ +0.006794] ? wb_workfn+0xa4/0x310
[ +0.007143] process_one_work+0x150/0x410
[ +0.008179] worker_thread+0x6d/0x520
[ +0.007490] kthread+0x12c/0x160
[ +0.006620] ? put_pwq_unlocked+0x80/0x80
[ +0.008185] ? kthread_park+0xa0/0xa0
[ +0.007484] ? do_syscall_64+0x53/0x150
[ +0.007837] ret_from_fork+0x29/0x40
Writeback calls:
btrfs_write_inode
btrfs_commit_transaction
btrfs_run_delayed_iputs
If iput() is called on that same inode, evict() will wait for writeback
forever.
btrfs_write_inode() was originally added way back in 4730a4bc5b
("btrfs_dirty_inode") to support O_SYNC writes. However, ->write_inode()
hasn't been used for O_SYNC since 148f948ba8 ("vfs: Introduce new
helpers for syncing after writing to O_SYNC file or IS_SYNC inode"), so
btrfs_write_inode() is actually unnecessary (and leads to a bunch of
unnecessary commits). Get rid of it, which also gets rid of the
deadlock.
CC: stable@vger.kernel.org # 3.2+
Signed-off-by: Josef Bacik <jbacik@fb.com>
[Omar: new commit message]
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed transaction handle.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed transaction handle.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always passed a well-formed tgtdevice so the fs_info
can be referenced from there.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed 'device' argument which is always
a well-formed device.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed transaction handle.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed srcdev argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced form the passed transaction handle.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In find_free_extent() under checks: label, we have the following code:
search_start = ALIGN(offset, fs_info->stripesize);
/* move on to the next group */
if (search_start + num_bytes >
block_group->key.objectid + block_group->key.offset) {
btrfs_add_free_space(block_group, offset, num_bytes);
goto loop;
}
if (offset < search_start)
btrfs_add_free_space(block_group, offset,
search_start - offset);
BUG_ON(offset > search_start);
However ALIGN() is rounding up, thus @search_start >= @offset and that
BUG_ON() will never be triggered.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At send.c:full_send_tree() we were setting the 'key' variable in the loop
while never using it later. We were also using two btrfs_key variables
to store the initial key for search and the key found in every iteration
of the loop. So remove this useless key assignment and use the same
btrfs_key variable to store the initial search key and the key found in
each iteration. This was introduced in the initial send commit but was
never used (commit 31db9f7c23 ("Btrfs: introduce BTRFS_IOC_SEND for
btrfs send/receive").
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The data and metadata callback implementation both use the same
function. We can remove the call indirection and intermediate helper
completely.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The data and metadata callback implementation both use the same
function. We can remove the call indirection completely.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All implementations of the callback are trivial and do the same and
there's only one user. Merge everything together.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The end_io callbacks passed to btrfs_wq_submit_bio
(btrfs_submit_bio_done and btree_submit_bio_done) are effectively the
same code, there's no point to do the indirection. Export
btrfs_submit_bio_done and call it directly.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After splitting the start and end hooks in a758781d4b ("btrfs:
separate types for submit_bio_start and submit_bio_done"), some of
the function arguments were dropped but not removed from the structure.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduced by c6100a4b4e ("Btrfs: replace tree->mapping with
tree->private_data") to be used in run_one_async_done where it got
unused after 736cd52e0c ("Btrfs: remove nr_async_submits and
async_submit_draining").
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reported in https://bugzilla.kernel.org/show_bug.cgi?id=199839, with an
image that has an invalid chunk type but does not return an error.
Add chunk type check in btrfs_check_chunk_valid, to detect the wrong
type combinations.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199839
Reported-by: Xu Wen <wen.xu@gatech.edu>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
EXTENT_BUFFER_DUMMY is an awful name for this flag. Buffers which have
this flag set are not in any way dummy. Rather, they are private in the
sense that are not mapped and linked to the global buffer tree. This
flag has subtle implications to the way free_extent_buffer works for
example, as well as controls whether page->mapping->private_lock is held
during extent_buffer release. Pages for an unmapped buffer cannot be
under io, nor can they be written by a 3rd party so taking the lock is
unnecessary.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ EXTENT_BUFFER_UNMAPPED, update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Remove stale comment since there is no longer an eb->eb_lock and
document the locking expectation with a lockdep_assert_held statement.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function used to release one page (and always the first one), but
not anymore since a50924e3a4 ("btrfs: drop constant param
from btrfs_release_extent_buffer_page"). Update the name and comment.
Signed-off-by: David Sterba <dsterba@suse.com>
The purpose of the function is to free all the pages comprising an
extent buffer. This can be achieved with a simple for loop rather than
the slightly more involved 'do {} while' construct. So rewrite the
loop using a 'for' construct. Additionally we can never have an
extent_buffer that has 0 pages so remove the check for index == 0. No
functional changes.
The reversed order used to have a meaning in the past where the first
page served as a blocking point for several callers. See eg
4f2de97ace ("Btrfs: set page->private to the eb").
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit eb14ab8ed2 ("Btrfs: fix page->private races") fixed a genuine
race between extent buffer initialisation and btree_releasepage.
Unfortunately as the code has evolved the comments weren't changed which
made them slightly wrong and they weren't very clear in the fist place.
Fix this by (hopefully) rewording them in a more approachable manner.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Current version of the page unlocking code was added in
727011e07c ("Btrfs: allow metadata blocks larger than the page size")
but even in this commit that particular flag was never used per-se. In
fact, btrfs only uses PageChecked for data pages to identify pages
which have been dirtied but don't have ORDERED bit set. For more
information see 247e743cbe ("Btrfs: Use async helpers to deal with
pages that have been improperly dirtied").
However, this doesn't apply to extent buffer pages. The important bit
here is that the pages are unlocked AFTER the extent buffer has been
properly recorded in the radix tree to avoid races with
btree_releasepage. Let's exploit this fact and simplify the page
unlocking sequence by unlocking the pages in-order and removing the
redundant PageChecked flag setting/clearing.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove the remaining code that misused the page cache pages during
device replace and could cause data corruption for compressed nodatasum
extents. Such files do not normally exist but there's a bug that allows
this combination and the corruption was exposed by device replace fixup
code.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are many places that open code the duplicity factor of the block
group profiles, create a common helper. This can be easily extended for
more copies.
Signed-off-by: David Sterba <dsterba@suse.com>
We have assigned the %fs_info->fs_devices in %fs_devices as its not
modified just use it for the mutex_lock().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The fs_info can be fetched from the transaction handle directly.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be fetched from the transaction handle. In addition, remove the
WARN_ON(trans == NULL) because it's not possible to hit this condition.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Rename btrfs_parse_early_options() to btrfs_parse_device_options(). As
btrfs_parse_early_options() parses the -o device options and scan the
device provided. So this rename specifies its action. Also the function
name is in line with btrfs_parse_subvol_options().
No functional changes.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit 0b246afa62 ("btrfs: root->fs_info cleanup, add fs_info
convenience variables"), the srcroot is no longer used to get
fs_info::nodesize. In fact, it can be dropped after commit 707e8a0715
("btrfs: use nodesize everywhere, kill leafsize").
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a small helper, btrfs_mark_bg_unused(), to acquire locks and
add a block group to unused_bgs list.
No functional modification, and only 3 callers are involved.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
do_chunk_alloc implements logic to detect whether there is currently
pending chunk allocation (by means of space_info->chunk_alloc being
set) and if so it loops around to the 'again' label. Additionally,
based on the state of the space_info (e.g. whether it's full or not)
and the return value of should_alloc_chunk() it decides whether this
is a "hard" error (ENOSPC) or we can just return 0.
This patch refactors all of this:
1. Put order to the scattered ifs handling the various cases in an
easy-to-read if {} else if{} branches. This makes clear the various
cases we are interested in handling.
2. Call should_alloc_chunk only once and use the result in the
if/else if constructs. All of this is done under space_info->lock, so
even before multiple calls of should_alloc_chunk were unnecessary.
3. Rewrite the "do {} while()" loop currently implemented via label
into an explicit loop construct.
4. Move the mutex locking for the case where the caller is the one doing
the allocation. For the case where the caller needs to wait a concurrent
allocation, introduce a pair of mutex_lock/mutex_unlock to act as a
barrier and reword the comment.
5. Switch local vars to bool type where pertinent.
All in all this shouldn't introduce any functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In commit b150a4f10d ("Btrfs: use a percpu to keep track of possibly
pinned bytes") we use total_bytes_pinned to track how many bytes we are
going to free in this transaction. When we are close to ENOSPC, we check it
and know if we can make the allocation by commit the current transaction.
For every data/metadata extent we are going to free, we add
total_bytes_pinned in btrfs_free_extent() and btrfs_free_tree_block(), and
release it in unpin_extent_range() when we finish the transaction. So this
is a variable we frequently update but rarely read - just the suitable
use of percpu_counter. But in previous commit we update total_bytes_pinned
by default 32 batch size, making every update essentially a spin lock
protected update. Since every spin lock/unlock operation involves syncing
a globally used variable and some kind of barrier in a SMP system, this is
more expensive than using total_bytes_pinned as a simple atomic64_t.
So fix this by using a customized batch size. Since we only read
total_bytes_pinned when we are close to ENOSPC and fail to allocate new
chunk, we can use a really large batch size and have nearly no penalty
in most cases.
[Test]
We tested the patch on a 4-cores x86 machine:
1. fallocate a 16GiB size test file
2. take snapshot (so all following writes will be COW)
3. run a 180 sec, 4 jobs, 4K random write fio on test file
We also added a temporary lockdep class on percpu_counter's spin lock
used by total_bytes_pinned to track it by lock_stat.
[Results]
unpatched:
lock_stat version 0.4
-----------------------------------------------------------------------
class name con-bounces contentions
waittime-min waittime-max waittime-total waittime-avg acq-bounces
acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
total_bytes_pinned_percpu: 82 82
0.21 0.61 29.46 0.36 298340
635973 0.09 11.01 173476.25 0.27
patched:
lock_stat version 0.4
-----------------------------------------------------------------------
class name con-bounces contentions
waittime-min waittime-max waittime-total waittime-avg acq-bounces
acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg
total_bytes_pinned_percpu: 1 1
0.62 0.62 0.62 0.62 13601
31542 0.14 9.61 11016.90 0.35
[Analysis]
Since the spin lock only protects a single in-memory variable, the
contentions (number of lock acquisitions that had to wait) in both
unpatched and patched version are low. But when we see acquisitions and
acq-bounces, we get much lower counts in patched version. Here the most
important metric is acq-bounces. It means how many times the lock gets
transferred between different cpus, so the patch can really reduce
cacheline bouncing of spin lock (also the global counter of percpu_counter)
in a SMP system.
Fixes: b150a4f10d ("Btrfs: use a percpu to keep track of possibly pinned bytes")
Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We use customized, nodesize batch value to update dirty_metadata_bytes.
We should also use batch version of compare function or we will easily
goto fast path and get false result from percpu_counter_compare().
Fixes: e2d845211e ("Btrfs: use percpu counter for dirty metadata count")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Return device pointer (with the IS_ERR semantics) from
btrfs_scan_one_device so we don't have to return in through pointer.
And since btrfs_fs_devices can be obtained from btrfs_device, return that.
Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ fixed conflics after recent changes to btrfs_scan_one_device ]
Signed-off-by: David Sterba <dsterba@suse.com>
fs_devices is always passed to btrfs_scan_one_device which overrides it.
In the call stack below fs_devices is passed to btrfs_scan_one_device
from btrfs_mount_root. In btrfs_mount_root the output fs_devices of
this call stack is not used.
btrfs_mount_root
btrfs_parse_early_options
btrfs_scan_one_device
So, it is not necessary to pass fs_devices from btrfs_mount_root, using
a local variable in btrfs_parse_early_options is enough.
Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Anand Jain <Anand.Jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Technically this extends the critical section covered by uuid_mutex to:
- parse early mount options -- here we can call device scan on paths
that can be passed as 'device=/dev/...'
- scan the device passed to mount
- open the devices related to the fs_devices -- this increases
fs_devices::opened
The race can happen when mount calls one of the scans and there's
another one called eg. by mkfs or 'btrfs dev scan':
Mount Scan
----- ----
scan_one_device (dev1, fsid1)
scan_one_device (dev2, fsid1)
add the device
free stale devices
fsid1 fs_devices::opened == 0
find fsid1:dev1
free fsid1:dev1
if it's the last one,
free fs_devices of fsid1
too
open_devices (dev1, fsid1)
dev1 not found
When fixed, the uuid mutex will make sure that mount will increase
fs_devices::opened and this will not be touched by the racing scan
ioctl.
Reported-and-tested-by: syzbot+909a5177749d7990ffa4@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+ceb2606025ec1cc3479c@syzkaller.appspotmail.com
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In preparation to take a big lock, move resource initialization before
the critical section. It's not obvious from the diff, the desired order
is:
- initialize mount security options
- allocate temporary fs_info
- allocate superblock buffers
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Prepartory work to fix race between mount and device scan.
btrfs_parse_early_options calls the device scan from mount and we'll
need to let mount completely manage the critical section.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Prepartory work to fix race between mount and device scan.
The callers will have to manage the critical section, eg. mount wants to
scan and then call btrfs_open_devices without the ioctl scan walking in
and modifying the fs devices in the meantime.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Prepartory work to fix race between mount and device scan.
The callers will have to manage the critical section, eg. mount wants to
scan and then call btrfs_open_devices without the ioctl scan walking in
and modifying the fs devices in the meantime.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_free_stale_devices() finds a stale (not opened) device matching
path in the fs_uuid list. We are already under uuid_mutex so when we
check for each fs_devices, hold the device_list_mutex too.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Over the years we named %fs_devices and %devices to represent the
struct btrfs_fs_devices and the struct btrfs_device. So follow the same
scheme here too. No functional changes.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Make sure the device_list_lock is held the whole time:
* when the device is being looked up
* new device is initialized and put to the list
* the list counters are updated (fs_devices::opened, fs_devices::total_devices)
Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ update changelog ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_free_stale_devices() looks for device path reused for another
filesystem, and deletes the older fs_devices::device entry.
In preparation to handle locking in device_list_add, move
btrfs_free_stale_devices outside as these two functions serve a
different purpose.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit 88c14590cd ("btrfs: use RCU in btrfs_show_devname for
device list traversal") btrfs_show_devname no longer takes
device_list_mutex. As such the deadlock that 0ccd05285e ("btrfs: fix a
possible umount deadlock") aimed to fix no longer exists, we can free
the devices immediatelly and remove the code that does the pending work.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
[ update changelog ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is not used since the alloc_start parameter has been
obsoleted in commit 0d0c71b317 ("btrfs: obsolete and remove
mount option alloc_start").
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since parameter flags is no more used since commit d740760656 ("btrfs:
split parse_early_options() in two"), remove it.
Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In case of deleting the seed device the %cur_devices (seed) and the
%fs_devices (parent) are different. Now, as the parent
fs_devices::total_devices also maintains the total number of devices
including the seed device, so decrement its in-memory value for the
successful seed delete. We are already updating its corresponding
on-disk btrfs_super_block::number_devices value.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 5d23515be6 ("btrfs: Move qgroup rescan on quota enable to
btrfs_quota_enable") not only resulted in an easier to follow code but
it also introduced a subtle bug. It changed the timing when the initial
transaction rescan was happening:
- before the commit: it would happen after transaction commit had occured
- after the commit: it might happen before the transaction was committed
This results in failure to correctly rescan the quota since there could
be data which is still not committed on disk.
This patch aims to fix this by moving the transaction creation/commit
inside btrfs_quota_enable, which allows to schedule the quota commit
after the transaction has been committed.
Fixes: 5d23515be6 ("btrfs: Move qgroup rescan on quota enable to btrfs_quota_enable")
Reported-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Link: https://marc.info/?l=linux-btrfs&m=152999289017582
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add fall-back code to catch failure of full_stripe_write. Proper error
handling from inside run_plug would need more code restructuring as it's
called at arbitrary points by io scheduler.
Signed-off-by: David Sterba <dsterba@suse.com>
Add helper that schedules a given function to run on the rmw workqueue.
This will replace several standalone helpers.
Signed-off-by: David Sterba <dsterba@suse.com>
The loops iterating eb pages use unsigned long, that's an overkill as
we know that there are at most 16 pages (64k / 4k), and 4 by default
(with nodesize 16k).
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Almost all callers pass the start and len as 2 arguments but this is not
necessary, all the information is provided by the eb. By reordering the
calls to num_extent_pages, we don't need the local variables with
start/len.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Functions that get btrfs inode can simply reach the fs_info by
dereferencing the root and this looks a bit more straightforward
compared to the btrfs_sb(...) indirection.
If the transaction handle is available and not NULL it's used instead.
Signed-off-by: David Sterba <dsterba@suse.com>
There are several places when the btrfs inode is converted to the
generic inode, back to btrfs and then passed to btrfs_ino. We can remove
the extra back and forth conversions.
Signed-off-by: David Sterba <dsterba@suse.com>
io_ctl_set_generation() assumes that the generation number shares
the same page with inline CRCs. Let's make sure this is always true.
Signed-off-by: Zhihui Zhang <zzhsuny@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is only usage of the declared devices variable, instead use its
value directly.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are many instances of the %fs_info->fs_devices pointer
dereferences, use a temporary variable instead.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Invalid reloc tree can cause kernel NULL pointer dereference when btrfs
does some cleanup of the reloc roots.
It turns out that fs_info::reloc_ctl can be NULL in
btrfs_recover_relocation() as we allocate relocation control after all
reloc roots have been verified.
So when we hit: note, we haven't called set_reloc_control() thus
fs_info::reloc_ctl is still NULL.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199833
Reported-by: Xu Wen <wen.xu@gatech.edu>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A crafted image has empty root tree block, which will later cause NULL
pointer dereference.
The following trees should never be empty:
1) Tree root
Must contain at least root items for extent tree, device tree and fs
tree
2) Chunk tree
Or we can't even bootstrap as it contains the mapping.
3) Fs tree
At least inode item for top level inode (.).
4) Device tree
Dev extents for chunks
5) Extent tree
Must have corresponding extent for each chunk.
If any of them is empty, we are sure the fs is corrupted and no need to
mount it.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199847
Reported-by: Xu Wen <wen.xu@gatech.edu>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A crafted image with invalid block group items could make free space cache
code to cause panic.
We could detect such invalid block group item by checking:
1) Item size
Known fixed value.
2) Block group size (key.offset)
We have an upper limit on block group item (10G)
3) Chunk objectid
Known fixed value.
4) Type
Only 4 valid type values, DATA, METADATA, SYSTEM and DATA|METADATA.
No more than 1 bit set for profile type.
5) Used space
No more than the block group size.
This should allow btrfs to detect and refuse to mount the crafted image.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199849
Reported-by: Xu Wen <wen.xu@gatech.edu>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The v0 extent type checks are the right case for the unlikely
annotations as we don't expect to ever see them, so let's give the
compiler some hint.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Following the removal of the v0 handling code let's be courteous and
print an error message when such extents are handled. In the cases
where we have a transaction just abort it, otherwise just call
btrfs_handle_fs_error. Both cases result in the FS being re-mounted RO.
In case the error handling would be too intrusive, leave the BUG_ON in
place, like extent_data_ref_count, other proper handling would catch
that earlier.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The v0 compat code was introduced in commit 5d4f98a28c
("Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)") 9
years ago, which was merged in 2.6.31. This means that the code is
there to support filesystems which are _VERY_ old and if you are using
btrfs on such an old kernel, you have much bigger problems. This coupled
with the fact that no one is likely testing/maintining this code likely
means it has bugs lurking. All things considered I think 43 kernel
releases later it's high time this remnant of the past got removed.
This patch removes all code wrapped in #ifdefs but leaves the BUG_ONs in case
we have a v0 with no support intact as a sort of safety-net.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's only coding style fix not functinal change. When if/else has only
one statement then the braces are not needed.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's not good to override the error code when failing from
btrfs_getxattr() in btrfs_get_acl() because it hides the real reason of
the failure.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no chance to get into -ERANGE error condition because we first
call btrfs_getxattr to get the length of the attribute, then we do a
subsequent call with the size from the first call. Between the 2 calls
the size shouldn't change. So remove the unnecessary -ERANGE error
check.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_get_acl() the first call of btr_getxattr() is for getting the
length of attribute, the value buffer is never used in this case. So
it's better to replace empty string with NULL.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The caller of btrfs_get_acl() checks error condition so there is no
impact from this change. In practice there is no chance to get into
default case of switch statement because VFS has already checked the
type.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If type of extent_inline_ref found is not expected, filesystem may have
been corrupted, should return EUCLEAN instead of EINVAL.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
struct kiocb carries the ki_pos, so there is no need to pass it as
a separate function parameter.
generic_file_direct_write() increments ki_pos, so we now assign pos
after the function.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
[ rename to btrfs_buffered_write ]
Signed-off-by: David Sterba <dsterba@suse.com>
For easier debugging, print eb->start if level is invalid. Also make
clear if bytenr found is not expected.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the function uses 2 goto labels to properly handle allocation
failures. This could be simplified by simply re-arranging the code so
that allocations are the in the beginning of the function. This allows
to use simple return statements. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Under certain KVM load and LTP tests, it is possible to hit the
following calltrace if quota is enabled:
BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
WARNING: CPU: 0 PID: 49 at ../block/blk-core.c:172 blk_status_to_errno+0x1a/0x30
CPU: 0 PID: 49 Comm: kworker/u2:1 Not tainted 4.12.14-15-default #1 SLE15 (unreleased)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
task: ffff9f827b340bc0 task.stack: ffffb4f8c0304000
RIP: 0010:blk_status_to_errno+0x1a/0x30
Call Trace:
submit_extent_page+0x191/0x270 [btrfs]
? btrfs_create_repair_bio+0x130/0x130 [btrfs]
__do_readpage+0x2d2/0x810 [btrfs]
? btrfs_create_repair_bio+0x130/0x130 [btrfs]
? run_one_async_done+0xc0/0xc0 [btrfs]
__extent_read_full_page+0xe7/0x100 [btrfs]
? run_one_async_done+0xc0/0xc0 [btrfs]
read_extent_buffer_pages+0x1ab/0x2d0 [btrfs]
? run_one_async_done+0xc0/0xc0 [btrfs]
btree_read_extent_buffer_pages+0x94/0xf0 [btrfs]
read_tree_block+0x31/0x60 [btrfs]
read_block_for_search.isra.35+0xf0/0x2e0 [btrfs]
btrfs_search_slot+0x46b/0xa00 [btrfs]
? kmem_cache_alloc+0x1a8/0x510
? btrfs_get_token_32+0x5b/0x120 [btrfs]
find_parent_nodes+0x11d/0xeb0 [btrfs]
? leaf_space_used+0xb8/0xd0 [btrfs]
? btrfs_leaf_free_space+0x49/0x90 [btrfs]
? btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
btrfs_find_all_roots+0x45/0x60 [btrfs]
btrfs_qgroup_trace_extent_post+0x20/0x40 [btrfs]
btrfs_add_delayed_data_ref+0x1a3/0x1d0 [btrfs]
btrfs_alloc_reserved_file_extent+0x38/0x40 [btrfs]
insert_reserved_file_extent.constprop.71+0x289/0x2e0 [btrfs]
btrfs_finish_ordered_io+0x2f4/0x7f0 [btrfs]
? pick_next_task_fair+0x2cd/0x530
? __switch_to+0x92/0x4b0
btrfs_worker_helper+0x81/0x300 [btrfs]
process_one_work+0x1da/0x3f0
worker_thread+0x2b/0x3f0
? process_one_work+0x3f0/0x3f0
kthread+0x11a/0x130
? kthread_create_on_node+0x40/0x40
ret_from_fork+0x35/0x40
BTRFS critical (device vda2): unable to find logical 8820195328 length 16384
BTRFS: error (device vda2) in btrfs_finish_ordered_io:3023: errno=-5 IO failure
BTRFS info (device vda2): forced readonly
BTRFS error (device vda2): pending csums is 2887680
[CAUSE]
It's caused by race with block group auto removal:
- There is a meta block group X, which has only one tree block
The tree block belongs to fs tree 257.
- In current transaction, some operation modified fs tree 257
The tree block gets COWed, so the block group X is empty, and marked
as unused, queued to be deleted.
- Some workload (like fsync) wakes up cleaner_kthread()
Which will call btrfs_delete_unused_bgs() to remove unused block
groups.
So block group X along its chunk map get removed.
- Some delalloc work finished for fs tree 257
Quota needs to get the original reference of the extent, which will
read tree blocks of commit root of 257.
Then since the chunk map gets removed, the above warning gets
triggered.
[FIX]
Just let btrfs_delete_unused_bgs() skip block group which still has
pinned bytes.
However there is a minor side effect: currently we only queue empty
blocks at update_block_group(), and such empty block group with pinned
bytes won't go through update_block_group() again, such block group
won't be removed, until it gets new extent allocated and removed.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With gcc 4.1.2:
fs/btrfs/inode-map.c: In function ‘btrfs_unpin_free_ino’:
fs/btrfs/inode-map.c:241: warning: ‘count’ may be used uninitialized in this function
While this warning is a false-positive, it can easily be killed by
refactoring the code.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While the regular inode timestamps all use timespec64 now, the i_otime
field is btrfs specific and still needs to be converted to correctly
represent times beyond 2038.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The transaction times were changed to ktime_get_real_seconds to avoid
the y2038 overflow, but they still have a minor problem when they go
backwards or jump due to settimeofday() or leap seconds.
This changes the transaction handling to instead use ktime_get_seconds(),
which returns a CLOCK_MONOTONIC timestamp that has neither of those
problems.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We used to call btrfs_file_extent_inline_len() to get the uncompressed
data size of an inlined extent.
However this function is hiding evil, for compressed extent, it has no
choice but to directly read out ram_bytes from btrfs_file_extent_item.
While for uncompressed extent, it uses item size to calculate the real
data size, and ignoring ram_bytes completely.
In fact, for corrupted ram_bytes, due to above behavior kernel
btrfs_print_leaf() can't even print correct ram_bytes to expose the bug.
Since we have the tree-checker to verify all EXTENT_DATA, such mismatch
can be detected pretty easily, thus we can trust ram_bytes without the
evil btrfs_file_extent_inline_len().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a new extent buffer is allocated there are a few mandatory fields
which need to be set in order for the buffer to be sane: level,
generation, bytenr, backref_rev, owner and FSID/UUID. Currently this
is open coded in the callers of btrfs_alloc_tree_block, meaning it's
fairly high in the abstraction hierarchy of operations. This patch
solves this by simply moving this init code in btrfs_init_new_buffer,
since this is the function which initializes a newly allocated
extent buffer. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit f8f84b2dfd ("btrfs: index check-integrity state hash by a dev_t")
changed how btrfsic indexes device state.
Now we need to access device->bdev->bd_dev, while for degraded mount
it's completely possible to have device->bdev as NULL, thus it will
trigger a NULL pointer dereference at mount time.
Fix it by checking if the device is degraded before accessing
device->bdev->bd_dev.
There are a lot of other places accessing device->bdev->bd_dev, however
the other call sites have either checked device->bdev, or the
device->bdev is passed from btrfsic_map_block(), so it won't cause harm.
Fixes: f8f84b2dfd ("btrfs: index check-integrity state hash by a dev_t")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed transaction handle.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed bg cache.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from trans since the function is always called
within a valid transaction.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced directly from the transaction handle since it's
always valid.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed transaction handle, since it's
always valid.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed transaction handle, since it's
always valid.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed block group.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed block group.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from trans since the function is always called
within a transaction.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from trans since the function is always called
within a transaction.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always called with a valid transaction handle from
where fs_info can be referenced. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can always be referneced from the passed transaction handle since
it's always valid. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
fs_info can be refenreced from the transaction handle, since it's always
valid. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The argument is no longer used so remove it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always called with a valid transaction handle so
fs_info can be referenced from there. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always called with a valid transaction from where
fs_info can be referenced. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function already takes a transaction which holds a reference to
the fs_info struct. Use that reference and remove the extra arg. No
functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
fs_info can be referenced from the transaction handle, which is always
valid. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always called with a valid transaction handle so we
can reference the fs_info from there. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always called with a valid transaction handle from
where we can reference fs_info. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always called with a valid transaction handle from
where we can reference the fs_info. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always called with a valid transaction handle from
where fs_info can be referenced. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always called with a valid transaction handle from
where fs_info can be referenced. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This argument is unused. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always called with a valid transaction handle from
where fs_info can be referenced. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always called with a valid transaction handle from
where the fs_info can be referenced. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function always uses the leaf's extent_buffer which already
contains a reference to the fs_info. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always called with a valid transaction handle from
where the fs_info can be referenced. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This argument is unused. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always called with a valid transaction from where the
fs_info can be referenced. No functional change.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always called with a valid transaction handle from
where fs_info can be referenced. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always called with a valid transaction handle from
where fs_info can be referenced. So remove the redundant argument.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is always called with a valid transaction so there is no
need to duplicate the fs_info, we can reference it directly from the
trans handle. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The C programming language does not allow to use preprocessor statements
inside macro arguments (pr_info() is defined as a macro). Hence rework
the pr_info() statement in btrfs_print_mod_info() such that it becomes
compliant. This patch allows tools like sparse to analyze the BTRFS
source code.
Fixes: 62e855771d ("btrfs: convert printk(KERN_* to use pr_* calls")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch avoids that the compiler complains that a fall-through
annotation is missing when building with W=1.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch avoids that building the BTRFS source code with smatch
triggers complaints about inconsistent indenting.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently this function takes the root as an argument only to get the
log_root from it. Simplify this by directly passing the log root from
the caller. Also eliminate the fs_info local variable, since it's used
only once, so directly reference it from the transaction handle.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The logic to check if the inode is already in the log can now be
simplified since we always wait for the ordered extents to complete
before deciding whether the inode needs to be logged. The big comment
about it can go away too.
CC: Filipe Manana <fdmanana@suse.com>
Suggested-by: Filipe Manana <fdmanana@suse.com>
[ code and changelog copied from mail discussion ]
Signed-off-by: David Sterba <dsterba@suse.com>
This is no longer used anywhere, remove all of it.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We no longer use this list we've passed around so remove it everywhere.
Also remove the extra checks for ordered/filemap errors as this is
handled higher up now that we're waiting on ordered_extents before
getting to the tree log code.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since we are waiting on all ordered extents at the start of the fsync()
path we don't need to wait on any logged ordered extents, and we don't
need to look up the checksums on the ordered extents as they will
already be on disk prior to getting here. Rework this so we're only
looking up and copying the on-disk checksums for the extent range we
care about.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a priority inversion that exists currently with btrfs fsync. In
some cases we will collect outstanding ordered extents onto a list and
only wait on them at the very last second. However this "very last
second" falls inside of a transaction handle, so if we are in a lower
priority cgroup we can end up holding the transaction open for longer
than needed, so if a high priority cgroup is also trying to fsync()
it'll see latency.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The comment wrongfully states that the owner parameter is the level of
the parent block. In fact owner is the level of the current block and
by adding 1 to it we can eventually get to the parent/root.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Here is a doc-only patch which tires to deobfuscate the terra-incognita
that arguments for delayed refs are.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit ac0b4145d6 ("btrfs: scrub: Don't use inode pages
for device replace") the function is not used and we can remove all
functions down the call chain.
There was an optimization that reused inode pages to speed up device
replace, but broke when there was nodatasum and compressed page. The
potential performance gain is small so we don't loose much by removing
it and using scrub_pages same as the other pages.
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
The get_seconds() function is deprecated as it truncates the timestamp
to 32 bits. Change it to or ktime_get_real_seconds().
Signed-off-by: Allen Pais <allen.lkml@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAltSwkIACgkQxWXV+ddt
WDtMBQ//UXMjHaXvFmC0SM6NczuYQR51hLYtJFIKig93XK5goVpUBTNbxO7LX/Tn
4zKoKyVhkW1884V6mRiC+G23QLbo0BQZA7DExfyJ3jylQdjZMBm+K+r19OtGQf5v
CII7oUwni03KIiXIqiFAL5dLWebVQpG5EKJbh8GLZsmg6xNcyVaUqZ/fHXajbZiv
ldEBtHBKIv7WWTJmylMBKMWnRz+jqU91fXPahoU6R5qivODrLt1o/PMuSjVNhaxe
iDldHfdOaiQmLHB/1kOGyv492oW5mSSVNDE8LjEDZ61tDNlAcUyuKUWIRBxDEDtD
6D7rlVQXJ/N7sJ6+UYmJKsRpHL+NOkyzSZ0QEU/sm1Xpm8gkhHuuofRPrVCtd3l1
ZSbwvlrdyjigVEBfM3IbToQ/K6Rc1ZGId20OAs9PCQbb+mj9IxPIncZ7pI1c4hlh
pPEjcYsp14JbCTjctFalcqTiFY5tHRQsx+GUFnDyOcdL7Mi+CoH+0Jy61Vgz9GQE
7s934cfEC0ot/f66kAL/PZzxUfC7TePqaa+sDfS5BIkJ4M6lPMxS5De5R4Z0+Nzr
DXgQAlgXmxfRjpOYMTH9D0EDdSeJaNmVHgk7hFbiYk/KX3oyd4NmgI9Cfao8rQJv
2yd8wF2httfSJKD4b/Hv9r6Ho/Bw9PK59BvWOKYhSj6IGl32utw=
=f7eB
-----END PGP SIGNATURE-----
Merge tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"A fix of a corruption regarding fsync and clone, under some very
specific conditions explained in the patch.
The fix is marked for stable 3.16+ so I'd like to get it merged now
given the impact"
* tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix file data corruption after cloning a range and fsync
When we clone a range into a file we can end up dropping existing
extent maps (or trimming them) and replacing them with new ones if the
range to be cloned overlaps with a range in the destination inode.
When that happens we add the new extent maps to the list of modified
extents in the inode's extent map tree, so that a "fast" fsync (the flag
BTRFS_INODE_NEEDS_FULL_SYNC not set in the inode) will see the extent maps
and log corresponding extent items. However, at the end of range cloning
operation we do truncate all the pages in the affected range (in order to
ensure future reads will not get stale data). Sometimes this truncation
will release the corresponding extent maps besides the pages from the page
cache. If this happens, then a "fast" fsync operation will miss logging
some extent items, because it relies exclusively on the extent maps being
present in the inode's extent tree, leading to data loss/corruption if
the fsync ends up using the same transaction used by the clone operation
(that transaction was not committed in the meanwhile). An extent map is
released through the callback btrfs_invalidatepage(), which gets called by
truncate_inode_pages_range(), and it calls __btrfs_releasepage(). The
later ends up calling try_release_extent_mapping() which will release the
extent map if some conditions are met, like the file size being greater
than 16Mb, gfp flags allow blocking and the range not being locked (which
is the case during the clone operation) nor being the extent map flagged
as pinned (also the case for cloning).
The following example, turned into a test for fstests, reproduces the
issue:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "pwrite -S 0x18 9000K 6908K" /mnt/foo
$ xfs_io -f -c "pwrite -S 0x20 2572K 156K" /mnt/bar
$ xfs_io -c "fsync" /mnt/bar
# reflink destination offset corresponds to the size of file bar,
# 2728Kb minus 4Kb.
$ xfs_io -c ""reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" /mnt/bar
$ xfs_io -c "fsync" /mnt/bar
$ md5sum /mnt/bar
95a95813a8c2abc9aa75a6c2914a077e /mnt/bar
<power fail>
$ mount /dev/sdb /mnt
$ md5sum /mnt/bar
207fd8d0b161be8a84b945f0df8d5f8d /mnt/bar
# digest should be 95a95813a8c2abc9aa75a6c2914a077e like before the
# power failure
In the above example, the destination offset of the clone operation
corresponds to the size of the "bar" file minus 4Kb. So during the clone
operation, the extent map covering the range from 2572Kb to 2728Kb gets
trimmed so that it ends at offset 2724Kb, and a new extent map covering
the range from 2724Kb to 11724Kb is created. So at the end of the clone
operation when we ask to truncate the pages in the range from 2724Kb to
2724Kb + 15908Kb, the page invalidation callback ends up removing the new
extent map (through try_release_extent_mapping()) when the page at offset
2724Kb is passed to that callback.
Fix this by setting the bit BTRFS_INODE_NEEDS_FULL_SYNC whenever an extent
map is removed at try_release_extent_mapping(), forcing the next fsync to
search for modified extents in the fs/subvolume tree instead of relying on
the presence of extent maps in memory. This way we can continue doing a
"fast" fsync if the destination range of a clone operation does not
overlap with an existing range or if any of the criteria necessary to
remove an extent map at try_release_extent_mapping() is not met (file
size not bigger then 16Mb or gfp flags do not allow blocking).
CC: stable@vger.kernel.org # 3.16+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In commit ac0b4145d6 ("btrfs: scrub: Don't use inode pages for device
replace") we removed the branch of copy_nocow_pages() to avoid
corruption for compressed nodatasum extents.
However above commit only solves the problem in scrub_extent(), if
during scrub_pages() we failed to read some pages,
sctx->no_io_error_seen will be non-zero and we go to fixup function
scrub_handle_errored_block().
In scrub_handle_errored_block(), for sctx without csum (no matter if
we're doing replace or scrub) we go to scrub_fixup_nodatasum() routine,
which does the similar thing with copy_nocow_pages(), but does it
without the extra check in copy_nocow_pages() routine.
So for test cases like btrfs/100, where we emulate read errors during
replace/scrub, we could corrupt compressed extent data again.
This patch will fix it just by avoiding any "optimization" for
nodatasum, just falls back to the normal fixup routine by try read from
any good copy.
This also solves WARN_ON() or dead lock caused by lame backref iteration
in scrub_fixup_nodatasum() routine.
The deadlock or WARN_ON() won't be triggered before commit ac0b4145d6
("btrfs: scrub: Don't use inode pages for device replace") since
copy_nocow_pages() have better locking and extra check for data extent,
and it's already doing the fixup work by try to read data from any good
copy, so it won't go scrub_fixup_nodatasum() anyway.
This patch disables the faulty code and will be removed completely in a
followup patch.
Fixes: ac0b4145d6 ("btrfs: scrub: Don't use inode pages for device replace")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_cmp_data_free() puts cmp's src_pages and dst_pages, but leaves
their page address intact. Now, if you hit "goto again" in
btrfs_extent_same_range() and hit some error in
btrfs_cmp_data_prepare(), you'll try to unlock/put already put pages.
This is simple fix to reset the address to avoid use-after-free.
Fixes: 67b07bd4be ("Btrfs: reuse cmp workspace in EXTENT_SAME ioctl")
Signed-off-by: Naohiro Aota <naota@elisp.net>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 542c5908ab ("btrfs: replace uuid_mutex by
device_list_mutex in btrfs_open_devices") switched to device_list_mutex
as we need that for the device list traversal, but we also need
uuid_mutex to protect access to fs_devices::opened to be consistent with
other users of that.
Fixes: 542c5908ab ("btrfs: replace uuid_mutex by device_list_mutex in btrfs_open_devices")
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Clean up f_op->dedupe_file_range() interface.
1) Use loff_t for offsets and length instead of u64
2) Order the arguments the same way as {copy|clone}_file_range().
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAls4zz0ACgkQxWXV+ddt
WDs0ZhAAplEAcN1BP986BS7GpjrG20vQtP9AnHlnSEJJJmsnykpspBylOcLRkKjF
LKBfBPCKqIo7kn5ebKT1Kk7zJPkOOEfmxGW7hffVN/oa/oMtmgJbbHDUgl2TDgdu
rky1O+Bj+S37s5rhiXAJ4oU9ekdpWIlN30GczfynjiPqGigKh/cINsEEhQIIAiJG
PRDQfSIJeh67x1AP0KE8sJAYSsaeFxT+kHrT/NPs1NFDSzrQSa/QWPFVjGVVuI/Y
w84Mo0EqdRV7tap7D3QyWyYea6zdP00PG8TyLl0Kba+LckFbzpNN5hP3SUxleBzL
0ZBJi7/tOqnrMV3YaGm40dLfgD4B+CFt8zDyg2JvWUxxEzfQfYif7KIT2IV8fSqS
QrVw2NrzQC7EZ4Zu98wCN7dyyOE8yhqbq805YdG3Nj+zT6DqRu01TBo4Yr/Ek8ux
+ITAtQVbaOZmTIt/qh/Oxc5jRsurAno1FP3XRH+1hfSlS7xc3LfI1CUbX3jAKzXN
edxdM4/h+d4nekvROnKBH4EheS6+ZVfgzYlYUW9c2rjcJ1RHhDElbh14+IoM6LKJ
nJ+Cp+744F6W5jaG3oWElJrdhlY31mWUjiZaj2CHl16EcH3MToytrxKMX+OWo95W
gChnKicrtpO6+9nbED3Tdhp7SkbDysun6jvEpSgdlm8+2H5Kwrw=
=KWL7
-----END PGP SIGNATURE-----
Merge tag 'for-4.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"We have a few regression fixes for qgroup rescan status tracking and
the vm_fault_t conversion that mixed up the error values"
* tag 'for-4.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix mount failure when qgroup rescan is in progress
Btrfs: fix regression in btrfs_page_mkwrite() from vm_fault_t conversion
btrfs: quota: Set rescan progress to (u64)-1 if we hit last leaf
If a power failure happens while the qgroup rescan kthread is running,
the next mount operation will always fail. This is because of a recent
regression that makes qgroup_rescan_init() incorrectly return -EINVAL
when we are mounting the filesystem (through btrfs_read_qgroup_config()).
This causes the -EINVAL error to be returned regardless of any qgroup
flags being set instead of returning the error only when neither of
the flags BTRFS_QGROUP_STATUS_FLAG_RESCAN nor BTRFS_QGROUP_STATUS_FLAG_ON
are set.
A test case for fstests follows up soon.
Fixes: 9593bf4967 ("btrfs: qgroup: show more meaningful qgroup_rescan_init error message")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The vm_fault_t conversion commit introduced a ret2 variable for tracking
the integer return values from internal btrfs functions. It was
sometimes returning VM_FAULT_LOCKED for pages that were actually invalid
and had been removed from the radix. Something like this:
ret2 = btrfs_delalloc_reserve_space() // returns zero on success
lock_page(page)
if (page->mapping != inode->i_mapping)
goto out_unlock;
...
out_unlock:
if (!ret2) {
...
return VM_FAULT_LOCKED;
}
This ends up triggering this WARNING in btrfs_destroy_inode()
WARN_ON(BTRFS_I(inode)->block_rsv.size);
xfstests generic/095 was able to reliably reproduce the errors.
Since out_unlock: is only used for errors, this fix moves it below the
if (!ret2) check we use to return VM_FAULT_LOCKED for success.
Fixes: a528a24150 (btrfs: change return type of btrfs_page_mkwrite to vm_fault_t)
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit ff3d27a048 ("btrfs: qgroup: Finish rescan when hit the last leaf
of extent tree") added a new exit for rescan finish.
However after finishing quota rescan, we set
fs_info->qgroup_rescan_progress to (u64)-1 before we exit through the
original exit path.
While we missed that assignment of (u64)-1 in the new exit path.
The end result is, the quota status item doesn't have the same value.
(-1 vs the last bytenr + 1)
Although it doesn't affect quota accounting, it's still better to keep
the original behavior.
Reported-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Fixes: ff3d27a048 ("btrfs: qgroup: Finish rescan when hit the last leaf of extent tree")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlsvtAIACgkQxWXV+ddt
WDvPJQ//eEr+ACNxG5n42e63TaNpLOeoGAsUChwekP6DAIc2n/r78SHX+T/woDqd
7/dc2eqlYF5Fmn6MPQ1ufL2xlfw0t3OnemK8T5+4sxZDdzeAH9V+kHaqoaLUPExL
4r6lK5Ywwa2cWghC7WvQg3+bWLX18OEExG63SlEhLo3YJM5uUqVhGVPi6ARrbxNM
GJvXcQsxjXLqukm4gYvHC6Zra9Awv65uiAU+VCm2y96j1fEJ0yjK/pC1RtoFGCqS
48Jjuzfq/V3nxy0Wjr3DvpQVEQcKyGha6i/eazZISdRhGSjrYdvIwpUn7gZeoo2Q
hT8VVergLbVYgIeaOwgwubQzNaG2C/ZTsEjPQrNrA7a/AGsh5C/ommYE9MSyS6L0
PG0NLUNDXFmEj8WdI97So+1Sm2OCb04DPbPhHbbkhw5L/MPE1TaLN5aUWguj8laB
NnyPRdVP9jCJAI0OhJY7nPDmsPKe2jogVVsRheTcob+V5G+vIgzDXlGfsW/88Seg
dHubMaC0nz6u8Cj4dIviitiLXuustyz0JkVdljTLawWWJ/Hs7NlsSf3Q3nj2Kvia
e8QMID0vLphQyL1hqC0n7M0g2ohq9NUGT1nhLTPSpFl0l8bIA9PQehOx9Q5vx8yp
tJF+d0qiNfgvadA+KhyvQ8puQb2+zZQ8+Pqfjwd4ySD7SI5dcA4=
=fCdm
-----END PGP SIGNATURE-----
Merge tag 'for-4.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two regression fixes and an incorrect error value propagation fix from
'rename exchange'"
* tag 'for-4.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix return value on rename exchange failure
btrfs: fix invalid-free in btrfs_extent_same
Btrfs: fix physical offset reported by fiemap for inline extents
If we failed during a rename exchange operation after starting/joining a
transaction, we would end up replacing the return value, stored in the
local 'ret' variable, with the return value from btrfs_end_transaction().
So this could end up returning 0 (success) to user space despite the
operation having failed and aborted the transaction, because if there are
multiple tasks having a reference on the transaction at the time
btrfs_end_transaction() is called by the rename exchange, that function
returns 0 (otherwise it returns -EIO and not the original error value).
So fix this by not overwriting the return value on error after getting
a transaction handle.
Fixes: cdd1fedf82 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If this condition ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
(BTRFS_I(dst)->flags & BTRFS_INODE_NODATASUM))
is hit, we will go to free the uninitialized cmp.src_pages and
cmp.dst_pages.
Fixes: 67b07bd4be ("Btrfs: reuse cmp workspace in EXTENT_SAME ioctl")
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 9d311e11fc ("Btrfs: fiemap: pass correct bytenr when
fm_extent_count is zero") introduced a regression where we no longer
report 0 as the physical offset for inline extents (and other extents
with a special block_start value). This is because it always sets the
variable used to report the physical offset ("disko") as em->block_start
plus some offset, and em->block_start has the value 18446744073709551614
((u64) -2) for inline extents.
This made the btrfs test 004 (from fstests) often fail, for example, for
a file with an inline extent we have the following items in the subvolume
tree:
item 101 key (418 INODE_ITEM 0) itemoff 11029 itemsize 160
generation 25 transid 38 size 1525 nbytes 1525
block group 0 mode 100666 links 1 uid 0 gid 0 rdev 0
sequence 0 flags 0x2(none)
atime 1529342058.461891730 (2018-06-18 18:14:18)
ctime 1529342058.461891730 (2018-06-18 18:14:18)
mtime 1529342058.461891730 (2018-06-18 18:14:18)
otime 1529342055.869892885 (2018-06-18 18:14:15)
item 102 key (418 INODE_REF 264) itemoff 11016 itemsize 13
index 25 namelen 3 name: fc7
item 103 key (418 EXTENT_DATA 0) itemoff 9470 itemsize 1546
generation 38 type 0 (inline)
inline extent data size 1525 ram_bytes 1525 compression 0 (none)
Then when test 004 invoked fiemap against the file it got a non-zero
physical offset:
$ filefrag -v /mnt/p0/d4/d7/fc7
Filesystem type is: 9123683e
File size of /mnt/p0/d4/d7/fc7 is 1525 (1 block of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 4095: 18446744073709551614.. 4093: 4096: last,not_aligned,inline,eof
/mnt/p0/d4/d7/fc7: 1 extent found
This resulted in the test failing like this:
btrfs/004 49s ... [failed, exit status 1]- output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/004.out.bad)
--- tests/btrfs/004.out 2016-08-23 10:17:35.027012095 +0100
+++ /home/fdmanana/git/hub/xfstests/results//btrfs/004.out.bad 2018-06-18 18:15:02.385872155 +0100
@@ -1,3 +1,10 @@
QA output created by 004
*** test backref walking
-*** done
+./tests/btrfs/004: line 227: [: 7.55578637259143e+22: integer expression expected
+ERROR: 7.55578637259143e+22 is not a valid numeric value.
+unexpected output from
+ /home/fdmanana/git/hub/btrfs-progs/btrfs inspect-internal logical-resolve -s 65536 -P 7.55578637259143e+22 /home/fdmanana/btrfs-tests/scratch_1
...
(Run 'diff -u tests/btrfs/004.out /home/fdmanana/git/hub/xfstests/results//btrfs/004.out.bad' to see the entire diff)
Ran: btrfs/004
The large number in scientific notation reported as an invalid numeric
value is the result from the filter passed to perl which multiplies the
physical offset by the block size reported by fiemap.
So fix this by ensuring the physical offset is always set to 0 when we
are processing an extent with a special block_start value.
Fixes: 9d311e11fc ("Btrfs: fiemap: pass correct bytenr when fm_extent_count is zero")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a late set of changes from Deepa Dinamani doing an automated
treewide conversion of the inode and iattr structures from 'timespec'
to 'timespec64', to push the conversion from the VFS layer into the
individual file systems.
There were no conflicts between this and the contents of linux-next
until just before the merge window, when we saw multiple problems:
- A minor conflict with my own y2038 fixes, which I could address
by adding another patch on top here.
- One semantic conflict with late changes to the NFS tree. I addressed
this by merging Deepa's original branch on top of the changes that
now got merged into mainline and making sure the merge commit includes
the necessary changes as produced by coccinelle.
- A trivial conflict against the removal of staging/lustre.
- Multiple conflicts against the VFS changes in the overlayfs tree.
These are still part of linux-next, but apparently this is no longer
intended for 4.18 [1], so I am ignoring that part.
As Deepa writes:
The series aims to switch vfs timestamps to use struct timespec64.
Currently vfs uses struct timespec, which is not y2038 safe.
The series involves the following:
1. Add vfs helper functions for supporting struct timepec64 timestamps.
2. Cast prints of vfs timestamps to avoid warnings after the switch.
3. Simplify code using vfs timestamps so that the actual
replacement becomes easy.
4. Convert vfs timestamps to use struct timespec64 using a script.
This is a flag day patch.
Next steps:
1. Convert APIs that can handle timespec64, instead of converting
timestamps at the boundaries.
2. Update internal data structures to avoid timestamp conversions.
Thomas Gleixner adds:
I think there is no point to drag that out for the next merge window.
The whole thing needs to be done in one go for the core changes which
means that you're going to play that catchup game forever. Let's get
over with it towards the end of the merge window.
[1] https://www.spinics.net/lists/linux-fsdevel/msg128294.html
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJbInZAAAoJEGCrR//JCVInReoQAIlVIIMt5ZX6wmaKbrjy9Itf
MfgbFihQ/djLnuSPVQ3nztcxF0d66BKHZ9puVjz6+mIHqfDvJTRwZs9nU+sOF/T1
g78fRkM1cxq6ZCkGYAbzyjyo5aC4PnSMP/NQLmwqvi0MXqqrbDoq5ZdP9DHJw39h
L9lD8FM/P7T29Fgp9tq/pT5l9X8VU8+s5KQG1uhB5hii4VL6pD6JyLElDita7rg+
Z7/V7jkxIGEUWF7vGaiR1QTFzEtpUA/exDf9cnsf51OGtK/LJfQ0oiZPPuq3oA/E
LSbt8YQQObc+dvfnGxwgxEg1k5WP5ekj/Wdibv/+rQKgGyLOTz6Q4xK6r8F2ahxs
nyZQBdXqHhJYyKr1H1reUH3mrSgQbE5U5R1i3My0xV2dSn+vtK5vgF21v2Ku3A1G
wJratdtF/kVBzSEQUhsYTw14Un+xhBLRWzcq0cELonqxaKvRQK9r92KHLIWNE7/v
c0TmhFbkZA+zR8HdsaL3iYf1+0W/eYy8PcvepyldKNeW2pVk3CyvdTfY2Z87G2XK
tIkK+BUWbG3drEGG3hxZ3757Ln3a9qWyC5ruD3mBVkuug/wekbI8PykYJS7Mx4s/
WNXl0dAL0Eeu1M8uEJejRAe1Q3eXoMWZbvCYZc+wAm92pATfHVcKwPOh8P7NHlfy
A3HkjIBrKW5AgQDxfgvm
=CZX2
-----END PGP SIGNATURE-----
Merge tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground
Pull inode timestamps conversion to timespec64 from Arnd Bergmann:
"This is a late set of changes from Deepa Dinamani doing an automated
treewide conversion of the inode and iattr structures from 'timespec'
to 'timespec64', to push the conversion from the VFS layer into the
individual file systems.
As Deepa writes:
'The series aims to switch vfs timestamps to use struct timespec64.
Currently vfs uses struct timespec, which is not y2038 safe.
The series involves the following:
1. Add vfs helper functions for supporting struct timepec64
timestamps.
2. Cast prints of vfs timestamps to avoid warnings after the switch.
3. Simplify code using vfs timestamps so that the actual replacement
becomes easy.
4. Convert vfs timestamps to use struct timespec64 using a script.
This is a flag day patch.
Next steps:
1. Convert APIs that can handle timespec64, instead of converting
timestamps at the boundaries.
2. Update internal data structures to avoid timestamp conversions'
Thomas Gleixner adds:
'I think there is no point to drag that out for the next merge
window. The whole thing needs to be done in one go for the core
changes which means that you're going to play that catchup game
forever. Let's get over with it towards the end of the merge window'"
* tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
pstore: Remove bogus format string definition
vfs: change inode times to use struct timespec64
pstore: Convert internal records to timespec64
udf: Simplify calls to udf_disk_stamp_to_time
fs: nfs: get rid of memcpys for inode times
ceph: make inode time prints to be long long
lustre: Use long long type to print inode time
fs: add timespec64_truncate()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlsg598ACgkQxWXV+ddt
WDtG1w//R9/nvAk5A5cbForTXNyxwXAQnz0t2/4Lh5igbfJloqoTZtr47Gvqsvy+
DITU+3BPcyupBuUFLoeivPC+ruKOUmle27Vm62mdZRtyt96kiUiV/m1gbPhFy8lW
DKHK8tvtIoZObo5oNGvRxiuJPjefiChYZMDHokB+MY5ZALSRaIW9opj2WsM+ZAt7
g9CEjeitQcBY68CCpSEVBlQSz+BTZYEDFJGNCmGDxGhaBGZr/ganrkDJ75cG6U10
LnOZ6LDHNxGMqUm4wnhfmpHtVcIJiBF+gOyTumBPtFSoLnBverl684xizglpoq6d
fnUP8Y6XR9JA4OCZvo310yvX9nyqgb0H2h+APO0f7jRRcJo0QSKZ/qZR+XZCk3PU
91HtBopcGs8gGQUkRdAE7TMCiIEzL1eNOXHvsiILCObq1i7iNCe7Dzx6M6Gfgep8
a3IcoVmSw1DFpln2ZxTQw9viAib41iU46XHXz7W7rGPulF5QdXGo5ScORRG6HBLE
nZsXdTkrCPMJyN2bIhU6YJOK9rb9TjD4lUtnvzT8t1CfUsxQT4AsJykaYr9BwF2D
Z4rBruUAQ3OmvXJDfGG4T5YCAdPBN+xBcxCeyrDZSD0r6YPQGoF0dlvHmvP0J78D
oGkD1bb/gjcvsPJxtTQ4QWEh0oqZiDfRt4qdgO46vhba0onzQFw=
=X2zA
-----END PGP SIGNATURE-----
Merge tag 'for-4.18-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- error handling fixup for one of the new ioctls from 1st pull
- fix for device-replace that incorrectly uses inode pages and can mess
up compressed extents in some cases
- fiemap fix for reporting incorrect number of extents
- vm_fault_t type conversion
* tag 'for-4.18-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: scrub: Don't use inode pages for device replace
btrfs: change return type of btrfs_page_mkwrite to vm_fault_t
Btrfs: fiemap: pass correct bytenr when fm_extent_count is zero
btrfs: Check error of btrfs_iget in btrfs_search_path_in_tree_user
Pull the timespec64 conversion from Deepa Dinamani:
"The series aims to switch vfs timestamps to use
struct timespec64. Currently vfs uses struct timespec,
which is not y2038 safe.
The flag patch applies cleanly. I've not seen the timestamps
update logic change often. The series applies cleanly on 4.17-rc6
and linux-next tip (top commit: next-20180517).
I'm not sure how to merge this kind of a series with a flag patch.
We are targeting 4.18 for this.
Let me know if you have other suggestions.
The series involves the following:
1. Add vfs helper functions for supporting struct timepec64 timestamps.
2. Cast prints of vfs timestamps to avoid warnings after the switch.
3. Simplify code using vfs timestamps so that the actual
replacement becomes easy.
4. Convert vfs timestamps to use struct timespec64 using a script.
This is a flag day patch.
I've tried to keep the conversions with the script simple, to
aid in the reviews. I've kept all the internal filesystem data
structures and function signatures the same.
Next steps:
1. Convert APIs that can handle timespec64, instead of converting
timestamps at the boundaries.
2. Update internal data structures to avoid timestamp conversions."
I've pulled it into a branch based on top of the NFS changes that
are now in mainline, so I could resolve the non-obvious conflict
between the two while merging.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
[BUG]
Btrfs can create compressed extent without checksum (even though it
shouldn't), and if we then try to replace device containing such extent,
the result device will contain all the uncompressed data instead of the
compressed one.
Test case already submitted to fstests:
https://patchwork.kernel.org/patch/10442353/
[CAUSE]
When handling compressed extent without checksum, device replace will
goe into copy_nocow_pages() function.
In that function, btrfs will get all inodes referring to this data
extents and then use find_or_create_page() to get pages direct from that
inode.
The problem here is, pages directly from inode are always uncompressed.
And for compressed data extent, they mismatch with on-disk data.
Thus this leads to corrupted compressed data extent written to replace
device.
[FIX]
In this attempt, we could just remove the "optimization" branch, and let
unified scrub_pages() to handle it.
Although scrub_pages() won't bother reusing page cache, it will be a
little slower, but it does the correct csum checking and won't cause
such data corruption caused by "optimization".
Note about the fix: this is the minimal fix that can be backported to
older stable trees without conflicts. The whole callchain from
copy_nocow_pages() can be deleted, and will be in followup patches.
Fixes: ff023aac31 ("Btrfs: add code to scrub to copy read data to another disk")
CC: stable@vger.kernel.org # 4.4+
Reported-by: James Harvey <jamespharvey20@gmail.com>
Reviewed-by: James Harvey <jamespharvey20@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ remove code removal, add note why ]
Signed-off-by: David Sterba <dsterba@suse.com>
Use the new return type vm_fault_t for fault handler. For now, this is
just documenting that the function returns a VM_FAULT value rather than
an errno. Once all instances are converted, vm_fault_t will become a
distinct type.
Reference commit 1c8f422059 ("mm: change return type to vm_fault_t")
vmf_error() is the newly introduced inline function in 4.17-rc6.
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
fm_mapped_extents is not correct when fm_extent_count is 0
Like:
# mount /dev/vdb5 /mnt/btrfs
# dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 25088..25215 128 0x1
When user space wants to get the number of file extents,
set fm_extent_count to 0 to run fiemap and then read fm_mapped_extents.
In the above example, fiemap will return with fm_mapped_extents set to 4,
but it should be 1 since there's only one entry in the output.
[REASON]
The problem seems to be that disko is only set if
fieinfo->fi_extents_max is set. And this member is initialized, in the
generic ioctl_fiemap function, to the value of used-passed
fm_extent_count. So when the user passes 0 then fi_extent_max is also
set to zero and this causes btrfs to not initialize disko at all.
Eventually this leads emit_fiemap_extent being called with a bogus
'phys' argument preventing proper fiemap entries merging.
[FIX]
Move the disko initialization earlier in extent_fiemap making it
independent of user-passed arguments, allowing emit_fiemap_extent to
properly handle consecutive extent entries.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The patch introducing the ioctl was not the latest version at the time
of merging to the mainline and needs a fixup from this patch.
Fixes: ba637a252d30 ("btrfs: Check error of btrfs_iget() in btrfs_search_path_in_tree_user")
Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlsVV6wACgkQxWXV+ddt
WDsaOhAAnrG3o/Bkn4loifYjiWQoArswwVJ2m5MvG14oPcn/c6wCUlE6mHdEsT8l
WgltSRSd/3hJ1BcuWjhcR8RSLBRSaPcw4FAHikAdOWMrLXSKfA+quT/upXD00iDS
+bnHoknc5vvk+Ow0KyGODAeMYfJm8gGB43BeVDr/mgrVRrwYY9r14s+6LX897NOT
qObdxBjtedua0uIYZX5673siFaSbvJNQyGn6iKO9Ww+7bkWjgKO8qY8/3/gEaJn1
q1303Oo65VOXQQKZed+NjPNakpqRmABaN79qTwWMnUI4qYBKi651a7sr5HljDpmc
szY3uF/E7AOfJSDeTBdglFR1eQgv20ceeIx7FN0n6iUbuIVH9P85mBF6f76Hivpq
fA39uPYxWwWq8RxmC09yqoTkuNS0zAQhov4n8XO0q/81Gfek0RpR+LMK15W02Ur8
+5Pazkgz+xRYRPhpExe75pbM5N31KFJdf+4wpXeZoXKGTeiiCl/SZV3NzyUtemGg
eN7ltHD/Y9iNpx7sgHn+2veXwn6psDad8ORxIlRyKDeot2pNZ4lnekPbqMC/IwAz
hF3ZpcEo8v2Q2kqwCqw5evq4NSOvfG0cEuIqAKir1AgTn2nPH99qRSR1He9kluap
GBPTPXNgBrhhdsh5kltI6CJcGE46bHsLG4LI5WvRs+wOtyaG5Ww=
=jIvf
-----END PGP SIGNATURE-----
Merge tag 'for-4.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"User visible features:
- added support for the ioctl FS_IOC_FSGETXATTR, per-inode flags,
successor of GET/SETFLAGS; now supports only existing flags:
append, immutable, noatime, nodump, sync
- 3 new unprivileged ioctls to allow users to enumerate subvolumes
- dedupe syscall implementation does not restrict the range to 16MiB,
though it still splits the whole range to 16MiB chunks
- on user demand, rmdir() is able to delete an empty subvolume,
export the capability in sysfs
- fix inode number types in tracepoints, other cleanups
- send: improved speed when dealing with a large removed directory,
measurements show decrease from 2000 minutes to 2 minutes on a
directory with 2 million entries
- pre-commit check of superblock to detect a mysterious in-memory
corruption
- log message updates
Other changes:
- orphan inode cleanup improved, does no keep long-standing
reservations that could lead up to early ENOSPC in some cases
- slight improvement of handling snapshotted NOCOW files by avoiding
some unnecessary tree searches
- avoid OOM when dealing with many unmergeable small extents at flush
time
- speedup conversion of free space tree representations from/to
bitmap/tree
- code refactoring, deletion, cleanups:
+ delayed refs
+ delayed iput
+ redundant argument removals
+ memory barrier cleanups
+ remove a redundant mutex supposedly excluding several ioctls to
run in parallel
- new tracepoints for blockgroup manipulation
- more sanity checks of compressed headers"
* tag 'for-4.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (183 commits)
btrfs: Add unprivileged version of ino_lookup ioctl
btrfs: Add unprivileged ioctl which returns subvolume's ROOT_REF
btrfs: Add unprivileged ioctl which returns subvolume information
Btrfs: clean up error handling in btrfs_truncate()
btrfs: Factor out write portion of btrfs_get_blocks_direct
btrfs: Factor out read portion of btrfs_get_blocks_direct
btrfs: return ENOMEM if path allocation fails in btrfs_cross_ref_exist
btrfs: raid56: Remove VLA usage
btrfs: return error value if create_io_em failed in cow_file_range
btrfs: drop useless member qgroup_reserved of btrfs_pending_snapshot
btrfs: drop unused parameter qgroup_reserved
btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
btrfs: lift some btrfs_cross_ref_exist checks in nocow path
btrfs: Remove fs_info argument from btrfs_uuid_tree_rem
btrfs: Remove fs_info argument from btrfs_uuid_tree_add
Btrfs: remove unused check of skip_locking
Btrfs: remove always true check in unlock_up
Btrfs: grab write lock directly if write_lock_level is the max level
Btrfs: move get root out of btrfs_search_slot to a helper
Btrfs: use more straightforward extent_buffer_uptodate check
...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJbFIrHAAoJEPfTWPspceCm2+kQAKo7o7HL30aRxJYu+gYafkuW
PV47zr3e4vhMDEzDaMsh1+V7I7bm3uS+NZu6cFbcV+N9KXFpeb4V4Hvvm5cs+OC3
WCOBi4eC1h4qnDQ3ZyySrCMN+KHYJ16pZqddEjqw+fhVudx8i+F+jz3Y4ZMDDc3q
pArKZvjKh2wEuYXUMFTjaXY46IgPt+er94OwvrhyHk+4AcA+Q/oqSfSdDahUC8jb
BVR3FV4I3NOHUaru0RbrUko13sVZSboWPCIFrlTDz8xXcJOnVHzdVS1WLFDXLHnB
O8q9cADCfa4K08kz68RxykcJiNxNvz5ChDaG0KloCFO+q1tzYRoXLsfaxyuUDg57
Zd93OFZC6hAzXdhclDFIuPET9OQIjDzwphodfKKmDsm3wtyOtydpA0o7JUEongp0
O1gQsEfYOXmQsXlo8Ot+Z7Ne/HvtGZ91JahUa/59edxQbcKaMrktoyQsQ/d1nOEL
4kXID18wPcFHWRQHYXyVuw6kbpRtQnh/U2m1eenSZ7tVQHwoe6mF3cfSf5MMseak
k8nAnmsfEvOL4Ar9ftg61GOrImaQlidxOC2A8fmY5r0Sq/ZldvIFIZizsdTTCcni
8SOTxcQowyqPf5NvMNQ8cKqqCJap3ppj4m7anZNhbypDIF2TmOWsEcXcMDn4y9on
fax14DPLo59gBRiPCn5f
=nga/
-----END PGP SIGNATURE-----
Merge tag 'for-4.18/block-20180603' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
- clean up how we pass around gfp_t and
blk_mq_req_flags_t (Christoph)
- prepare us to defer scheduler attach (Christoph)
- clean up drivers handling of bounce buffers (Christoph)
- fix timeout handling corner cases (Christoph/Bart/Keith)
- bcache fixes (Coly)
- prep work for bcachefs and some block layer optimizations (Kent).
- convert users of bio_sets to using embedded structs (Kent).
- fixes for the BFQ io scheduler (Paolo/Davide/Filippo)
- lightnvm fixes and improvements (Matias, with contributions from Hans
and Javier)
- adding discard throttling to blk-wbt (me)
- sbitmap blk-mq-tag handling (me/Omar/Ming).
- remove the sparc jsflash block driver, acked by DaveM.
- Kyber scheduler improvement from Jianchao, making it more friendly
wrt merging.
- conversion of symbolic proc permissions to octal, from Joe Perches.
Previously the block parts were a mix of both.
- nbd fixes (Josef and Kevin Vigor)
- unify how we handle the various kinds of timestamps that the block
core and utility code uses (Omar)
- three NVMe pull requests from Keith and Christoph, bringing AEN to
feature completeness, file backed namespaces, cq/sq lock split, and
various fixes
- various little fixes and improvements all over the map
* tag 'for-4.18/block-20180603' of git://git.kernel.dk/linux-block: (196 commits)
blk-mq: update nr_requests when switching to 'none' scheduler
block: don't use blocking queue entered for recursive bio submits
dm-crypt: fix warning in shutdown path
lightnvm: pblk: take bitmap alloc. out of critical section
lightnvm: pblk: kick writer on new flush points
lightnvm: pblk: only try to recover lines with written smeta
lightnvm: pblk: remove unnecessary bio_get/put
lightnvm: pblk: add possibility to set write buffer size manually
lightnvm: fix partial read error path
lightnvm: proper error handling for pblk_bio_add_pages
lightnvm: pblk: fix smeta write error path
lightnvm: pblk: garbage collect lines with failed writes
lightnvm: pblk: rework write error recovery path
lightnvm: pblk: remove dead function
lightnvm: pass flag on graceful teardown to targets
lightnvm: pblk: check for chunk size before allocating it
lightnvm: pblk: remove unnecessary argument
lightnvm: pblk: remove unnecessary indirection
lightnvm: pblk: return NVM_ error on failed submission
lightnvm: pblk: warn in case of corrupted write buffer
...
Add unprivileged version of ino_lookup ioctl BTRFS_IOC_INO_LOOKUP_USER
to allow normal users to call "btrfs subvolume list/show" etc. in
combination with BTRFS_IOC_GET_SUBVOL_INFO/BTRFS_IOC_GET_SUBVOL_ROOTREF.
This can be used like BTRFS_IOC_INO_LOOKUP but the argument is
different. This is because it always searches the fs/file tree
correspoinding to the fd with which this ioctl is called and also
returns the name of bottom subvolume.
The main differences from original ino_lookup ioctl are:
1. Read + Exec permission will be checked using inode_permission()
during path construction. -EACCES will be returned in case
of failure.
2. Path construction will be stopped at the inode number which
corresponds to the fd with which this ioctl is called. If
constructed path does not exist under fd's inode, -EACCES
will be returned.
3. The name of bottom subvolume is also searched and filled.
Note that the maximum length of path is shorter 256 (BTRFS_VOL_NAME_MAX+1)
bytes than ino_lookup ioctl because of space of subvolume's name.
Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
[ style fixes ]
Signed-off-by: David Sterba <dsterba@suse.com>
Add unprivileged ioctl BTRFS_IOC_GET_SUBVOL_ROOTREF which returns
ROOT_REF information of the subvolume containing this inode except the
subvolume name (this is because to prevent potential name leak). The
subvolume name will be gained by user version of ino_lookup ioctl
(BTRFS_IOC_INO_LOOKUP_USER) which also performs permission check.
The min id of root ref's subvolume to be searched is specified by
@min_id in struct btrfs_ioctl_get_subvol_rootref_args. After the search
ends, @min_id is set to the last searched root ref's subvolid + 1. Also,
if there are more root refs than BTRFS_MAX_ROOTREF_BUFFER_NUM,
-EOVERFLOW is returned. Therefore the caller can just call this ioctl
again without changing the argument to continue search.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
[ style fixes and struct item renames ]
Signed-off-by: David Sterba <dsterba@suse.com>
Add new unprivileged ioctl BTRFS_IOC_GET_SUBVOL_INFO which returns
the information of subvolume containing this inode.
(i.e. returns the information in ROOT_ITEM and ROOT_BACKREF.)
Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
[ minor style fixes, update struct comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Convert btrfs to embedded bio sets.
Acked-by: Chris Mason <clm@fb.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
btrfs_truncate() uses two variables for error handling, ret and err (if
this sounds familiar, it's because btrfs_truncate_inode_items() did
something similar). This is error prone, as was made evident by "Btrfs:
fix error handling in btrfs_truncate()". We only have err because we
don't want to mask an error if we call btrfs_update_inode() and
btrfs_end_transaction(), so let's make that its own scoped return
variable and use ret everywhere else.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the read side is extracted into its own function, do the same
to the write side. This leaves btrfs_get_blocks_direct_write with the
sole purpose of handling common locking required. Also flip the
condition in btrfs_get_blocks_direct_write so that the write case
comes first and we check for if (Create) rather than if (!create). This
is purely subjective but I believe makes reading a bit more "linear".
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently this function handles both the READ and WRITE dio cases. This
is facilitated by a bunch of 'if' statements, a goto short-circuit
statement and a very perverse aliasing of "!created"(READ) case
by setting lockstart = lockend and checking for lockstart < lockend for
detecting the write. Let's simplify this mess by extracting the
READ-only code into a separate __btrfs_get_block_direct_read function.
This is only the first step, the next one will be to factor out the
write side as well. The end goal will be to have the common locking/
unlocking code in btrfs_get_blocks_direct and then it will call either
the read|write subvariants. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The error code does not match the reason of failure and may confuse the
callers.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the quest to remove all stack VLA usage from the kernel[1], this
allocates the working buffers during regular init, instead of using stack
space. This refactors the allocation code a bit to make it easier
to review.
[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In cow_file_range(), create_io_em() may fail, but its return value is
not recorded. Then return value may be 0 even it failed which is a
wrong behavior.
Let cow_file_range() return PTR_ERR(em) if create_io_em() failed.
Fixes: 6f9994dbab ("Btrfs: create a helper to create em for IO")
CC: stable@vger.kernel.org # 4.11+
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since there is no more use of qgroup_reserved member in struct
btrfs_pending_snapshot, remove it.
Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit 7775c8184e ("btrfs: remove unused parameter from
btrfs_subvolume_release_metadata") parameter qgroup_reserved is not used
by caller of function btrfs_subvolume_reserve_metadata. So remove it.
Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[Problem description and how we fix it]
We should balance dirty metadata pages at the end of
btrfs_finish_ordered_io, since a small, unmergeable random write can
potentially produce dirty metadata which is multiple times larger than
the data itself. For example, a small, unmergeable 4KiB write may
produce:
16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree
Although we do call balance dirty pages in write side, but in the
buffered write path, most metadata are dirtied only after we reach the
dirty background limit (which by far only counts dirty data pages) and
wakeup the flusher thread. If there are many small, unmergeable random
writes spread in a large btree, we'll find a burst of dirty pages
exceeds the dirty_bytes limit after we wakeup the flusher thread - which
is not what we expect. In our machine, it caused out-of-memory problem
since a page cannot be dropped if it is marked dirty.
Someone may worry about we may sleep in btrfs_btree_balance_dirty_nodelay,
but since we do btrfs_finish_ordered_io in a separate worker, it will not
stop the flusher consuming dirty pages. Also, we use different worker for
metadata writeback endio, sleep in btrfs_finish_ordered_io help us throttle
the size of dirty metadata pages.
[Reproduce steps]
To reproduce the problem, we need to do 4KiB write randomly spread in a
large btree. In our 2GiB RAM machine:
1) Create 4 subvolumes.
2) Run fio on each subvolume:
[global]
direct=0
rw=randwrite
ioengine=libaio
bs=4k
iodepth=16
numjobs=1
group_reporting
size=128G
runtime=1800
norandommap
time_based
randrepeat=0
3) Take snapshot on each subvolume and repeat fio on existing files.
4) Repeat step (3) until we get large btrees.
In our case, by observing btrfs_root_item->bytes_used, we have 2GiB of
metadata in each subvolume tree and 12GiB of metadata in extent tree.
5) Stop all fio, take snapshot again, and wait until all delayed work is
completed.
6) Start all fio. Few seconds later we hit OOM when the flusher starts
to work.
It can be reproduced even when using nocow write.
Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
In nocow path, we check if the extent is snapshotted in
btrfs_cross_ref_exist(). We can do the similar check earlier and avoid
unnecessary search into extent tree.
A fio test on a Intel D-1531, 16GB RAM, SSD RAID-5 machine as follows:
[global]
group_reporting
time_based
thread=1
ioengine=libaio
bs=4k
iodepth=32
size=64G
runtime=180
numjobs=8
rw=randwrite
[file1]
filename=/mnt/nocow/testfile
IOPS result: unpatched patched
1 fio round: 46670 46958
snapshot
2 fio round: 51826 54498
3 fio round: 59767 61289
After snapshot, the first fio get about 5% performance gain. As we
continually write to the same file, all writes will resume to nocow mode
and eventually we have no performance gain.
Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
This function always takes a transaction handle which contains a
reference to the fs_info. Use that and remove the extra argument.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ rename the function ]
Signed-off-by: David Sterba <dsterba@suse.com>
This function always takes a transaction handle which contains a
reference to the fs_info. Use that and remove the extra argument.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The check is superfluous since all callers who set search_for_commit
also have skip_locking set.
ASSERT() is put in place to ensure skip_locking is set by new callers.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As unlock_up() is written as
for () {
if (!path->locks[i])
break;
...
if (... && path->locks[i]) {
}
}
Apparently, @path->locks[i] is always true at this 'if'.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Typically, when acquiring root node's lock, btrfs tries its best to get
read lock and trade for write lock if @write_lock_level implies to do so.
In case of (cow && (p->keep_locks || p->lowest_level)), write_lock_level
is set to BTRFS_MAX_LEVEL, which means we need to acquire root node's
write lock directly.
In this particular case, the dance of acquiring read lock and then trading
for write lock can be saved.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's good to have a helper instead of having all get-root details
open-coded. The new helper locks (if necessary) and sets root node of
the path.
Also invert the checks to make the code flow easier to read. There is
no functional change in this commit.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If parent_transid "0" is passed to btrfs_buffer_uptodate(),
btrfs_buffer_uptodate() is equivalent to extent_buffer_uptodate(), but
extent_buffer_uptodate() is preferred since we don't have to look into
verify_parent_transid().
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
read_block_for_search() can be simplified as:
tmp = find_extent_buffer();
if (tmp)
return;
...
free_extent_buffer();
read_tree_block();
Apparently, @tmp must be NULL at this point, free_extent_buffer() is not
needed.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit dc2d3005d2 ("btrfs: remove dead create_space_info
calls"), there is only one caller btrfs_init_space_info. However, it
doesn't need create_space_info to return space_info at all.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As verify_level_key() is checked after verify_parent_transid(), i.e.
if (verify_parent_transid())
ret = -EIO;
else if (verify_level_key())
ret = -EUCLEAN;
if parent_transid is 0, verify_parent_transid() skips verifying
parent_transid and considers eb as valid, and if verify_level_key()
reports something wrong, we're not going to know if it's caused by
corrupted metadata or non-checkecd eb (e.g. stale eb).
The stale eb can be from an outdated raid1 mirror after a degraded
mount, see eg "btrfs: fix reading stale metadata blocks after degraded
raid1 mounts" (02a3307aa9) for more details.
@parent_transid is able to tell whether the eb's generation has been
verified by the caller.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Error message from qgroup_rescan_init() mostly looks like:
BTRFS info (device nvme0n1p1): qgroup_rescan_init failed with -115
Which is far from meaningful, and sometimes confusing as for above
-EINPROGRESS it's mostly (despite the init race) harmless, but sometimes
it can also indicate problem if the return value is -EINVAL.
Change it to some more meaningful messages like:
BTRFS info (device nvme0n1p1): qgroup rescan is already in progress
And
BTRFS err(device nvme0n1p1): qgroup rescan init failed, qgroup is not enabled
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ update the messages and level ]
Signed-off-by: David Sterba <dsterba@suse.com>
If we have invalid flags set, when we error out we must drop our writer
counter and free the buffer we allocated for the arguments. This bug is
trivially reproduced with the following program on 4.7+:
#include <fcntl.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <linux/btrfs.h>
#include <linux/btrfs_tree.h>
int main(int argc, char **argv)
{
struct btrfs_ioctl_vol_args_v2 vol_args = {
.flags = UINT64_MAX,
};
int ret;
int fd;
if (argc != 2) {
fprintf(stderr, "usage: %s PATH\n", argv[0]);
return EXIT_FAILURE;
}
fd = open(argv[1], O_WRONLY);
if (fd == -1) {
perror("open");
return EXIT_FAILURE;
}
ret = ioctl(fd, BTRFS_IOC_RM_DEV_V2, &vol_args);
if (ret == -1)
perror("ioctl");
close(fd);
return EXIT_SUCCESS;
}
When unmounting the filesystem, we'll hit the
WARN_ON(mnt_get_writers(mnt)) in cleanup_mnt() and also may prevent the
filesystem to be remounted read-only as the writer count will stay
lifted.
Fixes: 6b526ed70c ("btrfs: introduce device delete by devid")
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For inlined extent, we only have one segment, thus less things to check.
And further more, inlined extent always has the csum in its leaf header,
it's less probable to have corrupted data.
Anyway, still check header and segment header.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
James Harvey reported that some corrupted compressed extent data can
lead to various kernel memory corruption.
Such corrupted extent data belongs to inode with NODATASUM flags, thus
data csum won't help us detecting such bug.
If lucky enough, KASAN could catch it like:
BUG: KASAN: slab-out-of-bounds in lzo_decompress_bio+0x384/0x7a0 [btrfs]
Write of size 4096 at addr ffff8800606cb0f8 by task kworker/u16:0/2338
CPU: 3 PID: 2338 Comm: kworker/u16:0 Tainted: G O 4.17.0-rc5-custom+ #50
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Workqueue: btrfs-endio btrfs_endio_helper [btrfs]
Call Trace:
dump_stack+0xc2/0x16b
print_address_description+0x6a/0x270
kasan_report+0x260/0x380
memcpy+0x34/0x50
lzo_decompress_bio+0x384/0x7a0 [btrfs]
end_compressed_bio_read+0x99f/0x10b0 [btrfs]
bio_endio+0x32e/0x640
normal_work_helper+0x15a/0xea0 [btrfs]
process_one_work+0x7e3/0x1470
worker_thread+0x1b0/0x1170
kthread+0x2db/0x390
ret_from_fork+0x22/0x40
...
The offending compressed data has the following info:
Header: length 32768 (looks completely valid)
Segment 0 Header: length 3472882419 (obviously out of bounds)
Then when handling segment 0, since it's over the current page, we need
the copy the compressed data to temporary buffer in workspace, then such
large size would trigger out-of-bounds memory access, screwing up the
whole kernel.
Fix it by adding extra checks on header and segment headers to ensure we
won't access out-of-bounds, and even checks the decompressed data won't
be out-of-bounds.
Reported-by: James Harvey <jamespharvey20@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ updated comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Although it's not that complex, but such comment could still save
several minutes for newer reader/reviewer instead of inferring that from
the code.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor wording updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
Since compression.h is using the SZ_* macros, and if some file includes
only compression.h without linux/sizes.h, it will cause compile error.
One example is lzo.c, if it uses BTRFS_MAX_COMPRESSED. Fix it by adding
linux/sizes.h in compression.h
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_clone_files(), we must check the NODATASUM flag while the
inodes are locked. Otherwise, it's possible that btrfs_ioctl_setflags()
will change the flags after we check and we can end up with a party
checksummed file.
The race window is only a few instructions in size, between the if and
the locks which is:
3834 if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
3835 return -EISDIR;
where the setflags must be run and toggle the NODATASUM flag (provided
the file size is 0). The clone will block on the inode lock, segflags
takes the inode lock, changes flags, releases log and clone continues.
Not impossible but still needs a lot of bad luck to hit unintentionally.
Fixes: 0e7b824c4e ("Btrfs: don't make a file partly checksummed through file clone")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Function btrfs_exclude_logged_extents may call __exclude_logged_extent
which may fail.
Propagate the failures of __exclude_logged_extent to upper caller.
Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of setting "parent" to ref->parent only when dealing with
a shared ref and subsequently performing another check to see
if (parent > 0), check the "node->type" directly and act accordingly.
This makes the code more streamline. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of taking only specific member of this structure, which results
in 2 extra arguments, just take the delayed_extent_op struct and
reference the arguments inside the functions. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function currently takes 7 parameters, most of which are proxies
for values from btrfs_delayed_ref_node struct which is not passed. This
patch simplifies the interface of the function by simply passing said
delayed ref node struct to the function. This enables us to:
1. Move locals variables and init code related to them from
run_delayed_tree_ref which should only be used inside
alloc_reserved_tree_block, such as skinny_metadata and the btrfs_key,
representing the extent being inserted. This removes the need for the
"ins" argument. Instead, it's replaced by a local var with a more
verbose name - extent_key.
2. Now that we have a reference to the node in alloc_reserved_tree_block
the delayed_tree_ref struct can be referenced inside the function and
this enable removing the "ref->level", "parent" and "ref_root"
arguments.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function already takes a transaction handle which contains a
reference to the fs_info. So use this and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The test failures are not clearly visible in the system log as they're
printed at INFO level. Add a new helper that is level ERROR. As this
touches almost all strings, I took the opportunity to unify them:
- decapitalize the first letter as there's a prefix and the text
continues after ":"
- glue strings split to more lines and un-indent so they fit to 80
columns
- use %llu instead of %Lu
- drop \n from the modified messages (test_msg is left untouched)
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_read_fs_root_no_name() may return ERR_PTR(-ENOENT) or
ERR_PTR(-ENOMEM) and therefore search_ioctl() and
btrfs_search_path_in_tree() should use PTR_ERR() instead of -ENOENT,
which all other callers of btrfs_read_fs_root_no_name() do.
Drop the error message as it would be confusing, the caller of ioctl
will likely interpret the error code and not look into the syslog.
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I got a report that after upgrading to 4.16, someone's filesystems
weren't mounting:
[ 23.845852] BTRFS info (device loop0): unrecognized mount option 'subvol='
Before 4.16, this mounted the default subvolume. It turns out that this
empty "subvol=" is actually an application bug, but it was causing the
application to fail, so it's an ABI break if you squint.
The generic parsing code we use for mount options (match_token())
doesn't match an empty string as "%s". Previously, setup_root_args()
removed the "subvol=" string, but the mount path was cleaned up to not
need that. Add a dummy Opt_subvol_empty to fix this.
The simple workaround is to use / or . for the value of 'subvol=' .
Fixes: 312c89fbca ("btrfs: cleanup btrfs_mount() using btrfs_mount_root()")
CC: stable@vger.kernel.org # 4.16+
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Looks like the original idea was to print the hex of the flags which is
not coded with their flag name. So use the current buf pointer bp
instead of buf.
Reaching the uknown flags should never happen, it's there just in case.
Fixes: ebce0e01b9 ("btrfs: make block group flags in balance printks human-readable")
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The dedupe range is 16 MiB, with 4 KiB pages and 8 byte pointers, the
arrays can be 32KiB large. To avoid allocation failures due to
fragmented memory, use the allocation with fallback to vmalloc.
The arrays are allocated and freed only inside btrfs_extent_same and
reused for all the ranges.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We support big dedup requests by splitting range to smaller parts, and
call dedupe logic on each of them.
Instead of repeated allocation and deallocation, allocate once at the
beginning and reuse in the iteration.
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_dedupe_file_range silently restricts the dedupe range to
to 16MiB to limit locking and working memory size and is documented in
manual page as implementation specific.
Let's remove that restriction by iterating over the dedup range in 16MiB
steps. This is backward compatible and will not change anything for
requests smaller then 16MiB.
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Split btrfs_extent_same() to two parts where one is the main EXTENT_SAME
entry and a helper that can be repeatedly called on a range. This will
be used in following patches.
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_link() calls btrfs_orphan_del() if it's linking an O_TMPFILE but
it doesn't reserve space to do so. Even before the removal of the
orphan_block_rsv it wasn't using it.
Fixes: ef3b9af50b ("Btrfs: implement inode_operations callback tmpfile")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We got rid of BTRFS_INODE_HAS_ORPHAN_ITEM and
BTRFS_INODE_ORPHAN_META_RESERVED, so we can renumber the flags to make
them consecutive again.
Signed-off-by: Omar Sandoval <osandov@fb.com>
[ switch them enums so we don't have to do that again ]
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we don't keep long-standing reservations for orphan items,
root->orphan_block_rsv isn't used. We can git rid of it, along with:
- root->orphan_lock, which was used to protect root->orphan_block_rsv
- root->orphan_inodes, which was used as a refcount for root->orphan_block_rsv
- BTRFS_INODE_ORPHAN_META_RESERVED, which was used to track reservations
in root->orphan_block_rsv
- btrfs_orphan_commit_root(), which was the last user of any of these
and does nothing else
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, we keep space reserved for all inode orphan items until the
inode is evicted (i.e., all references to it are dropped). We hit an
issue where an application would keep a bunch of deleted files open (by
design) and thus keep a large amount of space reserved, causing ENOSPC
errors when other operations tried to reserve space. This long-standing
reservation isn't absolutely necessary for a couple of reasons:
- We can almost always make the reservation we need or steal from the
global reserve for the orphan item
- If we can't, it's not the end of the world if we drop the orphan item
on the floor and let the next mount clean it up
So, get rid of persistent reservation and just reserve space in
btrfs_evict_inode().
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The truncate loop in btrfs_evict_inode() does two things at once:
- It refills the temporary block reserve, potentially stealing from the
global reserve or committing
- It calls btrfs_truncate_inode_items()
The tangle of continues hides the fact that these two steps are actually
separate. Split the first step out into a separate function both for
clarity and so that we can reuse it in a later patch.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_evict_inode(), if btrfs_truncate_inode_items() fails, the inode
item will still be in the tree but we still return the ino to the ino
cache. That will blow up later when someone tries to allocate that ino,
so don't return it to the cache.
Fixes: 581bb05094 ("Btrfs: Cache free inode numbers in memory")
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_orphan_commit_root() tries to delete an orphan item for a
subvolume in the tree root, but we don't actually insert that item in
the first place. See commit 0a0d4415e3 ("Btrfs: delete dead code in
btrfs_orphan_add()"). We can get rid of it.
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we don't add orphan items for truncate, there can't be races on
adding or deleting an orphan item, so this bit is unnecessary.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, we insert an orphan item during a truncate so that if there's
a crash, we don't leak extents past the on-disk i_size. However, since
commit 7f4f6e0a3f ("Btrfs: only update disk_i_size as we remove
extents"), we keep disk_i_size in sync with the extent items as we
truncate, so orphan cleanup will never have any extents to remove. Don't
bother with the superfluous orphan item.
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_free_extent() can fail because of ENOMEM. There's no reason to
panic here, we can just abort the transaction.
Fixes: f4b9aa8d3b ("btrfs_truncate")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_truncate_inode_items() uses two variables for error handling, ret
and err. These are not handled consistently, leading to a couple of
bugs.
- Errors from btrfs_del_items() are handled but not propagated to the
caller
- If btrfs_run_delayed_refs() fails and aborts the transaction, we
continue running
Just use ret everywhere and simplify things a bit, fixing both of these
issues.
Fixes: 79787eaab4 ("btrfs: replace many BUG_ONs with proper error handling")
Fixes: 1262133b8d ("Btrfs: account for crcs in delayed ref processing")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit a41ad394a0 ("Btrfs: convert to the new truncate sequence")
changed btrfs_setsize() to call truncate_setsize() instead of
vmtruncate() but didn't update the comment above it. truncate_setsize()
never fails (the IS_SWAPFILE() check happens elsewhere), so remove the
comment.
Additionally, the comment above btrfs_page_mkwrite() references
vmtruncate(), but truncate_setsize() does the size write and page
locking now.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
select_delayed_ref really just gets the next delayed ref which has to
be processed - either an add ref or drop ref. We never go back for
anything. So the comment is actually bogus, just remove it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Deletion of a subvolume by rmdir(2) has become allowed by the
'commit cd2decf640b1 ("btrfs: Allow rmdir(2) to delete an empty
subvolume")'.
It is a kind of new feature and this commits add a sysfs entry
/sys/fs/btrfs/features/rmdir_subvol
to indicate the availability of the feature so that a user program
(e.g. fstests) can detect it.
Prior to this commit, all entries in /sys/fs/btrfs/features are feature
which depend on feature bits of superblock (i.e. each feature affects
on-disk format) and managed by attribute_group "btrfs_feature_attr_group".
For each fs, entries in /sys/fs/btrfs/UUID/features indicate which
features are enabled (or can be changed online) for the fs.
However, rmdir_subvol feature only depends on kernel module. Therefore
new attribute_group "btrfs_static_feature_attr_group" is introduced and
sysfs_merge_group() is used to share /sys/fs/btrfs/features directory.
Features in "btrfs_static_feature_attr_group" won't be listed in each
/sys/fs/btrfs/UUID/features.
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use existing named values instead of the raw numbers.
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Kernel logs are very important for the forensic investigations of the
issues in general make it easy to use it. This patch adds 'balance:'
prefix so that it can be easily searched.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
* The simple 'flags' refer to the btrfs inode
* ... that's in 'binode
* the FS_*_FL variables are 'fsflags'
* the old copies of the variable are prefixed by 'old_'
* Struct inode flags contain 'i_flags'.
Signed-off-by: David Sterba <dsterba@suse.com>
The new ioctl is an extension to the FS_IOC_SETFLAGS and adds new
flags and is extensible. Don't get fooled by the XATTR in the name, it
does not have anything in common with the extended attributes,
incidentally also abbreviated as XATTRs.
This patch allows to set the xflags portion of the fsxattr structure,
other items have no meaning and non-zero values will result in
EOPNOTSUPP.
Currently supported xflags:
- APPEND
- IMMUTABLE
- NOATIME
- NODUMP
- SYNC
The structure of btrfs_ioctl_fssetxattr copies btrfs_ioctl_setflags but
is simpler on the flag setting side.
The original patch was written by Chandan Jay Sharma but was incomplete
and no further revision has been sent.
Based-on-patches-by: Chandan Jay Sharma <chandansbg@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The new ioctl is an extension to the FS_IOC_GETFLAGS and adds new
flags and is extensible. This patch allows to return the xflags portion
of the fsxattr structure, other items have no meaning for btrfs or can
be added later.
The original patch was written by Chandan Jay Sharma but was incomplete
and no further revision has been sent. Several cleanups were necessary
to avoid confusion with other ioctls, as we have another flavor of
flags.
Based-on-patches-by: Chandan Jay Sharma <chandansbg@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The FS_*_FL flags cannot be easily identified by a prefix but we still
need to recognize them so the 'fsflags' should be closer to the naming
scheme but again the 'fs' part sounds like it's a filesystem flag. I
don't have a better idea for now.
Signed-off-by: David Sterba <dsterba@suse.com>
The FS_*_FL flags cannot be easily identified by a variable name prefix
but we still need to recognize them so the 'fsflags' should be closer to
the naming scheme but again the 'fs' part sounds like it's a filesystem
flag. I don't have a better idea for now.
Signed-off-by: David Sterba <dsterba@suse.com>
Use a local btrfs_fs_devices variable to access the structure.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Delete the uuid_mutex lock here as this thread accesses the
btrfs_fs_devices::devices only (counters or called functions do a list
traversal). And the device_list_mutex lock is already taken.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_dev_replace_finishing updates devices (soruce and target) which
are within the btrfs_fs_devices::devices or withint the cloned seed
devices (btrfs_fs_devices::seed::devices), so we don't need the global
uuid_mutex.
The device replace context is also locked by its own locks.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_open_devices() is using the uuid_mutex, but as btrfs_open_devices
is just limited to openning all the devices under for given fsid, so we
don't need uuid_mutex.
Instead it should hold the device_list_mutex as it updates the members
of the btrfs_fs_devices and btrfs_device and not the whole fs_devs list.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
read_chunk_tree() calls read_one_dev(), but for seed device we have
to search the fs_uuids list, so we need the uuid_mutex. Add a comment
comment, so that we can improve this part.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of de-referencing the device->fs_devices use cur_devices
which points to the same fs_devices and does not change.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The generic block device lookup or cleanup does not need the uuid mutex,
that's only for the device_list_add.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
This function is no longer used outside of inode.c so just make it
static. At the same time give a more becoming name, since it's not
really invalidating the inodes but just calling d_prune_alias. Last,
but not least - move the function above the sole caller to avoid
introducing yet-another-pointless forward declaration.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the wrappers and reduce the amount of low-level details about the
waitqueue management.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the code assumes that there's an implied barrier by the
sequence of code preceding the wakeup, namely the mutex unlock.
As Nikolay pointed out:
I think this is wrong (not your code) but the original assumption that
the RELEASE semantics provided by mutex_unlock is sufficient.
According to memory-barriers.txt:
Section 'LOCK ACQUISITION FUNCTIONS' states:
(2) RELEASE operation implication:
Memory operations issued before the RELEASE will be completed before the
RELEASE operation has completed.
Memory operations issued after the RELEASE *may* be completed before the
RELEASE operation has completed.
(I've bolded the may portion)
The example given there:
As an example, consider the following:
*A = a;
*B = b;
ACQUIRE
*C = c;
*D = d;
RELEASE
*E = e;
*F = f;
The following sequence of events is acceptable:
ACQUIRE, {*F,*A}, *E, {*C,*D}, *B, RELEASE
So if we assume that *C is modifying the flag which the waitqueue is checking,
and *E is the actual wakeup, then those accesses can be re-ordered...
IMHO this code should be considered broken...
---
To be on the safe side, add the barriers. The synchronization logic
around log using the mutexes and several other threads does not make it
easy to reason for/against the barrier.
CC: Nikolay Borisov <nborisov@suse.com>
Link: https://lkml.kernel.org/r/6ee068d8-1a69-3728-00d1-d86293d43c9f@suse.com
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add convenience wrappers for the waitqueue management that involves
memory barriers to prevent deadlocks. The helpers will let us remove
barriers and the necessary comments in several places.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Under the following case, qgroup rescan can double account cowed tree
blocks:
In this case, extent tree only has one tree block.
-
| transid=5 last committed=4
| btrfs_qgroup_rescan_worker()
| |- btrfs_start_transaction()
| | transid = 5
| |- qgroup_rescan_leaf()
| |- btrfs_search_slot_for_read() on extent tree
| Get the only extent tree block from commit root (transid = 4).
| Scan it, set qgroup_rescan_progress to the last
| EXTENT/META_ITEM + 1
| now qgroup_rescan_progress = A + 1.
|
| fs tree get CoWed, new tree block is at A + 16K
| transid 5 get committed
-
| transid=6 last committed=5
| btrfs_qgroup_rescan_worker()
| btrfs_qgroup_rescan_worker()
| |- btrfs_start_transaction()
| | transid = 5
| |- qgroup_rescan_leaf()
| |- btrfs_search_slot_for_read() on extent tree
| Get the only extent tree block from commit root (transid = 5).
| scan it using qgroup_rescan_progress (A + 1).
| found new tree block beyong A, and it's fs tree block,
| account it to increase qgroup numbers.
-
In above case, tree block A, and tree block A + 16K get accounted twice,
while qgroup rescan should stop when it already reach the last leaf,
other than continue using its qgroup_rescan_progress.
Such case could happen by just looping btrfs/017 and with some
possibility it can hit such double qgroup accounting problem.
Fix it by checking the path to determine if we should finish qgroup
rescan, other than relying on next loop to exit.
Reported-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing qgroup rescan using the following script (modified from
btrfs/017 test case), we can sometimes hit qgroup corruption.
------
umount $dev &> /dev/null
umount $mnt &> /dev/null
mkfs.btrfs -f -n 64k $dev
mount $dev $mnt
extent_size=8192
xfs_io -f -d -c "pwrite 0 $extent_size" $mnt/foo > /dev/null
btrfs subvolume snapshot $mnt $mnt/snap
xfs_io -f -c "reflink $mnt/foo" $mnt/foo-reflink > /dev/null
xfs_io -f -c "reflink $mnt/foo" $mnt/snap/foo-reflink > /dev/null
xfs_io -f -c "reflink $mnt/foo" $mnt/snap/foo-reflink2 > /dev/unll
btrfs quota enable $mnt
# -W is the new option to only wait rescan while not starting new one
btrfs quota rescan -W $mnt
btrfs qgroup show -prce $mnt
umount $mnt
# Need to patch btrfs-progs to report qgroup mismatch as error
btrfs check $dev || _fail
------
For fast machine, we can hit some corruption which missed accounting
tree blocks:
------
qgroupid rfer excl max_rfer max_excl parent child
-------- ---- ---- -------- -------- ------ -----
0/5 8.00KiB 0.00B none none --- ---
0/257 8.00KiB 0.00B none none --- ---
------
This is due to the fact that we're always searching commit root for
btrfs_find_all_roots() at qgroup_rescan_leaf(), but the leaf we get is
from current transaction, not commit root.
And if our tree blocks get modified in current transaction, we won't
find any owner in commit root, thus causing the corruption.
Fix it by searching commit root for extent tree for
qgroup_rescan_leaf().
Reported-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[spotted while going through ->d_fsdata handling around d_splice_alias();
don't really care which tree that goes through]
The only thing even looking at ->d_fsdata in there (since 2012)
had been kfree(dentry->d_fsdata) in btrfs_dentry_delete(). Which,
incidentally, is all btrfs_dentry_delete() does.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are already 2 reports about strangely corrupted super blocks,
where csum still matches but extra garbage gets slipped into super block.
The corruption would looks like:
------
superblock: bytenr=65536, device=/dev/sdc1
---------------------------------------------------------
csum_type 41700 (INVALID)
csum 0x3b252d3a [match]
bytenr 65536
flags 0x1
( WRITTEN )
magic _BHRfS_M [match]
...
incompat_flags 0x5b22400000000169
( MIXED_BACKREF |
COMPRESS_LZO |
BIG_METADATA |
EXTENDED_IREF |
SKINNY_METADATA |
unknown flag: 0x5b22400000000000 )
...
------
Or
------
superblock: bytenr=65536, device=/dev/mapper/x
---------------------------------------------------------
csum_type 35355 (INVALID)
csum_size 32
csum 0xf0dbeddd [match]
bytenr 65536
flags 0x1
( WRITTEN )
magic _BHRfS_M [match]
...
incompat_flags 0x176d200000000169
( MIXED_BACKREF |
COMPRESS_LZO |
BIG_METADATA |
EXTENDED_IREF |
SKINNY_METADATA |
unknown flag: 0x176d200000000000 )
------
Obviously, csum_type and incompat_flags get some garbage, but its csum
still matches, which means kernel calculates the csum based on corrupted
super block memory.
And after manually fixing these values, the filesystem is completely
healthy without any problem exposed by btrfs check.
Although the cause is still unknown, at least detect it and prevent further
corruption.
Both reports have same symptoms, there's an overwrite on offset 192 of
the superblock, by 4 bytes. The superblock structure is not allocated or
freed and stays in the memory for the whole filesystem lifetime, so it's
not a use-after-free kind of error on someone else's leaked page.
As a vague point for the problable cause is mentioning of other system
freezing related to graphic card drivers.
Reported-by: Ken Swenson <flat@imo.uto.moe>
Reported-by: Ben Parsons <9parsonsb@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add brief analysis of the reports ]
Signed-off-by: David Sterba <dsterba@suse.com>
Refactor btrfs_check_super_valid:
1) Rename it to btrfs_validate_mount_super()
Now it's more obvious when the function should be called.
2) Extract core check routine into validate_super()
Later write time check can reuse it, and if needed, we could also
use validate_super() to check each super block.
3) Add more comments about btrfs_validate_mount_super()
Mostly about what it doesn't check and when it should be called.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename to validate_super ]
Signed-off-by: David Sterba <dsterba@suse.com>
Move btrfs_check_super_valid() before its single caller to avoid forward
declaration.
Though such code motion is not recommended as it pollutes git history,
in this case the following patches would need to add new forward
declarations for static functions that we want to avoid.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function always takes a transaction handle which contains a
reference to the fs_info. Use that and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function takes a transaction handle which already contains a
reference to the fs_info. So use it and remove the extra function
argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function alreay takes a transaction handle which holds a reference
to the fs_info. Use that and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function takes a transaction handle which holds a reference to
fs_info. So use that and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function takes a transaction handle which already has a reference
to the fs_info. Use it and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function always takes a transaction handle which references the
fs_info structure. So use that and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function already takes a transaction which has a reference to the
fs_info. So use that and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function already takes a transaction handle which has a reference
to the fs_info. So use that and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function always takes a transaction handle which contains a
reference to fs_info. So use that and kill the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function already takes a transaction handle which contains a
reference to fs_info. So use that and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function always takes a trans handle which contains a reference to
the fs_info. Use that and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function also takes a btrfs_block_group_cache which contains a
referene to the fs_info. So use that and remove the extra argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function already takes trans handle from where fs_info can be
referenced. Remove the redundant parameter.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function already takes a transaction handle which contains a
reference to fs_info.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function already takes a transaction handle which has a reference
to the fs_info.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We also pass in a transaction handle which has a reference to the
fs_info. Just remove the extraneous argument.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This will be necessary for future cleanups which remove the fs_info
argument from some freespace tree functions.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The invariant is that when nr_delalloc_inodes is 0 then the root
mustn't have any inodes on its delalloc inodes list.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently when checking if a directory can be deleted, we always check
if all its children have been processed.
Example: A directory with 2,000,000 files was deleted
original: 1994m57.071s
patch: 1m38.554s
[FIX]
Instead of checking all children on all calls to can_rmdir(), we keep
track of the directory index offset of the child last checked in the
last call to can_rmdir(), and then use it as the starting point for
future calls to can_rmdir().
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Move the allocation after the search when it's clear that the new entry
will be added.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
add_delayed_ref_head really performed 2 independent operations -
initialisting the ref head and adding it to a list. Now that the init
part is in a separate function let's complete the separation between
both operations. This results in a lot simpler interface for
add_delayed_ref_head since the function now deals solely with either
adding the newly initialised delayed ref head or merging it into an
existing delayed ref head. This results in vastly simplified function
signature since 5 arguments are dropped. The only other thing worth
mentioning is that due to this split the WARN_ON catching reinit of
existing. In this patch the condition is extended such that:
qrecord && head_ref->qgroup_ref_root && head_ref->qgroup_reserved
is added. This is done because the two qgroup_* prefixed member are
set only if both ref_root and reserved are passed. So functionally
it's equivalent to the old WARN_ON and allows to remove the two args
from add_delayed_ref_head.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the newly introduced function when initialising the head_ref in
add_delayed_ref_head. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
add_delayed_ref_head implements the logic to both initialize a head_ref
structure as well as perform the necessary operations to add it to the
delayed ref machinery. This has resulted in a very cumebrsome interface
with loads of parameters and code, which at first glance, looks very
unwieldy. Begin untangling it by first extracting the initialization
only code in its own function. It's more or less verbatim copy of the
first part of add_delayed_ref_head.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the initialization part and the critical section code have been
split it's a lot easier to open code add_delayed_data_ref. Do so in the
following manner:
1. The common init function is put immediately after memory-to-be-initialized
is allocated, followed by the specific data ref initialization.
2. The only piece of code that remains in the critical section is
insert_delayed_ref call.
3. Tracing and memory freeing code is moved outside of the critical
section.
No functional changes, just an overall shorter critical section.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the initialization part and the critical section code have been
split it's a lot easier to open code add_delayed_tree_ref. Do so in the
following manner:
1. The comming init code is put immediately after memory-to-be-initialized
is allocated, followed by the ref-specific member initialization.
2. The only piece of code that remains in the critical section is
insert_delayed_ref call.
3. Tracing and memory freeing code is put outside of the critical
section as well.
The only real change here is an overall shorter critical section when
dealing with delayed tree refs. From functional point of view - the code
is unchanged.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the newly introduced helper and remove the duplicate code. No
functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the newly introduced common helper. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
THe majority of the init code for struct btrfs_delayed_ref_node is
duplicated in add_delayed_data_ref and add_delayed_tree_ref. Factor out
the common bits in init_delayed_ref_common. This function is going to be
used in future patches to clean that up. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's not good to overwrite -ENOMEM using -EINVAL when failing from mount
option parsing, so just return original error code.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The fs_info is always available from the context so we don't need to
store it in the structure.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When debugging quota rescan race, some times btrfs rescan could account
some old (committed) leaf and then re-account newly committed leaf
in next generation.
This race needs extra transid to locate, so add @transid for
trace_btrfs_qgroup_account_extent() for such debug.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Trivial fix to spelling mistake of function name in btrfs_err message
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is used in only one place and devid argument is always
passed 0. So just remove it, similarly to how it was removed in the
userspace code.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Origin trace_qgroup_update_counters() only records qgroup id and its
reference count change.
It's good enough to debug qgroup accounting change, but when rescan race
is involved, it's pretty hard to distinguish which modification belongs
to which rescan.
So add old_rfer and old_excl trace output to help distinguishing
different rescan instance.
(Different rescan instance should reset its qgroup->rfer to 0)
For trace event parameter, it just changes from u64 qgroup_id to struct
btrfs_qgroup *qgroup, so number of parameters is not changed at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's used only in inode.c so makes no sense to have it exported. Also
move the definition of btrfs_delalloc_work to inode.c since it's used
only this file.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When allocating a delalloc work we are always setting the delayed_iput
to 0. So remove the delay_iput member of btrfs_delalloc_work, as a
result also remove it as a parameter from btrfs_alloc_delalloc_work
since it's not used anymore.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's always set to 0 so remove it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ rename to start_delalloc_inodes ]
Signed-off-by: David Sterba <dsterba@suse.com>
It's always set to 0, so just remove it and collapse the constant value
to the only function we are passing it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This parameter was introduced alongside the function in
eb73c1b7ce ("Btrfs: introduce per-subvolume delalloc inode list") to
avoid deadlocks since this function was used in the transaction commit
path. However, commit 8d875f95da ("btrfs: disable strict file flushes
for renames and truncates") removed that usage, rendering the parameter
obsolete.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_shrink_device, before btrfs_search_slot, path->reada is set to
READA_FORWARD. But I think READA_BACK is correct.
Since:
1. key.offset is set to (u64)-1
2. after btrfs_search_slot, btrfs_previous_item is called
So, for readahead previous items, READA_BACK is the correct one.
Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch will add the following trace events:
1) btrfs_remove_block_group
For btrfs_remove_block_group() function.
Triggered when a block group is really removed.
2) btrfs_add_unused_block_group
Triggered which block group is added to unused_bgs list.
3) btrfs_skip_unused_block_group
Triggered which unused block group is not deleted.
These trace events is pretty handy to debug case related to block group
auto remove.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
fs_info can be extracted from btrfs_block_group_cache, and all
btrfs_block_group_cache is created by btrfs_create_block_group_cache()
with fs_info initialized, no need to worry about NULL pointer
dereference.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the commit c6100a4b4e ("Btrfs: replace tree->mapping with
tree->private_data"), parameter fs_info in alloc_reloc_control is
not used. So remove it.
Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a new member struct btrfs_raid_attr::mindev_error so that
btrfs_raid_array can maintain the error code to return if the minimum
number of devices condition is not met while trying to delete a device
in the given raid. And so we can drop btrfs_raid_mindev_error.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a new member struct btrfs_raid_attr::bg_flag so that
btrfs_raid_array can maintain the bit map flag of the raid type, and
so we can drop btrfs_raid_group.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a new member struct btrfs_raid_attr::raid_name so that
btrfs_raid_array can maintain the name of the raid type, and so we can
drop btrfs_raid_type_names.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's pretty handy if we can get the debug output for locking status of
an extent buffer, specially for race condition related debugging.
So add the following output for btrfs_print_tree() and
btrfs_print_leaf():
- refs
- write_locks (as w:%d)
- read_locks (as r:%d)
- blocking_writers (as bw:%d)
- blocking_readers (as br:%d)
- spinning_writers (as sw:%d)
- spinning_readers (as sr:%d)
- lock_owner
- current->pid
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
The helper is quite simple and I'd like to see the locking in the
caller.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While the spinlock does not cause problems, using the mutex is more
correct and consistent with others. The global status of balance is eg.
checked from btrfs_pause_balance or btrfs_cancel_balance with mutex.
Resuming balance happens during mount or ro->rw remount. In the former
case, no other user of the balance_ctl exists, in the latter, balance
cannot run until the ro/rw transition is finished.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter controls locking of the stats part but we can lock it
unconditionally, as this only happens once when balance starts. This is
not performance critical.
Add the prefix for an exported function.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Balance cannot be started on a read-only filesystem and will have to
finish/exit before eg. going to read-only via remount.
In case the filesystem is forcibly set to read-only after an error,
balance will finish anyway and if the cancel call is too fast it will
just wait for that to happen.
The last case is when the balance is paused after mount but it's
read-only and cancelling would want to delete the item. The test is
moved after the check if balance is running at all, as it looks more
logical to report "no balance running" instead of "read-only
filesystem".
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently fs_info::balance_running is 0 or 1 and does not use the
semantics of atomics. The pause and cancel check for 0, that can happen
only after __btrfs_balance exits for whatever reason.
Parallel calls to balance ioctl may enter btrfs_ioctl_balance multiple
times but will block on the balance_mutex that protects the
fs_info::flags bit.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Mutual exclusion of device add/rm and balance was done by the volume
mutex up to version 3.7. The commit 5ac00addc7 ("Btrfs: disallow
mutually exclusive admin operations from user mode") added a bit that
essentially tracked the same information.
The status bit has an advantage over a mutex that it can be set without
restrictions of function context, so it started to be used in the
mount-time resuming of balance or device replace.
But we don't really need to track the same information in two ways.
1) After the previous cleanups, the main ioctl handlers for
add/del/resize copy the EXCL_OP bit next to the volume mutex, here
it's clearly safe.
2) Resuming balance during mount or after rw remount will set only the
EXCL_OP bit and the volume_mutex is held in the kernel thread that
calls btrfs_balance.
3) Resuming device replace during mount or after rw remount is done
after balance and is excluded by the EXCL_OP bit. It does not take
the volume_mutex at all and completely relies on the EXCL_OP bit.
4) The resuming of balance and dev-replace cannot hapen at the same time
as the ioctls cannot be started in parallel. Nevertheless, a crafted
image could trigger that and a warning is printed.
5) Balance is normally excluded by EXCL_OP and also uses own mutex to
protect against concurrent access to its status data. There's some
trickery to maintain the right lock nesting in case we need to
reexamine the status in btrfs_ioctl_balance. The volume_mutex is
removed and the unlock/lock sequence is left in place as we might
expect other waiters to proceed.
6) Similar to 5, the unlock/lock sequence is kept in
btrfs_cancel_balance to allow waiters to continue.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The volume mutex does not protect against anything in this case, the
comment about scrub is right but not related to locking and looks
confusing. The comment in btrfs_find_device_missing_or_by_path is wrong
and confusing too.
The device_list_mutex is not held here to protect device lookup, but in
this case device replace cannot run in parallel with device removal (due
to exclusive op protection), so we don't need further locking here.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function __cancel_balance name is confusing with the cancel
operation of balance and it really resets the state of balance back to
zero. The unset_balance_control helper is called only from one place and
simple enough to be inlined.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replace a WARN_ON with a proper check and message in case something goes
really wrong and resumed balance cannot set up its exclusive status.
The check is a user friendly assertion, I don't expect to ever happen
under normal circumstances.
Also document that the paused balance starts here and owns the exclusive
op status.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The device replace is paused by unmount or read only remount, and
resumed on next mount or write remount.
The exclusive status should be checked properly as it's a global
invariant and we must not allow 2 operations run. In this case, the
balance can be also paused and resumed under same conditions. It's
always checked first so dev-replace could see the EXCL_OP already taken,
BUT, the ioctl would never let start both at the same time.
Replace the WARN_ON with message and return 0, indicating no error as
this is purely theoretical and the user will be informed. Resolving that
manually should be possible by waiting for the other operation to finish
or cancel the paused state.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move locking and unlocking next to the BTRFS_FS_EXCL_OP bit manipulation
so it's obvious that the two happen at the same time.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function logically belongs there and there's only a single caller,
no need to export it. No code changes.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function will be used outside of volumes.c, the allocation
btrfs_alloc_device is also exported.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a preparatory cleanup that will make clear that the only
successful way out of btrfs_init_dev_replace_tgtdev will also set the
device_out to a valid pointer. With this guarantee, the callers can be
simplified.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is called once and is fairly small, we can merge it with
the caller.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function uses fs_info::fs_devices number of time, however we
declare and use it only at the end, instead do it in the beginning of
the function and use it.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
find_device() declares struct list_head *head pointer and used only once,
instead just use it directly.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__btrfs_open_devices() is un-exported drop __ prefix and rename it to
open_fs_devices().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__btrfs_close_devices() is un-exported, drop the __ prefix and rename it
to close_fs_devices().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__btrfs_open_devices() declares struct list_head *head, however head is
used only once, instead use btrfs_fs_devices::devices directly.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_fs_devices::list is the list of BTRFS fsid in the kernel, a generic
name 'list' makes it's search very difficult, rename it to fs_list.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is called from only 1 place and is effectively a wrapper
over wait_completion/kfree. It doesn't really bring any value having
those two calls in a separate function. Just open code it and remove it.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be directly referenced from the passed address_space so do that.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
list_empty_careful usually is a signal of something tricky going on. Its
usage in btrfs is actually not needed since both lists it's used on are
local to a function and cannot be modified concurrently. So switch to
plain list_empty. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is called only from btrfs_readpage and is already passed
the mapping. Simplify its signature by moving the code obtaining
reference to the extent tree in the function. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's not used in the function so just remove it. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function already gets the page from which the two extent trees
are referenced. Simplify its signature by moving the code getting the
trees inside the function. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Change the behavior of rmdir(2) and allow it to delete an empty
subvolume by using btrfs_delete_subvolume() which is used by
btrfs_ioctl_snap_destroy().
This is a change in behaviour and has been requested by users. Deleting
the subvolume by ioctl requires root permissions while the rmdir way
does works with standard tools and syscalls for all users that can
access the subvolume.
The main usecase is to allow 'rm -rf /path/with/subvols' to simply work.
We were not able to find any nasty usability surprises, the intention is
to do the destructive rm. Without allowing rmdir, this would have to be
followed by the ioctl subvolume deletion, which is more of an annoyance.
Implementation details:
The required lock for @dir and inode of @dentry is already acquired in
vfs layer.
We need some check before deleting a subvolume. Permission check is done
in vfs layer, emptiness check is in btrfs_rmdir() and additional check
(i.e. neither the subvolume is a default subvolume nor send is in progress)
is in btrfs_delete_subvolume().
Note that in btrfs_ioctl_snap_destroy(), d_delete() is called after
btrfs_delete_subvolume(). For rmdir(2), d_delete() is called in vfs
layer later.
Tested-by: Goffredo Baroncelli <kreijack@inwind.it>
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ enhance changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Factor out the second half of btrfs_ioctl_snap_destroy() as
btrfs_delete_subvolume(), which performs some subvolume specific checks
before deletion:
1. send is not in progress
2. the subvolume is not the default subvolume
3. the subvolume does not contain other subvolumes
and actual deletion process. btrfs_delete_subvolume() requires
inode_lock for both @dir and inode of @dentry. The remaining part of
btrfs_ioctl_snap_destroy() is mainly permission checks.
Note that call of d_delete() is not included in btrfs_delete_subvolume()
as this function will also be used by btrfs_rmdir() to delete an empty
subvolume and in that case d_delete() is called in VFS layer.
As a result, btrfs_unlink_subvol() and may_destroy_subvol()
become static functions. No functional changes.
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor comment updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
This is a preparation work to refactor btrfs_ioctl_snap_destroy()
and to allow rmdir(2) to delete an empty subvolume.
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor update of the function comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
With commit b18253ec57c0 ("btrfs: optimize free space tree bitmap
conversion"), there are no more callers to le_test_bit(). This patch
removes le_test_bit().
Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Presently, convert_free_space_to_extents() does a linear scan of the
bitmap. We can speed this up with find_next_{bit,zero_bit}_le().
This patch replaces the linear scan with find_next_{bit,zero_bit}_le().
Testing shows a 20-33% decrease in execution time for
convert_free_space_to_extents().
Since we change bitmap to be unsigned long, we have to do some casting
for the bitmap cursor. In le_bitmap_set() it makes sense to use u8, as
we are doing bit operations. Everywhere else, we're just using it for
pointer arithmetic and not directly accessing it, so char seems more
appropriate.
Suggested-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
le_bitmap_set() is only used by free-space-tree, so move it there and
make it static. le_bitmap_clear() is not used, so remove it.
Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We really want to know to which filesystem the extent map events belong,
but as it cannot be reached from the extent_map pointers, we need to
pass it down the callchain.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preparatory work to pass fs_info to btrfs_add_extent_mapping so we can
get a better tracepoint message. Extent maps do not need fs_info for
anything so we only add a dummy one without any other initialization.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The second if is really a subcase of ret being less than 0. So
introduce a generic if (ret < 0) check, and inside have another if
which explicitly handles the -ENOSPC and any other errors. No
functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Locks should generally be released in the oppposite order they are
acquired. Generally lock acquisiton ordering is used to ensure
deadlocks don't happen. However, as becomes more complicated it's
best to also maintain proper unlock order so as to avoid possible dead
locks. This was found by code inspection and doesn't necessarily lead
to a deadlock scenario.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently __endio_write_update_ordered uses labels to implement
what is essentially a simple while loop. This makes the code more
cumbersome to follow than it actually has to be. No functional
changes. No xfstest regressions were found during testing.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Adds comments about BTRFS_FS_EXCL_OP to existing comments
about the device locks.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
It's used to print its pointer in a debug statement but doesn't really
bring any useful information to the error message.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_get_block_group_info() was introduced by the
commit 5af3e8cce8 ("Btrfs: make filesystem read-only when submitting
barrier fails") which used it in disk-io.c.
However, the function is only called in ioctl.c now.
Its parameter type btrfs_ioctl_space_info* is only for ioctl.
So, make it static and rename it to be original name
get_block_group_info.
No functional change.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
add_pinned_bytes really cares whether the bytes being pinned are either
data or metadata. To that effect it checks whether the 'owner' argument
is less than BTRFS_FIRST_FREE_OBJECTID (256). This works because
owner can really have 2 types of values:
a) For metadata extents it holds the level at which the parent is in
the btree. This amounts to owner having the values 0-7
b) In case of modifying data extents, owner is the inode number
to which those extents belongs.
Let's make this more explicit byt converting the owner parameter to a
boolean value and either pass it directly when we know the type of
extents we are working with (i.e. in btrfs_free_tree_block). In cases
when the parent function can be called on both metadata/data extents
perform the check in the caller. This hopefully makes the interface
of add_pinned_bytes more intuitive.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlsG/ecACgkQxWXV+ddt
WDvr3w/8D12pwR9sPcEwxD4pvoLv7LP1VRQy2u+ivSifdBD7MueKh3y0igUMyARR
LERsK0zUsTQGkkC6c7ZYd4cT9PikPpXtO1P9iATFAKqR/YMDIV/haSqT8DwbI/qb
7F+ZMeTy1LzL01YlYBrGVDxP8AWVO2Dml6JolYxzplILSLvdPH6G8xOSjei/p9sm
RK5ERHJENEI0l/cThpiLoAEWjzciPtR39T5Hq45onHyCs3bjJCcx51/QE8sBsl8x
+BKvCmL40UKd30YKudJZYDM6NgMgWENhfTtIZQIInv99sMNCxIgTEUdX8ExdyjRZ
24rst/BuQz4d8r/8zqE/hdFsHRGWwnEiYmGWylanPY5KdQ41ULfXC06xuoNOLoW8
KQwD8SWv+W5vEJW0UQz5cb3vUgv5RnUzPvcmMfSztLeo2K4zj6zCK5L6XJwIJNbM
1AJR7R4TRkQdf5QEeziFl738Yv1AgsPQuKSiiFa9YwXMLU8dYXlx14ioUzBL8MLe
1wZPJ03x/N7eKJ0g6OIAAVfUTFFejv4Z2B2IDoObuLLsPwTdK6tS+9tJ5mos7ngG
Vf1ZVmhmeJdw1qwK8ROzAJHkK807KgGO7LWmA7tIVLwWuZX14F7xLQIg3Ux3MhIh
NhoBTFy2AGmdE0hFYv/4FA5dnUOU4VTVYVw3QUV4DMc0XIodZrE=
=iYyx
-----END PGP SIGNATURE-----
Merge tag 'for-4.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"A one-liner that prevents leaking an internal error value 1 out of the
ftruncate syscall.
This has been observed in practice. The steps to reproduce make a
common pattern (open/write/fync/ftruncate) but also need the
application to not check only for negative values and happens only for
compressed inlined files.
The conditions are narrow but as this could break userspace I think
it's better to merge it now and not wait for the merge window"
* tag 'for-4.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix error handling in btrfs_truncate()
Jun Wu at Facebook reported that an internal service was seeing a return
value of 1 from ftruncate() on Btrfs in some cases. This is coming from
the NEED_TRUNCATE_BLOCK return value from btrfs_truncate_inode_items().
btrfs_truncate() uses two variables for error handling, ret and err.
When btrfs_truncate_inode_items() returns non-zero, we set err to the
return value. However, NEED_TRUNCATE_BLOCK is not an error. Make sure we
only set err if ret is an error (i.e., negative).
To reproduce the issue: mount a filesystem with -o compress-force=zstd
and the following program will encounter return value of 1 from
ftruncate:
int main(void) {
char buf[256] = { 0 };
int ret;
int fd;
fd = open("test", O_CREAT | O_WRONLY | O_TRUNC, 0666);
if (fd == -1) {
perror("open");
return EXIT_FAILURE;
}
if (write(fd, buf, sizeof(buf)) != sizeof(buf)) {
perror("write");
close(fd);
return EXIT_FAILURE;
}
if (fsync(fd) == -1) {
perror("fsync");
close(fd);
return EXIT_FAILURE;
}
ret = ftruncate(fd, 128);
if (ret) {
printf("ftruncate() returned %d\n", ret);
close(fd);
return EXIT_FAILURE;
}
close(fd);
return EXIT_SUCCESS;
}
Fixes: ddfae63cc8 ("btrfs: move btrfs_truncate_block out of trans handle")
CC: stable@vger.kernel.org # 4.15+
Reported-by: Jun Wu <quark@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull vfs fixes from Al Viro:
"Assorted fixes all over the place"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
aio: fix io_destroy(2) vs. lookup_ioctx() race
ext2: fix a block leak
nfsd: vfs_mkdir() might succeed leaving dentry negative unhashed
cachefiles: vfs_mkdir() might succeed leaving dentry negative unhashed
unfuck sysfs_mount()
kernfs: deal with kernfs_fill_super() failures
cramfs: Fix IS_ENABLED typo
befs_lookup(): use d_splice_alias()
affs_lookup: switch to d_splice_alias()
affs_lookup(): close a race with affs_remove_link()
fix breakage caused by d_find_alias() semantics change
fs: don't scan the inode cache before SB_BORN is set
do d_instantiate/unlock_new_inode combinations safely
iov_iter: fix memory leak in pipe_get_pages_alloc()
iov_iter: fix return type of __pipe_get_pages()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlsBhWUACgkQxWXV+ddt
WDsl5g/7BwC4g1BICBG5SXKG9e/s0gjf/3xh7XI8g9kYYu3NktH4fWqDNNncgKtQ
LL4WTcFhYJ+Cx/wkgPoYHfR9CKN2dR038S1OneKz+nhP/dTXw1MnSmfNP4kECqSQ
vwdeDKwlO0Qsy2PdSLjLk8/Yn43wNleBI0swEF+5q7AQgv7XW9hk1oCpwJ7gjDw2
Ymb3WVlj/V0QhnZRnQEgnRwK4xLiOBszb6C+fxQDjtismWDz12dY3udl6Co18YTW
DnmH3x9qBXpL7D2S/6AtZcafbrfgSeL6PXTlcb1fLHK1HwZbdUAerUrVVlV2aitC
rHbg+pD0X1uzDCt2iZ7MROnZv/gLbU9OSz1foE9pw8xU9J5zbsvLlBSK4P0mdEzI
MaZzqB3H31cUSZJq/BUdGnFAOIykcOEvscn000p/cy7szv+GpWb08rTqvVgZvSM2
ai1qADU7ACaWdFjJUqbOi3zWyT6AcGONwjfSIaa/y3DyGzVX3UyJxeIuvznPS2Yt
17B4GRIbF1xPbNRRBw7N60E8o4p8t+BMftStMCBSl8zxnjd6RPOCluOH/az6tL+H
hmY/nGvJZCj3Y6SeLGiKXdNH9MFkhcvEIvePFkUt3AEEHcCtdG5RebrAvVSpO+N4
1SUAE1y8Cbco/KYMjlERpIzZIKOBkD/EnSBIXTI9mIVC0op6mII=
=bhlK
-----END PGP SIGNATURE-----
Merge tag 'for-4.17-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"We've accumulated some fixes during the last week, some of them were
in the works for a longer time but there are some newer ones too.
Most of the fixes have a reproducer and fix user visible problems,
also candidates for stable kernels. They IMHO qualify for a late rc,
though I did not expect that many"
* tag 'for-4.17-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix crash when trying to resume balance without the resume flag
btrfs: Fix delalloc inodes invalidation during transaction abort
btrfs: Split btrfs_del_delalloc_inode into 2 functions
btrfs: fix reading stale metadata blocks after degraded raid1 mounts
btrfs: property: Set incompat flag if lzo/zstd compression is set
Btrfs: fix duplicate extents after fsync of file with prealloc extents
Btrfs: fix xattr loss after power failure
Btrfs: send, fix invalid access to commit roots due to concurrent snapshotting
We set the BTRFS_BALANCE_RESUME flag in the btrfs_recover_balance()
only, which isn't called during the remount. So when resuming from
the paused balance we hit the bug:
kernel: kernel BUG at fs/btrfs/volumes.c:3890!
::
kernel: balance_kthread+0x51/0x60 [btrfs]
kernel: kthread+0x111/0x130
::
kernel: RIP: btrfs_balance+0x12e1/0x1570 [btrfs] RSP: ffffba7d0090bde8
Reproducer:
On a mounted filesystem:
btrfs balance start --full-balance /btrfs
btrfs balance pause /btrfs
mount -o remount,ro /dev/sdb /btrfs
mount -o remount,rw /dev/sdb /btrfs
To fix this set the BTRFS_BALANCE_RESUME flag in
btrfs_resume_balance_async().
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a transaction is aborted btrfs_cleanup_transaction is called to
cleanup all the various in-flight bits and pieces which migth be
active. One of those is delalloc inodes - inodes which have dirty
pages which haven't been persisted yet. Currently the process of
freeing such delalloc inodes in exceptional circumstances such as
transaction abort boiled down to calling btrfs_invalidate_inodes whose
sole job is to invalidate the dentries for all inodes related to a
root. This is in fact wrong and insufficient since such delalloc inodes
will likely have pending pages or ordered-extents and will be linked to
the sb->s_inode_list. This means that unmounting a btrfs instance with
an aborted transaction could potentially lead inodes/their pages
visible to the system long after their superblock has been freed. This
in turn leads to a "use-after-free" situation once page shrink is
triggered. This situation could be simulated by running generic/019
which would cause such inodes to be left hanging, followed by
generic/176 which causes memory pressure and page eviction which lead
to touching the freed super block instance. This situation is
additionally detected by the unmount code of VFS with the following
message:
"VFS: Busy inodes after unmount of Self-destruct in 5 seconds. Have a nice day..."
Additionally btrfs hits WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));
in free_fs_root for the same reason.
This patch aims to rectify the sitaution by doing the following:
1. Change btrfs_destroy_delalloc_inodes so that it calls
invalidate_inode_pages2 for every inode on the delalloc list, this
ensures that all the pages of the inode are released. This function
boils down to calling btrfs_releasepage. During test I observed cases
where inodes on the delalloc list were having an i_count of 0, so this
necessitates using igrab to be sure we are working on a non-freed inode.
2. Since calling btrfs_releasepage might queue delayed iputs move the
call out to btrfs_cleanup_transaction in btrfs_error_commit_super before
calling run_delayed_iputs for the last time. This is necessary to ensure
that delayed iputs are run.
Note: this patch is tagged for 4.14 stable but the fix applies to older
versions too but needs to be backported manually due to conflicts.
CC: stable@vger.kernel.org # 4.14.x: 2b87733134: btrfs: Split btrfs_del_delalloc_inode into 2 functions
CC: stable@vger.kernel.org # 4.14.x
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment to igrab ]
Signed-off-by: David Sterba <dsterba@suse.com>
This is in preparation of fixing delalloc inodes leakage on transaction
abort. Also export the new function.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If a btree block, aka. extent buffer, is not available in the extent
buffer cache, it'll be read out from the disk instead, i.e.
btrfs_search_slot()
read_block_for_search() # hold parent and its lock, go to read child
btrfs_release_path()
read_tree_block() # read child
Unfortunately, the parent lock got released before reading child, so
commit 5bdd3536cb ("Btrfs: Fix block generation verification race") had
used 0 as parent transid to read the child block. It forces
read_tree_block() not to check if parent transid is different with the
generation id of the child that it reads out from disk.
A simple PoC is included in btrfs/124,
0. A two-disk raid1 btrfs,
1. Right after mkfs.btrfs, block A is allocated to be device tree's root.
2. Mount this filesystem and put it in use, after a while, device tree's
root got COW but block A hasn't been allocated/overwritten yet.
3. Umount it and reload the btrfs module to remove both disks from the
global @fs_devices list.
4. mount -odegraded dev1 and write some data, so now block A is allocated
to be a leaf in checksum tree. Note that only dev1 has the latest
metadata of this filesystem.
5. Umount it and mount it again normally (with both disks), since raid1
can pick up one disk by the writer task's pid, if btrfs_search_slot()
needs to read block A, dev2 which does NOT have the latest metadata
might be read for block A, then we got a stale block A.
6. As parent transid is not checked, block A is marked as uptodate and
put into the extent buffer cache, so the future search won't bother
to read disk again, which means it'll make changes on this stale
one and make it dirty and flush it onto disk.
To avoid the problem, parent transid needs to be passed to
read_tree_block().
In order to get a valid parent transid, we need to hold the parent's
lock until finishing reading child.
This patch needs to be slightly adapted for stable kernels, the
&first_key parameter added to read_tree_block() is from 4.16+
(581c176041). The fix is to replace 0 by 'gen'.
Fixes: 5bdd3536cb ("Btrfs: Fix block generation verification race")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Incompat flag of LZO/ZSTD compression should be set at:
1. mount time (-o compress/compress-force)
2. when defrag is done
3. when property is set
Currently 3. is missing and this commit adds this.
This could lead to a filesystem that uses ZSTD but is not marked as
such. If a kernel without a ZSTD support encounteres a ZSTD compressed
extent, it will handle that but this could be confusing to the user.
Typically the filesystem is mounted with the ZSTD option, but the
discrepancy can arise when a filesystem is never mounted with ZSTD and
then the property on some file is set (and some new extents are
written). A simple mount with -o compress=zstd will fix that up on an
unpatched kernel.
Same goes for LZO, but this has been around for a very long time
(2.6.37) so it's unlikely that a pre-LZO kernel would be used.
Fixes: 5c1aab1dd5 ("btrfs: Add zstd support")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add user visible impact ]
Signed-off-by: David Sterba <dsterba@suse.com>
In commit 471d557afe ("Btrfs: fix loss of prealloc extents past i_size
after fsync log replay"), on fsync, we started to always log all prealloc
extents beyond an inode's i_size in order to avoid losing them after a
power failure. However under some cases this can lead to the log replay
code to create duplicate extent items, with different lengths, in the
extent tree. That happens because, as of that commit, we can now log
extent items based on extent maps that are not on the "modified" list
of extent maps of the inode's extent map tree. Logging extent items based
on extent maps is used during the fast fsync path to save time and for
this to work reliably it requires that the extent maps are not merged
with other adjacent extent maps - having the extent maps in the list
of modified extents gives such guarantee.
Consider the following example, captured during a long run of fsstress,
which illustrates this problem.
We have inode 271, in the filesystem tree (root 5), for which all of the
following operations and discussion apply to.
A buffered write starts at offset 312391 with a length of 933471 bytes
(end offset at 1245862). At this point we have, for this inode, the
following extent maps with the their field values:
em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
block_len 0, orig_block_len 0
em B, start 40960, orig_start 40960, len 376832, block_start 1106399232,
block_len 376832, orig_block_len 376832
em C, start 417792, orig_start 417792, len 782336, block_start
18446744073709551613, block_len 0, orig_block_len 0
em D, start 1200128, orig_start 1200128, len 835584, block_start
1106776064, block_len 835584, orig_block_len 835584
em E, start 2035712, orig_start 2035712, len 245760, block_start
1107611648, block_len 245760, orig_block_len 245760
Extent map A corresponds to a hole and extent maps D and E correspond to
preallocated extents.
Extent map D ends where extent map E begins (1106776064 + 835584 =
1107611648), but these extent maps were not merged because they are in
the inode's list of modified extent maps.
An fsync against this inode is made, which triggers the fast path
(BTRFS_INODE_NEEDS_FULL_SYNC is not set). This fsync triggers writeback
of the data previously written using buffered IO, and when the respective
ordered extent finishes, btrfs_drop_extents() is called against the
(aligned) range 311296..1249279. This causes a split of extent map D at
btrfs_drop_extent_cache(), replacing extent map D with a new extent map
D', also added to the list of modified extents, with the following
values:
em D', start 1249280, orig_start of 1200128,
block_start 1106825216 (= 1106776064 + 1249280 - 1200128),
orig_block_len 835584,
block_len 786432 (835584 - (1249280 - 1200128))
Then, during the fast fsync, btrfs_log_changed_extents() is called and
extent maps D' and E are removed from the list of modified extents. The
flag EXTENT_FLAG_LOGGING is also set on them. After the extents are logged
clear_em_logging() is called on each of them, and that makes extent map E
to be merged with extent map D' (try_merge_map()), resulting in D' being
deleted and E adjusted to:
em E, start 1249280, orig_start 1200128, len 1032192,
block_start 1106825216, block_len 1032192,
orig_block_len 245760
A direct IO write at offset 1847296 and length of 360448 bytes (end offset
at 2207744) starts, and at that moment the following extent maps exist for
our inode:
em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
block_len 0, orig_block_len 0
em B, start 40960, orig_start 40960, len 270336, block_start 1106399232,
block_len 270336, orig_block_len 376832
em C, start 311296, orig_start 311296, len 937984, block_start 1112842240,
block_len 937984, orig_block_len 937984
em E (prealloc), start 1249280, orig_start 1200128, len 1032192,
block_start 1106825216, block_len 1032192, orig_block_len 245760
The dio write results in drop_extent_cache() being called twice. The first
time for a range that starts at offset 1847296 and ends at offset 2035711
(length of 188416), which results in a double split of extent map E,
replacing it with two new extent maps:
em F, start 1249280, orig_start 1200128, block_start 1106825216,
block_len 598016, orig_block_len 598016
em G, start 2035712, orig_start 1200128, block_start 1107611648,
block_len 245760, orig_block_len 1032192
It also creates a new extent map that represents a part of the requested
IO (through create_io_em()):
em H, start 1847296, len 188416, block_start 1107423232, block_len 188416
The second call to drop_extent_cache() has a range with a start offset of
2035712 and end offset of 2207743 (length of 172032). This leads to
replacing extent map G with a new extent map I with the following values:
em I, start 2207744, orig_start 1200128, block_start 1107783680,
block_len 73728, orig_block_len 1032192
It also creates a new extent map that represents the second part of the
requested IO (through create_io_em()):
em J, start 2035712, len 172032, block_start 1107611648, block_len 172032
The dio write set the inode's i_size to 2207744 bytes.
After the dio write the inode has the following extent maps:
em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
block_len 0, orig_block_len 0
em B, start 40960, orig_start 40960, len 270336, block_start 1106399232,
block_len 270336, orig_block_len 376832
em C, start 311296, orig_start 311296, len 937984, block_start 1112842240,
block_len 937984, orig_block_len 937984
em F, start 1249280, orig_start 1200128, len 598016,
block_start 1106825216, block_len 598016, orig_block_len 598016
em H, start 1847296, orig_start 1200128, len 188416,
block_start 1107423232, block_len 188416, orig_block_len 835584
em J, start 2035712, orig_start 2035712, len 172032,
block_start 1107611648, block_len 172032, orig_block_len 245760
em I, start 2207744, orig_start 1200128, len 73728,
block_start 1107783680, block_len 73728, orig_block_len 1032192
Now do some change to the file, like adding a xattr for example and then
fsync it again. This triggers a fast fsync path, and as of commit
471d557afe ("Btrfs: fix loss of prealloc extents past i_size after fsync
log replay"), we use the extent map I to log a file extent item because
it's a prealloc extent and it starts at an offset matching the inode's
i_size. However when we log it, we create a file extent item with a value
for the disk byte location that is wrong, as can be seen from the
following output of "btrfs inspect-internal dump-tree":
item 1 key (271 EXTENT_DATA 2207744) itemoff 3782 itemsize 53
generation 22 type 2 (prealloc)
prealloc data disk byte 1106776064 nr 1032192
prealloc data offset 1007616 nr 73728
Here the disk byte value corresponds to calculation based on some fields
from the extent map I:
1106776064 = block_start (1107783680) - 1007616 (extent_offset)
extent_offset = 2207744 (start) - 1200128 (orig_start) = 1007616
The disk byte value of 1106776064 clashes with disk byte values of the
file extent items at offsets 1249280 and 1847296 in the fs tree:
item 6 key (271 EXTENT_DATA 1249280) itemoff 3568 itemsize 53
generation 20 type 2 (prealloc)
prealloc data disk byte 1106776064 nr 835584
prealloc data offset 49152 nr 598016
item 7 key (271 EXTENT_DATA 1847296) itemoff 3515 itemsize 53
generation 20 type 1 (regular)
extent data disk byte 1106776064 nr 835584
extent data offset 647168 nr 188416 ram 835584
extent compression 0 (none)
item 8 key (271 EXTENT_DATA 2035712) itemoff 3462 itemsize 53
generation 20 type 1 (regular)
extent data disk byte 1107611648 nr 245760
extent data offset 0 nr 172032 ram 245760
extent compression 0 (none)
item 9 key (271 EXTENT_DATA 2207744) itemoff 3409 itemsize 53
generation 20 type 2 (prealloc)
prealloc data disk byte 1107611648 nr 245760
prealloc data offset 172032 nr 73728
Instead of the disk byte value of 1106776064, the value of 1107611648
should have been logged. Also the data offset value should have been
172032 and not 1007616.
After a log replay we end up getting two extent items in the extent tree
with different lengths, one of 835584, which is correct and existed
before the log replay, and another one of 1032192 which is wrong and is
based on the logged file extent item:
item 12 key (1106776064 EXTENT_ITEM 835584) itemoff 3406 itemsize 53
refs 2 gen 15 flags DATA
extent data backref root 5 objectid 271 offset 1200128 count 2
item 13 key (1106776064 EXTENT_ITEM 1032192) itemoff 3353 itemsize 53
refs 1 gen 22 flags DATA
extent data backref root 5 objectid 271 offset 1200128 count 1
Obviously this leads to many problems and a filesystem check reports many
errors:
(...)
checking extents
Extent back ref already exists for 1106776064 parent 0 root 5 owner 271 offset 1200128 num_refs 1
extent item 1106776064 has multiple extent items
ref mismatch on [1106776064 835584] extent item 2, found 3
Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 2 wanted 1 back 0x55b1d0ad7680
Backref 1106776064 root 5 owner 271 offset 1200128 num_refs 0 not found in extent tree
Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 1 wanted 0 back 0x55b1d0ad4e70
Backref bytes do not match extent backref, bytenr=1106776064, ref bytes=835584, backref bytes=1032192
backpointer mismatch on [1106776064 835584]
checking free space cache
block group 1103101952 has wrong amount of free space
failed to load free space cache for block group 1103101952
checking fs roots
(...)
So fix this by logging the prealloc extents beyond the inode's i_size
based on searches in the subvolume tree instead of the extent maps.
Fixes: 471d557afe ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If a file has xattrs, we fsync it, to ensure we clear the flags
BTRFS_INODE_NEEDS_FULL_SYNC and BTRFS_INODE_COPY_EVERYTHING from its
inode, the current transaction commits and then we fsync it (without
either of those bits being set in its inode), we end up not logging
all its xattrs. This results in deleting all xattrs when replying the
log after a power failure.
Trivial reproducer
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ touch /mnt/foobar
$ setfattr -n user.xa -v qwerty /mnt/foobar
$ xfs_io -c "fsync" /mnt/foobar
$ sync
$ xfs_io -c "pwrite -S 0xab 0 64K" /mnt/foobar
$ xfs_io -c "fsync" /mnt/foobar
<power failure>
$ mount /dev/sdb /mnt
$ getfattr --absolute-names --dump /mnt/foobar
<empty output>
$
So fix this by making sure all xattrs are logged if we log a file's inode
item and neither the flags BTRFS_INODE_NEEDS_FULL_SYNC nor
BTRFS_INODE_COPY_EVERYTHING were set in the inode.
Fixes: 36283bf777 ("Btrfs: fix fsync xattr loss in the fast fsync path")
Cc: <stable@vger.kernel.org> # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
btrfs incremental send BUG happens when creating a snapshot of snapshot
that is being used by send.
[REASON]
The problem can happen if while we are doing a send one of the snapshots
used (parent or send) is snapshotted, because snapshoting implies COWing
the root of the source subvolume/snapshot.
1. When doing an incremental send, the send process will get the commit
roots from the parent and send snapshots, and add references to them
through extent_buffer_get().
2. When a snapshot/subvolume is snapshotted, its root node is COWed
(transaction.c:create_pending_snapshot()).
3. COWing releases the space used by the node immediately, through:
__btrfs_cow_block()
--btrfs_free_tree_block()
----btrfs_add_free_space(bytenr of node)
4. Because send doesn't hold a transaction open, it's possible that
the transaction used to create the snapshot commits, switches the
commit root and the old space used by the previous root node gets
assigned to some other node allocation. Allocation of a new node will
use the existing extent buffer found in memory, which we previously
got a reference through extent_buffer_get(), and allow the extent
buffer's content (pages) to be modified:
btrfs_alloc_tree_block
--btrfs_reserve_extent
----find_free_extent (get bytenr of old node)
--btrfs_init_new_buffer (use bytenr of old node)
----btrfs_find_create_tree_block
------alloc_extent_buffer
--------find_extent_buffer (get old node)
5. So send can access invalid memory content and have unpredictable
behaviour.
[FIX]
So we fix the problem by copying the commit roots of the send and
parent snapshots and use those copies.
CallTrace looks like this:
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.c:1861!
invalid opcode: 0000 [#1] SMP
CPU: 6 PID: 24235 Comm: btrfs Tainted: P O 3.10.105 #23721
ffff88046652d680 ti: ffff88041b720000 task.ti: ffff88041b720000
RIP: 0010:[<ffffffffa08dd0e8>] read_node_slot+0x108/0x110 [btrfs]
RSP: 0018:ffff88041b723b68 EFLAGS: 00010246
RAX: ffff88043ca6b000 RBX: ffff88041b723c50 RCX: ffff880000000000
RDX: 000000000000004c RSI: ffff880314b133f8 RDI: ffff880458b24000
RBP: 0000000000000000 R08: 0000000000000001 R09: ffff88041b723c66
R10: 0000000000000001 R11: 0000000000001000 R12: ffff8803f3e48890
R13: ffff8803f3e48880 R14: ffff880466351800 R15: 0000000000000001
FS: 00007f8c321dc8c0(0000) GS:ffff88047fcc0000(0000)
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
R2: 00007efd1006d000 CR3: 0000000213a24000 CR4: 00000000003407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack:
ffff88041b723c50 ffff8803f3e48880 ffff8803f3e48890 ffff8803f3e48880
ffff880466351800 0000000000000001 ffffffffa08dd9d7 ffff88041b723c50
ffff8803f3e48880 ffff88041b723c66 ffffffffa08dde85 a9ff88042d2c4400
Call Trace:
[<ffffffffa08dd9d7>] ? tree_move_down.isra.33+0x27/0x50 [btrfs]
[<ffffffffa08dde85>] ? tree_advance+0xb5/0xc0 [btrfs]
[<ffffffffa08e83d4>] ? btrfs_compare_trees+0x2d4/0x760 [btrfs]
[<ffffffffa0982050>] ? finish_inode_if_needed+0x870/0x870 [btrfs]
[<ffffffffa09841ea>] ? btrfs_ioctl_send+0xeda/0x1050 [btrfs]
[<ffffffffa094bd3d>] ? btrfs_ioctl+0x1e3d/0x33f0 [btrfs]
[<ffffffff81111133>] ? handle_pte_fault+0x373/0x990
[<ffffffff8153a096>] ? atomic_notifier_call_chain+0x16/0x20
[<ffffffff81063256>] ? set_task_cpu+0xb6/0x1d0
[<ffffffff811122c3>] ? handle_mm_fault+0x143/0x2a0
[<ffffffff81539cc0>] ? __do_page_fault+0x1d0/0x500
[<ffffffff81062f07>] ? check_preempt_curr+0x57/0x90
[<ffffffff8115075a>] ? do_vfs_ioctl+0x4aa/0x990
[<ffffffff81034f83>] ? do_fork+0x113/0x3b0
[<ffffffff812dd7d7>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[<ffffffff81150cc8>] ? SyS_ioctl+0x88/0xa0
[<ffffffff8153e422>] ? system_call_fastpath+0x16/0x1b
---[ end trace 29576629ee80b2e1 ]---
Fixes: 7069830a9e ("Btrfs: add btrfs_compare_trees function")
CC: stable@vger.kernel.org # 3.6+
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For anything NFS-exported we do _not_ want to unlock new inode
before it has grown an alias; original set of fixes got the
ordering right, but missed the nasty complication in case of
lockdep being enabled - unlock_new_inode() does
lockdep_annotate_inode_mutex_key(inode)
which can only be done before anyone gets a chance to touch
->i_mutex. Unfortunately, flipping the order and doing
unlock_new_inode() before d_instantiate() opens a window when
mkdir can race with open-by-fhandle on a guessed fhandle, leading
to multiple aliases for a directory inode and all the breakage
that follows from that.
Correct solution: a new primitive (d_instantiate_new())
combining these two in the right order - lockdep annotate, then
d_instantiate(), then the rest of unlock_new_inode(). All
combinations of d_instantiate() with unlock_new_inode() should
be converted to that.
Cc: stable@kernel.org # 2.6.29 and later
Tested-by: Mike Marshall <hubcap@omnibond.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlrsbkEACgkQxWXV+ddt
WDvm6Q/+KdFKJ7T8hBOc6o5EeULXCDF3FmMA7HvDC696WXKsckXJFKk52awvrSb6
3wnIzfWmI3K+rwX3cKqLRKe6tMXtBrTjVWXyezfvx1SMcCO4hSQ+nWLqK08htaNf
h7m3OC3y0xO8QHcFSkvHUov6KRWG3rH+4p46JsJjN7GTBtWmR6tsiyQQ9JMC3gNR
8Jnl1YaQq/JDLFm8GmFfPqIK+MLnNJ+GOJC1pm2vJQFtjnDw9dic+dI2hGX2oh9M
SSRAoJu7jUvTWSmQN9aJfbBUr4atzoKKGYsyAgx5qgXbzOnbUGTIhtyAZirRWWBy
0pT2b/8XuqsIabwR5dR4UbL4Ke1h5DS4c6GFydwO4DeddTovHtDzbN0cPuPQABL1
rwFzlnHhcM/qRu9SKXx1jRy7w4Vju8fVX9D4lyjLcyk24flkEAn1NlJCWEqSzPYR
ikTTm71r/1/62XqE6AcOyugS8E6EYtYHo3PjrfFXr3fQCbctTLEaUKoegFkTezbX
EZLRPy9KNGfuyUyh3eiSypNHZZ9WiT+W42BaLIcpbHJLnqB+A14Z0oZx6aHVhY0T
VFLL91O//OmUvjpsZ7I99LyswvrzsmU0jKS41GXOQlLsLTxtGhcZbvRt4uX4UUie
UOQrNCgO0Y2slGfv+uoDCLhH0tTCRziuEvmgu8bjPcfZE7tph+s=
=Jdmp
-----END PGP SIGNATURE-----
Merge tag 'for-4.17-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two regression fixes and one fix for stable"
* tag 'for-4.17-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: send, fix missing truncate for inode with prealloc extent past eof
btrfs: Take trans lock before access running trans in check_delayed_ref
btrfs: Fix wrong first_key parameter in replace_path
An incremental send operation can miss a truncate operation when an inode
has an increased size in the send snapshot and a prealloc extent beyond
its size.
Consider the following scenario where a necessary truncate operation is
missing in the incremental send stream:
1) In the parent snapshot an inode has a size of 1282957 bytes and it has
no prealloc extents beyond its size;
2) In the the send snapshot it has a size of 5738496 bytes and has a new
extent at offsets 1884160 (length of 106496 bytes) and a prealloc
extent beyond eof at offset 6729728 (and a length of 339968 bytes);
3) When processing the prealloc extent, at offset 6729728, we end up at
send.c:send_write_or_clone() and set the @len variable to a value of
18446744073708560384 because @offset plus the original @len value is
larger then the inode's size (6729728 + 339968 > 5738496). We then
call send_extent_data(), with that @offset and @len, which in turn
calls send_write(), and then the later calls fill_read_buf(). Because
the offset passed to fill_read_buf() is greater then inode's i_size,
this function returns 0 immediately, which makes send_write() and
send_extent_data() do nothing and return immediately as well. When
we get back to send.c:send_write_or_clone() we adjust the value
of sctx->cur_inode_next_write_offset to @offset plus @len, which
corresponds to 6729728 + 18446744073708560384 = 5738496, which is
precisely the the size of the inode in the send snapshot;
4) Later when at send.c:finish_inode_if_needed() we determine that
we don't need to issue a truncate operation because the value of
sctx->cur_inode_next_write_offset corresponds to the inode's new
size, 5738496 bytes. This is wrong because the last write operation
that was issued started at offset 1884160 with a length of 106496
bytes, so the correct value for sctx->cur_inode_next_write_offset
should be 1990656 (1884160 + 106496), so that a truncate operation
with a value of 5738496 bytes would have been sent to insert a
trailing hole at the destination.
So fix the issue by making send.c:send_write_or_clone() not attempt
to send write or clone operations for extents that start beyond the
inode's size, since such attempts do nothing but waste time by
calling helper functions and allocating path structures, and send
currently has no fallocate command in order to create prealloc extents
at the destination (either beyond a file's eof or not).
The issue was found running the test btrfs/007 from fstests using a seed
value of 1524346151 for fsstress.
Reported-by: Gu, Jinxiang <gujx@cn.fujitsu.com>
Fixes: ffa7c4296e ("Btrfs: send, do not issue unnecessary truncate operations")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In preivous patch:
Btrfs: kill trans in run_delalloc_nocow and btrfs_cross_ref_exist
We avoid starting btrfs transaction and get this information from
fs_info->running_transaction directly.
When accessing running_transaction in check_delayed_ref, there's a
chance that current transaction will be freed by commit transaction
after the NULL pointer check of running_transaction is passed.
After looking all the other places using fs_info->running_transaction,
they are either protected by trans_lock or holding the transactions.
Fix this by using trans_lock and increasing the use_count.
Fixes: e4c3b2dcd1 ("Btrfs: kill trans in run_delalloc_nocow and btrfs_cross_ref_exist")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: ethanwu <ethanwu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 581c176041 ("btrfs: Validate child tree block's level and first
key") introduced new @first_key parameter for read_tree_block(), however
caller in replace_path() is parasing wrong key to read_tree_block().
It should use parameter @first_key other than @key.
Normally it won't expose problem as @key is normally initialzied to the
same value of @first_key we expect.
However in relocation recovery case, @key can be set to (0, 0, 0), and
since no valid key in relocation tree can be (0, 0, 0), it will cause
read_tree_block() to return -EUCLEAN and interrupt relocation recovery.
Fix it by setting @first_key correctly.
Fixes: 581c176041 ("btrfs: Validate child tree block's level and first key")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlrcop8ACgkQxWXV+ddt
WDuT2Q/9HZo9b1N4f+dM38P/cyBwEt9VYT0MLzpIpOid6ogpdF+1E7F6rDnUOyEQ
0Me+znhC7kgXLBnpUDS16/2uGbp9VgCZo9NSq8juLrowGYpe9asadWbCqfFBmAh6
Pc6r03Z1Z1EhXXUSm0cqJwvNHpNN6WnfBP1ithcp4OAltjVZ7IU45xHCcymfTxi+
R6/kvJlG56YXoT97on/CHF85pAidELJ7CysLBGpCHP7Yl/lkVkyPVxSZO9up6ZtS
078l0nCG9PpzONbdUcB9QGICPDcNLbBDhrS8yzQ+oajiRBz0+Im4c2UUvmZEUiMb
6EAo6QV58OFUskgsY1cHWgZGS3YIFwzpmdsRQrCyuI4Z+1KXVu+RWMv1hF1G1Dqw
8fHDm6+V2oqcoK3lSnb40UUySGw2d7PwjGxUk5iGKWISdmrxQsLcgkY/ERbjZxar
emelwAZbsPwUNYNJj8qbm1QZlmys8FACKzJqyCR6+7/n353uWE30Tnv1MgfJg0nQ
7ejk3U96R+oynN/9WOBFCLpuvPCbYaZk4eA9+3EjcSLzN81r/WtbgQDntEWi9auO
/OUADaHOXoIaUnTBNtFV9DTseDW21hQ/4NseTVVsjKGWjh2qWRoEwuqI7TEZD/4M
kGgdJL+xvD0MzR6Tq7xbhnQTcCOwH6NQ7aI4rI3abDbU9ZPuDrY=
=lgaO
-----END PGP SIGNATURE-----
Merge tag 'for-4.17-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"This contains a few fixups to the qgroup patches that were merged this
dev cycle, unaligned access fix, blockgroup removal corner case fix
and a small debugging output tweak"
* tag 'for-4.17-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: print-tree: debugging output enhancement
btrfs: Fix race condition between delayed refs and blockgroup removal
btrfs: fix unaligned access in readdir
btrfs: Fix wrong btrfs_delalloc_release_extents parameter
btrfs: delayed-inode: Remove wrong qgroup meta reservation calls
btrfs: qgroup: Use independent and accurate per inode qgroup rsv
btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT
This patch enhances the following things:
- tree block header
* add generation and owner output for node and leaf
- node pointer generation output
- allow btrfs_print_tree() to not follow nodes
* just like btrfs-progs
Please note that, although function btrfs_print_tree() is not called by
anyone right now, it's still a pretty useful function to debug kernel.
So that function is still kept for later use.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When the delayed refs for a head are all run, eventually
cleanup_ref_head is called which (in case of deletion) obtains a
reference for the relevant btrfs_space_info struct by querying the bg
for the range. This is problematic because when the last extent of a
bg is deleted a race window emerges between removal of that bg and the
subsequent invocation of cleanup_ref_head. This can result in cache being null
and either a null pointer dereference or assertion failure.
task: ffff8d04d31ed080 task.stack: ffff9e5dc10cc000
RIP: 0010:assfail.constprop.78+0x18/0x1a [btrfs]
RSP: 0018:ffff9e5dc10cfbe8 EFLAGS: 00010292
RAX: 0000000000000044 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff8d04ffc1f868 RSI: ffff8d04ffc178c8 RDI: ffff8d04ffc178c8
RBP: ffff8d04d29e5ea0 R08: 00000000000001f0 R09: 0000000000000001
R10: ffff9e5dc0507d58 R11: 0000000000000001 R12: ffff8d04d29e5ea0
R13: ffff8d04d29e5f08 R14: ffff8d04efe29b40 R15: ffff8d04efe203e0
FS: 00007fbf58ead500(0000) GS:ffff8d04ffc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe6c6975648 CR3: 0000000013b2a000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
__btrfs_run_delayed_refs+0x10e7/0x12c0 [btrfs]
btrfs_run_delayed_refs+0x68/0x250 [btrfs]
btrfs_should_end_transaction+0x42/0x60 [btrfs]
btrfs_truncate_inode_items+0xaac/0xfc0 [btrfs]
btrfs_evict_inode+0x4c6/0x5c0 [btrfs]
evict+0xc6/0x190
do_unlinkat+0x19c/0x300
do_syscall_64+0x74/0x140
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x7fbf589c57a7
To fix this, introduce a new flag "is_system" to head_ref structs,
which is populated at insertion time. This allows to decouple the
querying for the spaceinfo from querying the possibly deleted bg.
Fixes: d7eae3403f ("Btrfs: rework delayed ref total_bytes_pinned accounting")
CC: stable@vger.kernel.org # 4.14+
Suggested-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The last update to readdir introduced a temporary buffer to store the
emitted readdir data, but as there are file names of variable length,
there's a lot of unaligned access.
This was observed on a sparc64 machine:
Kernel unaligned access at TPC[102f3080] btrfs_real_readdir+0x51c/0x718 [btrfs]
Fixes: 23b5ec7494 ("btrfs: fix readdir deadlock with pagefault")
CC: stable@vger.kernel.org # 4.14+
Reported-and-tested-by: René Rebe <rene@exactcode.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 43b18595d6 ("btrfs: qgroup: Use separate meta reservation type
for delalloc") merged into mainline is not the latest version submitted
to mail list in Dec 2017.
It has a fatal wrong @qgroup_free parameter, which results increasing
qgroup metadata pertrans reserved space, and causing a lot of early EDQUOT.
Fix it by applying the correct diff on top of current branch.
Fixes: 43b18595d6 ("btrfs: qgroup: Use separate meta reservation type for delalloc")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 4f5427ccce ("btrfs: delayed-inode: Use new qgroup meta rsv for
delayed inode and item") merged into mainline was not latest version
submitted to the mail list in Dec 2017.
Which lacks the following fixes:
1) Remove btrfs_qgroup_convert_reserved_meta() call in
btrfs_delayed_item_release_metadata()
2) Remove btrfs_qgroup_reserve_meta_prealloc() call in
btrfs_delayed_inode_reserve_metadata()
Those fixes will resolve unexpected EDQUOT problems.
Fixes: 4f5427ccce ("btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and item")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unlike reservation calculation used in inode rsv for metadata, qgroup
doesn't really need to care about things like csum size or extent usage
for the whole tree COW.
Qgroups care more about net change of the extent usage.
That's to say, if we're going to insert one file extent, it will mostly
find its place in COWed tree block, leaving no change in extent usage.
Or causing a leaf split, resulting in one new net extent and increasing
qgroup number by nodesize.
Or in an even more rare case, increase the tree level, increasing qgroup
number by 2 * nodesize.
So here instead of using the complicated calculation for extent
allocator, which cares more about accuracy and no error, qgroup doesn't
need that over-estimated reservation.
This patch will maintain 2 new members in btrfs_block_rsv structure for
qgroup, using much smaller calculation for qgroup rsv, reducing false
EDQUOT.
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Unlike previous method that tries to commit transaction inside
qgroup_reserve(), this time we will try to commit transaction using
fs_info->transaction_kthread to avoid nested transaction and no need to
worry about locking context.
Since it's an asynchronous function call and we won't wait for
transaction commit, unlike previous method, we must call it before we
hit the qgroup limit.
So this patch will use the ratio and size of qgroup meta_pertrans
reservation as indicator to check if we should trigger a transaction
commit. (meta_prealloc won't be cleaned in transaction committ, it's
useless anyway)
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlrTQ4sACgkQxWXV+ddt
WDti3Q/+MAeqsLTjvre2RQ3ka5hNyCuVftUIBmcP3YfJbt+xZYQyaewW4Xkfi3cm
cbJE+zehzf5ag+RJhxk3OwFTvNfLGIO9asWs3b08NGUi6VzwL0/8B/iOdZPuHSAV
TrecQIBE2Tp+xax9cQEnxav34D4dUtXNaDweGjp1MIIUkDneQP/I0vlTu7vafBgX
UVxP6riL/MCs7sjTHGIPs0lv8L/fgdmo+dk5SnNuIPTOcFTQXgVrtHjw9IvbKWd4
aq+sbNWoSrhXUfllbFg/wZqDe9tWn9E2f6m/H0ThSoNdxusSVgacOjFRYh20NKLW
WGB8Amd/ItGtJwJ1CIypa7VX2U11UAi0XT7BeiK82rUNEJ6moRqFOXG861gRLoTZ
SpH8uWO+e+CogfXob1KCndn5lot4AM2ZTkCqfrjpM35Nul72PZdne0CxNlmiRupY
Fdt5GB+sg8plcMaRiYr++BbbHP5tggX1MrhLGEbx2XBs2eRdn+2Lv2I1Ig/U4NUb
Vf+xk/tFLKGOTSZlbv7SV7ekXxG/3+7gAuL7A1XMETZCwBF4L3hwyW7CgkEBb3PC
TqX8BwMaRpyp/FgW/QL6edjXZ3a64VaHIfqRPNks3lWWcCHVbzyiVPfEjx+AEJ/0
abx6jXTJhLUBPuPxEgb5rsSv62RoxPoYHqCErrG95XwnZ8rSCts=
=I0y0
-----END PGP SIGNATURE-----
Merge tag 'for-4.17-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull more btrfs updates from David Sterba:
"We have queued a few more fixes (error handling, log replay,
softlockup) and the rest is SPDX updates that touche almost all files
so the diffstat is long"
* tag 'for-4.17-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: Only check first key for committed tree blocks
btrfs: add SPDX header to Kconfig
btrfs: replace GPL boilerplate by SPDX -- sources
btrfs: replace GPL boilerplate by SPDX -- headers
Btrfs: fix loss of prealloc extents past i_size after fsync log replay
Btrfs: clean up resources during umount after trans is aborted
btrfs: Fix possible softlock on single core machines
Btrfs: bail out on error during replay_dir_deletes
Btrfs: fix NULL pointer dereference in log_dir_items
When looping btrfs/074 with many cpus (>= 8), it's possible to trigger
kernel warning due to first key verification:
[ 4239.523446] WARNING: CPU: 5 PID: 2381 at fs/btrfs/disk-io.c:460 btree_read_extent_buffer_pages+0x1ad/0x210
[ 4239.523830] Modules linked in:
[ 4239.524630] RIP: 0010:btree_read_extent_buffer_pages+0x1ad/0x210
[ 4239.527101] Call Trace:
[ 4239.527251] read_tree_block+0x42/0x70
[ 4239.527434] read_node_slot+0xd2/0x110
[ 4239.527632] push_leaf_right+0xad/0x1b0
[ 4239.527809] split_leaf+0x4ea/0x700
[ 4239.527988] ? leaf_space_used+0xbc/0xe0
[ 4239.528192] ? btrfs_set_lock_blocking_rw+0x99/0xb0
[ 4239.528416] btrfs_search_slot+0x8cc/0xa40
[ 4239.528605] btrfs_insert_empty_items+0x71/0xc0
[ 4239.528798] __btrfs_run_delayed_refs+0xa98/0x1680
[ 4239.529013] btrfs_run_delayed_refs+0x10b/0x1b0
[ 4239.529205] btrfs_commit_transaction+0x33/0xaf0
[ 4239.529445] ? start_transaction+0xa8/0x4f0
[ 4239.529630] btrfs_alloc_data_chunk_ondemand+0x1b0/0x4e0
[ 4239.529833] btrfs_check_data_free_space+0x54/0xa0
[ 4239.530045] btrfs_delalloc_reserve_space+0x25/0x70
[ 4239.531907] btrfs_direct_IO+0x233/0x3d0
[ 4239.532098] generic_file_direct_write+0xcb/0x170
[ 4239.532296] btrfs_file_write_iter+0x2bb/0x5f4
[ 4239.532491] aio_write+0xe2/0x180
[ 4239.532669] ? lock_acquire+0xac/0x1e0
[ 4239.532839] ? __might_fault+0x3e/0x90
[ 4239.533032] do_io_submit+0x594/0x860
[ 4239.533223] ? do_io_submit+0x594/0x860
[ 4239.533398] SyS_io_submit+0x10/0x20
[ 4239.533560] ? SyS_io_submit+0x10/0x20
[ 4239.533729] do_syscall_64+0x75/0x1d0
[ 4239.533979] entry_SYSCALL_64_after_hwframe+0x42/0xb7
[ 4239.534182] RIP: 0033:0x7f8519741697
The problem here is, at btree_read_extent_buffer_pages() we don't have
acquired read/write lock on that extent buffer, only basic info like
level/bytenr is reliable.
So race condition leads to such false alert.
However in current call site, it's impossible to acquire proper lock
without race window.
To fix the problem, we only verify first key for committed tree blocks
(whose generation is no larger than fs_info->last_trans_committed), so
the content of such tree blocks will not change and there is no need to
get read/write lock.
Reported-by: Nikolay Borisov <nborisov@suse.com>
Fixes: 581c176041 ("btrfs: Validate child tree block's level and first key")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove GPL boilerplate text (long, short, one-line) and keep the rest,
ie. personal, company or original source copyright statements. Add the
SPDX header.
Signed-off-by: David Sterba <dsterba@suse.com>
Remove GPL boilerplate text (long, short, one-line) and keep the rest,
ie. personal, company or original source copyright statements. Add the
SPDX header.
Unify the include protection macros to match the file names.
Signed-off-by: David Sterba <dsterba@suse.com>
Currently if we allocate extents beyond an inode's i_size (through the
fallocate system call) and then fsync the file, we log the extents but
after a power failure we replay them and then immediately drop them.
This behaviour happens since about 2009, commit c71bf099ab ("Btrfs:
Avoid orphan inodes cleanup while replaying log"), because it marks
the inode as an orphan instead of dropping any extents beyond i_size
before replaying logged extents, so after the log replay, and while
the mount operation is still ongoing, we find the inode marked as an
orphan and then perform a truncation (drop extents beyond the inode's
i_size). Because the processing of orphan inodes is still done
right after replaying the log and before the mount operation finishes,
the intention of that commit does not make any sense (at least as
of today). However reverting that behaviour is not enough, because
we can not simply discard all extents beyond i_size and then replay
logged extents, because we risk dropping extents beyond i_size created
in past transactions, for example:
add prealloc extent beyond i_size
fsync - clears the flag BTRFS_INODE_NEEDS_FULL_SYNC from the inode
transaction commit
add another prealloc extent beyond i_size
fsync - triggers the fast fsync path
power failure
In that scenario, we would drop the first extent and then replay the
second one. To fix this just make sure that all prealloc extents
beyond i_size are logged, and if we find too many (which is far from
a common case), fallback to a full transaction commit (like we do when
logging regular extents in the fast fsync path).
Trivial reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "pwrite -S 0xab 0 256K" /mnt/foo
$ sync
$ xfs_io -c "falloc -k 256K 1M" /mnt/foo
$ xfs_io -c "fsync" /mnt/foo
<power failure>
# mount to replay log
$ mount /dev/sdb /mnt
# at this point the file only has one extent, at offset 0, size 256K
A test case for fstests follows soon, covering multiple scenarios that
involve adding prealloc extents with previous shrinking truncates and
without such truncates.
Fixes: c71bf099ab ("Btrfs: Avoid orphan inodes cleanup while replaying log")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently if some fatal errors occur, like all IO get -EIO, resources
would be cleaned up when
a) transaction is being committed or
b) BTRFS_FS_STATE_ERROR is set
However, in some rare cases, resources may be left alone after transaction
gets aborted and umount may run into some ASSERT(), e.g.
ASSERT(list_empty(&block_group->dirty_list));
For case a), in btrfs_commit_transaciton(), there're several places at the
beginning where we just call btrfs_end_transaction() without cleaning up
resources. For case b), it is possible that the trans handle doesn't have
any dirty stuff, then only trans hanlde is marked as aborted while
BTRFS_FS_STATE_ERROR is not set, so resources remain in memory.
This makes btrfs also check BTRFS_FS_STATE_TRANS_ABORTED to make sure that
all resources won't stay in memory after umount.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove the address_space ->tree_lock and use the xa_lock newly added to
the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
since we don't really care that it's a tree.
[willy@infradead.org: fix nds32, fs/dax.c]
Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
do_chunk_alloc implements a loop checking whether there is a pending
chunk allocation and if so causes the caller do loop. Generally this
loop is executed only once, however testing with btrfs/072 on a single
core vm machines uncovered an extreme case where the system could loop
indefinitely. This is due to a missing cond_resched when loop which
doesn't give a chance to the previous chunk allocator finish its job.
The fix is to simply add the missing cond_resched.
Fixes: 6d74119f1a ("Btrfs: avoid taking the chunk_mutex in do_chunk_alloc")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If errors were returned by btrfs_next_leaf(), replay_dir_deletes needs
to bail out, otherwise @ret would be forced to be 0 after 'break;' and
the caller won't be aware of it.
Fixes: e02119d5a7 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
0, 1 and <0 can be returned by btrfs_next_leaf(), and when <0 is
returned, path->nodes[0] could be NULL, log_dir_items lacks such a
check for <0 and we may run into a null pointer dereference panic.
Fixes: e02119d5a7 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlrDg0UACgkQxWXV+ddt
WDvtOQ//QCk0zH2EcPQnqOW6HqAfkkDc7D9P51sK1izNM3vBEYtbuPlY6wp3xnJr
0hbPjGNU7vMC/4SIwSgEdXyAvlpOr1gm9n+w1GcxpkjLa8l4P+2wt9OX0BSzRUMu
X7LQxqg2zmQibFy4b1MSmDGsO2dxB2eqvVUT/Ir4b56uqkdtValYRWY75APJIZ5l
6w0Ja3HVvgOX3pVwSmadCpfMEonN4JE+mfHaP8RajAlTGQcUPq8If9w4BtEoWQRl
QC7kUCCTmp+isnzH7u4EqEQC6XUqEqeuQH+Bli1pNYTvipHY+9EO1dSxHZoCelgk
M9PpQz8x+N6ZMcMNtJQVifkfN6tAp/acdWBTZtlpqB8nZR4v5bBndS/5TNBMu3/v
JfhMEsIUz5o2mWz9qUneyK80RoTRkL5SfdgOclx2Yd7K1fKuJsUjmdXaT08BzotS
5cxTFZu7EcFJyg4eHemjdyRYr3cUkS19P2uIJle9nj3DAMpCvIyL1c1vn5eB/7MN
3JeRME6AOQcD7sFgVNuYhGVdIBuwHU6kj4mf1WN27YKGdbsaMZsFz7/HH3SsBLOF
E0p7Q25HcHSrAimcmLTifN+gol8Y5m6dT0Pjuf9y9QbWvHwh7oRq7DVQ3L8hJkGz
j64b0jlv43P0QorWvsX6/VJBevWedhe1hZtrBhPFSE0CtJ3Ml4Q=
=wkSk
-----END PGP SIGNATURE-----
Merge tag 'for-4.17-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"There are a several user visible changes, the rest is mostly invisible
and continues to clean up the whole code base.
User visible changes:
- new mount option nossd_spread (pair for ssd_spread)
- mount option subvolid will detect junk after the number and fail
the mount
- add message after cancelled device replace
- direct module dependency on libcrc32, removed own crc wrappers
- removed user space transaction ioctls
- use lighter locking when reading /proc/self/mounts, RCU instead of
mutex to avoid unnecessary contention
Enhancements:
- skip writeback of last page when truncating file to same size
- send: do not issue unnecessary truncate operations
- mount option token specifiers: use %u for unsigned values, more
validation
- selftests: more tree block validations
qgroups:
- preparatory work for splitting reservation types for data and
metadata, this should allow for more accurate tracking and fix some
issues with underflows or do further enhancements
- split metadata reservations for started and joined transaction so
they do not get mixed up and are accounted correctly at commit time
- with the above, it's possible to revert patch that potentially
deadlocks when trying to make more space by explicitly committing
when the quota limit is hit
- fix root item corruption when multiple same source snapshots are
created with quota enabled
RAID56:
- make sure target is identical to source when raid56 rebuild fails
after dev-replace
- faster rebuild during scrub, batch by stripes and not
block-by-block
- make more use of cached data when rebuilding from a missing device
Fixes:
- null pointer deref when device replace target is missing
- fix fsync after hole punching when using no-holes feature
- fix lockdep splat when allocating percpu data with wrong GFP flags
Cleanups, refactoring, core changes:
- drop redunant parameters from various functions
- kill and opencode trivial helpers
- __cold/__exit function annotations
- dead code removal
- continued audit and documentation of memory barriers
- error handling: handle removal from uuid tree
- error handling: remove handling of impossible condtitons
- more debugging or error messages
- updated tracepoints
- one VLA use removal (and one still left)"
* tag 'for-4.17-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (164 commits)
btrfs: lift errors from add_extent_changeset to the callers
Btrfs: print error messages when failing to read trees
btrfs: user proper type for btrfs_mask_flags flags
btrfs: split dev-replace locking helpers for read and write
btrfs: remove stale comments about fs_mutex
btrfs: use RCU in btrfs_show_devname for device list traversal
btrfs: update barrier in should_cow_block
btrfs: use lockdep_assert_held for mutexes
btrfs: use lockdep_assert_held for spinlocks
btrfs: Validate child tree block's level and first key
btrfs: tests/qgroup: Fix wrong tree backref level
Btrfs: fix copy_items() return value when logging an inode
Btrfs: fix fsync after hole punching when using no-holes feature
btrfs: use helper to set ulist aux from a qgroup
Revert "btrfs: qgroups: Retry after commit on getting EDQUOT"
btrfs: qgroup: Update trace events for metadata reservation
btrfs: qgroup: Use root::qgroup_meta_rsv_* to record qgroup meta reserved space
btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and item
btrfs: qgroup: Use separate meta reservation type for delalloc
btrfs: qgroup: Introduce function to convert META_PREALLOC into META_PERTRANS
...
Pull wait_var_event updates from Ingo Molnar:
"This introduces the new wait_var_event() API, which is a more flexible
waiting primitive than wait_on_atomic_t().
All wait_on_atomic_t() users are migrated over to the new API and
wait_on_atomic_t() is removed. The migration fixes one bug and should
result in no functional changes for the other usecases"
* 'sched-wait-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/wait: Improve __var_waitqueue() code generation
sched/wait: Remove the wait_on_atomic_t() API
sched/wait, arch/mips: Fix and convert wait_on_atomic_t() usage to the new wait_var_event() API
sched/wait, fs/ocfs2: Convert wait_on_atomic_t() usage to the new wait_var_event() API
sched/wait, fs/nfs: Convert wait_on_atomic_t() usage to the new wait_var_event() API
sched/wait, fs/fscache: Convert wait_on_atomic_t() usage to the new wait_var_event() API
sched/wait, fs/btrfs: Convert wait_on_atomic_t() usage to the new wait_var_event() API
sched/wait, fs/afs: Convert wait_on_atomic_t() usage to the new wait_var_event() API
sched/wait, drivers/media: Convert wait_on_atomic_t() usage to the new wait_var_event() API
sched/wait, drivers/drm: Convert wait_on_atomic_t() usage to the new wait_var_event() API
sched/wait: Introduce wait_var_event()
The missing error handling in add_extent_changeset was hidden, so make
it at least visible in the callers.
Signed-off-by: David Sterba <dsterba@suse.com>
When mount fails to read trees like fs tree, checksum tree, extent
tree, etc, there is not enough information about where went wrong.
With this, messages like
"BTRFS warning (device sdf): failed to read root (objectid=7): -5"
would help us a bit.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All users pass a local unsigned int and not the __uXX types that are
supposed to be used for userspace interfaces.
Signed-off-by: David Sterba <dsterba@suse.com>
The current calls are unclear in what way btrfs_dev_replace_lock takes
the locks, so drop the argument, split the helpers and use similar
naming as for read and write locks.
Signed-off-by: David Sterba <dsterba@suse.com>
The fs_mutex has been killed in 2008, a213501153 ("Btrfs: Replace
the big fs_mutex with a collection of other locks"), still remembered in
some comments.
We don't have any extra needs for locking in the ACL handlers.
Signed-off-by: David Sterba <dsterba@suse.com>
The show_devname callback is used to print device name in
/proc/self/mounts, we need to traverse the device list consistently and
read the name that's copied to a seq buffer so we don't need further
locking.
If the first device is being deleted at the same time, the RCU will
allow us to read the device name, though it will become stale right
after the RCU protection ends. This is unavoidable and the user can
expect that the device will disappear from the filesystem's list at some
point.
The device_list_mutex was pretty heavy as it is used eg. for writing
superblock and a few other IO related contexts. This can stall any
application that reads the proc file for no reason.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Once there was a simple int force_cow that was used with the plain
barriers, and then converted to a bit, so we should use the appropriate
barrier helper.
Other variables in the complex if condition do not depend on a barrier,
so we should be fine in case the atomic barrier becomes a no-op.
Signed-off-by: David Sterba <dsterba@suse.com>
We have several reports about node pointer points to incorrect child
tree blocks, which could have even wrong owner and level but still with
valid generation and checksum.
Although btrfs check could handle it and print error message like:
leaf parent key incorrect 60670574592
Kernel doesn't have enough check on this type of corruption correctly.
At least add such check to read_tree_block() and btrfs_read_buffer(),
where we need two new parameters @level and @first_key to verify the
child tree block.
The new @level check is mandatory and all call sites are already
modified to extract expected level from its call chain.
While @first_key is optional, the following call sites are skipping such
check:
1) Root node/leaf
As ROOT_ITEM doesn't contain the first key, skip @first_key check.
2) Direct backref
Only parent bytenr and level is known and we need to resolve the key
all by ourselves, skip @first_key check.
Another note of this verification is, it needs extra info from nodeptr
or ROOT_ITEM, so it can't fit into current tree-checker framework, which
is limited to node/leaf boundary.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The extent tree of the test fs is like the following:
BTRFS info (device (null)): leaf 16327509003777336587 total ptrs 1 free space 3919
item 0 key (4096 168 4096) itemoff 3944 itemsize 51
extent refs 1 gen 1 flags 2
tree block key (68719476736 0 0) level 1
^^^^^^^
ref#0: tree block backref root 5
And it's using an empty tree for fs tree, so there is no way that its
level can be 1.
For REAL (created by mkfs) fs tree backref with no skinny metadata, the
result should look like:
item 3 key (30408704 EXTENT_ITEM 4096) itemoff 3845 itemsize 51
refs 1 gen 4 flags TREE_BLOCK
tree block key (256 INODE_ITEM 0) level 0
^^^^^^^
tree block backref root 5
Fix the level to 0, so it won't break later tree level checker.
Fixes: faa2dbf004 ("Btrfs: add sanity tests for new qgroup accounting code")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging an inode, at tree-log.c:copy_items(), if we call
btrfs_next_leaf() at the loop which checks for the need to log holes, we
need to make sure copy_items() returns the value 1 to its caller and
not 0 (on success). This is because the path the caller passed was
released and is now different from what is was before, and the caller
expects a return value of 0 to mean both success and that the path
has not changed, while a return value of 1 means both success and
signals the caller that it can not reuse the path, it has to perform
another tree search.
Even though this is a case that should not be triggered on normal
circumstances or very rare at least, its consequences can be very
unpredictable (especially when replaying a log tree).
Fixes: 16e7549f04 ("Btrfs: incompatible format change to remove hole extents")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we have the no-holes mode enabled and fsync a file after punching a
hole in it, we can end up not logging the whole hole range in the log tree.
This happens if the file has extent items that span more than one leaf and
we punch a hole that covers a range that starts in a leaf but does not go
beyond the offset of the first extent in the next leaf.
Example:
$ mkfs.btrfs -f -O no-holes -n 65536 /dev/sdb
$ mount /dev/sdb /mnt
$ for ((i = 0; i <= 831; i++)); do
offset=$((i * 2 * 256 * 1024))
xfs_io -f -c "pwrite -S 0xab -b 256K $offset 256K" \
/mnt/foobar >/dev/null
done
$ sync
# We now have 2 leafs in our filesystem fs tree, the first leaf has an
# item corresponding the extent at file offset 216530944 and the second
# leaf has a first item corresponding to the extent at offset 217055232.
# Now we punch a hole that partially covers the range of the extent at
# offset 216530944 but does go beyond the offset 217055232.
$ xfs_io -c "fpunch $((216530944 + 128 * 1024 - 4000)) 256K" /mnt/foobar
$ xfs_io -c "fsync" /mnt/foobar
<power fail>
# mount to replay the log
$ mount /dev/sdb /mnt
# Before this patch, only the subrange [216658016, 216662016[ (length of
# 4000 bytes) was logged, leaving an incorrect file layout after log
# replay.
Fix this by checking if there is a hole between the last extent item that
we processed and the first extent item in the next leaf, and if there is
one, log an explicit hole extent item.
Fixes: 16e7549f04 ("Btrfs: incompatible format change to remove hole extents")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a nice helper to do proper casting of a qgroup to a ulist aux
value. And several places that could make use of it.
Signed-off-by: David Sterba <dsterba@suse.com>
This reverts commit 48a89bc4f2.
The idea to commit transaction and free some space after hitting qgroup
limit is good, although the problem is it can easily cause deadlocks.
One deadlock example is caused by trying to flush data while still
holding it:
Call Trace:
__schedule+0x49d/0x10f0
schedule+0xc6/0x290
schedule_timeout+0x187/0x1c0
wait_for_completion+0x204/0x3a0
btrfs_wait_ordered_extents+0xa40/0xaf0 [btrfs]
qgroup_reserve+0x913/0xa10 [btrfs]
btrfs_qgroup_reserve_data+0x3ef/0x580 [btrfs]
btrfs_check_data_free_space+0x96/0xd0 [btrfs]
__btrfs_buffered_write+0x3ac/0xd40 [btrfs]
btrfs_file_write_iter+0x62a/0xba0 [btrfs]
__vfs_write+0x320/0x430
vfs_write+0x107/0x270
SyS_write+0xbf/0x150
do_syscall_64+0x1b0/0x3d0
entry_SYSCALL64_slow_path+0x25/0x25
Another can be caused by trying to commit one transaction while nesting
with trans handle held by ourselves:
btrfs_start_transaction()
|- btrfs_qgroup_reserve_meta_pertrans()
|- qgroup_reserve()
|- btrfs_join_transaction()
|- btrfs_commit_transaction()
The retry is causing more problems than exppected when limit is enabled.
At least a graceful EDQUOT is way better than deadlock.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now trace_qgroup_meta_reserve() will have extra type parameter.
And introduce two new trace events:
1) trace_qgroup_meta_free_all_pertrans()
For btrfs_qgroup_free_meta_all_pertrans()
2) trace_qgroup_meta_convert()
For btrfs_qgroup_convert_reserved_meta()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For quota disabled->enable case, it's possible that at reservation time
quota was not enabled so no bytes were really reserved, while at release
time, quota was enabled so we will try to release some bytes we didn't
really own.
Such situation can cause metadata reserveation underflow, for both types,
also less possible for per-trans type since quota enable will commit
transaction.
To address this, record qgroup meta reserved bytes into
root::qgroup_meta_rsv_pertrans and ::prealloc.
So at releasing time we won't free any bytes we didn't reserve.
For DATA, it's already handled by io_tree, so nothing needs to be done
there.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Quite similar for delalloc, some modification to delayed-inode and
delayed-item reservation. Also needs extra parameter for release case
to distinguish normal release and error release.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Before this patch, btrfs qgroup is mixing per-transcation meta rsv with
preallocated meta rsv, making it quite easy to underflow qgroup meta
reservation.
Since we have the new qgroup meta rsv types, apply it to delalloc
reservation.
Now for delalloc, most of its reserved space will use META_PREALLOC qgroup
rsv type.
And for callers reducing outstanding extent like btrfs_finish_ordered_io(),
they will convert corresponding META_PREALLOC reservation to
META_PERTRANS.
This is mainly due to the fact that current qgroup numbers will only be
updated in btrfs_commit_transaction(), that's to say if we don't keep
such placeholder reservation, we can exceed qgroup limitation.
And for callers freeing outstanding extent in error handler, we will
just free META_PREALLOC bytes.
This behavior makes callers of btrfs_qgroup_release_meta() or
btrfs_qgroup_convert_meta() to be aware of which type they are.
So in this patch, btrfs_delalloc_release_metadata() and its callers get
an extra parameter to info qgroup to do correct meta convert/release.
The good news is, even we use the wrong type (convert or free), it won't
cause obvious bug, as prealloc type is always in good shape, and the
type only affects how per-trans meta is increased or not.
So the worst case will be at most metadata limitation can be sometimes
exceeded (no convert at all) or metadata limitation is reached too soon
(no free at all).
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For meta_prealloc reservation users, after btrfs_join_transaction()
caller will modify tree so part (or even all) meta_prealloc reservation
should be converted to meta_pertrans until transaction commit time.
This patch introduces a new function,
btrfs_qgroup_convert_reserved_meta() to do this for META_PREALLOC
reservation user.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since qgroup has seperate metadata reservation types now, we can
completely get rid of the old root->qgroup_meta_rsv, which mostly acts
as current META_PERTRANS reservation type.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs uses 2 different methods to reseve metadata qgroup space.
1) Reserve at btrfs_start_transaction() time
This is quite straightforward, caller will use the trans handler
allocated to modify b-trees.
In this case, reserved metadata should be kept until qgroup numbers
are updated.
2) Reserve by using block_rsv first, and later btrfs_join_transaction()
This is more complicated, caller will reserve space using block_rsv
first, and then later call btrfs_join_transaction() to get a trans
handle.
In this case, before we modify trees, the reserved space can be
modified on demand, and after btrfs_join_transaction(), such reserved
space should also be kept until qgroup numbers are updated.
Since these two types behave differently, split the original "META"
reservation type into 2 sub-types:
META_PERTRANS:
For above case 1)
META_PREALLOC:
For reservations that happened before btrfs_join_transaction() of
case 2)
NOTE: This patch will only convert existing qgroup meta reservation
callers according to its situation, not ensuring all callers are at
correct timing.
Such fix will be added in later patches.
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ update comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
When modifying qgroup relationship, for qgroup which only owns exclusive
extents, we will go through quick update path.
In this path, we will add/subtract exclusive and reference number for
parent qgroup, since the source (child) qgroup only has exclusive
extents, destination (parent) qgroup will also own or lose those extents
exclusively.
The same should be the same for reservation, since later reservation
adding/releasing will also affect parent qgroup, without the reservation
carried from child, parent will underflow reservation or have dead
reservation which will never be freed.
However original code doesn't do the same thing for reservation.
It handles qgroup reservation quite differently:
It removes qgroup reservation, as it's allocating space from the
reserved qgroup for relationship adding.
But does nothing for qgroup reservation if we're removing a qgroup
relationship.
According to the original code, it looks just like because we're adding
qgroup->rfer, the code assumes we're writing new data, so it's follows
the normal write routine, by reducing qgroup->reserved and adding
qgroup->rfer/excl.
This old behavior is wrong, and should be fixed to follow the same
excl/rfer behavior.
Just fix it by using the correct behavior described above.
Fixes: 31193213f1 ("Btrfs: qgroup: Introduce a may_use to account space_info->bytes_may_use.")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since most callers of qgroup_reserve() are already defined by type,
converting qgroup_reserve() is quite an easy work.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce helpers to:
1) Get total reserved space
For limit calculation
2) Add/release reserved space for given type
With underflow detection and warning
3) Add/release reserved space according to child qgroup
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of single qgroup->reserved, use a new structure btrfs_qgroup_rsv
to store different types of reservation.
This patch only updates the header needed to compile.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_orphan_add() has had this case commented out since it was first
introduced in commit d68fc57b7e ("Btrfs: Metadata reservation for
orphan inodes"). Most of the orphan cleanup code has been rewritten
since then, so it's safe to say that this code isn't needed.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ switch to bool ]
Signed-off-by: David Sterba <dsterba@suse.com>
Any time the first block group of a new type is created, we add a new
kobject to sysfs to hold the attributes for that type. Kobject-internal
allocations always use GFP_KERNEL, making them prone to fs-reclaim races.
While it appears as if this can occur any time a block group is created,
the only times the first block group of a new type can be created in
memory is at mount and when we create the first new block group during
raid conversion.
This patch adds a new list to track pending kobject additions and then
handles them after we do chunk relocation. Between relocating the
target chunk (or forcing allocation of a new chunk in the case of data)
and removing the old chunk, we're in a safe place for fs-reclaim to
occur. We're holding the volume mutex, which is already held across
page faults, and the delete_unused_bgs_mutex, which will only stall
the cleaner thread.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit 2be12ef79 (btrfs: Separate space_info create/update), we've
separated out the creation and updating of the space info structures.
That commit was a straightforward refactoring of the two parts of
update_space_info, but we can go a step further. Since commits
c59021f84 (Btrfs: fix OOPS of empty filesystem after balance) and
b742bb82f (Btrfs: Link block groups of different raid types), we know
that the space_info structures will be created at mount and there will
only ever be, at most, three of them.
This patch cleans out the create_space_info calls after __find_space_info
returns NULL since __find_space_info *can't* return NULL.
The initial cause for reviewing this was the kobject_add calls from
create_space_info occuring in sites where fs-reclaim wasn't allowed. Now
we are certain they occur only early in the mount process and are safe.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Rebuild on missing device is as same as recover, after it's done, rbio
has data which is consistent with on-disk data, so it can be cached to
avoid further reads.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Drop optimal argument from the function find_live_mirror() as we can
deduce it in the function itself. Also rename optimal to
preferred_mirror.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Obtain the stripes info from the map directly and so no need
to pass it as an argument.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Added by 08e007d2e5 ("Btrfs: improve the noflush reservation") and
made redundant by 17024ad0a0 ("Btrfs: fix early ENOSPC due to
delalloc").
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Added by b4570aa994 ("btrfs: fix compiling with CONFIG_BTRFS_DEBUG
enabled.") and obsoleted by 2ff7e61e0d ("btrfs: take an fs_info
directly when the root is not used otherwise").
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduced by 5cdc7ad337 ("btrfs: Replace fs_info->workers with
btrfs_workqueue.") but obsoleted by 2a4581983f ("btrfs: factor
btrfs_init_workqueues() out of open_ctree()").
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Added in 38c227d87c ("Btrfs: snapshot-aware defrag") but subsequently
made redundant by 0b246afa62 ("btrfs: root->fs_info cleanup, add
fs_info convenience variables").
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Added in b5d67f64f9 ("Btrfs: change scrub to support big blocks") but
rendered redundant by be50a8ddaa ("Btrfs: Simplify
scrub_setup_recheck_block()'s argument").
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Added as part of 86d5f99442 ("btrfs: convert prelimary reference
tracking to use rbtrees") but never used. tmp_op_key essentially
subsumed that variable.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When checking the minimal nr_devs, there is one dead and meaningless
condition:
if (ndevs < devs_increment * sub_stripes || ndevs < devs_min) {
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This condition is meaningless, @devs_increment has nothing to do with
@sub_stripes.
In fact, in btrfs_raid_array[], profile with sub_stripes larger than 1
(RAID10) already has the @devs_increment set to 2.
So no need to multiple it by @sub_stripes.
And above condition is also dead.
For RAID10, @devs_increment * @sub_stripes equals 4, which is also the
@devs_min of RAID10.
For other profiles, @sub_stripes is always 1, and since @ndevs is
rounded down to @devs_increment, the condition will always be true.
Remove the meaningless condition to make later reader wander less.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is the entry to the extent allocator and as such has
quite a number of parameters. Some of those have subtle effects on the
allocation algorithm. Document the parameters.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As with every function which deals with modifying the btree
btrfs_uuid_tree_rem can fail for any number of reasons (ie. EIO/ENOMEM).
Handle return error value from this function gracefully by aborting the
transaction.
Fixes: dd5f9615fc ("Btrfs: maintain subvolume items in the UUID tree")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The kernel would like to have all stack VLA usage removed[1].
Unfortunately using an integer constant variable as the size of an
array is still considered a VLA. Instead let's use directly sizeof(var)
which removes the VLA usage. Use the occasion to remove csum_size
altogether and use sizeof() also for the size passed to memcmp
[1]: https://lkml.org/lkml/2018/3/7/621
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The callbacks make use of different parameters that are passed to the
other type unnecessarily. This patch adds separate types for each and
the unused parameters will be removed.
The type extent_submit_bio_hook_t keeps all parameters and can be used
where the start/done types are not appropriate.
Signed-off-by: David Sterba <dsterba@suse.com>
A useless wrapper around tree_mod_log_insert_root that hides missing
error handling. Move it to the callers.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A trivial wrapper that can be simply opencoded and makes the GFP
allocation request more visible. The error handling is now moved to the
callers.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The wrapper is effectively an alias for tree_mod_log_insert_move but
also hides the missing error handling. To make that more visible, lift
the BUG_ON to the callers.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The wrappers are trivial and do not bring any extra value on top of the
plain locking primitives.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The merge call was factored out to a separate helper but it's a trivial
one and arguably we can opencode it and cache the value.
Signed-off-by: David Sterba <dsterba@suse.com>
All callers pass a valid pointer so we can drop the redundant checks.
The call to submit_one_bio never happend and can be removed.
Signed-off-by: David Sterba <dsterba@suse.com>
In case of raid56, writes and rebuilds always take BTRFS_STRIPE_LEN(64K)
as unit, however, scrub_extent() sets blocksize as unit, so rebuild
process may be triggered on every block on a same stripe.
A typical example would be that when we're replacing a disappeared disk,
all reads on the disks get -EIO, every block (size is 4K if blocksize is
4K) would go thru these,
scrub_handle_errored_block
scrub_recheck_block # re-read pages one by one
scrub_recheck_block # rebuild by calling raid56_parity_recover()
page by page
Although with raid56 stripe cache most of reads during rebuild can be
avoided, the parity recover calculation(xor or raid6 algorithms) needs to
be done $(BTRFS_STRIPE_LEN / blocksize) times.
This makes it smarter by doing raid56 scrub/replace on stripe length.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sort mount options by the primary name, followed by the 'no-'
counterpart if it exists. Group the deprecated and debugging options.
Enum and token defintions are synced.
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs has two mount options for SSD optimizations: ssd and ssd_spread.
Presently there is an option to disable all SSD optimizations, but there
isn't an option to disable just ssd_spread.
This patch adds a mount option nossd_spread that disables ssd_spread
only.
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since userspace transaction have been removed we no longer have use
for this field so delete it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the userspace transaction ioctls have been removed,
TRANS_USERSPACE is no longer used hence we can remove it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the userspace transaction IOCTL have been removed, this member
is no longer used so just remove it
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 3558d4f88e ("btrfs: Deprecate userspace transaction ioctls")
marked the beginning of the end of userspace transaction. This commit
finishes the job! There are no known users and ceph does not use the
ioctl anymore.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Acked-by: Sage Weil <sage@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When multiple pending snapshots referring to the same source subvolume
are executed, enabled quota will cause root item corruption, where root
items are using old bytenr (no backref in extent tree).
This can be triggered by fstests btrfs/152.
The cause is when source subvolume is still dirty, extra commit
(simplied transaction commit) of qgroup_account_snapshot() can skip
dirty roots not recorded in current transaction, making root item of
source subvolume not updated.
Fix it by forcing recording source subvolume in current transaction
before qgroup sub-transaction commit.
Reported-by: Justin Maggard <jmaggard@netgear.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When performing an unlock on an extent buffer we'd like to order the
decrement of extent_buffer::blocking_writers with waking up any
waiters. In such situations it's sufficient to use smp_mb__after_atomic
rather than the heavy smp_mb. On architectures where atomic operations
are fully ordered (such as x86 or s390) unconditionally executing
a heavyweight smp_mb instruction causes a severe hit to performance
while bringin no improvements in terms of correctness.
The better thing is to use the appropriate smp_mb__after_atomic routine
which will do the correct thing (invoke a full smp_mb or in the case
of ordered atomics insert a compiler barrier). Put another way,
an RMW atomic op + smp_load__after_atomic equals, in terms of
semantics, to a full smp_mb. This ensures that none of the problems
described in the accompanying comment of waitqueue_active occur.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Some functions can filter metadata by the generation. Add a define that
will annotate such arguments.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
The current implementation of btrfs_page_exists_in_range() gives the
wrong answer if the workingset code has stored a shadow entry in the
page cache. The filemap_range_has_page() function does not have this
problem, and it's shared code, so use it instead.
eigned-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the last step of scrub_handle_error_block, we try to combine good
copies on all possible mirrors, this works fine for raid1 and raid10,
but not for raid56 as it's doing parity rebuild.
If parity rebuild doesn't get back with correct data which matches its
checksum, in case of replace we'd rather write what is stored in the
source device than the data calculuated from parity.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
async_missing_raid56() is identical to async_read_rebuild().
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previously, btrfs_inode_by_name() returned 0 which left caller to check
objectid of location even location if the type was invalid.
Let btrfs_inode_by_name() return -EUCLEAN if a corrupted location of a
dir entry is found. Removal of label out_err also simplifies the
function.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ drop unlikely ]
Signed-off-by: David Sterba <dsterba@suse.com>
This function btrfs_close_extra_devices() is about freeing
extra devids which once it may have belonged to this filesystem.
So rename it and add the comment. The _devid suffix is
appropriate as this function won't handle devices which are
outside of the filesytem being mounted.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This argument is always set to the root of the inode, which is also
passed. So let's get a reference inside the function and simplify
the arg list.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
According to tlv_put()'s prototype, data and attrlen needs to be
exchanged in the macro, but seems all callers are already aware of
this misorder and are therefore not affected.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that nothing uses the root arg of btrfs_log_dentry_safe it can be
safely removed. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_log_inode_parent is called from 2 places (btrfs_log_dentry_safe
and btrfs_log_new_name) both of which pass inode->root as the root
argument and the inode itself. Remove the redundant root argument and
get a reference to the root directly from the inode, also remove
redundant root != inode->root check from the same function. No
functional change.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function always sets keep_locks to 1 and saves the old value of
keep_locks which is restored at the end. So there is no way it can be
called without keep_locks being set. Remove comment imposing redundant
requirement on callers.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The xattr_handler::get prototype returns int, use it. The only ssize_t
exception is the per-inode listxattr handler.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Extern for functions does not make any difference, there are only a few
so let's remove them before it's too late.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When send finishes processing an inode representing a regular file, it
always issues a truncate operation for that file, even if its size did
not change or the last write sets the file size correctly. In the most
common cases, the issued write operations set the file to correct size
(either full or incremental sends) or the file size did not change (for
incremental sends), so the only case where a truncate operation is needed
is when a file size becomes smaller in the send snapshot when compared
to the parent snapshot.
By not issuing unnecessary truncate operations we reduce the stream size
and save time in the receiver. Currently truncating a file to the same
size triggers writeback of its last page (if it's dirty) and waits for it
to complete (only if the file size is not aligned with the filesystem's
sector size). This is being fixed by another patch and is independent of
this change (that patch's title is "Btrfs: skip writeback of last page
when truncating file to same size").
The following script was used to measure time spent by a receiver without
this change applied, with this change applied, and without this change and
with the truncate fix applied (the fix to not make it start and wait for
writeback to complete).
$ cat test_send.sh
#!/bin/bash
SRC_DEV=/dev/sdc
DST_DEV=/dev/sdd
SRC_MNT=/mnt/sdc
DST_MNT=/mnt/sdd
mkfs.btrfs -f $SRC_DEV >/dev/null
mkfs.btrfs -f $DST_DEV >/dev/null
mount $SRC_DEV $SRC_MNT
mount $DST_DEV $DST_MNT
echo "Creating source filesystem"
for ((t = 0; t < 10; t++)); do
(
for ((i = 1; i <= 20000; i++)); do
xfs_io -f -c "pwrite -S 0xab 0 5000" \
$SRC_MNT/file_$i > /dev/null
done
) &
worker_pids[$t]=$!
done
wait ${worker_pids[@]}
echo "Creating and sending snapshot"
btrfs subvolume snapshot -r $SRC_MNT $SRC_MNT/snap1 >/dev/null
/usr/bin/time -f "send took %e seconds" \
btrfs send -f $SRC_MNT/send_file $SRC_MNT/snap1
/usr/bin/time -f "receive took %e seconds" \
btrfs receive -f $SRC_MNT/send_file $DST_MNT
umount $SRC_MNT
umount $DST_MNT
The results, which are averages for 5 runs for each case, were the
following:
* Without this change
average receive time was 26.49 seconds
standard deviation of 2.53 seconds
* Without this change and with the truncate fix
average receive time was 12.51 seconds
standard deviation of 0.32 seconds
* With this change and without the truncate fix
average receive time was 10.02 seconds
standard deviation of 1.11 seconds
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we truncate a file to the same size and that size is not aligned
with the sector size, we end up triggering writeback (and wait for it to
complete) of the last page. This is unncessary as we can not have delayed
allocation beyond the inode's i_size and the goal of truncating a file
to its own size is to discard prealloc extents (allocated via the
fallocate(2) system call). Besides the unnecessary IO start and wait, it
also breaks the oppurtunity for larger contiguous extents on disk, as
before the last dirty page there might be other dirty pages.
This scenario is probably not very common in general, however it is
common for btrfs receive implementations because currently the send
stream always issues a truncate operation for each processed inode as
the last operation for that inode (this truncate operation is not
always needed and the send implementation will be addressed to avoid
them).
So improve this by not starting and waiting for writeback of the inode's
last page when we are truncating to exactly the same size.
The following script was used to quickly measure the time a receive
operation takes:
$ cat test_send.sh
#!/bin/bash
SRC_DEV=/dev/sdc
DST_DEV=/dev/sdd
SRC_MNT=/mnt/sdc
DST_MNT=/mnt/sdd
mkfs.btrfs -f $SRC_DEV >/dev/null
mkfs.btrfs -f $DST_DEV >/dev/null
mount $SRC_DEV $SRC_MNT
mount $DST_DEV $DST_MNT
echo "Creating source filesystem"
for ((t = 0; t < 10; t++)); do
(
for ((i = 1; i <= 20000; i++)); do
xfs_io -f -c "pwrite -S 0xab 0 5000" \
$SRC_MNT/file_$i > /dev/null
done
) &
worker_pids[$t]=$!
done
wait ${worker_pids[@]}
echo "Creating and sending snapshot"
btrfs subvolume snapshot -r $SRC_MNT $SRC_MNT/snap1 >/dev/null
/usr/bin/time -f "send took %e seconds" \
btrfs send -f $SRC_MNT/send_file $SRC_MNT/snap1
/usr/bin/time -f "receive took %e seconds" \
btrfs receive -f $SRC_MNT/send_file $DST_MNT
umount $SRC_MNT
umount $DST_MNT
The results for 5 runs were the following:
* Without this change
average receive time was 26.49 seconds
standard deviation of 2.53 seconds
* With this change
average receive time was 12.51 seconds
standard deviation of 0.32 seconds
Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It doens't make sense to process prealloc extents as pages will be
filled with zero when reading prealloc extents.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have btrfs_fs_info::data_chunk_allocations and
btrfs_fs_info::metadata_ratio declared as unsigned which would be
unsinged int and kernel style prefers unsigned int over bare unsigned.
So this patch changes them to u32.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Using any kind of memory barriers around atomic operations which have
a return value is redundant, since those operations themselves are
fully ordered. atomic_t.txt states:
- RMW operations that have a return value are fully ordered;
Fully ordered primitives are ordered against everything prior and
everything subsequent. Therefore a fully ordered primitive is like
having an smp_mb() before and an smp_mb() after the primitive.
Given this let's replace the extra memory barriers with comments.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the same function we just ran btrfs_alloc_device() which means the
btrfs_device::resized_list is sure to be empty and we are protected
with the btrfs_fs_info::volume_mutex.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The __cold functions are placed to a special section, as they're
expected to be called rarely. This could help i-cache prefetches or help
compiler to decide which branches are more/less likely to be taken
without any other annotations needed.
Though we can't add more __exit annotations, it's still possible to add
__cold (that's also added with __exit). That way the following function
categories are tagged:
- printf wrappers, error messages
- exit helpers
Signed-off-by: David Sterba <dsterba@suse.com>
Recently, the __init annotations have been added. There's unfortunatelly
only one case where we can add __exit, because most of the cleanup
helpers are also called from the __init phase.
As the __exit annotated functions get discarded completely for a
built-in code, we'd miss them from the init phase.
Signed-off-by: David Sterba <dsterba@suse.com>
We aren't verifying the parameter passed to the subvolid mount option,
so we won't report and fail the mount if a junk value is specified for
example, -o subvolid=abc.
This patch verifies the subvolid option with match_u64.
Up to now the memparse function accepts the K/M/G/ suffixes, that are
usually meant for size values and do not make sense for a subvolume it.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Fstests generic/475 provides a way to fail metadata reads while
checking if checksum exists for the inode inside run_delalloc_nocow(),
and csum_exist_in_range() interprets error (-EIO) as inode having
checksum and makes its caller enter the cow path.
In case of free space inode, this ends up with a warning in
cow_file_range().
The same problem applies to btrfs_cross_ref_exist() since it may also
read metadata in between.
With this, run_delalloc_nocow() bails out when errors occur at the two
places.
cc: <stable@vger.kernel.org> v2.6.28+
Fixes: 17d217fe97 ("Btrfs: fix nodatasum handling in balancing code")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The custom crc32 init code was introduced in
14a958e678 ("Btrfs: fix btrfs boot when compiled as built-in") to
enable using btrfs as a built-in. However, later as pointed out by
60efa5eb2e ("Btrfs: use late_initcall instead of module_init") this
wasn't enough and finally btrfs was switched to late_initcall which
comes after the generic crc32c implementation is initiliased. The
latter commit superseeded the former. Now that we don't have to
maintain our own code let's just remove it and switch to using the
generic implementation.
Despite touching a lot of files the patch is really simple. Here is the gist of
the changes:
1. Select LIBCRC32C rather than the low-level modules.
2. s/btrfs_crc32c/crc32c/g
3. replace hash.h with linux/crc32c.h
4. Move the btrfs namehash funcs to ctree.h and change the tree accordingly.
I've tested this with btrfs being both a module and a built-in and xfstest
doesn't complain.
Does seem to fix the longstanding problem of not automatically selectiong
the crc32c module when btrfs is used. Possibly there is a workaround in
dracut.
The modinfo confirms that now all the module dependencies are there:
before:
depends: zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate
after:
depends: libcrc32c,zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add more info to changelog from mails ]
Signed-off-by: David Sterba <dsterba@suse.com>
Function __get_raid_index() is used to convert block group flags into
raid index, which can be used to get various info directly from
btrfs_raid_array[].
Refactor this function a little:
1) Rename to btrfs_bg_flags_to_raid_index()
Double underline prefix is normally for internal functions, while the
function is used by both extent-tree and volumes.
Although the name is a little longer, but it should explain its usage
quite well.
2) Move it to volumes.h and make it static inline
Just several if-else branches, really no need to define it as a normal
function.
This also makes later code re-use between kernel and btrfs-progs
easier.
3) Remove function get_block_group_index()
Really no need to do such a simple thing as an exported function.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When inspecting the error message with real corruption, the "root=%llu"
always shows "1" (root tree), instead of the correct owner.
The problem is that we are getting @root from page->mapping->host, which
points the same btree inode, so we will always get the same root.
This makes the root owner output meaningless, and harder to port
tree-checker to btrfs-progs.
So get rid of the false and meaningless @root parameter and replace it
with @fs_info.
To get the owner, we can only rely on btrfs_header_owner() now.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is adding a tracepoint 'btrfs_handle_em_exist' to help debug the
subtle bugs around merge_extent_mapping.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_run_qgroups is doing a bit too much. Not only is it
responsible for synchronizing in-memory state of qgroups to disk but
it also contains code to trigger the initial qgroup rescan when
quota is enabled initially. This condition is detected by checking that
BTRFS_FS_QUOTA_ENABLED is not set and BTRFS_FS_QUOTA_ENABLING is set.
Nothing really requires from the code to be structured (and scattered)
the way it is so let's streamline things. First move the quota rescan
code into btrfs_quota_enable, where its invocation is closer to the
use. This also makes the FS_QUOTA_ENABLING flag redundant so let's
remove it as well.
This has been tested with a full xfstest run with qgroups enabled on
the scratch device of every xfstest and no regressions were observed.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
load_free_space_tree calls either function load_free_space_bitmaps or
load_free_space_extents. And either of those two will lead to call
btrfs_next_item. So in function load_free_space_tree, use READA_FORWARD
to read forward ahead.
This also changes the value from READA_BACK to READA_FORWARD, since
according to the logic, it should reada_for_search forward, not
backward.
Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
populate_free_space_tree calls function btrfs_search_slot_for_read with
parameter int find_higher = 1, it means that, if no exact match is
found, then use the next higher item. So in function
populate_free_space_tree, use READA_FORWARD to read forward ahead.
This also changes the value from READA_BACK to READA_FORWARD, since
according to the logic, it should reada_for_search forward, not
backward.
Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
delayed_iput_count wa supposed to be used to implement, well, delayed
iput. The idea is that we keep accumulating the number of iputs we do
until eventually the inode is deleted. Turns out we never really
switched the delayed_iput_count from 0 to 1, hence all conditional
code relying on the value of that member being different than 0 was
never executed. This, as it turns out, didn't cause any problem due
to the simple fact that the generic inode's i_count member was always
used to count the number of iputs. So let's just remove the unused
member and all unused code. This patch essentially provides no
functional changes. While at it, also add proper documentation for
btrfs_add_delayed_iput
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
The behavior of btrfs_delalloc_reserve_metadata depends on whether
the inode we are allocating for is the freespace inode or not. As it
stands if we are the free node we set 'flush' and 'delalloc_lock'
variable to certain values. Subsequently we check the values of those
vars and act accordingly. Instead, simplify things by having 1 if
which checks whether we are the freespace inode or not and do any
specific operation in either branches of that if. This makes the code
a bit easier to understand, as an added bonus it also shrinks the
compiled size:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-17 (-17)
Function old new delta
btrfs_delalloc_reserve_metadata 1876 1859 -17
Total: Before=85966, After=85949, chg -0.02%
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add opened device to the tail of dev_alloc_list instead of head, so that
it maintains the same order as dev_list.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
By maintaining the device list sorted lets us reproduce the problems
related to missing chunk in the degraded mode much more consistent. So
fix this by sorting the devices by devid within the kernel. So that we
know which device is assigned to the struct fs_info::latest_bdev when
all the devices are having and same SB generation.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
It's not necessary to hold ->orphan_lock when checking inode's runtime
flags.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of manually fiddling with the state of the task
(RUNNING->INTERRUPTIBLE->RUNNING) again just use schedule_timeout_interruptible
which adjusts the task state as needed. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Even though btrfs_start_dirty_block_groups is fairly in the beginning of
btrfs_commit_transaction outside of the critical section defined by the
transaction states it can only be run by a single comitter. In other
words it defines its own critical section thanks to the
BTRFS_TRANS_DIRTY_BG run flag and ro_block_group_mutex. However, its
error handling is outside of this critical section which is a bit
counter-intuitive. So move the error handling righ after the function is
executed and let the sole runner of dirty block groups handle the return
value. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
bio_add_page() can fail for logical reasons as from the bio_add_page()
comments:
/*
* This will only fail if either bio->bi_vcnt == bio->bi_max_vecs or
* it's a cloned bio.
*/
Here we have just allocated the bio, so both of those failures can't
occur. So drop the check. We can also drop the error stats for write
error.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Enospc_debug makes extent allocator print more debug messages,
however for chunk allocation, there is no debug message for enospc_debug
at all.
This patch will add message for the following parts of chunk allocator:
1) No rw device at all
Quite rare, but at least output one message for this case.
2) Not enough space for some device
This debug message is quite handy for unbalanced disks with stripe
based profiles (RAID0/10/5/6).
3) Not enough free devices
This debug message should tell us if current chunk allocator is
working correctly under minimal device requirements.
Although in most cases, we will hit other ENOSPC before we even hit a
chunk allocator ENOSPC, but such debug info won't help.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use ASSERT to report logical error in cow_file_range(), also move it a
bit closer to when the num_bytes is derived.
The extent start could be (u64)-1 in some cases, the assert should catch
that we do not accidentally pass it to cow_file_range.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch deletes local variable disk_num_bytes as its value
is same as num_bytes in the function cow_file_range().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit [1] removed the need to use btrfs_async_submit_limit(), so
delete it.
[1]
commit 736cd52e0c
Btrfs: remove nr_async_submits and async_submit_draining
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by btrfs at all.
So, remove the unused hardirq.h.
Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently when enospc_debug mount option is turned on we do not print
any debug info in case metadata reservation failures happen. Fix this
by adding the necessary hook in reserve_metadata_bytes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The reason why io_bgs can be modified without holding any lock is
non-obvious. Document it and reference that documentation from the
respective call sites.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
list_first_entry is essentially a wrapper over cotnainer_of. The latter
can never return null even if it's working on inconsistent list since it
will either crash or return some offset in the wrong struct.
Additionally, for the dirty_bgs list the iteration is done under
dirty_bgs_lock which ensures consistency of the list.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For debugging or administration purposes, we would want to know if and
when the user cancels the replace, to complement the existing messages
when dev-replace starts or finishes.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog, fold fix for RCU warning from Nikolay ]
Signed-off-by: David Sterba <dsterba@suse.com>
The replace target device can be missing when mounted with -o degraded,
but we wont allocate a missing btrfs_device to it. So check the device
before accessing.
BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0
IP: btrfs_destroy_dev_replace_tgtdev+0x43/0xf0 [btrfs]
Call Trace:
btrfs_dev_replace_cancel+0x15f/0x180 [btrfs]
btrfs_ioctl+0x2216/0x2590 [btrfs]
do_vfs_ioctl+0x625/0x650
SyS_ioctl+0x4e/0x80
do_syscall_64+0x5d/0x160
entry_SYSCALL64_slow_path+0x25/0x25
This patch has been moved in front of patch "btrfs: log, when replace,
is canceled by the user" that could reproduce the crash if the system
reboots inside btrfs_dev_replace_start before the
btrfs_dev_replace_finishing call.
$ mkfs /dev/sda
$ mount /dev/sda mnt
$ btrfs replace start /dev/sda /dev/sdb
<insert reboot>
$ mount po degraded /dev/sdb mnt
<crash>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ added reproducer description from mail ]
Signed-off-by: David Sterba <dsterba@suse.com>
The options alloc_start and subvolrootid are deprecated, comment them in
the tokens list. And leave them as it is. No functional changes.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As the commit mount option is unsigned so manage it as %u for token
verifications, instead of %d.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As check_int_print_mask mount option is unsigned so manage it as %u for
token verifications, instead of %d.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As metadata_ratio mount option is unsinged so manage it as %u for token
verifications, instead of %d.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The mount option thread_pool is always unsigned. Manage it that way all
around.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
extent_buffer_uptodate() is a trivial wrapper around test_bit() and
nothing else. So make it static and inline, save on code space and call
indirection.
Before:
text data bss dec hex filename
1131257 82898 18992 1233147 12d0fb fs/btrfs/btrfs.ko
After:
text data bss dec hex filename
1131090 82898 18992 1232980 12d054 fs/btrfs/btrfs.ko
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already pass btrfs_trans_handle which contains a reference to the
fs_info so use that. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already pass the btrfs_transaction which references fs_info so no
need to pass the later as an argument. Also use the opportunity to
shorten transaction->trans. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already pass the trans handle which has a reference to fs_info to
create_pending_snapshot so we can refer to it directly. Doing this
obviates the need to pass the fs_info to create_pending_snapshots as
well. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already have the fs_info from the passed transaction so use it
directly. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The only thing the passed root is used for is:
1. get a reference to the fs_info and to
2. call trace_btrfs_transaction_commit.
We can achieve 1) by simply referring to the fs_info from passed trans
object. As far as 2) is concerned cleanup_transaction is called from
only one place and the 'root' argument passed is the one from the trans
handle. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already pass a transaction handle which refrences the fs_info so
we can grab it from there. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already pass the transaction handle which has a reference to the
fs_info. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already pass the transaction which has a reference to the fs_info,
so use that. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already pass the transaction handle, which contains a refrence to
the fs_info so grab it from there. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed transaction so no point in passing
it as a function argument. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It can be referenced from the passed transaciton so no point in
passing it as function argument. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All current callers of this function just get a reference to the
trans->fs_info member and pass it as the second argument. Collapse this
into the function itself. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_write_and_wait_transaction is essentially a wrapper of
btrfs_write_and_wait_marked_extents with the addition of calling
clear_btree_io_tree. Having the code split doesn't really bring any
benefit. Open code the later into the former and add proper
documentation header.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
This function is only ever used in __btrfs_end_transaction and
btrfs_commit_transaction so there is no need to export it via header.
Let's move it closer to where it's used, make it static and remove it
from the header. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_init_dev_replace_tgtdev_for_resume() initializes replace
target device in a few simple steps, so do it at the parent function.
Moreover, there isn't any other caller so just open code it.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Current u64 return from btrfs_dev_replace_cancel() was probably done
to match the btrfs_ioctl_dev_replace_args::result. However as our
actual return value fits in int, and it further gets typecast to u64,
so just return int.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove __ which is for the special functions.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_dev_replace_cancel() calls __btrfs_dev_replace_cancel() for the
actual cancel so just open code it.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the function uses a hardcoded value for the checksum size of
a sector. This is fine, given that we currently support only a single
algorithm, whose checksum is 4 bytes == sizeof(u32). Despite not
having other algorithms, btrfs' design supports using a different
algorithm whith different space requirements. To future-proof the code
query the size of the currently used algorithm from the in-memory copy
of the super block. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a missing void parameter to function btrfs_test_extent_map, fixes
sparse warning:
warning: non-ANSI function declaration of function 'btrfs_test_extent_map'
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The check for a non-zero ret is redundant as the goto will jump to
the very next statement anyway. Remove this extraneous code.
Detected by CoverityScan, CID#1463784 ("Identical code for different
branches")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 0e8c36a9fd ("Btrfs: fix lots of orphan inodes when the space
is not enough") changed the way transaction reservation is made in
btrfs_evict_node and as a result this function became unused. This has
been the status quo for 5 years in which time no one noticed, so I'd
say it's safe to assume it's unlikely it will ever be used again.
Historical note: there were more attempts to remove the function, the
reasoning was missing and only based on some static analysis tool
reports. Other reason for rejection was that there seemed to be
connection to BTRFS_RESERVE_FLUSH_LIMIT and that would need to be
removeed to. This was not correct so removing the function is all we can
do.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
[ add the note ]
Signed-off-by: David Sterba <dsterba@suse.com>
Presently, failing a primary super block write but succeeding in at
least one super block write in general will appear to users as if
nothing important went wrong. However, upon unmounting and re-mounting,
the file system will be in a rolled back state. This was discovered
with a BCC program that uses bpf_override_return() to fail super block
writes.
This patch outputs an error clarifying that the primary super block
write has failed, so users can expect potentially erroneous behaviour.
It also forces wait_dev_supers() to return an error to its caller if
the primary super block write fails.
Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This prints out eb->bflags since it contains some useful information,
e.g. whether eb is dirty.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old wait_on_atomic_t() is going to get removed, use the more
flexible wait_var_event() API instead.
No change in functionality.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlqrzY4ACgkQxWXV+ddt
WDtV4BAAiiv5XECNB3lIvcoFjst8E7xIJNeKKzFZh8HNm+L94zP00uqrpHkwQSUv
tk9TGSYgbJJZMhw6I1rwYSKsoTkvP6NVMQZjpkJktWto9NXObKPvJzjKMdB+GYQR
TWGLBaHsMhruMTOewhdVsxA5od5+LravygHaLmTDQmho8jZSSEe0qqpqlakRI9sQ
+VZv1PDEI68IDMKMjOT7de7Ssq9c99BpGNXxvi5fy5kijyfgrs/BUfHBNV5r66MS
y0Yw1Ab+nlLC7cE5Ql24/snDojcLoSl7f5eYt7ib+DAmsJucYQzTfUetFcrYf8IC
EAmFs5Br6pbopi5M1IOxGZABNbr0qMS+zLvoF52cyka32nrmK/IMWl1s8bW0xfva
gxkjouzJDOTMq28asSdyu73i9yFyLB+tEHpojECljo+K5ASzvDad34xqVRlxY6vN
UJm4d76OqTfAyTRo7sQRWfLz/Cb1W+/v0mPotPWHWVyF0bUNsBgX5YKJhma3pU/z
rHwhI/fqg0NTXBcNUMCosEwS2u5vnjIfiOHkDTl1Dv747S/jM0AaNEvpjDo1cOuk
lsl4xKoDkQ6RpBCeFxRCW0wk1Nl/Su6RQ217xTtND+aIa4c6G7A9X8QAjj2XNAuK
CktJJD7wHYSBTu1sQ+u2R/c2QEgkAxABr4NpmWTPCBNlIKEjd4A=
=bXFe
-----END PGP SIGNATURE-----
Merge tag 'for-4.16-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"There's an important revert in this pull request that needs to go to
stable as it causes a corruption on big endian machines.
The other fix is for FIEMAP incorrectly reporting shared extents
before a sync and one fix for a crash in raid56.
So far we got only one report about the BE corruption, the stable
kernels were out for like a week, so hopefully the scope of the damage
is low"
* tag 'for-4.16-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Revert "btrfs: use proper endianness accessors for super_copy"
btrfs: add missing initialization in btrfs_check_shared
btrfs: Fix NULL pointer exception in find_bio_stripe
This reverts commit 3c181c12c4.
The offending patch was merged in 4.16-rc4 and was promptly applied to
stable kernels 4.14.25 and 4.15.8.
The patch causes a corruption in several superblock items on big-endian
machines because of messed up endianity conversions. The damage is
manually repairable. A filesystem cannot be mounted again after it has
been unmounted once.
We do a full revert and not a fixup so stable can pick that patch ASAP.
Fixes: 3c181c12c4 ("btrfs: use proper endianness accessors for super_copy")
Link: https://lkml.kernel.org/r/1521139304@msgid.manchmal.in-ulm.de
CC: stable@vger.kernel.org # 4.14+
Reported-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch addresses an issue that causes fiemap to falsely
report a shared extent. The test case is as follows:
xfs_io -f -d -c "pwrite -b 16k 0 64k" -c "fiemap -v" /media/scratch/file5
sync
xfs_io -c "fiemap -v" /media/scratch/file5
which gives the resulting output:
wrote 65536/65536 bytes at offset 0
64 KiB, 4 ops; 0.0000 sec (121.359 MiB/sec and 7766.9903 ops/sec)
/media/scratch/file5:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 24576..24703 128 0x2001
/media/scratch/file5:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 24576..24703 128 0x1
This is because btrfs_check_shared calls find_parent_nodes
repeatedly in a loop, passing a share_check struct to report
the count of shared extent. But btrfs_check_shared does not
re-initialize the count value to zero for subsequent calls
from the loop, resulting in a false share count value. This
is a regressive behavior from 4.13.
With proper re-initialization the test result is as follows:
wrote 65536/65536 bytes at offset 0
64 KiB, 4 ops; 0.0000 sec (110.035 MiB/sec and 7042.2535 ops/sec)
/media/scratch/file5:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 24576..24703 128 0x1
/media/scratch/file5:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 24576..24703 128 0x1
which corrects the regression.
Fixes: 3ec4d3238a ("btrfs: allow backref search checks for shared extents")
Signed-off-by: Edmund Nadolski <enadolski@suse.com>
[ add text from cover letter to changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlqbuK8ACgkQxWXV+ddt
WDsiVQ//eE8Axfw4qWOyHHjozoAIu9kifvOFQIhvKviLiXHJNrD/vBI6YwD1hyD1
rbbLilMsEl1OD1Sq3AeOUMNSSl5qUFEB+CA8vg9GznFTNRobTkL+p96Zt5xlRDu3
lsFFV93tED+dK4D/eSGP+xYbknA8hIk/2gWPkd6hpYKyh2QdsPBYDqCnaEXvd79P
DIP/cAjIfzqQn0FTiZ9wbaES+LPO+NoZgQRC2w9McYQ5CEMc+oAChEmPJRwpPoKy
YdhuwoULniRNHVnVOIVfi4w9jkSPSz7YIgLeRFli/WGBYGcKeHTMFkMa12KdpuUC
JUclOogJ5ZMbFV3C0W8XEJ7Vb9ltIevrH8MgfKP/3BScuZzbJZ+n5KkH2Gf7vcpe
w5cZGOsKDz+35fDCdTmsFpDK9kpGzHq48JlRifOjARbdyqNwVq4emxOeQlO1ygzq
Y9H5UeMpp/FDAm6g/bV8ezCXzwuwUDV9CwAJBD+WCiZhD2nX85FfIp1kfF2zLcUg
Y8irqV6A/J/0BFkF7Iu9AuzTxOdo6zMLkcHEKb+sL7yGxZ2248v5gZxDoyV5iNWR
WY4Y2GaMemZGu6+NyyLFAJzlFyACD5fSy8vT7KD6upCzwmlAtaJ3VULfTrLyi/uM
SNZAKf3WzKBVe5h52GNGRCi5OLTteH6Yqz5m5/h8r74mBO00IrA=
=bF96
-----END PGP SIGNATURE-----
Merge tag 'for-4.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- when NR_CPUS is large, a SRCU structure can significantly inflate
size of the main filesystem structure that would not be possible to
allocate by kmalloc, so the kvalloc fallback is used
- improved error handling
- fix endiannes when printing some filesystem attributes via sysfs,
this is could happen when a filesystem is moved between different
endianity hosts
- send fixes: the NO_HOLE mode should not send a write operation for a
file hole
- fix log replay for for special files followed by file hardlinks
- fix log replay failure after unlink and link combination
- fix max chunk size calculation for DUP allocation
* tag 'for-4.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix log replay failure after unlink and link combination
Btrfs: fix log replay failure after linking special file and fsync
Btrfs: send, fix issuing write op when processing hole in no data mode
btrfs: use proper endianness accessors for super_copy
btrfs: alloc_chunk: fix DUP stripe size handling
btrfs: Handle btrfs_set_extent_delalloc failure in relocate_file_extent_cluster
btrfs: handle failure of add_pending_csums
btrfs: use kvzalloc to allocate btrfs_fs_info
If we have a file with 2 (or more) hard links in the same directory,
remove one of the hard links, create a new file (or link an existing file)
in the same directory with the name of the removed hard link, and then
finally fsync the new file, we end up with a log that fails to replay,
causing a mount failure.
Example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/testdir
$ touch /mnt/testdir/foo
$ ln /mnt/testdir/foo /mnt/testdir/bar
$ sync
$ unlink /mnt/testdir/bar
$ touch /mnt/testdir/bar
$ xfs_io -c "fsync" /mnt/testdir/bar
<power failure>
$ mount /dev/sdb /mnt
mount: mount(2) failed: /mnt: No such file or directory
When replaying the log, for that example, we also see the following in
dmesg/syslog:
[71813.671307] BTRFS info (device dm-0): failed to delete reference to bar, inode 258 parent 257
[71813.674204] ------------[ cut here ]------------
[71813.675694] BTRFS: Transaction aborted (error -2)
[71813.677236] WARNING: CPU: 1 PID: 13231 at fs/btrfs/inode.c:4128 __btrfs_unlink_inode+0x17b/0x355 [btrfs]
[71813.679669] Modules linked in: btrfs xfs f2fs dm_flakey dm_mod dax ghash_clmulni_intel ppdev pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper evdev psmouse i2c_piix4 parport_pc i2c_core pcspkr sg serio_raw parport button sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel floppy virtio e1000 scsi_mod [last unloaded: btrfs]
[71813.679669] CPU: 1 PID: 13231 Comm: mount Tainted: G W 4.15.0-rc9-btrfs-next-56+ #1
[71813.679669] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
[71813.679669] RIP: 0010:__btrfs_unlink_inode+0x17b/0x355 [btrfs]
[71813.679669] RSP: 0018:ffffc90001cef738 EFLAGS: 00010286
[71813.679669] RAX: 0000000000000025 RBX: ffff880217ce4708 RCX: 0000000000000001
[71813.679669] RDX: 0000000000000000 RSI: ffffffff81c14bae RDI: 00000000ffffffff
[71813.679669] RBP: ffffc90001cef7c0 R08: 0000000000000001 R09: 0000000000000001
[71813.679669] R10: ffffc90001cef5e0 R11: ffffffff8343f007 R12: ffff880217d474c8
[71813.679669] R13: 00000000fffffffe R14: ffff88021ccf1548 R15: 0000000000000101
[71813.679669] FS: 00007f7cee84c480(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
[71813.679669] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[71813.679669] CR2: 00007f7cedc1abf9 CR3: 00000002354b4003 CR4: 00000000001606e0
[71813.679669] Call Trace:
[71813.679669] btrfs_unlink_inode+0x17/0x41 [btrfs]
[71813.679669] drop_one_dir_item+0xfa/0x131 [btrfs]
[71813.679669] add_inode_ref+0x71e/0x851 [btrfs]
[71813.679669] ? __lock_is_held+0x39/0x71
[71813.679669] ? replay_one_buffer+0x53/0x53a [btrfs]
[71813.679669] replay_one_buffer+0x4a4/0x53a [btrfs]
[71813.679669] ? rcu_read_unlock+0x3a/0x57
[71813.679669] ? __lock_is_held+0x39/0x71
[71813.679669] walk_up_log_tree+0x101/0x1d2 [btrfs]
[71813.679669] walk_log_tree+0xad/0x188 [btrfs]
[71813.679669] btrfs_recover_log_trees+0x1fa/0x31e [btrfs]
[71813.679669] ? replay_one_extent+0x544/0x544 [btrfs]
[71813.679669] open_ctree+0x1cf6/0x2209 [btrfs]
[71813.679669] btrfs_mount_root+0x368/0x482 [btrfs]
[71813.679669] ? trace_hardirqs_on_caller+0x14c/0x1a6
[71813.679669] ? __lockdep_init_map+0x176/0x1c2
[71813.679669] ? mount_fs+0x64/0x10b
[71813.679669] mount_fs+0x64/0x10b
[71813.679669] vfs_kern_mount+0x68/0xce
[71813.679669] btrfs_mount+0x13e/0x772 [btrfs]
[71813.679669] ? trace_hardirqs_on_caller+0x14c/0x1a6
[71813.679669] ? __lockdep_init_map+0x176/0x1c2
[71813.679669] ? mount_fs+0x64/0x10b
[71813.679669] mount_fs+0x64/0x10b
[71813.679669] vfs_kern_mount+0x68/0xce
[71813.679669] do_mount+0x6e5/0x973
[71813.679669] ? memdup_user+0x3e/0x5c
[71813.679669] SyS_mount+0x72/0x98
[71813.679669] entry_SYSCALL_64_fastpath+0x1e/0x8b
[71813.679669] RIP: 0033:0x7f7cedf150ba
[71813.679669] RSP: 002b:00007ffca71da688 EFLAGS: 00000206
[71813.679669] Code: 7f a0 e8 51 0c fd ff 48 8b 43 50 f0 0f ba a8 30 2c 00 00 02 72 17 41 83 fd fb 74 11 44 89 ee 48 c7 c7 7d 11 7f a0 e8 38 f5 8d e0 <0f> ff 44 89 e9 ba 20 10 00 00 eb 4d 48 8b 4d b0 48 8b 75 88 4c
[71813.679669] ---[ end trace 83bd473fc5b4663b ]---
[71813.854764] BTRFS: error (device dm-0) in __btrfs_unlink_inode:4128: errno=-2 No such entry
[71813.886994] BTRFS: error (device dm-0) in btrfs_replay_log:2307: errno=-2 No such entry (Failed to recover log tree)
[71813.903357] BTRFS error (device dm-0): cleaner transaction attach returned -30
[71814.128078] BTRFS error (device dm-0): open_ctree failed
This happens because the log has inode reference items for both inode 258
(the first file we created) and inode 259 (the second file created), and
when processing the reference item for inode 258, we replace the
corresponding item in the subvolume tree (which has two names, "foo" and
"bar") witht he one in the log (which only has one name, "foo") without
removing the corresponding dir index keys from the parent directory.
Later, when processing the inode reference item for inode 259, which has
a name of "bar" associated to it, we notice that dir index entries exist
for that name and for a different inode, so we attempt to unlink that
name, which fails because the inode reference item for inode 258 no longer
has the name "bar" associated to it, making a call to btrfs_unlink_inode()
fail with a -ENOENT error.
Fix this by unlinking all the names in an inode reference item from a
subvolume tree that are not present in the inode reference item found in
the log tree, before overwriting it with the item from the log tree.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If in the same transaction we rename a special file (fifo, character/block
device or symbolic link), create a hard link for it having its old name
then sync the log, we will end up with a log that can not be replayed and
at when attempting to replay it, an EEXIST error is returned and mounting
the filesystem fails. Example scenario:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ mkdir /mnt/testdir
$ mkfifo /mnt/testdir/foo
# Make sure everything done so far is durably persisted.
$ sync
# Create some unrelated file and fsync it, this is just to create a log
# tree. The file must be in the same directory as our special file.
$ touch /mnt/testdir/f1
$ xfs_io -c "fsync" /mnt/testdir/f1
# Rename our special file and then create a hard link with its old name.
$ mv /mnt/testdir/foo /mnt/testdir/bar
$ ln /mnt/testdir/bar /mnt/testdir/foo
# Create some other unrelated file and fsync it, this is just to persist
# the log tree which was modified by the previous rename and link
# operations. Alternatively we could have modified file f1 and fsync it.
$ touch /mnt/f2
$ xfs_io -c "fsync" /mnt/f2
<power failure>
$ mount /dev/sdc /mnt
mount: mount /dev/sdc on /mnt failed: File exists
This happens because when both the log tree and the subvolume's tree have
an entry in the directory "testdir" with the same name, that is, there
is one key (258 INODE_REF 257) in the subvolume tree and another one in
the log tree (where 258 is the inode number of our special file and 257
is the inode for directory "testdir"). Only the data of those two keys
differs, in the subvolume tree the index field for inode reference has
a value of 3 while the log tree it has a value of 5. Because the same key
exists in both trees, but have different index, the log replay fails with
an -EEXIST error when attempting to replay the inode reference from the
log tree.
Fix this by setting the last_unlink_trans field of the inode (our special
file) to the current transaction id when a hard link is created, as this
forces logging the parent directory inode, solving the conflict at log
replay time.
A new generic test case for fstests was also submitted.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing an incremental send of a filesystem with the no-holes feature
enabled, we end up issuing a write operation when using the no data mode
send flag, instead of issuing an update extent operation. Fix this by
issuing the update extent operation instead.
Trivial reproducer:
$ mkfs.btrfs -f -O no-holes /dev/sdc
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdc /mnt/sdc
$ mount /dev/sdd /mnt/sdd
$ xfs_io -f -c "pwrite -S 0xab 0 32K" /mnt/sdc/foobar
$ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap1
$ xfs_io -c "fpunch 8K 8K" /mnt/sdc/foobar
$ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap2
$ btrfs send /mnt/sdc/snap1 | btrfs receive /mnt/sdd
$ btrfs send --no-data -p /mnt/sdc/snap1 /mnt/sdc/snap2 \
| btrfs receive -vv /mnt/sdd
Before this change the output of the second receive command is:
receiving snapshot snap2 uuid=f6922049-8c22-e544-9ff9-fc6755918447...
utimes
write foobar, offset 8192, len 8192
utimes foobar
BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=f6922049-8c22-e544-9ff9-...
After this change it is:
receiving snapshot snap2 uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64...
utimes
update_extent foobar: offset=8192, len=8192
utimes foobar
BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64...
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The fs_info::super_copy is a byte copy of the on-disk structure and all
members must use the accessor macros/functions to obtain the right
value. This was missing in update_super_roots and in sysfs readers.
Moving between opposite endianness hosts will report bogus numbers in
sysfs, and mount may fail as the root will not be restored correctly. If
the filesystem is always used on a same endian host, this will not be a
problem.
Fix this by using the btrfs_set_super...() functions to set
fs_info::super_copy values, and for the sysfs, use the cached
fs_info::nodesize/sectorsize values.
CC: stable@vger.kernel.org
Fixes: df93589a17 ("btrfs: export more from FS_INFO to sysfs")
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
In case of using DUP, we search for enough unallocated disk space on a
device to hold two stripes.
The devices_info[ndevs-1].max_avail that holds the amount of unallocated
space found is directly assigned to stripe_size, while it's actually
twice the stripe size.
Later on in the code, an unconditional division of stripe_size by
dev_stripes corrects the value, but in the meantime there's a check to
see if the stripe_size does not exceed max_chunk_size. Since during this
check stripe_size is twice the amount as intended, the check will reduce
the stripe_size to max_chunk_size if the actual correct to be used
stripe_size is more than half the amount of max_chunk_size.
The unconditional division later tries to correct stripe_size, but will
actually make sure we can't allocate more than half the max_chunk_size.
Fix this by moving the division by dev_stripes before the max chunk size
check, so it always contains the right value, instead of putting a duct
tape division in further on to get it fixed again.
Since in all other cases than DUP, dev_stripes is 1, this change only
affects DUP.
Other attempts in the past were made to fix this:
* 37db63a400 "Btrfs: fix max chunk size check in chunk allocator" tried
to fix the same problem, but still resulted in part of the code acting
on a wrongly doubled stripe_size value.
* 86db25785a "Btrfs: fix max chunk size on raid5/6" unintentionally
broke this fix again.
The real problem was already introduced with the rest of the code in
73c5de0051.
The user visible result however will be that the max chunk size for DUP
will suddenly double, while it's actually acting according to the limits
in the code again like it was 5 years ago.
Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
Link: https://www.spinics.net/lists/linux-btrfs/msg69752.html
Fixes: 73c5de0051 ("btrfs: quasi-round-robin for chunk allocation")
Fixes: 86db25785a ("Btrfs: fix max chunk size on raid5/6")
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Essentially duplicate the error handling from the above block which
handles the !PageUptodate(page) case and additionally clear
EXTENT_BOUNDARY.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
add_pending_csums was added as part of the new data=ordered
implementation in e6dcd2dc9c ("Btrfs: New data=ordered
implementation"). Even back then it called the btrfs_csum_file_blocks
which can fail but it never bothered handling the failure. In ENOMEM
situation this could lead to the filesystem failing to write the
checksums for a particular extent and not detect this. On read this
could lead to the filesystem erroring out due to crc mismatch. Fix it by
propagating failure from add_pending_csums and handling them.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The srcu_struct in btrfs_fs_info scales in size with NR_CPUS. On
kernels built with NR_CPUS=8192, this can result in kmalloc failures
that prevent mounting.
There is work in progress to try to resolve this for every user of
srcu_struct but using kvzalloc will work around the failures until
that is complete.
As an example with NR_CPUS=512 on x86_64: the overall size of
subvol_srcu is 3460 bytes, fs_info is 6496.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlqG8poACgkQxWXV+ddt
WDuHSA//eC+69XpHwohI6pcPQ7Jbr9UCj1L/Gt0U96YSzijGW4Hv3OQEWLIRBu4c
nZbzQYtUunpguLYwfXgUUgXRHBTo2Y5bXZNmF2MtL7JcPOLhLh4h/IcGY7eRd2Vq
qvv2bqr3yAcQo7s6z5U/D8ulohzHQTxG7Jaq/BkVxQqhvu+vdu/9T8ikAWnmSTjw
lONu8soR5QO7tewxz23Cguw/t1bWe1aMXG9Ykd4avyhQHtgzNE+l82i4DYUhK2CM
x8M5/CxnDLPe73IJuA2INCUtpPvR4Qufi5Nz6EN3BrJNCGBkmg18sPIvWlH6LsVh
bsm4Lwz/piq+hkDq2GG+Z79uiGAfCVUWAsnm7yYHwpVyMvwHKlfrcVSAuRZixw5E
/NZ0JEkEOtvzpv4inZFYbAgD+oKfvYvwj9BW5BXfu2aH6hJBImfAeMSd1aHB3uZI
kGgy52k2v2P3WKQOFUbmW417P05DvvGmRvRmU+tSFpB+lXAZqRzoiVIuFm0xwhf1
1SmnYgnSYzPmzIRXAMsSYQeK/8NXDdMZMutaw/AYwX+QBEdIAErf6MWcjI6XZRyG
g8Gr8JcpwSa+H5/LKN5uswfXxfSAsqVHnZhbOVrjyGX0wyR4KJg3ag3KsHd9SCxb
LDEjPSYEDn9yfmw6pK2Q6J26FGYiKpuUXaNiYVNymGe6162IiBM=
=VeA/
-----END PGP SIGNATURE-----
Merge tag 'for-4.16-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"We have a few assorted fixes, some of them show up during fstests so I
gave them more testing"
* tag 'for-4.16-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: Fix use-after-free when cleaning up fs_devs with a single stale device
Btrfs: fix null pointer dereference when replacing missing device
btrfs: remove spurious WARN_ON(ref->count < 0) in find_parent_nodes
btrfs: Ignore errors from btrfs_qgroup_trace_extent_post
Btrfs: fix unexpected -EEXIST when creating new inode
Btrfs: fix use-after-free on root->orphan_block_rsv
Btrfs: fix btrfs_evict_inode to handle abnormal inodes correctly
Btrfs: fix extent state leak from tree log
Btrfs: fix crash due to not cleaning up tree log block's dirty bits
Btrfs: fix deadlock in run_delalloc_nocow
Commit 4fde46f0cc ("Btrfs: free the stale device") introduced
btrfs_free_stale_device which iterates the device lists for all
registered btrfs filesystems and deletes those devices which aren't
mounted. In a btrfs_devices structure has only 1 device attached to it
and it is unused then btrfs_free_stale_devices will proceed to also free
the btrfs_fs_devices struct itself. Currently this leads to a use after
free since list_for_each_entry will try to perform a check on the
already freed memory to see if it has to terminate the loop.
The fix is to use 'break' when we know we are freeing the current
fs_devs.
Fixes: 4fde46f0cc ("Btrfs: free the stale device")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Until v4.14, this warning was very infrequent:
WARNING: CPU: 3 PID: 18172 at fs/btrfs/backref.c:1391 find_parent_nodes+0xc41/0x14e0
Modules linked in: [...]
CPU: 3 PID: 18172 Comm: bees Tainted: G D W L 4.11.9-zb64+ #1
Hardware name: System manufacturer System Product Name/M5A78L-M/USB3, BIOS 2101 12/02/2014
Call Trace:
dump_stack+0x85/0xc2
__warn+0xd1/0xf0
warn_slowpath_null+0x1d/0x20
find_parent_nodes+0xc41/0x14e0
__btrfs_find_all_roots+0xad/0x120
? extent_same_check_offsets+0x70/0x70
iterate_extent_inodes+0x168/0x300
iterate_inodes_from_logical+0x87/0xb0
? iterate_inodes_from_logical+0x87/0xb0
? extent_same_check_offsets+0x70/0x70
btrfs_ioctl+0x8ac/0x2820
? lock_acquire+0xc2/0x200
do_vfs_ioctl+0x91/0x700
? __fget+0x112/0x200
SyS_ioctl+0x79/0x90
entry_SYSCALL_64_fastpath+0x23/0xc6
? trace_hardirqs_off_caller+0x1f/0x140
Starting with v4.14 (specifically 86d5f99442 ("btrfs: convert prelimary
reference tracking to use rbtrees")) the WARN_ON occurs three orders of
magnitude more frequently--almost once per second while running workloads
like bees.
Replace the WARN_ON() with a comment rationale for its removal.
The rationale is paraphrased from an explanation by Edmund Nadolski
<enadolski@suse.de> on the linux-btrfs mailing list.
Fixes: 8da6d5815c ("Btrfs: added btrfs_find_all_roots()")
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Running generic/019 with qgroups on the scratch device enabled is almost
guaranteed to trigger the BUG_ON in btrfs_free_tree_block. It's supposed
to trigger only on -ENOMEM, in reality, however, it's possible to get
-EIO from btrfs_qgroup_trace_extent_post. This function just finds the
roots of the extent being tracked and sets the qrecord->old_roots list.
If this operation fails nothing critical happens except the quota
accounting can be considered wrong. In such case just set the
INCONSISTENT flag for the quota and print a warning, rather than killing
off the system. Additionally, it's possible to trigger a BUG_ON in
btrfs_truncate_inode_items as well.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ error message adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
The highest objectid, which is assigned to new inode, is decided at
the time of initializing fs roots. However, in cases where log replay
gets processed, the btree which fs root owns might be changed, so we
have to search it again for the highest objectid, otherwise creating
new inode would end up with -EEXIST.
cc: <stable@vger.kernel.org> v4.4-rc6+
Fixes: f32e48e925 ("Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and subvolume roots")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This regression is introduced in
commit 3d48d9810d ("btrfs: Handle uninitialised inode eviction").
There are two problems,
a) it is ->destroy_inode() that does the final free on inode, not
->evict_inode(),
b) clear_inode() must be called before ->evict_inode() returns.
This could end up hitting BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
in evict() because I_CLEAR is set in clear_inode().
Fixes: commit 3d48d9810d ("btrfs: Handle uninitialised inode eviction")
Cc: <stable@vger.kernel.org> # v4.7-rc6+
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's possible that btrfs_sync_log() bails out after one of the two
btrfs_write_marked_extents() which convert extent state's state bit into
EXTENT_NEED_WAIT from EXTENT_DIRTY/EXTENT_NEW, however only EXTENT_DIRTY
and EXTENT_NEW are searched by free_log_tree() so that those extent states
with EXTENT_NEED_WAIT lead to memory leak.
cc: <stable@vger.kernel.org>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In cases that the whole fs flips into readonly status due to failures in
critical sections, then log tree's blocks are still dirty, and this leads
to a crash during umount time, the crash is about use-after-free,
umount
-> close_ctree
-> stop workers
-> iput(btree_inode)
-> iput_final
-> write_inode_now
-> ...
-> queue job on stop'd workers
cc: <stable@vger.kernel.org> v3.12+
Fixes: 681ae50917 ("Btrfs: cleanup reserved space when freeing tree log on error")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
@cur_offset is not set back to what it should be (@cow_start) if
btrfs_next_leaf() returns something wrong, and the range [cow_start,
cur_offset) remains locked forever.
cc: <stable@vger.kernel.org>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
documentation, errseq documentation, kernel-doc support for nested
structure definitions, the removal of lots of crufty kernel-doc support for
unused formats, SPDX tag documentation, the beginnings of a manual for
subsystem maintainers, and lots of fixes and updates.
As usual, some of the changesets reach outside of Documentation/ to effect
kerneldoc comment fixes. It also adds the new LICENSES directory, of which
Thomas promises I do not need to be the maintainer.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJab11TAAoJEI3ONVYwIuV6i1UP/1LgGPHW9Ygq5qaLFbReZd/u
Mx/orrhHX0PdkbCCE+CbL8Vm1m4UKFDTBdlpk3s542zxeeG0ZBXuTnvq4Kyk+cTN
p4/vsIEzk/Ih13/glGE5MlV+EjiEK+8hK69TIUj7bAyuHmpzofjRz9/1M6RLDGDC
HY6UI58AXG0yOQWMWCGRMYpQAFUGij2equ7Doe1ugXRq14dx7V4RsOhI140iRk7t
bquAq1rS2fXniiuPFmLBUe4dWW28isVa/Vl/aXcaWQDKMyT0OLhjOMW36wWKqtPi
WdVCpHv1NLZNyZZr9S3kvfOwW+BUqpEzfVwssyBLW4h0tsnIx0U0HVhSTY8/TvFZ
QD9yCSana4LB/e5CHXIX5lBHbjHxf+rETXqVV4MgwDaMvM3mCo4X6WUTJDmZADo6
vQISEKeb4su5uWAbc9T9xwRSLhZnFVdJ/QuYdNQ5+EpFJYLhzQ9eBvEz6JstSIXL
p9ASBiPNY3ulpVZ8q0JOHJRBhq5mHJH6Dy8achzbILy2l/ZI4b8lJ53mw9II04cp
puF96E6HpvuZ8Tgjjrg9U3ZdxXNrUgc/tjk2ZDkyTglk1XF2jKSq2tiNSZ3oLrJm
XqJPnpCeyJM5UDvwkIBzgC41WEHwe8uvoNbUnc4X7UJSZegFzcSLQXf5qaprHS5k
XeQ7sbd+S+jzVVjFi0W5
=Z15Z
-----END PGP SIGNATURE-----
Merge tag 'docs-4.16' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"Documentation updates for 4.16.
New stuff includes refcount_t documentation, errseq documentation,
kernel-doc support for nested structure definitions, the removal of
lots of crufty kernel-doc support for unused formats, SPDX tag
documentation, the beginnings of a manual for subsystem maintainers,
and lots of fixes and updates.
As usual, some of the changesets reach outside of Documentation/ to
effect kerneldoc comment fixes. It also adds the new LICENSES
directory, of which Thomas promises I do not need to be the
maintainer"
* tag 'docs-4.16' of git://git.lwn.net/linux: (65 commits)
linux-next: docs-rst: Fix typos in kfigure.py
linux-next: DOC: HWPOISON: Fix path to debugfs in hwpoison.txt
Documentation: Fix misconversion of #if
docs: add index entry for networking/msg_zerocopy
Documentation: security/credentials.rst: explain need to sort group_list
LICENSES: Add MPL-1.1 license
LICENSES: Add the GPL 1.0 license
LICENSES: Add Linux syscall note exception
LICENSES: Add the MIT license
LICENSES: Add the BSD-3-clause "Clear" license
LICENSES: Add the BSD 3-clause "New" or "Revised" License
LICENSES: Add the BSD 2-clause "Simplified" license
LICENSES: Add the LGPL-2.1 license
LICENSES: Add the LGPL 2.0 license
LICENSES: Add the GPL 2.0 license
Documentation: Add license-rules.rst to describe how to properly identify file licenses
scripts: kernel_doc: better handle show warnings logic
fs/*/Kconfig: drop links to 404-compliant http://acl.bestbits.at
doc: md: Fix a file name to md-fault.c in fault-injection.txt
errseq: Add to documentation tree
...
Pull networking updates from David Miller:
1) Significantly shrink the core networking routing structures. Result
of http://vger.kernel.org/~davem/seoul2017_netdev_keynote.pdf
2) Add netdevsim driver for testing various offloads, from Jakub
Kicinski.
3) Support cross-chip FDB operations in DSA, from Vivien Didelot.
4) Add a 2nd listener hash table for TCP, similar to what was done for
UDP. From Martin KaFai Lau.
5) Add eBPF based queue selection to tun, from Jason Wang.
6) Lockless qdisc support, from John Fastabend.
7) SCTP stream interleave support, from Xin Long.
8) Smoother TCP receive autotuning, from Eric Dumazet.
9) Lots of erspan tunneling enhancements, from William Tu.
10) Add true function call support to BPF, from Alexei Starovoitov.
11) Add explicit support for GRO HW offloading, from Michael Chan.
12) Support extack generation in more netlink subsystems. From Alexander
Aring, Quentin Monnet, and Jakub Kicinski.
13) Add 1000BaseX, flow control, and EEE support to mvneta driver. From
Russell King.
14) Add flow table abstraction to netfilter, from Pablo Neira Ayuso.
15) Many improvements and simplifications to the NFP driver bpf JIT,
from Jakub Kicinski.
16) Support for ipv6 non-equal cost multipath routing, from Ido
Schimmel.
17) Add resource abstration to devlink, from Arkadi Sharshevsky.
18) Packet scheduler classifier shared filter block support, from Jiri
Pirko.
19) Avoid locking in act_csum, from Davide Caratti.
20) devinet_ioctl() simplifications from Al viro.
21) More TCP bpf improvements from Lawrence Brakmo.
22) Add support for onlink ipv6 route flag, similar to ipv4, from David
Ahern.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1925 commits)
tls: Add support for encryption using async offload accelerator
ip6mr: fix stale iterator
net/sched: kconfig: Remove blank help texts
openvswitch: meter: Use 64-bit arithmetic instead of 32-bit
tcp_nv: fix potential integer overflow in tcpnv_acked
r8169: fix RTL8168EP take too long to complete driver initialization.
qmi_wwan: Add support for Quectel EP06
rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK
ipmr: Fix ptrdiff_t print formatting
ibmvnic: Wait for device response when changing MAC
qlcnic: fix deadlock bug
tcp: release sk_frag.page in tcp_disconnect
ipv4: Get the address of interface correctly.
net_sched: gen_estimator: fix lockdep splat
net: macb: Handle HRESP error
net/mlx5e: IPoIB, Fix copy-paste bug in flow steering refactoring
ipv6: addrconf: break critical section in addrconf_verify_rtnl()
ipv6: change route cache aging logic
i40e/i40evf: Update DESC_NEEDED value to reflect larger value
bnxt_en: cleanup DIM work on device shutdown
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlpvikQACgkQxWXV+ddt
WDs6qA//ZE7eEH0sKpD4Z+3gUevk/MMXwE9prRijEdjXz/K/UXtvpq0sI7HMQskZ
Ls9Wmzof+3WEQoa08RQZFzwuclW1Udm09SqE2oHP2gXQB5rC0BtWdrlMaKUJy03y
NUwxHetbE6TsFLU5HIVmi05NexNx9SVV6oJTWt00RlXTePw9Aoc88ikoXXUE2vqH
wbH9/ccmM9EkDFxdG+YG5QX054kQV8/5RXdqBJnIiGVRX5ZsAY84AN9x9YoRCVUw
wq9TfPu6XmeA6Uq6knpeLlXDms5w+FE3n5CduROk7Q7YNgpoZekF20c8uK8HzT4T
KF8hc0QpQgRCVBJ8I4MbPSMRIDf3IWfZmWSDEDda/6/ep6Bl99b8PFvdDKDBMUct
8wsgGrwGbHuz2l2QUIXjpBL9Cv9Tbu8vjmg0h2hFrpiH1c8JaXjKtJXAMtigWsZ1
DdX+5Y0zqvV/YLpzKF4aMDWXIteN4qaznvjdmj3B7BxgcnITOV/cmPCyMplNrtUa
Cs2fzGV5tpxhBzxE490v+frMULmLq2W1e6WPfFCqPKBCqulcR75TozDQS9M2/h4k
uAZzVKoguHrUPP1ONVas9aVC05K473nbYPd28eecYwkZ4z32hK4SAdnQY1aPreTe
axoV7p7a+i1bkzT5LK6gIfpddVWth8w45nz4P0lwxp0Z6XlkbJE=
=Irul
-----END PGP SIGNATURE-----
Merge tag 'for-4.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Features or user visible changes:
- fallocate: implement zero range mode
- avoid losing data raid profile when deleting a device
- tree item checker: more checks for directory items and xattrs
Notable fixes:
- raid56 recovery: don't use cached stripes, that could be
potentially changed and a later RMW or recovery would lead to
corruptions or failures
- let raid56 try harder to rebuild damaged data, reading from all
stripes if necessary
- fix scrub to repair raid56 in a similar way as in the case above
Other:
- cleanups: device freeing, removed some call indirections, redundant
bio_put/_get, unused parameters, refactorings and renames
- RCU list traversal fixups
- simplify mount callchain, remove recursing back when mounting a
subvolume
- plug for fsync, may improve bio merging on multiple devices
- compression heurisic: replace heap sort with radix sort, gains some
performance
- add extent map selftests, buffered write vs dio"
* tag 'for-4.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (155 commits)
btrfs: drop devid as device_list_add() arg
btrfs: get device pointer from device_list_add()
btrfs: set the total_devices in device_list_add()
btrfs: move pr_info into device_list_add
btrfs: make btrfs_free_stale_devices() to match the path
btrfs: rename btrfs_free_stale_devices() arg to skip_dev
btrfs: make btrfs_free_stale_devices() argument optional
btrfs: make btrfs_free_stale_device() to iterate all stales
btrfs: no need to check for btrfs_fs_devices::seeding
btrfs: Use IS_ALIGNED in btrfs_truncate_block instead of opencoding it
Btrfs: noinline merge_extent_mapping
Btrfs: add WARN_ONCE to detect unexpected error from merge_extent_mapping
Btrfs: extent map selftest: dio write vs dio read
Btrfs: extent map selftest: buffered write vs dio read
Btrfs: add extent map selftests
Btrfs: move extent map specific code to extent_map.c
Btrfs: add helper for em merge logic
Btrfs: fix unexpected EEXIST from btrfs_get_extent
Btrfs: fix incorrect block_len in merge_extent_mapping
btrfs: Remove unused readahead spinlock
...
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJabwjlAAoJEAAOaEEZVoIVeEEP/R84kZJjlZV/vNmFFvY46jM+
0hpMHXRNym+nW1Du1CKNkesEUAY8ACAQIyzJh63Q72341QTDdz3+asHwPYRNOqdC
PgryidPieojkNKQg+h7dmoKYlYh1xiCicvn66Q5PFb9B0lH36twekOK4X1qqJj8Z
breRmRoFLka9looMSuYgwbErts023fmASalvGum6T0ZM/7F9hUj4O3OsQtKTLUNM
VQ+gLJTQrUqrgzvWUwq3WTMa9YAaKP4oad8nsglNSpiVLG7WtURr5HokW9hAziqL
k99Y+K2ni1wZJlNGJAyV7PyEG2ieI5Xn+LzM2RM+SndD1QHF2QXACmSTDYfL51k5
G2RsKeTZvQPtX4qx9+vnCp/4oV6JduvCaq2Mt8SQb9nYZxKjs85TNLrARJv+85eQ
zP0OTxlH1Gfu3j36n3cny4XemyMYYF4hCFYfRPqTGst37fgLBtfIfUSQ6jedoCK2
Xcyb6ukGXMh6If/A7DSy91hvSSPrWSH7TPPsbfLy6o+wUOtpAGR4eXVlEuAiXrzc
gnoAz85oIMUQae66LrdrPk1NyE59qOb24g/yU5gyRBSpi2+/aoboNCKaD73tgs/C
XIMwGXLYmqkcud7IBQF0tHHiM+jsEkbSM4LUqRXSnqMdwNnS18Z4Q+JKqpdP0cii
eRdenDvUfu8Gu1Y9vWBv
=iihN
-----END PGP SIGNATURE-----
Merge tag 'iversion-v4.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull inode->i_version rework from Jeff Layton:
"This pile of patches is a rework of the inode->i_version field. We
have traditionally incremented that field on every inode data or
metadata change. Typically this increment needs to be logged on disk
even when nothing else has changed, which is rather expensive.
It turns out though that none of the consumers of that field actually
require this behavior. The only real requirement for all of them is
that it be different iff the inode has changed since the last time the
field was checked.
Given that, we can optimize away most of the i_version increments and
avoid dirtying inode metadata when the only change is to the i_version
and no one is querying it. Queries of the i_version field are rather
rare, so we can help write performance under many common workloads.
This patch series converts existing accesses of the i_version field to
a new API, and then converts all of the in-kernel filesystems to use
it. The last patch in the series then converts the backend
implementation to a scheme that optimizes away a large portion of the
metadata updates when no one is looking at it.
In my own testing this series significantly helps performance with
small I/O sizes. I also got this email for Christmas this year from
the kernel test robot (a 244% r/w bandwidth improvement with XFS over
DAX, with 4k writes):
https://lkml.org/lkml/2017/12/25/8
A few of the earlier patches in this pile are also flowing to you via
other trees (mm, integrity, and nfsd trees in particular)".
* tag 'iversion-v4.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: (22 commits)
fs: handle inode->i_version more efficiently
btrfs: only dirty the inode in btrfs_update_time if something was changed
xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing
fs: only set S_VERSION when updating times if necessary
IMA: switch IMA over to new i_version API
xfs: convert to new i_version API
ufs: use new i_version API
ocfs2: convert to new i_version API
nfsd: convert to new i_version API
nfs: convert to new i_version API
ext4: convert to new i_version API
ext2: convert to new i_version API
exofs: switch to new i_version API
btrfs: convert to new i_version API
afs: convert to new i_version API
affs: convert to new i_version API
fat: convert to new i_version API
fs: don't take the i_lock in inode_inc_iversion
fs: new API for handling inode->i_version
ntfs: remove i_version handling
...
Pull block updates from Jens Axboe:
"This is the main pull request for block IO related changes for the
4.16 kernel. Nothing major in this pull request, but a good amount of
improvements and fixes all over the map. This contains:
- BFQ improvements, fixes, and cleanups from Angelo, Chiara, and
Paolo.
- Support for SMR zones for deadline and mq-deadline from Damien and
Christoph.
- Set of fixes for bcache by way of Michael Lyle, including fixes
from himself, Kent, Rui, Tang, and Coly.
- Series from Matias for lightnvm with fixes from Hans Holmberg,
Javier, and Matias. Mostly centered around pblk, and the removing
rrpc 1.2 in preparation for supporting 2.0.
- A couple of NVMe pull requests from Christoph. Nothing major in
here, just fixes and cleanups, and support for command tracing from
Johannes.
- Support for blk-throttle for tracking reads and writes separately.
From Joseph Qi. A few cleanups/fixes also for blk-throttle from
Weiping.
- Series from Mike Snitzer that enables dm to register its queue more
logically, something that's alwways been problematic on dm since
it's a stacked device.
- Series from Ming cleaning up some of the bio accessor use, in
preparation for supporting multipage bvecs.
- Various fixes from Ming closing up holes around queue mapping and
quiescing.
- BSD partition fix from Richard Narron, fixing a problem where we
can't mount newer (10/11) FreeBSD partitions.
- Series from Tejun reworking blk-mq timeout handling. The previous
scheme relied on atomic bits, but it had races where we would think
a request had timed out if it to reused at the wrong time.
- null_blk now supports faking timeouts, to enable us to better
exercise and test that functionality separately. From me.
- Kill the separate atomic poll bit in the request struct. After
this, we don't use the atomic bits on blk-mq anymore at all. From
me.
- sgl_alloc/free helpers from Bart.
- Heavily contended tag case scalability improvement from me.
- Various little fixes and cleanups from Arnd, Bart, Corentin,
Douglas, Eryu, Goldwyn, and myself"
* 'for-4.16/block' of git://git.kernel.dk/linux-block: (186 commits)
block: remove smart1,2.h
nvme: add tracepoint for nvme_complete_rq
nvme: add tracepoint for nvme_setup_cmd
nvme-pci: introduce RECONNECTING state to mark initializing procedure
nvme-rdma: remove redundant boolean for inline_data
nvme: don't free uuid pointer before printing it
nvme-pci: Suspend queues after deleting them
bsg: use pr_debug instead of hand crafted macros
blk-mq-debugfs: don't allow write on attributes with seq_operations set
nvme-pci: Fix queue double allocations
block: Set BIO_TRACE_COMPLETION on new bio during split
blk-throttle: use queue_is_rq_based
block: Remove kblockd_schedule_delayed_work{,_on}()
blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays
blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly()
lib/scatterlist: Fix chaining support in sgl_alloc_order()
blk-throttle: track read and write request individually
block: add bdev_read_only() checks to common helpers
block: fail op_is_write() requests to read-only partitions
blk-throttle: export io_serviced_recursive, io_service_bytes_recursive
...
As struct btrfs_disk_super is being passed, so it can get devid
the same way its parent does.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of pointer to btrfs_fs_devices as an arg in device_list_add()
better to get pointer to btrfs_device as return value, then we have
both, pointer to btrfs_device and btrfs_fs_devices. btrfs_device is
needed to handle reappearing missing device.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At this point, we know that "now" and the file times may differ, and we
suspect that the i_version has been flagged to be bumped. Attempt to
bump the i_version, and only mark the inode dirty if that actually
occurred or if one of the times was updated.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Add a documentation blob that explains what the i_version field is, how
it is expected to work, and how it is currently implemented by various
filesystems.
We already have inode_inc_iversion. Add several other functions for
manipulating and accessing the i_version counter. For now, the
implementation is trivial and basically works the way that all of the
open-coded i_version accesses work today.
Future patches will convert existing users of i_version to use the new
API, and then convert the backend implementation to do things more
efficiently.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlppIo8ACgkQxWXV+ddt
WDuu9A//e29aPiicPrtIovNfVCzev774UddFBhIiEqyfbO0QvHk7Ld0d9+htR38W
eAwxJG3gH4VS9yOFAyk/LrsjpQZYzP32wjGoIXQlz8J1SzxALlILLmjpYkZWHwZX
UqAomYDnRjFu4jsHZ/AAyUNngYER8aeH+aO1PudppBiSCIVpk1TAVRnkt04gCJ7e
CigFgMdRc8pT5P0T6Rsv2+W/yJC7sU0BgDIBjmUcnduqEAxRYl0zsJzpP0IYPPRa
cAnpyu/ApK/m9mlLWi0SfUyNvePFWNxfA2QDBO5G6FwM6C2f5x/zBGfu29wAupVJ
czdOSR5uqCd6WGZHfvaQ4cgQ69AE4lk68zijHRNESPa7tVZ9mQt713h2BqAX6Fus
z+y45ti9CrFoOhaoXCSjUbpJZQ7YKWrat44Qi8pJaKGxqiqXpT8oVcBfjofu+Drn
vnDrU+7WvST5TxcWw/JBWaFfft1dkbdKX+EYZW8sVJAmA/e0+aIfElsTuvdUQxuA
aLnsyvsYQHVcA/mArD12sScQ13DDejR4ija8bP5tpmodxquUouwEAtKPoFnXVT6k
8hbb6qPLsiBerRMSSuaNaFF/ESi36bsizH3O99bYyyxlR8bXi7irHZY9VPWc1OK3
4twoDO0RhCxDHPY5mrIcKdlp9HiiUBPGos3BCaOAg8rKHd/QP7Q=
=EOAu
-----END PGP SIGNATURE-----
Merge tag 'for-4.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"It's been reported recently that readdir can list stale entries under
some conditions. Fix it."
* tag 'for-4.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix stale entries in readdir
In fixing the readdir+pagefault deadlock I accidentally introduced a
stale entry regression in readdir. If we get close to full for the
temporary buffer, and then skip a few delayed deletions, and then try to
add another entry that won't fit, we will emit the entries we found and
retry. Unfortunately we delete entries from our del_list as we find
them, assuming we won't need them. However our pos will be with
whatever our last entry was, which could be before the delayed deletions
we skipped, so the next search will add the deleted entries back into
our readdir buffer. So instead don't delete entries we find in our
del_list so we can make sure we always find our delayed deletions. This
is a slight perf hit for readdir with lots of pending deletions, but
hopefully this isn't a common occurrence. If it is we can revist this
and optimize it.
cc: stable@vger.kernel.org
Fixes: 23b5ec7494 ("btrfs: fix readdir deadlock with pagefault")
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no other parent for device_list_add() except for
btrfs_scan_one_device(), which would set btrfs_fs_devices::total_devices
if device_list_add is successful and this can be done with in
device_list_add() itself.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 60999ca4b4 ("btrfs: make device scan less noisy")
adds return value 1 to device_list_add(), so that parent function can
call pr_info only when new device is added. Move the pr_info() part
into device_list_add() so that this function can be kept simple.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The btrfs_free_stale_devices() is updated to match for the given device
path and delete it. (It searches for only unmounted list of devices.)
Also drop the comment about different path being used for the same
device, since now we will have cli to clean any device that's not a
concern any more.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
No functional changes.
Rename btrfs_free_stale_devices() arg to skip_dev, so that it
reflects what that arg for.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This updates btrfs_free_stale_devices() helper function to delete all
unmouted devices, when arg is NULL.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Let the list iterator iterate further and find other stale
devices and delete it. This is in preparation to add support
for user land request-able stale devices cleanup. Also rename
btrfs_free_stale_device() to btrfs_free_stale_devices().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to check for btrfs_fs_devices::seeding when we
have checked for btrfs_fs_devices::opened, because we can't sprout
without its seed FS being opened.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
No functional changes, just makes the code more readable
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to debug subtle bugs around merge_extent_mapping(), perf probe
can be used to check the arguments, but sometimes merge_extent_mapping()
got inlined by compiler and couldn't be probed.
This is adding noinline attribute to merge_extent_mapping().
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a subtle case, so in order to understand the problem, it'd be good
to know the content of existing and em when any error occurs.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This test case simulates the racy situation of dio write vs dio read,
and see if btrfs_get_extent() would return -EEXIST.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This test case simulates the racy situation of buffered write vs dio
read, and see if btrfs_get_extent() would return -EEXIST.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We've observed that btrfs_get_extent() and merge_extent_mapping() could
return -EEXIST in several cases, and they are caused by some racy
condition, e.g dio read vs dio write, which makes the problem very tricky
to reproduce.
This adds extent map selftests in order to simulate those racy situations.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
[ minor string adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
These helpers are extent map specific, move them to extent_map.c.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a prepare work for the following extent map selftest, which
runs tests against em merge logic.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This fixes a corner case that is caused by a race of dio write vs dio
read/write.
Here is how the race could happen.
Suppose that no extent map has been loaded into memory yet.
There is a file extent [0, 32K), two jobs are running concurrently
against it, t1 is doing dio write to [8K, 32K) and t2 is doing dio
read from [0, 4K) or [4K, 8K).
t1 goes ahead of t2 and splits em [0, 32K) to em [0K, 8K) and [8K 32K).
------------------------------------------------------
t1 t2
btrfs_get_blocks_direct() btrfs_get_blocks_direct()
-> btrfs_get_extent() -> btrfs_get_extent()
-> lookup_extent_mapping()
-> add_extent_mapping() -> lookup_extent_mapping()
# load [0, 32K)
-> btrfs_new_extent_direct()
-> btrfs_drop_extent_cache()
# split [0, 32K) and
# drop [8K, 32K)
-> add_extent_mapping()
# add [8K, 32K)
-> add_extent_mapping()
# handle -EEXIST when adding
# [0, 32K)
------------------------------------------------------
About how t2(dio read/write) runs into -EEXIST:
a) add_extent_mapping() gets -EEXIST for adding em [0, 32k),
b) search_extent_mapping() then returns [0, 8k) as the existing em,
even though start == existing->start, em is [0, 32k) so that
extent_map_end(em) > extent_map_end(existing), i.e. 32k > 8k,
c) then it goes thru merge_extent_mapping() which tries to add a [8k, 8k)
(with a length 0) and returns -EEXIST as [8k, 32k) is already in tree,
d) so btrfs_get_extent() ends up returning -EEXIST to dio read/write,
which is confusing applications.
Here I conclude all the possible situations,
1) start < existing->start
+-----------+em+-----------+
+--prev---+ | +-------------+ |
| | | | | |
+---------+ + +---+existing++ ++
+
|
+
start
2) start == existing->start
+------------em------------+
| +-------------+ |
| | | |
+ +----existing-+ +
|
|
+
start
3) start > existing->start && start < (existing->start + existing->len)
+------------em------------+
| +-------------+ |
| | | |
+ +----existing-+ +
|
|
+
start
4) start >= (existing->start + existing->len)
+-----------+em+-----------+
| +-------------+ | +--next---+
| | | | | |
+ +---+existing++ + +---------+
+
|
+
start
As we can see, it turns out that if start is within existing em (front
inclusive), then the existing em should be returned as is, otherwise,
we try our best to merge candidate em with sibling ems to form a
larger em (in order to reduce the total number of em).
Reported-by: David Vallender <david.vallender@landmark.co.uk>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
%block_len could be checked on deciding if two em are mergeable.
merge_extent_mapping() has only added the front pad if the front part
of em gets truncated, but it's possible that the end part gets
truncated.
For both compressed extent and inline extent, em->block_len is not
adjusted accordingly, and for regular extent, em->block_len always
equals to em->len, hence this sets em->block_len with em->len.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The reada_lock in struct btrfs_device was only initialised, and not
actually used. That's good because there's another lock also called
reada_lock in the btrfs_fs_info that was quite heavily used. Remove
this one.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Before rbio_orig_end_io() goes to free rbio, rbio may get merged with
more bios from other rbios and rbio->bio_list becomes non-empty,
in that case, these newly merged bios don't end properly.
Once unlock_stripe() is done, rbio->bio_list will not be updated any
more and we can call bio_endio() on all queued bios.
It should only happen in error-out cases, the normal path of recover
and full stripe write have already set RBIO_RMW_LOCKED_BIT to disable
merge before doing IO, so rbio_orig_end_io() called by them doesn't
have the above issue.
Reported-by: Jérôme Carretero <cJ-ko@zougloub.eu>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since raid6 recover tries all possible combinations of failed stripes,
- when raid6 rebuild algorithm is used, i.e. raid6_datap_recov() and
raid6_2data_recov(), it may change the in-memory content of failed
stripes, if such a raid bio is cached, a later raid write rmw or recover
can steal @stripe_pages from it instead of reading from disks, such that
it carries the wrong content to do write rmw or recovery and ends up
with corruption or recovery failures.
- when raid5 rebuild algorithm is used, i.e. xor, raid bio can be cached
because the only failed stripe which contains @rbio->bio_pages gets
modified, others remain the same so that their in-memory content is
consistent with their on-disk content.
This adds a check to skip caching rbio if using raid6 recover.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Bio iterated by set_bio_pages_uptodate() is raid56 internal one, so it
will never be a BIO_CLONED bio, and since this is called by end_io
functions, bio->bi_iter.bi_size is zero, we mustn't use
bio_for_each_segment() as that is a no-op if bi_size is zero.
Fixes: 6592e58c6b ("Btrfs: fix write corruption due to bio cloning on raid5/6")
Cc: <stable@vger.kernel.org> # v4.12-rc6+
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no function named btrfs_get_inode_index_count.
Explanation for magic number index_cnt=2 in btrfs_new_inode() is
actually located in btrfs_set_inode_index_count().
So replace 'btrfs_get_inode_index_count' in the comment by
'btrfs_set_inode_index_count'.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's not used outside of extent-tree so there is no reason to not be
static.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We call btrfs_free_stale_device() only when we alloc a new struct
btrfs_device (ret=1), so move it closer to where we alloc the new
device. Also drop the comments.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I've noticed that the updated item checker stack consumption increased
dramatically in 542f5385e20cf97447 ("btrfs: tree-checker: Add checker
for dir item")
tree-checker.c:check_leaf +552 (176 -> 728)
The array is 255 bytes long, dynamic allocation would slow down the
sanity checks so it's more reasonable to keep it on-stack. Moving the
variable to the scope of use reduces the stack usage again
tree-checker.c:check_leaf -264 (728 -> 464)
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
gcc-8 reports:
fs/btrfs/ioctl.c: In function 'btrfs_ioctl':
./include/linux/string.h:245:9: warning: '__builtin_strncpy' specified
bound 1024 equals destination size [-Wstringop-truncation]
We need one less byte or call strlcpy() to make it a nul-terminated
string. This is done on the next line anyway, but we want to avoid the
warning.
Signed-off-by: Xiongfeng Wang <xiongfeng.wang@linaro.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
It appears from the original commit [1] that there isn't any design
specific reason not to fail the mount instead of just warning. This
patch will change it to fail.
[1]
commit 319e4d0661
btrfs: Enhance super validation check
Fixes: 319e4d0661 ("btrfs: Enhance super validation check")
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs-progs uses super flag bit BTRFS_SUPER_FLAG_METADUMP_V2 (1ULL << 34).
So just define that in kernel so that we know its been used.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We've avoided data losing raid profile when doing balance, but it
turns out that deleting a device could also result in the same
problem.
Say we have 3 disks, and they're created with '-d raid1' profile.
- We have chunk P (the only data chunk on the empty btrfs).
- Suppose that chunk P's two raid1 copies reside in disk A and disk B.
- Now, 'btrfs device remove disk B'
btrfs_rm_device()
-> btrfs_shrink_device()
-> btrfs_relocate_chunk() #relocate any chunk on disk B
to other places.
- Chunk P will be removed and a new chunk will be created to hold
those data, but as chunk P is the only one holding raid1 profile,
after it goes away, the new chunk will be created as single profile
which is our default profile.
This fixes the problem by creating an empty data chunk before
relocating the data chunk.
Metadata/System chunk are supposed to have non-zero bytes all the time
so their raid profile is preserved.
Reported-by: James Alandt <James.Alandt@wdc.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For a fallocate's zero range operation that targets a range with an end
that is not aligned to the sector size, we can end up not updating the
inode's i_size. This happens when the last page of the range maps to an
unwritten (prealloc) extent and before that last page we have either a
hole or a written extent. This is because in this scenario we relied
on a call to btrfs_prealloc_file_range() to update the inode's i_size,
however it can only update the i_size to the "down aligned" end of the
range.
Example:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ xfs_io -f -c "pwrite -S 0xff 0 428K" /mnt/foobar
$ xfs_io -c "falloc -k 428K 4K" /mnt/foobar
$ xfs_io -c "fzero 0 430K" /mnt/foobar
$ du --bytes /mnt/foobar
438272 /mnt/foobar
The inode's i_size was left as 428Kb (438272 bytes) when it should have
been updated to 430Kb (440320 bytes).
Fix this by always updating the inode's i_size explicitly after zeroing
the range.
Fixes: ba6d5887946ff86d93dc ("Btrfs: add support for fallocate's zero range operation")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During a buffered IO write, we can have an extent state that we got when
we locked the range (if the range starts at an offset lower than eof), so
always pass it to btrfs_dirty_pages() so that setting the delalloc bit
in the range does not need to do a full search in the inode's io tree,
saving time and reducing the amount of time we hold the io tree's lock.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This implements support the zero range operation of fallocate. For now
at least it's as simple as possible while reusing most of the existing
fallocate and hole punching infrastructure.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since fail stripe index in rbio would be used to decide which
algorithm reconstruction would be run, we cannot merge rbios if
their's fail striped indexes are different, otherwise, one of the two
reconstructions would fail.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Given the above
'
if (last->operation != cur->operation)
return 0;
',
it's guaranteed that two operations are same.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Assign ret = -EINVAL where it is actually required.
Remove { } around single line if else code.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_device::scrub_device is not a device which is being scrubbed,
but it holds the scrub context, so rename to reflect the same. No
functional changes here.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no other consumer for btrfs_handle_error() other than
__btrfs_handle_fs_error(), further this function quite small.
Merge it into its parent.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ reformat comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
__btrfs_handle_fs_error() sets BTRFS_FS_STATE_ERROR, and calls
btrfs_handle_error() so no need to check if the BTRFS_FS_STATE_ERROR
is set in btrfs_handle_error(). And there is no other user of
btrfs_handle_error() as well.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a scenario that can end up with rebuild process failing to
return good content, i.e.
suppose that all disks can be read without problems and if the content
that was read out doesn't match its checksum, currently for raid6
btrfs at most retries twice,
- the 1st retry is to rebuild with all other stripes, it'll eventually
be a raid5 xor rebuild,
- if the 1st fails, the 2nd retry will deliberately fail parity p so
that it will do raid6 style rebuild,
however, the chances are that another non-parity stripe content also
has something corrupted, so that the above retries are not able to
return correct content, and users will think of this as data loss.
More seriouly, if the loss happens on some important internal btree
roots, it could refuse to mount.
This extends btrfs to do more retries and each retry fails only one
stripe. Since raid6 can tolerate 2 disk failures, if there is one
more failure besides the failure on which we're recovering, this can
always work.
The worst case is to retry as many times as the number of raid6 disks,
but given the fact that such a scenario is really rare in practice,
it's still acceptable.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The raid6 corruption is that,
suppose that all disks can be read without problems and if the content
that was read out doesn't match its checksum, currently for raid6
btrfs at most retries twice,
- the 1st retry is to rebuild with all other stripes, it'll eventually
be a raid5 xor rebuild,
- if the 1st fails, the 2nd retry will deliberately fail parity p so
that it will do raid6 style rebuild,
however, the chances are that another non-parity stripe content also
has something corrupted, so that the above retries are not able to
return correct content.
We've fixed normal reads to rebuild raid6 correctly with more retries
in Patch "Btrfs: make raid6 rebuild retry more"[1], this is to fix
scrub to do the exactly same rebuild process.
[1]: https://patchwork.kernel.org/patch/10091755/
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Update btrfs_check_rw_degradable() to check against the given device if
its lost.
We can use this function to know if the volume is going to be in
degraded mode OR failed state, when the given device fails. Which is
needed when we are handling the device failed state.
A preparatory patch does not affect the flow as such.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ enhance comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
All callers pass either GFP_NOFS or GFP_KERNEL now, so we can sink the
parameter to the function, though we lose some of the slightly better
semantics of GFP_KERNEL in some places, it's worth cleaning up the
callchains.
Signed-off-by: David Sterba <dsterba@suse.com>
There's only one instance where we pass different gfp mask to
unlock_extent_cached. Add a separate helper for that and then we can
drop the gfp parameter from unlock_extent_cached.
Signed-off-by: David Sterba <dsterba@suse.com>
Recent patches reworking the mount path left some unused parameters. We
pass a vfsmount to mount_subvol, the flags and data (ie. mount options)
have been already applied and we will not need them.
Signed-off-by: David Sterba <dsterba@suse.com>
Long ago, commit edf24abe51 ("btrfs: sanity mount option parsing and
early mount code") split the btrfs_parse_options() into two parts
(btrfs_parse_early_options() and btrfs_parse_options()). As a result,
btrfs_parse_optins no longer gets called twice and is the last one to
parse mount option string. Therefore there is no need to dup it.
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In fact nobody is waiting on @wait's waitqueue, it can be safely
removed.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The bio is not referenced after it has been submitted and the endio is
going to consume the sole reference on successful submission. On error,
the callers of __btrfs_submit_dio_bio do invoke bio_put so we don't
leak it either.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The bio is never referenced after it has been submitted so there is no
point in getting an extra reference.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The bio that is passsed is the newly created repair bio which already
has a reference count of 1, which is going to be consumed by the
endio routine on successful submission. On error the handler also
calls bio_put.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
bio_get/set is necessary only if the bio is going to be referenced
following submissions. In the code paths where such calls are made
we don't really need them since the bio is referenced only if
btrfs_map_bio returns an error. And this function can return an error
prior to submission only. So referencing the bio is safe. Furthermore
we do call bio_endio which will consume the last reference. So let's
remove the redundant calls.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As it's a single instance and local to the file, we don't need to pass
it as an argument.
Reviewed-by: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The callback is trivial and we don't need the abstraction for our
purposes. Let's open code it.
Reviewed-by: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The callback is trivial and we don't need the abstraction for our
purposes. Let's open code it and also make the array types explicit.
Reviewed-by: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove unused arg 'holder' from parse_subvol_options(), which has been
forgotten to be cleaned in the commit b99beb110e2d ("btrfs: split
parse_early_options() in two").
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since setup_root_args() is not used anymore, just remove it.
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now parse_early_options() is used by both btrfs_mount() and
btrfs_mount_root(). However, the former only needs subvol related part
and the latter needs the others.
Therefore extract the subvol related parts from parse_early_options() and
move it to new parse function (parse_subvol_options()).
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Cleanup btrfs_mount() by using btrfs_mount_root(). This avoids getting
btrfs_mount() called twice in mount path.
Old btrfs_mount() will do:
0. VFS layer calls vfs_kern_mount() with registered file_system_type
(for btrfs, btrfs_fs_type). btrfs_mount() is called on the way.
1. btrfs_parse_early_options() parses "subvolid=" mount option and set the
value to subvol_objectid. Otherwise, subvol_objectid has the initial
value of 0
2. check subvol_objectid is 5 or not. Assume this time id is not 5, then
btrfs_mount() returns by calling mount_subvol()
3. In mount_subvol(), original mount options are modified to contain
"subvolid=0" in setup_root_args(). Then, vfs_kern_mount() is called with
btrfs_fs_type and new options
4. btrfs_mount() is called again
5. btrfs_parse_early_options() parses "subvolid=0" and set 5 (instead of 0)
to subvol_objectid
6. check subvol_objectid is 5 or not. This time id is 5 and mount_subvol()
is not called. btrfs_mount() finishes mounting a root
7. (in mount_subvol()) with using a return vale of vfs_kern_mount(), it
calls mount_subtree()
8. return subvolume's dentry
Reusing the same file_system_type (and btrfs_mount()) for vfs_kern_mount()
is the cause of complication.
Instead, new btrfs_mount() will do:
1. parse subvol id related options for later use in mount_subvol()
2. mount device's root by calling vfs_kern_mount() with
btrfs_root_fs_type, which is not registered to VFS by
register_filesystem(). As a result, btrfs_mount_root() is called
3. return by calling mount_subvol()
The code of 2. is moved from the first part of mount_subvol().
The semantics of device holder changes from btrfs_fs_type to
btrfs_root_fs_type and has to be used in all contexts. Otherwise we'd
get wrong results when mount and dev scan would not check the same
thing. (this has been found indendently and the fix is folded into this
patch)
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ fold the btrfs_control_ioctl fixup, extend the comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Add btrfs_mount_root() and new file_system_type for preparation of cleanup
of btrfs_mount(). Code path is not changed yet.
btrfs_mount_root() is almost the same as current btrfs_mount(), but doesn't
have subvolume related part.
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Functions called from extent_write_cache_pages used void* as generic
callback data, but all of them convert it to extent_page_data, or use it
directly.
Signed-off-by: David Sterba <dsterba@suse.com>
The function extent_write_cache_pages is modelled after
write_cache_pages which is a generic interface and the writepage
parameter makes sense there. In btrfs we know exactly which callback
we're going to use, so we can pass it directly.
Signed-off-by: David Sterba <dsterba@suse.com>
flush_epd_write_bio is same as flush_write_bio, no point having two such
functions. Merge them to flush_write_bio. The 'noinline' attribute is
removed as it does not have any meaning.
Signed-off-by: David Sterba <dsterba@suse.com>
Currently there are 2 function doing binary search on btrfs nodes:
bin_search and btrfs_bin_search. The latter being a simple wrapper for
the former. So eliminate the wrapper and just rename bin_search to
btrfs_bin_search. No functional changes
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree argument passed to extent_write_full_page is referenced from
the page being passed to the same function. Since we already have
enough information to get the reference, remove the function parameter.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is called only from submit_compressed_extents and the
io tree being passed is always that of the inode. But we are also
passing the inode, so just move getting the io tree pointer in
extent_write_locked_range to simplify the signature.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This code was added in 492bb6deee ("Btrfs: Hold a reference on bios
during submit_bio, add some extra bio checks"). However, holding a
reference on a bio is necessary only if it's going to be referenced
after the submit_bio returns and the bio is completed. In this
particular instance this is not the case so there is no need to hold
an extra reference since we directly return.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When modifying a tree where the root is at BTRFS_MAX_LEVEL - 1 then
the level variable is going to be 7 (this is the max height of the
tree). On the other hand btrfs_cow_block is always called with
"level + 1" as an index into the nodes and slots arrays. This leads to
an out of bounds access. Admittdely this will be benign since an OOB
access of the nodes array will likely read the 0th element from the
slots array, which in this case is going to be 0 (since we start CoW at
the top of the tree). The OOB access into the slots array in turn will
read the 0th and 1st values of the locks array, which would both be 0
at the time. However, this benign behavior relies on the fact that the
path being passed hasn't been initialised, if it has already been used to
query a btree then it could potentially have populated the nodes/slots arrays.
Fix it by explicitly checking if we are at level 7 (the maximum allowed
index in nodes/slots arrays) and explicitly call the CoW routine with
NULL for parent's node/slot.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Fixes-coverity-id: 711515
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These duplicate includes have been found with scripts/checkincludes.pl but
they have been removed manually to avoid removing false positives.
Signed-off-by: Pravin Shedge <pravin.shedge4linux@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function was introduced by 247e743cbe ("Btrfs: Use async helpers
to deal with pages that have been improperly dirtied") and it didn't do
any error handling then. This function might very well fail in ENOMEM
situation, yet it's not handled, this could lead to inconsistent state.
So let's handle the failure by setting the mapping error bit.
Cc: stable@vger.kernel.org
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are several places opencoding this conversion, add a helper now
that we have 3 compression algorithms.
Signed-off-by: David Sterba <dsterba@suse.com>
Before returning hole_em in btrfs_get_fiemap_extent we check if it's different
than null. However, by the time this null check is triggered we already know
hole_em is not null because it means it points to the em we found and it
has already been dereferenced.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
trans was statically assigned to NULL and this never changed over the
course of btrfs_get_extent. So remove any code which checks whether
trans != NULL and just hardcode the fact trans is always NULL.
Resolves-coverity-id: 112806
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The return value of sizeof() is of type size_t, so we must print it
using the %z format modifier rather than %l to avoid this warning
on some architectures:
fs/btrfs/tree-checker.c: In function 'check_dir_item':
fs/btrfs/tree-checker.c:273:50: error: format '%lu' expects argument of type 'long unsigned int', but argument 5 has type 'u32' {aka 'unsigned int'} [-Werror=format=]
Fixes: 005887f2e3e0 ("btrfs: tree-checker: Add checker for dir item")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This changes to use struct completion directly and removes 'struct
scrub_bio_ret' along with the code using it.
This struct is used to get the return value from bio, but the caller can
access bio to get the return value directly and is holding a reference
on it so it won't go away underneath us and can be removed safely.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The defined wait is not used anywhere.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We need to call extent_range_clear_dirty_for_io()
on compression range to prevent application from changing
page content, while pages compressing.
extent_range_clear_dirty_for_io() runs on each loop iteration,
"(end - start)" can be much (up to 1024 times) bigger
then compression range (BTRFS_MAX_UNCOMPRESSED).
The start pointer is advanced each time we manage to compress part of
the range. The end pointer does not change so we could redirty the
remaining parts repeatedly.
Fix that behaviour by call extent_range_clear_dirty_for_io()
only once, the first time it happens.
This is the safest but probably not the best behaviour. Previous
iterations of the patch tried to redirty only the range that we were not
able to compress. This has been refused by David for safety reasons, the
writeout callchain is complex and there could be some path that relies
on redirtying the entire unwritten range.
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ enhance changelog, the history and safety concerns, add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Slowest part of heuristic for now is kernel heap sort()
It's can take up to 55% of runtime on sorting bucket items.
As sorting will always call on most data sets to get correctly
byte_core_set_size, the only way to speed up heuristic, is to
speed up sort on bucket.
Add a general radix_sort function.
Radix sort require 2 buffers, one full size of input array
and one for store counters (jump addresses).
That increase usage per heuristic workspace +1KiB
8KiB + 1KiB -> 8KiB + 2KiB
That is LSD Radix, i use 4 bit as a base for calculating,
to make counters array acceptable small (16 elements * 8 byte).
That Radix sort implementation have several points to adjust,
I added him to make radix sort general usable in kernel,
like heap sort, if needed.
Performance tested in userspace copy of heuristic code,
throughput:
- average <-> random data: ~3500 MiB/s - heap sort
- average <-> random data: ~6000 MiB/s - radix sort
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
[ coding style fixes ]
Signed-off-by: David Sterba <dsterba@suse.com>
Currently device state is being managed by each individual int
variable such as struct btrfs_device::is_tgtdev_for_dev_replace.
Instead of that declare btrfs_device::dev_state
BTRFS_DEV_STATE_FLUSH_SENT and use the bit operations.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently device state is being managed by each individual int
variable such as struct btrfs_device::is_tgtdev_for_dev_replace.
Instead of that declare btrfs_device::dev_state
BTRFS_DEV_STATE_MISSING and use the bit operations.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ whitespace adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Currently device state is being managed by each individual int
variable such as struct btrfs_device::missing. Instead of that
declare btrfs_device::dev_state BTRFS_DEV_STATE_MISSING and use
the bit operations.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by : Nikolay Borisov <nborisov@suse.com>
[ whitespace adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Currently device state is being managed by each individual int
variable such as struct btrfs_device::in_fs_metadata. Instead of
that declare device state BTRFS_DEV_STATE_IN_FS_METADATA and use
the bit operations.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ whitespace adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Currently device state is being managed by each individual int
variable such as struct btrfs_device::writeable. Instead of that
declare device state BTRFS_DEV_STATE_WRITEABLE and use the
bit operations.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ whitespace adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
This patch creates a helper function to get either the rcu device path
or missing.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ rename to btrfs_dev_name, switch to if/else ]
Signed-off-by: David Sterba <dsterba@suse.com>
We can query the bdev directly when needed at btrfs_discard_extent()
so drop btrfs_device::can_discard.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Suggested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function update_share_count is local to the source and does
not need to be in global scope, so make it static.
Cleans up sparse warning:
fs/btrfs/backref.c:219:6: warning: symbol 'update_share_count' was not
declared. Should it be static?
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 9036c10208 ("Btrfs: update hole handling v2") added the
FLAG_VACANCY to denote holes, however there was already a consistent way
of flagging extents which represent hole - ->block_start =
EXTENT_MAP_HOLE. And also the only place where this flag is checked is
in the fiemap code, but the block_start value is also checked and every
other place in the filesystem detects holes by using block_start
value's. So remove the extra flag. This survived a full xfstest run.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is no longer used outside of extent-tree.c.
Make it static.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
* ZSTD_inBuffer in_buf
* ZSTD_outBuffer out_buf
are used in all functions to pass the compression parameters and the
local variables consume some space. We can move them to the workspace
and reduce the stack consumption:
zstd.c:zstd_decompress -24 (136 -> 112)
zstd.c:zstd_decompress_bio -24 (144 -> 120)
zstd.c:zstd_compress_pages -24 (264 -> 240)
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Nick Terrell <terrelln@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are now 20 bytes of holes, we can reduce that to 4 by minor
changes. Moving 'aborted' to the status and flags is also more logical,
similar for num_dirty_bgs. The size goes from 432 to 416.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Recent updates to the structure left some holes, reorder the types so
the packing is tight. The size goes from 112 to 104 on 64bit.
Signed-off-by: David Sterba <dsterba@suse.com>
The use_count is a reference counter, we can use the refcount_t type,
though we don't use the atomicity. This is not a performance critical
code and we could catch the underflows. The type is changed from long,
but the number of references will fit an int.
Signed-off-by: David Sterba <dsterba@suse.com>
Last user was removed in a monster commit a22285a6a3
("Btrfs: Integrate metadata reservation with start_transaction") in
2010.
Signed-off-by: David Sterba <dsterba@suse.com>
The semantics of adding_csums matches bool, 'short' was most likely used
to save space in a698d0755a ("Btrfs: add a type field for the
transaction handle").
Signed-off-by: David Sterba <dsterba@suse.com>
Due to new_inline logic, the create == 0 is always true at this
point in the code, so the create != 0 branch can be removed.
Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replace hardcoded numeric argument values for inode_only with the
constants defined for that use.
Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The maximum size of a checksum buffer is known, BTRFS_CSUM_SIZE, and we
don't have to allocate it dynamically. This code path is not used at all
as we have only the crc32c and use an on-stack buffer already.
Signed-off-by: David Sterba <dsterba@suse.com>
Setting plug can merge adjacent IOs before dispatching IOs to the disk
driver.
Without plug, it'd not be a problem for single disk usecases, but for
multiple disks using raid profile, a large IO can be split to several
IOs of stripe length, and plug can be helpful to bring them together
for each disk so that we can save several disk access.
Moreover, fsync issues synchronous writes, so plug can really take
effect.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
No functional changes, create btrfs_open_one_device() from
__btrfs_open_devices(). This is a preparatory work to add dynamic
device scan.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ minor whitespace fixes ]
Signed-off-by: David Sterba <dsterba@suse.com>
No functional changes. This helps to move the entire section into
a new function.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is in preparation to move a section of code in __btrfs_open_devices()
into a new function so that it can be reused. As we set seeding if any of
the device is having SB flag BTRFS_SUPER_FLAG_SEEDING, so do it in the
device list loop itself. No functional changes.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With gcc-4.1.2:
fs/btrfs/ref-verify.c: In function ‘btrfs_build_ref_tree’:
fs/btrfs/ref-verify.c:1017: warning: ‘root’ is used uninitialized in this function
The variable is indeed passed uninitialized, but it is never used by the
callee. However, not all versions of gcc are smart enough to notice.
Hence remove the unused parameter from walk_up_tree() to silence the
compiler warning.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers pass btrfs_get_extent_fiemap and get_extent_skip_holes
itself is used only as a fiemap helper.
Signed-off-by: David Sterba <dsterba@suse.com>
All callers pass btrfs_get_extent_fiemap and we don't expect anything
else in the context of extent_fiemap.
Signed-off-by: David Sterba <dsterba@suse.com>
Previous patches cleaned up all places where
extent_page_data::get_extent was set and it was btrfs_get_extent all the
time, so we can simply call that instead.
This also reduces size of extent_page_data by 8 bytes which has positive
effect on stack consumption on various functions on the write out path.
Signed-off-by: David Sterba <dsterba@suse.com>
Since tree-checker has verified leaf when reading from disk, we don't
need the existing verify_dir_item() or btrfs_is_name_len_valid() checks.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add checker for dir item, for key types DIR_ITEM, DIR_INDEX and
XATTR_ITEM.
This checker does comprehensive checks for:
1) dir_item header and its data size
Against item boundary and maximum name/xattr length.
This part is mostly the same as old verify_dir_item().
2) dir_type
Against maximum file types, and against key type.
Since XATTR key should only have FT_XATTR dir item, and normal dir
item type should not have XATTR key.
The check between key->type and dir_type is newly introduced by this
patch.
3) name hash
For XATTR and DIR_ITEM key, key->offset is name hash (crc32c).
Check the hash of the name against the key to ensure it's correct.
The name hash check is only found in btrfs-progs before this patch.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers use GFP_NOFS, we don't have to pass it as an argument. The
built-in tests pass GFP_KERNEL, but they run only at module load time
and NOFS works there as well.
Signed-off-by: David Sterba <dsterba@suse.com>
Use __clear_extent_bit directly in case we want to pass unknown
gfp flags. Otherwise all clear_extent_bit callers use GFP_NOFS, so we
can sink them to the function and reduce argument count, at the cost
that __clear_extent_bit has to be exported.
Signed-off-by: David Sterba <dsterba@suse.com>
We take the fs_devices::device_list_mutex mutex in write_all_supers
which will prevent any add/del changes to the device list. Therefore we
don't need to use the RCU variant list_for_each_entry_rcu in any of the
called functions.
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need to use the mutex as we do not modify the devices nor the
list itself and just read information about device counts.
Move copying fsid out of the protected section, not applicable to RCU
same as the rest of the retrieved information.
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need to use the mutex as we do not modify the devices nor the
list itself and just read some information:
does not change during device lifetime:
- devid
- uuid
- name (ie. the path)
may change in parallel to the ioctl call, but can lead only to reporting
inacurracy:
- bytes_used
- total_bytes
Signed-off-by: David Sterba <dsterba@suse.com>
A helper to free a device and all it's dynamically allocated members,
like the rcu_string name or flush_bio. This is going to replace all
open coded places.
Signed-off-by: David Sterba <dsterba@suse.com>
Make it clear that it is an RCU helper, we want to use the name
free_device for a wrapper freeing all device members.
Signed-off-by: David Sterba <dsterba@suse.com>
These rules have been hidden in several if-else and are not
straightforward to follow, for example, dio submit hook's nocsum case
has a bug , i.e. doing async submit instead of sync submit, which has
been fixed recently.
This is documenting the rules for reference.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After commit 996478ca9c ("btrfs: change how we decide to commit
transactions during flushing") there is no need to hold the delayed_rsv
during the percpu_counter_compare call since we get the byte's snapshot
earlier. So hold the lock only while reading delayed_rsv.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Adding __init macro gives kernel a hint that this function is only used
during the initialization phase and its memory resources can be freed up
after.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Function btrfs_add_device() is adding the device item so rename to
reflect that in the function. Similarly we have btrfs_rm_dev_item().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_create_tree() will unconditionally generate UUID for any root.
So for quota tree and data reloc tree created by kernel, they will have
unique UUIDs.
However UUID in root item is only referred by UUID tree, which only
records UUID for fs trees. This makes unique UUIDs for quota/data reloc
tree meaningless.
Leave the UUID as zero for non-fs tree, making btrfs-debug-tree output
less confusing.
Reported-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A cleanup patch no functional change, we hold volume_mutex before
calling btrfs_rm_device, so move it into the function itself.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Right before we go into this loop locked_end is set to alloc_end - 1 and
is being used in nearby functions, no need to have exceptions. This just
makes the code consistent, no functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fallocating a file in btrfs goes through several stages. The one before
actually inserting the fallocated extents is to create a qgroup
reservation, covering the desired range. To this end there is a loop in
btrfs_fallocate which checks to see if there are holes in the fallocated
range or !PREALLOC extents past EOF and if so create qgroup reservations
for them. Unfortunately, the main condition of the loop is burried right
at the end of its body rather than in the actual while statement which
makes it non-obvious. Fix this by moving the condition in the while
statement where it belongs. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It was introduced because btrfs used to do blkdev_put in a deferred
work, now that btrfs has blkdev_put in place, this rcu_barrier can be
removed.
modprobe -r btrfs will do btrfs_cleanup_fs_uuids(), where it cleanup
every %fs_devices on the list, but when we do btrfs_close_devices(), we
have replaced the devices on the list with dummy ones which only have
the same name and uuid, so modprobe -r btrfs will free those instead of
what we were using, this change won't cause a problem for it.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied 2nd paragraph from mailinglist discussion ]
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_balance_delayed_items is the sole caller of
btrfs_wq_run_delayed_node and already includes one of the checks whether
the delayed inodes should be run. On the other hand
btrfs_wq_run_delayed_node duplicates that check and performs an
additional one for wq congestion.
Let's remove the duplicate check and move the congestion one in
btrfs_balance_delayed_items, leaving btrfs_wq_run_delayed_node to only
care about setting up the wq run. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_async_run_delayed_root's implementation uses 3 goto
labels to mimic the functionality of a simple do {} while loop. Refactor
the function to use a do {} while construct, making intention clear and
code easier to follow. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The following callpath is always invoked with mirror_num set to 0, so
let's remove it as an argument and directly pass 0 to __do_redpage. No
functional change.
extent_readpages
__extent_readpages
__do_contiguous_readpages
__do_readpage
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's sole callsite was removed in a previous patch so just nuke it for good.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As per atomic_t.txt documentation :
- RMW operations that have a return value are fully ordered;
atomic_xchg is one such operation so it already includes everything it
needs w.r.t memory ordering and add a comment to be more explicit about
that.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit addc3fa74e ("Btrfs: Fix the problem that the dirty flag of dev
stats is cleared") reworked the way device stats changes are tracked. A
new atomic dev_stats_ccnt counter was introduced which is incremented
every time any of the device stats counters are changed. This serves as
a flag whether there are any pending stats changes. However, this patch
only partially implemented the correct memory barriers necessary:
- It only ordered the stores to the counters but not the reads e.g.
btrfs_run_dev_stats
- It completely omitted any comments documenting the intended design and
how the memory barriers pair with each-other
This patch provides the necessary comments as well as adds a missing
smp_rmb in btrfs_run_dev_stats. Furthermore since dev_stats_cnt is only
a snapshot at best there was no point in reading the counter twice -
once in btrfs_dev_stats_dirty and then again when assigning stats_cnt.
Just collapse both reads into 1.
Fixes: addc3fa74e ("Btrfs: Fix the problem that the dirty flag of dev stats is cleared")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_end_bio() is using btrfs_dev_stat_inc() and then
btrfs_dev_stat_print_on_error() separately instead use
btrfs_dev_stat_inc_and_print() directly.
As of now there isn't any bio in btrfs which is - a non-empty write and
also the REQ_PREFLUSH flag is set. So in actual the condition
if (bio->bi_opf & REQ_PREFLUSH)
is never true in btrfs_end_bio(), and so there won't be any redundant
error log by using btrfs_dev_stat_inc_and_print() separately one for
write and another for flush.
This consolidation will help to add the device critical error handles in
the function btrfs_dev_stat_inc_and_print() and which can be renamed as
needed.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's pointless to defer it to a kthread helper as we're not under a
special context.
For reference, commit 1f78160ce1 ("Btrfs: using rcu lock in the reader
side of devices list") introduced RCU freeing for device structures.
Originally the blkdev_put was called from free_device and rcu_barrier had
to be called. This is no longer required, bdev and our device structures
are now freed separately.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ enhance changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
In functions like btrfs_create(), we run both
btrfs_balance_delayed_items() and btrfs_btree_balance_dirty() after
the operation, but btrfs_btree_balance_dirty() is surely going to run
btrfs_balance_delayed_items().
This keeps only btrfs_btree_balance_dirty().
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add injectable error types for each error-injectable function.
One motivation of error injection test is to find software flaws,
mistakes or mis-handlings of expectable errors. If we find such
flaws by the test, that is a program bug, so we need to fix it.
But if the tester miss input the error (e.g. just return success
code without processing anything), it causes unexpected behavior
even if the caller is correctly programmed to handle any errors.
That is not what we want to test by error injection.
To clarify what type of errors the caller must expect for each
injectable function, this introduces injectable error types:
- EI_ETYPE_NULL : means the function will return NULL if it
fails. No ERR_PTR, just a NULL.
- EI_ETYPE_ERRNO : means the function will return -ERRNO
if it fails.
- EI_ETYPE_ERRNO_NULL : means the function will return -ERRNO
(ERR_PTR) or NULL.
ALLOW_ERROR_INJECTION() macro is expanded to get one of
NULL, ERRNO, ERRNO_NULL to record the error type for
each function. e.g.
ALLOW_ERROR_INJECTION(open_ctree, ERRNO)
This error types are shown in debugfs as below.
====
/ # cat /sys/kernel/debug/error_injection/list
open_ctree [btrfs] ERRNO
io_ctl_init [btrfs] ERRNO
====
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Since error-injection framework is not limited to be used
by kprobes, nor bpf. Other kernel subsystems can use it
freely for checking safeness of error-injection, e.g.
livepatch, ftrace etc.
So this separate error-injection framework from kprobes.
Some differences has been made:
- "kprobe" word is removed from any APIs/structures.
- BPF_ALLOW_ERROR_INJECTION() is renamed to
ALLOW_ERROR_INJECTION() since it is not limited for BPF too.
- CONFIG_FUNCTION_ERROR_INJECTION is the config item of this
feature. It is automatically enabled if the arch supports
error injection feature for kprobe or ftrace etc.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Commit 17347cec15f919901c90(Btrfs: change how we iterate bios in endio)
mentioned that for dio the submitted bio may be fast cloned, we
can't access the bvec table directly for a cloned bio, so use
bio_get_first_bvec() to retrieve the 1st bvec.
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: linux-btrfs@vger.kernel.org
Cc: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Acked: David Sterba <dsterba@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
BTRFS uses bio->bi_vcnt to figure out page numbers, this approach is no
longer valid once we start enabling multipage bvecs.
correct once we start to enable multipage bvec.
Use bio_nr_pages() to do that instead.
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: linux-btrfs@vger.kernel.org
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch converts 3 users to bio_last_bvec_all(), so that we can go
ahead and convert to multipage bvec.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch converts to bio_first_bvec_all() & bio_first_page_all() for
retrieving the 1st bvec/page, and prepares for supporting multipage bvec.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlpPux0ACgkQxWXV+ddt
WDs/ORAAgRtjm+OWBb80eV1xJIHGRPRaL6E4OZc6SA7DEA+oCpkkVzOHQz3PV2a2
cAsIUvp9azZd41gzBMw8mIe4AQKLZpud+vEM7QYRlbZFtp3EWmZ1Jht4bJRxC+w7
NjBIEx4MX2KiUeRizmo3iWBVW+RoaRVW1xvFo/k5QchhO8U74SNYzxTGVxd8S/C0
ZanuTowdm71uCJJHkoNWArAsou40QCJOYK19WilRkrf6SGsUqc1zKArRKe2KF4GH
Wyf4Qyp2fm8RRKLOlc9NcsVbVqVg4kBmUXbJPCvltCs+JiyfhX9hahweoHHH8kmH
u/jR3CItVqX+Ft1WAtSpgRzxO0uGu6aVkIql0VHV6wIbGnFoJd9XQ6RPnT/awlOw
1jx8RLOZtVehF6pjyoSngLppqCw/sYpV8QhF32dEFGentO3Wd7CVKTcMOH498dbN
paNzcNEfnTFLbUmViOTXl8AS8VX+3PU2Mgn8W8UxcFYksoIpV9P/LBDS3iIGYMtL
pFFC9fYeipBDOPg2NV4QfCE9ZSqm35c2kAV/hb1nmPtPz4W+Ya5v2y9RSjAU80f4
Y8ZyePg6pjwWOp1dW+TZF0NE8ExzSvgnXAQOdZkiy4Ztc6OwTVhlwRfW1xFy2Py+
riR87A7/mDbiR9IXHgzFZi6WjjVMHDifBKeEpu91cF9JrwJqMBc=
=WIOv
-----END PGP SIGNATURE-----
Merge tag 'for-4.15-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"We have two more fixes for 4.15, both aimed for stable.
The leak fix is obvious, the second patch fixes a bug revealed by the
refcount API, when it behaves differently than previous atomic_t and
reports refs going from 0 to 1 in one case"
* tag 'for-4.15-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes
btrfs: Fix flush bio leak
refcounts have a generic implementation and an asm optimized one. The
generic version has extra debugging to make sure that once a refcount
goes to zero, refcount_inc won't increase it.
The btrfs delayed inode code wasn't expecting this, and we're tripping
over the warnings when the generic refcounts are used. We ended up with
this race:
Process A Process B
btrfs_get_delayed_node()
spin_lock(root->inode_lock)
radix_tree_lookup()
__btrfs_release_delayed_node()
refcount_dec_and_test(&delayed_node->refs)
our refcount is now zero
refcount_add(2) <---
warning here, refcount
unchanged
spin_lock(root->inode_lock)
radix_tree_delete()
With the generic refcounts, we actually warn again when process B above
tries to release his refcount because refcount_add() turned into a
no-op.
We saw this in production on older kernels without the asm optimized
refcounts.
The fix used here is to use refcount_inc_not_zero() to detect when the
object is in the middle of being freed and return NULL. This is almost
always the right answer anyway, since we usually end up pitching the
delayed_node if it didn't have fresh data in it.
This also changes __btrfs_release_delayed_node() to remove the extra
check for zero refcounts before radix tree deletion.
btrfs_get_delayed_node() was the only path that was allowing refcounts
to go from zero to one.
Fixes: 6de5f18e7b ("btrfs: fix refcount_t usage when deleting btrfs_delayed_node")
CC: <stable@vger.kernel.org> # 4.12+
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit e0ae999414 ("btrfs: preallocate device flush bio") reworked
the way the flush bio is allocated and used. Concretely it allocates
the bio in __alloc_device and then re-uses it multiple times with a
very simple endio routine that just calls complete() without consuming
a reference. Allocated bios by default come with a ref count of 1,
which is then consumed by the endio routine (or not, in which case they
should be bio_put by the caller). The way the impleementation works now
is that the flush bio has a refcount of 2 and we only ever bio_put it
once, leaving it to hang indefinitely. Fix this by removing the extra
bio_get in __alloc_device.
Fixes: e0ae999414 ("btrfs: preallocate device flush bio")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This link is replicated in most filesystems' config stanzas. Referring
to an archived version of that site is pointless as it mostly deals with
patches; user documentation is available elsewhere.
Signed-off-by: Adam Borowski <kilobyte@angband.pl>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Acked-by: David Sterba <dsterba@suse.com>
Acked-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Acked-by: Steve French <smfrench@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2017-12-18
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Allow arbitrary function calls from one BPF function to another BPF function.
As of today when writing BPF programs, __always_inline had to be used in
the BPF C programs for all functions, unnecessarily causing LLVM to inflate
code size. Handle this more naturally with support for BPF to BPF calls
such that this __always_inline restriction can be overcome. As a result,
it allows for better optimized code and finally enables to introduce core
BPF libraries in the future that can be reused out of different projects.
x86 and arm64 JIT support was added as well, from Alexei.
2) Add infrastructure for tagging functions as error injectable and allow for
BPF to return arbitrary error values when BPF is attached via kprobes on
those. This way of injecting errors generically eases testing and debugging
without having to recompile or restart the kernel. Tags for opting-in for
this facility are added with BPF_ALLOW_ERROR_INJECTION(), from Josef.
3) For BPF offload via nfp JIT, add support for bpf_xdp_adjust_head() helper
call for XDP programs. First part of this work adds handling of BPF
capabilities included in the firmware, and the later patches add support
to the nfp verifier part and JIT as well as some small optimizations,
from Jakub.
4) The bpftool now also gets support for basic cgroup BPF operations such
as attaching, detaching and listing current BPF programs. As a requirement
for the attach part, bpftool can now also load object files through
'bpftool prog load'. This reuses libbpf which we have in the kernel tree
as well. bpftool-cgroup man page is added along with it, from Roman.
5) Back then commit e87c6bc385 ("bpf: permit multiple bpf attachments for
a single perf event") added support for attaching multiple BPF programs
to a single perf event. Given they are configured through perf's ioctl()
interface, the interface has been extended with a PERF_EVENT_IOC_QUERY_BPF
command in this work in order to return an array of one or multiple BPF
prog ids that are currently attached, from Yonghong.
6) Various minor fixes and cleanups to the bpftool's Makefile as well
as a new 'uninstall' and 'doc-uninstall' target for removing bpftool
itself or prior installed documentation related to it, from Quentin.
7) Add CONFIG_CGROUP_BPF=y to the BPF kernel selftest config file which is
required for the test_dev_cgroup test case to run, from Naresh.
8) Fix reporting of XDP prog_flags for nfp driver, from Jakub.
9) Fix libbpf's exit code from the Makefile when libelf was not found in
the system, also from Jakub.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This was instrumental in reproducing a space cache bug.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This allows us to do error injection with BPF for open_ctree.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlosIf4ACgkQxWXV+ddt
WDspsw//YPhztOkAM7L37Lcv6PuMIBm7AsZax+iUctx9GlE9Yb9dYX+yIGjk3N44
M6oHANP/Af70lGn3jaNlH+BeQre+RFD2KnT+Yyvp/0DV5+v+Bb6wqzrVqeYf9NIr
lf6yc925gX10+DM6UXpYopTmdB8zXXO8xnqmFuT1jC/PrW/g+Hpxi7UtFFcoXwnE
uucdih1LnNC/2pwp4ygQAxMkLnU2foWRsEP9lqsv83ecKDBfVxHUidzEZLTO7L+c
ePc74AcyuPZ7DobuSDyDF4e0Ru5YtY5Zf+KR7RZHag5BNF2YLJE/XtN+hd3YhOQA
7VniaPzUEG74ukvkL3L2oqxrMEavE0IFJtmzT4CM8DlRsGsDnn5n45sGHfo5clr8
33XOq8aiGtbG1vwVbBJOuNQI2SWJxwe1OyAZoV/o1UVrltSCRf+dYL8Yf3IO2K0M
DRnRNqEcZQGfqrVO5Iblw7VzVqY9LKiRESScS0Btvrys+DTVZAgC9CJDwN446E5v
i56PrmT8OcC9MzP9wFIZtg27jiC0ndNwkqUhFrt1LBvC+BtvZvshAnFLhLfSRyZo
0gqp2GoP6CFaUd5Ok+osALWF2VG8cpMJ7urdX0O5zXEYKioLwiXUS9Z7sldfHsJr
Uiy1uh70UIOM96ZcsXyjLr0LO5vmgkV2kyDNbR5DtrJhfFai4Gs=
=YaZE
-----END PGP SIGNATURE-----
Merge tag 'for-4.15-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"This contains a few fixes (error handling, quota leak, FUA vs
nobarrier mount option).
There's one one worth mentioning separately - an off-by-one fix that
leads to overwriting first byte of an adjacent page with 0, out of
bounds of the memory allocated by an ioctl. This is under a privileged
part of the ioctl, can be triggerd in some subvolume layouts"
* tag 'for-4.15-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: Fix possible off-by-one in btrfs_search_path_in_tree
Btrfs: disable FUA if mounted with nobarrier
btrfs: fix missing error return in btrfs_drop_snapshot
btrfs: handle errors while updating refcounts in update_ref_for_cow
btrfs: Fix quota reservation leak on preallocated files
The name char array passed to btrfs_search_path_in_tree is of size
BTRFS_INO_LOOKUP_PATH_MAX (4080). So the actual accessible char indexes
are in the range of [0, 4079]. Currently the code uses the define but this
represents an off-by-one.
Implications:
Size of btrfs_ioctl_ino_lookup_args is 4096, so the new byte will be
written to extra space, not some padding that could be provided by the
allocator.
btrfs-progs store the arguments on stack, but kernel does own copy of
the ioctl buffer and the off-by-one overwrite does not affect userspace,
but the ending 0 might be lost.
Kernel ioctl buffer is allocated dynamically so we're overwriting
somebody else's memory, and the ioctl is privileged if args.objectid is
not 256. Which is in most cases, but resolving a subvolume stored in
another directory will trigger that path.
Before this patch the buffer was one byte larger, but then the -1 was
not added.
Fixes: ac8e9819d7 ("Btrfs: add search and inode lookup ioctls")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ added implications ]
Signed-off-by: David Sterba <dsterba@suse.com>
I was seeing disk flushes still happening when I mounted a Btrfs
filesystem with nobarrier for testing. This is because we use FUA to
write out the first super block, and on devices without FUA support, the
block layer translates FUA to a flush. Even on devices supporting true
FUA, using FUA when we asked for no barriers is surprising.
Fixes: 387125fc72 ("Btrfs: fix barrier flushes")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If btrfs_del_root fails in btrfs_drop_snapshot, we'll pick up the
error but then return 0 anyway due to mixing err and ret.
Fixes: 79787eaab4 ("btrfs: replace many BUG_ONs with proper error handling")
Cc: <stable@vger.kernel.org> # v3.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit fb235dc06f (btrfs: qgroup: Move half of the qgroup
accounting time out of commit trans) the assumption that
btrfs_add_delayed_{data,tree}_ref can only return 0 or -ENOMEM has
been false. The qgroup operations call into btrfs_search_slot
and friends and can now return the full spectrum of error codes.
Fortunately, the fix here is easy since update_ref_for_cow failing
is already handled so we just need to bail early with the error
code.
Fixes: fb235dc06f (btrfs: qgroup: Move half of the qgroup accounting ...)
Cc: <stable@vger.kernel.org> # v4.11+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Edmund Nadolski <enadolski@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit c6887cd111 ("Btrfs: don't do nocow check unless we have to")
changed the behavior of __btrfs_buffered_write() so that it first tries
to get a data space reservation, and then skips the relatively expensive
nocow check if the reservation succeeded.
If we have quotas enabled, the data space reservation also includes a
quota reservation. But in the rewrite case, the space has already been
accounted for in qgroups. So btrfs_check_data_free_space() increases
the quota reservation, but it never gets decreased when the data
actually gets written and overwrites the pre-existing data. So we're
left with both the qgroup and qgroup reservation accounting for the same
space.
This commit adds the missing btrfs_qgroup_free_data() call in the case
of BTRFS_ORDERED_PREALLOC extents.
Fixes: c6887cd111 ("Btrfs: don't do nocow check unless we have to")
Signed-off-by: Justin Maggard <jmaggard@netgear.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlofBpkACgkQxWXV+ddt
WDvtTQ//emI1QsD4N0e4BxMcZ1bcigiEk3jc4gj+biRapnMHHAHOqJbVtpK1v8gS
PCTw+4uD5UOGvhBtS4kXJn8e2qxWCESWJDXwVlW0RHmWLfwd9z7ly0sBMi3oiIqH
qief8CIkk3oTexiuuJ3mZGxqnDQjRGtWx2LM+bRJBWMk+jN32v2ObSlv9V505a5M
1daDBsjWojFWa8d4r3YZNJq1df2om/dwVQZ0Wk59bacIo9Xbvok0X459cOlWuv0p
mjx8m8uA/z+HVdkTYlzyKpq08O8Z4shj3GrBbSnZ511gKzV+c9jJPxij5pKm3Z2z
KW4Mp17+/7GSNcSsJiqnOYi+wtOrak2lD0COlZTijnY2jrv18h8ianoIM6CpzUdy
+b09yuFXbPLoUfyl6vFaO/JHuvAkQdaR2tJbds6lvW+liC1ReoL4W1WcUjY6nv9f
6wTaIv0vwgrHaxeIzxKNpnsTlpHAgorFFk0/w8nLb40WX8AoJ/95lo2zws8oaFDN
0Fylu3NYhoDrJZK+D8dbsWx2eTsFVCqep4w0+iEVZl3lfuy3FZl1pu8CL7ru9vJl
DNieh+lUvK1Fk+SYIoilGoriW96RbU8+jPo2W4A1ENzeMJfrNCSWtUSZZp4XT4tO
8m1PGud07XBLSxd62bAEDV3KZO2DnY1WxgXbKuIHSi9D5CI1LMo=
=7UW+
-----END PGP SIGNATURE-----
Merge tag 'for-4.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"We've collected some fixes in since the pre-merge window freeze.
There's technically only one regression fix for 4.15, but the rest
seems important and candidates for stable.
- fix missing flush bio puts in error cases (is serious, but rarely
happens)
- fix reporting stat::st_blocks for buffered append writes
- fix space cache invalidation
- fix out of bound memory access when setting zlib level
- fix potential memory corruption when fsync fails in the middle
- fix crash in integrity checker
- incremetnal send fix, path mixup for certain unlink/rename
combination
- pass flags to writeback so compressed writes can be throttled
properly
- error handling fixes"
* tag 'for-4.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: incremental send, fix wrong unlink path after renaming file
btrfs: tree-checker: Fix false panic for sanity test
Btrfs: fix list_add corruption and soft lockups in fsync
btrfs: Fix wild memory access in compression level parser
btrfs: fix deadlock when writing out space cache
btrfs: clear space cache inode generation always
Btrfs: fix reported number of inode blocks after buffered append writes
Btrfs: move definition of the function btrfs_find_new_delalloc_bytes
Btrfs: bail out gracefully rather than BUG_ON
btrfs: dev_alloc_list is not protected by RCU, use normal list_del
btrfs: add missing device::flush_bio puts
btrfs: Fix transaction abort during failure in btrfs_rm_dev_item
Btrfs: add write_flags for compression bio
Under some circumstances, an incremental send operation can issue wrong
paths for unlink commands related to files that have multiple hard links
and some (or all) of those links were renamed between the parent and send
snapshots. Consider the following example:
Parent snapshot
. (ino 256)
|---- a/ (ino 257)
| |---- b/ (ino 259)
| | |---- c/ (ino 260)
| | |---- f2 (ino 261)
| |
| |---- f2l1 (ino 261)
|
|---- d/ (ino 262)
|---- f1l1_2 (ino 258)
|---- f2l2 (ino 261)
|---- f1_2 (ino 258)
Send snapshot
. (ino 256)
|---- a/ (ino 257)
| |---- f2l1/ (ino 263)
| |---- b2/ (ino 259)
| |---- c/ (ino 260)
| | |---- d3 (ino 262)
| | |---- f1l1_2 (ino 258)
| | |---- f2l2_2 (ino 261)
| | |---- f1_2 (ino 258)
| |
| |---- f2 (ino 261)
| |---- f1l2 (ino 258)
|
|---- d (ino 261)
When computing the incremental send stream the following steps happen:
1) When processing inode 261, a rename operation is issued that renames
inode 262, which currently as a path of "d", to an orphan name of
"o262-7-0". This is done because in the send snapshot, inode 261 has
of its hard links with a path of "d" as well.
2) Two link operations are issued that create the new hard links for
inode 261, whose names are "d" and "f2l2_2", at paths "/" and
"o262-7-0/" respectively.
3) Still while processing inode 261, unlink operations are issued to
remove the old hard links of inode 261, with names "f2l1" and "f2l2",
at paths "a/" and "d/". However path "d/" does not correspond anymore
to the directory inode 262 but corresponds instead to a hard link of
inode 261 (link command issued in the previous step). This makes the
receiver fail with a ENOTDIR error when attempting the unlink
operation.
The problem happens because before sending the unlink operation, we failed
to detect that inode 262 was one of ancestors for inode 261 in the parent
snapshot, and therefore we didn't recompute the path for inode 262 before
issuing the unlink operation for the link named "f2l2" of inode 262. The
detection failed because the function "is_ancestor()" only follows the
first hard link it finds for an inode instead of all of its hard links
(as it was originally created for being used with directories only, for
which only one hard link exists). So fix this by making "is_ancestor()"
follow all hard links of the input inode.
A test case for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
If we run btrfs with CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y, it will
instantly cause kernel panic like:
------
...
assertion failed: 0, file: fs/btrfs/disk-io.c, line: 3853
...
Call Trace:
btrfs_mark_buffer_dirty+0x187/0x1f0 [btrfs]
setup_items_for_insert+0x385/0x650 [btrfs]
__btrfs_drop_extents+0x129a/0x1870 [btrfs]
...
-----
[Cause]
Btrfs will call btrfs_check_leaf() in btrfs_mark_buffer_dirty() to check
if the leaf is valid with CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y.
However quite some btrfs_mark_buffer_dirty() callers(*) don't really
initialize its item data but only initialize its item pointers, leaving
item data uninitialized.
This makes tree-checker catch uninitialized data as error, causing
such panic.
*: These callers include but not limited to
setup_items_for_insert()
btrfs_split_item()
btrfs_expand_item()
[Fix]
Add a new parameter @check_item_data to btrfs_check_leaf().
With @check_item_data set to false, item data check will be skipped and
fallback to old btrfs_check_leaf() behavior.
So we can still get early warning if we screw up item pointers, and
avoid false panic.
Cc: Filipe Manana <fdmanana@gmail.com>
Reported-by: Lakshmipathi.G <lakshmipathi.g@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a pure automated search-and-replace of the internal kernel
superblock flags.
The s_flags are now called SB_*, with the names and the values for the
moment mirroring the MS_* flags that they're equivalent to.
Note how the MS_xyz flags are the ones passed to the mount system call,
while the SB_xyz flags are what we then use in sb->s_flags.
The script to do this was:
# places to look in; re security/*: it generally should *not* be
# touched (that stuff parses mount(2) arguments directly), but
# there are two places where we really deal with superblock flags.
FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
include/linux/fs.h include/uapi/linux/bfs_fs.h \
security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
# the list of MS_... constants
SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
ACTIVE NOUSER"
SED_PROG=
for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done
# we want files that contain at least one of MS_...,
# with fs/namespace.c and fs/pnode.c excluded.
L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')
for f in $L; do sed -i $f $SED_PROG; done
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Xfstests btrfs/146 revealed this corruption,
[ 58.138831] Buffer I/O error on dev dm-0, logical block 2621424, async page read
[ 58.151233] BTRFS error (device sdf): bdev /dev/mapper/error-test errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
[ 58.152403] list_add corruption. prev->next should be next (ffff88005e6775d8), but was ffffc9000189be88. (prev=ffffc9000189be88).
[ 58.153518] ------------[ cut here ]------------
[ 58.153892] WARNING: CPU: 1 PID: 1287 at lib/list_debug.c:31 __list_add_valid+0x169/0x1f0
...
[ 58.157379] RIP: 0010:__list_add_valid+0x169/0x1f0
...
[ 58.161956] Call Trace:
[ 58.162264] btrfs_log_inode_parent+0x5bd/0xfb0 [btrfs]
[ 58.163583] btrfs_log_dentry_safe+0x60/0x80 [btrfs]
[ 58.164003] btrfs_sync_file+0x4c2/0x6f0 [btrfs]
[ 58.164393] vfs_fsync_range+0x5f/0xd0
[ 58.164898] do_fsync+0x5a/0x90
[ 58.165170] SyS_fsync+0x10/0x20
[ 58.165395] entry_SYSCALL_64_fastpath+0x1f/0xbe
...
It turns out that we could record btrfs_log_ctx:io_err in
log_one_extents when IO fails, but make log_one_extents() return '0'
instead of -EIO, so the IO error is not acknowledged by the callers,
i.e. btrfs_log_inode_parent(), which would remove btrfs_log_ctx:list
from list head 'root->log_ctxs'. Since btrfs_log_ctx is allocated
from stack memory, it'd get freed with a object alive on the
list. then a future list_add will throw the above warning.
This returns the correct error in the above case.
Jeff also reported this while testing against his fsync error
patch set[1].
[1]: https://www.spinics.net/lists/linux-btrfs/msg65308.html
"btrfs list corruption and soft lockups while testing writeback error handling"
Fixes: 8407f55326 ("Btrfs: fix data corruption after fast fsync and writeback error")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Kernel panic when mounting with "-o compress" mount option.
KASAN will report like:
------
==================================================================
BUG: KASAN: wild-memory-access in strncmp+0x31/0xc0
Read of size 1 at addr d86735fce994f800 by task mount/662
...
Call Trace:
dump_stack+0xe3/0x175
kasan_report+0x163/0x370
__asan_load1+0x47/0x50
strncmp+0x31/0xc0
btrfs_compress_str2level+0x20/0x70 [btrfs]
btrfs_parse_options+0xff4/0x1870 [btrfs]
open_ctree+0x2679/0x49f0 [btrfs]
btrfs_mount+0x1b7f/0x1d30 [btrfs]
mount_fs+0x49/0x190
vfs_kern_mount.part.29+0xba/0x280
vfs_kern_mount+0x13/0x20
btrfs_mount+0x31e/0x1d30 [btrfs]
mount_fs+0x49/0x190
vfs_kern_mount.part.29+0xba/0x280
do_mount+0xaad/0x1a00
SyS_mount+0x98/0xe0
entry_SYSCALL_64_fastpath+0x1f/0xbe
------
[Cause]
For 'compress' and 'compress_force' options, its token doesn't expect
any parameter so its args[0] contains uninitialized data.
Accessing args[0] will cause above wild memory access.
[Fix]
For Opt_compress and Opt_compress_force, set compression level to
the default.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ set the default in advance ]
Signed-off-by: David Sterba <dsterba@suse.com>
If we fail to prepare our pages for whatever reason (out of memory in
our case) we need to make sure to drop the block_group->data_rwsem,
otherwise hilarity ensues.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add label and use existing unlocking code ]
Signed-off-by: David Sterba <dsterba@suse.com>
We discovered a box that had double allocations, and suspected the space
cache may be to blame. While auditing the write out path I noticed that
if we've already setup the space cache we will just carry on. This
means that any error we hit after cache_save_setup before we go to
actually write the cache out we won't reset the inode generation, so
whatever was already written will be considered correct, except it'll be
stale. Fix this by _always_ resetting the generation on the block group
inode, this way we only ever have valid or invalid cache.
With this patch I was no longer able to reproduce cache corruption with
dm-log-writes and my bpf error injection tool.
Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIVAwUAWgm9V/Sw1s6N8H32AQK5mQ//QGUDZLXsUPCtq0XJq0V+r4MUjNp9tCZR
htiuNrEkHSyPpYgCcQ2Aqdl9kndwVXcE7lWT99mp/a0zwNAsp9GOGVhCXUd5R86G
XlrBuUYVvBJk18tDsUNWdjRQ0gMHgQSlEnEbsaGiU1bVrpXatI9hL8qoeO78Iy7+
eaJUQLCuCVJq7qMQGhC0hg338vmHVeYhnViXIxq+HFjsMmR9IVanuK+sQr6NSJxS
F6RkPxBUPWkRVMHmxTLWj/XSHZwtwu+Mnc/UFYsAPLKEbY0cIohsI8EgfE8U7geU
yRVnu3MIOXUXUrZizj9SwVYWdJfneRlINqMbHIO8QXMKR38tnQ0C2/7bgBsXiNPv
YdiAyeqL4nM+JthV/rgA3hWgupwBlSb4ubclTphDNxMs5MBIUIK3XUt9GOXDDUZz
2FT/FdrphM2UORaI2AEOi4Q0/nHdin+3rld8fjV0Ree/TPNXwcrOmvy8yGnxFCEp
5b7YLwKrffZGnnS965dhZlnFR6hjndmzFgHdyRrJwc80hXi1Q/+W4F19MoYkkoVK
G/gLvD3FbmygmFnjCik9TjUrro6vQxo56H/TuWgHTvYriNGH+D/D7EGUwg4GiXZZ
+7vrNw660uXmZiu9i0YacCRyD8lvm7QpmWLb+uHwzfsBE1+C8UetyQ+egSWVdWJO
KwPspygWXD4=
=3vy0
-----END PGP SIGNATURE-----
Merge tag 'afs-next-20171113' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS updates from David Howells:
"kAFS filesystem driver overhaul.
The major points of the overhaul are:
(1) Preliminary groundwork is laid for supporting network-namespacing
of kAFS. The remainder of the namespacing work requires some way
to pass namespace information to submounts triggered by an
automount. This requires something like the mount overhaul that's
in progress.
(2) sockaddr_rxrpc is used in preference to in_addr for holding
addresses internally and add support for talking to the YFS VL
server. With this, kAFS can do everything over IPv6 as well as
IPv4 if it's talking to servers that support it.
(3) Callback handling is overhauled to be generally passive rather
than active. 'Callbacks' are promises by the server to tell us
about data and metadata changes. Callbacks are now checked when
we next touch an inode rather than actively going and looking for
it where possible.
(4) File access permit caching is overhauled to store the caching
information per-inode rather than per-directory, shared over
subordinate files. Whilst older AFS servers only allow ACLs on
directories (shared to the files in that directory), newer AFS
servers break that restriction.
To improve memory usage and to make it easier to do mass-key
removal, permit combinations are cached and shared.
(5) Cell database management is overhauled to allow lighter locks to
be used and to make cell records autonomous state machines that
look after getting their own DNS records and cleaning themselves
up, in particular preventing races in acquiring and relinquishing
the fscache token for the cell.
(6) Volume caching is overhauled. The afs_vlocation record is got rid
of to simplify things and the superblock is now keyed on the cell
and the numeric volume ID only. The volume record is tied to a
superblock and normal superblock management is used to mediate
the lifetime of the volume fscache token.
(7) File server record caching is overhauled to make server records
independent of cells and volumes. A server can be in multiple
cells (in such a case, the administrator must make sure that the
VL services for all cells correctly reflect the volumes shared
between those cells).
Server records are now indexed using the UUID of the server
rather than the address since a server can have multiple
addresses.
(8) File server rotation is overhauled to handle VMOVED, VBUSY (and
similar), VOFFLINE and VNOVOL indications and to handle rotation
both of servers and addresses of those servers. The rotation will
also wait and retry if the server says it is busy.
(9) Data writeback is overhauled. Each inode no longer stores a list
of modified sections tagged with the key that authorised it in
favour of noting the modified region of a page in page->private
and storing a list of keys that made modifications in the inode.
This simplifies things and allows other keys to be used to
actually write to the server if a key that made a modification
becomes useless.
(10) Writable mmap() is implemented. This allows a kernel to be build
entirely on AFS.
Note that Pre AFS-3.4 servers are no longer supported, though this can
be added back if necessary (AFS-3.4 was released in 1998)"
* tag 'afs-next-20171113' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (35 commits)
afs: Protect call->state changes against signals
afs: Trace page dirty/clean
afs: Implement shared-writeable mmap
afs: Get rid of the afs_writeback record
afs: Introduce a file-private data record
afs: Use a dynamic port if 7001 is in use
afs: Fix directory read/modify race
afs: Trace the sending of pages
afs: Trace the initiation and completion of client calls
afs: Fix documentation on # vs % prefix in mount source specification
afs: Fix total-length calculation for multiple-page send
afs: Only progress call state at end of Tx phase from rxrpc callback
afs: Make use of the YFS service upgrade to fully support IPv6
afs: Overhaul volume and server record caching and fileserver rotation
afs: Move server rotation code into its own file
afs: Add an address list concept
afs: Overhaul cell database management
afs: Overhaul permit caching
afs: Overhaul the callback handling
afs: Rename struct afs_call server member to cm_server
...
Every pagevec_init user claims the pages being released are hot even in
cases where it is unlikely the pages are hot. As no one cares about the
hotness of pages being released to the allocator, just ditch the
parameter.
No performance impact is expected as the overhead is marginal. The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.
Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All users of pagevec_lookup() and pagevec_lookup_range() now pass
PAGEVEC_SIZE as a desired number of pages. Just drop the argument.
Link: http://lkml.kernel.org/r/20171009151359.31984-15-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We want only pages from given range in btree_write_cache_pages() and
extent_write_cache_pages(). Use pagevec_lookup_range_tag() instead of
pagevec_lookup_tag() and remove unnecessary code.
Link: http://lkml.kernel.org/r/20171009151359.31984-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: David Sterba <dsterba@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The patch from commit a7e3b975a0 ("Btrfs: fix reported number of inode
blocks") introduced a regression where if we do a buffered write starting
at position equal to or greater than the file's size and then stat(2) the
file before writeback is triggered, the number of used blocks does not
change (unless there's a prealloc/unwritten extent). Example:
$ xfs_io -f -c "pwrite -S 0xab 0 64K" foobar
$ du -h foobar
0 foobar
$ sync
$ du -h foobar
64K foobar
The first version of that patch didn't had this regression and the second
version, which was the one committed, was made only to address some
performance regression detected by the intel test robots using fs_mark.
This fixes the regression by setting the new delaloc bit in the range, and
doing it at btrfs_dirty_pages() while setting the regular dealloc bit as
well, so that this way we set both bits at once avoiding navigation of the
inode's io tree twice. Doing it at btrfs_dirty_pages() is also the most
meaninful place, as we should set the new dellaloc bit when if we set the
delalloc bit, which happens only if we copied bytes into the pages at
__btrfs_buffered_write().
This was making some of LTP's du tests fail, which can be quickly run
using a command line like the following:
$ ./runltp -q -p -l /ltp.log -f commands -s du -d /mnt
Fixes: a7e3b975a0 ("Btrfs: fix reported number of inode blocks")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move the definition of the function btrfs_find_new_delalloc_bytes() closer
to the function btrfs_dirty_pages(), because in a future commit it will be
used exclusively by btrfs_dirty_pages(). This just moves the function's
definition, with no functional changes at all.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If a file's DIR_ITEM key is invalid (due to memory errors) and gets
written to disk, a future lookup_path can end up with kernel panic due
to BUG_ON().
This gets rid of the BUG_ON(), meanwhile output the corrupted key and
return ENOENT if it's invalid.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reported-by: Guillaume Bouchard <bouchard@mercs-eng.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The dev_alloc_list list could be protected by various mutexes,
depending on the context. The list tracks devices that can take part of
allocating new chunks, so the closest mutex is chunk_mutex. Adding a new
device from inside the ADD_DEV ioctl will need device_list_mutex and
registering a new device from the ioctl needs uuid_mutex.
All mutexes naturally guarantee exclusivity against the same context.
The device ownership can move between the contexts and the exclusivity
is guaranteed by other means, eg. during the mount with the uuid_mutex.
There's no RCU involved for dev_alloc_list.
Signed-off-by: David Sterba <dsterba@suse.com>
This fixes potential bio leaks, in several error paths. Unfortunatelly
the device structure freeing is opencoded in many places and I missed
them when introducing the flush_bio.
Most of the time, devices get freed through call_rcu(..., free_device),
so it at least it's not that easy to hit the leak, but it's still
possible through the path that frees stale devices.
Fixes: e0ae999414 ("btrfs: preallocate device flush bio")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_rm_dev_item calls several function under an active transaction,
however it fails to abort it if an error happens. Fix this by adding
explicit btrfs_abort_transaction/btrfs_end_transaction calls.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Compression code path has only flaged bios with REQ_OP_WRITE no matter
where the bios come from, but it could be a sync write if fsync starts
this writeback or a normal writeback write if wb kthread starts a
periodic writeback.
It breaks the rule that sync writes and writeback writes need to be
differentiated from each other, because from the POV of block layer,
all bios need to be recognized by these flags in order to do some
management, e.g. throttlling.
This passes writeback_control to compression write path so that it can
send bios with proper flags to block layer.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull btrfs updates from David Sterba:
"There are some new user features and the usual load of invisible
enhancements or cleanups.
New features:
- extend mount options to specify zlib compression level, -o
compress=zlib:9
- v2 of ioctl "extent to inode mapping", addressing a usecase where
we want to retrieve more but inaccurate results and do the
postprocessing in userspace, aiding defragmentation or
deduplication tools
- populate compression heuristics logic, do data sampling and try to
guess compressibility by: looking for repeated patterns, counting
unique byte values and distribution, calculating Shannon entropy;
this will need more benchmarking and possibly fine tuning, but the
base should be good enough
- enable indexing for btrfs as lower filesystem in overlayfs
- speedup page cache readahead during send on large files
Internal enhancements:
- more sanity checks of b-tree items when reading them from disk
- more EINVAL/EUCLEAN fixups, missing BLK_STS_* conversion, other
errno or error handling fixes
- remove some homegrown IO-related logic, that's been obsoleted by
core block layer changes (batching, plug/unplug, own counters)
- add ref-verify, optional debugging feature to verify extent
reference accounting
- simplify code handling outstanding extents, make it more clear
where and how the accounting is done
- make delalloc reservations per-inode, simplify the code and make
the logic more straightforward
- extensive cleanup of delayed refs code
Notable fixes:
- fix send ioctl on 32bit with 64bit kernel"
* 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (102 commits)
btrfs: Fix bug for misused dev_t when lookup in dev state hash table.
Btrfs: heuristic: add Shannon entropy calculation
Btrfs: heuristic: add byte core set calculation
Btrfs: heuristic: add byte set calculation
Btrfs: heuristic: add detection of repeated data patterns
Btrfs: heuristic: implement sampling logic
Btrfs: heuristic: add bucket and sample counters and other defines
Btrfs: compression: separate heuristic/compression workspaces
btrfs: move btrfs_truncate_block out of trans handle
btrfs: don't call btrfs_start_delalloc_roots in flushoncommit
btrfs: track refs in a rb_tree instead of a list
btrfs: add a comp_refs() helper
btrfs: switch args for comp_*_refs
btrfs: make the delalloc block rsv per inode
btrfs: add tracepoints for outstanding extents mods
Btrfs: rework outstanding_extents
btrfs: increase output size for LOGICAL_INO_V2 ioctl
btrfs: add a flags argument to LOGICAL_INO and call it LOGICAL_INO_V2
btrfs: add a flag to iterate_inodes_from_logical to find all extent refs for uncompressed extents
btrfs: send: remove unused code
...
Make wait_on_atomic_t() pass the TASK_* mode onto its action function as an
extra argument and make it 'unsigned int throughout.
Also, consolidate a bunch of identical action functions into a default
function that can do the appropriate thing for the mode.
Also, change the argument name in the bit_wait*() function declarations to
reflect the fact that it's the mode and not the bit number.
[Peter Z gives this a grudging ACK, but thinks that the whole atomic_t wait
should be done differently, though he's not immediately sure as to how]
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
cc: Ingo Molnar <mingo@kernel.org>
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fix bug of commit 74d46992e0 ("block: replace bi_bdev with a gendisk
pointer and partitions index").
bio_dev(bio) is used to find the dev state in function
__btrfsic_submit_bio. But when dev_state is added to the hashtable, it
is using dev_t of block_device.
bio_dev(bio) returns a dev_t of part0 which is different from dev_t in
block_device(bd_dev). bd_dev in block_device represents the exact
partition.
block_device.bd_dev =
bio->bi_partno (same as block_device.bd_partno) + bio_dev(bio).
When adding a dev_state into hashtable, we use the exact partition dev_t.
So when looking it up, it should also use the exact partition dev_t.
Reproducer of this bug:
Use MOUNT_OPTIONS="-o check_int" and run btrfs/001 in fstests.
Then there will be WARNING like below.
WARNING:
btrfs: attempt to write superblock which references block M @29523968 (sda7 /1111654400/2) which is never written!
Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Byte distribution check in heuristic will filter edge data cases and
some time fail to classify input data.
Let's fix that by adding Shannon entropy calculation, that will cover
classification of most other data types.
As Shannon entropy needs log2 with some precision to work, let's use
ilog2(N) and for increased precision, by do ilog2(pow(N, 4)).
Shannon entropy has been slightly changed to avoid signed numbers and
division.
The calculation is direct by the formula, successor of precalculated
table or chains of if-else.
The accuracy errors of ilog2 are compensated by
@ENTROPY_LVL_ACEPTABLE 70 -> 65
@ENTROPY_LVL_HIGH 85 -> 80
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Calculate byte core set for data sample:
- sort buckets' numbers in decreasing order
- count how many values cover 90% of the sample
If the core set size is low (<=25%), data are easily compressible.
If the core set size is high (>=80%), data are not compressible.
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Calculate byte set size for data sample:
- calculate how many unique bytes have been in the sample
- for all bytes count > 0, check if we're still in the low count range
(~25%), such data are easily compressible, otherwise furhter analysis
is needed
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Walk over data sample and use memcmp to detect repeated patterns, like
zeros, but a bit more general.
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor coding style fixes ]
Signed-off-by: David Sterba <dsterba@suse.com>
Copy sample data from the input data range to sample buffer then
calculate byte value count for that sample into bucket.
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
[ minor comment updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
Add basic defines and structures for data sampling.
Added macros:
- For future sampling algo
- For bucket size
Heuristic workspace:
- Add bucket for storing byte type counters
- Add sample array for storing partial copy of input data range
- Add counter for store current sample size to workspace
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor coding style fixes, comments updated ]
Signed-off-by: David Sterba <dsterba@suse.com>
Compression heuristic itself is not a compression type, as current
infrastructure provides workspaces for several compression types, it's
difficult to just add heuristic workspace.
Just refactor the code to support compression/heuristic workspaces with
maximum code sharing and minimum changes in it.
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ coding style fixes ]
Signed-off-by: David Sterba <dsterba@suse.com>
Since we do a delalloc reserve in btrfs_truncate_block we can deadlock
with freeze. If somebody else is trying to allocate metadata for this
inode and it gets stuck in start_delalloc_inodes because of freeze we
will deadlock. Be safe and move this outside of a trans handle. This
also has a side-effect of making sure that we're not leaving stale data
behind in the other_encoding or encryption case. Not an issue now since
nobody uses it, but it would be a problem in the future.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're holding the sb_start_intwrite lock at this point, and doing async
filemap_flush of the inodes will result in a deadlock if we freeze the
fs during this operation. This is because we could do a
btrfs_join_transaction() in the thread we are waiting on which would
block at sb_start_intwrite, and thus deadlock. Using
writeback_inodes_sb() side steps the problem by not introducing all of
these extra locking dependencies.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we get a significant amount of delayed refs for a single block (think
modifying multiple snapshots) we can end up spending an ungodly amount
of time looping through all of the entries trying to see if they can be
merged. This is because we only add them to a list, so we have O(2n)
for every ref head. This doesn't make any sense as we likely have refs
for different roots, and so they cannot be merged. Tracking in a tree
will allow us to break as soon as we hit an entry that doesn't match,
making our worst case O(n).
With this we can also merge entries more easily. Before we had to hope
that matching refs were on the ends of our list, but with the tree we
can search down to exact matches and merge them at insert time.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of open-coding the delayed ref comparisons, add a helper to do
the comparisons generically and use that everywhere. We compare
sequence numbers last for following patches.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Make it more consistent, we want the inserted ref to be compared against
what's already in there. This will make the order go from lowest seq ->
highest seq, which will make us more likely to make forward progress if
there's a seqlock currently held.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The way we handle delalloc metadata reservations has gotten
progressively more complicated over the years. There is so much cruft
and weirdness around keeping the reserved count and outstanding counters
consistent and handling the error cases that it's impossible to
understand.
Fix this by making the delalloc block rsv per-inode. This way we can
calculate the actual size of the outstanding metadata reservations every
time we make a change, and then reserve the delta based on that amount.
This greatly simplifies the code everywhere, and makes the error
handling in btrfs_delalloc_reserve_metadata far less terrifying.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is handy for tracing problems with modifying the outstanding
extents counters.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Right now we do a lot of weird hoops around outstanding_extents in order
to keep the extent count consistent. This is because we logically
transfer the outstanding_extent count from the initial reservation
through the set_delalloc_bits. This makes it pretty difficult to get a
handle on how and when we need to mess with outstanding_extents.
Fix this by revamping the rules of how we deal with outstanding_extents.
Now instead everybody that is holding on to a delalloc extent is
required to increase the outstanding extents count for itself. This
means we'll have something like this
btrfs_delalloc_reserve_metadata - outstanding_extents = 1
btrfs_set_extent_delalloc - outstanding_extents = 2
btrfs_release_delalloc_extents - outstanding_extents = 1
for an initial file write. Now take the append write where we extend an
existing delalloc range but still under the maximum extent size
btrfs_delalloc_reserve_metadata - outstanding_extents = 2
btrfs_set_extent_delalloc
btrfs_set_bit_hook - outstanding_extents = 3
btrfs_merge_extent_hook - outstanding_extents = 2
btrfs_delalloc_release_extents - outstanding_extnets = 1
In order to make the ordered extent transition we of course must now
make ordered extents carry their own outstanding_extent reservation, so
for cow_file_range we end up with
btrfs_add_ordered_extent - outstanding_extents = 2
clear_extent_bit - outstanding_extents = 1
btrfs_remove_ordered_extent - outstanding_extents = 0
This makes all manipulations of outstanding_extents much more explicit.
Every successful call to btrfs_delalloc_reserve_metadata _must_ now be
combined with btrfs_release_delalloc_extents, even in the error case, as
that is the only function that actually modifies the
outstanding_extents counter.
The drawback to this is now we are much more likely to have transient
cases where outstanding_extents is much larger than it actually should
be. This could happen before as we manipulated the delalloc bits, but
now it happens basically at every write. This may put more pressure on
the ENOSPC flushing code, but I think making this code simpler is worth
the cost. I have another change coming to mitigate this side-effect
somewhat.
I also added trace points for the counter manipulation. These were used
by a bpf script I wrote to help track down leak issues.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Build-server workloads have hundreds of references per file after dedup.
Multiply by a few snapshots and we quickly exhaust the limit of 2730
references per extent that can fit into a 64K buffer.
Raise the limit to 16M to be consistent with other btrfs ioctls
(e.g. TREE_SEARCH_V2, FILE_EXTENT_SAME).
To minimize surprising userspace behavior, apply this change only to
the LOGICAL_INO_V2 ioctl.
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Tested-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that check_extent_in_eb()'s extent offset filter can be turned off,
we need a way to do it from userspace.
Add a 'flags' field to the btrfs_logical_ino_args structure to disable
extent offset filtering, taking the place of one of the existing
reserved[] fields.
Previous versions of LOGICAL_INO neglected to check whether any of the
reserved fields have non-zero values. Assigning meaning to those fields
now may change the behavior of existing programs that left these fields
uninitialized. The lack of a zero check also means that new programs
have no way to know whether the kernel is honoring the flags field.
To avoid these problems, define a new ioctl LOGICAL_INO_V2. We can
use the same argument layout as LOGICAL_INO, but shorten the reserved[]
array by one element and turn it into the 'flags' field. The V2 ioctl
explicitly checks that reserved fields and unsupported flag bits are zero
so that userspace can negotiate future feature bits as they are defined.
Since the memory layouts of the two ioctls' arguments are compatible,
there is no need for a separate function for logical_to_ino_v2 (contrast
with tree_search_v2 vs tree_search where the layout and code are quite
different). A version parameter and an 'if' statement will suffice.
Now that we have a flags field in logical_ino_args, add a flag
BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET to get the behavior we want,
and pass it down the stack to iterate_inodes_from_logical.
Motivation and background, copied from the patchset cover letter:
Suppose we have a file with one extent:
root@tester:~# zcat /usr/share/doc/cpio/changelog.gz > /test/a
root@tester:~# sync
Split the extent by overwriting it in the middle:
root@tester:~# cat /dev/urandom | dd bs=4k seek=2 skip=2 count=1 conv=notrunc of=/test/a
We should now have 3 extent refs to 2 extents, with one block unreachable.
The extent tree looks like:
root@tester:~# btrfs-debug-tree /dev/vdc -t 2
[...]
item 9 key (1103101952 EXTENT_ITEM 73728) itemoff 15942 itemsize 53
extent refs 2 gen 29 flags DATA
extent data backref root 5 objectid 261 offset 0 count 2
[...]
item 11 key (1103175680 EXTENT_ITEM 4096) itemoff 15865 itemsize 53
extent refs 1 gen 30 flags DATA
extent data backref root 5 objectid 261 offset 8192 count 1
[...]
and the ref tree looks like:
root@tester:~# btrfs-debug-tree /dev/vdc -t 5
[...]
item 6 key (261 EXTENT_DATA 0) itemoff 15825 itemsize 53
extent data disk byte 1103101952 nr 73728
extent data offset 0 nr 8192 ram 73728
extent compression(none)
item 7 key (261 EXTENT_DATA 8192) itemoff 15772 itemsize 53
extent data disk byte 1103175680 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression(none)
item 8 key (261 EXTENT_DATA 12288) itemoff 15719 itemsize 53
extent data disk byte 1103101952 nr 73728
extent data offset 12288 nr 61440 ram 73728
extent compression(none)
[...]
There are two references to the same extent with different, non-overlapping
byte offsets:
[------------------72K extent at 1103101952----------------------]
[--8K----------------|--4K unreachable----|--60K-----------------]
^ ^
| |
[--8K ref offset 0--][--4K ref offset 0--][--60K ref offset 12K--]
|
v
[-----4K extent-----] at 1103175680
We want to find all of the references to extent bytenr 1103101952.
Without the patch (and without running btrfs-debug-tree), we have to
do it with 18 LOGICAL_INO calls:
root@tester:~# btrfs ins log 1103101952 -P /test/
Using LOGICAL_INO
inode 261 offset 0 root 5
root@tester:~# for x in $(seq 0 17); do btrfs ins log $((1103101952 + x * 4096)) -P /test/; done 2>&1 | grep inode
inode 261 offset 0 root 5
inode 261 offset 4096 root 5 <- same extent ref as offset 0
(offset 8192 returns empty set, not reachable)
inode 261 offset 12288 root 5
inode 261 offset 16384 root 5 \
inode 261 offset 20480 root 5 |
inode 261 offset 24576 root 5 |
inode 261 offset 28672 root 5 |
inode 261 offset 32768 root 5 |
inode 261 offset 36864 root 5 \
inode 261 offset 40960 root 5 > all the same extent ref as offset 12288.
inode 261 offset 45056 root 5 / More processing required in userspace
inode 261 offset 49152 root 5 | to figure out these are all duplicates.
inode 261 offset 53248 root 5 |
inode 261 offset 57344 root 5 |
inode 261 offset 61440 root 5 |
inode 261 offset 65536 root 5 |
inode 261 offset 69632 root 5 /
In the worst case the extents are 128MB long, and we have to do 32768
iterations of the loop to find one 4K extent ref.
With the patch, we just use one call to map all refs to the extent at once:
root@tester:~# btrfs ins log 1103101952 -P /test/
Using LOGICAL_INO_V2
inode 261 offset 0 root 5
inode 261 offset 12288 root 5
The TREE_SEARCH ioctl allows userspace to retrieve the offset and
extent bytenr fields easily once the root, inode and offset are known.
This is sufficient information to build a complete map of the extent
and all of its references. Userspace can use this information to make
better choices to dedup or defrag.
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Tested-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
[ copy background and motivation from cover letter ]
Signed-off-by: David Sterba <dsterba@suse.com>
The LOGICAL_INO ioctl provides a backward mapping from extent bytenr and
offset (encoded as a single logical address) to a list of extent refs.
LOGICAL_INO complements TREE_SEARCH, which provides the forward mapping
(extent ref -> extent bytenr and offset, or logical address). These are
useful capabilities for programs that manipulate extents and extent
references from userspace (e.g. dedup and defrag utilities).
When the extents are uncompressed (and not encrypted and not other),
check_extent_in_eb performs filtering of the extent refs to remove any
extent refs which do not contain the same extent offset as the 'logical'
parameter's extent offset. This prevents LOGICAL_INO from returning
references to more than a single block.
To find the set of extent references to an uncompressed extent from [a, b),
userspace has to run a loop like this pseudocode:
for (i = a; i < b; ++i)
extent_ref_set += LOGICAL_INO(i);
At each iteration of the loop (up to 32768 iterations for a 128M extent),
data we are interested in is collected in the kernel, then deleted by
the filter in check_extent_in_eb.
When the extents are compressed (or encrypted or other), the 'logical'
parameter must be an extent bytenr (the 'a' parameter in the loop).
No filtering by extent offset is done (or possible?) so the result is
the complete set of extent refs for the entire extent. This removes
the need for the loop, since we get all the extent refs in one call.
Add an 'ignore_offset' argument to iterate_inodes_from_logical,
[...several levels of function call graph...], and check_extent_in_eb, so
that we can disable the extent offset filtering for uncompressed extents.
This flag can be set by an improved version of the LOGICAL_INO ioctl to
get either behavior as desired.
There is no functional change in this patch. The new flag is always
false.
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor coding style fixes ]
Signed-off-by: David Sterba <dsterba@suse.com>
This code was first introduced in 31db9f7c23 ("Btrfs: introduce
BTRFS_IOC_SEND for btrfs send/receive") and it was not functional, then
it got slightly refactored in e938c8ad54 ("Btrfs: code cleanups for
send/receive"), alas it was still dead. So let's remove it for good!
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
That was only an extra check to tackle a few bugs around this area, now
its safe to remove it. Replace it by an ASSERT.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is bikeshedding, but it seems people are drastically more likely to
understand "zlib:9" as compression level rather than an algorithm
version compared to "zlib9".
Based on feedback on the mailinglist, the ":9" will be the only accepted
syntax. The level must be a single digit. Unrecognized format will
result to the default, for forward compatibility in a similar way the
compression algorithm specifier was relaxed in commit
a7164fa4e0 ("btrfs: prepare for extensions in compression
options").
Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Reviewed-by: David Sterba <dsterba@suse.com>
[ tighten the accepted format ]
Signed-off-by: David Sterba <dsterba@suse.com>
Preliminary support for setting compression level for zlib, the
following works:
$ mount -o compess=zlib # default
$ mount -o compess=zlib0 # same
$ mount -o compess=zlib9 # level 9, slower sync, less data
$ mount -o compess=zlib1 # level 1, faster sync, more data
$ mount -o remount,compress=zlib3 # level set by remount
The compress-force works the same as compress'. The level is visible in
the same format in /proc/mounts. Level set via file property does not
work yet.
Required patch: "btrfs: prepare for extensions in compression options"
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs' code uses a mix of opencoded sizes and defines from sizes.h.
Let's unifiy the code base to always use the symbolic constants. No functional
changes
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fix missing change from commit f8f84b2dfd
("btrfs: index check-integrity state hash by a dev_t").
Function btrfsic_dev_state_hashtable_lookup uses dev_t to generate hashval
when look in up a btrfsic_dev_state in hash table. So when we add a
btrfsic_dev_state into the hash table, it should also use dev_t.
Reproducer of this bug:
Use MOUNT_OPTIONS="-o check_int" when running xfstest, device can not be
mounted successfully. So xfstest can not run.
Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When one of the device is missing, bbio_error() takes care of setting
the error status. And if its only IO that is pending in that stripe, it
fails to check the status of the other IO at %bbio_error before setting
the error %bi_status for the %orig_bio. Fix this by checking if
%bbio->error has exceeded the %bbio->max_errors.
Reproducer as below fdatasync error is seen intermittently.
mount -o degraded /dev/sdc /btrfs
dd status=none if=/dev/zero of=$(mktemp /btrfs/XXX) bs=4096 count=1 conv=fdatasync
dd: fdatasync failed for ‘/btrfs/LSe’: Input/output error
The reason for the intermittences of the problem is because
the following conditions have to be met, which depends on timing:
In btrfs_map_bio()
- the RAID1 the missing device has to be at %dev_nr = 1
In bbio_error()
. before bbio_error() is called the bio of the not-missing
device at %dev_nr = 0 must be completed so that the below
condition is true
if (atomic_dec_and_test(&bbio->stripes_pending)) {
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A cleanup patch, use need_full_stripe() to replace the open code.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Code cleanup for better understanding:
Variable needs_unlock to be called extent_locked to show state as
opposed to action. Changed the type to int, to reduce code in the
critical path.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At few places we could use BLK_STS_OK and BLK_STS_NOSUPP.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Satoru Taekeuchi <satoru.takeuchi@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ dropped first hunk btrfs_endio_direct_read ]
Signed-off-by: David Sterba <dsterba@suse.com>
These are useful for debugging problems where we mess with
trans->block_rsv to make sure we're not screwing something up.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can get this from the ref we've passed in.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is just excessive information in the ref_head, and makes the code
complicated. It is a relic from when we had the heads and the refs in
the same tree, which is no longer the case. With this removal I've
cleaned up a bunch of the cruft around this old assumption as well.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We do a couple different cleanup operations on the ref head. We adjust
counters, we'll free any reserved space if we didn't end up using the
ref, and we clear the pending csum bytes. Move all these disparate
things into cleanup_ref_head and clean up the logic in
__btrfs_run_delayed_refs so that it handles the !ref case a lot cleaner,
as well as making run_one_delayed_ref() only deal with real refs and not
the ref head.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only use this logic if our ref isn't a ref_head, so move it up into
the if (ref) case since we know that this is a normal ref and not a
delayed ref head.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move this code out to a helper function to further simplivy
__btrfs_run_delayed_refs.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move the extent_op cleanup for an empty head ref to a helper function to
help simplify __btrfs_run_delayed_refs.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Simplify the error handling in __btrfs_run_delayed_refs by breaking out
the code used to return a head back to the delayed_refs tree for
processing into a helper function.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We were only doing btrfs_check_space_for_delayed_refs() if the metadata
space was full, ie we couldn't allocate chunks. This assumes we'll be
able to allocate chunks during transaction commit, but since nothing
does a LIMIT flush during the transaction commit this won't actually
happen unless we happen to run shy of actual space. We already take
into account a full fs in btrfs_check_space_for_delayed_refs() so just
kill this extra check to make sure we're ending the transaction when we
need to.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We were having corruption issues that were tied back to problems with
the extent tree. In order to track them down I built this tool to try
and find the culprit, which was pretty successful. If you compile with
this tool on it will live verify every ref update that the fs makes and
make sure it is consistent and valid. I've run this through with
xfstests and haven't gotten any false positives. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update error messages, add fixup from Dan Carpenter to handle errors
of read_tree_block ]
Signed-off-by: David Sterba <dsterba@suse.com>
We need the actual root for the ref verifier tool to work, so change
these functions to pass the root around instead. This will be used in
a subsequent patch.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This adds the infrastructure for turning ref verify on and off for a
mount, to be used by a later patch.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ enhnance btrfs_print_mod_info to print if ref-verify is compiled in ]
Signed-off-by: David Sterba <dsterba@suse.com>
The use of sector_t in the callchain of submit_extent_page is not
necessary. Switch to u64 and rename the variable and use byte units
instead of 512b, ie. dropping the >> 9 shifts and avoiding the
con(tro)versions of sector_t.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to remove sector_t and will use 'offset', so this patch
frees the name.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The use of sector_t is not necessry, it's just for a warning. Switch to
u64 and rename the variable and use byte units instead of 512b, ie.
dropping the >> 9 shifts. The messages are adjusted as well.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We pass in a pointer in our send arg struct, this means the struct size
doesn't match with 32bit user space and 64bit kernel space. Fix this by
adding a compat mode and doing the appropriate conversion.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ move structure to the beginning, next to receive 32bit compat ]
Signed-off-by: David Sterba <dsterba@suse.com>
When device is missing without the -o degraded option then its an error
so report it as an error instead of a warning. And when -o degraded
option is provided, log the missing device as warning.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ switch error to bool ]
Signed-off-by: David Sterba <dsterba@suse.com>
EIO is only for the IO failure to the device, avoid it. Use ENOENT as
that's the closest error code describing what happened.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
add_missing_dev() can return device pointer so that IS_ERR/PTR_ERR can
be used to check for the actual error that occurred in the function.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
[ minor error message adjustment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Remove variables 'start' and 'end', which are set but never used.
Signed-off-by: Christos Gkekas <chris.gekas@gmail.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We now get a harmless compile-time on 32-bit architectures:
fs/btrfs/tree-checker.c: In function 'check_extent_data_item':
fs/btrfs/tree-checker.c:189:70: error: format '%lu' expects argument of type 'long unsigned int', but argument 6 has type 'unsigned int' [-Werror=format=]
This changes the format string to use %zu instead of %lu for size_t.
Fixes: c1f6520bf360 ("btrfs: tree-checker: Enhance output for check_extent_data_item")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we have the combo of flushing twice, which can make sure IO
have started since the second flush will wait for page lock which
won't be unlocked unless setting page writeback and queuing ordered
extents, we don't need %async_submit_draining, %async_delalloc_pages
and %nr_async_submits to tell whether the IO has actually started.
Moreover, all the flushers in use are followed by functions that wait
for ordered extents to complete, so %nr_async_submits, which tracks
whether bio's async submit has made progress, doesn't really make
sense.
However, %async_delalloc_pages is still required by shrink_delalloc()
as that function doesn't flush twice in the normal case (just issues a
writeback with WB_REASON_FS_FREE_SPACE).
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
By setting compression for a defrag task, the task will start IO at
the end of defrag.
After the combo of filemap_flush(), we've already made sure that
dirty pages have made progress via async compress thread because the
second filemap_flush() will wait for page lock, which won't be
unlocked until those pages have been marked as writeback and ordered
extents have been queued.
And this is for per-inode defrag, it's not helpful to wait on a global
%async_delalloc_pages and %nr_async_submits from fs_info.
Although waiting on %nr_async_submits means that all bios are
submitted down to per-device schedule IO lists, it doesn't wait for
their completions, thus users still need to do fsync/sync to make sure
the data is on disk. While with this change, it makes sure that pages
are marked with writeback bits and will be submitted asynchronously
shortly, therefore, the behavior of defrag option '-c' remains unchanged.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This was intended to congest higher layers to not send bios, but as
1) the congested bit has been taken by writeback
Async bios come from buffered writes and DIO writes.
For DIO writes, we want to submit them ASAP, while for buffered writes,
writeback uses balance_dirty_pages() to throttle how much dirty pages we
can have.
2) and no one is waiting for %nr_async_bios down to zero,
Historically, it was introduced along with changes which let
checksumming workload spread accross different cpus. And at that time,
pdflush was used instead of per-bdi flushing, perhaps pdflush did not
have the necessary information for writeback to do throttling.
We can safely remove them now.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
[ additional explanation from mails, removed unused variable 'limit' ]
Signed-off-by: David Sterba <dsterba@suse.com>
Output the invalid member name and its bad value, along with its
expected value range or alignment.
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Output the bad value and expected good value (or its alignment).
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
[ unindent long strings ]
Signed-off-by: David Sterba <dsterba@suse.com>
Enhance the output to print:
1) the eason
2) the ad value, if reason is not sufficient
3) good value (range)
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
[ wording, unidented long strings ]
Signed-off-by: David Sterba <dsterba@suse.com>
Use inline function to replace macro since we don't need
stringification.
(Macro still exists until all callers get updated)
And add more info about the error, and replace EIO with EUCLEAN.
For nr_items error, report if it's too large or too small, and output
the valid value range.
For node block pointer, added a new alignment checker.
For key order, also output the next key to make the problem more
obvious.
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
[ wording adjustments, unindented long strings ]
Signed-off-by: David Sterba <dsterba@suse.com>
It's no doubt the comprehensive tree block checker will become larger,
so moving them into their own files is quite reasonable.
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
[ wording adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Remove dead assigment of num_bytes.
Also as num_bytes only used in the will_compress block as copy of
total_in just replace that with total_in and drop num_bytes entirely.
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit a53f4f8e9c ("btrfs: Don't call btrfs_start_transaction() on
frozen fs to avoid deadlock.") started using internal calls and we
replace them with more suitable ones.
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently struct names for sysfs are generated only based on the
attribute names. This means that attribute names cannot be reused in
multiple places throughout the complete btrfs sysfs hierarchy.
E.g. allocation/data/total_bytes and allocation/data/single/total_bytes
result in the same struct name btrfs_attr_total_bytes. A workaround for
this case was made in the past by ad hoc creating an extra macro
wrapper, BTRFS_RAID_ATTR, that inserts some extra text in the struct
name.
Instead of polluting sysfs.h with such kind of extra macro definitions,
and only doing so when there are collisions, use a prefix which gets
inserted in the struct name, so we keep everything nicely grouped
together by default.
Current collections of attributes are:
* (the toplevel, empty prefix)
* allocation
* space_info
* raid
* features
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Bool initializations should use true and false. Bool tests don't need
comparisons.
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If btrfs_transaction_commit fails it will proceed to call
cleanup_transaction, which in turn already does btrfs_abort_transaction.
So let's remove the unnecessary code duplication. Also let's be explicit
about handling failure of btrfs_uuid_tree_add by calling
btrfs_end_transaction.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_udpate_root can fail and it aborts the transaction, the correct
way to handle an aborted transaction is to explicitly end with
btrfs_end_transaction. Even now the code is correct since
btrfs_commit_transaction would handle an aborted transaction but this is
more of an implementation detail. So let's be explicit in handling
failure in btrfs_update_root.
Furthermore btrfs_commit_transaction can also fail and by ignoring it's
return value we could have left the in-memory copy of the root item in
an inconsistent state. So capture the error value which allows us to
correctly revert the RO/RW flags in case of commit failure.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_init_new_device() calls btrfs_attach_transaction() to
commit sys chunks, and it should error out if it fails.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of BUG_ON return error to the caller. And handle the fail
condition by calling the abort transaction and going through the
error path.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When new device is being added to seed FS, seed FS is marked writable,
but when we fail to bring in the new device, we missed to undo the
writable part. This patch fixes it.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
EXTENT_CSUM checker is a relatively easy one, only needs to check:
1) Objectid
Fixed to BTRFS_EXTENT_CSUM_OBJECTID
2) Key offset alignment
Must be aligned to sectorsize
3) Item size alignedment
Must be aligned to csum size
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add extra checks for item with EXTENT_DATA type. This checks the
following thing:
0) Key offset
All key offsets must be aligned to sectorsize.
Inline extent must have 0 for key offset.
1) Item size
Uncompressed inline file extent size must match item size.
(Compressed inline file extent has no information about its on-disk size.)
Regular/preallocated file extent size must be a fixed value.
2) Every member of regular file extent item
Including alignment for bytenr and offset, possible value for
compression/encryption/type.
3) Type/compression/encode must be one of the valid values.
This should be the most comprehensive and strict check in the context
of btrfs_item for EXTENT_DATA.
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ switch to BTRFS_FILE_EXTENT_TYPES, similar to what
BTRFS_COMPRESS_TYPES does ]
Signed-off-by: David Sterba <dsterba@suse.com>
Function check_leaf() checks if any item pointer points outside of the
leaf, but it doesn't check if the pointer overlaps with the item itself.
Normally only the last item may be the victim, but adding such check is
never a bad idea anyway.
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Current check_leaf() function does a good job checking key order and
item offset/size.
However it only checks from slot 0 to the last but one slot, this is
good but makes later expansion hard.
So this refactoring iterates from slot 0 to the last slot.
For key comparison, it uses a key with all 0 as initial key, so all
valid keys should be larger than that.
And for item size/offset checks, it compares current item end with
previous item offset.
For slot 0, use leaf end as a special case.
This makes later item/key offset checks and item size checks easier to
be implemented.
Also, makes check_leaf() to return -EUCLEAN other than -EIO to indicate
error.
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Was added in:
c8b978188c
"Btrfs: Add zlib compression support"
Survive to near time (from 08.10.2008).
Because 'start' checked for zero before branch, so it's safe to remove
that subtraction.
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: Satoru Takeuchi <satoru.takeuchi@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay reported that generic/273 was failing currently with ENOSPC.
Turns out this is because we get to the point where the outstanding
reservations are greater than the pinned space on the fs. This is a
mistake, previously we used the current reservation amount in
may_commit_transaction, not the entire outstanding reservation amount.
Fix this to find the minimum byte size needed to make progress in
flushing, and pass that into may_commit_transaction. From there we can
make a smarter decision on whether to commit the transaction or not.
This fixes the failure in generic/273.
From Nikolai, IOW: when we go to the final stage of deciding whether to
do trans commit, instead of passing all the reservations from all
tickets we just pass the reservation for the current ticket. Otherwise,
in case all reservations exceed pinned space, then we don't commit
transaction and fail prematurely. Before we passed num_bytes from
flush_space, where num_bytes was the sum of all pending reserverations,
but now all we do is take the first ticket and commit the trans if we
can satisfy that.
Fixes: 957780eb27 ("Btrfs: introduce ticketed enospc infrastructure")
Cc: stable@vger.kernel.org # 4.8
Reported-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
[ added Nikolai's comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
By analyzing the perf on btrfs send, we found it take large amount of
cpu time on page_cache_sync_readahead. This effort can be reduced after
switching to asynchronous one. Overall performance gain on HDD and SSD
were 9 and 15 percent if simply send a large file.
Signed-off-by: Kuanling Huang <peterh@synology.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The local bio_list may have pending bios when doing cleanup, it can
end up with memory leak if they don't get freed.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Don't populate the read-only array types on the stack, instead make
it static const. Makes the object code smaller by nearly 60 bytes:
Before:
text data bss dec hex filename
90536 6552 64 97152 17b80 fs/btrfs/ioctl.o
After:
text data bss dec hex filename
90414 6616 64 97094 17b46 fs/btrfs/ioctl.o
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Forward the correct return value -ENOMEM from btrfsic_dev_state_alloc()
too.
Signed-off-by: Allen Pais <allen.lkml@gmail.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ adjust changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
We've seen the following backtrace stack in ftrace or dmesg log,
kworker/u16:10-4244 [000] 241942.480955: function: btrfs_put_ordered_extent
kworker/u16:10-4244 [000] 241942.480956: kernel_stack: <stack trace>
=> finish_ordered_fn (ffffffffa0384475)
=> btrfs_scrubparity_helper (ffffffffa03ca577) <-----"incorrect"
=> btrfs_freespace_write_helper (ffffffffa03ca98e) <-----"correct"
=> process_one_work (ffffffff81117b2f)
=> worker_thread (ffffffff81118c2a)
=> kthread (ffffffff81121de0)
=> ret_from_fork (ffffffff81d7087a)
btrfs_freespace_write_helper is actually calling normal_worker_helper
instead of btrfs_scrubparity_helper, so somehow kernel has parsed the
incorrect function address while unwinding the stack,
btrfs_scrubparity_helper really shouldn't be shown up.
It's caused by compiler doing inline for our helper function, adding a
noinline tag can fix that.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use noinline_for_stack ]
Signed-off-by: David Sterba <dsterba@suse.com>
Since both committing transaction and writing log-tree are doing
plugging on metadata IO, we can unify to use %sync_writers to benefit
both cases, instead of checking bio_flags while writing meta blocks of
log-tree.
We can remove this bio_flags because in order to write dirty blocks,
log tree also uses btrfs_write_marked_extents(), inside which we
have enabled %sync_writers, therefore, every write goes in a
synchronous way, so does checksuming.
Please also note that, bio_flags is applied per-context while
%sync_writers is applied per-inode, so this might incur some overhead, ie.
1) while log tree is flushing its dirty blocks via
btrfs_write_marked_extents(), in which %sync_writers is increased
by one.
2) in the meantime, some writeback operations may happen upon btrfs's
metadata inode, so these writes go synchronously, too.
However, AFAICS, the overhead is not a big one while the win is that
we unify the two places that needs synchronous way and remove a
special hack/flag.
This removes the bio_flags related stuff for writing log-tree.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have started plug in btrfs_write_and_wait_marked_extents() but the
generated IOs actually go to device's schedule IO list where the work
is doing in another task, thus the started plug doesn't make any
sense.
And since we wait for IOs immediately after writing meta blocks, it's
the same case as writing log tree, doing sync submit can merge more
IOs.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are checks on fs_info in __btrfs_panic to avoid dereferencing a
null fs_info, however, there is a call to btrfs_crit that may also
dereference a null fs_info. Fix this by adding a check to see if fs_info
is null and only print the s_id if fs_info is non-null.
Detected by CoverityScan CID#401973 ("Dereference after null check")
Fixes: efe120a067 ("Btrfs: convert printk to btrfs_ and fix BTRFS prefix")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The value of variable 'can_recover' is never used after being set, thus
it should be removed, as it was never used since the first commit
68a7342c51 ("Btrfs: cleanup orphaned root orphan item").
Signed-off-by: Christos Gkekas <chris.gekas@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If 'btrfs_alloc_path()' fails, we must free the resources already
allocated, as done in the other error handling paths in this function.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__link_block_group is called from only 2 places and at each call site the
space_info being passed is the same as the space info assigned to the passed
cache struct. Let's remove the redundant argument and make the function
reference the space_info from the passed block_group_cache. No functional
changes
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ renamed to link_block_group ]
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the code executes add_extent_mapping and if it is successful
it links the new mapping, it then proceeds to unlock the extent mapping
tree and check for failure and handle them. Instead, rework the code to
only perform a single check if add_extent_mapping has failed and handle
it, otherwise the code continues in a linear fashion. No functional
changes
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduced by 5a5f79b570 ("Btrfs: allow unaligned DIO") and never
used. The buffered fallback from unaligned DIO works as expected.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_changed_cb_t represents the signature of the callback being passed
to btrfs_compare_trees. Currently there is only one such callback,
namely changed_cb in send.c. This function doesn't really uses the first
2 parameters, i.e. the roots. Since there are not other functions
implementing the btrfs_changed_cb_t let's remove the unused parameters
from the prototype and implementation.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
iterate_dir_item:found_key - introduced in 31db9f7c23 ("Btrfs:
introduce BTRFS_IOC_SEND for btrfs send/receive"), yet never used.
record_ref:num - ditto
This is a first pass with the low-hanging fruit. There are still quite a
few unsued parameters in some function which have to abide by a callback
interface.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Src was initially part of 31ff1cd25d ("Btrfs: Copy into the log tree in
big batches"), however 16e7549f04 ("Btrfs: incompatible format change
to remove hole extents") changed parameters passed to copy_items which
made the src variable redundant.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While we submit direct writes, if the inode is flagged with nodatasum,
there's no benefit to submit asynchronously, because
a) we don't have to calculate checksum across processors,
b) and direct IO has started a plug, but async submit makes us queue
IO on each device's scheduled IO list instead of DIO's plug list, so
that IOs get much less merges in general.
Lets use sync submit for nodatasum inodes.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After mapping block with BTRFS_MAP_WRITE, parities have been sorted to
the end position, so this search can start from the first parity
stripe.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied changelog as a comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
We didn't copy fsid to struct super_block.s_uuid so Overlay disables
index feature with btrfs as the lower FS.
kernel: overlayfs: fs on '/lower' does not support file handles, falling back to index=off.
Fix this by publishing the fsid through struct super_block.s_uuid.
[ dsterba: I think that setting s_uuid is the last missing bit. Overlay
needs the file handle encoding support from the lower filesystem, which
is supported. Filling the whole filesystem id is correct, the subvolume
id is encoded in the file handle buffer from inside btrfs_encode_fh. ]
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These aren't used outside of volumes.c.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Some static functions are needlessly forward declared. Let's remove those
declarations since they add no value.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Both wait_for_commit() and wait_for_writer() are checking the
condition out of the mutex lock.
This refactors code a bit to be lock safe.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since TASK_UNINTERRUPTIBLE has been used here, wait_event() can do the
same job.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we're still going to wait after schedule(), we don't have to do
finish_wait() to remove our %wait_queue_entry since prepare_to_wait()
won't add the same %wait_queue_entry twice.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>