Introduce chunk allocation policy for btrfs. This policy controls how
chunks and device extents are allocated from devices.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Do not BUG_ON() when an invalid profile is passed to __btrfs_alloc_chunk().
Instead return -EINVAL with ASSERT() to catch a bug in the development
stage.
Suggested-by: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While the "full_search" variable defined in find_free_extent() is bool,
but the full_search argument of find_free_extent_update_loop() is
defined as int. Let's trivially fix the argument type.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After the previous patch, qgroup_rescan_running is protected by
btrfs_fs_info::qgroup_rescan_lock, thus no need for the extra spinlock.
Suggested-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There are some reports about btrfs wait forever to unmount itself, with
the following call trace:
INFO: task umount:4631 blocked for more than 491 seconds.
Tainted: G X 5.3.8-2-default #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
umount D 0 4631 3337 0x00000000
Call Trace:
([<00000000174adf7a>] __schedule+0x342/0x748)
[<00000000174ae3ca>] schedule+0x4a/0xd8
[<00000000174b1f08>] schedule_timeout+0x218/0x420
[<00000000174af10c>] wait_for_common+0x104/0x1d8
[<000003ff804d6994>] btrfs_qgroup_wait_for_completion+0x84/0xb0 [btrfs]
[<000003ff8044a616>] close_ctree+0x4e/0x380 [btrfs]
[<0000000016fa3136>] generic_shutdown_super+0x8e/0x158
[<0000000016fa34d6>] kill_anon_super+0x26/0x40
[<000003ff8041ba88>] btrfs_kill_super+0x28/0xc8 [btrfs]
[<0000000016fa39f8>] deactivate_locked_super+0x68/0x98
[<0000000016fcb198>] cleanup_mnt+0xc0/0x140
[<0000000016d6a846>] task_work_run+0xc6/0x110
[<0000000016d04f76>] do_notify_resume+0xae/0xb8
[<00000000174b30ae>] system_call+0xe2/0x2c8
[CAUSE]
The problem happens when we have called qgroup_rescan_init(), but
not queued the worker. It can be caused mostly by error handling.
Qgroup ioctl thread | Unmount thread
----------------------------------------+-----------------------------------
|
btrfs_qgroup_rescan() |
|- qgroup_rescan_init() |
| |- qgroup_rescan_running = true; |
| |
|- trans = btrfs_join_transaction() |
| Some error happened |
| |
|- btrfs_qgroup_rescan() returns error |
But qgroup_rescan_running == true; |
| close_ctree()
| |- btrfs_qgroup_wait_for_completion()
| |- running == true;
| |- wait_for_completion();
btrfs_qgroup_rescan_worker is never queued, thus no one is going to wake
up close_ctree() and we get a deadlock.
All involved qgroup_rescan_init() callers are:
- btrfs_qgroup_rescan()
The example above. It's possible to trigger the deadlock when error
happened.
- btrfs_quota_enable()
Not possible. Just after qgroup_rescan_init() we queue the work.
- btrfs_read_qgroup_config()
It's possible to trigger the deadlock. It only init the work, the
work queueing happens in btrfs_qgroup_rescan_resume().
Thus if error happened in between, deadlock is possible.
We shouldn't set fs_info->qgroup_rescan_running just in
qgroup_rescan_init(), as at that stage we haven't yet queued qgroup
rescan worker to run.
[FIX]
Set qgroup_rescan_running before queueing the work, so that we ensure
the rescan work is queued when we wait for it.
Fixes: 8d9eddad19 ("Btrfs: fix qgroup rescan worker initialization")
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
[ Change subject and cause analyse, use a smaller fix ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are new types and helpers that are supposed to be used in new code.
As a preparation to get rid of legacy types and API functions do
the conversion here.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a fuzzed image which could cause KASAN report at unmount time.
BUG: KASAN: use-after-free in btrfs_queue_work+0x2c1/0x390
Read of size 8 at addr ffff888067cf6848 by task umount/1922
CPU: 0 PID: 1922 Comm: umount Tainted: G W 5.0.21 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
dump_stack+0x5b/0x8b
print_address_description+0x70/0x280
kasan_report+0x13a/0x19b
btrfs_queue_work+0x2c1/0x390
btrfs_wq_submit_bio+0x1cd/0x240
btree_submit_bio_hook+0x18c/0x2a0
submit_one_bio+0x1be/0x320
flush_write_bio.isra.41+0x2c/0x70
btree_write_cache_pages+0x3bb/0x7f0
do_writepages+0x5c/0x130
__writeback_single_inode+0xa3/0x9a0
writeback_single_inode+0x23d/0x390
write_inode_now+0x1b5/0x280
iput+0x2ef/0x600
close_ctree+0x341/0x750
generic_shutdown_super+0x126/0x370
kill_anon_super+0x31/0x50
btrfs_kill_super+0x36/0x2b0
deactivate_locked_super+0x80/0xc0
deactivate_super+0x13c/0x150
cleanup_mnt+0x9a/0x130
task_work_run+0x11a/0x1b0
exit_to_usermode_loop+0x107/0x130
do_syscall_64+0x1e5/0x280
entry_SYSCALL_64_after_hwframe+0x44/0xa9
[CAUSE]
The fuzzed image has a completely screwd up extent tree:
leaf 29421568 gen 8 total ptrs 6 free space 3587 owner EXTENT_TREE
refs 2 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 5938
item 0 key (12587008 168 4096) itemoff 3942 itemsize 53
extent refs 1 gen 9 flags 1
ref#0: extent data backref root 5 objectid 259 offset 0 count 1
item 1 key (12591104 168 8192) itemoff 3889 itemsize 53
extent refs 1 gen 9 flags 1
ref#0: extent data backref root 5 objectid 271 offset 0 count 1
item 2 key (12599296 168 4096) itemoff 3836 itemsize 53
extent refs 1 gen 9 flags 1
ref#0: extent data backref root 5 objectid 259 offset 4096 count 1
item 3 key (29360128 169 0) itemoff 3803 itemsize 33
extent refs 1 gen 9 flags 2
ref#0: tree block backref root 5
item 4 key (29368320 169 1) itemoff 3770 itemsize 33
extent refs 1 gen 9 flags 2
ref#0: tree block backref root 5
item 5 key (29372416 169 0) itemoff 3737 itemsize 33
extent refs 1 gen 9 flags 2
ref#0: tree block backref root 5
Note that leaf 29421568 doesn't have its backref in the extent tree.
Thus extent allocator can re-allocate leaf 29421568 for other trees.
In short, the bug is caused by:
- Existing tree block gets allocated to log tree
This got its generation bumped.
- Log tree balance cleaned dirty bit of offending tree block
It will not be written back to disk, thus no WRITTEN flag.
- Original owner of the tree block gets COWed
Since the tree block has higher transid, no WRITTEN flag, it's reused,
and not traced by transaction::dirty_pages.
- Transaction aborted
Tree blocks get cleaned according to transaction::dirty_pages. But the
offending tree block is not recorded at all.
- Filesystem unmount
All pages are assumed to be are clean, destroying all workqueue, then
call iput(btree_inode).
But offending tree block is still dirty, which triggers writeback, and
causes use-after-free bug.
The detailed sequence looks like this:
- Initial status
eb: 29421568, header=WRITTEN bflags_dirty=0, page_dirty=0, gen=8,
not traced by any dirty extent_iot_tree.
- New tree block is allocated
Since there is no backref for 29421568, it's re-allocated as new tree
block.
Keep in mind that tree block 29421568 is still referred by extent
tree.
- Tree block 29421568 is filled for log tree
eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9 << (gen bumped)
traced by btrfs_root::dirty_log_pages
- Some log tree operations
Since the fs is using node size 4096, the log tree can easily go a
level higher.
- Log tree needs balance
Tree block 29421568 gets all its content pushed to right, thus now
it is empty, and we don't need it.
btrfs_clean_tree_block() from __push_leaf_right() get called.
eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9
traced by btrfs_root::dirty_log_pages
- Log tree write back
btree_write_cache_pages() goes through dirty pages ranges, but since
page of tree block 29421568 gets cleaned already, it's not written
back to disk. Thus it doesn't have WRITTEN bit set.
But ranges in dirty_log_pages are cleared.
eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9
not traced by any dirty extent_iot_tree.
- Extent tree update when committing transaction
Since tree block 29421568 has transid equal to running trans, and has
no WRITTEN bit, should_cow_block() will use it directly without adding
it to btrfs_transaction::dirty_pages.
eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9
not traced by any dirty extent_iot_tree.
At this stage, we're doomed. We have a dirty eb not tracked by any
extent io tree.
- Transaction gets aborted due to corrupted extent tree
Btrfs cleans up dirty pages according to transaction::dirty_pages and
btrfs_root::dirty_log_pages.
But since tree block 29421568 is not tracked by neither of them, it's
still dirty.
eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9
not traced by any dirty extent_iot_tree.
- Filesystem unmount
Since all cleanup is assumed to be done, all workqueus are destroyed.
Then iput(btree_inode) is called, expecting no dirty pages.
But tree 29421568 is still dirty, thus triggering writeback.
Since all workqueues are already freed, we cause use-after-free.
This shows us that, log tree blocks + bad extent tree can cause wild
dirty pages.
[FIX]
To fix the problem, don't submit any btree write bio if the filesytem
has any error. This is the last safe net, just in case other cleanup
haven't caught catch it.
Link: https://github.com/bobfuzzer/CVE/tree/master/CVE-2019-19377
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no point to inform the user about size change if there's none.
Update the message to conform to a commonly used format where the path
and devid are printed and also print old and new sizes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Marcos Paulo de Souza <marcos@mpdesouza.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ enhance message ]
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_update_global_block_rsv the lines:
num_bytes = block_rsv->size - block_rsv->reserved;
block_rsv->reserved += num_bytes;
imply:
block_rsv->reserved = block_rsv->size;
Assign block_rsv->size to block_rsv->reserved directly and reorder lines
so they match the other branch.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree_log_mutex and reloc_mutex locks are properly nested so we can
simplify error handling and add labels for them. This reduces line count
of the function.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All we need to read is checksum size from fs_info superblock, and
fs_info is provided by extent buffer so we can get rid of the wild
pointer indirections from page/inode/root.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The message seems to be for debugging and has little value for users.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't use the u_XX types anywhere, though they're defined.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove trivial comprator and open coded swap of two values.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
An unrecognized option is a failure that should get user/administrator
attention, the info level is often below what gets logged, so make it
error.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers pass extent buffer start and length so the extent buffer
itself should work fine.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The helper btrfs_header_chunk_tree_uuid follows naming convention of
other struct accessors but does something compeletly different. As the
offsetof calculation is clear in the context of extent buffer operations
we can remove it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The helper btrfs_header_fsid follows naming convention of other struct
accessors but does something compeletly different. As the offsetof
calculation is clear in the context of extent buffer operations we can
remove it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a simple forwarded call based on the operation that would better
fit the caller btrfs_map_block that's until now a trivial wrapper.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The struct_size macro does the same calculation and is safe regarding
overflows. Though we're not expecting them to happen, use the helper for
clarity.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch removes all haphazard code implementing nocow writers
exclusion from pending snapshot creation and switches to using the drew
lock to ensure this invariant still holds.
'Readers' are snapshot creators from create_snapshot and 'writers' are
nocow writers from buffered write path or btrfs_setsize. This locking
scheme allows for multiple snapshots to happen while any nocow writers
are blocked, since writes to page cache in the nocow path will make
snapshots inconsistent.
So for performance reasons we'd like to have the ability to run multiple
concurrent snapshots and also favors readers in this case. And in case
there aren't pending snapshots (which will be the majority of the cases)
we rely on the percpu's writers counter to avoid cacheline contention.
The main gain from using the drew lock is it's now a lot easier to
reason about the guarantees of the locking scheme and whether there is
some silent breakage lurking.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A (D)ouble (R)eader (W)riter (E)xclustion lock is a locking primitive
that allows to have multiple readers or multiple writers but not
multiple readers and writers holding it concurrently.
The code is factored out from the existing open-coded locking scheme
used to exclude pending snapshots from nocow writers and vice-versa.
Current implementation actually favors Readers (that is snapshot
creaters) to writers (nocow writers of the filesystem).
The API provides lock/unlock/trylock for reads and writes.
Formal specification for TLA+ provided by Valentin Schneider is at
https://lore.kernel.org/linux-btrfs/2dcaf81c-f0d3-409e-cb29-733d8b3b4cc9@arm.com/
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The error cleanup gotos in __btrfs_write_out_cache() needlessly jump
back making the code less readable then needed. Flatten them out so no
back-jump is necessary and the read flow is uninterrupted.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
free-space-cache.c has it's own set of DEBUG ifdefs which need to be
turned on instead of the global CONFIG_BTRFS_DEBUG to print debug
messages about failed block-group writes.
Switch this over to CONFIG_BTRFS_DEBUG so we always see these messages
when running a debug kernel.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Make the uptodate argument of io_ctl_add_pages() boolean.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
io_ctl_prepare_pages() gets a 'struct btrfs_io_ctl' as well as a 'struct
inode', but btrfs_io_ctl::inode points to the same struct inode as this is
assgined in io_ctl_init().
Use the inode form io_ctl to reduce the arguments of io_ctl_prepare_pages.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This ioctl will be responsible for deleting a subvolume using its id.
This can be used when a system has a file system mounted from a
subvolume, rather than the root file system, like below:
/
@subvol1/
@subvol2/
@subvol_default/
If only @subvol_default is mounted, we have no path to reach @subvol1
and @subvol2, thus no way to delete them. Current subvolume delete ioctl
takes a file handle point as argument, and if @subvol_default is
mounted, we can't reach @subvol1 and @subvol2 from the same mount point.
This patch introduces a new ioctl BTRFS_IOC_SNAP_DESTROY_V2 that takes
the extended structure with flags to allow to delete subvolume using
subvolid.
Now, we can use this new ioctl specifying the subvolume id and refer to
the same mount point. It doesn't matter which subvolume was mounted,
since we can reach to the desired one using the subvolume id, and then
delete it.
The full path to the subvolume id is resolved internally and access is
verified as if the subvolume was accessed by path.
The volume args v2 structure is extended to use the existing union for
subvolume id specification, that's valid in case the
BTRFS_SUBVOL_SPEC_BY_ID is set.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
The functions will be used outside of export.c and super.c to allow
resolving subvolume name from a given id, eg. for subvolume deletion by
id ioctl.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ split from the next patch ]
Signed-off-by: David Sterba <dsterba@suse.com>
When the device remove v2 ioctl was added, the full support mask was
added to sanity check the flags. However this would allow to let the
subvolume related flags to be accepted. This is not supposed to happen.
Use the correct support mask, which means that now any of
BTRFS_SUBVOL_CREATE_ASYNC, BTRFS_SUBVOL_RDONLY or
BTRFS_SUBVOL_QGROUP_INHERIT will be rejected as ENOTSUPP. Though this is
a user-visible change, specifying subvolume flags for device deletion
does not make sense and there are hopefully no applications doing that.
Reviewed-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Using the defined mask instead of flag enumeration in the ioctl handler
is preferred. No functional changes.
Reviewed-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sparse reports a warning at release_extent_buffer()
warning: context imbalance in release_extent_buffer() - unexpected unlock
The root cause is the missing annotation at release_extent_buffer()
Add the missing __releases(&eb->refs_lock) annotation
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In my EIO stress testing I noticed I was getting forced to rescan the
uuid tree pretty often, which was weird. This is because my error
injection stuff would sometimes inject an error after log replay but
before we loaded the UUID tree. If log replay committed the transaction
it wouldn't have updated the uuid tree generation, but the tree was
valid and didn't change, so there's no reason to not update the
generation here.
Fix this by setting the BTRFS_FS_UPDATE_UUID_TREE_GEN bit immediately
after reading all the fs roots if the uuid tree generation matches the
fs generation. Then any transaction commits that happen during mount
won't screw up our uuid tree state, forcing us to do needless uuid
rescans.
Fixes: 70f8017547 ("Btrfs: check UUID tree during mount if required")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In doing my fsstress+EIO stress testing I started running into issues
where umount would get stuck forever because the uuid checker was
chewing through the thousands of subvolumes I had created.
We shouldn't block umount on this, simply bail if we're unmounting the
fs. We need to make sure we don't mark the UUID tree as ok, so we only
set that bit if we made it through the whole rescan operation, but
otherwise this is completely safe.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's used only during filesystem mount as such it can be made private to
disk-io.c file. Also use the occasion to move btrfs_uuid_rescan_kthread
as btrfs_check_uuid_tree is its sole caller.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_uuid_tree_iterate is called from only once place and its 2nd
argument is always btrfs_check_uuid_tree_entry. Simplify
btrfs_uuid_tree_iterate's signature by removing its 2nd argument and
directly calling btrfs_check_uuid_tree_entry. Also move the latter into
uuid-tree.h. No functional changes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are temporary variables tracking the index of P and Q stripes, but
none of them is really used as such, merely for determining if the Q
stripe is present. This leads to compiler warnings with
-Wunused-but-set-variable and has been reported several times.
fs/btrfs/raid56.c: In function ‘finish_rmw’:
fs/btrfs/raid56.c:1199:6: warning: variable ‘p_stripe’ set but not used [-Wunused-but-set-variable]
1199 | int p_stripe = -1;
| ^~~~~~~~
fs/btrfs/raid56.c: In function ‘finish_parity_scrub’:
fs/btrfs/raid56.c:2356:6: warning: variable ‘p_stripe’ set but not used [-Wunused-but-set-variable]
2356 | int p_stripe = -1;
| ^~~~~~~~
Replace the two variables with one that has a clear meaning and also get
rid of the warnings. The logic that verifies that there are only 2
valid cases is unchanged.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With the following patches:
- btrfs: backref, only collect file extent items matching backref offset
- btrfs: backref, not adding refs from shared block when resolving normal backref
- btrfs: backref, only search backref entries from leaves of the same root
we only collect the normal data refs we want, so the imprecise upper
bound total_refs of that EXTENT_ITEM could now be changed to the count
of the normal backref entry we want to search.
Background and how the patches fit together:
Btrfs has two types of data backref.
For BTRFS_EXTENT_DATA_REF_KEY type of backref, we don't have the
exact block number. Therefore, we need to call resolve_indirect_refs.
It uses btrfs_search_slot to locate the leaf block. Then
we need to walk through the leaves to search for the EXTENT_DATA items
that have disk bytenr matching the extent item (add_all_parents).
When resolving indirect refs, we could take entries that don't
belong to the backref entry we are searching for right now.
For that reason when searching backref entry, we always use total
refs of that EXTENT_ITEM rather than individual count.
For example:
item 11 key (40831553536 EXTENT_ITEM 4194304) itemoff 15460 itemsize
extent refs 24 gen 7302 flags DATA
shared data backref parent 394985472 count 10 #1
extent data backref root 257 objectid 260 offset 1048576 count 3 #2
extent data backref root 256 objectid 260 offset 65536 count 6 #3
extent data backref root 257 objectid 260 offset 65536 count 5 #4
For example, when searching backref entry #4, we'll use total_refs
24, a very loose loop ending condition, instead of total_refs = 5.
But using total_refs = 24 is not accurate. Sometimes, we'll never find
all the refs from specific root. As a result, the loop keeps on going
until we reach the end of that inode.
The first 3 patches, handle 3 different types refs we might encounter.
These refs do not belong to the normal backref we are searching, and
hence need to be skipped.
This patch changes the total_refs to correct number so that we could
end loop as soon as we find all the refs we want.
btrfs send uses backref to find possible clone sources, the following
is a simple test to compare the results with and without this patch:
$ btrfs subvolume create /sub1
$ for i in `seq 1 163840`; do
dd if=/dev/zero of=/sub1/file bs=64K count=1 seek=$((i-1)) conv=notrunc oflag=direct
done
$ btrfs subvolume snapshot /sub1 /sub2
$ for i in `seq 1 163840`; do
dd if=/dev/zero of=/sub1/file bs=4K count=1 seek=$(((i-1)*16+10)) conv=notrunc oflag=direct
done
$ btrfs subvolume snapshot -r /sub1 /snap1
$ time btrfs send /snap1 | btrfs receive /volume2
Without this patch:
real 69m48.124s
user 0m50.199s
sys 70m15.600s
With this patch:
real 1m59.683s
user 0m35.421s
sys 2m42.684s
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: ethanwu <ethanwu@synology.com>
[ add patchset cover letter with background and numbers ]
Signed-off-by: David Sterba <dsterba@suse.com>
We could have some nodes/leaves in subvolume whose owner are not the
that subvolume. In this way, when we resolve normal backrefs of that
subvolume, we should avoid collecting those references from these blocks.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: ethanwu <ethanwu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All references from the block of SHARED_DATA_REF belong to that shared
block backref.
For example:
item 11 key (40831553536 EXTENT_ITEM 4194304) itemoff 15460 itemsize 95
extent refs 24 gen 7302 flags DATA
extent data backref root 257 objectid 260 offset 65536 count 5
extent data backref root 258 objectid 265 offset 0 count 9
shared data backref parent 394985472 count 10
Block 394985472 might be leaf from root 257, and the item obejctid and
(file_pos - file_extent_item::offset) in that leaf just happens to be
260 and 65536 which is equal to the first extent data backref entry.
Before this patch, when we resolve backref:
root 257 objectid 260 offset 65536
we will add those refs in block 394985472 and wrongly treat those as the
refs we want.
Fix this by checking if the leaf we are processing is shared data
backref, if so, just skip this leaf.
Shared data refs added into preftrees.direct have all entry value = 0
(root_id = 0, key = NULL, level = 0) except parent entry.
Other refs from indirect tree will have key value and root id != 0, and
these values won't be changed when their parent is resolved and added to
preftrees.direct. Therefore, we could reuse the preftrees.direct and
search ref with all values = 0 except parent is set to avoid getting
those resolved refs block.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: ethanwu <ethanwu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When resolving one backref of type EXTENT_DATA_REF, we collect all
references that simply reference the EXTENT_ITEM even though their
(file_pos - file_extent_item::offset) are not the same as the
btrfs_extent_data_ref::offset we are searching for.
This patch adds additional check so that we only collect references whose
(file_pos - file_extent_item::offset) == btrfs_extent_data_ref::offset.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: ethanwu <ethanwu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The integrity checking code for the super block mirrors is the last
remaining user of buffer_heads, change it to using plain bios as well.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the last caller of btrfsic_process_written_block() with
buffer_heads is gone, remove the buffer_head processing path from it as
well.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the last use of btrfsic_submit_bh() is gone as the super block
is now written using bios, remove the function as well.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Similar to the superblock read path, change the write path to using bios
and pages instead of buffer_heads. This allows us to skip over the
buffer_head code, for writing the superblock to disk.
This is based on a patch originally authored by Nikolay Borisov.
Co-developed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Super-block reading in BTRFS is done using buffer_heads. Buffer_heads
have some drawbacks, like not being able to propagate errors from the
lower layers.
Directly use the page cache for reading the super blocks from disk or
invalidating an on-disk super block. We have to use the page cache so to
avoid races between mkfs and udev. See also 6f60cbd3ae ("btrfs: access
superblock via pagecache in scan_one_device").
This patch unwraps the buffer head API and does not change the way the
super block is actually read.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_scratch_superblocks() isn't used anywhere outside volumes.c so
remove it from the header file and mark it as static. Also move it
above it's callers so we don't need a forward declaration.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Block device mappings are never in highmem so kmap() / kunmap() calls for
pages from block devices are unneeded. Use page_address() instead of
kmap() to get to the virtual addreses.
While we're at it, read_cache_page_gfp() doesn't return NULL on error,
only an ERR_PTR, so use IS_ERR() to check for errors.
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preparatory patch for removal of buffer_head usage in btrfs.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When attempting to set bits on a range of an exent io tree that already
has those bits set we can end up splitting an extent state record, use
the preallocated extent state record, insert it into the red black tree,
do another search on the red black tree, merge the preallocated extent
state record with the previous extent state record, remove that previous
record from the red black tree and then free it. This is all unnecessary
work that consumes time.
This happens specifically at the following case at __set_extent_bit():
$ cat -n fs/btrfs/extent_io.c
957 static int __must_check
958 __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
(...)
1044 /*
1045 * | ---- desired range ---- |
1046 * | state |
1047 * or
1048 * | ------------- state -------------- |
1049 *
(...)
1060 if (state->start < start) {
1061 if (state->state & exclusive_bits) {
1062 *failed_start = start;
1063 err = -EEXIST;
1064 goto out;
1065 }
1066
1067 prealloc = alloc_extent_state_atomic(prealloc);
1068 BUG_ON(!prealloc);
1069 err = split_state(tree, state, prealloc, start);
1070 if (err)
1071 extent_io_tree_panic(tree, err);
1072
1073 prealloc = NULL;
So if our extent state represents a range from 0 to 1MiB for example, and
we want to set bits in the range 128KiB to 256KiB for example, and that
extent state record already has all those bits set, we end up splitting
that record, so we end up with extent state records in the tree which
represent the ranges from 0 to 128KiB and from 128KiB to 1MiB. This is
temporary because a subsequent iteration in that function will end up
merging the records.
The splitting requires using the preallocated extent state record, so
a future iteration that needs to do another split will need to allocate
another extent state record in an atomic context, something not ideal
that we try to avoid as much as possible. The splitting also requires
an insertion in the red black tree, and a subsequent merge will require
a deletion from the red black tree and freeing an extent state record.
This change just skips the splitting of an extent state record when it
already has all the bits the we need to set.
Setting a bit that is already set for a range is very common in the
inode's 'file_extent_tree' extent io tree for example, where we keep
setting the EXTENT_DIRTY bit every time we replace an extent.
This change also fixes a bug that happens after the recent patchset from
Josef that avoids having implicit holes after a power failure when not
using the NO_HOLES feature, more specifically the patch with the subject:
"btrfs: introduce the inode->file_extent_tree"
This patch introduced an extent io tree per inode to keep track of
completed ordered extents and figure out at any time what is the safe
value for the inode's disk_i_size. This assumes that for contiguous
ranges in a file we always end up with a single extent state record in
the io tree, but that is not the case, as there is a short time window
where we can have two extent state records representing contiguous
ranges. When this happens we end setting up an incorrect value for the
inode's disk_i_size, resulting in data loss after a clean unmount
of the filesystem. The following example explains how this can happen.
Suppose we have an inode with an i_size and a disk_i_size of 1MiB, so in
the inode's file_extent_tree we have a single extent state record that
represents the range [0, 1MiB) with the EXTENT_DIRTY bit set. Then the
following steps happen:
1) A buffered write against file range [512KiB, 768KiB) is made. At this
point delalloc was not flushed yet;
2) Deduplication from some other inode into this inode's range
[128KiB, 256KiB) is made. This causes btrfs_inode_set_file_extent_range()
to be called, from btrfs_insert_clone_extent(), to mark the range
[128KiB, 256KiB) with EXTENT_DIRTY in the inode's file_extent_tree;
3) When btrfs_inode_set_file_extent_range() calls set_extent_bits(), we
end up at __set_extent_bit(). In the first iteration of that function's
loop we end up in the following branch:
$ cat -n fs/btrfs/extent_io.c
957 static int __must_check
958 __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
(...)
1044 /*
1045 * | ---- desired range ---- |
1046 * | state |
1047 * or
1048 * | ------------- state -------------- |
1049 *
(...)
1060 if (state->start < start) {
1061 if (state->state & exclusive_bits) {
1062 *failed_start = start;
1063 err = -EEXIST;
1064 goto out;
1065 }
1066
1067 prealloc = alloc_extent_state_atomic(prealloc);
1068 BUG_ON(!prealloc);
1069 err = split_state(tree, state, prealloc, start);
1070 if (err)
1071 extent_io_tree_panic(tree, err);
1072
1073 prealloc = NULL;
(...)
1089 goto search_again;
This splits the state record into two, one for range [0, 128KiB) and
another for the range [128KiB, 1MiB). Both already have the EXTENT_DIRTY
bit set. Then we jump to the 'search_again' label, where we unlock the
the spinlock protecting the extent io tree before jumping to the
'again' label to perform the next iteration;
4) In the meanwhile, delalloc is flushed, the ordered extent for the range
[512KiB, 768KiB) is created and when it completes, at
btrfs_finish_ordered_io(), it calls btrfs_inode_safe_disk_i_size_write()
with a value of 0 for its 'new_size' argument;
5) Before the deduplication task currently at __set_extent_bit() moves to
the next iteration, the task finishing the ordered extent calls
find_first_extent_bit() through btrfs_inode_safe_disk_i_size_write()
and gets 'start' set to 0 and 'end' set to 128KiB - because at this
moment the io tree has two extent state records, one representing the
range [0, 128KiB) and another representing the range [128KiB, 1MiB),
both with EXTENT_DIRTY set. Then we set 'isize' to:
isize = min(isize, end + 1)
= min(1MiB, 128KiB - 1 + 1)
= 128KiB
Then we set the inode's disk_i_size to 128KiB (isize).
After a clean unmount of the filesystem and mounting it again, we have
the file with a size of 128KiB, and effectively lost all the data it
had before in the range from 128KiB to 1MiB.
This change fixes that issue too, as we never end up splitting extent
state records when they already have all the bits we want set.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we're allocating a logged extent we attempt to insert an extent
record for the file extent directly. We increase
space_info->bytes_reserved, because the extent entry addition will call
btrfs_update_block_group(), which will convert the ->bytes_reserved to
->bytes_used. However if we fail at any point while inserting the
extent entry we will bail and leave space on ->bytes_reserved, which
will trigger a WARN_ON() on umount. Fix this by pinning the space if we
fail to insert, which is what happens in every other failure case that
involves adding the extent entry.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is only used in read_fs_root(), which is just a wrapper of
btrfs_get_fs_root().
For all the mentioned essential roots except log root tree,
btrfs_get_fs_root() has its own quick path to grab them from fs_info
directly, thus no need for key.offset modification.
For subvolume trees, btrfs_get_fs_root() with key.offset == -1 is
completely fine.
For log trees and log root tree, it's impossible to hit them, as for
relocation all backrefs are fetched from commit root, which never
records log tree blocks.
Log tree blocks either get freed in regular transaction commit, or
replayed at mount time. At runtime we should never hit an backref for
log tree in extent tree.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This commit flips the switch to start tracking/processing pinned extents
on a per-transaction basis. It mostly replaces all references from
btrfs_fs_info::(pinned_extents|freed_extents[]) to
btrfs_transaction::pinned_extents.
Two notable modifications that warrant explicit mention are changing
clean_pinned_extents to get a reference to the previously running
transaction. The other one is removal of call to
btrfs_destroy_pinned_extent since transactions are going to be cleaned
in btrfs_cleanup_one_transaction.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Next patch is going to refactor how pinned extents are tracked which
will necessitate changing this code. To ease that work and contain the
changes factor the code now in preparation, this will also help review.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In preparation to making pinned extents per-transaction ensure that log
such extents are always excluded from caching. To achieve this in
addition to marking them via btrfs_pin_extent_for_log_replay they also
need to be marked with btrfs_add_excluded_extent to prevent log tree
extent buffer being loaded by the free space caching thread. That's
required since log tree blocks are not recorded in the extent tree, hence
they always look free.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preparation for refactoring pinned extents tracking.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers have a reference to a transaction handle so pass it to
pin_down_extent. This is the final step before switching pinned extent
tracking to a per-transaction basis.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preparation for refactoring pinned extents tracking.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_pin_reserved_extent is now only called with a valid transaction so
exploit the fact to take a transaction. This is preparation for tracking
pinned extents on a per-transaction basis.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Calling btrfs_pin_reserved_extent makes sense only with a valid
transaction since pinned extents are processed from transaction commit
in btrfs_finish_extent_commit. In case of error it's sufficient to
adjust the reserved counter to account for log tree extents allocated in
the last transaction.
This commit moves btrfs_pin_reserved_extent to be called only with valid
transaction handle and otherwise uses the newly introduced
unaccount_log_buffer to adjust "reserved". If this is not done if a
failure occurs before transaction is committed WARN_ON are going to be
triggered on unmount. This was especially pronounced with generic/475
test.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function correctly adjusts the reserved bytes occupied by a log
tree extent buffer. It will be used instead of calling
btrfs_pin_reserved_extent.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preparation for switching pinned extent tracking to a per-transaction
basis.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Having btrfs_destroy_delayed_refs call btrfs_pin_extent is problematic
for making pinned extents tracking per-transaction since
btrfs_trans_handle cannot be passed to btrfs_pin_extent in this context.
Additionally delayed refs heads pinned in btrfs_destroy_delayed_refs
are going to be handled very closely, in btrfs_destroy_pinned_extent.
To enable btrfs_pin_extent to take btrfs_trans_handle simply open code
it in btrfs_destroy_delayed_refs and call btrfs_error_unpin_extent_range
on the range. This enables us to do less work in
btrfs_destroy_pinned_extent and leaves btrfs_pin_extent being called in
contexts which have a valid btrfs_trans_handle.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The devinfo attribute handlers were added in 668e48af7a ("btrfs:
sysfs, add devid/dev_state kobject and device attributes") and the name
should contain _devinfo_, there's one that does not conform, so unify it
with the rest.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit 668e48af7a ("btrfs: sysfs, add devid/dev_state kobject and
device attributes"), the functions btrfs_sysfs_add_device_link() and
btrfs_sysfs_rm_device_link() do more than just adding and removing the
device link as its name indicated. Rename them to be more specific
that's about the directory with the attirbutes
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have one simple function btrfs_sysfs_remove_fsid() to undo
btrfs_sysfs_add_fsid(), which also does proper checks before releasing
objects.
One difference, if btrfs_sysfs_remove_fsid is used that now we also call
kobject_del() which was missing before. This was tested (with kobject
debug turned on) and no change in behaviour was found.
This is a cleanup patch.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree pointer can be safely read from the inode, use it and drop the
redundant argument.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree pointer can be safely read from the inode, use it and drop the
redundant argument.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree pointer can be safely read from the inode, use it and drop the
redundant argument.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree pointer can be safely read from the page's inode, use it and
drop the redundant argument.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree pointer can be safely read from the inode so we can drop the
redundant argument from btrfs_lock_and_flush_ordered_range.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add assertions to all helpers that get tree as argument and verify that
it's the same that can be obtained from the inode or from its pages. In
followup patches the redundant arguments and assertions will be removed
one by one.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we're sure the tree from argument is same as the one we can get
from the page's inode io_tree, drop the redundant argument.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All functions that set up extent_page_data::tree set it to the inode
io_tree. That's passed down the callstack that accesses either the same
inode or its pages. In the end submit_extent_page can pull the tree out
of the page and we don't have to store it in the structure.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The status of aborted transaction can change between calls and it needs
to be accessed by READ_ONCE. Add a helper that also wraps the unlikely
hint.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The helpers are related to locking so move them there, update comments.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are now using these for all roots, rename them to btrfs_put_root()
and btrfs_grab_root();
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we're going to start relying on getting ref counting right for
roots, add a list to track allocated roots and print out any roots that
aren't freed up at free_fs_info time.
Hide this behind CONFIG_BTRFS_DEBUG because this will just be used for
developers to verify they aren't breaking things.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In adding things like eb leak checking and root leak checking there were
a lot of weird corner cases that come from the fact that
1) We do not init the fs_info until we get to open_ctree time in the
normal case and
2) The test infrastructure half-init's the fs_info for things that it
needs.
This makes it really annoying to make changes because you have to add
init in two different places, have special cases for testing fs_info's
that may not have certain things initialized, and cases for fs_info's
that didn't make it to open_ctree and thus are not fully set up.
Fix this by extracting out the non-allocating init of the fs info into
it's own public function and use that to make sure we're all getting
consistent views of an allocated fs_info.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
open_ctree mixes initialization of fs stuff and fs_info stuff, which
makes it confusing when doing things like adding the root leak
detection. Make a separate function that inits all the static
structures inside of the fs_info needed for the fs to operate, and then
call that before we start setting up the fs_info to be mounted.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Things like the percpu_counters, the mapping_tree, and the csum hash can
all be freed at btrfs_free_fs_info time, since the helpers all check if
the structure has been initialized already. This significantly cleans
up the error cases in open_ctree.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that all callers of btrfs_get_fs_root are subsequently calling
btrfs_grab_fs_root and handling dropping the ref when they are done
appropriately, go ahead and push btrfs_grab_fs_root up into
btrfs_get_fs_root.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we are going to track leaked roots we need to free them all the same
way, so don't kfree() roots directly, use btrfs_put_fs_root.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We lookup the fs_root and put it in our fs_info directly, we should hold
a ref on this root for the lifetime of the fs_info.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to start freeing roots and doing other complicated things in
free_fs_info, so we need to move it to disk-io.c and export it in order
to use things lik btrfs_put_fs_root().
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We lookup the uuid of arbitrary subvolumes, hold a ref on the root while
we're doing this.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We replay the log into arbitrary fs roots, hold a ref on the root while
we're doing this.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We create the snapshot and then use it for a bunch of things, we need to
hold a ref on it while we're messing with it.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We lookup the name of a subvol which means we'll cross into different
roots. Hold a ref while we're doing the look ups in the fs_root we're
searching.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We lookup all the clone roots and the parent root for send, so we need
to hold refs on all of these roots while we're processing them.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We look up the root for the bytenr that is failing, so we need to hold a
ref on the root for that operation.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We lookup roots for every orphan item we have, we need to hold a ref on
the root while we're doing this work.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All of relocation uses read_fs_root to lookup fs roots, so push the
btrfs_grab_fs_root() up into that helper and remove the individual
calls.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We look up the fs root in various places in here when recovering from a
crashed relcoation. Make sure we hold a ref on the root whenever we
look them up.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're creating a reloc inode in the data reloc tree, we need to hold a
ref on the root while we're doing that.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're looking up the data references for the bytenr in a root, we need
to hold a ref on that root while we're doing that.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are recording this root in the transaction, so we need to hold a ref
on it until we do that.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We look up the corresponding root for the reloc root, we need to hold a
ref while we're messing with it.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We look up the reloc roots corresponding root, we need to hold a ref on
that root.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is trickier than the previous conversions. We have backref_node's
that need to hold onto their root for their lifetime. Do the read of
the root and grab the ref. If at any point we don't use the root we
discard it, however if we use it in our backref node we don't free it
until we free the backref node. Any time we switch the root's for the
backref node we need to drop our ref on the old root and grab the ref on
the new root, and if we dupe a node we need to get a ref on the root
there as well.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We look up an arbitrary fs root here, we need to hold a ref on the root
for the duration.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We look up whatever root userspace has given us, we need to hold a ref
throughout this operation. Use 'root' only for the on fs root and not as
a temporary variable elsewhere.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can wander into a different root, so grab a ref on the root we look
up. Later on we make root = fs_info->tree_root so we need this separate
out label to make sure we do the right cleanup only in the case we're
looking up a different root.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We look up an arbitrary fs root, we need to hold a ref on it while we're
doing our search.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We lookup a arbitrary fs root, we need to hold a ref on that root. If
we're using our own inodes root then grab a ref on that as well to make
the cleanup easier.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're creating the new root here, but we should hold the ref until after
we've initialized the inode for it.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Looking up the inode from an arbitrary tree means we need to hold a ref
on that root.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are looking up an arbitrary inode, we need to hold a ref on the root
while we're doing this.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Looking up the inode we need to search the root, make sure we hold a
reference on that root while we're doing the lookup.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're looking up a random root, we need to hold a ref on it while we're
using it.
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If the root is sitting in the radix tree, we should probably have a ref
for the radix tree. Grab a ref on the root when we insert it, and drop
it when it gets deleted.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add another comment to cover how the space reservation system works
generally. This covers the actual reservation flow, as well as how
flushing is handled.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
delalloc space reservation is tricky because it encompasses both data
and metadata. Make it clear what each side does, the general flow of
how space is moved throughout the lifetime of a write, and what goes
into the calculations.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a giant comment at the top of block-rsv.c describing generally
how block reserves work. It is purely about the block reserves
themselves, and nothing to do with how the actual reservation system
works.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We want to use this for dropping all roots, and in some error cases we
may not have a root, so handle this to make the cleanup code easier.
Make btrfs_grab_fs_root the same so we can use it in cases where the
root may not exist (like the quota root).
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the orphan cleanup stuff doesn't use this directly we can just
make them static.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All this does is call btrfs_get_fs_root() with check_ref == true. Just
use btrfs_get_fs_root() so we don't have a bunch of different helpers
that do the same thing.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All helpers should either be using btrfs_get_fs_root() or
btrfs_read_tree_root().
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Relocation has it's special roots, we don't want to save these in the
root cache either, so swap it to use btrfs_read_tree_root(). However
the reloc root does need REF_COWS set, so make sure we set it everywhere
we use this helper, as it no longer does the REF_COWS setting.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Tree-log uses btrfs_read_fs_root to load its log, but this just calls
btrfs_read_tree_root. We don't save the log roots in our root cache, so
just export this helper and use it in the logging code.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_find_orphan_roots has this weird thing where it looks up the root
in cache to see if it is there before just reading the root. But the
read it uses just reads the root, it doesn't do any of the init work, we
do that by hand here. But this is unnecessary, all we really want is to
see if the root still exists and add it to the dead roots list to be
cleaned up, otherwise we delete the orphan item.
Fix this by just using btrfs_get_fs_root directly with check_ref set to
false so we get the orphan root items. Then we just handle in cache and
out of cache roots the same, add them to the dead roots list and carry
on.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a helper for reading fs roots that just reads the fs root off
the disk and then sets REF_COWS and init's the inheritable flags. Move
this into btrfs_init_fs_root so we can later get rid of this helper and
consolidate all of the fs root reading into one helper.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no reason to not init the root at alloc time, and with later
patches it actually causes problems if we error out mounting the fs
before the tree_root is init'ed because we expect it to have a valid ref
count. Fix this by pushing __setup_root into btrfs_alloc_root.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we have a safe way to update the isize, remove all of this code
as it's no longer needed.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we have a safe way to update the i_size, replace all uses of
btrfs_ordered_update_i_size with btrfs_inode_safe_disk_i_size_write.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We want to use this everywhere we modify the file extent items
permanently. These include:
1) Inserting new file extents for writes and prealloc extents.
2) Truncating inode items.
3) btrfs_cont_expand().
4) Insert inline extents.
5) Insert new extents from log replay.
6) Insert a new extent for clone, as it could be past i_size.
7) Hole punching
For hole punching in particular it might seem it's not necessary because
anybody extending would use btrfs_cont_expand, however there is a corner
that still can give us trouble. Start with an empty file and
fallocate KEEP_SIZE 1M-2M
We now have a 0 length file, and a hole file extent from 0-1M, and a
prealloc extent from 1M-2M. Now
punch 1M-1.5M
Because this is past i_size we have
[HOLE EXTENT][ NOTHING ][PREALLOC]
[0 1M][1M 1.5M][1.5M 2M]
with an i_size of 0. Now if we pwrite 0-1.5M we'll increas our i_size
to 1.5M, but our disk_i_size is still 0 until the ordered extent
completes.
However if we now immediately truncate 2M on the file we'll just call
btrfs_cont_expand(inode, 1.5M, 2M), since our old i_size is 1.5M. If we
commit the transaction here and crash we'll expose the gap.
To fix this we need to clear the file extent mapping for the range that
we punched but didn't insert a corresponding file extent for. This will
mean the truncate will only get an disk_i_size set to 1M if we crash
before the finish ordered io happens.
I've written an xfstest to reproduce the problem and validate this fix.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to keep track of where we have file extents on disk, and thus
where it is safe to adjust the i_size to, we need to have a tree in
place to keep track of the contiguous areas we have file extents for.
Add helpers to use this tree, as it's not required for NO_HOLES file
systems. We will use this by setting DIRTY for areas we know we have
file extent item's set, and clearing it when we remove file extent items
for truncation.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We were using btrfs_i_size_write(), which unconditionally jacks up
inode->disk_i_size. However since clone can operate on ranges we could
have pending ordered extents for a range prior to the start of our clone
operation and thus increase disk_i_size too far and have a hole with no
file extent.
Fix this by using the btrfs_ordered_update_i_size helper which will do
the right thing in the face of pending ordered extents outside of our
clone range.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfsctl was removed in 2012, now the function btrfs_control_ioctl()
is only used for devices ioctls. So update the comment.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Su Yue <Damenly_Su@gmx.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Relocation is one of the most complex part of btrfs, while it's also the
foundation stone for online resizing, profile converting.
For such a complex facility, we should at least have some introduction
to it.
This patch will add an basic introduction at pretty a high level,
explaining:
- What relocation does
- How relocation is done
Only mentioning how data reloc tree and reloc tree are involved in the
operation.
No details like the backref cache, or the data reloc tree contents.
- Which function to refer.
More detailed comments will be added for reloc tree creation, data reloc
tree creation and backref cache.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Each new element added to the mod seq list is always appended to the list,
and each one gets a sequence number coming from a counter which gets
incremented everytime a new element is added to the list (or a new node
is added to the tree mod log rbtree). Therefore the element with the
lowest sequence number is always the first element in the list.
So just remove the list iteration at btrfs_put_tree_mod_seq() that
computes the minimum sequence number in the list and replace it with
a check for the first element's sequence number.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The overview of btrfs dev-replace. It mentions some corner cases caused
by the write duplication and scrub based data copy.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ adjust wording ]
Signed-off-by: David Sterba <dsterba@suse.com>
Open code the xlog_state_want_sync logic in its two callers given that
this function is a trivial wrapper around xlog_state_switch_iclogs.
Move the lockdep assert into xlog_state_switch_iclogs to not lose this
debugging aid, and improve the comment that documents
xlog_state_switch_iclogs as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Use the shutdown flag in the log to bypass xlog_state_clean_iclog
entirely in case of a shut down log.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Factor out a few self-contained helpers from xlog_state_clean_iclog, and
update the documentation so it primarily documents why things happens
instead of how.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We can just check for a shut down log all the way down in
xlog_cil_committed instead of passing the parameter. This means a
slight behavior change in that we now also abort log items if the
shutdown came in halfway into the I/O completion processing, which
actually is the right thing to do.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There is no need to check for the ioerror state before the lock, as
the shutdown case is not a fast path. Also remove the call to force
shutdown the file system, as it must have been shut down already
for an iclog to be in the ioerror state. Also clean up the flow of
the function a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The only caller of xfs_log_release_iclog doesn't care about the return
value, so remove it. Also don't bother passing the mount pointer,
given that we can trivially derive it from the iclog.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Factor out the shared code to wait for a log force into a new helper.
This helper uses the XLOG_FORCED_SHUTDOWN check previous only used
by the unmount code over the equivalent iclog ioerror state used by
the other two functions.
There is a slight behavior change in that the force of the unmount
record is now accounted in the log force statistics.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xlog_cil_push is only called by xlog_cil_push_work, so merge the two
functions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
It turns out that there is one use case for programs being able to
write to swap devices, and that is the userspace hibernation code.
Quick fix: disable the S_SWAPFILE check if hibernation is configured.
Fixes: dc617f29db ("vfs: don't allow writes to swap files")
Reported-by: Domenico Andreoli <domenico.andreoli@linux.com>
Reported-by: Marian Klein <mkleinsoft@gmail.com>
Signed-off-by: Domenico Andreoli <domenico.andreoli@linux.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Sync removal of file is only used in case of a GFP_KERNEL kmalloc
failure at the cost of io_file_put::done and work flush, while a
glich like it can be handled at the call site without too much pain.
That said, what is proposed is to drop sync removing of file, and
the kink in neck as well.
Signed-off-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A case of task hung was reported by syzbot,
INFO: task syz-executor975:9880 blocked for more than 143 seconds.
Not tainted 5.6.0-rc6-syzkaller #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
syz-executor975 D27576 9880 9878 0x80004000
Call Trace:
schedule+0xd0/0x2a0 kernel/sched/core.c:4154
schedule_timeout+0x6db/0xba0 kernel/time/timer.c:1871
do_wait_for_common kernel/sched/completion.c:83 [inline]
__wait_for_common kernel/sched/completion.c:104 [inline]
wait_for_common kernel/sched/completion.c:115 [inline]
wait_for_completion+0x26a/0x3c0 kernel/sched/completion.c:136
io_queue_file_removal+0x1af/0x1e0 fs/io_uring.c:5826
__io_sqe_files_update.isra.0+0x3a1/0xb00 fs/io_uring.c:5867
io_sqe_files_update fs/io_uring.c:5918 [inline]
__io_uring_register+0x377/0x2c00 fs/io_uring.c:7131
__do_sys_io_uring_register fs/io_uring.c:7202 [inline]
__se_sys_io_uring_register fs/io_uring.c:7184 [inline]
__x64_sys_io_uring_register+0x192/0x560 fs/io_uring.c:7184
do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:294
entry_SYSCALL_64_after_hwframe+0x49/0xbe
and bisect pointed to 05f3fb3c53 ("io_uring: avoid ring quiesce for
fixed file set unregister and update").
It is down to the order that we wait for work done before flushing it
while nobody is likely going to wake us up.
We can drop that completion on stack as flushing work itself is a sync
operation we need and no more is left behind it.
To that end, io_file_put::done is re-used for indicating if it can be
freed in the workqueue worker context.
Reported-and-Inspired-by: syzbot <syzbot+538d1957ce178382a394@syzkaller.appspotmail.com>
Signed-off-by: Hillf Danton <hdanton@sina.com>
Rename ->done to ->free_pfile
Signed-off-by: Jens Axboe <axboe@kernel.dk>
CEPH_OSDMAP_FULL/NEARFULL aren't set since mimic, so we need to consult
per-pool flags as well. Unfortunately the backwards compatibility here
is lacking:
- the change that deprecated OSDMAP_FULL/NEARFULL went into mimic, but
was guarded by require_osd_release >= RELEASE_LUMINOUS
- it was subsequently backported to luminous in v12.2.2, but that makes
no difference to clients that only check OSDMAP_FULL/NEARFULL because
require_osd_release is not client-facing -- it is for OSDs
Since all kernels are affected, the best we can do here is just start
checking both map flags and pool flags and send that to stable.
These checks are best effort, so take osdc->lock and look up pool flags
just once. Remove the FIXME, since filesystem quotas are checked above
and RADOS quotas are reflected in POOL_FLAG_FULL: when the pool reaches
its quota, both POOL_FLAG_FULL and POOL_FLAG_FULL_QUOTA are set.
Cc: stable@vger.kernel.org
Reported-by: Yanhu Cao <gmayyyha@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Sage Weil <sage@redhat.com>
This patch is used to fix the bug in collect_uncached_read_data()
that rc is automatically converted from a signed number to an
unsigned number when the CIFS asynchronous read fails.
It will cause ctx->rc is error.
Example:
Share a directory and create a file on the Windows OS.
Mount the directory to the Linux OS using CIFS.
On the CIFS client of the Linux OS, invoke the pread interface to
deliver the read request.
The size of the read length plus offset of the read request is greater
than the maximum file size.
In this case, the CIFS server on the Windows OS returns a failure
message (for example, the return value of
smb2.nt_status is STATUS_INVALID_PARAMETER).
After receiving the response message, the CIFS client parses
smb2.nt_status to STATUS_INVALID_PARAMETER
and converts it to the Linux error code (rdata->result=-22).
Then the CIFS client invokes the collect_uncached_read_data function to
assign the value of rdata->result to rc, that is, rc=rdata->result=-22.
The type of the ctx->total_len variable is unsigned integer,
the type of the rc variable is integer, and the type of
the ctx->rc variable is ssize_t.
Therefore, during the ternary operation, the value of rc is
automatically converted to an unsigned number. The final result is
ctx->rc=4294967274. However, the expected result is ctx->rc=-22.
Signed-off-by: Yilu Lin <linyilu@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
xfstests generic/228 checks if fallocate respect RLIMIT_FSIZE.
After fallocate mode 0 extending enabled, we can hit this failure.
Fix this by check the new file size with vfs helper, return
error if file size is larger then RLIMIT_FSIZE(ulimit -f).
This patch has been tested by LTP/xfstests aginst samba and
Windows server.
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Murphy Zhou <jencce.kernel@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
New transform header structures. See recent updates
to MS-SMB2 adding section 2.2.42.1 and 2.2.42.2
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Additional compression capabilities can now be negotiated and a
new compression algorithm. Add the flags for these.
See newly updated MS-SMB2 sections 3.1.4.4.1 and 2.2.3.1.3
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Leaving PF_MEMALLOC set when exiting a kthread causes it to remain set
during do_exit(). That can confuse things. For example, if BSD process
accounting is enabled and the accounting file has FS_SYNC_FL set and is
located on an ext4 filesystem without a journal, then do_exit() can end
up calling ext4_write_inode(). That triggers the
WARN_ON_ONCE(current->flags & PF_MEMALLOC) there, as it assumes
(appropriately) that inodes aren't written when allocating memory.
This was originally reported for another kernel thread, xfsaild() [1].
cifs_demultiplex_thread() also exits with PF_MEMALLOC set, so it's
potentially subject to this same class of issue -- though I haven't been
able to reproduce the WARN_ON_ONCE() via CIFS, since unlike xfsaild(),
cifs_demultiplex_thread() is sent SIGKILL before exiting, and that
interrupts the write to the BSD process accounting file.
Either way, leaving PF_MEMALLOC set is potentially problematic. Let's
clean this up by properly saving and restoring PF_MEMALLOC.
[1] https://lore.kernel.org/r/0000000000000e7156059f751d7b@google.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The warning we print on mount about how to use less secure dialects
(when the user does not specify a version on mount) is useful
but is noisy to print on every default mount, and can be changed
to a warn_once. Slightly updated the warning text as well to note
SMB3.1.1 which has been the default which is typically negotiated
(for a few years now) by most servers.
"No dialect specified on mount. Default has changed to a more
secure dialect, SMB2.1 or later (e.g. SMB3.1.1), from CIFS
(SMB1). To use the less secure SMB1 dialect to access old
servers which do not support SMB3.1.1 (or even SMB3 or SMB2.1)
specify vers=1.0 on mount."
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
fix warning [-Wunused-but-set-variable] at variable 'rc',
keeping the code readable.
Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Since commit d0677992d2 ("cifs: add support for flock") added
support for flock, LTP/flock03[1] testcase started to fail.
This testcase is testing flock lock and unlock across fork.
The parent locks file and starts the child process, in which
it unlock the same fd and lock the same file with another fd
again. All the lock and unlock operation should succeed.
Now the child process does not actually unlock the file, so
the following lock fails. Fix this by allowing flock and OFD
lock go through the unlock routine, not skipping if the unlock
request comes from another process.
Patch has been tested by LTP/xfstests on samba and Windows
server, v3.11, with or without cache=none mount option.
[1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/flock/flock03.c
Signed-off-by: Murphy Zhou <jencce.kernel@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
See commit 349457ccf2
"Allow file systems to manually d_move() inside of ->rename()"
Lessens possibility of race conditions in rename
Signed-off-by: Steve French <stfrench@microsoft.com>
allows SMB2_open() callers to pass down a POSIX data buffer that will
trigger requesting POSIX create context and parsing the response into
the provided buffer.
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
* add code to request POSIX info level
* parse dir entries and fill cifs_fattr to get correct inode data
since the POSIX payload is variable size the number of entries in a
FIND response needs to be computed differently.
Dirs and regular files are properly reported along with mode bits,
hardlink number, c/m/atime. No special files yet (see below).
Current experimental version of Samba with the extension unfortunately
has issues with wildcards and needs the following patch:
> --- i/source3/smbd/smb2_query_directory.c
> +++ w/source3/smbd/smb2_query_directory.c
> @@ -397,9 +397,7 @@ smbd_smb2_query_directory_send(TALLOC_CTX
> *mem_ctx,
> }
> }
>
> - if (!state->smbreq->posix_pathnames) {
> wcard_has_wild = ms_has_wild(state->in_file_name);
> - }
>
> /* Ensure we've canonicalized any search path if not a wildcard. */
> if (!wcard_has_wild) {
>
Also for special files despite reporting them as reparse point samba
doesn't set the reparse tag field. This patch will mark them as needing
re-evaluation but the re-evaluate code doesn't deal with it yet.
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
* add new info level and structs for SMB2 posix extension
* add functions to parse and validate it
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
little progress on the posix create response.
* rename struct to create_posix_rsp to match with the request
create_posix context
* make struct packed
* pass smb info struct for parse_posix_ctxt to fill
* use smb info struct as param
* update TODO
What needs to be done:
SMB2_open() has an optional smb info out argument that it will fill.
Callers making use of this are:
- smb3_query_mf_symlink (need to investigate)
- smb2_open_file
Callers of smb2_open_file (via server->ops->open) are passing an
smbinfo struct but that struct cannot hold POSIX information. All the
call stack needs to be changed for a different info type. Maybe pass
SMB generic struct like cifs_fattr instead.
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
We really, really don't want people using insecure dialects
unless they realize what they are doing ...
Add mount warning if mounting with vers=1.0 (older SMB1/CIFS
dialect) instead of the default (SMB2.1 or later, typically
SMB3.1.1).
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
There are cases when we don't want to send the SMB2 flush operation
(e.g. when user specifies mount parm "nostrictsync") and it can be
a very expensive operation on the server. In most cases in order
to set mtime, we simply need to flush (write) the dirtry pages from
the client and send the writes to the server not also send a flush
protocol operation to the server.
Fixes: aa081859b1 ("cifs: flush before set-info if we have writeable handles")
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
cap_unix(ses) defaults to false for SMB2.
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
mod_delayed_work() is safer than queue_delayed_work() if there's a
chance that the work is already in the queue.
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
This means it's consistently called and the callers don't need to
care about it.
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
For the case where we have a DFS path like below and we're currently
connected to targetA:
//dfsroot/link -> //targetA/share/foo, //targetB/share/bar
after failover, we should make sure to update cifs_sb->prepath so the
next operations will use the new prefix path "/bar".
Besides, in order to simplify the use of different prefix paths,
enforce CIFS_MOUNT_USE_PREFIX_PATH for DFS mounts so we don't have to
revalidate the root dentry every time we set a new prefix path.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Check the AT_STATX_FORCE_SYNC flag and force an attribute
revalidation if requested by the caller, and if the caller
specificies AT_STATX_DONT_SYNC only revalidate cached attributes
if required. In addition do not flush writes in getattr (which
can be expensive) if size or timestamps not requested by the
caller.
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
work->data and work->list are shared in union. io_wq_assign_next() sets
->data if a req having a linked_timeout, but then io-wq may want to use
work->list, e.g. to do re-enqueue of a request, so corrupting ->data.
->data is not necessary, just remove it and extract linked_timeout
through @link_list.
Fixes: 60cf46ae60 ("io-wq: hash dependent work")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl53Vh0ACgkQxWXV+ddt
WDtfOQ//bbUyKXcdH0FBZOCEcJmegcK1eUFYqKrwR2bHGe5JRdLM8pAvjCcqmWeO
jtaRiFC4NSCqTIl3mkBUb+XmQtjZwixBUHRxJpuEO8zqawvFZXTqg/KJklNvi2rd
KdflSNia6KrozTT+B/lpwZ5emS+wSdj5XTZ6VGj4riwtphSfWAjOu+4cOASMeFu+
Gfn+N9xu0ZcR/6zO20xAg0Xz+WU2uj4EfeM35dtRP2bPLG0yOGmiYT15Ll9h74Wm
7F+28iNTQfYutAexGvUpiouanGXE+ka3TCsJg5LuVTpdKGraOVGEuX+RhsyoKQrB
E8bk91fbkLlooluhUC306iNA9/+RN/yFGtILX8JsgI2Od26ZuU01l/OHrc19MDIm
gw1w3PMsD/hXLsG5ba4QsIYOzXofSrPdWej29h/o5p0VEQrAoCJEpAi7fVsiJDR1
sx6kCodw5jYhVs1P6DdXO1pgjE7iFUmjUQCFkl40edPMLy/LwB99A4zNnCOwI0KZ
49CMWHDe+tXVJBTzPvtma/PycQHIxJYMf1f8ko9E4stB7HtfH4dnUERDkb1UwQ5n
aJgyhsCCnp/EJoPunUT7g9nLUdyu0Rtwknn3NascWZEieX2QhKEF5RcjAUSL+Hlo
jbGGvoLhG0nOtYkU7BNSQbL8wxPJEEAq8e6F4tWMcOkhX4pNZP8=
=YkB0
-----END PGP SIGNATURE-----
Merge tag 'for-5.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two fixes.
The first is a regression: when dropping some incompat bits the
conditions were reversed. The other is a fix for rename whiteout
potentially leaving stack memory linked to a list"
* tag 'for-5.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix removal of raid[56|1c34} incompat flags after removing block group
btrfs: fix log context list corruption after rename whiteout error
Merge misc fixes from Andrew Morton:
"10 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
x86/mm: split vmalloc_sync_all()
mm, slub: prevent kmalloc_node crashes and memory leaks
mm/mmu_notifier: silence PROVE_RCU_LIST warnings
epoll: fix possible lost wakeup on epoll_ctl() path
mm: do not allow MADV_PAGEOUT for CoW pages
mm, memcg: throttle allocators based on ancestral memory.high
mm, memcg: fix corruption on 64-bit divisor in memory.high throttling
page-flags: fix a crash at SetPageError(THP_SWAP)
mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case
memcg: fix NULL pointer dereference in __mem_cgroup_usage_unregister_event
After io_assign_current_work() of a linked work, it can be decided to
offloaded to another thread so doing io_wqe_enqueue(). However, until
next io_assign_current_work() it can be cancelled, that isn't handled.
Don't assign it, if it's not going to be executed.
Fixes: 60cf46ae60 ("io-wq: hash dependent work")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This fixes possible lost wakeup introduced by commit a218cc4914.
Originally modifications to ep->wq were serialized by ep->wq.lock, but
in commit a218cc4914 ("epoll: use rwlock in order to reduce
ep_poll_callback() contention") a new rw lock was introduced in order to
relax fd event path, i.e. callers of ep_poll_callback() function.
After the change ep_modify and ep_insert (both are called on epoll_ctl()
path) were switched to ep->lock, but ep_poll (epoll_wait) was using
ep->wq.lock on wqueue list modification.
The bug doesn't lead to any wqueue list corruptions, because wake up
path and list modifications were serialized by ep->wq.lock internally,
but actual waitqueue_active() check prior wake_up() call can be
reordered with modifications of ep ready list, thus wake up can be lost.
And yes, can be healed by explicit smp_mb():
list_add_tail(&epi->rdlink, &ep->rdllist);
smp_mb();
if (waitqueue_active(&ep->wq))
wake_up(&ep->wp);
But let's make it simple, thus current patch replaces ep->wq.lock with
the ep->lock for wqueue modifications, thus wake up path always observes
activeness of the wqueue correcty.
Fixes: a218cc4914 ("epoll: use rwlock in order to reduce ep_poll_callback() contention")
Reported-by: Max Neunhoeffer <max@arangodb.com>
Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Max Neunhoeffer <max@arangodb.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Christopher Kohlhoff <chris.kohlhoff@clearpool.io>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jes Sorensen <jes.sorensen@gmail.com>
Cc: <stable@vger.kernel.org> [5.1+]
Link: http://lkml.kernel.org/r/20200214170211.561524-1-rpenyaev@suse.de
References: https://bugzilla.kernel.org/show_bug.cgi?id=205933
Bisected-by: Max Neunhoeffer <max@arangodb.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl51dbQQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpiV3EADJHB2r2hTTEym5u1PbrEEVkjvdL6InU8lD
lFM7m2g6yZUncwm+aSZynHqAFY6Rd5Jk+gmYMuioi3ZxC2rs7jG1AOTpaeJYmhle
lzkjqSLtl+gdPMA9ydivk1UwILFjtZKG1JNc++tnCn3q7+eCkgnWAlq5b7idG2eF
BS0AEZP6Yz1zStTHLbHSB0StY8ovMIw0VaVQvguHLL9EBpbHmrs0cq3tipWkAyPR
2YwnXbxsJySukkwmBKxEWrGUYDze56jqJIqdFsOE0+WtGV+nk7OScPseXAaP4/+G
Vl23VNfryuZcsBUwI9tY1SzCFEXIwdXVGpCAYwQ/kU5WfvFpYaei+fXVNnL4kjR0
PfpA6XnMsZ3DzqgepmUd92sAA56ZtBxuGjqcSYlg/JwjvUHdpaZDkE2WLqkAMeUN
8A7cUw+R6XWQ2/y6ob7QvKiT/ZDR8GrYUl3EdGE3LhB1ZsvLXJDZpWipwQBzuk9R
vJJOkGst38rjsWnb+nfeLh3AsgjF14wo+2vQL4mKs24xKTIvadHsFAZjKLXZ93Wf
Vn58FaPOYIkjBidYLWb3dlO1ZR8S0803gohLkLV6adH8bCNCWxGTOR51DZLomAsb
nAUCEAJaZrOqaQAuJAFNNpS8+/da3AIF4HVd2EdZ1yFXU15y0+zIxtROjKzg+OxO
M3jC/Aet1Q==
=IMcu
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.6-20200320' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Two different fixes in here:
- Fix for a potential NULL pointer deref for links with async or
drain marked (Pavel)
- Fix for not properly checking RLIMIT_NOFILE for async punted
operations.
This affects openat/openat2, which were added this cycle, and
accept4. I did a full audit of other cases where we might check
current->signal->rlim[] and found only RLIMIT_FSIZE for buffered
writes and fallocate. That one is fixed and queued for 5.7 and
marked stable"
* tag 'io_uring-5.6-20200320' of git://git.kernel.dk/linux-block:
io_uring: make sure accept honor rlimit nofile
io_uring: make sure openat/openat2 honor rlimit nofile
io_uring: NULL-deref for IOSQE_{ASYNC,DRAIN}
We are incorrectly dropping the raid56 and raid1c34 incompat flags when
there are still raid56 and raid1c34 block groups, not when we do not any
of those anymore. The logic just got unintentionally broken after adding
the support for the raid1c34 modes.
Fix this by clear the flags only if we do not have block groups with the
respective profiles.
Fixes: 9c907446dc ("btrfs: drop incompat bit for raid1c34 after last block group is gone")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With the previous fixes for number of files open checking, I added some
debug code to see if we had other spots where we're checking rlimit()
against the async io-wq workers. The only one I found was file size
checking, which we should also honor.
During write and fallocate prep, store the max file size and override
that for the current ask if we're in io-wq worker context.
Cc: stable@vger.kernel.org # 5.1+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just like commit 4022e7af86, this fixes the fact that
IORING_OP_ACCEPT ends up using get_unused_fd_flags(), which checks
current->signal->rlim[] for limits.
Add an extra argument to __sys_accept4_file() that allows us to pass
in the proper nofile limit, and grab it at request prep time.
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dmitry reports that a test case shows that io_uring isn't honoring a
modified rlimit nofile setting. get_unused_fd_flags() checks the task
signal->rlimi[] for the limits. As this isn't easily inheritable,
provide a __get_unused_fd_flags() that takes the value instead. Then we
can grab it when the request is prepared (from the original task), and
pass that in when we do the async part part of the open.
Reported-by: Dmitry Kadashev <dkadashev@gmail.com>
Tested-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This new ioctl retrieves a file's encryption nonce, which is useful for
testing. See the corresponding fs/crypto/ patch for more details.
Link: https://lore.kernel.org/r/20200314205052.93294-3-ebiggers@kernel.org
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Add an ioctl FS_IOC_GET_ENCRYPTION_NONCE which retrieves the nonce from
an encrypted file or directory. The nonce is the 16-byte random value
stored in the inode's encryption xattr. It is normally used together
with the master key to derive the inode's actual encryption key.
The nonces are needed by automated tests that verify the correctness of
the ciphertext on-disk. Except for the IV_INO_LBLK_64 case, there's no
way to replicate a file's ciphertext without knowing that file's nonce.
The nonces aren't secret, and the existing ciphertext verification tests
in xfstests retrieve them from disk using debugfs or dump.f2fs. But in
environments that lack these debugging tools, getting the nonces by
manually parsing the filesystem structure would be very hard.
To make this important type of testing much easier, let's just add an
ioctl that retrieves the nonce.
Link: https://lore.kernel.org/r/20200314205052.93294-2-ebiggers@kernel.org
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl5zWDsACgkQ+7dXa6fL
C2upyg/+KFOmCLFEAgwRnBn4zDDcdDT9du25Duv2d/XfAo2Zx+Nbwm7jjKR/mrRZ
mRbcvb8qj92O4dzMCwcqDGpKT3xJmCZhxJQORBm55Bjme7tJDqXuQVYp1fZVy3Ka
XJS0jr4n5HTorW8iGSIPJmE76XpIPq0ANhPnLbq8wZELyw87K7+J5ZdHcnUh+myd
uKs8sIQ8PQZg6JBBj5wPRgrAkOFUTTINiUqy37ADIY1oZyzW1rUlAeAxVXV7Dnx7
G1HvlVaDw72G1XG4pn0pNBCdGJuNF0dG2zRbdjS+kGCmf6MB6x8e22JjWW9r+r9m
iJd4B2R/3V/kUn4i3B+jfOWD5DKzCW4lDixh9D2LzM16GUinYQTkrH9e8jMBBJGW
7p7X9Vl3o0Nt6NDVLmTKuyomvvtT/jMYiDtKjPuvxlPCGduXB8HvNRFxsKIEVRHi
4RcdTqUSOsyUnOvTfDTfyBu1srKFqTC3HzAunntV88UfGtWdhXRCWMejHdNK3uI9
BC4Ym6jkmFnbQzytW/6noprvVlDfgAuyplcyhnnJ5fVNm4YQ7lZLZPgf5TS+gchI
fMwDfRz3hOLDZ5WjCx6QLT1NHaowQLTrzTq0X3uj2ZrcnRORURvk8GfamzZoS9a5
omyQgBfm+1YpF2VwCyU42DytdmFDUCDofKondOXh8QciwhXqaRs=
=bvsn
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-fixes-20200319' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc, afs: Interruptibility fixes
Here are a number of fixes for AF_RXRPC and AFS that make AFS system calls
less interruptible and so less likely to leave the filesystem in an
uncertain state. There's also a miscellaneous patch to make tracing
consistent.
(1) Firstly, abstract out the Tx space calculation in sendmsg. Much the
same code is replicated in a number of places that subsequent patches
are going to alter, including adding another copy.
(2) Fix Tx interruptibility by allowing a kernel service, such as AFS, to
request that a call be interruptible only when waiting for a call slot
to become available (ie. the call has not taken place yet) or that a
call be not interruptible at all (e.g. when we want to do writeback
and don't want a signal interrupting a VM-induced writeback).
(3) Increase the minimum delay on MSG_WAITALL for userspace sendmsg() when
waiting for Tx buffer space as a 2*RTT delay is really small over 10G
ethernet and a 1 jiffy timeout might be essentially 0 if at the end of
the jiffy period.
(4) Fix some tracing output in AFS to make it consistent with rxrpc.
(5) Make sure aborted asynchronous AFS operations are tidied up properly
so we don't end up with stuck rxrpc calls.
(6) Make AFS client calls uninterruptible in the Rx phase. If we don't
wait for the reply to be fully gathered, we can't update the local VFS
state and we end up in an indeterminate state with respect to the
server.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl5xH1sACgkQiiy9cAdy
T1HzMgv/d27qMlDe1jrLgPY40FT6kjTfG6zKA8ikTg5LHt/esgqRrKsPTQVSVq/m
f6ZVGNlcTDfwAq+90Rw38hreUKRYCkkVWoCEE9SUkCqlg/3MVMorA72p9eDnp0/u
htADzvyBCNoMPJj1WGi5uyhGw58LBy5zWT4vibovGzEdlZ2Lv1qvVzyiGnju8ypy
2+0cgGhucQ8jfEAjqEP28T7nCT96+G0KJGqXX122+Mrx/agjGQ2xCCZRIH5ndVnp
VmaN7WxGQmN9AdLtsVgkrRa9VYtndspMzo7xUArrferlF/yLijvO2Lcu7o3QtH8N
RvLSc0qOD7eH3ETcAwvYd/luGH5OvvZDu4jHphK9KBz9GtGGRCKc7nxElv13S4LJ
27DG71x2XqTGmNoLmY57EZOtKVCsu6VBDlhq7u17RsYWDEurrvda0Nhe/Wo8P2yT
dESnNEX5YGi+nWIjvxwRGMJ7Gb1ZXLdjkJC5QNzDID4AZVHE678AxDR+ZjkHCYLE
Rsbsbmaw
=x6+U
-----END PGP SIGNATURE-----
Merge tag '5.6-rc6-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Three small smb3 fixes, two for stable"
* tag '5.6-rc6-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: fiemap: do not return EINVAL if get nothing
CIFS: Increment num_remote_opens stats counter even in case of smb2_query_dir_first
cifs: potential unintitliazed error code in cifs_getattr()
We know the version is 3 if on a v5 file system. For earlier file
systems formats we always upgrade the remaining v1 inodes to v2 and
thus only use v2 inodes. Use the xfs_sb_version_has_large_dinode
helper to check if we deal with small or large dinodes, and thus
remove the need for the di_version field in struct icdinode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Only v5 file systems can have the reflink feature, and those will
always use the large dinode format. Remove the extra check for the
inode version.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
di_flags2 is initialized to zero for v4 and earlier file systems. This
means di_flags2 can only be non-zero for a v5 file systems, in which
case both the parent and child inodes can store the field. Remove the
extra di_version check, and also remove the rather pointless local
di_flags2 variable while at it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The size of the dinode structure is only dependent on the file system
version, so instead of checking the individual inode version just use
the newly added xfs_sb_version_has_large_dinode helper, and simplify
various calling conventions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Add a new wrapper to check if a file system supports the v3 inode format
with a larger dinode core. Previously we used xfs_sb_version_hascrc for
that, which is technically correct but a little confusing to read.
Also move xfs_dinode_good_version next to xfs_sb_version_has_v3inode
so that we have one place that documents the superblock version to
inode version relationship.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Userspace should be able to monitor nfsd/clients/ to see when clients
come and go, but we're failing to send fsnotify events.
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
It's normal for a client to test a stateid from a previous instance,
e.g. after a network partition.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
There is measurable performance impact in some synthetic tests due to
commit 6d390e4b5d (locks: fix a potential use-after-free problem when
wakeup a waiter). Fix the race condition instead by clearing the
fl_blocker pointer after the wake_up, using explicit acquire/release
semantics.
This does mean that we can no longer use the clearing of fl_blocker as
the wait condition, so switch the waiters over to checking whether the
fl_blocked_member list_head is empty.
Reviewed-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Fixes: 6d390e4b5d (locks: fix a potential use-after-free problem when wakeup a waiter)
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
AIL removal of the quotaoff start intent and free of both quotaoff
intents is currently limited to the ->iop_committed() handler of the
end intent. This executes when the end intent is committed to the
on-disk log and marks the completion of the operation. The problem
with this is it assumes the success of the operation. If a shutdown
or other error occurs during the quotaoff, it's possible for the
quotaoff task to exit without removing the start intent from the
AIL. This results in an unmount hang as the AIL cannot be emptied.
Further, no other codepath frees the intents and so this is also a
memory leak vector.
First, update the high level quotaoff error path to directly remove
and free the quotaoff start intent if it still exists in the AIL at
the time of the error. Next, update both of the start and end
quotaoff intents with an ->iop_release() callback to properly handle
transaction abort.
This means that If the quotaoff start transaction aborts, it frees
the start intent in the transaction commit path. If the filesystem
shuts down before the end transaction allocates, the quotaoff
sequence removes and frees the start intent. If the end transaction
aborts, it removes the start intent and frees both. This ensures
that a shutdown does not result in a hung unmount and that memory is
not leaked regardless of when a quotaoff error occurs.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
AIL removal of the quotaoff start intent and free of both intents is
hardcoded to the ->iop_committed() handler of the end intent. Factor
out the start intent handling code so it can be used in a future
patch to properly handle quotaoff errors. Use xfs_trans_ail_remove()
instead of the _delete() variant to acquire the AIL lock and also
handle cases where an intent might not reside in the AIL at the
time of a failure.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Add support for btree staging cursors for the rmap btrees. This is
needed both for online repair and also to convert xfs_repair to use
btree bulk loading.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Add support for btree staging cursors for the refcount btrees. This
is needed both for online repair and also to convert xfs_repair to use
btree bulk loading.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Add support for btree staging cursors for the inode btrees. This
is needed both for online repair and also to convert xfs_repair to use
btree bulk loading.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Add support for btree staging cursors for the free space btrees. This
is needed both for online repair and also to convert xfs_repair to use
btree bulk loading.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Add a new btree function that enables us to bulk load a btree cursor.
This will be used by the upcoming online repair patches to generate new
btrees. This avoids the programmatic inefficiency of calling
xfs_btree_insert in a loop (which generates a lot of log traffic) in
favor of stamping out new btree blocks with ordered buffers, and then
committing both the new root and scheduling the removal of the old btree
blocks in a single transaction commit.
The design of this new generic code is based off the btree rebuilding
code in xfs_repair's phase 5 code, with the explicit goal of enabling us
to share that code between scrub and repair. It has the additional
feature of being able to control btree block loading factors.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Create an in-core fake root for inode-rooted btree types so that callers
can generate a whole new btree using the upcoming btree bulk load
function without making the new tree accessible from the rest of the
filesystem. It is up to the individual btree type to provide a function
to create a staged cursor (presumably with the appropriate callouts to
update the fakeroot) and then commit the staged root back into the
filesystem.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Create an in-core fake root for AG-rooted btree types so that callers
can generate a whole new btree using the upcoming btree bulk load
function without making the new tree accessible from the rest of the
filesystem. It is up to the individual btree type to provide a function
to create a staged cursor (presumably with the appropriate callouts to
update the fakeroot) and then commit the staged root back into the
filesystem.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Add a xbitmap_hweight helper function so that we can get rid of the
open-coded loop.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Shorten the name of xfs_bitmap to xbitmap since the scrub bitmap has
nothing to do with the libxfs bitmap.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Remove the xfs_bitmap_destroy call from the end of xrep_reap_extents
because this sort of violates our rule that the function initializing a
structure should destroy it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Double 'three' exists in the comments of iomap_dio_rw.
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Historically we only set the capacity to zero for devices that support
partitions (independ of actually having partitions created). Doing that
is rather inconsistent, but changing it broke legacy udisks polling for
legacy ide-cdrom devices. Use the crude a crude check for devices that
either are non-removable or partitionable to get the sane behavior for
most device while not breaking userspace for this particular setup.
Fixes: a1548b6744 ("block: move rescan_partitions to fs/block_dev.c")
Reported-by: He Zhe <zhe.he@windriver.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: He Zhe <zhe.he@windriver.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
No one checks the return value of debugfs_create_file_size, as it's not
needed, so make the return value void, so that no one tries to do so in
the future.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20200309163640.237984-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If we call fiemap on a truncated file with none blocks allocated,
it makes sense we get nothing from this call. No output means
no blocks have been counted, but the call succeeded. It's a valid
response.
Simple example reproducer:
xfs_io -f 'truncate 2M' -c 'fiemap -v' /cifssch/testfile
xfs_io: ioctl(FS_IOC_FIEMAP) ["/cifssch/testfile"]: Invalid argument
Signed-off-by: Murphy Zhou <jencce.kernel@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
The num_remote_opens counter keeps track of the number of open files which must be
maintained by the server at any point. This is a per-tree-connect counter, and the value
of this counter gets displayed in the /proc/fs/cifs/Stats output as a following...
Open files: 0 total (local), 1 open on server
^^^^^^^^^^^^^^^^
As a thumb-rule, we want to increment this counter for each open/create that we
successfully execute on the server. Similarly, we should decrement the counter when
we successfully execute a close.
In this case, an increment was being missed in case of smb2_query_dir_first,
in case of successful open. As a result, we would underflow the counter and we
could even see the counter go to negative after sufficient smb2_query_dir_first calls.
I tested the stats counter for a bunch of filesystem operations with the fix.
And it looks like the counter looks correct to me.
I also check if we missed the increments and decrements elsewhere. It does not
seem so. Few other cases where an open is done and we don't increment the counter are
the compound calls where the corresponding close is also sent in the request.
Signed-off-by: Shyam Prasad N <nspmangalore@gmail.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Smatch complains that "rc" could be uninitialized.
fs/cifs/inode.c:2206 cifs_getattr() error: uninitialized symbol 'rc'.
Changing it to "return 0;" improves readability as well.
Fixes: cc1baf98c8f6 ("cifs: do not ignore the SYNC flags in getattr")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
When I lifted the code in xfs_alloc_ag_vextent_lastblock out of a loop,
I forgot to convert all the accesses to len to be pointer dereferences.
Coverity-id: 1457918
Fixes: 5113f8ec37 ("xfs: clean up weird while loop in xfs_alloc_ag_vextent_near")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
User extended attributes are useful as metadata storage for kernfs
consumers like cgroups. Especially in the case of cgroups, it is useful
to have a central metadata store that multiple processes/services can
use to coordinate actions.
A concrete example is for userspace out of memory killers. We want to
let delegated cgroup subtree owners (running as non-root) to be able to
say "please avoid killing this cgroup". This is especially important for
desktop linux as delegated subtrees owners are less likely to run as
root.
This patch introduces a new flag, KERNFS_ROOT_SUPPORT_USER_XATTR, that
lets kernfs consumers enable user xattr support. An initial limit of 128
entries or 128KB -- whichever is hit first -- is placed per cgroup
because xattrs come from kernel memory and we don't want to let
unprivileged users accidentally eat up too much kernel memory.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
This helps set up size accounting in the next commit. Without this out
param, it's difficult to find out the removed xattr size without taking
a lock for longer and walking the xattr linked list twice.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
xattr values have a 64k maximum size. This can result in an order 4
kmalloc request which can be difficult to fulfill. Since xattrs do not
need physically contiguous memory, we can switch to kvmalloc and not
have to worry about higher order allocations failing.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
It's meant to be write-only.
Fixes: 89c905becc ("nfsd: allow forced expiration of NFSv4 clients")
Signed-off-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
yuehaibing@huawei.com reports the following build errors arise when
CONFIG_NFSD_V4_2_INTER_SSC is set and the NFS client is not built
into the kernel:
fs/nfsd/nfs4proc.o: In function `nfsd4_do_copy':
nfs4proc.c:(.text+0x23b7): undefined reference to `nfs42_ssc_close'
fs/nfsd/nfs4proc.o: In function `nfsd4_copy':
nfs4proc.c:(.text+0x5d2a): undefined reference to `nfs_sb_deactive'
fs/nfsd/nfs4proc.o: In function `nfsd4_do_async_copy':
nfs4proc.c:(.text+0x61d5): undefined reference to `nfs42_ssc_open'
nfs4proc.c:(.text+0x6389): undefined reference to `nfs_sb_deactive'
The new inter-server copy code invokes client functions. Until the
NFS server has infrastructure to load the appropriate NFS client
modules to handle inter-server copy requests, let's constrain the
way this feature is built.
Reported-by: YueHaibing <yuehaibing@huawei.com>
Fixes: ce0887ac96 ("NFSD add nfs4 inter ssc to nfsd4_copy")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: YueHaibing <yuehaibing@huawei.com> # build-tested
If the rpc.mountd daemon goes down, then that should not cause all
exports to start failing with ESTALE errors. Let's explicitly
distinguish between the cache upcall cases that need to time out,
and those that do not.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Add tracing to allow us to figure out where any stale filehandle issues
may be originating from.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
In NFSv4, the lock stateids are tied to the lockowner, and the open stateid,
so that the action of closing the file also results in either an automatic
loss of the locks, or an error of the form NFS4ERR_LOCKS_HELD.
In practice this means we must not add new locks to the open stateid
after the close process has been invoked. In fact doing so, can result
in the following panic:
kernel BUG at lib/list_debug.c:51!
invalid opcode: 0000 [#1] SMP NOPTI
CPU: 2 PID: 1085 Comm: nfsd Not tainted 5.6.0-rc3+ #2
Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.14410784.B64.1908150010 08/15/2019
RIP: 0010:__list_del_entry_valid.cold+0x31/0x55
Code: 1a 3d 9b e8 74 10 c2 ff 0f 0b 48 c7 c7 f0 1a 3d 9b e8 66 10 c2 ff 0f 0b 48 89 f2 48 89 fe 48 c7 c7 b0 1a 3d 9b e8 52 10 c2 ff <0f> 0b 48 89 fe 4c 89 c2 48 c7 c7 78 1a 3d 9b e8 3e 10 c2 ff 0f 0b
RSP: 0018:ffffb296c1d47d90 EFLAGS: 00010246
RAX: 0000000000000054 RBX: ffff8ba032456ec8 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff8ba039e99cc8 RDI: ffff8ba039e99cc8
RBP: ffff8ba032456e60 R08: 0000000000000781 R09: 0000000000000003
R10: 0000000000000000 R11: 0000000000000001 R12: ffff8ba009a4abe0
R13: ffff8ba032456e8c R14: 0000000000000000 R15: ffff8ba00adb01d8
FS: 0000000000000000(0000) GS:ffff8ba039e80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fb213f0b008 CR3: 00000001347de006 CR4: 00000000003606e0
Call Trace:
release_lock_stateid+0x2b/0x80 [nfsd]
nfsd4_free_stateid+0x1e9/0x210 [nfsd]
nfsd4_proc_compound+0x414/0x700 [nfsd]
? nfs4svc_decode_compoundargs+0x407/0x4c0 [nfsd]
nfsd_dispatch+0xc1/0x200 [nfsd]
svc_process_common+0x476/0x6f0 [sunrpc]
? svc_sock_secure_port+0x12/0x30 [sunrpc]
? svc_recv+0x313/0x9c0 [sunrpc]
? nfsd_svc+0x2d0/0x2d0 [nfsd]
svc_process+0xd4/0x110 [sunrpc]
nfsd+0xe3/0x140 [nfsd]
kthread+0xf9/0x130
? nfsd_destroy+0x50/0x50 [nfsd]
? kthread_park+0x90/0x90
ret_from_fork+0x1f/0x40
The fix is to ensure that lock creation tests for whether or not the
open stateid is unhashed, and to fail if that is the case.
Fixes: 659aefb68e ("nfsd: Ensure we don't recognise lock stateids after freeing them")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
svcrdma expects that the payload falls precisely into the xdr_buf
page vector. This does not seem to be the case for
nfsd4_encode_readv().
This code is called only when fops->splice_read is missing or when
RQ_SPLICE_OK is clear, so it's not a noticeable problem in many
common cases.
Add new transport method: ->xpo_read_payload so that when a READ
payload does not fit exactly in rq_res's page vector, the XDR
encoder can inform the RPC transport exactly where that payload is,
without the payload's XDR pad.
That way, when a Write chunk is present, the transport knows what
byte range in the Reply message is supposed to be matched with the
chunk.
Note that the Linux NFS server implementation of NFS/RDMA can
currently handle only one Write chunk per RPC-over-RDMA message.
This simplifies the implementation of this fix.
Fixes: b042098063 ("nfsd4: allow exotic read compounds")
Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=198053
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
list_for_each_entry_rcu() has built-in RCU and lock checking.
Pass cond argument to list_for_each_entry_rcu() to silence
false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled
by default.
Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
list_for_each_entry_rcu() has built-in RCU and lock checking.
Pass cond argument to list_for_each_entry_rcu() to silence
false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled
by default.
Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Currently, nfsd4_encode_exchange_id() encodes the utsname nodename
string in the server_scope field. In a multi-host container
environemnt, if an nfsd container is restarted on a different host than
it was originally running on, clients will see a server_scope mismatch
and will not attempt to reclaim opens.
Instead, set the server_scope while we're in a process context during
service startup, so we get the utsname nodename of the current process
and store that in nfsd_net.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
[bfields: fix up major_id too]
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
prevent inodes from vanishing, but ihold() does not guarantee inode
persistence. Replace the inode pointer with a per boot, machine wide,
unique inode identifier. The second commit fixes the breakage of the hash
mechanism whihc causes a 100% performance regression.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl5uQJsTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoYp4EAC5fr/AyRaIn/AEIZUmoyK6ELUaknfH
Z788avxDB/t5GkzC9A2dMpybYi78tzLSAEfB8jYgwbrqExapVtiqvjGZ1RIi3HoN
f/DWLnOb2s+yYQ3BQlHu4RdKONEzCqBwKFpElGRv3JzCY8Qeh5cQBzdqzvOEFmYw
P7DJVtJRZ2dud7AzJ+xk6KuNIKCT2F7Djmtop6nq1EVw0J/2oYOVgQu76APBj7cj
32srLmpP4xcQiJmWLC5ksXKiZrMPnyNfwXhHFufNvJ2Re6+Wf8mmglqG/5DmA+Ns
Sq3L7D7yXwtWQZ8Po1qnWhPDZVXQbWzHyTn4YAMJAK7yoO7mut8jgECt+A8vf4L+
hsc41c6THfdCQQ9gmxLL+c08nZGlmvIC4/1RsihNZ3kd2o4k6Ah9xFp8lBFcpjWd
7tuhakNqJvUOvB34t2AYqzMFbZ/FJG+QNGyIW0bTUn4YIgRPxI/zsdMxqGVBZ4oN
0iuy1kPLGbGAnLU9thkiVMmAyaPesuiB6f+mmzobEUgGI35GrCJi6a4YaTG1sqFn
Gl8oPzcU2n+DWbVBfJrVFHJye7oi78kCw6wpNLBCJQp8NP8doAH0Sgspglg52E/p
G4GGLz0vGauHBC5wQ3WYiGLImWbzC1dwKdcNE7dhuTgXbhz8ChVlOSU9Fu4+pGpq
6URL6DVTLwDZPg==
=e2iB
-----END PGP SIGNATURE-----
Merge tag 'locking-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull futex fix from Thomas Gleixner:
"Fix for yet another subtle futex issue.
The futex code used ihold() to prevent inodes from vanishing, but
ihold() does not guarantee inode persistence. Replace the inode
pointer with a per boot, machine wide, unique inode identifier.
The second commit fixes the breakage of the hash mechanism which
causes a 100% performance regression"
* tag 'locking-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Unbreak futex hashing
futex: Fix inode life-time issue
If the xfs_buf_map array allocation in xfs_dabuf_map fails for whatever
reason, we bail out with error code zero. This will confuse callers, so
make sure that we return ENOMEM. Allocation failure should never happen
with the small size of the array, but code defensively anyway.
Fixes: 45feef8f50 ("xfs: refactor xfs_dabuf_map")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Enable io-wq hashing stuff for dependent works simply by re-enqueueing
such requests.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It's a preparation patch removing io_wq_enqueue_hashed(), which
now should be done by io_wq_hash_work() + io_wq_enqueue().
Also, set hash value for dependant works, and do it as late as possible,
because req->file can be unavailable before. This hash will be ignored
by io-wq.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This little tweak restores the behaviour that was before the recent
io_worker_handle_work() optimisation patches. It makes the function do
cond_resched() and flush_signals() only if there is an actual work to
execute.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Processing links, io_submit_sqe() prepares requests, drops sqes, and
passes them with sqe=NULL to io_queue_sqe(). There IOSQE_DRAIN and/or
IOSQE_ASYNC requests will go through the same prep, which doesn't expect
sqe=NULL and fail with NULL pointer deference.
Always do full prepare including io_alloc_async_ctx() for linked
requests, and then it can skip the second preparation.
Cc: stable@vger.kernel.org # 5.5
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We can use the variable allocated_clusters rather than map_from_clusters
to control reserved block/cluster accounting in ext4_ext_map_blocks.
This eliminates a variable and associated code and improves readability
a little.
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20200311205125.25061-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Now that the eofblocks code has been removed, we don't need to assign
0 to err before calling ext4_ext_insert_extent() since it will assign
a return value to ret anyway. The variable free_on_err can be
eliminated and replaced by a reference to allocated_clusters which
clearly conveys the idea that newly allocated blocks should be freed
when recovering from an extent insertion failure. The error handling
code itself should be restructured so that it errors out immediately on
an insertion failure in the case where no new blocks have been allocated
(bigalloc) rather than proceeding further into the mapping code. The
initializer for fb_flags can also be rearranged for improved
readability. Finally, insert a missing space in nearby code.
No known bugs are addressed by this patch - it's simply a cleanup.
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20200311205033.25013-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Link: https://lore.kernel.org/r/20200309180813.GA3347@embeddedor
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Link: https://lore.kernel.org/r/20200309154838.GA31559@embeddedor
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This patch moves ext4_fiemap to use iomap framework.
For xattr a new 'ext4_iomap_xattr_ops' is added.
Reported-by: kbuild test robot <lkp@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/b9f45c885814fcdd0631747ff0fe08886270828c.1582880246.git.riteshh@linux.ibm.com
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
For indirect block mapping if the i_block > max supported block in inode
then ext4_ind_map_blocks() returns a -EIO error. But in case of fiemap
this could be a valid query to ->iomap_begin call.
So check if the offset >= s_bitmap_maxbytes in ext4_iomap_begin_report(),
then simply skip calling ext4_map_blocks().
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/87fa0ddc5967fa707656212a3b66a7233425325c.1582880246.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_iomap_begin is already implemented which provides ext4_map_blocks,
so just move the API from generic_block_bmap to iomap_bmap for iomap
conversion.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/8bbd53bd719d5ccfecafcce93f2bf1d7955a44af.1582880246.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This patch avoids the memory alloc & free path when depth is 0,
since anyway there is no extra caching done in that case.
So on checking depth 0, simply return early.
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/93da0d0f073c73358e85bb9849d8a5378d1da539.1582880246.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
IOMAP_F_MERGED needs to be set in case of non-extent based mapping.
This is needed in later patches for conversion of ext4_fiemap to use iomap.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/a4764c91c08c16d4d4a4b36defb2a08625b0e9b3.1582880246.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Daniel Borkmann says:
====================
pull-request: bpf-next 2020-03-13
The following pull-request contains BPF updates for your *net-next* tree.
We've added 86 non-merge commits during the last 12 day(s) which contain
a total of 107 files changed, 5771 insertions(+), 1700 deletions(-).
The main changes are:
1) Add modify_return attach type which allows to attach to a function via
BPF trampoline and is run after the fentry and before the fexit programs
and can pass a return code to the original caller, from KP Singh.
2) Generalize BPF's kallsyms handling and add BPF trampoline and dispatcher
objects to be visible in /proc/kallsyms so they can be annotated in
stack traces, from Jiri Olsa.
3) Extend BPF sockmap to allow for UDP next to existing TCP support in order
in order to enable this for BPF based socket dispatch, from Lorenz Bauer.
4) Introduce a new bpftool 'prog profile' command which attaches to existing
BPF programs via fentry and fexit hooks and reads out hardware counters
during that period, from Song Liu. Example usage:
bpftool prog profile id 337 duration 3 cycles instructions llc_misses
4228 run_cnt
3403698 cycles (84.08%)
3525294 instructions # 1.04 insn per cycle (84.05%)
13 llc_misses # 3.69 LLC misses per million isns (83.50%)
5) Batch of improvements to libbpf, bpftool and BPF selftests. Also addition
of a new bpf_link abstraction to keep in particular BPF tracing programs
attached even when the applicaion owning them exits, from Andrii Nakryiko.
6) New bpf_get_current_pid_tgid() helper for tracing to perform PID filtering
and which returns the PID as seen by the init namespace, from Carlos Neira.
7) Refactor of RISC-V JIT code to move out common pieces and addition of a
new RV32G BPF JIT compiler, from Luke Nelson.
8) Add gso_size context member to __sk_buff in order to be able to know whether
a given skb is GSO or not, from Willem de Bruijn.
9) Add a new bpf_xdp_output() helper which reuses XDP's existing perf RB output
implementation but can be called from tracepoint programs, from Eelco Chaudron.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Right now the tail ends of follow_dotdot{,_rcu}() are pretty
much the open-coded analogues of step_into(). The differences:
* the lack of proper LOOKUP_NO_XDEV handling in non-RCU case
(arguably a bug)
* the lack of ->d_manage() handling (again, arguably a bug)
Adjust the calling conventions so that on the next step with could
just switch those functions to returning step_into().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Behaviour change: LOOKUP_BENEATH lookup of .. in absolute root
yields an error even if it's not the process' root. That's
possible only if you'd managed to escape chroot jail by way of
procfs symlinks, but IMO the resulting behaviour is not worse -
more consistent and easier to describe:
".." in root is "stay where you are", uness LOOKUP_BENEATH
has been given, in which case it's "fail with EXDEV".
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of returning 0, return new dentry; instead of returning
-ENOENT, return NULL. Adjust the callers accordingly.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>