Commit Graph

63582 Commits

Author SHA1 Message Date
Naohiro Aota
c4a816c67c btrfs: introduce chunk allocation policy
Introduce chunk allocation policy for btrfs. This policy controls how
chunks and device extents are allocated from devices.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:48 +01:00
Naohiro Aota
b25c19f49e btrfs: handle invalid profile in chunk allocation
Do not BUG_ON() when an invalid profile is passed to __btrfs_alloc_chunk().
Instead return -EINVAL with ASSERT() to catch a bug in the development
stage.

Suggested-by: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:48 +01:00
Naohiro Aota
52d40aba68 btrfs: change full_search to bool in find_free_extent_update_loop
While the "full_search" variable defined in find_free_extent() is bool,
but the full_search argument of find_free_extent_update_loop() is
defined as int. Let's trivially fix the argument type.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:47 +01:00
Qu Wenruo
daf475c915 btrfs: qgroup: Remove the unnecesaary spin lock for qgroup_rescan_running
After the previous patch, qgroup_rescan_running is protected by
btrfs_fs_info::qgroup_rescan_lock, thus no need for the extra spinlock.

Suggested-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:47 +01:00
Qu Wenruo
d61acbbf54 btrfs: qgroup: ensure qgroup_rescan_running is only set when the worker is at least queued
[BUG]
There are some reports about btrfs wait forever to unmount itself, with
the following call trace:

  INFO: task umount:4631 blocked for more than 491 seconds.
        Tainted: G               X  5.3.8-2-default #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  umount          D    0  4631   3337 0x00000000
  Call Trace:
  ([<00000000174adf7a>] __schedule+0x342/0x748)
   [<00000000174ae3ca>] schedule+0x4a/0xd8
   [<00000000174b1f08>] schedule_timeout+0x218/0x420
   [<00000000174af10c>] wait_for_common+0x104/0x1d8
   [<000003ff804d6994>] btrfs_qgroup_wait_for_completion+0x84/0xb0 [btrfs]
   [<000003ff8044a616>] close_ctree+0x4e/0x380 [btrfs]
   [<0000000016fa3136>] generic_shutdown_super+0x8e/0x158
   [<0000000016fa34d6>] kill_anon_super+0x26/0x40
   [<000003ff8041ba88>] btrfs_kill_super+0x28/0xc8 [btrfs]
   [<0000000016fa39f8>] deactivate_locked_super+0x68/0x98
   [<0000000016fcb198>] cleanup_mnt+0xc0/0x140
   [<0000000016d6a846>] task_work_run+0xc6/0x110
   [<0000000016d04f76>] do_notify_resume+0xae/0xb8
   [<00000000174b30ae>] system_call+0xe2/0x2c8

[CAUSE]
The problem happens when we have called qgroup_rescan_init(), but
not queued the worker. It can be caused mostly by error handling.

	Qgroup ioctl thread		|	Unmount thread
----------------------------------------+-----------------------------------
					|
btrfs_qgroup_rescan()			|
|- qgroup_rescan_init()			|
|  |- qgroup_rescan_running = true;	|
|					|
|- trans = btrfs_join_transaction()	|
|  Some error happened			|
|					|
|- btrfs_qgroup_rescan() returns error	|
   But qgroup_rescan_running == true;	|
					| close_ctree()
					| |- btrfs_qgroup_wait_for_completion()
					|    |- running == true;
					|    |- wait_for_completion();

btrfs_qgroup_rescan_worker is never queued, thus no one is going to wake
up close_ctree() and we get a deadlock.

All involved qgroup_rescan_init() callers are:

- btrfs_qgroup_rescan()
  The example above. It's possible to trigger the deadlock when error
  happened.

- btrfs_quota_enable()
  Not possible. Just after qgroup_rescan_init() we queue the work.

- btrfs_read_qgroup_config()
  It's possible to trigger the deadlock. It only init the work, the
  work queueing happens in btrfs_qgroup_rescan_resume().
  Thus if error happened in between, deadlock is possible.

We shouldn't set fs_info->qgroup_rescan_running just in
qgroup_rescan_init(), as at that stage we haven't yet queued qgroup
rescan worker to run.

[FIX]
Set qgroup_rescan_running before queueing the work, so that we ensure
the rescan work is queued when we wait for it.

Fixes: 8d9eddad19 ("Btrfs: fix qgroup rescan worker initialization")
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
[ Change subject and cause analyse, use a smaller fix ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:47 +01:00
Andy Shevchenko
807fc790aa btrfs: switch to use new generic UUID API
There are new types and helpers that are supposed to be used in new code.

As a preparation to get rid of legacy types and API functions do
the conversion here.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:47 +01:00
Qu Wenruo
b3ff8f1d38 btrfs: Don't submit any btree write bio if the fs has errors
[BUG]
There is a fuzzed image which could cause KASAN report at unmount time.

  BUG: KASAN: use-after-free in btrfs_queue_work+0x2c1/0x390
  Read of size 8 at addr ffff888067cf6848 by task umount/1922

  CPU: 0 PID: 1922 Comm: umount Tainted: G        W         5.0.21 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
  Call Trace:
   dump_stack+0x5b/0x8b
   print_address_description+0x70/0x280
   kasan_report+0x13a/0x19b
   btrfs_queue_work+0x2c1/0x390
   btrfs_wq_submit_bio+0x1cd/0x240
   btree_submit_bio_hook+0x18c/0x2a0
   submit_one_bio+0x1be/0x320
   flush_write_bio.isra.41+0x2c/0x70
   btree_write_cache_pages+0x3bb/0x7f0
   do_writepages+0x5c/0x130
   __writeback_single_inode+0xa3/0x9a0
   writeback_single_inode+0x23d/0x390
   write_inode_now+0x1b5/0x280
   iput+0x2ef/0x600
   close_ctree+0x341/0x750
   generic_shutdown_super+0x126/0x370
   kill_anon_super+0x31/0x50
   btrfs_kill_super+0x36/0x2b0
   deactivate_locked_super+0x80/0xc0
   deactivate_super+0x13c/0x150
   cleanup_mnt+0x9a/0x130
   task_work_run+0x11a/0x1b0
   exit_to_usermode_loop+0x107/0x130
   do_syscall_64+0x1e5/0x280
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

[CAUSE]
The fuzzed image has a completely screwd up extent tree:

  leaf 29421568 gen 8 total ptrs 6 free space 3587 owner EXTENT_TREE
  refs 2 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 5938
          item 0 key (12587008 168 4096) itemoff 3942 itemsize 53
                  extent refs 1 gen 9 flags 1
                  ref#0: extent data backref root 5 objectid 259 offset 0 count 1
          item 1 key (12591104 168 8192) itemoff 3889 itemsize 53
                  extent refs 1 gen 9 flags 1
                  ref#0: extent data backref root 5 objectid 271 offset 0 count 1
          item 2 key (12599296 168 4096) itemoff 3836 itemsize 53
                  extent refs 1 gen 9 flags 1
                  ref#0: extent data backref root 5 objectid 259 offset 4096 count 1
          item 3 key (29360128 169 0) itemoff 3803 itemsize 33
                  extent refs 1 gen 9 flags 2
                  ref#0: tree block backref root 5
          item 4 key (29368320 169 1) itemoff 3770 itemsize 33
                  extent refs 1 gen 9 flags 2
                  ref#0: tree block backref root 5
          item 5 key (29372416 169 0) itemoff 3737 itemsize 33
                  extent refs 1 gen 9 flags 2
                  ref#0: tree block backref root 5

Note that leaf 29421568 doesn't have its backref in the extent tree.
Thus extent allocator can re-allocate leaf 29421568 for other trees.

In short, the bug is caused by:

- Existing tree block gets allocated to log tree
  This got its generation bumped.

- Log tree balance cleaned dirty bit of offending tree block
  It will not be written back to disk, thus no WRITTEN flag.

- Original owner of the tree block gets COWed
  Since the tree block has higher transid, no WRITTEN flag, it's reused,
  and not traced by transaction::dirty_pages.

- Transaction aborted
  Tree blocks get cleaned according to transaction::dirty_pages. But the
  offending tree block is not recorded at all.

- Filesystem unmount
  All pages are assumed to be are clean, destroying all workqueue, then
  call iput(btree_inode).
  But offending tree block is still dirty, which triggers writeback, and
  causes use-after-free bug.

The detailed sequence looks like this:

- Initial status
  eb: 29421568, header=WRITTEN bflags_dirty=0, page_dirty=0, gen=8,
      not traced by any dirty extent_iot_tree.

- New tree block is allocated
  Since there is no backref for 29421568, it's re-allocated as new tree
  block.
  Keep in mind that tree block 29421568 is still referred by extent
  tree.

- Tree block 29421568 is filled for log tree
  eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9 << (gen bumped)
      traced by btrfs_root::dirty_log_pages

- Some log tree operations
  Since the fs is using node size 4096, the log tree can easily go a
  level higher.

- Log tree needs balance
  Tree block 29421568 gets all its content pushed to right, thus now
  it is empty, and we don't need it.
  btrfs_clean_tree_block() from __push_leaf_right() get called.

  eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9
      traced by btrfs_root::dirty_log_pages

- Log tree write back
  btree_write_cache_pages() goes through dirty pages ranges, but since
  page of tree block 29421568 gets cleaned already, it's not written
  back to disk. Thus it doesn't have WRITTEN bit set.
  But ranges in dirty_log_pages are cleared.

  eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9
      not traced by any dirty extent_iot_tree.

- Extent tree update when committing transaction
  Since tree block 29421568 has transid equal to running trans, and has
  no WRITTEN bit, should_cow_block() will use it directly without adding
  it to btrfs_transaction::dirty_pages.

  eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9
      not traced by any dirty extent_iot_tree.

  At this stage, we're doomed. We have a dirty eb not tracked by any
  extent io tree.

- Transaction gets aborted due to corrupted extent tree
  Btrfs cleans up dirty pages according to transaction::dirty_pages and
  btrfs_root::dirty_log_pages.
  But since tree block 29421568 is not tracked by neither of them, it's
  still dirty.

  eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9
      not traced by any dirty extent_iot_tree.

- Filesystem unmount
  Since all cleanup is assumed to be done, all workqueus are destroyed.
  Then iput(btree_inode) is called, expecting no dirty pages.
  But tree 29421568 is still dirty, thus triggering writeback.
  Since all workqueues are already freed, we cause use-after-free.

This shows us that, log tree blocks + bad extent tree can cause wild
dirty pages.

[FIX]
To fix the problem, don't submit any btree write bio if the filesytem
has any error.  This is the last safe net, just in case other cleanup
haven't caught catch it.

Link: https://github.com/bobfuzzer/CVE/tree/master/CVE-2019-19377
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:46 +01:00
Marcos Paulo de Souza
faf8f7b957 btrfs: ioctl: resize: only show message if size is changed
There is no point to inform the user about size change if there's none.
Update the message to conform to a commonly used format where the path
and devid are printed and also print old and new sizes.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Marcos Paulo de Souza <marcos@mpdesouza.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ enhance message ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:46 +01:00
Anand Jain
b82582d668 btrfs: slightly simplify global block reserve calculations
In btrfs_update_global_block_rsv the lines:

  num_bytes = block_rsv->size - block_rsv->reserved;
  block_rsv->reserved += num_bytes;

imply:

  block_rsv->reserved = block_rsv->size;

Assign block_rsv->size to block_rsv->reserved directly and reorder lines
so they match the other branch.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:46 +01:00
David Sterba
56e9f6ea32 btrfs: merge unlocking to common exit block in btrfs_commit_transaction
The tree_log_mutex and reloc_mutex locks are properly nested so we can
simplify error handling and add labels for them. This reduces line count
of the function.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:46 +01:00
David Sterba
15b6e8a83e btrfs: reduce pointer intdirections in btree_readpage_end_io_hook
All we need to read is checksum size from fs_info superblock, and
fs_info is provided by extent buffer so we can get rid of the wild
pointer indirections from page/inode/root.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:45 +01:00
David Sterba
b79ce3dddd btrfs: adjust delayed refs message level
The message seems to be for debugging and has little value for users.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:45 +01:00
David Sterba
1db45a35f0 btrfs: replace u_long type cast with unsigned long
We don't use the u_XX types anywhere, though they're defined.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:45 +01:00
David Sterba
eeb6f17200 btrfs: raid56: simplify sort_parity_stripes
Remove trivial comprator and open coded swap of two values.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:45 +01:00
David Sterba
7e8f19e50e btrfs: adjust message level for unrecognized mount option
An unrecognized option is a failure that should get user/administrator
attention, the info level is often below what gets logged, so make it
error.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:45 +01:00
David Sterba
42c9d0b524 btrfs: simplify parameters of btrfs_set_disk_extent_flags
All callers pass extent buffer start and length so the extent buffer
itself should work fine.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:45 +01:00
David Sterba
c4ac754198 btrfs: open code trivial helper btrfs_header_chunk_tree_uuid
The helper btrfs_header_chunk_tree_uuid follows naming convention of
other struct accessors but does something compeletly different. As the
offsetof calculation is clear in the context of extent buffer operations
we can remove it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:44 +01:00
David Sterba
9a8658e33d btrfs: open code trivial helper btrfs_header_fsid
The helper btrfs_header_fsid follows naming convention of other struct
accessors but does something compeletly different. As the offsetof
calculation is clear in the context of extent buffer operations we can
remove it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:44 +01:00
David Sterba
75fb2e9e49 btrfs: move mapping of block for discard to its caller
There's a simple forwarded call based on the operation that would better
fit the caller btrfs_map_block that's until now a trivial wrapper.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:44 +01:00
David Sterba
ee787f9550 btrfs: use struct_size to calculate size of raid hash table
The struct_size macro does the same calculation and is safe regarding
overflows. Though we're not expecting them to happen, use the helper for
clarity.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:44 +01:00
Nikolay Borisov
dcc3eb9638 btrfs: convert snapshot/nocow exlcusion to drew lock
This patch removes all haphazard code implementing nocow writers
exclusion from pending snapshot creation and switches to using the drew
lock to ensure this invariant still holds.

'Readers' are snapshot creators from create_snapshot and 'writers' are
nocow writers from buffered write path or btrfs_setsize. This locking
scheme allows for multiple snapshots to happen while any nocow writers
are blocked, since writes to page cache in the nocow path will make
snapshots inconsistent.

So for performance reasons we'd like to have the ability to run multiple
concurrent snapshots and also favors readers in this case. And in case
there aren't pending snapshots (which will be the majority of the cases)
we rely on the percpu's writers counter to avoid cacheline contention.

The main gain from using the drew lock is it's now a lot easier to
reason about the guarantees of the locking scheme and whether there is
some silent breakage lurking.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:44 +01:00
Nikolay Borisov
2992df7326 btrfs: Implement DREW lock
A (D)ouble (R)eader (W)riter (E)xclustion lock is a locking primitive
that allows to have multiple readers or multiple writers but not
multiple readers and writers holding it concurrently.

The code is factored out from the existing open-coded locking scheme
used to exclude pending snapshots from nocow writers and vice-versa.
Current implementation actually favors Readers (that is snapshot
creaters) to writers (nocow writers of the filesystem).

The API provides lock/unlock/trylock for reads and writes.

Formal specification for TLA+ provided by Valentin Schneider is at
https://lore.kernel.org/linux-btrfs/2dcaf81c-f0d3-409e-cb29-733d8b3b4cc9@arm.com/

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:43 +01:00
Johannes Thumshirn
fd8efa818c btrfs: simplify error handling in __btrfs_write_out_cache()
The error cleanup gotos in __btrfs_write_out_cache() needlessly jump
back making the code less readable then needed.  Flatten them out so no
back-jump is necessary and the read flow is uninterrupted.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:43 +01:00
Johannes Thumshirn
1afb648e94 btrfs: use standard debug config option to enable free-space-cache debug prints
free-space-cache.c has it's own set of DEBUG ifdefs which need to be
turned on instead of the global CONFIG_BTRFS_DEBUG to print debug
messages about failed block-group writes.

Switch this over to CONFIG_BTRFS_DEBUG so we always see these messages
when running a debug kernel.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:43 +01:00
Johannes Thumshirn
7a195f6db9 btrfs: make the uptodate argument of io_ctl_add_pages() boolean
Make the uptodate argument of io_ctl_add_pages() boolean.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:43 +01:00
Johannes Thumshirn
831fa14f1e btrfs: use inode from io_ctl in io_ctl_prepare_pages
io_ctl_prepare_pages() gets a 'struct btrfs_io_ctl' as well as a 'struct
inode', but btrfs_io_ctl::inode points to the same struct inode as this is
assgined in io_ctl_init().

Use the inode form io_ctl to reduce the arguments of io_ctl_prepare_pages.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:43 +01:00
Marcos Paulo de Souza
949964c928 btrfs: add new BTRFS_IOC_SNAP_DESTROY_V2 ioctl
This ioctl will be responsible for deleting a subvolume using its id.
This can be used when a system has a file system mounted from a
subvolume, rather than the root file system, like below:

/
@subvol1/
@subvol2/
@subvol_default/

If only @subvol_default is mounted, we have no path to reach @subvol1
and @subvol2, thus no way to delete them. Current subvolume delete ioctl
takes a file handle point as argument, and if @subvol_default is
mounted, we can't reach @subvol1 and @subvol2 from the same mount point.

This patch introduces a new ioctl BTRFS_IOC_SNAP_DESTROY_V2 that takes
the extended structure with flags to allow to delete subvolume using
subvolid.

Now, we can use this new ioctl specifying the subvolume id and refer to
the same mount point. It doesn't matter which subvolume was mounted,
since we can reach to the desired one using the subvolume id, and then
delete it.

The full path to the subvolume id is resolved internally and access is
verified as if the subvolume was accessed by path.

The volume args v2 structure is extended to use the existing union for
subvolume id specification, that's valid in case the
BTRFS_SUBVOL_SPEC_BY_ID is set.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:42 +01:00
Marcos Paulo de Souza
c0c907a47d btrfs: export helpers for subvolume name/id resolution
The functions will be used outside of export.c and super.c to allow
resolving subvolume name from a given id, eg. for subvolume deletion by
id ioctl.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ split from the next patch ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:42 +01:00
David Sterba
748449cdbe btrfs: use ioctl args support mask for device delete
When the device remove v2 ioctl was added, the full support mask was
added to sanity check the flags. However this would allow to let the
subvolume related flags to be accepted. This is not supposed to happen.

Use the correct support mask, which means that now any of
BTRFS_SUBVOL_CREATE_ASYNC, BTRFS_SUBVOL_RDONLY or
BTRFS_SUBVOL_QGROUP_INHERIT will be rejected as ENOTSUPP. Though this is
a user-visible change, specifying subvolume flags for device deletion
does not make sense and there are hopefully no applications doing that.

Reviewed-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:42 +01:00
David Sterba
673990dba3 btrfs: use ioctl args support mask for subvolume create/delete
Using the defined mask instead of flag enumeration in the ioctl handler
is preferred. No functional changes.

Reviewed-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:42 +01:00
Jules Irenge
5ce48d0f0e btrfs: Add missing lock annotation for release_extent_buffer()
Sparse reports a warning at release_extent_buffer()
warning: context imbalance in release_extent_buffer() - unexpected unlock

The root cause is the missing annotation at release_extent_buffer()
Add the missing __releases(&eb->refs_lock) annotation

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:42 +01:00
Josef Bacik
75ec1db871 btrfs: set update the uuid generation as soon as possible
In my EIO stress testing I noticed I was getting forced to rescan the
uuid tree pretty often, which was weird.  This is because my error
injection stuff would sometimes inject an error after log replay but
before we loaded the UUID tree.  If log replay committed the transaction
it wouldn't have updated the uuid tree generation, but the tree was
valid and didn't change, so there's no reason to not update the
generation here.

Fix this by setting the BTRFS_FS_UPDATE_UUID_TREE_GEN bit immediately
after reading all the fs roots if the uuid tree generation matches the
fs generation.  Then any transaction commits that happen during mount
won't screw up our uuid tree state, forcing us to do needless uuid
rescans.

Fixes: 70f8017547 ("Btrfs: check UUID tree during mount if required")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:41 +01:00
Josef Bacik
c94bec2c61 btrfs: bail out of uuid tree scanning if we're closing
In doing my fsstress+EIO stress testing I started running into issues
where umount would get stuck forever because the uuid checker was
chewing through the thousands of subvolumes I had created.

We shouldn't block umount on this, simply bail if we're unmounting the
fs.  We need to make sure we don't mark the UUID tree as ok, so we only
set that bit if we made it through the whole rescan operation, but
otherwise this is completely safe.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:41 +01:00
Nikolay Borisov
97f4dd09da btrfs: make btrfs_check_uuid_tree private to disk-io.c
It's used only during filesystem mount as such it can be made private to
disk-io.c file. Also use the occasion to move btrfs_uuid_rescan_kthread
as btrfs_check_uuid_tree is its sole caller.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:41 +01:00
Nikolay Borisov
560b7a4aa2 btrfs: call btrfs_check_uuid_tree_entry directly in btrfs_uuid_tree_iterate
btrfs_uuid_tree_iterate is called from only once place and its 2nd
argument is always btrfs_check_uuid_tree_entry. Simplify
btrfs_uuid_tree_iterate's signature by removing its 2nd argument and
directly calling btrfs_check_uuid_tree_entry. Also move the latter into
uuid-tree.h. No functional changes.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:41 +01:00
David Sterba
c17af96554 btrfs: raid56: simplify tracking of Q stripe presence
There are temporary variables tracking the index of P and Q stripes, but
none of them is really used as such, merely for determining if the Q
stripe is present. This leads to compiler warnings with
-Wunused-but-set-variable and has been reported several times.

fs/btrfs/raid56.c: In function ‘finish_rmw’:
fs/btrfs/raid56.c:1199:6: warning: variable ‘p_stripe’ set but not used [-Wunused-but-set-variable]
 1199 |  int p_stripe = -1;
      |      ^~~~~~~~
fs/btrfs/raid56.c: In function ‘finish_parity_scrub’:
fs/btrfs/raid56.c:2356:6: warning: variable ‘p_stripe’ set but not used [-Wunused-but-set-variable]
 2356 |  int p_stripe = -1;
      |      ^~~~~~~~

Replace the two variables with one that has a clear meaning and also get
rid of the warnings. The logic that verifies that there are only 2
valid cases is unchanged.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:41 +01:00
ethanwu
b25b0b871f btrfs: backref, use correct count to resolve normal data refs
With the following patches:

- btrfs: backref, only collect file extent items matching backref offset
- btrfs: backref, not adding refs from shared block when resolving normal backref
- btrfs: backref, only search backref entries from leaves of the same root

we only collect the normal data refs we want, so the imprecise upper
bound total_refs of that EXTENT_ITEM could now be changed to the count
of the normal backref entry we want to search.

Background and how the patches fit together:

Btrfs has two types of data backref.
For BTRFS_EXTENT_DATA_REF_KEY type of backref, we don't have the
exact block number. Therefore, we need to call resolve_indirect_refs.
It uses btrfs_search_slot to locate the leaf block. Then
we need to walk through the leaves to search for the EXTENT_DATA items
that have disk bytenr matching the extent item (add_all_parents).

When resolving indirect refs, we could take entries that don't
belong to the backref entry we are searching for right now.
For that reason when searching backref entry, we always use total
refs of that EXTENT_ITEM rather than individual count.

For example:
item 11 key (40831553536 EXTENT_ITEM 4194304) itemoff 15460 itemsize
  extent refs 24 gen 7302 flags DATA
  shared data backref parent 394985472 count 10 #1
  extent data backref root 257 objectid 260 offset 1048576 count 3 #2
  extent data backref root 256 objectid 260 offset 65536 count 6 #3
  extent data backref root 257 objectid 260 offset 65536 count 5 #4

For example, when searching backref entry #4, we'll use total_refs
24, a very loose loop ending condition, instead of total_refs = 5.

But using total_refs = 24 is not accurate. Sometimes, we'll never find
all the refs from specific root.  As a result, the loop keeps on going
until we reach the end of that inode.

The first 3 patches, handle 3 different types refs we might encounter.
These refs do not belong to the normal backref we are searching, and
hence need to be skipped.

This patch changes the total_refs to correct number so that we could
end loop as soon as we find all the refs we want.

btrfs send uses backref to find possible clone sources, the following
is a simple test to compare the results with and without this patch:

 $ btrfs subvolume create /sub1
 $ for i in `seq 1 163840`; do
     dd if=/dev/zero of=/sub1/file bs=64K count=1 seek=$((i-1)) conv=notrunc oflag=direct
   done
 $ btrfs subvolume snapshot /sub1 /sub2
 $ for i in `seq 1 163840`; do
     dd if=/dev/zero of=/sub1/file bs=4K count=1 seek=$(((i-1)*16+10)) conv=notrunc oflag=direct
   done
 $ btrfs subvolume snapshot -r /sub1 /snap1
 $ time btrfs send /snap1 | btrfs receive /volume2

Without this patch:

real 69m48.124s
user 0m50.199s
sys  70m15.600s

With this patch:

real    1m59.683s
user    0m35.421s
sys     2m42.684s

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: ethanwu <ethanwu@synology.com>
[ add patchset cover letter with background and numbers ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:40 +01:00
ethanwu
cfc0eed0ec btrfs: backref, only search backref entries from leaves of the same root
We could have some nodes/leaves in subvolume whose owner are not the
that subvolume. In this way, when we resolve normal backrefs of that
subvolume, we should avoid collecting those references from these blocks.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: ethanwu <ethanwu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:40 +01:00
ethanwu
ed58f2e66e btrfs: backref, don't add refs from shared block when resolving normal backref
All references from the block of SHARED_DATA_REF belong to that shared
block backref.

For example:

  item 11 key (40831553536 EXTENT_ITEM 4194304) itemoff 15460 itemsize 95
      extent refs 24 gen 7302 flags DATA
      extent data backref root 257 objectid 260 offset 65536 count 5
      extent data backref root 258 objectid 265 offset 0 count 9
      shared data backref parent 394985472 count 10

Block 394985472 might be leaf from root 257, and the item obejctid and
(file_pos - file_extent_item::offset) in that leaf just happens to be
260 and 65536 which is equal to the first extent data backref entry.

Before this patch, when we resolve backref:

  root 257 objectid 260 offset 65536

we will add those refs in block 394985472 and wrongly treat those as the
refs we want.

Fix this by checking if the leaf we are processing is shared data
backref, if so, just skip this leaf.

Shared data refs added into preftrees.direct have all entry value = 0
(root_id = 0, key = NULL, level = 0) except parent entry.

Other refs from indirect tree will have key value and root id != 0, and
these values won't be changed when their parent is resolved and added to
preftrees.direct. Therefore, we could reuse the preftrees.direct and
search ref with all values = 0 except parent is set to avoid getting
those resolved refs block.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: ethanwu <ethanwu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:40 +01:00
ethanwu
7ac8b88ee6 btrfs: backref, only collect file extent items matching backref offset
When resolving one backref of type EXTENT_DATA_REF, we collect all
references that simply reference the EXTENT_ITEM even though their
(file_pos - file_extent_item::offset) are not the same as the
btrfs_extent_data_ref::offset we are searching for.

This patch adds additional check so that we only collect references whose
(file_pos - file_extent_item::offset) == btrfs_extent_data_ref::offset.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: ethanwu <ethanwu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:40 +01:00
Johannes Thumshirn
9da2b242e2 btrfs: remove buffer_heads form super block mirror integrity checking
The integrity checking code for the super block mirrors is the last
remaining user of buffer_heads, change it to using plain bios as well.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:40 +01:00
Johannes Thumshirn
59aaad503f btrfs: remove buffer_heads from btrfsic_process_written_block()
Now that the last caller of btrfsic_process_written_block() with
buffer_heads is gone, remove the buffer_head processing path from it as
well.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:40 +01:00
Johannes Thumshirn
61ecc5fc18 btrfs: remove btrfsic_submit_bh()
Now that the last use of btrfsic_submit_bh() is gone as the super block
is now written using bios, remove the function as well.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:39 +01:00
Johannes Thumshirn
314b6dd0ee btrfs: use bios instead of buffer_heads from super block writeout
Similar to the superblock read path, change the write path to using bios
and pages instead of buffer_heads. This allows us to skip over the
buffer_head code, for writing the superblock to disk.

This is based on a patch originally authored by Nikolay Borisov.

Co-developed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:39 +01:00
Johannes Thumshirn
8f32380d3f btrfs: use the page cache for super block reading
Super-block reading in BTRFS is done using buffer_heads. Buffer_heads
have some drawbacks, like not being able to propagate errors from the
lower layers.

Directly use the page cache for reading the super blocks from disk or
invalidating an on-disk super block. We have to use the page cache so to
avoid races between mkfs and udev. See also 6f60cbd3ae ("btrfs: access
superblock via pagecache in scan_one_device").

This patch unwraps the buffer head API and does not change the way the
super block is actually read.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:39 +01:00
Johannes Thumshirn
6fbceb9fa4 btrfs: reduce scope of btrfs_scratch_superblocks()
btrfs_scratch_superblocks() isn't used anywhere outside volumes.c so
remove it from the header file and mark it as static.  Also move it
above it's callers so we don't need a forward declaration.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:39 +01:00
Johannes Thumshirn
c514c9b10b btrfs: don't kmap() pages from block devices
Block device mappings are never in highmem so kmap() / kunmap() calls for
pages from block devices are unneeded. Use page_address() instead of
kmap() to get to the virtual addreses.

While we're at it, read_cache_page_gfp() doesn't return NULL on error,
only an ERR_PTR, so use IS_ERR() to check for errors.

Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:39 +01:00
Nikolay Borisov
f6d9abbc1f btrfs: Export btrfs_release_disk_super
Preparatory patch for removal of buffer_head usage in btrfs.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:38 +01:00
Filipe Manana
55ffaabe23 Btrfs: avoid unnecessary splits when setting bits on an extent io tree
When attempting to set bits on a range of an exent io tree that already
has those bits set we can end up splitting an extent state record, use
the preallocated extent state record, insert it into the red black tree,
do another search on the red black tree, merge the preallocated extent
state record with the previous extent state record, remove that previous
record from the red black tree and then free it. This is all unnecessary
work that consumes time.

This happens specifically at the following case at __set_extent_bit():

  $ cat -n fs/btrfs/extent_io.c
   957  static int __must_check
   958  __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
  (...)
  1044          /*
  1045           *     | ---- desired range ---- |
  1046           * | state |
  1047           *   or
  1048           * | ------------- state -------------- |
  1049           *
  (...)
  1060          if (state->start < start) {
  1061                  if (state->state & exclusive_bits) {
  1062                          *failed_start = start;
  1063                          err = -EEXIST;
  1064                          goto out;
  1065                  }
  1066
  1067                  prealloc = alloc_extent_state_atomic(prealloc);
  1068                  BUG_ON(!prealloc);
  1069                  err = split_state(tree, state, prealloc, start);
  1070                  if (err)
  1071                          extent_io_tree_panic(tree, err);
  1072
  1073                  prealloc = NULL;

So if our extent state represents a range from 0 to 1MiB for example, and
we want to set bits in the range 128KiB to 256KiB for example, and that
extent state record already has all those bits set, we end up splitting
that record, so we end up with extent state records in the tree which
represent the ranges from 0 to 128KiB and from 128KiB to 1MiB. This is
temporary because a subsequent iteration in that function will end up
merging the records.

The splitting requires using the preallocated extent state record, so
a future iteration that needs to do another split will need to allocate
another extent state record in an atomic context, something not ideal
that we try to avoid as much as possible. The splitting also requires
an insertion in the red black tree, and a subsequent merge will require
a deletion from the red black tree and freeing an extent state record.

This change just skips the splitting of an extent state record when it
already has all the bits the we need to set.

Setting a bit that is already set for a range is very common in the
inode's 'file_extent_tree' extent io tree for example, where we keep
setting the EXTENT_DIRTY bit every time we replace an extent.

This change also fixes a bug that happens after the recent patchset from
Josef that avoids having implicit holes after a power failure when not
using the NO_HOLES feature, more specifically the patch with the subject:

  "btrfs: introduce the inode->file_extent_tree"

This patch introduced an extent io tree per inode to keep track of
completed ordered extents and figure out at any time what is the safe
value for the inode's disk_i_size. This assumes that for contiguous
ranges in a file we always end up with a single extent state record in
the io tree, but that is not the case, as there is a short time window
where we can have two extent state records representing contiguous
ranges. When this happens we end setting up an incorrect value for the
inode's disk_i_size, resulting in data loss after a clean unmount
of the filesystem. The following example explains how this can happen.

Suppose we have an inode with an i_size and a disk_i_size of 1MiB, so in
the inode's file_extent_tree we have a single extent state record that
represents the range [0, 1MiB) with the EXTENT_DIRTY bit set. Then the
following steps happen:

1) A buffered write against file range [512KiB, 768KiB) is made. At this
   point delalloc was not flushed yet;

2) Deduplication from some other inode into this inode's range
   [128KiB, 256KiB) is made. This causes btrfs_inode_set_file_extent_range()
   to be called, from btrfs_insert_clone_extent(), to mark the range
   [128KiB, 256KiB) with EXTENT_DIRTY in the inode's file_extent_tree;

3) When btrfs_inode_set_file_extent_range() calls set_extent_bits(), we
   end up at __set_extent_bit(). In the first iteration of that function's
   loop we end up in the following branch:

   $ cat -n fs/btrfs/extent_io.c
    957  static int __must_check
    958  __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
   (...)
   1044          /*
   1045           *     | ---- desired range ---- |
   1046           * | state |
   1047           *   or
   1048           * | ------------- state -------------- |
   1049           *
   (...)
   1060          if (state->start < start) {
   1061                  if (state->state & exclusive_bits) {
   1062                          *failed_start = start;
   1063                          err = -EEXIST;
   1064                          goto out;
   1065                  }
   1066
   1067                  prealloc = alloc_extent_state_atomic(prealloc);
   1068                  BUG_ON(!prealloc);
   1069                  err = split_state(tree, state, prealloc, start);
   1070                  if (err)
   1071                          extent_io_tree_panic(tree, err);
   1072
   1073                  prealloc = NULL;
   (...)
   1089                  goto search_again;

   This splits the state record into two, one for range [0, 128KiB) and
   another for the range [128KiB, 1MiB). Both already have the EXTENT_DIRTY
   bit set. Then we jump to the 'search_again' label, where we unlock the
   the spinlock protecting the extent io tree before jumping to the
   'again' label to perform the next iteration;

4) In the meanwhile, delalloc is flushed, the ordered extent for the range
   [512KiB, 768KiB) is created and when it completes, at
   btrfs_finish_ordered_io(), it calls btrfs_inode_safe_disk_i_size_write()
   with a value of 0 for its 'new_size' argument;

5) Before the deduplication task currently at __set_extent_bit() moves to
   the next iteration, the task finishing the ordered extent calls
   find_first_extent_bit() through btrfs_inode_safe_disk_i_size_write()
   and gets 'start' set to 0 and 'end' set to 128KiB - because at this
   moment the io tree has two extent state records, one representing the
   range [0, 128KiB) and another representing the range [128KiB, 1MiB),
   both with EXTENT_DIRTY set. Then we set 'isize' to:

   isize = min(isize, end + 1)
         = min(1MiB, 128KiB - 1 + 1)
         = 128KiB

   Then we set the inode's disk_i_size to 128KiB (isize).

   After a clean unmount of the filesystem and mounting it again, we have
   the file with a size of 128KiB, and effectively lost all the data it
   had before in the range from 128KiB to 1MiB.

This change fixes that issue too, as we never end up splitting extent
state records when they already have all the bits we want set.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:38 +01:00
Josef Bacik
ab9b2c7b32 btrfs: handle logged extent failure properly
If we're allocating a logged extent we attempt to insert an extent
record for the file extent directly.  We increase
space_info->bytes_reserved, because the extent entry addition will call
btrfs_update_block_group(), which will convert the ->bytes_reserved to
->bytes_used.  However if we fail at any point while inserting the
extent entry we will bail and leave space on ->bytes_reserved, which
will trigger a WARN_ON() on umount.  Fix this by pinning the space if we
fail to insert, which is what happens in every other failure case that
involves adding the extent entry.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:38 +01:00
Qu Wenruo
e19221180d btrfs: relocation: Remove is_cowonly_root()
This function is only used in read_fs_root(), which is just a wrapper of
btrfs_get_fs_root().

For all the mentioned essential roots except log root tree,
btrfs_get_fs_root() has its own quick path to grab them from fs_info
directly, thus no need for key.offset modification.

For subvolume trees, btrfs_get_fs_root() with key.offset == -1 is
completely fine.

For log trees and log root tree, it's impossible to hit them, as for
relocation all backrefs are fetched from commit root, which never
records log tree blocks.

Log tree blocks either get freed in regular transaction commit, or
replayed at mount time. At runtime we should never hit an backref for
log tree in extent tree.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:38 +01:00
Nikolay Borisov
fe119a6eeb btrfs: switch to per-transaction pinned extents
This commit flips the switch to start tracking/processing pinned extents
on a per-transaction basis. It mostly replaces all references from
btrfs_fs_info::(pinned_extents|freed_extents[]) to
btrfs_transaction::pinned_extents.

Two notable modifications that warrant explicit mention are changing
clean_pinned_extents to get a reference to the previously running
transaction. The other one is removal of call to
btrfs_destroy_pinned_extent since transactions are going to be cleaned
in btrfs_cleanup_one_transaction.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:38 +01:00
Nikolay Borisov
45bb5d6ae9 btrfs: Factor out pinned extent clean up in btrfs_delete_unused_bgs
Next patch is going to refactor how pinned extents are tracked which
will necessitate changing this code. To ease that work and contain the
changes factor the code now in preparation, this will also help review.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:37 +01:00
Nikolay Borisov
f2fb72983b btrfs: Mark pinned log extents as excluded
In preparation to making pinned extents per-transaction ensure that log
such extents are always excluded from caching. To achieve this in
addition to marking them via btrfs_pin_extent_for_log_replay they also
need to be marked with btrfs_add_excluded_extent to prevent log tree
extent buffer being loaded by the free space caching thread. That's
required since log tree blocks are not recorded in the extent tree, hence
they always look free.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:37 +01:00
Nikolay Borisov
6b45f64172 btrfs: Pass transaction handle to write_pinned_extent_entries
Preparation for refactoring pinned extents tracking.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:37 +01:00
Nikolay Borisov
6690d07126 btrfs: Make pin_down_extent take transaction handle
All callers have a reference to a transaction handle so pass it to
pin_down_extent. This is the final step before switching pinned extent
tracking to a per-transaction basis.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:37 +01:00
Nikolay Borisov
9fce570454 btrfs: Make btrfs_pin_extent_for_log_replay take transaction handle
Preparation for refactoring pinned extents tracking.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:37 +01:00
Nikolay Borisov
7bfc100705 btrfs: Make btrfs_pin_reserved_extent take transaction handle
btrfs_pin_reserved_extent is now only called with a valid transaction so
exploit the fact to take a transaction. This is preparation for tracking
pinned extents on a per-transaction basis.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:37 +01:00
Nikolay Borisov
10e958d523 btrfs: Call btrfs_pin_reserved_extent only during active transaction
Calling btrfs_pin_reserved_extent makes sense only with a valid
transaction since pinned extents are processed from transaction commit
in btrfs_finish_extent_commit. In case of error it's sufficient to
adjust the reserved counter to account for log tree extents allocated in
the last transaction.

This commit moves btrfs_pin_reserved_extent to be called only with valid
transaction handle and otherwise uses the newly introduced
unaccount_log_buffer to adjust "reserved". If this is not done if a
failure occurs before transaction is committed WARN_ON are going to be
triggered on unmount. This was especially pronounced with generic/475
test.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:36 +01:00
Nikolay Borisov
6787bb9f35 btrfs: Introduce unaccount_log_buffer
This function correctly adjusts the reserved bytes occupied by a log
tree extent buffer. It will be used instead of calling
btrfs_pin_reserved_extent.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:36 +01:00
Nikolay Borisov
b25c36f84b btrfs: Make btrfs_pin_extent take trans handle
Preparation for switching pinned extent tracking to a per-transaction
basis.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:36 +01:00
Nikolay Borisov
f603bb94ab btrfs: Perform pinned cleanup directly in btrfs_destroy_delayed_refs
Having btrfs_destroy_delayed_refs call btrfs_pin_extent is problematic
for making pinned extents tracking per-transaction since
btrfs_trans_handle cannot be passed to btrfs_pin_extent in this context.
Additionally delayed refs heads pinned in btrfs_destroy_delayed_refs
are going to be handled very closely, in btrfs_destroy_pinned_extent.

To enable btrfs_pin_extent to take btrfs_trans_handle simply open code
it in btrfs_destroy_delayed_refs and call btrfs_error_unpin_extent_range
on the range. This enables us to do less work in
btrfs_destroy_pinned_extent and leaves btrfs_pin_extent being called in
contexts which have a valid btrfs_trans_handle.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:36 +01:00
Anand Jain
25864778bc btrfs: sysfs, unify handler name of devinfo/missing
The devinfo attribute handlers were added in 668e48af7a ("btrfs:
sysfs, add devid/dev_state kobject and device attributes") and the name
should contain _devinfo_, there's one that does not conform, so unify it
with the rest.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:36 +01:00
Anand Jain
f3cd2c5811 btrfs: sysfs, rename device_link add/remove functions
Since commit 668e48af7a ("btrfs: sysfs, add devid/dev_state kobject and
device attributes"), the functions btrfs_sysfs_add_device_link() and
btrfs_sysfs_rm_device_link() do more than just adding and removing the
device link as its name indicated. Rename them to be more specific
that's about the directory with the attirbutes

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:35 +01:00
Anand Jain
1f6087e69c btrfs: sysfs, use btrfs_sysfs_remove_fsid to celanup errors in add_fsid
We have one simple function btrfs_sysfs_remove_fsid() to undo
btrfs_sysfs_add_fsid(), which also does proper checks before releasing
objects.

One difference, if btrfs_sysfs_remove_fsid is used that now we also call
kobject_del() which was missing before. This was tested (with kobject
debug turned on) and no change in behaviour was found.

This is a cleanup patch.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:35 +01:00
David Sterba
f657a31c86 btrfs: sink argument tree to __do_readpage
The tree pointer can be safely read from the inode, use it and drop the
redundant argument.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:35 +01:00
David Sterba
b6660e80f1 btrfs: sink arugment tree to contiguous_readpages
The tree pointer can be safely read from the inode, use it and drop the
redundant argument.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:35 +01:00
David Sterba
0d44fea77e btrfs: sink argument tree to __extent_read_full_page
The tree pointer can be safely read from the inode, use it and drop the
redundant argument.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:35 +01:00
David Sterba
71ad38b44e btrfs: sink argument tree to extent_read_full_page
The tree pointer can be safely read from the page's inode, use it and
drop the redundant argument.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:35 +01:00
David Sterba
b272ae22ac btrfs: drop argument tree from btrfs_lock_and_flush_ordered_range
The tree pointer can be safely read from the inode so we can drop the
redundant argument from btrfs_lock_and_flush_ordered_range.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:34 +01:00
David Sterba
ae6957ebbf btrfs: add assertions for tree == inode->io_tree to extent IO helpers
Add assertions to all helpers that get tree as argument and verify that
it's the same that can be obtained from the inode or from its pages. In
followup patches the redundant arguments and assertions will be removed
one by one.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:34 +01:00
David Sterba
0ceb34bf46 btrfs: drop argument tree from submit_extent_page
Now that we're sure the tree from argument is same as the one we can get
from the page's inode io_tree, drop the redundant argument.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:34 +01:00
David Sterba
45b08405b9 btrfs: remove extent_page_data::tree
All functions that set up extent_page_data::tree set it to the inode
io_tree. That's passed down the callstack that accesses either the same
inode or its pages. In the end submit_extent_page can pull the tree out
of the page and we don't have to store it in the structure.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:34 +01:00
David Sterba
bf31f87f71 btrfs: add wrapper for transaction abort predicate
The status of aborted transaction can change between calls and it needs
to be accessed by READ_ONCE. Add a helper that also wraps the unlikely
hint.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:34 +01:00
David Sterba
b908c334e7 btrfs: move root node locking helpers to locking.c
The helpers are related to locking so move them there, update comments.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:33 +01:00
Josef Bacik
0024652895 btrfs: rename btrfs_put_fs_root and btrfs_grab_fs_root
We are now using these for all roots, rename them to btrfs_put_root()
and btrfs_grab_root();

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:33 +01:00
Josef Bacik
bd647ce385 btrfs: add a leak check for roots
Now that we're going to start relying on getting ref counting right for
roots, add a list to track allocated roots and print out any roots that
aren't freed up at free_fs_info time.

Hide this behind CONFIG_BTRFS_DEBUG because this will just be used for
developers to verify they aren't breaking things.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:33 +01:00
Josef Bacik
8260edba67 btrfs: make the init of static elements in fs_info separate
In adding things like eb leak checking and root leak checking there were
a lot of weird corner cases that come from the fact that

  1) We do not init the fs_info until we get to open_ctree time in the
     normal case and

  2) The test infrastructure half-init's the fs_info for things that it
     needs.

This makes it really annoying to make changes because you have to add
init in two different places, have special cases for testing fs_info's
that may not have certain things initialized, and cases for fs_info's
that didn't make it to open_ctree and thus are not fully set up.

Fix this by extracting out the non-allocating init of the fs info into
it's own public function and use that to make sure we're all getting
consistent views of an allocated fs_info.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:33 +01:00
Josef Bacik
ae18c37ad5 btrfs: move fs_info init work into it's own helper function
open_ctree mixes initialization of fs stuff and fs_info stuff, which
makes it confusing when doing things like adding the root leak
detection.  Make a separate function that inits all the static
structures inside of the fs_info needed for the fs to operate, and then
call that before we start setting up the fs_info to be mounted.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:33 +01:00
Josef Bacik
141386e1a5 btrfs: free more things in btrfs_free_fs_info
Things like the percpu_counters, the mapping_tree, and the csum hash can
all be freed at btrfs_free_fs_info time, since the helpers all check if
the structure has been initialized already.  This significantly cleans
up the error cases in open_ctree.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:32 +01:00
Josef Bacik
bc44d7c4b2 btrfs: push btrfs_grab_fs_root into btrfs_get_fs_root
Now that all callers of btrfs_get_fs_root are subsequently calling
btrfs_grab_fs_root and handling dropping the ref when they are done
appropriately, go ahead and push btrfs_grab_fs_root up into
btrfs_get_fs_root.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:32 +01:00
Josef Bacik
81f096edf0 btrfs: use btrfs_put_fs_root to free roots always
If we are going to track leaked roots we need to free them all the same
way, so don't kfree() roots directly, use btrfs_put_fs_root.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:32 +01:00
Josef Bacik
4c78e9f596 btrfs: hold a ref on the root in open_ctree
We lookup the fs_root and put it in our fs_info directly, we should hold
a ref on this root for the lifetime of the fs_info.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:32 +01:00
Josef Bacik
0d4b046301 btrfs: export and rename free_fs_info
We're going to start freeing roots and doing other complicated things in
free_fs_info, so we need to move it to disk-io.c and export it in order
to use things lik btrfs_put_fs_root().

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:32 +01:00
Josef Bacik
fbb0ce40d6 btrfs: hold a ref on the root in btrfs_check_uuid_tree_entry
We lookup the uuid of arbitrary subvolumes, hold a ref on the root while
we're doing this.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:31 +01:00
Josef Bacik
ca2037fba6 btrfs: hold a ref on the root in btrfs_recover_log_trees
We replay the log into arbitrary fs roots, hold a ref on the root while
we're doing this.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:31 +01:00
Josef Bacik
5119cfc36f btrfs: hold a ref on the root in create_pending_snapshot
We create the snapshot and then use it for a bunch of things, we need to
hold a ref on it while we're messing with it.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:31 +01:00
Josef Bacik
5168489a07 btrfs: hold a ref on the root in get_subvol_name_from_objectid
We lookup the name of a subvol which means we'll cross into different
roots.  Hold a ref while we're doing the look ups in the fs_root we're
searching.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:31 +01:00
Josef Bacik
6f9a3da5da btrfs: hold a ref on the root in btrfs_ioctl_send
We lookup all the clone roots and the parent root for send, so we need
to hold refs on all of these roots while we're processing them.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:31 +01:00
Josef Bacik
fd79d43b34 btrfs: hold a ref on the root in scrub_print_warning_inode
We look up the root for the bytenr that is failing, so we need to hold a
ref on the root for that operation.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:30 +01:00
Josef Bacik
0b2dee5cff btrfs: hold a ref for the root in btrfs_find_orphan_roots
We lookup roots for every orphan item we have, we need to hold a ref on
the root while we're doing this work.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:30 +01:00
Josef Bacik
9f583209f2 btrfs: push grab_fs_root into read_fs_root
All of relocation uses read_fs_root to lookup fs roots, so push the
btrfs_grab_fs_root() up into that helper and remove the individual
calls.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:30 +01:00
Josef Bacik
932fd26df8 btrfs: hold a ref on the root in btrfs_recover_relocation
We look up the fs root in various places in here when recovering from a
crashed relcoation.  Make sure we hold a ref on the root whenever we
look them up.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:30 +01:00
Josef Bacik
76deacf023 btrfs: hold a ref on the root in create_reloc_inode
We're creating a reloc inode in the data reloc tree, we need to hold a
ref on the root while we're doing that.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:30 +01:00
Josef Bacik
3d7babdcf2 btrfs: hold a ref on the root in find_data_references
We're looking up the data references for the bytenr in a root, we need
to hold a ref on that root while we're doing that.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:30 +01:00
Josef Bacik
442b1ac524 btrfs: hold a ref on the root in record_reloc_root_in_trans
We are recording this root in the transaction, so we need to hold a ref
on it until we do that.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:29 +01:00
Josef Bacik
ab9737bd75 btrfs: hold a ref on the root in merge_reloc_roots
We look up the corresponding root for the reloc root, we need to hold a
ref while we're messing with it.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:29 +01:00
Josef Bacik
db2c2ca2db btrfs: hold a ref on the root in prepare_to_merge
We look up the reloc roots corresponding root, we need to hold a ref on
that root.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:29 +01:00
Josef Bacik
0b530bc5e1 btrfs: hold a ref on the root in build_backref_tree
This is trickier than the previous conversions.  We have backref_node's
that need to hold onto their root for their lifetime.  Do the read of
the root and grab the ref.  If at any point we don't use the root we
discard it, however if we use it in our backref node we don't free it
until we free the backref node.  Any time we switch the root's for the
backref node we need to drop our ref on the old root and grab the ref on
the new root, and if we dupe a node we need to get a ref on the root
there as well.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:29 +01:00
Josef Bacik
2a2b5d6202 btrfs: hold ref on root in btrfs_ioctl_default_subvol
We look up an arbitrary fs root here, we need to hold a ref on the root
for the duration.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:29 +01:00
Josef Bacik
04734e8448 btrfs: hold a ref on the root in btrfs_ioctl_get_subvol_info
We look up whatever root userspace has given us, we need to hold a ref
throughout this operation. Use 'root' only for the on fs root and not as
a temporary variable elsewhere.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:28 +01:00
Josef Bacik
b8a49ae191 btrfs: hold a ref on the root in btrfs_search_path_in_tree_user
We can wander into a different root, so grab a ref on the root we look
up.  Later on we make root = fs_info->tree_root so we need this separate
out label to make sure we do the right cleanup only in the case we're
looking up a different root.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:28 +01:00
Josef Bacik
88234012be btrfs: hold a ref on the root in btrfs_search_path_in_tree
We look up an arbitrary fs root, we need to hold a ref on it while we're
doing our search.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:28 +01:00
Josef Bacik
3ca35e839e btrfs: hold a ref on the root in search_ioctl
We lookup a arbitrary fs root, we need to hold a ref on that root.  If
we're using our own inodes root then grab a ref on that as well to make
the cleanup easier.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:28 +01:00
Josef Bacik
fc92f79856 btrfs: hold a ref on the root in create_subvol
We're creating the new root here, but we should hold the ref until after
we've initialized the inode for it.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:28 +01:00
Josef Bacik
8727002f79 btrfs: hold a ref on the root in fixup_tree_root_location
Looking up the inode from an arbitrary tree means we need to hold a ref
on that root.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:28 +01:00
Josef Bacik
02162a0265 btrfs: hold a ref on the root in __btrfs_run_defrag_inode
We are looking up an arbitrary inode, we need to hold a ref on the root
while we're doing this.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:27 +01:00
Josef Bacik
bdf70b9e75 btrfs: hold a root ref in btrfs_get_dentry
Looking up the inode we need to search the root, make sure we hold a
reference on that root while we're doing the lookup.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:27 +01:00
Josef Bacik
9326f76f4b btrfs: hold a ref on the root in resolve_indirect_ref
We're looking up a random root, we need to hold a ref on it while we're
using it.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:27 +01:00
Josef Bacik
af01d2e53f btrfs: hold a ref on fs roots while they're in the radix tree
If the root is sitting in the radix tree, we should probably have a ref
for the radix tree.  Grab a ref on the root when we insert it, and drop
it when it gets deleted.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:27 +01:00
Josef Bacik
4b8b052888 btrfs: describe the space reservation system in general
Add another comment to cover how the space reservation system works
generally.  This covers the actual reservation flow, as well as how
flushing is handled.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:27 +01:00
Josef Bacik
6f4ad559ea btrfs: add a comment describing delalloc space reservation
delalloc space reservation is tricky because it encompasses both data
and metadata.  Make it clear what each side does, the general flow of
how space is moved throughout the lifetime of a write, and what goes
into the calculations.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:27 +01:00
Josef Bacik
734d8c15df btrfs: add a comment describing block reserves
This is a giant comment at the top of block-rsv.c describing generally
how block reserves work.  It is purely about the block reserves
themselves, and nothing to do with how the actual reservation system
works.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:26 +01:00
Josef Bacik
4cdfd93002 btrfs: handle NULL roots in btrfs_put/btrfs_grab_fs_root
We want to use this for dropping all roots, and in some error cases we
may not have a root, so handle this to make the cleanup code easier.
Make btrfs_grab_fs_root the same so we can use it in cases where the
root may not exist (like the quota root).

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:26 +01:00
Josef Bacik
a98db0f304 btrfs: make the fs root init functions static
Now that the orphan cleanup stuff doesn't use this directly we can just
make them static.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:26 +01:00
Josef Bacik
3619c94f07 btrfs: open code btrfs_read_fs_root_no_name
All this does is call btrfs_get_fs_root() with check_ref == true.  Just
use btrfs_get_fs_root() so we don't have a bunch of different helpers
that do the same thing.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:26 +01:00
Josef Bacik
83db2aadb3 btrfs: remove btrfs_read_fs_root, not used anymore
All helpers should either be using btrfs_get_fs_root() or
btrfs_read_tree_root().

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:26 +01:00
Josef Bacik
3dbf1738a1 btrfs: make relocation use btrfs_read_tree_root()
Relocation has it's special roots, we don't want to save these in the
root cache either, so swap it to use btrfs_read_tree_root().  However
the reloc root does need REF_COWS set, so make sure we set it everywhere
we use this helper, as it no longer does the REF_COWS setting.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:25 +01:00
Josef Bacik
62a2c73ebd btrfs: export and use btrfs_read_tree_root for tree-log
Tree-log uses btrfs_read_fs_root to load its log, but this just calls
btrfs_read_tree_root.  We don't save the log roots in our root cache, so
just export this helper and use it in the logging code.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:25 +01:00
Josef Bacik
e59d18b45d btrfs: make btrfs_find_orphan_roots use btrfs_get_fs_root
btrfs_find_orphan_roots has this weird thing where it looks up the root
in cache to see if it is there before just reading the root.  But the
read it uses just reads the root, it doesn't do any of the init work, we
do that by hand here.  But this is unnecessary, all we really want is to
see if the root still exists and add it to the dead roots list to be
cleaned up, otherwise we delete the orphan item.

Fix this by just using btrfs_get_fs_root directly with check_ref set to
false so we get the orphan root items.  Then we just handle in cache and
out of cache roots the same, add them to the dead roots list and carry
on.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:25 +01:00
Josef Bacik
f39e457156 btrfs: move fs root init stuff into btrfs_init_fs_root
We have a helper for reading fs roots that just reads the fs root off
the disk and then sets REF_COWS and init's the inheritable flags.  Move
this into btrfs_init_fs_root so we can later get rid of this helper and
consolidate all of the fs root reading into one helper.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:25 +01:00
Josef Bacik
96dfcb46ff btrfs: push __setup_root into btrfs_alloc_root
There's no reason to not init the root at alloc time, and with later
patches it actually causes problems if we error out mounting the fs
before the tree_root is init'ed because we expect it to have a valid ref
count.  Fix this by pushing __setup_root into btrfs_alloc_root.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:25 +01:00
Josef Bacik
3f1c64ce04 btrfs: delete the ordered isize update code
Now that we have a safe way to update the isize, remove all of this code
as it's no longer needed.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:24 +01:00
Josef Bacik
d923afe96d btrfs: replace all uses of btrfs_ordered_update_i_size
Now that we have a safe way to update the i_size, replace all uses of
btrfs_ordered_update_i_size with btrfs_inode_safe_disk_i_size_write.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:24 +01:00
Josef Bacik
9ddc959e80 btrfs: use the file extent tree infrastructure
We want to use this everywhere we modify the file extent items
permanently.  These include:

  1) Inserting new file extents for writes and prealloc extents.
  2) Truncating inode items.
  3) btrfs_cont_expand().
  4) Insert inline extents.
  5) Insert new extents from log replay.
  6) Insert a new extent for clone, as it could be past i_size.
  7) Hole punching

For hole punching in particular it might seem it's not necessary because
anybody extending would use btrfs_cont_expand, however there is a corner
that still can give us trouble.  Start with an empty file and

fallocate KEEP_SIZE 1M-2M

We now have a 0 length file, and a hole file extent from 0-1M, and a
prealloc extent from 1M-2M.  Now

punch 1M-1.5M

Because this is past i_size we have

[HOLE EXTENT][ NOTHING ][PREALLOC]
[0        1M][1M   1.5M][1.5M  2M]

with an i_size of 0.  Now if we pwrite 0-1.5M we'll increas our i_size
to 1.5M, but our disk_i_size is still 0 until the ordered extent
completes.

However if we now immediately truncate 2M on the file we'll just call
btrfs_cont_expand(inode, 1.5M, 2M), since our old i_size is 1.5M.  If we
commit the transaction here and crash we'll expose the gap.

To fix this we need to clear the file extent mapping for the range that
we punched but didn't insert a corresponding file extent for.  This will
mean the truncate will only get an disk_i_size set to 1M if we crash
before the finish ordered io happens.

I've written an xfstest to reproduce the problem and validate this fix.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:24 +01:00
Josef Bacik
41a2ee75aa btrfs: introduce per-inode file extent tree
In order to keep track of where we have file extents on disk, and thus
where it is safe to adjust the i_size to, we need to have a tree in
place to keep track of the contiguous areas we have file extents for.

Add helpers to use this tree, as it's not required for NO_HOLES file
systems.  We will use this by setting DIRTY for areas we know we have
file extent item's set, and clearing it when we remove file extent items
for truncation.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:24 +01:00
Josef Bacik
790a1d44f9 btrfs: use btrfs_ordered_update_i_size in clone_finish_inode_update
We were using btrfs_i_size_write(), which unconditionally jacks up
inode->disk_i_size.  However since clone can operate on ranges we could
have pending ordered extents for a range prior to the start of our clone
operation and thus increase disk_i_size too far and have a hole with no
file extent.

Fix this by using the btrfs_ordered_update_i_size helper which will do
the right thing in the face of pending ordered extents outside of our
clone range.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:24 +01:00
Su Yue
cfe953c824 btrfs: update the comment of btrfs_control_ioctl()
Btrfsctl was removed in 2012, now the function btrfs_control_ioctl()
is only used for devices ioctls. So update the comment.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Su Yue <Damenly_Su@gmx.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:23 +01:00
Qu Wenruo
0c89138970 btrfs: relocation: Add introduction of how relocation works
Relocation is one of the most complex part of btrfs, while it's also the
foundation stone for online resizing, profile converting.

For such a complex facility, we should at least have some introduction
to it.

This patch will add an basic introduction at pretty a high level,
explaining:

- What relocation does
- How relocation is done
  Only mentioning how data reloc tree and reloc tree are involved in the
  operation.
  No details like the backref cache, or the data reloc tree contents.
- Which function to refer.

More detailed comments will be added for reloc tree creation, data reloc
tree creation and backref cache.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:23 +01:00
Filipe Manana
42836cf4ba Btrfs: don't iterate mod seq list when putting a tree mod seq
Each new element added to the mod seq list is always appended to the list,
and each one gets a sequence number coming from a counter which gets
incremented everytime a new element is added to the list (or a new node
is added to the tree mod log rbtree). Therefore the element with the
lowest sequence number is always the first element in the list.

So just remove the list iteration at btrfs_put_tree_mod_seq() that
computes the minimum sequence number in the list and replace it with
a check for the first element's sequence number.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:23 +01:00
Qu Wenruo
30b3688e1f btrfs: Add overview of device replace
The overview of btrfs dev-replace.  It mentions some corner cases caused
by the write duplication and scrub based data copy.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ adjust wording ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23 17:01:23 +01:00
Christoph Hellwig
693639994b xfs: remove xlog_state_want_sync
Open code the xlog_state_want_sync logic in its two callers given that
this function is a trivial wrapper around xlog_state_switch_iclogs.

Move the lockdep assert into xlog_state_switch_iclogs to not lose this
debugging aid, and improve the comment that documents
xlog_state_switch_iclogs as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-23 08:27:59 -07:00
Christoph Hellwig
5781464bd1 xfs: move the ioerror check out of xlog_state_clean_iclog
Use the shutdown flag in the log to bypass xlog_state_clean_iclog
entirely in case of a shut down log.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-23 08:27:59 -07:00
Christoph Hellwig
c814b4f24e xfs: refactor xlog_state_clean_iclog
Factor out a few self-contained helpers from xlog_state_clean_iclog, and
update the documentation so it primarily documents why things happens
instead of how.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-23 08:27:59 -07:00
Christoph Hellwig
12e6a0f449 xfs: remove the aborted parameter to xlog_state_done_syncing
We can just check for a shut down log all the way down in
xlog_cil_committed instead of passing the parameter.  This means a
slight behavior change in that we now also abort log items if the
shutdown came in halfway into the I/O completion processing, which
actually is the right thing to do.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-23 08:27:59 -07:00
Christoph Hellwig
a582f32fad xfs: simplify log shutdown checking in xfs_log_release_iclog
There is no need to check for the ioerror state before the lock, as
the shutdown case is not a fast path.  Also remove the call to force
shutdown the file system, as it must have been shut down already
for an iclog to be in the ioerror state.  Also clean up the flow of
the function a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-23 08:27:59 -07:00
Christoph Hellwig
f97a43e436 xfs: simplify the xfs_log_release_iclog calling convention
The only caller of xfs_log_release_iclog doesn't care about the return
value, so remove it.  Also don't bother passing the mount pointer,
given that we can trivially derive it from the iclog.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-23 08:27:58 -07:00
Christoph Hellwig
81e5b50a8f xfs: factor out a xlog_wait_on_iclog helper
Factor out the shared code to wait for a log force into a new helper.
This helper uses the XLOG_FORCED_SHUTDOWN check previous only used
by the unmount code over the equivalent iclog ioerror state used by
the other two functions.

There is a slight behavior change in that the force of the unmount
record is now accounted in the log force statistics.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-23 08:27:58 -07:00
Christoph Hellwig
c7cc296ddd xfs: merge xlog_cil_push into xlog_cil_push_work
xlog_cil_push is only called by xlog_cil_push_work, so merge the two
functions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-23 08:27:58 -07:00
Domenico Andreoli
56939e014a hibernate: Allow uswsusp to write to swap
It turns out that there is one use case for programs being able to
write to swap devices, and that is the userspace hibernation code.

Quick fix: disable the S_SWAPFILE check if hibernation is configured.

Fixes: dc617f29db ("vfs: don't allow writes to swap files")
Reported-by: Domenico Andreoli <domenico.andreoli@linux.com>
Reported-by: Marian Klein <mkleinsoft@gmail.com>
Signed-off-by: Domenico Andreoli <domenico.andreoli@linux.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-23 08:22:15 -07:00
Hillf Danton
a5318d3cdf io-uring: drop 'free_pfile' in struct io_file_put
Sync removal of file is only used in case of a GFP_KERNEL kmalloc
failure at the cost of io_file_put::done and work flush, while a
glich like it can be handled at the call site without too much pain.

That said, what is proposed is to drop sync removing of file, and
the kink in neck as well.

Signed-off-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-23 09:22:15 -06:00
Hillf Danton
4afdb733b1 io-uring: drop completion when removing file
A case of task hung was reported by syzbot,

INFO: task syz-executor975:9880 blocked for more than 143 seconds.
      Not tainted 5.6.0-rc6-syzkaller #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
syz-executor975 D27576  9880   9878 0x80004000
Call Trace:
 schedule+0xd0/0x2a0 kernel/sched/core.c:4154
 schedule_timeout+0x6db/0xba0 kernel/time/timer.c:1871
 do_wait_for_common kernel/sched/completion.c:83 [inline]
 __wait_for_common kernel/sched/completion.c:104 [inline]
 wait_for_common kernel/sched/completion.c:115 [inline]
 wait_for_completion+0x26a/0x3c0 kernel/sched/completion.c:136
 io_queue_file_removal+0x1af/0x1e0 fs/io_uring.c:5826
 __io_sqe_files_update.isra.0+0x3a1/0xb00 fs/io_uring.c:5867
 io_sqe_files_update fs/io_uring.c:5918 [inline]
 __io_uring_register+0x377/0x2c00 fs/io_uring.c:7131
 __do_sys_io_uring_register fs/io_uring.c:7202 [inline]
 __se_sys_io_uring_register fs/io_uring.c:7184 [inline]
 __x64_sys_io_uring_register+0x192/0x560 fs/io_uring.c:7184
 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:294
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

and bisect pointed to 05f3fb3c53 ("io_uring: avoid ring quiesce for
fixed file set unregister and update").

It is down to the order that we wait for work done before flushing it
while nobody is likely going to wake us up.

We can drop that completion on stack as flushing work itself is a sync
operation we need and no more is left behind it.

To that end, io_file_put::done is re-used for indicating if it can be
freed in the workqueue worker context.

Reported-and-Inspired-by: syzbot <syzbot+538d1957ce178382a394@syzkaller.appspotmail.com>
Signed-off-by: Hillf Danton <hdanton@sina.com>

Rename ->done to ->free_pfile

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-23 09:21:06 -06:00
Luis Henriques
c8d6ee0144 ceph: fix memory leak in ceph_cleanup_snapid_map()
kmemleak reports the following memory leak:

unreferenced object 0xffff88821feac8a0 (size 96):
  comm "kworker/1:0", pid 17, jiffies 4294896362 (age 20.512s)
  hex dump (first 32 bytes):
    a0 c8 ea 1f 82 88 ff ff 00 c9 ea 1f 82 88 ff ff  ................
    00 00 00 00 00 00 00 00 00 01 00 00 00 00 ad de  ................
  backtrace:
    [<00000000b3ea77fb>] ceph_get_snapid_map+0x75/0x2a0
    [<00000000d4060942>] fill_inode+0xb26/0x1010
    [<0000000049da6206>] ceph_readdir_prepopulate+0x389/0xc40
    [<00000000e2fe2549>] dispatch+0x11ab/0x1521
    [<000000007700b894>] ceph_con_workfn+0xf3d/0x3240
    [<0000000039138a41>] process_one_work+0x24d/0x590
    [<00000000eb751f34>] worker_thread+0x4a/0x3d0
    [<000000007e8f0d42>] kthread+0xfb/0x130
    [<00000000d49bd1fa>] ret_from_fork+0x3a/0x50

A kfree is missing while looping the 'to_free' list of ceph_snapid_map
objects.

Cc: stable@vger.kernel.org
Fixes: 75c9627efb ("ceph: map snapid to anonymous bdev ID")
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-23 13:07:08 +01:00
Ilya Dryomov
7614209736 ceph: check POOL_FLAG_FULL/NEARFULL in addition to OSDMAP_FULL/NEARFULL
CEPH_OSDMAP_FULL/NEARFULL aren't set since mimic, so we need to consult
per-pool flags as well.  Unfortunately the backwards compatibility here
is lacking:

- the change that deprecated OSDMAP_FULL/NEARFULL went into mimic, but
  was guarded by require_osd_release >= RELEASE_LUMINOUS
- it was subsequently backported to luminous in v12.2.2, but that makes
  no difference to clients that only check OSDMAP_FULL/NEARFULL because
  require_osd_release is not client-facing -- it is for OSDs

Since all kernels are affected, the best we can do here is just start
checking both map flags and pool flags and send that to stable.

These checks are best effort, so take osdc->lock and look up pool flags
just once.  Remove the FIXME, since filesystem quotas are checked above
and RADOS quotas are reflected in POOL_FLAG_FULL: when the pool reaches
its quota, both POOL_FLAG_FULL and POOL_FLAG_FULL_QUOTA are set.

Cc: stable@vger.kernel.org
Reported-by: Yanhu Cao <gmayyyha@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Sage Weil <sage@redhat.com>
2020-03-23 13:07:08 +01:00
Yilu Lin
97adda8b3a CIFS: Fix bug which the return value by asynchronous read is error
This patch is used to fix the bug in collect_uncached_read_data()
that rc is automatically converted from a signed number to an
unsigned number when the CIFS asynchronous read fails.
It will cause ctx->rc is error.

Example:
Share a directory and create a file on the Windows OS.
Mount the directory to the Linux OS using CIFS.
On the CIFS client of the Linux OS, invoke the pread interface to
deliver the read request.

The size of the read length plus offset of the read request is greater
than the maximum file size.

In this case, the CIFS server on the Windows OS returns a failure
message (for example, the return value of
smb2.nt_status is STATUS_INVALID_PARAMETER).

After receiving the response message, the CIFS client parses
smb2.nt_status to STATUS_INVALID_PARAMETER
and converts it to the Linux error code (rdata->result=-22).

Then the CIFS client invokes the collect_uncached_read_data function to
assign the value of rdata->result to rc, that is, rc=rdata->result=-22.

The type of the ctx->total_len variable is unsigned integer,
the type of the rc variable is integer, and the type of
the ctx->rc variable is ssize_t.

Therefore, during the ternary operation, the value of rc is
automatically converted to an unsigned number. The final result is
ctx->rc=4294967274. However, the expected result is ctx->rc=-22.

Signed-off-by: Yilu Lin <linyilu@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
2020-03-22 22:49:10 -05:00
Murphy Zhou
ef4a632ccc CIFS: check new file size when extending file by fallocate
xfstests generic/228 checks if fallocate respect RLIMIT_FSIZE.
After fallocate mode 0 extending enabled, we can hit this failure.
Fix this by check the new file size with vfs helper, return
error if file size is larger then RLIMIT_FSIZE(ulimit -f).

This patch has been tested by LTP/xfstests aginst samba and
Windows server.

Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Murphy Zhou <jencce.kernel@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
2020-03-22 22:49:10 -05:00
Steve French
8895c66f2b SMB3: Minor cleanup of protocol definitions
And add one missing define (COMPRESSION_TRANSFORM_ID) and
flag (TRANSFORM_FLAG_ENCRYPTED)

Signed-off-by: Steve French <stfrench@microsoft.com>
2020-03-22 22:49:10 -05:00
Steve French
8f23343131 SMB3: Additional compression structures
New transform header structures. See recent updates
to MS-SMB2 adding section 2.2.42.1 and 2.2.42.2

Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
2020-03-22 22:49:10 -05:00
Steve French
2fe4f62de4 SMB3: Add new compression flags
Additional compression capabilities can now be negotiated and a
new compression algorithm.  Add the flags for these.

See newly updated MS-SMB2 sections 3.1.4.4.1 and 2.2.3.1.3

Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
2020-03-22 22:49:10 -05:00
Gustavo A. R. Silva
cff2def598 cifs: smb2pdu.h: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-03-22 22:49:10 -05:00
Eric Biggers
dc920277f1 cifs: clear PF_MEMALLOC before exiting demultiplex thread
Leaving PF_MEMALLOC set when exiting a kthread causes it to remain set
during do_exit().  That can confuse things.  For example, if BSD process
accounting is enabled and the accounting file has FS_SYNC_FL set and is
located on an ext4 filesystem without a journal, then do_exit() can end
up calling ext4_write_inode().  That triggers the
WARN_ON_ONCE(current->flags & PF_MEMALLOC) there, as it assumes
(appropriately) that inodes aren't written when allocating memory.

This was originally reported for another kernel thread, xfsaild() [1].
cifs_demultiplex_thread() also exits with PF_MEMALLOC set, so it's
potentially subject to this same class of issue -- though I haven't been
able to reproduce the WARN_ON_ONCE() via CIFS, since unlike xfsaild(),
cifs_demultiplex_thread() is sent SIGKILL before exiting, and that
interrupts the write to the BSD process accounting file.

Either way, leaving PF_MEMALLOC set is potentially problematic.  Let's
clean this up by properly saving and restoring PF_MEMALLOC.

[1] https://lore.kernel.org/r/0000000000000e7156059f751d7b@google.com

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-03-22 22:49:10 -05:00
Gustavo A. R. Silva
266b9fecc5 cifs: cifspdu.h: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-03-22 22:49:10 -05:00
Steve French
ba55344f36 CIFS: Warn less noisily on default mount
The warning we print on mount about how to use less secure dialects
(when the user does not specify a version on mount) is useful
but is noisy to print on every default mount, and can be changed
to a warn_once.  Slightly updated the warning text as well to note
SMB3.1.1 which has been the default which is typically negotiated
(for a few years now) by most servers.

      "No dialect specified on mount. Default has changed to a more
       secure dialect, SMB2.1 or later (e.g. SMB3.1.1), from CIFS
       (SMB1). To use the less secure SMB1 dialect to access old
       servers which do not support SMB3.1.1 (or even SMB3 or SMB2.1)
       specify vers=1.0 on mount."

Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
2020-03-22 22:49:09 -05:00
Qiujun Huang
f2d67931fd fs/cifs: fix gcc warning in sid_to_id
fix warning [-Wunused-but-set-variable] at variable 'rc',
keeping the code readable.

Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-03-22 22:49:09 -05:00
Murphy Zhou
0667059d0b cifs: allow unlock flock and OFD lock across fork
Since commit d0677992d2 ("cifs: add support for flock") added
support for flock, LTP/flock03[1] testcase started to fail.

This testcase is testing flock lock and unlock across fork.
The parent locks file and starts the child process, in which
it unlock the same fd and lock the same file with another fd
again. All the lock and unlock operation should succeed.

Now the child process does not actually unlock the file, so
the following lock fails. Fix this by allowing flock and OFD
lock go through the unlock routine, not skipping if the unlock
request comes from another process.

Patch has been tested by LTP/xfstests on samba and Windows
server, v3.11, with or without cache=none mount option.

[1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/flock/flock03.c
Signed-off-by: Murphy Zhou <jencce.kernel@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
2020-03-22 22:49:09 -05:00
Steve French
c7e9f78f7b cifs: do d_move in rename
See commit 349457ccf2
"Allow file systems to manually d_move() inside of ->rename()"

Lessens possibility of race conditions in rename

Signed-off-by: Steve French <stfrench@microsoft.com>
2020-03-22 22:49:09 -05:00
Aurelien Aptel
69dda3059e cifs: add SMB2_open() arg to return POSIX data
allows SMB2_open() callers to pass down a POSIX data buffer that will
trigger requesting POSIX create context and parsing the response into
the provided buffer.

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
2020-03-22 22:49:09 -05:00
Aurelien Aptel
3d519bd126 cifs: plumb smb2 POSIX dir enumeration
* add code to request POSIX info level
* parse dir entries and fill cifs_fattr to get correct inode data

since the POSIX payload is variable size the number of entries in a
FIND response needs to be computed differently.

Dirs and regular files are properly reported along with mode bits,
hardlink number, c/m/atime. No special files yet (see below).

Current experimental version of Samba with the extension unfortunately
has issues with wildcards and needs the following patch:

> --- i/source3/smbd/smb2_query_directory.c
> +++ w/source3/smbd/smb2_query_directory.c
> @@ -397,9 +397,7 @@ smbd_smb2_query_directory_send(TALLOC_CTX
> *mem_ctx,
> 		}
> 	}
>
> -       if (!state->smbreq->posix_pathnames) {
> 		wcard_has_wild = ms_has_wild(state->in_file_name);
> -       }
>
> 	/* Ensure we've canonicalized any search path if not a wildcard. */
> 	if (!wcard_has_wild) {
>

Also for special files despite reporting them as reparse point samba
doesn't set the reparse tag field. This patch will mark them as needing
re-evaluation but the re-evaluate code doesn't deal with it yet.

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-03-22 22:49:09 -05:00
Aurelien Aptel
349e13ad30 cifs: add smb2 POSIX info level
* add new info level and structs for SMB2 posix extension
* add functions to parse and validate it

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-03-22 22:49:09 -05:00
Aurelien Aptel
2e8af978d9 cifs: rename posix create rsp
little progress on the posix create response.

* rename struct to create_posix_rsp to match with the request
  create_posix context
* make struct packed
* pass smb info struct for parse_posix_ctxt to fill
* use smb info struct as param
* update TODO

What needs to be done:

SMB2_open() has an optional smb info out argument that it will fill.
Callers making use of this are:

- smb3_query_mf_symlink (need to investigate)
- smb2_open_file

Callers of smb2_open_file (via server->ops->open) are passing an
smbinfo struct but that struct cannot hold POSIX information. All the
call stack needs to be changed for a different info type. Maybe pass
SMB generic struct like cifs_fattr instead.

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-03-22 22:49:09 -05:00
Steve French
8fe0c2c2cb cifs: print warning mounting with vers=1.0
We really, really don't want people using insecure dialects
unless they realize what they are doing ...

Add mount warning if mounting with vers=1.0 (older SMB1/CIFS
dialect) instead of the default (SMB2.1 or later, typically
SMB3.1.1).

Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
2020-03-22 22:49:09 -05:00
Steve French
cf5371ae46 smb3: fix performance regression with setting mtime
There are cases when we don't want to send the SMB2 flush operation
(e.g. when user specifies mount parm "nostrictsync") and it can be
a very expensive operation on the server.  In most cases in order
to set mtime, we simply need to flush (write) the dirtry pages from
the client and send the writes to the server not also send a flush
protocol operation to the server.

Fixes: aa081859b1 ("cifs: flush before set-info if we have writeable handles")
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-03-22 22:49:09 -05:00
Stefan Metzmacher
864138cb31 cifs: make use of cap_unix(ses) in cifs_reconnect_tcon()
cap_unix(ses) defaults to false for SMB2.

Signed-off-by: Stefan Metzmacher <metze@samba.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-03-22 22:49:09 -05:00
Stefan Metzmacher
b08484d715 cifs: use mod_delayed_work() for &server->reconnect if already queued
mod_delayed_work() is safer than queue_delayed_work() if there's a
chance that the work is already in the queue.

Signed-off-by: Stefan Metzmacher <metze@samba.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-03-22 22:49:09 -05:00
Stefan Metzmacher
e2e87519bd cifs: call wake_up(&server->response_q) inside of cifs_reconnect()
This means it's consistently called and the callers don't need to
care about it.

Signed-off-by: Stefan Metzmacher <metze@samba.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-03-22 22:49:09 -05:00
Paulo Alcantara (SUSE)
bacd704a95 cifs: handle prefix paths in reconnect
For the case where we have a DFS path like below and we're currently
connected to targetA:

    //dfsroot/link -> //targetA/share/foo, //targetB/share/bar

after failover, we should make sure to update cifs_sb->prepath so the
next operations will use the new prefix path "/bar".

Besides, in order to simplify the use of different prefix paths,
enforce CIFS_MOUNT_USE_PREFIX_PATH for DFS mounts so we don't have to
revalidate the root dentry every time we set a new prefix path.

Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-03-22 22:49:09 -05:00
Steve French
ffdec8d642 cifs: do not ignore the SYNC flags in getattr
Check the AT_STATX_FORCE_SYNC flag and force an attribute
revalidation if requested by the caller, and if the caller
specificies AT_STATX_DONT_SYNC only revalidate cached attributes
if required.  In addition do not flush writes in getattr (which
can be expensive) if size or timestamps not requested by the
caller.

Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-03-22 22:49:09 -05:00
Pavel Begunkov
18a542ff19 io_uring: Fix ->data corruption on re-enqueue
work->data and work->list are shared in union. io_wq_assign_next() sets
->data if a req having a linked_timeout, but then io-wq may want to use
work->list, e.g. to do re-enqueue of a request, so corrupting ->data.

->data is not necessary, just remove it and extract linked_timeout
through @link_list.

Fixes: 60cf46ae60 ("io-wq: hash dependent work")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-22 19:31:27 -06:00
Linus Torvalds
67d584e33e for-5.6-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl53Vh0ACgkQxWXV+ddt
 WDtfOQ//bbUyKXcdH0FBZOCEcJmegcK1eUFYqKrwR2bHGe5JRdLM8pAvjCcqmWeO
 jtaRiFC4NSCqTIl3mkBUb+XmQtjZwixBUHRxJpuEO8zqawvFZXTqg/KJklNvi2rd
 KdflSNia6KrozTT+B/lpwZ5emS+wSdj5XTZ6VGj4riwtphSfWAjOu+4cOASMeFu+
 Gfn+N9xu0ZcR/6zO20xAg0Xz+WU2uj4EfeM35dtRP2bPLG0yOGmiYT15Ll9h74Wm
 7F+28iNTQfYutAexGvUpiouanGXE+ka3TCsJg5LuVTpdKGraOVGEuX+RhsyoKQrB
 E8bk91fbkLlooluhUC306iNA9/+RN/yFGtILX8JsgI2Od26ZuU01l/OHrc19MDIm
 gw1w3PMsD/hXLsG5ba4QsIYOzXofSrPdWej29h/o5p0VEQrAoCJEpAi7fVsiJDR1
 sx6kCodw5jYhVs1P6DdXO1pgjE7iFUmjUQCFkl40edPMLy/LwB99A4zNnCOwI0KZ
 49CMWHDe+tXVJBTzPvtma/PycQHIxJYMf1f8ko9E4stB7HtfH4dnUERDkb1UwQ5n
 aJgyhsCCnp/EJoPunUT7g9nLUdyu0Rtwknn3NascWZEieX2QhKEF5RcjAUSL+Hlo
 jbGGvoLhG0nOtYkU7BNSQbL8wxPJEEAq8e6F4tWMcOkhX4pNZP8=
 =YkB0
 -----END PGP SIGNATURE-----

Merge tag 'for-5.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Two fixes.

  The first is a regression: when dropping some incompat bits the
  conditions were reversed. The other is a fix for rename whiteout
  potentially leaving stack memory linked to a list"

* tag 'for-5.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix removal of raid[56|1c34} incompat flags after removing block group
  btrfs: fix log context list corruption after rename whiteout error
2020-03-22 11:35:33 -07:00
Linus Torvalds
b3c03db67e Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "10 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  x86/mm: split vmalloc_sync_all()
  mm, slub: prevent kmalloc_node crashes and memory leaks
  mm/mmu_notifier: silence PROVE_RCU_LIST warnings
  epoll: fix possible lost wakeup on epoll_ctl() path
  mm: do not allow MADV_PAGEOUT for CoW pages
  mm, memcg: throttle allocators based on ancestral memory.high
  mm, memcg: fix corruption on 64-bit divisor in memory.high throttling
  page-flags: fix a crash at SetPageError(THP_SWAP)
  mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case
  memcg: fix NULL pointer dereference in __mem_cgroup_usage_unregister_event
2020-03-22 10:46:50 -07:00
Pavel Begunkov
f2cf11492b io-wq: close cancel gap for hashed linked work
After io_assign_current_work() of a linked work, it can be decided to
offloaded to another thread so doing io_wqe_enqueue(). However, until
next io_assign_current_work() it can be cancelled, that isn't handled.

Don't assign it, if it's not going to be executed.

Fixes: 60cf46ae60 ("io-wq: hash dependent work")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-22 11:33:58 -06:00
Roman Penyaev
1b53734bd0 epoll: fix possible lost wakeup on epoll_ctl() path
This fixes possible lost wakeup introduced by commit a218cc4914.
Originally modifications to ep->wq were serialized by ep->wq.lock, but
in commit a218cc4914 ("epoll: use rwlock in order to reduce
ep_poll_callback() contention") a new rw lock was introduced in order to
relax fd event path, i.e. callers of ep_poll_callback() function.

After the change ep_modify and ep_insert (both are called on epoll_ctl()
path) were switched to ep->lock, but ep_poll (epoll_wait) was using
ep->wq.lock on wqueue list modification.

The bug doesn't lead to any wqueue list corruptions, because wake up
path and list modifications were serialized by ep->wq.lock internally,
but actual waitqueue_active() check prior wake_up() call can be
reordered with modifications of ep ready list, thus wake up can be lost.

And yes, can be healed by explicit smp_mb():

  list_add_tail(&epi->rdlink, &ep->rdllist);
  smp_mb();
  if (waitqueue_active(&ep->wq))
	wake_up(&ep->wp);

But let's make it simple, thus current patch replaces ep->wq.lock with
the ep->lock for wqueue modifications, thus wake up path always observes
activeness of the wqueue correcty.

Fixes: a218cc4914 ("epoll: use rwlock in order to reduce ep_poll_callback() contention")
Reported-by: Max Neunhoeffer <max@arangodb.com>
Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Max Neunhoeffer <max@arangodb.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Christopher Kohlhoff <chris.kohlhoff@clearpool.io>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jes Sorensen <jes.sorensen@gmail.com>
Cc: <stable@vger.kernel.org>	[5.1+]
Link: http://lkml.kernel.org/r/20200214170211.561524-1-rpenyaev@suse.de
References: https://bugzilla.kernel.org/show_bug.cgi?id=205933
Bisected-by: Max Neunhoeffer <max@arangodb.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-21 18:56:06 -07:00
Linus Torvalds
1ab7ea1f83 io_uring-5.6-20200320
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl51dbQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiV3EADJHB2r2hTTEym5u1PbrEEVkjvdL6InU8lD
 lFM7m2g6yZUncwm+aSZynHqAFY6Rd5Jk+gmYMuioi3ZxC2rs7jG1AOTpaeJYmhle
 lzkjqSLtl+gdPMA9ydivk1UwILFjtZKG1JNc++tnCn3q7+eCkgnWAlq5b7idG2eF
 BS0AEZP6Yz1zStTHLbHSB0StY8ovMIw0VaVQvguHLL9EBpbHmrs0cq3tipWkAyPR
 2YwnXbxsJySukkwmBKxEWrGUYDze56jqJIqdFsOE0+WtGV+nk7OScPseXAaP4/+G
 Vl23VNfryuZcsBUwI9tY1SzCFEXIwdXVGpCAYwQ/kU5WfvFpYaei+fXVNnL4kjR0
 PfpA6XnMsZ3DzqgepmUd92sAA56ZtBxuGjqcSYlg/JwjvUHdpaZDkE2WLqkAMeUN
 8A7cUw+R6XWQ2/y6ob7QvKiT/ZDR8GrYUl3EdGE3LhB1ZsvLXJDZpWipwQBzuk9R
 vJJOkGst38rjsWnb+nfeLh3AsgjF14wo+2vQL4mKs24xKTIvadHsFAZjKLXZ93Wf
 Vn58FaPOYIkjBidYLWb3dlO1ZR8S0803gohLkLV6adH8bCNCWxGTOR51DZLomAsb
 nAUCEAJaZrOqaQAuJAFNNpS8+/da3AIF4HVd2EdZ1yFXU15y0+zIxtROjKzg+OxO
 M3jC/Aet1Q==
 =IMcu
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.6-20200320' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "Two different fixes in here:

   - Fix for a potential NULL pointer deref for links with async or
     drain marked (Pavel)

   - Fix for not properly checking RLIMIT_NOFILE for async punted
     operations.

     This affects openat/openat2, which were added this cycle, and
     accept4. I did a full audit of other cases where we might check
     current->signal->rlim[] and found only RLIMIT_FSIZE for buffered
     writes and fallocate. That one is fixed and queued for 5.7 and
     marked stable"

* tag 'io_uring-5.6-20200320' of git://git.kernel.dk/linux-block:
  io_uring: make sure accept honor rlimit nofile
  io_uring: make sure openat/openat2 honor rlimit nofile
  io_uring: NULL-deref for IOSQE_{ASYNC,DRAIN}
2020-03-21 11:54:47 -07:00
Filipe Manana
d8e6fd5c79 btrfs: fix removal of raid[56|1c34} incompat flags after removing block group
We are incorrectly dropping the raid56 and raid1c34 incompat flags when
there are still raid56 and raid1c34 block groups, not when we do not any
of those anymore. The logic just got unintentionally broken after adding
the support for the raid1c34 modes.

Fix this by clear the flags only if we do not have block groups with the
respective profiles.

Fixes: 9c907446dc ("btrfs: drop incompat bit for raid1c34 after last block group is gone")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-20 21:31:32 +01:00
Jens Axboe
4ed734b0d0 io_uring: honor original task RLIMIT_FSIZE
With the previous fixes for number of files open checking, I added some
debug code to see if we had other spots where we're checking rlimit()
against the async io-wq workers. The only one I found was file size
checking, which we should also honor.

During write and fallocate prep, store the max file size and override
that for the current ask if we're in io-wq worker context.

Cc: stable@vger.kernel.org # 5.1+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-20 11:41:23 -06:00
Jens Axboe
09952e3e78 io_uring: make sure accept honor rlimit nofile
Just like commit 4022e7af86, this fixes the fact that
IORING_OP_ACCEPT ends up using get_unused_fd_flags(), which checks
current->signal->rlim[] for limits.

Add an extra argument to __sys_accept4_file() that allows us to pass
in the proper nofile limit, and grab it at request prep time.

Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-20 08:48:36 -06:00
Jens Axboe
4022e7af86 io_uring: make sure openat/openat2 honor rlimit nofile
Dmitry reports that a test case shows that io_uring isn't honoring a
modified rlimit nofile setting. get_unused_fd_flags() checks the task
signal->rlimi[] for the limits. As this isn't easily inheritable,
provide a __get_unused_fd_flags() that takes the value instead. Then we
can grab it when the request is prepared (from the original task), and
pass that in when we do the async part part of the open.

Reported-by: Dmitry Kadashev <dkadashev@gmail.com>
Tested-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-20 08:47:27 -06:00
Eric Biggers
861261f2a9 ubifs: wire up FS_IOC_GET_ENCRYPTION_NONCE
This new ioctl retrieves a file's encryption nonce, which is useful for
testing.  See the corresponding fs/crypto/ patch for more details.

Link: https://lore.kernel.org/r/20200314205052.93294-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-03-19 21:57:06 -07:00
Eric Biggers
ee446e1af4 f2fs: wire up FS_IOC_GET_ENCRYPTION_NONCE
This new ioctl retrieves a file's encryption nonce, which is useful for
testing.  See the corresponding fs/crypto/ patch for more details.

Link: https://lore.kernel.org/r/20200314205052.93294-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-03-19 21:57:06 -07:00
Eric Biggers
7ec9f3b47a ext4: wire up FS_IOC_GET_ENCRYPTION_NONCE
This new ioctl retrieves a file's encryption nonce, which is useful for
testing.  See the corresponding fs/crypto/ patch for more details.

Link: https://lore.kernel.org/r/20200314205052.93294-3-ebiggers@kernel.org
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-03-19 21:56:59 -07:00
Eric Biggers
e98ad46475 fscrypt: add FS_IOC_GET_ENCRYPTION_NONCE ioctl
Add an ioctl FS_IOC_GET_ENCRYPTION_NONCE which retrieves the nonce from
an encrypted file or directory.  The nonce is the 16-byte random value
stored in the inode's encryption xattr.  It is normally used together
with the master key to derive the inode's actual encryption key.

The nonces are needed by automated tests that verify the correctness of
the ciphertext on-disk.  Except for the IV_INO_LBLK_64 case, there's no
way to replicate a file's ciphertext without knowing that file's nonce.

The nonces aren't secret, and the existing ciphertext verification tests
in xfstests retrieve them from disk using debugfs or dump.f2fs.  But in
environments that lack these debugging tools, getting the nonces by
manually parsing the filesystem structure would be very hard.

To make this important type of testing much easier, let's just add an
ioctl that retrieves the nonce.

Link: https://lore.kernel.org/r/20200314205052.93294-2-ebiggers@kernel.org
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-03-19 21:56:54 -07:00
David S. Miller
3ac9eb4210 RxRPC fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl5zWDsACgkQ+7dXa6fL
 C2upyg/+KFOmCLFEAgwRnBn4zDDcdDT9du25Duv2d/XfAo2Zx+Nbwm7jjKR/mrRZ
 mRbcvb8qj92O4dzMCwcqDGpKT3xJmCZhxJQORBm55Bjme7tJDqXuQVYp1fZVy3Ka
 XJS0jr4n5HTorW8iGSIPJmE76XpIPq0ANhPnLbq8wZELyw87K7+J5ZdHcnUh+myd
 uKs8sIQ8PQZg6JBBj5wPRgrAkOFUTTINiUqy37ADIY1oZyzW1rUlAeAxVXV7Dnx7
 G1HvlVaDw72G1XG4pn0pNBCdGJuNF0dG2zRbdjS+kGCmf6MB6x8e22JjWW9r+r9m
 iJd4B2R/3V/kUn4i3B+jfOWD5DKzCW4lDixh9D2LzM16GUinYQTkrH9e8jMBBJGW
 7p7X9Vl3o0Nt6NDVLmTKuyomvvtT/jMYiDtKjPuvxlPCGduXB8HvNRFxsKIEVRHi
 4RcdTqUSOsyUnOvTfDTfyBu1srKFqTC3HzAunntV88UfGtWdhXRCWMejHdNK3uI9
 BC4Ym6jkmFnbQzytW/6noprvVlDfgAuyplcyhnnJ5fVNm4YQ7lZLZPgf5TS+gchI
 fMwDfRz3hOLDZ5WjCx6QLT1NHaowQLTrzTq0X3uj2ZrcnRORURvk8GfamzZoS9a5
 omyQgBfm+1YpF2VwCyU42DytdmFDUCDofKondOXh8QciwhXqaRs=
 =bvsn
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-fixes-20200319' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc, afs: Interruptibility fixes

Here are a number of fixes for AF_RXRPC and AFS that make AFS system calls
less interruptible and so less likely to leave the filesystem in an
uncertain state.  There's also a miscellaneous patch to make tracing
consistent.

 (1) Firstly, abstract out the Tx space calculation in sendmsg.  Much the
     same code is replicated in a number of places that subsequent patches
     are going to alter, including adding another copy.

 (2) Fix Tx interruptibility by allowing a kernel service, such as AFS, to
     request that a call be interruptible only when waiting for a call slot
     to become available (ie. the call has not taken place yet) or that a
     call be not interruptible at all (e.g. when we want to do writeback
     and don't want a signal interrupting a VM-induced writeback).

 (3) Increase the minimum delay on MSG_WAITALL for userspace sendmsg() when
     waiting for Tx buffer space as a 2*RTT delay is really small over 10G
     ethernet and a 1 jiffy timeout might be essentially 0 if at the end of
     the jiffy period.

 (4) Fix some tracing output in AFS to make it consistent with rxrpc.

 (5) Make sure aborted asynchronous AFS operations are tidied up properly
     so we don't end up with stuck rxrpc calls.

 (6) Make AFS client calls uninterruptible in the Rx phase.  If we don't
     wait for the reply to be fully gathered, we can't update the local VFS
     state and we end up in an indeterminate state with respect to the
     server.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-19 20:28:34 -07:00
Linus Torvalds
cd607737f3 three small smb3 fixes, 2 for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl5xH1sACgkQiiy9cAdy
 T1HzMgv/d27qMlDe1jrLgPY40FT6kjTfG6zKA8ikTg5LHt/esgqRrKsPTQVSVq/m
 f6ZVGNlcTDfwAq+90Rw38hreUKRYCkkVWoCEE9SUkCqlg/3MVMorA72p9eDnp0/u
 htADzvyBCNoMPJj1WGi5uyhGw58LBy5zWT4vibovGzEdlZ2Lv1qvVzyiGnju8ypy
 2+0cgGhucQ8jfEAjqEP28T7nCT96+G0KJGqXX122+Mrx/agjGQ2xCCZRIH5ndVnp
 VmaN7WxGQmN9AdLtsVgkrRa9VYtndspMzo7xUArrferlF/yLijvO2Lcu7o3QtH8N
 RvLSc0qOD7eH3ETcAwvYd/luGH5OvvZDu4jHphK9KBz9GtGGRCKc7nxElv13S4LJ
 27DG71x2XqTGmNoLmY57EZOtKVCsu6VBDlhq7u17RsYWDEurrvda0Nhe/Wo8P2yT
 dESnNEX5YGi+nWIjvxwRGMJ7Gb1ZXLdjkJC5QNzDID4AZVHE678AxDR+ZjkHCYLE
 Rsbsbmaw
 =x6+U
 -----END PGP SIGNATURE-----

Merge tag '5.6-rc6-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Three small smb3 fixes, two for stable"

* tag '5.6-rc6-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  CIFS: fiemap: do not return EINVAL if get nothing
  CIFS: Increment num_remote_opens stats counter even in case of smb2_query_dir_first
  cifs: potential unintitliazed error code in cifs_getattr()
2020-03-19 10:19:11 -07:00
Christoph Hellwig
6471e9c5e7 xfs: remove the di_version field from struct icdinode
We know the version is 3 if on a v5 file system.   For earlier file
systems formats we always upgrade the remaining v1 inodes to v2 and
thus only use v2 inodes.  Use the xfs_sb_version_has_large_dinode
helper to check if we deal with small or large dinodes, and thus
remove the need for the di_version field in struct icdinode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-19 08:48:47 -07:00
Christoph Hellwig
5e28aafe70 xfs: simplify a check in xfs_ioctl_setattr_check_cowextsize
Only v5 file systems can have the reflink feature, and those will
always use the large dinode format.  Remove the extra check for the
inode version.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-19 08:48:47 -07:00
Christoph Hellwig
b3d1d37544 xfs: simplify di_flags2 inheritance in xfs_ialloc
di_flags2 is initialized to zero for v4 and earlier file systems.  This
means di_flags2 can only be non-zero for a v5 file systems, in which
case both the parent and child inodes can store the field.  Remove the
extra di_version check, and also remove the rather pointless local
di_flags2 variable while at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-19 08:48:47 -07:00
Christoph Hellwig
e9e2eae89d xfs: only check the superblock version for dinode size calculation
The size of the dinode structure is only dependent on the file system
version, so instead of checking the individual inode version just use
the newly added xfs_sb_version_has_large_dinode helper, and simplify
various calling conventions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-19 08:48:47 -07:00
Christoph Hellwig
b81b79f4ed xfs: add a new xfs_sb_version_has_v3inode helper
Add a new wrapper to check if a file system supports the v3 inode format
with a larger dinode core.  Previously we used xfs_sb_version_hascrc for
that, which is technically correct but a little confusing to read.

Also move xfs_dinode_good_version next to xfs_sb_version_has_v3inode
so that we have one place that documents the superblock version to
inode version relationship.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-19 08:47:34 -07:00
J. Bruce Fields
69afd26798 nfsd: fsnotify on rmdir under nfsd/clients/
Userspace should be able to monitor nfsd/clients/ to see when clients
come and go, but we're failing to send fsnotify events.

Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-19 11:33:42 -04:00
J. Bruce Fields
663e36f076 nfsd4: kill warnings on testing stateids with mismatched clientids
It's normal for a client to test a stateid from a previous instance,
e.g. after a network partition.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-19 10:51:42 -04:00
Linus Torvalds
dcf23ac3e8 locks: reinstate locks_delete_block optimization
There is measurable performance impact in some synthetic tests due to
commit 6d390e4b5d (locks: fix a potential use-after-free problem when
wakeup a waiter). Fix the race condition instead by clearing the
fl_blocker pointer after the wake_up, using explicit acquire/release
semantics.

This does mean that we can no longer use the clearing of fl_blocker as
the wait condition, so switch the waiters over to checking whether the
fl_blocked_member list_head is empty.

Reviewed-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Fixes: 6d390e4b5d (locks: fix a potential use-after-free problem when wakeup a waiter)
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-18 13:03:38 -07:00
Brian Foster
8a62714313 xfs: fix unmount hang and memory leak on shutdown during quotaoff
AIL removal of the quotaoff start intent and free of both quotaoff
intents is currently limited to the ->iop_committed() handler of the
end intent. This executes when the end intent is committed to the
on-disk log and marks the completion of the operation. The problem
with this is it assumes the success of the operation. If a shutdown
or other error occurs during the quotaoff, it's possible for the
quotaoff task to exit without removing the start intent from the
AIL. This results in an unmount hang as the AIL cannot be emptied.
Further, no other codepath frees the intents and so this is also a
memory leak vector.

First, update the high level quotaoff error path to directly remove
and free the quotaoff start intent if it still exists in the AIL at
the time of the error. Next, update both of the start and end
quotaoff intents with an ->iop_release() callback to properly handle
transaction abort.

This means that If the quotaoff start transaction aborts, it frees
the start intent in the transaction commit path. If the filesystem
shuts down before the end transaction allocates, the quotaoff
sequence removes and frees the start intent. If the end transaction
aborts, it removes the start intent and frees both. This ensures
that a shutdown does not result in a hung unmount and that memory is
not leaked regardless of when a quotaoff error occurs.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-03-18 08:12:23 -07:00
Brian Foster
854f82b1f6 xfs: factor out quotaoff intent AIL removal and memory free
AIL removal of the quotaoff start intent and free of both intents is
hardcoded to the ->iop_committed() handler of the end intent. Factor
out the start intent handling code so it can be used in a future
patch to properly handle quotaoff errors. Use xfs_trans_ail_remove()
instead of the _delete() variant to acquire the AIL lock and also
handle cases where an intent might not reside in the AIL at the
time of a failure.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-03-18 08:12:23 -07:00
Darrick J. Wong
59d677127c xfs: add support for rmap btree staging cursors
Add support for btree staging cursors for the rmap btrees.  This is
needed both for online repair and also to convert xfs_repair to use
btree bulk loading.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-03-18 08:12:23 -07:00
Darrick J. Wong
56e98164ff xfs: add support for refcount btree staging cursors
Add support for btree staging cursors for the refcount btrees.  This
is needed both for online repair and also to convert xfs_repair to use
btree bulk loading.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-03-18 08:12:23 -07:00
Darrick J. Wong
c29ce8f48e xfs: add support for inode btree staging cursors
Add support for btree staging cursors for the inode btrees.  This
is needed both for online repair and also to convert xfs_repair to use
btree bulk loading.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-03-18 08:12:23 -07:00
Darrick J. Wong
e6eb33d905 xfs: add support for free space btree staging cursors
Add support for btree staging cursors for the free space btrees.  This
is needed both for online repair and also to convert xfs_repair to use
btree bulk loading.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-03-18 08:12:23 -07:00
Darrick J. Wong
60e3d70707 xfs: support bulk loading of staged btrees
Add a new btree function that enables us to bulk load a btree cursor.
This will be used by the upcoming online repair patches to generate new
btrees.  This avoids the programmatic inefficiency of calling
xfs_btree_insert in a loop (which generates a lot of log traffic) in
favor of stamping out new btree blocks with ordered buffers, and then
committing both the new root and scheduling the removal of the old btree
blocks in a single transaction commit.

The design of this new generic code is based off the btree rebuilding
code in xfs_repair's phase 5 code, with the explicit goal of enabling us
to share that code between scrub and repair.  It has the additional
feature of being able to control btree block loading factors.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-03-18 08:12:23 -07:00
Darrick J. Wong
349e1c0380 xfs: introduce fake roots for inode-rooted btrees
Create an in-core fake root for inode-rooted btree types so that callers
can generate a whole new btree using the upcoming btree bulk load
function without making the new tree accessible from the rest of the
filesystem.  It is up to the individual btree type to provide a function
to create a staged cursor (presumably with the appropriate callouts to
update the fakeroot) and then commit the staged root back into the
filesystem.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-03-18 08:12:23 -07:00
Darrick J. Wong
e06536a692 xfs: introduce fake roots for ag-rooted btrees
Create an in-core fake root for AG-rooted btree types so that callers
can generate a whole new btree using the upcoming btree bulk load
function without making the new tree accessible from the rest of the
filesystem.  It is up to the individual btree type to provide a function
to create a staged cursor (presumably with the appropriate callouts to
update the fakeroot) and then commit the staged root back into the
filesystem.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-03-18 08:12:23 -07:00
Darrick J. Wong
608eb3cee7 xfs: replace open-coded bitmap weight logic
Add a xbitmap_hweight helper function so that we can get rid of the
open-coded loop.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-03-18 08:12:23 -07:00
Darrick J. Wong
00b10d487b xfs: rename xfs_bitmap to xbitmap
Shorten the name of xfs_bitmap to xbitmap since the scrub bitmap has
nothing to do with the libxfs bitmap.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-03-18 08:12:23 -07:00
Darrick J. Wong
37a6547d92 xfs: xrep_reap_extents should not destroy the bitmap
Remove the xfs_bitmap_destroy call from the end of xrep_reap_extents
because this sort of violates our rule that the function initializing a
structure should destroy it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-03-18 08:12:23 -07:00
yangerkun
d9973ce2fe iomap: fix comments in iomap_dio_rw
Double 'three' exists in the comments of iomap_dio_rw.

Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-18 08:04:36 -07:00
Christoph Hellwig
d981cb5b9f block: fix a device invalidation regression
Historically we only set the capacity to zero for devices that support
partitions (independ of actually having partitions created).  Doing that
is rather inconsistent, but changing it broke legacy udisks polling for
legacy ide-cdrom devices.  Use the crude a crude check for devices that
either are non-removable or partitionable to get the sane behavior for
most device while not breaking userspace for this particular setup.

Fixes: a1548b6744 ("block: move rescan_partitions to fs/block_dev.c")
Reported-by: He Zhe <zhe.he@windriver.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: He Zhe <zhe.he@windriver.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-18 08:47:04 -06:00
Greg Kroah-Hartman
526ee72dfd debugfs: remove return value of debugfs_create_file_size()
No one checks the return value of debugfs_create_file_size, as it's not
needed, so make the return value void, so that no one tries to do so in
the future.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20200309163640.237984-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-18 13:35:29 +01:00
Taehee Yoo
275678e7a9 debugfs: Check module state before warning in {full/open}_proxy_open()
When the module is being removed, the module state is set to
MODULE_STATE_GOING. At this point, try_module_get() fails.
And when {full/open}_proxy_open() is being called,
it calls try_module_get() to try to hold module reference count.
If it fails, it warns about the possibility of debugfs file leak.

If {full/open}_proxy_open() is called while the module is being removed,
it fails to hold the module.
So, It warns about debugfs file leak. But it is not the debugfs file
leak case. So, this patch just adds module state checking routine
in the {full/open}_proxy_open().

Test commands:
    #SHELL1
    while :
    do
        modprobe netdevsim
        echo 1 > /sys/bus/netdevsim/new_device
        modprobe -rv netdevsim
    done

    #SHELL2
    while :
    do
        cat /sys/kernel/debug/netdevsim/netdevsim1/ports/0/ipsec
    done

Splat looks like:
[  298.766738][T14664] debugfs file owner did not clean up at exit: ipsec
[  298.766766][T14664] WARNING: CPU: 2 PID: 14664 at fs/debugfs/file.c:312 full_proxy_open+0x10f/0x650
[  298.768595][T14664] Modules linked in: netdevsim(-) openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 n][  298.771343][T14664] CPU: 2 PID: 14664 Comm: cat Tainted: G        W         5.5.0+ #1
[  298.772373][T14664] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  298.773545][T14664] RIP: 0010:full_proxy_open+0x10f/0x650
[  298.774247][T14664] Code: 48 c1 ea 03 80 3c 02 00 0f 85 c1 04 00 00 49 8b 3c 24 e8 e4 b5 78 ff 84 c0 75 2d 4c 89 ee 48
[  298.776782][T14664] RSP: 0018:ffff88805b7df9b8 EFLAGS: 00010282[  298.777583][T14664] RAX: dffffc0000000008 RBX: ffff8880511725c0 RCX: 0000000000000000
[  298.778610][T14664] RDX: 0000000000000000 RSI: 0000000000000006 RDI: ffff8880540c5c14
[  298.779637][T14664] RBP: 0000000000000000 R08: fffffbfff15235ad R09: 0000000000000000
[  298.780664][T14664] R10: 0000000000000001 R11: 0000000000000000 R12: ffffffffc06b5000
[  298.781702][T14664] R13: ffff88804c234a88 R14: ffff88804c22dd00 R15: ffffffff8a1b5660
[  298.782722][T14664] FS:  00007fafa13a8540(0000) GS:ffff88806c800000(0000) knlGS:0000000000000000
[  298.783845][T14664] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  298.784672][T14664] CR2: 00007fafa0e9cd10 CR3: 000000004b286005 CR4: 00000000000606e0
[  298.785739][T14664] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  298.786769][T14664] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  298.787785][T14664] Call Trace:
[  298.788237][T14664]  do_dentry_open+0x63c/0xf50
[  298.788872][T14664]  ? open_proxy_open+0x270/0x270
[  298.789524][T14664]  ? __x64_sys_fchdir+0x180/0x180
[  298.790169][T14664]  ? inode_permission+0x65/0x390
[  298.790832][T14664]  path_openat+0xc45/0x2680
[  298.791425][T14664]  ? save_stack+0x69/0x80
[  298.791988][T14664]  ? save_stack+0x19/0x80
[  298.792544][T14664]  ? path_mountpoint+0x2e0/0x2e0
[  298.793233][T14664]  ? check_chain_key+0x236/0x5d0
[  298.793910][T14664]  ? sched_clock_cpu+0x18/0x170
[  298.794527][T14664]  ? find_held_lock+0x39/0x1d0
[  298.795153][T14664]  do_filp_open+0x16a/0x260
[ ... ]

Fixes: 9fd4dcece4 ("debugfs: prevent access to possibly dead file_operations at file open")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://lore.kernel.org/r/20200218043150.29447-1-ap420073@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-18 12:37:49 +01:00
Murphy Zhou
979a2665eb CIFS: fiemap: do not return EINVAL if get nothing
If we call fiemap on a truncated file with none blocks allocated,
it makes sense we get nothing from this call. No output means
no blocks have been counted, but the call succeeded. It's a valid
response.

Simple example reproducer:
xfs_io -f 'truncate 2M' -c 'fiemap -v' /cifssch/testfile
xfs_io: ioctl(FS_IOC_FIEMAP) ["/cifssch/testfile"]: Invalid argument

Signed-off-by: Murphy Zhou <jencce.kernel@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
2020-03-17 13:27:06 -05:00
Shyam Prasad N
1be1fa42eb CIFS: Increment num_remote_opens stats counter even in case of smb2_query_dir_first
The num_remote_opens counter keeps track of the number of open files which must be
maintained by the server at any point. This is a per-tree-connect counter, and the value
of this counter gets displayed in the /proc/fs/cifs/Stats output as a following...

Open files: 0 total (local), 1 open on server
                             ^^^^^^^^^^^^^^^^
As a thumb-rule, we want to increment this counter for each open/create that we
successfully execute on the server. Similarly, we should decrement the counter when
we successfully execute a close.

In this case, an increment was being missed in case of smb2_query_dir_first,
in case of successful open. As a result, we would underflow the counter and we
could even see the counter go to negative after sufficient smb2_query_dir_first calls.

I tested the stats counter for a bunch of filesystem operations with the fix.
And it looks like the counter looks correct to me.

I also check if we missed the increments and decrements elsewhere. It does not
seem so. Few other cases where an open is done and we don't increment the counter are
the compound calls where the corresponding close is also sent in the request.

Signed-off-by: Shyam Prasad N <nspmangalore@gmail.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2020-03-17 13:27:03 -05:00
Dan Carpenter
39946886fc cifs: potential unintitliazed error code in cifs_getattr()
Smatch complains that "rc" could be uninitialized.

    fs/cifs/inode.c:2206 cifs_getattr() error: uninitialized symbol 'rc'.

Changing it to "return 0;" improves readability as well.

Fixes: cc1baf98c8f6 ("cifs: do not ignore the SYNC flags in getattr")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
2020-03-17 13:26:26 -05:00
Darrick J. Wong
77ca1eed5a xfs: fix incorrect test in xfs_alloc_ag_vextent_lastblock
When I lifted the code in xfs_alloc_ag_vextent_lastblock out of a loop,
I forgot to convert all the accesses to len to be pointer dereferences.

Coverity-id: 1457918
Fixes: 5113f8ec37 ("xfs: clean up weird while loop in xfs_alloc_ag_vextent_near")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-03-17 07:32:46 -07:00
Daniel Xu
0c47383ba3 kernfs: Add option to enable user xattrs
User extended attributes are useful as metadata storage for kernfs
consumers like cgroups. Especially in the case of cgroups, it is useful
to have a central metadata store that multiple processes/services can
use to coordinate actions.

A concrete example is for userspace out of memory killers. We want to
let delegated cgroup subtree owners (running as non-root) to be able to
say "please avoid killing this cgroup". This is especially important for
desktop linux as delegated subtrees owners are less likely to run as
root.

This patch introduces a new flag, KERNFS_ROOT_SUPPORT_USER_XATTR, that
lets kernfs consumers enable user xattr support. An initial limit of 128
entries or 128KB -- whichever is hit first -- is placed per cgroup
because xattrs come from kernel memory and we don't want to let
unprivileged users accidentally eat up too much kernel memory.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-03-16 15:53:47 -04:00
Daniel Xu
a46a22955b kernfs: Add removed_size out param for simple_xattr_set
This helps set up size accounting in the next commit. Without this out
param, it's difficult to find out the removed xattr size without taking
a lock for longer and walking the xattr linked list twice.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-03-16 15:53:47 -04:00
Daniel Xu
fdc85222d5 kernfs: kvmalloc xattr value instead of kmalloc
xattr values have a 64k maximum size. This can result in an order 4
kmalloc request which can be difficult to fulfill. Since xattrs do not
need physically contiguous memory, we can switch to kvmalloc and not
have to worry about higher order allocations failing.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-03-16 15:53:47 -04:00
Petr Vorel
6cbfad5f20 nfsd: remove read permission bit for ctl sysctl
It's meant to be write-only.

Fixes: 89c905becc ("nfsd: allow forced expiration of NFSv4 clients")
Signed-off-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:34 -04:00
Chuck Lever
3ac3711adb NFSD: Fix NFS server build errors
yuehaibing@huawei.com reports the following build errors arise when
CONFIG_NFSD_V4_2_INTER_SSC is set and the NFS client is not built
into the kernel:

fs/nfsd/nfs4proc.o: In function `nfsd4_do_copy':
nfs4proc.c:(.text+0x23b7): undefined reference to `nfs42_ssc_close'
fs/nfsd/nfs4proc.o: In function `nfsd4_copy':
nfs4proc.c:(.text+0x5d2a): undefined reference to `nfs_sb_deactive'
fs/nfsd/nfs4proc.o: In function `nfsd4_do_async_copy':
nfs4proc.c:(.text+0x61d5): undefined reference to `nfs42_ssc_open'
nfs4proc.c:(.text+0x6389): undefined reference to `nfs_sb_deactive'

The new inter-server copy code invokes client functions. Until the
NFS server has infrastructure to load the appropriate NFS client
modules to handle inter-server copy requests, let's constrain the
way this feature is built.

Reported-by: YueHaibing <yuehaibing@huawei.com>
Fixes: ce0887ac96 ("NFSD add nfs4 inter ssc to nfsd4_copy")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: YueHaibing <yuehaibing@huawei.com> # build-tested
2020-03-16 12:04:34 -04:00
Trond Myklebust
65286b883c nfsd: export upcalls must not return ESTALE when mountd is down
If the rpc.mountd daemon goes down, then that should not cause all
exports to start failing with ESTALE errors. Let's explicitly
distinguish between the cache upcall cases that need to time out,
and those that do not.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:33 -04:00
Trond Myklebust
6a30e47fa0 nfsd: Add tracepoints for update of the expkey and export cache entries
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:33 -04:00
Trond Myklebust
cf749f3cc7 nfsd: Add tracepoints for exp_find_key() and exp_get_by_name()
Add tracepoints for upcalls.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:33 -04:00
Trond Myklebust
f01274a933 nfsd: Add tracing to nfsd_set_fh_dentry()
Add tracing to allow us to figure out where any stale filehandle issues
may be originating from.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:33 -04:00
Trond Myklebust
a451b12311 nfsd: Don't add locks to closed or closing open stateids
In NFSv4, the lock stateids are tied to the lockowner, and the open stateid,
so that the action of closing the file also results in either an automatic
loss of the locks, or an error of the form NFS4ERR_LOCKS_HELD.

In practice this means we must not add new locks to the open stateid
after the close process has been invoked. In fact doing so, can result
in the following panic:

 kernel BUG at lib/list_debug.c:51!
 invalid opcode: 0000 [#1] SMP NOPTI
 CPU: 2 PID: 1085 Comm: nfsd Not tainted 5.6.0-rc3+ #2
 Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.14410784.B64.1908150010 08/15/2019
 RIP: 0010:__list_del_entry_valid.cold+0x31/0x55
 Code: 1a 3d 9b e8 74 10 c2 ff 0f 0b 48 c7 c7 f0 1a 3d 9b e8 66 10 c2 ff 0f 0b 48 89 f2 48 89 fe 48 c7 c7 b0 1a 3d 9b e8 52 10 c2 ff <0f> 0b 48 89 fe 4c 89 c2 48 c7 c7 78 1a 3d 9b e8 3e 10 c2 ff 0f 0b
 RSP: 0018:ffffb296c1d47d90 EFLAGS: 00010246
 RAX: 0000000000000054 RBX: ffff8ba032456ec8 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: ffff8ba039e99cc8 RDI: ffff8ba039e99cc8
 RBP: ffff8ba032456e60 R08: 0000000000000781 R09: 0000000000000003
 R10: 0000000000000000 R11: 0000000000000001 R12: ffff8ba009a4abe0
 R13: ffff8ba032456e8c R14: 0000000000000000 R15: ffff8ba00adb01d8
 FS:  0000000000000000(0000) GS:ffff8ba039e80000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007fb213f0b008 CR3: 00000001347de006 CR4: 00000000003606e0
 Call Trace:
  release_lock_stateid+0x2b/0x80 [nfsd]
  nfsd4_free_stateid+0x1e9/0x210 [nfsd]
  nfsd4_proc_compound+0x414/0x700 [nfsd]
  ? nfs4svc_decode_compoundargs+0x407/0x4c0 [nfsd]
  nfsd_dispatch+0xc1/0x200 [nfsd]
  svc_process_common+0x476/0x6f0 [sunrpc]
  ? svc_sock_secure_port+0x12/0x30 [sunrpc]
  ? svc_recv+0x313/0x9c0 [sunrpc]
  ? nfsd_svc+0x2d0/0x2d0 [nfsd]
  svc_process+0xd4/0x110 [sunrpc]
  nfsd+0xe3/0x140 [nfsd]
  kthread+0xf9/0x130
  ? nfsd_destroy+0x50/0x50 [nfsd]
  ? kthread_park+0x90/0x90
  ret_from_fork+0x1f/0x40

The fix is to ensure that lock creation tests for whether or not the
open stateid is unhashed, and to fail if that is the case.

Fixes: 659aefb68e ("nfsd: Ensure we don't recognise lock stateids after freeing them")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:33 -04:00
Chuck Lever
7dcf4ab952 NFSD: Clean up nfsd4_encode_readv
Address some minor nits I noticed while working on this function.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:31 -04:00
Chuck Lever
412055398b nfsd: Fix NFSv4 READ on RDMA when using readv
svcrdma expects that the payload falls precisely into the xdr_buf
page vector. This does not seem to be the case for
nfsd4_encode_readv().

This code is called only when fops->splice_read is missing or when
RQ_SPLICE_OK is clear, so it's not a noticeable problem in many
common cases.

Add new transport method: ->xpo_read_payload so that when a READ
payload does not fit exactly in rq_res's page vector, the XDR
encoder can inform the RPC transport exactly where that payload is,
without the payload's XDR pad.

That way, when a Write chunk is present, the transport knows what
byte range in the Reply message is supposed to be matched with the
chunk.

Note that the Linux NFS server implementation of NFS/RDMA can
currently handle only one Write chunk per RPC-over-RDMA message.
This simplifies the implementation of this fix.

Fixes: b042098063 ("nfsd4: allow exotic read compounds")
Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=198053
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:31 -04:00
Madhuparna Bhowmik
057a227435 fs: nfsd: fileache.c: Use built-in RCU list checking
list_for_each_entry_rcu() has built-in RCU and lock checking.

Pass cond argument to list_for_each_entry_rcu() to silence
false lockdep warning when  CONFIG_PROVE_RCU_LIST is enabled
by default.

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:31 -04:00
Madhuparna Bhowmik
36a8049181 fs: nfsd: nfs4state.c: Use built-in RCU list checking
list_for_each_entry_rcu() has built-in RCU and lock checking.

Pass cond argument to list_for_each_entry_rcu() to silence
false lockdep warning when  CONFIG_PROVE_RCU_LIST is enabled
by default.

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:31 -04:00
Scott Mayhew
7627d7dc79 nfsd: set the server_scope during service startup
Currently, nfsd4_encode_exchange_id() encodes the utsname nodename
string in the server_scope field.  In a multi-host container
environemnt, if an nfsd container is restarted on a different host than
it was originally running on, clients will see a server_scope mismatch
and will not attempt to reclaim opens.

Instead, set the server_scope while we're in a process context during
service startup, so we get the utsname nodename of the current process
and store that in nfsd_net.

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
[bfields: fix up major_id too]
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:30 -04:00
Linus Torvalds
34d5a4b336 Fix for yet another subtle futex issue. The futex code used ihold() to
prevent inodes from vanishing, but ihold() does not guarantee inode
 persistence. Replace the inode pointer with a per boot, machine wide,
 unique inode identifier. The second commit fixes the breakage of the hash
 mechanism whihc causes a 100% performance regression.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl5uQJsTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoYp4EAC5fr/AyRaIn/AEIZUmoyK6ELUaknfH
 Z788avxDB/t5GkzC9A2dMpybYi78tzLSAEfB8jYgwbrqExapVtiqvjGZ1RIi3HoN
 f/DWLnOb2s+yYQ3BQlHu4RdKONEzCqBwKFpElGRv3JzCY8Qeh5cQBzdqzvOEFmYw
 P7DJVtJRZ2dud7AzJ+xk6KuNIKCT2F7Djmtop6nq1EVw0J/2oYOVgQu76APBj7cj
 32srLmpP4xcQiJmWLC5ksXKiZrMPnyNfwXhHFufNvJ2Re6+Wf8mmglqG/5DmA+Ns
 Sq3L7D7yXwtWQZ8Po1qnWhPDZVXQbWzHyTn4YAMJAK7yoO7mut8jgECt+A8vf4L+
 hsc41c6THfdCQQ9gmxLL+c08nZGlmvIC4/1RsihNZ3kd2o4k6Ah9xFp8lBFcpjWd
 7tuhakNqJvUOvB34t2AYqzMFbZ/FJG+QNGyIW0bTUn4YIgRPxI/zsdMxqGVBZ4oN
 0iuy1kPLGbGAnLU9thkiVMmAyaPesuiB6f+mmzobEUgGI35GrCJi6a4YaTG1sqFn
 Gl8oPzcU2n+DWbVBfJrVFHJye7oi78kCw6wpNLBCJQp8NP8doAH0Sgspglg52E/p
 G4GGLz0vGauHBC5wQ3WYiGLImWbzC1dwKdcNE7dhuTgXbhz8ChVlOSU9Fu4+pGpq
 6URL6DVTLwDZPg==
 =e2iB
 -----END PGP SIGNATURE-----

Merge tag 'locking-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull futex fix from Thomas Gleixner:
 "Fix for yet another subtle futex issue.

  The futex code used ihold() to prevent inodes from vanishing, but
  ihold() does not guarantee inode persistence. Replace the inode
  pointer with a per boot, machine wide, unique inode identifier.

  The second commit fixes the breakage of the hash mechanism which
  causes a 100% performance regression"

* tag 'locking-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Unbreak futex hashing
  futex: Fix inode life-time issue
2020-03-15 12:55:52 -07:00
Darrick J. Wong
faf8ee8476 xfs: xfs_dabuf_map should return ENOMEM when map allocation fails
If the xfs_buf_map array allocation in xfs_dabuf_map fails for whatever
reason, we bail out with error code zero.  This will confuse callers, so
make sure that we return ENOMEM.  Allocation failure should never happen
with the small size of the array, but code defensively anyway.

Fixes: 45feef8f50 ("xfs: refactor xfs_dabuf_map")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2020-03-15 09:22:35 -07:00
Pavel Begunkov
60cf46ae60 io-wq: hash dependent work
Enable io-wq hashing stuff for dependent works simply by re-enqueueing
such requests.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-14 17:02:30 -06:00
Pavel Begunkov
8766dd516c io-wq: split hashing and enqueueing
It's a preparation patch removing io_wq_enqueue_hashed(), which
now should be done by io_wq_hash_work() + io_wq_enqueue().

Also, set hash value for dependant works, and do it as late as possible,
because req->file can be unavailable before. This hash will be ignored
by io-wq.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-14 17:02:28 -06:00
Pavel Begunkov
d78298e73a io-wq: don't resched if there is no work
This little tweak restores the behaviour that was before the recent
io_worker_handle_work() optimisation patches. It makes the function do
cond_resched() and flush_signals() only if there is an actual work to
execute.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-14 17:02:26 -06:00
Pavel Begunkov
f1d96a8fcb io_uring: NULL-deref for IOSQE_{ASYNC,DRAIN}
Processing links, io_submit_sqe() prepares requests, drops sqes, and
passes them with sqe=NULL to io_queue_sqe(). There IOSQE_DRAIN and/or
IOSQE_ASYNC requests will go through the same prep, which doesn't expect
sqe=NULL and fail with NULL pointer deference.

Always do full prepare including io_alloc_async_ctx() for linked
requests, and then it can skip the second preparation.

Cc: stable@vger.kernel.org # 5.5
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-14 16:57:41 -06:00
Eric Whitney
2971148d0f ext4: remove map_from_cluster from ext4_ext_map_blocks
We can use the variable allocated_clusters rather than map_from_clusters
to control reserved block/cluster accounting in ext4_ext_map_blocks.
This eliminates a variable and associated code and improves readability
a little.

Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20200311205125.25061-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-03-14 14:43:14 -04:00
Eric Whitney
3499046134 ext4: clean up ext4_ext_insert_extent() call in ext4_ext_map_blocks()
Now that the eofblocks code has been removed, we don't need to assign
0 to err before calling ext4_ext_insert_extent() since it will assign
a return value to ret anyway.  The variable free_on_err can be
eliminated and replaced by a reference to allocated_clusters which
clearly conveys the idea that newly allocated blocks should be freed
when recovering from an extent insertion failure.  The error handling
code itself should be restructured so that it errors out immediately on
an insertion failure in the case where no new blocks have been allocated
(bigalloc) rather than proceeding further into the mapping code.  The
initializer for fb_flags can also be rearranged for improved
readability.  Finally, insert a missing space in nearby code.

No known bugs are addressed by this patch - it's simply a cleanup.

Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20200311205033.25013-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-03-14 14:43:13 -04:00
Dmitry Monakhov
eb5760863f ext4: mark block bitmap corrupted when found instead of BUGON
We already has similar code in ext4_mb_complex_scan_group(), but
ext4_mb_simple_scan_group() still affected.

Other reports: https://www.spinics.net/lists/linux-ext4/msg60231.html

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Dmitry Monakhov <dmonakhov@gmail.com>
Link: https://lore.kernel.org/r/20200310150156.641-1-dmonakhov@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-03-14 14:43:13 -04:00
Gustavo A. R. Silva
47b1030612 ext4: use flexible-array member for xattr structs
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Link: https://lore.kernel.org/r/20200309180813.GA3347@embeddedor
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-03-14 14:43:13 -04:00
Gustavo A. R. Silva
e32ac2459c ext4: use flexible-array member in struct fname
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Link: https://lore.kernel.org/r/20200309154838.GA31559@embeddedor
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-03-14 14:43:13 -04:00
Ritesh Harjani
d3b6f23f71 ext4: move ext4_fiemap to use iomap framework
This patch moves ext4_fiemap to use iomap framework.
For xattr a new 'ext4_iomap_xattr_ops' is added.

Reported-by: kbuild test robot <lkp@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/b9f45c885814fcdd0631747ff0fe08886270828c.1582880246.git.riteshh@linux.ibm.com
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-03-14 14:43:13 -04:00
Ritesh Harjani
b2c5764262 ext4: make ext4_ind_map_blocks work with fiemap
For indirect block mapping if the i_block > max supported block in inode
then ext4_ind_map_blocks() returns a -EIO error. But in case of fiemap
this could be a valid query to ->iomap_begin call.
So check if the offset >= s_bitmap_maxbytes in ext4_iomap_begin_report(),
then simply skip calling ext4_map_blocks().

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/87fa0ddc5967fa707656212a3b66a7233425325c.1582880246.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-03-14 14:43:12 -04:00
Ritesh Harjani
ac58e4fb03 ext4: move ext4 bmap to use iomap infrastructure
ext4_iomap_begin is already implemented which provides ext4_map_blocks,
so just move the API from generic_block_bmap to iomap_bmap for iomap
conversion.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/8bbd53bd719d5ccfecafcce93f2bf1d7955a44af.1582880246.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-03-14 14:43:12 -04:00
Ritesh Harjani
2f424a5a09 ext4: optimize ext4_ext_precache for 0 depth
This patch avoids the memory alloc & free path when depth is 0,
since anyway there is no extra caching done in that case.
So on checking depth 0, simply return early.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/93da0d0f073c73358e85bb9849d8a5378d1da539.1582880246.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-03-14 14:43:12 -04:00
Ritesh Harjani
6386722a32 ext4: add IOMAP_F_MERGED for non-extent based mapping
IOMAP_F_MERGED needs to be set in case of non-extent based mapping.
This is needed in later patches for conversion of ext4_fiemap to use iomap.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/a4764c91c08c16d4d4a4b36defb2a08625b0e9b3.1582880246.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-03-14 14:43:12 -04:00
David S. Miller
44ef976ab3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2020-03-13

The following pull-request contains BPF updates for your *net-next* tree.

We've added 86 non-merge commits during the last 12 day(s) which contain
a total of 107 files changed, 5771 insertions(+), 1700 deletions(-).

The main changes are:

1) Add modify_return attach type which allows to attach to a function via
   BPF trampoline and is run after the fentry and before the fexit programs
   and can pass a return code to the original caller, from KP Singh.

2) Generalize BPF's kallsyms handling and add BPF trampoline and dispatcher
   objects to be visible in /proc/kallsyms so they can be annotated in
   stack traces, from Jiri Olsa.

3) Extend BPF sockmap to allow for UDP next to existing TCP support in order
   in order to enable this for BPF based socket dispatch, from Lorenz Bauer.

4) Introduce a new bpftool 'prog profile' command which attaches to existing
   BPF programs via fentry and fexit hooks and reads out hardware counters
   during that period, from Song Liu. Example usage:

   bpftool prog profile id 337 duration 3 cycles instructions llc_misses

        4228 run_cnt
     3403698 cycles                                              (84.08%)
     3525294 instructions   #  1.04 insn per cycle               (84.05%)
          13 llc_misses     #  3.69 LLC misses per million isns  (83.50%)

5) Batch of improvements to libbpf, bpftool and BPF selftests. Also addition
   of a new bpf_link abstraction to keep in particular BPF tracing programs
   attached even when the applicaion owning them exits, from Andrii Nakryiko.

6) New bpf_get_current_pid_tgid() helper for tracing to perform PID filtering
   and which returns the PID as seen by the init namespace, from Carlos Neira.

7) Refactor of RISC-V JIT code to move out common pieces and addition of a
   new RV32G BPF JIT compiler, from Luke Nelson.

8) Add gso_size context member to __sk_buff in order to be able to know whether
   a given skb is GSO or not, from Willem de Bruijn.

9) Add a new bpf_xdp_output() helper which reuses XDP's existing perf RB output
   implementation but can be called from tracepoint programs, from Eelco Chaudron.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-13 20:52:03 -07:00
Al Viro
6dfd9fe54d follow_dotdot{,_rcu}(): switch to use of step_into()
gets the regular mount crossing on result of ..

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-13 21:09:13 -04:00
Al Viro
7521f22b3c handle_dots(), follow_dotdot{,_rcu}(): preparation to switch to step_into()
Right now the tail ends of follow_dotdot{,_rcu}() are pretty
much the open-coded analogues of step_into().  The differences:
	* the lack of proper LOOKUP_NO_XDEV handling in non-RCU case
(arguably a bug)
	* the lack of ->d_manage() handling (again, arguably a bug)

Adjust the calling conventions so that on the next step with could
just switch those functions to returning step_into().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-13 21:09:13 -04:00
Al Viro
957dd41d88 move handle_dots(), follow_dotdot() and follow_dotdot_rcu() past step_into()
pure move; we are going to have step_into() called by that bunch.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-13 21:09:13 -04:00
Al Viro
c9a0f75d81 follow_dotdot{,_rcu}(): lift LOOKUP_BENEATH checks out of loop
Behaviour change: LOOKUP_BENEATH lookup of .. in absolute root
yields an error even if it's not the process' root.  That's
possible only if you'd managed to escape chroot jail by way of
procfs symlinks, but IMO the resulting behaviour is not worse -
more consistent and easier to describe:
	".." in root is "stay where you are", uness LOOKUP_BENEATH
	has been given, in which case it's "fail with EXDEV".

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-13 21:09:13 -04:00
Al Viro
abc2c632e0 follow_dotdot{,_rcu}(): lift switching nd->path to parent out of loop
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-13 21:09:13 -04:00
Al Viro
a6a7eb7628 expand path_parent_directory() in its callers
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-13 21:09:13 -04:00
Al Viro
63b27720a4 path_parent_directory(): leave changing path->dentry to callers
Instead of returning 0, return new dentry; instead of returning
-ENOENT, return NULL.  Adjust the callers accordingly.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-13 21:09:13 -04:00