Commit Graph

68382 Commits

Author SHA1 Message Date
J. Bruce Fields
b2140338d8 nfsd: simplify nfsd4_change_info
It doesn't make sense to carry all these extra fields around.  Just
make everything into change attribute from the start.

This is just cleanup, there should be no change in behavior.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-12-09 09:39:37 -05:00
J. Bruce Fields
70b87f7729 nfsd: only call inode_query_iversion in the I_VERSION case
inode_query_iversion() can modify i_version.  Depending on the exported
filesystem, that may not be safe.  For example, if you're re-exporting
NFS, NFS stores the server's change attribute in i_version and does not
expect it to be modified locally.  This has been observed causing
unnecessary cache invalidations.

The way a filesystem indicates that it's OK to call
inode_query_iverson() is by setting SB_I_VERSION.

So, move the I_VERSION check out of encode_change(), where it's used
only in GETATTR responses, to nfsd4_change_attribute(), which is
also called for pre- and post- operation attributes.

(Note we could also pull the NFSEXP_V4ROOT case into
nfsd4_change_attribute() as well.  That would actually be a no-op,
since pre/post attrs are only used for metadata-modifying operations,
and V4ROOT exports are read-only.  But we might make the change in
the future just for simplicity.)

Reported-by: Daire Byrne <daire@dneg.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-12-09 09:39:37 -05:00
Cheng Lin
4a9d81caf8 nfs_common: need lock during iterate through the list
If the elem is deleted during be iterated on it, the iteration
process will fall into an endless loop.

kernel: NMI watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [nfsd:17137]

PID: 17137  TASK: ffff8818d93c0000  CPU: 4   COMMAND: "nfsd"
    [exception RIP: __state_in_grace+76]
    RIP: ffffffffc00e817c  RSP: ffff8818d3aefc98  RFLAGS: 00000246
    RAX: ffff881dc0c38298  RBX: ffffffff81b03580  RCX: ffff881dc02c9f50
    RDX: ffff881e3fce8500  RSI: 0000000000000001  RDI: ffffffff81b03580
    RBP: ffff8818d3aefca0   R8: 0000000000000020   R9: ffff8818d3aefd40
    R10: ffff88017fc03800  R11: ffff8818e83933c0  R12: ffff8818d3aefd40
    R13: 0000000000000000  R14: ffff8818e8391068  R15: ffff8818fa6e4000
    CS: 0010  SS: 0018
 #0 [ffff8818d3aefc98] opens_in_grace at ffffffffc00e81e3 [grace]
 #1 [ffff8818d3aefca8] nfs4_preprocess_stateid_op at ffffffffc02a3e6c [nfsd]
 #2 [ffff8818d3aefd18] nfsd4_write at ffffffffc028ed5b [nfsd]
 #3 [ffff8818d3aefd80] nfsd4_proc_compound at ffffffffc0290a0d [nfsd]
 #4 [ffff8818d3aefdd0] nfsd_dispatch at ffffffffc027b800 [nfsd]
 #5 [ffff8818d3aefe08] svc_process_common at ffffffffc02017f3 [sunrpc]
 #6 [ffff8818d3aefe70] svc_process at ffffffffc0201ce3 [sunrpc]
 #7 [ffff8818d3aefe98] nfsd at ffffffffc027b117 [nfsd]
 #8 [ffff8818d3aefec8] kthread at ffffffff810b88c1
 #9 [ffff8818d3aeff50] ret_from_fork at ffffffff816d1607

The troublemake elem:
crash> lock_manager ffff881dc0c38298
struct lock_manager {
  list = {
    next = 0xffff881dc0c38298,
    prev = 0xffff881dc0c38298
  },
  block_opens = false
}

Fixes: c87fb4a378 ("lockd: NLM grace period shouldn't block NFSv4 opens")
Signed-off-by: Cheng Lin <cheng.lin130@zte.com.cn>
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-12-09 09:38:34 -05:00
Dai Ngo
ca9364dde5 NFSD: Fix 5 seconds delay when doing inter server copy
Since commit b4868b44c5 ("NFSv4: Wait for stateid updates after
CLOSE/OPEN_DOWNGRADE"), every inter server copy operation suffers 5
seconds delay regardless of the size of the copy. The delay is from
nfs_set_open_stateid_locked when the check by nfs_stateid_is_sequential
fails because the seqid in both nfs4_state and nfs4_stateid are 0.

Fix by modifying nfs4_init_cp_state to return the stateid with seqid 1
instead of 0. This is also to conform with section 4.8 of RFC 7862.

Here is the relevant paragraph from section 4.8 of RFC 7862:

   A copy offload stateid's seqid MUST NOT be zero.  In the context of a
   copy offload operation, it is inappropriate to indicate "the most
   recent copy offload operation" using a stateid with a seqid of zero
   (see Section 8.2.2 of [RFC5661]).  It is inappropriate because the
   stateid refers to internal state in the server and there may be
   several asynchronous COPY operations being performed in parallel on
   the same file by the server.  Therefore, a copy offload stateid with
   a seqid of zero MUST be considered invalid.

Fixes: ce0887ac96 ("NFSD add nfs4 inter ssc to nfsd4_copy")
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-12-09 09:38:34 -05:00
Chuck Lever
eb162e1772 NFSD: Fix sparse warning in nfs4proc.c
linux/fs/nfsd/nfs4proc.c:1542:24: warning: incorrect type in assignment (different base types)
linux/fs/nfsd/nfs4proc.c:1542:24:    expected restricted __be32 [assigned] [usertype] status
linux/fs/nfsd/nfs4proc.c:1542:24:    got int

Clean-up: The dup_copy_fields() function returns only zero, so make
it return void for now, and get rid of the return code check.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-12-09 09:38:34 -05:00
kazuo ito
4420440c57 nfsd: Fix message level for normal termination
The warning message from nfsd terminating normally
can confuse system adminstrators or monitoring software.

Though it's not exactly fair to pin-point a commit where it
originated, the current form in the current place started
to appear in:

Fixes: e096bbc648 ("knfsd: remove special handling for SIGHUP")
Signed-off-by: kazuo ito <kzpn200@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-12-09 09:38:33 -05:00
Gao Xiang
1825c8d7ce erofs: force inplace I/O under low memory scenario
Try to forcely switch to inplace I/O under low memory scenario in
order to avoid direct memory reclaim due to cached page allocation.

Link: https://lore.kernel.org/r/20201209123717.12430-1-hsiangkao@aol.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2020-12-09 20:42:02 +08:00
Anant Thazhemadam
e51d68e76d fs: quota: fix array-index-out-of-bounds bug by passing correct argument to vfs_cleanup_quota_inode()
When dquot_resume() was last updated, the argument that got passed
to vfs_cleanup_quota_inode was incorrectly set.

If type = -1 and dquot_load_quota_sb() returns a negative value,
then vfs_cleanup_quota_inode() gets called with -1 passed as an
argument, and this leads to an array-index-out-of-bounds bug.

Fix this issue by correctly passing the arguments.

Fixes: ae45f07d47 ("quota: Simplify dquot_resume()")
Link: https://lore.kernel.org/r/20201208194338.7064-1-anant.thazhemadam@gmail.com
Reported-by: syzbot+2643e825238d7aabb37f@syzkaller.appspotmail.com
Tested-by: syzbot+2643e825238d7aabb37f@syzkaller.appspotmail.com
CC: stable@vger.kernel.org
Signed-off-by: Anant Thazhemadam <anant.thazhemadam@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-12-09 10:07:10 +01:00
Darrick J. Wong
3945ae03d8 xfs: move kernel-specific superblock validation out of libxfs
A couple of the superblock validation checks apply only to the kernel,
so move them to xfs_fc_fill_super before we add the needsrepair "feature",
which will prevent the kernel (but not xfsprogs) from mounting the
filesystem.  This also reduces the diff between kernel and userspace
libxfs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2020-12-08 19:30:10 -08:00
David Howells
4cb6829647 afs: Fix memory leak when mounting with multiple source parameters
There's a memory leak in afs_parse_source() whereby multiple source=
parameters overwrite fc->source in the fs_context struct without freeing
the previously recorded source.

Fix this by only permitting a single source parameter and rejecting with
an error all subsequent ones.

This was caught by syzbot with the kernel memory leak detector, showing
something like the following trace:

  unreferenced object 0xffff888114375440 (size 32):
    comm "repro", pid 5168, jiffies 4294923723 (age 569.948s)
    backtrace:
      slab_post_alloc_hook+0x42/0x79
      __kmalloc_track_caller+0x125/0x16a
      kmemdup_nul+0x24/0x3c
      vfs_parse_fs_string+0x5a/0xa1
      generic_parse_monolithic+0x9d/0xc5
      do_new_mount+0x10d/0x15a
      do_mount+0x5f/0x8e
      __do_sys_mount+0xff/0x127
      do_syscall_64+0x2d/0x3a
      entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 13fcc68370 ("afs: Add fs_context support")
Reported-by: syzbot+86dc6632faaca40133ab@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-08 15:59:25 -08:00
Daeho Jeong
6422a71ef4 f2fs: fix race of pending_pages in decompression
I found out f2fs_free_dic() is invoked in a wrong timing, but
f2fs_verify_bio() still needed the dic info and it triggered the
below kernel panic. It has been caused by the race condition of
pending_pages value between decompression and verity logic, when
the same compression cluster had been split in different bios.
By split bios, f2fs_verify_bio() ended up with decreasing
pending_pages value before it is reset to nr_cpages by
f2fs_decompress_pages() and caused the kernel panic.

[ 4416.564763] Unable to handle kernel NULL pointer dereference
               at virtual address 0000000000000000
...
[ 4416.896016] Workqueue: fsverity_read_queue f2fs_verity_work
[ 4416.908515] pc : fsverity_verify_page+0x20/0x78
[ 4416.913721] lr : f2fs_verify_bio+0x11c/0x29c
[ 4416.913722] sp : ffffffc019533cd0
[ 4416.913723] x29: ffffffc019533cd0 x28: 0000000000000402
[ 4416.913724] x27: 0000000000000001 x26: 0000000000000100
[ 4416.913726] x25: 0000000000000001 x24: 0000000000000004
[ 4416.913727] x23: 0000000000001000 x22: 0000000000000000
[ 4416.913728] x21: 0000000000000000 x20: ffffffff2076f9c0
[ 4416.913729] x19: ffffffff2076f9c0 x18: ffffff8a32380c30
[ 4416.913731] x17: ffffffc01f966d97 x16: 0000000000000298
[ 4416.913732] x15: 0000000000000000 x14: 0000000000000000
[ 4416.913733] x13: f074faec89ffffff x12: 0000000000000000
[ 4416.913734] x11: 0000000000001000 x10: 0000000000001000
[ 4416.929176] x9 : ffffffff20d1f5c7 x8 : 0000000000000000
[ 4416.929178] x7 : 626d7464ff286b6b x6 : ffffffc019533ade
[ 4416.929179] x5 : 000000008049000e x4 : ffffffff2793e9e0
[ 4416.929180] x3 : 000000008049000e x2 : ffffff89ecfa74d0
[ 4416.929181] x1 : 0000000000000c40 x0 : ffffffff2076f9c0
[ 4416.929184] Call trace:
[ 4416.929187]  fsverity_verify_page+0x20/0x78
[ 4416.929189]  f2fs_verify_bio+0x11c/0x29c
[ 4416.929192]  f2fs_verity_work+0x58/0x84
[ 4417.050667]  process_one_work+0x270/0x47c
[ 4417.055354]  worker_thread+0x27c/0x4d8
[ 4417.059784]  kthread+0x13c/0x320
[ 4417.063693]  ret_from_fork+0x10/0x18

Chao pointed this can happen by the below race condition.

Thread A        f2fs_post_read_wq          fsverity_wq
- f2fs_read_multi_pages()
  - f2fs_alloc_dic
   - dic->pending_pages = 2
   - submit_bio()
   - submit_bio()
               - f2fs_post_read_work() handle first bio
                - f2fs_decompress_work()
                 - __read_end_io()
                  - f2fs_decompress_pages()
                   - dic->pending_pages--
                - enqueue f2fs_verity_work()
                                           - f2fs_verity_work() handle first bio
                                            - f2fs_verify_bio()
                                             - dic->pending_pages--
               - f2fs_post_read_work() handle second bio
                - f2fs_decompress_work()
                - enqueue f2fs_verity_work()
                                            - f2fs_verify_pages()
                                            - f2fs_free_dic()

                                          - f2fs_verity_work() handle second bio
                                           - f2fs_verfy_bio()
                                                 - use-after-free on dic

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-08 15:39:14 -08:00
Chao Yu
96dd025195 f2fs: fix to account inline xattr correctly during recovery
During recovery, we may missed to update inline xattr count correctly,
fix it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-08 14:25:41 -08:00
Jack Qiu
8492156153 f2fs: inline: fix wrong inline inode stat
Miss to stat inline inode in f2fs_recover_inline_data.

Signed-off-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-08 14:25:41 -08:00
Jack Qiu
6e5ca4fce7 f2fs: inline: correct comment in f2fs_recover_inline_data
In 3rd scene, it should remove data blocks instead of inline_data.

Signed-off-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-08 14:25:41 -08:00
Yangtao Li
d540e35d4e f2fs: don't check PAGE_SIZE again in sanity_check_raw_super()
Many flash devices read and write a single IO based on a multiple
of 4KB, and we support only 4KB page cache size now.

Since we already check page size in init_f2fs_fs(), so remove page
size check in sanity_check_raw_super().

Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Signed-off-by: Shaohua Liu <liush@allwinnertech.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-08 14:25:40 -08:00
Yangtao Li
b9ec10948f f2fs: convert to F2FS_*_INO macro
Use F2FS_ROOT_INO, F2FS_NODE_INO and F2FS_META_INO macro
for better code readability.

Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Signed-off-by: Shaohua Liu <liush@allwinnertech.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-08 14:25:40 -08:00
Linus Torvalds
7d8761ba27 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull seq_file fix from Al Viro:
 "This fixes a regression introduced in this cycle wrt iov_iter based
  variant for reading a seq_file"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fix return values of seq_read_iter()
2020-12-08 12:20:34 -08:00
Hillf Danton
f26c08b444 io_uring: fix file leak on error path of io ctx creation
Put file as part of error handling when setting up io ctx to fix
memory leaks like the following one.

   BUG: memory leak
   unreferenced object 0xffff888101ea2200 (size 256):
     comm "syz-executor355", pid 8470, jiffies 4294953658 (age 32.400s)
     hex dump (first 32 bytes):
       00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
       20 59 03 01 81 88 ff ff 80 87 a8 10 81 88 ff ff   Y..............
     backtrace:
       [<000000002e0a7c5f>] kmem_cache_zalloc include/linux/slab.h:654 [inline]
       [<000000002e0a7c5f>] __alloc_file+0x1f/0x130 fs/file_table.c:101
       [<000000001a55b73a>] alloc_empty_file+0x69/0x120 fs/file_table.c:151
       [<00000000fb22349e>] alloc_file+0x33/0x1b0 fs/file_table.c:193
       [<000000006e1465bb>] alloc_file_pseudo+0xb2/0x140 fs/file_table.c:233
       [<000000007118092a>] anon_inode_getfile fs/anon_inodes.c:91 [inline]
       [<000000007118092a>] anon_inode_getfile+0xaa/0x120 fs/anon_inodes.c:74
       [<000000002ae99012>] io_uring_get_fd fs/io_uring.c:9198 [inline]
       [<000000002ae99012>] io_uring_create fs/io_uring.c:9377 [inline]
       [<000000002ae99012>] io_uring_setup+0x1125/0x1630 fs/io_uring.c:9411
       [<000000008280baad>] do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       [<00000000685d8cf0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reported-by: syzbot+71c4697e27c99fddcf17@syzkaller.appspotmail.com
Fixes: 0f2122045b ("io_uring: don't rely on weak ->files references")
Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-08 08:54:26 -07:00
Naohiro Aota
7b3d5a90cb btrfs: introduce ZONED feature flag
This patch introduces the ZONED incompat flag. The flag indicates that
the volume management will satisfy the constraints imposed by
host-managed zoned block devices (aligned chunk allocation, append-only
updates, reset zone after filled).

As the zoned support will happen incrementally due to enhancing some
core infrastructure like super block writes, tree-log, raid support, the
feature will appear in sysfs only on debug builds. It will be enabled
once the support is feature complete and applications can reliably check
whether zoned support is present or not.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:16 +01:00
Nikolay Borisov
a2633b6a29 btrfs: return bool from btrfs_should_end_transaction
Results in slightly smaller code.

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-11 (-11)
Function                                     old     new   delta
btrfs_should_end_transaction                  96      85     -11
Total: Before=20070, After=20059, chg -0.05%

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:16 +01:00
Nikolay Borisov
8a8f4deaba btrfs: return bool from should_end_transaction
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:15 +01:00
Nikolay Borisov
8df01fddb7 btrfs: remove err variable from do_relocation
It simply gets assigned to 'ret' in case of errors. The flow of the
while loop is not changed by this commit since the few call sites
that 'goto next' will simply break from the loop.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:15 +01:00
Nikolay Borisov
c6a592f2e2 btrfs: eliminate err variable from merge_reloc_root
In most cases when an error is returned from a function 'ret' is simply
assigned to 'err'. There is only one case where walk_up_reloc_tree can
return a positive value - in this case the code breaks from the loop and
ret is going to get its return value from btrfs_cow_block - either 0 or
negative. This retains the old logic of how 'err' used to be set at
this call site.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:15 +01:00
Nikolay Borisov
ee0d904fd9 btrfs: remove err variable from btrfs_delete_subvolume
Use only a single 'ret' to control whether we should abort the
transaction or not. That's fine, because if we abort a transaction then
btrfs_end_transaction will return the same value as passed to
btrfs_abort_transaction. No semantic changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:15 +01:00
Filipe Manana
c65ca98f9e btrfs: unlock path before checking if extent is shared during nocow writeback
When we are attempting to start writeback for an existing extent in NOCOW
mode, at run_delalloc_nocow(), we must check if the extent is shared, and
if it is, fallback to a COW write. However we do such check while still
holding a read lock on the leaf that contains the file extent item, and
that check, the call to btrfs_cross_ref_exist(), can take some time
because:

1) It needs to do a search on the extent tree, which obviously takes some
   time, specially if delayed references are being run at the moment, as
   we can block when trying to lock currently write locked btree nodes;

2) It needs to check the delayed references for any existing reference
   for our data extent, this requires acquiring the delayed references'
   spinlock and maybe block on the mutex of a delayed reference head in the
   case where there is a delayed reference for our data extent, in the
   worst case it makes us release the path on the extent tree and retry
   the whole process again (going back to step 1).

There are other operations we do while holding the leaf locked that can
take some significant time as well (specially all together):

* btrfs_extent_readonly() - to check if the block group containing the
  extent is currently in RO mode. This requires taking a spinlock and
  searching for the block group in a rbtree that can be big on large
  filesystems;

* csum_exist_in_range() - to search if there are any checksums in the
  csum tree for the extent. Like before, this can take some time if we are
  in a filesystem that has both COW and NOCOW files, in which case the
  csum tree is not empty;

* btrfs_inc_nocow_writers() - increment the number of nocow writers in the
  block group that contains the data extent. Needs to acquire a spinlock
  and search for the block group in a rbtree that can be big on large
  filesystems.

So just unlock the leaf (release the path) before doing all those checks,
since we do not need it anymore. In case we can not do a NOCOW write for
the extent, due to any of those checks failing, and the writeback range
goes beyond that extents' length, we will do another btree search for the
next file extent item.

The following script that calls dbench was used to measure the impact of
this change on a VM with 8 CPUs, 16Gb of ram, using a raw NVMe device
directly (no intermediary filesystem on the host) and using a non-debug
kernel (default configuration on Debian):

  $ cat test-dbench.sh
  #!/bin/bash

  DEV=/dev/sdk
  MNT=/mnt/sdk
  MOUNT_OPTIONS="-o ssd -o nodatacow"
  MKFS_OPTIONS="-m single -d single"

  mkfs.btrfs -f $MKFS_OPTIONS $DEV
  mount $MOUNT_OPTIONS $DEV $MNT

  dbench -D $MNT -t 300 64

  umount $MNT

Before this change:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    9326331     0.317   399.957
 Close        6851198     0.002     6.402
 Rename        394894     2.621   402.819
 Unlink       1883131     0.931   398.082
 Deltree          256    19.160   303.580
 Mkdir            128     0.003     0.016
 Qpathinfo    8452314     0.068   116.133
 Qfileinfo    1481921     0.001     5.081
 Qfsinfo      1549963     0.002     4.444
 Sfileinfo     759679     0.084    17.079
 Find         3268168     0.396   118.196
 WriteX       4653310     0.056   110.993
 ReadX        14618818     0.005    23.314
 LockX          30364     0.003     0.497
 UnlockX        30364     0.002     1.720
 Flush         653619    16.954   569.299

Throughput 966.651 MB/sec  64 clients  64 procs  max_latency=569.377 ms

After this change:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    9710433     0.302   232.449
 Close        7132948     0.002    11.496
 Rename        411144     2.452   131.805
 Unlink       1960961     0.893   230.383
 Deltree          256    14.858   198.646
 Mkdir            128     0.002     0.005
 Qpathinfo    8800890     0.066   111.588
 Qfileinfo    1542556     0.001     3.852
 Qfsinfo      1613835     0.002     5.483
 Sfileinfo     790871     0.081    19.492
 Find         3402743     0.386   120.185
 WriteX       4842918     0.054   179.312
 ReadX        15220407     0.005    32.435
 LockX          31612     0.003     1.533
 UnlockX        31612     0.002     1.047
 Flush         680567    16.320   463.323

Throughput 1016.59 MB/sec  64 clients  64 procs  max_latency=463.327 ms

+5.0% throughput, -20.5% max latency

Also, the following test using fio was run:

  $ cat test-fio.sh
  #!/bin/bash

  DEV=/dev/sdk
  MNT=/mnt/sdk
  MOUNT_OPTIONS="-o ssd -o nodatacow"
  MKFS_OPTIONS="-d single -m single"

  if [ $# -ne 4 ]; then
      echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE"
      exit 1
  fi

  NUM_JOBS=$1
  FILE_SIZE=$2
  FSYNC_FREQ=$3
  BLOCK_SIZE=$4

  cat <<EOF > /tmp/fio-job.ini
  [writers]
  rw=randwrite
  fsync=$FSYNC_FREQ
  fallocate=none
  group_reporting=1
  direct=0
  bs=$BLOCK_SIZE
  ioengine=sync
  size=$FILE_SIZE
  directory=$MNT
  numjobs=$NUM_JOBS
  EOF

  echo
  echo "Using fio config:"
  echo
  cat /tmp/fio-job.ini
  echo
  echo "mount options: $MOUNT_OPTIONS"
  echo

  mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
  mount $MOUNT_OPTIONS $DEV $MNT

  echo "Creating nodatacow files before fio runs..."
  for ((i = 0; i < $NUM_JOBS; i++)); do
      xfs_io -f -c "pwrite -b 128M 0 $FILE_SIZE" "$MNT/writers.$i.0"
  done
  sync

  fio /tmp/fio-job.ini
  umount $MNT

Before this change:

$ ./test-fio.sh 16 512M 2 4K
(...)
WRITE: bw=28.3MiB/s (29.6MB/s), 28.3MiB/s-28.3MiB/s (29.6MB/s-29.6MB/s), io=8192MiB (8590MB), run=289800-289800msec

After this change:

$ ./test-fio.sh 16 512M 2 4K
(...)
WRITE: bw=31.2MiB/s (32.7MB/s), 31.2MiB/s-31.2MiB/s (32.7MB/s-32.7MB/s), io=8192MiB (8590MB), run=262845-262845msec

+9.7% throughput, -9.8% runtime

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:15 +01:00
David Sterba
c7c01a4a25 btrfs: tree-checker: annotate all error branches as unlikely
The tree checker is called many times as it verifies metadata at
read/write time. The checks follow a simple pattern:

  if (error_condition) {
	  report_error();
	  return -EUCLEAN;
  }

All the error reporting functions are annotated as __cold that is
supposed to hint the compiler to move the statement block out of the hot
path. This does not seem to happen that often.

As the error condition is expected to be false almost always, we can
annotate it with 'unlikely' as this satisfies one of the few use cases
for the annotation. The expected outcome is a stronger hint to compiler
to reorder the checks

  test
  jump to exit
  test
  jump to exit
  ...

which can be observed in asm of eg. check_dir_item,
btrfs_check_chunk_valid, check_root_item or check_leaf.

There's a measurable run time improvement reported by Josef, the testing
workload went from 655 MiB/s to 677 MiB/s, which is about +3%.

There should be no functional changes but some of the conditions have
been rewritten to produce more readable result, some lines are longer
than 80, for the sake of readability.

Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:15 +01:00
David Sterba
a0f6d924ca btrfs: remove stub device info from messages when we have no fs_info
Without a NULL fs_info the helpers will print something like

	BTRFS error (device <unknown>): ...

This can happen in contexts where fs_info is not available at all or
it's potentially unsafe due to object lifetime. The <unknown> stub does
not bring much information and with the prefix makes the message
unnecessarily longer.

Remove it for the NULL fs_info case.

	BTRFS error: ...

Callers can add the device information to the message itself if needed.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:14 +01:00
Qu Wenruo
fb22e9c4cd btrfs: use detach_page_private() in alloc_extent_buffer()
In alloc_extent_buffer(), after we got a page from btree inode, we check
if that page has private pointer attached.

If attached, we check if the existing extent buffer has proper refs.
If not (the eb is being freed), we will detach that private eb pointer.

The point here is, we are detaching that eb pointer by calling:
- ClearPagePrivate()
- put_page()

The put_page() here is especially confusing, as it's decreasing the ref
from attach_page_private().  Without knowing that, it looks like the
put_page() is for the find_or_create_page() call, confusing the reader.

Since we're always modifying page private with attach_page_private() and
detach_page_private(), the only open-coded detach_page_private() here is
really confusing.

Fix it by calling detach_page_private().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:14 +01:00
Qu Wenruo
35478d053a btrfs: use nodesize to determine if we need readahead in btrfs_lookup_bio_sums
In btrfs_lookup_bio_sums() if the bio is pretty large, we want to
start readahead in the csum tree.

However the threshold is an immediate number, (PAGE_SIZE * 8), from the
initial btrfs merge.

The meaning of the value is pretty hard to guess, especially when the
immediate number is from the times when 4K sectorsize was the default
and only CRC32C was supported.

For the most common btrfs setup, CRC32 csum and 4K sectorsize,
it means just 32K read would kick readahead, while the csum itself is
only 32 bytes in size.

Now let's be more reasonable by taking both csum size and node size into
consideration.

If the csum size for the bio is larger than one leaf, then we kick the
readahead.  This means for current default btrfs, the threshold will be
16M.

This change should not change performance observably, thus this is
mostly a readability enhancement.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:14 +01:00
Qu Wenruo
829ddec922 btrfs: only clear EXTENT_LOCK bit in extent_invalidatepage
extent_invalidatepage() will try to clear all possible bits since it's
calling clear_extent_bit() with delete == 1.

This is currently fine, since for btree io tree, it only utilizes
EXTENT_LOCK bit.  But this could be a problem for later subpage support,
which will utilize extra io tree bit to represent additional info.

This patch will just convert that clear_extent_bit() to
unlock_extent_cached().

For current code since only EXTENT_LOCKED bit is utilized, this doesn't
change the behavior, but provides a much cleaner basis for incoming
subpage support.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:14 +01:00
Qu Wenruo
8e1dc982ed btrfs: remove unused parameter phy_offset from btrfs_validate_metadata_buffer
Parameter @phy_offset is the offset against the bio->bi_iter.bi_sector.
@phy_offset is mostly for data io to lookup the csum in btrfs_io_bio.

But for metadata, it's completely useless as metadata stores their own
csum in its header, so we can remove it.

Note: parameters @start and @end, they are not utilized at all for
current sectorsize == PAGE_SIZE case, as we can grab eb directly from
page.

But those two parameters are very important for later subpage support,
thus @start/@len are not touched here.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:14 +01:00
Qu Wenruo
2c36395430 btrfs: scrub: remove the anonymous structure from scrub_page
That anonymous structure serve no special purpose, just replace it with
regular members.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:14 +01:00
Qu Wenruo
f97e27e91d btrfs: use fixed width int type for extent_state::state
Currently the type is unsigned int which could change its width
depending on the architecture. We need up to 32 bits so make it
explicit.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:13 +01:00
Qu Wenruo
e09caaf913 btrfs: introduce helper to handle page status update in end_bio_extent_readpage()
Introduce a new helper to handle update page status in
end_bio_extent_readpage(). This will be later used for subpage support
where the page status update can be more complex than now.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:13 +01:00
Qu Wenruo
94e8c95ccb btrfs: add structure to keep track of extent range in end_bio_extent_readpage
In end_bio_extent_readpage() we had a strange dance around
extent_start/extent_len.

Hidden behind the strange dance is, it's just calling
endio_readpage_release_extent() on each bvec range.

Here is an example to explain the original work flow:

  Bio is for inode 257, containing 2 pages, for range [1M, 1M+8K)

  end_bio_extent_extent_readpage() entered
  |- extent_start = 0;
  |- extent_end = 0;
  |- bio_for_each_segment_all() {
  |  |- /* Got the 1st bvec */
  |  |- start = SZ_1M;
  |  |- end = SZ_1M + SZ_4K - 1;
  |  |- update = 1;
  |  |- if (extent_len == 0) {
  |  |  |- extent_start = start; /* SZ_1M */
  |  |  |- extent_len = end + 1 - start; /* SZ_1M */
  |  |  }
  |  |
  |  |- /* Got the 2nd bvec */
  |  |- start = SZ_1M + 4K;
  |  |- end = SZ_1M + 4K - 1;
  |  |- update = 1;
  |  |- if (extent_start + extent_len == start) {
  |  |  |- extent_len += end + 1 - start; /* SZ_8K */
  |  |  }
  |  } /* All bio vec iterated */
  |
  |- if (extent_len) {
     |- endio_readpage_release_extent(tree, extent_start, extent_len,
				      update);
	/* extent_start == SZ_1M, extent_len == SZ_8K, uptodate = 1 */

As the above flow shows, the existing code in end_bio_extent_readpage()
is accumulates extent_start/extent_len, and when the contiguous range
stops, calls endio_readpage_release_extent() for the range.

However current behavior has something not really considered:

- The inode can change
  For bio, its pages don't need to have contiguous page_offset.
  This means, even pages from different inodes can be packed into one
  bio.

- bvec cross page boundary
  There is a feature called multi-page bvec, where bvec->bv_len can go
  beyond bvec->bv_page boundary.

- Poor readability

This patch will address the problem:

- Introduce a proper structure, processed_extent, to record processed
  extent range

- Integrate inode/start/end/uptodate check into
  endio_readpage_release_extent()

- Add more comment on each step.
  This should greatly improve the readability, now in
  end_bio_extent_readpage() there are only two
  endio_readpage_release_extent() calls.

- Add inode check for contiguity
  Now we also ensure the inode is the same one before checking if the
  range is contiguous.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:13 +01:00
Qu Wenruo
b1d51f67c9 btrfs: tests: remove invalid extent-io test
In extent-io-test, there are two invalid tests:

- Invalid nodesize for test_eb_bitmaps()
  Instead of the sectorsize and nodesize combination passed in, we're
  always using hand-crafted nodesize, e.g:

	len = (sectorsize < BTRFS_MAX_METADATA_BLOCKSIZE)
		? sectorsize * 4 : sectorsize;

  In above case, if we have 32K page size, then we will get a length of
  128K, which is beyond max node size, and obviously invalid.

  The common page size goes up to 64K so we haven't hit that

- Invalid extent buffer bytenr
  For 64K page size, the only combination we're going to test is
  sectorsize = nodesize = 64K.
  However, in that case we will try to test an eb which bytenr is not
  sectorsize aligned:

	/* Do it over again with an extent buffer which isn't page-aligned. */
	eb = __alloc_dummy_extent_buffer(fs_info, nodesize / 2, len);

  Sector alignment is a hard requirement for any sector size.
  The only exception is superblock. But anything else should follow
  sector size alignment.

  This is definitely an invalid test case.

This patch will fix both problems by:

- Honor the sectorsize/nodesize combination
  Now we won't bother to hand-craft the length and use it as nodesize.

- Use sectorsize as the 2nd run extent buffer start
  This would test the case where extent buffer is aligned to sectorsize
  but not always aligned to nodesize.

Please note that, later subpage related cleanup will reduce
extent_buffer::pages[] to exactly what we need, making the sector
unaligned extent buffer operations cause problems.

Since only extent_io self tests utilize this, this patch is required for
all later cleanup/refactoring.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:13 +01:00
Tom Rix
445d8ab53f btrfs: sysfs: remove unneeded semicolon
A semicolon is not needed after a switch statement.

Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:13 +01:00
Nikolay Borisov
95b982de37 btrfs: simplify return values in setup_nodes_for_search
The function is needlessly convoluted. Fix that by:

* removing redundant sret variable definition in both if arms

* replace the again/done labels with direct return statements, the
  function is short enough and doesn't do anything special upon exit

* remove BUG_ON on split_node returning a positive number - it can't
  happen as split_node returns either 0 or a negative error code.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:13 +01:00
Nikolay Borisov
d5286a92ea btrfs: remove useless return value statement in split_node
At the point when we set 'ret = 0' it's guaranteed that the function is
going to return 0 so directly return 0. No functional changes.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:12 +01:00
Filipe Manana
f30bed8342 btrfs: remove unnecessary attempt to drop extent maps after adding inline extent
At inode.c:cow_file_range_inline(), after we insert the inline extent
in the fs/subvolume btree, we call btrfs_drop_extent_cache() to drop
all extent maps in the file range, however that is not necessary because
we have already done it in the call to btrfs_drop_extents(), which calls
btrfs_drop_extent_cache() for us, and since at this point we have the file
range locked in the inode's iotree (we are in the writeback path), we know
no other task can come in and read stale file extent items or find none
and therefore create either stale extent maps or an extent map that
represents a hole.

So just remove that unnecessary call to btrfs_drop_extent_cache(), as it's
doing nothing and only wasting time. This call has been around since 2008,
introduced in commit c8b978188c ("Btrfs: Add zlib compression support"),
but even back then it seems it was not necessary, since we had the range
locked in the inode's iotree and the call to btrfs_drop_extents() already
used to always call btrfs_drop_extent_cache().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:12 +01:00
Filipe Manana
bc5b5b1e51 btrfs: stop incrementing log batch when joining log transaction
When joining a log transaction we acquire the root's log mutex, then
increment the root's log batch and log writers counters while holding
the mutex. However we don't need to increment the log batch there,
because we are holding the mutex and incremented the log writers counter
as well, so any other task trying to sync log will wait for the current
task to finish its logging and still achieve the desired log batching.

Since the log batch counter is an atomic counter and is incremented twice
at the very beginning of the fsync callback (btrfs_sync_file()), once
before flushing delalloc and once again after waiting for writeback to
complete, eliminating its increment when joining the log transaction
may provide some performance gains in case we have multiple concurrent
tasks doing fsyncs against different files in the same subvolume, as it
reduces contention on the atomic (locking the cacheline and bouncing it).

When testing fio with 32 jobs, on a 8 cores VM, doing fsyncs against
different files of the same subvolume, on top of a zram device, I could
consistently see gains (higher throughput) between 1% to 2%, which is a
very low value and possibly hard to be observed with a real device (I
couldn't observe consistent gains with my low/mid end NVMe device).
So this change is mostly motivated to just simplify the logic, as updating
the log batch counter is only relevant when an fsync starts and while not
holding the root's log mutex.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:12 +01:00
Filipe Manana
f2f121ab50 btrfs: skip unnecessary searches for xattrs when logging an inode
Every time we log an inode we lookup in the fs/subvol tree for xattrs and
if we have any, log them into the log tree. However it is very common to
have inodes without any xattrs, so doing the search wastes times, but more
importantly it adds contention on the fs/subvol tree locks, either making
the logging code block and wait for tree locks or making the logging code
making other concurrent operations block and wait.

The most typical use cases where xattrs are used are when capabilities or
ACLs are defined for an inode, or when SELinux is enabled.

This change makes the logging code detect when an inode does not have
xattrs and skip the xattrs search the next time the inode is logged,
unless the inode is evicted and loaded again or a xattr is added to the
inode. Therefore skipping the search for xattrs on inodes that don't ever
have xattrs and are fsynced with some frequency.

The following script that calls dbench was used to measure the impact of
this change on a VM with 8 CPUs, 16Gb of ram, using a raw NVMe device
directly (no intermediary filesystem on the host) and using a non-debug
kernel (default configuration on Debian distributions):

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/sdk
  MNT=/mnt/sdk
  MOUNT_OPTIONS="-o ssd"

  mkfs.btrfs -f -m single -d single $DEV
  mount $MOUNT_OPTIONS $DEV $MNT

  dbench -D $MNT -t 200 40

  umount $MNT

The results before this change:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    5761605     0.172   312.057
 Close        4232452     0.002    10.927
 Rename        243937     1.406   277.344
 Unlink       1163456     0.631   298.402
 Deltree          160    11.581   221.107
 Mkdir             80     0.003     0.005
 Qpathinfo    5221410     0.065   122.309
 Qfileinfo     915432     0.001     3.333
 Qfsinfo       957555     0.003     3.992
 Sfileinfo     469244     0.023    20.494
 Find         2018865     0.448   123.659
 WriteX       2874851     0.049   118.529
 ReadX        9030579     0.004    21.654
 LockX          18754     0.003     4.423
 UnlockX        18754     0.002     0.331
 Flush         403792    10.944   359.494

Throughput 908.444 MB/sec  40 clients  40 procs  max_latency=359.500 ms

The results after this change:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    6442521     0.159   230.693
 Close        4732357     0.002    10.972
 Rename        272809     1.293   227.398
 Unlink       1301059     0.563   218.500
 Deltree          160     7.796    54.887
 Mkdir             80     0.008     0.478
 Qpathinfo    5839452     0.047   124.330
 Qfileinfo    1023199     0.001     4.996
 Qfsinfo      1070760     0.003     5.709
 Sfileinfo     524790     0.033    21.765
 Find         2257658     0.314   125.611
 WriteX       3211520     0.040   232.135
 ReadX        10098969     0.004    25.340
 LockX          20974     0.003     1.569
 UnlockX        20974     0.002     3.475
 Flush         451553    10.287   331.037

Throughput 1011.77 MB/sec  40 clients  40 procs  max_latency=331.045 ms

+10.8% throughput, -8.2% max latency

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:12 +01:00
Nikolay Borisov
1cab5e7283 btrfs: merge __set_extent_bit and set_extent_bit
There are only 2 direct calls to set_extent_bit outside of extent-io -
in btrfs_find_new_delalloc_bytes and btrfs_truncate_block, the rest are
thin wrappers around __set_extent_bit. This adds unnecessary indirection
and just makes it more annoying when looking at the various extent bit
manipulation functions.  This patch renames __set_extent_bit to
set_extent_bit effectively removing a level of indirection. No
functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat and remove __must_check ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:12 +01:00
Nikolay Borisov
729f796172 btrfs: make btrfs_update_inode_fallback take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:12 +01:00
Nikolay Borisov
b06359a325 btrfs: make btrfs_cont_expand take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:12 +01:00
Nikolay Borisov
217f42eb3d btrfs: make btrfs_truncate_block take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:11 +01:00
Nikolay Borisov
03fcb1ab6f btrfs: make btrfs_insert_replace_extent take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:11 +01:00
Nikolay Borisov
dea46d84a3 btrfs: make find_first_non_hole take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:11 +01:00
Nikolay Borisov
a4ba6cc03e btrfs: make maybe_insert_hole take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:11 +01:00
Nikolay Borisov
9a56fcd15a btrfs: make btrfs_update_inode take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:11 +01:00
Nikolay Borisov
dfeb9e7cc3 btrfs: make btrfs_update_inode_item take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:11 +01:00
Nikolay Borisov
f3fbcaef59 btrfs: make btrfs_delayed_update_inode take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:10 +01:00
Nikolay Borisov
72e7e6edd3 btrfs: make btrfs_finish_ordered_io btrfs_inode-centric
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:10 +01:00
Nikolay Borisov
507433985c btrfs: make btrfs_truncate_inode_items take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:10 +01:00
Nikolay Borisov
90dffd0cff btrfs: make insert_prealloc_file_extent take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:10 +01:00
Nikolay Borisov
76aea53796 btrfs: make btrfs_inode_safe_disk_i_size_write take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:10 +01:00
Josef Bacik
a55463c9f0 btrfs: remove extent_buffer::recursed
It is unused everywhere now, it can be removed.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:10 +01:00
Josef Bacik
0ecae6fffe btrfs: remove the recurse parameter from __btrfs_tree_read_lock
It is completely unused now, remove it.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:09 +01:00
Josef Bacik
fe596ca3d3 btrfs: use btrfs_tree_read_lock in btrfs_search_slot
We no longer use recursion, so
__btrfs_tree_read_lock(BTRFS_NESTING_NORMAL) == btrfs_tree_read_lock.
Replace this call with the simple helper.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:09 +01:00
Josef Bacik
1bb9659841 btrfs: merge back btrfs_read_lock_root_node helpers
We no longer have recursive locking and there's no need for separate
helpers that allowed the transition to rwsem with minimal code changes.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:09 +01:00
Josef Bacik
4048daedb9 btrfs: locking: remove the recursion handling code
Now that we're no longer using recursion, rip out all of the supporting
code.  Follow up patches will clean up the callers of these functions.

The extent_buffer::lock_owner is still retained as it allows safety
checks in btrfs_init_new_buffer for the case that the free space cache
is corrupted and we try to allocate a block that we are currently using
and have locked in the path.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:09 +01:00
Josef Bacik
2f5239dcb2 btrfs: remove btrfs_path::recurse
With my async free space cache loading patches ("btrfs: load free space
cache asynchronously") we no longer have a user of path->recurse and can
remove it.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:09 +01:00
Josef Bacik
0e46318df8 btrfs: unlock to current level in btrfs_next_old_leaf
Filipe reported the following lockdep splat

  ======================================================
  WARNING: possible circular locking dependency detected
  5.10.0-rc2-btrfs-next-71 #1 Not tainted
  ------------------------------------------------------
  find/324157 is trying to acquire lock:
  ffff8ebc48d293a0 (btrfs-tree-01#2/3){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]

  but task is already holding lock:
  ffff8eb9932c5088 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #1 (btrfs-tree-00){++++}-{3:3}:
	 lock_acquire+0xd8/0x490
	 down_write_nested+0x44/0x120
	 __btrfs_tree_lock+0x27/0x120 [btrfs]
	 btrfs_search_slot+0x2a3/0xc50 [btrfs]
	 btrfs_insert_empty_items+0x58/0xa0 [btrfs]
	 insert_with_overflow+0x44/0x110 [btrfs]
	 btrfs_insert_xattr_item+0xb8/0x1d0 [btrfs]
	 btrfs_setxattr+0xd6/0x4c0 [btrfs]
	 btrfs_setxattr_trans+0x68/0x100 [btrfs]
	 __vfs_setxattr+0x66/0x80
	 __vfs_setxattr_noperm+0x70/0x200
	 vfs_setxattr+0x6b/0x120
	 setxattr+0x125/0x240
	 path_setxattr+0xba/0xd0
	 __x64_sys_setxattr+0x27/0x30
	 do_syscall_64+0x33/0x80
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (btrfs-tree-01#2/3){++++}-{3:3}:
	 check_prev_add+0x91/0xc60
	 __lock_acquire+0x1689/0x3130
	 lock_acquire+0xd8/0x490
	 down_read_nested+0x45/0x220
	 __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
	 btrfs_next_old_leaf+0x27d/0x580 [btrfs]
	 btrfs_real_readdir+0x1e3/0x4b0 [btrfs]
	 iterate_dir+0x170/0x1c0
	 __x64_sys_getdents64+0x83/0x140
	 do_syscall_64+0x33/0x80
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  other info that might help us debug this:

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(btrfs-tree-00);
				 lock(btrfs-tree-01#2/3);
				 lock(btrfs-tree-00);
    lock(btrfs-tree-01#2/3);

   *** DEADLOCK ***

  5 locks held by find/324157:
   #0: ffff8ebc502c6e00 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60
   #1: ffff8eb97f689980 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: iterate_dir+0x52/0x1c0
   #2: ffff8ebaec00ca58 (btrfs-tree-02#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
   #3: ffff8eb98f986f78 (btrfs-tree-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
   #4: ffff8eb9932c5088 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]

  stack backtrace:
  CPU: 2 PID: 324157 Comm: find Not tainted 5.10.0-rc2-btrfs-next-71 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x8d/0xb5
   check_noncircular+0xff/0x110
   ? mark_lock.part.0+0x468/0xe90
   check_prev_add+0x91/0xc60
   __lock_acquire+0x1689/0x3130
   ? kvm_clock_read+0x14/0x30
   ? kvm_sched_clock_read+0x5/0x10
   lock_acquire+0xd8/0x490
   ? __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
   down_read_nested+0x45/0x220
   ? __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
   __btrfs_tree_read_lock+0x32/0x1a0 [btrfs]
   btrfs_next_old_leaf+0x27d/0x580 [btrfs]
   btrfs_real_readdir+0x1e3/0x4b0 [btrfs]
   iterate_dir+0x170/0x1c0
   __x64_sys_getdents64+0x83/0x140
   ? filldir+0x1d0/0x1d0
   do_syscall_64+0x33/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This happens because btrfs_next_old_leaf searches down to our current
key, and then walks up the path until we can move to the next slot, and
then reads back down the path so we get the next leaf.

However it doesn't unlock any lower levels until it replaces them with
the new extent buffer.  This is technically fine, but of course causes
lockdep to complain, because we could be holding locks on lower levels
while locking upper levels.

Fix this by dropping all nodes below the level that we use as our new
starting point before we start reading back down the path.  This also
allows us to drop the nested/recursive locking magic, because we're no
longer locking two nodes at the same level anymore.

Reported-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:09 +01:00
Josef Bacik
ffeb03cfe2 btrfs: cleanup the locking in btrfs_next_old_leaf
We are carrying around this next_rw_lock from when we would do spinning
vs blocking read locks.  Now that we have the rwsem locking we can
simply use the read lock flag unconditionally and the read lock helpers.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:09 +01:00
Anand Jain
b2598edf8b btrfs: remove unused argument seed from btrfs_find_device
Commit 343694eee8d8 ("btrfs: switch seed device to list api"), missed to
check if the parameter seed is true in the function btrfs_find_device().
This tells it whether to traverse the seed device list or not.

After this commit, the argument is unused and can be removed.

In device_list_add() it's not necessary because fs_devices always points
to the device's fs_devices. So with the devid+uuid matching, it will
find the right device and return, thus not needing to traverse seed
devices.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:08 +01:00
Anand Jain
3a160a9331 btrfs: drop never met disk total bytes check in verify_one_dev_extent
Drop the condition in verify_one_dev_extent,
btrfs_device::disk_total_bytes is set even for a seed device. The
comment is wrong, the size is properly set when cloning the device.

Commit 1b3922a8bc ("btrfs: Use real device structure to verify
dev extent") introduced it but it's unclear why the total_disk_bytes
was 0.

Theoretically, all devices (including missing and seed) marked with the
BTRFS_DEV_STATE_IN_FS_METADATA flag gets the total_disk_bytes updated at
fill_device_from_item():

  open_ctree()
    btrfs_read_chunk_tree()
      read_one_dev()
        open_seed_device()
        fill_device_from_item()

Even if verify_one_dev_extent() reports total_disk_bytes == 0, then its
a bug to be fixed somewhere else and not in verify_one_dev_extent() as
it's just a messenger. It is never expected that a total_disk_bytes
shall be zero.

The function fill_device_from_item() does the job of reading it from the
item and updating btrfs_device::disk_total_bytes. So both the missing
device and the seed devices do have their disk_total_bytes updated.
btrfs_find_device can also return a device from fs_info->seed_list
because it searches it as well.

Furthermore, while removing the device if there is a power loss, we
could have a device with its total_bytes = 0, that's still valid.

Instead, introduce a check against maximum block device size in
read_one_dev().

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:08 +01:00
Anand Jain
bacce86ae8 btrfs: drop unused argument step from btrfs_free_extra_devids
Commit cf89af146b ("btrfs: dev-replace: fail mount if we don't have
replace item with target device") dropped the multi stage operation of
btrfs_free_extra_devids() that does not need to check replace target
anymore and we can remove the 'step' argument.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:08 +01:00
Filipe Manana
2766ff6176 btrfs: update the number of bytes used by an inode atomically
There are several occasions where we do not update the inode's number of
used bytes atomically, resulting in a concurrent stat(2) syscall to report
a value of used blocks that does not correspond to a valid value, that is,
a value that does not match neither what we had before the operation nor
what we get after the operation completes.

In extreme cases it can result in stat(2) reporting zero used blocks, which
can cause problems for some userspace tools where they can consider a file
with a non-zero size and zero used blocks as completely sparse and skip
reading data, as reported/discussed a long time ago in some threads like
the following:

  https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html

The cases where this can happen are the following:

-> Case 1

If we do a write (buffered or direct IO) against a file region for which
there is already an allocated extent (or multiple extents), then we have a
short time window where we can report a number of used blocks to stat(2)
that does not take into account the file region being overwritten. This
short time window happens when completing the ordered extent(s).

This happens because when we drop the extents in the write range we
decrement the inode's number of bytes and later on when we insert the new
extent(s) we increment the number of bytes in the inode, resulting in a
short time window where a stat(2) syscall can get an incorrect number of
used blocks.

If we do writes that overwrite an entire file, then we have a short time
window where we report 0 used blocks to stat(2).

Example reproducer:

  $ cat reproducer-1.sh
  #!/bin/bash

  MNT=/mnt/sdi
  DEV=/dev/sdi

  stat_loop()
  {
      trap "wait; exit" SIGTERM
      local filepath=$1
      local expected=$2
      local got

      while :; do
          got=$(stat -c %b $filepath)
          if [ $got -ne $expected ]; then
             echo -n "ERROR: unexpected used blocks"
             echo " (got: $got expected: $expected)"
          fi
      done
  }

  mkfs.btrfs -f $DEV > /dev/null
  # mkfs.xfs -f $DEV > /dev/null
  # mkfs.ext4 -F $DEV > /dev/null
  # mkfs.f2fs -f $DEV > /dev/null
  # mkfs.reiserfs -f $DEV > /dev/null
  mount $DEV $MNT

  xfs_io -f -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
  expected=$(stat -c %b $MNT/foobar)

  # Create a process to keep calling stat(2) on the file and see if the
  # reported number of blocks used (disk space used) changes, it should
  # not because we are not increasing the file size nor punching holes.
  stat_loop $MNT/foobar $expected &
  loop_pid=$!

  for ((i = 0; i < 50000; i++)); do
      xfs_io -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
  done

  kill $loop_pid &> /dev/null
  wait

  umount $DEV

  $ ./reproducer-1.sh
  ERROR: unexpected used blocks (got: 0 expected: 128)
  ERROR: unexpected used blocks (got: 0 expected: 128)
  (...)

Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.

-> Case 2

If we do a buffered write against a file region that does not have any
allocated extents, like a hole or beyond EOF, then during ordered extent
completion we have a short time window where a concurrent stat(2) syscall
can report a number of used blocks that does not correspond to the value
before or after the write operation, a value that is actually larger than
the value after the write completes.

This happens because once we start a buffered write into an unallocated
file range we increment the inode's 'new_delalloc_bytes', to make sure
any stat(2) call gets a correct used blocks value before delalloc is
flushed and completes. However at ordered extent completion, after we
inserted the new extent, we increment the inode's number of bytes used
with the size of the new extent, and only later, when clearing the range
in the inode's iotree, we decrement the inode's 'new_delalloc_bytes'
counter with the size of the extent. So this results in a short time
window where a concurrent stat(2) syscall can report a number of used
blocks that accounts for the new extent twice.

Example reproducer:

  $ cat reproducer-2.sh
  #!/bin/bash

  MNT=/mnt/sdi
  DEV=/dev/sdi

  stat_loop()
  {
      trap "wait; exit" SIGTERM
      local filepath=$1
      local expected=$2
      local got

      while :; do
          got=$(stat -c %b $filepath)
          if [ $got -ne $expected ]; then
              echo -n "ERROR: unexpected used blocks"
              echo " (got: $got expected: $expected)"
          fi
      done
  }

  mkfs.btrfs -f $DEV > /dev/null
  # mkfs.xfs -f $DEV > /dev/null
  # mkfs.ext4 -F $DEV > /dev/null
  # mkfs.f2fs -f $DEV > /dev/null
  # mkfs.reiserfs -f $DEV > /dev/null
  mount $DEV $MNT

  touch $MNT/foobar
  write_size=$((64 * 1024))
  for ((i = 0; i < 16384; i++)); do
     offset=$(($i * $write_size))
     xfs_io -c "pwrite -S 0xab $offset $write_size" $MNT/foobar >/dev/null
     blocks_used=$(stat -c %b $MNT/foobar)

     # Fsync the file to trigger writeback and keep calling stat(2) on it
     # to see if the number of blocks used changes.
     stat_loop $MNT/foobar $blocks_used &
     loop_pid=$!
     xfs_io -c "fsync" $MNT/foobar

     kill $loop_pid &> /dev/null
     wait $loop_pid
  done

  umount $DEV

  $ ./reproducer-2.sh
  ERROR: unexpected used blocks (got: 265472 expected: 265344)
  ERROR: unexpected used blocks (got: 284032 expected: 283904)
  (...)

Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.

-> Case 3

Another case where such problems happen is during other operations that
replace extents in a file range with other extents. Those operations are
extent cloning, deduplication and fallocate's zero range operation.

The cause of the problem is similar to the first case. When we drop the
extents from a range, we decrement the inode's number of bytes, and later
on, after inserting the new extents we increment it. Since this is not
done atomically, a concurrent stat(2) call can see and return a number of
used blocks that is smaller than it should be, does not match the number
of used blocks before or after the clone/deduplication/zero operation.

Like for the first case, when doing a clone, deduplication or zero range
operation against an entire file, we end up having a time window where we
can report 0 used blocks to a stat(2) call.

Example reproducer:

  $ cat reproducer-3.sh
  #!/bin/bash

  MNT=/mnt/sdi
  DEV=/dev/sdi

  mkfs.btrfs -f $DEV > /dev/null
  # mkfs.xfs -f -m reflink=1 $DEV > /dev/null
  mount $DEV $MNT

  extent_size=$((64 * 1024))
  num_extents=16384
  file_size=$(($extent_size * $num_extents))

  # File foo has many small extents.
  xfs_io -f -s -c "pwrite -S 0xab -b $extent_size 0 $file_size" $MNT/foo \
      > /dev/null
  # File bar has much less extents and has exactly the same data as foo.
  xfs_io -f -c "pwrite -S 0xab 0 $file_size" $MNT/bar > /dev/null

  expected=$(stat -c %b $MNT/foo)

  # Now deduplicate bar into foo. While the deduplication is in progres,
  # the number of used blocks/file size reported by stat should not change
  xfs_io -c "dedupe $MNT/bar 0 0 $file_size" $MNT/foo > /dev/null  &
  dedupe_pid=$!
  while [ -n "$(ps -p $dedupe_pid -o pid=)" ]; do
      used=$(stat -c %b $MNT/foo)
      if [ $used -ne $expected ]; then
          echo "Unexpected blocks used: $used (expected: $expected)"
      fi
  done

  umount $DEV

  $ ./reproducer-3.sh
  Unexpected blocks used: 2076800 (expected: 2097152)
  Unexpected blocks used: 2097024 (expected: 2097152)
  Unexpected blocks used: 2079872 (expected: 2097152)
  (...)

Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.

So fix this by:

1) Making btrfs_drop_extents() not decrement the VFS inode's number of
   bytes, and instead return the number of bytes;

2) Making any code that drops extents and adds new extents update the
   inode's number of bytes atomically, while holding the btrfs inode's
   spinlock, which is also used by the stat(2) callback to get the inode's
   number of bytes;

3) For ranges in the inode's iotree that are marked as 'delalloc new',
   corresponding to previously unallocated ranges, increment the inode's
   number of bytes when clearing the 'delalloc new' bit from the range,
   in the same critical section that decrements the inode's
   'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock.

An alternative would be to have btrfs_getattr() wait for any IO (ordered
extents in progress) and locking the whole range (0 to (u64)-1) while it
it computes the number of blocks used. But that would mean blocking
stat(2), which is a very used syscall and expected to be fast, waiting
for writes, clone/dedupe, fallocate, page reads, fiemap, etc.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:08 +01:00
Filipe Manana
7f458a3873 btrfs: fix race when defragmenting leads to unnecessary IO
When defragmenting we skip ranges that have holes or inline extents, so that
we don't do unnecessary IO and waste space. We do this check when calling
should_defrag_range() at btrfs_defrag_file(). However we do it without
holding the inode's lock. The reason we do it like this is to avoid
blocking other tasks for too long, that possibly want to operate on other
file ranges, since after the call to should_defrag_range() and before
locking the inode, we trigger a synchronous page cache readahead. However
before we were able to lock the inode, some other task might have punched
a hole in our range, or we may now have an inline extent there, in which
case we should not set the range for defrag anymore since that would cause
unnecessary IO and make us waste space (i.e. allocating extents to contain
zeros for a hole).

So after we locked the inode and the range in the iotree, check again if
we have holes or an inline extent, and if we do, just skip the range.

I hit this while testing my next patch that fixes races when updating an
inode's number of bytes (subject "btrfs: update the number of bytes used
by an inode atomically"), and it depends on this change in order to work
correctly. Alternatively I could rework that other patch to detect holes
and flag their range with the 'new delalloc' bit, but this itself fixes
an efficiency problem due a race that from a functional point of view is
not harmful (it could be triggered with btrfs/062 from fstests).

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:08 +01:00
Filipe Manana
5893dfb98f btrfs: refactor btrfs_drop_extents() to make it easier to extend
There are many arguments for __btrfs_drop_extents() and its wrapper
btrfs_drop_extents(), which makes it hard to add more arguments to it and
requires changing every caller. I have added a couple myself back in 2014
commit 1acae57b16 ("Btrfs: faster file extent item replace operations")
and therefore know firsthand that it is a bit cumbersome to add additional
arguments to these functions.

Since I will need to add more arguments in a subsequent bug fix, this
change is preparatory work and adds a data structure that holds all the
arguments, for both input and output, that are passed to this function,
with some comments in the structure's definition mentioning what each
field is and how it relates to other fields.

Callers of this function need only to zero out the content of the
structure and setup only the fields they need. This also removes the
need to have both __btrfs_drop_extents() and btrfs_drop_extents(), so
now we have a single function named btrfs_drop_extents() that takes a
pointer to this new data structure (struct btrfs_drop_extents_args).

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:08 +01:00
Josef Bacik
e114c545bb btrfs: set the lockdep class for extent buffers on creation
Both Filipe and Fedora QA recently hit the following lockdep splat:

  WARNING: possible recursive locking detected
  5.10.0-0.rc1.20201028gited8780e3f2ec.57.fc34.x86_64 #1 Not tainted
  --------------------------------------------
  rsync/2610 is trying to acquire lock:
  ffff89617ed48f20 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140

  but task is already holding lock:
  ffff8961757b1130 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140

  other info that might help us debug this:
   Possible unsafe locking scenario:
	 CPU0
	 ----
    lock(&eb->lock);
    lock(&eb->lock);

   *** DEADLOCK ***
   May be due to missing lock nesting notation
  2 locks held by rsync/2610:
   #0: ffff896107212b90 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: walk_component+0x10c/0x190
   #1: ffff8961757b1130 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140

  stack backtrace:
  CPU: 1 PID: 2610 Comm: rsync Not tainted 5.10.0-0.rc1.20201028gited8780e3f2ec.57.fc34.x86_64 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
  Call Trace:
   dump_stack+0x8b/0xb0
   __lock_acquire.cold+0x12d/0x2a4
   ? kvm_sched_clock_read+0x14/0x30
   ? sched_clock+0x5/0x10
   lock_acquire+0xc8/0x400
   ? btrfs_tree_read_lock_atomic+0x34/0x140
   ? read_block_for_search.isra.0+0xdd/0x320
   _raw_read_lock+0x3d/0xa0
   ? btrfs_tree_read_lock_atomic+0x34/0x140
   btrfs_tree_read_lock_atomic+0x34/0x140
   btrfs_search_slot+0x616/0x9a0
   btrfs_lookup_dir_item+0x6c/0xb0
   btrfs_lookup_dentry+0xa8/0x520
   ? lockdep_init_map_waits+0x4c/0x210
   btrfs_lookup+0xe/0x30
   __lookup_slow+0x10f/0x1e0
   walk_component+0x11b/0x190
   path_lookupat+0x72/0x1c0
   filename_lookup+0x97/0x180
   ? strncpy_from_user+0x96/0x1e0
   ? getname_flags.part.0+0x45/0x1a0
   vfs_statx+0x64/0x100
   ? lockdep_hardirqs_on_prepare+0xff/0x180
   ? _raw_spin_unlock_irqrestore+0x41/0x50
   __do_sys_newlstat+0x26/0x40
   ? lockdep_hardirqs_on_prepare+0xff/0x180
   ? syscall_enter_from_user_mode+0x27/0x80
   ? syscall_enter_from_user_mode+0x27/0x80
   do_syscall_64+0x33/0x40
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

I have also seen a report of lockdep complaining about the lock class
that was looked up being the same as the lock class on the lock we were
using, but I can't find the report.

These are problems that occur because we do not have the lockdep class
set on the extent buffer until _after_ we read the eb in properly.  This
is problematic for concurrent readers, because we will create the extent
buffer, lock it, and then attempt to read the extent buffer.

If a second thread comes in and tries to do a search down the same path
they'll get the above lockdep splat because the class isn't set properly
on the extent buffer.

There was a good reason for this, we generally didn't know the real
owner of the eb until we read it, specifically in refcounted roots.

However now all refcounted roots have the same class name, so we no
longer need to worry about this.  For non-refcounted trees we know
which root we're on based on the parent.

Fix this by setting the lockdep class on the eb at creation time instead
of read time.  This will fix the splat and the weirdness where the class
changes in the middle of locking the block.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:07 +01:00
Josef Bacik
3fbaf25817 btrfs: pass the owner_root and level to alloc_extent_buffer
Now that we've plumbed all of the callers to have the owner root and the
level, plumb it down into alloc_extent_buffer().

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:07 +01:00
Josef Bacik
5d81230baa btrfs: pass the root owner and level around for readahead
The readahead infrastructure does raw reads of extent buffers, but we're
going to need to know their owner and level in order to set the lockdep
key properly, so plumb in the infrastructure that we'll need to have
this information when we start allocating extent buffers.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:07 +01:00
Josef Bacik
1b7ec85ef4 btrfs: pass root owner to read_tree_block
In order to properly set the lockdep class of a newly allocated block we
need to know the owner of the block.  For non-refcounted trees this is
straightforward, we always know in advance what tree we're reading from.
For refcounted trees we don't necessarily know, however all refcounted
trees share the same lockdep class name, tree-<level>.

Fix all the callers of read_tree_block() to pass in the root objectid
we're using.  In places like relocation and backref we could probably
unconditionally use 0, but just in case use the root when we have it,
otherwise use 0 in the cases we don't have the root as it's going to be
a refcounted tree anyway.

This is a preparation patch for further changes.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:07 +01:00
Josef Bacik
182c79fcb8 btrfs: use btrfs_read_node_slot in btrfs_qgroup_trace_subtree
We're open-coding btrfs_read_node_slot() here, replace with the helper.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:07 +01:00
Josef Bacik
3acfbd6a99 btrfs: use btrfs_read_node_slot in qgroup_trace_new_subtree_blocks
We're open-coding btrfs_read_node_slot() here, replace with the helper.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:07 +01:00
Josef Bacik
6b2cb7cb95 btrfs: use btrfs_read_node_slot in qgroup_trace_extent_swap
We're open-coding btrfs_read_node_slot() here, replace with the helper.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:06 +01:00
Josef Bacik
c990ada2a0 btrfs: use btrfs_read_node_slot in walk_down_tree
We're open-coding btrfs_read_node_slot() here, replace with the helper.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:06 +01:00
Josef Bacik
6b3426be27 btrfs: use btrfs_read_node_slot in replace_path
We're open-coding btrfs_read_node_slot() here, replace with the helper.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:06 +01:00
Josef Bacik
c975253682 btrfs: use btrfs_read_node_slot in do_relocation
We're open coding btrfs_read_node_slot in do_relocation, replace this
with the proper helper.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:06 +01:00
Josef Bacik
8ef385bbf0 btrfs: use btrfs_read_node_slot in walk_down_reloc_tree
We do not need to call read_tree_block() here, simply use the
btrfs_read_node_slot helper.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:06 +01:00
Josef Bacik
206983b72a btrfs: use btrfs_read_node_slot in btrfs_realloc_node
We have this open-coded nightmare in btrfs_realloc_node that does
the same thing that the normal read path does, which is to see if we
have the eb in memory already, and if not read it, and verify the eb is
uptodate.  Delete this open coding and simply use btrfs_read_node_slot.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:06 +01:00
Josef Bacik
bfb484d922 btrfs: cleanup extent buffer readahead
We're going to pass around more information when we allocate extent
buffers, in order to make that cleaner how we do readahead.  Most of the
callers have the parent node that we're getting our blockptr from, with
the sole exception of relocation which simply has the bytenr it wants to
read.

Add a helper that takes the current arguments that we need (bytenr and
gen), and add another helper for simply reading the slot out of a node.
In followup patches the helper that takes all the extra arguments will
be expanded, and the simpler helper won't need to have it's arguments
adjusted.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:05 +01:00
Josef Bacik
416e3445ef btrfs: remove lockdep classes for the fs tree
We have this weird problem where our lockdep class is set after we
read a tree block, which can race with concurrent readers and result in
erroneous lockdep errors.  We want to set the lockdep class at
allocation time if possible, but in certain cases we may not have the
actual root owner, such as with relocation or any backref lookups.  This
is only really a problem for reference counted trees, because all other
trees have their root reference set in their extent reference.  Remove
the fs tree specific lock class.  We need to still keep the reloc tree
one, it's still reference counted, because replace_path will lock the
reloc tree and the destination tree, and if they're both set to
tree-<level> we'll have issues.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:05 +01:00
Pavel Begunkov
3e48d8d254 btrfs: discard: reschedule work after sysfs param update
After sysfs updates discard's iops_limit or kbps_limit it also needs to
adjust current timer through rescheduling, otherwise the discard work
may wait for a long time for the previous timer to expire or bumped by
someone else.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:05 +01:00
Pavel Begunkov
df903e5d29 btrfs: don't miss async discards after scheduled work override
If btrfs_discard_schedule_work() is called with override=true, it sets
delay anew regardless how much time is left until the timer should have
fired. If delays are long (that can happen, for example, with low
kbps_limit), they might get constantly overridden without having a
chance to run the discard work.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:05 +01:00
Pavel Begunkov
6e88f116bd btrfs: discard: store async discard delay as ns not as jiffies
Most delay calculations are done in ns or ms, so store
discard_ctl->delay in ms and convert the final delay to jiffies only at
the end.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:05 +01:00
Pavel Begunkov
e50404a8a6 btrfs: discard: speed up async discard up to iops_limit
Instead of using iops_limit only for cutting off extremes, calculate the
discard delay directly from it, so it closely follows iops_limit and
doesn't under-discard even though quotas are not saturated.

The iops limit could be hit more often in some cases and could increase
the discard rate.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:05 +01:00
Qu Wenruo
480a8ec83b btrfs: scrub: refactor scrub_find_csum()
Function scrub_find_csum() is to locate the csum for bytenr @logical
from sctx->csum_list.

However it lacks a lot of comments to explain things like how the
csum_list is organized and why we need to drop csum range which is
before us.

Refactor the function by:

- Add more comments explaining the behavior
- Add comment explaining why we need to drop the csum range
- Put the csum copy in the main loop
  This is mostly for the incoming patches to make scrub_find_csum() able
  to find multiple checksums.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:05 +01:00
Qu Wenruo
96e63a45fb btrfs: scrub: remove the force parameter from scrub_pages
The @force parameter for scrub_pages() is to indicate whether we want to
force bio submission.  Currently it's only used for the super block,
and it can be easily determined by the @flags, so we can remove the
parameter.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:04 +01:00
Qu Wenruo
261d2dcb24 btrfs: scrub: distinguish scrub page from regular page
There are several call sites where we declare something like
"struct scrub_page *page".

This is confusing as we also use regular page in this code,
rename it to 'spage' where applicable.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:04 +01:00
Qu Wenruo
ac303b6987 btrfs: pass bvec to csum_dirty_buffer instead of page
Currently csum_dirty_buffer() uses page to grab extent buffer, but that
only works for sector size == PAGE_SIZE case.

For subpage we need page + page_offset to grab extent buffer.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:04 +01:00
Qu Wenruo
77bf40a2ba btrfs: extract extent buffer verification from btrfs_validate_metadata_buffer()
Currently btrfs_validate_metadata_buffer() only needs to handle one
extent buffer as currently one page maps to at most one extent buffer.

For incoming subpage support, we need to extend the support where one
page could contain multiple extent buffers.

Split the function so we can call validate_extent_buffer on extent
buffers independently.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:04 +01:00
Qu Wenruo
a26663e7a2 btrfs: make csum_tree_block() handle node smaller than page
For subpage size support, metadata blocks of nodesize are smaller than
one page and this needs to be handled when calculating the checksum.

The checksummed start and length need to be adjusted but only for the
first page:

- start is simply offset in the page

- length is nodesize (subpage) or PAGE_SIZE for all other cases

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:04 +01:00
Qu Wenruo
2f4d60dfae btrfs: grab fs_info from extent_buffer in btrfs_mark_buffer_dirty
Since commit f28491e0a6 ("Btrfs: move the extent buffer radix tree into
the fs_info"), fs_info can be grabbed from extent_buffer directly.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:04 +01:00
Qu Wenruo
478ef8868f btrfs: make buffer_radix take sector size units
For subpage sector size support, one page can contain multiple tree
blocks. The entries cannot be based on page size and index must be
derived from the sectorsize. No change for page size == sector size.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:03 +01:00
Qu Wenruo
0d01e247a0 btrfs: assert page mapping lock in attach_extent_buffer_page
When calling attach_extent_buffer_page(), either we're attaching
anonymous pages, called from btrfs_clone_extent_buffer(),
or we're attaching btree inode pages, called from alloc_extent_buffer().

For the latter case, we should hold page->mapping->private_lock to avoid
parallel changes to page->private.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:03 +01:00
Josef Bacik
bbb86a3717 btrfs: protect fs_info->caching_block_groups by block_group_cache_lock
I got the following lockdep splat

  ======================================================
  WARNING: possible circular locking dependency detected
  5.9.0+ #101 Not tainted
  ------------------------------------------------------
  btrfs-cleaner/3445 is trying to acquire lock:
  ffff89dbec39ab48 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x170

  but task is already holding lock:
  ffff89dbeaf28a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #2 (&fs_info->commit_root_sem){++++}-{3:3}:
	 down_write+0x3d/0x70
	 btrfs_cache_block_group+0x2d5/0x510
	 find_free_extent+0xb6e/0x12f0
	 btrfs_reserve_extent+0xb3/0x1b0
	 btrfs_alloc_tree_block+0xb1/0x330
	 alloc_tree_block_no_bg_flush+0x4f/0x60
	 __btrfs_cow_block+0x11d/0x580
	 btrfs_cow_block+0x10c/0x220
	 commit_cowonly_roots+0x47/0x2e0
	 btrfs_commit_transaction+0x595/0xbd0
	 sync_filesystem+0x74/0x90
	 generic_shutdown_super+0x22/0x100
	 kill_anon_super+0x14/0x30
	 btrfs_kill_super+0x12/0x20
	 deactivate_locked_super+0x36/0xa0
	 cleanup_mnt+0x12d/0x190
	 task_work_run+0x5c/0xa0
	 exit_to_user_mode_prepare+0x1df/0x200
	 syscall_exit_to_user_mode+0x54/0x280
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #1 (&space_info->groups_sem){++++}-{3:3}:
	 down_read+0x40/0x130
	 find_free_extent+0x2ed/0x12f0
	 btrfs_reserve_extent+0xb3/0x1b0
	 btrfs_alloc_tree_block+0xb1/0x330
	 alloc_tree_block_no_bg_flush+0x4f/0x60
	 __btrfs_cow_block+0x11d/0x580
	 btrfs_cow_block+0x10c/0x220
	 commit_cowonly_roots+0x47/0x2e0
	 btrfs_commit_transaction+0x595/0xbd0
	 sync_filesystem+0x74/0x90
	 generic_shutdown_super+0x22/0x100
	 kill_anon_super+0x14/0x30
	 btrfs_kill_super+0x12/0x20
	 deactivate_locked_super+0x36/0xa0
	 cleanup_mnt+0x12d/0x190
	 task_work_run+0x5c/0xa0
	 exit_to_user_mode_prepare+0x1df/0x200
	 syscall_exit_to_user_mode+0x54/0x280
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (btrfs-root-00){++++}-{3:3}:
	 __lock_acquire+0x1167/0x2150
	 lock_acquire+0xb9/0x3d0
	 down_read_nested+0x43/0x130
	 __btrfs_tree_read_lock+0x32/0x170
	 __btrfs_read_lock_root_node+0x3a/0x50
	 btrfs_search_slot+0x614/0x9d0
	 btrfs_find_root+0x35/0x1b0
	 btrfs_read_tree_root+0x61/0x120
	 btrfs_get_root_ref+0x14b/0x600
	 find_parent_nodes+0x3e6/0x1b30
	 btrfs_find_all_roots_safe+0xb4/0x130
	 btrfs_find_all_roots+0x60/0x80
	 btrfs_qgroup_trace_extent_post+0x27/0x40
	 btrfs_add_delayed_data_ref+0x3fd/0x460
	 btrfs_free_extent+0x42/0x100
	 __btrfs_mod_ref+0x1d7/0x2f0
	 walk_up_proc+0x11c/0x400
	 walk_up_tree+0xf0/0x180
	 btrfs_drop_snapshot+0x1c7/0x780
	 btrfs_clean_one_deleted_snapshot+0xfb/0x110
	 cleaner_kthread+0xd4/0x140
	 kthread+0x13a/0x150
	 ret_from_fork+0x1f/0x30

  other info that might help us debug this:

  Chain exists of:
    btrfs-root-00 --> &space_info->groups_sem --> &fs_info->commit_root_sem

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(&fs_info->commit_root_sem);
				 lock(&space_info->groups_sem);
				 lock(&fs_info->commit_root_sem);
    lock(btrfs-root-00);

   *** DEADLOCK ***

  3 locks held by btrfs-cleaner/3445:
   #0: ffff89dbeaf28838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: cleaner_kthread+0x6e/0x140
   #1: ffff89dbeb6c7640 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x40b/0x5c0
   #2: ffff89dbeaf28a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80

  stack backtrace:
  CPU: 0 PID: 3445 Comm: btrfs-cleaner Not tainted 5.9.0+ #101
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
  Call Trace:
   dump_stack+0x8b/0xb0
   check_noncircular+0xcf/0xf0
   __lock_acquire+0x1167/0x2150
   ? __bfs+0x42/0x210
   lock_acquire+0xb9/0x3d0
   ? __btrfs_tree_read_lock+0x32/0x170
   down_read_nested+0x43/0x130
   ? __btrfs_tree_read_lock+0x32/0x170
   __btrfs_tree_read_lock+0x32/0x170
   __btrfs_read_lock_root_node+0x3a/0x50
   btrfs_search_slot+0x614/0x9d0
   ? find_held_lock+0x2b/0x80
   btrfs_find_root+0x35/0x1b0
   ? do_raw_spin_unlock+0x4b/0xa0
   btrfs_read_tree_root+0x61/0x120
   btrfs_get_root_ref+0x14b/0x600
   find_parent_nodes+0x3e6/0x1b30
   btrfs_find_all_roots_safe+0xb4/0x130
   btrfs_find_all_roots+0x60/0x80
   btrfs_qgroup_trace_extent_post+0x27/0x40
   btrfs_add_delayed_data_ref+0x3fd/0x460
   btrfs_free_extent+0x42/0x100
   __btrfs_mod_ref+0x1d7/0x2f0
   walk_up_proc+0x11c/0x400
   walk_up_tree+0xf0/0x180
   btrfs_drop_snapshot+0x1c7/0x780
   ? btrfs_clean_one_deleted_snapshot+0x73/0x110
   btrfs_clean_one_deleted_snapshot+0xfb/0x110
   cleaner_kthread+0xd4/0x140
   ? btrfs_alloc_root+0x50/0x50
   kthread+0x13a/0x150
   ? kthread_create_worker_on_cpu+0x40/0x40
   ret_from_fork+0x1f/0x30

while testing another lockdep fix.  This happens because we're using the
commit_root_sem to protect fs_info->caching_block_groups, which creates
a dependency on the groups_sem -> commit_root_sem, which is problematic
because we will allocate blocks while holding tree roots.  Fix this by
making the list itself protected by the fs_info->block_group_cache_lock.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:03 +01:00
Josef Bacik
e747853cae btrfs: load free space cache asynchronously
While documenting the usage of the commit_root_sem, I noticed that we do
not actually take the commit_root_sem in the case of the free space
cache.  This is problematic because we're supposed to hold that sem
while we're reading the commit roots, which is what we do for the free
space cache.

The reason I did it inline when I originally wrote the code was because
there's the case of unpinning where we need to make sure that the free
space cache is loaded if we're going to use the free space cache.  But
we can accomplish the same thing by simply waiting for the cache to be
loaded.

Rework this code to load the free space cache asynchronously.  This
allows us to greatly cleanup the caching code because now it's all
shared by the various caching methods.  We also are now in a position to
have the commit_root semaphore held while we're loading the free space
cache.  And finally our modification of ->last_byte_to_unpin is removed
because it can be handled in the proper way on commit.

Some care must be taken when replaying the log, when we expect that the
free space cache will be read entirely before we start excluding space
to replay. This could lead to overwriting space during replay.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:03 +01:00
Josef Bacik
4d7240f0ab btrfs: load the free space cache inode extents from commit root
Historically we've allowed recursive locking specifically for the free
space inode.  This is because we are only doing reads and know that it's
safe.  However we don't actually need this feature, we can get away with
reading the commit root for the extents.  In fact if we want to allow
asynchronous loading of the free space cache we have to use the commit
root, otherwise we will deadlock.

Switch to using the commit root for the file extents.  These are only
read at load time, and are replaced as soon as we start writing the
cache out to disk.  The cache is never read again, so this is
legitimate.  This matches what we do for the inode itself, as we read
that from the commit root as well.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:03 +01:00
Josef Bacik
cd79909bc7 btrfs: load free space cache into a temporary ctl
The free space cache has been special in that we would load it right
away instead of farming the work off to a worker thread.  This resulted
in some weirdness that had to be taken into account for this fact,
namely that if we every found a block group being cached the fast way we
had to wait for it to finish, because we could get the cache before it
had been validated and we may throw the cache away.

To handle this particular case instead create a temporary
btrfs_free_space_ctl to load the free space cache into.  Then once we've
validated that it makes sense, copy it's contents into the actual
block_group->free_space_ctl.  This allows us to avoid the problems of
needing to wait for the caching to complete, we can clean up the discard
extent handling stuff in __load_free_space_cache, and we no longer need
to do the merge_space_tree() because the space is added one by one into
the real free_space_ctl.  This will allow further reworks of how we
handle loading the free space cache.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:03 +01:00
Josef Bacik
66b53bae46 btrfs: cleanup btrfs_discard_update_discardable usage
This passes in the block_group and the free_space_ctl, but we can get
this from the block group itself.  Part of this is because we call it
from __load_free_space_cache, which can be called for the inode cache as
well.

Move that call into the block group specific load section, wrap it in
the right lock that we need for the assertion (but otherwise this is
safe without the lock because this happens in single-thread context).

Fix up the arguments to only take the block group.  Add a lockdep_assert
as well for good measure to make sure we don't mess up the locking
again.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:02 +01:00
Josef Bacik
2ca08c56e8 btrfs: explicitly protect ->last_byte_to_unpin in unpin_extent_range
Currently unpin_extent_range happens in the transaction commit context,
so we are protected from ->last_byte_to_unpin changing while we're
unpinning, because any new transactions would have to wait for us to
complete before modifying ->last_byte_to_unpin.

However in the future we may want to change how this works, for instance
with async unpinning or other such TODO items.  To prepare for that
future explicitly protect ->last_byte_to_unpin with the commit_root_sem
so we are sure it won't change while we're doing our work.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:02 +01:00
Josef Bacik
27d56e62e4 btrfs: update last_byte_to_unpin in switch_commit_roots
While writing an explanation for the need of the commit_root_sem for
btrfs_prepare_extent_commit, I realized we have a slight hole that could
result in leaked space if we have to do the old style caching.  Consider
the following scenario

 commit root
 +----+----+----+----+----+----+----+
 |\\\\|    |\\\\|\\\\|    |\\\\|\\\\|
 +----+----+----+----+----+----+----+
 0    1    2    3    4    5    6    7

 new commit root
 +----+----+----+----+----+----+----+
 |    |    |    |\\\\|    |    |\\\\|
 +----+----+----+----+----+----+----+
 0    1    2    3    4    5    6    7

Prior to this patch, we run btrfs_prepare_extent_commit, which updates
the last_byte_to_unpin, and then we subsequently run
switch_commit_roots.  In this example lets assume that
caching_ctl->progress == 1 at btrfs_prepare_extent_commit() time, which
means that cache->last_byte_to_unpin == 1.  Then we go and do the
switch_commit_roots(), but in the meantime the caching thread has made
some more progress, because we drop the commit_root_sem and re-acquired
it.  Now caching_ctl->progress == 3.  We swap out the commit root and
carry on to unpin.

The race can happen like:

  1) The caching thread was running using the old commit root when it
     found the extent for [2, 3);

  2) Then it released the commit_root_sem because it was in the last
     item of a leaf and the semaphore was contended, and set ->progress
     to 3 (value of 'last'), as the last extent item in the current leaf
     was for the extent for range [2, 3);

  3) Next time it gets the commit_root_sem, will start using the new
     commit root and search for a key with offset 3, so it never finds
     the hole for [2, 3).

  So the caching thread never saw [2, 3) as free space in any of the
  commit roots, and by the time finish_extent_commit() was called for
  the range [0, 3), ->last_byte_to_unpin was 1, so it only returned the
  subrange [0, 1) to the free space cache, skipping [2, 3).

In the unpin code we have last_byte_to_unpin == 1, so we unpin [0,1),
but do not unpin [2,3).  However because caching_ctl->progress == 3 we
do not see the newly freed section of [2,3), and thus do not add it to
our free space cache.  This results in us missing a chunk of free space
in memory (on disk too, unless we have a power failure before writing
the free space cache to disk).

Fix this by making sure the ->last_byte_to_unpin is set at the same time
that we swap the commit roots, this ensures that we will always be
consistent.

CC: stable@vger.kernel.org # 5.8+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ update changelog with Filipe's review comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:02 +01:00
Josef Bacik
9076dbd5ee btrfs: do not shorten unpin len for caching block groups
While fixing up our ->last_byte_to_unpin locking I noticed that we will
shorten len based on ->last_byte_to_unpin if we're caching when we're
adding back the free space.  This is correct for the free space, as we
cannot unpin more than ->last_byte_to_unpin, however we use len to
adjust the ->bytes_pinned counters and such, which need to track the
actual pinned usage.  This could result in
WARN_ON(space_info->bytes_pinned) triggering at unmount time.

Fix this by using a local variable for the amount to add to free space
cache, and leave len untouched in this case.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:02 +01:00
David Sterba
dc51616486 btrfs: reorder extent buffer members for better packing
After the rwsem replaced the tree lock implementation, the extent buffer
got smaller but leaving some holes behind. By changing log_index type
and reordering, we can squeeze the size further to 240 bytes, measured on
release config on x86_64. Log_index spans only 3 values and needs to be
signed.

Before:

struct extent_buffer {
        u64                        start;                /*     0     8 */
        long unsigned int          len;                  /*     8     8 */
        long unsigned int          bflags;               /*    16     8 */
        struct btrfs_fs_info *     fs_info;              /*    24     8 */
        spinlock_t                 refs_lock;            /*    32     4 */
        atomic_t                   refs;                 /*    36     4 */
        atomic_t                   io_pages;             /*    40     4 */
        int                        read_mirror;          /*    44     4 */
        struct callback_head       callback_head __attribute__((__aligned__(8))); /*    48    16 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        pid_t                      lock_owner;           /*    64     4 */
        bool                       lock_recursed;        /*    68     1 */

        /* XXX 3 bytes hole, try to pack */

        struct rw_semaphore        lock;                 /*    72    40 */
        short int                  log_index;            /*   112     2 */

        /* XXX 6 bytes hole, try to pack */

        struct page *              pages[16];            /*   120   128 */

        /* size: 248, cachelines: 4, members: 14 */
        /* sum members: 239, holes: 2, sum holes: 9 */
        /* forced alignments: 1 */
        /* last cacheline: 56 bytes */
} __attribute__((__aligned__(8)));

After:

struct extent_buffer {
        u64                        start;                /*     0     8 */
        long unsigned int          len;                  /*     8     8 */
        long unsigned int          bflags;               /*    16     8 */
        struct btrfs_fs_info *     fs_info;              /*    24     8 */
        spinlock_t                 refs_lock;            /*    32     4 */
        atomic_t                   refs;                 /*    36     4 */
        atomic_t                   io_pages;             /*    40     4 */
        int                        read_mirror;          /*    44     4 */
        struct callback_head       callback_head __attribute__((__aligned__(8))); /*    48    16 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        pid_t                      lock_owner;           /*    64     4 */
        bool                       lock_recursed;        /*    68     1 */
        s8                         log_index;            /*    69     1 */

        /* XXX 2 bytes hole, try to pack */

        struct rw_semaphore        lock;                 /*    72    40 */
        struct page *              pages[16];            /*   112   128 */

        /* size: 240, cachelines: 4, members: 14 */
        /* sum members: 238, holes: 1, sum holes: 2 */
        /* forced alignments: 1 */
        /* last cacheline: 48 bytes */
} __attribute__((__aligned__(8)));

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:02 +01:00
Josef Bacik
b9729ce014 btrfs: locking: rip out path->leave_spinning
We no longer distinguish between blocking and spinning, so rip out all
this code.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:02 +01:00
Josef Bacik
ac5887c8e0 btrfs: locking: remove all the blocking helpers
Now that we're using a rw_semaphore we no longer need to indicate if a
lock is blocking or not, nor do we need to flip the entire path from
blocking to spinning.  Remove these helpers and all the places they are
called.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:01 +01:00
David Sterba
2ae0c2d80d btrfs: scrub: remove local copy of csum_size from context
The context structure unnecessarily stores copy of the checksum size,
that can be now easily obtained from fs_info.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:01 +01:00
David Sterba
419b791ce7 btrfs: check integrity: remove local copy of csum_size
The state structure unnecessarily stores copy of the checksum size, that
can be now easily obtained from fs_info.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:01 +01:00
David Sterba
713cebfb98 btrfs: remove unnecessary local variables for checksum size
Remove local variable that is then used just once and replace it with
fs_info::csum_size.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:00 +01:00
David Sterba
223486c27b btrfs: switch cached fs_info::csum_size from u16 to u32
The fs_info value is 32bit, switch also the local u16 variables. This
leads to a better assembly code generated due to movzwl.

This simple change will shave some bytes on x86_64 and release config:

   text    data     bss     dec     hex filename
1090000   17980   14912 1122892  11224c pre/btrfs.ko
1089794   17980   14912 1122686  11217e post/btrfs.ko

DELTA: -206

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:59 +01:00
David Sterba
55fc29bed8 btrfs: use cached value of fs_info::csum_size everywhere
btrfs_get_16 shows up in the system performance profiles (helper to read
16bit values from on-disk structures). This is partially because of the
checksum size that's frequently read along with data reads/writes, other
u16 uses are from item size or directory entries.

Replace all calls to btrfs_super_csum_size by the cached value from
fs_info.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:59 +01:00
David Sterba
fe5ecbe818 btrfs: precalculate checksums per leaf once
btrfs_csum_bytes_to_leaves shows up in system profiles, which makes it a
candidate for optimizations. After the 64bit division has been replaced
by shift, there's still a calculation done each time the function is
called: checksums per leaf.

As this is a constant value for the entire filesystem lifetime, we
can calculate it once at mount time and reuse. This also allows to
reduce the division to 64bit/32bit as we know the constant will always
fit the 32bit type.

Replace the open-coded rounding up with a macro that internally handles
the 64bit division and as it's now a short function, make it static
inline (slight code increase, slight stack usage reduction).

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:58 +01:00
David Sterba
22b6331d96 btrfs: store precalculated csum_size in fs_info
In many places we need the checksum size and it is inefficient to read
it from the raw superblock. Store the value into fs_info, actual use
will be in followup patches.  The size is u32 as it allows to generate
better assembly than with u16.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:58 +01:00
David Sterba
265fdfa6ce btrfs: replace s_blocksize_bits with fs_info::sectorsize_bits
The value of super_block::s_blocksize_bits is the same as
fs_info::sectorsize_bits, but we don't need to do the extra dereferences
in many functions and storing the bits as u32 (in fs_info) generates
shorter assembly.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:58 +01:00
David Sterba
098e63082b btrfs: replace div_u64 by shift in free_space_bitmap_size
Change free_space_bitmap_size to take btrfs_fs_info so we can get the
sectorsize_bits to do calculations.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:57 +01:00
David Sterba
ab108d992b btrfs: use precalculated sectorsize_bits from fs_info
We do a lot of calculations where we divide or multiply by sectorsize.
We also know and make sure that sectorsize is a power of two, so this
means all divisions can be turned to shifts and avoid eg. expensive
u64/u32 divisions.

The type is u32 as it's more register friendly on x86_64 compared to u8
and the resulting assembly is smaller (movzbl vs movl).

There's also superblock s_blocksize_bits but it's usually one more
pointer dereference farther than fs_info.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:57 +01:00
Qu Wenruo
e940e9a7c7 btrfs: rename page_size to io_size in submit_extent_page
The variable @page_size in submit_extent_page() is not related to page
size.

It can already be smaller than PAGE_SIZE, so rename it to io_size to
reduce confusion, this is especially important for later subpage
support.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:56 +01:00
Qu Wenruo
8b8bbd461e btrfs: only require sector size alignment for page read
If we're reading partial page, btrfs will warn about this as read/write
is always done in sector size, which now equals page size.

But for the upcoming subpage read-only support, our data read is only
aligned to sectorsize, which can be smaller than page size.

Thus here we change the warning condition to check it against
sectorsize, the behavior is not changed for regular sectorsize ==
PAGE_SIZE case, and won't report error for subpage read.

Also, pass the proper start/end with bv_offset for check_data_csum() to
handle.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:56 +01:00
Qu Wenruo
12e3360f74 btrfs: rename pages_locked in process_pages_contig()
Function process_pages_contig() does not only handle page locking but
also other operations.  Rename the local variable pages_locked to
pages_processed to reduce confusion.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:55 +01:00
Qu Wenruo
265d4ac03f btrfs: sink parameter start and len to check_data_csum
For check_data_csum(), the page we're using is directly from the inode
mapping, thus it has valid page_offset().

We can use (page_offset() + pg_off) to replace @start parameter
completely, while the @len should always be sectorsize.

Since we're here, also add some comment, as there are quite some
confusion in words like start/offset, without explaining whether it's
file_offset or logical bytenr.

This should not affect the existing behavior, as for current sectorsize
== PAGE_SIZE case, @pgoff should always be 0, and len is always
PAGE_SIZE (or sectorsize from the dio read path).

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:55 +01:00
Qu Wenruo
8896a08d8e btrfs: replace fs_info and private_data with inode in btrfs_wq_submit_bio
All callers of btrfs_wq_submit_bio() pass struct inode as @private_data,
so there is no need for it to be (void *), replace it with "struct inode
*inode".

While we can extract fs_info from struct inode, also remove the @fs_info
parameter.

Since we're here, also replace all the (void *private_data) into (struct
inode *inode).

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:54 +01:00
Qu Wenruo
3f6bb4aeb5 btrfs: sink the failed_start parameter to set_extent_bit
The @failed_start parameter is only paired with @exclusive_bits, and
those parameters are only used for EXTENT_LOCKED bit, which have their
own wrappers lock_extent_bits().

Thus for regular set_extent_bit() calls, the failed_start makes no
sense, just sink the parameter.

Also, since @failed_start and @exclusive_bits are used in pairs, add
an assert to make it obvious.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:54 +01:00
Qu Wenruo
03509b781a btrfs: update the comment for find_first_extent_bit
The pitfall here is, if the parameter @bits has multiple bits set, we
will return the first range which just has one of the specified bits
set.

This is a little tricky if we want an exact match.  Anyway, update the
comment to make that clear.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:53 +01:00
Qu Wenruo
a3efb2f0ba btrfs: fix the comment on lock_extent_buffer_for_io
The return value of that function is completely wrong.

That function only returns 0 if the extent buffer doesn't need to be
submitted.  The "ret = 1" and "ret = 0" are determined by the return
value of "test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)".

And if we get ret == 1, it's because the extent buffer is dirty, and we
set its status to EXTENT_BUFFER_WRITE_BACK, and continue to page
locking.

While if we get ret == 0, it means the extent is not dirty from the
beginning, so we don't need to write it back.

The caller also follows this, in btree_write_cache_pages(), if
lock_extent_buffer_for_io() returns 0, we just skip the extent buffer
completely.

So the comment is completely wrong.

Since we're here, also change the description a little.  The write bio
flushing won't be visible to the caller, thus it's not an major feature.
In the main description, only describe the locking part to make the
point more clear.

For reference, added in commit 2e3c25136a ("btrfs: extent_io: add
proper error handling to lock_extent_buffer_for_io()")

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:53 +01:00
David Sterba
cc7c77146e btrfs: remove unnecessary casts in printk
Long time ago the explicit casts were necessary for u64 but we don't
need it.  Remove casts where the type matches, leaving only cases that
cast sector_t or loff_t.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:52 +01:00
David Sterba
c842268458 btrfs: add set/get accessors for root_item::drop_level
The drop_level member is used directly unlike all the other int types in
root_item. Add the definition and use it everywhere. The type is u8 so
there's no conversion necessary and the helpers are properly inlined,
this is for consistency.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:52 +01:00
David Sterba
f944d2cb20 btrfs: use root_item helpers for limit and flags in btrfs_create_tree
For consistency use the available helpers to set flags and limit.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:51 +01:00
David Sterba
3b5418fba3 btrfs: check-integrity: use proper helper to access btrfs_header
There's one raw use of le->cpu conversion but we have a helper to do
that for us, so use it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:51 +01:00
David Sterba
09e3a28892 btrfs: send: use helpers to access root_item::ctransid
We have helpers to access the on-disk item members, use that for
root_item::ctransid instead of raw le64_to_cpu.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:51 +01:00
David Sterba
ab1405aa25 btrfs: generate lockdep keyset names at compile time
The names in btrfs_lockdep_keysets are generated from a simple pattern
using snprintf but we can generate them directly with some macro magic
and remove the helpers.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:50 +01:00
David Sterba
387824afd7 btrfs: use the right number of levels for lockdep keysets
BTRFS_MAX_LEVEL is 8 and the keyset table is supposed to have a key for
each level, but we'll never have more than 8 levels.  The values passed
to btrfs_set_buffer_lockdep_class are always derived from a valid extent
buffer.  Set the array sizes to the right value.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:50 +01:00
Goldwyn Rodrigues
ecfdc08b8c btrfs: remove dio iomap DSYNC workaround
This effectively reverts 09745ff88d93 ("btrfs: dio iomap DSYNC
workaround") now that the iomap API has been updated to allow
iomap_dio_complete() not to be called under i_rwsem anymore.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:49 +01:00
Goldwyn Rodrigues
a42fa64316 btrfs: call iomap_dio_complete() without inode_lock
If direct writes are called with O_DIRECT | O_DSYNC, it will result in a
deadlock because iomap_dio_rw() is called under i_rwsem which calls:

  iomap_dio_complete()
    generic_write_sync()
      btrfs_sync_file()

btrfs_sync_file() requires i_rwsem, so call __iomap_dio_rw() with the
i_rwsem locked, and call iomap_dio_complete() after unlocking i_rwsem.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:49 +01:00
Goldwyn Rodrigues
502756b380 btrfs: remove btrfs_inode::dio_sem
The inode dio_sem can be eliminated because all DIO synchronization is
now performed through inode->i_rwsem that provides the same guarantees.

This reduces btrfs_inode size by 40 bytes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:48 +01:00
Goldwyn Rodrigues
e9adabb971 btrfs: use shared lock for direct writes within EOF
Direct writes within EOF are safe to be performed with inode shared lock
to improve parallelization with other direct writes or reads because EOF
is not changed and there is no race with truncate().

Direct reads are already performed under shared inode lock.

This patch is precursor to removing btrfs_inode->dio_sem.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:48 +01:00
Goldwyn Rodrigues
c352370633 btrfs: push inode locking and unlocking into buffered/direct write
Push inode locking and unlocking closer to where we perform the I/O. For
this we need to move the write checks inside the respective functions as
well.

pos is evaluated after generic_write_checks because O_APPEND can change
iocb->ki_pos.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:48 +01:00
Goldwyn Rodrigues
a14b78ad06 btrfs: introduce btrfs_inode_lock()/unlock()
btrfs_inode_lock/unlock() are wrappers around inode locks, separating
the type of lock and actual locking.

- 0 - default, exclusive lock
- BTRFS_ILOCK_SHARED - for shared locks, for possible parallel DIO
- BTRFS_ILOCK_TRY - for the RWF_NOWAIT sequence

The bits SHARED and TRY can be combined together.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:47 +01:00
Goldwyn Rodrigues
b8d8e1fd57 btrfs: introduce btrfs_write_check()
btrfs_write_check() checks write parameters in one place before
beginning a write. This does away with inode_unlock() after every check.
In the later patches, it will help push inode_lock/unlock() in buffered
and direct write functions respectively.

generic_write_checks needs to be called before as it could truncate
iov_iter and its return used as count.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:47 +01:00
Goldwyn Rodrigues
c86537a42f btrfs: check FS error state bit early during write
fs_info::fs_state is a filesystem bit check as opposed to inode and can
be performed before we begin with write checks. This eliminates inode
lock/unlock in case the error bit is set.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:46 +01:00
Goldwyn Rodrigues
5e8b9ef303 btrfs: move pos increment and pagecache extension to btrfs_buffered_write
While we do this, correct the call to pagecache_isize_extended:

 - pagecache_isize_extended needs to be called to the start of the write
   as opposed to i_size

 - we don't need to check range before the call, this is done in the
   function

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:46 +01:00
Goldwyn Rodrigues
4e4cabece9 btrfs: split btrfs_direct_IO to read and write
The read and write DIO don't have anything in common except for the
call to iomap_dio_rw. Extract the write call into a new function to get
rid of conditional statements for direct write.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:45 +01:00
Anand Jain
3d8cc17a05 btrfs: sysfs: add per-fs attribute for read policy
Add

 /sys/fs/btrfs/UUID/read_policy

attribute so that the read policy for the raid1, raid1c34 and raid10 can
be tuned.

When this attribute is read, it will show all available policies, with
active policy in [ ]. The read_policy attribute can be written using one
of the items listed in there.

For example:
  $ cat /sys/fs/btrfs/UUID/read_policy
  [pid]
  $ echo pid > /sys/fs/btrfs/UUID/read_policy

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:45 +01:00
Anand Jain
33fd2f714c btrfs: create read policy framework
As of now, we use the pid method to read striped mirrored data, which
means process id determines the stripe id to read. This type of routing
typically helps in a system with many small independent processes tying
to read random data. On the other hand, the pid based read IO policy is
inefficient because if there is a single process trying to read a large
file, the overall disk bandwidth remains underutilized.

So this patch introduces a read policy framework so that we could add
more read policies, such as IO routing based on the device's wait-queue
or manual when we have a read-preferred device or a policy based on the
target storage caching.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:44 +01:00
Anand Jain
aaefed2078 btrfs: add helper for string match ignoring leading/trailing whitespace
Add a generic helper to match the string in a given buffer, and ignore
the leading and trailing whitespace.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename variables, add comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:44 +01:00
Filipe Manana
88090ad36a btrfs: do not start and wait for delalloc on snapshot roots on transaction commit
We do not need anymore to start writeback for delalloc of roots that are
being snapshotted and wait for it to complete. This was done in commit
609e804d77 ("Btrfs: fix file corruption after snapshotting due to mix
of buffered/DIO writes") to fix a type of file corruption where files in a
snapshot end up having their i_size updated in a non-ordered way, leaving
implicit file holes, when buffered IO writes that increase a file's size
are followed by direct IO writes that also increase the file's size.

This is not needed anymore because we now have a more generic mechanism
to prevent a non-ordered i_size update since commit 9ddc959e80
("btrfs: use the file extent tree infrastructure"), which addresses this
scenario involving snapshots as well.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:43 +01:00
Josef Bacik
196d59ab9c btrfs: switch extent buffer tree lock to rw_semaphore
Historically we've implemented our own locking because we wanted to be
able to selectively spin or sleep based on what we were doing in the
tree.  For instance, if all of our nodes were in cache then there's
rarely a reason to need to sleep waiting for node locks, as they'll
likely become available soon.  At the time this code was written the
rw_semaphore didn't do adaptive spinning, and thus was orders of
magnitude slower than our home grown locking.

However now the opposite is the case.  There are a few problems with how
we implement blocking locks, namely that we use a normal waitqueue and
simply wake everybody up in reverse sleep order.  This leads to some
suboptimal performance behavior, and a lot of context switches in highly
contended cases.  The rw_semaphores actually do this properly, and also
have adaptive spinning that works relatively well.

The locking code is also a bit of a bear to understand, and we lose the
benefit of lockdep for the most part because the blocking states of the
lock are simply ad-hoc and not mapped into lockdep.

So rework the locking code to drop all of this custom locking stuff, and
simply use a rw_semaphore for everything.  This makes the locking much
simpler for everything, as we can now drop a lot of cruft and blocking
transitions.  The performance numbers vary depending on the workload,
because generally speaking there doesn't tend to be a lot of contention
on the btree.  However, on my test system which is an 80 core single
socket system with 256GiB of RAM and a 2TiB NVMe drive I get the
following results (with all debug options off):

  dbench 200 baseline
  Throughput 216.056 MB/sec  200 clients  200 procs  max_latency=1471.197 ms

  dbench 200 with patch
  Throughput 737.188 MB/sec  200 clients  200 procs  max_latency=714.346 ms

Previously we also used fs_mark to test this sort of contention, and
those results are far less impressive, mostly because there's not enough
tasks to really stress the locking

  fs_mark -d /d[0-15] -S 0 -L 20 -n 100000 -s 0 -t 16

  baseline
    Average Files/sec:     160166.7
    p50 Files/sec:         165832
    p90 Files/sec:         123886
    p99 Files/sec:         123495

    real    3m26.527s
    user    2m19.223s
    sys     48m21.856s

  patched
    Average Files/sec:     164135.7
    p50 Files/sec:         171095
    p90 Files/sec:         122889
    p99 Files/sec:         113819

    real    3m29.660s
    user    2m19.990s
    sys     44m12.259s

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:43 +01:00
Nikolay Borisov
ecdcf3c259 btrfs: open code insert_orphan_item
Just open code it in its sole caller and remove a level of indirection.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:42 +01:00
Josef Bacik
9037d3cbcb btrfs: introduce mount option rescue=all
Now that we have the building blocks for some better recovery options
with corrupted file systems, add a rescue=all option to enable all of
the relevant rescue options.  This will allow distros to simply default
to rescue=all for the "oh dear lord the world's on fire" recovery
without needing to know all the different options that we have and may
add in the future.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:42 +01:00
Josef Bacik
882dbe0cec btrfs: introduce mount option rescue=ignoredatacsums
There are cases where you can end up with bad data csums because of
misbehaving applications.  This happens when an application modifies a
buffer in-flight when doing an O_DIRECT write.  In order to recover the
file we need a way to turn off data checksums so you can copy the file
off, and then you can delete the file and restore it properly later.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:42 +01:00
Josef Bacik
42437a6386 btrfs: introduce mount option rescue=ignorebadroots
In the face of extent root corruption, or any other core fs wide root
corruption we will fail to mount the file system.  This makes recovery
kind of a pain, because you need to fall back to userspace tools to
scrape off data.  Instead provide a mechanism to gracefully handle bad
roots, so we can at least mount read-only and possibly recover data from
the file system.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:41 +01:00
Josef Bacik
68319c18cb btrfs: show rescue=usebackuproot in /proc/mounts
The standalone option usebackuproot was intended as one-time use and it
was not necessary to keep it in the option list. Now that we're going to
have more rescue options, it's desirable to keep them intact as it could
be confusing why the option disappears.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ remove the btrfs_clear_opt part from open_ctree ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:41 +01:00
Josef Bacik
ab0b4a3ebf btrfs: add a helper to print out rescue= options
We're going to have a lot of rescue options, add a helper to collapse
the /proc/mounts output to rescue=option1:option2:option3 format.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:40 +01:00
Josef Bacik
ceafe3cc39 btrfs: sysfs: export supported rescue= mount options
We're going to be adding a variety of different rescue options, we
should advertise which ones we support to make user spaces life easier
in the future.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:40 +01:00
Josef Bacik
334c16d82c btrfs: push the NODATASUM check into btrfs_lookup_bio_sums
When we move to being able to handle NULL csum_roots it'll be cleaner to
just check in btrfs_lookup_bio_sums instead of at all of the caller
locations, so push the NODATASUM check into it as well so it's unified.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:39 +01:00
Josef Bacik
d70bf7484f btrfs: unify the ro checking for mount options
We're going to be adding more options that require RDONLY, so add a
helper to do the check and error out if we don't have RDONLY set.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:39 +01:00
Filipe Manana
a6889caf6e btrfs: do not start readahead for csum tree when scrubbing non-data block groups
When scrubbing a stripe of a block group we always start readahead for the
checksums btree and wait for it to complete, however when the blockgroup is
not a data block group (or a mixed block group) it is a waste of time to do
it, since there are no checksums for metadata extents in that btree.

So skip that when the block group does not have the data flag set, saving
some time doing memory allocations, queueing a job in the readahead work
queue, waiting for it to complete and potentially avoiding some IO as well
(when csum tree extents are not in memory already).

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:39 +01:00
Filipe Manana
a57ad681f1 btrfs: assert we are holding the reada_lock when releasing a readahead zone
When we drop the last reference of a zone, we end up releasing it through
the callback reada_zone_release(), which deletes the zone from a device's
reada_zones radix tree. This tree is protected by the global readahead
lock at fs_info->reada_lock. Currently all places that are sure that they
are dropping the last reference on a zone, are calling kref_put() in a
critical section delimited by this lock, while all other places that are
sure they are not dropping the last reference, do not bother calling
kref_put() while holding that lock.

When working on the previous fix for hangs and use-after-frees in the
readahead code, my initial attempts were different and I actually ended
up having reada_zone_release() called when not holding the lock, which
resulted in weird and unexpected problems. So just add an assertion
there to detect such problem more quickly and make the dependency more
obvious.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:38 +01:00
Goldwyn Rodrigues
aa8c1a41a1 btrfs: set EXTENT_NORESERVE bits side btrfs_dirty_pages()
Set the extent bits EXTENT_NORESERVE inside btrfs_dirty_pages() as
opposed to calling set_extent_bits again later.

Fold check for written length within the function.

Note: EXTENT_NORESERVE is set before unlocking extents.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:38 +01:00
Goldwyn Rodrigues
13f0dd8f78 btrfs: use round_down while calculating start position in btrfs_dirty_pages()
round_down looks prettier than the bit mask operations.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:38 +01:00
Goldwyn Rodrigues
949b32732e btrfs: use iosize while reading compressed pages
While using compression, a submitted bio is mapped with a compressed bio
which performs the read from disk, decompresses and returns uncompressed
data to original bio. The original bio must reflect the uncompressed
size (iosize) of the I/O to be performed, or else the page just gets the
decompressed I/O length of data (disk_io_size). The compressed bio
checks the extent map and gets the correct length while performing the
I/O from disk.

This came up in subpage work when only compressed length of the original
bio was filled in the page. This worked correctly for pagesize ==
sectorsize because both compressed and uncompressed data are at pagesize
boundaries, and would end up filling the requested page.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:37 +01:00
Goldwyn Rodrigues
eefa45f593 btrfs: calculate num_pages, reserve_bytes once in btrfs_buffered_write
write_bytes can change in btrfs_check_nocow_lock(). Calculate variables
such as num_pages and reserve_bytes once we are sure of the value of
write_bytes so there is no need to re-calculate.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:37 +01:00
Nikolay Borisov
fb8a7e941b btrfs: calculate more accurate remaining time to sleep in transaction_kthread
If transaction_kthread is woken up before btrfs_fs_info::commit_interval
seconds have elapsed it will sleep for a fixed period of 5 seconds. This
is not a problem per-se but is not accurate. Instead the code should
sleep for an interval which guarantees on next wakeup commit_interval
would have passed. Since time tracking is not precise subtract 1 second
from delta to ensure the delay we end up waiting will be longer than
than the wake up period.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:36 +01:00
Nikolay Borisov
643900bee4 btrfs: record delta directly in transaction_kthread
Rename 'now' to 'delta' and store there the delta between transaction
start time and current time. This is in preparation for optimising the
sleep logic in the next patch. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:36 +01:00
Nikolay Borisov
e4e4288161 btrfs: remove redundant time check in transaction kthread loop
The value obtained from ktime_get_seconds() is guaranteed to be
monotonically increasing since it's taken from CLOCK_MONOTONIC. As
transaction_kthread obtains a reference to the currently running
transaction under holding btrfs_fs_info::trans_lock it's guaranteed to:

a) see an initialized 'cur', whose start_time is guaranteed to be smaller
   than 'now'

or

b) not obtain a 'cur' and simply go to sleep.

Given this remove the unnecessary check, if it sees
now < cur->start_time this would imply there are far greater problems on
the machine.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:19 +01:00
Gao Xiang
473e15b0c0 erofs: simplify try_to_claim_pcluster()
simplify try_to_claim_pcluster() by directly using cmpxchg() here
(the retry loop caused more overhead.) Also, move the chain loop
detection in and rename it to z_erofs_try_to_claim_pcluster().

Link: https://lore.kernel.org/r/20201208095834.3133565-3-hsiangkao@redhat.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2020-12-08 18:08:22 +08:00
Gao Xiang
bf225074ff erofs: insert to managed cache after adding to pcl
Previously, it could be some concern to call add_to_page_cache_lru()
with page->mapping == Z_EROFS_MAPPING_STAGING (!= NULL).

In contrast, page->private is used instead now, so partially revert
commit 5ddcee1f3a ("erofs: get rid of __stagingpage_alloc helper")
with some adaption for simplicity.

Link: https://lore.kernel.org/r/20201208095834.3133565-2-hsiangkao@redhat.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2020-12-08 18:08:21 +08:00
Gao Xiang
6aaa7b0664 erofs: get rid of magical Z_EROFS_MAPPING_STAGING
Previously, we played around with magical page->mapping for short-lived
temporary pages since we need to identify different types of pages in
the same pcluster but both invalidated and short-lived temporary pages
can have page->mapping == NULL. It was considered as safe because that
temporary pages are all non-LRU / non-movable pages.

This patch tends to use specific page->private to identify short-lived
pages instead so it won't rely on page->mapping anymore. Details are
described in "compress.h" as well.

Link: https://lore.kernel.org/r/20201208095834.3133565-1-hsiangkao@redhat.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2020-12-08 18:08:21 +08:00
Vladimir Zapolskiy
a426ce9d67 erofs: remove a void EROFS_VERSION macro set in Makefile
Since commit 4f761fa253 ("erofs: rename errln/infoln/debugln to
erofs_{err, info, dbg}") the defined macro EROFS_VERSION has no affect,
therefore removing it from the Makefile is a non-functional change.

Link: https://lore.kernel.org/r/20201030122839.25431-1-vladimir@tuxera.com
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Vladimir Zapolskiy <vladimir@tuxera.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2020-12-08 18:06:06 +08:00
Jaegeuk Kim
10208567f1 f2fs: introduce max_io_bytes, a sysfs entry, to limit bio size
This patch adds max_io_bytes to limit bio size when f2fs tries to merge
consecutive IOs. This can give a testing point to split out bios and check
end_io handles those bios correctly. This is used to capture a recent bug
on the decompression and fsverity flow.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-07 08:41:25 -08:00
Jaegeuk Kim
ec2ddf4994 f2fs: don't allow any writes on readonly mount
generic_make_request: Trying to write to read-only block-device dm-5 (partno 0)
WARNING: CPU: 7 PID: 546 at block/blk-core.c:2190 generic_make_request_checks+0x664/0x690
pc : generic_make_request_checks+0x664/0x690
lr : generic_make_request_checks+0x664/0x690
Call trace:
 generic_make_request_checks+0x664/0x690
 generic_make_request+0xf0/0x3a4
 submit_bio+0x80/0x250
 __submit_merged_bio+0x368/0x4e0
 __submit_merged_write_cond.llvm.12294350193007536502+0xe0/0x3e8
 f2fs_wait_on_page_writeback+0x84/0x128
 f2fs_convert_inline_page+0x35c/0x6f8
 f2fs_convert_inline_inode+0xe0/0x2e0
 f2fs_file_mmap+0x48/0x9c
 mmap_region+0x41c/0x74c
 do_mmap+0x40c/0x4fc
 vm_mmap_pgoff+0xb8/0x114
 vm_mmap+0x34/0x48
 elf_map+0x68/0x108
 load_elf_binary+0x538/0xb70
 search_binary_handler+0xac/0x1dc
 exec_binprm+0x50/0x15c
 __do_execve_file+0x620/0x740
 __arm64_sys_execve+0x54/0x68
 el0_svc_common+0x9c/0x168
 el0_svc_handler+0x60/0x6c
 el0_svc+0x8/0xc

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-07 08:41:16 -08:00
Pavel Begunkov
e8c954df23 io_uring: fix mis-seting personality's creds
After io_identity_cow() copies an work.identity it wants to copy creds
to the new just allocated id, not the old one. Otherwise it's
akin to req->work.identity->creds = req->work.identity->creds.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-07 08:43:44 -07:00
Nikolay Borisov
ba1bc00f35 btrfs: use helpers to convert from seconds to jiffies in transaction_kthread
The kernel provides easy to understand helpers to convert from human
understandable units to the kernel-friendly 'jiffies'. So let's use
those to make the code easier to understand. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-07 16:24:00 +01:00
Anand Jain
089c8b0551 btrfs: sysfs: export filesystem generation
Matching with the information that's available from the ioctl
FS_INFO, add generation to the per-filesystem directory
/sys/fs/btrfs/UUID/generation, which could be used by scripts.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-07 16:23:59 +01:00
Menglong Dong
2bf509d96d coredump: fix core_pattern parse error
'format_corename()' will splite 'core_pattern' on spaces when it is in
pipe mode, and take helper_argv[0] as the path to usermode executable.
It works fine in most cases.

However, if there is a space between '|' and '/file/path', such as
'| /usr/lib/systemd/systemd-coredump %P %u %g', then helper_argv[0] will
be parsed as '', and users will get a 'Core dump to | disabled'.

It is not friendly to users, as the pattern above was valid previously.
Fix this by ignoring the spaces between '|' and '/file/path'.

Fixes: 315c69261d ("coredump: split pipe command whitespace before expanding template")
Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Wise <pabs3@bonedaddy.net>
Cc: Jakub Wilk <jwilk@jwilk.net> [https://bugs.debian.org/924398]
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/5fb62870.1c69fb81.8ef5d.af76@mx.google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-06 10:19:07 -08:00
Linus Torvalds
619ca2664c io_uring-5.10-2020-12-05
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/L/o4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpifFD/9tqDSwqcRIKRhSHqsD6OJRagZ0UB37cPvH
 ZXmtHblK9BkeXc83OPy+NmlcPq1mpDmbMgy0rls8UOGnZs0RINN9En5hHGOeRVD4
 Ma6Abh9ypzXhNKqY4HO+lcsN+elGqoaYNSnu/Nbcib7Z5yYzUSNDnpIeTQy7Onbw
 dTGNTJalZYqrQ+6ytglfVeQuXSSKYPJfjFdcYd1It6Ks0aSKDgcQxJZgPeuF6xE+
 mxEWADs+zpZG8UqZYyLnS6+9dxuJwhx+Cd47Ntky0VdaWmxCogQiI3X/Q3vspnVq
 gz9VqnUIA3Qgabm0UJkyjCi87W2KNDf4r5OOaQw4dLc0kDhIVS6ii67iGVs9Dl7F
 xCZMCE4pkEvs0pPYCyITUUNPh/fBONHgDHAWxxhztkE8ldvsAUmqNyV/LWMMvtSl
 SUQli6OZMVyyUX9sroHOWKhi8rQYt1UZa0tm+UZE/+YrZLfp6qBCZHFC6hdNfQoS
 k7PiuxecsH4i/OweJMPtggChJ7VWHZUorTljfHnl14oDuFuVD22AJvc3q+OSStI1
 Ua5N27b2/pWXnf9P3dPGzikprOU/vXPQnlifmBxXNj+/TD29+GPUHhV2+kibc5/F
 gMbrhX3TX4tstHIGxUCoeB8XepiBqOmn1eLkWNQQGi8RbqCKfE4j/5o07HgxST2P
 Ns2GaEKb4A==
 =CPrM
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.10-2020-12-05' of git://git.kernel.dk/linux-block

Pull io_uring fix from Jens Axboe:
 "Just a small fix this time, for an issue with 32-bit compat apps and
  buffer selection with recvmsg"

* tag 'io_uring-5.10-2020-12-05' of git://git.kernel.dk/linux-block:
  io_uring: fix recvmsg setup with compat buf-select
2020-12-05 14:39:59 -08:00
Linus Torvalds
d4e904198c Three smb3 fixes, two for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl/K3U0ACgkQiiy9cAdy
 T1Gewwv7Bjtq+c4tvc0LZj04oCfgsmcdu9s82Xff+e81lJlPyK+04zrp1kPiVQ8O
 iV5+bQr96BPgA96f2Et/OdnO4ei4T1vp9yTfXkTHQCWHQhd57N98TNudtJ+JNKCv
 ziaY+X6BFBBFnbCiPQxLZjvXsXenS3PFTq+wJB6ot4Qsj7z3kk88nXrvDDP7kiFL
 3WqZzUWKPJaRHjJQIrUch9M77P1cGk1duKT8Ydj/XKQujocWFWYXEWVYMGKcFFpK
 cXBEbN6vBUTogn6fA8lQKy3KTarRGT8qpl8c91UCZ61sH8t6RlHQ53rtPrLPG/9k
 KH8xyN+Lu4+w9LDni5HqsIHAqi0+dnPvKRZWIkxMI/ZbZcV4Mwoi4kMZ/H85psp6
 wEDgA6czaQSS9c3jZ0lvNgOHvpKkz1lz8sXwUuTYVehMaQrTcxNdEgh8TdlfZ9P6
 St404ZXSsthXChac2N22G9JX4XcM/wocSWq/sfpoUvr9eKJMjFMdB7rP7ORPQofb
 S/u8v5pD
 =3b+C
 -----END PGP SIGNATURE-----

Merge tag '5.10-rc6-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Three smb3 fixes (two for stable) fixing

   - a null pointer issue in a DFS error path

   - a problem with excessive padding when mounted with "idsfromsid"
     causing owner fields to get corrupted

   - a more recent problem with compounded reparse point query found in
     testing to the Linux kernel server"

* tag '5.10-rc6-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: refactor create_sd_buf() and and avoid corrupting the buffer
  cifs: add NULL check for ses->tcon_ipc
  smb3: set COMPOUND_FID to FileID field of subsequent compound request
2020-12-05 11:09:07 -08:00
Florent Revest
dba4a9256b net: Remove the err argument from sock_from_file
Currently, the sock_from_file prototype takes an "err" pointer that is
either not set or set to -ENOTSOCK IFF the returned socket is NULL. This
makes the error redundant and it is ignored by a few callers.

This patch simplifies the API by letting callers deduce the error based
on whether the returned socket is NULL or not.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Florent Revest <revest@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20201204113609.1850150-1-revest@google.com
2020-12-04 22:32:40 +01:00
Jakub Kicinski
a1dd1d8697 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-12-03

The main changes are:

1) Support BTF in kernel modules, from Andrii.

2) Introduce preferred busy-polling, from Björn.

3) bpf_ima_inode_hash() and bpf_bprm_opts_set() helpers, from KP Singh.

4) Memcg-based memory accounting for bpf objects, from Roman.

5) Allow bpf_{s,g}etsockopt from cgroup bind{4,6} hooks, from Stanislav.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (118 commits)
  selftests/bpf: Fix invalid use of strncat in test_sockmap
  libbpf: Use memcpy instead of strncpy to please GCC
  selftests/bpf: Add fentry/fexit/fmod_ret selftest for kernel module
  selftests/bpf: Add tp_btf CO-RE reloc test for modules
  libbpf: Support attachment of BPF tracing programs to kernel modules
  libbpf: Factor out low-level BPF program loading helper
  bpf: Allow to specify kernel module BTFs when attaching BPF programs
  bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier
  selftests/bpf: Add CO-RE relocs selftest relying on kernel module BTF
  selftests/bpf: Add support for marking sub-tests as skipped
  selftests/bpf: Add bpf_testmod kernel module for testing
  libbpf: Add kernel module BTF support for CO-RE relocations
  libbpf: Refactor CO-RE relocs to not assume a single BTF object
  libbpf: Add internal helper to load BTF data by FD
  bpf: Keep module's btf_data_size intact after load
  bpf: Fix bpf_put_raw_tracepoint()'s use of __module_address()
  selftests/bpf: Add Userspace tests for TCP_WINDOW_CLAMP
  bpf: Adds support for setting window clamp
  samples/bpf: Fix spelling mistake "recieving" -> "receiving"
  bpf: Fix cold build of test_progs-no_alu32
  ...
====================

Link: https://lore.kernel.org/r/20201204021936.85653-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 07:48:12 -08:00
Giuseppe Scrivano
582f1fb6b7
fs, close_range: add flag CLOSE_RANGE_CLOEXEC
When the flag CLOSE_RANGE_CLOEXEC is set, close_range doesn't
immediately close the files but it sets the close-on-exec bit.

It is useful for e.g. container runtimes that usually install a
seccomp profile "as late as possible" before execv'ing the container
process itself.  The container runtime could either do:
  1                                  2
- install_seccomp_profile();       - close_range(MIN_FD, MAX_INT, 0);
- close_range(MIN_FD, MAX_INT, 0); - install_seccomp_profile();
- execve(...);                     - execve(...);

Both alternative have some disadvantages.

In the first variant the seccomp_profile cannot block the close_range
syscall, as well as opendir/read/close/... for the fallback on older
kernels.
In the second variant, close_range() can be used only on the fds
that are not going to be needed by the runtime anymore, and it must be
potentially called multiple times to account for the different ranges
that must be closed.

Using close_range(..., ..., CLOSE_RANGE_CLOEXEC) solves these issues.
The runtime is able to use the existing open fds, the seccomp profile
can block close_range() and the syscalls used for its fallback.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Link: https://lore.kernel.org/r/20201118104746.873084-2-gscrivan@redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-12-04 12:06:15 +01:00
Ronnie Sahlberg
ea64370bca cifs: refactor create_sd_buf() and and avoid corrupting the buffer
When mounting with "idsfromsid" mount option, Azure
corrupted the owner SIDs due to excessive padding
caused by placing the owner fields at the end of the
security descriptor on create.  Placing owners at the
front of the security descriptor (rather than the end)
is also safer, as the number of ACEs (that follow it)
are variable.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Suggested-by: Rohith Surabattula <rohiths@microsoft.com>
CC: Stable <stable@vger.kernel.org> # v5.8
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-03 17:12:14 -06:00
Aurelien Aptel
59463eb888 cifs: add NULL check for ses->tcon_ipc
In some scenarios (DFS and BAD_NETWORK_NAME) set_root_set() can be
called with a NULL ses->tcon_ipc.

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-03 17:06:03 -06:00
Namjae Jeon
7963178485 smb3: set COMPOUND_FID to FileID field of subsequent compound request
For an operation compounded with an SMB2 CREATE request, client must set
COMPOUND_FID(0xFFFFFFFFFFFFFFFF) to FileID field of smb2 ioctl.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Fixes: 2e4564b31b ("smb3: add support stat of WSL reparse points for special file types")
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-12-03 17:02:52 -06:00
Linus Torvalds
c82a505c00 Restore splice functionality for 9p
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAl/IvNUACgkQq06b7GqY
 5nBRAA//UQJ3ioEvTlIpdHR6wKeH9XV2bucqjbAO2PRU0r8qEBC6wRr8IDBv4J4D
 xUyQ+teWjBLoKekdhoXsz3xXWtwqPNchVXClFOtjsUqYn2F71pUYNhRYJL9Xy+yy
 pumUxG3vJpJ4QTWkuSeB7/Ha561waWHILIeo1knT6z0Q49jmYI1J9Bl84q7RiC/7
 wH+bgyE0RSORoIfkfapIvgwdjamUsnzMVhSVKNOol3gwkjFJGfOtk16VVSKq/RlK
 5uOnjmWiVAkfPdDOA93mu6KE3FA4FB/sQaESwBR2fawsd646WooG5hn8fhMa0kHP
 y6AZo6fdYglcOD/OE1S0GpcSOocAdglfcU51kfxX8Z7BwfAbl7WCPCS2Rl75PYFf
 PhTzidD+vQsroGEaCF2+w90yIZBGgLz0Lh6az3zmaNGo7lcrRyP6GzQMmKILp2O1
 MuPWNFRzt5ZAmyq7roVVclMzXnXEQM2v5YS3ib+wuUpQHWwGAUzdV6u9uFsp6vXx
 QSnsD4Y9wwMSI03QkkzW7TaH1laImRt820emivo3rOoWk1f7sCG5r3WlNPRIh1cN
 9n6oCCOvFgXpTmiE0btqSKeMMxCqjtPHGZ5dQTTnftf0euNXGIZDPMSL45ANDXh+
 5Ezhjggqa91ohUSBRJgwc9MGsZy0GxOneDWPMpXAIDe9OlRkSC0=
 =zCUA
 -----END PGP SIGNATURE-----

Merge tag '9p-for-5.10-rc7' of git://github.com/martinetd/linux

Pull 9p fixes from Dominique Martinet:
 "Restore splice functionality for 9p"

* tag '9p-for-5.10-rc7' of git://github.com/martinetd/linux:
  fs: 9p: add generic splice_write file operation
  fs: 9p: add generic splice_read file operations
2020-12-03 11:47:13 -08:00
Bob Peterson
6e5c4ea37a gfs2: in signal_our_withdraw wait for unfreeze of _this_ fs only
Function signal_our_withdraw needs to work on file systems that have been
partially frozen. To do this, it called flush_workqueue(gfs2_freeze_wq).
This this wrong because it waits for *ALL* file systems to be unfrozen, not
just the one we're withdrawing from. It should only wait for the targetted
file system to be unfrozen. Otherwise it would wait until ALL file systems
are thawed before signaling the withdraw.

This patch changes signal_our_withdraw so it calls flush_work() for the target
file system's freeze work (only) to be completed.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-12-03 17:04:41 +01:00
Bob Peterson
dd64fe8167 gfs2: Remove sb_start_write from gfs2_statfs_sync
Before this patch, function gfs2_statfs_sync called sb_start_write and
sb_end_write. This is completely unnecessary because, aside from grabbing
glocks, gfs2_statfs_sync does all its updates to statfs with a transaction:
gfs2_trans_begin and _end. And transactions always do sb_start_intwrite in
gfs2_trans_begin and sb_end_intwrite in gfs2_trans_end.

This patch simply removes the call to sb_start_write.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-12-03 17:04:24 +01:00
Chunguang Xu
ce3cca3374 ext4: simplify the code of mb_find_order_for_block
The code of mb_find_order_for_block is a bit obscure, but we can
simplify it with mb_find_buddy(), make the code more concise.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/1604764698-4269-3-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-03 09:42:41 -05:00
Chunguang Xu
6bd97bf273 ext4: remove redundant mb_regenerate_buddy()
After this patch (163a203), if an abnormal bitmap is detected, we
will mark the group as corrupt, and we will not use this group in
the future. Therefore, it should be meaningless to regenerate the
buddy bitmap of this group, It might be better to delete it.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/1604764698-4269-2-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-03 09:42:39 -05:00
Amir Goldstein
1a2620a998 inotify: convert to handle_inode_event() interface
Convert inotify to use the simple handle_inode_event() interface to
get rid of the code duplication between the generic helper
fsnotify_handle_event() and the inotify_handle_event() callback, which
also happen to be buggy code.

The bug will be fixed in the generic helper.

Link: https://lore.kernel.org/r/20201202120713.702387-3-amir73il@gmail.com
CC: stable@vger.kernel.org
Fixes: b9a1b97725 ("fsnotify: create method handle_inode_event() in fsnotify_operations")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-12-03 15:41:29 +01:00
Chunguang Xu
837c23fbc1 ext4: use ASSERT() to replace J_ASSERT()
There are currently multiple forms of assertion, such as J_ASSERT().
J_ASEERT() is provided for the jbd module, which is a public module.
Maybe we should use custom ASSERT() like other file systems, such as
xfs, which would be better.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/1604764698-4269-1-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-03 09:36:57 -05:00
Roman Anufriev
ca9b404ff1 ext4: print quota journalling mode on (re-)mount
Right now, it is hard to understand which quota journalling type is enabled:
you need to be quite familiar with kernel code and trace it or really
understand what different combinations of fs flags/mount options lead to.

This patch adds printing of current quota jounalling mode on each
mount/remount, thus making it easier to check it at a glance/in autotests.
The semantics is similar to ext4 data journalling modes:

* journalled - quota configured, journalling will be enabled
* writeback  - quota configured, journalling won't be enabled
* none       - quota isn't configured
* disabled   - kernel compiled without CONFIG_QUOTA feature

Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/1603336860-16153-2-git-send-email-dotdot@yandex-team.ru
Signed-off-by: Roman Anufriev <dotdot@yandex-team.ru>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-03 09:18:48 -05:00
Roman Anufriev
f177ee0882 ext4: add helpers for checking whether quota can be enabled/is journalled
Right now, there are several places, where we check whether fs is
capable of enabling quota or if quota is journalled with quite long
and non-self-descriptive condition statements.

This patch wraps these statements into helpers for better readability
and easier usage.

Link: https://lore.kernel.org/r/1603336860-16153-1-git-send-email-dotdot@yandex-team.ru
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Roman Anufriev <dotdot@yandex-team.ru>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-03 09:18:48 -05:00
Colin Ian King
face525ecb ext4: remove redundant assignment of variable ex
Variable ex is assigned a variable that is not being read, the assignment
is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20201021132326.148052-1-colin.king@canonical.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-03 09:18:48 -05:00
Xianting Tian
46bac53529 ext4: remove the null check of bio_vec page
bv_page can't be NULL in a valid bio_vec, so we can remove the NULL check,
as we did in other places when calling bio_for_each_segment_all() to go
through all bio_vec of a bio.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Link: https://lore.kernel.org/r/20201020082201.34257-1-tian.xianting@h3c.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-03 09:15:29 -05:00
Kaixu Xia
7b721e6d33 ext4: remove redundant operation that set bh to NULL
The out_fail branch path don't release the bh and the second bh is
valid only in the for statement, so we don't need to set them to NULL.

Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/1603194069-17557-1-git-send-email-kaixuxia@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-03 09:15:29 -05:00
Amir Goldstein
950cc0d2be fsnotify: generalize handle_inode_event()
The handle_inode_event() interface was added as (quoting comment):
"a simple variant of handle_event() for groups that only have inode
marks and don't have ignore mask".

In other words, all backends except fanotify.  The inotify backend
also falls under this category, but because it required extra arguments
it was left out of the initial pass of backends conversion to the
simple interface.

This results in code duplication between the generic helper
fsnotify_handle_event() and the inotify_handle_event() callback
which also happen to be buggy code.

Generalize the handle_inode_event() arguments and add the check for
FS_EXCL_UNLINK flag to the generic helper, so inotify backend could
be converted to use the simple interface.

Link: https://lore.kernel.org/r/20201202120713.702387-2-amir73il@gmail.com
CC: stable@vger.kernel.org
Fixes: b9a1b97725 ("fsnotify: create method handle_inode_event() in fsnotify_operations")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-12-03 14:58:35 +01:00
Aleksa Sarai
398840f8bb
openat2: reject RESOLVE_BENEATH|RESOLVE_IN_ROOT
This was an oversight in the original implementation, as it makes no
sense to specify both scoping flags to the same openat2(2) invocation
(before this patch, the result of such an invocation was equivalent to
RESOLVE_IN_ROOT being ignored).

This is a userspace-visible ABI change, but the only user of openat2(2)
at the moment is LXC which doesn't specify both flags and so no
userspace programs will break as a result.

Fixes: fddb5d430a ("open: introduce openat2(2) syscall")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: <stable@vger.kernel.org> # v5.6+
Link: https://lore.kernel.org/r/20201027235044.5240-2-cyphar@cyphar.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-12-03 10:15:50 +01:00
Jaegeuk Kim
a95ba66ac1 f2fs: avoid race condition for shrinker count
Light reported sometimes shinker gets nat_cnt < dirty_nat_cnt resulting in
wrong do_shinker work. Let's avoid to return insane overflowed value by adding
single tracking value.

Reported-by: Light Hsieh <Light.Hsieh@mediatek.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-03 00:59:26 -08:00
Daeho Jeong
5fdb322ff2 f2fs: add F2FS_IOC_DECOMPRESS_FILE and F2FS_IOC_COMPRESS_FILE
Added two ioctl to decompress/compress explicitly the compression
enabled file in "compress_mode=user" mount option.

Using these two ioctls, the users can make a control of compression
and decompression of their files.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-03 00:12:08 -08:00
Daeho Jeong
602a16d58e f2fs: add compress_mode mount option
We will add a new "compress_mode" mount option to control file
compression mode. This supports "fs" and "user". In "fs" mode (default),
f2fs does automatic compression on the compression enabled files.
In "user" mode, f2fs disables the automaic compression and gives the
user discretion of choosing the target file and the timing. It means
the user can do manual compression/decompression on the compression
enabled files using ioctls.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-03 00:11:57 -08:00
Yangtao Li
db48965264 f2fs: Remove unnecessary unlikely()
WARN_ON() already contains an unlikely(), so it's not necessary
to use unlikely.

Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Signed-off-by: Shuosheng Huang <huangshuosheng@allwinnertech.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-02 22:00:23 -08:00
Jack Qiu
5335bfc6eb f2fs: init dirty_secmap incorrectly
section is dirty, but dirty_secmap may not set

Reported-by: Jia Yang <jiayang5@huawei.com>
Fixes: da52f8ade4 ("f2fs: get the right gc victim section when section has several segments")
Cc: <stable@vger.kernel.org>
Signed-off-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-02 22:00:23 -08:00
Jaegeuk Kim
b876f4c94c f2fs: remove buffer_head which has 32bits limit
This patch removes buffer_head dependency when getting block addresses.
Light reported there's a 32bit issue in f2fs_fiemap where map_bh.b_size is
32bits while len is 64bits given by user. This will give wrong length to
f2fs_map_block.

Reported-by: Light Hsieh <Light.Hsieh@mediatek.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-02 22:00:23 -08:00
Jaegeuk Kim
963ba7f983 f2fs: fix wrong block count instead of bytes
We should convert cur_lblock, a block count, to bytes for len.

Fixes: af4b6b8edf ("f2fs: introduce check_swap_activate_fast()")
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-02 22:00:22 -08:00
Jaegeuk Kim
43b9d4b4d9 f2fs: use new conversion functions between blks and bytes
This patch cleans up blks and bytes conversions.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-02 22:00:22 -08:00
Jaegeuk Kim
6cbfcab5ff f2fs: rename logical_to_blk and blk_to_logical
This patch renames two functions like below having u64.
 - logical_to_blk to bytes_to_blks
 - blk_to_logical to blks_to_bytes

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-02 22:00:22 -08:00
Chao Yu
3a0a9cbc44 f2fs: fix kbytes written stat for multi-device case
For multi-device case, one f2fs image includes multi devices, so it
needs to account bytes written of all block devices belong to the image
rather than one main block device, fix it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-02 22:00:22 -08:00
Chao Yu
b28f047b28 f2fs: compress: support chksum
This patch supports to store chksum value with compressed
data, and verify the integrality of compressed data while
reading the data.

The feature can be enabled through specifying mount option
'compress_chksum'.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-02 22:00:22 -08:00
Chao Yu
493720a485 f2fs: fix to avoid REQ_TIME and CP_TIME collision
Lei Li reported a issue: if foreground operations are frequent, background
checkpoint may be always skipped due to below check, result in losing more
data after sudden power-cut.

f2fs_balance_fs_bg()
...
	if (!is_idle(sbi, REQ_TIME) &&
		(!excess_dirty_nats(sbi) && !excess_dirty_nodes(sbi)))
		return;

E.g:
cp_interval = 5 second
idle_interval = 2 second
foreground operation interval = 1 second (append 1 byte per second into file)

In such case, no matter when it calls f2fs_balance_fs_bg(), is_idle(, REQ_TIME)
returns false, result in skipping background checkpoint.

This patch changes as below to make trigger condition being more reasonable:
- trigger sync_fs() if dirty_{nats,nodes} and prefree segs exceeds threshold;
- skip triggering sync_fs() if there is any background inflight IO or there is
foreground operation recently and meanwhile cp_rwsem is being held by someone;

Reported-by: Lei Li <noctis.akm@gmail.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-02 22:00:21 -08:00
Sahitya Tummala
8769918bf0 f2fs: change to use rwsem for cp_mutex
Use rwsem to ensure serialization of the callers and to avoid
starvation of high priority tasks, when the system is under
heavy IO workload.

Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-02 22:00:21 -08:00
Daniel Rosenberg
7ad08a58bf f2fs: Handle casefolding with Encryption
Expand f2fs's casefolding support to include encrypted directories.  To
index casefolded+encrypted directories, we use the SipHash of the
casefolded name, keyed by a key derived from the directory's fscrypt
master key.  This ensures that the dirhash doesn't leak information
about the plaintext filenames.

Encryption keys are unavailable during roll-forward recovery, so we
can't compute the dirhash when recovering a new dentry in an encrypted +
casefolded directory.  To avoid having to force a checkpoint when a new
file is fsync'ed, store the dirhash on-disk appended to i_name.

This patch incorporates work by Eric Biggers <ebiggers@google.com>
and Jaegeuk Kim <jaegeuk@kernel.org>.

Co-developed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-02 22:00:21 -08:00
Daniel Rosenberg
bb9cd9106b fscrypt: Have filesystems handle their d_ops
This shifts the responsibility of setting up dentry operations from
fscrypt to the individual filesystems, allowing them to have their own
operations while still setting fscrypt's d_revalidate as appropriate.

Most filesystems can just use generic_set_encrypted_ci_d_ops, unless
they have their own specific dentry operations as well. That operation
will set the minimal d_ops required under the circumstances.

Since the fscrypt d_ops are set later on, we must set all d_ops there,
since we cannot adjust those later on. This should not result in any
change in behavior.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-02 22:00:21 -08:00
Daniel Rosenberg
608af70351 libfs: Add generic function for setting dentry_ops
This adds a function to set dentry operations at lookup time that will
work for both encrypted filenames and casefolded filenames.

A filesystem that supports both features simultaneously can use this
function during lookup preparations to set up its dentry operations once
fscrypt no longer does that itself.

Currently the casefolding dentry operation are always set if the
filesystem defines an encoding because the features is toggleable on
empty directories. Unlike in the encryption case, the dentry operations
used come from the parent. Since we don't know what set of functions
we'll eventually need, and cannot change them later, we enable the
casefolding operations if the filesystem supports them at all.

By splitting out the various cases, we support as few dentry operations
as we can get away with, maximizing compatibility with overlayfs, which
will not function if a filesystem supports certain dentry_operations.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-02 22:00:21 -08:00
Zhang Qilong
beb78181f1 f2fs: Remove the redundancy initialization
There are two assignments are meaningless, and remove them.

Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-02 22:00:21 -08:00
Liu Song
9f7e334aec f2fs: remove writeback_inodes_sb in f2fs_remount
Since sync_inodes_sb has been used, there is no need to
use writeback_inodes_sb, so remove it.

Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-02 22:00:20 -08:00
Hyeongseok Kim
89ff600503 f2fs: fix double free of unicode map
In case of retrying fill_super with skip_recovery,
s_encoding for casefold would not be loaded again even though it's
already been freed because it's not NULL.
Set NULL after free to prevent double freeing when unmount.

Fixes: eca4873ee1 ("f2fs: Use generic casefolding support")
Signed-off-by: Hyeongseok Kim <hyeongseok@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-02 22:00:20 -08:00
Chao Yu
34178b1bc4 f2fs: fix compat F2FS_IOC_{MOVE,GARBAGE_COLLECT}_RANGE
Eric reported a ioctl bug in below link:

https://lore.kernel.org/linux-f2fs-devel/20201103032234.GB2875@sol.localdomain/

That said, on some 32-bit architectures, u64 has only 32-bit alignment,
notably i386 and x86_32, so that size of struct f2fs_gc_range compiled
in x86_32 is 20 bytes, however the size in x86_64 is 24 bytes, binary
compiled in x86_32 can not call F2FS_IOC_GARBAGE_COLLECT_RANGE successfully
due to mismatched value of ioctl command in between binary and f2fs
module, similarly, F2FS_IOC_MOVE_RANGE will fail too.

In this patch we introduce two ioctls for compatibility of above special
32-bit binary:
- F2FS_IOC32_GARBAGE_COLLECT_RANGE
- F2FS_IOC32_MOVE_RANGE

Reported-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-02 22:00:20 -08:00
Chao Yu
3a1b9eaf72 f2fs: avoid unneeded data copy in f2fs_ioc_move_range()
Fields in struct f2fs_move_range won't change in f2fs_ioc_move_range(),
let's avoid copying this structure's data to userspace.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-02 22:00:20 -08:00
Daeho Jeong
e1e8debec6 f2fs: add F2FS_IOC_SET_COMPRESS_OPTION ioctl
Added a new F2FS_IOC_SET_COMPRESS_OPTION ioctl to change file
compression option of a file.

struct f2fs_comp_option {
    u8 algorithm;         => compression algorithm
                          => 0:lzo, 1:lz4, 2:zstd, 3:lzorle
    u8 log_cluster_size;  => log scale cluster size
                          => 2 ~ 8
};

struct f2fs_comp_option option;

option.algorithm = 1;
option.log_cluster_size = 7;

ioctl(fd, F2FS_IOC_SET_COMPRESS_OPTION, &option);

Signed-off-by: Daeho Jeong <daehojeong@google.com>
[Chao Yu: remove f2fs_is_compress_algorithm_valid()]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-12-02 21:59:38 -08:00
Roman Gushchin
bcfe06bf26 mm: memcontrol: Use helpers to read page's memcg data
Patch series "mm: allow mapping accounted kernel pages to userspace", v6.

Currently a non-slab kernel page which has been charged to a memory cgroup
can't be mapped to userspace.  The underlying reason is simple: PageKmemcg
flag is defined as a page type (like buddy, offline, etc), so it takes a
bit from a page->mapped counter.  Pages with a type set can't be mapped to
userspace.

But in general the kmemcg flag has nothing to do with mapping to
userspace.  It only means that the page has been accounted by the page
allocator, so it has to be properly uncharged on release.

Some bpf maps are mapping the vmalloc-based memory to userspace, and their
memory can't be accounted because of this implementation detail.

This patchset removes this limitation by moving the PageKmemcg flag into
one of the free bits of the page->mem_cgroup pointer.  Also it formalizes
accesses to the page->mem_cgroup and page->obj_cgroups using new helpers,
adds several checks and removes a couple of obsolete functions.  As the
result the code became more robust with fewer open-coded bit tricks.

This patch (of 4):

Currently there are many open-coded reads of the page->mem_cgroup pointer,
as well as a couple of read helpers, which are barely used.

It creates an obstacle on a way to reuse some bits of the pointer for
storing additional bits of information.  In fact, we already do this for
slab pages, where the last bit indicates that a pointer has an attached
vector of objcg pointers instead of a regular memcg pointer.

This commits uses 2 existing helpers and introduces a new helper to
converts all read sides to calls of these helpers:
  struct mem_cgroup *page_memcg(struct page *page);
  struct mem_cgroup *page_memcg_rcu(struct page *page);
  struct mem_cgroup *page_memcg_check(struct page *page);

page_memcg_check() is intended to be used in cases when the page can be a
slab page and have a memcg pointer pointing at objcg vector.  It does
check the lowest bit, and if set, returns NULL.  page_memcg() contains a
VM_BUG_ON_PAGE() check for the page not being a slab page.

To make sure nobody uses a direct access, struct page's
mem_cgroup/obj_cgroups is converted to unsigned long memcg_data.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Link: https://lkml.kernel.org/r/20201027001657.3398190-1-guro@fb.com
Link: https://lkml.kernel.org/r/20201027001657.3398190-2-guro@fb.com
Link: https://lore.kernel.org/bpf/20201201215900.3569844-2-guro@fb.com
2020-12-02 18:28:05 -08:00
Eric Biggers
a14d0b6764 fscrypt: allow deleting files with unsupported encryption policy
Currently it's impossible to delete files that use an unsupported
encryption policy, as the kernel will just return an error when
performing any operation on the top-level encrypted directory, even just
a path lookup into the directory or opening the directory for readdir.

More specifically, this occurs in any of the following cases:

- The encryption context has an unrecognized version number.  Current
  kernels know about v1 and v2, but there could be more versions in the
  future.

- The encryption context has unrecognized encryption modes
  (FSCRYPT_MODE_*) or flags (FSCRYPT_POLICY_FLAG_*), an unrecognized
  combination of modes, or reserved bits set.

- The encryption key has been added and the encryption modes are
  recognized but aren't available in the crypto API -- for example, a
  directory is encrypted with FSCRYPT_MODE_ADIANTUM but the kernel
  doesn't have CONFIG_CRYPTO_ADIANTUM enabled.

It's desirable to return errors for most operations on files that use an
unsupported encryption policy, but the current behavior is too strict.
We need to allow enough to delete files, so that people can't be stuck
with undeletable files when downgrading kernel versions.  That includes
allowing directories to be listed and allowing dentries to be looked up.

Fix this by modifying the key setup logic to treat an unsupported
encryption policy in the same way as "key unavailable" in the cases that
are required for a recursive delete to work: preparing for a readdir or
a dentry lookup, revalidating a dentry, or checking whether an inode has
the same encryption policy as its parent directory.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201203022041.230976-10-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-12-02 18:25:01 -08:00
Eric Biggers
5b421f0880 fscrypt: unexport fscrypt_get_encryption_info()
Now that fscrypt_get_encryption_info() is only called from files in
fs/crypto/ (due to all key setup now being handled by higher-level
helper functions instead of directly by filesystems), unexport it and
move its declaration to fscrypt_private.h.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201203022041.230976-9-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-12-02 18:25:01 -08:00
Eric Biggers
de3cdc6e75 fscrypt: move fscrypt_require_key() to fscrypt_private.h
fscrypt_require_key() is now only used by files in fs/crypto/.  So
reduce its visibility to fscrypt_private.h.  This is also a prerequsite
for unexporting fscrypt_get_encryption_info().

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201203022041.230976-8-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-12-02 18:25:01 -08:00
Eric Biggers
7622350e5e fscrypt: move body of fscrypt_prepare_setattr() out-of-line
In preparation for reducing the visibility of fscrypt_require_key() by
moving it to fscrypt_private.h, move the call to it from
fscrypt_prepare_setattr() to an out-of-line function.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201203022041.230976-7-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-12-02 18:25:01 -08:00
Eric Biggers
ec0caa974c fscrypt: introduce fscrypt_prepare_readdir()
The last remaining use of fscrypt_get_encryption_info() from filesystems
is for readdir (->iterate_shared()).  Every other call is now in
fs/crypto/ as part of some other higher-level operation.

We need to add a new argument to fscrypt_get_encryption_info() to
indicate whether the encryption policy is allowed to be unrecognized or
not.  Doing this is easier if we can work with high-level operations
rather than direct filesystem use of fscrypt_get_encryption_info().

So add a function fscrypt_prepare_readdir() which wraps the call to
fscrypt_get_encryption_info() for the readdir use case.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201203022041.230976-6-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-12-02 18:25:01 -08:00
Eric Biggers
91d0d89241 ext4: don't call fscrypt_get_encryption_info() from dx_show_leaf()
The call to fscrypt_get_encryption_info() in dx_show_leaf() is too low
in the call tree; fscrypt_get_encryption_info() should have already been
called when starting the directory operation.  And indeed, it already
is.  Moreover, the encryption key is guaranteed to already be available
because dx_show_leaf() is only called when adding a new directory entry.

And even if the key wasn't available, dx_show_leaf() uses
fscrypt_fname_disk_to_usr() which knows how to create a no-key name.

So for the above reasons, and because it would be desirable to stop
exporting fscrypt_get_encryption_info() directly to filesystems, remove
the call to fscrypt_get_encryption_info() from dx_show_leaf().

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201203022041.230976-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-12-02 18:25:01 -08:00
Eric Biggers
a302052b95 ubifs: remove ubifs_dir_open()
Since encrypted directories can be opened and searched without their key
being available, and each readdir and ->lookup() tries to set up the
key, trying to set up the key in ->open() too isn't really useful.

Just remove it so that directories don't need an ->open() method
anymore, and so that we eliminate a use of fscrypt_get_encryption_info()
(which I'd like to stop exporting to filesystems).

Link: https://lore.kernel.org/r/20201203022041.230976-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-12-02 18:25:01 -08:00
Eric Biggers
73114b6d28 f2fs: remove f2fs_dir_open()
Since encrypted directories can be opened and searched without their key
being available, and each readdir and ->lookup() tries to set up the
key, trying to set up the key in ->open() too isn't really useful.

Just remove it so that directories don't need an ->open() method
anymore, and so that we eliminate a use of fscrypt_get_encryption_info()
(which I'd like to stop exporting to filesystems).

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Link: https://lore.kernel.org/r/20201203022041.230976-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-12-02 18:25:01 -08:00
Eric Biggers
65f62515e9 ext4: remove ext4_dir_open()
Since encrypted directories can be opened and searched without their key
being available, and each readdir and ->lookup() tries to set up the
key, trying to set up the key in ->open() too isn't really useful.

Just remove it so that directories don't need an ->open() method
anymore, and so that we eliminate a use of fscrypt_get_encryption_info()
(which I'd like to stop exporting to filesystems).

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201203022041.230976-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-12-02 18:25:01 -08:00
Linus Torvalds
34816d20f1 Various gfs2 fixes
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAl/H/pUUHGFncnVlbmJh
 QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTqxhRAAuTKHz1jMYHNAWm1NVVRiGVENhjlV
 g5kRcRmf8m/VAWmI+HM0J34CbHpWc1IgJnjID/vZ/VB8QQPmoDoEFrD/ple9V6A4
 xQJzQcG9uvLir5h43R2SWR6OSSijtEDjvpQpRQSK3B2e1otRV+W7mLjPBkg410rY
 l9Cj01QhCwCv1dVO7VknSEB1Jpkhe0Vvr1z4DropeU5JEBSADTl5deTkFWbinDo1
 9nygDw01AdWXhR1FNJTAy8XJYlcM3dx6xDuHD9xJH9HciWsocOr+EqtPfJEEsCSf
 X6CrWY3mYZBr1kJFcvp2HD5rKYhw0xC0gZiHEnvmZx5XKTKoDKq2+2s2wQOwWk3B
 QIvEfLoAzV6mKBHRC8mCDw2bImj8FDrvBkmmJnlyL6nJDrriBXimFAJPX7MryKTJ
 jdvm0JFRyY0ja3LxLxBVwiKnKiw6yY6fPjpMWzi042P/PhshRSvazxgUf+pc69CB
 Czo3W9MnNGWzP73ALIsuVxG81RW48CvAJdXOmoWIqjEeqnC0YZ3vgwInQcqBzton
 fCI38XN6CG1UbwaVxoZGoPS5FIfBbrTL5v62e3U8JIv3J4ol5fWCEruUhImsByGI
 w3UkiuNqWjOab/774m9ZS0eBXCEn6LAizvZSgf3pt+8VRoN20RxF3k1Jhvm7g7I3
 IbOHK9kddrycH9Q=
 =EHzo
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-v5.10-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 fixes from Andreas Gruenbacher:
 "Various gfs2 fixes"

* tag 'gfs2-v5.10-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Fix deadlock between gfs2_{create_inode,inode_lookup} and delete_work_func
  gfs2: Upgrade shared glocks for atime updates
  gfs2: Don't freeze the file system during unmount
  gfs2: check for empty rgrp tree in gfs2_ri_update
  gfs2: set lockdep subclass for iopen glocks
  gfs2: Fix deadlock dumping resource group glocks
2020-12-02 17:25:23 -08:00
Arpitha Raghunandan
5f6b99d028 fs: ext4: Modify inode-test.c to use KUnit parameterized testing feature
Modify fs/ext4/inode-test.c to use the parameterized testing
feature of KUnit.

Signed-off-by: Arpitha Raghunandan <98.arpi@gmail.com>
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Tested-by: David Gow <davidgow@google.com>
Reviewed-by: David Gow <davidgow@google.com>
Reviewed-by: Iurii Zaikin <yzaikin@google.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2020-12-02 16:07:25 -07:00
NeilBrown
bf701b765e NFS: switch nfsiod to be an UNBOUND workqueue.
nfsiod is currently a concurrency-managed workqueue (CMWQ).
This means that workitems scheduled to nfsiod on a given CPU are queued
behind all other work items queued on any CMWQ on the same CPU.  This
can introduce unexpected latency.

Occaionally nfsiod can even cause excessive latency.  If the work item
to complete a CLOSE request calls the final iput() on an inode, the
address_space of that inode will be dismantled.  This takes time
proportional to the number of in-memory pages, which on a large host
working on large files (e.g..  5TB), can be a large number of pages
resulting in a noticable number of seconds.

We can avoid these latency problems by switching nfsiod to WQ_UNBOUND.
This causes each concurrent work item to gets a dedicated thread which
can be scheduled to an idle CPU.

There is precedent for this as several other filesystems use WQ_UNBOUND
workqueue for handling various async events.

Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: ada609ee2a ("workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:54 -05:00
Calum Mackay
9b82d88d59 lockd: don't use interval-based rebinding over TCP
NLM uses an interval-based rebinding, i.e. it clears the transport's
binding under certain conditions if more than 60 seconds have elapsed
since the connection was last bound.

This rebinding is not necessary for an autobind RPC client over a
connection-oriented protocol like TCP.

It can also cause problems: it is possible for nlm_bind_host() to clear
XPRT_BOUND whilst a connection worker is in the middle of trying to
reconnect, after it had already been checked in xprt_connect().

When the connection worker notices that XPRT_BOUND has been cleared
under it, in xs_tcp_finish_connecting(), that results in:

	xs_tcp_setup_socket: connect returned unhandled error -107

Worse, it's possible that the two can get into lockstep, resulting in
the same behaviour repeated indefinitely, with the above error every
300 seconds, without ever recovering, and the connection never being
established. This has been seen in practice, with a large number of NLM
client tasks, following a server restart.

The existing callers of nlm_bind_host & nlm_rebind_host should not need
to force the rebind, for TCP, so restrict the interval-based rebinding
to UDP only.

For TCP, we will still rebind when needed, e.g. on timeout, and connection
error (including closure), since connection-related errors on an existing
connection, ECONNREFUSED when trying to connect, and rpc_check_timeout(),
already unconditionally clear XPRT_BOUND.

To avoid having to add the fix, and explanation, to both nlm_bind_host()
and nlm_rebind_host(), remove the duplicate code from the former, and
have it call the latter.

Drop the dprintk, which adds no value over a trace.

Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
Fixes: 35f5a422ce ("SUNRPC: new interface to force an RPC rebind")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:54 -05:00
Sargun Dhillon
d3ff46fe69 NFSv4: Refactor to use user namespaces for nfs4idmap
In several patches work has been done to enable NFSv4 to use user
namespaces:
58002399da: NFSv4: Convert the NFS client idmapper to use the container user namespace
3b7eb5e35d: NFS: When mounting, don't share filesystems between different user namespaces

Unfortunately, the userspace APIs were only such that the userspace facing
side of the filesystem (superblock s_user_ns) could be set to a non init
user namespace. This furthers the fs_context related refactoring, and
piggybacks on top of that logic, so the superblock user namespace, and the
NFS user namespace are the same.

Users can still use rpc.idmapd if they choose to, but there are complexities
with user namespaces and request-key that have yet to be addresssed.

Eventually, we will need to at least:
  * Come up with an upcall mechanism that can be triggered inside of the container,
    or safely triggered outside, with the requisite context to do the right
    mapping. * Handle whatever refactoring needs to be done in net/sunrpc.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Tested-by: Alban Crequy <alban.crequy@gmail.com>
Fixes: 62a55d088c ("NFS: Additional refactoring for fs_context conversion")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:54 -05:00
Sargun Dhillon
d18a9d3fa0 NFS: NFSv2/NFSv3: Use cred from fs_context during mount
There was refactoring done to use the fs_context for mounting done in:
62a55d088c: NFS: Additional refactoring for fs_context conversion

This made it so that the net_ns is fetched from the fs_context (the netns
that fsopen is called in). This change also makes it so that the credential
fetched during fsopen is used as well as the net_ns.

NFS has already had a number of changes to prepare it for user namespaces:
1a58e8a0e5: NFS: Store the credential of the mount process in the nfs_server
264d948ce7: NFS: Convert NFSv3 to use the container user namespace
c207db2f5d: NFS: Convert NFSv2 to use the container user namespace

Previously, different credentials could be used for creation of the
fs_context versus creation of the nfs_server, as FSCONFIG_CMD_CREATE did
the actual credential check, and that's where current_creds() were fetched.
This meant that the user namespace which fsopen was called in could be a
non-init user namespace. This still requires that the user that calls
FSCONFIG_CMD_CREATE has CAP_SYS_ADMIN in the init user ns.

This roughly allows a privileged user to mount on behalf of an unprivileged
usernamespace, by forking off and calling fsopen in the unprivileged user
namespace. It can then pass back that fsfd to the privileged process which
can configure the NFS mount, and then it can call FSCONFIG_CMD_CREATE
before switching back into the mount namespace of the container, and finish
up the mounting process and call fsmount and move_mount.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Tested-by: Alban Crequy <alban.crequy@gmail.com>
Fixes: 62a55d088c ("NFS: Additional refactoring for fs_context conversion")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:54 -05:00
Trond Myklebust
b6d49ecd10 NFSv4: Fix a pNFS layout related use-after-free race when freeing the inode
When returning the layout in nfs4_evict_inode(), we need to ensure that
the layout is actually done being freed before we can proceed to free the
inode itself.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:54 -05:00
Trond Myklebust
17068466ad NFSv4: Fix open coded xdr_stream_remaining()
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:54 -05:00
Trond Myklebust
9ed5af268e SUNRPC: Clean up the handling of page padding in rpc_prepare_reply_pages()
rpc_prepare_reply_pages() currently expects the 'hdrsize' argument to
contain the length of the data that we expect to want placed in the head
kvec plus a count of 1 word of padding that is placed after the page data.
This is very confusing when trying to read the code, and sometimes leads
to callers adding an arbitrary value of '1' just in order to satisfy the
requirement (whether or not the page data actually needs such padding).

This patch aims to clarify the code by changing the 'hdrsize' argument
to remove that 1 word of padding. This means we need to subtract the
padding from all the existing callers.

Fixes: 02ef04e432 ("NFS: Account for XDR pad of buf->pages")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:53 -05:00
Trond Myklebust
046e5ccb41 NFSv4: Fix the alignment of page data in the getdeviceinfo reply
We can fit the device_addr4 opaque data padding in the pages.

Fixes: cf500bac8f ("SUNRPC: Introduce rpc_prepare_reply_pages()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:53 -05:00
Trond Myklebust
9889981349 pNFS: Clean up open coded xdr string decoding
Use the existing xdr_stream_decode_string_dup() to safely decode into
kmalloced strings.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:53 -05:00
Trond Myklebust
9a7016319e pNFS/flexfiles: Fix up layoutstats reporting for non-TCP transports
Ensure that we report the correct netid when using UDP or RDMA
transports to the DSes.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:53 -05:00
Trond Myklebust
4be78d2681 NFSv4/pNFS: Store the transport type in struct nfs4_pnfs_ds_addr
We want to enable RDMA and UDP as valid transport methods if a
GETDEVICEINFO call specifies it. Do so by adding a parser for the
netid that translates it to an appropriate argument for the RPC
transport layer.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:53 -05:00
Trond Myklebust
190c75a31f pNFS: Add helpers for allocation/free of struct nfs4_pnfs_ds_addr
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:53 -05:00
Trond Myklebust
a12f996d34 NFSv4/pNFS: Use connections to a DS that are all of the same protocol family
If the pNFS metadata server advertises multiple addresses for the same
data server, we should try to connect to just one protocol family and
transport type on the assumption that homogeneity will improve performance.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:53 -05:00
Trond Myklebust
1c3695d0bb NFS: Switch mount code to use xprt_find_transport_ident()
Switch the mount code to use xprt_find_transport_ident() and to check
the results before allowing the mount to proceed.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:53 -05:00
Trond Myklebust
794092c57f NFS: Do uncached readdir when we're seeking a cookie in an empty page cache
If the directory is changing, causing the page cache to get invalidated
while we are listing the contents, then the NFS client is currently forced
to read in the entire directory contents from scratch, because it needs
to perform a linear search for the readdir cookie. While this is not
an issue for small directories, it does not scale to directories with
millions of entries.
In order to be able to deal with large directories that are changing,
add a heuristic to ensure that if the page cache is empty, and we are
searching for a cookie that is not the zero cookie, we just default to
performing uncached readdir.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:52 -05:00
Trond Myklebust
35df59d3ef NFS: Reduce number of RPC calls when doing uncached readdir
If we're doing uncached readdir, allocate multiple pages in order to
try to avoid duplicate RPC calls for the same getdents() call.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:52 -05:00
Trond Myklebust
762567b7c7 NFS: Optimisations for monotonically increasing readdir cookies
If the server is handing out monotonically increasing readdir cookie values,
then we can optimise away searches through pages that contain cookies that
lie outside our search range.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:52 -05:00
Trond Myklebust
b593c09f83 NFS: Improve handling of directory verifiers
If the server insists on using the readdir verifiers in order to allow
cookies to expire, then we should ensure that we cache the verifier
with the cookie, so that we can return an error if the application
tries to use the expired cookie.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:52 -05:00
Trond Myklebust
9fff59ed4c NFS: Handle NFS4ERR_NOT_SAME and NFSERR_BADCOOKIE from readdir calls
If the server returns NFS4ERR_NOT_SAME or tells us that the cookie is
bad in response to a READDIR call, then we should empty the page cache
so that we can fill it from scratch again.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:52 -05:00
Trond Myklebust
82e22a5e62 NFS: Allow the NFS generic code to pass in a verifier to readdir
If we're ever going to allow support for servers that use the readdir
verifier, then that use needs to be managed by the middle layers as
those need to be able to reject cookies from other verifiers.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:52 -05:00
Trond Myklebust
6c981eff23 NFS: Cleanup to remove nfs_readdir_descriptor_t typedef
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:52 -05:00
Trond Myklebust
6b75cf9e30 NFS: Reduce readdir stack usage
The descriptor and the struct nfs_entry are both large structures,
so don't allocate them from the stack.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:52 -05:00
Trond Myklebust
dbeaf8c984 NFS: nfs_do_filldir() does not return a value
Clean up nfs_do_filldir().

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:52 -05:00
Trond Myklebust
93b8959a0a NFS: More readdir cleanups
Remove the redundant caching of the credential in struct
nfs_open_dir_context.
Pass the buffer size as an argument to nfs_readdir_xdr_filler().

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:52 -05:00
Trond Myklebust
1a34c8c9a4 NFS: Support larger readdir buffers
Support readdir buffers of up to 1MB in size so that we can read
large directories using few RPC calls.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:52 -05:00
Trond Myklebust
a52a8a6ada NFS: Simplify struct nfs_cache_array_entry
We don't need to store a hash, so replace struct qstr with a simple
const char pointer and length.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:51 -05:00
Trond Myklebust
ed09222d65 NFS: Replace kmap() with kmap_atomic() in nfs_readdir_search_array()
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:51 -05:00
Trond Myklebust
e762a63981 NFS: Remove unnecessary kmap in nfs_readdir_xdr_to_array()
The kmapped pointer is only used once per loop to check if we need to
exit.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:51 -05:00
Trond Myklebust
3b2a09f127 NFS: Don't discard readdir results
If a readdir call returns more data than we can fit into one page
cache page, then allocate a new one for that data rather than
discarding the data.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:51 -05:00
Trond Myklebust
1f1d4aa4e4 NFS: Clean up directory array handling
Refactor to use pagecache_get_page() so that we can fill the page
in multiple stages.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:51 -05:00
Trond Myklebust
972bcdf233 NFS: Clean up nfs_readdir_page_filler()
Clean up handling of the case where there are no entries in the readdir
reply.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:51 -05:00
Trond Myklebust
b1e21c9743 NFS: Clean up readdir struct nfs_cache_array
Since the 'eof_index' is only ever used as a flag, make it so.
Also add a flag to detect if the page has been completely filled.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:51 -05:00
Trond Myklebust
2e7a464179 NFS: Ensure contents of struct nfs_open_dir_context are consistent
Ensure that the contents of struct nfs_open_dir_context are consistent
by setting them under the file->f_lock from a private copy (that is
known to be consistent).

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:51 -05:00
Olga Kornievskaia
05ad917561 NFSv4.2: condition READDIR's mask for security label based on LSM state
Currently, the client will always ask for security_labels if the server
returns that it supports that feature regardless of any LSM modules
(such as Selinux) enforcing security policy. This adds performance
penalty to the READDIR operation.

Client adjusts superblock's support of the security_label based on
the server's support but also current client's configuration of the
LSM modules. Thus, prior to using the default bitmask in READDIR,
this patch checks the server's capabilities and then instructs
READDIR to remove FATTR4_WORD2_SECURITY_LABEL from the bitmask.

v5: fixing silly mistakes of the rushed v4
v4: simplifying logic
v3: changing label's initialization per Ondrej's comment
v2: dropping selinux hook and using the sb cap.

Suggested-by: Ondrej Mosnacek <omosnace@redhat.com>
Suggested-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Fixes: 2b0143b5c9 ("VFS: normal filesystems (and lustre): d_inode() annotations")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:51 -05:00
Trond Myklebust
76998ebb91 NFSv4: Observe the NFS_MOUNT_SOFTREVAL flag in _nfs4_proc_lookupp
We need to respect the NFS_MOUNT_SOFTREVAL flag in _nfs4_proc_lookupp,
by timing out if the server is unavailable.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:51 -05:00
Trond Myklebust
3c5e9a59fa NFSv3: Add emulation of the lookupp() operation
In order to use the open_by_filehandle() operations on NFSv3, we need
to be able to emulate lookupp() so that nfs_get_parent() can be used
to convert disconnected dentries into connected ones.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:51 -05:00
Trond Myklebust
5f447cb881 NFSv3: Refactor nfs3_proc_lookup() to split out the dentry
We want to reuse the lookup code in NFSv3 in order to emulate the
NFSv4 lookupp operation.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:51 -05:00
Dai Ngo
bd75475c2f NFSv4.2: Fix 5 seconds delay when doing inter server copy
Since commit b4868b44c5 ("NFSv4: Wait for stateid updates after
CLOSE/OPEN_DOWNGRADE"), every inter server copy operation suffers 5
seconds delay regardless of the size of the copy. The delay is from
nfs_set_open_stateid_locked when the check by nfs_stateid_is_sequential
fails because the seqid in both nfs4_state and nfs4_stateid are 0.

Fix __nfs42_ssc_open to delay setting of NFS_OPEN_STATE in nfs4_state,
until after the call to update_open_stateid, to indicate this is the 1st
open. This fix is part of a 2 patches, the other patch is the fix in the
source server to return the stateid for COPY_NOTIFY request with seqid 1
instead of 0.

Fixes: ce0887ac96 ("NFSD add nfs4 inter ssc to nfsd4_copy")
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-12-02 13:45:33 -05:00
Chuck Lever
5482e09a88 NFS: Fix rpcrdma_inline_fixup() crash with new LISTXATTRS operation
By switching to an XFS-backed export, I am able to reproduce the
ibcomp worker crash on my client with xfstests generic/013.

For the failing LISTXATTRS operation, xdr_inline_pages() is called
with page_len=12 and buflen=128.

- When ->send_request() is called, rpcrdma_marshal_req() does not
  set up a Reply chunk because buflen is smaller than the inline
  threshold. Thus rpcrdma_convert_iovs() does not get invoked at
  all and the transport's XDRBUF_SPARSE_PAGES logic is not invoked
  on the receive buffer.

- During reply processing, rpcrdma_inline_fixup() tries to copy
  received data into rq_rcv_buf->pages because page_len is positive.
  But there are no receive pages because rpcrdma_marshal_req() never
  allocated them.

The result is that the ibcomp worker faults and dies. Sometimes that
causes a visible crash, and sometimes it results in a transport hang
without other symptoms.

RPC/RDMA's XDRBUF_SPARSE_PAGES support is not entirely correct, and
should eventually be fixed or replaced. However, my preference is
that upper-layer operations should explicitly allocate their receive
buffers (using GFP_KERNEL) when possible, rather than relying on
XDRBUF_SPARSE_PAGES.

Reported-by: Olga kornievskaia <kolga@netapp.com>
Suggested-by: Olga kornievskaia <kolga@netapp.com>
Fixes: c10a75145f ("NFSv4.2: add the extended attribute proc functions.")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Olga kornievskaia <kolga@netapp.com>
Reviewed-by: Frank van der Linden <fllinden@amazon.com>
Tested-by: Olga kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-12-02 13:19:20 -05:00
Gabriel Krisman Bertazi
1446e1df9e kernel: Implement selective syscall userspace redirection
Introduce a mechanism to quickly disable/enable syscall handling for a
specific process and redirect to userspace via SIGSYS.  This is useful
for processes with parts that require syscall redirection and parts that
don't, but who need to perform this boundary crossing really fast,
without paying the cost of a system call to reconfigure syscall handling
on each boundary transition.  This is particularly important for Windows
games running over Wine.

The proposed interface looks like this:

  prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector])

The range [<offset>,<offset>+<length>) is a part of the process memory
map that is allowed to by-pass the redirection code and dispatch
syscalls directly, such that in fast paths a process doesn't need to
disable the trap nor the kernel has to check the selector.  This is
essential to return from SIGSYS to a blocked area without triggering
another SIGSYS from rt_sigreturn.

selector is an optional pointer to a char-sized userspace memory region
that has a key switch for the mechanism. This key switch is set to
either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the
redirection without calling the kernel.

The feature is meant to be set per-thread and it is disabled on
fork/clone/execv.

Internally, this doesn't add overhead to the syscall hot path, and it
requires very little per-architecture support.  I avoided using seccomp,
even though it duplicates some functionality, due to previous feedback
that maybe it shouldn't mix with seccomp since it is not a security
mechanism.  And obviously, this should never be considered a security
mechanism, since any part of the program can by-pass it by using the
syscall dispatcher.

For the sysinfo benchmark, which measures the overhead added to
executing a native syscall that doesn't require interception, the
overhead using only the direct dispatcher region to issue syscalls is
pretty much irrelevant.  The overhead of using the selector goes around
40ns for a native (unredirected) syscall in my system, and it is (as
expected) dominated by the supervisor-mode user-address access.  In
fact, with SMAP off, the overhead is consistently less than 5ns on my
test box.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201127193238.821364-4-krisman@collabora.com
2020-12-02 15:07:56 +01:00
Christoph Hellwig
977115c0f6 block: stop using bdget_disk for partition 0
We can just dereference the point in struct gendisk instead.  Also
remove the now unused export.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Christoph Hellwig
0d02129e76 block: merge struct block_device and struct hd_struct
Instead of having two structures that represent each block device with
different life time rules, merge them into a single one.  This also
greatly simplifies the reference counting rules, as we can use the inode
reference count as the main reference count for the new struct
block_device, with the device model reference front ending it for device
model interaction.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Christoph Hellwig
9499ffc752 f2fs: remove a few bd_part checks
bd_part is never NULL for a block device in use by a file system, so
remove the checks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Christoph Hellwig
8446fe9255 block: switch partition lookup to use struct block_device
Use struct block_device to lookup partitions on a disk.  This removes
all usage of struct hd_struct from the I/O path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Coly Li <colyli@suse.de>			[bcache]
Acked-by: Chao Yu <yuchao0@huawei.com>			[f2fs]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Christoph Hellwig
cb8432d650 block: allocate struct hd_struct as part of struct bdev_inode
Allocate hd_struct together with struct block_device to pre-load
the lifetime rule changes in preparation of merging the two structures.

Note that part0 was previously embedded into struct gendisk, but is
a separate allocation now, and already points to the block_device instead
of the hd_struct.  The lifetime of struct gendisk is still controlled by
the struct device embedded in the part0 hd_struct.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Christoph Hellwig
1bdd5ae025 block: move holder_dir to struct block_device
Move the holder_dir field to struct block_device in preparation for
kill struct hd_struct.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Christoph Hellwig
231926dbf0 block: move the partition_meta_info to struct block_device
Move the partition_meta_info to struct block_device in preparation for
killing struct hd_struct.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Christoph Hellwig
15e3d2c5cd block: move disk stat accounting to struct block_device
Move the dkstats and stamp field to struct block_device in preparation
of killing struct hd_struct.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Christoph Hellwig
a782483cc1 block: remove the nr_sects field in struct hd_struct
Now that the hd_struct always has a block device attached to it, there is
no need for having two size field that just get out of sync.

Additionally the field in hd_struct did not use proper serialization,
possibly allowing for torn writes.  By only using the block_device field
this problem also gets fixed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Coly Li <colyli@suse.de>			[bcache]
Acked-by: Chao Yu <yuchao0@huawei.com>			[f2fs]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Christoph Hellwig
e6cb53827e block: initialize struct block_device in bdev_alloc
Don't play tricks with slab constructors as bdev structures tends to not
get reused very much, and this makes the code a lot less error prone.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Christoph Hellwig
37c3fc9abb block: simplify the block device claiming interface
Stop passing the whole device as a separate argument given that it
can be trivially deducted and cleanup the !holder debug check.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:39 -07:00
Christoph Hellwig
a954ea8120 block: remove ->bd_contains
Now that each hd_struct has a reference to the corresponding
block_device, there is no need for the bd_contains pointer.  Add
a bdev_whole() helper to look up the whole device block_device
struture instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:39 -07:00
Christoph Hellwig
22ae8ce8b8 block: simplify bdev/disk lookup in blkdev_get
To simplify block device lookup and a few other upcoming areas, make sure
that we always have a struct block_device available for each disk and
each partition, and only find existing block devices in bdget.  The only
downside of this is that each device and partition uses a little more
memory.  The upside will be that a lot of code can be simplified.

With that all we need to look up the block device is to lookup the inode
and do a few sanity checks on the gendisk, instead of the separate lookup
for the gendisk.  For blk-cgroup which wants to access a gendisk without
opening it, a new blkdev_{get,put}_no_open low-level interface is added
to replace the previous get_gendisk use.

Note that the change to look up block device directly instead of the two
step lookup using struct gendisk causes a subtile change in behavior:
accessing a non-existing partition on an existing block device can now
cause a call to request_module.  That call is harmless, and in practice
no recent system will access these nodes as they aren't created by udev
and static /dev/ setups are unusual.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:39 -07:00
Christoph Hellwig
4e7b5671c6 block: remove i_bdev
Switch the block device lookup interfaces to directly work with a dev_t
so that struct block_device references are only acquired by the
blkdev_get variants (and the blk-cgroup special case).  This means that
we now don't need an extra reference in the inode and can generally
simplify handling of struct block_device to keep the lookups contained
in the core block layer code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Coly Li <colyli@suse.de>		[bcache]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:39 -07:00
Christoph Hellwig
7918f0f6fd block: opencode devcgroup_inode_permission
Just call devcgroup_check_permission to avoid various superflous checks
and a double conversion of the access flags.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:39 -07:00
Christoph Hellwig
63d9932cae block: move bdput() to the callers of __blkdev_get
This will allow for a more symmetric calling convention going forward.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:39 -07:00
Christoph Hellwig
5b56b6ed57 block: refactor blkdev_get
Move more code that is only run on the outer open but not the open of
the underlying whole device when opening a partition into blkdev_get,
which leads to a much easier to follow structure.

This allows to simplify the disk and module refcounting so that one
reference is held for each open, similar to what we do with normal
file operations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:39 -07:00
Christoph Hellwig
ec5d451438 block: refactor __blkdev_put
Reorder the code to have one big section for the last close, and to use
bdev_is_partition.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:39 -07:00
Christoph Hellwig
3a4174e686 block: switch bdgrab to use igrab
All of the current callers already have a reference, but to prepare for
additional users ensure bdgrab returns NULL if the block device is being
freed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:39 -07:00
Christoph Hellwig
612c6aa781 block: change the hash used for looking up block devices
Adding the minor to the major creates tons of pointless conflicts. Just
use the dev_t itself, which is 32-bits and thus is guaranteed to fit
into ino_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:39 -07:00
Christoph Hellwig
8d65269fe8 block: add a bdev_kobj helper
Add a little helper to find the kobject for a struct block_device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Coly Li <colyli@suse.de>		[bcache]
Acked-by: David Sterba <dsterba@suse.com>	[btrfs]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:39 -07:00
Christoph Hellwig
040f04bd2e fs: simplify freeze_bdev/thaw_bdev
Store the frozen superblock in struct block_device to avoid the awkward
interface that can return a sb only used a cookie, an ERR_PTR or NULL.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Chao Yu <yuchao0@huawei.com>		[f2fs]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:38 -07:00
Christoph Hellwig
60b498852b fs: remove get_super_thawed and get_super_exclusive_thawed
Just open code the wait in the only caller of both functions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:38 -07:00
Dominique Martinet
960f4f8a4e fs: 9p: add generic splice_write file operation
The default splice operations got removed recently, add it back to 9p
with iter_file_splice_write like many other filesystems do.

Link: http://lkml.kernel.org/r/1606837496-21717-1-git-send-email-asmadeus@codewreck.org
Fixes: 36e2c7421f ("fs: don't allow splice read/write without explicit ops")
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-12-01 21:40:47 +01:00
Vasile-Laurentiu Stanimir
26fecbf760 pstore: Move kmsg_bytes default into Kconfig
While kmsg_bytes can be set for pstore via mount, if a crash occurs
before the mount only partial console log will be stored as kmsg_bytes
defaults to a potentially different hardcoded value in the kernel
(PSTORE_DEFAULT_KMSG_BYTES). This makes it impossible to analyze valuable
post-mortem data especially on the embedded development or in the process
of bringing up new boards. Change this value to be a Kconfig option
with the default of old PSTORE_DEFAULT_KMSG_BYTES

Signed-off-by: Vasile-Laurentiu Stanimir <vasile-laurentiu.stanimir@windriver.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-12-01 12:09:17 -08:00
Christoph Hellwig
b6f8ed33ab pstore/blk: remove {un,}register_pstore_blk
This interface is entirely unused, so remove them and various bits of
unreachable code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201016132047.3068029-4-hch@lst.de
2020-12-01 12:01:03 -08:00
Christoph Hellwig
cbf82e3503 pstore/zone: cap the maximum device size
Introduce an abritrary 128MiB cap to avoid malloc failures when using
a larger block device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: WeiXiong Liao <gmpy.liaowx@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201016132047.3068029-2-hch@lst.de
2020-12-01 11:32:55 -08:00
Toke Høiland-Jørgensen
cf03f316ad fs: 9p: add generic splice_read file operations
The v9fs file operations were missing the splice_read operations, which
breaks sendfile() of files on such a filesystem. I discovered this while
trying to load an eBPF program using iproute2 inside a 'virtme' environment
which uses 9pfs for the virtual file system. iproute2 relies on sendfile()
with an AF_ALG socket to hash files, which was erroring out in the virtual
environment.

Since generic_file_splice_read() seems to just implement splice_read in
terms of the read_iter operation, I simply added the generic implementation
to the file operations, which fixed the error I was seeing. A quick grep
indicates that this is what most other file systems do as well.

Link: http://lkml.kernel.org/r/20201201135409.55510-1-toke@redhat.com
Fixes: 36e2c7421f ("fs: don't allow splice read/write without explicit ops")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2020-12-01 17:53:49 +01:00
Dan Carpenter
cfd1d0f524 9p: Remove unnecessary IS_ERR() check
The "fid" variable can't be an error pointer so there is no need to
check.  The code is slightly cleaner if we move the increment before
the break and remove the NULL check as well.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2020-12-01 08:19:02 +01:00
Dan Carpenter
dfd375864a 9p: Uninitialized variable in v9fs_writeback_fid()
If v9fs_fid_lookup_with_uid() fails then "fid" is not initialized.

The v9fs_fid_lookup_with_uid() can't return NULL.  If it returns an
error pointer then we can still pass that to clone_fid() and it will
return the error pointer back again.

Fixes: 6636b6dcc3 ("9p: add refcount to p9_fid struct")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2020-12-01 08:18:57 +01:00
Tom Rix
28c332b941 gfs2: remove trailing semicolons from macro definitions
The macro use will already have a semicolon.

Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-12-01 00:25:21 +01:00
Andreas Gruenbacher
a55a47a3bc Revert "GFS2: Prevent delete work from occurring on glocks used for create"
Since commit a0e3cc65fa ("gfs2: Turn gl_delete into a delayed work"), we're
cancelling any pending delete work of an iopen glock before attaching a new
inode to that glock in gfs2_create_inode.  This means that delete_work_func can
no longer be queued or running when attaching the iopen glock to the new inode,
and we can revert commit a4923865ea ("GFS2: Prevent delete work from
occurring on glocks used for create"), which tried to achieve the same but in a
racy way.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-12-01 00:25:21 +01:00
Andreas Gruenbacher
e3a77eebfa gfs2: Make inode operations static
The inode operations are not used outside inode.c.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-12-01 00:25:21 +01:00
Andreas Gruenbacher
dd0ecf5441 gfs2: Fix deadlock between gfs2_{create_inode,inode_lookup} and delete_work_func
In gfs2_create_inode and gfs2_inode_lookup, make sure to cancel any pending
delete work before taking the inode glock.  Otherwise, gfs2_cancel_delete_work
may block waiting for delete_work_func to complete, and delete_work_func may
block trying to acquire the inode glock in gfs2_inode_lookup.

Reported-by: Alexander Aring <aahringo@redhat.com>
Fixes: a0e3cc65fa ("gfs2: Turn gl_delete into a delayed work")
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-12-01 00:21:10 +01:00
Björn Töpel
7c951cafc0 net: Add SO_BUSY_POLL_BUDGET socket option
This option lets a user set a per socket NAPI budget for
busy-polling. If the options is not set, it will use the default of 8.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/20201130185205.196029-3-bjorn.topel@gmail.com
2020-12-01 00:09:25 +01:00
Björn Töpel
7fd3253a7d net: Introduce preferred busy-polling
The existing busy-polling mode, enabled by the SO_BUSY_POLL socket
option or system-wide using the /proc/sys/net/core/busy_read knob, is
an opportunistic. That means that if the NAPI context is not
scheduled, it will poll it. If, after busy-polling, the budget is
exceeded the busy-polling logic will schedule the NAPI onto the
regular softirq handling.

One implication of the behavior above is that a busy/heavy loaded NAPI
context will never enter/allow for busy-polling. Some applications
prefer that most NAPI processing would be done by busy-polling.

This series adds a new socket option, SO_PREFER_BUSY_POLL, that works
in concert with the napi_defer_hard_irqs and gro_flush_timeout
knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were
introduced in commit 6f8b12d661 ("net: napi: add hard irqs deferral
feature"), and allows for a user to defer interrupts to be enabled and
instead schedule the NAPI context from a watchdog timer. When a user
enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled,
and the NAPI context is being processed by a softirq, the softirq NAPI
processing will exit early to allow the busy-polling to be performed.

If the application stops performing busy-polling via a system call,
the watchdog timer defined by gro_flush_timeout will timeout, and
regular softirq handling will resume.

In summary; Heavy traffic applications that prefer busy-polling over
softirq processing should use this option.

Example usage:

  $ echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs
  $ echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout

Note that the timeout should be larger than the userspace processing
window, otherwise the watchdog will timeout and fall back to regular
softirq processing.

Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/20201130185205.196029-2-bjorn.topel@gmail.com
2020-12-01 00:09:25 +01:00
Paulo Alcantara
212253367d cifs: fix potential use-after-free in cifs_echo_request()
This patch fixes a potential use-after-free bug in
cifs_echo_request().

For instance,

  thread 1
  --------
  cifs_demultiplex_thread()
    clean_demultiplex_info()
      kfree(server)

  thread 2 (workqueue)
  --------
  apic_timer_interrupt()
    smp_apic_timer_interrupt()
      irq_exit()
        __do_softirq()
          run_timer_softirq()
            call_timer_fn()
	      cifs_echo_request() <- use-after-free in server ptr

Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-11-30 15:23:45 -06:00
Paulo Alcantara
6988a619f5 cifs: allow syscalls to be restarted in __smb_send_rqst()
A customer has reported that several files in their multi-threaded app
were left with size of 0 because most of the read(2) calls returned
-EINTR and they assumed no bytes were read.  Obviously, they could
have fixed it by simply retrying on -EINTR.

We noticed that most of the -EINTR on read(2) were due to real-time
signals sent by glibc to process wide credential changes (SIGRT_1),
and its signal handler had been established with SA_RESTART, in which
case those calls could have been automatically restarted by the
kernel.

Let the kernel decide to whether or not restart the syscalls when
there is a signal pending in __smb_send_rqst() by returning
-ERESTARTSYS.  If it can't, it will return -EINTR anyway.

Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-11-30 15:23:31 -06:00
Chuck Lever
5cfc822f3e NFSD: Remove macros that are no longer used
Now that all the NFSv4 decoder functions have been converted to
make direct calls to the xdr helpers, remove the unused C macros.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:44 -05:00
Chuck Lever
d9b74bdac6 NFSD: Replace READ* macros in nfsd4_decode_compound()
And clean-up: Now that we have removed the DECODE_TAIL macro from
nfsd4_decode_compound(), we observe that there's no benefit for
nfsd4_decode_compound() to return nfs_ok or nfserr_bad_xdr only to
have its sole caller convert those values to one or zero,
respectively. Have nfsd4_decode_compound() return 1/0 instead.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:44 -05:00
Chuck Lever
3a237b4af5 NFSD: Make nfsd4_ops::opnum a u32
Avoid passing a "pointer to int" argument to xdr_stream_decode_u32.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:43 -05:00
Chuck Lever
2212036cad NFSD: Replace READ* macros in nfsd4_decode_listxattrs()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:43 -05:00
Chuck Lever
403366a7e8 NFSD: Replace READ* macros in nfsd4_decode_setxattr()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:43 -05:00
Chuck Lever
830c71502a NFSD: Replace READ* macros in nfsd4_decode_xattr_name()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:43 -05:00
Chuck Lever
3dfd0b0e15 NFSD: Replace READ* macros in nfsd4_decode_clone()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:43 -05:00
Chuck Lever
9d32b412fe NFSD: Replace READ* macros in nfsd4_decode_seek()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:43 -05:00
Chuck Lever
2846bb0525 NFSD: Replace READ* macros in nfsd4_decode_offload_status()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:43 -05:00
Chuck Lever
f9a953fb36 NFSD: Replace READ* macros in nfsd4_decode_copy_notify()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:43 -05:00
Chuck Lever
e8febea719 NFSD: Replace READ* macros in nfsd4_decode_copy()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:43 -05:00
Chuck Lever
f49e4b4d58 NFSD: Replace READ* macros in nfsd4_decode_nl4_server()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:43 -05:00
Chuck Lever
6aef27aaea NFSD: Replace READ* macros in nfsd4_decode_fallocate()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:42 -05:00
Chuck Lever
0d6467844d NFSD: Replace READ* macros in nfsd4_decode_reclaim_complete()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:42 -05:00
Chuck Lever
c95f2ec349 NFSD: Replace READ* macros in nfsd4_decode_destroy_clientid()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:42 -05:00
Chuck Lever
b7a0c8f6e7 NFSD: Replace READ* macros in nfsd4_decode_test_stateid()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:42 -05:00
Chuck Lever
cf907b1132 NFSD: Replace READ* macros in nfsd4_decode_sequence()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:42 -05:00
Chuck Lever
53d70873e3 NFSD: Replace READ* macros in nfsd4_decode_secinfo_no_name()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:42 -05:00
Chuck Lever
645fcad371 NFSD: Replace READ* macros in nfsd4_decode_layoutreturn()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:42 -05:00
Chuck Lever
c8e88e3aa7 NFSD: Replace READ* macros in nfsd4_decode_layoutget()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:42 -05:00
Chuck Lever
5185980d8a NFSD: Replace READ* macros in nfsd4_decode_layoutcommit()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:42 -05:00
Chuck Lever
044959715f NFSD: Replace READ* macros in nfsd4_decode_getdeviceinfo()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:42 -05:00
Chuck Lever
aec387d590 NFSD: Replace READ* macros in nfsd4_decode_free_stateid()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:41 -05:00
Chuck Lever
94e254af1f NFSD: Replace READ* macros in nfsd4_decode_destroy_session()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:41 -05:00
Chuck Lever
81243e3fe3 NFSD: Replace READ* macros in nfsd4_decode_create_session()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:41 -05:00
Chuck Lever
3a3f1fbacb NFSD: Add a helper to decode channel_attrs4
De-duplicate some code.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:41 -05:00
Chuck Lever
10ff842281 NFSD: Add a helper to decode nfs_impl_id4
Refactor for clarity.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:41 -05:00
Chuck Lever
523ec6ed6f NFSD: Add a helper to decode state_protect4_a
Refactor for clarity.

Also, remove a stale comment. Commit ed94164398 ("nfsd: implement
machine credential support for some operations") added support for
SP4_MACH_CRED, so state_protect_a is no longer completely ignored.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:41 -05:00
Chuck Lever
547bfeb4cd NFSD: Add a separate decoder for ssv_sp_parms
Refactor for clarity.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:41 -05:00
Chuck Lever
2548aa784d NFSD: Add a separate decoder to handle state_protect_ops
Refactor for clarity and de-duplication of code.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:41 -05:00
Chuck Lever
571e0451c4 NFSD: Replace READ* macros in nfsd4_decode_bind_conn_to_session()
A dedicated sessionid4 decoder is introduced that will be used by
other operation decoders in subsequent patches.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:41 -05:00
Chuck Lever
0f81d96098 NFSD: Replace READ* macros in nfsd4_decode_backchannel_ctl()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:40 -05:00
Chuck Lever
1a99440807 NFSD: Replace READ* macros in nfsd4_decode_cb_sec()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:40 -05:00
Chuck Lever
a4a80c15ca NFSD: Replace READ* macros in nfsd4_decode_release_lockowner()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:40 -05:00
Chuck Lever
244e2befcb NFSD: Replace READ* macros in nfsd4_decode_write()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:40 -05:00
Chuck Lever
67cd453eed NFSD: Replace READ* macros in nfsd4_decode_verify()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:40 -05:00
Chuck Lever
d1ca55149d NFSD: Replace READ* macros in nfsd4_decode_setclientid_confirm()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:40 -05:00
Chuck Lever
92fa6c08c2 NFSD: Replace READ* macros in nfsd4_decode_setclientid()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:40 -05:00
Chuck Lever
44592fe947 NFSD: Replace READ* macros in nfsd4_decode_setattr()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:40 -05:00
Chuck Lever
d0abdae519 NFSD: Replace READ* macros in nfsd4_decode_secinfo()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:40 -05:00
Chuck Lever
d12f90458d NFSD: Replace READ* macros in nfsd4_decode_renew()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:40 -05:00
Chuck Lever
ba881a0a53 NFSD: Replace READ* macros in nfsd4_decode_rename()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:39 -05:00
Chuck Lever
b7f5fbf219 NFSD: Replace READ* macros in nfsd4_decode_remove()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:39 -05:00
Chuck Lever
0dfaf2a371 NFSD: Replace READ* macros in nfsd4_decode_readdir()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:39 -05:00
Chuck Lever
3909c3bc60 NFSD: Replace READ* macros in nfsd4_decode_read()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:39 -05:00
Chuck Lever
a73bed9841 NFSD: Replace READ* macros in nfsd4_decode_putfh()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:39 -05:00
Chuck Lever
dca71651f0 NFSD: Replace READ* macros in nfsd4_decode_open_downgrade()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:39 -05:00
Chuck Lever
06bee693a1 NFSD: Replace READ* macros in nfsd4_decode_open_confirm()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:39 -05:00
Chuck Lever
61e5e0b3ec NFSD: Replace READ* macros in nfsd4_decode_open()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:39 -05:00
Chuck Lever
1708e50b01 NFSD: Add helper to decode OPEN's open_claim4 argument
Refactor for clarity.

Note that op_fname is the only instance of an NFSv4 filename stored
in a struct xdr_netobj. Convert it to a u32/char * pair so that the
new nfsd4_decode_filename() helper can be used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:39 -05:00
Chuck Lever
b07bebd9eb NFSD: Replace READ* macros in nfsd4_decode_share_deny()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:39 -05:00
Chuck Lever
9aa62f5199 NFSD: Replace READ* macros in nfsd4_decode_share_access()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:38 -05:00
Chuck Lever
e6ec04b27b NFSD: Add helper to decode OPEN's openflag4 argument
Refactor for clarity.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:38 -05:00
Chuck Lever
bf33bab3c4 NFSD: Add helper to decode OPEN's createhow4 argument
Refactor for clarity.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:38 -05:00
Chuck Lever
796dd1c6b6 NFSD: Add helper to decode NFSv4 verifiers
This helper will be used to simplify decoders in subsequent
patches.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:38 -05:00
Chuck Lever
3d5877e8e0 NFSD: Replace READ* macros in nfsd4_decode_lookup()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:38 -05:00
Chuck Lever
ca9cf9fc27 NFSD: Replace READ* macros in nfsd4_decode_locku()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:38 -05:00
Chuck Lever
0a146f04aa NFSD: Replace READ* macros in nfsd4_decode_lockt()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:38 -05:00
Chuck Lever
7c59deed5c NFSD: Replace READ* macros in nfsd4_decode_lock()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:38 -05:00
Chuck Lever
8918cc0d2b NFSD: Add helper for decoding locker4
Refactor for clarity.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:38 -05:00
Chuck Lever
144e826940 NFSD: Add helpers to decode a clientid4 and an NFSv4 state owner
These helpers will also be used to simplify decoders in subsequent
patches.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:38 -05:00
Chuck Lever
5dcbfabb67 NFSD: Relocate nfsd4_decode_opaque()
Enable nfsd4_decode_opaque() to be used in more decoders, and
replace the READ* macros in nfsd4_decode_opaque().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:37 -05:00
Chuck Lever
5c505d1286 NFSD: Replace READ* macros in nfsd4_decode_link()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:37 -05:00
Chuck Lever
f759eff260 NFSD: Replace READ* macros in nfsd4_decode_getattr()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:37 -05:00
Chuck Lever
95e6482ced NFSD: Replace READ* macros in nfsd4_decode_delegreturn()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:37 -05:00
Chuck Lever
000dfa18b3 NFSD: Replace READ* macros in nfsd4_decode_create()
A dedicated decoder for component4 is introduced here, which will be
used by other operation decoders in subsequent patches.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:37 -05:00
Chuck Lever
d1c263a031 NFSD: Replace READ* macros in nfsd4_decode_fattr()
Let's be more careful to avoid overrunning the memory that backs
the bitmap array. This requires updating the synopsis of
nfsd4_decode_fattr().

Bruce points out that a server needs to be careful to return nfs_ok
when a client presents bitmap bits the server doesn't support. This
includes bits in bitmap words the server might not yet support.

The current READ* based implementation is good about that, but that
requirement hasn't been documented.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:37 -05:00
Chuck Lever
66f0476c70 NFSD: Replace READ* macros that decode the fattr4 umask attribute
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:37 -05:00
Chuck Lever
dabe91828f NFSD: Replace READ* macros that decode the fattr4 security label attribute
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:37 -05:00
Chuck Lever
1c3eff7ea4 NFSD: Replace READ* macros that decode the fattr4 time_set attributes
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:37 -05:00
Chuck Lever
393c31dd27 NFSD: Replace READ* macros that decode the fattr4 owner_group attribute
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:36 -05:00
Chuck Lever
9853a5ac9b NFSD: Replace READ* macros that decode the fattr4 owner attribute
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:36 -05:00
Chuck Lever
1c8f0ad7dd NFSD: Replace READ* macros that decode the fattr4 mode attribute
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:36 -05:00
Chuck Lever
c941a96823 NFSD: Replace READ* macros that decode the fattr4 acl attribute
Refactor for clarity and to move infrequently-used code out of line.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:36 -05:00
Chuck Lever
2ac1b9b2af NFSD: Replace READ* macros that decode the fattr4 size attribute
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:36 -05:00
Chuck Lever
081d53fe0b NFSD: Change the way the expected length of a fattr4 is checked
Because the fattr4 is now managed in an xdr_stream, all that is
needed is to store the initial position of the stream before
decoding the attribute list. Then the actual length of the list
is computed using the final stream position, after decoding is
complete.

No behavior change is expected.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:36 -05:00
Chuck Lever
cbd9abb370 NFSD: Replace READ* macros in nfsd4_decode_commit()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:36 -05:00
Chuck Lever
d3d2f38154 NFSD: Replace READ* macros in nfsd4_decode_close()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:36 -05:00
Chuck Lever
d169a6a9e5 NFSD: Replace READ* macros in nfsd4_decode_access()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:36 -05:00
Chuck Lever
c1346a1216 NFSD: Replace the internals of the READ_BUF() macro
Convert the READ_BUF macro in nfs4xdr.c from open code to instead
use the new xdr_stream-style decoders already in use by the encode
side (and by the in-kernel NFS client implementation). Once this
conversion is done, each individual NFSv4 argument decoder can be
independently cleaned up to replace these macros with C code.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:36 -05:00
Chuck Lever
08281341be NFSD: Add tracepoints in nfsd4_decode/encode_compound()
For troubleshooting purposes, record failures to decode NFSv4
operation arguments and encode operation results.

trace_nfsd_compound_decode_err() replaces the dprintk() call sites
that are embedded in READ_* macros that are about to be removed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:35 -05:00
Chuck Lever
0dfdad1c1d NFSD: Add tracepoints in nfsd_dispatch()
For troubleshooting purposes, record GARBAGE_ARGS and CANT_ENCODE
failures.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:35 -05:00
Chuck Lever
788f7183fb NFSD: Add common helpers to decode void args and encode void results
Start off the conversion to xdr_stream by de-duplicating the functions
that decode void arguments and encode void results.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:35 -05:00
Chuck Lever
5191955d6f SUNRPC: Prepare for xdr_stream-style decoding on the server-side
A "permanent" struct xdr_stream is allocated in struct svc_rqst so
that it is usable by all server-side decoders. A per-rqst scratch
buffer is also allocated to handle decoding XDR data items that
cross page boundaries.

To demonstrate how it will be used, add the first call site for the
new svcxdr_init_decode() API.

As an additional part of the overall conversion, add symbolic
constants for successful and failed XDR operations. Returning "0" is
overloaded. Sometimes it means something failed, but sometimes it
means success. To make it more clear when XDR decoding functions
succeed or fail, introduce symbolic constants.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:35 -05:00
Chuck Lever
0ae4c3e8a6 SUNRPC: Add xdr_set_scratch_page() and xdr_reset_scratch_buffer()
Clean up: De-duplicate some frequently-used code.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:35 -05:00
Huang Guobin
231307df24 nfsd: Fix error return code in nfsd_file_cache_init()
Fix to return PTR_ERR() error code from the error handling case instead of
0 in function nfsd_file_cache_init(), as done elsewhere in this function.

Fixes: 65294c1f2c5e7("nfsd: add a new struct file caching facility to nfsd")
Signed-off-by: Huang Guobin <huangguobin4@huawei.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:45:56 -05:00
Pavel Begunkov
2d280bc893 io_uring: fix recvmsg setup with compat buf-select
__io_compat_recvmsg_copy_hdr() with REQ_F_BUFFER_SELECT reads out iov
len but never assigns it to iov/fast_iov, leaving sr->len with garbage.
Hopefully, following io_buffer_select() truncates it to the selected
buffer size, but the value is still may be under what was specified.

Cc: <stable@vger.kernel.org> # 5.7
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-30 11:12:03 -07:00
Chuck Lever
f45a444cfe NFSD: Add SPDX header for fs/nfsd/trace.c
Clean up.

The file was contributed in 2014 by Christoph Hellwig in commit
31ef83dc05 ("nfsd: add trace events").

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 13:00:24 -05:00
Chuck Lever
3a90e1dff1 NFSD: Remove extra "0x" in tracepoint format specifier
Clean up: %p adds its own 0x already.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 13:00:24 -05:00
Chuck Lever
b76278ae68 NFSD: Clean up the show_nf_may macro
Display all currently possible NFSD_MAY permission flags.

Move and rename show_nf_may with a more generic name because the
NFSD_MAY permission flags are used in other places besides the file
cache.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 13:00:24 -05:00
Alex Shi
71fd721839 nfsd/nfs3: remove unused macro nfsd3_fhandleres
The macro is unused, remove it to tame gcc warning:
fs/nfsd/nfs3proc.c:702:0: warning: macro "nfsd3_fhandleres" is not used
[-Wunused-macros]

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: linux-nfs@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 13:00:23 -05:00
Tom Rix
25fef48bdb NFSD: A semicolon is not needed after a switch statement.
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 13:00:23 -05:00
Chuck Lever
76e5492b16 NFSD: Invoke svc_encode_result_payload() in "read" NFSD encoders
Have the NFSD encoders annotate the boundaries of every
direct-data-placement eligible result data payload. Then change
svcrdma to use that annotation instead of the xdr->page_len
when handling Write chunks.

For NFSv4 on RDMA, that enables the ability to recognize multiple
result payloads per compound. This is a pre-requisite for supporting
multiple Write chunks per RPC transaction.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 13:00:22 -05:00
Chuck Lever
03493bca08 SUNRPC: Rename svc_encode_read_payload()
Clean up: "result payload" is a less confusing name for these
payloads. "READ payload" reflects only the NFS usage.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 13:00:21 -05:00
Trond Myklebust
63e2fffa59 pNFS/flexfiles: Fix array overflow when flexfiles mirroring is enabled
If the flexfiles mirroring is enabled, then the read code expects to be
able to set pgio->pg_mirror_idx to point to the data server that is
being used for this particular read. However it does not change the
pg_mirror_count because we only need to send a single read.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-11-30 10:52:22 -05:00
Linus Torvalds
1214917e00 More EFI fixes for v5.10-rc:
- revert efivarfs kmemleak fix again - it was a false positive;
 - make CONFIG_EFI_EARLYCON depend on CONFIG_EFI explicitly so it does not
   pull in other dependencies unnecessarily if CONFIG_EFI is not set
 - defer attempts to load SSDT overrides from EFI vars until after the
   efivar layer is up.
 -----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE+9lifEBpyUIVN1cpw08iOZLZjyQFAl+/i/UACgkQw08iOZLZ
 jyRb+Qv/RVQSvZW+u6MrYPZthmVxnNQ4pFHKAjibD3h9isbrWdq6AETtNghUGWQr
 nr1WeOj4Qa2aDe4z63Sra3QNsdLnSn0FsHuJPD8rozMd4N4jiChtpLukdShibXXz
 25yfs7KpNyqwj3QnFd2LpJBXGqzdoKFrzWbnWSnFEMrSfkqptBhogslhVxzal8Uz
 4hUyGhe/iBfgU720uoVCmofPpYxqV/cndEmsnA8rZnW5yTPXIn6f9c4KUOcFxgLS
 LchxZYRd9GZoQ4Yt40ih9JX1ILZNhrhXh96cfNuUiVwnf9Lg7xSOGIX3+e3PR9Lz
 1VI4UuA7HM8qDZx+N5iiBRrBIgtdtMgKLEip3n+x84/7P/p26HXe3OJCPBWh0Q0q
 aWCcV8qiwju6VYK6dv4Gz+2/OYiSKXrXZCx57dnCr7tv5srYenwsvCCYa9wLTpMW
 16/DRmkHcfbAdfS38Bhs3qY/zK+He7XE/sPJao/NuVwoJ3Nu0dA4LR7JFG7TVjKm
 bepO3A8W
 =gvBz
 -----END PGP SIGNATURE-----

Merge tag 'efi-urgent-for-v5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull EFI fixes from Borislav Petkov:
 "More EFI fixes forwarded from Ard Biesheuvel:

   - revert efivarfs kmemleak fix again - it was a false positive

   - make CONFIG_EFI_EARLYCON depend on CONFIG_EFI explicitly so it does
     not pull in other dependencies unnecessarily if CONFIG_EFI is not
     set

   - defer attempts to load SSDT overrides from EFI vars until after the
     efivar layer is up"

* tag 'efi-urgent-for-v5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  efi: EFI_EARLYCON should depend on EFI
  efivarfs: revert "fix memory leak in efivarfs_create()"
  efi/efivars: Set generic ops before loading SSDT
2020-11-29 10:18:53 -08:00
Linus Torvalds
9223e74f99 io_uring-5.10-2020-11-27
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/BZBwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpku+EACqrIZ6wuA/mIWbKs9wWf280hRbIQL1q7zy
 Ejj2A6DfclZz8XuUUfrRQ2diDDPBNnfPVRfiIzAi/AO7f7jDn91+2ZYfoJJWLZbq
 kCDHScPI6+OjDho5r+3vocxdoXZfuYxBlKsIzdiLl32mGEWbYjd+rmlAVJ74FPYz
 mFddmQCZoTBuEArpW6h8bKoPLgoC4zzZ2MgcjjpHT7Me/ai8t7OoA+FBpUjRcu+q
 Bt/ZfjIiWOzss9+psk5BSrHo5yXY51TWiiuV8hVM3RwVsL2dabG4WPgaLzum3H9p
 UZ4NAvXdRX9gWfF85mo7PEo1/0REmRy4BOJQ8qtBnvq6eepttAXHBx0SFDfpcLzG
 oAB4HaxuixAdsutnIynBXad3MskhNtFzbz/4UqOcnKbpT5Zxi65YiBKAsNwXjIdU
 bc3rZ3hyP4FifzNFa86wFLQ9w9gKV8mEogAz122lxL44AyguZzwN1I3H/YSEt2iX
 tKqHNWxnrb9MP3ycAuMvwjF3aNruns3QVv5fZQ9CrAUnAVF6Q8o/E2aTt4rmiJUW
 kQAvQJQj6MhBi/GXJk6YmTTDYgEMZ5JxFV3IVW17YZiMlpvIFML4X2nKo12qhwnq
 8eZ0qJHFRyycW9eBY3qznw8S5Z7Nh8r91MjJs5sullHg99xKJbvR+IoswImfJMme
 Gk857imwGw==
 =HJnn
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.10-2020-11-27' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:

 - Out of bounds fix for the cq size cap from earlier this release (Joseph)

 - iov_iter type check fix (Pavel)

 - Files grab + cancelation fix (Pavel)

* tag 'io_uring-5.10-2020-11-27' of git://git.kernel.dk/linux-block:
  io_uring: fix files grab/cancel race
  io_uring: fix ITER_BVEC check
  io_uring: fix shift-out-of-bounds when round up cq size
2020-11-27 12:56:04 -08:00
Linus Torvalds
a17a3ca55e for-5.10-rc5-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl/BGBQACgkQxWXV+ddt
 WDtRYQ/+IUGjJ4l6MyL3PwLgTVabKsSpm2R3y3M/0tVJ0FIhDXYbkpjB2CkpIdcH
 jUvHnRL4H59hwG6rwlXU2oC298FNbbrLGIU+c9DR50RuyQCDGnT02XvxwfDIEFzp
 WLNZ/CAhRkm6boj//70lV26BpeXT59KMYwNixfCNTXq9Ir0qYHHCGg4cEBQLS++2
 JUU8XVTLURIYiFOLbwmABI7V43OgDhdORIr+qnR8xjCUyhusZsjVVbvIdW3BDi/S
 wK7NJsfuqgsF0zD9URJjpwTFiJL9SvBLWR8JnM9NiLW3ZbkGBL+efL6mdWuH7534
 gruJRS2zYPMO2/Kpjy/31CWLap3PUSD3i8cKF+uo3liojWuSUhN8kfvcNxJVd5se
 NkEK+4zOjsDIVbv7gcjThSv4KTnOUO/XfN9TWUMuduaMBmGQNaQut1FpGV98utiK
 yW6x8xqcR4SI+lqY6ILqCK+qUHf19BLSsuyzZdTIontKKRA9F9hY8a4XTZuzTWml
 BGYmFGP640vOo8C9GjrQfpAwa7CB/DnF/cg1AAmuZ8vrEm9zYjmauFKK8ZPcveA3
 KGrnmIlYjhAIX16oRbfwOgj9D2xa1loBzJyHQHByvCMXGVFBnqRTRANRHFrdQWJB
 qh9+J4EJcUXPE9WGHxAW/g9vpFkV7IRABHs7aUB8zApxI9nGA0Q=
 =kcxn
 -----END PGP SIGNATURE-----

Merge tag 'for-5.10-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few fixes for various warnings that accumulated over past two weeks:

   - tree-checker: add missing return values for some errors

   - lockdep fixes
      - when reading qgroup config and starting quota rescan
      - reverse order of quota ioctl lock and VFS freeze lock

   - avoid accessing potentially stale fs info during device scan,
     reported by syzbot

   - add scope NOFS protection around qgroup relation changes

   - check for running transaction before flushing qgroups

   - fix tracking of new delalloc ranges for some cases"

* tag 'for-5.10-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix lockdep splat when enabling and disabling qgroups
  btrfs: do nofs allocations when adding and removing qgroup relations
  btrfs: fix lockdep splat when reading qgroup config on mount
  btrfs: tree-checker: add missing returns after data_ref alignment checks
  btrfs: don't access possibly stale fs_info data for printing duplicate device
  btrfs: tree-checker: add missing return after error in root_item
  btrfs: qgroup: don't commit transaction when we already hold the handle
  btrfs: fix missing delalloc new bit for new delalloc ranges
2020-11-27 12:42:13 -08:00
Ingo Molnar
a787bdaff8 Merge branch 'linus' into sched/core, to resolve semantic conflict
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-11-27 11:10:50 +01:00
Andreas Gruenbacher
82e938bd53 gfs2: Upgrade shared glocks for atime updates
Commit 20f829999c ("gfs2: Rework read and page fault locking") lifted
the glock lock taking from the low-level ->readpage and ->readahead
address space operations to the higher-level ->read_iter file and
->fault vm operations.  The glocks are still taken in LM_ST_SHARED mode
only.  On filesystems mounted without the noatime option, ->read_iter
sometimes needs to update the atime as well, though.  Right now, this
leads to a failed locking mode assertion in gfs2_dirty_inode.

Fix that by introducing a new update_time inode operation.  There, if
the glock is held non-exclusively, upgrade it to an exclusive lock.

Reported-by: Alexander Aring <aahringo@redhat.com>
Fixes: 20f829999c ("gfs2: Rework read and page fault locking")
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-11-26 19:58:25 +01:00
Rustam Kovhaev
d24396c529 reiserfs: add check for an invalid ih_entry_count
when directory item has an invalid value set for ih_entry_count it might
trigger use-after-free or out-of-bounds read in bin_search_in_dir_item()

ih_entry_count * IH_SIZE for directory item should not be larger than
ih_item_len

Link: https://lore.kernel.org/r/20201101140958.3650143-1-rkovhaev@gmail.com
Reported-and-tested-by: syzbot+83b6f7cf9922cae5c4d7@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?extid=83b6f7cf9922cae5c4d7
Signed-off-by: Rustam Kovhaev <rkovhaev@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-11-26 16:57:28 +01:00
Pavel Begunkov
af60470347 io_uring: fix files grab/cancel race
When one task is in io_uring_cancel_files() and another is doing
io_prep_async_work() a race may happen. That's because after accounting
a request inflight in first call to io_grab_identity() it still may fail
and go to io_identity_cow(), which migh briefly keep dangling
work.identity and not only.

Grab files last, so io_prep_async_work() won't fail if it did get into
->inflight_list.

note: the bug shouldn't exist after making io_uring_cancel_files() not
poking into other tasks' requests.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-26 08:50:21 -07:00
Bob Peterson
f39e7d3aae gfs2: Don't freeze the file system during unmount
GFS2's freeze/thaw mechanism uses a special freeze glock to control its
operation. It does this with a sync glock operation (glops.c) called
freeze_go_sync. When the freeze glock is demoted (glock's do_xmote) the
glops function causes the file system to be frozen. This is intended. However,
GFS2's mount and unmount processes also hold the freeze glock to prevent other
processes, perhaps on different cluster nodes, from mounting the frozen file
system in read-write mode.

Before this patch, there was no check in freeze_go_sync for whether a freeze
in intended or whether the glock demote was caused by a normal unmount.
So it was trying to freeze the file system it's trying to unmount, which
ends up in a deadlock.

This patch adds an additional check to freeze_go_sync so that demotes of the
freeze glock are ignored if they come from the unmount process.

Fixes: 20b3291290 ("gfs2: Fix regression in freeze_go_sync")
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-11-25 18:12:08 +01:00
Bob Peterson
778721510e gfs2: check for empty rgrp tree in gfs2_ri_update
If gfs2 tries to mount a (corrupt) file system that has no resource
groups it still tries to set preferences on the first one, which causes
a kernel null pointer dereference. This patch adds a check to function
gfs2_ri_update so this condition is detected and reported back as an
error.

Reported-by: syzbot+e3f23ce40269a4c9053a@syzkaller.appspotmail.com
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-11-25 18:10:55 +01:00
Ard Biesheuvel
ff04f3b6f2 efivarfs: revert "fix memory leak in efivarfs_create()"
The memory leak addressed by commit fe5186cf12 is a false positive:
all allocations are recorded in a linked list, and freed when the
filesystem is unmounted. This leads to double frees, and as reported
by David, leads to crashes if SLUB is configured to self destruct when
double frees occur.

So drop the redundant kfree() again, and instead, mark the offending
pointer variable so the allocation is ignored by kmemleak.

Cc: Vamshi K Sthambamkadi <vamshi.k.sthambamkadi@gmail.com>
Fixes: fe5186cf12 ("efivarfs: fix memory leak in efivarfs_create()")
Reported-by: David Laight <David.Laight@aculab.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2020-11-25 16:55:02 +01:00
Linus Torvalds
127c501a03 Four smb3 fixes for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl+9eV4ACgkQiiy9cAdy
 T1EXOAv+JF6QZiwB6TELiusDLw+6UWpT7CcT1guL+eSrFVYEPIhqF4V4QXY0oA0Y
 F8jTpHxEx8wdECAPPNGHh/a4E+Y1vV/W8Nv5DkglAwjeXAD2Y84VAp8hH890jnn0
 M8I9qdnbfSodRueshpKScRPHbfp4Smlz1BR9R0syk7T7TmCy8aKNwYN1lBy5Nf9f
 ICMn1F5e9z4nX43NJIwzO+NSPehtLm8ULFZER/pQ+tGDhwXTdFc9HPzfu0ZoYbEO
 zADjmY4PItVYRINnWBntEBLYcAFeAB0finPTP2kCfXfRDF5cPgElp84F3Uro7se5
 bioboePO+bUS0jigIiP3qZ7zTHEdJoICsiJzVGmDZYsawK3MAwamp2EH3axAr4B/
 h4LULgN7nCatPW5lMo3/3EPZXVbXVTOYIB2REtqJugK8USQ9+v9SMLNT/qWn0GE5
 bzZoZ22wkHEOn4EIxYSCX4tgj9cJ2v9B/0NMpTQLTECKBQi3iV32GxLglvMwZru6
 eWKL5tZj
 =lxdD
 -----END PGP SIGNATURE-----

Merge tag '5.10-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Four smb3 fixes for stable: one fixes a memleak, the other three
  address a problem found with decryption offload that can cause a use
  after free"

* tag '5.10-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb3: Handle error case during offload read path
  smb3: Avoid Mid pending list corruption
  smb3: Call cifs reconnect from demultiplex thread
  cifs: fix a memleak with modefromsid
2020-11-24 15:33:18 -08:00
Eric Biggers
4a4b8721f1 fscrypt: simplify master key locking
The stated reasons for separating fscrypt_master_key::mk_secret_sem from
the standard semaphore contained in every 'struct key' no longer apply.

First, due to commit a992b20cd4 ("fscrypt: add
fscrypt_prepare_new_inode() and fscrypt_set_context()"),
fscrypt_get_encryption_info() is no longer called from within a
filesystem transaction.

Second, due to commit d3ec10aa95 ("KEYS: Don't write out to userspace
while holding key semaphore"), the semaphore for the "keyring" key type
no longer ranks above page faults.

That leaves performance as the only possible reason to keep the separate
mk_secret_sem.  Specifically, having mk_secret_sem reduces the
contention between setup_file_encryption_key() and
FS_IOC_{ADD,REMOVE}_ENCRYPTION_KEY.  However, these ioctls aren't
executed often, so this doesn't seem to be worth the extra complexity.

Therefore, simplify the locking design by just using key->sem instead of
mk_secret_sem.

Link: https://lore.kernel.org/r/20201117032626.320275-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-11-24 15:29:47 -08:00
Eric Biggers
234f1b7f8d fscrypt: remove unnecessary calls to fscrypt_require_key()
In an encrypted directory, a regular dentry (one that doesn't have the
no-key name flag) can only be created if the directory's encryption key
is available.

Therefore the calls to fscrypt_require_key() in __fscrypt_prepare_link()
and __fscrypt_prepare_rename() are unnecessary, as these functions
already check that the dentries they're given aren't no-key names.

Remove these unnecessary calls to fscrypt_require_key().

Link: https://lore.kernel.org/r/20201118075609.120337-6-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-11-24 15:10:27 -08:00
Eric Biggers
76786a0f08 ubifs: prevent creating duplicate encrypted filenames
As described in "fscrypt: add fscrypt_is_nokey_name()", it's possible to
create a duplicate filename in an encrypted directory by creating a file
concurrently with adding the directory's encryption key.

Fix this bug on ubifs by rejecting no-key dentries in ubifs_create(),
ubifs_mkdir(), ubifs_mknod(), and ubifs_symlink().

Note that ubifs doesn't actually report the duplicate filenames from
readdir, but rather it seems to replace the original dentry with a new
one (which is still wrong, just a different effect from ext4).

On ubifs, this fixes xfstest generic/595 as well as the new xfstest I
wrote specifically for this bug.

Fixes: f4f61d2cc6 ("ubifs: Implement encrypted filenames")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20201118075609.120337-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-11-24 15:10:27 -08:00
Eric Biggers
bfc2b7e851 f2fs: prevent creating duplicate encrypted filenames
As described in "fscrypt: add fscrypt_is_nokey_name()", it's possible to
create a duplicate filename in an encrypted directory by creating a file
concurrently with adding the directory's encryption key.

Fix this bug on f2fs by rejecting no-key dentries in f2fs_add_link().

Note that the weird check for the current task in f2fs_do_add_link()
seems to make this bug difficult to reproduce on f2fs.

Fixes: 9ea97163c6 ("f2fs crypto: add filename encryption for f2fs_add_link")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20201118075609.120337-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-11-24 15:10:27 -08:00
Eric Biggers
75d18cd186 ext4: prevent creating duplicate encrypted filenames
As described in "fscrypt: add fscrypt_is_nokey_name()", it's possible to
create a duplicate filename in an encrypted directory by creating a file
concurrently with adding the directory's encryption key.

Fix this bug on ext4 by rejecting no-key dentries in ext4_add_entry().

Note that the duplicate check in ext4_find_dest_de() sometimes prevented
this bug.  However in many cases it didn't, since ext4_find_dest_de()
doesn't examine every dentry.

Fixes: 4461471107 ("ext4 crypto: enable filename encryption")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20201118075609.120337-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-11-24 15:10:27 -08:00
Eric Biggers
159e1de201 fscrypt: add fscrypt_is_nokey_name()
It's possible to create a duplicate filename in an encrypted directory
by creating a file concurrently with adding the encryption key.

Specifically, sys_open(O_CREAT) (or sys_mkdir(), sys_mknod(), or
sys_symlink()) can lookup the target filename while the directory's
encryption key hasn't been added yet, resulting in a negative no-key
dentry.  The VFS then calls ->create() (or ->mkdir(), ->mknod(), or
->symlink()) because the dentry is negative.  Normally, ->create() would
return -ENOKEY due to the directory's key being unavailable.  However,
if the key was added between the dentry lookup and ->create(), then the
filesystem will go ahead and try to create the file.

If the target filename happens to already exist as a normal name (not a
no-key name), a duplicate filename may be added to the directory.

In order to fix this, we need to fix the filesystems to prevent
->create(), ->mkdir(), ->mknod(), and ->symlink() on no-key names.
(->rename() and ->link() need it too, but those are already handled
correctly by fscrypt_prepare_rename() and fscrypt_prepare_link().)

In preparation for this, add a helper function fscrypt_is_nokey_name()
that filesystems can use to do this check.  Use this helper function for
the existing checks that fs/crypto/ does for rename and link.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20201118075609.120337-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-11-24 15:10:27 -08:00
Alexander Aring
515b269d5b gfs2: set lockdep subclass for iopen glocks
This patch introduce a new globs attribute to define the subclass of the
glock lockref spinlock. This avoid the following lockdep warning, which
occurs when we lock an inode lock while an iopen lock is held:

============================================
WARNING: possible recursive locking detected
5.10.0-rc3+ #4990 Not tainted
--------------------------------------------
kworker/0:1/12 is trying to acquire lock:
ffff9067d45672d8 (&gl->gl_lockref.lock){+.+.}-{3:3}, at: lockref_get+0x9/0x20

but task is already holding lock:
ffff9067da308588 (&gl->gl_lockref.lock){+.+.}-{3:3}, at: delete_work_func+0x164/0x260

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&gl->gl_lockref.lock);
  lock(&gl->gl_lockref.lock);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

3 locks held by kworker/0:1/12:
 #0: ffff9067c1bfdd38 ((wq_completion)delete_workqueue){+.+.}-{0:0}, at: process_one_work+0x1b7/0x540
 #1: ffffac594006be70 ((work_completion)(&(&gl->gl_delete)->work)){+.+.}-{0:0}, at: process_one_work+0x1b7/0x540
 #2: ffff9067da308588 (&gl->gl_lockref.lock){+.+.}-{3:3}, at: delete_work_func+0x164/0x260

stack backtrace:
CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.10.0-rc3+ #4990
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Workqueue: delete_workqueue delete_work_func
Call Trace:
 dump_stack+0x8b/0xb0
 __lock_acquire.cold+0x19e/0x2e3
 lock_acquire+0x150/0x410
 ? lockref_get+0x9/0x20
 _raw_spin_lock+0x27/0x40
 ? lockref_get+0x9/0x20
 lockref_get+0x9/0x20
 delete_work_func+0x188/0x260
 process_one_work+0x237/0x540
 worker_thread+0x4d/0x3b0
 ? process_one_work+0x540/0x540
 kthread+0x127/0x140
 ? __kthread_bind_mask+0x60/0x60
 ret_from_fork+0x22/0x30

Suggested-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-11-24 23:45:58 +01:00
Alexander Aring
16e6281b6b gfs2: Fix deadlock dumping resource group glocks
Commit 0e539ca1bb ("gfs2: Fix NULL pointer dereference in gfs2_rgrp_dump")
introduced additional locking in gfs2_rgrp_go_dump, which is also used for
dumping resource group glocks via debugfs.  However, on that code path, the
glock spin lock is already taken in dump_glock, and taking it again in
gfs2_glock2rgrp leads to deadlock.  This can be reproduced with:

  $ mkfs.gfs2 -O -p lock_nolock /dev/FOO
  $ mount /dev/FOO /mnt/foo
  $ touch /mnt/foo/bar
  $ cat /sys/kernel/debug/gfs2/FOO/glocks

Fix that by not taking the glock spin lock inside the go_dump callback.

Fixes: 0e539ca1bb ("gfs2: Fix NULL pointer dereference in gfs2_rgrp_dump")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-11-24 23:45:58 +01:00
Pavel Begunkov
9c3a205c5f io_uring: fix ITER_BVEC check
iov_iter::type is a bitmask that also keeps direction etc., so it
shouldn't be directly compared against ITER_*. Use proper helper.

Fixes: ff6165b2d7 ("io_uring: retain iov_iter state over io_read/io_write calls")
Reported-by: David Howells <dhowells@redhat.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Cc: <stable@vger.kernel.org> # 5.9
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-24 07:54:30 -07:00
Joseph Qi
eb2667b343 io_uring: fix shift-out-of-bounds when round up cq size
Abaci Fuzz reported a shift-out-of-bounds BUG in io_uring_create():

[ 59.598207] UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13
[ 59.599665] shift exponent 64 is too large for 64-bit type 'long unsigned int'
[ 59.601230] CPU: 0 PID: 963 Comm: a.out Not tainted 5.10.0-rc4+ #3
[ 59.602502] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 59.603673] Call Trace:
[ 59.604286] dump_stack+0x107/0x163
[ 59.605237] ubsan_epilogue+0xb/0x5a
[ 59.606094] __ubsan_handle_shift_out_of_bounds.cold+0xb2/0x20e
[ 59.607335] ? lock_downgrade+0x6c0/0x6c0
[ 59.608182] ? rcu_read_lock_sched_held+0xaf/0xe0
[ 59.609166] io_uring_create.cold+0x99/0x149
[ 59.610114] io_uring_setup+0xd6/0x140
[ 59.610975] ? io_uring_create+0x2510/0x2510
[ 59.611945] ? lockdep_hardirqs_on_prepare+0x286/0x400
[ 59.613007] ? syscall_enter_from_user_mode+0x27/0x80
[ 59.614038] ? trace_hardirqs_on+0x5b/0x180
[ 59.615056] do_syscall_64+0x2d/0x40
[ 59.615940] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 59.617007] RIP: 0033:0x7f2bb8a0b239

This is caused by roundup_pow_of_two() if the input entries larger
enough, e.g. 2^32-1. For sq_entries, it will check first and we allow
at most IORING_MAX_ENTRIES, so it is okay. But for cq_entries, we do
round up first, that may overflow and truncate it to 0, which is not
the expected behavior. So check the cq size first and then do round up.

Fixes: 88ec3211e4 ("io_uring: round-up cq size before comparing with rounded sq size")
Reported-by: Abaci Fuzz <abaci@linux.alibaba.com>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-24 07:54:30 -07:00
Thomas Gleixner
13c8da5db4 Merge branch 'sched/core' into core/mm
Pull the migrate disable mechanics which is a prerequisite for preemptible
kmap_local().
2020-11-24 11:26:11 +01:00
Eric Biggers
bde4933490 fs-verity: move structs needed for file signing to UAPI header
Although it isn't used directly by the ioctls,
"struct fsverity_descriptor" is required by userspace programs that need
to compute fs-verity file digests in a standalone way.  Therefore
it's also needed to sign files in a standalone way.

Similarly, "struct fsverity_formatted_digest" (previously called
"struct fsverity_signed_digest" which was misleading) is also needed to
sign files if the built-in signature verification is being used.

Therefore, move these structs to the UAPI header.

While doing this, try to make it clear that the signature-related fields
in fsverity_descriptor aren't used in the file digest computation.

Acked-by: Luca Boccassi <luca.boccassi@microsoft.com>
Link: https://lore.kernel.org/r/20201113211918.71883-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-11-23 19:30:14 -08:00
Filipe Manana
a855fbe692 btrfs: fix lockdep splat when enabling and disabling qgroups
When running test case btrfs/017 from fstests, lockdep reported the
following splat:

  [ 1297.067385] ======================================================
  [ 1297.067708] WARNING: possible circular locking dependency detected
  [ 1297.068022] 5.10.0-rc4-btrfs-next-73 #1 Not tainted
  [ 1297.068322] ------------------------------------------------------
  [ 1297.068629] btrfs/189080 is trying to acquire lock:
  [ 1297.068929] ffff9f2725731690 (sb_internal#2){.+.+}-{0:0}, at: btrfs_quota_enable+0xaf/0xa70 [btrfs]
  [ 1297.069274]
		 but task is already holding lock:
  [ 1297.069868] ffff9f2702b61a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0xa70 [btrfs]
  [ 1297.070219]
		 which lock already depends on the new lock.

  [ 1297.071131]
		 the existing dependency chain (in reverse order) is:
  [ 1297.071721]
		 -> #1 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}:
  [ 1297.072375]        lock_acquire+0xd8/0x490
  [ 1297.072710]        __mutex_lock+0xa3/0xb30
  [ 1297.073061]        btrfs_qgroup_inherit+0x59/0x6a0 [btrfs]
  [ 1297.073421]        create_subvol+0x194/0x990 [btrfs]
  [ 1297.073780]        btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
  [ 1297.074133]        __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
  [ 1297.074498]        btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
  [ 1297.074872]        btrfs_ioctl+0x1a90/0x36f0 [btrfs]
  [ 1297.075245]        __x64_sys_ioctl+0x83/0xb0
  [ 1297.075617]        do_syscall_64+0x33/0x80
  [ 1297.075993]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [ 1297.076380]
		 -> #0 (sb_internal#2){.+.+}-{0:0}:
  [ 1297.077166]        check_prev_add+0x91/0xc60
  [ 1297.077572]        __lock_acquire+0x1740/0x3110
  [ 1297.077984]        lock_acquire+0xd8/0x490
  [ 1297.078411]        start_transaction+0x3c5/0x760 [btrfs]
  [ 1297.078853]        btrfs_quota_enable+0xaf/0xa70 [btrfs]
  [ 1297.079323]        btrfs_ioctl+0x2c60/0x36f0 [btrfs]
  [ 1297.079789]        __x64_sys_ioctl+0x83/0xb0
  [ 1297.080232]        do_syscall_64+0x33/0x80
  [ 1297.080680]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [ 1297.081139]
		 other info that might help us debug this:

  [ 1297.082536]  Possible unsafe locking scenario:

  [ 1297.083510]        CPU0                    CPU1
  [ 1297.084005]        ----                    ----
  [ 1297.084500]   lock(&fs_info->qgroup_ioctl_lock);
  [ 1297.084994]                                lock(sb_internal#2);
  [ 1297.085485]                                lock(&fs_info->qgroup_ioctl_lock);
  [ 1297.085974]   lock(sb_internal#2);
  [ 1297.086454]
		  *** DEADLOCK ***
  [ 1297.087880] 3 locks held by btrfs/189080:
  [ 1297.088324]  #0: ffff9f2725731470 (sb_writers#14){.+.+}-{0:0}, at: btrfs_ioctl+0xa73/0x36f0 [btrfs]
  [ 1297.088799]  #1: ffff9f2702b60cc0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1f4d/0x36f0 [btrfs]
  [ 1297.089284]  #2: ffff9f2702b61a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0xa70 [btrfs]
  [ 1297.089771]
		 stack backtrace:
  [ 1297.090662] CPU: 5 PID: 189080 Comm: btrfs Not tainted 5.10.0-rc4-btrfs-next-73 #1
  [ 1297.091132] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  [ 1297.092123] Call Trace:
  [ 1297.092629]  dump_stack+0x8d/0xb5
  [ 1297.093115]  check_noncircular+0xff/0x110
  [ 1297.093596]  check_prev_add+0x91/0xc60
  [ 1297.094076]  ? kvm_clock_read+0x14/0x30
  [ 1297.094553]  ? kvm_sched_clock_read+0x5/0x10
  [ 1297.095029]  __lock_acquire+0x1740/0x3110
  [ 1297.095510]  lock_acquire+0xd8/0x490
  [ 1297.095993]  ? btrfs_quota_enable+0xaf/0xa70 [btrfs]
  [ 1297.096476]  start_transaction+0x3c5/0x760 [btrfs]
  [ 1297.096962]  ? btrfs_quota_enable+0xaf/0xa70 [btrfs]
  [ 1297.097451]  btrfs_quota_enable+0xaf/0xa70 [btrfs]
  [ 1297.097941]  ? btrfs_ioctl+0x1f4d/0x36f0 [btrfs]
  [ 1297.098429]  btrfs_ioctl+0x2c60/0x36f0 [btrfs]
  [ 1297.098904]  ? do_user_addr_fault+0x20c/0x430
  [ 1297.099382]  ? kvm_clock_read+0x14/0x30
  [ 1297.099854]  ? kvm_sched_clock_read+0x5/0x10
  [ 1297.100328]  ? sched_clock+0x5/0x10
  [ 1297.100801]  ? sched_clock_cpu+0x12/0x180
  [ 1297.101272]  ? __x64_sys_ioctl+0x83/0xb0
  [ 1297.101739]  __x64_sys_ioctl+0x83/0xb0
  [ 1297.102207]  do_syscall_64+0x33/0x80
  [ 1297.102673]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [ 1297.103148] RIP: 0033:0x7f773ff65d87

This is because during the quota enable ioctl we lock first the mutex
qgroup_ioctl_lock and then start a transaction, and starting a transaction
acquires a fs freeze semaphore (at the VFS level). However, every other
code path, except for the quota disable ioctl path, we do the opposite:
we start a transaction and then lock the mutex.

So fix this by making the quota enable and disable paths to start the
transaction without having the mutex locked, and then, after starting the
transaction, lock the mutex and check if some other task already enabled
or disabled the quotas, bailing with success if that was the case.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-23 21:16:43 +01:00
Filipe Manana
7aa6d35984 btrfs: do nofs allocations when adding and removing qgroup relations
When adding or removing a qgroup relation we are doing a GFP_KERNEL
allocation which is not safe because we are holding a transaction
handle open and that can make us deadlock if the allocator needs to
recurse into the filesystem. So just surround those calls with a
nofs context.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-23 21:16:40 +01:00
Filipe Manana
3d05cad3c3 btrfs: fix lockdep splat when reading qgroup config on mount
Lockdep reported the following splat when running test btrfs/190 from
fstests:

  [ 9482.126098] ======================================================
  [ 9482.126184] WARNING: possible circular locking dependency detected
  [ 9482.126281] 5.10.0-rc4-btrfs-next-73 #1 Not tainted
  [ 9482.126365] ------------------------------------------------------
  [ 9482.126456] mount/24187 is trying to acquire lock:
  [ 9482.126534] ffffa0c869a7dac0 (&fs_info->qgroup_rescan_lock){+.+.}-{3:3}, at: qgroup_rescan_init+0x43/0xf0 [btrfs]
  [ 9482.126647]
		 but task is already holding lock:
  [ 9482.126777] ffffa0c892ebd3a0 (btrfs-quota-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x120 [btrfs]
  [ 9482.126886]
		 which lock already depends on the new lock.

  [ 9482.127078]
		 the existing dependency chain (in reverse order) is:
  [ 9482.127213]
		 -> #1 (btrfs-quota-00){++++}-{3:3}:
  [ 9482.127366]        lock_acquire+0xd8/0x490
  [ 9482.127436]        down_read_nested+0x45/0x220
  [ 9482.127528]        __btrfs_tree_read_lock+0x27/0x120 [btrfs]
  [ 9482.127613]        btrfs_read_lock_root_node+0x41/0x130 [btrfs]
  [ 9482.127702]        btrfs_search_slot+0x514/0xc30 [btrfs]
  [ 9482.127788]        update_qgroup_status_item+0x72/0x140 [btrfs]
  [ 9482.127877]        btrfs_qgroup_rescan_worker+0xde/0x680 [btrfs]
  [ 9482.127964]        btrfs_work_helper+0xf1/0x600 [btrfs]
  [ 9482.128039]        process_one_work+0x24e/0x5e0
  [ 9482.128110]        worker_thread+0x50/0x3b0
  [ 9482.128181]        kthread+0x153/0x170
  [ 9482.128256]        ret_from_fork+0x22/0x30
  [ 9482.128327]
		 -> #0 (&fs_info->qgroup_rescan_lock){+.+.}-{3:3}:
  [ 9482.128464]        check_prev_add+0x91/0xc60
  [ 9482.128551]        __lock_acquire+0x1740/0x3110
  [ 9482.128623]        lock_acquire+0xd8/0x490
  [ 9482.130029]        __mutex_lock+0xa3/0xb30
  [ 9482.130590]        qgroup_rescan_init+0x43/0xf0 [btrfs]
  [ 9482.131577]        btrfs_read_qgroup_config+0x43a/0x550 [btrfs]
  [ 9482.132175]        open_ctree+0x1228/0x18a0 [btrfs]
  [ 9482.132756]        btrfs_mount_root.cold+0x13/0xed [btrfs]
  [ 9482.133325]        legacy_get_tree+0x30/0x60
  [ 9482.133866]        vfs_get_tree+0x28/0xe0
  [ 9482.134392]        fc_mount+0xe/0x40
  [ 9482.134908]        vfs_kern_mount.part.0+0x71/0x90
  [ 9482.135428]        btrfs_mount+0x13b/0x3e0 [btrfs]
  [ 9482.135942]        legacy_get_tree+0x30/0x60
  [ 9482.136444]        vfs_get_tree+0x28/0xe0
  [ 9482.136949]        path_mount+0x2d7/0xa70
  [ 9482.137438]        do_mount+0x75/0x90
  [ 9482.137923]        __x64_sys_mount+0x8e/0xd0
  [ 9482.138400]        do_syscall_64+0x33/0x80
  [ 9482.138873]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [ 9482.139346]
		 other info that might help us debug this:

  [ 9482.140735]  Possible unsafe locking scenario:

  [ 9482.141594]        CPU0                    CPU1
  [ 9482.142011]        ----                    ----
  [ 9482.142411]   lock(btrfs-quota-00);
  [ 9482.142806]                                lock(&fs_info->qgroup_rescan_lock);
  [ 9482.143216]                                lock(btrfs-quota-00);
  [ 9482.143629]   lock(&fs_info->qgroup_rescan_lock);
  [ 9482.144056]
		  *** DEADLOCK ***

  [ 9482.145242] 2 locks held by mount/24187:
  [ 9482.145637]  #0: ffffa0c8411c40e8 (&type->s_umount_key#44/1){+.+.}-{3:3}, at: alloc_super+0xb9/0x400
  [ 9482.146061]  #1: ffffa0c892ebd3a0 (btrfs-quota-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x120 [btrfs]
  [ 9482.146509]
		 stack backtrace:
  [ 9482.147350] CPU: 1 PID: 24187 Comm: mount Not tainted 5.10.0-rc4-btrfs-next-73 #1
  [ 9482.147788] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  [ 9482.148709] Call Trace:
  [ 9482.149169]  dump_stack+0x8d/0xb5
  [ 9482.149628]  check_noncircular+0xff/0x110
  [ 9482.150090]  check_prev_add+0x91/0xc60
  [ 9482.150561]  ? kvm_clock_read+0x14/0x30
  [ 9482.151017]  ? kvm_sched_clock_read+0x5/0x10
  [ 9482.151470]  __lock_acquire+0x1740/0x3110
  [ 9482.151941]  ? __btrfs_tree_read_lock+0x27/0x120 [btrfs]
  [ 9482.152402]  lock_acquire+0xd8/0x490
  [ 9482.152887]  ? qgroup_rescan_init+0x43/0xf0 [btrfs]
  [ 9482.153354]  __mutex_lock+0xa3/0xb30
  [ 9482.153826]  ? qgroup_rescan_init+0x43/0xf0 [btrfs]
  [ 9482.154301]  ? qgroup_rescan_init+0x43/0xf0 [btrfs]
  [ 9482.154768]  ? qgroup_rescan_init+0x43/0xf0 [btrfs]
  [ 9482.155226]  qgroup_rescan_init+0x43/0xf0 [btrfs]
  [ 9482.155690]  btrfs_read_qgroup_config+0x43a/0x550 [btrfs]
  [ 9482.156160]  open_ctree+0x1228/0x18a0 [btrfs]
  [ 9482.156643]  btrfs_mount_root.cold+0x13/0xed [btrfs]
  [ 9482.157108]  ? rcu_read_lock_sched_held+0x5d/0x90
  [ 9482.157567]  ? kfree+0x31f/0x3e0
  [ 9482.158030]  legacy_get_tree+0x30/0x60
  [ 9482.158489]  vfs_get_tree+0x28/0xe0
  [ 9482.158947]  fc_mount+0xe/0x40
  [ 9482.159403]  vfs_kern_mount.part.0+0x71/0x90
  [ 9482.159875]  btrfs_mount+0x13b/0x3e0 [btrfs]
  [ 9482.160335]  ? rcu_read_lock_sched_held+0x5d/0x90
  [ 9482.160805]  ? kfree+0x31f/0x3e0
  [ 9482.161260]  ? legacy_get_tree+0x30/0x60
  [ 9482.161714]  legacy_get_tree+0x30/0x60
  [ 9482.162166]  vfs_get_tree+0x28/0xe0
  [ 9482.162616]  path_mount+0x2d7/0xa70
  [ 9482.163070]  do_mount+0x75/0x90
  [ 9482.163525]  __x64_sys_mount+0x8e/0xd0
  [ 9482.163986]  do_syscall_64+0x33/0x80
  [ 9482.164437]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [ 9482.164902] RIP: 0033:0x7f51e907caaa

This happens because at btrfs_read_qgroup_config() we can call
qgroup_rescan_init() while holding a read lock on a quota btree leaf,
acquired by the previous call to btrfs_search_slot_for_read(), and
qgroup_rescan_init() acquires the mutex qgroup_rescan_lock.

A qgroup rescan worker does the opposite: it acquires the mutex
qgroup_rescan_lock, at btrfs_qgroup_rescan_worker(), and then tries to
update the qgroup status item in the quota btree through the call to
update_qgroup_status_item(). This inversion of locking order
between the qgroup_rescan_lock mutex and quota btree locks causes the
splat.

Fix this simply by releasing and freeing the path before calling
qgroup_rescan_init() at btrfs_read_qgroup_config().

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-23 21:16:30 +01:00
David Sterba
6d06b0ad94 btrfs: tree-checker: add missing returns after data_ref alignment checks
There are sectorsize alignment checks that are reported but then
check_extent_data_ref continues. This was not intended, wrong alignment
is not a minor problem and we should return with error.

CC: stable@vger.kernel.org # 5.4+
Fixes: 0785a9aacf ("btrfs: tree-checker: Add EXTENT_DATA_REF check")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-23 21:16:21 +01:00
Johannes Thumshirn
0697d9a610 btrfs: don't access possibly stale fs_info data for printing duplicate device
Syzbot reported a possible use-after-free when printing a duplicate device
warning device_list_add().

At this point it can happen that a btrfs_device::fs_info is not correctly
setup yet, so we're accessing stale data, when printing the warning
message using the btrfs_printk() wrappers.

  ==================================================================
  BUG: KASAN: use-after-free in btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
  Read of size 8 at addr ffff8880878e06a8 by task syz-executor225/7068

  CPU: 1 PID: 7068 Comm: syz-executor225 Not tainted 5.9.0-rc5-syzkaller #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  Call Trace:
   __dump_stack lib/dump_stack.c:77 [inline]
   dump_stack+0x1d6/0x29e lib/dump_stack.c:118
   print_address_description+0x66/0x620 mm/kasan/report.c:383
   __kasan_report mm/kasan/report.c:513 [inline]
   kasan_report+0x132/0x1d0 mm/kasan/report.c:530
   btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
   device_list_add+0x1a88/0x1d60 fs/btrfs/volumes.c:943
   btrfs_scan_one_device+0x196/0x490 fs/btrfs/volumes.c:1359
   btrfs_mount_root+0x48f/0xb60 fs/btrfs/super.c:1634
   legacy_get_tree+0xea/0x180 fs/fs_context.c:592
   vfs_get_tree+0x88/0x270 fs/super.c:1547
   fc_mount fs/namespace.c:978 [inline]
   vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
   btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
   legacy_get_tree+0xea/0x180 fs/fs_context.c:592
   vfs_get_tree+0x88/0x270 fs/super.c:1547
   do_new_mount fs/namespace.c:2875 [inline]
   path_mount+0x179d/0x29e0 fs/namespace.c:3192
   do_mount fs/namespace.c:3205 [inline]
   __do_sys_mount fs/namespace.c:3413 [inline]
   __se_sys_mount+0x126/0x180 fs/namespace.c:3390
   do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x44840a
  RSP: 002b:00007ffedfffd608 EFLAGS: 00000293 ORIG_RAX: 00000000000000a5
  RAX: ffffffffffffffda RBX: 00007ffedfffd670 RCX: 000000000044840a
  RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffedfffd630
  RBP: 00007ffedfffd630 R08: 00007ffedfffd670 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000001a
  R13: 0000000000000004 R14: 0000000000000003 R15: 0000000000000003

  Allocated by task 6945:
   kasan_save_stack mm/kasan/common.c:48 [inline]
   kasan_set_track mm/kasan/common.c:56 [inline]
   __kasan_kmalloc+0x100/0x130 mm/kasan/common.c:461
   kmalloc_node include/linux/slab.h:577 [inline]
   kvmalloc_node+0x81/0x110 mm/util.c:574
   kvmalloc include/linux/mm.h:757 [inline]
   kvzalloc include/linux/mm.h:765 [inline]
   btrfs_mount_root+0xd0/0xb60 fs/btrfs/super.c:1613
   legacy_get_tree+0xea/0x180 fs/fs_context.c:592
   vfs_get_tree+0x88/0x270 fs/super.c:1547
   fc_mount fs/namespace.c:978 [inline]
   vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
   btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
   legacy_get_tree+0xea/0x180 fs/fs_context.c:592
   vfs_get_tree+0x88/0x270 fs/super.c:1547
   do_new_mount fs/namespace.c:2875 [inline]
   path_mount+0x179d/0x29e0 fs/namespace.c:3192
   do_mount fs/namespace.c:3205 [inline]
   __do_sys_mount fs/namespace.c:3413 [inline]
   __se_sys_mount+0x126/0x180 fs/namespace.c:3390
   do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  Freed by task 6945:
   kasan_save_stack mm/kasan/common.c:48 [inline]
   kasan_set_track+0x3d/0x70 mm/kasan/common.c:56
   kasan_set_free_info+0x17/0x30 mm/kasan/generic.c:355
   __kasan_slab_free+0xdd/0x110 mm/kasan/common.c:422
   __cache_free mm/slab.c:3418 [inline]
   kfree+0x113/0x200 mm/slab.c:3756
   deactivate_locked_super+0xa7/0xf0 fs/super.c:335
   btrfs_mount_root+0x72b/0xb60 fs/btrfs/super.c:1678
   legacy_get_tree+0xea/0x180 fs/fs_context.c:592
   vfs_get_tree+0x88/0x270 fs/super.c:1547
   fc_mount fs/namespace.c:978 [inline]
   vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
   btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
   legacy_get_tree+0xea/0x180 fs/fs_context.c:592
   vfs_get_tree+0x88/0x270 fs/super.c:1547
   do_new_mount fs/namespace.c:2875 [inline]
   path_mount+0x179d/0x29e0 fs/namespace.c:3192
   do_mount fs/namespace.c:3205 [inline]
   __do_sys_mount fs/namespace.c:3413 [inline]
   __se_sys_mount+0x126/0x180 fs/namespace.c:3390
   do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  The buggy address belongs to the object at ffff8880878e0000
   which belongs to the cache kmalloc-16k of size 16384
  The buggy address is located 1704 bytes inside of
   16384-byte region [ffff8880878e0000, ffff8880878e4000)
  The buggy address belongs to the page:
  page:0000000060704f30 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x878e0
  head:0000000060704f30 order:3 compound_mapcount:0 compound_pincount:0
  flags: 0xfffe0000010200(slab|head)
  raw: 00fffe0000010200 ffffea00028e9a08 ffffea00021e3608 ffff8880aa440b00
  raw: 0000000000000000 ffff8880878e0000 0000000100000001 0000000000000000
  page dumped because: kasan: bad access detected

  Memory state around the buggy address:
   ffff8880878e0580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ffff8880878e0600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  >ffff8880878e0680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
				    ^
   ffff8880878e0700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ffff8880878e0780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ==================================================================

The syzkaller reproducer for this use-after-free crafts a filesystem image
and loop mounts it twice in a loop. The mount will fail as the crafted
image has an invalid chunk tree. When this happens btrfs_mount_root() will
call deactivate_locked_super(), which then cleans up fs_info and
fs_info::sb. If a second thread now adds the same block-device to the
filesystem, it will get detected as a duplicate device and
device_list_add() will reject the duplicate and print a warning. But as
the fs_info pointer passed in is non-NULL this will result in a
use-after-free.

Instead of printing possibly uninitialized or already freed memory in
btrfs_printk(), explicitly pass in a NULL fs_info so the printing of the
device name will be skipped altogether.

There was a slightly different approach discussed in
https://lore.kernel.org/linux-btrfs/20200114060920.4527-1-anand.jain@oracle.com/t/#u

Link: https://lore.kernel.org/linux-btrfs/000000000000c9e14b05afcc41ba@google.com
Reported-by: syzbot+582e66e5edf36a22c7b0@syzkaller.appspotmail.com
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-23 21:16:12 +01:00
Jens Axboe
36f4fa6886 io_uring: add support for shutdown(2)
This adds support for the shutdown(2) system call, which is useful for
dealing with sockets.

shutdown(2) may block, so we have to punt it to async context.

Suggested-by: Norman Maurer <norman.maurer@googlemail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-23 09:15:15 -07:00
Jens Axboe
ce59fc69b1 io_uring: allow SQPOLL with CAP_SYS_NICE privileges
CAP_SYS_ADMIN is too restrictive for a lot of uses cases, allow
CAP_SYS_NICE based on the premise that such users are already allowed
to raise the priority of tasks.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-23 09:15:15 -07:00
Gustavo A. R. Silva
8fca3c8a34 ext2: Fix fall-through warnings for Clang
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
by explicitly adding a break statement instead of just letting the code
fall through to the next case.

Link: https://github.com/KSPP/linux/issues/115
Link: https://lore.kernel.org/r/73d8ae2d06d639815672ee9ee4550ea4bfa08489.1605896059.git.gustavoars@kernel.org
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-11-23 10:36:53 +01:00
Linus Torvalds
68d3fa235f Couple of EFI fixes for v5.10:
- fix memory leak in efivarfs driver
 - fix HYP mode issue in 32-bit ARM version of the EFI stub when built in
   Thumb2 mode
 - avoid leaking EFI pgd pages on allocation failure
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEnNKg2mrY9zMBdeK7wjcgfpV0+n0FAl+tRcoACgkQwjcgfpV0
 +n1LZQf/af1p9A0zT1nC3IrcaABceJDnD3dJuV9SD6QhFuD2Dw/Mshr+MVzDsO+3
 btvuu0r4CzQ5ajfpfcGcvBFFWbbPTwKvWQe++9Unwoz5acw7hpV5yxNwMivdaJEh
 3o4pkgpCmwtliTwiroDficO7Vlqefqf4LZd7/iQYQTnuPK3waYQBjwp9t2D9tlx7
 kXiEQDP2BDRCUrKEjlR7AHTZ156mw+UsiquAuxMCGTKBqwiELEEV6aPseqa5MmNV
 RDV1IXWdhOQyQfzg0s6vTzwGeN0JubSxHng6O9UbE+tctz4EqaaHIEsRuMBq8oLD
 Y8JeGp1ovypTJxeLE5t6eEzsfvTRsg==
 =GnmM
 -----END PGP SIGNATURE-----

Merge tag 'efi-urgent-for-v5.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull EFI fixes from Borislav Petkov:
 "Forwarded EFI fixes from Ard Biesheuvel:

   - fix memory leak in efivarfs driver

   - fix HYP mode issue in 32-bit ARM version of the EFI stub when built
     in Thumb2 mode

   - avoid leaking EFI pgd pages on allocation failure"

* tag 'efi-urgent-for-v5.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  efi/x86: Free efi_pgd with free_pages()
  efivarfs: fix memory leak in efivarfs_create()
  efi/arm: set HSCTLR Thumb2 bit correctly for HVC calls from HYP
2020-11-22 13:05:48 -08:00
Linus Torvalds
4a51c60a11 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "8 patches.

  Subsystems affected by this patch series: mm (madvise, pagemap,
  readahead, memcg, userfaultfd), kbuild, and vfs"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm: fix madvise WILLNEED performance problem
  libfs: fix error cast of negative value in simple_attr_write()
  mm/userfaultfd: do not access vma->vm_mm after calling handle_userfault()
  mm: memcg/slab: fix root memcg vmstats
  mm: fix readahead_page_batch for retry entries
  mm: fix phys_to_target_node() and memory_add_physaddr_to_nid() exports
  compiler-clang: remove version check for BPF Tracing
  mm/madvise: fix memory leak from process_madvise
2020-11-22 12:14:46 -08:00
Linus Torvalds
a7f07fc14f A final set of miscellaneous bug fixes for ext4
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl+6tkoACgkQ8vlZVpUN
 gaO1xgf/aJ5chEWFEQVrdSdd+cvhuILz1Hp2iW8xdZgPeN2ovWIC3LPCTDr9FWB0
 MhpGS9avYIf8mHZgsw7HVzqUv6gPcT0khragPp348QJzxnbz/saZ5ujK/WR2zJxr
 SoB9f2vdqW0gBbKMO6avXm0gTnuNemcK5oH6tzI5ECBpV3Ltk1dJWtgQkVp9rAyP
 EFEb9hUYpdZ3J1cm8SCUIO99Tu2KMd+yNRv42z0BKNTfBNe2P5aG56p1sMYMcIr7
 BiVUrhPkbAf3gMsMDzZQE5mHZHyzJoHNHssLHWcEU/o9Wd2wZHjAEzzn9Tz49rUg
 yhYTrhLcQJcfmL0XvgrIhJaXTfMc6w==
 =s/1A
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "A final set of miscellaneous bug fixes for ext4"

* tag 'ext4_for_linus_fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix bogus warning in ext4_update_dx_flag()
  jbd2: fix kernel-doc markups
  ext4: drop fast_commit from /proc/mounts
2020-11-22 11:39:32 -08:00
David Howells
a9e5c87ca7 afs: Fix speculative status fetch going out of order wrt to modifications
When doing a lookup in a directory, the afs filesystem uses a bulk
status fetch to speculatively retrieve the statuses of up to 48 other
vnodes found in the same directory and it will then either update extant
inodes or create new ones - effectively doing 'lookup ahead'.

To avoid the possibility of deadlocking itself, however, the filesystem
doesn't lock all of those inodes; rather just the directory inode is
locked (by the VFS).

When the operation completes, afs_inode_init_from_status() or
afs_apply_status() is called, depending on whether the inode already
exists, to commit the new status.

A case exists, however, where the speculative status fetch operation may
straddle a modification operation on one of those vnodes.  What can then
happen is that the speculative bulk status RPC retrieves the old status,
and whilst that is happening, the modification happens - which returns
an updated status, then the modification status is committed, then we
attempt to commit the speculative status.

This results in something like the following being seen in dmesg:

	kAFS: vnode modified {100058:861} 8->9 YFS.InlineBulkStatus

showing that for vnode 861 on volume 100058, we saw YFS.InlineBulkStatus
say that the vnode had data version 8 when we'd already recorded version
9 due to a local modification.  This was causing the cache to be
invalidated for that vnode when it shouldn't have been.  If it happens
on a data file, this might lead to local changes being lost.

Fix this by ignoring speculative status updates if the data version
doesn't match the expected value.

Note that it is possible to get a DV regression if a volume gets
restored from a backup - but we should get a callback break in such a
case that should trigger a recheck anyway.  It might be worth checking
the volume creation time in the volsync info and, if a change is
observed in that (as would happen on a restore), invalidate all caches
associated with the volume.

Fixes: 5cf9dd55a0 ("afs: Prospectively look up extra files when doing a single lookup")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-22 11:27:03 -08:00
Yicong Yang
488dac0c92 libfs: fix error cast of negative value in simple_attr_write()
The attr->set() receive a value of u64, but simple_strtoll() is used for
doing the conversion.  It will lead to the error cast if user inputs a
negative value.

Use kstrtoull() instead of simple_strtoll() to convert a string got from
the user to an unsigned value.  The former will return '-EINVAL' if it
gets a negetive value, but the latter can't handle the situation
correctly.  Make 'val' unsigned long long as what kstrtoull() takes,
this will eliminate the compile warning on no 64-bit architectures.

Fixes: f7b88631a8 ("fs/libfs.c: fix simple_attr_write() on 32bit machines")
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lkml.kernel.org/r/1605341356-11872-1-git-send-email-yangyicong@hisilicon.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-22 10:48:22 -08:00
Linus Torvalds
a349e4c659 Fixes for 5.10-rc5:
- Fix various deficiencies in online fsck's metadata checking code.
 - Fix an integer casting bug in the xattr code on 32-bit systems.
 - Fix a hang in an inode walk when the inode index is corrupt.
 - Fix error codes being dropped when initializing per-AG structures
 - Fix nowait directio writes that partially succeed but return EAGAIN.
 - Revert last week's rmap comparison patch because it was wrong.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl+38UwACgkQ+H93GTRK
 tOtMFQ/9EV2/I673TJj8GY5gJE9mUXGbiUyMedl8XammhNPL0JZMAGsnMXBWCtjO
 0pmO+TS4epwlYpZ4XVLlxNUxUwDkxgq3nHKbEz2jezepKPTCasL7XZOECGQ1gKdA
 11BRNP7Y91ndYtVZGxHu+92oeZAzJgTh6OVYtJytniTgF9r96hgr/+3dA8GQxkqm
 bkkfWfKxxCwMYLRLRNcnVbkj0xDMgmKOILyFR63ZhW8RtrfmdIUYDUty7RGvj4bJ
 csZmrkcu/wIj+9NeXw8KS5KpNOWu2q3baORXe6EodoVgFMa4I11kiuGucZehsIbH
 yNgTLDaFNUm1aBCkSrYtz7m4iwLq8No7XB/OIXrALSd5yJqaXhDyMnEV/tBeAL7D
 MXn032Sc6hPSyGBtCmurTSo61oKP3HjgMXA4vvNw5CxJ7Q4EoZyBXCdHtZcRnB7+
 MSa+ylBTbmP/AJ2AQrPiArGlAKUTnJM6WknIBCWiIueRtadTh1cquBFVbDxoEIX5
 eKcjdQrX2xNrFNE2rRuYI4ml+wwtdgk7JO41gjAw+NA2V1LJW6Q5A5RKX2PiOidC
 oGdNPTLG7Rfh7sMaPo66X3xTQPoOwcV0O+ArXlFNDBZXDUw0d1tWzVfYo+/2Zym6
 3sFcTKMdTKtG8NasNjvbanmZTV1VLbZAJRdevH1NFAWUICiTBkY=
 =HoPI
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.10-fixes-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "The critical fixes are for a crash that someone reported in the xattr
  code on 32-bit arm last week; and a revert of the rmap key comparison
  change from last week as it was totally wrong. I need a vacation. :(

  Summary:

   - Fix various deficiencies in online fsck's metadata checking code

   - Fix an integer casting bug in the xattr code on 32-bit systems

   - Fix a hang in an inode walk when the inode index is corrupt

   - Fix error codes being dropped when initializing per-AG structures

   - Fix nowait directio writes that partially succeed but return EAGAIN

   - Revert last week's rmap comparison patch because it was wrong"

* tag 'xfs-5.10-fixes-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: revert "xfs: fix rmap key and record comparison functions"
  xfs: don't allow NOWAIT DIO across extent boundaries
  xfs: return corresponding errcode if xfs_initialize_perag() fail
  xfs: ensure inobt record walks always make forward progress
  xfs: fix forkoff miscalculation related to XFS_LITINO(mp)
  xfs: directory scrub should check the null bestfree entries too
  xfs: strengthen rmap record flags checking
  xfs: fix the minrecs logic when dealing with inode root child blocks
2020-11-21 10:36:25 -08:00
Linus Torvalds
ba911108f4 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAl+4JHgACgkQnJ2qBz9k
 QNlXOQf/YHs4q4HgBI0tsStS/U4xFtmY77Rcm1pllqH6BZPBg1vRpzfh7hZvIPMa
 GceTcMAX4OmG6++fRzVgNDIuem3Jl0oDCm++pWPev+S/V06PuTu36viuFWJ3e/5g
 0wDLYXRj4dRUiQtjbSkI7LAgIX1wbTANOKSZeaKFYaGHfEcFm1GkHUuHzEBVX1Jw
 bRpaod3ikmjoaoI6TTZlKKnrKksSw6F5wHUiHu2ZHdZ6kQ36elwHFu8QXJCzkZ7F
 F9vt4IIKq6xzEVdwDXAPjsFkPp2B4Bz+AgcSpoitg/2L5hc2d/kxgI4zvpXY8TGs
 hpW6YPXEXIjhHjKX22f99ThI4BqXww==
 =bBTT
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_for_v5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fanotify fix from Jan Kara:
 "A single fanotify fix from Amir"

* tag 'fsnotify_for_v5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fanotify: fix logic of reporting name info with watched parent
2020-11-21 10:33:33 -08:00
Linus Torvalds
fa5fca78bb io_uring-5.10-2020-11-20
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+4DAwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgphdOD/9xOEnYPuekvVH9G9nyNd//Q9fPArG2+j6V
 /MCnze07GNtDt7z15oR+T07hKXmf+Ejh4nu3JJ6MUNfe/47hhJqHSxRHU6+PJCjk
 hPrsaTsDedxxLEDiLmvhXnUPzfVzJtefxVAAaKikWOb3SBqLdh7xTFSlor1HbRBl
 Zk4d343cjBDYfvSSt/zMWDzwwvramdz7rJnnPMKXITu64ITL5314vuK2YVZmBOet
 YujSah7J8FL1jKhiG1Iw5rayd2Q3smnHWIEQ+lvW6WiTvMJMLOxif2xNF4/VEZs1
 CBGJUQt42LI6QGEzRBHohcefZFuPGoxnduSzHCOIhh7d6+k+y9mZfsPGohr3g9Ov
 NotXpVonnA7GbRqzo1+IfBRve7iRONdZ3/LBwyRmqav4I4jX68wXBNH5IDpVR0Sn
 c31avxa/ZL7iLIBx32enp0/r3mqNTQotEleSLUdyJQXAZTyG2INRhjLLXTqSQ5BX
 oVp0fZzKCwsr6HCPZpXZ/f2G7dhzuF0ghoceC02GsOVooni22gdVnQj+AWNus398
 e+wcimT4MX6AHNFxO2aUtJow0KWWZRzC1p5Mxu/9W3YiMtJiC0YOGePfSqiTqX0g
 Uk0H5dOAgBUQrAsusf7bKr0K6W25yEk/JipxhWqi0rC71x42mLTsCT1wxSCvLwqs
 WxhdtVKroQ==
 =7PAe
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.10-2020-11-20' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "Mostly regression or stable fodder:

   - Disallow async path resolution of /proc/self

   - Tighten constraints for segmented async buffered reads

   - Fix double completion for a retry error case

   - Fix for fixed file life times (Pavel)"

* tag 'io_uring-5.10-2020-11-20' of git://git.kernel.dk/linux-block:
  io_uring: order refnode recycling
  io_uring: get an active ref_node from files_data
  io_uring: don't double complete failed reissue request
  mm: never attempt async page lock if we've transferred data already
  io_uring: handle -EOPNOTSUPP on path resolution
  proc: don't allow async path resolution of /proc/self components
2020-11-20 11:47:22 -08:00
YiFei Zhu
0d8315dddd seccomp/cache: Report cache data through /proc/pid/seccomp_cache
Currently the kernel does not provide an infrastructure to translate
architecture numbers to a human-readable name. Translating syscall
numbers to syscall names is possible through FTRACE_SYSCALL
infrastructure but it does not provide support for compat syscalls.

This will create a file for each PID as /proc/pid/seccomp_cache.
The file will be empty when no seccomp filters are loaded, or be
in the format of:
<arch name> <decimal syscall number> <ALLOW | FILTER>
where ALLOW means the cache is guaranteed to allow the syscall,
and filter means the cache will pass the syscall to the BPF filter.

For the docker default profile on x86_64 it looks like:
x86_64 0 ALLOW
x86_64 1 ALLOW
x86_64 2 ALLOW
x86_64 3 ALLOW
[...]
x86_64 132 ALLOW
x86_64 133 ALLOW
x86_64 134 FILTER
x86_64 135 FILTER
x86_64 136 FILTER
x86_64 137 ALLOW
x86_64 138 ALLOW
x86_64 139 FILTER
x86_64 140 ALLOW
x86_64 141 ALLOW
[...]

This file is guarded by CONFIG_SECCOMP_CACHE_DEBUG with a default
of N because I think certain users of seccomp might not want the
application to know which syscalls are definitely usable. For
the same reason, it is also guarded by CAP_SYS_ADMIN.

Suggested-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/
Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/94e663fa53136f5a11f432c661794d1ee7060779.1605101222.git.yifeifz2@illinois.edu
2020-11-20 11:16:35 -08:00
Eric Biggers
a24d22b225 crypto: sha - split sha.h into sha1.h and sha2.h
Currently <crypto/sha.h> contains declarations for both SHA-1 and SHA-2,
and <crypto/sha3.h> contains declarations for SHA-3.

This organization is inconsistent, but more importantly SHA-1 is no
longer considered to be cryptographically secure.  So to the extent
possible, SHA-1 shouldn't be grouped together with any of the other SHA
versions, and usage of it should be phased out.

Therefore, split <crypto/sha.h> into two headers <crypto/sha1.h> and
<crypto/sha2.h>, and make everyone explicitly specify whether they want
the declarations for SHA-1, SHA-2, or both.

This avoids making the SHA-1 declarations visible to files that don't
want anything to do with SHA-1.  It also prepares for potentially moving
sha1.h into a new insecure/ or dangerous/ directory.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-11-20 14:45:33 +11:00
Jan Kara
f902b21650 ext4: fix bogus warning in ext4_update_dx_flag()
The idea of the warning in ext4_update_dx_flag() is that we should warn
when we are clearing EXT4_INODE_INDEX on a filesystem with metadata
checksums enabled since after clearing the flag, checksums for internal
htree nodes will become invalid. So there's no need to warn (or actually
do anything) when EXT4_INODE_INDEX is not set.

Link: https://lore.kernel.org/r/20201118153032.17281-1-jack@suse.cz
Fixes: 48a3431195 ("ext4: fix checksum errors with indexed dirs")
Reported-by: Eric Biggers <ebiggers@kernel.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2020-11-19 22:41:10 -05:00
Mauro Carvalho Chehab
2bf31d9442 jbd2: fix kernel-doc markups
Kernel-doc markup should use this format:
        identifier - description

They should not have any type before that, as otherwise
the parser won't do the right thing.

Also, some identifiers have different names between their
prototypes and the kernel-doc markup.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Link: https://lore.kernel.org/r/72f5c6628f5f278d67625f60893ffbc2ca28d46e.1605521731.git.mchehab+huawei@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-11-19 22:38:29 -05:00
Darrick J. Wong
eb8409071a xfs: revert "xfs: fix rmap key and record comparison functions"
This reverts commit 6ff646b2ce.

Your maintainer committed a major braino in the rmap code by adding the
attr fork, bmbt, and unwritten extent usage bits into rmap record key
comparisons.  While XFS uses the usage bits *in the rmap records* for
cross-referencing metadata in xfs_scrub and xfs_repair, it only needs
the owner and offset information to distinguish between reverse mappings
of the same physical extent into the data fork of a file at multiple
offsets.  The other bits are not important for key comparisons for index
lookups, and never have been.

Eric Sandeen reports that this causes regressions in generic/299, so
undo this patch before it does more damage.

Reported-by: Eric Sandeen <sandeen@sandeen.net>
Fixes: 6ff646b2ce ("xfs: fix rmap key and record comparison functions")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2020-11-19 15:17:50 -08:00
Theodore Ts'o
704c2317ca ext4: drop fast_commit from /proc/mounts
The options in /proc/mounts must be valid mount options --- and
fast_commit is not a mount option.  Otherwise, command sequences like
this will fail:

    # mount /dev/vdc /vdc
    # mkdir -p /vdc/phoronix_test_suite /pts
    # mount --bind /vdc/phoronix_test_suite /pts
    # mount -o remount,nodioread_nolock /pts
    mount: /pts: mount point not mounted or bad option.

And in the system logs, you'll find:

    EXT4-fs (vdc): Unrecognized mount option "fast_commit" or missing value

Fixes: 995a3ed67f ("ext4: add fast_commit feature and handling for extended mount options")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-11-19 15:41:57 -05:00
Dave Chinner
883a790a84 xfs: don't allow NOWAIT DIO across extent boundaries
Jens has reported a situation where partial direct IOs can be issued
and completed yet still return -EAGAIN. We don't want this to report
a short IO as we want XFS to complete user DIO entirely or not at
all.

This partial IO situation can occur on a write IO that is split
across an allocated extent and a hole, and the second mapping is
returning EAGAIN because allocation would be required.

The trivial reproducer:

$ sudo xfs_io -fdt -c "pwrite 0 4k" -c "pwrite -V 1 -b 8k -N 0 8k" /mnt/scr/foo
wrote 4096/4096 bytes at offset 0
4 KiB, 1 ops; 0.0001 sec (27.509 MiB/sec and 7042.2535 ops/sec)
pwrite: Resource temporarily unavailable
$

The pwritev2(0, 8kB, RWF_NOWAIT) call returns EAGAIN having done
the first 4kB write:

 xfs_file_direct_write: dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 0x2000
 iomap_apply:          dev 259:1 ino 0x83 pos 0 length 8192 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor
 xfs_ilock_nowait:     dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap
 xfs_iunlock:          dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin
 xfs_iomap_found:      dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 8192 fork data startoff 0x0 startblock 24 blockcount 0x1
 iomap_apply_dstmap:   dev 259:1 ino 0x83 bdev 259:1 addr 102400 offset 0 length 4096 type MAPPED flags DIRTY

Here the first iomap loop has mapped the first 4kB of the file and
issued the IO, and we enter the second iomap_apply loop:

 iomap_apply: dev 259:1 ino 0x83 pos 4096 length 4096 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor
 xfs_ilock_nowait:     dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap
 xfs_iunlock:          dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin

And we exit with -EAGAIN out because we hit the allocate case trying
to make the second 4kB block.

Then IO completes on the first 4kB and the original IO context
completes and unlocks the inode, returning -EAGAIN to userspace:

 xfs_end_io_direct_write: dev 259:1 ino 0x83 isize 0x1000 disize 0x1000 offset 0x0 count 4096
 xfs_iunlock:          dev 259:1 ino 0x83 flags IOLOCK_SHARED caller xfs_file_dio_aio_write

There are other vectors to the same problem when we re-enter the
mapping code if we have to make multiple mappinfs under NOWAIT
conditions. e.g. failing trylocks, COW extents being found,
allocation being required, and so on.

Avoid all these potential problems by only allowing IOMAP_NOWAIT IO
to go ahead if the mapping we retrieve for the IO spans an entire
allocated extent. This avoids the possibility of subsequent mappings
to complete the IO from triggering NOWAIT semantics by any means as
NOWAIT IO will now only enter the mapping code once per NOWAIT IO.

Reported-and-tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-11-19 08:59:11 -08:00
Dominique Martinet
5bfe97d738 9p: Fix writeback fid incorrectly being attached to dentry
v9fs_dir_release needs fid->ilist to have been initialized for filp's
fid, not the inode's writeback fid's.

With refcounting this can be improved on later but this appears to fix
null deref issues.

Link: http://lkml.kernel.org/r/1605802012-31133-3-git-send-email-asmadeus@codewreck.org
Fixes: 6636b6dcc3 ("fs/9p: track open fids")
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2020-11-19 17:22:28 +01:00
Dominique Martinet
ff5e72ebef 9p: apply review requests for fid refcounting
Fix style issues in parent commit ("apply review requests for fid
refcounting"), no functional change.

Link: http://lkml.kernel.org/r/1605802012-31133-2-git-send-email-asmadeus@codewreck.org
Fixes: 6636b6dcc3 ("9p: add refcount to p9_fid struct")
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2020-11-19 17:21:34 +01:00
Jianyong Wu
6636b6dcc3 9p: add refcount to p9_fid struct
Fix race issue in fid contention.

Eric's and Greg's patch offer a mechanism to fix open-unlink-f*syscall
bug in 9p. But there is race issue in fid parallel accesses.
As Greg's patch stores all of fids from opened files into according inode,
so all the lookup fid ops can retrieve fid from inode preferentially. But
there is no mechanism to handle the fid contention issue. For example,
there are two threads get the same fid in the same time and one of them
clunk the fid before the other thread ready to discard the fid. In this
scenario, it will lead to some fatal problems, even kernel core dump.

I introduce a mechanism to fix this race issue. A counter field introduced
into p9_fid struct to store the reference counter to the fid. When a fid
is allocated from the inode or dentry, the counter will increase, and
will decrease at the end of its occupation. It is guaranteed that the
fid won't be clunked before the reference counter go down to 0, then
we can avoid the clunked fid to be used.

tests:
race issue test from the old test case:
for file in {01..50}; do touch f.${file}; done
seq 1 1000 | xargs -n 1 -P 50 -I{} cat f.* > /dev/null

open-unlink-f*syscall test:
I have tested for f*syscall include: ftruncate fstat fchown fchmod faccessat.

Link: http://lkml.kernel.org/r/20200923141146.90046-5-jianyong.wu@arm.com
Fixes: 478ba09edc ("fs/9p: search open fids first")
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2020-11-19 17:20:39 +01:00
Yu Kuai
595189c25c xfs: return corresponding errcode if xfs_initialize_perag() fail
In xfs_initialize_perag(), if kmem_zalloc(), xfs_buf_hash_init(), or
radix_tree_preload() failed, the returned value 'error' is not set
accordingly.

Reported-as-fixing: 8b26c5825e ("xfs: handle ENOMEM correctly during initialisation of perag structures")
Fixes: 9b24717979 ("xfs: cache unlinked pointers in an rhashtable")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-11-18 09:23:51 -08:00
Darrick J. Wong
27c14b5daa xfs: ensure inobt record walks always make forward progress
The aim of the inode btree record iterator function is to call a
callback on every record in the btree.  To avoid having to tear down and
recreate the inode btree cursor around every callback, it caches a
certain number of records in a memory buffer.  After each batch of
callback invocations, we have to perform a btree lookup to find the
next record after where we left off.

However, if the keys of the inode btree are corrupt, the lookup might
put us in the wrong part of the inode btree, causing the walk function
to loop forever.  Therefore, we add extra cursor tracking to make sure
that we never go backwards neither when performing the lookup nor when
jumping to the next inobt record.  This also fixes an off by one error
where upon resume the lookup should have been for the inode /after/ the
point at which we stopped.

Found by fuzzing xfs/460 with keys[2].startino = ones causing bulkstat
and quotacheck to hang.

Fixes: a211432c27 ("xfs: create simplified inode walk function")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2020-11-18 09:23:51 -08:00
Gao Xiang
ada49d64fb xfs: fix forkoff miscalculation related to XFS_LITINO(mp)
Currently, commit e9e2eae89d dropped a (int) decoration from
XFS_LITINO(mp), and since sizeof() expression is also involved,
the result of XFS_LITINO(mp) is simply as the size_t type
(commonly unsigned long).

Considering the expression in xfs_attr_shortform_bytesfit():
  offset = (XFS_LITINO(mp) - bytes) >> 3;
let "bytes" be (int)340, and
    "XFS_LITINO(mp)" be (unsigned long)336.

on 64-bit platform, the expression is
  offset = ((unsigned long)336 - (int)340) >> 3 =
           (int)(0xfffffffffffffffcUL >> 3) = -1

but on 32-bit platform, the expression is
  offset = ((unsigned long)336 - (int)340) >> 3 =
           (int)(0xfffffffcUL >> 3) = 0x1fffffff
instead.

so offset becomes a large positive number on 32-bit platform, and
cause xfs_attr_shortform_bytesfit() returns maxforkoff rather than 0.

Therefore, one result is
  "ASSERT(new_size <= XFS_IFORK_SIZE(ip, whichfork));"

assertion failure in xfs_idata_realloc(), which was also the root
cause of the original bugreport from Dennis, see:
   https://bugzilla.redhat.com/show_bug.cgi?id=1894177

And it can also be manually triggered with the following commands:
  $ touch a;
  $ setfattr -n user.0 -v "`seq 0 80`" a;
  $ setfattr -n user.1 -v "`seq 0 80`" a

on 32-bit platform.

Fix the case in xfs_attr_shortform_bytesfit() by bailing out
"XFS_LITINO(mp) < bytes" in advance suggested by Eric and a misleading
comment together with this bugfix suggested by Darrick. It seems the
other users of XFS_LITINO(mp) are not impacted.

Fixes: e9e2eae89d ("xfs: only check the superblock version for dinode size calculation")
Cc: <stable@vger.kernel.org> # 5.7+
Reported-and-tested-by: Dennis Gilmore <dgilmore@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-11-18 09:23:51 -08:00
Darrick J. Wong
6b48e5b8a2 xfs: directory scrub should check the null bestfree entries too
Teach the directory scrubber to check all the bestfree entries,
including the null ones.  We want to be able to detect the case where
the entry is null but there actually /is/ a directory data block.

Found by fuzzing lbests[0] = ones in xfs/391.

Fixes: df481968f3 ("xfs: scrub directory freespace")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-11-18 09:23:50 -08:00
Darrick J. Wong
498fe261f0 xfs: strengthen rmap record flags checking
We always know the correct state of the rmap record flags (attr, bmbt,
unwritten) so check them by direct comparison.

Fixes: d852657ccf ("xfs: cross-reference reverse-mapping btree")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-11-18 09:23:50 -08:00
Darrick J. Wong
e95b6c3ef1 xfs: fix the minrecs logic when dealing with inode root child blocks
The comment and logic in xchk_btree_check_minrecs for dealing with
inode-rooted btrees isn't quite correct.  While the direct children of
the inode root are allowed to have fewer records than what would
normally be allowed for a regular ondisk btree block, this is only true
if there is only one child block and the number of records don't fit in
the inode root.

Fixes: 08a3a692ef ("xfs: btree scrub should check minrecs")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-11-18 09:23:50 -08:00
Bob Peterson
20b3291290 gfs2: Fix regression in freeze_go_sync
Patch 541656d3a5 ("gfs2: freeze should work on read-only mounts") changed
the check for glock state in function freeze_go_sync() from "gl->gl_state
== LM_ST_SHARED" to "gl->gl_req == LM_ST_EXCLUSIVE".  That's wrong and it
regressed gfs2's freeze/thaw mechanism because it caused only the freezing
node (which requests the glock in EX) to queue freeze work.

All nodes go through this go_sync code path during the freeze to drop their
SHared hold on the freeze glock, allowing the freezing node to acquire it
in EXclusive mode. But all the nodes must freeze access to the file system
locally, so they ALL must queue freeze work. The freeze_work calls
freeze_func, which makes a request to reacquire the freeze glock in SH,
effectively blocking until the thaw from the EX holder. Once thawed, the
freezing node drops its EX hold on the freeze glock, then the (blocked)
freeze_func reacquires the freeze glock in SH again (on all nodes, including
the freezer) so all nodes go back to a thawed state.

This patch changes the check back to gl_state == LM_ST_SHARED like it was
prior to 541656d3a5.

Fixes: 541656d3a5 ("gfs2: freeze should work on read-only mounts")
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-11-18 16:28:11 +01:00
Pavel Begunkov
e297822b20 io_uring: order refnode recycling
Don't recycle a refnode until we're done with all requests of nodes
ejected before.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-18 08:02:10 -07:00
Pavel Begunkov
1e5d770bb8 io_uring: get an active ref_node from files_data
An active ref_node always can be found in ctx->files_data, it's much
safer to get it this way instead of poking into files_data->ref_list.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-18 08:02:10 -07:00
Jens Axboe
c993df5a68 io_uring: don't double complete failed reissue request
Zorro reports that an xfstest test case is failing, and it turns out that
for the reissue path we can potentially issue a double completion on the
request for the failure path. There's an issue around the retry as well,
but for now, at least just make sure that we handle the error path
correctly.

Cc: stable@vger.kernel.org
Fixes: b63534c41e ("io_uring: re-issue block requests that failed because of resources")
Reported-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-17 15:17:29 -07:00
Thomas Gleixner
4cffe21d4a Merge branch 'x86/entry' into core/entry
Prepare for the merging of the syscall_work series which conflicts with the
TIF bits overhaul in X86.
2020-11-16 20:51:59 +01:00
Eric Biggers
3ceb6543e9 fscrypt: remove kernel-internal constants from UAPI header
There isn't really any valid reason to use __FSCRYPT_MODE_MAX or
FSCRYPT_POLICY_FLAGS_VALID in a userspace program.  These constants are
only meant to be used by the kernel internally, and they are defined in
the UAPI header next to the mode numbers and flags only so that kernel
developers don't forget to update them when adding new modes or flags.

In https://lkml.kernel.org/r/20201005074133.1958633-2-satyat@google.com
there was an example of someone wanting to use __FSCRYPT_MODE_MAX in a
user program, and it was wrong because the program would have broken if
__FSCRYPT_MODE_MAX were ever increased.  So having this definition
available is harmful.  FSCRYPT_POLICY_FLAGS_VALID has the same problem.

So, remove these definitions from the UAPI header.  Replace
FSCRYPT_POLICY_FLAGS_VALID with just listing the valid flags explicitly
in the one kernel function that needs it.  Move __FSCRYPT_MODE_MAX to
fscrypt_private.h, remove the double underscores (which were only
present to discourage use by userspace), and add a BUILD_BUG_ON() and
comments to (hopefully) ensure it is kept in sync.

Keep the old name FS_POLICY_FLAGS_VALID, since it's been around for
longer and there's a greater chance that removing it would break source
compatibility with some program.  Indeed, mtd-utils is using it in
an #ifdef, and removing it would introduce compiler warnings (about
FS_POLICY_FLAGS_PAD_* being redefined) into the mtd-utils build.
However, reduce its value to 0x07 so that it only includes the flags
with old names (the ones present before Linux 5.4), and try to make it
clear that it's now "frozen" and no new flags should be added to it.

Fixes: 2336d0deb2 ("fscrypt: use FSCRYPT_ prefix for uapi constants")
Cc: <stable@vger.kernel.org> # v5.4+
Link: https://lore.kernel.org/r/20201024005132.495952-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-11-16 11:41:12 -08:00
Eric Biggers
ed45e20164 fs-verity: rename "file measurement" to "file digest"
I originally chose the name "file measurement" to refer to the fs-verity
file digest to avoid confusion with traditional full-file digests or
with the bare root hash of the Merkle tree.

But the name "file measurement" hasn't caught on, and usually people are
calling it something else, usually the "file digest".  E.g. see
"struct fsverity_digest" and "struct fsverity_formatted_digest", the
libfsverity_compute_digest() and libfsverity_sign_digest() functions in
libfsverity, and the "fsverity digest" command.

Having multiple names for the same thing is always confusing.

So to hopefully avoid confusion in the future, rename
"fs-verity file measurement" to "fs-verity file digest".

This leaves FS_IOC_MEASURE_VERITY as the only reference to "measure" in
the kernel, which makes some amount of sense since the ioctl is actively
"measuring" the file.

I'll be renaming this in fsverity-utils too (though similarly the
'fsverity measure' command, which is a wrapper for
FS_IOC_MEASURE_VERITY, will stay).

Acked-by: Luca Boccassi <luca.boccassi@microsoft.com>
Link: https://lore.kernel.org/r/20201113211918.71883-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-11-16 11:40:12 -08:00
Eric Biggers
9e90f30e78 fs-verity: rename fsverity_signed_digest to fsverity_formatted_digest
The name "struct fsverity_signed_digest" is causing confusion because it
isn't actually a signed digest, but rather it's the way that the digest
is formatted in order to be signed.  Rename it to
"struct fsverity_formatted_digest" to prevent this confusion.

Also update the struct's comment to clarify that it's specific to the
built-in signature verification support and isn't a requirement for all
fs-verity users.

I'll be renaming this struct in fsverity-utils too.

Acked-by: Luca Boccassi <luca.boccassi@microsoft.com>
Link: https://lore.kernel.org/r/20201113211918.71883-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-11-16 11:40:11 -08:00
Eric Biggers
7bf765dd84 fs-verity: remove filenames from file comments
Embedding the file path inside kernel source code files isn't
particularly useful as often files are moved around and the paths become
incorrect.  checkpatch.pl warns about this since v5.10-rc1.

Acked-by: Luca Boccassi <luca.boccassi@microsoft.com>
Link: https://lore.kernel.org/r/20201113211918.71883-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-11-16 11:40:10 -08:00
Christoph Hellwig
5a5678ff3a block: unexport revalidate_disk_size
revalidate_disk_size is now only called from set_capacity_and_notify,
so drop the export.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-16 08:34:15 -07:00
Christoph Hellwig
99473d9db9 block: remove the call to __invalidate_device in check_disk_size_change
__invalidate_device without the kill_dirty parameter just invalidates
various clean entries in caches, which doesn't really help us with
anything, but can cause all kinds of horrible lock orders due to how
it calls into the file system.  The only reason this hasn't been a
major issue is because so many people use partitions, for which no
invalidation was performed anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-16 08:34:14 -07:00
Rohith Surabattula
1254100030 smb3: Handle error case during offload read path
Mid callback needs to be called only when valid data is
read into pages.

These patches address a problem found during decryption offload:
      CIFS: VFS: trying to dequeue a deleted mid
that could cause a refcount use after free:
      Workqueue: smb3decryptd smb2_decrypt_offload [cifs]

Signed-off-by: Rohith Surabattula <rohiths@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org> #5.4+
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-11-15 23:05:33 -06:00
Rohith Surabattula
ac873aa3dc smb3: Avoid Mid pending list corruption
When reconnect happens Mid queue can be corrupted when both
demultiplex and offload thread try to dequeue the MID from the
pending list.

These patches address a problem found during decryption offload:
         CIFS: VFS: trying to dequeue a deleted mid
that could cause a refcount use after free:
         Workqueue: smb3decryptd smb2_decrypt_offload [cifs]

Signed-off-by: Rohith Surabattula <rohiths@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org> #5.4+
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-11-15 23:05:33 -06:00
Rohith Surabattula
de9ac0a6e9 smb3: Call cifs reconnect from demultiplex thread
cifs_reconnect needs to be called only from demultiplex thread.
skip cifs_reconnect in offload thread. So, cifs_reconnect will be
called by demultiplex thread in subsequent request.

These patches address a problem found during decryption offload:
     CIFS: VFS: trying to dequeue a deleted mid
that can cause a refcount use after free:

[ 1271.389453] Workqueue: smb3decryptd smb2_decrypt_offload [cifs]
[ 1271.389456] RIP: 0010:refcount_warn_saturate+0xae/0xf0
[ 1271.389457] Code: fa 1d 6a 01 01 e8 c7 44 b1 ff 0f 0b 5d c3 80 3d e7 1d 6a 01 00 75 91 48 c7 c7 d8 be 1d a2 c6 05 d7 1d 6a 01 01 e8 a7 44 b1 ff <0f> 0b 5d c3 80 3d c5 1d 6a 01 00 0f 85 6d ff ff ff 48 c7 c7 30 bf
[ 1271.389458] RSP: 0018:ffffa4cdc1f87e30 EFLAGS: 00010286
[ 1271.389458] RAX: 0000000000000000 RBX: ffff9974d2809f00 RCX: ffff9974df898cc8
[ 1271.389459] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9974df898cc0
[ 1271.389460] RBP: ffffa4cdc1f87e30 R08: 0000000000000004 R09: 00000000000002c0
[ 1271.389460] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9974b7fdb5c0
[ 1271.389461] R13: ffff9974d2809f00 R14: ffff9974ccea0a80 R15: ffff99748e60db80
[ 1271.389462] FS:  0000000000000000(0000) GS:ffff9974df880000(0000) knlGS:0000000000000000
[ 1271.389462] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1271.389463] CR2: 000055c60f344fe4 CR3: 0000001031a3c002 CR4: 00000000003706e0
[ 1271.389465] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1271.389465] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1271.389466] Call Trace:
[ 1271.389483]  cifs_mid_q_entry_release+0xce/0x110 [cifs]
[ 1271.389499]  smb2_decrypt_offload+0xa9/0x1c0 [cifs]
[ 1271.389501]  process_one_work+0x1e8/0x3b0
[ 1271.389503]  worker_thread+0x50/0x370
[ 1271.389504]  kthread+0x12f/0x150
[ 1271.389506]  ? process_one_work+0x3b0/0x3b0
[ 1271.389507]  ? __kthread_bind_mask+0x70/0x70
[ 1271.389509]  ret_from_fork+0x22/0x30

Signed-off-by: Rohith Surabattula <rohiths@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org> #5.4+
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-11-15 23:05:33 -06:00
Namjae Jeon
9812857208 cifs: fix a memleak with modefromsid
kmemleak reported a memory leak allocated in query_info() when cifs is
working with modefromsid.

  backtrace:
    [<00000000aeef6a1e>] slab_post_alloc_hook+0x58/0x510
    [<00000000b2f7a440>] __kmalloc+0x1a0/0x390
    [<000000006d470ebc>] query_info+0x5b5/0x700 [cifs]
    [<00000000bad76ce0>] SMB2_query_acl+0x2b/0x30 [cifs]
    [<000000001fa09606>] get_smb2_acl_by_path+0x2f3/0x720 [cifs]
    [<000000001b6ebab7>] get_smb2_acl+0x75/0x90 [cifs]
    [<00000000abf43904>] cifs_acl_to_fattr+0x13b/0x1d0 [cifs]
    [<00000000a5372ec3>] cifs_get_inode_info+0x4cd/0x9a0 [cifs]
    [<00000000388e0a04>] cifs_revalidate_dentry_attr+0x1cd/0x510 [cifs]
    [<0000000046b6b352>] cifs_getattr+0x8a/0x260 [cifs]
    [<000000007692c95e>] vfs_getattr_nosec+0xa1/0xc0
    [<00000000cbc7d742>] vfs_getattr+0x36/0x40
    [<00000000de8acf67>] vfs_statx_fd+0x4a/0x80
    [<00000000a58c6adb>] __do_sys_newfstat+0x31/0x70
    [<00000000300b3b4e>] __x64_sys_newfstat+0x16/0x20
    [<000000006d8e9c48>] do_syscall_64+0x37/0x80

This patch add missing kfree for pntsd when mounting modefromsid option.

Cc: Stable <stable@vger.kernel.org> # v5.4+
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-11-15 23:05:33 -06:00
Al Viro
4bbf439b09 fix return values of seq_read_iter()
Unlike ->read(), ->read_iter() instances *must* return the amount
of data they'd left in iterator.  For ->read() returning less than
it has actually copied is a QoI issue; read(fd, unmapped_page - 5, 8)
is allowed to fill all 5 bytes of destination and return 4; it's
not nice to caller, but POSIX allows pretty much anything in such
situation, up to and including a SIGSEGV.

generic_file_splice_read() uses pipe-backed iterator as destination;
there a short copy comes from pipe being full, not from running into
an un{mapped,writable} page in the middle of destination as we
have for iovec-backed iterators read(2) uses.  And there we rely
upon the ->read_iter() reporting the actual amount it has left
in destination.

Conversion of a ->read() instance into ->read_iter() has to watch
out for that.  If you really need an "all or nothing" kind of
behaviour somewhere, you need to do iov_iter_revert() to prune
the partial copy.

In case of seq_read_iter() we can handle short copy just fine;
the data is in m->buf and next call will fetch it from there.

Fixes: d4d50710a8 (seq_file: add seq_read_iter)
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-11-15 22:12:53 -05:00
David Woodhouse
28f1326710 eventfd: Export eventfd_ctx_do_read()
Where events are consumed in the kernel, for example by KVM's
irqfd_wakeup() and VFIO's virqfd_wakeup(), they currently lack a
mechanism to drain the eventfd's counter.

Since the wait queue is already locked while the wakeup functions are
invoked, all they really need to do is call eventfd_ctx_do_read().

Add a check for the lock, and export it for them.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20201027135523.646811-2-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-15 09:49:10 -05:00
Linus Torvalds
e28c0d7c92 Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
 "14 patches.

  Subsystems affected by this patch series: mm (migration, vmscan, slub,
  gup, memcg, hugetlbfs), mailmap, kbuild, reboot, watchdog, panic, and
  ocfs2"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  ocfs2: initialize ip_next_orphan
  panic: don't dump stack twice on warn
  hugetlbfs: fix anon huge page migration race
  mm: memcontrol: fix missing wakeup polling thread
  kernel/watchdog: fix watchdog_allowed_mask not used warning
  reboot: fix overflow parsing reboot cpu number
  Revert "kernel/reboot.c: convert simple_strtoul to kstrtoint"
  compiler.h: fix barrier_data() on clang
  mm/gup: use unpin_user_pages() in __gup_longterm_locked()
  mm/slub: fix panic in slab_alloc_node()
  mailmap: fix entry for Dmitry Baryshkov/Eremin-Solenikov
  mm/vmscan: fix NR_ISOLATED_FILE corruption on 64-bit
  mm/compaction: stop isolation if too many pages are isolated and we have pages to migrate
  mm/compaction: count pages and stop correctly during page isolation
2020-11-14 12:35:11 -08:00
David Howells
3ad216ee73 afs: Fix afs_write_end() when called with copied == 0 [ver #3]
When afs_write_end() is called with copied == 0, it tries to set the
dirty region, but there's no way to actually encode a 0-length region in
the encoding in page->private.

"0,0", for example, indicates a 1-byte region at offset 0.  The maths
miscalculates this and sets it incorrectly.

Fix it to just do nothing but unlock and put the page in this case.  We
don't actually need to mark the page dirty as nothing presumably
changed.

Fixes: 65dd2d6072 ("afs: Alter dirty range encoding in page->private")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-14 11:51:18 -08:00
Wengang Wang
f5785283dd ocfs2: initialize ip_next_orphan
Though problem if found on a lower 4.1.12 kernel, I think upstream has
same issue.

In one node in the cluster, there is the following callback trace:

   # cat /proc/21473/stack
   __ocfs2_cluster_lock.isra.36+0x336/0x9e0 [ocfs2]
   ocfs2_inode_lock_full_nested+0x121/0x520 [ocfs2]
   ocfs2_evict_inode+0x152/0x820 [ocfs2]
   evict+0xae/0x1a0
   iput+0x1c6/0x230
   ocfs2_orphan_filldir+0x5d/0x100 [ocfs2]
   ocfs2_dir_foreach_blk+0x490/0x4f0 [ocfs2]
   ocfs2_dir_foreach+0x29/0x30 [ocfs2]
   ocfs2_recover_orphans+0x1b6/0x9a0 [ocfs2]
   ocfs2_complete_recovery+0x1de/0x5c0 [ocfs2]
   process_one_work+0x169/0x4a0
   worker_thread+0x5b/0x560
   kthread+0xcb/0xf0
   ret_from_fork+0x61/0x90

The above stack is not reasonable, the final iput shouldn't happen in
ocfs2_orphan_filldir() function.  Looking at the code,

  2067         /* Skip inodes which are already added to recover list, since dio may
  2068          * happen concurrently with unlink/rename */
  2069         if (OCFS2_I(iter)->ip_next_orphan) {
  2070                 iput(iter);
  2071                 return 0;
  2072         }
  2073

The logic thinks the inode is already in recover list on seeing
ip_next_orphan is non-NULL, so it skip this inode after dropping a
reference which incremented in ocfs2_iget().

While, if the inode is already in recover list, it should have another
reference and the iput() at line 2070 should not be the final iput
(dropping the last reference).  So I don't think the inode is really in
the recover list (no vmcore to confirm).

Note that ocfs2_queue_orphans(), though not shown up in the call back
trace, is holding cluster lock on the orphan directory when looking up
for unlinked inodes.  The on disk inode eviction could involve a lot of
IOs which may need long time to finish.  That means this node could hold
the cluster lock for very long time, that can lead to the lock requests
(from other nodes) to the orhpan directory hang for long time.

Looking at more on ip_next_orphan, I found it's not initialized when
allocating a new ocfs2_inode_info structure.

This causes te reflink operations from some nodes hang for very long
time waiting for the cluster lock on the orphan directory.

Fix: initialize ip_next_orphan as NULL.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201109171746.27884-1-wen.gang.wang@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-14 11:26:04 -08:00
Jens Axboe
944d1444d5 io_uring: handle -EOPNOTSUPP on path resolution
Any attempt to do path resolution on /proc/self from an async worker will
yield -EOPNOTSUPP. We can safely do that resolution from the task itself,
and without blocking, so retry it from there.

Ideally io_uring would know this upfront and not have to go through the
worker thread to find out, but that doesn't currently seem feasible.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-14 10:22:30 -07:00
Alex Shi
65cdb4a214 configfs: fix kernel-doc markup issue
Add explanation for 'frag' parameter to avoid kernel-doc issue:
fs/configfs/dir.c:277: warning: Function parameter or member 'frag' not
described in 'configfs_create_dir'

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-11-14 10:22:45 +01:00
Linus Torvalds
f01c30de86 More VFS fixes for 5.10-rc4:
- Minor cleanups of the sb_start_* fs freeze helpers.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl+sDaIACgkQ+H93GTRK
 tOu4sw//bIdBw11YfI9sPtMJR/RkK3lm/pU4A/eJYGD65Mzk8J4kNi6jXKuyqQ8e
 /RpTqKWOwVW05Qg5HlKTxXRyr5Q788+EuBQH2t8VukWVdAgK2TFvNTTXb7QDsNSD
 SneC7Sox3CEO+vYnBsr7tUjfl7AYH0uFTxLkvpYqSQBn2+jo2x0s7NyKKZSDAASI
 +Rmhinw4QjjAHYC54nBy6Q47XhrZJj7XCODJdEql81cKSJUvjCo3url3sNvGXXNW
 oXbs5IO5cVQrQx6n9rQxCfkN1dz9c/CBopYFwdgmg76Bj4VLSzCYVecnMeDl53pV
 3jXesNtJcR2dz64e98K1Moof2dHSm0/NP0Q7KnMYEaGEl6tAtyjSx9lL2Qd6npG+
 mG460UHd/7RHXoH/BTaCrtHHyA4pApHMqf+w3R2ienxrltKUJAEfGM/5x8o0ikWx
 laeT0L/m6Yv/dGnDvNthhoF84tCiQUnxg+UeXiKv4R9uFL1bKMFPw5i1zWuXqqaX
 yZPqUY1tiecQskr89AimOVI64L2MJ4DgBey1JzNL/XzPtw55Qu+LR6MkkaIC08Wu
 ubGJTm6fPw3Cz8JYgn4WIgKB9Q7yAoKsyl0mGLQh2SJT1FS8WLct+SRPwXcMVfJT
 VpkgjJW/ak5L+XfQU6Ev39zUasEAqdaxvPoTxUfne6spUiNbgrk=
 =ZC9a
 -----END PGP SIGNATURE-----

Merge tag 'vfs-5.10-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull fs freeze fix and cleanups from Darrick Wong:
 "A single vfs fix for 5.10, along with two subsequent cleanups.

  A very long time ago, a hack was added to the vfs fs freeze protection
  code to work around lockdep complaints about XFS, which would try to
  run a transaction (which requires intwrite protection) to finalize an
  xfs freeze (by which time the vfs had already taken intwrite).

  Fast forward a few years, and XFS fixed the recursive intwrite problem
  on its own, and the hack became unnecessary. Fast forward almost a
  decade, and latent bugs in the code converting this hack from freeze
  flags to freeze locks combine with lockdep bugs to make this reproduce
  frequently enough to notice page faults racing with freeze.

  Since the hack is unnecessary and causes thread race errors, just get
  rid of it completely. Making this kind of vfs change midway through a
  cycle makes me nervous, but a large enough number of the usual
  VFS/ext4/XFS/btrfs suspects have said this looks good and solves a
  real problem vector.

  And once that removal is done, __sb_start_write is now simple enough
  that it becomes possible to refactor the function into smaller,
  simpler static inline helpers in linux/fs.h. The cleanup is
  straightforward.

  Summary:

   - Finally remove the "convert to trylock" weirdness in the fs freezer
     code. It was necessary 10 years ago to deal with nested
     transactions in XFS, but we've long since removed that; and now
     this is causing subtle race conditions when lockdep goes offline
     and sb_start_* aren't prepared to retry a trylock failure.

   - Minor cleanups of the sb_start_* fs freeze helpers"

* tag 'vfs-5.10-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  vfs: move __sb_{start,end}_write* to fs.h
  vfs: separate __sb_start_write into blocking and non-blocking helpers
  vfs: remove lockdep bogosity in __sb_start_write
2020-11-13 16:07:53 -08:00
Linus Torvalds
d9315f5634 Fixes for 5.10-rc4:
- Fix a fairly serious problem where the reverse mapping btree key
 comparison functions were silently ignoring parts of the keyspace when
 doing comparisons.
 - Fix a thinko in the online refcount scrubber.
 - Fix a missing unlock in the pnfs code.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl+sDJYACgkQ+H93GTRK
 tOtQOg//ZMCNB9wN/xZxlFHYIxD/AwnzuiVrfbQTT38JNg9fR9aHJxPcEbovplq+
 QosISoi/ooTbiyIz9p31RbGPGH7y26Oo9CcENWgVkiOdI6KTW0v41181/gAWpegQ
 u9V9aTARb2cerKVSc/TONitOgkzEu69J4GoG2A6pbWyCoUKvOrne9+v785BjpHdU
 3IdT30DQiW8QZodSww+i42fRrZhpkkmIELbZV7PKTCIJXRifAr2oAE5CBQaFHJri
 gh1Wgc7snn+fiQ8xLsG7u+Zl1bS6dC7wO8YksSX77V9CeaHvbAJxh16rBXXdaHAi
 TR5rymJ1+VB5SD9yVTyE7szQ9U6eo8nMktxxP6/Iejy6IiYSV0C/eQtKx/aYt6lU
 ZjoKnwxsiNXw6K+f+6AIFfPP3M4OmtOoQK8mzsl1rNVY6P3ZUtZQ0GoCnjPOtyfa
 PChG6eDzCcNVRUpzPAcVhxBWi8ilyMtEjJps+aBm5NQGnuyZ+PSZDLxYZCq5mOik
 m9uvIYDRvi6l9StShxi2DtcrYD665ZPWDAMeYXV1CxockjqdbMn+j+SiK6MJ0bzb
 9fL5IR+RphK3aZ4+U9PCJBPNK25Dd9rMaFIfb3FzZmokTDlBFovq26LCEshAH5rg
 WLZvynY3wF9TIqxnD8H9fGxNHJ5cbfR4tvMUOdXwCg0dAtFHoDI=
 =5qeU
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.10-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:

 - Fix a fairly serious problem where the reverse mapping btree key
   comparison functions were silently ignoring parts of the keyspace
   when doing comparisons

 - Fix a thinko in the online refcount scrubber

 - Fix a missing unlock in the pnfs code

* tag 'xfs-5.10-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: fix a missing unlock on error in xfs_fs_map_blocks
  xfs: fix brainos in the refcount scrubber's rmap fragment processor
  xfs: fix rmap key and record comparison functions
  xfs: set the unwritten bit in rmap lookup flags in xchk_bmap_get_rmapextents
  xfs: fix flags argument to rmap lookup when converting shared file rmaps
2020-11-13 16:01:44 -08:00
Jens Axboe
8d4c3e76e3 proc: don't allow async path resolution of /proc/self components
If this is attempted by a kthread, then return -EOPNOTSUPP as we don't
currently support that. Once we can get task_pid_ptr() doing the right
thing, then this can go away again.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-13 16:47:52 -07:00
Linus Torvalds
1b1e9262ca io_uring-5.10-2020-11-13
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+u9+4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnyID/4kaa3gPcRhqCMZ2vSqQjwdMgKNON7b8Qh2
 Q9K/dsJn+jFDCKqZcX3hjG5kLziqRiIvMVtJfCYLm00p8hPfbxrmuK6iq1cxbkCg
 Jm9FqR/5O2ERMN+zQHZEccDhv0knnc33hLVAJLWneAH2enGMbPdvnOoez8ixfVCM
 IxEO5SRk0inrxgUOG5Qp78zqR+Znx3KQudpAJpYqP4/5T+59gI7yhkkvgiJDWlF8
 o+kMsiLHToVEoU/MTa/npfdEx+Ac+XryQ+3QAzMPL4miUHcgLwpF2cOZ02Op2hd1
 xf4/w4zwFRC9JGmgpBg26Woyen4e3o89RaB2lFrT/35PKwJU6u6UOUXYYrSoNt+m
 Yhr90mHqpS/iAQ2ejtEQVkgiXsjp4ovnocWcaqGhhh9k2NYbCaw35OtciWaCdOYW
 G4qnUClueV84XxYt3nzVUl99BkERoRUjiToG82L/4opO9FOAAXR/n4/a9R7ru4ZX
 FOE4H52r+PVsgpkAFiR0fX8BY44SSXoRQ3RPmns0iJpQBYS+R7GJpzvdU8Hfd+5j
 rrdD+XGZmhft39E1Yz2Ejtb805spoDHo08oHvVRMWdlmYFP09nNsqgcQOVksiAm1
 t5Dn5PMBjoFffPIaQDDKZqPM11tYk3pFV9VIx8u9tS0jqPsoY/I7p8l0zcOMz4pc
 NaG0aeY82w==
 =FEw9
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.10-2020-11-13' of git://git.kernel.dk/linux-block

Pull io_uring fix from Jens Axboe:
 "A single fix in here, for a missed rounding case at setup time, which
  caused an otherwise legitimate setup case to return -EINVAL if used
  with unaligned ring size values"

* tag 'io_uring-5.10-2020-11-13' of git://git.kernel.dk/linux-block:
  io_uring: round-up cq size before comparing with rounded sq size
2020-11-13 15:05:19 -08:00
Dave Kleikamp
c61b3e4839 jfs: Fix array index bounds check in dbAdjTree
Bounds checking tools can flag a bug in dbAdjTree() for an array index
out of bounds in dmt_stree. Since dmt_stree can refer to the stree in
both structures dmaptree and dmapctl, use the larger array to eliminate
the false positive.

Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com>
2020-11-13 16:03:07 -06:00
Daniel Xu
1a49a97df6 btrfs: tree-checker: add missing return after error in root_item
There's a missing return statement after an error is found in the
root_item, this can cause further problems when a crafted image triggers
the error.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=210181
Fixes: 259ee7754b ("btrfs: tree-checker: Add ROOT_ITEM check")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-13 22:18:10 +01:00
Qu Wenruo
6f23277a49 btrfs: qgroup: don't commit transaction when we already hold the handle
[BUG]
When running the following script, btrfs will trigger an ASSERT():

  #/bin/bash
  mkfs.btrfs -f $dev
  mount $dev $mnt
  xfs_io -f -c "pwrite 0 1G" $mnt/file
  sync
  btrfs quota enable $mnt
  btrfs quota rescan -w $mnt

  # Manually set the limit below current usage
  btrfs qgroup limit 512M $mnt $mnt

  # Crash happens
  touch $mnt/file

The dmesg looks like this:

  assertion failed: refcount_read(&trans->use_count) == 1, in fs/btrfs/transaction.c:2022
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/ctree.h:3230!
  invalid opcode: 0000 [#1] SMP PTI
  RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
   btrfs_commit_transaction.cold+0x11/0x5d [btrfs]
   try_flush_qgroup+0x67/0x100 [btrfs]
   __btrfs_qgroup_reserve_meta+0x3a/0x60 [btrfs]
   btrfs_delayed_update_inode+0xaa/0x350 [btrfs]
   btrfs_update_inode+0x9d/0x110 [btrfs]
   btrfs_dirty_inode+0x5d/0xd0 [btrfs]
   touch_atime+0xb5/0x100
   iterate_dir+0xf1/0x1b0
   __x64_sys_getdents64+0x78/0x110
   do_syscall_64+0x33/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7fb5afe588db

[CAUSE]
In try_flush_qgroup(), we assume we don't hold a transaction handle at
all.  This is true for data reservation and mostly true for metadata.
Since data space reservation always happens before we start a
transaction, and for most metadata operation we reserve space in
start_transaction().

But there is an exception, btrfs_delayed_inode_reserve_metadata().
It holds a transaction handle, while still trying to reserve extra
metadata space.

When we hit EDQUOT inside btrfs_delayed_inode_reserve_metadata(), we
will join current transaction and commit, while we still have
transaction handle from qgroup code.

[FIX]
Let's check current->journal before we join the transaction.

If current->journal is unset or BTRFS_SEND_TRANS_STUB, it means
we are not holding a transaction, thus are able to join and then commit
transaction.

If current->journal is a valid transaction handle, we avoid committing
transaction and just end it

This is less effective than committing current transaction, as it won't
free metadata reserved space, but we may still free some data space
before new data writes.

Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1178634
Fixes: c53e965360 ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-13 22:17:57 +01:00
Filipe Manana
c334730988 btrfs: fix missing delalloc new bit for new delalloc ranges
When doing a buffered write, through one of the write family syscalls, we
look for ranges which currently don't have allocated extents and set the
'delalloc new' bit on them, so that we can report a correct number of used
blocks to the stat(2) syscall until delalloc is flushed and ordered extents
complete.

However there are a few other places where we can do a buffered write
against a range that is mapped to a hole (no extent allocated) and where
we do not set the 'new delalloc' bit. Those places are:

- Doing a memory mapped write against a hole;

- Cloning an inline extent into a hole starting at file offset 0;

- Calling btrfs_cont_expand() when the i_size of the file is not aligned
  to the sector size and is located in a hole. For example when cloning
  to a destination offset beyond EOF.

So after such cases, until the corresponding delalloc range is flushed and
the respective ordered extents complete, we can report an incorrect number
of blocks used through the stat(2) syscall.

In some cases we can end up reporting 0 used blocks to stat(2), which is a
particular bad value to report as it may mislead tools to think a file is
completely sparse when its i_size is not zero, making them skip reading
any data, an undesired consequence for tools such as archivers and other
backup tools, as reported a long time ago in the following thread (and
other past threads):

  https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html

Example reproducer:

  $ cat reproducer.sh
  #!/bin/bash

  MNT=/mnt/sdi
  DEV=/dev/sdi

  mkfs.btrfs -f $DEV > /dev/null
  # mkfs.xfs -f $DEV > /dev/null
  # mkfs.ext4 -F $DEV > /dev/null
  # mkfs.f2fs -f $DEV > /dev/null
  mount $DEV $MNT

  xfs_io -f -c "truncate 64K"   \
      -c "mmap -w 0 64K"        \
      -c "mwrite -S 0xab 0 64K" \
      -c "munmap"               \
      $MNT/foo

  blocks_used=$(stat -c %b $MNT/foo)
  echo "blocks used: $blocks_used"

  if [ $blocks_used -eq 0 ]; then
      echo "ERROR: blocks used is 0"
  fi

  umount $DEV

  $ ./reproducer.sh
  blocks used: 0
  ERROR: blocks used is 0

So move the logic that decides to set the 'delalloc bit' bit into the
function btrfs_set_extent_delalloc(), since that is what we use for all
those missing cases as well as for the cases that currently work well.

This change is also preparatory work for an upcoming patch that fixes
other problems related to tracking and reporting the number of bytes used
by an inode.

CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-13 22:15:59 +01:00
Dinghao Liu
751341b4d7 jfs: Fix memleak in dbAdjCtl
When dbBackSplit() fails, mp should be released to
prevent memleak. It's the same when dbJoin() fails.

Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2020-11-13 13:43:02 -06:00
Randy Dunlap
ed1c9a7a85 jfs: delete duplicated words + other fixes
Delete repeated words in fs/jfs/.
{for, allocation, if, the}
Insert "is" in one place to correct the grammar.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
To: linux-fsdevel@vger.kernel.org
Cc: jfs-discussion@lists.sourceforge.net
2020-11-13 13:36:00 -06:00
Steven Rostedt (VMware)
d19ad0775d ftrace: Have the callbacks receive a struct ftrace_regs instead of pt_regs
In preparation to have arguments of a function passed to callbacks attached
to functions as default, change the default callback prototype to receive a
struct ftrace_regs as the forth parameter instead of a pt_regs.

For callbacks that set the FL_SAVE_REGS flag in their ftrace_ops flags, they
will now need to get the pt_regs via a ftrace_get_regs() helper call. If
this is called by a callback that their ftrace_ops did not have a
FL_SAVE_REGS flag set, it that helper function will return NULL.

This will allow the ftrace_regs to hold enough just to get the parameters
and stack pointer, but without the worry that callbacks may have a pt_regs
that is not completely filled.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-13 12:14:55 -05:00
Linus Torvalds
d3ba7afcc1 Two ext4 bug fixes, one via a revert of a commit sent during the merge window.
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl+t+7EACgkQ8vlZVpUN
 gaN6nQf+OzMMrP/QWF6fRG/ocQTgm4UZ/lo3REfZa8dRrFH+6qjtoFrmSnK7e+MJ
 V+639IYvHknDEgvap2yF8S6g06nAqb2HeSCHnkxdS3tCh5ZLgo2XmFOtB/WxZLnU
 Cx8dv9kw+mWJPdoRqJ+A4jn5cW2j3VLGNyJIdyIikkTb8L92fZRa/jKVZeIb84xX
 FEyshnzb3rV6ba0XdE99gWkabIAnnIsSwkF6SPhcqJpI3Lt1jkkV3D5h6DDoDz8d
 YpIA/6oPhEM2KwRgx9RJPdNRzHgmwWr2ti/0YLqlLNHWz1oZqi9K6yimXCfccwSU
 oCdK38tMWAFNiOGaijYx5xS3oNV+Dg==
 =9MzH
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_bugfixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Two ext4 bug fixes, one being a revert of a commit sent during the
  merge window"

* tag 'ext4_for_linus_bugfixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  Revert "ext4: fix superblock checksum calculation race"
  ext4: handle dax mount option collision
2020-11-13 09:05:33 -08:00
Ira Weiny
a6fbd0ab3d fs/ext2: Use ext2_put_page
There are 3 places in namei.c where the equivalent of ext2_put_page() is
open coded on a page which was returned from the ext2_get_page() call
[through the use of ext2_find_entry() and ext2_dotdot()].

Move ext2_put_page() to ext2.h and use it in namei.c

Also add a comment regarding the proper way to release the page returned
from ext2_find_entry() and ext2_dotdot().

Link: https://lore.kernel.org/r/20201112174244.701325-1-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Signed-off-by: Jan Kara <jack@suse.cz>
2020-11-13 11:19:30 +01:00
Linus Torvalds
585e5b17b9 another fscrypt fix for 5.10-rc4
Fix a regression where new files weren't using inline encryption when
 they should be.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCX63AIhQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOKzlsAP9/m9XfxW3SwG4D1dnajXQPNZgsaby2
 AxkqJyjxq3kBvQEAo8fPe8uURAzYBA9C5tcP0+QCB3jqZkHu0HVCeQKvXwI=
 =zldW
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt

Pull fscrypt fix from Eric Biggers:
 "Fix a regression where new files weren't using inline encryption when
  they should be"

* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
  fscrypt: fix inline encryption not used on new files
2020-11-12 16:39:58 -08:00
Linus Torvalds
20ca21dfcc Fix jdata data corruption and glock reference leak
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAl+ttO4UHGFncnVlbmJh
 QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTqAdg//Xssi+p1N13BfAdyYdPoqEQ9JKSZH
 vEwth53ASsEy+WXz7TMulmrwJXWyQXfEibPvGLrHZZ0zgdw3DyLqGCnDCJOqmdOQ
 /e08MmJ9LdGz06IHnlzhWj3Lm4KpJaFzil4HBYE8Jlu4PimLVKcIQoDRRMV2DURl
 arPSm/dJtqUFVDj9+bawq5mRmxA0gXPFspf857wnDzNB+hGlll2wvK70vapNlg39
 hNVjDMfhb04CVNsJoVZS4wRI3TlwvOjlxB9WKvNvyRDF1jQT8bSX8UJyq5Qupf+K
 /HJvZFrm0gv6W0Z9UFMy5JXIQ5NTriBzrdu5rgzhKPA/r8oV0/8i0pq/macXHOQF
 shPNZQXdsN6uCK57JNugSW6C96l2K4kP7rOSjTE6N0WFo9y/u2bCAbo0+hh0lvns
 2sSKLX2ZtVwvyy0LVcAJa0Q1ZU9CK5F4J8F2Zy3DFOJqV1GuRCh0LBgyWoL4jJEQ
 J3JLJdUevP7E7dvXIwzsDNWOUiRy0xAqFQOIcdvt4WhMsH7QsHIiYjodHFLKx2qq
 Xk9YSTua7A+UjpLDyt2iMbumplMomQqx72NLUM5Kv9r+kSId4Ird/Q8HYHvWezzY
 xUjIvOAMXI/ZKsrJRwE0V2xI+q2wmHPFYrBjl0CymWsCksc9kB1wzkqQYp/Tcxxm
 hhp3iPPY/mEXr4s=
 =bTtI
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-v5.10-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 fixes from Andreas Gruenbacher:
 "Fix jdata data corruption and glock reference leak"

* tag 'gfs2-v5.10-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Fix case in which ail writes are done to jdata holes
  Revert "gfs2: Ignore journal log writes for jdata holes"
  gfs2: fix possible reference leak in gfs2_check_blk_type
2020-11-12 16:37:14 -08:00
Linus Torvalds
200f9d21aa NFS Client Bugfixes for Linux 5.10-rc4
- Stable fixes:
   - Fix failure to unregister shrinker
 
 - Other fixes:
   - Fix unnecessary locking to clear up some contention
   - Fix listxattr receive buffer size
   - Fix default mount options for nfsroot
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl+tbGIACgkQ18tUv7Cl
 QOvRyxAA4YXD1dlnO2Xbqo7ZyrgoZkVn08rb9yloeCuCNJDZPDSXt2QHAKdbmMU+
 8dxpcWN/8RUEUJK3cccNf2+XV/AWqqaFnFXylcfXLUnjZx0f30ou+HO+BRZFInVd
 OgG3njO94jV1B3RK38J7jyVRqx3hd0Vkq9Ja4LVF2l/x9ueGrj+pOdNauWr1JhFo
 6l4Fk2PKakLKJGsxLXmKlBb7p+EEwKa1qRov8SED33uTZkSnbFOmbxtEp1bu7sQx
 UKBTLADny9FClA1sjM45XN2nLS99/uUl/CaRKm/GB5nP4WKG4J3HgziAAvVglHcP
 yrUIiwLaUGZvteiO5O6NJqZpk6NyzWnBo4ZDt/TZcQ5nvK7uD6buUbDFFn++lbKm
 qwVWCnsme7sx3zVLLS4pY2GXnNNkGozjyrQOV0NQx1QphfalKsXHxeXikY+dkXr5
 FZwKodWxiKlsZj8cyOVjrm9q3+EsBnW8FyitgVQH4QIvcU9Z9zdB5QFyy7KsG4bw
 3iKsbz4HsJ0K10m7ykNEcR5R6XQBnFVWGxAHkQ3qbxzw9hYvhEebP/N2P7x3DC1X
 3gVPDto03Vc5PsuGoXm50kqXpRD3w+fnpf+HMZFmRbqjanqBHvgyYu58Zy0fXEnQ
 VigUcvsjAJhmoneahO3va8HF3a70PPqhzTTVKtfORBNg9uHmS1M=
 =7a8T
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.10-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client bugfixes from Anna Schumaker:
 "Stable fixes:
  - Fix failure to unregister shrinker

  Other fixes:
  - Fix unnecessary locking to clear up some contention
  - Fix listxattr receive buffer size
  - Fix default mount options for nfsroot"

* tag 'nfs-for-5.10-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFS: Remove unnecessary inode lock in nfs_fsync_dir()
  NFS: Remove unnecessary inode locking in nfs_llseek_dir()
  NFS: Fix listxattr receive buffer size
  NFSv4.2: fix failure to unregister shrinker
  nfsroot: Default mount option should ask for built-in NFS version
2020-11-12 13:49:12 -08:00
Bob Peterson
4e79e3f08e gfs2: Fix case in which ail writes are done to jdata holes
Patch b2a846dbef ("gfs2: Ignore journal log writes for jdata holes")
tried (unsuccessfully) to fix a case in which writes were done to jdata
blocks, the blocks are sent to the ail list, then a punch_hole or truncate
operation caused the blocks to be freed. In other words, the ail items
are for jdata holes. Before b2a846dbef, the jdata hole caused function
gfs2_block_map to return -EIO, which was eventually interpreted as an
IO error to the journal, and then withdraw.

This patch changes function gfs2_get_block_noalloc, which is only used
for jdata writes, so it returns -ENODATA rather than -EIO, and when
-ENODATA is returned to gfs2_ail1_start_one, the error is ignored.
We can safely ignore it because gfs2_ail1_start_one is only called
when the jdata pages have already been written and truncated, so the
ail1 content no longer applies.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-11-12 18:55:20 +01:00
Bob Peterson
d3039c0615 Revert "gfs2: Ignore journal log writes for jdata holes"
This reverts commit b2a846dbef.

That commit changed the behavior of function gfs2_block_map to return
-ENODATA in cases where a hole (IOMAP_HOLE) is encountered and create is
false.  While that fixed the intended problem for jdata, it also broke
other callers of gfs2_block_map such as some jdata block reads.  Before
the patch, an encountered hole would be skipped and the buffer seen as
unmapped by the caller.  The patch changed the behavior to return
-ENODATA, which is interpreted as an error by the caller.

The -ENODATA return code should be restricted to the specific case where
jdata holes are encountered during ail1 writes.  That will be done in a
later patch.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-11-12 18:41:57 +01:00