linux

Author	SHA1	Message	Date
Pavel Begunkov	b64e3444d4	io_uring: simplify io_req_map_rw() Don't deref req->io->rw every time, but put it in a local variable. This looks prettier, generates less instructions, and doesn't break alias analysis. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-24 12:55:44 -06:00
Pavel Begunkov	e73751225b	io_uring: replace rw->task_work with rq->task_work io_kiocb::task_work was de-unionised, and is not planned to be shared back, because it's too useful and commonly used. Hence, instead of keeping a separate task_work in struct io_async_rw just reuse req->task_work. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-24 12:55:44 -06:00
Pavel Begunkov	2ae523ed07	io_uring: extract io_sendmsg_copy_hdr() Don't repeat send msg initialisation code, it's error prone. Extract and use a helper function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-24 12:55:44 -06:00
Pavel Begunkov	1400e69705	io_uring: use more specific type in rcv/snd msg cp send/recv msghdr initialisation works with struct io_async_msghdr, but pulls the whole struct io_async_ctx for no reason. That complicates it with composite accessing, e.g. io->msg. Use and pass the most specific type, which is struct io_async_msghdr. It is the larget field in union io_async_ctx and doesn't save stack space, but looks clearer. The most of the changes are replacing "io->msg." with "iomsg->" Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-24 12:55:44 -06:00
Pavel Begunkov	270a594070	io_uring: rename sr->msg into umsg Every second field in send/recv is called msg, make it a bit more understandable by renaming ->msg, which is a user provided ptr, to ->umsg. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-24 12:55:44 -06:00
Dmitry Vyukov	b36200f543	io_uring: fix sq array offset calculation rings_size() sets sq_offset to the total size of the rings (the returned value which is used for memory allocation). This is wrong: sq array should be located within the rings, not after them. Set sq_offset to where it should be. Fixes: `75b28affdd` ("io_uring: allocate the two rings together") Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Hristo Venev <hristo@venev.name> Cc: io-uring@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-24 12:55:44 -06:00
Jens Axboe	760618f7a8	Merge branch 'io_uring-5.8' into for-5.9/io_uring Merge in io_uring-5.8 fixes, as changes/cleanups to how we do locked mem accounting require a fixup, and only one of the spots are noticed by git as the other merges cleanly. The flags fix from io_uring-5.8 also causes a merge conflict, the leak fix for recvmsg, the double poll fix, and the link failure locking fix. * io_uring-5.8: io_uring: fix lockup in io_fail_links() io_uring: fix ->work corruption with poll_add io_uring: missed req_init_async() for IOSQE_ASYNC io_uring: always allow drain/link/hardlink/async sqe flags io_uring: ensure double poll additions work with both request types io_uring: fix recvmsg memory leak with buffer selection io_uring: fix not initialised work->flags io_uring: fix missing msg_name assignment io_uring: account user memory freed when exit has been queued io_uring: fix memleak in io_sqe_files_register() io_uring: fix memleak in __io_sqe_files_update() io_uring: export cq overflow status to userspace Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-24 12:53:31 -06:00
Pavel Begunkov	4ae6dbd683	io_uring: fix lockup in io_fail_links() io_fail_links() doesn't consider REQ_F_COMP_LOCKED leading to nested spin_lock(completion_lock) and lockup. [ 197.680409] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 6-... } 18239 jiffies s: 1421 root: 0x40/. [ 197.680411] rcu: blocking rcu_node structures: [ 197.680412] Task dump for CPU 6: [ 197.680413] link-timeout R running task 0 1669 1 0x8000008a [ 197.680414] Call Trace: [ 197.680420] ? io_req_find_next+0xa0/0x200 [ 197.680422] ? io_put_req_find_next+0x2a/0x50 [ 197.680423] ? io_poll_task_func+0xcf/0x140 [ 197.680425] ? task_work_run+0x67/0xa0 [ 197.680426] ? do_exit+0x35d/0xb70 [ 197.680429] ? syscall_trace_enter+0x187/0x2c0 [ 197.680430] ? do_group_exit+0x43/0xa0 [ 197.680448] ? __x64_sys_exit_group+0x18/0x20 [ 197.680450] ? do_syscall_64+0x52/0xa0 [ 197.680452] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-24 12:51:33 -06:00
Pavel Begunkov	d5e16d8e23	io_uring: fix ->work corruption with poll_add req->work might be already initialised by the time it gets into __io_arm_poll_handler(), which will corrupt it by using fields that are in an union with req->work. Luckily, the only side effect is missing put_creds(). Clean req->work before going there. Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-24 12:51:33 -06:00
Li Guifu	99c787cfd2	f2fs: fix use-after-free issue During umount, f2fs_put_super() unregisters procfs entries after f2fs_destroy_segment_manager(), it may cause use-after-free issue when umount races with procfs accessing, fix it by relocating f2fs_unregister_sysfs(). [Chao Yu: change commit title/message a bit] Signed-off-by: Li Guifu <bluce.liguifu@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-23 20:22:42 -07:00
Jia Yang	68e79baf41	f2fs: Change the type of f2fs_flush_inline_data() to void The return value of f2fs_flush_inline_data() is not used, so delete it. Signed-off-by: Jia Yang <jiayang5@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-23 20:22:37 -07:00
Steve French	0e6705182d	Revert "cifs: Fix the target file was deleted when rename failed." This reverts commit `9ffad9263b`. Upon additional testing with older servers, it was found that the original commit introduced a regression when using the old SMB1 dialect and rsyncing over an existing file. The patch will need to be respun to address this, likely including a larger refactoring of the SMB1 and SMB3 rename code paths to make it less confusing and also to address some additional rename error cases that SMB3 may be able to workaround. Signed-off-by: Steve French <stfrench@microsoft.com> Reported-by: Patrick Fernie <patrick.fernie@gmail.com> CC: Stable <stable@vger.kernel.org> Acked-by: Ronnie Sahlberg <lsahlber@redhat.com> Acked-by: Pavel Shilovsky <pshilov@microsoft.com> Acked-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>	2020-07-23 15:44:11 -05:00
Pavel Begunkov	3e863ea3bb	io_uring: missed req_init_async() for IOSQE_ASYNC IOSQE_ASYNC branch of io_queue_sqe() is another place where an unitialised req->work can be accessed (i.e. prior io_req_init_async()). Nothing really bad though, it just looses IO_WQ_WORK_CONCURRENT flag. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-23 11:20:55 -06:00
Peter Enderborg	a24c6f7bc9	debugfs: Add access restriction option Since debugfs include sensitive information it need to be treated carefully. But it also has many very useful debug functions for userspace. With this option we can have same configuration for system with need of debugfs and a way to turn it off. This gives a extra protection for exposure on systems where user-space services with system access are attacked. It is controlled by a configurable default value that can be override with a kernel command line parameter. (debugfs=) It can be on or off, but also internally on but not seen from user-space. This no-mount mode do not register a debugfs as filesystem, but client can register their parts in the internal structures. This data can be readed with a debugger or saved with a crashkernel. When it is off clients get EPERM error when accessing the functions for registering their components. Signed-off-by: Peter Enderborg <peter.enderborg@sony.com> Link: https://lore.kernel.org/r/20200716071511.26864-3-peter.enderborg@sony.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-07-23 17:10:25 +02:00
J. Bruce Fields	9affa43581	nfsd4: fix NULL dereference in nfsd/clients display code We hold the cl_lock here, and that's enough to keep stateid's from going away, but it's not enough to prevent the files they point to from going away. Take fi_lock and a reference and check for NULL, as we do in other code. Reported-by: NeilBrown <neilb@suse.de> Fixes: `78599c42ae` ("nfsd4: add file to display list of client's opens") Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2020-07-22 16:47:14 -04:00
Eric Biggers	f3db0bed45	fs-verity: use smp_load_acquire() for ->i_verity_info Normally smp_store_release() or cmpxchg_release() is paired with smp_load_acquire(). Sometimes smp_load_acquire() can be replaced with the more lightweight READ_ONCE(). However, for this to be safe, all the published memory must only be accessed in a way that involves the pointer itself. This may not be the case if allocating the object also involves initializing a static or global variable, for example. fsverity_info::tree_params.hash_alg->tfm is a crypto_ahash object that's internal to and is allocated by the crypto subsystem. So by using READ_ONCE() for ->i_verity_info, we're relying on internal implementation details of the crypto subsystem. Remove this fragile assumption by using smp_load_acquire() instead. Also fix the cmpxchg logic to correctly execute an ACQUIRE barrier when losing the cmpxchg race, since cmpxchg doesn't guarantee a memory barrier on failure. (Note: I haven't seen any real-world problems here. This change is just fixing the code to be guaranteed correct and less fragile.) Fixes: `fd2d1acfca` ("fs-verity: add the hook for file ->open()") Link: https://lore.kernel.org/r/20200721225920.114347-6-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2020-07-21 16:02:41 -07:00
Eric Biggers	ab673b9874	fscrypt: use smp_load_acquire() for ->i_crypt_info Normally smp_store_release() or cmpxchg_release() is paired with smp_load_acquire(). Sometimes smp_load_acquire() can be replaced with the more lightweight READ_ONCE(). However, for this to be safe, all the published memory must only be accessed in a way that involves the pointer itself. This may not be the case if allocating the object also involves initializing a static or global variable, for example. fscrypt_info includes various sub-objects which are internal to and are allocated by other kernel subsystems such as keyrings and crypto. So by using READ_ONCE() for ->i_crypt_info, we're relying on internal implementation details of these other kernel subsystems. Remove this fragile assumption by using smp_load_acquire() instead. (Note: I haven't seen any real-world problems here. This change is just fixing the code to be guaranteed correct and less fragile.) Fixes: `e37a784d8b` ("fscrypt: use READ_ONCE() to access ->i_crypt_info") Link: https://lore.kernel.org/r/20200721225920.114347-5-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2020-07-21 16:02:13 -07:00
Eric Biggers	777afe4e68	fscrypt: use smp_load_acquire() for ->s_master_keys Normally smp_store_release() or cmpxchg_release() is paired with smp_load_acquire(). Sometimes smp_load_acquire() can be replaced with the more lightweight READ_ONCE(). However, for this to be safe, all the published memory must only be accessed in a way that involves the pointer itself. This may not be the case if allocating the object also involves initializing a static or global variable, for example. super_block::s_master_keys is a keyring, which is internal to and is allocated by the keyrings subsystem. By using READ_ONCE() for it, we're relying on internal implementation details of the keyrings subsystem. Remove this fragile assumption by using smp_load_acquire() instead. (Note: I haven't seen any real-world problems here. This change is just fixing the code to be guaranteed correct and less fragile.) Fixes: `22d94f493b` ("fscrypt: add FS_IOC_ADD_ENCRYPTION_KEY ioctl") Link: https://lore.kernel.org/r/20200721225920.114347-4-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2020-07-21 16:02:13 -07:00
Eric Biggers	97c6327f71	fscrypt: use smp_load_acquire() for fscrypt_prepared_key Normally smp_store_release() or cmpxchg_release() is paired with smp_load_acquire(). Sometimes smp_load_acquire() can be replaced with the more lightweight READ_ONCE(). However, for this to be safe, all the published memory must only be accessed in a way that involves the pointer itself. This may not be the case if allocating the object also involves initializing a static or global variable, for example. fscrypt_prepared_key includes a pointer to a crypto_skcipher object, which is internal to and is allocated by the crypto subsystem. By using READ_ONCE() for it, we're relying on internal implementation details of the crypto subsystem. Remove this fragile assumption by using smp_load_acquire() instead. (Note: I haven't seen any real-world problems here. This change is just fixing the code to be guaranteed correct and less fragile.) Fixes: `5fee36095c` ("fscrypt: add inline encryption support") Cc: Satya Tangirala <satyat@google.com> Link: https://lore.kernel.org/r/20200721225920.114347-3-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2020-07-21 16:02:13 -07:00
Eric Biggers	bd0d97b719	fscrypt: switch fscrypt_do_sha256() to use the SHA-256 library fscrypt_do_sha256() is only used for hashing encrypted filenames to create no-key tokens, which isn't performance-critical. Therefore a C implementation of SHA-256 is sufficient. Also, the logic to create no-key tokens is always potentially needed. This differs from fscrypt's other dependencies on crypto API algorithms, which are conditionally needed depending on what encryption policies userspace is using. Therefore, for fscrypt there isn't much benefit to allowing SHA-256 to be a loadable module. So, make fscrypt_do_sha256() use the SHA-256 library instead of the crypto_shash API. This is much simpler, since it avoids having to implement one-time-init (which is hard to do correctly, and in fact was implemented incorrectly) and handle failures to allocate the crypto_shash object. Fixes: `edc440e3d2` ("fscrypt: improve format of no-key names") Cc: Daniel Rosenberg <drosen@google.com> Link: https://lore.kernel.org/r/20200721225920.114347-2-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2020-07-21 16:02:13 -07:00
Boris Burkov	48cfa61b58	btrfs: fix mount failure caused by race with umount It is possible to cause a btrfs mount to fail by racing it with a slow umount. The crux of the sequence is generic_shutdown_super not yet calling sop->put_super before btrfs_mount_root calls btrfs_open_devices. If that occurs, btrfs_open_devices will decide the opened counter is non-zero, increment it, and skip resetting fs_devices->total_rw_bytes to 0. From here, mount will call sget which will result in grab_super trying to take the super block umount semaphore. That semaphore will be held by the slow umount, so mount will block. Before up-ing the semaphore, umount will delete the super block, resulting in mount's sget reliably allocating a new one, which causes the mount path to dutifully fill it out, and increment total_rw_bytes a second time, which causes the mount to fail, as we see double the expected bytes. Here is the sequence laid out in greater detail: CPU0 CPU1 down_write sb->s_umount btrfs_kill_super kill_anon_super(sb) generic_shutdown_super(sb); shrink_dcache_for_umount(sb); sync_filesystem(sb); evict_inodes(sb); // SLOW btrfs_mount_root btrfs_scan_one_device fs_devices = device->fs_devices fs_info->fs_devices = fs_devices // fs_devices-opened makes this a no-op btrfs_open_devices(fs_devices, mode, fs_type) s = sget(fs_type, test, set, flags, fs_info); find sb in s_instances grab_super(sb); down_write(&s->s_umount); // blocks sop->put_super(sb) // sb->fs_devices->opened == 2; no-op spin_lock(&sb_lock); hlist_del_init(&sb->s_instances); spin_unlock(&sb_lock); up_write(&sb->s_umount); return 0; retry lookup don't find sb in s_instances (deleted by CPU0) s = alloc_super return s; btrfs_fill_super(s, fs_devices, data) open_ctree // fs_devices total_rw_bytes improperly set! btrfs_read_chunk_tree read_one_dev // increment total_rw_bytes again!! super_total_bytes < fs_devices->total_rw_bytes // ERROR!!! To fix this, we clear total_rw_bytes from within btrfs_read_chunk_tree before the calls to read_one_dev, while holding the sb umount semaphore and the uuid mutex. To reproduce, it is sufficient to dirty a decent number of inodes, then quickly umount and mount. for i in $(seq 0 500) do dd if=/dev/zero of="/mnt/foo/$i" bs=1M count=1 done umount /mnt/foo& mount /mnt/foo does the trick for me. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-21 22:08:54 +02:00
Robbie Ko	5909ca110b	btrfs: fix page leaks after failure to lock page for delalloc When locking pages for delalloc, we check if it's dirty and mapping still matches. If it does not match, we need to return -EAGAIN and release all pages. Only the current page was put though, iterate over all the remaining pages too. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-21 22:08:53 +02:00
Qu Wenruo	fa91e4aa17	btrfs: qgroup: fix data leak caused by race between writeback and truncate [BUG] When running tests like generic/013 on test device with btrfs quota enabled, it can normally lead to data leak, detected at unmount time: BTRFS warning (device dm-3): qgroup 0/5 has unreleased space, type 0 rsv 4096 ------------[ cut here ]------------ WARNING: CPU: 11 PID: 16386 at fs/btrfs/disk-io.c:4142 close_ctree+0x1dc/0x323 [btrfs] RIP: 0010:close_ctree+0x1dc/0x323 [btrfs] Call Trace: btrfs_put_super+0x15/0x17 [btrfs] generic_shutdown_super+0x72/0x110 kill_anon_super+0x18/0x30 btrfs_kill_super+0x17/0x30 [btrfs] deactivate_locked_super+0x3b/0xa0 deactivate_super+0x40/0x50 cleanup_mnt+0x135/0x190 __cleanup_mnt+0x12/0x20 task_work_run+0x64/0xb0 __prepare_exit_to_usermode+0x1bc/0x1c0 __syscall_return_slowpath+0x47/0x230 do_syscall_64+0x64/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ---[ end trace caf08beafeca2392 ]--- BTRFS error (device dm-3): qgroup reserved space leaked [CAUSE] In the offending case, the offending operations are: 2/6: writev f2X[269 1 0 0 0 0] [1006997,67,288] 0 2/7: truncate f2X[269 1 0 0 48 1026293] 18388 0 The following sequence of events could happen after the writev(): CPU1 (writeback) \| CPU2 (truncate) ----------------------------------------------------------------- btrfs_writepages() \| \|- extent_write_cache_pages() \| \|- Got page for 1003520 \| \| 1003520 is Dirty, no writeback \| \| So (!clear_page_dirty_for_io()) \| \| gets called for it \| \|- Now page 1003520 is Clean. \| \| \| btrfs_setattr() \| \| \|- btrfs_setsize() \| \| \|- truncate_setsize() \| \| New i_size is 18388 \|- __extent_writepage() \| \| \|- page_offset() > i_size \| \|- btrfs_invalidatepage() \| \|- Page is clean, so no qgroup \| callback executed This means, the qgroup reserved data space is not properly released in btrfs_invalidatepage() as the page is Clean. [FIX] Instead of checking the dirty bit of a page, call btrfs_qgroup_free_data() unconditionally in btrfs_invalidatepage(). As qgroup rsv are completely bound to the QGROUP_RESERVED bit of io_tree, not bound to page status, thus we won't cause double freeing anyway. Fixes: `0b34c261e2` ("btrfs: qgroup: Prevent qgroup->reserved from going subzero") CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-21 22:08:32 +02:00
Filipe Manana	580c079b57	btrfs: fix double free on ulist after backref resolution failure At btrfs_find_all_roots_safe() we allocate a ulist and set the roots argument to point to it. However if later we fail due to an error returned by find_parent_nodes(), we free that ulist but leave a dangling pointer in the roots argument. Upon receiving the error, a caller of this function can attempt to free the same ulist again, resulting in an invalid memory access. One such scenario is during qgroup accounting: btrfs_qgroup_account_extents() --> calls btrfs_find_all_roots() passes &new_roots (a stack allocated pointer) to btrfs_find_all_roots() --> btrfs_find_all_roots() just calls btrfs_find_all_roots_safe() passing &new_roots to it --> allocates ulist and assigns its address to roots (which points to new_roots from btrfs_qgroup_account_extents()) --> find_parent_nodes() returns an error, so we free the ulist and leave roots pointing to it after returning --> btrfs_qgroup_account_extents() sees btrfs_find_all_roots() returned an error and jumps to the label 'cleanup', which just tries to free again the same ulist Stack trace example: ------------[ cut here ]------------ BTRFS: tree first key check failed WARNING: CPU: 1 PID: 1763215 at fs/btrfs/disk-io.c:422 btrfs_verify_level_key+0xe0/0x180 [btrfs] Modules linked in: dm_snapshot dm_thin_pool (...) CPU: 1 PID: 1763215 Comm: fsstress Tainted: G W 5.8.0-rc3-btrfs-next-64 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_verify_level_key+0xe0/0x180 [btrfs] Code: 28 5b 5d (...) RSP: 0018:ffffb89b473779a0 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff90397759bf08 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000027 RDI: 00000000ffffffff RBP: ffff9039a419c000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: ffffb89b43301000 R12: 000000000000005e R13: ffffb89b47377a2e R14: ffffb89b473779af R15: 0000000000000000 FS: 00007fc47e1e1000(0000) GS:ffff9039ac200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fc47e1df000 CR3: 00000003d9e4e001 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: read_block_for_search+0xf6/0x350 [btrfs] btrfs_next_old_leaf+0x242/0x650 [btrfs] resolve_indirect_refs+0x7cf/0x9e0 [btrfs] find_parent_nodes+0x4ea/0x12c0 [btrfs] btrfs_find_all_roots_safe+0xbf/0x130 [btrfs] btrfs_qgroup_account_extents+0x9d/0x390 [btrfs] btrfs_commit_transaction+0x4f7/0xb20 [btrfs] btrfs_sync_file+0x3d4/0x4d0 [btrfs] do_fsync+0x38/0x70 __x64_sys_fdatasync+0x13/0x20 do_syscall_64+0x5c/0xe0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fc47e2d72e3 Code: Bad RIP value. RSP: 002b:00007fffa32098c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004b RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fc47e2d72e3 RDX: 00007fffa3209830 RSI: 00007fffa3209830 RDI: 0000000000000003 RBP: 000000000000072e R08: 0000000000000001 R09: 0000000000000003 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000003e8 R13: 0000000051eb851f R14: 00007fffa3209970 R15: 00005607c4ac8b50 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffffb8eb5e85>] copy_process+0x755/0x1eb0 softirqs last enabled at (0): [<ffffffffb8eb5e85>] copy_process+0x755/0x1eb0 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace 8639237550317b48 ]--- BTRFS error (device sdc): tree first key mismatch detected, bytenr=62324736 parent_transid=94 key expected=(262,108,1351680) has=(259,108,1921024) general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI CPU: 2 PID: 1763215 Comm: fsstress Tainted: G W 5.8.0-rc3-btrfs-next-64 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:ulist_release+0x14/0x60 [btrfs] Code: c7 07 00 (...) RSP: 0018:ffffb89b47377d60 EFLAGS: 00010282 RAX: 6b6b6b6b6b6b6b6b RBX: ffff903959b56b90 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000270024 RDI: ffff9036e2adc840 RBP: ffff9036e2adc848 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff9036e2adc840 R13: 0000000000000015 R14: ffff9039a419ccf8 R15: ffff90395d605840 FS: 00007fc47e1e1000(0000) GS:ffff9039ac600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f8c1c0a51c8 CR3: 00000003d9e4e004 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ulist_free+0x13/0x20 [btrfs] btrfs_qgroup_account_extents+0xf3/0x390 [btrfs] btrfs_commit_transaction+0x4f7/0xb20 [btrfs] btrfs_sync_file+0x3d4/0x4d0 [btrfs] do_fsync+0x38/0x70 __x64_sys_fdatasync+0x13/0x20 do_syscall_64+0x5c/0xe0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fc47e2d72e3 Code: Bad RIP value. RSP: 002b:00007fffa32098c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004b RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fc47e2d72e3 RDX: 00007fffa3209830 RSI: 00007fffa3209830 RDI: 0000000000000003 RBP: 000000000000072e R08: 0000000000000001 R09: 0000000000000003 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000003e8 R13: 0000000051eb851f R14: 00007fffa3209970 R15: 00005607c4ac8b50 Modules linked in: dm_snapshot dm_thin_pool (...) ---[ end trace 8639237550317b49 ]--- RIP: 0010:ulist_release+0x14/0x60 [btrfs] Code: c7 07 00 (...) RSP: 0018:ffffb89b47377d60 EFLAGS: 00010282 RAX: 6b6b6b6b6b6b6b6b RBX: ffff903959b56b90 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000270024 RDI: ffff9036e2adc840 RBP: ffff9036e2adc848 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff9036e2adc840 R13: 0000000000000015 R14: ffff9039a419ccf8 R15: ffff90395d605840 FS: 00007fc47e1e1000(0000) GS:ffff9039ad200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6a776f7d40 CR3: 00000003d9e4e002 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Fix this by making btrfs_find_all_roots_safe() set *roots to NULL after it frees the ulist. Fixes: `8da6d5815c` ("Btrfs: added btrfs_find_all_roots()") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-21 21:59:15 +02:00
Daeho Jeong	9af846486d	f2fs: add F2FS_IOC_SEC_TRIM_FILE ioctl Added a new ioctl to send discard commands or/and zero out to selected data area of a regular file for security reason. The way of handling range.len of F2FS_IOC_SEC_TRIM_FILE: 1. Added -1 value support for range.len to secure trim the whole blocks starting from range.start regardless of i_size. 2. If the end of the range passes over the end of file, it means until the end of file (i_size). 3. ignored the case of that range.len is zero to prevent the function from making end_addr zero and triggering different behaviour of the function. Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-21 12:58:11 -07:00
Jaegeuk Kim	b0f3b87fb3	f2fs: should avoid inode eviction in synchronous path https://bugzilla.kernel.org/show_bug.cgi?id=208565 PID: 257 TASK: ecdd0000 CPU: 0 COMMAND: "init" #0 [<c0b420ec>] (__schedule) from [<c0b423c8>] #1 [<c0b423c8>] (schedule) from [<c0b459d4>] #2 [<c0b459d4>] (rwsem_down_read_failed) from [<c0b44fa0>] #3 [<c0b44fa0>] (down_read) from [<c044233c>] #4 [<c044233c>] (f2fs_truncate_blocks) from [<c0442890>] #5 [<c0442890>] (f2fs_truncate) from [<c044d408>] #6 [<c044d408>] (f2fs_evict_inode) from [<c030be18>] #7 [<c030be18>] (evict) from [<c030a558>] #8 [<c030a558>] (iput) from [<c047c600>] #9 [<c047c600>] (f2fs_sync_node_pages) from [<c0465414>] #10 [<c0465414>] (f2fs_write_checkpoint) from [<c04575f4>] #11 [<c04575f4>] (f2fs_sync_fs) from [<c0441918>] #12 [<c0441918>] (f2fs_do_sync_file) from [<c0441098>] #13 [<c0441098>] (f2fs_sync_file) from [<c0323fa0>] #14 [<c0323fa0>] (vfs_fsync_range) from [<c0324294>] #15 [<c0324294>] (do_fsync) from [<c0324014>] #16 [<c0324014>] (sys_fsync) from [<c0108bc0>] This can be caused by flush_dirty_inode() in f2fs_sync_node_pages() where iput() requires f2fs_lock_op() again resulting in livelock. Reported-by: Zhiguo Niu <Zhiguo.Niu@unisoc.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-21 12:55:54 -07:00
Eric Biggers	f000223c98	fscrypt: restrict IV_INO_LBLK_* to AES-256-XTS IV_INO_LBLK_* exist only because of hardware limitations, and currently the only known use case for them involves AES-256-XTS. Therefore, for now only allow them in combination with AES-256-XTS. This way we don't have to worry about them being combined with other encryption modes. (To be clear, combining IV_INO_LBLK_* with other encryption modes should work just fine. It's just not being tested, so we can't be 100% sure it works. So with no known use case, it's best to disallow it for now, just like we don't allow other weird combinations like AES-256-XTS contents encryption with Adiantum filenames encryption.) This can be relaxed later if a use case for other combinations arises. Fixes: `b103fb7653` ("fscrypt: add support for IV_INO_LBLK_64 policies") Fixes: `e3b1078bed` ("fscrypt: add support for IV_INO_LBLK_32 policies") Link: https://lore.kernel.org/r/20200721181012.39308-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2020-07-21 11:12:57 -07:00
Eric W. Biederman	be619f7f06	exec: Implement kernel_execve To allow the kernel not to play games with set_fs to call exec implement kernel_execve. The function kernel_execve takes pointers into kernel memory and copies the values pointed to onto the new userspace stack. The calls with arguments from kernel space of do_execve are replaced with calls to kernel_execve. The calls do_execve and do_execveat are made static as there are now no callers outside of exec. The comments that mention do_execve are updated to refer to kernel_execve or execve depending on the circumstances. In addition to correcting the comments, this makes it easy to grep for do_execve and verify it is not used. Inspired-by: https://lkml.kernel.org/r/20200627072704.2447163-1-hch@lst.de Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/87wo365ikj.fsf@x220.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2020-07-21 08:24:52 -05:00
Eric W. Biederman	d8b9cd549e	exec: Factor bprm_stack_limits out of prepare_arg_pages In preparation for implementiong kernel_execve (which will take kernel pointers not userspace pointers) factor out bprm_stack_limits out of prepare_arg_pages. This separates the counting which depends upon the getting data from userspace from the calculations of the stack limits which is usable in kernel_execve. The remove prepare_args_pages and compute bprm->argc and bprm->envc directly in do_execveat_common, before bprm_stack_limits is called. Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lkml.kernel.org/r/87365u6x60.fsf@x220.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2020-07-21 08:24:52 -05:00
Eric W. Biederman	0c9cdff054	exec: Factor bprm_execve out of do_execve_common Currently it is necessary for the usermode helper code and the code that launches init to use set_fs so that pages coming from the kernel look like they are coming from userspace. To allow that usage of set_fs to be removed cleanly the argument copying from userspace needs to happen earlier. Factor bprm_execve out of do_execve_common to separate out the copying of arguments to the newe stack, and the rest of exec. In separating bprm_execve from do_execve_common the copying of the arguments onto the new stack happens earlier. As the copying of the arguments does not depend any security hooks, files, the file table, current->in_execve, current->fs->in_exec, bprm->unsafe, or creds this is safe. Likewise the security hook security_creds_for_exec does not depend upon preventing the argument copying from happening. In addition to making it possible to implement kernel_execve that performs the copying differently, this separation of bprm_execve from do_execve_common makes for a nice separation of responsibilities making the exec code easier to navigate. Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lkml.kernel.org/r/878sfm6x6x.fsf@x220.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2020-07-21 08:24:52 -05:00
Eric W. Biederman	f18ac551e5	exec: Move bprm_mm_init into alloc_bprm Currently it is necessary for the usermode helper code and the code that launches init to use set_fs so that pages coming from the kernel look like they are coming from userspace. To allow that usage of set_fs to be removed cleanly the argument copying from userspace needs to happen earlier. Move the allocation and initialization of bprm->mm into alloc_bprm so that the bprm->mm is available early to store the new user stack into. This is a prerequisite for copying argv and envp into the new user stack early before ther rest of exec. To keep the things consistent the cleanup of bprm->mm is moved into free_bprm. So that bprm->mm will be cleaned up whenever bprm->mm is allocated and free_bprm are called. Moving bprm_mm_init earlier is safe as it does not depend on any files, current->in_execve, current->fs->in_exec, bprm->unsafe, or the if the file table is shared. (AKA bprm_mm_init does not depend on any of the code that happens between alloc_bprm and where it was previously called.) This moves bprm->mm cleanup after current->fs->in_exec is set to 0. This is safe because current->fs->in_exec is only used to preventy taking an additional reference on the fs_struct. This moves bprm->mm cleanup after current->in_execve is set to 0. This is safe because current->in_execve is only used by the lsms (apparmor and tomoyou) and always for LSM specific functions, never for anything to do with the mm. This adds bprm->mm cleanup into the successful return path. This is safe because being on the successful return path implies that begin_new_exec succeeded and set brpm->mm to NULL. As bprm->mm is NULL bprm cleanup I am moving into free_bprm will do nothing. Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lkml.kernel.org/r/87eepe6x7p.fsf@x220.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2020-07-21 08:24:52 -05:00
Eric W. Biederman	60d9ad1d1d	exec: Move initialization of bprm->filename into alloc_bprm Currently it is necessary for the usermode helper code and the code that launches init to use set_fs so that pages coming from the kernel look like they are coming from userspace. To allow that usage of set_fs to be removed cleanly the argument copying from userspace needs to happen earlier. Move the computation of bprm->filename and possible allocation of a name in the case of execveat into alloc_bprm to make that possible. The exectuable name, the arguments, and the environment are copied into the new usermode stack which is stored in bprm until exec passes the point of no return. As the executable name is copied first onto the usermode stack it needs to be known. As there are no dependencies to computing the executable name, compute it early in alloc_bprm. As an implementation detail if the filename needs to be generated because it embeds a file descriptor store that filename in a new field bprm->fdpath, and free it in free_bprm. Previously this was done in an independent variable pathbuf. I have renamed pathbuf fdpath because fdpath is more suggestive of what kind of path is in the variable. I moved fdpath into struct linux_binprm because it is tightly tied to the other variables in struct linux_binprm, and as such is needed to allow the call alloc_binprm to move. Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lkml.kernel.org/r/87k0z66x8f.fsf@x220.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2020-07-21 08:24:52 -05:00
Eric W. Biederman	0a8f36eb48	exec: Factor out alloc_bprm Currently it is necessary for the usermode helper code and the code that launches init to use set_fs so that pages coming from the kernel look like they are coming from userspace. To allow that usage of set_fs to be removed cleanly the argument copying from userspace needs to happen earlier. Move the allocation of the bprm into it's own function (alloc_bprm) and move the call of alloc_bprm before unshare_files so that bprm can ultimately be allocated, the arguments can be placed on the new stack, and then the bprm can be passed into the core of exec. Neither the allocation of struct binprm nor the unsharing depend upon each other so swapping the order in which they are called is trivially safe. To keep things consistent the order of cleanup at the end of do_execve_common swapped to match the order of initialization. Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/87pn8y6x9a.fsf@x220.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2020-07-21 08:24:44 -05:00
Ilya Ponetayev	db415f7aae	exfat: fix name_hash computation on big endian systems On-disk format for name_hash field is LE, so it must be explicitly transformed on BE system for proper result. Fixes: `370e812b3e` ("exfat: add nls operations") Cc: stable@vger.kernel.org # v5.7 Signed-off-by: Chen Minqiang <ptpt52@gmail.com> Signed-off-by: Ilya Ponetayev <i.ponetaev@ndmsystems.com> Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2020-07-21 10:44:19 +09:00
Hyeongseok Kim	41e3928f8c	exfat: fix wrong size update of stream entry by typo The stream.size field is updated to the value of create timestamp of the file entry. Fix this to use correct stream entry pointer. Fixes: `29bbb14bfc` ("exfat: fix incorrect update of stream entry in __exfat_truncate()") Signed-off-by: Hyeongseok Kim <hyeongseok@gmail.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2020-07-21 10:44:15 +09:00
Namjae Jeon	d2fa0c337d	exfat: fix wrong hint_stat initialization in exfat_find_dir_entry() We found the wrong hint_stat initialization in exfat_find_dir_entry(). It should be initialized when cluster is EXFAT_EOF_CLUSTER. Fixes: `ca06197382` ("exfat: add directory operations") Cc: stable@vger.kernel.org # v5.7 Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2020-07-21 10:44:10 +09:00
Namjae Jeon	43946b7049	exfat: fix overflow issue in exfat_cluster_to_sector() An overflow issue can occur while calculating sector in exfat_cluster_to_sector(). It needs to cast clus's type to sector_t before left shifting. Fixes: `1acf1a564b` ("exfat: add in-memory and on-disk structures and headers") Cc: stable@vger.kernel.org # v5.7 Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2020-07-21 10:44:06 +09:00
Eric Biggers	1d6217a4f9	fscrypt: rename FS_KEY_DERIVATION_NONCE_SIZE The name "FS_KEY_DERIVATION_NONCE_SIZE" is a bit outdated since due to the addition of FSCRYPT_POLICY_FLAG_DIRECT_KEY, the file nonce may now be used as a tweak instead of for key derivation. Also, we're now prefixing the fscrypt constants with "FSCRYPT_" instead of "FS_". Therefore, rename this constant to FSCRYPT_FILE_NONCE_SIZE. Link: https://lore.kernel.org/r/20200708215722.147154-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2020-07-20 17:26:33 -07:00
Eric Biggers	e455de313e	fscrypt: add comments that describe the HKDF info strings Each HKDF context byte is associated with a specific format of the remaining part of the application-specific info string. Add comments so that it's easier to keep track of what these all are. Link: https://lore.kernel.org/r/20200708215529.146890-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2020-07-20 17:26:32 -07:00
Randy Dunlap	887e037391	f2fs: segment.h: delete a duplicated word Drop the repeated word "the" in a comment. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: linux-f2fs-devel@lists.sourceforge.net Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-20 15:47:38 -07:00
Chao Yu	02772fbfcb	f2fs: compress: fix to avoid memory leak on cc->cpages Memory allocated for storing compressed pages' poitner should be released after f2fs_write_compressed_pages(), otherwise it will cause memory leak issue. Signed-off-by: Chao Yu <yuchao0@huawei.com> Fixes: `4c8ff7095b` ("f2fs: support data compression") Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-20 15:47:38 -07:00
Eric Biggers	3357af8f1a	f2fs: use generic names for generic ioctls Don't define F2FS_IOC_* aliases to ioctls that already have a generic FS_IOC_* name. These aliases are unnecessary, and they make it unclear which ioctls are f2fs-specific and which are generic. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-20 15:47:32 -07:00
Johannes Thumshirn	89ee72376b	zonefs: count pages after truncating the iterator Count pages after possibly truncating the iterator to the maximum zone append size, not before. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>	2020-07-20 17:59:31 +09:00
Damien Le Moal	01b2651cfb	zonefs: Fix compilation warning Avoid the compilation warning "Variable 'ret' is reassigned a value before the old one has been used." in zonefs_create_zgroup() by setting ret for the error path only if an error happens. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>	2020-07-20 17:57:50 +09:00
Greg Kroah-Hartman	6bdb486c5a	Merge 5.8-rc6 into driver-core-next We need the driver core fixes in here too. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-07-20 09:31:35 +02:00
Adrian Reber	12886f8ab1	proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE Opening files in /proc/pid/map_files when the current user is CAP_CHECKPOINT_RESTORE capable in the root namespace is useful for checkpointing and restoring to recover files that are unreachable via the file system such as deleted files, or memfd files. Signed-off-by: Adrian Reber <areber@redhat.com> Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com> Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com> Reviewed-by: Serge Hallyn <serge@hallyn.com> Link: https://lore.kernel.org/r/20200719100418.2112740-5-areber@redhat.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2020-07-19 20:14:42 +02:00
Jianyong Wu	aab6c873cf	9p: remove unused code in 9p These codes have been commented out since 2007 and lay in kernel since then. So, it's better to remove them. Link: http://lkml.kernel.org/r/20200628074337.45895-1-jianyong.wu@arm.com Signed-off-by: Jianyong Wu <jianyong.wu@arm.com> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>	2020-07-19 14:58:47 +02:00
Zheng Bin	cb0aae0e31	9p: Fix memory leak in v9fs_mount v9fs_mount v9fs_session_init v9fs_cache_session_get_cookie v9fs_random_cachetag -->alloc cachetag v9ses->fscache = fscache_acquire_cookie -->maybe NULL sb = sget -->fail, goto clunk clunk_fid: v9fs_session_close if (v9ses->fscache) -->NULL kfree(v9ses->cachetag) Thus memleak happens. Link: http://lkml.kernel.org/r/20200615012153.89538-1-zhengbin13@huawei.com Fixes: `60e78d2c99` ("9p: Add fscache support to 9p") Cc: <stable@vger.kernel.org> # v2.6.32+ Signed-off-by: Zheng Bin <zhengbin13@huawei.com> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>	2020-07-19 14:58:47 +02:00
Jianyong Wu	6624664160	9p: retrieve fid from file when file instance exist. In the current setattr implementation in 9p, fid is always retrieved from dentry no matter file instance exists or not. If so, there may be some info related to opened file instance dropped. So it's better to retrieve fid from file instance when it is passed to setattr. for example: fd=open("tmp", O_RDWR); ftruncate(fd, 10); The file context related with the fd will be lost as fid is always retrieved from dentry, then the backend can't get the info of file context. It is against the original intention of user and may lead to bug. Link: http://lkml.kernel.org/r/20200710101548.10108-1-jianyong.wu@arm.com Signed-off-by: Jianyong Wu <jianyong.wu@arm.com> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>	2020-07-19 14:58:47 +02:00
Daniele Albano	61710e437f	io_uring: always allow drain/link/hardlink/async sqe flags We currently filter these for timeout_remove/async_cancel/files_update, but we only should be filtering for fixed file and buffer select. This also causes a second read of sqe->flags, which isn't needed. Just check req->flags for the relevant bits. This then allows these commands to be used in links, for example, like everything else. Signed-off-by: Daniele Albano <d.albano@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-18 14:15:16 -06:00
Jens Axboe	807abcb088	io_uring: ensure double poll additions work with both request types The double poll additions were centered around doing POLL_ADD on file descriptors that use more than one waitqueue (typically one for read, one for write) when being polled. However, it can also end up being triggered for when we use poll triggered retry. For that case, we cannot safely use req->io, as that could be used by the request type itself. Add a second io_poll_iocb pointer in the structure we allocate for poll based retry, and ensure we use the right one from the two paths. Fixes: `18bceab101` ("io_uring: allow POLL_ADD with double poll_wait() users") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-17 19:41:05 -06:00
Linus Torvalds	6a70f89cc5	More NFS Client Bugfixes for Linux 5.8 Bugfixes: - NFS: Fix interrupted slots by using the SEQUENCE operation - SUNRPC: reverte `d03727b248` to fix unkillable IOs - xprtrdma: Fix double-free in rpcrdma_ep_create() - xprtrdma: Fix recursion into rpcrdma_xprt_disconnect() - xprtrdma: Fix return code from rpcrdma_xprt_connect() - xprtrdma: Fix handling of connect errors - xprtrdma: Fix incorrect header size calculations -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl8SFqQACgkQ18tUv7Cl QOt76Q/9H/SrhERk64bErRLIvp0Gy6fI+U++pvNhDEUQfsPcrSYszMwmjBQ0yVet xZv/TP70GRY3EzKQ7TG8X9dX/PIRRdM9Sg6Xg4dLiJTJfVOj01ddwLHkJ0yt67wS 7TrWtIvL98vAhQRggHBg9E9ZgFFn1mCLmGQSgvBnQM9wlUmIrCf1UptlSNlnuGjn ogoT5DM9L5X5Z2AGDvLGHWviS8+yl55JmBolGz7yoE/wDWcAASCwK23cXPDfdLCe kbUaV06uD/77lIirM0PazIMJWmrZvc4+AyJBu1VOAK6RWkZ1ED47dQq9FMZpyYfK gvrw771Le4bhbZpIG1u4pS5/bpTQGdHJ4TRCA409rkSX5J+FxUsaLYnAcPtRVuiE 2dPOAQYrPgB0P+gb1mKfRD/c34+ooY3KUDfn28RFyaFdN9g4+bkLa0zYkj77q/l8 F938ejd7K3gpsKLssOjLajL+4TJC+f7aFPEAK5qKFpA/t1riH3qTvy8H/wjXumSF MaNezJbPoDRTpLzNbeld8MKN5/4wegnhym2k6MIuobAzTF+rL/VJZA3Ke70PACzy rISzBV0Stp8rT3pZFa6rWsPjYzvTd5rskWyqBD+Ep0XincxHNr0mrgRUKSYfdser +YnPDJb69rPFNz/XyvDFJW4nRKK/SnzHpvMngkBFUyzsdgvHYfs= =LUoM -----END PGP SIGNATURE----- Merge tag 'nfs-for-5.8-3' of git://git.linux-nfs.org/projects/anna/linux-nfs into master Pull NFS client fixes from Anna Schumaker: "A few more NFS client bugfixes for Linux 5.8: NFS: - Fix interrupted slots by using the SEQUENCE operation SUNRPC: - revert `d03727b248` to fix unkillable IOs xprtrdma: - Fix double-free in rpcrdma_ep_create() - Fix recursion into rpcrdma_xprt_disconnect() - Fix return code from rpcrdma_xprt_connect() - Fix handling of connect errors - Fix incorrect header size calculations" * tag 'nfs-for-5.8-3' of git://git.linux-nfs.org/projects/anna/linux-nfs: SUNRPC reverting `d03727b248` ("NFSv4 fix CLOSE not waiting for direct IO compeletion") xprtrdma: fix incorrect header size calculations NFS: Fix interrupted slots by sending a solo SEQUENCE operation xprtrdma: Fix handling of connect errors xprtrdma: Fix return code from rpcrdma_xprt_connect() xprtrdma: Fix recursion into rpcrdma_xprt_disconnect() xprtrdma: Fix double-free in rpcrdma_ep_create()	2020-07-17 16:37:52 -07:00
Eric Sandeen	4750a171c3	xfs: preserve inode versioning across remounts The MS_I_VERSION mount flag is exposed via the VFS, as documented in the mount manpages etc; see the iversion and noiversion mount options in mount(8). As a result, mount -o remount looks for this option in /proc/mounts and will only send the I_VERSION flag back in during remount it it is present. Since it's not there, a remount will /remove/ the I_VERSION flag at the vfs level, and iversion functionality is lost. xfs v5 superblocks intend to always have i_version enabled; it is set as a default at mount time, but is lost during remount for the reasons above. The generic fix would be to expose this documented option in /proc/mounts, but since that was rejected, fix it up again in the xfs remount path instead, so that at least xfs won't suffer from this misbehavior. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-17 13:20:20 -07:00
Olga Kornievskaia	65caafd0d2	SUNRPC reverting `d03727b248` ("NFSv4 fix CLOSE not waiting for direct IO compeletion") Reverting commit `d03727b248` "NFSv4 fix CLOSE not waiting for direct IO compeletion". This patch made it so that fput() by calling inode_dio_done() in nfs_file_release() would wait uninterruptably for any outstanding directIO to the file (but that wait on IO should be killable). The problem the patch was also trying to address was REMOVE returning ERR_ACCESS because the file is still opened, is supposed to be resolved by server returning ERR_FILE_OPEN and not ERR_ACCESS. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2020-07-17 14:47:38 -04:00
Linus Torvalds	4ebf8d7649	io_uring-5.8-2020-07-17 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8RyOwQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpuNmD/sFxMpo0Q4szKSdFY16RxmLbeeCG8eQC+6P Zqqd4t4tpr1tamSf4pya8zh7ivkfPlm+IFQopEEXbDAZ5P8TwF59KvABRbUYbCFM ldQzJgvRwoTIhs0ojIY6CPMAxbpDLx8mpwgbzcjuKxbGDHEnndXPDbNO/8olxAaa Ace5zk7TpY9YDtEXr1qe3y0riw11o/E9S/iX+M/z1KGKQcx01jU4hwesuzssde4J rEG3TYFiHCkhfB0AtGj3zYInCYIXqqJRqEv9NP0npWB1IWbyLy9XatEDCx8aIblA HICy09+4v5HR5h4vByRGOvT28rl//7ZB4tdzkunLWYrxYkYOqypsRI8NeDelxtWa Iv+1Og94lQnjwOF9Iqz/q2z/OfpxlJpOvy8d5xWjhiNr9oc5ugAqVUiFjuQ6XnVG mNJA21pJwzpesggOErIYjI13JvwW3aFylAB3fBPitHcmCusElnLunSs3/zhr9NY7 BJomUwC/KCmcp/X/WX2W5LoKEnG9WnVrJJmDWjz1wQLziKa7dvHAGUGFArpGJmJ3 TUGefdBi6q2nC7o+K26pwFfjQpA1Myf8Vp6qS957YQ7kZoI1a0bCuxp/rrq1gNFt 8HeKf4jmfqcBZeTPlZDyMWwC5F1MpK9V+KComBqSA8x5/Q0mV6BsOvY5mMHfCgky HuD7ERgs3g== =Aucc -----END PGP SIGNATURE----- Merge tag 'io_uring-5.8-2020-07-17' of git://git.kernel.dk/linux-block into master Pull io_uring fix from Jens Axboe: "Fix for a case where, with automatic buffer selection, we can leak the buffer descriptor for recvmsg" * tag 'io_uring-5.8-2020-07-17' of git://git.kernel.dk/linux-block: io_uring: fix recvmsg memory leak with buffer selection	2020-07-17 10:47:51 -07:00
Linus Torvalds	0dd68a34ec	fuse fixes for 5.8-rc6 -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXxGFQwAKCRDh3BK/laaZ PDYzAP9bXxHQaRdetnj6lOGNWjmVmiHfntxHqkl6QjZf6e1WlwD+NRXayVTc+Lzw M1pBK6kqovMQVWkyFfA3dTq/BZMzfAc= =9GPn -----END PGP SIGNATURE----- Merge tag 'fuse-fixes-5.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse into master Pull fuse fixes from Miklos Szeredi: - two regressions in this cycle caused by the conversion of writepage list to an rb_tree - two regressions in v5.4 cause by the conversion to the new mount API - saner behavior of fsconfig(2) for the reconfigure case - an ancient issue with FS_IOC_{GET,SET}FLAGS ioctls * tag 'fuse-fixes-5.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: Fix parameter for FS_IOC_{GET,SET}FLAGS fuse: don't ignore errors from fuse_writepages_fill() fuse: clean up condition for writepage sending fuse: reject options on reconfigure via fsconfig(2) fuse: ignore 'data' argument of mount(..., MS_REMOUNT) fuse: use ->reconfigure() instead of ->remount_fs() fuse: fix warning in tree_insert() and clean up writepage insertion fuse: move rb_erase() before tree_insert()	2020-07-17 10:36:19 -07:00
Linus Torvalds	44fea37378	overlayfs fixes for 5.8-rc6 -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXxGF+QAKCRDh3BK/laaZ PCHnAQCqNxcxncKMebpJ2hNIEPuSvUPRA4+iOOnc+9HTZ4A09wD/d/8ryybORTZN IHq2PpQUtuGgASv6GrptJSmpDvG6RA0= =lOD9 -----END PGP SIGNATURE----- Merge tag 'ovl-fixes-5.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs into master Pull overlayfs fixes from Miklos Szeredi: - fix a regression introduced in v4.20 in handling a regenerated squashfs lower layer - two regression fixes for this cycle, one of which is Oops inducing - miscellaneous issues * tag 'ovl-fixes-5.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: fix lookup of indexed hardlinks with metacopy ovl: fix unneeded call to ovl_change_flags() ovl: fix mount option checks for nfs_export with no upperdir ovl: force read-only sb on failure to create index dir ovl: fix regression with re-formatted lower squashfs ovl: fix oops in ovl_indexdir_cleanup() with nfs_export=on ovl: relax WARN_ON() when decoding lower directory file handle ovl: remove not used argument in ovl_check_origin ovl: change ovl_copy_up_flags static ovl: inode reference leak in ovl_is_inuse true case.	2020-07-17 10:29:19 -07:00
Olga Kornievskaia	dbc4fec6b6	NFSv4.0 allow nconnect for v4.0 It looks like this "else" is just a typo. It turns off nconnect for NFSv4.0 even though it works for every other version. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-07-17 13:16:23 -04:00
He Zhe	ab91e7a6da	freezer: Add unsafe versions of freezable_schedule_timeout_interruptible for NFS commit `0688e64bc6` ("NFS: Allow signal interruption of NFS4ERR_DELAYed operations") introduces nfs4_delay_interruptible which also needs an _unsafe version to avoid the following call trace for the same reason explained in commit `416ad3c9c0` ("freezer: add unsafe versions of freezable helpers for NFS") CPU: 4 PID: 3968 Comm: rm Tainted: G W 5.8.0-rc4 #1 Hardware name: Marvell OcteonTX CN96XX board (DT) Call trace: dump_backtrace+0x0/0x1dc show_stack+0x20/0x30 dump_stack+0xdc/0x150 debug_check_no_locks_held+0x98/0xa0 nfs4_delay_interruptible+0xd8/0x120 nfs4_handle_exception+0x130/0x170 nfs4_proc_rmdir+0x8c/0x220 nfs_rmdir+0xa4/0x360 vfs_rmdir.part.0+0x6c/0x1b0 do_rmdir+0x18c/0x210 __arm64_sys_unlinkat+0x64/0x7c el0_svc_common.constprop.0+0x7c/0x110 do_el0_svc+0x24/0xa0 el0_sync_handler+0x13c/0x1b8 el0_sync+0x158/0x180 Signed-off-by: He Zhe <zhe.he@windriver.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-07-17 13:12:44 -04:00
Trond Myklebust	57f80c0eda	Merge branch 'xattr-devel'	2020-07-17 10:35:48 -04:00
Kees Cook	3f649ab728	treewide: Remove uninitialized_var() usage Using uninitialized_var() is dangerous as it papers over real bugs[1] (or can in the future), and suppresses unrelated compiler warnings (e.g. "unused variable"). If the compiler thinks it is uninitialized, either simply initialize the variable or make compiler changes. In preparation for removing[2] the[3] macro[4], remove all remaining needless uses with the following script: git grep '\buninitialized_var\b' \| cut -d: -f1 \| sort -u \| \ xargs perl -pi -e \ 's/\buninitialized_var$([^$]+)\)/\1/g; s:\s/\ (GCC be quiet\|to make compiler happy) \*/$::g;' drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid pathological white-space. No outstanding warnings were found building allmodconfig with GCC 9.3.0 for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64, alpha, and m68k. [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/ [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/ [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/ Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5 Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs Signed-off-by: Kees Cook <keescook@chromium.org>	2020-07-16 12:35:15 -07:00
Jason Yan	4df7a75f69	f2fs: Eliminate usage of uninitialized_var() macro This is an effort to eliminate the uninitialized_var() macro[1]. The use of this macro is the wrong solution because it forces off ANY analysis by the compiler for a given variable. It even masks "unused variable" warnings. Quoted from Linus[2]: "It's a horrible thing to use, in that it adds extra cruft to the source code, and then shuts up a compiler warning (even the _reliable_ warnings from gcc)." Fix it by remove this variable since it is not needed at all. [1] https://github.com/KSPP/linux/issues/81 [2] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/ Suggested-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Link: https://lore.kernel.org/r/20200615085132.166470-1-yanaijie@huawei.com Signed-off-by: Kees Cook <keescook@chromium.org>	2020-07-16 12:32:26 -07:00
Christoph Hellwig	5b642d8b9f	block: integrate bd_start_claiming into __blkdev_get bd_start_claiming duplicates a lot of the work done in __blkdev_get. Integrate the two functions to avoid the duplicate work, and to do the right thing for the md -ERESTARTSYS corner case. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-16 09:35:44 -06:00
Christoph Hellwig	ecbe6bc000	block: use bd_prepare_to_claim directly in the loop driver The arcane magic in bd_start_claiming is only needed to be able to claim a block_device that hasn't been fully set up. Switch the loop driver that claims from the ioctl path with a fully set up struct block_device to just use the much simpler bd_prepare_to_claim directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-16 09:35:44 -06:00
Christoph Hellwig	58e46ed9cc	block: refactor bd_start_claiming Move the locking and assignment of bd_claiming from bd_start_claiming to bd_prepare_to_claim. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-16 09:35:44 -06:00
Christoph Hellwig	c5638ab417	block: simplify the restart case in __blkdev_get Insted of duplicating all the cleanup logic jump to the code that cleans up anyway, and restart after that. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-16 09:35:44 -06:00
Christoph Hellwig	9e96c8c0e9	fs: add a vfs_fchmod helper Add a helper for struct file based chmode operations. To be used by the initramfs code soon. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-07-16 15:33:04 +02:00
Christoph Hellwig	c04011fe8c	fs: add a vfs_fchown helper Add a helper for struct file based chown operations. To be used by the initramfs code soon. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-07-16 15:33:00 +02:00
Amir Goldstein	4518dfcf76	ovl: fix lookup of indexed hardlinks with metacopy We recently moved setting inode flag OVL_UPPERDATA to ovl_lookup(). When looking up an overlay dentry, upperdentry may be found by index and not by name. In that case, we fail to read the metacopy xattr and falsly set the OVL_UPPERDATA on the overlay inode. This caused a regression in xfstest overlay/033 when run with OVERLAY_MOUNT_OPTIONS="-o metacopy=on". Fixes: `28166ab3c8` ("ovl: initialize OVL_UPPERDATA in ovl_lookup()") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-07-16 07:24:47 +02:00
Amir Goldstein	81a33c1ee9	ovl: fix unneeded call to ovl_change_flags() The check if user has changed the overlay file was wrong, causing unneeded call to ovl_change_flags() including taking f_lock on every file access. Fixes: `d989903058` ("ovl: do not generate duplicate fsnotify events for "fake" path") Cc: <stable@vger.kernel.org> # v4.19+ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-07-16 07:24:47 +02:00
David Howells	811f04bac1	afs: Fix interruption of operations The afs filesystem driver allows unstarted operations to be cancelled by signal, but most of these can easily be restarted (mkdir for example). The primary culprits for reproducing this are those applications that use SIGALRM to display a progress counter. File lock-extension operation is marked uninterruptible as we have a limited time in which to do it, and the release op is marked uninterruptible also as if we fail to unlock a file, we'll have to wait 20 mins before anyone can lock it again. The store operation logs a warning if it gets interruption, e.g.: kAFS: Unexpected error from FS.StoreData -4 because it's run from the background - but it can also be run from fdatasync()-type things. However, store options aren't marked interruptible at the moment. Fix this in the following ways: (1) Mark store operations as uninterruptible. It might make sense to relax this for certain situations, but I'm not sure how to make sure that background store ops aren't affected by signals to foreground processes that happen to trigger them. (2) In afs_get_io_locks(), where we're getting the serialisation lock for talking to the fileserver, return ERESTARTSYS rather than EINTR because a lot of the operations (e.g. mkdir) are restartable if we haven't yet started sending the op to the server. Fixes: `e49c7b2f6d` ("afs: Build an abstraction around an "operation" concept") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-07-15 15:49:04 -07:00
Amir Goldstein	f0e1266ed2	ovl: fix mount option checks for nfs_export with no upperdir Without upperdir mount option, there is no index dir and the dependency checks nfs_export => index for mount options parsing are incorrect. Allow the combination nfs_export=on,index=off with no upperdir and move the check for dependency redirect_dir=nofollow for non-upper mount case to mount options parsing. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-07-16 00:11:15 +02:00
Amir Goldstein	470c156361	ovl: force read-only sb on failure to create index dir With index feature enabled, on failure to create index dir, overlay is being mounted read-only. However, we do not forbid user to remount overlay read-write. Fix that by setting ofs->workdir to NULL, which prevents remount read-write. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-07-16 00:11:15 +02:00
Amir Goldstein	a888db3101	ovl: fix regression with re-formatted lower squashfs Commit `9df085f3c9` ("ovl: relax requirement for non null uuid of lower fs") relaxed the requirement for non null uuid with single lower layer to allow enabling index and nfs_export features with single lower squashfs. Fabian reported a regression in a setup when overlay re-uses an existing upper layer and re-formats the lower squashfs image. Because squashfs has no uuid, the origin xattr in upper layer are decoded from the new lower layer where they may resolve to a wrong origin file and user may get an ESTALE or EIO error on lookup. To avoid the reported regression while still allowing the new features with single lower squashfs, do not allow decoding origin with lower null uuid unless user opted-in to one of the new features that require following the lower inode of non-dir upper (index, xino, metacopy). Reported-by: Fabian <godi.beat@gmx.net> Link: https://lore.kernel.org/linux-unionfs/32532923.JtPX5UtSzP@fgdesktop/ Fixes: `9df085f3c9` ("ovl: relax requirement for non null uuid of lower fs") Cc: stable@vger.kernel.org # v4.20+ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-07-16 00:10:31 +02:00
Amir Goldstein	20396365a1	ovl: fix oops in ovl_indexdir_cleanup() with nfs_export=on Mounting with nfs_export=on, xfstests overlay/031 triggers a kernel panic since v5.8-rc1 overlayfs updates. overlayfs: orphan index entry (index/00fb1..., ftype=4000, nlink=2) BUG: kernel NULL pointer dereference, address: 0000000000000030 RIP: 0010:ovl_cleanup_and_whiteout+0x28/0x220 [overlay] Bisect point at commit `c21c839b84` ("ovl: whiteout inode sharing") Minimal reproducer: -------------------------------------------------- rm -rf l u w m mkdir -p l u w m mkdir -p l/testdir touch l/testdir/testfile mount -t overlay -o lowerdir=l,upperdir=u,workdir=w,nfs_export=on overlay m echo 1 > m/testdir/testfile umount m rm -rf u/testdir mount -t overlay -o lowerdir=l,upperdir=u,workdir=w,nfs_export=on overlay m umount m -------------------------------------------------- When mount with nfs_export=on, and fail to verify an orphan index, we're cleaning this index from indexdir by calling ovl_cleanup_and_whiteout(). This dereferences ofs->workdir, that was earlier set to NULL. The design was that ovl->workdir will point at ovl->indexdir, but we are assigning ofs->indexdir to ofs->workdir only after ovl_indexdir_cleanup(). There is no reason not to do it sooner, because once we get success from ofs->indexdir = ovl_workdir_create(... there is no turning back. Reported-and-tested-by: Murphy Zhou <jencce.kernel@gmail.com> Fixes: `c21c839b84` ("ovl: whiteout inode sharing") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-07-16 00:09:59 +02:00
Amir Goldstein	124c2de2c0	ovl: relax WARN_ON() when decoding lower directory file handle Decoding a lower directory file handle to overlay path with cold inode/dentry cache may go as follows: 1. Decode real lower file handle to lower dir path 2. Check if lower dir is indexed (was copied up) 3. If indexed, get the upper dir path from index 4. Lookup upper dir path in overlay 5. If overlay path found, verify that overlay lower is the lower dir from step 1 On failure to verify step 5 above, user will get an ESTALE error and a WARN_ON will be printed. A mismatch in step 5 could be a result of lower directory that was renamed while overlay was offline, after that lower directory has been copied up and indexed. This is a scripted reproducer based on xfstest overlay/052: # Create lower subdir create_dirs create_test_files $lower/lowertestdir/subdir mount_dirs # Copy up lower dir and encode lower subdir file handle touch $SCRATCH_MNT/lowertestdir test_file_handles $SCRATCH_MNT/lowertestdir/subdir -p -o $tmp.fhandle # Rename lower dir offline unmount_dirs mv $lower/lowertestdir $lower/lowertestdir.new/ mount_dirs # Attempt to decode lower subdir file handle test_file_handles $SCRATCH_MNT -p -i $tmp.fhandle Since this WARN_ON() can be triggered by user we need to relax it. Fixes: `4b91c30a5a` ("ovl: lookup connected ancestor of dir in inode cache") Cc: <stable@vger.kernel.org> # v4.16+ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-07-16 00:09:17 +02:00
youngjun	d78a0dcf64	ovl: remove not used argument in ovl_check_origin ovl_check_origin outparam 'ctrp' argument not used by caller. So remove this argument. Signed-off-by: youngjun <her0gyugyu@gmail.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-07-16 00:06:16 +02:00
youngjun	5ac8e8025a	ovl: change ovl_copy_up_flags static "ovl_copy_up_flags" is used in copy_up.c. so, change it static. Signed-off-by: youngjun <her0gyugyu@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-07-16 00:06:16 +02:00
youngjun	24f14009b8	ovl: inode reference leak in ovl_is_inuse true case. When "ovl_is_inuse" true case, trap inode reference not put. plus adding the comment explaining sequence of ovl_is_inuse after ovl_setup_trap. Fixes: `0be0bfd2de` ("ovl: fix regression caused by overlapping layers detection") Cc: <stable@vger.kernel.org> # v4.19+ Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: youngjun <her0gyugyu@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-07-16 00:05:40 +02:00
Pavel Begunkov	681fda8d27	io_uring: fix recvmsg memory leak with buffer selection io_recvmsg() doesn't free memory allocated for struct io_buffer. This can causes a leak when used with automatic buffer selection. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-15 13:35:56 -06:00
Amir Goldstein	9c61f3b560	fanotify: break up fanotify_alloc_event() Break up fanotify_alloc_event() into helpers by event struct type. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2020-07-15 17:41:33 +02:00
Amir Goldstein	b8a6c3a2f0	fanotify: create overflow event type The special overflow event is allocated as struct fanotify_path_event, but with a null path. Use a special event type to identify the overflow event, so the helper fanotify_has_event_path() will always indicate a non null path. Allocating the overflow event doesn't need any of the fancy stuff in fanotify_alloc_event(), so create a simplified helper for allocating the overflow event. There is also no need to store and report the pid with an overflow event. Link: https://lore.kernel.org/r/20200708111156.24659-7-amir73il@gmail.com Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2020-07-15 17:37:03 +02:00
Amir Goldstein	956235afd1	inotify: do not use objectid when comparing events inotify's event->wd is the object identifier. Compare that instead of the common fsnotidy event objectid, so we can get rid of the objectid field later. Link: https://lore.kernel.org/r/20200708111156.24659-6-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2020-07-15 17:36:58 +02:00
Amir Goldstein	9991bb84b2	kernfs: do not call fsnotify() with name without a parent When creating an FS_MODIFY event on inode itself (not on parent) the file_name argument should be NULL. The change to send a non NULL name to inode itself was done on purpuse as part of another commit, as Tejun writes: "...While at it, supply the target file name to fsnotify() from kernfs_node->name.". But this is wrong practice and inconsistent with inotify behavior when watching a single file. When a child is being watched (as opposed to the parent directory) the inotify event should contain the watch descriptor, but not the file name. Fixes: `df6a58c5c5` ("kernfs: don't depend on d_find_any_alias()...") Link: https://lore.kernel.org/r/20200708111156.24659-5-amir73il@gmail.com Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2020-07-15 17:36:52 +02:00
Amir Goldstein	9a02aa40dd	nfsd: use fsnotify_data_inode() to get the unlinked inode The inode argument to handle_event() is about to become obsolete. Link: https://lore.kernel.org/r/20200708111156.24659-4-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2020-07-15 17:36:47 +02:00
Amir Goldstein	cbcf47adc8	fsnotify: return non const from fsnotify_data_inode() Return non const inode pointer from fsnotify_data_inode(). None of the fsnotify hooks pass const inode pointer as data and callers often need to cast to a non const pointer. Link: https://lore.kernel.org/r/20200708111156.24659-3-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2020-07-15 17:36:45 +02:00
Amir Goldstein	c738fbabb0	fsnotify: fold fsnotify() call into fsnotify_parent() All (two) callers of fsnotify_parent() also call fsnotify() to notify the child inode. Move the second fsnotify() call into fsnotify_parent(). This will allow more flexibility in making decisions about which of the two event falvors should be sent. Using 'goto notify_child' in the inline helper seems a bit strange, but it mimics the code in __fsnotify_parent() for clarity and the goto pattern will become less strage after following patches are applied. Link: https://lore.kernel.org/r/20200708111156.24659-2-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2020-07-15 17:36:41 +02:00
Mel Gorman	71d734103e	fsnotify: Rearrange fast path to minimise overhead when there is no watcher The fsnotify paths are trivial to hit even when there are no watchers and they are surprisingly expensive. For example, every successful vfs_write() hits fsnotify_modify which calls both fsnotify_parent and fsnotify unless FMODE_NONOTIFY is set which is an internal flag invisible to userspace. As it stands, fsnotify_parent is a guaranteed functional call even if there are no watchers and fsnotify() does a substantial amount of unnecessary work before it checks if there are any watchers. A perf profile showed that applying mnt->mnt_fsnotify_mask in fnotify() was almost half of the total samples taken in that function during a test. This patch rearranges the fast paths to reduce the amount of work done when there are no watchers. The test motivating this was "perf bench sched messaging --pipe". Despite the fact the pipes are anonymous, fsnotify is still called a lot and the overhead is noticeable even though it's completely pointless. It's likely the overhead is negligible for real IO so this is an extreme example. This is a comparison of hackbench using processes and pipes on a 1-socket machine with 8 CPU threads without fanotify watchers. 5.7.0 5.7.0 vanilla fastfsnotify-v1r1 Amean 1 0.4837 ( 0.00%) 0.4630 * 4.27%* Amean 3 1.5447 ( 0.00%) 1.4557 ( 5.76%) Amean 5 2.6037 ( 0.00%) 2.4363 ( 6.43%) Amean 7 3.5987 ( 0.00%) 3.4757 ( 3.42%) Amean 12 5.8267 ( 0.00%) 5.6983 ( 2.20%) Amean 18 8.4400 ( 0.00%) 8.1327 ( 3.64%) Amean 24 11.0187 ( 0.00%) 10.0290 * 8.98%* Amean 30 13.1013 ( 0.00%) 12.8510 ( 1.91%) Amean 32 13.9190 ( 0.00%) 13.2410 ( 4.87%) 5.7.0 5.7.0 vanilla fastfsnotify-v1r1 Duration User 157.05 152.79 Duration System 1279.98 1219.32 Duration Elapsed 182.81 174.52 This is showing that the latencies are improved by roughly 2-9%. The variability is not shown but some of these results are within the noise as this workload heavily overloads the machine. That said, the system CPU usage is reduced by quite a bit so it makes sense to avoid the overhead even if it is a bit tricky to detect at times. A perf profile of just 1 group of tasks showed that 5.14% of samples taken were in either fsnotify() or fsnotify_parent(). With the patch, 2.8% of samples were in fsnotify, mostly function entry and the initial check for watchers. The check for watchers is complicated enough that inlining it may be controversial. [Amir] Slightly simplify with mnt_or_sb_mask => marks_mask Link: https://lore.kernel.org/r/20200708111156.24659-1-amir73il@gmail.com Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2020-07-15 15:29:10 +02:00
Jan Kara	47aaabdedf	fanotify: Avoid softlockups when reading many events When user provides large buffer for events and there are lots of events available, we can try to copy them all to userspace without scheduling which can softlockup the kernel (furthermore exacerbated by the contention on notification_lock). Add a scheduling point after copying each event. Note that usually the real underlying problem is the cost of fanotify event merging and the resulting contention on notification_lock but this is a cheap way to somewhat reduce the problem until we can properly address that. Reported-by: Francesco Ruggeri <fruggeri@arista.com> Link: https://lore.kernel.org/lkml/20200714025417.A25EB95C0339@us180.sjc.aristanetworks.com Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2020-07-15 15:23:28 +02:00
Chirantan Ekbote	31070f6cce	fuse: Fix parameter for FS_IOC_{GET,SET}FLAGS The ioctl encoding for this parameter is a long but the documentation says it should be an int and the kernel drivers expect it to be an int. If the fuse driver treats this as a long it might end up scribbling over the stack of a userspace process that only allocated enough space for an int. This was previously discussed in [1] and a patch for fuse was proposed in [2]. From what I can tell the patch in [2] was nacked in favor of adding new, "fixed" ioctls and using those from userspace. However there is still no "fixed" version of these ioctls and the fact is that it's sometimes infeasible to change all userspace to use the new one. Handling the ioctls specially in the fuse driver seems like the most pragmatic way for fuse servers to support them without causing crashes in userspace applications that call them. [1]: https://lore.kernel.org/linux-fsdevel/20131126200559.GH20559@hall.aurel32.net/T/ [2]: https://sourceforge.net/p/fuse/mailman/message/31771759/ Signed-off-by: Chirantan Ekbote <chirantan@chromium.org> Fixes: `59efec7b90` ("fuse: implement ioctl support") Cc: <stable@vger.kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-07-15 14:18:20 +02:00
Alexei Starovoitov	ec2ffdf65f	Merge branch 'usermode-driver-cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace into bpf-next	2020-07-14 12:18:01 -07:00
He Zhe	59679d9933	freezer: Add unsafe version of freezable_schedule_timeout_interruptible() for NFS commit `0688e64bc6` ("NFS: Allow signal interruption of NFS4ERR_DELAYed operations") introduces nfs4_delay_interruptible which also needs an _unsafe version to avoid the following call trace for the same reason explained in commit `416ad3c9c0` ("freezer: add unsafe versions of freezable helpers for NFS") CPU: 4 PID: 3968 Comm: rm Tainted: G W 5.8.0-rc4 #1 Hardware name: Marvell OcteonTX CN96XX board (DT) Call trace: dump_backtrace+0x0/0x1dc show_stack+0x20/0x30 dump_stack+0xdc/0x150 debug_check_no_locks_held+0x98/0xa0 nfs4_delay_interruptible+0xd8/0x120 nfs4_handle_exception+0x130/0x170 nfs4_proc_rmdir+0x8c/0x220 nfs_rmdir+0xa4/0x360 vfs_rmdir.part.0+0x6c/0x1b0 do_rmdir+0x18c/0x210 __arm64_sys_unlinkat+0x64/0x7c el0_svc_common.constprop.0+0x7c/0x110 do_el0_svc+0x24/0xa0 el0_sync_handler+0x13c/0x1b8 el0_sync+0x158/0x180 Signed-off-by: He Zhe <zhe.he@windriver.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2020-07-14 19:25:41 +02:00
YueHaibing	8464e650b9	xfs: remove duplicated include from xfs_buf_item.c Remove duplicated include. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2020-07-14 08:47:33 -07:00
Christoph Hellwig	76622c88c2	xfs: remove SYNC_WAIT and SYNC_TRYLOCK These two definitions are unused now. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2020-07-14 08:47:33 -07:00
Gao Xiang	92a005448f	xfs: get rid of unnecessary xfs_perag_{get,put} pairs In the course of some operations, we look up the perag from the mount multiple times to get or change perag information. These are often very short pieces of code, so while the lookup cost is generally low, the cost of the lookup is far higher than the cost of the operation we are doing on the perag. Since we changed buffers to hold references to the perag they are cached in, many modification contexts already hold active references to the perag that are held across these operations. This is especially true for any operation that is serialised by an allocation group header buffer. In these cases, we can just use the buffer's reference to the perag to avoid needing to do lookups to access the perag. This means that many operations don't need to do perag lookups at all to access the perag because they've already looked up objects that own persistent references and hence can use that reference instead. Cc: Dave Chinner <dchinner@redhat.com> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-14 08:47:33 -07:00
Vasily Averin	7779b047a5	fuse: don't ignore errors from fuse_writepages_fill() fuse_writepages() ignores some errors taken from fuse_writepages_fill() I believe it is a bug: if .writepages is called with WB_SYNC_ALL it should either guarantee that all data was successfully saved or return error. Fixes: `26d614df1d` ("fuse: Implement writepages callback") Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-07-14 14:45:42 +02:00
Miklos Szeredi	6ddf3af93e	fuse: clean up condition for writepage sending fuse_writepages_fill uses following construction: if (wpa && ap->num_pages && (A \|\| B \|\| C)) { action; } else if (wpa && D) { if (E) { the same action; } } - ap->num_pages check is always true and can be removed - "if" and "else if" calls the same action and can be merged. Move checking A, B, C, D, E conditions to a helper, add comments. Original-patch-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-07-14 14:45:41 +02:00
Miklos Szeredi	b330966f79	fuse: reject options on reconfigure via fsconfig(2) Previous patch changed handling of remount/reconfigure to ignore all options, including those that are unknown to the fuse kernel fs. This was done for backward compatibility, but this likely only affects the old mount(2) API. The new fsconfig(2) based reconfiguration could possibly be improved. This would make the new API less of a drop in replacement for the old, OTOH this is a good chance to get rid of some weirdnesses in the old API. Several other behaviors might make sense: 1) unknown options are rejected, known options are ignored 2) unknown options are rejected, known options are rejected if the value is changed, allowed otherwise 3) all options are rejected Prior to the backward compatibility fix to ignore all options all known options were accepted (1), even if they change the value of a mount parameter; fuse_reconfigure() does not look at the config values set by fuse_parse_param(). To fix that we'd need to verify that the value provided is the same as set in the initial configuration (2). The major drawback is that this is much more complex than just rejecting all attempts at changing options (3); i.e. all options signify initial configuration values and don't make sense on reconfigure. This patch opts for (3) with the rationale that no mount options are reconfigurable in fuse. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-07-14 14:45:41 +02:00
Miklos Szeredi	e8b20a474c	fuse: ignore 'data' argument of mount(..., MS_REMOUNT) The command mount -o remount -o unknownoption /mnt/fuse succeeds on kernel versions prior to v5.4 and fails on kernel version at or after. This is because fuse_parse_param() rejects any unrecognised options in case of FS_CONTEXT_FOR_RECONFIGURE, just as for FS_CONTEXT_FOR_MOUNT. This causes a regression in case the fuse filesystem is in fstab, since remount sends all options found there to the kernel; even ones that are meant for the initial mount and are consumed by the userspace fuse server. Fix this by ignoring mount options, just as fuse_remount_fs() did prior to the conversion to the new API. Reported-by: Stefan Priebe <s.priebe@profihost.ag> Fixes: `c30da2e981` ("fuse: convert to use the new mount API") Cc: <stable@vger.kernel.org> # v5.4 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-07-14 14:45:41 +02:00
Miklos Szeredi	0189a2d367	fuse: use ->reconfigure() instead of ->remount_fs() s_op->remount_fs() is only called from legacy_reconfigure(), which is not used after being converted to the new API. Convert to using ->reconfigure(). This restores the previous behavior of syncing the filesystem and rejecting MS_MANDLOCK on remount. Fixes: `c30da2e981` ("fuse: convert to use the new mount API") Cc: <stable@vger.kernel.org> # v5.4 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-07-14 14:45:41 +02:00
Miklos Szeredi	c146024ec4	fuse: fix warning in tree_insert() and clean up writepage insertion fuse_writepages_fill() calls tree_insert() with ap->num_pages = 0 which triggers the following warning: WARNING: CPU: 1 PID: 17211 at fs/fuse/file.c:1728 tree_insert+0xab/0xc0 [fuse] RIP: 0010:tree_insert+0xab/0xc0 [fuse] Call Trace: fuse_writepages_fill+0x5da/0x6a0 [fuse] write_cache_pages+0x171/0x470 fuse_writepages+0x8a/0x100 [fuse] do_writepages+0x43/0xe0 Fix up the warning and clean up the code around rb-tree insertion: - Rename tree_insert() to fuse_insert_writeback() and make it return the conflicting entry in case of failure - Re-add tree_insert() as a wrapper around fuse_insert_writeback() - Rename fuse_writepage_in_flight() to fuse_writepage_add() and reverse the meaning of the return value to mean + "true" in case the writepage entry was successfully added + "false" in case it was in-fligt queued on an existing writepage entry's auxiliary list or the existing writepage entry's temporary page updated Switch from fuse_find_writeback() + tree_insert() to fuse_insert_writeback() - Move setting orig_pages to before inserting/updating the entry; this may result in the orig_pages value being discarded later in case of an in-flight request - In case of a new writepage entry use fuse_writepage_add() unconditionally, only set data->wpa if the entry was added. Fixes: `6b2fb79963` ("fuse: optimize writepages search") Reported-by: kernel test robot <rong.a.chen@intel.com> Original-path-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-07-14 14:45:41 +02:00
Miklos Szeredi	69a6487ac0	fuse: move rb_erase() before tree_insert() In fuse_writepage_end() the old writepages entry needs to be removed from the rbtree before inserting the new one, otherwise tree_insert() would fail. This is a very rare codepath and no reproducer exists. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-07-14 14:45:41 +02:00
Alexander A. Klimov	248727a498	udf: Replace HTTP links with HTTPS ones Rationale: Reduces attack surface on kernel devs opening the links for MITM as HTTPS traffic is much harder to manipulate. Deterministic algorithm: For each file: If not .svg: For each line: If doesn't contain `\bxmlns\b`: For each link, `\bhttp://[^# \t\r\n]*(?:\w\|/)`: If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`: If both the HTTP and HTTPS versions return 200 OK and serve the same content: Replace HTTP with HTTPS. Link: https://lore.kernel.org/r/20200713200738.37800-1-grandmaster@al2klimov.de Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de> Signed-off-by: Jan Kara <jack@suse.cz>	2020-07-14 14:37:39 +02:00
Frank van der Linden	95ad37f90c	NFSv4.2: add client side xattr caching. Implement client side caching for NFSv4.2 extended attributes. The cache is a per-inode hashtable, with name/value entries. There is one special entry for the listxattr cache. NFS inodes have a pointer to a cache structure. The cache structure is allocated on demand, freed when the cache is invalidated. Memory shrinkers keep the size in check. Large entries (> PAGE_SIZE) are collected by a separate shrinker, and freed more aggressively than others. Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-07-13 17:52:46 -04:00
Frank van der Linden	012a211abd	NFSv4.2: hook in the user extended attribute handlers Now that all the lower level code is there to make the RPC calls, hook it in to the xattr handlers and the listxattr entry point, to make them available. Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-07-13 17:52:45 -04:00
Frank van der Linden	c10a75145f	NFSv4.2: add the extended attribute proc functions. Implement the extended attribute procedures for NFSv4.2 extended attribute support (RFC 8276). Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-07-13 17:52:45 -04:00
Frank van der Linden	ccde1e9c01	nfs: make the buf_to_pages_noslab function available to the nfs code Make the buf_to_pages_noslab function available to the rest of the NFS code. Rename it to nfs4_buf_to_pages_noslab to be consistent. This will be used later in the NFSv4.2 xattr code. Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-07-13 17:52:45 -04:00
Frank van der Linden	0f44da51ae	nfs: define and use the NFS_INO_INVALID_XATTR flag Define the NFS_INO_INVALID_XATTR flag, to be used for the NFSv4.2 xattr cache, and use it where appropriate. No functional change as yet. Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-07-13 17:52:45 -04:00
Frank van der Linden	1b523ca972	nfs: modify update_changeattr to deal with regular files Until now, change attributes in change_info form were only returned by directory operations. However, they are also used for the RFC 8276 extended attribute operations, which work on both directories and regular files. Modify update_changeattr to deal: * Rename it to nfs4_update_changeattr and make it non-static. * Don't always use INO_INVALID_DATA, this isn't needed for a directory that only had its extended attributes changed by us. * Existing callers now always pass in INO_INVALID_DATA. For the current callers of this function, behavior is unchanged. Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-07-13 17:52:45 -04:00
Frank van der Linden	72832a2453	NFSv4.2: query the extended attribute access bits RFC 8276 defines separate ACCESS bits for extended attribute checking. Query them in nfs_do_access and opendata. Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-07-13 17:52:45 -04:00
Frank van der Linden	d2ae4f8b21	nfs: define nfs_access_get_cached function The only consumer of nfs_access_get_cached_rcu and nfs_access_cached calls these static functions in order to first try RCU access, and then locked access. Combine them in to a single function, and call that. Make this function available to the rest of the NFS code. Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-07-13 17:52:45 -04:00
Frank van der Linden	3e1f02123f	NFSv4.2: add client side XDR handling for extended attributes Define the argument and response structures that will be used for RFC 8276 extended attribute RPC calls, and implement the necessary functions to encode/decode the extended attribute operations. Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-07-13 17:52:45 -04:00
Frank van der Linden	b78ef845c3	NFSv4.2: query the server for extended attribute support Query the server for extended attribute support, and record it as the NFS_CAP_XATTR flag in the server capabilities. Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-07-13 17:52:45 -04:00
Frank van der Linden	04a5da690e	NFSv4.2: define limits and sizes for user xattr handling Set limits for extended attributes (attribute value size and listxattr buffer size), based on the fs-independent limits (XATTR_*_MAX). Define the maximum XDR sizes for the RFC 8276 XATTR operations. In the case of operations that carry a larger payload (SETXATTR, GETXATTR, LISTXATTR), these exclude that payload, which is added as separate pages, like other operations do. Define, much like for read and write operations, the maximum overhead sizes for get/set/listxattr, and use them to limit the maximum payload size for those operations, in combination with the channel attributes. Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-07-13 17:52:45 -04:00
Scott Mayhew	df60446cd1	nfsd: avoid a NULL dereference in __cld_pipe_upcall() If the rpc_pipefs is unmounted, then the rpc_pipe->dentry becomes NULL and dereferencing the dentry->d_sb will trigger an oops. The only reason we're doing that is to determine the nfsd_net, which could instead be passed in by the caller. So do that instead. Fixes: `11a60d1592` ("nfsd: add a "GetVersion" upcall for nfsdcld") Signed-off-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2020-07-13 17:28:46 -04:00
J. Bruce Fields	94415b06eb	nfsd4: a client's own opens needn't prevent delegations We recently fixed lease breaking so that a client's actions won't break its own delegations. But we still have an unnecessary self-conflict when granting delegations: a client's own write opens will prevent us from handing out a read delegation even when no other client has the file open for write. Fix that by turning off the checks for conflicting opens under vfs_setlease, and instead performing those checks in the nfsd code. We don't depend much on locks here: instead we acquire the delegation, then check for conflicts, and drop the delegation again if we find any. The check beforehand is an optimization of sorts, just to avoid acquiring the delegation unnecessarily. There's a race where the first check could cause us to deny the delegation when we could have granted it. But, that's OK, delegation grants are optional (and probably not even a good idea in that case). Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2020-07-13 17:28:46 -04:00
Xu Wang	0b7cd9d9ca	nfsd: Use seq_putc() in two functions A single character (line break) should be put into a sequence. Thus use the corresponding function "seq_putc()". Signed-off-by: Xu Wang <vulab@iscas.ac.cn> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2020-07-13 17:28:46 -04:00
Frank van der Linden	0e885e846d	nfsd: add fattr support for user extended attributes Check if user extended attributes are supported for an inode, and return the answer when being queried for file attributes. An exported filesystem can now signal its RFC8276 user extended attributes capability. Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2020-07-13 17:27:03 -04:00
Frank van der Linden	23e50fe3a5	nfsd: implement the xattr functions and en/decode logic Implement the main entry points for the *XATTR operations. Add functions to calculate the reply size for the user extended attribute operations, and implement the XDR encode / decode logic for these operations. Add the user extended attributes operations to nfsd4_ops. Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2020-07-13 17:27:03 -04:00
Frank van der Linden	6178713bd4	nfsd: add structure definitions for xattr requests / responses Add the structures used in extended attribute request / response handling. Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2020-07-13 17:27:03 -04:00
Frank van der Linden	c11d7fd1b3	nfsd: take xattr bits into account for permission checks Since the NFSv4.2 extended attributes extension defines 3 new access bits for xattr operations, take them in to account when validating what the client is asking for, and when checking permissions. Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2020-07-13 17:27:03 -04:00
Frank van der Linden	32119446bb	nfsd: define xattr functions to call into their vfs counterparts This adds the filehandle based functions for the xattr operations that call in to the vfs layer to do the actual work. Signed-off-by: Frank van der Linden <fllinden@amazon.com> [ cel: address checkpatch.pl complaint ] Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2020-07-13 17:27:03 -04:00
Frank van der Linden	4dd05fceb7	nfsd: add defines for NFSv4.2 extended attribute support Add defines for server-side extended attribute support. Most have already been added as part of client support, but these are the network order error codes for the noxattr and xattr2big errors, and the addition of the xattr support to the supported file attributes (if configured). Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2020-07-13 17:27:03 -04:00
Frank van der Linden	874c7b8ea5	nfsd: split off the write decode code into a separate function nfs4_decode_write has code to parse incoming XDR write data in to a kvec head, and a list of pages. Put this code in to a separate function, so that it can be used later by the xattr code, for setxattr. No functional change. Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2020-07-13 17:27:03 -04:00
Frank van der Linden	cab8d289c5	xattr: add a function to check if a namespace is supported Add a function that checks is an extended attribute namespace is supported for an inode, meaning that a handler must be present for either the whole namespace, or at least one synthetic xattr in the namespace. To be used by the nfs server code when being queried for extended attributes support. Cc: linux-fsdevel@vger.kernel.org Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2020-07-13 17:27:03 -04:00
Frank van der Linden	08b5d5014a	xattr: break delegations in {set,remove}xattr set/removexattr on an exported filesystem should break NFS delegations. This is true in general, but also for the upcoming support for RFC 8726 (NFSv4 extended attribute support). Make sure that they do. Additionally, they need to grow a _locked variant, since callers might call this with i_rwsem held (like the NFS server code). Cc: stable@vger.kernel.org # v4.9+ Cc: linux-fsdevel@vger.kernel.org Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2020-07-13 17:27:03 -04:00
Kees Cook	173817151b	fs: Expand __receive_fd() to accept existing fd Expand __receive_fd() with support for replace_fd() for the coming seccomp "addfd" ioctl(). Add new wrapper receive_fd_replace() for the new behavior and update existing wrappers to retain old behavior. Thanks to Colin Ian King <colin.king@canonical.com> for pointing out an uninitialized variable exposure in an earlier version of this patch. Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dmitry Kadashev <dkadashev@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Sargun Dhillon <sargun@sargun.me> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Kees Cook <keescook@chromium.org>	2020-07-13 11:03:45 -07:00
Kees Cook	deefa7f350	fs: Add receive_fd() wrapper for __receive_fd() For both pidfd and seccomp, the __user pointer is not used. Update __receive_fd() to make writing to ufd optional via a NULL check. However, for the receive_fd_user() wrapper, ufd is NULL checked so an -EFAULT can be returned to avoid changing the SCM_RIGHTS interface behavior. Add new wrapper receive_fd() for pidfd and seccomp that does not use the ufd argument. For the new helper, the allocated fd needs to be returned on success. Update the existing callers to handle it. Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Sargun Dhillon <sargun@sargun.me> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Kees Cook <keescook@chromium.org>	2020-07-13 11:03:44 -07:00
Kees Cook	6659061045	fs: Move __scm_install_fd() to __receive_fd() In preparation for users of the "install a received file" logic outside of net/ (pidfd and seccomp), relocate and rename __scm_install_fd() from net/core/scm.c to __receive_fd() in fs/file.c, and provide a wrapper named receive_fd_user(), as future patches will change the interface to __receive_fd(). Additionally add a comment to fd_install() as a counterpoint to how __receive_fd() interacts with fput(). Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Dmitry Kadashev <dkadashev@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Ido Schimmel <idosch@idosch.org> Cc: Ioana Ciornei <ioana.ciornei@nxp.com> Cc: linux-fsdevel@vger.kernel.org Cc: netdev@vger.kernel.org Reviewed-by: Sargun Dhillon <sargun@sargun.me> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Kees Cook <keescook@chromium.org>	2020-07-13 11:03:44 -07:00
Anna Schumaker	913fadc5b1	NFS: Fix interrupted slots by sending a solo SEQUENCE operation We used to do this before `3453d5708b`, but this was changed to better handle the NFS4ERR_SEQ_MISORDERED error code. This commit fixed the slot re-use case when the server doesn't receive the interrupted operation, but if the server does receive the operation then it could still end up replying to the client with mis-matched operations from the reply cache. We can fix this by sending a SEQUENCE to the server while recovering from a SEQ_MISORDERED error when we detect that we are in an interrupted slot situation. Fixes: `3453d5708b` (NFSv4.1: Avoid false retries when RPC calls are interrupted) Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2020-07-13 10:50:41 -04:00
Trond Myklebust	18eb87f444	pNFS/flexfiles: The mirror count could depend on the layout segment range Make sure we specify the layout segment range when calculating the mirror count. In theory, that number could depend on the range to which we're writing. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-07-12 23:49:55 -04:00
Trond Myklebust	f97ff92bd1	pNFS/flexfiles: Clean up redundant calls to pnfs_put_lseg() Both nfs_pageio_reset_read_mds() and nfs_pageio_reset_write_mds() do call pnfs_generic_pg_cleanup() for us. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-07-12 23:49:55 -04:00
Trond Myklebust	ac7cbb2211	NFS: Allow applications to speed up readdir+statx() using AT_STATX_DONT_SYNC If the application uses the AT_STATX_DONT_SYNC flag after doing readdir(), then we should still mark the parent inode as seeing a readdirplus hit. That ensures that we continue to use readdirplus in the 'ls -l' type of workflow to do fast lookups of the dentries. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-07-12 23:49:55 -04:00
Linus Torvalds	4437dd6e8f	io_uring-5.8-2020-07-12 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8LOlAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpsYdD/995gx1/DL98raclVmMJ5i2c4fMEV6th2W0 C/SsbhpLhIurmoKVwjkz24vLNPXsr2E+XV8VMB0aClDrau22/bPGVcl8vfXwg8EK yWrwGvpRXD9jNctGEBBREdRG+CWBsGrhDj8NidVyHfzOIvzFqepR/Ruj4BQydesk Ic70MmTQLDBlmGqR0iuL1TfOkBwIcG9x6JST32XKLA+KLeHV9pPB4qp/5ULiS0Xq +lXw2TqQQ8CtzyIXDMAA0+m4FA/clNSClNUpsOT9me1JulAdDq7GLnG1mh0wDBvr 1y8CiuICgO1iReNn4bXlW4DzegtE0f32omq14GrMeNVK5rQh4JEusneiszKQZJ3u VJRPnXlc5l1LXbYyJbxPDA31Vqt2eIlBkl9VRM+BeUNe+hPZai9PXDYDVeipokaa amXdJm5FxoOLWghpD1c7Tkm0aBYgEr6wFu9UAjmQfN5M8agDkD42odIt7u1a/uQF ZRxaveZs+1IexGO0t84CMRaHZKcvcKTCo6VcJyvDj6vj8Fz3ibDOTNHPCd5fRQTg UnAk5a7lwaOPODFtpiPfvVwsIx60HAmRcw7ZruXAuijEmT/Yht87L8WZ94cLJe4X SeQTSCuHJZJQyWsglK4/Ap5Z5yG9HFjzwYfVnCeIXZt8SEv7fRVdGhRSKemM4Xgq cRHZ/Gu0+g== =g4p9 -----END PGP SIGNATURE----- Merge tag 'io_uring-5.8-2020-07-12' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: "Two late fixes again: - Fix missing msg_name assignment in certain cases (Pavel) - Correct a previous fix for full coverage (Pavel)" * tag 'io_uring-5.8-2020-07-12' of git://git.kernel.dk/linux-block: io_uring: fix not initialised work->flags io_uring: fix missing msg_name assignment	2020-07-12 12:17:58 -07:00
Linus Torvalds	72c34e8d70	for-5.8-rc4-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl8K3ugACgkQxWXV+ddt WDsDNBAAn5iaMNwlCBYpwAaWlltMog3SKg+vgpEcFD9qLlmimW/1TlrjjGRzp6Mn nnNp+YjYDotqU9pP1OwESpY1LTzuVQlQL1yaiPLrehw/WsZgjdDWBk/EyU0n1vz1 Sr5wcyCVyVZZyO2/BEVTDhkvu+sj9Rcwo2QCsC2aIOTVSfQGFSklMp2VNdu2YQBy zyTOhbwpn3OPPZsvScEujvSY9oUAN3J8WYA9jmgtwjZD7sr6UNyNI9vy8woi0VAQ Uo7nXc43ZcS1xTwziGOpC6fZi90zrF7ZvfFT0qY92EEDcAQcCzPDl6f4OnAjr6/b rnZcLvusEcENjFQn3pD7fCuXiIRrN8eHspj5+K/oRBTXWC5AykBwsLWt7M+tTMYa ljEBRZlQlHMlC3xSEZNDccEvScXrEIu3Q2WrTOTXSgXi4e3q89VUTEIjAhfnTTzJ VwHhGZIB6o+V7wZ0EhWdt9b1/Ro/AcADddV+AxTsfC1YCHVZOsSSa3DxV243ORsA /U3t2a4SMp/iSHTtoLIwbr/O1Uj9UaOk2n1DcNbGIgdn14yYt6YWOhvrOPBampEa zfBzmAOx9r5Mf2wWD0iTm4gJEZsrB+IpboYZ6cuBcOI29+A4k0POBfRLXgf8/jMo 5kBWm+C3KKkZO8u/Z4gtVG1ZFdxsnYAc+q+UXS5ZSJMH+++UoZQ= =hTok -----END PGP SIGNATURE----- Merge tag 'for-5.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Two refcounting fixes and one prepartory patch for upcoming splice cleanup: - fix double put of block group with nodatacow - fix missing block group put when remounting with discard=async - explicitly set splice callback (no functional change), to ease integrating splice cleanup patches" * tag 'for-5.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: wire up iter_file_splice_write btrfs: fix double put of block group with nocow btrfs: discard: add missing put when grabbing block group from unused list	2020-07-12 10:58:35 -07:00
Pavel Begunkov	16d598030a	io_uring: fix not initialised work->flags `59960b9deb` ("io_uring: fix lazy work init") tried to fix missing io_req_init_async(), but left out work.flags and hash. Do it earlier. Fixes: `7cdaf587de` ("io_uring: avoid whole io_wq_work copy for requests completed inline") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-12 09:40:50 -06:00
Pavel Begunkov	dd821e0c95	io_uring: fix missing msg_name assignment Ensure to set msg.msg_name for the async portion of send/recvmsg, as the header copy will copy to/from it. Cc: stable@vger.kernel.org # v5.5+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-12 09:40:25 -06:00
David S. Miller	71930d6102	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net All conflicts seemed rather trivial, with some guidance from Saeed Mameed on the tc_ct.c one. Signed-off-by: David S. Miller <davem@davemloft.net>	2020-07-11 00:46:00 -07:00
Linus Torvalds	5ab39e08ff	4 cifs/smb3 fixes: the three for stable fix problems found recently with change notification including a reference count leak -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl8I/pEACgkQiiy9cAdy T1HEzAv/frpjB9Ae4EH4oI9bqAPpIvmz0OG8qjiVhA7EMuKCPzj4DdP6HwgJ41Rr GS6IaltHHYRFmv2rKPFm9BJs0l03XJmkOlx/D+mvh0YJiNAYGan99wCVuhjotRkk 1LLpG8ZKRSGtxc+gkyaAUiKMZPHtDxvfy/VUsoCzOJBUDVByeH2LlwanXGibWEmw DHap06dClRZ0nLf7BvfL5vtFdX5If9CUOiwgj6PY/Oy+/hhq/XjQJClhVSH1tGhG wXMOpq5gANG0MOCkJh6GVykawjSE0gz/jc6d+zlUW4stqMLQMz7kfIqLy/OZ2z4/ zggpKmxN969ZyhLxldEJguet5+gHu7rX7dQMJb83UBtV9uC8mtgVR/5vTMCtrGBX c9YNdzOnk7Djhb9ka1sx9KAORwagphYS4BIAhk2bQbOcW8OfkiRHVLaDou39Vbn3 3rBVTFDg7Kdg1yFhPXZM1r4HBcWupVHq/jh+kCoHcPVrwYBwNDTflTWogDjWkzph Bc+39FSC =c1VA -----END PGP SIGNATURE----- Merge tag '5.8-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6 Pull cifs fixes from Steve French: "Four cifs/smb3 fixes: the three for stable fix problems found recently with change notification including a reference count leak" * tag '5.8-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: update internal module version number cifs: fix reference leak for tlink smb3: fix unneeded error message on change notify cifs: remove the retry in cifs_poxis_lock_set smb3: fix access denied on change notify request to some servers	2020-07-10 21:16:48 -07:00
Kees Cook	c818c03b66	seccomp: Report number of loaded filters in /proc/$pid/status A common question asked when debugging seccomp filters is "how many filters are attached to your process?" Provide a way to easily answer this question through /proc/$pid/status with a "Seccomp_filters" line. Signed-off-by: Kees Cook <keescook@chromium.org>	2020-07-10 16:01:51 -07:00
Jakub Kicinski	a2b992c828	debugfs: make sure we can remove u32_array files cleanly debugfs_create_u32_array() allocates a small structure to wrap the data and size information about the array. If users ever try to remove the file this leads to a leak since nothing ever frees this wrapper. That said there are no upstream users of debugfs_create_u32_array() that'd remove a u32 array file (we only have one u32 array user in CMA), so there is no real bug here. Make callers pass a wrapper they allocated. This way the lifetime management of the wrapper is on the caller, and we can avoid the potential leak in debugfs. CC: Chucheng Luo <luochucheng@vivo.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-07-10 13:54:00 -07:00
Linus Torvalds	a581387e41	io_uring-5.8-2020-07-10 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8IkFIQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgph7tEACdkJ9CupSjVQBJ35/2wBiUCU6lh0iUbru3 oURuw9AdjzT75en68LOyrhfdGCkNX8xrIoHnh9pGukoPid//R3rAsk+5j4Doc7SY FBZuH7HbOC/gdTAzSd0tLyaddmf06eq0PPmlWQGUYu29Vy49nvLDM2V+455OgkT9 csCNk/Fq/np7IKVZW56KZ1Iz90Dtp8BVwFJeFdzFK5cpCjKXTyvzS/5GEFqz0rc3 AwvxkBCpDjXQqh5OOs2qM18QUeJAQV0PX1x+XjtASMB3dr4W8D5KDEhaQwiXiGOT j7yHX/JThtOYI41IiKpF/LP9EQAkavIT1n7ZjXwablq6ecQjj24DxCe0h1uwnKyW G5Hztjcb2YMhhDVj4q6m/uzD13+ccIeEmVoKjRuGj+0YDFETIBYK8U4GElu7VUIr yUKqy7jFEHKFZj5eQz5JDl08+zq/ocZHJctqZhCB3DWMbE5f+VCzNGEqGGs2IiRS J7/S8jB1j7Te8jFXBN3uaNpuj+U/JlYCPqO83fxF14ar3wZ1mblML4cw0sXUJazO 1YI3LU91qBeCNK/9xUc43MzzG/SgdIKJjfoaYYPBPMG5SzVYAtRoUUiNf942UE1k 7u8/leql+RqwSLedjU/AQqzs29p7wcK8nzFBApmTfiiBHYcRbJ6Fxp3/UwTOR3tQ kwmkUtivtQ== =31iz -----END PGP SIGNATURE----- Merge tag 'io_uring-5.8-2020-07-10' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: - Fix memleak for error path in registered files (Yang) - Export CQ overflow state in flags, necessary to fix a case where liburing doesn't know if it needs to enter the kernel (Xiaoguang) - Fix for a regression in when user memory is accounted freed, causing issues with back-to-back ring exit + init if the ulimit -l setting is very tight. * tag 'io_uring-5.8-2020-07-10' of git://git.kernel.dk/linux-block: io_uring: account user memory freed when exit has been queued io_uring: fix memleak in io_sqe_files_register() io_uring: fix memleak in __io_sqe_files_update() io_uring: export cq overflow status to userspace	2020-07-10 09:57:57 -07:00
Linus Torvalds	b1b11d0063	cleanup in-kernel read and write operations Reshuffle the (__)kernel_read and (__)kernel_write helpers, and ensure all users of in-kernel file I/O use them if they don't use iov_iter based methods already. -----BEGIN PGP SIGNATURE----- iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl8Ij8gLHGhjaEBsc3Qu ZGUACgkQD55TZVIEUYOcpBAAn157ooLqRrqQisEA6j59rTgkHUuqZMUx+8XjiivX baHQPmgctza1Xzjc4PjJ1owtLpt4ywcTpY8IDj3vZF1PpffeeuWVzxMTk/aIvhNN zPK2SJpRlDQHErKEhkTTOfOYoFTgc7vPa5Hvm6AEMaJs8oPtGZ2rnQHzPXENl/TY TgcLd1ou3iuw19UIAfB+EfuC9uhq7pCPu9+tryNyT2IfM7fqdsIhRESpcodg1ve+ 1k6leFIBrXa3MWiBGVUGCrSmlpP9xd22Zl8D/w60WeYWeg7szZoUK2bjhbdIEDZI tTwkdZ73IKpcxOyzUVbfr2hqNa94zrXCKQGfEGVS/7arV7QH4yvhg9NU9lqVXZKV ruPoyjsmJkHW52FfEEv1Gfrd6v4H6qZ6iyJEm3ZYNGul85O97t1xA/kKxAIwMuPa nFhhxHIooT/We3Ao77FROhIob4D5AOfOI4gvkTE15YMzsNxT/yjilQjdDFR5An6A ckzqb+VyDvcTx2gxR/qaol7b4lzmri4S/8Jt7WXjHOtNe9eXC4kl44leitK5j31H fHZNyMLJ2+/JF5pGB2rNRNnTeQ7lXKob4Y+qAjRThddDxtdsf5COdZAiIiZbRurR Ogl2k3sMDdHgNfycK2Bg5Fab9OIWePQlpcGU14afUSPviuNkIYKLGrx92ZWef53j loI= =eYsI -----END PGP SIGNATURE----- Merge tag 'cleanup-kernel_read_write' of git://git.infradead.org/users/hch/misc Pull in-kernel read and write op cleanups from Christoph Hellwig: "Cleanup in-kernel read and write operations Reshuffle the (__)kernel_read and (__)kernel_write helpers, and ensure all users of in-kernel file I/O use them if they don't use iov_iter based methods already. The new WARN_ONs in combination with syzcaller already found a missing input validation in 9p. The fix should be on your way through the maintainer ASAP". [ This is prep-work for the real changes coming 5.9 ] * tag 'cleanup-kernel_read_write' of git://git.infradead.org/users/hch/misc: fs: remove __vfs_read fs: implement kernel_read using __kernel_read integrity/ima: switch to using __kernel_read fs: add a __kernel_read helper fs: remove __vfs_write fs: implement kernel_write using __kernel_write fs: check FMODE_WRITE in __kernel_write fs: unexport __kernel_write bpfilter: switch to kernel_write autofs: switch to kernel_write cachefiles: switch to kernel_write	2020-07-10 09:45:15 -07:00
Linus Torvalds	d02b0478c1	Fix gfs2 readahead deadlocks -----BEGIN PGP SIGNATURE----- iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAl8IgKwUHGFncnVlbmJh QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTpm2Q//bj3ZDIjep9a4d7mRVGeX3OeslLzk NDB2Vu03B0oZKQFYbQNdpblxy2Cfyz4m8xkNCdsD8EQ2d1zaPWhywJ6vxc1VO5Dw wRODwRMgVe0hd9dLR8b8GzUO0+4ncpjqmyEyrCRjwPRkghcX8uuSTifXtY+yeDEv X2BHlSGMjqCFBfq+RTa8Fi3wWFy9QhGy74QVoidMM0ulFLJbWSu0EnCXZ+hZQ4vR sJokd2SDSP60LE964CwMxuMNUNwSMwL3VrlUm74qx1WVCK8lyYtm231E5CAHRbAw C/f6sIKoyzyfJbv2HqgvMXvh72hO4MaJgIb8Pbht8a9GZdfk6i2JbcNmHXXk5OMN GkYLLhkDrj4X/MChNuk20Zsylaij1+CCLb6C4UsQeXF0e/QA6iYIGRmpApGN2gNP IA8rTz4Ibmd5ZpVMJNPOGSbq3fpPEboEoxVn+fWVvhDTopATxYS85tKqU5Bfvdr5 QcBqqeAL9yludQa520C1lIbGDBOJ57LisybMBVufklx8ZtFNNbHyB/b1YnfUBvRF 8WXVpYkh1ckB4VvVj7qnKY2/JJT0VVhQmTogqwqZy9m+Nb8I4l0pemUsJnypS0qs KmoBvZmhWhE3tnqmCVzSvuHzO/eYGSfN91AavGBaddFzsqLLe8Hkm8kzlS5bZxGn OVWGWVvuoSu72s8= =dfnJ -----END PGP SIGNATURE----- Merge tag 'gfs2-v5.8-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 fixes from Andreas Gruenbacher: "Fix gfs2 readahead deadlocks by adding a IOCB_NOIO flag that allows gfs2 to use the generic fiel read iterator functions without having to worry about being called back while holding locks". * tag 'gfs2-v5.8-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Rework read and page fault locking fs: Add IOCB_NOIO flag for generic_file_read_iter	2020-07-10 08:53:21 -07:00
Jens Axboe	309fc03a32	io_uring: account user memory freed when exit has been queued We currently account the memory after the exit work has been run, but that leaves a gap where a process has closed its ring and until the memory has been accounted as freed. If the memlocked ulimit is borderline, then that can introduce spurious setup errors returning -ENOMEM because the free work hasn't been run yet. Account this as freed when we close the ring, as not to expose a tiny gap where setting up a new ring can fail. Fixes: `85faa7b834` ("io_uring: punt final io_ring_ctx wait-and-free to workqueue") Cc: stable@vger.kernel.org # v5.7 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-10 09:18:35 -06:00
Yang Yingliang	667e57da35	io_uring: fix memleak in io_sqe_files_register() I got a memleak report when doing some fuzz test: BUG: memory leak unreferenced object 0x607eeac06e78 (size 8): comm "test", pid 295, jiffies 4294735835 (age 31.745s) hex dump (first 8 bytes): 00 00 00 00 00 00 00 00 ........ backtrace: [<00000000932632e6>] percpu_ref_init+0x2a/0x1b0 [<0000000092ddb796>] __io_uring_register+0x111d/0x22a0 [<00000000eadd6c77>] __x64_sys_io_uring_register+0x17b/0x480 [<00000000591b89a6>] do_syscall_64+0x56/0xa0 [<00000000864a281d>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Call percpu_ref_exit() on error path to avoid refcount memleak. Fixes: `05f3fb3c53` ("io_uring: avoid ring quiesce for fixed file set unregister and update") Cc: stable@vger.kernel.org Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-10 07:50:21 -06:00
Xu Wang	c80a67bd5d	debugfs: file: Remove unnecessary cast in kfree() Remove unnecassary casts in the argument to kfree. Signed-off-by: Xu Wang <vulab@iscas.ac.cn> Link: https://lore.kernel.org/r/20200709054033.30148-1-vulab@iscas.ac.cn Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-07-10 15:07:56 +02:00
Jens Axboe	4349f30ecb	io_uring: remove dead 'ctx' argument and move forward declaration We don't use 'ctx' at all in io_sq_thread_drop_mm(), it just works on the mm of the current task. Drop the argument. Move io_file_put_work() to where we have the other forward declarations of functions. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-09 15:07:01 -06:00
Christoph Hellwig	d777659113	btrfs: wire up iter_file_splice_write btrfs implements the iter_write op and thus can use the more efficient iov_iter based splice implementation. For now falling back to the less efficient default is pretty harmless, but I have a pending series that removes the default, and thus would cause btrfs to not support splice at all. Reported-by: Andy Lavr <andy.lavr@gmail.com> Tested-by: Andy Lavr <andy.lavr@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-09 19:57:58 +02:00
Waiman Long	c3f2375b90	xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim Depending on the workloads, the following circular locking dependency warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo lock) may show up: ====================================================== WARNING: possible circular locking dependency detected 5.0.0-rc1+ #60 Tainted: G W ------------------------------------------------------ fsfreeze/4346 is trying to acquire lock: 0000000026f1d784 (fs_reclaim){+.+.}, at: fs_reclaim_acquire.part.19+0x5/0x30 but task is already holding lock: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650 which lock already depends on the new lock. : Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(sb_internal); lock(fs_reclaim); lock(sb_internal); lock(fs_reclaim); * DEADLOCK * 4 locks held by fsfreeze/4346: #0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650 #1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290 #2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650 #3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650 stack backtrace: Call Trace: dump_stack+0xe0/0x19a print_circular_bug.isra.10.cold.34+0x2f4/0x435 check_prev_add.constprop.19+0xca1/0x15f0 validate_chain.isra.14+0x11af/0x3b50 __lock_acquire+0x728/0x1200 lock_acquire+0x269/0x5a0 fs_reclaim_acquire.part.19+0x29/0x30 fs_reclaim_acquire+0x19/0x20 kmem_cache_alloc+0x3e/0x3f0 kmem_zone_alloc+0x79/0x150 xfs_trans_alloc+0xfa/0x9d0 xfs_sync_sb+0x86/0x170 xfs_log_sbcount+0x10f/0x140 xfs_quiesce_attr+0x134/0x270 xfs_fs_freeze+0x4a/0x70 freeze_super+0x1af/0x290 do_vfs_ioctl+0xedc/0x16c0 ksys_ioctl+0x41/0x80 __x64_sys_ioctl+0x73/0xa9 do_syscall_64+0x18f/0xd23 entry_SYSCALL_64_after_hwframe+0x49/0xbe This is a false positive as all the dirty pages are flushed out before the filesystem can be frozen. One way to avoid this splat is to add GFP_NOFS to the affected allocation calls by using the memalloc_nofs_save()/memalloc_nofs_restore() pair. This shouldn't matter unless the system is really running out of memory. In that particular case, the filesystem freeze operation may fail while it was succeeding previously. Without this patch, the command sequence below will show that the lock dependency chain sb_internal -> fs_reclaim exists. # fsfreeze -f /home # fsfreeze --unfreeze /home # grep -i fs_reclaim -C 3 /proc/lockdep_chains \| grep -C 5 sb_internal After applying the patch, such sb_internal -> fs_reclaim lock dependency chain can no longer be found. Because of that, the locking dependency warning will not be shown. Suggested-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2020-07-09 09:16:38 -07:00
Josef Bacik	230ed39743	btrfs: fix double put of block group with nocow While debugging a patch that I wrote I was hitting use-after-free panics when accessing block groups on unmount. This turned out to be because in the nocow case if we bail out of doing the nocow for whatever reason we need to call btrfs_dec_nocow_writers() if we called the inc. This puts our block group, but a few error cases does if (nocow) { btrfs_dec_nocow_writers(); goto error; } unfortunately, error is error: if (nocow) btrfs_dec_nocow_writers(); so we get a double put on our block group. Fix this by dropping the error cases calling of btrfs_dec_nocow_writers(), as it's handled at the error label now. Fixes: `762bf09893` ("btrfs: improve error handling in run_delalloc_nocow") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-09 17:44:26 +02:00
Jens Axboe	2bc9930e78	io_uring: get rid of __req_need_defer() We just have one caller of this, req_need_defer(), just inline the code in there instead. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-09 09:43:27 -06:00
Steve French	a8dab63ea6	cifs: update internal module version number To 2.28 Signed-off-by: Steve French <stfrench@microsoft.com>	2020-07-09 10:07:09 -05:00
Ronnie Sahlberg	a77592a700	cifs: fix reference leak for tlink Don't leak a reference to tlink during the NOTIFY ioctl Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> CC: Stable <stable@vger.kernel.org> # v5.6+	2020-07-09 10:06:52 -05:00
Ard Biesheuvel	f88814cc25	efi/efivars: Expose RT service availability via efivars abstraction Commit `bf67fad19e` ("efi: Use more granular check for availability for variable services") introduced a check into the efivarfs, efi-pstore and other drivers that aborts loading of the module if not all three variable runtime services (GetVariable, SetVariable and GetNextVariable) are supported. However, this results in efivarfs being unavailable entirely if only SetVariable support is missing, which is only needed if you want to make any modifications. Also, efi-pstore and the sysfs EFI variable interface could be backed by another implementation of the 'efivars' abstraction, in which case it is completely irrelevant which services are supported by the EFI firmware. So make the generic 'efivars' abstraction dependent on the availibility of the GetVariable and GetNextVariable EFI runtime services, and add a helper 'efivar_supports_writes()' to find out whether the currently active efivars abstraction supports writes (and wire it up to the availability of SetVariable for the generic one). Then, use the efivar_supports_writes() helper to decide whether to permit efivarfs to be mounted read-write, and whether to enable efi-pstore or the sysfs EFI variable interface altogether. Fixes: `bf67fad19e` ("efi: Use more granular check for availability for variable services") Reported-by: Heinrich Schuchardt <xypron.glpk@gmx.de> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Tested-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>	2020-07-09 10:14:29 +03:00
Alexander A. Klimov	1f1a5be80c	Replace HTTP links with HTTPS ones: DISKQUOTA Rationale: Reduces attack surface on kernel devs opening the links for MITM as HTTPS traffic is much harder to manipulate. Deterministic algorithm: For each file: If not .svg: For each line: If doesn't contain `\bxmlns\b`: For each link, `\bhttp://[^# \t\r\n]*(?:\w\|/)`: If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`: If both the HTTP and HTTPS versions return 200 OK and serve the same content: Replace HTTP with HTTPS. Link: https://lore.kernel.org/r/20200708171905.15396-1-grandmaster@al2klimov.de Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de> Signed-off-by: Jan Kara <jack@suse.cz>	2020-07-09 08:14:01 +02:00
Chengguang Xu	1197d04fd3	ext2: initialize quota info in ext2_xattr_set() In order to correctly account/limit space usage, should initialize quota info before calling quota related functions. Link: https://lore.kernel.org/r/20200626054959.114177-1-cgxu519@mykernel.net Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Reviewed-by: Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Jan Kara <jack@suse.cz>	2020-07-09 08:14:01 +02:00
Chengguang Xu	cf1013f441	ext2: fix some incorrect comments in inode.c There are some incorrect comments in inode.c, so fix them properly. Link: https://lore.kernel.org/r/20200703124411.24085-1-cgxu519@mykernel.net Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Signed-off-by: Jan Kara <jack@suse.cz>	2020-07-09 08:14:01 +02:00
Chengguang Xu	30b42a714d	ext2: remove nocheck option Remove useless nocheck option. Link: https://lore.kernel.org/r/20200619073144.4701-1-cgxu519@mykernel.net Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Signed-off-by: Jan Kara <jack@suse.cz>	2020-07-09 08:14:01 +02:00
Mikulas Patocka	bc2fbaa4d3	ext2: fix missing percpu_counter_inc sbi->s_freeinodes_counter is only decreased by the ext2 code, it is never increased. This patch fixes it. Note that sbi->s_freeinodes_counter is only used in the algorithm that tries to find the group for new allocations, so this bug is not easily visible (the only visibility is that the group finding algorithm selects inoptinal result). Link: https://lore.kernel.org/r/alpine.LRH.2.02.2004201538300.19436@file01.intranet.prod.int.rdu2.redhat.com Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz>	2020-07-09 08:14:01 +02:00
zhangyi (F)	a43850a380	ext2: ext2_find_entry() return -ENOENT if no entry found Almost all callers of ext2_find_entry() transform NULL return value to -ENOENT, so just let ext2_find_entry() retuen -ENOENT instead of NULL if no valid entry found, and also switch to check the return value of ext2_inode_by_name() in ext2_lookup() and ext2_get_parent(). Link: https://lore.kernel.org/r/20200608034043.10451-2-yi.zhang@huawei.com Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Jan Kara <jack@suse.cz>	2020-07-09 08:14:00 +02:00
zhangyi (F)	b4962091a5	ext2: propagate errors up to ext2_find_entry()'s callers The same to commit <36de928641ee4> (ext4: propagate errors up to ext4_find_entry()'s callers') in ext4, also return error instead of NULL pointer in case of some error happens in ext2_find_entry() (e.g. -ENOMEM or -EIO). This could avoid a negative dentry cache entry installed even it failed to read directory block due to IO error. Link: https://lore.kernel.org/r/20200608034043.10451-1-yi.zhang@huawei.com Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz>	2020-07-09 08:14:00 +02:00
Chengguang Xu	1fcbcf06e4	ext2: fix improper assignment for e_value_offs In the process of changing value for existing EA, there is an improper assignment of e_value_offs(setting to 0), because it will be reset to incorrect value in the following loop(shifting EA values before target). Delayed assignment can avoid this issue. Link: https://lore.kernel.org/r/20200603084429.25344-1-cgxu519@mykernel.net Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Signed-off-by: Jan Kara <jack@suse.cz>	2020-07-09 08:14:00 +02:00
Chao Yu	aff6fbbe8e	f2fs: don't keep meta inode pages used for compressed block migration meta inode's pages are used for encrypted, verity and compressed blocks, so the meta inode's cache invalidation condition in do_checkpoint() should consider compression as well, not just for verity and encryption, fix it. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-08 22:28:34 -07:00
Yang Yingliang	f3bd9dae37	io_uring: fix memleak in __io_sqe_files_update() I got a memleak report when doing some fuzz test: BUG: memory leak unreferenced object 0xffff888113e02300 (size 488): comm "syz-executor401", pid 356, jiffies 4294809529 (age 11.954s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ a0 a4 ce 19 81 88 ff ff 60 ce 09 0d 81 88 ff ff ........`....... backtrace: [<00000000129a84ec>] kmem_cache_zalloc include/linux/slab.h:659 [inline] [<00000000129a84ec>] __alloc_file+0x25/0x310 fs/file_table.c:101 [<000000003050ad84>] alloc_empty_file+0x4f/0x120 fs/file_table.c:151 [<000000004d0a41a3>] alloc_file+0x5e/0x550 fs/file_table.c:193 [<000000002cb242f0>] alloc_file_pseudo+0x16a/0x240 fs/file_table.c:233 [<00000000046a4baa>] anon_inode_getfile fs/anon_inodes.c:91 [inline] [<00000000046a4baa>] anon_inode_getfile+0xac/0x1c0 fs/anon_inodes.c:74 [<0000000035beb745>] __do_sys_perf_event_open+0xd4a/0x2680 kernel/events/core.c:11720 [<0000000049009dc7>] do_syscall_64+0x56/0xa0 arch/x86/entry/common.c:359 [<00000000353731ca>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 BUG: memory leak unreferenced object 0xffff8881152dd5e0 (size 16): comm "syz-executor401", pid 356, jiffies 4294809529 (age 11.954s) hex dump (first 16 bytes): 01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<0000000074caa794>] kmem_cache_zalloc include/linux/slab.h:659 [inline] [<0000000074caa794>] lsm_file_alloc security/security.c:567 [inline] [<0000000074caa794>] security_file_alloc+0x32/0x160 security/security.c:1440 [<00000000c6745ea3>] __alloc_file+0xba/0x310 fs/file_table.c:106 [<000000003050ad84>] alloc_empty_file+0x4f/0x120 fs/file_table.c:151 [<000000004d0a41a3>] alloc_file+0x5e/0x550 fs/file_table.c:193 [<000000002cb242f0>] alloc_file_pseudo+0x16a/0x240 fs/file_table.c:233 [<00000000046a4baa>] anon_inode_getfile fs/anon_inodes.c:91 [inline] [<00000000046a4baa>] anon_inode_getfile+0xac/0x1c0 fs/anon_inodes.c:74 [<0000000035beb745>] __do_sys_perf_event_open+0xd4a/0x2680 kernel/events/core.c:11720 [<0000000049009dc7>] do_syscall_64+0x56/0xa0 arch/x86/entry/common.c:359 [<00000000353731ca>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 If io_sqe_file_register() failed, we need put the file that get by fget() to avoid the memleak. Fixes: `c3a31e6056` ("io_uring: add support for IORING_REGISTER_FILES_UPDATE") Cc: stable@vger.kernel.org Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-08 20:16:19 -06:00
Xiaoguang Wang	6d5f904904	io_uring: export cq overflow status to userspace For those applications which are not willing to use io_uring_enter() to reap and handle cqes, they may completely rely on liburing's io_uring_peek_cqe(), but if cq ring has overflowed, currently because io_uring_peek_cqe() is not aware of this overflow, it won't enter kernel to flush cqes, below test program can reveal this bug: static void test_cq_overflow(struct io_uring ring) { struct io_uring_cqe cqe; struct io_uring_sqe sqe; int issued = 0; int ret = 0; do { sqe = io_uring_get_sqe(ring); if (!sqe) { fprintf(stderr, "get sqe failed\n"); break;; } ret = io_uring_submit(ring); if (ret <= 0) { if (ret != -EBUSY) fprintf(stderr, "sqe submit failed: %d\n", ret); break; } issued++; } while (ret > 0); assert(ret == -EBUSY); printf("issued requests: %d\n", issued); while (issued) { ret = io_uring_peek_cqe(ring, &cqe); if (ret) { if (ret != -EAGAIN) { fprintf(stderr, "peek completion failed: %s\n", strerror(ret)); break; } printf("left requets: %d\n", issued); continue; } io_uring_cqe_seen(ring, cqe); issued--; printf("left requets: %d\n", issued); } } int main(int argc, char argv[]) { int ret; struct io_uring ring; ret = io_uring_queue_init(16, &ring, 0); if (ret) { fprintf(stderr, "ring setup failed: %d\n", ret); return 1; } test_cq_overflow(&ring); return 0; } To fix this issue, export cq overflow status to userspace by adding new IORING_SQ_CQ_OVERFLOW flag, then helper functions() in liburing, such as io_uring_peek_cqe, can be aware of this cq overflow and do flush accordingly. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-08 19:17:06 -06:00
Christoph Hellwig	21cf866145	writeback: remove bdi->congested_fn Except for pktdvd, the only places setting congested bits are file systems that allocate their own backing_dev_info structures. And pktdvd is a deprecated driver that isn't useful in stack setup either. So remove the dead congested_fn stacking infrastructure. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Acked-by: David Sterba <dsterba@suse.com> [axboe: fixup unused variables in bcache/request.c] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-08 17:20:46 -06:00
Christoph Hellwig	13ab64880e	isofs: remove a stale comment check_disk_change isn't for consumers of the block layer, so remove the comment mentioning it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-08 16:20:01 -06:00
Christoph Hellwig	9a3ffbbc65	block: remove flush_disk flush_disk has only two callers, so open code it there. That also helps clarifying the error message for the particular case, and allows to remove setting bd_invalidated in check_disk_size_change, which will be cleared again instantly. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-08 16:20:01 -06:00
Jens Axboe	5acbbc8ed3	io_uring: only call kfree() for a non-zero pointer It's safe to call kfree() with a NULL pointer, but it's also pointless. Most of the time we don't have any data to free, and at millions of requests per second, the redundant function call adds noticeable overhead (about 1.3% of the runtime). Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-08 15:15:26 -06:00
Dan Carpenter	aa340845ae	io_uring: fix a use after free in io_async_task_func() The "apoll" variable is freed and then used on the next line. We need to move the free down a few lines. Fixes: `0be0b0e33b` ("io_uring: simplify io_async_task_func()") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-08 13:15:04 -06:00
Eric Biggers	4f74d15fe4	ext4: add inline encryption support Wire up ext4 to support inline encryption via the helper functions which fs/crypto/ now provides. This includes: - Adding a mount option 'inlinecrypt' which enables inline encryption on encrypted files where it can be used. - Setting the bio_crypt_ctx on bios that will be submitted to an inline-encrypted file. Note: submit_bh_wbc() in fs/buffer.c also needed to be patched for this part, since ext4 sometimes uses ll_rw_block() on file data. - Not adding logically discontiguous data to bios that will be submitted to an inline-encrypted file. - Not doing filesystem-layer crypto on inline-encrypted files. Co-developed-by: Satya Tangirala <satyat@google.com> Signed-off-by: Satya Tangirala <satyat@google.com> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20200702015607.1215430-5-satyat@google.com Signed-off-by: Eric Biggers <ebiggers@google.com>	2020-07-08 10:29:43 -07:00
Satya Tangirala	27aacd28ea	f2fs: add inline encryption support Wire up f2fs to support inline encryption via the helper functions which fs/crypto/ now provides. This includes: - Adding a mount option 'inlinecrypt' which enables inline encryption on encrypted files where it can be used. - Setting the bio_crypt_ctx on bios that will be submitted to an inline-encrypted file. - Not adding logically discontiguous data to bios that will be submitted to an inline-encrypted file. - Not doing filesystem-layer crypto on inline-encrypted files. This patch includes a fix for a race during IPU by Sahitya Tummala <stummala@codeaurora.org> Signed-off-by: Satya Tangirala <satyat@google.com> Acked-by: Jaegeuk Kim <jaegeuk@kernel.org> Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Link: https://lore.kernel.org/r/20200702015607.1215430-4-satyat@google.com Co-developed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Eric Biggers <ebiggers@google.com>	2020-07-08 10:29:43 -07:00
Satya Tangirala	5fee36095c	fscrypt: add inline encryption support Add support for inline encryption to fs/crypto/. With "inline encryption", the block layer handles the decryption/encryption as part of the bio, instead of the filesystem doing the crypto itself via Linux's crypto API. This model is needed in order to take advantage of the inline encryption hardware present on most modern mobile SoCs. To use inline encryption, the filesystem needs to be mounted with '-o inlinecrypt'. Blk-crypto will then be used instead of the traditional filesystem-layer crypto whenever possible to encrypt the contents of any encrypted files in that filesystem. Fscrypt still provides the key and IV to use, and the actual ciphertext on-disk is still the same; therefore it's testable using the existing fscrypt ciphertext verification tests. Note that since blk-crypto has a fallback to Linux's crypto API, and also supports all the encryption modes currently supported by fscrypt, this feature is usable and testable even without actual inline encryption hardware. Per-filesystem changes will be needed to set encryption contexts when submitting bios and to implement the 'inlinecrypt' mount option. This patch just adds the common code. Signed-off-by: Satya Tangirala <satyat@google.com> Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org> Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20200702015607.1215430-3-satyat@google.com Co-developed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Eric Biggers <ebiggers@google.com>	2020-07-08 10:29:30 -07:00
Chao Yu	9627a7b31f	f2fs: fix error path in do_recover_data() - don't panic kernel if f2fs_get_node_page() fails in f2fs_recover_inline_data() or f2fs_recover_inline_xattr(); - return error number of f2fs_truncate_blocks() to f2fs_recover_inline_data()'s caller; Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-08 10:11:19 -07:00
Chao Yu	f567adb034	f2fs: fix to wait GCed compressed page writeback like we did for encrypted page. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-08 10:11:19 -07:00
Dehe Gu	ffcde4b29a	f2fs: remove write attribute of main_blkaddr sysfs node Fuzzing main_blkaddr sysfs node will corrupt this field's value, causing kernel panic, remove its write attribute to avoid potential security risk. [Chao Yu: add description] Signed-off-by: Dehe Gu <gudehe@huawei.com> Signed-off-by: Daiyue Zhang <zhangdaiyue1@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-08 10:11:19 -07:00
Steve French	8668115cf2	smb3: fix unneeded error message on change notify We should not be logging a warning repeatedly on change notify. CC: Stable <stable@vger.kernel.org> # v5.6+ Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2020-07-08 03:59:02 -05:00
Christoph Hellwig	775802c057	fs: remove __vfs_read Fold it into the two callers. Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 08:27:57 +02:00
Christoph Hellwig	6209dd9132	fs: implement kernel_read using __kernel_read Consolidate the two in-kernel read helpers to make upcoming changes easier. The only difference are the missing call to rw_verify_area in kernel_read, and an access_ok check that doesn't make sense for kernel buffers to start with. Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 08:27:57 +02:00
Christoph Hellwig	61a707c543	fs: add a __kernel_read helper This is the counterpart to __kernel_write, and skip the rw_verify_area call compared to kernel_read. Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 08:27:56 +02:00
Christoph Hellwig	53ad86266b	fs: remove __vfs_write Fold it into the two callers. Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 08:27:56 +02:00
Christoph Hellwig	81238b2cff	fs: implement kernel_write using __kernel_write Consolidate the two in-kernel write helpers to make upcoming changes easier. The only difference are the missing call to rw_verify_area in kernel_write, and an access_ok check that doesn't make sense for kernel buffers to start with. Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 08:27:56 +02:00
Christoph Hellwig	a01ac27be4	fs: check FMODE_WRITE in __kernel_write Add a WARN_ON_ONCE if the file isn't actually open for write. This matches the check done in vfs_write, but actually warn warns as a kernel user calling write on a file not opened for writing is a pretty obvious programming error. Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 08:27:56 +02:00
Christoph Hellwig	9db9775224	fs: unexport __kernel_write This is a very special interface that skips sb_writes protection, and not used by modules anymore. Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 08:27:56 +02:00
Christoph Hellwig	13c164b1a1	autofs: switch to kernel_write While pipes don't really need sb_writers projection, __kernel_write is an interface better kept private, and the additional rw_verify_area does not hurt here. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ian Kent <raven@themaw.net>	2020-07-08 08:27:56 +02:00
Christoph Hellwig	97c7990c4b	cachefiles: switch to kernel_write __kernel_write doesn't take a sb_writers references, which we need here. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Howells <dhowells@redhat.com>	2020-07-08 08:27:56 +02:00
Daeho Jeong	0e5e81114d	f2fs: add GC_URGENT_LOW mode in gc_urgent Added a new gc_urgent mode, GC_URGENT_LOW, in which mode F2FS will lower the bar of checking idle in order to process outstanding discard commands and GC a little bit aggressively. Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-07 21:51:49 -07:00
Jaegeuk Kim	6b12367da2	f2fs: avoid readahead race condition If two readahead threads having same offset enter in readpages, every read IOs are split and issued to the disk which giving lower bandwidth. This patch tries to avoid redundant readahead calls. Fixes one build error reported by Randy. Fix build error when F2FS_FS_COMPRESSION is not set/enabled. This label is needed in either case. ../fs/f2fs/data.c: In function ‘f2fs_mpage_readpages’: ../fs/f2fs/data.c:2327:5: error: label ‘next_page’ used but not defined goto next_page; Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-07 21:51:48 -07:00
Chao Yu	d7cd3702ca	f2fs: fix return value of move_data_block() If f2fs_grab_cache_page() fails, it needs to return -ENOMEM. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-07 21:51:48 -07:00
Jia Yang	b7973091f0	f2fs: add parameter op_flag in f2fs_submit_page_read() The parameter op_flag is not used in f2fs_get_read_data_page(), but it is used in f2fs_grab_read_bio(). Obviously, op_flag is not passed to f2fs_grab_read_bio() successfully. We need to add parameter in f2fs_submit_page_read() to pass it. The case: - gc_data_segment - f2fs_get_read_data_page(.., op_flag = REQ_RAHEAD,..) - f2fs_submit_page_read - f2fs_grab_read_bio(.., op_flag = 0, ..) Signed-off-by: Jia Yang <jiayang5@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-07 21:51:48 -07:00
Chao Yu	901d745f8e	f2fs: split f2fs_allocate_new_segments() to two independent functions: - f2fs_allocate_new_segment() for specified type segment allocation - f2fs_allocate_new_segments() for all data type segments allocation Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-07 21:51:48 -07:00
Yubo Feng	9039d8355d	f2fs: lost matching-pair of trace in f2fs_truncate_inode_blocks if get_node_path() return -E2BIG and trace of f2fs_truncate_inode_blocks_enter/exit enabled then the matching-pair of trace_exit will lost in log. Signed-off-by: Yubo Feng <fengyubo3@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-07 21:51:47 -07:00
Yu Changchun	29b993c7cd	f2fs: fix an oops in f2fs_is_compressed_page This patch is to fix a crash: #3 [ffffb6580689f898] oops_end at ffffffffa2835bc2 #4 [ffffb6580689f8b8] no_context at ffffffffa28766e7 #5 [ffffb6580689f920] async_page_fault at ffffffffa320135e [exception RIP: f2fs_is_compressed_page+34] RIP: ffffffffa2ba83a2 RSP: ffffb6580689f9d8 RFLAGS: 00010213 RAX: 0000000000000001 RBX: fffffc0f50b34bc0 RCX: 0000000000002122 RDX: 0000000000002123 RSI: 0000000000000c00 RDI: fffffc0f50b34bc0 RBP: ffff97e815a40178 R8: 0000000000000000 R9: ffff97e83ffc9000 R10: 0000000000032300 R11: 0000000000032380 R12: ffffb6580689fa38 R13: fffffc0f50b34bc0 R14: ffff97e825cbd000 R15: 0000000000000c00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #6 [ffffb6580689f9d8] __is_cp_guaranteed at ffffffffa2b7ea98 #7 [ffffb6580689f9f0] f2fs_submit_page_write at ffffffffa2b81a69 #8 [ffffb6580689fa30] f2fs_do_write_meta_page at ffffffffa2b99777 #9 [ffffb6580689fae0] __f2fs_write_meta_page at ffffffffa2b75f1a #10 [ffffb6580689fb18] f2fs_sync_meta_pages at ffffffffa2b77466 #11 [ffffb6580689fc98] do_checkpoint at ffffffffa2b78e46 #12 [ffffb6580689fd88] f2fs_write_checkpoint at ffffffffa2b79c29 #13 [ffffb6580689fdd0] f2fs_sync_fs at ffffffffa2b69d95 #14 [ffffb6580689fe20] sync_filesystem at ffffffffa2ad2574 #15 [ffffb6580689fe30] generic_shutdown_super at ffffffffa2a9b582 #16 [ffffb6580689fe48] kill_block_super at ffffffffa2a9b6d1 #17 [ffffb6580689fe60] kill_f2fs_super at ffffffffa2b6abe1 #18 [ffffb6580689fea0] deactivate_locked_super at ffffffffa2a9afb6 #19 [ffffb6580689feb8] cleanup_mnt at ffffffffa2abcad4 #20 [ffffb6580689fee0] task_work_run at ffffffffa28bca28 #21 [ffffb6580689ff00] exit_to_usermode_loop at ffffffffa28050b7 #22 [ffffb6580689ff38] do_syscall_64 at ffffffffa280560e #23 [ffffb6580689ff50] entry_SYSCALL_64_after_hwframe at ffffffffa320008c This occurred when umount f2fs if enable F2FS_FS_COMPRESSION with F2FS_IO_TRACE. Fixes it by adding IS_IO_TRACED_PAGE to check validity of pid for page_private. Signed-off-by: Yu Changchun <yuchangchun1@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-07 21:51:47 -07:00
Lihong Kou	9a99c17dab	f2fs: make trace enter and end in pairs for unlink In the f2fs_unlink we do not add trace end for some error paths, just add. Signed-off-by: Lihong Kou <koulihong@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-07 21:51:47 -07:00
Chao Yu	eb1353cfa9	f2fs: fix to check page dirty status before writeback In f2fs_write_raw_pages(), we need to check page dirty status before writeback, because there could be a racer (e.g. reclaimer) helps writebacking the dirty page. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-07 21:51:47 -07:00
Wang Xiaojun	d078319d06	f2fs: remove the unused compr parameter The parameter compr is unused in the f2fs_cluster_blocks function so we no longer need to pass it as a parameter. Signed-off-by: Wang Xiaojun <wangxiaojun11@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-07 21:51:46 -07:00
Chao Yu	dd5a09bd05	f2fs: support to trace f2fs_fiemap() to show f2fs_fiemap()'s result as below: f2fs_fiemap: dev = (251,0), ino = 7, lblock:0, pblock:1625292800, len:2097152, flags:0, ret:0 Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-07 21:51:46 -07:00
Chao Yu	b79b0a310b	f2fs: support to trace f2fs_bmap() to show f2fs_bmap()'s result as below: f2fs_bmap: dev = (251,0), ino = 7, lblock:0, pblock:396800 Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-07 21:51:46 -07:00
Chao Yu	250e84d725	f2fs: fix wrong return value of f2fs_bmap_compress() If compression is disable, we should return zero rather than -EOPNOTSUPP to indicate f2fs_bmap() is not supported. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-07 21:51:46 -07:00
Liu Song	b815bdc781	f2fs: remove useless parameter of __insert_free_nid() In current version, @state will only be FREE_NID. This parameter has no real effect so remove it to keep clean. Signed-off-by: Liu Song <liu.song11@zte.com.cn> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-07 21:51:45 -07:00
Liu Song	e5cc2c5563	f2fs: fix typo in comment of f2fs_do_add_link stakable/stackable Signed-off-by: Liu Song <fishland@aliyun.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-07 21:51:45 -07:00
Chao Yu	a6d601f30d	f2fs: fix to wait page writeback before update Filesystem including f2fs should support stable page for special device like software raid, however there is one missing path that page could be updated while it is writeback state as below, fix this. - gc_node_segment - f2fs_move_node_page - __write_node_page - set_page_writeback - do_read_inode - f2fs_init_extent_tree - __f2fs_init_extent_tree i_ext->len = 0; Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-07 21:51:45 -07:00
Chao Yu	0759e2c151	f2fs: show more debug info for per-temperature log - Add to account and show per-log dirty_seg, full_seg and valid_blocks in debugfs. - reformat printed info. TYPE segno secno zoneno dirty_seg full_seg valid_blk - COLD data: 1523 1523 1523 1 0 399 - WARM data: 769 769 769 20 255 133098 - HOT data: 767 767 767 9 0 167 - Dir dnode: 22 22 22 3 0 70 - File dnode: 722 722 722 14 10 6505 - Indir nodes: 2 2 2 1 0 3 Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-07 21:51:45 -07:00
Qilong Zhang	9776750078	f2fs: add f2fs_gc exception handle in f2fs_ioc_gc_range When f2fs_ioc_gc_range performs multiple segments gc ops, the return value of f2fs_ioc_gc_range is determined by the last segment gc ops. If its ops failed, the f2fs_ioc_gc_range will be considered to be failed despite some of previous segments gc ops succeeded. Therefore, so we fix: Redefine the return value of getting victim ops and add exception handle for f2fs_gc. In particular, 1).if target has no valid block, it will go on. 2).if target sectoion has valid block(s), but it is current section, we will reminder the caller. Signed-off-by: Qilong Zhang <zhangqilong3@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-07 21:51:44 -07:00
Chao Yu	f608c38c59	f2fs: clean up parameter of f2fs_allocate_data_block() Use validation of @fio to inidcate whether caller want to serialize IOs in io.io_list or not, then @add_list will be redundant, remove it. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-07 21:51:44 -07:00
Chao Yu	79963d967b	f2fs: shrink node_write lock coverage - to avoid race between checkpoint and quota file writeback, it just needs to hold read lock of node_write in writeback path. - node_write lock has covered all LFS data write paths, it's not necessary, we only need to hold node_write lock at write path of quota file. This refactors commit `ca7f76e680` ("f2fs: fix wrong discard space"). Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-07 21:51:44 -07:00
Chao Yu	0ef818335f	f2fs: add prefix for exported symbols to avoid polluting global symbol namespace. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-07-07 21:51:43 -07:00
yangerkun	2e98c01846	cifs: remove the retry in cifs_poxis_lock_set The caller of cifs_posix_lock_set will do retry(like fcntl_setlk64->do_lock_file_wait) if we will wait for any file_lock. So the retry in cifs_poxis_lock_set seems duplicated, remove it to make a cleanup. Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: NeilBrown <neilb@suse.de>	2020-07-07 23:51:16 -05:00
Steve French	4ef9b4f1a7	smb3: fix access denied on change notify request to some servers read permission, not just read attributes permission, is required on the directory. See MS-SMB2 (protocol specification) section 3.3.5.19. Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org> # v5.6+ Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>	2020-07-07 18:24:39 -05:00
Andreas Gruenbacher	20f829999c	gfs2: Rework read and page fault locking So far, gfs2 has taken the inode glocks inside the ->readpage and ->readahead address space operations. Since commit `d4388340ae` ("fs: convert mpage_readpages to mpage_readahead"), gfs2_readahead is passed the pages to read ahead locked. With that, the current holder of the inode glock may be trying to lock one of those pages while gfs2_readahead is trying to take the inode glock, resulting in a deadlock. Fix that by moving the lock taking to the higher-level ->read_iter file and ->fault vm operations. This also gets rid of an ugly lock inversion workaround in gfs2_readpage. The cache consistency model of filesystems like gfs2 is such that if data is found in the page cache, the data is up to date and can be used without taking any filesystem locks. If a page is not cached, filesystem locks must be taken before populating the page cache. To avoid taking the inode glock when the data is already cached, gfs2_file_read_iter first tries to read the data with the IOCB_NOIO flag set. If that fails, the inode glock is taken and the operation is retried with the IOCB_NOIO flag cleared. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2020-07-07 23:40:12 +02:00
Linus Torvalds	aa27b32b76	for-5.8-rc4-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl8EdTkACgkQxWXV+ddt WDv6xA/9Hguo/k6oj/7Nl9n3UUZ7gp44R/jy37fhMuNcwuEDuqIEfAgGXupdJVaj pYDorUMRUQfI2yLB1iHAnPgBMKBidSroDsdrRHKuimnhABSO2/KX/KXPianIIRGi wPvqZR04L565LNpRlDQx7OYkJWey7b6xf47UZqDglivnKY1OwCJlXgfCj/9FApr0 Y+PVlgEU78ExTeAHs/h8ofZ/f5T2eqiluBSFVykzCg1NngaQVOKpN3gnWEatUAvM ekm6U4E1ZR9oOprdhlf6V96ztGzVTRKB1vFIeCvJLqLNIe+0pxlRfRn2aOj8vzEO DRjgOlhyAIgypp78SwCspjhvejvVneSFdEGSVvHOw1ombB//OJ1qBb5G/lIcwCj3 PZ3OnQJV7+/Ty7Xt/X26W841zvnu90K0di0CsOPehtbkgkR4txgHCJB9mSlsMugN awN5Ryy1rw1cAM5GspXG9EEOvJmnSizQf4BcK649IG5eUKThYYLc5mp68jiMljs0 NHFPg5P4yTRjk7Yqgxq5VvTPLLJo5j5xxqtY/1zDWuguRa40wIoy/JUJaJoPg9Vd 221/qRG4R4xGyZXGx6XTiWK+M3qjTlS9My9tGoWygwlExRkr7Uli9Ikef3U0tBoF bjTcfCNOuCp+JECHNcnMZ9fhhFaMwIL1V4OflB1iicBAtXxo8Lk= =+4BZ -----END PGP SIGNATURE----- Merge tag 'for-5.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - regression fix of a leak in global block reserve accounting - fix a (hard to hit) race of readahead vs releasepage that could lead to crash - convert all remaining uses of comment fall through annotations to the pseudo keyword - fix crash when mounting a fuzzed image with -o recovery * tag 'for-5.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: reset tree root pointer after error in init_tree_roots btrfs: fix reclaim_size counter leak after stealing from global reserve btrfs: fix fatal extent_buffer readahead vs releasepage race btrfs: convert comments to fallthrough annotations	2020-07-07 14:10:33 -07:00
Pavel Begunkov	b2edc0a77f	io_uring: don't burn CPU for iopoll on exit First of all don't spin in io_ring_ctx_wait_and_kill() on iopoll. Requests won't complete faster because of that, but only lengthen io_uring_release(). The same goes for offloaded cleanup in io_ring_exit_work() -- it already has waiting loop, don't do blocking active spinning. For that, pass min=0 into io_iopoll_[try_]reap_events(), so it won't actively spin. Leave the function if io_do_iopoll() there can't complete a request to sleep in io_ring_exit_work(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-07 12:00:03 -06:00
Pavel Begunkov	7668b92a69	io_uring: remove nr_events arg from iopoll_check() Nobody checks io_iopoll_check()'s output parameter @nr_events. Remove the parameter and declare it further down the stack. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-07 12:00:03 -06:00
Pavel Begunkov	9dedd56301	io_uring: partially inline io_iopoll_getevents() io_iopoll_reap_events() doesn't care about returned valued of io_iopoll_getevents() and does the same checks for list emptiness and need_resched(). Just use io_do_iopoll(). io_sq_thread() doesn't check return value as well. It also passes min=0, so there never be the second iteration inside io_poll_getevents(). Inline it there too. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-07 12:00:03 -06:00
Darrick J. Wong	2fb94e36b6	xfs: rtbitmap scrubber should check inode size Make sure the rtbitmap is large enough to store the entire bitmap. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Allison Collins <allison.henderson@oracle.com>	2020-07-07 07:15:09 -07:00
Darrick J. Wong	f866560be2	xfs: rtbitmap scrubber should verify written extents Ensure that the realtime bitmap file is backed entirely by written extents. No holes, no unwritten blocks, etc. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Allison Collins <allison.henderson@oracle.com>	2020-07-07 07:15:09 -07:00
Dave Chinner	e2705b0304	xfs: remove xfs_inobp_check() This debug code is called on every xfs_iflush() call, which then checks every inode in the buffer for non-zero unlinked list field. Hence it checks every inode in the cluster buffer every time a single inode on that cluster it flushed. This is resulting in: - 38.91% 5.33% [kernel] [k] xfs_iflush - 17.70% xfs_iflush - 9.93% xfs_inobp_check 4.36% xfs_buf_offset 10% of the CPU time spent flushing inodes is repeatedly checking unlinked fields in the buffer. We don't need to do this. The other place we call xfs_inobp_check() is xfs_iunlink_update_dinode(), and this is after we've done this assert for the agino we are about to write into that inode: ASSERT(xfs_verify_agino_or_null(mp, agno, next_agino)); which means we've already checked that the agino we are about to write is not 0 on debug kernels. The inode buffer verifiers do everything else we need, so let's just remove this debug code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-07 07:15:09 -07:00
Dave Chinner	a69a1dc284	xfs: factor xfs_iflush_done xfs_iflush_done() does 3 distinct operations to the inodes attached to the buffer. Separate these operations out into functions so that it is easier to modify these operations independently in future. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-07 07:15:09 -07:00
Dave Chinner	5717ea4d52	xfs: rework xfs_iflush_cluster() dirty inode iteration Now that we have all the dirty inodes attached to the cluster buffer, we don't actually have to do radix tree lookups to find them. Sure, the radix tree is efficient, but walking a linked list of just the dirty inodes attached to the buffer is much better. We are also no longer dependent on having a locked inode passed into the function to determine where to start the lookup. This means we can drop it from the function call and treat all inodes the same. We also make xfs_iflush_cluster skip inodes marked with XFS_IRECLAIM. This we avoid races with inodes that reclaim is actively referencing or are being re-initialised by inode lookup. If they are actually dirty, they'll get written by a future cluster flush.... We also add a shutdown check after obtaining the flush lock so that we catch inodes that are dirty in memory and may have inconsistent state due to the shutdown in progress. We abort these inodes directly and so they remove themselves directly from the buffer list and the AIL rather than having to wait for the buffer to be failed and callbacks run to be processed correctly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-07 07:15:09 -07:00
Dave Chinner	e6187b3444	xfs: rename xfs_iflush_int() with xfs_iflush() gone, we can rename xfs_iflush_int() back to xfs_iflush(). Also move it up above xfs_iflush_cluster() so we don't need the forward definition any more. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-07 07:15:08 -07:00
Dave Chinner	90c60e1640	xfs: xfs_iflush() is no longer necessary Now we have a cached buffer on inode log items, we don't need to do buffer lookups when flushing inodes anymore - all we need to do is lock the buffer and we are ready to go. This largely gets rid of the need for xfs_iflush(), which is essentially just a mechanism to look up the buffer and flush the inode to it. Instead, we can just call xfs_iflush_cluster() with a few modifications to ensure it also flushes the inode we already hold locked. This allows the AIL inode item pushing to be almost entirely non-blocking in XFS - we won't block unless memory allocation for the cluster inode lookup blocks or the block device queues are full. Writeback during inode reclaim becomes a little more complex because we now have to lock the buffer ourselves, but otherwise this change is largely a functional no-op that removes a whole lot of code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-07 07:15:08 -07:00
Dave Chinner	48d55e2ae3	xfs: attach inodes to the cluster buffer when dirtied Rather than attach inodes to the cluster buffer just when we are doing IO, attach the inodes to the cluster buffer when they are dirtied. The means the buffer always carries a list of dirty inodes that reference it, and we can use that list to make more fundamental changes to inode writeback that aren't otherwise possible. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-07 07:15:08 -07:00
Dave Chinner	71e3e35646	xfs: rework stale inodes in xfs_ifree_cluster Once we have inodes pinning the cluster buffer and attached whenever they are dirty, we no longer have a guarantee that the items are flush locked when we lock the cluster buffer. Hence we cannot just walk the buffer log item list and modify the attached inodes. If the inode is not flush locked, we have to ILOCK it first and then flush lock it to do all the prerequisite checks needed to avoid races with other code. This is already handled by xfs_ifree_get_one_inode(), so rework the inode iteration loop and function to update all inodes in cache whether they are attached to the buffer or not. Note: we also remove the copying of the log item lsn to the ili_flush_lsn as xfs_iflush_done() now uses the XFS_ISTALE flag to trigger aborts and so flush lsn matching is not needed in IO completion for processing freed inodes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-07 07:15:08 -07:00
Dave Chinner	02511a5a6a	xfs: clean up inode reclaim comments Inode reclaim is quite different now to the way described in various comments, so update all the comments explaining what it does and how it works. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-07 07:15:08 -07:00
Dave Chinner	4d0bab3a44	xfs: remove SYNC_WAIT from xfs_reclaim_inodes() Clean up xfs_reclaim_inodes() callers. Most callers want blocking behaviour, so just make the existing SYNC_WAIT behaviour the default. For the xfs_reclaim_worker(), just call xfs_reclaim_inodes_ag() directly because we just want optimistic clean inode reclaim to be done in the background. For xfs_quiesce_attr() we can just remove the inode reclaim calls as they are a historic relic that was required to flush dirty inodes that contained unlogged changes. We now log all changes to the inodes, so the sync AIL push from xfs_log_quiesce() called by xfs_quiesce_attr() will do all the required inode writeback for freeze. Seeing as we now want to loop until all reclaimable inodes have been reclaimed, make xfs_reclaim_inodes() loop on the XFS_ICI_RECLAIM_TAG tag rather than having xfs_reclaim_inodes_ag() tell it that inodes were skipped. This is much more reliable and will always loop until all reclaimable inodes are reclaimed. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-07 07:15:08 -07:00
Dave Chinner	50718b8d73	xfs: remove SYNC_TRYLOCK from inode reclaim All background reclaim is SYNC_TRYLOCK already, and even blocking reclaim (SYNC_WAIT) can use trylock mechanisms as xfs_reclaim_inodes_ag() will keep cycling until there are no more reclaimable inodes. Hence we can kill SYNC_TRYLOCK from inode reclaim and make everything unconditionally non-blocking. We remove all the optimistic "avoid blocking on locks" checks done in xfs_reclaim_inode_grab() as nothing blocks on locks anymore. Further, checking XFS_IFLOCK optimistically can result in detecting inodes in the process of being cleaned (i.e. between being removed from the AIL and having the flush lock dropped), so for xfs_reclaim_inodes() to reliably reclaim all inodes we need to drop these checks anyway. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-07 07:15:08 -07:00
Dave Chinner	9552e14d3e	xfs: don't block inode reclaim on the ILOCK When we attempt to reclaim an inode, the first thing we do is take the inode lock. This is blocking right now, so if the inode being accessed by something else (e.g. being flushed to the cluster buffer) we will block here. Change this to a trylock so that we do not block inode reclaim unnecessarily here. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-07 07:15:08 -07:00
Dave Chinner	0e8e2c6343	xfs: allow multiple reclaimers per AG Inode reclaim will still throttle direct reclaim on the per-ag reclaim locks. This is no longer necessary as reclaim can run non-blocking now. Hence we can remove these locks so that we don't arbitrarily block reclaimers just because there are more direct reclaimers than there are AGs. This can result in multiple reclaimers working on the same range of an AG, but this doesn't cause any apparent issues. Optimising the spread of concurrent reclaimers for best efficiency can be done in a future patchset. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-07 07:15:07 -07:00
Dave Chinner	617825fe34	xfs: remove IO submission from xfs_reclaim_inode() We no longer need to issue IO from shrinker based inode reclaim to prevent spurious OOM killer invocation. This leaves only the global filesystem management operations such as unmount needing to writeback dirty inodes and reclaim them. Instead of using the reclaim pass to write dirty inodes before reclaiming them, use the AIL to push all the dirty inodes before we try to reclaim them. This allows us to remove all the conditional SYNC_WAIT locking and the writeback code from xfs_reclaim_inode() and greatly simplify the checks we need to do to reclaim an inode. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-07 07:15:07 -07:00
Dave Chinner	993f951f50	xfs: make inode reclaim almost non-blocking Now that dirty inode writeback doesn't cause read-modify-write cycles on the inode cluster buffer under memory pressure, the need to throttle memory reclaim to the rate at which we can clean dirty inodes goes away. That is due to the fact that we no longer thrash inode cluster buffers under memory pressure to clean dirty inodes. This means inode writeback no longer stalls on memory allocation or read IO, and hence can be done asynchronously without generating memory pressure. As a result, blocking inode writeback in reclaim is no longer necessary to prevent reclaim priority windup as cleaning dirty inodes is no longer dependent on having memory reserves available for the filesystem to make progress reclaiming inodes. Hence we can convert inode reclaim to be non-blocking for shrinker callouts, both for direct reclaim and kswapd. On a vanilla kernel, running a 16-way fsmark create workload on a 4 node/16p/16GB RAM machine, I can reliably pin 14.75GB of RAM via userspace mlock(). The OOM killer gets invoked at 15GB of pinned RAM. Without the inode cluster pinning, this non-blocking reclaim patch triggers premature OOM killer invocation with the same memory pinning, sometimes with as much as 45% of RAM being free. It's trivially easy to trigger the OOM killer when reclaim does not block. With pinning inode clusters in RAM and then adding this patch, I can reliably pin 14.5GB of RAM and still have the fsmark workload run to completion. The OOM killer gets invoked 14.75GB of pinned RAM, which is only a small amount of memory less than the vanilla kernel. It is much more reliable than just with async reclaim alone. simoops shows that allocation stalls go away when async reclaim is used. Vanilla kernel: Run time: 1924 seconds Read latency (p50: 3,305,472) (p95: 3,723,264) (p99: 4,001,792) Write latency (p50: 184,064) (p95: 553,984) (p99: 807,936) Allocation latency (p50: 2,641,920) (p95: 3,911,680) (p99: 4,464,640) work rate = 13.45/sec (avg 13.44/sec) (p50: 13.46) (p95: 13.58) (p99: 13.70) alloc stall rate = 3.80/sec (avg: 2.59) (p50: 2.54) (p95: 2.96) (p99: 3.02) With inode cluster pinning and async reclaim: Run time: 1924 seconds Read latency (p50: 3,305,472) (p95: 3,715,072) (p99: 3,977,216) Write latency (p50: 187,648) (p95: 553,984) (p99: 789,504) Allocation latency (p50: 2,748,416) (p95: 3,919,872) (p99: 4,448,256) work rate = 13.28/sec (avg 13.32/sec) (p50: 13.26) (p95: 13.34) (p99: 13.34) alloc stall rate = 0.02/sec (avg: 0.02) (p50: 0.01) (p95: 0.03) (p99: 0.03) Latencies don't really change much, nor does the work rate. However, allocation almost never stalls with these changes, whilst the vanilla kernel is sometimes reporting 20 stalls/s over a 60s sample period. This difference is due to inode reclaim being largely non-blocking now. IOWs, once we have pinned inode cluster buffers, we can make inode reclaim non-blocking without a major risk of premature and/or spurious OOM killer invocation, and without any changes to memory reclaim infrastructure. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-07 07:15:07 -07:00
Dave Chinner	298f7bec50	xfs: pin inode backing buffer to the inode log item When we dirty an inode, we are going to have to write it disk at some point in the near future. This requires the inode cluster backing buffer to be present in memory. Unfortunately, under severe memory pressure we can reclaim the inode backing buffer while the inode is dirty in memory, resulting in stalling the AIL pushing because it has to do a read-modify-write cycle on the cluster buffer. When we have no memory available, the read of the cluster buffer blocks the AIL pushing process, and this causes all sorts of issues for memory reclaim as it requires inode writeback to make forwards progress. Allocating a cluster buffer causes more memory pressure, and results in more cluster buffers to be reclaimed, resulting in more RMW cycles to be done in the AIL context and everything then backs up on AIL progress. Only the synchronous inode cluster writeback in the the inode reclaim code provides some level of forwards progress guarantees that prevent OOM-killer rampages in this situation. Fix this by pinning the inode backing buffer to the inode log item when the inode is first dirtied (i.e. in xfs_trans_log_inode()). This may mean the first modification of an inode that has been held in cache for a long time may block on a cluster buffer read, but we can do that in transaction context and block safely until the buffer has been allocated and read. Once we have the cluster buffer, the inode log item takes a reference to it, pinning it in memory, and attaches it to the log item for future reference. This means we can always grab the cluster buffer from the inode log item when we need it. When the inode is finally cleaned and removed from the AIL, we can drop the reference the inode log item holds on the cluster buffer. Once all inodes on the cluster buffer are clean, the cluster buffer will be unpinned and it will be available for memory reclaim to reclaim again. This avoids the issues with needing to do RMW cycles in the AIL pushing context, and hence allows complete non-blocking inode flushing to be performed by the AIL pushing context. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-07 07:15:07 -07:00
Dave Chinner	e98084b8be	xfs: move xfs_clear_li_failed out of xfs_ail_delete_one() xfs_ail_delete_one() is called directly from dquot and inode IO completion, as well as from the generic xfs_trans_ail_delete() function. Inodes are about to have their own failure handling, and dquots will in future, too. Pull the clearing of the LI_FAILED flag up into the callers so we can customise the code appropriately. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-07 07:15:07 -07:00
Dave Chinner	3536b61e74	xfs: unwind log item error flagging When an buffer IO error occurs, we want to mark all the log items attached to the buffer as failed. Open code the error handling loop so that we can modify the flagging for the different types of objects directly and independently of each other. This also allows us to remove the ->iop_error method from the log item operations. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-07 07:15:07 -07:00
Dave Chinner	428947e9d5	xfs: handle buffer log item IO errors directly Currently when a buffer with attached log items has an IO error it called ->iop_error for each attched log item. These all call xfs_set_li_failed() to handle the error, but we are about to change the way log items manage buffers. hence we first need to remove the per-item dependency on buffer handling done by xfs_set_li_failed(). We already have specific buffer type IO completion routines, so move the log item error handling out of the generic error handling and into the log item specific functions so we can implement per-type error handling easily. This requires a more complex return value from the error handling code so that we can take the correct action the failure handling requires. This results in some repeated boilerplate in the functions, but that can be cleaned up later once all the changes cascade through this code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-07 07:15:07 -07:00
Dave Chinner	2ef3f7f5db	xfs: get rid of log item callbacks They are not used anymore, so remove them from the log item and the buffer iodone attachment interfaces. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-07 07:15:07 -07:00
Dave Chinner	fec671cd35	xfs: clean up the buffer iodone callback functions Now that we've sorted inode and dquot buffers, we can apply the same cleanups to dirty buffers with buffer log items. They only have one callback, too, so we don't need the log item callback. Collapse the iodone functions and remove all the now unnecessary infrastructure around callback processing. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-07 07:15:07 -07:00
Qu Wenruo	04e484c597	btrfs: discard: add missing put when grabbing block group from unused list [BUG] The following small test script can trigger ASSERT() at unmount time: mkfs.btrfs -f $dev mount $dev $mnt mount -o remount,discard=async $mnt umount $mnt The call trace: assertion failed: atomic_read(&block_group->count) == 1, in fs/btrfs/block-group.c:3431 ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.h:3204! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 4 PID: 10389 Comm: umount Tainted: G O 5.8.0-rc3-custom+ #68 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: btrfs_free_block_groups.cold+0x22/0x55 [btrfs] close_ctree+0x2cb/0x323 [btrfs] btrfs_put_super+0x15/0x17 [btrfs] generic_shutdown_super+0x72/0x110 kill_anon_super+0x18/0x30 btrfs_kill_super+0x17/0x30 [btrfs] deactivate_locked_super+0x3b/0xa0 deactivate_super+0x40/0x50 cleanup_mnt+0x135/0x190 __cleanup_mnt+0x12/0x20 task_work_run+0x64/0xb0 __prepare_exit_to_usermode+0x1bc/0x1c0 __syscall_return_slowpath+0x47/0x230 do_syscall_64+0x64/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The code: ASSERT(atomic_read(&block_group->count) == 1); btrfs_put_block_group(block_group); [CAUSE] Obviously it's some btrfs_get_block_group() call doesn't get its put call. The offending btrfs_get_block_group() happens here: void btrfs_mark_bg_unused(struct btrfs_block_group bg) { if (list_empty(&bg->bg_list)) { btrfs_get_block_group(bg); list_add_tail(&bg->bg_list, &fs_info->unused_bgs); } } So every call sites removing the block group from unused_bgs list should reduce the ref count of that block group. However for async discard, it didn't follow the call convention: void btrfs_discard_punt_unused_bgs_list(struct btrfs_fs_info fs_info) { list_for_each_entry_safe(block_group, next, &fs_info->unused_bgs, bg_list) { list_del_init(&block_group->bg_list); btrfs_discard_queue_work(&fs_info->discard_ctl, block_group); } } And in btrfs_discard_queue_work(), it doesn't call btrfs_put_block_group() either. [FIX] Fix the problem by reducing the reference count when we grab the block group from unused_bgs list. Reported-by: Marcos Paulo de Souza <mpdesouza@suse.com> Fixes: `6e80d4f8c4` ("btrfs: handle empty block_group removal for async discard") CC: stable@vger.kernel.org # 5.6+ Tested-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-07 16:06:28 +02:00
Matteo Croce	fd49e03280	pstore: Fix linking when crypto API disabled When building a kernel with CONFIG_PSTORE=y and CONFIG_CRYPTO not set, a build error happens: ld: fs/pstore/platform.o: in function `pstore_dump': platform.c:(.text+0x3f9): undefined reference to `crypto_comp_compress' ld: fs/pstore/platform.o: in function `pstore_get_backend_records': platform.c:(.text+0x784): undefined reference to `crypto_comp_decompress' This because some pstore code uses crypto_comp_(de)compress regardless of the CONFIG_CRYPTO status. Fix it by wrapping the (de)compress usage by IS_ENABLED(CONFIG_PSTORE_COMPRESS) Signed-off-by: Matteo Croce <mcroce@linux.microsoft.com> Link: https://lore.kernel.org/lkml/20200706234045.9516-1-mcroce@linux.microsoft.com Fixes: `cb3bee0369` ("pstore: Use crypto compress API") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>	2020-07-06 19:42:31 -07:00
Andreas Gruenbacher	856473cd5d	iomap: Make sure iomap_end is called after iomap_begin Make sure iomap_end is always called when iomap_begin succeeds. Without this fix, iomap_end won't be called when a filesystem's iomap_begin operation returns an invalid mapping, bypassing any unlocking done in iomap_end. With this fix, the unlocking will still happen. This bug was found by Bob Peterson during code review. It's unlikely that such iomap_begin bugs will survive to affect users, so backporting this fix seems unnecessary. Fixes: `ae259a9c85` ("fs: introduce iomap infrastructure") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-06 10:49:27 -07:00
Dave Chinner	6f5de1808e	xfs: use direct calls for dquot IO completion Similar to inodes, we can call the dquot IO completion functions directly from the buffer completion code, removing another user of log item callbacks for IO completion processing. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-06 10:46:59 -07:00
Dave Chinner	aac855ab1a	xfs: make inode IO completion buffer centric Having different io completion callbacks for different inode states makes things complex. We can detect if the inode is stale via the XFS_ISTALE flag in IO completion, so we don't need a special callback just for this. This means inodes only have a single iodone callback, and inode IO completion is entirely buffer centric at this point. Hence we no longer need to use a log item callback at all as we can just call xfs_iflush_done() directly from the buffer completions and walk the buffer log item list to complete the all inodes under IO. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-06 10:46:59 -07:00
Dave Chinner	a7e134ef37	xfs: clean up whacky buffer log item list reinit When we've emptied the buffer log item list, it does a list_del_init on itself to reset it's pointers to itself. This is unnecessary as the list is already empty at this point - it was a left-over fragment from the list_head conversion of the buffer log item list. Remove them. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-06 10:46:59 -07:00
Dave Chinner	b01d1461ae	xfs: call xfs_buf_iodone directly All unmarked dirty buffers should be in the AIL and have log items attached to them. Hence when they are written, we will run a callback to remove the item from the AIL if appropriate. Now that we've handled inode and dquot buffers, all remaining calls are to xfs_buf_iodone() and so we can hard code this rather than use an indirect call. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-06 10:46:58 -07:00
Dave Chinner	9fe5c77cbe	xfs: mark log recovery buffers for completion Log recovery has it's own buffer write completion handler for buffers that it directly recovers. Convert these to direct calls by flagging these buffers as being log recovery buffers. The flag will get cleared by the log recovery IO completion routine, so it will never leak out of log recovery. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-06 10:46:58 -07:00
Dave Chinner	0c7e5afbea	xfs: mark dquot buffers in cache dquot buffers always have write IO callbacks, so by marking them directly we can avoid needing to attach ->b_iodone functions to them. This avoids an indirect call, and makes future modifications much simpler. This is largely a rearrangement of the code at this point - no IO completion functionality changes at this point, just how the code is run is modified. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-06 10:46:58 -07:00
Dave Chinner	f593bf144c	xfs: mark inode buffers in cache Inode buffers always have write IO callbacks, so by marking them directly we can avoid needing to attach ->b_iodone functions to them. This avoids an indirect call, and makes future modifications much simpler. While this is largely a refactor of existing functionality, we broaden the scope of the flag to beyond where inodes are explicitly attached because future changes need to know what type of log items are attached to the buffer. Adding this buffer flag may invoke the inode iodone callback in cases where it wouldn't have been previously, but this is not a functional change because the callback is identical to the normal buffer write iodone callback when inodes are not attached. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-06 10:46:58 -07:00
Dave Chinner	1319ebefd6	xfs: add an inode item lock The inode log item is kind of special in that it can be aggregating new changes in memory at the same time time existing changes are being written back to disk. This means there are fields in the log item that are accessed concurrently from contexts that don't share any locking at all. e.g. updating ili_last_fields occurs at flush time under the ILOCK_EXCL and flush lock at flush time, under the flush lock at IO completion time, and is read under the ILOCK_EXCL when the inode is logged. Hence there is no actual serialisation between reading the field during logging of the inode in transactions vs clearing the field in IO completion. We currently get away with this by the fact that we are only clearing fields in IO completion, and nothing bad happens if we accidentally log more of the inode than we actually modify. Worst case is we consume a tiny bit more memory and log bandwidth. However, if we want to do more complex state manipulations on the log item that requires updates at all three of these potential locations, we need to have some mechanism of serialising those operations. To do this, introduce a spinlock into the log item to serialise internal state. This could be done via the xfs_inode i_flags_lock, but this then leads to potential lock inversion issues where inode flag updates need to occur inside locks that best nest inside the inode log item locks (e.g. marking inodes stale during inode cluster freeing). Using a separate spinlock avoids these sorts of problems and simplifies future code. This does not touch the use of ili_fields in the item formatting code - that is entirely protected by the ILOCK_EXCL at this point in time, so it remains untouched. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-06 10:46:58 -07:00
Dave Chinner	1dfde687a6	xfs: remove logged flag from inode log item This was used to track if the item had logged fields being flushed to disk. We log everything in the inode these days, so this logic is no longer needed. Remove it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-06 10:46:58 -07:00
Dave Chinner	96355d5a1f	xfs: Don't allow logging of XFS_ISTALE inodes In tracking down a problem in this patchset, I discovered we are reclaiming dirty stale inodes. This wasn't discovered until inodes were always attached to the cluster buffer and then the rcu callback that freed inodes was assert failing because the inode still had an active pointer to the cluster buffer after it had been reclaimed. Debugging the issue indicated that this was a pre-existing issue resulting from the way the inodes are handled in xfs_inactive_ifree. When we free a cluster buffer from xfs_ifree_cluster, all the inodes in cache are marked XFS_ISTALE. Those that are clean have nothing else done to them and so eventually get cleaned up by background reclaim. i.e. it is assumed we'll never dirty/relog an inode marked XFS_ISTALE. On journal commit dirty stale inodes as are handled by both buffer and inode log items to run though xfs_istale_done() and removed from the AIL (buffer log item commit) or the log item will simply unpin it because the buffer log item will clean it. What happens to any specific inode is entirely dependent on which log item wins the commit race, but the result is the same - stale inodes are clean, not attached to the cluster buffer, and not in the AIL. Hence inode reclaim can just free these inodes without further care. However, if the stale inode is relogged, it gets dirtied again and relogged into the CIL. Most of the time this isn't an issue, because relogging simply changes the inode's location in the current checkpoint. Problems arise, however, when the CIL checkpoints between two transactions in the xfs_inactive_ifree() deferops processing. This results in the XFS_ISTALE inode being redirtied and inserted into the CIL without any of the other stale cluster buffer infrastructure being in place. Hence on journal commit, it simply gets unpinned, so it remains dirty in memory. Everything in inode writeback avoids XFS_ISTALE inodes so it can't be written back, and it is not tracked in the AIL so there's not even a trigger to attempt to clean the inode. Hence the inode just sits dirty in memory until inode reclaim comes along, sees that it is XFS_ISTALE, and goes to reclaim it. This reclaiming of a dirty inode caused use after free, list corruptions and other nasty issues later in this patchset. Hence this patch addresses a violation of the "never log XFS_ISTALE inodes" caused by the deferops processing rolling a transaction and relogging a stale inode in xfs_inactive_free. It also adds a bunch of asserts to catch this problem in debug kernels so that we don't reintroduce this problem in future. Reproducer for this issue was generic/558 on a v4 filesystem. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-06 10:46:58 -07:00
Yafang Shao	0d5a57140b	xfs: remove useless definitions in xfs_linux.h Remove current_pid(), current_test_flags() and current_clear_flags_nested(), because they are useless. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-06 10:46:58 -07:00
Dave Chinner	cd647d5651	xfs: use MMAPLOCK around filemap_map_pages() The page faultround path ->map_pages is implemented in XFS via filemap_map_pages(). This function checks that pages found in page cache lookups have not raced with truncate based invalidation by checking page->mapping is correct and page->index is within EOF. However, we've known for a long time that this is not sufficient to protect against races with invalidations done by operations that do not change EOF. e.g. hole punching and other fallocate() based direct extent manipulations. The way we protect against these races is we wrap the page fault operations in a XFS_MMAPLOCK_SHARED lock so they serialise against fallocate and truncate before calling into the filemap function that processes the fault. Do the same for XFS's ->map_pages implementation to close this potential data corruption issue. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-06 10:46:58 -07:00
Darrick J. Wong	e2aaee9cd3	xfs: move helpers that lock and unlock two inodes against userspace IO Move the double-inode locking helpers to xfs_inode.c since they're not specific to reflink. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2020-07-06 10:46:57 -07:00
Darrick J. Wong	10b4bd6c9c	xfs: refactor locking and unlocking two inodes against userspace IO Refactor the two functions that we use to lock and unlock two inodes to block userspace from initiating IO against a file, whether via system calls or mmap activity. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2020-07-06 10:46:57 -07:00
Darrick J. Wong	451d34ee07	xfs: fix xfs_reflink_remap_prep calling conventions Fix the return value of xfs_reflink_remap_prep so that its return value conventions match the rest of xfs. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2020-07-06 10:46:57 -07:00
Darrick J. Wong	168eae803c	xfs: reflink can skip remap existing mappings If the source and destination map are identical, we can skip the remap step to save some time. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2020-07-06 10:46:57 -07:00
Darrick J. Wong	94b941fd7a	xfs: only reserve quota blocks if we're mapping into a hole When logging quota block count updates during a reflink operation, we only log the /delta/ of the block count changes to the dquot. Since we now know ahead of time the extent type of both dmap and smap (and that they have the same length), we know that we only need to reserve quota blocks for dmap's blockcount if we're mapping it into a hole. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2020-07-06 10:46:57 -07:00
Darrick J. Wong	aa5d0ba0b5	xfs: only reserve quota blocks for bmbt changes if we're changing the data fork Now that we've reworked xfs_reflink_remap_extent to remap only one extent per transaction, we actually know if the extent being removed is an allocated mapping. This means that we now know ahead of time if we're going to be touching the data fork. Since we only need blocks for a bmbt split if we're going to update the data fork, we only need to get quota reservation if we know we're going to touch the data fork. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2020-07-06 10:46:57 -07:00
Darrick J. Wong	00fd1d56dd	xfs: redesign the reflink remap loop to fix blkres depletion crash The existing reflink remapping loop has some structural problems that need addressing: The biggest problem is that we create one transaction for each extent in the source file without accounting for the number of mappings there are for the same range in the destination file. In other words, we don't know the number of remap operations that will be necessary and we therefore cannot guess the block reservation required. On highly fragmented filesystems (e.g. ones with active dedupe) we guess wrong, run out of block reservation, and fail. The second problem is that we don't actually use the bmap intents to their full potential -- instead of calling bunmapi directly and having to deal with its backwards operation, we could call the deferred ops xfs_bmap_unmap_extent and xfs_refcount_decrease_extent instead. This makes the frontend loop much simpler. Solve all of these problems by refactoring the remapping loops so that we only perform one remapping operation per transaction, and each operation only tries to remap a single extent from source to dest. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reported-by: Edwin Török <edwin@etorok.net> Tested-by: Edwin Török <edwin@etorok.net>	2020-07-06 10:46:57 -07:00
Darrick J. Wong	877f58f536	xfs: rename xfs_bmap_is_real_extent to is_written_extent The name of this predicate is a little misleading -- it decides if the extent mapping is allocated and written. Change the name to be more direct, as we're going to add a new predicate in the next patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2020-07-06 10:46:57 -07:00
Darrick J. Wong	83895227ab	xfs: fix reflink quota reservation accounting error Quota reservations are supposed to account for the blocks that might be allocated due to a bmap btree split. Reflink doesn't do this, so fix this to make the quota accounting more accurate before we start rearranging things. Fixes: `862bb360ef` ("xfs: reflink extents from one file to another") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2020-07-06 10:46:56 -07:00
Darrick J. Wong	eb0efe5063	xfs: don't eat an EIO/ENOSPC writeback error when scrubbing data fork The data fork scrubber calls filemap_write_and_wait to flush dirty pages and delalloc reservations out to disk prior to checking the data fork's extent mappings. Unfortunately, this means that scrub can consume the EIO/ENOSPC errors that would otherwise have stayed around in the address space until (we hope) the writer application calls fsync to persist data and collect errors. The end result is that programs that wrote to a file might never see the error code and proceed as if nothing were wrong. xfs_scrub is not in a position to notify file writers about the writeback failure, and it's only here to check metadata, not file contents. Therefore, if writeback fails, we should stuff the error code back into the address space so that an fsync by the writer application can pick that up. Fixes: `99d9d8d05d` ("xfs: scrub inode block mappings") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2020-07-06 10:46:56 -07:00
Brian Foster	f74681ba20	xfs: preserve rmapbt swapext block reservation from freed blocks The rmapbt extent swap algorithm remaps individual extents between the source inode and the target to trigger reverse mapping metadata updates. If either inode straddles a format or other bmap allocation boundary, the individual unmap and map cycles can trigger repeated bmap block allocations and frees as the extent count bounces back and forth across the boundary. While net block usage is bound across the swap operation, this behavior can prematurely exhaust the transaction block reservation because it continuously drains as the transaction rolls. Each allocation accounts against the reservation and each free returns to global free space on transaction roll. The previous workaround to this problem attempted to detect this boundary condition and provide surplus block reservation to acommodate it. This is insufficient because more remaps can occur than implied by the extent counts; if start offset boundaries are not aligned between the two inodes, for example. To address this problem more generically and dynamically, add a transaction accounting mode that returns freed blocks to the transaction reservation instead of the superblock counters on transaction roll and use it when the rmapbt based algorithm is active. This allows the chain of remap transactions to preserve the block reservation based own its own frees and prevent premature exhaustion regardless of the remap pattern. Note that this is only safe for superblocks with lazy sb accounting, but the latter is required for v5 supers and the rmap feature depends on v5. Fixes: `b3fed43482` ("xfs: account format bouncing into rmapbt swapext tx reservation") Root-caused-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-06 10:46:56 -07:00
Keyur Patel	06734e3c95	xfs: Couple of typo fixes in comments ./xfs/libxfs/xfs_inode_buf.c:56: unnecssary ==> unnecessary ./xfs/libxfs/xfs_inode_buf.c:59: behavour ==> behaviour ./xfs/libxfs/xfs_inode_buf.c:206: unitialized ==> uninitialized Signed-off-by: Keyur Patel <iamkeyur96@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-07-06 10:46:56 -07:00
Pavel Begunkov	3fcee5a6d5	io_uring: briefly loose locks while reaping events It's not nice to hold @uring_lock for too long io_iopoll_reap_events(). For instance, the lock is needed to publish requests to @poll_list, and that locks out tasks doing that for no good reason. Loose it occasionally. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-06 09:06:20 -06:00
Pavel Begunkov	eba0a4dd2a	io_uring: fix stopping iopoll'ing too early Nobody adjusts *nr_events (number of completed requests) before calling io_iopoll_getevents(), so the passed @min shouldn't be adjusted as well. Othewise it can return less than initially asked @min without hitting need_resched(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-06 09:06:20 -06:00
Pavel Begunkov	3aadc23e60	io_uring: don't delay iopoll'ed req completion ->iopoll() may have completed current request, but instead of reaping it, io_do_iopoll() just continues with the next request in the list. As a result it can leave just polled and completed request in the list up until next syscall. Even outer loop in io_iopoll_getevents() doesn't help the situation. E.g. poll_list: req0 -> req1 If req0->iopoll() completed both requests, and @min<=1, then @req0 will be left behind. Check whether a req was completed after ->iopoll(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-06 09:06:20 -06:00
Pavel Begunkov	8b3656af2a	io_uring: fix lost cqe->flags Don't forget to fill cqe->flags properly in io_submit_flush_completions() Fixes: `a1d7c393c4` ("io_uring: enable READ/WRITE to use deferred completions") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-05 15:07:50 -06:00
Pavel Begunkov	652532ad45	io_uring: keep queue_sqe()'s fail path separately A preparation path, extracts error path into a separate block. It looks saner then calling req_set_fail_links() after io_put_req_find_next(), even though it have been working well. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-05 15:07:37 -06:00
Pavel Begunkov	6df1db6b54	io_uring: fix mis-refcounting linked timeouts io_prep_linked_timeout() sets REQ_F_LINK_TIMEOUT altering refcounting of the following linked request. After that someone should call io_queue_linked_timeout(), otherwise a submission reference of the linked timeout won't be ever dropped. That's what happens in io_steal_work() if io-wq decides to postpone linked request with io_wqe_enqueue(). io_queue_linked_timeout() can also be potentially called twice without synchronisation during re-submission, e.g. io_rw_resubmit(). There are the rules, whoever did io_prep_linked_timeout() must also call io_queue_linked_timeout(). To not do it twice, io_prep_linked_timeout() will return non NULL only for the first call. That's controlled by REQ_F_LINK_TIMEOUT flag. Also kill REQ_F_QUEUE_TIMEOUT. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-05 15:07:35 -06:00
Jens Axboe	c2c4c83c58	io_uring: use new io_req_task_work_add() helper throughout Since we now have that in the 5.9 branch, convert the existing users of task_work_add() to use this new helper. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-05 15:07:31 -06:00
Jens Axboe	4c6e277c4c	io_uring: abstract out task work running Provide a helper to run task_work instead of checking and running manually in a bunch of different spots. While doing so, also move the task run state setting where we run the task work. Then we can move it out of the callback helpers. This also helps ensure we only do this once per task_work list run, not per task_work item. Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-05 15:05:22 -06:00
Jens Axboe	58c6a581de	Merge branch 'io_uring-5.8' into for-5.9/io_uring Pull in task_work changes from the 5.8 series, as we'll need to apply the same kind of changes to other parts in the 5.9 branch. * io_uring-5.8: io_uring: fix regression with always ignoring signals in io_cqring_wait() io_uring: use signal based task_work running task_work: teach task_work_add() to do signal_wake_up()	2020-07-05 15:04:17 -06:00
Alexander A. Klimov	cba22b1c59	Replace HTTP links with HTTPS ones: CIFS Rationale: Reduces attack surface on kernel devs opening the links for MITM as HTTPS traffic is much harder to manipulate. Deterministic algorithm: For each file: If not .svg: For each line: If doesn't contain `\bxmlns\b`: For each link, `\bhttp://[^# \t\r\n]*(?:\w\|/)`: If both the HTTP and HTTPS versions return 200 OK and serve the same content: Replace HTTP with HTTPS. Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Link: https://lore.kernel.org/r/20200627103125.71828-1-grandmaster@al2klimov.de Signed-off-by: Jonathan Corbet <corbet@lwn.net>	2020-07-05 14:23:38 -06:00
Linus Torvalds	9fbe565cb7	io_uring-5.8-2020-07-05 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8BDx4QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpvhzD/4rxzJsn6ukrsxMXFaKIrjZ/hkcRJIMNozz YWu4PwcDvszvZu66MeAu0tnCttzxlIgP8oCm6cx9ImMQwkYIVbV0q1XJ3wmzUQpZ pEDW4j0j8hgcLhfZH9ojUAkTP8TnltakxkrwC6egUvnT0vuKDUy5ISbkl4uxWYpH p4Dq7ASqy8xjtzac/VLTSzBgzhTMSic5NMJY21md9eAaFB1vYBmDyHB3O1bEk4kw pvWGFm7a4qssnAB61SMfq3nWQ9UA0+XX4a+CWEzJIMqj4H6UpjOCQU23X1AlaLJX ILeq26PwoZQF8cS4D83tMnmPWz1LqslBgnUuAGCVLsT7omvhDLM75iFBpMzWglLu No8TlxLZ+Dga04vpjeEptWqSfUS6K879cNJuFGjadBogq06SImIVDHXXTrPhtCGg B9+uFHkOUlIkjM5h2zqdkmhnbf0sWodowIrx7+aL294QVlqnY0uBR9eh6+CSKT+h PhJ+FhN+N6B1dTyryaO5hMjyg0h4ZpvIMT3HBpNXtnRVlUT2+OYN3g5HHt6z//Rp eeJTh7pnY7uT60c8x96kySwQIydXSKBI+7ysLlntgiyvutbzaC5Fq7/f1YTWyNVk zqM/+FuJUsstu0y/GBEDpglpL1+S9VjNcJUDpUMUKwCAkh7TnI/ATo1rn9GiM1n1 SQZ4HcaCYw== =Uawr -----END PGP SIGNATURE----- Merge tag 'io_uring-5.8-2020-07-05' of git://git.kernel.dk/linux-block Pull io_uring fix from Jens Axboe: "Andres reported a regression with the fix that was merged earlier this week, where his setup of using signals to interrupt io_uring CQ waits no longer worked correctly. Fix this, and also limit our use of TWA_SIGNAL to the case where we need it, and continue using TWA_RESUME for task_work as before. Since the original is marked for 5.7 stable, let's flush this one out early" * tag 'io_uring-5.8-2020-07-05' of git://git.kernel.dk/linux-block: io_uring: fix regression with always ignoring signals in io_cqring_wait()	2020-07-05 10:41:33 -07:00
Jens Axboe	b7db41c9e0	io_uring: fix regression with always ignoring signals in io_cqring_wait() When switching to TWA_SIGNAL for task_work notifications, we also made any signal based condition in io_cqring_wait() return -ERESTARTSYS. This breaks applications that rely on using signals to abort someone waiting for events. Check if we have a signal pending because of queued task_work, and repeat the signal check once we've run the task_work. This provides a reliable way of telling the two apart. Additionally, only use TWA_SIGNAL if we are using an eventfd. If not, we don't have the dependency situation described in the original commit, and we can get by with just using TWA_RESUME like we previously did. Fixes: `ce593a6c48` ("io_uring: use signal based task_work running") Cc: stable@vger.kernel.org # v5.7 Reported-by: Andres Freund <andres@anarazel.de> Tested-by: Andres Freund <andres@anarazel.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-04 13:44:45 -06:00
Eric W. Biederman	25cf336de5	exec: Remove do_execve_file Now that the last callser has been removed remove this code from exec. For anyone thinking of resurrecing do_execve_file please note that the code was buggy in several fundamental ways. - It did not ensure the file it was passed was read-only and that deny_write_access had been called on it. Which subtlely breaks invaniants in exec. - The caller of do_execve_file was expected to hold and put a reference to the file, but an extra reference for use by exec was not taken so that when exec put it's reference to the file an underflow occured on the file reference count. - The point of the interface was so that a pathname did not need to exist. Which breaks pathname based LSMs. Tetsuo Handa originally reported these issues[1]. While it was clear that deny_write_access was missing the fundamental incompatibility with the passed in O_RDWR filehandle was not immediately recognized. All of these issues were fixed by modifying the usermode driver code to have a path, so it did not need this hack. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> [1] https://lore.kernel.org/linux-fsdevel/2a8775b4-1dd5-9d5c-aa42-9872445e0942@i-love.sakura.ne.jp/ v1: https://lkml.kernel.org/r/871rm2f0hi.fsf_-_@x220.int.ebiederm.org v2: https://lkml.kernel.org/r/87lfk54p0m.fsf_-_@x220.int.ebiederm.org Link: https://lkml.kernel.org/r/20200702164140.4468-10-ebiederm@xmission.com Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2020-07-04 09:35:43 -05:00
Linus Torvalds	8b082a41da	Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull sysctl fix from Al Viro: "Another regression fix for sysctl changes this cycle..." * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: Call sysctl_head_finish on error	2020-07-03 23:20:14 -07:00
Linus Torvalds	b8e516b367	8 cifs/smb3 fixes, most when specifying the multiuser mount flag, 5 of the fixes for stable. -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl7/s1YACgkQiiy9cAdy T1GhzgwAqARAg1iCUEDjyy7VEZ9HNA3X87GxM7zkid5Fz2WTDlHLBQL6LWZkLODK PIz8IP4V3DoBddN2DGlqIiCZmCMDn2bBN+6u1O2TkR2lv2w3ASxzwYSMQWqUUw6U a03BkDZNFE4fJq5pPDdVaVzDss4tuNKW8N5RvptRqlbLp74SRUgMjVyyWwN4UunW AHH3VqRCWJJj6Yp6MAx3rtoEiAtjTt9Ej3Fb2MXdF5jZObzI3LOY13Z09QIWbE3P Sh7Py66CSG7UYYkQqoe43zYwxeOgo6FAYWxIULTPJYdFIi5+RHPQ0SYc6+BHfDRo AHMchJpwZ/j4JOeIJGDItuUQPVnwYAOZ+75s7ofhAbG95kwcfs+AkDoLqkM8IWpu LS5rHi7sOA4GK8Hio9xp+MgttsmXRcnBQ4ShBoTaDBKa7v/NeRAPolsD5FgZWunO CKRDsDD5hKO2bQsJk4te35/IQpxRpEiiGmMpyaNUdaCdhXxcPHCYEWYdw9EnTP6i 1xc7au/u =laR3 -----END PGP SIGNATURE----- Merge tag '5.8-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6 Pull cifs fixes from Steve French: "Eight cifs/smb3 fixes, most when specifying the multiuser mount flag. Five of the fixes are for stable" * tag '5.8-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: prevent truncation from long to int in wait_for_free_credits cifs: Fix the target file was deleted when rename failed. SMB3: Honor 'posix' flag for multiuser mounts SMB3: Honor 'handletimeout' flag for multiuser mounts SMB3: Honor lease disabling for multiuser mounts SMB3: Honor persistent/resilient handle flags for multiuser mounts SMB3: Honor 'seal' flag for multiuser mounts cifs: Display local UID details for SMB sessions in DebugData	2020-07-03 23:03:45 -07:00
Linus Torvalds	0c7d7d1fad	Changes for 5.8-rc4: - Fix a use-after-free bug when the fs shuts down. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl77c9oACgkQ+H93GTRK tOvYow/+O61y32kJzJ6xJ5QoGE5K10CXp4cpBwOgG/6POERIDBirgAMqKAx8rEu2 go36qUVzwyveaB4iyuIkw5K+odZbHpGiuQqiGu8Aw4XBEAEhDqPPsnHhllSi/VOq 4AjvfYefmj0ALQay1pzGpR2h5+03JwOw0ZFcmBl5QSTMLwZQZ1PoU8ujiYPSxsUr m9dcGZtGU16mmDgORzYTDnSYSKruhJDSD5IxsID+QST0wK4MuPkJr0T+ZQmziWb1 xmHE1aTGDZcrYG4+x2Pzop822mrnuMBnMaX8KOOtiZAhtKb19sAf0OUWBkWOQYvb Vk3mIDDz910vd/szw3B5KMVNkeYoRtHAQztpLyfJ3Gtxmt5g/4ZX2fEX/KWJbV5D 82GLf4gB4GhTQFUwcTmgtmCCd87aH0ABiCEnURN6R04tOCXgc7bYn0XXvZL6Axd/ 25bkhlDdmOfveAZrZ1WKWSEYN/at9R5iqYsbWH1FmoE6h82OvZDxweB7P4r66KwT pMhYRqKrRrwNtarPmn6bC/8Ci5h6vl8MOP3+mTYx2eXU0aFe8OCYpPNFsSufxx+e hHTHUKxCSQDfYHbIqXVF5F8G7msh4ue+687UIo7seWyzTU4PaphH0kaHKV/+fiY1 zU9+GNjk0iVZy6cG+25ZcMPQmYv7qdrvm6XuxciywFim7l2wq0s= =BnJO -----END PGP SIGNATURE----- Merge tag 'xfs-5.8-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs fix from Darrick Wong: "Fix a use-after-free bug when the fs shuts down" * tag 'xfs-5.8-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix use-after-free on CIL context on shutdown	2020-07-03 14:46:46 -07:00
Linus Torvalds	bf2d63694e	Various gfs2 fixes -----BEGIN PGP SIGNATURE----- iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAl7/A8kUHGFncnVlbmJh QHJlZGhhdC5jb20ACgkQ1b+f6wMTZToR8A//RI5c1W9De9ST6ITDsVLVpKX7ugRl Zo9ORCRgnnBuJFGQ6VtPuren9Rd6nxIOwf8MeFHjkgapIcXVM/f81Hx7WTUS7KOP dakeOOWPw1Ue3hcnpxjT4fTgZo43u1VoKbFGMpUPK0zvGrYh8fa4euAhALGIUB0X jc12mX/FdpQY+9YurY45GJ186tC0aZp1kEK+mA5apPuIA+9pUqbPIt6tmLV1wRry 2a2fTZXI0MN0/u4ZhHSFcJVj4k6xdLLQodJHW+FAPh50vHf7W++DL6mJ5V6Svp8N vaWUdTbLh8wp/IxD+G4tnPk2BFKvGRwXZV9ZmrivxcwRKc4bgA878yjU6Ox4B0XB EzXcTPeO0zGHtXBv49Pig0dZDllgvlDL/us2rnt+APZ99yWsX8GzJ1BDP4gT4var 3IuNxyvpeAth9zth/aOzS2pE50yano64CQZq/NsuvAz4MTI8zms/o2ioH/aPbwfk DFkAQHzlNPZlbt6jF6WJxF4R0Mrs8g1oTF5eH0sIpfMUxtq0Ay9dlHDEY1yjl/Jw 2KBuKLF6P6IZN3aSolawY29dcnMp5vbRDLdbqNzz2/DzAQmqjd4IcY0gK/nHTc8i 2TLyffQ6YTdEDXKIDqfPNDxjcYLKd1XCIyogXt6Rw57C3YwIbDY/drPtFUpUCUFI Oaf4j8M9W+gamzU= =GQwd -----END PGP SIGNATURE----- Merge tag 'gfs2-v5.8-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 fixes from Andreas Gruenbacher: "Various gfs2 fixes" * tag 'gfs2-v5.8-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: The freeze glock should never be frozen gfs2: When freezing gfs2, use GL_EXACT and not GL_NOCACHE gfs2: read-only mounts should grab the sd_freeze_gl glock gfs2: freeze should work on read-only mounts gfs2: eliminate GIF_ORDERED in favor of list_empty gfs2: Don't sleep during glock hash walk gfs2: fix trans slab error when withdraw occurs inside log_flush gfs2: Don't return NULL from gfs2_inode_lookup	2020-07-03 12:01:04 -07:00
Matthew Wilcox (Oracle)	d4d80e6992	Call sysctl_head_finish on error This error path returned directly instead of calling sysctl_head_finish(). Fixes: `ef9d965bc8` ("sysctl: reject gigantic reads/write to sysctl files") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-07-03 14:10:46 -04:00
Bob Peterson	c860f8ffbe	gfs2: The freeze glock should never be frozen Before this patch, some gfs2 code locked the freeze glock with LM_FLAG_NOEXP (Do not freeze) flag, and some did not. We never want to freeze the freeze glock, so this patch makes it consistently use LM_FLAG_NOEXP always. Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2020-07-03 12:05:35 +02:00
Bob Peterson	623ba664b7	gfs2: When freezing gfs2, use GL_EXACT and not GL_NOCACHE Before this patch, the freeze code in gfs2 specified GL_NOCACHE in several places. That's wrong because we always want to know the state of whether the file system is frozen. There was also a problem with freeze/thaw transitioning the glock from frozen (EX) to thawed (SH) because gfs2 will normally grant glocks in EX to processes that request it in SH mode, unless GL_EXACT is specified. Therefore, the freeze/thaw code, which tried to reacquire the glock in SH mode would get the glock in EX mode, and miss the transition from EX to SH. That made it think the thaw had completed normally, but since the glock was still cached in EX, other nodes could not freeze again. This patch removes the GL_NOCACHE flag to allow the freeze glock to be cached. It also adds the GL_EXACT flag so the glock is fully transitioned from EX to SH, thereby allowing future freeze operations. Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2020-07-03 12:05:35 +02:00
Bob Peterson	b780cc615b	gfs2: read-only mounts should grab the sd_freeze_gl glock Before this patch, only read-write mounts would grab the freeze glock in read-only mode, as part of gfs2_make_fs_rw. So the freeze glock was never initialized. That meant requests to freeze, which request the glock in EX, were granted without any state transition. That meant you could mount a gfs2 file system, which is currently frozen on a different cluster node, in read-only mode. This patch makes read-only mounts lock the freeze glock in SH mode, which will block for file systems that are frozen on another node. Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2020-07-03 12:05:35 +02:00
Bob Peterson	541656d3a5	gfs2: freeze should work on read-only mounts Before this patch, function freeze_go_sync, called when promoting the freeze glock, was testing for the SDF_JOURNAL_LIVE superblock flag. That's only set for read-write mounts. Read-only mounts don't use a journal, so the bit is never set, so the freeze never happened. This patch removes the check for SDF_JOURNAL_LIVE for freeze requests but still checks it when deciding whether to flush a journal. Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2020-07-03 12:05:35 +02:00
Bob Peterson	7542486b89	gfs2: eliminate GIF_ORDERED in favor of list_empty In several places, we used the GIF_ORDERED inode flag to determine if an inode was on the ordered writes list. However, since we always held the sd_ordered_lock spin_lock during the manipulation, we can just as easily check list_empty(&ip->i_ordered) instead. This allows us to keep more than one ordered writes list to make journal writing improvements. This patch eliminates GIF_ORDERED in favor of checking list_empty. Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2020-07-03 12:05:34 +02:00
Linus Torvalds	083176c86f	Fixes for a umask bug on exported filesystems lacking ACL support, a leak and a module unloading bug in the /proc/fs/nfsd/clients/ code, and a compile warning. -----BEGIN PGP SIGNATURE----- iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl79+IoVHGJmaWVsZHNA ZmllbGRzZXMub3JnAAoJECebzXlCjuG+6I8P/24e9/W50SUBsQYseG2BpwjR/RsQ YjMbqrf1XOxo+8axNpdbe0bhq2jWiyQz0esnF33RlztlDSJmSNJfueWDSezKzKwC o8afQx0qJaVZUsT/XAXa2Hk2OZd2ZYF6f3DGMiz+knBGdAzSwJjpgqhzocMCQ3Hr t/PG6DJazLB3VDIe1VziTet2uv52A0A+uBYKguK/QPlpae2uXKFJ8U7v6wCsU395 Sqd2/X2KGbeYoCrWsmpvdCDVeNmAbI0KlhY8pR6BHqGp7TYm4+AueqWzpYHlNHei PukM8AROoTBEAc6Wiqqmp0UKRR+Qn/9NIuvQtvBnC6WGIPjEG1hTkAwlRfT6VYvn oPg4oekKjRJLz/TSaqfJRpli5GwxfWAW14LTZZT+Xe0/7FhVe28/R8F1dP5ZJaeq h9//4rCt/yUYAQq1odOMbNCr0rGVcKzdSN3E36OvJFVQ9bMyXHKetKHywOki13w5 M8UQK5zb21ghT7OSICmeRXHqsXRmTFO8QhUZ8L63Qb2hfiQ5fVQdSiHmM8iRcwWY bxqrSs8YV7i+I0i1YYTYWmmFgP8Y11sL7ovAEs86cP2Rk58Bk5VA2TPT114W53AD xaZHpjsH0AfZS87dEVdvS2/dAdtbHZsFwHxGnfvyl/CKTqoz5yY5etcwULQI3+XQ 8kG8FOFpt7T/5zmB =wF3M -----END PGP SIGNATURE----- Merge tag 'nfsd-5.8-1' of git://linux-nfs.org/~bfields/linux Pull nfsd fixes from Bruce Fields: "Fixes for a umask bug on exported filesystems lacking ACL support, a leak and a module unloading bug in the /proc/fs/nfsd/clients/ code, and a compile warning" * tag 'nfsd-5.8-1' of git://linux-nfs.org/~bfields/linux: SUNRPC: Add missing definition of ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE nfsd: fix nfsdfs inode reference count leak nfsd4: fix nfsdfs reference count loop nfsd: apply umask on fs without ACL support	2020-07-02 20:35:33 -07:00
Linus Torvalds	c93493b7cd	io_uring-5.8-2020-07-01 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl79YU0QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgplHKD/9rgv0c1I7dCh6MgQKxT+2z/eZcaPO3PekW sbn8yC8RiSIL85Av1zEfC1wAp+Mp21QlFKXFiZ6BJj5bdDbbshLk0WdbnxvuM+9I gyngTI/+em5D/WCcetAkPjnMTDq0m4l0UXd91fyNAeErmYZbvhL5dXihZsBJ3T9c Bprn4RzWwrUsUwGn8qIEZhx2UovMrzXJHGFxWXh/81YHkh7Y4mjvATKxtECIliW/ +QQJDU7Tf3gZw+ETPIDOEB9Hl9c9W+9fcWWzmrXzViUyy54IMbF4qyJpWcGaRh6c sO3apymwu7wwAUbQcE8IWr3ZLZDtw68AgUdZ5b/T0c2fEwqsI/UDMhBbELiuqcT0 MAoQdUSNNqZTti0PX5vg5CQlCFzjnl2uIwHF6LVSbrqgyqxiC3Qrus/FYSaf3x9h bAmNgWC9DeKp/wtEKMuBXaOm7RjrEutD5hjJYfVK/AkvKTZyZDx3vZ9FRH8WtrII 7KhUI3DPSZCeWlcpDtK+0fEqtqTw6OtCQ8U5vKSnJjoRSXLUtuk6IYbp/tqNxwe/ 0d+U6R+w513jVlXARUP48mV7tzpESp2MLP6Nd2Is/OD5tePWzQEZinpKzsFP4djH d2PT5FFGPCw9yBk03sI1Je/CFqVYwCGqav6h8dKKVBanMjoEdL4U1PMhI48Zua+9 M8pqRHoeDA== =4lvI -----END PGP SIGNATURE----- Merge tag 'io_uring-5.8-2020-07-01' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: "One fix in here, for a regression in 5.7 where a task is waiting in the kernel for a condition, but that condition won't become true until task_work is run. And the task_work can't be run exactly because the task is waiting in the kernel, so we'll never make any progress. One example of that is registering an eventfd and queueing io_uring work, and then the task goes and waits in eventfd read with the expectation that it'll get woken (and read an event) when the io_uring request completes. The io_uring request is finished through task_work, which won't get run while the task is looping in eventfd read" * tag 'io_uring-5.8-2020-07-01' of git://git.kernel.dk/linux-block: io_uring: use signal based task_work running task_work: teach task_work_add() to do signal_wake_up()	2020-07-02 14:56:22 -07:00
Josef Bacik	0465337c55	btrfs: reset tree root pointer after error in init_tree_roots Eric reported an issue where mounting -o recovery with a fuzzed fs resulted in a kernel panic. This is because we tried to free the tree node, except it was an error from the read. Fix this by properly resetting the tree_root->node == NULL in this case. The panic was the following BTRFS warning (device loop0): failed to read tree root BUG: kernel NULL pointer dereference, address: 000000000000001f RIP: 0010:free_extent_buffer+0xe/0x90 [btrfs] Call Trace: free_root_extent_buffers.part.0+0x11/0x30 [btrfs] free_root_pointers+0x1a/0xa2 [btrfs] open_ctree+0x1776/0x18a5 [btrfs] btrfs_mount_root.cold+0x13/0xfa [btrfs] ? selinux_fs_context_parse_param+0x37/0x80 legacy_get_tree+0x27/0x40 vfs_get_tree+0x25/0xb0 fc_mount+0xe/0x30 vfs_kern_mount.part.0+0x71/0x90 btrfs_mount+0x147/0x3e0 [btrfs] ? cred_has_capability+0x7c/0x120 ? legacy_get_tree+0x27/0x40 legacy_get_tree+0x27/0x40 vfs_get_tree+0x25/0xb0 do_mount+0x735/0xa40 __x64_sys_mount+0x8e/0xd0 do_syscall_64+0x4d/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Nik says: this is problematic only if we fail on the last iteration of the loop as this results in init_tree_roots returning err value with tree_root->node = -ERR. Subsequently the caller does: fail_tree_roots which calls free_root_pointers on the bogus value. Reported-by: Eric Sandeen <sandeen@redhat.com> Fixes: `b8522a1e5f` ("btrfs: Factor out tree roots initialization during mount") CC: stable@vger.kernel.org # 5.5+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add details how the pointer gets dereferenced ] Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-02 10:27:12 +02:00
Filipe Manana	6d548b9e5d	btrfs: fix reclaim_size counter leak after stealing from global reserve Commit `7f9fe61440` ("btrfs: improve global reserve stealing logic"), added in the 5.8 merge window, introduced another leak for the space_info's reclaim_size counter. This is very often triggered by the test cases generic/269 and generic/416 from fstests, producing a stack trace like the following during unmount: [37079.155499] ------------[ cut here ]------------ [37079.156844] WARNING: CPU: 2 PID: 2000423 at fs/btrfs/block-group.c:3422 btrfs_free_block_groups+0x2eb/0x300 [btrfs] [37079.158090] Modules linked in: dm_snapshot btrfs dm_thin_pool (...) [37079.164440] CPU: 2 PID: 2000423 Comm: umount Tainted: G W 5.7.0-rc7-btrfs-next-62 #1 [37079.165422] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), (...) [37079.167384] RIP: 0010:btrfs_free_block_groups+0x2eb/0x300 [btrfs] [37079.168375] Code: bd 58 ff ff ff 00 4c 8d (...) [37079.170199] RSP: 0018:ffffaa53875c7de0 EFLAGS: 00010206 [37079.171120] RAX: ffff98099e701cf8 RBX: ffff98099e2d4000 RCX: 0000000000000000 [37079.172057] RDX: 0000000000000001 RSI: ffffffffc0acc5b1 RDI: 00000000ffffffff [37079.173002] RBP: ffff98099e701cf8 R08: 0000000000000000 R09: 0000000000000000 [37079.173886] R10: 0000000000000000 R11: 0000000000000000 R12: ffff98099e701c00 [37079.174730] R13: ffff98099e2d5100 R14: dead000000000122 R15: dead000000000100 [37079.175578] FS: 00007f4d7d0a5840(0000) GS:ffff9809ec600000(0000) knlGS:0000000000000000 [37079.176434] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [37079.177289] CR2: 0000559224dcc000 CR3: 000000012207a004 CR4: 00000000003606e0 [37079.178152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [37079.178935] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [37079.179675] Call Trace: [37079.180419] close_ctree+0x291/0x2d1 [btrfs] [37079.181162] generic_shutdown_super+0x6c/0x100 [37079.181898] kill_anon_super+0x14/0x30 [37079.182641] btrfs_kill_super+0x12/0x20 [btrfs] [37079.183371] deactivate_locked_super+0x31/0x70 [37079.184012] cleanup_mnt+0x100/0x160 [37079.184650] task_work_run+0x68/0xb0 [37079.185284] exit_to_usermode_loop+0xf9/0x100 [37079.185920] do_syscall_64+0x20d/0x260 [37079.186556] entry_SYSCALL_64_after_hwframe+0x49/0xb3 [37079.187197] RIP: 0033:0x7f4d7d2d9357 [37079.187836] Code: eb 0b 00 f7 d8 64 89 01 48 (...) [37079.189180] RSP: 002b:00007ffee4e0d368 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [37079.189845] RAX: 0000000000000000 RBX: 00007f4d7d3fb224 RCX: 00007f4d7d2d9357 [37079.190515] RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000559224dc5c90 [37079.191173] RBP: 0000559224dc1970 R08: 0000000000000000 R09: 00007ffee4e0c0e0 [37079.191815] R10: 0000559224dc7b00 R11: 0000000000000246 R12: 0000000000000000 [37079.192451] R13: 0000559224dc5c90 R14: 0000559224dc1a80 R15: 0000559224dc1ba0 [37079.193096] irq event stamp: 0 [37079.193729] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [37079.194379] hardirqs last disabled at (0): [<ffffffff97ab8935>] copy_process+0x755/0x1ea0 [37079.195033] softirqs last enabled at (0): [<ffffffff97ab8935>] copy_process+0x755/0x1ea0 [37079.195700] softirqs last disabled at (0): [<0000000000000000>] 0x0 [37079.196318] ---[ end trace b32710d864dea887 ]--- In the past commit `d611add48b` ("btrfs: fix reclaim counter leak of space_info objects") fixed similar cases. That commit however has a date more recent (April 7 2020) then the commit mentioned before (March 13 2020), however it was merged in kernel 5.7 while the older commit, which introduces a new leak, was merged only in the 5.8 merge window. So the leak sneaked in unnoticed. Fix this by making steal_from_global_rsv() remove the ticket using the helper remove_ticket(), which decrements the reclaim_size counter of the space_info object. Fixes: `7f9fe61440` ("btrfs: improve global reserve stealing logic") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-02 10:18:34 +02:00
Boris Burkov	6bf9cd2eed	btrfs: fix fatal extent_buffer readahead vs releasepage race Under somewhat convoluted conditions, it is possible to attempt to release an extent_buffer that is under io, which triggers a BUG_ON in btrfs_release_extent_buffer_pages. This relies on a few different factors. First, extent_buffer reads done as readahead for searching use WAIT_NONE, so they free the local extent buffer reference while the io is outstanding. However, they should still be protected by TREE_REF. However, if the system is doing signficant reclaim, and simultaneously heavily accessing the extent_buffers, it is possible for releasepage to race with two concurrent readahead attempts in a way that leaves TREE_REF unset when the readahead extent buffer is released. Essentially, if two tasks race to allocate a new extent_buffer, but the winner who attempts the first io is rebuffed by a page being locked (likely by the reclaim itself) then the loser will still go ahead with issuing the readahead. The loser's call to find_extent_buffer must also race with the reclaim task reading the extent_buffer's refcount as 1 in a way that allows the reclaim to re-clear the TREE_REF checked by find_extent_buffer. The following represents an example execution demonstrating the race: CPU0 CPU1 CPU2 reada_for_search reada_for_search readahead_tree_block readahead_tree_block find_create_tree_block find_create_tree_block alloc_extent_buffer alloc_extent_buffer find_extent_buffer // not found allocates eb lock pages associate pages to eb insert eb into radix tree set TREE_REF, refs == 2 unlock pages read_extent_buffer_pages // WAIT_NONE not uptodate (brand new eb) lock_page if !trylock_page goto unlock_exit // not an error free_extent_buffer release_extent_buffer atomic_dec_and_test refs to 1 find_extent_buffer // found try_release_extent_buffer take refs_lock reads refs == 1; no io atomic_inc_not_zero refs to 2 mark_buffer_accessed check_buffer_tree_ref // not STALE, won't take refs_lock refs == 2; TREE_REF set // no action read_extent_buffer_pages // WAIT_NONE clear TREE_REF release_extent_buffer atomic_dec_and_test refs to 1 unlock_page still not uptodate (CPU1 read failed on trylock_page) locks pages set io_pages > 0 submit io return free_extent_buffer release_extent_buffer dec refs to 0 delete from radix tree btrfs_release_extent_buffer_pages BUG_ON(io_pages > 0)!!! We observe this at a very low rate in production and were also able to reproduce it in a test environment by introducing some spurious delays and by introducing probabilistic trylock_page failures. To fix it, we apply check_tree_ref at a point where it could not possibly be unset by a competing task: after io_pages has been incremented. All the codepaths that clear TREE_REF check for io, so they would not be able to clear it after this point until the io is done. Stack trace, for reference: [1417839.424739] ------------[ cut here ]------------ [1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841! [1417839.447024] invalid opcode: 0000 [#1] SMP [1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0 [1417839.517008] Code: ed e9 ... [1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202 [1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX: 0000000000000028 [1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI: ffff888102d6def0 [1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09: 0000000000000238 [1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12: 0000000000000000 [1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15: ffff888797aafd90 [1417839.651549] FS: 00007f5a844fa700(0000) GS:ffff88885f680000(0000) knlGS:0000000000000000 [1417839.669810] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4: 00000000003606e0 [1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [1417839.731320] Call Trace: [1417839.737103] release_extent_buffer+0x39/0x90 [1417839.746913] read_block_for_search.isra.38+0x2a3/0x370 [1417839.758645] btrfs_search_slot+0x260/0x9b0 [1417839.768054] btrfs_lookup_file_extent+0x4a/0x70 [1417839.778427] btrfs_get_extent+0x15f/0x830 [1417839.787665] ? submit_extent_page+0xc4/0x1c0 [1417839.797474] ? __do_readpage+0x299/0x7a0 [1417839.806515] __do_readpage+0x33b/0x7a0 [1417839.815171] ? btrfs_releasepage+0x70/0x70 [1417839.824597] extent_readpages+0x28f/0x400 [1417839.833836] read_pages+0x6a/0x1c0 [1417839.841729] ? startup_64+0x2/0x30 [1417839.849624] __do_page_cache_readahead+0x13c/0x1a0 [1417839.860590] filemap_fault+0x6c7/0x990 [1417839.869252] ? xas_load+0x8/0x80 [1417839.876756] ? xas_find+0x150/0x190 [1417839.884839] ? filemap_map_pages+0x295/0x3b0 [1417839.894652] __do_fault+0x32/0x110 [1417839.902540] __handle_mm_fault+0xacd/0x1000 [1417839.912156] handle_mm_fault+0xaa/0x1c0 [1417839.921004] __do_page_fault+0x242/0x4b0 [1417839.930044] ? page_fault+0x8/0x30 [1417839.937933] page_fault+0x1e/0x30 [1417839.945631] RIP: 0033:0x33c4bae [1417839.952927] Code: Bad RIP value. [1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206 [1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX: 0000000000000000 [1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002 [1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09: 00007f5a94aeccb8 [1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12: 00007f5a94aecc79 [1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15: 00007f5a94aecc40 CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-02 10:18:33 +02:00
Marcos Paulo de Souza	c730ae0c6b	btrfs: convert comments to fallthrough annotations Convert fall through comments to the pseudo-keyword which is now the preferred way. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-07-02 10:18:30 +02:00
Ronnie Sahlberg	19e888678b	cifs: prevent truncation from long to int in wait_for_free_credits The wait_event_... defines evaluate to long so we should not assign it an int as this may truncate the value. Reported-by: Marshall Midden <marshallmidden@gmail.com> Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-07-01 20:01:26 -05:00
Zhang Xiaoxu	9ffad9263b	cifs: Fix the target file was deleted when rename failed. When xfstest generic/035, we found the target file was deleted if the rename return -EACESS. In cifs_rename2, we unlink the positive target dentry if rename failed with EACESS or EEXIST, even if the target dentry is positived before rename. Then the existing file was deleted. We should just delete the target file which created during the rename. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>	2020-07-01 19:41:56 -05:00
Paul Aurich	5391b8e1b7	SMB3: Honor 'posix' flag for multiuser mounts The flag from the primary tcon needs to be copied into the volume info so that cifs_get_tcon will try to enable extensions on the per-user tcon. At that point, since posix extensions must have already been enabled on the superblock, don't try to needlessly adjust the mount flags. Fixes: `ce558b0e17` ("smb3: Add posix create context for smb3.11 posix mounts") Fixes: `b326614ea2` ("smb3: allow "posix" mount option to enable new SMB311 protocol extensions") Signed-off-by: Paul Aurich <paul@darkrain42.org> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>	2020-07-01 19:41:36 -05:00
Paul Aurich	6b356f6cf9	SMB3: Honor 'handletimeout' flag for multiuser mounts Fixes: `ca567eb2b3` ("SMB3: Allow persistent handle timeout to be configurable on mount") Signed-off-by: Paul Aurich <paul@darkrain42.org> CC: Stable <stable@vger.kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>	2020-07-01 19:40:33 -05:00
Paul Aurich	ad35f169db	SMB3: Honor lease disabling for multiuser mounts Fixes: `3e7a02d478` ("smb3: allow disabling requesting leases") Signed-off-by: Paul Aurich <paul@darkrain42.org> CC: Stable <stable@vger.kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>	2020-07-01 19:40:17 -05:00
Paul Aurich	00dfbc2f9c	SMB3: Honor persistent/resilient handle flags for multiuser mounts Without this: - persistent handles will only be enabled for per-user tcons if the server advertises the 'Continuous Availabity' capability - resilient handles would never be enabled for per-user tcons Signed-off-by: Paul Aurich <paul@darkrain42.org> CC: Stable <stable@vger.kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>	2020-07-01 19:40:06 -05:00
Paul Aurich	cc15461c73	SMB3: Honor 'seal' flag for multiuser mounts Ensure multiuser SMB3 mounts use encryption for all users' tcons if the mount options are configured to require encryption. Without this, only the primary tcon and IPC tcons are guaranteed to be encrypted. Per-user tcons would only be encrypted if the server was configured to require encryption. Signed-off-by: Paul Aurich <paul@darkrain42.org> CC: Stable <stable@vger.kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>	2020-07-01 19:38:46 -05:00
Paul Aurich	aadd69cad0	cifs: Display local UID details for SMB sessions in DebugData This is useful for distinguishing SMB sessions on a multiuser mount. Signed-off-by: Paul Aurich <paul@darkrain42.org> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>	2020-07-01 19:38:19 -05:00
Christoph Hellwig	1008fe6dc3	block: remove the all_bdevs list Instead just iterate over the inodes for the block device superblock. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-01 08:08:25 -06:00
Christoph Hellwig	e556f6ba10	block: remove the bd_queue field from struct block_device Just use bd_disk->queue instead. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-01 08:08:20 -06:00
Christoph Hellwig	6b7b181b67	block: remove the bd_block_size field from struct block_device We can trivially calculate the block size from the inodes i_blkbits variable. Use that instead of keeping two redundant copies of the information in slightly different formats. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-01 08:08:17 -06:00
Christoph Hellwig	5ff9f19231	block: simplify set_init_blocksize The loop to increase the initial block size doesn't really make any sense, as the AND operation won't match for powers of two if it didn't for the initial block size. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-01 08:08:17 -06:00
Christoph Hellwig	ed9b3196d2	fs: remove a weird comment in submit_bh_wbc All bios can get remapped if submitted to partitions. No need to comment on that. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-01 07:27:23 -06:00
Linus Torvalds	edb543cfe5	Description for this pull request: - Zero out unused characters of FileName field to avoid a complaint from some fsck tool. - Fix memory leak on error paths. - Fix unnecessary VOL_DIRTY set when calling rmdir on non-empty directory. - Call sync_filesystem() for read-only remount(Fix generic/452 test in xfstests) - Add own fsync() to flush dirty metadata. -----BEGIN PGP SIGNATURE----- iQJMBAABCgA2FiEE6NzKS6Uv/XAAGHgyZwv7A1FEIQgFAl76ldAYHG5hbWphZS5q ZW9uQHNhbXN1bmcuY29tAAoJEGcL+wNRRCEIjoEQAKYh1yAd9xYqYznwWKgRa76d DyRfuDzIcgPoM8C00sys237OGhb2iXlyLAWQ1Ag6kIZkxjCPjMZ3Ma+piqi0sEvG YXhrDdSkAstsbRiQ/Z/pFSFPBmI8wej64uMR1eZOtY5ms0VPtau3paX6JWBhiGZU cmS3ggUFUvOlky9vKCRX2kaPSVyN+VpUMiGe2jfa8x5y6ZRLWPgkQwfVYk38O4zS Z4x/UZiokfUXqrh5kPVWDxk96oWq2c+KLxmRawjEA9IOvgqs2ydbcAQnGx5fkHAO d+aqLjo3XsMlN7dfB9xKhFjRrZL6MggU2Ptu/BoEb5RsyPUGk/wCYQMjAykeBLtT VC+3tGQob3GEgeVdhogrPOhPCNv3Pxgl8XBigE8sDMtvdoqrHeP83i8fYCcUb3jY ENjSaIZxD/kOtjf2nbgz6FDJhJQSsoFP+oKqndPc9umD5mM0Foj+NZ9cevdNvLsd qqanWxbdfgI6iCSg8S8dJE4PTSI2o08MY+Nh+NA6MktIEOaQDy3ncXjk/XZ7oX42 4zMrvNvTX894vcpCDNaa+ZW1NVSTWaIf+saHRqqnsU6nouQL0VmsTK4SAGqtGeFb vZobK4z8qy3uliKiGtjbc3DYA1gB9lJCKNCXLaFuCD6amAPufXWDeeVp8AzA5AVh AqS+oQIoO8yCf9GyBAL7 =lu+d -----END PGP SIGNATURE----- Merge tag 'exfat-for-5.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat Pull exfat fixes from Namjae Jeon: - Zero out unused characters of FileName field to avoid a complaint from some fsck tool. - Fix memory leak on error paths. - Fix unnecessary VOL_DIRTY set when calling rmdir on non-empty directory. - Call sync_filesystem() for read-only remount (Fix generic/452 test in xfstests) - Add own fsync() to flush dirty metadata. * tag 'exfat-for-5.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat: exfat: flush dirty metadata in fsync exfat: move setting VOL_DIRTY over exfat_remove_entries() exfat: call sync_filesystem for read-only remount exfat: add missing brelse() calls on error paths exfat: Set the unused characters of FileName field to the value 0000h	2020-06-30 12:35:11 -07:00
Jens Axboe	ce593a6c48	io_uring: use signal based task_work running Since 5.7, we've been using task_work to trigger async running of requests in the context of the original task. This generally works great, but there's a case where if the task is currently blocked in the kernel waiting on a condition to become true, it won't process task_work. Even though the task is woken, it just checks whatever condition it's waiting on, and goes back to sleep if it's still false. This is a problem if that very condition only becomes true when that task_work is run. An example of that is the task registering an eventfd with io_uring, and it's now blocked waiting on an eventfd read. That read could depend on a completion event, and that completion event won't get trigged until task_work has been run. Use the TWA_SIGNAL notification for task_work, so that we ensure that the task always runs the work when queued. Cc: stable@vger.kernel.org # v5.7 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-30 12:39:05 -06:00
Pavel Begunkov	8eb06d7e8d	io_uring: fix missing ->mm on exit There is a fancy bug, where exiting user task may not have ->mm, that makes task_works to try to do kthread_use_mm(ctx->sqo_mm). Don't do that if sqo_mm is NULL. [ 290.460558] WARNING: CPU: 6 PID: 150933 at kernel/kthread.c:1238 kthread_use_mm+0xf3/0x110 [ 290.460579] CPU: 6 PID: 150933 Comm: read-write2 Tainted: G I E 5.8.0-rc2-00066-g9b21720607cf #531 [ 290.460580] RIP: 0010:kthread_use_mm+0xf3/0x110 ... [ 290.460584] Call Trace: [ 290.460584] __io_sq_thread_acquire_mm.isra.0.part.0+0x25/0x30 [ 290.460584] __io_req_task_submit+0x64/0x80 [ 290.460584] io_req_task_submit+0x15/0x20 [ 290.460585] task_work_run+0x67/0xa0 [ 290.460585] do_exit+0x35d/0xb70 [ 290.460585] do_group_exit+0x43/0xa0 [ 290.460585] get_signal+0x140/0x900 [ 290.460586] do_signal+0x37/0x780 [ 290.460586] __prepare_exit_to_usermode+0x126/0x1c0 [ 290.460586] __syscall_return_slowpath+0x3b/0x1c0 [ 290.460587] do_syscall_64+0x5f/0xa0 [ 290.460587] entry_SYSCALL_64_after_hwframe+0x44/0xa9 following with faults. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-30 09:33:02 -06:00
Pavel Begunkov	3fa5e0f331	io_uring: optimise io_req_find_next() fast check gcc 9.2.0 compiles io_req_find_next() as a separate function leaving the first REQ_F_LINK_HEAD fast check not inlined. Help it by splitting out the check from the function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-30 09:32:04 -06:00
Pavel Begunkov	0be0b0e33b	io_uring: simplify io_async_task_func() Greatly simplify io_async_task_func() removing duplicated functionality of __io_req_task_submit(). This do one extra spin lock/unlock for cancelled poll case, but that shouldn't happen often. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-30 09:32:04 -06:00
Pavel Begunkov	ea1164e574	io_uring: fix NULL mm in io_poll_task_func() io_poll_task_func() hand-coded link submission forgetting to set TASK_RUNNING, acquire mm, etc. Call existing helper for that, i.e. __io_req_task_submit(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-30 09:32:04 -06:00
Pavel Begunkov	cf2f54255d	io_uring: don't fail iopoll requeue without ->mm Actually, io_iopoll_queue() may have NULL ->mm, that's if SQ thread didn't grabbed mm before doing iopoll. Don't fail reqs there, as after recent changes it won't be punted directly but rather through task_work. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-30 09:32:04 -06:00
Jens Axboe	ab0b6451db	io_uring: clean up io_kill_linked_timeout() locking Avoid jumping through hoops to silence unused variable warnings, and also fix sparse rightfully complaining about the locking context: fs/io_uring.c:1593:39: warning: context imbalance in 'io_kill_linked_timeout' - unexpected unlock Provide the functional helper as __io_kill_linked_timeout(), and have separate the locking from it. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-30 08:43:15 -06:00
Pavel Begunkov	cbdcb4357c	io_uring: do grab_env() just before punting Currently io_steal_work() is disabled, and every linked request should go through task_work for initialisation. Do io_req_work_grab_env() just before io-wq punting and for the whole link, so any request reachable by io_steal_work() is prepared. This is also interesting for another reason -- it localises io_req_work_grab_env() into one place just before io-wq punting, helping to to better manage req->work lifetime and add some neat cleanup/optimisations later. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-30 08:40:00 -06:00
Pavel Begunkov	debb85f496	io_uring: factor out grab_env() from defer_prep() Remove io_req_work_grab_env() call from io_req_defer_prep(), just call it when neccessary. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-30 08:39:59 -06:00
Pavel Begunkov	edcdfcc149	io_uring: do init work in grab_env() Place io_req_init_async() in io_req_work_grab_env() so it won't be forgotten. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-30 08:39:59 -06:00
Pavel Begunkov	351fd53595	io_uring: don't pass def into io_req_work_grab_env Remove struct io_op_def *def parameter from io_req_work_grab_env(), it's trivially deducible from req->opcode and fast. The API is cleaner this way, and also helps the complier to understand that it's a real constant and could be register-cached. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-30 08:39:59 -06:00
Pavel Begunkov	ecfc517774	io_uring: fix potential use after free on fallback request free After __io_free_req() puts a ctx ref, it should be assumed that the ctx may already be gone. However, it can be accessed when putting the fallback req. Free the req first and then put the ctx. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-30 08:39:59 -06:00
Pavel Begunkov	8eb7e2d007	io_uring: kill REQ_F_TIMEOUT_NOSEQ There are too many useless flags, kill REQ_F_TIMEOUT_NOSEQ, which can be easily infered from req.timeout itself. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-30 08:39:59 -06:00
Pavel Begunkov	a1a4661691	io_uring: kill REQ_F_TIMEOUT Now REQ_F_TIMEOUT is set but never used, kill it Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-30 08:39:59 -06:00
Pavel Begunkov	9b5f7bd932	io_uring: replace find_next() out param with ret Generally, it's better to return a value directly than having out parameter. It's cleaner and saves from some kinds of ugly bugs. May also be faster. Return next request from io_req_find_next() and friends directly instead of passing out parameter. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-30 08:39:57 -06:00
Pavel Begunkov	7c86ffeeed	io_uring: deduplicate freeing linked timeouts Linked timeout cancellation code is repeated in in io_req_link_next() and io_fail_links(), and they differ in details even though shouldn't. Basing on the fact that there is maximum one armed linked timeout in a link, and it immediately follows the head, extract a function that will check for it and defuse. Justification: - DRY and cleaner - better inlining for io_req_link_next() (just 1 call site now) - isolates linked_timeouts from common path - reduces time under spinlock for failed links - actually less code Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: fold in locking fix for io_fail_links()] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-30 08:38:58 -06:00
Herbert Xu	7999096fa9	iov_iter: Move unnecessary inclusion of crypto/hash.h The header file linux/uio.h includes crypto/hash.h which pulls in most of the Crypto API. Since linux/uio.h is used throughout the kernel this means that every tiny bit of change to the Crypto API causes the entire kernel to get rebuilt. This patch fixes this by moving it into lib/iov_iter.c instead where it is actually used. This patch also fixes the ifdef to use CRYPTO_HASH instead of just CRYPTO which does not guarantee the existence of ahash. Unfortunately a number of drivers were relying on linux/uio.h to provide access to linux/slab.h. This patch adds inclusions of linux/slab.h as detected by build failures. Also skbuff.h was relying on this to provide a declaration for ahash_request. This patch adds a forward declaration instead. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-06-30 09:34:23 -04:00
Andreas Gruenbacher	34244d711d	gfs2: Don't sleep during glock hash walk In flush_delete_work, instead of flushing each individual pending delayed work item, cancel and re-queue them for immediate execution. The waiting isn't needed here because we're already waiting for all queued work items to complete in gfs2_flush_delete_work. This makes the code more efficient, but more importantly, it avoids sleeping during a rhashtable walk, inside rcu_read_lock(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2020-06-30 13:04:45 +02:00
Bob Peterson	58e08e8d83	gfs2: fix trans slab error when withdraw occurs inside log_flush Log flush operations (gfs2_log_flush()) can target a specific transaction. But if the function encounters errors (e.g. io errors) and withdraws, the transaction was only freed it if was queued to one of the ail lists. If the withdraw occurred before the transaction was queued to the ail1 list, function ail_drain never freed it. The result was: BUG gfs2_trans: Objects remaining in gfs2_trans on __kmem_cache_shutdown() This patch makes log_flush() add the targeted transaction to the ail1 list so that function ail_drain() will find and free it properly. Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2020-06-30 13:04:45 +02:00
Andreas Gruenbacher	5902f4dd6e	gfs2: Don't return NULL from gfs2_inode_lookup Callers expect gfs2_inode_lookup to return an inode pointer or ERR_PTR(error). Commit `b66648ad6d` caused it to return NULL instead of ERR_PTR(-ESTALE) in some cases. Fix that. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: `b66648ad6d` ("gfs2: Move inode generation number check into gfs2_inode_lookup") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2020-06-30 13:04:45 +02:00
Paul E. McKenney	9f47eb5461	fs/btrfs: Add cond_resched() for try_release_extent_mapping() stalls Very large I/Os can cause the following RCU CPU stall warning: RIP: 0010:rb_prev+0x8/0x50 Code: 49 89 c0 49 89 d1 48 89 c2 48 89 f8 e9 e5 fd ff ff 4c 89 48 10 c3 4c = 89 06 c3 4c 89 40 10 c3 0f 1f 00 48 8b 0f 48 39 cf 74 38 <48> 8b 47 10 48 85 c0 74 22 48 8b 50 08 48 85 d2 74 0c 48 89 d0 48 RSP: 0018:ffffc9002212bab0 EFLAGS: 00000287 ORIG_RAX: ffffffffffffff13 RAX: ffff888821f93630 RBX: ffff888821f93630 RCX: ffff888821f937e0 RDX: 0000000000000000 RSI: 0000000000102000 RDI: ffff888821f93630 RBP: 0000000000103000 R08: 000000000006c000 R09: 0000000000000238 R10: 0000000000102fff R11: ffffc9002212bac8 R12: 0000000000000001 R13: ffffffffffffffff R14: 0000000000102000 R15: ffff888821f937e0 __lookup_extent_mapping+0xa0/0x110 try_release_extent_mapping+0xdc/0x220 btrfs_releasepage+0x45/0x70 shrink_page_list+0xa39/0xb30 shrink_inactive_list+0x18f/0x3b0 shrink_lruvec+0x38e/0x6b0 shrink_node+0x14d/0x690 do_try_to_free_pages+0xc6/0x3e0 try_to_free_mem_cgroup_pages+0xe6/0x1e0 reclaim_high.constprop.73+0x87/0xc0 mem_cgroup_handle_over_high+0x66/0x150 exit_to_usermode_loop+0x82/0xd0 do_syscall_64+0xd4/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9 On a PREEMPT=n kernel, the try_release_extent_mapping() function's "while" loop might run for a very long time on a large I/O. This commit therefore adds a cond_resched() to this loop, providing RCU any needed quiescent states. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-06-29 11:58:50 -07:00
J. Bruce Fields	bf2654017e	nfsd: fix nfsdfs inode reference count leak I don't understand this code well, but I'm seeing a warning about a still-referenced inode on unmount, and every other similar filesystem does a dput() here. Fixes: `e8a79fb14f` ("nfsd: add nfsd/clients directory") Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2020-06-29 14:48:28 -04:00
J. Bruce Fields	681370f4b0	nfsd4: fix nfsdfs reference count loop We don't drop the reference on the nfsdfs filesystem with mntput(nn->nfsd_mnt) until nfsd_exit_net(), but that won't be called until the nfsd module's unloaded, and we can't unload the module as long as there's a reference on nfsdfs. So this prevents module unloading. Fixes: `2c830dd720` ("nfsd: persist nfsd filesystem across mounts") Reported-and-Tested-by: Luo Xiaogang <lxgrxd@163.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2020-06-29 14:48:02 -04:00
Mel Gorman	b6509f6a8c	Revert "fs: Do not check if there is a fsnotify watcher on pseudo inodes" This reverts commit `e9c15badbb` ("fs: Do not check if there is a fsnotify watcher on pseudo inodes"). The commit intended to eliminate fsnotify-related overhead for pseudo inodes but it is broken in concept. inotify can receive events of pipe files under /proc/X/fd and chromium relies on close and open events for sandboxing. Maxim Levitsky reported the following Chromium starts as a white rectangle, shows few white rectangles that resemble its notifications and then crashes. The stdout output from chromium: [mlevitsk@starship ~]$chromium-freeworld mesa: for the --simplifycfg-sink-common option: may only occur zero or one times! mesa: for the --global-isel-abort option: may only occur zero or one times! [3379:3379:0628/135151.440930:ERROR:browser_switcher_service.cc(238)] XXX Init() ../../sandbox/linux/seccomp-bpf-helpers/sigsys_handlers.cc:CRASHING:seccomp-bpf failure in syscall 0072 Received signal 11 SEGV_MAPERR 0000004a9048 Crashes are not universal but even if chromium does not crash, it certainly does not work properly. While filtering just modify and access might be safe, the benefit is not worth the risk hence the revert. Reported-by: Maxim Levitsky <mlevitsk@redhat.com> Fixes: `e9c15badbb` ("fs: Do not check if there is a fsnotify watcher on pseudo inodes") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-29 09:40:55 -07:00
Pavel Begunkov	fb49278624	io_uring: fix missing wake_up io_rw_reissue() Don't forget to wake up a process to which io_rw_reissue() added task_work. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-29 07:43:03 -06:00
Sungjong Seo	5267456e95	exfat: flush dirty metadata in fsync generic_file_fsync() exfat used could not guarantee the consistency of a file because it has flushed not dirty metadata but only dirty data pages for a file. Instead of that, use exfat_file_fsync() for files and directories so that it guarantees to commit both the metadata and data pages for a file. Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2020-06-29 17:11:18 +09:00
Namjae Jeon	3bcfb70109	exfat: move setting VOL_DIRTY over exfat_remove_entries() Move setting VOL_DIRTY over exfat_remove_entries() to avoid unneeded leaving VOL_DIRTY on -ENOTEMPTY. Fixes: `5f2aa07507` ("exfat: add inode operations") Cc: stable@vger.kernel.org # v5.7 Reported-by: Tetsuhiro Kohada <kohada.t2@gmail.com> Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2020-06-29 17:11:13 +09:00
Hyunchul Lee	a0271a15cf	exfat: call sync_filesystem for read-only remount We need to commit dirty metadata and pages to disk before remounting exfat as read-only. This fixes a failure in xfstests generic/452 generic/452 does the following: cp something <exfat>/ mount -o remount,ro <exfat> the <exfat>/something is corrupted. because while exfat is remounted as read-only, exfat doesn't have a chance to commit metadata and vfs invalidates page caches in a block device. Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com> Acked-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2020-06-29 17:11:08 +09:00
Dan Carpenter	e8dd3cda86	exfat: add missing brelse() calls on error paths If the second exfat_get_dentry() call fails then we need to release "old_bh" before returning. There is a similar bug in exfat_move_file(). Fixes: `5f2aa07507` ("exfat: add inode operations") Reported-by: Markus Elfring <Markus.Elfring@web.de> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2020-06-29 17:11:05 +09:00
Hyeongseok.Kim	4ba6ccd695	exfat: Set the unused characters of FileName field to the value 0000h Some fsck tool complain that padding part of the FileName field is not set to the value 0000h. So let's maintain filesystem cleaner, as exfat's spec. recommendation. Signed-off-by: Hyeongseok.Kim <Hyeongseok@gmail.com> Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2020-06-29 17:11:00 +09:00
Linus Torvalds	bc53f67d24	- Fix build regression on v4.8 and older - Robustness fix for TPM log parsing code - kobject refcount fix for the ESRT parsing code - Two efivarfs fixes to make it behave more like an ordinary file system - Style fixup for zero length arrays - Fix a regression in path separator handling in the initrd loader - Fix a missing prototype warning - Add some kerneldoc headers for newly introduced stub routines - Allow support for SSDT overrides via EFI variables to be disabled - Report CPU mode and MMU state upon entry for 32-bit ARM - Use the correct stack pointer alignment when entering from mixed mode Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl74344RHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1heMw//b9UPgWlkH2xnAjo9QeFvounyT8XrLLnW QkhkiIGDvM2qWUmRotRrxRq39P9A+AH4x0krWTZam67W1OuWleUjwQWrnYE8vhql xdIAJmD1oWTi07p4SFzLVA7mJvMX5xenCYvGTALoHtsGnLbOiRGSSTnuXZr1c6Kd 2XcY89kpcZGXgw9VCNV2Ez1g0OlCHS1N5LV31WGUcFl30Q3aZpdLmnFUzKLUbRgb sTNMlu2mLGSs/ZaTAaOGNzFkxGVJI2+0C+ApKvmR9WR7+5n9Brs27RSLgPMViXun BnsTewMdxNBXITgLxcUEtngPEWIzqrwJVbLaZVeWcWez0g11GIt0+wonpRnxWjHA XgQm00sK4HIvs+3YWUJ1PpXyjUmiPvOKZM5um9zsCiYml+RzzIm6bznII4Lh7rQe 4kOLXkxaww+LS4r3+si6Q16og4zd/zZs4MoxaF7frTJ6oiUWOpBJqdf92Kiz0DaS kfQ2I3d/PdZvWuNIiBCfX9bjd7q0zq0zyIghP7460lx88aaHb20samTtl+qjN4MM Wpik/soeYi5pICDRRwiAHhpgK+li4LLjP3D81rYX8pEaAiubpjCwqLxIexQ6XJCV UZAR4swswrYntdXfUMmRnPBsLWWLePq6sRAvlent2si2cp+65f8I1xZ0ClK7YMjr qXUW7jOp/88= =F0bv -----END PGP SIGNATURE----- Merge tag 'efi-urgent-2020-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI fixes from Ingo Molnar: - Fix build regression on v4.8 and older - Robustness fix for TPM log parsing code - kobject refcount fix for the ESRT parsing code - Two efivarfs fixes to make it behave more like an ordinary file system - Style fixup for zero length arrays - Fix a regression in path separator handling in the initrd loader - Fix a missing prototype warning - Add some kerneldoc headers for newly introduced stub routines - Allow support for SSDT overrides via EFI variables to be disabled - Report CPU mode and MMU state upon entry for 32-bit ARM - Use the correct stack pointer alignment when entering from mixed mode * tag 'efi-urgent-2020-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: efi/libstub: arm: Print CPU boot mode and MMU state at boot efi/libstub: arm: Omit arch specific config table matching array on arm64 efi/x86: Setup stack correctly for efi_pe_entry efi: Make it possible to disable efivar_ssdt entirely efi/libstub: Descriptions for stub helper functions efi/libstub: Fix path separator regression efi/libstub: Fix missing-prototype warning for skip_spaces() efi: Replace zero-length array and use struct_size() helper efivarfs: Don't return -EINTR when rate-limiting reads efivarfs: Update inode modification time for successful writes efi/esrt: Fix reference count leak in esre_create_sysfs_entry. efi/tpm: Verify event log header before parsing efi/x86: Fix build with gcc 4	2020-06-28 11:42:16 -07:00
Pavel Begunkov	f3a6fa2267	io_uring: fix iopoll -EAGAIN handling req->iopoll() is not necessarily called by a task that submitted a request. Because of that, it's dangerous to grab_env() and punt async on -EGAIN, potentially grabbing another task's mm and corrupting its memory. Do resubmit from the submitter task context. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-28 08:13:03 -06:00
Pavel Begunkov	3adfecaa64	io_uring: do task_work_run() during iopoll There are a lot of new users of task_work, and some of task_work_add() may happen while we do io polling, thus make iopoll from time to time to do task_work_run(), so it doesn't poll for sitting there reqs. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-28 08:13:03 -06:00
Pavel Begunkov	6795c5aba2	io_uring: clean up req->result setting by rw Assign req->result to io_size early in io_{read,write}(), it's enough and makes it more straightforward. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-28 08:10:17 -06:00
Pavel Begunkov	9b0d911acc	io_uring: kill REQ_F_LINK_NEXT After pulling nxt from a request, it's no more a links head, so clear REQ_F_LINK_HEAD. Absence of this flag also indicates that there are no linked requests, so replacing REQ_F_LINK_NEXT, which can be killed. Linked timeouts also behave leaving the flag intact when necessary. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-28 08:10:17 -06:00
Pavel Begunkov	2d6500d44c	io_uring: cosmetic changes for batch free Move all batch free bits close to each other and rename in a consistent way. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-28 08:10:17 -06:00
Pavel Begunkov	c352438333	io_uring: batch-free linked requests as well There is no reason to not batch deallocation of linked requests. Take away its next req first and handle it as everything else in io_req_multi_free(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-28 08:10:17 -06:00
Pavel Begunkov	2757a23e7f	io_uring: dismantle req early and remove need_iter Every request in io_req_multi_free() is has ->file set. Instead of pointlessly defering and counting reqs with file, dismantle it on place and save for batch dealloc. It also saves us from potentially skipping io_cleanup_req(), put_task(), etc. Never happens though, becacuse ->file is always there. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-28 08:10:17 -06:00
Pavel Begunkov	e6543a816e	io_uring: remove inflight batching in free_many() io_free_req_many() is used only for iopoll requests, i.e. reads/writes. Hence no need to batch inflight unhooking. For safety, it'll be done by io_dismantle_req(), which replaces __io_req_aux_free(), and looks more solid and cleaner. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-28 08:10:17 -06:00
Pavel Begunkov	8c9cb6cd9a	io_uring: fix refs underflow in io_iopoll_queue() Now io_complete_rw_common() puts a ref, extra io_req_put() in io_iopoll_queue() causes undeflow. Remove it. [ 455.998620] refcount_t: underflow; use-after-free. [ 455.998743] WARNING: CPU: 6 PID: 285394 at lib/refcount.c:28 refcount_warn_saturate+0xae/0xf0 [ 455.998772] CPU: 6 PID: 285394 Comm: read-write2 Tainted: G I E 5.8.0-rc2-00048-g1b1aa738f167-dirty #509 [ 455.998772] RIP: 0010:refcount_warn_saturate+0xae/0xf0 ... [ 455.998778] Call Trace: [ 455.998778] io_put_req+0x44/0x50 [ 455.998778] io_iopoll_complete+0x245/0x370 [ 455.998779] io_iopoll_getevents+0x12f/0x1a0 [ 455.998779] io_iopoll_reap_events.part.0+0x5e/0xa0 [ 455.998780] io_ring_ctx_wait_and_kill+0x132/0x1c0 [ 455.998780] io_uring_release+0x20/0x30 [ 455.998780] __fput+0xcd/0x230 [ 455.998781] ____fput+0xe/0x10 [ 455.998781] task_work_run+0x67/0xa0 [ 455.998781] do_exit+0x35d/0xb70 [ 455.998782] do_group_exit+0x43/0xa0 [ 455.998783] get_signal+0x140/0x900 [ 455.998783] do_signal+0x37/0x780 [ 455.998784] __prepare_exit_to_usermode+0x126/0x1c0 [ 455.998785] __syscall_return_slowpath+0x3b/0x1c0 [ 455.998785] do_syscall_64+0x5f/0xa0 [ 455.998785] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: `a1d7c393c4` ("io_uring: enable READ/WRITE to use deferred completions") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-28 08:10:17 -06:00
Pavel Begunkov	710c2bfb66	io_uring: fix missing io_grab_files() We won't have valid ring_fd, ring_file in task work. Grab files early. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-28 08:10:17 -06:00
Pavel Begunkov	a6d45dd0d4	io_uring: don't mark link's head for_async No reason to mark a head of a link as for-async in io_req_defer_prep(). grab_env(), etc. That will be done further during submission if neccessary. Mark for_async=false saving extra grab_env() in many cases. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-28 08:10:17 -06:00
Pavel Begunkov	1bcb8c5d65	io_uring: fix feeding io-wq with uninit reqs io_steal_work() can't be sure that @nxt has req->work properly set, so we can't pass it to io-wq as is. A dirty quick fix -- drag it through io_req_task_queue(), and always return NULL from io_steal_work(). e.g. [ 50.770161] BUG: kernel NULL pointer dereference, address: 00000000 [ 50.770164] #PF: supervisor write access in kernel mode [ 50.770164] #PF: error_code(0x0002) - not-present page [ 50.770168] CPU: 1 PID: 1448 Comm: io_wqe_worker-0 Tainted: G I 5.8.0-rc2-00035-g2237d76530eb-dirty #494 [ 50.770172] RIP: 0010:override_creds+0x19/0x30 ... [ 50.770183] io_worker_handle_work+0x25c/0x430 [ 50.770185] io_wqe_worker+0x2a0/0x350 [ 50.770190] kthread+0x136/0x180 [ 50.770194] ret_from_fork+0x22/0x30 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-28 08:10:17 -06:00
Pavel Begunkov	906a8c3fdb	io_uring: fix punting req w/o grabbed env It's not enough to check for REQ_F_WORK_INITIALIZED and punt async assuming that io_req_work_grab_env() was called, it may not have been. E.g. io_close_prep() and personality path set the flag without further async init. As a quick fix, always pass next work through io_req_task_queue(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-28 08:10:16 -06:00
Pavel Begunkov	8ef77766ba	io_uring: fix req->work corruption req->work and req->task_work are in a union, so io_req_task_queue() screws everything that was in work. De-union them for now. [ 704.367253] BUG: unable to handle page fault for address: ffffffffaf7330d0 [ 704.367256] #PF: supervisor write access in kernel mode [ 704.367256] #PF: error_code(0x0003) - permissions violation [ 704.367261] CPU: 6 PID: 1654 Comm: io_wqe_worker-0 Tainted: G I 5.8.0-rc2-00038-ge28d0bdc4863-dirty #498 [ 704.367265] RIP: 0010:_raw_spin_lock+0x1e/0x36 ... [ 704.367276] __alloc_fd+0x35/0x150 [ 704.367279] __get_unused_fd_flags+0x25/0x30 [ 704.367280] io_openat2+0xcb/0x1b0 [ 704.367283] io_issue_sqe+0x36a/0x1320 [ 704.367294] io_wq_submit_work+0x58/0x160 [ 704.367295] io_worker_handle_work+0x2a3/0x430 [ 704.367296] io_wqe_worker+0x2a0/0x350 [ 704.367301] kthread+0x136/0x180 [ 704.367304] ret_from_fork+0x22/0x30 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-28 08:10:10 -06:00
David Howells	719fdd3292	afs: Fix storage of cell names The cell name stored in the afs_cell struct is a 64-char + NUL buffer - when it needs to be able to handle up to AFS_MAXCELLNAME (256 chars) + NUL. Fix this by changing the array to a pointer and allocating the string. Found using Coverity. Fixes: `989782dcdc` ("afs: Overhaul cell database management") Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-27 22:04:24 -07:00
Linus Torvalds	916a3b0fc1	6 cifs/smb3 fixes, 3 for stable. Fixes xfstests 451, 313 and 316 -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl73SpUACgkQiiy9cAdy T1EseAwAkwkY8r/7LbNDnil+xQivdCfxuc+FCYXpw3HBvR/Zfjb+n/01RIpJoJw7 kl6MyFUBALrNFY6DvhsNErn7cP9O5Fjg73AfDfE2ySG4N+xZt+EcbbNZ6MtWwdQQ a+ZzelGkT1lg+x4Xzz6oy9eWjvHPu6V9e8ycWjl2uRc7I19ze9NinV0rWOp80DAN uiVEZo/5f28qTYIVP9rFayKN4TcOQYYRYLukP9zH9s0EBvLYQHGefvE8f01iLdm4 JyDi/4hmGIS4e7IaROImX25DKxPQTVUytjhmxHdmjg1Or0O3WMSr7zLWauJNn1G8 /820ec/CgBLtqpD6Y9vUar01+U3Q7Qms/UrEwx+WVVpZPDFVNKDfd6aLlj+UCJeQ PHERRVKdHMyz5iaqY4hZhS90uizt4mHAmoNf+YcbjdaiBvebqaAuo/foIwadYBEm 1ZGevYUIt3cpvbAIv/I3OSrTSvY1/OQZmkHj5IZ0iZXdJaMeOhgrXYyIeL95aJEU d6x8VYpI =5BTi -----END PGP SIGNATURE----- Merge tag '5.8-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6 Pull cifs fixes from Steve French: "Six cifs/smb3 fixes, three of them for stable. Fixes xfstests 451, 313 and 316" * tag '5.8-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: misc: Use array_size() in if-statement controlling expression cifs: update ctime and mtime during truncate cifs/smb3: Fix data inconsistent when punch hole cifs/smb3: Fix data inconsistent when zero file range cifs: Fix double add page to memcg when cifs_readpages cifs: Fix cached_fid refcnt leak in open_shroot	2020-06-27 15:24:04 -07:00
Linus Torvalds	4e99b32169	NFS Client Bugfixes for Linux 5.8-rc Stable Fixes: - xprtrdma: Fix handling of RDMA_ERROR replies - sunrpc: Fix rollback in rpc_gssd_dummy_populate() - pNFS/flexfiles: Fix list corruption if the mirror count changes - NFSv4: Fix CLOSE not waiting for direct IO completion - SUNRPC: Properly set the @subbuf parameter of xdr_buf_subsegment() Other Fixes: - xprtrdma: Fix a use-after-free with r_xprt->rx_ep - Fix other xprtrdma races during disconnect - NFS: Fix memory leak of export_path -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl72aFkACgkQ18tUv7Cl QOsWfg//ewmCjJV1LGJJM2ntxcN9xAZJdIY3cuYfxQaDr/qwdbgh8DNbPlkImaoB aW5DVciqKJ8HpJchko4wYvNbbnAI32Kd87RcmUYoXwwUY+H2kwuOf41Vm4jfrScF NHiN5b5GTUz2X/s83NsbE9uGCFE1TS8pJkn6chVEWJY+QOjWpQmJrFQ0E9ULwP1O g46Dym9RtILrsNyGcSks6Rnts4Ujm3+PDW+hLWjGzwotDgMS2LGZ7oQpfcs0NvHs A3RjSOywltockeKvqchibTZMAXjIxqLV8cmo6AsT2H3llGbr+F61DkBMuTgqozhp QAONwvxDv6EcnsS5NnOJJdhwG7IK1dPIA5oxmGq7XlhShZF+hrfvGYyhkmDkdf8V 9wfpV6foPC07hTcd+h0+A5DTh4Bxi71q+VIvVyQzgvX4UgRMrRptkNUzAm/Tn56C JoFtjxswy0W476rqYaIJKjs/Mv1eozwvEifIuwpMu+VWiwiNEygNKyvmdVYxeDmv 13hjXVbQCCjyPvQSmBRKUEOR07DxHUt5Kcy9xHQ5ZXr5KdCERSt9MfXucxUxMQTA JG143HPt3P7tkr+1wIyerN94w0kZGQqtQR/BHd5Ms0abrv+jgqjQVleFd4vX2igU o/pCH4SLEhEndChU6lvv534ilRSH5LLQifyV2ThFFdZpOhtw7tU= =BzX9 -----END PGP SIGNATURE----- Merge tag 'nfs-for-5.8-2' of git://git.linux-nfs.org/projects/anna/linux-nfs Pull NFS client bugfixes from Anna Schumaker: "Stable Fixes: - xprtrdma: Fix handling of RDMA_ERROR replies - sunrpc: Fix rollback in rpc_gssd_dummy_populate() - pNFS/flexfiles: Fix list corruption if the mirror count changes - NFSv4: Fix CLOSE not waiting for direct IO completion - SUNRPC: Properly set the @subbuf parameter of xdr_buf_subsegment() Other Fixes: - xprtrdma: Fix a use-after-free with r_xprt->rx_ep - Fix other xprtrdma races during disconnect - NFS: Fix memory leak of export_path" * tag 'nfs-for-5.8-2' of git://git.linux-nfs.org/projects/anna/linux-nfs: SUNRPC: Properly set the @subbuf parameter of xdr_buf_subsegment() NFSv4 fix CLOSE not waiting for direct IO compeletion pNFS/flexfiles: Fix list corruption if the mirror count changes nfs: Fix memory leak of export_path sunrpc: fixed rollback in rpc_gssd_dummy_populate() xprtrdma: Fix handling of RDMA_ERROR replies xprtrdma: Clean up disconnect xprtrdma: Clean up synopsis of rpcrdma_flush_disconnect() xprtrdma: Use re_connect_status safely in rpcrdma_xprt_connect() xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed	2020-06-27 09:35:47 -07:00
Linus Torvalds	ab0f2473d3	io_uring-5.8-2020-06-26 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl72TkcQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpsr2D/40VWZtyhFeUlpE+Qiodz0ZSmREZxCj1QQ1 oE8vvyfGVjYbGAeWUsR7hpXBTv9bnEHpF7GRumjOAuML5+cuhh1XSWUHitnJcuHC dX6K3ueh6x8l3mn2EKK/NOaHT6/4STwO7er3lX1wQIAAhIXp7Och2geOL+a3PoZd NMcGaQ3aPrr0Qo7hW7ZMaAmYROewLvZ7p8aIowmBXqTT1Qxy9Ig29HtaDbEkno0X TWy/tuU73nli4QWwWIst14Oeqfm81xDjLRSDa9tID0nvn5ZtB6wy8yAa0QRZS90w t9dB02VVQl+Ql4ZzrXnRTJciP6B4jFvir61oq9vSnDp51LQGyQb/rATXaoiEPPc3 uQARCrB4MDAWFs70BX6MFprI0NNZIdCZK+Okaki2HsjnI5uJQvN5Hrlmo1Khyate doO9HjQtDenFyQcha+ea0SUWzXKV/Uss4WemES5Sem6CFPVMkZ/vco2d7D6PEJc1 AX5efoiBcd/NNL5XfVQoe7HTuCHIczXXEHP2FAgJc8q1lp7ROUnWQZsm5968ERqs pelRq5jHNd9ZF29jEfnYvxidJCc1+34YrKmQ9OPgJkqaoQ9aBGANsI9eM6cQ5CLx X7riSQh+BTqdAtczT5HDFX15GF9VxsD3CGaOrhG1f7aZm7J19bIImP5+Uh/AHY49 iBkyVZ7fNA== =ar3Q -----END PGP SIGNATURE----- Merge tag 'io_uring-5.8-2020-06-26' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: "Three small fixes: - Close a corner case for polled IO resubmission (Pavel) - Toss commands when exiting (Pavel) - Fix SQPOLL conditional reschedule on perpetually busy submit (Xuan)" * tag 'io_uring-5.8-2020-06-26' of git://git.kernel.dk/linux-block: io_uring: fix current->mm NULL dereference on exit io_uring: fix hanging iopoll in case of -EAGAIN io_uring: fix io_sq_thread no schedule when busy	2020-06-27 09:02:49 -07:00
Randy Dunlap	1e16c2f917	io_uring: fix function args for !CONFIG_NET Fix build errors when CONFIG_NET is not set/enabled: ../fs/io_uring.c:5472:10: error: too many arguments to function ‘io_sendmsg’ ../fs/io_uring.c:5474:10: error: too many arguments to function ‘io_send’ ../fs/io_uring.c:5484:10: error: too many arguments to function ‘io_recvmsg’ ../fs/io_uring.c:5486:10: error: too many arguments to function ‘io_recv’ ../fs/io_uring.c:5510:9: error: too many arguments to function ‘io_accept’ ../fs/io_uring.c:5518:9: error: too many arguments to function ‘io_connect’ Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: io-uring@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-26 19:46:18 -06:00
Jens Axboe	2237d76530	Merge branch 'io_uring-5.8' into for-5.9/io_uring Merge in changes that went into 5.8-rc3. GIT will silently do the merge, but we still need a tweak on top of that since io_complete_rw_common() was modified to take a io_comp_state pointer. The auto-merge fails on that, and we end up with something that doesn't compile. * io_uring-5.8: io_uring: fix current->mm NULL dereference on exit io_uring: fix hanging iopoll in case of -EAGAIN io_uring: fix io_sq_thread no schedule when busy Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-26 13:44:16 -06:00
Linus Torvalds	7c902e2730	Merge branch 'akpm' (patches from Andrew) Merge misx fixes from Andrew Morton: "31 patches. Subsystems affected by this patch series: hotfixes, mm/pagealloc, kexec, ocfs2, lib, mm/slab, mm/slab, mm/slub, mm/swap, mm/pagemap, mm/vmalloc, mm/memcg, mm/gup, mm/thp, mm/vmscan, x86, mm/memory-hotplug, MAINTAINERS" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (31 commits) MAINTAINERS: update info for sparse mm/memory_hotplug.c: fix false softlockup during pfn range removal mm: remove vmalloc_exec arm64: use PAGE_KERNEL_ROX directly in alloc_insn_page x86/hyperv: allocate the hypercall page with only read and execute bits mm/memory: fix IO cost for anonymous page mm/swap: fix for "mm: workingset: age nonresident information alongside anonymous pages" mm: workingset: age nonresident information alongside anonymous pages doc: THP CoW fault no longer allocate THP docs: mm/gup: minor documentation update mm/memcontrol.c: prevent missed memory.low load tears mm/memcontrol.c: add missed css_put() mm: memcontrol: handle div0 crash race condition in memory.low mm/vmalloc.c: fix a warning while make xmldocs media: omap3isp: remove cacheflush.h make asm-generic/cacheflush.h more standalone mm/debug_vm_pgtable: fix build failure with powerpc 8xx mm/memory.c: properly pte_offset_map_lock/unlock in vm_insert_pages() mm: fix swap cache node allocation mask slub: cure list_slab_objects() from double fix ...	2020-06-26 12:19:36 -07:00
Pavel Begunkov	f4db7182e0	io-wq: return next work from ->do_work() directly It's easier to return next work from ->do_work() than having an in-out argument. Looks nicer and easier to compile. Also, merge io_wq_assign_next() into its only user. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-26 10:34:27 -06:00
Pavel Begunkov	e883a79d8c	io-wq: compact io-wq flags numbers Renumerate IO_WQ flags, so they take adjacent bits Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-26 10:34:27 -06:00
Jens Axboe	c40f63790e	io_uring: use task_work for links if possible Currently links are always done in an async fashion, unless we catch them inline after we successfully complete a request without having to resort to blocking. This isn't necessarily the most efficient approach, it'd be more ideal if we could just use the task_work handling for this. Outside of saving an async jump, we can also do less prep work for these kinds of requests. Running dependent links from the task_work handler yields some nice performance benefits. As an example, examples/link-cp from the liburing repository uses read+write links to implement a copy operation. Without this patch, the a cache fold 4G file read from a VM runs in about 3 seconds: $ time examples/link-cp /data/file /dev/null real 0m2.986s user 0m0.051s sys 0m2.843s and a subsequent cache hot run looks like this: $ time examples/link-cp /data/file /dev/null real 0m0.898s user 0m0.069s sys 0m0.797s With this patch in place, the cold case takes about 2.4 seconds: $ time examples/link-cp /data/file /dev/null real 0m2.400s user 0m0.020s sys 0m2.366s and the cache hot case looks like this: $ time examples/link-cp /data/file /dev/null real 0m0.676s user 0m0.010s sys 0m0.665s As expected, the (mostly) cache hot case yields the biggest improvement, running about 25% faster with this change, while the cache cold case yields about a 20% increase in performance. Outside of the performance increase, we're using less CPU as well, as we're not using the async offload threads at all for this anymore. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-26 10:34:23 -06:00
Olga Kornievskaia	d03727b248	NFSv4 fix CLOSE not waiting for direct IO compeletion Figuring out the root case for the REMOVE/CLOSE race and suggesting the solution was done by Neil Brown. Currently what happens is that direct IO calls hold a reference on the open context which is decremented as an asynchronous task in the nfs_direct_complete(). Before reference is decremented, control is returned to the application which is free to close the file. When close is being processed, it decrements its reference on the open_context but since directIO still holds one, it doesn't sent a close on the wire. It returns control to the application which is free to do other operations. For instance, it can delete a file. Direct IO is finally releasing its reference and triggering an asynchronous close. Which races with the REMOVE. On the server, REMOVE can be processed before the CLOSE, failing the REMOVE with EACCES as the file is still opened. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Suggested-by: Neil Brown <neilb@suse.com> CC: stable@vger.kernel.org Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2020-06-26 08:43:14 -04:00
Trond Myklebust	8b04013737	pNFS/flexfiles: Fix list corruption if the mirror count changes If the mirror count changes in the new layout we pick up inside ff_layout_pg_init_write(), then we can end up adding the request to the wrong mirror and corrupting the mirror->pg_list. Fixes: `d600ad1f2b` ("NFS41: pop some layoutget errors to application") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2020-06-26 08:43:14 -04:00
Tom Rix	4659ed7cc8	nfs: Fix memory leak of export_path The try_location function is called within a loop by nfs_follow_referral. try_location calls nfs4_pathname_string to created the export_path. nfs4_pathname_string allocates the memory. export_path is stored in the nfs_fs_context/fs_context structure similarly as hostname and source. But whereas the ctx hostname and source are freed before assignment, export_path is not. So if there are multiple loops, the new export_path will overwrite the old without the old being freed. So call kfree for export_path. Signed-off-by: Tom Rix <trix@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2020-06-26 08:43:14 -04:00
Junxiao Bi	9277f8334f	ocfs2: fix value of OCFS2_INVALID_SLOT In the ocfs2 disk layout, slot number is 16 bits, but in ocfs2 implementation, slot number is 32 bits. Usually this will not cause any issue, because slot number is converted from u16 to u32, but OCFS2_INVALID_SLOT was defined as -1, when an invalid slot number from disk was obtained, its value was (u16)-1, and it was converted to u32. Then the following checking in get_local_system_inode will be always skipped: static struct inode *get_local_system_inode(struct ocfs2_super osb, int type, u32 slot) { BUG_ON(slot == OCFS2_INVALID_SLOT); ... } Link: http://lkml.kernel.org/r/20200616183829.87211-5-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-26 00:27:37 -07:00
Junxiao Bi	e5a15e17a7	ocfs2: fix panic on nfs server over ocfs2 The following kernel panic was captured when running nfs server over ocfs2, at that time ocfs2_test_inode_bit() was checking whether one inode locating at "blkno" 5 was valid, that is ocfs2 root inode, its "suballoc_slot" was OCFS2_INVALID_SLOT(65535) and it was allocted from //global_inode_alloc, but here it wrongly assumed that it was got from per slot inode alloctor which would cause array overflow and trigger kernel panic. BUG: unable to handle kernel paging request at 0000000000001088 IP: [<ffffffff816f6898>] _raw_spin_lock+0x18/0xf0 PGD 1e06ba067 PUD 1e9e7d067 PMD 0 Oops: 0002 [#1] SMP CPU: 6 PID: 24873 Comm: nfsd Not tainted 4.1.12-124.36.1.el6uek.x86_64 #2 Hardware name: Huawei CH121 V3/IT11SGCA1, BIOS 3.87 02/02/2018 RIP: _raw_spin_lock+0x18/0xf0 RSP: e02b:ffff88005ae97908 EFLAGS: 00010206 RAX: ffff88005ae98000 RBX: 0000000000001088 RCX: 0000000000000000 RDX: 0000000000020000 RSI: 0000000000000009 RDI: 0000000000001088 RBP: ffff88005ae97928 R08: 0000000000000000 R09: ffff880212878e00 R10: 0000000000007ff0 R11: 0000000000000000 R12: 0000000000001088 R13: ffff8800063c0aa8 R14: ffff8800650c27d0 R15: 000000000000ffff FS: 0000000000000000(0000) GS:ffff880218180000(0000) knlGS:ffff880218180000 CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000001088 CR3: 00000002033d0000 CR4: 0000000000042660 Call Trace: igrab+0x1e/0x60 ocfs2_get_system_file_inode+0x63/0x3a0 [ocfs2] ocfs2_test_inode_bit+0x328/0xa00 [ocfs2] ocfs2_get_parent+0xba/0x3e0 [ocfs2] reconnect_path+0xb5/0x300 exportfs_decode_fh+0xf6/0x2b0 fh_verify+0x350/0x660 [nfsd] nfsd4_putfh+0x4d/0x60 [nfsd] nfsd4_proc_compound+0x3d3/0x6f0 [nfsd] nfsd_dispatch+0xe0/0x290 [nfsd] svc_process_common+0x412/0x6a0 [sunrpc] svc_process+0x123/0x210 [sunrpc] nfsd+0xff/0x170 [nfsd] kthread+0xcb/0xf0 ret_from_fork+0x61/0x90 Code: 83 c2 02 0f b7 f2 e8 18 dc 91 ff 66 90 eb bf 0f 1f 40 00 55 48 89 e5 41 56 41 55 41 54 53 0f 1f 44 00 00 48 89 fb ba 00 00 02 00 <f0> 0f c1 17 89 d0 45 31 e4 45 31 ed c1 e8 10 66 39 d0 41 89 c6 RIP _raw_spin_lock+0x18/0xf0 CR2: 0000000000001088 ---[ end trace 7264463cd1aac8f9 ]--- Kernel panic - not syncing: Fatal exception Link: http://lkml.kernel.org/r/20200616183829.87211-4-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-26 00:27:37 -07:00
Junxiao Bi	7569d3c754	ocfs2: load global_inode_alloc Set global_inode_alloc as OCFS2_FIRST_ONLINE_SYSTEM_INODE, that will make it load during mount. It can be used to test whether some global/system inodes are valid. One use case is that nfsd will test whether root inode is valid. Link: http://lkml.kernel.org/r/20200616183829.87211-3-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-26 00:27:37 -07:00
Junxiao Bi	4cd9973f9f	ocfs2: avoid inode removal while nfsd is accessing it Patch series "ocfs2: fix nfsd over ocfs2 issues", v2. This is a series of patches to fix issues on nfsd over ocfs2. patch 1 is to avoid inode removed while nfsd access it patch 2 & 3 is to fix a panic issue. This patch (of 4): When nfsd is getting file dentry using handle or parent dentry of some dentry, one cluster lock is used to avoid inode removed from other node, but it still could be removed from local node, so use a rw lock to avoid this. Link: http://lkml.kernel.org/r/20200616183829.87211-1-junxiao.bi@oracle.com Link: http://lkml.kernel.org/r/20200616183829.87211-2-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-26 00:27:36 -07:00
Linus Torvalds	52366a107b	\n -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAl706ikACgkQnJ2qBz9k QNkk/Af9E2/VzEy4CNsGWTBdxRCZQ12Q3n1pe+ReqkmQDEWjN4FxTuhukw9dtsxE a6ZIm9EXOyFmu+LnrSFoskWDBDCrgwo2zOF2kW/pjs9KRW04l0sWuGEI5btKW9/2 Q/uFUJjpgrQ3sxSbj2Df0Q6k0CVBQMTzoJvH2QobViRgzoJeSMr0nE+Sw7PRHzOB Wh3Fis65B8ZrxBMnTPuwzo3zLrvvqtzW6MGRSK0HxOBR1R9KCWvkJgBdyMy80/tg bX2VvpUL6FRUmc36B1VJ/d3hon13nQ0GthTvD1FuBYHmVf/z5AU1gtQOIGl5QkWi Q6PoW+lL8m+gTcN29stz1KHHrvhPbQ== =nQGb -----END PGP SIGNATURE----- Merge tag 'fsnotify_for_v5.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify fixlet from Jan Kara: "A performance improvement to reduce impact of fsnotify for inodes where it isn't used" * tag 'fsnotify_for_v5.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fs: Do not check if there is a fsnotify watcher on pseudo inodes	2020-06-25 13:02:58 -07:00
Jens Axboe	a1d7c393c4	io_uring: enable READ/WRITE to use deferred completions A bit more surgery required here, as completions are generally done through the kiocb->ki_complete() callback, even if they complete inline. This enables the regular read/write path to use the io_comp_state logic to batch inline completions. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-25 07:23:49 -06:00
Jens Axboe	229a7b6350	io_uring: pass in completion state to appropriate issue side handlers Provide the completion state to the handlers that we know can complete inline, so they can utilize this for batching completions. Cap the max batch count at 32. This should be enough to provide a good amortization of the cost of the lock+commit dance for completions, while still being low enough not to cause any real latency issues for SQPOLL applications. Xuan Zhuo <xuanzhuo@linux.alibaba.com> reports that this changes his profile from: 17.97% [kernel] [k] copy_user_generic_unrolled 13.92% [kernel] [k] io_commit_cqring 11.04% [kernel] [k] __io_cqring_fill_event 10.33% [kernel] [k] udp_recvmsg 5.94% [kernel] [k] skb_release_data 4.31% [kernel] [k] udp_rmem_release 2.68% [kernel] [k] __check_object_size 2.24% [kernel] [k] __slab_free 2.22% [kernel] [k] _raw_spin_lock_bh 2.21% [kernel] [k] kmem_cache_free 2.13% [kernel] [k] free_pcppages_bulk 1.83% [kernel] [k] io_submit_sqes 1.38% [kernel] [k] page_frag_free 1.31% [kernel] [k] inet_recvmsg to 19.99% [kernel] [k] copy_user_generic_unrolled 11.63% [kernel] [k] skb_release_data 9.36% [kernel] [k] udp_rmem_release 8.64% [kernel] [k] udp_recvmsg 6.21% [kernel] [k] __slab_free 4.39% [kernel] [k] __check_object_size 3.64% [kernel] [k] free_pcppages_bulk 2.41% [kernel] [k] kmem_cache_free 2.00% [kernel] [k] io_submit_sqes 1.95% [kernel] [k] page_frag_free 1.54% [kernel] [k] io_put_req [...] 0.07% [kernel] [k] io_commit_cqring 0.44% [kernel] [k] __io_cqring_fill_event Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-25 07:23:46 -06:00
Jens Axboe	f13fad7ba4	io_uring: pass down completion state on the issue side No functional changes in this patch, just in preparation for having the completion state be available on the issue side. Later on, this will allow requests that complete inline to be completed in batches. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-25 07:23:44 -06:00
Jens Axboe	013538bd65	io_uring: add 'io_comp_state' to struct io_submit_state No functional changes in this patch, just in preparation for passing back pending completions to the caller and completing them in a batched fashion. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-25 07:22:50 -06:00
Jens Axboe	e1e16097e2	io_uring: provide generic io_req_complete() helper We have lots of callers of: io_cqring_add_event(req, result); io_put_req(req); Provide a helper that does this for us. It helps clean up the code, and also provides a more convenient location for us to change the completion handling. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-25 07:22:41 -06:00
Pavel Begunkov	d3cac64c49	io_uring: fix NULL-mm for linked reqs __io_queue_sqe() tries to handle all request of a link, so it's not enough to grab mm in io_sq_thread_acquire_mm() based just on the head. Don't check req->needs_mm and do it always. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>	2020-06-25 07:22:38 -06:00
Pavel Begunkov	d60b5fbc1c	io_uring: fix current->mm NULL dereference on exit Don't reissue requests from io_iopoll_reap_events(), the task may not have mm, which ends up with NULL. It's better to kill everything off on exit anyway. [ 677.734670] RIP: 0010:io_iopoll_complete+0x27e/0x630 ... [ 677.734679] Call Trace: [ 677.734695] ? __send_signal+0x1f2/0x420 [ 677.734698] ? _raw_spin_unlock_irqrestore+0x24/0x40 [ 677.734699] ? send_signal+0xf5/0x140 [ 677.734700] io_iopoll_getevents+0x12f/0x1a0 [ 677.734702] io_iopoll_reap_events.part.0+0x5e/0xa0 [ 677.734703] io_ring_ctx_wait_and_kill+0x132/0x1c0 [ 677.734704] io_uring_release+0x20/0x30 [ 677.734706] __fput+0xcd/0x230 [ 677.734707] ____fput+0xe/0x10 [ 677.734709] task_work_run+0x67/0xa0 [ 677.734710] do_exit+0x35d/0xb70 [ 677.734712] do_group_exit+0x43/0xa0 [ 677.734713] get_signal+0x140/0x900 [ 677.734715] do_signal+0x37/0x780 [ 677.734717] ? enqueue_hrtimer+0x41/0xb0 [ 677.734718] ? recalibrate_cpu_khz+0x10/0x10 [ 677.734720] ? ktime_get+0x3e/0xa0 [ 677.734721] ? lapic_next_deadline+0x26/0x30 [ 677.734723] ? tick_program_event+0x4d/0x90 [ 677.734724] ? __hrtimer_get_next_event+0x4d/0x80 [ 677.734726] __prepare_exit_to_usermode+0x126/0x1c0 [ 677.734741] prepare_exit_to_usermode+0x9/0x40 [ 677.734742] idtentry_exit_cond_rcu+0x4c/0x60 [ 677.734743] sysvec_reschedule_ipi+0x92/0x160 [ 677.734744] ? asm_sysvec_reschedule_ipi+0xa/0x20 [ 677.734745] asm_sysvec_reschedule_ipi+0x12/0x20 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-25 07:20:43 -06:00
Pavel Begunkov	cd664b0e35	io_uring: fix hanging iopoll in case of -EAGAIN io_do_iopoll() won't do anything with a request unless req->iopoll_completed is set. So io_complete_rw_iopoll() has to set it, otherwise io_do_iopoll() will poll a file again and again even though the request of interest was completed long time ago. Also, remove -EAGAIN check from io_issue_sqe() as it races with the changed lines. The request will take the long way and be resubmitted from io_iopoll*(). io_kiocb's result and iopoll_completed") Fixes: `bbde017a32` ("io_uring: add memory barrier to synchronize Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-25 07:20:43 -06:00
Linus Torvalds	8be3a53e18	Changes since last update: Fix a regression which uses potential uninitialized high 32-bit value unexpectedly recently observed with specific compiler options. -----BEGIN PGP SIGNATURE----- iIsEABYIADMWIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCXvO6thUcaHNpYW5na2Fv QHJlZGhhdC5jb20ACgkQOTcx3B+15gT8eQEA/W9d/II6pqD1KD7Oh7K8AIt7kU46 JTBY6bA/lmMC/GkA/1cqAOxDfEGmWzH5Y/Hz7CLgnsRQYo90i9JZ1tcFAWkK =kUeU -----END PGP SIGNATURE----- Merge tag 'erofs-for-5.8-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fix from Gao Xiang: "Fix a regression which uses potential uninitialized high 32-bit value unexpectedly recently observed with specific compiler options" * tag 'erofs-for-5.8-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: fix partially uninitialized misuse in z_erofs_onlinepage_fixup	2020-06-24 17:39:30 -07:00
Christoph Hellwig	621c1f4294	block: move struct block_device to blk_types.h Move the struct block_device definition together with most of the block layer definitions, as it has nothing to do with the rest of fs.h. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-24 09:16:02 -06:00
Christoph Hellwig	3f1266f1f8	block: move block-related definitions out of fs.h Move most of the block related definition out of fs.h into more suitable headers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-24 09:16:02 -06:00
Christoph Hellwig	764b23bd9a	block: mark bd_finish_claiming static Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-24 09:16:02 -06:00
Gao Xiang	3c59728288	erofs: fix partially uninitialized misuse in z_erofs_onlinepage_fixup Hongyu reported "id != index" in z_erofs_onlinepage_fixup() with specific aarch64 environment easily, which wasn't shown before. After digging into that, I found that high 32 bits of page->private was set to 0xaaaaaaaa rather than 0 (due to z_erofs_onlinepage_init behavior with specific compiler options). Actually we only use low 32 bits to keep the page information since page->private is only 4 bytes on most 32-bit platforms. However z_erofs_onlinepage_fixup() uses the upper 32 bits by mistake. Let's fix it now. Reported-and-tested-by: Hongyu Jin <hongyu.jin@unisoc.com> Fixes: `3883a79abd` ("staging: erofs: introduce VLE decompression support") Cc: <stable@vger.kernel.org> # 4.19+ Reviewed-by: Chao Yu <yuchao0@huawei.com> Link: https://lore.kernel.org/r/20200618234349.22553-1-hsiangkao@aol.com Signed-off-by: Gao Xiang <hsiangkao@redhat.com>	2020-06-24 09:47:44 +08:00
Gustavo A. R. Silva	bf1028a41e	cifs: misc: Use array_size() in if-statement controlling expression Use array_size() instead of the open-coded version in the controlling expression of the if statement. Also, while there, use the preferred form for passing a size of a struct. The alternative form where struct name is spelled out hurts readability and introduces an opportunity for a bug when the pointer variable type is changed but the corresponding sizeof that is passed as argument is not. This issue was found with the help of Coccinelle and, audited and fixed manually. Addresses-KSPP-ID: https://github.com/KSPP/linux/issues/83 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-06-23 19:06:27 -05:00
Zhang Xiaoxu	5618303d85	cifs: update ctime and mtime during truncate As the man description of the truncate, if the size changed, then the st_ctime and st_mtime fields should be updated. But in cifs, we doesn't do it. It lead the xfstests generic/313 failed. So, add the ATTR_MTIME\|ATTR_CTIME flags on attrs when change the file size Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-06-23 19:06:27 -05:00
Zhang Xiaoxu	acc91c2d8d	cifs/smb3: Fix data inconsistent when punch hole When punch hole success, we also can read old data from file: # strace -e trace=pread64,fallocate xfs_io -f -c "pread 20 40" \ -c "fpunch 20 40" -c"pread 20 40" file pread64(3, " version 5.8.0-rc1+"..., 40, 20) = 40 fallocate(3, FALLOC_FL_KEEP_SIZE\|FALLOC_FL_PUNCH_HOLE, 20, 40) = 0 pread64(3, " version 5.8.0-rc1+"..., 40, 20) = 40 CIFS implements the fallocate(FALLOCATE_FL_PUNCH_HOLE) with send SMB ioctl(FSCTL_SET_ZERO_DATA) to server. It just set the range of the remote file to zero, but local page caches not updated, then the local page caches inconsistent with server. Also can be found by xfstests generic/316. So, we need to remove the page caches before send the SMB ioctl(FSCTL_SET_ZERO_DATA) to server. Fixes: `31742c5a33` ("enable fallocate punch hole ("fallocate -p") for SMB3") Suggested-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Cc: stable@vger.kernel.org # v3.17 Signed-off-by: Steve French <stfrench@microsoft.com>	2020-06-23 19:06:27 -05:00
Zhang Xiaoxu	6b69040247	cifs/smb3: Fix data inconsistent when zero file range CIFS implements the fallocate(FALLOC_FL_ZERO_RANGE) with send SMB ioctl(FSCTL_SET_ZERO_DATA) to server. It just set the range of the remote file to zero, but local page cache not update, then the data inconsistent with server, which leads the xfstest generic/008 failed. So we need to remove the local page caches before send SMB ioctl(FSCTL_SET_ZERO_DATA) to server. After next read, it will re-cache it. Fixes: `30175628bf` ("[SMB3] Enable fallocate -z support for SMB3 mounts") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Cc: stable@vger.kernel.org # v3.17 Signed-off-by: Steve French <stfrench@microsoft.com>	2020-06-23 19:06:27 -05:00
Xuan Zhuo	b772f07add	io_uring: fix io_sq_thread no schedule when busy When the user consumes and generates sqe at a fast rate, io_sqring_entries can always get sqe, and ret will not be equal to -EBUSY, so that io_sq_thread will never call cond_resched or schedule, and then we will get the following system error prompt: rcu: INFO: rcu_sched self-detected stall on CPU or watchdog: BUG: soft lockup-CPU#23 stuck for 112s! [io_uring-sq:1863] This patch checks whether need to call cond_resched() by checking the need_resched() function every cycle. Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-23 11:54:30 -06:00
Zhang Xiaoxu	95a3d8f3af	cifs: Fix double add page to memcg when cifs_readpages When xfstests generic/451, there is an BUG at mm/memcontrol.c: page:ffffea000560f2c0 refcount:2 mapcount:0 mapping:000000008544e0ea index:0xf mapping->aops:cifs_addr_ops dentry name:"tst-aio-dio-cycle-write.451" flags: 0x2fffff80000001(locked) raw: 002fffff80000001 ffffc90002023c50 ffffea0005280088 ffff88815cda0210 raw: 000000000000000f 0000000000000000 00000002ffffffff ffff88817287d000 page dumped because: VM_BUG_ON_PAGE(page->mem_cgroup) page->mem_cgroup:ffff88817287d000 ------------[ cut here ]------------ kernel BUG at mm/memcontrol.c:2659! invalid opcode: 0000 [#1] SMP CPU: 2 PID: 2038 Comm: xfs_io Not tainted 5.8.0-rc1 #44 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_ 073836-buildvm-ppc64le-16.ppc.4 RIP: 0010:commit_charge+0x35/0x50 Code: 0d 48 83 05 54 b2 02 05 01 48 89 77 38 c3 48 c7 c6 78 4a ea ba 48 83 05 38 b2 02 05 01 e8 63 0d9 RSP: 0018:ffffc90002023a50 EFLAGS: 00010202 RAX: 0000000000000000 RBX: ffff88817287d000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff88817ac97ea0 RDI: ffff88817ac97ea0 RBP: ffffea000560f2c0 R08: 0000000000000203 R09: 0000000000000005 R10: 0000000000000030 R11: ffffc900020237a8 R12: 0000000000000000 R13: 0000000000000001 R14: 0000000000000001 R15: ffff88815a1272c0 FS: 00007f5071ab0800(0000) GS:ffff88817ac80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055efcd5ca000 CR3: 000000015d312000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: mem_cgroup_charge+0x166/0x4f0 __add_to_page_cache_locked+0x4a9/0x710 add_to_page_cache_locked+0x15/0x20 cifs_readpages+0x217/0x1270 read_pages+0x29a/0x670 page_cache_readahead_unbounded+0x24f/0x390 __do_page_cache_readahead+0x3f/0x60 ondemand_readahead+0x1f1/0x470 page_cache_async_readahead+0x14c/0x170 generic_file_buffered_read+0x5df/0x1100 generic_file_read_iter+0x10c/0x1d0 cifs_strict_readv+0x139/0x170 new_sync_read+0x164/0x250 __vfs_read+0x39/0x60 vfs_read+0xb5/0x1e0 ksys_pread64+0x85/0xf0 __x64_sys_pread64+0x22/0x30 do_syscall_64+0x69/0x150 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f5071fcb1af Code: Bad RIP value. RSP: 002b:00007ffde2cdb8e0 EFLAGS: 00000293 ORIG_RAX: 0000000000000011 RAX: ffffffffffffffda RBX: 00007ffde2cdb990 RCX: 00007f5071fcb1af RDX: 0000000000001000 RSI: 000055efcd5ca000 RDI: 0000000000000003 RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000001000 R11: 0000000000000293 R12: 0000000000000001 R13: 000000000009f000 R14: 0000000000000000 R15: 0000000000001000 Modules linked in: ---[ end trace 725fa14a3e1af65c ]--- Since commit `3fea5a499d` ("mm: memcontrol: convert page cache to a new mem_cgroup_charge() API") not cancel the page charge, the pages maybe double add to pagecache: thread1 \| thread2 cifs_readpages readpages_get_pages add_to_page_cache_locked(head,index=n)=0 \| readpages_get_pages \| add_to_page_cache_locked(head,index=n+1)=0 add_to_page_cache_locked(head, index=n+1)=-EEXIST then, will next loop with list head page's index=n+1 and the page->mapping not NULL readpages_get_pages add_to_page_cache_locked(head, index=n+1) commit_charge VM_BUG_ON_PAGE So, we should not do the next loop when any page add to page cache failed. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: Steve French <stfrench@microsoft.com> Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>	2020-06-23 12:04:52 -05:00
Linus Torvalds	3e08a95294	for-5.8-rc2-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl7yABEACgkQxWXV+ddt WDtGoQ//cBWRRWLlLTRgpaKnY6t8JgVUqNvPJISHHf45cNbOJh0yo8hUuKMW+440 8ovYqtFoZD+JHcHDE2sMueHBFe38rG5eT/zh8j/ruhBzeJcTb3lSYz53d7sfl5kD cIVngPEVlGziDqW2PsWLlyh8ulBGzY3YmS6kAEkyP/6/uhE/B1dq6qn3GUibkbKI dfNjHTLwZVmwnqoxLu8ZE2/hHFbzhl0sm09snsXYSVu13g36+edp0Z+pF0MlKGVk G6YrnZcts8TWwneZ4nogD9f2CMvzMhYDDLyEjsX0Ouhb+Cu2WNxdfrJ2ZbPNU82w EGbo451mIt6Ht8wicEjh27LWLI7YMraF/Ig/ODMdvFBYDbhl4voX2t+4n+p5Czbg AW6Wtg/q5EaaNFqrTsqAAiUn0+R3sMiDWrE0AewcE7syPGqQ2XMwP4la5pZ36rz8 8Vo5KIGo44PIJ1dMwcX+bg3HTtUnBJSxE5fUi0rJ3ZfHKGjLS79VonEeQjh3QD6W 0UlK+jCjo6KZoe33XdVV2hVkHd63ZIlliXWv0LOR+gpmqqgW2b3wf181zTvo/5sI v0fDjstA9caqf68ChPE9jJi7rZPp/AL1yAQGEiNzjKm4U431TeZJl2cpREicMJDg FCDU51t9425h8BFkM4scErX2/53F1SNNNSlAsFBGvgJkx6rTENs= =/eCR -----END PGP SIGNATURE----- Merge tag 'for-5.8-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A number of fixes, located in two areas, one performance fix and one fixup for better integration with another patchset. - bug fixes in nowait aio: - fix snapshot creation hang after nowait-aio was used - fix failure to write to prealloc extent past EOF - don't block when extent range is locked - block group fixes: - relocation failure when scrub runs in parallel - refcount fix when removing fails - fix race between removal and creation - space accounting fixes - reinstante fast path check for log tree at unlink time, fixes performance drop up to 30% in REAIM - kzfree/kfree fixup to ease treewide patchset renaming kzfree" * tag 'for-5.8-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: use kfree() in btrfs_ioctl_get_subvol_info() btrfs: fix RWF_NOWAIT writes blocking on extent locks and waiting for IO btrfs: fix RWF_NOWAIT write not failling when we need to cow btrfs: fix failure of RWF_NOWAIT write into prealloc extent beyond eof btrfs: fix hang on snapshot creation after RWF_NOWAIT write btrfs: check if a log root exists before locking the log_mutex on unlink btrfs: fix bytes_may_use underflow when running balance and scrub in parallel btrfs: fix data block group relocation failure due to concurrent scrub btrfs: fix race between block group removal and block group creation btrfs: fix a block group ref counter leak after failure to remove block group	2020-06-23 09:20:11 -07:00
Dave Chinner	c7f87f3984	xfs: fix use-after-free on CIL context on shutdown xlog_wait() on the CIL context can reference a freed context if the waiter doesn't get scheduled before the CIL context is freed. This can happen when a task is on the hard throttle and the CIL push aborts due to a shutdown. This was detected by generic/019: thread 1 thread 2 __xfs_trans_commit xfs_log_commit_cil <CIL size over hard throttle limit> xlog_wait schedule xlog_cil_push_work wake_up_all <shutdown aborts commit> xlog_cil_committed kmem_free remove_wait_queue spin_lock_irqsave --> UAF Fix it by moving the wait queue to the CIL rather than keeping it in in the CIL context that gets freed on push completion. Because the wait queue is now independent of the CIL context and we might have multiple contexts in flight at once, only wake the waiters on the push throttle when the context we are pushing is over the hard throttle size threshold. Fixes: `0e7ab7efe7` ("xfs: Throttle commits on delayed background CIL push") Reported-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2020-06-22 19:22:57 -07:00
Xiyu Yang	77577de641	cifs: Fix cached_fid refcnt leak in open_shroot open_shroot() invokes kref_get(), which increases the refcount of the "tcon->crfid" object. When open_shroot() returns not zero, it means the open operation failed and close_shroot() will not be called to decrement the refcount of the "tcon->crfid". The reference counting issue happens in one normal path of open_shroot(). When the cached root have been opened successfully in a concurrent process, the function increases the refcount and jump to "oshr_free" to return. However the current return value "rc" may not equal to 0, thus the increased refcount will not be balanced outside the function, causing a refcnt leak. Fix this issue by setting the value of "rc" to 0 before jumping to "oshr_free" label. Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn> Signed-off-by: Xin Tan <tanxin.ctf@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org>	2020-06-21 22:34:50 -05:00
Pavel Begunkov	f6b6c7d6a9	io_uring: kill NULL checks for submit state After recent changes, io_submit_sqes() always passes valid submit state, so kill leftovers checking it for NULL. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-21 20:46:05 -06:00
Pavel Begunkov	b90cd197f9	io_uring: set @poll->file after @poll init It's a good practice to modify fields of a struct after but not before it was initialised. Even though io_init_poll_iocb() doesn't touch poll->file, call it first. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-21 20:46:05 -06:00
Pavel Begunkov	24c7467863	io_uring: remove REQ_F_MUST_PUNT REQ_F_MUST_PUNT may seem looking good and clear, but it's the same as not having REQ_F_NOWAIT set. That rather creates more confusion. Moreover, it doesn't even affect any behaviour (e.g. see the patch removing it from io_{read,write}). Kill theg flag and update already outdated comments. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-21 20:46:05 -06:00
Pavel Begunkov	62ef731650	io_uring: remove setting REQ_F_MUST_PUNT in rw io_{read,write}() { ... copy_iov: // prep async if (!(flags & REQ_F_NOWAIT) && !file_can_poll(file)) flags \|= REQ_F_MUST_PUNT; } REQ_F_MUST_PUNT there is pointless, because if it happens then REQ_F_NOWAIT is known to be _not_ set, and the request will go async path in __io_queue_sqe() anyway. file_can_poll() check is also repeated in arm_poll*(), so don't need it. Remove the mentioned assignment REQ_F_MUST_PUNT in preparation for killing the flag. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-21 20:46:03 -06:00
Jens Axboe	bcf5a06304	io_uring: support true async buffered reads, if file provides it If the file is flagged with FMODE_BUF_RASYNC, then we don't have to punt the buffered read to an io-wq worker. Instead we can rely on page unlocking callbacks to support retry based async IO. This is a lot more efficient than doing async thread offload. The retry is done similarly to how we handle poll based retry. From the unlock callback, we simply queue the retry to a task_work based handler. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-21 20:44:26 -06:00
Jens Axboe	8730f12b79	btrfs: flag files as supporting buffered async reads btrfs uses generic_file_read_iter(), which already supports this. Acked-by: Chris Mason <clm@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-21 20:44:25 -06:00
Jens Axboe	f89fb730aa	xfs: flag files as supporting buffered async reads XFS uses generic_file_read_iter(), which already supports this. Acked-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-21 20:44:25 -06:00
Jens Axboe	a304f07448	block: flag block devices as supporting IOCB_WAITQ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-21 20:44:25 -06:00
Jens Axboe	b63534c41e	io_uring: re-issue block requests that failed because of resources Mark the plug with nowait == true, which will cause requests to avoid blocking on request allocation. If they do, we catch them and reissue them from a task_work based handler. Normally we can catch -EAGAIN directly, but the hard case is for split requests. As an example, the application issues a 512KB request. The block core will split this into 128KB if that's the max size for the device. The first request issues just fine, but we run into -EAGAIN for some latter splits for the same request. As the bio is split, we don't get to see the -EAGAIN until one of the actual reads complete, and hence we cannot handle it inline as part of submission. This does potentially cause re-reads of parts of the range, as the whole request is reissued. There's currently no better way to handle this. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-21 20:44:25 -06:00
Jens Axboe	4503b7676a	io_uring: catch -EIO from buffered issue request failure -EIO bubbles up like -EAGAIN if we fail to allocate a request at the lower level. Play it safe and treat it like -EAGAIN in terms of sync retry, to avoid passing back an errant -EIO. Catch some of these early for block based file, as non-mq devices generally do not support NOWAIT. That saves us some overhead by not first trying, then retrying from async context. We can go straight to async punt instead. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-21 20:44:25 -06:00
Jens Axboe	ac8691c415	io_uring: always plug for any number of IOs Currently we only plug if we're doing more than two request. We're going to be relying on always having the plug there to pass down information, so plug unconditionally. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-21 20:44:25 -06:00
Bijan Mottahedeh	2e0464d48f	io_uring: separate reporting of ring pages from registered pages Ring pages are not pinned so it is more appropriate to report them as locked. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-21 20:44:01 -06:00
Bijan Mottahedeh	309758254e	io_uring: report pinned memory usage Report pinned memory usage always, regardless of whether locked memory limit is enforced. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-21 20:44:01 -06:00
Bijan Mottahedeh	aad5d8da1b	io_uring: rename ctx->account_mem field Rename account_mem to limit_name to clarify its purpose. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-21 20:44:01 -06:00
Bijan Mottahedeh	a087e2b519	io_uring: add wrappers for memory accounting Facilitate separation of locked memory usage reporting vs. limiting for upcoming patches. No functional changes. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> [axboe: kill unnecessary () around return in io_account_mem()] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-21 20:44:00 -06:00
Jiufei Xue	a31eb4a2f1	io_uring: use EPOLLEXCLUSIVE flag to aoid thundering herd type behavior Applications can pass this flag in to avoid accept thundering herd. Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-21 20:44:00 -06:00
Jiufei Xue	5769a351b8	io_uring: change the poll type to be 32-bits poll events should be 32-bits to cover EPOLLEXCLUSIVE. Explicit word-swap the poll32_events for big endian to make sure the ABI is not changed. We call this feature IORING_FEAT_POLL_32BITS, applications who want to use EPOLLEXCLUSIVE should check the feature bit first. Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-21 20:44:00 -06:00
Linus Torvalds	8b6ddd10d6	A few fixes and small cleanups for tracing: - Have recordmcount work with > 64K sections (to support LTO) - kprobe RCU fixes - Correct a kprobe critical section with missing mutex - Remove redundant arch_disarm_kprobe() call - Fix lockup when kretprobe triggers within kprobe_flush_task() - Fix memory leak in fetch_op_data operations - Fix sleep in atomic in ftrace trace array sample code - Free up memory on failure in sample trace array code - Fix incorrect reporting of function_graph fields in format file - Fix quote within quote parsing in bootconfig - Fix return value of bootconfig tool - Add testcases for bootconfig tool - Fix maybe uninitialized warning in ftrace pid file code - Remove unused variable in tracing_iter_reset() - Fix some typos -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXu1jrRQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qoCMAP91nOccE3X+Nvc3zET3isDWnl1tWJxk icsBgN/JwBRuTAD/dnWTHIWM2/5lTiagvyVsmINdJHP6JLr8T7dpN9tlxAQ= =Cuo7 -----END PGP SIGNATURE----- Merge tag 'trace-v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: - Have recordmcount work with > 64K sections (to support LTO) - kprobe RCU fixes - Correct a kprobe critical section with missing mutex - Remove redundant arch_disarm_kprobe() call - Fix lockup when kretprobe triggers within kprobe_flush_task() - Fix memory leak in fetch_op_data operations - Fix sleep in atomic in ftrace trace array sample code - Free up memory on failure in sample trace array code - Fix incorrect reporting of function_graph fields in format file - Fix quote within quote parsing in bootconfig - Fix return value of bootconfig tool - Add testcases for bootconfig tool - Fix maybe uninitialized warning in ftrace pid file code - Remove unused variable in tracing_iter_reset() - Fix some typos * tag 'trace-v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ftrace: Fix maybe-uninitialized compiler warning tools/bootconfig: Add testcase for show-command and quotes test tools/bootconfig: Fix to return 0 if succeeded to show the bootconfig tools/bootconfig: Fix to use correct quotes for value proc/bootconfig: Fix to use correct quotes for value tracing: Remove unused event variable in tracing_iter_reset tracing/probe: Fix memleak in fetch_op_data operations trace: Fix typo in allocate_ftrace_ops()'s comment tracing: Make ftrace packed events have align of 1 sample-trace-array: Remove trace_array 'sample-instance' sample-trace-array: Fix sleeping function called from invalid context kretprobe: Prevent triggering kretprobe from within kprobe_flush_task kprobes: Remove redundant arch_disarm_kprobe() call kprobes: Fix to protect kick_kprobe_optimizer() by kprobe_mutex kprobes: Use non RCU traversal APIs on kprobe_tables if possible kprobes: Suppress the suspicious RCU warning on kprobes recordmcount: support >64k sections	2020-06-20 13:17:47 -07:00
David Howells	5481fc6eb8	afs: Fix hang on rmmod due to outstanding timer The fileserver probe timer, net->fs_probe_timer, isn't cancelled when the kafs module is being removed and so the count it holds on net->servers_outstanding doesn't get dropped.. This causes rmmod to wait forever. The hung process shows a stack like: afs_purge_servers+0x1b5/0x23c [kafs] afs_net_exit+0x44/0x6e [kafs] ops_exit_list+0x72/0x93 unregister_pernet_operations+0x14c/0x1ba unregister_pernet_subsys+0x1d/0x2a afs_exit+0x29/0x6f [kafs] __do_sys_delete_module.isra.0+0x1a2/0x24b do_syscall_64+0x51/0x95 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fix this by: (1) Attempting to cancel the probe timer and, if successful, drop the count that the timer was holding. (2) Make the timer function just drop the count and not schedule the prober if the afs portion of net namespace is being destroyed. Also, whilst we're at it, make the following changes: (3) Initialise net->servers_outstanding to 1 and decrement it before waiting on it so that it doesn't generate wake up events by being decremented to 0 until we're cleaning up. (4) Switch the atomic_dec() on ->servers_outstanding for ->fs_timer in afs_purge_servers() to use the helper function for that. Fixes: `f6cbb368bc` ("afs: Actively poll fileservers to maintain NAT or firewall openings") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-20 12:01:58 -07:00
David Howells	f8ea5c7bce	afs: Fix afs_do_lookup() to call correct fetch-status op variant Fix afs_do_lookup()'s fallback case for when FS.InlineBulkStatus isn't supported by the server. In the fallback, it calls FS.FetchStatus for the specific vnode it's meant to be looking up. Commit `b6489a49f7` broke this by renaming one of the two identically-named afs_fetch_status_operation descriptors to something else so that one of them could be made non-static. The site that used the renamed one, however, wasn't renamed and didn't produce any warning because the other was declared in a header. Fix this by making afs_do_lookup() use the renamed variant. Note that there are two variants of the success method because one is called from ->lookup() where we may or may not have an inode, but can't call iget until after we've talked to the server - whereas the other is called from within iget where we have an inode, but it may or may not be initialised. The latter variant expects there to be an inode, but because it's being called from there former case, there might not be - resulting in an oops like the following: BUG: kernel NULL pointer dereference, address: 00000000000000b0 ... RIP: 0010:afs_fetch_status_success+0x27/0x7e ... Call Trace: afs_wait_for_operation+0xda/0x234 afs_do_lookup+0x2fe/0x3c1 afs_lookup+0x3c5/0x4bd __lookup_slow+0xcd/0x10f walk_component+0xa2/0x10c path_lookupat.isra.0+0x80/0x110 filename_lookup+0x81/0x104 vfs_statx+0x76/0x109 __do_sys_newlstat+0x39/0x6b do_syscall_64+0x4c/0x78 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: `b6489a49f7` ("afs: Fix silly rename") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-20 12:01:58 -07:00
Linus Torvalds	4333a9b0b6	io_uring-5.8-2020-06-19 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7s0e0QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgppsQD/9ZD5Rrr1fRqZw29UMvjoKu53Plvc14GA7R Tv44p/qF8KD7mkSY+YB3OIRh+NP7fHPXIJLotjZU9CFpgQtTjkiYbVCxPH5TpwZJ YuEODrLuwuy3o5MU+a4t5uCBoGqDq5Fnz0Kh5kFfOl1D8qBzqczNzL0Ygn1FyoLd LcCchC+kjROX6F+Oo+0onQeOFipSjxkG6LOThSiFsLJAL8huVLDem7ihon/HvqYw 68lAj2X0QlaIMzk0yKJ2LFovQRhk+nlWGtW6XzVCPetbEkFdmGOcEl13orwh/say tkzKGN8O0JLThIxWqEQn1MHK9MeaKlnS9j2tFI3suR65xvjlxE8+cxsmlg/wNzGo UyQgh1M8QvPNvDAXCL4q1k2QmCH0YwTY+pHqCIFDp37LRE6ZPboNWEV55YVB8VpL axLPf89Any8ta3YICFq/Zmm03A/GUmLsxWspgbtOZMT40loNgZ3YoDR7cPfE76jU N2XEZEVOQrvodXW6fjqfx6AAYraxhDo2gh4SZhF0ydXDgGTmK6BcHHu7a3lSp7+e eKgYDgkMsa7bP5Cm3+VaNuv7db84kPEBeXLO5zJ86N/nKk8JE97Tl6uiVC8maBC2 r9ftsQd3fXkwDYmUk81EOjp2+YXY2zEt3vs8cP3euGt39qwjy4mdN05UQa5Xm5BS XLWmpO2teA== =tmwG -----END PGP SIGNATURE----- Merge tag 'io_uring-5.8-2020-06-19' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: - Catch a case where io_sq_thread() didn't do proper mm acquire - Ensure poll completions are reaped on shutdown - Async cancelation and run fixes (Pavel) - io-poll race fixes (Xiaoguang) - Request cleanup race fix (Xiaoguang) * tag 'io_uring-5.8-2020-06-19' of git://git.kernel.dk/linux-block: io_uring: fix possible race condition against REQ_F_NEED_CLEANUP io_uring: reap poll completions while waiting for refs to drop on exit io_uring: acquire 'mm' for task_work for SQPOLL io_uring: add memory barrier to synchronize io_kiocb's result and iopoll_completed io_uring: don't fail links for EAGAIN error in IOPOLL mode io_uring: cancel by ->task not pid io_uring: lazy get task io_uring: batch cancel in io_uring_cancel_files() io_uring: cancel all task's requests on exit io-wq: add an option to cancel all matched reqs io-wq: reorder cancellation pending -> running io_uring: fix lazy work init	2020-06-19 13:16:58 -07:00
Linus Torvalds	d2b1c81f5f	block-5.8-2020-06-19 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7s0SAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpp+YEACVqFvsfzxKCqa61IzyuOaPfnj9awyP+MY2 7V6y9sDDHL8sp6aPDbHvqFnqz0O7E+7nHVZD2rf2qc6tKKMvJYNO/BFZSXPvWTZV KQ4cBChf/LDwqAKOnI4ZhmF5UcSyyob1yMy4uJ+U0gQiXXrRMbwJ3N1K24a9dr4c epkzGavR0Q+PJ9BbUgjACjbRdT+vrP4bOu0cuyCGkIpD9eCerKJ6mFaUAj0FDthD bg4BJj+c8Ij6LO0V++Wga6OxccmL43KeP0ky8B3x07PfAl+tDWqsbHSlU2YPtdcq 5nKgMMTW16mVnZeO2/W0JB7tn89VubsmyvIFcm2KNeeRqSnEZyW9HI8n4kq994Ju xMH24lgbsU4trNeYkgOmzPoJJZ+LShkn+rnldyI1U/fhpEYub7DqfVySuT7ti9in uFpQdeRUmPsdw92F3+o6h8OYAflpcQQ7CblkzxPEeV4OyzOZasb+S9tMNPe59KBh 0MtHv9IfzgtDihR6HuXifitXaP+GtH4x3D2z0dzEdooHKHC/+P3WycS5daG+3WKQ xV5lJruvpTuxhXKLFAH0wRrxnVlB0VUvhQ21T3WgHrwF0btbdmQMHFc83XOxBIB4 jHWJMHGc4xp1ZdpWFBC8Cj79OmJh1w/ao8+/cf8SUoTB0LzFce1B8LvwnxgpcpUk VjIOrl7zhQ== =LeLd -----END PGP SIGNATURE----- Merge tag 'block-5.8-2020-06-19' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - Use import_uuid() where appropriate (Andy) - bcache fixes (Coly, Mauricio, Zhiqiang) - blktrace sparse warnings fix (Jan) - blktrace concurrent setup fix (Luis) - blkdev_get use-after-free fix (Jason) - Ensure all blk-mq maps are updated (Weiping) - Loop invalidate bdev fix (Zheng) * tag 'block-5.8-2020-06-19' of git://git.kernel.dk/linux-block: block: make function 'kill_bdev' static loop: replace kill_bdev with invalidate_bdev partitions/ldm: Replace uuid_copy() with import_uuid() where it makes sense block: update hctx map when use multiple maps blktrace: Avoid sparse warnings when assigning q->blk_trace blktrace: break out of blktrace setup on concurrent calls block: Fix use-after-free in blkdev_get() trace/events/block.h: drop kernel-doc for dropped function parameter blk-mq: Remove redundant 'return' statement bcache: pr_info() format clean up in bcache_device_init() bcache: use delayed kworker fo asynchronous devices registration bcache: check and adjust logical block size for backing devices bcache: fix potential deadlock problem in btree_gc_coalesce	2020-06-19 13:11:26 -07:00
Wang Xiaojun	ba87a45c23	f2fs: use kfree() to free variables allocated by match_strdup() Use kfree() instead of kvfree() to free variables allocated by match_strdup(). Because the memory is allocated with kmalloc inside match_strdup(). Signed-off-by: Wang Xiaojun <wangxiaojun11@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-06-18 12:37:47 -07:00
Linus Torvalds	5e857ce6ea	Merge branch 'hch' (maccess patches from Christoph Hellwig) Merge non-faulting memory access cleanups from Christoph Hellwig: "Andrew and I decided to drop the patches implementing your suggested rename of the probe_kernel_* and probe_user_* helpers from -mm as there were way to many conflicts. After -rc1 might be a good time for this as all the conflicts are resolved now" This also adds a type safety checking patch on top of the renaming series to make the subtle behavioral difference between 'get_user()' and 'get_kernel_nofault()' less potentially dangerous and surprising. * emailed patches from Christoph Hellwig <hch@lst.de>: maccess: make get_kernel_nofault() check for minimal type compatibility maccess: rename probe_kernel_address to get_kernel_nofault maccess: rename probe_user_{read,write} to copy_{from,to}_user_nofault maccess: rename probe_kernel_{read,write} to copy_{from,to}_kernel_nofault	2020-06-18 12:35:51 -07:00
Jack Qiu	da52f8ade4	f2fs: get the right gc victim section when section has several segments Assume each section has 4 segment: .___________________________. \|_Segment0_\|_..._\|_Segment3_\| . . . . .__________. \|_section0_\| Segment 0~2 has 0 valid block, segment 3 has 512 valid blocks. It will fail if we want to gc section0 in this scenes, because all 4 segments in section0 is not dirty. So we should use dirty section bitmap instead of dirty segment bitmap to get right victim section. Signed-off-by: Jack Qiu <jack.qiu@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-06-18 12:35:38 -07:00
Wuyun Zhao	db5ae36329	f2fs: fix a race condition between f2fs_write_end_io and f2fs_del_fsync_node_entry Under some condition, the __write_node_page will submit a page which is not f2fs_in_warm_node_list and will not call f2fs_add_fsync_node_entry. f2fs_gc continue to run to invoke f2fs_iget -> do_read_inode to read the same node page and set code node, which make f2fs_in_warm_node_list become true, that will cause f2fs_bug_on in f2fs_del_fsync_node_entry when f2fs_write_end_io called. - f2fs_write_end_io - f2fs_iget - do_read_inode - set_cold_node recover cold node flag - f2fs_in_warm_node_list - is_cold_node if node is cold, assume we have added node to fsync_node_list during writepages() - f2fs_del_fsync_node_entry - f2fs_bug_on() due to node page is not in fsync_node_list [ 34.966133] Call trace: [ 34.969902] f2fs_del_fsync_node_entry+0x100/0x108 [ 34.976071] f2fs_write_end_io+0x1e0/0x288 [ 34.981539] bio_endio+0x248/0x270 [ 34.986289] blk_update_request+0x2b0/0x4d8 [ 34.991841] scsi_end_request+0x40/0x440 [ 34.997126] scsi_io_completion+0xa4/0x748 [ 35.002593] scsi_finish_command+0xdc/0x110 [ 35.008143] scsi_softirq_done+0x118/0x150 [ 35.013610] blk_done_softirq+0x8c/0xe8 [ 35.018811] __do_softirq+0x2e8/0x578 [ 35.023828] irq_exit+0xfc/0x120 [ 35.028398] handle_IPI+0x1d8/0x330 [ 35.033233] gic_handle_irq+0x110/0x1d4 [ 35.038433] el1_irq+0xb4/0x130 [ 35.042917] kmem_cache_alloc+0x3f0/0x418 [ 35.048288] radix_tree_node_alloc+0x50/0xf8 [ 35.053933] __radix_tree_create+0xf8/0x188 [ 35.059484] __radix_tree_insert+0x3c/0x128 [ 35.065035] add_gc_inode+0x90/0x118 [ 35.069967] f2fs_gc+0x1b80/0x2d70 [ 35.074718] f2fs_disable_checkpoint+0x94/0x1d0 [ 35.080621] f2fs_fill_super+0x10c4/0x1b88 [ 35.086088] mount_bdev+0x194/0x1e0 [ 35.090923] f2fs_mount+0x40/0x50 [ 35.095589] mount_fs+0xb4/0x190 [ 35.100159] vfs_kern_mount+0x80/0x1d8 [ 35.105260] do_mount+0x478/0xf18 [ 35.109926] ksys_mount+0x90/0xd0 [ 35.114592] __arm64_sys_mount+0x24/0x38 Signed-off-by: Wuyun Zhao <zhaowuyun@wingtech.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-06-18 12:34:52 -07:00
Wei Fang	6f6489288e	f2fs: remove useless truncate in f2fs_collapse_range() Since offset < new_size, no need to do truncate_pagecache() again with new_size. Signed-off-by: Wei Fang <fangwei1@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-06-18 12:33:18 -07:00
Denis Efremov	742532d11d	f2fs: use kfree() instead of kvfree() to free superblock data Use kfree() instead of kvfree() to free super in read_raw_super_block() because the memory is allocated with kzalloc() in the function. Use kfree() instead of kvfree() to free sbi, raw_super in f2fs_fill_super() and f2fs_put_super() because the memory is allocated with kzalloc(). Signed-off-by: Denis Efremov <efremov@linux.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-06-18 12:33:18 -07:00
Jaegeuk Kim	99bbe30701	f2fs: avoid checkpatch error ERROR:INITIALISED_STATIC: do not initialise statics to NULL Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-06-18 12:33:11 -07:00
Zheng Bin	3373a3461a	block: make function 'kill_bdev' static kill_bdev does not have any external user, so make it static. Signed-off-by: Zheng Bin <zhengbin13@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-18 09:24:35 -06:00
Xiaoguang Wang	6f2cc1664d	io_uring: fix possible race condition against REQ_F_NEED_CLEANUP In io_read() or io_write(), when io request is submitted successfully, it'll go through the below sequence: kfree(iovec); req->flags &= ~REQ_F_NEED_CLEANUP; return ret; But clearing REQ_F_NEED_CLEANUP might be unsafe. The io request may already have been completed, and then io_complete_rw_iopoll() and io_complete_rw() will be called, both of which will also modify req->flags if needed. This causes a race condition, with concurrent non-atomic modification of req->flags. To eliminate this race, in io_read() or io_write(), if io request is submitted successfully, we don't remove REQ_F_NEED_CLEANUP flag. If REQ_F_NEED_CLEANUP is set, we'll leave __io_req_aux_free() to the iovec cleanup work correspondingly. Cc: stable@vger.kernel.org Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-18 08:32:44 -06:00
Jens Axboe	56952e91ac	io_uring: reap poll completions while waiting for refs to drop on exit If we're doing polled IO and end up having requests being submitted async, then completions can come in while we're waiting for refs to drop. We need to reap these manually, as nobody else will be looking for them. Break the wait into 1/20th of a second time waits, and check for done poll completions if we time out. Otherwise we can have done poll completions sitting in ctx->poll_list, which needs us to reap them but we're just waiting for them. Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-17 15:05:08 -06:00
Jens Axboe	9d8426a091	io_uring: acquire 'mm' for task_work for SQPOLL If we're unlucky with timing, we could be running task_work after having dropped the memory context in the sq thread. Since dropping the context requires a runnable task state, we cannot reliably drop it as part of our check-for-work loop in io_sq_thread(). Instead, abstract out the mm acquire for the sq thread into a helper, and call it from the async task work handler. Cc: stable@vger.kernel.org # v5.7 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-17 12:49:16 -06:00
Xiaoguang Wang	bbde017a32	io_uring: add memory barrier to synchronize io_kiocb's result and iopoll_completed In io_complete_rw_iopoll(), stores to io_kiocb's result and iopoll completed are two independent store operations, to ensure that once iopoll_completed is ture and then req->result must been perceived by the cpu executing io_do_iopoll(), proper memory barrier should be used. And in io_do_iopoll(), we check whether req->result is EAGAIN, if it is, we'll need to issue this io request using io-wq again. In order to just issue a single smp_rmb() on the completion side, move the re-submit work to io_iopoll_complete(). Cc: stable@vger.kernel.org Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> [axboe: don't set ->iopoll_completed for -EAGAIN retry] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-17 12:49:09 -06:00
Xiaoguang Wang	2d7d67920e	io_uring: don't fail links for EAGAIN error in IOPOLL mode In IOPOLL mode, for EAGAIN error, we'll try to submit io request again using io-wq, so don't fail rest of links if this io request has links. Cc: stable@vger.kernel.org Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-17 12:49:01 -06:00
Christoph Hellwig	fe557319aa	maccess: rename probe_kernel_{read,write} to copy_{from,to}_kernel_nofault Better describe what these functions do. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-17 10:57:41 -07:00
J. Bruce Fields	22cf8419f1	nfsd: apply umask on fs without ACL support The server is failing to apply the umask when creating new objects on filesystems without ACL support. To reproduce this, you need to use NFSv4.2 and a client and server recent enough to support umask, and you need to export a filesystem that lacks ACL support (for example, ext4 with the "noacl" mount option). Filesystems with ACL support are expected to take care of the umask themselves (usually by calling posix_acl_create). For filesystems without ACL support, this is up to the caller of vfs_create(), vfs_mknod(), or vfs_mkdir(). Reported-by: Elliott Mitchell <ehem+debian@m5p.com> Reported-by: Salvatore Bonaccorso <carnil@debian.org> Tested-by: Salvatore Bonaccorso <carnil@debian.org> Fixes: `47057abde5` ("nfsd: add support for the umask attribute") Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2020-06-17 10:48:58 -04:00
Masami Hiramatsu	4e264ffd95	proc/bootconfig: Fix to use correct quotes for value Fix /proc/bootconfig to select double or single quotes corrctly according to the value. If a bootconfig value includes a double quote character, we must use single-quotes to quote that value. This modifies if() condition and blocks for avoiding double-quote in value check in 2 places. Anyway, since xbc_array_for_each_value() can handle the array which has a single node correctly. Thus, if (vnode && xbc_node_is_array(vnode)) { xbc_array_for_each_value(vnode) /* vnode->next != NULL / ... } else { snprintf(val); / val is an empty string if !vnode / } is equivalent to if (vnode) { xbc_array_for_each_value(vnode) / vnode->next can be NULL / ... } else { snprintf(""); / value is always empty */ } Link: http://lkml.kernel.org/r/159230244786.65555.3763894451251622488.stgit@devnote2 Cc: stable@vger.kernel.org Fixes: `c1a3c36017` ("proc: bootconfig: Add /proc/bootconfig to show boot config list") Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2020-06-16 21:21:03 -04:00
Linus Torvalds	26c20ffcb5	AFS fixes -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl7pMu8ACgkQ+7dXa6fL C2sgNRAAnOCq281ojebwVSIkRDVGlxBODNeJtcgOOC4ib3jZM++vhdnnJgJIr8kc UOQ+LF4E5hNgwELubCrLOx/AjIzVuzfrreFNOPh3P3TSjyxW/7AU+tFGkdnLkYun NyadOXxI9Dk84UBN1LrmRm3ccAbF6nDf/KcPykS0oAEh12LVm6sDpVJz9+1uclnK Xq0rgl+zrR0+SPplPYz4P/OEPTgNfpLV9DHVYfkvsvEhwb/TaUmiLj9SEgndp+fg L3CT66QXoG9zds9hYFVODQM3devaXOpGNU0vsc9+Xg57BWuYvVed24eH5oBrcBQo F5kon+mcZlHtmTG87UJ6vFUwfHGeYqKKRb9XTbKbATtIWvkB3XM4Jz/XUlaAIE+R y0njNYEoIn4wHkleL/KeHmFPFSYG7pZpAN3wqhXZ9wVptXRDSB10OK3vpgLD/2rM V68FmBin6eStE5qZ8Mu9qMQxXb1buknoef37FIXUozjc+VMPrg5dbG6GjcW/CqIC LynaNUvrQOvF0ZFVzMt7ffZPrdDYlqqzyN0bReMdibys4BPKo24gSr5aVMLt7YXf ZaJeApcSdsphs4uUmtHKlHYgUQrSEl9pSGmc4hcq9bNIKHo9S618LG9uuUplOjdP j0L8N6uWBHQCjAvu6kDm8Wp5pRPPUnTgaXDsok7yP2GLRqBEm3Q= =bYOZ -----END PGP SIGNATURE----- Merge tag 'afs-fixes-20200616' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull AFS fixes from David Howells: "I've managed to get xfstests kind of working with afs. Here are a set of patches that fix most of the bugs found. There are a number of primary issues: - Incorrect handling of mtime and non-handling of ctime. It might be argued, that the latter isn't a bug since the AFS protocol doesn't support ctime, but I should probably still update it locally. - Shared-write mmap, truncate and writeback bugs. This includes not changing i_size under the callback lock, overwriting local i_size with the reply from the server after a partial writeback, not limiting the writeback from an mmapped page to EOF. - Checks for an abort code indicating that the primary vnode in an operation was deleted by a third-party are done in the wrong place. - Silly rename bugs. This includes an incomplete conversion to the new operation handling, duplicate nlink handling, nlink changing not being done inside the callback lock and insufficient handling of third-party conflicting directory changes. And some secondary ones: - The UAEOVERFLOW abort code should map to EOVERFLOW not EREMOTEIO. - Remove a couple of unused or incompletely used bits. - Remove a couple of redundant success checks. These seem to fix all the data-corruption bugs found by ./check -afs -g quick along with the obvious silly rename bugs and time bugs. There are still some test failures, but they seem to fall into two classes: firstly, the authentication/security model is different to the standard UNIX model and permission is arbitrated by the server and cached locally; and secondly, there are a number of features that AFS does not support (such as mknod). But in these cases, the tests themselves need to be adapted or skipped. Using the in-kernel afs client with xfstests also found a bug in the AuriStor AFS server that has been fixed for a future release" * tag 'afs-fixes-20200616' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: afs: Fix silly rename afs: afs_vnode_commit_status() doesn't need to check the RPC error afs: Fix use of afs_check_for_remote_deletion() afs: Remove afs_operation::abort_code afs: Fix yfs_fs_fetch_status() to honour vnode selector afs: Remove yfs_fs_fetch_file_status() as it's not used afs: Fix the mapping of the UAEOVERFLOW abort code afs: Fix truncation issues and mmap writeback size afs: Concoct ctimes afs: Fix EOF corruption afs: afs_write_end() should change i_size under the right lock afs: Fix non-setting of mtime when writing into mmap	2020-06-16 17:40:51 -07:00
Linus Torvalds	ffbc93768e	flexible-array member conversion patches for 5.8-rc2 Hi Linus, Please, pull the following patches that replace zero-length arrays with flexible-array members. Notice that all of these patches have been baking in linux-next for two development cycles now. There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. C99 introduced “flexible array members”, which lacks a numeric size for the array declaration entirely: struct something { size_t count; struct foo items[]; }; This is the way the kernel expects dynamically sized trailing elements to be declared. It allows the compiler to generate errors when the flexible array does not occur last in the structure, which helps to prevent some kind of undefined behavior[3] bugs from being inadvertently introduced to the codebase. It also allows the compiler to correctly analyze array sizes (via sizeof(), CONFIG_FORTIFY_SOURCE, and CONFIG_UBSAN_BOUNDS). For instance, there is no mechanism that warns us that the following application of the sizeof() operator to a zero-length array always results in zero: struct something { size_t count; struct foo items[0]; }; struct something instance; instance = kmalloc(struct_size(instance, items, count), GFP_KERNEL); instance->count = count; size = sizeof(instance->items) instance->count; memcpy(instance->items, source, size); At the last line of code above, size turns out to be zero, when one might have thought it represents the total size in bytes of the dynamic memory recently allocated for the trailing array items. Here are a couple examples of this issue[4][5]. Instead, flexible array members have incomplete type, and so the sizeof() operator may not be applied[6], so any misuse of such operators will be immediately noticed at build time. The cleanest and least error-prone way to implement this is through the use of a flexible array member: struct something { size_t count; struct foo items[]; }; struct something instance; instance = kmalloc(struct_size(instance, items, count), GFP_KERNEL); instance->count = count; size = sizeof(instance->items[0]) instance->count; memcpy(instance->items, source, size); Thanks -- Gustavo [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://github.com/KSPP/linux/issues/21 [3] https://git.kernel.org/linus/76497732932f15e7323dc805e8ea8dc11bb587cf [4] https://git.kernel.org/linus/f2cd32a443da694ac4e28fbf4ac6f9d5cc63a539 [5] https://git.kernel.org/linus/ab91c2a89f86be2898cee208d492816ec238b2cf [6] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEkmRahXBSurMIg1YvRwW0y0cG2zEFAl7oSmYACgkQRwW0y0cG 2zGEiw/9FiH3MBwMlPVJPcneY1wCH/N6ZSf+kr7SJiVwV/YbBe9EWuaKZ0D4vAWm kTACkOfsZ1me1OKz9wNrOxn0zezTMFQK2PLPgzKIPuK0Hg8MW1EU63RIRsnr0bPc b90wZwyBQtLbGRC3/9yAACKwFZe/SeYoV5rr8uylffA35HZW3SZbTex6XnGCF9Q5 UYwnz7vNg+9VH1GRQeB5jlqL7mAoRzJ49I/TL3zJr04Mn+xC+vVBS7XwipDd03p+ foC6/KmGhlCO9HMPASReGrOYNPydDAMKLNPdIfUlcTKHWsoTjGOcW/dzfT4rUu6n nKr5rIqJ4FdlIvXZL5P5w7Uhkwbd3mus5G0HBk+V/cUScckCpBou+yuGzjxXSitQ o0qPsGjWr3v+gxRWHj8YO/9MhKKKW0Iy+QmAC9+uLnbfJdbUwYbLIXbsOKnokCA8 jkDEr64F5hFTKtajIK4VToJK1CsM3D9dwTub27lwZysHn3RYSQdcyN+9OiZgdzpc GlI6QoaqKR9AT4b/eBmqlQAKgA07zSQ5RsIjRm6hN3d7u/77x2kyrreo+trJyVY2 F17uEOzfTqZyxtkPayE8DVjTtbByoCuBR0Vm1oMAFxjyqZQY5daalB0DKd1mdYqi khIXqNAuYqHOb898fEuzidjV38hxZ9y8SAym3P7WnYl+Hxz+8Jo= =8HUQ -----END PGP SIGNATURE----- Merge tag 'flex-array-conversions-5.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux Pull flexible-array member conversions from Gustavo A. R. Silva: "Replace zero-length arrays with flexible-array members. Notice that all of these patches have been baking in linux-next for two development cycles now. There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. C99 introduced “flexible array members”, which lacks a numeric size for the array declaration entirely: struct something { size_t count; struct foo items[]; }; This is the way the kernel expects dynamically sized trailing elements to be declared. It allows the compiler to generate errors when the flexible array does not occur last in the structure, which helps to prevent some kind of undefined behavior[3] bugs from being inadvertently introduced to the codebase. It also allows the compiler to correctly analyze array sizes (via sizeof(), CONFIG_FORTIFY_SOURCE, and CONFIG_UBSAN_BOUNDS). For instance, there is no mechanism that warns us that the following application of the sizeof() operator to a zero-length array always results in zero: struct something { size_t count; struct foo items[0]; }; struct something instance; instance = kmalloc(struct_size(instance, items, count), GFP_KERNEL); instance->count = count; size = sizeof(instance->items) instance->count; memcpy(instance->items, source, size); At the last line of code above, size turns out to be zero, when one might have thought it represents the total size in bytes of the dynamic memory recently allocated for the trailing array items. Here are a couple examples of this issue[4][5]. Instead, flexible array members have incomplete type, and so the sizeof() operator may not be applied[6], so any misuse of such operators will be immediately noticed at build time. The cleanest and least error-prone way to implement this is through the use of a flexible array member: struct something { size_t count; struct foo items[]; }; struct something instance; instance = kmalloc(struct_size(instance, items, count), GFP_KERNEL); instance->count = count; size = sizeof(instance->items[0]) instance->count; memcpy(instance->items, source, size); instead" [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://github.com/KSPP/linux/issues/21 [3] commit `7649773293` ("cxgb3/l2t: Fix undefined behaviour") [4] commit `f2cd32a443` ("rndis_wlan: Remove logically dead code") [5] commit `ab91c2a89f` ("tpm: eventlog: Replace zero-length array with flexible-array member") [6] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html * tag 'flex-array-conversions-5.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: (41 commits) w1: Replace zero-length array with flexible-array tracing/probe: Replace zero-length array with flexible-array soc: ti: Replace zero-length array with flexible-array tifm: Replace zero-length array with flexible-array dmaengine: tegra-apb: Replace zero-length array with flexible-array stm class: Replace zero-length array with flexible-array Squashfs: Replace zero-length array with flexible-array ASoC: SOF: Replace zero-length array with flexible-array ima: Replace zero-length array with flexible-array sctp: Replace zero-length array with flexible-array phy: samsung: Replace zero-length array with flexible-array RxRPC: Replace zero-length array with flexible-array rapidio: Replace zero-length array with flexible-array media: pwc: Replace zero-length array with flexible-array firmware: pcdp: Replace zero-length array with flexible-array oprofile: Replace zero-length array with flexible-array block: Replace zero-length array with flexible-array tools/testing/nvdimm: Replace zero-length array with flexible-array libata: Replace zero-length array with flexible-array kprobes: Replace zero-length array with flexible-array ...	2020-06-16 17:23:57 -07:00
Christian Brauner	60997c3d45	close_range: add CLOSE_RANGE_UNSHARE One of the use-cases of close_range() is to drop file descriptors just before execve(). This would usually be expressed in the sequence: unshare(CLONE_FILES); close_range(3, ~0U); as pointed out by Linus it might be desirable to have this be a part of close_range() itself under a new flag CLOSE_RANGE_UNSHARE. This expands {dup,unshare)_fd() to take a max_fds argument that indicates the maximum number of file descriptors to copy from the old struct files. When the user requests that all file descriptors are supposed to be closed via close_range(min, max) then we can cap via unshare_fd(min) and hence don't need to do any of the heavy fput() work for everything above min. The patch makes it so that if CLOSE_RANGE_UNSHARE is requested and we do in fact currently share our file descriptor table we create a new private copy. We then close all fds in the requested range and finally after we're done we install the new fd table. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2020-06-17 00:07:38 +02:00
Christian Brauner	278a5fbaed	open: add close_range() This adds the close_range() syscall. It allows to efficiently close a range of file descriptors up to all file descriptors of a calling task. I was contacted by FreeBSD as they wanted to have the same close_range() syscall as we proposed here. We've coordinated this and in the meantime, Kyle was fast enough to merge close_range() into FreeBSD already in April: https://reviews.freebsd.org/D21627 https://svnweb.freebsd.org/base?view=revision&revision=359836 and the current plan is to backport close_range() to FreeBSD 12.2 (cf. [2]) once its merged in Linux too. Python is in the process of switching to close_range() on FreeBSD and they are waiting on us to merge this to switch on Linux as well: https://bugs.python.org/issue38061 The syscall came up in a recent discussion around the new mount API and making new file descriptor types cloexec by default. During this discussion, Al suggested the close_range() syscall (cf. [1]). Note, a syscall in this manner has been requested by various people over time. First, it helps to close all file descriptors of an exec()ing task. This can be done safely via (quoting Al's example from [1] verbatim): /* that exec is sensitive / unshare(CLONE_FILES); / we don't want anything past stderr here / close_range(3, ~0U); execve(....); The code snippet above is one way of working around the problem that file descriptors are not cloexec by default. This is aggravated by the fact that we can't just switch them over without massively regressing userspace. For a whole class of programs having an in-kernel method of closing all file descriptors is very helpful (e.g. demons, service managers, programming language standard libraries, container managers etc.). (Please note, unshare(CLONE_FILES) should only be needed if the calling task is multi-threaded and shares the file descriptor table with another thread in which case two threads could race with one thread allocating file descriptors and the other one closing them via close_range(). For the general case close_range() before the execve() is sufficient.) Second, it allows userspace to avoid implementing closing all file descriptors by parsing through /proc/<pid>/fd/ and calling close() on each file descriptor. From looking at various large(ish) userspace code bases this or similar patterns are very common in: - service managers (cf. [4]) - libcs (cf. [6]) - container runtimes (cf. [5]) - programming language runtimes/standard libraries - Python (cf. [2]) - Rust (cf. [7], [8]) As Dmitry pointed out there's even a long-standing glibc bug about missing kernel support for this task (cf. [3]). In addition, the syscall will also work for tasks that do not have procfs mounted and on kernels that do not have procfs support compiled in. In such situations the only way to make sure that all file descriptors are closed is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE, OPEN_MAX trickery (cf. comment [8] on Rust). The performance is striking. For good measure, comparing the following simple close_all_fds() userspace implementation that is essentially just glibc's version in [6]: static int close_all_fds(void) { int dir_fd; DIR dir; struct dirent direntp; dir = opendir("/proc/self/fd"); if (!dir) return -1; dir_fd = dirfd(dir); while ((direntp = readdir(dir))) { int fd; if (strcmp(direntp->d_name, ".") == 0) continue; if (strcmp(direntp->d_name, "..") == 0) continue; fd = atoi(direntp->d_name); if (fd == dir_fd \|\| fd == 0 \|\| fd == 1 \|\| fd == 2) continue; close(fd); } closedir(dir); return 0; } to close_range() yields: 1. closing 4 open files: - close_all_fds(): ~280 us - close_range(): ~24 us 2. closing 1000 open files: - close_all_fds(): ~5000 us - close_range(): ~800 us close_range() is designed to allow for some flexibility. Specifically, it does not simply always close all open file descriptors of a task. Instead, callers can specify an upper bound. This is e.g. useful for scenarios where specific file descriptors are created with well-known numbers that are supposed to be excluded from getting closed. For extra paranoia close_range() comes with a flags argument. This can e.g. be used to implement extension. Once can imagine userspace wanting to stop at the first error instead of ignoring errors under certain circumstances. There might be other valid ideas in the future. In any case, a flag argument doesn't hurt and keeps us on the safe side. From an implementation side this is kept rather dumb. It saw some input from David and Jann but all nonsense is obviously my own! - Errors to close file descriptors are currently ignored. (Could be changed by setting a flag in the future if needed.) - __close_range() is a rather simplistic wrapper around __close_fd(). My reasoning behind this is based on the nature of how __close_fd() needs to release an fd. But maybe I misunderstood specifics: We take the files_lock and rcu-dereference the fdtable of the calling task, we find the entry in the fdtable, get the file and need to release files_lock before calling filp_close(). In the meantime the fdtable might have been altered so we can't just retake the spinlock and keep the old rcu-reference of the fdtable around. Instead we need to grab a fresh reference to the fdtable. If my reasoning is correct then there's really no point in fancyfying __close_range(): We just need to rcu-dereference the fdtable of the calling task once to cap the max_fd value correctly and then go on calling __close_fd() in a loop. /* References */ [1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/ [2]: `9e4f2f3a6b/Modules/_posixsubprocess.c (L220)` [3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7 [4]: `5238e95759/src/basic/fd-util.c (L217)` [5]: `ddf4b77e11/src/lxc/start.c (L236)` [6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17 Note that this is an internal implementation that is not exported. Currently, libc seems to not provide an exported version of this because of missing kernel support to do this. Note, in a recent patch series Florian made grantpt() a nop thereby removing the code referenced here. [7]: https://github.com/rust-lang/rust/issues/12148 [8]: `5f47c0613e/src/libstd/sys/unix/process2.rs (L303-L308)` Rust's solution is slightly different but is equally unperformant. Rust calls getdtablesize() which is a glibc library function that simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then goes on to call close() on each fd. That's obviously overkill for most tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or OPEN_MAX. Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set to 1024. Even in this case, there's a very high chance that in the common case Rust is calling the close() syscall 1021 times pointlessly if the task just has 0, 1, and 2 open. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Kyle Evans <self@kyle-evans.net> Cc: Jann Horn <jannh@google.com> Cc: David Howells <dhowells@redhat.com> Cc: Dmitry V. Levin <ldv@altlinux.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Florian Weimer <fweimer@redhat.com> Cc: linux-api@vger.kernel.org	2020-06-17 00:05:19 +02:00
David Howells	b6489a49f7	afs: Fix silly rename Fix AFS's silly rename by the following means: (1) Set the destination directory in afs_do_silly_rename() so as to avoid misbehaviour and indicate that the directory data version will increment by 1 so as to avoid warnings about unexpected changes in the DV. Also indicate that the ctime should be updated to avoid xfstest grumbling. (2) Note when the server indicates that a directory changed more than we expected (AFS_OPERATION_DIR_CONFLICT), indicating a conflict with a third party change, checking on successful completion of unlink and rename. The problem is that the FS.RemoveFile RPC op doesn't report the status of the unlinked file, though YFS.RemoveFile2 does. This can be mitigated by the assumption that if the directory DV cranked by exactly 1, we can be sure we removed one link from the file; further, ordinarily in AFS, files cannot be hardlinked across directories, so if we reduce nlink to 0, the file is deleted. However, if the directory DV jumps by more than 1, we cannot know if a third party intervened by adding or removing a link on the file we just removed a link from. The same also goes for any vnode that is at the destination of the FS.Rename RPC op. (3) Make afs_vnode_commit_status() apply the nlink drop inside the cb_lock section along with the other attribute updates if ->op_unlinked is set on the descriptor for the appropriate vnode. (4) Issue a follow up status fetch to the unlinked file in the event of a third party conflict that makes it impossible for us to know if we actually deleted the file or not. (5) Provide a flag, AFS_VNODE_SILLY_DELETED, to make afs_getattr() lie to the user about the nlink of a silly deleted file so that it appears as 0, not 1. Found with the generic/035 and generic/084 xfstests. Fixes: `e49c7b2f6d` ("afs: Build an abstraction around an "operation" concept") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>	2020-06-16 22:00:28 +01:00
Waiman Long	b091f7fede	btrfs: use kfree() in btrfs_ioctl_get_subvol_info() In btrfs_ioctl_get_subvol_info(), there is a classic case where kzalloc() was incorrectly paired with kzfree(). According to David Sterba, there isn't any sensitive information in the subvol_info that needs to be cleared before freeing. So kzfree() isn't really needed, use kfree() instead. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-06-16 19:24:03 +02:00
Filipe Manana	5dbb75ed69	btrfs: fix RWF_NOWAIT writes blocking on extent locks and waiting for IO A RWF_NOWAIT write is not supposed to wait on filesystem locks that can be held for a long time or for ongoing IO to complete. However when calling check_can_nocow(), if the inode has prealloc extents or has the NOCOW flag set, we can block on extent (file range) locks through the call to btrfs_lock_and_flush_ordered_range(). Such lock can take a significant amount of time to be available. For example, a fiemap task may be running, and iterating through the entire file range checking all extents and doing backref walking to determine if they are shared, or a readpage operation may be in progress. Also at btrfs_lock_and_flush_ordered_range(), called by check_can_nocow(), after locking the file range we wait for any existing ordered extent that is in progress to complete. Another operation that can take a significant amount of time and defeat the purpose of RWF_NOWAIT. So fix this by trying to lock the file range and if it's currently locked return -EAGAIN to user space. If we are able to lock the file range without waiting and there is an ordered extent in the range, return -EAGAIN as well, instead of waiting for it to complete. Finally, don't bother trying to lock the snapshot lock of the root when attempting a RWF_NOWAIT write, as that is only important for buffered writes. Fixes: `edf064e7c6` ("btrfs: nowait aio support") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-06-16 19:22:45 +02:00
Filipe Manana	260a63395f	btrfs: fix RWF_NOWAIT write not failling when we need to cow If we attempt to do a RWF_NOWAIT write against a file range for which we can only do NOCOW for a part of it, due to the existence of holes or shared extents for example, we proceed with the write as if it were possible to NOCOW the whole range. Example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ touch /mnt/sdj/bar $ chattr +C /mnt/sdj/bar $ xfs_io -d -c "pwrite -S 0xab -b 256K 0 256K" /mnt/bar wrote 262144/262144 bytes at offset 0 256 KiB, 1 ops; 0.0003 sec (694.444 MiB/sec and 2777.7778 ops/sec) $ xfs_io -c "fpunch 64K 64K" /mnt/bar $ sync $ xfs_io -d -c "pwrite -N -V 1 -b 128K -S 0xfe 0 128K" /mnt/bar wrote 131072/131072 bytes at offset 0 128 KiB, 1 ops; 0.0007 sec (160.051 MiB/sec and 1280.4097 ops/sec) This last write should fail with -EAGAIN since the file range from 64K to 128K is a hole. On xfs it fails, as expected, but on ext4 it currently succeeds because apparently it is expensive to check if there are extents allocated for the whole range, but I'll check with the ext4 people. Fix the issue by checking if check_can_nocow() returns a number of NOCOW'able bytes smaller then the requested number of bytes, and if it does return -EAGAIN. Fixes: `edf064e7c6` ("btrfs: nowait aio support") CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-06-16 19:22:37 +02:00
Filipe Manana	4b1946284d	btrfs: fix failure of RWF_NOWAIT write into prealloc extent beyond eof If we attempt to write to prealloc extent located after eof using a RWF_NOWAIT write, we always fail with -EAGAIN. We do actually check if we have an allocated extent for the write at the start of btrfs_file_write_iter() through a call to check_can_nocow(), but later when we go into the actual direct IO write path we simply return -EAGAIN if the write starts at or beyond EOF. Trivial to reproduce: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ touch /mnt/foo $ chattr +C /mnt/foo $ xfs_io -d -c "pwrite -S 0xab 0 64K" /mnt/foo wrote 65536/65536 bytes at offset 0 64 KiB, 16 ops; 0.0004 sec (135.575 MiB/sec and 34707.1584 ops/sec) $ xfs_io -c "falloc -k 64K 1M" /mnt/foo $ xfs_io -d -c "pwrite -N -V 1 -S 0xfe -b 64K 64K 64K" /mnt/foo pwrite: Resource temporarily unavailable On xfs and ext4 the write succeeds, as expected. Fix this by removing the wrong check at btrfs_direct_IO(). Fixes: `edf064e7c6` ("btrfs: nowait aio support") CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-06-16 19:22:31 +02:00
Filipe Manana	f2cb2f39cc	btrfs: fix hang on snapshot creation after RWF_NOWAIT write If we do a successful RWF_NOWAIT write we end up locking the snapshot lock of the inode, through a call to check_can_nocow(), but we never unlock it. This means the next attempt to create a snapshot on the subvolume will hang forever. Trivial reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ touch /mnt/foobar $ chattr +C /mnt/foobar $ xfs_io -d -c "pwrite -S 0xab 0 64K" /mnt/foobar $ xfs_io -d -c "pwrite -N -V 1 -S 0xfe 0 64K" /mnt/foobar $ btrfs subvolume snapshot -r /mnt /mnt/snap --> hangs Fix this by unlocking the snapshot lock if check_can_nocow() returned success. Fixes: `edf064e7c6` ("btrfs: nowait aio support") CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-06-16 19:22:27 +02:00
Filipe Manana	e7a79811d0	btrfs: check if a log root exists before locking the log_mutex on unlink This brings back an optimization that commit `e678934cbe` ("btrfs: Remove unnecessary check from join_running_log_trans") removed, but in a different form. So it's almost equivalent to a revert. That commit removed an optimization where we avoid locking a root's log_mutex when there is no log tree created in the current transaction. The affected code path is triggered through unlink operations. That commit was based on the assumption that the optimization was not necessary because we used to have the following checks when the patch was authored: int btrfs_del_dir_entries_in_log(...) { (...) if (dir->logged_trans < trans->transid) return 0; ret = join_running_log_trans(root); (...) } int btrfs_del_inode_ref_in_log(...) { (...) if (inode->logged_trans < trans->transid) return 0; ret = join_running_log_trans(root); (...) } However before that patch was merged, another patch was merged first which replaced those checks because they were buggy. That other patch corresponds to commit `803f0f64d1` ("Btrfs: fix fsync not persisting dentry deletions due to inode evictions"). The assumption that if the logged_trans field of an inode had a smaller value then the current transaction's generation (transid) meant that the inode was not logged in the current transaction was only correct if the inode was not evicted and reloaded in the current transaction. So the corresponding bug fix changed those checks and replaced them with the following helper function: static bool inode_logged(struct btrfs_trans_handle trans, struct btrfs_inode inode) { if (inode->logged_trans == trans->transid) return true; if (inode->last_trans == trans->transid && test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags) && !test_bit(BTRFS_FS_LOG_RECOVERING, &trans->fs_info->flags)) return true; return false; } So if we have a subvolume without a log tree in the current transaction (because we had no fsyncs), every time we unlink an inode we can end up trying to lock the log_mutex of the root through join_running_log_trans() twice, once for the inode being unlinked (by btrfs_del_inode_ref_in_log()) and once for the parent directory (with btrfs_del_dir_entries_in_log()). This means if we have several unlink operations happening in parallel for inodes in the same subvolume, and the those inodes and/or their parent inode were changed in the current transaction, we end up having a lot of contention on the log_mutex. The test robots from intel reported a -30.7% performance regression for a REAIM test after commit `e678934cbe` ("btrfs: Remove unnecessary check from join_running_log_trans"). So just bring back the optimization to join_running_log_trans() where we check first if a log root exists before trying to lock the log_mutex. This is done by checking for a bit that is set on the root when a log tree is created and removed when a log tree is freed (at transaction commit time). Commit `e678934cbe` ("btrfs: Remove unnecessary check from join_running_log_trans") was merged in the 5.4 merge window while commit `803f0f64d1` ("Btrfs: fix fsync not persisting dentry deletions due to inode evictions") was merged in the 5.3 merge window. But the first commit was actually authored before the second commit (May 23 2019 vs June 19 2019). Reported-by: kernel test robot <rong.a.chen@intel.com> Link: https://lore.kernel.org/lkml/20200611090233.GL12456@shao2-debian/ Fixes: `e678934cbe` ("btrfs: Remove unnecessary check from join_running_log_trans") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-06-16 19:22:23 +02:00
Filipe Manana	6bd335b469	btrfs: fix bytes_may_use underflow when running balance and scrub in parallel When balance and scrub are running in parallel it is possible to end up with an underflow of the bytes_may_use counter of the data space_info object, which triggers a warning like the following: [134243.793196] BTRFS info (device sdc): relocating block group 1104150528 flags data [134243.806891] ------------[ cut here ]------------ [134243.807561] WARNING: CPU: 1 PID: 26884 at fs/btrfs/space-info.h:125 btrfs_add_reserved_bytes+0x1da/0x280 [btrfs] [134243.808819] Modules linked in: btrfs blake2b_generic xor (...) [134243.815779] CPU: 1 PID: 26884 Comm: kworker/u8:8 Tainted: G W 5.6.0-rc7-btrfs-next-58 #5 [134243.816944] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [134243.818389] Workqueue: writeback wb_workfn (flush-btrfs-108483) [134243.819186] RIP: 0010:btrfs_add_reserved_bytes+0x1da/0x280 [btrfs] [134243.819963] Code: 0b f2 85 (...) [134243.822271] RSP: 0018:ffffa4160aae7510 EFLAGS: 00010287 [134243.822929] RAX: 000000000000c000 RBX: ffff96159a8c1000 RCX: 0000000000000000 [134243.823816] RDX: 0000000000008000 RSI: 0000000000000000 RDI: ffff96158067a810 [134243.824742] RBP: ffff96158067a800 R08: 0000000000000001 R09: 0000000000000000 [134243.825636] R10: ffff961501432a40 R11: 0000000000000000 R12: 000000000000c000 [134243.826532] R13: 0000000000000001 R14: ffffffffffff4000 R15: ffff96158067a810 [134243.827432] FS: 0000000000000000(0000) GS:ffff9615baa00000(0000) knlGS:0000000000000000 [134243.828451] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [134243.829184] CR2: 000055bd7e414000 CR3: 00000001077be004 CR4: 00000000003606e0 [134243.830083] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [134243.830975] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [134243.831867] Call Trace: [134243.832211] find_free_extent+0x4a0/0x16c0 [btrfs] [134243.832846] btrfs_reserve_extent+0x91/0x180 [btrfs] [134243.833487] cow_file_range+0x12d/0x490 [btrfs] [134243.834080] fallback_to_cow+0x82/0x1b0 [btrfs] [134243.834689] ? release_extent_buffer+0x121/0x170 [btrfs] [134243.835370] run_delalloc_nocow+0x33f/0xa30 [btrfs] [134243.836032] btrfs_run_delalloc_range+0x1ea/0x6d0 [btrfs] [134243.836725] ? find_lock_delalloc_range+0x221/0x250 [btrfs] [134243.837450] writepage_delalloc+0xe8/0x150 [btrfs] [134243.838059] __extent_writepage+0xe8/0x4c0 [btrfs] [134243.838674] extent_write_cache_pages+0x237/0x530 [btrfs] [134243.839364] extent_writepages+0x44/0xa0 [btrfs] [134243.839946] do_writepages+0x23/0x80 [134243.840401] __writeback_single_inode+0x59/0x700 [134243.841006] writeback_sb_inodes+0x267/0x5f0 [134243.841548] __writeback_inodes_wb+0x87/0xe0 [134243.842091] wb_writeback+0x382/0x590 [134243.842574] ? wb_workfn+0x4a2/0x6c0 [134243.843030] wb_workfn+0x4a2/0x6c0 [134243.843468] process_one_work+0x26d/0x6a0 [134243.843978] worker_thread+0x4f/0x3e0 [134243.844452] ? process_one_work+0x6a0/0x6a0 [134243.844981] kthread+0x103/0x140 [134243.845400] ? kthread_create_worker_on_cpu+0x70/0x70 [134243.846030] ret_from_fork+0x3a/0x50 [134243.846494] irq event stamp: 0 [134243.846892] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [134243.847682] hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020 [134243.848687] softirqs last enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020 [134243.849913] softirqs last disabled at (0): [<0000000000000000>] 0x0 [134243.850698] ---[ end trace bd7c03622e0b0a96 ]--- [134243.851335] ------------[ cut here ]------------ When relocating a data block group, for each extent allocated in the block group we preallocate another extent with the same size for the data relocation inode (we do it at prealloc_file_extent_cluster()). We reserve space by calling btrfs_check_data_free_space(), which ends up incrementing the data space_info's bytes_may_use counter, and then call btrfs_prealloc_file_range() to allocate the extent, which always decrements the bytes_may_use counter by the same amount. The expectation is that writeback of the data relocation inode always follows a NOCOW path, by writing into the preallocated extents. However, when starting writeback we might end up falling back into the COW path, because the block group that contains the preallocated extent was turned into RO mode by a scrub running in parallel. The COW path then calls the extent allocator which ends up calling btrfs_add_reserved_bytes(), and this function decrements the bytes_may_use counter of the data space_info object by an amount corresponding to the size of the allocated extent, despite we haven't previously incremented it. When the counter currently has a value smaller then the allocated extent we reset the counter to 0 and emit a warning, otherwise we just decrement it and slowly mess up with this counter which is crucial for space reservation, the end result can be granting reserved space to tasks when there isn't really enough free space, and having the tasks fail later in critical places where error handling consists of a transaction abort or hitting a BUG_ON(). Fix this by making sure that if we fallback to the COW path for a data relocation inode, we increment the bytes_may_use counter of the data space_info object. The COW path will then decrement it at btrfs_add_reserved_bytes() on success or through its error handling part by a call to extent_clear_unlock_delalloc() (which ends up calling btrfs_clear_delalloc_extent() that does the decrement operation) in case of an error. Test case btrfs/061 from fstests could sporadically trigger this. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-06-16 19:21:31 +02:00
Filipe Manana	432cd2a10f	btrfs: fix data block group relocation failure due to concurrent scrub When running relocation of a data block group while scrub is running in parallel, it is possible that the relocation will fail and abort the current transaction with an -EINVAL error: [134243.988595] BTRFS info (device sdc): found 14 extents, stage: move data extents [134243.999871] ------------[ cut here ]------------ [134244.000741] BTRFS: Transaction aborted (error -22) [134244.001692] WARNING: CPU: 0 PID: 26954 at fs/btrfs/ctree.c:1071 __btrfs_cow_block+0x6a7/0x790 [btrfs] [134244.003380] Modules linked in: btrfs blake2b_generic xor raid6_pq (...) [134244.012577] CPU: 0 PID: 26954 Comm: btrfs Tainted: G W 5.6.0-rc7-btrfs-next-58 #5 [134244.014162] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [134244.016184] RIP: 0010:__btrfs_cow_block+0x6a7/0x790 [btrfs] [134244.017151] Code: 48 c7 c7 (...) [134244.020549] RSP: 0018:ffffa41607863888 EFLAGS: 00010286 [134244.021515] RAX: 0000000000000000 RBX: ffff9614bdfe09c8 RCX: 0000000000000000 [134244.022822] RDX: 0000000000000001 RSI: ffffffffb3d63980 RDI: 0000000000000001 [134244.024124] RBP: ffff961589e8c000 R08: 0000000000000000 R09: 0000000000000001 [134244.025424] R10: ffffffffc0ae5955 R11: 0000000000000000 R12: ffff9614bd530d08 [134244.026725] R13: ffff9614ced41b88 R14: ffff9614bdfe2a48 R15: 0000000000000000 [134244.028024] FS: 00007f29b63c08c0(0000) GS:ffff9615ba600000(0000) knlGS:0000000000000000 [134244.029491] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [134244.030560] CR2: 00007f4eb339b000 CR3: 0000000130d6e006 CR4: 00000000003606f0 [134244.031997] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [134244.033153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [134244.034484] Call Trace: [134244.034984] btrfs_cow_block+0x12b/0x2b0 [btrfs] [134244.035859] do_relocation+0x30b/0x790 [btrfs] [134244.036681] ? do_raw_spin_unlock+0x49/0xc0 [134244.037460] ? _raw_spin_unlock+0x29/0x40 [134244.038235] relocate_tree_blocks+0x37b/0x730 [btrfs] [134244.039245] relocate_block_group+0x388/0x770 [btrfs] [134244.040228] btrfs_relocate_block_group+0x161/0x2e0 [btrfs] [134244.041323] btrfs_relocate_chunk+0x36/0x110 [btrfs] [134244.041345] btrfs_balance+0xc06/0x1860 [btrfs] [134244.043382] ? btrfs_ioctl_balance+0x27c/0x310 [btrfs] [134244.045586] btrfs_ioctl_balance+0x1ed/0x310 [btrfs] [134244.045611] btrfs_ioctl+0x1880/0x3760 [btrfs] [134244.049043] ? do_raw_spin_unlock+0x49/0xc0 [134244.049838] ? _raw_spin_unlock+0x29/0x40 [134244.050587] ? __handle_mm_fault+0x11b3/0x14b0 [134244.051417] ? ksys_ioctl+0x92/0xb0 [134244.052070] ksys_ioctl+0x92/0xb0 [134244.052701] ? trace_hardirqs_off_thunk+0x1a/0x1c [134244.053511] __x64_sys_ioctl+0x16/0x20 [134244.054206] do_syscall_64+0x5c/0x280 [134244.054891] entry_SYSCALL_64_after_hwframe+0x49/0xbe [134244.055819] RIP: 0033:0x7f29b51c9dd7 [134244.056491] Code: 00 00 00 (...) [134244.059767] RSP: 002b:00007ffcccc1dd08 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [134244.061168] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f29b51c9dd7 [134244.062474] RDX: 00007ffcccc1dda0 RSI: 00000000c4009420 RDI: 0000000000000003 [134244.063771] RBP: 0000000000000003 R08: 00005565cea4b000 R09: 0000000000000000 [134244.065032] R10: 0000000000000541 R11: 0000000000000202 R12: 00007ffcccc2060a [134244.066327] R13: 00007ffcccc1dda0 R14: 0000000000000002 R15: 00007ffcccc1dec0 [134244.067626] irq event stamp: 0 [134244.068202] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [134244.069351] hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020 [134244.070909] softirqs last enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020 [134244.072392] softirqs last disabled at (0): [<0000000000000000>] 0x0 [134244.073432] ---[ end trace bd7c03622e0b0a99 ]--- The -EINVAL error comes from the following chain of function calls: __btrfs_cow_block() <-- aborts the transaction btrfs_reloc_cow_block() replace_file_extents() get_new_location() <-- returns -EINVAL When relocating a data block group, for each allocated extent of the block group, we preallocate another extent (at prealloc_file_extent_cluster()), associated with the data relocation inode, and then dirty all its pages. These preallocated extents have, and must have, the same size that extents from the data block group being relocated have. Later before we start the relocation stage that updates pointers (bytenr field of file extent items) to point to the the new extents, we trigger writeback for the data relocation inode. The expectation is that writeback will write the pages to the previously preallocated extents, that it follows the NOCOW path. That is generally the case, however, if a scrub is running it may have turned the block group that contains those extents into RO mode, in which case writeback falls back to the COW path. However in the COW path instead of allocating exactly one extent with the expected size, the allocator may end up allocating several smaller extents due to free space fragmentation - because we tell it at cow_file_range() that the minimum allocation size can match the filesystem's sector size. This later breaks the relocation's expectation that an extent associated to a file extent item in the data relocation inode has the same size as the respective extent pointed by a file extent item in another tree - in this case the extent to which the relocation inode poins to is smaller, causing relocation.c:get_new_location() to return -EINVAL. For example, if we are relocating a data block group X that has a logical address of X and the block group has an extent allocated at the logical address X + 128KiB with a size of 64KiB: 1) At prealloc_file_extent_cluster() we allocate an extent for the data relocation inode with a size of 64KiB and associate it to the file offset 128KiB (X + 128KiB - X) of the data relocation inode. This preallocated extent was allocated at block group Z; 2) A scrub running in parallel turns block group Z into RO mode and starts scrubing its extents; 3) Relocation triggers writeback for the data relocation inode; 4) When running delalloc (btrfs_run_delalloc_range()), we try first the NOCOW path because the data relocation inode has BTRFS_INODE_PREALLOC set in its flags. However, because block group Z is in RO mode, the NOCOW path (run_delalloc_nocow()) falls back into the COW path, by calling cow_file_range(); 5) At cow_file_range(), in the first iteration of the while loop we call btrfs_reserve_extent() to allocate a 64KiB extent and pass it a minimum allocation size of 4KiB (fs_info->sectorsize). Due to free space fragmentation, btrfs_reserve_extent() ends up allocating two extents of 32KiB each, each one on a different iteration of that while loop; 6) Writeback of the data relocation inode completes; 7) Relocation proceeds and ends up at relocation.c:replace_file_extents(), with a leaf which has a file extent item that points to the data extent from block group X, that has a logical address (bytenr) of X + 128KiB and a size of 64KiB. Then it calls get_new_location(), which does a lookup in the data relocation tree for a file extent item starting at offset 128KiB (X + 128KiB - X) and belonging to the data relocation inode. It finds a corresponding file extent item, however that item points to an extent that has a size of 32KiB, which doesn't match the expected size of 64KiB, resuling in -EINVAL being returned from this function and propagated up to __btrfs_cow_block(), which aborts the current transaction. To fix this make sure that at cow_file_range() when we call the allocator we pass it a minimum allocation size corresponding the desired extent size if the inode belongs to the data relocation tree, otherwise pass it the filesystem's sector size as the minimum allocation size. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-06-16 19:21:25 +02:00
Filipe Manana	ffcb9d4457	btrfs: fix race between block group removal and block group creation There is a race between block group removal and block group creation when the removal is completed by a task running fitrim or scrub. When this happens we end up failing the block group creation with an error -EEXIST since we attempt to insert a duplicate block group item key in the extent tree. That results in a transaction abort. The race happens like this: 1) Task A is doing a fitrim, and at btrfs_trim_block_group() it freezes block group X with btrfs_freeze_block_group() (until very recently that was named btrfs_get_block_group_trimming()); 2) Task B starts removing block group X, either because it's now unused or due to relocation for example. So at btrfs_remove_block_group(), while holding the chunk mutex and the block group's lock, it sets the 'removed' flag of the block group and it sets the local variable 'remove_em' to false, because the block group is currently frozen (its 'frozen' counter is > 0, until very recently this counter was named 'trimming'); 3) Task B unlocks the block group and the chunk mutex; 4) Task A is done trimming the block group and unfreezes the block group by calling btrfs_unfreeze_block_group() (until very recently this was named btrfs_put_block_group_trimming()). In this function we lock the block group and set the local variable 'cleanup' to true because we were able to decrement the block group's 'frozen' counter down to 0 and the flag 'removed' is set in the block group. Since 'cleanup' is set to true, it locks the chunk mutex and removes the extent mapping representing the block group from the mapping tree; 5) Task C allocates a new block group Y and it picks up the logical address that block group X had as the logical address for Y, because X was the block group with the highest logical address and now the second block group with the highest logical address, the last in the fs mapping tree, ends at an offset corresponding to block group X's logical address (this logical address selection is done at volumes.c:find_next_chunk()). At this point the new block group Y does not have yet its item added to the extent tree (nor the corresponding device extent items and chunk item in the device and chunk trees). The new group Y is added to the list of pending block groups in the transaction handle; 6) Before task B proceeds to removing the block group item for block group X from the extent tree, which has a key matching: (X logical offset, BTRFS_BLOCK_GROUP_ITEM_KEY, length) task C while ending its transaction handle calls btrfs_create_pending_block_groups(), which finds block group Y and tries to insert the block group item for Y into the exten tree, which fails with -EEXIST since logical offset is the same that X had and task B hasn't yet deleted the key from the extent tree. This failure results in a transaction abort, producing a stack like the following: ------------[ cut here ]------------ BTRFS: Transaction aborted (error -17) WARNING: CPU: 2 PID: 19736 at fs/btrfs/block-group.c:2074 btrfs_create_pending_block_groups+0x1eb/0x260 [btrfs] Modules linked in: btrfs blake2b_generic xor raid6_pq (...) CPU: 2 PID: 19736 Comm: fsstress Tainted: G W 5.6.0-rc7-btrfs-next-58 #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_create_pending_block_groups+0x1eb/0x260 [btrfs] Code: ff ff ff 48 8b 55 50 f0 48 (...) RSP: 0018:ffffa4160a1c7d58 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff961581909d98 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffffb3d63990 RDI: 0000000000000001 RBP: ffff9614f3356a58 R08: 0000000000000000 R09: 0000000000000001 R10: ffff9615b65b0040 R11: 0000000000000000 R12: ffff961581909c10 R13: ffff9615b0c32000 R14: ffff9614f3356ab0 R15: ffff9614be779000 FS: 00007f2ce2841e80(0000) GS:ffff9615bae00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000555f18780000 CR3: 0000000131d34005 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_start_dirty_block_groups+0x398/0x4e0 [btrfs] btrfs_commit_transaction+0xd0/0xc50 [btrfs] ? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs] ? __ia32_sys_fdatasync+0x20/0x20 iterate_supers+0xdb/0x180 ksys_sync+0x60/0xb0 __ia32_sys_sync+0xa/0x10 do_syscall_64+0x5c/0x280 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f2ce1d4d5b7 Code: 83 c4 08 48 3d 01 (...) RSP: 002b:00007ffd8b558c58 EFLAGS: 00000202 ORIG_RAX: 00000000000000a2 RAX: ffffffffffffffda RBX: 000000000000002c RCX: 00007f2ce1d4d5b7 RDX: 00000000ffffffff RSI: 00000000186ba07b RDI: 000000000000002c RBP: 0000555f17b9e520 R08: 0000000000000012 R09: 000000000000ce00 R10: 0000000000000078 R11: 0000000000000202 R12: 0000000000000032 R13: 0000000051eb851f R14: 00007ffd8b558cd0 R15: 0000555f1798ec20 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020 softirqs last enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace bd7c03622e0b0a9c ]--- Fix this simply by making btrfs_remove_block_group() remove the block group's item from the extent tree before it flags the block group as removed. Also make the free space deletion from the free space tree before flagging the block group as removed, to avoid a similar race with adding and removing free space entries for the free space tree. Fixes: `04216820fe` ("Btrfs: fix race between fs trimming and block group remove/allocation") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-06-16 19:20:58 +02:00
Filipe Manana	9fecd13202	btrfs: fix a block group ref counter leak after failure to remove block group When removing a block group, if we fail to delete the block group's item from the extent tree, we jump to the 'out' label and end up decrementing the block group's reference count once only (by 1), resulting in a counter leak because the block group at that point was already removed from the block group cache rbtree - so we have to decrement the reference count twice, once for the rbtree and once for our lookup at the start of the function. There is a second bug where if removing the free space tree entries (the call to remove_block_group_free_space()) fails we end up jumping to the 'out_put_group' label but end up decrementing the reference count only once, when we should have done it twice, since we have already removed the block group from the block group cache rbtree. This happens because the reference count decrement for the rbtree reference happens after attempting to remove the free space tree entries, which is far away from the place where we remove the block group from the rbtree. To make things less error prone, decrement the reference count for the rbtree immediately after removing the block group from it. This also eleminates the need for two different exit labels on error, renaming 'out_put_label' to just 'out' and removing the old 'out'. Fixes: `f6033c5e33` ("btrfs: fix block group leak when removing fails") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2020-06-16 19:20:51 +02:00
Jason Yan	2d3a8e2ded	block: Fix use-after-free in blkdev_get() In blkdev_get() we call __blkdev_get() to do some internal jobs and if there is some errors in __blkdev_get(), the bdput() is called which means we have released the refcount of the bdev (actually the refcount of the bdev inode). This means we cannot access bdev after that point. But acctually bdev is still accessed in blkdev_get() after calling __blkdev_get(). This results in use-after-free if the refcount is the last one we released in __blkdev_get(). Let's take a look at the following scenerio: CPU0 CPU1 CPU2 blkdev_open blkdev_open Remove disk bd_acquire blkdev_get __blkdev_get del_gendisk bdev_unhash_inode bd_acquire bdev_get_gendisk bd_forget failed because of unhashed bdput bdput (the last one) bdev_evict_inode access bdev => use after free [ 459.350216] BUG: KASAN: use-after-free in __lock_acquire+0x24c1/0x31b0 [ 459.351190] Read of size 8 at addr ffff88806c815a80 by task syz-executor.0/20132 [ 459.352347] [ 459.352594] CPU: 0 PID: 20132 Comm: syz-executor.0 Not tainted 4.19.90 #2 [ 459.353628] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [ 459.354947] Call Trace: [ 459.355337] dump_stack+0x111/0x19e [ 459.355879] ? __lock_acquire+0x24c1/0x31b0 [ 459.356523] print_address_description+0x60/0x223 [ 459.357248] ? __lock_acquire+0x24c1/0x31b0 [ 459.357887] kasan_report.cold+0xae/0x2d8 [ 459.358503] __lock_acquire+0x24c1/0x31b0 [ 459.359120] ? _raw_spin_unlock_irq+0x24/0x40 [ 459.359784] ? lockdep_hardirqs_on+0x37b/0x580 [ 459.360465] ? _raw_spin_unlock_irq+0x24/0x40 [ 459.361123] ? finish_task_switch+0x125/0x600 [ 459.361812] ? finish_task_switch+0xee/0x600 [ 459.362471] ? mark_held_locks+0xf0/0xf0 [ 459.363108] ? __schedule+0x96f/0x21d0 [ 459.363716] lock_acquire+0x111/0x320 [ 459.364285] ? blkdev_get+0xce/0xbe0 [ 459.364846] ? blkdev_get+0xce/0xbe0 [ 459.365390] __mutex_lock+0xf9/0x12a0 [ 459.365948] ? blkdev_get+0xce/0xbe0 [ 459.366493] ? bdev_evict_inode+0x1f0/0x1f0 [ 459.367130] ? blkdev_get+0xce/0xbe0 [ 459.367678] ? destroy_inode+0xbc/0x110 [ 459.368261] ? mutex_trylock+0x1a0/0x1a0 [ 459.368867] ? __blkdev_get+0x3e6/0x1280 [ 459.369463] ? bdev_disk_changed+0x1d0/0x1d0 [ 459.370114] ? blkdev_get+0xce/0xbe0 [ 459.370656] blkdev_get+0xce/0xbe0 [ 459.371178] ? find_held_lock+0x2c/0x110 [ 459.371774] ? __blkdev_get+0x1280/0x1280 [ 459.372383] ? lock_downgrade+0x680/0x680 [ 459.373002] ? lock_acquire+0x111/0x320 [ 459.373587] ? bd_acquire+0x21/0x2c0 [ 459.374134] ? do_raw_spin_unlock+0x4f/0x250 [ 459.374780] blkdev_open+0x202/0x290 [ 459.375325] do_dentry_open+0x49e/0x1050 [ 459.375924] ? blkdev_get_by_dev+0x70/0x70 [ 459.376543] ? __x64_sys_fchdir+0x1f0/0x1f0 [ 459.377192] ? inode_permission+0xbe/0x3a0 [ 459.377818] path_openat+0x148c/0x3f50 [ 459.378392] ? kmem_cache_alloc+0xd5/0x280 [ 459.379016] ? entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 459.379802] ? path_lookupat.isra.0+0x900/0x900 [ 459.380489] ? __lock_is_held+0xad/0x140 [ 459.381093] do_filp_open+0x1a1/0x280 [ 459.381654] ? may_open_dev+0xf0/0xf0 [ 459.382214] ? find_held_lock+0x2c/0x110 [ 459.382816] ? lock_downgrade+0x680/0x680 [ 459.383425] ? __lock_is_held+0xad/0x140 [ 459.384024] ? do_raw_spin_unlock+0x4f/0x250 [ 459.384668] ? _raw_spin_unlock+0x1f/0x30 [ 459.385280] ? __alloc_fd+0x448/0x560 [ 459.385841] do_sys_open+0x3c3/0x500 [ 459.386386] ? filp_open+0x70/0x70 [ 459.386911] ? trace_hardirqs_on_thunk+0x1a/0x1c [ 459.387610] ? trace_hardirqs_off_caller+0x55/0x1c0 [ 459.388342] ? do_syscall_64+0x1a/0x520 [ 459.388930] do_syscall_64+0xc3/0x520 [ 459.389490] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 459.390248] RIP: 0033:0x416211 [ 459.390720] Code: 75 14 b8 02 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 04 19 00 00 c3 48 83 ec 08 e8 0a fa ff ff 48 89 04 24 b8 02 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 53 fa ff ff 48 89 d0 48 83 c4 08 48 3d 01 [ 459.393483] RSP: 002b:00007fe45dfe9a60 EFLAGS: 00000293 ORIG_RAX: 0000000000000002 [ 459.394610] RAX: ffffffffffffffda RBX: 00007fe45dfea6d4 RCX: 0000000000416211 [ 459.395678] RDX: 00007fe45dfe9b0a RSI: 0000000000000002 RDI: 00007fe45dfe9b00 [ 459.396758] RBP: 000000000076bf20 R08: 0000000000000000 R09: 000000000000000a [ 459.397930] R10: 0000000000000075 R11: 0000000000000293 R12: 00000000ffffffff [ 459.399022] R13: 0000000000000bd9 R14: 00000000004cdb80 R15: 000000000076bf2c [ 459.400168] [ 459.400430] Allocated by task 20132: [ 459.401038] kasan_kmalloc+0xbf/0xe0 [ 459.401652] kmem_cache_alloc+0xd5/0x280 [ 459.402330] bdev_alloc_inode+0x18/0x40 [ 459.402970] alloc_inode+0x5f/0x180 [ 459.403510] iget5_locked+0x57/0xd0 [ 459.404095] bdget+0x94/0x4e0 [ 459.404607] bd_acquire+0xfa/0x2c0 [ 459.405113] blkdev_open+0x110/0x290 [ 459.405702] do_dentry_open+0x49e/0x1050 [ 459.406340] path_openat+0x148c/0x3f50 [ 459.406926] do_filp_open+0x1a1/0x280 [ 459.407471] do_sys_open+0x3c3/0x500 [ 459.408010] do_syscall_64+0xc3/0x520 [ 459.408572] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 459.409415] [ 459.409679] Freed by task 1262: [ 459.410212] __kasan_slab_free+0x129/0x170 [ 459.410919] kmem_cache_free+0xb2/0x2a0 [ 459.411564] rcu_process_callbacks+0xbb2/0x2320 [ 459.412318] __do_softirq+0x225/0x8ac Fix this by delaying bdput() to the end of blkdev_get() which means we have finished accessing bdev. Fixes: `77ea887e43` ("implement in-kernel gendisk events handling") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Jason Yan <yanaijie@huawei.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ming Lei <ming.lei@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-16 10:33:12 -06:00
David Howells	7c295eec1e	afs: afs_vnode_commit_status() doesn't need to check the RPC error afs_vnode_commit_status() is only ever called if op->error is 0, so remove the op->error checks from the function. Fixes: `e49c7b2f6d` ("afs: Build an abstraction around an "operation" concept") Signed-off-by: David Howells <dhowells@redhat.com>	2020-06-16 16:26:57 +01:00
David Howells	728279a5a1	afs: Fix use of afs_check_for_remote_deletion() afs_check_for_remote_deletion() checks to see if error ENOENT is returned by the server in response to an operation and, if so, marks the primary vnode as having been deleted as the FID is no longer valid. However, it's being called from the operation success functions, where no abort has happened - and if an inline abort is recorded, it's handled by afs_vnode_commit_status(). Fix this by actually calling the operation aborted method if provided and having that point to afs_check_for_remote_deletion(). Fixes: `e49c7b2f6d` ("afs: Build an abstraction around an "operation" concept") Signed-off-by: David Howells <dhowells@redhat.com>	2020-06-16 16:26:57 +01:00
David Howells	44767c3531	afs: Remove afs_operation::abort_code Remove afs_operation::abort_code as it's read but never set. Use ac.abort_code instead. Signed-off-by: David Howells <dhowells@redhat.com>	2020-06-16 16:26:57 +01:00
David Howells	9bd87ec631	afs: Fix yfs_fs_fetch_status() to honour vnode selector Fix yfs_fs_fetch_status() to honour the vnode selector in op->fetch_status.which as does afs_fs_fetch_status() that allows afs_do_lookup() to use this as an alternative to the InlineBulkStatus RPC call if not implemented by the server. This doesn't matter in the current code as YFS servers always implement InlineBulkStatus, but a subsequent will call it on YFS servers too in some circumstances. Fixes: `e49c7b2f6d` ("afs: Build an abstraction around an "operation" concept") Signed-off-by: David Howells <dhowells@redhat.com>	2020-06-16 16:26:57 +01:00
David Howells	6c85cacc8c	afs: Remove yfs_fs_fetch_file_status() as it's not used Remove yfs_fs_fetch_file_status() as it's no longer used. Signed-off-by: David Howells <dhowells@redhat.com>	2020-06-16 16:26:57 +01:00
Mel Gorman	e9c15badbb	fs: Do not check if there is a fsnotify watcher on pseudo inodes The kernel uses internal mounts created by kern_mount() and populated with files with no lookup path by alloc_file_pseudo() for a variety of reasons. An example of such a mount is for anonymous pipes. For pipes, every vfs_write() regardless of filesystem, calls fsnotify_modify() to notify of any changes which incurs a small amount of overhead in fsnotify even when there are no watchers. It can also trigger for reads and readv and writev, it was simply vfs_write() that was noticed first. A patch is pending that reduces, but does not eliminate, the overhead of fsnotify but for files that cannot be looked up via a path, even that small overhead is unnecessary. The user API for all notification subsystems (inotify, fanotify, ...) is based on the pathname and a dirfd and proc entries appear to be the only visible representation of the files. Proc does not have the same pathname as the internal entry and the proc inode is not the same as the internal inode so even if fanotify is used on a file under /proc/XX/fd, no useful events are notified. This patch changes alloc_file_pseudo() to always opt out of fsnotify by setting FMODE_NONOTIFY flag so that no check is made for fsnotify watchers on pseudo files. This should be safe as the underlying helper for the dentry is d_alloc_pseudo() which explicitly states that no lookups are ever performed meaning that fanotify should have nothing useful to attach to. The test motivating this was "perf bench sched messaging --pipe". On a single-socket machine using threads the difference of the patch was as follows. 5.7.0 5.7.0 vanilla nofsnotify-v1r1 Amean 1 1.3837 ( 0.00%) 1.3547 ( 2.10%) Amean 3 3.7360 ( 0.00%) 3.6543 ( 2.19%) Amean 5 5.8130 ( 0.00%) 5.7233 * 1.54%* Amean 7 8.1490 ( 0.00%) 7.9730 * 2.16%* Amean 12 14.6843 ( 0.00%) 14.1820 ( 3.42%) Amean 18 21.8840 ( 0.00%) 21.7460 ( 0.63%) Amean 24 28.8697 ( 0.00%) 29.1680 ( -1.03%) Amean 30 36.0787 ( 0.00%) 35.2640 * 2.26%* Amean 32 38.0527 ( 0.00%) 38.1223 ( -0.18%) The difference is small but in some cases it's outside the noise so while marginal, there is still some small benefit to ignoring fsnotify for files allocated via alloc_file_pseudo() in some cases. Link: https://lore.kernel.org/r/20200615121358.GF3183@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2020-06-16 09:40:45 +02:00
Gustavo A. R. Silva	b2b32e3aa0	Squashfs: Replace zero-length array with flexible-array There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://github.com/KSPP/linux/issues/21 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>	2020-06-15 23:08:32 -05:00
Gustavo A. R. Silva	6112bad79f	jffs2: Replace zero-length array with flexible-array There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://github.com/KSPP/linux/issues/21 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>	2020-06-15 23:08:31 -05:00
Gustavo A. R. Silva	241cb28e38	aio: Replace zero-length array with flexible-array There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://github.com/KSPP/linux/issues/21 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>	2020-06-15 23:08:25 -05:00
Linus Torvalds	3be20b6fc1	This is the second round of ext4 commits for 5.8 merge window. It includes the per-inode DAX support, which was dependant on the DAX infrastructure which came in via the XFS tree, and a number of regression and bug fixes; most notably the "BUG: using smp_processor_id() in preemptible code in ext4_mb_new_blocks" reported by syzkaller. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl7mgCcACgkQ8vlZVpUN gaPftwf8C4w/7SG+CYLdwg0d2u9TKk77yDuWaioFHOcMSjZvG4TCSgtMhZxQnyty 9t4yqacILx12pCj/mZnrZp5BOSn9O2ZbuDoXNKNrFXU0BF+CsbnhvJvrrh1j/MUa PPtcqyGFdOLSDvHSD9xPVT76juwh79aR8vB7qnQXaEO5wcLodZWoqBEFSKCl6Bo8 hjXs1EXidusKsoarQxW6mEITmnhU2S2fuCVDgVcoM/LmKwzbgqvlWrentq9u8qLH W+XbjWgUtCM1byeDZWqe5FYyyJ8x+dTv7H5an3KR92EN6hKo5AOvzA0I41pZscq/ bJ9p2THDxJQX4rJBevGAS5mZ6hTkRw== =z6eO -----END PGP SIGNATURE----- Merge tag 'ext4-for-linus-5.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull more ext4 updates from Ted Ts'o: "This is the second round of ext4 commits for 5.8 merge window [1]. It includes the per-inode DAX support, which was dependant on the DAX infrastructure which came in via the XFS tree, and a number of regression and bug fixes; most notably the "BUG: using smp_processor_id() in preemptible code in ext4_mb_new_blocks" reported by syzkaller" [1] The pull request actually came in 15 minutes after I had tagged the rc1 release. Tssk, tssk, late.. - Linus * tag 'ext4-for-linus-5.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4, jbd2: ensure panic by fix a race between jbd2 abort and ext4 error handlers ext4: support xattr gnu.* namespace for the Hurd ext4: mballoc: Use this_cpu_read instead of this_cpu_ptr ext4: avoid utf8_strncasecmp() with unstable name ext4: stop overwrite the errcode in ext4_setup_super ext4: fix partial cluster initialization when splitting extent ext4: avoid race conditions when remounting with options that change dax Documentation/dax: Update DAX enablement for ext4 fs/ext4: Introduce DAX inode flag fs/ext4: Remove jflag variable fs/ext4: Make DAX mount option a tri-state fs/ext4: Only change S_DAX on inode load fs/ext4: Update ext4_should_use_dax() fs/ext4: Change EXT4_MOUNT_DAX to EXT4_MOUNT_DAX_ALWAYS fs/ext4: Disallow verity if inode is DAX fs/ext4: Narrow scope of DAX check in setflags	2020-06-15 09:32:10 -07:00
Pavel Begunkov	801dd57bd1	io_uring: cancel by ->task not pid For an exiting process it tries to cancel all its inflight requests. Use req->task to match such instead of work.pid. We always have req->task set, and it will be valid because we're matching only current exiting task. Also, remove work.pid and everything related, it's useless now. Reported-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-15 08:51:38 -06:00
Pavel Begunkov	4dd2824d6d	io_uring: lazy get task There will be multiple places where req->task is used, so refcount-pin it lazily with introduced *io_{get,put}_req_task(). We need to always have valid ->task for cancellation reasons, but don't care about pinning it in some cases. That's why it sets req->task in io_req_init() and implements get/put laziness with a flag. This also removes using @current from polling io_arm_poll_handler(), etc., but doesn't change observable behaviour. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-15 08:51:35 -06:00
Pavel Begunkov	67c4d9e693	io_uring: batch cancel in io_uring_cancel_files() Instead of waiting for each request one by one, first try to cancel all of them in a batched manner, and then go over inflight_list/etc to reap leftovers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-15 08:51:34 -06:00
Pavel Begunkov	44e728b8aa	io_uring: cancel all task's requests on exit If a process is going away, io_uring_flush() will cancel only 1 request with a matching pid. Cancel all of them Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-15 08:51:34 -06:00
Pavel Begunkov	4f26bda152	io-wq: add an option to cancel all matched reqs This adds support for cancelling all io-wq works matching a predicate. It isn't used yet, so no change in observable behaviour. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-15 08:51:34 -06:00
Pavel Begunkov	f4c2665e33	io-wq: reorder cancellation pending -> running Go all over all pending lists and cancel works there, and only then try to match running requests. No functional changes here, just a preparation for bulk cancellation. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-15 08:51:33 -06:00
David Howells	4ec89596d0	afs: Fix the mapping of the UAEOVERFLOW abort code Abort code UAEOVERFLOW is returned when we try and set a time that's out of range, but it's currently mapped to EREMOTEIO by the default case. Fix UAEOVERFLOW to map instead to EOVERFLOW. Found with the generic/258 xfstest. Note that the test is wrong as it assumes that the filesystem will support a pre-UNIX-epoch date. Fixes: `1eda8bab70` ("afs: Add support for the UAE error table") Signed-off-by: David Howells <dhowells@redhat.com>	2020-06-15 15:41:03 +01:00
David Howells	793fe82ee3	afs: Fix truncation issues and mmap writeback size Fix the following issues: (1) Fix writeback to reduce the size of a store operation to i_size, effectively discarding the extra data. The problem comes when afs_page_mkwrite() records that a page is about to be modified by mmap(). It doesn't know what bits of the page are going to be modified, so it records the whole page as being dirty (this is stored in page->private as start and end offsets). Without this, the marshalling for the store to the server extends the size of the file to the end of the page (in afs_fs_store_data() and yfs_fs_store_data()). (2) Fix setattr to actually truncate the pagecache, thereby clearing the discarded part of a file. (3) Fix setattr to check that the new size is okay and to disable ATTR_SIZE if i_size wouldn't change. (4) Force i_size to be updated as the result of a truncate. (5) Don't truncate if ATTR_SIZE is not set. (6) Call pagecache_isize_extended() if the file was enlarged. Note that truncate_set_size() isn't used because the setting of i_size is done inside afs_vnode_commit_status() under the vnode->cb_lock. Found with the generic/029 and generic/393 xfstests. Fixes: `31143d5d51` ("AFS: implement basic file write support") Fixes: `4343d00872` ("afs: Get rid of the afs_writeback record") Signed-off-by: David Howells <dhowells@redhat.com>	2020-06-15 15:41:02 +01:00
David Howells	da8d075512	afs: Concoct ctimes The in-kernel afs filesystem ignores ctime because the AFS fileserver protocol doesn't support ctimes. This, however, causes various xfstests to fail. Work around this by: (1) Setting ctime to attr->ia_ctime in afs_setattr(). (2) Not ignoring ATTR_MTIME_SET, ATTR_TIMES_SET and ATTR_TOUCH settings. (3) Setting the ctime from the server mtime when on the target file when creating a hard link to it. (4) Setting the ctime on directories from their revised mtimes when renaming/moving a file. Found by the generic/221 and generic/309 xfstests. Signed-off-by: David Howells <dhowells@redhat.com>	2020-06-15 15:41:02 +01:00
David Howells	3f4aa98181	afs: Fix EOF corruption When doing a partial writeback, afs_write_back_from_locked_page() may generate an FS.StoreData RPC request that writes out part of a file when a file has been constructed from pieces by doing seek, write, seek, write, ... as is done by ld. The FS.StoreData RPC is given the current i_size as the file length, but the server basically ignores it unless the data length is 0 (in which case it's just a truncate operation). The revised file length returned in the result of the RPC may then not reflect what we suggested - and this leads to i_size getting moved backwards - which causes issues later. Fix the client to take account of this by ignoring the returned file size unless the data version number jumped unexpectedly - in which case we're going to have to clear the pagecache and reload anyway. This can be observed when doing a kernel build on an AFS mount. The following pair of commands produce the issue: ld -m elf_x86_64 -z max-page-size=0x200000 --emit-relocs \ -T arch/x86/realmode/rm/realmode.lds \ arch/x86/realmode/rm/header.o \ arch/x86/realmode/rm/trampoline_64.o \ arch/x86/realmode/rm/stack.o \ arch/x86/realmode/rm/reboot.o \ -o arch/x86/realmode/rm/realmode.elf arch/x86/tools/relocs --realmode \ arch/x86/realmode/rm/realmode.elf \ >arch/x86/realmode/rm/realmode.relocs This results in the latter giving: Cannot read ELF section headers 0/18: Success as the realmode.elf file got corrupted. The sequence of events can also be driven with: xfs_io -t -f \ -c "pwrite -S 0x58 0 0x58" \ -c "pwrite -S 0x59 10000 1000" \ -c "close" \ /afs/example.com/scratch/a Fixes: `31143d5d51` ("AFS: implement basic file write support") Signed-off-by: David Howells <dhowells@redhat.com>	2020-06-15 15:41:02 +01:00
David Howells	1f32ef7989	afs: afs_write_end() should change i_size under the right lock Fix afs_write_end() to change i_size under vnode->cb_lock rather than ->wb_lock so that it doesn't race with afs_vnode_commit_status() and afs_getattr(). The ->wb_lock is only meant to guard access to ->wb_keys which isn't accessed by that piece of code. Fixes: `4343d00872` ("afs: Get rid of the afs_writeback record") Signed-off-by: David Howells <dhowells@redhat.com>	2020-06-15 15:41:02 +01:00
David Howells	bb41348928	afs: Fix non-setting of mtime when writing into mmap The mtime on an inode needs to be updated when a write is made into an mmap'ed section. There are three ways in which this could be done: update it when page_mkwrite is called, update it when a page is changed from dirty to writeback or leave it to the server and fix the mtime up from the reply to the StoreData RPC. Found with the generic/215 xfstest. Fixes: `1cf7a1518a` ("afs: Implement shared-writeable mmap") Signed-off-by: David Howells <dhowells@redhat.com>	2020-06-15 15:41:02 +01:00
Pavel Begunkov	59960b9deb	io_uring: fix lazy work init Don't leave garbage in req.work before punting async on -EAGAIN in io_iopoll_queue(). [ 140.922099] general protection fault, probably for non-canonical address 0xdead000000000100: 0000 [#1] PREEMPT SMP PTI ... [ 140.922105] RIP: 0010:io_worker_handle_work+0x1db/0x480 ... [ 140.922114] Call Trace: [ 140.922118] ? __next_timer_interrupt+0xe0/0xe0 [ 140.922119] io_wqe_worker+0x2a9/0x360 [ 140.922121] ? _raw_spin_unlock_irqrestore+0x24/0x40 [ 140.922124] kthread+0x12c/0x170 [ 140.922125] ? io_worker_handle_work+0x480/0x480 [ 140.922126] ? kthread_park+0x90/0x90 [ 140.922127] ret_from_fork+0x22/0x30 Fixes: `7cdaf587de` ("io_uring: avoid whole io_wq_work copy for requests completed inline") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-15 08:37:55 -06:00
Tony Luck	4353f03317	efivarfs: Don't return -EINTR when rate-limiting reads Applications that read EFI variables may see a return value of -EINTR if they exceed the rate limit and a signal delivery is attempted while the process is sleeping. This is quite surprising to the application, which probably doesn't have code to handle it. Change the interruptible sleep to a non-interruptible one. Reported-by: Lennart Poettering <mzxreary@0pointer.de> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20200528194905.690-3-tony.luck@intel.com Signed-off-by: Ard Biesheuvel <ardb@kernel.org>	2020-06-15 14:38:56 +02:00
Tony Luck	2096721f15	efivarfs: Update inode modification time for successful writes Some applications want to be able to see when EFI variables have been updated. Update the modification time for successful writes. Reported-by: Lennart Poettering <mzxreary@0pointer.de> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20200528194905.690-2-tony.luck@intel.com Signed-off-by: Ard Biesheuvel <ardb@kernel.org>	2020-06-15 14:38:56 +02:00
Jan Kara	5fcd57505c	writeback: Drop I_DIRTY_TIME_EXPIRE The only use of I_DIRTY_TIME_EXPIRE is to detect in __writeback_single_inode() that inode got there because flush worker decided it's time to writeback the dirty inode time stamps (either because we are syncing or because of age). However we can detect this directly in __writeback_single_inode() and there's no need for the strange propagation with I_DIRTY_TIME_EXPIRE flag. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2020-06-15 09:18:46 +02:00
Jan Kara	f9cae926f3	writeback: Fix sync livelock due to b_dirty_time processing When we are processing writeback for sync(2), move_expired_inodes() didn't set any inode expiry value (older_than_this). This can result in writeback never completing if there's steady stream of inodes added to b_dirty_time list as writeback rechecks dirty lists after each writeback round whether there's more work to be done. Fix the problem by using sync(2) start time is inode expiry value when processing b_dirty_time list similarly as for ordinarily dirtied inodes. This requires some refactoring of older_than_this handling which simplifies the code noticeably as a bonus. Fixes: `0ae45f63d4` ("vfs: add support for a lazytime mount option") CC: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2020-06-15 09:18:45 +02:00
Jan Kara	5afced3bf2	writeback: Avoid skipping inode writeback Inode's i_io_list list head is used to attach inode to several different lists - wb->{b_dirty, b_dirty_time, b_io, b_more_io}. When flush worker prepares a list of inodes to writeback e.g. for sync(2), it moves inodes to b_io list. Thus it is critical for sync(2) data integrity guarantees that inode is not requeued to any other writeback list when inode is queued for processing by flush worker. That's the reason why writeback_single_inode() does not touch i_io_list (unless the inode is completely clean) and why __mark_inode_dirty() does not touch i_io_list if I_SYNC flag is set. However there are two flaws in the current logic: 1) When inode has only I_DIRTY_TIME set but it is already queued in b_io list due to sync(2), concurrent __mark_inode_dirty(inode, I_DIRTY_SYNC) can still move inode back to b_dirty list resulting in skipping writeback of inode time stamps during sync(2). 2) When inode is on b_dirty_time list and writeback_single_inode() races with __mark_inode_dirty() like: writeback_single_inode() __mark_inode_dirty(inode, I_DIRTY_PAGES) inode->i_state \|= I_SYNC __writeback_single_inode() inode->i_state \|= I_DIRTY_PAGES; if (inode->i_state & I_SYNC) bail if (!(inode->i_state & I_DIRTY_ALL)) - not true so nothing done We end up with I_DIRTY_PAGES inode on b_dirty_time list and thus standard background writeback will not writeback this inode leading to possible dirty throttling stalls etc. (thanks to Martijn Coenen for this analysis). Fix these problems by tracking whether inode is queued in b_io or b_more_io lists in a new I_SYNC_QUEUED flag. When this flag is set, we know flush worker has queued inode and we should not touch i_io_list. On the other hand we also know that once flush worker is done with the inode it will requeue the inode to appropriate dirty list. When I_SYNC_QUEUED is not set, __mark_inode_dirty() can (and must) move inode to appropriate dirty list. Reported-by: Martijn Coenen <maco@android.com> Reviewed-by: Martijn Coenen <maco@android.com> Tested-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Fixes: `0ae45f63d4` ("vfs: add support for a lazytime mount option") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz>	2020-06-15 09:18:45 +02:00
Jan Kara	b35250c081	writeback: Protect inode->i_io_list with inode->i_lock Currently, operations on inode->i_io_list are protected by wb->list_lock. In the following patches we'll need to maintain consistency between inode->i_state and inode->i_io_list so change the code so that inode->i_lock protects also all inode's i_io_list handling. Reviewed-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> CC: stable@vger.kernel.org # Prerequisite for "writeback: Avoid skipping inode writeback" Signed-off-by: Jan Kara <jack@suse.cz>	2020-06-15 09:18:11 +02:00
Al Viro	067c054fb9	dlmfs: clean up dlmfs_file_{read,write}() a bit The damn file is constant-sized - 64 bytes. IOW, * i_size_read() is pointless * so's dynamic allocation * so's the 'size' argument of user_dlm_read_lvb() * ... and so's open-coding simple_read_from_buffer(), while we are at it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-06-14 19:04:42 -04:00
Linus Torvalds	9d645db853	for-5.8-part2-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl7lZwgACgkQxWXV+ddt WDuj6g/9E2JtqeO8zRMLb+Do/n5YX0dFHt+dM1AGY+nw8hb3U9Vlgc8KJa7UpZFX opl1i9QL+cJLoZMZL5xZhDouMQlum5cGVV3hLwqEPYetRF/ytw/kunWAg5o8OW1R sJxGcjyiiKpZLVx6nMjGnYjsrbOJv0HlaWfY3NCon4oQ8yQTzTPMPBevPWRM7Iqw Ssi8pA8zXCc2QoLgyk6Pe/IGeox8+z9RA2akHkJIdMWiPHm43RDF4Yx3Yl9NHHZA M+pLVKjZoejqwVaai8osBqWVw4Ypax1+CJit6iHGwJDkQyFPcMXMsOc5ZYBnT5or k/ceVMCs+ejvCK1+L30u7FQRiDqf5Fwhf/SGfq7+y83KbEjMfWOya3Lyk47fbDD4 776rSaS6ejqVklWppbaPhntSrBtPR1NaDOfi55bc9TOe+yW7Du+AsQMlEE0bTJaW eHl+A4AP/nDlo8Etn1jTWd023bzzO+iySMn3YZfK0vw3vkj3JfrCGXx6DEYipOou uEUj0jDo/rdiB5S3GdUCujjaPgm/f0wkPudTRB9lpxJas2qFU+qo2TLJhEleELwj m4laz7W7S+nUFP0LRl8O82AzBfjm+oHjWTpfdloT6JW9Da8/iuZ/x9VBWQ8mFJwX U0cR3zVqUuWcK78fZa/FFgGPBxlwUv2j+OhRGsS0/orDRlrwcXo= =5S0s -----END PGP SIGNATURE----- Merge tag 'for-5.8-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "This reverts the direct io port to iomap infrastructure of btrfs merged in the first pull request. We found problems in invalidate page that don't seem to be fixable as regressions or without changing iomap code that would not affect other filesystems. There are four reverts in total, but three of them are followup cleanups needed to revert `a43a67a2d7` cleanly. The result is the buffer head based implementation of direct io. Reverts are not great, but under current circumstances I don't see better options" * tag 'for-5.8-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Revert "btrfs: switch to iomap_dio_rw() for dio" Revert "fs: remove dio_end_io()" Revert "btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK" Revert "btrfs: split btrfs_direct_IO to read and write part"	2020-06-14 09:47:25 -07:00
David Sterba	55e20bd12a	Revert "btrfs: switch to iomap_dio_rw() for dio" This reverts commit `a43a67a2d7`. This patch reverts the main part of switching direct io implementation to iomap infrastructure. There's a problem in invalidate page that couldn't be solved as regression in this development cycle. The problem occurs when buffered and direct io are mixed, and the ranges overlap. Although this is not recommended, filesystems implement measures or fallbacks to make it somehow work. In this case, fallback to buffered IO would be an option for btrfs (this already happens when direct io is done on compressed data), but the change would be needed in the iomap code, bringing new semantics to other filesystems. Another problem arises when again the buffered and direct ios are mixed, invalidation fails, then -EIO is set on the mapping and fsync will fail, though there's no real error. There have been discussions how to fix that, but revert seems to be the least intrusive option. Link: https://lore.kernel.org/linux-btrfs/20200528192103.xm45qoxqmkw7i5yl@fiona/ Signed-off-by: David Sterba <dsterba@suse.com>	2020-06-14 01:19:02 +02:00
Linus Torvalds	f82e7b57b5	12 cifs/smb3 fixes, 2 for stable. Adds support for idsfromsid on create and chgrp/chown. Improves query info (getattr) when posix extensions negotiated. -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl7kX6EACgkQiiy9cAdy T1GdiQwAqMaDRVLZWeV5Uc0EM9AGkrWVu6F5n9nBzuKDTXCAf8aKCyiyYdMz/P20 belQA3bPG4jkLa/4Or1XfTY2OSSV4eGBlTfjHNeW2ZJ5pJWGInqCHuVco/M98om8 57JMTMZDTxN6884U+v3bBl4jDE6MqK3QS0WfA63ufd0T8ZnFOGDBn1DieJKbViyy ZckpDH0etaAxO171SV5VwzbFe9U7OeTXupD8LYEHngR7vfaFCkX6ZftYYN0aWsvs uL3p6K1kiNNxTXm0M3Hw6Gpk1nEAM9/6nOR6+TUppor+rQVJCH5F7NKQVrR92MDq Qgwldt16DP1NjOb0q5L37HIg+9kD2kshKs9CErneen6eWtcfiN0HYT35hBxVi7RT XT/dMt17wq3waoq92+RY3U4vb47QVWS6asH4/sqsTqUMWrlEYNGkEuCfeniZzJfO bxglNPVafQ5qy2DWBzsAUX/isaR06FihEKqODK+K78KGTptim/+ip9+yXGjM6ne2 lhdWspC5 =Iwqj -----END PGP SIGNATURE----- Merge tag '5.8-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6 Pull more cifs updates from Steve French: "12 cifs/smb3 fixes, 2 for stable. - add support for idsfromsid on create and chgrp/chown allowing ability to save owner information more naturally for some workloads - improve query info (getattr) when SMB3.1.1 posix extensions are negotiated by using new query info level" * tag '5.8-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6: smb3: Add debug message for new file creation with idsfromsid mount option cifs: fix chown and chgrp when idsfromsid mount option enabled smb3: allow uid and gid owners to be set on create with idsfromsid mount option smb311: Add tracepoints for new compound posix query info smb311: add support for using info level for posix extensions query smb311: Add support for lookup with posix extensions query info smb311: Add support for SMB311 query info (non-compounded) SMB311: Add support for query info using posix extensions (level 100) smb3: add indatalen that can be a non-zero value to calculation of credit charge in smb2 ioctl smb3: fix typo in mount options displayed in /proc/mounts cifs: Add get_security_type_str function to return sec type. smb3: extend fscache mount volume coherency check	2020-06-13 13:43:56 -07:00
Linus Torvalds	6adc19fd13	Kbuild updates for v5.8 (2nd) - fix build rules in binderfs sample - fix build errors when Kbuild recurses to the top Makefile - covert '---help---' in Kconfig to 'help' -----BEGIN PGP SIGNATURE----- iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl7lBuYVHG1hc2FoaXJv eUBrZXJuZWwub3JnAAoJED2LAQed4NsGHvIP/3iErjPshpg/phwH8NTCS4SFkiti BZRM+2lupSn7Qs53BTpVzIkXoHBJQZlJxlQ5HY8ScO+fiz28rKZr+b40us+je1Q+ SkvSPfwZzxjEg7lAZutznG4KgItJLWJKmDyh9T8Y8TAuG4f8WO0hKnXoAp3YorS2 zppEIxso8O5spZPjp+fF/fPbxPjIsabGK7Jp2LpSVFR5pVDHI/ycTlKQS+MFpMEx 6JIpdFRw7TkvKew1dr5uAWT5btWHatEqjSR3JeyVHv3EICTGQwHmcHK67cJzGInK T51+DT7/CpKtmRgGMiTEu/INfMzzoQAKl6Fcu+vMaShTN97Hk9DpdtQyvA6P/h3L 8GA4UBct05J7fjjIB7iUD+GYQ0EZbaFujzRXLYk+dQqEJRbhcCwvdzggGp0WvGRs 1f8/AIpgnQv8JSL/bOMgGMS5uL2dSLsgbzTdr6RzWf1jlYdI1i4u7AZ/nBrwWP+Z iOBkKsVceEoJrTbaynl3eoYqFLtWyDau+//oBc2gUvmhn8ioM5dfqBRiJjxJnPG9 /giRj6xRIqMMEw8Gg8PCG7WebfWxWyaIQwlWBbPok7DwISURK5mvOyakZL+Q25/y 6MBr2H8NEJsf35q0GTINpfZnot7NX4JXrrndJH8NIRC7HEhwd29S041xlQJdP0rs E76xsOr3hrAmBu4P =1NIT -----END PGP SIGNATURE----- Merge tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull more Kbuild updates from Masahiro Yamada: - fix build rules in binderfs sample - fix build errors when Kbuild recurses to the top Makefile - covert '---help---' in Kconfig to 'help' * tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: treewide: replace '---help---' in Kconfig files with 'help' kbuild: fix broken builds because of GZIP,BZIP2,LZOP variables samples: binderfs: really compile this sample and fix build issues	2020-06-13 13:29:16 -07:00
Linus Torvalds	593bd5e5d3	New code for 5.8: - Fix an integer overflow problem in the unshare actor. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl7fCTEACgkQ+H93GTRK tOuQxQ//Ya/xLx9UPoZepTzjHQKl2MlYVYRfKCL60NrH6kNpvq9jyGiPg6xOXc3g KGTe23YDiuP80L3hpIZ9yj/SbJAItI8LsqHHrvVDbAdVSQdK56ajZqq3xwyvOC9u RqCkGkVzRE+nmToJQbYCSmPA446aqMWuCpmlsTbuGmjvkRKAMgBBG/66nbcplQnC eeflcVW7IdbbQ45K8QpyP4AeNMobc26B7zmWqXYeZuMxHcFsrnvld3pgke39i8Hk k0SzMenGddYfb6/FknnxHASMnqnhE7lA1YyWe7F3uDM8OwmpNIseBysqm+6tETkn DBlcpVeENNJB7ygPhqOJXmmDGnap5Y7vwhAc8jX84yuXRkd0gx5aTRIyH8cNp9lQ TRwoVY9DTUkUlMkSLpgeCFIOR5SyOW3H4xZV4PC0sJxAWtM0J3B8A5zvAjQ5kVRP 79gVRpl2OUj648nbrPRwhDBwnNZAhflRVvBh9kasteA7SAtuGJFJKZZ162Smltz2 1E9i/2CvUUartNOjKkT3qPzAF6B1Je3AGTMwuDPhcYX9bdW+9pCD09yi1CiGOn7S QuuwyHTAcLRtZiShNCG6zQhqq++zQCZ58J1IBHYajE73YM1+8r/5wCfTIhB+CPuf J0rjqS+d151d2qMBnK6oag0t2u5Hj+xlcJw9QnQGqPKs6yIktA0= =s+Pr -----END PGP SIGNATURE----- Merge tag 'iomap-5.8-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull iomap fix from Darrick Wong: "A single iomap bug fix for a variable type mistake on 32-bit architectures, fixing an integer overflow problem in the unshare actor" * tag 'iomap-5.8-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: Fix unsharing of an extent >2GB on a 32-bit machine	2020-06-13 12:44:30 -07:00
Linus Torvalds	c555722768	Fixes for 5.8: - Fix a resource leak on an error bailout. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl7fCpAACgkQ+H93GTRK tOtlog//ZkKRzp72HXCTgGpQj0IjkCjuZlz0F8FpdVhl9lOANaZPoXDbCIAax8q1 67wfDG7p8wl109KZnMuaPPXSC5KlynaWphSs7XMXqgLFXViha31c6U7PSMyxZmBB 674hE9eKnVNjhkMk98MtVV3ShWge9T5yGVXYhQbXMWDx8GCdNd9NEP3qnMcBEaLt EPl6yoOfdNnKo37ptrt1Qb2NgORDBDDHYPr6SX/xEYDsppDLp8u+k/YGhuoJVtdc HGR08ryIn6lctvkLbqDxtFzFxIL8Za7AHrBXilgioJYRJ78v7VyCnj1u8eT/axsa ZUis/sQXjgvSvlsGZQZkyPdtnfhFbzXCeulyQvrMnEheMuz691dljMid3fEBkfmq SubqE+HDP8aC6Zs9EkV/lEtdTH+EQ2ojZHH9s5oi6qbvilfFxyoPUfIxog+bhqPO fwl1sL2nb/eQuBF+DeHg4UxP9WzA06Z1q9nZpDjrY224aMOWnrN8TBOKv4FZiRDt M1l3VXcVsaDbCmbOsCTXdLh0Ap3przjk4hFPOjPJxlTzTNO9rPLhopvuLd+J3quA fzNNBA4bMSq3IFSg3VEC2U3YgF3anGrt8PuopIwCH8muc9agCs//fI3Y/eI4k9oT VOUPSxKckZ6SAEhIr7uTyKFzS+yNFBaYN/Y0FqDGnzbf5Bqr9NM= =C0rZ -----END PGP SIGNATURE----- Merge tag 'xfs-5.8-merge-9' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs fix from Darrick Wong: "We've settled down into the bugfix phase; this one fixes a resource leak on an error bailout path" * tag 'xfs-5.8-merge-9' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: Add the missed xfs_perag_put() for xfs_ifree_cluster()	2020-06-13 12:40:24 -07:00
Masahiro Yamada	a7f7f6248d	treewide: replace '---help---' in Kconfig files with 'help' Since commit `84af7a6194` ("checkpatch: kconfig: prefer 'help' over '---help---'"), the number of '---help---' has been gradually decreasing, but there are still more than 2400 instances. This commit finishes the conversion. While I touched the lines, I also fixed the indentation. There are a variety of indentation styles found. a) 4 spaces + '---help---' b) 7 spaces + '---help---' c) 8 spaces + '---help---' d) 1 space + 1 tab + '---help---' e) 1 tab + '---help---' (correct indentation) f) 1 tab + 1 space + '---help---' g) 1 tab + 2 spaces + '---help---' In order to convert all of them to 1 tab + 'help', I ran the following commend: $ find . -name 'Kconfig' \| xargs sed -i 's/^[[:space:]]---help---/\thelp/' Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>	2020-06-14 01:57:21 +09:00
Linus Torvalds	6c32978414	Notifications over pipes + Keyring notifications -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl7U/i8ACgkQ+7dXa6fL C2u2eg/+Oy6ybq0hPovYVkFI9WIG7ZCz7w9Q6BEnfYMqqn3dnfJxKQ3l4pnQEOWw f4QfvpvevsYfMtOJkYcG6s66rQgbFdqc5TEyBBy0QNp3acRolN7IXkcopvv9xOpQ JxedpbFG1PTFLWjvBpyjlrUPouwLzq2FXAf1Ox0ZIMw6165mYOMWoli1VL8dh0A0 Ai7JUB0WrvTNbrwhV413obIzXT/rPCdcrgbQcgrrLPex8lQ47ZAE9bq6k4q5HiwK KRzEqkQgnzId6cCNTFBfkTWsx89zZunz7jkfM5yx30MvdAtPSxvvpfIPdZRZkXsP E2K9Fk1/6OQZTC0Op3Pi/bt+hVG/mD1p0sQUDgo2MO3qlSS+5mMkR8h3mJEgwK12 72P4YfOJkuAy2z3v4lL0GYdUDAZY6i6G8TMxERKu/a9O3VjTWICDOyBUS6F8YEAK C7HlbZxAEOKTVK0BTDTeEUBwSeDrBbvH6MnRlZCG5g1Fos2aWP0udhjiX8IfZLO7 GN6nWBvK1fYzfsUczdhgnoCzQs3suoDo04HnsTPGJ8De52T4x2RsjV+gPx0nrNAq eWChl1JvMWsY2B3GLnl9XQz4NNN+EreKEkk+PULDGllrArrPsp5Vnhb9FJO1PVCU hMDJHohPiXnKbc8f4Bd78OhIvnuoGfJPdM5MtNe2flUKy2a2ops= =YTGf -----END PGP SIGNATURE----- Merge tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull notification queue from David Howells: "This adds a general notification queue concept and adds an event source for keys/keyrings, such as linking and unlinking keys and changing their attributes. Thanks to Debarshi Ray, we do have a pull request to use this to fix a problem with gnome-online-accounts - as mentioned last time: https://gitlab.gnome.org/GNOME/gnome-online-accounts/merge_requests/47 Without this, g-o-a has to constantly poll a keyring-based kerberos cache to find out if kinit has changed anything. [ There are other notification pending: mount/sb fsinfo notifications for libmount that Karel Zak and Ian Kent have been working on, and Christian Brauner would like to use them in lxc, but let's see how this one works first ] LSM hooks are included: - A set of hooks are provided that allow an LSM to rule on whether or not a watch may be set. Each of these hooks takes a different "watched object" parameter, so they're not really shareable. The LSM should use current's credentials. [Wanted by SELinux & Smack] - A hook is provided to allow an LSM to rule on whether or not a particular message may be posted to a particular queue. This is given the credentials from the event generator (which may be the system) and the watch setter. [Wanted by Smack] I've provided SELinux and Smack with implementations of some of these hooks. WHY === Key/keyring notifications are desirable because if you have your kerberos tickets in a file/directory, your Gnome desktop will monitor that using something like fanotify and tell you if your credentials cache changes. However, we also have the ability to cache your kerberos tickets in the session, user or persistent keyring so that it isn't left around on disk across a reboot or logout. Keyrings, however, cannot currently be monitored asynchronously, so the desktop has to poll for it - not so good on a laptop. This facility will allow the desktop to avoid the need to poll. DESIGN DECISIONS ================ - The notification queue is built on top of a standard pipe. Messages are effectively spliced in. The pipe is opened with a special flag: pipe2(fds, O_NOTIFICATION_PIPE); The special flag has the same value as O_EXCL (which doesn't seem like it will ever be applicable in this context)[?]. It is given up front to make it a lot easier to prohibit splice&co from accessing the pipe. [?] Should this be done some other way? I'd rather not use up a new O_* flag if I can avoid it - should I add a pipe3() system call instead? The pipe is then configured:: ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, queue_depth); ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter); Messages are then read out of the pipe using read(). - It should be possible to allow write() to insert data into the notification pipes too, but this is currently disabled as the kernel has to be able to insert messages into the pipe without holding pipe->mutex and the code to make this work needs careful auditing. - sendfile(), splice() and vmsplice() are disabled on notification pipes because of the pipe->mutex issue and also because they sometimes want to revert what they just did - but one or more notification messages might've been interleaved in the ring. - The kernel inserts messages with the wait queue spinlock held. This means that pipe_read() and pipe_write() have to take the spinlock to update the queue pointers. - Records in the buffer are binary, typed and have a length so that they can be of varying size. This allows multiple heterogeneous sources to share a common buffer; there are 16 million types available, of which I've used just a few, so there is scope for others to be used. Tags may be specified when a watchpoint is created to help distinguish the sources. - Records are filterable as types have up to 256 subtypes that can be individually filtered. Other filtration is also available. - Notification pipes don't interfere with each other; each may be bound to a different set of watches. Any particular notification will be copied to all the queues that are currently watching for it - and only those that are watching for it. - When recording a notification, the kernel will not sleep, but will rather mark a queue as having lost a message if there's insufficient space. read() will fabricate a loss notification message at an appropriate point later. - The notification pipe is created and then watchpoints are attached to it, using one of: keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01); watch_mount(AT_FDCWD, "/", 0, fd, 0x02); watch_sb(AT_FDCWD, "/mnt", 0, fd, 0x03); where in both cases, fd indicates the queue and the number after is a tag between 0 and 255. - Watches are removed if either the notification pipe is destroyed or the watched object is destroyed. In the latter case, a message will be generated indicating the enforced watch removal. Things I want to avoid: - Introducing features that make the core VFS dependent on the network stack or networking namespaces (ie. usage of netlink). - Dumping all this stuff into dmesg and having a daemon that sits there parsing the output and distributing it as this then puts the responsibility for security into userspace and makes handling namespaces tricky. Further, dmesg might not exist or might be inaccessible inside a container. - Letting users see events they shouldn't be able to see. TESTING AND MANPAGES ==================== - The keyutils tree has a pipe-watch branch that has keyctl commands for making use of notifications. Proposed manual pages can also be found on this branch, though a couple of them really need to go to the main manpages repository instead. If the kernel supports the watching of keys, then running "make test" on that branch will cause the testing infrastructure to spawn a monitoring process on the side that monitors a notifications pipe for all the key/keyring changes induced by the tests and they'll all be checked off to make sure they happened. https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log/?h=pipe-watch - A test program is provided (samples/watch_queue/watch_test) that can be used to monitor for keyrings, mount and superblock events. Information on the notifications is simply logged to stdout" * tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: smack: Implement the watch_key and post_notification hooks selinux: Implement the watch_key security hook keys: Make the KEY_NEED_* perms an enum rather than a mask pipe: Add notification lossage handling pipe: Allow buffers to be marked read-whole-or-error for notifications Add sample notification program watch_queue: Add a key/keyring notification facility security: Add hooks to rule on setting a watch pipe: Add general notification queue support pipe: Add O_NOTIFICATION_PIPE security: Add a hook for the point of notification insertion uapi: General notification queue definitions	2020-06-13 09:56:21 -07:00
Steve French	a7a519a492	smb3: Add debug message for new file creation with idsfromsid mount option Pavel noticed that a debug message (disabled by default) in creating the security descriptor context could be useful for new file creation owner fields (as we already have for the mode) when using mount parm idsfromsid. [38120.392272] CIFS: FYI: owner S-1-5-88-1-0, group S-1-5-88-2-0 [38125.792637] CIFS: FYI: owner S-1-5-88-1-1000, group S-1-5-88-2-1000 Also cleans up a typo in a comment Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>	2020-06-12 16:31:06 -05:00
Linus Torvalds	44ebe016df	Merge branch 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull proc fix from Eric Biederman: "Much to my surprise syzbot found a very old bug in proc that the recent changes made easier to reproce. This bug is subtle enough it looks like it fooled everyone who should know better" * 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: proc: Use new_inode not new_inode_pseudo	2020-06-12 12:38:18 -07:00
Eric W. Biederman	ef1548adad	proc: Use new_inode not new_inode_pseudo Recently syzbot reported that unmounting proc when there is an ongoing inotify watch on the root directory of proc could result in a use after free when the watch is removed after the unmount of proc when the watcher exits. Commit `69879c01a0` ("proc: Remove the now unnecessary internal mount of proc") made it easier to unmount proc and allowed syzbot to see the problem, but looking at the code it has been around for a long time. Looking at the code the fsnotify watch should have been removed by fsnotify_sb_delete in generic_shutdown_super. Unfortunately the inode was allocated with new_inode_pseudo instead of new_inode so the inode was not on the sb->s_inodes list. Which prevented fsnotify_unmount_inodes from finding the inode and removing the watch as well as made it so the "VFS: Busy inodes after unmount" warning could not find the inodes to warn about them. Make all of the inodes in proc visible to generic_shutdown_super, and fsnotify_sb_delete by using new_inode instead of new_inode_pseudo. The only functional difference is that new_inode places the inodes on the sb->s_inodes list. I wrote a small test program and I can verify that without changes it can trigger this issue, and by replacing new_inode_pseudo with new_inode the issues goes away. Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/000000000000d788c905a7dfa3f4@google.com Reported-by: syzbot+7d2debdcdb3cb93c1e5e@syzkaller.appspotmail.com Fixes: `0097875bd4` ("proc: Implement /proc/thread-self to point at the directory of the current thread") Fixes: `021ada7dff` ("procfs: switch /proc/self away from proc_dir_entry") Fixes: `51f0885e54` ("vfs,proc: guarantee unique inodes in /proc") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2020-06-12 14:13:33 -05:00
zhangyi (F)	7b97d868b7	ext4, jbd2: ensure panic by fix a race between jbd2 abort and ext4 error handlers In the ext4 filesystem with errors=panic, if one process is recording errno in the superblock when invoking jbd2_journal_abort() due to some error cases, it could be raced by another __ext4_abort() which is setting the SB_RDONLY flag but missing panic because errno has not been recorded. jbd2_journal_commit_transaction() jbd2_journal_abort() journal->j_flags \|= JBD2_ABORT; jbd2_journal_update_sb_errno() \| ext4_journal_check_start() \| __ext4_abort() \| sb->s_flags \|= SB_RDONLY; \| if (!JBD2_REC_ERR) \| return; journal->j_flags \|= JBD2_REC_ERR; Finally, it will no longer trigger panic because the filesystem has already been set read-only. Fix this by introduce j_abort_mutex to make sure journal abort is completed before panic, and remove JBD2_REC_ERR flag. Fixes: `4327ba52af` ("ext4, jbd2: ensure entering into panic after recording an error in superblock") Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200609073540.3810702-1-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-06-12 14:51:41 -04:00
Steve French	a660339827	cifs: fix chown and chgrp when idsfromsid mount option enabled idsfromsid was ignored in chown and chgrp causing it to fail when upcalls were not configured for lookup. idsfromsid allows mapping users when setting user or group ownership using "special SID" (reserved for this). Add support for chmod and chgrp when idsfromsid mount option is enabled. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>	2020-06-12 13:21:32 -05:00
Steve French	975221eca5	smb3: allow uid and gid owners to be set on create with idsfromsid mount option Currently idsfromsid mount option allows querying owner information from the special sids used to represent POSIX uids and gids but needed changes to populate the security descriptor context with the owner information when idsfromsid mount option was used. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>	2020-06-12 13:21:15 -05:00
Jan (janneke) Nieuwenhuizen	88ee9d571b	ext4: support xattr gnu.* namespace for the Hurd The Hurd gained[0] support for moving the translator and author fields out of the inode and into the "gnu.*" xattr namespace. In anticipation of that, an xattr INDEX was reserved[1]. The Hurd has now been brought into compliance[2] with that. This patch adds support for reading and writing such attributes from Linux; you can now do something like mkdir -p hurd-root/servers/socket touch hurd-root/servers/socket/1 setfattr --name=gnu.translator --value='"/hurd/pflocal\0"' \ hurd-root/servers/socket/1 getfattr --name=gnu.translator hurd-root/servers/socket/1 # file: 1 gnu.translator="/hurd/pflocal" to setup a pipe translator, which is being used to create[3] a vm-image for the Hurd from GNU Guix. [0] https://summerofcode.withgoogle.com/projects/#5869799859027968 [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3980bd3b406addb327d858aebd19e229ea340b9a [2] https://git.savannah.gnu.org/cgit/hurd/hurd.git/commit/?id=a04c7bf83172faa7cb080fbe3b6c04a8415ca645 [3] https://git.savannah.gnu.org/cgit/guix.git/log/?h=wip-hurd-vm Signed-off-by: Jan Nieuwenhuizen <janneke@gnu.org> Link: https://lore.kernel.org/r/20200525193940.878-1-janneke@gnu.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-06-12 13:23:34 -04:00
Steve French	e4bd7c4a8d	smb311: Add tracepoints for new compound posix query info Add dynamic tracepoints for new SMB3.1.1. posix extensions query info level (100) Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>	2020-06-12 08:55:18 -05:00
Steve French	d313852d7a	smb311: add support for using info level for posix extensions query Adds calls to the newer info level for query info using SMB3.1.1 posix extensions. The remaining two places that call the older query info (non-SMB3.1.1 POSIX) require passing in the fid and can be updated in a later patch. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>	2020-06-12 08:54:12 -05:00
Steve French	790434ff98	smb311: Add support for lookup with posix extensions query info Improve support for lookup when using SMB3.1.1 posix mounts. Use new info level 100 (posix query info) Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>	2020-06-12 06:21:19 -05:00
Steve French	b1bc1874b8	smb311: Add support for SMB311 query info (non-compounded) Add worker function for non-compounded SMB3.1.1 POSIX Extensions query info. This is needed for revalidate of root (cached) directory for example. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>	2020-06-12 06:21:06 -05:00
Steve French	6a5f6592a0	SMB311: Add support for query info using posix extensions (level 100) Adds support for better query info on dentry revalidation (using the SMB3.1.1 POSIX extensions level 100). Followon patch will add support for translating the UID/GID from the SID and also will add support for using the posix query info on lookup. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>	2020-06-12 06:20:38 -05:00
Namjae Jeon	ebf57440ec	smb3: add indatalen that can be a non-zero value to calculation of credit charge in smb2 ioctl Some of tests in xfstests failed with cifsd kernel server since commit `e80ddeb2f7`. cifsd kernel server validates credit charge from client by calculating it base on max((InputCount + OutputCount) and (MaxInputResponse + MaxOutputResponse)) according to specification. MS-SMB2 specification describe credit charge calculation of smb2 ioctl : If Connection.SupportsMultiCredit is TRUE, the server MUST validate CreditCharge based on the maximum of (InputCount + OutputCount) and (MaxInputResponse + MaxOutputResponse), as specified in section 3.3.5.2.5. If the validation fails, it MUST fail the IOCTL request with STATUS_INVALID_PARAMETER. This patch add indatalen that can be a non-zero value to calculation of credit charge in SMB2_ioctl_init(). Fixes: `e80ddeb2f7` ("smb3: fix incorrect number of credits when ioctl MaxOutputResponse > 64K") Cc: Stable <stable@vger.kernel.org> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Cc: Steve French <smfrench@gmail.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-06-12 06:20:17 -05:00
Linus Torvalds	b1a6274994	Merge branch 'akpm' (patches from Andrew) Pull updates from Andrew Morton: "A few fixes and stragglers. Subsystems affected by this patch series: mm/memory-failure, ocfs2, lib/lzo, misc" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: amdgpu: a NULL ->mm does not mean a thread is a kthread lib/lzo: fix ambiguous encoding bug in lzo-rle ocfs2: fix build failure when TCP/IP is disabled mm/memory-failure: send SIGBUS(BUS_MCEERR_AR) only to current thread mm/memory-failure: prioritize prctl(PR_MCE_KILL) over vm.memory_failure_early_kill	2020-06-11 18:18:50 -07:00
Tom Seewald	fce1affe4e	ocfs2: fix build failure when TCP/IP is disabled After commit `12abc5ee78` ("tcp: add tcp_sock_set_nodelay") and commit `c488aeadcb` ("tcp: add tcp_sock_set_user_timeout"), building the kernel with OCFS2_FS=y but without INET=y causes it to fail with: ld: fs/ocfs2/cluster/tcp.o: in function `o2net_accept_many': tcp.c:(.text+0x21b1): undefined reference to `tcp_sock_set_nodelay' ld: tcp.c:(.text+0x21c1): undefined reference to `tcp_sock_set_user_timeout' ld: fs/ocfs2/cluster/tcp.o: in function `o2net_start_connect': tcp.c:(.text+0x2633): undefined reference to `tcp_sock_set_nodelay' ld: tcp.c:(.text+0x2643): undefined reference to `tcp_sock_set_user_timeout' This is due to tcp_sock_set_nodelay() and tcp_sock_set_user_timeout() being declared in linux/tcp.h and defined in net/ipv4/tcp.c, which depend on TCP/IP being enabled. To fix this, make OCFS2_FS depend on INET=y which already requires NET=y. Fixes: `12abc5ee78` ("tcp: add tcp_sock_set_nodelay") Fixes: `c488aeadcb` ("tcp: add tcp_sock_set_user_timeout") Signed-off-by: Tom Seewald <tseewald@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Acked-by: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: David S. Miller <davem@davemloft.net> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Link: http://lkml.kernel.org/r/20200606190827.23954-1-tseewald@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-11 18:17:47 -07:00
Linus Torvalds	b961f8dc89	io_uring-5.8-2020-06-11 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7iocEQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpj96EACRUW8F6Y9qibPIIYGOAdpW5vf6hdW88oan hkxOr2+y+9Odyn3WAnQtuMvmIAyOnIpVB1PiGtiXY1mmESWwbFZuxo6m1u4PiqZF rmvThcrx/o7T1hPzPJt2dUZmR6qBY2rbkGaruD14bcn36DW6fkAicZmsl7UluKTm pKE2wsxKsjGkcvElYsLYZbVm/xGe+UldaSpNFSp8b+yCAaH6eJLfhjeVC4Db7Yzn v3Liz012Xed3nmHktgXrihK8vQ1P7zOFaISJlaJ9yRK4z3VAF7wTgvZUjeYGP5FS GnUW/2p7UOsi5QkX9w2ZwPf/d0aSLZ/Va/5PjZRzAjNORMY5sjPtsfzqdKCohOhq q8qanyU1pOXRKf1cOEzU40hS81ZDRmoQRTCym6vgwHZrmVtcNnL/Af9soGrWIA8m +U6S2fpfuxeNP017HSzLHWtCGEOGYvhEc1D70mNBSIB8lElNvNVI6hWZOmxWkbKn w3O2JIfh9bl9Pk2espwZykJmzehYECP/H8wyhTlF3vBWieFF4uRucBgsmFgQmhvg NWQ7Iea49zOBt3IV3+LIRS2ulpXe7uu4WJYMa6da5o0a11ayNkngrh5QnBSSJ2rR HRUKZ9RA99A5edqyxEujDW2QABycNiYdo8ua2gYEFBvRNc9ff1l2CqWAk0n66uxE 4vj4jmVJHg== =evRQ -----END PGP SIGNATURE----- Merge tag 'io_uring-5.8-2020-06-11' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: "A few late stragglers in here. In particular: - Validate full range for provided buffers (Bijan) - Fix bad use of kfree() in buffer registration failure (Denis) - Don't allow close of ring itself, it's not fully safe. Making it fully safe would require making the system call more expensive, which isn't worth it. - Buffer selection fix - Regression fix for O_NONBLOCK retry - Make IORING_OP_ACCEPT honor O_NONBLOCK (Jiufei) - Restrict opcode handling for SQ/IOPOLL (Pavel) - io-wq work handling cleanups and improvements (Pavel, Xiaoguang) - IOPOLL race fix (Xiaoguang)" * tag 'io_uring-5.8-2020-06-11' of git://git.kernel.dk/linux-block: io_uring: fix io_kiocb.flags modification race in IOPOLL mode io_uring: check file O_NONBLOCK state for accept io_uring: avoid unnecessary io_wq_work copy for fast poll feature io_uring: avoid whole io_wq_work copy for requests completed inline io_uring: allow O_NONBLOCK async retry io_wq: add per-wq work handler instead of per work io_uring: don't arm a timeout through work.func io_uring: remove custom ->func handlers io_uring: don't derive close state from ->func io_uring: use kvfree() in io_sqe_buffer_register() io_uring: validate the full range of provided buffers for access io_uring: re-set iov base/len for buffer select retry io_uring: move send/recv IOPOLL check into prep io_uring: deduplicate io_openat{,2}_prep() io_uring: do build_open_how() only once io_uring: fix {SQ,IO}POLL with unsupported opcodes io_uring: disallow close of ring itself	2020-06-11 16:10:08 -07:00
David Howells	b3597945c8	afs: Fix afs_store_data() to set mtime in new operation descriptor Fix afs_store_data() so that it sets the mtime in the new operation descriptor otherwise the mtime on the server gets set to 0 when a write is stored to the server. Fixes: `e49c7b2f6d` ("afs: Build an abstraction around an "operation" concept") Reported-by: Dave Botsch <botsch@cnf.cornell.edu> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-11 16:04:30 -07:00
Linus Torvalds	623f6dc593	Merge branch 'akpm' (patches from Andrew) Merge some more updates from Andrew Morton: - various hotfixes and minor things - hch's use_mm/unuse_mm clearnups Subsystems affected by this patch series: mm/hugetlb, scripts, kcov, lib, nilfs, checkpatch, lib, mm/debug, ocfs2, lib, misc. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: kernel: set USER_DS in kthread_use_mm kernel: better document the use_mm/unuse_mm API contract kernel: move use_mm/unuse_mm to kthread.c kernel: move use_mm/unuse_mm to kthread.c stacktrace: cleanup inconsistent variable type lib: test get_count_order/long in test_bitops.c mm: add comments on pglist_data zones ocfs2: fix spelling mistake and grammar mm/debug_vm_pgtable: fix kernel crash by checking for THP support lib: fix bitmap_parse() on 64-bit big endian archs checkpatch: correct check for kernel parameters doc nilfs2: fix null pointer dereference at nilfs_segctor_do_construct() lib/lz4/lz4_decompress.c: document deliberate use of `&' kcov: check kcov_softirq in kcov_remote_stop() scripts/spelling: add a few more typos khugepaged: selftests: fix timeout condition in wait_for_scan()	2020-06-11 13:25:53 -07:00
Linus Torvalds	a539568299	NFS Client Updates for Linux 5.8 New features and improvements: - Sunrpc receive buffer sizes only change when establishing a GSS credentials - Add more sunrpc tracepoints - Improve on tracepoints to capture internal NFS I/O errors Other bugfixes and cleanups: - Move a dprintk() to after a call to nfs_alloc_fattr() - Fix off-by-one issues in rpc_ntop6 - Fix a few coccicheck warnings - Use the correct SPDX license identifiers - Fix rpc_call_done assignment for BIND_CONN_TO_SESSION - Replace zero-length array with flexible array - Remove duplicate headers - Set invalid blocks after NFSv4 writes to update space_used attribute - Fix direct WRITE throughput regression -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl7ibyIACgkQ18tUv7Cl QOsOHBAA1A1stYld0gOhKZtMqxRJi3fnJ5mgroLGtyVQe8uAjpD8Ib1oRleC4MJq ifpYPozIhMZQCvDiGTAKJ8629OYiXGrN8D5nV6Y2tEGpu5wYv98MyZlU9Y8rVzCP 5vsIMUp5XH8y2wYO8k7fDPPxWNH9Ax89wz5OI16mZxgY/LDm4ojZq+pGbYnWZa4w oK6Efa66z7yQkPV8oIWuvLe1zZYWGAPibBEwJbrvUWyfygB3owI36sc6nuiEQM+4 hD3h5UtVn8BnudUqvLLa21rnQROMFpgYf4Q/2A1UaNfyRAPoPXMztECBSEYXO0L4 saiMc5o/yTTBCC0ZjV1F+xuGQzMgSQ83KOdbr+a+upvBeFpBynJxccdvMTDEam+q rl7Ypdc42CsTZ1aVWG/AoIk6GENzR0tXqNR6BcDjYG/yRWvnt/RIZlp6G67IbtRH b9we+3MbI/lTBoCFGahkkBYO3elTNwilxH3pWcRi8ehNn0GPjlLqHePR17Tmq1tL QycDlm7QB1m5xNsOOLaBoB4SyguPV0SBprZJ4yYU1B3KC3bGurZVK3+TSLXQrO9V 12RLDt4AOGr0TlctBIhNbkGp8xHY6Dg7HgbdjdrVq8Y9YCfg0C37789BnZA5nVxF 4L101lsTI0puymh+MwmhiyOvCldn30f+MjuWJSm17Id+eRIxYj4= =a84h -----END PGP SIGNATURE----- Merge tag 'nfs-for-5.8-1' of git://git.linux-nfs.org/projects/anna/linux-nfs Pull NFS client updates from Anna Schumaker: "New features and improvements: - Sunrpc receive buffer sizes only change when establishing a GSS credentials - Add more sunrpc tracepoints - Improve on tracepoints to capture internal NFS I/O errors Other bugfixes and cleanups: - Move a dprintk() to after a call to nfs_alloc_fattr() - Fix off-by-one issues in rpc_ntop6 - Fix a few coccicheck warnings - Use the correct SPDX license identifiers - Fix rpc_call_done assignment for BIND_CONN_TO_SESSION - Replace zero-length array with flexible array - Remove duplicate headers - Set invalid blocks after NFSv4 writes to update space_used attribute - Fix direct WRITE throughput regression" * tag 'nfs-for-5.8-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (27 commits) NFS: Fix direct WRITE throughput regression SUNRPC: rpc_xprt lifetime events should record xprt->state xprtrdma: Make xprt_rdma_slot_table_entries static nfs: set invalid blocks after NFSv4 writes NFS: remove redundant initialization of variable result sunrpc: add missing newline when printing parameter 'auth_hashtable_size' by sysfs NFS: Add a tracepoint in nfs_set_pgio_error() NFS: Trace short NFS READs NFS: nfs_xdr_status should record the procedure name SUNRPC: Set SOFTCONN when destroying GSS contexts SUNRPC: rpc_call_null_helper() should set RPC_TASK_SOFT SUNRPC: rpc_call_null_helper() already sets RPC_TASK_NULLCREDS SUNRPC: trace RPC client lifetime events SUNRPC: Trace transport lifetime events SUNRPC: Split the xdr_buf event class SUNRPC: Add tracepoint to rpc_call_rpcerror() SUNRPC: Update the RPC_SHOW_SOCKET() macro SUNRPC: Update the rpc_show_task_flags() macro SUNRPC: Trace GSS context lifetimes SUNRPC: receive buffer size estimation values almost never change ...	2020-06-11 12:22:41 -07:00
Linus Torvalds	7cf035cc83	Third part of new DAX code for 5.8: - Teach XFS to ask the VFS to drop an inode if the administrator changes the FS_XFLAG_DAX inode flag such that the S_DAX state would change. This can result in files changing access modes without requiring an unmount cycle. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl7WezgACgkQ+H93GTRK tOvGCg//QCKK9yEBE0sjR0hjeJwmvoxAtdCGyrq+h1wZ1OI+iZCy3K4D9+/NSl/Z SajLB2bK2mJBBZmERJlRITnTwSFbHX5roMHK4NDTqRAusHdm92JxMWds5ZMXGuTD HiJ3Luu4VWHBuraRB5JKxHWyDc57xSE0kC1KwPySQCNAuDv+YS42uZMAM6284T/J hkE4odUztykN+tOZj/nY+FiRjhzLF93cghMmRlDlgmibyHGLMhrfEmaMj99eCA5O PVepVA6Vk0IHWviAWS8vqX2LADQaP1U2RP4racBAdmk+Z6kqUV+KSotJU1O+6ey/ Zfnr4VytKxmxngXhR4wnQcu3sIZqDdZRF4mcngE/G1yJXKim5iwVX+D04BsvDOpo OpND9+dOuh4ecbayfOGxp9lmBsPQBzI/qPpUcpbtsRd5xN5IyTL6MyNsSPJg2ZrG 5BaSUO+Hw6hmBcB5MLF36XBuj25QEITl/hxy66Ym2BiQT5im9a3JxRlj8OvoWEVI 1323WehvPSA1bl7m1mNQyE7/h7TjeGA6LiTplSxPqencKzXn93wcTDV09GOZ11No UEb+yC97hzRepEq4hSTr7tu18RN04ryA28/Vtcm/YeeM2bQ5WpI6jTd3b7g55BE2 EuUGAUFLZRKsvSnLhg/GbuaW442osAKuNttIL1NEyPs0Iw76Ero= =mKZf -----END PGP SIGNATURE----- Merge tag 'vfs-5.8-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull DAX updates part three from Darrick Wong: "Now that the xfs changes have landed, this third piece changes the FS_XFLAG_DAX ioctl code in xfs to request that the inode be reloaded after the last program closes the file, if doing so would make a S_DAX change happen. The goal here is to make dax access mode switching quicker when possible. Summary: - Teach XFS to ask the VFS to drop an inode if the administrator changes the FS_XFLAG_DAX inode flag such that the S_DAX state would change. This can result in files changing access modes without requiring an unmount cycle" * tag 'vfs-5.8-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: fs/xfs: Update xfs_ioctl_setattr_dax_invalidate() fs/xfs: Combine xfs_diflags_to_linux() and xfs_diflags_to_iflags() fs/xfs: Create function xfs_inode_should_enable_dax() fs/xfs: Make DAX mount option a tri-state fs/xfs: Change XFS_MOUNT_DAX to XFS_MOUNT_DAX_ALWAYS fs/xfs: Remove unnecessary initialization of i_rwsem	2020-06-11 10:48:12 -07:00
Chuck Lever	ba838a75e7	NFS: Fix direct WRITE throughput regression I measured a 50% throughput regression for large direct writes. The observed on-the-wire behavior is that the client sends every NFS WRITE twice: once as an UNSTABLE WRITE plus a COMMIT, and once as a FILE_SYNC WRITE. This is because the nfs_write_match_verf() check in nfs_direct_commit_complete() fails for every WRITE. Buffered writes use nfs_write_completion(), which sets req->wb_verf correctly. Direct writes use nfs_direct_write_completion(), which does not set req->wb_verf at all. This leaves req->wb_verf set to all zeroes for every direct WRITE, and thus nfs_direct_commit_completion() always sets NFS_ODIRECT_RESCHED_WRITES. This fix appears to restore nearly all of the lost performance. Fixes: `1f28476dcb` ("NFS: Fix O_DIRECT commit verifier handling") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2020-06-11 13:33:48 -04:00
Zheng Bin	3a39e77869	nfs: set invalid blocks after NFSv4 writes Use the following command to test nfsv4(size of file1M is 1MB): mount -t nfs -o vers=4.0,actimeo=60 127.0.0.1/dir1 /mnt cp file1M /mnt du -h /mnt/file1M -->0 within 60s, then 1M When write is done(cp file1M /mnt), will call this: nfs_writeback_done nfs4_write_done nfs4_write_done_cb nfs_writeback_update_inode nfs_post_op_update_inode_force_wcc_locked(change, ctime, mtime nfs_post_op_update_inode_force_wcc_locked nfs_set_cache_invalid nfs_refresh_inode_locked nfs_update_inode nfsd write response contains change, ctime, mtime, the flag will be clear after nfs_update_inode. Howerver, write response does not contain space_used, previous open response contains space_used whose value is 0, so inode->i_blocks is still 0. nfs_getattr -->called by "du -h" do_update \|= force_sync \|\| nfs_attribute_cache_expired -->false in 60s cache_validity = READ_ONCE(NFS_I(inode)->cache_validity) do_update \|= cache_validity & (NFS_INO_INVALID_ATTR -->false if (do_update) { __nfs_revalidate_inode } Within 60s, does not send getattr request to nfsd, thus "du -h /mnt/file1M" is 0. Add a NFS_INO_INVALID_BLOCKS flag, set it when nfsv4 write is done. Fixes: `16e1437517` ("NFS: More fine grained attribute tracking") Signed-off-by: Zheng Bin <zhengbin13@huawei.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2020-06-11 13:33:48 -04:00
Colin Ian King	86b936672e	NFS: remove redundant initialization of variable result The variable result is being initialized with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2020-06-11 13:33:48 -04:00
Chuck Lever	cd2ed9bdc0	NFS: Add a tracepoint in nfs_set_pgio_error() Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2020-06-11 13:33:48 -04:00
Chuck Lever	fd2b612141	NFS: Trace short NFS READs A short read can generate an -EIO error without there being an error on the wire. This tracepoint acts as an eyecatcher when there is no obvious I/O error. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2020-06-11 13:33:48 -04:00
Chuck Lever	5be5945864	NFS: nfs_xdr_status should record the procedure name When sunrpc trace points are not enabled, the recorded task ID information alone is not helpful. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2020-06-11 13:33:48 -04:00
Linus Torvalds	c742b63473	Highlights: - Keep nfsd clients from unnecessarily breaking their own delegations: Note this requires a small kthreadd addition, discussed at: https://lore.kernel.org/r/1588348912-24781-1-git-send-email-bfields@redhat.com The result is Tejun Heo's suggestion, and he was OK with this going through my tree. - Patch nfsd/clients/ to display filenames, and to fix byte-order when displaying stateid's. - fix a module loading/unloading bug, from Neil Brown. - A big series from Chuck Lever with RPC/RDMA and tracing improvements, and lay some groundwork for RPC-over-TLS. Note Stephen Rothwell spotted two conflicts in linux-next. Both should be straightforward: include/trace/events/sunrpc.h https://lore.kernel.org/r/20200529105917.50dfc40f@canb.auug.org.au net/sunrpc/svcsock.c https://lore.kernel.org/r/20200529131955.26c421db@canb.auug.org.au -----BEGIN PGP SIGNATURE----- iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl7iRYwVHGJmaWVsZHNA ZmllbGRzZXMub3JnAAoJECebzXlCjuG+yx8QALIfyz/ziPgjGBnNJGCW8BjWHz7+ rGI+1SP2EUpgJ0fGJc9MpGyYTa5T3pTgsENnIRtegyZDISg2OQ5GfifpkTz4U7vg QbWRihs/W9EhltVYhKvtLASAuSAJ8ETbDfLXVb2ncY7iO6JNvb22xwsgKZILmzm1 uG4qSszmBZzpMUUy51kKJYJZ3ysP+v14qOnyOXEoeEMuJYNK9FkQ9bSPZ6wTJNOn hvZBMbU7LzRyVIvp358mFHY+vwq5qBNkJfVrZBkURGn4OxWPbWDXzqOi0Zs1oBjA L+QODIbTLGkopu/rD0r1b872PDtket7p5zsD8MreeI1vJOlt3xwqdCGlicIeNATI b0RG7sqh+pNv0mvwLxSNTf3rO0EKW6tUySqCnQZUAXFGRH0nYM2TWze4HUr2zfWT EgRMwxHY/AZUStZBuCIHPJ6inWnKuxSUELMf2a9JHO1BJc/yClRgmwJGdthVwb9u GP6F3/maFu+9YOO6iROMsqtxDA+q5vch5IBzevNOOBDEQDKqENmogR/knl9DmAhF sr+FOa3O0u6S4tgXw/TU97JS/h1L2Hu6QVEwU2iVzWtlUUOFVMZQODJTB6Lts4Ka gKzYXWvCHN+LyETsN6q7uHFg9mtO7xO5vrrIgo72SuVCscDw/8iHkoOOFLief+GE O0fR0IYjW8U1Rkn2 =YEf0 -----END PGP SIGNATURE----- Merge tag 'nfsd-5.8' of git://linux-nfs.org/~bfields/linux Pull nfsd updates from Bruce Fields: "Highlights: - Keep nfsd clients from unnecessarily breaking their own delegations. Note this requires a small kthreadd addition. The result is Tejun Heo's suggestion (see link), and he was OK with this going through my tree. - Patch nfsd/clients/ to display filenames, and to fix byte-order when displaying stateid's. - fix a module loading/unloading bug, from Neil Brown. - A big series from Chuck Lever with RPC/RDMA and tracing improvements, and lay some groundwork for RPC-over-TLS" Link: https://lore.kernel.org/r/1588348912-24781-1-git-send-email-bfields@redhat.com * tag 'nfsd-5.8' of git://linux-nfs.org/~bfields/linux: (49 commits) sunrpc: use kmemdup_nul() in gssp_stringify() nfsd: safer handling of corrupted c_type nfsd4: make drc_slab global, not per-net SUNRPC: Remove unreachable error condition in rpcb_getport_async() nfsd: Fix svc_xprt refcnt leak when setup callback client failed sunrpc: clean up properly in gss_mech_unregister() sunrpc: svcauth_gss_register_pseudoflavor must reject duplicate registrations. sunrpc: check that domain table is empty at module unload. NFSD: Fix improperly-formatted Doxygen comments NFSD: Squash an annoying compiler warning SUNRPC: Clean up request deferral tracepoints NFSD: Add tracepoints for monitoring NFSD callbacks NFSD: Add tracepoints to the NFSD state management code NFSD: Add tracepoints to NFSD's duplicate reply cache SUNRPC: svc_show_status() macro should have enum definitions SUNRPC: Restructure svc_udp_recvfrom() SUNRPC: Refactor svc_recvfrom() SUNRPC: Clean up svc_release_skb() functions SUNRPC: Refactor recvfrom path dealing with incomplete TCP receives SUNRPC: Replace dprintk() call sites in TCP receive path ...	2020-06-11 10:33:13 -07:00
Xiaoguang Wang	65a6543da3	io_uring: fix io_kiocb.flags modification race in IOPOLL mode While testing io_uring in arm, we found sometimes io_sq_thread() keeps polling io requests even though there are not inflight io requests in block layer. After some investigations, found a possible race about io_kiocb.flags, see below race codes: 1) in the end of io_write() or io_read() req->flags &= ~REQ_F_NEED_CLEANUP; kfree(iovec); return ret; 2) in io_complete_rw_iopoll() if (res != -EAGAIN) req->flags \|= REQ_F_IOPOLL_COMPLETED; In IOPOLL mode, io requests still maybe completed by interrupt, then above codes are not safe, concurrent modifications to req->flags, which is not protected by lock or is not atomic modifications. I also had disassemble io_complete_rw_iopoll() in arm: req->flags \|= REQ_F_IOPOLL_COMPLETED; 0xffff000008387b18 <+76>: ldr w0, [x19,#104] 0xffff000008387b1c <+80>: orr w0, w0, #0x1000 0xffff000008387b20 <+84>: str w0, [x19,#104] Seems that the "req->flags \|= REQ_F_IOPOLL_COMPLETED;" is load and modification, two instructions, which obviously is not atomic. To fix this issue, add a new iopoll_completed in io_kiocb to indicate whether io request is completed. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-11 09:45:21 -06:00
Ritesh Harjani	8119853653	ext4: mballoc: Use this_cpu_read instead of this_cpu_ptr Simplify reading a seq variable by directly using this_cpu_read API instead of doing this_cpu_ptr and then dereferencing it. This also avoid the below kernel BUG: which happens when CONFIG_DEBUG_PREEMPT is enabled BUG: using smp_processor_id() in preemptible [00000000] code: syz-fuzzer/6927 caller is ext4_mb_new_blocks+0xa4d/0x3b70 fs/ext4/mballoc.c:4711 CPU: 1 PID: 6927 Comm: syz-fuzzer Not tainted 5.7.0-next-20200602-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x18f/0x20d lib/dump_stack.c:118 check_preemption_disabled+0x20d/0x220 lib/smp_processor_id.c:48 ext4_mb_new_blocks+0xa4d/0x3b70 fs/ext4/mballoc.c:4711 ext4_ext_map_blocks+0x201b/0x33e0 fs/ext4/extents.c:4244 ext4_map_blocks+0x4cb/0x1640 fs/ext4/inode.c:626 ext4_getblk+0xad/0x520 fs/ext4/inode.c:833 ext4_bread+0x7c/0x380 fs/ext4/inode.c:883 ext4_append+0x153/0x360 fs/ext4/namei.c:67 ext4_init_new_dir fs/ext4/namei.c:2757 [inline] ext4_mkdir+0x5e0/0xdf0 fs/ext4/namei.c:2802 vfs_mkdir+0x419/0x690 fs/namei.c:3632 do_mkdirat+0x21e/0x280 fs/namei.c:3655 do_syscall_64+0x60/0xe0 arch/x86/entry/common.c:359 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 42f56b7a4a7d ("ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling") Suggested-by: Borislav Petkov <bp@alien8.de> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reported-by: syzbot+82f324bb69744c5f6969@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/534f275016296996f54ecf65168bb3392b6f653d.1591699601.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-06-11 11:03:26 -04:00
Eric Biggers	2ce3ee931a	ext4: avoid utf8_strncasecmp() with unstable name If the dentry name passed to ->d_compare() fits in dentry::d_iname, then it may be concurrently modified by a rename. This can cause undefined behavior (possibly out-of-bounds memory accesses or crashes) in utf8_strncasecmp(), since fs/unicode/ isn't written to handle strings that may be concurrently modified. Fix this by first copying the filename to a stack buffer if needed. This way we get a stable snapshot of the filename. Fixes: `b886ee3e77` ("ext4: Support case-insensitive file name lookups") Cc: <stable@vger.kernel.org> # v5.2+ Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Daniel Rosenberg <drosen@google.com> Cc: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20200601200543.59417-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-06-11 11:01:33 -04:00
yangerkun	5adaccac46	ext4: stop overwrite the errcode in ext4_setup_super Now the errcode from ext4_commit_super will overwrite EROFS exists in ext4_setup_super. Actually, no need to call ext4_commit_super since we will return EROFS. Fix it by goto done directly. Fixes: `c89128a008` ("ext4: handle errors on ext4_commit_super") Signed-off-by: yangerkun <yangerkun@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200601073404.3712492-1-yangerkun@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-06-11 10:59:38 -04:00
Jeffle Xu	cfb3c85a60	ext4: fix partial cluster initialization when splitting extent Fix the bug when calculating the physical block number of the first block in the split extent. This bug will cause xfstests shared/298 failure on ext4 with bigalloc enabled occasionally. Ext4 error messages indicate that previously freed blocks are being freed again, and the following fsck will fail due to the inconsistency of block bitmap and bg descriptor. The following is an example case: 1. First, Initialize a ext4 filesystem with cluster size '16K', block size '4K', in which case, one cluster contains four blocks. 2. Create one file (e.g., xxx.img) on this ext4 filesystem. Now the extent tree of this file is like: ... 36864:[0]4:220160 36868:[0]14332:145408 51200:[0]2:231424 ... 3. Then execute PUNCH_HOLE fallocate on this file. The hole range is like: .. ext4_ext_remove_space: dev 254,16 ino 12 since 49506 end 49506 depth 1 ext4_ext_remove_space: dev 254,16 ino 12 since 49544 end 49546 depth 1 ext4_ext_remove_space: dev 254,16 ino 12 since 49605 end 49607 depth 1 ... 4. Then the extent tree of this file after punching is like ... 49507:[0]37:158047 49547:[0]58:158087 ... 5. Detailed procedure of punching hole [49544, 49546] 5.1. The block address space: ``` lblk ~49505 49506 49507~49543 49544~49546 49547~ ---------+------+-------------+----------------+-------- extent \| hole \| extent \| hole \| extent ---------+------+-------------+----------------+-------- pblk ~158045 158046 158047~158083 158084~158086 158087~ ``` 5.2. The detailed layout of cluster 39521: ``` cluster 39521 <-------------------------------> hole extent <----------------------><-------- lblk 49544 49545 49546 49547 +-------+-------+-------+-------+ \| \| \| \| \| +-------+-------+-------+-------+ pblk 158084 1580845 158086 158087 ``` 5.3. The ftrace output when punching hole [49544, 49546]: - ext4_ext_remove_space (start 49544, end 49546) - ext4_ext_rm_leaf (start 49544, end 49546, last_extent [49507(158047), 40], partial [pclu 39522 lblk 0 state 2]) - ext4_remove_blocks (extent [49507(158047), 40], from 49544 to 49546, partial [pclu 39522 lblk 0 state 2] - ext4_free_blocks: (block 158084 count 4) - ext4_mballoc_free (extent 1/6753/1) 5.4. Ext4 error message in dmesg: EXT4-fs error (device vdb): mb_free_blocks:1457: group 1, block 158084:freeing already freed block (bit 6753); block bitmap corrupt. EXT4-fs error (device vdb): ext4_mb_generate_buddy:747: group 1, block bitmap and bg descriptor inconsistent: 19550 vs 19551 free clusters In this case, the whole cluster 39521 is freed mistakenly when freeing pblock 158084~158086 (i.e., the first three blocks of this cluster), although pblock 158087 (the last remaining block of this cluster) has not been freed yet. The root cause of this isuue is that, the pclu of the partial cluster is calculated mistakenly in ext4_ext_remove_space(). The correct partial_cluster.pclu (i.e., the cluster number of the first block in the next extent, that is, lblock 49597 (pblock 158086)) should be 39521 rather than 39522. Fixes: `f4226d9ea4` ("ext4: fix partial cluster initialization") Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Eric Whitney <enwlinux@gmail.com> Cc: stable@kernel.org # v3.19+ Link: https://lore.kernel.org/r/1590121124-37096-1-git-send-email-jefflexu@linux.alibaba.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-06-11 10:57:40 -04:00
Theodore Ts'o	829b37b8cd	ext4: avoid race conditions when remounting with options that change dax Trying to change dax mount options when remounting could allow mount options to be enabled for a small amount of time, and then the mount option change would be reverted. In the case of "mount -o remount,dax", this can cause a race where files would temporarily treated as DAX --- and then not. Cc: stable@kernel.org Reported-by: syzbot+bca9799bf129256190da@syzkaller.appspotmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-06-11 10:54:07 -04:00
Theodore Ts'o	68cd44920d	Enable ext4 support for per-file/directory dax operations This adds the same per-file/per-directory DAX support for ext4 as was done for xfs, now that we finally have consensus over what the interface should be.	2020-06-11 10:51:44 -04:00
Christoph Hellwig	37c54f9bd4	kernel: set USER_DS in kthread_use_mm Some architectures like arm64 and s390 require USER_DS to be set for kernel threads to access user address space, which is the whole purpose of kthread_use_mm, but other like x86 don't. That has lead to a huge mess where some callers are fixed up once they are tested on said architectures, while others linger around and yet other like io_uring try to do "clever" optimizations for what usually is just a trivial asignment to a member in the thread_struct for most architectures. Make kthread_use_mm set USER_DS, and kthread_unuse_mm restore to the previous value instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jens Axboe <axboe@kernel.dk> Acked-by: Michael S. Tsirkin <mst@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Felipe Balbi <balbi@kernel.org> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Cc: Zhi Wang <zhi.a.wang@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: http://lkml.kernel.org/r/20200404094101.672954-7-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-10 19:14:18 -07:00
Christoph Hellwig	f5678e7f2a	kernel: better document the use_mm/unuse_mm API contract Switch the function documentation to kerneldoc comments, and add WARN_ON_ONCE asserts that the calling thread is a kernel thread and does not have ->mm set (or has ->mm set in the case of unuse_mm). Also give the functions a kthread_ prefix to better document the use case. [hch@lst.de: fix a comment typo, cover the newly merged use_mm/unuse_mm caller in vfio] Link: http://lkml.kernel.org/r/20200416053158.586887-3-hch@lst.de [sfr@canb.auug.org.au: powerpc/vas: fix up for {un}use_mm() rename] Link: http://lkml.kernel.org/r/20200422163935.5aa93ba5@canb.auug.org.au Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jens Axboe <axboe@kernel.dk> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [usb] Acked-by: Haren Myneni <haren@linux.ibm.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Felipe Balbi <balbi@kernel.org> Cc: Jason Wang <jasowang@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Cc: Zhi Wang <zhi.a.wang@intel.com> Link: http://lkml.kernel.org/r/20200404094101.672954-6-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-10 19:14:18 -07:00
Christoph Hellwig	9bf5b9eb23	kernel: move use_mm/unuse_mm to kthread.c Patch series "improve use_mm / unuse_mm", v2. This series improves the use_mm / unuse_mm interface by better documenting the assumptions, and my taking the set_fs manipulations spread over the callers into the core API. This patch (of 3): Use the proper API instead. Link: http://lkml.kernel.org/r/20200404094101.672954-1-hch@lst.de These helpers are only for use with kernel threads, and I will tie them more into the kthread infrastructure going forward. Also move the prototypes to kthread.h - mmu_context.h was a little weird to start with as it otherwise contains very low-level MM bits. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jens Axboe <axboe@kernel.dk> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Felipe Balbi <balbi@kernel.org> Cc: Jason Wang <jasowang@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Cc: Zhi Wang <zhi.a.wang@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: http://lkml.kernel.org/r/20200404094101.672954-1-hch@lst.de Link: http://lkml.kernel.org/r/20200416053158.586887-1-hch@lst.de Link: http://lkml.kernel.org/r/20200404094101.672954-5-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-10 19:14:18 -07:00
Keyur Patel	cc989e7847	ocfs2: fix spelling mistake and grammar ./ocfs2/mmap.c:65: bebongs ==> belonging Signed-off-by: Keyur Patel <iamkeyur96@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Link: http://lkml.kernel.org/r/20200608014818.102358-1-iamkeyur96@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-10 19:14:18 -07:00
Ryusuke Konishi	8301c719a2	nilfs2: fix null pointer dereference at nilfs_segctor_do_construct() After commit `c3aab9a0bd` ("mm/filemap.c: don't initiate writeback if mapping has no dirty pages"), the following null pointer dereference has been reported on nilfs2: BUG: kernel NULL pointer dereference, address: 00000000000000a8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI ... RIP: 0010:percpu_counter_add_batch+0xa/0x60 ... Call Trace: __test_set_page_writeback+0x2d3/0x330 nilfs_segctor_do_construct+0x10d3/0x2110 [nilfs2] nilfs_segctor_construct+0x168/0x260 [nilfs2] nilfs_segctor_thread+0x127/0x3b0 [nilfs2] kthread+0xf8/0x130 ... This crash turned out to be caused by set_page_writeback() call for segment summary buffers at nilfs_segctor_prepare_write(). set_page_writeback() can call inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK) where inode_to_wb(inode) is NULL if the inode of underlying block device does not have an associated wb. This fixes the issue by calling inode_attach_wb() in advance to ensure to associate the bdev inode with its wb. Fixes: `c3aab9a0bd` ("mm/filemap.c: don't initiate writeback if mapping has no dirty pages") Reported-by: Walton Hoops <me@waltonhoops.com> Reported-by: Tomas Hlavaty <tom@logand.com> Reported-by: ARAI Shun-ichi <hermes@ceres.dti.ne.jp> Reported-by: Hideki EIRAKU <hdk1983@gmail.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> [5.4+] Link: http://lkml.kernel.org/r/20200608.011819.1399059588922299158.konishi.ryusuke@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-10 19:14:17 -07:00
Linus Torvalds	b29482fde6	Merge branch 'work.epoll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull epoll update from Al Viro: "epoll conversion to read_iter from Jens; I thought there might be more epoll stuff this cycle, but uaccess took too much time" * 'work.epoll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: eventfd: convert to f_op->read_iter()	2020-06-10 18:09:13 -07:00
Jiufei Xue	e697deed83	io_uring: check file O_NONBLOCK state for accept If the socket is O_NONBLOCK, we should complete the accept request with -EAGAIN when data is not ready. Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-10 18:06:16 -06:00
Xiaoguang Wang	405a5d2b27	io_uring: avoid unnecessary io_wq_work copy for fast poll feature Basically IORING_OP_POLL_ADD command and async armed poll handlers for regular commands don't touch io_wq_work, so only REQ_F_WORK_INITIALIZED is set, can we do io_wq_work copy and restore. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-10 17:58:46 -06:00
Xiaoguang Wang	7cdaf587de	io_uring: avoid whole io_wq_work copy for requests completed inline If requests can be submitted and completed inline, we don't need to initialize whole io_wq_work in io_init_req(), which is an expensive operation, add a new 'REQ_F_WORK_INITIALIZED' to determine whether io_wq_work is initialized and add a helper io_req_init_async(), users must call io_req_init_async() for the first time touching any members of io_wq_work. I use /dev/nullb0 to evaluate performance improvement in my physical machine: modprobe null_blk nr_devices=1 completion_nsec=0 sudo taskset -c 60 fio -name=fiotest -filename=/dev/nullb0 -iodepth=128 -thread -rw=read -ioengine=io_uring -direct=1 -bs=4k -size=100G -numjobs=1 -time_based -runtime=120 before this patch: Run status group 0 (all jobs): READ: bw=724MiB/s (759MB/s), 724MiB/s-724MiB/s (759MB/s-759MB/s), io=84.8GiB (91.1GB), run=120001-120001msec With this patch: Run status group 0 (all jobs): READ: bw=761MiB/s (798MB/s), 761MiB/s-761MiB/s (798MB/s-798MB/s), io=89.2GiB (95.8GB), run=120001-120001msec About 5% improvement. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-10 17:58:46 -06:00
Linus Torvalds	4dbb29fe9d	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "A couple of trivial patches that fell through the cracks last cycle" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: fix indentation in deactivate_super() vfs: Remove duplicated d_mountpoint check in __is_local_mountpoint	2020-06-10 16:09:11 -07:00
Linus Torvalds	1c38372662	Merge branch 'work.sysctl' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull sysctl fixes from Al Viro: "Fixups to regressions in sysctl series" * 'work.sysctl' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: sysctl: reject gigantic reads/write to sysctl files cdrom: fix an incorrect __user annotation on cdrom_sysctl_info trace: fix an incorrect __user annotation on stack_trace_sysctl random: fix an incorrect __user annotation on proc_do_entropy net/sysctl: remove leftover __user annotations on neigh_proc_dointvec* net/sysctl: use cpumask_parse in flow_limit_cpu_sysctl	2020-06-10 16:05:54 -07:00
Linus Torvalds	4382a79b27	Merge branch 'uaccess.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc uaccess updates from Al Viro: "Assorted uaccess patches for this cycle - the stuff that didn't fit into thematic series" * 'uaccess.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: bpf: make bpf_check_uarg_tail_zero() use check_zeroed_user() x86: kvm_hv_set_msr(): use __put_user() instead of 32bit __clear_user() user_regset_copyout_zero(): use clear_user() TEST_ACCESS_OK _never_ had been checked anywhere x86: switch cp_stat64() to unsafe_put_user() binfmt_flat: don't use __put_user() binfmt_elf_fdpic: don't use __... uaccess primitives binfmt_elf: don't bother with __{put,copy_to}_user() pselect6() and friends: take handling the combined 6th/7th args into helper	2020-06-10 16:02:54 -07:00
Linus Torvalds	79ca035d2d	Merge branch 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull proc fix from Eric Biederman: "Syzbot found a NULL pointer dereference if kzalloc of s_fs_info fails" * 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: proc: s_fs_info may be NULL when proc_kill_sb is called	2020-06-10 15:00:11 -07:00
Alexey Gladkov	058f2e4da7	proc: s_fs_info may be NULL when proc_kill_sb is called syzbot found that proc_fill_super() fails before filling up sb->s_fs_info, deactivate_locked_super() will be called and sb->s_fs_info will be NULL. The proc_kill_sb() does not expect fs_info to be NULL which is wrong. Link: https://lore.kernel.org/lkml/0000000000002d7ca605a7b8b1c5@google.com Reported-by: syzbot+4abac52934a48af5ff19@syzkaller.appspotmail.com Fixes: `fa10fed30f` ("proc: allow to mount many instances of proc in one pid namespace") Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-06-10 14:54:54 -05:00
Christoph Hellwig	ef9d965bc8	sysctl: reject gigantic reads/write to sysctl files Instead of triggering a WARN_ON deep down in the page allocator just give up early on allocations that are way larger than the usual sysctl values. Fixes: `32927393dc` ("sysctl: pass kernel pointers to ->proc_handler") Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-06-10 14:11:33 -04:00
Steve French	7866c177a0	smb3: fix typo in mount options displayed in /proc/mounts Missing the final 's' in "max_channels" mount option when displayed in /proc/mounts (or by mount command) CC: Stable <stable@vger.kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Shyam Prasad N <nspmangalore@gmail.com>	2020-06-10 12:05:15 -05:00
Jens Axboe	c5b856255c	io_uring: allow O_NONBLOCK async retry We can assume that O_NONBLOCK is always honored, even if we don't have a ->read/write_iter() for the file type. Also unify the read/write checking for allowing async punt, having the write side factoring in the REQ_F_NOWAIT flag as well. Cc: stable@vger.kernel.org Fixes: `490e89676a` ("io_uring: only force async punt if poll based retry can't handle it") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-09 19:38:24 -06:00
Linus Torvalds	5b14671be5	fuse update for 5.8 -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXt/0GAAKCRDh3BK/laaZ PIJjAP48TurDqomsQMBLiOsSUy0YIhd5QC/G5MYLKSBojXoR+gD+KfqXhVIDz0En OI+K4674cNhf4CXNzUedU3qSOaJLfAU= =PqbB -----END PGP SIGNATURE----- Merge tag 'fuse-update-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Fix a rare deadlock in virtiofs - Fix st_blocks in writeback cache mode - Fix wrong checks in splice move causing spurious warnings - Fix a race between a GETATTR request and a FUSE_NOTIFY_INVAL_INODE notification - Use rb-tree instead of linear search for pages currently under writeout by userspace - Fix copy_file_range() inconsistencies * tag 'fuse-update-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: copy_file_range should truncate cache fuse: fix copy_file_range cache issues fuse: optimize writepages search fuse: update attr_version counter on fuse_notify_inval_inode() fuse: don't check refcount after stealing page fuse: fix weird page warning fuse: use dump_page virtiofs: do not use fuse_fill_super_common() for device installation fuse: always allow query of st_dev fuse: always flush dirty data on close(2) fuse: invalidate inode attr in writeback cache mode fuse: Update stale comment in queue_interrupt() fuse: BUG_ON correction in fuse_dev_splice_write() virtiofs: Add mount option and atime behavior to the doc virtiofs: schedule blocking async replies in separate worker	2020-06-09 15:48:24 -07:00
Linus Torvalds	52435c86bf	overlayfs update for 5.8 -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXt9klAAKCRDh3BK/laaZ PBeeAP9GRI0yajPzBzz2ZK9KkDc6A7wPiaAec+86Q+c02VncVwEAvq5Pi4um5RTZ 7SVv56ggKO3Cqx779zVyZTRYDs3+YA4= =bpKI -----END PGP SIGNATURE----- Merge tag 'ovl-update-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs updates from Miklos Szeredi: "Fixes: - Resolve mount option conflicts consistently - Sync before remount R/O - Fix file handle encoding corner cases - Fix metacopy related issues - Fix an unintialized return value - Add missing permission checks for underlying layers Optimizations: - Allow multipe whiteouts to share an inode - Optimize small writes by inheriting SB_NOSEC from upper layer - Do not call ->syncfs() multiple times for sync(2) - Do not cache negative lookups on upper layer - Make private internal mounts longterm" * tag 'ovl-update-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (27 commits) ovl: remove unnecessary lock check ovl: make oip->index bool ovl: only pass ->ki_flags to ovl_iocb_to_rwf() ovl: make private mounts longterm ovl: get rid of redundant members in struct ovl_fs ovl: add accessor for ofs->upper_mnt ovl: initialize error in ovl_copy_xattr ovl: drop negative dentry in upper layer ovl: check permission to open real file ovl: call secutiry hook in ovl_real_ioctl() ovl: verify permissions in ovl_path_open() ovl: switch to mounter creds in readdir ovl: pass correct flags for opening real directory ovl: fix redirect traversal on metacopy dentries ovl: initialize OVL_UPPERDATA in ovl_lookup() ovl: use only uppermetacopy state in ovl_lookup() ovl: simplify setting of origin for index lookup ovl: fix out of bounds access warning in ovl_check_fb_len() ovl: return required buffer size for file handles ovl: sync dirty data when remounting to ro mode ...	2020-06-09 15:40:50 -07:00
Linus Torvalds	4964dd2914	AFS fixes -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl7fxCMACgkQ+7dXa6fL C2tUNw//VdGTqw3SstdpCqlIQptfXjxIJqFBp5QsQ95b2yHEFcmaeqLP9SWOMPZ9 588xBMj4K9iN9WZgJdJwTGj5D1XmRwiISYnDh1pzNUPH2IrnlcTXfpsZ+BSWst/P XMJ1N3ZtzNymYTi4wzxcV/SjMJd4eX75jBiJn9rgp2PzVnaCUl+a21sLvryON2Eu YOpg08IlpRYLMsuHiISrqqgy7/iUzo34RXWbhgJV69zG07xoHtBwGf/iwetbh56G lmKp3xix1Fq181HyQLOn4/QkXTIFR1kyKFDdI3HxdRXsBybSZUk9IGaf74o09c7M PTS1FsF0lnGz+61Y5mb1UzPRlJSx8CAeCBSN6KWlbm/g9WPXHuxJFwT/KVPV5T13 qEYn1U1RL8yMlbjezMDM5TXSM4IhBRxn9hQ3qbhamiDI0AT8kr55lJKbxqTO8Cvv 6ZjYfPtX0D5Z9kce4E4nWFiRWb8xRsANqWdkVsIcxp1yyZvGkHot5LaCedagF+0b 71Zs0A5YMA4VaaL7sn1xMt/tjZ2ei+5a+cLbfEOoCniE6XqpLm5ZR6WwdbzwwUbV hzUn3ByQ5eqC5tkwehNufDLFt1HOzMbnTG9eCxbpUXAH1Q8CDwuROUMf4FGjR97K eeYHcEqLhEKgKRCmGYT00K/q1Ce39rcgGZLygdbbwaigEUCjLvU= =rH8G -----END PGP SIGNATURE----- Merge tag 'afs-fixes-20200609' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull AFS fixes from David Howells: "A set of small patches to fix some things, most of them minor. - Fix a memory leak in afs_put_sysnames() - Fix an oops in AFS file locking - Fix new use of BUG() - Fix debugging statements containing %px - Remove afs_zero_fid as it's unused - Make afs_zap_data() static" * tag 'afs-fixes-20200609' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: afs: Make afs_zap_data() static afs: Remove afs_zero_fid as it's not used afs: Fix debugging statements with %px to be %p afs: Fix use of BUG() afs: Fix file locking afs: Fix memory leak in afs_put_sysnames()	2020-06-09 15:38:46 -07:00
Linus Torvalds	42612e7763	f2fs-for-5.8-rc1 In this round, we've added some knobs to enhance compression feature and harden testing environment. In addition, we've fixed several bugs reported from Android devices such as long discarding latency, device hanging during quota_sync, etc. Enhancement: - support lzo-rle algorithm - add two ioctls to release and reserve blocks for compression - support partial truncation/fiemap on compressed file - introduce sysfs entries to attach IO flags explicitly - add iostat trace point along with read io stat Bug fix: - fix long discard latency - flush quota data by f2fs_quota_sync correctly - fix to recover parent inode number for power-cut recovery - fix lz4/zstd output buffer budget - parse checkpoint mount option correctly - avoid inifinite loop to wait for flushing node/meta pages - manage discard space correctly And some refactoring and clean up patches were added. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAl7fojgACgkQQBSofoJI UNKaZQ//Rd6r7Z25SJkAoy+y/m6QDaKg4Ap1wR6+QmirR7HNxtpr3dXSVvmj4Xhu ZDJ3LHmerFiwR/X4zFPud+PAoBe3gJa2k7GT8q0g4YkgLy0hfX9PXt0t3I9F8vlk 8m34j+hQaL9/3FBK4/PSG541vR/UUnwvu6t2pJMnz7rgnLej5I6yOIaoaihz7m+i k0ofK5ckuTNcZReAZ2tCIehQku7tDOBLdS5KxvBZBgRh0i5iSXXIa4ddvaMJdT/M WcjTZ6N8bFu0hCZ5hz9dyGGYo1XchQosLdLGhcEugsyxNp9Yuftyf5/Ie1wJNiEl ZsoRc15X7wfRPKKMMyDFljzPBPFiHr78p30uJ34bcYCu0j0CYi+gbKQztmEMZ2dy 9M+sDG3jd5R7ACXrwS2ElSEDyLBnTaxbeSdCpErGjn/U19TLllbzhnMA9KR9elDI pEWgRc7DPmPbRZaStXMxIamf7pbmUSm0akAYbzGFvMHcSx4MXuQFICGK9t/mhSDm sO2b1Ir39yk65sVNdjFsnqDsi6jTPgrLSe3FY4eMhkn15OSiVGhcz7ddQMD7Fbuq WLpHFqER650I28i0EXh8bxzjkrj+aJQKhGcVbmwVS33MtKVfBdh4GfQMvS6MbeOM MsZ10E7Dr9ildKxqHP5SgLlggkl512lpj3+d6j0mUSSSUP2jtUw= =MiEC -----END PGP SIGNATURE----- Merge tag 'f2fs-for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "In this round, we've added some knobs to enhance compression feature and harden testing environment. In addition, we've fixed several bugs reported from Android devices such as long discarding latency, device hanging during quota_sync, etc. Enhancements: - support lzo-rle algorithm - add two ioctls to release and reserve blocks for compression - support partial truncation/fiemap on compressed file - introduce sysfs entries to attach IO flags explicitly - add iostat trace point along with read io stat Bug fixes: - fix long discard latency - flush quota data by f2fs_quota_sync correctly - fix to recover parent inode number for power-cut recovery - fix lz4/zstd output buffer budget - parse checkpoint mount option correctly - avoid inifinite loop to wait for flushing node/meta pages - manage discard space correctly And some refactoring and clean up patches were added" * tag 'f2fs-for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (51 commits) f2fs: attach IO flags to the missing cases f2fs: add node_io_flag for bio flags likewise data_io_flag f2fs: remove unused parameter of f2fs_put_rpages_mapping() f2fs: handle readonly filesystem in f2fs_ioc_shutdown() f2fs: avoid utf8_strncasecmp() with unstable name f2fs: don't return vmalloc() memory from f2fs_kmalloc() f2fs: fix retry logic in f2fs_write_cache_pages() f2fs: fix wrong discard space f2fs: compress: don't compress any datas after cp stop f2fs: remove unneeded return value of __insert_discard_tree() f2fs: fix wrong value of tracepoint parameter f2fs: protect new segment allocation in expand_inode_data f2fs: code cleanup by removing ifdef macro surrounding f2fs: avoid inifinite loop to wait for flushing node pages at cp_error f2fs: flush dirty meta pages when flushing them f2fs: fix checkpoint=disable:%u%% f2fs: compress: fix zstd data corruption f2fs: add compressed/gc data read IO stat f2fs: fix potential use-after-free issue f2fs: compress: don't handle non-compressed data in workqueue ...	2020-06-09 11:28:59 -07:00
Linus Torvalds	ad57a1022f	Description for this pull request: * Bug fixes - Fix memory leak on mount failure with iocharset= option. - Fix Incorrect update of stream entry. - Fix cluster range validation error. * Clean-up codes - Remove unused code and unneeded assignment. - Rename variables in exfat structure as specification. - Reorganize boot sector analysis code. - Simplify exfat_utf8_d_hash and exfat_utf8_d_cmp(). - Optimize exfat entry cache functions. - Improve wording of EXFAT_DEFAULT_IOCHARSET config option. * New Feature - Add boot region verification. -----BEGIN PGP SIGNATURE----- iQJMBAABCgA2FiEE6NzKS6Uv/XAAGHgyZwv7A1FEIQgFAl7fQBwYHG5hbWphZS5q ZW9uQHNhbXN1bmcuY29tAAoJEGcL+wNRRCEIyWMP/2CqlryPilKiXj/C2n9r2s5O 7NNABC7xhyILk9fGz/mUOGohqBQXNNbZUDS17m2xbygw3vkXYN72ejDb/1DLVU8E LsYd85Pj8l7kMnOmjXKNLetoql1S3nm19PgIB7GYNI/BfeBFXcyxQdOTOlwq28w7 PkfnWhnvnIxTfbTJj6EFB5tPYDycpm32LiUSQqsAmy2i0pC9WY6w4PnJz/c8wiqe +LZkLtZ1blGSKLY6C1FotVi7OmjiRWm0e+sdPE/Rsaxb/nnL/S7Nt03GPHZMkGxm eVq5MBUadQAr61duIWKcF7dFUmqqVTAO/bgYrxB4ljd/1j1lwWwZjD7iLnbsOfOy +Go5NsDoLEySKp7JSkLJ8S6mdKsAyAf4TK8diZlIGGfF7jV6puo3h9yDk0e6U0/G E613f60O5bymQWe9STLiJwMo65M7rjzuT3WUcTFuf58LqS6UR+ngq089V4lV720N USxZu7wtO5m0j5feXY72x6E/xaL1wqbMuHr0defQZ9CN8JZKCRtthletjI8TVDOZ hxIASZacQdWkWBL4mCs3lmaflSaD32J7RxPSqnQHMxrB6UVh9lT97rQBGGnbyRyL 2Hqwe8cUk/ki6fOmpNvyIUh01S+wtgVGuAAEoKPEIKGmDw1KeAGXOpVX1NPcbZWT s7HTy7H3SfAnNAED8+Ct =Dgtx -----END PGP SIGNATURE----- Merge tag 'exfat-for-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat Pull exfat update from Namjae Jeon: "Bug fixes: - Fix memory leak on mount failure with iocharset= option - Fix incorrect update of stream entry - Fix cluster range validation error Clean-ups: - Remove unused code and unneeded assignment - Rename variables in exfat structure as specification - Reorganize boot sector analysis code - Simplify exfat_utf8_d_hash and exfat_utf8_d_cmp() - Optimize exfat entry cache functions - Improve wording of EXFAT_DEFAULT_IOCHARSET config option New Feature: - Add boot region verification" * tag 'exfat-for-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat: exfat: Fix potential use after free in exfat_load_upcase_table() exfat: fix range validation error in alloc and free cluster exfat: fix incorrect update of stream entry in __exfat_truncate() exfat: fix memory leak in exfat_parse_param() exfat: remove unnecessary reassignment of p_uniname->name_len exfat: standardize checksum calculation exfat: add boot region verification exfat: separate the boot sector analysis exfat: redefine PBR as boot_sector exfat: optimize dir-cache exfat: replace 'time_ms' with 'time_cs' exfat: remove the assignment of 0 to bool variable exfat: Remove unused functions exfat_high_surrogate() and exfat_low_surrogate() exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF exfat: Improve wording of EXFAT_DEFAULT_IOCHARSET config option exfat: Use a more common logging style exfat: Simplify exfat_utf8_d_cmp() for code points above U+FFFF	2020-06-09 11:24:59 -07:00
David Sterba	f1084bc60a	Revert "fs: remove dio_end_io()" This reverts commit `b75b7ca7c2`. The patch restores a helper that was not necessary after direct IO port to iomap infrastructure, which gets reverted. Signed-off-by: David Sterba <dsterba@suse.com>	2020-06-09 19:23:18 +02:00
David Sterba	8e0fa5d7b3	Revert "btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK" This reverts commit `5f008163a5`. The patch is a simplification after direct IO port to iomap infrastructure, which gets reverted. Signed-off-by: David Sterba <dsterba@suse.com>	2020-06-09 19:21:48 +02:00
David Sterba	f4c48b4408	Revert "btrfs: split btrfs_direct_IO to read and write part" This reverts commit `d8f3e73587`. The patch is a cleanup of direct IO port to iomap infrastructure, which gets reverted. Signed-off-by: David Sterba <dsterba@suse.com>	2020-06-09 19:19:27 +02:00
David Howells	c68421bbad	afs: Make afs_zap_data() static Make afs_zap_data() static as it's only used in the file in which it is defined. Signed-off-by: David Howells <dhowells@redhat.com>	2020-06-09 18:17:14 +01:00
David Howells	4a06fa5403	afs: Remove afs_zero_fid as it's not used Remove afs_zero_fid as it's not used. Signed-off-by: David Howells <dhowells@redhat.com>	2020-06-09 18:17:14 +01:00
David Howells	fed79fd783	afs: Fix debugging statements with %px to be %p Fix a couple of %px to be %p in debugging statements. Fixes: `e49c7b2f6d` ("afs: Build an abstraction around an "operation" concept") Fixes: `8a070a9648` ("afs: Detect cell aliases 1 - Cells with root volumes") Reported-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org>	2020-06-09 18:17:14 +01:00
Linus Torvalds	595a56ac1b	linux-kselftest-kunit-5.8-rc1 This Kunit update for Linux 5.8-rc1 consists of: - Several config fragment fixes from Anders Roxell to improve test coverage. - Improvements to kunit run script to use defconfig as default and restructure the code for config/build/exec/parse from Vitor Massaru Iha and David Gow. - Miscellaneous documentation warn fix. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAl7etrcACgkQCwJExA0N QxzGYg/+KHpPhB31IAjNFKCRqwDooftst3dohhzguxJLpDHdEmVJ4moQhLr4gL+/ qpi3T9hr4Rx++n/A5NoxDvyJvGr+FAL40U+Of7F2UyHpqQmfKPj37I+yvyeR1JEL z4+yXEpfQLZaQkmZ7f3GWHyqN3+xwvyTEy7NYUad7xMxLF/99No+I6RMD6yp3srS wUUeuBIesSFT0LXYrgI+wgsNGUESlj/McjiP5eMj6UtlMgKpzmfzH56Fia8uw1pw 6QtpntxDHjtxVfp8YKM4qExI54YI2t6sgHTIoOUsMWD5Q2kHd8kNf1L+lb1sKYUF j7lzol5nuqqchAVQYjHzNHa8XKndvexGyWMsPz1gAnkpgVrvBTSFcavdDpDuDZ0T HoJZnk9XPsguBQjDxapzPYfAQ81Un/rEmZQ8/X2TaNjdSIH1hHljhaP2OZ6eND/Q iobq9x8nC9D95TIqjDbRw3Sp2na/pZLN8Gp27hmKlc+L1XzV8NuZe/WGOUe3lsrq fG1ZSLo/iRau8gHuF6fRSrGIzQSCEMGKl3jlQ28OT9HGMAgTlncEwVzQId48/AsS UOY+bIAnRZuK+B5F/vw6L3o1e3c17z5bruVlb0M0alP5b7P9/3WLNHsHA3r8haZF F6PwIWu41wdRjJf2HI7zD5LaQe/7oU3jfwvuA7n2z8Py+zGx7m4= =S+HY -----END PGP SIGNATURE----- Merge tag 'linux-kselftest-kunit-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull Kunit updates from Shuah Khan: "This consists of: - Several config fragment fixes from Anders Roxell to improve test coverage. - Improvements to kunit run script to use defconfig as default and restructure the code for config/build/exec/parse from Vitor Massaru Iha and David Gow. - Miscellaneous documentation warn fix" * tag 'linux-kselftest-kunit-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: security: apparmor: default KUNIT_* fragments to KUNIT_ALL_TESTS fs: ext4: default KUNIT_* fragments to KUNIT_ALL_TESTS drivers: base: default KUNIT_* fragments to KUNIT_ALL_TESTS lib: Kconfig.debug: default KUNIT_* fragments to KUNIT_ALL_TESTS kunit: default KUNIT_* fragments to KUNIT_ALL_TESTS kunit: Kconfig: enable a KUNIT_ALL_TESTS fragment kunit: Fix TabError, remove defconfig code and handle when there is no kunitconfig kunit: use KUnit defconfig by default kunit: use --build_dir=.kunit as default Documentation: test.h - fix warnings kunit: kunit_tool: Separate out config/build/exec/parse	2020-06-09 10:04:47 -07:00
Michel Lespinasse	c1e8d7c6a7	mmap locking API: convert mmap_sem comments Convert comments that reference mmap_sem to reference mmap_lock instead. [akpm@linux-foundation.org: fix up linux-next leftovers] [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil] [akpm@linux-foundation.org: more linux-next fixups, per Michel] Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-09 09:39:14 -07:00
Michel Lespinasse	3e4e28c5a8	mmap locking API: convert mmap_sem API comments Convert comments that reference old mmap_sem APIs to reference corresponding new mmap locking APIs instead. Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-12-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-09 09:39:14 -07:00
Michel Lespinasse	42fc541404	mmap locking API: add mmap_assert_locked() and mmap_assert_write_locked() Add new APIs to assert that mmap_sem is held. Using this instead of rwsem_is_locked and lockdep_assert_held[_write] makes the assertions more tolerant of future changes to the lock type. Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-10-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-09 09:39:14 -07:00
Michel Lespinasse	89154dd531	mmap locking API: convert mmap_sem call sites missed by coccinelle Convert the last few remaining mmap_sem rwsem calls to use the new mmap locking API. These were missed by coccinelle for some reason (I think coccinelle does not support some of the preprocessor constructs in these files ?) [akpm@linux-foundation.org: convert linux-next leftovers] [akpm@linux-foundation.org: more linux-next leftovers] [akpm@linux-foundation.org: more linux-next leftovers] Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-6-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-09 09:39:14 -07:00
Michel Lespinasse	d8ed45c5dc	mmap locking API: use coccinelle to convert mmap_sem rwsem call sites This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock \| -down_write +mmap_write_lock \| -down_write_killable +mmap_write_lock_killable \| -down_write_trylock +mmap_write_trylock \| -up_write +mmap_write_unlock \| -downgrade_write +mmap_write_downgrade \| -down_read +mmap_read_lock \| -down_read_killable +mmap_read_lock_killable \| -down_read_trylock +mmap_read_trylock \| -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-09 09:39:14 -07:00
Mike Rapoport	e31cf2f4ca	mm: don't include asm/pgtable.h if linux/mm.h is already included Patch series "mm: consolidate definitions of page table accessors", v2. The low level page table accessors (pXY_index(), pXY_offset()) are duplicated across all architectures and sometimes more than once. For instance, we have 31 definition of pgd_offset() for 25 supported architectures. Most of these definitions are actually identical and typically it boils down to, e.g. static inline unsigned long pmd_index(unsigned long address) { return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1); } static inline pmd_t pmd_offset(pud_t pud, unsigned long address) { return (pmd_t )pud_page_vaddr(pud) + pmd_index(address); } These definitions can be shared among 90% of the arches provided XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined. For architectures that really need a custom version there is always possibility to override the generic version with the usual ifdefs magic. These patches introduce include/linux/pgtable.h that replaces include/asm-generic/pgtable.h and add the definitions of the page table accessors to the new header. This patch (of 12): The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the functions involving page table manipulations, e.g. pte_alloc() and pmd_alloc(). So, there is no point to explicitly include <asm/pgtable.h> in the files that include <linux/mm.h>. The include statements in such cases are remove with a simple loop: for f in $(git grep -l "include <linux/mm.h>") ; do sed -i -e '/include <asm\/pgtable.h>/ d' $f done Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Cain <bcain@codeaurora.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Mark Salter <msalter@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nick Hu <nickhu@andestech.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-09 09:39:13 -07:00
David Howells	9ca0652596	afs: Fix use of BUG() Fix afs_compare_addrs() to use WARN_ON(1) instead of BUG() and return 1 (ie. srx_a > srx_b). There's no point trying to put actual error handling in as this should not occur unless a new transport address type is allowed by AFS. And even if it does, in this particular case, it'll just never match unknown types of addresses. This BUG() was more of a 'you need to add a case here' indicator. Reported-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org>	2020-06-09 17:21:03 +01:00
David Howells	5749ce92c4	afs: Fix file locking Fix AFS file locking to use the correct vnode pointer and remove a member of the afs_operation struct that is never set, but it is read and followed, causing an oops. This can be triggered by: flock -s /afs/example.com/foo sleep 1 when it calls the kernel to get a file lock. Fixes: `e49c7b2f6d` ("afs: Build an abstraction around an "operation" concept") Reported-by: Dave Botsch <botsch@cnf.cornell.edu> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Dave Botsch <botsch@cnf.cornell.edu>	2020-06-09 15:22:06 +01:00
Zhihao Cheng	2ca068be09	afs: Fix memory leak in afs_put_sysnames() Fix afs_put_sysnames() to actually free the specified afs_sysnames object after its reference count has been decreased to zero and its contents have been released. Fixes: `6f8880d8e6` ("afs: Implement @sys substitution handling") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: David Howells <dhowells@redhat.com>	2020-06-09 15:22:06 +01:00
Dan Carpenter	fc961522dd	exfat: Fix potential use after free in exfat_load_upcase_table() This code calls brelse(bh) and then dereferences "bh" on the next line resulting in a possible use after free. The brelse() should just be moved down a line. Fixes: b676fdbcf4c8 ("exfat: standardize checksum calculation") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2020-06-09 16:50:18 +09:00
hyeongseok.kim	a949824f01	exfat: fix range validation error in alloc and free cluster There is check error in range condition that can never be entered even with invalid input. Replace incorrent checking code with already existing valid checker. Signed-off-by: hyeongseok.kim <hyeongseok@gmail.com> Acked-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2020-06-09 16:50:12 +09:00
Namjae Jeon	29bbb14bfc	exfat: fix incorrect update of stream entry in __exfat_truncate() At truncate, there is a problem of incorrect updating in the file entry pointer instead of stream entry. This will cause the problem of overwriting the time field of the file entry to new_size. Fix it to update stream entry. Fixes: `98d917047e` ("exfat: add file operations") Cc: stable@vger.kernel.org # v5.7 Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2020-06-09 16:50:07 +09:00
Al Viro	f341a7d8dc	exfat: fix memory leak in exfat_parse_param() butt3rflyh4ck reported memory leak found by syzkaller. A param->string held by exfat_mount_options. BUG: memory leak unreferenced object 0xffff88801972e090 (size 8): comm "syz-executor.2", pid 16298, jiffies 4295172466 (age 14.060s) hex dump (first 8 bytes): 6b 6f 69 38 2d 75 00 00 koi8-u.. backtrace: [<000000005bfe35d6>] kstrdup+0x36/0x70 mm/util.c:60 [<0000000018ed3277>] exfat_parse_param+0x160/0x5e0 fs/exfat/super.c:276 [<000000007680462b>] vfs_parse_fs_param+0x2b4/0x610 fs/fs_context.c:147 [<0000000097c027f2>] vfs_parse_fs_string+0xe6/0x150 fs/fs_context.c:191 [<00000000371bf78f>] generic_parse_monolithic+0x16f/0x1f0 fs/fs_context.c:231 [<000000005ce5eb1b>] do_new_mount fs/namespace.c:2812 [inline] [<000000005ce5eb1b>] do_mount+0x12bb/0x1b30 fs/namespace.c:3141 [<00000000b642040c>] __do_sys_mount fs/namespace.c:3350 [inline] [<00000000b642040c>] __se_sys_mount fs/namespace.c:3327 [inline] [<00000000b642040c>] __x64_sys_mount+0x18f/0x230 fs/namespace.c:3327 [<000000003b024e98>] do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295 [<00000000ce2b698c>] entry_SYSCALL_64_after_hwframe+0x49/0xb3 exfat_free() should call exfat_free_iocharset(), to prevent a leak in case we fail after parsing iocharset= but before calling get_tree_bdev(). Additionally, there's no point copying param->string in exfat_parse_param() - just steal it, leaving NULL in param->string. That's independent from the leak or fix thereof - it's simply avoiding an extra copy. Fixes: `719c1e1829` ("exfat: add super block operations") Cc: stable@vger.kernel.org # v5.7 Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2020-06-09 16:50:02 +09:00
Namjae Jeon	f78059805f	exfat: remove unnecessary reassignment of p_uniname->name_len kbuild test robot reported : fs/exfat/nls.c:531:22: warning: Variable 'p_uniname->name_len' is reassigned a value before the old one has been used. The reassignment of p_uniname->name_len is not needed and remove it. Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2020-06-09 16:49:32 +09:00
Tetsuhiro Kohada	5875bf287d	exfat: standardize checksum calculation To clarify that it is a 16-bit checksum, the parts related to the 16-bit checksum are renamed and change type to u16. Furthermore, replace checksum calculation in exfat_load_upcase_table() with exfat_calc_checksum32(). Signed-off-by: Tetsuhiro Kohada <kohada.t2@gmail.com> Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2020-06-09 16:49:25 +09:00
Tetsuhiro Kohada	476189c0ef	exfat: add boot region verification Add Boot-Regions verification specified in exFAT specification. Note that the checksum type is strongly related to the raw structure, so the'u32 'type is used to clarify the number of bits. Signed-off-by: Tetsuhiro Kohada <kohada.t2@gmail.com> Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2020-06-09 16:49:19 +09:00
Tetsuhiro Kohada	33404a1598	exfat: separate the boot sector analysis Separate the boot sector analysis to read_boot_sector(). And add a check for the fs_name field. Furthermore, add a strict consistency check, because overlapping areas can cause serious corruption. Signed-off-by: Tetsuhiro Kohada <kohada.t2@gmail.com> Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2020-06-09 16:49:14 +09:00
Tetsuhiro Kohada	181a9e8009	exfat: redefine PBR as boot_sector Aggregate PBR related definitions and redefine as "boot_sector" to comply with the exFAT specification. And, rename variable names including 'pbr'. Signed-off-by: Tetsuhiro Kohada <kohada.t2@gmail.com> Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2020-06-09 16:49:10 +09:00
Tetsuhiro Kohada	943af1fdac	exfat: optimize dir-cache Optimize directory access based on exfat_entry_set_cache. - Hold bh instead of copied d-entry. - Modify bh->data directly instead of the copied d-entry. - Write back the retained bh instead of rescanning the d-entry-set. And - Remove unused cache related definitions. Signed-off-by: Tetsuhiro Kohada <kohada.tetsuhiro@dc.mitsubishielectric.co.jp> Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2020-06-09 16:49:05 +09:00
Tetsuhiro Kohada	ed0f84d30b	exfat: replace 'time_ms' with 'time_cs' Replace time_ms with time_cs in the file directory entry structure and related functions. The unit of create_time_ms/modify_time_ms in File Directory Entry are not 'milli-second', but 'centi-second'. The exfat specification uses the term '10ms', but instead use 'cs' as in msdos_fs.h. Signed-off-by: Tetsuhiro Kohada <kohada.t2@gmail.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2020-06-09 16:49:00 +09:00
Jason Yan	cdc06129a6	exfat: remove the assignment of 0 to bool variable There is no need to init 'sync' in exfat_set_vol_flags(). This also fixes the following coccicheck warning: fs/exfat/super.c:104:6-10: WARNING: Assignment of 0/1 to bool variable Signed-off-by: Jason Yan <yanaijie@huawei.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2020-06-09 16:48:53 +09:00
Pali Rohár	6778337a7a	exfat: Remove unused functions exfat_high_surrogate() and exfat_low_surrogate() After applying previous two patches, these functions are not used anymore. Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2020-06-09 16:48:49 +09:00
Pali Rohár	dddf7da398	exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF Function partial_name_hash() takes long type value into which can be stored one Unicode code point. Therefore conversion from UTF-32 to UTF-16 is not needed. Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2020-06-09 16:48:44 +09:00
Geert Uytterhoeven	31f5acc0aa	exfat: Improve wording of EXFAT_DEFAULT_IOCHARSET config option - Use consistent capitalization for "exFAT". - Fix grammar, - Split long sentence. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2020-06-09 16:48:39 +09:00
Joe Perches	d1727d55c0	exfat: Use a more common logging style Remove the direct use of KERN_<LEVEL> in functions by creating separate exfat_<level> macros. Miscellanea: o Remove several unnecessary terminating newlines in formats o Realign arguments and fit to 80 columns where appropriate Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2020-06-09 16:48:34 +09:00
Pali Rohár	197298a649	exfat: Simplify exfat_utf8_d_cmp() for code points above U+FFFF If two Unicode code points represented in UTF-16 are different then also their UTF-32 representation must be different. Therefore conversion from UTF-32 to UTF-16 is not needed. Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>	2020-06-09 16:48:28 +09:00
Kenneth D'souza	0b0430c6a1	cifs: Add get_security_type_str function to return sec type. This code is more organized and robust. Signed-off-by: Kenneth D'souza <kdsouza@redhat.com> Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-06-08 23:57:21 -05:00
Matthew Wilcox (Oracle)	d4ff3b2ef9	iomap: Fix unsharing of an extent >2GB on a 32-bit machine Widen the type used for counting the number of bytes unshared. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-06-08 20:58:29 -07:00
Chuhong Yuan	8cc0072469	xfs: Add the missed xfs_perag_put() for xfs_ifree_cluster() xfs_ifree_cluster() calls xfs_perag_get() at the beginning, but forgets to call xfs_perag_put() in one failed path. Add the missed function call to fix it. Fixes: `ce92464c18` ("xfs: make xfs_trans_get_buf return an error code") Signed-off-by: Chuhong Yuan <hslester96@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-06-08 20:57:03 -07:00
Jaegeuk Kim	b7b911d59d	f2fs: attach IO flags to the missing cases This adds more IOs to attach flags. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-06-08 20:37:54 -07:00
Jaegeuk Kim	32b6aba85c	f2fs: add node_io_flag for bio flags likewise data_io_flag This patch adds another way to attach bio flags to node writes. Description: Give a way to attach REQ_META\|FUA to node writes given temperature-based bits. Now the bits indicate: * REQ_META \| REQ_FUA \| * 5 \| 4 \| 3 \| 2 \| 1 \| 0 \| * Cold \| Warm \| Hot \| Cold \| Warm \| Hot \| Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-06-08 20:37:54 -07:00
Chao Yu	bc67c5d0ce	f2fs: remove unused parameter of f2fs_put_rpages_mapping() Just cleanup, no logic change. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-06-08 20:37:53 -07:00
Chao Yu	8626441f05	f2fs: handle readonly filesystem in f2fs_ioc_shutdown() If mountpoint is readonly, we should allow shutdowning filesystem successfully, this fixes issue found by generic/599 testcase of xfstest. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-06-08 20:37:53 -07:00
Eric Biggers	fc3bb095ab	f2fs: avoid utf8_strncasecmp() with unstable name If the dentry name passed to ->d_compare() fits in dentry::d_iname, then it may be concurrently modified by a rename. This can cause undefined behavior (possibly out-of-bounds memory accesses or crashes) in utf8_strncasecmp(), since fs/unicode/ isn't written to handle strings that may be concurrently modified. Fix this by first copying the filename to a stack buffer if needed. This way we get a stable snapshot of the filename. Fixes: `2c2eb7a300` ("f2fs: Support case-insensitive file name lookups") Cc: <stable@vger.kernel.org> # v5.4+ Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Daniel Rosenberg <drosen@google.com> Cc: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-06-08 20:37:53 -07:00
Eric Biggers	0b6d4ca04a	f2fs: don't return vmalloc() memory from f2fs_kmalloc() kmalloc() returns kmalloc'ed memory, and kvmalloc() returns either kmalloc'ed or vmalloc'ed memory. But the f2fs wrappers, f2fs_kmalloc() and f2fs_kvmalloc(), both return both kinds of memory. It's redundant to have two functions that do the same thing, and also breaking the standard naming convention is causing bugs since people assume it's safe to kfree() memory allocated by f2fs_kmalloc(). See e.g. the various allocations in fs/f2fs/compress.c. Fix this by making f2fs_kmalloc() just use kmalloc(). And to avoid re-introducing the allocation failures that the vmalloc fallback was intended to fix, convert the largest allocations to use f2fs_kvmalloc(). Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-06-08 20:34:58 -07:00
Linus Torvalds	95288a9b3b	The highlights are: - OSD/MDS latency and caps cache metrics infrastructure for the filesytem (Xiubo Li). Currently available through debugfs and will be periodically sent to the MDS in the future. - support for replica reads (balanced and localized reads) for rbd and the filesystem (myself). The default remains to always read from primary, users can opt-in with the new crush_location and read_from_replica options. Note that reading from replica is safe for general use only since Octopus. - support for RADOS allocation hint flags (myself). Currently used by rbd to propagate the compressible/incompressible hint given with the new compression_hint map option and ready for passing on more advanced hints, e.g. based on fadvise() from the filesystem. - support for efficient cross-quota-realm renames (Luis Henriques) - assorted cap handling improvements and cleanups, particularly untangling some of the locking (Jeff Layton) -----BEGIN PGP SIGNATURE----- iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl7eZP0THGlkcnlvbW92 QGdtYWlsLmNvbQAKCRBKf944AhHziwJDB/98bH+dsJidUkRctVerX933DvgmRGva sIxR0otqCK2zlucKSy8R8awbhVQ2lz4DQm9vrlwFQHBjZqXnrMzDG4rd/PukmKap l8DjHRgEsH698zjwDlyyz7/1ZqOOUcCKr5fly3Erqr92yWGoy2ve76LtTKgB5jnv wdwMk5v/NBWoxZ3Q1cvexbCtc60l0FCSH4FnH7NtT8eR9zCmL9vlpZWdjKi+U5em 6tTONuSq+0F4a9eXEv6QHEjRjkRo1WlttGdK3bX7mXD4O22TslgKg9hYsVoQVTiW Cc9n6Pggv2tbUnPgn/x342W26QyMgcoHCzrYPR7w0JrU61TzBewxqfpg =4fqQ -----END PGP SIGNATURE----- Merge tag 'ceph-for-5.8-rc1' of git://github.com/ceph/ceph-client Pull ceph updates from Ilya Dryomov: "The highlights are: - OSD/MDS latency and caps cache metrics infrastructure for the filesytem (Xiubo Li). Currently available through debugfs and will be periodically sent to the MDS in the future. - support for replica reads (balanced and localized reads) for rbd and the filesystem (myself). The default remains to always read from primary, users can opt-in with the new crush_location and read_from_replica options. Note that reading from replica is safe for general use only since Octopus. - support for RADOS allocation hint flags (myself). Currently used by rbd to propagate the compressible/incompressible hint given with the new compression_hint map option and ready for passing on more advanced hints, e.g. based on fadvise() from the filesystem. - support for efficient cross-quota-realm renames (Luis Henriques) - assorted cap handling improvements and cleanups, particularly untangling some of the locking (Jeff Layton)" * tag 'ceph-for-5.8-rc1' of git://github.com/ceph/ceph-client: (29 commits) rbd: compression_hint option libceph: support for alloc hint flags libceph: read_from_replica option libceph: support for balanced and localized reads libceph: crush_location infrastructure libceph: decode CRUSH device/bucket types and names libceph: add non-asserting rbtree insertion helper ceph: skip checking caps when session reconnecting and releasing reqs ceph: make sure mdsc->mutex is nested in s->s_mutex to fix dead lock ceph: don't return -ESTALE if there's still an open file libceph, rbd: replace zero-length array with flexible-array ceph: allow rename operation under different quota realms ceph: normalize 'delta' parameter usage in check_quota_exceeded ceph: ceph_kick_flushing_caps needs the s_mutex ceph: request expedited service on session's last cap flush ceph: convert mdsc->cap_dirty to a per-session list ceph: reset i_requested_max_size if file write is not wanted ceph: throw a warning if we destroy session with mutex still locked ceph: fix potential race in ceph_check_caps ceph: document what protects i_dirty_item and i_flushing_item ...	2020-06-08 12:49:18 -07:00
Pavel Begunkov	f5fa38c59c	io_wq: add per-wq work handler instead of per work io_uring is the only user of io-wq, and now it uses only io-wq callback for all its requests, namely io_wq_submit_work(). Instead of storing work->runner callback in each instance of io_wq_work, keep it in io-wq itself. pros: - reduces io_wq_work size - more robust -- ->func won't be invalidated with mem{cpy,set}(req) - helps other work Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-08 13:47:37 -06:00
Pavel Begunkov	d4c81f3852	io_uring: don't arm a timeout through work.func Remove io_link_work_cb() -- the last custom work.func. Not the prettiest thing, but works. Instead of queueing a linked timeout in io_link_work_cb() mark a request with REQ_F_QUEUE_TIMEOUT and do enqueueing based on the flag in io_wq_submit_work(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-08 13:47:37 -06:00
Pavel Begunkov	ac45abc0e2	io_uring: remove custom ->func handlers In preparation of getting rid of work.func, this removes almost all custom instances of it, leaving only io_wq_submit_work() and io_link_work_cb(). And the last one will be dealt later. Nothing fancy, just routinely remove *_finish() function and inline what's left. E.g. remove io_fsync_finish() + inline __io_fsync() into io_fsync(). As no users of io_req_cancelled() are left, delete it as well. The patch adds extra switch lookup on cold-ish path, but that's overweighted by nice diffstat and other benefits of the following patches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-08 13:47:37 -06:00
Pavel Begunkov	3af73b286c	io_uring: don't derive close state from ->func Relying on having a specific work.func is dangerous, even if an opcode handler set it itself. E.g. io_wq_assign_next() can modify it. io_close() sets a custom work.func to indicate that __close_fd_get_file() was already called. Fortunately, there is no bugs with io_wq_assign_next() and close yet. Still, do it safe and always be prepared to be called through io_wq_submit_work(). Zero req->close.put_file in prep, and call __close_fd_get_file() IFF it's NULL. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-08 13:47:37 -06:00
Linus Torvalds	ca687877e0	Changes in gfs2: - An iopen glock locking scheme rework that speeds up deletes of inodes accessed from multiple nodes. - Various bug fixes and debugging improvements. - Convert gfs2-glocks.txt to ReST. -----BEGIN PGP SIGNATURE----- iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAl7eYjMUHGFncnVlbmJh QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTr/9g//cJ6jgiD/+qzh0VzougVksVZIduAl RMB+kldOjBS2ORbyYM87Jm1tdyakgZuFO91XlSwChWRdC3Y2mqdaJIEE/kATqfY9 7Frlw++SyFKLvIf04kDYGk2hXX+umXXYFfrIiKb0tzDSGkPRaARUb3RM4TRvlSDP /U0JlYA/4aXMUge+VpYsbpSGeqHNfEzmcmCyPXGmZYyh1MZ/RocMZFYEsP9NH82J l07fxowUd10LJPEmBajzjD2NmEvjdvF4gBCOfJVNIfOzCj0CwXL3vmtu1SUNOKr+ em266EWZF89eOcvfdtE6xF0w81oCAK43wYRjIODSgI9JCLXGiOYmlWZVwZoqu5iy 2GQDhj/taq3ObuVqjR5n6GYuqMoJ+D0LSD13qccDALq/Bdy4lq9TMLSdDbkrVIM/ 8BVn0nI5MzUlTV3mq6uxhU0HqtYDwUEiHWURWw6bYRug5OvQy3nbg/XZptYlnD87 XQccE4ErjlgSHLiYx1YckFz/GG6ytrRAKl9bGMkZ0u2+XmDsH+iJJgLcaXCPUP9h /hhYagKI55UBDer7we4tppbu+gnJrg3PXlgImf53iMc7ia0KQHd+TfSIFkGPuydI aEKKhIQzje23JayMbPRnwPbNlw9zU1loPi7hPS3rCpDY+w8oawpFyIieEOcxJyEt pYkOt4BQi9LvpGc= =PsCY -----END PGP SIGNATURE----- Merge tag 'gfs2-for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 updates from Andreas Gruenbacher: - An iopen glock locking scheme rework that speeds up deletes of inodes accessed from multiple nodes - Various bug fixes and debugging improvements - Convert gfs2-glocks.txt to ReST * tag 'gfs2-for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: fix use-after-free on transaction ail lists gfs2: new slab for transactions gfs2: initialize transaction tr_ailX_lists earlier gfs2: Smarter iopen glock waiting gfs2: Wake up when setting GLF_DEMOTE gfs2: Check inode generation number in delete_work_func gfs2: Move inode generation number check into gfs2_inode_lookup gfs2: Minor gfs2_lookup_by_inum cleanup gfs2: Try harder to delete inodes locally gfs2: Give up the iopen glock on contention gfs2: Turn gl_delete into a delayed work gfs2: Keep track of deleted inode generations in LVBs gfs2: Allow ASPACE glocks to also have an lvb gfs2: instrumentation wrt log_flush stuck gfs2: introduce new gfs2_glock_assert_withdraw gfs2: print mapping->nrpages in glock dump for address space glocks gfs2: Only do glock put in gfs2_create_inode for free inodes gfs2: Allow lock_nolock mount to specify jid=X gfs2: Don't ignore inode write errors during inode_go_sync docs: filesystems: convert gfs2-glocks.txt to ReST	2020-06-08 12:47:09 -07:00
Linus Torvalds	20b0d06722	Merge branch 'akpm' (patches from Andrew) Merge still more updates from Andrew Morton: "Various trees. Mainly those parts of MM whose linux-next dependents are now merged. I'm still sitting on ~160 patches which await merges from -next. Subsystems affected by this patch series: mm/proc, ipc, dynamic-debug, panic, lib, sysctl, mm/gup, mm/pagemap" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (52 commits) doc: cgroup: update note about conditions when oom killer is invoked module: move the set_fs hack for flush_icache_range to m68k nommu: use flush_icache_user_range in brk and mmap binfmt_flat: use flush_icache_user_range exec: use flush_icache_user_range in read_code exec: only build read_code when needed m68k: implement flush_icache_user_range arm: rename flush_cache_user_range to flush_icache_user_range xtensa: implement flush_icache_user_range sh: implement flush_icache_user_range asm-generic: add a flush_icache_user_range stub mm: rename flush_icache_user_range to flush_icache_user_page arm,sparc,unicore32: remove flush_icache_user_range riscv: use asm-generic/cacheflush.h powerpc: use asm-generic/cacheflush.h openrisc: use asm-generic/cacheflush.h m68knommu: use asm-generic/cacheflush.h microblaze: use asm-generic/cacheflush.h ia64: use asm-generic/cacheflush.h hexagon: use asm-generic/cacheflush.h ...	2020-06-08 11:11:38 -07:00
Christoph Hellwig	79ef1e1fff	binfmt_flat: use flush_icache_user_range load_flat_file works on user addresses. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Greg Ungerer <gerg@linux-m68k.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Link: http://lkml.kernel.org/r/20200515143646.3857579-28-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-08 11:05:58 -07:00
Christoph Hellwig	bce2b68b89	exec: use flush_icache_user_range in read_code read_code operates on user addresses. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Link: http://lkml.kernel.org/r/20200515143646.3857579-27-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-08 11:05:58 -07:00
Christoph Hellwig	48304f7994	exec: only build read_code when needed Only build read_code when binary formats that use it are built into the kernel. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Link: http://lkml.kernel.org/r/20200515143646.3857579-26-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-08 11:05:58 -07:00
Guilherme G. Piccoli	f117955a22	kernel/watchdog.c: convert {soft/hard}lockup boot parameters to sysctl aliases After a recent change introduced by Vlastimil's series [0], kernel is able now to handle sysctl parameters on kernel command line; also, the series introduced a simple infrastructure to convert legacy boot parameters (that duplicate sysctls) into sysctl aliases. This patch converts the watchdog parameters softlockup_panic and {hard,soft}lockup_all_cpu_backtrace to use the new alias infrastructure. It fixes the documentation too, since the alias only accepts values 0 or 1, not the full range of integers. We also took the opportunity here to improve the documentation of the previously converted hung_task_panic (see the patch series [0]) and put the alias table in alphabetical order. [0] http://lkml.kernel.org/r/20200427180433.7029-1-vbabka@suse.cz Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Link: http://lkml.kernel.org/r/20200507214624.21911-1-gpiccoli@canonical.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-08 11:05:56 -07:00
Vlastimil Babka	b467f3ef3c	kernel/hung_task convert hung_task_panic boot parameter to sysctl We can now handle sysctl parameters on kernel command line and have infrastructure to convert legacy command line options that duplicate sysctl to become a sysctl alias. This patch converts the hung_task_panic parameter. Note that the sysctl handler is more strict and allows only 0 and 1, while the legacy parameter allowed any non-zero value. But there is little reason anyone would not be using 1. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: David Rientjes <rientjes@google.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Guilherme G . Piccoli" <gpiccoli@canonical.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20200427180433.7029-4-vbabka@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-08 11:05:56 -07:00
Vlastimil Babka	0a477e1ae2	kernel/sysctl: support handling command line aliases We can now handle sysctl parameters on kernel command line, but historically some parameters introduced their own command line equivalent, which we don't want to remove for compatibility reasons. We can, however, convert them to the generic infrastructure with a table translating the legacy command line parameters to their sysctl names, and removing the one-off param handlers. This patch adds the support and makes the first conversion to demonstrate it, on the (deprecated) numa_zonelist_order parameter. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: David Rientjes <rientjes@google.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Guilherme G . Piccoli" <gpiccoli@canonical.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20200427180433.7029-3-vbabka@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-08 11:05:56 -07:00
Vlastimil Babka	3db978d480	kernel/sysctl: support setting sysctl parameters from kernel command line Patch series "support setting sysctl parameters from kernel command line", v3. This series adds support for something that seems like many people always wanted but nobody added it yet, so here's the ability to set sysctl parameters via kernel command line options in the form of sysctl.vm.something=1 The important part is Patch 1. The second, not so important part is an attempt to clean up legacy one-off parameters that do the same thing as a sysctl. I don't want to remove them completely for compatibility reasons, but with generic sysctl support the idea is to remove the one-off param handlers and treat the parameters as aliases for the sysctl variants. I have identified several parameters that mention sysctl counterparts in Documentation/admin-guide/kernel-parameters.txt but there might be more. The conversion also has varying level of success: - numa_zonelist_order is converted in Patch 2 together with adding the necessary infrastructure. It's easy as it doesn't really do anything but warn on deprecated value these days. - hung_task_panic is converted in Patch 3, but there's a downside that now it only accepts 0 and 1, while previously it was any integer value - nmi_watchdog maps to two sysctls nmi_watchdog and hardlockup_panic, so there's no straighforward conversion possible - traceoff_on_warning is a flag without value and it would be required to handle that somehow in the conversion infractructure, which seems pointless for a single flag This patch (of 5): A recently proposed patch to add vm_swappiness command line parameter in addition to existing sysctl [1] made me wonder why we don't have a general support for passing sysctl parameters via command line. Googling found only somebody else wondering the same [2], but I haven't found any prior discussion with reasons why not to do this. Settings the vm_swappiness issue aside (the underlying issue might be solved in a different way), quick search of kernel-parameters.txt shows there are already some that exist as both sysctl and kernel parameter - hung_task_panic, nmi_watchdog, numa_zonelist_order, traceoff_on_warning. A general mechanism would remove the need to add more of those one-offs and might be handy in situations where configuration by e.g. /etc/sysctl.d/ is impractical. Hence, this patch adds a new parse_args() pass that looks for parameters prefixed by 'sysctl.' and tries to interpret them as writes to the corresponding sys/ files using an temporary in-kernel procfs mount. This mechanism was suggested by Eric W. Biederman [3], as it handles all dynamically registered sysctl tables, even though we don't handle modular sysctls. Errors due to e.g. invalid parameter name or value are reported in the kernel log. The processing is hooked right before the init process is loaded, as some handlers might be more complicated than simple setters and might need some subsystems to be initialized. At the moment the init process can be started and eventually execute a process writing to /proc/sys/ then it should be also fine to do that from the kernel. Sysctls registered later on module load time are not set by this mechanism - it's expected that in such scenarios, setting sysctl values from userspace is practical enough. [1] https://lore.kernel.org/r/BL0PR02MB560167492CA4094C91589930E9FC0@BL0PR02MB5601.namprd02.prod.outlook.com/ [2] https://unix.stackexchange.com/questions/558802/how-to-set-sysctl-using-kernel-command-line-parameter [3] https://lore.kernel.org/r/87bloj2skm.fsf@x220.int.ebiederm.org/ Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: "Guilherme G . Piccoli" <gpiccoli@canonical.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Link: http://lkml.kernel.org/r/20200427180433.7029-1-vbabka@suse.cz Link: http://lkml.kernel.org/r/20200427180433.7029-2-vbabka@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-08 11:05:56 -07:00
Linus Torvalds	63d72b93f2	vfs: clean up posix_acl_permission() logic aroudn MAY_NOT_BLOCK posix_acl_permission() does not care about MAY_NOT_BLOCK, and in fact the permission logic internally must not check that bit (it's only for upper layers to decide whether they can block to do IO to look up the acl information or not). But the way the code was written, it _looked_ like it cared, since the function explicitly did not mask that bit off. But it has exactly two callers: one for when that bit is set, which first clears the bit before calling posix_acl_permission(), and the other call site when that bit was clear. So stop the silly games "saving" the MAY_NOT_BLOCK bit that must not be used for the actual permission test, and that currently is pointlessly cleared by the callers when the function itself should just not care. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-08 11:04:19 -07:00
Linus Torvalds	5fc475b749	vfs: do not do group lookup when not necessary Rasmus Villemoes points out that the 'in_group_p()' tests can be a noticeable expense, and often completely unnecessary. A common situation is that the 'group' bits are the same as the 'other' bits wrt the permissions we want to test. So rewrite 'acl_permission_check()' to not bother checking for group ownership when the permission check doesn't care. For example, if we're asking for read permissions, and both 'group' and 'other' allow reading, there's really no reason to check if we're part of the group or not: either way, we'll allow it. Rasmus says: "On a bog-standard Ubuntu 20.04 install, a workload consisting of compiling lots of userspace programs (i.e., calling lots of short-lived programs that all need to get their shared libs mapped in, and the compilers poking around looking for system headers - lots of /usr/lib, /usr/bin, /usr/include/ accesses) puts in_group_p around 0.1% according to perf top. System-installed files are almost always 0755 (directories and binaries) or 0644, so in most cases, we can avoid the binary search and the cost of pulling the cred->groups array and in_group_p() .text into the cpu cache" Reported-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-08 11:04:19 -07:00
Denis Efremov	a8c73c1a61	io_uring: use kvfree() in io_sqe_buffer_register() Use kvfree() to free the pages and vmas, since they are allocated by kvmalloc_array() in a loop. Fixes: `d4ef647510` ("io_uring: avoid page allocation warnings") Signed-off-by: Denis Efremov <efremov@linux.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200605093203.40087-1-efremov@linux.com	2020-06-08 09:39:13 -06:00
Bijan Mottahedeh	efe68c1ca8	io_uring: validate the full range of provided buffers for access Account for the number of provided buffers when validating the address range. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-08 09:39:13 -06:00
youngjun	2068cf7dfb	ovl: remove unnecessary lock check Directory is always locked until "out_unlock" label. So lock check is not needed. Signed-off-by: youngjun <her0gyugyu@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-06-08 09:57:19 +02:00
Linus Torvalds	a2b447066c	Tag summary + Features - Replace zero-length array with flexible-array - add a valid state flags check - add consistency check between state and dfa diff encode flags - add apparmor subdir to proc attr interface - fail unpack if profile mode is unknown - add outofband transition and use it in xattr match - ensure that dfa state tables have entries + Cleanups - Use true and false for bool variable - Remove semicolon - Clean code by removing redundant instructions - Replace two seq_printf() calls by seq_puts() in aa_label_seq_xprint() - remove duplicate check of xattrs on profile attachment - remove useless aafs_create_symlink + Bug fixes - Fix memory leak of profile proxy - fix introspection of of task mode for unconfined tasks - fix nnp subset test for unconfined - check/put label on apparmor_sk_clone_security() -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE7cSDD705q2rFEEf7BS82cBjVw9gFAl7dUf4ACgkQBS82cBjV w9j8rA//R3qbVeiN3SJtxLhiF3AAdP2cVbZ/mAhQLwYObI6flb1bliiahJHRf8Ey FaVb4srOH8NlmzNINZehXOvD3UDwX/sbpw8h0Y0JolO+v1m3UXkt/eRoMt6gRz7I jtaImY1/V+G4O5rV5fGA1HQI8Geg+W9Abt32d16vyKIIpnBS/Pfv8ppM0NcHCZ4G e8935T/dMNK5K0Y7HNb1nMjyzEr0LtEXvXznBOrGVpCtDQ45m0/NBvAqpfhuKsVm FE5Na8rgtiB9sU72LaoNXNr8Y5LVgkXPmBr/e1FqZtF01XEarKb7yJDGOLrLpp1o rGYpY9DQSBT/ZZrwMaLFqCd1XtnN1BAmhlM6TXfnm25ArEnQ49ReHFc7ZHZRSTZz LWVBD6atZbapvqckk1SU49eCLuGs5wmRj/CmwdoQUbZ+aOfR68zF+0PANbP5xDo4 862MmeMsm8JHndeCelpZQRbhtXt0t9MDzwMBevKhxV9hbpt4g8DcnC5tNUc9AnJi qJDsMkytYhazIW+/4MsnLTo9wzhqzXq5kBeE++Xl7vDE/V+d5ocvQg73xtwQo9sx LzMlh3cPmBvOnlpYfnONZP8pJdjDAuESsi/H5+RKQL3cLz7NX31CLWR8dXLBHy80 Dvxqvy84Cf7buigqwSzgAGKjDI5HmeOECAMjpLbEB2NS9xxQYuk= =U7d2 -----END PGP SIGNATURE----- Merge tag 'apparmor-pr-2020-06-07' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor Pull apparmor updates from John Johansen: "Features: - Replace zero-length array with flexible-array - add a valid state flags check - add consistency check between state and dfa diff encode flags - add apparmor subdir to proc attr interface - fail unpack if profile mode is unknown - add outofband transition and use it in xattr match - ensure that dfa state tables have entries Cleanups: - Use true and false for bool variable - Remove semicolon - Clean code by removing redundant instructions - Replace two seq_printf() calls by seq_puts() in aa_label_seq_xprint() - remove duplicate check of xattrs on profile attachment - remove useless aafs_create_symlink Bug fixes: - Fix memory leak of profile proxy - fix introspection of of task mode for unconfined tasks - fix nnp subset test for unconfined - check/put label on apparmor_sk_clone_security()" * tag 'apparmor-pr-2020-06-07' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor: apparmor: Fix memory leak of profile proxy apparmor: fix introspection of of task mode for unconfined tasks apparmor: check/put label on apparmor_sk_clone_security() apparmor: Use true and false for bool variable security/apparmor/label.c: Clean code by removing redundant instructions apparmor: Replace zero-length array with flexible-array apparmor: ensure that dfa state tables have entries apparmor: remove duplicate check of xattrs on profile attachment. apparmor: add outofband transition and use it in xattr match apparmor: fail unpack if profile mode is unknown apparmor: fix nnp subset test for unconfined apparmor: remove useless aafs_create_symlink apparmor: add proc subdir to attrs apparmor: add consistency check between state and dfa diff encode flags apparmor: add a valid state flags check AppArmor: Remove semicolon apparmor: Replace two seq_printf() calls by seq_puts() in aa_label_seq_xprint()	2020-06-07 16:04:49 -07:00
Linus Torvalds	f558b8364e	Driver core patches for 5.8-rc1 Here is the set of driver core patches for 5.8-rc1. Not all that huge this release, just a number of small fixes and updates: - software node fixes - kobject now sends KOBJ_REMOVE when it is removed from sysfs, not when it is removed from memory (which could come much later) - device link additions and fixes based on testing on more devices - firmware core cleanups - other minor changes, full details in the shortlog All have been in linux-next for a while with no reported issues. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXtzmXg8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ymaAQCfZZ9prH3AMLF7DIkG3vMw0njLXt0An2FxrKYU wetHRG4KL9vTkdz7+TqU =t5LE -----END PGP SIGNATURE----- Merge tag 'driver-core-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the set of driver core patches for 5.8-rc1. Not all that huge this release, just a number of small fixes and updates: - software node fixes - kobject now sends KOBJ_REMOVE when it is removed from sysfs, not when it is removed from memory (which could come much later) - device link additions and fixes based on testing on more devices - firmware core cleanups - other minor changes, full details in the shortlog All have been in linux-next for a while with no reported issues" * tag 'driver-core-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (23 commits) driver core: Update device link status correctly for SYNC_STATE_ONLY links firmware_loader: change enum fw_opt to u32 software node: implement software_node_unregister() kobject: send KOBJ_REMOVE uevent when the object is removed from sysfs driver core: Remove unnecessary is_fwnode_dev variable in device_add() drivers property: When no children in primary, try secondary driver core: platform: Fix spelling errors in platform.c driver core: Remove check in driver_deferred_probe_force_trigger() of: platform: Batch fwnode parsing when adding all top level devices driver core: fw_devlink: Add support for batching fwnode parsing driver core: Look for waiting consumers only for a fwnode's primary device driver core: Move code to the right part of the file Revert "Revert "driver core: Set fw_devlink to "permissive" behavior by default"" drivers: base: Fix NULL pointer exception in __platform_driver_probe() if a driver developer is foolish firmware_loader: move fw_fallback_config to a private kernel symbol namespace driver core: Add missing '\n' in log messages driver/base/soc: Use kobj_to_dev() API Add documentation on meaning of -EPROBE_DEFER driver core: platform: remove redundant assignment to variable ret debugfs: Use the correct style for SPDX License Identifier ...	2020-06-07 10:53:36 -07:00
Linus Torvalds	3b69e8b457	Fix for arch/sh build regression with newer binutils, removal of SH5, fixes for module exports, and misc cleanup. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQEcBAABAgAGBQJe28l0AAoJELcQ+SIFb8HaKFMH/0T7tHfWit4+efmeDLhfrewd Fq9lLnEGmLy82AZqmd730gvD2ckbjUCm0ikKC79sCd14r3bIB1RCDKfXbY6rB3uI EDijbkzsjfOYG9ZAiDYTIbyrM2u2/1PzFiYTxHVDtPLbCPGfacbcfrDL+u143IXP ez/RHGLE6uYDvKi0Y0/VDKgMCW9bNlcEkL2/tKFVg2cipDi2Lfmi3Jss/id+5uOI N8XeZoyHjyWr7GeRZwN/hNPLDvLY//Uf5q6RB9VrTsN4Vrja7kjWMZkgsGkmGbNo f6BbLenq+KMfOSJrIzS3MgTRinoqRF5S518pkbGtgRQn0rZKfd6h85DG15RlPGk= =Ktnp -----END PGP SIGNATURE----- Merge tag 'sh-for-5.8' of git://git.libc.org/linux-sh Pull arch/sh updates from Rich Felker: "Fix for arch/sh build regression with newer binutils, removal of SH5, fixes for module exports, and misc cleanup" * tag 'sh-for-5.8' of git://git.libc.org/linux-sh: sh: remove sh5 support sh: add missing EXPORT_SYMBOL() for __delay sh: Convert ins[bwl]/outs[bwl] macros to inline functions sh: Convert iounmap() macros to inline functions sh: Add missing DECLARE_EXPORT() for __ashiftrt_r4_xx sh: configs: Cleanup old Kconfig IO scheduler options arch/sh: vmlinux.scr sh: Replace CONFIG_MTD_M25P80 with CONFIG_MTD_SPI_NOR in sh7757lcr_defconfig sh: sh4a: Bring back tmu3_device early device	2020-06-06 15:22:01 -07:00
Zou Wei	9fa88c5d3f	hpfs: fix warning due to superfluous semicolon Fixes coccicheck warning: fs/hpfs/buffer.c:56:2-3: Unneeded semicolon Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zou Wei <zou_wei@huawei.com> Signed-off-by: Mikulas Patocka <mikulas@twibright.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-06 10:08:17 -07:00
Steve French	5865985416	smb3: extend fscache mount volume coherency check It is better to check volume id and creation time, not just the root inode number to verify if the volume has changed when remounting. Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-06-06 11:16:25 -05:00
Linus Torvalds	aaa2faab4e	orangefs: a conversion and a cleanup... Conversion: John Hubbard's conversion from get_user_pages() to pin_user_pages() cleanup: Colin Ian King's removal of an unneeded variable initialization. -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEIGSFVdO6eop9nER2z0QOqevODb4FAl7ao4AACgkQz0QOqevO Db4MqRAAjb+nRMDA6aYfCY1Veh9BUnVeVPAqWdlSPbo7TdDfhCBJKgwZ8Y3GmOnR vYnVn6Yz/uhFwoZWmSd0AXgB55kGKyXNcHlyPO4FcaWGDNd/Dn8WWc7/lsqHnDFX cg0Ioy7VeYS+Y+iW3ZkEjkeGyVFProl/OsJrf9vfJiuyZrLp4th/ctlbV/sIA2R6 XNk0ld21gEB5YbrTCQebtyhdJLp9+hhI0BB2Lxm/JUZyyK4J2rN8H+SF/5JE8wEj SJD7K5kukxED2Kh3pU1fvRVr0VvHZjLHQav6TgW6GPokmb6EWZwvIYfUa8go50Jz 5fkpyRc8d3zibxDSdL6/Gr6mZQxceFnYvfPs27Vq3O08J/dhWHX0LlKctD4pEHNR NUsF8Piko+16JwPg+EeXTIsYrMiW8g5FhoTCl7FILQ06F2P6NDCyOUyHiLOU4KaN +bp6YmIyJBfIBgXz58Pqq2JwibkN6zYiRrb5jDDWv2Nw9ykpjNykrAXH6Fbz8JKV dPbKkxzx5HQuHBaxYCRKV4r3UNBYKAECrWcLQglG8D+EzvYrBxFZQoPu3qapJPJG 8fQKCKqU8hM6BGYpI76FIK8o/BrOdYr0ttqDsambDWt/uRXRKXwvKJVz3wOB8Q7n om+XlIk4UE/7vgpTru/wzVX0pmq8WWskbDOKkGqgeUC0BLuZcJw= =3A+d -----END PGP SIGNATURE----- Merge tag 'for-linus-5.8-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux Pull orangefs updates from Mike Marshall: - John Hubbard's conversion from get_user_pages() to pin_user_pages() - Colin Ian King's removal of an unneeded variable initialization * tag 'for-linus-5.8-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: orangefs: convert get_user_pages() --> pin_user_pages() orangefs: remove redundant assignment to variable ret	2020-06-05 16:44:36 -07:00
Linus Torvalds	e3cea0cad1	dlm for 5.8 This set includes a couple minor cleanups, and dropping the interruptible from a wait_event that waits for an event from the userspace cluster management. -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJe2op+AAoJEDgbc8f8gGmqJKkP/jE0BCnE7VbpID0sDaBLpntk +v53jxR/bpw6g+LwlJGBDLCRrcn8MnhC72W1OPZyFJ4AiuN5BHdvTVVSPSCBSHnU WfJKqbVSRt+PRaATJd8iCLdoMj858vLWFo380XVMTRsAaxYJ4dwRUMBUOx/pL8Dy pcDXDgWTC06nuRmSY8eYJYJWX3c6SdpHHbg4IA127Y9oTjIAvYyeFC4ohRtRJqrj bQlpjnfJ9bNwd5N15turGiR5IzY7UXI9wa/qylngr5B5gVxJttFpaE4OFF80HOkY Xxqqqhi9dIcyAAzftJaRQpyAq6kRmFgFAjH5Rvx+7n0BEMe+AAhQC9Coei1Zysh1 fssYrxRC3fqX1kg2alhpdBdCugFq7IUdFuwH044M+w8jcsoggGq8vGRMJEuKgnnm kLLBEi5HtAv0S9rAE9Acnqanc6lRvzJIc0vysG/0LQqzWr+F7HtRnS7kkOkO2+e1 014k20FmooJiu5XGrCz+dvmnIACpr6Z+S119uXntYGDDA+YGcJQTMs6pzLh5UTqB 40aIZPPRRj6K/Vvx2VJgEbzwmwWjjdtsYkcJEq+QPPsAkPa8EnbqICCyJDkUuY5K kltYHQ2IOa466KC8JHBS9jdVnHftk0jIl75eiP7vK+ZUEW+iwRgEXY11oiWeY23J EKfPgi0oikfmlj2Y3jme =VTRw -----END PGP SIGNATURE----- Merge tag 'dlm-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: "This set includes a couple minor cleanups, and dropping the interruptible from a wait_event that waits for an event from the userspace cluster management" * tag 'dlm-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: dlm: remove BUG() before panic() dlm: Switch to using wait_event() fs:dlm:remove unneeded semicolon in rcom.c dlm: user: Replace zero-length array with flexible-array member dlm: dlm_internal: Replace zero-length array with flexible-array member	2020-06-05 16:43:16 -07:00
Linus Torvalds	3803d5e4d3	22 changesets, 2 for stable. Includes big performance improvement for large i/o when using multichannel, also includes DFS fixes -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl7aelsACgkQiiy9cAdy T1HDmwv9Fj6OaXXx+btNvbB6xTWvCwMVKHwTPURMx+IjBYjJC65yPGkInPPkfUVo 7L9h55XCLwFohECleZJCkKOrJtnX1P8SsHtZck6QqjvUETJl/L3pAXpMMYACHLpg x4DE/NFkcW95J38s9Jtjhphq8ZGUhuDhaT+QeEd2Iq8HzAxk5ND47ZXkomMx1EEM ZsOrmJF+k2YQyDDpfhJeVF5iZDkbpASqA/TlLxxGH34IdAZIUB9qtGKADNLZ6YyT qpG601CSrEdl3tVY+SlRMHqwVTRhCViPD6Q3fMw8Xha436RIiWJJ4Rvn6bSP/ZQl PDPuSVRB2zmd70C/3ojXdku9+VfQLO52qkO3bf2IjgVJ3ARrxFxW7cb7bmYRqdyT WI5N1+8gETrIAK7aB3QKdmkcRFDtJD3wOTfBcgctuB8WrYrDvW2MNKkPbQdY5tnN xfp4f10Dg4d+/8knSytJrdKkDublU0kGbfLAa0oupjAzV6WB0qNt0TNGxE42L+ug iFSJOZxi =gCcP -----END PGP SIGNATURE----- Merge tag '5.8-rc-smb3-fixes-part-1' of git://git.samba.org/sfrench/cifs-2.6 Pull cifs updates from Steve French: "22 changesets, 2 for stable. Includes big performance improvement for large i/o when using multichannel, also includes DFS fixes" * tag '5.8-rc-smb3-fixes-part-1' of git://git.samba.org/sfrench/cifs-2.6: (22 commits) cifs: update internal module version number cifs: multichannel: try to rebind when reconnecting a channel cifs: multichannel: use pointer for binding channel smb3: remove static checker warning cifs: multichannel: move channel selection above transport layer cifs: multichannel: always zero struct cifs_io_parms cifs: dump Security Type info in DebugData smb3: fix incorrect number of credits when ioctl MaxOutputResponse > 64K smb3: default to minimum of two channels when multichannel specified cifs: multichannel: move channel selection in function cifs: fix minor typos in comments and log messages smb3: minor update to compression header definitions cifs: minor fix to two debug messages cifs: Standardize logging output smb3: Add new parm "nodelete" cifs: move some variables off the stack in smb2_ioctl_query_info cifs: reduce stack use in smb2_compound_op cifs: get rid of unused parameter in reconn_setup_dfs_targets() cifs: handle hostnames that resolve to same ip in failover cifs: set up next DFS target before generic_ip_connect() ...	2020-06-05 16:40:53 -07:00
Linus Torvalds	9daa0a27a0	AFS Changes -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl7ZC5kACgkQ+7dXa6fL C2uv9A/+NKlTSXyv2ZuvtmXADelndcXJ+nC+3bwI7Jh43aa8uCCsAVYD0VE+dxor Ingj/LUJ2sjjp6RXCeeqqETXCoCVt0zK2g216+An7k84KJ+ms+MDa8dNN7l6280S 1jw4hnT0+g9Ln6elgqBroV980MJC2NGL0Eaete8zFO8UqYZy5w1ge0HfGck2l45U 2lr6egCWYSUPmtFKXJnLV8luwRvq7DzvTk9WrJu3kwOjaY1AQP1+1VpdhChJLrRc /4Ddy1On5IXiFrPi5OtHA422bfirUpIv2HbmI047W9uiZ05MiXwSvNS1qJLTa1AA T/SK88d3FCeSYw3olAne2kEl9uewvGByr98fDKFOcDHZj18abd9/VtUp33RXxYBy lN2wqlWP++LlZ4sMCbbvLXX8OB1tekQzWQC0vJ5rhRSgveOlhL9TLG2Y05xokFs+ AwK8zTlDIZ6Pa/JIHfp2E0ZhXEazWTSmP+d7NkgzF0iiORukvsmxjOVUZC4+UCqK rYN6goJ5g8qpejRv5NhfP6/olb1NK33f/F2QSSFfxv9zda4HNlayvcoSnFrdUEnt IfBhSKPkeDVWs1yse7glDuw19tHp94B9UYwJ46qfHngQPArgy+gp23d0cSy41Pr5 FRQ23eNvBWIP4srt1gSCBexSGA1h/ACji41CPTJbF2jg5uWFAUE= =YVwD -----END PGP SIGNATURE----- Merge tag 'afs-next-20200604' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull AFS updates from David Howells: "There's some core VFS changes which affect a couple of filesystems: - Make the inode hash table RCU safe and providing some RCU-safe accessor functions. The search can then be done without taking the inode_hash_lock. Care must be taken because the object may be being deleted and no wait is made. - Allow iunique() to avoid taking the inode_hash_lock. - Allow AFS's callback processing to avoid taking the inode_hash_lock when using the inode table to find an inode to notify. - Improve Ext4's time updating. Konstantin Khlebnikov said "For now, I've plugged this issue with try-lock in ext4 lazy time update. This solution is much better." Then there's a set of changes to make a number of improvements to the AFS driver: - Improve callback (ie. third party change notification) processing by: (a) Relying more on the fact we're doing this under RCU and by using fewer locks. This makes use of the RCU-based inode searching outlined above. (b) Moving to keeping volumes in a tree indexed by volume ID rather than a flat list. (c) Making the server and volume records logically part of the cell. This means that a server record now points directly at the cell and the tree of volumes is there. This removes an N:M mapping table, simplifying things. - Improve keeping NAT or firewall channels open for the server callbacks to reach the client by actively polling the fileserver on a timed basis, instead of only doing it when we have an operation to process. - Improving detection of delayed or lost callbacks by including the parent directory in the list of file IDs to be queried when doing a bulk status fetch from lookup. We can then check to see if our copy of the directory has changed under us without us getting notified. - Determine aliasing of cells (such as a cell that is pointed to be a DNS alias). This allows us to avoid having ambiguity due to apparently different cells using the same volume and file servers. - Improve the fileserver rotation to do more probing when it detects that all of the addresses to a server are listed as non-responsive. It's possible that an address that previously stopped responding has become responsive again. Beyond that, lay some foundations for making some calls asynchronous: - Turn the fileserver cursor struct into a general operation struct and hang the parameters off of that rather than keeping them in local variables and hang results off of that rather than the call struct. - Implement some general operation handling code and simplify the callers of operations that affect a volume or a volume component (such as a file). Most of the operation is now done by core code. - Operations are supplied with a table of operations to issue different variants of RPCs and to manage the completion, where all the required data is held in the operation object, thereby allowing these to be called from a workqueue. - Put the standard "if (begin), while(select), call op, end" sequence into a canned function that just emulates the current behaviour for now. There are also some fixes interspersed: - Don't let the EACCES from ICMP6 mapping reach the user as such, since it's confusing as to whether it's a filesystem error. Convert it to EHOSTUNREACH. - Don't use the epoch value acquired through probing a server. If we have two servers with the same UUID but in different cells, it's hard to draw conclusions from them having different epoch values. - Don't interpret the argument to the CB.ProbeUuid RPC as a fileserver UUID and look up a fileserver from it. - Deal with servers in different cells having the same UUIDs. In the event that a CB.InitCallBackState3 RPC is received, we have to break the callback promises for every server record matching that UUID. - Don't let afs_statfs return values that go below 0. - Don't use running fileserver probe state to make server selection and address selection decisions on. Only make decisions on final state as the running state is cleared at the start of probing" Acked-by: Al Viro <viro@zeniv.linux.org.uk> (fs/inode.c part) * tag 'afs-next-20200604' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (27 commits) afs: Adjust the fileserver rotation algorithm to reprobe/retry more quickly afs: Show more a bit more server state in /proc/net/afs/servers afs: Don't use probe running state to make decisions outside probe code afs: Fix afs_statfs() to not let the values go below zero afs: Fix the by-UUID server tree to allow servers with the same UUID afs: Reorganise volume and server trees to be rooted on the cell afs: Add a tracepoint to track the lifetime of the afs_volume struct afs: Detect cell aliases 3 - YFS Cells with a canonical cell name op afs: Detect cell aliases 2 - Cells with no root volumes afs: Detect cell aliases 1 - Cells with root volumes afs: Implement client support for the YFSVL.GetCellName RPC op afs: Retain more of the VLDB record for alias detection afs: Fix handling of CB.ProbeUuid cache manager op afs: Don't get epoch from a server because it may be ambiguous afs: Build an abstraction around an "operation" concept afs: Rename struct afs_fs_cursor to afs_operation afs: Remove the error argument from afs_protocol_error() afs: Set error flag rather than return error from file status decode afs: Make callback processing more efficient. afs: Show more information in /proc/net/afs/servers ...	2020-06-05 16:26:36 -07:00
Linus Torvalds	0b166a57e6	A lot of bug fixes and cleanups for ext4, including: * Fix performance problems found in dioread_nolock now that it is the default, caused by transaction leaks. * Clean up fiemap handling in ext4 * Clean up and refactor multiple block allocator (mballoc) code * Fix a problem with mballoc with a smaller file systems running out of blocks because they couldn't properly use blocks that had been reserved by inode preallocation. * Fixed a race in ext4_sync_parent() versus rename() * Simplify the error handling in the extent manipulation code * Make sure all metadata I/O errors are felected to ext4_ext_dirty()'s and ext4_make_inode_dirty()'s callers. * Avoid passing an error pointer to brelse in ext4_xattr_set() * Fix race which could result to freeing an inode on the dirty last in data=journal mode. * Fix refcount handling if ext4_iget() fails * Fix a crash in generic/019 caused by a corrupted extent node -----BEGIN PGP SIGNATURE----- iQEyBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl7Ze8kACgkQ8vlZVpUN gaNChAf4xn0ytFSrweI/S2Sp05G/2L/ocZ2TZZk2ZdGeN1E+ABdSIv/zIF9zuFgZ /pY/C+fyEZWt4E3FlNO8gJzoEedkzMCMnUhSIfI+wZbcclyTOSNMJtnrnJKAEtVH HOvGZJmg357jy407RCGhZpJ773nwU2xhBTr5OFxvSf9mt/vzebxIOnw5D7HPlC1V Fgm6Du8q+tRrPsyjv1Yu4pUEVXMJ7qUcvt326AXVM3kCZO1Aa5GrURX0w3J4mzW1 tc1tKmtbLcVVYTo9CwHXhk/edbxrhAydSP2iACand3tK6IJuI6j9x+bBJnxXitnr vsxsfTYMG18+2SxrJ9LwmagqmrRq =HMTs -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "A lot of bug fixes and cleanups for ext4, including: - Fix performance problems found in dioread_nolock now that it is the default, caused by transaction leaks. - Clean up fiemap handling in ext4 - Clean up and refactor multiple block allocator (mballoc) code - Fix a problem with mballoc with a smaller file systems running out of blocks because they couldn't properly use blocks that had been reserved by inode preallocation. - Fixed a race in ext4_sync_parent() versus rename() - Simplify the error handling in the extent manipulation code - Make sure all metadata I/O errors are felected to ext4_ext_dirty()'s and ext4_make_inode_dirty()'s callers. - Avoid passing an error pointer to brelse in ext4_xattr_set() - Fix race which could result to freeing an inode on the dirty last in data=journal mode. - Fix refcount handling if ext4_iget() fails - Fix a crash in generic/019 caused by a corrupted extent node" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (58 commits) ext4: avoid unnecessary transaction starts during writeback ext4: don't block for O_DIRECT if IOCB_NOWAIT is set ext4: remove the access_ok() check in ext4_ioctl_get_es_cache fs: remove the access_ok() check in ioctl_fiemap fs: handle FIEMAP_FLAG_SYNC in fiemap_prep fs: move fiemap range validation into the file systems instances iomap: fix the iomap_fiemap prototype fs: move the fiemap definitions out of fs.h fs: mark __generic_block_fiemap static ext4: remove the call to fiemap_check_flags in ext4_fiemap ext4: split _ext4_fiemap ext4: fix fiemap size checks for bitmap files ext4: fix EXT4_MAX_LOGICAL_BLOCK macro add comment for ext4_dir_entry_2 file_type member jbd2: avoid leaking transaction credits when unreserving handle ext4: drop ext4_journal_free_reserved() ext4: mballoc: use lock for checking free blocks while retrying ext4: mballoc: refactor ext4_mb_good_group() ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling ext4: mballoc: refactor ext4_mb_discard_preallocations() ...	2020-06-05 16:19:28 -07:00
Linus Torvalds	242b233198	RDMA 5.8 merge window pull request A few large, long discussed works this time. The RNBD block driver has been posted for nearly two years now, and the removal of FMR has been a recurring discussion theme for a long time. The usual smattering of features and bug fixes. - Various small driver bugs fixes in rxe, mlx5, hfi1, and efa - Continuing driver cleanups in bnxt_re, hns - Big cleanup of mlx5 QP creation flows - More consistent use of src port and flow label when LAG is used and a mlx5 implementation - Additional set of cleanups for IB CM - 'RNBD' network block driver and target. This is a network block RDMA device specific to ionos's cloud environment. It brings strong multipath and resiliency capabilities. - Accelerated IPoIB for HFI1 - QP/WQ/SRQ ioctl migration for uverbs, and support for multiple async fds - Support for exchanging the new IBTA defiend ECE data during RDMA CM exchanges - Removal of the very old and insecure FMR interface from all ULPs and drivers. FRWR should be preferred for at least a decade now. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl7X/IwACgkQOG33FX4g mxp2uw/+MI2S/aXqEBvZfTT8yrkAwqYezS0VeTDnwH/T6UlTMDhHVN/2Ji3tbbX3 FEKT1i2mnAL5RqUAL1lr9g4sG/bVozrpN46Ws5Lu9dTbIPLKTNPWDuLFQDUShKY7 OyMI/bRx6anGnsOy20iiBqnrQbrrZj5TECgnmrkAl62QFdcl7aBWe/yYjy4CT11N ub+aBXBREN1F1pc0HIjd2tI+8gnZc+mNm1LVVDRH9Capun/pI26qDNh7e6QwGyIo n8ItraC8znLwv/nsUoTE7/JRcsTEe6vJI26PQmczZfNJs/4O65G7fZg0eSBseZYi qKf7Uwtb3qW0R7jRUMEgFY4DKXVAA0G2ph40HXBuzOSsqlT6HqYMO2wgG8pJkrTc qAjoSJGzfAHIsjxzxKI8wKuufCddjCm30VWWU7EKeriI6h1J0uPVqKkQMfYBTkik 696eZSBycAVgwayOng3XaehiTxOL7qGMTjUpDjUR6UscbiPG919vP+QsbIUuBXdb YoddBQJdyGJiaCXv32ciJjo9bjPRRi/bII7Q5qzCNI2mi4ZVbudF4ffzyQvdHtNJ nGnpRXoPi7kMvUrKTMPWkFjj0R5/UsPszsA51zbxPydfgBe0Dlc2PrrIG8dlzYAp wbV0Lec+iJucKlt7EZtrjz1xOiOOaQt/5/cW1bWqL+wk2t6gAuY= =9zTe -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma Pull rdma updates from Jason Gunthorpe: "A more active cycle than most of the recent past, with a few large, long discussed works this time. The RNBD block driver has been posted for nearly two years now, and flowing through RDMA due to it also introducing a new ULP. The removal of FMR has been a recurring discussion theme for a long time. And the usual smattering of features and bug fixes. Summary: - Various small driver bugs fixes in rxe, mlx5, hfi1, and efa - Continuing driver cleanups in bnxt_re, hns - Big cleanup of mlx5 QP creation flows - More consistent use of src port and flow label when LAG is used and a mlx5 implementation - Additional set of cleanups for IB CM - 'RNBD' network block driver and target. This is a network block RDMA device specific to ionos's cloud environment. It brings strong multipath and resiliency capabilities. - Accelerated IPoIB for HFI1 - QP/WQ/SRQ ioctl migration for uverbs, and support for multiple async fds - Support for exchanging the new IBTA defiend ECE data during RDMA CM exchanges - Removal of the very old and insecure FMR interface from all ULPs and drivers. FRWR should be preferred for at least a decade now" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (247 commits) RDMA/cm: Spurious WARNING triggered in cm_destroy_id() RDMA/mlx5: Return ECE DC support RDMA/mlx5: Don't rely on FW to set zeros in ECE response RDMA/mlx5: Return an error if copy_to_user fails IB/hfi1: Use free_netdev() in hfi1_netdev_free() RDMA/hns: Uninitialized variable in modify_qp_init_to_rtr() RDMA/core: Move and rename trace_cm_id_create() IB/hfi1: Fix hfi1_netdev_rx_init() error handling RDMA: Remove 'max_map_per_fmr' RDMA: Remove 'max_fmr' RDMA/core: Remove FMR device ops RDMA/rdmavt: Remove FMR memory registration RDMA/mthca: Remove FMR support for memory registration RDMA/mlx4: Remove FMR support for memory registration RDMA/i40iw: Remove FMR leftovers RDMA/bnxt_re: Remove FMR leftovers RDMA/mlx5: Remove FMR leftovers RDMA/core: Remove FMR pool API RDMA/rds: Remove FMR support for memory registration RDMA/srp: Remove support for FMR memory registration ...	2020-06-05 14:05:57 -07:00
Linus Torvalds	ac7b34218a	Split the old READ_IMPLIES_EXEC workaround from executable PT_GNU_STACK now that toolchains long support PT_GNU_STACK marking and there's no need anymore to force modern programs into having all its user mappings executable instead of only the stack and the PROT_EXEC ones. Disable that automatic READ_IMPLIES_EXEC forcing on x86-64 and arm64. Add tables documenting how READ_IMPLIES_EXEC is handled on x86-64, arm and arm64. By Kees Cook. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl7YFDIACgkQEsHwGGHe VUpnzxAAmXdODNOb1gGQvt+KJthkfkWh2A2R+tWxCRmFtjFTcS/eRxFfvGu2KmFY 2b2AcJzuJeGjs7WIvQU0pkR2p6STyzuSBBLj5J/OJR9FonQ4pPah38df4A0fOgI6 GJyJV9Ie7O2Ph1w2iLOeWBdmR90CnYuabxsfipgOL+sjHlEI0RqLSDgARRQsxTEj KM+JVAFD472KcUJnQKBVBOD1I1DOVBGu12r3y6chgsOtwshLNW/cO15cDgYrgnJZ OlR3EIUukCEEc1KQzUCihsypLuGfrmdq1MyPN8CME8gLfmOBsJyGRDhvmdbS+Wxh kAMYQ9BuNP/jMVtN950qV0qUtnZCeIPlj1sDb9STWz5fInLsXDSCS0eYi32yBFi+ 7yviVU95ml6Mda1Qd5axItTHFAjKIn0qfMZszkLOtUszIzNinCgH7t3ThoXeV223 BqrpntRwiGZVpXDdcp0QFYBsWSMchR47yuhL8pB4SWxQzgNzXqAEg2KFQU0XMDKp pdia9IzUozg/BrjG5cnRfZhq2lBra7fy3Dn6fw5+NR5vqhka0Wr8L6dyM1Rj74EU HPk5bRXgt0OIiIFPi4139ApY7k+8j2nbf12qUchue1ZVVKzbvK996FDXbrGgW3zD Wis1wglxB9urSUTmC1bMOeyOd+gebo3i/ACAjgSo+EbDN7qW0Qw= =2L7y -----END PGP SIGNATURE----- Merge tag 'core_core_updates_for_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull READ_IMPLIES_EXEC changes from Borislav Petkov: "Split the old READ_IMPLIES_EXEC workaround from executable PT_GNU_STACK now that toolchains long support PT_GNU_STACK marking and there's no need anymore to force modern programs into having all its user mappings executable instead of only the stack and the PROT_EXEC ones. Disable that automatic READ_IMPLIES_EXEC forcing on x86-64 and arm64. Add tables documenting how READ_IMPLIES_EXEC is handled on x86-64, arm and arm64. By Kees Cook" * tag 'core_core_updates_for_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: arm64/elf: Disable automatic READ_IMPLIES_EXEC for 64-bit address spaces arm32/64/elf: Split READ_IMPLIES_EXEC from executable PT_GNU_STACK arm32/64/elf: Add tables to document READ_IMPLIES_EXEC x86/elf: Disable automatic READ_IMPLIES_EXEC on 64-bit x86/elf: Split READ_IMPLIES_EXEC from executable PT_GNU_STACK x86/elf: Add table to document READ_IMPLIES_EXEC	2020-06-05 13:45:21 -07:00
Andreas Gruenbacher	300e549b6e	Merge branch 'gfs2-iopen' into for-next	2020-06-05 21:25:36 +02:00
Bob Peterson	83d060ca8d	gfs2: fix use-after-free on transaction ail lists Before this patch, transactions could be merged into the system transaction by function gfs2_merge_trans(), but the transaction ail lists were never merged. Because the ail flushing mechanism can run separately, bd elements can be attached to the transaction's buffer list during the transaction (trans_add_meta, etc) but quickly moved to its ail lists. Later, in function gfs2_trans_end, the transaction can be freed (by gfs2_trans_end) while it still has bd elements queued to its ail lists, which can cause it to either lose track of the bd elements altogether (memory leak) or worse, reference the bd elements after the parent transaction has been freed. Although I've not seen any serious consequences, the problem becomes apparent with the previous patch's addition of: gfs2_assert_warn(sdp, list_empty(&tr->tr_ail1_list)); to function gfs2_trans_free(). This patch adds logic into gfs2_merge_trans() to move the merged transaction's ail lists to the sdp transaction. This prevents the use-after-free. To do this properly, we need to hold the ail lock, so we pass sdp into the function instead of the transaction itself. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2020-06-05 21:24:25 +02:00
Bob Peterson	b839dadae8	gfs2: new slab for transactions This patch adds a new slab for gfs2 transactions. That allows us to reduce kernel memory fragmentation, have better organization of data for analysis of vmcore dumps. A new centralized function is added to free the slab objects, and it exposes use-after-free by giving warnings if a transaction is freed while it still has bd elements attached to its buffers or ail lists. We make sure to initialize those transaction ail lists so we can check their integrity when freeing. At a later time, we should add a slab initialization function to make it more efficient, but for this initial patch I wanted to minimize the impact. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2020-06-05 21:24:25 +02:00
Bob Peterson	cbcc89b630	gfs2: initialize transaction tr_ailX_lists earlier Since transactions may be freed shortly after they're created, before a log_flush occurs, we need to initialize their ail1 and ail2 lists earlier. Before this patch, the ail1 list was initialized in gfs2_log_flush(). This moves the initialization to the point when the transaction is first created. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2020-06-05 21:24:25 +02:00
Andreas Gruenbacher	9e8990dea9	gfs2: Smarter iopen glock waiting When trying to upgrade the iopen glock from a shared to an exclusive lock in gfs2_evict_inode, abort the wait if there is contention on the corresponding inode glock: in that case, the inode must still be in active use on another node, and we're not guaranteed to get the iopen glock anytime soon. To make this work even better, when we notice contention on the iopen glock and we can't evict the corresponsing inode and release the iopen glock immediately, poke the inode glock. The other node(s) trying to acquire the lock can then abort instead of timing out. Thanks to Heinz Mauelshagen for pointing out a locking bug in a previous version of this patch. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2020-06-05 20:19:21 +02:00
Andreas Gruenbacher	35b6f8fbcf	gfs2: Wake up when setting GLF_DEMOTE Wake up the sdp->sd_async_glock_wait wait queue when setting the GLF_DEMOTE flag. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2020-06-05 20:19:21 +02:00
Andreas Gruenbacher	b0dcffd8da	gfs2: Check inode generation number in delete_work_func In delete_work_func, if the iopen glock still has an inode attached, limit the inode lookup to that specific generation number: in the likely case that the inode was deleted on the node on which the inode's link count dropped to zero, we can skip verifying the on-disk block type and reading in the inode. The same applies if another node that had the inode open managed to delete the inode before us. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2020-06-05 20:19:21 +02:00
Andreas Gruenbacher	b66648ad6d	gfs2: Move inode generation number check into gfs2_inode_lookup Move the inode generation number check from gfs2_lookup_by_inum into gfs2_inode_lookup: gfs2_inode_lookup may be able to decide that an inode with the given inode generation number cannot exist without having to verify the block type or reading the inode from disk. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2020-06-05 20:19:21 +02:00
Andreas Gruenbacher	6bdcadea75	gfs2: Minor gfs2_lookup_by_inum cleanup Use a zero no_formal_ino instead of a NULL pointer to indicate that any inode generation number will qualify: a valid inode never has a zero no_formal_ino. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2020-06-05 20:19:21 +02:00
Andreas Gruenbacher	9e73330f29	gfs2: Try harder to delete inodes locally When an inode's link count drops to zero and the inode is cached on other nodes, the current behavior of gfs2 is to immediately give up and to rely on the other node(s) to delete the inode if there is iopen glock contention. This leads to resource group glock bouncing and the loss of caching. With the previous patches in place, we can fix that by not giving up immediately. When the inode is still open on other nodes, those nodes won't be able to evict the inode and give up the iopen glock. In that case, our lock conversion request will time out. The unlink system call will block for the duration of the iopen lock conversion request. We're also holding the inode glock in EX mode for an extended duration, so other nodes won't be able to make progress on the inode, either. This is worse than what we had before, but we can prevent other nodes from getting stuck by aborting our iopen locking request if there is contention on the inode glock. This will the the subject of a future patch. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2020-06-05 20:19:21 +02:00
Andreas Gruenbacher	8c7b9262a8	gfs2: Give up the iopen glock on contention When there's contention on the iopen glock, it means that the link count of the corresponding inode has dropped to zero on a remote node which is now trying to delete the inode. In that case, try to evict the inode so that the iopen glock will be released, which will allow the remote node to do its job. When the inode is still open locally, the inode's reference count won't drop to zero and so we'll keep holding the inode and its iopen glock. The remote node will time out its request to grab the iopen glock, and when the inode is finally closed locally, we'll try to delete it ourself. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2020-06-05 20:19:21 +02:00
Andreas Gruenbacher	a0e3cc65fa	gfs2: Turn gl_delete into a delayed work This requires flushing delayed work items in gfs2_make_fs_ro (which is called before unmounting a filesystem). When inodes are deleted and then recreated, pending gl_delete work items would have no effect because the inode generations will have changed, so we can cancel any pending gl_delete works before reusing iopen glocks. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2020-06-05 20:19:21 +02:00
Andreas Gruenbacher	f286d627ef	gfs2: Keep track of deleted inode generations in LVBs When deleting an inode, keep track of the generation of the deleted inode in the inode glock Lock Value Block (LVB). When trying to delete an inode remotely, check the last-known inode generation against the deleted inode generation to skip duplicate remote deletes. This avoids taking the resource group glock in order to verify the block type. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2020-06-05 20:19:20 +02:00
Bob Peterson	15f2547b41	gfs2: Allow ASPACE glocks to also have an lvb Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2020-06-05 20:18:59 +02:00
Bob Peterson	d5dc3d9677	gfs2: instrumentation wrt log_flush stuck This adds checks for gfs2_log_flush being stuck, similarly to the check in gfs2_ail1_flush. To faciliate this and make the strings easy to grep we move the ail1 emptying to its own function, empty_ail1_list. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2020-06-05 19:35:54 +02:00
Bob Peterson	ea4e61c7f4	gfs2: introduce new gfs2_glock_assert_withdraw Before this patch, asserts based on glocks did not print the glock with the error. This patch introduces a new macro, gfs2_glock_assert_withdraw which first prints the glock, then takes the assert. This also changes a few glock asserts to the new macro. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2020-06-05 16:44:29 +02:00
Bob Peterson	7e901d6e95	gfs2: print mapping->nrpages in glock dump for address space glocks This patch makes the glock dumps in debugfs print the number of pages (nrpages) for address space glocks. This will aid in debugging. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2020-06-05 14:58:23 +02:00
Linus Torvalds	886d7de631	Merge branch 'akpm' (patches from Andrew) Merge yet more updates from Andrew Morton: - More MM work. 100ish more to go. Mike Rapoport's "mm: remove __ARCH_HAS_5LEVEL_HACK" series should fix the current ppc issue - Various other little subsystems * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (127 commits) lib/ubsan.c: fix gcc-10 warnings tools/testing/selftests/vm: remove duplicate headers selftests: vm: pkeys: fix multilib builds for x86 selftests: vm: pkeys: use the correct page size on powerpc selftests/vm/pkeys: override access right definitions on powerpc selftests/vm/pkeys: test correct behaviour of pkey-0 selftests/vm/pkeys: introduce a sub-page allocator selftests/vm/pkeys: detect write violation on a mapped access-denied-key page selftests/vm/pkeys: associate key on a mapped page and detect write violation selftests/vm/pkeys: associate key on a mapped page and detect access violation selftests/vm/pkeys: improve checks to determine pkey support selftests/vm/pkeys: fix assertion in test_pkey_alloc_exhaust() selftests/vm/pkeys: fix number of reserved powerpc pkeys selftests/vm/pkeys: introduce powerpc support selftests/vm/pkeys: introduce generic pkey abstractions selftests: vm: pkeys: use the correct huge page size selftests/vm/pkeys: fix alloc_random_pkey() to make it really random selftests/vm/pkeys: fix assertion in pkey_disable_set/clear() selftests/vm/pkeys: fix pkey_disable_clear() selftests: vm: pkeys: add helpers for pkey bits ...	2020-06-04 19:18:29 -07:00
Christoph Hellwig	762a3af6fa	exec: open code copy_string_kernel Currently copy_string_kernel is just a wrapper around copy_strings that simplifies the calling conventions and uses set_fs to allow passing a kernel pointer. But due to the fact the we only need to handle a single kernel argument pointer, the logic can be sigificantly simplified while getting rid of the set_fs. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Link: http://lkml.kernel.org/r/20200501104105.2621149-3-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-04 19:06:26 -07:00
Christoph Hellwig	986db2d14a	exec: simplify the copy_strings_kernel calling convention copy_strings_kernel is always used with a single argument, adjust the calling convention to that. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Link: http://lkml.kernel.org/r/20200501104105.2621149-2-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-04 19:06:26 -07:00
Joe Perches	a396301578	fs/seq_file.c: seq_read: Update pr_info_ratelimited Use a more common logging style. Add and use pr_fmt, coalesce the format string, align arguments, use better grammar. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Vasily Averin <vvs@virtuozzo.com> Link: http://lkml.kernel.org/r/96ff603230ca1bd60034c36519be3930c3a3a226.camel@perches.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-04 19:06:25 -07:00
OGAWA Hirofumi	898310032b	fat: improve the readahead for FAT entries Current readahead for FAT entries is very simple but is having some flaws, so it is not working well for some environments. This patch improves the readahead more or less. The key points of modification are, - make the readahead size tunable by using bdi->ra_pages - care the bdi->io_pages to avoid the small size I/O request - update readahead window before fully exhausting With this patch, on slow USB connected 2TB hdd: [before] 383.18sec [after] 51.03sec Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: hyeongseok.kim <hyeongseok.kim@lge.com> Reviewed-by: hyeongseok.kim <hyeongseok.kim@lge.com> Link: http://lkml.kernel.org/r/87d08e1dlh.fsf@mail.parknet.co.jp Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-04 19:06:25 -07:00
OGAWA Hirofumi	b1b65750b8	fat: don't allow to mount if the FAT length == 0 If FAT length == 0, the image doesn't have any data. And it can be the cause of overlapping the root dir and FAT entries. Also Windows treats it as invalid format. Reported-by: syzbot+6f1624f937d9d6911e2d@syzkaller.appspotmail.com Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Link: http://lkml.kernel.org/r/87r1wz8mrd.fsf@mail.parknet.co.jp Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-04 19:06:25 -07:00
Anthony Iliopoulos	852991dd3a	fs/binfmt_elf: remove redundant elf_map ifndef The ifndef was added a long time ago to support archs that would define their own mapping function. The last user was the metag arch which was removed from the tree, and as such there are no users left. Let's kill it. Signed-off-by: Anthony Iliopoulos <ailiop@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200402161543.4119-1-ailiop@suse.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-04 19:06:25 -07:00
Alexey Dobriyan	8977a27b66	proc: rename "catch" function argument "catch" is reserved keyword in C++, rename it to something both gcc and g++ accept. Rename "ign" for symmetry. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200331210905.GA31680@avx2 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-04 19:06:24 -07:00
Linus Torvalds	15a2bc4dbb	Merge branch 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull execve updates from Eric Biederman: "Last cycle for the Nth time I ran into bugs and quality of implementation issues related to exec that could not be easily be fixed because of the way exec is implemented. So I have been digging into exec and cleanup up what I can. I don't think I have exec sorted out enough to fix the issues I started with but I have made some headway this cycle with 4 sets of changes. - promised cleanups after introducing exec_update_mutex - trivial cleanups for exec - control flow simplifications - remove the recomputation of bprm->cred The net result is code that is a bit easier to understand and work with and a decrease in the number of lines of code (if you don't count the added tests)" * 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (24 commits) exec: Compute file based creds only once exec: Add a per bprm->file version of per_clear binfmt_elf_fdpic: fix execfd build regression selftests/exec: Add binfmt_script regression test exec: Remove recursion from search_binary_handler exec: Generic execfd support exec/binfmt_script: Don't modify bprm->buf and then return -ENOEXEC exec: Move the call of prepare_binprm into search_binary_handler exec: Allow load_misc_binary to call prepare_binprm unconditionally exec: Convert security_bprm_set_creds into security_bprm_repopulate_creds exec: Factor security_bprm_creds_for_exec out of security_bprm_set_creds exec: Teach prepare_exec_creds how exec treats uids & gids exec: Set the point of no return sooner exec: Move handling of the point of no return to the top level exec: Run sync_mm_rss before taking exec_update_mutex exec: Fix spelling of search_binary_handler in a comment exec: Move the comment from above de_thread to above unshare_sighand exec: Rename flush_old_exec begin_new_exec exec: Move most of setup_new_exec into flush_old_exec exec: In setup_new_exec cache current in the local variable me ...	2020-06-04 14:07:08 -07:00
Linus Torvalds	9ff7258575	Merge branch 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull proc updates from Eric Biederman: "This has four sets of changes: - modernize proc to support multiple private instances - ensure we see the exit of each process tid exactly - remove has_group_leader_pid - use pids not tasks in posix-cpu-timers lookup Alexey updated proc so each mount of proc uses a new superblock. This allows people to actually use mount options with proc with no fear of messing up another mount of proc. Given the kernel's internal mounts of proc for things like uml this was a real problem, and resulted in Android's hidepid mount options being ignored and introducing security issues. The rest of the changes are small cleanups and fixes that came out of my work to allow this change to proc. In essence it is swapping the pids in de_thread during exec which removes a special case the code had to handle. Then updating the code to stop handling that special case" * 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: proc: proc_pid_ns takes super_block as an argument remove the no longer needed pid_alive() check in __task_pid_nr_ns() posix-cpu-timers: Replace __get_task_for_clock with pid_for_clock posix-cpu-timers: Replace cpu_timer_pid_type with clock_pid_type posix-cpu-timers: Extend rcu_read_lock removing task_struct references signal: Remove has_group_leader_pid exec: Remove BUG_ON(has_group_leader_pid) posix-cpu-timer: Unify the now redundant code in lookup_task posix-cpu-timer: Tidy up group_leader logic in lookup_task proc: Ensure we see the exit of each process tid exactly once rculist: Add hlists_swap_heads_rcu proc: Use PIDTYPE_TGID in next_tgid Use proc_pid_ns() to get pid_namespace from the proc superblock proc: use named enums for better readability proc: use human-readable values for hidepid docs: proc: add documentation for "hidepid=4" and "subset=pid" options and new mount behavior proc: add option to mount only a pids subset proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option proc: allow to mount many instances of proc in one pid namespace proc: rename struct proc_fs_info to proc_fs_opts	2020-06-04 13:54:34 -07:00
Linus Torvalds	051c3556e3	\n -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAl7Y7B4ACgkQnJ2qBz9k QNnWtAf+OJz782G6BsJrZtOgm5Vm+CSHmdKN8GHnDACT+mlNrTrLZi1OvfWjXtU/ UxX+l9w3OU/RW5uiMYrgN1Ajt5eIxT7AmszA1v7hbpLwIQzstW23DgEZLwB74+JA xLMH7xCb2jiVXWb0yQPLTiVHfGN99I4RHSWnc+OaIXe6qO6yIS3uS/k7PWMk9sSx BRfDKAxXjoz6Is9r6BYg1Ds4ZsmwmouoDIoA5h/PhRH07VArqTkMw3ahy2rZ61Ls 1IkU8zYKZdV2oKTRfQYxlCaEWE+65GZerTyAPuzHya93pAXAlfosIiXg6EnjiovB jseIlGbzVtZbuAug+OhXivd2U7H+Aw== =lWbb -----END PGP SIGNATURE----- Merge tag 'for_v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext2 and reiserfs cleanups from Jan Kara: "Two small cleanups for ext2 and one for reiserfs" * tag 'for_v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: reiserfs: Replace kmalloc with kcalloc in the comment ext2: code cleanup by removing ifdef macro surrounding ext2: Fix i_op setting for special inode	2020-06-04 13:53:10 -07:00
Linus Torvalds	07c8f3bfef	\n -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAl7Y2McACgkQnJ2qBz9k QNlHzwf/e4oz9oRCXPqBwh6C318nl6ksQO5ooW+Dhb535cr/Cn99nuZa3GrvW+aq eSbypsvZQMguk0/okEc4jcTgLmEw+KubpBXOi/DJZ9dzGQrvjT2nBkQmaTqwp9dO WMZcJLmszkrtokjKD4lVjyQArcwqQF/v/moEKIImw5A6CY4R4odTaUOCPnTwF7P6 OXsDPwRfAccJ25ZUZ8hjc+fRl/Ncex6szciaJ08T4btlaAtc5UIn5Sy/u8BqNNiw 0VRheD4sJ2c25hLOIQJ5RETIeuYaRcR/BA3vm+k1d2iIiw4ubj9+ppwiaWOryA9U 5fXnBmXKuUUrwFihzmiLSckIpm3IPg== =kghV -----END PGP SIGNATURE----- Merge tag 'fsnotify_for_v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify updates from Jan Kara: "Several smaller fixes and cleanups for fsnotify subsystem" * tag 'fsnotify_for_v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fanotify: fix ignore mask logic for events on child and on dir fanotify: don't write with size under sizeof(response) fsnotify: Remove proc_fs.h include fanotify: remove reference to fill_event_metadata() fsnotify: add mutex destroy fanotify: prefix should_merge() fanotify: Replace zero-length array with flexible-array inotify: Fix error return code assignment flow. fsnotify: Add missing annotation for fsnotify_finish_user_wait() and for fsnotify_prepare_user_wait()	2020-06-04 13:51:54 -07:00
Linus Torvalds	d77d1dbba9	zonefs changes for 5.8 Only one patch in this pull request to cleanup handling of uuid using the import_uuid() helper, from Andy. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCXtg3xAAKCRDdoc3SxdoY dp+AAQCIAQpe4qyF5hJtwLPY+qffDDuHDxHjrERpA6c7fpKicgD+K6uDIwZ8Y6L8 XXYPmKer58rV61jX4hvZGCAYwLmzRwA= =Vc+W -----END PGP SIGNATURE----- Merge tag 'zonefs-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs Pull zonefs update from Damien Le Moal: "Only one patch in this pull request to cleanup handling of uuid using the import_uuid() helper, from Andy" * tag 'zonefs-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs: zonefs: Replace uuid_copy() with import_uuid()	2020-06-04 13:50:13 -07:00
Steve French	331cc667a9	cifs: update internal module version number To 2.27 Signed-off-by: Steve French <stfrench@microsoft.com>	2020-06-04 13:50:55 -05:00
Aurelien Aptel	2f58967979	cifs: multichannel: try to rebind when reconnecting a channel first steps in trying to make channels properly reconnect. * add cifs_ses_find_chan() function to find the enclosing cifs_chan struct it belongs to * while we have the session lock and are redoing negprot and sess.setup in smb2_reconnect() redo the binding of channels. Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-06-04 13:50:55 -05:00
Aurelien Aptel	8eec79540d	cifs: multichannel: use pointer for binding channel Add a cifs_chan pointer in struct cifs_ses that points to the channel currently being bound if ses->binding is true. Previously it was always the channel past the established count. This will make reconnecting (and rebinding) a channel easier later on. Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-06-04 13:50:55 -05:00
Steve French	edb1613536	smb3: remove static checker warning Remove static checker warning pointed out by Dan Carpenter: The patch feeaec621c09: "cifs: multichannel: move channel selection above transport layer" from Apr 24, 2020, leads to the following static checker warning: fs/cifs/smb2pdu.c:149 smb2_hdr_assemble() error: we previously assumed 'tcon->ses' could be null (see line 133) Reported-by: Dan Carpenter <dan.carpenter@oracle.com> CC: Aurelien Aptel <aptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-06-04 13:50:55 -05:00
Aurelien Aptel	352d96f3ac	cifs: multichannel: move channel selection above transport layer Move the channel (TCP_Server_Info) selection from the tranport layer to higher in the call stack so that: - credit handling is done with the server that will actually be used to send. ->wait_mtu_credit * ->set_credits / set_credits * ->add_credits / add_credits * add_credits_and_wake_if - potential reconnection (smb2_reconnect) done when initializing a request is checked and done with the server that will actually be used to send. To do this: - remove the cifs_pick_channel() call out of compound_send_recv() - select channel and pass it down by adding a cifs_pick_channel(ses) call in: - smb311_posix_mkdir - SMB2_open - SMB2_ioctl - __SMB2_close - query_info - SMB2_change_notify - SMB2_flush - smb2_async_readv (if none provided in context param) - SMB2_read (if none provided in context param) - smb2_async_writev (if none provided in context param) - SMB2_write (if none provided in context param) - SMB2_query_directory - send_set_info - SMB2_oplock_break - SMB311_posix_qfs_info - SMB2_QFS_info - SMB2_QFS_attr - smb2_lockv - SMB2_lease_break - smb2_compound_op - smb2_set_ea - smb2_ioctl_query_info - smb2_query_dir_first - smb2_query_info_comound - smb2_query_symlink - cifs_writepages - cifs_write_from_iter - cifs_send_async_read - cifs_read - cifs_readpages - add TCP_Server_Info *server param argument to: - cifs_send_recv - compound_send_recv - SMB2_open_init - SMB2_query_info_init - SMB2_set_info_init - SMB2_close_init - SMB2_ioctl_init - smb2_iotcl_req_init - SMB2_query_directory_init - SMB2_notify_init - SMB2_flush_init - build_qfs_info_req - smb2_hdr_assemble - smb2_reconnect - fill_small_buf - smb2_plain_req_init - __smb2_plain_req_init The read/write codepath is different than the rest as it is using pages, io iterators and async calls. To deal with those we add a server pointer in the cifs_writedata/cifs_readdata/cifs_io_parms context struct and set it in: - cifs_writepages (wdata) - cifs_write_from_iter (wdata) - cifs_readpages (rdata) - cifs_send_async_read (rdata) The [rw]data->server pointer is eventually copied to cifs_io_parms->server to pass it down to SMB2_read/SMB2_write. If SMB2_read/SMB2_write is called from a different place that doesn't set the server field it will pick a channel. Some places do not pick a channel and just use ses->server or cifs_ses_server(ses). All cifs_ses_server(ses) calls are in codepaths involving negprot/sess.setup. - SMB2_negotiate (binding channel) - SMB2_sess_alloc_buffer (binding channel) - SMB2_echo (uses provided one) - SMB2_logoff (uses master) - SMB2_tdis (uses master) (list not exhaustive) Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-06-04 13:50:55 -05:00
Aurelien Aptel	7c06514afd	cifs: multichannel: always zero struct cifs_io_parms SMB2_read/SMB2_write check and use cifs_io_parms->server, which might be uninitialized memory. This change makes all callers zero-initialize the struct. Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-06-04 13:50:55 -05:00
Kenneth D'souza	8e84a61a9c	cifs: dump Security Type info in DebugData Currently the end user is unaware with what sec type the cifs share is mounted if no sec=<type> option is parsed. With this patch one can easily check from DebugData. Example: 1) Name: x.x.x.x Uses: 1 Capability: 0x8001f3fc Session Status: 1 Security type: RawNTLMSSP Signed-off-by: Kenneth D'souza <kdsouza@redhat.com> Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Acked-by: Aurelien Aptel <aaptel@suse.com>	2020-06-04 13:50:38 -05:00
Sahitya Tummala	e78790f84a	f2fs: fix retry logic in f2fs_write_cache_pages() In case a compressed file is getting overwritten, the current retry logic doesn't include the current page to be retried now as it sets the new start index as 0 and new end index as writeback_index - 1. This causes the corresponding cluster to be uncompressed and written as normal pages without compression. Fix this by allowing writeback to be retried for the current page as well (in case of compressed page getting retried due to index mismatch with cluster index). So that this cluster can be written compressed in case of overwrite. Also, align f2fs_write_cache_pages() according to the change - <64081362e8ff>("mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock"). Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-06-04 11:45:09 -07:00
Jens Axboe	dddb3e26f6	io_uring: re-set iov base/len for buffer select retry We already have the buffer selected, but we should set the iter list again. Cc: stable@vger.kernel.org # v5.7 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-04 11:45:29 -06:00
Pavel Begunkov	d2b6f48b69	io_uring: move send/recv IOPOLL check into prep Fail recv/send in case of IORING_SETUP_IOPOLL earlier during prep, so it'd be done only once. Removes duplication as well Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-04 11:14:19 -06:00
Pavel Begunkov	ec65fea5a8	io_uring: deduplicate io_openat{,2}_prep() io_openat_prep() and io_openat2_prep() are identical except for how struct open_how is built. Deduplicate it with a helper. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-04 11:14:19 -06:00
Pavel Begunkov	25e72d1012	io_uring: do build_open_how() only once build_open_how() is just adjusting open_flags/mode. Do it once during prep. It looks better than storing raw values for the future. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-04 11:14:19 -06:00
Pavel Begunkov	3232dd02af	io_uring: fix {SQ,IO}POLL with unsupported opcodes IORING_SETUP_IOPOLL is defined only for read/write, other opcodes should be disallowed, otherwise it'll get an error as below. Also refuse open/close with SQPOLL, as the polling thread wouldn't know which file table to use. RIP: 0010:io_iopoll_getevents+0x111/0x5a0 Call Trace: ? _raw_spin_unlock_irqrestore+0x24/0x40 ? do_send_sig_info+0x64/0x90 io_iopoll_reap_events.part.0+0x5e/0xa0 io_ring_ctx_wait_and_kill+0x132/0x1c0 io_uring_release+0x20/0x30 __fput+0xcd/0x230 ____fput+0xe/0x10 task_work_run+0x67/0xa0 do_exit+0x353/0xb10 ? handle_mm_fault+0xd4/0x200 ? syscall_trace_enter+0x18c/0x2c0 do_group_exit+0x43/0xa0 __x64_sys_exit_group+0x18/0x20 do_syscall_64+0x60/0x1e0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: allow provide/remove buffers and files update] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-04 11:13:53 -06:00
David Howells	8409f67b64	afs: Adjust the fileserver rotation algorithm to reprobe/retry more quickly Adjust the fileserver rotation algorithm so that if we've tried all the addresses on a server (cumulatively over multiple operations) until we've run out of untried addresses, immediately reprobe all that server's interfaces and retry the op at least once before we move onto the next server. Signed-off-by: David Howells <dhowells@redhat.com>	2020-06-04 15:37:58 +01:00
David Howells	32275d3f75	afs: Show more a bit more server state in /proc/net/afs/servers Display more information about the state of a server record, including the flags, rtt and break counter plus the probe state for each server in /proc/net/afs/servers. Rearrange the server flags a bit to make them easier to read at a glance in the proc file. Signed-off-by: David Howells <dhowells@redhat.com>	2020-06-04 15:37:58 +01:00
David Howells	f3c130e6e6	afs: Don't use probe running state to make decisions outside probe code Don't use the running state for fileserver probes to make decisions about which server to use as the state is cleared at the start of a probe and also intermediate values might be misleading. Instead, add a separate 'latest known' rtt in the afs_server struct and a flag to indicate if the server is known to be responding and update these as and when we know what to change them to. Signed-off-by: David Howells <dhowells@redhat.com>	2020-06-04 15:37:58 +01:00
David Howells	f11a016a85	afs: Fix afs_statfs() to not let the values go below zero Fix afs_statfs() so that the value for f_bavail and f_bfree don't go "negative" if the number of blocks in use by a volume exceeds the max quota for that volume. Signed-off-by: David Howells <dhowells@redhat.com>	2020-06-04 15:37:58 +01:00
David Howells	3c4c4075fc	afs: Fix the by-UUID server tree to allow servers with the same UUID Whilst it shouldn't happen, it is possible for multiple fileservers to share a UUID, particularly if an entire cell has been duplicated, UUIDs and all. In such a case, it's not necessarily possible to map the effect of the CB.InitCallBackState3 incoming RPC to a specific server unambiguously by UUID and thus to a specific cell. Indeed, there's a problem whereby multiple server records may need to occupy the same spot in the rb_tree rooted in the afs_net struct. Fix this by allowing servers to form a list, with the head of the list in the tree. When the front entry in the list is removed, the second in the list just replaces it. afs_init_callback_state() then just goes down the line, poking each server in the list. This means that some servers will be unnecessarily poked, unfortunately. An alternative would be to route by call parameters. Reported-by: Jeffrey Altman <jaltman@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Fixes: `d2ddc776a4` ("afs: Overhaul volume and server record caching and fileserver rotation")	2020-06-04 15:37:57 +01:00
David Howells	20325960f8	afs: Reorganise volume and server trees to be rooted on the cell Reorganise afs_volume objects such that they're in a tree keyed on volume ID, rooted at on an afs_cell object rather than being in multiple trees, each of which is rooted on an afs_server object. afs_server structs become per-cell and acquire a pointer to the cell. The process of breaking a callback then starts with finding the server by its network address, following that to the cell and then looking up each volume ID in the volume tree. This is simpler than the afs_vol_interest/afs_cb_interest N:M mapping web and allows those structs and the code for maintaining them to be simplified or removed. It does make a couple of things a bit more tricky, though: (1) Operations now start with a volume, not a server, so there can be more than one answer as to whether or not the server we'll end up using supports the FS.InlineBulkStatus RPC. (2) CB RPC operations that specify the server UUID. There's still a tree of servers by UUID on the afs_net struct, but the UUIDs in it aren't guaranteed unique. Signed-off-by: David Howells <dhowells@redhat.com>	2020-06-04 15:37:57 +01:00
David Howells	cca37d45d5	afs: Add a tracepoint to track the lifetime of the afs_volume struct Add a tracepoint to track the lifetime of the afs_volume struct. Signed-off-by: David Howells <dhowells@redhat.com>	2020-06-04 15:37:57 +01:00
David Howells	6dfdf5369c	afs: Detect cell aliases 3 - YFS Cells with a canonical cell name op YFS Volume Location servers have an operation by which the cell name may be queried. Use this to find out what a YFS server thinks the canonical cell name should be. Signed-off-by: David Howells <dhowells@redhat.com>	2020-06-04 15:37:57 +01:00
David Howells	6ef350b184	afs: Detect cell aliases 2 - Cells with no root volumes Implement the second phase of cell alias detection. This part handles alias detection for cells that don't have root.cell volumes and so we have to find some other volume or fileserver to query. We take the first volume from each such cell and attempt to look it up in the new cell. If found, we compare the records, if they are the same, we judge the cell names to be aliases. Signed-off-by: David Howells <dhowells@redhat.com>	2020-06-04 15:37:57 +01:00
David Howells	8a070a9648	afs: Detect cell aliases 1 - Cells with root volumes Put in the first phase of cell alias detection. This part handles alias detection for cells that have root.cell volumes (which is expected to be likely). When a cell becomes newly active, it is probed for its root.cell volume, and if it has one, this volume is compared against other root.cell volumes to find out if the list of fileserver UUIDs have any in common - and if that's the case, do the address lists of those fileservers have any addresses in common. If they do, the new cell is adjudged to be an alias of the old cell and the old cell is used instead. Comparing is aided by the server list in struct afs_server_list being sorted in UUID order and the addresses in the fileserver address lists being sorted in address order. The cell then retains the afs_volume object for the root.cell volume, even if it's not mounted for future alias checking. This necessary because: (1) Whilst fileservers have UUIDs that are meant to be globally unique, in practice they are not because cells get cloned without changing the UUIDs - so afs_server records need to be per cell. (2) Sometimes the DNS is used to make cell aliases - but if we don't know they're the same, we may end up with multiple superblocks and multiple afs_server records for the same thing, impairing our ability to deliver callback notifications of third party changes (3) The fileserver RPC API doesn't contain the cell name, so it can't tell us which cell it's notifying and can't see that a change made to to one cell should notify the same client that's also accessed as the other cell. Reported-by: Jeffrey Altman <jaltman@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>	2020-06-04 15:37:57 +01:00
David Howells	c3e9f88826	afs: Implement client support for the YFSVL.GetCellName RPC op Implement client support for the YFSVL.GetCellName RPC operation by which YFS permits the canonical cell name to be queried from a VL server. Signed-off-by: David Howells <dhowells@redhat.com>	2020-06-04 15:37:57 +01:00
David Howells	194d28cf19	afs: Retain more of the VLDB record for alias detection Save more bits from the volume location database record obtained for a server so that we can use this information in cell alias detection. Signed-off-by: David Howells <dhowells@redhat.com>	2020-06-04 15:37:57 +01:00
David Howells	3120c170ef	afs: Fix handling of CB.ProbeUuid cache manager op The AFS filesystem driver is handling the CB.ProbeUuid request incorrectly. The UUID presented in the request is that of the cache manager, not the fileserver, so afs_deliver_cb_probe_uuid() shouldn't be using that UUID to look up the server. Fix this by looking up the server by address instead. Signed-off-by: David Howells <dhowells@redhat.com>	2020-06-04 15:37:57 +01:00
David Howells	44746355cc	afs: Don't get epoch from a server because it may be ambiguous Don't get the epoch from a server, particularly one that we're looking up by UUID, as UUIDs may be ambiguous and may map to more than one server - so we can't draw any conclusions from it. Reported-by: Jeffrey Altman <jaltman@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>	2020-06-04 15:37:56 +01:00
David Howells	e49c7b2f6d	afs: Build an abstraction around an "operation" concept Turn the afs_operation struct into the main way that most fileserver operations are managed. Various things are added to the struct, including the following: (1) All the parameters and results of the relevant operations are moved into it, removing corresponding fields from the afs_call struct. afs_call gets a pointer to the op. (2) The target volume is made the main focus of the operation, rather than the target vnode(s), and a bunch of op->vnode->volume are made op->volume instead. (3) Two vnode records are defined (op->file[]) for the vnode(s) involved in most operations. The vnode record (struct afs_vnode_param) contains: - The vnode pointer. - The fid of the vnode to be included in the parameters or that was returned in the reply (eg. FS.MakeDir). - The status and callback information that may be returned in the reply about the vnode. - Callback break and data version tracking for detecting simultaneous third-parth changes. (4) Pointers to dentries to be updated with new inodes. (5) An operations table pointer. The table includes pointers to functions for issuing AFS and YFS-variant RPCs, handling the success and abort of an operation and handling post-I/O-lock local editing of a directory. To make this work, the following function restructuring is made: (A) The rotation loop that issues calls to fileservers that can be found in each function that wants to issue an RPC (such as afs_mkdir()) is extracted out into common code, in a new file called fs_operation.c. (B) The rotation loops, such as the one in afs_mkdir(), are replaced with a much smaller piece of code that allocates an operation, sets the parameters and then calls out to the common code to do the actual work. (C) The code for handling the success and failure of an operation are moved into operation functions (as (5) above) and these are called from the core code at appropriate times. (D) The pseudo inode getting stuff used by the dynamic root code is moved over into dynroot.c. (E) struct afs_iget_data is absorbed into the operation struct and afs_iget() expects to be given an op pointer and a vnode record. (F) Point (E) doesn't work for the root dir of a volume, but we know the FID in advance (it's always vnode 1, unique 1), so a separate inode getter, afs_root_iget(), is provided to special-case that. (G) The inode status init/update functions now also take an op and a vnode record. (H) The RPC marshalling functions now, for the most part, just take an afs_operation struct as their only argument. All the data they need is held there. The result delivery functions write their answers there as well. (I) The call is attached to the operation and then the operation core does the waiting. And then the new operation code is, for the moment, made to just initialise the operation, get the appropriate vnode I/O locks and do the same rotation loop as before. This lays the foundation for the following changes in the future: () Overhauling the rotation (again). () Support for asynchronous I/O, where the fileserver rotation must be done asynchronously also. Signed-off-by: David Howells <dhowells@redhat.com>	2020-06-04 15:37:17 +01:00
Miklos Szeredi	74c6e384e9	ovl: make oip->index bool ovl_get_inode() uses oip->index as a bool value, not as a pointer. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-06-04 10:48:19 +02:00
Miklos Szeredi	b778e1ee1a	ovl: only pass ->ki_flags to ovl_iocb_to_rwf() Next patch will want to pass a modified set of flags, so... Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-06-04 10:48:19 +02:00
Miklos Szeredi	df820f8de4	ovl: make private mounts longterm Overlayfs is using clone_private_mount() to create internal mounts for underlying layers. These are used for operations requiring a path, such as dentry_open(). Since these private mounts are not in any namespace they are treated as short term, "detached" mounts and mntput() involves taking the global mount_lock, which can result in serious cacheline pingpong. Make these private mounts longterm instead, which trade the penalty on mntput() for a slightly longer shutdown time due to an added RCU grace period when putting these mounts. Introduce a new helper kern_unmount_many() that can take care of multiple longterm mounts with a single RCU grace period. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-06-04 10:48:19 +02:00
Miklos Szeredi	b8e42a651b	ovl: get rid of redundant members in struct ovl_fs ofs->upper_mnt is copied to ->layers[0].mnt and ->layers[0].trap could be used instead of a separate ->upperdir_trap. Split the lowerdir option early to get the number of layers, then allocate the ->layers array, and finally fill the upper and lower layers, as before. Get rid of path_put_init() in ovl_lower_dir(), since the only caller will take care of that. [Colin Ian King] Fix null pointer dereference on null stack pointer on error return found by Coverity. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-06-04 10:48:19 +02:00
Miklos Szeredi	08f4c7c86d	ovl: add accessor for ofs->upper_mnt Next patch will remove ofs->upper_mnt, so add an accessor function for this field. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-06-04 10:48:19 +02:00
Yuxuan Shui	520da69d26	ovl: initialize error in ovl_copy_xattr In ovl_copy_xattr, if all the xattrs to be copied are overlayfs private xattrs, the copy loop will terminate without assigning anything to the error variable, thus returning an uninitialized value. If ovl_copy_xattr is called from ovl_clear_empty, this uninitialized error value is put into a pointer by ERR_PTR(), causing potential invalid memory accesses down the line. This commit initialize error with 0. This is the correct value because when there's no xattr to copy, because all xattrs are private, ovl_copy_xattr should succeed. This bug is discovered with the help of INIT_STACK_ALL and clang. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com> Link: https://bugs.chromium.org/p/chromium/issues/detail?id=1050405 Fixes: `0956254a2d` ("ovl: don't copy up opaqueness") Cc: stable@vger.kernel.org # v4.8 Signed-off-by: Alexander Potapenko <glider@google.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-06-04 10:48:19 +02:00
Steve French	e80ddeb2f7	smb3: fix incorrect number of credits when ioctl MaxOutputResponse > 64K We were not checking to see if ioctl requests asked for more than 64K (ie when CIFSMaxBufSize was > 64K) so when setting larger CIFSMaxBufSize then ioctls would fail with invalid parameter errors. When requests ask for more than 64K in MaxOutputResponse then we need to ask for more than 1 credit. Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org> Reviewed-by: Aurelien Aptel <aaptel@suse.com>	2020-06-04 01:13:41 -05:00
Steve French	1ee0e6d47d	smb3: default to minimum of two channels when multichannel specified When "multichannel" is specified on mount, make sure to default to at least two channels. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2020-06-04 01:13:37 -05:00
Linus Torvalds	ee01c4d72a	Merge branch 'akpm' (patches from Andrew) Merge more updates from Andrew Morton: "More mm/ work, plenty more to come Subsystems affected by this patch series: slub, memcg, gup, kasan, pagealloc, hugetlb, vmscan, tools, mempolicy, memblock, hugetlbfs, thp, mmap, kconfig" * akpm: (131 commits) arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined riscv: support DEBUG_WX mm: add DEBUG_WX support drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid() powerpc/mm: drop platform defined pmd_mknotpresent() mm: thp: don't need to drain lru cache when splitting and mlocking THP hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs sparc32: register memory occupied by kernel as memblock.memory include/linux/memblock.h: fix minor typo and unclear comment mm, mempolicy: fix up gup usage in lookup_node tools/vm/page_owner_sort.c: filter out unneeded line mm: swap: memcg: fix memcg stats for huge pages mm: swap: fix vmstats for huge pages mm: vmscan: limit the range of LRU type balancing mm: vmscan: reclaim writepage is IO cost mm: vmscan: determine anon/file pressure balance at the reclaim root mm: balance LRU lists based on relative thrashing mm: only count actual rotations as LRU reclaim cost ...	2020-06-03 20:24:15 -07:00
Jan Kara	6b8ed62008	ext4: avoid unnecessary transaction starts during writeback ext4_writepages() currently works in a loop like: start a transaction scan inode for pages to write map and submit these pages stop the transaction This loop results in starting transaction once more than is needed because in the last iteration we start a transaction only to scan the inode and find there are no pages to write. This can be significant increase in number of transaction starts for single-extent files or files that have all blocks already mapped. Furthermore we already know from previous iteration whether there are more pages to write or not. So propagate the information from mpage_prepare_extent_to_map() and avoid unnecessary looping in case there are no more pages to write. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200525081215.29451-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-06-03 23:16:56 -04:00
Jens Axboe	6e014c621e	ext4: don't block for O_DIRECT if IOCB_NOWAIT is set Running with some debug patches to detect illegal blocking triggered the extend/unaligned condition in ext4. If ext4 needs to extend the file (and hence go to buffered IO), or if the app is doing unaligned IO, then ext4 asks the iomap code to wait for IO completion. If the caller asked for no-wait semantics by setting IOCB_NOWAIT, then ext4 should return -EAGAIN instead. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/76152096-2bbb-7682-8fce-4cb498bcd909@kernel.dk Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-06-03 23:16:55 -04:00
Christoph Hellwig	ba98890393	ext4: remove the access_ok() check in ext4_ioctl_get_es_cache access_ok just checks we are fed a proper user pointer. We also do that in copy_to_user itself, so no need to do this early. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200523073016.2944131-10-hch@lst.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-06-03 23:16:55 -04:00
Christoph Hellwig	c7d216e8c4	fs: remove the access_ok() check in ioctl_fiemap access_ok just checks we are fed a proper user pointer. We also do that in copy_to_user itself, so no need to do this early. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Link: https://lore.kernel.org/r/20200523073016.2944131-9-hch@lst.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-06-03 23:16:55 -04:00
Christoph Hellwig	45dd052e67	fs: handle FIEMAP_FLAG_SYNC in fiemap_prep By moving FIEMAP_FLAG_SYNC handling to fiemap_prep we ensure it is handled once instead of duplicated, but can still be done under fs locks, like xfs/iomap intended with its duplicate handling. Also make sure the error value of filemap_write_and_wait is propagated to user space. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Link: https://lore.kernel.org/r/20200523073016.2944131-8-hch@lst.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-06-03 23:16:55 -04:00
Christoph Hellwig	cddf8a2c4a	fs: move fiemap range validation into the file systems instances Replace fiemap_check_flags with a fiemap_prep helper that also takes the inode and mapped range, and performs the sanity check and truncation previously done in fiemap_check_range. This way the validation is inside the file system itself and thus properly works for the stacked overlayfs case as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Link: https://lore.kernel.org/r/20200523073016.2944131-7-hch@lst.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-06-03 23:16:55 -04:00
Christoph Hellwig	2732881894	iomap: fix the iomap_fiemap prototype iomap_fiemap should take u64 start and len arguments, just like the ->fiemap prototype. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Link: https://lore.kernel.org/r/20200523073016.2944131-6-hch@lst.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-06-03 23:16:55 -04:00
Christoph Hellwig	10c5db2864	fs: move the fiemap definitions out of fs.h No need to pull the fiemap definitions into almost every file in the kernel build. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Link: https://lore.kernel.org/r/20200523073016.2944131-5-hch@lst.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-06-03 23:16:55 -04:00
Christoph Hellwig	44ebcd06bb	fs: mark __generic_block_fiemap static There is no caller left outside of ioctl.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Link: https://lore.kernel.org/r/20200523073016.2944131-4-hch@lst.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-06-03 23:16:54 -04:00
Christoph Hellwig	da565e792b	ext4: remove the call to fiemap_check_flags in ext4_fiemap iomap_fiemap already calls fiemap_check_flags first thing, so this additional check is redundant. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200523073016.2944131-3-hch@lst.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-06-03 23:16:54 -04:00
Christoph Hellwig	03a5ed24c9	ext4: split _ext4_fiemap The fiemap and EXT4_IOC_GET_ES_CACHE cases share almost no code, so split them into entirely separate functions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200523073016.2944131-2-hch@lst.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-06-03 23:16:54 -04:00
Christoph Hellwig	328e24ae14	ext4: fix fiemap size checks for bitmap files Add an extra validation of the len parameter, as for ext4 some files might have smaller file size limits than others. This also means the redundant size check in ext4_ioctl_get_es_cache can go away, as all size checking is done in the shared fiemap handler. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200505154324.3226743-3-hch@lst.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-06-03 23:16:54 -04:00
Ritesh Harjani	175efa81fe	ext4: fix EXT4_MAX_LOGICAL_BLOCK macro ext4 supports max number of logical blocks in a file to be 0xffffffff. (This is since ext4_extent's ee_block is __le32). This means that EXT4_MAX_LOGICAL_BLOCK should be 0xfffffffe (starting from 0 logical offset). This patch fixes this. The issue was seen when ext4 moved to iomap_fiemap API and when overlayfs was mounted on top of ext4. Since overlayfs was missing filemap_check_ranges(), so it could pass a arbitrary huge length which lead to overflow of map.m_len logic. This patch fixes that. Fixes: `d3b6f23f71` ("ext4: move ext4_fiemap to use iomap framework") Reported-by: syzbot+77fa5bdb65cc39711820@syzkaller.appspotmail.com Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20200505154324.3226743-2-hch@lst.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-06-03 23:16:54 -04:00
Jonathan Grant	9f364e1d95	add comment for ext4_dir_entry_2 file_type member Signed-off-by: Jonathan Grant <jg@jguk.org> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/ad3290d5-86af-99c1-f9d5-cd1bab710429@jguk.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-06-03 23:16:54 -04:00
Jan Kara	14ff628630	jbd2: avoid leaking transaction credits when unreserving handle When reserved transaction handle is unused, we subtract its reserved credits in __jbd2_journal_unreserve_handle() called from jbd2_journal_stop(). However this function forgets to remove reserved credits from transaction->t_outstanding_credits and thus the transaction space that was reserved remains effectively leaked. The leaked transaction space can be quite significant in some cases and leads to unnecessarily small transactions and thus reducing throughput of the journalling machinery. E.g. fsmark workload creating lots of 4k files was observed to have about 20% lower throughput due to this when ext4 is mounted with dioread_nolock mount option. Subtract reserved credits from t_outstanding_credits as well. CC: stable@vger.kernel.org Fixes: `8f7d89f368` ("jbd2: transaction reservation support") Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200520133119.1383-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-06-03 23:16:53 -04:00
Jan Kara	dfcd4489e2	ext4: drop ext4_journal_free_reserved() Remove ext4_journal_free_reserved() function. It is never used. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20200520133119.1383-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-06-03 23:16:53 -04:00
Ritesh Harjani	993778306e	ext4: mballoc: use lock for checking free blocks while retrying Currently while doing block allocation grp->bb_free may be getting modified if discard is happening in parallel. For e.g. consider a case where there are lot of threads who have preallocated lot of blocks and there is a thread which is trying to discard all of this group's PA. Now it could happen that we see all of those group's bb_free is zero and fail the allocation while there is sufficient space if we free up all the PA. So this patch adds another flag "EXT4_MB_STRICT_CHECK" which will be set if we are unable to allocate any blocks in the first try (since we may not have considered blocks about to be discarded from PA lists). So during retry attempt to allocate blocks we will use ext4_lock_group() for checking if the group is good or not. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/9cb740a117c958c36596f167b12af1beae9a68b7.1589955723.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-06-03 23:16:53 -04:00
Ritesh Harjani	8ef123fe02	ext4: mballoc: refactor ext4_mb_good_group() ext4_mb_good_group() definition was changed some time back and now it even initializes the buddy cache (via ext4_mb_init_group()), if in case the EXT4_MB_GRP_NEED_INIT() is true for a group. Note that ext4_mb_init_group() could sleep and so should not be called under a spinlock held. This is fine as of now because ext4_mb_good_group() is called before loading the buddy bitmap without ext4_lock_group() held and again called after loading the bitmap, only this time with ext4_lock_group() held. But still this whole thing is confusing. So this patch refactors out ext4_mb_good_group_nolock() which should be called when without holding ext4_lock_group(). Also in further patches we hold the spinlock (ext4_lock_group()) while doing any calculations which involves grp->bb_free or grp->bb_fragments. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/d9f7d031a5fbe1c943fae6bf1ff5cdf0604ae722.1589955723.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-06-03 23:16:53 -04:00
Ritesh Harjani	07b5b8e1ac	ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling There could be a race in function ext4_mb_discard_group_preallocations() where the 1st thread may iterate through group's bb_prealloc_list and remove all the PAs and add to function's local list head. Now if the 2nd thread comes in to discard the group preallocations, it will see that the group->bb_prealloc_list is empty and will return 0. Consider for a case where we have less number of groups (for e.g. just group 0), this may even return an -ENOSPC error from ext4_mb_new_blocks() (where we call for ext4_mb_discard_group_preallocations()). But that is wrong, since 2nd thread should have waited for 1st thread to release all the PAs and should have retried for allocation. Since 1st thread was anyway going to discard the PAs. The algorithm using this percpu seq counter goes below: 1. We sample the percpu discard_pa_seq counter before trying for block allocation in ext4_mb_new_blocks(). 2. We increment this percpu discard_pa_seq counter when we either allocate or free these blocks i.e. while marking those blocks as used/free in mb_mark_used()/mb_free_blocks(). 3. We also increment this percpu seq counter when we successfully identify that the bb_prealloc_list is not empty and hence proceed for discarding of those PAs inside ext4_mb_discard_group_preallocations(). Now to make sure that the regular fast path of block allocation is not affected, as a small optimization we only sample the percpu seq counter on that cpu. Only when the block allocation fails and when freed blocks found were 0, that is when we sample percpu seq counter for all cpus using below function ext4_get_discard_pa_seq_sum(). This happens after making sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty. It can be well argued that why don't just check for grp->bb_free to see if there are any free blocks to be allocated. So here are the two concerns which were discussed:- 1. If for some reason the blocks available in the group are not appropriate for allocation logic (say for e.g. EXT4_MB_HINT_GOAL_ONLY, although this is not yet implemented), then the retry logic may result into infinte looping since grp->bb_free is non-zero. 2. Also before preallocation was clubbed with block allocation with the same ext4_lock_group() held, there were lot of races where grp->bb_free could not be reliably relied upon. Due to above, this patch considers discard_pa_seq logic to determine if we should retry for block allocation. Say if there are are n threads trying for block allocation and none of those could allocate or discard any of the blocks, then all of those n threads will fail the block allocation and return -ENOSPC error. (Since the seq counter for all of those will match as no block allocation/discard was done during that duration). Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/7f254686903b87c419d798742fd9a1be34f0657b.1589955723.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-06-03 23:16:53 -04:00
Ritesh Harjani	cf5e2ca6c9	ext4: mballoc: refactor ext4_mb_discard_preallocations() Implement ext4_mb_discard_preallocations_should_retry() which we will need in later patches to add more logic like check for sequence number match to see if we should retry for block allocation or not. There should be no functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/1cfae0098d2aa9afbeb59331401258182868c8f2.1589955723.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-06-03 23:16:53 -04:00

... 12 13 14 15 16 ...

66232 Commits