Commit Graph

62832 Commits

Author SHA1 Message Date
Steve French
87f93d82e0 smb3: fix problem with null cifs super block with previous patch
Add check for null cifs_sb to create_options helper

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2020-02-05 06:32:19 -06:00
Linus Torvalds
51a198e89a Merge tag 'jfs-5.6' of git://github.com/kleikamp/linux-shaggy
Pull jfs update from David Kleikamp:
 "Trivial cleanup for jfs"

* tag 'jfs-5.6' of git://github.com/kleikamp/linux-shaggy:
  jfs: remove unused MAXL2PAGES
2020-02-05 05:28:20 +00:00
Linus Torvalds
72f582ff85 Merge branch 'work.recursive_removal' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs recursive removal updates from Al Viro:
 "We have quite a few places where synthetic filesystems do an
  equivalent of 'rm -rf', with varying amounts of code duplication,
  wrong locking, etc. That really ought to be a library helper.

  Only debugfs (and very similar tracefs) are converted here - I have
  more conversions, but they'd never been in -next, so they'll have to
  wait"

* 'work.recursive_removal' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  simple_recursive_removal(): kernel-side rm -rf for ramfs-style filesystems
2020-02-05 05:09:46 +00:00
Linus Torvalds
bddea11b1b Merge branch 'imm.timestamp' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs timestamp updates from Al Viro:
 "More 64bit timestamp work"

* 'imm.timestamp' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  kernfs: don't bother with timestamp truncation
  fs: Do not overload update_time
  fs: Delete timespec64_trunc()
  fs: ubifs: Eliminate timespec64_trunc() usage
  fs: ceph: Delete timespec64_trunc() usage
  fs: cifs: Delete usage of timespec64_trunc
  fs: fat: Eliminate timespec64_trunc() usage
  utimes: Clamp the timestamps in notify_change()
2020-02-05 05:02:42 +00:00
Jens Axboe
2faf852d1b io_uring: cleanup fixed file data table references
syzbot reports a use-after-free in io_ring_file_ref_switch() when it
tries to switch back to percpu mode. When we put the final reference to
the table by calling percpu_ref_kill_and_confirm(), we don't want the
zero reference to queue async work for flushing the potentially queued
up items. We currently do a few flush_work(), but they merely paper
around the issue, since the work item may not have been queued yet
depending on the when the percpu-ref callback gets run.

Coming into the file unregister, we know we have the ring quiesced.
io_ring_file_ref_switch() can check for whether or not the ref is dying
or not, and not queue anything async at that point. Once the ref has
been confirmed killed, flush any potential items manually.

Reported-by: syzbot+7caeaea49c2c8a591e3d@syzkaller.appspotmail.com
Fixes: 05f3fb3c53 ("io_uring: avoid ring quiesce for fixed file set unregister and update")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-04 20:04:18 -07:00
Jens Axboe
df069d80c8 io_uring: spin for sq thread to idle on shutdown
As part of io_uring shutdown, we cancel work that is pending and won't
necessarily complete on its own. That includes requests like poll
commands and timeouts.

If we're using SQPOLL for kernel side submission and we shutdown the
ring immediately after queueing such work, we can race with the sqthread
doing the submission. This means we may miss cancelling some work, which
results in the io_uring shutdown hanging forever.

Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-04 16:48:34 -07:00
Vasily Averin
9f198a2ac5 help_next should increase position index
if seq_file .next fuction does not change position index,
read after some lseek can generate unexpected output.

https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2020-02-04 15:22:04 -05:00
Robert Milkowski
7dc2993a9e NFSv4.0: nfs4_do_fsinfo() should not do implicit lease renewals
Currently, each time nfs4_do_fsinfo() is called it will do an implicit
NFS4 lease renewal, which is not compliant with the NFS4 specification.
This can result in a lease being expired by an NFS server.

Commit 83ca7f5ab3 ("NFS: Avoid PUTROOTFH when managing leases")
introduced implicit client lease renewal in nfs4_do_fsinfo(),
which can result in the NFSv4.0 lease to expire on a server side,
and servers returning NFS4ERR_EXPIRED or NFS4ERR_STALE_CLIENTID.

This can easily be reproduced by frequently unmounting a sub-mount,
then stat'ing it to get it mounted again, which will delay or even
completely prevent client from sending RENEW operations if no other
NFS operations are issued. Eventually nfs server will expire client's
lease and return an error on file access or next RENEW.

This can also happen when a sub-mount is automatically unmounted
due to inactivity (after nfs_mountpoint_expiry_timeout), then it is
mounted again via stat(). This can result in a short window during
which client's lease will expire on a server but not on a client.
This specific case was observed on production systems.

This patch removes the implicit lease renewal from nfs4_do_fsinfo().

Fixes: 83ca7f5ab3 ("NFS: Avoid PUTROOTFH when managing leases")
Signed-off-by: Robert Milkowski <rmilkowski@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-04 12:27:55 -05:00
Robert Milkowski
924491f2e4 NFSv4: try lease recovery on NFS4ERR_EXPIRED
Currently, if an nfs server returns NFS4ERR_EXPIRED to open(),
we return EIO to applications without even trying to recover.

Fixes: 272289a3df ("NFSv4: nfs4_do_handle_exception() handle revoke/expiry of a single stateid")
Signed-off-by: Robert Milkowski <rmilkowski@gmail.com>
Reviewed-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-04 12:08:24 -05:00
Wenwen Wang
123c23c6a7 NFS: Fix memory leaks
In _nfs42_proc_copy(), 'res->commit_res.verf' is allocated through
kzalloc() if 'args->sync' is true. In the following code, if
'res->synchronous' is false, handle_async_copy() will be invoked. If an
error occurs during the invocation, the following code will not be executed
and the error will be returned . However, the allocated
'res->commit_res.verf' is not deallocated, leading to a memory leak. This
is also true if the invocation of process_copy_commit() returns an error.

To fix the above leaks, redirect the execution to the 'out' label if an
error is encountered.

Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-04 11:01:54 -05:00
Dai Ngo
227823d207 nfs: optimise readdir cache page invalidation
When the directory is large and it's being modified by one client
while another client is doing the 'ls -l' on the same directory then
the cache page invalidation from nfs_force_use_readdirplus causes
the reading client to keep restarting READDIRPLUS from cookie 0
which causes the 'ls -l' to take a very long time to complete,
possibly never completing.

Currently when nfs_force_use_readdirplus is called to switch from
READDIR to READDIRPLUS, it invalidates all the cached pages of the
directory. This cache page invalidation causes the next nfs_readdir
to re-read the directory content from cookie 0.

This patch is to optimise the cache invalidation in
nfs_force_use_readdirplus by only truncating the cached pages from
last page index accessed to the end the file. It also marks the
inode to delay invalidating all the cached page of the directory
until the next initial nfs_readdir of the next 'ls' instance.

Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Trond Myklebust <trond.myklebust@hammerspace.com>
[Anna - Fix conflicts with Trond's readdir patches]
[Anna - Remove redundant call to nfs_zap_mapping()]
[Anna - Replace d_inode(file_dentry(desc->file)) with file_inode(desc->file)]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-04 10:50:44 -05:00
Linus Torvalds
7f879e1a94 Merge tag 'ovl-update-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs update from Miklos Szeredi:

 - Try to preserve holes in sparse files when copying up, thus saving
   disk space and improving performance.

 - Fix a performance regression introduced in v4.19 by preserving
   asynchronicity of IO when fowarding to underlying layers. Add VFS
   helpers to submit async iocbs.

 - Fix a regression in lseek(2) introduced in v4.19 that breaks >2G
   seeks on 32bit kernels.

 - Fix a corner case where st_ino/st_dev was not preserved across copy
   up.

 - Miscellaneous fixes and cleanups.

* tag 'ovl-update-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: fix lseek overflow on 32bit
  ovl: add splice file read write helper
  ovl: implement async IO routines
  vfs: add vfs_iocb_iter_[read|write] helper functions
  ovl: layer is const
  ovl: fix corner case of non-constant st_dev;st_ino
  ovl: fix corner case of conflicting lower layer uuid
  ovl: generalize the lower_fs[] array
  ovl: simplify ovl_same_sb() helper
  ovl: generalize the lower_layers[] array
  ovl: improving copy-up efficiency for big sparse file
  ovl: use ovl_inode_lock in ovl_llseek()
  ovl: use pr_fmt auto generate prefix
  ovl: fix wrong WARN_ON() in ovl_cache_update_ino()
2020-02-04 11:45:21 +00:00
Masahiro Yamada
45586c7078 treewide: remove redundant IS_ERR() before error code check
'PTR_ERR(p) == -E*' is a stronger condition than IS_ERR(p).
Hence, IS_ERR(p) is unneeded.

The semantic patch that generates this commit is as follows:

// <smpl>
@@
expression ptr;
constant error_code;
@@
-IS_ERR(ptr) && (PTR_ERR(ptr) == - error_code)
+PTR_ERR(ptr) == - error_code
// </smpl>

Link: http://lkml.kernel.org/r/20200106045833.1725-1-masahiroy@kernel.org
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Cc: Julia Lawall <julia.lawall@lip6.fr>
Acked-by: Stephen Boyd <sboyd@kernel.org> [drivers/clk/clk.c]
Acked-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> [GPIO]
Acked-by: Wolfram Sang <wsa@the-dreams.de> [drivers/i2c]
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [acpi/scan.c]
Acked-by: Rob Herring <robh@kernel.org>
Cc: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04 03:05:27 +00:00
Alexey Dobriyan
97a32539b9 proc: convert everything to "struct proc_ops"
The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
seq_file.h.

Conversion rule is:

	llseek		=> proc_lseek
	unlocked_ioctl	=> proc_ioctl

	xxx		=> proc_xxx

	delete ".owner = THIS_MODULE" line

[akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
[sfr@canb.auug.org.au: fix kernel/sched/psi.c]
  Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04 03:05:26 +00:00
Alexey Dobriyan
d56c0d45f0 proc: decouple proc from VFS with "struct proc_ops"
Currently core /proc code uses "struct file_operations" for custom hooks,
however, VFS doesn't directly call them.  Every time VFS expands
file_operations hook set, /proc code bloats for no reason.

Introduce "struct proc_ops" which contains only those hooks which /proc
allows to call into (open, release, read, write, ioctl, mmap, poll).  It
doesn't contain module pointer as well.

Save ~184 bytes per usage:

	add/remove: 26/26 grow/shrink: 1/4 up/down: 1922/-6674 (-4752)
	Function                                     old     new   delta
	sysvipc_proc_ops                               -      72     +72
				...
	config_gz_proc_ops                             -      72     +72
	proc_get_inode                               289     339     +50
	proc_reg_get_unmapped_area                   110     107      -3
	close_pdeo                                   227     224      -3
	proc_reg_open                                289     284      -5
	proc_create_data                              60      53      -7
	rt_cpu_seq_fops                              256       -    -256
				...
	default_affinity_proc_fops                   256       -    -256
	Total: Before=5430095, After=5425343, chg -0.09%

Link: http://lkml.kernel.org/r/20191225172228.GA13378@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04 03:05:26 +00:00
Steven Price
b7a16c7ad7 mm: pagewalk: add 'depth' parameter to pte_hole
The pte_hole() callback is called at multiple levels of the page tables.
Code dumping the kernel page tables needs to know what at what depth the
missing entry is.  Add this is an extra parameter to pte_hole().  When the
depth isn't know (e.g.  processing a vma) then -1 is passed.

The depth that is reported is the actual level where the entry is missing
(ignoring any folding that is in place), i.e.  any levels where
PTRS_PER_P?D is set to 1 are ignored.

Note that depth starts at 0 for a PGD so that PUD/PMD/PTE retain their
natural numbers as levels 2/3/4.

Link: http://lkml.kernel.org/r/20191218162402.45610-16-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Tested-by: Zong Li <zong.li@sifive.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04 03:05:25 +00:00
David Hildenbrand
abec749fac fs/proc/page.c: allow inspection of last section and fix end detection
If max_pfn does not fall onto a section boundary, it is possible to
inspect PFNs up to max_pfn, and PFNs above max_pfn, however, max_pfn
itself can't be inspected.  We can have a valid (and online) memmap at and
above max_pfn if max_pfn is not aligned to a section boundary.  The whole
early section has a memmap and is marked online.  Being able to inspect
the state of these PFNs is valuable for debugging, especially because
max_pfn can change on memory hotplug and expose these memmaps.

Also, querying page flags via "./page-types -r -a 0x144001,"
(tools/vm/page-types.c) inside a x86-64 guest with 4160MB under QEMU
results in an (almost) endless loop in user space, because the end is not
detected properly when starting after max_pfn.

Instead, let's allow to inspect all pages in the highest section and
return 0 directly if we try to access pages above that section.

While at it, check the count before adjusting it, to avoid masking user
errors.

Link: http://lkml.kernel.org/r/20191211163201.17179-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Bob Picco <bob.picco@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04 03:05:23 +00:00
Gang He
2d797e9ff9 ocfs2: fix oops when writing cloned file
Writing a cloned file triggers a kernel oops and the user-space command
process is also killed by the system.  The bug can be reproduced stably
via:

1) create a file under ocfs2 file system directory.

  journalctl -b > aa.txt

2) create a cloned file for this file.

  reflink aa.txt bb.txt

3) write the cloned file with dd command.

  dd if=/dev/zero of=bb.txt bs=512 count=1 conv=notrunc

The dd command is killed by the kernel, then you can see the oops message
via dmesg command.

[  463.875404] BUG: kernel NULL pointer dereference, address: 0000000000000028
[  463.875413] #PF: supervisor read access in kernel mode
[  463.875416] #PF: error_code(0x0000) - not-present page
[  463.875418] PGD 0 P4D 0
[  463.875425] Oops: 0000 [#1] SMP PTI
[  463.875431] CPU: 1 PID: 2291 Comm: dd Tainted: G           OE     5.3.16-2-default
[  463.875433] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[  463.875500] RIP: 0010:ocfs2_refcount_cow+0xa4/0x5d0 [ocfs2]
[  463.875505] Code: 06 89 6c 24 38 89 eb f6 44 24 3c 02 74 be 49 8b 47 28
[  463.875508] RSP: 0018:ffffa2cb409dfce8 EFLAGS: 00010202
[  463.875512] RAX: ffff8b1ebdca8000 RBX: 0000000000000001 RCX: ffff8b1eb73a9df0
[  463.875515] RDX: 0000000000056a01 RSI: 0000000000000000 RDI: 0000000000000000
[  463.875517] RBP: 0000000000000001 R08: ffff8b1eb73a9de0 R09: 0000000000000000
[  463.875520] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[  463.875522] R13: ffff8b1eb922f048 R14: 0000000000000000 R15: ffff8b1eb922f048
[  463.875526] FS:  00007f8f44d15540(0000) GS:ffff8b1ebeb00000(0000) knlGS:0000000000000000
[  463.875529] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  463.875532] CR2: 0000000000000028 CR3: 000000003c17a000 CR4: 00000000000006e0
[  463.875546] Call Trace:
[  463.875596]  ? ocfs2_inode_lock_full_nested+0x18b/0x960 [ocfs2]
[  463.875648]  ocfs2_file_write_iter+0xaf8/0xc70 [ocfs2]
[  463.875672]  new_sync_write+0x12d/0x1d0
[  463.875688]  vfs_write+0xad/0x1a0
[  463.875697]  ksys_write+0xa1/0xe0
[  463.875710]  do_syscall_64+0x60/0x1f0
[  463.875743]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  463.875758] RIP: 0033:0x7f8f4482ed44
[  463.875762] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 80 00 00 00
[  463.875765] RSP: 002b:00007fff300a79d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  463.875769] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f8f4482ed44
[  463.875771] RDX: 0000000000000200 RSI: 000055f771b5c000 RDI: 0000000000000001
[  463.875774] RBP: 0000000000000200 R08: 00007f8f44af9c78 R09: 0000000000000003
[  463.875776] R10: 000000000000089f R11: 0000000000000246 R12: 000055f771b5c000
[  463.875779] R13: 0000000000000200 R14: 0000000000000000 R15: 000055f771b5c000

This regression problem was introduced by commit e74540b285 ("ocfs2:
protect extent tree in ocfs2_prepare_inode_for_write()").

Link: http://lkml.kernel.org/r/20200121050153.13290-1-ghe@suse.com
Fixes: e74540b285 ("ocfs2: protect extent tree in ocfs2_prepare_inode_for_write()").
Signed-off-by: Gang He <ghe@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04 03:05:23 +00:00
Al Viro
12efec5602 saner copy_mount_options()
don't bother with the byte-by-byte loops, etc.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-03 21:23:33 -05:00
Jens Axboe
01d7a35687 aio: prevent potential eventfd recursion on poll
If we have nested or circular eventfd wakeups, then we can deadlock if
we run them inline from our poll waitqueue wakeup handler. It's also
possible to have very long chains of notifications, to the extent where
we could risk blowing the stack.

Check the eventfd recursion count before calling eventfd_signal(). If
it's non-zero, then punt the signaling to async context. This is always
safe, as it takes us out-of-line in terms of stack and locking context.

Cc: stable@vger.kernel.org # 4.19+
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03 17:27:47 -07:00
Pavel Begunkov
3e577dcd73 io_uring: put the flag changing code in the same spot
Both iocb_flags() and kiocb_set_rw_flags() are inline and modify
kiocb->ki_flags. Place them close, so they can be potentially better
optimised.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03 17:27:47 -07:00
Pavel Begunkov
6c8a313469 io_uring: iterate req cache backwards
Grab requests from cache-array from the end, so can get by only
free_reqs.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03 17:27:47 -07:00
Jens Axboe
3e69426da2 io_uring: punt even fadvise() WILLNEED to async context
Andres correctly points out that read-ahead can block, if it needs to
read in meta data (or even just through the page cache page allocations).
Play it safe for now and just ensure WILLNEED is also punted to async
context.

While in there, allow the file settings hints from non-blocking
context. They don't need to start/do IO, and we can safely do them
inline.

Fixes: 4840e418c2 ("io_uring: add IORING_OP_FADVISE")
Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03 17:27:47 -07:00
Jens Axboe
1a417f4e61 io_uring: fix sporadic double CQE entry for close
We punt close to async for the final fput(), but we log the completion
even before that even in that case. We rely on the request not having
a files table assigned to detect what the final async close should do.
However, if we punt the async queue to __io_queue_sqe(), we'll get
->files assigned and this makes io_close_finish() think it should both
close the filp again (which does no harm) AND log a new CQE event for
this request. This causes duplicate CQEs.

Queue the request up for async manually so we don't grab files
needlessly and trigger this condition.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03 17:27:47 -07:00
Pavel Begunkov
9250f9ee19 io_uring: remove extra ->file check
It won't ever get into io_prep_rw() when req->file haven't been set in
io_req_set_file(), hence remove the check.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03 17:27:47 -07:00
Jens Axboe
5d204bcfa0 io_uring: don't map read/write iovec potentially twice
If we have a read/write that is deferred, we already setup the async IO
context for that request, and mapped it. When we later try and execute
the request and we get -EAGAIN, we don't want to attempt to re-map it.
If we do, we end up with garbage in the iovec, which typically leads
to an -EFAULT or -EINVAL completion.

Cc: stable@vger.kernel.org # 5.5
Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03 17:27:47 -07:00
Jens Axboe
0b7b21e42b io_uring: use the proper helpers for io_send/recv
Don't use the recvmsg/sendmsg helpers, use the same helpers that the
recv(2) and send(2) system calls use.

Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03 17:27:47 -07:00
Jens Axboe
f0b493e6b9 io_uring: prevent potential eventfd recursion on poll
If we have nested or circular eventfd wakeups, then we can deadlock if
we run them inline from our poll waitqueue wakeup handler. It's also
possible to have very long chains of notifications, to the extent where
we could risk blowing the stack.

Check the eventfd recursion count before calling eventfd_signal(). If
it's non-zero, then punt the signaling to async context. This is always
safe, as it takes us out-of-line in terms of stack and locking context.

Cc: stable@vger.kernel.org # 5.1+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03 17:27:47 -07:00
Jens Axboe
b5e683d5ca eventfd: track eventfd_signal() recursion depth
eventfd use cases from aio and io_uring can deadlock due to circular
or resursive calling, when eventfd_signal() tries to grab the waitqueue
lock. On top of that, it's also possible to construct notification
chains that are deep enough that we could blow the stack.

Add a percpu counter that tracks the percpu recursion depth, warn if we
exceed it. The counter is also exposed so that users of eventfd_signal()
can do the right thing if it's non-zero in the context where it is
called.

Cc: stable@vger.kernel.org # 4.19+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03 17:27:38 -07:00
Amir Goldstein
0f060936e4 SMB3: Backup intent flag missing from some more ops
When "backup intent" is requested on the mount (e.g. backupuid or
backupgid mount options), the corresponding flag was missing from
some of the operations.

Change all operations to use the macro cifs_create_options() to
set the backup intent flag if needed.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-02-03 16:12:47 -06:00
Trond Myklebust
93a6ab7b69 NFS: Switch readdir to using iterate_shared()
Now that the page cache locking is repaired, we should be able to
switch to using iterate_shared() for improved concurrency when
doing readdir().

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-03 16:37:51 -05:00
Trond Myklebust
3803d6721b NFS: Use kmemdup_nul() in nfs_readdir_make_qstr()
The directory strings stored in the readdir cache may be used with
printk(), so it is better to ensure they are nul-terminated.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-03 16:37:45 -05:00
Trond Myklebust
114de38225 NFS: Directory page cache pages need to be locked when read
When a NFS directory page cache page is removed from the page cache,
its contents are freed through a call to nfs_readdir_clear_array().
To prevent the removal of the page cache entry until after we've
finished reading it, we must take the page lock.

Fixes: 11de3b11e0 ("NFS: Fix a memory leak in nfs_readdir")
Cc: stable@vger.kernel.org # v2.6.37+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-03 16:37:17 -05:00
Trond Myklebust
4b310319c6 NFS: Fix memory leaks and corruption in readdir
nfs_readdir_xdr_to_array() must not exit without having initialised
the array, so that the page cache deletion routines can safely
call nfs_readdir_clear_array().
Furthermore, we should ensure that if we exit nfs_readdir_filler()
with an error, we free up any page contents to prevent a leak
if we try to fill the page again.

Fixes: 11de3b11e0 ("NFS: Fix a memory leak in nfs_readdir")
Cc: stable@vger.kernel.org # v2.6.37+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-03 16:35:17 -05:00
Trond Myklebust
a8bd9ddf39 NFS: Replace various occurrences of kstrndup() with kmemdup_nul()
When we already know the string length, it is more efficient to
use kmemdup_nul().

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
[Anna - Changes to super.c were already made during fscontext conversion]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-03 16:35:07 -05:00
Trond Myklebust
10717f4563 NFSv4: Limit the total number of cached delegations
Delegations can be expensive to return, and can cause scalability issues
for the server. Let's therefore try to limit the number of inactive
delegations we hold.
Once the number of delegations is above a certain threshold, start
to return them on close.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-03 16:35:07 -05:00
Trond Myklebust
d2269ea14e NFSv4: Add accounting for the number of active delegations held
In order to better manage our delegation caching, add a counter
to track the number of active delegations.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-03 16:35:07 -05:00
Trond Myklebust
b7b7dac684 NFSv4: Try to return the delegation immediately when marked for return on close
Add a routine to return the delegation immediately upon close of the
file if it was marked for return-on-close.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-03 16:35:07 -05:00
Trond Myklebust
0d10416797 NFS: Clear NFS_DELEGATION_RETURN_IF_CLOSED when the delegation is returned
If a delegation is marked as needing to be returned when the file is
closed, then don't clear that marking until we're ready to return
it.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-03 16:35:07 -05:00
Trond Myklebust
f885ea640d NFSv4: nfs_inode_evict_delegation() should set NFS_DELEGATION_RETURNING
In particular, the pnfs return-on-close code will check for that flag,
so ensure we set it appropriately.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-03 16:35:07 -05:00
Trond Myklebust
65f5160376 NFS: nfs_find_open_context() should use cred_fscmp()
We want to find open contexts that match our filesystem access
properties. They don't have to exactly match the cred.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-03 16:35:07 -05:00
Trond Myklebust
9a206de2ea NFS: nfs_access_get_cached_rcu() should use cred_fscmp()
We do not need to have the rcu lookup method fail in the case where
the fsuid/fsgid and supplemental groups match.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-03 16:35:07 -05:00
Trond Myklebust
3871224787 NFSv4: pnfs_roc() must use cred_fscmp() to compare creds
When comparing two 'struct cred' for equality w.r.t. behaviour under
filesystem access, we need to use cred_fscmp().

Fixes: a52458b48a ("NFS/NFSD/SUNRPC: replace generic creds with 'struct cred'.")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-03 16:35:07 -05:00
Linus Torvalds
ad80142836 Merge tag 'for-5.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull more btrfs updates from David Sterba:
 "Fixes that arrived after the merge window freeze, mostly stable
  material.

   - fix race in tree-mod-log element tracking

   - fix bio flushing inside extent writepages

   - fix assertion when in-memory tracking of discarded extents finds an
     empty tree (eg. after adding a new device)

   - update logic of temporary read-only block groups to take into
     account overcommit

   - fix some fixup worker corner cases:
       - page could not go through proper COW cycle and the dirty status
         is lost due to page migration
       - deadlock if delayed allocation is performed under page lock

   - fix send emitting invalid clones within the same file

   - fix statfs reporting 0 free space when global block reserve size is
     larger than remaining free space but there is still space for new
     chunks"

* tag 'for-5.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: do not zero f_bavail if we have available space
  Btrfs: send, fix emission of invalid clone operations within the same file
  btrfs: do not do delalloc reservation under page lock
  btrfs: drop the -EBUSY case in __extent_writepage_io
  Btrfs: keep pages dirty when using btrfs_writepage_fixup_worker
  btrfs: take overcommit into account in inc_block_group_ro
  btrfs: fix force usage in inc_block_group_ro
  btrfs: Correctly handle empty trees in find_first_clear_extent_bit
  btrfs: flush write bio if we loop in extent_write_cache_pages
  Btrfs: fix race between adding and putting tree mod seq elements and nodes
2020-02-03 17:03:42 +00:00
Masahiro Yamada
5f2fb52fac kbuild: rename hostprogs-y/always to hostprogs/always-y
In old days, the "host-progs" syntax was used for specifying host
programs. It was renamed to the current "hostprogs-y" in 2004.

It is typically useful in scripts/Makefile because it allows Kbuild to
selectively compile host programs based on the kernel configuration.

This commit renames like follows:

  always       ->  always-y
  hostprogs-y  ->  hostprogs

So, scripts/Makefile will look like this:

  always-$(CONFIG_BUILD_BIN2C) += ...
  always-$(CONFIG_KALLSYMS)    += ...
      ...
  hostprogs := $(always-y) $(always-m)

I think this makes more sense because a host program is always a host
program, irrespective of the kernel configuration. We want to specify
which ones to compile by CONFIG options, so always-y will be handier.

The "always", "hostprogs-y", "hostprogs-m" will be kept for backward
compatibility for a while.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-02-04 01:53:07 +09:00
Alex Shi
c0399cf668 NFS: remove unused macros
MNT_fhs_status_sz/MNT_fhandle3_sz are never used after they were
introduced. So better to remove them.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: linux-nfs@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-03 10:43:06 -05:00
Carlos Maiolino
324282c025 fibmap: Reject negative block numbers
FIBMAP receives an integer from userspace which is then implicitly converted
into sector_t to be passed to bmap(). No check is made to ensure userspace
didn't send a negative block number, which can end up in an underflow, and
returning to userspace a corrupted block address.

As a side-effect, the underflow caused by a negative block here, will
trigger the WARN() in iomap_bmap_actor(), which is how this issue was
first discovered.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-03 08:05:58 -05:00
Carlos Maiolino
0d89fdae2a fibmap: Use bmap instead of ->bmap method in ioctl_fibmap
Now we have the possibility of proper error return in bmap, use bmap()
function in ioctl_fibmap() instead of calling ->bmap method directly.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-03 08:05:57 -05:00
Carlos Maiolino
569d2056de ecryptfs: drop direct calls to ->bmap
Replace direct ->bmap calls by bmap() method.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-03 08:05:57 -05:00
Carlos Maiolino
10d83e11a5 cachefiles: drop direct usage of ->bmap method.
Replace the direct usage of ->bmap method by a bmap() call.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-03 08:05:56 -05:00