Commit Graph

57729 Commits

Author SHA1 Message Date
Filipe Manana
06fe39ab15 Btrfs: do not overwrite scrub error with fault error in scrub ioctl
If scrub returned an error and then the copy_to_user() call did not
succeed, we would overwrite the error returned by scrub with -EFAULT.
Fix that by calling copy_to_user() only if btrfs_scrub_dev() returned
success.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:15 +01:00
Nikolay Borisov
bc9a8bf79c btrfs: Make first argument of btrfs_run_delalloc_range directly an inode
Since this function is no longer a callback there is no need to have
its first argument obfuscated with a void *. Change it directly to a
pointer to an inode. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:15 +01:00
Julia Lawall
9cf10cc195 Btrfs: drop useless LIST_HEAD in merge_reloc_root
Drop LIST_HEAD where the variable it declares is never used.

The uses were removed in 3fd0a5585e ("Btrfs: Metadata ENOSPC
handling for balance"), but not the declaration.

The semantic patch that fixes this problem is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
identifier x;
@@
- LIST_HEAD(x);
  ... when != x
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:15 +01:00
David S. Miller
70f3522614 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three conflicts, one of which, for marvell10g.c is non-trivial and
requires some follow-up from Heiner or someone else.

The issue is that Heiner converted the marvell10g driver over to
use the generic c45 code as much as possible.

However, in 'net' a bug fix appeared which makes sure that a new
local mask (MDIO_AN_10GBT_CTRL_ADV_NBT_MASK) with value 0x01e0
is cleared.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-24 12:06:19 -08:00
Christoph Hellwig
81214bab58 iomap: wire up the iopoll method
Store the request queue the last bio was submitted to in the iocb
private data in addition to the cookie so that we find the right block
device.  Also refactor the common direct I/O bio submission code into a
nice little helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Modified to use bio_set_polled().

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-24 08:20:17 -07:00
Jens Axboe
0bbb280d7b block: add bio_set_polled() helper
For the upcoming async polled IO, we can't sleep allocating requests.
If we do, then we introduce a deadlock where the submitter already
has async polled IO in-flight, but can't wait for them to complete
since polled requests must be active found and reaped.

Utilize the helper in the blockdev DIRECT_IO code.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-24 08:20:17 -07:00
Christoph Hellwig
eae83ce10b block: wire up block device iopoll method
Just call blk_poll on the iocb cookie, we can derive the block device
from the inode trivially.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-24 08:20:17 -07:00
Hou Tao
2fe8b2d557 ubifs: Reject unsupported ioctl flags explicitly
Reject unsupported ioctl flags explicitly, so the following command
on a regular ubifs file will fail:
	chattr +d ubifs_file

And xfstests generic/424 will pass.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2019-02-24 11:40:46 +01:00
Trond Myklebust
5085607d20 NFS/pnfs: Bulk destroy of layouts needs to be safe w.r.t. umount
If a bulk layout recall or a metadata server reboot coincides with a
umount, then holding a reference to an inode is unsafe unless we
also hold a reference to the super block.

Fixes: fd9a8d7160 ("NFSv4.1: Fix bulk recall and destroy of layouts")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-23 13:59:29 -05:00
Geert Uytterhoeven
12e1e7af1a vfs: Make __vfs_write() static
__vfs_write() was unexported, and removed from <linux/fs.h>, but
forgotten to be made static.

Fixes: eb031849d5 ("fs: unexport __vfs_read/__vfs_write")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-22 00:30:05 -05:00
Bart Van Assche
d3d6a18d7d aio: Fix locking in aio_poll()
wake_up_locked() may but does not have to be called with interrupts
disabled. Since the fuse filesystem calls wake_up_locked() without
disabling interrupts aio_poll_wake() may be called with interrupts
enabled. Since the kioctx.ctx_lock may be acquired from IRQ context,
all code that acquires that lock from thread context must disable
interrupts. Hence change the spin_trylock() call in aio_poll_wake()
into a spin_trylock_irqsave() call. This patch fixes the following
lockdep complaint:

=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.0.0-rc4-next-20190131 #23 Not tainted
-----------------------------------------------------
syz-executor2/13779 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
0000000098ac1230 (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:329 [inline]
0000000098ac1230 (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1772 [inline]
0000000098ac1230 (&fiq->waitq){+.+.}, at: __io_submit_one fs/aio.c:1875 [inline]
0000000098ac1230 (&fiq->waitq){+.+.}, at: io_submit_one+0xedf/0x1cf0 fs/aio.c:1908

and this task is already holding:
000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:354 [inline]
000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1771 [inline]
000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one fs/aio.c:1875 [inline]
000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: io_submit_one+0xeb6/0x1cf0 fs/aio.c:1908
which would create a new lock dependency:
 (&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}

but this new dependency connects a SOFTIRQ-irq-safe lock:
 (&(&ctx->ctx_lock)->rlock){..-.}

... which became SOFTIRQ-irq-safe at:
  lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
  __raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline]
  _raw_spin_lock_irq+0x60/0x80 kernel/locking/spinlock.c:160
  spin_lock_irq include/linux/spinlock.h:354 [inline]
  free_ioctx_users+0x2d/0x4a0 fs/aio.c:610
  percpu_ref_put_many include/linux/percpu-refcount.h:285 [inline]
  percpu_ref_put include/linux/percpu-refcount.h:301 [inline]
  percpu_ref_call_confirm_rcu lib/percpu-refcount.c:123 [inline]
  percpu_ref_switch_to_atomic_rcu+0x3e7/0x520 lib/percpu-refcount.c:158
  __rcu_reclaim kernel/rcu/rcu.h:240 [inline]
  rcu_do_batch kernel/rcu/tree.c:2486 [inline]
  invoke_rcu_callbacks kernel/rcu/tree.c:2799 [inline]
  rcu_core+0x928/0x1390 kernel/rcu/tree.c:2780
  __do_softirq+0x266/0x95a kernel/softirq.c:292
  run_ksoftirqd kernel/softirq.c:654 [inline]
  run_ksoftirqd+0x8e/0x110 kernel/softirq.c:646
  smpboot_thread_fn+0x6ab/0xa10 kernel/smpboot.c:164
  kthread+0x357/0x430 kernel/kthread.c:247
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352

to a SOFTIRQ-irq-unsafe lock:
 (&fiq->waitq){+.+.}

... which became SOFTIRQ-irq-unsafe at:
...
  lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
  __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
  _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
  spin_lock include/linux/spinlock.h:329 [inline]
  flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
  fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
  fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
  fuse_send_init fs/fuse/inode.c:989 [inline]
  fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
  mount_nodev+0x68/0x110 fs/super.c:1392
  fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
  legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
  vfs_get_tree+0x123/0x450 fs/super.c:1481
  do_new_mount fs/namespace.c:2610 [inline]
  do_mount+0x1436/0x2c40 fs/namespace.c:2932
  ksys_mount+0xdb/0x150 fs/namespace.c:3148
  __do_sys_mount fs/namespace.c:3162 [inline]
  __se_sys_mount fs/namespace.c:3159 [inline]
  __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
  do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

other info that might help us debug this:

 Possible interrupt unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&fiq->waitq);
                               local_irq_disable();
                               lock(&(&ctx->ctx_lock)->rlock);
                               lock(&fiq->waitq);
  <Interrupt>
    lock(&(&ctx->ctx_lock)->rlock);

 *** DEADLOCK ***

1 lock held by syz-executor2/13779:
 #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:354 [inline]
 #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1771 [inline]
 #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one fs/aio.c:1875 [inline]
 #0: 000000003c46111c (&(&ctx->ctx_lock)->rlock){..-.}, at: io_submit_one+0xeb6/0x1cf0 fs/aio.c:1908

the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
-> (&(&ctx->ctx_lock)->rlock){..-.} {
   IN-SOFTIRQ-W at:
                    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
                    __raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline]
                    _raw_spin_lock_irq+0x60/0x80 kernel/locking/spinlock.c:160
                    spin_lock_irq include/linux/spinlock.h:354 [inline]
                    free_ioctx_users+0x2d/0x4a0 fs/aio.c:610
                    percpu_ref_put_many include/linux/percpu-refcount.h:285 [inline]
                    percpu_ref_put include/linux/percpu-refcount.h:301 [inline]
                    percpu_ref_call_confirm_rcu lib/percpu-refcount.c:123 [inline]
                    percpu_ref_switch_to_atomic_rcu+0x3e7/0x520 lib/percpu-refcount.c:158
                    __rcu_reclaim kernel/rcu/rcu.h:240 [inline]
                    rcu_do_batch kernel/rcu/tree.c:2486 [inline]
                    invoke_rcu_callbacks kernel/rcu/tree.c:2799 [inline]
                    rcu_core+0x928/0x1390 kernel/rcu/tree.c:2780
                    __do_softirq+0x266/0x95a kernel/softirq.c:292
                    run_ksoftirqd kernel/softirq.c:654 [inline]
                    run_ksoftirqd+0x8e/0x110 kernel/softirq.c:646
                    smpboot_thread_fn+0x6ab/0xa10 kernel/smpboot.c:164
                    kthread+0x357/0x430 kernel/kthread.c:247
                    ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
   INITIAL USE at:
                   lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
                   __raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline]
                   _raw_spin_lock_irq+0x60/0x80 kernel/locking/spinlock.c:160
                   spin_lock_irq include/linux/spinlock.h:354 [inline]
                   __do_sys_io_cancel fs/aio.c:2052 [inline]
                   __se_sys_io_cancel fs/aio.c:2035 [inline]
                   __x64_sys_io_cancel+0xd5/0x5a0 fs/aio.c:2035
                   do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
                   entry_SYSCALL_64_after_hwframe+0x49/0xbe
 }
 ... key      at: [<ffffffff8a574140>] __key.52370+0x0/0x40
 ... acquired at:
   lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
   __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
   _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
   spin_lock include/linux/spinlock.h:329 [inline]
   aio_poll fs/aio.c:1772 [inline]
   __io_submit_one fs/aio.c:1875 [inline]
   io_submit_one+0xedf/0x1cf0 fs/aio.c:1908
   __do_sys_io_submit fs/aio.c:1953 [inline]
   __se_sys_io_submit fs/aio.c:1923 [inline]
   __x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1923
   do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

the dependencies between the lock to be acquired
 and SOFTIRQ-irq-unsafe lock:
-> (&fiq->waitq){+.+.} {
   HARDIRQ-ON-W at:
                    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
                    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
                    _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
                    spin_lock include/linux/spinlock.h:329 [inline]
                    flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
                    fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
                    fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
                    fuse_send_init fs/fuse/inode.c:989 [inline]
                    fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
                    mount_nodev+0x68/0x110 fs/super.c:1392
                    fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
                    legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
                    vfs_get_tree+0x123/0x450 fs/super.c:1481
                    do_new_mount fs/namespace.c:2610 [inline]
                    do_mount+0x1436/0x2c40 fs/namespace.c:2932
                    ksys_mount+0xdb/0x150 fs/namespace.c:3148
                    __do_sys_mount fs/namespace.c:3162 [inline]
                    __se_sys_mount fs/namespace.c:3159 [inline]
                    __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
                    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
                    entry_SYSCALL_64_after_hwframe+0x49/0xbe
   SOFTIRQ-ON-W at:
                    lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
                    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
                    _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
                    spin_lock include/linux/spinlock.h:329 [inline]
                    flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
                    fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
                    fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
                    fuse_send_init fs/fuse/inode.c:989 [inline]
                    fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
                    mount_nodev+0x68/0x110 fs/super.c:1392
                    fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
                    legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
                    vfs_get_tree+0x123/0x450 fs/super.c:1481
                    do_new_mount fs/namespace.c:2610 [inline]
                    do_mount+0x1436/0x2c40 fs/namespace.c:2932
                    ksys_mount+0xdb/0x150 fs/namespace.c:3148
                    __do_sys_mount fs/namespace.c:3162 [inline]
                    __se_sys_mount fs/namespace.c:3159 [inline]
                    __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
                    do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
                    entry_SYSCALL_64_after_hwframe+0x49/0xbe
   INITIAL USE at:
                   lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
                   __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
                   _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
                   spin_lock include/linux/spinlock.h:329 [inline]
                   flush_bg_queue+0x1f3/0x3c0 fs/fuse/dev.c:415
                   fuse_request_queue_background+0x2d1/0x580 fs/fuse/dev.c:676
                   fuse_request_send_background+0x58/0x120 fs/fuse/dev.c:687
                   fuse_send_init fs/fuse/inode.c:989 [inline]
                   fuse_fill_super+0x13bb/0x1730 fs/fuse/inode.c:1214
                   mount_nodev+0x68/0x110 fs/super.c:1392
                   fuse_mount+0x2d/0x40 fs/fuse/inode.c:1239
                   legacy_get_tree+0xf2/0x200 fs/fs_context.c:590
                   vfs_get_tree+0x123/0x450 fs/super.c:1481
                   do_new_mount fs/namespace.c:2610 [inline]
                   do_mount+0x1436/0x2c40 fs/namespace.c:2932
                   ksys_mount+0xdb/0x150 fs/namespace.c:3148
                   __do_sys_mount fs/namespace.c:3162 [inline]
                   __se_sys_mount fs/namespace.c:3159 [inline]
                   __x64_sys_mount+0xbe/0x150 fs/namespace.c:3159
                   do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
                   entry_SYSCALL_64_after_hwframe+0x49/0xbe
 }
 ... key      at: [<ffffffff8a60dec0>] __key.43450+0x0/0x40
 ... acquired at:
   lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
   __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
   _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
   spin_lock include/linux/spinlock.h:329 [inline]
   aio_poll fs/aio.c:1772 [inline]
   __io_submit_one fs/aio.c:1875 [inline]
   io_submit_one+0xedf/0x1cf0 fs/aio.c:1908
   __do_sys_io_submit fs/aio.c:1953 [inline]
   __se_sys_io_submit fs/aio.c:1923 [inline]
   __x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1923
   do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

stack backtrace:
CPU: 0 PID: 13779 Comm: syz-executor2 Not tainted 5.0.0-rc4-next-20190131 #23
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 print_bad_irq_dependency kernel/locking/lockdep.c:1573 [inline]
 check_usage.cold+0x60f/0x940 kernel/locking/lockdep.c:1605
 check_irq_usage kernel/locking/lockdep.c:1650 [inline]
 check_prev_add_irq kernel/locking/lockdep_states.h:8 [inline]
 check_prev_add kernel/locking/lockdep.c:1860 [inline]
 check_prevs_add kernel/locking/lockdep.c:1968 [inline]
 validate_chain kernel/locking/lockdep.c:2339 [inline]
 __lock_acquire+0x1f12/0x4790 kernel/locking/lockdep.c:3320
 lock_acquire+0x16f/0x3f0 kernel/locking/lockdep.c:3826
 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
 _raw_spin_lock+0x2f/0x40 kernel/locking/spinlock.c:144
 spin_lock include/linux/spinlock.h:329 [inline]
 aio_poll fs/aio.c:1772 [inline]
 __io_submit_one fs/aio.c:1875 [inline]
 io_submit_one+0xedf/0x1cf0 fs/aio.c:1908
 __do_sys_io_submit fs/aio.c:1953 [inline]
 __se_sys_io_submit fs/aio.c:1923 [inline]
 __x64_sys_io_submit+0x1bd/0x580 fs/aio.c:1923
 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Avi Kivity <avi@scylladb.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org>
Fixes: e8693bcfa0 ("aio: allow direct aio poll comletions for keyed wakeups") # v4.19
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
[ bvanassche: added a comment ]
Reluctantly-Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-21 22:16:47 -05:00
Trond Myklebust
6f9449be53 NFS: Fix a soft lockup in the delegation recovery code
Fix a soft lockup when NFS client delegation recovery is attempted
but the inode is in the process of being freed. When the
igrab(inode) call fails, and we have to restart the recovery process,
we need to ensure that we won't attempt to recover the same delegation
again.

Fixes: 45870d6909 ("NFSv4.1: Test delegation stateids when server...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-21 14:51:25 -05:00
Jan Kara
52b9666efd udf: Drop pointless check from udf_sync_fs()
The check if (bh) in udf_sync_fs() is pointless as we cannot have
sbi->s_lvid_dirty and !sbi->s_lvid_bh (as already asserted by
udf_updated_lvid()). So just drop the pointless check.

Reviewed-by: Steven J. Magnani <steve@digidescorp.com>
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-21 19:25:36 +01:00
Trond Myklebust
3453d5708b NFSv4.1: Avoid false retries when RPC calls are interrupted
A 'false retry' in NFSv4.1 occurs when the client attempts to transmit a
new RPC call using a slot+sequence number combination that references an
already cached one. Currently, the Linux NFS client will do this if a
user process interrupts an RPC call that is in progress.
The problem with doing so is that we defeat the main mechanism used by
the server to differentiate between a new call and a replayed one. Even
if the server is able to perfectly cache the arguments of the old call,
it cannot know if the client intended to replay or send a new call.

The obvious fix is to bump the sequence number pre-emptively if an
RPC call is interrupted, but in order to deal with the corner cases
where the interrupted call is not actually received and processed by
the server, we need to interpret the error NFS4ERR_SEQ_MISORDERED
as a sign that we need to either wait or locate a correct sequence
number that lies between the value we sent, and the last value that
was acked by a SEQUENCE call on that slot.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Tested-by: Jason Tibbitts <tibbs@math.uh.edu>
2019-02-21 13:22:43 -05:00
Linus Torvalds
8a61716ff2 Two bug fixes for old issues, both marked for stable.
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAlxu23wTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi04sB/9/Lm694io6coIW/8mTVjKs9iF7+AkA
 V11Pu3VHnp8okcNs3iSv6rtbPgLrONP5fg7OC9xAzzsEjA8kitnV7kC0RunzuUiw
 BnZcE0v5Eh4NB34DW4w4I0JDkEH75zlsw/ar8HDvgxTw9hCgV56n1oic+MkqAuiL
 5TwBjuLL5Nl4tA9djcMtHhBy3u+jQXAEDJ9tiKFBhzQOPo73N2IyZyY+Kz3qqygV
 XLWvpmW6PA9jJt8pUGRCqpahaYAY7+cyzOUJbqdRUpeESVl/2Epr7zJZo9LHQtsE
 GympS5weYMg6xLoZHQXvLbb77E/SlXkeVyNzBw7LnV6us2PS4tbgSt2J
 =alOG
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.0-rc8' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "Two bug fixes for old issues, both marked for stable"

* tag 'ceph-for-5.0-rc8' of git://github.com/ceph/ceph-client:
  ceph: avoid repeatedly adding inode to mdsc->snap_flush_list
  libceph: handle an empty authorize reply
2019-02-21 09:43:37 -08:00
Michal Hocko
b2b469939e proc, oom: do not report alien mms when setting oom_score_adj
Tetsuo has reported that creating a thousands of processes sharing MM
without SIGHAND (aka alien threads) and setting
/proc/<pid>/oom_score_adj will swamp the kernel log and takes ages [1]
to finish.  This is especially worrisome that all that printing is done
under RCU lock and this can potentially trigger RCU stall or softlockup
detector.

The primary reason for the printk was to catch potential users who might
depend on the behavior prior to 44a70adec9 ("mm, oom_adj: make sure
processes sharing mm have same view of oom_score_adj") but after more
than 2 years without a single report I guess it is safe to simply remove
the printk altogether.

The next step should be moving oom_score_adj over to the mm struct and
remove all the tasks crawling as suggested by [2]

[1] http://lkml.kernel.org/r/97fce864-6f75-bca5-14bc-12c9f890e740@i-love.sakura.ne.jp
[2] http://lkml.kernel.org/r/20190117155159.GA4087@dhcp22.suse.cz

Link: http://lkml.kernel.org/r/20190212102129.26288-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Yong-Taek Lee <ytk.lee@samsung.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-21 09:01:00 -08:00
Konstantin Khlebnikov
bc1d69d615 ext4: add sysfs attr /sys/fs/ext4/<disk>/journal_task
This is useful for moving journal thread into cgroup or
for tracing it with ftrace/perf/blktrace.

For now the only way is `pgrep jbd2/$DISK` but this is not reliable:
name may be longer than "comm" limit and any task could mock it.

Attribute shows pid in current pid-namespace or 0 if task is unreachable.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-02-21 11:49:27 -05:00
Geert Uytterhoeven
231fe82b56 ext4: Change debugging support help prefix from EXT4 to Ext4
All other configuration options for the ext* family of file systems use
"Ext%u" instead of "EXT%u".

Fixes: 6ba495e925 ("ext4: Add configurable run-time mballoc debugging")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-02-21 11:37:28 -05:00
zhangyi (F)
ddccb6dbe7 ext4: fix compile error when using BUFFER_TRACE
Fix compile error below when using BUFFER_TRACE.

fs/ext4/inode.c: In function ‘ext4_expand_extra_isize’:
fs/ext4/inode.c:5979:19: error: request for member ‘bh’ in something not a structure or union
  BUFFER_TRACE(iloc.bh, "get_write_access");

Fixes: c03b45b853 ("ext4, project: expand inode extra size if possible")
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2019-02-21 11:29:10 -05:00
zhangyi (F)
01215d3edb jbd2: fix compile warning when using JBUFFER_TRACE
The jh pointer may be used uninitialized in the two cases below and the
compiler complain about it when enabling JBUFFER_TRACE macro, fix them.

In file included from fs/jbd2/transaction.c:19:0:
fs/jbd2/transaction.c: In function ‘jbd2_journal_get_undo_access’:
./include/linux/jbd2.h:1637:38: warning: ‘jh’ is used uninitialized in this function [-Wuninitialized]
 #define JBUFFER_TRACE(jh, info) do { printk("%s: %d\n", __func__, jh->b_jcount);} while (0)
                                      ^
fs/jbd2/transaction.c:1219:23: note: ‘jh’ was declared here
  struct journal_head *jh;
                       ^
In file included from fs/jbd2/transaction.c:19:0:
fs/jbd2/transaction.c: In function ‘jbd2_journal_dirty_metadata’:
./include/linux/jbd2.h:1637:38: warning: ‘jh’ may be used uninitialized in this function [-Wmaybe-uninitialized]
 #define JBUFFER_TRACE(jh, info) do { printk("%s: %d\n", __func__, jh->b_jcount);} while (0)
                                      ^
fs/jbd2/transaction.c:1332:23: note: ‘jh’ was declared here
  struct journal_head *jh;
                       ^

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
2019-02-21 11:24:09 -05:00
Dan Carpenter
7159a986b4 ext4: fix some error pointer dereferences
We can't pass error pointers to brelse().

Fixes: fb265c9cb4 ("ext4: add ext4_sb_bread() to disambiguate ENOMEM cases")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2019-02-21 11:17:34 -05:00
Colin Ian King
081a8ae2a5 xfs: fix uninitialized error variable
A previous commit removed the initialization of variable 'error' to zero,
and can cause a bogus error return.  This occurs when error contains a
non-zero garbage value and the call to xchk_should_terminate detects a
pending fatal signal and checks for a zero error before setting it
to -EAGAIN. Fix the issue by initializing error to zero.

Fixes: b9454fe056 ("xfs: clean up the inode cluster checking in the inobt scrub")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-21 07:55:07 -08:00
Christoph Hellwig
66ae56a53f xfs: introduce an always_cow mode
Add a mode where XFS never overwrites existing blocks in place.  This
is to aid debugging our COW code, and also put infatructure in place
for things like possible future support for zoned block devices, which
can't support overwrites.

This mode is enabled globally by doing a:

    echo 1 > /sys/fs/xfs/debug/always_cow

Note that the parameter is global to allow running all tests in xfstests
easily in this mode, which would not easily be possible with a per-fs
sysfs file.

In always_cow mode persistent preallocations are disabled, and fallocate
will fail when called with a 0 mode (with our without
FALLOC_FL_KEEP_SIZE), and not create unwritten extent for zeroed space
when called with FALLOC_FL_ZERO_RANGE or FALLOC_FL_UNSHARE_RANGE.

There are a few interesting xfstests failures when run in always_cow
mode:

 - generic/392 fails because the bytes used in the file used to test
   hole punch recovery are less after the log replay.  This is
   because the blocks written and then punched out are only freed
   with a delay due to the logging mechanism.
 - xfs/170 will fail as the already fragile file streams mechanism
   doesn't seem to interact well with the COW allocator
 - xfs/180 xfs/182 xfs/192 xfs/198 xfs/204 and xfs/208 will claim
   the file system is badly fragmented, but there is not much we
   can do to avoid that when always writing out of place
 - xfs/205 fails because overwriting a file in always_cow mode
   will require new space allocation and the assumption in the
   test thus don't work anymore.
 - xfs/326 fails to modify the file at all in always_cow mode after
   injecting the refcount error, leading to an unexpected md5sum
   after the remount, but that again is expected

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-21 07:55:07 -08:00
Christoph Hellwig
c4feb0b194 xfs: report IOMAP_F_SHARED from xfs_file_iomap_begin_delay
No user of it in the iomap code at the moment, but we should not
actively report wrong information if we can trivially get it right.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-21 07:55:07 -08:00
Christoph Hellwig
26b91c728b xfs: make COW fork unwritten extent conversions more robust
If we have racing buffered and direct I/O COW fork extents under
writeback can have been moved to the data fork by the time we call
xfs_reflink_convert_cow from xfs_submit_ioend.  This would be mostly
harmless as the block numbers don't change by this move, except for
the fact that xfs_bmapi_write will crash or trigger asserts when
not finding existing extents, even despite trying to paper over this
with the XFS_BMAPI_CONVERT_ONLY flag.

Instead of special casing non-transaction conversions in the already
way too complicated xfs_bmapi_write just add a new helper for the much
simpler non-transactional COW fork case, which simplify ignores not
found extents.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-21 07:55:07 -08:00
Christoph Hellwig
db46e604ad xfs: merge COW handling into xfs_file_iomap_begin_delay
Besides simplifying the code a bit this allows to actually implement
the behavior of using COW preallocation for non-COW data mentioned
in the current comments.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-21 07:55:07 -08:00
Christoph Hellwig
12df89f28f xfs: also truncate holes covered by COW blocks
This only matters if we want to write data through the COW fork that is
not actually an overwrite of existing data.  Reasons for that are
speculative COW fork allocations using the cowextsize, or a mode where
we always write through the COW fork.  Currently both can't actually
happen, but I plan to enable them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-21 07:55:07 -08:00
Christoph Hellwig
78f0cc9d55 xfs: don't use delalloc extents for COW on files with extsize hints
While using delalloc for extsize hints is generally a good idea, the
current code that does so only for COW doesn't help us much and creates
a lot of special cases.  Switch it to use real allocations like we
do for direct I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-21 07:55:07 -08:00
Christoph Hellwig
60271ab79d xfs: fix SEEK_DATA for speculative COW fork preallocation
We speculatively allocate extents in the COW fork to reduce
fragmentation.  But when we write data into such COW fork blocks that
do now shadow an allocation in the data fork SEEK_DATA will not
correctly report it, as it only looks at the data fork extents.
The only reason why that hasn't been an issue so far is because
we even use these speculative COW fork preallocations over holes in
the data fork at all for buffered writes, and blocks in the COW
fork that are written by direct writes are moved into the data
fork immediately at I/O completion time.

Add a new set of iomap_ops for SEEK_HOLE/SEEK_DATA which looks into
both the COW and data fork, and reports all COW extents as unwritten
to the iomap layer.  While this isn't strictly true for COW fork
extents that were already converted to real extents, the practical
semantics that you can't read data from them until they are moved
into the data fork are very similar, and this will force the iomap
layer into probing the extents for actually present data.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-21 07:55:07 -08:00
Christoph Hellwig
16be143373 xfs: make xfs_bmbt_to_iomap more useful
Move checking for invalid zero blocks and setting of various iomap flags
into this helper.  Also make it deal with "raw" delalloc extents to
avoid clutter in the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-21 07:55:07 -08:00
Mathieu Malaterre
793bc5181b ext4: annotate more implicit fall throughs
There is a plan to build the kernel with -Wimplicit-fallthrough and
these places in the code produced warnings (W=1). Fix them up.

This commit remove the following warnings:

  fs/ext4/indirect.c:1182:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
  fs/ext4/indirect.c:1188:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
  fs/ext4/indirect.c:1432:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
  fs/ext4/indirect.c:1440:6: warning: this statement may fall through [-Wimplicit-fallthrough=]

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2019-02-21 10:51:27 -05:00
Mathieu Malaterre
034f891a84 ext4: annotate implicit fall throughs
There is a plan to build the kernel with -Wimplicit-fallthrough and
these places in the code produced warnings (W=1). Fix them up.

This commit remove the following warnings:

  fs/ext4/hash.c:233:15: warning: this statement may fall through [-Wimplicit-fallthrough=]
  fs/ext4/hash.c:246:15: warning: this statement may fall through [-Wimplicit-fallthrough=]

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2019-02-21 10:49:53 -05:00
J. Bruce Fields
c54f24e338 nfsd: fix performance-limiting session calculation
We're unintentionally limiting the number of slots per nfsv4.1 session
to 10.  Often more than 10 simultaneous RPCs are needed for the best
performance.

This calculation was meant to prevent any one client from using up more
than a third of the limit we set for total memory use across all clients
and sessions.  Instead, it's limiting the client to a third of the
maximum for a single session.

Fix this.

Reported-by: Chris Tracy <ctracy@engr.scu.edu>
Cc: stable@vger.kernel.org
Fixes: de766e5704 "nfsd: give out fewer session slots as limit approaches"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-02-21 10:47:00 -05:00
Jan Kara
b519057981 fanotify: Make waits for fanotify events only killable
Making waits for response to fanotify permission events interruptible
can result in EINTR returns from open(2) or other syscalls when there's
e.g. AV software that's monitoring the file. Orion reports that e.g.
bash is complaining like:

bash: /etc/bash_completion.d/itweb-settings.bash: Interrupted system call

So for now convert the wait from interruptible to only killable one.
That is mostly invisible to userspace. Sadly this breaks hibernation
with fanotify permission events pending again but we have to put more
thought into how to fix this without regressing userspace visible
behavior.

Reported-by: Orion Poplawski <orion@nwra.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-21 11:47:23 +01:00
ZhangXiaoxu
ded52fbe70 nfs: fix xfstest generic/099 failed on nfsv3
After setxattr, the nfsv3 cached the acl which set by user.

But at the backend, the shared file system (eg. ext4) will check
the acl, if it can merged with mode, it won't add acl to the file.
So, the nfsv3 cached acl is redundant.

Don't 'set_cached_acl' when setxattr.

Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20 17:33:55 -05:00
Kazuo Ito
2cde04e90d pNFS: Avoid read/modify/write when it is not necessary
As the block and SCSI layouts can only read/write fixed-length
blocks, we must perform read-modify-write when data to be written is
not aligned to a block boundary or smaller than the block size.
(612aa983a0 pnfs: add flag to force read-modify-write in ->write_begin)

The current code tries to see if we have to do read-modify-write
on block-oriented pNFS layouts by just checking !PageUptodate(page),
but the same condition also applies for overwriting of any uncached
potions of existing files, making such operations excessively slow
even it is block-aligned.

The change does not affect the optimization for modify-write-read
cases (38c73044f5 NFS: read-modify-write page updating),
because partial update of !PageUptodate() pages can only happen
in layouts that can do arbitrary length read/write and never
in block-based ones.

Testing results:

We ran fio on one of the pNFS clients running 4.20 kernel
(vanilla and patched) in this configuration to read/write/overwrite
files on the storage array, exported as pnfs share by the server.

 pNFS clients ---1G Ethernet--- pNFS server
 (HP DL360 G8)                  (HP DL360 G8)
       |                              |
       |                              |
       +------8G Fiber Channel--------+
                     |
               Storage Array
                 (HP P6350)

Throughput of overwrite (both buffered and O_SYNC) is noticeably
improved.

Ops.     |block size|   Throughput   |
         |  (KiB)   |    (MiB/s)     |
         |          |  4.20 | patched|
---------+----------+----------------+
buffered |         4|  21.3 |  232   |
overwrite|        32|  22.2 |  256   |
         |       512|  22.4 |  260   |
---------+----------+----------------+
O_SYNC   |         4|   3.84|    4.77|
overwrite|        32|  12.2 |   32.0 |
         |       512|  18.5 |  152   |
---------+----------+----------------+

Read and write (buffered and O_SYNC) by the same client remain unchanged
by the patch either negatively or positively, as they should do.

Ops.     |block size|   Throughput   |
         |  (KiB)   |    (MiB/s)     |
         |          |  4.20 | patched|
---------+----------+----------------+
read     |         4| 548   |  550   |
         |        32| 547   |  551   |
         |       512| 548   |  551   |
---------+----------+----------------+
buffered |         4| 237   |  244   |
write    |        32| 261   |  268   |
         |       512| 265   |  272   |
---------+----------+----------------+
O_SYNC   |         4|   0.46|    0.46|
write    |        32|   3.60|    3.57|
         |       512| 105   |  106   |
---------+----------+----------------+

Signed-off-by: Kazuo Ito <ito_kazuo_g3@lab.ntt.co.jp>
Tested-by: Hiroyuki Watanabe <watanabe.hiroyuki@lab.ntt.co.jp>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20 17:33:55 -05:00
Kazuo Ito
97ae91bbf3 pNFS: Fix potential corruption of page being written
nfs_want_read_modify_write() didn't check for !PagePrivate when pNFS
block or SCSI layout was in use, therefore we could lose data forever
if the page being written was filled by a read before completion.

Signed-off-by: Kazuo Ito <ito_kazuo_g3@lab.ntt.co.jp>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20 17:33:55 -05:00
zhangliguang
bf211ca1a8 NFS: Fix typo in comments of nfs_readdir_alloc_pages()
This fixes the typo in comments of nfs_readdir_alloc_pages().
Because nfs_readdir_large_page and nfs_readdir_free_pagearray had been
renamed.

Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20 17:33:55 -05:00
zhangliguang
42f72cf368 NFS: Remove redundant semicolon
This removes redundant semicolon for ending code.

Fixes: c7944ebb9c ("NFSv4: Fix lookup revalidate of regular files")
Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20 17:33:55 -05:00
luanshi
be4c2d4723 NFS: readdirplus optimization by cache mechanism
When listing very large directories via NFS, clients may take a long
time to complete. There are about three factors involved:

First of all, ls and practically every other method of listing a
directory including python os.listdir and find rely on libc readdir().
However readdir() only reads 32K of directory entries at a time, which
means that if you have a lot of files in the same directory, it is going
to take an insanely long time to read all the directory entries.

Secondly, libc readdir() reads 32K of directory entries at a time, in
kernel space 32K buffer split into 8 pages. One NFS readdirplus rpc will
be called for one page, which introduces many readdirplus rpc calls.

Lastly, one NFS readdirplus rpc asks for 32K data (filled by nfs_dentry)
to fill one page (filled by dentry), we found that nearly one third of
data was wasted.

To solve above problems, pagecache mechanism was introduced. One NFS
readdirplus rpc will ask for a large data (more than 32k), the data can
fill more than one page, the cached pages can be used for next readdir
call. This can reduce many readdirplus rpc calls and improve readdirplus
performance.

TESTING:
When listing very large directories(include 300 thousand files) via NFS

time ls -l /nfs_mount | wc -l

without the patch:
300001
real    1m53.524s
user    0m2.314s
sys     0m2.599s

with the patch:
300001
real    0m23.487s
user    0m2.305s
sys     0m2.558s

Improved performance: 79.6%
readdirplus rpc calls decrease: 85%

Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20 17:33:55 -05:00
Eric W. Biederman
40cc394be1 fs/nfs: Fix nfs_parse_devname to not modify it's argument
In the rare and unsupported case of a hostname list nfs_parse_devname
will modify dev_name.  There is no need to modify dev_name as the all
that is being computed is the length of the hostname, so the computed
length can just be shorted.

Fixes: dc04589827 ("NFS: Use common device name parsing logic for NFSv4 and NFSv2/v3")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20 17:33:55 -05:00
Julia Lawall
45bb8d8027 NFS: drop useless LIST_HEAD
Drop LIST_HEAD where the variable it declares has never
been used.

The semantic patch that fixes this problem is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
identifier x;
@@
- LIST_HEAD(x);
  ... when != x
// </smpl>

Fixes: 0e20162ed1 ("NFSv4.1 Use MDS auth flavor for data server connection")
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20 17:33:55 -05:00
Trond Myklebust
e9acf2105f NFS: Fix sparse annotations for nfs_set_open_stateid_locked()
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20 15:14:21 -05:00
Trond Myklebust
302fad7bd5 NFS: Fix up documentation warnings
Fix up some compiler warnings about function parameters, etc not being
correctly described or formatted.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20 15:14:21 -05:00
Trond Myklebust
2dc23afffb NFS: ENOMEM should also be a fatal error.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20 15:14:21 -05:00
Trond Myklebust
7dc58ca5d8 NFS: EINTR is also a fatal error.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20 15:14:21 -05:00
Trond Myklebust
875bc3fbf2 NFS: Ensure NFS writeback allocations don't recurse back into NFS.
All the allocations that we can hit in the NFS layer and sunrpc layers
themselves are already marked as GFP_NOFS, but we need to ensure that
any calls to generic kernel functionality do the right thing as well.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20 15:14:20 -05:00
Trond Myklebust
df3accb849 NFS: Pass error information to the pgio error cleanup routine
Allow the caller to pass error information when cleaning up a failed
I/O request so that we can conditionally take action to cancel the
request altogether if the error turned out to be fatal.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20 15:14:20 -05:00
Trond Myklebust
078b5fd92c NFS: Clean up list moves of struct nfs_page
In several places we're just moving the struct nfs_page from one list to
another by first removing from the existing list, then adding to the new
one.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20 15:14:20 -05:00
Trond Myklebust
8127d82705 NFS: Don't recoalesce on error in nfs_pageio_complete_mirror()
If the I/O completion failed with a fatal error, then we should just
exit nfs_pageio_complete_mirror() rather than try to recoalesce.

Fixes: a7d42ddb30 ("nfs: add mirroring support to pgio layer")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.0+
2019-02-20 15:14:20 -05:00
Trond Myklebust
4d91969ed4 NFS: Fix an I/O request leakage in nfs_do_recoalesce
Whether we need to exit early, or just reprocess the list, we
must not lost track of the request which failed to get recoalesced.

Fixes: 03d5eb65b5 ("NFS: Fix a memory leak in nfs_do_recoalesce")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.0+
2019-02-20 15:14:20 -05:00
Trond Myklebust
f57dcf4c72 NFS: Fix I/O request leakages
When we fail to add the request to the I/O queue, we currently leave it
to the caller to free the failed request. However since some of the
requests that fail are actually created by nfs_pageio_add_request()
itself, and are not passed back the caller, this leads to a leakage
issue, which can again cause page locks to leak.

This commit addresses the leakage by freeing the created requests on
error, using desc->pg_completion_ops->error_cleanup()

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Fixes: a7d42ddb30 ("nfs: add mirroring support to pgio layer")
Cc: stable@vger.kernel.org # v4.0: c18b96a1b8: nfs: clean up rest of reqs
Cc: stable@vger.kernel.org # v4.0: d600ad1f2b: NFS41: pop some layoutget
Cc: stable@vger.kernel.org # v4.0+
2019-02-20 15:14:20 -05:00
Mike Marshall
6e356d4595 orangefs: remove two un-needed BUG_ONs...
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2019-02-20 15:12:52 -05:00
Linus Torvalds
1f5a018c5b Merge branch 'fixes-v5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull keys fixes from James Morris:

 - Handle quotas better, allowing full quota to be reached.

 - Fix the creation of shortcuts in the assoc_array internal
   representation when the index key needs to be an exact multiple of
   the machine word size.

 - Fix a dependency loop between the request_key contruction record and
   the request_key authentication key. The construction record isn't
   really necessary and can be dispensed with.

 - Set the timestamp on a new key rather than leaving it as 0. This
   would ordinarily be fine - provided the system clock is never set to
   a time before 1970

* 'fixes-v5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
  keys: Timestamp new keys
  keys: Fix dependency loop between construction record and auth key
  assoc_array: Fix shortcut creation
  KEYS: allow reaching the keys quotas exactly
2019-02-20 09:09:33 -08:00
David S. Miller
375ca548f7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two easily resolvable overlapping change conflicts, one in
TCP and one in the eBPF verifier.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-20 00:34:07 -08:00
YueHaibing
f612acfae8 exec: Fix mem leak in kernel_read_file
syzkaller report this:
BUG: memory leak
unreferenced object 0xffffc9000488d000 (size 9195520):
  comm "syz-executor.0", pid 2752, jiffies 4294787496 (age 18.757s)
  hex dump (first 32 bytes):
    ff ff ff ff ff ff ff ff a8 00 00 00 01 00 00 00  ................
    02 00 00 00 00 00 00 00 80 a1 7a c1 ff ff ff ff  ..........z.....
  backtrace:
    [<000000000863775c>] __vmalloc_node mm/vmalloc.c:1795 [inline]
    [<000000000863775c>] __vmalloc_node_flags mm/vmalloc.c:1809 [inline]
    [<000000000863775c>] vmalloc+0x8c/0xb0 mm/vmalloc.c:1831
    [<000000003f668111>] kernel_read_file+0x58f/0x7d0 fs/exec.c:924
    [<000000002385813f>] kernel_read_file_from_fd+0x49/0x80 fs/exec.c:993
    [<0000000011953ff1>] __do_sys_finit_module+0x13b/0x2a0 kernel/module.c:3895
    [<000000006f58491f>] do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290
    [<00000000ee78baf4>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [<00000000241f889b>] 0xffffffffffffffff

It should goto 'out_free' lable to free allocated buf while kernel_read
fails.

Fixes: 39d637af5a ("vfs: forbid write access when reading a file into memory")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-18 21:26:24 -05:00
Kees Cook
b5372fe5dc exec: load_script: Do not exec truncated interpreter path
Commit 8099b047ec ("exec: load_script: don't blindly truncate
shebang string") was trying to protect against a confused exec of a
truncated interpreter path. However, it was overeager and also refused
to truncate arguments as well, which broke userspace, and it was
reverted. This attempts the protection again, but allows arguments to
remain truncated. In an effort to improve readability, helper functions
and comments have been added.

Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Samuel Dionne-Riel <samuel@dionne-riel.com>
Cc: Richard Weinberger <richard.weinberger@gmail.com>
Cc: Graham Christensen <graham@grahamc.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-18 16:49:36 -08:00
Darrick J. Wong
15baadf72c xfs: fix xfs_buf magic number endian checks
Create a separate magic16 check function so that we don't run afoul of
static checkers.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-18 09:38:41 -08:00
Yan, Zheng
04242ff3ac ceph: avoid repeatedly adding inode to mdsc->snap_flush_list
Otherwise, mdsc->snap_flush_list may get corrupted.

Cc: stable@vger.kernel.org
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-02-18 18:08:29 +01:00
yangerkun
93bc420ed4 ext2: support statx syscall
Since statx, every filesystem should fill the attributes/attributes_mask
in routine getattr. But the generic_fillattr has not fill that, so add
ext2_getattr to do this. This can fix generic/424 while testing ext2.

Reviewed-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-18 15:14:43 +01:00
Jan Kara
fabf7f29b3 fanotify: Use interruptible wait when waiting for permission events
When waiting for response to fanotify permission events, we currently
use uninterruptible waits. That makes code simple however it can cause
lots of processes to end up in uninterruptible sleep with hard reboot
being the only alternative in case fanotify listener process stops
responding (e.g. due to a bug in its implementation). Uninterruptible
sleep also makes system hibernation fail if the listener gets frozen
before the process generating fanotify permission event.

Fix these problems by using interruptible sleep for waiting for response
to fanotify event. This is slightly tricky though - we have to
detect when the event got already reported to userspace as in that
case we must not free the event. Instead we push the responsibility for
freeing the event to the process that will write response to the
event.

Reported-by: Orion Poplawski <orion@nwra.com>
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-18 12:41:16 +01:00
Jan Kara
40873284d7 fanotify: Track permission event state
Track whether permission event got already reported to userspace and
whether userspace already answered to the permission event. Protect
stores to this field together with updates to ->response field by
group->notification_lock. This will allow aborting wait for reply to
permission event from userspace.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-18 12:41:16 +01:00
Jan Kara
ca6f86998d fanotify: Simplify cleaning of access_list
Simplify iteration cleaning access_list in fanotify_release(). That will
make following changes more obvious.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-18 12:41:16 +01:00
Jan Kara
f7db89accc fsnotify: Create function to remove event from notification list
Create function to remove event from the notification list. Later it will
be used from more places.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-18 12:41:16 +01:00
Jan Kara
8c5544666c fanotify: Move locking inside get_one_event()
get_one_event() has a single caller and that just locks
notification_lock around the call. Move locking inside get_one_event()
as that will make using ->response field for permission event state
easier.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-18 12:40:46 +01:00
Jan Kara
af6a511306 fanotify: Fold dequeue_event() into process_access_response()
Fold dequeue_event() into process_access_response(). This will make
changes to use of ->response field easier.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-18 11:49:36 +01:00
Christoph Hellwig
7588cbeec6 xfs: retry COW fork delalloc conversion when no extent was found
While we can only truncate a block under the page lock for the current
page, there is no high-level synchronization for moving extents from the
COW to the data fork.  This means that for example we can have another
thread doing a direct I/O completion that moves extents from the COW to
the data fork race with writeback.  While this race is very hard to hit
the always_cow seems to reproduce it reasonably well, and it also exists
without that.  Because of that there is a chance that a delalloc
conversion for the COW fork might not find any extents to convert.  In
that case we should retry the whole block lookup and now find the blocks
in the data fork.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17 11:55:54 -08:00
Christoph Hellwig
19c8e4e258 xfs: remove the truncate short cut in xfs_map_blocks
Now that we properly handle the race with truncate in the delalloc
allocator there is no need to short cut this exceptional case earlier
on.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17 11:55:54 -08:00
Christoph Hellwig
4ad765edb0 xfs: move xfs_iomap_write_allocate to xfs_aops.c
This function is a small wrapper only used by the writeback code, so
move it together with the writeback code and simplify it down to the
glorified do { } while loop that is now is.

A few bits intentionally got lost here: no need to call xfs_qm_dqattach
because quotas are always attached when we create the delalloc
reservation, and no need for the imap->br_startblock == 0 check given
that xfs_bmapi_convert_delalloc already has a WARN_ON_ONCE for exactly
that condition.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17 11:55:54 -08:00
Christoph Hellwig
125851ac92 xfs: move stat accounting to xfs_bmapi_convert_delalloc
This way we can actually count how many bytes got converted and how many
calls we need, unlike in the caller which doesn't have the detailed
view.

Note that this includes a slight change in behavior as the
xs_xstrat_quick is now bumped for every allocation instead of just the
one covering the requested writeback offset, which makes a lot more
sense.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17 11:55:54 -08:00
Christoph Hellwig
491ce61e93 xfs: move transaction handling to xfs_bmapi_convert_delalloc
No need to deal with the transaction and the inode locking in the
caller. Note that we also switch to passing whichfork as the second
paramter, matching what most related functions do.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17 11:55:54 -08:00
Christoph Hellwig
d8ae82e394 xfs: split XFS_BMAPI_DELALLOC handling from xfs_bmapi_write
Delalloc conversion has traditionally been part of our function to
allocate blocks on disk (first xfs_bmapi, then xfs_bmapi_write), but
delalloc conversion is a little special as we really do not want
to allocate blocks over holes, for which we don't have reservations.

Split the delalloc conversions into a separate helper to keep the
code simple and structured.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17 11:55:54 -08:00
Christoph Hellwig
c8b54673b3 xfs: factor out two helpers from xfs_bmapi_write
We want to be able to reuse them for the upcoming dedidcated delalloc
convert routine.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17 11:55:54 -08:00
Christoph Hellwig
b101e3342a xfs: simplify the xfs_bmap_btree_to_extents calling conventions
Move boilerplate code from the callers into xfs_bmap_btree_to_extents:

 - exit early without failure if we don't need to convert to the
   extent format
 - assert that we have a btree cursor
 - don't reinitialize the passed in logflags argument

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17 11:55:53 -08:00
Christoph Hellwig
b4e29032f2 xfs: remove the s_maxbytes checks in xfs_map_blocks
We already ensure all data fits into s_maxbytes in the write / fault
path.  The only reason we have them here is that they were copy and
pasted from xfs_bmapi_read when we stopped using that function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17 11:55:53 -08:00
Christoph Hellwig
be225fec72 xfs: remove the io_type field from the writeback context and ioend
The io_type field contains what is basically a summary of information
from the inode fork and the imap.  But we can just as easily use that
information directly, simplifying a few bits here and there and
improving the trace points.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-17 11:55:53 -08:00
Linus Torvalds
88fe73cb80 Two small fixes, one for crashes using nfs/krb5 with older enctypes, one
that could prevent clients from reclaiming state after a kernel upgrade.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJcZzZHAAoJECebzXlCjuG+EOsQALVuwSJqQh4GUVMSBYzL6Ov4
 SfinB8LJ8/1HwngSvRB3xQ4HiOtpFSNkjzfFYE7epy6augY8tRRnHGbnlHbsG5vI
 wQqTR6PbSq2mupgpi2WGRlRh521SDOi8V49fplUC+FuV7dJT/wm0hgdKsHCPHPX4
 TEYPglsvG6PLu5IcAofNac9PVZH21s3yVIKvqd6yifED5lhopdNw210s5DtzvugI
 g2JgHOhTfana+xQS/cJ1U8JHbbpM7jwOXAJ7IWD8k4GXdAW03X6jNOcseudcBTQY
 qSL33//6Xdu0r0uI21z4ZWxSWCOtt8YvnbMoG4EBqh3DpKbUpExh8j4eIyNPSuSF
 Y/8iAVJ9KWYhWO+IVPqvHVXz4mCIDK+f7iJ/m+lLjOQmWkpp6koeUDjKs4k9zBUC
 mbGTOrh0TJzXvKWKEU5Qy7meZVJGUpV+9ca+cDs5XN7Xa3blTp+5VrRVeDgKO5Kx
 OF3Y3IBOWhqN7+kEH98RvdZAmtbO0zg02IEIHOMPxH69JU8o0EsEni1LXsqDJrRi
 sLVYXvLwdPLfkqSjpI8xNeaoFXeelopx8Re+2oNEFIEvsfeT5XikbQHoqgFJNsyk
 hz7PHwuyGjc6NJRRSBUKYouWKPP4rrM7ZiOSyIEDYIIwyhBirpjrECaHzdi3D5j+
 xUyFGMF5F3wk1fdQHPfD
 =NopI
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.0-2' of git://linux-nfs.org/~bfields/linux

Pull more nfsd fixes from Bruce Fields:
 "Two small fixes, one for crashes using nfs/krb5 with older enctypes,
  one that could prevent clients from reclaiming state after a kernel
  upgrade"

* tag 'nfsd-5.0-2' of git://linux-nfs.org/~bfields/linux:
  sunrpc: fix 4 more call sites that were using stack memory with a scatterlist
  Revert "nfsd4: return default lease period"
2019-02-16 17:38:01 -08:00
Linus Torvalds
55638c520b More NFS client fixes for Linux 5.0
- Make sure Send CQ is allocated on an existing compvec
 - Properly check debugfs dentry before using it
 - Don't use page_file_mapping() after removing a page
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlxnMQ0ACgkQ18tUv7Cl
 QOsbhQ//VhgoXX25xHrApLz8wMuYPNOboDFSUf0O1GWoHi3opHnP+9LPf/iZkRQy
 YS0ufcO95i1LGjZLb8ac9hBWkko8TBl/dIONsG4ppf2bAbiVuag848wehi8hsGba
 zaSsXV6qdibq4qZsyK35hh0cHVHDgB1EMTu7AVORdvXsTHVX3xL86vts2y2VSLKv
 w9yKQBg4E4pWwENi7v77icSuGg/WpwfKnYxBzG6JPXuHQLGidyc/HrnVmLwhd6DQ
 0Sa6nzOAvgjjgVibB+tJfsitScmMTsaxulvHsm5iLjPJZ8SUjxYvAPl3AZdCYPvU
 XaADy8nrvXJUe9APhMINbkoxnF4W/OPnUMG3bWkWp2LeNZvk5l7VOzTW5Sh49Xyk
 pBAOd7qr3kfjFdvzypVz9NeXuS6BsTUA6LAudo8rF7nxi8jHPp6L+zZNWVrPIjY0
 +bNIj3K1Bji3jU9vTHyTzxDRB/4ZnzJaPF2Gv/5Y2cvkI7mfzHUz5p6cAU1OPIVB
 kuhZXkQFEPSS2OV6MUOe/HgmtY0oLM3XU9cEaFkLz59D1kb1fjO/yUu9YBQMq6Ke
 o6b7Dwh4WvLVN/AbgegKOnp5G0/ljmz6y7ML0AElYXg1iT4k0zE+qJpMWhOTRJnd
 +jf4hSS+l7p7D1ed+uqdMS/jc1s5vcuxwYDQUIutELjA/TCbLNI=
 =28v+
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.0-4' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull more NFS client fixes from Anna Schumaker:
 "Three fixes this time.

  Nicolas's is for xprtrdma completion vector allocation on single-core
  systems. Greg's adds an error check when allocating a debugfs dentry.
  And Ben's is an additional fix for nfs_page_async_flush() to prevent
  pages from accidentally getting truncated.

  Summary:

   - Make sure Send CQ is allocated on an existing compvec

   - Properly check debugfs dentry before using it

   - Don't use page_file_mapping() after removing a page"

* tag 'nfs-for-5.0-4' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFS: Don't use page_file_mapping after removing the page
  rpc: properly check debugfs dentry before using it
  xprtrdma: Make sure Send CQ is allocated on an existing compvec
2019-02-16 17:33:39 -08:00
Aurelien Jarno
cc4b1242d7 vfs: fix preadv64v2 and pwritev64v2 compat syscalls with offset == -1
The preadv2 and pwritev2 syscalls are supposed to emulate the readv and
writev syscalls when offset == -1. Therefore the compat code should
check for offset before calling do_compat_preadv64 and
do_compat_pwritev64. This is the case for the preadv2 and pwritev2
syscalls, but handling of offset == -1 is missing in their 64-bit
equivalent.

This patch fixes that, calling do_compat_readv and do_compat_writev when
offset == -1. This fixes the following glibc tests on x32:
 - misc/tst-preadvwritev2
 - misc/tst-preadvwritev64v2

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: H.J. Lu <hjl.tools@gmail.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-16 10:21:52 -05:00
David Howells
822ad64d7e keys: Fix dependency loop between construction record and auth key
In the request_key() upcall mechanism there's a dependency loop by which if
a key type driver overrides the ->request_key hook and the userspace side
manages to lose the authorisation key, the auth key and the internal
construction record (struct key_construction) can keep each other pinned.

Fix this by the following changes:

 (1) Killing off the construction record and using the auth key instead.

 (2) Including the operation name in the auth key payload and making the
     payload available outside of security/keys/.

 (3) The ->request_key hook is given the authkey instead of the cons
     record and operation name.

Changes (2) and (3) allow the auth key to naturally be cleaned up if the
keyring it is in is destroyed or cleared or the auth key is unlinked.

Fixes: 7ee02a316600 ("keys: Fix dependency loop between construction record and auth key")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <james.morris@microsoft.com>
2019-02-15 14:12:09 -08:00
David S. Miller
3313da8188 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The netfilter conflicts were rather simple overlapping
changes.

However, the cls_tcindex.c stuff was a bit more complex.

On the 'net' side, Cong is fixing several races and memory
leaks.  Whilst on the 'net-next' side we have Vlad adding
the rtnl-ness support.

What I've decided to do, in order to resolve this, is revert the
conversion over to using a workqueue that Cong did, bringing us back
to pure RCU.  I did it this way because I believe that either Cong's
races don't apply with have Vlad did things, or Cong will have to
implement the race fix slightly differently.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-15 12:38:38 -08:00
Jens Axboe
6fb845f0e7 Linux 5.0-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFRBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlxgqNUeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGwsoH+OVXu0NQofwTvVru
 8lgF3BSDG2mhf7mxbBBlBizGVy9jnjRNGCFMC+Jq8IwiFLwprja/G27kaDTkpuF1
 PHC3yfjKvjTeUP5aNdHlmxv6j1sSJfZl0y46DQal4UeTG/Giq8TFTi+Tbz7Wb/WV
 yCx4Lr8okAwTuNhnL8ojUCVIpd3c8QsyR9v6nEQ14Mj+MvEbokyTkMJV0bzOrM38
 JOB+/X1XY4JPZ6o3MoXrBca3bxbAJzMneq+9CWw1U5eiIG3msg4a+Ua3++RQMDNr
 8BP0yCZ6wo32S8uu0PI6HrZaBnLYi5g9Wh7Q7yc0mn1Uh1zWFykA6TtqK90agJeR
 A6Ktjw==
 =scY4
 -----END PGP SIGNATURE-----

Merge tag 'v5.0-rc6' into for-5.1/block

Pull in 5.0-rc6 to avoid a dumb merge conflict with fs/iomap.c.
This is needed since io_uring is now based on the block branch,
to avoid a conflict between the multi-page bvecs and the bits
of io_uring that touch the core block parts.

* tag 'v5.0-rc6': (525 commits)
  Linux 5.0-rc6
  x86/mm: Make set_pmd_at() paravirt aware
  MAINTAINERS: Update the ocores i2c bus driver maintainer, etc
  blk-mq: remove duplicated definition of blk_mq_freeze_queue
  Blk-iolatency: warn on negative inflight IO counter
  blk-iolatency: fix IO hang due to negative inflight counter
  MAINTAINERS: unify reference to xen-devel list
  x86/mm/cpa: Fix set_mce_nospec()
  futex: Handle early deadlock return correctly
  futex: Fix barrier comment
  net: dsa: b53: Fix for failure when irq is not defined in dt
  blktrace: Show requests without sector
  mips: cm: reprime error cause
  mips: loongson64: remove unreachable(), fix loongson_poweroff().
  sit: check if IPv6 enabled before calling ip6_err_gen_icmpv6_unreach()
  geneve: should not call rt6_lookup() when ipv6 was disabled
  KVM: nVMX: unconditionally cancel preemption timer in free_nested (CVE-2019-7221)
  KVM: x86: work around leak of uninitialized stack contents (CVE-2019-7222)
  kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974)
  signal: Better detection of synchronous signals
  ...
2019-02-15 08:43:59 -07:00
Ming Lei
07173c3ec2 block: enable multipage bvecs
This patch pulls the trigger for multi-page bvecs.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15 08:40:12 -07:00
Ming Lei
6dc4f100c1 block: allow bio_for_each_segment_all() to iterate over multi-page bvec
This patch introduces one extra iterator variable to bio_for_each_segment_all(),
then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.

Given it is just one mechannical & simple change on all bio_for_each_segment_all()
users, this patch does tree-wide change in one single patch, so that we can
avoid to use a temporary helper for this conversion.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15 08:40:11 -07:00
Ming Lei
c3a7ce7380 btrfs: use mp_bvec_last_segment to get bio's last page
Preparing for supporting multi-page bvec.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15 08:40:11 -07:00
Ming Lei
f70f446407 fs/buffer.c: use bvec iterator to truncate the bio
Once multi-page bvec is enabled, the last bvec may include more than one
page, this patch use mp_bvec_last_segment() to truncate the bio.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15 08:40:11 -07:00
Christoph Hellwig
8a2ee44a37 btrfs: look at bi_size for repair decisions
bio_readpage_error currently uses bi_vcnt to decide if it is worth
retrying an I/O.  But the vector count is mostly an implementation
artifact - it really should figure out if there is more than a
single sector worth retrying.  Use bi_size for that and shift by
PAGE_SHIFT.  This really should be blocks/sectors, but given that
btrfs doesn't support a sector size different from the PAGE_SIZE
using the page size keeps the changes to a minimum.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15 08:40:10 -07:00
Darrick J. Wong
c4a6bf7f6c xfs: don't ever put nlink > 0 inodes on the unlinked list
When XFS creates an O_TMPFILE file, the inode is created with nlink = 1,
put on the unlinked list, and then the VFS sets nlink = 0 in d_tmpfile.
If we crash before anything logs the inode (it's dirty incore but the
vfs doesn't tell us it's dirty so we never log that change), the iunlink
processing part of recovery will then explode with a pile of:

XFS: Assertion failed: VFS_I(ip)->i_nlink == 0, file:
fs/xfs/xfs_log_recover.c, line: 5072

Worse yet, since nlink is nonzero, the inodes also don't get cleaned up
and they just leak until the next xfs_repair run.

Therefore, change xfs_iunlink to require that inodes being put on the
unlinked list have nlink == 0, change the tmpfile callers to instantiate
nodes that way, and set the nlink to 1 just prior to calling d_tmpfile.
Fix the comment for xfs_iunlink while we're at it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-02-14 22:42:57 -08:00
Darrick J. Wong
15a268d9f2 xfs: reserve blocks for ifree transaction during log recovery
Log recovery frees all the inodes stored in the unlinked list, which can
cause expansion of the free inode btree.  The ifree code skips block
reservations if it thinks there's a per-AG space reservation, but we
don't set up the reservation until after log recovery, which means that
a finobt expansion blows up in xfs_trans_mod_sb when we exceed the
transaction's block reservation.

To fix this, we set the "no finobt reservation" flag to true when we
create the xfs_mount and only set it to false if we confirm that every
AG had enough free space to put aside for the finobt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-02-14 22:42:57 -08:00
Darrick J. Wong
e1f6ca1138 xfs: rename m_inotbt_nores to m_finobt_nores
Rename this flag variable to imply more strongly that it's related to
the free inode btree (finobt) operation.  No functional changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-02-14 22:42:57 -08:00
Linus Torvalds
cb5b020a8d Revert "exec: load_script: don't blindly truncate shebang string"
This reverts commit 8099b047ec.

It turns out that people do actually depend on the shebang string being
truncated, and on the fact that an interpreter (like perl) will often
just re-interpret it entirely to get the full argument list.

Reported-by: Samuel Dionne-Riel <samuel@dionne-riel.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-14 15:02:18 -08:00
Andreas Dilger
c9e716eb9b ext4: don't update s_rev_level if not required
Don't update the superblock s_rev_level during mount if it isn't
actually necessary, only if superblock features are being set by
the kernel.  This was originally added for ext3 since it always
set the INCOMPAT_RECOVER and HAS_JOURNAL features during mount,
but this is not needed since no journal mode was added to ext4.

That will allow Geert to mount his 20-year-old ext2 rev 0.0 m68k
filesystem, as a testament of the backward compatibility of ext4.

Fixes: 0390131ba8 ("ext4: Allow ext4 to run without a journal")
Signed-off-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-02-14 17:52:18 -05:00
Theodore Ts'o
a58ca99266 jbd2: fold jbd2_superblock_csum_{verify,set} into their callers
The functions jbd2_superblock_csum_verify() and
jbd2_superblock_csum_set() only get called from one location, so to
simplify things, fold them into their callers.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-02-14 16:28:14 -05:00
Theodore Ts'o
538bcaa626 jbd2: fix race when writing superblock
The jbd2 superblock is lockless now, so there is probably a race
condition between writing it so disk and modifing contents of it, which
may lead to checksum error. The following race is the one case that we
have captured.

jbd2                                fsstress
jbd2_journal_commit_transaction
 jbd2_journal_update_sb_log_tail
  jbd2_write_superblock
   jbd2_superblock_csum_set         jbd2_journal_revoke
                                     jbd2_journal_set_features(revork)
                                     modify superblock
   submit_bh(checksum incorrect)

Fix this by locking the buffer head before modifing it.  We always
write the jbd2 superblock after we modify it, so this just means
calling the lock_buffer() a little earlier.

This checksum corruption problem can be reproduced by xfstests
generic/475.

Reported-by: zhangyi (F) <yi.zhang@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-02-14 16:27:14 -05:00
Thomas Gleixner
d869f86645 Merge branch 'linus' into irq/core
Pick up upstream changes to avoid conflicts for pending patches.
2019-02-14 22:26:50 +01:00
Bob Peterson
23e93c9b2c Revert "gfs2: read journal in large chunks to locate the head"
This reverts commit 2a5f14f279.

This patch causes xfstests generic/311 to fail. Reverting this for
now until we have a proper fix.

Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-14 09:52:51 -08:00
Darrick J. Wong
3b50086f0c xfs: don't overflow xattr listent buffer
For VFS listxattr calls, xfs_xattr_put_listent calls
__xfs_xattr_put_listent twice if it sees an attribute
"trusted.SGI_ACL_FILE": once for that name, and again for
"system.posix_acl_access".  Unfortunately, if we happen to run out of
buffer space while emitting the first name, we set count to -1 (so that
we can feed ERANGE to the caller).  The second invocation doesn't check that
the context parameters make sense and overwrites the byte before the
buffer, triggering a KASAN report:

==================================================================
BUG: KASAN: slab-out-of-bounds in strncpy+0xb3/0xd0
Write of size 1 at addr ffff88807fbd317f by task syz/1113

CPU: 3 PID: 1113 Comm: syz Not tainted 5.0.0-rc6-xfsx #rc6
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
 dump_stack+0xcc/0x180
 print_address_description+0x6c/0x23c
 kasan_report.cold.3+0x1c/0x35
 strncpy+0xb3/0xd0
 __xfs_xattr_put_listent+0x1a9/0x2c0 [xfs]
 xfs_attr_list_int_ilocked+0x11af/0x1800 [xfs]
 xfs_attr_list_int+0x20c/0x2e0 [xfs]
 xfs_vn_listxattr+0x225/0x320 [xfs]
 listxattr+0x11f/0x1b0
 path_listxattr+0xbd/0x130
 do_syscall_64+0x139/0x560

While we're at it we add an assert to the other put_listent to avoid
this sort of thing ever happening to the attrlist_by_handle code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-02-14 09:36:52 -08:00
J. Bruce Fields
3bf6b57ec2 Revert "nfsd4: return default lease period"
This reverts commit d6ebf5088f.

I forgot that the kernel's default lease period should never be
decreased!

After a kernel upgrade, the kernel has no way of knowing on its own what
the previous lease time was.  Unless userspace tells it otherwise, it
will assume the previous lease period was the same.

So if we decrease this value in a kernel upgrade, we end up enforcing a
grace period that's too short, and clients will fail to reclaim state in
time.  Symptoms may include EIO and log messages like "NFS:
nfs4_reclaim_open_state: Lock reclaim failed!"

There was no real justification for the lease period decrease anyway.

Reported-by: Donald Buczek <buczek@molgen.mpg.de>
Fixes: d6ebf5088f "nfsd4: return default lease period"
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-02-14 12:33:19 -05:00
Jan Kara
53136b393c fanotify: Select EXPORTFS
Fanotify now uses exportfs_encode_inode_fh() so it needs to select
EXPORTFS.

Fixes: e9e0c89030 "fanotify: encode file identifier for FAN_REPORT_FID"
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-14 17:46:33 +01:00
Chuck Lever
02ef04e432 NFS: Account for XDR pad of buf->pages
Certain NFS results (eg. READLINK) might expect a data payload that
is not an exact multiple of 4 bytes. In this case, XDR encoding
is required to pad that payload so its length on the wire is a
multiple of 4 bytes. The constants that define the maximum size of
each NFS result do not appear to account for this extra word.

In each case where the data payload is to be received into pages:

- 1 word is added to the size of the receive buffer allocated by
  call_allocate

- rpc_inline_rcv_pages subtracts 1 word from @hdrsize so that the
  extra buffer space falls into the rcv_buf's tail iovec

- If buf->pagelen is word-aligned, an XDR pad is not needed and
  is thus removed from the tail

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-02-14 10:13:49 -05:00
Chuck Lever
cf500bac8f SUNRPC: Introduce rpc_prepare_reply_pages()
prepare_reply_buffer() and its NFSv4 equivalents expose the details
of the RPC header and the auth slack values to upper layer
consumers, creating a layering violation, and duplicating code.

Remedy these issues by adding a new RPC client API that hides those
details from upper layers in a common helper function.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-02-14 10:04:37 -05:00
Chuck Lever
f23f658404 NFS: Add trace events to report non-zero NFS status codes
These can help field troubleshooting without needing the overhead
of a full network capture (ie, tcpdump).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-02-13 12:03:21 -05:00
Chuck Lever
eb72f484a5 NFS: Remove print_overflow_msg()
This issue is now captured by a trace point in the RPC client.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-02-13 11:53:45 -05:00
Chuck Lever
0ccc61b1c7 SUNRPC: Add xdr_stream::rqst field
Having access to the controlling rpc_rqst means a trace point in the
XDR code can report:

 - the XID
 - the task ID and client ID
 - the p_name of RPC being processed

Subsequent patches will introduce such trace points.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-02-13 11:05:50 -05:00
Chad Austin
fabf7e0262 fuse: cache readdir calls if filesystem opts out of opendir
If a filesystem returns ENOSYS from opendir and thus opts out of
opendir and releasedir requests, it almost certainly would also like
readdir results cached. Default open_flags to FOPEN_KEEP_CACHE and
FOPEN_CACHE_DIR in that case.

With this patch, I've measured recursive directory enumeration across
large FUSE mounts to be faster than native mounts.

Signed-off-by: Chad Austin <chadaustin@fb.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:15 +01:00
Chad Austin
d9a9ea94f7 fuse: support clients that don't implement 'opendir'
Allow filesystems to return ENOSYS from opendir, preventing the kernel from
sending opendir and releasedir messages in the future. This avoids
userspace transitions when filesystems don't need to keep track of state
per directory handle.

A new capability flag, FUSE_NO_OPENDIR_SUPPORT, parallels
FUSE_NO_OPEN_SUPPORT, indicating the new semantics for returning ENOSYS
from opendir.

Signed-off-by: Chad Austin <chadaustin@fb.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:15 +01:00
Miklos Szeredi
2f7b6f5bed fuse: lift bad inode checks into callers
Bad inode checks were done  done in various places, and move them into
fuse_file_{read|write}_iter().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:15 +01:00
Miklos Szeredi
55752a3aba fuse: multiplex cached/direct_io file operations
This is cleanup, as well as allowing switching between I/O modes while the
file is open in the future.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:15 +01:00
Miklos Szeredi
d4136d6075 fuse add copy_file_range to direct io fops
Nothing preventing copy_file_range to work on files opened with
FOPEN_DIRECT_IO.

Fixes: 88bc7d5097 ("fuse: add support for copy_file_range()")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:14 +01:00
Miklos Szeredi
3c3db095b6 fuse: use iov_iter based generic splice helpers
The default splice implementation is grossly inefficient and the iter based
ones work just fine, so use those instead.  I've measured an 8x speedup for
splice write (with len = 128k).

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:14 +01:00
Martin Raiber
23c94e1cdc fuse: Switch to using async direct IO for FOPEN_DIRECT_IO
Switch to using the async directo IO code path in fuse_direct_read_iter()
and fuse_direct_write_iter().  This is especially important in connection
with loop devices with direct IO enabled as loop assumes async direct io is
actually async.

Signed-off-by: Martin Raiber <martin@urbackup.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:14 +01:00
Miklos Szeredi
75126f5504 fuse: use atomic64_t for khctr
...to get rid of one more fc->lock use.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:14 +01:00
Miklos Szeredi
eb98e3bdf3 fuse: clean up aborted
The only caller that needs fc->aborted set is fuse_conn_abort_write().
Setting fc->aborted is now racy (fuse_abort_conn() may already be in
progress or finished) but there's no reason to care.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:14 +01:00
Kirill Tkhai
6b675738ce fuse: Protect ff->reserved_req via corresponding fi->lock
This is rather natural action after previous patches, and it just decreases
load of fc->lock.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:14 +01:00
Kirill Tkhai
c9d8f5f069 fuse: Protect fi->nlookup with fi->lock
This continues previous patch and introduces the same protection for
nlookup field.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:14 +01:00
Kirill Tkhai
f15ecfef05 fuse: Introduce fi->lock to protect write related fields
To minimize contention of fc->lock, this patch introduces a new spinlock
for protection fuse_inode metadata:

fuse_inode:
	writectr
	writepages
	write_files
	queued_writes
	attr_version

inode:
	i_size
	i_nlink
	i_mtime
	i_ctime

Also, it protects the fields changed in fuse_change_attributes_common()
(too many to list).

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:14 +01:00
Kirill Tkhai
4510d86fbb fuse: Convert fc->attr_version into atomic64_t
This patch makes fc->attr_version of atomic64_t type, so fc->lock won't be
needed to read or modify it anymore.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:13 +01:00
Kirill Tkhai
ebf84d0c72 fuse: Add fuse_inode argument to fuse_prepare_release()
Here is preparation for next patches, which introduce new fi->lock for
protection of ff->write_entry linked into fi->write_files.

This patch just passes new argument to the function.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:13 +01:00
Kirill Tkhai
b782911b52 fuse: Verify userspace asks to requeue interrupt that we really sent
When queue_interrupt() is called from fuse_dev_do_write(), it came from
userspace directly. Userspace may pass any request id, even the request's
we have not interrupted (or even background's request). This patch adds
sanity check to make kernel safe against that.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:13 +01:00
Kirill Tkhai
7407a10de5 fuse: Do some refactoring in fuse_dev_do_write()
This is needed for next patch.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:13 +01:00
Kirill Tkhai
5e0fed717a fuse: Wake up req->waitq of only if not background
Currently, we wait on req->waitq in request_wait_answer() function only,
and it's never used for background requests.  Since wake_up() is not a
light-weight macros, instead of this, it unfolds in really called function,
which makes locking operations taking some cpu cycles, let's avoid its call
for the case we definitely know it's completely useless.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:13 +01:00
Kirill Tkhai
217316a601 fuse: Optimize request_end() by not taking fiq->waitq.lock
We take global fiq->waitq.lock every time, when we are in this function,
but interrupted requests are just small subset of all requests. This patch
optimizes request_end() and makes it to take the lock when it's really
needed.

queue_interrupt() needs small change for that. After req is linked to
interrupt list, we do smp_mb() and check for FR_FINISHED again. In case of
FR_FINISHED bit has appeared, we remove req and leave the function:

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:13 +01:00
Kirill Tkhai
8da6e91832 fuse: Kill fasync only if interrupt is queued in queue_interrupt()
We should sent signal only in case of interrupt is really queued.  Not a
real problem, but this makes the code clearer and intuitive.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:13 +01:00
Kirill Tkhai
340617508d fuse: Remove stale comment in end_requests()
Function end_requests() does not take fc->lock.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:12 +01:00
Kirill Tkhai
c5de16cca2 fuse: Replace page without copying in fuse_writepage_in_flight()
It looks like we can optimize page replacement and avoid copying by simple
updating the request's page.

[SzM: swap with new request's tmp page to avoid use after free.]

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:12 +01:00
Miklos Szeredi
e2653bd53a fuse: fix leaked aux requests
Auxiliary requests chained on req->misc.write.next may be leaked on
truncate.  Free these as well if the parent request was truncated off.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:12 +01:00
Miklos Szeredi
419234d595 fuse: only reuse auxiliary request in fuse_writepage_in_flight()
Don't reuse the queued request, even if it only contains a single page.
This is needed because previous locking changes (spliting out
fiq->waitq.lock from fc->lock) broke the assumption that request will
remain in FR_PENDING at least until the new page contents are copied.

This fix removes a slight optimization for a rare corner case, so we really
shoudln't care.

Reported-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Fixes: fd22d62ed0 ("fuse: no fc->lock for iqueue parts")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:12 +01:00
Miklos Szeredi
7f305ca192 fuse: clean up fuse_writepage_in_flight()
Restructure the function to better separate the locked and the unlocked
parts.  Use the "old_req" local variable to mean only the queued request,
and not any auxiliary requests added onto its misc.write.next list.  These
changes are in preparation for the following patch.

Also turn BUG_ON instances into WARN_ON and add a header comment explaining
what the function does.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:12 +01:00
Miklos Szeredi
2fe93bd432 fuse: extract fuse_find_writeback() helper
Call this from fuse_range_is_writeback() and fuse_writepage_in_flight().

Turn a BUG_ON() into a WARN_ON() in the process.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 13:15:12 +01:00
Vivek Goyal
993a0b2aec ovl: Do not lose security.capability xattr over metadata file copy-up
If a file has been copied up metadata only, and later data is copied up,
upper loses any security.capability xattr it has (underlying filesystem
clears it as upon file write).

From a user's point of view, this is just a file copy-up and that should
not result in losing security.capability xattr.  Hence, before data copy
up, save security.capability xattr (if any) and restore it on upper after
data copy up is complete.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Fixes: 0c28887493 ("ovl: A new xattr OVL_XATTR_METACOPY for file on upper")
Cc: <stable@vger.kernel.org> # v4.19+
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13 11:14:46 +01:00
Ira Weiny
0cefc36b32 fs/dax: NIT fix comment regarding start/end vs range
Fixes: ac46d4f3c4 ("mm/mmu_notifier: use structure for invalidate_range_start/end calls v2")
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-02-12 20:24:15 -08:00
Souptick Joarder
c9aed74e6a fs/dax: Convert to use vmf_error()
This code is converted to use vmf_error().

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-02-12 20:22:45 -08:00
Linus Torvalds
1f947a7a01 Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
 "6 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm: proc: smaps_rollup: fix pss_locked calculation
  Rename include/{uapi => }/asm-generic/shmparam.h really
  Revert "mm: use early_pfn_to_nid in page_ext_init"
  mm/gup: fix gup_pmd_range() for dax
  Revert "mm: slowly shrink slabs with a relatively small number of objects"
  Revert "mm: don't reclaim inodes with many attached pages"
2019-02-12 17:15:33 -08:00
Sandeep Patil
27dd768ed8 mm: proc: smaps_rollup: fix pss_locked calculation
The 'pss_locked' field of smaps_rollup was being calculated incorrectly.
It accumulated the current pss everytime a locked VMA was found.  Fix
that by adding to 'pss_locked' the same time as that of 'pss' if the vma
being walked is locked.

Link: http://lkml.kernel.org/r/20190203065425.14650-1-sspatil@android.com
Fixes: 493b0e9d94 ("mm: add /proc/pid/smaps_rollup")
Signed-off-by: Sandeep Patil <sspatil@android.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: <stable@vger.kernel.org>	[4.14.x, 4.19.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-12 16:33:18 -08:00
Dave Chinner
69056ee6a8 Revert "mm: don't reclaim inodes with many attached pages"
This reverts commit a76cf1a474 ("mm: don't reclaim inodes with many
attached pages").

This change causes serious changes to page cache and inode cache
behaviour and balance, resulting in major performance regressions when
combining worklaods such as large file copies and kernel compiles.

  https://bugzilla.kernel.org/show_bug.cgi?id=202441

This change is a hack to work around the problems introduced by changing
how agressive shrinkers are on small caches in commit 172b06c32b ("mm:
slowly shrink slabs with a relatively small number of objects").  It
creates more problems than it solves, wasn't adequately reviewed or
tested, so it needs to be reverted.

Link: http://lkml.kernel.org/r/20190130041707.27750-2-david@fromorbit.com
Fixes: a76cf1a474 ("mm: don't reclaim inodes with many attached pages")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Cc: Wolfgang Walter <linux@stwm.de>
Cc: Roman Gushchin <guro@fb.com>
Cc: Spock <dairinin@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-12 16:33:18 -08:00
Kees Cook
93ee4b7d9f pstore/ram: Avoid needless alloc during header write
Since the header is a fixed small maximum size, just use a stack variable
to avoid memory allocation in the write path.

Signed-off-by: Kees Cook <keescook@chromium.org>
2019-02-12 13:45:53 -08:00
Benjamin Coddington
d2ceb7e570 NFS: Don't use page_file_mapping after removing the page
If nfs_page_async_flush() removes the page from the mapping, then we can't
use page_file_mapping() on it as nfs_updatepate() is wont to do when
receiving an error.  Instead, push the mapping to the stack before the page
is possibly truncated.

Fixes: 8fc75bed96 ("NFS: Fix up return value on fatal errors in nfs_page_async_flush()")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-02-12 15:56:28 -05:00
Yue Hu
47afd7ae65 pstore/ram: Add kmsg hlen zero check to ramoops_pstore_write()
If zero-length header happened in ramoops_write_kmsg_hdr(), that means
we will not be able to read back dmesg record later, since it will be
treated as invalid header in ramoops_pstore_read(). So we should not
execute the following code but return the error.

Signed-off-by: Yue Hu <huyue2@yulong.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2019-02-12 12:38:54 -08:00
Yue Hu
1e0f67a96a pstore/ram: Move initialization earlier
Since only one single ramoops area allowed at a time, other probes
(like device tree) are meaningless, as it will waste CPU resources.
So let's check for being already initialized first.

Signed-off-by: Yue Hu <huyue2@yulong.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2019-02-12 12:10:43 -08:00
Yue Hu
4c6c4d3453 pstore: Avoid writing records with zero size
Sometimes pstore_console_write() will write records with zero size
to persistent ram zone, which is unnecessary. It will only increase
resource consumption. Also adjust ramoops_write_kmsg_hdr() to have
same logic if memory allocation fails.

Signed-off-by: Yue Hu <huyue2@yulong.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2019-02-12 12:09:49 -08:00
Linus Torvalds
8ae757efd3 Revert a commit which landed in v5.0-rc1 since it makes fsync in ext4
nojournal mode unsafe.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlxiSNUACgkQ8vlZVpUN
 gaPcCgf+LTKVur85OUKqeEDRfiCYjNemzgSOve8EtiGnPzG5s8vVeVQ5n880lTzY
 c/Q2/CotN1/0slWxF8WmZ6xTzdrdadtdJ2iv6O5R58uj3lz3UK4cXJf+XmvQRj3L
 5UqMygRN1omKFCp+dB38SeZuE1amV/tV61Q3ATnVMLi+xZcH8lV+efulNi19p2Gd
 EC7DYs2seAiNEJAh9nVSL1LWo3i0OH5JJQCv08+c71CPPqixAYuuEy59q55k6OvW
 Jp8mhYd4/5ueoV+rxH8Gft52MaOPvPHaxWvxrIbpKb5u8EZ6WUXGZ74qdK3KG6hG
 k29IX8XoXWlTHkfSiOOZLUJuW7n+XQ==
 =w43x
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fix from Ted Ts'o:
 "Revert a commit which landed in v5.0-rc1 since it makes fsync in ext4
  nojournal mode unsafe"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  Revert "ext4: use ext4_write_inode() when fsyncing w/o a journal"
2019-02-12 09:57:45 -08:00
Brian Foster
670105de15 xfs: compile time offset checks for common v4/v5 metadata
The v5 superblock format added various metadata fields (such as crc,
metadata lsn, owner uuid, etc.) to v4 metadata headers or created
new v5 headers for blocks where no such headers existed on v4. Where
v4 headers did exist, the v5 structures are careful to place v4
metadata at the original location. For example, the magic value is
expected to be at the same location in certain blocks to facilitate
version detection.

While failure of this invariant is likely to cause severe and
obvious problems at runtime, we can detect this condition at compile
time via the more recently added on-disk format check
infrastructure. Since there is no runtime cost, add some offset
checks that start with v5 structure definitions, traverse down to
the first bit of common metadata with v4 and ensure that common
metadata is at the expected offset. Note that we don't care about
blocks which had no v4 header because there is no common metadata in
those cases. No functional changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:01 -08:00
Darrick J. Wong
9228d751eb xfs: use buf ops magic to detect btree block type
Now that we encode block magic numbers in all the buffer ops, use that
for block type detection in the ag header repair code instead of
encoding magics directly in the repair code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:07:01 -08:00
Darrick J. Wong
4260baac62 xfs: add magic numbers to dquot buffer ops
Add dquot magic numbers to the buffer ops type, in case we ever want to
use them.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:07:01 -08:00
Darrick J. Wong
2bfe7069f7 xfs: add inode magic to inode verifier
Use xfs_verify_magic to check the magic numbers of inodes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:07:01 -08:00
Brian Foster
8764f98351 xfs: factor xfs_da3_blkinfo verification into common helper
With the verifier magic value helper in place, we've left a bit more
duplicate code across the verifiers that involve struct
xfs_da3_blkinfo. This includes the da node, xattr leaf and dir leaf
verifiers, all of which perform similar checks for v4 and v5
filesystems.

Create a common helper to verify an xfs_da3_blkinfo structure,
taking care to only access v5 fields where appropriate, and refactor
the aforementioned verifiers to use the helper. No functional
changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:01 -08:00
Brian Foster
39708c20ab xfs: miscellaneous verifier magic value fixups
Most buffer verifiers have hardcoded magic value checks
conditionalized on the version of the filesystem. The magic value
field of the verifier structure facilitates abstraction of some of
this code. Populate the ->magic field of various verifiers to take
advantage of this abstraction. No functional changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:01 -08:00
Brian Foster
09f420197d xfs: use verifier magic field in dir2 leaf verifiers
The dir2 leaf verifiers share the same underlying structure
verification code, but implement six accessor functions to multiplex
the code across the two verifiers. Further, the magic value isn't
sufficiently abstracted such that the common helper has to manually
fix up the magic from the caller on v5 filesystems.

Use the magic field in the verifier structure to eliminate the
duplicate code and clean this all up. No functional change.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:01 -08:00
Brian Foster
b8f8980166 xfs: distinguish between bnobt and cntbt magic values
The allocation btree verifiers share code that is unable to detect
cross-tree magic value corruptions such as a bnobt block with a
cntbt magic value. Populate the b_ops->magic field of the associated
verifier structures such that the structure verifier can check the
magic value against the expected value based on tree type.

The btree level check requires knowledge of the tree type to
determine the appropriate maximum value. This was previously part of
the hardcoded magic value checks. With that code removed, peek at
the first magic value in the verifier to determine the expected tree
type of the current block.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:01 -08:00
Brian Foster
27df4f5045 xfs: split up allocation btree verifier
Similar to the inode btree verifier, the same allocation btree
verifier structure is shared between the by-bno (bnobt) and by-size
(cntbt) btrees. This prevents the ability to distinguish magic
values between them. Separate the verifier into two, one for each
tree, and assign them appropriately. No functional changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:01 -08:00
Brian Foster
8473fee340 xfs: distinguish between inobt and finobt magic values
The inode btree verifier code is shared between the inode btree and
free inode btree because the underlying metadata formats are
essentially equivalent. A side effect of this is that the verifier
cannot determine whether a particular btree block should have an
inobt or finobt magic value.

This logic allows an unfortunate xfs_repair bug to escape detection
where certain level > 0 nodes of the finobt are stamped with inobt
magic by xfs_repair finobt reconstruction. This is fortunately not a
severe problem since the inode btree magic values do not contribute
to any changes in kernel behavior, but we do need a means to detect
and prevent this problem in the future.

Add a field to xfs_buf_ops to store the v4 and v5 superblock magic
values expected by a particular verifier. Add a helper to check an
on-disk magic value against the value expected by the verifier. Call
the helper from the shared [f]inobt verifier code for magic value
verification. This ensures that the inode btree blocks each have the
appropriate magic value based on specific tree type and superblock
version.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:01 -08:00
Brian Foster
01e68f40bf xfs: create a separate finobt verifier
The inobt verifier is reused for the inobt and finobt, which
prevents the ability to distinguish between magic values on a
per-tree basis. Create a separate finobt structure in preparation
for changes to enforce the appropriate magic value for the
associated tree. This patch has no functional change.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:01 -08:00
Brian Foster
e34d3e74eb xfs: always check magic values in on-disk byte order
Most verifiers that check on-disk magic values convert the CPU
endian magic value constant to disk endian to facilitate compile
time optimization of the byte swap and reduce the need for runtime
byte swaps in buffer verifiers. Several buffer verifiers do not
follow this pattern. Update those verifiers for consistency.

Also fix up a random typo in the inode readahead verifier name.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:01 -08:00
Brian Foster
75d0230314 xfs: clarify documentation for the function to reverify buffers
Improve the documentation around xfs_buf_ensure_ops, which is the
function that is responsible for cleaning up the b_ops state of buffers
that go through xrep_findroot_block but don't match anything.  Rename
the function to xfs_buf_reverify.

[darrick: this started off as bfoster mods of a previous patch of mine,
but the renaming part is now this separate patch.]

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:07:01 -08:00
Darrick J. Wong
9b24717979 xfs: cache unlinked pointers in an rhashtable
Use a rhashtable to cache the unlinked list incore.  This should speed
up unlinked processing considerably when there are a lot of inodes on
the unlinked list because iunlink_remove no longer has to traverse an
entire bucket list to find which inode points to the one being removed.

The incore list structure records "X.next_unlinked = Y" relations, with
the rhashtable using Y to index the records.  This makes finding the
inode X that points to a inode Y very quick.  If our cache fails to find
anything we can always fall back on the old method.

FWIW this drastically reduces the amount of time it takes to remove
inodes from the unlinked list.  I wrote a program to open a lot of
O_TMPFILE files and then close them in the same order, which takes
a very long time if we have to traverse the unlinked lists.  With the
ptach, I see:

+ /d/t/tmpfile/tmpfile
Opened 193531 files in 6.33s.
Closed 193531 files in 5.86s

real    0m12.192s
user    0m0.064s
sys     0m11.619s
+ cd /
+ umount /mnt

real    0m0.050s
user    0m0.004s
sys     0m0.030s

And without the patch:

+ /d/t/tmpfile/tmpfile
Opened 193588 files in 6.35s.
Closed 193588 files in 751.61s

real    12m38.853s
user    0m0.084s
sys     12m34.470s
+ cd /
+ umount /mnt

real    0m0.086s
user    0m0.000s
sys     0m0.060s

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:07:01 -08:00
Darrick J. Wong
4664c66c91 xfs: add tracepoints for high level iunlink operations
Add tracepoints so we can associate high level operations with low level
updates.  No functional changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:07:01 -08:00
Darrick J. Wong
b1d2a068ea xfs: refactor inode update in iunlink_remove
In xfs_iunlink_remove we have two identical calls to
xfs_iunlink_update_inode, so move it out of the if statement to simplify
the code some more.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:07:01 -08:00
Darrick J. Wong
23ffa52cc7 xfs: refactor unlinked list search and mapping to a separate function
There's a loop that searches an unlinked bucket list to find the inode
that points to a given inode.  Hoist this into a separate function;
later we'll use our iunlink backref cache to bypass the slow list
operation.  No functional changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:07:01 -08:00
Darrick J. Wong
f2fc16a3d7 xfs: refactor inode unlinked pointer update functions
Hoist the functions that update an inode's unlinked pointer updates into
a helper.  No functional changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:07:01 -08:00
Darrick J. Wong
86bfd3750f xfs: strengthen AGI unlinked inode bucket pointer checks
Strengthen our checking of the AGI unlinked pointers when we start to
use them for updating the metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:07:01 -08:00
Darrick J. Wong
9a4a511864 xfs: refactor AGI unlinked bucket updates
Split the AGI unlinked bucket updates into a separate function.  No
functional changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-02-11 16:07:01 -08:00
Darrick J. Wong
7d36c19538 xfs: add xfs_verify_agino_or_null helper
Add a new helper to check that a per-AG inode pointer is either null or
points somewhere valid within that AG.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-02-11 16:07:01 -08:00
Darrick J. Wong
5837f62592 xfs: clean up iunlink functions
Fix some indentation issues with the iunlink functions and reorganize
the tops of the functions to be identical.  No functional changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-02-11 16:07:01 -08:00
Brian Foster
c2b3164320 xfs: use the latest extent at writeback delalloc conversion time
The writeback delalloc conversion code is racy with respect to
changes in the currently cached file mapping outside of the current
page. This is because the ilock is cycled between the time the
caller originally looked up the mapping and across each real
allocation of the provided file range. This code has collected
various hacks over the years to help combat the symptoms of these
races (i.e., truncate race detection, allocation into hole
detection, etc.), but none address the fundamental problem that the
imap may not be valid at allocation time.

Rather than continue to use race detection hacks, update writeback
delalloc conversion to a model that explicitly converts the delalloc
extent backing the current file offset being processed. The current
file offset is the only block we can trust to remain once the ilock
is dropped because any operation that can remove the block
(truncate, hole punch, etc.) must flush and discard pagecache pages
first.

Modify xfs_iomap_write_allocate() to use the xfs_bmapi_delalloc()
mechanism to request allocation of the entire delalloc extent
backing the current offset instead of assuming the extent passed by
the caller is unchanged. Record the range specified by the caller
and apply it to the resulting allocated extent so previous checks by
the caller for COW fork overlap are not lost. Finally, overload the
bmapi delalloc flag with the range reval flag behavior since this is
the only use case for both.

This ensures that writeback always picks up the correct
and current extent associated with the page, regardless of races
with other extent modifying operations. If operating on a data fork
and the COW overlap state has changed since the ilock was cycled,
the caller revalidates against the COW fork sequence number before
using the imap for the next block.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:01 -08:00
Brian Foster
627209fbcc xfs: create delalloc bmapi wrapper for full extent allocation
The writeback delalloc conversion code is racy with respect to
changes in the currently cached file mapping. This stems from the
fact that the bmapi allocation code requires a file range to
allocate and the writeback conversion code assumes the range of the
currently cached mapping is still valid with respect to the fork. It
may not be valid, however, because the ilock is cycled (potentially
multiple times) between the time the cached mapping was populated
and the delalloc conversion occurs.

To facilitate a solution to this problem, create a new
xfs_bmapi_delalloc() wrapper to xfs_bmapi_write() that takes a file
(FSB) offset and attempts to allocate whatever delalloc extent backs
the offset. Use a new bmapi flag to cause xfs_bmapi_write() to set
the range based on the extent backing the bno parameter unless bno
lands in a hole. If bno does land in a hole, fall back to the
current behavior (which may result in an error or quietly skipping
holes in the specified range depending on other parameters). This
patch does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:01 -08:00
Brian Foster
3b35089807 xfs: remove superfluous writeback mapping eof trimming
Now that the cached writeback mapping is explicitly invalidated on
data fork changes, the EOF trimming band-aid is no longer necessary.
Remove xfs_trim_extent_eof() as well since it has no other users.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:01 -08:00
Brian Foster
d9252d526b xfs: validate writeback mapping using data fork seq counter
The writeback code caches the current extent mapping across multiple
xfs_do_writepage() calls to avoid repeated lookups for sequential
pages backed by the same extent. This is known to be slightly racy
with extent fork changes in certain difficult to reproduce
scenarios. The cached extent is trimmed to within EOF to help avoid
the most common vector for this problem via speculative
preallocation management, but this is a band-aid that does not
address the fundamental problem.

Now that we have an xfs_ifork sequence counter mechanism used to
facilitate COW writeback, we can use the same mechanism to validate
consistency between the data fork and cached writeback mappings. On
its face, this is somewhat of a big hammer approach because any
change to the data fork invalidates any mapping currently cached by
a writeback in progress regardless of whether the data fork change
overlaps with the range under writeback. In practice, however, the
impact of this approach is minimal in most cases.

First, data fork changes (delayed allocations) caused by sustained
sequential buffered writes are amortized across speculative
preallocations. This means that a cached mapping won't be
invalidated by each buffered write of a common file copy workload,
but rather only on less frequent allocation events. Second, the
extent tree is always entirely in-core so an additional lookup of a
usable extent mostly costs a shared ilock cycle and in-memory tree
lookup. This means that a cached mapping reval is relatively cheap
compared to the I/O itself. Third, spurious invalidations don't
impact ioend construction. This means that even if the same extent
is revalidated multiple times across multiple writepage instances,
we still construct and submit the same size ioend (and bio) if the
blocks are physically contiguous.

Update struct xfs_writepage_ctx with a new field to hold the
sequence number of the data fork associated with the currently
cached mapping. Check the wpc seqno against the data fork when the
mapping is validated and reestablish the mapping whenever the fork
has changed since the mapping was cached. This ensures that
writeback always uses a valid extent mapping and thus prevents lost
writebacks and stale delalloc block problems.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:01 -08:00
Brian Foster
9f9bc034b8 xfs: update fork seq counter on data fork changes
The sequence counter in the xfs_ifork structure is only updated on
COW forks. This is because the counter is currently only used to
optimize out repetitive COW fork checks at writeback time.

Tweak the extent code to update the seq counter regardless of the
fork type in preparation for using this counter on data forks as
well.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:00 -08:00
Marco Benatto
d519da41e2 xfs: Introduce XFS_PTAG_VERIFIER_ERROR panic mask
Currently we have a few PTAGs in place allowing us to transform a filesystem
error in a BUG() call.  However, we don't have a panic tag for corrupt
metadata, so introduce XFS_PTAG_VERIFIER_ERROR so that the administrator can
use the fs.xfs.panic_mask sysctl knob to convert any error detected by buffer
verifiers into a kernel panic.

Signed-off-by: Marco Benatto <mbenatto@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
[darrick: light editing of commit message]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:00 -08:00
YueHaibing
e88db81645 xfs: remove duplicated xfs_defer.h
Remove duplicated include.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-11 16:07:00 -08:00
Darrick J. Wong
654805367d xfs: check attribute name validity
Check extended attribute entry names for invalid characters.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:06:40 -08:00
Darrick J. Wong
e5d7d51b34 xfs: check directory name validity
Check directory entry names for invalid characters.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:06:40 -08:00
Darrick J. Wong
87c9607df2 xfs: fix off-by-one error in rtbitmap cross-reference
Fix an off-by-one error in the realtime bitmap "is used" cross-reference
helper function if the realtime extent size is a single block.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:06:40 -08:00
Darrick J. Wong
f8c1d7023e xfs: scrub should flag dir/attr offsets that aren't mappable with xfs_dablk_t
Teach scrub to flag extent maps that exceed the range that can be mapped
with a xfs_dablk_t.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:06:40 -08:00
Darrick J. Wong
3258cb208c xfs: abort xattr scrub if fatal signals are pending
The extended attribute scrubber should abort the "read all attrs" loop
if there's a fatal signal pending on the process.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:06:39 -08:00
Darrick J. Wong
f9e63342b8 xfs: consolidate scrub dinode mapping code into a single function
Move all the confusing dinode mapping code that's split between
xchk_iallocbt_check_cluster and xchk_iallocbt_check_cluster_ifree into
the first function so that it's clearer how we find the dinode for a
given inode.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:06:39 -08:00
Darrick J. Wong
4539b8a780 xfs: scrub big block inode btrees correctly
Teach scrub how to handle the case that there are one or more inobt
records covering a given inode cluster.  This fixes the operation on big
block filesystems (e.g. 64k blocks, 512 byte inodes).

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:06:39 -08:00
Darrick J. Wong
b9454fe056 xfs: clean up the inode cluster checking in the inobt scrub
The code to check inobt records against inode clusters is a mess of
poorly named variables and unnecessary parameters.  Clean the
unnecessary inode number parameters out of _check_cluster_freemask in
favor of computing them inside the function instead of making the caller
do it.  In xchk_iallocbt_check_cluster, rename the variables to make it
more obvious just what chunk_ino and cluster_ino represent.

Add a tracepoint to make it easier to track each inode cluster as we
scrub it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:06:39 -08:00
Darrick J. Wong
a1954242fa xfs: hoist inode cluster checks out of loop
Hoist the inode cluster checks out of the inobt record check loop into
a separate function in preparation for refactoring of that loop.  No
functional changes here; that's in the next patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:06:39 -08:00
Darrick J. Wong
22234c62f9 xfs: check inobt record alignment on big block filesystems
On a big block filesystem, there may be multiple inobt records covering
a single inode cluster.  These records obviously won't be aligned to
cluster alignment rules, and they must cover the entire cluster.  Teach
scrub to check for these things.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:06:39 -08:00
Darrick J. Wong
c050fdfeb5 xfs: check the ir_startino alignment directly
In xchk_iallocbt_rec, check the alignment of ir_startino by converting
the inode cluster block alignment into units of inodes instead of the
other way around (converting ir_startino to blocks).  This prevents us
from tripping over off-by-one errors in ir_startino which are obscured
by the inode -> block conversion.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:06:38 -08:00
Darrick J. Wong
435dcf0787 xfs: never try to scrub more than 64 inodes per inobt record
Make sure we never check more than XFS_INODES_PER_CHUNK inodes for any
given inobt record since there can be more than one inobt record mapped
to an inode cluster.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-11 16:06:38 -08:00
Jan Kara
f96c3ac8df ext4: fix crash during online resizing
When computing maximum size of filesystem possible with given number of
group descriptor blocks, we forget to include s_first_data_block into
the number of blocks. Thus for filesystems with non-zero
s_first_data_block it can happen that computed maximum filesystem size
is actually lower than current filesystem size which confuses the code
and eventually leads to a BUG_ON in ext4_alloc_group_tables() hitting on
flex_gd->count == 0. The problem can be reproduced like:

truncate -s 100g /tmp/image
mkfs.ext4 -b 1024 -E resize=262144 /tmp/image 32768
mount -t ext4 -o loop /tmp/image /mnt
resize2fs /dev/loop0 262145
resize2fs /dev/loop0 300000

Fix the problem by properly including s_first_data_block into the
computed number of filesystem blocks.

Fixes: 1c6bd7173d "ext4: convert file system to meta_bg if needed..."
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2019-02-11 13:30:32 -05:00
Steve Magnani
4f5edd82eb udf: disallow RW mount without valid integrity descriptor
Refuse to mount a volume read-write without a coherent Logical Volume
Integrity Descriptor, because we can't generate truly unique IDs without
one.

This fixes a bug where all inodes created on a UDF filesystem following
mount without a coherent LVID are assigned unique ID 0 which can then
confuse other UDF implementations.

Signed-off-by: Steven J. Magnani <steve@digidescorp.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-11 18:31:35 +01:00
Steve Magnani
e8b4274735 udf: finalize integrity descriptor before writeback
Make sure the CRC and tag checksum of the Logical Volume Integrity
Descriptor are valid before the structure is written out to disk.
Otherwise, unless the filesystem is unmounted gracefully, the on-disk
LVID will be invalid - which is unnecessary filesystem damage.

Signed-off-by: Steven J. Magnani <steve@digidescorp.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-11 09:26:02 +01:00
Steve Magnani
ebbd5e99f6 udf: factor out LVID finalization for reuse
Centralize timestamping and CRC/checksum updating of the in-core
Logical Volume Integrity Descriptor, in preparation for adding
a third site where this functionality is needed.

Signed-off-by: Steven J. Magnani <steve@digidescorp.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-11 09:24:07 +01:00
Greg Kroah-Hartman
9481caf39b Merge 5.0-rc6 into driver-core-next
We need the debugfs fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-11 09:09:02 +01:00
Ingo Molnar
c9ba7560c5 Linux 5.0-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFRBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlxgqNUeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGwsoH+OVXu0NQofwTvVru
 8lgF3BSDG2mhf7mxbBBlBizGVy9jnjRNGCFMC+Jq8IwiFLwprja/G27kaDTkpuF1
 PHC3yfjKvjTeUP5aNdHlmxv6j1sSJfZl0y46DQal4UeTG/Giq8TFTi+Tbz7Wb/WV
 yCx4Lr8okAwTuNhnL8ojUCVIpd3c8QsyR9v6nEQ14Mj+MvEbokyTkMJV0bzOrM38
 JOB+/X1XY4JPZ6o3MoXrBca3bxbAJzMneq+9CWw1U5eiIG3msg4a+Ua3++RQMDNr
 8BP0yCZ6wo32S8uu0PI6HrZaBnLYi5g9Wh7Q7yc0mn1Uh1zWFykA6TtqK90agJeR
 A6Ktjw==
 =scY4
 -----END PGP SIGNATURE-----

Merge tag 'v5.0-rc6' into sched/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-11 08:01:50 +01:00
Theodore Ts'o
6e589291f4 ext4: disallow files with EXT4_JOURNAL_DATA_FL from EXT4_IOC_SWAP_BOOT
A malicious/clueless root user can use EXT4_IOC_SWAP_BOOT to force a
corner casew which can lead to the file system getting corrupted.
There's no usefulness to allowing this, so just prohibit this case.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-02-11 01:07:10 -05:00
yangerkun
abdc644e8c ext4: add mask of ext4 flags to swap
The reason is that while swapping two inode, we swap the flags too.
Some flags such as EXT4_JOURNAL_DATA_FL can really confuse the things
since we're not resetting the address operations structure.  The
simplest way to keep things sane is to restrict the flags that can be
swapped.

Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2019-02-11 00:35:06 -05:00
yangerkun
aa507b5faf ext4: update quota information while swapping boot loader inode
While do swap between two inode, they swap i_data without update
quota information. Also, swap_inode_boot_loader can do "revert"
somtimes, so update the quota while all operations has been finished.

Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2019-02-11 00:14:02 -05:00
yangerkun
a46c68a318 ext4: cleanup pagecache before swap i_data
While do swap, we should make sure there has no new dirty page since we
should swap i_data between two inode:
1.We should lock i_mmap_sem with write to avoid new pagecache from mmap
read/write;
2.Change filemap_flush to filemap_write_and_wait and move them to the
space protected by inode lock to avoid new pagecache from buffer read/write.

Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2019-02-11 00:05:24 -05:00
yangerkun
67a11611e1 ext4: fix check of inode in swap_inode_boot_loader
Before really do swap between inode and boot inode, something need to
check to avoid invalid or not permitted operation, like does this inode
has inline data. But the condition check should be protected by inode
lock to avoid change while swapping. Also some other condition will not
change between swapping, but there has no problem to do this under inode
lock.

Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2019-02-11 00:02:05 -05:00
Xiaoguang Wang
a297b2fcee ext4: unlock unused_pages timely when doing writeback
In mpage_add_bh_to_extent(), when accumulated extents length is greater
than MAX_WRITEPAGES_EXTENT_LEN or buffer head's b_stat is not equal, we
will not continue to search unmapped area for this page, but note this
page is locked, and will only be unlocked in mpage_release_unused_pages()
after ext4_io_submit, if io also is throttled by blk-throttle or similar
io qos, we will hold this page locked for a while, it's unnecessary.

I think the best fix is to refactor mpage_add_bh_to_extent() to let it
return some hints whether to unlock this page, but given that we will
improve dioread_nolock later, we can let it done later, so currently
the simple fix would just call mpage_release_unused_pages() before
ext4_io_submit().

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-02-10 23:53:21 -05:00
zhangyi (F)
16e08b14a4 ext4: cleanup clean_bdev_aliases() calls
Now, we have already handle all cases of forgetting buffer in
jbd2_journal_forget(), the buffer should not be mapped to blockdevice
when reallocating it. So this patch remove all clean_bdev_aliases() and
clean_bdev_bh_alias() calls which were invoked by ext4 explicitly.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2019-02-10 23:32:07 -05:00
zhangyi (F)
597599268e jbd2: discard dirty data when forgetting an un-journalled buffer
We do not unmap and clear dirty flag when forgetting a buffer without
journal or does not belongs to any transaction, so the invalid dirty
data may still be written to the disk later. It's fine if the
corresponding block is never used before the next mount, and it's also
fine that we invoke clean_bdev_aliases() related functions to unmap
the block device mapping when re-allocating such freed block as data
block. But this logic is somewhat fragile and risky that may lead to
data corruption if we forget to clean bdev aliases. So, It's better to
discard dirty data during forget time.

We have been already handled all the cases of forgetting journalled
buffer, this patch deal with the remaining two cases.

- buffer is not journalled yet,
- buffer is journalled but doesn't belongs to any transaction.

We invoke __bforget() instead of __brelese() when forgetting an
un-journalled buffer in jbd2_journal_forget(). After this patch we can
remove all clean_bdev_aliases() related calls in ext4.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2019-02-10 23:26:06 -05:00
zhangyi (F)
904cdbd41d jbd2: clear dirty flag when revoking a buffer from an older transaction
Now, we capture a data corruption problem on ext4 while we're truncating
an extent index block. Imaging that if we are revoking a buffer which
has been journaled by the committing transaction, the buffer's jbddirty
flag will not be cleared in jbd2_journal_forget(), so the commit code
will set the buffer dirty flag again after refile the buffer.

fsx                               kjournald2
                                  jbd2_journal_commit_transaction
jbd2_journal_revoke                commit phase 1~5...
 jbd2_journal_forget
   belongs to older transaction    commit phase 6
   jbddirty not clear               __jbd2_journal_refile_buffer
                                     __jbd2_journal_unfile_buffer
                                      test_clear_buffer_jbddirty
                                       mark_buffer_dirty

Finally, if the freed extent index block was allocated again as data
block by some other files, it may corrupt the file data after writing
cached pages later, such as during unmount time. (In general,
clean_bdev_aliases() related helpers should be invoked after
re-allocation to prevent the above corruption, but unfortunately we
missed it when zeroout the head of extra extent blocks in
ext4_ext_handle_unwritten_extents()).

This patch mark buffer as freed and set j_next_transaction to the new
transaction when it already belongs to the committing transaction in
jbd2_journal_forget(), so that commit code knows it should clear dirty
bits when it is done with the buffer.

This problem can be reproduced by xfstests generic/455 easily with
seeds (3246 3247 3248 3249).

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
2019-02-10 23:23:04 -05:00
Nikolay Borisov
82dd124c40 ext4: replace opencoded i_writecount usage with inode_is_open_for_write()
There is a function which clearly conveys the objective of checking
i_writecount. Additionally the usage in ext4_mb_initialize_context was
wrong, since a node would have wrongfully been reported as writable if
i_writecount had a negative value (MMAP_DENY_WRITE).

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2019-02-10 23:04:16 -05:00
Thomas Gleixner
c2da3f1b71 proc/stat: Make the interrupt statistics more efficient
Waiman reported that on large systems with a large amount of interrupts the
readout of /proc/stat takes a long time to sum up the interrupt
statistics. In principle this is not a problem. but for unknown reasons
some enterprise quality software reads /proc/stat with a high frequency.

The reason for this is that interrupt statistics are accounted per cpu. So
the /proc/stat logic has to sum up the interrupt stats for each interrupt.

The interrupt core provides now a per interrupt summary counter which can
be used to avoid the summation loops completely except for interrupts
marked PER_CPU which are only a small fraction of the interrupt space if at
all.

Another simplification is to iterate only over the active interrupts and
skip the potentially large gaps in the interrupt number space and just
print zeros for the gaps without going into the interrupt core in the first
place.

Waiman provided test results from a 4-socket IvyBridge-EX system (60-core
120-thread, 3016 irqs) excuting a test program which reads /proc/stat
50,000 times:

   Before: 18.436s (sys 18.380s)
   After:   3.769s (sys  3.742s)

Reported-by: Waiman Long <longman@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Link: https://lkml.kernel.org/r/20190208135021.013828701@linutronix.de
2019-02-10 21:34:46 +01:00
Thomas Gleixner
41ea39101d y2038: Add time64 system calls
This series finally gets us to the point of having system calls with
 64-bit time_t on all architectures, after a long time of incremental
 preparation patches.
 
 There was actually one conversion that I missed during the summer,
 i.e. Deepa's timex series, which I now updated based the 5.0-rc1 changes
 and review comments.
 
 The following system calls are now added on all 32-bit architectures
 using the same system call numbers:
 
 403 clock_gettime64
 404 clock_settime64
 405 clock_adjtime64
 406 clock_getres_time64
 407 clock_nanosleep_time64
 408 timer_gettime64
 409 timer_settime64
 410 timerfd_gettime64
 411 timerfd_settime64
 412 utimensat_time64
 413 pselect6_time64
 414 ppoll_time64
 416 io_pgetevents_time64
 417 recvmmsg_time64
 418 mq_timedsend_time64
 419 mq_timedreceiv_time64
 420 semtimedop_time64
 421 rt_sigtimedwait_time64
 422 futex_time64
 423 sched_rr_get_interval_time64
 
 Each one of these corresponds directly to an existing system call
 that includes a 'struct timespec' argument, or a structure containing
 a timespec or (in case of clock_adjtime) timeval. Not included here
 are new versions of getitimer/setitimer and getrusage/waitid, which
 are planned for the future but only needed to make a consistent API
 rather than for correct operation beyond y2038. These four system
 calls are based on 'timeval', and it has not been finally decided
 what the replacement kernel interface will use instead.
 
 So far, I have done a lot of build testing across most architectures,
 which has found a number of bugs. Runtime testing so far included
 testing LTP on 32-bit ARM with the existing system calls, to ensure
 we do not regress for existing binaries, and a test with a 32-bit
 x86 build of LTP against a modified version of the musl C library
 that has been adapted to the new system call interface [3].
 This library can be used for testing on all architectures supported
 by musl-1.1.21, but it is not how the support is getting integrated
 into the official musl release. Official musl support is planned
 but will require more invasive changes to the library.
 
 Link: https://lore.kernel.org/lkml/20190110162435.309262-1-arnd@arndb.de/T/
 Link: https://lore.kernel.org/lkml/20190118161835.2259170-1-arnd@arndb.de/
 Link: https://git.linaro.org/people/arnd/musl-y2038.git/ [2]
 Signed-off-by: Arnd Bergmann <arnd@arndb.de>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJcXf7/AAoJEGCrR//JCVInPSUP/RhsQSCKMGtONB/vVICQhwep
 PybhzBSpHWFxszzTi6BEPN1zS9B069G9mDollRBYZCckyPqL/Bv6sI/vzQZdNk01
 Q6Nw92OnNE1QP8owZ5TjrZhpbtopWdqIXjsbGZlloUemvuJP2JwvKovQUcn5CPTQ
 jbnqU04CVyFFJYVxAnGJ+VSeWNrjW/cm/m+rhLFjUcwW7Y3aodxsPqPP6+K9hY9P
 yIWfcH42WBeEWGm1RSBOZOScQl4SGCPUAhFydl/TqyEQagyegJMIyMOv9wZ5AuTT
 xK644bDVmNsrtJDZDpx+J8hytXCk1LrnKzkHR/uK80iUIraF/8D7PlaPgTmEEjko
 XcrywEkvkXTVU3owCm2/sbV+8fyFKzSPipnNfN1JNxEX71A98kvMRtPjDueQq/GA
 Yh81rr2YLF2sUiArkc2fNpENT7EGhrh1q6gviK3FB8YDgj1kSgPK5wC/X0uolC35
 E7iC2kg4NaNEIjhKP/WKluCaTvjRbvV+0IrlJLlhLTnsqbA57ZKCCteiBrlm7wQN
 4csUtCyxchR9Ac2o/lj+Mf53z68Zv74haIROp18K2dL7ZpVcOPnA3XHeauSAdoyp
 wy2Ek6ilNvlNB+4x+mRntPoOsyuOUGv7JXzB9JvweLWUd9G7tvYeDJQp/0YpDppb
 K4UWcKnhtEom0DgK08vY
 =IZVb
 -----END PGP SIGNATURE-----

Merge tag 'y2038-new-syscalls' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground into timers/2038

Pull y2038 - time64 system calls from Arnd Bergmann:

This series finally gets us to the point of having system calls with 64-bit
time_t on all architectures, after a long time of incremental preparation
patches.

There was actually one conversion that I missed during the summer,
i.e. Deepa's timex series, which I now updated based the 5.0-rc1 changes
and review comments.

The following system calls are now added on all 32-bit architectures using
the same system call numbers:

403 clock_gettime64
404 clock_settime64
405 clock_adjtime64
406 clock_getres_time64
407 clock_nanosleep_time64
408 timer_gettime64
409 timer_settime64
410 timerfd_gettime64
411 timerfd_settime64
412 utimensat_time64
413 pselect6_time64
414 ppoll_time64
416 io_pgetevents_time64
417 recvmmsg_time64
418 mq_timedsend_time64
419 mq_timedreceiv_time64
420 semtimedop_time64
421 rt_sigtimedwait_time64
422 futex_time64
423 sched_rr_get_interval_time64

Each one of these corresponds directly to an existing system call that
includes a 'struct timespec' argument, or a structure containing a timespec
or (in case of clock_adjtime) timeval. Not included here are new versions
of getitimer/setitimer and getrusage/waitid, which are planned for the
future but only needed to make a consistent API rather than for correct
operation beyond y2038. These four system calls are based on 'timeval', and
it has not been finally decided what the replacement kernel interface will
use instead.

So far, I have done a lot of build testing across most architectures, which
has found a number of bugs. Runtime testing so far included testing LTP on
32-bit ARM with the existing system calls, to ensure we do not regress for
existing binaries, and a test with a 32-bit x86 build of LTP against a
modified version of the musl C library that has been adapted to the new
system call interface [3].  This library can be used for testing on all
architectures supported by musl-1.1.21, but it is not how the support is
getting integrated into the official musl release. Official musl support is
planned but will require more invasive changes to the library.

Link: https://lore.kernel.org/lkml/20190110162435.309262-1-arnd@arndb.de/T/
Link: https://lore.kernel.org/lkml/20190118161835.2259170-1-arnd@arndb.de/
Link: https://git.linaro.org/people/arnd/musl-y2038.git/ [2]
2019-02-10 21:24:43 +01:00
Linus Torvalds
e5a8a11632 for-linus-20190209
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlxfARoQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjsgEACP8vQzbvsOZOxHKi9Vcd8ziwyjyBebNh4F
 cKOx2Blgv0ReVAqLOVp9VJOJQoVQumV1btaA2YrmevxnCMpNUBpbP6G02tAqe9Z+
 D75FSpZXy4UvcMSlhfc/iB/RMI06benI9LnuL7zbzIQtrbtu+OFRnO6fpQOVGLxT
 Qa1wt/Rgahc48L4aHnIgPn0nyBRsEvuhC6FjI2D8akDaNiaHzwtGbpx7yDdmLNml
 fCzC2uSRJ31bXsO/5/fJorinaJ56r5N8aHaINYwXDv8zd8i94nQZhITAasXub1Km
 0nyuAg/fSzIdkrGmPINTKFaGYsOfRwpS4C4vagreBhzjfolPY0z9sQEQ63gZzDrd
 mAjHPxLTd165OLlR/RxoMC8AjZCZ0/YQaucxUOPkaIHfth5/dy5BFaCkWyA/I7/Z
 VnAyq0SqeL4hgIOGxZM0HeehKx+palNdJNZTcY7vF/7MVPuh5WM6z/FWsFa8k+ss
 B9YN4wchh7I8EVbLmfz9s/eqabRWF3Agh1dE+yAKwt1KIWHaMXWZTRQnj/69fs2e
 s3pwVMiiSz6K/Xnoe12nmQ4K0XeyKNROO78IIGY/Oa0Pe/hzCAaJMRMDsLp5EcJj
 dxpoi1OfGHMGoqYhL6tx6Atq5f6CMDrS28k/D44DHfO7T1qQGVy1A9SY7ZCfM5+c
 HKxTuRh8mg==
 =tuL6
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190209' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - NVMe pull request from Christoph, fixing namespace locking when
   dealing with the effects log, and a rapid add/remove issue (Keith)

 - blktrace tweak, ensuring requests with -1 sectors are shown (Jan)

 - link power management quirk for a Smasung SSD (Hans)

 - m68k nfblock dynamic major number fix (Chengguang)

 - series fixing blk-iolatency inflight counter issue (Liu)

 - ensure that we clear ->private when setting up the aio kiocb (Mike)

 - __find_get_block_slow() rate limit print (Tetsuo)

* tag 'for-linus-20190209' of git://git.kernel.dk/linux-block:
  blk-mq: remove duplicated definition of blk_mq_freeze_queue
  Blk-iolatency: warn on negative inflight IO counter
  blk-iolatency: fix IO hang due to negative inflight counter
  blktrace: Show requests without sector
  fs: ratelimit __find_get_block_slow() failure message.
  m68k: set proper major_num when specifying module param major_num
  libata: Add NOLPM quirk for SAMSUNG MZ7TE512HMHP-000L1 SSD
  nvme-pci: fix rapid add remove sequence
  nvme: lock NS list changes while handling command effects
  aio: initialize kiocb private in case any filesystems expect it.
2019-02-09 10:26:09 -08:00
David S. Miller
a655fe9f19 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
An ipvlan bug fix in 'net' conflicted with the abstraction away
of the IPV6 specific support in 'net-next'.

Similarly, a bug fix for mlx5 in 'net' conflicted with the flow
action conversion in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-08 15:00:17 -08:00
Linus Torvalds
8c8e62cc98 Driver core fixes for 5.0-rc6
Here are some driver core fixes for 5.0-rc6.
 
 Well, not so much "driver core" as "debugfs".  There's a lot of
 outstanding debugfs cleanup patches coming in through different
 subsystem trees, and in that process the debugfs core was found that it
 really should return errors when something bad happens, to prevent
 random files from showing up in the root of debugfs afterward.  So
 debugfs was fixed up to handle this properly, and then two fixes for
 the relay and blk-mq code was needed as it was making invalid
 assumptions about debugfs return values.
 
 There's also a cacheinfo fix in here that resolves a tiny issue.
 
 All of these have been in linux-next for over a week with no reported
 problems.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXF069g8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yk0+gCgy9PTVAJR5ZbYtWTJOTdBnd7pfqMAoMuGxc+6
 LLEbfSykLRxEf5SeOJun
 =KP8e
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-5.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fixes from Greg KH:
 "Here are some driver core fixes for 5.0-rc6.

  Well, not so much "driver core" as "debugfs". There's a lot of
  outstanding debugfs cleanup patches coming in through different
  subsystem trees, and in that process the debugfs core was found that
  it really should return errors when something bad happens, to prevent
  random files from showing up in the root of debugfs afterward. So
  debugfs was fixed up to handle this properly, and then two fixes for
  the relay and blk-mq code was needed as it was making invalid
  assumptions about debugfs return values.

  There's also a cacheinfo fix in here that resolves a tiny issue.

  All of these have been in linux-next for over a week with no reported
  problems"

* tag 'driver-core-5.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  blk-mq: protect debugfs_create_files() from failures
  relay: check return of create_buf_file() properly
  debugfs: debugfs_lookup() should return NULL if not found
  debugfs: return error values, not NULL
  debugfs: fix debugfs_rename parameter checking
  cacheinfo: Keep the old value if of_property_read_u32 fails
2019-02-08 10:53:44 -08:00
Linus Torvalds
bd5ff862ec Changes since last update:
- Fix cache coherency problem with writeback mappings
 - Fix buffer deadlock when shutting fs down
 - Fix a null pointer dereference when running online repair
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlxYbyEACgkQ+H93GTRK
 tOtLNg//VIiU6w1EgiQBgk8XxN3i9YQLBIzfbtgZGtLJWX8zYnGx9H4iVfZ9UDr8
 eKFoPVLvqZo5mutuot+3+ps5H4z6g9BotS048FJjFoarQwtYqnG3tcFkmLDInKW1
 jTPWBV/P7w+ODyPO082SyQ+Zn9pooyXkPoBbgA+vbQoqIsY5IF7VeFasrFMtRu21
 PEm/CpMxK5VMly+5ceOoqtdlWvRPDfczLfzW/iDZ4Qs2itUqFA6TJo5TD7kd4A/f
 yrjV6H5tWtp0uvBCBDq4W225uVUFVWC+wTrrII6qbvuDBNWfBsQ65GczYubAAu9X
 kdJdY3xj/Br1dk6jLTciCjihbjJ49xaxXfLAokNkh1pjHqHyinB5ALXC0dG4o+eo
 d5y5qo10zt3HZ8Kzr8753SzxRBjGbQhok+ytrBSpX8GckhAXmH5S6WZDFDh6PbJj
 5PSwvL7FNbS4M/Myjl+dwk3kWLVrGV2SglOJxCCsqCZPxNzopIrNf1uazLTZV+/2
 d+G7LQPXSvjK/iLfDQH/6sBIREx0nd3H/6mnmWBg/1xMD6z/Hgn8GJvAE8luRtOi
 usXYcjlkSEOSwxbUC4fCo0CrPp8DOHbrEEO4pavTN+GVIYsIen0ghq/x0HfcOCEu
 XguyRTYdQXnLZNo3zCqmnU4/C1W2L5Oce4IDznH8PEIgqAqXXrQ=
 =XsPH
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.0-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "Here are a handful of XFS fixes to fix a data corruption problem, a
  crasher bug, and a deadlock.

  Summary:

   - Fix cache coherency problem with writeback mappings

   - Fix buffer deadlock when shutting fs down

   - Fix a null pointer dereference when running online repair"

* tag 'xfs-5.0-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: set buffer ops when repair probes for btree type
  xfs: end sync buffer I/O properly on shutdown error
  xfs: eof trim writeback mapping as soon as it is cached
2019-02-08 10:46:14 -08:00
Ayush Mittal
26e28d68b1 kernfs: Allocating memory for kernfs_iattrs with kmem_cache.
Creating a new cache for kernfs_iattrs.
Currently, memory is allocated with kzalloc() which
always gives aligned memory. On ARM, this is 64 byte aligned.
To avoid the wastage of memory in aligning the size requested,
a new cache for kernfs_iattrs is created.

Size of struct kernfs_iattrs is 80 Bytes.
On ARM, it will come in kmalloc-128 slab.
and it will come in kmalloc-192 slab if debug info is enabled.
Extra bytes taken 48 bytes.

Total number of objects created : 4096
Total saving = 48*4096 = 192 KB

After creating new slab(When debug info is enabled) :
sh-3.2# cat /proc/slabinfo
...
kernfs_iattrs_cache   4069   4096    128   32    1 : tunables    0    0    0 : slabdata    128    128      0
...

All testing has been done on ARM target.

Signed-off-by: Ayush Mittal <ayush.m@samsung.com>
Signed-off-by: Vaneet Narang <v.narang@samsung.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-08 12:57:32 +01:00
Ondrej Mosnacek
5b2f2bd62e sysfs: remove unused include of kernfs-internal.h
This include is not needed (fs/sysfs/file.c builds just fine without
it). Remove it.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-08 12:57:31 +01:00
Linus Torvalds
ee6c0737a0 Two small nfsd bugfixes for 5.0, for an RDMA bug and a file clone bug.
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJcXJY6AAoJECebzXlCjuG+scAP/jm9ETUxC3E9ZcVetSLs7vf1
 AxHVsr3E3qf9uViq/+1NBRJ/BKE1Porzi1Uz01ie5MdY6SHWFMxvIvZrTRuEAhbc
 VA9xeRcnX+7RkPWik3sepSUieVy8KgJDMxdE07HwwyzST14I1s5Ev79wo6XiYlTw
 3+ZdZe19Y4owmTkbDiLsxJVsI2Y8b+9BIhZ9/ICRyFzZclnyLdO15HTDr4q7E9cw
 ZEZMOoljX4cjY8cD7tqf68QECxZYm4a8Ba+L6P+oqKajq/6yUrocXA2UG65EMtIQ
 LtMIdpkC+zHFagRQBC3ymqEWTpX9ED6TA4H4ZSdh6UH8NwYXHsxQUYfI4Oetju/B
 iqWBbXpwg2jMNRDhXS/KdNezYcGjYGJZ03dezeSP/GwojoiKALD0iXxWU2zyq0Qs
 crnhmc2j1wZZl5CFXLCYwjDjHbeH6gWGfLuzAv1Q9/jQUitQ2CpC4t1MCRdOlWBt
 cqjCleF6Rd3oVk51BdYPm5OyCHyQjrrXmOsx8aHkY774p7TqsaHBjSreTBjQWhO8
 wAfnr6yS4/21Hfrji52Nf4Q6UJ1FFEWB0wXJhYAHzem1RsPGSbnEDam6Bpd6Bh0d
 ZxAU4spoVhezQPjF4JCmHfiAJ7rbegfltJa669rE5L1kbUYYE6TAAcOm8RTSWoPZ
 iGxkg/en5XiGsOGH9AyB
 =++rV
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.0-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd fixes from Bruce Fields:
 "Two small nfsd bugfixes for 5.0, for an RDMA bug and a file clone bug"

* tag 'nfsd-5.0-1' of git://linux-nfs.org/~bfields/linux:
  svcrdma: Remove max_sge check at connect time
  nfsd: Fix error return values for nfsd4_clone_file_range()
2019-02-07 15:44:45 -07:00
Davidlohr Bueso
70f8a3ca68 mm: make mm->pinned_vm an atomic64 counter
Taking a sleeping lock to _only_ increment a variable is quite the
overkill, and pretty much all users do this. Furthermore, some drivers
(ie: infiniband and scif) that need pinned semantics can go to quite
some trouble to actually delay via workqueue (un)accounting for pinned
pages when not possible to acquire it.

By making the counter atomic we no longer need to hold the mmap_sem and
can simply some code around it for pinned_vm users. The counter is 64-bit
such that we need not worry about overflows such as rdma user input
controlled from userspace.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-07 12:54:02 -07:00
Amir Goldstein
e7fce6d94c fanotify: report FAN_ONDIR to listener with FAN_REPORT_FID
dirent modification events (create/delete/move) do not carry the
child entry name/inode information. Instead, we report FAN_ONDIR
for mkdir/rmdir so user can differentiate them from creat/unlink.

This is consistent with inotify reporting IN_ISDIR with dirent events
and is useful for implementing recursive directory tree watcher.

We avoid merging dirent events referring to subdirs with dirent events
referring to non subdirs, otherwise, user won't be able to tell from a
mask FAN_CREATE|FAN_DELETE|FAN_ONDIR if it describes mkdir+unlink pair
or rmdir+create pair of events.

For backward compatibility and consistency, do not report FAN_ONDIR
to user in legacy fanotify mode (reporting fd) and report FAN_ONDIR
to user in FAN_REPORT_FID mode for all event types.

Cc: <linux-api@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-07 16:47:32 +01:00
Amir Goldstein
235328d1fa fanotify: add support for create/attrib/move/delete events
Add support for events with data type FSNOTIFY_EVENT_INODE
(e.g. create/attrib/move/delete) for inode and filesystem mark types.

The "inode" events do not carry enough information (i.e. path) to
report event->fd, so we do not allow setting a mask for those events
unless group supports reporting fid.

The "inode" events are not supported on a mount mark, because they do
not carry enough information (i.e. path) to be filtered by mount point.

The "dirent" events (create/move/delete) report the fid of the parent
directory where events took place without specifying the filename of the
child. In the future, fanotify may get support for reporting filename
information for those events.

Cc: <linux-api@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-07 16:43:23 +01:00
Amir Goldstein
83b535d289 fanotify: support events with data type FSNOTIFY_EVENT_INODE
When event data type is FSNOTIFY_EVENT_INODE, we don't have a refernece
to the mount, so we will not be able to open a file descriptor when user
reads the event. However, if the listener has enabled reporting file
identifier with the FAN_REPORT_FID init flag, we allow reporting those
events and we use an identifier inode to encode fid.

The inode to use as identifier when reporting fid depends on the event.
For dirent modification events, we report the modified directory inode
and we report the "victim" inode otherwise.
For example:
FS_ATTRIB reports the child inode even if reported on a watched parent.
FS_CREATE reports the modified dir inode and not the created inode.

[JK: Fixup condition in fanotify_group_event_mask()]

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-07 16:38:36 +01:00
Amir Goldstein
0321e03cb4 fanotify: check FS_ISDIR flag instead of d_is_dir()
All fsnotify hooks set the FS_ISDIR flag for events that happen
on directory victim inodes except for fsnotify_perm().

Add the missing FS_ISDIR flag in fsnotify_perm() hook and let
fanotify_group_event_mask() check the FS_ISDIR flag instead of
checking if path argument is a directory.

This is needed for fanotify support for event types that do not
carry path information.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-07 16:38:36 +01:00
Amir Goldstein
0a20df7ed3 fsnotify: report FS_ISDIR flag with MOVE_SELF and DELETE_SELF events
We need to report FS_ISDIR flag with MOVE_SELF and DELETE_SELF events
for fanotify, because fanotify API requires the user to explicitly
request events on directories by FAN_ONDIR flag.

inotify never reported IN_ISDIR with those events. It looks like an
oversight, but to avoid the risk of breaking existing inotify programs,
mask the FS_ISDIR flag out when reprting those events to inotify backend.

We also add the FS_ISDIR flag with FS_ATTRIB event in the case of rename
over an empty target directory. inotify did not report IN_ISDIR in this
case, but it normally does report IN_ISDIR along with IN_ATTRIB event,
so in this case, we do not mask out the FS_ISDIR flag.

[JK: Simplify the checks in fsnotify_move()]

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-07 16:38:35 +01:00
Amir Goldstein
73072283a2 fanotify: use vfs_get_fsid() helper instead of vfs_statfs()
This is a cleanup that doesn't change any logic.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-07 16:38:35 +01:00
Amir Goldstein
ec86ff5689 vfs: add vfs_get_fsid() helper
Wrapper around statfs() interface.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-07 16:38:35 +01:00
Amir Goldstein
77115225ac fanotify: cache fsid in fsnotify_mark_connector
For FAN_REPORT_FID, we need to encode fid with fsid of the filesystem on
every event. To avoid having to call vfs_statfs() on every event to get
fsid, we store the fsid in fsnotify_mark_connector on the first time we
add a mark and on handle event we use the cached fsid.

Subsequent calls to add mark on the same object are expected to pass the
same fsid, so the call will fail on cached fsid mismatch.

If an event is reported on several mark types (inode, mount, filesystem),
all connectors should already have the same fsid, so we use the cached
fsid from the first connector.

[JK: Simplify code flow around fanotify_get_fid()
     make fsid argument of fsnotify_add_mark_locked() unconditional]

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-07 16:38:35 +01:00
Amir Goldstein
a8b13aa20a fanotify: enable FAN_REPORT_FID init flag
When setting up an fanotify listener, user may request to get fid
information in event instead of an open file descriptor.

The fid obtained with event on a watched object contains the file
handle returned by name_to_handle_at(2) and fsid returned by statfs(2).

Restrict FAN_REPORT_FID to class FAN_CLASS_NOTIF, because we have have
no good reason to support reporting fid on permission events.

When setting a mark, we need to make sure that the filesystem
supports encoding file handles with name_to_handle_at(2) and that
statfs(2) encodes a non-zero fsid.

Cc: <linux-api@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-07 16:38:35 +01:00
Amir Goldstein
5e469c830f fanotify: copy event fid info to user
If group requested FAN_REPORT_FID and event has file identifier,
copy that information to user reading the event after event metadata.

fid information is formatted as struct fanotify_event_info_fid
that includes a generic header struct fanotify_event_info_header,
so that other info types could be defined in the future using the
same header.

metadata->event_len includes the length of the fid information.

The fid information includes the filesystem's fsid (see statfs(2))
followed by an NFS file handle of the file that could be passed as
an argument to open_by_handle_at(2).

Cc: <linux-api@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-07 16:38:35 +01:00
Amir Goldstein
e9e0c89030 fanotify: encode file identifier for FAN_REPORT_FID
When user requests the flag FAN_REPORT_FID in fanotify_init(),
a unique file identifier of the event target object will be reported
with the event.

The file identifier includes the filesystem's fsid (i.e. from statfs(2))
and an NFS file handle of the file (i.e. from name_to_handle_at(2)).

The file identifier makes holding the path reference and passing a file
descriptor to user redundant, so those are disabled in a group with
FAN_REPORT_FID.

Encode fid and store it in event for a group with FAN_REPORT_FID.
Up to 12 bytes of file handle on 32bit arch (16 bytes on 64bit arch)
are stored inline in fanotify_event struct. Larger file handles are
stored in an external allocated buffer.

On failure to encode fid, we print a warning and queue the event
without the fid information.

[JK: Fold part of later patched into this one to use
exportfs_encode_inode_fh() right away]

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-07 16:38:34 +01:00
Amir Goldstein
bb2f7b4542 fanotify: open code fill_event_metadata()
The helper is quite trivial and open coding it will make it easier
to implement copying event fid info to user.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-07 16:37:01 +01:00
Linus Torvalds
076a3f5537 fuse fixes for 5.0-rc6
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXFls+wAKCRDh3BK/laaZ
 PC+hAQDRkyJeAmMzpHwvv/IASqpJgc6HrSzH0p201lDyARcKIAD+MWxZHYP4ltAn
 WVTLIvYT1xsoqGG3plfZ/d1iNbAWcwU=
 =cL/O
 -----END PGP SIGNATURE-----

Merge tag 'fuse-fixes-5.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse fixes from Miklos Szeredi:
 "A fix for a CUSE regression introduced in v4.20, as well as fixes for
  a couple of old bugs"

* tag 'fuse-fixes-5.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: decrement NR_WRITEBACK_TEMP on the right page
  fuse: call pipe_buf_release() under pipe lock
  cuse: fix ioctl
  fuse: handle zero sized retrieve correctly
2019-02-07 07:52:08 +00:00
Arnd Bergmann
8dabe7245b y2038: syscalls: rename y2038 compat syscalls
A lot of system calls that pass a time_t somewhere have an implementation
using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have
been reworked so that this implementation can now be used on 32-bit
architectures as well.

The missing step is to redefine them using the regular SYSCALL_DEFINEx()
to get them out of the compat namespace and make it possible to build them
on 32-bit architectures.

Any system call that ends in 'time' gets a '32' suffix on its name for
that version, while the others get a '_time32' suffix, to distinguish
them from the normal version, which takes a 64-bit time argument in the
future.

In this step, only 64-bit architectures are changed, doing this rename
first lets us avoid touching the 32-bit architectures twice.

Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-02-07 00:13:27 +01:00
Dan Carpenter
1c3da4452d nfsd: fix an IS_ERR() vs NULL check
The get_backchannel_cred() used to return error pointers on error but
now it returns NULL pointers.

Fixes: 97f68c6b02 ("SUNRPC: add 'struct cred *' to auth_cred and rpc_cre")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-02-06 15:37:14 -05:00
Trond Myklebust
e3fdc89ca4 nfsd: Fix error return values for nfsd4_clone_file_range()
If the parameter 'count' is non-zero, nfsd4_clone_file_range() will
currently clobber all errors returned by vfs_clone_file_range() and
replace them with EINVAL.

Fixes: 42ec3d4c02 ("vfs: make remap_file_range functions take and...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.20+
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-02-06 15:32:05 -05:00
Tetsuo Handa
43636c804d fs: ratelimit __find_get_block_slow() failure message.
When something let __find_get_block_slow() hit all_mapped path, it calls
printk() for 100+ times per a second. But there is no need to print same
message with such high frequency; it is just asking for stall warning, or
at least bloating log files.

  [  399.866302][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8
  [  399.873324][T15342] b_state=0x00000029, b_size=512
  [  399.878403][T15342] device loop0 blocksize: 4096
  [  399.883296][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8
  [  399.890400][T15342] b_state=0x00000029, b_size=512
  [  399.895595][T15342] device loop0 blocksize: 4096
  [  399.900556][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8
  [  399.907471][T15342] b_state=0x00000029, b_size=512
  [  399.912506][T15342] device loop0 blocksize: 4096

This patch reduces frequency to up to once per a second, in addition to
concatenating three lines into one.

  [  399.866302][T15342] __find_get_block_slow() failed. block=1, b_blocknr=8, b_state=0x00000029, b_size=512, device loop0 blocksize: 4096

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-06 12:58:56 -07:00
Matthew Wilcox
fd9dc93e36 XArray: Change xa_insert to return -EBUSY
Userspace translates EEXIST to "File exists" which isn't a very good
error message for the problem.  "Device or resource busy" is a better
indication of what went wrong.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2019-02-06 13:12:15 -05:00
Mike Marshall
ec51f8ee1e aio: initialize kiocb private in case any filesystems expect it.
A recent optimization had left private uninitialized.

Fixes: 2bc4ca9bb6 ("aio: don't zero entire aio_kiocb aio_get_req()")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-06 08:04:22 -07:00
Amir Goldstein
33913997d5 fanotify: rename struct fanotify_{,perm_}event_info
struct fanotify_event_info "inherits" from struct fsnotify_event and
therefore a more appropriate (and short) name for it is fanotify_event.
Same for struct fanotify_perm_event_info, which now "inherits" from
struct fanotify_event.

We plan to reuse the name struct fanotify_event_info for user visible
event info record format.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-06 15:25:45 +01:00
Amir Goldstein
a0a92d261f fsnotify: move mask out of struct fsnotify_event
Common fsnotify_event helpers have no need for the mask field.
It is only used by backend code, so move the field out of the
abstract fsnotify_event struct and into the concrete backend
event structs.

This change packs struct inotify_event_info better on 64bit
machine and will allow us to cram some more fields into
struct fanotify_event_info.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-06 15:25:11 +01:00
Amir Goldstein
45a9fb3725 fsnotify: send all event types to super block marks
So far, existence of super block marks was checked only on events with
data type FSNOTIFY_EVENT_PATH. Use the super block of the "to_tell" inode
to report the events of all event types to super block marks.

This change has no effect on current backends. Soon, this will allow
fanotify backend to receive all event types on a super block mark.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-02-06 15:20:30 +01:00
Christoph Hellwig
80f2121380 scsi: fs: remove exofs
This was an example for using the SCSI OSD protocol, which we're trying
to remove.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2019-02-05 21:28:13 -05:00
Mimi Zohar
fdb2410f77 ima: define ima_post_create_tmpfile() hook and add missing call
If tmpfiles can be made persistent, then newly created tmpfiles need to
be treated like any other new files in policy.

This patch indicates which newly created tmpfiles are in policy, causing
the file hash to be calculated on __fput().

Reported-by: Ignaz Forster <ignaz.forster@gmx.de>
[rgoldwyn@suse.com: Call ima_post_create_tmpfile() in vfs_tmpfile() as
opposed to do_tmpfile(). This will help the case for overlayfs where
copy_up is denied while overwriting a file.]
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2019-02-04 17:36:01 -05:00
Vivek Goyal
5f32879ea3 ovl: During copy up, first copy up data and then xattrs
If a file with capability set (and hence security.capability xattr) is
written kernel clears security.capability xattr. For overlay, during file
copy up if xattrs are copied up first and then data is, copied up. This
means data copy up will result in clearing of security.capability xattr
file on lower has. And this can result into surprises. If a lower file has
CAP_SETUID, then it should not be cleared over copy up (if nothing was
actually written to file).

This also creates problems with chown logic where it first copies up file
and then tries to clear setuid bit. But by that time security.capability
xattr is already gone (due to data copy up), and caller gets -ENODATA.
This has been reported by Giuseppe here.

https://github.com/containers/libpod/issues/2015#issuecomment-447824842

Fix this by copying up data first and then metadta. This is a regression
which has been introduced by my commit as part of metadata only copy up
patches.

TODO: There will be some corner cases where a file is copied up metadata
only and later data copy up happens and that will clear security.capability
xattr. Something needs to be done about that too.

Fixes: bd64e57586 ("ovl: During copy up, first copy up metadata and then data")
Cc: <stable@vger.kernel.org> # v4.19+
Reported-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-04 09:09:57 +01:00
Elena Reshetova
d036bda7d0 sched/core: Convert sighand_struct.count to refcount_t
atomic_t variables are currently used to implement reference
counters with the following properties:

 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable sighand_struct.count is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

** Important note for maintainers:

Some functions from refcount_t API defined in lib/refcount.c
have different memory ordering guarantees than their atomic
counterparts.

The full comparison can be seen in
https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
in state to be merged to the documentation tree.

Normally the differences should not matter since refcount_t provides
enough guarantees to satisfy the refcounting use cases, but in
some rare cases it might matter.

Please double check that you don't have some undocumented
memory guarantees for this variable usage.

For the sighand_struct.count it might make a difference
in following places:

 - __cleanup_sighand: decrement in refcount_dec_and_test() only
   provides RELEASE ordering and control dependency on success
   vs. fully ordered atomic counterpart

Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: viro@zeniv.linux.org.uk
Link: https://lkml.kernel.org/r/1547814450-18902-2-git-send-email-elena.reshetova@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-04 08:53:52 +01:00
Darrick J. Wong
add46b3b02 xfs: set buffer ops when repair probes for btree type
In xrep_findroot_block, we work out the btree type and correctness of a
given block by calling different btree verifiers on root block
candidates.  However, we leave the NULL b_ops while ->verify_read
validates the block, which means that if the verifier calls
xfs_buf_verifier_error it'll crash on the null b_ops.  Fix it to set
b_ops before calling the verifier and unsetting it if the verifier
fails.

Furthermore, improve the documentation around xfs_buf_ensure_ops, which
is the function that is responsible for cleaning up the b_ops state of
buffers that go through xrep_findroot_block but don't match anything.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-03 14:03:59 -08:00
Brian Foster
465fa17f4a xfs: end sync buffer I/O properly on shutdown error
As of commit e339dd8d8b ("xfs: use sync buffer I/O for sync delwri
queue submission"), the delwri submission code uses sync buffer I/O
for sync delwri I/O. Instead of waiting on async I/O to unlock the
buffer, it uses the underlying sync I/O completion mechanism.

If delwri buffer submission fails due to a shutdown scenario, an
error is set on the buffer and buffer completion never occurs. This
can cause xfs_buf_delwri_submit() to deadlock waiting on a
completion event.

We could check the error state before waiting on such buffers, but
that doesn't serialize against the case of an error set via a racing
I/O completion. Instead, invoke I/O completion in the shutdown case
regardless of buffer I/O type.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-03 14:03:06 -08:00
Brian Foster
aa6ee4ab69 xfs: eof trim writeback mapping as soon as it is cached
The cached writeback mapping is EOF trimmed to try and avoid races
between post-eof block management and writeback that result in
sending cached data to a stale location. The cached mapping is
currently trimmed on the validation check, which leaves a race
window between the time the mapping is cached and when it is trimmed
against the current inode size.

For example, if a new mapping is cached by delalloc conversion on a
blocksize == page size fs, we could cycle various locks, perform
memory allocations, etc.  in the writeback codepath before the
associated mapping is eventually trimmed to i_size. This leaves
enough time for a post-eof truncate and file append before the
cached mapping is trimmed. The former event essentially invalidates
a range of the cached mapping and the latter bumps the inode size
such the trim on the next writepage event won't trim all of the
invalid blocks. fstest generic/464 reproduces this scenario
occasionally and causes a lost writeback and stale delalloc blocks
warning on inode inactivation.

To work around this problem, trim the cached writeback mapping as
soon as it is cached in addition to on subsequent validation checks.
This is a minor tweak to tighten the race window as much as possible
until a proper invalidation mechanism is available.

Fixes: 40214d128e ("xfs: trim writepage mapping to within eof")
Cc: <stable@vger.kernel.org> # v4.14+
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-03 14:02:49 -08:00
Deepa Dinamani
45bdc66159 socket: Rename SO_RCVTIMEO/ SO_SNDTIMEO with _OLD suffixes
SO_RCVTIMEO and SO_SNDTIMEO socket options use struct timeval
as the time format. struct timeval is not y2038 safe.
The subsequent patches in the series add support for new socket
timeout options with _NEW suffix that will use y2038 safe
data structures. Although the existing struct timeval layout
is sufficiently wide to represent timeouts, because of the way
libc will interpret time_t based on user defined flag, these
new flags provide a way of having a structure that is the same
for all architectures consistently.
Rename the existing options with _OLD suffix forms so that the
right option is enabled for userspace applications according
to the architecture and time_t definition of libc.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Cc: ccaulfie@redhat.com
Cc: deller@gmx.de
Cc: paulus@samba.org
Cc: ralf@linux-mips.org
Cc: rth@twiddle.net
Cc: cluster-devel@redhat.com
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-alpha@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-mips@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03 11:17:31 -08:00
Linus Torvalds
312b3a93dd for-5.0-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlxWtJgACgkQxWXV+ddt
 WDsDow//ZpnyDwQWvSIfF2UUQOPlcBjbHKuBA7rU0wdybW635QYGR0mqnI+1VnMj
 7ssUkeN6N0a2gQzrUG4Y+zpdzOWv2xQ4jKZ9GMOp9gwyzEFyPkcFXOnmM8UfYtVu
 e3fK65e8BZHmTeu0kGKah4Dt1g0t4fUmhsKR4Pfp5YNJC+zuuGTwUW1K/ZQHXJ+3
 kTHc7WP1lsF7wgaZ+Gl+Kvp8fVrHVdygMVTdRBW8QaBgPLa/KExvK62jW+NmCYhj
 7OIkWdew7e8IXc3Ie5IbOomHAv7IgqqgiO9VO9+n0EpyV4UobUgxrgBKJ+0yc1Ya
 eidbKhMslwUE50y00JVm+vw0gwQHkR+hZDn/mRB6xiIeI8tu/yQIJZ6AhYJXoByR
 cu8+SNO5Z5dOZ1f146ZH8lnkr24tuSnkDUhbRDR5pdb4tAHHej2ALzhbfbwbPEpF
 IverYLw5fOMGeRU/mBsjkVadpSZ4S0HVNU85ERdhLtVLK1PSaY2UkUaA+Ii5y7au
 EYDjaGMflmJ8cAVqgtgedEff2n8OKDnzRZlz4IPLI73MVSITGZkM7PmYmYsLm2Bs
 mDPnmyqR8kzcd1RRtSeZTvqOpAIZG+QUOmD2jiKrchmp54Sz0V/HJWRs3aQybD6Q
 ph0yAcbkvgp/ewe5IFgaI0kcyH7zYdL6GtiI2WUE3/8DObgrgsA=
 =E2PP
 -----END PGP SIGNATURE-----

Merge tag 'for-5.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - regression fix: transaction commit can run away due to delayed ref
   waiting heuristic, this is not necessary now because of the proper
   reservation mechanism introduced in 5.0

 - regression fix: potential crash due to use-before-check of an ERR_PTR
   return value

 - fix for transaction abort during transaction commit that needs to
   properly clean up pending block groups

 - fix deadlock during b-tree node/leaf splitting, when this happens on
   some of the fundamental trees, we must prevent new tree block
   allocation to re-enter indirectly via the block group flushing path

 - potential memory leak after errors during mount

* tag 'for-5.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: On error always free subvol_name in btrfs_mount
  btrfs: clean up pending block groups when transaction commit aborts
  btrfs: fix potential oops in device_list_add
  btrfs: don't end the transaction for delayed refs in throttle
  Btrfs: fix deadlock when allocating tree block during leaf/node split
2019-02-03 08:48:33 -08:00
Linus Torvalds
b9de6efed2 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "24 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (24 commits)
  autofs: fix error return in autofs_fill_super()
  autofs: drop dentry reference only when it is never used
  fs/drop_caches.c: avoid softlockups in drop_pagecache_sb()
  mm: migrate: don't rely on __PageMovable() of newpage after unlocking it
  psi: clarify the Kconfig text for the default-disable option
  mm, memory_hotplug: __offline_pages fix wrong locking
  mm: hwpoison: use do_send_sig_info() instead of force_sig()
  kasan: mark file common so ftrace doesn't trace it
  init/Kconfig: fix grammar by moving a closing parenthesis
  lib/test_kmod.c: potential double free in error handling
  mm, oom: fix use-after-free in oom_kill_process
  mm/hotplug: invalid PFNs from pfn_to_online_page()
  mm,memory_hotplug: fix scan_movable_pages() for gigantic hugepages
  psi: fix aggregation idle shut-off
  mm, memory_hotplug: test_pages_in_a_zone do not pass the end of zone
  mm, memory_hotplug: is_mem_section_removable do not pass the end of a zone
  oom, oom_reaper: do not enqueue same task twice
  mm: migrate: make buffer_migrate_page_norefs() actually succeed
  kernel/exit.c: release ptraced tasks before zap_pid_ns_processes
  x86_64: increase stack size for KASAN_EXTRA
  ...
2019-02-02 09:32:58 -08:00
Linus Torvalds
33640d718c SMB3 fixes, some from this week's SMB3 test evemt, 5 for stable and a particularly important one for queryxattr (see xfstests 70 and 117)
-----BEGIN PGP SIGNATURE-----
 
 iQGyBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlxUf8kACgkQiiy9cAdy
 T1Fy8Qv4rlgVRBPcSJpvZKhFjMd5KskVIDnP0tOc5od+Mg+547eAg7UKKnJVDgKS
 OUtiP8uC/UErWvQ214hoXF1sMwoG+rDdTRdIOVDLD2a2gIIJzaIPADhYt4w5kiAX
 nO9NwPwk03KTzNUQmpdFPofbLK4csmJUR4kBNE+XhZTulmw9BkplGXnNkQGjsEdf
 nY6YdfqofE//DmR/KB1BylD0lxDtnfuB5zoELPmM6X4iWtD9W6pKVg23huEv+Dmh
 80LdYt77vI4RKEzOAmsYEpROdAmlCksVQC1nlh5EHOuMN/9TCdpdWRahJWWhbbKr
 /b8TuD+0QUUKEQdGvwM8DM/OCjnwoJphzI8CK1VzRpM4XkZxpQTzng2zVFPNzs4Y
 wwKCsxxTO7ePGg6hE/DTnRn9XJtKbR1nQsIjHcw6rE0ydl+P6YRZDUEehqyRU+jd
 Z09XCPS1+zeUPE6PRE9wlrIt7QdxOynufhaWRKf18b2b4cfKbNGcd+rMMZPagUQe
 x2CMkEU=
 =qX/z
 -----END PGP SIGNATURE-----

Merge tag '5.0-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb3 fixes from Steve French:
 "SMB3 fixes, some from this week's SMB3 test evemt, 5 for stable and a
  particularly important one for queryxattr (see xfstests 70 and 117)"

* tag '5.0-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: update internal module version number
  CIFS: fix use-after-free of the lease keys
  CIFS: Do not consider -ENODATA as stat failure for reads
  CIFS: Do not count -ENODATA as failure for query directory
  CIFS: Fix trace command logging for SMB2 reads and writes
  CIFS: Fix possible oops and memory leaks in async IO
  cifs: limit amount of data we request for xattrs to CIFSMaxBufSize
  cifs: fix computation for MAX_SMB2_HDR_SIZE
2019-02-01 16:53:01 -08:00
Ian Kent
f585b283e3 autofs: fix error return in autofs_fill_super()
In autofs_fill_super() on error of get inode/make root dentry the return
should be ENOMEM as this is the only failure case of the called
functions.

Link: http://lkml.kernel.org/r/154725123240.11260.796773942606871359.stgit@pluto-themaw-net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 15:46:24 -08:00
Pan Bian
63ce5f552b autofs: drop dentry reference only when it is never used
autofs_expire_run() calls dput(dentry) to drop the reference count of
dentry.  However, dentry is read via autofs_dentry_ino(dentry) after
that.  This may result in a use-free-bug.  The patch drops the reference
count of dentry only when it is never used.

Link: http://lkml.kernel.org/r/154725122396.11260.16053424107144453867.stgit@pluto-themaw-net
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 15:46:24 -08:00
Jan Kara
c27d82f52f fs/drop_caches.c: avoid softlockups in drop_pagecache_sb()
When superblock has lots of inodes without any pagecache (like is the
case for /proc), drop_pagecache_sb() will iterate through all of them
without dropping sb->s_inode_list_lock which can lead to softlockups
(one of our customers hit this).

Fix the problem by going to the slow path and doing cond_resched() in
case the process needs rescheduling.

Link: http://lkml.kernel.org/r/20190114085343.15011-1-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 15:46:24 -08:00
Alexey Dobriyan
1fde6f21d9 proc: fix /proc/net/* after setns(2)
/proc entries under /proc/net/* can't be cached into dcache because
setns(2) can change current net namespace.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: avoid vim miscolorization]
[adobriyan@gmail.com: write test, add dummy ->d_revalidate hook: necessary if /proc/net/* is pinned at setns time]
  Link: http://lkml.kernel.org/r/20190108192350.GA12034@avx2
Link: http://lkml.kernel.org/r/20190107162336.GA9239@avx2
Fixes: 1da4d377f9 ("proc: revalidate misc dentries")
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reported-by: Mateusz Stępień <mateusz.stepien@netrounds.com>
Reported-by: Ahmad Fatoum <a.fatoum@pengutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 15:46:22 -08:00
Linus Torvalds
9ace868a17 Changes since last update:
- fix page migration when using iomap for pagecache management
 - fix a use-after-free bug in the directio code
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlxPK1IACgkQ+H93GTRK
 tOvzeA/+I4bWVmovfV+EGFzHSV6zRsy17v6c4ncMrdia41rhmvxl+sAJgj2+uFCb
 J37cCMpPrmAOz+JGIW7PbCt8uzmwaXOfB9p9N58wM+hSSxtlN+wZFsIaoepOUTDK
 t3e2L7QQxQjN9HXZU0RNUi/zgS3poDfzap7cZ71spBxX5hVd1zVQa0q/o5OXr7OI
 sBlZLsIOhmS8WU2TmfkwzUVi+/FR4dCgyP8eDAGho/KbwvO9sfWzLNxf0U/ORWfA
 JG+2LX42eKKjX6wo39zW1mAXwOBhnLnCqOOgKrDy8XRXPARiNAzLNn0AwhxJSAqD
 z3qE298Oag6gZo0lNJjDXIko5D9y2koWjQe7z4fJ2JYpV5mSyq8/F4XDNO2FKIap
 07p0OBGa1yfwQYmS5TrhJvYwvsHqTNs122jpowqeD3o0Xh64y2TZLEMdhhIsRjga
 +S9OSVQ15JDf/QeI5LPGK6Oc6B3JnRYgrYf7g7DYu4eqEsJ3V3pqtbXzjxGkzUjx
 5xf85ujuRUQoKCPZQ00ewmsfZMfOcaYqfhosx6LvR6ZPvPH3Ex3nLHks5ZV/eXfR
 Auusq6XMiHb5ljukfVCa0WStntUl5gMaRha5QJy1Vg5Zd1ikvTkB5CkJFlODJ8hS
 GaEIa58Gf75dkHOvm4bFEOoy2EcE3I+qdwR+0xgg+6gDlNeZZNQ=
 =KYgc
 -----END PGP SIGNATURE-----

Merge tag 'iomap-5.0-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull iomap fixes from Darrick Wong:
 "A couple of iomap fixes to eliminate some memory corruption and hang
  problems that were reported:

   - fix page migration when using iomap for pagecache management

   - fix a use-after-free bug in the directio code"

* tag 'iomap-5.0-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: fix a use after free in iomap_dio_rw
  iomap: get/put the page in iomap_page_create/release()
2019-02-01 10:30:18 -08:00
Jann Horn
01e7187b41 pipe: stop using ->can_merge
Al Viro pointed out that since there is only one pipe buffer type to which
new data can be appended, it isn't necessary to have a ->can_merge field in
struct pipe_buf_operations, we can just check for a magic type.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-01 02:01:45 -05:00
Jann Horn
a0ce2f0aa6 splice: don't merge into linked buffers
Before this patch, it was possible for two pipes to affect each other after
data had been transferred between them with tee():

============
$ cat tee_test.c

int main(void) {
  int pipe_a[2];
  if (pipe(pipe_a)) err(1, "pipe");
  int pipe_b[2];
  if (pipe(pipe_b)) err(1, "pipe");
  if (write(pipe_a[1], "abcd", 4) != 4) err(1, "write");
  if (tee(pipe_a[0], pipe_b[1], 2, 0) != 2) err(1, "tee");
  if (write(pipe_b[1], "xx", 2) != 2) err(1, "write");

  char buf[5];
  if (read(pipe_a[0], buf, 4) != 4) err(1, "read");
  buf[4] = 0;
  printf("got back: '%s'\n", buf);
}
$ gcc -o tee_test tee_test.c
$ ./tee_test
got back: 'abxx'
$
============

As suggested by Al Viro, fix it by creating a separate type for
non-mergeable pipe buffers, then changing the types of buffers in
splice_pipe_to_pipe() and link_pipe().

Cc: <stable@vger.kernel.org>
Fixes: 7c77f0b3f9 ("splice: implement pipe to pipe splicing")
Fixes: 70524490ee ("[PATCH] splice: add support for sys_tee()")
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-01 02:01:45 -05:00
Chandan Rajendra
fbdb440132 copy_mount_string: Limit string length to PATH_MAX
On ppc64le, When a string with PAGE_SIZE - 1 (i.e. 64k-1) length is
passed as a "filesystem type" argument to the mount(2) syscall,
copy_mount_string() ends up allocating 64k (the PAGE_SIZE on ppc64le)
worth of space for holding the string in kernel's address space.

Later, in set_precision() (invoked by get_fs_type() ->
__request_module() -> vsnprintf()), we end up assigning
strlen(fs-type-string) i.e. 65535 as the
value to 'struct printf_spec'->precision member. This field has a width
of 16 bits and it is a signed data type. Hence an invalid value ends
up getting assigned. This causes the "WARN_ONCE(spec->precision != prec,
"precision %d too large", prec)" statement inside set_precision() to be
executed.

This commit fixes the bug by limiting the length of the string passed by
copy_mount_string() to strndup_user() to PATH_MAX.

Signed-off-by: Chandan Rajendra <chandan@linux.ibm.com>
Reported-by: Abdul Haleem <abdhalee@linux.ibm.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-01 01:57:33 -05:00
Christoph Hellwig
801e523796 fs: move generic stat response attr handling to vfs_getattr_nosec
generic_fillattr is an optional helper that isn't used by all file
systems, move handling purely based on inode flags to vfs_getattr_nosec,
which is common code.

This fixes setting this flag for file systems not using generic_fillattr
like xfs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-01 01:55:45 -05:00
Christoph Hellwig
5678b5d6a8 orangefs: don't reinitialize result_mask in ->getattr
The caller already initializes it to the basic stats.  Just
clear not supported default bits where needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-01 01:55:45 -05:00
Xiaoguang Wang
53cf978457 jbd2: fix deadlock while checkpoint thread waits commit thread to finish
This issue was found when I tried to put checkpoint work in a separate thread,
the deadlock below happened:
         Thread1                                |   Thread2
__jbd2_log_wait_for_space                       |
jbd2_log_do_checkpoint (hold j_checkpoint_mutex)|
  if (jh->b_transaction != NULL)                |
    ...                                         |
    jbd2_log_start_commit(journal, tid);        |jbd2_update_log_tail
                                                |  will lock j_checkpoint_mutex,
                                                |  but will be blocked here.
                                                |
    jbd2_log_wait_commit(journal, tid);         |
    wait_event(journal->j_wait_done_commit,     |
     !tid_gt(tid, journal->j_commit_sequence)); |
     ...                                        |wake_up(j_wait_done_commit)
  }                                             |

then deadlock occurs, Thread1 will never be waken up.

To fix this issue, drop j_checkpoint_mutex in jbd2_log_do_checkpoint()
when we are going to wait for transaction commit.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-01-31 23:42:11 -05:00
Theodore Ts'o
8fdd60f2ae Revert "ext4: use ext4_write_inode() when fsyncing w/o a journal"
This reverts commit ad211f3e94.

As Jan Kara pointed out, this change was unsafe since it means we lose
the call to sync_mapping_buffers() in the nojournal case.  The
original point of the commit was avoid taking the inode mutex (since
it causes a lockdep warning in generic/113); but we need the mutex in
order to call sync_mapping_buffers().

The real fix to this problem was discussed here:

https://lore.kernel.org/lkml/20181025150540.259281-4-bvanassche@acm.org

The proposed patch was to fix a syzbot complaint, but the problem can
also demonstrated via "kvm-xfstests -c nojournal generic/113".
Multiple solutions were discused in the e-mail thread, but none have
landed in the kernel as of this writing.  Anyway, commit
ad211f3e94 is absolutely the wrong way to suppress the lockdep, so
revert it.

Fixes: ad211f3e94 ("ext4: use ext4_write_inode() when fsyncing w/o a journal")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported: Jan Kara <jack@suse.cz>
2019-01-31 23:41:11 -05:00
Andreas Gruenbacher
e74c98ca2d gfs2: Revert "Fix loop in gfs2_rbm_find"
This reverts commit 2d29f6b96d.

It turns out that the fix can lead to a ~20 percent performance regression
in initial writes to the page cache according to iozone.  Let's revert this
for now to have more time for a proper fix.

Cc: stable@vger.kernel.org # v3.13+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-31 11:45:11 -08:00
Linus Torvalds
937108b093 NFS client fixes for Linux 5.0
Stable bugfix:
 - Fix up return value on fatal errors in nfs_page_async_flush()
 
 Other bugfix:
 - Fix NULL pointer dereference of dev_name
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlxTOEsACgkQ18tUv7Cl
 QOuS2A//U2J1xz2N8R/k9I4puMXss+DpUAfryNRrDul0qL4tsr7UhHzHezJVl17X
 coPGA/YD+voybyT+eYeACCHUhDMNN8gj2KoCMlE1ueWAbiCOxrS4NgFM2djO3lka
 dlfqgSbVS1Z7+KtEEiFGq/HiF6y0WxanMBHnfhllNbXBDE6W0/+EPdgjX7fZF3FF
 AS6QQmruXL/b1/hJasfTsF3wcHs3y+Y23RP85j4F8aYrcWLOyPUhhuzv/o6Zoh37
 fqltMxueWy+2qpn8dBE+9ILuKnUxnIsIwpF4YFhI7XrQlqMIWYMrShiqSDqYeVUP
 3qdX8LtRR2VsNCTDR9HamVtCkbi9DkJRXQA/fChVPiLA+P0W2Q2uiKsNKEijuZdl
 9fvl9aIL/+glczHrZeJTKellFSEocaZ/L5gVmpM6Fk8zyFitP0+nkO40g/qou+A0
 O77A+EK9v4XPe8z87kwrZhphT12QZK2oIPMAZDnjitktbuObip0Wva4w92KnIqK0
 QPIN081oxNF7BnWEUESCTeqXl670lV83Xek1eVHSCTnFOI68riP1YoUQlIhujV/R
 82J+y6HJYtLDj87NuJrAAXtUrtzAPDr39TJr3V2aH0kdpPajUAhkC3gLix13ORyM
 cmP3K1M3U5f3HAElrywqQrGcxaYKN/Hpfb2427vEnbxieTKElVo=
 =ZOXa
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.0-3' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "This addresses two bugs, one in the error code handling of
  nfs_page_async_flush() and one to fix a potential NULL pointer
  dereference in nfs_parse_devname().

  Stable bugfix:
   - Fix up return value on fatal errors in nfs_page_async_flush()

  Other bugfix:
   - Fix NULL pointer dereference of dev_name"

* tag 'nfs-for-5.0-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFS: Fix up return value on fatal errors in nfs_page_async_flush()
  nfs: Fix NULL pointer dereference of dev_name
2019-01-31 10:13:05 -08:00
Steve French
b9b9378b49 cifs: update internal module version number
To 2.17

Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-31 07:05:06 -06:00
Aurelien Aptel
d339adc12a CIFS: fix use-after-free of the lease keys
The request buffers are freed right before copying the pointers.
Use the func args instead which are identical and still valid.

Simple reproducer (requires KASAN enabled) on a cifs mount:

echo foo > foo ; tail -f foo & rm foo

Cc: <stable@vger.kernel.org> # 4.20
Fixes: 179e44d49c ("smb3: add tracepoint for sending lease break responses to server")
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.de>
2019-01-31 07:03:20 -06:00
Jan Kara
1c2d14212b ext2: Fix underflow in ext2_max_size()
When ext2 filesystem is created with 64k block size, ext2_max_size()
will return value less than 0. Also, we cannot write any file in this fs
since the sb->maxbytes is less than 0. The core of the problem is that
the size of block index tree for such large block size is more than
i_blocks can carry. So fix the computation to count with this
possibility.

File size limits computed with the new function for the full range of
possible block sizes look like:

bits file_size
10     17247252480
11    275415851008
12   2196873666560
13   2197948973056
14   2198486220800
15   2198754754560
16   2198888906752

CC: stable@vger.kernel.org
Reported-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-01-31 10:59:12 +01:00
Richard Guy Briggs
57d4657716 audit: ignore fcaps on umount
Don't fetch fcaps when umount2 is called to avoid a process hang while
it waits for the missing resource to (possibly never) re-appear.

Note the comment above user_path_mountpoint_at():
 * A umount is a special case for path walking. We're not actually interested
 * in the inode in this situation, and ESTALE errors can be a problem.  We
 * simply want track down the dentry and vfsmount attached at the mountpoint
 * and avoid revalidating the last component.

This can happen on ceph, cifs, 9p, lustre, fuse (gluster) or NFS.

Please see the github issue tracker
https://github.com/linux-audit/audit-kernel/issues/100

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
[PM: merge fuzz in audit_log_fcaps()]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-01-30 20:51:47 -05:00
Al Viro
f3a09c9201 introduce fs_context methods
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-30 17:44:27 -05:00
Al Viro
e1a91586d5 fs_context flavour for submounts
This is an eventual replacement for vfs_submount() uses.  Unlike the
"mount" and "remount" cases, the users of that thing are not in VFS -
they are buried in various ->d_automount() instances and rather than
converting them all at once we introduce the (thankfully small and
simple) infrastructure here and deal with the prospective users in
afs, nfs, etc. parts of the series.

Here we just introduce a new constructor (fs_context_for_submount())
along with the corresponding enum constant to be put into fc->purpose
for those.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-30 17:44:27 -05:00
David Howells
8d0347f6c3 convert do_remount_sb() to fs_context
Replace do_remount_sb() with a function, reconfigure_super(), that's
fs_context aware.  The fs_context is expected to be parameterised already
and have ->root pointing to the superblock to be reconfigured.

A legacy wrapper is provided that is intended to be called from the
fs_context ops when those appear, but for now is called directly from
reconfigure_super().  This wrapper invokes the ->remount_fs() superblock op
for the moment.  It is intended that the remount_fs() op will be phased
out.

The fs_context->purpose is set to FS_CONTEXT_FOR_RECONFIGURE to indicate
that the context is being used for reconfiguration.

do_umount_root() is provided to consolidate remount-to-R/O for umount and
emergency remount by creating a context and invoking reconfiguration.

do_remount(), do_umount() and do_emergency_remount_callback() are switched
to use the new process.

[AV -- fold UMOUNT and EMERGENCY_REMOUNT in; fixes the
umount / bug, gets rid of pointless complexity]
[AV -- set ->net_ns in all cases; nfs remount will need that]
[AV -- shift security_sb_remount() call into reconfigure_super(); the callers
that didn't do security_sb_remount() have NULL fc->security anyway, so it's
a no-op for them]

Signed-off-by: David Howells <dhowells@redhat.com>
Co-developed-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-30 17:44:26 -05:00
Al Viro
c9ce29ed79 vfs_get_tree(): evict the call of security_sb_kern_mount()
Right now vfs_get_tree() calls security_sb_kern_mount() (i.e.
mount MAC) unless it gets MS_KERNMOUNT or MS_SUBMOUNT in flags.
Doing it that way is both clumsy and imprecise.

Consider the callers' tree of vfs_get_tree():
vfs_get_tree()
        <- do_new_mount()
	<- vfs_kern_mount()
		<- simple_pin_fs()
		<- vfs_submount()
		<- kern_mount_data()
		<- init_mount_tree()
		<- btrfs_mount()
			<- vfs_get_tree()
		<- nfs_do_root_mount()
			<- nfs4_try_mount()
				<- nfs_fs_mount()
					<- vfs_get_tree()
			<- nfs4_referral_mount()

do_new_mount() always does need MAC (we are guaranteed that neither
MS_KERNMOUNT nor MS_SUBMOUNT will be passed there).

simple_pin_fs(), vfs_submount() and kern_mount_data() pass explicit
flags inhibiting that check.  So does nfs4_referral_mount() (the
flags there are ulimately coming from vfs_submount()).

init_mount_tree() is called too early for anything LSM-related; it
doesn't matter whether we attempt those checks, they'll do nothing.

Finally, in case of btrfs_mount() and nfs_fs_mount(), doing MAC
is pointless - either the caller will do it, or the flags are
such that we wouldn't have done it either.

In other words, the one and only case when we want that check
done is when we are called from do_new_mount(), and there we
want it unconditionally.

So let's simply move it there.  The superblock is still locked,
so nobody is going to get access to it (via ustat(2), etc.)
until we get a chance to apply the checks - we are free to
move them to any point up to where we drop ->s_umount (in
do_new_mount_fc()).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-30 17:44:26 -05:00
David Howells
132e460848 new helper: do_new_mount_fc()
Create an fs_context-aware version of do_new_mount().  This takes an
fs_context with a superblock already attached to it.

Make do_new_mount() use do_new_mount_fc() rather than do_new_mount(); this
allows the consolidation of the mount creation, check and add steps.

To make this work, mount_too_revealing() is changed to take a superblock
rather than a mount (which the fs_context doesn't have available), allowing
this check to be done before the mount object is created.

Signed-off-by: David Howells <dhowells@redhat.com>
Co-developed-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-30 17:44:25 -05:00
Al Viro
a0c9a8b8fd teach vfs_get_tree() to handle subtype, switch do_new_mount() to it
Roll the handling of subtypes into do_new_mount() and vfs_get_tree().  The
former determines any subtype string and hangs it off the fs_context; the
latter applies it.

Make do_new_mount() create, parameterise and commit an fs_context and
create a mount for itself rather than calling vfs_kern_mount().

[AV -- missing kstrdup()]
[AV -- ... and no kstrdup() if we get to setting ->s_submount - we
simply transfer it from fc, leaving NULL behind]
[AV -- constify ->s_submount, while we are at it]

Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-30 17:44:25 -05:00
Al Viro
8f2918898e new helpers: vfs_create_mount(), fc_mount()
Create a new helper, vfs_create_mount(), that creates a detached vfsmount
object from an fs_context that has a superblock attached to it.

Almost all uses will be paired with immediately preceding vfs_get_tree();
add a helper for such combination.

Switch vfs_kern_mount() to use this.

NOTE: mild behaviour change; passing NULL as 'device name' to
something like procfs will change /proc/*/mountstats - "device none"
instead on "no device".  That is consistent with /proc/mounts et.al.

[do'h - EXPORT_SYMBOL_GPL slipped in by mistake; removed]
[AV -- remove confused comment from vfs_create_mount()]
[AV -- removed the second argument]

Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-30 17:44:24 -05:00
David Howells
9bc61ab18b vfs: Introduce fs_context, switch vfs_kern_mount() to it.
Introduce a filesystem context concept to be used during superblock
creation for mount and superblock reconfiguration for remount.  This is
allocated at the beginning of the mount procedure and into it is placed:

 (1) Filesystem type.

 (2) Namespaces.

 (3) Source/Device names (there may be multiple).

 (4) Superblock flags (SB_*).

 (5) Security details.

 (6) Filesystem-specific data, as set by the mount options.

Accessor functions are then provided to set up a context, parameterise it
from monolithic mount data (the data page passed to mount(2)) and tear it
down again.

A legacy wrapper is provided that implements what will be the basic
operations, wrapping access to filesystems that aren't yet aware of the
fs_context.

Finally, vfs_kern_mount() is changed to make use of the fs_context and
mount_fs() is replaced by vfs_get_tree(), called from vfs_kern_mount().
[AV -- add missing kstrdup()]
[AV -- put_cred() can be unconditional - fc->cred can't be NULL]
[AV -- take legacy_validate() contents into legacy_parse_monolithic()]
[AV -- merge KERNEL_MOUNT and USER_MOUNT]
[AV -- don't unlock superblock on success return from vfs_get_tree()]
[AV -- kill 'reference' argument of init_fs_context()]

Signed-off-by: David Howells <dhowells@redhat.com>
Co-developed-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-30 17:44:23 -05:00
Al Viro
74e831221c saner handling of temporary namespaces
mount_subtree() creates (and soon destroys) a temporary namespace,
so that automounts could function normally.  These beasts should
never become anyone's current namespaces; they don't, but it would
be better to make prevention of that more straightforward.  And
since they don't become anyone's current namespace, we don't need
to bother with reserving procfs inums for those.

Teach alloc_mnt_ns() to skip inum allocation if told so, adjust
put_mnt_ns() accordingly, make mount_subtree() use temporary
(anon) namespace.  is_anon_ns() checks if a namespace is such.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-30 17:44:07 -05:00
Al Viro
3bd045cc9c separate copying and locking mount tree on cross-userns copies
Rather than having propagate_mnt() check doing unprivileged copies,
lock them before commit_tree().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-30 17:14:50 -05:00
Waiman Long
af0c9af1b3 fs/dcache: Track & report number of negative dentries
The current dentry number tracking code doesn't distinguish between
positive & negative dentries.  It just reports the total number of
dentries in the LRU lists.

As excessive number of negative dentries can have an impact on system
performance, it will be wise to track the number of positive and
negative dentries separately.

This patch adds tracking for the total number of negative dentries in
the system LRU lists and reports it in the 5th field in the
/proc/sys/fs/dentry-state file.  The number, however, does not include
negative dentries that are in flight but not in the LRU yet as well as
those in the shrinker lists which are on the way out anyway.

The number of positive dentries in the LRU lists can be roughly found by
subtracting the number of negative dentries from the unused count.

Matthew Wilcox had confirmed that since the introduction of the
dentry_stat structure in 2.1.60, the dummy array was there, probably for
future extension.  They were not replacements of pre-existing fields.
So no sane applications that read the value of /proc/sys/fs/dentry-state
will do dummy thing if the last 2 fields of the sysctl parameter are not
zero.  IOW, it will be safe to use one of the dummy array entry for
negative dentry count.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-30 11:02:11 -08:00
Waiman Long
1dbd449c99 fs/dcache: Fix incorrect nr_dentry_unused accounting in shrink_dcache_sb()
The nr_dentry_unused per-cpu counter tracks dentries in both the LRU
lists and the shrink lists where the DCACHE_LRU_LIST bit is set.

The shrink_dcache_sb() function moves dentries from the LRU list to a
shrink list and subtracts the dentry count from nr_dentry_unused.  This
is incorrect as the nr_dentry_unused count will also be decremented in
shrink_dentry_list() via d_shrink_del().

To fix this double decrement, the decrement in the shrink_dcache_sb()
function is taken out.

Fixes: 4e717f5c10 ("list_lru: remove special case function list_lru_dispose_all."
Cc: stable@kernel.org
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-30 11:02:11 -08:00
Eric W. Biederman
532b618bdf btrfs: On error always free subvol_name in btrfs_mount
The subvol_name is allocated in btrfs_parse_subvol_options and is
consumed and freed in mount_subvol.  Add a free to the error paths that
don't call mount_subvol so that it is guaranteed that subvol_name is
freed when an error happens.

Fixes: 312c89fbca ("btrfs: cleanup btrfs_mount() using btrfs_mount_root()")
Cc: stable@vger.kernel.org # v4.19+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-30 18:16:47 +01:00
David Sterba
c7cc64a985 btrfs: clean up pending block groups when transaction commit aborts
The fstests generic/475 stresses transaction aborts and can reveal
space accounting or use-after-free bugs regarding block goups.

In this case the pending block groups that remain linked to the
structures after transaction commit aborts in the middle.

The corrupted slabs lead to failures in following tests, eg. generic/476

  [ 8172.752887] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
  [ 8172.755799] #PF error: [normal kernel read fault]
  [ 8172.757571] PGD 661ae067 P4D 661ae067 PUD 3db8e067 PMD 0
  [ 8172.759000] Oops: 0000 [#1] PREEMPT SMP
  [ 8172.760209] CPU: 0 PID: 39 Comm: kswapd0 Tainted: G        W         5.0.0-rc2-default #408
  [ 8172.762495] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
  [ 8172.765772] RIP: 0010:shrink_page_list+0x2f9/0xe90
  [ 8172.770453] RSP: 0018:ffff967f00663b18 EFLAGS: 00010287
  [ 8172.771184] RAX: 0000000000000000 RBX: ffff967f00663c20 RCX: 0000000000000000
  [ 8172.772850] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8c0620ab20e0
  [ 8172.774629] RBP: ffff967f00663dd8 R08: 0000000000000000 R09: 0000000000000000
  [ 8172.776094] R10: ffff8c0620ab22f8 R11: ffff8c063f772688 R12: ffff967f00663b78
  [ 8172.777533] R13: ffff8c063f625600 R14: ffff8c063f625608 R15: dead000000000200
  [ 8172.778886] FS:  0000000000000000(0000) GS:ffff8c063d400000(0000) knlGS:0000000000000000
  [ 8172.780545] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [ 8172.781787] CR2: 0000000000000058 CR3: 000000004e962000 CR4: 00000000000006f0
  [ 8172.783547] Call Trace:
  [ 8172.784112]  shrink_inactive_list+0x194/0x410
  [ 8172.784747]  shrink_node_memcg.constprop.85+0x3a5/0x6a0
  [ 8172.785472]  shrink_node+0x62/0x1e0
  [ 8172.786011]  balance_pgdat+0x216/0x460
  [ 8172.786577]  kswapd+0xe3/0x4a0
  [ 8172.787085]  ? finish_wait+0x80/0x80
  [ 8172.787795]  ? balance_pgdat+0x460/0x460
  [ 8172.788799]  kthread+0x116/0x130
  [ 8172.789640]  ? kthread_create_on_node+0x60/0x60
  [ 8172.790323]  ret_from_fork+0x24/0x30
  [ 8172.794253] CR2: 0000000000000058

or accounting errors at umount time:

  [ 8159.537251] WARNING: CPU: 2 PID: 19031 at fs/btrfs/extent-tree.c:5987 btrfs_free_block_groups+0x3d5/0x410 [btrfs]
  [ 8159.543325] CPU: 2 PID: 19031 Comm: umount Tainted: G        W         5.0.0-rc2-default #408
  [ 8159.545472] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
  [ 8159.548155] RIP: 0010:btrfs_free_block_groups+0x3d5/0x410 [btrfs]
  [ 8159.554030] RSP: 0018:ffff967f079cbde8 EFLAGS: 00010206
  [ 8159.555144] RAX: 0000000001000000 RBX: ffff8c06366cf800 RCX: 0000000000000000
  [ 8159.556730] RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff8c06255ad800
  [ 8159.558279] RBP: ffff8c0637ac0000 R08: 0000000000000001 R09: 0000000000000000
  [ 8159.559797] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8c0637ac0108
  [ 8159.561296] R13: ffff8c0637ac0158 R14: 0000000000000000 R15: dead000000000100
  [ 8159.562852] FS:  00007f7f693b9fc0(0000) GS:ffff8c063d800000(0000) knlGS:0000000000000000
  [ 8159.564839] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [ 8159.566160] CR2: 00007f7f68fab7b0 CR3: 000000000aec7000 CR4: 00000000000006e0
  [ 8159.567898] Call Trace:
  [ 8159.568597]  close_ctree+0x17f/0x350 [btrfs]
  [ 8159.569628]  generic_shutdown_super+0x64/0x100
  [ 8159.570808]  kill_anon_super+0x14/0x30
  [ 8159.571857]  btrfs_kill_super+0x12/0xa0 [btrfs]
  [ 8159.573063]  deactivate_locked_super+0x29/0x60
  [ 8159.574234]  cleanup_mnt+0x3b/0x70
  [ 8159.575176]  task_work_run+0x98/0xc0
  [ 8159.576177]  exit_to_usermode_loop+0x83/0x90
  [ 8159.577315]  do_syscall_64+0x15b/0x180
  [ 8159.578339]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

This fix is based on 2 Josef's patches that used sideefects of
btrfs_create_pending_block_groups, this fix introduces the helper that
does what we need.

CC: stable@vger.kernel.org # 4.4+
CC: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-30 18:16:47 +01:00
Al Viro
92900e5160 btrfs: fix potential oops in device_list_add
alloc_fs_devices() can return ERR_PTR(-ENOMEM), so dereferencing its
result before the check for IS_ERR() is a bad idea.

Fixes: d1a6300282 ("btrfs: add members to fs_devices to track fsid changes")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-30 18:16:40 +01:00
Greg Kroah-Hartman
37ea7b630a debugfs: debugfs_lookup() should return NULL if not found
Lots of callers of debugfs_lookup() were just checking NULL to see if
the file/directory was found or not.  By changing this in ff9fb72bc0
("debugfs: return error values, not NULL") we caused some subsystems to
easily crash.

Fixes: ff9fb72bc0 ("debugfs: return error values, not NULL")
Reported-by: syzbot+b382ba6a802a3d242790@syzkaller.appspotmail.com
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-30 12:39:49 +01:00
Pavel Shilovsky
082aaa8700 CIFS: Do not consider -ENODATA as stat failure for reads
When doing reads beyound the end of a file the server returns
error STATUS_END_OF_FILE error which is mapped to -ENODATA.
Currently we report it as a failure which confuses read stats.
Change it to not consider -ENODATA as failure for stat purposes.

Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
2019-01-29 17:27:16 -06:00
Pavel Shilovsky
8e6e72aece CIFS: Do not count -ENODATA as failure for query directory
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
2019-01-29 17:24:53 -06:00
Pavel Shilovsky
7d42e72fe8 CIFS: Fix trace command logging for SMB2 reads and writes
Currently we log success once we send an async IO request to
the server. Instead we need to analyse a response and then log
success or failure for a particular command. Also fix argument
list for read logging.

Cc: <stable@vger.kernel.org> # 4.18
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-29 17:19:56 -06:00
Pavel Shilovsky
9bda8723da CIFS: Fix possible oops and memory leaks in async IO
Allocation of a page array for non-cached IO was separated from
allocation of rdata and wdata structures and this introduced memory
leaks and a possible null pointer dereference. This patch fixes
these problems.

Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-29 17:19:47 -06:00
Ronnie Sahlberg
c4627e66f7 cifs: limit amount of data we request for xattrs to CIFSMaxBufSize
minus the various headers and blobs that will be part of the reply.

or else we might trigger a session reconnect.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-01-29 16:17:25 -06:00
Ronnie Sahlberg
58d15ed120 cifs: fix computation for MAX_SMB2_HDR_SIZE
The size of the fixed part of the create response is 88 bytes not 56.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-01-29 16:15:08 -06:00
Trond Myklebust
8fc75bed96 NFS: Fix up return value on fatal errors in nfs_page_async_flush()
Ensure that we return the fatal error value that caused us to exit
nfs_page_async_flush().

Fixes: c373fff7bd ("NFSv4: Don't special case "launder"")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.12+
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-29 16:33:24 -05:00
Greg Kroah-Hartman
ff9fb72bc0 debugfs: return error values, not NULL
When an error happens, debugfs should return an error pointer value, not
NULL.  This will prevent the totally theoretical error where a debugfs
call fails due to lack of memory, returning NULL, and that dentry value
is then passed to another debugfs call, which would end up succeeding,
creating a file at the root of the debugfs tree, but would then be
impossible to remove (because you can not remove the directory NULL).

So, to make everyone happy, always return errors, this makes the users
of debugfs much simpler (they do not have to ever check the return
value), and everyone can rest easy.

Reported-by: Gary R Hook <ghook@amd.com>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reported-by: Masami Hiramatsu <mhiramat@kernel.org>
Reported-by: Michal Hocko <mhocko@kernel.org>
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reported-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-29 21:28:35 +01:00
Liu Xiang
4bc74ba1c7 ext2: Fix a typo in comment
Fix a typo in ext2_get_blocks comment.

Signed-off-by: Liu Xiang <liu.xiang6@zte.com.cn>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-01-29 16:43:54 +01:00
Yao Liu
80ff001724 nfs: Fix NULL pointer dereference of dev_name
There is a NULL pointer dereference of dev_name in nfs_parse_devname()

The oops looks something like:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
  ...
  RIP: 0010:nfs_fs_mount+0x3b6/0xc20 [nfs]
  ...
  Call Trace:
   ? ida_alloc_range+0x34b/0x3d0
   ? nfs_clone_super+0x80/0x80 [nfs]
   ? nfs_free_parsed_mount_data+0x60/0x60 [nfs]
   mount_fs+0x52/0x170
   ? __init_waitqueue_head+0x3b/0x50
   vfs_kern_mount+0x6b/0x170
   do_mount+0x216/0xdc0
   ksys_mount+0x83/0xd0
   __x64_sys_mount+0x25/0x30
   do_syscall_64+0x65/0x220
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fix this by adding a NULL check on dev_name

Signed-off-by: Yao Liu <yotta.liu@ucloud.cn>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-28 12:08:30 -05:00
Liu Xiang
0b7a814c26 ext2: Remove redundant check for finding no group
When best_desc keeps NULL, best_group keeps -1, too. So we can
return best_group directly.

Signed-off-by: Liu Xiang <liu.xiang6@zte.com.cn>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-01-28 15:51:11 +01:00
Mathieu Malaterre
f068ebd13b ext2: Annotate implicit fall through in __ext2_truncate_blocks
There is a plan to build the kernel with -Wimplicit-fallthrough and
these places in the code produced warnings (W=1).

This commit removes the following warnings:

  fs/ext2/inode.c:1237:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
  fs/ext2/inode.c:1244:7: warning: this statement may fall through [-Wimplicit-fallthrough=]

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-01-28 15:50:45 +01:00
Josef Bacik
302167c50b btrfs: don't end the transaction for delayed refs in throttle
Previously callers to btrfs_end_transaction_throttle() would commit the
transaction if there wasn't enough delayed refs space.  This happens in
relocation, and if the fs is relatively empty we'll run out of delayed
refs space basically immediately, so we'll just be stuck in this loop of
committing the transaction over and over again.

This code existed because we didn't have a good feedback mechanism for
running delayed refs, but with the delayed refs rsv we do now.  Delete
this throttling code and let the btrfs_start_transaction() in relocation
deal with putting pressure on the delayed refs infrastructure.  With
this patch we no longer take 5 minutes to balance a metadata only fs.

Qu has submitted a fstest to catch slow balance or excessive transaction
commits. Steps to reproduce:

* create subvolume
* create many (eg. 16000) inlined files, of size 2KiB
* iteratively snapshot and touch several files to trigger metadata
  updates
* start balance -m

Reported-by: Qu Wenruo <wqu@suse.com>
Fixes: 64403612b7 ("btrfs: rework btrfs_check_space_for_delayed_refs")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add tags and steps to reproduce ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-28 15:41:11 +01:00
Filipe Manana
a627947076 Btrfs: fix deadlock when allocating tree block during leaf/node split
When splitting a leaf or node from one of the trees that are modified when
flushing pending block groups (extent, chunk, device and free space trees),
we need to allocate a new tree block, which in turn can result in the need
to allocate a new block group. After allocating the new block group we may
need to flush new block groups that were previously allocated during the
course of the current transaction, which is what may cause a deadlock due
to attempts to write lock twice the same leaf or node, as when splitting
a leaf or node we are holding a write lock on it and its parent node.

The same type of deadlock can also happen when increasing the tree's
height, since we are holding a lock on the existing root while allocating
the tree block to use as the new root node.

An example trace when the deadlock happens during the leaf split path is:

  [27175.293054] CPU: 0 PID: 3005 Comm: kworker/u17:6 Tainted: G        W         4.19.16 #1
  [27175.293942] Hardware name: Penguin Computing Relion 1900/MD90-FS0-ZB-XX, BIOS R15 06/25/2018
  [27175.294846] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
  (...)
  [27175.298384] RSP: 0018:ffffab2087107758 EFLAGS: 00010246
  [27175.299269] RAX: 0000000000000bbd RBX: ffff9fadc7141c48 RCX: 0000000000000001
  [27175.300155] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff9fadc7141c48
  [27175.301023] RBP: 0000000000000001 R08: ffff9faeb6ac1040 R09: ffff9fa9c0000000
  [27175.301887] R10: 0000000000000000 R11: 0000000000000040 R12: ffff9fb21aac8000
  [27175.302743] R13: ffff9fb1a64d6a20 R14: 0000000000000001 R15: ffff9fb1a64d6a18
  [27175.303601] FS:  0000000000000000(0000) GS:ffff9fb21fa00000(0000) knlGS:0000000000000000
  [27175.304468] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [27175.305339] CR2: 00007fdc8743ead8 CR3: 0000000763e0a006 CR4: 00000000003606f0
  [27175.306220] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [27175.307087] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [27175.307940] Call Trace:
  [27175.308802]  btrfs_search_slot+0x779/0x9a0 [btrfs]
  [27175.309669]  ? update_space_info+0xba/0xe0 [btrfs]
  [27175.310534]  btrfs_insert_empty_items+0x67/0xc0 [btrfs]
  [27175.311397]  btrfs_insert_item+0x60/0xd0 [btrfs]
  [27175.312253]  btrfs_create_pending_block_groups+0xee/0x210 [btrfs]
  [27175.313116]  do_chunk_alloc+0x25f/0x300 [btrfs]
  [27175.313984]  find_free_extent+0x706/0x10d0 [btrfs]
  [27175.314855]  btrfs_reserve_extent+0x9b/0x1d0 [btrfs]
  [27175.315707]  btrfs_alloc_tree_block+0x100/0x5b0 [btrfs]
  [27175.316548]  split_leaf+0x130/0x610 [btrfs]
  [27175.317390]  btrfs_search_slot+0x94d/0x9a0 [btrfs]
  [27175.318235]  btrfs_insert_empty_items+0x67/0xc0 [btrfs]
  [27175.319087]  alloc_reserved_file_extent+0x84/0x2c0 [btrfs]
  [27175.319938]  __btrfs_run_delayed_refs+0x596/0x1150 [btrfs]
  [27175.320792]  btrfs_run_delayed_refs+0xed/0x1b0 [btrfs]
  [27175.321643]  delayed_ref_async_start+0x81/0x90 [btrfs]
  [27175.322491]  normal_work_helper+0xd0/0x320 [btrfs]
  [27175.323328]  ? move_linked_works+0x6e/0xa0
  [27175.324160]  process_one_work+0x191/0x370
  [27175.324976]  worker_thread+0x4f/0x3b0
  [27175.325763]  kthread+0xf8/0x130
  [27175.326531]  ? rescuer_thread+0x320/0x320
  [27175.327284]  ? kthread_create_worker_on_cpu+0x50/0x50
  [27175.328027]  ret_from_fork+0x35/0x40
  [27175.328741] ---[ end trace 300a1b9f0ac30e26 ]---

Fix this by preventing the flushing of new blocks groups when splitting a
leaf/node and when inserting a new root node for one of the trees modified
by the flushing operation, similar to what is done when COWing a node/leaf
from on of these trees.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202383
Reported-by: Eli V <eliventer@gmail.com>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-28 15:04:58 +01:00
Christoph Hellwig
4ea899ead2 iomap: fix a use after free in iomap_dio_rw
Introduce a local wait_for_completion variable to avoid an access to the
potentially freed dio struture after dropping the last reference count.

Also use the chance to document the completion behavior to make the
refcounting clear to the reader of the code.

Fixes: ff6a9292e6 ("iomap: implement direct I/O")
Reported-by: Chandan Rajendra <chandan@linux.ibm.com>
Reported-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Chandan Rajendra <chandan@linux.ibm.com>
Tested-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-01-27 08:47:42 -08:00
Piotr Jaroszynski
8e47a45732 iomap: get/put the page in iomap_page_create/release()
migrate_page_move_mapping() expects pages with private data set to have
a page_count elevated by 1.  This is what used to happen for xfs through
the buffer_heads code before the switch to iomap in commit 82cb14175e
("xfs: add support for sub-pagesize writeback without buffer_heads").
Not having the count elevated causes move_pages() to fail on memory
mapped files coming from xfs.

Make iomap compatible with the migrate_page_move_mapping() assumption by
elevating the page count as part of iomap_page_create() and lowering it
in iomap_page_release().

It causes the move_pages() syscall to misbehave on memory mapped files
from xfs.  It does not not move any pages, which I suppose is "just" a
perf issue, but it also ends up returning a positive number which is out
of spec for the syscall.  Talking to Michal Hocko, it sounds like
returning positive numbers might be a necessary update to move_pages()
anyway though.

Fixes: 82cb14175e ("xfs: add support for sub-pagesize writeback without buffer_heads")
Signed-off-by: Piotr Jaroszynski <pjaroszynski@nvidia.com>
[hch: actually get/put the page iomap_migrate_page() to make it work
      properly]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-01-27 08:46:45 -08:00
Linus Torvalds
7c2614bf7a a set of small smb3 fixes, some fixing various crediting issues discovered during xfstest runs, five for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlxLV8IACgkQiiy9cAdy
 T1HRIgv/fieDI6HBXAoK9mXzVLJXCZ1IxVZTygC3Xn/bCaxHerS+QSe88w0r1QTZ
 KLCoT7pBBy06uTowX4/cAF0IYTStsYWngBsH3+HL4EEchkZAVe02wAJ8QOCXs9YU
 UjoMvNLmcyrzwNt8EduNM2jUUpOL8EHiZOQKUvLt1Vdo3xBtyB3Nkttji92YLOQb
 Cxs0RJdKNaalMphOQwAtbdbZwlL79KGSA0gMm9W2VD744YKMjQ3uGXaTuoe6+w19
 voYKVTSelgyZg51yzF20x1Pl3OV/8L6tWgN1BIcKZbfGBshJhhcD93Pef7bmhxNS
 Y2yF3SPr/zcdgEkwigVlNs1uBCBBeU04/yQr+YvJwNcjsRTcTKOMuquxq4gXM1P8
 7f8gA1u0n1H70YKK29sd3sl7Iu8rHVacwlxjfoAcCjMe0VQvOJtR+V/zwP2uWr8T
 IjX05unBe/B1v+NXqNdkOpE6ZaiJtt2ny5NDZhfeubOouBLsey7/g/gSx48ICrq+
 U20kKsb6
 =CCH2
 -----END PGP SIGNATURE-----

Merge tag '5.0-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb3 fixes from Steve French:
 "A set of small smb3 fixes, some fixing various crediting issues
  discovered during xfstest runs, five for stable"

* tag '5.0-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: print CIFSMaxBufSize as part of /proc/fs/cifs/DebugData
  smb3: add credits we receive from oplock/break PDUs
  CIFS: Fix mounts if the client is low on credits
  CIFS: Do not assume one credit for async responses
  CIFS: Fix credit calculations in compound mid callback
  CIFS: Fix credit calculation for encrypted reads with errors
  CIFS: Fix credits calculations for reads with errors
  CIFS: Do not reconnect TCP session in add_credits()
  smb3: Cleanup license mess
  CIFS: Fix possible hang during async MTU reads and writes
  cifs: fix memory leak of an allocated cifs_ntsd structure
2019-01-26 15:38:22 -08:00
Linus Torvalds
6b8f915916 for-linus-20190125
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlxLdgsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpoVGD/4sGYQqfiXogQIJYbPH2RRPrJuLIIITjiAv
 lPXX1wx/tvz/ktwKJE2OiWTES0JjdH1HlC+7/0L/fLb8DXiKBmFuUHlwhureoL9Y
 o//BIQKuaje35kyHITTy2UAJOOqnNtJTaAP2AfkL+eOcj/V/G5rJIfLGs9QtuAR7
 sRJ+uhg1EbW/+uO0bULDmG6WUjxFu8mqcw3i6g0VVLVOnXB2EKcZTl3KPrdAXrUp
 XtmouERga6OfAUSJyZmTSV136mL+opRB2WFFVeIzjQfLmyItDGbSX/YPS8oJ2pow
 v7630F+CMrd4aKpqqtnAhfWpGqd0Xw7cYfZ9MKTJmZPmGzf9a1fQFpmgZosD4Dh3
 7MrhboU4TUt9PdXESA7CmE7LkTp99ghfj5/ysKrSV5h3HsH2RbLxJk91Rx3vmAWD
 u1xWRYL+GYLH6ZwOLvM1iqBrrLN3mUyrx98SaMgoXuqNzmQmgz9LPeA0Gt09FJbo
 uj+ebg4dRwuThjni4xQhl3zL2RQy7nlTDFDdKOz/XoiYk2NUVksss+sxGjNarHj0
 b5pCD4HOp57OreGExaOARpBRah5HSNdQtBRsIOnbyEq6f/e1LsIY23Z9nNF0deGO
 sZzgsbnsn+zg8bC6T/Gk4UY6XdCcgaS3SL04SVKAE3lO6A4C/Awo8DgD9Bl1zpC1
 HQlNkl5fBg==
 =iucY
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190125' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "A collection of fixes for this release. This contains:

   - Silence sparse rightfully complaining about non-static wbt
     functions (Bart)

   - Fixes for the zoned comments/ioctl documentation (Damien)

   - direct-io fix that's been lingering for a while (Ernesto)

   - cgroup writeback fix (Tejun)

   - Set of NVMe patches for nvme-rdma/tcp (Sagi, Hannes, Raju)

   - Block recursion tracking fix (Ming)

   - Fix debugfs command flag naming for a few flags (Jianchao)"

* tag 'for-linus-20190125' of git://git.kernel.dk/linux-block:
  block: Fix comment typo
  uapi: fix ioctl documentation
  blk-wbt: Declare local functions static
  blk-mq: fix the cmd_flag_name array
  nvme-multipath: drop optimization for static ANA group IDs
  nvmet-rdma: fix null dereference under heavy load
  nvme-rdma: rework queue maps handling
  nvme-tcp: fix timeout handler
  nvme-rdma: fix timeout handler
  writeback: synchronize sync(2) against cgroup writeback membership switches
  block: cover another queue enter recursion via BIO_QUEUE_ENTERED
  direct-io: allow direct writes to empty inodes
2019-01-26 12:42:41 -08:00
Richard Guy Briggs
4b7d248b3a audit: move loginuid and sessionid from CONFIG_AUDITSYSCALL to CONFIG_AUDIT
loginuid and sessionid (and audit_log_session_info) should be part of
CONFIG_AUDIT scope and not CONFIG_AUDITSYSCALL since it is used in
CONFIG_CHANGE, ANOM_LINK, FEATURE_CHANGE (and INTEGRITY_RULE), none of
which are otherwise dependent on AUDITSYSCALL.

Please see github issue
https://github.com/linux-audit/audit-kernel/issues/104

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
[PM: tweaked subject line for better grep'ing]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-01-25 13:03:23 -05:00
Greg Kroah-Hartman
d88c93f090 debugfs: fix debugfs_rename parameter checking
debugfs_rename() needs to check that the dentries passed into it really
are valid, as sometimes they are not (i.e. if the return value of
another debugfs call is passed into this one.)  So fix this up by
properly checking if the two parent directories are errors (they are
allowed to be NULL), and if the dentry to rename is not NULL or an
error.

Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-25 12:56:32 +01:00
Eric Biggers
231baecdef crypto: clarify name of WEAK_KEY request flag
CRYPTO_TFM_REQ_WEAK_KEY confuses newcomers to the crypto API because it
sounds like it is requesting a weak key.  Actually, it is requesting
that weak keys be forbidden (for algorithms that have the notion of
"weak keys"; currently only DES and XTS do).

Also it is only one letter away from CRYPTO_TFM_RES_WEAK_KEY, with which
it can be easily confused.  (This in fact happened in the UX500 driver,
though just in some debugging messages.)

Therefore, make the intent clear by renaming it to
CRYPTO_TFM_REQ_FORBID_WEAK_KEYS.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-01-25 18:41:52 +08:00
Ronnie Sahlberg
a5f1a81f70 cifs: print CIFSMaxBufSize as part of /proc/fs/cifs/DebugData
Was helpful in debug for some recent problems.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-24 14:52:06 -06:00
Ronnie Sahlberg
2e5700bdde smb3: add credits we receive from oplock/break PDUs
Otherwise we gradually leak credits leading to potential
hung session.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-24 14:52:06 -06:00
Pavel Shilovsky
6a9cbdd1ce CIFS: Fix mounts if the client is low on credits
If the server doesn't grant us at least 3 credits during the mount
we won't be able to complete it because query path info operation
requires 3 credits. Use the cached file handle if possible to allow
the mount to succeed.

Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-24 14:52:06 -06:00
Pavel Shilovsky
0fd1d37b05 CIFS: Do not assume one credit for async responses
If we don't receive a response we can't assume that the server
granted one credit. Assume zero credits in such cases.

Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-24 14:52:06 -06:00
Pavel Shilovsky
3d3003fce8 CIFS: Fix credit calculations in compound mid callback
The current code doesn't do proper accounting for credits
in SMB1 case: it adds one credit per response only if we get
a complete response while it needs to return it unconditionally.
Fix this and also include malformed responses for SMB2+ into
accounting for credits because such responses have Credit
Granted field, thus nothing prevents to get a proper credit
value from them.

Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-24 14:52:06 -06:00
Pavel Shilovsky
ec678eae74 CIFS: Fix credit calculation for encrypted reads with errors
We do need to account for credits received in error responses
to read requests on encrypted sessions.

Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-24 14:52:05 -06:00
Pavel Shilovsky
8004c78c68 CIFS: Fix credits calculations for reads with errors
Currently we mark MID as malformed if we get an error from server
in a read response. This leads to not properly processing credits
in the readv callback. Fix this by marking such a response as
normal received response and process it appropriately.

Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-24 14:52:05 -06:00
Pavel Shilovsky
ef68e83184 CIFS: Do not reconnect TCP session in add_credits()
When executing add_credits() we currently call cifs_reconnect()
if the number of credits is zero and there are no requests in
flight. In this case we may call cifs_reconnect() recursively
twice and cause memory corruption given the following sequence
of functions:

mid1.callback() -> add_credits() -> cifs_reconnect() ->
-> mid2.callback() -> add_credits() -> cifs_reconnect().

Fix this by avoiding to call cifs_reconnect() in add_credits()
and checking for zero credits in the demultiplex thread.

Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-24 14:50:57 -06:00
Varad Gautam
73052b0dae fs/devpts: always delete dcache dentry-s in dput()
d_delete only unhashes an entry if it is reached with
dentry->d_lockref.count != 1. Prior to commit 8ead9dd547 ("devpts:
more pty driver interface cleanups"), d_delete was called on a dentry
from devpts_pty_kill with two references held, which would trigger the
unhashing, and the subsequent dputs would release it.

Commit 8ead9dd547 reworked devpts_pty_kill to stop acquiring the second
reference from d_find_alias, and the d_delete call left the dentries
still on the hashed list without actually ever being dropped from dcache
before explicit cleanup. This causes the number of negative dentries for
devpts to pile up, and an `ls /dev/pts` invocation can take seconds to
return.

Provide always_delete_dentry() from simple_dentry_operations
as .d_delete for devpts, to make the dentry be dropped from dcache.

Without this cleanup, the number of dentries in /dev/pts/ can be grown
arbitrarily as:

`python -c 'import pty; pty.spawn(["ls", "/dev/pts"])'`

A systemtap probe on dcache_readdir to count d_subdirs shows this count
to increase with each pty spawn invocation above:

probe kernel.function("dcache_readdir") {
    subdirs = &@cast($file->f_path->dentry, "dentry")->d_subdirs;
    p = subdirs;
    p = @cast(p, "list_head")->next;
    i = 0
    while (p != subdirs) {
      p = @cast(p, "list_head")->next;
      i = i+1;
    }
    printf("number of dentries: %d\n", i);
}

Fixes: 8ead9dd547 ("devpts: more pty driver interface cleanups")
Signed-off-by: Varad Gautam <vrd@amazon.de>
Reported-by: Zheng Wang <wanz@amazon.de>
Reported-by: Brandon Schwartz <bsschwar@amazon.de>
Root-caused-by: Maximilian Heyne <mheyne@amazon.de>
Root-caused-by: Nicolas Pernas Maradei <npernas@amazon.de>
CC: David Woodhouse <dwmw@amazon.co.uk>
CC: Maximilian Heyne <mheyne@amazon.de>
CC: Stefan Nuernberger <snu@amazon.de>
CC: Amit Shah <aams@amazon.de>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: Al Viro <viro@ZenIV.linux.org.uk>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: Matthew Wilcox <willy@infradead.org>
CC: Eric Biggers <ebiggers@google.com>
CC: <stable@vger.kernel.org> # 4.9+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-24 13:38:30 -05:00
Linus Torvalds
c04e2a780c \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlxJ6LUACgkQnJ2qBz9k
 QNn75Qf/U60RPju54Kdyepo4iMFoQXS/VJg87bpRetUdN9vKU3CbnyYAx43ftPOk
 GH+UU2u+24DX2AWbzy4yx2CJP1YUUF1oa2QkSW2P9WWn1NYU3goWha1Il8bc96Kq
 TQU/ypl83RoaFtw4JwpL27ot39UH5WWafYiackqzOYIuAdpgei2aaZFNvVeb9Ije
 9udt7ygcTn94DVdYAFI6NeH5KAA+kzpIyWWRCWlmESsNYnCUQntSLm3tQG8sTHZt
 sXk2Kp1r7XFl/zkg9iFSOhSIIyVizSC4cHrE0iJ/0kLgoJjfjbOa1jZ5Eeq8yTc3
 UsXI5ICNS3RuZWwSWkh759JnhQ1R/A==
 =cB63
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_for_v5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull inotify fix from Jan Kara:
 "Fix a file refcount leak in an inotify error path"

* tag 'fsnotify_for_v5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  inotify: Fix fd refcount leak in inotify_add_watch().
2019-01-25 06:03:24 +13:00
Thomas Gleixner
b0b2cac7e2 smb3: Cleanup license mess
Precise and non-ambiguous license information is important. The recently
added aegis header file has a SPDX license identifier, which is nice, but
at the same time it has a contradictionary license boiler plate text.

  SPDX-License-Identifier: GPL-2.0

versus

  *   This program is free software;  you can redistribute it and/or modify
  *   it under the terms of the GNU General Public License as published by
  *   the Free Software Foundation; either version 2 of the License, or
  *   (at your option) any later version.

Oh well.

Assuming that the SPDX identifier is correct and according to x86/hyper-v
contributions from Microsoft GPL V2 only is the usual license.

Remove the boiler plate as it is wrong and even if correct it is redundant.

Fixes: eccb4422cf ("smb3: Add ftrace tracepoints for improved SMB3 debugging")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Steve French <sfrench@samba.org>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-24 09:37:33 -06:00
Pavel Shilovsky
acc58d0bab CIFS: Fix possible hang during async MTU reads and writes
When doing MTU i/o we need to leave some credits for
possible reopen requests and other operations happening
in parallel. Currently we leave 1 credit which is not
enough even for reopen only: we need at least 2 credits
if durable handle reconnect fails. Also there may be
other operations at the same time including compounding
ones which require 3 credits at a time each. Fix this
by leaving 8 credits which is big enough to cover most
scenarios.

Was able to reproduce this when server was configured
to give out fewer credits than usual.

The proper fix would be to reconnect a file handle first
and then obtain credits for an MTU request but this leads
to bigger code changes and should happen in other patches.

Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-24 09:37:33 -06:00
Colin Ian King
73aaf920cc cifs: fix memory leak of an allocated cifs_ntsd structure
The call to SMB2_queary_acl can allocate memory to pntsd and also
return a failure via a call to SMB2_query_acl (and then query_info).
This occurs when query_info allocates the structure and then in
query_info the call to smb2_validate_and_copy_iov fails. Currently the
failure just returns without kfree'ing pntsd hence causing a memory
leak.

Currently, *data is allocated if it's not already pointing to a buffer,
so it needs to be kfree'd only if was allocated in query_info, so the
fix adds an allocated flag to track this.  Also set *dlen to zero on
an error just to be safe since *data is kfree'd.

Also set errno to -ENOMEM if the allocation of *data fails.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Dan Carpener <dan.carpenter@oracle.com>
2019-01-24 09:37:33 -06:00
Eric Biggers
f5e55e777c fscrypt: return -EXDEV for incompatible rename or link into encrypted dir
Currently, trying to rename or link a regular file, directory, or
symlink into an encrypted directory fails with EPERM when the source
file is unencrypted or is encrypted with a different encryption policy,
and is on the same mountpoint.  It is correct for the operation to fail,
but the choice of EPERM breaks tools like 'mv' that know to copy rather
than rename if they see EXDEV, but don't know what to do with EPERM.

Our original motivation for EPERM was to encourage users to securely
handle their data.  Encrypting files by "moving" them into an encrypted
directory can be insecure because the unencrypted data may remain in
free space on disk, where it can later be recovered by an attacker.
It's much better to encrypt the data from the start, or at least try to
securely delete the source data e.g. using the 'shred' program.

However, the current behavior hasn't been effective at achieving its
goal because users tend to be confused, hack around it, and complain;
see e.g. https://github.com/google/fscrypt/issues/76.  And in some cases
it's actually inconsistent or unnecessary.  For example, 'mv'-ing files
between differently encrypted directories doesn't work even in cases
where it can be secure, such as when in userspace the same passphrase
protects both directories.  Yet, you *can* already 'mv' unencrypted
files into an encrypted directory if the source files are on a different
mountpoint, even though doing so is often insecure.

There are probably better ways to teach users to securely handle their
files.  For example, the 'fscrypt' userspace tool could provide a
command that migrates unencrypted files into an encrypted directory,
acting like 'shred' on the source files and providing appropriate
warnings depending on the type of the source filesystem and disk.

Receiving errors on unimportant files might also force some users to
disable encryption, thus making the behavior counterproductive.  It's
desirable to make encryption as unobtrusive as possible.

Therefore, change the error code from EPERM to EXDEV so that tools
looking for EXDEV will fall back to a copy.

This, of course, doesn't prevent users from still doing the right things
to securely manage their files.  Note that this also matches the
behavior when a file is renamed between two project quota hierarchies;
so there's precedent for using EXDEV for things other than mountpoints.

xfstests generic/398 will require an update with this change.

[Rewritten from an earlier patch series by Michael Halcrow.]

Cc: Michael Halcrow <mhalcrow@google.com>
Cc: Joe Richey <joerichey@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-01-23 23:56:43 -05:00
Chandan Rajendra
643fa9612b fscrypt: remove filesystem specific build config option
In order to have a common code base for fscrypt "post read" processing
for all filesystems which support encryption, this commit removes
filesystem specific build config option (e.g. CONFIG_EXT4_FS_ENCRYPTION)
and replaces it with a build option (i.e. CONFIG_FS_ENCRYPTION) whose
value affects all the filesystems making use of fscrypt.

Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-01-23 23:56:43 -05:00
Chandan Rajendra
62230e0d70 f2fs: use IS_ENCRYPTED() to check encryption status
This commit removes the f2fs specific f2fs_encrypted_inode() and makes
use of the generic IS_ENCRYPTED() macro to check for the encryption
status of an inode.

Acked-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-01-23 23:56:43 -05:00
Chandan Rajendra
592ddec757 ext4: use IS_ENCRYPTED() to check encryption status
This commit removes the ext4 specific ext4_encrypted_inode() and makes
use of the generic IS_ENCRYPTED() macro to check for the encryption
status of an inode.

Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-01-23 23:56:43 -05:00
Eric Biggers
1058ef0dcb fscrypt: remove CRYPTO_CTR dependency
fscrypt doesn't use the CTR mode of operation for anything, so there's
no need to select CRYPTO_CTR.  It was added by commit 71dea01ea2
("ext4 crypto: require CONFIG_CRYPTO_CTR if ext4 encryption is
enabled").  But, I've been unable to identify the arm64 crypto bug it
was supposedly working around.

I suspect the issue was seen only on some old Android device kernel
(circa 3.10?).  So if the fix wasn't mistaken, the real bug is probably
already fixed.  Or maybe it was actually a bug in a non-upstream crypto
driver.

So, remove the dependency.  If it turns out there's actually still a
bug, we'll fix it properly.

Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-01-23 23:56:43 -05:00
Greg Kroah-Hartman
2abbf9a4d2 gfs: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

There is no need to save the dentries for the debugfs files, so drop
those variables to save a bit of space and make the code simpler.

Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: cluster-devel@redhat.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-01-23 12:30:34 +01:00
James Morris
9624d5c9c7 Linux 5.0-rc3
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlxFDv0eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGBPsH/3Ij47fut8kwxGSX
 Tmx7Y+VYftRiKSwK3+HxsCvde3scqfkxAukb3HeJDzZdpnouT0k4nqUYQabAANi/
 MdaO+NSBRp/NjzZcpFG9QAroIQ2G2sRQ4E8ldFcNmdsjZWlUfKIHPfYHzvvc06L4
 MhvdkpMa/p51Jz9egQs0kfSvrb6fh4OEDTI19/aaGR0oJBhoGhLrqTI+vdYhMiyO
 wWtUXgZfsmlCBdAQLRh04CxGTc/32VApoB/SwP9sF+xD3gcL0mPFNKUociio6K2Y
 a7u7yuzUKvVwuafVgX9QT+f+je5/5u+WFsG/26cfXzizZoNWW5oDl3sBD3hRNkvt
 J13lB1w=
 =ch+/
 -----END PGP SIGNATURE-----

Merge tag 'v5.0-rc3' into next-general

Sync to Linux 5.0-rc3 to pull in the VFS changes which impacted a lot
of the LSM code.
2019-01-22 14:33:10 -08:00
Tejun Heo
7fc5854f8c writeback: synchronize sync(2) against cgroup writeback membership switches
sync_inodes_sb() can race against cgwb (cgroup writeback) membership
switches and fail to writeback some inodes.  For example, if an inode
switches to another wb while sync_inodes_sb() is in progress, the new
wb might not be visible to bdi_split_work_to_wbs() at all or the inode
might jump from a wb which hasn't issued writebacks yet to one which
already has.

This patch adds backing_dev_info->wb_switch_rwsem to synchronize cgwb
switch path against sync_inodes_sb() so that sync_inodes_sb() is
guaranteed to see all the target wbs and inodes can't jump wbs to
escape syncing.

v2: Fixed misplaced rwsem init.  Spotted by Jiufei.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jiufei Xue <xuejiufei@gmail.com>
Link: http://lkml.kernel.org/r/dc694ae2-f07f-61e1-7097-7c8411cee12d@gmail.com
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-22 14:39:38 -07:00
Ernesto A. Fernández
8b9433eb4d direct-io: allow direct writes to empty inodes
On a DIO_SKIP_HOLES filesystem, the ->get_block() method is currently
not allowed to create blocks for an empty inode.  This confusion comes
from trying to bit shift a negative number, so check the size of the
inode first.

The problem is most visible for hfsplus, because the fallback to
buffered I/O doesn't happen and the write fails with EIO.  This is in
part the fault of the module, because it gives a wrong return value on
->get_block(); that will be fixed in a separate patch.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-22 08:26:44 -07:00
Greg Kroah-Hartman
21acc07d33 f2fs: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Cc: linux-f2fs-devel@lists.sourceforge.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-22 14:25:25 +01:00
Jan Kara
032cdc3979 ext2: Set superblock revision when enabling xattr feature
When setting the first xattr, we automatically enable
EXT2_FEATURE_COMPAT_EXT_ATTR. However we forget to call
ext2_update_dynamic_rev() so in theory if the filesystem was created as
ancient one without features support, this could be missed. The
consequences are minor anyway - since the feature is compat one, only
old e2fsck which does not understand xattrs could do something bad.

Reported-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-01-22 12:35:36 +01:00
Liu Xiang
f6f5014a1d ext2: Remove redundant check on s_inode_size
The case of (EXT2_INODE_SIZE(sb) == 0) is included in
(sbi->s_inode_size < EXT2_GOOD_OLD_INODE_SIZE).
So there is no need to check again.

Signed-off-by: Liu Xiang <liu.xiang6@zte.com.cn>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-01-22 11:38:20 +01:00
Chengguang Xu
6a03e6a8dc ext2: set proper return code
Set proper return code when failing from allocating
memory in ext2_fill_super().

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-01-22 11:38:15 +01:00
Sergey Senozhatsky
0eeb27311f debugfs: debugfs_use_start/finish do not exist anymore
debugfs_use_file_start() and debugfs_use_file_finish() do not exist
since commit c9afbec270 ("debugfs: purge obsolete SRCU based removal
protection"); tweak debugfs_create_file_unsafe() comment.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-22 10:30:35 +01:00
Yue Hu
182ca6e0ae pstore/ram: Replace dummy_data heap memory with stack memory
In ramoops_register_dummy() dummy_data is allocated via kzalloc()
then it will always occupy the heap space after register platform
device via platform_device_register_data(), but it will not be
used any more. So let's free it for system usage, replace it with
stack memory is better due to small size.

Signed-off-by: Yue Hu <huyue2@yulong.com>
[kees: add required memset and adjust sizeof() argument]
Signed-off-by: Kees Cook <keescook@chromium.org>
2019-01-21 19:32:17 -08:00
Phillip Potter
e108921894 ext2: use common file type conversion
Deduplicate the ext2 file type conversion implementation and remove
EXT2_FT_* definitions - file systems that use the same file types as
defined by POSIX do not need to define their own versions and can
use the common helper functions decared in fs_types.h and implemented
in fs_types.c

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Phillip Potter <phil@philpotter.co.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-01-21 17:48:17 +01:00
Phillip Potter
bbe7449e25 fs: common implementation of file type
Many file systems use a copy&paste implementation
of dirent to on-disk file type conversions.

Create a common implementation to be used by file systems
with some useful conversion helpers to reduce open coded
file type conversions in file system code.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Phillip Potter <phil@philpotter.co.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
2019-01-21 17:48:13 +01:00
Thomas Gleixner
74827ee295 ceph: quota: cleanup license mess
Precise and non-ambiguous license information is important. The recently
added quota.c file has a SPDX license identifier, which is nice, but
at the same time it has a contradictionary license boiler plate text.

  SPDX-License-Identifier: GPL-2.0

versus

  * This program is free software; you can redistribute it and/or
  * modify it under the terms of the GNU General Public License
  * as published by the Free Software Foundation; either version 2
  * of the License, or (at your option) any later version.

Oh well.

As the other ceph related files are licensed under the GPL v2 only, it's
assumed that the SPDX id is correct and the boiler plate was randomly
copied into that patch.

Remove the boiler plate as it is wrong and even if correct it is redundant.

Fixes: fb18a57568 ("ceph: quota: add initial infrastructure to support cephfs quotas")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Luis Henriques <lhenriques@suse.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: "Yan, Zheng" <zyan@redhat.com>
Cc: Sage Weil <sage@redhat.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: ceph-devel@vger.kernel.org
Acked-by: Luis Henriques <lhenriques@suse.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-01-21 14:53:23 +01:00
Yan, Zheng
d95e674c01 ceph: clear inode pointer when snap realm gets dropped by its inode
snap realm and corresponding inode have pointers to each other.
The two pointer should get clear at the same time. Otherwise,
snap realm's pointer may reference freed inode.

Cc: stable@vger.kernel.org # 4.17+
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Luis Henriques <lhenriques@suse.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-01-21 14:52:41 +01:00
Linus Torvalds
1e556ba3b6 Fixes for pstore/ram
- Fix console ramoops to show the previous boot logs (Sai Prakash Ranjan)
 - Avoid allocation and leak of platform data
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlxE/QYWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJsVXD/94pJsNHogOle3XnZJst/1cV4+R
 KTyQ6QQ1184lI5FPGJliV3veCuKnTY4Q7e5vCl2mzEvds+VRxu2yUm+LFxKVj04W
 yiRNv8Yn3+3eKxNufqLa9ySbsSZyIpoCKQ/20nbC5cNM3r+Gq12GUnDBixtGdl9V
 fB316emIaKR70Qvi9KlfwDTcjh+SWBR2u8L0YsmTaoJkjtaCJhekN3hfya6Wnjf5
 1sj1fCDzteda/Ld0gXoImHPvt0UcHPJDp1+ZhY0ja64QIuoWwDxsisp4X4nS8iFD
 oak9BWRDhRwSzPnyCu8BI3Hc4ayxcBYR/vmBS12LTkIJVdR1CaqfeDYvZexyX04n
 TYZnmPeZ+2R3wzA9aCLjUDEAYNi403sLqIvR99jlrOoiMo+5Bf05TvBG8lND2dQg
 A13854vg9ssxfiuEmvxJTg8SLtcFiqEtrzZhqaLVfuf4cqaN3o0Ed+A0gjZadwwD
 TPxiMh4YEZuw+qszZwyzi7uECKOvJIJeLCL8CApPvihCNFy2vKF5wj58ZizXXmoT
 Orq6DJ1ITx4DTkYIgtiTC9HKDuvNDZ/ncdb5QoqlpoTe01r8GQOzffZ+tD3xRTNc
 8+yAdK0/bTjLVNbFX/Q4dwagh0I8EQSrdw8CmmQs0lsCvfyYTfHz8uPeBQ7rrzeN
 2r1H7a/OKpL/duTkEw==
 =m6sX
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore fixes from Kees Cook:

 - Fix console ramoops to show the previous boot logs (Sai Prakash
   Ranjan)

 - Avoid allocation and leak of platform data

* tag 'pstore-v5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  pstore/ram: Avoid allocation and leak of platform data
  pstore/ram: Fix console ramoops to show the previous boot logs
2019-01-21 13:12:03 +13:00
Kees Cook
5631e8576a pstore/ram: Avoid allocation and leak of platform data
Yue Hu noticed that when parsing device tree the allocated platform data
was never freed. Since it's not used beyond the function scope, this
switches to using a stack variable instead.

Reported-by: Yue Hu <huyue2@yulong.com>
Fixes: 35da60941e ("pstore/ram: add Device Tree bindings")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2019-01-20 14:44:52 -08:00
Linus Torvalds
1be969f468 for-5.0-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlxElHsACgkQxWXV+ddt
 WDsF8Q/6A+l/Ku8D+xSmw+JGXGNroX1V62sxVYWqIgrkYNg6iSMjicqF3aN1bCbR
 LqvOQ5skerrKteNYIPbTTOD5Xp37ccinSeEWEF0ktFkeU1G6yJo7aRsnQlO1sduk
 2PKUOA1/PdeTBiOj1bej/1ybhtIW+d0MaoPtnUCMC8DD/ihfmU332+KC8VmRUYGZ
 4kT1DvKfBOSVz1UTl6OJdWo76crvjz0eGVnH1YG7DoFbpVzAbphHE7+aC/WkHQme
 X5Ux2NtYVMe/0IGAC7kJnj4jCZ1weAdlmvzzagjzGtWdhWgTxntiJ/FVJs9nO8Dm
 G/pVtD8RVjgMkPOWcfw5fdderrBJqjVGgl4VDrDLqjO9OTGNFJs+HgcPRi6Oli28
 sA+HG+U34YzSfKY0L9eAmpkNxMjWywBuXTQIAlMhHNZCL0vZ2K5EYa3dTp9OswAW
 IIcOh/LfZxiomvMvUqQWcRCy5y/b+cYjOjbHwkrw+ewd3IWXVLG8YLMyZI3vnHKu
 /f1xn6KCap9a1cS4LwyK6gzstEugn0MYmnmD/Jx8I1BJFBt55Q31ES6tPHgTmh/d
 QjveRjMkxNCql4h5Hq0+LiXoSoocBmsO0wrs2QrWSx4PBnJsvjySnWr/8GfAOj79
 BhnuQFxbr/BkNyBzvKrjoI+zZnrVm0cBU59lP6PzN75+kQTaIfs=
 =hGlM
 -----END PGP SIGNATURE-----

Merge tag 'for-5.0-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A handful of fixes (some of them in testing for a long time):

   - fix some test failures regarding cleanup after transaction abort

   - revert of a patch that could cause a deadlock

   - delayed iput fixes, that can help in ENOSPC situation when there's
     low space and a lot data to write"

* tag 'for-5.0-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: wakeup cleaner thread when adding delayed iput
  btrfs: run delayed iputs before committing
  btrfs: wait on ordered extents on abort cleanup
  btrfs: handle delayed ref head accounting cleanup in abort
  Revert "btrfs: balance dirty metadata pages in btrfs_finish_ordered_io"
2019-01-21 07:35:26 +13:00
Linus Torvalds
b0efca46b5 NFS client fixes for Linux 5.0
Stable bugfixes:
 - Fix TCP receive code on archs with flush_dcache_page()
 
 Other bugfixes:
 - Fix error code in rpcrdma_buffer_create()
 - Fix a double free in rpcrdma_send_ctxs_create()
 - Fix kernel BUG at kernel/cred.c:825
 - Fix unnecessary retry in nfs42_proc_copy_file_range()
 - Ensure rq_bytes_sent is reset before request transmission
 - Ensure we respect the RPCSEC_GSS sequence number limit
 - Address Kerberos performance/behavior regression
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlxCH1kACgkQ18tUv7Cl
 QOs4rBAAymqyhUzNgap1TX/KezFxqii7CVMVabrA5eGN+ZXbSVAZkwy7BMZWwVIp
 tEvD7lxWtF11x7bQDw7Xz+ruBCjLdD0RQIFnlBpVKqsRy9oSRA4PsgSbuIFaw+gX
 Bun4Z0xmOCPF7knRv6gQonArEZfHeokIIN8AtSBtWVByaOrnZwgDkNTIub8akpUl
 FQlzgq7lTydVzNcju2ImBeubU7KgFEu0F2Zub5z/iR+F2Mx/bAju8Q4YeVlPyD8U
 QJoIBlXAvgK8LK4bZCh40zPeEt0TMWXnW7o0JHgVQ0g6VbT+hp17I7fz91xEazye
 qbjpIJIjv5daEv0REM8t5ZCZB3tEatVjb4EQWXp0gJYb0l5E3I/O+7MO44n4uMYx
 s3UTxzM6NjwCtlgmn4tYUj+vEIExQHUUnwOl02e5iEa7bqNNY75ehAhj5Rh7iQBH
 H4b+OVuqc608q87rNePdK1LRyh0/u1cDI1kDAQoIP2omlb5hJQGk0Nuz9G2BodIj
 rP0x7nV+ykOXZtr6TR+RvaksL1W39PzVKYA0aL+e2gbcv4YO+Oq1phvNKwRWPM4a
 g08r/kvifS5h6/Jq8Wmn83f1vAOX7Sf23RtEoj+t9hc4S4JbsV2iYK3PY3eWbSYE
 Oz0Vt4gvBBJ+0rHJ10BsQ7686OQkyMKpIlvmx6O5mWVlthovbJM=
 =6Nzz
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.0-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "These are mostly fixes for SUNRPC bugs, with a single v4.2
  copy_file_range() fix mixed in.

  Stable bugfixes:
   - Fix TCP receive code on archs with flush_dcache_page()

  Other bugfixes:
   - Fix error code in rpcrdma_buffer_create()
   - Fix a double free in rpcrdma_send_ctxs_create()
   - Fix kernel BUG at kernel/cred.c:825
   - Fix unnecessary retry in nfs42_proc_copy_file_range()
   - Ensure rq_bytes_sent is reset before request transmission
   - Ensure we respect the RPCSEC_GSS sequence number limit
   - Address Kerberos performance/behavior regression"

* tag 'nfs-for-5.0-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  SUNRPC: Address Kerberos performance/behavior regression
  SUNRPC: Ensure we respect the RPCSEC_GSS sequence number limit
  SUNRPC: Ensure rq_bytes_sent is reset before request transmission
  NFSv4.2 fix unnecessary retry in nfs4_copy_file_range
  sunrpc: kernel BUG at kernel/cred.c:825!
  SUNRPC: Fix TCP receive code on archs with flush_dcache_page()
  xprtrdma: Double free in rpcrdma_sendctxs_create()
  xprtrdma: Fix error code in rpcrdma_buffer_create()
2019-01-20 09:27:38 +12:00
Linus Torvalds
0facb89245 for-linus-20190118
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlxCLxcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmBID/0SpsLsnoeki37Wei0HzBZeRdFGIXFfg1XB
 +s66nJUL+igHIIwH+EO7Ct+vw8jfCWZfasI+ThGtkgRg88bqJO3HEbE1MQY8KlTh
 Dm9LRHH9Jp4OYbbag2RXammaD4XKi980fIyVTp9R0keszQroSUTXXS40fksszTtB
 K4LEBiAxGJgin60W9RgbbK93n/pw17U5Wu2WID6ZfVxZYVzpXl0iIQXUz1zmnpy5
 3aP7ORqJGnH7WQiV57LMpq5/QFBoKrfUsjtxpnJ+i2694sLYX8NCqnH82SUliSU+
 NN3Vxic1ozmZEADFKg4dfmgVqqyJaoVCRvsA5i5YjqG9m65G/S7/BmsaWbq5Ixf2
 N1KQr/36vkYGYWqtb2rbYRVf9XfbxWKp1wuSaX3N1Vh1/Ee180qD68+wpa/0mtkh
 2dXV6CQP9De4BXH6mxj3Yr7LI2a2r7KLYhULZCAvLMvDtsBrPyHAmRRgXxaLlKL8
 QMFYYO/1wjlzpGnn4IxP8sDMH6N2shtZHAmC2n9TM2gmk5tA0OmajJJ4uxNbtyhw
 +ZlDco3aw/x0v2onjlbTiPij2cReRgwN+tVaEvn0dj3uxaisCru60iYdnr9FX4nL
 g3NxNUJbUhl6YMwvNMI0GKfz1IILzjPgJIhDqSHvYhtCA1w0DMVXs3Qv4HXoVh3m
 ffdby6/Yqw==
 =5yJp
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190118' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - block size setting fixes for loop/nbd (Jan Kara)

 - md bio_alloc_mddev() cleanup (Marcos)

 - Ensure we don't lose the REQ_INTEGRITY flag (Ming)

 - Two NVMe fixes by way of Christoph:
    - Fix NVMe IRQ calculation (Ming)
    - Uninitialized variable in nvmet-tcp (Sagi)

 - BFQ comment fix (Paolo)

 - License cleanup for recently added blk-mq-debugfs-zoned (Thomas)

* tag 'for-linus-20190118' of git://git.kernel.dk/linux-block:
  block: Cleanup license notice
  nvme-pci: fix nvme_setup_irqs()
  nvmet-tcp: fix uninitialized variable access
  block: don't lose track of REQ_INTEGRITY flag
  blockdev: Fix livelocks on loop device
  nbd: Use set_blocksize() to set device blocksize
  md: Make bio_alloc_mddev use bio_alloc_bioset
  block, bfq: fix comments on __bfq_deactivate_entity
2019-01-20 09:12:50 +12:00
Josef Bacik
fd340d0f68 btrfs: wakeup cleaner thread when adding delayed iput
The cleaner thread usually takes care of delayed iputs, with the
exception of the btrfs_end_transaction_throttle path.  Delaying iputs
means we are potentially delaying the eviction of an inode and it's
respective space.  The cleaner thread only gets woken up every 30
seconds, or when we require space.  If there are a lot of inodes that
need to be deleted we could induce a serious amount of latency while we
wait for these inodes to be evicted.  So instead wakeup the cleaner if
it's not already awake to process any new delayed iputs we add to the
list.  If we suddenly need space we will less likely be backed up
behind a bunch of inodes that are waiting to be deleted, and we could
possibly free space before we need to get into the flushing logic which
will save us some latency.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-18 17:27:23 +01:00
Josef Bacik
3ec9a4c81c btrfs: run delayed iputs before committing
Delayed iputs means we can have final iputs of deleted inodes in the
queue, which could potentially generate a lot of pinned space that could
be free'd.  So before we decide to commit the transaction for ENOPSC
reasons, run the delayed iputs so that any potential space is free'd up.
If there is and we freed enough we can then commit the transaction and
potentially be able to make our reservation.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-18 17:27:21 +01:00
Josef Bacik
74d5d229b1 btrfs: wait on ordered extents on abort cleanup
If we flip read-only before we initiate writeback on all dirty pages for
ordered extents we've created then we'll have ordered extents left over
on umount, which results in all sorts of bad things happening.  Fix this
by making sure we wait on ordered extents if we have to do the aborted
transaction cleanup stuff.

generic/475 can produce this warning:

 [ 8531.177332] WARNING: CPU: 2 PID: 11997 at fs/btrfs/disk-io.c:3856 btrfs_free_fs_root+0x95/0xa0 [btrfs]
 [ 8531.183282] CPU: 2 PID: 11997 Comm: umount Tainted: G        W 5.0.0-rc1-default+ #394
 [ 8531.185164] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
 [ 8531.187851] RIP: 0010:btrfs_free_fs_root+0x95/0xa0 [btrfs]
 [ 8531.193082] RSP: 0018:ffffb1ab86163d98 EFLAGS: 00010286
 [ 8531.194198] RAX: ffff9f3449494d18 RBX: ffff9f34a2695000 RCX:0000000000000000
 [ 8531.195629] RDX: 0000000000000002 RSI: 0000000000000001 RDI:0000000000000000
 [ 8531.197315] RBP: ffff9f344e930000 R08: 0000000000000001 R09:0000000000000000
 [ 8531.199095] R10: 0000000000000000 R11: ffff9f34494d4ff8 R12:ffffb1ab86163dc0
 [ 8531.200870] R13: ffff9f344e9300b0 R14: ffffb1ab86163db8 R15:0000000000000000
 [ 8531.202707] FS:  00007fc68e949fc0(0000) GS:ffff9f34bd800000(0000)knlGS:0000000000000000
 [ 8531.204851] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [ 8531.205942] CR2: 00007ffde8114dd8 CR3: 000000002dfbd000 CR4:00000000000006e0
 [ 8531.207516] Call Trace:
 [ 8531.208175]  btrfs_free_fs_roots+0xdb/0x170 [btrfs]
 [ 8531.210209]  ? wait_for_completion+0x5b/0x190
 [ 8531.211303]  close_ctree+0x157/0x350 [btrfs]
 [ 8531.212412]  generic_shutdown_super+0x64/0x100
 [ 8531.213485]  kill_anon_super+0x14/0x30
 [ 8531.214430]  btrfs_kill_super+0x12/0xa0 [btrfs]
 [ 8531.215539]  deactivate_locked_super+0x29/0x60
 [ 8531.216633]  cleanup_mnt+0x3b/0x70
 [ 8531.217497]  task_work_run+0x98/0xc0
 [ 8531.218397]  exit_to_usermode_loop+0x83/0x90
 [ 8531.219324]  do_syscall_64+0x15b/0x180
 [ 8531.220192]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
 [ 8531.221286] RIP: 0033:0x7fc68e5e4d07
 [ 8531.225621] RSP: 002b:00007ffde8116608 EFLAGS: 00000246 ORIG_RAX:00000000000000a6
 [ 8531.227512] RAX: 0000000000000000 RBX: 00005580c2175970 RCX:00007fc68e5e4d07
 [ 8531.229098] RDX: 0000000000000001 RSI: 0000000000000000 RDI:00005580c2175b80
 [ 8531.230730] RBP: 0000000000000000 R08: 00005580c2175ba0 R09:00007ffde8114e80
 [ 8531.232269] R10: 0000000000000000 R11: 0000000000000246 R12:00005580c2175b80
 [ 8531.233839] R13: 00007fc68eac61c4 R14: 00005580c2175a68 R15:0000000000000000

Leaving a tree in the rb-tree:

3853 void btrfs_free_fs_root(struct btrfs_root *root)
3854 {
3855         iput(root->ino_cache_inode);
3856         WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));

CC: stable@vger.kernel.org
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add stacktrace ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-18 17:24:19 +01:00
Josef Bacik
31890da0bf btrfs: handle delayed ref head accounting cleanup in abort
We weren't doing any of the accounting cleanup when we aborted
transactions.  Fix this by making cleanup_ref_head_accounting global and
calling it from the abort code, this fixes the issue where our
accounting was all wrong after the fs aborts.

The test generic/475 on a 2G VM can trigger the problems eg.:

  [ 8502.136957] WARNING: CPU: 0 PID: 11064 at fs/btrfs/extent-tree.c:5986 btrfs_free_block_grou +ps+0x3dc/0x410 [btrfs]
  [ 8502.148372] CPU: 0 PID: 11064 Comm: umount Not tainted 5.0.0-rc1-default+ #394
  [ 8502.150807] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626 +cc-prebuilt.qemu-project.org 04/01/2014
  [ 8502.154317] RIP: 0010:btrfs_free_block_groups+0x3dc/0x410 [btrfs]
  [ 8502.160623] RSP: 0018:ffffb1ab84b93de8 EFLAGS: 00010206
  [ 8502.161906] RAX: 0000000001000000 RBX: ffff9f34b1756400 RCX: 0000000000000000
  [ 8502.163448] RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff9f34b1755400
  [ 8502.164906] RBP: ffff9f34b7e8c000 R08: 0000000000000001 R09: 0000000000000000
  [ 8502.166716] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9f34b7e8c108
  [ 8502.168498] R13: ffff9f34b7e8c158 R14: 0000000000000000 R15: dead000000000100
  [ 8502.170296] FS:  00007fb1cf15ffc0(0000) GS:ffff9f34bd400000(0000) knlGS:0000000000000000
  [ 8502.172439] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [ 8502.173669] CR2: 00007fb1ced507b0 CR3: 000000002f7a6000 CR4: 00000000000006f0
  [ 8502.175094] Call Trace:
  [ 8502.175759]  close_ctree+0x17f/0x350 [btrfs]
  [ 8502.176721]  generic_shutdown_super+0x64/0x100
  [ 8502.177702]  kill_anon_super+0x14/0x30
  [ 8502.178607]  btrfs_kill_super+0x12/0xa0 [btrfs]
  [ 8502.179602]  deactivate_locked_super+0x29/0x60
  [ 8502.180595]  cleanup_mnt+0x3b/0x70
  [ 8502.181406]  task_work_run+0x98/0xc0
  [ 8502.182255]  exit_to_usermode_loop+0x83/0x90
  [ 8502.183113]  do_syscall_64+0x15b/0x180
  [ 8502.183919]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Corresponding to

  release_global_block_rsv() {
  ...
  WARN_ON(fs_info->delayed_refs_rsv.reserved > 0);

CC: stable@vger.kernel.org
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add log dump ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-18 17:10:04 +01:00
David Sterba
77b7aad195 Revert "btrfs: balance dirty metadata pages in btrfs_finish_ordered_io"
This reverts commit e73e81b6d0.

This patch causes a few problems:

- adds latency to btrfs_finish_ordered_io
- as btrfs_finish_ordered_io is used for free space cache, generating
  more work from btrfs_btree_balance_dirty_nodelay could end up in the
  same workque, effectively deadlocking

12260 kworker/u96:16+btrfs-freespace-write D
[<0>] balance_dirty_pages+0x6e6/0x7ad
[<0>] balance_dirty_pages_ratelimited+0x6bb/0xa90
[<0>] btrfs_finish_ordered_io+0x3da/0x770
[<0>] normal_work_helper+0x1c5/0x5a0
[<0>] process_one_work+0x1ee/0x5a0
[<0>] worker_thread+0x46/0x3d0
[<0>] kthread+0xf5/0x130
[<0>] ret_from_fork+0x24/0x30
[<0>] 0xffffffffffffffff

Transaction commit will wait on the freespace cache:

838 btrfs-transacti D
[<0>] btrfs_start_ordered_extent+0x154/0x1e0
[<0>] btrfs_wait_ordered_range+0xbd/0x110
[<0>] __btrfs_wait_cache_io+0x49/0x1a0
[<0>] btrfs_write_dirty_block_groups+0x10b/0x3b0
[<0>] commit_cowonly_roots+0x215/0x2b0
[<0>] btrfs_commit_transaction+0x37e/0x910
[<0>] transaction_kthread+0x14d/0x180
[<0>] kthread+0xf5/0x130
[<0>] ret_from_fork+0x24/0x30
[<0>] 0xffffffffffffffff

And then writepages ends up waiting on transaction commit:

9520 kworker/u96:13+flush-btrfs-1 D
[<0>] wait_current_trans+0xac/0xe0
[<0>] start_transaction+0x21b/0x4b0
[<0>] cow_file_range_inline+0x10b/0x6b0
[<0>] cow_file_range.isra.69+0x329/0x4a0
[<0>] run_delalloc_range+0x105/0x3c0
[<0>] writepage_delalloc+0x119/0x180
[<0>] __extent_writepage+0x10c/0x390
[<0>] extent_write_cache_pages+0x26f/0x3d0
[<0>] extent_writepages+0x4f/0x80
[<0>] do_writepages+0x17/0x60
[<0>] __writeback_single_inode+0x59/0x690
[<0>] writeback_sb_inodes+0x291/0x4e0
[<0>] __writeback_inodes_wb+0x87/0xb0
[<0>] wb_writeback+0x3bb/0x500
[<0>] wb_workfn+0x40d/0x610
[<0>] process_one_work+0x1ee/0x5a0
[<0>] worker_thread+0x1e0/0x3d0
[<0>] kthread+0xf5/0x130
[<0>] ret_from_fork+0x24/0x30
[<0>] 0xffffffffffffffff

Eventually, we have every process in the system waiting on
balance_dirty_pages(), and nobody is able to make progress on page
writeback.

The original patch tried to fix an OOM condition, that happened on 4.4 but no
success reproducing that on later kernels (4.19 and 4.20). This is more likely
a problem in OOM itself.

Link: https://lore.kernel.org/linux-btrfs/20180528054821.9092-1-ethanlien@synology.com/
Reported-by: Chris Mason <clm@fb.com>
CC: stable@vger.kernel.org # 4.18+
CC: ethanlien <ethanlien@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-18 17:09:55 +01:00
Stephen Martin
4bd4e92cfe sysfs: fix blank line coding style warning
Fixed a coding style issue.

Signed-off-by: Stephen Martin <lockwood@opperline.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-18 16:45:17 +01:00
Linus Torvalds
a3a80255d5 AFS fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAXECcyPu3V2unywtrAQIpGw//ctcGHg2sfUEra17pvlKEbpZe9bMeJxp6
 UY2YR5gPpiYMmvNe8hLl3I68b43h03jGOx+KmqowqX7anNq3o2nMy0pDbuGmKtuS
 5NmIOECAml8k2uPSpacAF2s6TsxB2lTDYwZdyeuRmZ4scOTujNby33RlijGIxX4s
 87WJFRuCacm9I1KkiKKn4PWoYGjDdsZ7rsDyEeBmQ/MiKOSLG4QP5XuNr4X9zFMX
 r8uF3N8h/NzJWefEirc2DPFfiWLJqkyclq9tgsTB1Z3l+x5u/MHnIg3rZpZhH0uC
 GhGWjlGYqTxOwzYCaOOsNIDRF4rAGPi3lzuJXjONnhvbOO7DCGJ+Mo2obxkAqLL6
 PrtFuQvgXIOl/k8y6AdckuPPu/OMHT1hyY0PQXmTGhHAAfPP3RHoPl7owEjmAbRg
 hvRkFDSIKZ4Kr1nKP4vwaJYEKtxUQrkOwZKmN6ve31ZeJsyrH12MsCgWvp7oQwRJ
 fVbk8DWVRtYzy4RaO3Xr+0WfD+03dDi6KKCPPiC2gtNKOO+1Kco/EtIqPi6SXg0m
 ee/mFOkRsmEh++iNxS58qLxH37On6GSYOElSIMN0NJDNA2TzLrvidXlMSOTD1hj4
 n6gL38E3br/CIimKPYm87qi6yC59CAtrJCulYiPOoMc20eaEUP4DIXN8yjgjLNQp
 Hj5M9GRTwXk=
 =cUrU
 -----END PGP SIGNATURE-----

Merge tag 'afs-fixes-20190117' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull AFS fixes from David Howells:
 "Here's a set of fixes for AFS:

   - Use struct_size() for kzalloc() size calculation.

   - When calling YFS.CreateFile rather than AFS.CreateFile, it is
     possible to create a file with a file lock already held. The
     default value indicating no lock required is actually -1, not 0.

   - Fix an oops in inode/vnode validation if the target inode doesn't
     have a server interest assigned (ie. a server that will notify us
     of changes by third parties).

   - Fix refcounting of keys in file locking.

   - Fix a race in refcounting asynchronous operations in the event of
     an error during request transmission. The provision of a dedicated
     function to get an extra ref on a call is split into a separate
     commit"

* tag 'afs-fixes-20190117' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  afs: Fix race in async call refcounting
  afs: Provide a function to get a ref on a call
  afs: Fix key refcounting in file locking code
  afs: Don't set vnode->cb_s_break in afs_validate()
  afs: Set correct lock type for the yfs CreateFile
  afs: Use struct_size() in kzalloc()
2019-01-18 06:27:24 +12:00
Sai Prakash Ranjan
6a4c9ab13f pstore/ram: Fix console ramoops to show the previous boot logs
commit b05c950698 ("pstore/ram: Simplify ramoops_get_next_prz()
arguments") changed update assignment in getting next persistent ram zone
by adding a check for record type. But the check always returns true since
the record type is assigned 0. And this breaks console ramoops by showing
current console log instead of previous log on warm reset and hard reset
(actually hard reset should not be showing any logs).

Fix this by having persistent ram zone type check instead of record type
check. Tested this on SDM845 MTP and dragonboard 410c.

Reproducing this issue is simple as below:

1. Trigger hard reset and mount pstore. Will see console-ramoops
   record in the mounted location which is the current log.

2. Trigger warm reset and mount pstore. Will see the current
   console-ramoops record instead of previous record.

Fixes: b05c950698 ("pstore/ram: Simplify ramoops_get_next_prz() arguments")
Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
[kees: dropped local variable usage]
Signed-off-by: Kees Cook <keescook@chromium.org>
2019-01-17 09:14:06 -08:00
Al Viro
6d7fbce7da kill kernfs_pin_sb()
unused now and impossible to use safely anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-17 12:02:57 -05:00
Al Viro
399504e21a fix cgroup_do_mount() handling of failure exits
same story as with last May fixes in sysfs (7b745a4e40
"unfuck sysfs_mount()"); new_sb is left uninitialized
in case of early errors in kernfs_mount_ns() and papering
over it by treating any error from kernfs_mount_ns() as
equivalent to !new_ns ends up conflating the cases when
objects had never been transferred to a superblock with
ones when that has happened and resulting new superblock
had been dropped.  Easily fixed (same way as in sysfs
case).  Additionally, there's a superblock leak on
kernfs_node_dentry() failure *and* a dentry leak inside
kernfs_node_dentry() itself - the latter on probably
impossible errors, but the former not impossible to trigger
(as the matter of fact, injecting allocation failures
at that point *does* trigger it).

Cc: stable@kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-17 11:52:49 -05:00
David Howells
34fa47612b afs: Fix race in async call refcounting
There's a race between afs_make_call() and afs_wake_up_async_call() in the
case that an error is returned from rxrpc_kernel_send_data() after it has
queued the final packet.

afs_make_call() will try and clean up the mess, but the call state may have
been moved on thereby causing afs_process_async_call() to also try and to
delete the call.

Fix this by:

 (1) Getting an extra ref for an asynchronous call for the call itself to
     hold.  This makes sure the call doesn't evaporate on us accidentally
     and will allow the call to be retained by the caller in a future
     patch.  The ref is released on leaving afs_make_call() or
     afs_wait_for_call_to_complete().

 (2) In the event of an error from rxrpc_kernel_send_data():

     (a) Don't set the call state to AFS_CALL_COMPLETE until *after* the
     	 call has been aborted and ended.  This prevents
     	 afs_deliver_to_call() from doing anything with any notifications
     	 it gets.

     (b) Explicitly end the call immediately to prevent further callbacks.

     (c) Cancel any queued async_work and wait for the work if it's
     	 executing.  This allows us to be sure the race won't recur when we
     	 change the state.  We put the work queue's ref on the call if we
     	 managed to cancel it.

     (d) Put the call's ref that we got in (1).  This belongs to us as long
     	 as the call is in state AFS_CALL_CL_REQUESTING.

Fixes: 341f741f04 ("afs: Refcount the afs_call struct")
Signed-off-by: David Howells <dhowells@redhat.com>
2019-01-17 15:17:28 +00:00
David Howells
7a75b0079a afs: Provide a function to get a ref on a call
Provide a function to get a reference on an afs_call struct.

Signed-off-by: David Howells <dhowells@redhat.com>
2019-01-17 15:17:28 +00:00
David Howells
59d49076ae afs: Fix key refcounting in file locking code
Fix the refcounting of the authentication keys in the file locking code.
The vnode->lock_key member points to a key on which it expects to be
holding a ref, but it isn't always given an extra ref, however.

Fixes: 0fafdc9f88 ("afs: Fix file locking")
Signed-off-by: David Howells <dhowells@redhat.com>
2019-01-17 15:17:28 +00:00
Marc Dionne
4882a27cec afs: Don't set vnode->cb_s_break in afs_validate()
A cb_interest record is not necessarily attached to the vnode on entry to
afs_validate(), which can cause an oops when we try to bring the vnode's
cb_s_break up to date in the default case (ie. no current callback promise
and the vnode has not been deleted).

Fix this by simply removing the line, as vnode->cb_s_break will be set when
needed by afs_register_server_cb_interest() when we next get a callback
promise from RPC call.

The oops looks something like:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
    ...
    RIP: 0010:afs_validate+0x66/0x250 [kafs]
    ...
    Call Trace:
     afs_d_revalidate+0x8d/0x340 [kafs]
     ? __d_lookup+0x61/0x150
     lookup_dcache+0x44/0x70
     ? lookup_dcache+0x44/0x70
     __lookup_hash+0x24/0xa0
     do_unlinkat+0x11d/0x2c0
     __x64_sys_unlink+0x23/0x30
     do_syscall_64+0x4d/0xf0
     entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: ae3b7361dc ("afs: Fix validation/callback interaction")
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2019-01-17 15:15:52 +00:00
Miklos Szeredi
a2ebba8241 fuse: decrement NR_WRITEBACK_TEMP on the right page
NR_WRITEBACK_TEMP is accounted on the temporary page in the request, not
the page cache page.

Fixes: 8b284dc472 ("fuse: writepages: handle same page rewrites")
Cc: <stable@vger.kernel.org> # v3.13
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-01-16 10:27:59 +01:00
Jann Horn
9509941e9c fuse: call pipe_buf_release() under pipe lock
Some of the pipe_buf_release() handlers seem to assume that the pipe is
locked - in particular, anon_pipe_buf_release() accesses pipe->tmp_page
without taking any extra locks. From a glance through the callers of
pipe_buf_release(), it looks like FUSE is the only one that calls
pipe_buf_release() without having the pipe locked.

This bug should only lead to a memory leak, nothing terrible.

Fixes: dd3bb14f44 ("fuse: support splice() writing to fuse device")
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-01-16 10:27:59 +01:00
Miklos Szeredi
8a3177db59 cuse: fix ioctl
cuse_process_init_reply() doesn't initialize fc->max_pages and thus all
cuse bases ioctls fail with ENOMEM.

Reported-by: Andreas Steinmetz <ast@domdv.de>
Fixes: 5da784cce4 ("fuse: add max_pages to init_out")
Cc: <stable@vger.kernel.org> # v4.20
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-01-16 10:27:59 +01:00