Commit Graph

51627 Commits

Author SHA1 Message Date
Steve French
23586b66d8 Fix SMB3.1.1 guest authentication to Samba
Samba rejects SMB3.1.1 dialect (vers=3.1.1) negotiate requests from
the kernel client due to the two byte pad at the end of the negotiate
contexts.

CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2017-09-19 20:22:14 -05:00
Al Viro
3968cf6238 get_compat_sigset()
similar to put_compat_sigset()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-19 17:56:01 -04:00
Deepa Dinamani
fa2e62a540 io_getevents: Use timespec64 to represent timeouts
struct timespec is not y2038 safe. Use y2038 safe
struct timespec64 to represent timeouts.
The system call interface itself will be changed as
part of different series.

Timeouts will not really need more than 32 bits.
But, replacing these with timespec64 helps verification
of a y2038 safe kernel by getting rid of timespec
internally.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: linux-aio@kvack.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-19 17:55:59 -04:00
Deepa Dinamani
36819ad093 select: Use get/put_timespec64
Usage of these apis and their compat versions makes
the syscalls: select family of syscalls and their
compat implementations simpler.

This is a preparatory patch to isolate data conversions to
struct timespec64 at userspace boundaries. This helps contain
the changes needed to transition to new y2038 safe types.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-19 17:55:58 -04:00
Yan, Zheng
717e6f2893 ceph: avoid panic in create_session_open_msg() if utsname() returns NULL
utsname() can return NULL while process is exiting. Kernel releases
file locks during process exits. We send request to mds when releasing
file lock. So it's possible that we open mds session while process is
exiting. utsname() is called in create_session_open_msg().

Link: http://tracker.ceph.com/issues/21275
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
[idryomov@gmail.com: drop utsname.h include from mds_client.c]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-19 21:04:52 +02:00
Linus Torvalds
24420862bf Convert default dialect to smb2.1 or later to allow connecting to Windows 7 for example, also includes some fixes for stable
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQGcBAABAgAGBQJZwBUiAAoJEIosvXAHck9RbygL/0jG69MaHf1HSZdZQDmHQZFk
 9U3M8vhufxWx4bkM574Hm3TJJMZ4X5GNLB9oqOKdHa6Ciz2XRqwsGkIuqWuGCFmo
 vwBtHGwz3iGVaOb1LMzqxRKG/W1LJgGFZe57ZwGhwglDa6rMn46ygT11e8fH0+g1
 OI1OhUxDqLE8EDLHcfAaTlrRze8tNnYiEsRYU7qx6k4yeh5r3o9UMU3dqdMyEaw2
 pM2xLyBp8pZDcfCMrZwSoSFS4zt8eD5C9l7TDsoixxhChf2cVGA1nfbQUPYHfVd4
 tV4MmpHZtD9Uay4+L+tUAv1lAmuM93AMoltLXk34RojiPLjWzr5MSqdo2L0bEjYe
 SaItonZhOA/Y+A6VpvRKQBxOmSdgNhcyenUsQE8ybifPAksqIGlzc3Ba9iGLcVrp
 Izz4WARBrDAXC+FTt9N6kfVSN3QGIrYeZ2uwYmZCadK7GHu6O2s3wtYXa/0mewXG
 9s7s26kQwvJUAIcK3xu61HiJBx0jCW5DjKeejoaXxw==
 =sn5Y
 -----END PGP SIGNATURE-----

Merge tag '4.14-smb3-multidialect-support-and-fixes-for-stable' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Convert default dialect to smb2.1 or later to allow connecting to
  Windows 7 for example, also includes some fixes for stable"

* tag '4.14-smb3-multidialect-support-and-fixes-for-stable' of git://git.samba.org/sfrench/cifs-2.6:
  Update version of cifs module
  cifs: hide unused functions
  SMB3: Add support for multidialect negotiate (SMB2.1 and later)
  CIFS/SMB3: Update documentation to reflect SMB3 and various changes
  cifs: check rsp for NULL before dereferencing in SMB2_open
2017-09-19 08:35:42 -07:00
Eric W. Biederman
54640d2387 fcntl: Don't set si_code to SI_SIGIO when sig == SIGPOLL
When fixing things to avoid ambiguous cases I had a thinko
and included SIGPOLL/SIGIO in with all of the other signals
that have signal specific si_codes.  Which is completely wrong.

Fix that.

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-09-18 22:51:14 -05:00
Arnd Bergmann
0ab0b271bf isofs: fix build regression
The new isofs_show_options() function fails to build when CONFIG_NLS
is disabled:

fs/isofs/inode.c: In function 'isofs_show_options':
fs/isofs/inode.c:518:44: error: 'CONFIG_NLS_DEFAULT' undeclared (first use in this function)
fs/isofs/inode.c:518:44: note: each undeclared identifier is reported only once for each function it appears in

This adds a check for CONFIG_JOLIET (which selects NLS), matching
the other uses of the iocharset handling in this file.

Fixes: 6fecb86a44f5 ("isofs: Implement show_options")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-09-18 12:24:26 +02:00
Konstantin Khlebnikov
0a51fb7174 quota: add missing lock into __dquot_transfer()
Lock dq_dqb_lock around dquot_decr_inodes()

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: 7b9ca4c61b ("quota: Reduce contention on dq_data_lock")
Signed-off-by: Jan Kara <jack@suse.cz>
2017-09-18 12:23:58 +02:00
Steve French
94a9daeaec Update version of cifs module
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2017-09-17 23:10:48 -05:00
Arnd Bergmann
1368f155a9 cifs: hide unused functions
The newly added SMB2+ attribute support causes unused function
warnings when CONFIG_CIFS_XATTR is disabled:

fs/cifs/smb2ops.c:563:1: error: 'smb2_set_ea' defined but not used [-Werror=unused-function]
 smb2_set_ea(const unsigned int xid, struct cifs_tcon *tcon,
fs/cifs/smb2ops.c:513:1: error: 'smb2_query_eas' defined but not used [-Werror=unused-function]
 smb2_query_eas(const unsigned int xid, struct cifs_tcon *tcon,

This adds another #ifdef around the affected functions.

Fixes: 5517554e43 ("cifs: Add support for writing attributes on SMB2+")
Fixes: 95907fea4f ("cifs: Add support for reading attributes on SMB2+")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-09-17 23:10:48 -05:00
Steve French
9764c02fcb SMB3: Add support for multidialect negotiate (SMB2.1 and later)
With the need to discourage use of less secure dialect, SMB1 (CIFS),
we temporarily upgraded the dialect to SMB3 in 4.13, but since there
are various servers which only support SMB2.1 (2.1 is more secure
than CIFS/SMB1) but not optimal for a default dialect - add support
for multidialect negotiation.  cifs.ko will now request SMB2.1
or later (ie SMB2.1 or SMB3.0, SMB3.02) and the server will
pick the latest most secure one it can support.

In addition since we are sending multidialect negotiate, add
support for secure negotiate to validate that a man in the
middle didn't downgrade us.

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org> # 4.13+
2017-09-17 23:10:48 -05:00
Linus Torvalds
0666f560b7 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc fixes from Thomas Gleixner:

  - A fix for a user space regression in /proc/$PID/stat

  - A couple of objtool fixes:
     ~ Plug a memory leak
     ~ Avoid accessing empty sections which upsets certain binutil
       versions
     ~ Prevent corrupting the obj file when section sizes did not change

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  fs/proc: Report eip/esp in /prod/PID/stat for coredumping
  objtool: Fix object file corruption
  objtool: Do not retrieve data from empty sections
  objtool: Fix memory leak in elf_create_rela_section()
2017-09-17 08:16:36 -07:00
John Ogness
fd7d56270b fs/proc: Report eip/esp in /prod/PID/stat for coredumping
Commit 0a1eb2d474 ("fs/proc: Stop reporting eip and esp in
/proc/PID/stat") stopped reporting eip/esp because it is
racy and dangerous for executing tasks. The comment adds:

    As far as I know, there are no use programs that make any
    material use of these fields, so just get rid of them.

However, existing userspace core-dump-handler applications (for
example, minicoredumper) are using these fields since they
provide an excellent cross-platform interface to these valuable
pointers. So that commit introduced a user space visible
regression.

Partially revert the change and make the readout possible for
tasks with the proper permissions and only if the target task
has the PF_DUMPCORE flag set.

Fixes: 0a1eb2d474 ("fs/proc: Stop reporting eip and esp in> /proc/PID/stat")
Reported-by: Marco Felsch <marco.felsch@preh.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Cc: Tycho Andersen <tycho.andersen@canonical.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: stable@vger.kernel.org
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linux API <linux-api@vger.kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/87poatfwg6.fsf@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-09-15 23:31:16 +02:00
Linus Torvalds
30db202e54 Some cleanups and a big bug fix for ACLs. When I was reviewing Jan Kara's
ACL patch, I realized that Orangefs ACL code was busted, not just in the
 kernel module, but in the server as well. I've been working on the
 code in the server mostly, but here's one kernel patch, there
 will be more.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZu8QoAAoJEM9EDqnrzg2+cJIP/2A8cCLk5HZf/DU0R9AUTyZx
 +vZF0xdbesUuOkOAw5phYor31bToD8XsPKY7z/OlUjXJR09V2DukCX4YRDyrOibJ
 nTUxWk1JnCArUeefIkD8N/4EtuQ7n8mdhAeln4//vjzGKXB/BgmTNhXe3+8RTj/t
 IIVV+T99aNth4JD9K7Uux/vtoZA7kSvIbPQHRzFRr38GORJOkjB8b3mluxwjDkS/
 S2DJND/mneQSPeh7VylKGSPHTqQcv9eg83/muyEoaWcd94QT2pZx9ZEYQrZ1hKus
 1Nk1xmaJXqbJ1V+9qKJT9cxiDwE3uz5TQ6JSeB91ca50DcO/Up9EIdPMF0P13J21
 0mh5/OuiffVdBYYPIWOe2KdpXOw9aBQqQyNA2MZdk1hotW0o/FlxNx4qtuVeQpqo
 f3U0hQBQQD86nLylw6QDu5sD8Bxxx4ihM5szuHn9YlStYUgPbBdPZHsFTk2LAmp8
 71UobSnSQGaOop6pfAJXW2y7m790BwYQhK7vSozQtTLMNU7EzPItIT6oQM7pjS3M
 1Kh6a5XXwH+imbiaeLRBsfNA293eDuRhz9wsZeLnCV0Tt34bp+UG0FMfP+gceDQn
 4hFEPnzWVAQpCyJBq4AMCH/fitawQiLcqgPBjiMVOl6pnznd5MK5C4T9XfaNf5R6
 t2JWF/PS+plOa02uq4Fu
 =wR7F
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-4.14-ofs2' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux

Pull orangefs updates from Mike Marshall:
 "Some cleanups and a big bug fix for ACLs.

  When I was reviewing Jan Kara's ACL patch, I realized that Orangefs
  ACL code was busted, not just in the kernel module, but in the server
  as well. I've been working on the code in the server mostly, but
  here's one kernel patch, there will be more"

* tag 'for-linus-4.14-ofs2' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
  orangefs: Adjust three checks for null pointers
  orangefs: Use kcalloc() in orangefs_prepare_cdm_array()
  orangefs: Delete error messages for a failed memory allocation in five functions
  orangefs: constify xattr_handler structure
  orangefs: don't call filemap_write_and_wait from fsync
  orangefs: off by ones in xattr size checks
  orangefs: documentation clean up
  orangefs: react properly to posix_acl_update_mode's aftermath.
  orangefs: Don't clear SGID when inheriting ACLs
2017-09-15 12:16:18 -07:00
Mimi Zohar
711aab1dbb vfs: constify path argument to kernel_read_file_from_path
This patch constifies the path argument to kernel_read_file_from_path().

Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-14 20:18:45 -07:00
Linus Torvalds
6ed0529fef NFS client bugfixes for Linux 4.14
Hightlights include:
 
 bugfixes:
 - Various changes relating to reporting IO errors.
 - pnfs: Use the standard I/O stateid when calling LAYOUTGET
 
 Features:
 - Add static NFS I/O tracepoints for debugging
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZuuXZAAoJEGcL54qWCgDy1MAQAIMJR7SZhc6i5PpGXDLx+lRp
 HiNldIlW/TXMcHKQ/35cIkt+KPy49EJiB54USZkdiKeiGqMUKtrGgxLVqLY6GELV
 gORRHxYY4SkNCWSDqwkpM+4XCQ56Tqrh6PxPmiOd+zsKAiqWzL5WFydbdVZcN3Dr
 e/lVHktefKZ3ks0DJ+qbYY2MCVqJ8HLlsQ7ZhvGzNNXHHhmWZHAL2LtXdAZ3e5Wt
 fkEuZLhSSVT52VFreS8Z+SZr8Q5kbaNrihT+rNuE3IDg+HqJ0+P99ZbABt0ZIzwI
 9Om8R7WKhDeZQKOVHN6UvLX6p04U56VsM43PLdenQ9JeMTjgjSCA47s2rYl/b9GM
 hPgp51urhj2k7CrjVE9oMOdgMBO6lCpeQT4K4PAtQeeTWb00sPOGr2PGoPb6r1Pa
 UqybzwdKZxpv+/jIiZffd+GYRoETmDW4UnlXq0aE2IMxXUWVI+liJuKVXk/zT5cz
 N/n4rJrJTEJFSOMd0UF8TfbnsR+OgNDo76HWOBWbJ2JZ46qmCufZAOgXhN7XtC0O
 Kbatgsbi9lRBRBT80Agdr3OOoT2mzuzaY+VwrfiV2vkLOsdhmW039oUUws4V91ii
 TksS9JFBtJP30m/qxEUDSisrQpLrEcPaT4zYY3tFHywL2Dn1t300iTPeW+LXX0c7
 aJfE4XNfYgs7BEHgl1Q3
 =p93H
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.14-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull more NFS client updates from Trond Myklebust:
 "Hightlights include:

  Bugfixes:
   - Various changes relating to reporting IO errors.
   - pnfs: Use the standard I/O stateid when calling LAYOUTGET

  Features:
   - Add static NFS I/O tracepoints for debugging"

* tag 'nfs-for-4.14-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFS: various changes relating to reporting IO errors.
  NFS: Add static NFS I/O tracepoints
  pNFS: Use the standard I/O stateid when calling LAYOUTGET
2017-09-14 20:04:32 -07:00
Linus Torvalds
9e0ce554b0 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc leftovers from Al Viro.

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fix the __user misannotations in asm-generic get_user/put_user
  fput: Don't reinvent the wheel but use existing llist API
  namespace.c: Don't reinvent the wheel but use existing llist API
2017-09-14 20:01:41 -07:00
Linus Torvalds
e253d98f5b Merge branch 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull nowait read support from Al Viro:
 "Support IOCB_NOWAIT for buffered reads and block devices"

* 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  block_dev: support RFW_NOWAIT on block device nodes
  fs: support RWF_NOWAIT for buffered reads
  fs: support IOCB_NOWAIT in generic_file_buffered_read
  fs: pass iocb to do_generic_file_read
2017-09-14 19:29:55 -07:00
Linus Torvalds
0f0d12728e Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull mount flag updates from Al Viro:
 "Another chunk of fmount preparations from dhowells; only trivial
  conflicts for that part. It separates MS_... bits (very grotty
  mount(2) ABI) from the struct super_block ->s_flags (kernel-internal,
  only a small subset of MS_... stuff).

  This does *not* convert the filesystems to new constants; only the
  infrastructure is done here. The next step in that series is where the
  conflicts would be; that's the conversion of filesystems. It's purely
  mechanical and it's better done after the merge, so if you could run
  something like

	list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$')

	sed -i -e 's/\<MS_RDONLY\>/SB_RDONLY/g' \
	        -e 's/\<MS_NOSUID\>/SB_NOSUID/g' \
	        -e 's/\<MS_NODEV\>/SB_NODEV/g' \
	        -e 's/\<MS_NOEXEC\>/SB_NOEXEC/g' \
	        -e 's/\<MS_SYNCHRONOUS\>/SB_SYNCHRONOUS/g' \
	        -e 's/\<MS_MANDLOCK\>/SB_MANDLOCK/g' \
	        -e 's/\<MS_DIRSYNC\>/SB_DIRSYNC/g' \
	        -e 's/\<MS_NOATIME\>/SB_NOATIME/g' \
	        -e 's/\<MS_NODIRATIME\>/SB_NODIRATIME/g' \
	        -e 's/\<MS_SILENT\>/SB_SILENT/g' \
	        -e 's/\<MS_POSIXACL\>/SB_POSIXACL/g' \
	        -e 's/\<MS_KERNMOUNT\>/SB_KERNMOUNT/g' \
	        -e 's/\<MS_I_VERSION\>/SB_I_VERSION/g' \
	        -e 's/\<MS_LAZYTIME\>/SB_LAZYTIME/g' \
	        $list

  and commit it with something along the lines of 'convert filesystems
  away from use of MS_... constants' as commit message, it would save a
  quite a bit of headache next cycle"

* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  VFS: Differentiate mount flags (MS_*) from internal superblock flags
  VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
  vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags
2017-09-14 18:54:01 -07:00
Linus Torvalds
581bfce969 Merge branch 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more set_fs removal from Al Viro:
 "Christoph's 'use kernel_read and friends rather than open-coding
  set_fs()' series"

* 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: unexport vfs_readv and vfs_writev
  fs: unexport vfs_read and vfs_write
  fs: unexport __vfs_read/__vfs_write
  lustre: switch to kernel_write
  gadget/f_mass_storage: stop messing with the address limit
  mconsole: switch to kernel_read
  btrfs: switch write_buf to kernel_write
  net/9p: switch p9_fd_read to kernel_write
  mm/nommu: switch do_mmap_private to kernel_read
  serial2002: switch serial2002_tty_write to kernel_{read/write}
  fs: make the buf argument to __kernel_write a void pointer
  fs: fix kernel_write prototype
  fs: fix kernel_read prototype
  fs: move kernel_read to fs/read_write.c
  fs: move kernel_write to fs/read_write.c
  autofs4: switch autofs4_write to __kernel_write
  ashmem: switch to ->read_iter
2017-09-14 18:13:32 -07:00
Linus Torvalds
cc73fee0ba Merge branch 'work.ipc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull ipc compat cleanup and 64-bit time_t from Al Viro:
 "IPC copyin/copyout sanitizing, including 64bit time_t work from Deepa
  Dinamani"

* 'work.ipc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  utimes: Make utimes y2038 safe
  ipc: shm: Make shmid_kernel timestamps y2038 safe
  ipc: sem: Make sem_array timestamps y2038 safe
  ipc: msg: Make msg_queue timestamps y2038 safe
  ipc: mqueue: Replace timespec with timespec64
  ipc: Make sys_semtimedop() y2038 safe
  get rid of SYSVIPC_COMPAT on ia64
  semtimedop(): move compat to native
  shmat(2): move compat to native
  msgrcv(2), msgsnd(2): move compat to native
  ipc(2): move compat to native
  ipc: make use of compat ipc_perm helpers
  semctl(): move compat to native
  semctl(): separate all layout-dependent copyin/copyout
  msgctl(): move compat to native
  msgctl(): split the actual work from copyin/copyout
  ipc: move compat shmctl to native
  shmctl: split the work from copyin/copyout
2017-09-14 17:37:26 -07:00
Linus Torvalds
e7cdb60fd2 Merge branch 'zstd-minimal' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull zstd support from Chris Mason:
 "Nick Terrell's patch series to add zstd support to the kernel has been
  floating around for a while. After talking with Dave Sterba, Herbert
  and Phillip, we decided to send the whole thing in as one pull
  request.

  zstd is a big win in speed over zlib and in compression ratio over
  lzo, and the compression team here at FB has gotten great results
  using it in production. Nick will continue to update the kernel side
  with new improvements from the open source zstd userland code.

  Nick has a number of benchmarks for the main zstd code in his lib/zstd
  commit:

      I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB
      of RAM. The VM is running on a MacBook Pro with a 3.1 GHz Intel
      Core i7 processor, 16 GB of RAM, and a SSD. I benchmarked using
      `silesia.tar` [3], which is 211,988,480 B large. Run the following
      commands for the benchmark:

        sudo modprobe zstd_compress_test
        sudo mknod zstd_compress_test c 245 0
        sudo cp silesia.tar zstd_compress_test

      The time is reported by the time of the userland `cp`.
      The MB/s is computed with

        1,536,217,008 B / time(buffer size, hash)

      which includes the time to copy from userland.
      The Adjusted MB/s is computed with

        1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)).

      The memory reported is the amount of memory the compressor
      requests.

        | Method   | Size (B) | Time (s) | Ratio | MB/s    | Adj MB/s | Mem (MB) |
        |----------|----------|----------|-------|---------|----------|----------|
        | none     | 11988480 |    0.100 |     1 | 2119.88 |        - |        - |
        | zstd -1  | 73645762 |    1.044 | 2.878 |  203.05 |   224.56 |     1.23 |
        | zstd -3  | 66988878 |    1.761 | 3.165 |  120.38 |   127.63 |     2.47 |
        | zstd -5  | 65001259 |    2.563 | 3.261 |   82.71 |    86.07 |     2.86 |
        | zstd -10 | 60165346 |   13.242 | 3.523 |   16.01 |    16.13 |    13.22 |
        | zstd -15 | 58009756 |   47.601 | 3.654 |    4.45 |     4.46 |    21.61 |
        | zstd -19 | 54014593 |  102.835 | 3.925 |    2.06 |     2.06 |    60.15 |
        | zlib -1  | 77260026 |    2.895 | 2.744 |   73.23 |    75.85 |     0.27 |
        | zlib -3  | 72972206 |    4.116 | 2.905 |   51.50 |    52.79 |     0.27 |
        | zlib -6  | 68190360 |    9.633 | 3.109 |   22.01 |    22.24 |     0.27 |
        | zlib -9  | 67613382 |   22.554 | 3.135 |    9.40 |     9.44 |     0.27 |

      I benchmarked zstd decompression using the same method on the same
      machine. The benchmark file is located in the upstream zstd repo
      under `contrib/linux-kernel/zstd_decompress_test.c` [4]. The
      memory reported is the amount of memory required to decompress
      data compressed with the given compression level. If you know the
      maximum size of your input, you can reduce the memory usage of
      decompression irrespective of the compression level.

        | Method   | Time (s) | MB/s    | Adjusted MB/s | Memory (MB) |
        |----------|----------|---------|---------------|-------------|
        | none     |    0.025 | 8479.54 |             - |           - |
        | zstd -1  |    0.358 |  592.15 |        636.60 |        0.84 |
        | zstd -3  |    0.396 |  535.32 |        571.40 |        1.46 |
        | zstd -5  |    0.396 |  535.32 |        571.40 |        1.46 |
        | zstd -10 |    0.374 |  566.81 |        607.42 |        2.51 |
        | zstd -15 |    0.379 |  559.34 |        598.84 |        4.61 |
        | zstd -19 |    0.412 |  514.54 |        547.77 |        8.80 |
        | zlib -1  |    0.940 |  225.52 |        231.68 |        0.04 |
        | zlib -3  |    0.883 |  240.08 |        247.07 |        0.04 |
        | zlib -6  |    0.844 |  251.17 |        258.84 |        0.04 |
        | zlib -9  |    0.837 |  253.27 |        287.64 |        0.04 |

  I ran a long series of tests and benchmarks on the btrfs side and the
  gains are very similar to the core benchmarks Nick ran"

* 'zstd-minimal' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  squashfs: Add zstd support
  btrfs: Add zstd support
  lib: Add zstd modules
  lib: Add xxhash module
2017-09-14 17:30:49 -07:00
Linus Torvalds
dff4d1f6fe - Some request-based DM core and DM multipath fixes and cleanups
- Constify a few variables in DM core and DM integrity
 
 - Add bufio optimization and checksum failure accounting to DM integrity
 
 - Fix DM integrity to avoid checking integrity of failed reads
 
 - Fix DM integrity to use init_completion
 
 - A couple DM log-writes target fixes
 
 - Simplify DAX flushing by eliminating the unnecessary flush abstraction
   that was stood up for DM's use.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJZuo8UAAoJEMUj8QotnQNa5BEIANO4mHh1nrzEbH72a4RCLgxV
 H1Pk1zZx/W1bhOOmcRRhxCSM85dPgsCegc5EmpwLZEMavQrP9UZblHcYOUsyIx7W
 S/lWa+soOq/5N2OveROc4WdoWVs50UFmc1+BcClc4YrEe+15XC3R0VMkjX2b/hUL
 o2eYhPjpMlgaorMtRRU6MAooo2fBRQ9m05aPeVgd35fxibrE7PZm+EYW09wa0STi
 9ufuDXJf8+TtFP/38BD41LbUEskuHUZTSDeAJ+3DBaTtfEZcZYxsst4P9JangsHx
 jqqqI9aYzFD2a27fl9WLhCvm40YFiKp5nwzED0RZjzWxVa/jTShX7a49BdzTTfw=
 =rkSB
 -----END PGP SIGNATURE-----

Merge tag 'for-4.14/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - Some request-based DM core and DM multipath fixes and cleanups

 - Constify a few variables in DM core and DM integrity

 - Add bufio optimization and checksum failure accounting to DM
   integrity

 - Fix DM integrity to avoid checking integrity of failed reads

 - Fix DM integrity to use init_completion

 - A couple DM log-writes target fixes

 - Simplify DAX flushing by eliminating the unnecessary flush
   abstraction that was stood up for DM's use.

* tag 'for-4.14/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dax: remove the pmem_dax_ops->flush abstraction
  dm integrity: use init_completion instead of COMPLETION_INITIALIZER_ONSTACK
  dm integrity: make blk_integrity_profile structure const
  dm integrity: do not check integrity for failed read operations
  dm log writes: fix >512b sectorsize support
  dm log writes: don't use all the cpu while waiting to log blocks
  dm ioctl: constify ioctl lookup table
  dm: constify argument arrays
  dm integrity: count and display checksum failures
  dm integrity: optimize writing dm-bufio buffers that are partially changed
  dm rq: do not update rq partially in each ending bio
  dm rq: make dm-sq requeuing behavior consistent with dm-mq behavior
  dm mpath: complain about unsupported __multipath_map_bio() return values
  dm mpath: avoid that building with W=1 causes gcc 7 to complain about fall-through
2017-09-14 13:43:16 -07:00
Markus Elfring
0b08273c8a orangefs: Adjust three checks for null pointers
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The script “checkpatch.pl” pointed information out like the following.

Comparison to NULL could be written !…

Thus fix affected source code places.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-09-14 14:58:31 -04:00
Markus Elfring
5e273a0e06 orangefs: Use kcalloc() in orangefs_prepare_cdm_array()
* A multiplication for the size determination of a memory allocation
  indicated that an array data structure should be processed.
  Thus use the corresponding function "kcalloc".

  This issue was detected by using the Coccinelle software.

* Replace the specification of a data structure by a pointer dereference
  to make the corresponding size determination a bit safer according to
  the Linux coding style convention.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-09-14 14:58:30 -04:00
Markus Elfring
07a258531c orangefs: Delete error messages for a failed memory allocation in five functions
Omit an extra message for a memory allocation failure in these functions.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-09-14 14:58:29 -04:00
Julia Lawall
1217444405 orangefs: constify xattr_handler structure
The xattr_handler structure is only stored in an array of const
structures.  Thus the xattr_handler structure itself can be
const.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-09-14 14:58:29 -04:00
Jeff Layton
49e5571324 orangefs: don't call filemap_write_and_wait from fsync
Orangefs doesn't do buffered writes yet, so there's no point in
initiating and waiting for writeback.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-09-14 14:58:28 -04:00
Dan Carpenter
5f13e58767 orangefs: off by ones in xattr size checks
A previous patch which claimed to remove off by ones actually introduced
them.

strlen() returns the length of the string not including the NUL
character.  We are using strcpy() to copy "name" into a buffer which is
ORANGEFS_MAX_XATTR_NAMELEN characters long.  We should make sure to
leave space for the NUL, otherwise we're writing one character beyond
the end of the buffer.

Fixes: e675c5ec51 ("orangefs: clean up oversize xattr validation")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-09-14 14:58:27 -04:00
Mike Marshall
4bef69000d orangefs: react properly to posix_acl_update_mode's aftermath.
posix_acl_update_mode checks to see if the permissions
described by the ACL can be encoded into the
object's mode. If so, it sets "acl" to NULL
and "mode" to the new desired value. Prior to this patch
we failed to actually propagate the new mode back to the
server.

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-09-14 14:54:38 -04:00
Jan Kara
b5accbb0df orangefs: Don't clear SGID when inheriting ACLs
When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.

Fix the problem by creating __orangefs_set_acl() function that does not
call posix_acl_update_mode() and use it when inheriting ACLs. That
prevents SGID bit clearing and the mode has been properly set by
posix_acl_create() anyway.

Fixes: 073931017b
CC: stable@vger.kernel.org
CC: Mike Marshall <hubcap@omnibond.com>
CC: pvfs2-developers@beowulf-underground.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-09-14 14:54:37 -04:00
Michal Hocko
0ee931c4e3 mm: treewide: remove GFP_TEMPORARY allocation flag
GFP_TEMPORARY was introduced by commit e12ba74d8f ("Group short-lived
and reclaimable kernel allocations") along with __GFP_RECLAIMABLE.  It's
primary motivation was to allow users to tell that an allocation is
short lived and so the allocator can try to place such allocations close
together and prevent long term fragmentation.  As much as this sounds
like a reasonable semantic it becomes much less clear when to use the
highlevel GFP_TEMPORARY allocation flag.  How long is temporary? Can the
context holding that memory sleep? Can it take locks? It seems there is
no good answer for those questions.

The current implementation of GFP_TEMPORARY is basically GFP_KERNEL |
__GFP_RECLAIMABLE which in itself is tricky because basically none of
the existing caller provide a way to reclaim the allocated memory.  So
this is rather misleading and hard to evaluate for any benefits.

I have checked some random users and none of them has added the flag
with a specific justification.  I suspect most of them just copied from
other existing users and others just thought it might be a good idea to
use without any measuring.  This suggests that GFP_TEMPORARY just
motivates for cargo cult usage without any reasoning.

I believe that our gfp flags are quite complex already and especially
those with highlevel semantic should be clearly defined to prevent from
confusion and abuse.  Therefore I propose dropping GFP_TEMPORARY and
replace all existing users to simply use GFP_KERNEL.  Please note that
SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and
so they will be placed properly for memory fragmentation prevention.

I can see reasons we might want some gfp flag to reflect shorterm
allocations but I propose starting from a clear semantic definition and
only then add users with proper justification.

This was been brought up before LSF this year by Matthew [1] and it
turned out that GFP_TEMPORARY really doesn't have a clear semantic.  It
seems to be a heuristic without any measured advantage for most (if not
all) its current users.  The follow up discussion has revealed that
opinions on what might be temporary allocation differ a lot between
developers.  So rather than trying to tweak existing users into a
semantic which they haven't expected I propose to simply remove the flag
and start from scratch if we really need a semantic for short term
allocations.

[1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org

[akpm@linux-foundation.org: fix typo]
[akpm@linux-foundation.org: coding-style fixes]
[sfr@canb.auug.org.au: drm/i915: fix up]
  Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au
Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-13 18:53:16 -07:00
Arnd Bergmann
ebfddb3d44 fscache: fix fscache_objlist_show format processing
gcc points out a minor bug in the handling of unknown cookie types,
which could result in a string overflow when the integer is copied into
a 3-byte string:

  fs/fscache/object-list.c: In function 'fscache_objlist_show':
  fs/fscache/object-list.c:265:19: error: 'sprintf' may write a terminating nul past the end of the destination [-Werror=format-overflow=]
   sprintf(_type, "%02u", cookie->def->type);
                  ^~~~~~
  fs/fscache/object-list.c:265:4: note: 'sprintf' output between 3 and 4 bytes into a destination of size 3

This is currently harmless as no code sets a type other than 0 or 1, but
it makes sense to use snprintf() here to avoid overflowing the array if
that changes.

Link: http://lkml.kernel.org/r/20170714120720.906842-22-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-13 18:53:15 -07:00
Arnd Bergmann
6dec0dd4a6 procfs: remove unused variable
In NOMMU configurations, we get a warning about a variable that has become
unused:

  fs/proc/task_nommu.c: In function 'nommu_vma_show':
  fs/proc/task_nommu.c:148:28: error: unused variable 'priv' [-Werror=unused-variable]

Link: http://lkml.kernel.org/r/20170911200231.3171415-1-arnd@arndb.de
Fixes: 1240ea0dc3 ("fs, proc: remove priv argument from is_stack")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-13 18:53:15 -07:00
Linus Torvalds
e7989f973a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
 "This fixes a regression (spotted by the Sandstorm.io folks) in the pid
  namespace handling introduced in 4.12.

  There's also a fix for honoring sync/dsync flags for pwritev2()"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: getattr cleanup
  fuse: honor iocb sync flags on write
  fuse: allow server to run in different pid_ns
2017-09-13 10:10:19 -07:00
Linus Torvalds
c353f88f3d Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
 "This fixes d_ino correctness in readdir, which brings overlayfs on par
  with normal filesystems regarding inode number semantics, as long as
  all layers are on the same filesystem.

  There are also some bug fixes, one in particular (random ioctl's
  shouldn't be able to modify lower layers) that touches some vfs code,
  but of course no-op for non-overlay fs"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: fix false positive ESTALE on lookup
  ovl: don't allow writing ioctl on lower layer
  ovl: fix relatime for directories
  vfs: add flags to d_real()
  ovl: cleanup d_real for negative
  ovl: constant d_ino for non-merge dirs
  ovl: constant d_ino across copy up
  ovl: fix readdir error value
  ovl: check snprintf return
2017-09-13 09:11:44 -07:00
Linus Torvalds
6d8ef53e8b for-f2fs-4.14
In this round, we've mostly tuned f2fs to provide better user experience
 for Android. Especially, we've worked on atomic write feature again with
 SQLite community in order to support it officially. And we added or modified
 several facilities to analyze and enhance IO behaviors.
 
 Major changes include:
 - add app/fs io stat
 - add inode checksum feature
 - support project/journalled quota
 - enhance atomic write with new ioctl() which exposes feature set
 - enhance background gc/discard/fstrim flows with new gc_urgent mode
 - add F2FS_IOC_FS{GET,SET}XATTR
 - fix some quota flows
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAlm4brsACgkQQBSofoJI
 UNK6dw/+Jd0j2whU5oxRcxZ6aL1Pj9fp2IdnGk1NbAI2mKAAlGknE/8CDS9OOMdO
 y8O0x3H271DXTMfHJAq2pAfJzcMhiT/Wmw2UsHvmHU0mPmfDcSBKEqPQj6Nbl483
 4s1dyt20InfHsVaKhUWAhov14bxLSiQTfeFH0SL2qv/NTp1Xlp6mwQvKCrmNNxud
 coUL45Zk5uVAVckR0hsyfqudvdXM1LTDG0Y6/j0IaFtO9HqyAEgkILiSqL65TpBV
 2OrXsTf0p2HN9g8vSUUouyD4Oj+q1OHt+VN7gw03xXm3TqAaqnkpIq/dtGLEPyM5
 HD6Q2nDHDTLeKO2Ibi9C0f+bph4UqrCq/eoAjG1sM+6Sm+Hyf193FLR/E2R9aj8w
 ++lCoHUSf/krrMs9d+vnNWaTsKszAbAQRLiZaSHi21+0lcDZtYejNsm52LpDMAfO
 jzz+TTOvXTSlHWSlt8DRKVolNhMRFy9OYIJ0schYYD6FJldARmBMfcZosrhL1Xoh
 oU/bBaXwMv1XOWAOGCQbGrqREiciqXbKDGPQJq65Zvn60U6YzZf04wDbm0zXku5E
 x7S8kPxz8c/010JHIxvULZRamlvXSjFevbAa+QtNsEhlj6DkDSdisMj+w7/jU4Yx
 uInHojIq7ARJO0SBIoYFkz3+/2w++McCK0b/gpx1WHsN8I013zs=
 =w4KH
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we've mostly tuned f2fs to provide better user
  experience for Android. Especially, we've worked on atomic write
  feature again with SQLite community in order to support it officially.
  And we added or modified several facilities to analyze and enhance IO
  behaviors.

  Major changes include:
   - add app/fs io stat
   - add inode checksum feature
   - support project/journalled quota
   - enhance atomic write with new ioctl() which exposes feature set
   - enhance background gc/discard/fstrim flows with new gc_urgent mode
   - add F2FS_IOC_FS{GET,SET}XATTR
   - fix some quota flows"

* tag 'f2fs-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (63 commits)
  f2fs: hurry up to issue discard after io interruption
  f2fs: fix to show correct discard_granularity in sysfs
  f2fs: detect dirty inode in evict_inode
  f2fs: clear radix tree dirty tag of pages whose dirty flag is cleared
  f2fs: speed up gc_urgent mode with SSR
  f2fs: better to wait for fstrim completion
  f2fs: avoid race in between read xattr & write xattr
  f2fs: make get_lock_data_page to handle encrypted inode
  f2fs: use generic terms used for encrypted block management
  f2fs: introduce f2fs_encrypted_file for clean-up
  Revert "f2fs: add a new function get_ssr_cost"
  f2fs: constify super_operations
  f2fs: fix to wake up all sleeping flusher
  f2fs: avoid race in between atomic_read & atomic_inc
  f2fs: remove unneeded parameter of change_curseg
  f2fs: update i_flags correctly
  f2fs: don't check inode's checksum if it was dirtied or writebacked
  f2fs: don't need to update inode checksum for recovery
  f2fs: trigger fdatasync for non-atomic_write file
  f2fs: fix to avoid race in between aio and gc
  ...
2017-09-12 20:05:58 -07:00
Linus Torvalds
cdb897e327 The highlights include:
* a large series of fixes and improvements to the snapshot-handling
    code (Zheng Yan)
 
  * individual read/write OSD requests passed down to libceph are now
    limited to 16M in size to avoid hitting OSD-side limits (Zheng Yan)
 
  * encode MStatfs v2 message to allow for more accurate space usage
    reporting (Douglas Fuller)
 
  * switch to the new writeback error tracking infrastructure (Jeff
    Layton)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJZuAC0AAoJEEp/3jgCEfOLb14H/REYq4fDDkUa70L4leKWWdCa
 n71ipkKeoorfivts71iOtGMJfK+Z6ax+dq1PvBWMy6PtzXS/+2B+t2XwILvLiwWH
 h87i44bY68aLWRTSusgTfB+I7gyVrWN0WMLznZ5rfM9XuyPv+RPyJYh3EhxWI5+U
 2kOHFEc+cPL6mAshGmB8lIzKOWTfmBiw28ulICwlcazm79hh39aNBQE546lS8gA3
 kXuJ55odojPgXOYh+vs60raIBnm6flek1jLxBGYG3MU4gv0VVWOyW0eWeuqW+EcR
 6dVYlzg1xGlPp+vRmDZQuv/E2MafBxdcil/RrdLeqcx/Hf1KJBzcLgUzIMbnOAI=
 =YDZP
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.14-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "The highlights include:

   - a large series of fixes and improvements to the snapshot-handling
     code (Zheng Yan)

   - individual read/write OSD requests passed down to libceph are now
     limited to 16M in size to avoid hitting OSD-side limits (Zheng Yan)

   - encode MStatfs v2 message to allow for more accurate space usage
     reporting (Douglas Fuller)

   - switch to the new writeback error tracking infrastructure (Jeff
     Layton)"

* tag 'ceph-for-4.14-rc1' of git://github.com/ceph/ceph-client: (35 commits)
  ceph: stop on-going cached readdir if mds revokes FILE_SHARED cap
  ceph: wait on writeback after writing snapshot data
  ceph: fix capsnap dirty pages accounting
  ceph: ignore wbc->range_{start,end} when write back snapshot data
  ceph: fix "range cyclic" mode writepages
  ceph: cleanup local variables in ceph_writepages_start()
  ceph: optimize pagevec iterating in ceph_writepages_start()
  ceph: make writepage_nounlock() invalidate page that beyonds EOF
  ceph: properly get capsnap's size in get_oldest_context()
  ceph: remove stale check in ceph_invalidatepage()
  ceph: queue cap snap only when snap realm's context changes
  ceph: handle race between vmtruncate and queuing cap snap
  ceph: fix message order check in handle_cap_export()
  ceph: fix NULL pointer dereference in ceph_flush_snaps()
  ceph: adjust 36 checks for NULL pointers
  ceph: delete an unnecessary return statement in update_dentry_lease()
  ceph: ENOMEM pr_err in __get_or_create_frag() is redundant
  ceph: check negative offsets in ceph_llseek()
  ceph: more accurate statfs
  ceph: properly set snap follows for cap reconnect
  ...
2017-09-12 20:03:53 -07:00
Richard Wareing
b31ff3cdf5 xfs: XFS_IS_REALTIME_INODE() should be false if no rt device present
If using a kernel with CONFIG_XFS_RT=y and we set the RHINHERIT flag on
a directory in a filesystem that does not have a realtime device and
create a new file in that directory, it gets marked as a real time file.
When data is written and a fsync is issued, the filesystem attempts to
flush a non-existent rt device during the fsync process.

This results in a crash dereferencing a null buftarg pointer in
xfs_blkdev_issue_flush():

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
  IP: xfs_blkdev_issue_flush+0xd/0x20
  .....
  Call Trace:
    xfs_file_fsync+0x188/0x1c0
    vfs_fsync_range+0x3b/0xa0
    do_fsync+0x3d/0x70
    SyS_fsync+0x10/0x20
    do_syscall_64+0x4d/0xb0
    entry_SYSCALL64_slow_path+0x25/0x25

Setting RT inode flags does not require special privileges so any
unprivileged user can cause this oops to occur.  To reproduce, confirm
kernel is compiled with CONFIG_XFS_RT=y and run:

  # mkfs.xfs -f /dev/pmem0
  # mount /dev/pmem0 /mnt/test
  # mkdir /mnt/test/foo
  # xfs_io -c 'chattr +t' /mnt/test/foo
  # xfs_io -f -c 'pwrite 0 5m' -c fsync /mnt/test/foo/bar

Or just run xfstests with MKFS_OPTIONS="-d rtinherit=1" and wait.

Kernels built with CONFIG_XFS_RT=n are not exposed to this bug.

Fixes: f538d4da8d ("[XFS] write barrier support")
Cc: <stable@vger.kernel.org>
Signed-off-by: Richard Wareing <rwareing@fb.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-12 20:02:22 -07:00
Ronnie Sahlberg
bf2afee14e cifs: check rsp for NULL before dereferencing in SMB2_open
In SMB2_open there are several paths where the SendReceive2
call will return an error before it sets rsp_iov.iov_base
thus leaving iov_base uninitialized.

Thus we need to check rsp before we dereference it in
the call to get_rfc1002_length().

A report of this issue was previously reported in
http://www.spinics.net/lists/linux-cifs/msg12846.html

RH-bugzilla : 1476151

Version 2 :
* Lets properly initialize rsp_iov before we use it.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>.
Signed-off-by: Steve French <smfrench@gmail.com>
Reported-by: Xiaoli Feng <xifeng@redhat.com>
CC: Stable <stable@vger.kernel.org>
2017-09-12 18:11:44 -05:00
Chao Yu
e6c6de18f0 f2fs: hurry up to issue discard after io interruption
Once we encounter I/O interruption during issuing discards, we will delay
long time before next round, but if system status is I/O idle during the
time, it may loses opportunity to issue discards. So this patch changes
to hurry up to issue discard after io interruption.

Besides, this patch also fixes to issue discards accurately with assigned
rate.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-12 10:02:55 -07:00
Chao Yu
80647e5f4c f2fs: fix to show correct discard_granularity in sysfs
Fix below incorrect display when reading discard_granularity sysfs node.

$ cat /sys/fs/f2fs/<device>/discard_granularity
$ 16
$ echo 32 > /sys/fs/f2fs/<device>/discard_granularity
$ cat /sys/fs/f2fs/<device>/discard_granularity
$ 16

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-12 10:02:47 -07:00
Chao Yu
ca7d802a7d f2fs: detect dirty inode in evict_inode
Add a bugon in f2fs_evict_inode to detect inconsistent status between
inode cache and related node page cache.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-12 10:02:39 -07:00
Amir Goldstein
939ae4efd5 ovl: fix false positive ESTALE on lookup
Commit b9ac5c274b ("ovl: hash overlay non-dir inodes by copy up origin")
verifies that the origin lower inode stored in the overlayfs inode matched
the inode of a copy up origin dentry found by lookup.

There is a false positive result in that check when lower fs does not
support file handles and copy up origin cannot be followed by file handle
at lookup time.

The false negative happens when finding an overlay inode in cache on a
copied up overlay dentry lookup. The overlay inode still 'remembers' the
copy up origin inode, but the copy up origin dentry is not available for
verification.

Relax the check in case copy up origin dentry is not available.

Fixes: b9ac5c274b ("ovl: hash overlay non-dir inodes by copy up...")
Cc: <stable@vger.kernel.org> # v4.13
Reported-by: Jordi Pujol <jordipujolp@gmail.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-09-12 17:22:20 +02:00
Miklos Szeredi
5b97eeacbd fuse: getattr cleanup
The refreshed argument isn't used by any caller, get rid of it.

Use a helper for just updating the inode (no need to fill in a kstat).

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-09-12 16:57:54 +02:00
Miklos Szeredi
e1c0eecba1 fuse: honor iocb sync flags on write
If the IOCB_DSYNC flag is set a sync is not being performed by
fuse_file_write_iter.

Honor IOCB_DSYNC/IOCB_SYNC by setting O_DYSNC/O_SYNC respectively in the
flags filed of the write request.

We don't need to sync data or metadata, since fuse_perform_write() does
write-through and the filesystem is responsible for updating file times.

Original patch by Vitaly Zolotusky.

Reported-by: Nate Clark <nate@neworld.us>
Cc: Vitaly Zolotusky <vitaly@unitc.com>.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-09-12 16:57:53 +02:00
Miklos Szeredi
5d6d3a301c fuse: allow server to run in different pid_ns
Commit 0b6e9ea041 ("fuse: Add support for pid namespaces") broke
Sandstorm.io development tools, which have been sending FUSE file
descriptors across PID namespace boundaries since early 2014.

The above patch added a check that prevented I/O on the fuse device file
descriptor if the pid namespace of the reader/writer was different from the
pid namespace of the mounter.  With this change passing the device file
descriptor to a different pid namespace simply doesn't work.  The check was
added because pids are transferred to/from the fuse userspace server in the
namespace registered at mount time.

To fix this regression, remove the checks and do the following:

1) the pid in the request header (the pid of the task that initiated the
filesystem operation) is translated to the reader's pid namespace.  If a
mapping doesn't exist for this pid, then a zero pid is used.  Note: even if
a mapping would exist between the initiator task's pid namespace and the
reader's pid namespace the pid will be zero if either mapping from
initator's to mounter's namespace or mapping from mounter's to reader's
namespace doesn't exist.

2) The lk.pid value in setlk/setlkw requests and getlk reply is left alone.
Userspace should not interpret this value anyway.  Also allow the
setlk/setlkw operations if the pid of the task cannot be represented in the
mounter's namespace (pid being zero in that case).

Reported-by: Kenton Varda <kenton@sandstorm.io>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 0b6e9ea041 ("fuse: Add support for pid namespaces")
Cc: <stable@vger.kernel.org> # v4.12+
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Seth Forshee <seth.forshee@canonical.com>
2017-09-12 16:57:53 +02:00
Linus Torvalds
8e7757d83d NFS client updates for Linux 4.14
Hightlights include:
 
 Stable bugfixes:
 - Fix mirror allocation in the writeback code to avoid a use after free
 - Fix the O_DSYNC writes to use the correct byte range
 - Fix 2 use after free issues in the I/O code
 
 Features:
 - Writeback fixes to split up the inode->i_lock in order to reduce contention
 - RPC client receive fixes to reduce the amount of time the
   xprt->transport_lock is held when receiving data from a socket into am
   XDR buffer.
 - Ditto fixes to reduce contention between call side users of the rdma
   rb_lock, and its use in rpcrdma_reply_handler.
 - Re-arrange rdma stats to reduce false cacheline sharing.
 - Various rdma cleanups and optimisations.
 - Refactor the NFSv4.1 exchange id code and clean up the code.
 - Const-ify all instances of struct rpc_xprt_ops
 
 Bugfixes:
 - Fix the NFSv2 'sec=' mount option.
 - NFSv4.1: don't use machine credentials for CLOSE when using 'sec=sys'
 - Fix the NFSv3 GRANT callback when the port changes on the server.
 - Fix livelock issues with COMMIT
 - NFSv4: Use correct inode in _nfs4_opendata_to_nfs4_state() when doing
   and NFSv4.1 open by filehandle.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZtbvIAAoJEGcL54qWCgDy/boP/jRuVk6B2VyhWnJkOgdQzIN3
 Q8PIR0oxkywH2MI7c9/G2k5b/HD9BK2iQrXzIoPxRuPrckKLwzqYclzG8PR4Niyg
 D3CCzrvGcEXZrv/nHQ+HDMD0ZuUyXFqhrYeyQwNSJ9p/oP0gaxnYwteennfJVa99
 mv6+LdoY+lzVYJI1gmMHVF2zOhN+rTe7xUVnjYnsVCpwMvL+u992oZl3qQJRFG6b
 HlXOy7h5JRFyue61P20PSgh9D1JUWWYD/V0EG+7cIvByAg5KxhvVgjqSsTTT7FXe
 Omn4fTv1MFzk8er9qYFRjpM2IoIdAejFMqX3/PxQVr2qOFNmHYrq+WsdWNQEr/Wu
 WREJu5Ac1Hboe2/scA+DtuVPFePPPyrolhwk533aNWrdDywg01e0XqBEDKR/atJd
 u5lvW20UfLQuCFLOpaxDpq2ngQSOg6t96N36tsydG0SAVpiydOPMLqkQi7Nb3aoB
 79xGpmtnijP5T6jnOI2/nexM08OMTI0BhMbXJC5v1+lnxIJKcKdnGlTM4UJyxUMq
 /3dFI4IQZLfkMEjIvZFoi+nKWx3DYhiUhkKhbBYwtB4P4q8Z2qKTPHFxORz9griZ
 Pa+8BPuDuodIWuDD97q1Dnw2NWjQim8Rx/ce4c8FHGzwMJLPkcVqk+guGsub5IdO
 7qF7Vvv02gJ48TAqTBDf
 =1Ssl
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.14-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Hightlights include:

  Stable bugfixes:
   - Fix mirror allocation in the writeback code to avoid a use after
     free
   - Fix the O_DSYNC writes to use the correct byte range
   - Fix 2 use after free issues in the I/O code

  Features:
   - Writeback fixes to split up the inode->i_lock in order to reduce
     contention
   - RPC client receive fixes to reduce the amount of time the
     xprt->transport_lock is held when receiving data from a socket into
     am XDR buffer.
   - Ditto fixes to reduce contention between call side users of the
     rdma rb_lock, and its use in rpcrdma_reply_handler.
   - Re-arrange rdma stats to reduce false cacheline sharing.
   - Various rdma cleanups and optimisations.
   - Refactor the NFSv4.1 exchange id code and clean up the code.
   - Const-ify all instances of struct rpc_xprt_ops

  Bugfixes:
   - Fix the NFSv2 'sec=' mount option.
   - NFSv4.1: don't use machine credentials for CLOSE when using
     'sec=sys'
   - Fix the NFSv3 GRANT callback when the port changes on the server.
   - Fix livelock issues with COMMIT
   - NFSv4: Use correct inode in _nfs4_opendata_to_nfs4_state() when
     doing and NFSv4.1 open by filehandle"

* tag 'nfs-for-4.14-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (69 commits)
  NFS: Count the bytes of skipped subrequests in nfs_lock_and_join_requests()
  NFS: Don't hold the group lock when calling nfs_release_request()
  NFS: Remove pnfs_generic_transfer_commit_list()
  NFS: nfs_lock_and_join_requests and nfs_scan_commit_list can deadlock
  NFS: Fix 2 use after free issues in the I/O code
  NFS: Sync the correct byte range during synchronous writes
  lockd: Delete an error message for a failed memory allocation in reclaimer()
  NFS: remove jiffies field from access cache
  NFS: flush data when locking a file to ensure cache coherence for mmap.
  SUNRPC: remove some dead code.
  NFS: don't expect errors from mempool_alloc().
  xprtrdma: Use xprt_pin_rqst in rpcrdma_reply_handler
  xprtrdma: Re-arrange struct rx_stats
  NFS: Fix NFSv2 security settings
  NFSv4.1: don't use machine credentials for CLOSE when using 'sec=sys'
  SUNRPC: ECONNREFUSED should cause a rebind.
  NFS: Remove unused parameter gfp_flags from nfs_pageio_init()
  NFSv4: Fix up mirror allocation
  SUNRPC: Add a separate spinlock to protect the RPC request receive list
  SUNRPC: Cleanup xs_tcp_read_common()
  ...
2017-09-11 22:01:44 -07:00
Daeho Jeong
0abd8e70d2 f2fs: clear radix tree dirty tag of pages whose dirty flag is cleared
On a senario like writing out the first dirty page of the inode
as the inline data, we only cleared dirty flags of the pages, but
didn't clear the dirty tags of those pages in the radix tree.

If we don't clear the dirty tags of the pages in the radix tree, the
inodes which contain the pages will be marked with I_DIRTY_PAGES again
and again, and writepages() for the inodes will be invoked in every
writeback period. As a result, nothing will be done in every
writepages() for the inodes and it will just consume CPU time
meaninglessly.

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-11 21:32:38 -07:00
NeilBrown
bf4b490597 NFS: various changes relating to reporting IO errors.
1/ remove 'start' and 'end' args from nfs_file_fsync_commit().
   They aren't used.

2/ Make nfs_context_set_write_error() a "static inline" in internal.h
   so we can...

3/ Use nfs_context_set_write_error() instead of mapping_set_error()
   if nfs_pageio_add_request() fails before sending any request.
   NFS generally keeps errors in the open_context, not the mapping,
   so this is more consistent.

4/ If filemap_write_and_write_range() reports any error, still
   check ctx->error.  The value in ctx->error is likely to be
   more useful.  As part of this, NFS_CONTEXT_ERROR_WRITE is
   cleared slightly earlier, before nfs_file_fsync_commit() is called,
   rather than at the start of that function.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-11 22:28:56 -04:00
Chuck Lever
8224b2734a NFS: Add static NFS I/O tracepoints
Tools like tcpdump and rpcdebug can be very useful. But there are
plenty of environments where they are difficult or impossible to
use. For example, we've had customers report I/O failures during
workloads so heavy that collecting network traffic or enabling
RPC debugging are themselves onerous.

The kernel's static tracepoints are lightweight (less likely to
introduce timing changes) and efficient (the trace data is compact).
They also work in scenarios where capturing network traffic is not
possible due to lack of hardware support (some InfiniBand HCAs) or
where data or network privacy is a concern.

Introduce tracepoints that show when an NFS READ, WRITE, or COMMIT
is initiated, and when it completes. Record the arguments and
results of each operation, which are not shown by existing sunrpc
module's tracepoints.

For instance, the recorded offset and count can be used to match an
"initiate" event to a "done" event. If an NFS READ result returns
fewer bytes than requested or zero, seeing the EOF flag can be
probative. Seeing an NFS4ERR_BAD_STATEID result is also indication
of a particular class of problems. The timing information attached
to each event record can often be useful as well.

Usage example:

[root@manet tmp]# trace-cmd record -e nfs:*initiate* -e nfs:*done
/sys/kernel/debug/tracing/events/nfs/*initiate*/filter
/sys/kernel/debug/tracing/events/nfs/*done/filter
Hit Ctrl^C to stop recording
^CKernel buffer statistics:
  Note: "entries" are the entries left in the kernel ring buffer and are not
        recorded in the trace data. They should all be zero.

CPU: 0
entries: 0
overrun: 0
commit overrun: 0
bytes: 3680
oldest event ts:    78.367422
now ts:   100.124419
dropped events: 0
read events: 74

... and so on.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-11 22:20:38 -04:00
Trond Myklebust
70d2f7b1ea pNFS: Use the standard I/O stateid when calling LAYOUTGET
Instead of having a private method for copying the open/delegation stateid,
use the same call that is used for standard I/O through the MDS.

Note that this means we transmit the stateid with a zero seqid, avoiding
issues with NFS4ERR_OLD_STATEID.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-11 22:19:00 -04:00
Linus Torvalds
dd198ce714 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace updates from Eric Biederman:
 "Life has been busy and I have not gotten half as much done this round
  as I would have liked. I delayed it so that a minor conflict
  resolution with the mips tree could spend a little time in linux-next
  before I sent this pull request.

  This includes two long delayed user namespace changes from Kirill
  Tkhai. It also includes a very useful change from Serge Hallyn that
  allows the security capability attribute to be used inside of user
  namespaces. The practical effect of this is people can now untar
  tarballs and install rpms in user namespaces. It had been suggested to
  generalize this and encode some of the namespace information
  information in the xattr name. Upon close inspection that makes the
  things that should be hard easy and the things that should be easy
  more expensive.

  Then there is my bugfix/cleanup for signal injection that removes the
  magic encoding of the siginfo union member from the kernel internal
  si_code. The mips folks reported the case where I had used FPE_FIXME
  me is impossible so I have remove FPE_FIXME from mips, while at the
  same time including a return statement in that case to keep gcc from
  complaining about unitialized variables.

  I almost finished the work to get make copy_siginfo_to_user a trivial
  copy to user. The code is available at:

     git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git neuter-copy_siginfo_to_user-v3

  But I did not have time/energy to get the code posted and reviewed
  before the merge window opened.

  I was able to see that the security excuse for just copying fields
  that we know are initialized doesn't work in practice there are buggy
  initializations that don't initialize the proper fields in siginfo. So
  we still sometimes copy unitialized data to userspace"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  Introduce v3 namespaced file capabilities
  mips/signal: In force_fcr31_sig return in the impossible case
  signal: Remove kernel interal si_code magic
  fcntl: Don't use ambiguous SIG_POLL si_codes
  prctl: Allow local CAP_SYS_ADMIN changing exe_file
  security: Use user_namespace::level to avoid redundant iterations in cap_capable()
  userns,pidns: Verify the userns for new pid namespaces
  signal/testing: Don't look for __SI_FAULT in userspace
  signal/mips: Document a conflict with SI_USER with SIGFPE
  signal/sparc: Document a conflict with SI_USER with SIGFPE
  signal/ia64: Document a conflict with SI_USER with SIGFPE
  signal/alpha: Document a conflict with SI_USER for SIGTRAP
2017-09-11 18:34:47 -07:00
Jaegeuk Kim
b3a97a2a9a f2fs: speed up gc_urgent mode with SSR
This patch activates SSR in gc_urgent mode.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-11 17:22:18 -07:00
Jaegeuk Kim
1eb1ef4a8e f2fs: better to wait for fstrim completion
In android, we'd better wait for fstrim completion instead of issuing the
discard commands asynchronous.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-11 17:22:12 -07:00
Linus Torvalds
89fd915c40 libnvdimm for 4.14
* Media error handling support in the Block Translation Table (BTT)
   driver is reworked to address sleeping-while-atomic locking and
   memory-allocation-context conflicts.
 
 * The dax_device lookup overhead for xfs and ext4 is moved out of the
   iomap hot-path to a mount-time lookup.
 
 * A new 'ecc_unit_size' sysfs attribute is added to advertise the
   read-modify-write boundary property of a persistent memory range.
 
 * Preparatory fix-ups for arm and powerpc pmem support are included
   along with other miscellaneous fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZtsAGAAoJEB7SkWpmfYgCrzMP/2vPvZvrFjZn5pAoZjlmTmHM
 ySceoOC7vwvVXIsSs52FhSjcxEoXo9cklXPwhXOPVtVUFdSDJBUOIUxwIziE6Y+5
 sFJ2xT9K+5zKBUiXJwqFQDg52dn//eBNnnnDz+HQrBSzGrbWQhIZY2m19omPzv1I
 BeN0OCGOdW3cjSo3BCFl1d+KrSl704e7paeKq/TO3GIiAilIXleTVxcefEEodV2K
 ZvWHpFIhHeyN8dsF8teI952KcCT92CT/IaabxQIwCxX0/8/GFeDc5aqf77qiYWKi
 uxCeQXdgnaE8EZNWZWGWIWul6eYEkoCNbLeUQ7eJnECq61VxVajJS0NyGa5T9OiM
 P046Bo2b1b3R0IHxVIyVG0ZCm3YUMAHSn/3uRxPgESJ4bS/VQ3YP5M6MLxDOlc90
 IisLilagitkK6h8/fVuVrwciRNQ71XEC34t6k7GCl/1ZnLlLT+i4/jc5NRZnGEZh
 aXAAGHdteQ+/mSz6p2UISFUekbd6LerwzKRw8ibDvH6pTud8orYR7g2+JoGhgb6Y
 pyFVE8DhIcqNKAMxBsjiRZ46OQ7qrT+AemdAG3aVv6FaNoe4o5jPLdw2cEtLqtpk
 +DNm0/lSWxxxozjrvu6EUZj6hk8R5E19XpRzV5QJkcKUXMu7oSrFLdMcC4FeIjl9
 K4hXLV3fVBVRMiS0RA6z
 =5iGY
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm from Dan Williams:
 "A rework of media error handling in the BTT driver and other updates.
  It has appeared in a few -next releases and collected some late-
  breaking build-error and warning fixups as a result.

  Summary:

   - Media error handling support in the Block Translation Table (BTT)
     driver is reworked to address sleeping-while-atomic locking and
     memory-allocation-context conflicts.

   - The dax_device lookup overhead for xfs and ext4 is moved out of the
     iomap hot-path to a mount-time lookup.

   - A new 'ecc_unit_size' sysfs attribute is added to advertise the
     read-modify-write boundary property of a persistent memory range.

   - Preparatory fix-ups for arm and powerpc pmem support are included
     along with other miscellaneous fixes"

* tag 'libnvdimm-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (26 commits)
  libnvdimm, btt: fix format string warnings
  libnvdimm, btt: clean up warning and error messages
  ext4: fix null pointer dereference on sbi
  libnvdimm, nfit: move the check on nd_reserved2 to the endpoint
  dax: fix FS_DAX=n BLOCK=y compilation
  libnvdimm: fix integer overflow static analysis warning
  libnvdimm, nd_blk: remove mmio_flush_range()
  libnvdimm, btt: rework error clearing
  libnvdimm: fix potential deadlock while clearing errors
  libnvdimm, btt: cache sector_size in arena_info
  libnvdimm, btt: ensure that flags were also unchanged during a map_read
  libnvdimm, btt: refactor map entry operations with macros
  libnvdimm, btt: fix a missed NVDIMM_IO_ATOMIC case in the write path
  libnvdimm, nfit: export an 'ecc_unit_size' sysfs attribute
  ext4: perform dax_device lookup at mount
  ext2: perform dax_device lookup at mount
  xfs: perform dax_device lookup at mount
  dax: introduce a fs_dax_get_by_bdev() helper
  libnvdimm, btt: check memory allocation failure
  libnvdimm, label: fix index block size calculation
  ...
2017-09-11 13:10:57 -07:00
Mikulas Patocka
c3ca015fab dax: remove the pmem_dax_ops->flush abstraction
Commit abebfbe2f7 ("dm: add ->flush() dax operation support") is
buggy. A DM device may be composed of multiple underlying devices and
all of them need to be flushed. That commit just routes the flush
request to the first device and ignores the other devices.

It could be fixed by adding more complex logic to the device mapper. But
there is only one implementation of the method pmem_dax_ops->flush - that
is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
don't need the pmem_dax_ops->flush abstraction at all, we can call
arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
can't ever reach anything different from arch_wb_cache_pmem().

It should be also pointed out that for some uses of persistent memory it
is needed to flush only a very small amount of data (such as 1 cacheline),
and it would be overkill if we go through that device mapper machinery for
a single flushed cache line.

Fix this by removing the pmem_dax_ops->flush abstraction and call
arch_wb_cache_pmem() directly from dax_flush(). Also, remove the device
mapper code that forwards the flushes.

Fixes: abebfbe2f7 ("dm: add ->flush() dax operation support")
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-09-11 11:00:55 -04:00
Nicolas Pitre
cdf38888ed binfmt_elf_fdpic: fix crash on MMU system with dynamic binaries
In elf_fdpic_map_file() there is a test to ensure the dynamic section in
user space is properly terminated. However it does so by dereferencing
a user address directly. Add proper user space accessor.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Acked-by: Mickael GUENE <mickael.guene@st.com>
Tested-by: Vincent Abriou <vincent.abriou@st.com>
Tested-by: Andras Szemzo <szemzo.andras@gmail.com>
2017-09-10 19:31:47 -04:00
Nicolas Pitre
4755200b6b binfmt_elf: don't attempt to load FDPIC binaries
On platforms where both ELF and ELF-FDPIC variants are available, the
regular ELF loader will happily identify FDPIC binaries as proper ELF
and load them without the necessary FDPIC fixups, resulting in an
immediate user space crash. Let's prevent binflt_elf from loading those
binaries so binfmt_elf_fdpic has a chance to pick them up. For those
architectures that don't define elf_check_fdpic(), a default version
returning false is provided.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Acked-by: Mickael GUENE <mickael.guene@st.com>
Tested-by: Vincent Abriou <vincent.abriou@st.com>
Tested-by: Andras Szemzo <szemzo.andras@gmail.com>
2017-09-10 19:31:47 -04:00
Nicolas Pitre
382e67aec6 ARM: enable elf_fdpic on systems with an MMU
Provide the necessary changes to be able to execute ELF-FDPIC binaries
on ARM systems with an MMU.

The default for CONFIG_BINFMT_ELF_FDPIC is also set to n if the regular
ELF loader is already configured so not to force FDPIC support on
everyone. Given that CONFIG_BINFMT_ELF depends on CONFIG_MMU, this means
CONFIG_BINFMT_ELF_FDPIC will still default to y when !MMU.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Acked-by: Mickael GUENE <mickael.guene@st.com>
Tested-by: Vincent Abriou <vincent.abriou@st.com>
Tested-by: Andras Szemzo <szemzo.andras@gmail.com>
2017-09-10 19:31:46 -04:00
Nicolas Pitre
50b2b2e691 ARM: add ELF_FDPIC support
This includes the necessary code to recognise the FDPIC format on ARM
and the ptrace command definitions used by the common ptrace code.

Based on patches originally from Mickael Guene <mickael.guene@st.com>.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Acked-by: Mickael GUENE <mickael.guene@st.com>
Tested-by: Vincent Abriou <vincent.abriou@st.com>
Tested-by: Andras Szemzo <szemzo.andras@gmail.com>
2017-09-10 19:31:46 -04:00
Linus Torvalds
a59e57da49 MTD changes for 4.14:
General updates:
  * Constify pci_device_id in various drivers
  * Constify device_type
  * Remove pad control code from the Gemini driver
  * Use %pOF to print OF node full_name
  * Various fixes in the physmap_of driver
  * Remove unused vars in mtdswap
  * Check devm_kzalloc() return value in the spear_smi driver
  * Check clk_prepare_enable() return code in the st_spi_fsm driver
  * Create per MTD device debugfs enties
 
 NAND updates, from Boris Brezillon:
  * Fix memory leaks in the core
  * Remove unused NAND locking support
  * Rename nand.h into rawnand.h (preparing support for spi NANDs)
  * Use NAND_MAX_ID_LEN where appropriate
  * Fix support for 20nm Hynix chips
  * Fix support for Samsung and Hynix SLC NANDs
  * Various cleanup, improvements and fixes in the qcom driver
  * Fixes for bugs detected by various static code analysis tools
  * Fix mxc ooblayout definition
  * Add a new part_parsers to tmio and sharpsl platform data in order to
    define a custom list of partition parsers
  * Request the reset line in exclusive mode in the sunxi driver
  * Fix a build error in the orion-nand driver when compiled for ARMv4
  * Allow 64-bit mvebu platforms to select the PXA3XX driver
 
 SPI NOR updates, from Cyrille Pitchen and Marek Vasut:
  * add support to the JEDEC JESD216B specification (SFDP tables).
  * add support to the Intel Denverton SPI flash controller.
  * fix error recovery for Spansion/Cypress SPI NOR memories.
  * fix 4-byte address management for the Aspeed SPI controller.
  * add support to some Microchip SST26 memory parts
  * remove unneeded pinctrl header Write a message for tag:
 -----BEGIN PGP SIGNATURE-----
 
 iQJABAABCAAqBQJZrav6Ixxib3Jpcy5icmV6aWxsb25AZnJlZS1lbGVjdHJvbnMu
 Y29tAAoJEGXtNgF+CLcABwkP/joDrq09RIC9n5gP+ubJe6O1jKvNWDd6bIVXD3Ke
 73R0a0ANwwWlNYWTChTdrb8UeewVS1bzutyy5O2Sbdb6Jc6s7xkfQDTsbET2HWOK
 S7Lt/zjlC6/6cow59B6h43PGS6wmIFaZD3K+70sGhvFnV8epVUzS2Aa783xS8LXm
 so2djZOdUYnW+yE0eho24VQR6nS4YP4Vc+7Mm9skjU0ifjB9mJiWRkzoQnqIgORO
 M+Iab+qjDs9KR/edWh6mZtnvjps0VSW4I40YsClpcgIn550w1DSXe4u6/8Nk+2Bp
 gfrALls91gob0ocxmEdIyLID+M0410HcN/Lvh36nw+tkkGTaXf0D6mkqzdKNrZ3w
 yz+UV9uf19kr1c6zFGcCvUlD0btn9KT+F2legnhgURtwUyDFQcaYQlkpDIeEzUMV
 ZrtzKbSE2v9810YKXjtCnseewdP+Eph/ewN6ODX5yg/fs8K0fyQYTRtYYM50U69X
 md8zznBBDPhJVu5T2Of7my9V1SxvCP8a7LrKjAXuFHpZ/CHiPe+QOWBgG2L+zXXT
 e10/rTg7T2pcyKpBvL/3/mCYeJ+Iup3lKT1EHGCXcKnLGecVgOsbvdG+JnvQMI2J
 FLmu1exvrzi0Gcrs/05hqwyUvkHZ5FB1a+heNOtmQ+h1U0ElXqILyu7brzghupRe
 3phO
 =UgCd
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20170904' of git://git.infradead.org/linux-mtd

Pull MTD updates from Boris Brezillon:
 "General updates:
   - Constify pci_device_id in various drivers
   - Constify device_type
   - Remove pad control code from the Gemini driver
   - Use %pOF to print OF node full_name
   - Various fixes in the physmap_of driver
   - Remove unused vars in mtdswap
   - Check devm_kzalloc() return value in the spear_smi driver
   - Check clk_prepare_enable() return code in the st_spi_fsm driver
   - Create per MTD device debugfs enties

  NAND updates, from Boris Brezillon:
   - Fix memory leaks in the core
   - Remove unused NAND locking support
   - Rename nand.h into rawnand.h (preparing support for spi NANDs)
   - Use NAND_MAX_ID_LEN where appropriate
   - Fix support for 20nm Hynix chips
   - Fix support for Samsung and Hynix SLC NANDs
   - Various cleanup, improvements and fixes in the qcom driver
   - Fixes for bugs detected by various static code analysis tools
   - Fix mxc ooblayout definition
   - Add a new part_parsers to tmio and sharpsl platform data in order
     to define a custom list of partition parsers
   - Request the reset line in exclusive mode in the sunxi driver
   - Fix a build error in the orion-nand driver when compiled for ARMv4
   - Allow 64-bit mvebu platforms to select the PXA3XX driver

  SPI NOR updates, from Cyrille Pitchen and Marek Vasut:
   - add support to the JEDEC JESD216B specification (SFDP tables).
   - add support to the Intel Denverton SPI flash controller.
   - fix error recovery for Spansion/Cypress SPI NOR memories.
   - fix 4-byte address management for the Aspeed SPI controller.
   - add support to some Microchip SST26 memory parts
   - remove unneeded pinctrl header Write a message for tag:"

* tag 'for-linus-20170904' of git://git.infradead.org/linux-mtd: (74 commits)
  mtd: nand: complain loudly when chip->bits_per_cell is not correctly initialized
  mtd: nand: make Samsung SLC NAND usable again
  mtd: nand: tmio: Register partitions using the parsers
  mfd: tmio: Add partition parsers platform data
  mtd: nand: sharpsl: Register partitions using the parsers
  mtd: nand: sharpsl: Add partition parsers platform data
  mtd: nand: qcom: Support for IPQ8074 QPIC NAND controller
  mtd: nand: qcom: support for IPQ4019 QPIC NAND controller
  dt-bindings: qcom_nandc: IPQ8074 QPIC NAND documentation
  dt-bindings: qcom_nandc: IPQ4019 QPIC NAND documentation
  dt-bindings: qcom_nandc: fix the ipq806x device tree example
  mtd: nand: qcom: support for different DEV_CMD register offsets
  mtd: nand: qcom: QPIC data descriptors handling
  mtd: nand: qcom: enable BAM or ADM mode
  mtd: nand: qcom: erased codeword detection configuration
  mtd: nand: qcom: support for read location registers
  mtd: nand: qcom: support for passing flags in DMA helper functions
  mtd: nand: qcom: add BAM DMA descriptor handling
  mtd: nand: qcom: allocate BAM transaction
  mtd: nand: qcom: DMA mapping support for register read buffer
  ...
2017-09-09 14:48:21 -07:00
Trond Myklebust
1bd5d6d08e NFS: Count the bytes of skipped subrequests in nfs_lock_and_join_requests()
If we skip a subrequest due to a zero refcount, we should still count
the byte range that it covered so that we accurately reconstruct the
original request size.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-09 16:43:09 -04:00
Linus Torvalds
ad9a19d003 More RDMA work and some op-structure constification from Chuck Lever,
and a small cleanup to our xdr encoding.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZst0LAAoJECebzXlCjuG+o30QALbchoIvs7BiDrUxYMfJ2nCa
 7UW69STwX79B3NZTg7RrScFTLPEFW9DMpb/Og7AYTH3/wdgGYQNM1UxGUYe7IxSN
 xemH7BSmQzJ7ryaxouO/jskUw5nvNRXhY0PMxJApjrCs837vTjduIVw9zUa8EDeH
 9toxpTM4k3z/1myj60PuHnuQF9EyLDL6W581loDF04nQB3pVRbAZOh1lUeqMgLUd
 7IF+CDECFcjL7oZSA3wDGpsVySLdZ+GYxloFIDO/d8kHEsZD3OaN2MdfRki8EOSQ
 qibTYO0284VeyNLUOIHjspqbDh0Lr2F7VolMmlM5GF1IuApih0/QYidqsH6/As3U
 JIAK53vgqZfK2qI0ud7dGGFEnT/vlE7pQiXiza36xI8YZu4Xz6uGbM41p38RU8jO
 3fr38xdPqqO7YE6F7ZUHYyrmW81Vi0lFdQkw1DBEipHV8UquuCmdtAeR9xgDsdQ/
 LsMVevM1mF+19krOIGbBnENq1GX78ecfHEYGxlTjf/MeO4JYl+8/x7Ow2e/ZbwSa
 7hpUeCiVuVmy1hqOEtraBl5caAG0hCE8PeGRrdr5dA6ZS9YTm0ANgtxndKabwDh2
 CjXF3gRnQNUGdFGCi/fmvfb89tVNj1tL52pbQqfgOb/VFrrL328vyNNg/1p2VY4Q
 qzmKtxZhi/XBewQjaSQl
 =E3UQ
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.14' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "More RDMA work and some op-structure constification from Chuck Lever,
  and a small cleanup to our xdr encoding"

* tag 'nfsd-4.14' of git://linux-nfs.org/~bfields/linux:
  svcrdma: Estimate Send Queue depth properly
  rdma core: Add rdma_rw_mr_payload()
  svcrdma: Limit RQ depth
  svcrdma: Populate tail iovec when receiving
  nfsd: Incoming xdr_bufs may have content in tail buffer
  svcrdma: Clean up svc_rdma_build_read_chunk()
  sunrpc: Const-ify struct sv_serv_ops
  nfsd: Const-ify NFSv4 encoding and decoding ops arrays
  sunrpc: Const-ify instances of struct svc_xprt_ops
  nfsd4: individual encoders no longer see error cases
  nfsd4: skip encoder in trivial error cases
  nfsd4: define ->op_release for compound ops
  nfsd4: opdesc will be useful outside nfs4proc.c
  nfsd4: move some nfsd4 op definitions to xdr4.h
2017-09-09 13:31:49 -07:00
Linus Torvalds
66ba772ee3 Merge branch 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
 "The changes range through all types: cleanups, core chagnes, sanity
  checks, fixes, other user visible changes, detailed list below:

   - deprecated: user transaction ioctl

   - mount option ssd does not change allocation alignments

   - degraded read-write mount is allowed if all the raid profile
     constraints are met, now based on more accurate check

   - defrag: do not reset compression afterwards; the NOCOMPRESS flag
     can be now overriden by defrag

   - prep work for better extent reference tracking (related to the
     qgroup slowness with balance)

   - prep work for compression heuristics

   - memory allocation reductions (may help latencies on a loaded
     system)

   - better accounting for io waiting states

   - error handling improvements (removed BUGs)

   - added more sanity checks for shared refs

   - fix readdir vs pagefault deadlock under some circumstances

   - fix for 'no-hole' mode, certain combination of compressed and
     inline extents

   - send: fix emission of invalid clone operations

   - fixup file mode if setting acls fail

   - more fixes from fuzzing

   - oher cleanups"

* 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (104 commits)
  btrfs: submit superblock io with REQ_META and REQ_PRIO
  btrfs: remove unnecessary memory barrier in btrfs_direct_IO
  btrfs: remove superfluous chunk_tree argument from btrfs_alloc_dev_extent
  btrfs: Remove chunk_objectid parameter of btrfs_alloc_dev_extent
  btrfs: pass fs_info to btrfs_del_root instead of tree_root
  Btrfs: add one more sanity check for shared ref type
  Btrfs: remove BUG_ON in __add_tree_block
  Btrfs: remove BUG() in add_data_reference
  Btrfs: remove BUG() in print_extent_item
  Btrfs: remove BUG() in btrfs_extent_inline_ref_size
  Btrfs: convert to use btrfs_get_extent_inline_ref_type
  Btrfs: add a helper to retrive extent inline ref type
  btrfs: scrub: simplify scrub worker initialization
  btrfs: scrub: clean up division in scrub_find_csum
  btrfs: scrub: clean up division in __scrub_mark_bitmap
  btrfs: scrub: use bool for flush_all_writes
  btrfs: preserve i_mode if __btrfs_set_acl() fails
  btrfs: Remove extraneous chunk_objectid variable
  btrfs: Remove chunk_objectid argument from btrfs_make_block_group
  btrfs: Remove extra parentheses from condition in copy_items()
  ...
2017-09-09 13:27:51 -07:00
Trond Myklebust
8b77484f2b NFS: Don't hold the group lock when calling nfs_release_request()
That can deadlock if this is the last reference since
nfs_page_group_destroy() calls nfs_page_group_sync_on_bit().
Note that even if the page was removed from the subpage list,
the req->wb_head could still be pointing to the old head.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-09 15:36:40 -04:00
Trond Myklebust
5d2a9d9dac NFS: Remove pnfs_generic_transfer_commit_list()
It's pretty much a duplicate of nfs_scan_commit_list() that also
clears the PG_COMMIT_TO_DS flag.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-09 12:51:40 -04:00
Trond Myklebust
137da553db NFS: nfs_lock_and_join_requests and nfs_scan_commit_list can deadlock
Since the commit list is not ordered, it is possible for nfs_scan_commit_list
to hold a request that nfs_lock_and_join_requests() is waiting for, while
at the same time trying to grab a request that nfs_lock_and_join_requests
already holds.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-09 12:28:01 -04:00
Sean Purcell
87bf54bb43 squashfs: Add zstd support
Add zstd compression and decompression support to SquashFS. zstd is a
great fit for SquashFS because it can compress at ratios approaching xz,
while decompressing twice as fast as zlib. For SquashFS in particular,
it can decompress as fast as lzo and lz4. It also has the flexibility
to turn down the compression ratio for faster compression times.

The compression benchmark is run on the file tree from the SquashFS archive
found in ubuntu-16.10-desktop-amd64.iso [1]. It uses `mksquashfs` with the
default block size (128 KB) and and various compression algorithms/levels.
xz and zstd are also benchmarked with 256 KB blocks. The decompression
benchmark times how long it takes to `tar` the file tree into `/dev/null`.
See the benchmark file in the upstream zstd source repository located under
`contrib/linux-kernel/squashfs-benchmark.sh` [2] for details.

I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
16 GB of RAM, and a SSD.

| Method         | Ratio | Compression MB/s | Decompression MB/s |
|----------------|-------|------------------|--------------------|
| gzip           |  2.92 |               15 |                128 |
| lzo            |  2.64 |              9.5 |                217 |
| lz4            |  2.12 |               94 |                218 |
| xz             |  3.43 |              5.5 |                 35 |
| xz 256 KB      |  3.53 |              5.4 |                 40 |
| zstd 1         |  2.71 |               96 |                210 |
| zstd 5         |  2.93 |               69 |                198 |
| zstd 10        |  3.01 |               41 |                225 |
| zstd 15        |  3.13 |             11.4 |                224 |
| zstd 16 256 KB |  3.24 |              8.1 |                210 |

This patch was written by Sean Purcell <me@seanp.xyz>, but I will be
taking over the submission process.

[1] http://releases.ubuntu.com/16.10/
[2] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/squashfs-benchmark.sh

zstd source repository: https://github.com/facebook/zstd

Signed-off-by: Sean Purcell <me@seanp.xyz>
Signed-off-by: Nick Terrell <terrelln@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Acked-by: Phillip Lougher <phillip@squashfs.org.uk>
2017-09-08 19:33:25 -07:00
Trond Myklebust
196639ebbe NFS: Fix 2 use after free issues in the I/O code
The writeback code wants to send a commit after processing the pages,
which is why we want to delay releasing the struct path until after
that's done.

Also, the layout code expects that we do not free the inode before
we've put the layout segments in pnfs_writehdr_free() and
pnfs_readhdr_free()

Fixes: 919e3bd9a8 ("NFS: Ensure we commit after writeback is complete")
Fixes: 4714fb51fd ("nfs: remove pgio_header refcount, related cleanup")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-08 22:07:52 -04:00
OGAWA Hirofumi
5680db4b66 vfat: deduplicate hex2bin()
We may use hex2bin() instead of custom approach.

Link: http://lkml.kernel.org/r/87zibktpil.fsf@devron
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:50 -07:00
Tomohiro Kusumi
b9fa2ad1ed autofs: use unsigned int/long instead of uint/ulong for ioctl args
The standard types unsigned int and unsigned long should be used for
.compat_ioctl.  autofs is the only fs using uing/ulong for this, and these
are even the only uint/ulong in the entire autofs code.

Drop unneeded long cast in return value of autofs_dev_ioctl_compat().
It's already long.

Link: http://lkml.kernel.org/r/150285069709.4670.3884827966280147529.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi <tkusumi@tuxera.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:50 -07:00
Tomohiro Kusumi
7ed1da84b3 autofs: drop wrong comment
This comment was correct when it was added in 8d7b48e0 ("autofs4: add
miscellaneous device for ioctls") in 2008, but not after 4e44b685 "Get rid
of path_lookup in autofs4" in 2009 which introduced find_autofs_mount().

Link: http://lkml.kernel.org/r/150285069148.4670.17959501481201077445.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi <tkusumi@tuxera.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:50 -07:00
Tomohiro Kusumi
718b303b49 autofs: use AUTOFS_DEV_IOCTL_SIZE
Use a macro which defines misc-dev ioctl parameter size (excluding a path
beyond &path[0]) since it's been used to initialize and copy this
structure ever since it first appeared in 8d7b48e0 in 2008.

(or simply get rid of this if this is just unnecessary abstraction when
all it needs is sizeof(struct autofs_dev_ioctl))

Edit: raven@themaw.net
That's a good point but I'd prefer to keep the macro define.
End edit: raven@themaw.net

Link: http://lkml.kernel.org/r/150285068577.4670.2599968823770600622.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi <tkusumi@tuxera.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:50 -07:00
Tomohiro Kusumi
ed837d0901 autofs: non functional header inclusion cleanup
Having header includes before any macro (without any dependency) simply
looks normal.  No reason to have these macros in between.

Link: http://lkml.kernel.org/r/150285068011.4670.10271483982093996996.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi <tkusumi@tuxera.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:50 -07:00
Ian Kent
3dd8f7c3b7 autofs: make dev ioctl version and ismountpoint user accessible
Some of the autofs miscellaneous device ioctls need to be accessable to
user space applications without CAP_SYS_ADMIN to get information about
autofs mounts.

Link: http://lkml.kernel.org/r/150216642517.11652.2338933266137331637.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Colin Walters <walters@redhat.com>
Cc: Ondrej Holy <oholy@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:50 -07:00
Ian Kent
e54c7bcbf1 autofs: make disc device user accessible
The autofs miscellanous device ioctls that shouldn't require
CAP_SYS_ADMIN need to be accessible to user space applications in order
to be able to get information about autofs mounts.

The module checks capabilities so the miscelaneous device should be fine
with broad permissions.

Link: http://lkml.kernel.org/r/150216641928.11652.7388977863125547969.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Colin Walters <walters@redhat.com>
Cc: Ondrej Holy <oholy@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:50 -07:00
Ian Kent
42f4614821 autofs: fix AT_NO_AUTOMOUNT not being honored
The fstatat(2) and statx() calls can pass the flag AT_NO_AUTOMOUNT which
is meant to clear the LOOKUP_AUTOMOUNT flag and prevent triggering of an
automount by the call.  But this flag is unconditionally cleared for all
stat family system calls except statx().

stat family system calls have always triggered mount requests for the
negative dentry case in follow_automount() which is intended but prevents
the fstatat(2) and statx() AT_NO_AUTOMOUNT case from being handled.

In order to handle the AT_NO_AUTOMOUNT for both system calls the negative
dentry case in follow_automount() needs to be changed to return ENOENT
when the LOOKUP_AUTOMOUNT flag is clear (and the other required flags are
clear).

AFAICT this change doesn't have any noticable side effects and may, in
some use cases (although I didn't see it in testing) prevent unnecessary
callbacks to the automount daemon.

It's also possible that a stat family call has been made with a path that
is in the process of being mounted by some other process.  But stat family
calls should return the automount state of the path as it is "now" so it
shouldn't wait for mount completion.

This is the same semantic as the positive dentry case already handled.

Link: http://lkml.kernel.org/r/150216641255.11652.4204561328197919771.stgit@pluto.themaw.net
Fixes: deccf497d8 ("Make stat/lstat/fstatat pass AT_NO_AUTOMOUNT to vfs_statx()")
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: David Howells <dhowells@redhat.com>
Cc: Colin Walters <walters@redhat.com>
Cc: Ondrej Holy <oholy@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:50 -07:00
Markus Elfring
9367bb730e binfmt_flat: delete two error messages for a failed memory allocation in decompress_exec()
Omit extra messages for a memory allocation failure in this function.

This issue was detected by using the Coccinelle software.

Link: http://lkml.kernel.org/r/f92aac79-b05e-321a-1a19-d38c7159ee9c@users.sourceforge.net
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:50 -07:00
Davidlohr Bueso
b2ac2ea629 fs/epoll: use faster rb_first_cached()
...  such that we can avoid the tree walks to get the node with the
smallest key.  Semantically the same, as the previously used rb_first(),
but O(1).  The main overhead is the extra footprint for the cached rb_node
pointer, which should not matter for epoll.

Link: http://lkml.kernel.org/r/20170719014603.19029-15-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:49 -07:00
Davidlohr Bueso
410bd5ecb2 procfs: use faster rb_first_cached()
...  such that we can avoid the tree walks to get the node with the
smallest key.  Semantically the same, as the previously used rb_first(),
but O(1).  The main overhead is the extra footprint for the cached rb_node
pointer, which should not matter for procfs.

Link: http://lkml.kernel.org/r/20170719014603.19029-14-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:49 -07:00
Davidlohr Bueso
f808c13fd3 lib/interval_tree: fast overlap detection
Allow interval trees to quickly check for overlaps to avoid unnecesary
tree lookups in interval_tree_iter_first().

As of this patch, all interval tree flavors will require using a
'rb_root_cached' such that we can have the leftmost node easily
available.  While most users will make use of this feature, those with
special functions (in addition to the generic insert, delete, search
calls) will avoid using the cached option as they can do funky things
with insertions -- for example, vma_interval_tree_insert_after().

[jglisse@redhat.com: fix deadlock from typo vm_lock_anon_vma()]
  Link: http://lkml.kernel.org/r/20170808225719.20723-1-jglisse@redhat.com
Link: http://lkml.kernel.org/r/20170719014603.19029-12-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Doug Ledford <dledford@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Christian Benvenuti <benve@cisco.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:49 -07:00
David Rientjes
1403830294 fs, proc: unconditional cond_resched when reading smaps
If there are large numbers of hugepages to iterate while reading
/proc/pid/smaps, the page walk never does cond_resched().  On archs
without split pmd locks, there can be significant and observable
contention on mm->page_table_lock which cause lengthy delays without
rescheduling.

Always reschedule in smaps_pte_range() if necessary since the pagewalk
iteration can be expensive.

Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708211405520.131071@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:47 -07:00
Alexey Dobriyan
855d97657d proc: uninline proc_create()
Save some code from ~320 invocations all clearing last argument.

	add/remove: 3/0 grow/shrink: 0/158 up/down: 45/-702 (-657)
	function                                     old     new   delta
	proc_create                                    -      17     +17
	__ksymtab_proc_create                          -      16     +16
	__kstrtab_proc_create                          -      12     +12
	yam_init_driver                              301     298      -3

		...

	cifs_proc_init                               249     228     -21
	via_fb_pci_probe                            2304    2280     -24

Link: http://lkml.kernel.org/r/20170819094702.GA27864@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:47 -07:00
Michal Hocko
1240ea0dc3 fs, proc: remove priv argument from is_stack
Commit b18cb64ead ("fs/proc: Stop trying to report thread stacks")
removed the priv parameter user in is_stack so the argument is
redundant.  Drop it.

[arnd@arndb.de: remove unused variable]
  Link: http://lkml.kernel.org/r/20170801120150.1520051-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/20170728075833.7241-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:47 -07:00
Andrea Arcangeli
656710a60e userfaultfd: non-cooperative: closing the uffd without triggering SIGBUS
This is an enhancement to avoid a non cooperative userfaultfd manager
having to unregister all regions before it can close the uffd after all
userfaultfd activity completed.

The UFFDIO_UNREGISTER would serialize against the handle_userfault by
taking the mmap_sem for writing, but we can simply repeat the page fault
if we detect the uffd was closed and so the regular page fault paths
should takeover.

Link: http://lkml.kernel.org/r/20170823181227.19926-1-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:47 -07:00
Jérôme Glisse
df6ad69838 mm/device-public-memory: device memory cache coherent with CPU
Platform with advance system bus (like CAPI or CCIX) allow device memory
to be accessible from CPU in a cache coherent fashion.  Add a new type of
ZONE_DEVICE to represent such memory.  The use case are the same as for
the un-addressable device memory but without all the corners cases.

Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sherry Cheung <SCheung@nvidia.com>
Cc: Subhash Gutti <sgutti@nvidia.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Bob Liu <liubo95@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:46 -07:00
Jérôme Glisse
2916ecc0f9 mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY
Introduce a new migration mode that allow to offload the copy to a device
DMA engine.  This changes the workflow of migration and not all
address_space migratepage callback can support this.

This is intended to be use by migrate_vma() which itself is use for thing
like HMM (see include/linux/hmm.h).

No additional per-filesystem migratepage testing is needed.  I disables
MIGRATE_SYNC_NO_COPY in all problematic migratepage() callback and i
added comment in those to explain why (part of this patch).  The commit
message is unclear it should say that any callback that wish to support
this new mode need to be aware of the difference in the migration flow
from other mode.

Some of these callbacks do extra locking while copying (aio, zsmalloc,
balloon, ...) and for DMA to be effective you want to copy multiple
pages in one DMA operations.  But in the problematic case you can not
easily hold the extra lock accross multiple call to this callback.

Usual flow is:

For each page {
 1 - lock page
 2 - call migratepage() callback
 3 - (extra locking in some migratepage() callback)
 4 - migrate page state (freeze refcount, update page cache, buffer
     head, ...)
 5 - copy page
 6 - (unlock any extra lock of migratepage() callback)
 7 - return from migratepage() callback
 8 - unlock page
}

The new mode MIGRATE_SYNC_NO_COPY:
 1 - lock multiple pages
For each page {
 2 - call migratepage() callback
 3 - abort in all problematic migratepage() callback
 4 - migrate page state (freeze refcount, update page cache, buffer
     head, ...)
} // finished all calls to migratepage() callback
 5 - DMA copy multiple pages
 6 - unlock all the pages

To support MIGRATE_SYNC_NO_COPY in the problematic case we would need a
new callback migratepages() (for instance) that deals with multiple
pages in one transaction.

Because the problematic cases are not important for current usage I did
not wanted to complexify this patchset even more for no good reason.

Link: http://lkml.kernel.org/r/20170817000548.32038-14-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Sherry Cheung <SCheung@nvidia.com>
Cc: Subhash Gutti <sgutti@nvidia.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Bob Liu <liubo95@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:46 -07:00
Jérôme Glisse
5042db43cc mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory
HMM (heterogeneous memory management) need struct page to support
migration from system main memory to device memory.  Reasons for HMM and
migration to device memory is explained with HMM core patch.

This patch deals with device memory that is un-addressable memory (ie CPU
can not access it).  Hence we do not want those struct page to be manage
like regular memory.  That is why we extend ZONE_DEVICE to support
different types of memory.

A persistent memory type is define for existing user of ZONE_DEVICE and a
new device un-addressable type is added for the un-addressable memory
type.  There is a clear separation between what is expected from each
memory type and existing user of ZONE_DEVICE are un-affected by new
requirement and new use of the un-addressable type.  All specific code
path are protect with test against the memory type.

Because memory is un-addressable we use a new special swap type for when a
page is migrated to device memory (this reduces the number of maximum swap
file).

The main two additions beside memory type to ZONE_DEVICE is two callbacks.
First one, page_free() is call whenever page refcount reach 1 (which
means the page is free as ZONE_DEVICE page never reach a refcount of 0).
This allow device driver to manage its memory and associated struct page.

The second callback page_fault() happens when there is a CPU access to an
address that is back by a device page (which are un-addressable by the
CPU).  This callback is responsible to migrate the page back to system
main memory.  Device driver can not block migration back to system memory,
HMM make sure that such page can not be pin into device memory.

If device is in some error condition and can not migrate memory back then
a CPU page fault to device memory should end with SIGBUS.

[arnd@arndb.de: fix warning]
  Link: http://lkml.kernel.org/r/20170823133213.712917-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/20170817000548.32038-8-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Sherry Cheung <SCheung@nvidia.com>
Cc: Subhash Gutti <sgutti@nvidia.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Bob Liu <liubo95@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:46 -07:00
Naoya Horiguchi
ab6e3d0939 mm: soft-dirty: keep soft-dirty bits over thp migration
Soft dirty bit is designed to keep tracked over page migration.  This
patch makes it work in the same manner for thp migration too.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:45 -07:00
Zi Yan
84c3fc4e9c mm: thp: check pmd migration entry in common path
When THP migration is being used, memory management code needs to handle
pmd migration entries properly.  This patch uses !pmd_present() or
is_swap_pmd() (depending on whether pmd_none() needs separate code or
not) to check pmd migration entries at the places where a pmd entry is
present.

Since pmd-related code uses split_huge_page(), split_huge_pmd(),
pmd_trans_huge(), pmd_trans_unstable(), or
pmd_none_or_trans_huge_or_clear_bad(), this patch:

1. adds pmd migration entry split code in split_huge_pmd(),

2. takes care of pmd migration entries whenever pmd_trans_huge() is present,

3. makes pmd_none_or_trans_huge_or_clear_bad() pmd migration entry aware.

Since split_huge_page() uses split_huge_pmd() and pmd_trans_unstable()
is equivalent to pmd_none_or_trans_huge_or_clear_bad(), we do not change
them.

Until this commit, a pmd entry should be:
1. pointing to a pte page,
2. is_swap_pmd(),
3. pmd_trans_huge(),
4. pmd_devmap(), or
5. pmd_none().

Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:45 -07:00
Yunlei He
27161f13e3 f2fs: avoid race in between read xattr & write xattr
Thread A:					Thread B:
-f2fs_getxattr
   -lookup_all_xattrs
      -xnid = F2FS_I(inode)->i_xattr_nid;
						-f2fs_setxattr
						    -__f2fs_setxattr
						        -write_all_xattrs
						            -truncate_xattr_node
							          ...  ...
						-write_checkpoint
								  ...  ...
						-alloc_nid   <- nid reuse
          -get_node_page
              -f2fs_bug_on  <- nid != node_footer->nid

It's need a rw_sem to avoid the race

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-07 20:57:20 -07:00
Jaegeuk Kim
13ba41e346 f2fs: make get_lock_data_page to handle encrypted inode
This patch refactors get_lock_data_page() to handle encryption case directly.
In order to do that, it introduces common f2fs_submit_page_read().

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-07 20:55:51 -07:00
Linus Torvalds
828f4257d1 This series has the ultimate goal of providing a sane stack rlimit when
running set*id processes. To do this, the bprm_secureexec LSM hook is
 collapsed into the bprm_set_creds hook so the secureexec-ness of an exec
 can be determined early enough to make decisions about rlimits and the
 resulting memory layouts. Other logic acting on the secureexec-ness of an
 exec is similarly consolidated. Capabilities needed some special handling,
 but the refactoring removed other special handling, so that was a wash.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJZrwRKAAoJEIly9N/cbcAmhboP/iwLbYfWngIJdu3pYKrW+CEg
 uUVY6RNnsumJ5yEhD/yQKXSPmZ8PkC8vexPYvf8TcPOlMRQuhVvdiR0FfSUvkMWy
 pB8ZVCyAV1uSnW4BH61FCxHInrahy8jlvQwnAujvw+FNxhcQjyEGKupOLIMGLioQ
 8G5Ihf+hOjiXRhKbXueQi89n8i4jEI5YTH1RnC+Gsy8jG11EC9BhPddKSMaUKZA3
 HYYqUyV0daYpGuxTOxaRdDO5wb6rlS+B46hqtOsSsIBOQkCjnLCRcdeMCqvXjQmv
 kyZj03cPlUjEHqh3d3nB6utvVWReGf/p986//kQjT1OZPhATbySAu7wUHoLik3dU
 zuexudNTBROf6YXahMxSJp348GS++xoBFARa78402E++U7C4/eoclbLCWAylBwVA
 H+QAHFYRC2WFoskejSYBRPz6HLr1SIaSYMsKbkHqP07zi6p3ic2Uq3XvOP2zL/5p
 l/mXa1Fs2vcDOWPER8a8b9mVkJDvuXj6J11lG+q80UWAWC3sd9GkSwOen80ps3Xo
 /7dd+h2BAJSSVxZQFxd5YCx99mT0ntQZ797PhjxOY6SX/xUdOCAp9x1zDU5OUovP
 q2ty3UTd7tq8h1RnHOnrn9cKmMmI7kpBvEfPGM507cEVjyfsMu2jJtUxN9dXOAkB
 aebEsg3C8M6z5OdGVpWH
 =Yva4
 -----END PGP SIGNATURE-----

Merge tag 'secureexec-v4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull secureexec update from Kees Cook:
 "This series has the ultimate goal of providing a sane stack rlimit
  when running set*id processes.

  To do this, the bprm_secureexec LSM hook is collapsed into the
  bprm_set_creds hook so the secureexec-ness of an exec can be
  determined early enough to make decisions about rlimits and the
  resulting memory layouts. Other logic acting on the secureexec-ness of
  an exec is similarly consolidated. Capabilities needed some special
  handling, but the refactoring removed other special handling, so that
  was a wash"

* tag 'secureexec-v4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  exec: Consolidate pdeath_signal clearing
  exec: Use sane stack rlimit under secureexec
  exec: Consolidate dumpability logic
  smack: Remove redundant pdeath_signal clearing
  exec: Use secureexec for clearing pdeath_signal
  exec: Use secureexec for setting dumpability
  LSM: drop bprm_secureexec hook
  commoncap: Move cap_elevated calculation into bprm_set_creds
  commoncap: Refactor to remove bprm_secureexec hook
  smack: Refactor to remove bprm_secureexec hook
  selinux: Refactor to remove bprm_secureexec hook
  apparmor: Refactor to remove bprm_secureexec hook
  binfmt: Introduce secureexec flag
  exec: Correct comments about "point of no return"
  exec: Rename bprm->cred_prepared to called_set_creds
2017-09-07 20:35:29 -07:00
Linus Torvalds
21d236bf2b Make pstore permissions more versatile by removing CAP_SYSLOG requirement
and defining more restrictive root directory DAC permissions default
 (0750, which can be adjust after boot unlike the CAP_SYSLOG check).
 Suggested by Nick Kralevich.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJZrv+iAAoJEIly9N/cbcAmZXcP/jZ7dW3zQiZ2q6YQDokaABT4
 AZxGdDrogLQ6wWmV+ApHIYEOTcVvbswvBLwKIE7l9XpG41tIKUe4h9iCVvpBSARP
 SpyeawztJ8KNw00EFZWP/hOxCXHeausilea/1zh/+Rt5VhU2YIw/fhew821bjLmh
 3exBjoLcWSHHCUY/e9ByMB0mB0SYUmnqhFub77Z6zZMhaRw9/gvPibS1DdmjGPPI
 Rq0zejFAqXy50rmbKVTT2QQPq/gQnUyb/Q216ytbSUntaAwfISDrwN74slupjG3S
 Vrca+BxThJYZ+rnbqjMDoROgKAYNqyIlvFVCO3H6DUqnPnGROIAeGELAcGyncUo+
 6Mdpumhy25K0+YbJkNYxm1cyH0w47EWpIqBqPTh1IhuedDB5cpdamR88dShmMzNA
 XhvMhe9eNxI5ZzOg8X8qCEc/hRZoZj5F4m2R+Wh55YRH3rDtuaIzONPvGyJfYYVS
 tY8ut/r8+qMID9I4qLtIAmVX2rzR/6BG7H3ofApY0OGFRmCt0nicUdN56JJ+GNRf
 7XfpEXDL+sG3fkUk8oQSfSEhLuOseTazLuxrQAWJIZ3FZ4JnRW/a/izlbsI2+nvy
 FcC1+tG43ISwir5jZzNznYNrGM01TdFwQ5izKE3E1U+xsBRbR7OT8Y0005Z+GUwW
 6feSKts8UKq4tFNt1WY9
 =+gsj
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore update from Kees Cook:
 "Make pstore permissions more versatile by removing CAP_SYSLOG
  requirement and defining more restrictive root directory DAC
  permissions default (0750, which can be adjust after boot unlike the
  CAP_SYSLOG check).

  Suggested by Nick Kralevich"

* tag 'pstore-v4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  Revert "pstore: Honor dmesg_restrict sysctl on dmesg dumps"
  pstore: Make default pstorefs root dir perms 0750
2017-09-07 19:58:56 -07:00
Linus Torvalds
8dc5b3a6cb enable xattr support for smb3 and also a bugfix
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQGcBAABAgAGBQJZsb0OAAoJEIosvXAHck9RgOYL/0AAqRPEt1mvZTIm51LXxSYy
 4Iyyz8AEHFp4ZW5jctblm49/4mh3rJ4O7Ig50de00y+XkoJIzACgf1lazaTTCcNQ
 lspHnfqCgR1DvF+A7OxbfS4CkF43SXFnL3GcoiW5zpV8QndM3DWLVIj1AHpEtEOX
 sa3mSdFVhdEP6ka5Q98vam4N6jRnvz1gapLQKHpRTuVAGYZAw44+8HJKxN5btTIP
 F+9X4zCyNjDDsuKoxAkVZmo/k7cxbKQBRjq9fNLHPR/GPEy06I+j2AqTnd7Gc/o+
 2hw0X4eF8N7HbEujnvEcqPLCqwNVtPNdlOelICNmNSo4OYhrhb3u93hEv7oFuVLH
 /PlGrzRYi6TEf1aBiftfA9aNVNzCHtnqRIZiC9+LhCiI77H9ef4/Fl7i6xsoKcAi
 XA0RyslWNbhlQBWMXeQN7k2lyZIx4+Kq7xnnW1FBPLIuLMCrZAS0/IAcakKf3Mud
 zDzOKI7ZRWEIEaSsM/I/QFfagkMuKTSea6XovkogUw==
 =+DYj
 -----END PGP SIGNATURE-----

Merge tag '4.14-smb3-xattr-enable' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs update from Steve French:
 "Enable xattr support for smb3 and also a bugfix"

* tag '4.14-smb3-xattr-enable' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Check for timeout on Negotiate stage
  cifs: Add support for writing attributes on SMB2+
  cifs: Add support for reading attributes on SMB2+
2017-09-07 16:06:14 -07:00
Linus Torvalds
2500e287bc Merge git://git.kvack.org/~bcrl/aio-next
Pull aio fix from Ben LaHaise:
 "Improve aio-nr counting on large SMP systems.

  It has been in linux-next for quite some time"

* git://git.kvack.org/~bcrl/aio-next:
  fs: aio: fix the increment of aio-nr and counting against aio-max-nr
2017-09-07 15:51:11 -07:00
Linus Torvalds
ae8ac6b7db Merge branch 'quota_scaling' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota scaling updates from Jan Kara:
 "This contains changes to make the quota subsystem more scalable.

  Reportedly it improves number of files created per second on ext4
  filesystem on fast storage by about a factor of 2x"

* 'quota_scaling' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (28 commits)
  quota: Add lock annotations to struct members
  quota: Reduce contention on dq_data_lock
  fs: Provide __inode_get_bytes()
  quota: Inline dquot_[re]claim_reserved_space() into callsite
  quota: Inline inode_{incr,decr}_space() into callsites
  quota: Inline functions into their callsites
  ext4: Disable dirty list tracking of dquots when journalling quotas
  quota: Allow disabling tracking of dirty dquots in a list
  quota: Remove dq_wait_unused from dquot
  quota: Move locking into clear_dquot_dirty()
  quota: Do not dirty bad dquots
  quota: Fix possible corruption of dqi_flags
  quota: Propagate ->quota_read errors from v2_read_file_info()
  quota: Fix error codes in v2_read_file_info()
  quota: Push dqio_sem down to ->read_file_info()
  quota: Push dqio_sem down to ->write_file_info()
  quota: Push dqio_sem down to ->get_next_id()
  quota: Push dqio_sem down to ->release_dqblk()
  quota: Remove locking for writing to the old quota format
  quota: Do not acquire dqio_sem for dquot overwrites in v2 format
  ...
2017-09-07 15:19:35 -07:00
Linus Torvalds
460352c2f1 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF, reiserfs, quota, fsnotify cleanups from Jan Kara:
 "Several UDF, reiserfs, quota and fsnotify cleanups.

  Note that there is also a patch updating MAINTAINERS entry for
  notification subsystem to point to me as a maintainer since current
  entries are stale"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fsnotify: make dnotify_fsnotify_ops const
  isofs: Delete an unnecessary variable initialisation in isofs_read_inode()
  isofs: Adjust four checks for null pointers
  isofs: Delete an error message for a failed memory allocation in isofs_read_inode()
  quota_v2: Delete an error message for a failed memory allocation in v2_read_file_info()
  fs-udf: Delete an error message for a failed memory allocation in two functions
  fs-udf: Improve six size determinations
  fs-udf: Adjust two checks for null pointers
  reiserfs: fix spelling mistake: "tranasction" -> "transaction"
  MAINTAINERS: Update entries for notification subsystem
  uapi/linux/quota.h: Do not include linux/errno.h
2017-09-07 14:53:17 -07:00
Linus Torvalds
c0da4fa0d1 media updates for v4.14-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZsSBoAAoJEAhfPr2O5OEVDc4QAJZSuVYmyLgvtmPxhyqgCvkz
 I0DmWM4ZtK2VT/xJ/AA23z8IiLKi2+pDC0Xx6/aIiA665cyl3oPUdkKIaHW9Z6+A
 fV8gSFkmGkluQb9mP/KdHYI2oSeEv2ivCa1kfaApYcoBa904z8uU++z15Iu5p/+m
 fjpc2vnc9rax0Vuwmgv7p1CL4j4e/ja0siCSCGbu2ad50KqP4ytnBooNPQOQt89D
 L+Av5MeGml/CTUUnAFjWfSmQ72Ht8GhoBBKc6wGoq9x3GTckDDTqy8BAqGt4UQnu
 fR0mb71zuSVmTjxRe7tc/74m3ReaeSHzQeHJhjdQslvNmV3RVQgk/6CCsmqNEegr
 rbC3glQCM+gp5YywCjRL6DCPsoqvjexLtPQjMZIGYxgSYQUyXGOxilgmj9+73761
 6aOl0nqdgN+vlWzaSeDF9EQxRsc+cCq/Po8/xuPE/Pzs6zTQwU+6b+ADLf9jCyDP
 LTC49wOj24SoWiTlG1FTct2ogZ3h5wNPWlurBtmyiFJn+43RpsH5IW9wLilCjeiE
 6JeCWEIBglCCq/TVCzETKNSaixDL6/lMQ9uRdCpIO4VLyoS6S9pZASNPBmQ1h7h/
 oTjYDeWirIthNOccstbBoJQYSX62CqAIW3wq5ME6PAgM+ioiLXLYk0fV3yBKoBNW
 Z0SBeTcuPxWmfzuxMtik
 =fNM2
 -----END PGP SIGNATURE-----

Merge tag 'media/v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

Pull media updates from Mauro Carvalho Chehab:
 "Brazil's Independence Day pull request :-)

  This is one of the biggest media pull requests, with 625 patches
  affecting almost all parts of media (RC, DVB, V4L2, CEC, docs).

  This contains:

   - A lot of new drivers:
     * DVB frontends: mxl5xx, stv0910, stv6111;
     * camera flash: as3645a led driver;
     * HDMI receiver: adv748X;
     * camera sensor: Omnivision 6650 5M driver (ov6650);
     * HDMI CEC: ao-cec meson driver;
     * V4L2: Qualcom camss driver;
     * Remote controller: gpio-ir-tx, pwm-ir-tx and zx-irdec drivers.

   - The DDbridge DVB driver got a massive update, with makes it in sync
     with modern hardware from that vendor;

   - There's an important milestone on this series: the DVB
     documentation was written in 2003, but only started to be updated
     in 2007. It also used to contain several gaps from the time it was
     kept out of tree, mentioning error codes and device nodes that
     never existed upstream. On this series, it received a massive
     update: all non-deprecated digital TV APIs are now in sync with the
     current implementation;

   - Some DVB APIs that aren't used by any upstream driver got removed;

   - Other parts of the media documentation algo got updated, fixing
     some bugs on its PDF output and making it compatible with Sphinx
     version 1.6.

     As the number of hacks required to build PDF output reduced, I hope
     we'll have less troubles as newer versions of our documentation
     toolchain are released (famous last words);

   - As usual, lots of driver cleanups and improvements"

* tag 'media/v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (624 commits)
  media: leds: as3645a: add V4L2_FLASH_LED_CLASS dependency
  media: get rid of removed DMX_GET_CAPS and DMX_SET_SOURCE leftovers
  media: Revert "[media] v4l: async: make v4l2 coexist with devicetree nodes in a dt overlay"
  media: staging: atomisp: sh_css_calloc shall return a pointer to the allocated space
  media: Revert "[media] lirc_dev: remove superfluous get/put_device() calls"
  media: add qcom_camss.rst to v4l-drivers rst file
  media: dvb headers: make checkpatch happier
  media: dvb uapi: move frontend legacy API to another part of the book
  media: pixfmt-srggb12p.rst: better format the table for PDF output
  media: docs-rst: media: Don't use \small for V4L2_PIX_FMT_SRGGB10 documentation
  media: index.rst: don't write "Contents:" on PDF output
  media: pixfmt*.rst: replace a two dots by a comma
  media: vidioc-g-fmt.rst: adjust table format
  media: vivid.rst: add a blank line to correct ReST format
  media: v4l2 uapi book: get rid of driver programming's chapter
  media: format.rst: use the right markup for important notes
  media: docs-rst: cardlists: change their format to flat-tables
  media: em28xx-cardlist.rst: update to reflect last changes
  media: v4l2-event.rst: adjust table to fit on PDF output
  media: docs: don't show ToC for each part on PDF output
  ...
2017-09-07 12:53:14 -07:00
Linus Torvalds
a0725ab0c7 Merge branch 'for-4.14/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
 "This is the first pull request for 4.14, containing most of the code
  changes. It's a quiet series this round, which I think we needed after
  the churn of the last few series. This contains:

   - Fix for a registration race in loop, from Anton Volkov.

   - Overflow complaint fix from Arnd for DAC960.

   - Series of drbd changes from the usual suspects.

   - Conversion of the stec/skd driver to blk-mq. From Bart.

   - A few BFQ improvements/fixes from Paolo.

   - CFQ improvement from Ritesh, allowing idling for group idle.

   - A few fixes found by Dan's smatch, courtesy of Dan.

   - A warning fixup for a race between changing the IO scheduler and
     device remova. From David Jeffery.

   - A few nbd fixes from Josef.

   - Support for cgroup info in blktrace, from Shaohua.

   - Also from Shaohua, new features in the null_blk driver to allow it
     to actually hold data, among other things.

   - Various corner cases and error handling fixes from Weiping Zhang.

   - Improvements to the IO stats tracking for blk-mq from me. Can
     drastically improve performance for fast devices and/or big
     machines.

   - Series from Christoph removing bi_bdev as being needed for IO
     submission, in preparation for nvme multipathing code.

   - Series from Bart, including various cleanups and fixes for switch
     fall through case complaints"

* 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
  kernfs: checking for IS_ERR() instead of NULL
  drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
  drbd: Fix allyesconfig build, fix recent commit
  drbd: switch from kmalloc() to kmalloc_array()
  drbd: abort drbd_start_resync if there is no connection
  drbd: move global variables to drbd namespace and make some static
  drbd: rename "usermode_helper" to "drbd_usermode_helper"
  drbd: fix race between handshake and admin disconnect/down
  drbd: fix potential deadlock when trying to detach during handshake
  drbd: A single dot should be put into a sequence.
  drbd: fix rmmod cleanup, remove _all_ debugfs entries
  drbd: Use setup_timer() instead of init_timer() to simplify the code.
  drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
  drbd: new disk-option disable-write-same
  drbd: Fix resource role for newly created resources in events2
  drbd: mark symbols static where possible
  drbd: Send P_NEG_ACK upon write error in protocol != C
  drbd: add explicit plugging when submitting batches
  drbd: change list_for_each_safe to while(list_first_entry_or_null)
  drbd: introduce drbd_recv_header_maybe_unplug
  ...
2017-09-07 11:59:42 -07:00
Mauricio Faria de Oliveira
2a8a98673c fs: aio: fix the increment of aio-nr and counting against aio-max-nr
Currently, aio-nr is incremented in steps of 'num_possible_cpus() * 8'
for io_setup(nr_events, ..) with 'nr_events < num_possible_cpus() * 4':

    ioctx_alloc()
    ...
        nr_events = max(nr_events, num_possible_cpus() * 4);
        nr_events *= 2;
    ...
        ctx->max_reqs = nr_events;
    ...
        aio_nr += ctx->max_reqs;
    ....

This limits the number of aio contexts actually available to much less
than aio-max-nr, and is increasingly worse with greater number of CPUs.

For example, with 64 CPUs, only 256 aio contexts are actually available
(with aio-max-nr = 65536) because the increment is 512 in that scenario.

Note: 65536 [max aio contexts] / (64*4*2) [increment per aio context]
is 128, but make it 256 (double) as counting against 'aio-max-nr * 2':

    ioctx_alloc()
    ...
        if (aio_nr + nr_events > (aio_max_nr * 2UL) ||
        ...
            goto err_ctx;
    ...

This patch uses the original value of nr_events (from userspace) to
increment aio-nr and count against aio-max-nr, which resolves those.

Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Reported-by: Lekshmi C. Pillai <lekshmi.cpillai@in.ibm.com>
Tested-by: Lekshmi C. Pillai <lekshmi.cpillai@in.ibm.com>
Tested-by: Paul Nguyen <nguyenp@us.ibm.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2017-09-07 12:28:28 -04:00
tarangg@amazon.com
e973b1a599 NFS: Sync the correct byte range during synchronous writes
Since commit 18290650b1 ("NFS: Move buffered I/O locking into
nfs_file_write()") nfs_file_write() has not flushed the correct byte
range during synchronous writes.  generic_write_sync() expects that
iocb->ki_pos points to the right edge of the range rather than the
left edge.

To replicate the problem, open a file with O_DSYNC, have the client
write at increasing offsets, and then print the successful offsets.
Block port 2049 partway through that sequence, and observe that the
client application indicates successful writes in advance of what the
server received.

Fixes: 18290650b1 ("NFS: Move buffered I/O locking into nfs_file_write()")
Signed-off-by: Jacob Strauss <jsstraus@amazon.com>
Signed-off-by: Tarang Gupta <tarangg@amazon.com>
Tested-by: Tarang Gupta <tarangg@amazon.com>
Cc: stable@vger.kernel.org # v4.8+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-07 11:07:13 -04:00
Linus Torvalds
d34fc1adf0 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - various misc bits

 - DAX updates

 - OCFS2

 - most of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (119 commits)
  mm,fork: introduce MADV_WIPEONFORK
  x86,mpx: make mpx depend on x86-64 to free up VMA flag
  mm: add /proc/pid/smaps_rollup
  mm: hugetlb: clear target sub-page last when clearing huge page
  mm: oom: let oom_reap_task and exit_mmap run concurrently
  swap: choose swap device according to numa node
  mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
  mm, oom: do not rely on TIF_MEMDIE for memory reserves access
  z3fold: use per-cpu unbuddied lists
  mm, swap: don't use VMA based swap readahead if HDD is used as swap
  mm, swap: add sysfs interface for VMA based swap readahead
  mm, swap: VMA based swap readahead
  mm, swap: fix swap readahead marking
  mm, swap: add swap readahead hit statistics
  mm/vmalloc.c: don't reinvent the wheel but use existing llist API
  mm/vmstat.c: fix wrong comment
  selftests/memfd: add memfd_create hugetlbfs selftest
  mm/shmem: add hugetlbfs support to memfd_create()
  mm, devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups
  mm/vmalloc.c: halve the number of comparisons performed in pcpu_get_vm_areas()
  ...
2017-09-06 20:49:49 -07:00
Rik van Riel
d2cd9ede6e mm,fork: introduce MADV_WIPEONFORK
Introduce MADV_WIPEONFORK semantics, which result in a VMA being empty
in the child process after fork.  This differs from MADV_DONTFORK in one
important way.

If a child process accesses memory that was MADV_WIPEONFORK, it will get
zeroes.  The address ranges are still valid, they are just empty.

If a child process accesses memory that was MADV_DONTFORK, it will get a
segmentation fault, since those address ranges are no longer valid in
the child after fork.

Since MADV_DONTFORK also seems to be used to allow very large programs
to fork in systems with strict memory overcommit restrictions, changing
the semantics of MADV_DONTFORK might break existing programs.

MADV_WIPEONFORK only works on private, anonymous VMAs.

The use case is libraries that store or cache information, and want to
know that they need to regenerate it in the child process after fork.

Examples of this would be:
 - systemd/pulseaudio API checks (fail after fork) (replacing a getpid
   check, which is too slow without a PID cache)
 - PKCS#11 API reinitialization check (mandated by specification)
 - glibc's upcoming PRNG (reseed after fork)
 - OpenSSL PRNG (reseed after fork)

The security benefits of a forking server having a re-inialized PRNG in
every child process are pretty obvious.  However, due to libraries
having all kinds of internal state, and programs getting compiled with
many different versions of each library, it is unreasonable to expect
calling programs to re-initialize everything manually after fork.

A further complication is the proliferation of clone flags, programs
bypassing glibc's functions to call clone directly, and programs calling
unshare, causing the glibc pthread_atfork hook to not get called.

It would be better to have the kernel take care of this automatically.

The patch also adds MADV_KEEPONFORK, to undo the effects of a prior
MADV_WIPEONFORK.

This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:

    https://man.openbsd.org/minherit.2

[akpm@linux-foundation.org: numerically order arch/parisc/include/uapi/asm/mman.h #defines]
Link: http://lkml.kernel.org/r/20170811212829.29186-3-riel@redhat.com
Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Florian Weimer <fweimer@redhat.com>
Reported-by: Colm MacCártaigh <colm@allcosts.net>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Drewry <wad@chromium.org>
Cc: <linux-api@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:30 -07:00
Daniel Colascione
493b0e9d94 mm: add /proc/pid/smaps_rollup
/proc/pid/smaps_rollup is a new proc file that improves the performance
of user programs that determine aggregate memory statistics (e.g., total
PSS) of a process.

Android regularly "samples" the memory usage of various processes in
order to balance its memory pool sizes.  This sampling process involves
opening /proc/pid/smaps and summing certain fields.  For very large
processes, sampling memory use this way can take several hundred
milliseconds, due mostly to the overhead of the seq_printf calls in
task_mmu.c.

smaps_rollup improves the situation.  It contains most of the fields of
/proc/pid/smaps, but instead of a set of fields for each VMA,
smaps_rollup instead contains one synthetic smaps-format entry
representing the whole process.  In the single smaps_rollup synthetic
entry, each field is the summation of the corresponding field in all of
the real-smaps VMAs.  Using a common format for smaps_rollup and smaps
allows userspace parsers to repurpose parsers meant for use with
non-rollup smaps for smaps_rollup, and it allows userspace to switch
between smaps_rollup and smaps at runtime (say, based on the
availability of smaps_rollup in a given kernel) with minimal fuss.

By using smaps_rollup instead of smaps, a caller can avoid the
significant overhead of formatting, reading, and parsing each of a large
process's potentially very numerous memory mappings.  For sampling
system_server's PSS in Android, we measured a 12x speedup, representing
a savings of several hundred milliseconds.

One alternative to a new per-process proc file would have been including
PSS information in /proc/pid/status.  We considered this option but
thought that PSS would be too expensive (by a few orders of magnitude)
to collect relative to what's already emitted as part of
/proc/pid/status, and slowing every user of /proc/pid/status for the
sake of readers that happen to want PSS feels wrong.

The code itself works by reusing the existing VMA-walking framework we
use for regular smaps generation and keeping the mem_size_stats
structure around between VMA walks instead of using a fresh one for each
VMA.  In this way, summation happens automatically.  We let seq_file
walk over the VMAs just as it does for regular smaps and just emit
nothing to the seq_file until we hit the last VMA.

Benchmarks:

    using smaps:
    iterations:1000 pid:1163 pss:220023808
    0m29.46s real 0m08.28s user 0m20.98s system

    using smaps_rollup:
    iterations:1000 pid:1163 pss:220702720
    0m04.39s real 0m00.03s user 0m04.31s system

We're using the PSS samples we collect asynchronously for
system-management tasks like fine-tuning oom_adj_score, memory use
tracking for debugging, application-level memory-use attribution, and
deciding whether we want to kill large processes during system idle
maintenance windows.  Android has been using PSS for these purposes for
a long time; as the average process VMA count has increased and and
devices become more efficiency-conscious, PSS-collection inefficiency
has started to matter more.  IMHO, it'd be a lot safer to optimize the
existing PSS-collection model, which has been fine-tuned over the years,
instead of changing the memory tracking approach entirely to work around
smaps-generation inefficiency.

Tim said:

: There are two main reasons why Android gathers PSS information:
:
: 1. Android devices can show the user the amount of memory used per
:    application via the settings app.  This is a less important use case.
:
: 2. We log PSS to help identify leaks in applications.  We have found
:    an enormous number of bugs (in the Android platform, in Google's own
:    apps, and in third-party applications) using this data.
:
: To do this, system_server (the main process in Android userspace) will
: sample the PSS of a process three seconds after it changes state (for
: example, app is launched and becomes the foreground application) and about
: every ten minutes after that.  The net result is that PSS collection is
: regularly running on at least one process in the system (usually a few
: times a minute while the screen is on, less when screen is off due to
: suspend).  PSS of a process is an incredibly useful stat to track, and we
: aren't going to get rid of it.  We've looked at some very hacky approaches
: using RSS ("take the RSS of the target process, subtract the RSS of the
: zygote process that is the parent of all Android apps") to reduce the
: accounting time, but it regularly overestimated the memory used by 20+
: percent.  Accordingly, I don't think that there's a good alternative to
: using PSS.
:
: We started looking into PSS collection performance after we noticed random
: frequency spikes while a phone's screen was off; occasionally, one of the
: CPU clusters would ramp to a high frequency because there was 200-300ms of
: constant CPU work from a single thread in the main Android userspace
: process.  The work causing the spike (which is reasonable governor
: behavior given the amount of CPU time needed) was always PSS collection.
: As a result, Android is burning more power than we should be on PSS
: collection.
:
: The other issue (and why I'm less sure about improving smaps as a
: long-term solution) is that the number of VMAs per process has increased
: significantly from release to release.  After trying to figure out why we
: were seeing these 200-300ms PSS collection times on Android O but had not
: noticed it in previous versions, we found that the number of VMAs in the
: main system process increased by 50% from Android N to Android O (from
: ~1800 to ~2700) and varying increases in every userspace process.  Android
: M to N also had an increase in the number of VMAs, although not as much.
: I'm not sure why this is increasing so much over time, but thinking about
: ASLR and ways to make ASLR better, I expect that this will continue to
: increase going forward.  I would not be surprised if we hit 5000 VMAs on
: the main Android process (system_server) by 2020.
:
: If we assume that the number of VMAs is going to increase over time, then
: doing anything we can do to reduce the overhead of each VMA during PSS
: collection seems like the right way to go, and that means outputting an
: aggregate statistic (to avoid whatever overhead there is per line in
: writing smaps and in reading each line from userspace).

Link: http://lkml.kernel.org/r/20170812022148.178293-1-dancol@google.com
Signed-off-by: Daniel Colascione <dancol@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sonny Rao <sonnyrao@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:30 -07:00
Andrea Arcangeli
a36985d31a userfaultfd: provide pid in userfault msg - add feat union
No ABI change, but this will make it more explicit to software that ptid
is only available if requested by passing UFFD_FEATURE_THREAD_ID to
UFFDIO_API.  The fact it's a union will also self document it shouldn't
be taken for granted there's a tpid there.

Link: http://lkml.kernel.org/r/20170802165145.22628-7-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Alexey Perevalov <a.perevalov@samsung.com>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:29 -07:00
Alexey Perevalov
9d4ac93482 userfaultfd: provide pid in userfault msg
It could be useful for calculating downtime during postcopy live
migration per vCPU.  Side observer or application itself will be
informed about proper task's sleep during userfaultfd processing.

Process's thread id is being provided when user requeste it by setting
UFFD_FEATURE_THREAD_ID bit into uffdio_api.features.

Link: http://lkml.kernel.org/r/20170802165145.22628-6-aarcange@redhat.com
Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:29 -07:00
Prakash Sangappa
2d6d6f5a09 mm: userfaultfd: add feature to request for a signal delivery
In some cases, userfaultfd mechanism should just deliver a SIGBUS signal
to the faulting process, instead of the page-fault event.  Dealing with
page-fault event using a monitor thread can be an overhead in these
cases.  For example applications like the database could use the
signaling mechanism for robustness purpose.

Database uses hugetlbfs for performance reason.  Files on hugetlbfs
filesystem are created and huge pages allocated using fallocate() API.
Pages are deallocated/freed using fallocate() hole punching support.
These files are mmapped and accessed by many processes as shared memory.
The database keeps track of which offsets in the hugetlbfs file have
pages allocated.

Any access to mapped address over holes in the file, which can occur due
to bugs in the application, is considered invalid and expect the process
to simply receive a SIGBUS.  However, currently when a hole in the file
is accessed via the mapped address, kernel/mm attempts to automatically
allocate a page at page fault time, resulting in implicitly filling the
hole in the file.  This may not be the desired behavior for applications
like the database that want to explicitly manage page allocations of
hugetlbfs files.

Using userfaultfd mechanism with this support to get a signal, database
application can prevent pages from being allocated implicitly when
processes access mapped address over holes in the file.

This patch adds UFFD_FEATURE_SIGBUS feature to userfaultfd mechnism to
request for a SIGBUS signal.

See following for previous discussion about the database requirement
leading to this proposal as suggested by Andrea.

http://www.spinics.net/lists/linux-mm/msg129224.html

Link: http://lkml.kernel.org/r/1501552446-748335-2-git-send-email-prakash.sangappa@oracle.com
Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:29 -07:00
Michal Hocko
c41f012ade mm: rename global_page_state to global_zone_page_state
global_page_state is error prone as a recent bug report pointed out [1].
It only returns proper values for zone based counters as the enum it
gets suggests.  We already have global_node_page_state so let's rename
global_page_state to global_zone_page_state to be more explicit here.
All existing users seems to be correct:

$ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c
      2 NR_BOUNCE
      2 NR_FREE_CMA_PAGES
     11 NR_FREE_PAGES
      1 NR_KERNEL_STACK_KB
      1 NR_MLOCK
      2 NR_PAGETABLE

This patch shouldn't introduce any functional change.

[1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp

Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:29 -07:00
Jeff Layton
de23abd151 fs/sync.c: remove unnecessary NULL f_mapping check in sync_file_range
fsync codepath assumes that f_mapping can never be NULL, but
sync_file_range has a check for that.

Remove the one from sync_file_range as I don't see how you'd ever get a
NULL pointer in here.

Link: http://lkml.kernel.org/r/20170525110509.9434-1-jlayton@redhat.com
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:28 -07:00
Mike Rapoport
ce53e8e6f2 userfaultfd: report UFFDIO_ZEROPAGE as available for shmem VMAs
Now when shmem VMAs can be filled with zero page via userfaultfd we can
report that UFFDIO_ZEROPAGE is available for those VMAs

Link: http://lkml.kernel.org/r/1497939652-16528-7-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:28 -07:00
Jan Kara
397162ffa2 mm: remove nr_pages argument from pagevec_lookup{,_range}()
All users of pagevec_lookup() and pagevec_lookup_range() now pass
PAGEVEC_SIZE as a desired number of pages.

Just drop the argument.

Link: http://lkml.kernel.org/r/20170726114704.7626-11-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:27 -07:00
Jan Kara
8338141f0f fs: use pagevec_lookup_range() in page_cache_seek_hole_data()
We want only pages from given range in page_cache_seek_hole_data().  Use
pagevec_lookup_range() instead of pagevec_lookup() and remove
unnecessary code.

Note that the check for getting less pages than desired can be removed
because index gets updated by pagevec_lookup_range().

Link: http://lkml.kernel.org/r/20170726114704.7626-9-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:27 -07:00
Jan Kara
48f2301c07 hugetlbfs: use pagevec_lookup_range() in remove_inode_hugepages()
We want only pages from given range in remove_inode_hugepages().  Use
pagevec_lookup_range() instead of pagevec_lookup().

Link: http://lkml.kernel.org/r/20170726114704.7626-8-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:27 -07:00
Jan Kara
2b85a6171d ext4: use pagevec_lookup_range() in writeback code
Both occurences of pagevec_lookup() actually want only pages from a
given range.  Use pagevec_lookup_range() for the lookup.

Link: http://lkml.kernel.org/r/20170726114704.7626-7-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:27 -07:00
Jan Kara
dec0da7b60 ext4: use pagevec_lookup_range() in ext4_find_unwritten_pgoff()
Use pagevec_lookup_range() in ext4_find_unwritten_pgoff() since we are
interested only in pages in the given range.  Simplify the logic as a
result of not getting pages out of range and index getting automatically
advanced.

Link: http://lkml.kernel.org/r/20170726114704.7626-6-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:26 -07:00
Jan Kara
c10f778ddf fs: fix performance regression in clean_bdev_aliases()
Commit e64855c6cf ("fs: Add helper to clean bdev aliases under a bh
and use it") added a wrapper for clean_bdev_aliases() that invalidates
bdev aliases underlying a single buffer head.

However this has caused a performance regression for bonnie++ benchmark
on ext4 filesystem when delayed allocation is turned off (ext3 mode) -
average of 3 runs:

  Hmean SeqOut Char  164787.55 (  0.00%) 107189.06 (-34.95%)
  Hmean SeqOut Block 219883.89 (  0.00%) 168870.32 (-23.20%)

The reason for this regression is that clean_bdev_aliases() is slower
when called for a single block because pagevec_lookup() it uses will end
up iterating through the radix tree until it finds a page (which may
take a while) but we are only interested whether there's a page at a
particular index.

Fix the problem by using pagevec_lookup_range() instead which avoids the
needless iteration.

Fixes: e64855c6cf ("fs: Add helper to clean bdev aliases under a bh and use it")
Link: http://lkml.kernel.org/r/20170726114704.7626-5-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:26 -07:00
Jan Kara
d72dc8a25a mm: make pagevec_lookup() update index
Make pagevec_lookup() (and underlying find_get_pages()) update index to
the next page where iteration should continue.  Most callers want this
and also pagevec_lookup_tag() already does this.

Link: http://lkml.kernel.org/r/20170726114704.7626-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:26 -07:00
Jan Kara
26b433d0da fscache: remove unused ->now_uncached callback
Patch series "Ranged pagevec lookup", v2.

In this series I make pagevec_lookup() update the index (to be
consistent with pagevec_lookup_tag() and also as a preparation for
ranged lookups), provide ranged variant of pagevec_lookup() and use it
in places where it makes sense.  This not only removes some common code
but is also a measurable performance win for some use cases (see patch
4/10) where radix tree is sparse and searching & grabing of a page after
the end of the range has measurable overhead.

This patch (of 10):

The callback doesn't ever get called.  Remove it.

Link: http://lkml.kernel.org/r/20170726114704.7626-2-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:26 -07:00
Jun Piao
964f14a0d3 ocfs2: clean up some dead code
clean up some unused functions and parameters.

Link: http://lkml.kernel.org/r/598A5E21.2080807@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:24 -07:00
Jan Kara
01ffb56bc1 ocfs2: make ocfs2_set_acl() static
The function is never called outside of fs/ocfs2/acl.c.

Link: http://lkml.kernel.org/r/20170801141252.19675-2-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:24 -07:00
Nicolas Iooss
2f52074d35 dax: initialize variable pfn before using it
dax_pmd_insert_mapping() contains the following code:

        pfn_t pfn;
        if (bdev_dax_pgoff(bdev, sector, size, &pgoff) != 0)
            goto fallback;
        /* ... */
    fallback:
      trace_dax_pmd_insert_mapping_fallback(inode, vmf, length, pfn, ret);

When the condition in the if statement fails, the function calls
trace_dax_pmd_insert_mapping_fallback() with an uninitialized pfn value.

This issue has been found while building the kernel with clang.  The
compiler reported:

    fs/dax.c:1280:6: error: variable 'pfn' is used uninitialized
    whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
        if (bdev_dax_pgoff(bdev, sector, size, &pgoff) != 0)
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    fs/dax.c:1310:60: note: uninitialized use occurs here
      trace_dax_pmd_insert_mapping_fallback(inode, vmf, length, pfn, ret);
                                                                     ^~~

Link: http://lkml.kernel.org/r/20170903083000.587-1-nicolas.iooss_linux@m4x.org
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:24 -07:00
Ross Zwisler
917f34526c dax: use PG_PMD_COLOUR instead of open coding
Use ~PG_PMD_COLOUR in dax_entry_waitqueue() instead of open coding an
equivalent page offset mask.

Link: http://lkml.kernel.org/r/20170822222436.18926-2-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Slusarz, Marcin" <marcin.slusarz@intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:24 -07:00
Ross Zwisler
a2e050f5a9 dax: explain how read(2)/write(2) addresses are validated
Add a comment explaining how the user addresses provided to read(2) and
write(2) are validated in the DAX I/O path.

We call dax_copy_from_iter() or copy_to_iter() on these without calling
access_ok() first in the DAX code, and there was a concern that the user
might be able to read/write to arbitrary kernel addresses with this
path.

Link: http://lkml.kernel.org/r/20170816173615.10098-1-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:24 -07:00
Ross Zwisler
527b19d080 dax: move all DAX radix tree defs to fs/dax.c
Now that we no longer insert struct page pointers in DAX radix trees the
page cache code no longer needs to know anything about DAX exceptional
entries.  Move all the DAX exceptional entry definitions from dax.h to
fs/dax.c.

Link: http://lkml.kernel.org/r/20170724170616.25810-6-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:24 -07:00
Ross Zwisler
d01ad197ac dax: remove DAX code from page_cache_tree_insert()
Now that we no longer insert struct page pointers in DAX radix trees we
can remove the special casing for DAX in page_cache_tree_insert().

This also allows us to make dax_wake_mapping_entry_waiter() local to
fs/dax.c, removing it from dax.h.

Link: http://lkml.kernel.org/r/20170724170616.25810-5-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:24 -07:00
Ross Zwisler
91d25ba8a6 dax: use common 4k zero page for dax mmap reads
When servicing mmap() reads from file holes the current DAX code
allocates a page cache page of all zeroes and places the struct page
pointer in the mapping->page_tree radix tree.

This has three major drawbacks:

1) It consumes memory unnecessarily. For every 4k page that is read via
   a DAX mmap() over a hole, we allocate a new page cache page. This
   means that if you read 1GiB worth of pages, you end up using 1GiB of
   zeroed memory. This is easily visible by looking at the overall
   memory consumption of the system or by looking at /proc/[pid]/smaps:

	7f62e72b3000-7f63272b3000 rw-s 00000000 103:00 12   /root/dax/data
	Size:            1048576 kB
	Rss:             1048576 kB
	Pss:             1048576 kB
	Shared_Clean:          0 kB
	Shared_Dirty:          0 kB
	Private_Clean:   1048576 kB
	Private_Dirty:         0 kB
	Referenced:      1048576 kB
	Anonymous:             0 kB
	LazyFree:              0 kB
	AnonHugePages:         0 kB
	ShmemPmdMapped:        0 kB
	Shared_Hugetlb:        0 kB
	Private_Hugetlb:       0 kB
	Swap:                  0 kB
	SwapPss:               0 kB
	KernelPageSize:        4 kB
	MMUPageSize:           4 kB
	Locked:                0 kB

2) It is slower than using a common zero page because each page fault
   has more work to do. Instead of just inserting a common zero page we
   have to allocate a page cache page, zero it, and then insert it. Here
   are the average latencies of dax_load_hole() as measured by ftrace on
   a random test box:

    Old method, using zeroed page cache pages:	3.4 us
    New method, using the common 4k zero page:	0.8 us

   This was the average latency over 1 GiB of sequential reads done by
   this simple fio script:

     [global]
     size=1G
     filename=/root/dax/data
     fallocate=none
     [io]
     rw=read
     ioengine=mmap

3) The fact that we had to check for both DAX exceptional entries and
   for page cache pages in the radix tree made the DAX code more
   complex.

Solve these issues by following the lead of the DAX PMD code and using a
common 4k zero page instead.  As with the PMD code we will now insert a
DAX exceptional entry into the radix tree instead of a struct page
pointer which allows us to remove all the special casing in the DAX
code.

Note that we do still pretty aggressively check for regular pages in the
DAX radix tree, especially where we take action based on the bits set in
the page.  If we ever find a regular page in our radix tree now that
most likely means that someone besides DAX is inserting pages (which has
happened lots of times in the past), and we want to find that out early
and fail loudly.

This solution also removes the extra memory consumption.  Here is that
same /proc/[pid]/smaps after 1GiB of reading from a hole with the new
code:

	7f2054a74000-7f2094a74000 rw-s 00000000 103:00 12   /root/dax/data
	Size:            1048576 kB
	Rss:                   0 kB
	Pss:                   0 kB
	Shared_Clean:          0 kB
	Shared_Dirty:          0 kB
	Private_Clean:         0 kB
	Private_Dirty:         0 kB
	Referenced:            0 kB
	Anonymous:             0 kB
	LazyFree:              0 kB
	AnonHugePages:         0 kB
	ShmemPmdMapped:        0 kB
	Shared_Hugetlb:        0 kB
	Private_Hugetlb:       0 kB
	Swap:                  0 kB
	SwapPss:               0 kB
	KernelPageSize:        4 kB
	MMUPageSize:           4 kB
	Locked:                0 kB

Overall system memory consumption is similarly improved.

Another major change is that we remove dax_pfn_mkwrite() from our fault
flow, and instead rely on the page fault itself to make the PTE dirty
and writeable.  The following description from the patch adding the
vm_insert_mixed_mkwrite() call explains this a little more:

   "To be able to use the common 4k zero page in DAX we need to have our
    PTE fault path look more like our PMD fault path where a PTE entry
    can be marked as dirty and writeable as it is first inserted rather
    than waiting for a follow-up dax_pfn_mkwrite() =>
    finish_mkwrite_fault() call.

    Right now we can rely on having a dax_pfn_mkwrite() call because we
    can distinguish between these two cases in do_wp_page():

            case 1: 4k zero page => writable DAX storage
            case 2: read-only DAX storage => writeable DAX storage

    This distinction is made by via vm_normal_page(). vm_normal_page()
    returns false for the common 4k zero page, though, just as it does
    for DAX ptes. Instead of special casing the DAX + 4k zero page case
    we will simplify our DAX PTE page fault sequence so that it matches
    our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper.
    We will instead use dax_iomap_fault() to handle write-protection
    faults.

    This means that insert_pfn() needs to follow the lead of
    insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If
    'mkwrite' is set insert_pfn() will do the work that was previously
    done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path"

Link: http://lkml.kernel.org/r/20170724170616.25810-4-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:24 -07:00
Ross Zwisler
e30331ff05 dax: relocate some dax functions
dax_load_hole() will soon need to call dax_insert_mapping_entry(), so it
needs to be moved lower in dax.c so the definition exists.

dax_wake_mapping_entry_waiter() will soon be removed from dax.h and be
made static to dax.c, so we need to move its definition above all its
callers.

Link: http://lkml.kernel.org/r/20170724170616.25810-3-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:24 -07:00
Linus Torvalds
aae3dbb477 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Support ipv6 checksum offload in sunvnet driver, from Shannon
    Nelson.

 2) Move to RB-tree instead of custom AVL code in inetpeer, from Eric
    Dumazet.

 3) Allow generic XDP to work on virtual devices, from John Fastabend.

 4) Add bpf device maps and XDP_REDIRECT, which can be used to build
    arbitrary switching frameworks using XDP. From John Fastabend.

 5) Remove UFO offloads from the tree, gave us little other than bugs.

 6) Remove the IPSEC flow cache, from Florian Westphal.

 7) Support ipv6 route offload in mlxsw driver.

 8) Support VF representors in bnxt_en, from Sathya Perla.

 9) Add support for forward error correction modes to ethtool, from
    Vidya Sagar Ravipati.

10) Add time filter for packet scheduler action dumping, from Jamal Hadi
    Salim.

11) Extend the zerocopy sendmsg() used by virtio and tap to regular
    sockets via MSG_ZEROCOPY. From Willem de Bruijn.

12) Significantly rework value tracking in the BPF verifier, from Edward
    Cree.

13) Add new jump instructions to eBPF, from Daniel Borkmann.

14) Rework rtnetlink plumbing so that operations can be run without
    taking the RTNL semaphore. From Florian Westphal.

15) Support XDP in tap driver, from Jason Wang.

16) Add 32-bit eBPF JIT for ARM, from Shubham Bansal.

17) Add Huawei hinic ethernet driver.

18) Allow to report MD5 keys in TCP inet_diag dumps, from Ivan
    Delalande.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1780 commits)
  i40e: point wb_desc at the nvm_wb_desc during i40e_read_nvm_aq
  i40e: avoid NVM acquire deadlock during NVM update
  drivers: net: xgene: Remove return statement from void function
  drivers: net: xgene: Configure tx/rx delay for ACPI
  drivers: net: xgene: Read tx/rx delay for ACPI
  rocker: fix kcalloc parameter order
  rds: Fix non-atomic operation on shared flag variable
  net: sched: don't use GFP_KERNEL under spin lock
  vhost_net: correctly check tx avail during rx busy polling
  net: mdio-mux: add mdio_mux parameter to mdio_mux_init()
  rxrpc: Make service connection lookup always check for retry
  net: stmmac: Delete dead code for MDIO registration
  gianfar: Fix Tx flow control deactivation
  cxgb4: Ignore MPS_TX_INT_CAUSE[Bubble] for T6
  cxgb4: Fix pause frame count in t4_get_port_stats
  cxgb4: fix memory leak
  tun: rename generic_xdp to skb_xdp
  tun: reserve extra headroom only when XDP is set
  net: dsa: bcm_sf2: Configure IMP port TC2QOS mapping
  net: dsa: bcm_sf2: Advertise number of egress queues
  ...
2017-09-06 14:45:08 -07:00
Linus Torvalds
ec3604c7a5 Writeback error handling fixes for v4.14
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZrTy3AAoJEAAOaEEZVoIVaucP/ApBAj2S5wzvlV1u6l8E6ae7
 ZeEEZfcWwzRYlKjZAkTWqj9XvGpDGO5gLq4wsZK2edFAq++/MJF8ZVtN4tdZ1kUZ
 DUvRodtVOrT08Kp9wZXGT7JOFrf6U/6gMcR6p0MuWnHndeKYvlpcFi9NPT4EC9/z
 Zm9V7gtlPdSOha7eaSjUS0+vLERkxqXLBW3Av9QUOBP/lbI3lqIroGKeHDYnVdya
 2P/k5EcRRJMyJP6TqyYxmmJl+UWjJFMLvnlUDBslHnD/u3mIUhw3JLHYBjn5dZRE
 Xjq56IDPoXDUvzlBhtn/Uqyx+/wtwsNsylpmKv6K5G1JfdeuSsPVsCey+A1cqV64
 LpE5896wf9TmnmI9LNyh6vDn925xPSGBiF45UEp5f9aO7jXeY0MaEZ8g+ENqFIDK
 v4gtZdS9FhYHV+/l4qEwYMKrqSbwKEs1r1FT+f4wnABby1ojfdA57ZPlp5PV2Vjp
 szTp88Zkb7cMvZwEnWwxWofcJNmgS7uNahvnQF3IJ4ITsioEkuyYR3K4ZQMaaaV9
 wCp6G0FhXZaK3OI7o9WiDwaO2elp9Hxc8bnqKpiBbHZkY0NLh7/++5VxpeNbTHFy
 AGijQiiKGNNyYqNj93wq9jpVdMNjB0pXrHRxfav8v7MtQ+WfbEoAENF4T7hN7iXn
 UuF6eSWEC5O1UCRUk1A+
 =LLY3
 -----END PGP SIGNATURE-----

Merge tag 'wberr-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull writeback error handling updates from Jeff Layton:
 "This pile continues the work from last cycle on better tracking
  writeback errors. In v4.13 we added some basic errseq_t infrastructure
  and converted a few filesystems to use it.

  This set continues refining that infrastructure, adds documentation,
  and converts most of the other filesystems to use it. The main
  exception at this point is the NFS client"

* tag 'wberr-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  ecryptfs: convert to file_write_and_wait in ->fsync
  mm: remove optimizations based on i_size in mapping writeback waits
  fs: convert a pile of fsync routines to errseq_t based reporting
  gfs2: convert to errseq_t based writeback error reporting for fsync
  fs: convert sync_file_range to use errseq_t based error-tracking
  mm: add file_fdatawait_range and file_write_and_wait
  fuse: convert to errseq_t based error tracking for fsync
  mm: consolidate dax / non-dax checks for writeback
  Documentation: add some docs for errseq_t
  errseq: rename __errseq_set to errseq_set
2017-09-06 14:11:03 -07:00
Linus Torvalds
066dea8c30 File locking related changes for v4.14
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZrTzyAAoJEAAOaEEZVoIVj8wP/0sOxG+7vEEpe4uj2W52aq9T
 Y39/ZLfRTLm9SqgH61lkN+IyUsvDx+IP1ws2LBhp0IDRD9m40wdILhHZRWXJcRW2
 ApEfmXF+rxnZZ6725ixX9w4Ylab2ZeGmKbzaG4wIjxfddftewZkJvFQcb1LZDfWq
 1N0SF4KWoWN6t26Du5CHmYSj/Sz6YGrWGhF22u3mNfkGL+MmuKbz+kB3W+0q2NUF
 ZjkOIH9WcRiXgSlcHPBLre2EKHqHaNgb0s4Iofd3ZEe50v1NwY/vBMefxuwRdgKS
 kpLhIKIYMawrHn2rpV0jm12qdgCYj+t2kbVIUBDn3unBP2zYA0e/oo5HNIrroVlk
 Q6aGwmW0LN60rpd5qcRuNS1p1h2id2HpxEe98dsski6T8CVnj/nvu7EIxmWM02cf
 g2HeOd7bnl3+uu7SwSTkOVb6G7Kjn+Xufiz/n11mK6fl2jvOyWZZmDqhhjWAYJ8r
 t5mQVWJdEV12+6+A1WSv9DeS3TUgdYPCF8dzDtF+JVn3WEmxYHywH36Y3hKKz+BA
 gFEhnHvlyaVvpXCr8Y5BqNSfEfvZe/YUnmVReHpgBU/U4GJ17iQYk/g2vfmPLmsN
 IZ2OGCrDUc/LfdWc4llRyQBvlGT1KujaT0tbN7xnuWcS2qWdsfX4jDtDUH9E6pvK
 TB6Sw4Ike0ixamG8N8q/
 =VPMU
 -----END PGP SIGNATURE-----

Merge tag 'locks-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull file locking updates from Jeff Layton:
 "This pile just has a few file locking fixes from Ben Coddington. There
  are a couple of cleanup patches + an attempt to bring sanity to the
  l_pid value that is reported back to userland on an F_GETLK request.

  After a few gyrations, he came up with a way for filesystems to
  communicate to the VFS layer code whether the pid should be translated
  according to the namespace or presented as-is to userland"

* tag 'locks-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  locks: restore a warn for leaked locks on close
  fs/locks: Remove fl_nspid and use fs-specific l_pid for remote locks
  fs/locks: Use allocation rather than the stack in fcntl_getlk()
2017-09-06 13:43:26 -07:00
Linus Torvalds
c7f396f12f dlm for 4.14
This set includes a bunch of minor code cleanups that
 have accumulated, probably from code analyzers people
 like to run.  There is one nice fix that avoids some
 socket leaks by switching to use sock_create_lite().
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZrtP1AAoJEDgbc8f8gGmqH3YQALZouj0tzatxHfWlGcMHoufL
 M2wmQragG4qOI1w9vNKmo6GctQ+Teqholnt2gRHputxNLzPUXPzNgAR1/O7O8741
 TCB/fhR16KqMMP4bTa2GQ73WVFhohkh8xSvtcWkhCqC+Ti/qx2FN7LZ7Mxn6Muje
 IC7E+Oy2Xr64lUb1CsfpXXel8vs+ujoMIAZiU4P/PgCzYX5FvaFWJ9VCwgYzfIuN
 zj2O1txau5xW2fZmD5GRmgWY/g5wCPcPxwCdZacqrL7yNiU1wsrhYFds0AiGSPJC
 D/wMX9a0GN28L+zW0eLEVI+lIk8f5Az+DOrw5UUFNwDd4ejDWaS0dtMNThYu6VvD
 x6+JZhgZHcj3Df/s4PMZvPkCx+8ZeRGK9RK+jlkEVfO8aIE39gi6mC+EuTJmZe/m
 PAB7O2OG0FTUPoY+t/5wKaz1g6qSHQ2fQZb8rAMoUFWwFJWXp3q7/tlZN4dlwIDI
 2yp9UN09ug3tICcne/gvmJ5x8lVN3Eh6XHkbO1qedsv45SYKdOwPmvyp2XTZooJK
 kg827Z+deRmvTfX3gzEsEO1caabtYDOrZ23RHJxqViNdZbMn3Tifc2pBZCIJHfmu
 4My+Midl38Ch9SUx30ePwjyJ9+Ptsm7KiSOIvrtRZV/1bPEqRP03suxEZvmCA/z4
 elMj1gKykj/GHXZ+cHlX
 =lGzS
 -----END PGP SIGNATURE-----

Merge tag 'dlm-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm updates from David Teigland:
 "This set includes a bunch of minor code cleanups that have
  accumulated, probably from code analyzers people like to run. There is
  one nice fix that avoids some socket leaks by switching to use
  sock_create_lite()"

* tag 'dlm-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  dlm: use sock_create_lite inside tcp_accept_from_sock
  uapi linux/dlm_netlink.h: include linux/dlmconstants.h
  dlm: avoid double-free on error path in dlm_device_{register,unregister}
  dlm: constify kset_uevent_ops structure
  dlm: print log message when cluster name is not set
  dlm: Delete an unnecessary variable initialisation in dlm_ls_start()
  dlm: Improve a size determination in two functions
  dlm: Use kcalloc() in two functions
  dlm: Use kmalloc_array() in make_member_array()
  dlm: Delete an error message for a failed memory allocation in dlm_recover_waiters_pre()
  dlm: Improve a size determination in dlm_recover_waiters_pre()
  dlm: Use kcalloc() in dlm_scan_waiters()
  dlm: Improve a size determination in table_seq_start()
  dlm: Add spaces for better code readability
  dlm: Replace six seq_puts() calls by seq_putc()
  dlm: Make dismatch error message more clear
  dlm: Fix kernel memory disclosure
2017-09-06 13:39:23 -07:00
Linus Torvalds
be6297e9be Scalability improvements when allocating inodes, and some
miscellaneous bug fixes and cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlms4KwACgkQ8vlZVpUN
 gaMsNwf9EDIaB7uMAFIRw8TzszbYKE3K3T412s18zce+kYea6wZPkQavWH/qMYgU
 r8jmXfDi3KZJJBI8fFW4qh36qs/fJoOeQOD3BTHcczEJqaaOiahaxfSylTwezDfw
 fIkEMCfBj6Vyldo0aKrtM4iU07Njj7QmYBtsiJo1kpyAuZ7wuoaiyizCLRb0fhLB
 AnBOs2ur9fQvn954M03tJIKpxFgmbpofM7yMtJYpW9dHCCWe2G+sIdBy/W6vVTxt
 sJIzUGyyEFs9Hr8U8THbuo5XgScFh+NEOkj/cf60t5ZYxwZzJvY7+Oj7BQwxEM1p
 Efcjz2kRLU8qSYHOHxCar0D3MW+oNQ==
 =E7gL
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Scalability improvements when allocating inodes, and some
  miscellaneous bug fixes and cleanups"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: avoid Y2038 overflow in recently_deleted()
  ext4: fix fault handling when mounted with -o dax,ro
  ext4: fix quota inconsistency during orphan cleanup for read-only mounts
  ext4: fix incorrect quotaoff if the quota feature is enabled
  ext4: remove useless test and assignment in strtohash functions
  ext4: backward compatibility support for Lustre ea_inode implementation
  ext4: remove timebomb in ext4_decode_extra_time()
  ext4: use sizeof(*ptr)
  ext4: in ext4_seek_{hole,data}, return -ENXIO for negative offsets
  ext4: reduce lock contention in __ext4_new_inode
  ext4: cleanup goto next group
  ext4: do not unnecessarily allocate buffer in recently_deleted()
2017-09-06 12:59:41 -07:00
Linus Torvalds
5791577963 Updates for 4.14:
- Write unmount record for a ro mount to avoid unnecessary log replay
 - Clean up orphaned inodes when mounting fs readonly
 - Resubmit inode log items when buffer writeback fails to avoid umount hang
 - Fix log recovery corruption problems when log headers wrap around the end
 - Avoid infinite loop searching for free inodes when inode counters are wrong
 - Evict inodes involved with log redo so that we don't leak them later
 - Fix a potential race between reclaim and inode cluster freeing
 - Refactor the inode joining code w.r.t. transaction rolling & deferred ops
 - Fix a bug where the log doesn't properly deal with dirty buffers that
   are about to become ordered buffers
 - Fix the extent swap code to deal with making dirty buffers ordered properly
 - Consolidate page fault handlers
 - Refactor the incore extent manipulation functions to use the iext
   abstractions instead of directly modifying with extent data
 - Disable crashy chattr +/-x until we fix it
 - Don't allow us to set S_DAX for v2 inodes
 - Various cleanups
 - Clarify some documentation
 - Fix a problem where fsync and a log commit race to send the disk a
   flush command, resulting in a small window where power fail data loss
   could occur
 - Simplify some rmap operations in the fcollapse code
 - Fix some use-after-free problems in async writeback
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJZrEAQAAoJEPh/dxk0SrTrxKEP/3y8sLWdy4fUdPpVkwZteXwc
 zGyYaLrmKRc5i6abBNtLCZoRJGfRdvVyPrhQ1q3mt8H//xuURgqgFFyjj3wAdsLf
 sDejIHhdsc8/VcuLLtCW3rEYg58hJ89hW7d1InCP0tvqWmljh9svhzXebtwUvNNF
 /2fHIUXUiAxLbgjv/N2i/smlLl0zdx6C2x1TlJmfwer0UMTAnlmbFWxCqmtUZwSl
 QSuGgn1wo3dkId9aFoNwQmSCFeYcxQlpaInJEzUiVQOA4dbphXHO9Bsx0eOkpDuz
 39waaX0fld8LEfIQGmUQ995UkAwfk/asjgDSApyXdkMayNWhi0KpRl1zXgCb8BbL
 m7vYJhIfJ399+jbNPe1+htn3I16AmpvAai9MNJidFclWwqFEuQEnxZccdtTIAiRv
 XuYiq9hN2NOwlwPUYfrZxfx34fdocRyHmGVs3i7P3/qPWd5Hx6+FpQTOngciS7MN
 6xnM8PbnrLadw3ooMDEKgWsN805BQALiwzDRggoAXG1Pm2SqFnLD/dAR4c7R3nR8
 vvYlfGHnd38aMlW73IALkkGJqZy/bHPFhrbvpjXyIG6SYwCjrWrO0chM0O8MCRrF
 MIW3rM5hYIE8aCkpJ2mxvcQalmSAlSPVKlmgvSK4S1Sz4kcywxskNhch8uNkb5uy
 WUHhrJz+wBjdjrDOU3aL
 =jBdo
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.14-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull XFS updates from Darrick Wong:
 "Here are the changes for xfs for 4.14. Most of these are cleanups and
  fixes for bad behavior, as we're mostly focusing on improving
  reliablity this cycle (read: there's potentially a lot of stuff on the
  horizon for 4.15 so better to spend a few weeks killing other bugs
  now).

  Summary:

   - Write unmount record for a ro mount to avoid unnecessary log replay

   - Clean up orphaned inodes when mounting fs readonly

   - Resubmit inode log items when buffer writeback fails to avoid
     umount hang

   - Fix log recovery corruption problems when log headers wrap around
     the end

   - Avoid infinite loop searching for free inodes when inode counters
     are wrong

   - Evict inodes involved with log redo so that we don't leak them
     later

   - Fix a potential race between reclaim and inode cluster freeing

   - Refactor the inode joining code w.r.t. transaction rolling &
     deferred ops

   - Fix a bug where the log doesn't properly deal with dirty buffers
     that are about to become ordered buffers

   - Fix the extent swap code to deal with making dirty buffers ordered
     properly

   - Consolidate page fault handlers

   - Refactor the incore extent manipulation functions to use the iext
     abstractions instead of directly modifying with extent data

   - Disable crashy chattr +/-x until we fix it

   - Don't allow us to set S_DAX for v2 inodes

   - Various cleanups

   - Clarify some documentation

   - Fix a problem where fsync and a log commit race to send the disk a
     flush command, resulting in a small window where power fail data
     loss could occur

   - Simplify some rmap operations in the fcollapse code

   - Fix some use-after-free problems in async writeback"

* tag 'xfs-4.14-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (44 commits)
  xfs: use kmem_free to free return value of kmem_zalloc
  xfs: open code end_buffer_async_write in xfs_finish_page_writeback
  xfs: don't set v3 xflags for v2 inodes
  xfs: fix compiler warnings
  fsmap: fix documentation of FMR_OF_LAST
  xfs: simplify the rmap code in xfs_bmse_merge
  xfs: remove unused flags arg from xfs_file_iomap_begin_delay
  xfs: fix incorrect log_flushed on fsync
  xfs: disable per-inode DAX flag
  xfs: replace xfs_qm_get_rtblks with a direct call to xfs_bmap_count_leaves
  xfs: rewrite xfs_bmap_count_leaves using xfs_iext_get_extent
  xfs: use xfs_iext_*_extent helpers in xfs_bmap_split_extent_at
  xfs: use xfs_iext_*_extent helpers in xfs_bmap_shift_extents
  xfs: move some code around inside xfs_bmap_shift_extents
  xfs: use xfs_iext_get_extent in xfs_bmap_first_unused
  xfs: switch xfs_bmap_local_to_extents to use xfs_iext_insert
  xfs: add a xfs_iext_update_extent helper
  xfs: consolidate the various page fault handlers
  iomap: return VM_FAULT_* codes from iomap_page_mkwrite
  xfs: relog dirty buffers during swapext bmbt owner change
  ...
2017-09-06 12:19:23 -07:00
Linus Torvalds
77d0ab600a We've got a whopping 29 GFS2 patches for this merge window, mainly
because we held some back from the previous merge window until we
 could get them perfected and well tested. We have a couple patch
 sets, including my patch set for protecting glock gl_object and
 Andreas Gruenbacher's patch set to fix the long-standing shrink-
 slab hang, plus a bunch of assorted bugs and cleanups:
 
 1. I fixed a bug whereby an IO error would lead to a double-brelse.
 2. Andreas Gruenbacher made a minor cleanup to call his relatively
    new function, gfs2_holder_initialized, rather than doing it
    manually. This was just missed by a previous patch set.
 3. Jan Kara fixed a bug whereby the SGID was being cleared when
    inheriting ACLs.
 4. Andreas found a bug and fixed it in his previous patch,
    "Get rid of flush_delayed_work in gfs2_evict_inode". A call to
    flush_delayed_work was deleted from *gfs2_inode_lookup and added
    to gfs2_create_inode.
 5. Wang Xibo found and fixed a list_add call in inode_go_lock
    that specified the parameters in the wrong order.
 6. Coly Li submitted a patch to add the REQ_PRIO to some of GFS2's
    metadata reads that were accidentally missing them.
 7 - 10. I submitted a 4-patch set to protect the glock gl_object
    field. GFS2 was setting and checking gl_object with no locking
    mechanism, so the value was occasionally stomped on, which caused
    file system corruption.
 11. I submitted a small cleanup to function gfs2_clear_rgrpd.
    It was needlessly adding rgrp glocks to the lru list, then pulling
    them back off immediately. The rgrp glocks don't use the lru list
    anyway, so doing so was just a waste of time.
 12. I submitted a patch that checks the GLOF_LRU flag on a glock
    before trying to remove it from the lru_list. This avoids a lot
    of unnecessary spin_lock contention.
 13. I submitted a patch to delete GFS2's debugfs files only after
    we evict all the glocks. Before this patch, GFS2 would delete the
    debugfs files, and if unmount hung waiting for a glock, there was
    no way to debug the problem. Now, if a hang occurs during umount,
    we can examine the debugfs files to figure out why it's hung.
 14. Andreas Gruenbacher submitted a patch to fix some trivial typos.
 15 - 19. Andreas also submitted a five-part patch set to fix the
    longstanding hang involving the slab shrinker: dlm requires
    memory, calls the inode shrinker, which calls gfs2's evict, which
    calls back into DLM before it can evict an inode.
 20. Abhi Das submitted a patch to forcibly flush the active items
    list to relieve memory pressure. This fixes a long-standing bug
    whereby GFS2 was getting hung permanently in balance_dirty_pages.
 21. Thomas Tai submitted a patch to fix a slab corruption problem
    due to a residual pointer left in the lock_dlm lockstruct.
 22. I submitted a patch to withdraw the file system if IO errors
    are encountered while writing to the journals or statfs system
    file which were previously not being sent back up. Before, some
    IO errors were sometimes not be detected for several hours, and
    at recovery time, the journal errors made journal replay
    impossible.
 23. Andreas has a patch to fix an annoying format-truncation compiler
    warning so GFS2 compiles cleanly.
 24. I have a patch that fixes a handful of sparse compiler warnings.
 25. Andreas fixed up an useless gl_object warning caused by an
    earlier patch.
 26. Arvind Yadav added a patch to properly constify our rhashtable
    params declare.
 27. I added a patch to fix a regression caused by the non-recursive
    delete and truncate patch that caused file system blocks to not
    be properly freed.
 28. Ernesto A. Fernández added a patch to fix a place where GFS2
    would send back the wrong return code setting extended attributes.
 29. Ernesto also added a patch to fix a case in which GFS2 was
    improperly setting an inode's i_mode, potentially granting access
    to the wrong users.
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJZrMC2AAoJENeLYdPf93o7PIQIAKY4hdC2pMM5tiiIHx5fPAAr
 tjpVuFkDQzyEaTb9sArVLxEdva3ShKERQKoYq/VVxqbAEwPgXbzJFNNil1WTJi1t
 J2gE4wE4G5x1+A7XDzCdPI8KAcF+yX63AaFYlVKyuZSq5w7njIRc1Vk+TFiIexxC
 xb0nP0g9L6Zt114rE8kfi0/GLjTO9vOKM3XsJgG612I3/cs3RUx4gJ+nSUG0bYLA
 qoBIXEJ3SFHw2Zr/LgHZ9QDHnlPVl3bjg03sRQaWZms7XbLegDBYsDSvS1HLZ300
 gjTc0Dgz/6KwzDVJ7cZ/fPNYtIFY58tKs6aqqDTrCncsX9nPjcTAxYkBNWsFyZM=
 =tXJ8
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-4.14.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull GFS2 updates from Bob Peterson:
 "We've got a whopping 29 GFS2 patches for this merge window, mainly
  because we held some back from the previous merge window until we
  could get them perfected and well tested. We have a couple patch sets,
  including my patch set for protecting glock gl_object and Andreas
  Gruenbacher's patch set to fix the long-standing shrink- slab hang,
  plus a bunch of assorted bugs and cleanups.

  Summary:

   - I fixed a bug whereby an IO error would lead to a double-brelse.

   - Andreas Gruenbacher made a minor cleanup to call his relatively new
     function, gfs2_holder_initialized, rather than doing it manually.
     This was just missed by a previous patch set.

   - Jan Kara fixed a bug whereby the SGID was being cleared when
     inheriting ACLs.

   - Andreas found a bug and fixed it in his previous patch, "Get rid of
     flush_delayed_work in gfs2_evict_inode". A call to
     flush_delayed_work was deleted from *gfs2_inode_lookup and added to
     gfs2_create_inode.

   - Wang Xibo found and fixed a list_add call in inode_go_lock that
     specified the parameters in the wrong order.

   - Coly Li submitted a patch to add the REQ_PRIO to some of GFS2's
     metadata reads that were accidentally missing them.

   - I submitted a 4-patch set to protect the glock gl_object field.
     GFS2 was setting and checking gl_object with no locking mechanism,
     so the value was occasionally stomped on, which caused file system
     corruption.

   - I submitted a small cleanup to function gfs2_clear_rgrpd. It was
     needlessly adding rgrp glocks to the lru list, then pulling them
     back off immediately. The rgrp glocks don't use the lru list
     anyway, so doing so was just a waste of time.

   - I submitted a patch that checks the GLOF_LRU flag on a glock before
     trying to remove it from the lru_list. This avoids a lot of
     unnecessary spin_lock contention.

   - I submitted a patch to delete GFS2's debugfs files only after we
     evict all the glocks. Before this patch, GFS2 would delete the
     debugfs files, and if unmount hung waiting for a glock, there was
     no way to debug the problem. Now, if a hang occurs during umount,
     we can examine the debugfs files to figure out why it's hung.

   - Andreas Gruenbacher submitted a patch to fix some trivial typos.

   - Andreas also submitted a five-part patch set to fix the
     longstanding hang involving the slab shrinker: dlm requires memory,
     calls the inode shrinker, which calls gfs2's evict, which calls
     back into DLM before it can evict an inode.

   - Abhi Das submitted a patch to forcibly flush the active items list
     to relieve memory pressure. This fixes a long-standing bug whereby
     GFS2 was getting hung permanently in balance_dirty_pages.

   - Thomas Tai submitted a patch to fix a slab corruption problem due
     to a residual pointer left in the lock_dlm lockstruct.

   - I submitted a patch to withdraw the file system if IO errors are
     encountered while writing to the journals or statfs system file
     which were previously not being sent back up. Before, some IO
     errors were sometimes not be detected for several hours, and at
     recovery time, the journal errors made journal replay impossible.

   - Andreas has a patch to fix an annoying format-truncation compiler
     warning so GFS2 compiles cleanly.

   - I have a patch that fixes a handful of sparse compiler warnings.

   - Andreas fixed up an useless gl_object warning caused by an earlier
     patch.

   - Arvind Yadav added a patch to properly constify our rhashtable
     params declare.

   - I added a patch to fix a regression caused by the non-recursive
     delete and truncate patch that caused file system blocks to not be
     properly freed.

   - Ernesto A. Fernández added a patch to fix a place where GFS2 would
     send back the wrong return code setting extended attributes.

   - Ernesto also added a patch to fix a case in which GFS2 was
     improperly setting an inode's i_mode, potentially granting access
     to the wrong users"

* tag 'gfs2-4.14.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (29 commits)
  gfs2: preserve i_mode if __gfs2_set_acl() fails
  gfs2: don't return ENODATA in __gfs2_xattr_set unless replacing
  GFS2: Fix non-recursive truncate bug
  gfs2: constify rhashtable_params
  GFS2: Fix gl_object warnings
  GFS2: Fix up some sparse warnings
  gfs2: Silence gcc format-truncation warning
  GFS2: Withdraw for IO errors writing to the journal or statfs
  gfs2: fix slab corruption during mounting and umounting gfs file system
  gfs2: forcibly flush ail to relieve memory pressure
  gfs2: Clean up waiting on glocks
  gfs2: Defer deleting inodes under memory pressure
  gfs2: gfs2_evict_inode: Put glocks asynchronously
  gfs2: Get rid of gfs2_set_nlink
  gfs2: gfs2_glock_get: Wait on freeing glocks
  gfs2: Fix trivial typos
  GFS2: Delete debugfs files only after we evict the glocks
  GFS2: Don't waste time locking lru_lock for non-lru glocks
  GFS2: Don't bother trying to add rgrps to the lru list
  GFS2: Clear gl_object when deleting an inode in gfs2_delete_inode
  ...
2017-09-06 11:42:31 -07:00
Yan, Zheng
15b51bd6ba ceph: stop on-going cached readdir if mds revokes FILE_SHARED cap
If directory's FILE_SHARED cap get revoked, dentry in the directory
can get spliced into other directory (Eg, other client move the
dentry into directory B, then we do readdir on directory B). So we
should stop on-going cached readdir. this can be achieved by marking
dir not complete, because __dcache_readdir() checks dir completeness
before emitting each dentry.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:57:00 +02:00
Yan, Zheng
f275635ee0 ceph: wait on writeback after writing snapshot data
In sync mode, writepages() needs to write all dirty pages. But
it can only write dirty pages associated with the oldest snapc.
To write dirty pages associated with next snapc, it needs to wait
until current writes complete.

Without this wait, writepages() keeps looking up dirty pages, but
the found dirty pages are not writeable. It wastes CPU time.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:57:00 +02:00
Yan, Zheng
7e1ee54a07 ceph: fix capsnap dirty pages accounting
writepages_finish() calls ceph_put_wrbuffer_cap_refs() once for
all pages, parameter snapc is set to req->r_snapc. So writepages()
shouldn't write dirty pages associated with different snapc in
one OSD request.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:59 +02:00
Yan, Zheng
2a2d927e35 ceph: ignore wbc->range_{start,end} when write back snapshot data
writepages() needs to write dirty pages to OSD in strict order of
snapshot context. It must first write dirty pages associated with
the oldest snapshot context. In the write range case, dirty pages
in the specified range can be associated with newer snapc. They
are not writeable until we write all dirty pages associated with
the oldest snapc.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:58 +02:00
Yan, Zheng
590e9d9861 ceph: fix "range cyclic" mode writepages
In range cyclic mode, writepages() should first write dirty pages
in range [writeback_index, (pgoff_t)-1], then write pages in range
[0, writeback_index -1]. Besides, if writepages() encounters a page
that beyond EOF, it should restart from the beginning.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:57 +02:00
Yan, Zheng
0e5ecac716 ceph: cleanup local variables in ceph_writepages_start()
Remove two variables and define variables of same type together.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:57 +02:00
Yan, Zheng
0713e5f24b ceph: optimize pagevec iterating in ceph_writepages_start()
ceph_writepages_start() supports writing non-continuous pages.
If it encounters a non-dirty or non-writeable page in pagevec,
it can continue to check the rest pages in pagevec.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:56 +02:00
Yan, Zheng
05455e1177 ceph: make writepage_nounlock() invalidate page that beyonds EOF
Otherwise, the page left in state that page is associated with a
snapc, but (PageDirty(page) || PageWriteback(page)) is false.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:56 +02:00
Yan, Zheng
1f934b00e9 ceph: properly get capsnap's size in get_oldest_context()
capsnap's size is set by __ceph_finish_cap_snap(). If capsnap is under
writing, its size is zero. In this case, get_oldest_context() should
read i_size. Besides, ceph_writepages_start() should re-check capsnap's
size after dirty pages get locked.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:55 +02:00
Yan, Zheng
b072d77466 ceph: remove stale check in ceph_invalidatepage()
Both set_page_dirty and truncate_complete_page should be called
for locked page, they can't race with each other.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:55 +02:00
Yan, Zheng
3ae0bebc49 ceph: queue cap snap only when snap realm's context changes
If we create capsnap when snap realm's context does not change, the
new capsnap's snapc is equal to ci->i_head_snapc. Page writeback code
can't differentiates dirty pages associated with the new capsnap from
dirty pages associated with i_head_snapc.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:54 +02:00
Yan, Zheng
c8fd0d37f8 ceph: handle race between vmtruncate and queuing cap snap
It's possible that we create a cap snap while there is pending
vmtruncate (truncate hasn't been processed by worker thread).
We should truncate dirty pages beyond capsnap->size in that case.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:53 +02:00
Yan, Zheng
fa0aa3b839 ceph: fix message order check in handle_cap_export()
If caps for importer mds exists, but cap id mismatch, client should
have received corresponding import message. Because cap ID does not
change as long as client holds the caps.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:53 +02:00
Yan, Zheng
c858a0709f ceph: fix NULL pointer dereference in ceph_flush_snaps()
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:52 +02:00
Markus Elfring
d37b1d9943 ceph: adjust 36 checks for NULL pointers
The script “checkpatch.pl” pointed information out like the following.

Comparison to NULL could be written ...

Thus fix the affected source code places.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:52 +02:00
Markus Elfring
b529d1b382 ceph: delete an unnecessary return statement in update_dentry_lease()
The script "checkpatch.pl" pointed information out like the following.

WARNING: void function return statements are not generally useful

Thus remove such a statement in the affected function.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:51 +02:00
Markus Elfring
51308806ff ceph: ENOMEM pr_err in __get_or_create_frag() is redundant
Omit an extra message for a memory allocation failure in this function.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:51 +02:00
Luis Henriques
397f238994 ceph: check negative offsets in ceph_llseek()
When a user requests SEEK_HOLE or SEEK_DATA with a negative offset
ceph_llseek should return -ENXIO.  Currently -EINVAL is being returned for
SEEK_DATA and 0 for SEEK_HOLE.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:50 +02:00
Douglas Fuller
06d74376c8 ceph: more accurate statfs
Improve accuracy of statfs reporting for Ceph filesystems comprising
exactly one data pool. In this case, the Ceph monitor can now report
the space usage for the single data pool instead of the global data
for the entire Ceph cluster. Include support for this message in
mon_client and leverage it in ceph/super.

Signed-off-by: Douglas Fuller <dfuller@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:49 +02:00
Yan, Zheng
92776fd2c2 ceph: properly set snap follows for cap reconnect
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:49 +02:00
Yan, Zheng
b178cf4304 ceph: don't use CEPH_OSD_FLAG_ORDERSNAP
Inode can be moved between snap realms. It's possible inode is moved
into a snap realm whose seq number is smaller than old snap realm's.
So there is no guarantee that seq number inode's snap context always
increases.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:48 +02:00
Yan, Zheng
1c0a9c2d97 ceph: include snapc in debug message of write
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:48 +02:00
Yan, Zheng
24d063acc2 ceph: make sure flushsnap messages are sent in proper order
Before sending new flushsnap message, check if there are old
flushsnap messages that need to be re-sent. If there are, re-send
old messages first. This guarantees ordering of flushsnap messages.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:47 +02:00
Yan, Zheng
a5cd74ad38 ceph: fix -EOLDSNAPC handling
Need to drop cap reference before retry. Besides, it's better to
redo file write checks for each retry because we re-lock inode.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:47 +02:00
Yan, Zheng
5d37ca1480 ceph: send LSSNAP request to auth mds of directory inode
Snapdir inode has no capability. __choose_mds() should choose mds
base on capabilities of snapdir's parent inode.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:46 +02:00
Yan, Zheng
8d45b911a9 ceph: don't fill readdir cache for LSSNAP reply
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:45 +02:00
Yan, Zheng
9a86962b35 ceph: cleanup ceph_readdir_prepopulate()
In LSSNAP case, req->r_dentry is already set to snapdir dentry.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:45 +02:00
Jeff Layton
b74fceae73 ceph: use errseq_t for writeback error reporting
Ensure that when writeback errors are marked that we report those to all
file descriptions that were open at the time of the error.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:44 +02:00
Yan, Zheng
95569713af ceph: new cap message flags indicate if there is pending capsnap
These flags tell mds if there is pending capsnap explicitly.
Without this explicit notification, mds can only conclude if
client has pending capsnap. The method mds use is inefficient
and error-prone.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:44 +02:00
Yanhu Cao
3fb99d483e ceph: nuke startsync op
startsync is a no-op, has been for years.  Remove it.

Link: http://tracker.ceph.com/issues/20604
Signed-off-by: Yanhu Cao <gmayyyha@gmail.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:43 +02:00
Yan, Zheng
4214fb158c ceph: validate correctness of some mount options
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:42 +02:00
Yan, Zheng
95cca2b44e ceph: limit osd write size
OSD has a configurable limitation of max write size. OSD return
error if write request size is larger than the limitation. For now,
set max write size to CEPH_MSG_MAX_DATA_LEN. It should be small
enough.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:41 +02:00
Yan, Zheng
aa187926b7 ceph: limit osd read size to CEPH_MSG_MAX_DATA_LEN
libceph returns -EIO when read size > CEPH_MSG_MAX_DATA_LEN.

Link: http://tracker.ceph.com/issues/20528
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:56:03 +02:00
Yan, Zheng
2ae409dc6a ceph: remove unused cap_release_safety mount option
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-06 19:43:05 +02:00
Markus Elfring
58a69893a9 lockd: Delete an error message for a failed memory allocation in reclaimer()
Omit an extra message for a memory allocation failure in this function.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-06 12:33:56 -04:00
NeilBrown
03c6f7d64a NFS: remove jiffies field from access cache
This field hasn't been used since commit 57b691819e ("NFS: Cache
access checks more aggressively").

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-06 12:32:37 -04:00
NeilBrown
779eafab06 NFS: flush data when locking a file to ensure cache coherence for mmap.
When a byte range lock (or flock) is taken out on an NFS file, the
validity of the cached data is checked and the inode is marked
NFS_INODE_INVALID_DATA.  However the cached data isn't flushed from
the page cache.

This is sufficient for future read() requests or mmap() requests as
they call nfs_revalidate_mapping() which performs the flush if
necessary.

However an existing mapping is not affected.  Accessing data through
that mapping will continue to return old data even though the inode is
marked NFS_INODE_INVALID_DATA.

This can easily be confirmed using the 'nfs' tool in
  git://github.com/okirch/twopence-nfs.git
and running

   nfs coherence FILENAME
on one client, and
   nfs coherence -r FILENAME
on another client.

It appears that prior to Linux 2.6.0 this worked correctly.

However commit:

http://git.kernel.org/cgit/linux/kernel/git/history/history.git/commit/?id=ca9268fe3ddd075714005adecd4afbd7f9ab87d0

removed the call to inode_invalidate_pages() from nfs_zap_caches().  I
haven't tested this code, but inspection suggests that prior to this
commit, file locking would invalidate all inode pages.

This patch adds a call to nfs_revalidate_mapping() after a
successful SETLK so that invalid data is flushed.  With this patch the
above test passes.  To minimize impact (and possibly avoid a GETATTR
call) this only happens if the mapping might be mapped into
userspace.

Cc: Olaf Kirch <okir@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-06 12:31:15 -04:00
NeilBrown
237f8306c3 NFS: don't expect errors from mempool_alloc().
Commit fbe77c30e9 ("NFS: move rw_mode to nfs_pageio_header")
reintroduced some pointless code that commit 518662e0fc ("NFS: fix
usage of mempools.") had recently removed.

Remove it again.

Cc: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-06 12:31:15 -04:00
Jaegeuk Kim
d4c759ee5f f2fs: use generic terms used for encrypted block management
This patch renames functions regarding to buffer management via META_MAPPING
used for encrypted blocks especially. We can actually use them in generic way.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-05 20:21:48 -07:00
Jaegeuk Kim
1958593e4f f2fs: introduce f2fs_encrypted_file for clean-up
This patch replaces (f2fs_encrypted_inode() && S_ISREG()) with
f2fs_encrypted_file(), which gives no functional change.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-05 20:11:29 -07:00
Chuck Lever
eae03e2ac8 nfsd: Incoming xdr_bufs may have content in tail buffer
Since the beginning, svcsock has built a received RPC Call message
by populating the xdr_buf's head, then placing the remaining
message bytes in the xdr_buf's page list. The xdr_buf's tail is
never populated.

This means that an NFSv4 COMPOUND containing an NFS WRITE operation
plus trailing operations has a page list that contains the WRITE
data payload followed by the trailing operations. NFSv4 XDR decoders
will not look in the xdr_buf's tail, ever, because svcsock never put
anything there.

To support transports that can pass the write payload in the
xdr_buf's pagelist and trailing content in the xdr_buf's tail,
introduce logic in READ_BUF that switches to the xdr_buf's tail vec
when the decoder runs out of content in rq_arg.pages.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-09-05 15:15:29 -04:00
J. Bruce Fields
0828170f3d merge nfsd 4.13 bugfixes into nfsd for-4.14 branch 2017-09-05 15:11:47 -04:00
Linus Torvalds
bafb0762cb Char/Misc drivers for 4.14-rc1
Here is the big char/misc driver update for 4.14-rc1.
 
 Lots of different stuff in here, it's been an active development cycle
 for some reason.  Highlights are:
   - updated binder driver, this brings binder up to date with what
     shipped in the Android O release, plus some more changes that
     happened since then that are in the Android development trees.
   - coresight updates and fixes
   - mux driver file renames to be a bit "nicer"
   - intel_th driver updates
   - normal set of hyper-v updates and changes
   - small fpga subsystem and driver updates
   - lots of const code changes all over the driver trees
   - extcon driver updates
   - fmc driver subsystem upadates
   - w1 subsystem minor reworks and new features and drivers added
   - spmi driver updates
 
 Plus a smattering of other minor driver updates and fixes.
 
 All of these have been in linux-next with no reported issues for a
 while.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWa1+Ew8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yl26wCgquufNylfhxr65NbJrovduJYzRnUAniCivXg8
 bePIh/JI5WxWoHK+wEbY
 =hYWx
 -----END PGP SIGNATURE-----

Merge tag 'char-misc-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc driver updates from Greg KH:
 "Here is the big char/misc driver update for 4.14-rc1.

  Lots of different stuff in here, it's been an active development cycle
  for some reason. Highlights are:

   - updated binder driver, this brings binder up to date with what
     shipped in the Android O release, plus some more changes that
     happened since then that are in the Android development trees.

   - coresight updates and fixes

   - mux driver file renames to be a bit "nicer"

   - intel_th driver updates

   - normal set of hyper-v updates and changes

   - small fpga subsystem and driver updates

   - lots of const code changes all over the driver trees

   - extcon driver updates

   - fmc driver subsystem upadates

   - w1 subsystem minor reworks and new features and drivers added

   - spmi driver updates

  Plus a smattering of other minor driver updates and fixes.

  All of these have been in linux-next with no reported issues for a
  while"

* tag 'char-misc-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (244 commits)
  ANDROID: binder: don't queue async transactions to thread.
  ANDROID: binder: don't enqueue death notifications to thread todo.
  ANDROID: binder: Don't BUG_ON(!spin_is_locked()).
  ANDROID: binder: Add BINDER_GET_NODE_DEBUG_INFO ioctl
  ANDROID: binder: push new transactions to waiting threads.
  ANDROID: binder: remove proc waitqueue
  android: binder: Add page usage in binder stats
  android: binder: fixup crash introduced by moving buffer hdr
  drivers: w1: add hwmon temp support for w1_therm
  drivers: w1: refactor w1_slave_show to make the temp reading functionality separate
  drivers: w1: add hwmon support structures
  eeprom: idt_89hpesx: Support both ACPI and OF probing
  mcb: Fix an error handling path in 'chameleon_parse_cells()'
  MCB: add support for SC31 to mcb-lpc
  mux: make device_type const
  char: virtio: constify attribute_group structures.
  Documentation/ABI: document the nvmem sysfs files
  lkdtm: fix spelling mistake: "incremeted" -> "incremented"
  perf: cs-etm: Fix ETMv4 CONFIGR entry in perf.data file
  nvmem: include linux/err.h from header
  ...
2017-09-05 11:08:17 -07:00
Yunlong Song
2afce76a11 Revert "f2fs: add a new function get_ssr_cost"
This reverts commit b7b7c4cf1c.

se->ckpt_valid_blocks will never be smaller than se->valid_blocks, so just
remove get_ssr_cost.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-05 10:52:28 -07:00
Arvind Yadav
f62fc9f976 f2fs: constify super_operations
super_operations are not supposed to change at runtime.
"struct super_block" working with super_operations provided
by <linux/fs.h> work with const super_operations. So mark
the non-const structs as const

Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-05 10:50:24 -07:00
Chao Yu
d3238691ed f2fs: fix to wake up all sleeping flusher
In scenario of remount_ro vs flush, after flush_thread exits in
->remount_fs, flusher will only clean up golbal issue_list, but
without waking up flushers waiting on that list, result in hang
related user threads.

In order to fix this issue, this patch enables the flusher to
take charge of issue_flush thread: executes merged flush command,
and wake up all sleeping flushers.

Fixes: 5eba8c5d1f ("f2fs: fix to access nullified flush_cmd_control pointer")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-05 10:50:23 -07:00
Chao Yu
edd748e6c8 f2fs: avoid race in between atomic_read & atomic_inc
Previously, we will miss merging flush command during fsync due to below
race condition:

Thread A 		Thread B		Thread C
- f2fs_issue_flush
 - atomic_read(&issing_flush)
			- f2fs_issue_flush
			 - atomic_read(&issing_flush)
						- f2fs_issue_flush
						 - atomic_read(&issing_flush)
  - atomic_inc(&issing_flush)
			  - atomic_inc(&issing_flush)
						  - atomic_inc(&issing_flush)
   - submit_flush_wait
			   - submit_flush_wait
						   - submit_flush_wait

It needs to use atomic_inc_return instead to avoid such race.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-05 10:50:22 -07:00
Chao Yu
025d63a486 f2fs: remove unneeded parameter of change_curseg
allocate_segment_by_default is the only caller of change_curseg passing
@reuse with 'false', but commit 763bfe1bc5 ("f2fs: remove reusing any
prefree segments") removes the calling, after that, @reuse in
change_curseg always be true, so, let's clean up the unneeded parameter.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-05 10:50:21 -07:00
Chao Yu
11f5020d2f f2fs: update i_flags correctly
f2fs enables hash-indexed directory by default, so we need to tag
FS_INDEX_FL in inode::i_flags during directory creataion, in order
to show correct status of inode in lsattr:

Before:
------------------- /mnt/f2fs/dir/
After:
-----------I------- /mnt/f2fs/dir/

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-05 10:50:21 -07:00
Jaegeuk Kim
ee60523499 f2fs: don't check inode's checksum if it was dirtied or writebacked
If another thread already made the page dirtied or writebacked, we must avoid
to verify checksum. If we got an error, we need to remove its uptodate as well.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-05 10:50:11 -07:00
Jaegeuk Kim
a298d57f96 f2fs: don't need to update inode checksum for recovery
This patch fixes "f2fs: support inode checksum".
The recovered inode page will be rewritten with valid checksum.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-05 10:49:37 -07:00
Linus Torvalds
44b1671fae Driver core update for 4.14-rc1
Here is the "big" driver core update for 4.14-rc1.
 
 It's really not all that big, the largest thing here being some firmware
 tests to help ensure that that crazy api is working properly.
 
 There's also a new uevent for when a driver is bound or unbound from a
 device, fixing a hole in the driver model that's been there since the
 very beginning.  Many thanks to Dmitry for being persistent and pointing
 out how wrong I was about this all along :)
 
 Patches for the new uevents are already in the systemd tree, if people
 want to play around with them.
 
 Otherwise just a number of other small api changes and updates here,
 nothing major.  All of these patches have been in linux-next for a
 while with no reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWa1/IQ8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yn8jACfdQg+YXGxTExonxnyiWgoDMMSO2gAn1ETOaak
 itLO5ll4b6EQ0r3pU27d
 =pCYl
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core update from Greg KH:
 "Here is the "big" driver core update for 4.14-rc1.

  It's really not all that big, the largest thing here being some
  firmware tests to help ensure that that crazy api is working properly.

  There's also a new uevent for when a driver is bound or unbound from a
  device, fixing a hole in the driver model that's been there since the
  very beginning. Many thanks to Dmitry for being persistent and
  pointing out how wrong I was about this all along :)

  Patches for the new uevents are already in the systemd tree, if people
  want to play around with them.

  Otherwise just a number of other small api changes and updates here,
  nothing major. All of these patches have been in linux-next for a
  while with no reported issues"

* tag 'driver-core-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (28 commits)
  driver core: bus: Fix a potential double free
  Do not disable driver and bus shutdown hook when class shutdown hook is set.
  base: topology: constify attribute_group structures.
  base: Convert to using %pOF instead of full_name
  kernfs: Clarify lockdep name for kn->count
  fbdev: uvesafb: remove DRIVER_ATTR() usage
  xen: xen-pciback: remove DRIVER_ATTR() usage
  driver core: Document struct device:dma_ops
  mod_devicetable: Remove excess description from structured comment
  test_firmware: add batched firmware tests
  firmware: enable a debug print for batched requests
  firmware: define pr_fmt
  firmware: send -EINTR on signal abort on fallback mechanism
  test_firmware: add test case for SIGCHLD on sync fallback
  initcall_debug: add deferred probe times
  Input: axp20x-pek - switch to using devm_device_add_group()
  Input: synaptics_rmi4 - use devm_device_add_group() for attributes in F01
  Input: gpio_keys - use devm_device_add_group() for attributes
  driver core: add devm_device_add_group() and friends
  driver core: add device_{add|remove}_group() helpers
  ...
2017-09-05 10:41:21 -07:00
Colin Ian King
aed9eb1b21 ext4: fix null pointer dereference on sbi
In the case of a kzalloc failure when allocating sbi we end up
with a null pointer dereference on sbi when assigning sbi->s_daxdev.
Fix this by moving the assignment of sbi->s_daxdev to after the
null pointer check of sbi.

Detected by CoverityScan CID#1455379 ("Dereference before null check")

Fixes: 5e405595e5 ("ext4: perform dax_device lookup at mount")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-09-05 10:02:08 -07:00
Mauro Carvalho Chehab
4cd7d6c957 media: get rid of removed DMX_GET_CAPS and DMX_SET_SOURCE leftovers
Those two ioctls were never used within the Kernel. Still, there
used to have compat32 code there (and an if #0 block at the core).

Get rid of them.

Fixes: 286fe1ca3f ("media: dmx.h: get rid of DMX_GET_CAPS")
Fixes: 13adefbe9e ("media: dmx.h: get rid of DMX_SET_SOURCE")
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-09-05 08:25:07 -04:00
Miklos Szeredi
7c6893e3c9 ovl: don't allow writing ioctl on lower layer
Problem with ioctl() is that it's a file operation, yet often used as an
inode operation (i.e. modify the inode despite the file being opened for
read-only).

mnt_want_write_file() is used by filesystems in such cases to get write
access on an arbitrary open file.

Since overlayfs lets filesystems do all file operations, including ioctl,
this can lead to mnt_want_write_file() returning OK for a lower file and
modification of that lower file.

This patch prevents modification by checking if the file is from an
overlayfs lower layer and returning EPERM in that case.

Need to introduce a mnt_want_write_file_path() variant that still does the
old thing for inode operations that can do the copy up + modification
correctly in such cases (fchown, fsetxattr, fremovexattr).

This does not address the correctness of such ioctls on overlayfs (the
correct way would be to copy up and attempt to perform ioctl on upper
file).

In theory this could be a regression.  We very much hope that nobody is
relying on such a hack in any sane setup.

While this patch meddles in VFS code, it has no effect on non-overlayfs
filesystems.

Reported-by: "zhangyi (F)" <yi.zhang@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-09-05 12:53:12 +02:00
Miklos Szeredi
cd91304e71 ovl: fix relatime for directories
Need to treat non-regular overlayfs files the same as regular files when
checking for an atime update.

Add a d_real() flag to make it return the upper dentry for all file types.

Reported-by: "zhangyi (F)" <yi.zhang@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-09-05 12:53:11 +02:00
Samuel Cabrero
76e752701a cifs: Check for timeout on Negotiate stage
Some servers seem to accept connections while booting but never send
the SMBNegotiate response neither close the connection, causing all
processes accessing the share hang on uninterruptible sleep state.

This happens when the cifs_demultiplex_thread detects the server is
unresponsive so releases the socket and start trying to reconnect.
At some point, the faulty server will accept the socket and the TCP
status will be set to NeedNegotiate. The first issued command accessing
the share will start the negotiation (pid 5828 below), but the response
will never arrive so other commands will be blocked waiting on the mutex
(pid 55352).

This patch checks for unresponsive servers also on the negotiate stage
releasing the socket and reconnecting if the response is not received
and checking again the tcp state when the mutex is acquired.

PID: 55352  TASK: ffff880fd6cc02c0  CPU: 0   COMMAND: "ls"
 #0 [ffff880fd9add9f0] schedule at ffffffff81467eb9
 #1 [ffff880fd9addb38] __mutex_lock_slowpath at ffffffff81468fe0
 #2 [ffff880fd9addba8] mutex_lock at ffffffff81468b1a
 #3 [ffff880fd9addbc0] cifs_reconnect_tcon at ffffffffa042f905 [cifs]
 #4 [ffff880fd9addc60] smb_init at ffffffffa042faeb [cifs]
 #5 [ffff880fd9addca0] CIFSSMBQPathInfo at ffffffffa04360b5 [cifs]
 ....

Which is waiting a mutex owned by:

PID: 5828   TASK: ffff880fcc55e400  CPU: 0   COMMAND: "xxxx"
 #0 [ffff880fbfdc19b8] schedule at ffffffff81467eb9
 #1 [ffff880fbfdc1b00] wait_for_response at ffffffffa044f96d [cifs]
 #2 [ffff880fbfdc1b60] SendReceive at ffffffffa04505ce [cifs]
 #3 [ffff880fbfdc1bb0] CIFSSMBNegotiate at ffffffffa0438d79 [cifs]
 #4 [ffff880fbfdc1c50] cifs_negotiate_protocol at ffffffffa043b383 [cifs]
 #5 [ffff880fbfdc1c80] cifs_reconnect_tcon at ffffffffa042f911 [cifs]
 #6 [ffff880fbfdc1d20] smb_init at ffffffffa042faeb [cifs]
 #7 [ffff880fbfdc1d60] CIFSSMBQFSInfo at ffffffffa0434eb0 [cifs]
 ....

Signed-off-by: Samuel Cabrero <scabrero@suse.de>
Reviewed-by: Aurélien Aptel <aaptel@suse.de>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-09-04 20:55:29 -05:00
Christoph Hellwig
9725d4cef6 fs: unexport vfs_readv and vfs_writev
We've got no modular users left, and any potential modular user is better
of with iov_iter based variants.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-04 19:05:16 -04:00
Christoph Hellwig
bd8df82be6 fs: unexport vfs_read and vfs_write
No modular users left.  Given that they take user pointers there is no
good reason to export it to drivers to start with.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-04 19:05:16 -04:00
Christoph Hellwig
eb031849d5 fs: unexport __vfs_read/__vfs_write
No modular users left, and any new ones should use kernel_read/write
or iov_iter variants instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-04 19:05:16 -04:00
Christoph Hellwig
8e93157bdd btrfs: switch write_buf to kernel_write
Instead of playing with the addressing limits.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-04 19:05:16 -04:00
Christoph Hellwig
73e18f7c0b fs: make the buf argument to __kernel_write a void pointer
This matches kernel_read and kernel_write and avoids any need for casts in
the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-04 19:05:15 -04:00
Christoph Hellwig
e13ec939e9 fs: fix kernel_write prototype
Make the position an in/out argument like all the other read/write
helpers and and make the buf argument a void pointer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-04 19:05:15 -04:00
Christoph Hellwig
bdd1d2d3d2 fs: fix kernel_read prototype
Use proper ssize_t and size_t types for the return value and count
argument, move the offset last and make it an in/out argument like
all other read/write helpers, and make the buf argument a void pointer
to get rid of lots of casts in the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-04 19:05:15 -04:00
Christoph Hellwig
c41fbad015 fs: move kernel_read to fs/read_write.c
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-04 19:05:15 -04:00
Christoph Hellwig
ac452acae1 fs: move kernel_write to fs/read_write.c
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-04 19:05:15 -04:00
Christoph Hellwig
317d5a5f0f autofs4: switch autofs4_write to __kernel_write
Instead of playing games with the address limit..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-04 19:05:15 -04:00
Christoph Hellwig
c35fc7a5ab block_dev: support RFW_NOWAIT on block device nodes
All support is already there in the generic code, we just need to wire
it up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-04 19:04:23 -04:00
Christoph Hellwig
91f9943e1c fs: support RWF_NOWAIT for buffered reads
This is based on the old idea and code from Milosz Tanski.  With the aio
nowait code it becomes mostly trivial now.  Buffered writes continue to
return -EOPNOTSUPP if RWF_NOWAIT is passed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-04 19:04:23 -04:00
Miklos Szeredi
495e642939 vfs: add flags to d_real()
Add a separate flags argument (in addition to the open flags) to control
the behavior of d_real().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-09-04 21:42:22 +02:00
Ronnie Sahlberg
5517554e43 cifs: Add support for writing attributes on SMB2+
This adds support for writing extended attributes on SMB2+ shares.
Attributes can be written using the setfattr command.

RH-bz: 1110709

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2017-09-04 14:03:45 -05:00
Ronnie Sahlberg
95907fea4f cifs: Add support for reading attributes on SMB2+
SMB1 already has support to read attributes. This adds similar support
to SMB2+.

With this patch, tools such as 'getfattr' will now work with SMB2+ shares.

RH-bz: 1110709

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2017-09-04 14:03:41 -05:00
Linus Torvalds
5f82e71a00 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:

 - Add 'cross-release' support to lockdep, which allows APIs like
   completions, where it's not the 'owner' who releases the lock, to be
   tracked. It's all activated automatically under
   CONFIG_PROVE_LOCKING=y.

 - Clean up (restructure) the x86 atomics op implementation to be more
   readable, in preparation of KASAN annotations. (Dmitry Vyukov)

 - Fix static keys (Paolo Bonzini)

 - Add killable versions of down_read() et al (Kirill Tkhai)

 - Rework and fix jump_label locking (Marc Zyngier, Paolo Bonzini)

 - Rework (and fix) tlb_flush_pending() barriers (Peter Zijlstra)

 - Remove smp_mb__before_spinlock() and convert its usages, introduce
   smp_mb__after_spinlock() (Peter Zijlstra)

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (56 commits)
  locking/lockdep/selftests: Fix mixed read-write ABBA tests
  sched/completion: Avoid unnecessary stack allocation for COMPLETION_INITIALIZER_ONSTACK()
  acpi/nfit: Fix COMPLETION_INITIALIZER_ONSTACK() abuse
  locking/pvqspinlock: Relax cmpxchg's to improve performance on some architectures
  smp: Avoid using two cache lines for struct call_single_data
  locking/lockdep: Untangle xhlock history save/restore from task independence
  locking/refcounts, x86/asm: Disable CONFIG_ARCH_HAS_REFCOUNT for the time being
  futex: Remove duplicated code and fix undefined behaviour
  Documentation/locking/atomic: Finish the document...
  locking/lockdep: Fix workqueue crossrelease annotation
  workqueue/lockdep: 'Fix' flush_work() annotation
  locking/lockdep/selftests: Add mixed read-write ABBA tests
  mm, locking/barriers: Clarify tlb_flush_pending() barriers
  locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE and CONFIG_LOCKDEP_COMPLETIONS truly non-interactive
  locking/lockdep: Explicitly initialize wq_barrier::done::map
  locking/lockdep: Rename CONFIG_LOCKDEP_COMPLETE to CONFIG_LOCKDEP_COMPLETIONS
  locking/lockdep: Reword title of LOCKDEP_CROSSRELEASE config
  locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE part of CONFIG_PROVE_LOCKING
  locking/refcounts, x86/asm: Implement fast refcount overflow protection
  locking/lockdep: Fix the rollback and overwrite detection logic in crossrelease
  ...
2017-09-04 11:52:29 -07:00
Linus Torvalds
f213a6c84c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle were:

   - fix affine wakeups (Peter Zijlstra)

   - improve CPU onlining (and general bootup) scalability on systems
     with ridiculous number (thousands) of CPUs (Peter Zijlstra)

   - sched/numa updates (Rik van Riel)

   - sched/deadline updates (Byungchul Park)

   - sched/cpufreq enhancements and related cleanups (Viresh Kumar)

   - sched/debug enhancements (Xie XiuQi)

   - various fixes"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
  sched/debug: Optimize sched_domain sysctl generation
  sched/topology: Avoid pointless rebuild
  sched/topology, cpuset: Avoid spurious/wrong domain rebuilds
  sched/topology: Improve comments
  sched/topology: Fix memory leak in __sdt_alloc()
  sched/completion: Document that reinit_completion() must be called after complete_all()
  sched/autogroup: Fix error reporting printk text in autogroup_create()
  sched/fair: Fix wake_affine() for !NUMA_BALANCING
  sched/debug: Intruduce task_state_to_char() helper function
  sched/debug: Show task state in /proc/sched_debug
  sched/debug: Use task_pid_nr_ns in /proc/$pid/sched
  sched/core: Remove unnecessary initialization init_idle_bootup_task()
  sched/deadline: Change return value of cpudl_find()
  sched/deadline: Make find_later_rq() choose a closer CPU in topology
  sched/numa: Scale scan period with tasks in group and shared/private
  sched/numa: Slow down scan rate if shared faults dominate
  sched/pelt: Fix false running accounting
  sched: Mark pick_next_task_dl() and build_sched_domain() as static
  sched/cpupri: Don't re-initialize 'struct cpupri'
  sched/deadline: Don't re-initialize 'struct cpudl'
  ...
2017-09-04 09:10:24 -07:00
Miklos Szeredi
191a3980c6 ovl: cleanup d_real for negative
d_real() is never called with a negative dentry.  So remove the
d_is_negative() check (which would never trigger anyway, since d_is_reg()
returns false for a negative dentry).

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-09-04 16:44:42 +02:00
Ingo Molnar
edc2988c54 Merge branch 'linus' into locking/core, to fix up conflicts
Conflicts:
	mm/page_alloc.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-04 11:01:18 +02:00
Deepa Dinamani
aaed2dd8a3 utimes: Make utimes y2038 safe
struct timespec is not y2038 safe on 32 bit machines.
Replace timespec with y2038 safe struct timespec64.

Note that the patch only changes the internals without
modifying the syscall interfaces. This will be part
of a separate series.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-03 20:24:30 -04:00
Linus Torvalds
69c0067aa3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc fixes from Al Viro:
 "Loose ends and regressions from the last merge window.

  Strictly speaking, only binfmt_flat thing is a build regression per
  se - the rest is 'only sparse cares about that' stuff"

[ This came in before the 4.13 release and could have gone there, but it
  was late in the release and nothing seemed critical enough to care, so
  I'm pulling it in the 4.14 merge window instead  - Linus ]

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  binfmt_flat: fix arch/m32r and arch/microblaze flat_put_addr_at_rp()
  compat_hdio_ioctl: Fix a declaration
  <linux/uaccess.h>: Fix copy_in_user() declaration
  annotate RWF_... flags
  teach SYSCALL_DEFINE/COMPAT_SYSCALL_DEFINE to handle __bitwise arguments
2017-09-03 16:09:03 -07:00
Pan Bian
6c370590cf xfs: use kmem_free to free return value of kmem_zalloc
In function xfs_test_remount_options(), kfree() is used to free memory
allocated by kmem_zalloc(). But it is better to use kmem_free().

Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-03 10:40:46 -07:00
Christoph Hellwig
8353a814f2 xfs: open code end_buffer_async_write in xfs_finish_page_writeback
Our loop in xfs_finish_page_writeback, which iterates over all buffer
heads in a page and then calls end_buffer_async_write, which also
iterates over all buffers in the page to check if any I/O is in flight
is not only inefficient, but also potentially dangerous as
end_buffer_async_write can cause the page and all buffers to be freed.

Replace it with a single loop that does the work of end_buffer_async_write
on a per-page basis.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-03 10:40:45 -07:00
Christoph Hellwig
dd60687ee5 xfs: don't set v3 xflags for v2 inodes
Reject attempts to set XFLAGS that correspond to di_flags2 inode flags
if the inode isn't a v3 inode, because di_flags2 only exists on v3.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-02 08:22:19 -07:00
Darrick J. Wong
7bf7a193a9 xfs: fix compiler warnings
Fix up all the compiler warnings that have crept in.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-09-02 08:22:19 -07:00
Linus Torvalds
d0d6ab53c9 Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs version warning fix from Steve French:
 "As requested, additional kernel warning messages to clarify the
  default dialect changes"

[ There is still some discussion about exactly which version should be
  the new default.  Longer-term we have auto-negotiation coming, but
  that's not there yet..  - Linus ]

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  Fix warning messages when mounting to older servers
2017-09-01 20:57:27 -07:00
David S. Miller
6026e043d0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three cases of simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-01 17:42:05 -07:00
Darrick J. Wong
4cc1ee5e65 xfs: simplify the rmap code in xfs_bmse_merge
In Christoph's patch to refactor xfs_bmse_merge, the updated rmap code
does more work than it needs to (because map-extent auto-merges
records).  Remove the unnecessary unmap and save ourselves a deferred
op.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-09-01 13:08:26 -07:00
Eric Sandeen
f91fb956f2 xfs: remove unused flags arg from xfs_file_iomap_begin_delay
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 13:08:26 -07:00
Amir Goldstein
47c7d0b195 xfs: fix incorrect log_flushed on fsync
When calling into _xfs_log_force{,_lsn}() with a pointer
to log_flushed variable, log_flushed will be set to 1 if:
1. xlog_sync() is called to flush the active log buffer
AND/OR
2. xlog_wait() is called to wait on a syncing log buffers

xfs_file_fsync() checks the value of log_flushed after
_xfs_log_force_lsn() call to optimize away an explicit
PREFLUSH request to the data block device after writing
out all the file's pages to disk.

This optimization is incorrect in the following sequence of events:

 Task A                    Task B
 -------------------------------------------------------
 xfs_file_fsync()
   _xfs_log_force_lsn()
     xlog_sync()
        [submit PREFLUSH]
                           xfs_file_fsync()
                             file_write_and_wait_range()
                               [submit WRITE X]
                               [endio  WRITE X]
                             _xfs_log_force_lsn()
                               xlog_wait()
        [endio  PREFLUSH]

The write X is not guarantied to be on persistent storage
when PREFLUSH request in completed, because write A was submitted
after the PREFLUSH request, but xfs_file_fsync() of task A will
be notified of log_flushed=1 and will skip explicit flush.

If the system crashes after fsync of task A, write X may not be
present on disk after reboot.

This bug was discovered and demonstrated using Josef Bacik's
dm-log-writes target, which can be used to record block io operations
and then replay a subset of these operations onto the target device.
The test goes something like this:
- Use fsx to execute ops of a file and record ops on log device
- Every now and then fsync the file, store md5 of file and mark
  the location in the log
- Then replay log onto device for each mark, mount fs and compare
  md5 of file to stored value

Cc: Christoph Hellwig <hch@lst.de>
Cc: Josef Bacik <jbacik@fb.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 13:08:26 -07:00
Christoph Hellwig
742d842907 xfs: disable per-inode DAX flag
Currently flag switching can be used to easily crash the kernel.  Disable
the per-inode DAX flag until that is sorted out.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 13:08:26 -07:00
Christoph Hellwig
8bfadd8d03 xfs: replace xfs_qm_get_rtblks with a direct call to xfs_bmap_count_leaves
Use the existing functionality instead of directly poking into the extent
list.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 13:08:26 -07:00
Christoph Hellwig
e17a5c6f0e xfs: rewrite xfs_bmap_count_leaves using xfs_iext_get_extent
This avoids poking into the internals of the extent list.  Also return
the number of extents as the return value instead of an additional
by reference argument, and make it available to callers outside of
xfs_bmap_util.c

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 13:08:26 -07:00
Christoph Hellwig
4c35445b59 xfs: use xfs_iext_*_extent helpers in xfs_bmap_split_extent_at
This abstracts the function away from details of the low-level extent
list implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 13:08:25 -07:00
Christoph Hellwig
4da6b514ea xfs: use xfs_iext_*_extent helpers in xfs_bmap_shift_extents
This abstracts the function away from details of the low-level extent
list implementation.

Note that it seems like the previous implementation of rmap for
the merge case was completely broken, but it no seems appear to
trigger that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 13:08:25 -07:00
Christoph Hellwig
05b7c8ab2b xfs: move some code around inside xfs_bmap_shift_extents
For the first right move we need to look up next_fsb.  That means
our last fsb that contains next_fsb must also be the current extent,
so take advantage of that by moving the code around a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 13:08:25 -07:00
Oleg Nesterov
138e4ad67a epoll: fix race between ep_poll_callback(POLLFREE) and ep_free()/ep_remove()
The race was introduced by me in commit 971316f050 ("epoll:
ep_unregister_pollwait() can use the freed pwq->whead").  I did not
realize that nothing can protect eventpoll after ep_poll_callback() sets
->whead = NULL, only whead->lock can save us from the race with
ep_free() or ep_remove().

Move ->whead = NULL to the end of ep_poll_callback() and add the
necessary barriers.

TODO: cleanup the ewake/EPOLLEXCLUSIVE logic, it was confusing even
before this patch.

Hopefully this explains use-after-free reported by syzcaller:

	BUG: KASAN: use-after-free in debug_spin_lock_before
	...
	 _raw_spin_lock_irqsave+0x4a/0x60 kernel/locking/spinlock.c:159
	 ep_poll_callback+0x29f/0xff0 fs/eventpoll.c:1148

this is spin_lock(eventpoll->lock),

	...
	Freed by task 17774:
	...
	 kfree+0xe8/0x2c0 mm/slub.c:3883
	 ep_free+0x22c/0x2a0 fs/eventpoll.c:865

Fixes: 971316f050 ("epoll: ep_unregister_pollwait() can use the freed pwq->whead")
Reported-by: 范龙飞 <long7573@126.com>
Cc: stable@vger.kernel.org
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-01 13:07:35 -07:00
Serge E. Hallyn
8db6c34f1d Introduce v3 namespaced file capabilities
Root in a non-initial user ns cannot be trusted to write a traditional
security.capability xattr.  If it were allowed to do so, then any
unprivileged user on the host could map his own uid to root in a private
namespace, write the xattr, and execute the file with privilege on the
host.

However supporting file capabilities in a user namespace is very
desirable.  Not doing so means that any programs designed to run with
limited privilege must continue to support other methods of gaining and
dropping privilege.  For instance a program installer must detect
whether file capabilities can be assigned, and assign them if so but set
setuid-root otherwise.  The program in turn must know how to drop
partial capabilities, and do so only if setuid-root.

This patch introduces v3 of the security.capability xattr.  It builds a
vfs_ns_cap_data struct by appending a uid_t rootid to struct
vfs_cap_data.  This is the absolute uid_t (that is, the uid_t in user
namespace which mounted the filesystem, usually init_user_ns) of the
root id in whose namespaces the file capabilities may take effect.

When a task asks to write a v2 security.capability xattr, if it is
privileged with respect to the userns which mounted the filesystem, then
nothing should change.  Otherwise, the kernel will transparently rewrite
the xattr as a v3 with the appropriate rootid.  This is done during the
execution of setxattr() to catch user-space-initiated capability writes.
Subsequently, any task executing the file which has the noted kuid as
its root uid, or which is in a descendent user_ns of such a user_ns,
will run the file with capabilities.

Similarly when asking to read file capabilities, a v3 capability will
be presented as v2 if it applies to the caller's namespace.

If a task writes a v3 security.capability, then it can provide a uid for
the xattr so long as the uid is valid in its own user namespace, and it
is privileged with CAP_SETFCAP over its namespace.  The kernel will
translate that rootid to an absolute uid, and write that to disk.  After
this, a task in the writer's namespace will not be able to use those
capabilities (unless rootid was 0), but a task in a namespace where the
given uid is root will.

Only a single security.capability xattr may exist at a time for a given
file.  A task may overwrite an existing xattr so long as it is
privileged over the inode.  Note this is a departure from previous
semantics, which required privilege to remove a security.capability
xattr.  This check can be re-added if deemed useful.

This allows a simple setxattr to work, allows tar/untar to work, and
allows us to tar in one namespace and untar in another while preserving
the capability, without risking leaking privilege into a parent
namespace.

Example using tar:

 $ cp /bin/sleep sleepx
 $ mkdir b1 b2
 $ lxc-usernsexec -m b:0:100000:1 -m b:1:$(id -u):1 -- chown 0:0 b1
 $ lxc-usernsexec -m b:0:100001:1 -m b:1:$(id -u):1 -- chown 0:0 b2
 $ lxc-usernsexec -m b:0:100000:1000 -- tar --xattrs-include=security.capability --xattrs -cf b1/sleepx.tar sleepx
 $ lxc-usernsexec -m b:0:100001:1000 -- tar --xattrs-include=security.capability --xattrs -C b2 -xf b1/sleepx.tar
 $ lxc-usernsexec -m b:0:100001:1000 -- getcap b2/sleepx
   b2/sleepx = cap_sys_admin+ep
 # /opt/ltp/testcases/bin/getv3xattr b2/sleepx
   v3 xattr, rootid is 100001

A patch to linux-test-project adding a new set of tests for this
functionality is in the nsfscaps branch at github.com/hallyn/ltp

Changelog:
   Nov 02 2016: fix invalid check at refuse_fcap_overwrite()
   Nov 07 2016: convert rootid from and to fs user_ns
   (From ebiederm: mar 28 2017)
     commoncap.c: fix typos - s/v4/v3
     get_vfs_caps_from_disk: clarify the fs_ns root access check
     nsfscaps: change the code split for cap_inode_setxattr()
   Apr 09 2017:
       don't return v3 cap for caps owned by current root.
      return a v2 cap for a true v2 cap in non-init ns
   Apr 18 2017:
      . Change the flow of fscap writing to support s_user_ns writing.
      . Remove refuse_fcap_overwrite().  The value of the previous
        xattr doesn't matter.
   Apr 24 2017:
      . incorporate Eric's incremental diff
      . move cap_convert_nscap to setxattr and simplify its usage
   May 8, 2017:
      . fix leaking dentry refcount in cap_inode_getsecurity

Signed-off-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-09-01 14:57:15 -05:00
Linus Torvalds
b8a78bb4d1 ceph fscache page locking fix from Zheng, marked for stable.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJZqYGlAAoJEEp/3jgCEfOLx58H/1jnP79H03/kchVBCGPLCKjs
 E+pgHpb2922EGeYmEUoxfq627SiCODap/jo6JFVpsd+JnmHLZiMzmEzGpDce6fn9
 /YY5u3WNtmnKtyPvl0kzspK0ujaeCuiRyarULXBiHveL2ZQINKus4F9MiZphNnt4
 X4hgo866+esEf6LocuEkMoEGvgN7vk/Q9nDPgD/YoFrhCuwdvLJpBnE65CGbQyk5
 n3g0qlBR+yorDr1stdlSyVUDPkF5FQjhQTqkpi1oPAhsNPKgVPyZzRIEQEA+nI+N
 wTsQ0SMKfST4PNaRNdUuO1xwszYziqqlLZ2KwaaLIDHlElcbQR1S3GUKz6hddJc=
 =oesm
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.13-rc8' of git://github.com/ceph/ceph-client

Pull ceph fix from Ilya Dryomov:
 "ceph fscache page locking fix from Zheng, marked for stable"

* tag 'ceph-for-4.13-rc8' of git://github.com/ceph/ceph-client:
  ceph: fix readpage from fscache
2017-09-01 12:46:30 -07:00
Christoph Hellwig
f2285c148c xfs: use xfs_iext_get_extent in xfs_bmap_first_unused
Use the bmap abstraction instead of open-coding bmbt details here.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Christoph Hellwig
50bb44c286 xfs: switch xfs_bmap_local_to_extents to use xfs_iext_insert
Use the helper instead of open coding it, to provide a better abstraction
for the scalable extent list work.  This also gets an additional assert
and trace point for free.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Christoph Hellwig
67e4e69cb2 xfs: add a xfs_iext_update_extent helper
This helper is used to update an extent record based on the extent index,
and can be used to provide a level of abstractions between callers that
want to modify in-core extent records and the details of the extent list
implementation.

Also switch all users of the xfs_bmbt_set_all(xfs_iext_get_ext(...))
pattern to this new helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Christoph Hellwig
d522d569d6 xfs: consolidate the various page fault handlers
Add a new __xfs_filemap_fault helper that implements all four page fault
callouts, and make these methods themselves small stubs that set the
correct write_fault flag, and exit early for the non-DAX case for the
hugepage related ones.

Also remove the extra size checking in the pfn_fault path, which is now
handled in the core DAX code.

Life would be so much simpler if we only had one method for all this.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Christoph Hellwig
e7647fb491 iomap: return VM_FAULT_* codes from iomap_page_mkwrite
All callers will need the VM_FAULT_* flags, so convert in the helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Brian Foster
2dd3d709fc xfs: relog dirty buffers during swapext bmbt owner change
The owner change bmbt scan that occurs during extent swap operations
does not handle ordered buffer failures. Buffers that cannot be
marked ordered must be physically logged so previously dirty ranges
of the buffer can be relogged in the transaction.

Since the bmbt scan may need to process and potentially log a large
number of blocks, we can't expect to complete this operation in a
single transaction. Update extent swap to use a permanent
transaction with enough log reservation to physically log a buffer.
Update the bmbt scan to physically log any buffers that cannot be
ordered and to terminate the scan with -EAGAIN. On -EAGAIN, the
caller rolls the transaction and restarts the scan. Finally, update
the bmbt scan helper function to skip bmbt blocks that already match
the expected owner so they are not reprocessed after scan restarts.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[darrick: fix the xfs_trans_roll call]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Brian Foster
a5814bceea xfs: disallow marking previously dirty buffers as ordered
Ordered buffers are used in situations where the buffer is not
physically logged but must pass through the transaction/logging
pipeline for a particular transaction. As a result, ordered buffers
are not unpinned and written back until the transaction commits to
the log. Ordered buffers have a strict requirement that the target
buffer must not be currently dirty and resident in the log pipeline
at the time it is marked ordered. If a dirty+ordered buffer is
committed, the buffer is reinserted to the AIL but not physically
relogged at the LSN of the associated checkpoint. The buffer log
item is assigned the LSN of the latest checkpoint and the AIL
effectively releases the previously logged buffer content from the
active log before the buffer has been written back. If the tail
pushes forward and a filesystem crash occurs while in this state, an
inconsistent filesystem could result.

It is currently the caller responsibility to ensure an ordered
buffer is not already dirty from a previous modification. This is
unclear and error prone when not used in situations where it is
guaranteed a buffer has not been previously modified (such as new
metadata allocations).

To facilitate general purpose use of ordered buffers, update
xfs_trans_ordered_buf() to conditionally order the buffer based on
state of the log item and return the status of the result. If the
bli is dirty, do not order the buffer and return false. The caller
must either physically log the buffer (having acquired the
appropriate log reservation) or push it from the AIL to clean it
before it can be marked ordered in the current transaction.

Note that ordered buffers are currently only used in two situations:
1.) inode chunk allocation where previously logged buffers are not
possible and 2.) extent swap which will be updated to handle ordered
buffer failures in a separate patch.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Brian Foster
6fb10d6d22 xfs: move bmbt owner change to last step of extent swap
The extent swap operation currently resets bmbt block owners before
the inode forks are swapped. The bmbt buffers are marked as ordered
so they do not have to be physically logged in the transaction.

This use of ordered buffers is not safe as bmbt buffers may have
been previously physically logged. The bmbt owner change algorithm
needs to be updated to physically log buffers that are already dirty
when/if they are encountered. This means that an extent swap will
eventually require multiple rolling transactions to handle large
btrees. In addition, all inode related changes must be logged before
the bmbt owner change scan begins and can roll the transaction for
the first time to preserve fs consistency via log recovery.

In preparation for such fixes to the bmbt owner change algorithm,
refactor the bmbt scan out of the extent fork swap code to the last
operation before the transaction is committed. Update
xfs_swap_extent_forks() to only set the inode log flags when an
owner change scan is necessary. Update xfs_swap_extents() to trigger
the owner change based on the inode log flags. Note that since the
owner change now occurs after the extent fork swap, the inode btrees
must be fixed up with the inode number of the current inode (similar
to log recovery).

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Brian Foster
99c794c639 xfs: skip bmbt block ino validation during owner change
Extent swap uses xfs_btree_visit_blocks() to fix up bmbt block
owners on v5 (!rmapbt) filesystems. The bmbt scan uses
xfs_btree_lookup_get_block() to read bmbt blocks which verifies the
current owner of the block against the parent inode of the bmbt.
This works during extent swap because the bmbt owners are updated to
the opposite inode number before the inode extent forks are swapped.

The modified bmbt blocks are marked as ordered buffers which allows
everything to commit in a single transaction. If the transaction
commits to the log and the system crashes such that recovery of the
extent swap is required, log recovery restarts the bmbt scan to fix
up any bmbt blocks that may have not been written back before the
crash. The log recovery bmbt scan occurs after the inode forks have
been swapped, however. This causes the bmbt block owner verification
to fail, leads to log recovery failure and requires xfs_repair to
zap the log to recover.

Define a new invalid inode owner flag to inform the btree block
lookup mechanism that the current inode may be invalid with respect
to the current owner of the bmbt block. Set this flag on the cursor
used for change owner scans to allow this operation to work at
runtime and during log recovery.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Fixes: bb3be7e7c ("xfs: check for bogus values in btree block headers")
Cc: stable@vger.kernel.org
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Brian Foster
8dc518dfa7 xfs: don't log dirty ranges for ordered buffers
Ordered buffers are attached to transactions and pushed through the
logging infrastructure just like normal buffers with the exception
that they are not actually written to the log. Therefore, we don't
need to log dirty ranges of ordered buffers. xfs_trans_log_buf() is
called on ordered buffers to set up all of the dirty state on the
transaction, buffer and log item and prepare the buffer for I/O.

Now that xfs_trans_dirty_buf() is available, call it from
xfs_trans_ordered_buf() so the latter is now mutually exclusive with
xfs_trans_log_buf(). This reflects the implementation of ordered
buffers and helps eliminate confusion over the need to log ranges of
ordered buffers just to set up internal log state.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Brian Foster
9684010d38 xfs: refactor buffer logging into buffer dirtying helper
xfs_trans_log_buf() is responsible for logging the dirty segments of
a buffer along with setting all of the necessary state on the
transaction, buffer, bli, etc., to ensure that the associated items
are marked as dirty and prepared for I/O. We have a couple use cases
that need to to dirty a buffer in a transaction without actually
logging dirty ranges of the buffer.  One existing use case is
ordered buffers, which are currently logged with arbitrary ranges to
accomplish this even though the content of ordered buffers is never
written to the log. Another pending use case is to relog an already
dirty buffer across rolled transactions within the deferred
operations infrastructure. This is required to prevent a held
(XFS_BLI_HOLD) buffer from pinning the tail of the log.

Refactor xfs_trans_log_buf() into a new function that contains all
of the logic responsible to dirty the transaction, lidp, buffer and
bli. This new function can be used in the future for the use cases
outlined above. This patch does not introduce functional changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Brian Foster
e9385cc6fb xfs: ordered buffer log items are never formatted
Ordered buffers pass through the logging infrastructure without ever
being written to the log. The way this works is that the ordered
buffer status is transferred to the log vector at commit time via
the ->iop_size() callback. In xlog_cil_insert_format_items(),
ordered log vectors bypass ->iop_format() processing altogether.

Therefore it is unnecessary for xfs_buf_item_format() to handle
ordered buffers. Remove the unnecessary logic and assert that an
ordered buffer never reaches this point.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Brian Foster
6453c65d35 xfs: remove unnecessary dirty bli format check for ordered bufs
xfs_buf_item_unlock() historically checked the dirty state of the
buffer by manually checking the buffer log formats for dirty
segments. The introduction of ordered buffers invalidated this check
because ordered buffers have dirty bli's but no dirty (logged)
segments. The check was updated to accommodate ordered buffers by
looking at the bli state first and considering the blf only if the
bli is clean.

This logic is safe but unnecessary. There is no valid case where the
bli is clean yet the blf has dirty segments. The bli is set dirty
whenever the blf is logged (via xfs_trans_log_buf()) and the blf is
cleared in the only place BLI_DIRTY is cleared (xfs_trans_binval()).

Remove the conditional blf dirty checks and replace with an assert
that should catch any discrepencies between bli and blf dirty
states. Refactor the old blf dirty check into a helper function to
be used by the assert.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Brian Foster
a4f6cf6b2b xfs: open-code xfs_buf_item_dirty()
It checks a single flag and has one caller. It probably isn't worth
its own function.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Christoph Hellwig
8ad7c629b1 xfs: remove the ip argument to xfs_defer_finish
And instead require callers to explicitly join the inode using
xfs_defer_ijoin.  Also consolidate the defer error handling in
a few places using a goto label.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Christoph Hellwig
882d8785fb xfs: rename xfs_defer_join to xfs_defer_ijoin
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Christoph Hellwig
411350df14 xfs: refactor xfs_trans_roll
Split xfs_trans_roll into a low-level helper that just rolls the
actual transaction and a new higher level xfs_trans_roll_inode
that takes care of logging and rejoining the inode.  This gets
rid of the NULL inode case, and allows to simplify the special
cases in the deferred operation code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Omar Sandoval
f2e9ad212d xfs: check for race with xfs_reclaim_inode() in xfs_ifree_cluster()
After xfs_ifree_cluster() finds an inode in the radix tree and verifies
that the inode number is what it expected, xfs_reclaim_inode() can swoop
in and free it. xfs_ifree_cluster() will then happily continue working
on the freed inode. Most importantly, it will mark the inode stale,
which will probably be overwritten when the inode slab object is
reallocated, but if it has already been reallocated then we can end up
with an inode spuriously marked stale.

In 8a17d7dded ("xfs: mark reclaimed inodes invalid earlier") we added
a second check to xfs_iflush_cluster() to detect this race, but the
similar RCU lookup in xfs_ifree_cluster() needs the same treatment.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01 10:55:30 -07:00
Darrick J. Wong
799ea9e9c5 xfs: evict all inodes involved with log redo item
When we introduced the bmap redo log items, we set MS_ACTIVE on the
mountpoint and XFS_IRECOVERY on the inode to prevent unlinked inodes
from being truncated prematurely during log recovery.  This also had the
effect of putting linked inodes on the lru instead of evicting them.

Unfortunately, we neglected to find all those unreferenced lru inodes
and evict them after finishing log recovery, which means that we leak
them if anything goes wrong in the rest of xfs_mountfs, because the lru
is only cleaned out on unmount.

Therefore, evict unreferenced inodes in the lru list immediately
after clearing MS_ACTIVE.

Fixes: 17c12bcd30 ("xfs: when replaying bmap operations, don't let unlinked inodes get reaped")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Cc: viro@ZenIV.linux.org.uk
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-09-01 10:55:30 -07:00
Dan Carpenter
ef13ecbc13 kernfs: checking for IS_ERR() instead of NULL
The kernfs_get_inode() returns NULL on error, it never returns error
pointers.

Fixes: aa81882534 ("kernfs: add exportfs operations")
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-01 08:17:55 -06:00
Boris Brezillon
d1f936d736 This pull request contains the following core changes:
* Fix memory leaks in the core
 * Remove unused NAND locking support
 * Rename nand.h into rawnand.h (preparing support for spi NANDs)
 * Use NAND_MAX_ID_LEN where appropriate
 * Fix support for 20nm Hynix chips
 * Fix support for Samsung and Hynix SLC NANDs
 
 and the following driver changes:
 
 * Various cleanup, improvements and fixes in the qcom driver
 * Fixes for bugs detected by various static code analysis tools
 * Fix mxc ooblayout definition
 * Add a new part_parsers to tmio and sharpsl platform data in order to
   define a custom list of partition parsers
 * Request the reset line in exclusive mode in the sunxi driver
 * Fix a build error in the orion-nand driver when compiled for ARMv4
 * Allow 64-bit mvebu platforms to select the PXA3XX driver
 -----BEGIN PGP SIGNATURE-----
 
 iQJABAABCAAqBQJZpwMZIxxib3Jpcy5icmV6aWxsb25AZnJlZS1lbGVjdHJvbnMu
 Y29tAAoJEGXtNgF+CLcAJEoP/jRnEjPznWT3+ngw6k/rnykkn/wexKV3iyX/6b71
 MQT/ZFuT3HsHnUjyprPvyRWJeKun6XyIH5fk7FlXIei9TaWCt6/UGTKousaPKeR2
 maggGeEjxGVpHJM/jpIYCyjt83zpezBqTupv52XhXxPaU7ROSpuHCd92YcPzIaT5
 tcn8JrI7TGuGlBrBbA2y8ZrPtuug3IKqUfpIiQmoqr0jzQR+AbZKHg0kk5a8piOn
 OguK67uhxeOvq831bGPehCPDbuE0loNi4CssayJ1HrisfS95kH/cqrveapgKsUG/
 fxaHh1i65I0lxa8sgUgeUiU04Zsy1YcgNbCj41AY4AHnjJ0+Qp1cV6KAB/x5/wH9
 ES/fW06how+1BLEeLvOr+rIQ41WeP0qV2H3r/PtkeswKKAV3gSERBXVHmg1E7Yum
 HkmPqzhu+nSk3mP7p3yxpd7EwWQh2xpvVYrfQ5vQbtdfm8Uw9n6S8x+O89ch9wWi
 +KbMWFsmF78nuRWW3WTsaiOFKcTRLBT5RoU2z4i9hCvQT2Pnx3SBhHrIj6xqBI1S
 8MpeDdlXHmRQfZxj+jDqU77JYqVEmy/5it9OjhjMpOqxfCf4K6Nlb75TEdR5Nh/9
 BA1qqTBEslg3UqS8ofGFHGFZWrW3JHf02SYo2zU9IvBninh3HrqHiFhBc3p1xNDE
 7GwD
 =bC1+
 -----END PGP SIGNATURE-----

Merge tag 'nand/for-4.14' of git://git.infradead.org/l2-mtd into mtd/next

From Boris:
"
This pull request contains the following core changes:

* Fix memory leaks in the core
* Remove unused NAND locking support
* Rename nand.h into rawnand.h (preparing support for spi NANDs)
* Use NAND_MAX_ID_LEN where appropriate
* Fix support for 20nm Hynix chips
* Fix support for Samsung and Hynix SLC NANDs

and the following driver changes:

* Various cleanup, improvements and fixes in the qcom driver
* Fixes for bugs detected by various static code analysis tools
* Fix mxc ooblayout definition
* Add a new part_parsers to tmio and sharpsl platform data in order to
  define a custom list of partition parsers
* Request the reset line in exclusive mode in the sunxi driver
* Fix a build error in the orion-nand driver when compiled for ARMv4
* Allow 64-bit mvebu platforms to select the PXA3XX driver
"
2017-09-01 15:34:30 +02:00
Steve French
7e682f766f Fix warning messages when mounting to older servers
When mounting to older servers, such as Windows XP (or even Windows 7),
the limited error messages that can be passed back to user space can
get confusing since the default dialect has changed from SMB1 (CIFS) to
more secure SMB3 dialect. Log additional information when the user chooses
to use the default dialects and when the server does not support the
dialect requested.

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
2017-09-01 00:18:44 -05:00
Linus Torvalds
e89ce1f89f two cifs bug fixes for stable
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQGcBAABAgAGBQJZpxWIAAoJEIosvXAHck9Rl1AL/2wNdEvDvDYZ48IeJN/R0Agb
 nhsXycDshjjUSMghm5IMXvaOitwQL3bcbVeGBDOF4UYoDgJZSuOeOvzYhAJaykiy
 Cs2rwve9Vuw9pliY1D4705g2jduvca6KP4Rzcmf8zeRqG8siQLDlFI8MqYcWiz6d
 F8O9EH+n+Daqf4L5b9IB/3aXZiQLDXKC5HjwJJlPjF6fhkqMbwFLSgTYPmCoM2M2
 oC4jDIPIvM7+p5K8Hs3UvkUaeUoMV1Hx/bba3gGFLK0lnGiD2davZPNBLR46nt6j
 L6l0ujg1pv0kOnTsyfiZp/m43UF5IGeLacBj5DczDrio/8FYKcLCoTIXpWB/dMn/
 NR9kGw/Y9M7jv+qumwrFzMaqv4ztMc2QssFsm37gHEtkEIBPiPTYixRIG/OkVRo7
 u8sZic1DvFot0YalZqpIX6Ss6egiY5CI4q7AWU8XwpkiW+K/RBN8g31Cmcx6scU2
 6/ywpNm2tP5D7AfH13jQh8CQOstnrFH50fUCyXpASA==
 =VgIs
 -----END PGP SIGNATURE-----

Merge tag 'cifs-fixes-for-4.13-rc7-and-stable' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Two cifs bug fixes for stable"

* tag 'cifs-fixes-for-4.13-rc7-and-stable' of git://git.samba.org/sfrench/cifs-2.6:
  CIFS: remove endian related sparse warning
  CIFS: Fix maximum SMB2 header size
2017-08-31 18:45:04 -07:00
Linus Torvalds
ea25c43179 Merge branch 'mmu_notifier_fixes'
Merge mmu_notifier fixes from Jérôme Glisse:
 "The invalidate_page callback suffered from 2 pitfalls. First it used
  to happen after page table lock was release and thus a new page might
  have been setup for the virtual address before the call to
  invalidate_page().

  This is in a weird way fixed by commit c7ab0d2fdc ("mm: convert
  try_to_unmap_one() to use page_vma_mapped_walk()") which moved the
  callback under the page table lock. Which also broke several existing
  user of the mmu_notifier API that assumed they could sleep inside this
  callback.

  The second pitfall was invalidate_page being the only callback not
  taking a range of address in respect to invalidation but was giving an
  address and a page. Lot of the callback implementer assumed this could
  never be THP and thus failed to invalidate the appropriate range for
  THP pages.

  By killing this callback we unify the mmu_notifier callback API to
  always take a virtual address range as input.

  There is now two clear API (I am not mentioning the youngess API which
  is seldomly used):

   - invalidate_range_start()/end() callback (which allow you to sleep)

   - invalidate_range() where you can not sleep but happen right after
     page table update under page table lock

  Note that a lot of existing user feels broken in respect to
  range_start/ range_end. Many user only have range_start() callback but
  there is nothing preventing them to undo what was invalidated in their
  range_start() callback after it returns but before any CPU page table
  update take place.

  The code pattern use in kvm or umem odp is an example on how to
  properly avoid such race. In a nutshell use some kind of sequence
  number and active range invalidation counter to block anything that
  might undo what the range_start() callback did.

  If you do not care about keeping fully in sync with CPU page table (ie
  you can live with CPU page table pointing to new different page for a
  given virtual address) then you can take a reference on the pages
  inside the range_start callback and drop it in range_end or when your
  driver is done with those pages.

  Last alternative is to use invalidate_range() if you can do
  invalidation without sleeping as invalidate_range() callback happens
  under the CPU page table spinlock right after the page table is
  updated.

  The first two patches convert existing mmu_notifier_invalidate_page()
  calls to mmu_notifier_invalidate_range() and bracket those call with
  call to mmu_notifier_invalidate_range_start()/end().

  The next ten patches remove existing invalidate_page() callback as it
  can no longer happen.

  Finally the last page remove the invalidate_page() callback completely
  so it can RIP.

  Changes since v1:
   - remove more dead code in kvm (no testing impact)
   - more accurate end address computation (patch 2) in page_mkclean_one
     and try_to_unmap_one
   - added tested-by/reviewed-by gotten so far"

* emailed patches from Jérôme Glisse <jglisse@redhat.com>:
  mm/mmu_notifier: kill invalidate_page
  KVM: update to new mmu_notifier semantic v2
  xen/gntdev: update to new mmu_notifier semantic
  sgi-gru: update to new mmu_notifier semantic
  misc/mic/scif: update to new mmu_notifier semantic
  iommu/intel: update to new mmu_notifier semantic
  iommu/amd: update to new mmu_notifier semantic
  IB/hfi1: update to new mmu_notifier semantic
  IB/umem: update to new mmu_notifier semantic
  drm/amdgpu: update to new mmu_notifier semantic
  powerpc/powernv: update to new mmu_notifier semantic
  mm/rmap: update to new mmu_notifier semantic v2
  dax: update to new mmu_notifier semantic
2017-08-31 17:30:01 -07:00
Dave Kleikamp
c227390c91 jfs should use MAX_LFS_FILESIZE when calculating s_maxbytes
jfs had previously avoided the use of MAX_LFS_FILESIZE because it hadn't
accounted for the whole 32-bit index range on 32-bit systems.  That has
been fixed by commit 0cc3b0ec23 ("Clarify (and fix) MAX_LFS_FILESIZE
macros"), so we can simplify the code now.

Suggested by Andreas Dilger.

Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: jfs-discussion@lists.sourceforge.net
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-31 17:02:21 -07:00
Jérôme Glisse
a4d1a88525 dax: update to new mmu_notifier semantic
Replace all mmu_notifier_invalidate_page() calls by *_invalidate_range()
and make sure it is bracketed by calls to *_invalidate_range_start()/end().

Note that because we can not presume the pmd value or pte value we have
to assume the worst and unconditionaly report an invalidation as
happening.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Bernhard Held <berny156@gmx.de>
Cc: Adam Borowski <kilobyte@angband.pl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: axie <axie@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-31 16:12:59 -07:00
Yan, Zheng
dd2bc47348 ceph: fix readpage from fscache
ceph_readpage() unlocks page prematurely prematurely in the case
that page is reading from fscache. Caller of readpage expects that
page is uptodate when it get unlocked. So page shoule get locked
by completion callback of fscache_read_or_alloc_pages()

Cc: stable@vger.kernel.org # 4.1+, needs backporting for < 4.7
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-01 00:04:26 +02:00
Christoph Hellwig
ddef7ed2b5 annotate RWF_... flags
[AV: added missing annotations in syscalls.h/compat.h]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-08-31 17:32:38 -04:00
Dan Williams
5e405595e5 ext4: perform dax_device lookup at mount
The ->iomap_begin() operation is a hot path, so cache the
fs_dax_get_by_host() result at mount time to avoid the incurring the
hash lookup overhead on a per-i/o basis.

Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-08-31 11:12:13 -07:00
Dan Williams
8cf037a8b2 ext2: perform dax_device lookup at mount
The ->iomap_begin() operation is a hot path, so cache the
fs_dax_get_by_host() result at mount time to avoid the incurring the
hash lookup overhead on a per-i/o basis.

Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-08-31 09:33:30 -07:00
Dan Williams
486aff5e04 xfs: perform dax_device lookup at mount
The ->iomap_begin() operation is a hot path, so cache the
fs_dax_get_by_host() result at mount time to avoid the incurring the
hash lookup overhead on a per-i/o basis.

Reported-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-08-31 09:31:47 -07:00
Andreas Dilger
b5f515735b ext4: avoid Y2038 overflow in recently_deleted()
Avoid a 32-bit time overflow in recently_deleted() since i_dtime
(inode deletion time) is stored only as a 32-bit value on disk.
Since i_dtime isn't used for much beyond a boolean value in e2fsck
and is otherwise only used in this function in the kernel, there is
no benefit to use more space in the inode for this field on disk.

Instead, compare only the relative deletion time with the low
32 bits of the time using the newly-added time_before32() helper,
which is similar to time_before() and time_after() for jiffies.

Increase RECENTCY_DIRTY to 300s based on Ted's comments about
usage experience at Google.

Signed-off-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
2017-08-31 11:09:45 -04:00
Ernesto A. Fernández
309e8cda59 gfs2: preserve i_mode if __gfs2_set_acl() fails
When changing a file's acl mask, __gfs2_set_acl() will first set the
group bits of i_mode to the value of the mask, and only then set the
actual extended attribute representing the new acl.

If the second part fails (due to lack of space, for example) and the
file had no acl attribute to begin with, the system will from now on
assume that the mask permission bits are actual group permission bits,
potentially granting access to the wrong users.

Prevent this by only changing the inode mode after the acl has been set.

Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-31 07:53:15 -05:00
Ernesto A. Fernández
54aae14bee gfs2: don't return ENODATA in __gfs2_xattr_set unless replacing
The function __gfs2_xattr_set() will return -ENODATA when called to
remove a xattr that does not exist. The result is that setfacl will
show an exit status of 1 when called to set only a file's mode bits
(on a file with no ACLs), despite succeeding. A "No data available"
error will be printed as well.

To fix this return 0 instead, except when the XATTR_REPLACE flag is
set, in which case -ENODATA is appropriate. This is consistent with
how most other xattr setting functions work, in other filesystems.

Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-31 07:43:03 -05:00
Steve French
6e3c1529c3 CIFS: remove endian related sparse warning
Recent patch had an endian warning ie
cifs: return ENAMETOOLONG for overlong names in cifs_open()/cifs_lookup()

Signed-off-by: Steve French <smfrench@gmail.com>
CC: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Stable <stable@vger.kernel.org>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
2017-08-30 14:43:11 -05:00
Pavel Shilovsky
9e37b1784f CIFS: Fix maximum SMB2 header size
Currently the maximum size of SMB2/3 header is set incorrectly which
leads to hanging of directory listing operations on encrypted SMB3
connections. Fix this by setting the maximum size to 170 bytes that
is calculated as RFC1002 length field size (4) + transform header
size (52) + SMB2 header size (64) + create response size (56).

Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Sachin Prabhu <sprabhu@redhat.com>
2017-08-30 14:42:30 -05:00
Bob Peterson
c4a9d1892f GFS2: Fix non-recursive truncate bug
Before this patch if you truncated a file to a smaller size it
wasn't freeing all the blocks properly. There are two reasons.

First, the metapath comparison was not comparing previous heights.
I added a function, mp_eq_to_hgt, which checks the metapath at
all heights prior to the target height.

Second, in function find_nonnull_ptr, it needed to zero out all
pointers for heights following the target height. Translated into
decimal integer terms, this way a number like 299, when incremented,
becomes 300, not 399. The 2 gets incremented to 3, and the following
digits need to be reset.

These two things allow the truncate state machine to properly find
the blocks it needs to delete.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-30 13:29:22 -05:00
Bhumika Goyal
c9ea9df303 fsnotify: make dnotify_fsnotify_ops const
Make this const as it is never modified.

Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-30 16:02:48 +02:00
Arvind Yadav
d296b15ed5 gfs2: constify rhashtable_params
rhashtable_params are not supposed to change at runtime. All
Functions rhashtable_* working with const rhashtable_params
provided by <linux/rhashtable.h>. So mark the non-const structs
as const.

Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-30 08:14:39 -05:00
Andreas Gruenbacher
7023a0b16f GFS2: Fix gl_object warnings
The following cleanup is needed to avoid spilling the syslog with
false warnings.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-30 08:14:27 -05:00
Chao Yu
774e1b78a0 f2fs: trigger fdatasync for non-atomic_write file
Sqlite only cares about synchronization of file data instead of other data
unrelated attribute of inode, so in commit flow, call fdatasync is enough.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-29 10:05:42 -07:00
Chao Yu
73ac2f4e82 f2fs: fix to avoid race in between aio and gc
We won't wait DIO synchronously when doing AIO, so there will be potential
IO reorder in between AIO and GC, which will cause data corruption.

This patch adds inode_dio_wait to serialize aio and data GC to avoid this
issue.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-29 10:05:42 -07:00
Jaegeuk Kim
01983c715a f2fs: wake up discard_thread iff there is a candidate
This patch fixes to avoid needless wake ups.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-29 10:05:33 -07:00
Jaegeuk Kim
adb6dc1971 f2fs: return error when accessing insane flie offset
If file offset is insane, we have to return error instead of kernel panic.

Reported-by: Eric Zhang <followme999@163.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-29 10:02:58 -07:00
Chao Yu
0adf6a1b79 f2fs: trigger normal fsync for non-atomic_write file
If file was not opened with atomic write mode, but user uses atomic write
ioctl to fsync datas, in the flow, we should not fsync that file with
atomic write mode.

Fixes: 608514deba ("f2fs: set fsync mark only for the last dnode")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-29 10:02:57 -07:00
Chao Yu
84a23fbe96 f2fs: clear FI_HOT_DATA correctly
This patch fixes to clear FI_HOT_DATA correctly in below path:
- error handling in f2fs_ioc_start_atomic_write
- after commit atomic write in f2fs_ioc_commit_atomic_write
- after drop atomic write in drop_inmem_pages

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-29 10:02:56 -07:00
Chao Yu
6f890df0a7 f2fs: fix out-of-order execution in f2fs_issue_flush
In f2fs_issue_flush, due to out-of-order execution of CPU, wake_up can
be called before we insert issue_list, result in long latency of
wait_for_completion. Fix this by adding smp_mb() to force the order of
related codes.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-29 10:02:55 -07:00
Jaegeuk Kim
5f656541ff f2fs: issue discard commands if gc_urgent is set
It's time to issue all the discard commands, if user sets the idle time.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-29 10:02:47 -07:00
David Howells
e833251ad8 rxrpc: Add notification of end-of-Tx phase
Add a callback to rxrpc_kernel_send_data() so that a kernel service can get
a notification that the AF_RXRPC call has transitioned out the Tx phase and
is now waiting for a reply or a final ACK.

This is called from AF_RXRPC with the call state lock held so the
notification is guaranteed to come before any reply is passed back.

Further, modify the AFS filesystem to make use of this so that we don't have
to change the afs_call state before sending the last bit of data.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-08-29 10:55:20 +01:00
Helge Deller
79de3cbe9a fs/select: Fix memory corruption in compat_get_fd_set()
Commit 464d62421c ("select: switch compat_{get,put}_fd_set() to
compat_{get,put}_bitmap()") changed the calculation on how many bytes
need to be zeroed when userspace handed over a NULL pointer for a fdset
array in the select syscall.

The calculation was changed in compat_get_fd_set() wrongly from
	memset(fdset, 0, ((nr + 1) & ~1)*sizeof(compat_ulong_t));
to
	memset(fdset, 0, ALIGN(nr, BITS_PER_LONG));

The ALIGN(nr, BITS_PER_LONG) calculates the number of _bits_ which need
to be zeroed in the target fdset array (rounded up to the next full bits
for an unsigned long).

But the memset() call expects the number of _bytes_ to be zeroed.

This leads to clearing more memory than wanted (on the stack area or
even at kmalloc()ed memory areas) and to random kernel crashes as we
have seen them on the parisc platform.

The correct change should have been

	memset(fdset, 0, (ALIGN(nr, BITS_PER_LONG) / BITS_PER_LONG) * BYTES_PER_LONG);

which is the same as can be archieved with a call to

	zero_fd_set(nr, fdset).

Fixes: 464d62421c ("select: switch compat_{get,put}_fd_set() to compat_{get,put}_bitmap()"
Acked-by:: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-28 16:09:19 -07:00
Waiman Long
39bf04db6b kernfs: Clarify lockdep name for kn->count
The reference count in kernfs_node structure is treated like a rwsem by
using lockdep instrumentation code. The lockdep name, however, is still
"s_active" which is carried over from the old sysfs code. As s_active
is no longer the variable name, its use may confuse users on where the
lock is when it is reported by lockdep. So it is changed to "kn->count"
which is how this variable is normally referenced in kernfs code.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-28 16:50:15 +02:00
Greg Kroah-Hartman
9749c37275 Merge 4.13-rc7 into char-misc-next
We want the binder fix in here as well for testing and merge issues.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-28 10:19:01 +02:00
Byungchul Park
b9ea557ee9 fput: Don't reinvent the wheel but use existing llist API
Although llist provides proper APIs, they are not used. Make them used.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-08-28 00:50:23 -04:00
Byungchul Park
2978573578 namespace.c: Don't reinvent the wheel but use existing llist API
Although llist provides proper APIs, they are not used. Make them used.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-08-28 00:50:22 -04:00
Linus Torvalds
b3242dba9f Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "6 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm/memblock.c: reversed logic in memblock_discard()
  fork: fix incorrect fput of ->exe_file causing use-after-free
  mm/madvise.c: fix freeing of locked page with MADV_FREE
  dax: fix deadlock due to misaligned PMD faults
  mm, shmem: fix handling /sys/kernel/mm/transparent_hugepage/shmem_enabled
  PM/hibernate: touch NMI watchdog when creating snapshot
2017-08-25 18:02:27 -07:00
Linus Torvalds
105065c3f7 Two nfsd bugfixes, neither 4.13 regressions, but both potentially
serious.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZoFxzAAoJECebzXlCjuG+T+4QAJhvEAPfoqxAJcjpy5Wgal96
 1QmHR1owRyA85MMVHhnVUClzzezECc8uXOxRvRFx+4pCW4PRwY3CRa6H0Acrte0l
 npxWi6CiOkuLTCA+NNVnJAty7zBp2Ag0hYJc2NFwhZJ1cVOcIab6Pc7U6jyoB7Nh
 d10rmB7eYsevZgKaCwxxlieFIkIDrPhIJzku5Zy7PXneITzDKX8kEaIs+JkuJ3xt
 H2w3ERpeeDVDlRd6ffo2OwXKaQkCmMNb64c2YA6yZptOHikuR5ARuvZxbOGveHrM
 uCrxAFgETBIusmBC45W9MmTw4c3GgDcW8/yx09pLWD7UDwsbOLMspXl9usX5sgaq
 Py3HpyPpZjovmfJUCI4UW/RWyo4El5T3IlknHjjg5AfnA3fe15xZVKcmKetVe4k9
 QxWKenwv+0hnOztF5Xotiysw+08aF6rIe3QQ/n6ZMathZAqvaaKsHa5TICL78anO
 F1WqwEKx7c7wg1ZnvV2uAeVsGobHi6Y5LAsyKx3dZMfZmVjqZe4wxGSD5eFAore5
 t4QWDWnLY0t/iPrYpLB1vINXvgD1T6b3rvnMiwm2B+ITMNzNOgLK0vYsNjzsk0uL
 gIOGma2LN7HwtKlsZHZewsR2rsIPcQ4D9FfPZBo1+jSYLzL4ktHWTalFCngwylhe
 y7iV/D+jvrHzrMr9T6rl
 =L3ES
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.13-2' of git://linux-nfs.org/~bfields/linux

Pull nfsd fixes from Bruce Fields:
 "Two nfsd bugfixes, neither 4.13 regressions, but both potentially
  serious"

* tag 'nfsd-4.13-2' of git://linux-nfs.org/~bfields/linux:
  net: sunrpc: svcsock: fix NULL-pointer exception
  nfsd: Limit end of page list when decoding NFSv4 WRITE
2017-08-25 17:27:26 -07:00
Linus Torvalds
8c7932a32e some bug fixes for stable for cifs
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQGcBAABAgAGBQJZn2HQAAoJEIosvXAHck9RXMAL/iVeR4DjmXLwGQtOIQUzj0pv
 0JRubkh8/ud5VvfznjDvy0bBl/jodCK6N2wU7iqBhJUYW5Tc/TLaRt6MZ2KT4pLo
 PrD64hdjEtxkU5si+LOVLU11KndEIIQUV5+Mh9Zqj51DTHsyXJHPi/98HjNJm5Gq
 pXfUk+4eq229Pqq1JuPtfPaNHH/fZCODLf82vDQZedlaZhzHgXtDg6iQM0SalNhg
 iQSAWvmFr5lHlMs5/QMkhurvSaS38GXd+npWUGlJmFymlQbpqzpPGdYMgjnzLxDC
 Jw/Uowzo136CWSkSQV2DudKveNfIrVDYGgb97NgtZxsXYlBuJu4rCJvpLOsm6zap
 ZRnSReRvEIr6/TvMJ2wnRioz0JkbpPz8gMg7EUzfaexZtuAHXx6bguf2RjrnLJiH
 jhV+U+1uwTOgJejbvju/KVV6AP9kECyE5tZjuDF8FenfWkboqAYNaxxWVAfZreF5
 wMF0FeJWoGUxwYgRvd8neG1VWB5LQO8rNaQmYNBi7w==
 =MlGX
 -----END PGP SIGNATURE-----

Merge tag 'cifs-fixes-for-4.13-rc6-and-stable' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Some bug fixes for stable for cifs"

* tag 'cifs-fixes-for-4.13-rc6-and-stable' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: return ENAMETOOLONG for overlong names in cifs_open()/cifs_lookup()
  cifs: Fix df output for users with quota limits
2017-08-25 17:22:33 -07:00
Bob Peterson
27c3b415f6 GFS2: Fix up some sparse warnings
This patch cleans up various pieces of GFS2 to avoid sparse errors.
This doesn't fix them all, but it fixes several. The first error,
in function glock_hash_walk was a genuine bug where the rhashtable
could be started and not stopped.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-25 18:47:18 -05:00
Ross Zwisler
fffa281b48 dax: fix deadlock due to misaligned PMD faults
In DAX there are two separate places where the 2MiB range of a PMD is
defined.

The first is in the page tables, where a PMD mapping inserted for a
given address spans from (vmf->address & PMD_MASK) to ((vmf->address &
PMD_MASK) + PMD_SIZE - 1).  That is, from the 2MiB boundary below the
address to the 2MiB boundary above the address.

So, for example, a fault at address 3MiB (0x30 0000) falls within the
PMD that ranges from 2MiB (0x20 0000) to 4MiB (0x40 0000).

The second PMD range is in the mapping->page_tree, where a given file
offset is covered by a radix tree entry that spans from one 2MiB aligned
file offset to another 2MiB aligned file offset.

So, for example, the file offset for 3MiB (pgoff 768) falls within the
PMD range for the order 9 radix tree entry that ranges from 2MiB (pgoff
512) to 4MiB (pgoff 1024).

This system works so long as the addresses and file offsets for a given
mapping both have the same offsets relative to the start of each PMD.

Consider the case where the starting address for a given file isn't 2MiB
aligned - say our faulting address is 3 MiB (0x30 0000), but that
corresponds to the beginning of our file (pgoff 0).  Now all the PMDs in
the mapping are misaligned so that the 2MiB range defined in the page
tables never matches up with the 2MiB range defined in the radix tree.

The current code notices this case for DAX faults to storage with the
following test in dax_pmd_insert_mapping():

	if (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR)
		goto unlock_fallback;

This test makes sure that the pfn we get from the driver is 2MiB
aligned, and relies on the assumption that the 2MiB alignment of the pfn
we get back from the driver matches the 2MiB alignment of the faulting
address.

However, faults to holes were not checked and we could hit the problem
described above.

This was reported in response to the NVML nvml/src/test/pmempool_sync
TEST5:

	$ cd nvml/src/test/pmempool_sync
	$ make TEST5

You can grab NVML here:

	https://github.com/pmem/nvml/

The dmesg warning you see when you hit this error is:

  WARNING: CPU: 13 PID: 2900 at fs/dax.c:641 dax_insert_mapping_entry+0x2df/0x310

Where we notice in dax_insert_mapping_entry() that the radix tree entry
we are about to replace doesn't match the locked entry that we had
previously inserted into the tree.  This happens because the initial
insertion was done in grab_mapping_entry() using a pgoff calculated from
the faulting address (vmf->address), and the replacement in
dax_pmd_load_hole() => dax_insert_mapping_entry() is done using
vmf->pgoff.

In our failure case those two page offsets (one calculated from
vmf->address, one using vmf->pgoff) point to different order 9 radix
tree entries.

This failure case can result in a deadlock because the radix tree unlock
also happens on the pgoff calculated from vmf->address.  This means that
the locked radix tree entry that we swapped in to the tree in
dax_insert_mapping_entry() using vmf->pgoff is never unlocked, so all
future faults to that 2MiB range will block forever.

Fix this by validating that the faulting address's PMD offset matches
the PMD offset from the start of the file.  This check is done at the
very beginning of the fault and covers faults that would have mapped to
storage as well as faults to holes.  I left the COLOUR check in
dax_pmd_insert_mapping() in place in case we ever hit the insanity
condition where the alignment of the pfn we get from the driver doesn't
match the alignment of the userspace address.

Link: http://lkml.kernel.org/r/20170822222436.18926-1-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: "Slusarz, Marcin" <marcin.slusarz@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-25 16:12:46 -07:00
Andreas Gruenbacher
561b796987 gfs2: Silence gcc format-truncation warning
Enlarge sd_fsname to be big enough for the longest long lock table name
and an arbitrary journal number.  This silences two -Wformat-truncation
warnings with gcc 7.1.1.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-25 10:59:21 -05:00
Bob Peterson
942b0cddfb GFS2: Withdraw for IO errors writing to the journal or statfs
Before this patch, if GFS2 encountered IO errors while writing to
the journal, it would not report the problem, so they would go
unnoticed, sometimes for many hours. Sometimes this would only be
noticed later, when recovery tried to do journal replay and failed
due to invalid metadata at the blocks that resulted in IO errors.

This patch makes GFS2's log daemon check for IO errors. If it
encounters one, it withdraws from the file system and reports
why in dmesg. A similar action is taken when IO errors occur when
writing to the system statfs file.

These errors are also reported back to any callers of fsync, since
that requires the journal to be flushed. Therefore, any IO errors
that would previously go unnoticed are now noticed and the file
system is withdrawn as early as possible, thus preventing further
file system damage.

Also note that this reintroduces superblock variable sd_log_error,
which Christoph removed with commit f729b66fca.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-25 10:59:09 -05:00
Ingo Molnar
3a9ff4fd04 Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25 11:07:13 +02:00
Ingo Molnar
10c9850cb2 Merge branch 'linus' into locking/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25 11:04:51 +02:00
Chuck Lever
afea5657c2 sunrpc: Const-ify struct sv_serv_ops
Close an attack vector by moving the arrays of per-server methods to
read-only memory.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-08-24 22:13:50 -04:00
Chuck Lever
c1df609d9d nfsd: Const-ify NFSv4 encoding and decoding ops arrays
Close an attack vector by moving the arrays of encoding and decoding
methods to read-only memory.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-08-24 22:13:50 -04:00
J. Bruce Fields
bac966d606 nfsd4: individual encoders no longer see error cases
With a few exceptions, most individual encoders don't handle error
cases.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-08-24 22:12:49 -04:00
J. Bruce Fields
b7571e4cd3 nfsd4: skip encoder in trivial error cases
Most encoders do nothing in the error case.  But they can still screw
things up in that case: most errors happen very early in rpc processing,
possibly before argument fields are filled in and bounds-tested, so
encoders that do anything other than immediately bail on error can
easily crash in odd error cases.

So just handle errors centrally most of the time to remove the chance of
error.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-08-24 22:12:48 -04:00
J. Bruce Fields
34b1744c91 nfsd4: define ->op_release for compound ops
Run a separate ->op_release function if necessary instead of depending
on the xdr encoder to do this.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-08-24 22:12:48 -04:00
J. Bruce Fields
f4f9ef4a1b nfsd4: opdesc will be useful outside nfs4proc.c
Trivial cleanup, no change in behavior.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-08-24 21:12:20 -04:00
Chuck Lever
fc788f64f1 nfsd: Limit end of page list when decoding NFSv4 WRITE
When processing an NFSv4 WRITE operation, argp->end should never
point past the end of the data in the final page of the page list.
Otherwise, nfsd4_decode_compound can walk into uninitialized memory.

More critical, nfsd4_decode_write is failing to increment argp->pagelen
when it increments argp->pagelist.  This can cause later xdr decoders
to assume more data is available than really is, which can cause server
crashes on malformed requests.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-08-24 18:05:30 -04:00
Linus Torvalds
b71a5e3fe8 Merge branch 'for-4.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
 "We have one more fixup that stems from the blk_status_t conversion
  that did not quite cover everything.

  The normal cases were not affected because the code is 0, but any
  error and retries could mix up new and old values"

* 'for-4.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix blk_status_t/errno confusion
2017-08-24 14:10:31 -07:00
Eric W. Biederman
311fc65c9f pty: Repair TIOCGPTPEER
The implementation of TIOCGPTPEER has two issues.

When /dev/ptmx (as opposed to /dev/pts/ptmx) is opened the wrong
vfsmount is passed to dentry_open.  Which results in the kernel displaying
the wrong pathname for the peer.

The second is simply by caching the vfsmount and dentry of the peer it leaves
them open, in a way they were not previously Which because of the inreased
reference counts can cause unnecessary behaviour differences resulting in
regressions.

To fix these move the ioctl into tty_io.c at a generic level allowing
the ioctl to have access to the struct file on which the ioctl is
being called.  This allows the path of the slave to be derived when
opening the slave through TIOCGPTPEER instead of requiring the path to
the slave be cached.  Thus removing the need for caching the path.

A new function devpts_ptmx_path is factored out of devpts_acquire and
used to implement a function devpts_mntget.   The new function devpts_mntget
takes a filp to perform the lookup on and fsi so that it can confirm
that the superblock that is found by devpts_ptmx_path is the proper superblock.

v2: Lots of fixes to make the code actually work
v3: Suggestions by Linus
    - Removed the unnecessary initialization of filp in ptm_open_peer
    - Simplified devpts_ptmx_path as gotos are no longer required

[ This is the fix for the issue that was reverted in commit
  143c97cc65, but this time without breaking 'pbuilder' due to
  increased reference counts   - Linus ]

Fixes: 54ebbfb160 ("tty: add TIOCGPTPEER ioctl")
Reported-by: Christian Brauner <christian.brauner@canonical.com>
Reported-and-tested-by: Stefan Lippers-Hollmann <s.l-h@gmx.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-24 13:23:03 -07:00
Randy Dodgen
fd96b8da68 ext4: fix fault handling when mounted with -o dax,ro
If an ext4 filesystem is mounted with both the DAX and read-only
options, executables on that filesystem will fail to start (claiming
'Segmentation fault') due to the fault handler returning
VM_FAULT_SIGBUS.

This is due to the DAX fault handler (see ext4_dax_huge_fault)
attempting to write to the journal when FAULT_FLAG_WRITE is set. This is
the wrong behavior for write faults which will lead to a COW page; in
particular, this fails for readonly mounts.

This change avoids journal writes for faults that are expected to COW.

It might be the case that this could be better handled in
ext4_iomap_begin / ext4_iomap_end (called via iomap_ops inside
dax_iomap_fault). These is some overlap already (e.g. grabbing journal
handles).

Signed-off-by: Randy Dodgen <dodgen@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
2017-08-24 15:26:01 -04:00
zhangyi (F)
95f1fda47c ext4: fix quota inconsistency during orphan cleanup for read-only mounts
Quota does not get enabled for read-only mounts if filesystem
has quota feature, so that quotas cannot updated during orphan
cleanup, which will lead to quota inconsistency.

This patch turn on quotas during orphan cleanup for this case,
make sure quotas can be updated correctly.

Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org # 3.18+
2017-08-24 15:21:50 -04:00
zhangyi (F)
b0a5a9589d ext4: fix incorrect quotaoff if the quota feature is enabled
Current ext4 quota should always "usage enabled" if the
quota feautre is enabled. But in ext4_orphan_cleanup(), it
turn quotas off directly (used for the older journaled
quota), so we cannot turn it on again via "quotaon" unless
umount and remount ext4.

Simple reproduce:

  mkfs.ext4 -O project,quota /dev/vdb1
  mount -o prjquota /dev/vdb1 /mnt
  chattr -p 123 /mnt
  chattr +P /mnt
  touch /mnt/aa /mnt/bb
  exec 100<>/mnt/aa
  rm -f /mnt/aa
  sync
  echo c > /proc/sysrq-trigger

  #reboot and mount
  mount -o prjquota /dev/vdb1 /mnt
  #query status
  quotaon -Ppv /dev/vdb1
  #output
  quotaon: Cannot find mountpoint for device /dev/vdb1
  quotaon: No correct mountpoint specified.

This patch add check for journaled quotas to avoid incorrect
quotaoff when ext4 has quota feautre.

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org # 3.18
2017-08-24 15:19:39 -04:00
Damien Guibouret
918dc9d0ab ext4: remove useless test and assignment in strtohash functions
On transformation of str to hash, computed value is initialised before
first byte modulo 4. But it is already initialised before entering loop
and after processing last byte modulo 4. So the corresponding test and
initialisation could be removed.

Signed-off-by: Damien Guibouret <damien.guibouret@partition-saving.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-08-24 15:11:34 -04:00
Tahsin Erdogan
a6d0567604 ext4: backward compatibility support for Lustre ea_inode implementation
Original Lustre ea_inode feature did not have ref counts on xattr inodes
because there was always one parent that referenced it. New
implementation expects ref count to be initialized which is not true for
Lustre case. Handle this by detecting Lustre created xattr inode and set
its ref count to 1.

The quota handling of xattr inodes have also changed with deduplication
support. New implementation manually manages quotas to support sharing
across multiple users. A consequence is that, a referencing inode
incorporates the blocks of xattr inode into its own i_block field.

We need to know how a xattr inode was created so that we can reverse the
block charges during reference removal. This is handled by introducing a
EXT4_STATE_LUSTRE_EA_INODE flag. The flag is set on a xattr inode if
inode appears to have been created by Lustre. During xattr inode reference
removal, the manual quota uncharge is skipped if the flag is set.

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-08-24 14:25:02 -04:00
Christoph Hellwig
eaa093d2c0 ext4: remove timebomb in ext4_decode_extra_time()
Changing behavior based on the version code is a timebomb waiting to
happen, and not easily bisectable.  Drop it and leave any removal
to explicit developer action. (And I don't think file system
should _ever_ remove backwards compatibility that has no explicit
flag, but I'll leave that to the ext4 folks).

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Eric Biggers <ebiggers@google.com>
2017-08-24 13:59:24 -04:00
Markus Elfring
d695a1bea3 ext4: use sizeof(*ptr)
Replace the specification of data structures by pointer dereferences
as the parameter for the operator "sizeof" to make the corresponding size
determination a bit safer according to the Linux coding style convention.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2017-08-24 13:50:24 -04:00
Darrick J. Wong
1bd8d6cd3e ext4: in ext4_seek_{hole,data}, return -ENXIO for negative offsets
In the ext4 implementations of SEEK_HOLE and SEEK_DATA, make sure we
return -ENXIO for negative offsets instead of banging around inside
the extent code and returning -EFSCORRUPTED.

Reported-by: Mateusz S <muttdini@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org # 4.6
2017-08-24 13:22:06 -04:00
Wang Shilong
901ed070df ext4: reduce lock contention in __ext4_new_inode
While running number of creating file threads concurrently,
we found heavy lock contention on group spinlock:

FUNC                           TOTAL_TIME(us)       COUNT        AVG(us)
ext4_create                    1707443399           1440000      1185.72
_raw_spin_lock                 1317641501           180899929    7.28
jbd2__journal_start            287821030            1453950      197.96
jbd2_journal_get_write_access  33441470             73077185     0.46
ext4_add_nondir                29435963             1440000      20.44
ext4_add_entry                 26015166             1440049      18.07
ext4_dx_add_entry              25729337             1432814      17.96
ext4_mark_inode_dirty          12302433             5774407      2.13

most of cpu time blames to _raw_spin_lock, here is some testing
numbers with/without patch.

Test environment:
Server : SuperMicro Sever (2 x E5-2690 v3@2.60GHz, 128GB 2133MHz
         DDR4 Memory, 8GbFC)
Storage : 2 x RAID1 (DDN SFA7700X, 4 x Toshiba PX02SMU020 200GB
          Read Intensive SSD)

format command:
        mkfs.ext4 -J size=4096

test command:
        mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
                -r -i 1 -v -p 10 -u #first run to load inode

        mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
                -r -i 3 -v -p 10 -u

Kernel version: 4.13.0-rc3

Test  1,440,000 files with 48 directories by 48 processes:

Without patch:

File Creation   File removal
79,033          289,569 ops/per second
81,463          285,359
79,875          288,475

With patch:
File Creation   File removal
810669		301694
812805		302711
813965		297670

Creation performance is improved more than 10X with large
journal size. The main problem here is we test bitmap
and do some check and journal operations which could be
slept, then we test and set with lock hold, this could
be racy, and make 'inode' steal by other process.

However, after first try, we could confirm handle has
been started and inode bitmap journaled too, then
we could find and set bit with lock hold directly, this
will mostly gurateee success with second try.

Tested-by: Shuichi Ihara <sihara@ddn.com>
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-08-24 12:56:35 -04:00
Wang Shilong
2fe435d8b0 ext4: cleanup goto next group
avoid duplicated codes, also we need goto
next group in case we found reserved inode.

Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-08-24 11:58:18 -04:00
Jan Kara
4f9d956d19 ext4: do not unnecessarily allocate buffer in recently_deleted()
In recently_deleted() function we want to check whether inode is still
cached in buffer cache. Use sb_find_get_block() for that instead of
sb_getblk() to avoid unnecessary allocation of bdev page and buffer
heads.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-08-24 11:52:21 -04:00
Omar Sandoval
58efbc9f54 Btrfs: fix blk_status_t/errno confusion
This fixes several instances of blk_status_t and bare errno ints being
mixed up, some of which are real bugs.

In the normal case, 0 matches BLK_STS_OK, so we don't observe any
effects of the missing conversion, but in case of errors or passes
through the repair/retry paths, the errors get mixed up.

The changes were identified using 'sparse', we don't have reports of the
buggy behaviour.

Fixes: 4e4cbee93d ("block: switch bios to blk_status_t")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-24 17:19:02 +02:00
Linus Torvalds
143c97cc65 Revert "pty: fix the cached path of the pty slave file descriptor in the master"
This reverts commit c8c03f1858.

It turns out that while fixing the ptmx file descriptor to have the
correct 'struct path' to the associated slave pty is a really good
thing, it breaks some user space tools for a very annoying reason.

The problem is that /dev/ptmx and its associated slave pty (/dev/pts/X)
are on different mounts.  That was what caused us to have the wrong path
in the first place (we would mix up the vfsmount of the 'ptmx' node,
with the dentry of the pty slave node), but it also means that now while
we use the right vfsmount, having the pty master open also keeps the pts
mount busy.

And it turn sout that that makes 'pbuilder' very unhappy, as noted by
Stefan Lippers-Hollmann:

 "This patch introduces a regression for me when using pbuilder
  0.228.7[2] (a helper to build Debian packages in a chroot and to
  create and update its chroots) when trying to umount /dev/ptmx (inside
  the chroot) on Debian/ unstable (full log and pbuilder configuration
  file[3] attached).

  [...]
  Setting up build-essential (12.3) ...
  Processing triggers for libc-bin (2.24-15) ...
  I: unmounting dev/ptmx filesystem
  W: Could not unmount dev/ptmx: umount: /var/cache/pbuilder/build/1340/dev/ptmx: target is busy
          (In some cases useful info about processes that
           use the device is found by lsof(8) or fuser(1).)"

apparently pbuilder tries to unmount the /dev/pts filesystem while still
holding at least one master node open, which is arguably not very nice,
but we don't break user space even when fixing other bugs.

So this commit has to be reverted.

I'll try to figure out a way to avoid caching the path to the slave pty
in the master pty.  The only thing that actually wants that slave pty
path is the "TIOCGPTPEER" ioctl, and I think we could just recreate the
path at that time.

Reported-by: Stefan Lippers-Hollmann <s.l-h@gmx.de>
Cc: Eric W Biederman <ebiederm@xmission.com>
Cc: Christian Brauner <christian.brauner@canonical.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-23 18:16:11 -07:00
Christoph Hellwig
74d46992e0 block: replace bi_bdev with a gendisk pointer and partitions index
This way we don't need a block_device structure to submit I/O.  The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open.  Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).

For the actual I/O path all that we need is the gendisk, which exists
once per block device.  But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.

Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:49:55 -06:00
Christoph Hellwig
c2ee070fb0 block: cache the partition index in struct block_device
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:49:53 -06:00
Christoph Hellwig
f8f84b2dfd btrfs: index check-integrity state hash by a dev_t
We won't have the struct block_device available in the bio soon, so switch
to the numerical dev_t instead of the block_device pointer for looking up
the check-integrity state.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:49:47 -06:00
Ronnie Sahlberg
d3edede29f cifs: return ENAMETOOLONG for overlong names in cifs_open()/cifs_lookup()
Add checking for the path component length and verify it is <= the maximum
that the server advertizes via FileFsAttributeInformation.

With this patch cifs.ko will now return ENAMETOOLONG instead of ENOENT
when users to access an overlong path.

To test this, try to cd into a (non-existing) directory on a CIFS share
that has a too long name:
cd /mnt/aaaaaaaaaaaaaaa...

and it now should show a good error message from the shell:
bash: cd: /mnt/aaaaaaaaaaaaaaaa...aaaaaa: File name too long

rh bz 1153996

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Cc: <stable@vger.kernel.org>
2017-08-23 13:34:52 -05:00
Sachin Prabhu
42bec214d8 cifs: Fix df output for users with quota limits
The df for a SMB2 share triggers a GetInfo call for
FS_FULL_SIZE_INFORMATION. The values returned are used to populate
struct statfs.

The problem is that none of the information returned by the call
contains the total blocks available on the filesystem. Instead we use
the blocks available to the user ie. quota limitation when filling out
statfs.f_blocks. The information returned does contain Actual free units
on the filesystem and is used to populate statfs.f_bfree. For users with
quota enabled, it can lead to situations where the total free space
reported is more than the total blocks on the system ending up with df
reports like the following

 # df -h /mnt/a
Filesystem         Size  Used Avail Use% Mounted on
//192.168.22.10/a  2.5G -2.3G  2.5G    - /mnt/a

To fix this problem, we instead populate both statfs.f_bfree with the
same value as statfs.f_bavail ie. CallerAvailableAllocationUnits. This
is similar to what is done already in the code for cifs and df now
reports the quota information for the user used to mount the share.

 # df --si /mnt/a
Filesystem         Size  Used Avail Use% Mounted on
//192.168.22.10/a  2.7G  101M  2.6G   4% /mnt/a

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Pierguido Lambri <plambri@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Cc: <stable@vger.kernel.org>
2017-08-23 13:33:21 -05:00
Markus Elfring
def12ec59d isofs: Delete an unnecessary variable initialisation in isofs_read_inode()
The local variable "bh" will be set to an appropriate pointer a bit later.
Thus omit the explicit initialisation at the beginning.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-23 18:54:03 +02:00
Markus Elfring
e96e8a1dc1 isofs: Adjust four checks for null pointers
The script “checkpatch.pl” pointed information out like the following.

Comparison to NULL could be written !...

Thus fix the affected source code places.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-23 18:53:06 +02:00
Linus Torvalds
98b9f8a454 Fix a clang build regression and an potential xattr corruption bug.
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlmdAHoACgkQ8vlZVpUN
 gaNVsgf/SRn6HaOpX7BdrtkXqjV8VvLZsDmsZPkhchdmTxMpIFJNf16/sg0hqdyJ
 wcTx3y+BkBSjBXLtqK+hslVyg4pUjSBWWZyZ9Dtyi5+B92CJJJBdaHIpcdvd3Ek1
 J/HPQjqcPXL43Cg5SQ0/KgVMhCze9I4bEbNm2evC18bC15hZAVP0FK1hT3FNpyIB
 fhOu9FZdnzlcBlnLdfTqgIEPaHzc6zcJnqpSbkT0InjiJf5cxDionhoaBzUh9Jzg
 bKvkFRDTDWDrBcYStuHwgpELmVVYJGbwjzMVOAcmeCiSJqNbU1/Ym5t3e3rflKmi
 6YEyDhK43iZGiR4/QUffrCxEIzfqrA==
 =dOeQ
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix a clang build regression and an potential xattr corruption bug"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: add missing xattr hash update
  ext4: fix clang build regression
2017-08-22 21:30:52 -07:00
Carlos Maiolino
2d32311cf1 xfs: stop searching for free slots in an inode chunk when there are none
In a filesystem without finobt, the Space manager selects an AG to alloc a new
inode, where xfs_dialloc_ag_inobt() will search the AG for the free slot chunk.

When the new inode is in the same AG as its parent, the btree will be searched
starting on the parent's record, and then retried from the top if no slot is
available beyond the parent's record.

To exit this loop though, xfs_dialloc_ag_inobt() relies on the fact that the
btree must have a free slot available, once its callers relied on the
agi->freecount when deciding how/where to allocate this new inode.

In the case when the agi->freecount is corrupted, showing available inodes in an
AG, when in fact there is none, this becomes an infinite loop.

Add a way to stop the loop when a free slot is not found in the btree, making
the function to fall into the whole AG scan which will then, be able to detect
the corruption and shut the filesystem down.

As pointed by Brian, this might impact performance, giving the fact we
don't reset the search distance anymore when we reach the end of the
tree, giving it fewer tries before falling back to the whole AG search, but
it will only affect searches that start within 10 records to the end of the tree.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22 09:22:24 -07:00
Brian Foster
e67d3d4246 xfs: add log recovery tracepoint for head/tail
Torn write detection and tail overwrite detection can shift the log
head and tail respectively in the event of CRC mismatch or
corruption errors. Add a high-level log recovery tracepoint to dump
the final log head/tail and make those values easily attainable in
debug/diagnostic situations.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22 09:22:24 -07:00
Brian Foster
a4c9b34d6a xfs: handle -EFSCORRUPTED during head/tail verification
Torn write and tail overwrite detection both trigger only on
-EFSBADCRC errors. While this is the most likely failure scenario
for each condition, -EFSCORRUPTED is still possible in certain cases
depending on what ends up on disk when a torn write or partial tail
overwrite occurs. For example, an invalid log record h_len can lead
to an -EFSCORRUPTED error when running the log recovery CRC pass.

Therefore, update log head and tail verification to trigger the
associated head/tail fixups in the event of -EFSCORRUPTED errors
along with -EFSBADCRC. Also, -EFSCORRUPTED can currently be returned
from xlog_do_recovery_pass() before rhead_blk is initialized if the
first record encountered happens to be corrupted. This leads to an
incorrect 'first_bad' return value. Initialize rhead_blk earlier in
the function to address that problem as well.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22 09:22:24 -07:00
Brian Foster
7f4d01f36a xfs: add log item pinning error injection tag
Add an error injection tag to force log items in the AIL to the
pinned state. This option can be used by test infrastructure to
induce head behind tail conditions. Specifically, this is intended
to be used by xfstests to reproduce log recovery problems after
failed/corrupted log writes overwrite the last good tail LSN in the
log.

When enabled, AIL push attempts see log items in the AIL in the
pinned state. This stalls metadata writeback and thus prevents the
current tail of the log from moving forward. When disabled,
subsequent AIL pushes observe the log items in their appropriate
state and filesystem operation continues as normal.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22 09:22:24 -07:00
Brian Foster
4a4f66eac4 xfs: fix log recovery corruption error due to tail overwrite
If we consider the case where the tail (T) of the log is pinned long
enough for the head (H) to push and block behind the tail, we can
end up blocked in the following state without enough free space (f)
in the log to satisfy a transaction reservation:

	0	phys. log	N
	[-------HffT---H'--T'---]

The last good record in the log (before H) refers to T. The tail
eventually pushes forward (T') leaving more free space in the log
for writes to H. At this point, suppose space frees up in the log
for the maximum of 8 in-core log buffers to start flushing out to
the log. If this pushes the head from H to H', these next writes
overwrite the previous tail T. This is safe because the items logged
from T to T' have been written back and removed from the AIL.

If the next log writes (H -> H') happen to fail and result in
partial records in the log, the filesystem shuts down having
overwritten T with invalid data. Log recovery correctly locates H on
the subsequent mount, but H still refers to the now corrupted tail
T. This results in log corruption errors and recovery failure.

Since the tail overwrite results from otherwise correct runtime
behavior, it is up to log recovery to try and deal with this
situation. Update log recovery tail verification to run a CRC pass
from the first record past the tail to the head. This facilitates
error detection at T and moves the recovery tail to the first good
record past H' (similar to truncating the head on torn write
detection). If corruption is detected beyond the range possibly
affected by the max number of iclogs, the log is legitimately
corrupted and log recovery failure is expected.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22 09:22:23 -07:00
Brian Foster
5297ac1f6d xfs: always verify the log tail during recovery
Log tail verification currently only occurs when torn writes are
detected at the head of the log. This was introduced because a
change in the head block due to torn writes can lead to a change in
the tail block (each log record header references the current tail)
and the tail block should be verified before log recovery proceeds.

Tail corruption is possible outside of torn write scenarios,
however. For example, partial log writes can be detected and cleared
during the initial head/tail block discovery process. If the partial
write coincides with a tail overwrite, the log tail is corrupted and
recovery fails.

To facilitate correct handling of log tail overwites, update log
recovery to always perform tail verification. This is necessary to
detect potential tail overwrite conditions when torn writes may not
have occurred. This changes normal (i.e., no torn writes) recovery
behavior slightly to detect and return CRC related errors near the
tail before actual recovery starts.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22 09:22:23 -07:00
Brian Foster
284f1c2c9b xfs: fix recovery failure when log record header wraps log end
The high-level log recovery algorithm consists of two loops that
walk the physical log and process log records from the tail to the
head. The first loop handles the case where the tail is beyond the
head and processes records up to the end of the physical log. The
subsequent loop processes records from the beginning of the physical
log to the head.

Because log records can wrap around the end of the physical log, the
first loop mentioned above must handle this case appropriately.
Records are processed from in-core buffers, which means that this
algorithm must split the reads of such records into two partial
I/Os: 1.) from the beginning of the record to the end of the log and
2.) from the beginning of the log to the end of the record. This is
further complicated by the fact that the log record header and log
record data are read into independent buffers.

The current handling of each buffer correctly splits the reads when
either the header or data starts before the end of the log and wraps
around the end. The data read does not correctly handle the case
where the prior header read wrapped or ends on the physical log end
boundary. blk_no is incremented to or beyond the log end after the
header read to point to the record data, but the split data read
logic triggers, attempts to read from an invalid log block and
ultimately causes log recovery to fail. This can be reproduced
fairly reliably via xfstests tests generic/047 and generic/388 with
large iclog sizes (256k) and small (10M) logs.

If the record header read has pushed beyond the end of the physical
log, the subsequent data read is actually contiguous. Update the
data read logic to detect the case where blk_no has wrapped, mod it
against the log size to read from the correct address and issue one
contiguous read for the log data buffer. The log record is processed
as normal from the buffer(s), the loop exits after the current
iteration and the subsequent loop picks up with the first new record
after the start of the log.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22 09:22:23 -07:00
Carlos Maiolino
d3a304b629 xfs: Properly retry failed inode items in case of error during buffer writeback
When a buffer has been failed during writeback, the inode items into it
are kept flush locked, and are never resubmitted due the flush lock, so,
if any buffer fails to be written, the items in AIL are never written to
disk and never unlocked.

This causes unmount operation to hang due these items flush locked in AIL,
but this also causes the items in AIL to never be written back, even when
the IO device comes back to normal.

I've been testing this patch with a DM-thin device, creating a
filesystem larger than the real device.

When writing enough data to fill the DM-thin device, XFS receives ENOSPC
errors from the device, and keep spinning on xfsaild (when 'retry
forever' configuration is set).

At this point, the filesystem can not be unmounted because of the flush locked
items in AIL, but worse, the items in AIL are never retried at all
(once xfs_inode_item_push() will skip the items that are flush locked),
even if the underlying DM-thin device is expanded to the proper size.

This patch fixes both cases, retrying any item that has been failed
previously, using the infra-structure provided by the previous patch.

Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22 09:22:23 -07:00
Carlos Maiolino
0b80ae6ed1 xfs: Add infrastructure needed for error propagation during buffer IO failure
With the current code, XFS never re-submit a failed buffer for IO,
because the failed item in the buffer is kept in the flush locked state
forever.

To be able to resubmit an log item for IO, we need a way to mark an item
as failed, if, for any reason the buffer which the item belonged to
failed during writeback.

Add a new log item callback to be used after an IO completion failure
and make the needed clean ups.

Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22 09:22:23 -07:00
Eric Sandeen
6f4a1eefdd xfs: toggle readonly state around xfs_log_mount_finish
When we do log recovery on a readonly mount, unlinked inode
processing does not happen due to the readonly checks in
xfs_inactive(), which are trying to prevent any I/O on a
readonly mount.

This is misguided - we do I/O on readonly mounts all the time,
for consistency; for example, log recovery.  So do the same
RDONLY flag twiddling around xfs_log_mount_finish() as we
do around xfs_log_mount(), for the same reason.

This all cries out for a big rework but for now this is a
simple fix to an obvious problem.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22 09:22:23 -07:00
Eric Sandeen
757a69ef6c xfs: write unmount record for ro mounts
There are dueling comments in the xfs code about intent
for log writes when unmounting a readonly filesystem.

In xfs_mountfs, we see the intent:

/*
 * Now the log is fully replayed, we can transition to full read-only
 * mode for read-only mounts. This will sync all the metadata and clean
 * the log so that the recovery we just performed does not have to be
 * replayed again on the next mount.
 */

and it calls xfs_quiesce_attr(), but by the time we get to
xfs_log_unmount_write(), it returns early for a RDONLY mount:

 * Don't write out unmount record on read-only mounts.

Because of this, sequential ro mounts of a filesystem with
a dirty log will replay the log each time, which seems odd.

Fix this by writing an unmount record even for RO mounts, as long
as norecovery wasn't specified (don't write a clean log record
if a dirty log may still be there!) and the log device is
writable.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22 09:22:23 -07:00
David Sterba
db95c876c5 btrfs: submit superblock io with REQ_META and REQ_PRIO
The superblock is also metadata of the filesystem so the relevant IO
should be tagged as such. We also tag it as high priority, as it's the
last block committed for metadata from a given transaction. Any delays
would effectively block the whole transaction, also blocking any other
operation holding the device_list_mutex.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-22 13:22:05 +02:00
David S. Miller
e2a7c34fb2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-08-21 17:06:42 -07:00
Chao Yu
969d1b180d f2fs: introduce discard_granularity sysfs entry
Commit d618ebaf0a ("f2fs: enable small discard by default") enables
f2fs to issue 4K size discard in real-time discard mode. However, issuing
smaller discard may cost more lifetime but releasing less free space in
flash device. Since f2fs has ability of separating hot/cold data and
garbage collection, we can expect that small-sized invalid region would
expand soon with OPU, deletion or garbage collection on valid datas, so
it's better to delay or skip issuing smaller size discards, it could help
to reduce overmuch consumption of IO bandwidth and lifetime of flash
storage.

This patch makes f2fs selectng 64K size as its default minimal
granularity, and issue discard with the size which is not smaller than
minimal granularity. Also it exposes discard granularity as sysfs entry
for configuration in different scenario.

Jaegeuk Kim:
 We must issue all the accumulated discard commands when fstrim is called.
 So, I've added pend_list_tag[] to indicate whether we should issue the
 commands or not. If tag sets P_ACTIVE or P_TRIM, we have to issue them.
 P_TRIM is set once at a time, given fstrim trigger.
 In addition, issue_discard_thread is calling too much due to the number of
 discard commands remaining in the pending list. I added a timer to control
 it likewise gc_thread.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-21 15:55:07 -07:00
Yunlong Song
f24b150a63 f2fs: remove unused function overprovision_sections
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-21 15:55:07 -07:00
Jaegeuk Kim
125c9fb1cc f2fs: check hot_data for roll-forward recovery
We need to check HOT_DATA to truncate any previous data block when doing
roll-forward recovery.

Cc: <stable@vger.kernel.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-21 15:55:06 -07:00
Chao Yu
c56f16dab0 f2fs: add tracepoint for f2fs_gc
This patch adds tracepoint for f2fs_gc.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-21 15:55:05 -07:00
Chao Yu
7f2b4e8ea5 f2fs: retry to revoke atomic commit in -ENOMEM case
During atomic committing, if we encounter -ENOMEM in revoke path, it's
better to give a chance to retry revoking.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-21 15:55:04 -07:00
Jaegeuk Kim
afd2b4da40 f2fs: let fill_super handle roll-forward errors
If we set CP_ERROR_FLAG in roll-forward error, f2fs is no longer to proceed
any IOs due to f2fs_cp_error(). But, for example, if some stale data is involved
on roll-forward process, we're able to get -ENOENT, getting fs stuck.
If we get any error, let fill_super set SBI_NEED_FSCK and try to recover back
to stable point.

Cc: <stable@vger.kernel.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-21 15:55:03 -07:00
Qiuyang Sun
f2220c7f15 f2fs: merge equivalent flags F2FS_GET_BLOCK_[READ|DIO]
Currently, the two flags F2FS_GET_BLOCK_[READ|DIO] are totally equivalent
and can be used interchangably in all scenarios they are involved in.
Neither of the flags is referenced in f2fs_map_blocks(), making them both
the default case. To remove the ambiguity, this patch merges both flags
into F2FS_GET_BLOCK_DEFAULT, and introduces an enum for all distinct flags.

Signed-off-by: Qiuyang Sun <sunqiuyang@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-21 15:55:02 -07:00
Chao Yu
4b2414d04e f2fs: support journalled quota
This patch supports to enable f2fs to accept quota information through
mount option:
- {usr,grp,prj}jquota=<quota file path>
- jqfmt=<quota type>

Then, in ->mount flow, we can recover quota file during log replaying,
by this, journelled quota can be supported.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: Fix wrong return values.]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-21 15:54:48 -07:00
Nikolay Borisov
dc59215d4f btrfs: remove unnecessary memory barrier in btrfs_direct_IO
Commit 38851cc19a ("Btrfs: implement unlocked dio write") implemented
unlocked dio write, allowing multiple dio writers to write to
non-overlapping, and non-eof-extending regions. In doing so it also
introduced a broken memory barrier. It is broken due to 2 things:

1. Memory barriers _MUST_ always be paired, this is clearly not the case
   here

2. Checkpatch actually produces a warning if a memory barrier is
   introduced that doesn't have a comment explaining how it's being
   paired.

Specifically for inode::i_dio_count that's wrapped inside
inode_dio_begin, there is no explicit barrier semantics attached, so
removing is fine as the atomic is used in common the waiter/wakeup
pattern.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ enhance changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 18:49:21 +02:00
Nikolay Borisov
b5d9071c4f btrfs: remove superfluous chunk_tree argument from btrfs_alloc_dev_extent
Currently this function is always called with the object id of the root
key of the chunk_tree, which is always BTRFS_CHUNK_TREE_OBJECTID. So
let's subsume it straight into the function itself. No functional
change.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 18:30:30 +02:00
Nikolay Borisov
0ca00afb2b btrfs: Remove chunk_objectid parameter of btrfs_alloc_dev_extent
THe function is always called with chunk_objectid set to
BTRFS_FIRST_CHUNK_TREE_OBJECTID. Let's collapse the parameter in the
function itself. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 18:30:16 +02:00
Jeff Mahoney
1cd5447eb6 btrfs: pass fs_info to btrfs_del_root instead of tree_root
btrfs_del_roots always uses the tree_root.  Let's pass fs_info instead.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:49:54 +02:00
Liu Bo
64ecdb647d Btrfs: add one more sanity check for shared ref type
Every shared ref has a parent tree block, which can be get from
btrfs_extent_inline_ref_offset().  And the tree block must be aligned
to the nodesize, so we'd know this inline ref is not valid if this
block's bytenr is not aligned to the nodesize, in which case, most
likely the ref type has been misused.

This adds the above mentioned check and also updates
print_extent_item() called by btrfs_print_leaf() to point out the
invalid ref while printing the tree structure.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:43 +02:00
Liu Bo
cdccee993f Btrfs: remove BUG_ON in __add_tree_block
The BUG_ON() can be triggered when the caller is processing an invalid
extent inline ref, e.g.

a shared data ref is offered instead of an extent data ref, such that
it tries to find a non-existent tree block and then btrfs_search_slot
returns 1 for no such item.

This replaces the BUG_ON() with a WARN() followed by calling
btrfs_print_leaf() to show more details about what's going on and
returning -EINVAL to upper callers.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:43 +02:00
Liu Bo
b14c55a191 Btrfs: remove BUG() in add_data_reference
Now that we have a helper to report invalid value of extent inline ref
type, we need to quit gracefully instead of throwing out a kernel panic.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:43 +02:00
Liu Bo
07638ea598 Btrfs: remove BUG() in print_extent_item
btrfs_print_leaf() is used in btrfs_get_extent_inline_ref_type, so
here we really want to print the invalid value of ref type instead of
causing a kernel panic.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:43 +02:00
Liu Bo
4335958de2 Btrfs: remove BUG() in btrfs_extent_inline_ref_size
Now that btrfs_get_extent_inline_ref_type() can report if type is a
valid one and all callers can gracefully deal with that, we don't need
to crash here.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:43 +02:00
Liu Bo
3de28d579e Btrfs: convert to use btrfs_get_extent_inline_ref_type
Since we have a helper which can do sanity check, this converts all
btrfs_extent_inline_ref_type to it.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:43 +02:00
Liu Bo
167ce953ca Btrfs: add a helper to retrive extent inline ref type
An invalid value of extent inline ref type may be read from a
malicious image which may force btrfs to crash.

This adds a helper which does sanity check for the ref type, so we can
know if it's sane, return he type, otherwise return an error.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minimal tweak const types, causing warnings due to other cleanup patches ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:42 +02:00
David Sterba
af1cbe0a66 btrfs: scrub: simplify scrub worker initialization
Minor simplification, merge calls to one.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:42 +02:00
David Sterba
1d1bf92d9d btrfs: scrub: clean up division in scrub_find_csum
Use proper helpers for 64bit division.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:42 +02:00
David Sterba
7736b0a431 btrfs: scrub: clean up division in __scrub_mark_bitmap
Use proper helpers for 64bit division and then cast to narrower type.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:42 +02:00
David Sterba
2073c4c2e5 btrfs: scrub: use bool for flush_all_writes
flush_all_writes is an atomic but does not use the semantics at all,
it's just on/off indicator, we can use bool.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:42 +02:00
Ernesto A. Fernández
d7d8249665 btrfs: preserve i_mode if __btrfs_set_acl() fails
When changing a file's acl mask, btrfs_set_acl() will first set the
group bits of i_mode to the value of the mask, and only then set the
actual extended attribute representing the new acl.

If the second part fails (due to lack of space, for example) and the
file had no acl attribute to begin with, the system will from now on
assume that the mask permission bits are actual group permission bits,
potentially granting access to the wrong users.

Prevent this by restoring the original mode bits if __btrfs_set_acl
fails.

Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:42 +02:00
Nikolay Borisov
408fbf19ad btrfs: Remove extraneous chunk_objectid variable
BTRFS_FIRST_CHUNK_TREE_OBJECTIS id the only objectid being used in the
chunk_tree. So remove a variable which is always set to that value and collapse
its usage in callees which are passed this variable. No functional changes

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:42 +02:00
Nikolay Borisov
0174484d61 btrfs: Remove chunk_objectid argument from btrfs_make_block_group
btrfs_make_block_group is always called with chunk_objectid set to
BTRFS_FIRST_CHUNK_TREE_OBJECTID. There's no reason why this behavior will
change anytime soon, so let's remove the argument and decrease the cognitive
load when reading the code path. No functional change

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:42 +02:00
Matthias Kaehlcke
0dde10bed2 btrfs: Remove extra parentheses from condition in copy_items()
There is no need for the extra pair of parentheses, remove it. This
fixes the following warning when building with clang:

fs/btrfs/tree-log.c:3694:10: warning: equality comparison with extraneous
  parentheses [-Wparentheses-equality]
                if ((i == (nr - 1)))
                     ~~^~~~~~~~~~~

Also remove the unnecessary parentheses around the substraction.

Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:42 +02:00
Nikolay Borisov
0ce1dd2a4a btrfs: Remove redundant setting of uuid in btrfs_block_header
btrfs_alloc_dev_extent currently unconditionally sets the uuid in the
leaf block header the function is working with. This is unnecessary
since this operation is peformed by the core btree handling code
(splitting a node, allocating a new btree block etc). So let's remove
it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:42 +02:00
Hans van Kranenburg
583b723151 btrfs: Do not use data_alloc_cluster in ssd mode
This patch provides a band aid to improve the 'out of the box'
behaviour of btrfs for disks that are detected as being an ssd.  In a
general purpose mixed workload scenario, the current ssd mode causes
overallocation of available raw disk space for data, while leaving
behind increasing amounts of unused fragmented free space. This
situation leads to early ENOSPC problems which are harming user
experience and adoption of btrfs as a general purpose filesystem.

This patch modifies the data extent allocation behaviour of the ssd mode
to make it behave identical to nossd mode.  The metadata behaviour and
additional ssd_spread option stay untouched so far.

Recommendations for future development are to reconsider the current
oversimplified nossd / ssd distinction and the broken detection
mechanism based on the rotational attribute in sysfs and provide
experienced users with a more flexible way to choose allocator behaviour
for data and metadata, optimized for certain use cases, while keeping
sane 'out of the box' default settings.  The internals of the current
btrfs code have more potential than what currently gets exposed to the
user to choose from.

    The SSD story...

    In the first year of btrfs development, around early 2008, btrfs
gained a mount option which enables specific functionality for
filesystems on solid state devices. The first occurance of this
functionality is in commit e18e4809, labeled "Add mount -o ssd, which
includes optimizations for seek free storage".

The effect on allocating free space for doing (data) writes is to
'cluster' writes together, writing them out in contiguous space, as
opposed to a 'tetris' way of putting all separate writes into any free
space fragment that fits (which is what the -o nossd behaviour does).

A somewhat simplified explanation of what happens is that, when for
example, the 'cluster' size is set to 2MiB, when we do some writes, the
data allocator will search for a free space block that is 2MiB big, and
put the writes in there. The ssd mode itself might allow a 2MiB cluster
to be composed of multiple free space extents with some existing data in
between, while the additional ssd_spread mount option kills off this
option and requires fully free space.

The idea behind this is (commit 536ac8ae): "The [...] clusters make it
more likely a given IO will completely overwrite the ssd block, so it
doesn't have to do an internal rwm cycle."; ssd block meaning nand erase
block. So, effectively this means applying a "locality based algorithm"
and trying to outsmart the actual ssd.

Since then, various changes have been made to the involved code, but the
basic idea is still present, and gets activated whenever the ssd mount
option is active. This also happens by default, when the rotational flag
as seen at /sys/block/<device>/queue/rotational is set to 0.

    However, there's a number of problems with this approach.

    First, what the optimization is trying to do is outsmart the ssd by
assuming there is a relation between the physical address space of the
block device as seen by btrfs and the actual physical storage of the
ssd, and then adjusting data placement. However, since the introduction
of the Flash Translation Layer (FTL) which is a part of the internal
controller of an ssd, these attempts are futile. The use of good quality
FTL in consumer ssd products might have been limited in 2008, but this
situation has changed drastically soon after that time. Today, even the
flash memory in your automatic cat feeding machine or your grandma's
wheelchair has a full featured one.

Second, the behaviour as described above results in the filesystem being
filled up with badly fragmented free space extents because of relatively
small pieces of space that are freed up by deletes, but not selected
again as part of a 'cluster'. Since the algorithm prefers allocating a
new chunk over going back to tetris mode, the end result is a filesystem
in which all raw space is allocated, but which is composed of
underutilized chunks with a 'shotgun blast' pattern of fragmented free
space. Usually, the next problematic thing that happens is the
filesystem wanting to allocate new space for metadata, which causes the
filesystem to fail in spectacular ways.

Third, the default mount options you get for an ssd ('ssd' mode enabled,
'discard' not enabled), in combination with spreading out writes over
the full address space and ignoring freed up space leads to worst case
behaviour in providing information to the ssd itself, since it will
never learn that all the free space left behind is actually free.  There
are two ways to let an ssd know previously written data does not have to
be preserved, which are sending explicit signals using discard or
fstrim, or by simply overwriting the space with new data.  The worst
case behaviour is the btrfs ssd_spread mount option in combination with
not having discard enabled. It has a side effect of minimizing the reuse
of free space previously written in.

Fourth, the rotational flag in /sys/ does not reliably indicate if the
device is a locally attached ssd. For example, iSCSI or NBD displays as
non-rotational, while a loop device on an ssd shows up as rotational.

The combination of the second and third problem effectively means that
despite all the good intentions, the btrfs ssd mode reliably causes the
ssd hardware and the filesystem structures and performance to be choked
to death. The clickbait version of the title of this story would have
been "Btrfs ssd optimizations considered harmful for ssds".

The current nossd 'tetris' mode (even still without discard) allows a
pattern of overwriting much more previously used space, causing many
more implicit discards to happen because of the overwrite information
the ssd gets. The actual location in the physical address space, as seen
from the point of view of btrfs is irrelevant, because the actual writes
to the low level flash are reordered anyway thanks to the FTL.

    Changes made in the code

1. Make ssd mode data allocation identical to tetris mode, like nossd.
2. Adjust and clean up filesystem mount messages so that we can easily
identify if a kernel has this patch applied or not, when providing
support to end users. Also, make better use of the *_and_info helpers to
only trigger messages on actual state changes.

    Backporting notes

Notes for whoever wants to backport this patch to their 4.9 LTS kernel:
* First apply commit 951e7966 "btrfs: drop the nossd flag when
  remounting with -o ssd", or fixup the differences manually.
* The rest of the conflicts are because of the fs_info refactoring. So,
  for example, instead of using fs_info, it's root->fs_info in
  extent-tree.c

Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:42 +02:00
Lu Fengqi
43a0111103 btrfs: use btrfsic_submit_bio instead of submit_bio in write_dev_flush
Although this bio has no data attached, it will reach this condition
(bio->bi_opf & REQ_PREFLUSH) and then update the flush_gen of dev_state
in __btrfsic_submit_bio. So we should still submit it through integrity
checker. Otherwise, the integrity checker will throw the following warning
when I mount a newly created btrfs filesystem.

[10264.755497] btrfs: attempt to write superblock which references block M @29523968 (sdb1/1111654400/0) which is not flushed out of disk's write cache (block flush_gen=1, dev->flush_gen=0)!
[10264.755498] btrfs: attempt to write superblock which references block M @29523968 (sdb1/37912576/0) which is not flushed out of disk's write cache (block flush_gen=1, dev->flush_gen=0)!

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:42 +02:00
Filipe Manana
72610b1b40 Btrfs: incremental send, fix emission of invalid clone operations
When doing an incremental send it's possible that the computed send stream
contains clone operations that will fail on the receiver if the receiver
has compression enabled and the clone operations target a sector sized
extent that starts at a zero file offset, is not compressed on the source
filesystem but ends up being compressed and inlined at the destination
filesystem.

Example scenario:

  $ mkfs.btrfs -f /dev/sdb
  $ mount -o compress /dev/sdb /mnt

  # By doing a direct IO write, the data is not compressed.
  $ xfs_io -f -d -c "pwrite -S 0xab 0 4K" /mnt/foobar
  $ btrfs subvolume snapshot -r /mnt /mnt/mysnap1

  $ xfs_io -c "reflink /mnt/foobar 0 8K 4K" /mnt/foobar
  $ btrfs subvolume snapshot -r /mnt /mnt/mysnap2

  $ btrfs send -f /tmp/1.snap /mnt/mysnap1
  $ btrfs send -f /tmp/2.snap -p /mnt/mysnap1 /mnt/mysnap2
  $ umount /mnt

  $ mkfs.btrfs -f /dev/sdc
  $ mount -o compress /dev/sdc /mnt
  $ btrfs receive -f /tmp/1.snap /mnt
  $ btrfs receive -f /tmp/2.snap /mnt
  ERROR: failed to clone extents to foobar
  Operation not supported

The same could be achieved by mounting the source filesystem without
compression and doing a buffered IO write instead of a direct IO one,
and mounting the destination filesystem with compression enabled.

So fix this by issuing regular write operations in the send stream
instead of clone operations when the source offset is zero and the
range has a length matching the sector size.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:42 +02:00
Liu Bo
f716abd55d Btrfs: fix out of bounds array access while reading extent buffer
There is a corner case that slips through the checkers in functions
reading extent buffer, ie.

if (start < eb->len) and (start + len > eb->len),
then

a) map_private_extent_buffer() returns immediately because
it's thinking the range spans across two pages,

b) and the checkers in read_extent_buffer(), WARN_ON(start > eb->len)
and WARN_ON(start + len > eb->start + eb->len), both are OK in this
corner case, but it'd actually try to access the eb->pages out of
bounds because of (start + len > eb->len).

The case is found by switching extent inline ref type from shared data
ref to non-shared data ref, which is a kind of metadata corruption.

It'd use the wrong helper to access the eb,
eg. btrfs_extent_data_ref_root(eb, ref) is used but the %ref passing
here is "struct btrfs_shared_data_ref".  And if the extent item
happens to be the first item in the eb, then offset/length will get
over eb->len which ends up an invalid memory access.

This is adding proper checks in order to avoid invalid memory access,
ie. 'general protection fault', before it's too late.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21 17:47:14 +02:00
Markus Elfring
8898662268 isofs: Delete an error message for a failed memory allocation in isofs_read_inode()
Omit an extra message for a memory allocation failure in this function.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-21 16:05:48 +02:00
Markus Elfring
434aafb572 quota_v2: Delete an error message for a failed memory allocation in v2_read_file_info()
Omit an extra message for a memory allocation failure in this function.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-21 16:02:59 +02:00
Trond Myklebust
7af7a5963c Merge branch 'bugfixes' 2017-08-20 13:04:12 -04:00
Chuck Lever
53a75f22e7 NFS: Fix NFSv2 security settings
For a while now any NFSv2 mount where sec= is specified uses
AUTH_NULL. If sec= is not specified, the mount uses AUTH_UNIX.
Commit e68fd7c807 ("mount: use sec= that was specified on the
command line") attempted to address a very similar problem with
NFSv3, and should have fixed this too, but it has a bug.

The MNTv1 MNT procedure does not return a list of security flavors,
so our client makes up a list containing just AUTH_NULL. This should
enable nfs_verify_authflavors() to assign the sec= specified flavor,
but instead, it incorrectly sets it to AUTH_NULL.

I expect this would also be a problem for any NFSv3 server whose
MNTv3 MNT procedure returned a security flavor list containing only
AUTH_NULL.

Fixes: e68fd7c807 ("mount: use sec= that was specified on ... ")
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=310
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-20 12:43:34 -04:00
NeilBrown
b79e87e070 NFSv4.1: don't use machine credentials for CLOSE when using 'sec=sys'
An NFSv4.1 client might close a file after the user who opened it has
logged off.  In this case the user's credentials may no longer be
valid, if they are e.g. kerberos credentials that have expired.

NFSv4.1 has a mechanism to allow the client to use machine credentials
to close a file.  However due to a short-coming in the RFC, a CLOSE
with those credentials may not be possible if the file in question
isn't exported to the same security flavor - the required PUTFH must
be rejected when this is the case.

Specifically if a server and client support kerberos in general and
have used it to form a machine credential, but the file is only
exported to "sec=sys", a PUTFH with the machine credentials will fail,
so CLOSE is not possible.

As RPC_AUTH_UNIX (used by sec=sys) credentials can never expire, there
is no value in using the machine credential in place of them.
So in that case, just use the users credentials for CLOSE etc, as you would
in NFSv4.0

Signed-off-by: Neil Brown <neilb@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-20 12:43:17 -04:00
Linus Torvalds
7f680d7ec3 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
 "Another pile of small fixes and updates for x86:

   - Plug a hole in the SMAP implementation which misses to clear AC on
     NMI entry

   - Fix the norandmaps/ADDR_NO_RANDOMIZE logic so the command line
     parameter works correctly again

   - Use the proper accessor in the startup64 code for next_early_pgt to
     prevent accessing of invalid addresses and faulting in the early
     boot code.

   - Prevent CPU hotplug lock recursion in the MTRR code

   - Unbreak CPU0 hotplugging

   - Rename overly long CPUID bits which got introduced in this cycle

   - Two commits which mark data 'const' and restrict the scope of data
     and functions to file scope by making them 'static'"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Constify attribute_group structures
  x86/boot/64/clang: Use fixup_pointer() to access 'next_early_pgt'
  x86/elf: Remove the unnecessary ADDR_NO_RANDOMIZE checks
  x86: Fix norandmaps/ADDR_NO_RANDOMIZE
  x86/mtrr: Prevent CPU hotplug lock recursion
  x86: Mark various structures and functions as 'static'
  x86/cpufeature, kvm/svm: Rename (shorten) the new "virtualized VMSAVE/VMLOAD" CPUID flag
  x86/smpboot: Unbreak CPU0 hotplug
  x86/asm/64: Clear AC on NMI entries
2017-08-20 09:36:52 -07:00
Trond Myklebust
3bde7afdab NFS: Remove unused parameter gfp_flags from nfs_pageio_init()
Now that the mirror allocation has been moved, the parameter can go.
Also remove the redundant symbol export.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-20 11:35:33 -04:00
Trond Myklebust
14abcb0bf5 NFSv4: Fix up mirror allocation
There are a number of callers of nfs_pageio_complete() that want to
continue using the nfs_pageio_descriptor without needing to call
nfs_pageio_init() again. Examples include nfs_pageio_resend() and
nfs_pageio_cond_complete().

The problem is that nfs_pageio_complete() also calls
nfs_pageio_cleanup_mirroring(), which frees up the array of mirrors.
This can lead to writeback errors, in the next call to
nfs_pageio_setup_mirroring().

Fix by simply moving the allocation of the mirrors to
nfs_pageio_setup_mirroring().

Link: https://bugzilla.kernel.org/show_bug.cgi?id=196709
Reported-by: JianhongYin <yin-jianhong@163.com>
Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-20 10:36:35 -04:00
Linus Torvalds
cc28fcdc01 Changes since last time:
- Don't leak resources when mount fails
 - Don't accidentally clobber variables when looking for free inodes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJZlfFsAAoJEPh/dxk0SrTrTmQP/1Yga+FXQ1vjsyi0SyPRupwd
 6beHGDEyLSmYaZKqye8v/nJlNVT8nmJofM20Hyu04f41K4oShQrzrI7jOOscOaYY
 jGEpgbx9fpLPD7AupgDvEDcrZyzZD/j3XxoSsOEGe5D6m3t2X0B4RtHz3jtj2s3e
 wkaBTE7GpzwrhC+9L+3AAtlpNlwkbjcCz0Wfrqlo8DjvRHTlutbYF51fthLJACtz
 U5XgNlxrjQlxGxn4IRHEqxmxWKz2iF4aQHGIX8OEGyt8J3YEO2t3K+nSalWduiBc
 mynExqVFIdGddNWoW4au6IKkPEahytsPVAiyt1TQMNvgkOMCO6DfUz+WmyQbd483
 2r/xUbMdP78RQsUDXdrIEcTiHs/GEfQmIxUongf/0au3r2wmpQfbqzQuBxhuVbzW
 1tQQsDKrO3r+GeEEoBPehtWVF/QPlQvlpT6pfft69kcgp5ukPDvOyOoM0ZEbKy72
 zBWEs5O/kHUOBBXXdV2cqazplq3LyLuBMok1y+gUXXOyXfEd2w9LPqmoK3RmqSQ2
 FnZc2A6tjko1NDLrSkq/uYRXIGi7ZAfxzqhP0L6XLUnu+kjN/A2Xb6pdfB9Wngl2
 8nLVbBL/d28lMVPLJ5M3yxoVcQbIfcNqNA5QmWVCmPUqEwgMQFCsbBdYMKILI0ok
 B76xb0VyZBP5l9QJ514S
 =vJe/
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.13-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "A handful more bug fixes for you today.

  Changes since last time:

   - Don't leak resources when mount fails

   - Don't accidentally clobber variables when looking for free inodes"

* tag 'xfs-4.13-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: don't leak quotacheck dquots when cow recovery
  xfs: clear MS_ACTIVE after finishing log recovery
  iomap: fix integer truncation issues in the zeroing and dirtying helpers
  xfs: fix inobt inode allocation search optimization
2017-08-18 14:25:50 -07:00
Trond Myklebust
b7561e5186 Merge branch 'writeback' 2017-08-18 14:51:10 -04:00
Nikolay Borisov
c59efa7eb2 btrfs: Fix -EOVERFLOW handling in btrfs_ioctl_tree_search_v2
The buffer passed to btrfs_ioctl_tree_search* functions have to be at least
sizeof(struct btrfs_ioctl_search_header). If this is not the case then the
ioctl should return -EOVERFLOW and set the uarg->buf_size to the minimum
required size. Currently btrfs_ioctl_tree_search_v2 would return an -EOVERFLOW
error with ->buf_size being set to the value passed by user space. Fix this by
removing the size check and relying on search_ioctl, which already includes it
and correctly sets buf_size.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-18 16:36:29 +02:00
Nikolay Borisov
e6961cac73 btrfs: Move skip checksum check from btrfs_submit_direct to __btrfs_submit_dio_bio
Currently the code checks whether we should do data checksumming in
btrfs_submit_direct and the boolean result of this check is passed to
btrfs_submit_direct_hook, in turn passing it to __btrfs_submit_dio_bio which
actually consumes it. The last function actually has all the necessary context
to figure out whether to skip the check or not, so let's move the check closer
to where it's being consumed. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Chris Mason <clm@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-18 16:36:29 +02:00
Filipe Manana
6399fb5a0b Btrfs: fix assertion failure during fsync in no-holes mode
When logging an inode in full mode that has an inline compressed extent
that represents a range with a size matching the sector size (currently
the same as the page size), has a trailing hole and the no-holes feature
is enabled, we end up failing an assertion leading to a trace like the
following:

[141812.031528] assertion failed: len == i_size, file: fs/btrfs/tree-log.c, line: 4453
[141812.033069] ------------[ cut here ]------------
[141812.034330] kernel BUG at fs/btrfs/ctree.h:3452!
[141812.035137] invalid opcode: 0000 [#1] PREEMPT SMP
[141812.035932] Modules linked in: btrfs dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio dm_flakey dm_mod dax ppdev evdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 tpm_tis psmouse crypto_simd parport_pc sg pcspkr tpm_tis_core cryptd parport serio_raw glue_helper tpm i2c_piix4 i2c_core button sunrpc loop autofs4 ext4 crc16 jbd2 mbcache raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod ata_generic virtio_scsi ata_piix floppy crc32c_intel libata scsi_mod virtio_pci virtio_ring e1000 virtio [last unloaded: btrfs]
[141812.036790] CPU: 3 PID: 845 Comm: fdm-stress Tainted: G    B   W       4.12.3-btrfs-next-52+ #1
[141812.036790] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
[141812.036790] task: ffff8801e6694180 task.stack: ffffc90009004000
[141812.036790] RIP: 0010:assfail.constprop.18+0x1c/0x1e [btrfs]
[141812.036790] RSP: 0018:ffffc90009007bc0 EFLAGS: 00010282
[141812.036790] RAX: 0000000000000046 RBX: ffff88017512c008 RCX: 0000000000000001
[141812.036790] RDX: ffff88023fd95201 RSI: ffffffff8182264c RDI: 00000000ffffffff
[141812.036790] RBP: ffffc90009007bc0 R08: 0000000000000001 R09: 0000000000000001
[141812.036790] R10: 0000000000001000 R11: ffffffff82f5a0c9 R12: ffff88014e5947e8
[141812.036790] R13: 00000000000b4000 R14: ffff8801b234d008 R15: 0000000000000000
[141812.036790] FS:  00007fdba6ffd700(0000) GS:ffff88023fd80000(0000) knlGS:0000000000000000
[141812.036790] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[141812.036790] CR2: 00007fdb9c000010 CR3: 000000016efa2000 CR4: 00000000001406e0
[141812.036790] Call Trace:
[141812.036790]  btrfs_log_inode+0x9f0/0xd3d [btrfs]
[141812.036790]  ? __mutex_lock+0x120/0x3ce
[141812.036790]  btrfs_log_inode_parent+0x224/0x685 [btrfs]
[141812.036790]  ? lock_acquire+0x16b/0x1af
[141812.036790]  btrfs_log_dentry_safe+0x60/0x7b [btrfs]
[141812.036790]  btrfs_sync_file+0x32e/0x3f8 [btrfs]
[141812.036790]  vfs_fsync_range+0x8a/0x9d
[141812.036790]  vfs_fsync+0x1c/0x1e
[141812.036790]  do_fsync+0x31/0x4a
[141812.036790]  SyS_fdatasync+0x13/0x17
[141812.036790]  entry_SYSCALL_64_fastpath+0x18/0xad
[141812.036790] RIP: 0033:0x7fdbac41a47d
[141812.036790] RSP: 002b:00007fdba6ffce30 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[141812.036790] RAX: ffffffffffffffda RBX: ffffffff81092c9f RCX: 00007fdbac41a47d
[141812.036790] RDX: 0000004cf0160a40 RSI: 0000000000000000 RDI: 0000000000000006
[141812.036790] RBP: ffffc90009007f98 R08: 0000000000000000 R09: 0000000000000010
[141812.036790] R10: 00000000000002e8 R11: 0000000000000293 R12: ffffffff8110cd90
[141812.036790] R13: ffffc90009007f78 R14: 0000000000000000 R15: 0000000000000000
[141812.036790]  ? time_hardirqs_off+0x9/0x14
[141812.036790]  ? trace_hardirqs_off_caller+0x1f/0xa3
[141812.036790] Code: c7 d6 61 6b a0 48 89 e5 e8 ba ef a8 e0 0f 0b 55 89 f1 48 c7 c2 6d 65 6b a0 48 89 fe 48 c7 c7 81 65 6b a0 48 89 e5 e8 9c ef a8 e0 <0f> 0b 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89
[141812.036790] RIP: assfail.constprop.18+0x1c/0x1e [btrfs] RSP: ffffc90009007bc0
[141812.084448] ---[ end trace 44e472684c7a32cc ]---

Which happens because the code that logs a trailing hole when the no-holes
feature is enabled, did not consider that a compressed inline extent can
represent a range with a size matching the sector size, in which case
expanding the inode's i_size, through a truncate operation, won't lead
to padding with zeroes the page that represents the inline extent, and
therefore the inline extent remains after the truncation.

Fix this by adapting the assertion to accept inline extents representing
data with a sector size length if, and only if, the inline extents are
compressed.

A sample and trivial reproducer (for systems with a 4K page size) for this
issue:

  mkfs.btrfs -O no-holes -f /dev/sdc
  mount -o compress /dev/sdc /mnt
  xfs_io -f -c "pwrite -S 0xab 0 4K" /mnt/foobar
  sync
  xfs_io -c "truncate 32K" /mnt/foobar
  xfs_io -c "fsync" /mnt/foobar

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-18 16:36:29 +02:00
Filipe Manana
4a4b964f42 Btrfs: avoid unnecessarily locking inode when clearing a range
If the range being cleared was not marked for defrag and we are not
about to clear the range from the defrag status, we don't need to
lock and unlock the inode.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Chris Mason <clm@fb.com>
Reviewed-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-18 16:36:29 +02:00
Colin Ian King
938e1c77f8 btrfs: remove redundant check on ret being non-zero
The error return variable ret is initialized to zero and then is
checked to see if it is non-zero in the if-block that follows it.
It is therefore impossible for ret to be non-zero after the if-block
hence the check is redundant and can be removed.

Detected by CoverityScan, CID#1021040 ("Logically dead code")

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-18 16:36:29 +02:00
Nikolay Borisov
2d77ab3cfb btrfs: expose internal free space tree routine only if sanity tests are enabled
The internal free space tree management routines are always exposed for
testing purposes. Make them dependent on SANITY_TESTS being on so that
they are exposed only when they really have to.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-18 16:36:29 +02:00
Nikolay Borisov
db7c942ce8 btrfs: Remove unused sectorsize variable from struct map_lookup
This variable was added in 1abe9b8a13 ("Btrfs: add initial tracepointi
support for btrfs"), yet it never really got used, only assigned to. So
let's remove it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-18 16:36:29 +02:00
Nikolay Borisov
92ac58ec99 btrfs: Remove never-reached WARN_ON
We have a WARN_ON(!var) inside an if branch which is executed (among
others) only when var is true.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-18 16:36:29 +02:00
Anand Jain
dc2f29212a btrfs: remove unused BTRFS_COMPRESS_LAST
We aren't using this define, so removing it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-18 16:36:29 +02:00
Anand Jain
44880fdc91 btrfs: use appropriate define for the fsid
Though BTRFS_FSID_SIZE and BTRFS_UUID_SIZE are of the same size, we
should use the matching constant for the fsid buffer.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-18 16:36:29 +02:00
Josef Bacik
42e9cc46fb btrfs: increase ctx->pos for delayed dir index
Our dir_context->pos is supposed to hold the next position we're
supposed to look.  If we successfully insert a delayed dir index we
could end up with a duplicate entry because we don't increase ctx->pos
after doing the dir_emit.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-18 16:36:20 +02:00
Kees Cook
c71b02e4d2 Revert "pstore: Honor dmesg_restrict sysctl on dmesg dumps"
This reverts commit 68c4a4f8ab, with
various conflict clean-ups.

The capability check required too much privilege compared to simple DAC
controls. A system builder was forced to have crash handler processes
run with CAP_SYSLOG which would give it the ability to read (and wipe)
the _current_ dmesg, which is much more access than being given access
only to the historical log stored in pstorefs.

With the prior commit to make the root directory 0750, the files are
protected by default but a system builder can now opt to give access
to a specific group (via chgrp on the pstorefs root directory) without
being forced to also give away CAP_SYSLOG.

Suggested-by: Nick Kralevich <nnk@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
2017-08-17 16:29:19 -07:00
Kees Cook
d7caa33687 pstore: Make default pstorefs root dir perms 0750
Currently only DMESG and CONSOLE record types are protected, and it isn't
obvious that they are using a capability check. Instead switch to explicit
root directory mode of 0750 to keep files private by default. This will
allow the removal of the capability check, which was non-obvious and
forces a process to have possibly too much privilege when simple post-boot
chgrp for readers would be possible without it.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
2017-08-17 16:28:37 -07:00
Jan Kara
7b9ca4c61b quota: Reduce contention on dq_data_lock
dq_data_lock is currently used to protect all modifications of quota
accounting information, consistency of quota accounting on the inode,
and dquot pointers from inode. As a result contention on the lock can be
pretty heavy.

Reduce the contention on the lock by protecting quota accounting
information by a new dquot->dq_dqb_lock and consistency of quota
accounting with inode usage by inode->i_lock.

This change reduces time to create 500000 files on ext4 on ramdisk by 50
different processes in separate directories by 6% when user quota is
turned on. When those 50 processes belong to 50 different users, the
improvement is about 9%.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 22:07:59 +02:00
Jan Kara
f4a8116a4c fs: Provide __inode_get_bytes()
Provide helper __inode_get_bytes() which assumes i_lock is already
acquired. Quota code will need this to be able to use i_lock to protect
consistency of quota accounting information and inode usage.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 22:06:03 +02:00
Jan Kara
3ab167d2ba quota: Inline dquot_[re]claim_reserved_space() into callsite
dquot_claim_reserved_space() and dquot_reclaim_reserved_space() have
only a single callsite. Inline them there.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 22:03:03 +02:00
Jan Kara
a478e522e3 quota: Inline inode_{incr,decr}_space() into callsites
inode_incr_space() and inode_decr_space() have only two callsites.
Inline them there as that will make locking changes simpler.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 22:01:14 +02:00
Jan Kara
0ed60de34a quota: Inline functions into their callsites
inode_add_rsv_space() and inode_sub_rsv_space() had only one callsite.
Inline them there directly. inode_claim_rsv_space() and
inode_reclaim_rsv_space() had two callsites so inline them there as
well. This will simplify further locking changes.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 22:00:59 +02:00
Jan Kara
91389240a2 ext4: Disable dirty list tracking of dquots when journalling quotas
When journalling quotas, we writeback all dquots immediately after
changing them as part of current transation. Thus there's no need to
write anything in dquot_writeback_dquots() and so we can avoid updating
list of dirty dquots to reduce dq_list_lock contention.

This change reduces time to create 500000 files on ext4 on ramdisk by 50
different processes in separate directories by 15% when user quota is
turned on.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 22:00:54 +02:00
Jan Kara
834057bf84 quota: Allow disabling tracking of dirty dquots in a list
Filesystems that are journalling quotas generally don't need tracking of
dirty dquots in a list since forcing a transaction commit flushes all
quotas anyway. Allow filesystem to say it doesn't want dquots to be
tracked as it reduces contention on the dq_list_lock.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 22:00:45 +02:00
Jan Kara
503330f382 quota: Remove dq_wait_unused from dquot
Currently every dquot carries a wait_queue_head_t used only when we are
turning quotas off to wait for last users to drop dquot references.
Since such rare case is not performance sensitive in any means, just use
a global waitqueue for this and save space in struct dquot. Also convert
the logic to use wait_event() instead of open-coding it.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 22:00:40 +02:00
Jan Kara
1e0b7cb062 quota: Move locking into clear_dquot_dirty()
Move locking of dq_list_lock into clear_dquot_dirty(). It makes the
function more self-contained and will simplify our life later.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 22:00:24 +02:00
Jan Kara
4580b30ea8 quota: Do not dirty bad dquots
Currently we mark dirty even dquots that are not active (i.e.,
initialization or reading failed for them). Thus later we have to check
whether dirty dquot is really active and just clear the dirty bit if
not. Avoid this complication by just never marking non-active dquot as
dirty.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 22:00:07 +02:00
Jan Kara
15512377bd quota: Fix possible corruption of dqi_flags
dqi_flags modifications are protected by dq_data_lock. However the
modifications in vfs_load_quota_inode() and in mark_info_dirty() were
not which could lead to corruption of dqi_flags. Since modifications to
dqi_flags are rare, this is hard to observe in practice but in theory it
could happen. Fix the problem by always using dq_data_lock for
protection.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 22:00:04 +02:00
Darrick J. Wong
77aff8c764 xfs: don't leak quotacheck dquots when cow recovery
If we fail a mount on account of cow recovery errors, it's possible that
a previous quotacheck left some dquots in memory.  The bailout clause of
xfs_mountfs forgets to purge these, and so we leak them.  Fix that.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-08-17 12:40:33 -07:00
Darrick J. Wong
8204f8ddaa xfs: clear MS_ACTIVE after finishing log recovery
Way back when we established inode block-map redo log items, it was
discovered that we needed to prevent the VFS from evicting inodes during
log recovery because any given inode might be have bmap redo items to
replay even if the inode has no link count and is ultimately deleted,
and any eviction of an unlinked inode causes the inode to be truncated
and freed too early.

To make this possible, we set MS_ACTIVE so that inodes would not be torn
down immediately upon release.  Unfortunately, this also results in the
quota inodes not being released at all if a later part of the mount
process should fail, because we never reclaim the inodes.  So, set
MS_ACTIVE right before we do the last part of log recovery and clear it
immediately after we finish the log recovery so that everything
will be torn down properly if we abort the mount.

Fixes: 17c12bcd30 ("xfs: when replaying bmap operations, don't let unlinked inodes get reaped")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-08-17 12:40:33 -07:00
Jan Kara
f98bbe37ae quota: Propagate ->quota_read errors from v2_read_file_info()
Currently we return -EIO on any error (or short read) from
->quota_read() while reading quota info. Propagate the error code
instead.

Suggested-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 19:20:28 +02:00
Jan Kara
cb8d01b4f6 quota: Fix error codes in v2_read_file_info()
v2_read_file_info() returned -1 instead of proper error codes on error.
Luckily this is not easily visible from userspace as we have called
->check_quota_file shortly before and thus already verified the quota
file is sane. Still set the error codes to proper values.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 19:18:11 +02:00
Jan Kara
42fdb8583d quota: Push dqio_sem down to ->read_file_info()
Push down acquisition of dqio_sem into ->read_file_info() callback. This
is for consistency with other operations and it also allows us to get
rid of an ugliness in OCFS2.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 19:16:24 +02:00
Jan Kara
9a8ae30e73 quota: Push dqio_sem down to ->write_file_info()
Push down acquisition of dqio_sem into ->write_file_info() callback.
Mostly for consistency with other operations.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 19:11:23 +02:00
Jan Kara
f14618c682 quota: Push dqio_sem down to ->get_next_id()
Push down acquisition of dqio_sem into ->get_next_id() callback. Mostly
for consistency with other operations.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 19:05:16 +02:00
Jan Kara
b9a1a7f4b6 quota: Push dqio_sem down to ->release_dqblk()
Push down acquisition of dqio_sem into ->release_dqblk() callback. It
will allow quota formats to decide whether they need it or not.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 19:04:01 +02:00
Jan Kara
f0c5bae5cc quota: Remove locking for writing to the old quota format
The old quota quota format has fixed offset in quota file based on ID so
there's no locking needed against concurrent modifications of the file
(locking against concurrent IO on the same dquot is still provided by
dq_lock).

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 19:03:44 +02:00
Jan Kara
d2faa41516 quota: Do not acquire dqio_sem for dquot overwrites in v2 format
When dquot has space already allocated in a quota file, we just
overwrite that place when writing dquot. So we don't need any protection
against other modifications of quota file as these keep dquot in place.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 19:03:16 +02:00
Jan Kara
8fc32c2b0d quota: Push dqio_sem down to ->write_dqblk()
Push down acquisition of dqio_sem into ->write_dqblk() callback. It will
allow quota formats to decide whether they need it or not.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 19:03:09 +02:00
Jan Kara
47cdc11dee quota: Remove locking for reading from the old quota format
The old quota format has fixed offset in quota file based on ID so
there's no locking needed against concurrent modifications of the file
(locking against concurrent IO on the same dquot is still provided by
dq_lock).

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 19:01:16 +02:00
Jan Kara
e342e38df9 quota: Push dqio_sem down to ->read_dqblk()
Push down acquisition of dqio_sem into ->read_dqblk() callback. It will
allow quota formats to decide whether they need it or not.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 19:01:00 +02:00
Jan Kara
5e8cb9b624 quota: Protect dquot writeout with dq_lock
Currently dquot writeout is only protected by dqio_sem held for writing.
As we transition to a finer grained locking we will use dquot->dq_lock
instead. So acquire it in dquot_commit() and move dqio_sem just around
->commit_dqblk() call as it is still needed to serialize quota file
changes.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 19:00:50 +02:00
Jan Kara
d6ab366102 quota: Acquire dqio_sem for reading in vfs_load_quota_inode()
vfs_load_quota_inode() needs dqio_sem only for reading. In fact dqio_sem
is not needed there at all since the function can be called only during
quota on when quota file cannot be modified but let's leave the
protection there since it is logical and the path is in no way
performance critical.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 18:59:04 +02:00
Jan Kara
0cff9151d3 quota: Acquire dqio_sem for reading in dquot_get_next_id()
dquot_get_next_id() needs dqio_sem only for reading to protect against
racing with modification of quota file structure.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 18:58:44 +02:00
Jan Kara
62676838cb quota: Do more fine-grained locking in dquot_acquire()
We need dqio_sem held just for reading when calling ->read_dqblk() in
dquot_acquire(). Also dqio_sem is not needed when setting DQ_READ_B and
DQ_ACTIVE_B as concurrent reads and dquot activations are serialized by
dq_lock. So acquire and release dqio_sem closer to the place where it is
needed. This reduces lock hold time and will make locking changes
easier.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 18:56:07 +02:00
Jan Kara
bc8230ee8e quota: Convert dqio_mutex to rwsem
Convert dqio_mutex to rwsem and call it dqio_sem. No functional changes
yet.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 18:52:48 +02:00
Linus Torvalds
99f781b1bf Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota fix from Jan Kara:
 "A fix of a check for quota limit"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  quota: correct space limit check
2017-08-17 09:26:10 -07:00
Linus Torvalds
c8c03f1858 pty: fix the cached path of the pty slave file descriptor in the master
Christian Brauner reported that if you use the TIOCGPTPEER ioctl() to
get a slave pty file descriptor, the resulting file descriptor doesn't
look right in /proc/<pid>/fd/<fd>.  In particular, he wanted to use
readlink() on /proc/self/fd/<fd> to get the pathname of the slave pty
(basically implementing "ptsname{_r}()").

The reason for that was that we had generated the wrong 'struct path'
when we create the pty in ptmx_open().

In particular, the dentry was correct, but the vfsmount pointed to the
mount of the ptmx node. That _can_ be correct - in case you use
"/dev/pts/ptmx" to open the master - but usually is not.  The normal
case is to use /dev/ptmx, which then looks up the pts/ directory, and
then the vfsmount of the ptmx node is obviously the /dev directory, not
the /dev/pts/ directory.

We actually did have the right vfsmount available, but in the wrong
place (it gets looked up in 'devpts_acquire()' when we get a reference
to the pts filesystem), and so ptmx_open() used the wrong mnt pointer.

The end result of this confusion was that the pty worked fine, but when
if you did TIOCGPTPEER to get the slave side of the pty, end end result
would also work, but have that dodgy 'struct path'.

And then when doing "d_path()" on to get the pathname, the vfsmount
would not match the root of the pts directory, and d_path() would return
an empty pathname thinking that the entry had escaped a bind mount into
another mount.

This fixes the problem by making devpts_acquire() return the vfsmount
for the pts filesystem, allowing ptmx_open() to trivially just use the
right mount for the pts dentry, and create the proper 'struct path'.

Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-17 09:10:48 -07:00
Oleg Nesterov
01578e3616 x86/elf: Remove the unnecessary ADDR_NO_RANDOMIZE checks
The ADDR_NO_RANDOMIZE checks in stack_maxrandom_size() and
randomize_stack_top() are not required.

PF_RANDOMIZE is set by load_elf_binary() only if ADDR_NO_RANDOMIZE is not
set, no need to re-check after that.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: stable@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: http://lkml.kernel.org/r/20170815154011.GB1076@redhat.com
2017-08-16 20:32:02 +02:00
Markus Elfring
b5f5245491 fs-udf: Delete an error message for a failed memory allocation in two functions
Omit an extra message for a memory allocation failure in these functions.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-16 16:43:23 +02:00
Markus Elfring
033c9da008 fs-udf: Improve six size determinations
Replace the specification of data structures by variable references
as the parameter for the operator "sizeof" to make the corresponding size
determination a bit safer according to the Linux coding style convention.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-16 16:42:03 +02:00
Markus Elfring
ba2eb866a8 fs-udf: Adjust two checks for null pointers
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The script “checkpatch.pl” pointed information out like the following.

Comparison to NULL could be written !…

Thus fix affected source code places.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-16 16:38:54 +02:00
Josef Bacik
23b5ec7494 btrfs: fix readdir deadlock with pagefault
Readdir does dir_emit while under the btree lock.  dir_emit can trigger
the page fault which means we can deadlock.  Fix this by allocating a
buffer on opening a directory and copying the readdir into this buffer
and doing dir_emit from outside of the tree lock.

Thread A
readdir  <holding tree lock>
  dir_emit
    <page fault>
      down_read(mmap_sem)

Thread B
mmap write
  down_write(mmap_sem)
    page_mkwrite
      wait_ordered_extents

Process C
finish_ordered_extent
  insert_reserved_file_extent
   try to lock leaf <hang>

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copy the deadlock scenario to changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:05 +02:00
Nikolay Borisov
8d8aafeea2 btrfs: Simplify math in should_alloc chunk
Currently should_alloc_chunk uses ->total_bytes - ->bytes_readonly to
signify the total amount of bytes in this space info. However, given
Jeff's patch which adds bytes_pinned and bytes_may_use to the calculation
of num_allocated it becomes a lot more clear to just eliminate num_bytes
altogether and add the bytes_readonly to the amount of used space. That
way we don't change the results of the following statements. In the
process also start using btrfs_space_info_used.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:05 +02:00
Jeff Mahoney
f44d2287d2 btrfs: account for pinned bytes in should_alloc_chunk
In a heavy write scenario, we can end up with a large number of pinned bytes.
This can translate into (very) premature ENOSPC because pinned bytes
must be accounted for when allowing a reservation but aren't accounted for
when deciding whether to create a new chunk.

This patch adds the accounting to should_alloc_chunk so that we can
create the chunk.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:05 +02:00
David Sterba
a7164fa4e0 btrfs: prepare for extensions in compression options
This is a minimal patch intended to be backported to older kernels.
We're going to extend the string specifying the compression method and
this would fail on kernels before that change (the string is compared
exactly).

Relax the string matching only to the prefix, ie. ignoring anything that
goes after "zlib" or "lzo", regardless of th format extension we decide
to use. This applies to the mount options and properties.

That way, patched old kernels could be booted on systems already
utilizing the new compression spec.

Applicable since commit 63541927c8, v3.14.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:05 +02:00
David Sterba
1e20d1c45f btrfs: allow defrag compress to override NOCOMPRESS attribute
Currently, the BTRFS_INODE_NOCOMPRESS will prevent any compression on a
given file, except when the mount is force-compress. As users have
reported on IRC, this will also prevent compression when requested by
defrag (btrfs fi defrag -c file).

The nocompress flag is set automatically by filesystem when the ratios
are bad and the user would have to manually drop the bit in order to
make defrag -c work. This is not good from the usability perspective.

This patch will raise priority for the defrag -c over nocompress, ie.
any file with NOCOMPRESS bit set will get defragmented. The bit will
remain untouched.

Alternate option was to also drop the nocompress bit and keep the
decision logic as is, but I think this is not the right solution.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:05 +02:00
David Sterba
1e2ef46d89 btrfs: defrag: cleanup checking for compression status
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:05 +02:00
David Sterba
eec63c65dc btrfs: separate defrag and property compression
Add new value for compression to distinguish between defrag and
property. Previously, a single variable was used and this caused clashes
when the per-file 'compression' was set and a defrag -c was called.

The property-compression is loaded when the file is open, defrag will
overwrite the same variable and reset to 0 (ie. NONE) at when the file
defragmentaion is finished. That's considered a usability bug.

Now we won't touch the property value, use the defrag-compression. The
precedence of defrag is higher than for property (and whole-filesystem).

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:05 +02:00
David Sterba
b52aa8c93e btrfs: rename variable holding per-inode compression type
This is preparatory for separating inode compression requested by defrag
and set via properties. This will fix a usability bug when defrag will
reset compression type to NONE. If the file has compression set via
property, it will not apply anymore (until next mount or reset through
command line).

We're going to fix that by adding another variable just for the defrag
call and won't touch the property. The defrag will have higher priority
when deciding whether to compress the data.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:05 +02:00
Timofey Titovets
c2fcdcdf36 Btrfs: add skeleton code for compression heuristic
Add skeleton code for compresison heuristics. Now it iterates over all
the pages, but in the end always says "yes, compress please", ie it does
not change the current behaviour.

In the future we're going to add various heuristics to analyze the data.
This patch can be used as a baseline for measuring if the effectivness
and performance.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ enhanced changelog, modified comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba
131ce4367a btrfs: account that we're waiting for IO in scrub_submit_raid56_bio_wait
Correctly account for IO when waiting for a submitted bio in scrub. This
only for the accounting purposes and should not change other behaviour.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba
9c17f6cda1 btrfs: account that we're waiting for DIO read
Correctly account for IO when waiting for a submitted DIO read, the case
when we're retrying.  This only for the accounting purposes and should
not change other behaviour.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba
4958aa6821 btrfs: drop chunk locks at the end of close_ctree
The pinned chunks might be left over so we clean them but at this point
of close_ctree, there's noone to race with, the locking can be removed.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba
d3c0bab563 btrfs: remove trivial wrapper btrfs_force_ra
It's a simple call page_cache_sync_readahead, same arguments in the same
order.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba
35dc313046 btrfs: drop ancient page flag mappings
There's no PageFsMisc. Added by patch 4881ee5a2e in 2008, the flag is
not present in current kernels.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba
ea14b57fd1 btrfs: fix spelling of snapshotting
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
Nikolay Borisov
e38ae7a086 btrfs: Make flush_space return void
The return value of flush_space was used to have significance in the
early days when the code was first introduced and before the ticketed
enospc rework. Since the latter got introduced the return value lost any
significance whatsoever to its callers. So let's remove it. While at it
also remove the unused ticket variable in
btrfs_async_reclaim_metadata_space. It was used in the initial version
of the ticketed ENOSPC work, however Wang Xiaoguang detected a problem
with this and fixed it in ce129655c9 ("btrfs: introduce tickets_id to
determine whether asynchronous metadata reclaim work makes progress").

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
Nikolay Borisov
3558d4f88e btrfs: Deprecate userspace transaction ioctls
Userspace transactions were introduced in commit 6bf13c0cc8 ("Btrfs:
transaction ioctls") to provide semantics that Ceph's object store
required. However, things have changed significantly since then, to the
point where btrfs is no longer suitable as a backend for ceph and in
fact it's actively advised against such usages. Considering this, there
doesn't seem to be a widespread, legit use case of userspace
transaction. They also clutter the file->private pointer.

So to end the agony let's nuke the userspace transaction ioctls. As a
first step let's give time for people to voice their objection by just
WARN()ining when the userspace transaction is used.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ move the warning past perm checks, keep the has-been-printed state;
  we're ok with just one warning over all filesystems ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba
9f6d251033 btrfs: use named constant for bdev blocksize
Superblock is read and written using buffer heads, we need to set the
bdev blocksize. The magic constant has been hardcoded in several places,
so replace it with a named constant.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba
abbb3b8ebf btrfs: split write_dev_supers to two functions
There are two independent parts, one that writes the superblocks and
another that waits for completion. No functional changes, but cleanups,
reformatting and comment updates.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba
35c70103a5 btrfs: refactor find_device helper
Polish the helper:
* drop underscores, no special meaning here
* pass fs_devices, as this is what the API implements
* drop noinline, no apparent reason for such simple helper
* constify uuid
* add comment

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba
2dfeca9bfb btrfs: merge alloc_device helpers
There are two helpers called in chain from one location, we can merge the
functionaliy.

Originally, alloc_fs_devices could fill the device uuid randomly if we
we didn't give the uuid buffer. This happens for seed devices but the
fsid is generated in btrfs_prepare_sprout, so we can remove it.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba
4b81ba48c6 btrfs: merge REQ_OP and REQ_ flags to one parameter in submit_extent_page
The function submit_extent_page has 15(!) parameters right now, op and
op_flags are effectively one value stored to bio::bi_opf, no need to
pass them separately. So it's 14 parameters now.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba
f1c77c55cd btrfs: cleanup types storing REQ_*
Unify types of local variables and parameters that store various
REQ_* values to unsigned int.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:04 +02:00
David Sterba
abe60ba45c btrfs: get fs_info from eb in btrfs_print_tree, remove argument
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:03 +02:00
David Sterba
a4f78750ef btrfs: get fs_info from eb in btrfs_print_leaf, remove argument
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:03 +02:00
David Sterba
f1b8a1e8c0 btrfs: simplify btrfs_dev_replace_kthread
This function prints an informative message and then continues
dev-replace. The message contains a progress percentage which is read
from the status. The status is allocated dynamically, about 2600 bytes,
just to read the single value. That's an overkill. We'll use the new
helper and drop the allocation.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:03 +02:00
David Sterba
74b595fe67 btrfs: factor reading progress out of btrfs_dev_replace_status
We'll want to read the percentage value from dev_replace elsewhere, move
the logic to a separate helper.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:03 +02:00
David Sterba
0a52d10808 btrfs: defrag: make readahead state allocation failure non-fatal
All sorts of readahead errors are not considered fatal. We can continue
defragmentation without it, with some potential slow down, which will
last only for the current inode.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:03 +02:00
David Sterba
63e727ecd2 btrfs: use GFP_KERNEL in btrfs_defrag_file
We can safely use GFP_KERNEL, the function is called from two contexts:

- ioctl handler, called directly, no locks taken
- cleaner thread, running all queued defrag work, outside of any locks

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:03 +02:00
David Sterba
3ec8362111 btrfs: use GFP_KERNEL in mount and remount
We don't need to restrict the allocation flags in btrfs_mount or
_remount. No big filesystem locks are held (possibly s_umount but that
does no count here).

Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:03 +02:00
Nikolay Borisov
e3f3ad1268 btrfs: Remove never reached error handling code in __add_reloc_root
One of the error handling paths in __add_reloc_root contains btrfs_panic()
followed by some other code. As the name implies what it does is print
some error message and call BUG, naturally what follow afterwards is not
invoked. So remove this extra code.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:03 +02:00
Nikolay Borisov
e4ff5fb5dc btrfs: Remove unused parameters from volume.c functions
This also adjusts the respective callers in other files. Those were
found with -Wunused-parameter.

btrfs_full_stripe_len's mapping_tree - introduced by 53b381b3ab
("Btrfs: RAID5 and RAID6") but it was never really used even in that
commit

btrfs_is_parity_mirror's mirror_num - same as above

chunk_drange_filter's chunk_offset - introduced by 94e60d5a5c ("Btrfs:
devid subset filter") and never used.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:03 +02:00
Nikolay Borisov
110840bb62 btrfs: Remove unused variables
clear_super - usage was removed in commit cea67ab92d ("btrfs: clean
the old superblocks before freeing the device") but that change forgot
to remove the actual variable.

max_key - commit 6174d3cb43 ("Btrfs: remove unused max_key arg from
btrfs_search_forward") removed the max_key parameter but it forgot to
remove references from callers.

stripe_len - this one was added by e06cd3dd7c ("Btrfs: add validadtion
checks for chunk loading") but even then it wasn't used.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:03 +02:00
Nikolay Borisov
500ceed807 btrfs: Remove find_raid56_stripe_len
find_raid56_stripe_len statically returns SZ_64K which equals BTRFS_STRIPE_LEN.
It's sole caller is __btrfs_alloc_chunk and it assigns the return value to ai
variable which is already set to BTRFS_STRIPE_LEN. So remove the function
invocation altogether and remove the function itself. Also remove the variable
since it's only aliasing BTRFS_STRIPE_LEN and use the define directly. Use
the occassion to simplify the rounding down of stripe_size now that the value
we want it to align is a power of 2.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:03 +02:00
Nikolay Borisov
47f08b9699 btrfs: Use explicit round_down macro in btrfs resize ioctl handler
No functional changes, just make the code more self-explanatory.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:03 +02:00
Anand Jain
19aee8dea3 btrfs: btrfs_inherit_iflags() can be static
btrfs_new_inode() is the only consumer move it to inode.c,
from ioctl.c.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
Nick Terrell
26b28dce50 btrfs: Keep one more workspace around
find_workspace() allocates up to num_online_cpus() + 1 workspaces.
free_workspace() will only keep num_online_cpus() workspaces. When
(de)compressing we will allocate num_online_cpus() + 1 workspaces, then
free one, and repeat. Instead, we can just keep num_online_cpus() + 1
workspaces around, and never have to allocate/free another workspace in the
common case.

I tested on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. I mounted a
BtrFS partition with -o compress-force={lzo,zlib,zstd} and logged whenever
a workspace was allocated of freed. Then I copied vmlinux (527 MB) to the
partition. Before the patch, during the copy it would allocate and free 5-6
workspaces. After, it only allocated the initial 3. This held true for lzo,
zlib, and zstd. The time it took to execute cp vmlinux /mnt/btrfs && sync
dropped from 1.70s to 1.44s with lzo compression, and from 2.04s to 1.80s
for zstd compression.

Signed-off-by: Nick Terrell <terrelln@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
David Sterba
913e153572 btrfs: drop newlines from strings when using btrfs_* helpers
The helpers append "\n" so we can keep the actual strings shorter. The
extra newline will print an empty line.  Some messages have been
slightly modified to be more consistent with the rest (lowercase first
letter).

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
Nikolay Borisov
b6e6bca51e btrfs: qgroups: Fix BUG_ON condition in tree level check
The current code was erroneously checking for
root_level > BTRFS_MAX_LEVEL. If we had a root_level of 8 then the check
won't trigger and we could potentially hit a buffer overflow. The
correct check should be root_level >= BTRFS_MAX_LEVEL .

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
Qu Wenruo
c550245148 btrfs: Enhance message when a device is missing during mount
For a missing device, btrfs will just refuse to mount with almost
meaningless kernel message like:

 BTRFS info (device vdb6): disk space caching is enabled
 BTRFS info (device vdb6): has skinny extents
 BTRFS error (device vdb6): failed to read the system array: -5
 BTRFS error (device vdb6): open_ctree failed

This patch will print a new message about the missing device:

 BTRFS info (device vdb6): disk space caching is enabled
 BTRFS info (device vdb6): has skinny extents
 BTRFS warning (device vdb6): devid 2 uuid 80470722-cad2-4b90-b7c3-fee294552f1b is missing
 BTRFS error (device vdb6): failed to read the system array: -5
 BTRFS error (device vdb6): open_ctree failed

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
Qu Wenruo
bc3cce2378 btrfs: Cleanup num_tolerated_disk_barrier_failures
As we use per-chunk degradable check, the global
num_tolerated_disk_barrier_failures is of no use.

We can now remove it.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
Qu Wenruo
d10b82fe29 btrfs: Allow barrier_all_devices to do chunk level device check
The last user of num_tolerated_disk_barrier_failures is
barrier_all_devices().
But it can be easily changed to the new per-chunk degradable check
framework.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
Qu Wenruo
b382cfe889 btrfs: Do chunk level check for degraded remount
Just the same for mount time check, use btrfs_check_rw_degradable() to
check if we are OK to be remounted rw.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
Qu Wenruo
4330e183c9 btrfs: Do chunk level check for degraded rw mount
Now use the btrfs_check_rw_degradable() to check if we can mount in the
degraded mode.

With this patch, we can mount in the following case:
 # mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc
 # wipefs -a /dev/sdc
 # mount /dev/sdb /mnt/btrfs -o degraded
 As the single data chunk is only on sdb, so it's OK to mount as
 degraded, as missing one device is OK for RAID1.

But still fail in the following case as expected:
 # mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc
 # wipefs -a /dev/sdb
 # mount /dev/sdc /mnt/btrfs -o degraded
 As the data chunk is only in sdb, so it's not OK to mount it as
 degraded.

Reported-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reported-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
Qu Wenruo
21634a19f6 btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.

It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.

Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.

Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.

For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.

But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.

Such case can be easily reproduced using the following script:
 # mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
 # wipefs -f /dev/sdc
 # mount /dev/sdb -o degraded,rw

If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.

This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
  solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
Liu Bo
0d1e0bead6 Btrfs: report errors when checksum is not found
When btrfs fails the checksum check, it'll fill the whole page with
"1".

However, if %csum_expected is 0 (which means there is no checksum), then
for some unknown reason, we just pretend that the read is correct, so
userspace would be confused about the dilemma that read is successful but
getting a page with all content being "1".

This can happen due to a bug in btrfs-convert.

This fixes it by always returning errors if checksum doesn't match.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
Nikolay Borisov
69f03f137a btrfs: Prevent possible ERR_PTR() dereference
In btrfs_full_stripe_len/btrfs_is_parity_mirror we have similar code which
gets the chunk map for a particular range via get_chunk_map. However,
get_chunk_map can return an ERR_PTR value and while the 2 callers do catch
this with a WARN_ON they then proceed to indiscriminately dereference the
extent map. This of course leads to a crash. Fix the offenders by making the
dereference conditional on IS_ERR.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
Nikolay Borisov
1174cade81 btrfs: Remove redundant checks from btrfs_alloc_data_chunk_ondemand
Many commits ago the data space_info in alloc_data_chunk_ondemand used to be
acquired from the inode. At that point commit
33b4d47f5e ("Btrfs: deal with NULL space info") got introduced to deal with
spurios cases where the space info could be null, following a rebalance.
Nowadays, however, the space info is referenced directly from the btrfs_fs_info
struct which is initialised at filesystem mount time. This makes the null
checks redundant, so remove them.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:02 +02:00
Nikolay Borisov
7bdd6277e0 btrfs: Remove redundant argument of flush_space
All callers of flush_space pass the same number for orig/num_bytes
arguments. Let's remove one of the numbers and also modify the trace
point to show only a single number - bytes requested.

Seems that last point where the two parameters were treated differently
is before the ticketed enospc rework.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:01 +02:00
Aleksa Sarai
6c6b5a39c4 btrfs: resume qgroup rescan on rw remount
Several distributions mount the "proper root" as ro during initrd and
then remount it as rw before pivot_root(2). Thus, if a rescan had been
aborted by a previous shutdown, the rescan would never be resumed.

This issue would manifest itself as several btrfs ioctl(2)s causing the
entire machine to hang when btrfs_qgroup_wait_for_completion was hit
(due to the fs_info->qgroup_rescan_running flag being set but the rescan
itself not being resumed). Notably, Docker's btrfs storage driver makes
regular use of BTRFS_QUOTA_CTL_DISABLE and BTRFS_IOC_QUOTA_RESCAN_WAIT
(causing this problem to be manifested on boot for some machines).

Cc: <stable@vger.kernel.org> # v3.11+
Cc: Jeff Mahoney <jeffm@suse.com>
Fixes: b382a324b6 ("Btrfs: fix qgroup rescan resume on mount")
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:01 +02:00
Edmund Nadolski
01747e92a9 btrfs: clean up extraneous computations in add_delayed_refs
Repeating the same computation in multiple places is not
necessary.

Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:01 +02:00
Edmund Nadolski
3ec4d3238a btrfs: allow backref search checks for shared extents
When called with a struct share_check, find_parent_nodes()
will detect a shared extent and immediately return with
BACKREF_SHARED_FOUND.

Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:01 +02:00
Edmund Nadolski
9dd14fd696 btrfs: add cond_resched() calls when resolving backrefs
Since backref resolution is CPU-intensive, the cond_resched calls
should help alleviate soft lockup occurences.

Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:01 +02:00
Jeff Mahoney
00142756e1 btrfs: backref, add tracepoints for prelim_ref insertion and merging
This patch adds a tracepoint event for prelim_ref insertion and
merging.  For each, the ref being inserted or merged and the count
of tree nodes is issued.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:01 +02:00
Jeff Mahoney
6c336b212b btrfs: add a node counter to each of the rbtrees
This patch adds counters to each of the rbtrees so that we can tell
how large they are growing for a given workload.  These counters
will be exported by tracepoints in the next patch.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:12:01 +02:00
Edmund Nadolski
86d5f99442 btrfs: convert prelimary reference tracking to use rbtrees
It's been known for a while that the use of multiple lists
that are periodically merged was an algorithmic problem within
btrfs.  There are several workloads that don't complete in any
reasonable amount of time (e.g. btrfs/130) and others that cause
soft lockups.

The solution is to use a set of rbtrees that do insertion merging
for both indirect and direct refs, with the former converting
refs into the latter.  The result is a btrfs/130 workload that
used to take several hours now takes about half of that. This
runtime still isn't acceptable and a future patch will address that
by moving the rbtrees higher in the stack so the lookups can be
shared across multiple calls to find_parent_nodes.

Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 16:11:55 +02:00
Colin Ian King
65f2b26339 reiserfs: fix spelling mistake: "tranasction" -> "transaction"
Trivial fix to spelling mistake in reiserfs_warning message

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-16 15:59:47 +02:00
Edmund Nadolski
f6954245d9 btrfs: remove ref_tree implementation from backref.c
Commit afce772e87 ("btrfs: fix check_shared for fiemap ioctl") added
the ref_tree code in backref.c to reduce backref searching for
shared extents under the FIEMAP ioctl. This code will not be
compatible with the upcoming rbtree changes for improved backref
searching, so this patch removes the ref_tree code.  The rbtree
changes will provide the equivalent functionality for FIEMAP.

The above commit also introduced transaction semantics around calls to
btrfs_check_shared() in order to accurately account for delayed refs.
This functionality needs to be retained, so a complete revert of the
above commit is not desirable. This patch therefore removes the
ref_tree portion of the commit as above, however it does not remove
the transaction portion.

Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 14:19:53 +02:00
Edmund Nadolski
bb739cf08e btrfs: btrfs_check_shared should manage its own transaction
Commit afce772e87 ("btrfs: fix check_shared for fiemap ioctl") added
transaction semantics around calls to btrfs_check_shared() in order to
provide accurate accounting of delayed refs. The transaction management
should be done inside btrfs_check_shared(), so that callers do not need
to manage transactions individually.

Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 14:19:53 +02:00
Jeff Mahoney
e0c476b128 btrfs: backref, cleanup __ namespace abuse
We typically use __ to indicate a helper routine that shouldn't be
called directly without understanding the proper context required
to do so.  We use static functions to indicate that a function is
private to a particular C file.  The backref code uses static
function and __ prefixes on nearly everything, which makes the code
difficult to read and establishes a pattern for future code that
shouldn't be followed.  This patch drops all the unnecessary prefixes.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 14:19:53 +02:00
Jeff Mahoney
4dae077a83 btrfs: backref, add unode_aux_to_inode_list helper
Replacing the double cast and ternary conditional with a helper makes
the code easier on the eyes.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 14:19:53 +02:00
Jeff Mahoney
73980becae btrfs: backref, constify some arguments
This constifies a few buffers used in the backref code.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 14:19:53 +02:00
Jeff Mahoney
9a35b63728 btrfs: constify tracepoint arguments
Tracepoint arguments are all read-only.  If we mark the arguments
as const, we're able to keep or convert those arguments to const
where appropriate.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 14:19:53 +02:00
Jeff Mahoney
1cbb1f454e btrfs: struct-funcs, constify readers
We have reader helpers for most of the on-disk structures that use
an extent_buffer and pointer as offset into the buffer that are
read-only.  We should mark them as const and, in turn, allow consumers
of these interfaces to mark the buffers const as well.

No impact on code, but serves as documentation that a buffer is intended
not to be modified.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 14:19:53 +02:00
Nikolay Borisov
23d1f73788 btrfs: remove unused sectorsize member
The sectorsize member of btrfs_block_group_cache is unused. So remove it, this
reduces the number of holes in the struct.

With patch:
/* size: 856, cachelines: 14, members: 40 */
/* sum members: 837, holes: 4, sum holes: 19 */
/* bit holes: 1, sum bit holes: 29 bits */
/* last cacheline: 24 bytes */

Without patch:
/* size: 864, cachelines: 14, members: 41 */
/* sum members: 841, holes: 5, sum holes: 23 */
/* bit holes: 1, sum bit holes: 29 bits */
/* last cacheline: 32 bytes */

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 14:19:53 +02:00
Nikolay Borisov
f148ef4d3a btrfs: Be explicit about usage of min()
__btrfs_alloc_chunk contains code which boils down to:

    ndevs = min(ndevs, devs_max)

It's conditional upon devs_max not being 0. However, it cannot really be 0
since it's always set to either BTRFS_MAX_DEVS_SYS_CHUNK or
BTRFS_MAX_DEVS(fs_info->chunk_root). So eliminate the condition check and use
min explicitly. This has no functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 14:19:52 +02:00
Nikolay Borisov
e5600fd6fc btrfs: Use explicit round_down call rather than open-coding it
No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 14:19:52 +02:00
Nikolay Borisov
ebcc9301ea btrfs: convert while loop to list_for_each_entry
No functional changes, just make the loop a bit more readable

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16 14:19:52 +02:00
David S. Miller
463910e2df Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-08-15 20:23:23 -07:00
Chao Yu
b8c502b81e f2fs: fix potential overflow when adjusting GC cycle
While comparing signed and unsigned variables, compiler will converts the
signed value to unsigned one, due to this reason, {in,de}crease_sleep_time
may return overflowed result.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-15 10:40:14 -07:00
Chao Yu
9a20d391cd f2fs: avoid unneeded sync on quota file
We only need to sync quota file with appointed quota type instead of all
types in f2fs_quota_{on,off}.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-15 10:40:13 -07:00
Jaegeuk Kim
d9872a698c f2fs: introduce gc_urgent mode for background GC
This patch adds a sysfs entry to control urgent mode for background GC.
If this is set, background GC thread conducts GC with gc_urgent_sleep_time
all the time.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-15 10:40:12 -07:00
Jaegeuk Kim
3537581a72 f2fs: use IPU for cold files
We expect cold files write data sequentially, but sometimes some of small data
can be updated, which incurs fragmentation.
Let's avoid that.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-15 10:40:11 -07:00
Yunlong Song
008396e1b0 f2fs: fix the size value in __check_sit_bitmap
The current size value is not correct and will miss bitmap check.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-15 10:40:10 -07:00
Thomas Tai
cc1dfa8b75 gfs2: fix slab corruption during mounting and umounting gfs file system
When using cman-3.0.12.1 and gfs2-utils-3.0.12.1, mounting and
unmounting GFS2 file system would cause kernel to hang. The slab
allocator suggests that it is likely a double free memory corruption.
The issue is traced back to v3.9-rc6 where a patch is submitted to
use kzalloc() for storing a bitmap instead of using a local variable.
The intention is to allocate memory during mount and to free memory
during unmount. The original patch misses a code path which has
already freed the memory and caused memory corruption. This patch sets
the memory pointer to NULL after the memory is freed, so that double
free memory corruption will not happen.

gdlm_mount()
  '-- set_recover_size() which use kzalloc()
  '-- if dlm does not support ops callbacks then
          '--- free_recover_size() which use kfree()

gldm_unmount()
  '-- free_recover_size() which use kfree()

Previous patch which introduced the double free issue is
commit 57c7310b8e ("GFS2: use kmalloc for lvb bitmap")

Signed-off-by: Thomas Tai <thomas.tai@oracle.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
2017-08-15 11:54:09 -05:00
Nick Terrell
5c1aab1dd5 btrfs: Add zstd support
Add zstd compression and decompression support to BtrFS. zstd at its
fastest level compresses almost as well as zlib, while offering much
faster compression and decompression, approaching lzo speeds.

I benchmarked btrfs with zstd compression against no compression, lzo
compression, and zlib compression. I benchmarked two scenarios. Copying
a set of files to btrfs, and then reading the files. Copying a tarball
to btrfs, extracting it to btrfs, and then reading the extracted files.
After every operation, I call `sync` and include the sync time.
Between every pair of operations I unmount and remount the filesystem
to avoid caching. The benchmark files can be found in the upstream
zstd source repository under
`contrib/linux-kernel/{btrfs-benchmark.sh,btrfs-extract-benchmark.sh}`
[1] [2].

I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
16 GB of RAM, and a SSD.

The first compression benchmark is copying 10 copies of the unzipped
Silesia corpus [3] into a BtrFS filesystem mounted with
`-o compress-force=Method`. The decompression benchmark times how long
it takes to `tar` all 10 copies into `/dev/null`. The compression ratio is
measured by comparing the output of `df` and `du`. See the benchmark file
[1] for details. I benchmarked multiple zstd compression levels, although
the patch uses zstd level 1.

| Method  | Ratio | Compression MB/s | Decompression speed |
|---------|-------|------------------|---------------------|
| None    |  0.99 |              504 |                 686 |
| lzo     |  1.66 |              398 |                 442 |
| zlib    |  2.58 |               65 |                 241 |
| zstd 1  |  2.57 |              260 |                 383 |
| zstd 3  |  2.71 |              174 |                 408 |
| zstd 6  |  2.87 |               70 |                 398 |
| zstd 9  |  2.92 |               43 |                 406 |
| zstd 12 |  2.93 |               21 |                 408 |
| zstd 15 |  3.01 |               11 |                 354 |

The next benchmark first copies `linux-4.11.6.tar` [4] to btrfs. Then it
measures the compression ratio, extracts the tar, and deletes the tar.
Then it measures the compression ratio again, and `tar`s the extracted
files into `/dev/null`. See the benchmark file [2] for details.

| Method | Tar Ratio | Extract Ratio | Copy (s) | Extract (s)| Read (s) |
|--------|-----------|---------------|----------|------------|----------|
| None   |      0.97 |          0.78 |    0.981 |      5.501 |    8.807 |
| lzo    |      2.06 |          1.38 |    1.631 |      8.458 |    8.585 |
| zlib   |      3.40 |          1.86 |    7.750 |     21.544 |   11.744 |
| zstd 1 |      3.57 |          1.85 |    2.579 |     11.479 |    9.389 |

[1] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-benchmark.sh
[2] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-extract-benchmark.sh
[3] http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia
[4] https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.11.6.tar.xz

zstd source repository: https://github.com/facebook/zstd

Signed-off-by: Nick Terrell <terrelln@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2017-08-15 09:02:09 -07:00
Trond Myklebust
2ce209c42c NFS: Wait for requests that are locked on the commit list
If a request is on the commit list, but is locked, we will currently skip
it, which can lead to livelocking when the commit count doesn't reduce
to zero.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:48 -04:00
Trond Myklebust
8205b9ce03 NFSv4/pnfs: Replace pnfs_put_lseg_locked() with pnfs_put_lseg()
Now that we no longer hold the inode->i_lock when manipulating the
commit lists, it is safe to call pnfs_put_lseg() again.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:48 -04:00
Trond Myklebust
4b9bb25b36 NFS: Switch to using mapping->private_lock for page writeback lookups.
Switch from using the inode->i_lock for this to avoid contention with
other metadata manipulation.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:48 -04:00
Trond Myklebust
5cb953d4b1 NFS: Use an atomic_long_t to count the number of commits
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:48 -04:00
Trond Myklebust
a6b6d5b85a NFS: Use an atomic_long_t to count the number of requests
Rather than forcing us to take the inode->i_lock just in order to bump
the number.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:47 -04:00
Trond Myklebust
e824f99ada NFSv4: Use a mutex to protect the per-inode commit lists
The commit lists can get very large, so using the inode->i_lock can
end up affecting general metadata performance.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:47 -04:00
Trond Myklebust
b30d2f04c3 NFS: Refactor nfs_page_find_head_request()
Split out the 2 cases so that we can treat the locking differently.
The issue is that the locking in the pageswapcache cache is highly
linked to the commit list locking.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:47 -04:00
Trond Myklebust
bd37d6fce1 NFSv4: Convert nfs_lock_and_join_requests() to use nfs_page_find_head_request()
Hide the locking from nfs_lock_and_join_requests() so that we can
separate out the requirements for swapcache pages.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:47 -04:00
Trond Myklebust
7e8a30f8b4 NFS: Fix up nfs_page_group_covers_page()
Fix up the test in nfs_page_group_covers_page(). The simplest implementation
is to check that we have a set of intersecting or contiguous subrequests
that connect page offset 0 to nfs_page_length(req->wb_page).

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:47 -04:00
Trond Myklebust
1344b7ea17 NFS: Remove unused parameter from nfs_page_group_lock()
nfs_page_group_lock() is now always called with the 'nonblock'
parameter set to 'false'.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:47 -04:00
Trond Myklebust
dee83046e7 NFS: Remove unuse function nfs_page_group_lock_wait()
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:47 -04:00
Trond Myklebust
902a4c0046 NFS: Remove nfs_page_group_clear_bits()
At this point, we only expect ever to potentially see PG_REMOVE and
PG_TEARDOWN being set on the subrequests.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:47 -04:00
Trond Myklebust
5b2b5187fa NFS: Fix nfs_page_group_destroy() and nfs_lock_and_join_requests() race cases
Since nfs_page_group_destroy() does not take any locks on the requests
to be freed, we need to ensure that we don't inadvertently free the
request in nfs_destroy_unlinked_subrequests() while the last reference
is being released elsewhere.

Do this by:

1) Taking a reference to the request unless it is already being freed
2) Checking (under the page group lock) if PG_TEARDOWN is already set before
   freeing an unreferenced request in nfs_destroy_unlinked_subrequests()

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:47 -04:00
Trond Myklebust
74a6d4b5ae NFS: Further optimise nfs_lock_and_join_requests()
When locking the entire group in order to remove subrequests,
the locks are always taken in order, and with the page group
lock being taken after the page head is locked. The intention
is that:

1) The lock on the group head guarantees that requests may not
   be removed from the group (although new entries could be appended
   if we're not holding the group lock).
2) It is safe to drop and retake the page group lock while iterating
   through the list, in particular when waiting for a subrequest lock.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:47 -04:00
Trond Myklebust
b5bab9bf91 NFS: Reduce inode->i_lock contention in nfs_lock_and_join_requests()
We should no longer need the inode->i_lock, now that we've
straightened out the request locking. The locking schema is now:

1) Lock page head request
2) Lock the page group
3) Lock the subrequests one by one

Note that there is a subtle race with nfs_inode_remove_request() due
to the fact that the latter does not lock the page head, when removing
it from the struct page. Only the last subrequest is locked, hence
we need to re-check that the PagePrivate(page) is still set after
we've locked all the subrequests.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:47 -04:00
Trond Myklebust
7e6cca6caf NFS: Remove page group limit in nfs_flush_incompatible()
nfs_try_to_update_request() should be able to cope now.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:47 -04:00
Trond Myklebust
f6032f216f NFS: Teach nfs_try_to_update_request() to deal with request page_groups
Simplify the code, and avoid some flushes to disk.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:47 -04:00
Trond Myklebust
b66aaa8dfe NFS: Fix the inode request accounting when pages have subrequests
Both nfs_destroy_unlinked_subrequests() and nfs_lock_and_join_requests()
manipulate the inode flags adjusting the NFS_I(inode)->nrequests.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:47 -04:00
Trond Myklebust
31a01f093e NFS: Don't unlock writebacks before declaring PG_WB_END
We don't want nfs_lock_and_join_requests() to start fiddling with
the request before the call to nfs_page_group_sync_on_bit().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:46 -04:00
Trond Myklebust
e14bebf6de NFS: Don't check request offset and size without holding a lock
Request offsets and sizes are not guaranteed to be stable unless you
are holding the request locked.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:46 -04:00
Trond Myklebust
a0e265bc78 NFS: Fix an ABBA issue in nfs_lock_and_join_requests()
All other callers of nfs_page_group_lock() appear to already hold the
page lock on the head page, so doing it in the opposite order here
is inefficient, although not deadlock prone since we roll back all
locks on contention.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:46 -04:00
Trond Myklebust
7cb9cd9aa2 NFS: Fix a reference and lock leak in nfs_lock_and_join_requests()
Yes, this is a situation that should never happen (hence the WARN_ON)
but we should still ensure that we free up the locks and references to
the faulty pages.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:46 -04:00
Trond Myklebust
08fead2ae5 NFS: Ensure we always dereference the page head last
This fixes a race with nfs_page_group_sync_on_bit() whereby the
call to wake_up_bit() in nfs_page_group_unlock() could occur after
the page header had been freed.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:46 -04:00
Trond Myklebust
1403390d83 NFS: Reduce lock contention in nfs_try_to_update_request()
Micro-optimisation to move the lockless check into the for(;;) loop.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:46 -04:00
Trond Myklebust
82749dd4ef NFS: Reduce lock contention in nfs_page_find_head_request()
Add a lockless check for whether or not the page might be carrying
an existing writeback before we grab the inode->i_lock.

Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:46 -04:00
Trond Myklebust
6d17d653c9 NFS: Simplify page writeback
We don't expect the page header lock to ever be held across I/O, so
it should always be safe to wait for it, even if we're doing nonblocking
writebacks.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:46 -04:00
Trond Myklebust
55cfcd1211 Merge branch 'open_state' 2017-08-15 11:54:13 -04:00
Greg Kroah-Hartman
d985524680 Merge 4.13-rc5 into char-misc-next
We want the firmware, and other changes, in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-14 13:29:31 -07:00
Tahsin Erdogan
32aaf19420 ext4: add missing xattr hash update
When updating an extended attribute, if the padded value sizes are the
same, a shortcut is taken to avoid the bulk of the work. This was fine
until the xattr hash update was moved inside ext4_xattr_set_entry().
With that change, the hash update got missed in the shortcut case.

Thanks to ZhangYi (yizhang089@gmail.com) for root causing the problem.

Fixes: daf8328172 ("ext4: eliminate xattr entry e_hash recalculation for removes")

Reported-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-08-14 08:30:06 -04:00
Theodore Ts'o
b80b32b6d5 ext4: fix clang build regression
Arnd Bergmann <arnd@arndb.de>

As Stefan pointed out, I misremembered what clang can do specifically,
and it turns out that the variable-length array at the end of the
structure did not work (a flexible array would have worked here
but not solved the problem):

fs/ext4/mballoc.c:2303:17: error: fields must have a constant size:
'variable length array in structure' extension will never be supported
                ext4_grpblk_t counters[blocksize_bits + 2];

This reverts part of my previous patch, using a fixed-size array
again, but keeping the check for the array overflow.

Fixes: 2df2c3402f ("ext4: fix warning about stack corruption")
Reported-by: Stefan Agner <stefan@agner.ch>
Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-08-14 08:29:18 -04:00
Trond Myklebust
75e8c48b9e NFSv4: Use the nfs4_state being recovered in _nfs4_opendata_to_nfs4_state()
If we're recovering a nfs4_state, then we should try to use that instead
of looking up a new stateid. Only do that if the inodes match, though.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-13 20:36:15 -04:00
Trond Myklebust
4e2fcac773 NFSv4: Use correct inode in _nfs4_opendata_to_nfs4_state()
When doing open by filehandle we don't really want to lookup a new inode,
but rather update the one we've got. Add a helper which does this for us.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-13 20:36:15 -04:00
Boris Brezillon
d4092d76a4 mtd: nand: Rename nand.h into rawnand.h
We are planning to share more code between different NAND based
devices (SPI NAND, OneNAND and raw NANDs), but before doing that
we need to move the existing include/linux/mtd/nand.h file into
include/linux/mtd/rawnand.h so we can later create a nand.h header
containing all common structure and function prototypes.

Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Signed-off-by: Peter Pan <peterpandong@micron.com>
Acked-by: Vladimir Zapolskiy <vz@mleia.com>
Acked-by: Alexander Sverdlin <alexander.sverdlin@gmail.com>
Acked-by: Wenyou Yang <wenyou.yang@microchip.com>
Acked-by: Krzysztof Kozlowski <krzk@kernel.org>
Acked-by: Han Xu <han.xu@nxp.com>
Acked-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Acked-by: Shawn Guo <shawnguo@kernel.org>
Acked-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Acked-by: Neil Armstrong <narmstrong@baylibre.com>
Acked-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-By: Harvey Hunt <harveyhuntnexus@gmail.com>
Acked-by: Tony Lindgren <tony@atomide.com>
Acked-by: Krzysztof Halasa <khalasa@piap.pl>
2017-08-13 10:11:49 +02:00
Christoph Hellwig
e28ae8e428 iomap: fix integer truncation issues in the zeroing and dirtying helpers
Fix the min_t calls in the zeroing and dirtying helpers to perform the
comparisms on 64-bit types, which prevents them from incorrectly
being truncated, and larger zeroing operations being stuck in a never
ending loop.

Special thanks to Markus Stockhausen for spotting the bug.

Reported-by: Paul Menzel <pmenzel@molgen.mpg.de>
Tested-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-11 16:56:33 -07:00
Omar Sandoval
c44245b3d5 xfs: fix inobt inode allocation search optimization
When we try to allocate a free inode by searching the inobt, we try to
find the inode nearest the parent inode by searching chunks both left
and right of the chunk containing the parent. As an optimization, we
cache the leftmost and rightmost records that we previously searched; if
we do another allocation with the same parent inode, we'll pick up the
search where it last left off.

There's a bug in the case where we found a free inode to the left of the
parent's chunk: we need to update the cached left and right records, but
because we already reassigned the right record to point to the left, we
end up assigning the left record to both the cached left and right
records.

This isn't a correctness problem strictly, but it can result in the next
allocation rechecking chunks unnecessarily or allocating inodes further
away from the parent than it needs to. Fix it by swapping the record
pointer after we update the cached left and right records.

Fixes: bd16956599 ("xfs: speed up free inode search")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-11 16:56:33 -07:00
Linus Torvalds
216e4a1def Some more NFS client bugfixes for 4.13
Stable fix:
 - Fix leaking nfs4_ff_ds_version array
 
 Other fixes:
 - Improve TEST_STATEID OLD_STATEID handling to prevent recovery loop
 - Require 64-bit sector_t for pNFS blocklayout to prevent 32-bit compile
 errors
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlmOFIIACgkQ18tUv7Cl
 QOsaUQ/9E7lAP6yYp8HfjIBayN1gcme0ZeGzmWVdP8R9isvqTE0MjrwoNxk7h61H
 La/qUcymE32bMX8qYlDs0mw+yhiTcR/UoP5lS/4FCSUZoQsE6BWXoh+O9QlqEcuE
 mFbA9SV52Pf5Mdc/bTNKyh7jgCjeqzlu2sRo5LUM+N7G/M2a5RPfJVGVNYpOmVs/
 ay30B5tHG/K3eeXECLjFTw3HeMorsS2coTaxtX6RghqPoVF6OFZarMUt69IX3zgg
 jBjokz7YfaPSeOEIOapGGRRARHRBAaPE8TvAtRd45R2pMk+Lr12cFWLjT72wRCCM
 nXrTpJc+q8feje9YpT5yoKtgRnW6etxKM8dtyYrXG1NO+dfZHNIe2Z1ARplhzhV3
 Rt8lBV0N0b7kHZfyMJjYINhAbUxvS8UghRpljuHm4+f1lkoV6cVhKoaat/7MQDwZ
 I55M2Edl+A6wPQA7hpFuIT++PVN6GDK7D1rZTKaDBfZ3OCTOQLx0g1kZwHYs/lmk
 gvvtkj82RmbIPoG1rbxHTJFoQdVrpVCYAWr4rbgqNvUrZCjxTRmwRmyMpC/M1cXI
 noyZ/F+VdVLa0mADKMUmiQJ6QkoHjRIAIqlJbLRRl2VFlWHfu7hUiXk7hqt5ocQW
 cpxwird0Fur8cbEKVriRcwNpqGBrDDO7bv1lyQkwEOeHWZ6Fv9o=
 =1/Ms
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.13-5' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "A few more NFS client bugfixes from me for rc5.

  Dros has a stable fix for flexfiles to prevent leaking the
  nfs4_ff_ds_version arrays when freeing a layout, Trond fixed a
  potential recovery loop situation with the TEST_STATEID operation, and
  Christoph fixed up the pNFS blocklayout Kconfig options to prevent
  unsafe use with kernels that don't have large block device support.
  Summary:

  Stable fix:
   - fix leaking nfs4_ff_ds_version array

  Other fixes:
   - improve TEST_STATEID OLD_STATEID handling to prevent recovery loop

   - require 64-bit sector_t for pNFS blocklayout to prevent 32-bit
     compile errors"

* tag 'nfs-for-4.13-5' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  pnfs/blocklayout: require 64-bit sector_t
  NFSv4: Ignore NFS4ERR_OLD_STATEID in nfs41_check_open_stateid()
  nfs/flexfiles: fix leak of nfs4_ff_ds_version arrays
2017-08-11 13:54:09 -07:00
Linus Torvalds
2bfc37cdef Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
 "Fix a few bugs in fuse"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: set mapping error in writepage_locked when it fails
  fuse: Dont call set_page_dirty_lock() for ITER_BVEC pages for async_dio
  fuse: initialize the flock flag in fuse_file on allocation
2017-08-11 11:20:48 -07:00
Christoph Hellwig
8a9d6e964d pnfs/blocklayout: require 64-bit sector_t
The blocklayout code does not compile cleanly for a 32-bit sector_t,
and also has no reliable checks for devices sizes, which makes it
unsafe to use with a kernel that doesn't support large block devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 5c83746a0c ("pnfs/blocklayout: in-kernel GETDEVICEINFO XDR parsing")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-11 14:10:13 -04:00
Ingo Molnar
040cca3ab2 Merge branch 'linus' into locking/core, to resolve conflicts
Conflicts:
	include/linux/mm_types.h
	mm/huge_memory.c

I removed the smp_mb__before_spinlock() like the following commit does:

  8b1b436dd1 ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")

and fixed up the affected commits.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-11 13:51:59 +02:00
Jeff Layton
9183976ef1 fuse: set mapping error in writepage_locked when it fails
This ensures that we see errors on fsync when writeback fails.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-08-11 11:38:26 +02:00
Mike Rapoport
e86b298beb userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage
When the process exit races with outstanding mcopy_atomic, it would be
better to return ESRCH error.  When such race occurs the process and
it's mm are going away and returning "no such process" to the uffd
monitor seems better fit than ENOSPC.

Link: http://lkml.kernel.org/r/1502111545-32305-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10 15:54:07 -07:00
Minchan Kim
b3a81d0841 mm: fix KSM data corruption
Nadav reported KSM can corrupt the user data by the TLB batching
race[1].  That means data user written can be lost.

Quote from Nadav Amit:
 "For this race we need 4 CPUs:

  CPU0: Caches a writable and dirty PTE entry, and uses the stale value
  for write later.

  CPU1: Runs madvise_free on the range that includes the PTE. It would
  clear the dirty-bit. It batches TLB flushes.

  CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty.
  We care about the fact that it clears the PTE write-bit, and of
  course, batches TLB flushes.

  CPU3: Runs KSM. Our purpose is to pass the following test in
  write_protect_page():

	if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
	    (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)))

  Since it will avoid TLB flush. And we want to do it while the PTE is
  stale. Later, and before replacing the page, we would be able to
  change the page.

  Note that all the operations the CPU1-3 perform canhappen in parallel
  since they only acquire mmap_sem for read.

  We start with two identical pages. Everything below regards the same
  page/PTE.

  CPU0        CPU1        CPU2        CPU3
  ----        ----        ----        ----
  Write the same
  value on page

  [cache PTE as
   dirty in TLB]

              MADV_FREE
              pte_mkclean()

                          4 > clear_refs
                          pte_wrprotect()

                                      write_protect_page()
                                      [ success, no flush ]

                                      pages_indentical()
                                      [ ok ]

  Write to page
  different value

  [Ok, using stale
   PTE]

                                      replace_page()

  Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late.
  CPU0 already wrote on the page, but KSM ignored this write, and it got
  lost"

In above scenario, MADV_FREE is fixed by changing TLB batching API
including [set|clear]_tlb_flush_pending.  Remained thing is soft-dirty
part.

This patch changes soft-dirty uses TLB batching API instead of
flush_tlb_mm and KSM checks pending TLB flush by using
mm_tlb_flush_pending so that it will flush TLB to avoid data lost if
there are other parallel threads pending TLB flush.

[1] http://lkml.kernel.org/r/BD3A0EBE-ECF4-41D4-87FA-C755EA9AB6BD@gmail.com

Link: http://lkml.kernel.org/r/20170802000818.4760-8-namit@vmware.com
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Reported-by: Nadav Amit <namit@vmware.com>
Tested-by: Nadav Amit <namit@vmware.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10 15:54:07 -07:00
Johannes Weiner
d507e2ebd2 mm: fix global NR_SLAB_.*CLAIMABLE counter reads
As Tetsuo points out:
 "Commit 385386cff4 ("mm: vmstat: move slab statistics from zone to
  node counters") broke "Slab:" field of /proc/meminfo . It shows nearly
  0kB"

In addition to /proc/meminfo, this problem also affects the slab
counters OOM/allocation failure info dumps, can cause early -ENOMEM from
overcommit protection, and miscalculate image size requirements during
suspend-to-disk.

This is because the patch in question switched the slab counters from
the zone level to the node level, but forgot to update the global
accessor functions to read the aggregate node data instead of the
aggregate zone data.

Use global_node_page_state() to access the global slab counters.

Fixes: 385386cff4 ("mm: vmstat: move slab statistics from zone to node counters")
Link: http://lkml.kernel.org/r/20170801134256.5400-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Stefan Agner <stefan@agner.ch>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10 15:54:06 -07:00
Abhi Das
b066a4eebd gfs2: forcibly flush ail to relieve memory pressure
On systems with low memory, it is possible for gfs2 to infinitely
loop in balance_dirty_pages() under heavy IO (creating sparse files).

balance_dirty_pages() attempts to write out the dirty pages via
gfs2_writepages() but none are found because these dirty pages are
being used by the journaling code in the ail. Normally, the journal
has an upper threshold which when hit triggers an automatic flush
of the ail. But this threshold can be higher than the number of
allowable dirty pages and result in the ail never being flushed.

This patch forces an ail flush when gfs2_writepages() fails to write
anything. This is a good indication that the ail might be holding
some dirty pages.

Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-10 10:51:03 -05:00
Andreas Gruenbacher
a91323e255 gfs2: Clean up waiting on glocks
The prepare_to_wait_on_glock and finish_wait_on_glock functions introduced in
commit 56a365be "gfs2: gfs2_glock_get: Wait on freeing glocks" are
better removed, resulting in cleaner code.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-10 10:51:02 -05:00
Andreas Gruenbacher
6a1c8f6dcf gfs2: Defer deleting inodes under memory pressure
When under memory pressure and an inode's link count has dropped to
zero, defer deleting the inode to the delete workqueue.  This avoids
calling into DLM under memory pressure, which can deadlock.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-10 10:49:13 -05:00
Andreas Gruenbacher
71c1b21368 gfs2: gfs2_evict_inode: Put glocks asynchronously
gfs2_evict_inode is called to free inodes under memory pressure.  The
function calls into DLM when an inode's last cluster-wide reference goes
away (remote unlink) and to release the glock and associated DLM lock
before finally destroying the inode.  However, if DLM is blocked on
memory to become available, calling into DLM again will deadlock.

Avoid that by decoupling releasing glocks from destroying inodes in that
case: with gfs2_glock_queue_put, glocks will be dequeued asynchronously
in work queue context, when the associated inodes have likely already
been destroyed.

With this change, inodes can end up being unlinked, remote-unlink can be
triggered, and then the inode can be reallocated before all
remote-unlink callbacks are processed.  To detect that, revalidate the
link count in gfs2_evict_inode to make sure we're not deleting an
allocated, referenced inode.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-10 10:45:21 -05:00
Andreas Gruenbacher
eebd2e813f gfs2: Get rid of gfs2_set_nlink
Remove gfs2_set_nlink which prevents the link count of an inode from
becoming non-zero once it has reached zero.  The next commit reduces the
amount of waiting on glocks when an inode is evicted from memory.  With
that, an inode can become reallocated before all the remote-unlink
callbacks from a previous delete are processed, which causes the link
count to change from zero to non-zero.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-10 10:42:11 -05:00
Andreas Gruenbacher
0515480ad4 gfs2: gfs2_glock_get: Wait on freeing glocks
Keep glocks in their hash table until they are freed instead of removing
them when their last reference is dropped.  This allows to wait for any
previous instances of a glock to go away in gfs2_glock_get before
creating a new glocks.

Special thanks to Andy Price for finding and fixing a problem which also
required us to delete the rcu_read_unlock from the error case in function
gfs2_glock_get.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-10 10:39:31 -05:00
Peter Zijlstra
a9668cd6ee locking: Remove smp_mb__before_spinlock()
Now that there are no users of smp_mb__before_spinlock() left, remove
it entirely.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 12:29:03 +02:00
Peter Zijlstra
ff7a5fb0f1 overlayfs, locking: Remove smp_mb__before_spinlock() usage
While we could replace the smp_mb__before_spinlock() with the new
smp_mb__after_spinlock(), the normal pattern is to use
smp_store_release() to publish an object that is used for
lockless_dereference() -- and mirrors the regular rcu_assign_pointer()
/ rcu_dereference() patterns.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 12:29:02 +02:00
Aleksa Sarai
74dc3384fc sched/debug: Use task_pid_nr_ns in /proc/$pid/sched
It appears as though the addition of the PID namespace did not update
the output code for /proc/*/sched, which resulted in it providing PIDs
that were not self-consistent with the /proc mount. This additionally
made it trivial to detect whether a process was inside &init_pid_ns from
userspace, making container detection trivial:

   https://github.com/jessfraz/amicontained

This leads to situations such as:

  % unshare -pmf
  % mount -t proc proc /proc
  % head -n1 /proc/1/sched
  head (10047, #threads: 1)

Fix this by just using task_pid_nr_ns for the output of /proc/*/sched.
All of the other uses of task_pid_nr in kernel/sched/debug.c are from a
sysctl context and thus don't need to be namespaced.

Signed-off-by: Aleksa Sarai <asarai@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Jess Frazelle <acidburn@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: cyphar@cyphar.com
Link: http://lkml.kernel.org/r/20170806044141.5093-1-asarai@suse.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 12:18:19 +02:00
Chao Yu
b0af6d491a f2fs: add app/fs io stat
This patch enables inner app/fs io stats and introduces below virtual fs
nodes for exposing stats info:
/sys/fs/f2fs/<dev>/iostat_enable
/proc/fs/f2fs/<dev>/iostat_info

Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix wrong stat assignment]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-09 21:43:58 -07:00
Yunlong Song
35ee82ca13 f2fs: do not change the valid_block value if cur_valid_map was wrongly set or cleared
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-09 17:45:23 -07:00
Yunlong Song
6415fedc57 f2fs: update cur_valid_map_mir together with cur_valid_map
When cur_valid_map passes the f2fs_test_and_set(,clear)_bit test,
cur_valid_map_mir update is skipped unlikely, so fix it. The fix
now changes the mirror check together with cur_valid_map all the
time.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: Fix unused variable and add unlikely for corner condition.]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-09 17:45:21 -07:00
David S. Miller
3118e6e19d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The UDP offload conflict is dealt with by simply taking what is
in net-next where we have removed all of the UFO handling code
entirely.

The TCP conflict was a case of local variables in a function
being removed from both net and net-next.

In netvsc we had an assignment right next to where a missing
set of u64 stats sync object inits were added.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 16:28:45 -07:00
Trond Myklebust
c0ca0e5934 NFSv4: Ignore NFS4ERR_OLD_STATEID in nfs41_check_open_stateid()
If the call to TEST_STATEID returns NFS4ERR_OLD_STATEID, then it just
means we raced with other calls to OPEN.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-09 13:36:56 -04:00
Andreas Gruenbacher
61b91cfdc6 gfs2: Fix trivial typos
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-09 09:36:39 -05:00
Bob Peterson
b2fb7dab7f GFS2: Delete debugfs files only after we evict the glocks
This patch moves the call to gfs2_delete_debugfs_file so that it
comes after the glock hash table has been cleared. This way we
can query the debugfs files if umount hangs.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-09 09:36:39 -05:00
Bob Peterson
645ebd49f0 GFS2: Don't waste time locking lru_lock for non-lru glocks
Before this patch, glock_dq would call gfs2_glock_remove_from_lru.
For glocks that are never put on the LRU, such as the transaction
glock, this just takes the spin_lock, determines there's nothing to
be done because the list is empty, then unlocks again. This was
causing unnecessary lock contention on the lru_lock spin_lock.
This patch adds a check for GLOF_LRU in the glops before taking
the spin_lock.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-09 09:36:39 -05:00
Bob Peterson
2d821a8b71 GFS2: Don't bother trying to add rgrps to the lru list
This patch removes a call to gfs2_glock_add_to_lru from function
gfs2_clear_rgrpd. The call is just a waste of time because as soon
as it adds it to the lru_list, the call to gfs2_glock_put takes it
back off again.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-08-09 09:36:38 -05:00
Bob Peterson
240c6235df GFS2: Clear gl_object when deleting an inode in gfs2_delete_inode
This patch adds some calls to clear gl_object in function
gfs2_delete_inode. Since we are deleting the inode, and the glock
typically outlives the inode in core, we must clear gl_object
so subsequent use of the glock (e.g. for a new inode in its place)
will not have the old pointer sitting there. In error cases we
need to tidy up after ourselves. In non-error cases, we need to
clear gl_object before we set the block free in the bitmap so
residules aren't left for potential inode creators.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2017-08-09 09:36:38 -05:00
Bob Peterson
9c1b28081f GFS2: Clear gl_object if gfs2_create_inode fails
If function gfs2_create_inode fails after the inode has been
created (for example, if the inode_refresh fails for some reason)
the function was setting gl_object but never clearing it again.
The glocks are left pointing to a freed inode. This patch adds
the calls to clear gl_object in the appropriate error paths.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2017-08-09 09:36:26 -05:00
Weston Andros Adamson
1feb26162b nfs/flexfiles: fix leak of nfs4_ff_ds_version arrays
The client was freeing the nfs4_ff_layout_ds, but not the contained
nfs4_ff_ds_version array.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-08 17:18:10 -04:00
Linus Torvalds
1742c0f055 Changes since last update:
- Fix memory leak when issuing discard
 - Fix propagation of the dax inode flag
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJZhNzbAAoJEPh/dxk0SrTrhMQP/jskrkmob2pHDV/C3jEkLI5g
 2tcM9iS1AF3eWjdJtyIsTyejqaJONwLKjKC/pFA+zJtmv4hbC1DnVFy+3F1iU3Ws
 /BC4PzOnhdZrzbY0fjvg4M9sJOOfEPJbUm0eQyYRlUW3s+uRBhylz0/soa6JTA4G
 ZbxW9EhToJrHmT7T8oXXU9HVFLvJzhXdu+hbIGOiraTMcDkkEBGoW4Zz4dcRvjMU
 TZEt6WlBISKrCaGbtb38ChoMv97LGOQLbDM9oy4evvfnuJUQJJT/ayUZH6nvC/3d
 e9Lko4mPNLmTwfVh7hR4b8nC2TwAPPEcrvQcrKfgDolnzNJJU7en1TLJxJbWKEnM
 dvxixDp18E4lzSjVCC9pfCCY3esGLNKtmT5m9aCyNRl7oIdAhHxbIZABUquSurTn
 ii9Ulz+sRWZjY/X4/y+2tyEHLgGaJhDyHqz3I+1iBA2FBn2Wic/cZLvy/2ngmDWX
 rsVEj0ll8i9CLFGFgs6gjfe9dkmwVN+KA2VzgFuNFuNQlUyFZSq6Eqv7aKbgEDjM
 NzeKhkG2RMEBuHVZLHdeoJ2xNSD5Cuo6laJauevqFQ901rSAMqkUu6OjKHJQPKpt
 YMSgHVcnOJ0LaUcqNjJ+j1XlI7HLByu76s3uilvBnISUlLoRoUXwRBwi/BfCv0M0
 MMgB+DAg66T4wPQfTh1y
 =4UKT
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.13-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "I have a couple more bug fixes for you today:

   - fix memory leak when issuing discard

   - fix propagation of the dax inode flag"

* tag 'xfs-4.13-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: Fix per-inode DAX flag inheritance
  xfs: Fix leak of discard bio
2017-08-07 18:16:22 -07:00
David S. Miller
fde6af4729 mlx5-shared-2017-08-07
This series includes some mlx5 updates for both net-next and rdma trees.
 
 From Saeed,
 Core driver updates to allow selectively building the driver with
 or without some large driver components, such as
 	- E-Switch (Ethernet SRIOV support).
 	- Multi-Physical Function Switch (MPFs) support.
 For that we split E-Switch and MPFs functionalities into separate files.
 
 From Erez,
 Delay mlx5_core events when mlx5 interfaces, namely mlx5_ib, registration
 is taking place and until it completes.
 
 From Rabie,
 Increase the maximum supported flow counters.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJZiDoAAAoJEEg/ir3gV/o+594H/RH5kRwC719s/5YQFJXvGsVC
 fjtj3UUJPLrWB8XBh7a4PRcxXPIHaFKJuY3MU7KHFIeZQFklJcit3njjpxDlUINo
 F5S1LHBSYBkeMD/ksWBA8OLCBprNGN6WQ2tuFfAjZlQQ44zqv8LJmegoDtW9bGRy
 aGAkjUmALEblQsq81y0BQwN2/8DA8HAywrs8L2dkH1LHwijoIeYMZFOtKugv1FbB
 ABSKxcU7D/NYw6rsVdZG59fHFQ+eKOspDFqBZrUzfQ+zUU2hFFo96ovfXBfIqYCV
 7BtJuKXu2LeGPzFLsuw4h1131iqFT1iSMy9fEhf/4OwaL/KPP/+Umy8vP/XfM+U=
 =wCpd
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-shared-2017-08-07' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux

Saeed Mahameed says:

====================
mlx5-shared-2017-08-07

This series includes some mlx5 updates for both net-next and rdma trees.

From Saeed,
Core driver updates to allow selectively building the driver with
or without some large driver components, such as
	- E-Switch (Ethernet SRIOV support).
	- Multi-Physical Function Switch (MPFs) support.
For that we split E-Switch and MPFs functionalities into separate files.

From Erez,
Delay mlx5_core events when mlx5 interfaces, namely mlx5_ib, registration
is taking place and until it completes.

From Rabie,
Increase the maximum supported flow counters.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07 10:42:09 -07:00
Guoqing Jiang
1c24285372 dlm: use sock_create_lite inside tcp_accept_from_sock
With commit 0ffdaf5b41 ("net/sock: add WARN_ON(parent->sk)
in sock_graft()"), a calltrace happened as follows:

[  457.018340] WARNING: CPU: 0 PID: 15623 at ./include/net/sock.h:1703 inet_accept+0x135/0x140
...
[  457.018381] RIP: 0010:inet_accept+0x135/0x140
[  457.018381] RSP: 0018:ffffc90001727d18 EFLAGS: 00010286
[  457.018383] RAX: 0000000000000001 RBX: ffff880012413000 RCX: 0000000000000001
[  457.018384] RDX: 000000000000018a RSI: 00000000fffffe01 RDI: ffffffff8156fae8
[  457.018384] RBP: ffffc90001727d38 R08: 0000000000000000 R09: 0000000000004305
[  457.018385] R10: 0000000000000001 R11: 0000000000004304 R12: ffff880035ae7a00
[  457.018386] R13: ffff88001282af10 R14: ffff880034e4e200 R15: 0000000000000000
[  457.018387] FS:  0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
[  457.018388] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  457.018389] CR2: 00007fdec22f9000 CR3: 0000000002b5a000 CR4: 00000000000006f0
[  457.018395] Call Trace:
[  457.018402]  tcp_accept_from_sock.part.8+0x12d/0x449 [dlm]
[  457.018405]  ? vprintk_emit+0x248/0x2d0
[  457.018409]  tcp_accept_from_sock+0x3f/0x50 [dlm]
[  457.018413]  process_recv_sockets+0x3b/0x50 [dlm]
[  457.018415]  process_one_work+0x138/0x370
[  457.018417]  worker_thread+0x4d/0x3b0
[  457.018419]  kthread+0x109/0x140
[  457.018421]  ? rescuer_thread+0x320/0x320
[  457.018422]  ? kthread_park+0x60/0x60
[  457.018424]  ret_from_fork+0x25/0x30

Since newsocket created by sock_create_kern sets it's
sock by the path:

	sock_create_kern -> __sock_creat
			 ->pf->create => inet_create
			 -> sock_init_data

Then WARN_ON is triggered by "con->sock->ops->accept =>
inet_accept -> sock_graft", it also means newsock->sk
is leaked since sock_graft will replace it with a new
sk.

To resolve the issue, we need to use sock_create_lite
instead of sock_create_kern, like commit 0933a578cd
("rds: tcp: use sock_create_lite() to create the accept
socket") did.

Reported-by: Zhilong Liu <zlliu@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2017-08-07 11:23:09 -05:00
Edwin Török
55acdd926f dlm: avoid double-free on error path in dlm_device_{register,unregister}
Can be reproduced when running dlm_controld (tested on 4.4.x, 4.12.4):
 # seq 1 100 | xargs -P0 -n1 dlm_tool join
 # seq 1 100 | xargs -P0 -n1 dlm_tool leave

misc_register fails due to duplicate sysfs entry, which causes
dlm_device_register to free ls->ls_device.name.
In dlm_device_deregister the name was freed again, causing memory
corruption.

According to the comment in dlm_device_deregister the name should've been
set to NULL when registration fails,
so this patch does that.

sysfs: cannot create duplicate filename '/dev/char/10:1'
------------[ cut here ]------------
warning: cpu: 1 pid: 4450 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x56/0x70
modules linked in: msr rfcomm dlm ccm bnep dm_crypt uvcvideo
videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core videodev
btusb media btrtl btbcm btintel bluetooth ecdh_generic intel_rapl
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
snd_hda_codec_hdmi irqbypass crct10dif_pclmul crc32_pclmul
ghash_clmulni_intel thinkpad_acpi pcbc nvram snd_seq_midi
snd_seq_midi_event aesni_intel snd_hda_codec_realtek snd_hda_codec_generic
snd_rawmidi aes_x86_64 crypto_simd glue_helper snd_hda_intel snd_hda_codec
cryptd intel_cstate arc4 snd_hda_core snd_seq snd_seq_device snd_hwdep
iwldvm intel_rapl_perf mac80211 joydev input_leds iwlwifi serio_raw
cfg80211 snd_pcm shpchp snd_timer snd mac_hid mei_me lpc_ich mei soundcore
sunrpc parport_pc ppdev lp parport autofs4 i915 psmouse
 e1000e ahci libahci i2c_algo_bit sdhci_pci ptp drm_kms_helper sdhci
pps_core syscopyarea sysfillrect sysimgblt fb_sys_fops drm wmi video
cpu: 1 pid: 4450 comm: dlm_test.exe not tainted 4.12.4-041204-generic
hardware name: lenovo 232425u/232425u, bios g2et82ww (2.02 ) 09/11/2012
task: ffff96b0cbabe140 task.stack: ffffb199027d0000
rip: 0010:sysfs_warn_dup+0x56/0x70
rsp: 0018:ffffb199027d3c58 eflags: 00010282
rax: 0000000000000038 rbx: ffff96b0e2c49158 rcx: 0000000000000006
rdx: 0000000000000000 rsi: 0000000000000086 rdi: ffff96b15e24dcc0
rbp: ffffb199027d3c70 r08: 0000000000000001 r09: 0000000000000721
r10: ffffb199027d3c00 r11: 0000000000000721 r12: ffffb199027d3cd1
r13: ffff96b1592088f0 r14: 0000000000000001 r15: ffffffffffffffef
fs:  00007f78069c0700(0000) gs:ffff96b15e240000(0000)
knlgs:0000000000000000
cs:  0010 ds: 0000 es: 0000 cr0: 0000000080050033
cr2: 000000178625ed28 cr3: 0000000091d3e000 cr4: 00000000001406e0
call trace:
 sysfs_do_create_link_sd.isra.2+0x9e/0xb0
 sysfs_create_link+0x25/0x40
 device_add+0x5a9/0x640
 device_create_groups_vargs+0xe0/0xf0
 device_create_with_groups+0x3f/0x60
 ? snprintf+0x45/0x70
 misc_register+0x140/0x180
 device_write+0x6a8/0x790 [dlm]
 __vfs_write+0x37/0x160
 ? apparmor_file_permission+0x1a/0x20
 ? security_file_permission+0x3b/0xc0
 vfs_write+0xb5/0x1a0
 sys_write+0x55/0xc0
 ? sys_fcntl+0x5d/0xb0
 entry_syscall_64_fastpath+0x1e/0xa9
rip: 0033:0x7f78083454bd
rsp: 002b:00007f78069bbd30 eflags: 00000293 orig_rax: 0000000000000001
rax: ffffffffffffffda rbx: 0000000000000006 rcx: 00007f78083454bd
rdx: 000000000000009c rsi: 00007f78069bee00 rdi: 0000000000000005
rbp: 00007f77f8000a20 r08: 000000000000fcf0 r09: 0000000000000032
r10: 0000000000000024 r11: 0000000000000293 r12: 00007f78069bde00
r13: 00007f78069bee00 r14: 000000000000000a r15: 00007f78069bbd70
code: 85 c0 48 89 c3 74 12 b9 00 10 00 00 48 89 c2 31 f6 4c 89 ef e8 2c c8
ff ff 4c 89 e2 48 89 de 48 c7 c7 b0 8e 0c a8 e8 41 e8 ed ff <0f> ff 48 89
df e8 00 d5 f4 ff 5b 41 5c 41 5d 5d c3 66 0f 1f 84
---[ end trace 40412246357cc9e0 ]---

dlm: 59f24629-ae39-44e2-9030-397ebc2eda26: leaving the lockspace group...
bug: unable to handle kernel null pointer dereference at 0000000000000001
ip: [<ffffffff811a3b4a>] kmem_cache_alloc+0x7a/0x140
pgd 0
oops: 0000 [#1] smp
modules linked in: dlm 8021q garp mrp stp llc openvswitch nf_defrag_ipv6
nf_conntrack libcrc32c iptable_filter dm_multipath crc32_pclmul dm_mod
aesni_intel psmouse aes_x86_64 sg ablk_helper cryptd lrw gf128mul
glue_helper i2c_piix4 nls_utf8 tpm_tis tpm isofs nfsd auth_rpcgss
oid_registry nfs_acl lockd grace sunrpc xen_wdt ip_tables x_tables autofs4
hid_generic usbhid hid sr_mod cdrom sd_mod ata_generic pata_acpi 8139too
serio_raw ata_piix 8139cp mii uhci_hcd ehci_pci ehci_hcd libata
scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod ipv6
cpu: 0 pid: 394 comm: systemd-udevd tainted: g w 4.4.0+0 #1
hardware name: xen hvm domu, bios 4.7.2-2.2 05/11/2017
task: ffff880002410000 ti: ffff88000243c000 task.ti: ffff88000243c000
rip: e030:[<ffffffff811a3b4a>] [<ffffffff811a3b4a>]
kmem_cache_alloc+0x7a/0x140
rsp: e02b:ffff88000243fd90 eflags: 00010202
rax: 0000000000000000 rbx: ffff8800029864d0 rcx: 000000000007b36c
rdx: 000000000007b36b rsi: 00000000024000c0 rdi: ffff880036801c00
rbp: ffff88000243fdc0 r08: 0000000000018880 r09: 0000000000000054
r10: 000000000000004a r11: ffff880034ace6c0 r12: 00000000024000c0
r13: ffff880036801c00 r14: 0000000000000001 r15: ffffffff8118dcc2
fs: 00007f0ab77548c0(0000) gs:ffff880036e00000(0000) knlgs:0000000000000000
cs: e033 ds: 0000 es: 0000 cr0: 0000000080050033
cr2: 0000000000000001 cr3: 000000000332d000 cr4: 0000000000040660
stack:
ffffffff8118dc90 ffff8800029864d0 0000000000000000 ffff88003430b0b0
ffff880034b78320 ffff88003430b0b0 ffff88000243fdf8 ffffffff8118dcc2
ffff8800349c6700 ffff8800029864d0 000000000000000b 00007f0ab7754b90
call trace:
[<ffffffff8118dc90>] ? anon_vma_fork+0x60/0x140
[<ffffffff8118dcc2>] anon_vma_fork+0x92/0x140
[<ffffffff8107033e>] copy_process+0xcae/0x1a80
[<ffffffff8107128b>] _do_fork+0x8b/0x2d0
[<ffffffff81071579>] sys_clone+0x19/0x20
[<ffffffff815a30ae>] entry_syscall_64_fastpath+0x12/0x71
] code: f6 75 1c 4c 89 fa 44 89 e6 4c 89 ef e8 a7 e4 00 00 41 f7 c4 00 80
00 00 49 89 c6 74 47 eb 32 49 63 45 20 48 8d 4a 01 4d 8b 45 00 <49> 8b 1c
06 4c 89 f0 65 49 0f c7 08 0f 94 c0 84 c0 74 ac 49 63
rip [<ffffffff811a3b4a>] kmem_cache_alloc+0x7a/0x140
rsp <ffff88000243fd90>
cr2: 0000000000000001
--[ end trace 70cb9fd1b164a0e8 ]--

CC: stable@vger.kernel.org
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2017-08-07 11:23:09 -05:00
Bhumika Goyal
417f7c59ed dlm: constify kset_uevent_ops structure
Declare kset_uevent_ops structure as const as it is only passed as an
argument to the function kset_create_and_add. This argument is of type
const, so declare the structure as const.

Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2017-08-07 11:23:09 -05:00
Zhu Lingshan
3b0e761ba8 dlm: print log message when cluster name is not set
Print a message when a cluster name is not specified by
the caller.  In this case the cluster name configured
for the dlm is used without any validation that it is
the cluster expected by the application.

Signed-off-by: Zhu Lingshan <lszhu@suse.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2017-08-07 11:23:09 -05:00
Markus Elfring
2ab93ae138 dlm: Delete an unnecessary variable initialisation in dlm_ls_start()
The local variable "rv" is reassigned by a statement at the beginning.
Thus omit the explicit initialisation.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: David Teigland <teigland@redhat.com>
2017-08-07 11:23:09 -05:00
Markus Elfring
d12ad1a964 dlm: Improve a size determination in two functions
Replace the specification of two data structures by pointer dereferences
as the parameter for the operator "sizeof" to make the corresponding size
determination a bit safer according to the Linux coding style convention.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: David Teigland <teigland@redhat.com>
2017-08-07 11:23:09 -05:00
Markus Elfring
2f48e06102 dlm: Use kcalloc() in two functions
* Multiplications for the size determination of memory allocations
  indicated that array data structures should be processed.
  Thus reuse the corresponding function "kcalloc".

  This issue was detected by using the Coccinelle software.

* Replace the specification of data structures by pointer dereferences
  to make the corresponding size determinations a bit safer according to
  the Linux coding style convention.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: David Teigland <teigland@redhat.com>
2017-08-07 11:23:09 -05:00
Markus Elfring
790854becc dlm: Use kmalloc_array() in make_member_array()
* A multiplication for the size determination of a memory allocation
  indicated that an array data structure should be processed.
  Thus use the corresponding function "kmalloc_array".

  This issue was detected by using the Coccinelle software.

* Replace the specification of a data type by a pointer dereference
  to make the corresponding size determination a bit safer according to
  the Linux coding style convention.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: David Teigland <teigland@redhat.com>
2017-08-07 11:23:09 -05:00
Markus Elfring
0d37eca752 dlm: Delete an error message for a failed memory allocation in dlm_recover_waiters_pre()
Omit an extra message for a memory allocation failure in this function.

Link: http://events.linuxfoundation.org/sites/events/files/slides/LCJ16-Refactor_Strings-WSang_0.pdf
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: David Teigland <teigland@redhat.com>
2017-08-07 11:23:09 -05:00
Markus Elfring
102e67d4e3 dlm: Improve a size determination in dlm_recover_waiters_pre()
Replace the specification of a data structure by a pointer dereference
as the parameter for the operator "sizeof" to make the corresponding size
determination a bit safer according to the Linux coding style convention.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: David Teigland <teigland@redhat.com>
2017-08-07 11:23:09 -05:00
Markus Elfring
fbb1008151 dlm: Use kcalloc() in dlm_scan_waiters()
A multiplication for the size determination of a memory allocation
indicated that an array data structure should be processed.
Thus use the corresponding function "kcalloc".

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: David Teigland <teigland@redhat.com>
2017-08-07 11:23:09 -05:00
Markus Elfring
2c257e96df dlm: Improve a size determination in table_seq_start()
Replace the specification of a data structure by a pointer dereference
as the parameter for the operator "sizeof" to make the corresponding size
determination a bit safer according to the Linux coding style convention.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: David Teigland <teigland@redhat.com>
2017-08-07 11:23:09 -05:00
Markus Elfring
41922ce831 dlm: Add spaces for better code readability
The script "checkpatch.pl" pointed information out like the following.

CHECK: spaces preferred around that '+' (ctx:VxV)

Thus fix the affected source code places.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: David Teigland <teigland@redhat.com>
2017-08-07 11:23:09 -05:00
Markus Elfring
653996ca8d dlm: Replace six seq_puts() calls by seq_putc()
Six single characters (line breaks) should be put into a sequence.
Thus use the corresponding function "seq_putc".

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: David Teigland <teigland@redhat.com>
2017-08-07 11:23:09 -05:00
Gang He
8e1743748b dlm: Make dismatch error message more clear
This change will try to make this error message more clear,
since the upper applications (e.g. ocfs2) invoke dlm_new_lockspace
to create a new lockspace with passing a cluster name. Sometimes,
dlm_new_lockspace return failure while two cluster names dismatch,
the user is a little confused since this line error message is not
enough obvious.

Signed-off-by: Gang He <ghe@suse.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2017-08-07 11:23:09 -05:00
Vlad Tsyrklevich
8286d6b14c dlm: Fix kernel memory disclosure
Clear the 'unused' field and the uninitialized padding in 'lksb' to
avoid leaking memory to userland in copy_result_to_user().

Signed-off-by: Vlad Tsyrklevich <vlad@tsyrklevich.net>
Signed-off-by: David Teigland <teigland@redhat.com>
2017-08-07 11:23:09 -05:00
zhangyi (F)
41e327b586 quota: correct space limit check
Currently we compare total space (curspace + rsvspace)
with space limit in quota-tools when setting grace time
and also in check_bdq(), but we missing rsvspace in
somewhere else, correct them. This patch also fix incorrect
zero dqb_btime and grace time updating failure when we use
rsvspace(e.g. ext4 dalloc feature).

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-07 16:51:28 +02:00
Trond Myklebust
bfab281721 NFSv4: Cleanup setting of the migration flags.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-07 09:32:55 -04:00
Trond Myklebust
937e3133cd NFSv4.1: Ensure we clear the SP4_MACH_CRED flags in nfs4_sp4_select_mode()
If the server changes, so that it no longer supports SP4_MACH_CRED, or
that it doesn't support the same set of SP4_MACH_CRED functionality,
then we want to ensure that we clear the unsupported flags.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-07 09:32:55 -04:00
Trond Myklebust
9c760d1fd5 NFSv4: Refactor _nfs4_proc_exchange_id()
Tease apart the functionality in nfs4_exchange_id_done() so that
it is easier to debug exchange id vs trunking issues by moving
all the processing out of nfs4_exchange_id_done() and into the
callers.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-07 09:32:55 -04:00
Linus Torvalds
ed66da1104 A large number of ext4 bug fixes and cleanups for v4.13
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlmHbBAACgkQ8vlZVpUN
 gaMu3gf+LpI5bI1XA3R8KbXB2snnz6wM7OzArfqvreX+m+xP1CK6nVpAIgpkZqfw
 QkQ1xPJk7Q25vex/pPcsgLO0Vxf0i4vpydK+fYnf30S4WvGQVq6OHZWFFv2zM2YB
 7TWxjG+KryM7j6JSXdUiSTKP3nX84TW/IMIWuZMR1nuOa8N5M4yD3uc+3EBTjSbq
 P/dxfmkp2hQKnlZVBWqCjJDhtxwUYTF4iZ/pbSVeGbgHCh1674ml+airb4K9ltNU
 0vR0JChD12YJaafjaAyIrqqKwDGvnN+H5wyhCodEV9w8jthbcU04Jfmi1auB9UxT
 y7/sgbV64W2o5hBwxY3RXjZkVLpDsw==
 =Mtr7
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "A large number of ext4 bug fixes and cleanups for v4.13"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix copy paste error in ext4_swap_extents()
  ext4: fix overflow caused by missing cast in ext4_resize_fs()
  ext4, project: expand inode extra size if possible
  ext4: cleanup ext4_expand_extra_isize_ea()
  ext4: restructure ext4_expand_extra_isize
  ext4: fix forgetten xattr lock protection in ext4_expand_extra_isize
  ext4: make xattr inode reads faster
  ext4: inplace xattr block update fails to deduplicate blocks
  ext4: remove unused mode parameter
  ext4: fix warning about stack corruption
  ext4: fix dir_nlink behaviour
  ext4: silence array overflow warning
  ext4: fix SEEK_HOLE/SEEK_DATA for blocksize < pagesize
  ext4: release discard bio after sending discard commands
  ext4: convert swap_inode_data() over to use swap() on most of the fields
  ext4: error should be cleared if ea_inode isn't added to the cache
  ext4: Don't clear SGID when inheriting ACLs
  ext4: preserve i_mode if __ext4_set_acl() fails
  ext4: remove unused metadata accounting variables
  ext4: correct comment references to ext4_ext_direct_IO()
2017-08-06 12:31:17 -07:00
Maninder Singh
4e56201321 ext4: fix copy paste error in ext4_swap_extents()
This bug was found by a static code checker tool for copy paste
problems.

Signed-off-by: Maninder Singh <maninder1.s@samsung.com>
Signed-off-by: Vaneet Narang <v.narang@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-08-06 01:33:07 -04:00
Jerry Lee
aec51758ce ext4: fix overflow caused by missing cast in ext4_resize_fs()
On a 32-bit platform, the value of n_blcoks_count may be wrong during
the file system is resized to size larger than 2^32 blocks.  This may
caused the superblock being corrupted with zero blocks count.

Fixes: 1c6bd7173d
Signed-off-by: Jerry Lee <jerrylee@qnap.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org # 3.7+
2017-08-06 01:18:31 -04:00
Miao Xie
c03b45b853 ext4, project: expand inode extra size if possible
When upgrading from old format, try to set project id
to old file first time, it will return EOVERFLOW, but if
that file is dirtied(touch etc), changing project id will
be allowed, this might be confusing for users, we could
try to expand @i_extra_isize here too.

Reported-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-08-06 01:00:49 -04:00
Miao Xie
b640b2c51b ext4: cleanup ext4_expand_extra_isize_ea()
Clean up some goto statement, make ext4_expand_extra_isize_ea() clearer.

Signed-off-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
2017-08-06 00:55:48 -04:00
Miao Xie
cf0a5e818f ext4: restructure ext4_expand_extra_isize
Current ext4_expand_extra_isize just tries to expand extra isize, if
someone is holding xattr lock or some check fails, it will give up.
So rename its name to ext4_try_to_expand_extra_isize.

Besides that, we clean up unnecessary check and move some relative checks
into it.

Signed-off-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
2017-08-06 00:40:01 -04:00
Miao Xie
3b10fdc6d8 ext4: fix forgetten xattr lock protection in ext4_expand_extra_isize
We should avoid the contention between the i_extra_isize update and
the inline data insertion, so move the xattr trylock in front of
i_extra_isize update.

Signed-off-by: Miao Xie <miaoxie@huawei.com>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
2017-08-06 00:27:38 -04:00
Tahsin Erdogan
9699d4f91d ext4: make xattr inode reads faster
ext4_xattr_inode_read() currently reads each block sequentially while
waiting for io operation to complete before moving on to the next
block. This prevents request merging in block layer.

Add a ext4_bread_batch() function that starts reads for all blocks
then optionally waits for them to complete. A similar logic is used
in ext4_find_entry(), so update that code to use the new function.

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-08-06 00:07:01 -04:00
Tahsin Erdogan
ec00022030 ext4: inplace xattr block update fails to deduplicate blocks
When an xattr block has a single reference, block is updated inplace
and it is reinserted to the cache. Later, a cache lookup is performed
to see whether an existing block has the same contents. This cache
lookup will most of the time return the just inserted entry so
deduplication is not achieved.

Running the following test script will produce two xattr blocks which
can be observed in "File ACL: " line of debugfs output:

  mke2fs -b 1024 -I 128 -F -O extent /dev/sdb 1G
  mount /dev/sdb /mnt/sdb

  touch /mnt/sdb/{x,y}

  setfattr -n user.1 -v aaa /mnt/sdb/x
  setfattr -n user.2 -v bbb /mnt/sdb/x

  setfattr -n user.1 -v aaa /mnt/sdb/y
  setfattr -n user.2 -v bbb /mnt/sdb/y

  debugfs -R 'stat x' /dev/sdb | cat
  debugfs -R 'stat y' /dev/sdb | cat

This patch defers the reinsertion to the cache so that we can locate
other blocks with the same contents.

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2017-08-05 22:41:42 -04:00
Tahsin Erdogan
77a2e84d51 ext4: remove unused mode parameter
ext4_alloc_file_blocks() does not use its mode parameter. Remove it.

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-08-05 22:15:45 -04:00
Arnd Bergmann
2df2c3402f ext4: fix warning about stack corruption
After commit 62d1034f53e3 ("fortify: use WARN instead of BUG for now"),
we get a warning about possible stack overflow from a memcpy that
was not strictly bounded to the size of the local variable:

    inlined from 'ext4_mb_seq_groups_show' at fs/ext4/mballoc.c:2322:2:
include/linux/string.h:309:9: error: '__builtin_memcpy': writing between 161 and 1116 bytes into a region of size 160 overflows the destination [-Werror=stringop-overflow=]

We actually had a bug here that would have been found by the warning,
but it was already fixed last year in commit 30a9d7afe7 ("ext4: fix
stack memory corruption with 64k block size").

This replaces the fixed-length structure on the stack with a variable-length
structure, using the correct upper bound that tells the compiler that
everything is really fine here. I also change the loop count to check
for the same upper bound for consistency, but the existing code is
already correct here.

Note that while clang won't allow certain kinds of variable-length arrays
in structures, this particular instance is fine, as the array is at the
end of the structure, and the size is strictly bounded.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-08-05 21:57:46 -04:00
Andreas Dilger
c741489206 ext4: fix dir_nlink behaviour
The dir_nlink feature has been enabled by default for new ext4
filesystems since e2fsprogs-1.41 in 2008, and was automatically
enabled by the kernel for older ext4 filesystems since the
dir_nlink feature was added with ext4 in kernel 2.6.28+ when
the subdirectory count exceeded EXT4_LINK_MAX-1.

Automatically adding the file system features such as dir_nlink is
generally frowned upon, since it could cause the file system to not be
mountable on older kernel, thus preventing the administrator from
rolling back to an older kernel if necessary.

In this case, the administrator might also want to disable the feature
because glibc's fts_read() function does not correctly optimize
directory traversal for directories that use st_nlinks field of 1 to
indicate that the number of links in the directory are not tracked by
the file system, and could fail to traverse the full directory
hierarchy.  Fortunately, in the past ten years very few users have
complained about incomplete file system traversal by glibc's
fts_read().

This commit also changes ext4_inc_count() to allow i_nlinks to reach
the full EXT4_LINK_MAX links on the parent directory (including "."
and "..") before changing i_links_count to be 1.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=196405
Signed-off-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-08-05 19:47:34 -04:00
Dan Carpenter
381cebfe72 ext4: silence array overflow warning
I get a static checker warning:

    fs/ext4/ext4.h:3091 ext4_set_de_type()
    error: buffer overflow 'ext4_type_by_mode' 15 <= 15

It seems unlikely that we would hit this read overflow in real life, but
it's also simple enough to make the array 16 bytes instead of 15.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-08-05 19:00:31 -04:00
Jan Kara
fcf5ea1099 ext4: fix SEEK_HOLE/SEEK_DATA for blocksize < pagesize
ext4_find_unwritten_pgoff() does not properly handle a situation when
starting index is in the middle of a page and blocksize < pagesize. The
following command shows the bug on filesystem with 1k blocksize:

  xfs_io -f -c "falloc 0 4k" \
            -c "pwrite 1k 1k" \
            -c "pwrite 3k 1k" \
            -c "seek -a -r 0" foo

In this example, neither lseek(fd, 1024, SEEK_HOLE) nor lseek(fd, 2048,
SEEK_DATA) will return the correct result.

Fix the problem by neglecting buffers in a page before starting offset.

Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
CC: stable@vger.kernel.org # 3.8+
2017-08-05 17:43:24 -04:00
Daeho Jeong
e45105772d ext4: release discard bio after sending discard commands
We've changed the discard command handling into parallel manner.
But, in this change, I forgot decreasing the usage count of the bio
which was used to send discard request. I'm sorry about that.

Fixes: a015434480 ("ext4: send parallel discards on commit completions")
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-08-05 13:11:57 -04:00
Lukas Czerner
56bdf855e6 xfs: Fix per-inode DAX flag inheritance
According to the commit that implemented per-inode DAX flag:
commit 58f88ca2df ("xfs: introduce per-inode DAX enablement")
the flag is supposed to act as "inherit flag".

Currently this only works in the situations where parent directory
already has a flag in di_flags set, otherwise inheritance does not
work. This is because setting the XFS_DIFLAG2_DAX flag is done in a
wrong branch designated for di_flags, not di_flags2.

Fix this by moving the code to branch designated for setting di_flags2,
which does test for flags in di_flags2.

Fixes: 58f88ca2df ("xfs: introduce per-inode DAX enablement")
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-04 13:43:36 -07:00
Jan Kara
ea7bd56fa3 xfs: Fix leak of discard bio
The bio describing discard operation is allocated by
__blkdev_issue_discard() which returns us a reference to it. That
reference is never released and thus we leak this bio. Drop the bio
reference once it completes in xlog_discard_endio().

CC: stable@vger.kernel.org
Fixes: 4560e78f40
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-04 13:43:36 -07:00
Jaegeuk Kim
a36c106dff f2fs: use printk_ratelimited for f2fs_msg
This patch reduces contention of printks.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-03 19:09:52 -07:00
Jaegeuk Kim
bf9e697ecd f2fs: expose features to sysfs entry
This patch exposes what features are supported by current f2fs build to sysfs
entry via:

/sys/fs/f2fs/features/
/sys/fs/f2fs/dev/features

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-03 19:09:51 -07:00
Chao Yu
704956ecf5 f2fs: support inode checksum
This patch adds to support inode checksum in f2fs.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix verification flow]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-03 19:09:26 -07:00
Jaegeuk Kim
4f31d26b0c f2fs: return wrong error number on f2fs_quota_write
This must return size, not error number.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-03 19:05:05 -07:00
Yunlong Song
401db79f61 f2fs: provide f2fs_balance_fs to __write_node_page
Let node writeback also do f2fs_balance_fs to ensure there are always enough free
segments.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-08-03 19:05:04 -07:00
Linus Torvalds
995d03ae26 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "15 fixes"

[ This does not merge the "fortify: use WARN instead of BUG for now"
  patch, which needs a bit of extra work to build cleanly with all
  configurations. Arnd is on it.   - Linus ]

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  ocfs2: don't clear SGID when inheriting ACLs
  mm: allow page_cache_get_speculative in interrupt context
  userfaultfd: non-cooperative: flush event_wqh at release time
  ipc: add missing container_of()s for randstruct
  cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()
  userfaultfd_zeropage: return -ENOSPC in case mm has gone
  mm: take memory hotplug lock within numa_zonelist_order_handler()
  mm/page_io.c: fix oops during block io poll in swapin path
  zram: do not free pool->size_class
  kthread: fix documentation build warning
  kasan: avoid -Wmaybe-uninitialized warning
  userfaultfd: non-cooperative: notify about unmap of destination during mremap
  mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries
  pid: kill pidhash_size in pidhash_init()
  mm/hugetlb.c: __get_user_pages ignores certain follow_hugetlb_page errors
2017-08-03 14:58:13 -07:00
Jeff Layton
6d4b512413 ecryptfs: convert to file_write_and_wait in ->fsync
This change is mainly for documentation/completeness, as ecryptfs never
calls mapping_set_error, and so will never return a previous writeback
error.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-08-03 14:20:22 -04:00
Ashish Samant
61c12b49e1 fuse: Dont call set_page_dirty_lock() for ITER_BVEC pages for async_dio
Commit 8fba54aebb ("fuse: direct-io: don't dirty ITER_BVEC pages") fixes
the ITER_BVEC page deadlock for direct io in fuse by checking in
fuse_direct_io(), whether the page is a bvec page or not, before locking
it.  However, this check is missed when the "async_dio" mount option is
enabled.  In this case, set_page_dirty_lock() is called from the req->end
callback in request_end(), when the fuse thread is returning from userspace
to respond to the read request.  This will cause the same deadlock because
the bvec condition is not checked in this path.

Here is the stack of the deadlocked thread, while returning from userspace:

[13706.656686] INFO: task glusterfs:3006 blocked for more than 120 seconds.
[13706.657808] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[13706.658788] glusterfs       D ffffffff816c80f0     0  3006      1
0x00000080
[13706.658797]  ffff8800d6713a58 0000000000000086 ffff8800d9ad7000
ffff8800d9ad5400
[13706.658799]  ffff88011ffd5cc0 ffff8800d6710008 ffff88011fd176c0
7fffffffffffffff
[13706.658801]  0000000000000002 ffffffff816c80f0 ffff8800d6713a78
ffffffff816c790e
[13706.658803] Call Trace:
[13706.658809]  [<ffffffff816c80f0>] ? bit_wait_io_timeout+0x80/0x80
[13706.658811]  [<ffffffff816c790e>] schedule+0x3e/0x90
[13706.658813]  [<ffffffff816ca7e5>] schedule_timeout+0x1b5/0x210
[13706.658816]  [<ffffffff81073ffb>] ? gup_pud_range+0x1db/0x1f0
[13706.658817]  [<ffffffff810668fe>] ? kvm_clock_read+0x1e/0x20
[13706.658819]  [<ffffffff81066909>] ? kvm_clock_get_cycles+0x9/0x10
[13706.658822]  [<ffffffff810f5792>] ? ktime_get+0x52/0xc0
[13706.658824]  [<ffffffff816c6f04>] io_schedule_timeout+0xa4/0x110
[13706.658826]  [<ffffffff816c8126>] bit_wait_io+0x36/0x50
[13706.658828]  [<ffffffff816c7d06>] __wait_on_bit_lock+0x76/0xb0
[13706.658831]  [<ffffffffa0545636>] ? lock_request+0x46/0x70 [fuse]
[13706.658834]  [<ffffffff8118800a>] __lock_page+0xaa/0xb0
[13706.658836]  [<ffffffff810c8500>] ? wake_atomic_t_function+0x40/0x40
[13706.658838]  [<ffffffff81194d08>] set_page_dirty_lock+0x58/0x60
[13706.658841]  [<ffffffffa054d968>] fuse_release_user_pages+0x58/0x70 [fuse]
[13706.658844]  [<ffffffffa0551430>] ? fuse_aio_complete+0x190/0x190 [fuse]
[13706.658847]  [<ffffffffa0551459>] fuse_aio_complete_req+0x29/0x90 [fuse]
[13706.658849]  [<ffffffffa05471e9>] request_end+0xd9/0x190 [fuse]
[13706.658852]  [<ffffffffa0549126>] fuse_dev_do_write+0x336/0x490 [fuse]
[13706.658854]  [<ffffffffa054963e>] fuse_dev_write+0x6e/0xa0 [fuse]
[13706.658857]  [<ffffffff812a9ef3>] ? security_file_permission+0x23/0x90
[13706.658859]  [<ffffffff81205300>] do_iter_readv_writev+0x60/0x90
[13706.658862]  [<ffffffffa05495d0>] ? fuse_dev_splice_write+0x350/0x350
[fuse]
[13706.658863]  [<ffffffff812062a1>] do_readv_writev+0x171/0x1f0
[13706.658866]  [<ffffffff810b3d00>] ? try_to_wake_up+0x210/0x210
[13706.658868]  [<ffffffff81206361>] vfs_writev+0x41/0x50
[13706.658870]  [<ffffffff81206496>] SyS_writev+0x56/0xf0
[13706.658872]  [<ffffffff810257a1>] ? syscall_trace_leave+0xf1/0x160
[13706.658874]  [<ffffffff816cbb2e>] system_call_fastpath+0x12/0x71

Fix this by making should_dirty a fuse_io_priv parameter that can be
checked in fuse_aio_complete_req().

Reported-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-08-03 17:55:58 +02:00
Linus Torvalds
19ec50a438 Two more NFS client bugfixes for 4.13
Stable fix:
 - Fix EXCHANGE_ID corrupt verifier issue
 
 Other fix:
 - Fix double frees in nfs4_test_session_trunk()
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlmCNJAACgkQ18tUv7Cl
 QOvlKhAAjjKMtdwYDV7B3SJvDfa5KNH9nNaMjKoD/OzWavDL7jRVRUWayeJiy4sf
 wJAMsG280ztag1Mr/OnClESZThWNrHTnKjq/P8kuefJz+pPvWP+AaQ/8QSM/rulM
 Y7qfh+FZb2t7lK80VAZTXEi4bvQzlkIV2285a34VIUm3VTpwl3HvltkBMLFztDSb
 sdaQh+N48FOYdG9RMncxpzFfRvUILzP9GFMg6WAFAcbr1wSsT9MfLqd+ANgOidE8
 X2Ch9l0jN51m1dFGmMUZ5HMmuhUh2m50SXhmUOTpS7fj+k11OWsV2IIWHKPk/LjI
 5E0aUPDU2TOE9noLSfkguAyxvqAGeAHPDIE7QM9vTGgktzXWGzh++ODeSzkIDp+5
 iZ+cYLkme1lOg1nEqj1/SrIZXHP7ZCvK8iqNgoqT8rIp6zE1tLPnNAbRDmnwC/VH
 PrfINATV8llN/3vFojWJfD1enDe3+dALPINLVQOjwjafq6d5/hzRdCGqWpCDjmlU
 esDoCkoWGUjadIr6EZGtDAzSZgafsUx6d6QnGtkIUsz1d0FIXH6holB5EKaQpOLf
 dVb9iO5R2EcVP+WvB3KjA5y6MCNvoVqebxMvLBPCYsu3fI8fQ0sgSjgSJXi0fRzg
 G7DUsKcBGVKarDMXTUReL2G6lN7h0f5EsM1WnA9KF0TV9aah/ZE=
 =Ddq9
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.13-4' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "Two fixes from Trond this time, now that he's back from his vacation.
  The first is a stable fix for the EXCHANGE_ID issue on the mailing
  list, and the other fixes a double-free situation that he found at the
  same time.

  Stable fix:
   - Fix EXCHANGE_ID corrupt verifier issue

  Other fix:
   - Fix double frees in nfs4_test_session_trunk()"

* tag 'nfs-for-4.13-4' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFSv4: Fix double frees in nfs4_test_session_trunk()
  NFSv4: Fix EXCHANGE_ID corrupt verifier issue
2017-08-02 20:56:44 -07:00
Jan Kara
19ec8e4858 ocfs2: don't clear SGID when inheriting ACLs
When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0').  However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.

Fix the problem by moving posix_acl_update_mode() out of ocfs2_set_acl()
into ocfs2_iop_set_acl().  That way the function will not be called when
inheriting ACLs which is what we want as it prevents SGID bit clearing
and the mode has been properly set by posix_acl_create() anyway.  Also
posix_acl_chmod() that is calling ocfs2_set_acl() takes care of updating
mode itself.

Fixes: 073931017b ("posix_acl: Clear SGID bit when setting file permissions")
Link: http://lkml.kernel.org/r/20170801141252.19675-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-02 17:16:13 -07:00
Mike Rapoport
5a18b64e3f userfaultfd: non-cooperative: flush event_wqh at release time
There may still be threads waiting on event_wqh at the time the
userfault file descriptor is closed.  Flush the events wait-queue to
prevent waiting threads from hanging.

Link: http://lkml.kernel.org/r/1501398127-30419-1-git-send-email-rppt@linux.vnet.ibm.com
Fixes: 9cd75c3cd4 ("userfaultfd: non-cooperative: add ability to report
non-PF events from uffd descriptor")
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-02 17:16:13 -07:00
Mike Rapoport
9d95aa4bad userfaultfd_zeropage: return -ENOSPC in case mm has gone
In the non-cooperative userfaultfd case, the process exit may race with
outstanding mcopy_atomic called by the uffd monitor.  Returning -ENOSPC
instead of -EINVAL when mm is already gone will allow uffd monitor to
distinguish this case from other error conditions.

Unfortunately I overlooked userfaultfd_zeropage when updating
userfaultd_copy().

Link: http://lkml.kernel.org/r/1501136819-21857-1-git-send-email-rppt@linux.vnet.ibm.com
Fixes: 96333187ab ("userfaultfd_copy: return -ENOSPC in case mm has gone")
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-02 17:16:12 -07:00
Trond Myklebust
d9cb73300a NFSv4: Fix double frees in nfs4_test_session_trunk()
rpc_clnt_add_xprt() expects the callback function to be synchronous, and
expects to release the transport and switch references itself.

Fixes: 04fa2c6bb5 ("NFS pnfs data server multipath session trunking")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-02 09:45:55 -04:00
J. Bruce Fields
0020939f20 nfsd4: move some nfsd4 op definitions to xdr4.h
I want code in nfs4xdr.c to have access to this stuff.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-08-01 17:36:35 -04:00
Trond Myklebust
fd40559c86 NFSv4: Fix EXCHANGE_ID corrupt verifier issue
The verifier is allocated on the stack, but the EXCHANGE_ID RPC call was
changed to be asynchronous by commit 8d89bd70bc. If we interrrupt
the call to rpc_wait_for_completion_task(), we can therefore end up
transmitting random stack contents in lieu of the verifier.

Fixes: 8d89bd70bc ("NFS setup async exchange_id")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-01 16:28:55 -04:00
Kees Cook
fe8993b3a0 exec: Consolidate pdeath_signal clearing
Instead of an additional secureexec check for pdeath_signal, just move it
up into the initial secureexec test. Neither perf nor arch code touches
pdeath_signal, so the relocation shouldn't change anything.

Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge Hallyn <serge@hallyn.com>
2017-08-01 12:03:14 -07:00
Kees Cook
64701dee41 exec: Use sane stack rlimit under secureexec
For a secureexec, before memory layout selection has happened, reset the
stack rlimit to something sane to avoid the caller having control over
the resulting layouts.

$ ulimit -s
8192
$ ulimit -s unlimited
$ /bin/sh -c 'ulimit -s'
unlimited
$ sudo /bin/sh -c 'ulimit -s'
8192

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
2017-08-01 12:03:14 -07:00
Kees Cook
473d89639d exec: Consolidate dumpability logic
Since it's already valid to set dumpability in the early part of
setup_new_exec(), we can consolidate the logic into a single place.
The BINPRM_FLAGS_ENFORCE_NONDUMP is set during would_dump() calls
before setup_new_exec(), so its test is safe to move as well.

Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
2017-08-01 12:03:13 -07:00
Kees Cook
a70423dfbc exec: Use secureexec for clearing pdeath_signal
Like dumpability, clearing pdeath_signal happens both in setup_new_exec()
and later in commit_creds(). The test in setup_new_exec() is different
from all other privilege comparisons, though: it is checking the new cred
(bprm) uid vs the old cred (current) euid. This appears to be a bug,
introduced by commit a6f76f23d2 ("CRED: Make execve() take advantage of
copy-on-write credentials"):

-       if (bprm->e_uid != current_euid() ||
-           bprm->e_gid != current_egid()) {
-               set_dumpable(current->mm, suid_dumpable);
+       if (bprm->cred->uid != current_euid() ||
+           bprm->cred->gid != current_egid()) {

It was bprm euid vs current euid (and egids), but the effective got
dropped. Nothing in the exec flow changes bprm->cred->uid (nor gid).
The call traces are:

	prepare_bprm_creds()
	    prepare_exec_creds()
	        prepare_creds()
	            memcpy(new_creds, old_creds, ...)
	            security_prepare_creds() (unimplemented by commoncap)
	...
	prepare_binprm()
	    bprm_fill_uid()
	        resets euid/egid to current euid/egid
	        sets euid/egid on bprm based on set*id file bits
	    security_bprm_set_creds()
		cap_bprm_set_creds()
		        handle all caps-based manipulations

so this test is effectively a test of current_uid() vs current_euid(),
which is wrong, just like the prior dumpability tests were wrong.

The commit log says "Clear pdeath_signal and set dumpable on
certain circumstances that may not be covered by commit_creds()." This
may be meaning the earlier old euid vs new euid (and egid) test that
got changed.

Luckily, as with dumpability, this is all masked by commit_creds()
which performs old/new euid and egid tests and clears pdeath_signal.

And again, like dumpability, we should include LSM secureexec logic for
pdeath_signal clearing. For example, Smack goes out of its way to clear
pdeath_signal when it finds a secureexec condition.

Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
2017-08-01 12:03:11 -07:00
Kees Cook
e37fdb785a exec: Use secureexec for setting dumpability
The examination of "current" to decide dumpability is wrong. This was a
check of and euid/uid (or egid/gid) mismatch in the existing process,
not the newly created one. This appears to stretch back into even the
"history.git" tree. Luckily, dumpability is later set in commit_creds().
In earlier kernel versions before creds existed, similar checks also
existed late in the exec flow, covering up the mistake as far back as I
could find.

Note that because the commit_creds() check examines differences of euid,
uid, egid, gid, and capabilities between the old and new creds, it would
look like the setup_new_exec() dumpability test could be entirely removed.
However, the secureexec test may cover a different set of tests (specific
to the LSMs) than what commit_creds() checks for. So, fix this test to
use secureexec (the removed euid tests are redundant to the commoncap
secureexec checks now).

Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
2017-08-01 12:03:10 -07:00
Kees Cook
2af6228026 LSM: drop bprm_secureexec hook
This removes the bprm_secureexec hook since the logic has been folded into
the bprm_set_creds hook for all LSMs now.

Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: John Johansen <john.johansen@canonical.com>
Acked-by: James Morris <james.l.morris@oracle.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
2017-08-01 12:03:10 -07:00
Kees Cook
46d98eb4e1 commoncap: Refactor to remove bprm_secureexec hook
The commoncap implementation of the bprm_secureexec hook is the only LSM
that depends on the final call to its bprm_set_creds hook (since it may
be called for multiple files, it ignores bprm->called_set_creds). As a
result, it cannot safely _clear_ bprm->secureexec since other LSMs may
have set it.  Instead, remove the bprm_secureexec hook by introducing a
new flag to bprm specific to commoncap: cap_elevated. This is similar to
cap_effective, but that is used for a specific subset of elevated
privileges, and exists solely to track state from bprm_set_creds to
bprm_secureexec. As such, it will be removed in the next patch.

Here, set the new bprm->cap_elevated flag when setuid/setgid has happened
from bprm_fill_uid() or fscapabilities have been prepared. This temporarily
moves the bprm_secureexec hook to a static inline. The helper will be
removed in the next patch; this makes the step easier to review and bisect,
since this does not introduce any changes to inputs nor outputs to the
"elevated privileges" calculation.

The new flag is merged with the bprm->secureexec flag in setup_new_exec()
since this marks the end of any further prepare_binprm() calls.

Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Acked-by: James Morris <james.l.morris@oracle.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
2017-08-01 12:03:08 -07:00
Kees Cook
c425e189ff binfmt: Introduce secureexec flag
The bprm_secureexec hook can be moved earlier. Right now, it is called
during create_elf_tables(), via load_binary(), via search_binary_handler(),
via exec_binprm(). Nearly all (see exception below) state used by
bprm_secureexec is created during the bprm_set_creds hook, called from
prepare_binprm().

For all LSMs (except commoncaps described next), only the first execution
of bprm_set_creds takes any effect (they all check bprm->called_set_creds
which prepare_binprm() sets after the first call to the bprm_set_creds
hook).  However, all these LSMs also only do anything with bprm_secureexec
when they detected a secure state during their first run of bprm_set_creds.
Therefore, it is functionally identical to move the detection into
bprm_set_creds, since the results from secureexec here only need to be
based on the first call to the LSM's bprm_set_creds hook.

The single exception is that the commoncaps secureexec hook also examines
euid/uid and egid/gid differences which are controlled by bprm_fill_uid(),
via prepare_binprm(), which can be called multiple times (e.g.
binfmt_script, binfmt_misc), and may clear the euid/egid for the final
load (i.e. the script interpreter). However, while commoncaps specifically
ignores bprm->cred_prepared, and runs its bprm_set_creds hook each time
prepare_binprm() may get called, it needs to base the secureexec decision
on the final call to bprm_set_creds. As a result, it will need special
handling.

To begin this refactoring, this adds the secureexec flag to the bprm
struct, and calls the secureexec hook during setup_new_exec(). This is
safe since all the cred work is finished (and past the point of no return).
This explicit call will be removed in later patches once the hook has been
removed.

Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: John Johansen <john.johansen@canonical.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
2017-08-01 12:03:05 -07:00
Kees Cook
a9208e42ba exec: Correct comments about "point of no return"
In commit 221af7f87b ("Split 'flush_old_exec' into two functions"),
the comment about the point of no return should have stayed in
flush_old_exec() since it refers to "bprm->mm = NULL;" line, but prior
changes in commits c89681ed7d ("remove steal_locks()"), and
fd8328be87 ("sanitize handling of shared descriptor tables in failing
execve()") made it look like it meant the current->sas_ss_sp line instead.

The comment was referring to the fact that once bprm->mm is NULL, all
failures from a binfmt load_binary hook (e.g. load_elf_binary), will
get SEGV raised against current. Move this comment and expand the
explanation a bit, putting it above the assignment this time, and add
details about the true nature of "point of no return" being the call
to flush_old_exec() itself.

This also removes an erroneous commet about when credentials are being
installed. That has its own dedicated function, install_exec_creds(),
which carries a similar (and correct) comment, so remove the bogus comment
where installation is not actually happening.

Cc: David Howells <dhowells@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
2017-08-01 12:03:04 -07:00
Kees Cook
ddb4a1442d exec: Rename bprm->cred_prepared to called_set_creds
The cred_prepared bprm flag has a misleading name. It has nothing to do
with the bprm_prepare_cred hook, and actually tracks if bprm_set_creds has
been called. Rename this flag and improve its comment.

Cc: David Howells <dhowells@redhat.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: John Johansen <john.johansen@canonical.com>
Acked-by: James Morris <james.l.morris@oracle.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
2017-08-01 12:02:48 -07:00
David S. Miller
29fda25a2d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two minor conflicts in virtio_net driver (bug fix overlapping addition
of a helper) and MAINTAINERS (new driver edit overlapping revamp of
PHY entry).

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-01 10:07:50 -07:00
Jeff Layton
3b49c9a1e9 fs: convert a pile of fsync routines to errseq_t based reporting
This patch converts most of the in-kernel filesystems that do writeback
out of the pagecache to report errors using the errseq_t-based
infrastructure that was recently added. This allows them to report
errors once for each open file description.

Most filesystems have a fairly straightforward fsync operation. They
call filemap_write_and_wait_range to write back all of the data and
wait on it, and then (sometimes) sync out the metadata.

For those filesystems this is a straightforward conversion from calling
filemap_write_and_wait_range in their fsync operation to calling
file_write_and_wait_range.

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-08-01 08:39:29 -04:00
Jeff Layton
d07a6ac7b6 gfs2: convert to errseq_t based writeback error reporting for fsync
Also, fix a place where a writeback error might get dropped in the
gfs2_is_jdata case.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-08-01 08:39:29 -04:00
Jeff Layton
6454568d96 fs: convert sync_file_range to use errseq_t based error-tracking
sync_file_range doesn't call down into the filesystem directly at all.
It only kicks off writeback of pagecache pages and optionally waits
on the result.

Convert sync_file_range to use errseq_t based error tracking, under the
assumption that most users will prefer this behavior when errors occur.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-08-01 08:39:29 -04:00
Chao Yu
ddc34e328d f2fs: introduce f2fs_statfs_project
This patch introduces f2fs_statfs_project, it enables to show usage
status of directory tree which is limited with project quota.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-31 16:48:36 -07:00
Chao Yu
2c1d030569 f2fs: support F2FS_IOC_FS{GET,SET}XATTR
This patch adds FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR ioctl interface
support for f2fs. The interface is kept consistent with the one
of ext4/xfs.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-31 16:48:35 -07:00
Jaegeuk Kim
b6a245eb34 f2fs: don't need to wait for node writes for atomic write
We have a node chain to serialize node block writes, so if any IOs for
node block writes are reordered, we'll get broken node chain. IOWs,
roll-forward recovery will see all or none node blocks given fsync
mark.

E.g.,
Node chain consists of:
 N1 -> N2 -> N3 -> NFSYNC -> N1' -> N2' -> N'FSYNC

Reordered to:
1) N1 -> N2 -> N3 -> N2' -> NFSYNC -> N'FSYNC -> power-cut
2) N1 -> N2 -> N3 -> N1' -> NFSYNC -> power-cut
3) N1 -> N2 -> NFSYNC -> N1' -> N'FSYNC -> N3 -> power-cut
4) N1 -> NFSYNC -> N1' -> N2' -> N'FSYNC -> N3 -> power-cut

Roll-forward recovery can proceed to:
1) N1 -> N2 -> N3 -> NFSYNC -> X
2) N1 -> N2 -> N3 -> NFSYNC -> N1' -> X
3) N1 -> N2 -> N3 -> FSYNC -> N1' -> X
4) N1 -> X

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-31 16:48:34 -07:00
Jaegeuk Kim
dc6b205510 f2fs: avoid naming confusion of sysfs init
This patch changes the function names of sysfs init to follow ext4.

f2fs_init_sysfs <-> f2fs_register_sysfs
f2fs_exit_sysfs <-> f2fs_unregister_sysfs

Suggested-by: Chao Yu <yuchao0@huawei.com>
Reivewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-31 16:48:33 -07:00
Chao Yu
5c57132eaf f2fs: support project quota
This patch adds to support plain project quota.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-31 16:48:32 -07:00
Chao Yu
a6d3a479ae f2fs: record quota during dot{,dot} recovery
In ->lookup(), we will have a try to recover dot or dotdot for
corrupted directory, once disk quota is on, if it allocates new
block during dotdot recovery, we need to record disk quota info
for the allocation, so this patch fixes this issue by adding
missing dquot_initialize() in __recover_dot_dentries.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-31 16:48:31 -07:00
Chao Yu
7a2af766af f2fs: enhance on-disk inode structure scalability
This patch add new flag F2FS_EXTRA_ATTR storing in inode.i_inline
to indicate that on-disk structure of current inode is extended.

In order to extend, we changed the inode structure a bit:

Original one:

struct f2fs_inode {
	...
	struct f2fs_extent i_ext;
	__le32 i_addr[DEF_ADDRS_PER_INODE];
	__le32 i_nid[DEF_NIDS_PER_INODE];
}

Extended one:

struct f2fs_inode {
        ...
        struct f2fs_extent i_ext;
	union {
		struct {
			__le16 i_extra_isize;
			__le16 i_padding;
			__le32 i_extra_end[0];
		};
		__le32 i_addr[DEF_ADDRS_PER_INODE];
	};
        __le32 i_nid[DEF_NIDS_PER_INODE];
}

Once F2FS_EXTRA_ATTR is set, we will steal four bytes in the head of
i_addr field for storing i_extra_isize and i_padding. with i_extra_isize,
we can calculate actual size of reserved space in i_addr, available
attribute fields included in total extra attribute fields for current
inode can be described as below:

  +--------------------+
  | .i_mode            |
  | ...                |
  | .i_ext             |
  +--------------------+
  | .i_extra_isize     |-----+
  | .i_padding         |     |
  | .i_prjid           |     |
  | .i_atime_extra     |     |
  | .i_ctime_extra     |     |
  | .i_mtime_extra     |<----+
  | .i_inode_cs        |<----- store blkaddr/inline from here
  | .i_xattr_cs        |
  | ...                |
  +--------------------+
  |                    |
  |    block address   |
  |                    |
  +--------------------+
  | .i_nid             |
  +--------------------+
  |   node_footer      |
  | (nid, ino, offset) |
  +--------------------+

Hence, with this patch, we would enhance scalability of f2fs inode for
storing more newly added attribute.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-31 16:48:30 -07:00
Chao Yu
f247037120 f2fs: make max inline size changeable
This patch tries to make below macros calculating max inline size,
inline dentry field size considerring reserving size-changeable
space:
- MAX_INLINE_DATA
- NR_INLINE_DENTRY
- INLINE_DENTRY_BITMAP_SIZE
- INLINE_RESERVED_SIZE

Then, when inline_{data,dentry} options is enabled, it allows us to
reserve inline space with different size flexibly for adding newly
introduced inode attribute.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-31 16:48:29 -07:00
Jaegeuk Kim
e65ef20781 f2fs: add ioctl to expose current features
This patch adds an ioctl to provide feature information to user.
For exapmle, SQLite can use this ioctl to detect whether f2fs support atomic
write or not.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-31 16:48:28 -07:00
Chao Yu
dc6febb6bc f2fs: make background threads of f2fs being aware of freezing
When ->freeze_fs is called from lvm for doing snapshot, it needs to
make sure there will be no more changes in filesystem's data, however,
previously, background threads like GC thread wasn't aware of freezing,
so in environment with active background threads, data of snapshot
becomes unstable.

This patch fixes this issue by adding sb_{start,end}_intwrite in
below background threads:
- GC thread
- flush thread
- discard thread

Note that, don't use sb_start_intwrite() in gc_thread_func() due to:

generic/241 reports below bug:

 ======================================================
 WARNING: possible circular locking dependency detected
 4.13.0-rc1+ #32 Tainted: G           O
 ------------------------------------------------------
 f2fs_gc-250:0/22186 is trying to acquire lock:
  (&sbi->gc_mutex){+.+...}, at: [<f8fa7f0b>] f2fs_sync_fs+0x7b/0x1b0 [f2fs]

 but task is already holding lock:
  (sb_internal#2){++++.-}, at: [<f8fb5609>] gc_thread_func+0x159/0x4a0 [f2fs]

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #2 (sb_internal#2){++++.-}:
        __lock_acquire+0x405/0x7b0
        lock_acquire+0xae/0x220
        __sb_start_write+0x11d/0x1f0
        f2fs_evict_inode+0x2d6/0x4e0 [f2fs]
        evict+0xa8/0x170
        iput+0x1fb/0x2c0
        f2fs_sync_inode_meta+0x3f/0xf0 [f2fs]
        write_checkpoint+0x1b1/0x750 [f2fs]
        f2fs_sync_fs+0x85/0x1b0 [f2fs]
        f2fs_do_sync_file.isra.24+0x137/0xa30 [f2fs]
        f2fs_sync_file+0x34/0x40 [f2fs]
        vfs_fsync_range+0x4a/0xa0
        do_fsync+0x3c/0x60
        SyS_fdatasync+0x15/0x20
        do_fast_syscall_32+0xa1/0x1b0
        entry_SYSENTER_32+0x4c/0x7b

 -> #1 (&sbi->cp_mutex){+.+...}:
        __lock_acquire+0x405/0x7b0
        lock_acquire+0xae/0x220
        __mutex_lock+0x4f/0x830
        mutex_lock_nested+0x25/0x30
        write_checkpoint+0x2f/0x750 [f2fs]
        f2fs_sync_fs+0x85/0x1b0 [f2fs]
        sync_filesystem+0x67/0x80
        generic_shutdown_super+0x27/0x100
        kill_block_super+0x22/0x50
        kill_f2fs_super+0x3a/0x40 [f2fs]
        deactivate_locked_super+0x3d/0x70
        deactivate_super+0x40/0x60
        cleanup_mnt+0x39/0x70
        __cleanup_mnt+0x10/0x20
        task_work_run+0x69/0x80
        exit_to_usermode_loop+0x57/0x92
        do_fast_syscall_32+0x18c/0x1b0
        entry_SYSENTER_32+0x4c/0x7b

 -> #0 (&sbi->gc_mutex){+.+...}:
        validate_chain.isra.36+0xc50/0xdb0
        __lock_acquire+0x405/0x7b0
        lock_acquire+0xae/0x220
        __mutex_lock+0x4f/0x830
        mutex_lock_nested+0x25/0x30
        f2fs_sync_fs+0x7b/0x1b0 [f2fs]
        f2fs_balance_fs_bg+0xb9/0x200 [f2fs]
        gc_thread_func+0x302/0x4a0 [f2fs]
        kthread+0xe9/0x120
        ret_from_fork+0x19/0x24

 other info that might help us debug this:

 Chain exists of:
   &sbi->gc_mutex --> &sbi->cp_mutex --> sb_internal#2

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(sb_internal#2);
                                lock(&sbi->cp_mutex);
                                lock(sb_internal#2);
   lock(&sbi->gc_mutex);

  *** DEADLOCK ***

 1 lock held by f2fs_gc-250:0/22186:
  #0:  (sb_internal#2){++++.-}, at: [<f8fb5609>] gc_thread_func+0x159/0x4a0 [f2fs]

 stack backtrace:
 CPU: 2 PID: 22186 Comm: f2fs_gc-250:0 Tainted: G           O    4.13.0-rc1+ #32
 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
 Call Trace:
  dump_stack+0x5f/0x92
  print_circular_bug+0x1b3/0x1bd
  validate_chain.isra.36+0xc50/0xdb0
  ? __this_cpu_preempt_check+0xf/0x20
  __lock_acquire+0x405/0x7b0
  lock_acquire+0xae/0x220
  ? f2fs_sync_fs+0x7b/0x1b0 [f2fs]
  __mutex_lock+0x4f/0x830
  ? f2fs_sync_fs+0x7b/0x1b0 [f2fs]
  mutex_lock_nested+0x25/0x30
  ? f2fs_sync_fs+0x7b/0x1b0 [f2fs]
  f2fs_sync_fs+0x7b/0x1b0 [f2fs]
  f2fs_balance_fs_bg+0xb9/0x200 [f2fs]
  gc_thread_func+0x302/0x4a0 [f2fs]
  ? preempt_schedule_common+0x2f/0x4d
  ? f2fs_gc+0x540/0x540 [f2fs]
  kthread+0xe9/0x120
  ? f2fs_gc+0x540/0x540 [f2fs]
  ? kthread_create_on_node+0x30/0x30
  ret_from_fork+0x19/0x24

The deadlock occurs in below condition:
GC Thread			Thread B
- sb_start_intwrite
				- f2fs_sync_file
				 - f2fs_sync_fs
				  - mutex_lock(&sbi->gc_mutex)
				   - write_checkpoint
				    - block_operations
				     - f2fs_sync_inode_meta
				      - iput
				       - sb_start_intwrite
 - mutex_lock(&sbi->gc_mutex)

Fix this by altering sb_start_intwrite to sb_start_write_trylock.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-31 16:47:27 -07:00
Jeff Layton
7e51fe1dd1 fuse: convert to errseq_t based error tracking for fsync
Change to file_write_and_wait_range and
file_check_and_advance_wb_err

Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-07-31 19:12:25 -04:00
Jeff Layton
9c5d58fb9e ext4: convert swap_inode_data() over to use swap() on most of the fields
For some odd reason, it forces a byte-by-byte copy of each field. A
plain old swap() on most of these fields would be more efficient. We
do need to retain the memswap of i_data however as that field is an array.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-31 00:55:34 -04:00
Emoly Liu
191eac3300 ext4: error should be cleared if ea_inode isn't added to the cache
For Lustre, if ea_inode fails in hash validation but passes parent
inode and generation checks, it won't be added to the cache as well
as the error "-EFSCORRUPTED" should be cleared, otherwise it will
cause "Structure needs cleaning" when running getfattr command.

Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-9723

Cc: stable@vger.kernel.org
Fixes: dec214d00e
Signed-off-by: Emoly Liu <emoly.liu@intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: tahsin@google.com
2017-07-31 00:40:22 -04:00
Jan Kara
a3bb2d5587 ext4: Don't clear SGID when inheriting ACLs
When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.

Fix the problem by moving posix_acl_update_mode() out of
__ext4_set_acl() into ext4_set_acl(). That way the function will not be
called when inheriting ACLs which is what we want as it prevents SGID
bit clearing and the mode has been properly set by posix_acl_create()
anyway.

Fixes: 073931017b
CC: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2017-07-30 23:33:01 -04:00
Ernesto A. Fernández
397e434176 ext4: preserve i_mode if __ext4_set_acl() fails
When changing a file's acl mask, __ext4_set_acl() will first set the group
bits of i_mode to the value of the mask, and only then set the actual
extended attribute representing the new acl.

If the second part fails (due to lack of space, for example) and the file
had no acl attribute to begin with, the system will from now on assume
that the mask permission bits are actual group permission bits, potentially
granting access to the wrong users.

Prevent this by only changing the inode mode after the acl has been set.

Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-30 22:43:41 -04:00
Eric Whitney
a627b0a7c1 ext4: remove unused metadata accounting variables
Two variables in ext4_inode_info, i_reserved_meta_blocks and
i_allocated_meta_blocks, are unused.  Removing them saves a little
memory per in-memory inode and cleans up clutter in several tracepoints.
Adjust tracepoint output from ext4_alloc_da_blocks() for consistency
and fix a typo and whitespace near these changes.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-30 22:30:11 -04:00
Eric Whitney
1e21196c8e ext4: correct comment references to ext4_ext_direct_IO()
Commit 914f82a32d "ext4: refactor direct IO code" deleted
ext4_ext_direct_IO(), but references to that function remain in
comments.  Update them to refer to ext4_direct_IO_write().

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-30 22:26:40 -04:00
Shaohua Li
69fd5c3917 blktrace: add an option to allow displaying cgroup path
By default we output cgroup id in blktrace. This adds an option to
display cgroup path. Since get cgroup path is a relativly heavy
operation, we don't enable it by default.

with the option enabled, blktrace will output something like this:
dd-1353  [007] d..2   293.015252:   8,0   /test/level  D   R 24 + 8 [dd]

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-29 09:00:03 -06:00
Shaohua Li
aa81882534 kernfs: add exportfs operations
Now we have the facilities to implement exportfs operations. The idea is
cgroup can export the fhandle info to userspace, then userspace uses
fhandle to find the cgroup name. Another example is userspace can get
fhandle for a cgroup and BPF uses the fhandle to filter info for the
cgroup.

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-29 09:00:03 -06:00
Shaohua Li
c53cd490b1 kernfs: introduce kernfs_node_id
inode number and generation can identify a kernfs node. We are going to
export the identification by exportfs operations, so put ino and
generation into a separate structure. It's convenient when later patches
use the identification.

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-29 09:00:03 -06:00
Shaohua Li
319ba91d35 kernfs: don't set dentry->d_fsdata
When working on adding exportfs operations in kernfs, I found it's hard
to initialize dentry->d_fsdata in the exportfs operations. Looks there
is no way to do it without race condition. Look at the kernfs code
closely, there is no point to set dentry->d_fsdata. inode->i_private
already points to kernfs_node, and we can get inode from a dentry. So
this patch just delete the d_fsdata usage.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-29 09:00:03 -06:00
Shaohua Li
ba16b2846a kernfs: add an API to get kernfs node from inode number
Add an API to get kernfs node from inode number. We will need this to
implement exportfs operations.

This API will be used in blktrace too later, so it should be as fast as
possible. To make the API lock free, kernfs node is freed in RCU
context. And we depend on kernfs_node count/ino number to filter out
stale kernfs nodes.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-29 09:00:03 -06:00
Shaohua Li
4a3ef68aca kernfs: implement i_generation
Set i_generation for kernfs inode. This is required to implement
exportfs operations. The generation is 32-bit, so it's possible the
generation wraps up and we find stale files. To reduce the posssibility,
we don't reuse inode numer immediately. When the inode number allocation
wraps, we increase generation number. In this way generation/inode
number consist of a 64-bit number which is unlikely duplicated. This
does make the idr tree more sparse and waste some memory. Since idr
manages 32-bit keys, idr uses a 6-level radix tree, each level covers 6
bits of the key. In a 100k inode kernfs, the worst case will have around
300k radix tree node. Each node is 576bytes, so the tree will use about
~150M memory. Sounds not too bad, if this really is a problem, we should
find better data structure.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-29 09:00:03 -06:00
Shaohua Li
7d35079f82 kernfs: use idr instead of ida to manage inode number
kernfs uses ida to manage inode number. The problem is we can't get
kernfs_node from inode number with ida. Switching to use idr, next patch
will add an API to get kernfs_node from inode number.

Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-29 09:00:03 -06:00
Jaegeuk Kim
7a10f0177e f2fs: don't give partially written atomic data from process crash
This patch resolves the below scenario.

== Process 1 ==     == Process 2 ==
open(w)             open(rw)
begin
write(new_#1)
process_crash
  f_op->flush
  locks_remove_posix
  f_op>release
                    read (new_#1)

In order to avoid corrupted database caused by new_#1, we must do roll-back
at process_crash time. In order to check that, this patch keeps task which
triggers transaction begin, and does roll-back in f_op->flush before removing
file locks.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-28 17:49:01 -07:00
Jaegeuk Kim
640cc18982 f2fs: give a try to do atomic write in -ENOMEM case
It'd be better to retry writing atomic pages when we get -ENOMEM.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-28 17:49:00 -07:00
Ernesto A. Fernández
14af20fcb1 f2fs: preserve i_mode if __f2fs_set_acl() fails
When changing a file's acl mask, __f2fs_set_acl() will first set the
group bits of i_mode to the value of the mask, and only then set the
actual extended attribute representing the new acl.

If the second part fails (due to lack of space, for example) and the
file had no acl attribute to begin with, the system will from now on
assume that the mask permission bits are actual group permission bits,
potentially granting access to the wrong users.

Prevent this by only changing the inode mode after the acl has been set.

Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-28 17:48:54 -07:00
Linus Torvalds
286ba844c5 More NFS client bugfixes for 4.13
Stable fixes:
 - Fix a race where CB_NOTIFY_LOCK fails to wake a waiter
 - Invalidate file size when taking a lock to prevent corruption
 
 Other fixes:
 - Don't excessively generate tiny writes with fallocate
 - Use the raw NFS access mask in nfs4_opendata_access()
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAll7l0oACgkQ18tUv7Cl
 QOuL3w/+M5I5xKKrMOjg2cfzdFAn+syTmXYK0HFrxp76CiaNBcQFK5kG1ebkdrTM
 EDXWznamWRTCIbg0U7/3X/763yWUyZM8RSC0nJXyt7FNZitg1Hsvw/OawaM2Z8q4
 TQQPqelEAhAG7zbgyCCe+SuAoAOTq7HpX2wru8gK6POBOP6gmNEJtchBzqrlsq8d
 bSH8E7BLhbRIwC3htsPfTW0NZtqpp7u/wKeLtt/ZGIuM/+78iOa1wMwCESVJfd47
 2DpDmS0LxVtAjs8lCjMBKyrypNEvh+evgdbxeiXG/T6ykzWhBy96OSOm7ooIjqOr
 pkptxrKOBGb9/8rMnxjCKRIfjgVz77GfB2jD+/ILzRP1E0xipHXWCc5XqzBP999l
 zqUVDMPH3zrq8lxmC9FgoY1PJAcRrZ/aEIjozwkVcksTDYx+GJPYMR6Wks9/1cT0
 4jLTRsBgckj9b3FcjsCiyavHBweDChCEgzx5CLpEqH1KKCfT6MnLTb/WoJL/s1R8
 MLb0MC5PMpLP4OvCRR+mCg+dJD2nXF/Cz2E9r3SSlhZiDsNWmdBXk2A5XoAFzY5l
 pQeqkogBdiINu7p/G3n7837ThRUGV+04C9D9WDI7IF/dktOyYYO/4DNDVYEiFqKL
 9v8Hc4EyGwR2dY5iEKaSuNEk8zTxL1ZGv1H8WTCSDmNRQ+64Q3s=
 =OHnE
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.13-3' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "More NFS client bugfixes for 4.13.

  Most of these fix locking bugs that Ben and Neil noticed, but I also
  have a patch to fix one more access bug that was reported after last
  week.

  Stable fixes:
   - Fix a race where CB_NOTIFY_LOCK fails to wake a waiter
   - Invalidate file size when taking a lock to prevent corruption

  Other fixes:
   - Don't excessively generate tiny writes with fallocate
   - Use the raw NFS access mask in nfs4_opendata_access()"

* tag 'nfs-for-4.13-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFSv4.1: Fix a race where CB_NOTIFY_LOCK fails to wake a waiter
  NFS: Optimize fallocate by refreshing mapping when needed.
  NFS: invalidate file size when taking a lock.
  NFS: Use raw NFS access mask in nfs4_opendata_access()
2017-07-28 14:44:56 -07:00
Linus Torvalds
19993e7378 Changes since last update:
- Fix firstfsb variables that we left uninitialized, which could lead to
   locking problems.
 - Check for NULL metadata buffer pointers before using them.
 - Don't allow btree cursor manipulation if the btree block is corrupt.
   Better to just shut down.
 - Fix infinite loop problems in quotacheck.
 - Fix buffer overrun when validating directory blocks.
 - Fix deadlock problem in bunmapi.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJZeXhTAAoJEPh/dxk0SrTrUK8P/RvwLgEflTEUQjjNoakhOuWV
 yYlXJz2ArWbG/w8vtW4rb6gnExJ6OmJ4EUZWe78g0oTpo9vmS7GuHE0HBXS8AiNe
 9Y87GBIQLRi6BOlY9wfiKcLyA3u/buLSAkFhjulA+ARRIS2G3pW/PkzOfl1tIJhl
 rpL/xJ8TAcNz5LLu/znwebtnIMbaplMdV80b4dOHoNvYC0mYaOFTRiyXANqdCnKx
 C4tYyKkkQHYDjyXjOwJt8I8CUvcbMrOVQd1E1px+n2L9O81dUP04PhF8N0vPZl/Q
 ueP83KRqCAm89HMc2P/P0bkBZmbFUtgtMA67oOUxx66crDWEExGRhqZ1+/UgJsAg
 t5yFg3+QwgaXhAXcZZrvGGMT0b3L6ew5//dhY8XcMq2xKpKrxls/RtQrw3Lux+qx
 lHhGIAyd15LBKHWARwGXC315gOMfLnUuWhG63pOygL4PrVvOY22Axj5YdRJD5J6E
 Z4oRzqhQngeLrbfbj73DQFcGxdeEUodB+Pz8uTQ+6pfy5JU3dMzdI16ekX1bgZV3
 qFFMRR77a4RpAWYy27LYeaa8NTAEQEahKdRWXofjgKjfsvgnxe+cqhoSdkExAX0c
 MM0DtXMo2dMjpsajNCo971jPK89a07dY+6Y9COwSMV2vD8Ml43v2F9nS9kcQlUAP
 H6kdr19vd5p/BBH1P+lu
 =TKiL
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.13-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:

 - fix firstfsb variables that we left uninitialized, which could lead
   to locking problems.

 - check for NULL metadata buffer pointers before using them.

 - don't allow btree cursor manipulation if the btree block is corrupt.
   Better to just shut down.

 - fix infinite loop problems in quotacheck.

 - fix buffer overrun when validating directory blocks.

 - fix deadlock problem in bunmapi.

* tag 'xfs-4.13-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: fix multi-AG deadlock in xfs_bunmapi
  xfs: check that dir block entries don't off the end of the buffer
  xfs: fix quotacheck dquot id overflow infinite loop
  xfs: check _alloc_read_agf buffer pointer before using
  xfs: set firstfsb to NULLFSBLOCK before feeding it to _bmapi_write
  xfs: check _btree_check_block value
2017-07-28 14:29:48 -07:00
Benjamin Coddington
b7dbcc0e43 NFSv4.1: Fix a race where CB_NOTIFY_LOCK fails to wake a waiter
nfs4_retry_setlk() sets the task's state to TASK_INTERRUPTIBLE within the
same region protected by the wait_queue's lock after checking for a
notification from CB_NOTIFY_LOCK callback.  However, after releasing that
lock, a wakeup for that task may race in before the call to
freezable_schedule_timeout_interruptible() and set TASK_WAKING, then
freezable_schedule_timeout_interruptible() will set the state back to
TASK_INTERRUPTIBLE before the task will sleep.  The result is that the task
will sleep for the entire duration of the timeout.

Since we've already set TASK_INTERRUPTIBLE in the locked section, just use
freezable_schedule_timout() instead.

Fixes: a1d617d8f1 ("nfs: allow blocking locks to be awoken by lock callbacks")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-28 15:35:30 -04:00
Linus Torvalds
0a2a1330d2 Merge branch 'for-4.13-part3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
 "Fixes addressing problems reported by users, and there's one more
  regression fix"

* 'for-4.13-part3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: round down size diff when shrinking/growing device
  Btrfs: fix early ENOSPC due to delalloc
  btrfs: fix lockup in find_free_extent with read-only block groups
  Btrfs: fix dir item validation when replaying xattr deletes
2017-07-28 12:26:59 -07:00
Miklos Szeredi
4edb83bb10 ovl: constant d_ino for non-merge dirs
Impure directories are ones which contain objects with origins (i.e. those
that have been copied up).  These are relevant to readdir operation only
because of the d_ino field, no other transformation is necessary.  Also a
directory can become impure between two getdents(2) calls.

This patch creates a cache for impure directories.  Unlike the cache for
merged directories, this one only contains entries with origin and is not
refcounted but has a its lifetime tied to that of the dentry.

Similarly to the merged cache, the impure cache is invalidated based on a
version number.  This version number is incremented when an entry with
origin is added or removed from the directory.

If the cache is empty, then the impure xattr is removed from the directory.

This patch also fixes up handling of d_ino for the ".." entry if the parent
directory is merged.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-07-27 21:54:06 +02:00
Amir Goldstein
b5efccbe0a ovl: constant d_ino across copy up
When all layers are on the same fs, and iterating a directory which may
contain copy up entries, call vfs_getattr() on the overlay entries to make
sure that d_ino will be consistent with st_ino from stat(2).

There is an overhead of lookup per upper entry in readdir.

The overhead is minimal if the iterated entries are already in dcache.  It
is also quite useful for the common case of 'ls -l' that readdir() pre
populates the dcache with the listed entries, making the following stat()
calls faster.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-07-27 21:54:06 +02:00
Miklos Szeredi
31e8ccea3c ovl: fix readdir error value
actor's return value is taken as a bool (filled/not filled) so we need to
return the error in the context.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-07-27 21:54:06 +02:00
Miklos Szeredi
6787341a0f ovl: check snprintf return
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-07-27 21:54:05 +02:00
NeilBrown
6ba80d4348 NFS: Optimize fallocate by refreshing mapping when needed.
posix_fallocate() will allocate space in an NFS file by considering
the last byte of every 4K block.  If it is before EOF, it will read
the byte and if it is zero, a zero is written out.  If it is after EOF,
the zero is unconditionally written.

For the blocks beyond EOF, if NFS believes its cache is valid, it will
expand these writes to write full pages, and then will merge the pages.
This results if (typically) 1MB writes.  If NFS believes its cache is
not valid (particularly if NFS_INO_INVALID_DATA or
NFS_INO_REVAL_PAGECACHE are set - see nfs_write_pageuptodate()), it will
send the individual 1-byte writes. This results in (typically) 256 times
as many RPC requests, and can be substantially slower.

Currently nfs_revalidate_mapping() is only used when reading a file or
mmapping a file, as these are times when the content needs to be
up-to-date.  Writes don't generally need the cache to be up-to-date, but
writes beyond EOF can benefit, particularly in the posix_fallocate()
case.

So this patch calls nfs_revalidate_mapping() when writing beyond EOF -
i.e. when there is a gap between the end of the file and the start of
the write.  If the cache is thought to be out of date (as happens after
taking a file lock), this will cause a GETATTR, and the two flags
mentioned above will be cleared.  With this, posix_fallocate() on a
newly locked file does not generate excessive tiny writes.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-27 11:22:42 -04:00
NeilBrown
442ce0499c NFS: invalidate file size when taking a lock.
Prior to commit ca0daa277a ("NFS: Cache aggressively when file is open
for writing"), NFS would revalidate, or invalidate, the file size when
taking a lock.  Since that commit it only invalidates the file content.

If the file size is changed on the server while wait for the lock, the
client will have an incorrect understanding of the file size and could
corrupt data.  This particularly happens when writing beyond the
(supposed) end of file and can be easily be demonstrated with
posix_fallocate().

If an application opens an empty file, waits for a write lock, and then
calls posix_fallocate(), glibc will determine that the underlying
filesystem doesn't support fallocate (assuming version 4.1 or earlier)
and will write out a '0' byte at the end of each 4K page in the region
being fallocated that is after the end of the file.
NFS will (usually) detect that these writes are beyond EOF and will
expand them to cover the whole page, and then will merge the pages.
Consequently, NFS will write out large blocks of zeroes beyond where it
thought EOF was.  If EOF had moved, the pre-existing part of the file
will be over-written.  Locking should have protected against this,
but it doesn't.

This patch restores the use of nfs_zap_caches() which invalidated the
cached attributes.  When posix_fallocate() asks for the file size, the
request will go to the server and get a correct answer.

cc: stable@vger.kernel.org (v4.8+)
Fixes: ca0daa277a ("NFS: Cache aggressively when file is open for writing")
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-27 11:22:42 -04:00
Yunlei He
8790568255 f2fs: alloc new nids for xattr block in recovery
recovery file A:			recovery file B:
	-get_dnode_of_data
		-alloc_nid
					-recover_xattr_data
						-set_node_addr(sbi, &ni, NEW_ADDR, false);
							--->bug_on for nid has been used by file A

In recovery process, new allocated node blocks may "reuse" xattr block
nids, this patch alloc new nids for xattr blocks in recovery process to
avoid this problem.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-26 19:34:30 -07:00
Chao Yu
76a9dd85d4 f2fs: spread struct f2fs_dentry_ptr for inline path
Use f2fs_dentry_ptr structure to indicate inline dentry structure as
much as possible, so we can wrap inline dentry with size-fixed fields
to the one with size-changeable fields. With this change, we can
handle size-changeable inline dentry more easily.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-26 19:34:30 -07:00
Yunlei He
5f4ce6abc2 f2fs: remove unused input parameter
This patch remove unused input parameter in function
new_node_page.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Yong Sheng <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-26 19:34:30 -07:00
Anna Schumaker
1e6f209515 NFS: Use raw NFS access mask in nfs4_opendata_access()
Commit bd8b244174 ("NFS: Store the raw NFS access mask in the inode's
access cache") changed how the access results are stored after an
access() call.  An NFS v4 OPEN might have access bits returned with the
opendata, so we should use the NFS4_ACCESS values when determining the
return value in nfs4_opendata_access().

Fixes: bd8b244174 ("NFS: Store the raw NFS access mask in the inode's
access cache")
Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Tested-by: Takashi Iwai <tiwai@suse.de>
2017-07-26 16:53:57 -04:00
Christoph Hellwig
5b094d6dac xfs: fix multi-AG deadlock in xfs_bunmapi
Just like in the allocator we must avoid touching multiple AGs out of
order when freeing blocks, as freeing still locks the AGF and can cause
the same AB-BA deadlocks as in the allocation path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-26 08:20:03 -07:00
Linus Torvalds
25f6a53799 JFS fixes for 4.13
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEIodevzQLVs53l6BhNqiEXrVAjGQFAll3YIwACgkQNqiEXrVA
 jGQE9w//b/MyL9MtvGcKI7u3V7RrqbodzoL42KxV98TI3y7rpmUcmRsqDT045ufD
 S+AOqnIhvH2XbsF7jvZ0UROUtErBgh+pIRqCctVqQM+GKE7p/KR0rY/4eMPlbqQL
 Q2L0ZbokHU4mgvo7SqSJkYFRd0PPBaaw4aJaf8gg1g0pCb29jcer4ycKKHCjz0Wz
 YS2/v7NVWysehihiz/JF6ga5/n6VZeXq3fa48Mmt4TDzKt4IaqgLtAEbPa+eddwW
 z/t4YIoiQ4JUYPnLNmG1Kd8lACfeqKkr2WIh3143ipof4Tj3ZaNc315wVQUl58LP
 6ZUg12tLDVH9OOPdI9VsVLostRbyKH3yUazULBXWIQQt6ZWcFiKFbturMlZevSJB
 OuGBsfpDDvkd+2n2iNcwane/+Ouw4LChzUvmp9/MuIRxinRVdXAVRjaUb2M54xSW
 qR/Yfw8qyky9tac1c1Md/bVo7oMEw0Xliv3HxmGKHNPLMOLzwfVTAw0gGXSrGFn8
 veUKCl1J1+9NWoNExDkyUjsmD1CdkhQk1gpmU70KWQlCgKCtBhWiX6rEy0l8pw1t
 G5UmFpgqG8g61ODuW0dbnSdImmS8mcbY4I1lSvQFb3qVtCeBLz9q7OKSCuavbs4M
 egtaUNVMJ22dVv12loj2OOyVdneUXbUB0WVga3kfEq7QRCtSjF4=
 =fVR9
 -----END PGP SIGNATURE-----

Merge tag 'jfs-4.13' of git://github.com/kleikamp/linux-shaggy

Pull JFS fixes from David Kleikamp.

* tag 'jfs-4.13' of git://github.com/kleikamp/linux-shaggy:
  jfs: preserve i_mode if __jfs_set_acl() fails
  jfs: Don't clear SGID when inheriting ACLs
  jfs: atomically read inode size
2017-07-25 08:51:57 -07:00
Darrick J. Wong
6215894e11 xfs: check that dir block entries don't off the end of the buffer
When we're checking the entries in a directory buffer, make sure that
the entry length doesn't push us off the end of the buffer.  Found via
xfs/388 writing ones to the length fields.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-07-25 08:36:35 -07:00
Eric W. Biederman
cc731525f2 signal: Remove kernel interal si_code magic
struct siginfo is a union and the kernel since 2.4 has been hiding a union
tag in the high 16bits of si_code using the values:
__SI_KILL
__SI_TIMER
__SI_POLL
__SI_FAULT
__SI_CHLD
__SI_RT
__SI_MESGQ
__SI_SYS

While this looks plausible on the surface, in practice this situation has
not worked well.

- Injected positive signals are not copied to user space properly
  unless they have these magic high bits set.

- Injected positive signals are not reported properly by signalfd
  unless they have these magic high bits set.

- These kernel internal values leaked to userspace via ptrace_peek_siginfo

- It was possible to inject these kernel internal values and cause the
  the kernel to misbehave.

- Kernel developers got confused and expected these kernel internal values
  in userspace in kernel self tests.

- Kernel developers got confused and set si_code to __SI_FAULT which
  is SI_USER in userspace which causes userspace to think an ordinary user
  sent the signal and that it was not kernel generated.

- The values make it impossible to reorganize the code to transform
  siginfo_copy_to_user into a plain copy_to_user.  As si_code must
  be massaged before being passed to userspace.

So remove these kernel internal si codes and make the kernel code simpler
and more maintainable.

To replace these kernel internal magic si_codes introduce the helper
function siginfo_layout, that takes a signal number and an si_code and
computes which union member of siginfo is being used.  Have
siginfo_layout return an enumeration so that gcc will have enough
information to warn if a switch statement does not handle all of union
members.

A couple of architectures have a messed up ABI that defines signal
specific duplications of SI_USER which causes more special cases in
siginfo_layout than I would like.  The good news is only problem
architectures pay the cost.

Update all of the code that used the previous magic __SI_ values to
use the new SIL_ values and to call siginfo_layout to get those
values.  Escept where not all of the cases are handled remove the
defaults in the switch statements so that if a new case is missed in
the future the lack will show up at compile time.

Modify the code that copies siginfo si_code to userspace to just copy
the value and not cast si_code to a short first.  The high bits are no
longer used to hold a magic union member.

Fixup the siginfo header files to stop including the __SI_ values in
their constants and for the headers that were missing it to properly
update the number of si_codes for each signal type.

The fixes to copy_siginfo_from_user32 implementations has the
interesting property that several of them perviously should never have
worked as the __SI_ values they depended up where kernel internal.
With that dependency gone those implementations should work much
better.

The idea of not passing the __SI_ values out to userspace and then
not reinserting them has been tested with criu and criu worked without
changes.

Ref: 2.4.0-test1
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-07-24 14:30:28 -05:00
Eric W. Biederman
d08477aa97 fcntl: Don't use ambiguous SIG_POLL si_codes
We have a weird and problematic intersection of features that when
they all come together result in ambiguous siginfo values, that
we can not support properly.

- Supporting fcntl(F_SETSIG,...) with arbitrary valid signals.

- Using positive values for POLL_IN, POLL_OUT, POLL_MSG, ..., etc
  that imply they are signal specific si_codes and using the
  aforementioned arbitrary signal to deliver them.

- Supporting injection of arbitrary siginfo values for debugging and
  checkpoint/restore.

The result is that just looking at siginfo si_codes of 1 to 6 are
ambigious.  It could either be a signal specific si_code or it could
be a generic si_code.

For most of the kernel this is a non-issue but for sending signals
with siginfo it is impossible to play back the kernel signals and
get the same result.

Strictly speaking when the si_code was changed from SI_SIGIO to
POLL_IN and friends between 2.2 and 2.4 this functionality was not
ambiguous, as only real time signals were supported.  Before 2.4 was
released the kernel began supporting siginfo with non realtime signals
so they could give details of why the signal was sent.

The result is that if F_SETSIG is set to one of the signals with signal
specific si_codes then user space can not know why the signal was sent.

I grepped through a bunch of userspace programs using debian code
search to get a feel for how often people choose a signal that results
in an ambiguous si_code.  I only found one program doing so and it was
using SIGCHLD to test the F_SETSIG functionality, and did not appear
to be a real world usage.

Therefore the ambiguity does not appears to be a real world problem in
practice.  Remove the ambiguity while introducing the smallest chance
of breakage by changing the si_code to SI_SIGIO when signals with
signal specific si_codes are targeted.

Fixes: v2.3.40 -- Added support for queueing non-rt signals
Fixes: v2.3.21 -- Changed the si_code from SI_SIGIO
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-07-24 14:29:23 -05:00
Brian Foster
cfaf2d0343 xfs: fix quotacheck dquot id overflow infinite loop
If a dquot has an id of U32_MAX, the next lookup index increment
overflows the uint32_t back to 0. This starts the lookup sequence
over from the beginning, repeats indefinitely and results in a
livelock.

Update xfs_qm_dquot_walk() to explicitly check for the lookup
overflow and exit the loop.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-24 08:33:25 -07:00
Nikolay Borisov
0e4324a4c3 btrfs: round down size diff when shrinking/growing device
Further testing showed that the fix introduced in 7dfb8be11b ("btrfs:
Round down values which are written for total_bytes_size") was
insufficient and it could still lead to discrepancies between the
total_bytes in the super block and the device total bytes. So this patch
also ensures that the difference between old/new sizes when
shrinking/growing is also rounded down. This ensure that we won't be
subtracting/adding a non-sectorsize multiples to the superblock/device
total sizees.

Fixes: 7dfb8be11b ("btrfs: Round down values which are written for total_bytes_size")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-24 16:05:00 +02:00
Omar Sandoval
17024ad0a0 Btrfs: fix early ENOSPC due to delalloc
If a lot of metadata is reserved for outstanding delayed allocations, we
rely on shrink_delalloc() to reclaim metadata space in order to fulfill
reservation tickets. However, shrink_delalloc() has a shortcut where if
it determines that space can be overcommitted, it will stop early. This
made sense before the ticketed enospc system, but now it means that
shrink_delalloc() will often not reclaim enough space to fulfill any
tickets, leading to an early ENOSPC. (Reservation tickets don't care
about being able to overcommit, they need every byte accounted for.)

Fix it by getting rid of the shortcut so that shrink_delalloc() reclaims
all of the metadata it is supposed to. This fixes early ENOSPCs we were
seeing when doing a btrfs receive to populate a new filesystem, as well
as early ENOSPCs Christoph saw when doing a big cp -r onto Btrfs.

Fixes: 957780eb27 ("Btrfs: introduce ticketed enospc infrastructure")
Tested-by: Christoph Anton Mitterer <mail@christoph.anton.mitterer.name>
Cc: stable@vger.kernel.org
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-24 16:04:26 +02:00
Jeff Mahoney
144439376b btrfs: fix lockup in find_free_extent with read-only block groups
If we have a block group that is all of the following:
1) uncached in memory
2) is read-only
3) has a disk cache state that indicates we need to recreate the cache

AND the file system has enough free space fragmentation such that the
request for an extent of a given size can't be honored;

AND have a single CPU core;

AND it's the block group with the highest starting offset such that
there are no opportunities (like reading from disk) for the loop to
yield the CPU;

We can end up with a lockup.

The root cause is simple.  Once we're in the position that we've read in
all of the other block groups directly and none of those block groups
can honor the request, there are no more opportunities to sleep.  We end
up trying to start a caching thread which never gets run if we only have
one core.  This *should* present as a hung task waiting on the caching
thread to make some progress, but it doesn't.  Instead, it degrades into
a busy loop because of the placement of the read-only check.

During the first pass through the loop, block_group->cached will be set
to BTRFS_CACHE_STARTED and have_caching_bg will be set.  Then we hit the
read-only check and short circuit the loop.  We're not yet in
LOOP_CACHING_WAIT, so we skip that loop back before going through the
loop again for other raid groups.

Then we move to LOOP_CACHING_WAIT state.

During the this pass through the loop, ->cached will still be
BTRFS_CACHE_STARTED, which means it's not cached, so we'll enter
cache_block_group, do a lot of nothing, and return, and also set
have_caching_bg again.  Then we hit the read-only check and short circuit
the loop.  The same thing happens as before except now we DO trigger
the LOOP_CACHING_WAIT && have_caching_bg check and loop back up to the
top.  We do this forever.

There are two fixes in this patch since they address the same underlying
bug.

The first is to add a cond_resched to the end of the loop to ensure
that the caching thread always has an opportunity to run.  This will
fix the soft lockup issue, but find_free_extent will still loop doing
nothing until the thread has completed.

The second is to move the read-only check to the top of the loop.  We're
never going to return an allocation within a read-only block group so
we may as well skip it early.  The check for ->cached == BTRFS_CACHE_ERROR
would cause the same problem except that BTRFS_CACHE_ERROR is considered
a "done" state and we won't re-set have_caching_bg again.

Many thanks to Stephan Kulow <coolo@suse.de> for his excellent help in
the testing process.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-24 16:04:02 +02:00
Greg Kroah-Hartman
24a81a2c25 Merge 4.13-rc2 into char-misc-next
We want the char/misc driver fixes in here as well to handle future
changes.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-23 19:58:30 -07:00
Linus Torvalds
505d5c1119 NFS client bugfixes for 4.13
Stable bugfixes:
 - Fix error reporting regression
 
 Bugfixes:
 - Fix setting filelayout ds address race
 - Fix subtle access bug when using ACLs
 - Fix setting mnt3_counts array size
 - Fix a couple of pNFS commit races
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAllyVYYACgkQ18tUv7Cl
 QOtTwBAA8ek0Ba5wIfwlQe4MvIX1t6v7q7Otrwdombhuw1a6nW410hFwu2SG8kGd
 fQWr1xnqRsMKwu83UuImtlYYvMa271TZgXxPNWx7KJpi/zocnigJRg5sVpkcRqza
 AjjPc245pHCooQoaTAlm5WrO9zDm0s7lRGuTB1cvPRsElnFVWT0jwhTASIDOU1Zy
 9K/hElAnXp/dZv/ydjSePTPPsVQPJWLbJPlSm6vAIQPyeXWUeCgAym0yf2FVvtsd
 AQeozaq9xq6tofVPZhsYWeBKswjTHs3FxL8vhDOF9gF3QMPm43mwsfrKVidWo4vW
 0UJnRZRCFgG0WIxxhA7l1Z9MovAsXlbWmFufgCa4Ev/bC5WuUT4ZEkjBGJw2vXD4
 0/lxkhD41PBhCl/LIod9OT6iJ8koifl50JUC4N67D2illFy9a7Btzx3EPxfDz9uG
 6jEek9x6B1xf7AC4HhxByN/E8gKX08N4Q/afxTFuAwrzKRKqI4Me3qbCyU86bp+T
 wiAxgVPVnmb/VBVLU68i7titdsA6U8ZO12FFqu9QOr9wHMXxa6108h2Nia9jVTdk
 EZhanXa6tJThQQ/QZicuR5hTtoM4BuikaaJSHZ4ODbgrAjsMAyqy4qsf4tf3LLAo
 tEDu5sJDmHJhhdtqzuz+OtDn5iJ2Nga6a4/fMwt9YU2ZQBm7X/8=
 =ZcQv
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.13-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client bugfixes from Anna Schumaker:
 "Stable bugfix:
   - Fix error reporting regression

  Bugfixes:
   - Fix setting filelayout ds address race
   - Fix subtle access bug when using ACLs
   - Fix setting mnt3_counts array size
   - Fix a couple of pNFS commit races"

* tag 'nfs-for-4.13-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFS/filelayout: Fix racy setting of fl->dsaddr in filelayout_check_deviceid()
  NFS: Be more careful about mapping file permissions
  NFS: Store the raw NFS access mask in the inode's access cache
  NFSv3: Convert nfs3_proc_access() to use nfs_access_set_mask()
  NFS: Refactor NFS access to kernel access mask calculation
  net/sunrpc/xprt_sock: fix regression in connection error reporting.
  nfs: count correct array for mnt3_counts array size
  Revert commit 722f0b8911 ("pNFS: Don't send COMMITs to the DSes if...")
  pNFS/flexfiles: Handle expired layout segments in ff_layout_initiate_commit()
  NFS: Fix another COMMIT race in pNFS
  NFS: Fix a COMMIT race in pNFS
  mount: copy the port field into the cloned nfs_server structure.
  NFS: Don't run wake_up_bit() when nobody is waiting...
  nfs: add export operations
2017-07-21 16:26:01 -07:00
Linus Torvalds
99313414dd Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
 "This fixes a crash with SELinux and several other old and new bugs"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: check for bad and whiteout index on lookup
  ovl: do not cleanup directory and whiteout index entries
  ovl: fix xattr get and set with selinux
  ovl: remove unneeded check for IS_ERR()
  ovl: fix origin verification of index dir
  ovl: mark parent impure on ovl_link()
  ovl: fix random return value on mount
2017-07-21 16:24:22 -07:00
Trond Myklebust
1ebf980127 NFS/filelayout: Fix racy setting of fl->dsaddr in filelayout_check_deviceid()
We must set fl->dsaddr once, and once only, even if there are multiple
processes calling filelayout_check_deviceid() for the same layout
segment.

Reported-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-21 14:08:45 -04:00
Benjamin Coddington
3953704fde locks: restore a warn for leaked locks on close
When locks.c moved to using file_lock_context, the check for any locks that
were not released was moved from the __fput() to destroy_inode() path in
commit 8634b51f6c ("locks: convert lease handling to file_lock_context").
This warning has been quite useful for catching bugs, particularly in NFS
where lock handling still sees some churn.

Let's bring back the warning for leaked locks on __fput, as this warning is
much more likely to be seen and reported by users.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-07-21 13:57:31 -04:00
Trond Myklebust
ecbb903c56 NFS: Be more careful about mapping file permissions
When mapping a directory, we want the MAY_WRITE permissions to reflect
whether or not we have permission to modify, add and delete the directory
entries. MAY_EXEC must map to lookup permissions.

On the other hand, for files, we want MAY_WRITE to reflect a permission
to modify and extend the file.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-21 11:51:19 -04:00
Trond Myklebust
bd8b244174 NFS: Store the raw NFS access mask in the inode's access cache
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-21 11:51:19 -04:00
Trond Myklebust
eda3e20847 NFSv3: Convert nfs3_proc_access() to use nfs_access_set_mask()
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-21 11:51:19 -04:00
Trond Myklebust
15d4b73ac2 NFS: Refactor NFS access to kernel access mask calculation
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-21 11:51:19 -04:00
Bob Peterson
4d7c18c7df GFS2: Set gl_object in inode lookup only after block type check
Before this patch, the inode glock's gl_object was set after a
reference was acquired, but before the block type was verified.
In cases where the block was unlinked, then freed and reused on
another node, a residule delete callback (delete_work) would try
to look up the inode, eventually failing the block check, but
only after it overwrites gl_object with a pointer to the wrong
inode. This patch moves the assignment of gl_object after the
block check so it won't be improperly overwritten.

Likewise, at the end of the function, gfs2_inode_lookup was
clearing gl_object after it unlocked the glock, which meant
another process might free the glock in the meantime. This
patch guards against that case.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2017-07-21 08:20:42 -05:00
Bob Peterson
df3d87bde1 GFS2: Introduce helper for clearing gl_object
This patch introduces a new helper function in glock.h that
clears gl_object, with an added integrity check. An additional
integrity check has been added to glock_set_object, plus comments.
This is step 1 in a series to ensure gl_object integrity.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2017-07-21 08:20:05 -05:00
Eryu Guan
ecc7b435d2 nfs: count correct array for mnt3_counts array size
Array size of mnt3_counts should be the size of array
mnt3_procedures, not mnt_procedures, though they're same in size
right now. Found this by code inspection.

Fixes: 1c5876ddbd ("sunrpc: move p_count out of struct rpc_procinfo")
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-21 08:49:57 -04:00
Coly Li
e477b24b50 gfs2: add flag REQ_PRIO for metadata I/O
When gfs2 does metadata I/O, only REQ_META is used as a metadata hint of
the bio. But flag REQ_META is just a hint for block trace, not for block
layer code to handle a bio as metadata request.

For some of metadata I/Os of gfs2, A REQ_PRIO flag on the metadata bio
would be very informative to block layer code. For example, if bcache is
used as a I/O cache for gfs2, it will be possible for bcache code to get
the hint and cache the pre-fetched metadata blocks on cache device. This
behavior may be helpful to improve metadata I/O performance if the
following requests hit the cache.

Here are the locations in gfs2 code where a REQ_PRIO flag should be added,
- All places where REQ_READAHEAD is used, gfs2 code uses this flag for
  metadata read ahead.
- In gfs2_meta_rq() where the first metadata block is read in.
- In gfs2_write_buf_to_page(), read in quota metadata blocks to have them
  up to date.
These metadata blocks are probably to be accessed again in future, adding
a REQ_PRIO flag may have bcache to keep such metadata in fast cache
device. For system without a cache layer, REQ_PRIO can still provide hint
to block layer to handle metadata requests more properly.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-07-21 07:48:22 -05:00
Wang Xibo
e7cb550d79 GFS2: fix code parameter error in inode_go_lock
In inode_go_lock() function, the parameter order of list_add() is error.
According to the define of list_add(), the first parameter is new entry
and the second is the list head, so ip->i_trunc_list should be the
first parameter and the sdp->sd_trunc_list should be second.

Signed-off-by: Wang Xibo<wang.xibo@zte.com.cn>
Signed-off-by: Xiao Likun<xiao.likun@zte.com.cn>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-07-21 07:40:59 -05:00
David Howells
ddc6c70f07 rxrpc: Move the packet.h include file into net/rxrpc/
Move the protocol description header file into net/rxrpc/ and rename it to
protocol.h.  It's no longer necessary to expose it as packets are no longer
exposed to kernel services (such as AFS) that use the facility.

The abort codes are transferred to the UAPI header instead as we pass these
back to userspace and also to kernel services.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-07-21 11:00:20 +01:00
Darrick J. Wong
10479e2dea xfs: check _alloc_read_agf buffer pointer before using
In some circumstances, _alloc_read_agf can return an error code of zero
but also a null AGF buffer pointer.  Check for this and jump out.

Fixes-coverity-id: 1415250
Fixes-coverity-id: 1415320
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-07-20 14:42:33 -07:00
Darrick J. Wong
4c1a67bd36 xfs: set firstfsb to NULLFSBLOCK before feeding it to _bmapi_write
We must initialize the firstfsb parameter to _bmapi_write so that it
doesn't incorrectly treat stack garbage as a restriction on which AGs
it can search for free space.

Fixes-coverity-id: 1402025
Fixes-coverity-id: 1415167
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-07-20 14:42:33 -07:00
Darrick J. Wong
1e86eabe73 xfs: check _btree_check_block value
Check the _btree_check_block return value for the firstrec and lastrec
functions, since we have the ability to signal that the repositioning
did not succeed.

Fixes-coverity-id: 114067
Fixes-coverity-id: 114068
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-07-20 14:42:32 -07:00
Linus Torvalds
791f2df39b Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull misc filesystem fixes from Jan Kara:
 "Several ACL related fixes for ext2, reiserfs, and hfsplus.

  And also one minor isofs cleanup"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  hfsplus: Don't clear SGID when inheriting ACLs
  isofs: Fix off-by-one in 'session' mount option parsing
  reiserfs: preserve i_mode if __reiserfs_set_acl() fails
  ext2: preserve i_mode if ext2_set_acl() fails
  ext2: Don't clear SGID when inheriting ACLs
  reiserfs: Don't clear SGID when inheriting ACLs
2017-07-20 10:41:12 -07:00
Linus Torvalds
465b0dbb38 for-f2fs-v4.13-rc2
We've filed some bug fixes:
 - missing f2fs case in terms of stale SGID big, introduced by Jan
 - build error for seq_file.h
 - avoid cpu lockup
 - wrong inode_unlock in error case
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAllwICsACgkQQBSofoJI
 UNItLQ/8CrPqw7pOSoH72n79/d5Md7tKe5TNN2qZbjCVGj7qs2opOnGM8hhFtUTe
 nFzK84evSpIQlgdRJFJU82E55U0coa3ySHgCQSUnHOobTtNsdmwq7p21/xT5LV3s
 211zGYDgqtdp5/5ONHeD1ckF0QR9S9nWPuIRt9ef3bp2c7CfDrk+LLMrwSMeUlZo
 /uk5j32QPdME9ittqZ1bEZPl2FgwgmI4NFjyjGiHDK/ZYGhspHfa7FHjL8PW69UG
 pquiwlqHTg+i9wSc9byYALnJEs1XN6oW8E5TxO5zGqvfa77tQQb+qGHG9kYGDu64
 JMpAXort5ZKNatkLLMXOoojLWutthv70f1IQK3eGUHhiWmsYrWZHjzrDh8hkcgh7
 JMwGbYHrQlsAdk6B1r4MM8GW/telLufM3jTp7Fhpn1fLomWSE28JPtql9Ci5kIKX
 XxUF0y2HbC4ZI5LlY2umRzAfULaEFWEG/8X+wqTl3oE5Jv7Jthd69rpdjJvcQnPx
 iIz7J6BJopjAUoTUlXdSnWkP7VPkDOtDpAiu7cj16U39XSnIW/ceC+qLeP1J2R2c
 +hTg2pfYvh4eJGnNdxv4kZOxFFhjaEBReBPPgYOyCr7IPTtA+sucXO/zqWN6RH95
 tu8+Efl60eQbCt2Gh+JlBR7hXNsgk56ksZ8XaYhBM4VRIWZFc/0=
 =nVpP
 -----END PGP SIGNATURE-----

Merge tag 'for-f2fs-v4.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs fixes from Jaegeuk Kim:
 "We've filed some bug fixes:

   - missing f2fs case in terms of stale SGID bit, introduced by Jan

   - build error for seq_file.h

   - avoid cpu lockup

   - wrong inode_unlock in error case"

* tag 'for-f2fs-v4.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
  f2fs: avoid cpu lockup
  f2fs: include seq_file.h for sysfs.c
  f2fs: Don't clear SGID when inheriting ACLs
  f2fs: remove extra inode_unlock() in error path
2017-07-20 10:30:16 -07:00
Amir Goldstein
0e082555ce ovl: check for bad and whiteout index on lookup
Index should always be of the same file type as origin, except for
the case of a whiteout index.  A whiteout index should only exist
if all lower aliases have been unlinked, which means that finding
a lower origin on lookup whose index is a whiteout should be treated
as a lookup error.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-07-20 11:08:21 +02:00
Amir Goldstein
61b674710c ovl: do not cleanup directory and whiteout index entries
Directory index entries are going to be used for looking up
redirected upper dirs by lower dir fh when decoding an overlay
file handle of a merge dir.

Whiteout index entries are going to be used as an indication that
an exported overlay file handle should be treated as stale (i.e.
after unlink of the overlay inode).

We don't know the verification rules for directory and whiteout
index entries, because they have not been implemented yet, so fail
to mount overlay rw if those entries are found to avoid corrupting
an index that was created by a newer kernel.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-07-20 11:08:21 +02:00
Miklos Szeredi
1d88f18373 ovl: fix xattr get and set with selinux
inode_doinit_with_dentry() in SELinux wants to read the upper inode's xattr
to get security label, and ovl_xattr_get() calls ovl_dentry_real(), which
depends on dentry->d_inode, but d_inode is null and not initialized yet at
this point resulting in an Oops.

Fix by getting the upperdentry info from the inode directly in this case.

Reported-by: Eryu Guan <eguan@redhat.com>
Fixes: 09d8b58673 ("ovl: move __upperdentry to ovl_inode")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-07-20 11:08:21 +02:00
Trond Myklebust
213297369c Revert commit 722f0b8911 ("pNFS: Don't send COMMITs to the DSes if...")
Doing the test without taking any locks is racy, and so really it makes
more sense to do it in the flexfiles code (which is the only case that
cares).

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-19 15:28:21 -04:00
Trond Myklebust
4b75053e9b pNFS/flexfiles: Handle expired layout segments in ff_layout_initiate_commit()
If the layout has expired due to a fencing event, then we should not
attempt to commit to the DS.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-19 15:28:21 -04:00
Trond Myklebust
4118188645 NFS: Fix another COMMIT race in pNFS
We must make sure that cinfo->ds->ncommitting is in sync with the
commit list, since it is checked as part of pnfs_commit_list().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-19 15:28:21 -04:00
Trond Myklebust
e39928f942 NFS: Fix a COMMIT race in pNFS
We must make sure that cinfo->ds->nwritten is in sync with the
commit list, since it is checked as part of pnfs_scan_commit_lists().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-19 15:28:21 -04:00
Steve Dickson
89a6814d9b mount: copy the port field into the cloned nfs_server structure.
Doing this copy eliminates the "port=0" entry in
the /proc/mounts entries

Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=69241

Signed-off-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-19 15:28:21 -04:00
Filipe Manana
e33bf72361 Btrfs: fix dir item validation when replaying xattr deletes
We were passing an incorrect slot number to the function that validates
directory items when we are replaying xattr deletes from a log tree. The
correct slot is stored at variable 'i' and not at 'path->slots[0]', so
the call to the validation function was only correct for the first
iteration of the loop, when 'i == path->slots[0]'.
After this fix, the fstest generic/066 passes again.

Fixes: 8ee8c2d62d ("btrfs: Verify dir_item in replay_xattr_deletes")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-19 20:38:16 +02:00
Andreas Gruenbacher
98e5a91a61 gfs2: Fixup to "Get rid of flush_delayed_work in gfs2_evict_inode"
When commit 4fd1a57952 moved the call to flush_delayed_work from
gfs2_evict_inode to gfs2_inode_lookup to avoid calling into DLM during
evict, a similar call should have been added to gfs2_create_inode:
that's another code path in which glocks of previous inodes may be
reused.

The flush of the iopen glock work queue added by 4fd1a57952, on the
other hand, is unnecessary and can be removed.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-07-19 11:10:19 -05:00
Jan Kara
914cea93dd gfs2: Don't clear SGID when inheriting ACLs
When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.

Fix the problem by moving posix_acl_update_mode() out of
__gfs2_set_acl() into gfs2_set_acl(). That way the function will not be
called when inheriting ACLs which is what we want as it prevents SGID
bit clearing and the mode has been properly set by posix_acl_create()
anyway.

Fixes: 073931017b
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-07-19 10:58:54 -05:00
Linus Torvalds
e06fdaf40a Now that IPC and other changes have landed, enable manual markings for
randstruct plugin, including the task_struct.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJZbRgGAAoJEIly9N/cbcAmk2AQAIL60aQ+9RIcFAXriFhnd7Z2
 x9Jqi9JNc8NgPFXx8GhE4J4eTZ5PwcjgXBpNRWY/laBkRyoBHn24ku09YxrJjmHz
 ZSUsP+/iO9lVeEfbmU9Tnk50afkfwx6bHXBwkiVGQWHtybNVUqA19JbqkHeg8ubx
 myKLGeUv5PPCodRIcBDD0+HaAANcsqtgbDpgmWU8s+IXWwvWCE2p7PuBw7v3HHgH
 qzlPDHYQCRDw+LWsSqPaHj+9mbRO18P/ydMoZHGH4Hl3YYNtty8ZbxnraI3A7zBL
 6mLUVcZ+/l88DqHc5I05T8MmLU1yl2VRxi8/jpMAkg9wkvZ5iNAtlEKIWU6eqsvk
 vaImNOkViLKlWKF+oUD1YdG16d8Segrc6m4MGdI021tb+LoGuUbkY7Tl4ee+3dl/
 9FM+jPv95HjJnyfRNGidh2TKTa9KJkh6DYM9aUnktMFy3ca1h/LuszOiN0LTDiHt
 k5xoFURk98XslJJyXM8FPwXCXiRivrXMZbg5ixNoS4aYSBLv7Cn1M6cPnSOs7UPh
 FqdNPXLRZ+vabSxvEg5+41Ioe0SHqACQIfaSsV5BfF2rrRRdaAxK4h7DBcI6owV2
 7ziBN1nBBq2onYGbARN6ApyCqLcchsKtQfiZ0iFsvW7ZawnkVOOObDTCgPl3tdkr
 403YXzphQVzJtpT5eRV6
 =ngAW
 -----END PGP SIGNATURE-----

Merge tag 'gcc-plugins-v4.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull structure randomization updates from Kees Cook:
 "Now that IPC and other changes have landed, enable manual markings for
  randstruct plugin, including the task_struct.

  This is the rest of what was staged in -next for the gcc-plugins, and
  comes in three patches, largest first:

   - mark "easy" structs with __randomize_layout

   - mark task_struct with an optional anonymous struct to isolate the
     __randomize_layout section

   - mark structs to opt _out_ of automated marking (which will come
     later)

  And, FWIW, this continues to pass allmodconfig (normal and patched to
  enable gcc-plugins) builds of x86_64, i386, arm64, arm, powerpc, and
  s390 for me"

* tag 'gcc-plugins-v4.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  randstruct: opt-out externally exposed function pointer structs
  task_struct: Allow randomized layout
  randstruct: Mark various structs for randomization
2017-07-19 08:55:18 -07:00
Linus Torvalds
a90c6ac2b5 A number of small fixes for -rc1 Luminous changes plus a readdir race
fix, marked for stable.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJZb3AkAAoJEEp/3jgCEfOLMAUH/RRRxbY4KL/PUhDXVPf+a+Pf
 groC365undvuCmHCkT1ufrlrh56KE0XUvEKgXJp+r84WS4SC6lxaebD6QvzVtyMM
 KPVnbpCNfKw5KtLB1upMteYY6MGfTk4VTPCav69aNGPrvUxJQB8obvWenPi0rWk/
 knALvlJZbSiZeUDK3Id9cjntTGkClYuUHYJQ1JaZeieB/Xwnr+ZvV4on8ul7gkGX
 B6zdqaM43ZomSl/rJrV/G/MOMNV5uVjBNJmVpfH7KkZQGipW7O+8aDwFaMFAAN7r
 4TQcLf+d3SDjcjVspikCMYr0r0VnbL8hLPGkd7Cus/3jei9GWQHGaQqbZZmcKl8=
 =TPyV
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.13-rc2' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "A number of small fixes for -rc1 Luminous changes plus a readdir race
  fix, marked for stable"

* tag 'ceph-for-4.13-rc2' of git://github.com/ceph/ceph-client:
  libceph: potential NULL dereference in ceph_msg_data_create()
  ceph: fix race in concurrent readdir
  libceph: don't call encode_request_finish() on MOSDBackoff messages
  libceph: use alloc_pg_mapping() in __decode_pg_upmap_items()
  libceph: set -EINVAL in one place in crush_decode()
  libceph: NULL deref on osdmap_apply_incremental() error path
  libceph: fix old style declaration warnings
2017-07-19 08:49:46 -07:00
Ernesto A. Fernández
f070e5ac9b jfs: preserve i_mode if __jfs_set_acl() fails
When changing a file's acl mask, __jfs_set_acl() will first set the group
bits of i_mode to the value of the mask, and only then set the actual
extended attribute representing the new acl.

If the second part fails (due to lack of space, for example) and the file
had no acl attribute to begin with, the system will from now on assume
that the mask permission bits are actual group permission bits, potentially
granting access to the wrong users.

Prevent this by only changing the inode mode after the acl has been set.

Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2017-07-18 14:28:06 -05:00
Jan Kara
9bcf66c72d jfs: Don't clear SGID when inheriting ACLs
When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.

Fix the problem by moving posix_acl_update_mode() out of
__jfs_set_acl() into jfs_set_acl(). That way the function will not be
called when inheriting ACLs which is what we want as it prevents SGID
bit clearing and the mode has been properly set by posix_acl_create()
anyway.

Fixes: 073931017b
CC: stable@vger.kernel.org
CC: jfs-discussion@lists.sourceforge.net
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2017-07-18 14:28:03 -05:00
Linus Torvalds
15b0a8d1d4 One fix for a problem introduced in the most recent merge window and
found by Dave Jones and KASAN.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZbhjwAAoJECebzXlCjuG+kx4P/3iBtL7+dBIcz3Lj6oetbOM0
 6myybTmC6ZWLN9cmNN6GIFk82TUYvo4EQRAOt7GcSdF6X5GtMM0LNMMaquAxM/gc
 ZwCeQ0K33PKL3qIlaOF1DYT/DN3uwVdRl+VfueMzQjOAfm7kMtQaa1xjaGJtqT13
 Vssn5S4LWIo9SaXFp9/kgnPZF9kIl1v8Rfwowhn8g90KNChcKnMSi4ia0jrMPK/w
 VLuOr8VLPlUa3GEKmcwMQqf+0WPryvVtahHIblNVTHyFeYP/j5u7TCp3T5GoechR
 /Rk2MPzPMtNk850QkrwEegYuJIRsMvrhSMaV/FOZCNKi728haP9ruOSQryMRKUBt
 BJMJrE+0hclJ2KPPbjp1cp2P1fn+dZbt+V22VGRD1ZTgKQKj7+DgSfLxUgA6/X4L
 4QjfmO5N6suLfl8XaXyncliQxFoyWWV4s+FKQDjk5/dZ0/PZt1zBZ3cBo2Bxnjgj
 1yfeVM8zyrnIAJv04U59fQnXaaqgkk0NmOllpTN4exbyHqq7U4wz9gpjFulbBn5X
 p6ebs6RKdRolfalpTyaRvdDcg8zmSwyeoHpzc3FHr7AGhX0edOXR8b+vWhalFB7T
 B1Jnoy/qesGjafB7BXgLtaZQO6wMoQv3JdTDrxDlRIXJQfULeR0Zw5kosNzgaBKY
 B1TSX5Tiit9W7+DxAfz+
 =QSoH
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.13-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd fix from Bruce Fields:
 "One fix for a problem introduced in the most recent merge window and
  found by Dave Jones and KASAN"

* tag 'nfsd-4.13-1' of git://linux-nfs.org/~bfields/linux:
  nfsd: Fix a memory scribble in the callback channel
2017-07-18 11:11:13 -07:00
Jan Kara
84969465dd hfsplus: Don't clear SGID when inheriting ACLs
When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.

Fix the problem by creating __hfsplus_set_posix_acl() function that does
not call posix_acl_update_mode() and use it when inheriting ACLs. That
prevents SGID bit clearing and the mode has been properly set by
posix_acl_create() anyway.

Fixes: 073931017b
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
2017-07-18 18:23:39 +02:00
Jan Kara
34363c057b isofs: Fix off-by-one in 'session' mount option parsing
According to ECMA-130 standard maximum valid track number is 99. Since
'session' mount option starts indexing at 0 (and we add 1 to the passed
number), we should refuse value 99. Also the condition in
isofs_get_last_session() unnecessarily repeats the check - remove it.

Reported-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-07-18 12:33:16 +02:00
Ernesto A. Fernández
fcea8aed91 reiserfs: preserve i_mode if __reiserfs_set_acl() fails
When changing a file's acl mask, reiserfs_set_acl() will first set the
group bits of i_mode to the value of the mask, and only then set the
actual extended attribute representing the new acl.

If the second part fails (due to lack of space, for example) and the
file had no acl attribute to begin with, the system will from now on
assume that the mask permission bits are actual group permission bits,
potentially granting access to the wrong users.

Prevent this by only changing the inode mode after the acl has been set.

Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-07-18 11:24:08 +02:00
Ernesto A. Fernández
fe26569eb9 ext2: preserve i_mode if ext2_set_acl() fails
When changing a file's acl mask, ext2_set_acl() will first set the group
bits of i_mode to the value of the mask, and only then set the actual
extended attribute representing the new acl.

If the second part fails (due to lack of space, for example) and the file
had no acl attribute to begin with, the system will from now on assume
that the mask permission bits are actual group permission bits, potentially
granting access to the wrong users.

Prevent this by only changing the inode mode after the acl has been set.

[JK: Rebased on top of "ext2: Don't clear SGID when inheriting ACLs"]
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-07-18 11:23:56 +02:00
Jaegeuk Kim
4db08d016c f2fs: avoid cpu lockup
Before retrying to flush data or dentry pages, we need to release cpu in order
to prevent watchdog.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-17 19:23:18 -07:00
Jaegeuk Kim
299aa41a45 f2fs: include seq_file.h for sysfs.c
This patch includes seq_file.h to avoid compile error.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-17 19:23:12 -07:00
Andreas Gruenbacher
283c9a97be gfs2: Lock holder cleanup (fixup)
Function gfs2_holder_initialized should be used in do_flock as well.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-07-17 13:39:15 -05:00
Trond Myklebust
eff7936877 nfsd: Fix a memory scribble in the callback channel
The offset of the entry in struct rpc_version has to match the version
number.

Reported-by: Dave Jones <davej@codemonkey.org.uk>
Fixes: 1c5876ddbd ("sunrpc: move p_count out of struct rpc_procinfo")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-07-17 13:15:06 -04:00
Logan Gunthorpe
133d55cdb2 block: order /proc/devices by major number
Presently, the order of the block devices listed in /proc/devices is not
entirely sequential. If a block device has a major number greater than
BLKDEV_MAJOR_HASH_SIZE (255), it will be ordered as if its major were
module 255. For example, 511 appears after 1.

This patch cleans that up and prints each major number in the correct
order, regardless of where they are stored in the hash table.

In order to do this, we introduce BLKDEV_MAJOR_MAX as an artificial
limit (chosen to be 512). It will then print all devices in major
order number from 0 to the maximum.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-17 15:42:20 +02:00
Bob Peterson
61eaadcd52 GFS2: Prevent double brelse in gfs2_meta_indirect_buffer
Before this patch, problems reading in indirect buffers would send
an IO error back to the caller, and release the buffer_head with
brelse() in function gfs2_meta_indirect_buffer, however, it would
still return the address of the buffer_head it released. After the
error was discovered, function gfs2_block_map would call function
release_metapath to free all buffers. That checked:
if (mp->mp_bh[i] == NULL) but since the value was set after the
error, it was non-zero, so brelse was called a second time. This
resulted in the following error:

kernel: WARNING: at fs/buffer.c:1224 __brelse+0x3a/0x40() (Tainted: G        W  -- ------------   )
kernel: Hardware name: RHEV Hypervisor
kernel: VFS: brelse: Trying to free free buffer

This patch changes gfs2_meta_indirect_buffer so it only sets
the buffer_head pointer in cases where it isn't released.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
2017-07-17 08:39:48 -05:00
Logan Gunthorpe
8a932f73e5 char_dev: order /proc/devices by major number
Presently, the order of the char devices listed in /proc/devices is not
entirely sequential. If a char device has a major number greater than
CHRDEV_MAJOR_HASH_SIZE (255), it will be ordered as if its major were
module 255. For example, 511 appears after 1.

This patch cleans that up and prints each major number in the correct
order, regardless of where they are stored in the hash table.

In order to do this, we introduce CHRDEV_MAJOR_MAX as an artificial
limit (chosen to be 511). It will then print all devices in major
order number from 0 to the maximum.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-17 15:28:50 +02:00
Logan Gunthorpe
a5d31a3f81 char_dev: extend dynamic allocation of majors into a higher range
We've run into problems with running out of dynamicly assign char
device majors particullarly on automated test systems with
all-yes-configs. Roughly 40 dynamic assignments can be made with such
kernels at this time while space is reserved for only 20.

Currently, the kernel only prints a warning when dynamic allocation
overflows the reserved region. And when this happens drivers that have
fixed assignments can randomly fail depending on the order of
initialization of other drivers. Thus, adding a new char device can cause
unexpected failures in completely unrelated parts of the kernel.

This patch solves the problem by extending dynamic major number
allocations down from 511 once the 234-254 region fills up. Fixed
majors already exist above 255 so the infrastructure to support
high number majors is already in place. The patch reserves an
additional 128 major numbers which should hopefully last us a while.

Kernels that don't require more than 20 dynamic majors assigned (which
is pretty typical) should not be affected by this change.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Linus Walleij <linus.walleij@linaro.org>
Link: https://lkml.org/lkml/2017/6/4/107
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-17 15:09:17 +02:00
Yan, Zheng
84583cfb97 ceph: fix race in concurrent readdir
For a large directory, program needs to issue multiple readdir
syscalls to get all dentries. When there are multiple programs
read the directory concurrently. Following sequence of events
can happen.

 - program calls readdir with pos = 2. ceph sends readdir request
   to mds. The reply contains N1 entries. ceph adds these N1 entries
   to readdir cache.
 - program calls readdir with pos = N1+2. The readdir is satisfied
   by the readdir cache, N2 entries are returned. (Other program
   calls readdir in the middle, which fills the cache)
 - program calls readdir with pos = N1+N2+2. ceph sends readdir
   request to mds. The reply contains N3 entries and it reaches
   directory end. ceph adds these N3 entries to the readdir cache
   and marks directory complete.

The second readdir call does not update fi->readdir_cache_idx.
ceph add the last N3 entries to wrong places.

Cc: stable@vger.kernel.org # v4.3+
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-17 14:54:59 +02:00
Jan Kara
a992f2d38e ext2: Don't clear SGID when inheriting ACLs
When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.

Fix the problem by creating __ext2_set_acl() function that does not call
posix_acl_update_mode() and use it when inheriting ACLs. That prevents
SGID bit clearing and the mode has been properly set by
posix_acl_create() anyway.

Fixes: 073931017b
CC: stable@vger.kernel.org
CC: linux-ext4@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
2017-07-17 10:15:31 +02:00
Jan Kara
6883cd7f68 reiserfs: Don't clear SGID when inheriting ACLs
When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.

Fix the problem by moving posix_acl_update_mode() out of
__reiserfs_set_acl() into reiserfs_set_acl(). That way the function will
not be called when inheriting ACLs which is what we want as it prevents
SGID bit clearing and the mode has been properly set by
posix_acl_create() anyway.

Fixes: 073931017b
CC: stable@vger.kernel.org
CC: reiserfs-devel@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
2017-07-17 10:14:50 +02:00
David Howells
e462ec50cb VFS: Differentiate mount flags (MS_*) from internal superblock flags
Differentiate the MS_* flags passed to mount(2) from the internal flags set
in the super_block's s_flags.  s_flags are now called SB_*, with the names
and the values for the moment mirroring the MS_* flags that they're
equivalent to.

In this patch, just the headers are altered and some kernel code where
blind automated conversion isn't necessarily correct.

Note that this shows up some interesting issues:

 (1) Some MS_* flags get translated to MNT_* flags (such as MS_NODEV ->
     MNT_NODEV) without passing this on to the filesystem, but some
     filesystems set such flags anyway.

 (2) The ->remount_fs() methods of some filesystems adjust the *flags
     argument by setting MS_* flags in it, such as MS_NOATIME - but these
     flags are then scrubbed by do_remount_sb() (only the occupants of
     MS_RMT_MASK are permitted: MS_RDONLY, MS_SYNCHRONOUS, MS_MANDLOCK,
     MS_I_VERSION and MS_LAZYTIME)

I'm not sure what's the best way to solve all these cases.

Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
2017-07-17 08:45:35 +01:00
David Howells
bc98a42c1f VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
Firstly by applying the following with coccinelle's spatch:

	@@ expression SB; @@
	-SB->s_flags & MS_RDONLY
	+sb_rdonly(SB)

to effect the conversion to sb_rdonly(sb), then by applying:

	@@ expression A, SB; @@
	(
	-(!sb_rdonly(SB)) && A
	+!sb_rdonly(SB) && A
	|
	-A != (sb_rdonly(SB))
	+A != sb_rdonly(SB)
	|
	-A == (sb_rdonly(SB))
	+A == sb_rdonly(SB)
	|
	-!(sb_rdonly(SB))
	+!sb_rdonly(SB)
	|
	-A && (sb_rdonly(SB))
	+A && sb_rdonly(SB)
	|
	-A || (sb_rdonly(SB))
	+A || sb_rdonly(SB)
	|
	-(sb_rdonly(SB)) != A
	+sb_rdonly(SB) != A
	|
	-(sb_rdonly(SB)) == A
	+sb_rdonly(SB) == A
	|
	-(sb_rdonly(SB)) && A
	+sb_rdonly(SB) && A
	|
	-(sb_rdonly(SB)) || A
	+sb_rdonly(SB) || A
	)

	@@ expression A, B, SB; @@
	(
	-(sb_rdonly(SB)) ? 1 : 0
	+sb_rdonly(SB)
	|
	-(sb_rdonly(SB)) ? A : B
	+sb_rdonly(SB) ? A : B
	)

to remove left over excess bracketage and finally by applying:

	@@ expression A, SB; @@
	(
	-(A & MS_RDONLY) != sb_rdonly(SB)
	+(bool)(A & MS_RDONLY) != sb_rdonly(SB)
	|
	-(A & MS_RDONLY) == sb_rdonly(SB)
	+(bool)(A & MS_RDONLY) == sb_rdonly(SB)
	)

to make comparisons against the result of sb_rdonly() (which is a bool)
work correctly.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-07-17 08:45:34 +01:00
Geert Uytterhoeven
a86054236d binfmt_flat: Use %u to format u32
Several variables had their types changed from unsigned long to u32, but
the printk()-style format to print them wasn't updated, leading to:

    fs/binfmt_flat.c: In function ‘load_flat_file’:
    fs/binfmt_flat.c:577: warning: format ‘%ld’ expects type ‘long int’, but argument 3 has type ‘u32’

Fixes: 468138d785 ("binfmt_flat: flat_{get,put}_addr_from_rp() should be able to fail")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-16 09:24:05 -07:00
Benjamin Coddington
9d5b86ac13 fs/locks: Remove fl_nspid and use fs-specific l_pid for remote locks
Since commit c69899a17c "NFSv4: Update of VFS byte range lock must be
atomic with the stateid update", NFSv4 has been inserting locks in rpciod
worker context.  The result is that the file_lock's fl_nspid is the
kworker's pid instead of the original userspace pid.

The fl_nspid is only used to represent the namespaced virtual pid number
when displaying locks or returning from F_GETLK.  There's no reason to set
it for every inserted lock, since we can usually just look it up from
fl_pid.  So, instead of looking up and holding struct pid for every lock,
let's just look up the virtual pid number from fl_pid when it is needed.
That means we can remove fl_nspid entirely.

The translaton and presentation of fl_pid should handle the following four
cases:

1 - F_GETLK on a remote file with a remote lock:
    In this case, the filesystem should determine the l_pid to return here.
    Filesystems should indicate that the fl_pid represents a non-local pid
    value that should not be translated by returning an fl_pid <= 0.

2 - F_GETLK on a local file with a remote lock:
    This should be the l_pid of the lock manager process, and translated.

3 - F_GETLK on a remote file with a local lock, and
4 - F_GETLK on a local file with a local lock:
    These should be the translated l_pid of the local locking process.

Fuse was already doing the correct thing by translating the pid into the
caller's namespace.  With this change we must update fuse to translate
to init's pid namespace, so that the locks API can then translate from
init's pid namespace into the pid namespace of the caller.

With this change, the locks API will expect that if a filesystem returns
a remote pid as opposed to a local pid for F_GETLK, that remote pid will
be <= 0.  This signifies that the pid is remote, and the locks API will
forego translating that pid into the pid namespace of the local calling
process.

Finally, we convert remote filesystems to present remote pids using
negative numbers. Have lustre, 9p, ceph, cifs, and dlm negate the remote
pid returned for F_GETLK lock requests.

Since local pids will never be larger than PID_MAX_LIMIT (which is
currently defined as <= 4 million), but pid_t is an unsigned int, we
should have plenty of room to represent remote pids with negative
numbers if we assume that remote pid numbers are similarly limited.

If this is not the case, then we run the risk of having a remote pid
returned for which there is also a corresponding local pid.  This is a
problem we have now, but this patch should reduce the chances of that
occurring, while also returning those remote pid numbers, for whatever
that may be worth.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-07-16 10:28:22 -04:00
Benjamin Coddington
52306e882f fs/locks: Use allocation rather than the stack in fcntl_getlk()
Struct file_lock is fairly large, so let's save some space on the stack by
using an allocation for struct file_lock in fcntl_getlk(), just as we do
for fcntl_setlk().

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-07-16 10:28:21 -04:00
Jaegeuk Kim
c925dc162f f2fs: Don't clear SGID when inheriting ACLs
This patch copies commit b7f8a09f80:
"btrfs: Don't clear SGID when inheriting ACLs" written by Jan.

Fixes: 073931017b
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-15 21:10:23 -07:00
Luis Henriques
5ffff2854a f2fs: remove extra inode_unlock() in error path
This commit removes an extra inode_unlock() that is being done in function
f2fs_ioc_setflags error path.  While there, get rid of a useless 'out'
label as well.

Fixes: 0abd675e97 ("f2fs: support plain user/group quota")
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-15 21:10:23 -07:00
Linus Torvalds
52f6c588c7 Add wait_for_random_bytes() and get_random_*_wait() functions so that
callers can more safely get random bytes if they can block until the
 CRNG is initialized.
 
 Also print a warning if get_random_*() is called before the CRNG is
 initialized.  By default, only one single-line warning will be printed
 per boot.  If CONFIG_WARN_ALL_UNSEEDED_RANDOM is defined, then a
 warning will be printed for each function which tries to get random
 bytes before the CRNG is initialized.  This can get spammy for certain
 architecture types, so it is not enabled by default.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAllqXNUACgkQ8vlZVpUN
 gaPtAgf/aUbXZuWYsDQzslHsbzEWi+qz4QgL885/w4L00pEImTTp91Q06SDxWhtB
 KPvGnZHS3IofxBh2DC+6AwN6dPMoWDCfYhhO6po3FSz0DiPRIQCTuvOb8fhKY1X7
 rTdDq2xtDxPGxJ25bMJtlrgzH2XlXPpVyPUeoc9uh87zUK5aesXpUn9kBniRexoz
 ume+M/cDzPKkwNQpbLq8vzhNjoWMVv0FeW2akVvrjkkWko8nZLZ0R/kIyKQlRPdG
 LZDXcz0oTHpDS6+ufEo292ZuWm2IGer2YtwHsKyCAsyEWsUqBz2yurtkSj3mAVyC
 hHafyS+5WNaGdgBmg0zJxxwn5qxxLg==
 =ua7p
 -----END PGP SIGNATURE-----

Merge tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random

Pull random updates from Ted Ts'o:
 "Add wait_for_random_bytes() and get_random_*_wait() functions so that
  callers can more safely get random bytes if they can block until the
  CRNG is initialized.

  Also print a warning if get_random_*() is called before the CRNG is
  initialized. By default, only one single-line warning will be printed
  per boot. If CONFIG_WARN_ALL_UNSEEDED_RANDOM is defined, then a
  warning will be printed for each function which tries to get random
  bytes before the CRNG is initialized. This can get spammy for certain
  architecture types, so it is not enabled by default"

* tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random:
  random: reorder READ_ONCE() in get_random_uXX
  random: suppress spammy warnings about unseeded randomness
  random: warn when kernel uses unseeded randomness
  net/route: use get_random_int for random counter
  net/neighbor: use get_random_u32 for 32-bit hash random
  rhashtable: use get_random_u32 for hash_rnd
  ceph: ensure RNG is seeded before using
  iscsi: ensure RNG is seeded before use
  cifs: use get_random_u32 for 32-bit lock random
  random: add get_random_{bytes,u32,u64,int,long,once}_wait family
  random: add wait_for_random_bytes() API
2017-07-15 12:44:02 -07:00
Linus Torvalds
78dcf73421 Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull ->s_options removal from Al Viro:
 "Preparations for fsmount/fsopen stuff (coming next cycle). Everything
  gets moved to explicit ->show_options(), killing ->s_options off +
  some cosmetic bits around fs/namespace.c and friends. Basically, the
  stuff needed to work with fsmount series with minimum of conflicts
  with other work.

  It's not strictly required for this merge window, but it would reduce
  the PITA during the coming cycle, so it would be nice to have those
  bits and pieces out of the way"

* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  isofs: Fix isofs_show_options()
  VFS: Kill off s_options and helpers
  orangefs: Implement show_options
  9p: Implement show_options
  isofs: Implement show_options
  afs: Implement show_options
  affs: Implement show_options
  befs: Implement show_options
  spufs: Implement show_options
  bpf: Implement show_options
  ramfs: Implement show_options
  pstore: Implement show_options
  omfs: Implement show_options
  hugetlbfs: Implement show_options
  VFS: Don't use save/replace_mount_options if not using generic_show_options
  VFS: Provide empty name qstr
  VFS: Make get_filesystem() return the affected filesystem
  VFS: Clean up whitespace in fs/namespace.c and fs/super.c
  Provide a function to create a NUL-terminated string from unterminated data
2017-07-15 12:00:42 -07:00
Linus Torvalds
89cbec71fe Merge branch 'work.uaccess-unaligned' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull uacess-unaligned removal from Al Viro:
 "That stuff had just one user, and an exotic one, at that - binfmt_flat
  on arm and m68k"

* 'work.uaccess-unaligned' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  kill {__,}{get,put}_user_unaligned()
  binfmt_flat: flat_{get,put}_addr_from_rp() should be able to fail
2017-07-15 11:17:52 -07:00
Linus Torvalds
966859b9f7 This pull request contains updates for UBIFS:
- Updates and fixes for the file encryption mode
 - Minor improvements
 - Random fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABAgAGBQJZahyrAAoJEGb5WYXrGLvBK9EP/1ZstpLyQyLrIHMpQCaRG64/
 19L383c5EZQubmUfg7nQFvjYWg5TSJ0Gca1GrYXn58o0LC4ncv5Wh06FlYH8nH0E
 LWDmAjJjXyuKiOiFTw2xT1q42uBzPwApsC/wa4eLSiC1j/yQxkzPH8WL4hGJ5V4p
 R43OX/2zq6yXW79VRK9atMgbp44L+6yGdiZAbUCL84QNptiJPfqWiEyiAAX097b/
 HPcIzf4VLDlwvs3xoRlaDDHh7qqEznC4EgHodwcUGd4TX3mr/J6C5ggsKT8Pp7yL
 NaVeeGdKv0ftOcSHFf5bcX4ygWrVoAV4gEOeTMUtFSBSd3iObhyLZsbUl40lTj8E
 jQiC+NdsxTwu1b2kjrlkEIwKag6gi0xBIQS+IMka4XsZ8OzPEvt0j71wFPSOUE5w
 TF+9sx32foUGGPNwHVGwgihIL8cpiybUl4feZ2qPKMOQDqukxIzS8Nw7EoKLYM+p
 khAJAL01tjLWlTaoLUZMUK/1nkMlQKNlY713ejccyEcVTxQ4SmcoZ8JF9IGqxgK7
 uLD/JkJhIlAUAEMhHiWmXYRSuaq+Mmeg55chmmSbA5bIuak18XBxWTaCBxL45ZCO
 sxTP+9lsBo47PVPGQd6XbmhUmIyWbA47HvykLs2lvOk7k4P0LGwnaGqBK2pMG4J2
 n1X8D8qFB6wMPt7+OExg
 =kZyI
 -----END PGP SIGNATURE-----

Merge tag 'upstream-4.13-rc1' of git://git.infradead.org/linux-ubifs

Pull UBIFS updates from Richard Weinberger:

 - Updates and fixes for the file encryption mode

 - Minor improvements

 - Random fixes

* tag 'upstream-4.13-rc1' of git://git.infradead.org/linux-ubifs:
  ubifs: Set double hash cookie also for RENAME_EXCHANGE
  ubifs: Massage assert in ubifs_xattr_set() wrt. init_xattrs
  ubifs: Don't leak kernel memory to the MTD
  ubifs: Change gfp flags in page allocation for bulk read
  ubifs: Fix oops when remounting with no_bulk_read.
  ubifs: Fail commit if TNC is obviously inconsistent
  ubifs: allow userspace to map mounts to volumes
  ubifs: Wire-up statx() support
  ubifs: Remove dead code from ubifs_get_link()
  ubifs: Massage debug prints wrt. fscrypt
  ubifs: Add assert to dent_key_init()
  ubifs: Fix unlink code wrt. double hash lookups
  ubifs: Fix data node size for truncating uncompressed nodes
  ubifs: Don't encrypt special files on creation
  ubifs: Fix memory leak in RENAME_WHITEOUT error path in do_rename
  ubifs: Fix inode data budget in ubifs_mknod
  ubifs: Correctly evict xattr inodes
  ubifs: Unexport ubifs_inode_slab
  ubifs: don't bother checking for encryption key in ->mmap()
  ubifs: require key for truncate(2) of encrypted file
2017-07-15 10:46:14 -07:00
Linus Torvalds
a80099a152 Changes since last update:
- Add some locking assertions for the _ilock helpers.
 - Revert the XFS_QMOPT_NOLOCK patch; after discussion with hch the
   online fsck patch that would have needed it has been redesigned and
   no longer needs it.
 - Fix behavioral regression of SEEK_HOLE/DATA with negative offsets to match
   4.12-era XFS behavior.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJZaTLhAAoJEPh/dxk0SrTrQp4P/1/hkrXuBmd6wthGUfdEHgFV
 AnStDnsJSn51DNCr3rAgavJLmQku+MtgYmNz9TkYne1XyIVTI+2hln7PUyV+u+4J
 jcA749hdeLaKC7uz8l3ZP0yZus1hZTG2swY7G4/HhpYZtJy4EkbxvnQADk24Qi9y
 zMU6bygPFi0cVCurL2wgs1NWhi+TaRtLUKdxloxQ1MqjvtoApl2GyRLAKafYforK
 XS6GwjCBYGs9LXkN6WlkGcR/JCiUDhBvYq5cQGQ7dNg0wBYe+z4saslYLXBhUBtv
 94KlKCCPN/hTofgUN+Io5g9AefMlKEUucOz6f55mfmd0fPcEJvjFdjBL0VN3tQUG
 nK8eJf+BEBCOpxAE5pUV00o3C3TovbtM8Eo3gD70ZTV50TRnKEFcmx/OLQ3n2ebD
 r7wLYNIFC6hm5Eb6sM3aTlPAj/Lq/fiTMF/r37tJ37qsdJ4um8jtttsIW+Sc3DQ/
 xKqdBJxbNzLf7Ku0ZgL9dt1ex93EpIanmHK6vMNljTDljgFbH5IzNxy8v77r79hV
 f4GlqR7UR8HUeUVTDGmV0z42oU9AEHKXPAWY9wNRGuO5y/xuj8XfbnDgBcld+scy
 DWdBuCo/m7QnkFvKKyLo+4r5eL9rC2Bm6hxw/dEtqwhEeAFqRUjItEzX+e//3Cuq
 p15JlTaxgHXTt1xKK7AE
 =M5uj
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.13-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull XFS fixes from Darrick Wong:
 "Largely debugging and regression fixes.

   - Add some locking assertions for the _ilock helpers.

   - Revert the XFS_QMOPT_NOLOCK patch; after discussion with hch the
     online fsck patch that would have needed it has been redesigned and
     no longer needs it.

   - Fix behavioral regression of SEEK_HOLE/DATA with negative offsets
     to match 4.12-era XFS behavior"

* tag 'xfs-4.13-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  vfs: in iomap seek_{hole,data}, return -ENXIO for negative offsets
  Revert "xfs: grab dquots without taking the ilock"
  xfs: assert locking precondition in xfs_readlink_bmap_ilocked
  xfs: assert locking precondіtion in xfs_attr_list_int_ilocked
  xfs: fixup xfs_attr_get_ilocked
2017-07-14 22:57:32 -07:00
Linus Torvalds
bc243704fb Merge branch 'for-4.13-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
 "We've identified and fixed a silent corruption (introduced by code in
  the first pull), a fixup after the blk_status_t merge and two fixes to
  incremental send that Filipe has been hunting for some time"

* 'for-4.13-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix unexpected return value of bio_readpage_error
  btrfs: btrfs_create_repair_bio never fails, skip error handling
  btrfs: cloned bios must not be iterated by bio_for_each_segment_all
  Btrfs: fix write corruption due to bio cloning on raid5/6
  Btrfs: incremental send, fix invalid memory access
  Btrfs: incremental send, fix invalid path for link commands
2017-07-14 22:55:52 -07:00
Linus Torvalds
867eacd7fb Merge branch 'akpm' (patches from Andrew)
Merge even more updates from Andrew Morton:

 - a few leftovers

 - fault-injector rework

 - add a module loader test driver

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  kmod: throttle kmod thread limit
  kmod: add test driver to stress test the module loader
  MAINTAINERS: give kmod some maintainer love
  xtensa: use generic fb.h
  fault-inject: add /proc/<pid>/fail-nth
  fault-inject: simplify access check for fail-nth
  fault-inject: make fail-nth read/write interface symmetric
  fault-inject: parse as natural 1-based value for fail-nth write interface
  fault-inject: automatically detect the number base for fail-nth write interface
  kernel/watchdog.c: use better pr_fmt prefix
  MAINTAINERS: move the befs tree to kernel.org
  lib/atomic64_test.c: add a test that atomic64_inc_not_zero() returns an int
  mm: fix overflow check in expand_upwards()
2017-07-14 21:57:25 -07:00
Akinobu Mita
168c42bc56 fault-inject: add /proc/<pid>/fail-nth
fail-nth interface is only created in /proc/self/task/<current-tid>/.
This change also adds it in /proc/<pid>/.

This makes shell based tool a bit simpler.

	$ bash -c "builtin echo 100 > /proc/self/fail-nth && exec ls /"

Link: http://lkml.kernel.org/r/1491490561-10485-6-git-send-email-akinobu.mita@gmail.com
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14 15:05:13 -07:00
Akinobu Mita
1203c8e6fb fault-inject: simplify access check for fail-nth
The fail-nth file is created with 0666 and the access is permitted if
and only if the task is current.

This file is owned by the currnet user.  So we can create it with 0644
and allow the owner to write it.  This enables to watch the status of
task->fail_nth from another processes.

[akinobu.mita@gmail.com: don't convert unsigned type value as signed int]
  Link: http://lkml.kernel.org/r/1492444483-9239-1-git-send-email-akinobu.mita@gmail.com
[akinobu.mita@gmail.com: avoid unwanted data race to task->fail_nth]
  Link: http://lkml.kernel.org/r/1499962492-8931-1-git-send-email-akinobu.mita@gmail.com
Link: http://lkml.kernel.org/r/1491490561-10485-5-git-send-email-akinobu.mita@gmail.com
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14 15:05:13 -07:00
Akinobu Mita
bfc740938d fault-inject: make fail-nth read/write interface symmetric
The read interface for fail-nth looks a bit odd.  Read from this file
returns "NYYYY..." or "YYYYY..." (this makes me surprise when cat this
file).  Because there is no EOF condition.  The first character
indicates current->fail_nth is zero or not, and then current->fail_nth
is reset to zero.

Just returning task->fail_nth value is more natural to understand.

Link: http://lkml.kernel.org/r/1491490561-10485-4-git-send-email-akinobu.mita@gmail.com
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14 15:05:13 -07:00
Akinobu Mita
9049f2f6e7 fault-inject: parse as natural 1-based value for fail-nth write interface
The value written to fail-nth file is parsed as 0-based.  Parsing as
one-based is more natural to understand and it enables to cancel the
previous setup by simply writing '0'.

This change also converts task->fail_nth from signed to unsigned int.

Link: http://lkml.kernel.org/r/1491490561-10485-3-git-send-email-akinobu.mita@gmail.com
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14 15:05:13 -07:00
Akinobu Mita
ecaad81ca0 fault-inject: automatically detect the number base for fail-nth write interface
Automatically detect the number base to use when writing to fail-nth
file instead of always parsing as a decimal number.

Link: http://lkml.kernel.org/r/1491490561-10485-2-git-send-email-akinobu.mita@gmail.com
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14 15:05:13 -07:00
Richard Weinberger
a6664433d3 ubifs: Set double hash cookie also for RENAME_EXCHANGE
We developed RENAME_EXCHANGE and UBIFS_FLG_DOUBLE_HASH more or less in
parallel and this case was forgotten. :-(

Cc: stable@vger.kernel.org
Fixes: d63d61c169 ("ubifs: Implement UBIFS_FLG_DOUBLE_HASH")
Signed-off-by: Richard Weinberger <richard@nod.at>
2017-07-14 22:50:57 +02:00
Xiaolei Li
d8db5b1ca9 ubifs: Massage assert in ubifs_xattr_set() wrt. init_xattrs
The inode is not locked in init_xattrs when creating a new inode.

Without this patch, there will occurs assert when booting or creating
a new file, if the kernel config CONFIG_SECURITY_SMACK is enabled.

Log likes:

UBIFS assert failed in ubifs_xattr_set at 298 (pid 1156)
CPU: 1 PID: 1156 Comm: ldconfig Tainted: G S 4.12.0-rc1-207440-g1e70b02 #2
Hardware name: MediaTek MT2712 evaluation board (DT)
Call trace:
[<ffff000008088538>] dump_backtrace+0x0/0x238
[<ffff000008088834>] show_stack+0x14/0x20
[<ffff0000083d98d4>] dump_stack+0x9c/0xc0
[<ffff00000835d524>] ubifs_xattr_set+0x374/0x5e0
[<ffff00000835d7ec>] init_xattrs+0x5c/0xb8
[<ffff000008385788>] security_inode_init_security+0x110/0x190
[<ffff00000835e058>] ubifs_init_security+0x30/0x68
[<ffff00000833ada0>] ubifs_mkdir+0x100/0x200
[<ffff00000820669c>] vfs_mkdir+0x11c/0x1b8
[<ffff00000820b73c>] SyS_mkdirat+0x74/0xd0
[<ffff000008082f8c>] __sys_trace_return+0x0/0x4

Signed-off-by: Xiaolei Li <xiaolei.li@mediatek.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2017-07-14 22:50:54 +02:00
Richard Weinberger
4acadda74f ubifs: Don't leak kernel memory to the MTD
When UBIFS prepares data structures which will be written to the MTD it
ensues that their lengths are multiple of 8. Since it uses kmalloc() the
padded bytes are left uninitialized and we leak a few bytes of kernel
memory to the MTD.
To make sure that all bytes are initialized, let's switch to kzalloc().
Kzalloc() is fine in this case because the buffers are not huge and in
the IO path the performance bottleneck is anyway the MTD.

Cc: stable@vger.kernel.org
Fixes: 1e51764a3c ("UBIFS: add new flash file system")
Signed-off-by: Richard Weinberger <richard@nod.at>
Reviewed-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2017-07-14 22:50:52 +02:00
Hyunchul Lee
480a1a6a3e ubifs: Change gfp flags in page allocation for bulk read
In low memory situations, page allocations for bulk read
can kill applications for reclaiming memory, and print an
failure message when allocations are failed.
Because bulk read is just an optimization, we don't have
to do these and can stop page allocations.

Though this siutation happens rarely, add __GFP_NORETRY
to prevent from excessive memory reclaim and killing
applications, and __GFP_WARN to suppress this failure
message.

For this, Use readahead_gfp_mask for gfp flags when
allocating pages.

Signed-off-by: Hyunchul Lee <cheol.lee@lge.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2017-07-14 22:50:50 +02:00
karam.lee
07d41c3cf2 ubifs: Fix oops when remounting with no_bulk_read.
When remounting with the no_bulk_read option,
there is a problem accessing the "bulk_read buffer(bu.buf)"
which has already been freed.

If the bulk_read option is enabled,
ubifs_tnc_bulk_read uses the pre-allocated bu.buf.

While bu.buf is being used by ubifs_tnc_bulk_read,
remounting with no_bulk_read frees bu.buf.

So I added code to check the use of "bu.buf" to avoid this situation.

------
I tested as follows(kernel v3.18) :

Use the script to repeat "no_bulk_read <-> bulk_read"
	remount.sh
	#!/bin/sh
	while true do;
		mount -o remount,no_bulk_read ${MOUNT_POINT};
		sleep 1;
		mount -o remount,bulk_read ${MOUNT_POINT};
		sleep 1;
	done

Perform read operation
	cat ${MOUNT_POINT}/* > /dev/null

The problem is reproduced immediately.

[  234.256845][kernel.0]Internal error: Oops: 17 [#1] PREEMPT ARM
[  234.258557][kernel.0]CPU: 0 PID: 2752 Comm: cat Tainted: G        W  O   3.18.31+ #51
[  234.259531][kernel.0]task: cbff8580 ti: cbd66000 task.ti: cbd66000
[  234.260306][kernel.0]PC is at validate_data_node+0x10/0x264
[  234.260994][kernel.0]LR is at ubifs_tnc_bulk_read+0x388/0x3ec
[  234.261712][kernel.0]pc : [<c01d98fc>]    lr : [<c01dc300>]    psr: 80000013
[  234.261712][kernel.0]sp : cbd67ba0  ip : 00000001  fp : 00000000
[  234.263337][kernel.0]r10: cd3e0260  r9 : c0df2008  r8 : 00000000
[  234.264087][kernel.0]r7 : cd3e0000  r6 : 00000000  r5 : cd3e0278  r4 : cd3e0000
[  234.264999][kernel.0]r3 : 00000003  r2 : cd3e0280  r1 : 00000000  r0 : cd3e0000
[  234.265910][kernel.0]Flags: Nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[  234.266896][kernel.0]Control: 10c53c7d  Table: 8c40c059  DAC: 00000015
[  234.267711][kernel.0]Process cat (pid: 2752, stack limit = 0xcbd66400)
[  234.268525][kernel.0]Stack: (0xcbd67ba0 to 0xcbd68000)
[  234.269169][kernel.0]7ba0: cd7c3940 c03d8650 0001bfe0 00002ab2 00000000 cbd67c5c cbd67c58 0001bfe0
[  234.270287][kernel.0]7bc0: cd3e0000 00002ab2 0001bfe0 00000014 cbd66000 cd3e0260 00000000 c01d6660
[  234.271403][kernel.0]7be0: 00002ab2 00000000 c82a5800 ffffffff cd3e0298 cd3e0278 00000000 cd3e0000
[  234.272520][kernel.0]7c00: 00000000 00000000 cd3e0260 c01dc300 00002ab2 00000000 60000013 d663affa
[  234.273639][kernel.0]7c20: cd3e01f0 cd3e01f0 60000013 c09397ec 00000000 cd3e0278 00002ab2 00000000
[  234.274755][kernel.0]7c40: cd3e0000 c01dbf48 00000014 00000003 00000160 00000015 00000004 d663affa
[  234.275874][kernel.0]7c60: ccdaa978 cd3e0278 cd3e0000 cf32a5f4 ccdaa820 00000044 cbd66000 cd3e0260
[  234.276992][kernel.0]7c80: 00000003 c01cec84 ccdaa8dc cbd67cc4 cbd67ec0 00000010 ccdaa978 00000000
[  234.278108][kernel.0]7ca0: 0000015e ccdaa8dc 00000000 00000000 cf32a5d0 00000000 0000015f ccdaa8dc
[  234.279228][kernel.0]7cc0: 00000000 c8488300 0009e5a4 0000000e cbd66000 0000015e cf32a5f4 c0113c04
[  234.280346][kernel.0]7ce0: 0000009f 0000003c c00098c4 ffffffff 00001000 00000000 000000ad 00000010
[  234.281463][kernel.0]7d00: 00000038 cd68f580 00000150 c8488360 00000000 cbd67d30 cbd67d70 0000000e
[  234.282579][kernel.0]7d20: 00000010 00000000 c0951874 c0112a9c cf379b60 cf379b84 cf379890 cf3798b4
[  234.283699][kernel.0]7d40: cf379578 cf37959c cf379380 cf3793a4 cf3790b0 cf3790d4 cf378fd8 cf378ffc
[  234.284814][kernel.0]7d60: cf378f48 cf378f6c cf32a5f4 cf32a5d0 00000000 00001000 00000018 00000000
[  234.285932][kernel.0]7d80: 00001000 c0050da4 00000000 00001000 cec04c00 00000000 00001000 c0e11328
[  234.287049][kernel.0]7da0: 00000000 00001000 cbd66000 00000000 00001000 c0012a60 00000000 00001000
[  234.288166][kernel.0]7dc0: cbd67dd4 00000000 00001000 80000013 00000000 00001000 cd68f580 00000000
[  234.289285][kernel.0]7de0: 00001000 c915d600 00000000 00001000 cbd67e48 00000000 00001000 00000018
[  234.290402][kernel.0]7e00: 00000000 00001000 00000000 00000000 00001000 c915d768 c915d768 c0113550
[  234.291522][kernel.0]7e20: cd68f580 cbd67e48 cd68f580 cb6713c0 00010000 000ac5a4 00000000 001fc5a4
[  234.292637][kernel.0]7e40: 00000000 c8488300 cbd67ec0 00eb0000 cd68f580 c0113ee4 00000000 cbd67ec0
[  234.293754][kernel.0]7e60: cd68f580 c8488300 cbd67ec0 00eb0000 cd68f580 00150000 c8488300 00eb0000
[  234.294874][kernel.0]7e80: 00010000 c0112fd0 00000000 cbd67ec0 cd68f580 00150000 00000000 cd68f580
[  234.295991][kernel.0]7ea0: cbd67ef0 c011308c 00000000 00000002 cd768850 00010000 00000000 c01133fc
[  234.297110][kernel.0]7ec0: 00150000 00000000 cbd67f50 00000000 00000000 cb6713c0 01000000 cbd67f48
[  234.298226][kernel.0]7ee0: cbd67f50 c8488300 00000000 c0113204 00010000 01000000 00000000 cb6713c0
[  234.299342][kernel.0]7f00: 00150000 00000000 cbd67f50 00000000 00000000 00000000 00000000 00000000
[  234.300462][kernel.0]7f20: cbd67f50 01000000 01000000 cb6713c0 c8488300 c00ebba8 01000000 00000000
[  234.301577][kernel.0]7f40: c8488300 cb6713c0 00000000 00000000 00000000 00000000 ccdaa820 00000000
[  234.302697][kernel.0]7f60: 00000000 01000000 00000003 00000001 cbd66000 00000000 00000001 c00ec678
[  234.303813][kernel.0]7f80: 00000000 00000200 00000000 01000000 01000000 00000000 00000000 000000ef
[  234.304933][kernel.0]7fa0: c000e904 c000e780 01000000 00000000 00000001 00000003 00000000 01000000
[  234.306049][kernel.0]7fc0: 01000000 00000000 00000000 000000ef 00000001 00000003 01000000 00000001
[  234.307165][kernel.0]7fe0: 00000000 beafb78c 0000ad08 00128d1c 60000010 00000001 00000000 00000000
[  234.308292][kernel.0][<c01d98fc>] (validate_data_node) from [<c01dc300>] (ubifs_tnc_bulk_read+0x388/0x3ec)
[  234.309493][kernel.0][<c01dc300>] (ubifs_tnc_bulk_read) from [<c01cec84>] (ubifs_readpage+0x1dc/0x46c)
[  234.310656][kernel.0][<c01cec84>] (ubifs_readpage) from [<c0113c04>] (__generic_file_splice_read+0x29c/0x4cc)
[  234.311890][kernel.0][<c0113c04>] (__generic_file_splice_read) from [<c0113ee4>] (generic_file_splice_read+0xb0/0xf4)
[  234.313214][kernel.0][<c0113ee4>] (generic_file_splice_read) from [<c0112fd0>] (do_splice_to+0x68/0x7c)
[  234.314386][kernel.0][<c0112fd0>] (do_splice_to) from [<c011308c>] (splice_direct_to_actor+0xa8/0x190)
[  234.315544][kernel.0][<c011308c>] (splice_direct_to_actor) from [<c0113204>] (do_splice_direct+0x90/0xb8)
[  234.316741][kernel.0][<c0113204>] (do_splice_direct) from [<c00ebba8>] (do_sendfile+0x17c/0x2b8)
[  234.317838][kernel.0][<c00ebba8>] (do_sendfile) from [<c00ec678>] (SyS_sendfile64+0xc4/0xcc)
[  234.318890][kernel.0][<c00ec678>] (SyS_sendfile64) from [<c000e780>] (ret_fast_syscall+0x0/0x38)
[  234.319983][kernel.0]Code: e92d47f0 e24dd050 e59f9228 e1a04000 (e5d18014)

Signed-off-by: karam.lee <karam.lee@lge.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2017-07-14 22:50:40 +02:00
Richard Weinberger
df71b09145 ubifs: Fail commit if TNC is obviously inconsistent
A reference to LEB 0 or with length 0 in the TNC
is never correct and could be caused by a memory corruption.
Don't write such a bad index node to the MTD.
Instead fail the commit which will turn UBIFS into read-only mode.

This is less painful than having the bad reference on the MTD
from where UBFIS has no chance to recover.

Signed-off-by: Richard Weinberger <richard@nod.at>
2017-07-14 22:49:07 +02:00
Rabin Vincent
319c104274 ubifs: allow userspace to map mounts to volumes
There currently appears to be no way for userspace to find out the
underlying volume number for a mounted ubifs file system, since ubifs
uses anonymous block devices.  The volume name is present in
/proc/mounts but UBI volumes can be renamed after the volume has been
mounted.

To remedy this, show the UBI number and UBI volume number as part of the
options visible under /proc/mounts.

Also, accept and ignore the ubi= vol= options if they are used mounting
(patch from Richard Weinberger).

 # mount -t ubifs ubi:baz x
 # mount
 ubi:baz on /root/x type ubifs (rw,relatime,ubi=0,vol=2)
 # ubirename /dev/ubi0 baz bazz
 # mount
 ubi:baz on /root/x type ubifs (rw,relatime,ubi=0,vol=2)
 # ubinfo -d 0 -n 2
 Volume ID:   2 (on ubi0)
 Type:        dynamic
 Alignment:   1
 Size:        67 LEBs (1063424 bytes, 1.0 MiB)
 State:       OK
 Name:        bazz
 Character device major/minor: 254:3

Signed-off-by: Rabin Vincent <rabinv@axis.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2017-07-14 22:49:07 +02:00
Richard Weinberger
a02a6eba99 ubifs: Wire-up statx() support
statx() can report what flags a file has, expose flags that UBIFS
supports. Especially STATX_ATTR_COMPRESSED and STATX_ATTR_ENCRYPTED
can be interesting for userspace.

Signed-off-by: Richard Weinberger <richard@nod.at>
2017-07-14 22:49:07 +02:00
Richard Weinberger
d2eb85226f ubifs: Remove dead code from ubifs_get_link()
We check the length already, no need to check later
again for an empty string.

Signed-off-by: Richard Weinberger <richard@nod.at>
2017-07-14 22:49:07 +02:00
Richard Weinberger
35ee314c84 ubifs: Massage debug prints wrt. fscrypt
If file names are encrypted we can no longer print them.
That's why we have to change these prints or remove them completely.

Signed-off-by: Richard Weinberger <richard@nod.at>
2017-07-14 22:49:07 +02:00
Richard Weinberger
8b2900c017 ubifs: Add assert to dent_key_init()
...to make sure that we don't use it for double hashed lookups
instead of dent_key_init_hash().

Signed-off-by: Richard Weinberger <richard@nod.at>
2017-07-14 22:49:06 +02:00
Richard Weinberger
781f675e2d ubifs: Fix unlink code wrt. double hash lookups
When removing an encrypted file with a long name and without having
the key we have to be able to locate and remove the directory entry
via a double hash. This corner case was simply forgotten.

Fixes: 528e3d178f ("ubifs: Add full hash lookup support")
Reported-by: David Oberhollenzer <david.oberhollenzer@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
2017-07-14 22:49:06 +02:00
David Oberhollenzer
59a74990f8 ubifs: Fix data node size for truncating uncompressed nodes
Currently, the function truncate_data_node only updates the
destination data node size if compression is used. For
uncompressed nodes, the old length is incorrectly retained.

This patch makes sure that the length is correctly set when
compression is disabled.

Fixes: 7799953b34 ("ubifs: Implement encrypt/decrypt for all IO")
Signed-off-by: David Oberhollenzer <david.oberhollenzer@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
2017-07-14 22:49:06 +02:00
David Gstir
f34e87f58d ubifs: Don't encrypt special files on creation
When a new inode is created, we check if the containing folder has a encryption
policy set and inherit that. This should however only be done for regular
files, links and subdirectories. Not for sockes fifos etc.

Fixes: d475a50745 ("ubifs: Add skeleton for fscrypto")
Cc: stable@vger.kernel.org
Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
2017-07-14 22:49:05 +02:00
Hyunchul Lee
bb50c63244 ubifs: Fix memory leak in RENAME_WHITEOUT error path in do_rename
in RENAME_WHITEOUT error path, fscrypt_name should be freed.

Signed-off-by: Hyunchul Lee <cheol.lee@lge.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2017-07-14 22:49:05 +02:00
Hyunchul Lee
4d35ca4f77 ubifs: Fix inode data budget in ubifs_mknod
Assign inode data budget to budget request correctly.

Signed-off-by: Hyunchul Lee <cheol.lee@lge.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2017-07-14 22:49:05 +02:00
Richard Weinberger
272eda8298 ubifs: Correctly evict xattr inodes
UBIFS handles extended attributes just like files, as consequence of
that, they also have inodes.
Therefore UBIFS does all the inode machinery also for xattrs. Since new
inodes have i_nlink of 1, a file or xattr inode will be evicted
if i_nlink goes down to 0 after an unlink. UBIFS assumes this model also
for xattrs, which is not correct.
One can create a file "foo" with xattr "user.test". By reading
"user.test" an inode will be created, and by deleting "user.test" it
will get evicted later. The assumption breaks if the file "foo", which
hosts the xattrs, will be removed. VFS nor UBIFS does not remove each
xattr via ubifs_xattr_remove(), it just removes the host inode from
the TNC and all underlying xattr nodes too and the inode will remain
in the cache and wastes memory.

To solve this problem, remove xattr inodes from the VFS inode cache in
ubifs_xattr_remove() to make sure that they get evicted.

Fixes: 1e51764a3c ("UBIFS: add new flash file system")
Cc: <stable@vger.kernel.org>
Signed-off-by: Richard Weinberger <richard@nod.at>
2017-07-14 22:49:04 +02:00
Richard Weinberger
e996bfd428 ubifs: Unexport ubifs_inode_slab
This SLAB is only being used in super.c, there is no need to expose
it into the global namespace.

Signed-off-by: Richard Weinberger <richard@nod.at>
2017-07-14 22:48:43 +02:00
Linus Torvalds
d3c329c741 befs fixes for 4.13-rc1
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJZaIEXAAoJEGu/nxmHO1GNRt8IAKatADenX+PZeE0npeilA0k2
 GNmqpN1VAypXxOV2Ud0L5T0x9aRasMuiaQTxWRJHvMfhYycdlJOe61ZMjF0mBTp2
 14mw0HpdiFzFrH3HoCTo1nwBfdkI4G9wGjE6+/ernp1jbmnVC8jWRd2AurbBGdFQ
 JMg1oFu13SxFGcJibarXoDfXAe0d4DZrMsXBTudWKyhsqhbkpXiYSdYT9HbyFhUy
 +/a7G/+PneDW6FvxR36D4vN5JNcwuW5NpvzDWXNpmj/c8Rr6Twycx9YJqiaRgQ/6
 OUmIQseKMaJleysiCJo3ahWcn+LdOWOmX7cIpgA6L2yn3wGzMQklVAqPwxpy49Q=
 =csgd
 -----END PGP SIGNATURE-----

Merge tag 'befs-v4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/luisbg/linux-befs

Pull single befs fix from Luis de Bethencourt:
 "Very little activity in the befs file system this time since I'm busy
  settling into a new job"

* tag 'befs-v4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/luisbg/linux-befs:
  befs: add kernel-doc formatting for befs_bt_read_super()
2017-07-14 12:31:09 -07:00
Liu Bo
c3cfb65630 Btrfs: fix unexpected return value of bio_readpage_error
With blk_status_t conversion (that are now present in master),
bio_readpage_error() may return 1 as now ->submit_bio_hook() may not set
%ret if it runs without problems.

This fixes that unexpected return value by changing
btrfs_check_repairable() to return a bool instead of updating %ret, and
patch is applicable to both codebases with and without blk_status_t.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-14 20:42:37 +02:00
David Sterba
e8f5b395d5 btrfs: btrfs_create_repair_bio never fails, skip error handling
As the function uses the non-failing bio allocation, we can remove error
handling from the callers as well.

Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-14 20:42:08 +02:00
David Sterba
c09abff87f btrfs: cloned bios must not be iterated by bio_for_each_segment_all
We've started using cloned bios more in 4.13, there are some specifics
regarding the iteration.  Filipe found [1] that the raid56 iterated a
cloned bio using bio_for_each_segment_all, which is incorrect. The
cloned bios have wrong bi_vcnt and this could lead to silent
corruptions.  This patch adds assertions to all remaining
bio_for_each_segment_all cases.

[1] https://patchwork.kernel.org/patch/9838535/

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-14 20:39:31 +02:00
Darrick J. Wong
d6ab17f261 vfs: in iomap seek_{hole,data}, return -ENXIO for negative offsets
In the iomap implementations of SEEK_HOLE and SEEK_DATA, make sure we
return -ENXIO for negative offsets.

Inspired-by: Mateusz S <muttdini@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-13 14:55:05 -07:00
Christoph Hellwig
0891f9971a Revert "xfs: grab dquots without taking the ilock"
This reverts commit 50e0bdbe9f.

The new XFS_QMOPT_NOLOCK isn't used at all, and conditional locking based
on a flag is always the wrong thing to do - we should be having helpers
that can be called without the lock instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-13 14:55:05 -07:00
Christoph Hellwig
29db2500f6 xfs: assert locking precondition in xfs_readlink_bmap_ilocked
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-13 14:55:05 -07:00
Christoph Hellwig
5af7777e11 xfs: assert locking precondіtion in xfs_attr_list_int_ilocked
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-13 14:55:05 -07:00
Christoph Hellwig
cf69f8248c xfs: fixup xfs_attr_get_ilocked
The comment mentioned the wrong lock.  Also add an ASSERT to assert
this locking precondition.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-13 14:55:05 -07:00
Linus Torvalds
b86faee6d1 NFS client updates for Linux 4.13
Stable bugfixes:
 - Fix -EACCESS on commit to DS handling
 - Fix initialization of nfs_page_array->npages
 - Only invalidate dentries that are actually invalid
 
 Features:
 - Enable NFSoRDMA transparent state migration
 - Add support for lookup-by-filehandle
 - Add support for nfs re-exporting
 
 Other bugfixes and cleanups:
 - Christoph cleaned up the way we declare NFS operations
 - Clean up various internal structures
 - Various cleanups to commits
 - Various improvements to error handling
 - Set the dt_type of . and .. entries in NFS v4
 - Make slot allocation more reliable
 - Fix fscache stat printing
 - Fix uninitialized variable warnings
 - Fix potential list overrun in nfs_atomic_open()
 - Fix a race in NFSoRDMA RPC reply handler
 - Fix return size for nfs42_proc_copy()
 - Fix against MAC forgery timing attacks
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlln4jEACgkQ18tUv7Cl
 QOv2ZxAAwbQN9Dtx4rOZmPe0Xszua23sNN0ja891PodkCjIiZrRelZhLIBAf1rfP
 uSR+jTD8EsBHGt3bzTXg2DHz+o8cGDZuH+uuZX+wRWJPQcKA2pC7zElqnse8nmn5
 4Z1UUdzf42vE4NZ/G1ucqpEiAmOqGJ3s7pCRLLXPvOSSQXqOhiomNDAcGxX05FIv
 Ly4Kr6RIfg/O4oNOZBuuL/tZHodeyOj1vbyjt/4bDQ5MEXlUQfcjJZEsz/2EcNh6
 rAgbquxr1pGCD072pPBwYNH2vLGbgNN41KDDMGI0clp+8p6EhV6BOlgcEoGtZM86
 c0yro2oBOB2vPCv9nGr6JgTOHPKG6ksJ7vWVXrtQEjBGP82AbFfAawLgqZ6Ae8dP
 Sqpx55j4xdm4nyNglCuhq5PlPAogARq/eibR+RbY973Lhzr5bZb3XqlairCkNNEv
 4RbTlxbWjhgrKJ56jVf+KpUDJAVG5viKMD7YDx/bOfLtvPwALbozD7ONrunz5v43
 PgQEvWvVtnQAKp27pqHemTsLFhU6M6eGUEctRnAfB/0ogWZh1X8QXgulpDlqG3kb
 g12kr5hfA0pSfcB0aGXVzJNnHKfW3IY3WBWtxq4xaMY22YkHtuB+78+9/yk3jCAi
 dvimjT2Ko9fE9MnltJ/hC5BU+T+xUxg+1vfwWnKMvMH8SIqjyu4=
 =OpLj
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.13-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "Stable bugfixes:
   - Fix -EACCESS on commit to DS handling
   - Fix initialization of nfs_page_array->npages
   - Only invalidate dentries that are actually invalid

  Features:
   - Enable NFSoRDMA transparent state migration
   - Add support for lookup-by-filehandle
   - Add support for nfs re-exporting

  Other bugfixes and cleanups:
   - Christoph cleaned up the way we declare NFS operations
   - Clean up various internal structures
   - Various cleanups to commits
   - Various improvements to error handling
   - Set the dt_type of . and .. entries in NFS v4
   - Make slot allocation more reliable
   - Fix fscache stat printing
   - Fix uninitialized variable warnings
   - Fix potential list overrun in nfs_atomic_open()
   - Fix a race in NFSoRDMA RPC reply handler
   - Fix return size for nfs42_proc_copy()
   - Fix against MAC forgery timing attacks"

* tag 'nfs-for-4.13-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (68 commits)
  NFS: Don't run wake_up_bit() when nobody is waiting...
  nfs: add export operations
  nfs4: add NFSv4 LOOKUPP handlers
  nfs: add a nfs_ilookup helper
  nfs: replace d_add with d_splice_alias in atomic_open
  sunrpc: use constant time memory comparison for mac
  NFSv4.2 fix size storage for nfs42_proc_copy
  xprtrdma: Fix documenting comments in frwr_ops.c
  xprtrdma: Replace PAGE_MASK with offset_in_page()
  xprtrdma: FMR does not need list_del_init()
  xprtrdma: Demote "connect" log messages
  NFSv4.1: Use seqid returned by EXCHANGE_ID after state migration
  NFSv4.1: Handle EXCHGID4_FLAG_CONFIRMED_R during NFSv4.1 migration
  xprtrdma: Don't defer MR recovery if ro_map fails
  xprtrdma: Fix FRWR invalidation error recovery
  xprtrdma: Fix client lock-up after application signal fires
  xprtrdma: Rename rpcrdma_req::rl_free
  xprtrdma: Pass only the list of registered MRs to ro_unmap_sync
  xprtrdma: Pre-mark remotely invalidated MRs
  xprtrdma: On invalidation failure, remove MWs from rl_registered
  ...
2017-07-13 14:35:37 -07:00
Trond Myklebust
b4f937cffa NFS: Don't run wake_up_bit() when nobody is waiting...
"perf lock" shows fairly heavy contention for the bit waitqueue locks
when doing an I/O heavy workload.
Use a bit to tell whether or not there has been contention for a lock
so that we can optimise away the bit waitqueue options in those cases.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 17:12:07 -04:00
Peng Tao
20fa190272 nfs: add export operations
This support for opening files on NFS by file handle, both through the
open_by_handle syscall, and for re-exporting NFS (for example using a
different version).  The support is very basic for now, as each open by
handle will have to do an NFSv4 open operation on the wire.  In the
future this will hopefully be mitigated by an open file cache, as well
as various optimizations in NFS for this specific case.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
[hch: incorporated various changes, resplit the patches, new changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 17:12:04 -04:00
Trond Myklebust
301bfa4830 NFS: Don't run wake_up_bit() when nobody is waiting...
"perf lock" shows fairly heavy contention for the bit waitqueue locks
when doing an I/O heavy workload.
Use a bit to tell whether or not there has been contention for a lock
so that we can optimise away the bit waitqueue options in those cases.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:57:18 -04:00
Linus Torvalds
6240300597 Chuck's RDMA update overhauls the "call receive" side of the
RPC-over-RDMA transport to use the new rdma_rw API.
 
 Christoph cleaned the way nfs operations are declared, removing a bunch
 of function-pointer casts and declaring the operation vectors as const.
 
 Christoph's changes touch both client and server, and both client and
 server pulls this time around should be based on the same commits from
 Christoph.
 
 (Note: Anna and I initially didn't coordinate this well and we realized
 our pull requests were going to leave you with Christoph's 33 patches
 duplicated between our two trees.  We decided a last-minute rebase was
 the lesser of two evils, so her pull request will show that last-minute
 rebase.  Yell if that was the wrong choice, and we'll know better for
 next time....)
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZZ80JAAoJECebzXlCjuG+PiMP/jmw4IbzY4qt/X8aldVTMPZ8
 TkEXuZSrc7FbmroqAR0XN/qJjzENKUcrnlYm7HKVe6iItTZUvJuVThtHQVGzZUZD
 wP2VRzgkky59aDs9cphfTPGKPKL1MtoC3qQdFmKd/8ZhBDHIq89A2pQJwl7PI4rA
 IHzvLmZtTKL+xWoypqZQxepONhEY2ZPrffGWL+5OVF/dPmWfJ6m/M6jRTb7zV/YD
 PZyRqWQ8UY/HwZTwRrxZDCCxUsmRUPZz195iFjM8wvBl7auWNetC22gyyITlvfzf
 1m0zJqw3qn09+v2xnAWs/ZVxypg6rsEiIcL2mf0JC/tQh+iIzabc4e/TwDEWqSq+
 ocQrvXJuZCjsrMqg4oaIuDFogaZCsGR5wxDAEyfYDS/8fMdiKq8xJzT7v31/2U37
 Bsr1hvgAmD4eZWaTrJg11V5RnTzDgns+EtNfISR8t4/k+wehDfyzav8A+j72sqvR
 JT+7iUEd0QcBwo+MCC7AOnLLsIX45QUjZKKrvZNAC1fmr8RyAF1zo5HHO+NNjLuP
 J2PUG2GbNxsQkm/JAFKDvyklLpEXZc6uyYAcEefirxYbh1x0GfuetzqtH58DtrQL
 /1e80MRG9Qgq5S8PvYyvp1bIQPDRaQ188chEvzZy+3QeNXydq2LzDh0bjlM+4A9I
 DZhP2pNGLh0ImaPtX0q+
 =mR/a
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.13' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Chuck's RDMA update overhauls the "call receive" side of the
  RPC-over-RDMA transport to use the new rdma_rw API.

  Christoph cleaned the way nfs operations are declared, removing a
  bunch of function-pointer casts and declaring the operation vectors as
  const.

  Christoph's changes touch both client and server, and both client and
  server pulls this time around should be based on the same commits from
  Christoph"

* tag 'nfsd-4.13' of git://linux-nfs.org/~bfields/linux: (53 commits)
  svcrdma: fix an incorrect check on -E2BIG and -EINVAL
  nfsd4: factor ctime into change attribute
  svcrdma: Remove svc_rdma_chunk_ctxt::cc_dir field
  svcrdma: use offset_in_page() macro
  svcrdma: Clean up after converting svc_rdma_recvfrom to rdma_rw API
  svcrdma: Clean-up svc_rdma_unmap_dma
  svcrdma: Remove frmr cache
  svcrdma: Remove unused Read completion handlers
  svcrdma: Properly compute .len and .buflen for received RPC Calls
  svcrdma: Use generic RDMA R/W API in RPC Call path
  svcrdma: Add recvfrom helpers to svc_rdma_rw.c
  sunrpc: Allocate up to RPCSVC_MAXPAGES per svc_rqst
  svcrdma: Don't account for Receive queue "starvation"
  svcrdma: Improve Reply chunk sanity checking
  svcrdma: Improve Write chunk sanity checking
  svcrdma: Improve Read chunk sanity checking
  svcrdma: Remove svc_rdma_marshal.c
  svcrdma: Avoid Send Queue overflow
  svcrdma: Squelch disconnection messages
  sunrpc: Disable splice for krb5i
  ...
2017-07-13 13:56:24 -07:00
Linus Torvalds
19c6e12c07 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext2, udf, reiserfs fixes from Jan Kara:
 "Several ext2, udf, and reiserfs fixes"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  ext2: Fix memory leak when truncate races ext2_get_blocks
  reiserfs: fix race in prealloc discard
  reiserfs: don't preallocate blocks for extended attributes
  udf: Convert udf_disk_stamp_to_time() to use mktime64()
  udf: Use time64_to_tm for timestamp conversion
  udf: Fix deadlock between writeback and udf_setsize()
  udf: Use i_size_read() in udf_adinicb_writepage()
  udf: Fix races with i_size changes during readpage
  udf: Remove unused UDF_DEFAULT_BLOCKSIZE
2017-07-13 13:53:54 -07:00
Amir Goldstein
a59f97ff66 ovl: remove unneeded check for IS_ERR()
ovl_workdir_create() returns a valid index dentry or NULL.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-07-13 22:06:46 +02:00
Amir Goldstein
961af647fc ovl: fix origin verification of index dir
Commit 54fb347e83 ("ovl: verify index dir matches upper dir")
introduced a new ovl_fh flag OVL_FH_FLAG_PATH_UPPER to indicate
an upper file handle, but forgot to add the flag to the mask of
valid flags, so index dir origin verification always discards
existing origin and stores a new one.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-07-13 22:06:46 +02:00
Amir Goldstein
ea3dad18dc ovl: mark parent impure on ovl_link()
When linking a file with copy up origin into a new parent, mark the
new parent dir "impure".

Fixes: ee1d6d37b6 ("ovl: mark upper dir with type origin entries "impure"")
Cc: <stable@vger.kernel.org> # v4.12
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-07-13 22:06:45 +02:00
Amir Goldstein
8fc646b443 ovl: fix random return value on mount
On failure to prepare_creds(), mount fails with a random
return value, as err was last set to an integer cast of
a valid lower mnt pointer or set to 0 if inodes index feature
is enabled.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 3fe6e52f06 ("ovl: override creds with the ones from ...")
Cc: <stable@vger.kernel.org> # v4.7
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-07-13 22:06:45 +02:00
Jeff Layton
5b5faaf6df nfs4: add NFSv4 LOOKUPP handlers
This will be needed in order to implement the get_parent export op
for nfsd.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:15 -04:00
Peng Tao
00422483ad nfs: add export operations
This support for opening files on NFS by file handle, both through the
open_by_handle syscall, and for re-exporting NFS (for example using a
different version).  The support is very basic for now, as each open by
handle will have to do an NFSv4 open operation on the wire.  In the
future this will hopefully be mitigated by an open file cache, as well
as various optimizations in NFS for this specific case.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
[hch: incorporated various changes, resplit the patches, new changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 16:00:15 -04:00
Peng Tao
f174ff7a0a nfs: add a nfs_ilookup helper
This helper will allow to find an existing NFS inode by the file handle
and fattr.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:15 -04:00
Peng Tao
774d9513a3 nfs: replace d_add with d_splice_alias in atomic_open
It's a trival change but follows knfsd export document that asks
for d_splice_alias during lookup.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:14 -04:00
Olga Kornievskaia
1ee48bdd22 NFSv4.2 fix size storage for nfs42_proc_copy
Return size of COPY is u64 but it was assigned to an "int" status.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:14 -04:00
Chuck Lever
838edb9497 NFSv4.1: Use seqid returned by EXCHANGE_ID after state migration
Transparent State Migration copies a client's lease state from the
server where a filesystem used to reside to the server where it now
resides. When an NFSv4.1 client first contacts that destination
server, it uses EXCHANGE_ID to detect trunking relationships.

The lease that was copied there is returned to that client, but the
destination server sets EXCHGID4_FLAG_CONFIRMED_R when replying to
the client. This is because the lease was confirmed on the source
server (before it was copied).

When CONFIRMED_R is set, the client throws away the sequence ID
returned by the server. During a Transparent State Migration, however
there's no other way for the client to know what sequence ID to use
with a lease that's been migrated.

Therefore, the client must save and use the contrived slot sequence
value returned by the destination server even when CONFIRMED_R is
set.

Note that some servers always return a seqid of 1 after a migration.

Reported-by: Xuan Qi <xuan.qi@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Xuan Qi <xuan.qi@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:12 -04:00
Chuck Lever
8dcbec6d20 NFSv4.1: Handle EXCHGID4_FLAG_CONFIRMED_R during NFSv4.1 migration
Transparent State Migration copies a client's lease state from the
server where a filesystem used to reside to the server where it now
resides. When an NFSv4.1 client first contacts that destination
server, it uses EXCHANGE_ID to detect trunking relationships.

The lease that was copied there is returned to that client, but the
destination server sets EXCHGID4_FLAG_CONFIRMED_R when replying to
the client. This is because the lease was confirmed on the source
server (before it was copied).

Normally, when CONFIRMED_R is set, a client purges the lease and
creates a new one. However, that throws away the entire benefit of
Transparent State Migration.

Therefore, the client must not purge that lease when it is possible
that Transparent State Migration has occurred.

Reported-by: Xuan Qi <xuan.qi@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Xuan Qi <xuan.qi@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:12 -04:00
NeilBrown
26fde4dfcb NFS: check for nfs_refresh_inode() errors in nfs_fhget()
If an NFS server returns a filehandle that we have previously
seen, and reports a different type, then nfs_refresh_inode()
will log a warning and return an error.

nfs_fhget() does not check for this error and may return an
inode with a different type than the one that the server
reported.

This is likely to cause confusion, and is one way that
->open_context() could return a directory inode as discussed
in the previous patch.

So if nfs_refresh_inode() returns and error, return that error
from nfs_fhget() to avoid the confusion propagating.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:09 -04:00
NeilBrown
eaa2b82c3b NFS: guard against confused server in nfs_atomic_open()
A confused server could return a filehandle for an
NFSv4 OPEN request, which it previously returned for a directory.
So the inode returned by  ->open_context() in nfs_atomic_open()
could conceivably be a directory inode.

This has particular implications for the call to
nfs_file_set_open_context() in nfs_finish_open().
If that is called on a directory inode, then the nfs_open_context
that gets stored in the filp->private_data will be linked to
nfs_inode->open_files.

When the directory is closed, nfs_closedir() will (ultimately)
free the ->private_data, but not unlink it from nfs_inode->open_files
(because it doesn't expect an nfs_open_context there).

Subsequently the memory could get used for something else and eventually
if the ->open_files list is walked, the walker will fall off the end and
crash.

So: change nfs_finish_open() to only call nfs_file_set_open_context()
for regular-file inodes.

This failure mode has been seen in a production setting (unknown NFS
server implementation).  The kernel was v3.0 and the specific sequence
seen would not affect more recent kernels, but I think a risk is still
present, and caution is wise.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:08 -04:00
NeilBrown
cc89684c9a NFS: only invalidate dentrys that are clearly invalid.
Since commit bafc9b754f ("vfs: More precise tests in d_invalidate")
in v3.18, a return of '0' from ->d_revalidate() will cause the dentry
to be invalidated even if it has filesystems mounted on or it or on a
descendant.  The mounted filesystem is unmounted.

This means we need to be careful not to return 0 unless the directory
referred to truly is invalid.  So -ESTALE or -ENOENT should invalidate
the directory.  Other errors such a -EPERM or -ERESTARTSYS should be
returned from ->d_revalidate() so they are propagated to the caller.

A particular problem can be demonstrated by:

1/ mount an NFS filesystem using NFSv3 on /mnt
2/ mount any other filesystem on /mnt/foo
3/ ls /mnt/foo
4/ turn off network, or otherwise make the server unable to respond
5/ ls /mnt/foo &
6/ cat /proc/$!/stack # note that nfs_lookup_revalidate is in the call stack
7/ kill -9 $! # this results in -ERESTARTSYS being returned
8/ observe that /mnt/foo has been unmounted.

This patch changes nfs_lookup_revalidate() to only treat
  -ESTALE from nfs_lookup_verify_inode() and
  -ESTALE or -ENOENT from ->lookup()
as indicating an invalid inode.  Other errors are returned.

Also nfs_check_inode_attributes() is changed to return -ESTALE rather
than -EIO.  This is consistent with the error returned in similar
circumstances from nfs_update_inode().

As this bug allows any user to unmount a filesystem mounted on an NFS
filesystem, this fix is suitable for stable kernels.

Fixes: bafc9b754f ("vfs: More precise tests in d_invalidate")
Cc: stable@vger.kernel.org (v3.18+)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:08 -04:00
Olga Kornievskaia
22368ff11d PNFS for stateid errors retry against MDS first
Upon receiving a stateid error such as BAD_STATEID, the client
should retry the operation against the MDS before deciding to
do stateid recovery.

Previously, the code would initiate state recovery and it could
lead to a race in a state manager that could chose an incorrect
recovery method which would lead to the EIO failure for the
application.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:08 -04:00
Olga Kornievskaia
a0bc01e0f1 PNFS fix EACCESS on commit to DS handling
Commit fabbbee0eb "PNFS fix fallback to MDS if got error on
commit to DS" moved the pnfs_set_lo_fail() to unhandled errors
which was not correct and lead to a kernel oops on umount.

Instead, fix the original EACCESS on commit to DS error by
getting the new layout and re-doing the IO.

Fixes: fabbbee0eb ("PNFS fix fallback to MDS if got error on commit to DS")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Cc: stable@vger.kernel.org # v4.12
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 15:59:57 -04:00
Dan Carpenter
4cd1ec95bd NFS: silence a uninitialized variable warning
Static checkers have gotten clever enough to complain that "id_long" is
uninitialized on the failure path.  It's harmless, but simple to fix.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 15:58:28 -04:00
Tuo Chen Peng
ce85bd2921 nfs: Fix fscache stat printing in nfs_show_stats()
nfs_show_stats() was incorrectly reading statistics for bytes when printing that
for fsc. It caused files like /proc/self/mountstats to report incorrect fsc
statistics for NFS mounts.

Signed-off-by: Tuo Chen Peng <tpeng@nvidia.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 15:58:27 -04:00
Benjamin Coddington
2eb3aea7d9 NFS: Fix initialization of nfs_page_array->npages
Commit 8ef9b0b9e1 open-coded nfs_pgarray_set(), and left out the
initialization of the nfs_page_array's npages.  This mistake didn't show up
until testing with block layouts, and there shows that all pNFS reads
return -EIO.

Fixes: 8ef9b0b9e1 ("NFS: move nfs_pgarray_set() to open code")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Cc: stable@vger.kernel.org # 4.12
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 15:58:06 -04:00
Trond Myklebust
1a4edf0f46 NFS: Fix commit policy for non-blocking calls to nfs_write_inode()
Now that the writes will schedule a commit on their own, we don't
need nfs_write_inode() to schedule one if there are outstanding
writes, and we're being called in non-blocking mode.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 15:58:05 -04:00
Trond Myklebust
919e3bd9a8 NFS: Ensure we commit after writeback is complete
If the page cache is being flushed, then we want to ensure that we
do start a commit once the pages are done being flushed.
If we just wait until all I/O is done to that file, we can end up
livelocking until the balance_dirty_pages() mechanism puts its
foot down and forces I/O to stop.
So instead we do more or less the same thing that O_DIRECT does,
and set up a counter to tell us when the flush is done,

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 15:58:05 -04:00
Trond Myklebust
b5973a8c1c NFS: Remove unused fields in the page I/O structures
Remove the 'layout_private' fields that were only used by the pNFS OSD
layout driver.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 15:58:05 -04:00
Benjamin Coddington
818a8dbe83 NFS: nfs_rename() - revalidate directories on -ERESTARTSYS
An interrupted rename will leave the old dentry behind if the rename
succeeds.  Fix this by forcing a lookup the next time through
->d_revalidate.

A previous attempt at solving this problem took the approach to complete
the work of the rename asynchronously, however that approach was wrong
since it would allow the d_move() to occur after the directory's i_mutex
had been dropped by the original process.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 15:58:04 -04:00
Benjamin Coddington
a7a3b1e971 NFS: convert flags to bool
NFS uses some int, and unsigned int :1, and bool as flags in structs and
args.  Assert the preference for uniformly replacing these with the bool
type.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 15:58:04 -04:00
Anna Schumaker
18fe6a23e3 NFS: Set FATTR4_WORD0_TYPE for . and .. entries
The current code worked okay for getdents(), but getdents64() expects
the d_type field to get filled out properly in the stat structure.
Setting this field fixes xfstests generic/401.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 15:58:03 -04:00
Christoph Hellwig
800222f80f nfsd4: const-ify nfsd4_ops
nfsd4_ops contains function pointers, and marking it as constant avoids
it being able to be used as an attach vector for code injections.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:58:03 -04:00
Christoph Hellwig
aa8217d5dc sunrpc: mark all struct svc_version instances as const
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-07-13 15:58:03 -04:00
Christoph Hellwig
b9c744c19c sunrpc: mark all struct svc_procinfo instances as const
struct svc_procinfo contains function pointers, and marking it as
constant avoids it being able to be used as an attach vector for
code injections.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:58:02 -04:00
Christoph Hellwig
0becc1181c sunrpc: move pc_count out of struct svc_procinfo
pc_count is the only writeable memeber of struct svc_procinfo, which is
a good candidate to be const-ified as it contains function pointers.

This patch moves it into out out struct svc_procinfo, and into a
separate writable array that is pointed to by struct svc_version.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:58:02 -04:00
Christoph Hellwig
72edc37a2c nfsd4: properly type op_func callbacks
Pass union nfsd4_op_u to the op_func callbacks instead of using unsafe
function pointer casts.

It also adds two missing structures to struct nfsd4_op.u to facilitate
this.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:58:02 -04:00
Christoph Hellwig
62bbf8bbb2 nfsd4: remove nfsd4op_rsize
Except for a lot of unnecessary casts this typedef only has one user,
so remove the casts and expand it in struct nfsd4_operation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:58:01 -04:00
Christoph Hellwig
c2a1102aa2 nfsd4: properly type op_get_currentstateid callbacks
Pass union nfsd4_op_u to the op_set_currentstateid callbacks instead of
using unsafe function pointer casts.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:58:01 -04:00
Christoph Hellwig
6c9600a71d nfsd4: properly type op_set_currentstateid callbacks
Given the args union in struct nfsd4_op a name, and pass it to the
op_set_currentstateid callbacks instead of using unsafe function
pointer casts.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:58:01 -04:00
Christoph Hellwig
d16d186721 sunrpc: properly type pc_encode callbacks
Drop the resp argument as it can trivially be derived from the rqstp
argument.  With that all functions now have the same prototype, and we
can remove the unsafe casting to kxdrproc_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-07-13 15:58:00 -04:00
Christoph Hellwig
cc6acc20a6 sunrpc: properly type pc_decode callbacks
Drop the argp argument as it can trivially be derived from the rqstp
argument.  With that all functions now have the same prototype, and we
can remove the unsafe casting to kxdrproc_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:58:00 -04:00
Christoph Hellwig
1150ded804 sunrpc: properly type pc_release callbacks
Drop the p and resp arguments as they are always NULL or can trivially
be derived from the rqstp argument.  With that all functions now have the
same prototype, and we can remove the unsafe casting to kxdrproc_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:57:59 -04:00
Christoph Hellwig
1c8a5409f3 sunrpc: properly type pc_func callbacks
Drop the argp and resp arguments as they can trivially be derived from
the rqstp argument.  With that all functions now have the same prototype,
and we can remove the unsafe casting to svc_procfunc as well as the
svc_procfunc typedef itself.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:57:59 -04:00
Christoph Hellwig
36ba89c2a4 nfsd: remove the unused PROC() macro in nfs3proc.c
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:57:58 -04:00
Christoph Hellwig
ec7e8caec9 nfsd: use named initializers in PROC()
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:57:58 -04:00
Christoph Hellwig
39d43f75c6 nfsd4: const-ify nfs_cb_version4
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:57:58 -04:00
Christoph Hellwig
511e936bf2 sunrpc: mark all struct rpc_procinfo instances as const
struct rpc_procinfo contains function pointers, and marking it as
constant avoids it being able to be used as an attach vector for
code injections.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-07-13 15:57:57 -04:00
Christoph Hellwig
9ae7d8ff29 nfs: use ARRAY_SIZE() in the nfsacl_version3 declaration
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:57:57 -04:00
Christoph Hellwig
c551858a88 sunrpc: move p_count out of struct rpc_procinfo
p_count is the only writeable memeber of struct rpc_procinfo, which is
a good candidate to be const-ified as it contains function pointers.

This patch moves it into out out struct rpc_procinfo, and into a
separate writable array that is pointed to by struct rpc_version and
indexed by p_statidx.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:57:57 -04:00
Christoph Hellwig
e91ff8e3c4 lockd: fix some weird indentation
Remove double indentation of a few struct rpc_version and
struct rpc_program instance.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-07-13 15:57:56 -04:00
Christoph Hellwig
947c6e431d nfs: don't cast callback decode/proc/encode routines
Instead declare all functions with the proper methods signature.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-07-13 15:57:56 -04:00
Christoph Hellwig
fc016483eb nfs: fix decoder callback prototypes
Declare the p_decode callbacks with the proper prototype instead of
casting to kxdrdproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-07-13 15:57:56 -04:00
Christoph Hellwig
04000564c1 lockd: fix decoder callback prototypes
Declare the p_decode callbacks with the proper prototype instead of
casting to kxdrdproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-07-13 15:57:55 -04:00
Christoph Hellwig
5362a4ec83 nfsd: fix decoder callback prototypes
Declare the p_decode callbacks with the proper prototype instead of
casting to kxdrdproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
2017-07-13 15:57:55 -04:00
Christoph Hellwig
843efb7d7d nfsd: fix encoder callback prototypes
Declare the p_encode callbacks with the proper prototype instead of
casting to kxdreproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
2017-07-13 15:57:54 -04:00
Christoph Hellwig
fcc85819ee nfs: fix encoder callback prototypes
Declare the p_encode callbacks with the proper prototype instead of
casting to kxdreproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-07-13 15:57:53 -04:00
Christoph Hellwig
d16073389b lockd: fix encoder callback prototypes
Declare the p_encode callbacks with the proper prototype instead of
casting to kxdreproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-07-13 15:57:53 -04:00
Linus Torvalds
ad51271afc Merge branch 'akpm' (patches from Andrew)
Merge yet more updates from Andrew Morton:

- various misc things

- kexec updates

- sysctl core updates

- scripts/gdb udpates

- checkpoint-restart updates

- ipc updates

- kernel/watchdog updates

- Kees's "rough equivalent to the glibc _FORTIFY_SOURCE=1 feature"

- "stackprotector: ascii armor the stack canary"

- more MM bits

- checkpatch updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (96 commits)
  writeback: rework wb_[dec|inc]_stat family of functions
  ARM: samsung: usb-ohci: move inline before return type
  video: fbdev: omap: move inline before return type
  video: fbdev: intelfb: move inline before return type
  USB: serial: safe_serial: move __inline__ before return type
  drivers: tty: serial: move inline before return type
  drivers: s390: move static and inline before return type
  x86/efi: move asmlinkage before return type
  sh: move inline before return type
  MIPS: SMP: move asmlinkage before return type
  m68k: coldfire: move inline before return type
  ia64: sn: pci: move inline before type
  ia64: move inline before return type
  FRV: tlbflush: move asmlinkage before return type
  CRIS: gpio: move inline before return type
  ARM: HP Jornada 7XX: move inline before return type
  ARM: KVM: move asmlinkage before type
  checkpatch: improve the STORAGE_CLASS test
  mm, migration: do not trigger OOM killer when migrating memory
  drm/i915: use __GFP_RETRY_MAYFAIL
  ...
2017-07-13 12:38:49 -07:00
Filipe Manana
6592e58c6b Btrfs: fix write corruption due to bio cloning on raid5/6
The recent changes to make bio cloning faster (added in the 4.13 merge
window) by using the bio_clone_fast() API introduced a regression on
raid5/6 modes, because cloned bios have an invalid bi_vcnt field
(therefore it can not be used) and the raid5/6 code uses the
bio_for_each_segment_all() API to iterate the segments of a bio, and this
API uses a bio's bi_vcnt field.

The issue is very simple to trigger by doing for example a direct IO write
against a raid5 or raid6 filesystem and then attempting to read what we
wrote before:

  $ mkfs.btrfs -m raid5 -d raid5 -f /dev/sdc /dev/sdd /dev/sde /dev/sdf
  $ mount /dev/sdc /mnt
  $ xfs_io -f -d -c "pwrite -S 0xab 0 1M" /mnt/foobar
  $ od -t x1 /mnt/foobar
  od: /mnt/foobar: read error: Input/output error

For that example, the following is also reported in dmesg/syslog:

  [18274.985557] btrfs_print_data_csum_error: 18 callbacks suppressed
  [18274.995277] BTRFS warning (device sdf): csum failed root 5 ino 257 off 0 csum 0x98f94189 expected csum 0x94374193 mirror 1
  [18274.997205] BTRFS warning (device sdf): csum failed root 5 ino 257 off 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1
  [18275.025221] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1
  [18275.047422] BTRFS warning (device sdf): csum failed root 5 ino 257 off 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1
  [18275.054818] BTRFS warning (device sdf): csum failed root 5 ino 257 off 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1
  [18275.054834] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1
  [18275.054943] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 2
  [18275.055207] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 3
  [18275.055571] BTRFS warning (device sdf): csum failed root 5 ino 257 off 0 csum 0x98f94189 expected csum 0x94374193 mirror 1
  [18275.062171] BTRFS warning (device sdf): csum failed root 5 ino 257 off 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1

A scrub will also fail correcting bad copies, mentioning the following in
dmesg/syslog:

  [18276.128696] scrub_handle_errored_block: 498 callbacks suppressed
  [18276.129617] BTRFS warning (device sdf): checksum error at logical 2186346496 on dev /dev/sde, sector 2116608, root 5, inode 257, offset 65536, length 4096, links $
  [18276.149235] btrfs_dev_stat_print_on_error: 498 callbacks suppressed
  [18276.157897] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
  [18276.206059] BTRFS warning (device sdf): checksum error at logical 2186477568 on dev /dev/sdd, sector 2116736, root 5, inode 257, offset 196608, length 4096, links$
  [18276.206059] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
  [18276.306552] BTRFS warning (device sdf): checksum error at logical 2186543104 on dev /dev/sdd, sector 2116864, root 5, inode 257, offset 262144, length 4096, links$
  [18276.319152] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
  [18276.394316] BTRFS warning (device sdf): checksum error at logical 2186739712 on dev /dev/sdf, sector 2116992, root 5, inode 257, offset 458752, length 4096, links$
  [18276.396348] BTRFS error (device sdf): bdev /dev/sdf errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
  [18276.434127] BTRFS warning (device sdf): checksum error at logical 2186870784 on dev /dev/sde, sector 2117120, root 5, inode 257, offset 589824, length 4096, links$
  [18276.434127] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
  [18276.500504] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186477568 on dev /dev/sdd
  [18276.538400] BTRFS warning (device sdf): checksum error at logical 2186481664 on dev /dev/sdd, sector 2116744, root 5, inode 257, offset 200704, length 4096, links$
  [18276.540452] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
  [18276.542012] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186481664 on dev /dev/sdd
  [18276.585030] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186346496 on dev /dev/sde
  [18276.598306] BTRFS warning (device sdf): checksum error at logical 2186412032 on dev /dev/sde, sector 2116736, root 5, inode 257, offset 131072, length 4096, links$
  [18276.598310] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
  [18276.598582] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186350592 on dev /dev/sde
  [18276.603455] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
  [18276.638362] BTRFS warning (device sdf): checksum error at logical 2186354688 on dev /dev/sde, sector 2116624, root 5, inode 257, offset 73728, length 4096, links $
  [18276.640445] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 5, gen 0
  [18276.645942] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186354688 on dev /dev/sde
  [18276.657204] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186412032 on dev /dev/sde
  [18276.660563] BTRFS warning (device sdf): checksum error at logical 2186416128 on dev /dev/sde, sector 2116744, root 5, inode 257, offset 135168, length 4096, links$
  [18276.664609] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 6, gen 0
  [18276.664609] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186358784 on dev /dev/sde

So fix this by using the bio_for_each_segment() API and setting before
the bio's bi_iter field to the value of the corresponding btrfs bio
container's saved iterator if we are processing a cloned bio in the
raid5/6 code (the same code processes both cloned and non-cloned bios).

This incorrect iteration of cloned bios was also causing some occasional
BUG_ONs when running fstest btrfs/064, which have a trace like the
following:

  [ 6674.416156] ------------[ cut here ]------------
  [ 6674.416157] kernel BUG at fs/btrfs/raid56.c:1897!
  [ 6674.416159] invalid opcode: 0000 [#1] PREEMPT SMP
  [ 6674.416160] Modules linked in: dm_flakey dm_mod dax ppdev tpm_tis parport_pc tpm_tis_core evdev tpm psmouse sg i2c_piix4 pcspkr parport i2c_core serio_raw button s
  [ 6674.416184] CPU: 3 PID: 19236 Comm: kworker/u32:10 Not tainted 4.12.0-rc6-btrfs-next-44+ #1
  [ 6674.416185] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
  [ 6674.416210] Workqueue: btrfs-endio btrfs_endio_helper [btrfs]
  [ 6674.416211] task: ffff880147f6c740 task.stack: ffffc90001fb8000
  [ 6674.416229] RIP: 0010:__raid_recover_end_io+0x1ac/0x370 [btrfs]
  [ 6674.416230] RSP: 0018:ffffc90001fbbb90 EFLAGS: 00010217
  [ 6674.416231] RAX: ffff8801ff4b4f00 RBX: 0000000000000002 RCX: 0000000000000001
  [ 6674.416232] RDX: ffff880099b045d8 RSI: ffffffff81a5f6e0 RDI: 0000000000000004
  [ 6674.416232] RBP: ffffc90001fbbbc8 R08: 0000000000000001 R09: 0000000000000001
  [ 6674.416233] R10: ffffc90001fbbac8 R11: 0000000000001000 R12: 0000000000000002
  [ 6674.416234] R13: ffff880099b045c0 R14: 0000000000000004 R15: ffff88012bff2000
  [ 6674.416235] FS:  0000000000000000(0000) GS:ffff88023f2c0000(0000) knlGS:0000000000000000
  [ 6674.416235] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [ 6674.416236] CR2: 00007f28cf282000 CR3: 00000001000c6000 CR4: 00000000000006e0
  [ 6674.416239] Call Trace:
  [ 6674.416259]  __raid56_parity_recover+0xfc/0x16e [btrfs]
  [ 6674.416276]  raid56_parity_recover+0x157/0x16b [btrfs]
  [ 6674.416293]  btrfs_map_bio+0xe0/0x259 [btrfs]
  [ 6674.416310]  btrfs_submit_bio_hook+0xbf/0x147 [btrfs]
  [ 6674.416327]  end_bio_extent_readpage+0x27b/0x4a0 [btrfs]
  [ 6674.416331]  bio_endio+0x17d/0x1b3
  [ 6674.416346]  end_workqueue_fn+0x3c/0x3f [btrfs]
  [ 6674.416362]  btrfs_scrubparity_helper+0x1aa/0x3b8 [btrfs]
  [ 6674.416379]  btrfs_endio_helper+0xe/0x10 [btrfs]
  [ 6674.416381]  process_one_work+0x276/0x4b6
  [ 6674.416384]  worker_thread+0x1ac/0x266
  [ 6674.416386]  ? rescuer_thread+0x278/0x278
  [ 6674.416387]  kthread+0x106/0x10e
  [ 6674.416389]  ? __list_del_entry+0x22/0x22
  [ 6674.416391]  ret_from_fork+0x27/0x40
  [ 6674.416395] Code: 44 89 e2 be 00 10 00 00 ff 15 b0 ab ef ff eb 72 4d 89 e8 89 d9 44 89 e2 be 00 10 00 00 ff 15 a3 ab ef ff eb 5d 41 83 fc ff 74 02 <0f> 0b 49 63 97
  [ 6674.416432] RIP: __raid_recover_end_io+0x1ac/0x370 [btrfs] RSP: ffffc90001fbbb90
  [ 6674.416434] ---[ end trace 74d56ebe7489dd6a ]---

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2017-07-13 19:26:01 +01:00
David Howells
fdb254db21 isofs: Fix isofs_show_options()
The isofs patch needs a small fix to handle a signed/unsigned comparison that
the compiler didn't flag - thanks to Dan for catching it.

It should be noted, however, the session number handing appears to be incorrect
between where it is parsed and where it is used.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-13 12:30:43 -04:00
Ernesto A. Fernández
4d9bcaddac ext2: Fix memory leak when truncate races ext2_get_blocks
Buffer heads referencing indirect blocks may not be released if the file
is truncated at the right time. This happens because ext2_get_branch()
returns NULL when it finds the whole chain of indirect blocks already
set, and when truncate alters the chain this value of NULL is
treated as the address of the last head to be released. Handle this in the
same way as it's done after the got_it label.

Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-07-13 13:45:08 +02:00
Linus Torvalds
4ca6df1348 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull sysctl fix from Eric Biederman:
 "A rather embarassing and hard to hit bug was merged into 4.11-rc1.

  Andrei Vagin tracked this bug now and after some staring at the code
  I came up with a fix"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  proc: Fix proc_sys_prune_dcache to hold a sb reference
2017-07-12 19:43:20 -07:00
Nikolay Borisov
3e8f399da4 writeback: rework wb_[dec|inc]_stat family of functions
Currently the writeback statistics code uses a percpu counters to hold
various statistics.  Furthermore we have 2 families of functions - those
which disable local irq and those which doesn't and whose names begin
with double underscore.  However, they both end up calling
__add_wb_stats which in turn calls percpu_counter_add_batch which is
already irq-safe.

Exploiting this fact allows to eliminated the __wb_* functions since
they don't add any further protection than we already have.
Furthermore, refactor the wb_* function to call __add_wb_stat directly
without the irq-disabling dance.  This will likely result in better
runtime of code which deals with modifying the stat counters.

While at it also document why percpu_counter_add_batch is in fact
preempt and irq-safe since at least 3 people got confused.

Link: http://lkml.kernel.org/r/1498029937-27293-1-git-send-email-nborisov@suse.com
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:05 -07:00
Michal Hocko
91c63ecda7 xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL
KM_MAYFAIL didn't have any suitable GFP_FOO counterpart until recently
so it relied on the default page allocator behavior for the given set of
flags.  This means that small allocations actually never failed.

Now that we have __GFP_RETRY_MAYFAIL flag which works independently on
the allocation request size we can map KM_MAYFAIL to it.  The allocator
will try as hard as it can to fulfill the request but fails eventually
if the progress cannot be made.  It does so without triggering the OOM
killer which can be seen as an improvement because KM_MAYFAIL users
should be able to deal with allocation failures.

Link: http://lkml.kernel.org/r/20170623085345.11304-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alex Belits <alex.belits@cavium.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: David Daney <david.daney@cavium.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: NeilBrown <neilb@suse.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:03 -07:00
Dmitry Vyukov
e41d58185f fault-inject: support systematic fault injection
Add /proc/self/task/<current-tid>/fail-nth file that allows failing
0-th, 1-st, 2-nd and so on calls systematically.
Excerpt from the added documentation:

 "Write to this file of integer N makes N-th call in the current task
  fail (N is 0-based). Read from this file returns a single char 'Y' or
  'N' that says if the fault setup with a previous write to this file
  was injected or not, and disables the fault if it wasn't yet injected.
  Note that this file enables all types of faults (slab, futex, etc).
  This setting takes precedence over all other generic settings like
  probability, interval, times, etc. But per-capability settings (e.g.
  fail_futex/ignore-private) take precedence over it. This feature is
  intended for systematic testing of faults in a single system call. See
  an example below"

Why add a new setting:
1. Existing settings are global rather than per-task.
   So parallel testing is not possible.
2. attr->interval is close but it depends on attr->count
   which is non reset to 0, so interval does not work as expected.
3. Trying to model this with existing settings requires manipulations
   of all of probability, interval, times, space, task-filter and
   unexposed count and per-task make-it-fail files.
4. Existing settings are per-failure-type, and the set of failure
   types is potentially expanding.
5. make-it-fail can't be changed by unprivileged user and aggressive
   stress testing better be done from an unprivileged user.
   Similarly, this would require opening the debugfs files to the
   unprivileged user, as he would need to reopen at least times file
   (not possible to pre-open before dropping privs).

The proposed interface solves all of the above (see the example).

We want to integrate this into syzkaller fuzzer.  A prototype has found
10 bugs in kernel in first day of usage:

  https://groups.google.com/forum/#!searchin/syzkaller/%22FAULT_INJECTION%22%7Csort:relevance

I've made the current interface work with all types of our sandboxes.
For setuid the secret sauce was prctl(PR_SET_DUMPABLE, 1, 0, 0, 0) to
make /proc entries non-root owned.  So I am fine with the current
version of the code.

[akpm@linux-foundation.org: fix build]
Link: http://lkml.kernel.org/r/20170328130128.101773-1-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:01 -07:00
Cyrill Gorcunov
92ef6da3d0 kcmp: fs/epoll: wrap kcmp code with CONFIG_CHECKPOINT_RESTORE
kcmp syscall is build iif CONFIG_CHECKPOINT_RESTORE is selected, so wrap
appropriate helpers in epoll code with the config to build it
conditionally.

Link: http://lkml.kernel.org/r/20170513083456.GG1881@uranus.lan
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reported-by: Andrew Morton <akpm@linuxfoundation.org>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:01 -07:00
Cyrill Gorcunov
0791e3644e kcmp: add KCMP_EPOLL_TFD mode to compare epoll target files
With current epoll architecture target files are addressed with
file_struct and file descriptor number, where the last is not unique.
Moreover files can be transferred from another process via unix socket,
added into queue and closed then so we won't find this descriptor in the
task fdinfo list.

Thus to checkpoint and restore such processes CRIU needs to find out
where exactly the target file is present to add it into epoll queue.
For this sake one can use kcmp call where some particular target file
from the queue is compared with arbitrary file passed as an argument.

Because epoll target files can have same file descriptor number but
different file_struct a caller should explicitly specify the offset
within.

To test if some particular file is matching entry inside epoll one have
to

 - fill kcmp_epoll_slot structure with epoll file descriptor,
   target file number and target file offset (in case if only
   one target is present then it should be 0)

 - call kcmp as kcmp(pid1, pid2, KCMP_EPOLL_TFD, fd, &kcmp_epoll_slot)
    - the kernel fetch file pointer matching file descriptor @fd of pid1
    - lookups for file struct in epoll queue of pid2 and returns traditional
      0,1,2 result for sorting purpose

Link: http://lkml.kernel.org/r/20170424154423.511592110@gmail.com
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrey Vagin <avagin@openvz.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:01 -07:00
Cyrill Gorcunov
77493f04b7 procfs: fdinfo: extend information about epoll target files
Since it is possbile to have same number in tfd field (say file added,
closed, then nother file dup'ed to same number and added back) it is
imposible to distinguish such target files solely by their numbers.

Strictly speaking regular applications don't need to recognize these
targets at all but for checkpoint/restore sake we need to collect
targets to be able to push them back on restore stage in a proper order.

Thus lets add file position, inode and device number where this target
lays.  This three fields can be used as a primary key for sorting, and
together with kcmp help CRIU can find out an exact file target (from the
whole set of processes being checkpointed).

Link: http://lkml.kernel.org/r/20170424154423.436491881@gmail.com
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:01 -07:00
Davidlohr Bueso
59224ac1cf fs/Kconfig: kill CONFIG_PERCPU_RWSEM some more
As of commit bf3eac84c4 ("percpu-rwsem: kill CONFIG_PERCPU_RWSEM") we
unconditionally build pcpu-rwsems.  Remove a leftover in for
FILE_LOCKING.

Link: http://lkml.kernel.org/r/20170518180115.2794-1-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:00 -07:00
Rakesh Pandit
5f9f48f5b3 bfs: fix sanity checks for empty files
Mount fails if file system image has empty files because of sanity check
while reading superblock.  For empty files disk offset to end of file
(i_eoffset) is cpu_to_le32(-1).  Sanity check comparison, which compares
disk offset with file system size isn't valid for this value and hence
is ignored with this patch.

Steps to reproduce:

  $  dd if=/dev/zero of=bfs-image count=204800
  $  mkfs.bfs bfs-image
  $  mkdir bfs-mount-point
  $  sudo mount -t bfs -o loop bfs-image bfs-mount-point/
  $  cd bfs-mount-point/
  $  sudo touch a
  $  cd ..
  $  sudo umount bfs-mount-point/
  $  sudo mount -t bfs -o loop bfs-image bfs-mount-point/
  mount: /dev/loop0: can't read superblock

  $  dmesg
  [25526.689580] BFS-fs: bfs_fill_super(): Inode 0x00000003 corrupted

Tigran said:
 "If you had created the filesystem with the proper mkfs under SCO
  UnixWare 7 you (probably) wouldn't encounter this issue. But since
  commercial Unix-es are now part of history and the only proper way is
  the Linux mkfs.bfs utility, your patch is fine"

Link: http://lkml.kernel.org/r/20170505201625.GA3097@hercules.tuxera.com
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Acked-by: Tigran Aivazian <aivazian.tigran@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:00 -07:00
Luis R. Rodriguez
61d9b56a89 sysctl: add unsigned int range support
To keep parity with regular int interfaces provide the an unsigned int
proc_douintvec_minmax() which allows you to specify a range of allowed
valid numbers.

Adding proc_douintvec_minmax_sysadmin() is easy but we can wait for an
actual user for that.

Link: http://lkml.kernel.org/r/20170519033554.18592-6-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:00 -07:00
Luis R. Rodriguez
4f2fec00af sysctl: simplify unsigned int support
Commit e7d316a02f ("sysctl: handle error writing UINT_MAX to u32
fields") added proc_douintvec() to start help adding support for
unsigned int, this however was only half the work needed.  Two fixes
have come in since then for the following issues:

  o Printing the values shows a negative value, this happens since
    do_proc_dointvec() and this uses proc_put_long()

This was fixed by commit 5380e5644a ("sysctl: don't print negative
flag for proc_douintvec").

  o We can easily wrap around the int values: UINT_MAX is 4294967295, if
    we echo in 4294967295 + 1 we end up with 0, using 4294967295 + 2 we
    end up with 1.
  o We echo negative values in and they are accepted

This was fixed by commit 425fffd886 ("sysctl: report EINVAL if value
is larger than UINT_MAX for proc_douintvec").

It still also failed to be added to sysctl_check_table()...  instead of
adding it with the current implementation just provide a proper and
simplified unsigned int support without any array unsigned int support
with no negative support at all.

Historically sysctl proc helpers have supported arrays, due to the
complexity this adds though we've taken a step back to evaluate array
users to determine if its worth upkeeping for unsigned int.  An
evaluation using Coccinelle has been done to perform a grammatical
search to ask ourselves:

  o How many sysctl proc_dointvec() (int) users exist which likely
    should be moved over to proc_douintvec() (unsigned int) ?
	Answer: about 8
	- Of these how many are array users ?
		Answer: Probably only 1
  o How many sysctl array users exist ?
	Answer: about 12

This last question gives us an idea just how popular arrays: they are not.
Array support should probably just be kept for strings.

The identified uint ports are:

  drivers/infiniband/core/ucma.c - max_backlog
  drivers/infiniband/core/iwcm.c - default_backlog
  net/core/sysctl_net_core.c - rps_sock_flow_sysctl()
  net/netfilter/nf_conntrack_timestamp.c - nf_conntrack_timestamp -- bool
  net/netfilter/nf_conntrack_acct.c nf_conntrack_acct -- bool
  net/netfilter/nf_conntrack_ecache.c - nf_conntrack_events -- bool
  net/netfilter/nf_conntrack_helper.c - nf_conntrack_helper -- bool
  net/phonet/sysctl.c proc_local_port_range()

The only possible array users is proc_local_port_range() but it does not
seem worth it to add array support just for this given the range support
works just as well.  Unsigned int support should be desirable more for
when you *need* more than INT_MAX or using int min/max support then does
not suffice for your ranges.

If you forget and by mistake happen to register an unsigned int proc
entry with an array, the driver will fail and you will get something as
follows:

sysctl table check failed: debug/test_sysctl//uint_0002 array now allowed
CPU: 2 PID: 1342 Comm: modprobe Tainted: G        W   E <etc>
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS <etc>
Call Trace:
 dump_stack+0x63/0x81
 __register_sysctl_table+0x350/0x650
 ? kmem_cache_alloc_trace+0x107/0x240
 __register_sysctl_paths+0x1b3/0x1e0
 ? 0xffffffffc005f000
 register_sysctl_table+0x1f/0x30
 test_sysctl_init+0x10/0x1000 [test_sysctl]
 do_one_initcall+0x52/0x1a0
 ? kmem_cache_alloc_trace+0x107/0x240
 do_init_module+0x5f/0x200
 load_module+0x1867/0x1bd0
 ? __symbol_put+0x60/0x60
 SYSC_finit_module+0xdf/0x110
 SyS_finit_module+0xe/0x10
 entry_SYSCALL_64_fastpath+0x1e/0xad
RIP: 0033:0x7f042b22d119
<etc>

Fixes: e7d316a02f ("sysctl: handle error writing UINT_MAX to u32 fields")
Link: http://lkml.kernel.org/r/20170519033554.18592-5-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Suggested-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Liping Zhang <zlpnobody@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:00 -07:00
Luis R. Rodriguez
89c5b53b16 sysctl: fix lax sysctl_check_table() sanity check
Patch series "sysctl: few fixes", v5.

I've been working on making kmod more deterministic, and as I did that I
couldn't help but notice a few issues with sysctl.  My end goal was just
to fix unsigned int support, which back then was completely broken.
Liping Zhang has sent up small atomic fixes, however it still missed yet
one more fix and Alexey Dobriyan had also suggested to just drop array
support given its complexity.

I have inspected array support using Coccinelle and indeed its not that
popular, so if in fact we can avoid it for new interfaces, I agree its
best.

I did develop a sysctl stress driver but will hold that off for another
series.

This patch (of 5):

Commit 7c60c48f58 ("sysctl: Improve the sysctl sanity checks")
improved sanity checks considerbly, however the enhancements on
sysctl_check_table() meant adding a functional change so that only the
last table entry's sanity error is propagated.  It also changed the way
errors were propagated so that each new check reset the err value, this
means only last sanity check computed is used for an error.  This has
been in the kernel since v3.4 days.

Fix this by carrying on errors from previous checks and iterations as we
traverse the table and ensuring we keep any error from previous checks.
We keep iterating on the table even if an error is found so we can
complain for all errors found in one shot.  This works as -EINVAL is
always returned on error anyway, and the check for error is any non-zero
value.

Fixes: 7c60c48f58 ("sysctl: Improve the sysctl sanity checks")
Link: http://lkml.kernel.org/r/20170519033554.18592-2-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:00 -07:00
J. Bruce Fields
630458e730 nfsd4: factor ctime into change attribute
Factoring ctime into the nfsv4 change attribute gives us better
properties than just i_version alone.

Eventually we'll likely also expose this (as opposed to raw i_version)
to userspace, at which point we'll want to move it to a common helper,
called from either userspace or individual filesystems.  For now, nfsd
is the only user.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-07-12 15:55:00 -04:00
Linus Torvalds
6b1c776d3e Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
 "This work from Amir introduces the inodes index feature, which
  provides:

   - hardlinks are not broken on copy up

   - infrastructure for overlayfs NFS export

  This also fixes constant st_ino for samefs case for lower hardlinks"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (33 commits)
  ovl: mark parent impure and restore timestamp on ovl_link_up()
  ovl: document copying layers restrictions with inodes index
  ovl: cleanup orphan index entries
  ovl: persistent overlay inode nlink for indexed inodes
  ovl: implement index dir copy up
  ovl: move copy up lock out
  ovl: rearrange copy up
  ovl: add flag for upper in ovl_entry
  ovl: use struct copy_up_ctx as function argument
  ovl: base tmpfile in workdir too
  ovl: factor out ovl_copy_up_inode() helper
  ovl: extract helper to get temp file in copy up
  ovl: defer upper dir lock to tempfile link
  ovl: hash overlay non-dir inodes by copy up origin
  ovl: cleanup bad and stale index entries on mount
  ovl: lookup index entry for copy up origin
  ovl: verify index dir matches upper dir
  ovl: verify upper root dir matches lower root dir
  ovl: introduce the inodes index dir feature
  ovl: generalize ovl_create_workdir()
  ...
2017-07-12 09:28:55 -07:00
Linus Torvalds
908b852df1 Upgrade default dialect to more secure SMB3 from older cifs dialect
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQGcBAABAgAGBQJZZEkpAAoJEIosvXAHck9RHxML/2fAdq8fzWMACInmycbJuAuS
 o/mjXyli5EtwVwcTCE5Z4mW303ch1CY964hKKT/egeOe4WrtUS6a904UCDTkre48
 KAJBCoi52jqTT6ruTC4EoSlMZi5V8q6/O91fwTVZGBRzEsAz/oCb0uwg1dfyFgu7
 g5W+ppYmQcDTImPR9r3BuYJj56pYWj77vlrRwfN5pAko5OocZXL71JPtBWqYuXoi
 jxicWnHc1RPdCIgaLanqQtTOvPub8f19a5cAz3/IAR6AEo0ySzS45CQGKag+Da86
 JVuXiAQ8SQJUoFvEWQ8XdAMu/U+9Vn6UenB8k2MlrOXNh406X3Rdv0cF0UzSdE93
 E+6xJ0S47pno/3eOgKPs1kDuy5edqgqxTicpurvzzjtAHDJtJGhYxSYxHK9i8R2S
 iNmnkuqBjQf9bprabZG7yze38nTyf0vlO9FviYZnMAy7Pxwpd9ADNHhooDbPaZtG
 qbquIAr0s0XZQVHCM/1jCvvXxjdX+ENpTZ4z79u0mA==
 =8UiP
 -----END PGP SIGNATURE-----

Merge tag 'smb3-security-fixes-for-4.13' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes and sane default from Steve French:
 "Upgrade default dialect to more secure SMB3 from older cifs dialect"

* tag 'smb3-security-fixes-for-4.13' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Clean up unused variables in smb2pdu.c
  [SMB3] Improve security, move default dialect to SMB3 from old CIFS
  [SMB3] Remove ifdef since SMB3 (and later) now STRONGLY preferred
  CIFS: Reconnect expired SMB sessions
  CIFS: Display SMB2 error codes in the hex format
  cifs: Use smb 2 - 3 and cifsacl mount options setacl function
  cifs: prototype declaration and definition to set acl for smb 2 - 3 and cifsacl mount options
2017-07-11 14:04:48 -07:00
Linus Torvalds
3bf7878f0f The main item here is support for v12.y.z ("Luminous") clusters:
RESEND_ON_SPLIT, RADOS_BACKOFF, OSDMAP_PG_UPMAP and CRUSH_CHOOSE_ARGS
 feature bits, and various other changes in the RADOS client protocol.
 On top of that we have a new fsc mount option to allow supplying
 fscache uniquifier (similar to NFS) and the usual pile of filesystem
 fixes from Zheng.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJZZQT+AAoJEEp/3jgCEfOLSsMH/i8ZdSzp7ocX00oLMlIxzFEk
 5BUXZ086mEPAE4fjJFPO7+qYk6y26MzAhJL+bj8r5E0GvBEpQkoAoSQZ19Mj5ApC
 nZnllzQ2C8kYvM4hp4Z2pLrF/OYACj/WJJgbTxubBET1zRq1iPj4EgbzBEraPvma
 K76W9ILKNUjIoSDlNR5qvykXXfvi2dxRpi/8nvfMCOcjlw/7orjXVLa05fKmmOoX
 OvpOjicWOrc8NlacGK+j1j1aaKlmLvZb9Ff+45hfC/L5PPQblM0dypFCVfq3MFFq
 nUxKgTCAQDPrndzCdURCtdovjFKbskRGKmhnd0EZkdDCcnUmg6nLxqta6g2Dbs0=
 =ioKM
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.13-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "The main item here is support for v12.y.z ("Luminous") clusters:
  RESEND_ON_SPLIT, RADOS_BACKOFF, OSDMAP_PG_UPMAP and CRUSH_CHOOSE_ARGS
  feature bits, and various other changes in the RADOS client protocol.

  On top of that we have a new fsc mount option to allow supplying
  fscache uniquifier (similar to NFS) and the usual pile of filesystem
  fixes from Zheng"

* tag 'ceph-for-4.13-rc1' of git://github.com/ceph/ceph-client: (44 commits)
  libceph: advertise support for NEW_OSDOP_ENCODING and SERVER_LUMINOUS
  libceph: osd_state is 32 bits wide in luminous
  crush: remove an obsolete comment
  crush: crush_init_workspace starts with struct crush_work
  libceph, crush: per-pool crush_choose_arg_map for crush_do_rule()
  crush: implement weight and id overrides for straw2
  libceph: apply_upmap()
  libceph: compute actual pgid in ceph_pg_to_up_acting_osds()
  libceph: pg_upmap[_items] infrastructure
  libceph: ceph_decode_skip_* helpers
  libceph: kill __{insert,lookup,remove}_pg_mapping()
  libceph: introduce and switch to decode_pg_mapping()
  libceph: don't pass pgid by value
  libceph: respect RADOS_BACKOFF backoffs
  libceph: make DEFINE_RB_* helpers more general
  libceph: avoid unnecessary pi lookups in calc_target()
  libceph: use target pi for calc_target() calculations
  libceph: always populate t->target_{oid,oloc} in calc_target()
  libceph: make sure need_resend targets reflect latest map
  libceph: delete from need_resend_linger before check_linger_pool_dne()
  ...
2017-07-11 12:12:28 -07:00
Eric W. Biederman
2fd1d2c4ce proc: Fix proc_sys_prune_dcache to hold a sb reference
Andrei Vagin writes:
FYI: This bug has been reproduced on 4.11.7
> BUG: Dentry ffff895a3dd01240{i=4e7c09a,n=lo}  still in use (1) [unmount of proc proc]
> ------------[ cut here ]------------
> WARNING: CPU: 1 PID: 13588 at fs/dcache.c:1445 umount_check+0x6e/0x80
> CPU: 1 PID: 13588 Comm: kworker/1:1 Not tainted 4.11.7-200.fc25.x86_64 #1
> Hardware name: CompuLab sbc-flt1/fitlet, BIOS SBCFLT_0.08.04 06/27/2015
> Workqueue: events proc_cleanup_work
> Call Trace:
>  dump_stack+0x63/0x86
>  __warn+0xcb/0xf0
>  warn_slowpath_null+0x1d/0x20
>  umount_check+0x6e/0x80
>  d_walk+0xc6/0x270
>  ? dentry_free+0x80/0x80
>  do_one_tree+0x26/0x40
>  shrink_dcache_for_umount+0x2d/0x90
>  generic_shutdown_super+0x1f/0xf0
>  kill_anon_super+0x12/0x20
>  proc_kill_sb+0x40/0x50
>  deactivate_locked_super+0x43/0x70
>  deactivate_super+0x5a/0x60
>  cleanup_mnt+0x3f/0x90
>  mntput_no_expire+0x13b/0x190
>  kern_unmount+0x3e/0x50
>  pid_ns_release_proc+0x15/0x20
>  proc_cleanup_work+0x15/0x20
>  process_one_work+0x197/0x450
>  worker_thread+0x4e/0x4a0
>  kthread+0x109/0x140
>  ? process_one_work+0x450/0x450
>  ? kthread_park+0x90/0x90
>  ret_from_fork+0x2c/0x40
> ---[ end trace e1c109611e5d0b41 ]---
> VFS: Busy inodes after unmount of proc. Self-destruct in 5 seconds.  Have a nice day...
> BUG: unable to handle kernel NULL pointer dereference at           (null)
> IP: _raw_spin_lock+0xc/0x30
> PGD 0

Fix this by taking a reference to the super block in proc_sys_prune_dcache.

The superblock reference is the core of the fix however the sysctl_inodes
list is converted to a hlist so that hlist_del_init_rcu may be used.  This
allows proc_sys_prune_dache to remove inodes the sysctl_inodes list, while
not causing problems for proc_sys_evict_inode when if it later choses to
remove the inode from the sysctl_inodes list.  Removing inodes from the
sysctl_inodes list allows proc_sys_prune_dcache to have a progress
guarantee, while still being able to drop all locks.  The fact that
head->unregistering is set in start_unregistering ensures that no more
inodes will be added to the the sysctl_inodes list.

Previously the code did a dance where it delayed calling iput until the
next entry in the list was being considered to ensure the inode remained on
the sysctl_inodes list until the next entry was walked to.  The structure
of the loop in this patch does not need that so is much easier to
understand and maintain.

Cc: stable@vger.kernel.org
Reported-by: Andrei Vagin <avagin@gmail.com>
Tested-by: Andrei Vagin <avagin@openvz.org>
Fixes: ace0c791e6 ("proc/sysctl: Don't grab i_lock under sysctl_lock.")
Fixes: d6cffbbe9a ("proc/sysctl: prune stale dentries during unregistering")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-07-11 11:01:24 -05:00
David Howells
1d278a8790 VFS: Kill off s_options and helpers
Kill off s_options, save/replace_mount_options() and generic_show_options()
as all filesystems now implement ->show_options() for themselves.  This
should make it easier to implement a context-based mount where the mount
options can be passed individually over a file descriptor.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-11 06:09:21 -04:00
David Howells
4dfdb71307 orangefs: Implement show_options
Implement the show_options superblock op for orangefs as part of a bid to
rid of s_options and generic_show_options() to make it easier to implement
a context-based mount where the mount options can be passed individually
over a file descriptor.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Mike Marshall <hubcap@omnibond.com>
cc: pvfs2-developers@beowulf-underground.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-11 06:09:21 -04:00
David Howells
c4fac91004 9p: Implement show_options
Implement the show_options superblock op for 9p as part of a bid to get
rid of s_options and generic_show_options() to make it easier to implement
a context-based mount where the mount options can be passed individually
over a file descriptor.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric Van Hensbergen <ericvh@gmail.com>
cc: Ron Minnich <rminnich@sandia.gov>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: v9fs-developer@lists.sourceforge.net
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-11 06:08:58 -04:00
David Howells
86a1da6d30 isofs: Implement show_options
Implement the show_options superblock op for omfs as part of a bid to get
rid of s_options and generic_show_options() to make it easier to implement
a context-based mount where the mount options can be passed individually
over a file descriptor.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-11 06:06:18 -04:00
David Howells
677018a6ce afs: Implement show_options
Implement the show_options superblock op for afs as part of a bid to get
rid of s_options and generic_show_options() to make it easier to implement
a context-based mount where the mount options can be passed individually
over a file descriptor.

Also implement the show_devname op to display the correct device name and thus
avoid the need to display the cell= and volume= options.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-11 06:06:18 -04:00
David Howells
26a7655e6a affs: Implement show_options
Implement the show_options superblock op for affs as part of a bid to get
rid of s_options and generic_show_options() to make it easier to implement
a context-based mount where the mount options can be passed individually
over a file descriptor.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-11 06:06:17 -04:00
David Howells
3ab7947ac3 befs: Implement show_options
Implement the show_options superblock op for befs as part of a bid to get
rid of s_options and generic_show_options() to make it easier to implement
a context-based mount where the mount options can be passed individually
over a file descriptor.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Luis de Bethencourt <luisbg@osg.samsung.com>
cc: Salah Triki <salah.triki@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-11 06:06:17 -04:00
Linus Torvalds
9967468c0a Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - most of the rest of MM

 - KASAN updates

 - lib/ updates

 - checkpatch updates

 - some binfmt_elf changes

 - various misc bits

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (115 commits)
  kernel/exit.c: avoid undefined behaviour when calling wait4()
  kernel/signal.c: avoid undefined behaviour in kill_something_info
  binfmt_elf: safely increment argv pointers
  s390: reduce ELF_ET_DYN_BASE
  powerpc: move ELF_ET_DYN_BASE to 4GB / 4MB
  arm64: move ELF_ET_DYN_BASE to 4GB / 4MB
  arm: move ELF_ET_DYN_BASE to 4MB
  binfmt_elf: use ELF_ET_DYN_BASE only for PIE
  fs, epoll: short circuit fetching events if thread has been killed
  checkpatch: improve multi-line alignment test
  checkpatch: improve macro reuse test
  checkpatch: change format of --color argument to --color[=WHEN]
  checkpatch: silence perl 5.26.0 unescaped left brace warnings
  checkpatch: improve tests for multiple line function definitions
  checkpatch: remove false warning for commit reference
  checkpatch: fix stepping through statements with $stat and ctx_statement_block
  checkpatch: [HLP]LIST_HEAD is also declaration
  checkpatch: warn when a MAINTAINERS entry isn't [A-Z]:\t
  checkpatch: improve the unnecessary OOM message test
  lib/bsearch.c: micro-optimize pivot position calculation
  ...
2017-07-10 16:58:42 -07:00
Kees Cook
67c6777a5d binfmt_elf: safely increment argv pointers
When building the argv/envp pointers, the envp is needlessly
pre-incremented instead of just continuing after the argv pointers are
finished.  In some (likely impossible) race where the strings could be
changed from userspace between copy_strings() and here, it might be
possible to confuse the envp position.  Instead, just use sp like
everything else.

Link: http://lkml.kernel.org/r/20170622173838.GA43308@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: Qualys Security Advisory <qsa@qualys.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:36 -07:00
Kees Cook
eab09532d4 binfmt_elf: use ELF_ET_DYN_BASE only for PIE
The ELF_ET_DYN_BASE position was originally intended to keep loaders
away from ET_EXEC binaries.  (For example, running "/lib/ld-linux.so.2
/bin/cat" might cause the subsequent load of /bin/cat into where the
loader had been loaded.)

With the advent of PIE (ET_DYN binaries with an INTERP Program Header),
ELF_ET_DYN_BASE continued to be used since the kernel was only looking
at ET_DYN.  However, since ELF_ET_DYN_BASE is traditionally set at the
top 1/3rd of the TASK_SIZE, a substantial portion of the address space
is unused.

For 32-bit tasks when RLIMIT_STACK is set to RLIM_INFINITY, programs are
loaded above the mmap region.  This means they can be made to collide
(CVE-2017-1000370) or nearly collide (CVE-2017-1000371) with
pathological stack regions.

Lowering ELF_ET_DYN_BASE solves both by moving programs below the mmap
region in all cases, and will now additionally avoid programs falling
back to the mmap region by enforcing MAP_FIXED for program loads (i.e.
if it would have collided with the stack, now it will fail to load
instead of falling back to the mmap region).

To allow for a lower ELF_ET_DYN_BASE, loaders (ET_DYN without INTERP)
are loaded into the mmap region, leaving space available for either an
ET_EXEC binary with a fixed location or PIE being loaded into mmap by
the loader.  Only PIE programs are loaded offset from ELF_ET_DYN_BASE,
which means architectures can now safely lower their values without risk
of loaders colliding with their subsequently loaded programs.

For 64-bit, ELF_ET_DYN_BASE is best set to 4GB to allow runtimes to use
the entire 32-bit address space for 32-bit pointers.

Thanks to PaX Team, Daniel Micay, and Rik van Riel for inspiration and
suggestions on how to implement this solution.

Fixes: d1fd836dcf ("mm: split ET_DYN ASLR from mmap ASLR")
Link: http://lkml.kernel.org/r/20170621173201.GA114489@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: Qualys Security Advisory <qsa@qualys.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Pratyush Anand <panand@redhat.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:36 -07:00
David Rientjes
c257a340ed fs, epoll: short circuit fetching events if thread has been killed
We've encountered zombies that are waiting for a thread to exit that are
looping in ep_poll() almost endlessly although there is a pending
SIGKILL as a result of a group exit.

This happens because we always find ep_events_available() and fetch more
events and never are able to check for signal_pending() that would break
from the loop and return -EINTR.

Special case fatal signals and break immediately to guarantee that we
loop to fetch more events and delay making a timely exit.

It would also be possible to simply move the check for signal_pending()
higher than checking for ep_events_available(), but there have been no
reports of delayed signal handling other than SIGKILL preventing zombies
from exiting that would be fixed by this.

It fixes an issue for us where we have witnessed zombies sticking around
for at least O(minutes), but considering the code has been like this
forever and nobody else has complained that I have found, I would simply
queue it up for 4.12.

Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705031722350.76784@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jan Kara <jack@suse.cz>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:36 -07:00
Heiner Kallweit
cde1b69389 fs/proc/generic.c: switch to ida_simple_get/remove
The code can be much simplified by switching to ida_simple_get/remove.

Link: http://lkml.kernel.org/r/8d1cc9f7-5115-c9dc-028e-c0770b6bfe1f@gmail.com
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:34 -07:00
Sahitya Tummala
b17c070fb6 fs/dcache.c: fix spin lockup issue on nlru->lock
__list_lru_walk_one() acquires nlru spin lock (nlru->lock) for longer
duration if there are more number of items in the lru list.  As per the
current code, it can hold the spin lock for upto maximum UINT_MAX
entries at a time.  So if there are more number of items in the lru
list, then "BUG: spinlock lockup suspected" is observed in the below
path:

  spin_bug+0x90
  do_raw_spin_lock+0xfc
  _raw_spin_lock+0x28
  list_lru_add+0x28
  dput+0x1c8
  path_put+0x20
  terminate_walk+0x3c
  path_lookupat+0x100
  filename_lookup+0x6c
  user_path_at_empty+0x54
  SyS_faccessat+0xd0
  el0_svc_naked+0x24

This nlru->lock is acquired by another CPU in this path -

  d_lru_shrink_move+0x34
  dentry_lru_isolate_shrink+0x48
  __list_lru_walk_one.isra.10+0x94
  list_lru_walk_node+0x40
  shrink_dcache_sb+0x60
  do_remount_sb+0xbc
  do_emergency_remount+0xb0
  process_one_work+0x228
  worker_thread+0x2e0
  kthread+0xf4
  ret_from_fork+0x10

Fix this lockup by reducing the number of entries to be shrinked from
the lru list to 1024 at once.  Also, add cond_resched() before
processing the lru list again.

Link: http://marc.info/?t=149722864900001&r=1&w=2
Link: http://lkml.kernel.org/r/1498707575-2472-1-git-send-email-stummala@codeaurora.org
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Alexander Polakov <apolyakov@beget.ru>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:33 -07:00
Vasily Averin
8c03cc85a0 fs/proc/task_mmu.c: remove obsolete comment in show_map_vma()
After commit 1be7107fbe ("mm: larger stack guard gap, between vmas")
we do not hide stack guard page in /proc/<pid>/maps

Link: http://lkml.kernel.org/r/211f3c2a-f7ef-7c13-82bf-46fd426f6e1b@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:32 -07:00
Naoya Horiguchi
78bb920344 mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error
Currently me_huge_page() relies on dequeue_hwpoisoned_huge_page() to
keep the error hugepage away from the system, which is OK but not good
enough because the hugepage still has a refcount and unpoison doesn't
work on the error hugepage (PageHWPoison flags are cleared but pages are
still leaked.) And there's "wasting health subpages" issue too.  This
patch reworks on me_huge_page() to solve these issues.

For hugetlb file, recently we have truncating code so let's use it in
hugetlbfs specific ->error_remove_page().

For anonymous hugepage, it's helpful to dissolve the error page after
freeing it into free hugepage list.  Migration entry and PageHWPoison in
the head page prevent the access to it.

TODO: dissolve_free_huge_page() can fail but we don't considered it yet.
It's not critical (and at least no worse that now) because in such case
the error hugepage just stays in free hugepage list without being
dissolved.  By virtue of PageHWPoison in head page, it's never allocated
to processes.

[akpm@linux-foundation.org: fix unused var warnings]
Fixes: 23a003bfd2 ("mm/madvise: pass return code of memory_failure() to userspace")
Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop
Link: http://lkml.kernel.org/r/1496305019-5493-8-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:30 -07:00
Eric Biggers
241f01fbed fs/buffer.c: make bh_lru_install() more efficient
To install a buffer_head into the cpu's LRU queue, bh_lru_install()
would construct a new copy of the queue and then memcpy it over the real
queue.  But it's easily possible to do the update in-place, which is
faster and simpler.  Some work can also be skipped if the buffer_head
was already in the queue.

As a microbenchmark I timed how long it takes to run sb_getblk()
10,000,000 times alternating between BH_LRU_SIZE + 1 blocks.
Effectively, this benchmarks looking up buffer_heads that are in the
page cache but not in the LRU:

	Before this patch: 1.758s
	After this patch: 1.653s

This patch also removes about 350 bytes of compiled code (on x86_64),
partly due to removal of the memcpy() which was being inlined+unrolled.

Link: http://lkml.kernel.org/r/20161229193445.1913-1-ebiggers3@gmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:30 -07:00
Linus Torvalds
5cdd4c0468 for-f2fs-4.13
In this round, we've added new features such as disk quota and statx, and
 modified internal bio management flow to merge more IOs depending on block
 types. We've also made internal threads freezeable for Android battery life.
 In addition to them, there are some patches to avoid lock contention as well
 as a couple of deadlock conditions.
 
 = Enhancement
 - support usrquota, grpquota, and statx
 - manage DATA/NODE typed bios separately to serialize more IOs
 - modify f2fs_lock_op/wio_mutex to avoid lock contention
 - prevent lock contention in migratepage
 
 = Bug fix
 - miss to load written inode flag
 - fix worst case victim selection in GC
 - freezeable GC and discard threads for Android battery life
 - sanitize f2fs metadata to deal with security hole
 - clean up sysfs-related code and docs
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAllj6fMACgkQQBSofoJI
 UNJ6Ng/+PqdGV/b6KroYIXI/scFx/1t87/0W+rY9tyLr1jX7nIHn9KLPjeDdvdlk
 5vEeZ/dGfW8wSI+ESzscvKberG2QlOPwJRyTB4jWR+bLatwzg7YjEblz+RX4/wfJ
 jKjnR7M//gRdhHdqA0xXrqguAjPbcEDK2RiVbhioMjWbZ/77j0IjcRokjMYdEf0m
 cJc2oMXFtlo+DJ1h9/8BmwQPTI9FfVdgbkPFTTJzV0ydQnBdxcAigrzwYZhPOVv0
 n2M1dKOiQewB4OADMuepZLFqJheItlgG9wlvEjGq7zTd5epHXRIqhM6h9GikQVb9
 YKAkajlKfWcwEXaEcVXtsMHC9x69Yf8xxOSQ1VrhypSUNbaynC9LDsErJx6yrF3P
 XC5baiqXsd/btg7tfrHJjk3gI+ck97d6TrTfUVR91X+1Tpkz7cyB226WxFKbyOG3
 EYCFVMbrIN2CaHHt1xWIT2zCfX5w9ycp8kFjY6jPi0OOZrKXpFw+1AwwTu9kn4xJ
 iuUc8pmc0/FyPqokmLef4Qp/RRM83+f+nzW/y//lkEf3nMn6qlHzNI1RAxXnBvGV
 DMXzuJDcJcHGcSDr7mWyKkm6gYcak/E4DdQLQqJ6VCt6KCdCEXP/XDlig5ey5ODY
 uGEr1QhXIpiYAON45HUi3gmytB3J3ZdzzpsG1PEco4+hjSuFhyE=
 =N4GZ
 -----END PGP SIGNATURE-----

Merge tag 'for-f2fs-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we've added new features such as disk quota and statx,
  and modified internal bio management flow to merge more IOs depending
  on block types. We've also made internal threads freezeable for
  Android battery life. In addition to them, there are some patches to
  avoid lock contention as well as a couple of deadlock conditions.

  Enhancements:
   - support usrquota, grpquota, and statx
   - manage DATA/NODE typed bios separately to serialize more IOs
   - modify f2fs_lock_op/wio_mutex to avoid lock contention
   - prevent lock contention in migratepage

  Bug fixes:
   - fix missing load of written inode flag
   - fix worst case victim selection in GC
   - freezeable GC and discard threads for Android battery life
   - sanitize f2fs metadata to deal with security hole
   - clean up sysfs-related code and docs"

* tag 'for-f2fs-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (59 commits)
  f2fs: support plain user/group quota
  f2fs: avoid deadlock caused by lock order of page and lock_op
  f2fs: use spin_{,un}lock_irq{save,restore}
  f2fs: relax migratepage for atomic written page
  f2fs: don't count inode block in in-memory inode.i_blocks
  Revert "f2fs: fix to clean previous mount option when remount_fs"
  f2fs: do not set LOST_PINO for renamed dir
  f2fs: do not set LOST_PINO for newly created dir
  f2fs: skip ->writepages for {mete,node}_inode during recovery
  f2fs: introduce __check_sit_bitmap
  f2fs: stop gc/discard thread in prior during umount
  f2fs: introduce reserved_blocks in sysfs
  f2fs: avoid redundant f2fs_flush after remount
  f2fs: report # of free inodes more precisely
  f2fs: add ioctl to do gc with target block address
  f2fs: don't need to check encrypted inode for partial truncation
  f2fs: measure inode.i_blocks as generic filesystem
  f2fs: set CP_TRIMMED_FLAG correctly
  f2fs: require key for truncate(2) of encrypted file
  f2fs: move sysfs code from super.c to fs/f2fs/sysfs.c
  ...
2017-07-10 14:29:45 -07:00
Linus Torvalds
7cee9384cb Fix up over-eager 'wait_queue_t' renaming
Commit ac6424b981 ("sched/wait: Rename wait_queue_t =>
wait_queue_entry_t") had scripted the renaming incorrectly, and didn't
actually check that the 'wait_queue_t' was a full token.

As a result, it also triggered on 'wait_queue_token', and renamed that
to 'wait_queue_entry_token' entry in the autofs4 packet structure
definition too.  That was entirely incorrect, and not intended.

The end result built fine when building just the kernel - because
everything had been renamed consistently there - but caused problems in
user space because the "struct autofs_packet_missing" type is exported
as part of the uapi.

This scripts it all back again:

    git grep -lw wait_queue_entry_token |
        xargs sed -i 's/wait_queue_entry_token/wait_queue_token/g'

and checks the end result.

Reported-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Fixes: ac6424b981 ("sched/wait: Rename wait_queue_t => wait_queue_entry_t")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 11:40:19 -07:00
Linus Torvalds
642338ba33 Changes for 4.13:
- Avoid quotacheck deadlocks
 - Fix transaction overflows when bunmapping fragmented files
 - Refactor directory readahead
 - Allow admin to configure if ASSERT is fatal
 - Improve transaction usage detail logging during overflows
 - Minor cleanups
 - Don't leak log items when the log shuts down
 - Remove double-underscore typedefs
 - Various preparation for online scrubbing
 - Introduce new error injection configuration sysfs knobs
 - Refactor dq_get_next to use extent map directly
 - Fix problems with iterating the page cache for unwritten data
 - Implement SEEK_{HOLE,DATA} via iomap
 - Refactor XFS to use iomap SEEK_HOLE and SEEK_DATA
 - Don't use MAXPATHLEN to check on-disk symlink target lengths
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJZYDw4AAoJEPh/dxk0SrTr2IMP/3JLeygIDtKBBVRPvlCmEXQC
 j8w1C/ntn46zZKQ8l14fAFV4HV2d+KJWf8+yDuPuGdMXJfPeKZf95otYhnSx/9Th
 MvCH7Nzg63yjEGqXpBkfIVr/GT0KTx28lxiqNViChr7XiXWookgf3SSLINO+vU4J
 L2jgLqieJfijiHTBs4qGCQPDwSXVoSOi5XCCQWDYQrXz6DI5UEJc70U53WkH4tRu
 RctOgp1lralwEO0PhfomD3m/Gk94taE/4ZpX/j/5Y4tvH/yh5aY3/KTCLm6+mYT3
 rgMpmg5hmm+UiCTNoTnQ5RxzGZWCfI1I9FZ3HqDsbhmFtaWh32ti0dEEDYsF8Opj
 ARnTty3cRx41LH9dULrVWdwW105AHgwEz8/OZlG0JOca9qzj9GKERMg/hpHINAKN
 TrBlkweg86LWZDy23udZJ/v35svNqSFsqL1yV8j5dXyBi+Yi2CGfU27zbBwnj4Jk
 047l+OuRbBnEOUULqJTEVBY3euoclwl/yQrW2m409s7vPGkGQBLuFCsDKQdnvJ/A
 D7frZqH8XypwnhFOkKybUnBkn4P7vZ2sEuCIZMsrH5k/ys8XyEkaBaOurjvMBOKA
 vLIMD6RXDWrFbOoovfK/stEM6/UFoQkgMhBe7vB9EXk1AjM8NYyWZgp5BkHtytC7
 qa6GRjtGefhc67hbwXJd
 =/GZI
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull XFS updates from Darrick Wong:
 "Here are some changes for you for 4.13. For the most part it's fixes
  for bugs and deadlock problems, and preparation for online fsck in
  some future merge window.

   - Avoid quotacheck deadlocks

   - Fix transaction overflows when bunmapping fragmented files

   - Refactor directory readahead

   - Allow admin to configure if ASSERT is fatal

   - Improve transaction usage detail logging during overflows

   - Minor cleanups

   - Don't leak log items when the log shuts down

   - Remove double-underscore typedefs

   - Various preparation for online scrubbing

   - Introduce new error injection configuration sysfs knobs

   - Refactor dq_get_next to use extent map directly

   - Fix problems with iterating the page cache for unwritten data

   - Implement SEEK_{HOLE,DATA} via iomap

   - Refactor XFS to use iomap SEEK_HOLE and SEEK_DATA

   - Don't use MAXPATHLEN to check on-disk symlink target lengths"

* tag 'xfs-4.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (48 commits)
  xfs: don't crash on unexpected holes in dir/attr btrees
  xfs: rename MAXPATHLEN to XFS_SYMLINK_MAXLEN
  xfs: fix contiguous dquot chunk iteration livelock
  xfs: Switch to iomap for SEEK_HOLE / SEEK_DATA
  vfs: Add iomap_seek_hole and iomap_seek_data helpers
  vfs: Add page_cache_seek_hole_data helper
  xfs: remove a whitespace-only line from xfs_fs_get_nextdqblk
  xfs: rewrite xfs_dq_get_next_id using xfs_iext_lookup_extent
  xfs: Check for m_errortag initialization in xfs_errortag_test
  xfs: grab dquots without taking the ilock
  xfs: fix semicolon.cocci warnings
  xfs: Don't clear SGID when inheriting ACLs
  xfs: free cowblocks and retry on buffered write ENOSPC
  xfs: replace log_badcrc_factor knob with error injection tag
  xfs: convert drop_writes to use the errortag mechanism
  xfs: remove unneeded parameter from XFS_TEST_ERROR
  xfs: expose errortag knobs via sysfs
  xfs: make errortag a per-mountpoint structure
  xfs: free uncommitted transactions during log recovery
  xfs: don't allow bmap on rt files
  ...
2017-07-10 10:51:53 -07:00
Linus Torvalds
6618a24ab2 Merge branch 'nowait-aio-btrfs-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
 "This fixes a user-visible bug introduced by the nowait-aio patches
  merged in this cycle"

* 'nowait-aio-btrfs-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: nowait aio: Correct assignment of pos
2017-07-10 10:27:48 -07:00
Goldwyn Rodrigues
ff0fa73247 btrfs: nowait aio: Correct assignment of pos
Assigning pos for usage early messes up in append mode, where the pos is
re-assigned in generic_write_checks(). Assign pos later to get the
correct position to write from iocb->ki_pos.

Since check_can_nocow also uses the value of pos, we shift
generic_write_checks() before check_can_nocow(). Checks with IOCB_DIRECT
are present in generic_write_checks(), so checking for IOCB_NOWAIT is
enough.

Also, put locking sequence in the fast path.

This fixes a user visible bug, as reported:

"apparently breaks several shell related features on my system.
In zsh history stopped working, because no new entries are added
anymore.
I fist noticed the issue when I tried to build mplayer. It uses a shell
script to generate a help_mp.h file:
[...]

Here is a simple testcase:

 % echo "foo" >> test
 % echo "foo" >> test
 % cat test
 foo
 %
"

Fixes: edf064e7c6 ("btrfs: nowait aio support")
CC: Jens Axboe <axboe@kernel.dk>
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Link: https://lkml.kernel.org/r/20170704042306.GA274@x4
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-10 15:29:44 +02:00
Christos Gkekas
68a6afa7fa cifs: Clean up unused variables in smb2pdu.c
There are multiple unused variables struct TCP_Server_Info *server
defined in many methods in smb2pdu.c. They should be removed and related
logic simplified.

Signed-off-by: Christos Gkekas <chris.gekas@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-07-09 23:38:00 -05:00
David Howells
d3e3b7eac8 afs: Add metadata xattrs
Add xattrs to allow the user to get/set metadata in lieu of having pioctl()
available.  The following xattrs are now available:

 - "afs.cell"

   The name of the cell in which the vnode's volume resides.

 - "afs.fid"

   The volume ID, vnode ID and vnode uniquifier of the file as three hex
   numbers separated by colons.

 - "afs.volume"

   The name of the volume in which the vnode resides.

For example:

	# getfattr -d -m ".*" /mnt/scratch
	getfattr: Removing leading '/' from absolute path names
	# file: mnt/scratch
	afs.cell="mycell.myorg.org"
	afs.fid="10000b:1:1"
	afs.volume="scratch"

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-09 14:40:12 -07:00
Marc Dionne
fd2498211a afs: Ignore AFS_ACE_READ and AFS_ACE_WRITE for directories
The AFS_ACE_READ and AFS_ACE_WRITE permission bits should not
be used to make access decisions for the directory itself.  They
are meant to control access for the objects contained in that
directory.

Reading a directory is allowed if the AFS_ACE_LOOKUP bit is set.
This would cause an incorrect access denied error for a directory
with AFS_ACE_LOOKUP but not AFS_ACE_READ.

The AFS_ACE_WRITE bit does not allow operations that modify the
directory.  For a directory with AFS_ACE_WRITE but neither
AFS_ACE_INSERT nor AFS_ACE_DELETE, this would result in trying
operations that would ultimately be denied by the server.

Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-09 14:40:12 -07:00
Linus Torvalds
bc2c6421cb The first major feature for ext4 this merge window is the largedir
feature, which allows ext4 directories to support over 2 billion
 directory entries (assuming ~64 byte file names; in practice, users
 will run into practical performance limits first.)  This feature was
 originally written by the Lustre team, and credit goes to Artem
 Blagodarenko from Seagate for getting this feature upstream.
 
 The second major major feature allows ext4 to support extended
 attribute values up to 64k.  This feature was also originally from
 Lustre, and has been enhanced by Tahsin Erdogan from Google with a
 deduplication feature so that if multiple files have the same xattr
 value (for example, Windows ACL's stored by Samba), only one copy will
 be stored on disk for encoding and caching efficiency.
 
 We also have the usual set of bug fixes, cleanups, and optimizations.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAllhl5AACgkQ8vlZVpUN
 gaOiNQf+L23sT9KIQmFwQP38vkBVw67Eo7gBfevmk7oqQLiRppT5mmLzW8EWEDxR
 PVaDQXvSZi18wSCAAcCd1ZqeIZk0P6tst0ufnIT60tGlZdUlwSLyrqvV/30axR2g
 6kcnv90ZszrQNx5U8q8bMzNrs1KtyPHFCRzavFsBX11WezNSpWnH2in/uxO+t9Jy
 F2zlrLUrE2m9AVMH48Dh6LbeaB6pqgr4k3jq1jG4Iqb2h9xgU8OKhs8gL07YS+Qi
 5A7s8GIvYQSoZUO9DOOie2f1zhpO0KrhXchyZTJukVQH7TsmFxoSh0vhXnP1Bohu
 CNLV6dzetDT0VfmPr1WhVe7lhZeeVw==
 =FFkF
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "The first major feature for ext4 this merge window is the largedir
  feature, which allows ext4 directories to support over 2 billion
  directory entries (assuming ~64 byte file names; in practice, users
  will run into practical performance limits first.) This feature was
  originally written by the Lustre team, and credit goes to Artem
  Blagodarenko from Seagate for getting this feature upstream.

  The second major major feature allows ext4 to support extended
  attribute values up to 64k. This feature was also originally from
  Lustre, and has been enhanced by Tahsin Erdogan from Google with a
  deduplication feature so that if multiple files have the same xattr
  value (for example, Windows ACL's stored by Samba), only one copy will
  be stored on disk for encoding and caching efficiency.

  We also have the usual set of bug fixes, cleanups, and optimizations"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (47 commits)
  ext4: fix spelling mistake: "prellocated" -> "preallocated"
  ext4: fix __ext4_new_inode() journal credits calculation
  ext4: skip ext4_init_security() and encryption on ea_inodes
  fs: generic_block_bmap(): initialize all of the fields in the temp bh
  ext4: change fast symlink test to not rely on i_blocks
  ext4: require key for truncate(2) of encrypted file
  ext4: don't bother checking for encryption key in ->mmap()
  ext4: check return value of kstrtoull correctly in reserved_clusters_store
  ext4: fix off-by-one fsmap error on 1k block filesystems
  ext4: return EFSBADCRC if a bad checksum error is found in ext4_find_entry()
  ext4: return EIO on read error in ext4_find_entry
  ext4: forbid encrypting root directory
  ext4: send parallel discards on commit completions
  ext4: avoid unnecessary stalls in ext4_evict_inode()
  ext4: add nombcache mount option
  ext4: strong binding of xattr inode references
  ext4: eliminate xattr entry e_hash recalculation for removes
  ext4: reserve space for xattr entries/names
  quota: add get_inode_usage callback to transfer multi-inode charges
  ext4: xattr inode deduplication
  ...
2017-07-09 09:31:22 -07:00
Linus Torvalds
58f587cb0b Add support for 128-bit AES and some cleanups to fscrypt
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAllhktgACgkQ8vlZVpUN
 gaOQIQf+KM2s46sxxEl0/hjdBXR4OxTmSS2/0900NPyg7JHKlL8PdYslOyvMiKjo
 wEi+YPwwQgbHtxhI1VINfV/q12MZHwvmFOfD9NzjrISwfmfsKj0dBgZDAfBH82sK
 12wKgUxA8xJ4P+Xdvnz2PokRcFCsh1YUr5IUQkP3JR2RZOxNFUj42QwPJ2yWzqxO
 MsnepMjIHsxvXZi0E7sPjRaoFsh3DDeLmNl8sX6INodC7hxJ1LotYKqJhA4stQpB
 ezXY2tabwg3gaOWvWH7THyHhGntbZVDga3iRrKdNLahXN8OBdHktmG75ubiN6tEg
 x80pqQLgr41yIQuJVOuyeh5jLYZrww==
 =i4r9
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt

Pull fscrypt updates from Ted Ts'o:
 "Add support for 128-bit AES and some cleanups to fscrypt"

* tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt:
  fscrypt: make ->dummy_context() return bool
  fscrypt: add support for AES-128-CBC
  fscrypt: inline fscrypt_free_filename()
2017-07-09 09:03:31 -07:00
Tommy Nguyen
799ce1dbb9 befs: add kernel-doc formatting for befs_bt_read_super()
fs/befs/TODO mentions some comments needing conversion to Kernel-Doc
formatting. This patch changes the comment describing befs_bt_read_super().

Signed-off-by: Tommy Nguyen <remyabel@gmail.com>
Signed-off-by: Luis de Bethencourt <luisbg@kernel.org>
2017-07-09 10:42:50 +01:00
Chao Yu
0abd675e97 f2fs: support plain user/group quota
This patch adds to support plain user/group quota.

Change Note by Jaegeuk Kim.

- Use f2fs page cache for quota files in order to consider garbage collection.
  so, quota files are not tolerable for sudden power-cuts, so user needs to do
  quotacheck.

- setattr() calls dquot_transfer which will transfer inode->i_blocks.
  We can't reclaim that during f2fs_evict_inode(). So, we need to count
  node blocks as well in order to match i_blocks with dquot's space.

  Note that, Chao wrote a patch to count inode->i_blocks without inode block.
  (f2fs: don't count inode block in in-memory inode.i_blocks)

- in f2fs_remount, we need to make RW in prior to dquot_resume.

- handle fault_injection case during f2fs_quota_off_umount

- TODO: Project quota

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-08 23:12:27 -07:00
Steve French
eef914a9eb [SMB3] Improve security, move default dialect to SMB3 from old CIFS
Due to recent publicity about security vulnerabilities in the
much older CIFS dialect, move the default dialect to the
widely accepted (and quite secure) SMB3.0 dialect from the
old default of the CIFS dialect.

We do not want to be encouraging use of less secure dialects,
and both Microsoft and CERT now strongly recommend not using the
older CIFS dialect (SMB Security Best Practices
"recommends disabling SMBv1").

SMB3 is both secure and widely available: in Windows 8 and later,
Samba and Macs.

Users can still choose to explicitly mount with the less secure
dialect (for old servers) by choosing "vers=1.0" on the cifs
mount

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2017-07-08 18:59:42 -05:00
Steve French
2a38e12053 [SMB3] Remove ifdef since SMB3 (and later) now STRONGLY preferred
Remove the CONFIG_CIFS_SMB2 ifdef and Kconfig option since they
must always be on now.

For various security reasons, SMB3 and later are STRONGLY preferred
over CIFS and older dialects, and SMB3 (and later) will now be
the default dialects so we do not want to allow them to be
ifdeffed out.

In the longer term, we may be able to make older CIFS support
disableable in Kconfig with a new set of #ifdef, but we always
want SMB3 and later support enabled.

Signed-off-by: Steven French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2017-07-08 18:57:07 -05:00
Pavel Shilovsky
511c54a2f6 CIFS: Reconnect expired SMB sessions
According to the MS-SMB2 spec (3.2.5.1.6) once the client receives
STATUS_NETWORK_SESSION_EXPIRED error code from a server it should
reconnect the current SMB session. Currently the client doesn't do
that. This can result in subsequent client requests failing by
the server. The patch adds an additional logic to the demultiplex
thread to identify expired sessions and reconnect them.

Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-07-08 17:23:10 -05:00
Pavel Shilovsky
4395d484b9 CIFS: Display SMB2 error codes in the hex format
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-07-08 17:23:10 -05:00
Shirish Pargaonkar
366ed846df cifs: Use smb 2 - 3 and cifsacl mount options setacl function
Added set acl function. Very similar to set cifs acl function for smb1.

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2017-07-08 17:23:10 -05:00
Shirish Pargaonkar
dac953401c cifs: prototype declaration and definition to set acl for smb 2 - 3 and cifsacl mount options
Modified current set info function to accommodate multiple info types and
additional information.

Added cifs acl specific function to invoke set info functionality.

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2017-07-08 15:08:38 -05:00
Linus Torvalds
b8d4c1f9f4 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc filesystem updates from Al Viro:
 "Assorted normal VFS / filesystems stuff..."

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  dentry name snapshots
  Make statfs properly return read-only state after emergency remount
  fs/dcache: init in_lookup_hashtable
  minix: Deinline get_block, save 2691 bytes
  fs: Reorder inode_owner_or_capable() to avoid needless
  fs: warn in case userspace lied about modprobe return
2017-07-08 10:50:54 -07:00
Linus Torvalds
46ace66b3b Merge branch 'work.__copy_in_user' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull __copy_in_user removal from Al Viro:
 "There used to be 6 places in the entire tree calling __copy_in_user(),
  all of them bogus.

  Four got killed off in work.drm branch, this takes care of the
  remaining ones and kills the definition of that sucker"

* 'work.__copy_in_user' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  kill __copy_in_user()
  sanitize do_i2c_smbus_ioctl()
2017-07-08 10:15:02 -07:00
Linus Torvalds
cee37d83e6 Merge branch 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull read/write fix from Al Viro:
 "file_start_write()/file_end_write() got mixed into vfs_iter_write() by
  accident; that's a deadlock for all existing callers - they already do
  that, some - quite a bit outside.

  Easily fixed, fortunately"

* 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  move file_{start,end}_write() out of do_iter_write()
2017-07-07 21:48:15 -07:00
Kees Cook
da029c11e6 exec: Limit arg stack to at most 75% of _STK_LIM
To avoid pathological stack usage or the need to special-case setuid
execs, just limit all arg stack usage to at most 75% of _STK_LIM (6MB).

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-07 20:05:08 -07:00
Linus Torvalds
088737f44b Writeback error handling fixes (pile #2)
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZXhmCAAoJEAAOaEEZVoIVpRkP/1qlYn3pq6d5Kuz84pejOmlL
 5jbkS/cOmeTxeUU4+B1xG8Lx7bAk8PfSXQOADbSJGiZd0ug95tJxplFYIGJzR/tG
 aNMHeu/BVKKhUKORGuKR9rJKtwC839L/qao+yPBo5U3mU4L73rFWX8fxFuhSJ8HR
 hvkgBu3Hx6GY59CzxJ8iJzj+B+uPSFrNweAk0+0UeWkBgTzEdiGqaXBX4cHIkq/5
 hMoCG+xnmwHKbCBsQ5js+YJT+HedZ4lvfjOqGxgElUyjJ7Bkt/IFYOp8TUiu193T
 tA4UinDjN8A7FImmIBIftrECmrAC9HIGhGZroYkMKbb8ReDR2ikE5FhKEpuAGU3a
 BXBgX2mPQuArvZWM7qeJCkxV9QJ0u/8Ykbyzo30iPrICyrzbEvIubeB/mDA034+Z
 Z0/z8C3v7826F3zP/NyaQEojUgRq30McMOIS8GMnx15HJwRsRKlzjfy9Wm4tWhl0
 t3nH1jMqAZ7068s6rfh/oCwdgGOwr5o4hW/bnlITzxbjWQUOnZIe7KBxIezZJ2rv
 OcIwd5qE8PNtpagGj5oUbnjGOTkERAgsMfvPk5tjUNt28/qUlVs2V0aeo47dlcsh
 oYr8WMOIzw98Rl7Bo70mplLrqLD6nGl0LfXOyUlT4STgLWW4ksmLVuJjWIUxcO/0
 yKWjj9wfYRQ0vSUqhsI5
 =3Z93
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull Writeback error handling updates from Jeff Layton:
 "This pile represents the bulk of the writeback error handling fixes
  that I have for this cycle. Some of the earlier patches in this pile
  may look trivial but they are prerequisites for later patches in the
  series.

  The aim of this set is to improve how we track and report writeback
  errors to userland. Most applications that care about data integrity
  will periodically call fsync/fdatasync/msync to ensure that their
  writes have made it to the backing store.

  For a very long time, we have tracked writeback errors using two flags
  in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
  writeback error occurs (via mapping_set_error) and are cleared as a
  side-effect of filemap_check_errors (as you noted yesterday). This
  model really sucks for userland.

  Only the first task to call fsync (or msync or fdatasync) will see the
  error. Any subsequent task calling fsync on a file will get back 0
  (unless another writeback error occurs in the interim). If I have
  several tasks writing to a file and calling fsync to ensure that their
  writes got stored, then I need to have them coordinate with one
  another. That's difficult enough, but in a world of containerized
  setups that coordination may even not be possible.

  But wait...it gets worse!

  The calls to filemap_check_errors can be buried pretty far down in the
  call stack, and there are internal callers of filemap_write_and_wait
  and the like that also end up clearing those errors. Many of those
  callers ignore the error return from that function or return it to
  userland at nonsensical times (e.g. truncate() or stat()). If I get
  back -EIO on a truncate, there is no reason to think that it was
  because some previous writeback failed, and a subsequent fsync() will
  (incorrectly) return 0.

  This pile aims to do three things:

   1) ensure that when a writeback error occurs that that error will be
      reported to userland on a subsequent fsync/fdatasync/msync call,
      regardless of what internal callers are doing

   2) report writeback errors on all file descriptions that were open at
      the time that the error occurred. This is a user-visible change,
      but I think most applications are written to assume this behavior
      anyway. Those that aren't are unlikely to be hurt by it.

   3) document what filesystems should do when there is a writeback
      error. Today, there is very little consistency between them, and a
      lot of cargo-cult copying. We need to make it very clear what
      filesystems should do in this situation.

  To achieve this, the set adds a new data type (errseq_t) and then
  builds new writeback error tracking infrastructure around that. Once
  all of that is in place, we change the filesystems to use the new
  infrastructure for reporting wb errors to userland.

  Note that this is just the initial foray into cleaning up this mess.
  There is a lot of work remaining here:

   1) convert the rest of the filesystems in a similar fashion. Once the
      initial set is in, then I think most other fs' will be fairly
      simple to convert. Hopefully most of those can in via individual
      filesystem trees.

   2) convert internal waiters on writeback to use errseq_t for
      detecting errors instead of relying on the AS_* flags. I have some
      draft patches for this for ext4, but they are not quite ready for
      prime time yet.

  This was a discussion topic this year at LSF/MM too. If you're
  interested in the gory details, LWN has some good articles about this:

      https://lwn.net/Articles/718734/
      https://lwn.net/Articles/724307/"

* tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  btrfs: minimal conversion to errseq_t writeback error reporting on fsync
  xfs: minimal conversion to errseq_t writeback error reporting
  ext4: use errseq_t based error handling for reporting data writeback errors
  fs: convert __generic_file_fsync to use errseq_t based reporting
  block: convert to errseq_t based writeback error tracking
  dax: set errors in mapping when writeback fails
  Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
  mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
  fs: new infrastructure for writeback error handling and reporting
  lib: add errseq_t type and infrastructure for handling it
  mm: don't TestClearPageError in __filemap_fdatawait_range
  mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
  jbd2: don't clear and reset errors after waiting on writeback
  buffer: set errors in mapping at the time that the error occurs
  fs: check for writeback errors after syncing out buffers in generic_file_fsync
  buffer: use mapping_set_error instead of setting the flag
  mm: fix mapping_set_error call in me_pagecache_dirty
2017-07-07 19:38:17 -07:00
Darrick J. Wong
cd87d86792 xfs: don't crash on unexpected holes in dir/attr btrees
In quite a few places we call xfs_da_read_buf with a mappedbno that we
don't control, then assume that the function passes back either an error
code or a buffer pointer.  Unfortunately, if mappedbno == -2 and bno
maps to a hole, we get a return code of zero and a NULL buffer, which
means that we crash if we actually try to use that buffer pointer.  This
happens immediately when we set the buffer type for transaction context.

Therefore, check that we have no error code and a non-NULL bp before
trying to use bp.  This patch is a follow-up to an incomplete fix in
96a3aefb8f ("xfs: don't crash if reading a directory results in an
unexpected hole").

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-07 18:55:17 -07:00
Linus Torvalds
33198c165b Writeback error handling fixes (pile #1)
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZXXLdAAoJEAAOaEEZVoIVtBIP/2BMtyDB5IVaxUuYc9LiFxCZ
 Y6W4aYEBgPhrct6epV3pnV+SXuzov9F5QZWe1P+lB3e30JHvPhO52OUIT7gSbFbv
 kKCh+p7Q1vLqaKxONPQpJI5LjlB6e6GIekrI4woA2RWVw+6cUyP0oQTVhsSsgnj/
 /GMo2pAqlhR3vnn9cWG93vl+xnrtmckpwFe0g5Jhdp/cVQBrqwxG+1W9rEsJf0nx
 RN29E7+CyxI3x2KkVdmgsMQkpkM2ooopn//1QDmS3M2sbCrJrLSTRG8LBEcs8fi8
 pQZcgW6uHXDH2I0hews1vhJRA38TeXoQfj9OZoFGQcVpbP3ZnjASKioRoQiSsHyQ
 QRDxUw6C45tjWT0HZ1GaCDMuTMs0z2/zF/E7TaOX6zB2LS/NuIluoVAMkYVyXY3a
 L39flIddnDaga1ojL+tQK5hhSl9C66++/FsFa2FZ0hLkeXA5WDLhRy0ODW3NaYg8
 89pPJDfiocEiI7ULht2Bkk88zFe+K07bQRQ5eoFtSOAxOnWGJCbxn8G8dFZZDHnO
 XZe3gscbR3DCMJ+agb4V/YOyqCHAJMA/lcnP9v7P+QnrEXSV5yrblk1Gx442xMhv
 tANcCUI3nb/b2Ma3DW3iZS/iYmhmy/baBSV3n65K9NqtkkIbnqSXxk+5RJd5eKsS
 8Y5nyu+6mlcOOxBMkmRo
 =jRrj
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-v4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull Writeback error handling fixes from Jeff Layton:
 "The main rationale for all of these changes is to tighten up writeback
  error reporting to userland. There are many ways now that writeback
  errors can be lost, such that fsync/fdatasync/msync return 0 when
  writeback actually failed.

  This pile contains a small set of cleanups and writeback error
  handling fixes that I was able to break off from the main pile (#2).

  Two of the patches in this pile are trivial. The exceptions are the
  patch to fix up error handling in write_one_page, and the patch to
  make JFS pay attention to write_one_page errors"

* tag 'for-linus-v4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  fs: remove call_fsync helper function
  mm: clean up error handling in write_one_page
  JFS: do not ignore return code from write_one_page()
  mm: drop "wait" parameter from write_one_page()
2017-07-07 18:39:15 -07:00
Linus Torvalds
3ea4fcc5fe first set of cifs fixes for 4.13, bug fixes and improved POSIX character mapping
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQGcBAABAgAGBQJZYBCAAAoJEIosvXAHck9R2tMMAKSzD0O4cQn+TVEDpw9ZkwSc
 g7xKjG57q/wc0ZuOk75FR9CPiWBgyno/3W7hsTfm/PhyqPHw2dOaCkgcUbahNamH
 RlxY3TzU8yAUucxDdFcxL47whdj1NJwDynyjtHshEL3eKx2bd5STG8cw4UQrSLOU
 CrG0XlqthPrgNEJavYP6bvrdfiuYeAnYykDdHwJiFLGVGgBhzqRVgjrTJkFClIlf
 qQEal4+H2Hz5rM89I0xQLZINh4UflPtX8lIngohyHnpT5sFxal4k2BnibjFzG8Vg
 r/zc5ePjEBvkIlpF5pAS2P6LO9295qOAltK5Wiq+7DYVS/KUDguQDMaXCTWb/zI6
 rAi7TFCNlDOOmySj0vsU2AqubI9ZZ2iCeA4eaPRrSVda3ZAG2hNYm+M1P91trS6d
 fqNJTsaPMYF4Je3lC7fe4PQIzBcq+H9gQk4gC0mAgyJ1rQmjSlm+yBop4ZdzepIA
 g+q7HgOA8QL/+a7D/1AfPTp9dcT5z0V5vz60+KKWXQ==
 =XSvA
 -----END PGP SIGNATURE-----

Merge tag 'cifs-bug-fixes-for-4.13' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "First set of CIFS/SMB3 fixes for the merge window. Also improves POSIX
  character mapping for SMB3"

* tag 'cifs-bug-fixes-for-4.13' of git://git.samba.org/sfrench/cifs-2.6:
  CIFS: fix circular locking dependency
  cifs: set oparms.create_options rather than or'ing in CREATE_OPEN_BACKUP_INTENT
  cifs: Do not modify mid entry after submitting I/O in cifs_call_async
  CIFS: add SFM mapping for 0x01-0x1F
  cifs: hide unused functions
  cifs: Use smb 2 - 3 and cifsacl mount options getacl functions
  cifs: prototype declaration and definition for smb 2 - 3 and cifsacl mount options
  CIFS: add CONFIG_CIFS_DEBUG_KEYS to dump encryption keys
  cifs: set mapping error when page writeback fails in writepage or launder_pages
  SMB3: Enable encryption for SMB3.1.1
2017-07-07 17:44:36 -07:00
Linus Torvalds
1a86fc754f This is a high-priority addendum patch for a regression
introduced by commit 88ffbf3e03. Some code was reverted that
 should not have been. This patch from Andreas Gruenbacher adds
 it back in.
 
 gfs2: Fix glock rhashtable rcu bug
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJZX9rXAAoJENeLYdPf93o7aoIIAIAGxnkPj+uHoaHkOKWWJ28g
 mjaRhPdMxFEdK7U06jJdqI5+nPZjZ9/W4J894ERBkAa685LjOGagkd4oIBHBJ99M
 kD1xIrkNIZC5ThuGbBjnYVq57EoFpsgLgvsLwZ+nYRYCFcphT+IO8hj/fLylayWq
 4LSNGMD8TrbTInVhtEGgFV1M8XI6RY2rQJm0wsYcdcAU+VEVtz25RGR0DMnmnh3m
 t4G1tmUByWfXEr4DCuEqYp9qbJb4sPG/JSoQQCiU/KrWYsm/6nJhXN2rlKNieimL
 /qasyP4LmFEAm5QuTdB0DNSkNTO7ZXm3C1dFwA5i+iNwY7efAeOVIRMSd1Yy4dE=
 =vcj3
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-4.13.fixes.addendum' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull GFS2 fix from Bob Peterson:
 "Sorry for the additional merge request, but Andreas discovered this
  problem soon after you processed our last gfs2 merge.

  This fixes a regression introduced by a patch we did in mid-2015
  (commit 88ffbf3e03: "GFS2: Use resizable hash table for glocks"), so
  best to get it fixed. Some code was reverted that should not have
  been.

  The patch from Andreas Gruenbacher just re-adds code that had been
  there originally"

* tag 'gfs2-4.13.fixes.addendum' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Fix glock rhashtable rcu bug
2017-07-07 17:38:17 -07:00
Al Viro
49d31c2f38 dentry name snapshots
take_dentry_name_snapshot() takes a safe snapshot of dentry name;
if the name is a short one, it gets copied into caller-supplied
structure, otherwise an extra reference to external name is grabbed
(those are never modified).  In either case the pointer to stable
string is stored into the same structure.

dentry must be held by the caller of take_dentry_name_snapshot(),
but may be freely dropped afterwards - the snapshot will stay
until destroyed by release_dentry_name_snapshot().

Intended use:
	struct name_snapshot s;

	take_dentry_name_snapshot(&s, dentry);
	...
	access s.name
	...
	release_dentry_name_snapshot(&s);

Replaces fsnotify_oldname_...(), gets used in fsnotify to obtain the name
to pass down with event.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-07 20:09:10 -04:00
Linus Torvalds
b59eea554f vfs: fix flock compat thinko
Michael Ellerman reported that commit 8c6657cb50 ("Switch flock
copyin/copyout primitives to copy_{from,to}_user()") broke his
networking on a bunch of PPC machines (64-bit kernel, 32-bit userspace).

The reason is a brown-paper bug by that commit, which had the arguments
to "copy_flock_fields()" in the wrong order, breaking the compat
handling for file locking.  Apparently very few people run 32-bit user
space on x86 any more, so the PPC people got the honor of noticing this
"feature".

Michael also sent a minimal diff that just changed the order of the
arguments in that macro.

This is not that minimal diff.

This not only changes the order of the arguments in the macro, it also
changes them to be pointers (to be consistent with all the other uses of
those pointers), and makes the functions that do all of this also have
the proper "const" attribution on the source pointers in order to make
issues like that (using the source as a destination) be really obvious.

Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-07 13:48:18 -07:00
Andreas Gruenbacher
961ae1d83d gfs2: Fix glock rhashtable rcu bug
Before commit 88ffbf3e03 "GFS2: Use resizable hash table for glocks",
glocks were freed via call_rcu to allow reading the glock hashtable
locklessly using rcu.  This was then changed to free glocks immediately,
which made reading the glock hashtable unsafe.  Bring back the original
code for freeing glocks via call_rcu.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Cc: stable@vger.kernel.org # 4.3+
2017-07-07 13:22:05 -05:00
Jaegeuk Kim
d29460e5cf f2fs: avoid deadlock caused by lock order of page and lock_op
- punch_hole
 - fill_zero
  - f2fs_lock_op
  - get_new_data_page
   - lock_page

- f2fs_write_data_pages
 - lock_page
 - do_write_data_page
  - f2fs_lock_op

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07 11:07:59 -07:00
Chao Yu
d1aa245354 f2fs: use spin_{,un}lock_irq{save,restore}
generic/361 reports below warning, this is because: once, there is
someone entering into critical region of sbi.cp_lock, if write_end_io.
f2fs_stop_checkpoint is invoked from an triggered IRQ, we will encounter
deadlock.

So this patch changes to use spin_{,un}lock_irq{save,restore} to create
critical region without IRQ enabled to avoid potential deadlock.

 irq event stamp: 83391573
 loop: Write error at byte offset 438729728, length 1024.
 hardirqs last  enabled at (83391573): [<c1809752>] restore_all+0xf/0x65
 hardirqs last disabled at (83391572): [<c1809eac>] reschedule_interrupt+0x30/0x3c
 loop: Write error at byte offset 438860288, length 1536.
 softirqs last  enabled at (83389244): [<c180cc4e>] __do_softirq+0x1ae/0x476
 softirqs last disabled at (83389237): [<c101ca7c>] do_softirq_own_stack+0x2c/0x40
 loop: Write error at byte offset 438990848, length 2048.
 ================================
 WARNING: inconsistent lock state
 4.12.0-rc2+ #30 Tainted: G           O
 --------------------------------
 inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
 xfs_io/7959 [HC1[1]:SC0[0]:HE0:SE1] takes:
  (&(&sbi->cp_lock)->rlock){?.+...}, at: [<f96f96cc>] f2fs_stop_checkpoint+0x1c/0x50 [f2fs]
 {HARDIRQ-ON-W} state was registered at:
   __lock_acquire+0x527/0x7b0
   lock_acquire+0xae/0x220
   _raw_spin_lock+0x42/0x50
   do_checkpoint+0x165/0x9e0 [f2fs]
   write_checkpoint+0x33f/0x740 [f2fs]
   __f2fs_sync_fs+0x92/0x1f0 [f2fs]
   f2fs_sync_fs+0x12/0x20 [f2fs]
   sync_filesystem+0x67/0x80
   generic_shutdown_super+0x27/0x100
   kill_block_super+0x22/0x50
   kill_f2fs_super+0x3a/0x40 [f2fs]
   deactivate_locked_super+0x3d/0x70
   deactivate_super+0x40/0x60
   cleanup_mnt+0x39/0x70
   __cleanup_mnt+0x10/0x20
   task_work_run+0x69/0x80
   exit_to_usermode_loop+0x57/0x85
   do_fast_syscall_32+0x18c/0x1b0
   entry_SYSENTER_32+0x4c/0x7b
 irq event stamp: 1957420
 hardirqs last  enabled at (1957419): [<c1808f37>] _raw_spin_unlock_irq+0x27/0x50
 hardirqs last disabled at (1957420): [<c1809f9c>] call_function_single_interrupt+0x30/0x3c
 softirqs last  enabled at (1953784): [<c180cc4e>] __do_softirq+0x1ae/0x476
 softirqs last disabled at (1953773): [<c101ca7c>] do_softirq_own_stack+0x2c/0x40

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&(&sbi->cp_lock)->rlock);
   <Interrupt>
     lock(&(&sbi->cp_lock)->rlock);

  *** DEADLOCK ***

 2 locks held by xfs_io/7959:
  #0:  (sb_writers#13){.+.+.+}, at: [<c11fd7ca>] vfs_write+0x16a/0x190
  #1:  (&sb->s_type->i_mutex_key#16){+.+.+.}, at: [<f96e33f5>] f2fs_file_write_iter+0x25/0x140 [f2fs]

 stack backtrace:
 CPU: 2 PID: 7959 Comm: xfs_io Tainted: G           O    4.12.0-rc2+ #30
 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
 Call Trace:
  dump_stack+0x5f/0x92
  print_usage_bug+0x1d3/0x1dd
  ? check_usage_backwards+0xe0/0xe0
  mark_lock+0x23d/0x280
  __lock_acquire+0x699/0x7b0
  ? __this_cpu_preempt_check+0xf/0x20
  ? trace_hardirqs_off_caller+0x91/0xe0
  lock_acquire+0xae/0x220
  ? f2fs_stop_checkpoint+0x1c/0x50 [f2fs]
  _raw_spin_lock+0x42/0x50
  ? f2fs_stop_checkpoint+0x1c/0x50 [f2fs]
  f2fs_stop_checkpoint+0x1c/0x50 [f2fs]
  f2fs_write_end_io+0x147/0x150 [f2fs]
  bio_endio+0x7a/0x1e0
  blk_update_request+0xad/0x410
  blk_mq_end_request+0x16/0x60
  lo_complete_rq+0x3c/0x70
  __blk_mq_complete_request_remote+0x11/0x20
  flush_smp_call_function_queue+0x6d/0x120
  ? debug_smp_processor_id+0x12/0x20
  generic_smp_call_function_single_interrupt+0x12/0x30
  smp_call_function_single_interrupt+0x25/0x40
  call_function_single_interrupt+0x37/0x3c
 EIP: _raw_spin_unlock_irq+0x2d/0x50
 EFLAGS: 00000296 CPU: 2
 EAX: 00000001 EBX: d2ccc51c ECX: 00000001 EDX: c1aacebd
 ESI: 00000000 EDI: 00000000 EBP: c96c9d1c ESP: c96c9d18
  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
  ? inherit_task_group.isra.98.part.99+0x6b/0xb0
  __add_to_page_cache_locked+0x1d4/0x290
  add_to_page_cache_lru+0x38/0xb0
  pagecache_get_page+0x8e/0x200
  f2fs_write_begin+0x96/0xf00 [f2fs]
  ? trace_hardirqs_on_caller+0xdd/0x1c0
  ? current_time+0x17/0x50
  ? trace_hardirqs_on+0xb/0x10
  generic_perform_write+0xa9/0x170
  __generic_file_write_iter+0x1a2/0x1f0
  ? f2fs_preallocate_blocks+0x137/0x160 [f2fs]
  f2fs_file_write_iter+0x6e/0x140 [f2fs]
  ? __lock_acquire+0x429/0x7b0
  __vfs_write+0xc1/0x140
  vfs_write+0x9b/0x190
  SyS_pwrite64+0x63/0xa0
  do_fast_syscall_32+0xa1/0x1b0
  entry_SYSENTER_32+0x4c/0x7b
 EIP: 0xb7786c61
 EFLAGS: 00000293 CPU: 2
 EAX: ffffffda EBX: 00000003 ECX: 08416000 EDX: 00001000
 ESI: 18b24000 EDI: 00000000 EBP: 00000003 ESP: bf9b36b0
  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b

Fixes: aaec2b1d18 ("f2fs: introduce cp_lock to protect updating of ckpt_flags")
Cc: stable@vger.kernel.org
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07 10:34:49 -07:00
Jaegeuk Kim
ff1048e7df f2fs: relax migratepage for atomic written page
In order to avoid lock contention for atomic written pages, we'd better give
EBUSY in f2fs_migrate_page when mode is asynchronous. We expect it will be
released soon as transaction commits.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07 10:34:48 -07:00
Chao Yu
000519f278 f2fs: don't count inode block in in-memory inode.i_blocks
Previously, we count all inode consumed blocks including inode block,
xattr block, index block, data block into i_blocks, for other generic
filesystems, they won't count inode block into i_blocks, so for
userspace applications or quota system, they may detect incorrect block
count according to i_blocks value in inode.

This patch changes to count all blocks into inode.i_blocks excluding
inode block, for on-disk i_blocks, we keep counting inode block for
backward compatibility.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07 10:34:47 -07:00
Chao Yu
6ac851ba89 Revert "f2fs: fix to clean previous mount option when remount_fs"
Don't clear old mount option before parse new option during ->remount_fs
like other generic filesystems.

This reverts commit 26666c8a43.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07 10:34:46 -07:00
Sheng Yong
b855bf0e16 f2fs: do not set LOST_PINO for renamed dir
After renaming a directory, fsck could detect unmatched pino. The scenario
can be reproduced as the following:

	$ mkdir /bar/subbar /foo
	$ rename /bar/subbar /foo

Then fsck will report:
[ASSERT] (__chk_dots_dentries:1182)  --> Bad inode number[0x3] for '..', parent parent ino is [0x4]

Rename sets LOST_PINO for old_inode. However, the flag cannot be cleared,
since dir is written back with CP. So, let's get rid of LOST_PINO for a
renamed dir and fix the pino directly at the end of rename.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07 10:34:45 -07:00
Sheng Yong
d58dfb7505 f2fs: do not set LOST_PINO for newly created dir
Since directories will be written back with checkpoint and fsync a
directory will always write CP, there is no need to set LOST_PINO
after creating a directory.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07 10:34:45 -07:00
Chao Yu
0771fcc71c f2fs: skip ->writepages for {mete,node}_inode during recovery
Skip ->writepages in prior to ->writepage for {meta,node}_inode during
recovery, hence unneeded loop in ->writepages can be avoided.

Moreover, check SBI_POR_DOING earlier while writebacking pages.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07 10:34:44 -07:00
Chao Yu
6915ea9d8d f2fs: introduce __check_sit_bitmap
After we introduce discard thread, discard command can be issued
concurrently with data allocating, this patch adds new function to
heck sit bitmap to ensure that userdata was invalid in which on-going
discard command covered.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07 10:34:43 -07:00
Chao Yu
cce1325247 f2fs: stop gc/discard thread in prior during umount
This patch resolves kernel panic for xfstests/081, caused by recent f2fs_bug_on

 f2fs: add f2fs_bug_on in __remove_discard_cmd

For fixing, we will stop gc/discard thread in prior in ->kill_sb in order to
avoid referring and releasing race among them.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07 10:34:42 -07:00
Chao Yu
daeb433e42 f2fs: introduce reserved_blocks in sysfs
In this patch, we add a new sysfs interface, with it, we can control
number of reserved blocks in system which could not be used by user,
it enable f2fs to let user to configure for adjusting over-provision
ratio dynamically instead of changing it by mkfs.

So we can expect it will help to reserve more free space for relieving
GC in both filesystem and flash device.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07 10:34:41 -07:00
Yunlong Song
d871cd046f f2fs: avoid redundant f2fs_flush after remount
create_flush_cmd_control will create redundant issue_flush_thread after each
remount with flush_merge option.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07 10:34:41 -07:00
Jaegeuk Kim
0cc091d0c8 f2fs: report # of free inodes more precisely
If the partition is small, we don't need to report total # of inodes including
hidden free nodes.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-07 10:34:40 -07:00
Linus Torvalds
b6ffe9ba46 libnvdimm for 4.13
* Introduce the _flushcache() family of memory copy helpers and use them
   for persistent memory write operations on x86. The _flushcache()
   semantic indicates that the cache is either bypassed for the copy
   operation (movnt) or any lines dirtied by the copy operation are
   written back (clwb, clflushopt, or clflush).
 
 * Extend dax_operations with ->copy_from_iter() and ->flush()
   operations. These operations and other infrastructure updates allow
   all persistent memory specific dax functionality to be pushed into
   libnvdimm and the pmem driver directly. It also allows dax-specific
   sysfs attributes to be linked to a host device, for example:
       /sys/block/pmem0/dax/write_cache
 
 * Add support for the new NVDIMM platform/firmware mechanisms introduced
   in ACPI 6.2 and UEFI 2.7. This support includes the v1.2 namespace
   label format, extensions to the address-range-scrub command set, new
   error injection commands, and a new BTT (block-translation-table)
   layout. These updates support inter-OS and pre-OS compatibility.
 
 * Fix a longstanding memory corruption bug in nfit_test.
 
 * Make the pmem and nvdimm-region 'badblocks' sysfs files poll(2)
   capable.
 
 * Miscellaneous fixes and small updates across libnvdimm and the nfit
   driver.
 
 Acknowledgements that came after the branch was pushed:
 
 commit 6aa734a2f3 "libnvdimm, region, pmem: fix 'badblocks'
   sysfs_get_dirent() reference lifetime"
 Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZXsUtAAoJEB7SkWpmfYgCOXcP/06bncqTEvtgrOF2b7O8w+8e
 mTySD51RUn6UpkFd37SMRch+rmbojuqj465TAE7XIXgyLgIOJixKaTlHYUoEnP3X
 rC4Q/g5mN0nittMDwL+vQaa1lQWd2kbjOlrqCgnLHVEEJpHmiQussunjvir4G1U7
 5ROooP8W+qMK5y5XPLJAg/gyGhYkjpRSlDg3Eo5meZZ0IdURbI7+WCLKrPcQUERT
 WmDc9gLhJdSQVxBV/0m2gdAER4ADmFjcrlm8kjXRBhdlUmEFjM0zpvlHJutHTkks
 rNZWCmCJs0Sas+DmRKszFmvVFHRHqUVA3dWK4P6PJEX+tl7BwlPcxpbfacHTG2EZ
 btArFc584DZ+EIrim1cXXRvLFlxnKOFBtBeteFs7l2kZjEcN6S4I5OZgTyeDpe/i
 2WDpHWLQWibkcIzH9y1EuMBkYnQjTJl1pecHzJoTaC+jAQ+opLiY7EecjLmCmQS6
 MBYUeQZNufLGfT5b8KXfpKeiXhpFkYrAGp+ErfoH/6RKy2zqTdagN1yVhos2y+a7
 JJu/Weetpn8qv+KTGUShO8TGyWv3wU46YkG2rKWl0FL1+C+6LMMw1/L0A97lwVlg
 BpypVVyaNu1D22ifZ8O5wbqPIYghoZ5akA0CiduhX19cpl5rTeTd8EvLjvcYhZEZ
 pMHuMAqIcIyLhIe2/sRF
 =xKQB
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "libnvdimm updates for the latest ACPI and UEFI specifications. This
  pull request also includes new 'struct dax_operations' enabling to
  undo the abuse of copy_user_nocache() for copy operations to pmem.

  The dax work originally missed 4.12 to address concerns raised by Al.

  Summary:

   - Introduce the _flushcache() family of memory copy helpers and use
     them for persistent memory write operations on x86. The
     _flushcache() semantic indicates that the cache is either bypassed
     for the copy operation (movnt) or any lines dirtied by the copy
     operation are written back (clwb, clflushopt, or clflush).

   - Extend dax_operations with ->copy_from_iter() and ->flush()
     operations. These operations and other infrastructure updates allow
     all persistent memory specific dax functionality to be pushed into
     libnvdimm and the pmem driver directly. It also allows dax-specific
     sysfs attributes to be linked to a host device, for example:
     /sys/block/pmem0/dax/write_cache

   - Add support for the new NVDIMM platform/firmware mechanisms
     introduced in ACPI 6.2 and UEFI 2.7. This support includes the v1.2
     namespace label format, extensions to the address-range-scrub
     command set, new error injection commands, and a new BTT
     (block-translation-table) layout. These updates support inter-OS
     and pre-OS compatibility.

   - Fix a longstanding memory corruption bug in nfit_test.

   - Make the pmem and nvdimm-region 'badblocks' sysfs files poll(2)
     capable.

   - Miscellaneous fixes and small updates across libnvdimm and the nfit
     driver.

  Acknowledgements that came after the branch was pushed: commit
  6aa734a2f3 ("libnvdimm, region, pmem: fix 'badblocks'
  sysfs_get_dirent() reference lifetime") was reviewed by Toshi Kani
  <toshi.kani@hpe.com>"

* tag 'libnvdimm-for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (42 commits)
  libnvdimm, namespace: record 'lbasize' for pmem namespaces
  acpi/nfit: Issue Start ARS to retrieve existing records
  libnvdimm: New ACPI 6.2 DSM functions
  acpi, nfit: Show bus_dsm_mask in sysfs
  libnvdimm, acpi, nfit: Add bus level dsm mask for pass thru.
  acpi, nfit: Enable DSM pass thru for root functions.
  libnvdimm: passthru functions clear to send
  libnvdimm, btt: convert some info messages to warn/err
  libnvdimm, region, pmem: fix 'badblocks' sysfs_get_dirent() reference lifetime
  libnvdimm: fix the clear-error check in nsio_rw_bytes
  libnvdimm, btt: fix btt_rw_page not returning errors
  acpi, nfit: quiet invalid block-aperture-region warnings
  libnvdimm, btt: BTT updates for UEFI 2.7 format
  acpi, nfit: constify *_attribute_group
  libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region
  libnvdimm, pmem, dax: export a cache control attribute
  dax: convert to bitmask for flags
  dax: remove default copy_from_iter fallback
  libnvdimm, nfit: enable support for volatile ranges
  libnvdimm, pmem: fix persistence warning
  ...
2017-07-07 09:44:06 -07:00
Darrick J. Wong
6eb0b8df9f xfs: rename MAXPATHLEN to XFS_SYMLINK_MAXLEN
XFS has a maximum symlink target length of 1024 bytes; this is a
holdover from the Irix days.  Unfortunately, the constant establishing
this is 'MAXPATHLEN' and is /not/ the same as the Linux MAXPATHLEN,
which is 4096.

The kernel enforces its 1024 byte MAXPATHLEN on symlink targets, but
xfsprogs picks up the (Linux) system 4096 byte MAXPATHLEN, which means
that xfs_repair doesn't complain about oversized symlinks.

Since this is an on-disk format constraint, put the define in the XFS
namespace and move everything over to use the new name.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-07-07 08:37:26 -07:00
Yan, Zheng
481f001ffa ceph: update ceph_dentry_info::lease_session when necessary
Current code does not update ceph_dentry_info::lease_session once
it is set. If auth mds of corresponding dentry changes, dentry lease
keeps in an invalid state.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:14 +02:00
Yan, Zheng
1d8f83604c ceph: new mount option that specifies fscache uniquifier
Current ceph uses FSID as primary index key of fscache data. This
allows ceph to retain cached data across remount. But this causes
problem (kernel opps, fscache does not support sharing data) when
a filesystem get mounted several times (with fscache enabled, with
different mount options).

The fix is adding a new mount option, which specifies uniquifier
for fscache.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:14 +02:00
Yan, Zheng
4b9f2042fd ceph: avoid accessing freeing inode in ceph_check_delayed_caps()
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:13 +02:00
Yan, Zheng
62a65f36d0 ceph: avoid invalid memory dereference in the middle of umount
extra_mon_dispatch() and debugfs' foo_show functions dereference
fsc->mdsc. we should clean up fsc->client->extra_mon_dispatch
and debugfs before destroying fsc->mds.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:13 +02:00
Yan, Zheng
1684dd03e9 ceph: getattr before read on ceph.* xattrs
Previously we were returning values for quota, layout
xattrs without any kind of update -- the user just got
whatever happened to be in our cache.

Clearly this extra round trip has a cost, but reads of
these xattrs are fairly rare, happening on admin
intervention rather than in normal operation.

Link: http://tracker.ceph.com/issues/17939
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:13 +02:00
Yan, Zheng
92e57e6287 ceph: don't re-send interrupted flock request
Don't re-send interrupted flock request in cases of mds failover
and receiving request forward. Because corresponding 'lock intr'
request may have been finished, it won't get re-sent.

Link: http://tracker.ceph.com/issues/20170
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:13 +02:00
Yan, Zheng
439868812a ceph: cleanup writepage_nounlock()
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:13 +02:00
Yan, Zheng
fa71fefb30 ceph: redirty page when writepage_nounlock() skips unwritable page
Ceph needs to flush dirty page in the order in which in which snap
context they belong to. Dirty pages belong to older snap context
should be flushed earlier. if writepage_nounlock() can not flush a
page, it should redirty the page.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:13 +02:00
Yan, Zheng
f2b0c45f09 ceph: remove useless page->mapping check in writepage_nounlock()
Callers of writepage_nounlock() have already ensured non-null
page->mapping.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:13 +02:00
Yan, Zheng
efb0ca765a ceph: update the 'approaching max_size' code
The old 'approaching max_size' code expects MDS set max_size to
'2 * reported_size'. This is no longer true. The new code reports
file size when half of previous max_size increment has been used.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:12 +02:00
Yan, Zheng
84eea8c790 ceph: re-request max size after importing caps
The 'wanted max size' could be sent to inode's old auth mds, re-send
it to inode's new auth mds if necessary. Otherwise write syscall may
hang.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-07 17:25:12 +02:00
Linus Torvalds
9f45efb928 Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:

 - a few hotfixes

 - various misc updates

 - ocfs2 updates

 - most of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (108 commits)
  mm, memory_hotplug: move movable_node to the hotplug proper
  mm, memory_hotplug: drop CONFIG_MOVABLE_NODE
  mm, memory_hotplug: drop artificial restriction on online/offline
  mm: memcontrol: account slab stats per lruvec
  mm: memcontrol: per-lruvec stats infrastructure
  mm: memcontrol: use generic mod_memcg_page_state for kmem pages
  mm: memcontrol: use the node-native slab memory counters
  mm: vmstat: move slab statistics from zone to node counters
  mm/zswap.c: delete an error message for a failed memory allocation in zswap_dstmem_prepare()
  mm/zswap.c: improve a size determination in zswap_frontswap_init()
  mm/zswap.c: delete an error message for a failed memory allocation in zswap_pool_create()
  mm/swapfile.c: sort swap entries before free
  mm/oom_kill: count global and memory cgroup oom kills
  mm: per-cgroup memory reclaim stats
  mm: kmemleak: treat vm_struct as alternative reference to vmalloc'ed objects
  mm: kmemleak: factor object reference updating out of scan_block()
  mm: kmemleak: slightly reduce the size of some structures on 64-bit architectures
  mm, mempolicy: don't check cpuset seqlock where it doesn't matter
  mm, cpuset: always use seqlock when changing task's nodemask
  mm, mempolicy: simplify rebinding mempolicies when updating cpusets
  ...
2017-07-06 22:27:08 -07:00
Linus Torvalds
c856863988 Merge branch 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc compat stuff updates from Al Viro:
 "This part is basically untangling various compat stuff. Compat
  syscalls moved to their native counterparts, getting rid of quite a
  bit of double-copying and/or set_fs() uses. A lot of field-by-field
  copyin/copyout killed off.

   - kernel/compat.c is much closer to containing just the
     copyin/copyout of compat structs. Not all compat syscalls are gone
     from it yet, but it's getting there.

   - ipc/compat_mq.c killed off completely.

   - block/compat_ioctl.c cleaned up; floppy compat ioctls moved to
     drivers/block/floppy.c where they belong. Yes, there are several
     drivers that implement some of the same ioctls. Some are m68k and
     one is 32bit-only pmac. drivers/block/floppy.c is the only one in
     that bunch that can be built on biarch"

* 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  mqueue: move compat syscalls to native ones
  usbdevfs: get rid of field-by-field copyin
  compat_hdio_ioctl: get rid of set_fs()
  take floppy compat ioctls to sodding floppy.c
  ipmi: get rid of field-by-field __get_user()
  ipmi: get COMPAT_IPMICTL_RECEIVE_MSG in sync with the native one
  rt_sigtimedwait(): move compat to native
  select: switch compat_{get,put}_fd_set() to compat_{get,put}_bitmap()
  put_compat_rusage(): switch to copy_to_user()
  sigpending(): move compat to native
  getrlimit()/setrlimit(): move compat to native
  times(2): move compat to native
  compat_{get,put}_bitmap(): use unsafe_{get,put}_user()
  fb_get_fscreeninfo(): don't bother with do_fb_ioctl()
  do_sigaltstack(): lift copying to/from userland into callers
  take compat_sys_old_getrlimit() to native syscall
  trim __ARCH_WANT_SYS_OLD_GETRLIMIT
2017-07-06 20:57:13 -07:00
Roman Gushchin
2262185c5b mm: per-cgroup memory reclaim stats
Track the following reclaim counters for every memory cgroup: PGREFILL,
PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and PGLAZYFREED.

These values are exposed using the memory.stats interface of cgroup v2.

The meaning of each value is the same as for global counters, available
using /proc/vmstat.

Also, for consistency, rename mem_cgroup_count_vm_event() to
count_memcg_event_mm().

Link: http://lkml.kernel.org/r/1494530183-30808-1-git-send-email-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:35 -07:00
Punit Agrawal
7868a2087e mm/hugetlb: add size parameter to huge_pte_offset()
A poisoned or migrated hugepage is stored as a swap entry in the page
tables.  On architectures that support hugepages consisting of
contiguous page table entries (such as on arm64) this leads to ambiguity
in determining the page table entry to return in huge_pte_offset() when
a poisoned entry is encountered.

Let's remove the ambiguity by adding a size parameter to convey
additional information about the requested address.  Also fixup the
definition/usage of huge_pte_offset() throughout the tree.

Link: http://lkml.kernel.org/r/20170522133604.11392-4-punit.agrawal@arm.com
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Acked-by: Steve Capper <steve.capper@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: James Hogan <james.hogan@imgtec.com> (odd fixer:METAG ARCHITECTURE)
Cc: Ralf Baechle <ralf@linux-mips.org> (supporter:MIPS)
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:34 -07:00
Pavel Tatashin
3d375d7859 mm: update callers to use HASH_ZERO flag
Update dcache, inode, pid, mountpoint, and mount hash tables to use
HASH_ZERO, and remove initialization after allocations.  In case of
places where HASH_EARLY was used such as in __pv_init_lock_hash the
zeroed hash table was already assumed, because memblock zeroes the
memory.

CPU: SPARC M6, Memory: 7T
Before fix:
  Dentry cache hash table entries: 1073741824
  Inode-cache hash table entries: 536870912
  Mount-cache hash table entries: 16777216
  Mountpoint-cache hash table entries: 16777216
  ftrace: allocating 20414 entries in 40 pages
  Total time: 11.798s

After fix:
  Dentry cache hash table entries: 1073741824
  Inode-cache hash table entries: 536870912
  Mount-cache hash table entries: 16777216
  Mountpoint-cache hash table entries: 16777216
  ftrace: allocating 20414 entries in 40 pages
  Total time: 3.198s

CPU: Intel Xeon E5-2630, Memory: 2.2T:
Before fix:
  Dentry cache hash table entries: 536870912
  Inode-cache hash table entries: 268435456
  Mount-cache hash table entries: 8388608
  Mountpoint-cache hash table entries: 8388608
  CPU: Physical Processor ID: 0
  Total time: 3.245s

After fix:
  Dentry cache hash table entries: 536870912
  Inode-cache hash table entries: 268435456
  Mount-cache hash table entries: 8388608
  Mountpoint-cache hash table entries: 8388608
  CPU: Physical Processor ID: 0
  Total time: 3.244s

Link: http://lkml.kernel.org/r/1488432825-92126-4-git-send-email-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
Cc: David Miller <davem@davemloft.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:33 -07:00
Mike Rapoport
f93ae36462 fs/userfaultfd.c: drop dead code
Calculation of start end end in __wake_userfault function are not used
and can be removed.

Link: http://lkml.kernel.org/r/1494930917-3134-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:33 -07:00
Michal Hocko
c823bd9244 fs/file.c: replace alloc_fdmem() with kvmalloc() alternative
There is no real reason to duplicate kvmalloc* helpers so drop
alloc_fdmem and replace it with the appropriate library function.

Link: http://lkml.kernel.org/r/20170531155145.17111-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:30 -07:00
Arvind Yadav
b74271e40e ocfs2: constify attribute_group structures
attribute_groups are not supposed to change at runtime.  All functions
working with attribute_groups provided by <linux/sysfs.h> work with
const attribute_group.  So mark the non-const structs as const.

File size before:
   text	   data	    bss	    dec	    hex	filename
   4402	   1088	     38	   5528	   1598	fs/ocfs2/stackglue.o

File size After adding 'const':
   text	   data	    bss	    dec	    hex	filename
   4442	   1024	     38	   5504	   1580	fs/ocfs2/stackglue.o

Link: http://lkml.kernel.org/r/cab4e59b4918db3ed2ec77073a4cb310c4429ef5.1498808026.git.arvind.yadav.cs@gmail.com
Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:30 -07:00
piaojun
25b1c72e15 ocfs2: free 'dummy_sc' in sc_fop_release() to prevent memory leak
'sd->dbg_sock' is malloced in sc_common_open(), but not freed at the end
of sc_fop_release().

Link: http://lkml.kernel.org/r/594FB0A4.2050105@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:30 -07:00
Fabian Frederick
62aa81d7c4 ocfs2: use magic.h
Filesystems generally use SUPER_MAGIC values from magic.h instead of a
local definition.

Link: http://lkml.kernel.org/r/20170521154217.27917-1-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:30 -07:00
Gang He
8c4d5a4387 ocfs2: fix a static checker warning
Fix a static code checker warning:

  fs/ocfs2/inode.c:179 ocfs2_iget() warn: passing zero to 'ERR_PTR'

Fixes: d56a8f32e4 ("ocfs2: check/fix inode block for online file check")
Link: http://lkml.kernel.org/r/1495516634-1952-1-git-send-email-ghe@suse.com
Signed-off-by: Gang He <ghe@suse.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Reviewed-by: Eric Ren <zren@suse.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:30 -07:00
Filipe Manana
24e52b11e0 Btrfs: incremental send, fix invalid memory access
When doing an incremental send, while processing an extent that changed
between the parent and send snapshots and that extent was an inline extent
in the parent snapshot, it's possible to access a memory region beyond
the end of leaf if the inline extent is very small and it is the first
item in a leaf.

An example scenario is described below.

The send snapshot has the following leaf:

 leaf 33865728 items 33 free space 773 generation 46 owner 5
 fs uuid ab7090d8-dafd-4fb9-9246-723b6d2e2fb7
 chunk uuid 2d16478c-c704-4ab9-b574-68bff2281b1f
        (...)
        item 14 key (335 EXTENT_DATA 0) itemoff 3052 itemsize 53
                generation 36 type 1 (regular)
                extent data disk byte 12791808 nr 4096
                extent data offset 0 nr 4096 ram 4096
                extent compression 0 (none)
        item 15 key (335 EXTENT_DATA 8192) itemoff 2999 itemsize 53
                generation 36 type 1 (regular)
                extent data disk byte 138170368 nr 225280
                extent data offset 0 nr 225280 ram 225280
                extent compression 0 (none)
        (...)

And the parent snapshot has the following leaf:

 leaf 31272960 items 17 free space 17 generation 31 owner 5
 fs uuid ab7090d8-dafd-4fb9-9246-723b6d2e2fb7
 chunk uuid 2d16478c-c704-4ab9-b574-68bff2281b1f
        item 0 key (335 EXTENT_DATA 0) itemoff 3951 itemsize 44
                generation 31 type 0 (inline)
                inline extent data size 23 ram_bytes 613 compression 1 (zlib)
        (...)

When computing the send stream, it is detected that the extent of inode
335, at file offset 0, and at fs/btrfs/send.c:is_extent_unchanged() we
grab the leaf from the parent snapshot and access the inline extent item.
However, before jumping to the 'out' label, we access the 'offset' and
'disk_bytenr' fields of the extent item, which should not be done for
inline extents since the inlined data starts at the offset of the
'disk_bytenr' field and can be very small. For example accessing the
'offset' field of the file extent item results in the following trace:

[  599.705368] general protection fault: 0000 [#1] PREEMPT SMP
[  599.706296] Modules linked in: btrfs psmouse i2c_piix4 ppdev acpi_cpufreq serio_raw parport_pc i2c_core evdev tpm_tis tpm_tis_core sg pcspkr parport tpm button su$
[  599.709340] CPU: 7 PID: 5283 Comm: btrfs Not tainted 4.10.0-rc8-btrfs-next-46+ #1
[  599.709340] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[  599.709340] task: ffff88023eedd040 task.stack: ffffc90006658000
[  599.709340] RIP: 0010:read_extent_buffer+0xdb/0xf4 [btrfs]
[  599.709340] RSP: 0018:ffffc9000665ba00 EFLAGS: 00010286
[  599.709340] RAX: db73880000000000 RBX: 0000000000000000 RCX: 0000000000000001
[  599.709340] RDX: ffffc9000665ba60 RSI: db73880000000000 RDI: ffffc9000665ba5f
[  599.709340] RBP: ffffc9000665ba30 R08: 0000000000000001 R09: ffff88020dc5e098
[  599.709340] R10: 0000000000001000 R11: 0000160000000000 R12: 6db6db6db6db6db7
[  599.709340] R13: ffff880000000000 R14: 0000000000000000 R15: ffff88020dc5e088
[  599.709340] FS:  00007f519555a8c0(0000) GS:ffff88023f3c0000(0000) knlGS:0000000000000000
[  599.709340] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  599.709340] CR2: 00007f1411afd000 CR3: 0000000235f8e000 CR4: 00000000000006e0
[  599.709340] Call Trace:
[  599.709340]  btrfs_get_token_64+0x93/0xce [btrfs]
[  599.709340]  ? printk+0x48/0x50
[  599.709340]  btrfs_get_64+0xb/0xd [btrfs]
[  599.709340]  process_extent+0x3a1/0x1106 [btrfs]
[  599.709340]  ? btree_read_extent_buffer_pages+0x5/0xef [btrfs]
[  599.709340]  changed_cb+0xb03/0xb3d [btrfs]
[  599.709340]  ? btrfs_get_token_32+0x7a/0xcc [btrfs]
[  599.709340]  btrfs_compare_trees+0x432/0x53d [btrfs]
[  599.709340]  ? process_extent+0x1106/0x1106 [btrfs]
[  599.709340]  btrfs_ioctl_send+0x960/0xe26 [btrfs]
[  599.709340]  btrfs_ioctl+0x181b/0x1fed [btrfs]
[  599.709340]  ? trace_hardirqs_on_caller+0x150/0x1ac
[  599.709340]  vfs_ioctl+0x21/0x38
[  599.709340]  ? vfs_ioctl+0x21/0x38
[  599.709340]  do_vfs_ioctl+0x611/0x645
[  599.709340]  ? rcu_read_unlock+0x5b/0x5d
[  599.709340]  ? __fget+0x6d/0x79
[  599.709340]  SyS_ioctl+0x57/0x7b
[  599.709340]  entry_SYSCALL_64_fastpath+0x18/0xad
[  599.709340] RIP: 0033:0x7f51945eec47
[  599.709340] RSP: 002b:00007ffc21c13e98 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[  599.709340] RAX: ffffffffffffffda RBX: ffffffff81096459 RCX: 00007f51945eec47
[  599.709340] RDX: 00007ffc21c13f20 RSI: 0000000040489426 RDI: 0000000000000004
[  599.709340] RBP: ffffc9000665bf98 R08: 00007f519450d700 R09: 00007f519450d700
[  599.709340] R10: 00007f519450d9d0 R11: 0000000000000202 R12: 0000000000000046
[  599.709340] R13: ffffc9000665bf78 R14: 0000000000000000 R15: 00007f5195574040
[  599.709340]  ? trace_hardirqs_off_caller+0x43/0xb1
[  599.709340] Code: 29 f0 49 39 d8 4c 0f 47 c3 49 03 81 58 01 00 00 44 89 c1 4c 01 c2 4c 29 c3 48 c1 f8 03 49 0f af c4 48 c1 e0 0c 4c 01 e8 48 01 c6 <f3> a4 31 f6 4$
[  599.709340] RIP: read_extent_buffer+0xdb/0xf4 [btrfs] RSP: ffffc9000665ba00
[  599.762057] ---[ end trace fe00d7af61b9f49e ]---

This is because the 'offset' field starts at an offset of 37 bytes
(offsetof(struct btrfs_file_extent_item, offset)), has a length of 8
bytes and therefore attemping to read it causes a 1 byte access beyond
the end of the leaf, as the first item's content in a leaf is located
at the tail of the leaf, the item size is 44 bytes and the offset of
that field plus its length (37 + 8 = 45) goes beyond the item's size
by 1 byte.

So fix this by accessing the 'offset' and 'disk_bytenr' fields after
jumping to the 'out' label if we are processing an inline extent. We
move the reading operation of the 'disk_bytenr' field too because we
have the same problem as for the 'offset' field explained above when
the inline data is less then 8 bytes. The access to the 'generation'
field is also moved but just for the sake of grouping access to all
the fields.

Fixes: e1cbfd7bf6 ("Btrfs: send, fix file hole not being preserved due to inline extent")
Cc: <stable@vger.kernel.org>  # v4.12+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-07-06 23:02:30 +01:00
Filipe Manana
f59627810e Btrfs: incremental send, fix invalid path for link commands
In some scenarios an incremental send stream can contain link commands
with an invalid target path. Such scenarios happen after moving some
directory inode A, renaming a regular file inode B into the old name of
inode A and finally creating a new hard link for inode B at directory
inode A.

Consider the following example scenario where this issue happens.

Parent snapshot:

  .                                                      (ino 256)
  |
  |--- dir1/                                             (ino 257)
  |      |--- dir2/                                      (ino 258)
  |             |--- dir3/                               (ino 259)
  |                   |--- file1                         (ino 261)
  |                   |--- dir4/                         (ino 262)
  |
  |--- dir5/                                             (ino 260)

Send snapshot:

  .                                                      (ino 256)
  |
  |--- dir1/                                             (ino 257)
         |--- dir2/                                      (ino 258)
         |      |--- dir3/                               (ino 259)
         |            |--- dir4                          (ino 261)
         |
         |--- dir6/                                      (ino 263)
                |--- dir44/                              (ino 262)
                       |--- file11                       (ino 261)
                       |--- dir55/                       (ino 260)

When attempting to apply the corresponding incremental send stream, a
link command contains an invalid target path which makes the receiver
fail. The following is the verbose output of the btrfs receive command:

  receiving snapshot mysnap2 uuid=90076fe6-5ba6-e64a-9321-9279670ed16b (...)
  utimes
  utimes dir1
  utimes dir1/dir2/dir3
  utimes
  rename dir1/dir2/dir3/dir4 -> o262-7-0
  link dir1/dir2/dir3/dir4 -> dir1/dir2/dir3/file1
  link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1
  ERROR: link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1 failed: Not a directory

The following steps happen during the computation of the incremental send
stream the lead to this issue:

1) When processing inode 261, we orphanize inode 262 due to a name/location
   collision with one of the new hard links for inode 261 (created in the
   second step below).

2) We create one of the 2 new hard links for inode 261, the one whose
   location is at "dir1/dir2/dir3/dir4".

3) We then attempt to create the other new hard link for inode 261, which
   has inode 262 as its parent directory. Because the path for this new
   hard link was computed before we started processing the new references
   (hard links), it reflects the old name/location of inode 262, that is,
   it does not account for the orphanization step that happened when
   we started processing the new references for inode 261, whence it is
   no longer valid, causing the receiver to fail.

So fix this issue by recomputing the full path of new references if we
ended up orphanizing other inodes which are directories.

A test case for fstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-07-06 23:02:18 +01:00
Colin Ian King
ff95015648 ext4: fix spelling mistake: "prellocated" -> "preallocated"
Trivial fix to spelling mistake in mb_debug debug message

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-07-06 15:28:45 -04:00
Linus Torvalds
a4c20b9a57 Merge branch 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu updates from Tejun Heo:
 "These are the percpu changes for the v4.13-rc1 merge window. There are
  a couple visibility related changes - tracepoints and allocator stats
  through debugfs, along with __ro_after_init markings and a cosmetic
  rename in percpu_counter.

  Please note that the simple O(#elements_in_the_chunk) area allocator
  used by percpu allocator is again showing scalability issues,
  primarily with bpf allocating and freeing large number of counters.
  Dennis is working on the replacement allocator and the percpu
  allocator will be seeing increased churns in the coming cycles"

* 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
  percpu: fix static checker warnings in pcpu_destroy_chunk
  percpu: fix early calls for spinlock in pcpu_stats
  percpu: resolve err may not be initialized in pcpu_alloc
  percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch
  percpu: add tracepoint support for percpu memory
  percpu: expose statistics about percpu memory via debugfs
  percpu: migrate percpu data structures to internal header
  percpu: add missing lockdep_assert_held to func pcpu_free_area
  mark most percpu globals as __ro_after_init
2017-07-06 08:59:41 -07:00
Al Viro
62473a2d6f move file_{start,end}_write() out of do_iter_write()
... and do *not* grab it in vfs_write_iter().

Fixes: "fs: implement vfs_iter_read using do_iter_read"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-06 09:15:47 -04:00