Commit Graph

45880 Commits

Author SHA1 Message Date
Miklos Szeredi
5201dc449e ovl: use cached acl on underlying layer
Instead of calling ->get_acl() directly, use get_acl() to get the cached
value.

We will have the acl cached on the underlying inode anyway, because we do
permission checking on the both the overlay and the underlying fs.

So, since we already have double caching, this improves performance without
any cost.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-09-01 11:11:59 +02:00
Miklos Szeredi
eea2fb4851 ovl: proper cleanup of workdir
When mounting overlayfs it needs a clean "work" directory under the
supplied workdir.

Previously the mount code removed this directory if it already existed and
created a new one.  If the removal failed (e.g. directory was not empty)
then it fell back to a read-only mount not using the workdir.

While this has never been reported, it is possible to get a non-empty
"work" dir from a previous mount of overlayfs in case of crash in the
middle of an operation using the work directory.

In this case the left over state should be discarded and the overlay
filesystem will be consistent, guaranteed by the atomicity of operations on
moving to/from the workdir to the upper layer.

This patch implements cleaning out any files left in workdir.  It is
implemented using real recursion for simplicity, but the depth is limited
to 2, because the worst case is that of a directory containing whiteouts
under "work".

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
2016-09-01 11:11:59 +02:00
Miklos Szeredi
c11b9fdd6a ovl: remove posix_acl_default from workdir
Clear out posix acl xattrs on workdir and also reset the mode after
creation so that an inherited sgid bit is cleared.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
2016-09-01 11:11:59 +02:00
Miklos Szeredi
38b256973e ovl: handle umask and posix_acl_default correctly on creation
Setting MS_POSIXACL in sb->s_flags has the side effect of passing mode to
create functions without masking against umask.

Another problem when creating over a whiteout is that the default posix acl
is not inherited from the parent dir (because the real parent dir at the
time of creation is the work directory).

Fix these problems by:

 a) If upper fs does not have MS_POSIXACL, then mask mode with umask.

 b) If creating over a whiteout, call posix_acl_create() to get the
 inherited acls.  After creation (but before moving to the final
 destination) set these acls on the created file.  posix_acl_create() also
 updates the file creation mode as appropriate.

Fixes: 39a25b2b37 ("ovl: define ->get_acl() for overlay inodes")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-09-01 11:11:59 +02:00
Mateusz Guzik
cd81a9170e mm: introduce get_task_exe_file
For more convenient access if one has a pointer to the task.

As a minor nit take advantage of the fact that only task lock + rcu are
needed to safely grab ->exe_file. This saves mm refcount dance.

Use the helper in proc_exe_link.

Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Richard Guy Briggs <rgb@redhat.com>
Cc: <stable@vger.kernel.org> # 4.3.x
Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-08-31 16:11:20 -04:00
Linus Torvalds
9f834ec18d binfmt_elf: switch to new creds when switching to new mm
We used to delay switching to the new credentials until after we had
mapped the executable (and possible elf interpreter).  That was kind of
odd to begin with, since the new executable will actually then _run_
with the new creds, but whatever.

The bigger problem was that we also want to make sure that we turn off
prof events and tracing before we start mapping the new executable
state.  So while this is a cleanup, it's also a fix for a possible
information leak.

Reported-by: Robert Święcki <robert@swiecki.net>
Tested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: David Howells <dhowells@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-31 09:13:56 -07:00
Konstantin Khlebnikov
17d0774f80 sysfs: correctly handle read offset on PREALLOC attrs
Attributes declared with __ATTR_PREALLOC use sysfs_kf_read() which returns
zero bytes for non-zero offset. This breaks script checkarray in mdadm tool
in debian where /bin/sh is 'dash' because its builtin 'read' reads only one
byte at a time. Script gets 'i' instead of 'idle' when reads current action
from /sys/block/$dev/md/sync_action and as a result does nothing.

This patch adds trivial implementation of partial read: generate whole
string and move required part into buffer head.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: 4ef67a8c95 ("sysfs/kernfs: make read requests on pre-alloc files use the buffer.")
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=787950
Cc: Stable <stable@vger.kernel.org> # v3.19+
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-31 15:14:44 +02:00
Tejun Heo
df6a58c5c5 kernfs: don't depend on d_find_any_alias() when generating notifications
kernfs_notify_workfn() sends out file modified events for the
scheduled kernfs_nodes.  Because the modifications aren't from
userland, it doesn't have the matching file struct at hand and can't
use fsnotify_modify().  Instead, it looked up the inode and then used
d_find_any_alias() to find the dentry and used fsnotify_parent() and
fsnotify() directly to generate notifications.

The assumption was that the relevant dentries would have been pinned
if there are listeners, which isn't true as inotify doesn't pin
dentries at all and watching the parent doesn't pin the child dentries
even for dnotify.  This led to, for example, inotify watchers not
getting notifications if the system is under memory pressure and the
matching dentries got reclaimed.  It can also be triggered through
/proc/sys/vm/drop_caches or a remount attempt which involves shrinking
dcache.

fsnotify_parent() only uses the dentry to access the parent inode,
which kernfs can do easily.  Update kernfs_notify_workfn() so that it
uses fsnotify() directly for both the parent and target inodes without
going through d_find_any_alias().  While at it, supply the target file
name to fsnotify() from kernfs_node->name.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Evgeny Vereshchagin <evvers@ya.ru>
Fixes: d911d98748 ("kernfs: make kernfs_notify() trigger inotify events too")
Cc: John McCutchan <john@johnmccutchan.com>
Cc: Robert Love <rlove@rlove.org>
Cc: Eric Paris <eparis@parisplace.org>
Cc: stable@vger.kernel.org # v3.16+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-31 14:48:52 +02:00
Linus Torvalds
0cf21c6609 NFS client bugfixes for 4.8
Highlights include:
 
 Stable patches:
 - Fix a refcount leak in nfs_callback_up_net
 - Fix an Oopsable condition when the flexfile pNFS driver connection to
   the DS fails
 - Fix an Oopsable condition in NFSv4.1 server callback races
 - Ensure pNFS clients stop doing I/O to the DS if their lease has expired,
   as required by the NFSv4.1 protocol
 
 Bugfixes:
 - Fix potential looping in the NFSv4.x migration code
 - Patch series to close callback races for OPEN, LAYOUTGET and LAYOUTRETURN
 - Silence WARN_ON when NFSv4.1 over RDMA is in use
 - Fix a LAYOUTCOMMIT race in the pNFS/blocks client
 - Fix pNFS timeout issues when the DS fails
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJXxbnyAAoJEGcL54qWCgDykWoP/jqgBBR/cSaOtx+5m39wlf0P
 pTdQkgcpWnhBS90tKZtC6zfJ2DFVt8sUNVn9+mVzT4Q7TgEcAmENQ//s0igxHLbl
 bkXPvULydvD05Db8m1xmq2snj72tWbpg3CaA7nfx6yiP63k237QxhyNZVkmEQDur
 ynU8dPzmxRaSTQdVgatdS0zqx8sF47OFnXVxkV0ssBKORGsWj3yKDcs293NZNFAM
 Ztkih5oW1mm+BtWUQVNrjRnfZFG+PxAxWv090JM6wABDRbDHwSaKmwmI0kWRKXoH
 DHrj4i/Wzws65Fg5AyVPSRkF8YvHSVsLnw/FlwKKZFsrWjU6WtLdLSzgzwQ47x98
 tQk/YGgNyiiD1cAcw+l0d3Ct1SO4AptNuisdJK0cn3iCdsbh6Y0eW6yRRtQY6jQI
 8qOyMTT8fp9ooEQK+nMNOhJVVlsG0hbvWAt/uiiBdPhjAfVB0UFRuua/vNKUO7yv
 hJkDY9i7EkMXKACf5BCpBuvYdU7rwqp43K9x34029A5vFTKOhJZS4hnAIocDd/WF
 Hw7yqHdpkvI5RgFbBV5tmfZPyS65k8AzzTtT1QHKlH0qEtN2iMaXsXM9EzK5bKfW
 85Cc6yzRk7NzDZKmZFs/T8zCYdzet48sCY7wVyOQjL0aIkIDNNcZhex+C1GuD1dp
 Ld0H5f9eZdwv/OAqJ8tm
 =U+XK
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.8-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

  Stable patches:
   - Fix a refcount leak in nfs_callback_up_net
   - Fix an Oopsable condition when the flexfile pNFS driver connection
     to the DS fails
   - Fix an Oopsable condition in NFSv4.1 server callback races
   - Ensure pNFS clients stop doing I/O to the DS if their lease has
     expired, as required by the NFSv4.1 protocol

  Bugfixes:
   - Fix potential looping in the NFSv4.x migration code
   - Patch series to close callback races for OPEN, LAYOUTGET and
     LAYOUTRETURN
   - Silence WARN_ON when NFSv4.1 over RDMA is in use
   - Fix a LAYOUTCOMMIT race in the pNFS/blocks client
   - Fix pNFS timeout issues when the DS fails"

* tag 'nfs-for-4.8-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4.x: Fix a refcount leak in nfs_callback_up_net
  NFS4: Avoid migration loops
  pNFS/flexfiles: Fix an Oopsable condition when connection to the DS fails
  NFSv4.1: Remove obsolete and incorrrect assignment in nfs4_callback_sequence
  NFSv4.1: Close callback races for OPEN, LAYOUTGET and LAYOUTRETURN
  NFSv4.1: Defer bumping the slot sequence number until we free the slot
  NFSv4.1: Delay callback processing when there are referring triples
  NFSv4.1: Fix Oopsable condition in server callback races
  SUNRPC: Silence WARN_ON when NFSv4.1 over RDMA is in use
  pnfs/blocklayout: update last_write_offset atomically with extents
  pNFS: The client must not do I/O to the DS if it's lease has expired
  pNFS: Handle NFS4ERR_OLD_STATEID correctly in LAYOUTSTAT calls
  pNFS/flexfiles: Set reasonable default retrans values for the data channel
  NFS: Allow the mount option retrans=0
  pNFS/flexfiles: Fix layoutstat periodic reporting
2016-08-30 11:14:02 -07:00
Trond Myklebust
98b0f80c23 NFSv4.x: Fix a refcount leak in nfs_callback_up_net
On error, the callers expect us to return without bumping
nn->cb_users[].

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org # v3.7+
2016-08-30 09:26:57 -04:00
Benjamin Coddington
52442f9b11 NFS4: Avoid migration loops
If a server returns itself as a location while migrating, the client may
end up getting stuck attempting to migrate twice to the same server.  Catch
this by checking if the nfs_client found is the same as the existing
client.  For the other two callers to nfs4_set_client, the nfs_client will
always be ERR_PTR(-EINVAL).

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-08-30 09:26:32 -04:00
Darrick J. Wong
ea78d80866 xfs: track log done items directly in the deferred pending work item
Christoph reports slab corruption when a deferred refcount update
aborts during _defer_finish().  The cause of this was broken log item
state tracking in xfs_defer_pending -- upon an abort,
_defer_trans_abort() will call abort_intent on all intent items,
including the ones that have already had a done item attached.

This is incorrect because each intent item has 2 refcount: the first
is released when the intent item is committed to the log; and the
second is released when the _done_ item is committed to the log, or
by the intent creator if there is no done item.  In other words, once
we log the done item, responsibility for releasing the intent item's
second refcount is transferred to the done item and /must not/ be
performed by anything else.

The dfp_committed flag should have been tracking whether or not we had
a done item so that _defer_trans_abort could decide if it needs to
abort the intent item, but due to a thinko this was not the case.  Rip
it out and track the done item directly so that we do the right thing
w.r.t. intent item freeing.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reported-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-30 13:51:39 +10:00
Linus Torvalds
b8927721ae Fix bugs that could cause kernel deadlocks or file system corruption
while moving xattrs to expand the extended inode.  Also add some
 sanity checks to the block group descriptors to make sure we don't end
 up overwriting the superblock.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJXw7i2AAoJEPL5WVaVDYGj96gH/A8rNgx7BoqPx3kanVEamblT
 tM0X9JcEGmKHN4enRts2b78EWbR0/U0SOP92+fg9SSq2MDJ0/kdaKLWmbUwx8jUi
 B7HMEqCprlCdigK7wwt3xF+6edyZRhtzlWy3bhxJ40f0KT5CuriSQbxogr931uKl
 hUKW2h5JtUqHtINzTt4oWjVm8xwrScxuYHYAcpw0G42ZzfO6xQOzQdowcx4m3cE9
 PrtTbU5MwW8/wgsdLiClScQq30MK/GCbHh5heyRt1BcNo9+MDsZDOgdavh9StfnW
 Bl1N6zwRtRBJNcpKWfTfwU4NTIvStCTyA8BJgKgE95YIHDsstJVl4MO7ot25qbM=
 =pXe+
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix bugs that could cause kernel deadlocks or file system corruption
  while moving xattrs to expand the extended inode.

  Also add some sanity checks to the block group descriptors to make
  sure we don't end up overwriting the superblock"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: avoid deadlock when expanding inode size
  ext4: properly align shifted xattrs when expanding inodes
  ext4: fix xattr shifting when expanding inodes part 2
  ext4: fix xattr shifting when expanding inodes
  ext4: validate that metadata blocks do not overlap superblock
  ext4: reserve xattr index for the Hurd
2016-08-29 12:37:11 -07:00
Trond Myklebust
3dc147359e pNFS/flexfiles: Fix an Oopsable condition when connection to the DS fails
If the attempt to connect to a DS fails inside ff_layout_pg_init_read or
ff_layout_pg_init_write, then we currently end up clearing the layout
segment carried by the struct nfs_pageio_descriptor, causing an Oops
when we later call into ff_layout_read_pagelist/ff_layout_write_pagelist.

The fix is to ensure we return the layout and then retry.

Fixes: 446ca21953 ("pNFS/flexfiles: When initing reads or writes, we...")
Cc: stable@vger.kernel.org # v4.7+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-08-29 15:21:16 -04:00
Christoph Hellwig
17de0a9ff3 iomap: don't set FIEMAP_EXTENT_MERGED for extent based filesystems
Filesystems like XFS that use extents should not set the
FIEMAP_EXTENT_MERGED flag in the fiemap extent structures.  To allow
for both behaviors for the upcoming gfs2 usage split the iomap
type field into type and flags, and only set FIEMAP_EXTENT_MERGED if
the IOMAP_F_MERGED flag is set.  The flags field will also come in
handy for future features such as shared extents on reflink-enabled
file systems.

Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-29 11:33:58 +10:00
Trond Myklebust
d138027a82 NFSv4.1: Remove obsolete and incorrrect assignment in nfs4_callback_sequence
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-08-28 14:23:27 -04:00
Trond Myklebust
2e80dbe7ac NFSv4.1: Close callback races for OPEN, LAYOUTGET and LAYOUTRETURN
Defer freeing the slot until after we have processed the results from
OPEN and LAYOUTGET. This means that the server can rely on the
mechanism in RFC5661 Section 2.10.6.3 to ensure that replies to an
OPEN or LAYOUTGET/RETURN RPC call don't race with the callbacks that
apply to them.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-08-28 14:23:27 -04:00
Trond Myklebust
07e8dcbda7 NFSv4.1: Defer bumping the slot sequence number until we free the slot
For operations like OPEN or LAYOUTGET, which return recallable state
(i.e. delegations and layouts) we want to enable the mechanism for
resolving recall races in RFC5661 Section 2.10.6.3.
To do so, we will want to defer bumping the slot's sequence number until
we have finished processing the RPC results.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-08-28 14:23:26 -04:00
Trond Myklebust
045d2a6d07 NFSv4.1: Delay callback processing when there are referring triples
If CB_SEQUENCE tells us that the processing of this request depends on
the completion of one or more referring triples (see RFC 5661 Section
2.10.6.3), delay the callback processing until after the RPC requests
being referred to have completed.
If we end up delaying for more than 1/2 second, then fall back to
returning NFS4ERR_DELAY in reply to the callback.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-08-28 14:23:26 -04:00
Trond Myklebust
e09c978aae NFSv4.1: Fix Oopsable condition in server callback races
The slot table hasn't been an array since v3.7. Ensure that we
use nfs4_lookup_slot() to access the slot correctly.

Fixes: 87dda67e73 ("NFSv4.1: Allow SEQUENCE to resize the slot table...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org # v3.8+
2016-08-28 14:23:22 -04:00
Linus Torvalds
5e608a0270 Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
 "11 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm: silently skip readahead for DAX inodes
  dax: fix device-dax region base
  fs/seq_file: fix out-of-bounds read
  mm: memcontrol: avoid unused function warning
  mm: clarify COMPACTION Kconfig text
  treewide: replace config_enabled() with IS_ENABLED() (2nd round)
  printk: fix parsing of "brl=" option
  soft_dirty: fix soft_dirty during THP split
  sysctl: handle error writing UINT_MAX to u32 fields
  get_maintainer: quiet noisy implicit -f vcs_file_exists checking
  byteswap: don't use __builtin_bswap*() with sparse
2016-08-26 23:12:12 -07:00
Linus Torvalds
28687b935e Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "We've queued up a few different fixes in here.  These range from
  enospc corners to fsync and quota fixes, and a few targeted at error
  handling for corrupt metadata/fuzzing"

* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix lockdep warning on deadlock against an inode's log mutex
  Btrfs: detect corruption when non-root leaf has zero item
  Btrfs: check btree node's nritems
  btrfs: don't create or leak aliased root while cleaning up orphans
  Btrfs: fix em leak in find_first_block_group
  btrfs: do not background blkdev_put()
  Btrfs: clarify do_chunk_alloc()'s return value
  btrfs: fix fsfreeze hang caused by delayed iputs deal
  btrfs: update btrfs_space_info's bytes_may_use timely
  btrfs: divide btrfs_update_reserved_bytes() into two functions
  btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster()
  btrfs: qgroup: Fix qgroup incorrectness caused by log replay
  btrfs: relocation: Fix leaking qgroups numbers on data extents
  btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent()
  btrfs: waiting on qgroup rescan should not always be interruptible
  btrfs: properly track when rescan worker is running
  btrfs: flush_space: treat return value of do_chunk_alloc properly
  Btrfs: add ASSERT for block group's memory leak
  btrfs: backref: Fix soft lockup in __merge_refs function
  Btrfs: fix memory leak of reloc_root
2016-08-26 20:22:01 -07:00
Linus Torvalds
370f601729 dlm fixes for 4.8
This fixes a bug introduced by recent debugfs cleanup.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXwJgQAAoJEDgbc8f8gGmqTQcP/1XKsslqYcg9e4xcx3ZAyT3l
 HTzRbygNmIzgIsLxDk4AvlvfrUOMFj/rJwBH/gvM68wD5cUHaTrdTN9riOWaJLFh
 J+EgkMYmKAoYvk3wyvAKbeYACOAB8BjTOLLN7zdEEDCVBMG4A+zq7B54xg3J15bU
 o60XLNnA34m4YPCh+LpGODckek++lKnsNzI/x0H7EQoMMU9Rm7WgVk+gictmnZlT
 Ms8zfE8dy1UPuGUyYN5YGGXoCasNN6FQc3MVLbTYCmw8qPwIa2hdMYjm8er329gL
 bvqp350ElogABbTGrgzN/cmrKJt6k3Y2i2ECs4G7aYBXkFhWJKXIdhPnu5ajiiRG
 DUwnPSqCgFXSDKU/X1Ev3Ro1IgdqZJx18PFgljW2PCPTDx79jCaMJjHgEtK+Q5mu
 VyeEiyXwhRPaFU4Sfc2Tul75ylI0SashufTRHSo80qfobCnhnByYTyOb8/MuCAsM
 v8fcgbSaHBktpiZIMOn9ZOcsaXQ/wkciqr5JKqnVO69F/m2dbz5SX6ySew0y+DSA
 6ZpU9H6VIXKzsd1NCLsUTgyJE5L649nE9T0CzbzBUWYj1EzC+lk/DLu+gzxVuj3M
 T0SDmU0d441qECOsxtyUgkBUOfqKoHQis5WZyU++cXxV9vapBR+s+NFAJjc3MmT+
 iiKm1Qg6nD5BQr8EM8i6
 =9igI
 -----END PGP SIGNATURE-----

Merge tag 'dlm-4.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm fix from David Teigland:
 "This fixes a bug introduced by recent debugfs cleanup"

* tag 'dlm-4.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  dlm: fix malfunction of dlm_tool caused by debugfs changes
2016-08-26 20:18:49 -07:00
Linus Torvalds
fd1ae51452 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "Here's a set of block fixes for the current 4.8-rc release.  This
  contains:

   - a fix for a secure erase regression, from Adrian.

   - a fix for an mmc use-after-free bug regression, also from Adrian.

   - potential zero pointer deference in bdev freezing, from Andrey.

   - a race fix for blk_set_queue_dying() from Bart.

   - a set of xen blkfront fixes from Bob Liu.

   - three small fixes for bcache, from Eric and Kent.

   - a fix for a potential invalid NVMe state transition, from Gabriel.

   - blk-mq CPU offline fix, preventing us from issuing and completing a
     request on the wrong queue.  From me.

   - revert two previous floppy changes, since they caused a user
     visibile regression.  A better fix is in the works.

   - ensure that we don't send down bios that have more than 256
     elements in them.  Fixes a crash with bcache, for example.  From
     Ming.

   - a fix for deferencing an error pointer with cgroup writeback.
     Fixes a regression.  From Vegard"

* 'for-linus' of git://git.kernel.dk/linux-block:
  mmc: fix use-after-free of struct request
  Revert "floppy: refactor open() flags handling"
  Revert "floppy: fix open(O_ACCMODE) for ioctl-only open"
  fs/block_dev: fix potential NULL ptr deref in freeze_bdev()
  blk-mq: improve warning for running a queue on the wrong CPU
  blk-mq: don't overwrite rq->mq_ctx
  block: make sure a big bio is split into at most 256 bvecs
  nvme: Fix nvme_get/set_features() with a NULL result pointer
  bdev: fix NULL pointer dereference
  xen-blkfront: free resources if xlvbd_alloc_gendisk fails
  xen-blkfront: introduce blkif_set_queue_limits()
  xen-blkfront: fix places not updated after introducing 64KB page granularity
  bcache: pr_err: more meaningful error message when nr_stripes is invalid
  bcache: RESERVE_PRIO is too small by one when prio_buckets() is a power of two.
  bcache: register_bcache(): call blkdev_put() when cache_alloc() fails
  block: Fix race triggered by blk_set_queue_dying()
  block: Fix secure erase
  nvme: Prevent controller state invalid transition
2016-08-26 18:50:07 -07:00
Vegard Nossum
088bf2ff5d fs/seq_file: fix out-of-bounds read
seq_read() is a nasty piece of work, not to mention buggy.

It has (I think) an old bug which allows unprivileged userspace to read
beyond the end of m->buf.

I was getting these:

    BUG: KASAN: slab-out-of-bounds in seq_read+0xcd2/0x1480 at addr ffff880116889880
    Read of size 2713 by task trinity-c2/1329
    CPU: 2 PID: 1329 Comm: trinity-c2 Not tainted 4.8.0-rc1+ #96
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
    Call Trace:
      kasan_object_err+0x1c/0x80
      kasan_report_error+0x2cb/0x7e0
      kasan_report+0x4e/0x80
      check_memory_region+0x13e/0x1a0
      kasan_check_read+0x11/0x20
      seq_read+0xcd2/0x1480
      proc_reg_read+0x10b/0x260
      do_loop_readv_writev.part.5+0x140/0x2c0
      do_readv_writev+0x589/0x860
      vfs_readv+0x7b/0xd0
      do_readv+0xd8/0x2c0
      SyS_readv+0xb/0x10
      do_syscall_64+0x1b3/0x4b0
      entry_SYSCALL64_slow_path+0x25/0x25
    Object at ffff880116889100, in cache kmalloc-4096 size: 4096
    Allocated:
    PID = 1329
      save_stack_trace+0x26/0x80
      save_stack+0x46/0xd0
      kasan_kmalloc+0xad/0xe0
      __kmalloc+0x1aa/0x4a0
      seq_buf_alloc+0x35/0x40
      seq_read+0x7d8/0x1480
      proc_reg_read+0x10b/0x260
      do_loop_readv_writev.part.5+0x140/0x2c0
      do_readv_writev+0x589/0x860
      vfs_readv+0x7b/0xd0
      do_readv+0xd8/0x2c0
      SyS_readv+0xb/0x10
      do_syscall_64+0x1b3/0x4b0
      return_from_SYSCALL_64+0x0/0x6a
    Freed:
    PID = 0
    (stack is not available)
    Memory state around the buggy address:
     ffff88011688a000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
     ffff88011688a080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    >ffff88011688a100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
		       ^
     ffff88011688a180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
     ffff88011688a200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ==================================================================
    Disabling lock debugging due to kernel taint

This seems to be the same thing that Dave Jones was seeing here:

  https://lkml.org/lkml/2016/8/12/334

There are multiple issues here:

  1) If we enter the function with a non-empty buffer, there is an attempt
     to flush it. But it was not clearing m->from after doing so, which
     means that if we try to do this flush twice in a row without any call
     to traverse() in between, we are going to be reading from the wrong
     place -- the splat above, fixed by this patch.

  2) If there's a short write to userspace because of page faults, the
     buffer may already contain multiple lines (i.e. pos has advanced by
     more than 1), but we don't save the progress that was made so the
     next call will output what we've already returned previously. Since
     that is a much less serious issue (and I have a headache after
     staring at seq_read() for the past 8 hours), I'll leave that for now.

Link: http://lkml.kernel.org/r/1471447270-32093-1-git-send-email-vegard.nossum@oracle.com
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-26 17:39:35 -07:00
Eric Ren
079d37df33 dlm: fix malfunction of dlm_tool caused by debugfs changes
With the current kernel, `dlm_tool lockdebug` fails as below:

"dlm_tool lockdebug ED0BD86DCE724393918A1AE8FDBF1EE3
can't open /sys/kernel/debug/dlm/ED0BD86DCE724393918A1AE8FDBF1EE3:
Operation not permitted"

This is because table_open() depends on file->f_op to tell which
seq_file ops should be passed down. But, the original file ops in
file->f_op is replaced by "debugfs_full_proxy_file_operations" with
commit 49d200deaa ("debugfs: prevent access to removed files'
private data").

Currently, I can think up 2 solutions: 1st, replace
debugfs_create_file() with debugfs_create_file_unsafe();
2nd, make different table_open#() accordingly. The 1st one
is neat, but I don't thoroughly understand its risk. Maybe
someone has a better one.

Signed-off-by: Eric Ren <zren@suse.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2016-08-26 13:22:14 -05:00
Brian Foster
800b2694f8 xfs: prevent dropping ioend completions during buftarg wait
xfs_wait_buftarg() waits for all pending I/O, drains the ioend
completion workqueue and walks the LRU until all buffers in the cache
have been released. This is traditionally an unmount operation` but the
mechanism is also reused during filesystem freeze.

xfs_wait_buftarg() invokes drain_workqueue() as part of the quiesce,
which is intended more for a shutdown sequence in that it indicates to
the queue that new operations are not expected once the drain has begun.
New work jobs after this point result in a WARN_ON_ONCE() and are
otherwise dropped.

With filesystem freeze, however, read operations are allowed and can
proceed during or after the workqueue drain. If such a read occurs
during the drain sequence, the workqueue infrastructure complains about
the queued ioend completion work item and drops it on the floor. As a
result, the buffer remains on the LRU and the freeze never completes.

Despite the fact that the overall buffer cache cleanup is not necessary
during freeze, fix up this operation such that it is safe to invoke
during non-unmount quiesce operations. Replace the drain_workqueue()
call with flush_workqueue(), which runs a similar serialization on
pending workqueue jobs without causing new jobs to be dropped. This is
safe for unmount as unmount independently locks out new operations by
the time xfs_wait_buftarg() is invoked.

cc: <stable@vger.kernel.org>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-26 16:01:59 +10:00
Dave Chinner
f3d7ebdeb2 xfs: fix superblock inprogress check
From inspection, the superblock sb_inprogress check is done in the
verifier and triggered only for the primary superblock via a
"bp->b_bn == XFS_SB_DADDR" check.

Unfortunately, the primary superblock is an uncached buffer, and
hence it is configured by xfs_buf_read_uncached() with:

	bp->b_bn = XFS_BUF_DADDR_NULL;  /* always null for uncached buffers */

And so this check never triggers. Fix it.

cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-26 16:01:30 +10:00
Darrick J. Wong
5b5c2dbd3c xfs: simple btree query range should look right if LE lookup fails
If the initial LOOKUP_LE in the simple query range fails to find
anything, we should attempt to increment the btree cursor to see
if there actually /are/ records for what we're trying to find.
Without this patch, a bnobt range query of (0, $agsize) returns
no results because the leftmost record never has a startblock
of zero.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-26 16:00:10 +10:00
Darrick J. Wong
722278997b xfs: fix some key handling problems in _btree_simple_query_range
We only need the record's high key for the first record that we look
at; for all records, we /definitely/ need the regular record key.
Therefore, fix how the simple range query function gets its keys.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-26 15:59:50 +10:00
Darrick J. Wong
da1f039d69 xfs: don't log the entire end of the AGF
When we're logging the last non-spare field in the AGF, we don't
need to log the spare fields, so plumb in a new AGF logging flag
to help us avoid that.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-26 15:59:31 +10:00
Darrick J. Wong
738f57c16a xfs: disallow mounting of realtime + rmap filesystems
Since the kernel doesn't currently support the realtime rmapbt,
don't allow such filesystems to be mounted.  Support will appear
in a future release.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-26 15:59:19 +10:00
Darrick J. Wong
ed150e1a5c xfs: don't perform lookups on zero-height btrees
If the caller passes in a cursor to a zero-height btree (which is
impossible), we never set block to anything but NULL, which causes the
later dereference of it to crash.  Instead, just return -EFSCORRUPTED.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-26 15:58:40 +10:00
Andrey Ryabinin
5bb53c0fb8 fs/block_dev: fix potential NULL ptr deref in freeze_bdev()
Calling freeze_bdev() twice on the same block device without mounted
filesystem get_super() will return NULL, which will lead to NULL-ptr
dereference later in drop_super().

Check get_super() result to fix that.

Note, that this is a purely theoretical issue. We have only 3
freeze_bdev() callers. 2 of them are in filesystem code and used on a
device with mounted fs. The third one in lock_fs() has protection in
upper-layer code against freezing block device the second time without
thawing it first.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-25 08:38:26 -06:00
Filipe Manana
28a235931b Btrfs: fix lockdep warning on deadlock against an inode's log mutex
Commit 44f714dae5 ("Btrfs: improve performance on fsync against new
inode after rename/unlink"), which landed in 4.8-rc2, introduced a
possibility for a deadlock due to double locking of an inode's log mutex
by the same task, which lockdep reports with:

[23045.433975] =============================================
[23045.434748] [ INFO: possible recursive locking detected ]
[23045.435426] 4.7.0-rc6-btrfs-next-34+ #1 Not tainted
[23045.436044] ---------------------------------------------
[23045.436044] xfs_io/3688 is trying to acquire lock:
[23045.436044]  (&ei->log_mutex){+.+...}, at: [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs]
[23045.436044]
               but task is already holding lock:
[23045.436044]  (&ei->log_mutex){+.+...}, at: [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs]
[23045.436044]
               other info that might help us debug this:
[23045.436044]  Possible unsafe locking scenario:

[23045.436044]        CPU0
[23045.436044]        ----
[23045.436044]   lock(&ei->log_mutex);
[23045.436044]   lock(&ei->log_mutex);
[23045.436044]
                *** DEADLOCK ***

[23045.436044]  May be due to missing lock nesting notation

[23045.436044] 3 locks held by xfs_io/3688:
[23045.436044]  #0:  (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffffa035f2ae>] btrfs_sync_file+0x14e/0x425 [btrfs]
[23045.436044]  #1:  (sb_internal#2){.+.+.+}, at: [<ffffffff8118446b>] __sb_start_write+0x5f/0xb0
[23045.436044]  #2:  (&ei->log_mutex){+.+...}, at: [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs]
[23045.436044]
               stack backtrace:
[23045.436044] CPU: 4 PID: 3688 Comm: xfs_io Not tainted 4.7.0-rc6-btrfs-next-34+ #1
[23045.436044] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[23045.436044]  0000000000000000 ffff88022f5f7860 ffffffff8127074d ffffffff82a54b70
[23045.436044]  ffffffff82a54b70 ffff88022f5f7920 ffffffff81092897 ffff880228015d68
[23045.436044]  0000000000000000 ffffffff82a54b70 ffffffff829c3f00 ffff880228015d68
[23045.436044] Call Trace:
[23045.436044]  [<ffffffff8127074d>] dump_stack+0x67/0x90
[23045.436044]  [<ffffffff81092897>] __lock_acquire+0xcbb/0xe4e
[23045.436044]  [<ffffffff8109155f>] ? mark_lock+0x24/0x201
[23045.436044]  [<ffffffff8109179a>] ? mark_held_locks+0x5e/0x74
[23045.436044]  [<ffffffff81092de0>] lock_acquire+0x12f/0x1c3
[23045.436044]  [<ffffffff81092de0>] ? lock_acquire+0x12f/0x1c3
[23045.436044]  [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs]
[23045.436044]  [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs]
[23045.436044]  [<ffffffff814a51a4>] mutex_lock_nested+0x77/0x3a7
[23045.436044]  [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs]
[23045.436044]  [<ffffffffa039705e>] ? btrfs_release_delayed_node+0xb/0xd [btrfs]
[23045.436044]  [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs]
[23045.436044]  [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs]
[23045.436044]  [<ffffffff810a0ed1>] ? vprintk_emit+0x453/0x465
[23045.436044]  [<ffffffffa0385a61>] btrfs_log_inode+0x66e/0xc95 [btrfs]
[23045.436044]  [<ffffffffa03c084d>] log_new_dir_dentries+0x26c/0x359 [btrfs]
[23045.436044]  [<ffffffffa03865aa>] btrfs_log_inode_parent+0x4a6/0x628 [btrfs]
[23045.436044]  [<ffffffffa0387552>] btrfs_log_dentry_safe+0x5a/0x75 [btrfs]
[23045.436044]  [<ffffffffa035f464>] btrfs_sync_file+0x304/0x425 [btrfs]
[23045.436044]  [<ffffffff811acaf4>] vfs_fsync_range+0x8c/0x9e
[23045.436044]  [<ffffffff811acb22>] vfs_fsync+0x1c/0x1e
[23045.436044]  [<ffffffff811acc79>] do_fsync+0x31/0x4a
[23045.436044]  [<ffffffff811ace99>] SyS_fsync+0x10/0x14
[23045.436044]  [<ffffffff814a88e5>] entry_SYSCALL_64_fastpath+0x18/0xa8
[23045.436044]  [<ffffffff8108f039>] ? trace_hardirqs_off_caller+0x3f/0xaa

An example reproducer for this is:

   $ mkfs.btrfs -f /dev/sdb
   $ mount /dev/sdb /mnt
   $ mkdir /mnt/dir
   $ touch /mnt/dir/foo
   $ sync
   $ mv /mnt/dir/foo /mnt/dir/bar
   $ touch /mnt/dir/foo
   $ xfs_io -c "fsync" /mnt/dir/bar

This is because while logging the inode of file bar we end up logging its
parent directory (since its inode has an unlink_trans field matching the
current transaction id due to the rename operation), which in turn logs
the inodes for all its new dentries, so that the new inode for the new
file named foo gets logged which in turn triggered another logging attempt
for the inode we are fsync'ing, since that inode had an old name that
corresponds to the name of the new inode.

So fix this by ensuring that when logging the inode for a new dentry that
has a name matching an old name of some other inode, we don't log again
the original inode that we are fsync'ing.

Fixes: 44f714dae5 ("Btrfs: improve performance on fsync against new inode after rename/unlink")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:32 -07:00
Liu Bo
1ba98d086f Btrfs: detect corruption when non-root leaf has zero item
Right now we treat leaf which has zero item as a valid one
because we could have an empty tree, that is, a root that is
also a leaf without any item, however, in the same case but
when the leaf is not a root, we can end up with hitting the
BUG_ON(1) in btrfs_extend_item() called by
setup_inline_extent_backref().

This makes us check the situation as a corruption if leaf is
not its own root.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:31 -07:00
Liu Bo
053ab70f06 Btrfs: check btree node's nritems
When btree node (level = 1) has nritems which equals to zero,
we can end up with panic due to insert_ptr()'s

BUG_ON(slot > nritems);

where slot is 1 and nritems is 0, as copy_for_split() calls
insert_ptr(.., path->slots[1] + 1, ...);

A invalid value results in the whole mess, this adds the check
for btree's node nritems so that we stop reading block when
when something is wrong.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:30 -07:00
Jeff Mahoney
35bbb97fc8 btrfs: don't create or leak aliased root while cleaning up orphans
commit 909c3a22da (Btrfs: fix loading of orphan roots leading to BUG_ON)
avoids the BUG_ON but can add an aliased root to the dead_roots list or
leak the root.

Since we've already been loading roots into the radix tree, we should
use it before looking the root up on disk.

Cc: <stable@vger.kernel.org> # 4.5
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:29 -07:00
Josef Bacik
187ee58c62 Btrfs: fix em leak in find_first_block_group
We need to call free_extent_map() on the em we look up.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:29 -07:00
Anand Jain
1423881941 btrfs: do not background blkdev_put()
At the end of unmount/dev-delete, if the device exclusive open is not
actually closed, then there might be a race with another program in
the userland who is trying to open the device in exclusive mode and
it may fail for eg:
      unmount /btrfs; fsck /dev/x
      btrfs dev del /dev/x /btrfs; fsck /dev/x
so here background blkdev_put() is not a choice

Signed-off-by: Anand Jain <Anand.Jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:28 -07:00
Liu Bo
28b737f6ed Btrfs: clarify do_chunk_alloc()'s return value
Function start_transaction() can return ERR_PTR(1) when flush is
BTRFS_RESERVE_FLUSH_LIMIT, so the call graph is

start_transaction (return ERR_PTR(1))
  -> btrfs_block_rsv_add (return 1)
     -> reserve_metadata_bytes (return 1)
        -> flush_space (return 1)
           -> do_chunk_alloc  (return 1)

With BTRFS_RESERVE_FLUSH_LIMIT, if flush_space is already on the
flush_state of ALLOC_CHUNK and it successfully allocates a new
chunk, then instead of trying to reserve space again,
reserve_metadata_bytes returns 1 immediately.

Eventually the callers who call start_transaction() usually just
do the IS_ERR() check which ERR_PTR(1) can pass, then it'll get
a panic when dereferencing a pointer which is ERR_PTR(1).

The following patch fixes the above problem.
"btrfs: flush_space: treat return value of do_chunk_alloc properly"
https://patchwork.kernel.org/patch/7778651/

This add comments to clarify do_chunk_alloc()'s return value.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:27 -07:00
Wang Xiaoguang
9e7cc91a6d btrfs: fix fsfreeze hang caused by delayed iputs deal
When running fstests generic/068, sometimes we got below deadlock:
  xfs_io          D ffff8800331dbb20     0  6697   6693 0x00000080
  ffff8800331dbb20 ffff88007acfc140 ffff880034d895c0 ffff8800331dc000
  ffff880032d243e8 fffffffeffffffff ffff880032d24400 0000000000000001
  ffff8800331dbb38 ffffffff816a9045 ffff880034d895c0 ffff8800331dbba8
  Call Trace:
  [<ffffffff816a9045>] schedule+0x35/0x80
  [<ffffffff816abab2>] rwsem_down_read_failed+0xf2/0x140
  [<ffffffff8118f5e1>] ? __filemap_fdatawrite_range+0xd1/0x100
  [<ffffffff8134f978>] call_rwsem_down_read_failed+0x18/0x30
  [<ffffffffa06631fc>] ? btrfs_alloc_block_rsv+0x2c/0xb0 [btrfs]
  [<ffffffff810d32b5>] percpu_down_read+0x35/0x50
  [<ffffffff81217dfc>] __sb_start_write+0x2c/0x40
  [<ffffffffa067f5d5>] start_transaction+0x2a5/0x4d0 [btrfs]
  [<ffffffffa067f857>] btrfs_join_transaction+0x17/0x20 [btrfs]
  [<ffffffffa068ba34>] btrfs_evict_inode+0x3c4/0x5d0 [btrfs]
  [<ffffffff81230a1a>] evict+0xba/0x1a0
  [<ffffffff812316b6>] iput+0x196/0x200
  [<ffffffffa06851d0>] btrfs_run_delayed_iputs+0x70/0xc0 [btrfs]
  [<ffffffffa067f1d8>] btrfs_commit_transaction+0x928/0xa80 [btrfs]
  [<ffffffffa0646df0>] btrfs_freeze+0x30/0x40 [btrfs]
  [<ffffffff81218040>] freeze_super+0xf0/0x190
  [<ffffffff81229275>] do_vfs_ioctl+0x4a5/0x5c0
  [<ffffffff81003176>] ? do_audit_syscall_entry+0x66/0x70
  [<ffffffff810038cf>] ? syscall_trace_enter_phase1+0x11f/0x140
  [<ffffffff81229409>] SyS_ioctl+0x79/0x90
  [<ffffffff81003c12>] do_syscall_64+0x62/0x110
  [<ffffffff816acbe1>] entry_SYSCALL64_slow_path+0x25/0x25

>From this warning, freeze_super() already holds SB_FREEZE_FS, but
btrfs_freeze() will call btrfs_commit_transaction() again, if
btrfs_commit_transaction() finds that it has delayed iputs to handle,
it'll start_transaction(), which will try to get SB_FREEZE_FS lock
again, then deadlock occurs.

The root cause is that in btrfs, sync_filesystem(sb) does not make
sure all metadata is updated. There still maybe some codes adding
delayed iputs, see below sample race window:

         CPU1                                  |         CPU2
|-> freeze_super()                             |
    |-> sync_filesystem(sb);                   |
    |                                          |-> cleaner_kthread()
    |                                          |   |-> btrfs_delete_unused_bgs()
    |                                          |       |-> btrfs_remove_chunk()
    |                                          |           |-> btrfs_remove_block_group()
    |                                          |               |-> btrfs_add_delayed_iput()
    |                                          |
    |-> sb->s_writers.frozen = SB_FREEZE_FS;   |
    |-> sb_wait_write(sb, SB_FREEZE_FS);       |
    |   acquire SB_FREEZE_FS lock.             |
    |                                          |
    |-> btrfs_freeze()                         |
        |-> btrfs_commit_transaction()         |
            |-> btrfs_run_delayed_iputs()      |
            |   will handle delayed iputs,     |
            |   that means start_transaction() |
            |   will be called, which will try |
            |   to get SB_FREEZE_FS lock.      |

To fix this issue, introduce a "int fs_frozen" to record internally whether
fs has been frozen. If fs has been frozen, we can not handle delayed iputs.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment to btrfs_freeze ]
Signed-off-by: David Sterba <dsterba@suse.com>

Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:26 -07:00
Wang Xiaoguang
18513091af btrfs: update btrfs_space_info's bytes_may_use timely
This patch can fix some false ENOSPC errors, below test script can
reproduce one false ENOSPC error:
	#!/bin/bash
	dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
	dev=$(losetup --show -f fs.img)
	mkfs.btrfs -f -M $dev
	mkdir /tmp/mntpoint
	mount $dev /tmp/mntpoint
	cd /tmp/mntpoint
	xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile

Above script will fail for ENOSPC reason, but indeed fs still has free
space to satisfy this request. Please see call graph:
btrfs_fallocate()
|-> btrfs_alloc_data_chunk_ondemand()
|   bytes_may_use += 64M
|-> btrfs_prealloc_file_range()
    |-> btrfs_reserve_extent()
        |-> btrfs_add_reserved_bytes()
        |   alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
        |   change bytes_may_use, and bytes_reserved += 64M. Now
        |   bytes_may_use + bytes_reserved == 128M, which is greater
        |   than btrfs_space_info's total_bytes, false enospc occurs.
        |   Note, the bytes_may_use decrease operation will be done in
        |   end of btrfs_fallocate(), which is too late.

Here is another simple case for buffered write:
                    CPU 1              |              CPU 2
                                       |
|-> cow_file_range()                   |-> __btrfs_buffered_write()
    |-> btrfs_reserve_extent()         |   |
    |                                  |   |
    |                                  |   |
    |    .....                         |   |-> btrfs_check_data_free_space()
    |                                  |
    |                                  |
    |-> extent_clear_unlock_delalloc() |

In CPU 1, btrfs_reserve_extent()->find_free_extent()->
btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
operation will be delayed to be done in extent_clear_unlock_delalloc().
Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
btrfs_check_data_free_space() tries to reserve 100MB data space.
If
	100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
		data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
		data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
btrfs_check_data_free_space() will try to allcate new data chunk or call
btrfs_start_delalloc_roots(), or commit current transaction in order to
reserve some free space, obviously a lot of work. But indeed it's not
necessary as long as decreasing bytes_may_use timely, we still have
free space, decreasing 128M from bytes_may_use.

To fix this issue, this patch chooses to update bytes_may_use for both
data and metadata in btrfs_add_reserved_bytes(). For compress path, real
extent length may not be equal to file content length, so introduce a
ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
file content length. Then compress path can update bytes_may_use
correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
and RESERVE_FREE.

As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In
run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as
PREALLOC, we also need to update bytes_may_use, but can not pass
EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so
here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook()
to update btrfs_space_info's bytes_may_use.

Meanwhile __btrfs_prealloc_file_range() will call
btrfs_free_reserved_data_space() internally for both sucessful and failed
path, btrfs_prealloc_file_range()'s callers does not need to call
btrfs_free_reserved_data_space() any more.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:26 -07:00
Wang Xiaoguang
4824f1f412 btrfs: divide btrfs_update_reserved_bytes() into two functions
This patch divides btrfs_update_reserved_bytes() into
btrfs_add_reserved_bytes() and btrfs_free_reserved_bytes(), and
next patch will extend btrfs_add_reserved_bytes()to fix some
false ENOSPC error, please see later patch for detailed info.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:25 -07:00
Wang Xiaoguang
dcb40c196f btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster()
In prealloc_file_extent_cluster(), btrfs_check_data_free_space() uses
wrong file offset for reloc_inode, it uses cluster->start and cluster->end,
which indeed are extent's bytenr. The correct value should be
cluster->[start|end] minus block group's start bytenr.

start bytenr   cluster->start
|              |     extent      |   extent   | ...| extent |
|----------------------------------------------------------------|
|                block group reloc_inode                         |

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:24 -07:00
Qu Wenruo
df2c95f33e btrfs: qgroup: Fix qgroup incorrectness caused by log replay
When doing log replay at mount time(after power loss), qgroup will leak
numbers of replayed data extents.

The cause is almost the same of balance.
So fix it by manually informing qgroup for owner changed extents.

The bug can be detected by btrfs/119 test case.

Cc: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:23 -07:00
Qu Wenruo
62b99540a1 btrfs: relocation: Fix leaking qgroups numbers on data extents
This patch fixes a REGRESSION introduced in 4.2, caused by the big quota
rework.

When balancing data extents, qgroup will leak all its numbers for
relocated data extents.

The relocation is done in the following steps for data extents:
1) Create data reloc tree and inode
2) Copy all data extents to data reloc tree
   And commit transaction
3) Create tree reloc tree(special snapshot) for any related subvolumes
4) Replace file extent in tree reloc tree with new extents in data reloc
   tree
   And commit transaction
5) Merge tree reloc tree with original fs, by swapping tree blocks

For 1)~4), since tree reloc tree and data reloc tree doesn't count to
qgroup, everything is OK.

But for 5), the swapping of tree blocks will only info qgroup to track
metadata extents.

If metadata extents contain file extents, qgroup number for file extents
will get lost, leading to corrupted qgroup accounting.

The fix is, before commit transaction of step 5), manually info qgroup to
track all file extents in data reloc tree.
Since at commit transaction time, the tree swapping is done, and qgroup
will account these data extents correctly.

Cc: Mark Fasheh <mfasheh@suse.de>
Reported-by: Mark Fasheh <mfasheh@suse.de>
Reported-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:22 -07:00
Qu Wenruo
cb93b52cc0 btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent()
Refactor btrfs_qgroup_insert_dirty_extent() function, to two functions:
1. btrfs_qgroup_insert_dirty_extent_nolock()
   Almost the same with original code.
   For delayed_ref usage, which has delayed refs locked.

   Change the return value type to int, since caller never needs the
   pointer, but only needs to know if they need to free the allocated
   memory.

2. btrfs_qgroup_insert_dirty_extent()
   The more encapsulated version.

   Will do the delayed_refs lock, memory allocation, quota enabled check
   and other things.

The original design is to keep exported functions to minimal, but since
more btrfs hacks exposed, like replacing path in balance, we need to
record dirty extents manually, so we have to add such functions.

Also, add comment for both functions, to info developers how to keep
qgroup correct when doing hacks.

Cc: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:21 -07:00
Jeff Mahoney
d06f23d6a9 btrfs: waiting on qgroup rescan should not always be interruptible
We wait on qgroup rescan completion in three places: file system
shutdown, the quota disable ioctl, and the rescan wait ioctl.  If the
user sends a signal while we're waiting, we continue happily along.  This
is expected behavior for the rescan wait ioctl.  It's racy in the shutdown
path but mostly works due to other unrelated synchronization points.
In the quota disable path, it Oopses the kernel pretty much immediately.

Cc: <stable@vger.kernel.org> # v4.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:20 -07:00
Jeff Mahoney
d2c609b834 btrfs: properly track when rescan worker is running
The qgroup_flags field is overloaded such that it reflects the on-disk
status of qgroups and the runtime state.  The BTRFS_QGROUP_STATUS_FLAG_RESCAN
flag is used to indicate that a rescan operation is in progress, but if
the file system is unmounted while a rescan is running, the rescan
operation is paused.  If the file system is then mounted read-only,
the flag will still be present but the rescan operation will not have
been resumed.  When we go to umount, btrfs_qgroup_wait_for_completion
will see the flag and interpret it to mean that the rescan worker is
still running and will wait for a completion that will never come.

This patch uses a separate flag to indicate when the worker is
running.  The locking and state surrounding the qgroup rescan worker
needs a lot of attention beyond this patch but this is enough to
avoid a hung umount.

Cc: <stable@vger.kernel.org> # v4.4+
Signed-off-by; Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:19 -07:00
Alex Lyakas
eecba891d3 btrfs: flush_space: treat return value of do_chunk_alloc properly
do_chunk_alloc returns 1 when it succeeds to allocate a new chunk.
But flush_space will not convert this to 0, and will also return 1.
As a result, reserve_metadata_bytes will think that flush_space failed,
and may potentially return this value "1" to the caller (depends how
reserve_metadata_bytes was called). The caller will also treat this as an error.
For example, btrfs_block_rsv_refill does:

int ret = -ENOSPC;
...
ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
if (!ret) {
        block_rsv_add_bytes(block_rsv, num_bytes, 0);
        return 0;
}

return ret;

So it will return -ENOSPC.

Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:18 -07:00
Liu Bo
f3bca8028b Btrfs: add ASSERT for block group's memory leak
This adds several ASSERT()' s to report memory leak of block group cache.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:17 -07:00
Qu Wenruo
d8422ba334 btrfs: backref: Fix soft lockup in __merge_refs function
When over 1000 file extents refers to one extent, find_parent_nodes()
will be obviously slow, due to the O(n^2)~O(n^3) loops inside
__merge_refs().

The following ftrace shows the cubic growth of execution time:

256 refs
 5) + 91.768 us   |  __add_keyed_refs.isra.12 [btrfs]();
 5)   1.447 us    |  __add_missing_keys.isra.13 [btrfs]();
 5) ! 114.544 us  |  __merge_refs [btrfs]();
 5) ! 136.399 us  |  __merge_refs [btrfs]();

512 refs
 6) ! 279.859 us  |  __add_keyed_refs.isra.12 [btrfs]();
 6)   3.164 us    |  __add_missing_keys.isra.13 [btrfs]();
 6) ! 442.498 us  |  __merge_refs [btrfs]();
 6) # 2091.073 us |  __merge_refs [btrfs]();

and 1024 refs
 7) ! 368.683 us  |  __add_keyed_refs.isra.12 [btrfs]();
 7)   4.810 us    |  __add_missing_keys.isra.13 [btrfs]();
 7) # 2043.428 us |  __merge_refs [btrfs]();
 7) * 18964.23 us |  __merge_refs [btrfs]();

And sort them into the following char:
(Unit: us)
------------------------------------------------------------------------
 Trace function        | 256 ref        | 512 refs      | 1024 refs    |
------------------------------------------------------------------------
 __add_keyed_refs      | 91             | 249           | 368          |
 __add_missing_keys    | 1              | 3             | 4            |
 __merge_refs 1st call | 114            | 442           | 2043         |
 __merge_refs 2nd call | 136            | 2091          | 18964        |
------------------------------------------------------------------------

We can see the that __add_keyed_refs() grows almost in linear behavior.
And __add_missing_keys() in this case doesn't change much or takes much
time.

While for the 1st __merge_refs() it's square growth
for the 2nd __merge_refs() call it's cubic growth.

It's no doubt that merge_refs() will take a long long time to execute if
the number of refs continues its grows.

So add a cond_resced() into the loop of __merge_refs().

Although this will solve the problem of soft lockup, we need to use the
new rb_tree based structure introduced by Lu Fengqi to really solve the
long execution time.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:16 -07:00
Liu Bo
1c1ea4f781 Btrfs: fix memory leak of reloc_root
When some critical errors occur and FS would be flipped into RO,
if we have an on-going balance, we can end up with a memory leak
of root->reloc_root since btrfs_drop_snapshots() bails out
without freeing reloc_root at the very early start.

However, we're not able to free reloc_root in btrfs_drop_snapshots()
because its caller, merge_reloc_roots(), still needs to access it to
cleanup reloc_root's rbtree.

This makes us free reloc_root when we're going to free fs/file roots.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:15 -07:00
Linus Torvalds
94ef71a99a This pull requests contains fixes for two issues in UBI and UBIFS:
1. Wrong UBIFS assertion.
 2. A UBIFS xattr regression.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABAgAGBQJXvflzAAoJEEtJtSqsAOnWs8QP/jEgGY5QcuvdIA+ymFFFeZ8o
 /8YwzbLula+M5T7trMSaRmT5AW5iwY/Xu/VVpVipKMVkAP7079jLjJljOckwriut
 FY60BoQ8VcxwFPRn5xMDJ6KdDMAzVFX10j/+h71VrlE6Ej4nu8XVYGYiRdnjTYiF
 JdxvuWgIDmycRT6bH69c5ZSNpMuOPpCydX0CbWEFk9P07BKL2+9inpPBGRJxy8y8
 abT4ByCJmZWjruzjeBrR4o9A2hrDTrlHPH2RzQgJXCDKntM8AjsCCCReHbhrCKLo
 QmZh+8l8N8HN4GQczcrbTSL0EZn3IsbAS6Ut03NOPcSb+kWjaH5Hcr2kEUEFA7R4
 myKfFe6/BorgHX4QTqNiX7r0y63YepcIFAIUnfzv4wya8p+IGAunouZ+D6e+3BPy
 ICUv3oGqDZkI/fqc5h3cU9RLF7fOvdAtqO+M/lInwFPbqv3jJoVxuC4oD7PEh/eY
 n7VNeVyYr+8JZ5MKBU3zYHRNyHDYME+wTpNGu/4fJR1Rsym523L7hka3zQDkDqs3
 xqFoln1BcRKT/kMTKubK5dLAWaRv68RuMZTPRDeSBrY5vY7jYTYg/42eXnmcOzeP
 vi8L3p8o2CSLHTjeX+Q0lHAw/Ppy/FFowUEx03huj0C+5LNtY6RvIvBtC3YABR8U
 nC92PnhRhl9gUvkDuii9
 =8pdh
 -----END PGP SIGNATURE-----

Merge tag 'upstream-4.8-rc4' of git://git.infradead.org/linux-ubifs

Pull UBIFS fixes from Richard Weinberger:
 "This pull requests contains fixes for two issues in UBI and UBIFS:

   - wrong UBIFS assertion.
   - a UBIFS xattr regression"

* tag 'upstream-4.8-rc4' of git://git.infradead.org/linux-ubifs:
  ubifs: Fix xattr generic handler usage
  ubifs: Fix assertion in layout_in_gaps()
2016-08-24 15:54:41 -04:00
Miklos Szeredi
8fba54aebb fuse: direct-io: don't dirty ITER_BVEC pages
When reading from a loop device backed by a fuse file it deadlocks on
lock_page().

This is because the page is already locked by the read() operation done on
the loop device.  In this case we don't want to either lock the page or
dirty it.

So do what fs/direct-io.c does: only dirty the page for ITER_IOVEC vectors.

Reported-by: Sheng Yang <sheng@yasker.org>
Fixes: aa4d86163e ("block: loop: switch to VFS ITER_BVEC")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # v4.1+
Reviewed-by: Sheng Yang <sheng@yasker.org>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
Tested-by: Sheng Yang <sheng@yasker.org>
Tested-by: Ashish Samant <ashish.samant@oracle.com>
2016-08-24 18:17:04 +02:00
Linus Torvalds
b059152245 Bug/regression fix
- fsmark regression
 - i_size race condition
 - wrong conditions in f2fs_move_file_range
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXvNycAAoJEEAUqH6CSFDS/UUP/A2s8O1Gxn8w7WuEwKqR9lCj
 d42luOM5DxPSeUgV0m76cLINLz13ae8o7Ywdsx8JeSHbR/03nmidVWK0F5ayMqXN
 Oc1ce40LBQyjaNgOI/yo/a6t5Rs4jZpWOBchXn3Qsd/bRbb07tEUv2/h6fkbP5P3
 LeU1oA2QBZkWWPWRvwFEHtJRN8UfC8GMrQP9ZO4wLH6N2HnFOgvUjwj8I8y2KHzP
 3DpZYHUP2SaI9DEJif10C9prORbBNdEoZd9G4wuVVBC2g7+/4deiQWbbk9z3TfQM
 o0n0GoBqlqISGaO+cH2VIr9smxY5FASLNYW6T/BadmnD7sskdNTFSbLZruabYH4n
 pTQLAI3GF0l6/t8qBCoo/LhJu3IQM6a6KeMw0cbEvu25U8UwXh9Md4Q4V4jiWr/5
 2GqSayQG8G78rIWVpvpxabx6Ab5XjT1dJMx/CHovoFUywQXti1X+NuSgmcOpJTHR
 GQkR4bi3z8wr8yM6XFpRBmYrJEZhu2E6i6Yz9MjOhgw1fdzrm1F/L4NObdjAREha
 yKZ2Bk9KCZrJyUMPH7/TB+1EdsOra89+gpUOU5ea7W0XkZQGWNLpFGQ/OFtEtM4g
 RfO+IY40mskeJ7i0wsQNbxLRc1oy1IQJzPVDw9zBtNA2QVvSr+lK+IIwKEK/IK78
 ke6IIQqJy0tlHlMai6rh
 =+niz
 -----END PGP SIGNATURE-----

Merge tag 'for-f2fs-v4.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs fixes from Jaegeuk Kim:
 - fsmark regression
 - i_size race condition
 - wrong conditions in f2fs_move_file_range

* tag 'for-f2fs-v4.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
  f2fs: avoid potential deadlock in f2fs_move_file_range
  f2fs: allow copying file range only in between regular files
  Revert "f2fs: move i_size_write in f2fs_write_end"
  Revert "f2fs: use percpu_rw_semaphore"
2016-08-23 20:24:27 -04:00
Richard Weinberger
17ce1eb0b6 ubifs: Fix xattr generic handler usage
UBIFS uses full names to work with xattrs, therefore we have to use
xattr_full_name() to obtain the xattr prefix as string.

Cc: <stable@vger.kernel.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Fixes: 2b88fc21ca ("ubifs: Switch to generic xattr handlers")
Signed-off-by: Richard Weinberger <richard@nod.at>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Tested-by: Dongsheng Yang <dongsheng081251@gmail.com>
2016-08-23 23:02:52 +02:00
Vincent Stehlé
c0082e985f ubifs: Fix assertion in layout_in_gaps()
An assertion in layout_in_gaps() verifies that the gap_lebs pointer is
below the maximum bound. When computing this maximum bound the idx_lebs
count is multiplied by sizeof(int), while C pointers arithmetic does take
into account the size of the pointed elements implicitly already. Remove
the multiplication to fix the assertion.

Fixes: 1e51764a3c ("UBIFS: add new flash file system")
Cc: <stable@vger.kernel.org>
Signed-off-by: Vincent Stehlé <vincent.stehle@intel.com>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2016-08-23 23:02:40 +02:00
Benjamin Coddington
41963c10c4 pnfs/blocklayout: update last_write_offset atomically with extents
Block/SCSI layout write completion may add committable extents to the
extent tree before updating the layout's last-written byte under the inode
lock.  If a sync happens before this value is updated, then
prepare_layoutcommit may find and encode these extents which would produce
a LAYOUTCOMMIT request whose encoded extents are larger than the request's
loca_length.

Fix this by using a last-written byte value that is updated atomically with
the extent tree so that commitable extents always match.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-08-23 11:41:38 -04:00
Trond Myklebust
b88fa69eaa pNFS: The client must not do I/O to the DS if it's lease has expired
Ensure that the client conforms to the normative behaviour described in
RFC5661 Section 12.7.2: "If a client believes its lease has expired,
it MUST NOT send I/O to the storage device until it has validated its
lease."

So ensure that we wait for the lease to be validated before using
the layout.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org # v3.20+
2016-08-23 11:27:01 -04:00
Vegard Nossum
e9e5e3fae8 bdev: fix NULL pointer dereference
I got this:

    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    Dumping ftrace buffer:
       (ftrace buffer empty)
    CPU: 0 PID: 5505 Comm: syz-executor Not tainted 4.8.0-rc2+ #161
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
    task: ffff880113415940 task.stack: ffff880118350000
    RIP: 0010:[<ffffffff8172cb32>]  [<ffffffff8172cb32>] bd_mount+0x52/0xa0
    RSP: 0018:ffff880118357ca0  EFLAGS: 00010207
    RAX: dffffc0000000000 RBX: ffffffffffffffff RCX: ffffc90000bb6000
    RDX: 0000000000000018 RSI: ffffffff846d6b20 RDI: 00000000000000c7
    RBP: ffff880118357cb0 R08: ffff880115967c68 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801188211e8
    R13: ffffffff847baa20 R14: ffff8801139cb000 R15: 0000000000000080
    FS:  00007fa3ff6c0700(0000) GS:ffff88011aa00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fc1d8cc7e78 CR3: 0000000109f20000 CR4: 00000000000006f0
    DR0: 000000000000001e DR1: 000000000000001e DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
    Stack:
     ffff880112cfd6c0 ffff8801188211e8 ffff880118357cf0 ffffffff8167f207
     ffffffff816d7a1e ffff880112a413c0 ffffffff847baa20 ffff8801188211e8
     0000000000000080 ffff880112cfd6c0 ffff880118357d38 ffffffff816dce0a
    Call Trace:
     [<ffffffff8167f207>] mount_fs+0x97/0x2e0
     [<ffffffff816d7a1e>] ? alloc_vfsmnt+0x55e/0x760
     [<ffffffff816dce0a>] vfs_kern_mount+0x7a/0x300
     [<ffffffff83c3247c>] ? _raw_read_unlock+0x2c/0x50
     [<ffffffff816dfc87>] do_mount+0x3d7/0x2730
     [<ffffffff81235fd4>] ? trace_do_page_fault+0x1f4/0x3a0
     [<ffffffff816df8b0>] ? copy_mount_string+0x40/0x40
     [<ffffffff8161ea81>] ? memset+0x31/0x40
     [<ffffffff816df73e>] ? copy_mount_options+0x1ee/0x320
     [<ffffffff816e2a02>] SyS_mount+0xb2/0x120
     [<ffffffff816e2950>] ? copy_mnt_ns+0x970/0x970
     [<ffffffff81005524>] do_syscall_64+0x1c4/0x4e0
     [<ffffffff83c3282a>] entry_SYSCALL64_slow_path+0x25/0x25
    Code: 83 e8 63 1b fc ff 48 85 c0 48 89 c3 74 4c e8 56 35 d1 ff 48 8d bb c8 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 36 4c 8b a3 c8 00 00 00 48 b8 00 00 00 00 00 fc
    RIP  [<ffffffff8172cb32>] bd_mount+0x52/0xa0
     RSP <ffff880118357ca0>
    ---[ end trace 13690ad962168b98 ]---

mount_pseudo() returns ERR_PTR(), not NULL, on error.

Fixes: 3684aa7099 ("block-dev: enable writeback cgroup support")
Cc: Shaohua Li <shli@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: stable@vger.kernel.org
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-22 08:06:15 -06:00
Trond Myklebust
9a0fe86745 pNFS: Handle NFS4ERR_OLD_STATEID correctly in LAYOUTSTAT calls
We normally want to update the stateid and then retry,

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-08-19 16:27:31 -04:00
Linus Torvalds
a8414fa360 xfs, iomap: update for 4.8-rc3
Changes in this update
 - regression fixes for XFS changes introduce in 4.8-rc1
 	- buffer IO accounting assert failure
 	- ENOSPC block accounting reservation issue
 	- DAX IO path page cache invalidation fix
 	- rmapbt on-disk block count in agf
 	- correct classification of rmap block type when updating AGFL.
 	- iomap support for attribute fork mapping
 - regression fixes for iomap infrastructure in 4.8-rc1
 	- fiemap: honor FIEMAP_FLAG_SYNC
 	- fiemap: implement FIEMAP_FLAG_XATTR support to fix XFS regression
 	- make mark_page_accessed and pagefault_disable usage consistent with
 	  other IO paths
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXtpPqAAoJEK3oKUf0dfoduBkQAKxy6ETxDUd5OqlFdc0NlZYL
 cZYefWzaX/X+eO/SCgw9zB/HE9o0/zyCF/OcGF0Inb1uySPkERIfV5qPGmmHvqdm
 86bfV3PRkRYsoXI289ci5y64hwFFbev65ZAm6pEbFCbkAYCPBZg6w4Cg+80dG9Cp
 o8o82oUW6WINVfySpyqsrO5Uje15Bz/Dx/tD9gkhQdWWaOHQhi8C9tf0uJQgLKWN
 MC/SXDoNafDIDs5rxJB2n8Nu66i09OgoP3wk1ID8GYwHOBi1QqSGWoLZFpLdgMoi
 GJnAAzl6yolq7exZuk/1LD/Vsu4mvtuK/5hA+pRf0KBLim+BLnv7tVYCRM9BHD7m
 s2ddgk2ZW9MNlYNB4K1GQf2WfjnNb7qrzupswEBtBXArgsA6v9TXjHz2sizF04MO
 EYcHbhAl48usuBiibLqDbo9w2bsAOE4BhLReU4SUSPD1/C6Ujicx32hj/IPKbUxV
 tIiAr6zf9PThvCI+flaFN6ztWTi1rZN1NPC/boxoUYh3FvEMmJZ6WFYB9SfvlCF/
 dd8ybry8wwIswdVA7R6GqWTWOEvY70QOsDZFLiJ/nivEKLpnifTGVAy6mBSIBiqG
 HsVktXj25454j+hv+2YPkmI1Th8E59ABZvX0oam+ZdtpjDnLefkrdaTKtARpzW3L
 xNKHeXjpODATzlVaWKfC
 =h8VB
 -----END PGP SIGNATURE-----

Merge tag 'xfs-iomap-for-linus-4.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs and iomap fixes from Dave Chinner:
 "Changes in this update:

  Regression fixes for XFS changes introduce in 4.8-rc1:
   - buffer IO accounting assert failure
   - ENOSPC block accounting reservation issue
   - DAX IO path page cache invalidation fix
   - rmapbt on-disk block count in agf
   - correct classification of rmap block type when updating AGFL.
   - iomap support for attribute fork mapping

  Regression fixes for iomap infrastructure in 4.8-rc1:
   - fiemap: honor FIEMAP_FLAG_SYNC
   - fiemap: implement FIEMAP_FLAG_XATTR support to fix XFS regression
   - make mark_page_accessed and pagefault_disable usage consistent with
     other IO paths"

* tag 'xfs-iomap-for-linus-4.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
  xfs: remove OWN_AG rmap when allocating a block from the AGFL
  xfs: (re-)implement FIEMAP_FLAG_XATTR
  xfs: simplify xfs_file_iomap_begin
  iomap: mark ->iomap_end as optional
  iomap: prepare iomap_fiemap for attribute mappings
  iomap: fiemap should honor the FIEMAP_FLAG_SYNC flag
  iomap: remove superflous pagefault_disable from iomap_write_actor
  iomap: remove superflous mark_page_accessed from iomap_write_actor
  xfs: store rmapbt block count in the AGF
  xfs: don't invalidate whole file on DAX read/write
  xfs: fix bogus space reservation in xfs_iomap_write_allocate
  xfs: don't assert fail on non-async buffers on ioacct decrement
2016-08-19 09:06:41 -07:00
Chao Yu
20a3d61d46 f2fs: avoid potential deadlock in f2fs_move_file_range
Thread A			Thread B
- inode_lock fileA
				- inode_lock fileB
				 - inode_lock fileA
 - inode_lock fileB

We may encounter above potential deadlock during moving file range in
concurrent scenario. This patch fixes the issue by using inode_trylock
instead.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-08-19 11:15:08 +09:00
Chao Yu
fe8494bfc8 f2fs: allow copying file range only in between regular files
Only if two input files are regular files, we allow copying data in
range of them, otherwise, deny it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-08-19 11:15:08 +09:00
Chao Yu
3024c9a1fe Revert "f2fs: move i_size_write in f2fs_write_end"
This reverts commit a2ee0a3003.

When testing with generic/032 of xfstest suit, failure message will be
reported as below:

generic/032 8s ... [failed, exit status 1] - output mismatch (see results/generic/032.out.bad)
    --- tests/generic/032.out	2015-01-11 16:52:27.643681072 +0800
    +++ results/generic/032.out.bad	2016-08-06 13:44:43.861330500 +0800
    @@ -1,5 +1,5 @@
     QA output created by 032
    -100 iterations
    -0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
    -*
    -0100000
    +1: [768..775]: unwritten
    +Unwritten extents found!
    ...
    (Run 'diff -u tests/generic/032.out results/generic/032.out.bad'  to see the entire diff)
Ran: generic/032
Failures: generic/032
Failed 1 of 1 tests

In write_end(), we should update i_size of inode before unlock page,
otherwise, we will lose newly updated data in following race condition.

Thread A			Thread B
- write_end
 - unlock page
				- writepages
				 - lock_page
				  - writepage
				  if page is out-of-range of file size,
				  we will skip writting the page.
 - update i_size

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-08-19 11:15:08 +09:00
Jaegeuk Kim
b873b798af Revert "f2fs: use percpu_rw_semaphore"
LKP reported -36.3% regression of fsmark.files_per_sec due to this patch.
I've confirmed that fxmark [1] has also slight regression for DWAL.

[1] https://github.com/sslab-gatech/fxmark

This reverts commit ec795418c4.
2016-08-19 11:15:08 +09:00
Linus Torvalds
184ca82348 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Buffers powersave frame test is reversed in cfg80211, fix from Felix
    Fietkau.

 2) Remove bogus WARN_ON in openvswitch, from Jarno Rajahalme.

 3) Fix some tg3 ethtool logic bugs, and one that would cause no
    interrupts to be generated when rx-coalescing is set to 0.  From
    Satish Baddipadige and Siva Reddy Kallam.

 4) QLCNIC mailbox corruption and napi budget handling fix from Manish
    Chopra.

 5) Fix fib_trie logic when walking the trie during /proc/net/route
    output than can access a stale node pointer.  From David Forster.

 6) Several sctp_diag fixes from Phil Sutter.

 7) PAUSE frame handling fixes in mlxsw driver from Ido Schimmel.

 8) Checksum fixup fixes in bpf from Daniel Borkmann.

 9) Memork leaks in nfnetlink, from Liping Zhang.

10) Use after free in rxrpc, from David Howells.

11) Use after free in new skb_array code of macvtap driver, from Jason
    Wang.

12) Calipso resource leak, from Colin Ian King.

13) mediatek bug fixes (missing stats sync init, etc.) from Sean Wang.

14) Fix bpf non-linear packet write helpers, from Daniel Borkmann.

15) Fix lockdep splats in macsec, from Sabrina Dubroca.

16) hv_netvsc bug fixes from Vitaly Kuznetsov, mostly to do with VF
    handling.

17) Various tc-action bug fixes, from CONG Wang.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (116 commits)
  net_sched: allow flushing tc police actions
  net_sched: unify the init logic for act_police
  net_sched: convert tcf_exts from list to pointer array
  net_sched: move tc offload macros to pkt_cls.h
  net_sched: fix a typo in tc_for_each_action()
  net_sched: remove an unnecessary list_del()
  net_sched: remove the leftover cleanup_a()
  mlxsw: spectrum: Allow packets to be trapped from any PG
  mlxsw: spectrum: Unmap 802.1Q FID before destroying it
  mlxsw: spectrum: Add missing rollbacks in error path
  mlxsw: reg: Fix missing op field fill-up
  mlxsw: spectrum: Trap loop-backed packets
  mlxsw: spectrum: Add missing packet traps
  mlxsw: spectrum: Mark port as active before registering it
  mlxsw: spectrum: Create PVID vPort before registering netdevice
  mlxsw: spectrum: Remove redundant errors from the code
  mlxsw: spectrum: Don't return upon error in removal path
  i40e: check for and deal with non-contiguous TCs
  ixgbe: Re-enable ability to toggle VLAN filtering
  ixgbe: Force VLNCTRL.VFE to be set in all VMDq paths
  ...
2016-08-17 17:26:58 -07:00
Dave Chinner
32438cf9d5 Merge branch 'iomap-fixes-4.8-rc3' into for-next 2016-08-17 11:13:37 +10:00
Darrick J. Wong
a03f1a6633 xfs: remove OWN_AG rmap when allocating a block from the AGFL
When we're really tight on space, xfs_alloc_ag_vextent_small() can
allocate a block from the AGFL and give it to the caller.  Since the
caller is never the AGFL-fixing method, we must remove the OWN_AG
reverse mapping because it will clash with whatever rmap the caller
wants to set up.  This bug was discovered by running generic/299
repeatedly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17 11:12:57 +10:00
Christoph Hellwig
1d4795e7bd xfs: (re-)implement FIEMAP_FLAG_XATTR
Use a special read-only iomap_ops implementation to support fiemap on
the attr fork.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17 08:45:30 +10:00
Christoph Hellwig
b95a21271b xfs: simplify xfs_file_iomap_begin
We'll never get nimap == 0 for a successful return from xfs_bmapi_read,
so don't try to handle it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17 08:44:52 +10:00
Christoph Hellwig
f20ac7ab17 iomap: mark ->iomap_end as optional
No need to implement it for read-only mappings.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17 08:42:34 +10:00
Dave Chinner
ac2dc058bc iomap: prepare iomap_fiemap for attribute mappings
By bassing through an -ENOENT, similar to the old XFS implementation of
FIEMAP_FLAG_XATTR.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17 08:41:34 +10:00
Dave Chinner
8896b8f609 iomap: fiemap should honor the FIEMAP_FLAG_SYNC flag
The flag is checked as supported, but then we do an unconditional
sync of the file, regardless of whether the flag is set or not. Make
the sync conditional on having the FIEMAP_FLAG_SYNC flag set.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17 08:41:10 +10:00
Christoph Hellwig
274c887494 iomap: remove superflous pagefault_disable from iomap_write_actor
iov_iter_copy_from_user_atomic disables page faults internally, no need to
do it around the call.  This also brings the iomap code in line with
the original filemap version.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17 08:40:18 +10:00
Christoph Hellwig
97dd8c9ee6 iomap: remove superflous mark_page_accessed from iomap_write_actor
This catches up with commit  2457ae ("mm: non-atomically mark page
accessed during page cache allocation where possible"), which
moved the initial access marking into the pagecache allocator.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17 08:39:47 +10:00
Darrick J. Wong
f32866fdc9 xfs: store rmapbt block count in the AGF
Track the number of blocks used for the rmapbt in the AGF.  When we
get to the AG reservation code we need this counter to quickly
make our reservation during mount.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17 08:31:49 +10:00
Dave Chinner
8b2180b3bf xfs: don't invalidate whole file on DAX read/write
When we do DAX IO, we try to invalidate the entire page cache held
on the file. This is incorrect as it will trash the entire mapping
tree that now tracks dirty state in exceptional entries in the radix
tree slots.

What we are trying to do is remove cached pages (e.g from reads
into holes) that sit in the radix tree over the range we are about
to write to. Hence we should just limit the invalidation to the
range we are about to overwrite.

Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17 08:31:33 +10:00
Christoph Hellwig
0af32fb468 xfs: fix bogus space reservation in xfs_iomap_write_allocate
The space reservations was without an explaination in commit

    "Add error reporting calls in error paths that return EFSCORRUPTED"

back in 2003.  There is no reason to reserve disk blocks in the
transaction when allocating blocks for delalloc space as we already
reserved the space when creating the delalloc extent.

With this fix we stop running out of the reserved pool in
generic/229, which has happened for long time with small blocksize
file systems, and has increased in severity with the new buffered
write path.

[ dchinner: we still need to pass the block reservation into
  xfs_bmapi_write() to ensure we don't deadlock during AG selection.
  See commit dbd5c8c ("xfs: pass total block res. as total
  xfs_bmapi_write() parameter") for more details on why this is
  necessary. ]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17 08:30:28 +10:00
Brian Foster
4dd3fd7197 xfs: don't assert fail on non-async buffers on ioacct decrement
The buffer I/O accounting mechanism tracks async buffers under I/O.  As
an optimization, the buffer I/O count is incremented only once on the
first async I/O for a given hold cycle of a buffer and decremented once
the buffer is released to the LRU (or freed).

xfs_buf_ioacct_dec() has an ASSERT() check for an XBF_ASYNC buffer, but
we have one or two corner cases where a buffer can be submitted for I/O
multiple times via different methods in a single hold cycle. If an async
I/O occurs first, the I/O count is incremented. If a sync I/O occurs
before the hold count drops, XBF_ASYNC is cleared by the time the I/O
count is decremented.

Remove the async assert check from xfs_buf_ioacct_dec() as this is a
perfectly valid scenario. For the purposes of I/O accounting, we really
only care about the buffer async state at I/O submission time.

Discovered-and-analyzed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17 08:30:28 +10:00
Trond Myklebust
15d03055cf pNFS/flexfiles: Set reasonable default retrans values for the data channel
Prior to this patch, the retrans value was set at 5, meaning that we
could see a maximum retransmission timeout value of more than 6 minutes.
That's a tad high for NFSv3 where the protocol does allow the server to
drop requests at any time.

Since this is a data channel, let's just set retrans to 0, and the default
timeout to 60s. The user can continue to adjust these defaults using the
dataserver_retrans and dataserver_timeo module parameters.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-08-16 11:16:19 -04:00
Trond Myklebust
a956beda19 NFS: Allow the mount option retrans=0
We should allow retrans=0 as just meaning that every timeout is a major
timeout, and that there is no increment in the timeout value.

For instance, this means that we would allow TCP users to specify a
flat timeout value of 60s, by specifying "timeo=600,retrans=0" in their
mount option string.

Siged-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-08-16 11:00:06 -04:00
Trond Myklebust
1c8d477a77 pNFS/flexfiles: Fix layoutstat periodic reporting
Putting the periodicity timer in the mirror instances is causing
non-scalable reporting behaviour and missed reporting intervals.
When you recall layouts and/or implement client side mirroring, it
leads to consecutive reports with only a few ms between RPC calls.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Fixes: d0379a5d06 ("pNFS/flexfiles: Support server-supplied...")
2016-08-14 23:01:10 -04:00
Linus Torvalds
a1e210331b Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:

 - an NVMe fix from Gabriel, fixing a suspend/resume issue on some
   setups

 - addition of a few missing entries in the block queue sysfs
   documentation, from Joe

 - a fix for a sparse shadow warning for the bvec iterator, from
   Johannes

 - a writeback deadlock involving raid issuing barriers, and not
   flushing the plug when we wakeup the flusher threads.  From
   Konstantin

 - a set of patches for the NVMe target/loop/rdma code, from Roland and
   Sagi

* 'for-linus' of git://git.kernel.dk/linux-block:
  bvec: avoid variable shadowing warning
  doc: update block/queue-sysfs.txt entries
  nvme: Suspend all queues before deletion
  mm, writeback: flush plugged IO in wakeup_flusher_threads()
  nvme-rdma: Remove unused includes
  nvme-rdma: start async event handler after reconnecting to a controller
  nvmet: Fix controller serial number inconsistency
  nvmet-rdma: Don't use the inline buffer in order to avoid allocation for small reads
  nvmet-rdma: Correctly handle RDMA device hot removal
  nvme-rdma: Make sure to shutdown the controller if we can
  nvme-loop: Remove duplicate call to nvme_remove_namespaces
  nvme-rdma: Free the I/O tags when we delete the controller
  nvme-rdma: Remove duplicate call to nvme_remove_namespaces
  nvme-rdma: Fix device removal handling
  nvme-rdma: Queue ns scanning after a sucessful reconnection
  nvme-rdma: Don't leak uninitialized memory in connect request private data
2016-08-13 09:56:45 -07:00
Linus Torvalds
b112324c2b Fixes for the dentry refcounting leak I introduced in 4.8-rc1, and for
races in the LOCK code which appear to go back to the big nfsd state
 lock removal from 3.17.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXrkCqAAoJECebzXlCjuG+vJQP/RiBMW04XXV3Pe61S7URqVIr
 USTpsXJApxGQOhU6XrEMYWGz9Ya70aMpS9MmKqXtYtlIQN6gVSmC+t8YNPIPs2oF
 1n9a9tQhX/hI1Ipe8vjWmarLH31GOhQqbAd6RdUwHWGrMyWeajBKCms9UZ1bdG42
 dXBvlH7A8aoFJUY9GXerf2b2hyz34KFJmNxSx5e70XjF3Wq4HaQCCTKU8RFJmDxd
 PVYTz/0CR3bbtRKJkDHs6jRo1Qr9PZJXVxiRhvG113XrbmVZcBqTZ4Ee/2vwjRvr
 obxzQGMO7cb/GT6Iqly1tkdMfp/miVS/gPRXXcLQRJXNDfZwoRZF/2LMgABiDn62
 WXxgd6uqnexb3AAuCSpIW1HTgWLX+YekVYHdlZBs+YsTY2Q/jDNsy3yYDzA257yT
 HaHh0oyWmiiJQ+SgOc/KI1ony3aRtF+WclKsr2vQtmC/DmRkOBXpeFTcPN9K51aa
 BhNGVnC/YgZojgJsECZ+9VaYcDZ13UGFIoSPOA0zRWZRbDLw5TIt0mgnes63s3mp
 9pTGlEe0hhjG21JNbU3vxXrWcXY5o5fXaU3oCWlQJmk/dNehbBkY4x3fj/a1S2sK
 nxd8mrUruYcXEYvdjYGef38zUcJmhY/26wq0DGczm3jcxHfCDv4/ydhHa6TTuI8A
 HUQjymCrr3sWrWHmYATR
 =d1R8
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.8-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd fixes from Bruce Fields:
 "Fixes for the dentry refcounting leak I introduced in 4.8-rc1, and for
  races in the LOCK code which appear to go back to the big nfsd state
  lock removal from 3.17"

* tag 'nfsd-4.8-1' of git://linux-nfs.org/~bfields/linux:
  nfsd: don't return an unhashed lock stateid after taking mutex
  nfsd: Fix race between FREE_STATEID and LOCK
  nfsd: fix dentry refcounting on create
2016-08-12 16:28:41 -07:00
Jeff Layton
dd257933fa nfsd: don't return an unhashed lock stateid after taking mutex
nfsd4_lock will take the st_mutex before working with the stateid it
gets, but between the time when we drop the cl_lock and take the mutex,
the stateid could become unhashed (a'la FREE_STATEID). If that happens
the lock stateid returned to the client will be forgotten.

Fix this by first moving the st_mutex acquisition into
lookup_or_create_lock_state. Then, have it check to see if the lock
stateid is still hashed after taking the mutex. If it's not, then put
the stateid and try the find/create again.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Tested-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Cc: stable@vger.kernel.org # feb9dad5 nfsd: Always lock state exclusively.
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-08-12 16:10:25 -04:00
Linus Torvalds
9909170065 NFS client bugfixes for Linux 4.8
Highlights include:
 
 - Stable patch from Olga to fix RPCSEC_GSS upcalls when the same user needs
   multiple different security services (e.g. krb5i and krb5p).
 - Stable patch to fix a regression introduced by the use of SO_REUSEPORT,
   and that prevented the use of multiple different NFS versions to the
   same server.
 - TCP socket reconnection timer fixes.
 - Patch from Neil to disable the use of IPv6 temporary addresses.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXrh03AAoJEGcL54qWCgDyp4EQALwZpmYCxWJE5xSHW95Fs124
 HYM8g4LznOfs3/ohInb1ja2FaQqUy0XEk3pSjNKfyYgjuwB4qJSOpnAqoIKxJFGB
 h4582leYZOZYMMCGslS2I4zcElBYO1WjnKNyb7MpZjCHmN0AdFfIcOXd2K7eL9hM
 /poImcs5KfMGIEJqmKqMUxmJ3RjxpK3LySQAes/Y5odOiHC4SGJdGUmSeuPGTbQd
 YjFWVHRFU6kVAzPd2Jl46Sgy6SpDaVz82HodXCSY+8lklmIkbIsVqJs0VWo3WkfL
 r5WLQ3PzZvloQ7o/E9tZGiB/LEi7roa51hYsG4sleN6Kap5vwyWg0QIKjqyJdFxB
 JmFanlCMfae3zNz4cusvgu1okvMnNqO4uRXJIAKfk64k775N9ebY7TXAZUK4/UbY
 4nxCHcxygamP/k/8HYFpc4964tMaimIs9JUdojad5a3dzffwXcgEC/0HPUih9R+i
 DO/cbVtWeDkmQPLrUqFfOAbmQdyAjELrv48d5BVIst49uuCULU2LlDlVLiAvaZvq
 s2YNmr7lkHowvgaH4ShL89wuyyD14Xu5/f49oFBFNKEQay9YthQ8s3XmdZBG7Zl0
 oyA1XJjWEq3p8nvPGIqFD26w75ppUbAWLTHsyoU0YfEYrZJrF9jPxowI7WlHgfVo
 Io79x1sbgTrckjG+osAf
 =UHph
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.8-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

   - Stable patch from Olga to fix RPCSEC_GSS upcalls when the same user
     needs multiple different security services (e.g.  krb5i and krb5p).

   - Stable patch to fix a regression introduced by the use of
     SO_REUSEPORT, and that prevented the use of multiple different NFS
     versions to the same server.

   - TCP socket reconnection timer fixes.

   - Patch from Neil to disable the use of IPv6 temporary addresses"

* tag 'nfs-for-4.8-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4: Cap the transport reconnection timer at 1/2 lease period
  NFSv4: Cleanup the setting of the nfs4 lease period
  SUNRPC: Limit the reconnect backoff timer to the max RPC message timeout
  SUNRPC: Fix reconnection timeouts
  NFSv4.2: LAYOUTSTATS may return NFS4ERR_ADMIN/DELEG_REVOKED
  SUNRPC: disable the use of IPv6 temporary addresses.
  SUNRPC: allow for upcalls for same uid but different gss service
  SUNRPC: Fix up socket autodisconnect
  SUNRPC: Handle EADDRNOTAVAIL on connection failures
2016-08-12 12:32:24 -07:00
Linus Torvalds
4b9eaf33d8 Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
 "7 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm/memory_hotplug.c: initialize per_cpu_nodestats for hotadded pgdats
  mm, oom: fix uninitialized ret in task_will_free_mem()
  kasan: remove the unnecessary WARN_ONCE from quarantine.c
  mm: memcontrol: fix memcg id ref counter on swap charge move
  mm: memcontrol: fix swap counter leak on swapout from offline cgroup
  proc, meminfo: use correct helpers for calculating LRU sizes in meminfo
  mm/hugetlb: fix incorrect hugepages count during mem hotplug
2016-08-11 16:58:24 -07:00
Mel Gorman
2f95ff90b9 proc, meminfo: use correct helpers for calculating LRU sizes in meminfo
meminfo_proc_show() and si_mem_available() are using the wrong helpers
for calculating the size of the LRUs.  The user-visible impact is that
there appears to be an abnormally high number of unevictable pages.

Link: http://lkml.kernel.org/r/20160805105805.GR2799@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-11 16:58:13 -07:00
Linus Torvalds
3b3ce01a57 A patch for a NULL dereference bug introduced in 4.8-rc1 and a handful
of static checker fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJXrHadAAoJEEp/3jgCEfOLY18H/0c13lLrwfOD2GWdtZ4Hxt8A
 JmLJtplRxnRd1ZpeXPsIXFhQVs0L8COK1diq51rV7xBYzlYzwQ4y3aRapi2YX9Lq
 5Ap8Cl91eVwvTETDp7uS7pFwPju7pnLgHEBstNG56H8sD9drjgIPanhdwDeg04iG
 3hl9NLHPwdMfBQhKMh8y6/ggBX6ErtIZIPY07zUlRvm9YiEb+aTyUHQF6K4BMWO7
 DZSrRJFfjgMk3Unc/KvKtir93PTA8J2sJxKsLKY5y79dFX/ulO724fMmIhUr6iB9
 serReW0WEfv7y3f4wiR87HuKwEkRadeq9Xzqe5TTByIbryJG+DaBAoCzedWMaWE=
 =09j2
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.8-rc2' of https://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "A patch for a NULL dereference bug introduced in 4.8-rc1 and a handful
  of static checker fixes"

* tag 'ceph-for-4.8-rc2' of https://github.com/ceph/ceph-client:
  ceph: initialize pathbase in the !dentry case in encode_caps_cb()
  rbd: nuke the 32-bit pool id check
  rbd: destroy header_oloc in rbd_dev_release()
  ceph: fix null pointer dereference in ceph_flush_snaps()
  libceph: using kfree_rcu() to simplify the code
  libceph: make cancel_generic_request() static
  libceph: fix return value check in alloc_msg_with_page_vector()
2016-08-11 13:53:34 -07:00
Chuck Lever
42691398be nfsd: Fix race between FREE_STATEID and LOCK
When running LTP's nfslock01 test, the Linux client can send a LOCK
and a FREE_STATEID request at the same time. The outcome is:

Frame 324    R OPEN stateid [2,O]

Frame 115004 C LOCK lockowner_is_new stateid [2,O] offset 672000 len 64
Frame 115008 R LOCK stateid [1,L]
Frame 115012 C WRITE stateid [0,L] offset 672000 len 64
Frame 115016 R WRITE NFS4_OK
Frame 115019 C LOCKU stateid [1,L] offset 672000 len 64
Frame 115022 R LOCKU NFS4_OK
Frame 115025 C FREE_STATEID stateid [2,L]
Frame 115026 C LOCK lockowner_is_new stateid [2,O] offset 672128 len 64
Frame 115029 R FREE_STATEID NFS4_OK
Frame 115030 R LOCK stateid [3,L]
Frame 115034 C WRITE stateid [0,L] offset 672128 len 64
Frame 115038 R WRITE NFS4ERR_BAD_STATEID

In other words, the server returns stateid L in a successful LOCK
reply, but it has already released it. Subsequent uses of stateid L
fail.

To address this, protect the generation check in nfsd4_free_stateid
with the st_mutex. This should guarantee that only one of two
outcomes occurs: either LOCK returns a fresh valid stateid, or
FREE_STATEID returns NFS4ERR_LOCKS_HELD.

Reported-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Fix-suggested-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-08-11 15:08:39 -04:00
Jan Kara
2e81a4eeed ext4: avoid deadlock when expanding inode size
When we need to move xattrs into external xattr block, we call
ext4_xattr_block_set() from ext4_expand_extra_isize_ea(). That may end
up calling ext4_mark_inode_dirty() again which will recurse back into
the inode expansion code leading to deadlocks.

Protect from recursion using EXT4_STATE_NO_EXPAND inode flag and move
its management into ext4_expand_extra_isize_ea() since its manipulation
is safe there (due to xattr_sem) from possible races with
ext4_xattr_set_handle() which plays with it as well.

CC: stable@vger.kernel.org   # 4.4.x
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-11 12:38:55 -04:00
Jan Kara
443a8c41cd ext4: properly align shifted xattrs when expanding inodes
We did not count with the padding of xattr value when computing desired
shift of xattrs in the inode when expanding i_extra_isize. As a result
we could create unaligned start of inline xattrs. Account for alignment
properly.

CC: stable@vger.kernel.org  # 4.4.x-
Signed-off-by: Jan Kara <jack@suse.cz>
2016-08-11 12:00:01 -04:00
Jan Kara
418c12d08d ext4: fix xattr shifting when expanding inodes part 2
When multiple xattrs need to be moved out of inode, we did not properly
recompute total size of xattr headers in the inode and the new header
position. Thus when moving the second and further xattr we asked
ext4_xattr_shift_entries() to move too much and from the wrong place,
resulting in possible xattr value corruption or general memory
corruption.

CC: stable@vger.kernel.org  # 4.4.x
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-11 11:58:32 -04:00
Jan Kara
d0141191a2 ext4: fix xattr shifting when expanding inodes
The code in ext4_expand_extra_isize_ea() treated new_extra_isize
argument sometimes as the desired target i_extra_isize and sometimes as
the amount by which we need to grow current i_extra_isize. These happen
to coincide when i_extra_isize is 0 which used to be the common case and
so nobody noticed this until recently when we added i_projid to the
inode and so i_extra_isize now needs to grow from 28 to 32 bytes.

The result of these bugs was that we sometimes unnecessarily decided to
move xattrs out of inode even if there was enough space and we often
ended up corrupting in-inode xattrs because arguments to
ext4_xattr_shift_entries() were just wrong. This could demonstrate
itself as BUG_ON in ext4_xattr_shift_entries() triggering.

Fix the problem by introducing new isize_diff variable and use it where
appropriate.

CC: stable@vger.kernel.org   # 4.4.x
Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-11 11:50:30 -04:00
Josef Bacik
502aa0a5be nfsd: fix dentry refcounting on create
b44061d0b9 introduced a dentry ref counting bug.  Previously we were
grabbing one ref to dchild in nfsd_create(), but with the creation of
nfsd_create_locked() we have a ref for dchild from the lookup in
nfsd_create(), and then another ref in nfsd_create_locked().  The ref
from the lookup in nfsd_create() is never dropped and results in
dentries still in use at unmount.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Fixes: b44061d0b9 "nfsd: reorganize nfsd_create"
Reported-by: kernel test robot <xiaolong.ye@intel.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-08-11 11:42:08 -04:00
Linus Torvalds
9512c47ec2 Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Some fixes for btrfs send/recv and fsync from Filipe and Robbie Ko.

  Bonus points to Filipe for already having xfstests in place for many
  of these"

* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: remove unused function btrfs_add_delayed_qgroup_reserve()
  Btrfs: improve performance on fsync against new inode after rename/unlink
  Btrfs: be more precise on errors when getting an inode from disk
  Btrfs: send, don't bug on inconsistent snapshots
  Btrfs: send, avoid incorrect leaf accesses when sending utimes operations
  Btrfs: send, fix invalid leaf accesses due to incorrect utimes operations
  Btrfs: send, fix warning due to late freeing of orphan_dir_info structures
  Btrfs: incremental send, fix premature rmdir operations
  Btrfs: incremental send, fix invalid paths for rename operations
  Btrfs: send, add missing error check for calls to path_loop()
  Btrfs: send, fix failure to move directories with the same name around
  Btrfs: add missing check for writeback errors on fsync
2016-08-10 11:16:03 -07:00
Konstantin Khlebnikov
51350ea0d7 mm, writeback: flush plugged IO in wakeup_flusher_threads()
I've found funny live-lock between raid10 barriers during resync and
memory controller hard limits. Inside mpage_readpages() task holds on to
its plug bio which blocks the barrier in raid10. Its memory cgroup have
no free memory thus the task goes into reclaimer but all reclaimable
pages are dirty and cannot be written because raid10 is rebuilding and
stuck on the barrier.

Common flush of such IO in schedule() never happens, because the caller
doesn't go to sleep.

Lock is 'live' because changing memory limit or killing tasks which
holds that stuck bio unblock whole progress.

That was what happened in 3.18.x but I see no difference in upstream
logic.  Theoretically this might happen even without memory cgroup.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-09 19:58:06 -06:00
Vladimir Davydov
c4159a75b6 mm: memcontrol: only mark charged pages with PageKmemcg
To distinguish non-slab pages charged to kmemcg we mark them PageKmemcg,
which sets page->_mapcount to -512.  Currently, we set/clear PageKmemcg
in __alloc_pages_nodemask()/free_pages_prepare() for any page allocated
with __GFP_ACCOUNT, including those that aren't actually charged to any
cgroup, i.e. allocated from the root cgroup context.  To avoid overhead
in case cgroups are not used, we only do that if memcg_kmem_enabled() is
true.  The latter is set iff there are kmem-enabled memory cgroups
(online or offline).  The root cgroup is not considered kmem-enabled.

As a result, if a page is allocated with __GFP_ACCOUNT for the root
cgroup when there are kmem-enabled memory cgroups and is freed after all
kmem-enabled memory cgroups were removed, e.g.

  # no memory cgroups has been created yet, create one
  mkdir /sys/fs/cgroup/memory/test
  # run something allocating pages with __GFP_ACCOUNT, e.g.
  # a program using pipe
  dmesg | tail
  # remove the memory cgroup
  rmdir /sys/fs/cgroup/memory/test

we'll get bad page state bug complaining about page->_mapcount != -1:

  BUG: Bad page state in process swapper/0  pfn:1fd945c
  page:ffffea007f651700 count:0 mapcount:-511 mapping:          (null) index:0x0
  flags: 0x1000000000000000()

To avoid that, let's mark with PageKmemcg only those pages that are
actually charged to and hence pin a non-root memory cgroup.

Fixes: 4949148ad4 ("mm: charge/uncharge kmemcg from generic page allocator paths")
Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-09 10:14:10 -07:00
Ilya Dryomov
4eacd4cb3a ceph: initialize pathbase in the !dentry case in encode_caps_cb()
pathbase is the base inode; set it to 0 if we've got no path.

Coverity-id: 146348
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2016-08-09 17:26:56 +02:00
Yan, Zheng
e4d2b16a44 ceph: fix null pointer dereference in ceph_flush_snaps()
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-08-08 21:41:43 +02:00
Miklos Szeredi
0956254a2d ovl: don't copy up opaqueness
When a copy up of a directory occurs which has the opaque xattr set, the
xattr remains in the upper directory. The immediate behavior with overlayfs
is that the upper directory is not treated as opaque, however after a
remount the opaque flag is used and upper directory is treated as opaque.
This causes files created in the lower layer to be hidden when using
multiple lower directories.

Fix by not copying up the opaque flag.

To reproduce:

 ----8<---------8<---------8<---------8<---------8<---------8<----
mkdir -p l/d/s u v w mnt
mount -t overlay overlay -olowerdir=l,upperdir=u,workdir=w mnt
rm -rf mnt/d/
mkdir -p mnt/d/n
umount mnt
mount -t overlay overlay -olowerdir=u:l,upperdir=v,workdir=w mnt
touch mnt/d/foo
umount mnt
mount -t overlay overlay -olowerdir=u:l,upperdir=v,workdir=w mnt
ls mnt/d
 ----8<---------8<---------8<---------8<---------8<---------8<----
 
output should be:  "foo  n"

Reported-by: Derek McGowan <dmcg@drizz.net>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=151291
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
2016-08-08 15:08:49 +02:00
Jens Axboe
1eff9d322a block: rename bio bi_rw to bi_opf
Since commit 63a4cc2486, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.

No intended functional changes in this commit.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-07 14:41:02 -06:00
Jens Axboe
c11f0c0b5b block/mm: make bdev_ops->rw_page() take a bool for read/write
Commit abf545484d changed it from an 'rw' flags type to the
newer ops based interface, but now we're effectively leaking
some bdev internals to the rest of the kernel. Since we only
care about whether it's a read or a write at that level, just
pass in a bool 'is_write' parameter instead.

Then we can also move op_is_write() and friends back under
CONFIG_BLOCK protection.

Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-07 14:41:02 -06:00
Linus Torvalds
e9d488c311 binfmt_misc for-linus on 20160727
First off, the intention of this pull is to declare that I'll be the
 binfmt_misc maintainer (mainly on the grounds of you touched it last,
 it's yours).  There's no MAINTAINERS entry, but get_maintainers.pl
 will now finger me.
 
 The update itself is to allow architecture emulation containers to
 function such that the emulation binary can be housed outside the
 container itself.  The container and fs parts both have acks from
 relevant experts.
 
 The change is user visible. To use the new feature you have to add an
 F option to your binfmt_misc configuration.  However, the existing
 tools, like systemd-binfmt work with this without modification.
 
 Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABAgAGBQJXmW5WAAoJEAVr7HOZEZN4K1QQAKgx5MPkoTU3QKKgzaMBBnWH
 pSMdoN8BhVSwENE/YJGMEyLaRa0zmrHVtFcnH2CHQE/GoXNnaej9l3LtBIwJ9K2P
 nrv4Rlhla5BxjhDkg8IWf3iG7iKDDHGZoyuVPx4dwxHFK1yCNH4SDeHaJCKK5qsC
 aLltMJMRnjsgJvBUC01dCUlp8srkWywHcyk9M9ic/Fr5vJ6JzdUr6/Md29eHmAXe
 NgCGwkVgSDiKfnTGZjIMsAtpwPsJ6RqBWQTcTdM/mkIpqwrMiVuaVOHqu2cmMU2i
 j4cQE6rQpy3sedDKZbHBQMOfYJNT4QYgYGuvyIWce9EPkIpOWHzQ7kYPJ/A/jZCE
 lN37TeyodbUDCnyuKk1YOrTBjJ0qdtc4FXJ1aq5s92GkgDs+LtxMdGzKDf3yUGiU
 W0TsE/wVy4rmEaeiyut33661ud4vivP4WklWK1Y+bklQcIcKQKKWnOCnDFDR5vuz
 CbL5ykVcJb3F28YhGYHvGLeXl0YcR3SwngWnnPCDPtBCeSirohuKb1SEe21C/RaB
 rm9S27d+LcKCXJyCqKh8BGsqroZ0iSZQI0Lbdqt+BCuuBw2rQhGStDeccDDUp9jg
 MOwpQwabjEseK0n75+hZ2SFS5Q+TQ6pccMlUJIDiBKWmRly8NpKlSKKWvBX8obIe
 0Gq6hgX1IwQnXI1O8QMC
 =6OjN
 -----END PGP SIGNATURE-----

Merge tag 'binfmt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/binfmt_misc

Pull binfmt_misc update from James Bottomley:
 "This update is to allow architecture emulation containers to function
  such that the emulation binary can be housed outside the container
  itself.  The container and fs parts both have acks from relevant
  experts.

  To use the new feature you have to add an F option to your binfmt_misc
  configuration"

From the docs:
 "The usual behaviour of binfmt_misc is to spawn the binary lazily when
  the misc format file is invoked.  However, this doesn't work very well
  in the face of mount namespaces and changeroots, so the F mode opens
  the binary as soon as the emulation is installed and uses the opened
  image to spawn the emulator, meaning it is always available once
  installed, regardless of how the environment changes"

* tag 'binfmt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/binfmt_misc:
  binfmt_misc: add F option description to documentation
  binfmt_misc: add persistent opened binary handler for containers
  fs: add filp_clone_open API
2016-08-07 10:13:14 -04:00
Eryu Guan
337684a174 fs: return EPERM on immutable inode
In most cases, EPERM is returned on immutable inode, and there're only a
few places returning EACCES. I noticed this when running LTP on
overlayfs, setxattr03 failed due to unexpected EACCES on immutable
inode.

So converting all EACCES to EPERM on immutable inode.

Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-07 10:03:31 -04:00
Linus Torvalds
fe64f3283f Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 "Assorted cleanups and fixes.

  In the "trivial API change" department - ->d_compare() losing 'parent'
  argument"

* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  cachefiles: Fix race between inactivating and culling a cache object
  9p: use clone_fid()
  9p: fix braino introduced in "9p: new helper - v9fs_parent_fid()"
  vfs: make dentry_needs_remove_privs() internal
  vfs: remove file_needs_remove_privs()
  vfs: fix deadlock in file_remove_privs() on overlayfs
  get rid of 'parent' argument of ->d_compare()
  cifs, msdos, vfat, hfs+: don't bother with parent in ->d_compare()
  affs ->d_compare(): don't bother with ->d_inode
  fold _d_rehash() and __d_rehash() together
  fold dentry_rcuwalk_invalidate() into its only remaining caller
2016-08-07 10:01:14 -04:00
Linus Torvalds
0cbbc422d5 xfs: reverse block mapping support for 4.8-rc1
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXpRdVAAoJEK3oKUf0dfod2tkP/24f1Znl9OQEPHoSZty9nXF0
 dSjOzE2lHbR4xjjuYjbn1siFnIX0A5nPPqleBYmt3gatiO+24vE1BiNWjM6Y/y7r
 3KHENRqmfSj26ha6wl/TUNaKnuFooBcQ0BaHI1IExFROitOSvZgPJPSrk29AH/Er
 OVJkaoi3N3o9mrfUpF9/M55Yi/DhQiPBYxkqcXvaqcakbL91EIj5TLZ72MJqgfje
 d6og33zxb21EDx9eIJEA0cWX4MLO2UQqFAuiJLzk2RkSAm6vRjbRJyYGG9jv81tP
 9ZX1gAw47v0qk3nPVyAgbi862ukYCYzmr1g2b4S2b0UKLXxQb8Fw8D2mRbFXl2wg
 wq0nKLg9jwsd8Yo7k8qOrUI9nl/E9Ytmj8t92Y49XvPjtsVFZREoCw3ojyjmlyZA
 9BywL5BzMHF6SsXe6LBGJpoebrxCnq5176FREBnpmH7UHM0BcWa4YSekQShwg3DW
 PFlBOxk5saz4Ktr5V3YUY+G6XgZ/AXWKlDox5+dESLIOgG0hyzbiVbPNSTQgDrnR
 m9yUJPef1NQj2JWSZbqKn7FSZDO6/IT2aeokn1KuoaDJww5HC80juyB1VThmpZnl
 QJGN6nmsYDVCLYjbT6scAzyGMYw9ZVhTM7eEk3kqAtCBf/nEyqJM+H0HYUDjfg9B
 cG5cRtZNDDkc30lFezJX
 =nXKv
 -----END PGP SIGNATURE-----

Merge tag 'xfs-rmap-for-linus-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull more xfs updates from Dave Chinner:
 "This is the second part of the XFS updates for this merge cycle, and
  contains the new reverse block mapping feature for XFS.

  Reverse mapping allows us to track the owner of a specific block on
  disk precisely.  It is implemented as a set of btrees (one per
  allocation group) that track the owners of allocated extents.
  Effectively it is a "used space tree" that is updated when we allocate
  or free extents.  i.e. it is coherent with the free space btrees we
  already maintain and never overlaps with them.

  This reverse mapping infrastructure is the building block of several
  upcoming features - reflink, copy-on-write data, dedupe, online
  metadata and data scrubbing, highly accurate bad sector/data loss
  reporting to users, and significantly improved reconstruction of
  damaged and corrupted filesystems.  There's a lot of new stuff coming
  along in the next couple of cycles,a nd it all builds in the rmap
  infrastructure.

  As such, it's a huge chunk of new code with new on-disk format
  features and internal infrastructure.  It warns at mount time as an
  experimental feature and that it may eat data (as we do with all new
  on-disk features until they stabilise).  We have not released
  userspace suport for it yet - userspace support currently requires
  download from Darrick's xfsprogs repo and build from source, so the
  access to this feature is really developer/tester only at this point.
  Initial userspace support will be released at the same time kernel
  with this code in it is released.

  The new rmap enabled code regresses 3 xfstests - all are ENOSPC
  related corner cases, one of which Darrick posted a fix for a few
  hours ago.  The other two are fixed by infrastructure that is part of
  the upcoming reflink patchset.  This new ENOSPC infrastructure
  requires a on-disk format tweak required to keep mount times in
  check - we need to keep an on-disk count of allocated rmapbt blocks so
  we don't have to scan the entire btrees at mount time to count them.

  This is currently being tested and will be part of the fixes sent in
  the next week or two so users will not be exposed to this change"

* tag 'xfs-rmap-for-linus-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (52 commits)
  xfs: move (and rename) the deferred bmap-free tracepoints
  xfs: collapse single use static functions
  xfs: remove unnecessary parentheses from log redo item recovery functions
  xfs: remove the extents array from the rmap update done log item
  xfs: in btree_lshift, only allocate temporary cursor when needed
  xfs: remove unnecesary lshift/rshift key initialization
  xfs: remove the get*keys and update_keys btree ops pointers
  xfs: enable the rmap btree functionality
  xfs: don't update rmapbt when fixing agfl
  xfs: disable XFS_IOC_SWAPEXT when rmap btree is enabled
  xfs: add rmap btree block detection to log recovery
  xfs: add rmap btree geometry feature flag
  xfs: propagate bmap updates to rmapbt
  xfs: enable the xfs_defer mechanism to process rmaps to update
  xfs: log rmap intent items
  xfs: create rmap update intent log items
  xfs: add rmap btree insert and delete helpers
  xfs: convert unwritten status of reverse mappings
  xfs: remove an extent from the rmap btree
  xfs: add an extent to the rmap btree
  ...
2016-08-06 09:50:36 -04:00
Linus Torvalds
835c92d43b Merge branch 'work.const-qstr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull qstr constification updates from Al Viro:
 "Fairly self-contained bunch - surprising lot of places passes struct
  qstr * as an argument when const struct qstr * would suffice; it
  complicates analysis for no good reason.

  I'd prefer to feed that separately from the assorted fixes (those are
  in #for-linus and with somewhat trickier topology)"

* 'work.const-qstr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  qstr: constify instances in adfs
  qstr: constify instances in lustre
  qstr: constify instances in f2fs
  qstr: constify instances in ext2
  qstr: constify instances in vfat
  qstr: constify instances in procfs
  qstr: constify instances in fuse
  qstr constify instances in fs/dcache.c
  qstr: constify instances in nfs
  qstr: constify instances in ocfs2
  qstr: constify instances in autofs4
  qstr: constify instances in hfs
  qstr: constify instances in hfsplus
  qstr: constify instances in logfs
  qstr: constify dentry_init_security
2016-08-06 09:49:02 -04:00
David Howells
372ee16386 rxrpc: Fix races between skb free, ACK generation and replying
Inside the kafs filesystem it is possible to occasionally have a call
processed and terminated before we've had a chance to check whether we need
to clean up the rx queue for that call because afs_send_simple_reply() ends
the call when it is done, but this is done in a workqueue item that might
happen to run to completion before afs_deliver_to_call() completes.

Further, it is possible for rxrpc_kernel_send_data() to be called to send a
reply before the last request-phase data skb is released.  The rxrpc skb
destructor is where the ACK processing is done and the call state is
advanced upon release of the last skb.  ACK generation is also deferred to
a work item because it's possible that the skb destructor is not called in
a context where kernel_sendmsg() can be invoked.

To this end, the following changes are made:

 (1) kernel_rxrpc_data_consumed() is added.  This should be called whenever
     an skb is emptied so as to crank the ACK and call states.  This does
     not release the skb, however.  kernel_rxrpc_free_skb() must now be
     called to achieve that.  These together replace
     rxrpc_kernel_data_delivered().

 (2) kernel_rxrpc_data_consumed() is wrapped by afs_data_consumed().

     This makes afs_deliver_to_call() easier to work as the skb can simply
     be discarded unconditionally here without trying to work out what the
     return value of the ->deliver() function means.

     The ->deliver() functions can, via afs_data_complete(),
     afs_transfer_reply() and afs_extract_data() mark that an skb has been
     consumed (thereby cranking the state) without the need to
     conditionally free the skb to make sure the state is correct on an
     incoming call for when the call processor tries to send the reply.

 (3) rxrpc_recvmsg() now has to call kernel_rxrpc_data_consumed() when it
     has finished with a packet and MSG_PEEK isn't set.

 (4) rxrpc_packet_destructor() no longer calls rxrpc_hard_ACK_data().

     Because of this, we no longer need to clear the destructor and put the
     call before we free the skb in cases where we don't want the ACK/call
     state to be cranked.

 (5) The ->deliver() call-type callbacks are made to return -EAGAIN rather
     than 0 if they expect more data (afs_extract_data() returns -EAGAIN to
     the delivery function already), and the caller is now responsible for
     producing an abort if that was the last packet.

 (6) There are many bits of unmarshalling code where:

 		ret = afs_extract_data(call, skb, last, ...);
		switch (ret) {
		case 0:		break;
		case -EAGAIN:	return 0;
		default:	return ret;
		}

     is to be found.  As -EAGAIN can now be passed back to the caller, we
     now just return if ret < 0:

 		ret = afs_extract_data(call, skb, last, ...);
		if (ret < 0)
			return ret;

 (7) Checks for trailing data and empty final data packets has been
     consolidated as afs_data_complete().  So:

		if (skb->len > 0)
			return -EBADMSG;
		if (!last)
			return 0;

     becomes:

		ret = afs_data_complete(call, skb, last);
		if (ret < 0)
			return ret;

 (8) afs_transfer_reply() now checks the amount of data it has against the
     amount of data desired and the amount of data in the skb and returns
     an error to induce an abort if we don't get exactly what we want.

Without these changes, the following oops can occasionally be observed,
particularly if some printks are inserted into the delivery path:

general protection fault: 0000 [#1] SMP
Modules linked in: kafs(E) af_rxrpc(E) [last unloaded: af_rxrpc]
CPU: 0 PID: 1305 Comm: kworker/u8:3 Tainted: G            E   4.7.0-fsdevel+ #1303
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
Workqueue: kafsd afs_async_workfn [kafs]
task: ffff88040be041c0 ti: ffff88040c070000 task.ti: ffff88040c070000
RIP: 0010:[<ffffffff8108fd3c>]  [<ffffffff8108fd3c>] __lock_acquire+0xcf/0x15a1
RSP: 0018:ffff88040c073bc0  EFLAGS: 00010002
RAX: 6b6b6b6b6b6b6b6b RBX: 0000000000000000 RCX: ffff88040d29a710
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88040d29a710
RBP: ffff88040c073c70 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: ffff88040be041c0 R15: ffffffff814c928f
FS:  0000000000000000(0000) GS:ffff88041fa00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fa4595f4750 CR3: 0000000001c14000 CR4: 00000000001406f0
Stack:
 0000000000000006 000000000be04930 0000000000000000 ffff880400000000
 ffff880400000000 ffffffff8108f847 ffff88040be041c0 ffffffff81050446
 ffff8803fc08a920 ffff8803fc08a958 ffff88040be041c0 ffff88040c073c38
Call Trace:
 [<ffffffff8108f847>] ? mark_held_locks+0x5e/0x74
 [<ffffffff81050446>] ? __local_bh_enable_ip+0x9b/0xa1
 [<ffffffff8108f9ca>] ? trace_hardirqs_on_caller+0x16d/0x189
 [<ffffffff810915f4>] lock_acquire+0x122/0x1b6
 [<ffffffff810915f4>] ? lock_acquire+0x122/0x1b6
 [<ffffffff814c928f>] ? skb_dequeue+0x18/0x61
 [<ffffffff81609dbf>] _raw_spin_lock_irqsave+0x35/0x49
 [<ffffffff814c928f>] ? skb_dequeue+0x18/0x61
 [<ffffffff814c928f>] skb_dequeue+0x18/0x61
 [<ffffffffa009aa92>] afs_deliver_to_call+0x344/0x39d [kafs]
 [<ffffffffa009ab37>] afs_process_async_call+0x4c/0xd5 [kafs]
 [<ffffffffa0099e9c>] afs_async_workfn+0xe/0x10 [kafs]
 [<ffffffff81063a3a>] process_one_work+0x29d/0x57c
 [<ffffffff81064ac2>] worker_thread+0x24a/0x385
 [<ffffffff81064878>] ? rescuer_thread+0x2d0/0x2d0
 [<ffffffff810696f5>] kthread+0xf3/0xfb
 [<ffffffff8160a6ff>] ret_from_fork+0x1f/0x40
 [<ffffffff81069602>] ? kthread_create_on_node+0x1cf/0x1cf

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-06 00:08:40 -04:00
Linus Torvalds
a02040d8d5 Fixes for pstore ramoops driver to catch bad kfree() and to use better DT
bindings.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJXpNnrAAoJEIly9N/cbcAm05QP/i9YTFAKHUYGdqi9NIldxb6/
 WiBfF4IZbrgoEB+25gAa2EEkkZPdT/MYK+Nbd4VhUxcqFM2S8gauTsHXW6x1fCfh
 36I4ul2UO3KM70/YrubAPnNjx1d1SI5mh4yRSHgAHguzFn9RE6vusSYPVVsgazpb
 yo+ZT+SwkWcv94i3Ro0sxgPog1kiN74unRaMd23Jt+FX1+Bdu5GzfruL0GtSRzVP
 3XKCQC+8E1A2pZDXSTARdVjN9feN5fNsapt37zK2urBStzy2rLNDXPVy3c/yfoE3
 6spXY+0gBlgcRr/N3AXF7UYRoR7M5zn7/t30GDSk0AGsxkoVGxcb886Z3ilO5/Y8
 4f+gB/Mjbsx3vw92EKglTdnUopH+l65GVdcKLiAqav4DqOaQsD+WRz1HrHky1bmy
 ngkeLCROiJWu9zh29aEyo9pejQQA+fcxea58WnqanWmhoNtZLbrZ4NoB2r1ltmi3
 uOAXMMh2ahB53Lx39Ft4/0VUnSPihkQ4MNSfLK6knJzK6JB3cfoI2KfPkB+TX06n
 /wg0SEgJsJ3542p1qP03539y68Q+6tli/b8bwqhfzZucK1SnWi744RTg8yo+FkX/
 QDPwXG/9HcWbRyIb2UNvzPkn5uhQzWjdRdorRKecf9oaqPNs0Il+wZ+dQW9T25ln
 VfcOSmIp0ks1UPk/D83C
 =tjDa
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore fixes from Kees Cook:
 "Fixes for pstore ramoops driver to catch bad kfree() and to use better
  DT bindings"

* tag 'pstore-v4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  ramoops: use persistent_ram_free() instead of kfree() for freeing prz
  ramoops: use DT reserved-memory bindings
2016-08-05 23:52:52 -04:00
Linus Torvalds
fff648da96 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "Here's the second round of block updates for this merge window.

  It's a mix of fixes for changes that went in previously in this round,
  and fixes in general.  This pull request contains:

   - Fixes for loop from Christoph

   - A bdi vs gendisk lifetime fix from Dan, worth two cookies.

   - A blk-mq timeout fix, when on frozen queues.  From Gabriel.

   - Writeback fix from Jan, ensuring that __writeback_single_inode()
     does the right thing.

   - Fix for bio->bi_rw usage in f2fs from me.

   - Error path deadlock fix in blk-mq sysfs registration from me.

   - Floppy O_ACCMODE fix from Jiri.

   - Fix to the new bio op methods from Mike.

     One more followup will be coming here, ensuring that we don't
     propagate the block types outside of block.  That, and a rename of
     bio->bi_rw is coming right after -rc1 is cut.

   - Various little fixes"

* 'for-linus' of git://git.kernel.dk/linux-block:
  mm/block: convert rw_page users to bio op use
  loop: make do_req_filebacked more robust
  loop: don't try to use AIO for discards
  blk-mq: fix deadlock in blk_mq_register_disk() error path
  Include: blkdev: Removed duplicate 'struct request;' declaration.
  Fixup direct bi_rw modifiers
  block: fix bdi vs gendisk lifetime mismatch
  blk-mq: Allow timeouts to run while queue is freezing
  nbd: fix race in ioctl
  block: fix use-after-free in seq file
  f2fs: drop bio->bi_rw manual assignment
  block: add missing group association in bio-cloning functions
  blkcg: kill unused field nr_undestroyed_grps
  writeback: Write dirty times for WB_SYNC_ALL writeback
  floppy: fix open(O_ACCMODE) for ioctl-only open
2016-08-05 23:31:51 -04:00
Trond Myklebust
8d480326c3 NFSv4: Cap the transport reconnection timer at 1/2 lease period
We don't want to miss a lease period renewal due to the TCP connection
failing to reconnect in a timely fashion. To ensure this doesn't happen,
cap the reconnection timer so that we retry the connection attempt
at least every 1/2 lease period.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-08-05 19:22:22 -04:00
Trond Myklebust
fb10fb67ad NFSv4: Cleanup the setting of the nfs4 lease period
Make a helper function nfs4_set_lease_period() and have
nfs41_setup_state_renewal() and nfs4_do_fsinfo() use it.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-08-05 19:13:08 -04:00
Chris Mason
1083881654 Merge branch 'integration-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.8 2016-08-05 12:25:05 -07:00
Hiraku Toyooka
e976e56423 ramoops: use persistent_ram_free() instead of kfree() for freeing prz
persistent_ram_zone(=prz) structures are allocated by persistent_ram_new(),
which includes vmap() or ioremap(). But they are currently freed by
kfree(). This uses persistent_ram_free() for correct this asymmetry usage.

Signed-off-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Nobuhiro Iwamatsu <nobuhiro.iwamatsu.kw@hitachi.com>
Cc: Mark Salyzyn <salyzyn@android.com>
Cc: Seiji Aguchi <seiji.aguchi.tr@hitachi.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-08-05 11:21:46 -07:00
Kees Cook
529182e204 ramoops: use DT reserved-memory bindings
Instead of a ramoops-specific node, use a child node of /reserved-memory.
This requires that of_platform_device_create() be explicitly called
for the node, though, since "/reserved-memory" does not have its own
"compatible" property.

Suggested-by: Rob Herring <robh@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Rob Herring <robh@kernel.org>
2016-08-05 11:21:36 -07:00
Trond Myklebust
206b3bb574 NFSv4.2: LAYOUTSTATS may return NFS4ERR_ADMIN/DELEG_REVOKED
We should handle those errors in the same way we handle the other
stateid errors: by invalidating the faulty layout stateid.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-08-05 12:18:10 -04:00
Linus Torvalds
a71e36045e Highlights:
Trond made a change to the server's tcp logic that allows a fast
 	client to better take advantage of high bandwidth networks, but
 	may increase the risk that a single client could starve other
 	clients; a new sunrpc.svc_rpc_per_connection_limit parameter
 	should help mitigate this in the (hopefully unlikely) event this
 	becomes a problem in practice.
 
 	Tom Haynes added a minimal flex-layout pnfs server, which is of
 	no use in production for now--don't build it unless you're doing
 	client testing or further server development.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXo7HNAAoJECebzXlCjuG+zqUP/RxO5jZjBhNI8/ayGdDW/Jnq
 s0Fu6B+aNRV3GnugmIeI4tWNGnPyERNzFtjLKlnwaasz/oW4qBLqGbNUWC5xKARS
 erODs0hM/1aCYWwNBEc5qXP2u23HrWVuQ+B5fg42ACyliKFGq5faDRmf6XGU/1kB
 8unXGWPAiLiNZD/bWP91fYhThlLgpfHBFZ7M3G2IqmzWZTSELPzwp1bpRWt7yWQQ
 z1oYtXToycbwz3yPVk3cXtaoqpjDUVZf2Guqgqi1BwEyEtYOSaYo1VHNsKDf4OId
 QXQh64AqIK4uszpvtNhvsEaAECN7IiB+N4n2laFiQVmAf8Hfl3AnV/gKeD4lKmTj
 TY6knnjZO/X88wn80MB7JR1H1WXvvzNIHwNR95qfub/lVKX+C+0AORRtYhi5F9ec
 ixNs/z1ImLpYxAjiP/T5anD5xcX2S+LcSv7kRjhEufqNFtRAIqBZO9ZWbCdXAAyE
 tcH9Cru4jeIlFO/y6O61EVrn9FFj2+0uu+7urefNRQ2Y9pmKeculJrLF6WO8WHms
 4IzXMmjZK+358RVdX2Ji5Hw6rBDvfgP+LjB8Jn8CeIiNRONEjT+2/AYQcfk61aLb
 INUbk6G6Vfd8iMO4aaRI9tmW+vKCOZa0IbnrNE1oHKp/AKBDr25i5YPSCsnl3r4Q
 iR7rRe9FIkfqBpbfjVFv
 =mo54
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.8' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Highlights:

   - Trond made a change to the server's tcp logic that allows a fast
     client to better take advantage of high bandwidth networks, but may
     increase the risk that a single client could starve other clients;
     a new sunrpc.svc_rpc_per_connection_limit parameter should help
     mitigate this in the (hopefully unlikely) event this becomes a
     problem in practice.

   - Tom Haynes added a minimal flex-layout pnfs server, which is of no
     use in production for now--don't build it unless you're doing
     client testing or further server development"

* tag 'nfsd-4.8' of git://linux-nfs.org/~bfields/linux: (32 commits)
  nfsd: remove some dead code in nfsd_create_locked()
  nfsd: drop unnecessary MAY_EXEC check from create
  nfsd: clean up bad-type check in nfsd_create_locked
  nfsd: remove unnecessary positive-dentry check
  nfsd: reorganize nfsd_create
  nfsd: check d_can_lookup in fh_verify of directories
  nfsd: remove redundant zero-length check from create
  nfsd: Make creates return EEXIST instead of EACCES
  SUNRPC: Detect immediate closure of accepted sockets
  SUNRPC: accept() may return sockets that are still in SYN_RECV
  nfsd: allow nfsd to advertise multiple layout types
  nfsd: Close race between nfsd4_release_lockowner and nfsd4_lock
  nfsd/blocklayout: Make sure calculate signature/designator length aligned
  xfs: abstract block export operations from nfsd layouts
  SUNRPC: Remove unused callback xpo_adjust_wspace()
  SUNRPC: Change TCP socket space reservation
  SUNRPC: Add a server side per-connection limit
  SUNRPC: Micro optimisation for svc_data_ready
  SUNRPC: Call the default socket callbacks instead of open coding
  SUNRPC: lock the socket while detaching it
  ...
2016-08-04 19:59:06 -04:00
Linus Torvalds
d58b0d980f Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull more btrfs updates from Chris Mason:
 "This is part two of my btrfs pull, which is some cleanups and a batch
  of fixes.

  Most of the code here is from Jeff Mahoney, making the pointers we
  pass around internally more consistent and less confusing overall.  I
  noticed a small problem right before I sent this out yesterday, so I
  fixed it up and re-tested overnight"

* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (40 commits)
  Btrfs: fix __MAX_CSUM_ITEMS
  btrfs: btrfs_abort_transaction, drop root parameter
  btrfs: add btrfs_trans_handle->fs_info pointer
  btrfs: btrfs_relocate_chunk pass extent_root to btrfs_end_transaction
  btrfs: convert nodesize macros to static inlines
  btrfs: introduce BTRFS_MAX_ITEM_SIZE
  btrfs: cleanup, remove prototype for btrfs_find_root_ref
  btrfs: copy_to_sk drop unused root parameter
  btrfs: simpilify btrfs_subvol_inherit_props
  btrfs: tests, use BTRFS_FS_STATE_DUMMY_FS_INFO instead of dummy root
  btrfs: tests, require fs_info for root
  btrfs: tests, move initialization into tests/
  btrfs: btrfs_test_opt and friends should take a btrfs_fs_info
  btrfs: prefix fsid to all trace events
  btrfs: plumb fs_info into btrfs_work
  btrfs: remove obsolete part of comment in statfs
  btrfs: hide test-only member under ifdef
  btrfs: Ratelimit "no csum found" info message
  btrfs: Add ratelimit to btrfs printing
  Btrfs: fix unexpected balance crash due to BUG_ON
  ...
2016-08-04 19:56:16 -04:00
Linus Torvalds
3a303258ef This pull request contains mostly cleanups and minor
improvements of UBI and UBIFS.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABAgAGBQJXoiyMAAoJEEtJtSqsAOnWoXoP/1Q192UXeI18eezK//Y1kgv/
 Q3gFoqtOWBnw9kcY9aTHdAtPJcgsjRzCMPVbd1TBEe071xWCyKziyGalNUFKLKOR
 IZxym3uf65jhXkcch7ZtoUdMH7XcGOavPg8X47RWs5u72uTiIt6t/RRUwM1zDeaW
 YZx3FnCGwyzPygrogTbVfH132o1pzO587wrxFeaZQ30sWCLqQOk3qVyROgz2J9zm
 00TjNQEvUgfhBf2PiUvX0S5Lan/AX1aB3iEGg05fIDDsZqui698DRDx+isFEJEHf
 NWBHDBnhOObwKgutDfCk1gsfIKxzxBCxlLQG/ZaCwG4XKke8ylRc1wNffJbKrIIQ
 AYywLol3n3/WR4VvPK+4/TX/s4UOZOvSZYiaVJiSmxOCUNydNtwIewNp+aVghV/u
 qMfWsWRIPy7OXOdm3fTxzRsFtUxZaqglQ/dK24i1d8kktM0rkb1mgfKq9P0uctWq
 0ejnNHQmJyuGKYvemjBtTXUFmFktelolDOfsAl10MbYZ+OwPOYpI9FbGY/POYWuT
 Gpn/x/r2lGtP94kGYxBzSX8xTCC4SEFaMjE2sRvhWoxA8YgIydTDhz9SxCO1wz8E
 a7nPnRQ0iZfo5JW0MkLZim+YDNyBjY5ASeBXXdJH/uXlCaFjmDCDCLz5/e08DuM3
 lmmkepYwimHJIClr6d+0
 =hOxy
 -----END PGP SIGNATURE-----

Merge tag 'upstream-4.8-rc1' of git://git.infradead.org/linux-ubifs

Pull UBI/UBIFS updates from Richard Weinberger:
 "This contains mostly cleanups and minor improvements of UBI and UBIFS"

* tag 'upstream-4.8-rc1' of git://git.infradead.org/linux-ubifs:
  ubi: Use bitmaps in Fastmap self-check code
  ubi: Be more paranoid while seaching for the most recent Fastmap
  ubi: Check whether the Fastmap anchor matches the super block
  ubi: Rework Fastmap attach base code
  ubi: Fix whitespace issue in count_fastmap_pebs()
  ubi: Introduce vol_ignored()
  ubi: Fix scan_fast() comment
  ubifs: switch_gc_head: Remove redondant sync of wbuf
  ubi: Make volume resize power cut aware
  ubi: Fix early logging
  ubi: gluebi: Fix double refcounting
  ubifs: Silence early error messages if MS_SILENT is set
  ubi: Fix race condition between ubi device creation and udev
  ubifs: Update comment for ubifs_errc
  ubi: Only read necessary size when reading the VID header
  ubifs: Make xattr structures static
  ubifs: Silence error output if MS_SILENT is set
2016-08-04 19:51:49 -04:00
Linus Torvalds
9e0243db61 Merge branch 'for-linus-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml
Pull UML updates from Richard Weinberger:
 "Beside of various fixes this also contains patches to enable features
  such was Kcov, kmemleak and TRACE_IRQFLAGS_SUPPORT on UML"

* 'for-linus-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
  hostfs: Freeing an ERR_PTR in hostfs_fill_sb_common()
  um: Support kcov
  um: Enable TRACE_IRQFLAGS_SUPPORT
  um: Use asm-generic/irqflags.h
  um: Fix possible deadlock in sig_handler_common()
  um: Select HAVE_DEBUG_KMEMLEAK
  um: Setup physical memory in setup_arch()
  um: Eliminate null test after alloc_bootmem
2016-08-04 19:37:59 -04:00
Linus Torvalds
8e7106a607 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu
Pull m68knommu updates from Greg Ungerer:
 "This series is all about Nicolas flat format support for MMU systems.

  Traditional m68k no-MMU flat format binaries can now be run on m68k
  MMU enabled systems too.  The series includes some nice cleanups of
  the binfmt_flat code and converts it to using proper user space
  accessor functions.

  With all this in place you can boot and run a complete no-MMU flat
  format based user space on an MMU enabled system"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu:
  m68k: enable binfmt_flat on systems with an MMU
  binfmt_flat: allow compressed flat binary format to work on MMU systems
  binfmt_flat: add MMU-specific support
  binfmt_flat: update libraries' data segment pointer with userspace accessors
  binfmt_flat: use clear_user() rather than memset() to clear .bss
  binfmt_flat: use proper user space accessors with old relocs code
  binfmt_flat: use proper user space accessors with relocs processing code
  binfmt_flat: clean up create_flat_tables() and stack accesses
  binfmt_flat: use generic transfer_args_to_stack()
  elf_fdpic_transfer_args_to_stack(): make it generic
  binfmt_flat: prevent kernel dammage from corrupted executable headers
  binfmt_flat: convert printk invocations to their modern form
  binfmt_flat: assorted cleanups
  m68k: use same start_thread() on MMU and no-MMU
  m68k: fix file path comment
  m68k: fix bFLT executable running on MMU enabled systems
2016-08-04 18:04:44 -04:00
Dan Carpenter
2b11885921 nfsd: remove some dead code in nfsd_create_locked()
We changed this around in f135af1041f ('nfsd: reorganize nfsd_create')
so "dchild" can't be an error pointer any more.  Also, dchild can't be
NULL here (and dput would already handle this even if it was).

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-08-04 17:11:53 -04:00
J. Bruce Fields
fa08139d5e nfsd: drop unnecessary MAY_EXEC check from create
We need an fh_verify to make sure we at least have a dentry, but actual
permission checks happen later.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-08-04 17:11:52 -04:00
J. Bruce Fields
7142327449 nfsd: clean up bad-type check in nfsd_create_locked
Minor cleanup, no change in behavior.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-08-04 17:11:51 -04:00
J. Bruce Fields
d03d9fe476 nfsd: remove unnecessary positive-dentry check
vfs_{create,mkdir,mknod} each begin with a call to may_create(), which
returns EEXIST if the object already exists.

This check is therefore unnecessary.

(In the NFSv2 case, nfsd_proc_create also has such a check.  Contrary to
RFC 1094, our code seems to believe that a CREATE of an existing file
should succeed.  I'm leaving that behavior alone.)

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-08-04 17:11:50 -04:00
J. Bruce Fields
b44061d0b9 nfsd: reorganize nfsd_create
There's some odd logic in nfsd_create() that allows it to be called with
the parent directory either locked or unlocked.  The only already-locked
caller is NFSv2's nfsd_proc_create().  It's less confusing to split out
the unlocked case into a separate function which the NFSv2 code can call
directly.

Also fix some comments while we're here.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-08-04 17:11:49 -04:00
J. Bruce Fields
e75b23f9e3 nfsd: check d_can_lookup in fh_verify of directories
Create and other nfsd ops generally assume we can call lookup_one_len on
inodes with S_IFDIR set.  Al says that this assumption isn't true in
general, though it should be for the filesystem objects nfsd sees.

Add a check just to make sure our assumption isn't violated.

Remove a couple checks for i_op->lookup in create code.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-08-04 17:11:48 -04:00
J. Bruce Fields
12391d0723 nfsd: remove redundant zero-length check from create
lookup_one_len already has this check.

The only effect of this patch is to return access instead of perm in the
0-length-filename case.  I actually prefer nfserr_perm (or _inval?), but
I doubt anyone cares.

The isdotent check seems redundant too, but I worry that some client
might actually care about that strange nfserr_exist error.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-08-04 17:11:47 -04:00
Oleg Drokin
7eed34f18d nfsd: Make creates return EEXIST instead of EACCES
When doing a create (mkdir/mknod) on a name, it's worth
checking the name exists first before returning EACCES in case
the directory is not writeable by the user.
This makes return values on the client more consistent
regardless of whenever the entry there is cached in the local
cache or not.
Another positive side effect is certain programs only expect
EEXIST in that case even despite POSIX allowing any valid
error to be returned.

Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-08-04 17:11:46 -04:00
Mike Christie
abf545484d mm/block: convert rw_page users to bio op use
The rw_page users were not converted to use bio/req ops. As a result
bdev_write_page is not passing down REQ_OP_WRITE and the IOs will
be sent down as reads.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Fixes: 4e1b2d52a8 ("block, fs, drivers: remove REQ_OP compat defs and related code")

Modified by me to:

1) Drop op_flags passing into ->rw_page(), as we don't use it.
2) Make op_is_write() and friends safe to use for !CONFIG_BLOCK

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-04 14:25:33 -06:00
Shaun Tancheff
b571bc606e Fixup direct bi_rw modifiers
bi_rw should be using bio_set_op_attrs to set bi_rw.

Signed-off-by: Shaun Tancheff <shaun@tancheff.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-04 14:19:16 -06:00
Jens Axboe
1aee6b9a7d f2fs: drop bio->bi_rw manual assignment
Merge 4fc29c1aa3 included this extra line, but it's not needed (or
useful) since we'll bio_set_op_attrs() right after to properly set
the op and flags for the bio.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-04 14:19:16 -06:00
Paolo Valente
20bd723ec6 block: add missing group association in bio-cloning functions
When a bio is cloned, the newly created bio must be associated with
the same blkcg as the original bio (if BLK_CGROUP is enabled). If
this operation is not performed, then the new bio is not associated
with any group, and the group of the current task is returned when
the group of the bio is requested.

Depending on the cloning frequency, this may cause a large
percentage of the bios belonging to a given group to be treated
as if belonging to other groups (in most cases as if belonging to
the root group). The expected group isolation may thereby be broken.

This commit adds the missing association in bio-cloning functions.

Fixes: da2f0f74cf ("Btrfs: add support for blkio controllers")
Cc: stable@vger.kernel.org # v4.3+

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Nikolay Borisov <kernel@kyup.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-04 14:19:16 -06:00
Jan Kara
dc5ff2b1d6 writeback: Write dirty times for WB_SYNC_ALL writeback
Currently we take care to handle I_DIRTY_TIME in vfs_fsync() and
queue_io() so that inodes which have only dirty timestamps are properly
written on fsync(2) and sync(2). However there are other call sites -
most notably going through write_inode_now() - which expect inode to be
clean after WB_SYNC_ALL writeback. This is not currently true as we do
not clear I_DIRTY_TIME in __writeback_single_inode() even for
WB_SYNC_ALL writeback in all the cases. This then resulted in the
following oops because bdev_write_inode() did not clean the inode and
writeback code later stumbled over a dirty inode with detached wb.

  general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
  Modules linked in:
  CPU: 3 PID: 32 Comm: kworker/u10:1 Not tainted 4.6.0-rc3+ #349
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  Workqueue: writeback wb_workfn (flush-11:0)
  task: ffff88006ccf1840 ti: ffff88006cda8000 task.ti: ffff88006cda8000
  RIP: 0010:[<ffffffff818884d2>]  [<ffffffff818884d2>]
  locked_inode_to_wb_and_lock_list+0xa2/0x750
  RSP: 0018:ffff88006cdaf7d0  EFLAGS: 00010246
  RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050
  RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286
  RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000
  R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0
  R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40
  FS:  0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0
  DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
  Stack:
   ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948
   ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948
   ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e
  Call Trace:
   [<     inline     >] inode_to_wb_and_lock_list fs/fs-writeback.c:309
   [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554
   [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600
   [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709
   [<     inline     >] wb_do_writeback fs/fs-writeback.c:1844
   [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884
   [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094
   [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228
   [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
   [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392
  Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e
  00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42>
  80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3
  RIP  [<     inline     >] wb_get include/linux/backing-dev-defs.h:212
  RIP  [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750
  fs/fs-writeback.c:281
   RSP <ffff88006cdaf7d0>
  ---[ end trace 986a4d314dcb2694 ]---

Fix the problem by making sure __writeback_single_inode() writes inode
only with dirty times in WB_SYNC_ALL mode.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-04 14:19:16 -06:00
Ross Zwisler
99a01cdf9d block: remove BLK_DEV_DAX config option
The functionality for block device DAX was already removed with commit
acc93d30d7 ("Revert "block: enable dax for raw block devices"")

However, we still had a config option hanging around that was always
disabled because it depended on CONFIG_BROKEN.  This config option was
introduced in commit 03cdadb040 ("block: disable block device DAX by
default")

This change reverts that commit, removing the dead config option.

Link: http://lkml.kernel.org/r/20160729182314.6368-1-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-04 08:50:07 -04:00
Dan Carpenter
8a545f1851 hostfs: Freeing an ERR_PTR in hostfs_fill_sb_common()
We can't pass error pointers to kfree() or it causes an oops.

Fixes: 52b209f7b8 ('get rid of hostfs_read_inode()')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2016-08-04 00:18:10 +02:00
Chris Mason
42049bf60d Btrfs: fix __MAX_CSUM_ITEMS
Jeff Mahoney's cleanup commit (14a1e067b4) wasn't correct for csums on
machines where the pagesize >= metadata blocksize.

This just reverts the relevant hunks to bring the old math back.

Signed-off-by: Chris Mason <clm@fb.com>
2016-08-03 14:08:37 -07:00
David Howells
db20a8925b cachefiles: Fix race between inactivating and culling a cache object
There's a race between cachefiles_mark_object_inactive() and
cachefiles_cull():

 (1) cachefiles_cull() can't delete a backing file until the cache object
     is marked inactive, but as soon as that's the case it's fair game.

 (2) cachefiles_mark_object_inactive() marks the object as being inactive
     and *only then* reads the i_blocks on the backing inode - but
     cachefiles_cull() might've managed to delete it by this point.

Fix this by making sure cachefiles_mark_object_inactive() gets any data it
needs from the backing inode before deactivating the object.

Without this, the following oops may occur:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
IP: [<ffffffffa06c5cc1>] cachefiles_mark_object_inactive+0x61/0xb0 [cachefiles]
...
CPU: 11 PID: 527 Comm: kworker/u64:4 Tainted: G          I    ------------   3.10.0-470.el7.x86_64 #1
Hardware name: Hewlett-Packard HP Z600 Workstation/0B54h, BIOS 786G4 v03.19 03/11/2011
Workqueue: fscache_object fscache_object_work_func [fscache]
task: ffff880035edaf10 ti: ffff8800b77c0000 task.ti: ffff8800b77c0000
RIP: 0010:[<ffffffffa06c5cc1>] cachefiles_mark_object_inactive+0x61/0xb0 [cachefiles]
RSP: 0018:ffff8800b77c3d70  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8800bf6cc400 RCX: 0000000000000034
RDX: 0000000000000000 RSI: ffff880090ffc710 RDI: ffff8800bf761ef8
RBP: ffff8800b77c3d88 R08: 2000000000000000 R09: 0090ffc710000000
R10: ff51005d2ff1c400 R11: 0000000000000000 R12: ffff880090ffc600
R13: ffff8800bf6cc520 R14: ffff8800bf6cc400 R15: ffff8800bf6cc498
FS:  0000000000000000(0000) GS:ffff8800bb8c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000098 CR3: 00000000019ba000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
 ffff880090ffc600 ffff8800bf6cc400 ffff8800867df140 ffff8800b77c3db0
 ffffffffa06c48cb ffff880090ffc600 ffff880090ffc180 ffff880090ffc658
 ffff8800b77c3df0 ffffffffa085d846 ffff8800a96b8150 ffff880090ffc600
Call Trace:
 [<ffffffffa06c48cb>] cachefiles_drop_object+0x6b/0xf0 [cachefiles]
 [<ffffffffa085d846>] fscache_drop_object+0xd6/0x1e0 [fscache]
 [<ffffffffa085d615>] fscache_object_work_func+0xa5/0x200 [fscache]
 [<ffffffff810a605b>] process_one_work+0x17b/0x470
 [<ffffffff810a6e96>] worker_thread+0x126/0x410
 [<ffffffff810a6d70>] ? rescuer_thread+0x460/0x460
 [<ffffffff810ae64f>] kthread+0xcf/0xe0
 [<ffffffff810ae580>] ? kthread_create_on_node+0x140/0x140
 [<ffffffff81695418>] ret_from_fork+0x58/0x90
 [<ffffffff810ae580>] ? kthread_create_on_node+0x140/0x140

The oopsing code shows:

	callq  0xffffffff810af6a0 <wake_up_bit>
	mov    0xf8(%r12),%rax
	mov    0x30(%rax),%rax
	mov    0x98(%rax),%rax   <---- oops here
	lock add %rax,0x130(%rbx)

where this is:

	d_backing_inode(object->dentry)->i_blocks

Fixes: a5b3a80b89 (CacheFiles: Provide read-and-reset release counters for cachefilesd)
Reported-by: Jianhong Yin <jiyin@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Steve Dickson <steved@redhat.com>
cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-08-03 13:33:26 -04:00
Al Viro
8ecfb75216 Merge branch 'for-viro' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs into for-linus 2016-08-03 13:31:51 -04:00
Geert Uytterhoeven
4b2e0162e4 fs/proc: Add compiler check for -Wno-override-init to support gcc < 4.2
With gcc < 4.2 (e.g. 4.1.2):

      CC      fs/proc/task_mmu.o
    cc1: error: unrecognized command line option "-Wno-override-init"

To fix this, only enable the compiler option when it is actually
supported by the compiler.

Fixes: ca52953f5f ("fs/proc/task_mmu.c: suppress compilation warnings with W=1")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-03 12:45:23 -04:00
Al Viro
7d50a29fe4 9p: use clone_fid()
in a bunch of places it cleans the things up

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-08-03 11:12:12 -04:00
Al Viro
797fc16d8f 9p: fix braino introduced in "9p: new helper - v9fs_parent_fid()"
In v9fs_vfs_rename() we need to clone the parents' fids, not just
find them.

Spotted-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-08-03 11:02:48 -04:00
Miklos Szeredi
f0fce87c36 vfs: make dentry_needs_remove_privs() internal
Only used by the vfs.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-08-03 13:57:57 +02:00
Miklos Szeredi
c1892c3776 vfs: fix deadlock in file_remove_privs() on overlayfs
file_remove_privs() is called with inode lock on file_inode(), which
proceeds to calling notify_change() on file->f_path.dentry.  Which triggers
the WARN_ON_ONCE(!inode_is_locked(inode)) in addition to deadlocking later
when ovl_setattr tries to lock the underlying inode again.

Fix this mess by not mixing the layers, but doing everything on underlying
dentry/inode.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 07a2daab49 ("ovl: Copy up underlying inode's ->i_mode to overlay inode")
Cc: <stable@vger.kernel.org>
2016-08-03 13:57:56 +02:00
Filipe Manana
e657149933 Btrfs: remove unused function btrfs_add_delayed_qgroup_reserve()
No longer used as of commit 5846a3c268 ("btrfs: qgroup: Fix a race in
delayed_ref which leads to abort trans").

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-08-03 11:02:51 +01:00
Darrick J. Wong
3481b68285 xfs: move (and rename) the deferred bmap-free tracepoints
Rename the deferred bmap-free to extent_free and make them only
trigger when we're really running deferred ops.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03 12:31:07 +10:00