Commit Graph

50070 Commits

Author SHA1 Message Date
Christoph Hellwig
36ffc6c1c0 block_dev: propagate bio_iov_iter_get_pages error in __blkdev_direct_IO
Once we move the block layer to its own status code we'll still want to
propagate the bio_iov_iter_get_pages, so restructure __blkdev_direct_IO
to take ret into account when returning the errno.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
Christoph Hellwig
d5245d7674 fs: simplify dio_bio_complete
Only read bio->bi_error once in the common path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
Christoph Hellwig
4055351cdb fs: remove the unused error argument to dio_end_io()
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
Christoph Hellwig
f729b66fca gfs2: remove the unused sd_log_error field
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
Aleksa Sarai
5f0f187fd0 tty: add compat_ioctl callbacks
In order to avoid future diversions between fs/compat_ioctl.c and
drivers/tty/pty.c, define .compat_ioctl callbacks for the relevant
tty_operations structs. Since both pty_unix98_ioctl() and
pty_bsd_ioctl() are compatible between 32-bit and 64-bit userspace no
special translation is required.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-09 11:27:20 +02:00
Brian Foster
95989c46d2 xfs: fix spurious spin_is_locked() assert failures on non-smp kernels
The 0-day kernel test robot reports assertion failures on
!CONFIG_SMP kernels due to failed spin_is_locked() checks. As it
turns out, spin_is_locked() is hardcoded to return zero on
!CONFIG_SMP kernels and so this function cannot be relied on to
verify spinlock state in this configuration.

To avoid this problem, replace the associated asserts with lockdep
variants that do the right thing regardless of kernel configuration.
Drop the one assert that checks for an unlocked lock as there is no
suitable lockdep variant for that case. This moves the spinlock
checks from XFS debug code to lockdep, but generally provides the
same level of protection.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-08 08:23:07 -07:00
David Miller
d41519a69b crypto: Work around deallocated stack frame reference gcc bug on sparc.
On sparc, if we have an alloca() like situation, as is the case with
SHASH_DESC_ON_STACK(), we can end up referencing deallocated stack
memory.  The result can be that the value is clobbered if a trap
or interrupt arrives at just the right instruction.

It only occurs if the function ends returning a value from that
alloca() area and that value can be placed into the return value
register using a single instruction.

For example, in lib/libcrc32c.c:crc32c() we end up with a return
sequence like:

        return  %i7+8
         lduw   [%o5+16], %o0   ! MEM[(u32 *)__shash_desc.1_10 + 16B],

%o5 holds the base of the on-stack area allocated for the shash
descriptor.  But the return released the stack frame and the
register window.

So if an intererupt arrives between 'return' and 'lduw', then
the value read at %o5+16 can be corrupted.

Add a data compiler barrier to work around this problem.  This is
exactly what the gcc fix will end up doing as well, and it absolutely
should not change the code generated for other cpus (unless gcc
on them has the same bug :-)

With crucial insight from Eric Sandeen.

Cc: <stable@vger.kernel.org>
Reported-by: Anatoly Pugachev <matorola@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-06-08 17:36:03 +08:00
David Howells
e754eba685 rxrpc: Provide a cmsg to specify the amount of Tx data for a call
Provide a control message that can be specified on the first sendmsg() of a
client call or the first sendmsg() of a service response to indicate the
total length of the data to be transmitted for that call.

Currently, because the length of the payload of an encrypted DATA packet is
encrypted in front of the data, the packet cannot be encrypted until we
know how much data it will hold.

By specifying the length at the beginning of the transmit phase, each DATA
packet length can be set before we start loading data from userspace (where
several sendmsg() calls may contribute to a particular packet).

An error will be returned if too little or too much data is presented in
the Tx phase.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-06-07 17:15:46 +01:00
Benjamin Coddington
501e7a4689 NFSv4.2: Don't send mode again in post-EXCLUSIVE4_1 SETATTR with umask
Now that we have umask support, we shouldn't re-send the mode in a SETATTR
following an exclusive CREATE, or we risk having the same problem fixed in
commit 5334c5bdac ("NFS: Send attributes in OPEN request for
NFS4_CREATE_EXCLUSIVE4_1"), which is that files with S_ISGID will have that
bit stripped away.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Fixes: dff25ddb48 ("nfs: add support for the umask attribute")
Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-06-05 12:23:15 -04:00
Christoph Hellwig
01633fd254 overlayfs: use uuid_t instead of uuid_be
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2017-06-05 16:59:13 +02:00
Christoph Hellwig
85787090a2 fs: switch ->s_uuid to uuid_t
For some file systems we still memcpy into it, but in various places this
already allows us to use the proper uuid helpers.  More to come..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com> (Changes to IMA/EVM)
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2017-06-05 16:59:12 +02:00
Amir Goldstein
d905fdaaa7 xfs: use the common helper uuid_is_null()
Use the common helper uuid_is_null() and remove the xfs specific
helper uuid_is_nil().

The common helper does not check for the NULL pointer value as
xfs helper did, but xfs code never calls the helper with a pointer
that can be NULL.

Conform comments and warning strings to use the term 'null uuid'
instead of 'nil uuid', because this is the terminology used by
lib/uuid.c and its users. It is also the terminology used in
userspace by libuuid and xfsprogs.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
[hch: remove now unused uuid.[ch]]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2017-06-05 16:59:08 +02:00
Christoph Hellwig
cb0ba6cc22 xfs: remove uuid_getnodeuniq and xfs_uu_t
Opencode uuid_getnodeuniq in the only caller, and directly decode
the uuid_t representation instead of using a structure cast for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-05 16:59:07 +02:00
Christoph Hellwig
df33767d9f uuid: hoist helpers uuid_equal() and uuid_copy() from xfs
These helper are used to compare and copy two uuid_t type objects.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
[hch: also provide the respective guid_ versions]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2017-06-05 16:59:04 +02:00
Christoph Hellwig
f9727a17db uuid: rename uuid types
Our "little endian" UUID really is a Wintel GUID, so rename it and its
helpers such (guid_t).  The big endian UUID is the only true one, so
give it the name uuid_t.  The uuid_le and uuid_be names are retained for
now, but will hopefully go away soon.  The exception to that are the _cmp
helpers that will be replaced by better primitives ASAP and thus don't
get the new names.

Also the _to_bin helpers are named to match the better named uuid_parse
routine in userspace.

Also remove the existing typedef in XFS that's now been superceeded by
the generic type name.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[andy: also update the UUID_LE/UUID_BE macros including fallout]
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-05 16:58:59 +02:00
Christoph Hellwig
12ce5f8c5c nfsd: namespace-prefix uuid_parse
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-05 16:56:38 +02:00
Christoph Hellwig
b1f359f980 xfs: use uuid_be to implement the uuid_t type
Use the generic Linux definition to implement our UUID type, this will
allow using more generic infrastructure in the future.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-05 16:56:36 +02:00
Amir Goldstein
dfd7487e99 xfs: use uuid_copy() helper to abstract uuid_t
uuid_t definition is about to change.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-05 16:56:35 +02:00
Christoph Hellwig
41bb26f8db uuid,afs: move struct uuid_v1 back into afs
This essentially is a partial revert of commit ff548773
("afs: Move UUID struct to linux/uuid.h") and moves struct uuid_v1 back into
fs/afs as struct afs_uuid.  It however keeps it as big endian structure
so that we can use the normal uuid generation helpers when casting to/from
struct afs_uuid.

The V1 uuid intrepretation in struct form isn't really useful to the
rest of the kernel, and not really compatible to it either, so move it
back to AFS instead of polluting the global uuid.h.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Howells <dhowells@redhat.com>
2017-06-05 16:56:34 +02:00
Richard Narron
239e250e4a fs/ufs: Set UFS default maximum bytes per file
This fixes a problem with reading files larger than 2GB from a UFS-2
file system:

    https://bugzilla.kernel.org/show_bug.cgi?id=195721

The incorrect UFS s_maxsize limit became a problem as of commit
c2a9737f45 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
which started using s_maxbytes to avoid a page index overflow in
do_generic_file_read().

That caused files to be truncated on UFS-2 file systems because the
default maximum file size is 2GB (MAX_NON_LFS) and UFS didn't update it.

Here I simply increase the default to a common value used by other file
systems.

Signed-off-by: Richard Narron <comet.berkeley@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Will B <will.brokenbourgh2877@gmail.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: <stable@vger.kernel.org> # v4.9 and backports of c2a9737f45
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-04 16:33:54 -07:00
Linus Torvalds
125f42b0e2 NFS client bugfixes for Linux 4.12
Bugfixes include:
 
 - Fix a typo in commit e092693443 that breaks copy offload
 - Fix the connect error propagation in xs_tcp_setup_socket()
 - Fix a lock leak in nfs40_walk_client_list
 - Verify that pNFS requests lie within the offset range of the layout segment.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZM0YBAAoJEGcL54qWCgDyLUwQALaPEVp00UMdDR0in7MIFKsO
 2mgi7pOyn6po3EjxKbGtjAbL4nSlVxdaFpCIGg47YXrl9/95Zjjmyke+iwRdnMsa
 ZPyXwfhVRa80fxbOogAverNCnCptoHoG7EzdWuCTcOOxMxR3Ixs7wVJXrs+7ig+r
 IdvIAyTsiDYuP5yVp5KkmJCtLGc0Ze20rb7VgdQJfdiLibWvfYCLZ9CgfAQkdAMU
 RIlbT0/BG13XDqwh/C2V1vLge0VfpT5p8qbIb/kFyQ0ZJUUiicGGGjp3u/yj0aG9
 ljldI34WmQpsy+nCNN4dEgsF461ECvWLwRZnnpN9nv7VurUBpJNUqHLnubvDbzhh
 w8QX54ceEWuQAjg96keNuYOhoG53Omle2/Cm+nmiJOmShJbJ0yh4OcB9DYe0gdYa
 5YXbKRjPvf/HfdE7PPpvbPG2E211zfvkLdHnFxswggWyGrh23kqlWrpcHpZomGNW
 GbJLfIfhyEfBjCPdNJT3Tzvewo2LkcTNLb+3mJhkxOegkdops8vGYA9G2mba3Daj
 1HWl1yFAdzlEf2H1Cb8Y2ZrJKHAmaYBKBkKZYUeAcr6EtoxNqnNMP+PEDcVIzPKg
 6Jq7DiYwYksK+XDWK9G4QBguKKGLvYtv0MIA3QDX+bBGLo+eFYxc2iaaJefYNdkK
 +vdLHclg/YpepLg+Ui21
 =P2bm
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.12-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Bugfixes include:

   - Fix a typo in commit e092693443 ("NFS append COMMIT after
     synchronous COPY") that breaks copy offload

   - Fix the connect error propagation in xs_tcp_setup_socket()

   - Fix a lock leak in nfs40_walk_client_list

   - Verify that pNFS requests lie within the offset range of the layout
     segment"

* tag 'nfs-for-4.12-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  nfs: Mark unnecessarily extern functions as static
  SUNRPC: ensure correct error is reported by xs_tcp_setup_socket()
  NFSv4.0: Fix a lock leak in nfs40_walk_client_list
  pnfs: Fix the check for requests in range of layout segment
  xprtrdma: Delete an error message for a failed memory allocation in xprt_rdma_bc_setup()
  pNFS/flexfiles: missing error code in ff_layout_alloc_lseg()
  NFS fix COMMIT after COPY
2017-06-04 11:56:53 -07:00
Al Viro
ae2a9762d6 compat statfs: switch to copy_to_user()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-04 13:51:34 -04:00
Jan Kara
4f253e1eb6 nfs: Mark unnecessarily extern functions as static
nfs_initialise_sb() and nfs_clone_super() are declared as extern even
though they are used only in fs/nfs/super.c. Mark them as static.

Also remove explicit 'inline' directive from nfs_initialise_sb() and
leave it upto compiler to decide whether inlining is worth it.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-06-03 16:06:38 -04:00
Linus Torvalds
f219764920 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "15 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  scripts/gdb: make lx-dmesg command work (reliably)
  mm: consider memblock reservations for deferred memory initialization sizing
  mm/hugetlb: report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified
  mlock: fix mlock count can not decrease in race condition
  mm/migrate: fix refcount handling when !hugepage_migration_supported()
  dax: fix race between colliding PMD & PTE entries
  mm: avoid spurious 'bad pmd' warning messages
  mm/page_alloc.c: make sure OOM victim can try allocations with no watermarks once
  pcmcia: remove left-over %Z format
  slub/memcg: cure the brainless abuse of sysfs attributes
  initramfs: fix disabling of initramfs (and its compression)
  mm: clarify why we want kmalloc before falling backto vmallock
  frv: declare jiffies to be located in the .data section
  include/linux/gfp.h: fix ___GFP_NOLOCKDEP value
  ksm: prevent crash after write_protect_page fails
2017-06-02 15:49:46 -07:00
Ross Zwisler
e2093926a0 dax: fix race between colliding PMD & PTE entries
We currently have two related PMD vs PTE races in the DAX code.  These
can both be easily triggered by having two threads reading and writing
simultaneously to the same private mapping, with the key being that
private mapping reads can be handled with PMDs but private mapping
writes are always handled with PTEs so that we can COW.

Here is the first race:

  CPU 0					CPU 1

  (private mapping write)
  __handle_mm_fault()
    create_huge_pmd() - FALLBACK
    handle_pte_fault()
      passes check for pmd_devmap()

					(private mapping read)
					__handle_mm_fault()
					  create_huge_pmd()
					    dax_iomap_pmd_fault() inserts PMD

      dax_iomap_pte_fault() does a PTE fault, but we already have a DAX PMD
      			  installed in our page tables at this spot.

Here's the second race:

  CPU 0					CPU 1

  (private mapping read)
  __handle_mm_fault()
    passes check for pmd_none()
    create_huge_pmd()
      dax_iomap_pmd_fault() inserts PMD

  (private mapping write)
  __handle_mm_fault()
    create_huge_pmd() - FALLBACK
					(private mapping read)
					__handle_mm_fault()
					  passes check for pmd_none()
					  create_huge_pmd()

    handle_pte_fault()
      dax_iomap_pte_fault() inserts PTE
					    dax_iomap_pmd_fault() inserts PMD,
					       but we already have a PTE at
					       this spot.

The core of the issue is that while there is isolation between faults to
the same range in the DAX fault handlers via our DAX entry locking,
there is no isolation between faults in the code in mm/memory.c.  This
means for instance that this code in __handle_mm_fault() can run:

	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
		ret = create_huge_pmd(&vmf);

But by the time we actually get to run the fault handler called by
create_huge_pmd(), the PMD is no longer pmd_none() because a racing PTE
fault has installed a normal PMD here as a parent.  This is the cause of
the 2nd race.  The first race is similar - there is the following check
in handle_pte_fault():

	} else {
		/* See comment in pte_alloc_one_map() */
		if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
			return 0;

So if a pmd_devmap() PMD (a DAX PMD) has been installed at vmf->pmd, we
will bail and retry the fault.  This is correct, but there is nothing
preventing the PMD from being installed after this check but before we
actually get to the DAX PTE fault handlers.

In my testing these races result in the following types of errors:

  BUG: Bad rss-counter state mm:ffff8800a817d280 idx:1 val:1
  BUG: non-zero nr_ptes on freeing mm: 15

Fix this issue by having the DAX fault handlers verify that it is safe
to continue their fault after they have taken an entry lock to block
other racing faults.

[ross.zwisler@linux.intel.com: improve fix for colliding PMD & PTE entries]
  Link: http://lkml.kernel.org/r/20170526195932.32178-1-ross.zwisler@linux.intel.com
Link: http://lkml.kernel.org/r/20170522215749.23516-2-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Pawel Lebioda <pawel.lebioda@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Pawel Lebioda <pawel.lebioda@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Xiong Zhou <xzhou@redhat.com>
Cc: Eryu Guan <eguan@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-02 15:07:37 -07:00
Linus Torvalds
e6e6d07436 Changes since last update:
- Fix an unmount hang due to a race in io buffer accounting.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJZMKVEAAoJEPh/dxk0SrTrBYcQAKSpzE8C9wDBw6cyxP3kwrTr
 FSQiSr7flnGBHwy2U0UC/SIFIwYxvW4BTnXJWADyqtnvLWP1+TC7UY1oNpkTsbkK
 KLsWgz3aOcT/8sb346PzFDAuxof2lkv3xFPRBFaoeSkybxWqLz6BWsbmaJNH/wqy
 W3k3H241mAftEiv1i9IUlAZMXE31qywIKzzUJvkOglXS8OdVFfMPQvUz6epU2LWA
 I2tBip936Sl45vLu6ubqoRpk8dWNuPPX+f4YXl8dVeqRKTYhviMwgYD4rlljb6Ti
 kIRG9HYg1GVZo5z/5unAjyEaKzYoRrXnO5Lg+i09NIhezlDhB2HJ+k71NljoeHoe
 YCwqumQIGgnxdFu+FP10tKh2EWvDp80SQxgzIvr+FCCKJdsdNYyftRh4CtsCPJSG
 xWHT1jgovygHsBEEmG2LS9mCXKkyWgMkHNMBu3Yy/F/4HGzrPjcU3F+x90OmOo7J
 S26kEwsAoo+Q5Is8QkmqrnD+CQ7jwXEv9Mw3UqRwQ7UagRdR2nI8CIGEC7W+42Mm
 Gd3TtAyJCbhZWXNq7pLeTnGu7JY3/dhR/8VSW+mIKtvFg7v9O1wZBYId8vTwZN1+
 8jgnW0h6myE10YKU5bc1TZeYYAkWA+JLRKxoexL3QD8jWeffyZgMNWPM2rb+4Jjp
 2wwCHMPvHE8X7a2urTW3
 =wRbJ
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.12-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull XFS fix from Darrick Wong:
 "I've one more bugfix for you for 4.12-rc4: Fix an unmount hang due to
  a race in io buffer accounting"

* tag 'xfs-4.12-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: use ->b_state to fix buffer I/O accounting release race
2017-06-02 12:29:03 -07:00
Linus Torvalds
3b1e342be2 Revert patch accidentally included in the merge window pull request, and
fix a crash that was likely a result of buggy client behavior.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZMHDGAAoJECebzXlCjuG+wVAP/RC2THsrHEfWQSrc+/wkKron
 7PUZo6VRhoasjBInSJB/tdy+Yb82NbfLoXfJ71ddAwRUZlte74aI762HHuMdWtHY
 8mCum5ea1AfRX5N/L/isO6lh4utO0vQEJ8r+P095d3EDwl0DnLYC3JVlKd/1r2VS
 ELy8DZkyaVHZO9xiT+mnRgsq4aMjxG3F7DTHpcKDDFzG5Ts00zBQIXDu/rKmw3fD
 WEuQjjrit1gFrUIUzJbSqwSokDCcf7v9HtGTI5+t+pIZ4Q2SyuKuTZvjtg+hb7Qa
 K+F2SNNQsqfTW65zllhVR3gYpCykoqYPAJDw9MlqLN5tCmXFZLYhFHEUFx5kuobx
 7+Dc3z1o5BgOiXcnKBVe+uONxXxcMYXLbU0e5Gac39GYW5xWzrU1+O6mMi0Q01YS
 QsGRZEqHE2/3j1TAl0Q2SqT8gtG+A7piU4s5VavIHKIzI3/WubZ1GjLQ+RfXjuNa
 DvkcAvSYfHyxzdWlyxjkzM09edt6SN3yEYdIRv9hiJEbUO3itVm9ycXTHLJUQUL0
 sfVeXkm49e8gZZxHn+XuJubkT8HYlDGLQVSzK1zWFgt+zxd9LiP9iY+zs+vL9ryJ
 DM9VmlJxZvNx9T7zSradW7gbIwOgxmBfRHFD05oODS1Tymb029akuU0YACb0sVnQ
 LzDaZejUmURp7vlUffFp
 =wznG
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.12-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd fixes from Bruce Fields:
 "Revert patch accidentally included in the merge window pull request,
  and fix a crash that was likely a result of buggy client behavior"

* tag 'nfsd-4.12-1' of git://linux-nfs.org/~bfields/linux:
  nfsd4: fix null dereference on replay
  nfsd: Revert "nfsd: check for oversized NFSv2/v3 arguments"
2017-06-01 16:24:48 -07:00
Linus Torvalds
2f48641cfc Use designated initializers for mtk-vcodec, powerplay, amdgpu, and sgi-xp.
Use ERR_CAST() to avoid cross-structure cast in ocf2, ntfs, and NFS.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJZMHWdAAoJEIly9N/cbcAmWOYP/i45fa6JG7Aw9N59Uz4sqeUQ
 ZUlvAUek6GkaGijCPtDYjy0cVj2Cc3QZLSRq9dDw/rU66Mc0ybYWHtIIwJy4ZjVe
 D4w2Cs7K1oSOnhJnPTjQSKuMD81PF75NLChf3XSfLvtOWVIqW33EzLIu5lJ1rc1x
 wh1fEAsJXGA9xklmW+m8Vn1FoS1a1j+9zuCEmGpveOkk6UKhhp73Ke8PP4uK9ld+
 saApe/iH0JdTP6I7030A8hXwz7ZCYbMicw1kVpnsn4rM24p+k3Y2/OrFT2tY6/Y6
 fzkTuVL7omQmUWph9zX6SYPg2GACEBTLb5V1YJ6zDUUzucu7vjfsvsTHXZb1gq2j
 i8hZ6XsNOMWYJiOkOOSKM0rpjG6WSvF/sGc78ap7NJ4QPZ2/h3BTOXfk/ye/xQmL
 WidEESJ4srInpi5ju8JTWHe27aydwiUUF91Y+gFv4G6CGU6/5vjUzOsgeiMxt0JN
 lPaTjjL4lBHI2yohx2Wqy88yYWulK3LB0Hzt9XcSGMBA58H9d0CV0ZTkH3dJJkpC
 QCM+Kt1DPy5A2RPC2APrPPCJsQycX9PSDeRaWkTxHnNLftpq65h1pAKjMcqsUPgb
 HEEMLIBGqm871dr3+aPJPfG3Qil9ANBscDRbHXugCFTseFQO6M26KAxWGN+6LIQp
 6Z0GUaPgJEua9ejodq4m
 =R3qn
 -----END PGP SIGNATURE-----

Merge tag 'gcc-plugins-v4.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull gcc-plugin prepwork from Kees Cook:
 "Use designated initializers for mtk-vcodec, powerplay, amdgpu, and
  sgi-xp. Use ERR_CAST() to avoid cross-structure cast in ocf2, ntfs,
  and NFS.

  Christoph Hellwig recommended that I send these fixes now, rather than
  waiting for the v4.13 merge window. These are all initializer and cast
  fixes needed for the future randstruct plugin that haven't been picked
  up by the respective maintainers"

* tag 'gcc-plugins-v4.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  mtk-vcodec: Use designated initializers
  drm/amd/powerplay: Use designated initializers
  drm/amdgpu: Use designated initializers
  sgi-xp: Use designated initializers
  ocfs2: Use ERR_CAST() to avoid cross-structure cast
  ntfs: Use ERR_CAST() to avoid cross-structure cast
  NFS: Use ERR_CAST() to avoid cross-structure cast
2017-06-01 16:17:42 -07:00
Bart Van Assche
30181faae3 nfsd: Check queue type before submitting a SCSI request
Since using scsi_req() is only allowed against request queues for
which struct scsi_request is the first member of their private
request data, refuse to submit SCSI commands against a queue for
which this is not the case.

References: commit 82ed4db499 ("block: split scsi_request out of struct request")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: J. Bruce Fields <bfields@redhat.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: Omar Sandoval <osandov@fb.com>
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-01 13:10:46 -06:00
Linus Torvalds
0bb230399f Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull Reiserfs and GFS2 fixes from Jan Kara:
 "Fixes to GFS2 & Reiserfs for the fallout of the recent WRITE_FUA
  cleanup from Christoph.

  Fixes for other filesystems were already merged by respective
  maintainers."

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  reiserfs: Make flush bios explicitely sync
  gfs2: Make flush bios explicitely sync
2017-06-01 10:45:27 -07:00
Christoph Hellwig
94073ad77f fs/locks: don't mess with the address limit in compat_fcntl64
Instead write a proper compat syscall that calls common helpers.

[ jlayton: fix pointer dereferencing in fixup_compat_flock ]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-06-01 11:29:07 -04:00
Jeff Mahoney
a9b3311ef3 btrfs: fix race with relocation recovery and fs_root setup
If we have to recover relocation during mount, we'll ultimately have to
evict the orphan inode.  That goes through the reservation dance, where
priority_reclaim_metadata_space and flush_space expect fs_info->fs_root
to be valid.  That's the next thing to be set up during mount, so we
crash, almost always in flush_space trying to join the transaction
but priority_reclaim_metadata_space is possible as well.  This call
path has been problematic in the past WRT whether ->fs_root is valid
yet.  Commit 957780eb27 (Btrfs: introduce ticketed enospc
infrastructure) added new users that are called in the direct path
instead of the async path that had already been worked around.

The thing is that we don't actually need the fs_root, specifically, for
anything.  We either use it to determine whether the root is the
chunk_root for use in choosing an allocation profile or as a root to pass
btrfs_join_transaction before immediately committing it.  Anything that
isn't the chunk root works in the former case and any root works in
the latter.

A simple fix is to use a root we know will always be there: the
extent_root.

Cc: <stable@vger.kernel.org> # v4.8+
Fixes: 957780eb27 (Btrfs: introduce ticketed enospc infrastructure)
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-01 16:56:55 +02:00
Jeff Mahoney
896533a7da btrfs: fix memory leak in update_space_info failure path
If we fail to add the space_info kobject, we'll leak the memory
for the percpu counter.

Fixes: 6ab0a2029c (btrfs: publish allocation data in sysfs)
Cc: <stable@vger.kernel.org> # v3.14+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-01 16:56:31 +02:00
David Sterba
cc2b702c52 btrfs: use correct types for page indices in btrfs_page_exists_in_range
Variables start_idx and end_idx are supposed to hold a page index
derived from the file offsets. The int type is not the right one though,
offsets larger than 1 << 44 will get silently trimmed off the high bits.
(1 << 44 is 16TiB)

What can go wrong, if start is below the boundary and end gets trimmed:
- if there's a page after start, we'll find it (radix_tree_gang_lookup_slot)
- the final check "if (page->index <= end_idx)" will unexpectedly fail

The function will return false, ie. "there's no page in the range",
although there is at least one.

btrfs_page_exists_in_range is used to prevent races in:

* in hole punching, where we make sure there are not pages in the
  truncated range, otherwise we'll wait for them to finish and redo
  truncation, but we're going to replace the pages with holes anyway so
  the only problem is the intermediate state

* lock_extent_direct: we want to make sure there are no pages before we
  lock and start DIO, to prevent stale data reads

For practical occurence of the bug, there are several constaints.  The
file must be quite large, the affected range must cross the 16TiB
boundary and the internal state of the file pages and pending operations
must match.  Also, we must not have started any ordered data in the
range, otherwise we don't even reach the buggy function check.

DIO locking tries hard in several places to avoid deadlocks with
buffered IO and avoids waiting for ranges. The worst consequence seems
to be stale data read.

CC: Liu Bo <bo.li.liu@oracle.com>
CC: stable@vger.kernel.org	# 3.16+
Fixes: fc4adbff82 ("btrfs: Drop EXTENT_UPTODATE check in hole punching and direct locking")
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-01 16:56:17 +02:00
Kees Cook
d3762358a7 pstore: Fix format string to use %u for record id
The format string for record->id (u64) was using %lld instead of %llu.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-05-31 10:13:45 -07:00
Kees Cook
c7f3c595f6 pstore: Populate pstore record->time field
The current time will be initially available in the record->time field
for all pstore_read() and pstore_write() calls. Backends can either
update the field during read(), or use the field during write() instead
of fetching time themselves.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-05-31 10:13:44 -07:00
Kees Cook
e581ca813a pstore: Create common record initializer
In preparation for setting timestamps in the pstore core, create a common
initializer routine, instead of using static initializers.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-05-31 10:13:44 -07:00
Kees Cook
656de42e83 pstore: Avoid potential infinite loop
If a backend does not correctly iterate through its records, pstore will
get stuck loading entries. Detect this with a large record count, and
announce if we ever hit the limit. This will let future backend reading
bugs less annoying to debug. Additionally adjust the error about
pstore_mkfile() failing.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-05-31 10:13:42 -07:00
Douglas Anderson
f6525b96dd pstore: Fix leaked pstore_record in pstore_get_backend_records()
When the "if (record->size <= 0)" test is true in
pstore_get_backend_records() it's pretty clear that nobody holds a
reference to the allocated pstore_record, yet we don't free it.

Let's free it.

Fixes: 2a2b0acf76 ("pstore: Allocate records on heap instead of stack")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
2017-05-31 10:10:09 -07:00
Ankit Kumar
4a16d1cb24 pstore: Don't warn if data is uncompressed and type is not PSTORE_TYPE_DMESG
commit 9abdcccc3d ("pstore: Extract common arguments into structure")
moved record decompression to function. decompress_record() gets
called without checking type and compressed flag. Warning will be
reported if data is uncompressed. Pstore type PSTORE_TYPE_PPC_OPAL,
PSTORE_TYPE_PPC_COMMON doesn't contain compressed data and warning get
printed part of dmesg.

Partial dmesg log:
[   35.848914] pstore: ignored compressed record type 6
[   35.848927] pstore: ignored compressed record type 8

Above warning should not get printed as it is known that data won't be
compressed for above type and it is valid condition.

This patch returns if data is not compressed and print warning only if
data is compressed and type is not PSTORE_TYPE_DMESG.

Reported-by: Anton Blanchard <anton@au1.ibm.com>
Signed-off-by: Ankit Kumar <ankit@linux.vnet.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Fixes: 9abdcccc3d ("pstore: Extract common arguments into structure")
Cc: stable@vger.kernel.org
2017-05-31 10:09:32 -07:00
Linus Torvalds
d602fb6844 Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
 "Fix regressions:

   - missing CONFIG_EXPORTFS dependency

   - failure if upper fs doesn't support xattr

   - bad error cleanup

  This also adds the concept of "impure" directories complementing the
  "origin" marking introduced in -rc1. Together they enable getting
  consistent st_ino and d_ino for directory listings.

  And there's a bug fix and a cleanup as well"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: filter trusted xattr for non-admin
  ovl: mark upper merge dir with type origin entries "impure"
  ovl: mark upper dir with type origin entries "impure"
  ovl: remove unused arg from ovl_lookup_temp()
  ovl: handle rename when upper doesn't support xattr
  ovl: don't fail copy-up if upper doesn't support xattr
  ovl: check on mount time if upper fs supports setting xattr
  ovl: fix creds leak in copy up error path
  ovl: select EXPORTFS
2017-05-31 08:29:02 -07:00
Brian Foster
63db7c815b xfs: use ->b_state to fix buffer I/O accounting release race
We've had user reports of unmount hangs in xfs_wait_buftarg() that
analysis shows is due to btp->bt_io_count == -1. bt_io_count
represents the count of in-flight asynchronous buffers and thus
should always be >= 0. xfs_wait_buftarg() waits for this value to
stabilize to zero in order to ensure that all untracked (with
respect to the lru) buffers have completed I/O processing before
unmount proceeds to tear down in-core data structures.

The value of -1 implies an I/O accounting decrement race. Indeed,
the fact that xfs_buf_ioacct_dec() is called from xfs_buf_rele()
(where the buffer lock is no longer held) means that bp->b_flags can
be updated from an unsafe context. While a user-level reproducer is
currently not available, some intrusive hacks to run racing buffer
lookups/ioacct/releases from multiple threads was used to
successfully manufacture this problem.

Existing callers do not expect to acquire the buffer lock from
xfs_buf_rele(). Therefore, we can not safely update ->b_flags from
this context. It turns out that we already have separate buffer
state bits and associated serialization for dealing with buffer LRU
state in the form of ->b_state and ->b_lock. Therefore, replace the
_XBF_IN_FLIGHT flag with a ->b_state variant, update the I/O
accounting wrappers appropriately and make sure they are used with
the correct locking. This ensures that buffer in-flight state can be
modified at buffer release time without racing with modifications
from a buffer lock holder.

Fixes: 9c7504aa72 ("xfs: track and serialize in-flight async buffers against unmount")
Cc: <stable@vger.kernel.org> # v4.8+
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Libor Pechacek <lpechacek@suse.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-31 08:22:52 -07:00
Linus Torvalds
f511c0b17b "Yes, people use FOLL_FORCE ;)"
This effectively reverts commit 8ee74a91ac ("proc: try to remove use
of FOLL_FORCE entirely")

It turns out that people do depend on FOLL_FORCE for the /proc/<pid>/mem
case, and we're talking not just debuggers. Talking to the affected people, the use-cases are:

Keno Fischer:
 "We used these semantics as a hardening mechanism in the julia JIT. By
  opening /proc/self/mem and using these semantics, we could avoid
  needing RWX pages, or a dual mapping approach. We do have fallbacks to
  these other methods (though getting EIO here actually causes an assert
  in released versions - we'll updated that to make sure to take the
  fall back in that case).

  Nevertheless the /proc/self/mem approach was our favored approach
  because it a) Required an attacker to be able to execute syscalls
  which is a taller order than getting memory write and b) didn't double
  the virtual address space requirements (as a dual mapping approach
  would).

  I think in general this feature is very useful for anybody who needs
  to precisely control the execution of some other process. Various
  debuggers (gdb/lldb/rr) certainly fall into that category, but there's
  another class of such processes (wine, various emulators) which may
  want to do that kind of thing.

  Now, I suspect most of these will have the other process under ptrace
  control, so maybe allowing (same_mm || ptraced) would be ok, but at
  least for the sandbox/remote-jit use case, it would be perfectly
  reasonable to not have the jit server be a ptracer"

Robert O'Callahan:
 "We write to readonly code and data mappings via /proc/.../mem in lots
  of different situations, particularly when we're adjusting program
  state during replay to match the recorded execution.

  Like Julia, we can add workarounds, but they could be expensive."

so not only do people use FOLL_FORCE for both reads and writes, but they
use it for both the local mm and remote mm.

With these comments in mind, we likely also cannot add the "are we
actively ptracing" check either, so this keeps the new code organization
and does not do a real revert that would add back the original comment
about "Maybe we should limit FOLL_FORCE to actual ptrace users?"

Reported-by: Keno Fischer <keno@juliacomputing.com>
Reported-by: Robert O'Callahan <robert@ocallahan.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-30 12:38:59 -07:00
Jan Kara
67a7d5f561 ext4: fix fdatasync(2) after extent manipulation operations
Currently, extent manipulation operations such as hole punch, range
zeroing, or extent shifting do not record the fact that file data has
changed and thus fdatasync(2) has a work to do. As a result if we crash
e.g. after a punch hole and fdatasync, user can still possibly see the
punched out data after journal replay. Test generic/392 fails due to
these problems.

Fix the problem by properly marking that file data has changed in these
operations.

CC: stable@vger.kernel.org
Fixes: a4bb6b64e3
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-29 13:24:55 -04:00
Miklos Szeredi
a082c6f680 ovl: filter trusted xattr for non-admin
Filesystems filter out extended attributes in the "trusted." domain for
unprivlieged callers.

Overlay calls underlying filesystem's method with elevated privs, so need
to do the filtering in overlayfs too.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-29 15:15:27 +02:00
Amir Goldstein
f3a1568582 ovl: mark upper merge dir with type origin entries "impure"
An upper dir is marked "impure" to let ovl_iterate() know that this
directory may contain non pure upper entries whose d_ino may need to be
read from the origin inode.

We already mark a non-merge dir "impure" when moving a non-pure child
entry inside it, to let ovl_iterate() know not to iterate the non-merge
dir directly.

Mark also a merge dir "impure" when moving a non-pure child entry inside
it and when copying up a child entry inside it.

This can be used to optimize ovl_iterate() to perform a "pure merge" of
upper and lower directories, merging the content of the directories,
without having to read d_ino from origin inodes.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-29 11:48:00 +02:00
Kees Cook
7585d12f65 ocfs2: Use ERR_CAST() to avoid cross-structure cast
When trying to propagate an error result, the error return path attempts
to retain the error, but does this with an open cast across very different
types, which the upcoming structure layout randomization plugin flags as
being potentially dangerous in the face of randomization. This is a false
positive, but what this code actually wants to do is use ERR_CAST() to
retain the error value.

Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2017-05-28 10:11:49 -07:00
Kees Cook
fee2aa7538 ntfs: Use ERR_CAST() to avoid cross-structure cast
When trying to propagate an error result, the error return path attempts
to retain the error, but does this with an open cast across very different
types, which the upcoming structure layout randomization plugin flags as
being potentially dangerous in the face of randomization. This is a false
positive, but what this code actually wants to do is use ERR_CAST() to
retain the error value.

Cc: Anton Altaparmakov <anton@tuxera.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2017-05-28 10:11:48 -07:00
Kees Cook
fe3b81b446 NFS: Use ERR_CAST() to avoid cross-structure cast
When the call to nfs_devname() fails, the error path attempts to retain
the error via the mnt variable, but this requires a cast across very
different types (char * to struct vfsmount *), which the upcoming
structure layout randomization plugin flags as being potentially
dangerous in the face of randomization. This is a false positive, but
what this code actually wants to do is retain the error value, so this
patch explicitly sets it, instead of using what seems to be an
unexpected cast.

Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-05-28 10:11:47 -07:00
Al Viro
4d7edbc34c nfsd_readlink(): switch to vfs_get_link()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-27 16:11:23 -04:00
Christoph Hellwig
a75d30c772 fs/locks: pass kernel struct flock to fcntl_getlk/setlk
This will make it easier to implement a sane compat fcntl syscall.

[ jlayton: fix undeclared identifiers in 32-bit fcntl64 syscall handler ]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-05-27 06:07:19 -04:00
Mauro Carvalho Chehab
80b79dd0e2 fs: locks: Fix some troubles at kernel-doc comments
There are a few syntax violations that cause outputs of
a few comments to not be properly parsed in ReST format.

No functional changes.

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-05-27 06:07:18 -04:00
Jan Kara
a056bdaae7 ext4: fix data corruption for mmap writes
mpage_submit_page() can race with another process growing i_size and
writing data via mmap to the written-back page. As mpage_submit_page()
samples i_size too early, it may happen that ext4_bio_write_page()
zeroes out too large tail of the page and thus corrupts user data.

Fix the problem by sampling i_size only after the page has been
write-protected in page tables by clear_page_dirty_for_io() call.

Reported-by: Michael Zimmer <michael@swarm64.com>
CC: stable@vger.kernel.org
Fixes: cb20d51883
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-26 17:45:45 -04:00
Jan Kara
4f8caa60a5 ext4: fix data corruption with EXT4_GET_BLOCKS_ZERO
When ext4_map_blocks() is called with EXT4_GET_BLOCKS_ZERO to zero-out
allocated blocks and these blocks are actually converted from unwritten
extent the following race can happen:

CPU0					CPU1

page fault				page fault
...					...
ext4_map_blocks()
  ext4_ext_map_blocks()
    ext4_ext_handle_unwritten_extents()
      ext4_ext_convert_to_initialized()
	- zero out converted extent
	ext4_zeroout_es()
	  - inserts extent as initialized in status tree

					ext4_map_blocks()
					  ext4_es_lookup_extent()
					    - finds initialized extent
					write data
  ext4_issue_zeroout()
    - zeroes out new extent overwriting data

This problem can be reproduced by generic/340 for the fallocated case
for the last block in the file.

Fix the problem by avoiding zeroing out the area we are mapping with
ext4_map_blocks() in ext4_ext_convert_to_initialized(). It is pointless
to zero out this area in the first place as the caller asked us to
convert the area to initialized because he is just going to write data
there before the transaction finishes. To achieve this we delete the
special case of zeroing out full extent as that will be handled by the
cases below zeroing only the part of the extent that needs it. We also
instruct ext4_split_extent() that the middle of extent being split
contains data so that ext4_split_extent_at() cannot zero out full extent
in case of ENOSPC.

CC: stable@vger.kernel.org
Fixes: 12735f8819
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-26 17:40:52 -04:00
Linus Torvalds
cdbe020678 Changed since last update:
- Fix indlen block reservation accounting bug when splitting delalloc extent
 - Fix warnings about unused variables that appeared in -rc1.
 - Don't spew errors when bmapping a local format directory
 - Fix an off-by-one error in a delalloc eof assertion
 - Make fsmap only return inode information for CAP_SYS_ADMIN
 - Fix a potential mount time deadlock recovering cow extents
 - Fix unaligned memory access in _btree_visit_blocks
 - Fix various SEEK_HOLE/SEEK_DATA bugs
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJZJwnxAAoJEPh/dxk0SrTr/TMQAKP6OMsjYxpro+1Uif+oPTQ6
 vvUfXJMWLKc07QI/czwLDY4A36h2TZjNxpBJypSfVumlD82ZPa8gp6XFWngwIUb4
 3G+A9zq4Fviq8Vzz3G75C8Q49h8IpmU3SimTlhS1BIcxe+upu2qplzM3yc6/T4MB
 WTTqtjL3SaW5D2v0ZdPL9ulQKKAlL1WfbZV9dDJ4UiRw5Jlwj2Udg6HnbRvfrcZF
 IziYlidrTIt64ecA9GqR32soXqFBGPKo6Wp9Pk+iWLlsfM6qcCt1m+yfM1JonRGA
 wycygcrrjfR/lFHMQCGonLs1ajC6isLeMZ804P6OP2q6kfdtersedvY7XSoYsEJ4
 ok4J3fiyqYgMGhPz7x0Y8IH9+gdudn7+fHiC5/RNkolEy8AbPPe21XhFDVxeTkCs
 4GAHNGQfOEK2PT69Ya81taVzT/TpuIGIkUAaDH8vsfxwcVunM08/OffsCiinLMJx
 bt3G7fH3wJ+VuYJS92amj3k6n6EAeHYc0dAVGd5e8dtN25079nBm+EP0Wp+j8uVl
 PwaJjde68wxWUvuYXVK1a8vietRS7xChyta34cYcStd4wWu1knccpN/mjQnK/ucB
 4etZspB1rQQx08KBqHVq8t508PA7nWtFxjE91JYkpvbyYym1WEH8Mz7rbVBI6NjS
 Y/8+uPhFq2BU1b9skj0U
 =pDjl
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.12-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull XFS fixes from Darrick Wong:
 "A few miscellaneous bug fixes & cleanups:

   - Fix indlen block reservation accounting bug when splitting delalloc
     extent

   - Fix warnings about unused variables that appeared in -rc1.

   - Don't spew errors when bmapping a local format directory

   - Fix an off-by-one error in a delalloc eof assertion

   - Make fsmap only return inode information for CAP_SYS_ADMIN

   - Fix a potential mount time deadlock recovering cow extents

   - Fix unaligned memory access in _btree_visit_blocks

   - Fix various SEEK_HOLE/SEEK_DATA bugs"

* tag 'xfs-4.12-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: Move handling of missing page into one place in xfs_find_get_desired_pgoff()
  xfs: Fix off-by-in in loop termination in xfs_find_get_desired_pgoff()
  xfs: Fix missed holes in SEEK_HOLE implementation
  xfs: fix off-by-one on max nr_pages in xfs_find_get_desired_pgoff()
  xfs: fix unaligned access in xfs_btree_visit_blocks
  xfs: avoid mount-time deadlock in CoW extent recovery
  xfs: only return detailed fsmap info if the caller has CAP_SYS_ADMIN
  xfs: bad assertion for delalloc an extent that start at i_size
  xfs: fix warnings about unused stack variables
  xfs: BMAPX shouldn't barf on inline-format directories
  xfs: fix indlen accounting error on partial delalloc conversion
2017-05-26 12:13:08 -07:00
Al Viro
8d1a81a852 sanitize do_i2c_smbus_ioctl()
no need to mess with __copy_in_user()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-25 17:52:59 -04:00
Jan Kara
a54fba8f5a xfs: Move handling of missing page into one place in xfs_find_get_desired_pgoff()
Currently several places in xfs_find_get_desired_pgoff() handle the case
of a missing page. Make them all handled in one place after the loop has
terminated.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-25 09:42:25 -07:00
Jan Kara
d7fd24257a xfs: Fix off-by-in in loop termination in xfs_find_get_desired_pgoff()
There is an off-by-one error in loop termination conditions in
xfs_find_get_desired_pgoff() since 'end' may index a page beyond end of
desired range if 'endoff' is page aligned. It doesn't have any visible
effects but still it is good to fix it.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-25 09:42:25 -07:00
Jan Kara
5375023ae1 xfs: Fix missed holes in SEEK_HOLE implementation
XFS SEEK_HOLE implementation could miss a hole in an unwritten extent as
can be seen by the following command:

xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "pwrite 128k 8k"
       -c "seek -h 0" file
wrote 57344/57344 bytes at offset 0
56 KiB, 14 ops; 0.0000 sec (49.312 MiB/sec and 12623.9856 ops/sec)
wrote 8192/8192 bytes at offset 131072
8 KiB, 2 ops; 0.0000 sec (70.383 MiB/sec and 18018.0180 ops/sec)
Whence	Result
HOLE	139264

Where we can see that hole at offset 56k was just ignored by SEEK_HOLE
implementation. The bug is in xfs_find_get_desired_pgoff() which does
not properly detect the case when pages are not contiguous.

Fix the problem by properly detecting when found page has larger offset
than expected.

CC: stable@vger.kernel.org
Fixes: d126d43f63
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-25 09:42:25 -07:00
Eryu Guan
8affebe16d xfs: fix off-by-one on max nr_pages in xfs_find_get_desired_pgoff()
xfs_find_get_desired_pgoff() is used to search for offset of hole or
data in page range [index, end] (both inclusive), and the max number
of pages to search should be at least one, if end == index.
Otherwise the only page is missed and no hole or data is found,
which is not correct.

When block size is smaller than page size, this can be demonstrated
by preallocating a file with size smaller than page size and writing
data to the last block. E.g. run this xfs_io command on a 1k block
size XFS on x86_64 host.

  # xfs_io -fc "falloc 0 3k" -c "pwrite 2k 1k" \
  	    -c "seek -d 0" /mnt/xfs/testfile
  wrote 1024/1024 bytes at offset 2048
  1 KiB, 1 ops; 0.0000 sec (33.675 MiB/sec and 34482.7586 ops/sec)
  Whence  Result
  DATA    EOF

Data at offset 2k was missed, and lseek(2) returned ENXIO.

This is uncovered by generic/285 subtest 07 and 08 on ppc64 host,
where pagesize is 64k. Because a recent change to generic/285
reduced the preallocated file size to smaller than 64k.

Cc: stable@vger.kernel.org # v3.7+
Signed-off-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-25 09:42:25 -07:00
Eric Sandeen
a4d768e702 xfs: fix unaligned access in xfs_btree_visit_blocks
This structure copy was throwing unaligned access warnings on sparc64:

Kernel unaligned access at TPC[1043c088] xfs_btree_visit_blocks+0x88/0xe0 [xfs]

xfs_btree_copy_ptrs does a memcpy, which avoids it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-25 09:42:25 -07:00
Tahsin Erdogan
b8cb5a545c ext4: fix quota charging for shared xattr blocks
ext4_xattr_block_set() calls dquot_alloc_block() to charge for an xattr
block when new references are made. However if dquot_initialize() hasn't
been called on an inode, request for charging is effectively ignored
because ext4_inode_info->i_dquot is not initialized yet.

Add dquot_initialize() to call paths that lead to ext4_xattr_block_set().

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-05-24 18:24:07 -04:00
Eric Biggers
c41d342b39 ext4: remove redundant check for encrypted file on dio write path
Currently we don't allow direct I/O on encrypted regular files, so in
such cases we return 0 early in ext4_direct_IO().  There was also an
additional BUG_ON() check in ext4_direct_IO_write(), but it can never be
hit because of the earlier check for the exact same condition in
ext4_direct_IO().  There was also no matching check on the read path,
which made the write path specific check seem very ad-hoc.

Just remove the unnecessary BUG_ON().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: David Gstir <david@sigma-star.at>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-05-24 18:20:31 -04:00
Eric Biggers
d6b975504e ext4: remove unused d_name argument from ext4_search_dir() et al.
Now that we are passing a struct ext4_filename, we do not need to pass
around the original struct qstr too.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-05-24 18:10:49 -04:00
Eric Biggers
e5465795ca ext4: fix off-by-one error when writing back pages before dio read
The 'lend' argument of filemap_write_and_wait_range() is inclusive, so
we need to subtract 1 from pos + count.

Note that 'count' is guaranteed to be nonzero since
ext4_file_read_iter() returns early when given a 0 count.

Fixes: 16c5468859 ("ext4: Allow parallel DIO reads")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-05-24 18:05:29 -04:00
Eryu Guan
624327f879 ext4: fix off-by-one on max nr_pages in ext4_find_unwritten_pgoff()
ext4_find_unwritten_pgoff() is used to search for offset of hole or
data in page range [index, end] (both inclusive), and the max number
of pages to search should be at least one, if end == index.
Otherwise the only page is missed and no hole or data is found,
which is not correct.

When block size is smaller than page size, this can be demonstrated
by preallocating a file with size smaller than page size and writing
data to the last block. E.g. run this xfs_io command on a 1k block
size ext4 on x86_64 host.

  # xfs_io -fc "falloc 0 3k" -c "pwrite 2k 1k" \
  	    -c "seek -d 0" /mnt/ext4/testfile
  wrote 1024/1024 bytes at offset 2048
  1 KiB, 1 ops; 0.0000 sec (42.459 MiB/sec and 43478.2609 ops/sec)
  Whence  Result
  DATA    EOF

Data at offset 2k was missed, and lseek(2) returned ENXIO.

This is unconvered by generic/285 subtest 07 and 08 on ppc64 host,
where pagesize is 64k. Because a recent change to generic/285
reduced the preallocated file size to smaller than 64k.

Signed-off-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-05-24 18:02:20 -04:00
Luis Henriques
42c99fc4c7 ceph: check that the new inode size is within limits in ceph_fallocate()
Currently the ceph client doesn't respect the rlimit in fallocate.  This
means that a user can allocate a file with size > RLIMIT_FSIZE.  This
patch adds the call to inode_newsize_ok() to verify filesystem limits and
ulimits.  This should make ceph successfully run xfstest generic/228.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-24 18:10:54 +02:00
Trond Myklebust
b49c15f97c NFSv4.0: Fix a lock leak in nfs40_walk_client_list
Xiaolong Ye's kernel test robot detected the following Oops:
[  299.158991] BUG: scheduling while atomic: mount.nfs/9387/0x00000002
[  299.169587] 2 locks held by mount.nfs/9387:
[  299.176165]  #0:  (nfs_clid_init_mutex){......}, at: [<ffffffff8130cc92>] nfs4_discover_server_trunking+0x47/0x1fc
[  299.201802]  #1:  (&(&nn->nfs_client_lock)->rlock){......}, at: [<ffffffff813125fa>] nfs40_walk_client_list+0x2e9/0x338
[  299.221979] CPU: 0 PID: 9387 Comm: mount.nfs Not tainted 4.11.0-rc7-00021-g14d1bbb #45
[  299.235584] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014
[  299.251176] Call Trace:
[  299.255192]  dump_stack+0x61/0x7e
[  299.260416]  __schedule_bug+0x65/0x74
[  299.266208]  __schedule+0x5d/0x87c
[  299.271883]  schedule+0x89/0x9a
[  299.276937]  schedule_timeout+0x232/0x289
[  299.283223]  ? detach_if_pending+0x10b/0x10b
[  299.289935]  schedule_timeout_uninterruptible+0x2a/0x2c
[  299.298266]  ? put_rpccred+0x3e/0x115
[  299.304327]  ? schedule_timeout_uninterruptible+0x2a/0x2c
[  299.312851]  msleep+0x1e/0x22
[  299.317612]  nfs4_discover_server_trunking+0x102/0x1fc
[  299.325644]  nfs4_init_client+0x13f/0x194

It looks as if we recently added a spin_lock() leak to
nfs40_walk_client_list() when cleaning up the code.

Reported-by: kernel test robot <xiaolong.ye@intel.com>
Fixes: 14d1bbb0ca ("NFS: Create a common nfs4_match_client() function")
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-24 08:05:16 -04:00
Benjamin Coddington
08cb5b0f05 pnfs: Fix the check for requests in range of layout segment
It's possible and acceptable for NFS to attempt to add requests beyond the
range of the current pgio->pg_lseg, a case which should be caught and
limited by the pg_test operation.  However, the current handling of this
case replaces pgio->pg_lseg with a new layout segment (after a WARN) within
that pg_test operation.  That will cause all the previously added requests
to be submitted with this new layout segment, which may not be valid for
those requests.

Fix this problem by only returning zero for the number of bytes to coalesce
from pg_test for this case which allows any previously added requests to
complete on the current layout segment.  The check for requests starting
out of range of the layout segment moves to pg_init, so that the
replacement of pgio->pg_lseg will be done when the next request is added.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-24 07:55:02 -04:00
Dan Carpenter
662f9a105b pNFS/flexfiles: missing error code in ff_layout_alloc_lseg()
If xdr_inline_decode() fails then we end up returning ERR_PTR(0).  The
caller treats NULL returns as -ENOMEM so it doesn't really hurt runtime,
but obviously we intended to set an error code here.

Fixes: d67ae825a5 ("pnfs/flexfiles: Add the FlexFile Layout Driver")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-24 07:52:54 -04:00
Olga Kornievskaia
6d3b5d8d8d NFS fix COMMIT after COPY
Fix a typo in the commit e092693443
"NFS append COMMIT after synchronous COPY"

Reported-by: Eryu Guan <eguan@redhat.com>
Fixes: e092693443 ("NFS append COMMIT after synchronous COPY")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Tested-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-24 07:52:48 -04:00
Jan Kara
d8747d642e reiserfs: Make flush bios explicitely sync
Commit b685d3d65a "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
definitions.  generic_make_request_checks() however strips REQ_FUA and
REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
write cache and thus write effectively becomes asynchronous which can
lead to performance regressions

Fix the problem by making sure all bios which are synchronous are
properly marked with REQ_SYNC.

Fixes: b685d3d65a
CC: reiserfs-devel@vger.kernel.org
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
2017-05-24 13:35:20 +02:00
Jan Kara
0f0b9b63e1 gfs2: Make flush bios explicitely sync
Commit b685d3d65a "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
definitions.  generic_make_request_checks() however strips REQ_FUA and
REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
write cache and thus write effectively becomes asynchronous which can
lead to performance regressions

Fix the problem by making sure all bios which are synchronous are
properly marked with REQ_SYNC.

Fixes: b685d3d65a
CC: Steven Whitehouse <swhiteho@redhat.com>
CC: cluster-devel@redhat.com
CC: stable@vger.kernel.org
Acked-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-05-24 13:35:20 +02:00
Eric Biggers
aaebdee8b8 f2fs: don't bother checking for encryption key in ->write_iter()
Since only an open file can be written to, and we only allow open()ing
an encrypted file when its key is available, there is no need to check
for the key again before permitting each ->write_iter().

This code was also broken in that it wouldn't actually have failed if
the key was in fact unavailable.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: David Gstir <david@sigma-star.at>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:11:08 -07:00
Eric Biggers
b82a6ea6ec f2fs: don't bother checking for encryption key in ->mmap()
Since only an open file can be mmap'ed, and we only allow open()ing an
encrypted file when its key is available, there is no need to check for
the key again before permitting each mmap().

This f2fs copy of this code was also broken in that it wouldn't actually
have failed if the key was in fact unavailable.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: David Gstir <david@sigma-star.at>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:10:36 -07:00
Chao Yu
6afae6336a f2fs: wait discard IO completion without cmd_lock held
Wait discard IO completion outside cmd_lock to avoid long latency
of holding cmd_lock in IO busy scenario.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:10:03 -07:00
Chao Yu
e31b982157 f2fs: wake up all waiters in f2fs_submit_discard_endio
There could be more than one waiter waiting discard IO completion, so we
need use complete_all() instead of complete() in f2fs_submit_discard_endio
to avoid hungtask.

Fixes: 	ec9895add2 ("f2fs: don't hold cmd_lock during waiting discard
command")
Cc: <stable@vger.kernel.org>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:09:54 -07:00
Chao Yu
04dfc23006 f2fs: show more info if fail to issue discard
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:09:45 -07:00
Chao Yu
fb830fc5cf f2fs: introduce io_list for serialize data/node IOs
Serialize data/node IOs by using fifo list instead of mutex lock,
it will help to enhance concurrency of f2fs, meanwhile keeping LFS
IO semantics.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:09:03 -07:00
Chao Yu
e41e6d75e5 f2fs: split wio_mutex
Split wio_mutex to adjust different temperature bio cache.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:07:23 -07:00
Yunlei He
963932a93c f2fs: combine huge num of discard rb tree consistence checks
Came across a hungtask caused by huge number of rb tree traversing
during adding discard addrs in cp. This patch combine these consistence
checks and move it to discard thread.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:07:19 -07:00
Yunlei He
dad48e7312 f2fs: fix a bug caused by NULL extent tree
Thread A:					Thread B:

-f2fs_remount
    -sbi->mount_opt.opt = 0;
						<--- -f2fs_iget
						         -do_read_inode
							     -f2fs_init_extent_tree
							         -F2FS_I(inode)->extent_tree is NULL
        -default_options && parse_options
	    -remount return
						<---  -f2fs_map_blocks
						          -f2fs_lookup_extent_tree
                                                              -f2fs_bug_on(sbi, !et);

The same problem with f2fs_new_inode.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:07:18 -07:00
Jaegeuk Kim
1d7be27082 f2fs: try to freeze in gc and discard threads
This allows to freeze gc and discard threads.

Cc: stable@vger.kernel.org
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:07:18 -07:00
Yunlei He
b7b7c4cf1c f2fs: add a new function get_ssr_cost
This patch add a new method get_ssr_cost to select
SSR segment more accurately.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:07:17 -07:00
Hou Pengyang
bd80a4b981 f2fs: declare load_free_nid_bitmap static
Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:07:16 -07:00
Jaegeuk Kim
cc15620bc8 f2fs: avoid f2fs_lock_op for IPU writes
Currently, if we do get_node_of_data before f2fs_lock_op, there may be dead lock
as follows, where process A would be in infinite loop, and B will NOT be awaked.

Process A(cp):            Process B:
f2fs_lock_all(sbi)
                        get_dnode_of_data <---- lock dn.node_page
flush_nodes             f2fs_lock_op

So, this patch adds f2fs_trylock_op to avoid f2fs_lock_op done by IPU.

Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:07:15 -07:00
Jaegeuk Kim
a912b54d3a f2fs: split bio cache
Split DATA/NODE type bio cache according to different temperature,
so write IOs with the same temperature can be merged in corresponding
bio cache as much as possible, otherwise, different temperature write
IOs submitting into one bio cache will always cause split of bio.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:05:39 -07:00
Jaegeuk Kim
81377bd628 f2fs: use fio instead of multiple parameters
This patch just changes using fio instead of parameters.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:05:38 -07:00
Jaegeuk Kim
b9109b0e49 f2fs: remove unnecessary read cases in merged IO flow
Merged IO flow doesn't need to care about read IOs.

f2fs_submit_merged_bio -> f2fs_submit_merged_write
f2fs_submit_merged_bios -> f2fs_submit_merged_writes
f2fs_submit_merged_bio_cond -> f2fs_submit_merged_write_cond

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:05:37 -07:00
Jaegeuk Kim
1919ffc0d7 f2fs: use f2fs_submit_page_bio for ra_meta_pages
This patch avoids to use f2fs_submit_merged_bio for read, which was the only
read case.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:05:36 -07:00
Weichao Guo
e5dbd9563e f2fs: make sure f2fs_gc returns consistent errno
By default, f2fs_gc returns -EINVAL in general error cases, e.g., no victim
was selected. However, the default errno may be overwritten in two cases:
gc_more and BG_GC -> FG_GC. We should return consistent errno in such cases.

Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:05:35 -07:00
Chao Yu
1c6d8ee4b8 f2fs: support statx
Last kernel has already support new syscall statx() in commit a528d35e8b
("statx: Add a system call to make enhanced file info available"), with
this interface we can show more file info including file creation and some
attribute flags to user.

This patch tries to support this functionality.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:05:34 -07:00
Jaegeuk Kim
93607124c5 f2fs: load inode's flag from disk
This patch fixes missing inode flag loaded from disk, reported by Tom.

[tom@localhost ~]$ sudo mount /dev/loop0 /mnt/
[tom@localhost ~]$ sudo chown tom:tom /mnt/
[tom@localhost ~]$ touch /mnt/testfile
[tom@localhost ~]$ sudo chattr +i /mnt/testfile
[tom@localhost ~]$ echo test > /mnt/testfile
bash: /mnt/testfile: Operation not permitted
[tom@localhost ~]$ rm /mnt/testfile
rm: cannot remove '/mnt/testfile': Operation not permitted
[tom@localhost ~]$ sudo umount /mnt/
[tom@localhost ~]$ sudo mount /dev/loop0 /mnt/
[tom@localhost ~]$ lsattr /mnt/testfile
----i-------------- /mnt/testfile
[tom@localhost ~]$ echo test > /mnt/testfile
[tom@localhost ~]$ rm /mnt/testfile
[tom@localhost ~]$ sudo umount /mnt/

Cc: stable@vger.kernel.org
Reported-by: Tom Yan <tom.ty89@outlook.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-23 21:05:31 -07:00
J. Bruce Fields
9a307403d3 nfsd4: fix null dereference on replay
if we receive a compound such that:

	- the sessionid, slot, and sequence number in the SEQUENCE op
	  match a cached succesful reply with N ops, and
	- the Nth operation of the compound is a PUTFH, PUTPUBFH,
	  PUTROOTFH, or RESTOREFH,

then nfsd4_sequence will return 0 and set cstate->status to
nfserr_replay_cache.  The current filehandle will not be set.  This will
cause us to call check_nfsd_access with first argument NULL.

To nfsd4_compound it looks like we just succesfully executed an
operation that set a filehandle, but the current filehandle is not set.

Fix this by moving the nfserr_replay_cache earlier.  There was never any
reason to have it after the encode_op label, since the only case where
he hit that is when opdesc->op_func sets it.

Note that there are two ways we could hit this case:

	- a client is resending a previously sent compound that ended
	  with one of the four PUTFH-like operations, or
	- a client is sending a *new* compound that (incorrectly) shares
	  sessionid, slot, and sequence number with a previously sent
	  compound, and the length of the previously sent compound
	  happens to match the position of a PUTFH-like operation in the
	  new compound.

The second is obviously incorrect client behavior.  The first is also
very strange--the only purpose of a PUTFH-like operation is to set the
current filehandle to be used by the following operation, so there's no
point in having it as the last in a compound.

So it's likely this requires a buggy or malicious client to reproduce.

Reported-by: Scott Mayhew <smayhew@redhat.com>
Cc: stable@kernel.vger.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-05-23 14:20:58 -04:00
Eric W. Biederman
296990deb3 mnt: Make propagate_umount less slow for overlapping mount propagation trees
Andrei Vagin pointed out that time to executue propagate_umount can go
non-linear (and take a ludicrious amount of time) when the mount
propogation trees of the mounts to be unmunted by a lazy unmount
overlap.

Make the walk of the mount propagation trees nearly linear by
remembering which mounts have already been visited, allowing
subsequent walks to detect when walking a mount propgation tree or a
subtree of a mount propgation tree would be duplicate work and to skip
them entirely.

Walk the list of mounts whose propgatation trees need to be traversed
from the mount highest in the mount tree to mounts lower in the mount
tree so that odds are higher that the code will walk the largest trees
first, allowing later tree walks to be skipped entirely.

Add cleanup_umount_visitation to remover the code's memory of which
mounts have been visited.

Add the functions last_slave and skip_propagation_subtree to allow
skipping appropriate parts of the mount propagation tree without
needing to change the logic of the rest of the code.

A script to generate overlapping mount propagation trees:

$ cat runs.h
set -e
mount -t tmpfs zdtm /mnt
mkdir -p /mnt/1 /mnt/2
mount -t tmpfs zdtm /mnt/1
mount --make-shared /mnt/1
mkdir /mnt/1/1

iteration=10
if [ -n "$1" ] ; then
	iteration=$1
fi

for i in $(seq $iteration); do
	mount --bind /mnt/1/1 /mnt/1/1
done

mount --rbind /mnt/1 /mnt/2

TIMEFORMAT='%Rs'
nr=$(( ( 2 ** ( $iteration + 1 ) ) + 1 ))
echo -n "umount -l /mnt/1 -> $nr        "
time umount -l /mnt/1

nr=$(cat /proc/self/mountinfo | grep zdtm | wc -l )
time umount -l /mnt/2

$ for i in $(seq 9 19); do echo $i; unshare -Urm bash ./run.sh $i; done

Here are the performance numbers with and without the patch:

     mhash |  8192   |  8192  | 1048576 | 1048576
    mounts | before  | after  |  before | after
    ------------------------------------------------
      1025 |  0.040s | 0.016s |  0.038s | 0.019s
      2049 |  0.094s | 0.017s |  0.080s | 0.018s
      4097 |  0.243s | 0.019s |  0.206s | 0.023s
      8193 |  1.202s | 0.028s |  1.562s | 0.032s
     16385 |  9.635s | 0.036s |  9.952s | 0.041s
     32769 | 60.928s | 0.063s | 44.321s | 0.064s
     65537 |         | 0.097s |         | 0.097s
    131073 |         | 0.233s |         | 0.176s
    262145 |         | 0.653s |         | 0.344s
    524289 |         | 2.305s |         | 0.735s
   1048577 |         | 7.107s |         | 2.603s

Andrei Vagin reports fixing the performance problem is part of the
work to fix CVE-2016-6213.

Cc: stable@vger.kernel.org
Fixes: a05964f391 ("[PATCH] shared mounts handling: umount")
Reported-by: Andrei Vagin <avagin@openvz.org>
Reviewed-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-05-23 08:41:17 -05:00
Eric W. Biederman
99b19d1647 mnt: In propgate_umount handle visiting mounts in any order
While investigating some poor umount performance I realized that in
the case of overlapping mount trees where some of the mounts are locked
the code has been failing to unmount all of the mounts it should
have been unmounting.

This failure to unmount all of the necessary
mounts can be reproduced with:

$ cat locked_mounts_test.sh

mount -t tmpfs test-base /mnt
mount --make-shared /mnt
mkdir -p /mnt/b

mount -t tmpfs test1 /mnt/b
mount --make-shared /mnt/b
mkdir -p /mnt/b/10

mount -t tmpfs test2 /mnt/b/10
mount --make-shared /mnt/b/10
mkdir -p /mnt/b/10/20

mount --rbind /mnt/b /mnt/b/10/20

unshare -Urm --propagation unchaged /bin/sh -c 'sleep 5; if [ $(grep test /proc/self/mountinfo | wc -l) -eq 1 ] ; then echo SUCCESS ; else echo FAILURE ; fi'
sleep 1
umount -l /mnt/b
wait %%

$ unshare -Urm ./locked_mounts_test.sh

This failure is corrected by removing the prepass that marks mounts
that may be umounted.

A first pass is added that umounts mounts if possible and if not sets
mount mark if they could be unmounted if they weren't locked and adds
them to a list to umount possibilities.  This first pass reconsiders
the mounts parent if it is on the list of umount possibilities, ensuring
that information of umoutability will pass from child to mount parent.

A second pass then walks through all mounts that are umounted and processes
their children unmounting them or marking them for reparenting.

A last pass cleans up the state on the mounts that could not be umounted
and if applicable reparents them to their first parent that remained
mounted.

While a bit longer than the old code this code is much more robust
as it allows information to flow up from the leaves and down
from the trunk making the order in which mounts are encountered
in the umount propgation tree irrelevant.

Cc: stable@vger.kernel.org
Fixes: 0c56fe3142 ("mnt: Don't propagate unmounts to locked mounts")
Reviewed-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-05-23 08:41:16 -05:00
Eric W. Biederman
570487d3fa mnt: In umount propagation reparent in a separate pass
It was observed that in some pathlogical cases that the current code
does not unmount everything it should.  After investigation it
was determined that the issue is that mnt_change_mntpoint can
can change which mounts are available to be unmounted during mount
propagation which is wrong.

The trivial reproducer is:
$ cat ./pathological.sh

mount -t tmpfs test-base /mnt
cd /mnt
mkdir 1 2 1/1
mount --bind 1 1
mount --make-shared 1
mount --bind 1 2
mount --bind 1/1 1/1
mount --bind 1/1 1/1
echo
grep test-base /proc/self/mountinfo
umount 1/1
echo
grep test-base /proc/self/mountinfo

$ unshare -Urm ./pathological.sh

The expected output looks like:
46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

The output without the fix looks like:
46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
52 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

That last mount in the output was in the propgation tree to be unmounted but
was missed because the mnt_change_mountpoint changed it's parent before the walk
through the mount propagation tree observed it.

Cc: stable@vger.kernel.org
Fixes: 1064f874ab ("mnt: Tuck mounts under others instead of creating shadow/side mounts.")
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Reviewed-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-05-23 08:40:32 -05:00
Konstantin Khlebnikov
887a973061 ext4: keep existing extra fields when inode expands
ext4_expand_extra_isize() should clear only space between old and new
size.

Fixes: 6dd4ee7cab # v2.6.23
Cc: stable@vger.kernel.org
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-21 22:36:23 -04:00
Konstantin Khlebnikov
9651e6b2e2 ext4: handle the rest of ext4_mb_load_buddy() ENOMEM errors
I've got another report about breaking ext4 by ENOMEM error returned from
ext4_mb_load_buddy() caused by memory shortage in memory cgroup.
This time inside ext4_discard_preallocations().

This patch replaces ext4_error() with ext4_warning() where errors returned
from ext4_mb_load_buddy() are not fatal and handled by caller:
* ext4_mb_discard_group_preallocations() - called before generating ENOSPC,
  we'll try to discard other group or return ENOSPC into user-space.
* ext4_trim_all_free() - just stop trimming and return ENOMEM from ioctl.

Some callers cannot handle errors, thus __GFP_NOFAIL is used for them:
* ext4_discard_preallocations()
* ext4_mb_discard_lg_preallocations()

Fixes: adb7ef600c ("ext4: use __GFP_NOFAIL in ext4_free_blocks()")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-21 22:35:23 -04:00
Jan Kara
3f1d5bad3f ext4: fix off-by-in in loop termination in ext4_find_unwritten_pgoff()
There is an off-by-one error in loop termination conditions in
ext4_find_unwritten_pgoff() since 'end' may index a page beyond end of
desired range if 'endoff' is page aligned. It doesn't have any visible
effects but still it is good to fix it.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-21 22:34:23 -04:00
Jan Kara
7d95eddf31 ext4: fix SEEK_HOLE
Currently, SEEK_HOLE implementation in ext4 may both return that there's
a hole at some offset although that offset already has data and skip
some holes during a search for the next hole. The first problem is
demostrated by:

xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "seek -h 0" file
wrote 57344/57344 bytes at offset 0
56 KiB, 14 ops; 0.0000 sec (2.054 GiB/sec and 538461.5385 ops/sec)
Whence	Result
HOLE	0

Where we can see that SEEK_HOLE wrongly returned offset 0 as containing
a hole although we have written data there. The second problem can be
demonstrated by:

xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "pwrite 128k 8k"
       -c "seek -h 0" file

wrote 57344/57344 bytes at offset 0
56 KiB, 14 ops; 0.0000 sec (1.978 GiB/sec and 518518.5185 ops/sec)
wrote 8192/8192 bytes at offset 131072
8 KiB, 2 ops; 0.0000 sec (2 GiB/sec and 500000.0000 ops/sec)
Whence	Result
HOLE	139264

Where we can see that hole at offsets 56k..128k has been ignored by the
SEEK_HOLE call.

The underlying problem is in the ext4_find_unwritten_pgoff() which is
just buggy. In some cases it fails to update returned offset when it
finds a hole (when no pages are found or when the first found page has
higher index than expected), in some cases conditions for detecting hole
are just missing (we fail to detect a situation where indices of
returned pages are not contiguous).

Fix ext4_find_unwritten_pgoff() to properly detect non-contiguous page
indices and also handle all cases where we got less pages then expected
in one place and handle it properly there.

CC: stable@vger.kernel.org
Fixes: c8c0df241c
CC: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-21 22:33:23 -04:00
Tahsin Erdogan
b4709067ac jbd2: preserve original nofs flag during journal restart
When a transaction starts, start_this_handle() saves current
PF_MEMALLOC_NOFS value so that it can be restored at journal stop time.
Journal restart is a special case that calls start_this_handle() without
stopping the transaction. start_this_handle() isn't aware that the
original value is already stored so it overwrites it with current value.

For instance, a call sequence like below leaves PF_MEMALLOC_NOFS flag set
at the end:

  jbd2_journal_start()
  jbd2__journal_restart()
  jbd2_journal_stop()

Make jbd2__journal_restart() restore the original value before calling
start_this_handle().

Fixes: 81378da64d ("jbd2: mark the transaction context with the scope GFP_NOFS context")
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-05-21 22:32:23 -04:00
Jan Kara
964edf66bf ext4: clear lockdep subtype for quota files on quota off
Quota files have special ranking of i_data_sem lock. We inform lockdep
about it when turning on quotas however when turning quotas off, we
don't clear the lockdep subclass from i_data_sem lock and thus when the
inode gets later reused for a normal file or directory, lockdep gets
confused and complains about possible deadlocks. Fix the problem by
resetting lockdep subclass of i_data_sem on quota off.

Cc: stable@vger.kernel.org
Fixes: daf647d2dd
Reported-and-tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-21 22:31:23 -04:00
Linus Torvalds
894e21642d Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "A small collection of fixes that should go into this cycle.

   - a pull request from Christoph for NVMe, which ended up being
     manually applied to avoid pulling in newer bits in master. Mostly
     fibre channel fixes from James, but also a few fixes from Jon and
     Vijay

   - a pull request from Konrad, with just a single fix for xen-blkback
     from Gustavo.

   - a fuseblk bdi fix from Jan, fixing a regression in this series with
     the dynamic backing devices.

   - a blktrace fix from Shaohua, replacing sscanf() with kstrtoull().

   - a request leak fix for drbd from Lars, fixing a regression in the
     last series with the kref changes. This will go to stable as well"

* 'for-linus' of git://git.kernel.dk/linux-block:
  nvmet: release the sq ref on rdma read errors
  nvmet-fc: remove target cpu scheduling flag
  nvme-fc: stop queues on error detection
  nvme-fc: require target or discovery role for fc-nvme targets
  nvme-fc: correct port role bits
  nvme: unmap CMB and remove sysfs file in reset path
  blktrace: fix integer parse
  fuseblk: Fix warning in super_setup_bdi_name()
  block: xen-blkback: add null check to avoid null pointer dereference
  drbd: fix request leak introduced by locking/atomic, kref: Kill kref_sub()
2017-05-20 16:12:30 -07:00
Linus Torvalds
8c3fc1643d Merge branch 'libnvdimm-for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
 "A couple of compile fixes.

  With the removal of the ->direct_access() method from
  block_device_operations in favor of a new dax_device + dax_operations
  we broke two configurations.

  The CONFIG_BLOCK=n case is fixed by compiling out the block+dax
  helpers in the dax core. Configurations with FS_DAX=n EXT4=y / XFS=y
  and DAX=m fail due to the helpers the builtin filesystem needs being
  in a module, so we stub out the helpers in the FS_DAX=n case."

* 'libnvdimm-for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  dax, xfs, ext4: compile out iomap-dax paths in the FS_DAX=n case
  dax: fix false CONFIG_BLOCK dependency
2017-05-19 17:35:34 -07:00
Darrick J. Wong
3ecb3ac7b9 xfs: avoid mount-time deadlock in CoW extent recovery
If a malicious user corrupts the refcount btree to cause a cycle between
different levels of the tree, the next mount attempt will deadlock in
the CoW recovery routine while grabbing buffer locks.  We can use the
ability to re-grab a buffer that was previous locked to a transaction to
avoid deadlocks, so do that here.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-05-19 08:12:49 -07:00
Amir Goldstein
ee1d6d37b6 ovl: mark upper dir with type origin entries "impure"
When moving a merge dir or non-dir with copy up origin into a non-merge
upper dir (a.k.a pure upper dir), we are marking the target parent dir
"impure". ovl_iterate() iterates pure upper dirs directly, because there is
no need to filter out whiteouts and merge dir content with lower dir. But
for the case of an "impure" upper dir, ovl_iterate() will not be able to
iterate the real upper dir directly, because it will need to lookup the
origin inode and use it to fill d_ino.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-19 09:33:49 +02:00
Miklos Szeredi
3d27573ce3 ovl: remove unused arg from ovl_lookup_temp()
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-19 09:33:49 +02:00
Amir Goldstein
21a2287811 ovl: handle rename when upper doesn't support xattr
On failure to set opaque/redirect xattr on rename, skip setting xattr and
return -EXDEV.

On failure to set opaque xattr when creating a new directory, -EIO is
returned instead of -EOPNOTSUPP.

Any failure to set those xattr will be recorded in super block and
then setting any xattr on upper won't be attempted again.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-19 09:33:49 +02:00
Jonathan Corbet
6312811be2 Merge remote-tracking branch 'mauro-exp/docbook3' into death-to-docbook
Mauro says:

This patch series convert the remaining DocBooks to ReST.

The first version was originally
send as 3 patch series:

   [PATCH 00/36] Convert DocBook documents to ReST
   [PATCH 0/5] Convert more books to ReST
   [PATCH 00/13] Get rid of DocBook

The lsm book was added as if it were a text file under
Documentation. The plan is to merge it with another file
under Documentation/security, after both this series and
a security Documentation patch series gets merged.

It also adjusts some Sphinx-pedantic errors/warnings on
some kernel-doc markups.

I also added some patches here to add PDF output for all
existing ReST books.
2017-05-18 11:03:08 -06:00
Miklos Szeredi
6266d465bd ovl: don't fail copy-up if upper doesn't support xattr
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-18 16:11:24 +02:00
Amir Goldstein
82b749b2c6 ovl: check on mount time if upper fs supports setting xattr
xattr are needed by overlayfs for setting opaque dir, redirect dir
and copy up origin.

Check at mount time by trying to set the overlay.opaque xattr on the
workdir and if that fails issue a warning message.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-18 16:11:24 +02:00
Amir Goldstein
8137ae26d2 ovl: fix creds leak in copy up error path
Fixes: 42f269b925 ("ovl: rearrange code in ovl_copy_up_locked()")
Cc: <stable@vger.kernel.org> # v4.11
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-18 16:11:24 +02:00
Jan Kara
69c8ebf832 fuseblk: Fix warning in super_setup_bdi_name()
Commit 5f7f7543f5 "fuse: Convert to separately allocated bdi" didn't
properly handle fuseblk filesystem. When fuse_bdi_init() is called for
that filesystem type, sb->s_bdi is already initialized (by
set_bdev_super()) to point to block device's bdi and consequently
super_setup_bdi_name() complains about this fact when reseting bdi to
the private one.

Fix the problem by properly dropping bdi reference in fuse_bdi_init()
before creating a private bdi in super_setup_bdi_name().

Fixes: 5f7f7543f5 ("fuse: Convert to separately allocated bdi")
Reported-by: Rakesh Pandit <rakesh@tuxera.com>
Tested-by: Rakesh Pandit <rakesh@tuxera.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-17 08:10:57 -06:00
Jin Qian
15d3042a93 f2fs: sanity check checkpoint segno and blkoff
Make sure segno and blkoff read from raw image are valid.

Cc: stable@vger.kernel.org
Signed-off-by: Jin Qian <jinqian@google.com>
[Jaegeuk Kim: adjust minor coding style]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-16 13:29:39 -07:00
J. Bruce Fields
9512a16b0e nfsd: Revert "nfsd: check for oversized NFSv2/v3 arguments"
This reverts commit 51f5677777 "nfsd: check for oversized NFSv2/v3
arguments", which breaks support for NFSv3 ACLs.

That patch was actually an earlier draft of a fix for the problem that
was eventually fixed by e6838a29ec "nfsd: check for oversized NFSv2/v3
arguments".  But somehow I accidentally left this earlier draft in the
branch that was part of my 2.12 pull request.

Reported-by: Eryu Guan <eguan@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-05-16 16:16:30 -04:00
Darrick J. Wong
ea9a46e1c4 xfs: only return detailed fsmap info if the caller has CAP_SYS_ADMIN
There were a number of handwaving complaints that one could "possibly"
use inode numbers and extent maps to fingerprint a filesystem hosting
multiple containers and somehow use the information to guess at the
contents of other containers and attack them.  Despite the total lack of
any demonstration that this is actually possible, it's easier to
restrict access now and broaden it later, so use the rmapbt fsmap
backends only if the caller has CAP_SYS_ADMIN.  Unprivileged users will
just have to make do with only getting the free space and static
metadata placement information.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2017-05-16 12:26:16 -07:00
Zorro Lang
892d2a5f70 xfs: bad assertion for delalloc an extent that start at i_size
By run fsstress long enough time enough in RHEL-7, I find an
assertion failure (harder to reproduce on linux-4.11, but problem
is still there):

  XFS: Assertion failed: (iflags & BMV_IF_DELALLOC) != 0, file: fs/xfs/xfs_bmap_util.c

The assertion is in xfs_getbmap() funciton:

  if (map[i].br_startblock == DELAYSTARTBLOCK &&
-->   map[i].br_startoff <= XFS_B_TO_FSB(mp, XFS_ISIZE(ip)))
          ASSERT((iflags & BMV_IF_DELALLOC) != 0);

When map[i].br_startoff == XFS_B_TO_FSB(mp, XFS_ISIZE(ip)), the
startoff is just at EOF. But we only need to make sure delalloc
extents that are within EOF, not include EOF.

Signed-off-by: Zorro Lang <zlang@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-16 09:24:36 -07:00
Darrick J. Wong
6e747506dd xfs: fix warnings about unused stack variables
Reduce stack usage and get rid of compiler warnings by eliminating
unused variables.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2017-05-16 09:24:36 -07:00
Darrick J. Wong
6eadbf4c8b xfs: BMAPX shouldn't barf on inline-format directories
When we're fulfilling a BMAPX request, jump out early if the data fork
is in local format.  This prevents us from hitting a debugging check in
bmapi_read and barfing errors back to userspace.  The on-disk extent
count check later isn't sufficient for IF_DELALLOC mode because da
extents are in memory and not on disk.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-05-16 09:24:36 -07:00
Brian Foster
0daaecacb8 xfs: fix indlen accounting error on partial delalloc conversion
The delalloc -> real block conversion path uses an incorrect
calculation in the case where the middle part of a delalloc extent
is being converted. This is documented as a rare situation because
XFS generally attempts to maximize contiguity by converting as much
of a delalloc extent as possible.

If this situation does occur, the indlen reservation for the two new
delalloc extents left behind by the conversion of the middle range
is calculated and compared with the original reservation. If more
blocks are required, the delta is allocated from the global block
pool. This delta value can be characterized as the difference
between the new total requirement (temp + temp2) and the currently
available reservation minus those blocks that have already been
allocated (startblockval(PREV.br_startblock) - allocated).

The problem is that the current code does not account for previously
allocated blocks correctly. It subtracts the current allocation
count from the (new - old) delta rather than the old indlen
reservation. This means that more indlen blocks than have been
allocated end up stashed in the remaining extents and free space
accounting is broken as a result.

Fix up the calculation to subtract the allocated block count from
the original extent indlen and thus correctly allocate the
reservation delta based on the difference between the new total
requirement and the unused blocks from the original reservation.
Also remove a bogus assert that contradicts the fact that the new
indlen reservation can be larger than the original indlen
reservation.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-16 09:24:35 -07:00
Colin Ian King
bff5baf8aa btrfs: fix incorrect error return ret being passed to mapping_set_error
The setting of return code ret should be based on the error code
passed into function end_extent_writepage and not on ret. Thanks
to Liu Bo for spotting this mistake in the original fix I submitted.

Detected by CoverityScan, CID#1414312 ("Logically dead code")

Fixes: 5dca6eea91 ("Btrfs: mark mapping with error flag to report errors to userspace")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-05-16 15:42:10 +02:00
Jan Kara
8d91012528 btrfs: Make flush bios explicitely sync
Commit b685d3d65a "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
definitions.  generic_make_request_checks() however strips REQ_FUA and
REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
write cache and thus write effectively becomes asynchronous which can
lead to performance regressions

Fix the problem by making sure all bios which are synchronous are
properly marked with REQ_SYNC.

CC: David Sterba <dsterba@suse.com>
CC: linux-btrfs@vger.kernel.org
Fixes: b685d3d65a
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-05-16 15:42:01 +02:00
Qu Wenruo
4751832da9 btrfs: fiemap: Cache and merge fiemap extent before submit it to user
[BUG]
Cycle mount btrfs can cause fiemap to return different result.
Like:
 # mount /dev/vdb5 /mnt/btrfs
 # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
 # xfs_io -c "fiemap -v" /mnt/btrfs/file
 /mnt/test/file:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..127]:        25088..25215       128   0x1
 # umount /mnt/btrfs
 # mount /dev/vdb5 /mnt/btrfs
 # xfs_io -c "fiemap -v" /mnt/btrfs/file
 /mnt/test/file:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..31]:         25088..25119        32   0x0
   1: [32..63]:        25120..25151        32   0x0
   2: [64..95]:        25152..25183        32   0x0
   3: [96..127]:       25184..25215        32   0x1
But after above fiemap, we get correct merged result if we call fiemap
again.
 # xfs_io -c "fiemap -v" /mnt/btrfs/file
 /mnt/test/file:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..127]:        25088..25215       128   0x1

[REASON]
Btrfs will try to merge extent map when inserting new extent map.

btrfs_fiemap(start=0 len=(u64)-1)
|- extent_fiemap(start=0 len=(u64)-1)
   |- get_extent_skip_holes(start=0 len=64k)
   |  |- btrfs_get_extent_fiemap(start=0 len=64k)
   |     |- btrfs_get_extent(start=0 len=64k)
   |        |  Found on-disk (ino, EXTENT_DATA, 0)
   |        |- add_extent_mapping()
   |        |- Return (em->start=0, len=16k)
   |
   |- fiemap_fill_next_extent(logic=0 phys=X len=16k)
   |
   |- get_extent_skip_holes(start=0 len=64k)
   |  |- btrfs_get_extent_fiemap(start=0 len=64k)
   |     |- btrfs_get_extent(start=16k len=48k)
   |        |  Found on-disk (ino, EXTENT_DATA, 16k)
   |        |- add_extent_mapping()
   |        |  |- try_merge_map()
   |        |     Merge with previous em start=0 len=16k
   |        |     resulting em start=0 len=32k
   |        |- Return (em->start=0, len=32K)    << Merged result
   |- Stripe off the unrelated range (0~16K) of return em
   |- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
      ^^^ Causing split fiemap extent.

And since in add_extent_mapping(), em is already merged, in next
fiemap() call, we will get merged result.

[FIX]
Here we introduce a new structure, fiemap_cache, which records previous
fiemap extent.

And will always try to merge current fiemap_cache result before calling
fiemap_fill_next_extent().
Only when we failed to merge current fiemap extent with cached one, we
will call fiemap_fill_next_extent() to submit cached one.

So by this method, we can merge all fiemap extents.

It can also be done in fs/ioctl.c, however the problem is if
fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
extent.
So I choose to merge it in btrfs.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-05-16 15:41:53 +02:00
Mauro Carvalho Chehab
e1511a840a fs: fix the location of the kernel-api book
The kernel-api book is now part of the core-api. Update its
location.

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-05-16 08:44:23 -03:00
Mauro Carvalho Chehab
e1b4fc7add fs: update location of filesystems documentation
The filesystem documentation was moved from DocBook to
Documentation/filesystems/. Update it at the sources.

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-05-16 08:44:22 -03:00
Mauro Carvalho Chehab
df1b560a4a fs: jbd2: escape a string with special chars on a kernel-doc
kernel-doc will try to interpret a foo() string, except if
properly escaped.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-05-16 08:44:11 -03:00
Mauro Carvalho Chehab
f16df9f765 fs: eventfd: fix identation on kernel-doc
Sphinx require explicit tags in order to use a list of possible
values, otherwise it produces this error:

	./fs/eventfd.c:219: WARNING: Option list ends without a blank line; unexpected unindent.

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-05-16 08:44:10 -03:00
Mauro Carvalho Chehab
0117d4272b fs: add a blank lines on some kernel-doc comments
Sphinx gets confused when it finds identation without a
good reason for it and without a preceding blank line:

	./fs/mpage.c:347: ERROR: Unexpected indentation.
	./fs/namei.c:4303: ERROR: Unexpected indentation.
	./fs/fs-writeback.c:2060: ERROR: Unexpected indentation.

No functional changes.

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-05-16 08:44:10 -03:00
Mauro Carvalho Chehab
91e4775d0f fs: jbd2: make jbd2_journal_start() kernel-doc parseable
kernel-doc script expects that a function documentation to
be just before the function, otherwise it will be ignored.

So, move the kernel-doc markup to the right place.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-05-16 08:44:09 -03:00
Linus Torvalds
1319a2856d Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
 "A set of minor cifs fixes"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  [CIFS] Minor cleanup of xattr query function
  fs: cifs: transport: Use time_after for time comparison
  SMB2: Fix share type handling
  cifs: cifsacl: Use a temporary ops variable to reduce code length
  Don't delay freeing mids when blocked on slow socket write of request
  CIFS: silence lockdep splat in cifs_relock_file()
2017-05-15 15:27:02 -07:00
Christoph Hellwig
bb2a8b0cd1 nfsd4: const-ify nfsd4_ops
nfsd4_ops contains function pointers, and marking it as constant avoids
it being able to be used as an attach vector for code injections.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:32 +02:00
Christoph Hellwig
e9679189e3 sunrpc: mark all struct svc_version instances as const
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-15 17:42:31 +02:00
Christoph Hellwig
860bda29b9 sunrpc: mark all struct svc_procinfo instances as const
struct svc_procinfo contains function pointers, and marking it as
constant avoids it being able to be used as an attach vector for
code injections.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:31 +02:00
Christoph Hellwig
7fd38af9ca sunrpc: move pc_count out of struct svc_procinfo
pc_count is the only writeable memeber of struct svc_procinfo, which is
a good candidate to be const-ified as it contains function pointers.

This patch moves it into out out struct svc_procinfo, and into a
separate writable array that is pointed to by struct svc_version.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:30 +02:00
Christoph Hellwig
eb69853da9 nfsd4: properly type op_func callbacks
Pass union nfsd4_op_u to the op_func callbacks instead of using unsafe
function pointer casts.

It also adds two missing structures to struct nfsd4_op.u to facilitate
this.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:29 +02:00
Christoph Hellwig
1c1226385b nfsd4: remove nfsd4op_rsize
Except for a lot of unnecessary casts this typedef only has one user,
so remove the casts and expand it in struct nfsd4_operation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:28 +02:00
Christoph Hellwig
57832e7bd8 nfsd4: properly type op_get_currentstateid callbacks
Pass union nfsd4_op_u to the op_set_currentstateid callbacks instead of
using unsafe function pointer casts.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:27 +02:00
Christoph Hellwig
b60e985980 nfsd4: properly type op_set_currentstateid callbacks
Given the args union in struct nfsd4_op a name, and pass it to the
op_set_currentstateid callbacks instead of using unsafe function
pointer casts.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:27 +02:00
Christoph Hellwig
63f8de3795 sunrpc: properly type pc_encode callbacks
Drop the resp argument as it can trivially be derived from the rqstp
argument.  With that all functions now have the same prototype, and we
can remove the unsafe casting to kxdrproc_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-15 17:42:25 +02:00
Christoph Hellwig
026fec7e7c sunrpc: properly type pc_decode callbacks
Drop the argp argument as it can trivially be derived from the rqstp
argument.  With that all functions now have the same prototype, and we
can remove the unsafe casting to kxdrproc_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:24 +02:00
Christoph Hellwig
8537488b5a sunrpc: properly type pc_release callbacks
Drop the p and resp arguments as they are always NULL or can trivially
be derived from the rqstp argument.  With that all functions now have the
same prototype, and we can remove the unsafe casting to kxdrproc_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:23 +02:00
Christoph Hellwig
a6beb73272 sunrpc: properly type pc_func callbacks
Drop the argp and resp arguments as they can trivially be derived from
the rqstp argument.  With that all functions now have the same prototype,
and we can remove the unsafe casting to svc_procfunc as well as the
svc_procfunc typedef itself.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:23 +02:00
Christoph Hellwig
9482c9c15c nfsd: remove the unused PROC() macro in nfs3proc.c
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:22 +02:00
Christoph Hellwig
f7235b6bc5 nfsd: use named initializers in PROC()
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:21 +02:00
Christoph Hellwig
02be49f6b7 nfsd4: const-ify nfs_cb_version4
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:20 +02:00
Christoph Hellwig
499b498810 sunrpc: mark all struct rpc_procinfo instances as const
struct rpc_procinfo contains function pointers, and marking it as
constant avoids it being able to be used as an attach vector for
code injections.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-15 17:42:20 +02:00
Christoph Hellwig
f700c72dd2 nfs: use ARRAY_SIZE() in the nfsacl_version3 declaration
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:19 +02:00
Christoph Hellwig
1c5876ddbd sunrpc: move p_count out of struct rpc_procinfo
p_count is the only writeable memeber of struct rpc_procinfo, which is
a good candidate to be const-ified as it contains function pointers.

This patch moves it into out out struct rpc_procinfo, and into a
separate writable array that is pointed to by struct rpc_version and
indexed by p_statidx.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-15 17:42:18 +02:00
Christoph Hellwig
cdfa31e93f lockd: fix some weird indentation
Remove double indentation of a few struct rpc_version and
struct rpc_program instance.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-15 17:42:17 +02:00
Christoph Hellwig
f4dac4ade5 nfs: don't cast callback decode/proc/encode routines
Instead declare all functions with the proper methods signature.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-15 17:42:16 +02:00
Christoph Hellwig
18d9cff400 nfs: fix decoder callback prototypes
Declare the p_decode callbacks with the proper prototype instead of
casting to kxdrdproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-15 17:42:16 +02:00
Christoph Hellwig
1fa2339123 lockd: fix decoder callback prototypes
Declare the p_decode callbacks with the proper prototype instead of
casting to kxdrdproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-15 17:42:15 +02:00
Christoph Hellwig
d39916c487 nfsd: fix decoder callback prototypes
Declare the p_decode callbacks with the proper prototype instead of
casting to kxdrdproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
2017-05-15 17:42:14 +02:00
Christoph Hellwig
1502c81b44 nfsd: fix encoder callback prototypes
Declare the p_encode callbacks with the proper prototype instead of
casting to kxdreproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
2017-05-15 17:42:10 +02:00
Christoph Hellwig
0096d39b96 nfs: fix encoder callback prototypes
Declare the p_encode callbacks with the proper prototype instead of
casting to kxdreproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-15 17:42:09 +02:00
Christoph Hellwig
bf96391e7b lockd: fix encoder callback prototypes
Declare the p_encode callbacks with the proper prototype instead of
casting to kxdreproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-15 17:42:09 +02:00
Arnd Bergmann
72d42504bd ovl: select EXPORTFS
We get a link error when EXPORTFS is not enabled:

ERROR: "exportfs_encode_fh" [fs/overlayfs/overlay.ko] undefined!
ERROR: "exportfs_decode_fh" [fs/overlayfs/overlay.ko] undefined!

This adds a Kconfig 'select' statement for overlayfs, the same way that
it is done for the other users of exportfs.

Fixes: 3a1e819b4e ("ovl: store file handle of lower inode on copy up")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-15 10:53:07 +02:00
Dan Williams
f5705aa8cf dax, xfs, ext4: compile out iomap-dax paths in the FS_DAX=n case
Tetsuo reports:

  fs/built-in.o: In function `xfs_file_iomap_end':
  xfs_iomap.c:(.text+0xe0ef9): undefined reference to `put_dax'
  fs/built-in.o: In function `xfs_file_iomap_begin':
  xfs_iomap.c:(.text+0xe1a7f): undefined reference to `dax_get_by_host'
  make: *** [vmlinux] Error 1
  $ grep DAX .config
  CONFIG_DAX=m
  # CONFIG_DEV_DAX is not set
  # CONFIG_FS_DAX is not set

When FS_DAX=n we can/must throw away the dax code in filesystems.
Implement 'fs_' versions of dax_get_by_host() and put_dax() that are
nops in the FS_DAX=n case.

Cc: <linux-xfs@vger.kernel.org>
Cc: <linux-ext4@vger.kernel.org>
Cc: Jan Kara <jack@suse.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Fixes: ef51042472 ("block, dax: move 'select DAX' from BLOCK to FS_DAX")
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-05-13 17:52:16 -07:00
Linus Torvalds
b53c4d5eb7 This pull request contains updates for both UBI and UBIFS:
- New config option CONFIG_UBIFS_FS_SECURITY
 - Minor improvements
 - Random fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABAgAGBQJZFuwKAAoJEEtJtSqsAOnWYrUP/0/y7PEh0ZGdi4kkQy/CnuJr
 pmybsQ0TbLoljahuDXqShKkMNvuXIvKSKcHROIsXreG+DCfC3v/srZlvRt7UCPOE
 QVvjh0sQTaMUrfcaTTM9g3Im/BZX9MueTaSF2Rgx1lF+R2t3InW1bv9hvmQxfoEA
 N75tJgH69mii5pDWuGgLLjmmxhbSkMGpM31QeO5DUaLRqdXcc5L5iK5Hnd+Wtj81
 oSB5RsergCfk17jaWH2e7G03LB2tm6AhM5oksTOpZ9+OIW9GOiUMfjYFC2ZYRwzx
 zHhnh0rGPfFv0jO5u4CXtWaQDfxyw6Z7XLK+Xo1RemkhM/7AQl2xetfIVDXErgoA
 NxxN/a8MWcEpJ2x6y/Z8740HXjyQjt9h3nHzlVPNP8hz68J796E7UzRjCQtf7Iyh
 xqhfjMabxfBqcLkTESvgmjcuwo1IkqOaFBjIw2Cd2nfBEkCKzoaINjRHitgUGj/z
 Mm1CJNWvaK6QTdZ3iCETCyPQI02A+4ZXhDf/QZS3wRAMc1v45pS/dVeBn+0F8Nrc
 ASiQwcd7u1IfJa3A6d6DgMECUWBXjc1GGMfMyhS/ta56pOfe1RyR3bg9WuISqUMe
 86id9tiSs7cP2UVFTrFFFWAO3rATj+9cOO9f2LTujPzcd88cJhKSykaLPmELfyE9
 YUPw9lpExwyXLn7S46LQ
 =9ZJe
 -----END PGP SIGNATURE-----

Merge tag 'upstream-4.12-rc1' of git://git.infradead.org/linux-ubifs

Pull UBI/UBIFS updates from Richard Weinberger:

 - new config option CONFIG_UBIFS_FS_SECURITY

 - minor improvements

 - random fixes

* tag 'upstream-4.12-rc1' of git://git.infradead.org/linux-ubifs:
  ubi: Add debugfs file for tracking PEB state
  ubifs: Fix a typo in comment of ioctl2ubifs & ubifs2ioctl
  ubifs: Remove unnecessary assignment
  ubifs: Fix cut and paste error on sb type comparisons
  ubi: fastmap: Fix slab corruption
  ubifs: Add CONFIG_UBIFS_FS_SECURITY to disable/enable security labels
  ubi: Make mtd parameter readable
  ubi: Fix section mismatch
2017-05-13 10:23:12 -07:00
Linus Torvalds
1251704a63 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "15 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm, docs: update memory.stat description with workingset* entries
  mm: vmscan: scan until it finds eligible pages
  mm, thp: copying user pages must schedule on collapse
  dax: fix PMD data corruption when fault races with write
  dax: fix data corruption when fault races with write
  ext4: return to starting transaction in ext4_dax_huge_fault()
  mm: fix data corruption due to stale mmap reads
  dax: prevent invalidation of mapped DAX entries
  Tigran has moved
  mm, vmalloc: fix vmalloc users tracking properly
  mm/khugepaged: add missed tracepoint for collapse_huge_page_swapin
  gcov: support GCC 7.1
  mm, vmstat: Remove spurious WARN() during zoneinfo print
  time: delete current_fs_time()
  hwpoison, memcg: forcibly uncharge LRU pages
2017-05-13 09:49:35 -07:00
Steve French
67b4c889cc [CIFS] Minor cleanup of xattr query function
Some minor cleanup of cifs query xattr functions (will also make
SMB3 xattr implementation cleaner as well).

Signed-off-by: Steve French <steve.french@primarydata.com>
2017-05-12 20:59:10 -05:00
Karim Eshapa
4328fea77c fs: cifs: transport: Use time_after for time comparison
Use time_after kernel macro for time comparison
that has safety check.

Signed-off-by: Karim Eshapa <karim.eshapa@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-12 19:56:44 -05:00
Christophe JAILLET
cd1230070a SMB2: Fix share type handling
In fs/cifs/smb2pdu.h, we have:
#define SMB2_SHARE_TYPE_DISK    0x01
#define SMB2_SHARE_TYPE_PIPE    0x02
#define SMB2_SHARE_TYPE_PRINT   0x03

Knowing that, with the current code, the SMB2_SHARE_TYPE_PRINT case can
never trigger and printer share would be interpreted as disk share.

So, test the ShareType value for equality instead.

Fixes: faaf946a7d ("CIFS: Add tree connect/disconnect capability for SMB2")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-12 19:55:56 -05:00
Joe Perches via samba-technical
ecdcf622eb cifs: cifsacl: Use a temporary ops variable to reduce code length
Create an ops variable to store tcon->ses->server->ops and cache
indirections and reduce code size a trivial bit.

$ size fs/cifs/cifsacl.o*
   text	   data	    bss	    dec	    hex	filename
   5338	    136	      8	   5482	   156a	fs/cifs/cifsacl.o.new
   5371	    136	      8	   5515	   158b	fs/cifs/cifsacl.o.old

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-12 19:45:18 -05:00
Ross Zwisler
876f29460c dax: fix PMD data corruption when fault races with write
This is based on a patch from Jan Kara that fixed the equivalent race in
the DAX PTE fault path.

Currently DAX PMD read fault can race with write(2) in the following
way:

CPU1 - write(2)                 CPU2 - read fault
                                dax_iomap_pmd_fault()
                                  ->iomap_begin() - sees hole

dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate

                                  grab_mapping_entry()
				  - we add huge zero page to the radix tree
				    and map it to page tables

The result is that hole page is mapped into page tables (and thus zeros
are seen in mmap) while file has data written in that place.

Fix the problem by locking exception entry before mapping blocks for the
fault.  That way we are sure invalidate_inode_pages2_range() call for
racing write will either block on entry lock waiting for the fault to
finish (and unmap stale page tables after that) or read fault will see
already allocated blocks by write(2).

Fixes: 9f141d6ef6 ("dax: Call ->iomap_begin without entry lock during dax fault")
Link: http://lkml.kernel.org/r/20170510172700.18991-1-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-12 15:57:16 -07:00
Jan Kara
13e451fdc1 dax: fix data corruption when fault races with write
Currently DAX read fault can race with write(2) in the following way:

CPU1 - write(2)			CPU2 - read fault
				dax_iomap_pte_fault()
				  ->iomap_begin() - sees hole
dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate
				  grab_mapping_entry()
				  - we add zero page in the radix tree
				    and map it to page tables

The result is that hole page is mapped into page tables (and thus zeros
are seen in mmap) while file has data written in that place.

Fix the problem by locking exception entry before mapping blocks for the
fault.  That way we are sure invalidate_inode_pages2_range() call for
racing write will either block on entry lock waiting for the fault to
finish (and unmap stale page tables after that) or read fault will see
already allocated blocks by write(2).

Fixes: 9f141d6ef6
Link: http://lkml.kernel.org/r/20170510085419.27601-5-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-12 15:57:16 -07:00
Jan Kara
fb26a1cbed ext4: return to starting transaction in ext4_dax_huge_fault()
DAX will return to locking exceptional entry before mapping blocks for a
page fault to fix possible races with concurrent writes.  To avoid lock
inversion between exceptional entry lock and transaction start, start
the transaction already in ext4_dax_huge_fault().

Fixes: 9f141d6ef6
Link: http://lkml.kernel.org/r/20170510085419.27601-4-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-12 15:57:16 -07:00
Jan Kara
cd656375f9 mm: fix data corruption due to stale mmap reads
Currently, we didn't invalidate page tables during invalidate_inode_pages2()
for DAX.  That could result in e.g. 2MiB zero page being mapped into
page tables while there were already underlying blocks allocated and
thus data seen through mmap were different from data seen by read(2).
The following sequence reproduces the problem:

 - open an mmap over a 2MiB hole

 - read from a 2MiB hole, faulting in a 2MiB zero page

 - write to the hole with write(3p). The write succeeds but we
   incorrectly leave the 2MiB zero page mapping intact.

 - via the mmap, read the data that was just written. Since the zero
   page mapping is still intact we read back zeroes instead of the new
   data.

Fix the problem by unconditionally calling invalidate_inode_pages2_range()
in dax_iomap_actor() for new block allocations and by properly
invalidating page tables in invalidate_inode_pages2_range() for DAX
mappings.

Fixes: c6dcf52c23
Link: http://lkml.kernel.org/r/20170510085419.27601-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-12 15:57:15 -07:00
Ross Zwisler
4636e70bb0 dax: prevent invalidation of mapped DAX entries
Patch series "mm,dax: Fix data corruption due to mmap inconsistency",
v4.

This series fixes data corruption that can happen for DAX mounts when
page faults race with write(2) and as a result page tables get out of
sync with block mappings in the filesystem and thus data seen through
mmap is different from data seen through read(2).

The series passes testing with t_mmap_stale test program from Ross and
also other mmap related tests on DAX filesystem.

This patch (of 4):

dax_invalidate_mapping_entry() currently removes DAX exceptional entries
only if they are clean and unlocked.  This is done via:

  invalidate_mapping_pages()
    invalidate_exceptional_entry()
      dax_invalidate_mapping_entry()

However, for page cache pages removed in invalidate_mapping_pages()
there is an additional criteria which is that the page must not be
mapped.  This is noted in the comments above invalidate_mapping_pages()
and is checked in invalidate_inode_page().

For DAX entries this means that we can can end up in a situation where a
DAX exceptional entry, either a huge zero page or a regular DAX entry,
could end up mapped but without an associated radix tree entry.  This is
inconsistent with the rest of the DAX code and with what happens in the
page cache case.

We aren't able to unmap the DAX exceptional entry because according to
its comments invalidate_mapping_pages() isn't allowed to block, and
unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.

Since we essentially never have unmapped DAX entries to evict from the
radix tree, just remove dax_invalidate_mapping_entry().

Fixes: c6dcf52c23 ("mm: Invalidate DAX radix tree entries only if appropriate")
Link: http://lkml.kernel.org/r/20170510085419.27601-2-jack@suse.cz
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>    [4.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-12 15:57:15 -07:00
Andrew Morton
cea582247a Tigran has moved
Cc: Tigran Aivazian <aivazian.tigran@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-12 15:57:15 -07:00
Linus Torvalds
0fcc3ab23d Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
 "Incremental fixes and a small feature addition on top of the main
  libnvdimm 4.12 pull request:

   - Geert noticed that tinyconfig was bloated by BLOCK selecting DAX.
     The size regression is fixed by moving all dax helpers into the
     dax-core and only specifying "select DAX" for FS_DAX and
     dax-capable drivers. He also asked for clarification of the
     NR_DEV_DAX config option which, on closer look, does not need to be
     a config option at all. Mike also throws in a DEV_DAX_PMEM fixup
     for good measure.

   - Ben's attention to detail on -stable patch submissions caught a
     case where the recent fixes to arch_copy_from_iter_pmem() missed a
     condition where we strand dirty data in the cache. This is tagged
     for -stable and will also be included in the rework of the pmem api
     to a proposed {memcpy,copy_user}_flushcache() interface for 4.13.

   - Vishal adds a feature that missed the initial pull due to pending
     review feedback. It allows the kernel to clear media errors when
     initializing a BTT (atomic sector update driver) instance on a pmem
     namespace.

   - Ross noticed that the dax_device + dax_operations conversion broke
     __dax_zero_page_range(). The nvdimm unit tests fail to check this
     path, but xfstests immediately trips over it. No excuse for missing
     this before submitting the 4.12 pull request.

  These all pass the nvdimm unit tests and an xfstests spot check. The
  set has received a build success notification from the kbuild robot"

* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  filesystem-dax: fix broken __dax_zero_page_range() conversion
  libnvdimm, btt: ensure that initializing metadata clears poison
  libnvdimm: add an atomic vs process context flag to rw_bytes
  x86, pmem: Fix cache flushing for iovec write < 8 bytes
  device-dax: kill NR_DEV_DAX
  block, dax: move "select DAX" from BLOCK to FS_DAX
  device-dax: Tell kbuild DEV_DAX_PMEM depends on DEV_DAX
2017-05-12 15:43:10 -07:00
Linus Torvalds
050453295f Merge branch 'work.sane_pwd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Making sure that something like a referral point won't end up as pwd
  or root.

  The main part is the last commit (fixing mntns_install()); that one
  fixes a hard-to-hit race. The fchdir() commit is making fchdir(2) a
  bit more robust - it should be impossible to get opened files (even
  O_PATH ones) for referral points in the first place, so the existing
  checks are OK, but checking the same thing as in chdir(2) is just as
  cheap.

  The path_init() commit removes a redundant check that shouldn't have
  been there in the first place"

* 'work.sane_pwd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  make sure that mntns_install() doesn't end up with referral for root
  path_init(): don't bother with checking MAY_EXEC for LOOKUP_ROOT
  make sure that fchdir() won't accept referral points, etc.
2017-05-12 11:39:59 -07:00
Linus Torvalds
9786e34e0a MTD updates for 4.12-rc1:
NAND, from Boris:
 """
  - some minor fixes/improvements on existing drivers (fsmc, gpio, ifc,
    davinci, brcmnand, omap)
  - a huge cleanup/rework of the denali driver accompanied with core
    fixes/improvements to simplify the driver code
  - a complete rewrite of the atmel driver to support new DT bindings
    make future evolution easier
  - the addition of per-vendor detection/initialization steps to avoid
    extending the nand_ids table with more extended-id entries
 """
 
 SPI NOR, from Cyrille:
 """
 - fixes in the hisi SPI controller driver.
 - fixes in the intel SPI controller driver.
 - fixes in the Mediatek SPI controller driver.
 - fixes to some SPI flash memories not supported the Chip Erase command.
 - add support to some new memory parts (Winbond, Macronix, Micron, ESMT).
 - add new driver for the STM32 QSPI controller.
 """
 
 And a few fixes for Gemini and Versatile platforms on physmap-of
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJZE86yAAoJEFySrpd9RFgtlOoP/1o1s8dlKdd4TazdoxBTL2wy
 C4wPkqPWyfREcD5ZUYJgr6ENI2OnEwcAxAt2CXnqegx+ZIPToBW4/WK9gj/TNLRx
 AfSOz+EPPzo5uZwJPnfocgIFYuhsspymvmISwv66kPbjfkrSjo1l/K9nem3gh7an
 IkQdVVq8brvxNeDZOAzbsT2Y5DZNfs00g1jLXkcQrpfM0sWKcbHIUa0BTWy4WKGV
 ElTr+xh7QHh/Pd9/A5znd3xX54w5+YR/xe38jSBfTb0vEgw/RIfhIcnvxQ8G/7Se
 jE0+8GR5ZJGKwA9Xk5nFzS2G3uECMFNS75KfxkZ0LlEE6ivUvpDbokCbIU4bDOCt
 /8bWQf9AGA3gLHGgNUQTSt5HrkBXTGp917jtAZbI/y2MzTkLw3aAZ/m/j37vv9ON
 ezeGRO6VWK3bcimLFrt6KO5emYstmm4Tp4rRe3jakH7eyTlINDsecKtuMo2xVzyZ
 kK3tnDMdEntECAiKh3ndRdAUL3fs+/IdzWTAxnF9VQFQs1YxiZ1K8kY/zcN+rzbn
 CVkEhdm+tdDBx8XgOdfnOTGRAJ07dGOoDhLPR4/egC/ta6GIRkHQjFSwsW7bD9p9
 phHH6nQX9Bpza1JV/xvljezoHjvZkny4UhRpLgYMowb41DXv7os7ZV+g7kf5sd0i
 mGzCH46j0DmWQ1u5/Q6j
 =dxj5
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20170510' of git://git.infradead.org/linux-mtd

Pull MTD updates from Brian Norris:
 "NAND, from Boris:
   - some minor fixes/improvements on existing drivers (fsmc, gpio, ifc,
     davinci, brcmnand, omap)
   - a huge cleanup/rework of the denali driver accompanied with core
     fixes/improvements to simplify the driver code
   - a complete rewrite of the atmel driver to support new DT bindings
     make future evolution easier
   - the addition of per-vendor detection/initialization steps to avoid
     extending the nand_ids table with more extended-id entries

  SPI NOR, from Cyrille:
   - fixes in the hisi, intel and Mediatek SPI controller drivers
   - fixes to some SPI flash memories not supporting the Chip Erase
     command.
   - add support to some new memory parts (Winbond, Macronix, Micron,
     ESMT).
   - add new driver for the STM32 QSPI controller

  And a few fixes for Gemini and Versatile platforms on physmap-of"

* tag 'for-linus-20170510' of git://git.infradead.org/linux-mtd: (100 commits)
  MAINTAINERS: Update NAND subsystem git repositories
  mtd: nand: gpio: update binding
  mtd: nand: add ooblayout for old hamming layout
  mtd: oxnas_nand: Allocating more than necessary in probe()
  dt-bindings: mtd: Document the STM32 QSPI bindings
  mtd: mtk-nor: set controller's address width according to nor flash
  mtd: spi-nor: add driver for STM32 quad spi flash controller
  mtd: nand: brcmnand: Check flash #WP pin status before nand erase/program
  mtd: nand: davinci: add comment on NAND subpage write status on keystone
  mtd: nand: omap2: Fix partition creation via cmdline mtdparts
  mtd: nand: NULL terminate a of_device_id table
  mtd: nand: Fix a couple error codes
  mtd: nand: allow drivers to request minimum alignment for passed buffer
  mtd: nand: allocate aligned buffers if NAND_OWN_BUFFERS is unset
  mtd: nand: denali: allow to override revision number
  mtd: nand: denali_dt: use pdev instead of ofdev for platform_device
  mtd: nand: denali_dt: remove dma-mask DT property
  mtd: nand: denali: support 64bit capable DMA engine
  mtd: nand: denali_dt: enable HW_ECC_FIXUP for Altera SOCFPGA variant
  mtd: nand: denali: support HW_ECC_FIXUP capability
  ...
2017-05-11 10:44:22 -07:00
Dan Williams
e84b83b9ee filesystem-dax: fix broken __dax_zero_page_range() conversion
The conversion of __dax_zero_page_range() to 'struct dax_operations'
caused it to frequently fail. The mistake was treating the @size
parameter as a dax mapping length rather than just a length of the
clear_pmem() operation. The dax mapping length is assumed to be hard
coded as PAGE_SIZE.

Without this fix any page unaligned zeroing request will trigger a
-EINVAL return from bdev_dax_pgoff().

Cc: Jan Kara <jack@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Fixes: cccbce6715 ("filesystem-dax: convert to dax_direct_access()")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-05-10 21:46:55 -07:00
Linus Torvalds
291b38a756 Annotation of module parameters that specify device settings
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAWPiW6vSw1s6N8H32AQLOrw/+NTqGf7bjq+64YKS6NfR0XDgE+wNJltGO
 ck7zJW3NHIg76RNu8s0I9xg5aVmwizz3Z5DGROZquaolnezux4tQihZ3AFyxIzLc
 +Y3WHYagcML7yFfjl/WznCLRD5EW3yPln4lCvQO0nW/xICRYeRI057JaIbi2Dtek
 BhcXt3c4AjXDLdYJkgtHV3p2R2mt8hcdFdWqqx6s7JaIThZNRGNzxAgtbcB9k5IW
 HVG9ZEIL73VBYWHrYivzjHYF5rBnNCPt87eOwDQeTOSkhv8te+u9k+bH8vxZw1T0
 XUtDrLBndKiuVo2GUfLkkF8LItx3Q9eLCJYy0joaIliyPqTEsPx9KjQ+Af0cxS9s
 ZPCZ5SYf96stKmDeL5xaMfrAmeyVHJ4lc4JTOqdzbIT8blsOSfYO/03p0ALShSDv
 /RQLaKGlf8Bjoy8PwKFcXb4sIDufcd/U1Av/EMFXxOfgN/u2JUkGKq6EaIM5B68L
 fHPje+aR9VNELPmPjwNOWtmN4I79EH3EItQf7zv0KG+UeKhcHLx/EAcSJ3ZRKEkH
 Lathg7pPOEJGArPiVO79TZzBG01ADn1aiwv65XObMzNZ+54xI/mN/Y1DNF/kL5jU
 XzvNzEjFt8mwMIZGVNdAt4+pDyMfIZGZSyUkSRKFnaQZMIvQrfQIU9RLBYLX5eOx
 +/p0VkIwDpg=
 =lbS7
 -----END PGP SIGNATURE-----

Merge tag 'hwparam-20170420' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull hw lockdown support from David Howells:
 "Annotation of module parameters that configure hardware resources
  including ioports, iomem addresses, irq lines and dma channels.

  This allows a future patch to prohibit the use of such module
  parameters to prevent that hardware from being abused to gain access
  to the running kernel image as part of locking the kernel down under
  UEFI secure boot conditions.

  Annotations are made by changing:

        module_param(n, t, p)
        module_param_named(n, v, t, p)
        module_param_array(n, t, m, p)

  to:

        module_param_hw(n, t, hwtype, p)
        module_param_hw_named(n, v, t, hwtype, p)
        module_param_hw_array(n, t, hwtype, m, p)

  where the module parameter refers to a hardware setting

  hwtype specifies the type of the resource being configured. This can
  be one of:

        ioport          Module parameter configures an I/O port
        iomem           Module parameter configures an I/O mem address
        ioport_or_iomem Module parameter could be either (runtime set)
        irq             Module parameter configures an I/O port
        dma             Module parameter configures a DMA channel
        dma_addr        Module parameter configures a DMA buffer address
        other           Module parameter configures some other value

  Note that the hwtype is compile checked, but not currently stored (the
  lockdown code probably won't require it). It is, however, there for
  future use.

  A bonus is that the hwtype can also be used for grepping.

  The intention is for the kernel to ignore or reject attempts to set
  annotated module parameters if lockdown is enabled. This applies to
  options passed on the boot command line, passed to insmod/modprobe or
  direct twiddling in /sys/module/ parameter files.

  The module initialisation then needs to handle the parameter not being
  set, by (1) giving an error, (2) probing for a value or (3) using a
  reasonable default.

  What I can't do is just reject a module out of hand because it may
  take a hardware setting in the module parameters. Some important
  modules, some ipmi stuff for instance, both probe for hardware and
  allow hardware to be manually specified; if the driver is aborts with
  any error, you don't get any ipmi hardware.

  Further, trying to do this entirely in the module initialisation code
  doesn't protect against sysfs twiddling.

  [!] Note that in and of itself, this series of patches should have no
      effect on the the size of the kernel or code execution - that is
      left to a patch in the next series to effect. It does mark
      annotated kernel parameters with a KERNEL_PARAM_FL_HWPARAM flag in
      an already existing field"

* tag 'hwparam-20170420' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (38 commits)
  Annotate hardware config module parameters in sound/pci/
  Annotate hardware config module parameters in sound/oss/
  Annotate hardware config module parameters in sound/isa/
  Annotate hardware config module parameters in sound/drivers/
  Annotate hardware config module parameters in fs/pstore/
  Annotate hardware config module parameters in drivers/watchdog/
  Annotate hardware config module parameters in drivers/video/
  Annotate hardware config module parameters in drivers/tty/
  Annotate hardware config module parameters in drivers/staging/vme/
  Annotate hardware config module parameters in drivers/staging/speakup/
  Annotate hardware config module parameters in drivers/staging/media/
  Annotate hardware config module parameters in drivers/scsi/
  Annotate hardware config module parameters in drivers/pcmcia/
  Annotate hardware config module parameters in drivers/pci/hotplug/
  Annotate hardware config module parameters in drivers/parport/
  Annotate hardware config module parameters in drivers/net/wireless/
  Annotate hardware config module parameters in drivers/net/wan/
  Annotate hardware config module parameters in drivers/net/irda/
  Annotate hardware config module parameters in drivers/net/hamradio/
  Annotate hardware config module parameters in drivers/net/ethernet/
  ...
2017-05-10 19:13:03 -07:00
Linus Torvalds
c70422f760 Another RDMA update from Chuck Lever, and a bunch of miscellaneous
bugfixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZE2UeAAoJECebzXlCjuG+St8P/0vG+ps9sY012E6Wh9gy4Ev4
 BtxG/c3CtcxrbNzW+cFhdEloBGtC0VvcrKNCozJTK4LdaPYErkyRBpjgXvIggT9I
 GWY4ftpH3eJ6uByN9Okgc3/1la2poDflJO/nYhdRed3YHOnXTtx/746tu1xAnVCV
 tFtDGrbJZTprt5c3zETtdtquCUSy2aMT5ZbrdU3yWBCwQMNSIufN3an8epfB++xx
 Ct+G0HTRffcWAdYuLT0N1HKqm8pkncdNMFpm7mVw0hMCRy552G3fuj8LtkhVTvKE
 1KN3zXY4jhaUYWD5Yt6AJcpLEro65b8swYk4e9FP2TNUpCmuRdXT9cb9vE8YztxC
 8s4N23RHaEx9I6pC3OU64a2HfhiQM/oOIvjlhTBsjojXsQcqZFD1vsoSYA8Byl0w
 m9EQWqPqge4m6yEYl7uAyL6xSthbrhcU1Ks5jvNXGcWzEQj7BATnynJANsfZ+y6r
 ZoVcsRNX49m1BG+p9br+9DFffPiNFUMqxbfr73L9HRep3OsPeFKazFG0bKd3hOqA
 E6L/AnBd9soSqTuTvbisWrGWbomhtd5G/fAa1uHrWTPHMXUWCmkguiau51FNfcHu
 xcJlBBVCvUmmd5u3wF6QeiyjPs4KEBzQzsOUsWKHRxDBp6s+5PX/lHuXRBlDP+fN
 TQq0KbvBtea1OyMaRtoV
 =Rtl/
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.12' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Another RDMA update from Chuck Lever, and a bunch of miscellaneous
  bugfixes"

* tag 'nfsd-4.12' of git://linux-nfs.org/~bfields/linux: (26 commits)
  nfsd: Fix up the "supattr_exclcreat" attributes
  nfsd: encoders mustn't use unitialized values in error cases
  nfsd: fix undefined behavior in nfsd4_layout_verify
  lockd: fix lockd shutdown race
  NFSv4: Fix callback server shutdown
  SUNRPC: Refactor svc_set_num_threads()
  NFSv4.x/callback: Create the callback service through svc_create_pooled
  lockd: remove redundant check on block
  svcrdma: Clean out old XDR encoders
  svcrdma: Remove the req_map cache
  svcrdma: Remove unused RDMA Write completion handler
  svcrdma: Reduce size of sge array in struct svc_rdma_op_ctxt
  svcrdma: Clean up RPC-over-RDMA backchannel reply processing
  svcrdma: Report Write/Reply chunk overruns
  svcrdma: Clean up RDMA_ERROR path
  svcrdma: Use rdma_rw API in RPC reply path
  svcrdma: Introduce local rdma_rw API helpers
  svcrdma: Clean up svc_rdma_get_inv_rkey()
  svcrdma: Add helper to save pages under I/O
  svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT
  ...
2017-05-10 13:29:23 -07:00
Linus Torvalds
73ccb023a2 NFS client updates for Linux 4.12
Highlights include:
 
 Stable bugfixes:
 - Fix use after free in write error path
 - Use GFP_NOIO for two allocations in writeback
 - Fix a hang in OPEN related to server reboot
 - Check the result of nfs4_pnfs_ds_connect
 - Fix an rcu lock leak
 
 Features:
 - Removal of the unmaintained and unused OSD pNFS layout
 - Cleanup and removal of lots of unnecessary dprintk()s
 - Cleanup and removal of some memory failure paths now that
   GFP_NOFS is guaranteed to never fail.
 - Remove the v3-only data server limitation on pNFS/flexfiles
 
 Bugfixes:
 - RPC/RDMA connection handling bugfixes
 - Copy offload: fixes to ensure the copied data is COMMITed to disk.
 - Readdir: switch back to using the ->iterate VFS interface
 - File locking fixes from Ben Coddington
 - Various use-after-free and deadlock issues in pNFS
 - Write path bugfixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIbBAABAgAGBQJZE0KiAAoJEGcL54qWCgDy/moP93wZ+cGnN5sC+nsirqj4eUiM
 BQKKonweNQIoYRwp5B9jLTxsUMIxRasV5W3BbqEm4PUtBYXfqQ7SfLv7RboKbd4M
 RJB9PS+sjx3Fxf65mhveKziwUFLvQCQ3+we0TpUga6+7SBiGlgPKBfisk7frC0nt
 BbYBuGaWXMPxO0BnR8adNwqiGINPDSzB+8sgjiT8zkZLm4lrew2eV7TDvwVOguD+
 S2vLPGhg1F9wu8aG731MgiSNaeCgsBP6I5D29fTTD7z1DCNMQXOoHcX8k4KwwIDB
 sHRR0tVBsg+1B7WdH4y41GQ03rn3o2DHeJB5cdYGaEu4lx7CecCzt0o0dfAkNizT
 5LxbQxIHPNYMeZmP2T0oD41zQyfjKqrdRSPnXi3dPD98NwaM1Lqv+Kzb/eXzupXp
 vJ7859PQCa3KjQ1IFhwdXTmh53J1c8SzEDpzz7WX0R0saRyxeIJsm30MmdPqKu7Z
 notjsXxrTmjIhC+0vFLey1kejFDh+b0gT6UIwoMdx39VL9AM6DVL7HsrU1kEwCdf
 f8otaLcm0WoUaseF+cMtfRNGEqCMxPywwz7mEKlGiVZgyAM8VfzH+s5j6/u6ncwS
 ASwRclwwPAZN97rzl0exZxuaRwFZd7oFT1zrviPWvv+0SUPuy258J6QpolUSavgi
 Qh7f3QR65K+QX9QbO1g=
 =7Nm2
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.12-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Stable bugfixes:
   - Fix use after free in write error path
   - Use GFP_NOIO for two allocations in writeback
   - Fix a hang in OPEN related to server reboot
   - Check the result of nfs4_pnfs_ds_connect
   - Fix an rcu lock leak

  Features:
   - Removal of the unmaintained and unused OSD pNFS layout
   - Cleanup and removal of lots of unnecessary dprintk()s
   - Cleanup and removal of some memory failure paths now that GFP_NOFS
     is guaranteed to never fail.
   - Remove the v3-only data server limitation on pNFS/flexfiles

  Bugfixes:
   - RPC/RDMA connection handling bugfixes
   - Copy offload: fixes to ensure the copied data is COMMITed to disk.
   - Readdir: switch back to using the ->iterate VFS interface
   - File locking fixes from Ben Coddington
   - Various use-after-free and deadlock issues in pNFS
   - Write path bugfixes"

* tag 'nfs-for-4.12-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (89 commits)
  pNFS/flexfiles: Always attempt to call layoutstats when flexfiles is enabled
  NFSv4.1: Work around a Linux server bug...
  NFS append COMMIT after synchronous COPY
  NFSv4: Fix exclusive create attributes encoding
  NFSv4: Fix an rcu lock leak
  nfs: use kmap/kunmap directly
  NFS: always treat the invocation of nfs_getattr as cache hit when noac is on
  Fix nfs_client refcounting if kmalloc fails in nfs4_proc_exchange_id and nfs4_proc_async_renew
  NFSv4.1: RECLAIM_COMPLETE must handle NFS4ERR_CONN_NOT_BOUND_TO_SESSION
  pNFS: Fix NULL dereference in pnfs_generic_alloc_ds_commits
  pNFS: Fix a typo in pnfs_generic_alloc_ds_commits
  pNFS: Fix a deadlock when coalescing writes and returning the layout
  pNFS: Don't clear the layout return info if there are segments to return
  pNFS: Ensure we commit the layout if it has been invalidated
  pNFS: Don't send COMMITs to the DSes if the server invalidated our layout
  pNFS/flexfiles: Fix up the ff_layout_write_pagelist failure path
  pNFS: Ensure we check layout validity before marking it for return
  NFS4.1 handle interrupted slot reuse from ERR_DELAY
  NFSv4: check return value of xdr_inline_decode
  nfs/filelayout: fix NULL pointer dereference in fl_pnfs_update_layout()
  ...
2017-05-10 13:03:38 -07:00
Trond Myklebust
b26b78cb72 nfsd: Fix up the "supattr_exclcreat" attributes
If an NFSv4 client asks us for the supattr_exclcreat, then we must
not return attributes that are unsupported by this minor version.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Fixes: 75976de655 ("NFSD: Return word2 bitmask if setting security..,")
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-05-10 14:30:10 -04:00
J. Bruce Fields
f961e3f2ac nfsd: encoders mustn't use unitialized values in error cases
In error cases, lgp->lg_layout_type may be out of bounds; so we
shouldn't be using it until after the check of nfserr.

This was seen to crash nfsd threads when the server receives a LAYOUTGET
request with a large layout type.

GETDEVICEINFO has the same problem.

Reported-by: Ari Kauppi <Ari.Kauppi@synopsys.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-05-10 14:25:19 -04:00
Linus Torvalds
de4d195308 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main changes are:

   - Debloat RCU headers

   - Parallelize SRCU callback handling (plus overlapping patches)

   - Improve the performance of Tree SRCU on a CPU-hotplug stress test

   - Documentation updates

   - Miscellaneous fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
  rcu: Open-code the rcu_cblist_n_lazy_cbs() function
  rcu: Open-code the rcu_cblist_n_cbs() function
  rcu: Open-code the rcu_cblist_empty() function
  rcu: Separately compile large rcu_segcblist functions
  srcu: Debloat the <linux/rcu_segcblist.h> header
  srcu: Adjust default auto-expediting holdoff
  srcu: Specify auto-expedite holdoff time
  srcu: Expedite first synchronize_srcu() when idle
  srcu: Expedited grace periods with reduced memory contention
  srcu: Make rcutorture writer stalls print SRCU GP state
  srcu: Exact tracking of srcu_data structures containing callbacks
  srcu: Make SRCU be built by default
  srcu: Fix Kconfig botch when SRCU not selected
  rcu: Make non-preemptive schedule be Tasks RCU quiescent state
  srcu: Expedite srcu_schedule_cbs_snp() callback invocation
  srcu: Parallelize callback handling
  kvm: Move srcu_struct fields to end of struct kvm
  rcu: Fix typo in PER_RCU_NODE_PERIOD header comment
  rcu: Use true/false in assignment to bool
  rcu: Use bool value directly
  ...
2017-05-10 10:30:46 -07:00
Linus Torvalds
b948abf53a Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs update from Miklos Szeredi:
 "The biggest part of this is making st_dev/st_ino on the overlay behave
  like a normal filesystem (i.e. st_ino doesn't change on copy up,
  st_dev is the same for all files and directories). Currently this only
  works if all layers are on the same filesystem, but future work will
  move the general case towards more sane behavior.

  There are also miscellaneous fixes, including fixes to handling
  append-only files. There's a small change in the VFS, but that only
  has an effect on overlayfs, since otherwise file->f_path.dentry->inode
  and file_inode(file) are always the same"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: update documentation w.r.t. constant inode numbers
  ovl: persistent inode numbers for upper hardlinks
  ovl: merge getattr for dir and nondir
  ovl: constant st_ino/st_dev across copy up
  ovl: persistent inode number for directories
  ovl: set the ORIGIN type flag
  ovl: lookup non-dir copy-up-origin by file handle
  ovl: use an auxiliary var for overlay root entry
  ovl: store file handle of lower inode on copy up
  ovl: check if all layers are on the same fs
  ovl: do not set overlay.opaque on non-dir create
  ovl: check IS_APPEND() on real upper inode
  vfs: ftruncate check IS_APPEND() on real upper inode
  ovl: Use designated initializers
  ovl: lockdep annotate of nested stacked overlayfs inode lock
2017-05-10 09:03:48 -07:00
Linus Torvalds
a2e5ad45a9 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
 "Support for pid namespaces from Seth and refcount_t work from Elena"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: Add support for pid namespaces
  fuse: convert fuse_conn.count from atomic_t to refcount_t
  fuse: convert fuse_req.count from atomic_t to refcount_t
  fuse: convert fuse_file.count from atomic_t to refcount_t
2017-05-10 08:45:30 -07:00
Linus Torvalds
26c5eaa132 The two main items are support for disabling automatic rbd exclusive
lock transfers from myself and the long awaited -ENOSPC handling series
 from Jeff.  The former will allow rbd users to take advantage of
 exclusive lock's built-in blacklist/break-lock functionality while
 staying in control of who owns the lock.  With the latter in place, we
 will abort filesystem writes on -ENOSPC instead of having them block
 indefinitely.
 
 Beyond that we've got the usual pile of filesystem fixes from Zheng,
 some refcount_t conversion patches from Elena and a patch for an
 ancient open() flags handling bug from Alexander.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJZEt/kAAoJEEp/3jgCEfOLpzAIAIld0N06DuHKG2F9mHEnLeGl
 Y60BZ3Ajo32i9qPT/u9ntI99ZMlkuHcNWg6WpCCh8umbwk2eiAKRP/KcfGcWmmp9
 EHj9COCmBR9TRM1pNS1lSMzljDnxf9sQmbIO9cwMQBUya5g19O0OpApzxF1YQhCR
 V9B/FYV5IXELC3b/NH45oeDAD9oy/WgwbhQ2feTBQJmzIVJx+Je9hdhR1PH1rI06
 ysyg3VujnUi/hoDhvPTBznNOxnHx/HQEecHH8b01MkbaCgxPH88jsUK/h7PYF3Gh
 DE/sCN69HXeu1D/al3zKoZdahsJ5GWkj9Q+vvBoQJm+ZPsndC+qpgSj761n9v38=
 =vamy
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.12-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "The two main items are support for disabling automatic rbd exclusive
  lock transfers from myself and the long awaited -ENOSPC handling
  series from Jeff.

  The former will allow rbd users to take advantage of exclusive lock's
  built-in blacklist/break-lock functionality while staying in control
  of who owns the lock. With the latter in place, we will abort
  filesystem writes on -ENOSPC instead of having them block
  indefinitely.

  Beyond that we've got the usual pile of filesystem fixes from Zheng,
  some refcount_t conversion patches from Elena and a patch for an
  ancient open() flags handling bug from Alexander"

* tag 'ceph-for-4.12-rc1' of git://github.com/ceph/ceph-client: (31 commits)
  ceph: fix memory leak in __ceph_setxattr()
  ceph: fix file open flags on ppc64
  ceph: choose readdir frag based on previous readdir reply
  rbd: exclusive map option
  rbd: return ResponseMessage result from rbd_handle_request_lock()
  rbd: kill rbd_is_lock_supported()
  rbd: support updating the lock cookie without releasing the lock
  rbd: store lock cookie
  rbd: ignore unlock errors
  rbd: fix error handling around rbd_init_disk()
  rbd: move rbd_unregister_watch() call into rbd_dev_image_release()
  rbd: move rbd_dev_destroy() call out of rbd_dev_image_release()
  ceph: when seeing write errors on an inode, switch to sync writes
  Revert "ceph: SetPageError() for writeback pages if writepages fails"
  ceph: handle epoch barriers in cap messages
  libceph: add an epoch_barrier field to struct ceph_osd_client
  libceph: abort already submitted but abortable requests when map or pool goes full
  libceph: allow requests to return immediately on full conditions if caller wishes
  libceph: remove req->r_replay_version
  ceph: make seeky readdir more efficient
  ...
2017-05-10 08:42:33 -07:00
Linus Torvalds
1176032cb1 Merge branch 'for-linus-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "This has fixes and cleanups Dave Sterba collected for the merge
  window.

  The biggest functional fixes are between btrfs raid5/6 and scrub, and
  raid5/6 and device replacement. Some of our pending qgroup fixes are
  included as well while I bash on the rest in testing.

  We also have the usual set of cleanups, including one that makes
  __btrfs_map_block() much more maintainable, and conversions from
  atomic_t to refcount_t"

* 'for-linus-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (71 commits)
  btrfs: fix the gfp_mask for the reada_zones radix tree
  Btrfs: fix reported number of inode blocks
  Btrfs: send, fix file hole not being preserved due to inline extent
  Btrfs: fix extent map leak during fallocate error path
  Btrfs: fix incorrect space accounting after failure to insert inline extent
  Btrfs: fix invalid attempt to free reserved space on failure to cow range
  btrfs: Handle delalloc error correctly to avoid ordered extent hang
  btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
  btrfs: check if the device is flush capable
  btrfs: delete unused member nobarriers
  btrfs: scrub: Fix RAID56 recovery race condition
  btrfs: scrub: Introduce full stripe lock for RAID56
  btrfs: Use ktime_get_real_ts for root ctime
  Btrfs: handle only applicable errors returned by btrfs_get_extent
  btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option
  btrfs: use q which is already obtained from bdev_get_queue
  Btrfs: switch to div64_u64 if with a u64 divisor
  Btrfs: update scrub_parity to use u64 stripe_len
  Btrfs: enable repair during read for raid56 profile
  btrfs: use clear_page where appropriate
  ...
2017-05-10 08:33:17 -07:00
Steve French
de1892b887 Don't delay freeing mids when blocked on slow socket write of request
When processing responses, and in particular freeing mids (DeleteMidQEntry),
which is very important since it also frees the associated buffers (cifs_buf_release),
we can block a long time if (writes to) socket is slow due to low memory or networking
issues.

We can block in send (smb request) waiting for memory, and be blocked in processing
responess (which could free memory if we let it) - since they both grab the
server->srv_mutex.

In practice, in the DeleteMidQEntry case - there is no reason we need to
grab the srv_mutex so remove these around DeleteMidQEntry, and it allows
us to free memory faster.

Signed-off-by: Steve French <steve.french@primarydata.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
2017-05-09 20:37:32 -05:00
Rabin Vincent
560d388950 CIFS: silence lockdep splat in cifs_relock_file()
cifs_relock_file() can perform a down_write() on the inode's lock_sem even
though it was already performed in cifs_strict_readv().  Lockdep complains
about this.  AFAICS, there is no problem here, and lockdep just needs to be
told that this nesting is OK.

 =============================================
 [ INFO: possible recursive locking detected ]
 4.11.0+ #20 Not tainted
 ---------------------------------------------
 cat/701 is trying to acquire lock:
  (&cifsi->lock_sem){++++.+}, at: cifs_reopen_file+0x7a7/0xc00

 but task is already holding lock:
  (&cifsi->lock_sem){++++.+}, at: cifs_strict_readv+0x177/0x310

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&cifsi->lock_sem);
   lock(&cifsi->lock_sem);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 1 lock held by cat/701:
  #0:  (&cifsi->lock_sem){++++.+}, at: cifs_strict_readv+0x177/0x310

 stack backtrace:
 CPU: 0 PID: 701 Comm: cat Not tainted 4.11.0+ #20
 Call Trace:
  dump_stack+0x85/0xc2
  __lock_acquire+0x17dd/0x2260
  ? trace_hardirqs_on_thunk+0x1a/0x1c
  ? preempt_schedule_irq+0x6b/0x80
  lock_acquire+0xcc/0x260
  ? lock_acquire+0xcc/0x260
  ? cifs_reopen_file+0x7a7/0xc00
  down_read+0x2d/0x70
  ? cifs_reopen_file+0x7a7/0xc00
  cifs_reopen_file+0x7a7/0xc00
  ? printk+0x43/0x4b
  cifs_readpage_worker+0x327/0x8a0
  cifs_readpage+0x8c/0x2a0
  generic_file_read_iter+0x692/0xd00
  cifs_strict_readv+0x29f/0x310
  generic_file_splice_read+0x11c/0x1c0
  do_splice_to+0xa5/0xc0
  splice_direct_to_actor+0xfa/0x350
  ? generic_pipe_buf_nosteal+0x10/0x10
  do_splice_direct+0xb5/0xe0
  do_sendfile+0x278/0x3a0
  SyS_sendfile64+0xc4/0xe0
  entry_SYSCALL_64_fastpath+0x1f/0xbe

Signed-off-by: Rabin Vincent <rabinv@axis.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-09 20:36:02 -05:00
Ari Kauppi
b550a32e60 nfsd: fix undefined behavior in nfsd4_layout_verify
UBSAN: Undefined behaviour in fs/nfsd/nfs4proc.c:1262:34
  shift exponent 128 is too large for 32-bit type 'int'

Depending on compiler+architecture, this may cause the check for
layout_type to succeed for overly large values (which seems to be the
case with amd64). The large value will be later used in de-referencing
nfsd4_layout_ops for function pointers.

Reported-by: Jani Tuovila <tuovila@synopsys.com>
Signed-off-by: Ari Kauppi <ari@synopsys.com>
[colin.king@canonical.com: use LAYOUT_TYPE_MAX instead of 32]
Cc: stable@vger.kernel.org
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-05-09 17:09:18 -04:00
Trond Myklebust
76b2a30338 pNFS/flexfiles: Always attempt to call layoutstats when flexfiles is enabled
Layoutstats is always desirable when using the flexfiles driver, so
we should enable it if that driver is being loaded. It is safe to do
so, because even when the mount specifies NFSv4.1, we will turn it
off if the server tells us it is unsupported.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-09 16:02:57 -04:00
Trond Myklebust
f4b23de3dd NFSv4.1: Work around a Linux server bug...
It turns out the Linux server has a bug in its implementation of
supattr_exclcreat; it returns the set of all attributes, whether
or not they are supported by minor version 1.
In order to avoid a regression, we therefore apply the supported_attrs
as a mask on top of whatever the server sent us.

Reported-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-09 15:52:15 -04:00
Linus Torvalds
11fbf53d66 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted bits and pieces from various people. No common topic in this
  pile, sorry"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs/affs: add rename exchange
  fs/affs: add rename2 to prepare multiple methods
  Make stat/lstat/fstatat pass AT_NO_AUTOMOUNT to vfs_statx()
  fs: don't set *REFERENCED on single use objects
  fs: compat: Remove warning from COMPATIBLE_IOCTL
  remove pointless extern of atime_need_update_rcu()
  fs: completely ignore unknown open flags
  fs: add a VALID_OPEN_FLAGS
  fs: remove _submit_bh()
  fs: constify tree_descr arrays passed to simple_fill_super()
  fs: drop duplicate header percpu-rwsem.h
  fs/affs: bugfix: Write files greater than page size on OFS
  fs/affs: bugfix: enable writes on OFS disks
  fs/affs: remove node generation check
  fs/affs: import amigaffs.h
  fs/affs: bugfix: make symbolic links work again
2017-05-09 09:12:53 -07:00
Linus Torvalds
8ee74a91ac proc: try to remove use of FOLL_FORCE entirely
We fixed the bugs in it, but it's still an ugly interface, so let's see
if anybody actually depends on it.  It's entirely possible that nothing
actually requires the whole "punch through read-only mappings"
semantics.

For example, gdb definitely uses the /proc/<pid>/mem interface, but it
looks like it mainly does it for regular reads of the target (that don't
need FOLL_FORCE), and looking at the gdb source code seems to fall back
on the traditional ptrace(PTRACE_POKEDATA) interface if it needs to.

If this breaks something, I do have a (more complex) version that only
enables FOLL_FORCE when somebody has PTRACE_ATTACH'ed to the target,
like the comment here used to say ("Maybe we should limit FOLL_FORCE to
actual ptrace users?").

Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-09 08:45:16 -07:00
Linus Torvalds
bf5f89463f Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - the rest of MM

 - various misc things

 - procfs updates

 - lib/ updates

 - checkpatch updates

 - kdump/kexec updates

 - add kvmalloc helpers, use them

 - time helper updates for Y2038 issues. We're almost ready to remove
   current_fs_time() but that awaits a btrfs merge.

 - add tracepoints to DAX

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (114 commits)
  drivers/staging/ccree/ssi_hash.c: fix build with gcc-4.4.4
  selftests/vm: add a test for virtual address range mapping
  dax: add tracepoint to dax_insert_mapping()
  dax: add tracepoint to dax_writeback_one()
  dax: add tracepoints to dax_writeback_mapping_range()
  dax: add tracepoints to dax_load_hole()
  dax: add tracepoints to dax_pfn_mkwrite()
  dax: add tracepoints to dax_iomap_pte_fault()
  mtd: nand: nandsim: convert to memalloc_noreclaim_*()
  treewide: convert PF_MEMALLOC manipulations to new helpers
  mm: introduce memalloc_noreclaim_{save,restore}
  mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC
  mm/huge_memory.c: deposit a pgtable for DAX PMD faults when required
  mm/huge_memory.c: use zap_deposited_table() more
  time: delete CURRENT_TIME_SEC and CURRENT_TIME
  gfs2: replace CURRENT_TIME with current_time
  apparmorfs: replace CURRENT_TIME with current_time()
  lustre: replace CURRENT_TIME macro
  fs: ubifs: replace CURRENT_TIME_SEC with current_time
  fs: ufs: use ktime_get_real_ts64() for birthtime
  ...
2017-05-08 18:17:56 -07:00
Ross Zwisler
b444073458 dax: add tracepoint to dax_insert_mapping()
Add a tracepoint to dax_insert_mapping(), following the same logging
conventions as the rest of DAX.  This tracepoint, along with the one in
dax_load_hole(), lets us know how a DAX PTE fault was serviced.

Here is an example DAX fault that inserts a PTE mapping:

  small-1126  [007] ....
   145.451604: dax_pte_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 pgoff 0x220

  small-1126  [007] ....
   145.452317: dax_insert_mapping: dev 259:0 ino 0x1003 shared write address 0x10420000 radix_entry 0x100006

  small-1126  [007] ....
   145.452399: dax_pte_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 pgoff 0x220 MAJOR|NOPAGE

Link: http://lkml.kernel.org/r/20170221195116.13278-7-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:16 -07:00
Ross Zwisler
f9bc3a0753 dax: add tracepoint to dax_writeback_one()
Add a tracepoint to dax_writeback_one(), following the same logging
conventions as the rest of DAX.

Here is an example range writeback which ends up flushing one PMD and
one PTE:

  test-1265  [003] ....
   496.615250: dax_writeback_range: dev 259:0 ino 0x1003 pgoff 0x0-0x7ffffffffffff

  test-1265  [003] ....
   496.616263: dax_writeback_one: dev 259:0 ino 0x1003 pgoff 0x0 pglen 0x200

  test-1265  [003] ....
   496.616270: dax_writeback_one: dev 259:0 ino 0x1003 pgoff 0x305 pglen 0x1

  test-1265  [003] ....
   496.616272: dax_writeback_range_done: dev 259:0 ino 0x1003 pgoff 0x0-0x7ffffffffffff

[akpm@linux-foundation.org: struct blk_dax_ctl has disappeared]
Link: http://lkml.kernel.org/r/20170221195116.13278-6-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:16 -07:00
Ross Zwisler
d14a3f48a1 dax: add tracepoints to dax_writeback_mapping_range()
Add tracepoints to dax_writeback_mapping_range(), following the same
logging conventions as the rest of DAX.

Here is an example writeback call:

  msync-1085  [006] ....
   200.902565: dax_writeback_range: dev 259:0 ino 0x1003 pgoff 0x200-0x2ff

  msync-1085  [006] ....
   200.902579: dax_writeback_range_done: dev 259:0 ino 0x1003 pgoff 0x200-0x2ff

[ross.zwisler@linux.intel.com: fix regression in dax_writeback_mapping_range()]
  Link: http://lkml.kernel.org/r/20170314215358.31451-1-ross.zwisler@linux.intel.com
Link: http://lkml.kernel.org/r/20170221195116.13278-5-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:16 -07:00
Ross Zwisler
678c9fd043 dax: add tracepoints to dax_load_hole()
Add tracepoints to dax_load_hole(), following the same logging conventions
as the rest of DAX.

Here is the logging generated by a PTE read from a hole:

  read-1075  [002] ....
    62.362108: dax_pte_fault: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10480000 pgoff 0x280

  read-1075  [002] ....
    62.362140: dax_load_hole: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10480000 pgoff 0x280 NOPAGE

  read-1075  [002] ....
    62.362141: dax_pte_fault_done: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10480000 pgoff 0x280 NOPAGE

Link: http://lkml.kernel.org/r/20170221195116.13278-4-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:16 -07:00
Ross Zwisler
c3ff68d7d1 dax: add tracepoints to dax_pfn_mkwrite()
Add tracepoints to dax_pfn_mkwrite(), following the same logging
conventions as the rest of DAX.

Here is an example PTE fault followed by a pfn_mkwrite:

  small_aligned-1094  [002] ....
   374.084998: dax_pte_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10400000 pgoff 0x200

  small_aligned-1094  [002] ....
   374.085145: dax_pte_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10400000 pgoff 0x200 MAJOR|NOPAGE

  small_aligned-1094  [002] ....
   374.085165: dax_pfn_mkwrite: dev 259:0 ino 0x1003 shared WRITE|MKWRITE|ALLOW_RETRY|KILLABLE|USER address 0x10400000 pgoff 0x200 NOPAGE

Link: http://lkml.kernel.org/r/20170221195116.13278-3-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:15 -07:00
Ross Zwisler
a9c42b33ed dax: add tracepoints to dax_iomap_pte_fault()
Patch series "second round of tracepoints for DAX".

This second round of DAX tracepoint patches adds tracing to the PTE
fault path (dax_iomap_pte_fault(), dax_pfn_mkwrite(), dax_load_hole(),
dax_insert_mapping()) and to the writeback path
(dax_writeback_mapping_range(), dax_writeback_one()).

The purpose of this tracing is to give us a high level view of what DAX
is doing, whether faults are being serviced by PMDs or PTEs, and by real
storage or by zero pages covering holes.

I do have some patches nearly ready which also add tracing to
grab_mapping_entry() and dax_insert_mapping_entry().  These are more
targeted at logging how we are interacting with the radix tree, how we
use empty entries for locking, whether we "downgrade" huge zero pages to
4k PTE sized allocations, etc.  In the end it seemed to me that this
might be too detailed to have as constantly present tracepoints, but if
anyone sees value in having tracepoints like this in the DAX code
permanently (Jan?), please let me know and I'll add those last two
patches.

All these tracepoints were done to be consistent with the style of the
XFS tracepoints and with the existing DAX PMD tracepoints.

This patch (of 6):

Add tracepoints to dax_iomap_pte_fault(), following the same logging
conventions as the rest of DAX.

Here is an example fault that initially tries to be serviced by the PMD
fault handler but which falls back to PTEs because the VMA isn't large
enough to hold a PMD:

  small-1086  [005] ....
   71.140014: xfs_filemap_huge_fault: dev 259:0 ino 0x1003

  small-1086  [005] ....
    71.140027: dax_pmd_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 vm_start 0x10200000 vm_end 0x10500000 pgoff 0x220 max_pgoff 0x1400

  small-1086  [005] ....
    71.140028: dax_pmd_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 vm_start 0x10200000 vm_end 0x10500000 pgoff 0x220 max_pgoff 0x1400 FALLBACK

  small-1086  [005] ....
    71.140035: dax_pte_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 pgoff 0x220

  small-1086  [005] ....
    71.140396: dax_pte_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 pgoff 0x220 MAJOR|NOPAGE

Link: http://lkml.kernel.org/r/20170221195116.13278-2-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:15 -07:00
Stephen Rothwell
b32c8c7648 gfs2: replace CURRENT_TIME with current_time
Link: http://lkml.kernel.org/r/20170420161852.0492bc3f@canb.auug.org.au
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:15 -07:00
Deepa Dinamani
607a11ad94 fs: ubifs: replace CURRENT_TIME_SEC with current_time
CURRENT_TIME_SEC is not y2038 safe.  current_time() will be transitioned
to use 64 bit time along with vfs in a separate patch.  There is no plan
to transition CURRENT_TIME_SEC to use y2038 safe time interfaces.

current_time() returns timestamps according to the granularities set in
the inode's super_block.  The granularity check to call
current_fs_time() or CURRENT_TIME_SEC is not required.

Use current_time() directly to update inode timestamp.  Use
timespec_trunc during file system creation, before the first inode is
created.

Link: http://lkml.kernel.org/r/1491613030-11599-9-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:15 -07:00
Deepa Dinamani
a88e99e976 fs: ufs: use ktime_get_real_ts64() for birthtime
CURRENT_TIME is not y2038 safe.  Replace it with ktime_get_real_ts64().
Inode time formats are already 64 bit long and accommodates time64_t.

Link: http://lkml.kernel.org/r/1491613030-11599-6-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:15 -07:00
Deepa Dinamani
1134e09100 fs: ceph: CURRENT_TIME with ktime_get_real_ts()
CURRENT_TIME is not y2038 safe.  The macro will be deleted and all the
references to it will be replaced by ktime_get_* apis.

struct timespec is also not y2038 safe.  Retain timespec for timestamp
representation here as ceph uses it internally everywhere.  These
references will be changed to use struct timespec64 in a separate patch.

The current_fs_time() api is being changed to use vfs struct inode* as
an argument instead of struct super_block*.

Set the new mds client request r_stamp field using ktime_get_real_ts()
instead of using current_fs_time().

Also, since r_stamp is used as mtime on the server, use timespec_trunc()
to truncate the timestamp, using the right granularity from the
superblock.

This api will be transitioned to be y2038 safe along with vfs.

Link: http://lkml.kernel.org/r/1491613030-11599-5-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
M:	Ilya Dryomov <idryomov@gmail.com>
M:	"Yan, Zheng" <zyan@redhat.com>
M:	Sage Weil <sage@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:15 -07:00
Deepa Dinamani
e37fea58f7 fs: cifs: replace CURRENT_TIME by other appropriate apis
CURRENT_TIME macro is not y2038 safe on 32 bit systems.

The patch replaces all the uses of CURRENT_TIME by current_time() for
filesystem times, and ktime_get_* functions for authentication
timestamps and timezone calculations.

This is also in preparation for the patch that transitions vfs
timestamps to use 64 bit time and hence make them y2038 safe.

CURRENT_TIME macro will be deleted before merging the aforementioned
change.

The inode timestamps read from the server are assumed to have correct
granularity and range.

The patch also assumes that the difference between server and client
times lie in the range INT_MIN..INT_MAX.  This is valid because this is
the difference between current times between server and client, and the
largest timezone difference is in the range of one day.

All cifs timestamps currently use timespec representation internally.
Authentication and timezone timestamps can also be transitioned into
using timespec64 when all other timestamps for cifs is transitioned to
use timespec64.

Link: http://lkml.kernel.org/r/1491613030-11599-4-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:15 -07:00
Deepa Dinamani
48fbfe50f1 fs: f2fs: use ktime_get_real_seconds for sit_info times
CURRENT_TIME_SEC is not y2038 safe.

Replace use of CURRENT_TIME_SEC with ktime_get_real_seconds in segment
timestamps used by GC algorithm including the segment mtime timestamps.

Link: http://lkml.kernel.org/r/1491613030-11599-2-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:15 -07:00
Tetsuo Handa
c718a97514 fs: semove set but not checked AOP_FLAG_UNINTERRUPTIBLE flag
Commit afddba49d1 ("fs: introduce write_begin, write_end, and
perform_write aops") introduced AOP_FLAG_UNINTERRUPTIBLE flag which was
checked in pagecache_write_begin(), but that check was removed by
4e02ed4b4a ("fs: remove prepare_write/commit_write").

Between these two commits, commit d9414774dc ("cifs: Convert cifs to
new aops.") added a check in cifs_write_begin(), but that check was soon
removed by commit a98ee8c1c7 ("[CIFS] fix regression in
cifs_write_begin/cifs_write_end").

Therefore, AOP_FLAG_UNINTERRUPTIBLE flag is checked nowhere.  Let's
remove this flag.  This patch has no functionality changes.

Link: http://lkml.kernel.org/r/1489294781-53494-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:14 -07:00
Masahiro Yamada
6e7c2b4dd3 scripts/spelling.txt: add "intialise(d)" pattern and fix typo instances
Fix typos and add the following to the scripts/spelling.txt:

  intialisation||initialisation
  intialised||initialised
  intialise||initialise

This commit does not intend to change the British spelling itself.

Link: http://lkml.kernel.org/r/1481573103-11329-18-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:13 -07:00
Michal Hocko
19809c2da2 mm, vmalloc: use __GFP_HIGHMEM implicitly
__vmalloc* allows users to provide gfp flags for the underlying
allocation.  This API is quite popular

  $ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l
  77

The only problem is that many people are not aware that they really want
to give __GFP_HIGHMEM along with other flags because there is really no
reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages
which are mapped to the kernel vmalloc space.  About half of users don't
use this flag, though.  This signals that we make the API unnecessarily
too complex.

This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to
be mapped to the vmalloc space.  Current users which add __GFP_HIGHMEM
are simplified and drop the flag.

Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Cristopher Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:13 -07:00
Michal Hocko
752ade68cb treewide: use kv[mz]alloc* rather than opencoded variants
There are many code paths opencoding kvmalloc.  Let's use the helper
instead.  The main difference to kvmalloc is that those users are
usually not considering all the aspects of the memory allocator.  E.g.
allocation requests <= 32kB (with 4kB pages) are basically never failing
and invoke OOM killer to satisfy the allocation.  This sounds too
disruptive for something that has a reasonable fallback - the vmalloc.
On the other hand those requests might fallback to vmalloc even when the
memory allocator would succeed after several more reclaim/compaction
attempts previously.  There is no guarantee something like that happens
though.

This patch converts many of those places to kv[mz]alloc* helpers because
they are more conservative.

Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
Acked-by: David Sterba <dsterba@suse.com> # btrfs
Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Santosh Raspatur <santosh@chelsio.com>
Cc: Hariprasad S <hariprasad@chelsio.com>
Cc: Yishai Hadas <yishaih@mellanox.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: "Yan, Zheng" <zyan@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:13 -07:00
Michal Hocko
81be3dee96 fs/xattr.c: zero out memory copied to userspace in getxattr
getxattr uses vmalloc to allocate memory if kzalloc fails.  This is
filled by vfs_getxattr and then copied to the userspace.  vmalloc,
however, doesn't zero out the memory so if the specific implementation
of the xattr handler is sloppy we can theoretically expose a kernel
memory.  There is no real sign this is really the case but let's make
sure this will not happen and use vzalloc instead.

Fixes: 779302e678 ("fs/xattr.c:getxattr(): improve handling of allocation failures")
Link: http://lkml.kernel.org/r/20170306103327.2766-1-mhocko@kernel.org
Acked-by: Kees Cook <keescook@chromium.org>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>	[3.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:12 -07:00
Michal Hocko
a7c3e901a4 mm: introduce kv[mz]alloc helpers
Patch series "kvmalloc", v5.

There are many open coded kmalloc with vmalloc fallback instances in the
tree.  Most of them are not careful enough or simply do not care about
the underlying semantic of the kmalloc/page allocator which means that
a) some vmalloc fallbacks are basically unreachable because the kmalloc
part will keep retrying until it succeeds b) the page allocator can
invoke a really disruptive steps like the OOM killer to move forward
which doesn't sound appropriate when we consider that the vmalloc
fallback is available.

As it can be seen implementing kvmalloc requires quite an intimate
knowledge if the page allocator and the memory reclaim internals which
strongly suggests that a helper should be implemented in the memory
subsystem proper.

Most callers, I could find, have been converted to use the helper
instead.  This is patch 6.  There are some more relying on __GFP_REPEAT
in the networking stack which I have converted as well and Eric Dumazet
was not opposed [2] to convert them as well.

[1] http://lkml.kernel.org/r/20170130094940.13546-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com

This patch (of 9):

Using kmalloc with the vmalloc fallback for larger allocations is a
common pattern in the kernel code.  Yet we do not have any common helper
for that and so users have invented their own helpers.  Some of them are
really creative when doing so.  Let's just add kv[mz]alloc and make sure
it is implemented properly.  This implementation makes sure to not make
a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
to not warn about allocation failures.  This also rules out the OOM
killer as the vmalloc is a more approapriate fallback than a disruptive
user visible action.

This patch also changes some existing users and removes helpers which
are specific for them.  In some cases this is not possible (e.g.
ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and
require GFP_NO{FS,IO} context which is not vmalloc compatible in general
(note that the page table allocation is GFP_KERNEL).  Those need to be
fixed separately.

While we are at it, document that __vmalloc{_node} about unsupported gfp
mask because there seems to be a lot of confusion out there.
kvmalloc_node will warn about GFP_KERNEL incompatible (which are not
superset) flags to catch new abusers.  Existing ones would have to die
slowly.

[sfr@canb.auug.org.au: f2fs fixup]
  Link: http://lkml.kernel.org/r/20170320163735.332e64b7@canb.auug.org.au
Link: http://lkml.kernel.org/r/20170306103032.2540-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>	[ext4 part]
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:12 -07:00
Kirill Tkhai
eaa0d190bf pidns: expose task pid_ns_for_children to userspace
pid_ns_for_children set by a task is known only to the task itself, and
it's impossible to identify it from outside.

It's a big problem for checkpoint/restore software like CRIU, because it
can't correctly handle tasks, that do setns(CLONE_NEWPID) in proccess of
their work.

This patch solves the problem, and it exposes pid_ns_for_children to ns
directory in standard way with the name "pid_for_children":

  ~# ls /proc/5531/ns -l | grep pid
  lrwxrwxrwx 1 root root 0 Jan 14 16:38 pid -> pid:[4026531836]
  lrwxrwxrwx 1 root root 0 Jan 14 16:38 pid_for_children -> pid:[4026532286]

Link: http://lkml.kernel.org/r/149201123914.6007.2187327078064239572.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Andrei Vagin <avagin@virtuozzo.com>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:12 -07:00
Kirill Tkhai
25b14e92af ns: allow ns_entries to have custom symlink content
Patch series "Expose task pid_ns_for_children to userspace".

pid_ns_for_children set by a task is known only to the task itself, and
it's impossible to identify it from outside.

It's a big problem for checkpoint/restore software like CRIU, because it
can't correctly handle tasks, that do setns(CLONE_NEWPID) in proccess of
their work.  If they have a custom pid_ns_for_children before dump, they
must have the same ns after restore.  Otherwise, restored task bumped
into enviroment it does not expect.

This patchset solves the problem.  It exposes pid_ns_for_children to ns
directory in standard way with the name "pid_for_children":

  ~# ls /proc/5531/ns -l | grep pid
  lrwxrwxrwx 1 root root 0 Jan 14 16:38 pid -> pid:[4026531836]
  lrwxrwxrwx 1 root root 0 Jan 14 16:38 pid_for_children -> pid:[4026532286]

This patch (of 2):

Make possible to have link content prefix yyy different from the link
name xxx:

  $ readlink /proc/[pid]/ns/xxx
  yyy:[4026531838]

This will be used in next patch.

Link: http://lkml.kernel.org/r/149201120318.6007.7362655181033883000.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:12 -07:00
Kees Cook
7fe6a42e87 reiserfs: use designated initializers
Prepare to mark sensitive kernel structures for randomization by making
sure they're using designated initializers.  These were identified
during allyesconfig builds of x86, arm, and arm64, with most initializer
fixes extracted from grsecurity.

Link: http://lkml.kernel.org/r/20170329210419.GA40066@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:11 -07:00
Tobin C. Harding
f245e1c17a fs/proc/inode.c: remove cast from memory allocation
Coccinelle emits this warning:

  WARNING: casting value returned by memory allocation function to (struct proc_inode *) is useless.

Remove unnecessary cast.

Link: http://lkml.kernel.org/r/1487745720-16967-1-git-send-email-me@tobin.cc
Signed-off-by: Tobin C. Harding <me@tobin.cc>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:10 -07:00
Olga Kornievskaia
e092693443 NFS append COMMIT after synchronous COPY
Instead of messing with the commit path which has been causing issues,
add a COMMIT op after the COPY and ask for stable copies in the first
space.

It saves a round trip, since after the COPY, the client sends a COMMIT
anyway.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-08 19:01:06 -04:00
J. Bruce Fields
efda760fe9 lockd: fix lockd shutdown race
As reported by David Jeffery: "a signal was sent to lockd while lockd
was shutting down from a request to stop nfs.  The signal causes lockd
to call restart_grace() which puts the lockd_net structure on the grace
list.  If this signal is received at the wrong time, it will occur after
lockd_down_net() has called locks_end_grace() but before
lockd_down_net() stops the lockd thread.  This leads to lockd putting
the lockd_net structure back on the grace list, then exiting without
anything removing it from the list."

So, perform the final locks_end_grace() from the the lockd thread; this
ensures it's serialized with respect to restart_grace().

Reported-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-05-08 18:06:18 -04:00
Linus Torvalds
70ef8f0d37 for-f2fs-4.12
In this round, we've focused on enhancing performance with regards to block
 allocation, GC, and discard/in-place-update IO controls. There are a bunch
 of clean-ups as well as minor bug fixes.
 
 = Enhancement
 - disable heap-based allocation by default
 - issue small-sized discard commands by default
 - change the policy of data hotness for logging
 - distinguish IOs in terms of size and wbc type
 - start SSR earlier to avoid foreground GC
 - enhance data structures managing discard commands
 - enhance in-place update flow
 - add some more fault injection routines
 - secure one more xattr entry
 
 = Bug fix
 - calculate victim cost for GC correctly
 - remain correct victim segment number for GC
 - race condition in nid allocator and initializer
 - stale pointer produced by atomic_writes
 - fix missing REQ_SYNC for flush commands
 - handle missing errors in more corner cases
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJZEKXrAAoJEEAUqH6CSFDSJJ8P/1Zy0NS9TM/PFtT7Sevb6vgC
 LcKLtX1bVhUuX9wAt5Q6BZ9927tCQPt5vLEYUxtniqEQaC0fsJAMbRYot+gR/dvN
 4bGgv1TeVST5pKbmctzhAL30PvZ1w4QS6dLvPMm2sPQSrPKGUGt0J8wPiHHZuvH4
 pygKzDxbrIJTeMhLm9tgFg7dWTJXV3VDb57WpA1AM1LAFVsIPF4vZnryLv3GsRmY
 eGRxgZEtt/90hCRbEcPirPZrtpv/O5f12K4Vp/NPw+4XGMEk+nTYndq6rlUWVNjg
 iPEDuxONyk/yb274SqB6sbNDuxHOqn7stGJepdUpSbprIsLZ0RmMaYWjSNsLU3Vh
 p4fAzRqvfSqAHCt0FEL/vT8M9ST5xQRVr9P/l0kDK5Ww95RROd05bEaGm/sKc7NB
 PHiWUoMIFFmuVsoCi6sM0AKps53ZGON8GEUyVKyM7NWTw1oWLPWifGMthEkysmwm
 08SdU5+XqbCeyMPAA2GURqMA5A8ssuA8+F0Citf4JPckQHPPj5pAydmx2wVlfBlc
 /bneR7T/8OsUbxgG8JSbdHUiPcjb20F0GTxSOTXiV/AaZAMCtyETnw64K2V6E0n7
 uraKcYYhypyphCj/IYc4vnQ3dCu3U2/NvTYEVX8DBvboN38/JVqmNWgQx9g+tLzj
 +r5s7PqTDuXv5Cfzc5NC
 =SBUb
 -----END PGP SIGNATURE-----

Merge tag 'for-f2fs-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we've focused on enhancing performance with regards to
  block allocation, GC, and discard/in-place-update IO controls. There
  are a bunch of clean-ups as well as minor bug fixes.

  Enhancements:
   - disable heap-based allocation by default
   - issue small-sized discard commands by default
   - change the policy of data hotness for logging
   - distinguish IOs in terms of size and wbc type
   - start SSR earlier to avoid foreground GC
   - enhance data structures managing discard commands
   - enhance in-place update flow
   - add some more fault injection routines
   - secure one more xattr entry

  Bug fixes:
   - calculate victim cost for GC correctly
   - remain correct victim segment number for GC
   - race condition in nid allocator and initializer
   - stale pointer produced by atomic_writes
   - fix missing REQ_SYNC for flush commands
   - handle missing errors in more corner cases"

* tag 'for-f2fs-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (111 commits)
  f2fs: fix a mount fail for wrong next_scan_nid
  f2fs: enhance scalability of trace macro
  f2fs: relocate inode_{,un}lock in F2FS_IOC_SETFLAGS
  f2fs: Make flush bios explicitely sync
  f2fs: show available_nids in f2fs/status
  f2fs: flush dirty nats periodically
  f2fs: introduce CP_TRIMMED_FLAG to avoid unneeded discard
  f2fs: allow cpc->reason to indicate more than one reason
  f2fs: release cp and dnode lock before IPU
  f2fs: shrink size of struct discard_cmd
  f2fs: don't hold cmd_lock during waiting discard command
  f2fs: nullify fio->encrypted_page for each writes
  f2fs: sanity check segment count
  f2fs: introduce valid_ipu_blkaddr to clean up
  f2fs: lookup extent cache first under IPU scenario
  f2fs: reconstruct code to write a data page
  f2fs: introduce __wait_discard_cmd
  f2fs: introduce __issue_discard_cmd
  f2fs: enable small discard by default
  f2fs: delay awaking discard thread
  ...
2017-05-08 12:24:17 -07:00
Rock Lee
798868c021 ubifs: Fix a typo in comment of ioctl2ubifs & ubifs2ioctl
Change 'convert' to 'converts'
Change 'UBIFS' to 'UBIFS inode flags'

Signed-off-by: Rock Lee <rockdotlee@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2017-05-08 20:48:55 +02:00
Stefan Agner
2a068daf57 ubifs: Remove unnecessary assignment
Assigning a value of a variable to itself is not useful.

Signed-off-by: Stefan Agner <stefan@agner.ch>
Signed-off-by: Richard Weinberger <richard@nod.at>
2017-05-08 20:48:47 +02:00
Colin Ian King
6a258f7d0f ubifs: Fix cut and paste error on sb type comparisons
The check for the bad node type of sb->type is checking sa->type
and not sb-type. This looks like a cut and paste error. Fix this.

Detected by PVS-Studio, warning: V581

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2017-05-08 20:48:41 +02:00
Hyunchul Lee
8326c1eec2 ubifs: Add CONFIG_UBIFS_FS_SECURITY to disable/enable security labels
When write syscall is called, every time security label is searched to
determine that file's privileges should be changed.
If LSM(Linux Security Model) is not used, this is useless.

So introduce CONFIG_UBIFS_SECURITY to disable security labels. it's default
value is "y".

Signed-off-by: Hyunchul Lee <cheol.lee@lge.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2017-05-08 20:48:23 +02:00
Linus Torvalds
677375cef8 Only bug fixes and cleanups for this merge window.
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlkPYHkACgkQ8vlZVpUN
 gaM97ggAlOm8n/tlbcdonX/+HHjlnqcy5uYD7A9AH/JordpRzy4eqcMbxMG39p1R
 DBtjo9Y0i3iFEGajRc0h7KXDLeTBUQ/JZpR8H60MFfAQHnTowuI91eb3/6QeZiHh
 CN/2KKzpYitPIEUfEHnVeYKOfvrzR7je5hrEiAwEkPeKv7XyrNVM0LHQ/jKpbQwg
 ntIzHvxjQyo8plx/m5S4Yew7tqjYpNiq4plmyk/Vxtw2FmB/FC76UxYeadoB3EI5
 etw+bCORB0tFZO27o56kXywg+mDcp7HEtVvq9LG28oEuBDAVKNoeKEvV7SiOBlZp
 +HnqIz5Hx1UTxOlTAc10IjvEhriEuw==
 =qCDl
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt

Pull fscrypt updates from Ted Ts'o:
 "Only bug fixes and cleanups for this merge window"

* tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt:
  fscrypt: correct collision claim for digested names
  MAINTAINERS: fscrypt: update mailing list, patchwork, and git
  ext4: clean up ext4_match() and callers
  f2fs: switch to using fscrypt_match_name()
  ext4: switch to using fscrypt_match_name()
  fscrypt: introduce helper function for filename matching
  fscrypt: avoid collisions when presenting long encrypted filenames
  f2fs: check entire encrypted bigname when finding a dentry
  ubifs: check for consistent encryption contexts in ubifs_lookup()
  f2fs: sync f2fs_lookup() with ext4_lookup()
  ext4: remove "nokey" check from ext4_lookup()
  fscrypt: fix context consistency check when key(s) unavailable
  fscrypt: Remove __packed from fscrypt_policy
  fscrypt: Move key structure and constants to uapi
  fscrypt: remove fscrypt_symlink_data_len()
  fscrypt: remove unnecessary checks for NULL operations
2017-05-08 11:40:34 -07:00
Linus Torvalds
dd727dad37 Add GETFSMAP support; some performance improvements for very large
file systems and for random write workloads into a preallocated file;
 bug fixes and cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlkPYB8ACgkQ8vlZVpUN
 gaP1HwgApoMQGegtRIbCZKUzKBJ2S6vwIoPAMz62JuwngOyWygJ1T1TliKTitG04
 XvijKpUHtEggMO/ZsUOCoyr2LzJlpVvvrJZsavEubO12LKreYMpvNraZF1GACYTb
 lIZpdWkpcEz5WnPV/PXW/dEMcSMhnKe8tbmHXMyAouSC6a55F5Wp456KF/plqkHU
 zkWTCDbEOtHThzpL8cthUL71ji62I3Op5jn/qOfKCm6/JtUlw5pYjWkRUNqqjSQE
 uQqMpqLxI/VjOdEiBPxEF6A+ZudZmoBQKY15ibWCcHUPFOPqk4RdYz6VivRI7zrg
 KrrKcdFT29MtKnRfAAoJcc0nJ4e1Iw==
 =il74
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:

 - add GETFSMAP support

 - some performance improvements for very large file systems and for
   random write workloads into a preallocated file

 - bug fixes and cleanups.

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  jbd2: cleanup write flags handling from jbd2_write_superblock()
  ext4: mark superblock writes synchronous for nobarrier mounts
  ext4: inherit encryption xattr before other xattrs
  ext4: replace BUG_ON with WARN_ONCE in ext4_end_bio()
  ext4: avoid unnecessary transaction stalls during writeback
  ext4: preload block group descriptors
  ext4: make ext4_shutdown() static
  ext4: support GETFSMAP ioctls
  vfs: add common GETFSMAP ioctl definitions
  ext4: evict inline data when writing to memory map
  ext4: remove ext4_xattr_check_entry()
  ext4: rename ext4_xattr_check_names() to ext4_xattr_check_entries()
  ext4: merge ext4_xattr_list() into ext4_listxattr()
  ext4: constify static data that is never modified
  ext4: trim return value and 'dir' argument from ext4_insert_dentry()
  jbd2: fix dbench4 performance regression for 'nobarrier' mounts
  jbd2: Fix lockdep splat with generic/270 test
  mm: retry writepages() on ENOMEM when doing an data integrity writeback
2017-05-08 11:30:05 -07:00
Dan Williams
ef51042472 block, dax: move "select DAX" from BLOCK to FS_DAX
For configurations that do not enable DAX filesystems or drivers, do not
require the DAX core to be built.

Given that the 'direct_access' method has been removed from
'block_device_operations', we can also go ahead and remove the
block-related dax helper functions from fs/block_dev.c to
drivers/dax/super.c. This keeps dax details out of the block layer and
lets the DAX core be built as a module in the FS_DAX=n case.

Filesystems need to include dax.h to call bdev_dax_supported().

Cc: linux-xfs@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.com>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-05-08 10:55:27 -07:00
Trond Myklebust
28cf22d0ba NFSv4: Fix exclusive create attributes encoding
When using NFS4_CREATE_EXCLUSIVE4_1 mode, the client will overestimate the
amount of space that it needs for the attributes because it does so
before checking whether or not the server supports a given attribute.

Fix by checking the attribute mask earlier.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-08 09:40:59 -04:00
Trond Myklebust
2e84611b3f NFSv4: Fix an rcu lock leak
The intention in the original patch was to release the lock when
we put the inode, however something got screwed up.

Reported-by: Jason Yan <yanaijie@huawei.com>
Fixes: 7b410d9ce4 ("pNFS: Delay getting the layout header in..")
Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-08 09:27:59 -04:00
Linus Torvalds
fe7a719b30 Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
 "Various fixes for stable for CIFS/SMB3 especially for better
  interoperability for SMB3 to Macs.

  It also includes Pavel's improvements to SMB3 async i/o support
  (which is much faster now)"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  CIFS: add misssing SFM mapping for doublequote
  SMB3: Work around mount failure when using SMB3 dialect to Macs
  cifs: fix CIFS_IOC_GET_MNT_INFO oops
  CIFS: fix mapping of SFM_SPACE and SFM_PERIOD
  CIFS: fix oplock break deadlocks
  cifs: fix CIFS_ENUMERATE_SNAPSHOTS oops
  cifs: fix leak in FSCTL_ENUM_SNAPS response handling
  Set unicode flag on cifs echo request to avoid Mac error
  CIFS: Add asynchronous write support through kernel AIO
  CIFS: Add asynchronous read support through kernel AIO
  CIFS: Add asynchronous context to support kernel AIO
  cifs: fix IPv6 link local, with scope id, address parsing
  cifs: small underflow in cnvrtDosUnixTm()
2017-05-06 11:51:46 -07:00
Linus Torvalds
d484467c86 Changes for 4.12:
- various code cleanups
 - introduce GETFSMAP ioctl
 - various refactoring
 - avoid dio reads past eof
 - fix memory corruption and other errors with fragmented directory blocks
 - fix accidental userspace memory corruptions
 - publish fs uuid in superblock
 - make fstrim terminatable
 - fix race between quotaoff and in-core inode creation
 - Avoid use-after-free when finishing up w/ buffer heads
 - Reserve enough space to handle bmap tree resizing during cow remap
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJZDfIzAAoJEPh/dxk0SrTrsEgP/3TjYbaqsad2e6KqtZwqN/Qx
 DUljUxReZl4rgnAaFD55XOPYWGZ2bBGNtAQlAR7/JYZuZs6obbBrqUukS19jPVi7
 SeQdknnU3yTq17LrwEeeQUOhem28GHxYtQYazdgNoTigZXABeXWzi53HzvPw5+Ci
 3a+zB1clu3cycKsD+UAhz/m0Z40ckjDMsDueJMOACiax+vPjlzSu36H9wzlF/h0R
 nq7VGSDZy6aS3H75PDjWVxoJGUSdO7jHYxwQflkk6wxrcmTCLZxuiDeSANOZ2KxM
 y8qTln6hqxalQSH9r6n84/XrQstYWfdLqwngIL5wMSvN6UbuFyNQKuouEkWs6EEZ
 4cuSqfihT7o5VcIpYiq1ZDgNzzpmDDMMeho4J9WBvm5Qt5hgPCo3gzweE/C6Sscs
 m+V1NvLd+kBiHoMhYPB8/lm4nXa/wT1Y3TtHc+8A/qkZKAwoOdxWKNIY58jfmdzb
 Rvv0LKi+6W5zanzXlNs3NXJBwZAeHuHXKY3UJT4BAWfjdtS6QvIf1Bcpj9ApyqE2
 oOnNMRhF+wSS9dSFoPXkRjzIyoR5CoOylB0KYV9OYELYPDLczwbvtX/9+tjDEol9
 odCZyyzJtKxYQbwf2TQ/ZqXQV4vw6lWOB7G4Itx7yv0Taa9vQ7cxSX2MnE7TA/pW
 IQKsE6C2I24Bfr2oPfms
 =oKCc
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.12-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Darrick Wong:
 "Here are the XFS changes for 4.12. The big new feature for this
  release is the new space mapping ioctl that we've been discussing
  since LSF2016, but other than that most of the patches are larger bug
  fixes, memory corruption prevention, and other cleanups.

  Summary:
   - various code cleanups
   - introduce GETFSMAP ioctl
   - various refactoring
   - avoid dio reads past eof
   - fix memory corruption and other errors with fragmented directory blocks
   - fix accidental userspace memory corruptions
   - publish fs uuid in superblock
   - make fstrim terminatable
   - fix race between quotaoff and in-core inode creation
   - avoid use-after-free when finishing up w/ buffer heads
   - reserve enough space to handle bmap tree resizing during cow remap"

* tag 'xfs-4.12-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (53 commits)
  xfs: fix use-after-free in xfs_finish_page_writeback
  xfs: reserve enough blocks to handle btree splits when remapping
  xfs: wait on new inodes during quotaoff dquot release
  xfs: update ag iterator to support wait on new inodes
  xfs: support ability to wait on new inodes
  xfs: publish UUID in struct super_block
  xfs: Allow user to kill fstrim process
  xfs: better log intent item refcount checking
  xfs: fix up quotacheck buffer list error handling
  xfs: remove xfs_trans_ail_delete_bulk
  xfs: don't use bool values in trace buffers
  xfs: fix getfsmap userspace memory corruption while setting OF_LAST
  xfs: fix __user annotations for xfs_ioc_getfsmap
  xfs: corruption needs to respect endianess too!
  xfs: use NULL instead of 0 to initialize a pointer in xfs_ioc_getfsmap
  xfs: use NULL instead of 0 to initialize a pointer in xfs_getfsmap
  xfs: simplify validation of the unwritten extent bit
  xfs: remove unused values from xfs_exntst_t
  xfs: remove the unused XFS_MAXLINK_1 define
  xfs: more do_div cleanups
  ...
2017-05-06 11:46:16 -07:00
Linus Torvalds
044f1daaaa Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes and updates from Jens Axboe:
 "Some fixes and followup features/changes that should go in, in this
  merge window. This contains:

   - Two fixes for lightnvm from Javier, fixing problems in the new code
     merge previously in this merge window.

   - A fix from Jan for the backing device changes, fixing an issue in
     NFS that causes a failure to mount on certain setups.

   - A change from Christoph, cleaning up the blk-mq init and exit
     request paths.

   - Remove elevator_change(), which is now unused. From Bart.

   - A fix for queue operation invocation on a dead queue, from Bart.

   - A series fixing up mtip32xx for blk-mq scheduling, removing a
     bandaid we previously had in place for this. From me.

   - A regression fix for this series, fixing a case where we wait on
     workqueue flushing from an invalid (non-blocking) context. From me.

   - A fix/optimization from Ming, ensuring that we don't both quiesce
     and freeze a queue at the same time.

   - A fix from Peter on lock ordering for CPU hotplug. Not a real
     problem right now, but will be once the CPU hotplug rework goes in.

   - A series from Omar, cleaning up out blk-mq debugfs support, and
     adding support for exporting info from schedulers in debugfs as
     well. This is really useful in debugging stalls or livelocks. From
     Omar"

* 'for-linus' of git://git.kernel.dk/linux-block: (28 commits)
  mq-deadline: add debugfs attributes
  kyber: add debugfs attributes
  blk-mq-debugfs: allow schedulers to register debugfs attributes
  blk-mq: untangle debugfs and sysfs
  blk-mq: move debugfs declarations to a separate header file
  blk-mq: Do not invoke queue operations on a dead queue
  blk-mq-debugfs: get rid of a bunch of boilerplate
  blk-mq-debugfs: rename hw queue directories from <n> to hctx<n>
  blk-mq-debugfs: don't open code strstrip()
  blk-mq-debugfs: error on long write to queue "state" file
  blk-mq-debugfs: clean up flag definitions
  blk-mq-debugfs: separate flags with |
  nfs: Fix bdi handling for cloned superblocks
  block/mq: Cure cpu hotplug lock inversion
  lightnvm: fix bad back free on error path
  lightnvm: create cmd before allocating request
  blk-mq: don't use sync workqueue flushing from drivers
  mtip32xx: convert internal commands to regular block infrastructure
  mtip32xx: cleanup internal tag assumptions
  block: don't call blk_mq_quiesce_queue() after queue is frozen
  ...
2017-05-06 11:25:08 -07:00
Linus Torvalds
53ef7d0e20 libnvdimm for 4.12
* Region media error reporting: A libnvdimm region device is the parent
 to one or more namespaces. To date, media errors have been reported via
 the "badblocks" attribute attached to pmem block devices for namespaces
 in "raw" or "memory" mode. Given that namespaces can be in "device-dax"
 or "btt-sector" mode this new interface reports media errors
 generically, i.e. independent of namespace modes or state. This
 subsequently allows userspace tooling to craft "ACPI 6.1 Section
 9.20.7.6 Function Index 4 - Clear Uncorrectable Error" requests and
 submit them via the ioctl path for NVDIMM root bus devices.
 
 * Introduce 'struct dax_device' and 'struct dax_operations': Prompted by
 a request from Linus and feedback from Christoph this allows for dax
 capable drivers to publish their own custom dax operations. This fixes
 the broken assumption that all dax operations are related to a
 persistent memory device, and makes it easier for other architectures
 and platforms to add customized persistent memory support.
 
 * 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
 available for storage appliance applications to manually trigger memory
 controllers to drain write-pending buffers that would otherwise be
 flushed automatically by the platform ADR (asynchronous-DRAM-refresh)
 mechanism at a power loss event. Support for "locked" DIMMs is included
 to prevent namespaces from surfacing when the namespace label data area
 is locked. Finally, fixes for various reported deadlocks and crashes,
 also tagged for -stable.
 
 * ACPI / nfit driver updates: General updates of the nfit driver to add
 DSM command overrides, ACPI 6.1 health state flags support, DSM payload
 debug available by default, and various fixes.
 
 Acknowledgements that came after the branch was pushed:
 
 commmit 565851c972 "device-dax: fix sysfs attribute deadlock"
 Tested-by: Yi Zhang <yizhan@redhat.com>
 
 commit 23f4984483 "libnvdimm: rework region badblocks clearing"
 Tested-by: Toshi Kani <toshi.kani@hpe.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZDONJAAoJEB7SkWpmfYgC3SsP/2KrLvTUcz646ViuPOgZ2cC4
 W6wAx6cvDSt+H52kLnFEsYoFt7WAj20ggPirb/Bc5jkGlvwE0lT9Xtmso9GpVkYT
 J9ZJ9pP/4YaAD3II1gmTwaUjYi0FxoOdx3Eb92yuWkO/8ylz4b2Nu3cBpYwyziGQ
 nIfEVwDXRLE86u6x0bWuf6TlVuvsbdiAI55CDqDMVQC6xIOLbSez7b8QIHlpiKEb
 Mw+xqdQva0esoreZEOXEhWNO+qtfILx8/ceBEGTNMp4e/JjZ2FbrSNplM+9bH5k7
 ywqP8lW+mBEw0fmBBkYoVG/xyesiiBb55JLnbi8Ew+7IUxw8a3iV7wftRi62lHcK
 zAjsHe4L+MansgtZsCL8wluvIPaktAdtB4xr7l9VNLKRYRUG73jEWU0gcUNryHIL
 BkQJ52pUS1PkClyAsWbBBHl1I/CvzVPd21VW0YELmLR4OywKy1c+eKw2bcYgjrb4
 59HZSv6S6EoKaQC+2qvVNpePil7cdfg5V2ubH/ki9HoYVyoxDptEWHnvf0NNatIH
 Y7mNcOPvhOksJmnKSyHbDjtRur7WoHIlC9D7UjEFkSBWsKPjxJHoidN4SnCMRtjQ
 WKQU0seoaKj04b68Bs/Qm9NozVgnsPFIUDZeLMikLFX2Jt7YSPu+Jmi2s4re6WLh
 TmJQ3Ly9t3o3/weHSzmn
 =Ox0s
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "The bulk of this has been in multiple -next releases. There were a few
  late breaking fixes and small features that got added in the last
  couple days, but the whole set has received a build success
  notification from the kbuild robot.

  Change summary:

   - Region media error reporting: A libnvdimm region device is the
     parent to one or more namespaces. To date, media errors have been
     reported via the "badblocks" attribute attached to pmem block
     devices for namespaces in "raw" or "memory" mode. Given that
     namespaces can be in "device-dax" or "btt-sector" mode this new
     interface reports media errors generically, i.e. independent of
     namespace modes or state.

     This subsequently allows userspace tooling to craft "ACPI 6.1
     Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error"
     requests and submit them via the ioctl path for NVDIMM root bus
     devices.

   - Introduce 'struct dax_device' and 'struct dax_operations': Prompted
     by a request from Linus and feedback from Christoph this allows for
     dax capable drivers to publish their own custom dax operations.
     This fixes the broken assumption that all dax operations are
     related to a persistent memory device, and makes it easier for
     other architectures and platforms to add customized persistent
     memory support.

   - 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
     available for storage appliance applications to manually trigger
     memory controllers to drain write-pending buffers that would
     otherwise be flushed automatically by the platform ADR
     (asynchronous-DRAM-refresh) mechanism at a power loss event.
     Support for "locked" DIMMs is included to prevent namespaces from
     surfacing when the namespace label data area is locked. Finally,
     fixes for various reported deadlocks and crashes, also tagged for
     -stable.

   - ACPI / nfit driver updates: General updates of the nfit driver to
     add DSM command overrides, ACPI 6.1 health state flags support, DSM
     payload debug available by default, and various fixes.

  Acknowledgements that came after the branch was pushed:

   - commmit 565851c972 "device-dax: fix sysfs attribute deadlock":
     Tested-by: Yi Zhang <yizhan@redhat.com>

   - commit 23f4984483 "libnvdimm: rework region badblocks clearing"
     Tested-by: Toshi Kani <toshi.kani@hpe.com>"

* tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (52 commits)
  libnvdimm, pfn: fix 'npfns' vs section alignment
  libnvdimm: handle locked label storage areas
  libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED
  brd: fix uninitialized use of brd->dax_dev
  block, dax: use correct format string in bdev_dax_supported
  device-dax: fix sysfs attribute deadlock
  libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking"
  libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering
  libnvdimm: rework region badblocks clearing
  acpi, nfit: kill ACPI_NFIT_DEBUG
  libnvdimm: fix clear length of nvdimm_forget_poison()
  libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify
  libnvdimm, region: sysfs trigger for nvdimm_flush()
  libnvdimm: fix phys_addr for nvdimm_clear_poison
  x86, dax, pmem: remove indirection around memcpy_from_pmem()
  block: remove block_device_operations ->direct_access()
  block, dax: convert bdev_dax_supported() to dax_direct_access()
  filesystem-dax: convert to dax_direct_access()
  Revert "block: use DAX for partition table reads"
  ext2, ext4, xfs: retrieve dax_device for iomap operations
  ...
2017-05-05 18:49:20 -07:00
Linus Torvalds
1a5fb64fee We've got ten GFS2 patches for this merge window.
1. Andreas Gruenbacher wrote a patch to replace the deprecated
    call to rhashtable_walk_init with rhashtable_walk_enter.
 2. Andreas also wrote a patch to eliminate redundant code in
    two of our debugfs sequence files.
 3. Andreas also cleaned up the rhashtable key ugliness Linus
    pointed out during this cycle, following Linus's suggestions.
 4. Andreas also wrote a patch to take advantage of his new
    function rhashtable_lookup_get_insert_fast. This makes glock
    lookup faster and more bullet-proof.
 5. Andreas also wrote a patch to revert a patch in the evict
    path that caused occasional deadlocks, and is no longer
    needed.
 6. Andrew Price wrote a patch to re-enable fallocate for the
    rindex system file to enable gfs2_grow to grow properly on
    secondary file system grow operations.
 7. I wrote a patch to initialize an inode number field to make
    certain kernel trace points more understandable.
 8. I also wrote a patch that makes GFS2 file system "withdraw"
    work more like it should by ignoring operations after a
    withdraw that would formerly cause a BUG() and kernel panic.
 9. I also reworked the entire truncate/delete algorithm,
    scrapping the old recursive algorithm in favor of a new
    non-recursive algorithm. This was done for performance:
    This way, GFS2 no longer needs to lock multiple resource
    groups while doing truncates and deletes of files that cross
    multiple resource group boundaries, allowing for better
    parallelism. It also solves a problem whereby deleting large
    files would request a large chunk of kernel memory, which
    resulted in a get_page_from_freelist warning.
 10. Due to a regression found during testing, I added a new
     patch to correct "GFS2: Prevent BUG from occurring when
     normal Withdraws occur".
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJZDNnaAAoJENeLYdPf93o7B7kIAJzwz7vVDVg2TpWVhMmXIWhf
 rZx3Gth5F0h+ZHddW7HzTLg+64XQ5//GyDD3UDtCpkhl5SJH+nt3juHyPJlRwioT
 0ua4SjyKLQSoJJVAEgAwu42QjORTXab7NjYn5LEhvRc0Gg/El9WGU+ZgmP2/aAvf
 KE2u/IEYNDkoJNS3Oqc7shajAyLYda6wCAASs/1ZGt9u48m/o/I23Zd7wr7EOkzw
 rd3gB0x80cJqDAB5IcymGOm111Tg4g34LwsRuyMnWE3H1jOgV+J515FVHEIvZuPq
 Wl9X7V8CzktI7nyLKVnZhpuv5JzyMq/vOPiD01tTFx8Oy1JCRezjmATXFjW/zIo=
 =MX3c
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-4.12.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull GFS2 updates from Bob Peterson:
 "We've got ten GFS2 patches for this merge window.

   - Andreas Gruenbacher wrote a patch to replace the deprecated call to
     rhashtable_walk_init with rhashtable_walk_enter.

   - Andreas also wrote a patch to eliminate redundant code in two of
     our debugfs sequence files.

   - Andreas also cleaned up the rhashtable key ugliness Linus pointed
     out during this cycle, following Linus's suggestions.

   - Andreas also wrote a patch to take advantage of his new function
     rhashtable_lookup_get_insert_fast. This makes glock lookup faster
     and more bullet-proof.

   - Andreas also wrote a patch to revert a patch in the evict path that
     caused occasional deadlocks, and is no longer needed.

   - Andrew Price wrote a patch to re-enable fallocate for the rindex
     system file to enable gfs2_grow to grow properly on secondary file
     system grow operations.

   - I wrote a patch to initialize an inode number field to make certain
     kernel trace points more understandable.

   - I also wrote a patch that makes GFS2 file system "withdraw" work
     more like it should by ignoring operations after a withdraw that
     would formerly cause a BUG() and kernel panic.

   - I also reworked the entire truncate/delete algorithm, scrapping the
     old recursive algorithm in favor of a new non-recursive algorithm.
     This was done for performance: This way, GFS2 no longer needs to
     lock multiple resource groups while doing truncates and deletes of
     files that cross multiple resource group boundaries, allowing for
     better parallelism. It also solves a problem whereby deleting large
     files would request a large chunk of kernel memory, which resulted
     in a get_page_from_freelist warning.

   - Due to a regression found during testing, I added a new patch to
     correct 'GFS2: Prevent BUG from occurring when normal Withdraws
     occur'."

* tag 'gfs2-4.12.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  GFS2: Allow glocks to be unlocked after withdraw
  GFS2: Non-recursive delete
  gfs2: Re-enable fallocate for the rindex
  Revert "GFS2: Wait for iopen glock dequeues"
  gfs2: Switch to rhashtable_lookup_get_insert_fast
  GFS2: Temporarily zero i_no_addr when creating a dinode
  gfs2: Don't pack struct lm_lockname
  gfs2: Deduplicate gfs2_{glocks,glstats}_open
  gfs2: Replace rhashtable_walk_init with rhashtable_walk_enter
  GFS2: Prevent BUG from occurring when normal Withdraws occur
2017-05-05 13:40:20 -07:00
Linus Torvalds
aeced66196 Some cleanups:
remove unused get_fsid_from_ino
   fix bounds check for listxattr
   clean up oversize xattr validation
   do not set getattr_time on orangefs_lookup
   return from orangefs_devreq_read quickly if possible
   do not wait for timeout if umounting
   handle zero size write in debugfs
 
 Bug fixes:
 
   do not check possibly stale size on truncate
   ensure the userspace component is unmounted if mount fails
   total reimplementation of dir.c
 
 New feature:
 
   implement statx
 
 The new implementation of dir.c is kind of a big deal, all new
 code. It has been posted to fs-devel during the previous rc period,
 we didn't get much review or feedback from there, but it has been reviewed
 very heavily here, so much so that we have two entire versions of the
 reimplementation. Not only does the new implementation fix some
 xfstests, but it passes all the new tests we made here that involve
 seeking and rewinding and giant directories and long file names.
 The new dir code has three patches itself:
 
   skip forward to the next directory entry if seek is short
   invalidate stored directory on seek
   count directory pieces correctly
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZDLasAAoJEM9EDqnrzg2+yR4P/jNsryNfQush5V/6EO+wpQ7p
 O0epuLG42QMN67wdsQDVOOzcRQq2IAoYrgupZfEvCVsoBiYxdCTTwhN/55UctMBA
 xnakv8BarrLd6pqSJOlQviP7ByXdEvy7dtYYuAEtdRnPtTZEmjDH0k9ME759+DVm
 pPQ6fanPzSZuG8fdjI4QrKiFfpE5slMeyMV9SmzIq81S1i+t2b9sDYKTiP3Jt14y
 KTweGdJXTRT+Piy27d80HN9ExlFXlcyru9GDWNhZi4EHlax7bq76Qwu1XKyaOg0h
 MN40+18k+Zqrpj1/tq4aj3YM0P3HjpRhtb5TqOC+QhZDIL1gJ8bv8rv61snWTak+
 6cXtwvIh7r4aEU+gkMLP29HXCVlGg3V4up+DdbHJVbIEXV8C5csJBP+sQUlU7A5D
 WoPmheV7CJ8nicwkxYm31dhdnW7mOwW/J4uUlM9w/yU/dVfoz1SK8AtKjy0xX87c
 Jpo7nuJEDprI+9neT0y5U+RHVqH08+cA5DCrdk0x8JaJIrjOZpvTROIPrtzlS7QL
 aTu+W/ISXtFwnM+ERmw8TKPD7TTUXypydYhzXe8V6itDpiNp1kQFGmLGzLhAMElH
 iGQkFatR6LSKh+DxUD3PREQGNyQCKpgPiqLoGYprzQ829tqLpThumfZic9lX1C/+
 we5VEpRbiz6BjN110DBJ
 =NGTt
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-4.12-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux

Pull orangefs updates from Mike Marshall:
 "Orangefs cleanups, fixes and statx support.

  Some cleanups:

   - remove unused get_fsid_from_ino
   - fix bounds check for listxattr
   - clean up oversize xattr validation
   - do not set getattr_time on orangefs_lookup
   - return from orangefs_devreq_read quickly if possible
   - do not wait for timeout if umounting
   - handle zero size write in debugfs

  Bug fixes:

   - do not check possibly stale size on truncate
   - ensure the userspace component is unmounted if mount fails
   - total reimplementation of dir.c

  New feature:

   - implement statx

  The new implementation of dir.c is kind of a big deal, all new code.
  It has been posted to fs-devel during the previous rc period, we
  didn't get much review or feedback from there, but it has been
  reviewed very heavily here, so much so that we have two entire
  versions of the reimplementation.

  Not only does the new implementation fix some xfstests, but it passes
  all the new tests we made here that involve seeking and rewinding and
  giant directories and long file names. The new dir code has three
  patches itself:

   - skip forward to the next directory entry if seek is short
   - invalidate stored directory on seek
   - count directory pieces correctly"

* tag 'for-linus-4.12-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
  orangefs: count directory pieces correctly
  orangefs: invalidate stored directory on seek
  orangefs: skip forward to the next directory entry if seek is short
  orangefs: handle zero size write in debugfs
  orangefs: do not wait for timeout if umounting
  orangefs: return from orangefs_devreq_read quickly if possible
  orangefs: ensure the userspace component is unmounted if mount fails
  orangefs: do not check possibly stale size on truncate
  orangefs: implement statx
  orangefs: remove ORANGEFS_READDIR macros
  orangefs: support very large directories
  orangefs: support llseek on directories
  orangefs: rewrite readdir to fix several bugs
  orangefs: do not set getattr_time on orangefs_lookup
  orangefs: clean up oversize xattr validation
  orangefs: fix bounds check for listxattr
  orangefs: remove unused get_fsid_from_ino
2017-05-05 13:36:10 -07:00
Linus Torvalds
414975eb76 befs fixes for 4.12-rc1
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJZDFZyAAoJEGu/nxmHO1GNpz0IAINPEyXe9zAc/K74u5mIUPKT
 MqK/ifAYdOmGDu9kB68tXFQ5o3GNmAjWI4P8/T6oGlK9IudChrwTBY9Gss7iaawc
 +sNu71NmnyxbWHb7w71kIdhwNiHWolgZva1Ex9yaQYqRAy/JapCke9gs5TiruM4j
 zObaZnw48RwVyvU/Xixoz0hOLDGkPltOdy3tkWmy9v8sg/jSf+HF1FpAIfyO4pm+
 Kf2YR9IEkHhHwhoVEbHeSOjH/Tgb8gO8Suh4OnPRAP3gnVLWhb5Deh7Pjlgoj8Gn
 am2KFSkpShwvNG+yXufEwS4p7ERNd4u3uk/IWhJTuw6sE08L+dFU4Rj+DdxR2eY=
 =sENx
 -----END PGP SIGNATURE-----

Merge tag 'befs-v4.12-rc1' of git://github.com/luisbg/linux-befs

Pull befs fix from Luis de Bethencourt:
 "One fix from Fabian Frederick making the nfs client still work after a
  cache drop"

* tag 'befs-v4.12-rc1' of git://github.com/luisbg/linux-befs:
  befs: make export work with cold dcache
2017-05-05 13:33:38 -07:00
Fabian Frederick
6b4657667b fs/affs: add rename exchange
Process RENAME_EXCHANGE in affs_rename2() adding static
affs_xrename() based on affs_rename().

We remove headers from respective directories then
affect bh to other inode directory entries for swapping.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-05 15:24:52 -04:00
Fabian Frederick
c6184028a7 fs/affs: add rename2 to prepare multiple methods
Currently AFFS only supports RENAME_NOREPLACE.
This patch isolates that method to a static function to
prepare RENAME_EXCHANGE addition.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-05 15:24:52 -04:00
Bob Peterson
ed17545d01 GFS2: Allow glocks to be unlocked after withdraw
This bug fixes a regression introduced by patch 0d1c7ae9d8.

The intent of the patch was to stop promoting glocks after a
file system is withdrawn due to a variety of errors, because doing
so results in a BUG(). (You should be able to unmount after a
withdraw rather than having the kernel panic.)

Unfortunately, it also stopped demotions, so glocks could not be
unlocked after withdraw, which means the unmount would hang.

This patch allows function do_xmote to demote locks to an
unlocked state after a withdraw, but not promote them.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-05-05 14:19:28 -05:00
Eryu Guan
161f55efba xfs: fix use-after-free in xfs_finish_page_writeback
Commit 28b783e47a ("xfs: bufferhead chains are invalid after
end_page_writeback") fixed one use-after-free issue by
pre-calculating the loop conditionals before calling bh->b_end_io()
in the end_io processing loop, but it assigned 'next' pointer before
checking end offset boundary & breaking the loop, at which point the
bh might be freed already, and caused use-after-free.

This is caught by KASAN when running fstests generic/127 on sub-page
block size XFS.

[ 2517.244502] run fstests generic/127 at 2017-04-27 07:30:50
[ 2747.868840] ==================================================================
[ 2747.876949] BUG: KASAN: use-after-free in xfs_destroy_ioend+0x3d3/0x4e0 [xfs] at addr ffff8801395ae698
...
[ 2747.918245] Call Trace:
[ 2747.920975]  dump_stack+0x63/0x84
[ 2747.924673]  kasan_object_err+0x21/0x70
[ 2747.928950]  kasan_report+0x271/0x530
[ 2747.933064]  ? xfs_destroy_ioend+0x3d3/0x4e0 [xfs]
[ 2747.938409]  ? end_page_writeback+0xce/0x110
[ 2747.943171]  __asan_report_load8_noabort+0x19/0x20
[ 2747.948545]  xfs_destroy_ioend+0x3d3/0x4e0 [xfs]
[ 2747.953724]  xfs_end_io+0x1af/0x2b0 [xfs]
[ 2747.958197]  process_one_work+0x5ff/0x1000
[ 2747.962766]  worker_thread+0xe4/0x10e0
[ 2747.966946]  kthread+0x2d3/0x3d0
[ 2747.970546]  ? process_one_work+0x1000/0x1000
[ 2747.975405]  ? kthread_create_on_node+0xc0/0xc0
[ 2747.980457]  ? syscall_return_slowpath+0xe6/0x140
[ 2747.985706]  ? do_page_fault+0x30/0x80
[ 2747.989887]  ret_from_fork+0x2c/0x40
[ 2747.993874] Object at ffff8801395ae690, in cache buffer_head size: 104
[ 2748.001155] Allocated:
[ 2748.003782] PID = 8327
[ 2748.006411]  save_stack_trace+0x1b/0x20
[ 2748.010688]  save_stack+0x46/0xd0
[ 2748.014383]  kasan_kmalloc+0xad/0xe0
[ 2748.018370]  kasan_slab_alloc+0x12/0x20
[ 2748.022648]  kmem_cache_alloc+0xb8/0x1b0
[ 2748.027024]  alloc_buffer_head+0x22/0xc0
[ 2748.031399]  alloc_page_buffers+0xd1/0x250
[ 2748.035968]  create_empty_buffers+0x30/0x410
[ 2748.040730]  create_page_buffers+0x120/0x1b0
[ 2748.045493]  __block_write_begin_int+0x17a/0x1800
[ 2748.050740]  iomap_write_begin+0x100/0x2f0
[ 2748.055308]  iomap_zero_range_actor+0x253/0x5c0
[ 2748.060362]  iomap_apply+0x157/0x270
[ 2748.064347]  iomap_zero_range+0x5a/0x80
[ 2748.068624]  iomap_truncate_page+0x6b/0xa0
[ 2748.073227]  xfs_setattr_size+0x1f7/0xa10 [xfs]
[ 2748.078312]  xfs_vn_setattr_size+0x68/0x140 [xfs]
[ 2748.083589]  xfs_file_fallocate+0x4ac/0x820 [xfs]
[ 2748.088838]  vfs_fallocate+0x2cf/0x780
[ 2748.093021]  SyS_fallocate+0x48/0x80
[ 2748.097006]  do_syscall_64+0x18a/0x430
[ 2748.101186]  return_from_SYSCALL_64+0x0/0x6a
[ 2748.105948] Freed:
[ 2748.108189] PID = 8327
[ 2748.110816]  save_stack_trace+0x1b/0x20
[ 2748.115093]  save_stack+0x46/0xd0
[ 2748.118788]  kasan_slab_free+0x73/0xc0
[ 2748.122969]  kmem_cache_free+0x7a/0x200
[ 2748.127247]  free_buffer_head+0x41/0x80
[ 2748.131524]  try_to_free_buffers+0x178/0x250
[ 2748.136316]  xfs_vm_releasepage+0x2e9/0x3d0 [xfs]
[ 2748.141563]  try_to_release_page+0x100/0x180
[ 2748.146325]  invalidate_inode_pages2_range+0x7da/0xcf0
[ 2748.152087]  xfs_shift_file_space+0x37d/0x6e0 [xfs]
[ 2748.157557]  xfs_collapse_file_space+0x49/0x120 [xfs]
[ 2748.163223]  xfs_file_fallocate+0x2a7/0x820 [xfs]
[ 2748.168462]  vfs_fallocate+0x2cf/0x780
[ 2748.172642]  SyS_fallocate+0x48/0x80
[ 2748.176629]  do_syscall_64+0x18a/0x430
[ 2748.180810]  return_from_SYSCALL_64+0x0/0x6a

Fixed it by checking on offset against end & breaking out first,
dereference bh only if there're still bufferheads to process.

Signed-off-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-05 12:16:48 -07:00
Linus Torvalds
e579dde654 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace updates from Eric Biederman:
 "This is a set of small fixes that were mostly stumbled over during
  more significant development. This proc fix and the fix to
  posix-timers are the most significant of the lot.

  There is a lot of good development going on but unfortunately it
  didn't quite make the merge window"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  proc: Fix unbalanced hard link numbers
  signal: Make kill_proc_info static
  rlimit: Properly call security_task_setrlimit
  signal: Remove unused definition of sig_user_definied
  ia64: Remove unused IA64_TASK_SIGHAND_OFFSET and IA64_SIGHAND_SIGLOCK_OFFSET
  ipc: Remove unused declaration of recompute_msgmni
  posix-timers: Correct sanity check in posix_cpu_nsleep
  sysctl: Remove dead register_sysctl_root
2017-05-05 11:08:43 -07:00
Fabian Frederick
0795bf8357 nfs: use kmap/kunmap directly
This patch removes useless nfs_readdir_get_array() and
nfs_readdir_release_array() as suggested by Trond Myklebust

nfs_readdir() calls nfs_revalidate_mapping() before
readdir_search_pagecache() , nfs_do_filldir(), uncached_readdir()
so mapping should be correct.

While kmap() can't fail, all subsequent error checks were removed
as well as unused labels.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-05 13:01:33 -04:00
Hou Tao
59b86d85a7 NFS: always treat the invocation of nfs_getattr as cache hit when noac is on
When using 'ls -l' to display a large directory, if noac option is used,
in function nfs_getattr() nfs_need_revalidate_inode() will always be true
for NFSv3 and the nfs_entry cache of the directory will be flushed. The
flush will lead to a fully reread of the directory entries from server.

To prevent the unnecessary RPCs, we need to check whether or not the
noac option is used, and always report the invocation of nfs_getattr()
as cache hit instead cache miss when it's on.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-05 13:01:32 -04:00
Dave Wysochanski
5c737cb299 Fix nfs_client refcounting if kmalloc fails in nfs4_proc_exchange_id and nfs4_proc_async_renew
If memory allocation fails for the callback data, we need to put the nfs_client
or we end up with an elevated refcount.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-05 13:01:32 -04:00
Trond Myklebust
0048fdd066 NFSv4.1: RECLAIM_COMPLETE must handle NFS4ERR_CONN_NOT_BOUND_TO_SESSION
If the server returns NFS4ERR_CONN_NOT_BOUND_TO_SESSION because we
are trunking, then RECLAIM_COMPLETE must handle that by calling
nfs4_schedule_session_recovery() and then retrying.

Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
2017-05-05 12:01:50 -04:00
Björn Jacke
85435d7a15 CIFS: add misssing SFM mapping for doublequote
SFM is mapping doublequote to 0xF020

Without this patch creating files with doublequote fails to Windows/Mac

Signed-off-by: Bjoern Jacke <bjacke@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
CC: stable <stable@vger.kernel.org>
2017-05-05 08:33:44 -05:00
Fabian Frederick
dcfd9b215b befs: make export work with cold dcache
based on commit b3b42c0dea
("fs/affs: make export work with cold dcache")

This adds get_parent function so that nfs client can still work after
cache drop (Tested on NFS v4 with echo 3 > /proc/sys/vm/drop_caches)

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
2017-05-05 11:35:35 +01:00
Amir Goldstein
5b6c9053fb ovl: persistent inode numbers for upper hardlinks
An upper type non directory dentry that is a copy up target
should have a reference to its lower copy up origin.

There are three ways for an upper type dentry to be instantiated:
1. A lower type dentry that is being copied up
2. An entry that is found in upper dir by ovl_lookup()
3. A negative dentry is hardlinked to an upper type dentry

In the first case, the lower reference is set before copy up.
In the second case, the lower reference is found by ovl_lookup().
In the last case of hardlinked upper dentry, it is not easy to
update the lower reference of the negative dentry.  Instead,
drop the newly hardlinked negative dentry from dcache and let
the next access call ovl_lookup() to find its lower reference.

This makes sure that the inode number reported by stat(2) after
the hardlink is created is the same inode number that will be
reported by stat(2) after mount cycle, which is the inode number
of the lower copy up origin of the hardlink source.

NOTE that this does not fix breaking of lower hardlinks on copy
up, but only fixes the case of lower nlink == 1, whose upper copy
up inode is hardlinked in upper dir.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-05 11:38:58 +02:00
Miklos Szeredi
5b712091a3 ovl: merge getattr for dir and nondir
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-05 11:38:58 +02:00
Amir Goldstein
72b608f085 ovl: constant st_ino/st_dev across copy up
When all layers are on the same underlying filesystem, let stat(2) return
st_dev/st_ino values of the copy up origin inode if it is known.

This results in constant st_ino/st_dev representation of files in an
overlay mount before and after copy up.

When the underlying filesystem support NFS exportfs, the result is also
persistent st_ino/st_dev representation before and after mount cycle.

Lower hardlinks are broken on copy up to different upper files, so we
cannot use the lower origin st_ino for those different files, even for the
same fs case.

When all overlay layers are on the same fs, use overlay st_dev for non-dirs
to get the correct result from du -x.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-05 11:38:58 +02:00
Amir Goldstein
b7a807dc20 ovl: persistent inode number for directories
stat(2) on overlay directories reports the overlay temp inode
number, which is constant across copy up, but is not persistent.

When all layers are on the same fs, report the copy up origin inode
number for directories.

This inode number is persistent, unique across the overlay mount and
constant across copy up.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-05 11:38:58 +02:00
Amir Goldstein
595485033d ovl: set the ORIGIN type flag
For directory entries, non zero oe->numlower implies OVL_TYPE_MERGE.
Define a new type flag OVL_TYPE_ORIGIN to indicate that an entry holds a
reference to its lower copy up origin.

For directory entries ORIGIN := MERGE && UPPER. For non-dir entries ORIGIN
means that a lower type dentry has been recently copied up or that we were
able to find the copy up origin from overlay.origin xattr.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-05 11:38:58 +02:00
Amir Goldstein
a9d019573e ovl: lookup non-dir copy-up-origin by file handle
If overlay.origin xattr is found on a non-dir upper inode try to get lower
dentry by calling exportfs_decode_fh().

On failure to lookup by file handle to lower layer, do not lookup the copy
up origin by name, because the lower found by name could be another file in
case the upper file was renamed.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-05 11:38:58 +02:00
Amir Goldstein
c22205d058 ovl: use an auxiliary var for overlay root entry
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-05 11:38:58 +02:00
Amir Goldstein
3a1e819b4e ovl: store file handle of lower inode on copy up
Sometimes it is interesting to know if an upper file is pure upper or a
copy up target, and if it is a copy up target, it may be interesting to
find the copy up origin.

This will be used to preserve lower inode numbers across copy up.

Store the lower inode file handle in upper inode extended attribute
overlay.origin on copy up to use it later for these cases.  Store the lower
filesystem uuid along side the file handle, so we can validate that we are
looking for the origin file in the original fs.

If lower fs does not support NFS export ops store a zero sized xattr so we
can always use the overlay.origin xattr to distinguish between a copy up
and a pure upper inode.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-05 11:38:58 +02:00
Amir Goldstein
7bcd74b98d ovl: check if all layers are on the same fs
Some features can only work when all layers are on the same fs.  Test this
condition during mount time, so features can check them later.

Add helper ovl_same_sb() to return the common super block in case all
layers are on the same fs.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-05-05 11:38:57 +02:00
Linus Torvalds
af82455f7d char/misc patches for 4.12-rc1
Here is the big set of new char/misc driver drivers and features for
 4.12-rc1.
 
 There's lots of new drivers added this time around, new firmware drivers
 from Google, more auxdisplay drivers, extcon drivers, fpga drivers, and
 a bunch of other driver updates.  Nothing major, except if you happen to
 have the hardware for these drivers, and then you will be happy :)
 
 All of these have been in linux-next for a while with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWQvAgg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yknsACgzkAeyz16Z97J3UTaeejbR7nKUCAAoKY4WEHY
 8O9f9pr9gj8GMBwxeZQa
 =OIfB
 -----END PGP SIGNATURE-----

Merge tag 'char-misc-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc driver updates from Greg KH:
 "Here is the big set of new char/misc driver drivers and features for
  4.12-rc1.

  There's lots of new drivers added this time around, new firmware
  drivers from Google, more auxdisplay drivers, extcon drivers, fpga
  drivers, and a bunch of other driver updates. Nothing major, except if
  you happen to have the hardware for these drivers, and then you will
  be happy :)

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'char-misc-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (136 commits)
  firmware: google memconsole: Fix return value check in platform_memconsole_init()
  firmware: Google VPD: Fix return value check in vpd_platform_init()
  goldfish_pipe: fix build warning about using too much stack.
  goldfish_pipe: An implementation of more parallel pipe
  fpga fr br: update supported version numbers
  fpga: region: release FPGA region reference in error path
  fpga altera-hps2fpga: disable/unprepare clock on error in alt_fpga_bridge_probe()
  mei: drop the TODO from samples
  firmware: Google VPD sysfs driver
  firmware: Google VPD: import lib_vpd source files
  misc: lkdtm: Add volatile to intentional NULL pointer reference
  eeprom: idt_89hpesx: Add OF device ID table
  misc: ds1682: Add OF device ID table
  misc: tsl2550: Add OF device ID table
  w1: Remove unneeded use of assert() and remove w1_log.h
  w1: Use kernel common min() implementation
  uio_mf624: Align memory regions to page size and set correct offsets
  uio_mf624: Refactor memory info initialization
  uio: Allow handling of non page-aligned memory regions
  hangcheck-timer: Fix typo in comment
  ...
2017-05-04 19:15:35 -07:00
Chris Mason
9bcaaea741 btrfs: fix the gfp_mask for the reada_zones radix tree
Commits cc8385b59e and 7ef70b4d99 added preallocation for the
reada radix trees and also switched them over to GFP_KERNEL for the
default gfp mask.

Since we're doing radix tree insertions under spinlocks, we need
to make sure the mask doesn't allow sleeping.  This fix keeps
the radix preallocation but switches back to the original gfp_mask.

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2017-05-04 16:56:11 -07:00
Martin Brandenburg
2f713b5c7d orangefs: count directory pieces correctly
A large directory full of differently sized file names triggered this.
Most directories, even very large directories with shorter names, would
be lucky enough to fit in one server response.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-05-04 14:38:24 -04:00
Martin Brandenburg
942835d68f orangefs: invalidate stored directory on seek
If an application seeks to a position before the point which has been
read, it must want updates which have been made to the directory.  So
delete the copy stored in the kernel so it will be fetched again.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-05-04 14:38:15 -04:00
Martin Brandenburg
bf15ba7c1f orangefs: skip forward to the next directory entry if seek is short
If userspace seeks to a position in the stream which is not correct, it
would have returned EIO because the data in the buffer at that offset
would be incorrect.  This and the userspace daemon returning a corrupt
directory are indistinguishable.

Now if the data does not look right, skip forward to the next chunk and
try again.  The motivation is that if the directory changes, an
application may seek to a position that was valid and no longer is valid.

It is not yet possible for a directory to change.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-05-04 14:38:10 -04:00
Eric Biggers
d9b9f8d5a8 ext4: clean up ext4_match() and callers
When ext4 encryption was originally merged, we were encrypting the
user-specified filename in ext4_match(), introducing a lot of additional
complexity into ext4_match() and its callers.  This has since been
changed to encrypt the filename earlier, so we can remove the gunk
that's no longer needed.  This more or less reverts ext4_search_dir()
and ext4_find_dest_de() to the way they were in the v4.0 kernel.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04 11:44:40 -04:00
Eric Biggers
1f73d49177 f2fs: switch to using fscrypt_match_name()
Switch f2fs directory searches to use the fscrypt_match_name() helper
function.  There should be no functional change.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04 11:44:39 -04:00
Eric Biggers
067d1023b6 ext4: switch to using fscrypt_match_name()
Switch ext4 directory searches to use the fscrypt_match_name() helper
function.  There should be no functional change.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04 11:44:38 -04:00
Eric Biggers
17159420a6 fscrypt: introduce helper function for filename matching
Introduce a helper function fscrypt_match_name() which tests whether a
fscrypt_name matches a directory entry.  Also clean up the magic numbers
and document things properly.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04 11:44:37 -04:00
Eric Biggers
6b06cdee81 fscrypt: avoid collisions when presenting long encrypted filenames
When accessing an encrypted directory without the key, userspace must
operate on filenames derived from the ciphertext names, which contain
arbitrary bytes.  Since we must support filenames as long as NAME_MAX,
we can't always just base64-encode the ciphertext, since that may make
it too long.  Currently, this is solved by presenting long names in an
abbreviated form containing any needed filesystem-specific hashes (e.g.
to identify a directory block), then the last 16 bytes of ciphertext.
This needs to be sufficient to identify the actual name on lookup.

However, there is a bug.  It seems to have been assumed that due to the
use of a CBC (ciphertext block chaining)-based encryption mode, the last
16 bytes (i.e. the AES block size) of ciphertext would depend on the
full plaintext, preventing collisions.  However, we actually use CBC
with ciphertext stealing (CTS), which handles the last two blocks
specially, causing them to appear "flipped".  Thus, it's actually the
second-to-last block which depends on the full plaintext.

This caused long filenames that differ only near the end of their
plaintexts to, when observed without the key, point to the wrong inode
and be undeletable.  For example, with ext4:

    # echo pass | e4crypt add_key -p 16 edir/
    # seq -f "edir/abcdefghijklmnopqrstuvwxyz012345%.0f" 100000 | xargs touch
    # find edir/ -type f | xargs stat -c %i | sort | uniq | wc -l
    100000
    # sync
    # echo 3 > /proc/sys/vm/drop_caches
    # keyctl new_session
    # find edir/ -type f | xargs stat -c %i | sort | uniq | wc -l
    2004
    # rm -rf edir/
    rm: cannot remove 'edir/_A7nNFi3rhkEQlJ6P,hdzluhODKOeWx5V': Structure needs cleaning
    ...

To fix this, when presenting long encrypted filenames, encode the
second-to-last block of ciphertext rather than the last 16 bytes.

Although it would be nice to solve this without depending on a specific
encryption mode, that would mean doing a cryptographic hash like SHA-256
which would be much less efficient.  This way is sufficient for now, and
it's still compatible with encryption modes like HEH which are strong
pseudorandom permutations.  Also, changing the presented names is still
allowed at any time because they are only provided to allow applications
to do things like delete encrypted directories.  They're not designed to
be used to persistently identify files --- which would be hard to do
anyway, given that they're encrypted after all.

For ease of backports, this patch only makes the minimal fix to both
ext4 and f2fs.  It leaves ubifs as-is, since ubifs doesn't compare the
ciphertext block yet.  Follow-on patches will clean things up properly
and make the filesystems use a shared helper function.

Fixes: 5de0b4d0cd ("ext4 crypto: simplify and speed up filename encryption")
Reported-by: Gwendal Grignou <gwendal@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04 11:44:36 -04:00
Jaegeuk Kim
6332cd32c8 f2fs: check entire encrypted bigname when finding a dentry
If user has no key under an encrypted dir, fscrypt gives digested dentries.
Previously, when looking up a dentry, f2fs only checks its hash value with
first 4 bytes of the digested dentry, which didn't handle hash collisions fully.
This patch enhances to check entire dentry bytes likewise ext4.

Eric reported how to reproduce this issue by:

 # seq -f "edir/abcdefghijklmnopqrstuvwxyz012345%.0f" 100000 | xargs touch
 # find edir -type f | xargs stat -c %i | sort | uniq | wc -l
100000
 # sync
 # echo 3 > /proc/sys/vm/drop_caches
 # keyctl new_session
 # find edir -type f | xargs stat -c %i | sort | uniq | wc -l
99999

Cc: <stable@vger.kernel.org>
Reported-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
(fixed f2fs_dentry_hash() to work even when the hash is 0)
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04 11:44:35 -04:00
Eric Biggers
413d5a9edb ubifs: check for consistent encryption contexts in ubifs_lookup()
As ext4 and f2fs do, ubifs should check for consistent encryption
contexts during ->lookup() in an encrypted directory.  This protects
certain users of filesystem encryption against certain types of offline
attacks.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04 11:44:35 -04:00
Eric Biggers
faac7fd97e f2fs: sync f2fs_lookup() with ext4_lookup()
As for ext4, now that fscrypt_has_permitted_context() correctly handles
the case where we have the key for the parent directory but not the
child, f2fs_lookup() no longer has to work around it.  Also add the same
warning message that ext4 uses.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04 11:44:34 -04:00
Eric Biggers
8c68084bff ext4: remove "nokey" check from ext4_lookup()
Now that fscrypt_has_permitted_context() correctly handles the case
where we have the key for the parent directory but not the child, we
don't need to try to work around this in ext4_lookup().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04 11:44:33 -04:00
Eric Biggers
272f98f684 fscrypt: fix context consistency check when key(s) unavailable
To mitigate some types of offline attacks, filesystem encryption is
designed to enforce that all files in an encrypted directory tree use
the same encryption policy (i.e. the same encryption context excluding
the nonce).  However, the fscrypt_has_permitted_context() function which
enforces this relies on comparing struct fscrypt_info's, which are only
available when we have the encryption keys.  This can cause two
incorrect behaviors:

1. If we have the parent directory's key but not the child's key, or
   vice versa, then fscrypt_has_permitted_context() returned false,
   causing applications to see EPERM or ENOKEY.  This is incorrect if
   the encryption contexts are in fact consistent.  Although we'd
   normally have either both keys or neither key in that case since the
   master_key_descriptors would be the same, this is not guaranteed
   because keys can be added or removed from keyrings at any time.

2. If we have neither the parent's key nor the child's key, then
   fscrypt_has_permitted_context() returned true, causing applications
   to see no error (or else an error for some other reason).  This is
   incorrect if the encryption contexts are in fact inconsistent, since
   in that case we should deny access.

To fix this, retrieve and compare the fscrypt_contexts if we are unable
to set up both fscrypt_infos.

While this slightly hurts performance when accessing an encrypted
directory tree without the key, this isn't a case we really need to be
optimizing for; access *with* the key is much more important.
Furthermore, the performance hit is barely noticeable given that we are
already retrieving the fscrypt_context and doing two keyring searches in
fscrypt_get_encryption_info().  If we ever actually wanted to optimize
this case we might start by caching the fscrypt_contexts.

Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04 11:43:17 -04:00
Jan Kara
17f423b516 jbd2: cleanup write flags handling from jbd2_write_superblock()
Currently jbd2_write_superblock() silently adds REQ_SYNC to flags with
which journal superblock is written. Make this explicit by making flags
passed down to jbd2_write_superblock() contain REQ_SYNC.

CC: linux-ext4@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04 11:01:31 -04:00
Jan Kara
00473374b7 ext4: mark superblock writes synchronous for nobarrier mounts
Commit b685d3d65a "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_FUA implementation.
generic_make_request_checks() however strips REQ_FUA flag from a bio
when the storage doesn't report volatile write cache and thus write
effectively becomes asynchronous which can lead to performance
regressions. This affects superblock writes for ext4. Fix the problem
by marking superblock writes always as synchronous.

Fixes: b685d3d65a
CC: linux-ext4@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04 10:58:03 -04:00
Jan Kara
9052c7cf49 nfs: Fix bdi handling for cloned superblocks
In commit 0d3b12584972 "nfs: Convert to separately allocated bdi" I have
wrongly cloned bdi reference in nfs_clone_super(). Further inspection
has shown that originally the code was actually allocating a new bdi (in
->clone_server callback) which was later registered in
nfs_fs_mount_common() and used for sb->s_bdi in nfs_initialise_sb().
This could later result in bdi for the original superblock not getting
unregistered when that superblock got shutdown (as the cloned sb still
held bdi reference) and later when a new superblock was created under
the same anonymous device number, a clash in sysfs has happened on bdi
registration:

------------[ cut here ]------------
WARNING: CPU: 1 PID: 10284 at /linux-next/fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x74
sysfs: cannot create duplicate filename '/devices/virtual/bdi/0:32'
Modules linked in: axp20x_usb_power gpio_axp209 nvmem_sunxi_sid sun4i_dma sun4i_ss virt_dma
CPU: 1 PID: 10284 Comm: mount.nfs Not tainted 4.11.0-rc4+ #14
Hardware name: Allwinner sun7i (A20) Family
[<c010f19c>] (unwind_backtrace) from [<c010bc74>] (show_stack+0x10/0x14)
[<c010bc74>] (show_stack) from [<c03c6e24>] (dump_stack+0x78/0x8c)
[<c03c6e24>] (dump_stack) from [<c0122200>] (__warn+0xe8/0x100)
[<c0122200>] (__warn) from [<c0122250>] (warn_slowpath_fmt+0x38/0x48)
[<c0122250>] (warn_slowpath_fmt) from [<c02ac178>] (sysfs_warn_dup+0x64/0x74)
[<c02ac178>] (sysfs_warn_dup) from [<c02ac254>] (sysfs_create_dir_ns+0x84/0x94)
[<c02ac254>] (sysfs_create_dir_ns) from [<c03c8b8c>] (kobject_add_internal+0x9c/0x2ec)
[<c03c8b8c>] (kobject_add_internal) from [<c03c8e24>] (kobject_add+0x48/0x98)
[<c03c8e24>] (kobject_add) from [<c048d75c>] (device_add+0xe4/0x5a0)
[<c048d75c>] (device_add) from [<c048ddb4>] (device_create_groups_vargs+0xac/0xbc)
[<c048ddb4>] (device_create_groups_vargs) from [<c048dde4>] (device_create_vargs+0x20/0x28)
[<c048dde4>] (device_create_vargs) from [<c02075c8>] (bdi_register_va+0x44/0xfc)
[<c02075c8>] (bdi_register_va) from [<c023d378>] (super_setup_bdi_name+0x48/0xa4)
[<c023d378>] (super_setup_bdi_name) from [<c0312ef4>] (nfs_fill_super+0x1a4/0x204)
[<c0312ef4>] (nfs_fill_super) from [<c03133f0>] (nfs_fs_mount_common+0x140/0x1e8)
[<c03133f0>] (nfs_fs_mount_common) from [<c03335cc>] (nfs4_remote_mount+0x50/0x58)
[<c03335cc>] (nfs4_remote_mount) from [<c023ef98>] (mount_fs+0x14/0xa4)
[<c023ef98>] (mount_fs) from [<c025cba0>] (vfs_kern_mount+0x54/0x128)
[<c025cba0>] (vfs_kern_mount) from [<c033352c>] (nfs_do_root_mount+0x80/0xa0)
[<c033352c>] (nfs_do_root_mount) from [<c0333818>] (nfs4_try_mount+0x28/0x3c)
[<c0333818>] (nfs4_try_mount) from [<c0313874>] (nfs_fs_mount+0x2cc/0x8c4)
[<c0313874>] (nfs_fs_mount) from [<c023ef98>] (mount_fs+0x14/0xa4)
[<c023ef98>] (mount_fs) from [<c025cba0>] (vfs_kern_mount+0x54/0x128)
[<c025cba0>] (vfs_kern_mount) from [<c02600f0>] (do_mount+0x158/0xc7c)
[<c02600f0>] (do_mount) from [<c0260f98>] (SyS_mount+0x8c/0xb4)
[<c0260f98>] (SyS_mount) from [<c0107840>] (ret_fast_syscall+0x0/0x3c)

Fix the problem by always creating new bdi for a superblock as we used
to do.

Reported-and-tested-by: Corentin Labbe <clabbe.montjoie@gmail.com>
Fixes: 0d3b12584972ce5781179ad3f15cca3cdb5cae05
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-05-04 07:57:46 -06:00
Luis Henriques
eeca958dce ceph: fix memory leak in __ceph_setxattr()
The ceph_inode_xattr needs to be released when removing an xattr.  Easily
reproducible running the 'generic/020' test from xfstests or simply by
doing:

  attr -s attr0 -V 0 /mnt/test && attr -r attr0 /mnt/test

While there, also fix the error path.

Here's the kmemleak splat:

unreferenced object 0xffff88001f86fbc0 (size 64):
  comm "attr", pid 244, jiffies 4294904246 (age 98.464s)
  hex dump (first 32 bytes):
    40 fa 86 1f 00 88 ff ff 80 32 38 1f 00 88 ff ff  @........28.....
    00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de  ................
  backtrace:
    [<ffffffff81560199>] kmemleak_alloc+0x49/0xa0
    [<ffffffff810f3e5b>] kmem_cache_alloc+0x9b/0xf0
    [<ffffffff812b157e>] __ceph_setxattr+0x17e/0x820
    [<ffffffff812b1c57>] ceph_set_xattr_handler+0x37/0x40
    [<ffffffff8111fb4b>] __vfs_removexattr+0x4b/0x60
    [<ffffffff8111fd37>] vfs_removexattr+0x77/0xd0
    [<ffffffff8111fdd1>] removexattr+0x41/0x60
    [<ffffffff8111fe65>] path_removexattr+0x75/0xa0
    [<ffffffff81120aeb>] SyS_lremovexattr+0xb/0x10
    [<ffffffff81564b20>] entry_SYSCALL_64_fastpath+0x13/0x94
    [<ffffffffffffffff>] 0xffffffffffffffff

Cc: stable@vger.kernel.org
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:24 +02:00
Alexander Graf
f775ff7d89 ceph: fix file open flags on ppc64
The file open flags (O_foo) are platform specific and should never go
out to an interface that is not local to the system.

Unfortunately these flags have leaked out onto the wire in the cephfs
implementation. That lead to bogus flags getting transmitted on ppc64.

This patch converts the kernel view of flags to the ceph view of file
open flags.

Fixes: 124e68e74 ("ceph: file operations")
Signed-off-by: Alexander Graf <agraf@suse.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:24 +02:00
Yan, Zheng
b50c2de51e ceph: choose readdir frag based on previous readdir reply
The dirfragtree is lazily updated, it's not always accurate. Infinite
loops happens in following circumstance.

- client send request to read frag A
- frag A has been fragmented into frag B and C. So mds fills the reply
  with contents of frag B
- client wants to read next frag C. ceph_choose_frag(frag value of C)
  return frag A.

The fix is using previous readdir reply to calculate next readdir frag
when possible.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:24 +02:00
Jeff Layton
26544c623e ceph: when seeing write errors on an inode, switch to sync writes
Currently, we don't have a real feedback mechanism in place for when we
start seeing buffered writeback errors. If writeback is failing, there
is nothing that prevents an application from continuing to dirty pages
that aren't being cleaned.

In the event that we're seeing write errors of any sort occur on an
inode, have the callback set a flag to force further writes to be
synchronous. When the next write succeeds, clear the flag to allow
buffered writeback to continue.

Since this is just a hint to the write submission mechanism, we only
take the i_ceph_lock when a lockless check shows that the flag needs to
be changed.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:22 +02:00
Jeff Layton
6fc1fe5e4c Revert "ceph: SetPageError() for writeback pages if writepages fails"
This reverts commit b109eec6f4.

If I'm filling up a filesystem with this sort of command:

    $ dd if=/dev/urandom of=/mnt/cephfs/fillfile bs=2M oflag=sync

...then I'll eventually get back EIO on a write. Further calls
will give us ENOSPC.

I'm not sure what prompted this change, but I don't think it's what we
want to do. If writepages failed, we will have already set the mapping
error appropriately, and that's what gets reported by fsync() or
close().

__filemap_fdatawait_range however, does this:

	wait_on_page_writeback(page);
	if (TestClearPageError(page))
		ret = -EIO;

...and that -EIO ends up trumping the mapping's error if one exists.

When writepages fails, we only want to set the error in the mapping,
and not flag the individual pages.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:22 +02:00
Jeff Layton
92475f05bd ceph: handle epoch barriers in cap messages
Have the client store and update the osdc epoch_barrier when a cap
message comes in with one.

When sending cap messages, send the epoch barrier as well. This allows
clients to inform servers that their released caps may not be used until
a particular OSD map epoch.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: "Yan, Zheng” <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:21 +02:00
Jeff Layton
a1f4020aab libceph: allow requests to return immediately on full conditions if caller wishes
Usually, when the osd map is flagged as full or the pool is at quota,
write requests just hang. This is not what we want for cephfs, where
it would be better to simply report -ENOSPC back to userland instead
of stalling.

If the caller knows that it will want an immediate error return instead
of blocking on a full or at-quota error condition then allow it to set a
flag to request that behavior.

Set that flag in ceph_osdc_new_request (since ceph.ko is the only caller),
and on any other write request from ceph.ko.

A later patch will deal with requests that were submitted before the new
map showing the full condition came in.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:21 +02:00
Yan, Zheng
79162547b7 ceph: make seeky readdir more efficient
Current cephfs client uses string to indicate start position of
readdir. The string is last entry of previous readdir reply.
This approach does not work for seeky readdir because we can
not easily convert the new postion to a string. For seeky readdir,
mds needs to return dentries from the beginning. Client keeps
retrying if the reply does not contain the dentry it wants.

In current version of ceph, mds sorts CDentry in its cache in
hash order. Client also uses dentry hash to compose dir postion.
For seeky readdir, if client passes the hash part of dir postion
to mds. mds can avoid replying useless dentries.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:20 +02:00
Yan, Zheng
2827528da0 ceph: close stopped mds' session
If a mds has stopped, close its session and clean up its session
requests/caps. The process is similar to handling SESSION_CLOSE
initiated by mds.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:20 +02:00
Yan, Zheng
0a07fc8cd0 ceph: fix potential use-after-free
__unregister_session() free the session if it drops the last
reference. We should grab an extra reference if we want to use
session after __unregister_session().

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:20 +02:00
Yan, Zheng
76201b6354 ceph: allow connecting to mds whose rank >= mdsmap::m_max_mds
mdsmap::m_max_mds is the expected count of active mds. It's not the
max rank of active mds. User can decrease mdsmap::m_max_mds, but does
not stop mds whose rank >= mdsmap::m_max_mds.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:20 +02:00
Yan, Zheng
8242c9f35a ceph: fix wrong check in ceph_renew_caps()
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:19 +02:00
Elena Reshetova
0e1a5ee657 libceph: convert ceph_pagelist.refcnt from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:19 +02:00
Elena Reshetova
805692d0e0 ceph: convert ceph_cap_snap.nref from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:18 +02:00
Elena Reshetova
3997c01d26 ceph: convert ceph_mds_session.s_ref from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:18 +02:00
Ilya Dryomov
74da4a0f57 libceph, ceph: always advertise all supported features
No reason to hide CephFS-specific features in the rbd case.  Recent
feature bits mix RADOS and CephFS-specific stuff together anyway.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-05-04 09:19:18 +02:00
Steve French
7db0a6efdc SMB3: Work around mount failure when using SMB3 dialect to Macs
Macs send the maximum buffer size in response on ioctl to validate
negotiate security information, which causes us to fail the mount
as the response buffer is larger than the expected response.

Changed ioctl response processing to allow for padding of validate
negotiate ioctl response and limit the maximum response size to
maximum buffer size.

Signed-off-by: Steve French <steve.french@primarydata.com>
CC: Stable <stable@vger.kernel.org>
2017-05-03 21:23:48 -05:00
Yunlei He
e9cdd30770 f2fs: fix a mount fail for wrong next_scan_nid
-write_checkpoint
   -do_checkpoint
      -next_free_nid    <--- something wrong with next free nid

-f2fs_fill_super
   -build_node_manager
      -build_free_nids
          -get_current_nat_page
             -__get_meta_page   <--- attempt to access beyond end of device

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 19:00:30 -07:00
Linus Torvalds
dd23f273d9 Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:

 - a few misc things

 - most of MM

 - KASAN updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits)
  kasan: separate report parts by empty lines
  kasan: improve double-free report format
  kasan: print page description after stacks
  kasan: improve slab object description
  kasan: change report header
  kasan: simplify address description logic
  kasan: change allocation and freeing stack traces headers
  kasan: unify report headers
  kasan: introduce helper functions for determining bug type
  mm: hwpoison: call shake_page() after try_to_unmap() for mlocked page
  mm: hwpoison: call shake_page() unconditionally
  mm/swapfile.c: fix swap space leak in error path of swap_free_entries()
  mm/gup.c: fix access_ok() argument type
  mm/truncate: avoid pointless cleancache_invalidate_inode() calls.
  mm/truncate: bail out early from invalidate_inode_pages2_range() if mapping is empty
  fs/block_dev: always invalidate cleancache in invalidate_bdev()
  fs: fix data invalidation in the cleancache during direct IO
  zram: reduce load operation in page_same_filled
  zram: use zram_free_page instead of open-coded
  zram: introduce zram data accessor
  ...
2017-05-03 17:55:59 -07:00
David Disseldorp
d8a6e505d6 cifs: fix CIFS_IOC_GET_MNT_INFO oops
An open directory may have a NULL private_data pointer prior to readdir.

Fixes: 0de1f4c6f6 ("Add way to query server fs info for smb3")
Cc: stable@vger.kernel.org
Signed-off-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-03 19:32:35 -05:00
Björn Jacke
b704e70b7c CIFS: fix mapping of SFM_SPACE and SFM_PERIOD
- trailing space maps to 0xF028
- trailing period maps to 0xF029

This fix corrects the mapping of file names which have a trailing character
that would otherwise be illegal (period or space) but is allowed by POSIX.

Signed-off-by: Bjoern Jacke <bjacke@samba.org>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-03 19:31:33 -05:00
Andrey Ryabinin
a5f6a6a9c7 fs/block_dev: always invalidate cleancache in invalidate_bdev()
invalidate_bdev() calls cleancache_invalidate_inode() iff ->nrpages != 0
which doen't make any sense.

Make sure that invalidate_bdev() always calls cleancache_invalidate_inode()
regardless of mapping->nrpages value.

Fixes: c515e1fd36 ("mm/fs: add hooks to support cleancache")
Link: http://lkml.kernel.org/r/20170424164135.22350-3-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Kuznetsov <kuznet@virtuozzo.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:12 -07:00
Andrey Ryabinin
55635ba76e fs: fix data invalidation in the cleancache during direct IO
Patch series "Properly invalidate data in the cleancache", v2.

We've noticed that after direct IO write, buffered read sometimes gets
stale data which is coming from the cleancache.  The reason for this is
that some direct write hooks call call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero, so we may not invalidate
data in the cleancache.

Another odd thing is that we check only for ->nrpages and don't check
for ->nrexceptional, but invalidate_inode_pages2[_range] also
invalidates exceptional entries as well.  So we invalidate exceptional
entries only if ->nrpages != 0? This doesn't feel right.

 - Patch 1 fixes direct IO writes by removing ->nrpages check.
 - Patch 2 fixes similar case in invalidate_bdev().
     Note: I only fixed conditional cleancache_invalidate_inode() here.
       Do we also need to add ->nrexceptional check in into invalidate_bdev()?

 - Patches 3-4: some optimizations.

This patch (of 4):

Some direct IO write fs hooks call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero.  This can't be right,
because invalidate_inode_pages2[_range]() also invalidate data in the
cleancache via cleancache_invalidate_inode() call.  So if page cache is
empty but there is some data in the cleancache, buffered read after
direct IO write would get stale data from the cleancache.

Also it doesn't feel right to check only for ->nrpages because
invalidate_inode_pages2[_range] invalidates exceptional entries as well.

Fix this by calling invalidate_inode_pages2[_range]() regardless of
nrpages state.

Note: nfs,cifs,9p doesn't need similar fix because the never call
cleancache_get_page() (nor directly, nor via mpage_readpage[s]()), so
they are not affected by this bug.

Fixes: c515e1fd36 ("mm/fs: add hooks to support cleancache")
Link: http://lkml.kernel.org/r/20170424164135.22350-2-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Kuznetsov <kuznet@virtuozzo.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:12 -07:00
Michal Hocko
eb52da3f48 jbd2: make the whole kjournald2 kthread NOFS safe
kjournald2 is central to the transaction commit processing.  As such any
potential allocation from this kernel thread has to be GFP_NOFS.  Make
sure to mark the whole kernel thread GFP_NOFS by the memalloc_nofs_save.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20170306131408.9828-8-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.cz>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:09 -07:00
Michal Hocko
81378da64d jbd2: mark the transaction context with the scope GFP_NOFS context
now that we have memalloc_nofs_{save,restore} api we can mark the whole
transaction context as implicitly GFP_NOFS.  All allocations will
automatically inherit GFP_NOFS this way.  This means that we do not have
to mark any of those requests with GFP_NOFS and moreover all the
ext4_kv[mz]alloc(GFP_NOFS) are also safe now because even the hardcoded
GFP_KERNEL allocations deep inside the vmalloc will be NOFS now.

[akpm@linux-foundation.org: tweak comments]
Link: http://lkml.kernel.org/r/20170306131408.9828-7-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.cz>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:09 -07:00
Michal Hocko
9ba1fb2c60 xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
API to prevent from reclaim recursion into the fs because vmalloc can
invoke unconditional GFP_KERNEL allocations and these functions might be
called from the NOFS contexts.  The memalloc_noio_save will enforce
GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
unnecessary.  Let's use memalloc_nofs_{save,restore} instead as it
should provide exactly what we need here - implicit GFP_NOFS context.

Link: http://lkml.kernel.org/r/20170306131408.9828-6-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:09 -07:00
Michal Hocko
7dea19f9ee mm: introduce memalloc_nofs_{save,restore} API
GFP_NOFS context is used for the following 5 reasons currently:

 - to prevent from deadlocks when the lock held by the allocation
   context would be needed during the memory reclaim

 - to prevent from stack overflows during the reclaim because the
   allocation is performed from a deep context already

 - to prevent lockups when the allocation context depends on other
   reclaimers to make a forward progress indirectly

 - just in case because this would be safe from the fs POV

 - silence lockdep false positives

Unfortunately overuse of this allocation context brings some problems to
the MM.  Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily.  We would like to get rid of
those as much as possible.  One way to do that is to use the flag in
scopes rather than isolated cases.  Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce memalloc_nofs_{save,restore} API to control the scope of
GFP_NOFS allocation context.  This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO.  The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.

PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO.  memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts.  Xfs code paths preserve
their semantic.  kmem_flags_convert() doesn't need to evaluate the flag
anymore.

This patch shouldn't introduce any functional changes.

Let's hope that filesystems will drop direct GFP_NOFS (resp.  ~__GFP_FS)
usage as much as possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

[akpm@linux-foundation.org: fix comment typo, reflow comment]
Link: http://lkml.kernel.org/r/20170306131408.9828-5-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:09 -07:00
Michal Hocko
9070733b4e xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
some time ago.  We would like to make this concept more generic and use
it for other filesystems as well.  Let's start by giving the flag a more
generic name PF_MEMALLOC_NOFS which is in line with an exiting
PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
contexts.  Replace all PF_FSTRANS usage from the xfs code in the first
step before we introduce a full API for it as xfs uses the flag directly
anyway.

This patch doesn't introduce any functional change.

Link: http://lkml.kernel.org/r/20170306131408.9828-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:09 -07:00
Shaohua Li
cf8496ea80 proc: show MADV_FREE pages info in smaps
Show MADV_FREE pages info of each vma in smaps.  The interface is for
diganose or monitoring purpose, userspace could use it to understand
what happens in the application.  Since userspace could dirty MADV_FREE
pages without notice from kernel, this interface is the only place we
can get accurate accounting info about MADV_FREE pages.

[mhocko@kernel.org: update Documentation/filesystems/proc.txt]
Link: http://lkml.kernel.org/r/89efde633559de1ec07444f2ef0f4963a97a2ce8.1487965799.git.shli@fb.com
Signed-off-by: Shaohua Li <shli@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:08 -07:00
Geliang Tang
d47736fafe fs/ocfs2/cluster: use offset_in_page() macro
Use offset_in_page() macro instead of open-coding.

Link: http://lkml.kernel.org/r/4dbc77ccaaed98b183cf4dba58a4fa325fd65048.1492758503.git.geliangtang@gmail.com
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:07 -07:00
Junxiao Bi
33496c3c3d ocfs2: o2hb: revert hb threshold to keep compatible
Configfs is the interface for ocfs2-tools to set configure to kernel and
$configfs_dir/cluster/$clustername/heartbeat/dead_threshold is the one
used to configure heartbeat dead threshold.  Kernel has a default value
of it but user can set O2CB_HEARTBEAT_THRESHOLD in /etc/sysconfig/o2cb
to override it.

Commit 45b997737a ("ocfs2/cluster: use per-attribute show and store
methods") changed heartbeat dead threshold name while ocfs2-tools did
not, so ocfs2-tools won't set this configurable and the default value is
always used.  So revert it.

Fixes: 45b997737a ("ocfs2/cluster: use per-attribute show and store methods")
Link: http://lkml.kernel.org/r/1490665245-15374-1-git-send-email-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Acked-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:07 -07:00
Geliang Tang
667b8a37f3 fs/ocfs2/cluster: use setup_timer
Use setup_timer() instead of init_timer() to simplify the code.

Link: http://lkml.kernel.org/r/5e75bf07beb91e092d5aa36c36769949a480456a.1489060564.git.geliangtang@gmail.com
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:07 -07:00
Chao Yu
a72d4b97bb f2fs: relocate inode_{,un}lock in F2FS_IOC_SETFLAGS
This patch expands cover region of inode->i_rwsem to keep setting flag
atomically.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 14:30:19 -07:00
Jan Kara
3adc5fcb7e f2fs: Make flush bios explicitely sync
Commit b685d3d65a "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
definitions.  generic_make_request_checks() however strips REQ_FUA and
REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
write cache and thus write effectively becomes asynchronous which can
lead to performance regressions.

Fix the problem by making sure all bios which are synchronous are
properly marked with REQ_SYNC.

Fixes: b685d3d65a
Cc: stable@vger.kernel.org # 4.9+
CC: Jaegeuk Kim <jaegeuk@kernel.org>
CC: linux-f2fs-devel@lists.sourceforge.net
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 14:30:18 -07:00
Darrick J. Wong
fe0be23e68 xfs: reserve enough blocks to handle btree splits when remapping
In xfs_reflink_end_cow, we erroneously reserve only enough blocks to
handle adding 1 extent.  This is problematic if we fragment free space,
have to do CoW, and then have to perform multiple bmap btree expansions.
Furthermore, the BUI recovery routine doesn't reserve /any/ blocks to
handle btree splits, so log recovery fails after our first error causes
the filesystem to go down.

Therefore, refactor the transaction block reservation macros until we
have a macro that works for our deferred (re)mapping activities, and fix
both problems by using that macro.

With 1k blocks we can hit this fairly often in g/187 if the scratch fs
is big enough.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-05-03 13:21:40 -07:00
Linus Torvalds
a3719f34fd Merge branch 'generic' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota, reiserfs, udf and ext2 updates from Jan Kara:
 "The branch contains changes to quota code so that it does not modify
  persistent flags in inode->i_flags (it was the only place in kernel
  doing that) and handle it inside filesystem's quotaon/off handlers
  instead.

  The branch also contains two UDF cleanups, a couple of reiserfs fixes
  and one fix for ext2 quota locking"

* 'generic' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  ext4: Improve comments in ext4_quota_{on|off}()
  udf: use kmap_atomic for memcpy copying
  udf: use octal for permissions
  quota: Remove dquot_quotactl_ops
  reiserfs: Remove i_attrs_to_sd_attrs()
  reiserfs: Remove useless setting of i_flags
  jfs: Remove jfs_get_inode_flags()
  ext2: Remove ext2_get_inode_flags()
  ext4: Remove ext4_get_inode_flags()
  quota: Stop setting IMMUTABLE and NOATIME flags on quota files
  jfs: Set flags on quota files directly
  ext2: Set flags on quota files directly
  reiserfs: Set flags on quota files directly
  ext4: Set flags on quota files directly
  reiserfs: Protect dquot_writeback_dquots() by s_umount semaphore
  reiserfs: Make cancel_old_flush() reliable
  ext2: Call dquot_writeback_dquots() with s_umount held
  reiserfs: avoid a -Wmaybe-uninitialized warning
2017-05-03 11:35:47 -07:00
Linus Torvalds
5133cd7518 Merge branch 'fsnotify' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
 "The branch contains mainly a rework of fsnotify infrastructure fixing
  a shortcoming that we have waited for response to fanotify permission
  events with SRCU read lock held and when the process consuming events
  was slow to respond the kernel has stalled.

  It also contains several cleanups of unnecessary indirections in
  fsnotify framework and a bugfix from Amir fixing leakage of kernel
  internal errno to userspace"

* 'fsnotify' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (37 commits)
  fanotify: don't expose EOPENSTALE to userspace
  fsnotify: remove a stray unlock
  fsnotify: Move ->free_mark callback to fsnotify_ops
  fsnotify: Add group pointer in fsnotify_init_mark()
  fsnotify: Drop inode_mark.c
  fsnotify: Remove fsnotify_find_{inode|vfsmount}_mark()
  fsnotify: Remove fsnotify_detach_group_marks()
  fsnotify: Rename fsnotify_clear_marks_by_group_flags()
  fsnotify: Inline fsnotify_clear_{inode|vfsmount}_mark_group()
  fsnotify: Remove fsnotify_recalc_{inode|vfsmount}_mask()
  fsnotify: Remove fsnotify_set_mark_{,ignored_}mask_locked()
  fanotify: Release SRCU lock when waiting for userspace response
  fsnotify: Pass fsnotify_iter_info into handle_event handler
  fsnotify: Provide framework for dropping SRCU lock in ->handle_event
  fsnotify: Remove special handling of mark destruction on group shutdown
  fsnotify: Detach mark from object list when last reference is dropped
  fsnotify: Move queueing of mark for destruction into fsnotify_put_mark()
  inotify: Do not drop mark reference under idr_lock
  fsnotify: Free fsnotify_mark_connector when there is no mark attached
  fsnotify: Lock object list with connector lock
  ...
2017-05-03 11:05:15 -07:00
Jaegeuk Kim
5b0ef73c9d f2fs: show available_nids in f2fs/status
This patch adds an entry in f2fs/status to show # of available nids.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 10:04:57 -07:00
Jaegeuk Kim
1c0f4bf5c3 f2fs: flush dirty nats periodically
This patch flushes dirty nats in order to acquire available nids by writing
checkpoint. Otherwise, we can have no chance to get freed nids.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 10:04:56 -07:00
Chao Yu
1f43e2ad7b f2fs: introduce CP_TRIMMED_FLAG to avoid unneeded discard
Introduce CP_TRIMMED_FLAG to indicate all invalid block were trimmed
before umount, so once we do mount with image which contain the flag,
we don't record invalid blocks as undiscard one, when fstrim is being
triggered, we can avoid issuing redundant discard commands.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 10:04:56 -07:00
Chao Yu
c473f1a965 f2fs: allow cpc->reason to indicate more than one reason
Change to use different bits of cpc->reason to indicate different status,
so cpc->reason can indicate more than one reason.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 10:04:55 -07:00
Hou Pengyang
279d6df20c f2fs: release cp and dnode lock before IPU
We don't need to rewrite the page under cp_rwsem and dnode locks.

Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 10:04:54 -07:00
Fred Isaman
c296cfe26b pNFS: Fix NULL dereference in pnfs_generic_alloc_ds_commits
Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-03 12:29:41 -04:00
Linus Torvalds
0302e28dee Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:
 "Highlights:

  IMA:
   - provide ">" and "<" operators for fowner/uid/euid rules

  KEYS:
   - add a system blacklist keyring

   - add KEYCTL_RESTRICT_KEYRING, exposes keyring link restriction
     functionality to userland via keyctl()

  LSM:
   - harden LSM API with __ro_after_init

   - add prlmit security hook, implement for SELinux

   - revive security_task_alloc hook

  TPM:
   - implement contextual TPM command 'spaces'"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (98 commits)
  tpm: Fix reference count to main device
  tpm_tis: convert to using locality callbacks
  tpm: fix handling of the TPM 2.0 event logs
  tpm_crb: remove a cruft constant
  keys: select CONFIG_CRYPTO when selecting DH / KDF
  apparmor: Make path_max parameter readonly
  apparmor: fix parameters so that the permission test is bypassed at boot
  apparmor: fix invalid reference to index variable of iterator line 836
  apparmor: use SHASH_DESC_ON_STACK
  security/apparmor/lsm.c: set debug messages
  apparmor: fix boolreturn.cocci warnings
  Smack: Use GFP_KERNEL for smk_netlbl_mls().
  smack: fix double free in smack_parse_opts_str()
  KEYS: add SP800-56A KDF support for DH
  KEYS: Keyring asymmetric key restrict method with chaining
  KEYS: Restrict asymmetric key linkage using a specific keychain
  KEYS: Add a lookup_restriction function for the asymmetric key type
  KEYS: Add KEYCTL_RESTRICT_KEYRING
  KEYS: Consistent ordering for __key_link_begin and restrict check
  KEYS: Add an optional lookup_restriction hook to key_type
  ...
2017-05-03 08:50:52 -07:00
Josef Bacik
563f40019d fs: don't set *REFERENCED on single use objects
By default we set DCACHE_REFERENCED and I_REFERENCED on any dentry or
inode we create.  This is problematic as this means that it takes two
trips through the LRU for any of these objects to be reclaimed,
regardless of their actual lifetime.  With enough pressure from these
caches we can easily evict our working set from page cache with single
use objects.  So instead only set *REFERENCED if we've already been
added to the LRU list.  This means that we've been touched since the
first time we were accessed, and so more likely to need to hang out in
cache.

To illustrate this issue I wrote the following scripts

https://github.com/josefbacik/debug-scripts/tree/master/cache-pressure

on my test box.  It is a single socket 4 core CPU with 16gib of RAM and
I tested on an Intel 2tib NVME drive.  The cache-pressure.sh script
creates a new file system and creates 2 6.5gib files in order to take up
13gib of the 16gib of ram with pagecache.  Then it runs a test program
that reads these 2 files in a loop, and keeps track of how often it has
to read bytes for each loop.  On an ideal system with no pressure we
should have to read 0 bytes indefinitely.  The second thing this script
does is start a fs_mark job that creates a ton of 0 length files,
putting pressure on the system with slab only allocations.  On exit the
script prints out how many bytes were read by the read-file program.
The results are as follows

Without patch:
/mnt/btrfs-test/reads/file1: total read during loops 27262988288
/mnt/btrfs-test/reads/file2: total read during loops 27262976000

With patch:
/mnt/btrfs-test/reads/file2: total read during loops 18640457728
/mnt/btrfs-test/reads/file1: total read during loops 9565376512

This patch results in a 50% reduction of the amount of pages evicted
from our working set.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-03 11:47:05 -04:00
Rabin Vincent
3998e6b87d CIFS: fix oplock break deadlocks
When the final cifsFileInfo_put() is called from cifsiod and an oplock
break work is queued, lockdep complains loudly:

 =============================================
 [ INFO: possible recursive locking detected ]
 4.11.0+ #21 Not tainted
 ---------------------------------------------
 kworker/0:2/78 is trying to acquire lock:
  ("cifsiod"){++++.+}, at: flush_work+0x215/0x350

 but task is already holding lock:
  ("cifsiod"){++++.+}, at: process_one_work+0x255/0x8e0

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock("cifsiod");
   lock("cifsiod");

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 2 locks held by kworker/0:2/78:
  #0:  ("cifsiod"){++++.+}, at: process_one_work+0x255/0x8e0
  #1:  ((&wdata->work)){+.+...}, at: process_one_work+0x255/0x8e0

 stack backtrace:
 CPU: 0 PID: 78 Comm: kworker/0:2 Not tainted 4.11.0+ #21
 Workqueue: cifsiod cifs_writev_complete
 Call Trace:
  dump_stack+0x85/0xc2
  __lock_acquire+0x17dd/0x2260
  ? match_held_lock+0x20/0x2b0
  ? trace_hardirqs_off_caller+0x86/0x130
  ? mark_lock+0xa6/0x920
  lock_acquire+0xcc/0x260
  ? lock_acquire+0xcc/0x260
  ? flush_work+0x215/0x350
  flush_work+0x236/0x350
  ? flush_work+0x215/0x350
  ? destroy_worker+0x170/0x170
  __cancel_work_timer+0x17d/0x210
  ? ___preempt_schedule+0x16/0x18
  cancel_work_sync+0x10/0x20
  cifsFileInfo_put+0x338/0x7f0
  cifs_writedata_release+0x2a/0x40
  ? cifs_writedata_release+0x2a/0x40
  cifs_writev_complete+0x29d/0x850
  ? preempt_count_sub+0x18/0xd0
  process_one_work+0x304/0x8e0
  worker_thread+0x9b/0x6a0
  kthread+0x1b2/0x200
  ? process_one_work+0x8e0/0x8e0
  ? kthread_create_on_node+0x40/0x40
  ret_from_fork+0x31/0x40

This is a real warning.  Since the oplock is queued on the same
workqueue this can deadlock if there is only one worker thread active
for the workqueue (which will be the case during memory pressure when
the rescuer thread is handling it).

Furthermore, there is at least one other kind of hang possible due to
the oplock break handling if there is only worker.  (This can be
reproduced without introducing memory pressure by having passing 1 for
the max_active parameter of cifsiod.) cifs_oplock_break() can wait
indefintely in the filemap_fdatawait() while the cifs_writev_complete()
work is blocked:

 sysrq: SysRq : Show Blocked State
   task                        PC stack   pid father
 kworker/0:1     D    0    16      2 0x00000000
 Workqueue: cifsiod cifs_oplock_break
 Call Trace:
  __schedule+0x562/0xf40
  ? mark_held_locks+0x4a/0xb0
  schedule+0x57/0xe0
  io_schedule+0x21/0x50
  wait_on_page_bit+0x143/0x190
  ? add_to_page_cache_lru+0x150/0x150
  __filemap_fdatawait_range+0x134/0x190
  ? do_writepages+0x51/0x70
  filemap_fdatawait_range+0x14/0x30
  filemap_fdatawait+0x3b/0x40
  cifs_oplock_break+0x651/0x710
  ? preempt_count_sub+0x18/0xd0
  process_one_work+0x304/0x8e0
  worker_thread+0x9b/0x6a0
  kthread+0x1b2/0x200
  ? process_one_work+0x8e0/0x8e0
  ? kthread_create_on_node+0x40/0x40
  ret_from_fork+0x31/0x40
 dd              D    0   683    171 0x00000000
 Call Trace:
  __schedule+0x562/0xf40
  ? mark_held_locks+0x29/0xb0
  schedule+0x57/0xe0
  io_schedule+0x21/0x50
  wait_on_page_bit+0x143/0x190
  ? add_to_page_cache_lru+0x150/0x150
  __filemap_fdatawait_range+0x134/0x190
  ? do_writepages+0x51/0x70
  filemap_fdatawait_range+0x14/0x30
  filemap_fdatawait+0x3b/0x40
  filemap_write_and_wait+0x4e/0x70
  cifs_flush+0x6a/0xb0
  filp_close+0x52/0xa0
  __close_fd+0xdc/0x150
  SyS_close+0x33/0x60
  entry_SYSCALL_64_fastpath+0x1f/0xbe

 Showing all locks held in the system:
 2 locks held by kworker/0:1/16:
  #0:  ("cifsiod"){.+.+.+}, at: process_one_work+0x255/0x8e0
  #1:  ((&cfile->oplock_break)){+.+.+.}, at: process_one_work+0x255/0x8e0

 Showing busy workqueues and worker pools:
 workqueue cifsiod: flags=0xc
   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
     in-flight: 16:cifs_oplock_break
     delayed: cifs_writev_complete, cifs_echo_request
 pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 750 3

Fix these problems by creating a a new workqueue (with a rescuer) for
the oplock break work.

Signed-off-by: Rabin Vincent <rabinv@axis.com>
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
2017-05-03 10:10:10 -05:00
David Disseldorp
6026685de3 cifs: fix CIFS_ENUMERATE_SNAPSHOTS oops
As with 618763958b, an open directory may have a NULL private_data
pointer prior to readdir. CIFS_ENUMERATE_SNAPSHOTS must check for this
before dereference.

Fixes: 834170c859 ("Enable previous version support")
Signed-off-by: David Disseldorp <ddiss@suse.de>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-03 09:59:20 -05:00
David Disseldorp
0e5c795592 cifs: fix leak in FSCTL_ENUM_SNAPS response handling
The server may respond with success, and an output buffer less than
sizeof(struct smb_snapshot_array) in length. Do not leak the output
buffer in this case.

Fixes: 834170c859 ("Enable previous version support")
Signed-off-by: David Disseldorp <ddiss@suse.de>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-03 09:54:12 -05:00
Chao Yu
9a744b92da f2fs: shrink size of struct discard_cmd
In order to shrink size of struct discard_cmd, change variable type of
@state in struct discard_cmd from int to unsigned char.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:51 -07:00
Chao Yu
ec9895add2 f2fs: don't hold cmd_lock during waiting discard command
Previously, with protection of cmd_lock, we will wait for end io of
discard command which potentially may lead long latency, making worse
concurrency.

So, in this patch, we try to add reference into discard entry to prevent
the entry being released by other thread, then we can avoid holding
global cmd_lock during waiting discard to finish.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:50 -07:00
Jaegeuk Kim
4d97807813 f2fs: nullify fio->encrypted_page for each writes
This makes sure each write request has nullified encrypted_page pointer.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:49 -07:00
Jin Qian
b9dd46188e f2fs: sanity check segment count
F2FS uses 4 bytes to represent block address. As a result, supported
size of disk is 16 TB and it equals to 16 * 1024 * 1024 / 2 segments.

Signed-off-by: Jin Qian <jinqian@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:48 -07:00
Jaegeuk Kim
a817737e87 f2fs: introduce valid_ipu_blkaddr to clean up
This patch introduces valid_ipu_blkaddr to clean up checking block address for
inplace-update.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:48 -07:00
Hou Pengyang
e959c8f543 f2fs: lookup extent cache first under IPU scenario
If a page is cold, NOT atomit written and need_ipu now, there is
a high probability that IPU should be adapted. For IPU, we try to
check extent tree to get the block index first, instead of reading
the dnode page, where may lead to an useless dnode IO, since no need to
update the dnode index for IPU.

Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:47 -07:00
Hou Pengyang
7eab0c0df8 f2fs: reconstruct code to write a data page
This patch introduces encrypt_one_page which encrypts one data page before
submit_bio, and change the use of need_inplace_update.

Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:46 -07:00
Chao Yu
63a94fa1d7 f2fs: introduce __wait_discard_cmd
Just cleanup, no logic change.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:45 -07:00
Chao Yu
bd5b07383a f2fs: introduce __issue_discard_cmd
Just cleanup, no logic change.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:44 -07:00
Linus Torvalds
89c9fea3c8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  tty: fix comment for __tty_alloc_driver()
  init/main: properly align the multi-line comment
  init/main: Fix double "the" in comment
  Fix dead URLs to ftp.kernel.org
  drivers: Clean up duplicated email address
  treewide: Fix typo in xml/driver-api/basics.xml
  tools/testing/selftests/powerpc: remove redundant CFLAGS in Makefile: "-Wall -O2 -Wall" -> "-O2 -Wall"
  selftests/timers: Spelling s/privledges/privileges/
  HID: picoLCD: Spelling s/REPORT_WRTIE_MEMORY/REPORT_WRITE_MEMORY/
  net: phy: dp83848: Fix Typo
  UBI: Fix typos
  Documentation: ftrace.txt: Correct nice value of 120 priority
  net: fec: Fix typo in error msg and comment
  treewide: Fix typos in printk
2017-05-02 19:09:35 -07:00
Linus Torvalds
76f1948a79 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching
Pull livepatch updates from Jiri Kosina:

 - a per-task consistency model is being added for architectures that
   support reliable stack dumping (extending this, currently rather
   trivial set, is currently in the works).

   This extends the nature of the types of patches that can be applied
   by live patching infrastructure. The code stems from the design
   proposal made [1] back in November 2014. It's a hybrid of SUSE's
   kGraft and RH's kpatch, combining advantages of both: it uses
   kGraft's per-task consistency and syscall barrier switching combined
   with kpatch's stack trace switching. There are also a number of
   fallback options which make it quite flexible.

   Most of the heavy lifting done by Josh Poimboeuf with help from
   Miroslav Benes and Petr Mladek

   [1] https://lkml.kernel.org/r/20141107140458.GA21774@suse.cz

 - module load time patch optimization from Zhou Chengming

 - a few assorted small fixes

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
  livepatch: add missing printk newlines
  livepatch: Cancel transition a safe way for immediate patches
  livepatch: Reduce the time of finding module symbols
  livepatch: make klp_mutex proper part of API
  livepatch: allow removal of a disabled patch
  livepatch: add /proc/<pid>/patch_state
  livepatch: change to a per-task consistency model
  livepatch: store function sizes
  livepatch: use kstrtobool() in enabled_store()
  livepatch: move patching functions into patch.c
  livepatch: remove unnecessary object loaded check
  livepatch: separate enabled and patched states
  livepatch/s390: add TIF_PATCH_PENDING thread flag
  livepatch/s390: reorganize TIF thread flag bits
  livepatch/powerpc: add TIF_PATCH_PENDING thread flag
  livepatch/x86: add TIF_PATCH_PENDING thread flag
  livepatch: create temporary klp_update_patch_state() stub
  x86/entry: define _TIF_ALLWORK_MASK flags explicitly
  stacktrace/x86: add function for detecting reliable stack traces
2017-05-02 18:24:16 -07:00
Linus Torvalds
8d65b08deb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Millar:
 "Here are some highlights from the 2065 networking commits that
  happened this development cycle:

   1) XDP support for IXGBE (John Fastabend) and thunderx (Sunil Kowuri)

   2) Add a generic XDP driver, so that anyone can test XDP even if they
      lack a networking device whose driver has explicit XDP support
      (me).

   3) Sparc64 now has an eBPF JIT too (me)

   4) Add a BPF program testing framework via BPF_PROG_TEST_RUN (Alexei
      Starovoitov)

   5) Make netfitler network namespace teardown less expensive (Florian
      Westphal)

   6) Add symmetric hashing support to nft_hash (Laura Garcia Liebana)

   7) Implement NAPI and GRO in netvsc driver (Stephen Hemminger)

   8) Support TC flower offload statistics in mlxsw (Arkadi Sharshevsky)

   9) Multiqueue support in stmmac driver (Joao Pinto)

  10) Remove TCP timewait recycling, it never really could possibly work
      well in the real world and timestamp randomization really zaps any
      hint of usability this feature had (Soheil Hassas Yeganeh)

  11) Support level3 vs level4 ECMP route hashing in ipv4 (Nikolay
      Aleksandrov)

  12) Add socket busy poll support to epoll (Sridhar Samudrala)

  13) Netlink extended ACK support (Johannes Berg, Pablo Neira Ayuso,
      and several others)

  14) IPSEC hw offload infrastructure (Steffen Klassert)"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2065 commits)
  tipc: refactor function tipc_sk_recv_stream()
  tipc: refactor function tipc_sk_recvmsg()
  net: thunderx: Optimize page recycling for XDP
  net: thunderx: Support for XDP header adjustment
  net: thunderx: Add support for XDP_TX
  net: thunderx: Add support for XDP_DROP
  net: thunderx: Add basic XDP support
  net: thunderx: Cleanup receive buffer allocation
  net: thunderx: Optimize CQE_TX handling
  net: thunderx: Optimize RBDR descriptor handling
  net: thunderx: Support for page recycling
  ipx: call ipxitf_put() in ioctl error path
  net: sched: add helpers to handle extended actions
  qed*: Fix issues in the ptp filter config implementation.
  qede: Fix concurrency issue in PTP Tx path processing.
  stmmac: Add support for SIMATIC IOT2000 platform
  net: hns: fix ethtool_get_strings overflow in hns driver
  tcp: fix wraparound issue in tcp_lp
  bpf, arm64: fix jit branch offset related to ldimm64
  bpf, arm64: implement jiting of BPF_XADD
  ...
2017-05-02 16:40:27 -07:00
Steve French
26c9cb668c Set unicode flag on cifs echo request to avoid Mac error
Mac requires the unicode flag to be set for cifs, even for the smb
echo request (which doesn't have strings).

Without this Mac rejects the periodic echo requests (when mounting
with cifs) that we use to check if server is down

Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
2017-05-02 14:57:34 -05:00
Pavel Shilovsky
c610c4b619 CIFS: Add asynchronous write support through kernel AIO
This patch adds support to process write calls passed by io_submit()
asynchronously. It based on the previously introduced async context
that allows to process i/o responses in a separate thread and
return the caller immediately for asynchronous calls.

This improves writing performance of single threaded applications
with increasing of i/o queue depth size.

Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-02 14:57:34 -05:00
Pavel Shilovsky
6685c5e2d1 CIFS: Add asynchronous read support through kernel AIO
This patch adds support to process read calls passed by io_submit()
asynchronously. It based on the previously introduced async context
that allows to process i/o responses in a separate thread and
return the caller immediately for asynchronous calls.

This improves reading performance of single threaded applications
with increasing of i/o queue depth size.

Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-02 14:57:34 -05:00
Pavel Shilovsky
ccf7f4088a CIFS: Add asynchronous context to support kernel AIO
Currently the code doesn't recognize asynchronous calls passed
by io_submit() and processes all calls synchronously. This is not
what kernel AIO expects. This patch introduces a new async context
that keeps track of all issued i/o requests and moves a response
collecting procedure to a separate thread. This allows to return
to a caller immediately for async calls and call iocb->ki_complete()
once all requests are completed. For sync calls the current thread
simply waits until all requests are completed.

Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-02 14:57:34 -05:00
Daniel N Pettersson
29bb3158cf cifs: fix IPv6 link local, with scope id, address parsing
When the IP address is gotten from the UNC, use only the address part
of the UNC. Else all after the percent sign in an IPv6 link local
address is interpreted as a scope id. This includes the slash and
share name. A scope id is expected to be an integer and any trailing
characters makes the conversion to integer fail.
Example of mount command that fails:
mount -i -t cifs //fe80::6a05:caff:fe3e:8ffc%2/test /mnt/t -o sec=none

Signed-off-by: Daniel N Pettersson <danielnp@axis.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-02 14:57:34 -05:00
Dan Carpenter
564277ecee cifs: small underflow in cnvrtDosUnixTm()
January is month 1.  There is no zero-th month.  If someone passes a
zero month then it means we read from one space before the start of the
total_days_of_prev_months[] array.

We may as well also be strict about days as well.

Fixes: 1bd5bbcb65 ("[CIFS] Legacy time handling for Win9x and OS/2 part 1")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-02 14:57:34 -05:00
Linus Torvalds
204f144c9f Merge branch 'work.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull fs/compat.c cleanups from Al Viro:
 "More moving of compat syscalls from fs/compat.c to fs/*.c where the
  native counterparts live.

  And death to compat_sys_getdents64() - the only architecture that used
  to need it was ia64, and _that_ has lost biarch support quite a few
  years ago"

* 'work.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs/compat.c: trim unused includes
  move compat_rw_copy_check_uvector() over to fs/read_write.c
  fhandle: move compat syscalls from compat.c
  open: move compat syscalls from compat.c
  stat: move compat syscalls from compat.c
  fcntl: move compat syscalls from compat.c
  readdir: move compat syscalls from compat.c
  statfs: move compat syscalls from compat.c
  utimes: move compat syscalls from compat.c
  move compat select-related syscalls to fs/select.c
  Remove compat_sys_getdents64()
2017-05-02 11:54:26 -07:00
Linus Torvalds
da7b66ffb2 Merge branch 'work.splice' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull splice updates from Al Viro:
 "These actually missed the last cycle; the branch itself is from last
  December"

* 'work.splice' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  make nr_pages calculation in default_file_splice_read() a bit less ugly
  splice/tee/vmsplice: validate flags
  splice_pipe_desc: kill ->flags
  remove spd_release_page()
2017-05-02 11:38:06 -07:00
Linus Torvalds
5b13475a5e Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull iov_iter updates from Al Viro:
 "Cleanups that sat in -next + -stable fodder that has just missed 4.11.

  There's more iov_iter work in my local tree, but I'd prefer to push
  the stuff that had been in -next first"

* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  iov_iter: don't revert iov buffer if csum error
  generic_file_read_iter(): make use of iov_iter_revert()
  generic_file_direct_write(): make use of iov_iter_revert()
  orangefs: use iov_iter_revert()
  sctp: switch to copy_from_iter_full()
  net/9p: switch to copy_from_iter_full()
  switch memcpy_from_msg() to copy_from_iter_full()
  rds: make use of iov_iter_revert()
2017-05-02 11:18:50 -07:00
Linus Torvalds
6fd4e7f774 Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fixes from Steve French:
 "Three cifs/smb3 fixes - including two for stable"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: don't check for failure from mempool_alloc()
  Do not return number of bytes written for ioctl CIFS_IOC_COPYCHUNK_FILE
  Fix match_prepath()
2017-05-02 11:16:29 -07:00
Linus Torvalds
2575be8ad3 - constify compression structures; Bhumika Goyal
- restore powerpc dumping; Ankit Kumar
 - fix more bugs in the rarely exercises module unloading logic
 - reorganize filesystem locking to fix problems noticed by lockdep
 - refactor internal pstore APIs to make development and review easier:
   - improve error reporting
   - add kernel-doc structure and function comments
   - avoid insane argument passing by using a common record structure
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJZB3wzAAoJEIly9N/cbcAmVQAP/3+GzoxUcL43ypsDa1CmCsFN
 l2roQjzWLNGfHgq5qkS/mNtrUdEvBMUBd2oyhHcaiqM0DsuuO3rKTp6dZ8oczYjN
 6GoTmU8ZwIPze3VEadNPjCIdpsfTMNKtZvVJCTrWnsgXxTawDS89qqr7SCs3qhBS
 Dm1E8oX77YyhOKoGA6O3CJxpdm/Ge4+KpPR6Uwj90eVro04vYoiwnBjyLUzE7w1l
 JcXEGEh1t5NjUxHeMwW7HZwJYZfA3DQ6I3MOOzGhf9tsKp6J0LQTTV8PMSEo1mif
 mLZhDBy8BBlunL+b2Tp3+c4QItGSHkBCWASI2RLa2TM7xvL67oC+qm/WaUyoRovy
 hllEG96rsCs3Zx7fFFsfQCwURcTWfJQMrD+0d/fM+P2ylWvgp+KU6PeLTS9IHu6M
 3n6i5i6A6OY/QvmZr1tN/06kUBjtQmo8EgQ0jxoxAlWyNcJqi93hmJyaRW28KxjS
 tjFTNLZMrslj0UDmjiD6fIuaT6gsGDB+3wAMPVAf+iV/k/2GUlj3ZILe4RaABAe9
 8xaUu11tZ5sTniayZ+10bA+6+K5n7uTlgU8RfFgaUZoRAzHgtyijOmdo6N+HILfK
 klv59B1Fmf6JpDlq7L9vurOqE82FAWFn4DruFM2bAaky2meFUNbYFiNfwK4l6lPI
 pmAgpdgRRvNMBCEmbVfv
 =S14G
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore updates from Kees Cook:
 "This has a large internal refactoring along with several smaller
  fixes.

   - constify compression structures; Bhumika Goyal

   - restore powerpc dumping; Ankit Kumar

   - fix more bugs in the rarely exercises module unloading logic

   - reorganize filesystem locking to fix problems noticed by lockdep

   - refactor internal pstore APIs to make development and review
     easier:
      - improve error reporting
      - add kernel-doc structure and function comments
      - avoid insane argument passing by using a common record
        structure"

* tag 'pstore-v4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (23 commits)
  pstore: Solve lockdep warning by moving inode locks
  pstore: Fix flags to enable dumps on powerpc
  pstore: Remove unused vmalloc.h in pmsg
  pstore: simplify write_user_compat()
  pstore: Remove write_buf() callback
  pstore: Replace arguments for write_buf_user() API
  pstore: Replace arguments for write_buf() API
  pstore: Replace arguments for erase() API
  pstore: Do not duplicate record metadata
  pstore: Allocate records on heap instead of stack
  pstore: Pass record contents instead of copying
  pstore: Always allocate buffer for decompression
  pstore: Replace arguments for write() API
  pstore: Replace arguments for read() API
  pstore: Switch pstore_mkfile to pass record
  pstore: Move record decompression to function
  pstore: Extract common arguments into structure
  pstore: Add kernel-doc for struct pstore_info
  pstore: Improve register_pstore() error reporting
  pstore: Avoid race in module unloading
  ...
2017-05-02 10:35:45 -07:00
Trond Myklebust
5f0114832a pNFS: Fix a typo in pnfs_generic_alloc_ds_commits
If the layout segment is invalid, we want to just resend the remaining
writes.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-02 12:35:34 -04:00
Trond Myklebust
61f454e30c pNFS: Fix a deadlock when coalescing writes and returning the layout
Consider the following deadlock:

Process P1	Process P2		Process P3
==========	==========		==========
					lock_page(page)

		lseg = pnfs_update_layout(inode)

lo = NFS_I(inode)->layout
pnfs_error_mark_layout_for_return(lo)

		lock_page(page)

					lseg = pnfs_update_layout(inode)

In this scenario,
- P1 has declared the layout to be in error, but P2 holds a reference to
  a layout segment on that inode, so the layoutreturn is deferred.
- P2 is waiting for a page lock held by P3.
- P3 is asking for a new layout segment, but is blocked waiting
  for the layoutreturn.

The fix is to ensure that pnfs_error_mark_layout_for_return() does
not set the NFS_LAYOUT_RETURN flag, which blocks P3. Instead, we allow
the latter to call LAYOUTGET so that it can make progress and unblock
P2.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-02 12:35:33 -04:00
Trond Myklebust
5466d21411 pNFS: Don't clear the layout return info if there are segments to return
In pnfs_clear_layoutreturn_info, ensure that we don't clear the layout
return info if there are new segments queued for return due to, for
instance, a race between a LAYOUTRETURN and a failed I/O attempt.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-02 12:35:33 -04:00
Eric Biggers
aa1dca3bd9 ext4: inherit encryption xattr before other xattrs
When using both encryption and SELinux (or another feature that requires
an xattr per file) on a filesystem with 256-byte inodes, each file's
xattrs usually spill into an external xattr block.  Currently, the
xattrs are inherited in the order ACL, security, then encryption.
Therefore, if spillage occurs, the encryption xattr will always end up
in the external block.  This is not ideal because the encryption xattrs
contain a nonce, so they will always be unique and will prevent the
external xattr blocks from being deduplicated.

To improve the situation, change the inheritance order to encryption,
ACL, then security.  This gives the encryption xattr a better chance to
be stored in-inode, allowing the other xattr(s) to be deduplicated.

Note that it may be better for userspace to format the filesystem with
512-byte inodes in this case.  However, it's not the default.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-02 00:49:54 -04:00
Linus Torvalds
6dc2cce932 Merge branch 'x86-process-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pul x86/process updates from Ingo Molnar:
 "The main change in this cycle was to add the ARCH_[GET|SET]_CPUID
  prctl() ABI extension to control the availability of the CPUID
  instruction, analogously to the existing PR_GET|SET_TSC ABI that
  controls RDTSC.

  Motivation: the 'rr' user-space record-and-replay execution debugger
  would like to trap and emulate the CPUID instruction - which
  instruction is normally unprivileged.

  Trapping CPUID is possible on IvyBridge and later Intel CPUs - expose
  this hardware capability"

* 'x86-process-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/syscalls/32: Ignore arch_prctl for other architectures
  um/arch_prctl: Fix fallout from x86 arch_prctl() rework
  x86/arch_prctl: Add ARCH_[GET|SET]_CPUID
  x86/cpufeature: Detect CPUID faulting support
  x86/syscalls/32: Wire up arch_prctl on x86-32
  x86/arch_prctl: Add do_arch_prctl_common()
  x86/arch_prctl/64: Rename do_arch_prctl() to do_arch_prctl_64()
  x86/arch_prctl/64: Use SYSCALL_DEFINE2 to define sys_arch_prctl()
  x86/arch_prctl: Rename 'code' argument to 'option'
  x86/msr: Rename MISC_FEATURE_ENABLES to MISC_FEATURES_ENABLES
  x86/process: Optimize TIF_NOTSC switch
  x86/process: Correct and optimize TIF_BLOCKSTEP switch
  x86/process: Optimize TIF checks in __switch_to_xtra()
2017-05-01 19:57:58 -07:00
Linus Torvalds
3527d3e951 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle were:

   - another round of rq-clock handling debugging, robustization and
     fixes

   - PELT accounting improvements

   - CPU hotplug related ->cpus_allowed affinity handling fixes all
     around the tree

   - ... plus misc fixes, cleanups and updates"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
  sched/x86: Update reschedule warning text
  crypto: N2 - Replace racy task affinity logic
  cpufreq/sparc-us2e: Replace racy task affinity logic
  cpufreq/sparc-us3: Replace racy task affinity logic
  cpufreq/sh: Replace racy task affinity logic
  cpufreq/ia64: Replace racy task affinity logic
  ACPI/processor: Replace racy task affinity logic
  ACPI/processor: Fix error handling in __acpi_processor_start()
  sparc/sysfs: Replace racy task affinity logic
  powerpc/smp: Replace open coded task affinity logic
  ia64/sn/hwperf: Replace racy task affinity logic
  ia64/salinfo: Replace racy task affinity logic
  workqueue: Provide work_on_cpu_safe()
  ia64/topology: Remove cpus_allowed manipulation
  sched/fair: Move the PELT constants into a generated header
  sched/fair: Increase PELT accuracy for small tasks
  sched/fair: Fix comments
  sched/Documentation: Add 'sched-pelt' tool
  sched/fair: Fix corner case in __accumulate_sum()
  sched/core: Remove 'task' parameter and rename tsk_restore_flags() to current_restore_flags()
  ...
2017-05-01 19:12:53 -07:00
Linus Torvalds
5db6db0d40 Merge branch 'work.uaccess' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull uaccess unification updates from Al Viro:
 "This is the uaccess unification pile. It's _not_ the end of uaccess
  work, but the next batch of that will go into the next cycle. This one
  mostly takes copy_from_user() and friends out of arch/* and gets the
  zero-padding behaviour in sync for all architectures.

  Dealing with the nocache/writethrough mess is for the next cycle;
  fortunately, that's x86-only. Same for cleanups in iov_iter.c (I am
  sold on access_ok() in there, BTW; just not in this pile), same for
  reducing __copy_... callsites, strn*... stuff, etc. - there will be a
  pile about as large as this one in the next merge window.

  This one sat in -next for weeks. -3KLoC"

* 'work.uaccess' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (96 commits)
  HAVE_ARCH_HARDENED_USERCOPY is unconditional now
  CONFIG_ARCH_HAS_RAW_COPY_USER is unconditional now
  m32r: switch to RAW_COPY_USER
  hexagon: switch to RAW_COPY_USER
  microblaze: switch to RAW_COPY_USER
  get rid of padding, switch to RAW_COPY_USER
  ia64: get rid of copy_in_user()
  ia64: sanitize __access_ok()
  ia64: get rid of 'segment' argument of __do_{get,put}_user()
  ia64: get rid of 'segment' argument of __{get,put}_user_check()
  ia64: add extable.h
  powerpc: get rid of zeroing, switch to RAW_COPY_USER
  esas2r: don't open-code memdup_user()
  alpha: fix stack smashing in old_adjtimex(2)
  don't open-code kernel_setsockopt()
  mips: switch to RAW_COPY_USER
  mips: get rid of tail-zeroing in primitives
  mips: make copy_from_user() zero tail explicitly
  mips: clean and reorder the forest of macros...
  mips: consolidate __invoke_... wrappers
  ...
2017-05-01 14:41:04 -07:00
Arnd Bergmann
67fd389735 block, dax: use correct format string in bdev_dax_supported
The new message has an incorrect format string, causing a warning in some
configurations:

fs/block_dev.c: In function 'bdev_dax_supported':
fs/block_dev.c:779:5: error: format '%d' expects argument of type 'int', but argument 2 has type 'long int' [-Werror=format=]
     "error: dax access failed (%d)", len);

This changes it to use the correct %ld instead of %d.

Fixes: 2093f2e9df ("block, dax: convert bdev_dax_supported() to dax_direct_access()")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-05-01 13:16:29 -07:00
Linus Torvalds
694752922b Merge branch 'for-4.12/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:

 - Add BFQ IO scheduler under the new blk-mq scheduling framework. BFQ
   was initially a fork of CFQ, but subsequently changed to implement
   fairness based on B-WF2Q+, a modified variant of WF2Q. BFQ is meant
   to be used on desktop type single drives, providing good fairness.
   From Paolo.

 - Add Kyber IO scheduler. This is a full multiqueue aware scheduler,
   using a scalable token based algorithm that throttles IO based on
   live completion IO stats, similary to blk-wbt. From Omar.

 - A series from Jan, moving users to separately allocated backing
   devices. This continues the work of separating backing device life
   times, solving various problems with hot removal.

 - A series of updates for lightnvm, mostly from Javier. Includes a
   'pblk' target that exposes an open channel SSD as a physical block
   device.

 - A series of fixes and improvements for nbd from Josef.

 - A series from Omar, removing queue sharing between devices on mostly
   legacy drivers. This helps us clean up other bits, if we know that a
   queue only has a single device backing. This has been overdue for
   more than a decade.

 - Fixes for the blk-stats, and improvements to unify the stats and user
   windows. This both improves blk-wbt, and enables other users to
   register a need to receive IO stats for a device. From Omar.

 - blk-throttle improvements from Shaohua. This provides a scalable
   framework for implementing scalable priotization - particularly for
   blk-mq, but applicable to any type of block device. The interface is
   marked experimental for now.

 - Bucketized IO stats for IO polling from Stephen Bates. This improves
   efficiency of polled workloads in the presence of mixed block size
   IO.

 - A few fixes for opal, from Scott.

 - A few pulls for NVMe, including a lot of fixes for NVMe-over-fabrics.
   From a variety of folks, mostly Sagi and James Smart.

 - A series from Bart, improving our exposed info and capabilities from
   the blk-mq debugfs support.

 - A series from Christoph, cleaning up how handle WRITE_ZEROES.

 - A series from Christoph, cleaning up the block layer handling of how
   we track errors in a request. On top of being a nice cleanup, it also
   shrinks the size of struct request a bit.

 - Removal of mg_disk and hd (sorry Linus) by Christoph. The former was
   never used by platforms, and the latter has outlived it's usefulness.

 - Various little bug fixes and cleanups from a wide variety of folks.

* 'for-4.12/block' of git://git.kernel.dk/linux-block: (329 commits)
  block: hide badblocks attribute by default
  blk-mq: unify hctx delay_work and run_work
  block: add kblock_mod_delayed_work_on()
  blk-mq: unify hctx delayed_run_work and run_work
  nbd: fix use after free on module unload
  MAINTAINERS: bfq: Add Paolo as maintainer for the BFQ I/O scheduler
  blk-mq-sched: alloate reserved tags out of normal pool
  mtip32xx: use runtime tag to initialize command header
  scsi: Implement blk_mq_ops.show_rq()
  blk-mq: Add blk_mq_ops.show_rq()
  blk-mq: Show operation, cmd_flags and rq_flags names
  blk-mq: Make blk_flags_show() callers append a newline character
  blk-mq: Move the "state" debugfs attribute one level down
  blk-mq: Unregister debugfs attributes earlier
  blk-mq: Only unregister hctxs for which registration succeeded
  blk-mq-debugfs: Rename functions for registering and unregistering the mq directory
  blk-mq: Let blk_mq_debugfs_register() look up the queue name
  blk-mq: Register <dev>/queue/mq after having registered <dev>/queue
  ide-pm: always pass 0 error to ide_complete_rq in ide_do_devset
  ide-pm: always pass 0 error to __blk_end_request_all
  ..
2017-05-01 10:39:57 -07:00
Theodore Ts'o
72d622b422 ext4: replace BUG_ON with WARN_ONCE in ext4_end_bio()
Add fallback code and a WARN_ONCE() call instead of a BUG_ON() in
the ext4_end_bio() function.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30 20:08:05 -04:00
Jan Kara
dddbd6ac8f ext4: avoid unnecessary transaction stalls during writeback
Currently ext4_writepages() submits all pages with transaction started.
When no page needs block allocation or extent conversion we can submit
all dirty pages in the inode while holding a single transaction handle
and when device is congested this can take significant amount of time.
Thus ext4_writepages() can block transaction commits for extended
periods of time.

Take for example a simple benchmark simulating PostgreSQL database
(pgioperf in mmtest). The benchmark runs 16 processes doing random reads
from a huge file, one process doing random writes to the huge file, and
one process doing sequential writes to a small files and frequently
running fsync. With unpatched kernel transaction commits take on average
~18s with standard deviation of ~41s, top 5 commit times are:

274.466639s, 126.467347s, 86.992429s, 34.351563s, 31.517653s.

After this patch transaction commits take on average 0.1s with standard
deviation of 0.15s, top 5 commit times are:

0.563792s, 0.519980s, 0.509841s, 0.471700s, 0.469899s

[ Modified so we use an explicit do_map flag instead of relying on
  io_end not being allocated, the since io_end->inode is needed for I/O
  error handling. -- tytso ]

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30 18:29:10 -04:00
Joe Richey
9c8268def6 fscrypt: Move key structure and constants to uapi
This commit exposes the necessary constants and structures for a
userspace program to pass filesystem encryption keys into the keyring.
The fscrypt_key structure was already part of the kernel ABI, this
change just makes it so programs no longer have to redeclare these
structures (like e4crypt in e2fsprogs currently does).

Note that we do not expose the other FS_*_KEY_SIZE constants as they are
not necessary. Only XTS is supported for contents_encryption_mode, so
currently FS_MAX_KEY_SIZE bytes of key material must always be passed to
the kernel.

This commit also removes __packed from fscrypt_key as it does not
contain any implicit padding and does not refer to an on-disk structure.

Signed-off-by: Joe Richey <joerichey@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30 01:26:34 -04:00
Eric Biggers
cd39e4bac1 fscrypt: remove unnecessary checks for NULL operations
The functions in fs/crypto/*.c are only called by filesystems configured
with encryption support.  Since the ->get_context(), ->set_context(),
and ->empty_dir() operations are always provided in that case (and must
be, otherwise there would be no way to get/set encryption policies, or
in the case of ->get_context() even access encrypted files at all),
there is no need to check for these operations being NULL and we can
remove these unneeded checks.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30 01:26:34 -04:00
Andrew Perepechko
85c8f176a6 ext4: preload block group descriptors
With enabled meta_bg option block group descriptors
reading IO is not sequential and requires optimization.

Signed-off-by: Andrew Perepechko <andrew.perepechko@seagate.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30 00:46:35 -04:00
Eric Biggers
1a20a63084 ext4: make ext4_shutdown() static
Make the ext4_shutdown() function static, as suggested by running sparse
('make C=2 fs/ext4/').  This was the only such warning in fs/ext4/.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30 00:40:44 -04:00
Darrick J. Wong
0c9ec4beec ext4: support GETFSMAP ioctls
Support the GETFSMAP ioctls so that we can use the xfs free space
management tools to probe ext4 as well.  Note that this is a partial
implementation -- we only report fixed-location metadata and free space;
everything else is reported as "unknown".

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30 00:36:53 -04:00
Eric Biggers
7b4cc9787f ext4: evict inline data when writing to memory map
Currently the case of writing via mmap to a file with inline data is not
handled.  This is maybe a rare case since it requires a writable memory
map of a very small file, but it is trivial to trigger with on
inline_data filesystem, and it causes the
'BUG_ON(ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA));' in
ext4_writepages() to be hit:

    mkfs.ext4 -O inline_data /dev/vdb
    mount /dev/vdb /mnt
    xfs_io -f /mnt/file \
	-c 'pwrite 0 1' \
	-c 'mmap -w 0 1m' \
	-c 'mwrite 0 1' \
	-c 'fsync'

	kernel BUG at fs/ext4/inode.c:2723!
	invalid opcode: 0000 [#1] SMP
	CPU: 1 PID: 2532 Comm: xfs_io Not tainted 4.11.0-rc1-xfstests-00301-g071d9acf3d1f #633
	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
	task: ffff88003d3a8040 task.stack: ffffc90000300000
	RIP: 0010:ext4_writepages+0xc89/0xf8a
	RSP: 0018:ffffc90000303ca0 EFLAGS: 00010283
	RAX: 0000028410000000 RBX: ffff8800383fa3b0 RCX: ffffffff812afcdc
	RDX: 00000a9d00000246 RSI: ffffffff81e660e0 RDI: 0000000000000246
	RBP: ffffc90000303dc0 R08: 0000000000000002 R09: 869618e8f99b4fa5
	R10: 00000000852287a2 R11: 00000000a03b49f4 R12: ffff88003808e698
	R13: 0000000000000000 R14: 7fffffffffffffff R15: 7fffffffffffffff
	FS:  00007fd3e53094c0(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000
	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	CR2: 00007fd3e4c51000 CR3: 000000003d554000 CR4: 00000000003406e0
	Call Trace:
	 ? _raw_spin_unlock+0x27/0x2a
	 ? kvm_clock_read+0x1e/0x20
	 do_writepages+0x23/0x2c
	 ? do_writepages+0x23/0x2c
	 __filemap_fdatawrite_range+0x80/0x87
	 filemap_write_and_wait_range+0x67/0x8c
	 ext4_sync_file+0x20e/0x472
	 vfs_fsync_range+0x8e/0x9f
	 ? syscall_trace_enter+0x25b/0x2d0
	 vfs_fsync+0x1c/0x1e
	 do_fsync+0x31/0x4a
	 SyS_fsync+0x10/0x14
	 do_syscall_64+0x69/0x131
	 entry_SYSCALL64_slow_path+0x25/0x25

We could try to be smart and keep the inline data in this case, or at
least support delayed allocation when allocating the block, but these
solutions would be more complicated and don't seem worthwhile given how
rare this case seems to be.  So just fix the bug by calling
ext4_convert_inline_data() when we're asked to make a page writable, so
that any inline data gets evicted, with the block allocated immediately.

Reported-by: Nick Alcock <nick.alcock@oracle.com>
Cc: stable@vger.kernel.org
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30 00:10:50 -04:00
Eric Biggers
6ba644b9fd ext4: remove ext4_xattr_check_entry()
ext4_xattr_check_entry() was redundant with validation of the full xattr
entries list in ext4_xattr_check_entries(), which all callers also did.
ext4_xattr_check_entry() also didn't actually do correct validation;
specifically, it never checked that the value doesn't overlap the xattr
names, nor did it account for padding when checking whether the xattr
value overflows the available space.  So remove it to eliminate any
potential confusion.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30 00:01:02 -04:00
Eric Biggers
2c4f992337 ext4: rename ext4_xattr_check_names() to ext4_xattr_check_entries()
ext4_xattr_check_names() actually validates both the xattr names and
values, not just the names.  So rename it to ext4_xattr_check_entries()
to avoid confusion.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-29 23:56:52 -04:00
Eric Biggers
ba7ea1d8f4 ext4: merge ext4_xattr_list() into ext4_listxattr()
There's no difference between ext4_xattr_list() and ext4_listxattr(), so
merge them together and just have ext4_listxattr().  Some years ago they
took different arguments, but that's no longer the case.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-29 23:53:17 -04:00
Eric Biggers
d600618673 ext4: constify static data that is never modified
Constify static data in ext4 that is never (intentionally) modified so
that it is placed in .rodata and benefits from memory protection.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-29 23:47:50 -04:00
Eric Biggers
1bc0af600b ext4: trim return value and 'dir' argument from ext4_insert_dentry()
In the initial implementation of ext4 encryption, the filename was
encrypted in ext4_insert_dentry(), which could fail and also required
access to the 'dir' inode.  Since then ext4 filename encryption has been
changed to encrypt the filename earlier, so we can revert the additions
to ext4_insert_dentry().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-29 23:27:26 -04:00
Jan Kara
5052b069ac jbd2: fix dbench4 performance regression for 'nobarrier' mounts
Commit b685d3d65a "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_FUA implementation. Since
JBD2 strips REQ_FUA and REQ_FLUSH flags from submitted IO when the
filesystem is mounted with nobarrier mount option, journal superblock
writes ended up being async writes after this patch and that caused
heavy performance regression for dbench4 benchmark with high number of
processes. In my test setup with HP RAID array with non-volatile write
cache and 32 GB ram, dbench4 runs with 8 processes regressed by ~25%.

Fix the problem by making sure journal superblock writes are always
treated as synchronous since they generally block progress of the
journalling machinery and thus the whole filesystem.

Fixes: b685d3d65a
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-29 21:07:30 -04:00
Jan Kara
c52c47e4b4 jbd2: Fix lockdep splat with generic/270 test
I've hit a lockdep splat with generic/270 test complaining that:

3216.fsstress.b/3533 is trying to acquire lock:
 (jbd2_handle){++++..}, at: [<ffffffff813152e0>] jbd2_log_wait_commit+0x0/0x150

but task is already holding lock:
 (jbd2_handle){++++..}, at: [<ffffffff8130bd3b>] start_this_handle+0x35b/0x850

The underlying problem is that jbd2_journal_force_commit_nested()
(called from ext4_should_retry_alloc()) may get called while a
transaction handle is started. In such case it takes care to not wait
for commit of the running transaction (which would deadlock) but only
for a commit of a transaction that is already committing (which is safe
as that doesn't wait for any filesystem locks).

In fact there are also other callers of jbd2_log_wait_commit() that take
care to pass tid of a transaction that is already committing and for
those cases, the lockdep instrumentation is too restrictive and leading
to false positive reports. Fix the problem by calling
jbd2_might_wait_for_commit() from jbd2_log_wait_commit() only if the
transaction isn't already committing.

Fixes: 1eaa566d36
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-29 20:12:16 -04:00
Mark Charlebois
9280cdd6fe fs: compat: Remove warning from COMPATIBLE_IOCTL
cmd in COMPATIBLE_IOCTL is always a u32, so cast it so there isn't a
warning about an overflow in XFORM.

From: Mark Charlebois <charlebm@gmail.com>
Signed-off-by: Mark Charlebois <charlebm@gmail.com>
Signed-off-by: Behan Webster <behanw@converseincode.com>
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-29 17:47:19 -04:00
Al Viro
0b33540f9d remove pointless extern of atime_need_update_rcu()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-29 17:42:25 -04:00
Trond Myklebust
1f18b82c34 pNFS: Ensure we commit the layout if it has been invalidated
If the layout is being invalidated on the server, then we must
invoke nfs_commit_inode() to ensure any commits to the DS get
cleared out.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-29 11:29:30 -04:00
Trond Myklebust
722f0b8911 pNFS: Don't send COMMITs to the DSes if the server invalidated our layout
If the layout was invalidated, then assume we should requeue all the
pending writes for the DS in question.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-29 11:29:24 -04:00
Trond Myklebust
37f8aa16da pNFS/flexfiles: Fix up the ff_layout_write_pagelist failure path
If the attempt to write through pNFS fails, we need to use the same
failure semantics as for the read path: If the FF_FLAGS_NO_IO_THRU_MDS
flag is set or we have sufficient valid DSes, then we must retry through
pNFS

Fixes: d67ae825a5 ("pnfs/flexfiles: Add the FlexFile Layout Driver")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-29 00:02:37 -04:00
Takashi Iwai
d66bb1607e proc: Fix unbalanced hard link numbers
proc_create_mount_point() forgot to increase the parent's nlink, and
it resulted in unbalanced hard link numbers, e.g. /proc/fs shows one
less than expected.

Fixes: eb6d38d542 ("proc: Allow creating permanently empty directories...")
Cc: stable@vger.kernel.org
Reported-by: Tristan Ye <tristan.ye@suse.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-04-28 21:05:26 -05:00
Linus Torvalds
28b2013587 Merge branch 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fix from Chris Mason:
 "We have one more fix for btrfs.

  This gets rid of a new WARN_ON from rc1 that ended up making more
  noise than we really want. The larger fix for the underflow got
  delayed a bit and it's better for now to put it under
  CONFIG_BTRFS_DEBUG"

* 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: qgroup: move noisy underflow warning to debugging build
2017-04-28 10:13:17 -07:00
Trond Myklebust
bdebfccd0e pNFS: Ensure we check layout validity before marking it for return
pnfs_error_mark_layout_for_return needs to check that the layout is
valid before calling pnfs_set_plh_return_info().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-28 13:07:01 -04:00
Olga Kornievskaia
88bd4f8629 NFS4.1 handle interrupted slot reuse from ERR_DELAY
If the RPC slot was interrupted and server replied to the next
operation on the "reused" slot with ERR_DELAY, don't clear out
the "interrupted" flag until we properly recover.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-28 13:07:00 -04:00
Pan Bian
4edabfd7d0 NFSv4: check return value of xdr_inline_decode
Function xdr_inline_decode() will return a NULL pointer if the input
buffer does not have long enough buffer to decode nbytes of data.
However, in function decode_op_map(), the return value of
xdr_inline_decode() is not validated before it is used. This patch adds
a check to the return value of xdr_inline_decode().

Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-28 13:06:59 -04:00
Artem Savkov
209aa23083 nfs/filelayout: fix NULL pointer dereference in fl_pnfs_update_layout()
Calling pnfs_put_lset on an IS_ERR pointer results in a NULL pointer
dereference like the one below. At the same time the check of retvalue
of filelayout_check_deviceid() sets lseg to error, but does not free it
before that.

[ 3000.636161] BUG: unable to handle kernel NULL pointer dereference at 000000000000003c
[ 3000.636970] IP: pnfs_put_lseg+0x29/0x100 [nfsv4]
[ 3000.637420] PGD 4f23b067
[ 3000.637421] PUD 4a0f4067
[ 3000.637679] PMD 0
[ 3000.637937]
[ 3000.638287] Oops: 0000 [#1] SMP
[ 3000.638591] Modules linked in: nfs_layout_nfsv41_files nfsv3 nfnetlink_queue nfnetlink_log nfnetlink bluetooth rfkill rpcsec_gss_krb5 nfsv4 nfs fscache binfmt_misc arc4 md4 nls_utf8 cifs ccm dns_resolver rpcrdma ib_isert iscsi_target_mod ib_iser rdma_cm iw_cm libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib ib_ucm ib_uverbs ib_umad ib_cm ib_core nls_koi8_u nls_cp932 ts_kmp nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr virtio_balloon ppdev virtio_rng parport_pc i2c_piix4 parport acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c ata_generic pata_acpi virtio_blk virtio_net cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crc32c_intel ata_piix ttm libata drm serio_raw
[ 3000.645245]  i2c_core virtio_pci virtio_ring virtio floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: xt_u32]
[ 3000.646360] CPU: 1 PID: 26402 Comm: date Not tainted 4.11.0-rc7.1.el7.test.x86_64 #1
[ 3000.647092] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 3000.647638] task: ffff8800415ada00 task.stack: ffffc90000ff0000
[ 3000.648207] RIP: 0010:pnfs_put_lseg+0x29/0x100 [nfsv4]
[ 3000.648696] RSP: 0018:ffffc90000ff39b8 EFLAGS: 00010246
[ 3000.649193] RAX: 0000000000000000 RBX: fffffffffffffff4 RCX: 00000000000d43be
[ 3000.649859] RDX: 00000000000d43bd RSI: 0000000000000000 RDI: fffffffffffffff4
[ 3000.650530] RBP: ffffc90000ff39d8 R08: 000000000001e320 R09: ffffffffa05c35ce
[ 3000.651203] R10: ffff88007fd1e320 R11: ffffea0001283d80 R12: 0000000001400040
[ 3000.651875] R13: ffff88004f77d9f0 R14: ffffc90000ff3cd8 R15: ffff8800417ade00
[ 3000.652546] FS:  00007fac4d5cd740(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
[ 3000.653304] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3000.653849] CR2: 000000000000003c CR3: 000000004f080000 CR4: 00000000000406e0
[ 3000.654527] Call Trace:
[ 3000.654771]  fl_pnfs_update_layout.constprop.20+0x10c/0x150 [nfs_layout_nfsv41_files]
[ 3000.655505]  filelayout_pg_init_write+0x21d/0x270 [nfs_layout_nfsv41_files]
[ 3000.656195]  __nfs_pageio_add_request+0x11c/0x490 [nfs]
[ 3000.656698]  nfs_pageio_add_request+0xac/0x260 [nfs]
[ 3000.657180]  nfs_do_writepage+0x109/0x2e0 [nfs]
[ 3000.657616]  nfs_writepages_callback+0x16/0x30 [nfs]
[ 3000.658096]  write_cache_pages+0x26f/0x510
[ 3000.658495]  ? nfs_do_writepage+0x2e0/0x2e0 [nfs]
[ 3000.658946]  ? _raw_spin_unlock_bh+0x1e/0x20
[ 3000.659357]  ? wb_wakeup_delayed+0x5f/0x70
[ 3000.659748]  ? __mark_inode_dirty+0x2eb/0x360
[ 3000.660170]  nfs_writepages+0x84/0xd0 [nfs]
[ 3000.660575]  ? nfs_updatepage+0x571/0xb70 [nfs]
[ 3000.661012]  do_writepages+0x1e/0x30
[ 3000.661358]  __filemap_fdatawrite_range+0xc6/0x100
[ 3000.661819]  filemap_write_and_wait_range+0x41/0x90
[ 3000.662292]  nfs_file_fsync+0x34/0x1f0 [nfs]
[ 3000.662704]  vfs_fsync_range+0x3d/0xb0
[ 3000.663065]  vfs_fsync+0x1c/0x20
[ 3000.663385]  nfs4_file_flush+0x57/0x80 [nfsv4]
[ 3000.663813]  filp_close+0x2f/0x70
[ 3000.664132]  __close_fd+0x9a/0xc0
[ 3000.664453]  SyS_close+0x23/0x50
[ 3000.664785]  do_syscall_64+0x67/0x180
[ 3000.665162]  entry_SYSCALL64_slow_path+0x25/0x25
[ 3000.665600] RIP: 0033:0x7fac4d0e1e90
[ 3000.665946] RSP: 002b:00007ffd54e90c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[ 3000.666679] RAX: ffffffffffffffda RBX: 00007fac4d3b5400 RCX: 00007fac4d0e1e90
[ 3000.667349] RDX: 0000000000000000 RSI: 00007fac4d5d9000 RDI: 0000000000000001
[ 3000.668031] RBP: 0000000000000000 R08: 00007fac4d3b6a00 R09: 00007fac4d5cd740
[ 3000.668709] R10: 00007ffd54e909e0 R11: 0000000000000246 R12: 0000000000000000
[ 3000.669385] R13: 00007fac4d3b5e80 R14: 0000000000000000 R15: 0000000000000000
[ 3000.670061] Code: 00 00 66 66 66 66 90 55 48 85 ff 48 89 e5 41 56 41 55 41 54 53 48 89 fb 0f 84 97 00 00 00 f6 05 16 8f bc ff 10 0f 85 a6 00 00 00 <4c> 8b 63 48 48 8d 7b 38 49 8b 84 24 90 00 00 00 4c 8d a8 88 00
[ 3000.671831] RIP: pnfs_put_lseg+0x29/0x100 [nfsv4] RSP: ffffc90000ff39b8
[ 3000.672462] CR2: 000000000000003c

Signed-off-by: Artem Savkov <asavkov@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-28 13:06:59 -04:00
Brian Foster
e20c8a517f xfs: wait on new inodes during quotaoff dquot release
The quotaoff operation has a race with inode allocation that results
in a livelock. An inode allocation that occurs before the quota
status flags are updated acquires the appropriate dquots for the
inode via xfs_qm_vop_dqalloc(). It then inserts the XFS_INEW inode
into the perag radix tree, sometime later attaches the dquots to the
inode and finally clears the XFS_INEW flag. Quotaoff expects to
release the dquots from all inodes in the filesystem via
xfs_qm_dqrele_all_inodes(). This invokes the AG inode iterator,
which skips inodes in the XFS_INEW state because they are not fully
constructed. If the scan occurs after dquots have been attached to
an inode, but before XFS_INEW is cleared, the newly allocated inode
will continue to hold a reference to the applicable dquots. When
quotaoff invokes xfs_qm_dqpurge_all(), the reference count of those
dquot(s) remain elevated and the dqpurge scan spins indefinitely.

To address this problem, update the xfs_qm_dqrele_all_inodes() scan
to wait on inodes marked on the XFS_INEW state. We wait on the
inodes explicitly rather than skip and retry to avoid continuous
retry loops due to a parallel inode allocation workload. Since
quotaoff updates the quota state flags and uses a synchronous
transaction before the dqrele scan, and dquots are attached to
inodes after radix tree insertion iff quota is enabled, one INEW
waiting pass through the AG guarantees that the scan has processed
all inodes that could possibly hold dquot references.

Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-28 08:11:08 -07:00
Brian Foster
ae2c4ac2dd xfs: update ag iterator to support wait on new inodes
The AG inode iterator currently skips new inodes as such inodes are
inserted into the inode radix tree before they are fully
constructed. Certain contexts require the ability to wait on the
construction of new inodes, however. The fs-wide dquot release from
the quotaoff sequence is an example of this.

Update the AG inode iterator to support the ability to wait on
inodes flagged with XFS_INEW upon request. Create a new
xfs_inode_ag_iterator_flags() interface and support a set of
iteration flags to modify the iteration behavior. When the
XFS_AGITER_INEW_WAIT flag is set, include XFS_INEW flags in the
radix tree inode lookup and wait on them before the callback is
executed.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-28 08:11:08 -07:00
Brian Foster
756baca27f xfs: support ability to wait on new inodes
Inodes that are inserted into the perag tree but still under
construction are flagged with the XFS_INEW bit. Most contexts either
skip such inodes when they are encountered or have the ability to
handle them.

The runtime quotaoff sequence introduces a context that must wait
for construction of such inodes to correctly ensure that all dquots
in the fs are released. In anticipation of this, support the ability
to wait on new inodes. Wake the appropriate bit when XFS_INEW is
cleared.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-28 08:11:08 -07:00
Amir Goldstein
8f720d9f89 xfs: publish UUID in struct super_block
Copy the uuid of the filesystem to struct super_block s_uuid field,
as several other filesystems already do.  Copy regardless of the nouuid
mount option, because other filesystems also do not guaranty uniqueness
of the s_uuid field in super_block struct.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-28 08:10:53 -07:00
NeilBrown
a6f74e80f2 cifs: don't check for failure from mempool_alloc()
mempool_alloc() cannot fail if the gfp flags allow it to
sleep, and both GFP_FS allows for sleeping.

So these tests of the return value from mempool_alloc()
cannot be needed.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-04-28 07:56:33 -05:00
Sachin Prabhu
7d0c234fd2 Do not return number of bytes written for ioctl CIFS_IOC_COPYCHUNK_FILE
commit 620d8745b3 ("Introduce cifs_copy_file_range()") changes the
behaviour of the cifs ioctl call CIFS_IOC_COPYCHUNK_FILE. In case of
successful writes, it now returns the number of bytes written. This
return value is treated as an error by the xfstest cifs/001. Depending
on the errno set at that time, this may or may not result in the test
failing.

The patch fixes this by setting the return value to 0 in case of
successful writes.

Fixes: commit 620d8745b3 ("Introduce cifs_copy_file_range()")
Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <smfrench@gmail.com>
2017-04-28 07:56:33 -05:00
Sachin Prabhu
cd8c42968e Fix match_prepath()
Incorrect return value for shares not using the prefix path means that
we will never match superblocks for these shares.

Fixes: commit c1d8b24d18 ("Compare prepaths when comparing superblocks")
Cc: stable@vger.kernel.org
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-04-28 07:54:54 -05:00
Kees Cook
3a7d2fd16c pstore: Solve lockdep warning by moving inode locks
Lockdep complains about a possible deadlock between mount and unlink
(which is technically impossible), but fixing this improves possible
future multiple-backend support, and keeps locking in the right order.

The lockdep warning could be triggered by unlinking a file in the
pstore filesystem:

  -> #1 (&sb->s_type->i_mutex_key#14){++++++}:
         lock_acquire+0xc9/0x220
         down_write+0x3f/0x70
         pstore_mkfile+0x1f4/0x460
         pstore_get_records+0x17a/0x320
         pstore_fill_super+0xa4/0xc0
         mount_single+0x89/0xb0
         pstore_mount+0x13/0x20
         mount_fs+0xf/0x90
         vfs_kern_mount+0x66/0x170
         do_mount+0x190/0xd50
         SyS_mount+0x90/0xd0
         entry_SYSCALL_64_fastpath+0x1c/0xb1

  -> #0 (&psinfo->read_mutex){+.+.+.}:
         __lock_acquire+0x1ac0/0x1bb0
         lock_acquire+0xc9/0x220
         __mutex_lock+0x6e/0x990
         mutex_lock_nested+0x16/0x20
         pstore_unlink+0x3f/0xa0
         vfs_unlink+0xb5/0x190
         do_unlinkat+0x24c/0x2a0
         SyS_unlinkat+0x16/0x30
         entry_SYSCALL_64_fastpath+0x1c/0xb1

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&sb->s_type->i_mutex_key#14);
                                lock(&psinfo->read_mutex);
                                lock(&sb->s_type->i_mutex_key#14);
   lock(&psinfo->read_mutex);

Reported-by: Marta Lofstedt <marta.lofstedt@intel.com>
Reported-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
2017-04-27 20:35:34 -07:00
Trond Myklebust
ed6473ddc7 NFSv4: Fix callback server shutdown
We want to use kthread_stop() in order to ensure the threads are
shut down before we tear down the nfs_callback_info in nfs_callback_down.

Tested-and-reviewed-by: Kinglong Mee <kinglongmee@gmail.com>
Reported-by: Kinglong Mee <kinglongmee@gmail.com>
Fixes: bb6aeba736 ("NFSv4.x: Switch to using svc_set_num_threads()...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-27 18:00:16 -04:00
Kinglong Mee
df807fffaa NFSv4.x/callback: Create the callback service through svc_create_pooled
As the comments for svc_set_num_threads() said,
" Destroying threads relies on the service threads filling in
rqstp->rq_task, which only the nfs ones do.  Assumes the serv
has been created using svc_create_pooled()."

If creating service through svc_create(), the svc_pool_map_put()
will be called in svc_destroy(), but the pool map isn't used.
So that, the reference of pool map will be drop, the next using
of pool map will get a zero npools.

[  137.992130] divide error: 0000 [#1] SMP
[  137.992148] Modules linked in: nfsd(E) nfsv4 nfs fscache fuse tun bridge stp llc ip_set nfnetlink vmw_vsock_vmci_transport vsock snd_seq_midi snd_seq_midi_event vmw_balloon coretemp crct10dif_pclmul crc32_pclmul ppdev ghash_clmulni_intel intel_rapl_perf joydev snd_ens1371 gameport snd_ac97_codec ac97_bus snd_seq snd_pcm snd_rawmidi snd_timer snd_seq_device snd soundcore parport_pc parport nfit acpi_cpufreq tpm_tis tpm_tis_core tpm vmw_vmci i2c_piix4 shpchp auth_rpcgss nfs_acl lockd(E) grace sunrpc(E) xfs libcrc32c vmwgfx drm_kms_helper ttm crc32c_intel drm e1000 mptspi scsi_transport_spi serio_raw mptscsih mptbase ata_generic pata_acpi [last unloaded: nfsd]
[  137.992336] CPU: 0 PID: 4514 Comm: rpc.nfsd Tainted: G            E   4.11.0-rc8+ #536
[  137.992777] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  137.993757] task: ffff955984101d00 task.stack: ffff9873c2604000
[  137.994231] RIP: 0010:svc_pool_for_cpu+0x2b/0x80 [sunrpc]
[  137.994768] RSP: 0018:ffff9873c2607c18 EFLAGS: 00010246
[  137.995227] RAX: 0000000000000000 RBX: ffff95598376f000 RCX: 0000000000000002
[  137.995673] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9559944aec00
[  137.996156] RBP: ffff9873c2607c18 R08: ffff9559944aec28 R09: 0000000000000000
[  137.996609] R10: 0000000001080002 R11: 0000000000000000 R12: ffff95598376f010
[  137.997063] R13: ffff95598376f018 R14: ffff9559944aec28 R15: ffff9559944aec00
[  137.997584] FS:  00007f755529eb40(0000) GS:ffff9559bb600000(0000) knlGS:0000000000000000
[  137.998048] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  137.998548] CR2: 000055f3aecd9660 CR3: 0000000084290000 CR4: 00000000001406f0
[  137.999052] Call Trace:
[  137.999517]  svc_xprt_do_enqueue+0xef/0x260 [sunrpc]
[  138.000028]  svc_xprt_received+0x47/0x90 [sunrpc]
[  138.000487]  svc_add_new_perm_xprt+0x76/0x90 [sunrpc]
[  138.000981]  svc_addsock+0x14b/0x200 [sunrpc]
[  138.001424]  ? recalc_sigpending+0x1b/0x50
[  138.001860]  ? __getnstimeofday64+0x41/0xd0
[  138.002346]  ? do_gettimeofday+0x29/0x90
[  138.002779]  write_ports+0x255/0x2c0 [nfsd]
[  138.003202]  ? _copy_from_user+0x4e/0x80
[  138.003676]  ? write_recoverydir+0x100/0x100 [nfsd]
[  138.004098]  nfsctl_transaction_write+0x48/0x80 [nfsd]
[  138.004544]  __vfs_write+0x37/0x160
[  138.004982]  ? selinux_file_permission+0xd7/0x110
[  138.005401]  ? security_file_permission+0x3b/0xc0
[  138.005865]  vfs_write+0xb5/0x1a0
[  138.006267]  SyS_write+0x55/0xc0
[  138.006654]  entry_SYSCALL_64_fastpath+0x1a/0xa9
[  138.007071] RIP: 0033:0x7f7554b9dc30
[  138.007437] RSP: 002b:00007ffc9f92c788 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  138.007807] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f7554b9dc30
[  138.008168] RDX: 0000000000000002 RSI: 00005640cd536640 RDI: 0000000000000003
[  138.008573] RBP: 00007ffc9f92c780 R08: 0000000000000001 R09: 0000000000000002
[  138.008918] R10: 0000000000000064 R11: 0000000000000246 R12: 0000000000000004
[  138.009254] R13: 00005640cdbf77a0 R14: 00005640cdbf7720 R15: 00007ffc9f92c238
[  138.009610] Code: 0f 1f 44 00 00 48 8b 87 98 00 00 00 55 48 89 e5 48 83 78 08 00 74 10 8b 05 07 42 02 00 83 f8 01 74 40 83 f8 02 74 19 31 c0 31 d2 <f7> b7 88 00 00 00 5d 89 d0 48 c1 e0 07 48 03 87 90 00 00 00 c3
[  138.010664] RIP: svc_pool_for_cpu+0x2b/0x80 [sunrpc] RSP: ffff9873c2607c18
[  138.011061] ---[ end trace b3468224cafa7d11 ]---

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-27 17:59:00 -04:00
Geliang Tang
3509d048c8 pstore: Remove unused vmalloc.h in pmsg
Since the vmalloc code has been removed from write_pmsg() in the commit
"5bf6d1b pstore/pmsg: drop bounce buffer", remove the unused header
vmalloc.h.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2017-04-27 14:48:59 -07:00
Chris Mason
bce19f9d23 Merge branch 'for-chris-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.12 2017-04-27 14:13:09 -07:00
Linus Torvalds
8b5d11e4b0 Thanks to Ari Kauppi and Tuomas Haanpää at Synopsis for spotting bugs in
our NFSv2/v3 xdr code that could crash the server or leak memory.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZAlLrAAoJECebzXlCjuG+lb8P/idTu9rLGaU1VYPInrdoXru0
 iPY+p5inmGSYW2MfCGlS7disaCACgzPBKVqKjeNB1hfHn2JZCrfeBd0XvBYlc7TH
 JYmlHKSBjN/ZfnYMl0WrlVUCZstt4JVmxWjO3sZTCL3nImEbGA7d13yBDagxISWh
 gM9wOOiJwR5lzT/W3MezNCYj4n27/vVhODMP+Qhy0rpK08dBORIH0Bi7hwtVgdD9
 cMVIXMbujmJrCz0Uhbo/DhoItcePRBrCLTXdY6WEeMmHxnXlyaA2XuA0EjjZHuMs
 +BsbAOsNy1BxY1c8Z2ignYgRymUvUBHeiJeIGZLbyKKM2OJMtE4BoqlSWjx08P3Y
 hTwTuaw8u7uxfAsvxepukZoonWj/uVY5tP2Hyq+K6CIKKJpB7vj+c3QGYD3dNnUu
 zDl/LG3ayZgpPXqfjKRnSzZE+St1/IwDnvaM2WN2B1mkuerVr8qDGq6xd4kB0QKX
 BcEKiwcb2ewfPaLlVnSXz6Wbuh2pB42BObJjC3qbOgMvQ7SBUM0UcZmFpJJ20uCR
 BX20aFzB/GHcd6fTRHpDrAxB4XGdVG/8Da5Ki2WhRnmmaeSXushhiKWImujY9B6i
 s4mZHu4gGGJdLzOT5u2HZl93STevriL70SA9nZPhPLQycnJMUAO0buU9UcNI7wXR
 GVu2F3IHd+anxDtDwOv4
 =Glhl
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.11-3' of git://linux-nfs.org/~bfields/linux

Pull nfsd fixes from Bruce Fields:
 "Thanks to Ari Kauppi and Tuomas Haanpää at Synopsis for spotting bugs
  in our NFSv2/v3 xdr code that could crash the server or leak memory"

* tag 'nfsd-4.11-3' of git://linux-nfs.org/~bfields/linux:
  nfsd: stricter decoding of write-like NFSv2/v3 ops
  nfsd4: minor NFSv2/v3 write decoding cleanup
  nfsd: check for oversized NFSv2/v3 arguments
2017-04-27 13:39:19 -07:00
Linus Torvalds
19ac447420 A fix for a kernel stack overflow bug in ceph setattr code, marked for
stable.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJZAhJLAAoJEEp/3jgCEfOLw4gH/ia+bMzmsnkYtjMfxQfCh0ia
 MHi7JS/YcAej/o71c/tvWlTU7mRbmvUCVSAcishRNytEBNGL8YzkP12vMOp/5Vdx
 kKk6yDWn9z0mR5/YdBKaE8ziM5Umdy+zLqeL4yuxyhtbxKFGUPG4txJKS5WD80yU
 Ld/toF2fL3y/JEs+s1pd5G+DPhEhEm2hFf56/VI6N7y08CHJgTqHB3GJ3ZnuUbnU
 UhSvNR9skdVirObI8jt3oWIix8uAGq5+6MjVeTqXo75Qng5sdBGZ8S2agxXbM3j7
 Hu8h/1bhKyPCUzAXnOyGcZeR+5DQolKmlKLhogbT4I9X4YC2ie4Djg0bmFHscWI=
 =8aUa
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.11-rc9' of git://github.com/ceph/ceph-client

Pull ceph fix from Ilya Dryomov:
 "A fix for a kernel stack overflow bug in ceph setattr code, marked for
  stable"

* tag 'ceph-for-4.11-rc9' of git://github.com/ceph/ceph-client:
  ceph: fix recursion between ceph_set_acl() and __ceph_setattr()
2017-04-27 11:38:05 -07:00
Linus Torvalds
f56fc7bdaa Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:

 - fix orangefs handling of faults on write() - I'd missed that one back
   when orangefs was going through review.

 - readdir counterpart of "9p: cope with bogus responses from server in
   p9_client_{read,write}" - server might be lying or broken, and we'd
   better not overrun the kmalloc'ed buffer we are copying the results
   into.

 - NFS O_DIRECT read/write can leave iov_iter advanced by too much;
   that's what had been causing iov_iter_pipe() warnings davej had been
   seeing.

 - statx_timestamp.tv_nsec type fix (s32 -> u32). That one really should
   go in before 4.11.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  uapi: change the type of struct statx_timestamp.tv_nsec to unsigned
  fix nfs O_DIRECT advancing iov_iter too much
  p9_client_readdir() fix
  orangefs_bufmap_copy_from_iovec(): fix EFAULT handling
2017-04-27 11:09:37 -07:00
Lukas Czerner
3c3781951c xfs: Allow user to kill fstrim process
fstrim can take really long time on big, slow device or on file system
with a lots of allocation groups. Currently there is no way for the user
to cancell the operation. This patch makes it possible for the user to
kill fstrim pocess by adding the check for fatal_signal_pending() in
xfs_trim_extents().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-27 10:45:34 -07:00
Michael Kerrisk (man-pages)
59372bbf3a statx: correct error handling of NULL pathname
The change in commit 1e2f82d1e9 ("statx: Kill fd-with-NULL-path
support in favour of AT_EMPTY_PATH") to error on a NULL pathname to
statx() is inconsistent.

It results in the error EINVAL for a NULL pathname.  Other system calls
with similar APIs (fchownat(), fstatat(), linkat()), return EFAULT.

The solution is simply to remove the EINVAL check.  As I already pointed
out in [1], user_path_at*() and filename_lookup() will handle the NULL
pathname as per the other APIs, to correctly produce the error EFAULT.

[1] https://lkml.org/lkml/2017/4/26/561

Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-27 10:45:09 -07:00
Christoph Hellwig
629e014bb8 fs: completely ignore unknown open flags
Currently we just stash anything we got into file->f_flags, and the
report it in fcntl(F_GETFD).  This patch just clears out all unknown
flags so that we don't pass them to the fs or report them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-27 05:13:04 -04:00
Christoph Hellwig
80f18379a7 fs: add a VALID_OPEN_FLAGS
Add a central define for all valid open flags, and use it in the uniqueness
check.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-27 05:13:04 -04:00
Eric Biggers
020c2833db fs: remove _submit_bh()
_submit_bh() allowed submitting a buffer_head for I/O using custom
bio_flags.  It used to be used by jbd to set BIO_SNAP_STABLE, introduced
by commit 7136851117 ("mm: make snapshotting pages for stable writes a
per-bio operation").  However, the code and flag has since been removed
and no _submit_bh() users remain.

These days, bio_flags are mostly used internally by the block layer to
track the state of bio's.  As such, it doesn't really make sense for
filesystems to use them instead of op_flags when wanting special
behavior for block requests.

Therefore, remove _submit_bh() and trim the bio_flags argument from
submit_bh_wbc().

Cc: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-26 23:54:06 -04:00
Eric Biggers
cda37124f4 fs: constify tree_descr arrays passed to simple_fill_super()
simple_fill_super() is passed an array of tree_descr structures which
describe the files to create in the filesystem's root directory.  Since
these arrays are never modified intentionally, they should be 'const' so
that they are placed in .rodata and benefit from memory protection.
This patch updates the function signature and all users, and also
constifies tree_descr.name.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-26 23:54:06 -04:00
Fabian Frederick
a80f2d2224 fs/affs: bugfix: Write files greater than page size on OFS
Previous AFFS patch fixed OFS write operations but unveiled
another bug: files greater than 4KB are being created with a wrong
size resulting in errors like the following:

dd if=/dev/zero of=file bs=4097 count=1
cp file /mnt/affs/
cp: error writing '/mnt/affs/file': Bad address

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-26 23:54:06 -04:00
Fabian Frederick
077e073e8f fs/affs: bugfix: enable writes on OFS disks
We called unconditionally affs_bread_ino() with create 0 resulting in
"error (device ...): get_block(): strange block request 0"
when trying to write on AFFS OFS format.

This patch adds create parameter to that function.
0 for affs_readpage_ofs()
1 for affs_write_begin_ofs()

Bug was found here:
https://bugzilla.kernel.org/show_bug.cgi?id=114961

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-26 23:54:06 -04:00
Fabian Frederick
a9d6cfb70f fs/affs: remove node generation check
node generation has to be stored on disk.
AFAICS we won't be able to manage it on AFFS.
This patch removes relevant check in affs_nfs_get_inode()

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-26 23:54:05 -04:00
Fabian Frederick
d2d58e0e0d fs/affs: import amigaffs.h
Have that file in global include/linux is not needed.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-26 23:54:05 -04:00
Fabian Frederick
f1bf90724d fs/affs: bugfix: make symbolic links work again
AFFS symbolic links were broken since kernel 2.6.29

Problem was bisected to the following

commit ebd09abbd9 ("vfs: ensure page symlinks are NUL-terminated")
commit 035146851c ("vfs: introduce helper function to safely
NUL-terminate symlinks")

AFFS wasn't setting inode size when reading symbolic link from disk or
creating a new one. Result was zero allocation in pagecache.

ln -s file symlink

ls -lrt

file
symlink ->

This patch adds inode isize information on inode get and symbolic link
addition.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-26 23:54:05 -04:00
David S. Miller
b1513c3531 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 22:39:08 -04:00
David Howells
1e2f82d1e9 statx: Kill fd-with-NULL-path support in favour of AT_EMPTY_PATH
With the new statx() syscall, the following both allow the attributes of
the file attached to a file descriptor to be retrieved:

	statx(dfd, NULL, 0, ...);

and:

	statx(dfd, "", AT_EMPTY_PATH, ...);

Change the code to reject the first option, though this means copying
the path and engaging pathwalk for the fstat() equivalent.  dfd can be a
non-directory provided path is "".

[ The timing of this isn't wonderful, but applying this now before we
  have statx() in any released kernel, before anybody starts using the
  NULL special case.    - Linus ]

Fixes: a528d35e8b ("statx: Add a system call to make enhanced file info available")
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric Sandeen <sandeen@sandeen.net>
cc: fstests@vger.kernel.org
cc: linux-api@vger.kernel.org
cc: linux-man@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-26 15:05:47 -07:00
Dan Carpenter
907bfcd8d8 orangefs: handle zero size write in debugfs
If we write zero bytes to this debugfs file, then it will cause an
underflow when we do copy_from_user(buf, ubuf, count - 1).  Debugfs can
normally only be written to by root so the impact of this is low.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26 14:33:01 -04:00
Martin Brandenburg
b5a9d61eeb orangefs: do not wait for timeout if umounting
When the computer is turned off, all the processes are killed and then
all the filesystems are umounted.  OrangeFS should not wait for the
userspace daemon to come back in that case.

This only works for plain umount(2).  To actually take advantage of this
interactively, `umount -f' is needed; otherwise umount will issue a
statfs first, which will wait for the userspace daemon to come back.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26 14:33:01 -04:00
Martin Brandenburg
b7a57ccab8 orangefs: return from orangefs_devreq_read quickly if possible
It is not necessary to take the lock and search through the request list
if the list is empty.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26 14:33:00 -04:00
Martin Brandenburg
9d286b0d82 orangefs: ensure the userspace component is unmounted if mount fails
If the mount is aborted after userspace has been asked to mount,
userspace must be told to unmount.

Ordinarily orangefs_kill_sb does the unmount.  However it cannot be
called if the superblock has not been set up.  This is a very narrow
window.

The NULL fs_id is not unmounted.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26 14:33:00 -04:00
Martin Brandenburg
53950ef541 orangefs: do not check possibly stale size on truncate
Let the server figure this out because our size might be out of date or
not present.

The bug was that

	xfs_io -f -t -c "pread -v 0 100" /mnt/foo
	echo "Test" > /mnt/foo
	xfs_io -f -t -c "pread -v 0 100" /mnt/foo

fails because the second truncate did not happen if nothing had
requested the size after the write in echo.  Thus i_size was zero (not
present) and the orangefs_setattr though i_size was zero and there was
nothing to do.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26 14:33:00 -04:00
Martin Brandenburg
68a24a6cc4 orangefs: implement statx
Fortunately OrangeFS has had a getattr request mask for a long time.

The server basically has two difficulty levels for attributes.  Fetching
any attribute except size requires communicating with the metadata
server for that handle.  Since all the attributes are right there, it
makes sense to return them all.  Fetching the size requires
communicating with every I/O server (that the file is distributed
across).  Therefore if asked for anything except size, get everything
except size, and if asked for size, get everything.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26 14:33:00 -04:00
Martin Brandenburg
7b796ae370 orangefs: remove ORANGEFS_READDIR macros
They are clones of the ORANGEFS_ITERATE macros in use elsewhere.  Delete
ORANGEFS_ITERATE_NEXT which is a hack previously used by readdir.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26 14:33:00 -04:00
Martin Brandenburg
480e3e532e orangefs: support very large directories
This works by maintaining a linked list of pages which the directory
has been read into rather than one giant fixed-size buffer.

This replaces code which limits the total directory size to the total
amount that could be returned in one server request.  Since filenames
are usually considerably shorter than the maximum, the old code could
usually handle several server requests before running out of space.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26 14:33:00 -04:00
Martin Brandenburg
72f66b8329 orangefs: support llseek on directories
This and the previous commit fix xfstests generic/257.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26 14:33:00 -04:00
Martin Brandenburg
382f4581e6 orangefs: rewrite readdir to fix several bugs
In the past, readdir assumed that the user buffer will be large enough
that all entries from the server will fit.  If this was not true,
entries would be skipped.

Since it works now, request 512 entries rather than 96 per server
operation.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26 14:33:00 -04:00
Martin Brandenburg
17930b252c orangefs: do not set getattr_time on orangefs_lookup
Since orangefs_lookup calls orangefs_iget which calls
orangefs_inode_getattr, getattr_time will get set.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26 14:33:00 -04:00
Martin Brandenburg
e675c5ec51 orangefs: clean up oversize xattr validation
Also don't check flags as this has been validated by the VFS already.

Fix an off-by-one error in the max size checking.

Stop logging just because userspace wants to write attributes which do
not fit.

This and the previous commit fix xfstests generic/020.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26 14:33:00 -04:00
Martin Brandenburg
a956af337b orangefs: fix bounds check for listxattr
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26 14:33:00 -04:00
Martin Brandenburg
418ce3eb66 orangefs: remove unused get_fsid_from_ino
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-04-26 14:33:00 -04:00
Trond Myklebust
c373fff7bd NFSv4: Don't special case "launder"
If the client receives a fatal server error from nfs_pageio_add_request(),
then we should always truncate the page on which the error occurred.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-26 13:03:04 -04:00
Trond Myklebust
54551d85ad NFS: Add a few more fatal I/O errors to nfs_error_is_fatal()
EACCES, EDQUOT, EFBIG and ESTALE are all fatal errors as far as NFS
I/O is concerned. They need to be reported back to the application.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-26 13:03:04 -04:00
Al Viro
eea86b637a Merge branches 'uaccess.alpha', 'uaccess.arc', 'uaccess.arm', 'uaccess.arm64', 'uaccess.avr32', 'uaccess.bfin', 'uaccess.c6x', 'uaccess.cris', 'uaccess.frv', 'uaccess.h8300', 'uaccess.hexagon', 'uaccess.ia64', 'uaccess.m32r', 'uaccess.m68k', 'uaccess.metag', 'uaccess.microblaze', 'uaccess.mips', 'uaccess.mn10300', 'uaccess.nios2', 'uaccess.openrisc', 'uaccess.parisc', 'uaccess.powerpc', 'uaccess.s390', 'uaccess.score', 'uaccess.sh', 'uaccess.sparc', 'uaccess.tile', 'uaccess.um', 'uaccess.unicore32', 'uaccess.x86' and 'uaccess.xtensa' into work.uaccess 2017-04-26 12:06:59 -04:00
Filipe Manana
a7e3b975a0 Btrfs: fix reported number of inode blocks
Currently when there are buffered writes that were not yet flushed and
they fall within allocated ranges of the file (that is, not in holes or
beyond eof assuming there are no prealloc extents beyond eof), btrfs
simply reports an incorrect number of used blocks through the stat(2)
system call (or any of its variants), regardless of mount options or
inode flags (compress, compress-force, nodatacow). This is because the
number of blocks used that is reported is based on the current number
of bytes in the vfs inode plus the number of dealloc bytes in the btrfs
inode. The later covers bytes that both fall within allocated regions
of the file and holes.

Example scenarios where the number of reported blocks is wrong while the
buffered writes are not flushed:

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt/sdc

  $ xfs_io -f -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo1
  wrote 65536/65536 bytes at offset 0
  64 KiB, 16 ops; 0.0000 sec (259.336 MiB/sec and 66390.0415 ops/sec)

  $ sync

  $ xfs_io -c "pwrite -S 0xbb 0 64K" /mnt/sdc/foo1
  wrote 65536/65536 bytes at offset 0
  64 KiB, 16 ops; 0.0000 sec (192.308 MiB/sec and 49230.7692 ops/sec)

  # The following should have reported 64K...
  $ du -h /mnt/sdc/foo1
  128K	/mnt/sdc/foo1

  $ sync

  # After flushing the buffered write, it now reports the correct value.
  $ du -h /mnt/sdc/foo1
  64K	/mnt/sdc/foo1

  $ xfs_io -f -c "falloc -k 0 128K" -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo2
  wrote 65536/65536 bytes at offset 0
  64 KiB, 16 ops; 0.0000 sec (520.833 MiB/sec and 133333.3333 ops/sec)

  $ sync

  $ xfs_io -c "pwrite -S 0xbb 64K 64K" /mnt/sdc/foo2
  wrote 65536/65536 bytes at offset 65536
  64 KiB, 16 ops; 0.0000 sec (260.417 MiB/sec and 66666.6667 ops/sec)

  # The following should have reported 128K...
  $ du -h /mnt/sdc/foo2
  192K	/mnt/sdc/foo2

  $ sync

  # After flushing the buffered write, it now reports the correct value.
  $ du -h /mnt/sdc/foo2
  128K	/mnt/sdc/foo2

So the number of used file blocks is simply incorrect, unlike in other
filesystems such as ext4 and xfs for example, but only while the buffered
writes are not flushed.

Fix this by tracking the number of delalloc bytes that fall within holes
and beyond eof of a file, and use instead this new counter when reporting
the number of used blocks for an inode.

Another different problem that exists is that the delalloc bytes counter
is reset when writeback starts (by clearing the EXTENT_DEALLOC flag from
the respective range in the inode's iotree) and the vfs inode's bytes
counter is only incremented when writeback finishes (through
insert_reserved_file_extent()). Therefore while writeback is ongoing we
simply report a wrong number of blocks used by an inode if the write
operation covers a range previously unallocated. While this change does
not fix this problem, it does minimizes it a lot by shortening that time
window, as the new dealloc bytes counter (new_delalloc_bytes) is only
decremented when writeback finishes right before updating the vfs inode's
bytes counter. Fully fixing this second problem is not trivial and will
be addressed later by a different patch.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-04-26 16:27:26 +01:00
Filipe Manana
e1cbfd7bf6 Btrfs: send, fix file hole not being preserved due to inline extent
Normally we don't have inline extents followed by regular extents, but
there's currently at least one harmless case where this happens. For
example, when the page size is 4Kb and compression is enabled:

  $ mkfs.btrfs -f /dev/sdb
  $ mount -o compress /dev/sdb /mnt
  $ xfs_io -f -c "pwrite -S 0xaa 0 4K" -c "fsync" /mnt/foobar
  $ xfs_io -c "pwrite -S 0xbb 8K 4K" -c "fsync" /mnt/foobar

In this case we get a compressed inline extent, representing 4Kb of
data, followed by a hole extent and then a regular data extent. The
inline extent was not expanded/converted to a regular extent exactly
because it represents 4Kb of data. This does not cause any apparent
problem (such as the issue solved by commit e1699d2d7b
("btrfs: add missing memset while reading compressed inline extents"))
except trigger an unexpected case in the incremental send code path
that makes us issue an operation to write a hole when it's not needed,
resulting in more writes at the receiver and wasting space at the
receiver.

So teach the incremental send code to deal with this particular case.

The issue can be currently triggered by running fstests btrfs/137 with
compression enabled (MOUNT_OPTIONS="-o compress" ./check btrfs/137).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2017-04-26 16:27:25 +01:00
Filipe Manana
be2d253cc9 Btrfs: fix extent map leak during fallocate error path
If the call to btrfs_qgroup_reserve_data() failed, we were leaking an
extent map structure. The failure can happen either due to an -ENOMEM
condition or, when quotas are enabled, due to -EDQUOT for example.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
2017-04-26 16:27:24 +01:00
Filipe Manana
1c81ba237b Btrfs: fix incorrect space accounting after failure to insert inline extent
When using compression, if we fail to insert an inline extent we
incorrectly end up attempting to free the reserved data space twice,
once through extent_clear_unlock_delalloc(), because we pass it the
flag EXTENT_DO_ACCOUNTING, and once through a direct call to
btrfs_free_reserved_data_space_noquota(). This results in a trace
like the following:

[  834.576240] ------------[ cut here ]------------
[  834.576825] WARNING: CPU: 2 PID: 486 at fs/btrfs/extent-tree.c:4316 btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
[  834.579501] Modules linked in: btrfs crc32c_generic xor raid6_pq ppdev i2c_piix4 acpi_cpufreq psmouse tpm_tis parport_pc pcspkr serio_raw tpm_tis_core sg parport evdev i2c_core tpm button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs]
[  834.592116] CPU: 2 PID: 486 Comm: kworker/u32:4 Not tainted 4.10.0-rc8-btrfs-next-37+ #2
[  834.593316] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[  834.595273] Workqueue: btrfs-delalloc btrfs_delalloc_helper [btrfs]
[  834.596103] Call Trace:
[  834.596103]  dump_stack+0x67/0x90
[  834.596103]  __warn+0xc2/0xdd
[  834.596103]  warn_slowpath_null+0x1d/0x1f
[  834.596103]  btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
[  834.596103]  compress_file_range.constprop.42+0x2fa/0x3fc [btrfs]
[  834.596103]  ? submit_compressed_extents+0x3a7/0x3a7 [btrfs]
[  834.596103]  async_cow_start+0x32/0x4d [btrfs]
[  834.596103]  btrfs_scrubparity_helper+0x187/0x3e7 [btrfs]
[  834.596103]  btrfs_delalloc_helper+0xe/0x10 [btrfs]
[  834.596103]  process_one_work+0x273/0x4e4
[  834.596103]  worker_thread+0x1eb/0x2ca
[  834.596103]  ? rescuer_thread+0x2b6/0x2b6
[  834.596103]  kthread+0x100/0x108
[  834.596103]  ? __list_del_entry+0x22/0x22
[  834.596103]  ret_from_fork+0x2e/0x40
[  834.611656] ---[ end trace 719902fe6bdef08f ]---

So fix this by not calling directly btrfs_free_reserved_data_space_noquota()
if an error happened.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-04-26 16:27:23 +01:00
Filipe Manana
a315e68f6e Btrfs: fix invalid attempt to free reserved space on failure to cow range
When attempting to COW a file range (we are starting writeback and doing
COW), if we manage to reserve an extent for the range we will write into
but fail after reserving it and before creating the respective ordered
extent, we end up in an error path where we attempt to decrement the
data space's bytes_may_use counter after we already did it while
reserving the extent, leading to a warning/trace like the following:

[  847.621524] ------------[ cut here ]------------
[  847.625441] WARNING: CPU: 5 PID: 4905 at fs/btrfs/extent-tree.c:4316 btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
[  847.633704] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq i2c_piix4 ppdev psmouse tpm_tis serio_raw pcspkr parport_pc tpm_tis_core i2c_core sg
[  847.644616] CPU: 5 PID: 4905 Comm: xfs_io Not tainted 4.10.0-rc8-btrfs-next-37+ #2
[  847.648601] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[  847.648601] Call Trace:
[  847.648601]  dump_stack+0x67/0x90
[  847.648601]  __warn+0xc2/0xdd
[  847.648601]  warn_slowpath_null+0x1d/0x1f
[  847.648601]  btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
[  847.648601]  btrfs_clear_bit_hook+0x140/0x258 [btrfs]
[  847.648601]  clear_state_bit+0x87/0x128 [btrfs]
[  847.648601]  __clear_extent_bit+0x222/0x2b7 [btrfs]
[  847.648601]  clear_extent_bit+0x17/0x19 [btrfs]
[  847.648601]  extent_clear_unlock_delalloc+0x3b/0x6b [btrfs]
[  847.648601]  cow_file_range.isra.39+0x387/0x39a [btrfs]
[  847.648601]  run_delalloc_nocow+0x4d7/0x70e [btrfs]
[  847.648601]  ? arch_local_irq_save+0x9/0xc
[  847.648601]  run_delalloc_range+0xa7/0x2b5 [btrfs]
[  847.648601]  writepage_delalloc.isra.31+0xb9/0x15c [btrfs]
[  847.648601]  __extent_writepage+0x249/0x2e8 [btrfs]
[  847.648601]  extent_write_cache_pages.constprop.33+0x28b/0x36c [btrfs]
[  847.648601]  ? arch_local_irq_save+0x9/0xc
[  847.648601]  ? mark_lock+0x24/0x201
[  847.648601]  extent_writepages+0x4b/0x5c [btrfs]
[  847.648601]  ? btrfs_writepage_start_hook+0xed/0xed [btrfs]
[  847.648601]  btrfs_writepages+0x28/0x2a [btrfs]
[  847.648601]  do_writepages+0x23/0x2c
[  847.648601]  __filemap_fdatawrite_range+0x5a/0x61
[  847.648601]  filemap_fdatawrite_range+0x13/0x15
[  847.648601]  btrfs_fdatawrite_range+0x20/0x46 [btrfs]
[  847.648601]  start_ordered_ops+0x19/0x23 [btrfs]
[  847.648601]  btrfs_sync_file+0x136/0x42c [btrfs]
[  847.648601]  vfs_fsync_range+0x8c/0x9e
[  847.648601]  vfs_fsync+0x1c/0x1e
[  847.648601]  do_fsync+0x31/0x4a
[  847.648601]  SyS_fsync+0x10/0x14
[  847.648601]  entry_SYSCALL_64_fastpath+0x18/0xad
[  847.648601] RIP: 0033:0x7f5b05200800
[  847.648601] RSP: 002b:00007ffe204f71c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[  847.648601] RAX: ffffffffffffffda RBX: ffffffff8109637b RCX: 00007f5b05200800
[  847.648601] RDX: 00000000008bd0a0 RSI: 00000000008bd2e0 RDI: 0000000000000003
[  847.648601] RBP: ffffc90001d67f98 R08: 000000000000ffff R09: 000000000000001f
[  847.648601] R10: 00000000000001f6 R11: 0000000000000246 R12: 0000000000000046
[  847.648601] R13: ffffc90001d67f78 R14: 00007f5b054be740 R15: 00007f5b054be740
[  847.648601]  ? trace_hardirqs_off_caller+0x3f/0xaa
[  847.685787] ---[ end trace 2a4a3e15382508e8 ]---

So fix this by not attempting to decrement the data space info's
bytes_may_use counter if we already reserved the extent and an error
happened before creating the ordered extent. We are already correctly
freeing the reserved extent if an error happens, so there's no additional
measure needed.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2017-04-26 16:27:22 +01:00
Qu Wenruo
524272607e btrfs: Handle delalloc error correctly to avoid ordered extent hang
[BUG]
If run_delalloc_range() returns error and there is already some ordered
extents created, btrfs will be hanged with the following backtrace:

Call Trace:
 __schedule+0x2d4/0xae0
 schedule+0x3d/0x90
 btrfs_start_ordered_extent+0x160/0x200 [btrfs]
 ? wake_atomic_t_function+0x60/0x60
 btrfs_run_ordered_extent_work+0x25/0x40 [btrfs]
 btrfs_scrubparity_helper+0x1c1/0x620 [btrfs]
 btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
 process_one_work+0x2af/0x720
 ? process_one_work+0x22b/0x720
 worker_thread+0x4b/0x4f0
 kthread+0x10f/0x150
 ? process_one_work+0x720/0x720
 ? kthread_create_on_node+0x40/0x40
 ret_from_fork+0x2e/0x40

[CAUSE]

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<>|                       |<---------- cleanup range --------->|
 ||
 \_=> First page handled by end_extent_writepage() in __extent_writepage()

The problem is caused by error handler of run_delalloc_range(), which
doesn't handle any created ordered extents, leaving them waiting on
btrfs_finish_ordered_io() to finish.

However after run_delalloc_range() returns error, __extent_writepage()
won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered
except the first page, and btrfs_finish_ordered_io() won't be triggered
for created ordered extents either.

So OE 2~n will hang forever, and if OE 1 is larger than one page, it
will also hang.

[FIX]
Introduce btrfs_cleanup_ordered_extents() function to cleanup created
ordered extents and finish them manually.

The function is based on existing
btrfs_endio_direct_write_update_ordered() function, and modify it to
act just like btrfs_writepage_endio_hook() but handles specified range
other than one page.

After fix, delalloc error will be handled like:

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<>|<--------  ----------->|<------ old error handler --------->|
 ||          ||
 ||          \_=> Cleaned up by cleanup_ordered_extents()
 \_=> First page handled by end_extent_writepage() in __extent_writepage()

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2017-04-26 16:27:21 +01:00
Qu Wenruo
4dbd80fb91 btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
[BUG]
When btrfs_reloc_clone_csum() reports error, it can underflow metadata
and leads to kernel assertion on outstanding extents in
run_delalloc_nocow() and cow_file_range().

 BTRFS info (device vdb5): relocating block group 12582912 flags data
 BTRFS info (device vdb5): found 1 extents
 assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858

Currently, due to another bug blocking ordered extents, the bug is only
reproducible under certain block group layout and using error injection.

a) Create one data block group with one 4K extent in it.
   To avoid the bug that hangs btrfs due to ordered extent which never
   finishes
b) Make btrfs_reloc_clone_csum() always fail
c) Relocate that block group

[CAUSE]
run_delalloc_nocow() and cow_file_range() handles error from
btrfs_reloc_clone_csum() wrongly:

(The ascii chart shows a more generic case of this bug other than the
bug mentioned above)

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
                    |<----------- cleanup range --------------->|
|<-----------  ----------->|
             \/
 btrfs_finish_ordered_io() range

So error handler, which calls extent_clear_unlock_delalloc() with
EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
will both cover OE n, and free its metadata, causing metadata under flow.

[Fix]
The fix is to ensure after calling btrfs_add_ordered_extent(), we only
call error handler after increasing the iteration offset, so that
cleanup range won't cover any created ordered extent.

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<-----------  ----------->|<---------- cleanup range --------->|
             \/
 btrfs_finish_ordered_io() range

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2017-04-26 16:27:21 +01:00
Amir Goldstein
4a99f3c83d ovl: do not set overlay.opaque on non-dir create
The optimization for opaque dir create was wrongly being applied
also to non-dir create.

Fixes: 97c684cc91 ("ovl: create directories inside merged parent opaque")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # v4.10
2017-04-26 14:33:44 +02:00
Colin Ian King
e56efe9322 lockd: remove redundant check on block
A null check followed by a return is being performed already, so block
is always non-null at the second check on block, hence we can remove
this redundant null-check (Detected by PVS-Studio).  Also re-work
comment to clean up a check-patch warning.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:56 -04:00
NeilBrown
99bbf6ecc6 NFS: don't try to cross a mountpount when there isn't one there.
consider the sequence of commands:
 mkdir -p /import/nfs /import/bind /import/etc
 mount --bind / /import/bind
 mount --make-private /import/bind
 mount --bind /import/etc /import/bind/etc

 exportfs -o rw,no_root_squash,crossmnt,async,no_subtree_check localhost:/
 mount -o vers=4 localhost:/ /import/nfs
 ls -l /import/nfs/etc

You would not expect this to report a stale file handle.
Yet it does.

The manipulations under /import/bind cause the dentry for
/etc to get the DCACHE_MOUNTED flag set, even though nothing
is mounted on /etc.  This causes nfsd to call
nfsd_cross_mnt() even though there is no mountpoint.  So an
upcall to mountd for "/etc" is performed.

The 'crossmnt' flag on the export of / causes mountd to
report that /etc is exported as it is a descendant of /.  It
assumes the kernel wouldn't ask about something that wasn't
a mountpoint.  The filehandle returned identifies the
filesystem and the inode number of /etc.

When this filehandle is presented to rpc.mountd, via
"nfsd.fh", the inode cannot be found associated with any
name in /etc/exports, or with any mountpoint listed by
getmntent().  So rpc.mountd says the filehandle doesn't
exist. Hence ESTALE.

This is fixed by teaching nfsd not to trust DCACHE_MOUNTED
too much.  It is just a hint, not a guarantee.
Change nfsd_mountpoint() to return '1' for a certain mountpoint,
'2' for a possible mountpoint, and 0 otherwise.

Then change nfsd_crossmnt() to check if follow_down()
actually found a mountpount and, if not, to avoid performing
a lookup if the location is not known to certainly require
an export-point.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:54 -04:00
NeilBrown
2f10fdcb6a nfsd4: remove pointless strdup_if_nonnull
kstrdup() already checks for NULL.

(Brought to our attention by Jason Yann noticing (from sparse output)
that it should have been declared static.)

Signed-off-by: NeilBrown <neilb@suse.com>
Reported-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:54 -04:00
J. Bruce Fields
51f5677777 nfsd: check for oversized NFSv2/v3 arguments
A client can append random data to the end of an NFSv2 or NFSv3 RPC call
without our complaining; we'll just stop parsing at the end of the
expected data and ignore the rest.

Encoded arguments and replies are stored together in an array of pages,
and if a call is too large it could leave inadequate space for the
reply.  This is normally OK because NFS RPC's typically have either
short arguments and long replies (like READ) or long arguments and short
replies (like WRITE).  But a client that sends an incorrectly long reply
can violate those assumptions.  This was observed to cause crashes.

So, insist that the argument not be any longer than we expect.

Also, several operations increment rq_next_page in the decode routine
before checking the argument size, which can leave rq_next_page pointing
well past the end of the page array, causing trouble later in
svc_free_pages.

As followup we may also want to rewrite the encoding routines to check
more carefully that they aren't running off the end of the page array.

Reported-by: Tuomas Haanpää <thaan@synopsys.com>
Reported-by: Ari Kauppi <ari@synopsys.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 17:25:53 -04:00
Chao Yu
d618ebaf0a f2fs: enable small discard by default
This patch start to enable 4K granularity small discard by default
when realtime discard is on, so, in seriously fragmented space,
small size discard can be issued in time to avoid useless storage
space occupying of invalid filesystem's data, then performance of
flash storage can be recovered.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-25 14:18:45 -07:00
Chao Yu
34e159da41 f2fs: delay awaking discard thread
It's better to delay awaking discard thread while queuing discard commands
in checkpoint, it will help to give more chances for merging big and small
discard.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-25 14:18:44 -07:00
Yunlei He
66a82d1fc7 f2fs: seperate read nat page from nat_tree_lock
This patch seperate nat page read io from nat_tree_lock.

-lock_page
	-get_node_info()
		-current_nat_addr

			......           	->       write_checkpoint

			-get_meta_page

Because we lock node page, we can make sure no other threads
modify this nid concurrently. So we just obtain current_nat_addr
under nat_tree_lock, node info is always same in both nat pack.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-25 14:16:39 -07:00
Sheng Yong
d3bb910c15 f2fs: fix multiple f2fs_add_link() having same name for inline dentry
Commit 88c5c13a50 (f2fs: fix multiple f2fs_add_link() calls having
same name) does not cover the scenario where inline dentry is enabled.
In that case, F2FS_I(dir)->task will be NULL, and __f2fs_add_link will
lookup dentries one more time.

This patch fixes it by moving the assigment of current task to a upper
level to cover both normal and inline dentry.

Cc: <stable@vger.kernel.org>
Fixes: 88c5c13a50 (f2fs: fix multiple f2fs_add_link() calls having same name)
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-25 14:16:31 -07:00
J. Bruce Fields
13bf9fbff0 nfsd: stricter decoding of write-like NFSv2/v3 ops
The NFSv2/v3 code does not systematically check whether we decode past
the end of the buffer.  This generally appears to be harmless, but there
are a few places where we do arithmetic on the pointers involved and
don't account for the possibility that a length could be negative.  Add
checks to catch these.

Reported-by: Tuomas Haanpää <thaan@synopsys.com>
Reported-by: Ari Kauppi <ari@synopsys.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 16:36:23 -04:00
J. Bruce Fields
db44bac41b nfsd4: minor NFSv2/v3 write decoding cleanup
Use a couple shortcuts that will simplify a following bugfix.

Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 16:36:16 -04:00
J. Bruce Fields
e6838a29ec nfsd: check for oversized NFSv2/v3 arguments
A client can append random data to the end of an NFSv2 or NFSv3 RPC call
without our complaining; we'll just stop parsing at the end of the
expected data and ignore the rest.

Encoded arguments and replies are stored together in an array of pages,
and if a call is too large it could leave inadequate space for the
reply.  This is normally OK because NFS RPC's typically have either
short arguments and long replies (like READ) or long arguments and short
replies (like WRITE).  But a client that sends an incorrectly long reply
can violate those assumptions.  This was observed to cause crashes.

Also, several operations increment rq_next_page in the decode routine
before checking the argument size, which can leave rq_next_page pointing
well past the end of the page array, causing trouble later in
svc_free_pages.

So, following a suggestion from Neil Brown, add a central check to
enforce our expectation that no NFSv2/v3 call has both a large call and
a large reply.

As followup we may also want to rewrite the encoding routines to check
more carefully that they aren't running off the end of the page array.

We may also consider rejecting calls that have any extra garbage
appended.  That would be safer, and within our rights by spec, but given
the age of our server and the NFS protocol, and the fact that we've
never enforced this before, we may need to balance that against the
possibility of breaking some oddball client.

Reported-by: Tuomas Haanpää <thaan@synopsys.com>
Reported-by: Ari Kauppi <ari@synopsys.com>
Cc: stable@vger.kernel.org
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25 16:34:37 -04:00
Trond Myklebust
bb3393d5b3 NFSv3: nfs3_nlm_alloc_call should be declared static
Fix compiler warnings.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-25 16:25:06 -04:00
Dan Williams
d4b29fd78e block: remove block_device_operations ->direct_access()
Now that all the producers and consumers of dax interfaces have been
converted to using dax_operations on a dax_device, remove the block
device direct_access enabling.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-25 13:20:46 -07:00
Dan Williams
2093f2e9df block, dax: convert bdev_dax_supported() to dax_direct_access()
Kill of the final user of bdev_direct_access() and struct blk_dax_ctl.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-25 13:20:46 -07:00
Dan Williams
cccbce6715 filesystem-dax: convert to dax_direct_access()
Now that a dax_device is plumbed through all dax-capable drivers we can
switch from block_device_operations to dax_operations for invoking
->direct_access.

This also lets us kill off some usages of struct blk_dax_ctl on the way
to its eventual removal.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-25 13:20:46 -07:00
Dan Williams
a41fe02b6b Revert "block: use DAX for partition table reads"
commit d1a5f2b4d8 ("block: use DAX for partition table reads") was
part of a stalled effort to allow dax mappings of block devices. Since
then the device-dax mechanism has filled the role of dax-mapping static
device ranges.

Now that we are moving ->direct_access() from a block_device operation
to a dax_inode operation we would need block devices to map and carry
their own dax_inode reference.

Unless / until we decide to revive dax mapping of raw block devices
through the dax_inode scheme, there is no need to carry
read_dax_sector(). Its removal in turn allows for the removal of
bdev_direct_access() and should have been included in commit
2237570168 ("block_dev: remove DAX leftovers").

Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-25 13:20:46 -07:00
Dan Williams
fa5d932c32 ext2, ext4, xfs: retrieve dax_device for iomap operations
In preparation for converting fs/dax.c to use dax_direct_access()
instead of bdev_direct_access(), add the plumbing to retrieve the
dax_device associated with a given block_device.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-25 13:20:46 -07:00
Trond Myklebust
a6598813a4 NFS: Don't write back further requests if there is a pending write error
If the server has already returned a fatal write error that the user
has not yet received on this file, then don't write back the other pages.
Instead, act as if they have been sent, and have returned with the same
error.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-25 15:42:34 -04:00
Trond Myklebust
6aeafd05ec pNFS: Fix use after free issues in pnfs_do_read()
The assumption should be that if the caller returns PNFS_ATTEMPTED, then hdr
has been consumed, and so we should not be testing hdr->task.tk_status.
If the caller returns PNFS_TRY_AGAIN, then we need to recoalesce and
free hdr.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-25 15:42:34 -04:00
Yan, Zheng
8179a101eb ceph: fix recursion between ceph_set_acl() and __ceph_setattr()
ceph_set_acl() calls __ceph_setattr() if the setacl operation needs
to modify inode's i_mode. __ceph_setattr() updates inode's i_mode,
then calls posix_acl_chmod().

The problem is that __ceph_setattr() calls posix_acl_chmod() before
sending the setattr request. The get_acl() call in posix_acl_chmod()
can trigger a getxattr request. The reply of the getxattr request
can restore inode's i_mode to its old value. The set_acl() call in
posix_acl_chmod() sees old value of inode's i_mode, so it calls
__ceph_setattr() again.

Cc: stable@vger.kernel.org # needs backporting for < 4.9
Link: http://tracker.ceph.com/issues/19688
Reported-by: Jerry Lee <leisurelysw24@gmail.com>
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Tested-by: Luis Henriques <lhenriques@suse.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-04-25 21:08:26 +02:00
Darrick J. Wong
c4cf1acdb1 xfs: better log intent item refcount checking
Use ASSERTs on the log intent item refcounts so that we fail noisily if
anyone tries to double-free the item.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-04-25 09:40:42 -07:00
Brian Foster
20e8a06378 xfs: fix up quotacheck buffer list error handling
The quotacheck error handling of the delwri buffer list assumes the
resident buffers are locked and doesn't clear the _XBF_DELWRI_Q flag
on the buffers that are dequeued. This can lead to assert failures
on buffer release and possibly other locking problems.

Move this code to a delwri queue cancel helper function to
encapsulate the logic required to properly release buffers from a
delwri queue. Update the helper to clear the delwri queue flag and
call it from quotacheck.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:42 -07:00
Christoph Hellwig
27af1bbf52 xfs: remove xfs_trans_ail_delete_bulk
xfs_iflush_done uses an on-stack variable length array to pass the log
items to be deleted to xfs_trans_ail_delete_bulk.  On-stack VLAs are a
nasty gcc extension that can lead to unbounded stack allocations, but
fortunately we can easily avoid them by simply open coding
xfs_trans_ail_delete_bulk in xfs_iflush_done, which is the only caller
of it except for the single-item xfs_trans_ail_delete.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:42 -07:00
Christoph Hellwig
3f88a15ae0 xfs: don't use bool values in trace buffers
Using bool values produces sparse warnings of this form:

fs/xfs/./xfs_trace.h:2252:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1)
fs/xfs/./xfs_trace.h:2252:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1)
fs/xfs/./xfs_trace.h:2278:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1)
fs/xfs/./xfs_trace.h:2278:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1)
fs/xfs/./xfs_trace.h:2307:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1)

Just use a char instead to fix those up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:42 -07:00
Darrick J. Wong
12e4a381c5 xfs: fix getfsmap userspace memory corruption while setting OF_LAST
At the end of a getfsmap call, we will set FMR_OF_LAST in the last
struct fsmap that was handed in by userspace if we've truly run out of
space mapping record (as opposed to simply running out of space in the
user array).  Unfortunately, fmh_entries is the wrong check for whether
or not we've filled out anything in the user array because the ioctl
provides that fmh_count==0 sets fmh_entries without filling out the user
array.  Therefore we end up writing things into user memory areas that we
weren't given, and kaboom.

Since Christoph amended the getfsmap structure to track the number of
fsmap entries we've actually filled out, use that as part of deciding if
we have to set the OF_LAST flag.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-04-25 09:40:42 -07:00
Christoph Hellwig
9d17e14cc0 xfs: fix __user annotations for xfs_ioc_getfsmap
By passing the whole fsmap_head structure and an index we can get the
user point annotations right for the embedded variable sized array
in struct fsmap_head.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: change idx to unsigned int]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:42 -07:00
Christoph Hellwig
e2a641922a xfs: corruption needs to respect endianess too!
At least if we want to be able to recognize the pattern.  Add a missing
byte swap to the corruption injection case in xlog_sync.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:42 -07:00
Christoph Hellwig
ef2b67ecf8 xfs: use NULL instead of 0 to initialize a pointer in xfs_ioc_getfsmap
Found by sparse.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:41 -07:00
Christoph Hellwig
fad5656b22 xfs: use NULL instead of 0 to initialize a pointer in xfs_getfsmap
Found by sparse.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:41 -07:00
Christoph Hellwig
0c1d9e4a61 xfs: simplify validation of the unwritten extent bit
XFS only supports the unwritten extent bit in the data fork, and only if
the file system has a version 5 superblock or the unwritten extent
feature bit.

We currently have two routines that validate the invariant:
xfs_check_nostate_extents which return -EFSCORRUPTED when it's not met,
and xfs_validate_extent that triggers and assert in debug build.

Both of them iterate over all extents of an inode fork when called,
which isn't very efficient.

This patch instead adds a new helper that verifies the invariant one
extent at a time, and calls it from the places where we iterate over
all extents to converted them from or two the in-memory format.  The
callers then return -EFSCORRUPTED when reading invalid extents from
disk, or trigger an assert when writing them to disk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:41 -07:00
Christoph Hellwig
37f7f9bbf3 xfs: remove unused values from xfs_exntst_t
We only ever use the normal and unwritten states.  And the actual
ondisk format (this enum isn't despite being in xfs_format.h) only
has space for the unwritten bit anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:41 -07:00
Christoph Hellwig
895e9bfc9e xfs: remove the unused XFS_MAXLINK_1 define
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:41 -07:00
Eric Sandeen
4f1adf3373 xfs: more do_div cleanups
On some architectures do_div does the pointer compare
trick to make sure that we've sent it an unsigned 64-bit
number.  (Why unsigned?  I don't know.)

Fix up the few places that squawk about this; in
xfs_bmap_wants_extents() we just used a bare int64_t so change
that to unsigned.

In xfs_adjust_extent_unmap_boundaries() all we wanted was the
mod, and we have an xfs-specific function to handle that w/o
side effects, which includes proper casting for do_div.

In xfs_daddr_to_ag[b]no, we were using the wrong type anyway;
XFS_BB_TO_FSBT returns a block in the filesystem, so use
xfs_rfsblock_t not xfs_daddr_t, and gain the unsignedness
from that type as a bonus.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:41 -07:00
Eric Sandeen
90115407c5 xfs: remove use of do_div with 32-bit dividend in quota
The kbuild test robot caught this; in debug code we have another
caller of do_div with a 32-bit dividend (j) which is caught now
that we are using the kernel-supplied do_div.

None of the values used here are 64-bit; just use simple division.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:41 -07:00
Hou Tao
42bf9dba40 xfs: remove the trailing newline used in the fmt parameter of TP_printk
The trailing newlines wil lead to extra newlines in the trace file
which looks like the following output, so remove them.
>kworker/4:1H-1508  [004] .... 47879.101608: xfs_discard_extent: dev 8:0
>
>kworker/u16:2-238  [004] .... 47879.101725: xfs_extent_busy_clear: dev 8:0

Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix the getfsmap tracepoints too]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:40 -07:00
Brian Foster
cb52ee334a xfs: prevent multi-fsb dir readahead from reading random blocks
Directory block readahead uses a complex iteration mechanism to map
between high-level directory blocks and underlying physical extents.
This mechanism attempts to traverse the higher-level dir blocks in a
manner that handles multi-fsb directory blocks and simultaneously
maintains a reference to the corresponding physical blocks.

This logic doesn't handle certain (discontiguous) physical extent
layouts correctly with multi-fsb directory blocks. For example,
consider the case of a 4k FSB filesystem with a 2 FSB (8k) directory
block size and a directory with the following extent layout:

 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
   0: [0..7]:          88..95            0 (88..95)             8
   1: [8..15]:         80..87            0 (80..87)             8
   2: [16..39]:        168..191          0 (168..191)          24
   3: [40..63]:        5242952..5242975  1 (72..95)            24

Directory block 0 spans physical extents 0 and 1, dirblk 1 lies
entirely within extent 2 and dirblk 2 spans extents 2 and 3. Because
extent 2 is larger than the directory block size, the readahead code
erroneously assumes the block is contiguous and issues a readahead
based on the physical mapping of the first fsb of the dirblk. This
results in read verifier failure and a spurious corruption or crc
failure, depending on the filesystem format.

Further, the subsequent readahead code responsible for walking
through the physical table doesn't correctly advance the physical
block reference for dirblk 2. Instead of advancing two physical
filesystem blocks, the first iteration of the loop advances 1 block
(correctly), but the subsequent iteration advances 2 more physical
blocks because the next physical extent (extent 3, above) happens to
cover more than dirblk 2. At this point, the higher-level directory
block walking is completely off the rails of the actual physical
layout of the directory for the respective mapping table.

Update the contiguous dirblock logic to consider the current offset
in the physical extent to avoid issuing directory readahead to
unrelated blocks. Also, update the mapping table advancing code to
consider the current offset within the current dirblock to avoid
advancing the mapping reference too far beyond the dirblock.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:40 -07:00
Eric Sandeen
023cc840b4 xfs: handle array index overrun in xfs_dir2_leaf_readbuf()
Carlos had a case where "find" seemed to start spinning
forever and never return.

This was on a filesystem with non-default multi-fsb (8k)
directory blocks, and a fragmented directory with extents
like this:

0:[0,133646,2,0]
1:[2,195888,1,0]
2:[3,195890,1,0]
3:[4,195892,1,0]
4:[5,195894,1,0]
5:[6,195896,1,0]
6:[7,195898,1,0]
7:[8,195900,1,0]
8:[9,195902,1,0]
9:[10,195908,1,0]
10:[11,195910,1,0]
11:[12,195912,1,0]
12:[13,195914,1,0]
...

i.e. the first extent is a contiguous 2-fsb dir block, but
after that it is fragmented into 1 block extents.

At the top of the readdir path, we allocate a mapping array
which (for this filesystem geometry) can hold 10 extents; see
the assignment to map_info->map_size.  During readdir, we are
therefore able to map extents 0 through 9 above into the array
for readahead purposes.  If we count by 2, we see that the last
mapped index (9) is the first block of a 2-fsb directory block.

At the end of xfs_dir2_leaf_readbuf() we have 2 loops to fill
more readahead; the outer loop assumes one full dir block is
processed each loop iteration, and an inner loop that ensures
that this is so by advancing to the next extent until a full
directory block is mapped.

The problem is that this inner loop may step past the last
extent in the mapping array as it tries to reach the end of
the directory block.  This will read garbage for the extent
length, and as a result the loop control variable 'j' may
become corrupted and never fail the loop conditional.

The number of valid mappings we have in our array is stored
in map->map_valid, so stop this inner loop based on that limit.

There is an ASSERT at the top of the outer loop for this
same condition, but we never made it out of the inner loop,
so the ASSERT never fired.

Huge appreciation for Carlos for debugging and isolating
the problem.

Debugged-and-analyzed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Tested-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:40 -07:00
Chandan Rajendra
a008c31c7e iomap_dio_rw: Prevent reading file data beyond iomap_dio->i_size
On a ppc64 machine executing overlayfs/019 with xfs as the lower and
upper filesystem causes the following call trace,

WARNING: CPU: 2 PID: 8034 at /root/repos/linux/fs/iomap.c:765 .iomap_dio_actor+0xcc/0x420
Modules linked in:
CPU: 2 PID: 8034 Comm: fsstress Tainted: G             L  4.11.0-rc5-next-20170405 #100
task: c000000631314880 task.stack: c0000003915d4000
NIP: c00000000035a72c LR: c00000000035a6f4 CTR: c00000000035a660
REGS: c0000003915d7570 TRAP: 0700   Tainted: G             L   (4.11.0-rc5-next-20170405)
MSR: 800000000282b032 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI>
  CR: 24004284  XER: 00000000
CFAR: c0000000006f7190 SOFTE: 1
GPR00: c00000000035a6f4 c0000003915d77f0 c0000000015a3f00 000000007c22f600
GPR04: 000000000022d000 0000000000002600 c0000003b2d56360 c0000003915d7960
GPR08: c0000003915d7cd0 0000000000000002 0000000000002600 c000000000521cc0
GPR12: 0000000024004284 c00000000fd80a00 000000004b04ae64 ffffffffffffffff
GPR16: 000000001000ca70 0000000000000000 c0000003b2d56380 c00000000153d2b8
GPR20: 0000000000000010 c0000003bc87bac8 0000000000223000 000000000022f5ff
GPR24: c0000003b2d56360 000000000000000c 0000000000002600 000000000022d000
GPR28: 0000000000000000 c0000003915d7960 c0000003b2d56360 00000000000001ff
NIP [c00000000035a72c] .iomap_dio_actor+0xcc/0x420
LR [c00000000035a6f4] .iomap_dio_actor+0x94/0x420
Call Trace:
[c0000003915d77f0] [c00000000035a6f4] .iomap_dio_actor+0x94/0x420 (unreliable)
[c0000003915d78f0] [c00000000035b9f4] .iomap_apply+0xf4/0x1f0
[c0000003915d79d0] [c00000000035c320] .iomap_dio_rw+0x230/0x420
[c0000003915d7ae0] [c000000000512a14] .xfs_file_dio_aio_read+0x84/0x160
[c0000003915d7b80] [c000000000512d24] .xfs_file_read_iter+0x104/0x130
[c0000003915d7c10] [c0000000002d6234] .__vfs_read+0x114/0x1a0
[c0000003915d7cf0] [c0000000002d7a8c] .vfs_read+0xac/0x1a0
[c0000003915d7d90] [c0000000002d96b8] .SyS_read+0x58/0x100
[c0000003915d7e30] [c00000000000b8e0] system_call+0x38/0xfc
Instruction dump:
78630020 7f831b78 7ffc07b4 7c7ce039 40820360 a13d0018 2f890003 419e0288
2f890004 419e00a0 2f890001 419e02a8 <0fe00000> 3b80fffb 38210100 7f83e378

The above problem can also be recreated on a regular xfs filesystem
using the command,

$ fsstress -d /mnt -l 1000 -n 1000 -p 1000

The reason for the call trace is,
1. When 'reserving' blocks for delayed allocation , XFS reserves more
   blocks (i.e. past file's current EOF) than required. This is done
   because XFS assumes that userspace might write more data and hence
   'reserving' more blocks might lead to the file's new data being
   stored contiguously on disk.
2. The in-memory 'struct xfs_bmbt_irec' mapping the file's last extent would
   then cover the prealloc-ed EOF blocks in addition to the regular blocks.
3. When flushing the dirty blocks to disk, we only flush data till the
   file's EOF. But before writing out the dirty data, we allocate blocks
   on the disk for holding the file's new data. This allocation includes
   the blocks that are part of the 'prealloc EOF blocks'.
4. Later, when the last reference to the inode is being closed, XFS frees the
   unused 'prealloc EOF blocks' in xfs_inactive().

In step 3 above, When allocating space on disk for the delayed allocation
range, the space allocator might sometimes allocate less blocks than
required. If such an allocation ends right at the current EOF of the
file, We will not be able to clear the "delayed allocation" flag for the
'prealloc EOF blocks', since we won't have dirty buffer heads associated
with that range of the file.

In such a situation if a Direct I/O read operation is performed on file
range [X, Y] (where X < EOF and Y > EOF), we flush dirty data in the
range [X, Y] and invalidate page cache for that range (Refer to
iomap_dio_rw()). Later for performing the Direct I/O read, XFS obtains
the extent items (which are still cached in memory) for the file
range. When doing so we are not supposed to get an extent item with
IOMAP_DELALLOC flag set, since the previous "flush" operation should
have converted any delayed allocation data in the range [X, Y]. Hence we
end up hitting a WARN_ON_ONCE(1) statement in iomap_dio_actor().

This commit fixes the bug by preventing the read operation from going
beyond iomap_dio->i_size.

Reported-by: Santhosh G <santhog4@linux.vnet.ibm.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:40 -07:00
Christoph Hellwig
7590632a33 xfs: remove bmap block allocation retries
Now that reflink operations don't set the firstblock value we don't
need the workarounds for non-NULL firstblock values without a prior
allocation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:40 -07:00
Christoph Hellwig
bf8eadbacb xfs: remove xfs_bmap_remap_alloc
The main thing that xfs_bmap_remap_alloc does is fixing the AGFL, similar
to what we do in the space allocator.  But the reflink code doesn't touch
the allocation btree unlike the normal space allocator, so we couldn't
care less about the state of the AGFL.

So remove xfs_bmap_remap_alloc and just handle the di_nblocks update in
the caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:40 -07:00
Christoph Hellwig
6ebd5a4413 xfs: introduce xfs_bmapi_remap
Add a new helper to be used for reflink extent list additions instead of
funneling them through xfs_bmapi_write and overloading the firstblock
member in struct xfs_bmalloca and struct xfs_alloc_args.

With some small changes to xfs_bmap_remap_alloc this also means we do
not need a xfs_bmalloca structure for this case at all.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:39 -07:00
Christoph Hellwig
6d04558f9f xfs: pass individual arguments to xfs_bmap_add_extent_hole_real
For the reflink case we'd much rather pass the required arguments than
faking up a struct xfs_bmalloca.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:39 -07:00
Christoph Hellwig
39e07daa46 xfs: remove attr fork handling in xfs_bmap_finish_one
We never do COW operations for the attr fork, so don't pretend we handle
them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:39 -07:00
Christoph Hellwig
52813fb13f xfs: fix integer truncation in xfs_bmap_remap_alloc
bno should be a xfs_fsblock_t, which is 64-bit wides instead of a
xfs_aglock_t, which truncates the value to 32 bits.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-25 09:40:39 -07:00
Trond Myklebust
b3230e80a6 pNFS: Ensure we check layout segment validity in the pg_init() callback
If we have a layout segment cached in pgio->pg_lseg, we should check it
for validity before reusing it in a new RPC request. Otherwise, if we
recoalesce, we can end up looping forever.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-25 10:56:19 -04:00
Amir Goldstein
4ff33aafd3 fanotify: don't expose EOPENSTALE to userspace
When delivering an event to userspace for a file on an NFS share,
if the file is deleted on server side before user reads the event,
user will not get the event.

If the event queue contained several events, the stale event is
quietly dropped and read() returns to user with events read so far
in the buffer.

If the event queue contains a single stale event or if the stale
event is a permission event, read() returns to user with the kernel
internal error code 518 (EOPENSTALE), which is not a POSIX error code.

Check the internal return value -EOPENSTALE in fanotify_read(), just
the same as it is checked in path_openat() and drop the event in the
cases that it is not already dropped.

This is a reproducer from Marko Rauhamaa:

Just take the example program listed under "man fanotify" ("fantest")
and follow these steps:

    ==============================================================
    NFS Server    NFS Client(1)     NFS Client(2)
    ==============================================================
    # echo foo >/nfsshare/bar.txt
                  # cat /nfsshare/bar.txt
                  foo
                                    # ./fantest /nfsshare
                                    Press enter key to terminate.
                                    Listening for events.
    # rm -f /nfsshare/bar.txt
                  # cat /nfsshare/bar.txt
                                    read: Unknown error 518
                  cat: /nfsshare/bar.txt: Operation not permitted
    ==============================================================

where NFS Client (1) and (2) are two terminal sessions on a single NFS
Client machine.

Reported-by: Marko Rauhamaa <marko.rauhamaa@f-secure.com>
Tested-by: Marko Rauhamaa <marko.rauhamaa@f-secure.com>
Cc: <linux-api@vger.kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-25 15:48:06 +02:00
Hou Pengyang
4086d3f61b f2fs: skip encrypted inode in ASYNC IPU policy
Async request may be throttled in block layer, so page for async may keep WRITE_BACK
for a long time.

For encrytped inode, we need wait on page writeback no matter if the device supports
BDI_CAP_STABLE_WRITES. This may result in a higher waiting page writeback time for
async encrypted inode page.

This patch skips IPU for encrypted inode's updating write.

Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-24 13:13:24 -07:00
Jaegeuk Kim
a788189305 f2fs: fix out-of free segments
This patch also reverts d0db7703ac ("f2fs: do SSR in higher priority").

This patch fixes out of free segments caused by many small file creation by
1) mkfs -s 1 2G
2) mount
3) untar
 - preoduce 60000 small files burstly
4) sync
 - flush node pages
 - flush imeta

Here, when we do f2fs_balance_fs, we missed # of imeta blocks, resulting in
skipping to check has_not_enough_free_secs.

Another test is done by
1) mkfs -s 12 2G
2) mount
3) untar
 - preoduce 60000 small files burstly
4) sync
 - flush node pages
 - flush imeta

In this case, this patch also fixes wrong block allocation under large section
size.

Reported-by: William Brana <wbrana@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-24 13:13:23 -07:00
Arnd Bergmann
d66450e773 f2fs: improve definition of statistic macros
With a recent addition of f2fs_lookup_extent_tree(), we get a warning about
the use of empty macros:

fs/f2fs/extent_cache.c: In function 'f2fs_lookup_extent_tree':
fs/f2fs/extent_cache.c:358:32: error: suggest braces around empty body in an 'else' statement [-Werror=empty-body]
   stat_inc_rbtree_node_hit(sbi);

A good way to avoid the warning and make the code more robust is to define
all no-op macros as 'do { } while (0)'.

Fixes: 54c2258cd6 ("f2fs: extract rb-tree operation infrastructure")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reivewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-24 13:13:22 -07:00
Jaegeuk Kim
d579324998 f2fs: assign allocation hint for warm/cold data
This patch gives slower device region to warm/cold data area more eagerly.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-24 13:06:53 -07:00
Jaegeuk Kim
d07efb5077 f2fs: fix _IOW usage
This patch fixes wrong _IOW usage.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-24 12:55:45 -07:00
Jaegeuk Kim
e066b83c9b f2fs: add ioctl to flush data from faster device to cold area
This patch adds an ioctl to flush data in faster device to cold area. User can
give device number and number of segments to move. It doesn't move it if there
is only one device.

The parameter looks like:

struct f2fs_flush_device {
	u32 dev_num;		/* device number to flush */
	u32 segments;		/* # of segments to flush */
};

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-24 12:55:41 -07:00
Jan Kara
61a929870d ext4: Improve comments in ext4_quota_{on|off}()
Improve comments in ext4_quota_{on|off}() to explain that returning
success despite ext4_journal_start() failing is deliberate.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-24 16:49:16 +02:00
Dan Carpenter
f4edce1afd fsnotify: remove a stray unlock
We recently shifted this code around, so we're no longer holding the
lock on this path.

Fixes: 755b5bc681 ("fsnotify: Remove indirection from mark list addition")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-24 16:41:28 +02:00
Fabian Frederick
5c26eac43a udf: use kmap_atomic for memcpy copying
Use temporary mapping for memory copying operations.

To avoid any sleeping problem,

mark_inode_dirty(inode) was moved after kunmap() in
udf_adinicb_readpage()

down_write(&iinfo->i_data_sem) set before kmap_atomic()
in udf_expand_file_adinicb()

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-24 16:28:02 +02:00
Fabian Frederick
6ff6b2b329 udf: use octal for permissions
According to commit f90774e1fd ("checkpatch: look for symbolic
permissions and suggest octal instead")

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-24 16:27:52 +02:00
Linus Torvalds
9ea33c44fb This pull request contains fixes for issues in both UBI and UBIFS:
- More O_TMPFILE fallout
 - RENAME_WHITEOUT regression due to a mis-merge
 - Memory leak in ubifs_mknod()
 - Power-cut problem in UBI's update volume feature
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABAgAGBQJY/HTyAAoJEEtJtSqsAOnW7bYP/jYNCRKUls/FhWcF02h/tuPz
 gYemObWxoEvPqy8laNwyYEkzct4ADsz4C2RCGxygJks2qYCzuXy+IQd/gQlJNIhM
 B9sNa1se1ntt71vwdwgRUgVRMGjknJ6rY8ZP2actuuzztdJE9fGNKhEarMVyQQ61
 f7+AvmFc0CHTEBi5hcHX/Rtdqvzt4OOJl5pLo+VT2RRegxb3oHgl1UTD+gqkdkMg
 WOxvdriSQPi8HHz1clCN36lG2H05sKLd7YEhSWKf6Pt0YYJmF8ckxLERzRclpV4D
 XXqUEwhrovJC+oRQbXS+DaJfjsML9y8GB28izC5c+H8/teGKv2n5NXtjsyWNCfnx
 gmZT1S24TnTsofbmd1GoemMMD4Vlt9EDq2+QVFCaPgK9eWkDQ/xlhjrfPMsuz+YY
 uCWgyzsJyvRewiOlkN1jcTua6f1nHt0SP7alNy0XUQF08jwYMaaf5IJAbEHP8Ink
 seIXNObiGgHl+bBsTQ4SU9PUIJIKIar0TSQf8HXV+/Go3XDz4ZTSjc1KVXs5Hg97
 irUPllIKmjb5Doiln12A5zePmgT2uaChFG21+6SwEWkFYrD3XIxBORxkEF6E7FJU
 asRk2oBzNYezz+qpjn3NeZnIV7Uc9B8bBZpqFrsys2KRAJIZRtdhpAF+jIbrS+7N
 JrakPr2I/FKnsNU/0mZh
 =ER2q
 -----END PGP SIGNATURE-----

Merge tag 'upstream-4.11-rc7' of git://git.infradead.org/linux-ubifs

Pull UBI/UBIFS fixes from Richard Weinberger:
 "This contains fixes for issues in both UBI and UBIFS:

   - more O_TMPFILE fallout

   - RENAME_WHITEOUT regression due to a mis-merge

   - memory leak in ubifs_mknod()

   - power-cut problem in UBI's update volume feature"

* tag 'upstream-4.11-rc7' of git://git.infradead.org/linux-ubifs:
  ubifs: Fix O_TMPFILE corner case in ubifs_link()
  ubifs: Fix RENAME_WHITEOUT support
  ubifs: Fix debug messages for an invalid filename in ubifs_dump_inode
  ubifs: Fix debug messages for an invalid filename in ubifs_dump_node
  ubifs: Remove filename from debug messages in ubifs_readdir
  ubifs: Fix memory leak in error path in ubifs_mknod
  ubi/upd: Always flush after prepared for an update
2017-04-23 16:49:16 -07:00
Ingo Molnar
58d30c36d4 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

 - Documentation updates.

 - Miscellaneous fixes.

 - Parallelize SRCU callback handling (plus overlapping patches).

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-23 11:12:44 +02:00
David S. Miller
fb796707d7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Both conflict were simple overlapping changes.

In the kaweth case, Eric Dumazet's skb_cow() bug fix overlapped the
conversion of the driver in net-next to use in-netdev stats.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21 20:23:53 -07:00
Linus Torvalds
94836ecf1e Fix a 4.11 regression that triggers a BUG() on an attempt to use an
unsupported NFSv4 compound op.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJY+mlFAAoJECebzXlCjuG+eCMP/2r/lXdx91Kibg4/D/9OB+LW
 YmcyUH83iRq/F/V88yEMz5XMUTZ/UkZ3iSGSDObjhBK7oswAtZhC1KiWOiuI9ry4
 tcqZM3se7FQNpbwm2wvhtRmbjd2YOZfS3o32Uorhc/beMZzIzZhUb15pp5o3hOIy
 4aoxeUs6//B51fil4Ur0oeLI1EqUd3t+1z0ytyRRzSfORHF/QEiTTXSmmKK+K2XA
 t35XXYJE3SCLxQpWW/CHT/IODv6nsaM0rgN/kkmZUxAyaw2GXT7U6+mzkz3tSL8s
 AEqNUptCn2M3iAv/e6wgpO3BS49HR+Gdg+JxPilCcs+GniWu4zH3sE0HFjTuqQ23
 7u/NB4lkDqFs9pO1x4D5Y/69yXbs6jUHn+NMnhTZS85pTwU5o3RNhzwegoOfIp/3
 CY22R92ZLHEqb1/VTc7rpeuj3d9r7Swg3U8X7kDnn7XtAj2HlMqaHZfw2zjTIYBO
 wqKnikTDMHuzZ7MtegLld4oayNZO4hN7Vi01Npzgr3nkwvpFpz8Jbvsn8mMHvUq8
 ZYWoMfz5rbH20N7cnFUAtB5fm67U9smjb01uzx3m6ylh1SqTWxAoZPWluGJTv0IA
 JKEM4g4PSNOizhib6KI2PW0SngLkCWKLy463fDS2Mx/M7DBZf0j4697TDK21R5NB
 Pny/pwJT4mjLwePc/qt2
 =mSBF
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.11-2' of git://linux-nfs.org/~bfields/linux

Pull nfsd bugfix from Bruce Fields:
 "Fix a 4.11 regression that triggers a BUG() on an attempt to use an
  unsupported NFSv4 compound op"

* tag 'nfsd-4.11-2' of git://linux-nfs.org/~bfields/linux:
  nfsd: fix oops on unsupported operation
2017-04-21 16:37:48 -07:00
Ilya Dryomov
19b7ccf865 block: get rid of blk_integrity_revalidate()
Commit 25520d55cd ("block: Inline blk_integrity in struct gendisk")
introduced blk_integrity_revalidate(), which seems to assume ownership
of the stable pages flag and unilaterally clears it if no blk_integrity
profile is registered:

    if (bi->profile)
            disk->queue->backing_dev_info->capabilities |=
                    BDI_CAP_STABLE_WRITES;
    else
            disk->queue->backing_dev_info->capabilities &=
                    ~BDI_CAP_STABLE_WRITES;

It's called from revalidate_disk() and rescan_partitions(), making it
impossible to enable stable pages for drivers that support partitions
and don't use blk_integrity: while the call in revalidate_disk() can be
trivially worked around (see zram, which doesn't support partitions and
hence gets away with zram_revalidate_disk()), rescan_partitions() can
be triggered from userspace at any time.  This breaks rbd, where the
ceph messenger is responsible for generating/verifying CRCs.

Since blk_integrity_{un,}register() "must" be used for (un)registering
the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES
setting there.  This way drivers that call blk_integrity_register() and
use integrity infrastructure won't interfere with drivers that don't
but still want stable pages.

Fixes: 25520d55cd ("block: Inline blk_integrity in struct gendisk")
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.4+, needs backporting
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-21 14:17:27 -06:00
Al Viro
4f757f3cbf make sure that mntns_install() doesn't end up with referral for root
new flag: LOOKUP_DOWN.  If the starting point is overmounted, cross
into whatever's mounted on top, triggering referrals et.al.

Use that instead of follow_down_one() loop in mntns_install(), handle
errors properly.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-21 14:05:36 -04:00
Al Viro
93893862fb path_init(): don't bother with checking MAY_EXEC for LOOKUP_ROOT
we'll hit that check in link_path_walk() anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-21 14:05:35 -04:00
Al Viro
159b095628 make sure that fchdir() won't accept referral points, etc.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-21 14:05:35 -04:00
Al Viro
c63ed807d1 orangefs: use iov_iter_revert()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-21 13:57:32 -04:00
Benjamin Coddington
f30cb757f6 NFS: Always wait for I/O completion before unlock
NFS attempts to wait for read and write completion before unlocking in
order to ensure that the data returned was protected by the lock.  When
this waiting is interrupted by a signal, the unlock may be skipped, and
messages similar to the following are seen in the kernel ring buffer:

[20.167876] Leaked locks on dev=0x0:0x2b ino=0x8dd4c3:
[20.168286] POSIX: fl_owner=ffff880078b06940 fl_flags=0x1 fl_type=0x0 fl_pid=20183
[20.168727] POSIX: fl_owner=ffff880078b06680 fl_flags=0x1 fl_type=0x0 fl_pid=20185

For NFSv3, the missing unlock will cause the server to refuse conflicting
locks indefinitely.  For NFSv4, the leftover lock will be removed by the
server after the lease timeout.

This patch fixes this issue by skipping the usual wait in
nfs_iocounter_wait if the FL_CLOSE flag is set when signaled.  Instead, the
wait happens in the unlock RPC task on the NFS UOC rpc_waitqueue.

For NFSv3, use lockd's new nlmclnt_operations along with
nfs_async_iocounter_wait to defer NLM's unlock task until the lock
context's iocounter reaches zero.

For NFSv4, call nfs_async_iocounter_wait() directly from unlock's
current rpc_call_prepare.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-21 10:45:01 -04:00
Benjamin Coddington
b1ece737f4 lockd: Introduce nlmclnt_operations
NFS would enjoy the ability to modify the behavior of the NLM client's
unlock RPC task in order to delay the transmission of the unlock until IO
that was submitted under that lock has completed.  This ability can ensure
that the NLM client will always complete the transmission of an unlock even
if the waiting caller has been interrupted with fatal signal.

For this purpose, a pointer to a struct nlmclnt_operations can be assigned
in a nfs_module's nfs_rpc_ops that will install those nlmclnt_operations on
the nlm_host.  The struct nlmclnt_operations defines three callback
operations that will be used in a following patch:

nlmclnt_alloc_call - used to call back after a successful allocation of
	a struct nlm_rqst in nlmclnt_proc().

nlmclnt_unlock_prepare - used to call back during NLM unlock's
	rpc_call_prepare.  The NLM client defers calling rpc_call_start()
	until this callback returns false.

nlmclnt_release_call - used to call back when the NLM client's struct
	nlm_rqst is freed.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-21 10:45:01 -04:00
Benjamin Coddington
7d6ddf88c4 NFS: Add an iocounter wait function for async RPC tasks
By sleeping on a new NFS Unlock-On-Close waitqueue, rpc tasks may wait for
a lock context's iocounter to reach zero.  The rpc waitqueue is only woken
when the open_context has the NFS_CONTEXT_UNLOCK flag set in order to
mitigate spurious wake-ups for any iocounter reaching zero.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-21 10:45:01 -04:00
Benjamin Coddington
50f2112cf7 locks: Set FL_CLOSE when removing flock locks on close()
Set FL_CLOSE in fl_flags as in locks_remove_posix() when clearing locks.
NFS will check for this flag to ensure an unlock is sent in a following
patch.

Fuse handles flock and posix locks differently for FL_CLOSE, and so
requires a fixup to retain the existing behavior for flock.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-21 10:45:01 -04:00
Benjamin Coddington
e12937279c NFS: Move the flock open mode check into nfs_flock()
We only need to check lock exclusive/shared types against open mode when
flock() is used on NFS, so move it into the flock-specific path instead of
checking it for all locks.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-21 10:45:00 -04:00
Benjamin Coddington
12a16d15b6 NFS4: remove a redundant lock range check
flock64_to_posix_lock() is already doing this check

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jeff.layton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-21 10:45:00 -04:00
Trond Myklebust
675e508f53 pNFS: unexport nfs4_pnfs_v3_ds_connect_unload
It is not used outside the NFSv4 module.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 16:58:50 -04:00
Trond Myklebust
b94196888f pNFS: Unexport pnfs_put_lseg_locked and _pnfs_return_layout
They are not used outside the NFSv4 module.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 16:53:58 -04:00
Trond Myklebust
73504740df pNFS: Remove unused layout driver callbacks
encode_layoutreturn and encode_layoutcommit are now unused. Let's
remove them.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 16:48:14 -04:00
Christoph Hellwig
6d22323b2e nfs: remove the objlayout driver
The objlayout code has been in the tree, but it's been unmaintained and
no server product for it actually ever shipped.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 15:51:23 -04:00
Dan Williams
b0686260fe dax: introduce dax_direct_access()
Replace bdev_direct_access() with dax_direct_access() that uses
dax_device and dax_operations instead of a block_device and
block_device_operations for dax. Once all consumers of the old api have
been converted bdev_direct_access() will be deleted.

Given that block device partitioning decisions can cause dax page
alignment constraints to be violated this also introduces the
bdev_dax_pgoff() helper. It handles calculating a logical pgoff relative
to the dax_device and also checks for page alignment.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-20 11:57:52 -07:00
Dan Williams
d8f07aee3f block: kill bdev_dax_capable()
This is leftover dead code that has since been replaced by
bdev_dax_supported().

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-04-20 11:57:52 -07:00
Trond Myklebust
260f32adb8 pNFS/flexfiles: Check the result of nfs4_pnfs_ds_connect
The check in nfs4_ff_layout_prepare_ds() seems to be missing.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Fixes: a33e4b036d ("pNFS: return status from nfs4_pnfs_ds_connect")
Cc: Weston Andros Adamson <dros@primarydata.com>
Cc: stable@vger.kernel.org # v4.11
2017-04-20 14:37:37 -04:00
Christoph Hellwig
17d5363b83 scsi: introduce a result field in struct scsi_request
This passes on the scsi_cmnd result field to users of passthrough
requests.  Currently we abuse req->errors for this purpose, but that
field will go away in its current form.

Note that the old IDE code abuses the errors field in very creative
ways and stores all kinds of different values in it.  I didn't dare
to touch this magic, so the abuses are brought forward 1:1.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:16:10 -06:00
Christoph Hellwig
b7819b9259 block: remove the blk_execute_rq return value
The function only returns -EIO if rq->errors is non-zero, which is not
very useful and lets a large number of callers ignore the return value.

Just let the callers figure out their error themselves.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:16:10 -06:00
Trond Myklebust
56e0d71ef1 NFSv4: Fix a hang in OPEN related to server reboot
If the server fails to return the attributes as part of an OPEN
reply, and then reboots, we can end up hanging. The reason is that
the client attempts to send a GETATTR in order to pick up the
missing OPEN call, but fails to release the slot first, causing
reboot recovery to deadlock.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Fixes: 2e80dbe7ac ("NFSv4.1: Close callback races for OPEN, LAYOUTGET...")
Cc: stable@vger.kernel.org # v4.8+
2017-04-20 14:12:57 -04:00
Jan Kara
7c4cc30024 bdi: Drop 'parent' argument from bdi_register[_va]()
Drop 'parent' argument of bdi_register() and bdi_register_va().  It is
always NULL.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:09:55 -06:00
Jan Kara
c1844d536d fs: Remove SB_I_DYNBDI flag
Now that all bdi structures filesystems use are properly refcounted, we
can remove the SB_I_DYNBDI flag.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:09:55 -06:00
Jan Kara
99edd4580b ubifs: Convert to separately allocated bdi
Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.

CC: Richard Weinberger <richard@nod.at>
CC: Artem Bityutskiy <dedekind1@gmail.com>
CC: Adrian Hunter <adrian.hunter@intel.com>
CC: linux-mtd@lists.infradead.org
Acked-by: Richard Weinberger <richard@nod.at>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:09:55 -06:00
Jan Kara
0db10944a7 nfs: Convert to separately allocated bdi
Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.

CC: Anna Schumaker <anna.schumaker@netapp.com>
CC: linux-nfs@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:09:55 -06:00
Jan Kara
a0349ec00f ncpfs: Convert to separately allocated bdi
Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.

CC: Petr Vandrovec <petr@vandrovec.name>
Acked-by: Petr Vandrovec <petr@vandrovec.name>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:09:55 -06:00
Jan Kara
0546c537b1 nilfs2: Convert to properly refcounting bdi
Similarly to set_bdev_super() NILFS2 just used block device reference to
bdi. Convert it to properly getting bdi reference. The reference will
get automatically dropped on superblock destruction.

CC: linux-nilfs@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:09:55 -06:00
Jan Kara
95fe66de9f gfs2: Convert to properly refcounting bdi
Similarly to set_bdev_super() GFS2 just used block device reference to
bdi. Convert it to properly getting bdi reference. The reference will
get automatically dropped on superblock destruction.

CC: Steven Whitehouse <swhiteho@redhat.com>
CC: Bob Peterson <rpeterso@redhat.com>
CC: cluster-devel@redhat.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:09:55 -06:00
Jan Kara
7fbbe972c3 fuse: Get rid of bdi_initialized
It is not needed anymore since bdi is initialized whenever superblock
exists.

CC: Miklos Szeredi <miklos@szeredi.hu>
CC: linux-fsdevel@vger.kernel.org
Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:09:55 -06:00
Jan Kara
5f7f7543f5 fuse: Convert to separately allocated bdi
Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.

CC: Miklos Szeredi <miklos@szeredi.hu>
CC: linux-fsdevel@vger.kernel.org
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:09:55 -06:00
Jan Kara
c7f014771b exofs: Convert to separately allocated bdi
Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.

CC: Boaz Harrosh <ooo@electrozaur.com>
CC: Benny Halevy <bhalevy@primarydata.com>
Acked-by: Boaz Harrosh <ooo@electrozaur.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:09:55 -06:00
Jan Kara
a5695a7908 coda: Convert to separately allocated bdi
Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.

CC: Jan Harkes <jaharkes@cs.cmu.edu>
CC: coda@cs.cmu.edu
CC: codalist@coda.cs.cmu.edu
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:09:55 -06:00
Jan Kara
edd3ba94c4 afs: Convert to separately allocated bdi
Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.

CC: David Howells <dhowells@redhat.com>
CC: linux-afs@lists.infradead.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:09:55 -06:00
Jan Kara
e836818bd9 ecryptfs: Convert to separately allocated bdi
Allocate struct backing_dev_info separately instead of embedding it
inside the superblock. This unifies handling of bdi among users.

CC: Tyler Hicks <tyhicks@canonical.com>
CC: ecryptfs@vger.kernel.org
Acked-by: Tyler Hicks <tyhicks@canonical.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:09:55 -06:00
Jan Kara
851ea08609 cifs: Convert to separately allocated bdi
Allocate struct backing_dev_info separately instead of embedding it
inside superblock. This unifies handling of bdi among users.

CC: Steve French <sfrench@samba.org>
CC: linux-cifs@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:09:55 -06:00
Jan Kara
09dc9fc24b ceph: Convert to separately allocated bdi
Allocate struct backing_dev_info separately instead of embedding it
inside client structure. This unifies handling of bdi among users.

CC: Ilya Dryomov <idryomov@gmail.com>
CC: "Yan, Zheng" <zyan@redhat.com>
CC: Sage Weil <sage@redhat.com>
CC: ceph-devel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:09:55 -06:00
Jan Kara
9e11ceee23 btrfs: Convert to separately allocated bdi
Allocate struct backing_dev_info separately instead of embedding it
inside superblock. This unifies handling of bdi among users.

CC: Chris Mason <clm@fb.com>
CC: Josef Bacik <jbacik@fb.com>
CC: David Sterba <dsterba@suse.com>
CC: linux-btrfs@vger.kernel.org
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:09:55 -06:00
Jan Kara
71304feba3 9p: Convert to separately allocated bdi
Allocate struct backing_dev_info separately instead of embedding it
inside session. This unifies handling of bdi among users.

CC: Eric Van Hensbergen <ericvh@gmail.com>
CC: Ron Minnich <rminnich@sandia.gov>
CC: Latchesar Ionkov <lucho@ionkov.net>
CC: v9fs-developer@lists.sourceforge.net
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:09:55 -06:00
Jan Kara
13eec2363e fs: Get proper reference for s_bdi
So far we just relied on block device to hold a bdi reference for us
while the filesystem is mounted. While that works perfectly fine, it is
a bit awkward that we have a pointer to a refcounted structure in the
superblock without proper reference. So make s_bdi hold a proper
reference to block device's BDI. No filesystem using mount_bdev()
actually changes s_bdi so this is safe and will make bdev filesystems
work the same way as filesystems needing to set up their private bdi.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:09:55 -06:00
Jan Kara
fca39346a5 fs: Provide infrastructure for dynamic BDIs in filesystems
Provide helper functions for setting up dynamically allocated
backing_dev_info structures for filesystems and cleaning them up on
superblock destruction.

CC: linux-mtd@lists.infradead.org
CC: linux-nfs@vger.kernel.org
CC: Petr Vandrovec <petr@vandrovec.name>
CC: linux-nilfs@vger.kernel.org
CC: cluster-devel@redhat.com
CC: osd-dev@open-osd.org
CC: codalist@coda.cs.cmu.edu
CC: linux-afs@lists.infradead.org
CC: ecryptfs@vger.kernel.org
CC: linux-cifs@vger.kernel.org
CC: ceph-devel@vger.kernel.org
CC: linux-btrfs@vger.kernel.org
CC: v9fs-developer@lists.sourceforge.net
CC: lustre-devel@lists.lustre.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:09:55 -06:00
Benjamin Coddington
fbe77c30e9 NFS: move rw_mode to nfs_pageio_header
Let's try to have it in a cacheline in nfs4_proc_pgio_rpc_prepare().

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 14:00:41 -04:00
Benjamin Coddington
8ef9b0b9e1 NFS: move nfs_pgarray_set() to open code
Since commit 00bfa30abe ("NFS: Create a common pgio_alloc and
pgio_release function"), nfs_pgarray_set() has only a single caller.  Let's
open code it.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:55:16 -04:00
Benjamin Coddington
ae97aa524e NFS: Use GFP_NOIO for two allocations in writeback
Prevent a deadlock that can occur if we wait on allocations
that try to write back our pages.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Fixes: 00bfa30abe ("NFS: Create a common pgio_alloc and pgio_release...")
Cc: stable@vger.kernel.org # 3.16+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:55:10 -04:00
Fred Isaman
1f84ccdf37 NFS: Fix use after free in write error path
Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Fixes: 0bcbf039f6 ("nfs: handle request add failure properly")
Cc: stable@vger.kernel.org # v4.5+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:51:52 -04:00
Benjamin Coddington
43b7d964ed NFS: Fix missing pg_cleanup after nfs_pageio_cond_complete()
Commit a7d42ddb30 ("nfs: add mirroring
support to pgio layer") moved pg_cleanup out of the path when there was
non-sequental I/O that needed to be flushed.  The result is that for
layouts that have more than one layout segment per file, the pg_lseg is not
cleared, so we can end up hitting the WARN_ON_ONCE(req_start >= seg_end) in
pnfs_generic_pg_test since the pg_lseg will be pointing to that
previously-flushed layout segment.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Fixes: a7d42ddb30 ("nfs: add mirroring support to pgio layer")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:49:22 -04:00
NeilBrown
518662e0fc NFS: fix usage of mempools.
When passed GFP flags that allow sleeping (such as
GFP_NOIO), mempool_alloc() will never return NULL, it will
wait until memory is available.

This means that we don't need to handle failure, but that we
do need to ensure one thread doesn't call mempool_alloc()
twice on the one pool without queuing or freeing the first
allocation.  If multiple threads did this during times of
high memory pressure, the pool could be exhausted and a
deadlock could result.

pnfs_generic_alloc_ds_commits() attempts to allocate from
the nfs_commit_mempool while already holding an allocation
from that pool.  This is not safe.  So change
nfs_commitdata_alloc() to take a flag that indicates whether
failure is acceptable.

In pnfs_generic_alloc_ds_commits(), accept failure and
handle it as we currently do.  Else where, do not accept
failure, and do not handle it.

Even when failure is acceptable, we want to succeed if
possible.  That means both
 - using an entry from the pool if there is one
 - waiting for direct reclaim is there isn't.

We call mempool_alloc(GFP_NOWAIT) to achieve the first, then
kmem_cache_alloc(GFP_NOIO|__GFP_NORETRY) to achieve the
second.  Each of these can fail, but together they do the
best they can without blocking indefinitely.

The objects returned by kmem_cache_alloc() will still be freed
by mempool_free().  This is safe as mempool_alloc() uses
exactly the same function to allocate objects (since the mempool
was created with mempool_create_slab_pool()).  The object returned
by mempool_alloc() and kmem_cache_alloc() are indistinguishable
so mempool_free() will handle both identically, either adding to the
pool or calling kmem_cache_free().

Also, don't test for failure when allocating from
nfs_wdata_mempool.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:44:05 -04:00
Anna Schumaker
f6148713b2 NFS: Clean up nfs4_proc_get_lease_time()
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:39:35 -04:00
Anna Schumaker
e917f0d1ce NFS: Clean up _nfs4_proc_exchange_id()
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:39:35 -04:00
Anna Schumaker
c7ae763903 NFS: Clean up nfs4_proc_bind_one_conn_to_session()
Returning errors directly even lets us remove the goto

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:39:35 -04:00
Anna Schumaker
3183783bbb NFS: Remove extra dprintk()s from nfs4namespace.c
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:39:35 -04:00
Anna Schumaker
539fd1d1f4 NFS: Clean up nfs4_get_rootfh()
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:39:35 -04:00
Anna Schumaker
4fe6b366d9 NFS: Remove extra dprintk()s from nfs4client.c
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:39:35 -04:00
Anna Schumaker
1073d9b49a NFS: Clean up nfs4_init_server()
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:39:34 -04:00
Anna Schumaker
2dc42c0d60 NFS: Clean up nfs4_set_client()
If we cut out the dprintk()s, then we can return error codes directly
and cut out the goto.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:39:34 -04:00
Anna Schumaker
8da0f93438 NFS: Clean up nfs4_check_server_scope()
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:39:34 -04:00
Anna Schumaker
ddfa0d4860 NFS: Clean up nfs4_check_serverowner_major_id()
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:39:34 -04:00
Anna Schumaker
14d1bbb0ca NFS: Create a common nfs4_match_client() function
This puts all the common code in a single place for the
walk_client_list() functions.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:39:34 -04:00
Anna Schumaker
5b6d3ff605 NFS: Clean up nfs4_check_serverowner_minor_id()
Once again, we can remove the function and compare integer values
directly.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:39:34 -04:00
Anna Schumaker
f251fd9e71 NFS: Clean up nfs4_match_clientids()
If we cut out the dprintk()s, then we don't even need this to be a
separate function.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:39:34 -04:00
Anna Schumaker
5be1810a8d NFS: Clean up nfs42_layoutstat_done()
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:39:34 -04:00
Anna Schumaker
e36d48e9e2 NFS: Remove extra dprintk()s from namespace.c
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:39:34 -04:00
Anna Schumaker
fe4f844d49 NFS: Clean up nfs_direct_commit_complete()
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:39:33 -04:00
Anna Schumaker
beeb533801 NFS: Remove nfs_direct_readpage_release()
Just remove the function and have the caller use nfs_release_request()
instead.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:39:33 -04:00
Anna Schumaker
4cbb976821 NFS: Clean up extra dprintk()s in client.c
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:39:33 -04:00
Anna Schumaker
2844b6aecf NFS: Clean up nfs_init_client()
We always call nfs_mark_client_ready() even if nfs_create_rpc_client()
returns an error, so we can rearrange nfs_init_client() to mark the
client ready from a single place.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:39:33 -04:00
Anna Schumaker
36718a669e NFS: Remove extra dprintk()s from callback_xdr.c
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:39:33 -04:00
Anna Schumaker
3d0bfaa60d NFS: Clean up encode_cb_sequence_res()
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:39:33 -04:00
Anna Schumaker
535ece2b8e NFS: Clean up decode_notify_lock_args()
Let's cut out the goto and return any errors immedately

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:39:33 -04:00
Anna Schumaker
1796549ad4 NFS: Clean up decode_cb_sequence_args()
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:39:33 -04:00
Anna Schumaker
c79d56d214 NFS: Clean up decode_layoutrecall_args()
Additionally, this change lets us cut out the goto by returning errors
immediately.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:39:33 -04:00
Anna Schumaker
135a4ea0d9 NFS: Clean up decode_recall_args()
Removing the dprintk() lets us simplify the function by returning status
codes directly, rather than using a goto.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:39:32 -04:00
Anna Schumaker
56938bb77a NFS: Clean up decode_getattr_args()
Removing the dprintk() lets us return the status value directly, rather
than jumping to a label if an error occurs.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:39:32 -04:00
Anna Schumaker
be55f1bca7 NFS: Remove extra dprintk()s from callback_proc.c
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:39:32 -04:00
Anna Schumaker
5694a4f848 NFS: Clean up nfs4_callback_layoutrecall()
In addition to removing the dprintk(), this patch also initializes "res"
to the default return value instead of doing this through an else
condition.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:39:32 -04:00
Anna Schumaker
1a916ce049 NFS: Clean up do_callback_layoutrecall()
Removing the dprintk()s lets us simplify the function by removing the
else condition entirely and returning the status of
initiate_{file,bulk}_draining() directly.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:39:32 -04:00
Tigran Mkrtchyan
a7878ca140 nfs: flexfilelayout: remove v3-only data server limitation
Flexfilelayout supports data servers which talk NFS v3 and v4.{0,1,2}.
However, this code path is disabled and v3 only servers are accepted.
This change removes this limitation.
Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:35:24 -04:00
Benjamin Coddington
b044f64513 NFS: switch back to to ->iterate()
NFS has some optimizations for readdir to choose between using READDIR or
READDIRPLUS based on workload, and which NFS operation to use is determined
by subsequent interactions with lookup, d_revalidate, and getattr.

Concurrent use of nfs_readdir() via ->iterate_shared() can cause those
optimizations to repeatedly invalidate the pagecache used to store
directory entries during readdir(), which causes some very bad performance
for directories with many entries (more than about 10000).

There's a couple ways to fix this in NFS, but no fix would be as simple as
going back to ->iterate() to serialize nfs_readdir(), and neither fix I
tested performed as well as going back to ->iterate().

The first required taking the directory's i_lock for each entry, with the
result of terrible contention.

The second way adds another flag to the nfs_inode, and so keeps the
optimizations working for large directories.  The difference from using
->iterate() here is that much more memory is consumed for a given workload
without any performance gain.

The workings of nfs_readdir() are such that concurrent users are serialized
within read_cache_page() waiting to retrieve pages of entries from the
server.  By serializing this work in iterate_dir() instead, contention for
cache pages is reduced.  Waiting processes can have an uncontended pass at
the entirety of the directory's pagecache once previous processes have
completed filling it.

v2 - Keep the bits needed for parallel lookup

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-20 13:33:09 -04:00
Amir Goldstein
b0990fbbbd ovl: check IS_APPEND() on real upper inode
For overlay file open, check IS_APPEND() on the real upper inode
inside d_real(), because the overlay inode does not have the
S_APPEND flag and IS_APPEND() can only be checked at open time.

Note that because overlayfs does not copy up the chattr inode flags
(i.e. S_APPEND, S_IMMUTABLE), the IS_APPEND() check is only relevant
for upper inodes that were set with chattr +a and not to lower
inodes that had chattr +a before copy up.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-04-20 16:37:26 +02:00
Amir Goldstein
78757af651 vfs: ftruncate check IS_APPEND() on real upper inode
ftruncate an overlayfs inode was checking IS_APPEND() on
overlay inode, but overlay inode does not have the S_APPEND flag.

Check IS_APPEND() on real upper inode instead.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-04-20 16:37:26 +02:00
Kees Cook
33006cdf9c ovl: Use designated initializers
Prepare to mark sensitive kernel structures for randomization by making
sure they're using designated initializers. These were identified during
allyesconfig builds of x86, arm, and arm64, with most initializer fixes
extracted from grsecurity.

For these cases, use { }, which will be zero-filled, instead of
undesignated NULLs.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-04-20 16:37:26 +02:00
David S. Miller
7b9f6da175 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
A function in kernel/bpf/syscall.c which got a bug fix in 'net'
was moved to kernel/bpf/verifier.c in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-20 10:35:33 -04:00
David Howells
b90fe0c4e0 Annotate hardware config module parameters in fs/pstore/
When the kernel is running in secure boot mode, we lock down the kernel to
prevent userspace from modifying the running kernel image.  Whilst this
includes prohibiting access to things like /dev/mem, it must also prevent
access by means of configuring driver modules in such a way as to cause a
device to access or modify the kernel image.

To this end, annotate module_param* statements that refer to hardware
configuration and indicate for future reference what type of parameter they
specify.  The parameter parser in the core sees this information and can
skip such parameters with an error message if the kernel is locked down.
The module initialisation then runs as normal, but just sees whatever the
default values for those parameters is.

Note that we do still need to do the module initialisation because some
drivers have viable defaults set in case parameters aren't specified and
some drivers support automatic configuration (e.g. PNP or PCI) in addition
to manually coded parameters.

This patch annotates drivers in fs/pstore/.

Suggested-by: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
cc: Anton Vorontsov <anton@enomsg.org>
cc: Colin Cross <ccross@android.com>
cc: Tony Luck <tony.luck@intel.com>
2017-04-20 12:02:32 +01:00
Linus Torvalds
4988f7a40f Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fix from Steve French:
 "One more cifs fix for stable"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Do not send echoes before Negotiate is complete
2017-04-19 17:12:46 -07:00
Cong Wang
073c516ff7 nsfs: mark dentry with DCACHE_RCUACCESS
Andrey reported a use-after-free in __ns_get_path():

  spin_lock include/linux/spinlock.h:299 [inline]
  lockref_get_not_dead+0x19/0x80 lib/lockref.c:179
  __ns_get_path+0x197/0x860 fs/nsfs.c:66
  open_related_ns+0xda/0x200 fs/nsfs.c:143
  sock_ioctl+0x39d/0x440 net/socket.c:1001
  vfs_ioctl fs/ioctl.c:45 [inline]
  do_vfs_ioctl+0x1bf/0x1780 fs/ioctl.c:685
  SYSC_ioctl fs/ioctl.c:700 [inline]
  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691

We are under rcu read lock protection at that point:

        rcu_read_lock();
        d = atomic_long_read(&ns->stashed);
        if (!d)
                goto slow;
        dentry = (struct dentry *)d;
        if (!lockref_get_not_dead(&dentry->d_lockref))
                goto slow;
        rcu_read_unlock();

but don't use a proper RCU API on the free path, therefore a parallel
__d_free() could free it at the same time.  We need to mark the stashed
dentry with DCACHE_RCUACCESS so that __d_free() will be called after all
readers leave RCU.

Fixes: e149ed2b80 ("take the targets of /proc/*/ns/* symlinks to separate fs")
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-19 15:56:24 -07:00
Colin Ian King
8918821f37 jffs2: fix spelling mistake: "requestied" -> "requested"
trivial fix to spelling mistake in JFFS2_ERROR message

Signed-off-by: Colin Ian King <colin.king@canonical.com>
[Brian: also fix 'an' -> 'a']
Signed-off-by: Brian Norris <computersforpeace@gmail.com>
2017-04-19 11:35:55 -07:00
Hou Pengyang
04485987f0 f2fs: introduce async IPU policy
This patch introduces an ASYNC IPU policy.

Under senario of large # of async updating(e.g. log writing in Android),
disk would be seriously fragmented, and higher frequent gc would be triggered.

This patch uses IPU to rewrite the async update writting, since async is
NOT sensitive to io latency.

Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
2017-04-19 11:00:46 -07:00
Chao Yu
d84d1cbdec f2fs: add undiscard blocks stat
This patch adds to account undiscard blocks.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
2017-04-19 11:00:45 -07:00
Chao Yu
001c584cca f2fs: unlock cp_rwsem early for IPU writes
For IPU writes, there won't be any udpates in dnode page since we
will reuse old block address instead of allocating new one, so we
don't need to lock cp_rwsem during IPU IO submitting.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
2017-04-19 11:00:44 -07:00
Chao Yu
df0f6b44dd f2fs: introduce __check_rb_tree_consistence
Introduce __check_rb_tree_consistence to check consistence of rb-tree
based discard cache in runtime.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-19 11:00:44 -07:00
Chao Yu
0243a5f9da f2fs: trace __submit_discard_cmd
Add an even class f2fs_discard for introducing f2fs_queue_discard, then
use f2fs_{queue,issue}_discard to trace __{queue,submit}_discard_cmd.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-19 11:00:43 -07:00
Chao Yu
ba48a33ef6 f2fs: in prior to issue big discard
Keep issuing big size discard in prior instead of the one with random
size, so that we expect that it will help to:
- be quick to recycle unused large space in flash storage device.
- give a chance for
  a) wait to merge small piece discards into bigger one, or
  b) avoid issuing discards while they have being reallocated by SSR.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-19 11:00:42 -07:00
Chao Yu
46f84c2c05 f2fs: clean up discard_cmd_control structure
Avoid long variable name in discard_cmd_control structure, no logic
change.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-19 11:00:41 -07:00
Chao Yu
004b686218 f2fs: use rb-tree to track pending discard commands
Introduce rb-tree based discard cache infrastructure to speed up lookup and
merge operation of discard entry.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: initialize dc to avoid build warning]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-19 11:00:40 -07:00
Bob Peterson
d552a2b9b3 GFS2: Non-recursive delete
Implement truncate/delete as a non-recursive algorithm. The older
algorithm was implemented with recursion to strip off each layer
at a time (going by height, starting with the maximum height.
This version tries to do the same thing but without recursion,
and without needing to allocate new structures or lists in memory.

For example, say you want to truncate a very large file to 1 byte,
and its end-of-file metapath is: 0.505.463.428. The starting
metapath would be 0.0.0.0. Since it's a truncate to non-zero, it
needs to preserve that byte, and all metadata pointing to it.
So it would start at 0.0.0.0, look up all its metadata buffers,
then free all data blocks pointed to at the highest level.
After that buffer is "swept", it moves on to 0.0.0.1, then
0.0.0.2, etc., reading in buffers and sweeping them clean.
When it gets to the end of the 0.0.0 metadata buffer (for 4K
blocks the last valid one is 0.0.0.508), it backs up to the
previous height and starts working on 0.0.1.0, then 0.0.1.1,
and so forth. After it reaches the end and sweeps 0.0.1.508,
it continues with 0.0.2.0, and so on. When that height is
exhausted, and it reaches 0.0.508.508 it backs up another level,
to 0.1.0.0, then 0.1.0.1, through 0.1.0.508. So it has to keep
marching backwards and forwards through the metadata until it's
all swept clean. Once it has all the data blocks freed, it
lowers the strip height, and begins the process all over again,
but with one less height. This time it sweeps 0.0.0 through
0.505.463. When that's clean, it lowers the strip height again
and works to free 0.505. Eventually it strips the lowest height, 0.
For a delete or truncate to 0, all metadata for all heights of
0.0.0.0 would be freed. For a truncate to 1 byte, 0.0.0.0 would
be preserved.

This isn't much different from normal integer incrementing,
where an integer gets incremented from 0000 (0.0.0.0) to 3021
(3.0.2.1). So 0000 gets increments to 0001, 0002, up to 0009,
then on to 0010, 0011 up to 0099, then 0100 and so forth. It's
just that each "digit" goes from 0 to 508 (for a total of 509
pointers) rather than from 0 to 9.

Note that the dinode will only have 483 pointers due to the
dinode structure itself.

Also note: this is just an example. These numbers (509 and 483)
are based on a standard 4K block size. Smaller block sizes will
yield smaller numbers of indirect pointers accordingly.

The truncation process is accomplished with the help of two
major functions and a few helper functions.

Functions do_strip and recursive_scan are obsolete, so removed.

New function sweep_bh_for_rgrps cleans a buffer_head pointed to
by the given metapath and height. By cleaning, I mean it frees
all blocks starting at the offset passed in metapath. It starts
at the first block in the buffer pointed to by the metapath and
identifies its resource group (rgrp). From there it frees all
subsequent block pointers that lie within that rgrp. If it's
already inside a transaction, it stays within it as long as it
can. In other words, it doesn't close a transaction until it knows
it's freed what it can from the resource group. In this way,
multiple buffers may be cleaned in a single transaction, as long
as those blocks in the buffer all lie within the same rgrp.

If it's not in a transaction, it starts one. If the buffer_head
has references to blocks within multiple rgrps, it frees all the
blocks inside the first rgrp it finds, then closes the
transaction. Then it repeats the cycle: identifies the next
unfreed block, uses it to find its rgrp, then starts a new
transaction for that set. It repeats this process repeatedly
until the buffer_head contains no more references to any blocks
past the given metapath.

Function trunc_dealloc has been reworked into a finite state
automaton. It has basically 3 active states:
DEALLOC_MP_FULL, DEALLOC_MP_LOWER, and DEALLOC_FILL_MP:

The DEALLOC_MP_FULL state implies the metapath has a full set
of buffers out to the "shrink height", and therefore, it can
call function sweep_bh_for_rgrps to free the blocks within the
highest height of the metapath. If it's just swept the lowest
level (or an error has occurred) the state machine is ended.
Otherwise it proceeds to the DEALLOC_MP_LOWER state.

The DEALLOC_MP_LOWER state implies we are finished with a given
buffer_head, which may now be released, and therefore we are
then missing some buffer information from the metapath. So we
need to find more buffers to read in. In most cases, this is
just a matter of releasing the buffer_head and moving to the
next pointer from the previous height, so it may be read in and
swept as well. If it can't find another non-null pointer to
process, it checks whether it's reached the end of a height
and needs to lower the strip height, or whether it still needs
move forward through the previous height's metadata. In this
state, all zero-pointers are skipped. From this state, it can
only loop around (once more backing up another height) or,
once a valid metapath is found (one that has non-zero
pointers), proceed to state DEALLOC_FILL_MP.

The DEALLOC_FILL_MP state implies that we have a metapath
but not all its buffers are read in. So we must proceed to read
in buffer_heads until the metapath has a valid buffer for every
height. If the previous state backed us up 3 heights, we may
need to read in a buffer, increment the height, then repeat the
process until buffers have been read in for all required heights.
If it's successful reading a buffer, and it's at the highest
height we need, it proceeds back to the DEALLOC_MP_FULL state.
If it's unable to fill in a buffer, (encounters a hole, etc.)
it tries to find another non-zero block pointer. If they're all
zero, it lowers the height and returns to the DEALLOC_MP_LOWER
state. If it finds a good non-null pointer, it loops around and
reads it in, while keeping the metapath in lock-step with the
pointers it examines.

The state machine runs until the truncation request is
satisfied. Then any transactions are ended, the quota and
statfs data are updated, and the function is complete.

Helper function metaptr1 was introduced to be an easy way to
determine the start of a buffer_head's indirect pointers.

Helper function lookup_mp_height was introduced to find a
metapath index and read in the buffer that corresponds to it.
In this way, function lookup_metapath becomes a simple loop to
call it for every height.

Helper function fillup_metapath is similar to lookup_metapath
except it can do partial lookups. If the state machine
backed up multiple levels (like 2999 wrapping to 3000) it
needs to find out the next starting point and start issuing
metadata reads at that point.

Helper function hptrs is a shortcut to determine how many
pointers should be expected in a buffer. Height 0 is the dinode
which has fewer pointers than the others.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-04-19 08:25:43 -04:00
Jan Kara
139c279fb9 quota: Remove dquot_quotactl_ops
Nobody uses them anymore.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-19 14:21:23 +02:00
Jan Kara
a480b5bebd reiserfs: Remove i_attrs_to_sd_attrs()
Now that all places setting inode->i_flags that should be reflected in
on-disk flags are gone, we can remove i_attrs_to_sd_attrs() call.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-19 14:21:23 +02:00
Jan Kara
a73415a8a5 reiserfs: Remove useless setting of i_flags
reiserfs_new_inode() clears IMMUTABLE and APPEND flags from a symlink
i_flags however a few lines below in sd_attrs_to_i_attrs() we will
happily overwrite i_flags with whatever we inherited from the directory.
Since this behavior is there for ages just remove the useless setting of
i_flags.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-19 14:21:23 +02:00
Jan Kara
7ba4a2e8b8 jfs: Remove jfs_get_inode_flags()
Now that all places setting inode->i_flags that should be reflected in
on-disk flags are gone, we can remove jfs_get_inode_flags() call.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-19 14:21:23 +02:00
Jan Kara
420768d319 ext2: Remove ext2_get_inode_flags()
Now that all places setting inode->i_flags that should be reflected in
on-disk flags are gone, we can remove ext2_get_inode_flags() call.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-19 14:21:23 +02:00
Jan Kara
38eae95ddc ext4: Remove ext4_get_inode_flags()
Now that all places setting inode->i_flags that should be reflected in
on-disk flags are gone, we can remove ext4_get_inode_flags() call.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-19 14:21:23 +02:00
Jan Kara
aad6cde9ad quota: Stop setting IMMUTABLE and NOATIME flags on quota files
Currently we set IMMUTABLE and NOATIME flags on quota files to stop
userspace from messing with them. Now that all filesystems set these
flags in their quota_on handlers, we can stop setting the flags in
generic quota code. This will allow filesystems to stop copying i_flags
to their on-disk flags on various occasions.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-19 14:21:23 +02:00
Jan Kara
12fd086d39 jfs: Set flags on quota files directly
Currently immutable and noatime flags on quota files are set by quota
code which requires us to copy inode->i_flags to our on disk version
of quota flags in GETFLAGS ioctl and copy_to_dinode(). Move to
setting / clearing these on-disk flags directly to save that copying.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-19 14:21:23 +02:00
Jan Kara
161f3b7447 ext2: Set flags on quota files directly
Currently immutable and noatime flags on quota files are set by quota
code which requires us to copy inode->i_flags to our on disk version of
quota flags in GETFLAGS ioctl and __ext2_write_inode().  Move to setting
/ clearing these on-disk flags directly to save that copying.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-19 14:21:23 +02:00
Jan Kara
33eb928a9e reiserfs: Set flags on quota files directly
Currently immutable and noatime flags on quota files are set by quota
code which requires us to copy inode->i_flags to our on disk version of
quota flags in GETFLAGS ioctl and when writing stat item. Move to
setting / clearing these on-disk flags directly to save that copying.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-19 14:21:23 +02:00
Jan Kara
957153fce8 ext4: Set flags on quota files directly
Currently immutable and noatime flags on quota files are set by quota
code which requires us to copy inode->i_flags to our on disk version of
quota flags in GETFLAGS ioctl and ext4_do_update_inode(). Move to
setting / clearing these on-disk flags directly to save that copying.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-19 14:21:23 +02:00
David Sterba
338bd52f3c btrfs: qgroup: move noisy underflow warning to debugging build
The WARN_ON and warning from report_reserved_underflow can become very
noisy and is visible unconditionally although this is namely for
debugging. The patch "btrfs: Add WARN_ON for qgroup reserved underflow"
(18dc22c19b) went to 4.11-rc1 and the plan
was to get the fix as well, but this hasn't happened.

CC: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-19 12:40:49 +02:00
James Morris
fa5b5b26e2 Merge branch 'stable-4.12' of git://git.infradead.org/users/pcmoore/selinux into next 2017-04-19 08:30:08 +10:00
Richard Weinberger
32fe905c17 ubifs: Fix O_TMPFILE corner case in ubifs_link()
It is perfectly fine to link a tmpfile back using linkat().
Since tmpfiles are created with a link count of 0 they appear
on the orphan list, upon re-linking the inode has to be removed
from the orphan list again.

Ralph faced a filesystem corruption in combination with overlayfs
due to this bug.

Cc: <stable@vger.kernel.org>
Cc: Ralph Sennhauser <ralph.sennhauser@gmail.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Reported-by: Ralph Sennhauser <ralph.sennhauser@gmail.com>
Tested-by: Ralph Sennhauser <ralph.sennhauser@gmail.com>
Reported-by: Amir Goldstein <amir73il@gmail.com>
Fixes: 474b93704f ("ubifs: Implement O_TMPFILE")
Signed-off-by: Richard Weinberger <richard@nod.at>
2017-04-18 23:18:02 +02:00
Jaegeuk Kim
d40d30c5aa f2fs: avoid dirty node pages in check_only recovery
In the check_only mode, we should not make any dirty node pages. Otherwise,
we can get this panic:

F2FS-fs (nvme0n1p1): Need to recover fsync data
------------[ cut here ]------------
kernel BUG at fs/f2fs/node.c:2204!
CPU: 7 PID: 19923 Comm: mount Tainted: G           OE   4.9.8 #2
RIP: 0010:[<ffffffffc0979c0b>]  [<ffffffffc0979c0b>] flush_nat_entries+0x43b/0x7d0 [f2fs]
Call Trace:
 [<ffffffffc096ddaa>] ? __f2fs_submit_merged_bio+0x5a/0xd0 [f2fs]
 [<ffffffffc096ddaa>] ? __f2fs_submit_merged_bio+0x5a/0xd0 [f2fs]
 [<ffffffffc096dddb>] ? __f2fs_submit_merged_bio+0x8b/0xd0 [f2fs]
 [<ffffffff860e450f>] ? up_write+0x1f/0x40
 [<ffffffffc096dddb>] ? __f2fs_submit_merged_bio+0x8b/0xd0 [f2fs]
 [<ffffffffc0969f04>] write_checkpoint+0x2f4/0xf20 [f2fs]
 [<ffffffff860e938d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffffc0960bc9>] ? f2fs_sync_fs+0x79/0x190 [f2fs]
 [<ffffffffc0960bc9>] ? f2fs_sync_fs+0x79/0x190 [f2fs]
 [<ffffffffc0960bd5>] f2fs_sync_fs+0x85/0x190 [f2fs]
 [<ffffffffc097b6de>] f2fs_balance_fs_bg+0x7e/0x1c0 [f2fs]
 [<ffffffffc0977b64>] f2fs_write_node_pages+0x34/0x350 [f2fs]
 [<ffffffff860e5f42>] ? __lock_is_held+0x52/0x70
 [<ffffffff861d9b31>] do_writepages+0x21/0x30
 [<ffffffff86298ce1>] __writeback_single_inode+0x61/0x760
 [<ffffffff86909127>] ? _raw_spin_unlock+0x27/0x40
 [<ffffffff8629a735>] writeback_single_inode+0xd5/0x190
 [<ffffffff8629a889>] write_inode_now+0x99/0xc0
 [<ffffffff86283876>] iput+0x1f6/0x2c0
 [<ffffffffc0964b52>] f2fs_fill_super+0xc32/0x10c0 [f2fs]
 [<ffffffff86266462>] mount_bdev+0x182/0x1b0
 [<ffffffffc0963f20>] ? f2fs_commit_super+0x100/0x100 [f2fs]
 [<ffffffffc0960da5>] f2fs_mount+0x15/0x20 [f2fs]
 [<ffffffff86266e08>] mount_fs+0x38/0x170
 [<ffffffff86288bab>] vfs_kern_mount+0x6b/0x160
 [<ffffffff8628bcfe>] do_mount+0x1be/0xd60

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-18 13:37:49 -07:00
Paul E. McKenney
5f0d5a3ae7 mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU
A group of Linux kernel hackers reported chasing a bug that resulted
from their assumption that SLAB_DESTROY_BY_RCU provided an existence
guarantee, that is, that no block from such a slab would be reallocated
during an RCU read-side critical section.  Of course, that is not the
case.  Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
slab of blocks.

However, there is a phrase for this, namely "type safety".  This commit
therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
to avoid future instances of this sort of confusion.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
[ paulmck: Add comments mentioning the old name, as requested by Eric
  Dumazet, in order to help people familiar with the old name find
  the new one. ]
Acked-by: David Rientjes <rientjes@google.com>
2017-04-18 11:42:36 -07:00
Seth Forshee
0b6e9ea041 fuse: Add support for pid namespaces
When the userspace process servicing fuse requests is running in
a pid namespace then pids passed via the fuse fd are not being
translated into that process' namespace. Translation is necessary
for the pid to be useful to that process.

Since no use case currently exists for changing namespaces all
translations can be done relative to the pid namespace in use
when fuse_conn_init() is called. For fuse this translates to
mount time, and for cuse this is when /dev/cuse is opened. IO for
this connection from another namespace will return errors.

Requests from processes whose pid cannot be translated into the
target namespace will have a value of 0 for in.h.pid.

File locking changes based on previous work done by Eric
Biederman.

Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-04-18 16:58:38 +02:00
Elena Reshetova
095fc40ace fuse: convert fuse_conn.count from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-04-18 16:58:37 +02:00
Elena Reshetova
ec99f6d31f fuse: convert fuse_req.count from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-04-18 16:58:37 +02:00
Elena Reshetova
4e8c2eb543 fuse: convert fuse_file.count from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-04-18 16:58:37 +02:00
Anand Jain
c2a9c7ab47 btrfs: check if the device is flush capable
The block layer call chain from submit_bio will check if the write cache
is enabled for the given queue before submitting the flush. This will
add a code to fail fast if its not.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ updated changelog to reflect current code stat, blkdev_issue_flush is
  not used yet ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 16:13:27 +02:00
Anand Jain
13e88e1560 btrfs: delete unused member nobarriers
The last consumer of nobarriers is removed by the commit [1] and sync
won't fail with EOPNOTSUPP anymore. Thus, now when write cache is write
through it just return success without actually transpiring such a
request to the block device/lun.

[1]
commit b25de9d6da
block: remove BIO_EOPNOTSUPP

And, as the device/lun write cache state may change dynamically saving
such as state won't help either. So deleting the member nobarriers.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 16:12:07 +02:00
Qu Wenruo
28d70e237d btrfs: scrub: Fix RAID56 recovery race condition
When scrubbing a RAID5 which has recoverable data corruption (only one
data stripe is corrupted), sometimes scrub will report more csum errors
than expected. Sometimes even unrecoverable error will be reported.

The problem can be easily reproduced by the following steps:
1) Create a btrfs with RAID5 data profile with 3 devs
2) Mount it with nospace_cache or space_cache=v2
   To avoid extra data space usage.
3) Create a 128K file and sync the fs, unmount it
   Now the 128K file lies at the beginning of the data chunk
4) Locate the physical bytenr of data chunk on dev3
   Dev3 is the 1st data stripe.
5) Corrupt the first 64K of the data chunk stripe on dev3
6) Mount the fs and scrub it

The correct csum error number should be 16 (assuming using x86_64).
Larger csum error number can be reported in a 1/3 chance.
And unrecoverable error can also be reported in a 1/10 chance.

The root cause of the problem is RAID5/6 recover code has race
condition, due to the fact that full scrub is initiated per device.

While for other mirror based profiles, each mirror is independent with
each other, so race won't cause any big problem.

For example:
        Corrupted       |       Correct          |      Correct        |
|   Scrub dev3 (D1)     |    Scrub dev2 (D2)     |    Scrub dev1(P)    |
------------------------------------------------------------------------
Read out D1             |Read out D2             |Read full stripe     |
Check csum              |Check csum              |Check parity         |
Csum mismatch           |Csum match, continue    |Parity mismatch      |
handle_errored_block    |                        |handle_errored_block |
 Read out full stripe   |                        | Read out full stripe|
 D1 csum error(err++)   |                        | D1 csum error(err++)|
 Recover D1             |                        | Recover D1          |

So D1's csum error is accounted twice, just because
handle_errored_block() doesn't have enough protection, and race can happen.

On even worse case, for example D1's recovery code is re-writing
D1/D2/P, and P's recovery code is just reading out full stripe, then we
can cause unrecoverable error.

This patch will use previously introduced lock_full_stripe() and
unlock_full_stripe() to protect the whole scrub_handle_errored_block()
function for RAID56 recovery.
So no extra csum error nor unrecoverable error.

Reported-by: Goffredo Baroncelli <kreijack@libero.it>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:27 +02:00
Qu Wenruo
0966a7b130 btrfs: scrub: Introduce full stripe lock for RAID56
Unlike mirror based profiles, RAID5/6 recovery needs to read out the
whole full stripe.

And if we don't do proper protection, it can easily cause race condition.

Introduce 2 new functions: lock_full_stripe() and unlock_full_stripe()
for RAID5/6.
Which store a rb_tree of mutexes for full stripes, so scrub callers can
use them to lock a full stripe to avoid race.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor comment adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:27 +02:00
Deepa Dinamani
fa7aede2ab btrfs: Use ktime_get_real_ts for root ctime
btrfs_root_item maintains the ctime for root updates.  This is not part
of vfs_inode.

Since current_time() uses struct inode* as an argument as Linus
suggested, this cannot be used to update root times unless, we modify
the signature to use inode.

Since btrfs uses nanosecond time granularity, it can also use
ktime_get_real_ts directly to obtain timestamp for the root. It is
necessary to use the timespec time api here because the same
btrfs_set_stack_timespec_*() apis are used for vfs inode times as well.
These can be transitioned to using timespec64 when btrfs internally
changes to use timespec64 as well.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:27 +02:00
Dan Carpenter
9986277e0e Btrfs: handle only applicable errors returned by btrfs_get_extent
btrfs_get_extent() never returns NULL pointers, so this code introduces
a static checker warning.

The btrfs_get_extent() is a bit complex, but trust me that it doesn't
return NULLs and also if it did we would trigger the BUG_ON(!em) before
the last return statement.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
[ updated subject ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:27 +02:00
Qu Wenruo
82bafb38c2 btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option
[BUG]
The easist way to reproduce the bug is:
------
 # mkfs.btrfs -f $dev -n 16K
 # mount $dev $mnt -o inode_cache
 # btrfs quota enable $mnt
 # btrfs quota rescan -w $mnt
 # btrfs qgroup show $mnt
qgroupid         rfer         excl
--------         ----         ----
0/5          32.00KiB     32.00KiB
             ^^ Twice the correct value
------

And fstests/btrfs qgroup test group can easily detect them with
inode_cache mount option.
Although some of them are false alerts since old test cases are using
fixed golden output.
While new test cases will use "btrfs check" to detect qgroup mismatch.

[CAUSE]
Inode_cache mount option will make commit_fs_roots() to call
btrfs_save_ino_cache() to update fs/subvol trees, and generate new
delayed refs.

However we call btrfs_qgroup_prepare_account_extents() too early, before
commit_fs_roots().
This makes the "old_roots" for newly generated extents are always NULL.
For freeing extent case, this makes both new_roots and old_roots to be
empty, while correct old_roots should not be empty.
This causing qgroup numbers not decreased correctly.

[FIX]
Modify the timing of calling btrfs_qgroup_prepare_account_extents() to
just before btrfs_qgroup_account_extents(), and add needed delayed_refs
handler.
So qgroup can handle inode_map mount options correctly.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
Anand Jain
e884f4f06e btrfs: use q which is already obtained from bdev_get_queue
We have already assigned q from bdev_get_queue() so use it.
And rearrange the code for better view.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
Liu Bo
42c61ab676 Btrfs: switch to div64_u64 if with a u64 divisor
This is fixing code pieces where we use div_u64 when passing a u64 divisor.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
Liu Bo
972d721939 Btrfs: update scrub_parity to use u64 stripe_len
Commit 3d8da67817 ("Btrfs: fix divide error upon chunk's stripe_len")
changed stripe_len in struct map_lookup to u64, but didn't update
stripe_len in struct scrub_parity.

This updates the type and switches to div64_u64_rem to match u64 divisor.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
Liu Bo
c725328c55 Btrfs: enable repair during read for raid56 profile
Now that scrub can fix data errors with the help of parity for raid56
profile, repair during read is able to as well.

Although the mirror num in raid56 scenario has different meanings, i.e.
0 or 1: read data directly
> 1:    do recover with parity,
it could be fit into how we repair bad block during read.

The trick is to use BTRFS_MAP_READ instead of BTRFS_MAP_WRITE to get the
device and position on it.

Cc: David Sterba <dsterba@suse.cz>
Tested-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
David Sterba
619a974292 btrfs: use clear_page where appropriate
There's a helper to clear whole page, with a arch-specific optimized
code. The replaced cases do not seem to be in performace critical code,
but we still might get some percent gain.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
Qu Wenruo
e501bfe323 btrfs: Prevent scrub recheck from racing with dev replace
scrub_setup_recheck_block() calls btrfs_map_sblock() and then accesses
bbio without protection of bio_counter.

This can lead to use-after-free if racing with dev replace cancel.

Fix it by increasing bio_counter before calling btrfs_map_sblock() and
decreasing the bio_counter when corresponding recover is finished.

Cc: Liu Bo <bo.li.liu@oracle.com>
Reported-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
Qu Wenruo
ae6529c35b btrfs: Wait for in-flight bios before freeing target device for raid56
When raid56 dev-replace is cancelled by running scrub, we will free
target device without waiting for in-flight bios, causing the following
NULL pointer deference or general protection failure.

 BUG: unable to handle kernel NULL pointer dereference at 00000000000005e0
 IP: generic_make_request_checks+0x4d/0x610
 CPU: 1 PID: 11676 Comm: kworker/u4:14 Tainted: G  O    4.11.0-rc2 #72
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
 Workqueue: btrfs-endio-raid56 btrfs_endio_raid56_helper [btrfs]
 task: ffff88002875b4c0 task.stack: ffffc90001334000
 RIP: 0010:generic_make_request_checks+0x4d/0x610
 Call Trace:
  ? generic_make_request+0xc7/0x360
  generic_make_request+0x24/0x360
  ? generic_make_request+0xc7/0x360
  submit_bio+0x64/0x120
  ? page_in_rbio+0x4d/0x80 [btrfs]
  ? rbio_orig_end_io+0x80/0x80 [btrfs]
  finish_rmw+0x3f4/0x540 [btrfs]
  validate_rbio_for_rmw+0x36/0x40 [btrfs]
  raid_rmw_end_io+0x7a/0x90 [btrfs]
  bio_endio+0x56/0x60
  end_workqueue_fn+0x3c/0x40 [btrfs]
  btrfs_scrubparity_helper+0xef/0x620 [btrfs]
  btrfs_endio_raid56_helper+0xe/0x10 [btrfs]
  process_one_work+0x2af/0x720
  ? process_one_work+0x22b/0x720
  worker_thread+0x4b/0x4f0
  kthread+0x10f/0x150
  ? process_one_work+0x720/0x720
  ? kthread_create_on_node+0x40/0x40
  ret_from_fork+0x2e/0x40
 RIP: generic_make_request_checks+0x4d/0x610 RSP: ffffc90001337bb8

In btrfs_dev_replace_finishing(), we will call
btrfs_rm_dev_replace_blocked() to wait bios before destroying the target
device when scrub is finished normally.

However when dev-replace is aborted, either due to error or cancelled by
scrub, we didn't wait for bios, this can lead to use-after-free if there
are bios holding the target device.

Furthermore, for raid56 scrub, at least 2 places are calling
btrfs_map_sblock() without protection of bio_counter, leading to the
problem.

This patch fixes the problem:
1) Wait for bio_counter before freeing target device when canceling
   replace
2) When calling btrfs_map_sblock() for raid56, use bio_counter to
   protect the call.

Cc: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
Qu Wenruo
9a33944bdf btrfs: scrub: Don't append on-disk pages for raid56 scrub
In the following situation, scrub will calculate wrong parity to
overwrite the correct one:

RAID5 full stripe:

Before
|     Dev 1      |     Dev  2     |     Dev 3     |
| Data stripe 1  | Data stripe 2  | Parity Stripe |
--------------------------------------------------- 0
| 0x0000 (Bad)   |     0xcdcd     |     0x0000    |
--------------------------------------------------- 4K
|     0xcdcd     |     0xcdcd     |     0x0000    |
...
|     0xcdcd     |     0xcdcd     |     0x0000    |
--------------------------------------------------- 64K

After scrubbing dev3 only:

|     Dev 1      |     Dev  2     |     Dev 3     |
| Data stripe 1  | Data stripe 2  | Parity Stripe |
--------------------------------------------------- 0
| 0xcdcd (Good)  |     0xcdcd     | 0xcdcd (Bad)  |
--------------------------------------------------- 4K
|     0xcdcd     |     0xcdcd     |     0x0000    |
...
|     0xcdcd     |     0xcdcd     |     0x0000    |
--------------------------------------------------- 64K

The reason is that after raid56 read rebuild rbio->stripe_pages are all
correctly recovered (0xcd for data stripes).

However when we check and repair parity in
scrub_parity_check_and_repair(), we will append pages in sparity->spages
list to rbio->bio_pages[], which contains old on-disk data.

And when we submit parity data to disk, we calculate parity using
rbio->bio_pages[] first, if rbio->bio_pages[] not found, then fallback
to rbio->stripe_pages[].

The patch fix it by not appending pages from sparity->spages.
So finish_parity_scrub() will use rbio->stripe_pages[] which is correct.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
Qu Wenruo
d51ea5dd22 btrfs: qgroup: Re-arrange tracepoint timing to co-operate with reserved space tracepoint
Newly introduced qgroup reserved space trace points are normally nested
into several common qgroup operations.

While some other trace points are not well placed to co-operate with
them, causing confusing output.

This patch re-arrange trace_btrfs_qgroup_release_data() and
trace_btrfs_qgroup_free_delayed_ref() trace points so they are triggered
before reserved space ones.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
Qu Wenruo
3159fe7bae btrfs: qgroup: Add trace point for qgroup reserved space
Introduce the following trace points:
qgroup_update_reserve
qgroup_meta_reserve

These trace points are handy to trace qgroup reserve space related
problems.

Also export btrfs_qgroup structure, as now we directly pass btrfs_qgroup
structure to trace points, so that structure needs to be exported.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
David Sterba
825ad4c964 btrfs: drop redundant parameters from btrfs_map_sblock
All callers pass 0 for mirror_num and 1 for need_raid_map.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
David Sterba
bcc8e07f9e btrfs: sink GFP flags parameter to tree_mod_log_insert_root
All (1) callers pass the same value.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
David Sterba
176ef8f5e6 btrfs: sink GFP flags parameter to tree_mod_log_insert_move
All (1) callers pass the same value.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
Liu Bo
abad60c601 Btrfs: fix wrong failed mirror_num of read-repair on raid56
In raid56 scenario, after trying parity recovery, we didn't set
mirror_num for btrfs_bio with failed mirror_num, hence
end_bio_extent_readpage() will report a random mirror_num in dmesg
log.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
Liu Bo
1bcd7aa17f Btrfs: set scrub page's io_error if failing to submit io
Scrub repairs data by the unit called scrub_block, which may contain
several pages.  Scrub always tries to look up a good copy of a whole
block, but if there's no such copy, it tries to do repair page by page.

If we don't set page's io_error when checking this bad copy, in the last
step, we may skip this page when repairing bad copy from good copy.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
David Sterba
171938e528 btrfs: track exclusive filesystem operation in flags
There are several operations, usually started from ioctls, that cannot
run concurrently. The status is tracked in
mutually_exclusive_operation_running as an atomic_t. We can easily track
the status as one of the per-filesystem flag bits with same
synchronization guarantees.

The conversion replaces:

* atomic_xchg(..., 1)    ->   test_and_set_bit(FLAG, ...)
* atomic_set(..., 0)     ->   clear_bit(FLAG, ...)

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
Goldwyn Rodrigues
48a89bc4f2 btrfs: qgroups: Retry after commit on getting EDQUOT
We are facing the same problem with EDQUOT which was experienced with
ENOSPC. Not sure if we require a full ticketing system such as ENOSPC, but
here is a quick fix, which may be too big a hammer.

Quotas are reserved during the start of an operation, incrementing
qg->reserved. However, it is written to disk in a commit_transaction
which could take as long as commit_interval. In the meantime there
could be deletions which are not accounted for because deletions are
accounted for only while committed (free_refroot). So, when we get
a EDQUOT flush the data to disk and try again.

This fixes fstests btrfs/139.

Here is a sample script which shows this issue.

DEVICE=/dev/vdb
MOUNTPOINT=/mnt
TESTVOL=$MOUNTPOINT/tmp
QUOTA=5
PROG=btrfs
DD_BS="4k"
DD_COUNT="256"
RUN_TIMES=5000

mkfs.btrfs -f $DEVICE
mount -o commit=240 $DEVICE $MOUNTPOINT
$PROG subvolume create $TESTVOL
$PROG quota enable $TESTVOL
$PROG qgroup limit ${QUOTA}G $TESTVOL

typeset -i DD_RUN_GOOD
typeset -i QUOTA

function _check_cmd() {
        if [[ ${?} > 0 ]]; then
                echo -n "$(date) E: Running previous command"
                echo ${*}
                echo "Without sync"
                $PROG qgroup show -pcreFf ${TESTVOL}
                echo "With sync"
                $PROG qgroup show -pcreFf --sync ${TESTVOL}
                exit 1
        fi
}

while true; do
  DD_RUN_GOOD=$RUN_TIMES

  while (( ${DD_RUN_GOOD} != 0 )); do
        dd if=/dev/zero of=${TESTVOL}/quotatest${DD_RUN_GOOD} bs=${DD_BS} count=${DD_COUNT}
        _check_cmd "dd if=/dev/zero of=${TESTVOL}/quotatest${DD_RUN_GOOD} bs=${DD_BS} count=${DD_COUNT}"
        DD_RUN_GOOD=(${DD_RUN_GOOD}-1)
  done

  $PROG qgroup show -pcref $TESTVOL
  echo "----------- Cleanup ---------- "
  rm $TESTVOL/quotatest*

done

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
Edmund Nadolski
de47c9d3ff btrfs: replace hardcoded value with SEQ_LAST macro
Define the SEQ_LAST macro to replace (u64)-1 in places where said
value triggers a special-case ref search behavior.

Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Reviewed-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
Edmund Nadolski
f58d88b336 btrfs: provide enumeration for __merge_refs mode argument
Replace hardcoded numeric values for __merge_refs 'mode' argument
with descriptive constants.

Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Reviewed-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
David Sterba
f486135eba btrfs: remove unused qgroup members from btrfs_trans_handle
The members have been effectively unused since "Btrfs: rework qgroup
accounting" (fcebe4562d), there's no substitute for
assert_qgroups_uptodate so it's removed as well.

Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
David Sterba
994a5d2bc7 btrfs: remove local blocksize variable in reada_find_extent
The name is misleading and the local variable serves no purpose.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
David Sterba
5721b8ad26 btrfs: remove redundant parameter from reada_start_machine_dev
We can read fs_info from dev.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
David Sterba
0ceaf28213 btrfs: remove redundant parameter from reada_find_zone
We can read fs_info from dev.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
David Sterba
d48d71aa99 btrfs: remove redundant parameter from btree_readahead_hook
We can read fs_info from eb.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
David Sterba
7ef70b4d99 btrfs: preallocate radix tree node for global readahead tree
We can preallocate the node so insertion does not have to do that under
the lock. The GFP flags for the global radix tree are initialized to
 GFP_NOFS & ~__GFP_DIRECT_RECLAIM
but we can use GFP_KERNEL, because readahead is optional and not on any
critical writeout path.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
David Sterba
cc8385b59e btrfs: preallocate radix tree node for readahead
We can preallocate the node so insertion does not have to do that under
the lock. The GFP flags for the per-device radix tree are initialized to
 GFP_NOFS & ~__GFP_DIRECT_RECLAIM
but we can use GFP_KERNEL, same as an allocation above anyway, but also
because readahead is optional and not on any critical writeout path.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
Goldwyn Rodrigues
4d339d0106 btrfs: No need to check !(flags & MS_RDONLY) twice
Code cleanup.
The code block is for !(*flags & MS_RDONLY). We don't need
to check it again.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
Liu Bo
1a79c1f246 Btrfs: update comments in cache_save_setup
We also don't bother to flush free space cache while with free space
tree.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
Liu Bo
539b50d2f6 Btrfs: convert BUG_ON to WARN_ON
These two BUG_ON()s would never be true, ensured by callers' logic.

Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
Liu Bo
2b19a1fef7 Btrfs: helper for ops that requires full stripe
This adds a helper to show directly whether ops require full stripe.

Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
Liu Bo
6fad823f49 Btrfs: do not add extra mirror when dev_replace target dev is not available
With this, we can avoid allocating memory for dev replace copies if the
target dev is not available.

Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:25 +02:00
Liu Bo
73c0f22825 Btrfs: handle operations for device replace separately
Since this part is mostly independent, this moves it to a separate
function.

Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:24 +02:00
Liu Bo
5ab56090b8 Btrfs: introduce a function to get extra mirror from replace
As the part of getting extra mirror in __btrfs_map_block is
independent, this puts it into a separate function.

Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:24 +02:00
Liu Bo
0b3d4cd371 Btrfs: separate DISCARD from __btrfs_map_block
Since DISCARD is not as important as an operation like write, we don't
copy it to target device during replace, and it makes __btrfs_map_block
less complex.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:24 +02:00
Liu Bo
592d92eeab Btrfs: create a helper for getting chunk map
We have similar code here and there, this merges them into a helper.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:24 +02:00
Liu Bo
09ed2f165c Btrfs: add file item tracepoints
While debugging truncate problems, I found that these tracepoints could
help us quickly know what went wrong.

Two sets of tracepoints are created to track regular/prealloc file item
and inline file item respectively, I put inline as a separate one since
what inline file items cares about are way less than the regular one.

This adds four tracepoints:
- btrfs_get_extent_show_fi_regular
- btrfs_get_extent_show_fi_inline
- btrfs_truncate_show_fi_regular
- btrfs_truncate_show_fi_inline

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ formatting adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:24 +02:00
Elena Reshetova
dec95574f4 btrfs: convert btrfs_raid_bio.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:24 +02:00
Elena Reshetova
99f4cdb16f btrfs: convert scrub_ctx.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:24 +02:00
Elena Reshetova
78a764504d btrfs: convert scrub_parity.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:24 +02:00
Elena Reshetova
186debd6ed btrfs: convert scrub_block.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:24 +02:00
Elena Reshetova
6f615018b3 btrfs: convert scrub_recover.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:24 +02:00
Elena Reshetova
a50299ae7c btrfs: convert compressed_bio.pending_bios from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:24 +02:00
Elena Reshetova
b7ac31b7b2 btrfs: convert extent_state.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:23 +02:00
Elena Reshetova
0700cea7c8 btrfs: convert btrfs_root.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:23 +02:00
Elena Reshetova
089e77e10d btrfs: convert btrfs_delayed_item.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:23 +02:00
Elena Reshetova
6de5f18e7b btrfs: convert btrfs_delayed_node.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:23 +02:00
Elena Reshetova
6df8cdf5bd btrfs: convert btrfs_delayed_ref_node.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:23 +02:00
Elena Reshetova
1e4f4714d5 btrfs: convert btrfs_caching_control.count from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:23 +02:00
Elena Reshetova
e76edab7f0 btrfs: convert btrfs_ordered_extent.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:23 +02:00
Elena Reshetova
490b54d6fb btrfs: convert extent_map.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:23 +02:00
Elena Reshetova
9b64f57ddf btrfs: convert btrfs_transaction.use_count from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:23 +02:00
Elena Reshetova
140475ae4a btrfs: convert btrfs_bio.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:23 +02:00
Liu Bo
f95fda8751 Btrfs: remove ASSERT in btrfs_truncate_inode_items
After 76b42abbf7 ("Btrfs: fix data loss after truncate when using the
no-holes feature"),

For either NO_HOLES or inline extents, we've set last_size to newsize to
avoid data loss after remount or inode got evicted and read again, thus,
we don't need this check anymore.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:23 +02:00
Adam Borowski
1450612797 btrfs: fix a bogus warning when converting only data or metadata
If your filesystem has, eg, data:raid0 metadata:raid1, and you run "btrfs
balance -dconvert=raid1", the meta.target field will be uninitialized.
That's otherwise ok, as it's unused except for this warning.

Thus, let's use the existing set of raid levels for the comparison.

As a side effect, non-convert balances will now nag about data>metadata.

Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:23 +02:00
Sachin Prabhu
62a6cfddcc cifs: Do not send echoes before Negotiate is complete
commit 4fcd1813e6 ("Fix reconnect to not defer smb3 session reconnect
long after socket reconnect") added support for Negotiate requests to
be initiated by echo calls.

To avoid delays in calling echo after a reconnect, I added the patch
introduced by the commit b8c600120f ("Call echo service immediately
after socket reconnect").

This has however caused a regression with cifs shares which do not have
support for echo calls to trigger Negotiate requests. On connections
which need to call Negotiation, the echo calls trigger an error which
triggers a reconnect which in turn triggers another echo call. This
results in a loop which is only broken when an operation is performed on
the cifs share. For an idle share, it can DOS a server.

The patch uses the smb_operation can_echo() for cifs so that it is
called only if connection has been already been setup.

kernel bz: 194531

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Tested-by: Jonathan Liu <net147@gmail.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-04-17 15:44:23 -05:00
Al Viro
85128b2be6 fix nfs O_DIRECT advancing iov_iter too much
It leaves the iterator advanced by the amount of IO it has requested
instead of the amount actually transferred.  Among other things,
that confuses the hell out of generic_file_splice_read().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-17 14:23:20 -04:00
Al Viro
890559e34e orangefs_bufmap_copy_from_iovec(): fix EFAULT handling
short copy here should mean instant EFAULT, not "move to the
next page and hope it fails there, this time with nothing
copied"

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-17 14:23:20 -04:00
Al Viro
801b25f104 fs/compat.c: trim unused includes
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-17 12:52:27 -04:00
Al Viro
f502985564 move compat_rw_copy_check_uvector() over to fs/read_write.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-17 12:52:26 -04:00
Al Viro
2b8910264a fhandle: move compat syscalls from compat.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-17 12:52:26 -04:00
Al Viro
e35d49f637 open: move compat syscalls from compat.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-17 12:52:25 -04:00
Al Viro
ac565de31c stat: move compat syscalls from compat.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-17 12:52:25 -04:00
Al Viro
80f0cce6aa fcntl: move compat syscalls from compat.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-17 12:52:24 -04:00
Al Viro
0460b2a28b readdir: move compat syscalls from compat.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-17 12:52:24 -04:00
Al Viro
4ada54ee7a statfs: move compat syscalls from compat.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-17 12:52:23 -04:00
Al Viro
1a060ba3c8 utimes: move compat syscalls from compat.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-17 12:52:23 -04:00
Al Viro
e99ca56ce0 move compat select-related syscalls to fs/select.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-17 12:52:22 -04:00
Al Viro
2611dc1939 Remove compat_sys_getdents64()
Unlike normal compat syscall variants, it is needed only for
biarch architectures that have different alignement requirements for
u64 in 32bit and 64bit ABI *and* have __put_user() that won't handle
a store of 64bit value at 32bit-aligned address.  We used to have one
such (ia64), but its biarch support has been gone since 2010 (after
being broken in 2008, which went unnoticed since nobody had been using
it).

It had escaped removal at the same time only because back in 2004
a patch that switched several syscalls on amd64 from private wrappers to
generic compat ones had switched to use of compat_sys_getdents64(), which
hadn't needed (or used) a compat wrapper on amd64.

Let's bury it - it's at least 7 years overdue.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-17 12:52:22 -04:00
Eric W. Biederman
b54807fa52 sysctl: Remove dead register_sysctl_root
The function no longer does anything.  The is only a single caller of
register_sysctl_root when semantically there should be two.  Remove
this function so that if someone decides this functionality is needed
again it will be obvious all of the callers of setup_sysctl_set need
to be audited and modified appropriately.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-04-16 23:42:49 -05:00
David S. Miller
6b6cbc1471 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts were simply overlapping changes.  In the net/ipv4/route.c
case the code had simply moved around a little bit and the same fix
was made in both 'net' and 'net-next'.

In the net/sched/sch_generic.c case a fix in 'net' happened at
the same time that a new argument was added to qdisc_hash_add().

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-15 21:16:30 -04:00
Martin Brandenburg
1ec1688c53 orangefs: free superblock when mount fails
Otherwise lockdep says:

[ 1337.483798] ================================================
[ 1337.483999] [ BUG: lock held when returning to user space! ]
[ 1337.484252] 4.11.0-rc6 #19 Not tainted
[ 1337.484423] ------------------------------------------------
[ 1337.484626] mount/14766 is leaving the kernel with locks still held!
[ 1337.484841] 1 lock held by mount/14766:
[ 1337.485017]  #0:  (&type->s_umount_key#33/1){+.+.+.}, at: [<ffffffff8124171f>] sget_userns+0x2af/0x520

Caught by xfstests generic/413 which tried to mount with the unsupported
mount option dax.  Then xfstests generic/422 ran sync which deadlocks.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Acked-by: Mike Marshall <hubcap@omnibond.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-15 09:39:31 -07:00
Linus Torvalds
c0eb027e5a vfs: don't do RCU lookup of empty pathnames
Normal pathname lookup doesn't allow empty pathnames, but using
AT_EMPTY_PATH (with name_to_handle_at() or fstatat(), for example) you
can trigger an empty pathname lookup.

And not only is the RCU lookup in that case entirely unnecessary
(because we'll obviously immediately finalize the end result), it is
actively wrong.

Why? An empth path is a special case that will return the original
'dirfd' dentry - and that dentry may not actually be RCU-free'd,
resulting in a potential use-after-free if we were to initialize the
path lazily under the RCU read lock and depend on complete_walk()
finalizing the dentry.

Found by syzkaller and KASAN.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-15 09:34:52 -07:00
Linus Torvalds
4b31ac485d Merge branch 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Dave Sterba collected a few more fixes for the last rc.

  These aren't marked for stable, but I'm putting them in with a batch
  were testing/sending by hand for this release"

* 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix potential use-after-free for cloned bio
  Btrfs: fix segmentation fault when doing dio read
  Btrfs: fix invalid dereference in btrfs_retry_endio
  btrfs: drop the nossd flag when remounting with -o ssd
2017-04-14 16:53:45 -07:00
Linus Torvalds
5466f4dfce Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull more CIFS fixes from Steve French:
 "As promised, here is the remaining set of cifs/smb3 fixes for stable
  (and a fix for one regression) now that they have had additional
  review and testing"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  CIFS: Fix SMB3 mount without specifying a security mechanism
  CIFS: store results of cifs_reopen_file to avoid infinite wait
  CIFS: remove bad_network_name flag
  CIFS: reconnect thread reschedule itself
  CIFS: handle guest access errors to Windows shares
  CIFS: Fix null pointer deref during read resp processing
2017-04-14 16:51:29 -07:00
Mike Kravetz
045c7a3f53 hugetlbfs: fix offset overflow in hugetlbfs mmap
If mmap() maps a file, it can be passed an offset into the file at which
the mapping is to start.  Offset could be a negative value when
represented as a loff_t.  The offset plus length will be used to update
the file size (i_size) which is also a loff_t.

Validate the value of offset and offset + length to make sure they do
not overflow and appear as negative.

Found by syzcaller with commit ff8c0c53c4 ("mm/hugetlb.c: don't call
region_abort if region_chg fails") applied.  Prior to this commit, the
overflow would still occur but we would luckily return ENOMEM.

To reproduce:

   mmap(0, 0x2000, 0, 0x40021, 0xffffffffffffffffULL, 0x8000000000000000ULL);

Resulted in,

  kernel BUG at mm/hugetlb.c:742!
  Call Trace:
   hugetlbfs_evict_inode+0x80/0xa0
   evict+0x24a/0x620
   iput+0x48f/0x8c0
   dentry_unlink_inode+0x31f/0x4d0
   __dentry_kill+0x292/0x5e0
   dput+0x730/0x830
   __fput+0x438/0x720
   ____fput+0x1a/0x20
   task_work_run+0xfe/0x180
   exit_to_usermode_loop+0x133/0x150
   syscall_return_slowpath+0x184/0x1c0
   entry_SYSCALL_64_fastpath+0xab/0xad

Fixes: ff8c0c53c4 ("mm/hugetlb.c: don't call region_abort if region_chg fails")
Link: http://lkml.kernel.org/r/1491951118-30678-1-git-send-email-mike.kravetz@oracle.com
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-13 18:24:21 -07:00
Kirill A. Shutemov
5b7abeae3a thp: fix MADV_DONTNEED vs clear soft dirty race
Yet another instance of the same race.

Fix is identical to change_huge_pmd().

See "thp: fix MADV_DONTNEED vs.  numa balancing race" for more details.

Link: http://lkml.kernel.org/r/20170302151034.27829-5-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-13 18:24:21 -07:00
Olga Kornievskaia
05b7278d51 nfsd: fix oops on unsupported operation
I'm hitting the BUG in nfsd4_max_reply() at fs/nfsd/nfs4proc.c:2495 when
client sends an operation the server doesn't support.

in nfsd4_max_reply() it checks for NULL rsize_bop but a non-supported
operation wouldn't have that set.

Cc: Kinglong Mee <kinglongmee@gmail.com>
Fixes: 2282cd2c05 "NFSD: Get response size before operation..."
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-13 11:18:56 -04:00
Pavel Shilovsky
67dbea2ce6 CIFS: Fix SMB3 mount without specifying a security mechanism
Commit ef65aaede2 ("smb2: Enforce sec= mount option") changed the
behavior of a mount command to enforce a specified security mechanism
during mounting. On another hand according to the spec if SMB3 server
doesn't respond with a security context it implies that it supports
NTLMSSP. The current code doesn't keep it in mind and fails a mount
for such servers if no security mechanism is specified. Fix this by
indicating that a server supports NTLMSSP if a security context isn't
returned during negotiate phase. This allows the code to use NTLMSSP
by default for SMB3 mounts.

Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-04-13 10:03:26 -05:00
Dan Williams
bfca9acf1a Merge branch 'for-4.11/libnvdimm' into for-4.12/dax 2017-04-12 21:59:01 -07:00
Jaegeuk Kim
d29fd17218 f2fs: fix not to set fsync/dentry mark
Otherwise, we can see stale fsync/dentry mark given by previous calls, resulting
in giving up roll-forward recovery due to wrong dentry mark.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-12 12:57:09 -07:00
Jaegeuk Kim
6c3acd9757 f2fs: allocate hot_data for atomic writes
We'd better allocate atomic writes to hot_data zone.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-12 12:57:08 -07:00
Jaegeuk Kim
3097388354 f2fs: give time to flush dirty pages for checkpoint
If all the threads are waiting for checkpoint, we have no chance to flush
required dirty pages.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-12 12:57:07 -07:00
Jaegeuk Kim
9bb02c3627 f2fs: fix fs corruption due to zero inode page
This patch fixes the following scenario.

- f2fs_create/f2fs_mkdir             - write_checkpoint
 - f2fs_mark_inode_dirty_sync         - block_operations
                                       - f2fs_lock_all
                                       - f2fs_sync_inode_meta
                                        - f2fs_unlock_all
                                        - sync_inode_metadata
 - f2fs_lock_op
                                         - f2fs_write_inode
                                          - update_inode_page
                                           - get_node_page
                                             return -ENOENT
 - new_inode_page
  - fill_node_footer
 - f2fs_mark_inode_dirty_sync
 - ...
 - f2fs_unlock_op
                                          - f2fs_inode_synced
                                       - f2fs_lock_all
                                       - do_checkpoint

In this checkpoint, we can get an inode page which contains zeros having valid
node footer only.

Cc: <stable@vger.kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-12 12:57:06 -07:00
Chao Yu
a54455f5ee f2fs: shrink blk plug region
Don't use blk plug covering area where there won't be any IOs being issued.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-12 12:57:05 -07:00
Brian Foster
3b4683c294 xfs: drop iolock from reclaim context to appease lockdep
Lockdep complains about use of the iolock in inode reclaim context
because it doesn't understand that reclaim has the last reference to
the inode, and thus an iolock->reclaim->iolock deadlock is not
possible.

The iolock is technically not necessary in xfs_inactive() and was
only added to appease an assert in xfs_free_eofblocks(), which can
be called from other non-reclaim contexts. Therefore, just kill the
assert and drop the use of the iolock from reclaim context to quiet
lockdep.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-12 08:43:23 -07:00
Eric Sandeen
5146d0b762 xfs: remove custom do_div implementations
Long ago, all this gunk was added with a lament about problems
with gcc's do_div, and a fun recommendation in the changelog:

 egcs-2.91.66 is the recommended compiler version for building XFS.

All this special stuff was needed to work around an old gcc bug,
apparently, and it's been there ever since.

There should be no need for this anymore, so remove it.

Remove the special 32-bit xfs_do_mod as well; just let the
kernel's do_div() handle all this.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-12 08:42:51 -07:00
Eric Sandeen
d956f813b6 xfs: simplify xfs_calc_dquots_per_chunk
ndquots is a 32-bit value, and we don't care
about the remainder; there is no reason to use do_div
here, it seems to be the result of a decade+ historical
accident.

Worse, the do_div implementation in userspace breaks
when fed a 32-bit dividend, so we commented it out there
in any case.

Change to simple division, and then we can change
userspace to match, and mandate a 64-bit dividend in
the do_div() in userspace as well.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
2017-04-12 08:42:51 -07:00
Chao Yu
54c2258cd6 f2fs: extract rb-tree operation infrastructure
rb-tree lookup/update functions are deeply coupled into extent cache
codes, it's very hard to reuse these basic functions, this patch
extracts common rb-tree operation infrastructure for latter reusing.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-11 15:13:52 -07:00
Jaegeuk Kim
8fd5a37efa f2fs: avoid frequent checkpoint during f2fs_gc
Now we're doing SSR aggressively more than ever before, so once we reach to
the reserved_segment, f2fs_balance_fs will call f2fs_gc, which triggers
checkpoint everytime. We actually must avoid that.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-11 15:12:39 -07:00
Liu Bo
a967efb30b Btrfs: fix potential use-after-free for cloned bio
KASAN reports that there is a use-after-free case of bio in btrfs_map_bio.

If we need to submit IOs to several disks at a time, the original bio
would get cloned and mapped to the destination disk, but we really should
use the original bio instead of a cloned bio to do the sanity check
because cloned bios are likely to be freed by its endio.

Reported-by: Diego <diegocg@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-11 18:49:56 +02:00
Liu Bo
97bf5a5589 Btrfs: fix segmentation fault when doing dio read
Commit 2dabb32484 ("Btrfs: Direct I/O read: Work on sectorsized blocks")
introduced this bug during iterating bio pages in dio read's endio hook,
and it could end up with segment fault of the dio reading task.

So the reason is 'if (nr_sectors--)', and it makes the code assume that
there is one more block in the same page, so page offset is increased and
the bio which is created to repair the bad block then has an incorrect
bvec.bv_offset, and a later access of the page content would throw a
segmentation fault.

This also adds ASSERT to check page offset against page size.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-11 18:49:29 +02:00
Liu Bo
2e949b0a55 Btrfs: fix invalid dereference in btrfs_retry_endio
When doing directIO repair, we have this oops:

[ 1458.532816] general protection fault: 0000 [#1] SMP
...
[ 1458.536291] Workqueue: btrfs-endio-repair btrfs_endio_repair_helper [btrfs]
[ 1458.536893] task: ffff88082a42d100 task.stack: ffffc90002b3c000
[ 1458.537499] RIP: 0010:btrfs_retry_endio+0x7e/0x1a0 [btrfs]
...
[ 1458.543261] Call Trace:
[ 1458.543958]  ? rcu_read_lock_sched_held+0xc4/0xd0
[ 1458.544374]  bio_endio+0xed/0x100
[ 1458.544750]  end_workqueue_fn+0x3c/0x40 [btrfs]
[ 1458.545257]  normal_work_helper+0x9f/0x900 [btrfs]
[ 1458.545762]  btrfs_endio_repair_helper+0x12/0x20 [btrfs]
[ 1458.546224]  process_one_work+0x34d/0xb70
[ 1458.546570]  ? process_one_work+0x29e/0xb70
[ 1458.546938]  worker_thread+0x1cf/0x960
[ 1458.547263]  ? process_one_work+0xb70/0xb70
[ 1458.547624]  kthread+0x17d/0x180
[ 1458.547909]  ? kthread_create_on_node+0x70/0x70
[ 1458.548300]  ret_from_fork+0x31/0x40

It turns out that btrfs_retry_endio is trying to get inode from a directIO
page.

This fixes the problem by using the saved inode pointer, done->inode.
btrfs_retry_endio_nocsum has the same problem, and it's fixed as well.

Also cleanup unused @start (which is too trivial for a separate patch).

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-11 18:49:08 +02:00
Adam Borowski
951e796639 btrfs: drop the nossd flag when remounting with -o ssd
The opposite case was already handled right in the very next switch entry.
And also when turning on nossd, drop ssd_spread.

Reported-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-11 18:48:59 +02:00
NeilBrown
717a94b5fc sched/core: Remove 'task' parameter and rename tsk_restore_flags() to current_restore_flags()
It is not safe for one thread to modify the ->flags
of another thread as there is no locking that can protect
the update.

So tsk_restore_flags(), which takes a task pointer and modifies
the flags, is an invitation to do the wrong thing.

All current users pass "current" as the task, so no developers have
accepted that invitation.  It would be best to ensure it remains
that way.

So rename tsk_restore_flags() to current_restore_flags() and don't
pass in a task_struct pointer.  Always operate on current->flags.

Signed-off-by: NeilBrown <neilb@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-11 09:06:32 +02:00
Germano Percossi
1fa839b498 CIFS: store results of cifs_reopen_file to avoid infinite wait
This fixes Continuous Availability when errors during
file reopen are encountered.

cifs_user_readv and cifs_user_writev would wait for ever if
results of cifs_reopen_file are not stored and for later inspection.

In fact, results are checked and, in case of errors, a chain
of function calls leading to reads and writes to be scheduled in
a separate thread is skipped.
These threads will wake up the corresponding waiters once reads
and writes are done.

However, given the return value is not stored, when rc is checked
for errors a previous one (always zero) is inspected instead.
This leads to pending reads/writes added to the list, making
cifs_user_readv and cifs_user_writev wait for ever.

Signed-off-by: Germano Percossi <germano.percossi@citrix.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-04-10 23:36:39 -05:00
Germano Percossi
a0918f1ce6 CIFS: remove bad_network_name flag
STATUS_BAD_NETWORK_NAME can be received during node failover,
causing the flag to be set and making the reconnect thread
always unsuccessful, thereafter.

Once the only place where it is set is removed, the remaining
bits are rendered moot.

Removing it does not prevent "mount" from failing when a non
existent share is passed.

What happens when the share really ceases to exist while the
share is mounted is undefined now as much as it was before.

Signed-off-by: Germano Percossi <germano.percossi@citrix.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-04-10 23:36:39 -05:00
Germano Percossi
18ea43113f CIFS: reconnect thread reschedule itself
In case of error, smb2_reconnect_server reschedule itself
with a delay, to avoid being too aggressive.

Signed-off-by: Germano Percossi <germano.percossi@citrix.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-04-10 23:36:39 -05:00
Mark Syms
40920c2bb1 CIFS: handle guest access errors to Windows shares
Commit 1a967d6c9b ("correctly to
anonymous authentication for the NTLM(v2) authentication") introduces
a regression in handling errors related to attempting a guest
connection to a Windows share which requires authentication. This
should result in a permission denied error but actually causes the
kernel module to enter a never-ending loop trying to follow a DFS
referal which doesn't exist.

The base cause of this is the failure now occurs later in the process
during tree connect and not at the session setup setup and all errors
in tree connect are interpreted as needing to follow the DFS paths
which isn't in this case correct. So, check the returned error against
EACCES and fail if this is returned error.

Feedback from Aurelien:

  PS> net user guest /activate:no
    PS> mkdir C:\guestshare
      PS> icacls C:\guestshare /grant 'Everyone:(OI)(CI)F'
        PS> new-smbshare -name guestshare -path C:\guestshare -fullaccess Everyone

        I've tested v3.10, v4.4, master, master+your patch using default options
        (empty or no user "NU") and user=abc (U).

        NT_LOGON_FAILURE in session setup: LF
        This is what you seem to have in 3.10.

        NT_ACCESS_DENIED in tree connect to the share: AD
        This is what you get before your infinite loop.

                     |   NU       U
                     --------------------------------
                     3.10         |   LF       LF
                     4.4          |   LF       LF
                     master       |   AD       LF
                     master+patch |   AD       LF

                     No infinite DFS loop :(
                     All these issues result in mount failing very fast with permission denied.

                     I guess it could be from either the Windows version or the share/folder
                     ACL. A deeper analysis of the packets might reveal more.

                     In any case I did not notice any issues for on a basic DFS setup with
                     the patch so I don't think it introduced any regressions, which is
                     probably all that matters. It still bothers me a little I couldn't hit
                     the bug.

                     I've included kernel output w/ debugging output and network capture of
                     my tests if anyone want to have a look at it. (master+patch = ml-guestfix).

Signed-off-by: Mark Syms <mark.syms@citrix.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Tested-by: Aurelien Aptel <aaptel@suse.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-04-10 23:36:38 -05:00
Pavel Shilovsky
350be257ea CIFS: Fix null pointer deref during read resp processing
Currently during receiving a read response mid->resp_buf can be
NULL when it is being passed to cifs_discard_remaining_data() from
cifs_readv_discard(). Fix it by always passing server->smallbuf
instead and initializing mid->resp_buf at the end of read response
processing.

Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Acked-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-04-10 23:36:38 -05:00
Jaegeuk Kim
4ddb1a4d4d f2fs: clean up some macros in terms of GET_SEGNO
This patch cleans several macros by introducing:
- BLKS_PER_SEC
- GET_SEC_FROM_SEG
- GET_SEG_FROM_SEC
- GET_ZONE_FROM_SEC
- GET_ZONE_FROM_SEG

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-10 19:48:13 -07:00
Jaegeuk Kim
302bd34882 f2fs: clean up get_valid_blocks with consistent parameter
This patch cleans up get_valid_blocks, which has no functional change.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-10 19:48:12 -07:00
Jaegeuk Kim
63fcf8e8d6 f2fs: use segment number for get_valid_blocks
This patch fixes to submit a segment number for get_valid_blocks.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-10 19:48:11 -07:00
Tomohiro Kusumi
68afcf2d38 f2fs: guard macro variables with braces
Add braces around variables used within macros for those make sense
to do it. Many of the macros in f2fs already do this. What this commit
doesn't do is anything that changes line# as a result of adding braces,
which usually affects the binary via __LINE__.

Confirmed no diff in fs/f2fs/f2fs.ko before/after this commit on x86_64,
to make sure this has no functional change as well as there's been no
unexpected side effect due to callers' arithmetics within the existing
code.

Signed-off-by: Tomohiro Kusumi <tkusumi@tuxera.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-10 19:48:10 -07:00
Tomohiro Kusumi
771a9a7177 f2fs: fix comment on f2fs_flush_merged_bios() after 86531d6b
Callers are to unlock the page on failure after 86531d6b.

Signed-off-by: Tomohiro Kusumi <tkusumi@tuxera.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-10 19:48:10 -07:00
Chao Yu
fa64a0036c f2fs: prevent waiter encountering incorrect discard states
In f2fs_submit_discard_endio, we will wake up waiter before setting
discard command states, so waiter may use incorrect states. Change
the order between complete() and states setting to fix this issue.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-10 19:48:09 -07:00
Chao Yu
d431413f00 f2fs: introduce f2fs_wait_discard_bios
Split f2fs_wait_discard_bios from f2fs_wait_discard_bio, just for cleanup,
no logic change.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-10 19:48:08 -07:00
Chao Yu
22d375dd9c f2fs: split discard_cmd_list
Split discard_cmd_list to discard_{pend,wait}_list, so while sending/waiting
discard command, we can avoid traversing unneeded entries in original list.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-10 19:48:07 -07:00
Jaegeuk Kim
c6f82fe90d Revert "f2fs: put allocate_segment after refresh_sit_entry"
This reverts commit 3436c4bdb3.

This makes a leak to register dirty segments. I reproduced the issue by
modified postmark which injects a lot of file create/delete/update and
finally triggers huge number of SSR allocations.

Cc: <stable@vger.kernel.org> # v4.10+
[Jaegeuk Kim: Change missing incorrect comment]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-10 19:48:06 -07:00
Jan Kara
054c636e5c fsnotify: Move ->free_mark callback to fsnotify_ops
Pointer to ->free_mark callback unnecessarily occupies one long in each
fsnotify_mark although they are the same for all marks from one
notification group. Move the callback pointer to fsnotify_ops.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:36 +02:00
Jan Kara
7b12932340 fsnotify: Add group pointer in fsnotify_init_mark()
Currently we initialize mark->group only in fsnotify_add_mark_lock().
However we will need to access fsnotify_ops of corresponding group from
fsnotify_put_mark() so we need mark->group initialized earlier. Do that
in fsnotify_init_mark() which has a consequence that once
fsnotify_init_mark() is called on a mark, the mark has to be destroyed
by fsnotify_put_mark().

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:36 +02:00
Jan Kara
ebb3b47e37 fsnotify: Drop inode_mark.c
inode_mark.c now contains only a single function. Move it to
fs/notify/fsnotify.c and remove inode_mark.c.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:36 +02:00
Jan Kara
b1362edfe1 fsnotify: Remove fsnotify_find_{inode|vfsmount}_mark()
These are very thin wrappers, just remove them. Drop
fs/notify/vfsmount_mark.c as it is empty now.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:36 +02:00
Jan Kara
2e37c6ca8d fsnotify: Remove fsnotify_detach_group_marks()
The function is already mostly contained in what
fsnotify_clear_marks_by_group() does. Just update that function to not
select marks when all of them should be destroyed and remove
fsnotify_detach_group_marks().

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:36 +02:00
Jan Kara
18f2e0d3a4 fsnotify: Rename fsnotify_clear_marks_by_group_flags()
The _flags() suffix in the function name was more confusing than
explaining so just remove it. Also rename the argument from 'flags' to
'type' to better explain what the function expects.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Suggested-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:36 +02:00
Jan Kara
416bcdbcbb fsnotify: Inline fsnotify_clear_{inode|vfsmount}_mark_group()
Inline these helpers as they are very thin. We still keep them as we
don't want to expose details about how list type is determined.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:36 +02:00
Jan Kara
8920d2734d fsnotify: Remove fsnotify_recalc_{inode|vfsmount}_mask()
These helpers are just very thin wrappers now. Remove them.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:36 +02:00
Jan Kara
66d2b81bcb fsnotify: Remove fsnotify_set_mark_{,ignored_}mask_locked()
These helpers are now only a simple assignment and just obfuscate
what is going on. Remove them.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:36 +02:00
Jan Kara
05f0e38724 fanotify: Release SRCU lock when waiting for userspace response
When userspace task processing fanotify permission events screws up and
does not respond, fsnotify_mark_srcu SRCU is held indefinitely which
causes further hangs in the whole notification subsystem. Although we
cannot easily solve the problem of operations blocked waiting for
response from userspace, we can at least somewhat localize the damage by
dropping SRCU lock before waiting for userspace response and reacquiring
it when userspace responds.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:36 +02:00
Jan Kara
9385a84d7e fsnotify: Pass fsnotify_iter_info into handle_event handler
Pass fsnotify_iter_info into ->handle_event() handler so that it can
release and reacquire SRCU lock via fsnotify_prepare_user_wait() and
fsnotify_finish_user_wait() functions.  These functions also make sure
current marks are appropriately pinned so that iteration protected by
srcu in fsnotify() stays safe.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:36 +02:00
Jan Kara
abc77577a6 fsnotify: Provide framework for dropping SRCU lock in ->handle_event
fanotify wants to drop fsnotify_mark_srcu lock when waiting for response
from userspace so that the whole notification subsystem is not blocked
during that time. This patch provides a framework for safely getting
mark reference for a mark found in the object list which pins the mark
in that list. We can then drop fsnotify_mark_srcu, wait for userspace
response and then safely continue iteration of the object list once we
reaquire fsnotify_mark_srcu.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:36 +02:00
Jan Kara
f09b04a03e fsnotify: Remove special handling of mark destruction on group shutdown
Currently we queue all marks for destruction on group shutdown and then
destroy them from fsnotify_destroy_group() instead from a worker thread
which is the usual path. However worker can already be processing some
list of marks to destroy so this does not make 100% all marks are really
destroyed by the time group is shut down. This isn't a big problem as
each mark holds group reference and thus group stays partially alive
until all marks are really freed but there's no point in complicating
our lives - just wait for the delayed work to be finished instead.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:36 +02:00
Jan Kara
6b3f05d24d fsnotify: Detach mark from object list when last reference is dropped
Instead of removing mark from object list from fsnotify_detach_mark(),
remove the mark when last reference to the mark is dropped. This will
allow fanotify to wait for userspace response to event without having to
hold onto fsnotify_mark_srcu.

To avoid pinning inodes by elevated refcount (and thus e.g. delaying
file deletion) while someone holds mark reference, we detach connector
from the object also from fsnotify_destroy_marks() and not only after
removing last mark from the list as it was now.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:36 +02:00
Jan Kara
11375145a7 fsnotify: Move queueing of mark for destruction into fsnotify_put_mark()
Currently we queue mark into a list of marks for destruction in
__fsnotify_free_mark() and keep the last mark reference dangling. After the
worker waits for SRCU period, it drops the last reference to the mark
which frees it. This scheme has the disadvantage that if we hold
reference to a mark and drop and reacquire SRCU lock, the mark can get
freed immediately which is slightly inconvenient and we will need to
avoid this in the future.

Move to a scheme where queueing of mark into a list of marks for
destruction happens when the last reference to the mark is dropped. Also
drop reference to the mark held by group list already when mark is
removed from that list instead of dropping it only from the destruction
worker.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00
Jan Kara
e725376058 inotify: Do not drop mark reference under idr_lock
Dropping mark reference can result in mark being freed. Although it
should not happen in inotify_remove_from_idr() since caller should hold
another reference, just don't risk lock up just after WARN_ON
unnecessarily. Also fold do_inotify_remove_from_idr() into the single
callsite as that function really is just two lines of real code.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00
Jan Kara
08991e83b7 fsnotify: Free fsnotify_mark_connector when there is no mark attached
Currently we free fsnotify_mark_connector structure only when inode /
vfsmount is getting freed. This can however impose noticeable memory
overhead when marks get attached to inodes only temporarily. So free the
connector structure once the last mark is detached from the object.
Since notification infrastructure can be working with the connector
under the protection of fsnotify_mark_srcu, we have to be careful and
free the fsnotify_mark_connector only after SRCU period passes.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00
Jan Kara
04662cab59 fsnotify: Lock object list with connector lock
So far list of marks attached to an object (inode / vfsmount) was
protected by i_lock or mnt_root->d_lock. This dictates that the list
must be empty before the object can be destroyed although the list is
now anchored in the fsnotify_mark_connector structure. Protect the list
by a spinlock in the fsnotify_mark_connector structure to decouple
lifetime of a list of marks from a lifetime of the object. This also
simplifies the code quite a bit since we don't have to differentiate
between inode and vfsmount lists in quite a few places anymore.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00
Jan Kara
2629718dd2 fsnotify: Remove useless list deletion and comment
After removing all the indirection it is clear that

hlist_del_init_rcu(&mark->obj_list);

in fsnotify_destroy_marks() is not needed as the mark gets removed from
the list shortly afterwards in fsnotify_destroy_mark() ->
fsnotify_detach_mark() -> fsnotify_detach_from_object(). Also there is
no problem with mark being visible on object list while we call
fsnotify_destroy_mark() as parallel destruction of marks from several
places is properly handled (as mentioned in the comment in
fsnotify_destroy_marks(). So just remove the list removal and also the
stale comment.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00
Jan Kara
73cd3c33ab fsnotify: Avoid double locking in fsnotify_detach_from_object()
We lock object list lock in fsnotify_detach_from_object() twice - once
to detach mark and second time to recalculate mask. That is unnecessary
and later it will become problematic as we will free the connector as
soon as there is no mark in it. So move recalculation of fsnotify mask
into the same critical section that is detaching mark.

This also removes recalculation of child dentry flags from
fsnotify_detach_from_object(). That is however fine. Those marks will
get recalculated once some event happens on a child.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00
Jan Kara
8212a6097a fsnotify: Remove indirection from fsnotify_detach_mark()
fsnotify_detach_mark() calls fsnotify_destroy_inode_mark() or
fsnotify_destroy_vfsmount_mark() to remove mark from object list. These
two functions are however very similar and differ only in the lock they
use to protect the object list of marks. Simplify the code by removing
the indirection and removing mark from the object list in a common
function.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00
Jan Kara
a03e2e4f07 fsnotify: Determine lock in fsnotify_destroy_marks()
Instead of passing spinlock into fsnotify_destroy_marks() determine it
directly in that function from the connector type. This will reduce code
churn when changing lock protecting list of marks.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00
Jan Kara
f06fd98759 fsnotify: Move locking into fsnotify_find_mark()
Move locking of a mark list into fsnotify_find_mark(). This reduces code
churn in the following patch changing lock protecting the list.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00
Jan Kara
a242677bb1 fsnotify: Move locking into fsnotify_recalc_mask()
Move locking of locks protecting a list of marks into
fsnotify_recalc_mask(). This reduces code churn in the following patch
which changes the lock protecting the list of marks.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00
Jan Kara
0810b4f9f2 fsnotify: Move fsnotify_destroy_marks()
Move fsnotify_destroy_marks() to be later in the fs/notify/mark.c. It
will need some functions that are declared after its current
declaration. No functional change.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00
Jan Kara
755b5bc681 fsnotify: Remove indirection from mark list addition
Adding notification mark to object list has been currently done through
fsnotify_add_{inode|vfsmount}_mark() helpers from
fsnotify_add_mark_locked() which call fsnotify_add_mark_list(). Remove
this unnecessary indirection to simplify the code.

Pushing all the locking to fsnotify_add_mark_list() also allows us to
allocate the connector structure with GFP_KERNEL mode.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00
Jan Kara
e911d8af87 fsnotify: Make fsnotify_mark_connector hold inode reference
Currently inode reference is held by fsnotify marks. Change the rules so
that inode reference is held by fsnotify_mark_connector structure
whenever the list is non-empty. This simplifies the code and is more
logical.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00
Jan Kara
86ffe245c4 fsnotify: Move object pointer to fsnotify_mark_connector
Move pointer to inode / vfsmount from mark itself to the
fsnotify_mark_connector structure. This is another step on the path
towards decoupling inode / vfsmount lifetime from notification mark
lifetime.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00
Jan Kara
9dd813c15b fsnotify: Move mark list head from object into dedicated structure
Currently notification marks are attached to object (inode or vfsmnt) by
a hlist_head in the object. The list is also protected by a spinlock in
the object. So while there is any mark attached to the list of marks,
the object must be pinned in memory (and thus e.g. last iput() deleting
inode cannot happen). Also for list iteration in fsnotify() to work, we
must hold fsnotify_mark_srcu lock so that mark itself and
mark->obj_list.next cannot get freed. Thus we are required to wait for
response to fanotify events from userspace process with
fsnotify_mark_srcu lock held. That causes issues when userspace process
is buggy and does not reply to some event - basically the whole
notification subsystem gets eventually stuck.

So to be able to drop fsnotify_mark_srcu lock while waiting for
response, we have to pin the mark in memory and make sure it stays in
the object list (as removing the mark waiting for response could lead to
lost notification events for groups later in the list). However we don't
want inode reclaim to block on such mark as that would lead to system
just locking up elsewhere.

This commit is the first in the series that paves way towards solving
these conflicting lifetime needs. Instead of anchoring the list of marks
directly in the object, we anchor it in a dedicated structure
(fsnotify_mark_connector) and just point to that structure from the
object. The following commits will also add spinlock protecting the list
and object pointer to the structure.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:34 +02:00
Jan Kara
c1f33073ac fsnotify: Update comments
Add a comment that lifetime of a notification mark is protected by SRCU
and remove a comment about clearing of marks attached to the inode. It
is stale and more uptodate version is at fsnotify_destroy_marks() which
is the function handling this case.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:34 +02:00
Linus Torvalds
84ced7fd06 Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fixes from Steve French:
 "This is a set of CIFS/SMB3 fixes for stable.

  There is another set of four SMB3 reconnect fixes for stable in
  progress but they are still being reviewed/tested, so didn't want to
  wait any longer to send these five below"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  Reset TreeId to zero on SMB2 TREE_CONNECT
  CIFS: Fix build failure with smb2
  Introduce cifs_copy_file_range()
  SMB3: Rename clone_range to copychunk_range
  Handle mismatched open calls
2017-04-09 09:10:02 -07:00
Linus Torvalds
5b50be743f Driver core fixes for 4.11-rc6
Here are 3 small fixes for 4.11-rc6.  One resolves a reported issue with
 sysfs files that NeilBrown found, one is a documenatation fix for the
 stable kernel rules, and the last is a small MAINTAINERS file update for
 kernfs.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWOnrMw8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yk/JQCfQKjOpGDAR9Hs6u4YQ4hJrAHFneYAn1F4MLDW
 3b0ZMnlZHkDq834UwKnB
 =iiei
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-4.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fixes from Greg KH:
 "Here are 3 small fixes for 4.11-rc6.

  One resolves a reported issue with sysfs files that NeilBrown found,
  one is a documenatation fix for the stable kernel rules, and the last
  is a small MAINTAINERS file update for kernfs"

* tag 'driver-core-4.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  MAINTAINERS: separate out kernfs maintainership
  sysfs: be careful of error returns from ops->show()
  Documentation: stable-kernel-rules: fix stable-tag format
2017-04-09 09:03:51 -07:00
Linus Torvalds
2a610b8aa8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS fixes from Al Viro:
 "statx followup fixes and a fix for stack-smashing on alpha"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  alpha: fix stack smashing in old_adjtimex(2)
  statx: Include a mask for stx_attributes in struct statx
  statx: Reserve the top bit of the mask for future struct expansion
  xfs: report crtime and attribute flags to statx
  ext4: Add statx support
  statx: optimize copy of struct statx to userspace
  statx: remove incorrect part of vfs_statx() comment
  statx: reject unknown flags when using NULL path
  Documentation/filesystems: fix documentation for ->getattr()
2017-04-09 08:26:21 -07:00
Christoph Hellwig
34045129b1 block_dev: use blkdev_issue_zerout for hole punches
This gets us support for non-discard efficient write of zeroes (e.g. NVMe)
and prepares for removing the discard_zeroes_data flag.

Also remove a pointless discard support check, which is done in
blkdev_issue_discard already.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-08 11:25:38 -06:00
Christoph Hellwig
ee472d835c block: add a flags argument to (__)blkdev_issue_zeroout
Turn the existing discard flag into a new BLKDEV_ZERO_UNMAP flag with
similar semantics, but without referring to diѕcard.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-08 11:25:38 -06:00
NeilBrown
c8a139d001 sysfs: be careful of error returns from ops->show()
ops->show() can return a negative error code.
Commit 65da3484d9 ("sysfs: correctly handle short reads on PREALLOC attrs.")
(in v4.4) caused this to be stored in an unsigned 'size_t' variable, so errors
would look like large numbers.
As a result, if an error is returned, sysfs_kf_read() will return the
value of 'count', typically 4096.

Commit 17d0774f80 ("sysfs: correctly handle read offset on PREALLOC attrs")
(in v4.8) extended this error to use the unsigned large 'len' as a size for
memmove().
Consequently, if ->show returns an error, then the first read() on the
sysfs file will return 4096 and could return uninitialized memory to
user-space.
If the application performs a subsequent read, this will trigger a memmove()
with extremely large count, and is likely to crash the machine is bizarre ways.

This bug can currently only be triggered by reading from an md
sysfs attribute declared with __ATTR_PREALLOC() during the
brief period between when mddev_put() deletes an mddev from
the ->all_mddevs list, and when mddev_delayed_delete() - which is
scheduled on a workqueue - completes.
Before this, an error won't be returned by the ->show()
After this, the ->show() won't be called.

I can reproduce it reliably only by putting delay like
	usleep_range(500000,700000);
early in mddev_delayed_delete(). Then after creating an
md device md0 run
  echo clear > /sys/block/md0/md/array_state; cat /sys/block/md0/md/array_state

The bug can be triggered without the usleep.

Fixes: 65da3484d9 ("sysfs: correctly handle short reads on PREALLOC attrs.")
Fixes: 17d0774f80 ("sysfs: correctly handle read offset on PREALLOC attrs")
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-08 17:33:32 +02:00
Linus Torvalds
56c2997965 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "10 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm: move pcp and lru-pcp draining into single wq
  mailmap: update Yakir Yang email address
  mm, swap_cgroup: reschedule when neeed in swap_cgroup_swapoff()
  dax: fix radix tree insertion race
  mm, thp: fix setting of defer+madvise thp defrag mode
  ptrace: fix PTRACE_LISTEN race corrupting task->state
  vmlinux.lds: add missing VMLINUX_SYMBOL macros
  mm/page_alloc.c: fix print order in show_free_areas()
  userfaultfd: report actual registered features in fdinfo
  mm: fix page_vma_mapped_walk() for ksm pages
2017-04-08 01:35:32 -07:00
Ross Zwisler
e11f8b7b6c dax: fix radix tree insertion race
While running generic/340 in my test setup I hit the following race.  It
can happen with kernels that support FS DAX PMDs, so v4.10 thru
v4.11-rc5.

Thread 1				Thread 2
--------				--------
dax_iomap_pmd_fault()
  grab_mapping_entry()
    spin_lock_irq()
    get_unlocked_mapping_entry()
    'entry' is NULL, can't call lock_slot()
    spin_unlock_irq()
    radix_tree_preload()
					dax_iomap_pmd_fault()
					  grab_mapping_entry()
					    spin_lock_irq()
					    get_unlocked_mapping_entry()
					    ...
					    lock_slot()
					    spin_unlock_irq()
					  dax_pmd_insert_mapping()
					    <inserts a PMD mapping>
    spin_lock_irq()
    __radix_tree_insert() fails with -EEXIST
    <fall back to 4k fault, and die horribly
     when inserting a 4k entry where a PMD exists>

The issue is that we have to drop mapping->tree_lock while calling
radix_tree_preload(), but since we didn't have a radix tree entry to
lock (unlike in the pmd_downgrade case) we have no protection against
Thread 2 coming along and inserting a PMD at the same index.  For 4k
entries we handled this with a special-case response to -EEXIST coming
from the __radix_tree_insert(), but this doesn't save us for PMDs
because the -EEXIST case can also mean that we collided with a 4k entry
in the radix tree at a different index, but one that is covered by our
PMD range.

So, correctly handle both the 4k and 2M collision cases by explicitly
re-checking the radix tree for an entry at our index once we reacquire
mapping->tree_lock.

This patch has made it through a clean xfstests run with the current
v4.11-rc5 based linux/master, and it also ran generic/340 500 times in a
loop.  It used to fail within the first 10 iterations.

Link: http://lkml.kernel.org/r/20170406212944.2866-1-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: <stable@vger.kernel.org>    [4.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-08 00:47:49 -07:00
Mike Rapoport
045098e944 userfaultfd: report actual registered features in fdinfo
fdinfo for userfault file descriptor reports UFFD_API_FEATURES.  Up
until recently, the UFFD_API_FEATURES was defined as 0, therefore
corresponding field in fdinfo always contained zero.  Now, with
introduction of several additional features, UFFD_API_FEATURES is not
longer 0 and it seems better to report actual features requested for the
userfaultfd object described by the fdinfo.

First, the applications that were using userfault will still see zero at
the features field in fdinfo.  Next, reporting actual features rather
than available features, gives clear indication of what userfault
features are used by an application.

Link: http://lkml.kernel.org/r/1491140181-22121-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-08 00:47:48 -07:00
Martin Brandenburg
cefdc26e86 orangefs: move features validation to fix filesystem hang
Without this fix (and another to the userspace component itself
described later), the kernel will be unable to process any OrangeFS
requests after the userspace component is restarted (due to a crash or
at the administrator's behest).

The bug here is that inside orangefs_remount, the orangefs_request_mutex
is locked.  When the userspace component restarts while the filesystem
is mounted, it sends a ORANGEFS_DEV_REMOUNT_ALL ioctl to the device,
which causes the kernel to send it a few requests aimed at synchronizing
the state between the two.  While this is happening the
orangefs_request_mutex is locked to prevent any other requests going
through.

This is only half of the bugfix.  The other half is in the userspace
component which outright ignores(!) requests made before it considers
the filesystem remounted, which is after the ioctl returns.  Of course
the ioctl doesn't return until after the userspace component responds to
the request it ignores.  The userspace component has been changed to
allow ORANGEFS_VFS_OP_FEATURES regardless of the mount status.

Mike Marshall says:
 "I've tested this patch against the fixed userspace part. This patch is
  real important, I hope it can make it into 4.11...

  Here's what happens when the userspace daemon is restarted, without
  the patch:

    =============================================
    [ INFO: possible recursive locking detected ]
    [   4.10.0-00007-ge98bdb3 #1 Not tainted    ]
    ---------------------------------------------
    pvfs2-client-co/29032 is trying to acquire lock:
     (orangefs_request_mutex){+.+.+.}, at: service_operation+0x3c7/0x7b0 [orangefs]
                  but task is already holding lock:
     (orangefs_request_mutex){+.+.+.}, at: dispatch_ioctl_command+0x1bf/0x330 [orangefs]

    CPU: 0 PID: 29032 Comm: pvfs2-client-co Not tainted 4.10.0-00007-ge98bdb3 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
    Call Trace:
     __lock_acquire+0x7eb/0x1290
     lock_acquire+0xe8/0x1d0
     mutex_lock_killable_nested+0x6f/0x6e0
     service_operation+0x3c7/0x7b0 [orangefs]
     orangefs_remount+0xea/0x150 [orangefs]
     dispatch_ioctl_command+0x227/0x330 [orangefs]
     orangefs_devreq_ioctl+0x29/0x70 [orangefs]
     do_vfs_ioctl+0xa3/0x6e0
     SyS_ioctl+0x79/0x90"

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Acked-by: Mike Marshall <hubcap@omnibond.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-07 13:41:22 -07:00
Jens Axboe
65f619d253 Merge branch 'for-linus' into for-4.12/block
We've added a considerable amount of fixes for stalls and issues
with the blk-mq scheduling in the 4.11 series since forking
off the for-4.12/block branch. We need to do improvements on
top of that for 4.12, so pull in the previous fixes to make
our lives easier going forward.

Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 12:45:20 -06:00
Liping Zhang
1680a3868f sysctl: add sanity check for proc_douintvec
Commit e7d316a02f ("sysctl: handle error writing UINT_MAX to u32
fields") introduced the proc_douintvec helper function, but it forgot to
add the related sanity check when doing register_sysctl_table.  So add
it now.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-07 09:46:44 -07:00
Jan-Marek Glogowski
806a28efe9 Reset TreeId to zero on SMB2 TREE_CONNECT
Currently the cifs module breaks the CIFS specs on reconnect as
described in http://msdn.microsoft.com/en-us/library/cc246529.aspx:

"TreeId (4 bytes): Uniquely identifies the tree connect for the
command. This MUST be 0 for the SMB2 TREE_CONNECT Request."

Signed-off-by: Jan-Marek Glogowski <glogow@fbihome.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Tested-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
2017-04-07 08:04:41 -05:00
Tobias Regnery
4fa8e504e5 CIFS: Fix build failure with smb2
I saw the following build error during a randconfig build:

fs/cifs/smb2ops.c: In function 'smb2_new_lease_key':
fs/cifs/smb2ops.c:1104:2: error: implicit declaration of function 'generate_random_uuid' [-Werror=implicit-function-declaration]

Explicit include the right header to fix this issue.

Signed-off-by: Tobias Regnery <tobias.regnery@gmail.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-04-07 08:04:41 -05:00
Sachin Prabhu
620d8745b3 Introduce cifs_copy_file_range()
The earlier changes to copy range for cifs unintentionally disabled the more
common form of server side copy.

The patch introduces the file_operations helper cifs_copy_file_range()
which is used by the syscall copy_file_range. The new file operations
helper allows us to perform server side copies for SMB2.0 and 2.1
servers as well as SMB 3.0+ servers which do not support the ioctl
FSCTL_DUPLICATE_EXTENTS_TO_FILE.

The new helper uses the ioctl FSCTL_SRV_COPYCHUNK_WRITE to perform
server side copies. The helper is called by vfs_copy_file_range() only
once an attempt to clone the file using the ioctl
FSCTL_DUPLICATE_EXTENTS_TO_FILE has failed.

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable  <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-04-07 08:04:41 -05:00
Sachin Prabhu
312bbc5946 SMB3: Rename clone_range to copychunk_range
Server side copy is one of the most important mechanisms smb2/smb3
supports and it was unintentionally disabled for most use cases.

Renaming calls to reflect the underlying smb2 ioctl called. This is
similar to the name duplicate_extents used for a similar ioctl which is
also used to duplicate files by reusing fs blocks. The name change is to
avoid confusion.

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2017-04-07 08:04:40 -05:00
Sachin Prabhu
38bd49064a Handle mismatched open calls
A signal can interrupt a SendReceive call which result in incoming
responses to the call being ignored. This is a problem for calls such as
open which results in the successful response being ignored. This
results in an open file resource on the server.

The patch looks into responses which were cancelled after being sent and
in case of successful open closes the open fids.

For this patch, the check is only done in SendReceive2()

RH-bz: 1403319

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Cc: Stable <stable@vger.kernel.org>
2017-04-07 08:04:40 -05:00
Darrick J. Wong
84358536dc xfs: actually report xattr extents via iomap
Apparently FIEMAP for xattrs has been broken since we switched to
the iomap backend because of an incorrect check for xattr presence.
Also fix the broken locking.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-04-06 16:00:39 -07:00
Christoph Hellwig
254133f5d0 xfs: fold __xfs_trans_roll into xfs_trans_roll
No one cares about the low-level helper anymore.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-06 16:00:11 -07:00
Linus Torvalds
269c930e66 Changes since last update:
- Rework the inline directory verifier to avoid crashes on disk corruption
 - Don't change file size when punching holes w/ KEEP_SIZE
 - Close a kernel memory exposure bug
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJY4qGYAAoJEPh/dxk0SrTr3mwQAIDqc1RlZThYETn5Mru9BeQ0
 NmiDbgO1394OSxSxpBVZCVW1jU23j3eXbOgO0oD8iEOySdTOwAoCxb78aYUkHsbS
 wKKvix2kprIfsAYfGW264MZjX2JYiBUTjhV8dYw9UB5Kot3TPvc+IgC7ICJtcb08
 y+/ycjbgMwlWxOQwkReaPhVYhKXa3/vqLBNN0E9oacSdpT10Mkb95RDxwFpe5zX2
 RjPYode8RHsoI9+OdvVwvbmPrvylxEt3jKZdnStRMjmAX/X2aUNVDYSCOSpNg+sz
 9kXec96kkoJ4oQVfiaEo0k4LWlTZnDSFe25CX8Gxu7yPgGt8XsnDADkieGBdVcYq
 2q3AfEpmQrYgnLFRu4D3A8KEX8IrP8MLAh/zY/heVp9fTx9uhDvnjF+f+F6mMYXF
 wzEf27RN+DeaXRLbHp35NOKPriT63JpRB/Hsgew6ibEiVPdy8J1+VYoJSohixr4d
 yryDiwtph25spMYxeH6HIiPCeWa6fMr08vwuasWJTEZirxYmF/ZEHTK/bU+ydlDp
 R9/UDTmiP0V10xO09bjLgFiREIO/b1fiOoniAMhhrT4tW2FL31hRpzl8jPZ++2X/
 W2xLMJ1iyVw/Cf1NDWGFIOFX6Y+lnN6dZHftNUno6MFlmdmajNaTbRTBiSGnPH4U
 URH7Dxpmkgp7xuNs8xqv
 =QdXx
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.11-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull XFS fixes from Darrick Wong:
 "Here are three more fixes for 4.11.

  The first one reworks the inline directory verifier to check the
  working copy of the directory metadata and to avoid triggering a
  periodic crash in xfs/348. The second patch fixes a regression in hole
  punching at EOF that corrupts files; and the third patch closes a
  kernel memory disclosure bug.

  Summary:

   - rework the inline directory verifier to avoid crashes on disk
     corruption

   - don't change file size when punching holes w/ KEEP_SIZE

   - close a kernel memory exposure bug"

* tag 'xfs-4.11-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: fix kernel memory exposure problems
  xfs: Honor FALLOC_FL_KEEP_SIZE when punching ends of files
  xfs: rework the inline directory verifiers
2017-04-06 14:42:05 -07:00
David S. Miller
ec1af27ea8 RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAWOYUWfSw1s6N8H32AQIqdg/9Hi+47eues/TBbogP8eRrqVEoNHFy75e/
 MMTFe0/Qio7ps78VuOSThbqh96dzIX5K5/7JdiHZyQk2QCTaJ2BvheCUISQovhFl
 yuAJcBhkO5iiQkR0agYdHVjIQGRth3usNIEyD1rm1DS/lr8ec9/iyjoipKpsZmxt
 WlRF3eGgqA+cLpH4K+k4x/LJwIl8868MBz58p6XXW2yZFRygQzYHmMobhDwgLoC2
 C2lHPEyllK7qcIaZD7SI/a2/bMwh7QTx1tJuQK3DgtJrAHigx96uxH3jqECk7fLg
 EhjLqIFmWVCUcrBbUqjlNtcuevzxCZTCCB0LAZgmOTyEyCFJzgoQmQo97VhMPbG5
 JF9bKg+JE6P5iwqtTBEW9p+LoyM7VAt6SzeuKH/vNAVGHc0ULDMB8XPYF3Nvqa3L
 RGIwcxWCAItZFdDCUvWgTyEuZtVXu6LbuvnU1HkaXlXsLLi4041MQgcekY58k6kv
 z4YnXojy0+mciJ4WV/7CLfNMyP36G0gwLugjLAsJwigxJoOTtsfphkGcpWlP9hBm
 IyTFJw0qbbuQD7fprfw//e+IgfjDQbxYMQKxQaJZflXzYDCab8PQkFTl1kWPrJSR
 yR0rk8wKb7Z1fyl/zpUNbw7KdFganqhkZ6jbOrr9G8Hp8jyubnwMCI5B2rSZfe1V
 CSOtCbEt8Is=
 =JnXV
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20170406' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Miscellany

Here's a set of patches that make some minor changes to AF_RXRPC:

 (1) Store error codes in struct rxrpc_call::error as negative codes and
     only convert to positive in recvmsg() to avoid confusion inside the
     kernel.

 (2) Note the result of trying to abort a call (this fails if the call is
     already 'completed').

 (3) Don't abort on temporary errors whilst processing challenge and
     response packets, but rather drop the packet and wait for
     retransmission.

And also adds some more tracing:

 (4) Protocol errors.

 (5) Received abort packets.

 (6) Changes in the Rx window size due to ACK packet information.

 (7) Client call initiation (to allow the rxrpc_call struct pointer, the
     wire call ID and the user ID/afs_call pointer to be cross-referenced).
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-06 14:22:46 -07:00
David S. Miller
6f14f443d3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Mostly simple cases of overlapping changes (adding code nearby,
a function whose name changes, for example).

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-06 08:24:51 -07:00
David Howells
3a92789af0 rxrpc: Use negative error codes in rxrpc_call struct
Use negative error codes in struct rxrpc_call::error because that's what
the kernel normally deals with and to make the code consistent.  We only
turn them positive when transcribing into a cmsg for userspace recvmsg.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-04-06 10:11:56 +01:00
Al Viro
e73a67f7cd don't open-code kernel_setsockopt()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-06 02:09:23 -04:00
Tomohiro Kusumi
64c24ecb3c f2fs: split make_dentry_ptr() into block and inline versions
Since callers statically know which type to use, make_dentry_ptr()
can simply be splitted into two inline functions. This way, the code
has less inlined, fewer arguments, and no cast.

Signed-off-by: Tomohiro Kusumi <tkusumi@tuxera.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-05 11:05:08 -07:00
Jaegeuk Kim
d1b3e72d54 f2fs: submit bio of in-place-update pages
This patch tries to split in-place-update bios from sequential bios.

Suggested-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-05 11:05:07 -07:00
Kaixu Xia
fc2e2875d5 f2fs: remove the redundant variable definition
The variable 'i' has been defined before, so here we can
use it directly.

Signed-off-by: Kaixu Xia <xiakaixu@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-05 11:05:07 -07:00
Jaegeuk Kim
687de7f101 f2fs: avoid IO split due to mixed WB_SYNC_ALL and WB_SYNC_NONE
If two threads try to flush dirty pages in different inodes respectively,
f2fs_write_data_pages() will produce WRITE and WRITE_SYNC one at a time,
resulting in a lot of 4KB seperated IOs.

So, this patch gives higher priority to WB_SYNC_ALL IOs and gathers write
IOs with a big WRITE_SYNC'ed bio.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-05 11:05:06 -07:00
Jaegeuk Kim
ef095d19e8 f2fs: write small sized IO to hot log
It would better split small and large IOs separately in order to get more
consecutive big writes.

The default threshold is set to 64KB, but configurable by sysfs/min_hot_blocks.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-05 11:05:05 -07:00
Chao Yu
a7eeb82385 f2fs: use bitmap in discard_entry
This patch changes to use bitmap instead of extent in struct discard_entry
to indicate discard range in one segment, for fragmented space, this
implementation can save memory footprint.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-05 11:05:04 -07:00
Chao Yu
f099405fc8 f2fs: clean up destroy_discard_cmd_control
Remove unneeded parameter and simply change flow in
destroy_discard_cmd_control.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-05 11:05:04 -07:00
Chao Yu
5f32366a29 f2fs: count discard command entry
Adds to count discard command entry and show the number in debugfs,
also fix to add cost of discard command cache into total comsumed
memory footprint.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-05 11:05:03 -07:00
Chao Yu
8b8dd65f72 f2fs: show issued flush/discard count
Show historical count of flush command and discard command.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-05 11:05:02 -07:00
Andrew Price
d4d7fc12b6 gfs2: Re-enable fallocate for the rindex
Commit 86066914ed "gfs2: Don't support
fallocate on jdata files" removed the ability of gfs2_grow to reserve
space at the end of the rindex, which could prevent a second gfs2_grow
from succeeding if the fs is full. Allow fallocate to work on the rindex
once again.

Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2017-04-05 11:45:26 -04:00
Jan Kara
1e0e653f11 reiserfs: Protect dquot_writeback_dquots() by s_umount semaphore
dquot_writeback_dquots() expects s_umount semaphore to be held to
protect it from other concurrent quota operations. reiserfs_sync_fs()
can call dquot_writeback_dquots() without holding s_umount semaphore
when called from flush_old_commits().

Fix the problem by grabbing s_umount in flush_old_commits(). However we
have to be careful and use only trylock since reiserfs_cancel_old_sync()
can be waiting for flush_old_commits() to complete while holding
s_umount semaphore. Possible postponing of sync work is not a big deal
though as that is only an opportunistic flush.

Fixes: 9d1ccbe70e
Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-05 14:24:55 +02:00
Jan Kara
71b0576bdb reiserfs: Make cancel_old_flush() reliable
Currently canceling of delayed work that flushes old data using
cancel_old_flush() does not prevent work from being requeued. Thus
in theory new work can be queued after cancel_old_flush() from
reiserfs_freeze() has run. This will become larger problem once
flush_old_commits() can requeue the work itself.

Fix the problem by recording in sbi->work_queue that flushing work is
canceled and should not be requeued.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-05 14:24:16 +02:00
Jan Kara
6554766150 ext2: Call dquot_writeback_dquots() with s_umount held
ext2_sync_fs() could be called without s_umount semaphore held when
called through ext2_write_super() from __ext2_write_inode(). This
function then calls dquot_writeback_dquots() which relies on s_umount to
be held for protection against other quota operations.

In fact __ext2_write_inode() does not need all the functionality
ext2_write_super() provides. It is enough to just write the superblock.
So use ext2_sync_super() instead.

Fixes: 9d1ccbe70e
Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-05 14:23:45 +02:00
Darrick J. Wong
4c934c7dd6 xfs: report realtime space information via the rtbitmap
Use the realtime bitmap to return free space information via getfsmap.
Eventually this will be superseded by the realtime rmapbt code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-04-03 15:18:18 -07:00
Darrick J. Wong
a1cae7283d xfs: have getfsmap fall back to the freesp btrees when rmap is not present
If the reverse-mapping btree isn't available, fall back to the
free space btrees to provide partial reverse mapping information.
The online scrub tool can make use of even partial information to
speed up the data block scan.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-04-03 15:18:18 -07:00
Darrick J. Wong
e89c041338 xfs: implement the GETFSMAP ioctl
Introduce a new ioctl that uses the reverse mapping btree to return
information about the physical layout of the filesystem.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-04-03 15:18:17 -07:00
Darrick J. Wong
fb3c3de2f6 xfs: add a couple of queries to iterate free extents in the rtbitmap
Add _query_range and _query_all functions to the realtime bitmap
allocator.  These two functions are similar in usage to the btree
functions with the same name and will be used for getfsmap and scrub.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-04-03 15:18:17 -07:00
Darrick J. Wong
e9a2599a24 xfs: create a function to query all records in a btree
Create a helper function that will query all records in a btree.
This will be used by the online repair functions to examine every
record in a btree to rebuild a second btree.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-04-03 15:18:17 -07:00
Darrick J. Wong
2d520bfaa2 xfs: provide a query_range function for freespace btrees
Implement a query_range function for the bnobt and cntbt.  This will
be used for getfsmap fallback if there is no rmapbt and by the online
scrub and repair code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-04-03 15:18:17 -07:00
Darrick J. Wong
08438b1e38 xfs: plumb in needed functions for range querying of the freespace btrees
Plumb in the pieces (init_high_key, diff_two_keys) necessary to call
query_range on the free space btrees.  Remove the debugging asserts
so that we can make queries starting from block 0.

While we're at it, merge the redundant "if (btnum ==" hunks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-04-03 15:18:17 -07:00
Darrick J. Wong
be6324c00c xfs: fix over-copying of getbmap parameters from userspace
In xfs_ioc_getbmap, we should only copy the fields of struct getbmap
from userspace, or else we end up copying random stack contents into the
kernel.  struct getbmap is a strict subset of getbmapx, so a partial
structure copy should work fine.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-04-03 15:18:16 -07:00
Nikolay Borisov
422e5b53ed xfs: Remove obsolete declaration of xfs_buf_get_empty
This function has been removed ever since at least 3.12-era. No need to
keep its declaration in the header so nuke it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-03 15:18:16 -07:00
Eric Sandeen
bc593eebfd xfs: fix up inode validation failure message
"xfs_iread: validation failed for inode 96 failed"

One "failed" seems like enough.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-03 15:18:16 -07:00
Christoph Hellwig
63fbb4c18d xfs: remove the ISUNWRITTEN macro
Opencoding the trivial checks makes it much easier to read (and grep..).

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-03 15:18:16 -07:00
Christoph Hellwig
9c4f29d391 xfs: factor out a xfs_bmap_is_real_extent helper
This checks for all the non-normal extent types, including handling both
encodings of delayed allocations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-03 15:18:16 -07:00
Brian Foster
696a562072 xfs: use dedicated log worker wq to avoid deadlock with cil wq
The log covering background task used to be part of the xfssyncd
workqueue. That workqueue was removed as of commit 5889608df ("xfs:
syncd workqueue is no more") and the associated work item scheduled
to the xfs-log wq. The latter is used for log buffer I/O completion.

Since xfs_log_worker() can invoke a log flush, a deadlock is
possible between the xfs-log and xfs-cil workqueues. Consider the
following codepath from xfs_log_worker():

xfs_log_worker()
  xfs_log_force()
    _xfs_log_force()
      xlog_cil_force()
        xlog_cil_force_lsn()
          xlog_cil_push_now()
            flush_work()

The above is in xfs-log wq context and blocked waiting on the
completion of an xfs-cil work item. Concurrently, the cil push in
progress can end up blocked here:

xlog_cil_push_work()
  xlog_cil_push()
    xlog_write()
      xlog_state_get_iclog_space()
        xlog_wait(&log->l_flush_wait, ...)

The above is in xfs-cil context waiting on log buffer I/O
completion, which executes in xfs-log wq context. In this scenario
both workqueues are deadlocked waiting on eachother.

Add a new workqueue specifically for the high level log covering and
ail pushing worker, as was the case prior to commit 5889608df.

Diagnosed-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-04-03 15:18:15 -07:00