Commit Graph

57206 Commits

Author SHA1 Message Date
Eric W. Biederman
9c8e0a1b68 mount: Prevent MNT_DETACH from disconnecting locked mounts
Timothy Baldwin <timbaldwin@fastmail.co.uk> wrote:
> As per mount_namespaces(7) unprivileged users should not be able to look under mount points:
>
>   Mounts that come as a single unit from more privileged mount are locked
>   together and may not be separated in a less privileged mount namespace.
>
> However they can:
>
> 1. Create a mount namespace.
> 2. In the mount namespace open a file descriptor to the parent of a mount point.
> 3. Destroy the mount namespace.
> 4. Use the file descriptor to look under the mount point.
>
> I have reproduced this with Linux 4.16.18 and Linux 4.18-rc8.
>
> The setup:
>
> $ sudo sysctl kernel.unprivileged_userns_clone=1
> kernel.unprivileged_userns_clone = 1
> $ mkdir -p A/B/Secret
> $ sudo mount -t tmpfs hide A/B
>
>
> "Secret" is indeed hidden as expected:
>
> $ ls -lR A
> A:
> total 0
> drwxrwxrwt 2 root root 40 Feb 12 21:08 B
>
> A/B:
> total 0
>
>
> The attack revealing "Secret":
>
> $ unshare -Umr sh -c "exec unshare -m ls -lR /proc/self/fd/4/ 4<A"
> /proc/self/fd/4/:
> total 0
> drwxr-xr-x 3 root root 60 Feb 12 21:08 B
>
> /proc/self/fd/4/B:
> total 0
> drwxr-xr-x 2 root root 40 Feb 12 21:08 Secret
>
> /proc/self/fd/4/B/Secret:
> total 0

I tracked this down to put_mnt_ns running passing UMOUNT_SYNC and
disconnecting all of the mounts in a mount namespace.  Fix this by
factoring drop_mounts out of drop_collected_mounts and passing
0 instead of UMOUNT_SYNC.

There are two possible behavior differences that result from this.
- No longer setting UMOUNT_SYNC will no longer set MNT_SYNC_UMOUNT on
  the vfsmounts being unmounted.  This effects the lazy rcu walk by
  kicking the walk out of rcu mode and forcing it to be a non-lazy
  walk.
- No longer disconnecting locked mounts will keep some mounts around
  longer as they stay because the are locked to other mounts.

There are only two users of drop_collected mounts: audit_tree.c and
put_mnt_ns.

In audit_tree.c the mounts are private and there are no rcu lazy walks
only calls to iterate_mounts. So the changes should have no effect
except for a small timing effect as the connected mounts are disconnected.

In put_mnt_ns there may be references from process outside the mount
namespace to the mounts.  So the mounts remaining connected will
be the bug fix that is needed.  That rcu walks are allowed to continue
appears not to be a problem especially as the rcu walk change was about
an implementation detail not about semantics.

Cc: stable@vger.kernel.org
Fixes: 5ff9d8a65c ("vfs: Lock in place mounts from more privileged users")
Reported-by: Timothy Baldwin <timbaldwin@fastmail.co.uk>
Tested-by: Timothy Baldwin <timbaldwin@fastmail.co.uk>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-11-08 01:05:32 -06:00
Eric W. Biederman
df7342b240 mount: Don't allow copying MNT_UNBINDABLE|MNT_LOCKED mounts
Jonathan Calmels from NVIDIA reported that he's able to bypass the
mount visibility security check in place in the Linux kernel by using
a combination of the unbindable property along with the private mount
propagation option to allow a unprivileged user to see a path which
was purposefully hidden by the root user.

Reproducer:
  # Hide a path to all users using a tmpfs
  root@castiana:~# mount -t tmpfs tmpfs /sys/devices/
  root@castiana:~#

  # As an unprivileged user, unshare user namespace and mount namespace
  stgraber@castiana:~$ unshare -U -m -r

  # Confirm the path is still not accessible
  root@castiana:~# ls /sys/devices/

  # Make /sys recursively unbindable and private
  root@castiana:~# mount --make-runbindable /sys
  root@castiana:~# mount --make-private /sys

  # Recursively bind-mount the rest of /sys over to /mnnt
  root@castiana:~# mount --rbind /sys/ /mnt

  # Access our hidden /sys/device as an unprivileged user
  root@castiana:~# ls /mnt/devices/
  breakpoint cpu cstate_core cstate_pkg i915 intel_pt isa kprobe
  LNXSYSTM:00 msr pci0000:00 platform pnp0 power software system
  tracepoint uncore_arb uncore_cbox_0 uncore_cbox_1 uprobe virtual

Solve this by teaching copy_tree to fail if a mount turns out to be
both unbindable and locked.

Cc: stable@vger.kernel.org
Fixes: 5ff9d8a65c ("vfs: Lock in place mounts from more privileged users")
Reported-by: Jonathan Calmels <jcalmels@nvidia.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-11-08 00:30:30 -06:00
Eric W. Biederman
25d202ed82 mount: Retest MNT_LOCKED in do_umount
It was recently pointed out that the one instance of testing MNT_LOCKED
outside of the namespace_sem is in ksys_umount.

Fix that by adding a test inside of do_umount with namespace_sem and
the mount_lock held.  As it helps to fail fails the existing test is
maintained with an additional comment pointing out that it may be racy
because the locks are not held.

Cc: stable@vger.kernel.org
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Fixes: 5ff9d8a65c ("vfs: Lock in place mounts from more privileged users")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-11-08 00:14:21 -06:00
Vasily Averin
de59fae004 ext4: fix buffer leak in __ext4_read_dirblock() on error path
Fixes: dc6982ff4d ("ext4: refactor code to read directory blocks ...")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 3.9
2018-11-07 22:36:23 -05:00
Tycho Andersen
9de30f3f7f dlm: don't leak kernel pointer to userspace
In copy_result_to_user(), we first create a struct dlm_lock_result, which
contains a struct dlm_lksb, the last member of which is a pointer to the
lvb. Unfortunately, we copy the entire struct dlm_lksb to the result
struct, which is then copied to userspace at the end of the function,
leaking the contents of sb_lvbptr, which is a valid kernel pointer in some
cases (indeed, later in the same function the data it points to is copied
to userspace).

It is an error to leak kernel pointers to userspace, as it undermines KASLR
protections (see e.g. 65eea8edc3 ("floppy: Do not copy a kernel pointer to
user memory in FDGETPRM ioctl") for another example of this).

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
Signed-off-by: David Teigland <teigland@redhat.com>
2018-11-07 16:12:45 -06:00
Tycho Andersen
3f0806d259 dlm: don't allow zero length names
kobject doesn't like zero length object names, so let's test for that.

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
Signed-off-by: David Teigland <teigland@redhat.com>
2018-11-07 16:10:45 -06:00
Tycho Andersen
d968b4e240 dlm: fix invalid free
dlm_config_nodes() does not allocate nodes on failure, so we should not
free() nodes when it fails.

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
Signed-off-by: David Teigland <teigland@redhat.com>
2018-11-07 15:50:34 -06:00
Jens Axboe
d1e36282b0 block: add REQ_HIPRI and inherit it from IOCB_HIPRI
We use IOCB_HIPRI to poll for IO in the caller instead of scheduling.
This information is not available for (or after) IO submission. The
driver may make different queue choices based on the type of IO, so
make the fact that we will poll for this IO known to the lower layers
as well.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-07 13:45:00 -07:00
Omar Sandoval
d6fd0ae25c Btrfs: fix missing delayed iputs on unmount
There's a race between close_ctree() and cleaner_kthread().
close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
sees it set, but this is racy; the cleaner might have already checked
the bit and could be cleaning stuff. In particular, if it deletes unused
block groups, it will create delayed iputs for the free space cache
inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
longer running delayed iputs after a commit. Therefore, if the cleaner
creates more delayed iputs after delayed iputs are run in
btrfs_commit_super(), we will leak inodes on unmount and get a busy
inode crash from the VFS.

Fix it by parking the cleaner before we actually close anything. Then,
any remaining delayed iputs will always be handled in
btrfs_commit_super(). This also ensures that the commit in close_ctree()
is really the last commit, so we can get rid of the commit in
cleaner_kthread().

The fstest/generic/475 followed by 476 can trigger a crash that
manifests as a slab corruption caused by accessing the freed kthread
structure by a wake up function. Sample trace:

[ 5657.077612] BUG: unable to handle kernel NULL pointer dereference at 00000000000000cc
[ 5657.079432] PGD 1c57a067 P4D 1c57a067 PUD da10067 PMD 0
[ 5657.080661] Oops: 0000 [#1] PREEMPT SMP
[ 5657.081592] CPU: 1 PID: 5157 Comm: fsstress Tainted: G        W         4.19.0-rc8-default+ #323
[ 5657.083703] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
[ 5657.086577] RIP: 0010:shrink_page_list+0x2f9/0xe90
[ 5657.091937] RSP: 0018:ffffb5c745c8f728 EFLAGS: 00010287
[ 5657.092953] RAX: 0000000000000074 RBX: ffffb5c745c8f830 RCX: 0000000000000000
[ 5657.094590] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9a8747fdf3d0
[ 5657.095987] RBP: ffffb5c745c8f9e0 R08: 0000000000000000 R09: 0000000000000000
[ 5657.097159] R10: ffff9a8747fdf5e8 R11: 0000000000000000 R12: ffffb5c745c8f788
[ 5657.098513] R13: ffff9a877f6ff2c0 R14: ffff9a877f6ff2c8 R15: dead000000000200
[ 5657.099689] FS:  00007f948d853b80(0000) GS:ffff9a877d600000(0000) knlGS:0000000000000000
[ 5657.101032] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5657.101953] CR2: 00000000000000cc CR3: 00000000684bd000 CR4: 00000000000006e0
[ 5657.103159] Call Trace:
[ 5657.103776]  shrink_inactive_list+0x194/0x410
[ 5657.104671]  shrink_node_memcg.constprop.84+0x39a/0x6a0
[ 5657.105750]  shrink_node+0x62/0x1c0
[ 5657.106529]  try_to_free_pages+0x1a4/0x500
[ 5657.107408]  __alloc_pages_slowpath+0x2c9/0xb20
[ 5657.108418]  __alloc_pages_nodemask+0x268/0x2b0
[ 5657.109348]  kmalloc_large_node+0x37/0x90
[ 5657.110205]  __kmalloc_node+0x236/0x310
[ 5657.111014]  kvmalloc_node+0x3e/0x70

Fixes: 30928e9baa ("btrfs: don't run delayed_iputs in commit")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add trace ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-07 20:17:45 +01:00
Vasily Averin
53692ec074 ext4: fix buffer leak in ext4_expand_extra_isize_ea() on error path
Fixes: de05ca8526 ("ext4: move call to ext4_error() into ...")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 4.17
2018-11-07 11:14:35 -05:00
Vasily Averin
6bdc9977fc ext4: fix buffer leak in ext4_xattr_move_to_block() on error path
Fixes: 3f2571c1f9 ("ext4: factor out xattr moving")
Fixes: 6dd4ee7cab ("ext4: Expand extra_inodes space per ...")
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 2.6.23
2018-11-07 11:10:21 -05:00
Vasily Averin
45ae932d24 ext4: release bs.bh before re-using in ext4_xattr_block_find()
bs.bh was taken in previous ext4_xattr_block_find() call,
it should be released before re-using

Fixes: 7e01c8e542 ("ext3/4: fix uninitialized bs in ...")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 2.6.26
2018-11-07 11:07:01 -05:00
Vasily Averin
ecaaf40847 ext4: fix buffer leak in ext4_xattr_get_block() on error path
Fixes: dec214d00e ("ext4: xattr inode deduplication")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 4.13
2018-11-07 11:01:33 -05:00
Vasily Averin
af18e35bfd ext4: fix possible leak of s_journal_flag_rwsem in error path
Fixes: c8585c6fca ("ext4: fix races between changing inode journal ...")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 4.7
2018-11-07 10:56:28 -05:00
Theodore Ts'o
9e463084cd ext4: fix possible leak of sbi->s_group_desc_leak in error path
Fixes: bfe0a5f47a ("ext4: add more mount time checks of the superblock")
Reported-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 4.18
2018-11-07 10:32:53 -05:00
Vasily Averin
1bfc204dc0 ext4: remove unneeded brelse call in ext4_xattr_inode_update_ref()
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-11-06 17:45:02 -05:00
Theodore Ts'o
4f32c38b46 ext4: avoid possible double brelse() in add_new_gdb() on error path
Fixes: b40971426a ("ext4: add error checking to calls to ...")
Reported-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 2.6.38
2018-11-06 17:18:17 -05:00
Vasily Averin
feaf264ce7 ext4: avoid buffer leak in ext4_orphan_add() after prior errors
Fixes: d745a8c20c ("ext4: reduce contention on s_orphan_lock")
Fixes: 6e3617e579 ("ext4: Handle non empty on-disk orphan link")
Cc: Dmitry Monakhov <dmonakhov@gmail.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 2.6.34
2018-11-06 17:01:36 -05:00
Vasily Averin
a6758309a0 ext4: avoid buffer leak on shutdown in ext4_mark_iloc_dirty()
ext4_mark_iloc_dirty() callers expect that it releases iloc->bh
even if it returns an error.

Fixes: 0db1ff222d ("ext4: add shutdown bit and check for it")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 4.11
2018-11-06 16:49:50 -05:00
Vasily Averin
db6aee6240 ext4: fix possible inode leak in the retry loop of ext4_resize_fs()
Fixes: 1c6bd7173d ("ext4: convert file system to meta_bg if needed ...")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 3.7
2018-11-06 16:20:40 -05:00
Vasily Averin
f348e2241f ext4: fix missing cleanup if ext4_alloc_flex_bg_array() fails while resizing
Fixes: 117fff10d7 ("ext4: grow the s_flex_groups array as needed ...")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 3.7
2018-11-06 16:16:01 -05:00
Dave Chinner
837514f7a4 xfs: fix overflow in xfs_attr3_leaf_verify
generic/070 on 64k block size filesystems is failing with a verifier
corruption on writeback or an attribute leaf block:

[   94.973083] XFS (pmem0): Metadata corruption detected at xfs_attr3_leaf_verify+0x246/0x260, xfs_attr3_leaf block 0x811480
[   94.975623] XFS (pmem0): Unmount and run xfs_repair
[   94.976720] XFS (pmem0): First 128 bytes of corrupted metadata buffer:
[   94.978270] 000000004b2e7b45: 00 00 00 00 00 00 00 00 3b ee 00 00 00 00 00 00  ........;.......
[   94.980268] 000000006b1db90b: 00 00 00 00 00 81 14 80 00 00 00 00 00 00 00 00  ................
[   94.982251] 00000000433f2407: 22 7b 5c 82 2d 5c 47 4c bb 31 1c 37 fa a9 ce d6  "{\.-\GL.1.7....
[   94.984157] 0000000010dc7dfb: 00 00 00 00 00 81 04 8a 00 0a 18 e8 dd 94 01 00  ................
[   94.986215] 00000000d5a19229: 00 a0 dc f4 fe 98 01 68 f0 d8 07 e0 00 00 00 00  .......h........
[   94.988171] 00000000521df36c: 0c 2d 32 e2 fe 20 01 00 0c 2d 58 65 fe 0c 01 00  .-2.. ...-Xe....
[   94.990162] 000000008477ae06: 0c 2d 5b 66 fe 8c 01 00 0c 2d 71 35 fe 7c 01 00  .-[f.....-q5.|..
[   94.992139] 00000000a4a6bca6: 0c 2d 72 37 fc d4 01 00 0c 2d d8 b8 f0 90 01 00  .-r7.....-......
[   94.994789] XFS (pmem0): xfs_do_force_shutdown(0x8) called from line 1453 of file fs/xfs/xfs_buf.c. Return address = ffffffff815365f3

This is failing this check:

                end = ichdr.freemap[i].base + ichdr.freemap[i].size;
                if (end < ichdr.freemap[i].base)
>>>>>                   return __this_address;
                if (end > mp->m_attr_geo->blksize)
                        return __this_address;

And from the buffer output above, the freemap array is:

	freemap[0].base = 0x00a0
	freemap[0].size = 0xdcf4	end = 0xdd94
	freemap[1].base = 0xfe98
	freemap[1].size = 0x0168	end = 0x10000
	freemap[2].base = 0xf0d8
	freemap[2].size = 0x07e0	end = 0xf8b8

These all look valid - the block size is 0x10000 and so from the
last check in the above verifier fragment we know that the end
of freemap[1] is valid. The problem is that end is declared as:

	uint16_t	end;

And (uint16_t)0x10000 = 0. So we have a verifier bug here, not a
corruption. Fix the verifier to use uint32_t types for the check and
hence avoid the overflow.

Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=201577
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-06 07:50:50 -08:00
Darrick J. Wong
bdec055bb9 xfs: print buffer offsets when dumping corrupt buffers
Use DUMP_PREFIX_OFFSET when printing hex dumps of corrupt buffers
because modern Linux now prints a 32-bit hash of our 64-bit pointer when
using DUMP_PREFIX_ADDRESS:

00000000b4bb4297: 00 00 00 00 00 00 00 00 3b ee 00 00 00 00 00 00  ........;.......
00000005ec77e26: 00 00 00 00 02 d0 5a 00 00 00 00 00 00 00 00 00  ......Z.........
000000015938018: 21 98 e8 b4 fd de 4c 07 bc ea 3c e5 ae b4 7c 48  !.....L...<...|H

This is totally worthless for a sequential dump since we probably only
care about tracking the buffer offsets and afaik there's no way to
recover the actual pointer from the hashed value.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-11-06 07:50:50 -08:00
Christophe JAILLET
132bf67237 xfs: Fix error code in 'xfs_ioc_getbmap()'
In this function, once 'buf' has been allocated, we unconditionally
return 0.
However, 'error' is set to some error codes in several error handling
paths.
Before commit 232b51948b ("xfs: simplify the xfs_getbmap interface")
this was not an issue because all error paths were returning directly,
but now that some cleanup at the end may be needed, we must propagate the
error code.

Fixes: 232b51948b ("xfs: simplify the xfs_getbmap interface")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-11-06 07:50:50 -08:00
Filipe Manana
ac765f83f1 Btrfs: fix data corruption due to cloning of eof block
We currently allow cloning a range from a file which includes the last
block of the file even if the file's size is not aligned to the block
size. This is fine and useful when the destination file has the same size,
but when it does not and the range ends somewhere in the middle of the
destination file, it leads to corruption because the bytes between the EOF
and the end of the block have undefined data (when there is support for
discard/trimming they have a value of 0x00).

Example:

 $ mkfs.btrfs -f /dev/sdb
 $ mount /dev/sdb /mnt

 $ export foo_size=$((256 * 1024 + 100))
 $ xfs_io -f -c "pwrite -S 0x3c 0 $foo_size" /mnt/foo
 $ xfs_io -f -c "pwrite -S 0xb5 0 1M" /mnt/bar

 $ xfs_io -c "reflink /mnt/foo 0 512K $foo_size" /mnt/bar

 $ od -A d -t x1 /mnt/bar
 0000000 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5
 *
 0524288 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c
 *
 0786528 3c 3c 3c 3c 00 00 00 00 00 00 00 00 00 00 00 00
 0786544 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 *
 0790528 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5
 *
 1048576

The bytes in the range from 786532 (512Kb + 256Kb + 100 bytes) to 790527
(512Kb + 256Kb + 4Kb - 1) got corrupted, having now a value of 0x00 instead
of 0xb5.

This is similar to the problem we had for deduplication that got recently
fixed by commit de02b9f6bb ("Btrfs: fix data corruption when
deduplicating between different files").

Fix this by not allowing such operations to be performed and return the
errno -EINVAL to user space. This is what XFS is doing as well at the VFS
level. This change however now makes us return -EINVAL instead of
-EOPNOTSUPP for cases where the source range maps to an inline extent and
the destination range's end is smaller then the destination file's size,
since the detection of inline extents is done during the actual process of
dropping file extent items (at __btrfs_drop_extents()). Returning the
-EINVAL error is done early on and solely based on the input parameters
(offsets and length) and destination file's size. This makes us consistent
with XFS and anyone else supporting cloning since this case is now checked
at a higher level in the VFS and is where the -EINVAL will be returned
from starting with kernel 4.20 (the VFS changed was introduced in 4.20-rc1
by commit 07d19dc9fb ("vfs: avoid problematic remapping requests into
partial EOF block"). So this change is more geared towards stable kernels,
as it's unlikely the new VFS checks get removed intentionally.

A test case for fstests follows soon, as well as an update to filter
existing tests that expect -EOPNOTSUPP to accept -EINVAL as well.

CC: <stable@vger.kernel.org> # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-06 16:42:41 +01:00
Filipe Manana
11023d3f5f Btrfs: fix infinite loop on inode eviction after deduplication of eof block
If we attempt to deduplicate the last block of a file A into the middle of
a file B, and file A's size is not a multiple of the block size, we end
rounding the deduplication length to 0 bytes, to avoid the data corruption
issue fixed by commit de02b9f6bb ("Btrfs: fix data corruption when
deduplicating between different files"). However a length of zero will
cause the insertion of an extent state with a start value greater (by 1)
then the end value, leading to a corrupt extent state that will trigger a
warning and cause chaos such as an infinite loop during inode eviction.
Example trace:

 [96049.833585] ------------[ cut here ]------------
 [96049.833714] WARNING: CPU: 0 PID: 24448 at fs/btrfs/extent_io.c:436 insert_state+0x101/0x120 [btrfs]
 [96049.833767] CPU: 0 PID: 24448 Comm: xfs_io Not tainted 4.19.0-rc7-btrfs-next-39 #1
 [96049.833768] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
 [96049.833780] RIP: 0010:insert_state+0x101/0x120 [btrfs]
 [96049.833783] RSP: 0018:ffffafd2c3707af0 EFLAGS: 00010282
 [96049.833785] RAX: 0000000000000000 RBX: 000000000004dfff RCX: 0000000000000006
 [96049.833786] RDX: 0000000000000007 RSI: ffff99045c143230 RDI: ffff99047b2168a0
 [96049.833787] RBP: ffff990457851cd0 R08: 0000000000000001 R09: 0000000000000000
 [96049.833787] R10: ffffafd2c3707ab8 R11: 0000000000000000 R12: ffff9903b93b12c8
 [96049.833788] R13: 000000000004e000 R14: ffffafd2c3707b80 R15: ffffafd2c3707b78
 [96049.833790] FS:  00007f5c14e7d700(0000) GS:ffff99047b200000(0000) knlGS:0000000000000000
 [96049.833791] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [96049.833792] CR2: 00007f5c146abff8 CR3: 0000000115f4c004 CR4: 00000000003606f0
 [96049.833795] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 [96049.833796] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 [96049.833796] Call Trace:
 [96049.833809]  __set_extent_bit+0x46c/0x6a0 [btrfs]
 [96049.833823]  lock_extent_bits+0x6b/0x210 [btrfs]
 [96049.833831]  ? _raw_spin_unlock+0x24/0x30
 [96049.833841]  ? test_range_bit+0xdf/0x130 [btrfs]
 [96049.833853]  lock_extent_range+0x8e/0x150 [btrfs]
 [96049.833864]  btrfs_double_extent_lock+0x78/0xb0 [btrfs]
 [96049.833875]  btrfs_extent_same_range+0x14e/0x550 [btrfs]
 [96049.833885]  ? rcu_read_lock_sched_held+0x3f/0x70
 [96049.833890]  ? __kmalloc_node+0x2b0/0x2f0
 [96049.833899]  ? btrfs_dedupe_file_range+0x19a/0x280 [btrfs]
 [96049.833909]  btrfs_dedupe_file_range+0x270/0x280 [btrfs]
 [96049.833916]  vfs_dedupe_file_range_one+0xd9/0xe0
 [96049.833919]  vfs_dedupe_file_range+0x131/0x1b0
 [96049.833924]  do_vfs_ioctl+0x272/0x6e0
 [96049.833927]  ? __fget+0x113/0x200
 [96049.833931]  ksys_ioctl+0x70/0x80
 [96049.833933]  __x64_sys_ioctl+0x16/0x20
 [96049.833937]  do_syscall_64+0x60/0x1b0
 [96049.833939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
 [96049.833941] RIP: 0033:0x7f5c1478ddd7
 [96049.833943] RSP: 002b:00007ffe15b196a8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
 [96049.833945] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5c1478ddd7
 [96049.833946] RDX: 00005625ece322d0 RSI: 00000000c0189436 RDI: 0000000000000004
 [96049.833947] RBP: 0000000000000000 R08: 00007f5c14a46f48 R09: 0000000000000040
 [96049.833948] R10: 0000000000000541 R11: 0000000000000202 R12: 0000000000000000
 [96049.833949] R13: 0000000000000000 R14: 0000000000000004 R15: 00005625ece322d0
 [96049.833954] irq event stamp: 6196
 [96049.833956] hardirqs last  enabled at (6195): [<ffffffff91b00663>] console_unlock+0x503/0x640
 [96049.833958] hardirqs last disabled at (6196): [<ffffffff91a037dd>] trace_hardirqs_off_thunk+0x1a/0x1c
 [96049.833959] softirqs last  enabled at (6114): [<ffffffff92600370>] __do_softirq+0x370/0x421
 [96049.833964] softirqs last disabled at (6095): [<ffffffff91a8dd4d>] irq_exit+0xcd/0xe0
 [96049.833965] ---[ end trace db7b05f01b7fa10c ]---
 [96049.935816] R13: 0000000000000000 R14: 00005562e5259240 R15: 00007ffff092b910
 [96049.935822] irq event stamp: 6584
 [96049.935823] hardirqs last  enabled at (6583): [<ffffffff91b00663>] console_unlock+0x503/0x640
 [96049.935825] hardirqs last disabled at (6584): [<ffffffff91a037dd>] trace_hardirqs_off_thunk+0x1a/0x1c
 [96049.935827] softirqs last  enabled at (6328): [<ffffffff92600370>] __do_softirq+0x370/0x421
 [96049.935828] softirqs last disabled at (6313): [<ffffffff91a8dd4d>] irq_exit+0xcd/0xe0
 [96049.935829] ---[ end trace db7b05f01b7fa123 ]---
 [96049.935840] ------------[ cut here ]------------
 [96049.936065] WARNING: CPU: 1 PID: 24463 at fs/btrfs/extent_io.c:436 insert_state+0x101/0x120 [btrfs]
 [96049.936107] CPU: 1 PID: 24463 Comm: umount Tainted: G        W         4.19.0-rc7-btrfs-next-39 #1
 [96049.936108] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
 [96049.936117] RIP: 0010:insert_state+0x101/0x120 [btrfs]
 [96049.936119] RSP: 0018:ffffafd2c3637bc0 EFLAGS: 00010282
 [96049.936120] RAX: 0000000000000000 RBX: 000000000004dfff RCX: 0000000000000006
 [96049.936121] RDX: 0000000000000007 RSI: ffff990445cf88e0 RDI: ffff99047b2968a0
 [96049.936122] RBP: ffff990457851cd0 R08: 0000000000000001 R09: 0000000000000000
 [96049.936123] R10: ffffafd2c3637b88 R11: 0000000000000000 R12: ffff9904574301e8
 [96049.936124] R13: 000000000004e000 R14: ffffafd2c3637c50 R15: ffffafd2c3637c48
 [96049.936125] FS:  00007fe4b87e72c0(0000) GS:ffff99047b280000(0000) knlGS:0000000000000000
 [96049.936126] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [96049.936128] CR2: 00005562e52618d8 CR3: 00000001151c8005 CR4: 00000000003606e0
 [96049.936129] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 [96049.936131] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 [96049.936131] Call Trace:
 [96049.936141]  __set_extent_bit+0x46c/0x6a0 [btrfs]
 [96049.936154]  lock_extent_bits+0x6b/0x210 [btrfs]
 [96049.936167]  btrfs_evict_inode+0x1e1/0x5a0 [btrfs]
 [96049.936172]  evict+0xbf/0x1c0
 [96049.936174]  dispose_list+0x51/0x80
 [96049.936176]  evict_inodes+0x193/0x1c0
 [96049.936180]  generic_shutdown_super+0x3f/0x110
 [96049.936182]  kill_anon_super+0xe/0x30
 [96049.936189]  btrfs_kill_super+0x13/0x100 [btrfs]
 [96049.936191]  deactivate_locked_super+0x3a/0x70
 [96049.936193]  cleanup_mnt+0x3b/0x80
 [96049.936195]  task_work_run+0x93/0xc0
 [96049.936198]  exit_to_usermode_loop+0xfa/0x100
 [96049.936201]  do_syscall_64+0x17f/0x1b0
 [96049.936202]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
 [96049.936204] RIP: 0033:0x7fe4b80cfb37
 [96049.936206] RSP: 002b:00007ffff092b688 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
 [96049.936207] RAX: 0000000000000000 RBX: 00005562e5259060 RCX: 00007fe4b80cfb37
 [96049.936208] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00005562e525faa0
 [96049.936209] RBP: 00005562e525faa0 R08: 00005562e525f770 R09: 0000000000000015
 [96049.936210] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fe4b85d1e64
 [96049.936211] R13: 0000000000000000 R14: 00005562e5259240 R15: 00007ffff092b910
 [96049.936211] R13: 0000000000000000 R14: 00005562e5259240 R15: 00007ffff092b910
 [96049.936216] irq event stamp: 6616
 [96049.936219] hardirqs last  enabled at (6615): [<ffffffff91b00663>] console_unlock+0x503/0x640
 [96049.936219] hardirqs last disabled at (6616): [<ffffffff91a037dd>] trace_hardirqs_off_thunk+0x1a/0x1c
 [96049.936222] softirqs last  enabled at (6328): [<ffffffff92600370>] __do_softirq+0x370/0x421
 [96049.936222] softirqs last disabled at (6313): [<ffffffff91a8dd4d>] irq_exit+0xcd/0xe0
 [96049.936223] ---[ end trace db7b05f01b7fa124 ]---

The second stack trace, from inode eviction, is repeated forever due to
the infinite loop during eviction.

This is the same type of problem fixed way back in 2015 by commit
113e828386 ("Btrfs: fix inode eviction infinite loop after extent_same
ioctl") and commit ccccf3d672 ("Btrfs: fix inode eviction infinite loop
after cloning into it").

So fix this by returning immediately if the deduplication range length
gets rounded down to 0 bytes, as there is nothing that needs to be done in
such case.

Example reproducer:

 $ mkfs.btrfs -f /dev/sdb
 $ mount /dev/sdb /mnt

 $ xfs_io -f -c "pwrite -S 0xe6 0 100" /mnt/foo
 $ xfs_io -f -c "pwrite -S 0xe6 0 1M" /mnt/bar

 # Unmount the filesystem and mount it again so that we start without any
 # extent state records when we ask for the deduplication.
 $ umount /mnt
 $ mount /dev/sdb /mnt

 $ xfs_io -c "dedupe /mnt/foo 0 500K 100" /mnt/bar

 # This unmount triggers the infinite loop.
 $ umount /mnt

A test case for fstests will follow soon.

Fixes: de02b9f6bb ("Btrfs: fix data corruption when deduplicating between different files")
CC: <stable@vger.kernel.org> # 4.19+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-06 16:42:37 +01:00
Filipe Manana
4222ea7100 Btrfs: fix deadlock on tree root leaf when finding free extent
When we are writing out a free space cache, during the transaction commit
phase, we can end up in a deadlock which results in a stack trace like the
following:

 schedule+0x28/0x80
 btrfs_tree_read_lock+0x8e/0x120 [btrfs]
 ? finish_wait+0x80/0x80
 btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
 btrfs_search_slot+0xf6/0x9f0 [btrfs]
 ? evict_refill_and_join+0xd0/0xd0 [btrfs]
 ? inode_insert5+0x119/0x190
 btrfs_lookup_inode+0x3a/0xc0 [btrfs]
 ? kmem_cache_alloc+0x166/0x1d0
 btrfs_iget+0x113/0x690 [btrfs]
 __lookup_free_space_inode+0xd8/0x150 [btrfs]
 lookup_free_space_inode+0x5b/0xb0 [btrfs]
 load_free_space_cache+0x7c/0x170 [btrfs]
 ? cache_block_group+0x72/0x3b0 [btrfs]
 cache_block_group+0x1b3/0x3b0 [btrfs]
 ? finish_wait+0x80/0x80
 find_free_extent+0x799/0x1010 [btrfs]
 btrfs_reserve_extent+0x9b/0x180 [btrfs]
 btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
 __btrfs_cow_block+0x11d/0x500 [btrfs]
 btrfs_cow_block+0xdc/0x180 [btrfs]
 btrfs_search_slot+0x3bd/0x9f0 [btrfs]
 btrfs_lookup_inode+0x3a/0xc0 [btrfs]
 ? kmem_cache_alloc+0x166/0x1d0
 btrfs_update_inode_item+0x46/0x100 [btrfs]
 cache_save_setup+0xe4/0x3a0 [btrfs]
 btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
 btrfs_commit_transaction+0xcb/0x8b0 [btrfs]

At cache_save_setup() we need to update the inode item of a block group's
cache which is located in the tree root (fs_info->tree_root), which means
that it may result in COWing a leaf from that tree. If that happens we
need to find a free metadata extent and while looking for one, if we find
a block group which was not cached yet we attempt to load its cache by
calling cache_block_group(). However this function will try to load the
inode of the free space cache, which requires finding the matching inode
item in the tree root - if that inode item is located in the same leaf as
the inode item of the space cache we are updating at cache_save_setup(),
we end up in a deadlock, since we try to obtain a read lock on the same
extent buffer that we previously write locked.

So fix this by using the tree root's commit root when searching for a
block group's free space cache inode item when we are attempting to load
a free space cache. This is safe since block groups once loaded stay in
memory forever, as well as their caches, so after they are first loaded
we will never need to read their inode items again. For new block groups,
once they are created they get their ->cached field set to
BTRFS_CACHE_FINISHED meaning we will not need to read their inode item.

Reported-by: Andrew Nelson <andrew.s.nelson@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAPTELenq9x5KOWuQ+fa7h1r3nsJG8vyiTH8+ifjURc_duHh2Wg@mail.gmail.com/
Fixes: 9d66e233c7 ("Btrfs: load free space cache if it exists")
Tested-by: Andrew Nelson <andrew.s.nelson@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-06 16:42:32 +01:00
Arnd Bergmann
7e17916b35 btrfs: avoid link error with CONFIG_NO_AUTO_INLINE
Note: this patch fixes a problem in a feature outside of btrfs ("kernel
hacking: add a config option to disable compiler auto-inlining") and is
applied ahead of time due to cross-subsystem dependencies.

On 32-bit ARM with gcc-8, I see a link error with the addition of the
CONFIG_NO_AUTO_INLINE option:

fs/btrfs/super.o: In function `btrfs_statfs':
super.c:(.text+0x67b8): undefined reference to `__aeabi_uldivmod'
super.c:(.text+0x67fc): undefined reference to `__aeabi_uldivmod'
super.c:(.text+0x6858): undefined reference to `__aeabi_uldivmod'
super.c:(.text+0x6920): undefined reference to `__aeabi_uldivmod'
super.c:(.text+0x693c): undefined reference to `__aeabi_uldivmod'
fs/btrfs/super.o:super.c:(.text+0x6958): more undefined references to `__aeabi_uldivmod' follow

So far this is the only file that shows the behavior, so I'd propose
to just work around it by marking the functions as 'static inline'
that normally get inlined here.

The reference to __aeabi_uldivmod comes from a div_u64() which has an
optimization for a constant division that uses a straight '/' operator
when the result should be known to the compiler. My interpretation is
that as we turn off inlining, gcc still expects the result to be constant
but fails to use that constant value.

Link: https://lkml.kernel.org/r/20181103153941.1881966-1-arnd@arndb.de
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
[ add the note ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-06 16:42:08 +01:00
Shaokun Zhang
761333f2f5 btrfs: tree-checker: Fix misleading group system information
block_group_err shows the group system as a decimal value with a '0x'
prefix, which is somewhat misleading.

Fix it to print hexadecimal, as was intended.

Fixes: fce466eab7 ("btrfs: tree-checker: Verify block_group_item")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-06 16:41:53 +01:00
Filipe Manana
008c6753f7 Btrfs: fix missing data checksums after a ranged fsync (msync)
Recently we got a massive simplification for fsync, where for the fast
path we no longer log new extents while their respective ordered extents
are still running.

However that simplification introduced a subtle regression for the case
where we use a ranged fsync (msync). Consider the following example:

               CPU 0                                    CPU 1

                                            mmap write to range [2Mb, 4Mb[
  mmap write to range [512Kb, 1Mb[
  msync range [512K, 1Mb[
    --> triggers fast fsync
        (BTRFS_INODE_NEEDS_FULL_SYNC
         not set)
    --> creates extent map A for this
        range and adds it to list of
        modified extents
    --> starts ordered extent A for
        this range
    --> waits for it to complete

                                            writeback triggered for range
                                            [2Mb, 4Mb[
                                              --> create extent map B and
                                                  adds it to the list of
                                                  modified extents
                                              --> creates ordered extent B

    --> start looking for and logging
        modified extents
    --> logs extent maps A and B
    --> finds checksums for extent A
        in the csum tree, but not for
        extent B
  fsync (msync) finishes

                                              --> ordered extent B
                                                  finishes and its
                                                  checksums are added
                                                  to the csum tree

                                <power cut>

After replaying the log, we have the extent covering the range [2Mb, 4Mb[
but do not have the data checksum items covering that file range.

This happens because at the very beginning of an fsync (btrfs_sync_file())
we start and wait for IO in the given range [512Kb, 1Mb[ and therefore
wait for any ordered extents in that range to complete before we start
logging the extents. However if right before we start logging the extent
in our range [512Kb, 1Mb[, writeback is started for any other dirty range,
such as the range [2Mb, 4Mb[ due to memory pressure or a concurrent fsync
or msync (btrfs_sync_file() starts writeback before acquiring the inode's
lock), an ordered extent is created for that other range and a new extent
map is created to represent that range and added to the inode's list of
modified extents.

That means that we will see that other extent in that list when collecting
extents for logging (done at btrfs_log_changed_extents()) and log the
extent before the respective ordered extent finishes - namely before the
checksum items are added to the checksums tree, which is where
log_extent_csums() looks for the checksums, therefore making us log an
extent without logging its checksums. Before that massive simplification
of fsync, this wasn't a problem because besides looking for checkums in
the checksums tree, we also looked for them in any ordered extent still
running.

The consequence of data checksums missing for a file range is that users
attempting to read the affected file range will get -EIO errors and dmesg
reports the following:

 [10188.358136] BTRFS info (device sdc): no csum found for inode 297 start 57344
 [10188.359278] BTRFS warning (device sdc): csum failed root 5 ino 297 off 57344 csum 0x98f94189 expected csum 0x00000000 mirror 1

So fix this by skipping extents outside of our logging range at
btrfs_log_changed_extents() and leaving them on the list of modified
extents so that any subsequent ranged fsync may collect them if needed.
Also, if we find a hole extent outside of the range still log it, just
to prevent having gaps between extent items after replaying the log,
otherwise fsck will complain when we are not using the NO_HOLES feature
(fstest btrfs/056 triggers such case).

Fixes: e7175a6927 ("btrfs: remove the wait ordered logic in the log_one_extent path")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-06 16:41:40 +01:00
Lu Fengqi
fcd5e74288 btrfs: fix pinned underflow after transaction aborted
When running generic/475, we may get the following warning in dmesg:

[ 6902.102154] WARNING: CPU: 3 PID: 18013 at fs/btrfs/extent-tree.c:9776 btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
[ 6902.109160] CPU: 3 PID: 18013 Comm: umount Tainted: G        W  O      4.19.0-rc8+ #8
[ 6902.110971] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[ 6902.112857] RIP: 0010:btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
[ 6902.118921] RSP: 0018:ffffc9000459bdb0 EFLAGS: 00010286
[ 6902.120315] RAX: ffff880175050bb0 RBX: ffff8801124a8000 RCX: 0000000000170007
[ 6902.121969] RDX: 0000000000000002 RSI: 0000000000170007 RDI: ffffffff8125fb74
[ 6902.123716] RBP: ffff880175055d10 R08: 0000000000000000 R09: 0000000000000000
[ 6902.125417] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880175055d88
[ 6902.127129] R13: ffff880175050bb0 R14: 0000000000000000 R15: dead000000000100
[ 6902.129060] FS:  00007f4507223780(0000) GS:ffff88017ba00000(0000) knlGS:0000000000000000
[ 6902.130996] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6902.132558] CR2: 00005623599cac78 CR3: 000000014b700001 CR4: 00000000003606e0
[ 6902.134270] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 6902.135981] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 6902.137836] Call Trace:
[ 6902.138939]  close_ctree+0x171/0x330 [btrfs]
[ 6902.140181]  ? kthread_stop+0x146/0x1f0
[ 6902.141277]  generic_shutdown_super+0x6c/0x100
[ 6902.142517]  kill_anon_super+0x14/0x30
[ 6902.143554]  btrfs_kill_super+0x13/0x100 [btrfs]
[ 6902.144790]  deactivate_locked_super+0x2f/0x70
[ 6902.146014]  cleanup_mnt+0x3b/0x70
[ 6902.147020]  task_work_run+0x9e/0xd0
[ 6902.148036]  do_syscall_64+0x470/0x600
[ 6902.149142]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 6902.150375]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 6902.151640] RIP: 0033:0x7f45077a6a7b
[ 6902.157324] RSP: 002b:00007ffd589f3e68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 6902.159187] RAX: 0000000000000000 RBX: 000055e8eec732b0 RCX: 00007f45077a6a7b
[ 6902.160834] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055e8eec73490
[ 6902.162526] RBP: 0000000000000000 R08: 000055e8eec734b0 R09: 00007ffd589f26c0
[ 6902.164141] R10: 0000000000000000 R11: 0000000000000246 R12: 000055e8eec73490
[ 6902.165815] R13: 00007f4507ac61a4 R14: 0000000000000000 R15: 00007ffd589f40d8
[ 6902.167553] irq event stamp: 0
[ 6902.168998] hardirqs last  enabled at (0): [<0000000000000000>]           (null)
[ 6902.170731] hardirqs last disabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
[ 6902.172773] softirqs last  enabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
[ 6902.174671] softirqs last disabled at (0): [<0000000000000000>]           (null)
[ 6902.176407] ---[ end trace 463138c2986b275c ]---
[ 6902.177636] BTRFS info (device dm-3): space_info 4 has 273465344 free, is not full
[ 6902.179453] BTRFS info (device dm-3): space_info total=276824064, used=4685824, pinned=18446744073708158976, reserved=0, may_use=0, readonly=65536

In the above line there's "pinned=18446744073708158976" which is an
unsigned u64 value of -1392640, an obvious underflow.

When transaction_kthread is running cleanup_transaction(), another
fsstress is running btrfs_commit_transaction(). The
btrfs_finish_extent_commit() may get the same range as
btrfs_destroy_pinned_extent() got, which causes the pinned underflow.

Fixes: d4b450cd4b ("Btrfs: fix race between transaction commit and empty block group removal")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-06 16:41:34 +01:00
Robbie Ko
506481b20e Btrfs: fix cur_offset in the error case for nocow
When the cow_file_range fails, the related resources are unlocked
according to the range [start..end), so the unlock cannot be repeated in
run_delalloc_nocow.

In some cases (e.g. cur_offset <= end && cow_start != -1), cur_offset is
not updated correctly, so move the cur_offset update before
cow_file_range.

  kernel BUG at mm/page-writeback.c:2663!
  Internal error: Oops - BUG: 0 [#1] SMP
  CPU: 3 PID: 31525 Comm: kworker/u8:7 Tainted: P O
  Hardware name: Realtek_RTD1296 (DT)
  Workqueue: writeback wb_workfn (flush-btrfs-1)
  task: ffffffc076db3380 ti: ffffffc02e9ac000 task.ti: ffffffc02e9ac000
  PC is at clear_page_dirty_for_io+0x1bc/0x1e8
  LR is at clear_page_dirty_for_io+0x14/0x1e8
  pc : [<ffffffc00033c91c>] lr : [<ffffffc00033c774>] pstate: 40000145
  sp : ffffffc02e9af4f0
  Process kworker/u8:7 (pid: 31525, stack limit = 0xffffffc02e9ac020)
  Call trace:
  [<ffffffc00033c91c>] clear_page_dirty_for_io+0x1bc/0x1e8
  [<ffffffbffc514674>] extent_clear_unlock_delalloc+0x1e4/0x210 [btrfs]
  [<ffffffbffc4fb168>] run_delalloc_nocow+0x3b8/0x948 [btrfs]
  [<ffffffbffc4fb948>] run_delalloc_range+0x250/0x3a8 [btrfs]
  [<ffffffbffc514c0c>] writepage_delalloc.isra.21+0xbc/0x1d8 [btrfs]
  [<ffffffbffc516048>] __extent_writepage+0xe8/0x248 [btrfs]
  [<ffffffbffc51630c>] extent_write_cache_pages.isra.17+0x164/0x378 [btrfs]
  [<ffffffbffc5185a8>] extent_writepages+0x48/0x68 [btrfs]
  [<ffffffbffc4f5828>] btrfs_writepages+0x20/0x30 [btrfs]
  [<ffffffc00033d758>] do_writepages+0x30/0x88
  [<ffffffc0003ba0f4>] __writeback_single_inode+0x34/0x198
  [<ffffffc0003ba6c4>] writeback_sb_inodes+0x184/0x3c0
  [<ffffffc0003ba96c>] __writeback_inodes_wb+0x6c/0xc0
  [<ffffffc0003bac20>] wb_writeback+0x1b8/0x1c0
  [<ffffffc0003bb0f0>] wb_workfn+0x150/0x250
  [<ffffffc0002b0014>] process_one_work+0x1dc/0x388
  [<ffffffc0002b02f0>] worker_thread+0x130/0x500
  [<ffffffc0002b6344>] kthread+0x10c/0x110
  [<ffffffc000284590>] ret_from_fork+0x10/0x40
  Code: d503201f a9025bb5 a90363b7 f90023b9 (d4210000)

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-11-06 16:41:04 +01:00
Matthew Wilcox
fe2b51145c nilfs2: Use xa_erase_irq
This code simply opencoded xa_erase_irq().

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-11-05 14:57:05 -05:00
Linus Torvalds
42bd06e93d This pull request contains updates for UBIFS:
- Full filesystem authentication feature,
   UBIFS is now able to have the whole filesystem structure
   authenticated plus user data encrypted and authenticated.
 - Minor cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAlvaF2IWHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wUb/D/0Z/jN80LtxoIlQzmfoBnVSnaXv
 BDvdDFHTwV+zu4XCvUyJzBnwzNjDxNK2XD5hAgiqCoTk5sr4KUi5+zfft5XMW40w
 T1m5mQNhjwmcI/J/5m2gSHbOSB8Hkc0HIybknS+5ZJDa1OZUkxejLcmpK5Wk+bxp
 Ak1cOn5GIJKRQMrUudhySkQaBe0DnNmHSACePSb5AYGlnRy6eJ26ANR2mU7PFg1V
 NBVbOQjMrYIV9qq9m+vtTNsLXidcaRf474fg7lshodmDBISy9g83Oq8FaPzYTJVJ
 rkvdsRzCrXeApSH2LJ8Gb1AvIAlvJa2Va+anXh8NrSBySfzTKrIPtIONkpF7zxOC
 8naZcRNvTqWcMfaTKGK+SGWUqGlHxdGOo5NkkKrn0jsO6HJ8kYAXKFGx65MsiCLv
 xPlKc543ZLSscw3JJqLXVoXr2hmwhUHMJwwaPngFmdgm88bog62feUgFpYOU/1dj
 1s2+q3jSUqfuS4oInjAmeX/Yq9dss/6dMo73ikbekIGRtijUfCMBWFyINdE0oWPu
 ZUdOOifYrozIG7wWEo6ZzCI1PIyPvYfKcXVMWimPmu9Xi5AnbCDMQmPYVF5YMM0R
 jexN9gVyFQQjz940reFJi0EkIJjwCycWLWft6P6cLDc/rRUUP4ibNYv3JL8WvhHn
 Eb9V6InXhcyuX4eopA==
 =lq2m
 -----END PGP SIGNATURE-----

Merge tag 'tags/upstream-4.20-rc1' of git://git.infradead.org/linux-ubifs

Pull UBIFS updates from Richard Weinberger:

 - Full filesystem authentication feature, UBIFS is now able to have the
   whole filesystem structure authenticated plus user data encrypted and
   authenticated.

 - Minor cleanups

* tag 'tags/upstream-4.20-rc1' of git://git.infradead.org/linux-ubifs: (26 commits)
  ubifs: Remove unneeded semicolon
  Documentation: ubifs: Add authentication whitepaper
  ubifs: Enable authentication support
  ubifs: Do not update inode size in-place in authenticated mode
  ubifs: Add hashes and HMACs to default filesystem
  ubifs: authentication: Authenticate super block node
  ubifs: Create hash for default LPT
  ubfis: authentication: Authenticate master node
  ubifs: authentication: Authenticate LPT
  ubifs: Authenticate replayed journal
  ubifs: Add auth nodes to garbage collector journal head
  ubifs: Add authentication nodes to journal
  ubifs: authentication: Add hashes to index nodes
  ubifs: Add hashes to the tree node cache
  ubifs: Create functions to embed a HMAC in a node
  ubifs: Add helper functions for authentication support
  ubifs: Add separate functions to init/crc a node
  ubifs: Format changes for authentication support
  ubifs: Store read superblock node
  ubifs: Drop write_node
  ...
2018-11-04 14:46:04 -08:00
Linus Torvalds
4710e78940 NFS client bugfixes for Linux 4.20
Highlights include:
 
 Bugfixes:
 - Fix build issues on architectures that don't provide 64-bit cmpxchg
 
 Cleanups:
 - Fix a spelling mistake
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJb3vl/AAoJEA4mA3inWBJc5J0P/1zjDSsf/H4/Pa3aktfgwMds
 Z1clRgBJrqBRodF78ARcNI7OfZroHFYJHQVq+E0HwXbzFj4/YZGfXkKhRYSgCZyT
 uZKCNY42DirHuWR852ukQhdmskD/lWVlI4LIiwOpDpTD7v/GX5hFXpbTkHgKswDP
 G+euxbovzu7IgJP6Ww0XfGCGgBq2H8r0AitF9uSpgVmJOTjpRisodJZy94xvy0e8
 HVo6BxtBVle6N43qymO4cdssgLdAgyL+2NAhb36PL7xEthPMZvUWaPDswjro4Iir
 wAhIYmqcOXD/D8U8DcvkATkcaN9adVpmkznp+aqVE423XQy62k+J7+2d8uWbjBig
 FfdiYTxnL5RZgdSl/1JknHCxI1eEIhqiR1R0bqj50+aHR/QI4lZ7SsHQVV4y1gJL
 b96igefbzLBYKp9UN4fNHsjADvtZS5vCzjm2ep/aESP7gWB/v/UmNmMHe3y7nNnt
 mxd++0O4N6WFEf7GQljbfOtnZZGqmONw3QJV01EHqcVvn65mUkzbGq0CX9+GN17v
 sk4ThqSjHpfyla6Ih+6E9efdWOMTH/Kg+fb9ZXkcwxmde0Wl/dfQCw7iTZTGHifv
 /rmGHHvrM2uNLgWt6eE/MJ2Jb0Aq78eOAtt2zGN+tSJTThOBK20vNAK79CFIhrfj
 lKcjOb0hM+xJAt7Y9MpT
 =O9mS
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

  Bugfix:
   - Fix build issues on architectures that don't provide 64-bit cmpxchg

  Cleanups:
   - Fix a spelling mistake"

* tag 'nfs-for-4.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFS: fix spelling mistake, EACCESS -> EACCES
  SUNRPC: Use atomic(64)_t for seq_send(64)
2018-11-04 08:20:09 -08:00
Vasily Averin
ea0abbb648 ext4: add missing brelse() update_backups()'s error path
Fixes: ac27a0ec11 ("ext4: initial copy of files from ext3")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 2.6.19
2018-11-03 17:11:19 -04:00
Vasily Averin
61a9c11e5e ext4: add missing brelse() add_new_gdb_meta_bg()'s error path
Fixes: 01f795f9e0 ("ext4: add online resizing support for meta_bg ...")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 3.7
2018-11-03 16:50:08 -04:00
Vasily Averin
cea5794122 ext4: add missing brelse() in set_flexbg_block_bitmap()'s error path
Fixes: 33afdcc540 ("ext4: add a function which sets up group blocks ...")
Cc: stable@kernel.org # 3.3
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-11-03 16:22:10 -04:00
Vasily Averin
9e4028935c ext4: avoid potential extra brelse in setup_new_flex_group_blocks()
Currently bh is set to NULL only during first iteration of for cycle,
then this pointer is not cleared after end of using.
Therefore rollback after errors can lead to extra brelse(bh) call,
decrements bh counter and later trigger an unexpected warning in __brelse()

Patch moves brelse() calls in body of cycle to exclude requirement of
brelse() call in rollback.

Fixes: 33afdcc540 ("ext4: add a function which sets up group blocks ...")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 3.3+
2018-11-03 16:13:17 -04:00
Linus Torvalds
169447287b 3 small fixes, one for stable, one debugging improvmemt, improvements to cifs directio and some minor cleanup
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlvdEHkACgkQiiy9cAdy
 T1ERqgwAoaYkwZhee/73MxC5N2JnBpRjP8Wwu6jo7q89DY35zI8NvLfy48VXNmnk
 8Sp+zPWXxFgFLxBIGIOEwvfrT2kl84lFHprAAkZRKQcz5ZdeqIgoKYWw46Ezglxn
 pyBpGFKNVKVSb0ZbAJgswB0KTCSqW2Hx6mf/lGxx1c2ao1kyXrvN4J2v9bK2UCbV
 KC9xAgkkteypiTsB/afWQMAybDtran1TdOS7rm4aNT697KJ4jOPSQ7Bhh4CxT+UC
 xSFtobCc/v42Si3oq0DNmFNQOdfj40Tg3RjHRHBOmcqN2b+kmGa/4mCUbVsFL08G
 8gpeAHpRDnVwBAQsa6y5LuojOQ0szJzkyM9SZHTyCG3sQRGX3b6L+CYafobzrxnk
 81Z0PtnDNvaoqG/6th/cQX/vz9LoiO5KG2dF1c9z+mK4m4D7JObuxaqcsrpb0uZl
 fWn8AU900QOP/MrciWhj3II+pyYOV3iqKp0cpUgrP1a2P/Qxu3FkXGc4he/lWfKB
 vSXGXf88
 =Ilu6
 -----END PGP SIGNATURE-----

Merge tag '4.20-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes and updates from Steve French:
 "Three small fixes (one Kerberos related, one for stable, and another
  fixes an oops in xfstest 377), two helpful debugging improvements,
  three patches for cifs directio and some minor cleanup"

* tag '4.20-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: fix signed/unsigned mismatch on aio_read patch
  cifs: don't dereference smb_file_target before null check
  CIFS: Add direct I/O functions to file_operations
  CIFS: Add support for direct I/O write
  CIFS: Add support for direct I/O read
  smb3: missing defines and structs for reparse point handling
  smb3: allow more detailed protocol info on open files for debugging
  smb3: on kerberos mount if server doesn't specify auth type use krb5
  smb3: add trace point for tree connection
  cifs: fix spelling mistake, EACCESS -> EACCES
  cifs: fix return value for cifs_listxattr
2018-11-03 10:45:55 -07:00
Tetsuo Handa
9f2df09a33 bfs: add sanity check at bfs_fill_super()
syzbot is reporting too large memory allocation at bfs_fill_super() [1].
Since file system image is corrupted such that bfs_sb->s_start == 0,
bfs_fill_super() is trying to allocate 8MB of continuous memory. Fix
this by adding a sanity check on bfs_sb->s_start, __GFP_NOWARN and
printf().

[1] https://syzkaller.appspot.com/bug?id=16a87c236b951351374a84c8a32f40edbc034e96

Link: http://lkml.kernel.org/r/1525862104-3407-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+71c6b5d68e91149fc8a4@syzkaller.appspotmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tigran Aivazian <aivazian.tigran@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-03 10:09:38 -07:00
Larry Chen
6194ae4242 ocfs2: fix clusters leak in ocfs2_defrag_extent()
ocfs2_defrag_extent() might leak allocated clusters.  When the file
system has insufficient space, the number of claimed clusters might be
less than the caller wants.  If that happens, the original code might
directly commit the transaction without returning clusters.

This patch is based on code in ocfs2_add_clusters_in_btree().

[akpm@linux-foundation.org: include localalloc.h, reduce scope of data_ac]
Link: http://lkml.kernel.org/r/20180904041621.16874-3-lchen@suse.com
Signed-off-by: Larry Chen <lchen@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-03 10:09:37 -07:00
Arnd Bergmann
3a3d1e5104 ocfs2: dlmglue: clean up timestamp handling
The handling of timestamps outside of the 1970..2038 range in the dlm
glue is rather inconsistent: on 32-bit architectures, this has always
wrapped around to negative timestamps in the 1902..1969 range, while on
64-bit kernels all timestamps are interpreted as positive 34 bit numbers
in the 1970..2514 year range.

Now that the VFS code handles 64-bit timestamps on all architectures, we
can make the behavior more consistent here, and return the same result
that we had on 64-bit already, making the file system y2038 safe in the
process.  Outside of dlmglue, it already uses 64-bit on-disk timestamps
anway, so that part is fine.

For consistency, I'm changing ocfs2_pack_timespec() to clamp anything
outside of the supported range to the minimum and maximum values.  This
avoids a possible ambiguity of values before 1970 in particular, which
used to be interpreted as times at the end of the 2514 range previously.

Link: http://lkml.kernel.org/r/20180619155826.4106487-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-03 10:09:37 -07:00
Changwei Ge
cf76c78595 ocfs2: don't put and assigning null to bh allocated outside
ocfs2_read_blocks() and ocfs2_read_blocks_sync() are both used to read
several blocks from disk.  Currently, the input argument *bhs* can be
NULL or NOT.  It depends on the caller's behavior.  If the function
fails in reading blocks from disk, the corresponding bh will be assigned
to NULL and put.

Obviously, above process for non-NULL input bh is not appropriate.
Because the caller doesn't even know its bhs are put and re-assigned.

If buffer head is managed by caller, ocfs2_read_blocks and
ocfs2_read_blocks_sync() should not evaluate it to NULL.  It will cause
caller accessing illegal memory, thus crash.

Link: http://lkml.kernel.org/r/HK2PR06MB045285E0F4FBB561F9F2F9B3D5680@HK2PR06MB0452.apcprd06.prod.outlook.com
Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
Reviewed-by: Guozhonghua <guozhonghua@h3c.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-03 10:09:37 -07:00
Changwei Ge
29aa30167a ocfs2: fix a misuse a of brelse after failing ocfs2_check_dir_entry
Somehow, file system metadata was corrupted, which causes
ocfs2_check_dir_entry() to fail in function ocfs2_dir_foreach_blk_el().

According to the original design intention, if above happens we should
skip the problematic block and continue to retrieve dir entry.  But
there is obviouse misuse of brelse around related code.

After failure of ocfs2_check_dir_entry(), current code just moves to
next position and uses the problematic buffer head again and again
during which the problematic buffer head is released for multiple times.
I suppose, this a serious issue which is long-lived in ocfs2.  This may
cause other file systems which is also used in a the same host insane.

So we should also consider about bakcporting this patch into linux
-stable.

Link: http://lkml.kernel.org/r/HK2PR06MB045211675B43EED794E597B6D56E0@HK2PR06MB0452.apcprd06.prod.outlook.com
Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
Suggested-by: Changkuo Shi <shi.changkuo@h3c.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-03 10:09:37 -07:00
Changwei Ge
9e98578775 ocfs2: don't use iocb when EIOCBQUEUED returns
When -EIOCBQUEUED returns, it means that aio_complete() will be called
from dio_complete(), which is an asynchronous progress against
write_iter.  Generally, IO is a very slow progress than executing
instruction, but we still can't take the risk to access a freed iocb.

And we do face a BUG crash issue.  Using the crash tool, iocb is
obviously freed already.

  crash> struct -x kiocb ffff881a350f5900
  struct kiocb {
    ki_filp = 0xffff881a350f5a80,
    ki_pos = 0x0,
    ki_complete = 0x0,
    private = 0x0,
    ki_flags = 0x0
  }

And the backtrace shows:
  ocfs2_file_write_iter+0xcaa/0xd00 [ocfs2]
  aio_run_iocb+0x229/0x2f0
  do_io_submit+0x291/0x540
  SyS_io_submit+0x10/0x20
  system_call_fastpath+0x16/0x75

Link: http://lkml.kernel.org/r/1523361653-14439-1-git-send-email-ge.changwei@h3c.com
Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-03 10:09:37 -07:00
Guozhonghua
21158ca85b ocfs2: without quota support, avoid calling quota recovery
During one dead node's recovery by other node, quota recovery work will
be queued.  We should avoid calling quota when it is not supported, so
check the quota flags.

Link: http://lkml.kernel.org/r/71604351584F6A4EBAE558C676F37CA401071AC9FB@H3CMLB12-EX.srv.huawei-3com.com
Signed-off-by: guozhonghua <guozhonghua@h3c.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-03 10:09:37 -07:00
Gang He
a634644751 ocfs2: remove ocfs2_is_o2cb_active()
Remove ocfs2_is_o2cb_active().  We have similar functions to identify
which cluster stack is being used via osb->osb_cluster_stack.

Secondly, the current implementation of ocfs2_is_o2cb_active() is not
totally safe.  Based on the design of stackglue, we need to get
ocfs2_stack_lock before using ocfs2_stack related data structures, and
that active_stack pointer can be NULL in the case of mount failure.

Link: http://lkml.kernel.org/r/1495441079-11708-1-git-send-email-ghe@suse.com
Signed-off-by: Gang He <ghe@suse.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Reviewed-by: Eric Ren <zren@suse.com>
Acked-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-03 10:09:37 -07:00
Steve French
b98e26df07 cifs: fix signed/unsigned mismatch on aio_read patch
The patch "CIFS: Add support for direct I/O read" had
a signed/unsigned mismatch (ssize_t vs. size_t) in the
return from one function.  Similar trivial change
in aio_write

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
2018-11-02 14:09:42 -05:00
Colin Ian King
8c6c9bed87 cifs: don't dereference smb_file_target before null check
There is a null check on dst_file->private data which suggests
it can be potentially null. However, before this check, pointer
smb_file_target is derived from dst_file->private and dereferenced
in the call to tlink_tcon, hence there is a potential null pointer
deference.

Fix this by assigning smb_file_target and target_tcon after the
null pointer sanity checks.

Detected by CoverityScan, CID#1475302 ("Dereference before null check")

Fixes: 04b38d6012 ("vfs: pull btrfs clone API to vfs layer")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-11-02 14:09:42 -05:00
Long Li
be4eb68846 CIFS: Add direct I/O functions to file_operations
With direct read/write functions implemented, add them to file_operations.

Dircet I/O is used under two conditions:
1. When mounting with "cache=none", CIFS uses direct I/O for all user file
data transfer.
2. When opening a file with O_DIRECT, CIFS uses direct I/O for all data
transfer on this file.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-11-02 14:09:42 -05:00
Long Li
8c5f9c1ab7 CIFS: Add support for direct I/O write
With direct I/O write, user supplied buffers are pinned to the memory and data
are transferred directly from user buffers to the transport layer.

Change in v3: add support for kernel AIO

Change in v4:
Refactor common write code to __cifs_writev for direct and non-direct I/O.
Retry on direct I/O failure.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-11-02 14:09:42 -05:00
Long Li
6e6e2b86c2 CIFS: Add support for direct I/O read
With direct I/O read, we transfer the data directly from transport layer to
the user data buffer.

Change in v3: add support for kernel AIO

Change in v4:
Refactor common read code to __cifs_readv for direct and non-direct I/O.
Retry on direct I/O failure.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-11-02 14:09:41 -05:00
Steve French
0df444a00f smb3: missing defines and structs for reparse point handling
We were missing some structs from MS-FSCC relating to
reparse point handling.  Add them to protocol defines
in smb2pdu.h

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2018-11-02 14:09:41 -05:00
Steve French
dfe33f9abc smb3: allow more detailed protocol info on open files for debugging
In order to debug complex problems it is often helpful to
have detailed information on the client and server view
of the open file information.  Add the ability for root to
view the list of smb3 open files and dump the persistent
handle and other info so that it can be more easily
correlated with server logs.

Sample output from "cat /proc/fs/cifs/open_files"

 # Version:1
 # Format:
 # <tree id> <persistent fid> <flags> <count> <pid> <uid> <filename> <mid>
 0x5 0x800000378 0x8000 1 7704 0 some-file 0x14
 0xcb903c0c 0x84412e67 0x8000 1 7754 1001 rofile 0x1a6d
 0xcb903c0c 0x9526b767 0x8000 1 7720 1000 file 0x1a5b
 0xcb903c0c 0x9ce41a21 0x8000 1 7715 0 smallfile 0xd67

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-11-02 14:09:41 -05:00
Steve French
926674de67 smb3: on kerberos mount if server doesn't specify auth type use krb5
Some servers (e.g. Azure) do not include a spnego blob in the SMB3
negotiate protocol response, so on kerberos mounts ("sec=krb5")
we can fail, as we expected the server to list its supported
auth types (OIDs in the spnego blob in the negprot response).
Change this so that on krb5 mounts we default to trying krb5 if the
server doesn't list its supported protocol mechanisms.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Stable <stable@vger.kernel.org>
2018-11-02 14:09:41 -05:00
Steve French
f8af49dd17 smb3: add trace point for tree connection
In debugging certain scenarios, especially reconnect cases,
it can be helpful to have a dynamic trace point for the
result of tree connect.  See sample output below
from a reconnect event. The new event is 'smb3_tcon'

            TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
               | |       |   ||||       |         |
           cifsd-6071  [001] ....  2659.897923: smb3_reconnect: server=localhost current_mid=0xa
     kworker/1:1-71    [001] ....  2666.026342: smb3_cmd_done: 	sid=0x0 tid=0x0 cmd=0 mid=0
     kworker/1:1-71    [001] ....  2666.026576: smb3_cmd_err: 	sid=0xc49e1787 tid=0x0 cmd=1 mid=1 status=0xc0000016 rc=-5
     kworker/1:1-71    [001] ....  2666.031677: smb3_cmd_done: 	sid=0xc49e1787 tid=0x0 cmd=1 mid=2
     kworker/1:1-71    [001] ....  2666.031921: smb3_cmd_done: 	sid=0xc49e1787 tid=0x6e78f05f cmd=3 mid=3
     kworker/1:1-71    [001] ....  2666.031923: smb3_tcon: xid=0 sid=0xc49e1787 tid=0x0 unc_name=\\localhost\test rc=0
     kworker/1:1-71    [001] ....  2666.032097: smb3_cmd_done: 	sid=0xc49e1787 tid=0x6e78f05f cmd=11 mid=4
     kworker/1:1-71    [001] ....  2666.032265: smb3_cmd_done: 	sid=0xc49e1787 tid=0x7912332f cmd=3 mid=5
     kworker/1:1-71    [001] ....  2666.032266: smb3_tcon: xid=0 sid=0xc49e1787 tid=0x0 unc_name=\\localhost\IPC$ rc=0
     kworker/1:1-71    [001] ....  2666.032386: smb3_cmd_done: 	sid=0xc49e1787 tid=0x7912332f cmd=11 mid=6

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-11-02 14:09:41 -05:00
Colin Ian King
413d610081 cifs: fix spelling mistake, EACCESS -> EACCES
Trivial fix to a spelling mistake of the error access name EACCESS,
rename to EACCES

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-11-02 14:09:41 -05:00
Ronnie Sahlberg
0c5d6cb664 cifs: fix return value for cifs_listxattr
If the application buffer was too small to fit all the names
we would still count the number of bytes and return this for
listxattr. This would then trigger a BUG in usercopy.c

Fix the computation of the size so that we return -ERANGE
correctly when the buffer is too small.

This fixes the kernel BUG for xfstest generic/377

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2018-11-02 14:09:41 -05:00
Linus Torvalds
5f21585384 for-linus-20181102
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlvchGgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpj/1D/4kEQx4ncnFoZk8QshHV1L++rH3BbcLjQDd
 Wbh9ZSIQdI/gHTzS6bE7x3YfcbpMWPMO3+jFawdfRiFTEjlF8vQ+mnJ+Btb3z4D6
 mGEeFGVhHExlp2a0x/Ma8YWVNlMB7BE8Tq73bZEVMY+9lbpmDW/vp7Sfa87LBDKQ
 ZmY+My+VdHN7qLtQ7t3W/HtpbU+kcXMMd3ICjK4i+ofXy6mynk4+oQ2jwyXc5L86
 UCJCsTsSRr3CgbnkW/uprHo0XHk8i7O/4C3oR+x4pAIxCCa9g+vmw0EO9fvi/2iQ
 qe8jKdm7Y09xu/TiPBa7iz45tdh0cNMJKo3OezmSF9Np+r69KL5C/U4GRPKN3Iwm
 keoqn14ScABkYMSe4ys1AdEgKD6bNUaW3r/lJxTH2oUR23mjnCLp7c4WD/G+MlbB
 CzoakQyCHTZmDFLr2Kc8bkjmpil2T2UFfmLIDAu30LWIYeSGpiIO/V+g1foJMF2f
 06ERltNvgX1BJjoh4NSWySLEf1ZtkUU60NeATRol6gwhnIyLrHsgfm6OEhqlW/7x
 Xc1BWyzX7K6c3Dskk/u5aSRyXOyRC9KkMt3/2XexeDNHkte9yMH0IgSvopPBuER8
 +iPvPjNp7ychTKZB3zpSnlqGgePTjbufIEBtO3OyUmDZKjUqxahtxkQfmPhoclu+
 XdR4ArcqNg==
 =0zM4
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20181102' of git://git.kernel.dk/linux-block

Pull block layer fixes from Jens Axboe:
 "The biggest part of this pull request is the revert of the blkcg
  cleanup series. It had one fix earlier for a stacked device issue, but
  another one was reported. Rather than play whack-a-mole with this,
  revert the entire series and try again for the next kernel release.

  Apart from that, only small fixes/changes.

  Summary:

   - Indentation fixup for mtip32xx (Colin Ian King)

   - The blkcg cleanup series revert (Dennis Zhou)

   - Two NVMe fixes. One fixing a regression in the nvme request
     initialization in this merge window, causing nvme-fc to not work.
     The other is a suspend/resume p2p resource issue (James, Keith)

   - Fix sg discard merge, allowing us to merge in cases where we didn't
     before (Jianchao Wang)

   - Call rq_qos_exit() after the queue is frozen, preventing a hang
     (Ming)

   - Fix brd queue setup, fixing an oops if we fail setting up all
     devices (Ming)"

* tag 'for-linus-20181102' of git://git.kernel.dk/linux-block:
  nvme-pci: fix conflicting p2p resource adds
  nvme-fc: fix request private initialization
  blkcg: revert blkcg cleanups series
  block: brd: associate with queue until adding disk
  block: call rq_qos_exit() after queue is frozen
  mtip32xx: clean an indentation issue, remove extraneous tabs
  block: fix the DISCARD request merge
2018-11-02 11:25:48 -07:00
Linus Torvalds
c2aa1a444c vfs: rework data cloning infrastructure
Rework the vfs_clone_file_range and vfs_dedupe_file_range infrastructure to use
 a common .remap_file_range method and supply generic bounds and sanity checking
 functions that are shared with the data write path. The current VFS
 infrastructure has problems with rlimit, LFS file sizes, file time stamps,
 maximum filesystem file sizes, stripping setuid bits, etc and so they are
 addressed in these commits.
 
 We also introduce the ability for the ->remap_file_range methods to return short
 clones so that clones for vfs_copy_file_range() don't get rejected if the entire
 range can't be cloned. It also allows filesystems to sliently skip deduplication
 of partial EOF blocks if they are not capable of doing so without requiring
 errors to be thrown to userspace.
 
 All existing filesystems are converted to user the new .remap_file_range method,
 and both XFS and ocfs2 are modified to make use of the new generic checking
 infrastructure.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJb29gEAAoJEK3oKUf0dfodpOAQAL2VbHjvKXEwNMDTKscSRMmZ
 Z0xXo3gamFKQ+VGOqy2g2lmAYQs9SAnTuCGTJ7zIAp7u+q8gzUy5FzKAwLS4Id6L
 8siaY6nzlicfO04d0MdXnWz0f3xykChgzfdQfVUlUi7WrDioBUECLPmx4a+USsp1
 DQGjLOZfoOAmn2rijdnH9RTEaHqg+8mcTaLN9TRav4gGqrWxldFKXw2y6ouFC7uo
 /hxTRNXR9VI+EdbDelwBNXl9nU9gQA0WLOvRKwgUrtv6bSJohTPsmXt7EbBtNcVR
 cl3zDNc1sLD1bLaRLEUAszI/33wXaaQgom1iB51obIcHHef+JxRNG/j6rUMfzxZI
 VaauGv5EIvtaKN0LTAqVVLQ8t2MQFYfOr8TykmO+1UFog204aKRANdVMHDSjxD/0
 dTGKJGcq+HnKQ+JHDbTdvuXEL8sUUl1FiLjOQbZPw63XmuddLKFUA2TOjXn6htbU
 1h1MG5d9KjGLpabp2BQheczD08NuSmcrOBNt7IoeI3+nxr3HpMwprfB9TyaERy9X
 iEgyVXmjjc9bLLRW7A2wm77aW64NvPs51wKMnvuNgNwnCewrGS6cB8WVj2zbQjH1
 h3f3nku44s9ctNPSBzb/sJLnpqmZQ5t0oSmrMSN+5+En6rNTacoJCzxHRJBA7z/h
 Z+C6y1GTZw0euY6Zjiwu
 =CE/A
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.20-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull vfs dedup fixes from Dave Chinner:
 "This reworks the vfs data cloning infrastructure.

  We discovered many issues with these interfaces late in the 4.19 cycle
  - the worst of them (data corruption, setuid stripping) were fixed for
  XFS in 4.19-rc8, but a larger rework of the infrastructure fixing all
  the problems was needed. That rework is the contents of this pull
  request.

  Rework the vfs_clone_file_range and vfs_dedupe_file_range
  infrastructure to use a common .remap_file_range method and supply
  generic bounds and sanity checking functions that are shared with the
  data write path. The current VFS infrastructure has problems with
  rlimit, LFS file sizes, file time stamps, maximum filesystem file
  sizes, stripping setuid bits, etc and so they are addressed in these
  commits.

  We also introduce the ability for the ->remap_file_range methods to
  return short clones so that clones for vfs_copy_file_range() don't get
  rejected if the entire range can't be cloned. It also allows
  filesystems to sliently skip deduplication of partial EOF blocks if
  they are not capable of doing so without requiring errors to be thrown
  to userspace.

  Existing filesystems are converted to user the new remap_file_range
  method, and both XFS and ocfs2 are modified to make use of the new
  generic checking infrastructure"

* tag 'xfs-4.20-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (28 commits)
  xfs: remove [cm]time update from reflink calls
  xfs: remove xfs_reflink_remap_range
  xfs: remove redundant remap partial EOF block checks
  xfs: support returning partial reflink results
  xfs: clean up xfs_reflink_remap_blocks call site
  xfs: fix pagecache truncation prior to reflink
  ocfs2: remove ocfs2_reflink_remap_range
  ocfs2: support partial clone range and dedupe range
  ocfs2: fix pagecache truncation prior to reflink
  ocfs2: truncate page cache for clone destination file before remapping
  vfs: clean up generic_remap_file_range_prep return value
  vfs: hide file range comparison function
  vfs: enable remap callers that can handle short operations
  vfs: plumb remap flags through the vfs dedupe functions
  vfs: plumb remap flags through the vfs clone functions
  vfs: make remap_file_range functions take and return bytes completed
  vfs: remap helper should update destination inode metadata
  vfs: pass remap flags to generic_remap_checks
  vfs: pass remap flags to generic_remap_file_range_prep
  vfs: combine the clone and dedupe into a single remap_file_range
  ...
2018-11-02 09:33:08 -07:00
Linus Torvalds
8adcc59974 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "No common topic, really - a handful of assorted stuff; the least
  trivial bits are Mark's dedupe patches"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs/exofs: only use true/false for asignment of bool type variable
  fs/exofs: fix potential memory leak in mount option parsing
  Delete invalid assignment statements in do_sendfile
  iomap: remove duplicated include from iomap.c
  vfs: dedupe should return EPERM if permission is not granted
  vfs: allow dedupe of user owned read-only files
  ntfs: don't open-code ERR_CAST
  ext4: don't open-code ERR_CAST
2018-11-01 20:19:49 -07:00
Linus Torvalds
9931a07d51 Merge branch 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull AFS updates from Al Viro:
 "AFS series, with some iov_iter bits included"

* 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
  missing bits of "iov_iter: Separate type from direction and use accessor functions"
  afs: Probe multiple fileservers simultaneously
  afs: Fix callback handling
  afs: Eliminate the address pointer from the address list cursor
  afs: Allow dumping of server cursor on operation failure
  afs: Implement YFS support in the fs client
  afs: Expand data structure fields to support YFS
  afs: Get the target vnode in afs_rmdir() and get a callback on it
  afs: Calc callback expiry in op reply delivery
  afs: Fix FS.FetchStatus delivery from updating wrong vnode
  afs: Implement the YFS cache manager service
  afs: Remove callback details from afs_callback_break struct
  afs: Commit the status on a new file/dir/symlink
  afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFS
  afs: Don't invoke the server to read data beyond EOF
  afs: Add a couple of tracepoints to log I/O errors
  afs: Handle EIO from delivery function
  afs: Fix TTL on VL server and address lists
  afs: Implement VL server rotation
  afs: Improve FS server rotation error handling
  ...
2018-11-01 19:58:52 -07:00
Dennis Zhou
b5f2954d30 blkcg: revert blkcg cleanups series
This reverts a series committed earlier due to null pointer exception
bug report in [1]. It seems there are edge case interactions that I did
not consider and will need some time to understand what causes the
adverse interactions.

The original series can be found in [2] with a follow up series in [3].

[1] https://www.spinics.net/lists/cgroups/msg20719.html
[2] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/
[3] https://lore.kernel.org/lkml/20181020185612.51587-1-dennis@kernel.org/

This reverts the following commits:
d459d853c2, b2c3fa5467, 101246ec02, b3b9f24f5f, e2b0989954,
f0fcb3ec89, c839e7a03f, bdc2491708, 74b7c02a9b, 5bf9a1f3b4,
a7b39b4e96, 07b05bcc32, 49f4c2dc2b, 27e6fa996c

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-01 19:59:53 -06:00
Linus Torvalds
e468f5c06b The Compiler Attributes series
This is an effort to disentangle the include/linux/compiler*.h headers
 and bring them up to date.
 
 The main idea behind the series is to use feature checking macros
 (i.e. __has_attribute) instead of compiler version checks (e.g. GCC_VERSION),
 which are compiler-agnostic (so they can be shared, reducing the size
 of compiler-specific headers) and version-agnostic.
 
 Other related improvements have been performed in the headers as well,
 which on top of the use of __has_attribute it has amounted to a significant
 simplification of these headers (e.g. GCC_VERSION is now only guarding
 a few non-attribute macros).
 
 This series should also help the efforts to support compiling the kernel
 with clang and icc. A fair amount of documentation and comments have also
 been added, clarified or removed; and the headers are now more readable,
 which should help kernel developers in general.
 
 The series was triggered due to the move to gcc >= 4.6. In turn, this series
 has also triggered Sparse to gain the ability to recognize __has_attribute
 on its own.
 
 Finally, the __nonstring variable attribute series has been also applied
 on top; plus two related patches from Nick Desaulniers for unreachable()
 that came a bit afterwards.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEPjU5OPd5QIZ9jqqOGXyLc2htIW0FAlvNpywACgkQGXyLc2ht
 IW1aiQ/+P8SJOa3GkiH37/nrIbk/wgMNytbs+gxE5YPaU1DP74Mn1prJ4XhQQic9
 /mt8GnitZwzEHWdsGEUk+ZQwnIa7ZEAmpecbAF206AMRbNxa14T5YwBx4bqWFjZp
 sP4zPTHt3JCKL8TM+z26o152UbF2kc4WSxHjEjSFaqEnR2E5D0MwFeGPzc8fgWmS
 pNyn3CidzB0TS1UF008YXhiJO6HIhFNPyhPawlhwbbdsdlhZ4u0JmwfqP4EvjRFM
 kyzdQ9CDe+AgTTD9Y8HhtoUClaa7SJzFWNzpKIJMWt8jpKWYZQ/+WtwKg2cf+v3M
 uwktcs3RI1dYrjcITLz4VJ0oVaRFnyGgXvMP4yqWQx429hqnd09WXhMioXQ1htoI
 H0vpPIAPsK+dqVA9sP3JzMq4h6+dE7P364lkbThbVpYAGKZ52qaLt9ixT1mw1Q9f
 a683ji6o02IVOGUNZ/3KAb5MqdhewNEDdZILZYRfm4AL1Em3WW9QVtIosHPviLgc
 16VjA02wKdxIcg+1LZMTNhfybztnSCf7SuQurpH1zEqFDGzrXwB7nYFplEY7DrrD
 cqhOA1fMQa++oQR+D40QDoY2ybqPOyvJG7z17pvtt+6jXep4yy2a3Bxf+ClK0nto
 5yT7v9ikXJr84FOkk7OvktLlAWvcykvAdfvDepBZhpqhuX82tHY=
 =Y8WB
 -----END PGP SIGNATURE-----

Merge tag 'compiler-attributes-for-linus-4.20-rc1' of https://github.com/ojeda/linux

Pull compiler attribute updates from Miguel Ojeda:
 "This is an effort to disentangle the include/linux/compiler*.h headers
  and bring them up to date.

  The main idea behind the series is to use feature checking macros
  (i.e. __has_attribute) instead of compiler version checks (e.g.
  GCC_VERSION), which are compiler-agnostic (so they can be shared,
  reducing the size of compiler-specific headers) and version-agnostic.

  Other related improvements have been performed in the headers as well,
  which on top of the use of __has_attribute it has amounted to a
  significant simplification of these headers (e.g. GCC_VERSION is now
  only guarding a few non-attribute macros).

  This series should also help the efforts to support compiling the
  kernel with clang and icc. A fair amount of documentation and comments
  have also been added, clarified or removed; and the headers are now
  more readable, which should help kernel developers in general.

  The series was triggered due to the move to gcc >= 4.6. In turn, this
  series has also triggered Sparse to gain the ability to recognize
  __has_attribute on its own.

  Finally, the __nonstring variable attribute series has been also
  applied on top; plus two related patches from Nick Desaulniers for
  unreachable() that came a bit afterwards"

* tag 'compiler-attributes-for-linus-4.20-rc1' of https://github.com/ojeda/linux:
  compiler-gcc: remove comment about gcc 4.5 from unreachable()
  compiler.h: update definition of unreachable()
  Compiler Attributes: ext4: remove local __nonstring definition
  Compiler Attributes: auxdisplay: panel: use __nonstring
  Compiler Attributes: enable -Wstringop-truncation on W=1 (gcc >= 8)
  Compiler Attributes: add support for __nonstring (gcc >= 8)
  Compiler Attributes: add MAINTAINERS entry
  Compiler Attributes: add Doc/process/programming-language.rst
  Compiler Attributes: remove uses of __attribute__ from compiler.h
  Compiler Attributes: KENTRY used twice the "used" attribute
  Compiler Attributes: use feature checks instead of version checks
  Compiler Attributes: add missing SPDX ID in compiler_types.h
  Compiler Attributes: remove unneeded sparse (__CHECKER__) tests
  Compiler Attributes: homogenize __must_be_array
  Compiler Attributes: remove unneeded tests
  Compiler Attributes: always use the extra-underscores syntax
  Compiler Attributes: remove unused attributes
2018-11-01 18:34:46 -07:00
Al Viro
78a63f1235 Merge tag 'nfs-for-4.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
backmerge to do fixup of iov_iter_kvec() conflict
2018-11-01 18:17:23 -04:00
Linus Torvalds
7260935d71 overlayfs update for 4.20
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW9tlWwAKCRDh3BK/laaZ
 PEszAQDnwCMuh4WwhmS4d4X9/rLwzxBqFcHbBUMxvUSpD2LSvQD/fbV3EVkcTGUc
 DVfV2e9Zdy2vq36fcR3EMa1oUGNzpgc=
 =oscW
 -----END PGP SIGNATURE-----

Merge tag 'ovl-update-4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs updates from Miklos Szeredi:
 "A mix of fixes and cleanups"

* tag 'ovl-update-4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: automatically enable redirect_dir on metacopy=on
  ovl: check whiteout in ovl_create_over_whiteout()
  ovl: using posix_acl_xattr_size() to get size instead of posix_acl_to_xattr()
  ovl: abstract ovl_inode lock with a helper
  ovl: remove the 'locked' argument of ovl_nlink_{start,end}
  ovl: relax requirement for non null uuid of lower fs
  ovl: fold copy-up helpers into callers
  ovl: untangle copy up call chain
  ovl: relax permission checking on underlying layers
  ovl: fix recursive oi->lock in ovl_link()
  vfs: fix FIGETBSZ ioctl on an overlayfs file
  ovl: clean up error handling in ovl_get_tmpfile()
  ovl: fix error handling in ovl_verify_set_fh()
2018-11-01 14:48:48 -07:00
Miklos Szeredi
d47748e5ae ovl: automatically enable redirect_dir on metacopy=on
Current behavior is to automatically disable metacopy if redirect_dir is
not enabled and proceed with the mount.

If "metacopy=on" mount option was given, then this behavior can confuse the
user: no mount failure, yet metacopy is disabled.

This patch makes metacopy=on imply redirect_dir=on.

The converse is also true: turning off full redirect with redirect_dir=
{off|follow|nofollow} will disable metacopy.

If both metacopy=on and redirect_dir={off|follow|nofollow} is specified,
then mount will fail, since there's no way to correctly resolve the
conflict.

Reported-by: Daniel Walsh <dwalsh@redhat.com>
Fixes: d5791044d2 ("ovl: Provide a mount option metacopy=on/off...")
Cc: <stable@vger.kernel.org> # v4.19
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-11-01 21:31:39 +01:00
Linus Torvalds
2d6bb6adb7 New gcc plugin: stackleak
- Introduces the stackleak gcc plugin ported from grsecurity by Alexander
   Popov, with x86 and arm64 support.
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlvQvn4WHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJpSfD/sErFreuPT1beSw994Lr9Zx4k9v
 ERsuXxWBENaJOJXbOOHMfVEcEeG/1uhPSp7hlw/dpHfh0anATTrcYqm8RNKbfK+k
 o06+JK14OJfpm5Ghq/7OizhdNLCMT8wMU3XZtWfy65VSJGjEFx8Y48vMeQtpWtUK
 ylSzi9JV6j2iUBF9oibtiT53+yqsqAtX80X1G7HRCgv9kxuKMhZr+Q5oGV6+ViyQ
 Azj8mNn06iRnhHKd17WxDJr0GjSibzz4weS/9XgP3t3EcNWJo1EgBlD2KV3tOfP5
 nzmqfqTqrcjxs/tyjdh6vVCSlYucNtyCQGn63qyShQYSg6mZwclR2fY8YSTw6PWw
 GfYWFOWru9z+qyQmwFkQ9bSQS2R+JIT0oBCj9VmtF9XmPCy7K2neJsQclzSPBiCW
 wPgXVQS4IA4684O5CmDOVMwmDpGvhdBNUR6cqSzGLxQOHY1csyXubMNUsqU3g9xk
 Ob4pEy/xrrIw4WpwHcLHSEW5gV1/OLhsT0fGRJJiC947L3cN5s9EZp7FLbIS0zlk
 qzaXUcLmn6AgcfkYwg5cI3RMLaN2V0eDCMVTWZJ1wbrmUV9chAaOnTPTjNqLOTht
 v3b1TTxXG4iCpMmOFf59F8pqgAwbBDlfyNSbySZ/Pq5QH69udz3Z9pIUlYQnSJHk
 u6q++2ReDpJXF81rBw==
 =Ks6B
 -----END PGP SIGNATURE-----

Merge tag 'stackleak-v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull stackleak gcc plugin from Kees Cook:
 "Please pull this new GCC plugin, stackleak, for v4.20-rc1. This plugin
  was ported from grsecurity by Alexander Popov. It provides efficient
  stack content poisoning at syscall exit. This creates a defense
  against at least two classes of flaws:

   - Uninitialized stack usage. (We continue to work on improving the
     compiler to do this in other ways: e.g. unconditional zero init was
     proposed to GCC and Clang, and more plugin work has started too).

   - Stack content exposure. By greatly reducing the lifetime of valid
     stack contents, exposures via either direct read bugs or unknown
     cache side-channels become much more difficult to exploit. This
     complements the existing buddy and heap poisoning options, but
     provides the coverage for stacks.

  The x86 hooks are included in this series (which have been reviewed by
  Ingo, Dave Hansen, and Thomas Gleixner). The arm64 hooks have already
  been merged through the arm64 tree (written by Laura Abbott and
  reviewed by Mark Rutland and Will Deacon).

  With VLAs having been removed this release, there is no need for
  alloca() protection, so it has been removed from the plugin"

* tag 'stackleak-v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  arm64: Drop unneeded stackleak_check_alloca()
  stackleak: Allow runtime disabling of kernel stack erasing
  doc: self-protection: Add information about STACKLEAK feature
  fs/proc: Show STACKLEAK metrics in the /proc file system
  lkdtm: Add a test for STACKLEAK
  gcc-plugins: Add STACKLEAK plugin for tracking the kernel stack
  x86/entry: Add STACKLEAK erasing the kernel stack at the end of syscalls
2018-11-01 11:46:27 -07:00
Colin Ian King
d3787af289 NFS: fix spelling mistake, EACCESS -> EACCES
Trivial fix to a spelling mistake of the error access name EACCESS,
rename to EACCES

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-11-01 14:07:06 -04:00
Linus Torvalds
9b5cf826ef fuse update for 4.20
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW9m8rQAKCRDh3BK/laaZ
 POTeAP9DthScqnVxrRiyvORwffjTLOCijY4yAatgxTU5MO6TQgD/eeO62Exq5Cij
 4uXCSNIzPVPKiimunVKYoDM8KmcNtAQ=
 =z92F
 -----END PGP SIGNATURE-----

Merge tag 'fuse-update-4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse updates from Miklos Szeredi:
 "As well as the usual bug fixes, this adds the following new features:

   - cached readdir and readlink

   - max I/O size increased from 128k to 1M

   - improved performance and scalability of request queues

   - copy_file_range support

  The only non-fuse bits are trivial cleanups of macros in
  <linux/bitops.h>"

* tag 'fuse-update-4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (31 commits)
  fuse: enable caching of symlinks
  fuse: only invalidate atime in direct read
  fuse: don't need GETATTR after every READ
  fuse: allow fine grained attr cache invaldation
  bitops: protect variables in bit_clear_unless() macro
  bitops: protect variables in set_mask_bits() macro
  fuse: realloc page array
  fuse: add max_pages to init_out
  fuse: allocate page array more efficiently
  fuse: reduce size of struct fuse_inode
  fuse: use iversion for readdir cache verification
  fuse: use mtime for readdir cache verification
  fuse: add readdir cache version
  fuse: allow using readdir cache
  fuse: allow caching readdir
  fuse: extract fuse_emit() helper
  fuse: add FOPEN_CACHE_DIR
  fuse: split out readdir.c
  fuse: Use hash table to link processing request
  fuse: kill req->intr_unique
  ...
2018-10-31 14:50:02 -07:00
Linus Torvalds
31990f0f53 The highlights are:
- a series that fixes some old memory allocation issues in libceph
   (myself).  We no longer allocate memory in places where allocation
   failures cannot be handled and BUG when the allocation fails.
 
 - support for copy_file_range() syscall (Luis Henriques).  If size and
   alignment conditions are met, it leverages RADOS copy-from operation.
   Otherwise, a local copy is performed.
 
 - a patch that reduces memory requirement of ceph_sync_read() from the
   size of the entire read to the size of one object (Zheng Yan).
 
 - fallocate() syscall is now restricted to FALLOC_FL_PUNCH_HOLE (Luis
   Henriques)
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAlvZ6AcTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi8H+B/9V/QB1BX5Q2DvkS3mcLNI2NphrppaD
 VBuviwoIzaBm1paCrx40J/pCtsK1Fybl5dBAh1W0SDxEGR8JUA8GJw+oemtOS6pZ
 DwjOF9S7uhzf5M3nQ9SvAbIudBISMZQRi22Y8fWs3k+yaECIz1J/pe7RiKo/GBAB
 NnlbrZ1AYSB02chchVCSmWTApeIRp9JXnaM9xLMJWGVLL/vONjt3ltJ/w9haGYz8
 FPFLPFeWobWqFElnOUomxU8Cv84DgPtH8si0UAn16jveractpFJWO4X6LDs/ZYDk
 /MccfsB3EK9BCJdLJMoI0/lXxE33z3/MehmJDs9xGSX/N4N7UTF8Ve1b
 =U91e
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.20-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "The highlights are:

   - a series that fixes some old memory allocation issues in libceph
     (myself). We no longer allocate memory in places where allocation
     failures cannot be handled and BUG when the allocation fails.

   - support for copy_file_range() syscall (Luis Henriques). If size and
     alignment conditions are met, it leverages RADOS copy-from
     operation. Otherwise, a local copy is performed.

   - a patch that reduces memory requirement of ceph_sync_read() from
     the size of the entire read to the size of one object (Zheng Yan).

   - fallocate() syscall is now restricted to FALLOC_FL_PUNCH_HOLE (Luis
     Henriques)"

* tag 'ceph-for-4.20-rc1' of git://github.com/ceph/ceph-client: (25 commits)
  ceph: new mount option to disable usage of copy-from op
  ceph: support copy_file_range file operation
  libceph: support the RADOS copy-from operation
  ceph: add non-blocking parameter to ceph_try_get_caps()
  libceph: check reply num_data_items in setup_request_data()
  libceph: preallocate message data items
  libceph, rbd, ceph: move ceph_osdc_alloc_messages() calls
  libceph: introduce alloc_watch_request()
  libceph: assign cookies in linger_submit()
  libceph: enable fallback to ceph_msg_new() in ceph_msgpool_get()
  ceph: num_ops is off by one in ceph_aio_retry_work()
  libceph: no need to call osd_req_opcode_valid() in osd_req_encode_op()
  ceph: set timeout conditionally in __cap_delay_requeue
  libceph: don't consume a ref on pagelist in ceph_msg_data_add_pagelist()
  libceph: introduce ceph_pagelist_alloc()
  libceph: osd_req_op_cls_init() doesn't need to take opcode
  libceph: bump CEPH_MSG_MAX_DATA_LEN
  ceph: only allow punch hole mode in fallocate
  ceph: refactor ceph_sync_read()
  ceph: check if LOOKUPNAME request was aborted when filling trace
  ...
2018-10-31 14:42:31 -07:00
Linus Torvalds
59fc453b21 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - the rest of MM

 - lib/bitmap updates

 - hfs updates

 - fatfs updates

 - various other misc things

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits)
  mm/gup.c: fix __get_user_pages_fast() comment
  mm: Fix warning in insert_pfn()
  memory-hotplug.rst: add some details about locking internals
  powerpc/powernv: hold device_hotplug_lock when calling memtrace_offline_pages()
  powerpc/powernv: hold device_hotplug_lock when calling device_online()
  mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock
  mm/memory_hotplug: make add_memory() take the device_hotplug_lock
  mm/memory_hotplug: make remove_memory() take the device_hotplug_lock
  mm/memblock.c: warn if zero alignment was requested
  memblock: stop using implicit alignment to SMP_CACHE_BYTES
  docs/boot-time-mm: remove bootmem documentation
  mm: remove include/linux/bootmem.h
  memblock: replace BOOTMEM_ALLOC_* with MEMBLOCK variants
  mm: remove nobootmem
  memblock: rename __free_pages_bootmem to memblock_free_pages
  memblock: rename free_all_bootmem to memblock_free_all
  memblock: replace free_bootmem_late with memblock_free_late
  memblock: replace free_bootmem{_node} with memblock_free
  mm: nobootmem: remove bootmem allocation APIs
  memblock: replace alloc_bootmem with memblock_alloc
  ...
2018-10-31 09:25:15 -07:00
Mike Rapoport
57c8a661d9 mm: remove include/linux/bootmem.h
Move remaining definitions and declarations from include/linux/bootmem.h
into include/linux/memblock.h and remove the redundant header.

The includes were replaced with the semantic patch below and then
semi-automated removal of duplicated '#include <linux/memblock.h>

@@
@@
- #include <linux/bootmem.h>
+ #include <linux/memblock.h>

[sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
  Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
[sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
  Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
[sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
  Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:16 -07:00
Mike Rapoport
aca52c3983 mm: remove CONFIG_HAVE_MEMBLOCK
All architecures use memblock for early memory management. There is no need
for the CONFIG_HAVE_MEMBLOCK configuration option.

[rppt@linux.vnet.ibm.com: of/fdt: fixup #ifdefs]
  Link: http://lkml.kernel.org/r/20180919103457.GA20545@rapoport-lnx
[rppt@linux.vnet.ibm.com: csky: fixups after bootmem removal]
  Link: http://lkml.kernel.org/r/20180926112744.GC4628@rapoport-lnx
[rppt@linux.vnet.ibm.com: remove stale #else and the code it protects]
  Link: http://lkml.kernel.org/r/1538067825-24835-1-git-send-email-rppt@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1536927045-23536-4-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:15 -07:00
Frank Sorenson
22ea4eba63 fat: truncate inode timestamp updates in setattr
setattr_copy can't truncate timestamps correctly for
msdos/vfat, so truncate and copy them ourselves.

Link: http://lkml.kernel.org/r/a2b4701b1125573fafaeaae6802050ca86d6f8cc.1538363961.git.sorenson@redhat.com
Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:14 -07:00
Frank Sorenson
cd83f6b194 fat: change timestamp updates to use fat_truncate_time
Convert the inode timestamp updates to use fat_truncate_time.

Link: http://lkml.kernel.org/r/2663d3083c4dd62f00b64612c8eaf5542bb05a4c.1538363961.git.sorenson@redhat.com
Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:14 -07:00
Frank Sorenson
6bb885ecd7 fat: add functions to update and truncate timestamps appropriately
Add the fat-specific inode_operation ->update_time() and
fat_truncate_time() function to truncate the inode timestamps from 1
nanosecond to the appropriate granularity.

Link: http://lkml.kernel.org/r/38af1ba3c3cf0d7381ce7b63077ef8af75901532.1538363961.git.sorenson@redhat.com
Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:14 -07:00
Frank Sorenson
d9f4d94261 fat: create a function to calculate the timezone offest
Patch series "fat: timestamp updates", v5.

fat/msdos timestamps are stored on-disk with several different
granularities, some of them lower resolution than timespec64_trunc() can
provide.  In addition, they are only truncated as they are written to
disk, so the timestamps in-memory for new or modified files/directories
may be different from the same timestamps after a remount, as the
now-truncated times are re-read from the on-disk format.

These patches allow finer granularity for the timestamps where possible
and add fat-specific ->update_time inode operation and fat_truncate_time
functions to truncate each timestamp correctly, giving consistent times
across remounts.

This patch (of 4):

Move the calculation of the number of seconds in the timezone offset to a
common function.

Link: http://lkml.kernel.org/r/3671ff8cff5eeedbb85ebda5e4de0728920db4f6.1538363961.git.sorenson@redhat.com
Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:14 -07:00
Mihir Mehta
eceb8902be fat: expand a slightly out-of-date comment
The file namei.c seems to have been renamed to namei_msdos.c, so I decided
to update the comment with the correct name, and expand it a bit to tell
the reader what to look for.

Link: http://lkml.kernel.org/r/20180928194947.23932-1-mihir@cs.utexas.edu
Signed-off-by: Mihir Mehta <mihir@cs.utexas.edu>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:14 -07:00
Masahiro Yamada
21bfc8309c reiserfs: remove workaround code for GCC 3.x
cafa0010cd ("Raise the minimum required gcc version to 4.6") bumped the
minimum GCC version to 4.6 for all architectures.

The workaround code in fs/reiserfs/Makefile is obsolete now.

Link: http://lkml.kernel.org/r/1535337230-13222-1-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:14 -07:00
Jann Horn
b10298d56c reiserfs: propagate errors from fill_with_dentries() properly
fill_with_dentries() failed to propagate errors up to
reiserfs_for_each_xattr() properly.  Plumb them through.

Note that reiserfs_for_each_xattr() is only used by
reiserfs_delete_xattrs() and reiserfs_chown_xattrs().  The result of
reiserfs_delete_xattrs() is discarded anyway, the only difference there is
whether a warning is printed to dmesg.  The result of
reiserfs_chown_xattrs() does matter because it can block chowning of the
file to which the xattrs belong; but either way, the resulting state can
have misaligned ownership, so my patch doesn't improve things greatly.

Credit for making me look at this code goes to Al Viro, who pointed out
that the ->actor calling convention is suboptimal and should be changed.

Link: http://lkml.kernel.org/r/20180802163335.83312-1-jannh@google.com
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:14 -07:00
Colin Ian King
6c9a3f843a fs/hfs/extent.c: fix array out of bounds read of array extent
Currently extent and index i are both being incremented causing an array
out of bounds read on extent[i].  Fix this by removing the extraneous
increment of extent.

Ernesto said:

: This is only triggered when deleting a file with a resource fork.  I
: may be wrong because the documentation isn't clear, but I don't think
: you can create those under linux.  So I guess nobody was testing them.
:
: > A disk space leak, perhaps?
:
: That's what it looks like in general.  hfs_free_extents() won't do
: anything if the block count doesn't add up, and the error will be
: ignored.  Now, if the block count randomly does add up, we could see
: some corruption.

Detected by CoverityScan, CID#711541 ("Out of bounds read")

Link: http://lkml.kernel.org/r/20180831140538.31566-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Ernesto A. Fernndez <ernesto.mnd.fernandez@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hin-Tak Leung <htl10@users.sourceforge.net>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:13 -07:00
Ernesto A. Fernández
8cd3cb5061 hfs: update timestamp on truncate()
The vfs takes care of updating mtime on ftruncate(), but on truncate() it
must be done by the module.

Link: http://lkml.kernel.org/r/e1611eda2985b672ed2d8677350b4ad8c2d07e8a.1539316825.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:13 -07:00
Ernesto A. Fernández
dc8844aada hfsplus: update timestamps on truncate()
The vfs takes care of updating ctime and mtime on ftruncate(), but on
truncate() it must be done by the module.

This patch can be tested with xfstests generic/313.

Link: http://lkml.kernel.org/r/9beb0913eea37288599e8e1b7cec8768fb52d1b8.1539316825.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:13 -07:00
Ernesto A. Fernández
1267a07be5 hfs: fix return value of hfs_get_block()
Direct writes to empty inodes fail with EIO.  The generic direct-io code
is in part to blame (a patch has been submitted as "direct-io: allow
direct writes to empty inodes"), but hfs is worse affected than the other
filesystems because the fallback to buffered I/O doesn't happen.

The problem is the return value of hfs_get_block() when called with
!create.  Change it to be more consistent with the other modules.

Link: http://lkml.kernel.org/r/4538ab8c35ea37338490525f0f24cbc37227528c.1539195310.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:13 -07:00
Ernesto A. Fernández
839c3a6a5e hfsplus: fix return value of hfsplus_get_block()
Direct writes to empty inodes fail with EIO.  The generic direct-io code
is in part to blame (a patch has been submitted as "direct-io: allow
direct writes to empty inodes"), but hfsplus is worse affected than the
other filesystems because the fallback to buffered I/O doesn't happen.

The problem is the return value of hfsplus_get_block() when called with
!create.  Change it to be more consistent with the other modules.

Link: http://lkml.kernel.org/r/2cd1301404ec7cf1e39c8f11a01a4302f1460ad6.1539195310.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:13 -07:00
Ernesto A. Fernández
54640c7502 hfs: prevent btree data loss on ENOSPC
Inserting a new record in a btree may require splitting several of its
nodes.  If we hit ENOSPC halfway through, the new nodes will be left
orphaned and their records will be lost.  This could mean lost inodes or
extents.

Henceforth, check the available disk space before making any changes.
This still leaves the potential problem of corruption on ENOMEM.

There is no need to reserve space before deleting a catalog record, as we
do for hfsplus.  This difference is because hfs index nodes have fixed
length keys.

Link: http://lkml.kernel.org/r/ab5fc8a7d5ffccfd5f27b1cf2cb4ceb6c110da74.1536269131.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:13 -07:00
Ernesto A. Fernández
d92915c35b hfsplus: prevent btree data loss on ENOSPC
Inserting or deleting a record in a btree may require splitting several of
its nodes.  If we hit ENOSPC halfway through, the new nodes will be left
orphaned and their records will be lost.  This could mean lost inodes,
extents or xattrs.

Henceforth, check the available disk space before making any changes.
This still leaves the potential problem of corruption on ENOMEM.

The patch can be tested with xfstests generic/027.

Link: http://lkml.kernel.org/r/4596eef22fbda137b4ffa0272d92f0da15364421.1536269129.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:13 -07:00
Ernesto A. Fernández
ef75bcc576 hfs: fix BUG on bnode parent update
hfs_brec_update_parent() may hit BUG_ON() if the first record of both a
leaf node and its parent are changed, and if this forces the parent to
be split.  It is not possible for this to happen on a valid hfs
filesystem because the index nodes have fixed length keys.

For reasons I ignore, the hfs module does have support for a number of
hfsplus features.  A corrupt btree header may report variable length
keys and trigger this BUG, so it's better to fix it.

Link: http://lkml.kernel.org/r/cf9b02d57f806217a2b1bf5db8c3e39730d8f603.1535682463.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:13 -07:00
Ernesto A. Fernández
d057c03667 hfs: prevent btree data loss on root split
This bug is triggered whenever hfs_brec_update_parent() needs to split
the root node.  The height of the btree is not increased, which leaves
the new node orphaned and its records lost.  It is not possible for this
to happen on a valid hfs filesystem because the index nodes have fixed
length keys.

For reasons I ignore, the hfs module does have support for a number of
hfsplus features.  A corrupt btree header may report variable length
keys and trigger this bug, so it's better to fix it.

Link: http://lkml.kernel.org/r/9750b1415685c4adca10766895f6d5ef12babdb0.1535682463.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:13 -07:00
Ernesto A. Fernández
19a9d0f1ac hfsplus: fix BUG on bnode parent update
Creating, renaming or deleting a file may hit BUG_ON() if the first
record of both a leaf node and its parent are changed, and if this
forces the parent to be split.  This bug is triggered by xfstests
generic/027, somewhat rarely; here is a more reliable reproducer:

  truncate -s 50M fs.iso
  mkfs.hfsplus fs.iso
  mount fs.iso /mnt
  i=1000
  while [ $i -le 2400 ]; do
    touch /mnt/$i &>/dev/null
    ((++i))
  done
  i=2400
  while [ $i -ge 1000 ]; do
    mv /mnt/$i /mnt/$(perl -e "print $i x61") &>/dev/null
    ((--i))
  done

The issue is that a newly created bnode is being put twice.  Reset
new_node to NULL in hfs_brec_update_parent() before reaching goto again.

Link: http://lkml.kernel.org/r/5ee1db09b60373a15890f6a7c835d00e76bf601d.1535682461.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:13 -07:00
Ernesto A. Fernández
0a3021d4f5 hfsplus: prevent btree data loss on root split
Creating, renaming or deleting a file may cause catalog corruption and
data loss.  This bug is randomly triggered by xfstests generic/027, but
here is a faster reproducer:

  truncate -s 50M fs.iso
  mkfs.hfsplus fs.iso
  mount fs.iso /mnt
  i=100
  while [ $i -le 150 ]; do
    touch /mnt/$i &>/dev/null
    ((++i))
  done
  i=100
  while [ $i -le 150 ]; do
    mv /mnt/$i /mnt/$(perl -e "print $i x82") &>/dev/null
    ((++i))
  done
  umount /mnt
  fsck.hfsplus -n fs.iso

The bug is triggered whenever hfs_brec_update_parent() needs to split the
root node.  The height of the btree is not increased, which leaves the new
node orphaned and its records lost.

Link: http://lkml.kernel.org/r/26d882184fc43043a810114258f45277752186c7.1535682461.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:13 -07:00
Souptick Joarder
b5c212374c fs/proc/vmcore.c: Convert to use vmf_error()
This code can be replaced with vmf_error() inline function.

Link: http://lkml.kernel.org/r/20180918145945.GA11392@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:12 -07:00
Miklos Szeredi
5e12758086 ovl: check whiteout in ovl_create_over_whiteout()
Kaixuxia repors that it's possible to crash overlayfs by removing the
whiteout on the upper layer before creating a directory over it.  This is a
reproducer:

 mkdir lower upper work merge
 touch lower/file
 mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merge
 rm merge/file
 ls -al merge/file
 rm upper/file
 ls -al merge/
 mkdir merge/file

Before commencing with a vfs_rename(..., RENAME_EXCHANGE) verify that the
lookup of "upper" is positive and is a whiteout, and return ESTALE
otherwise.

Reported by: kaixuxia <xiakaixu1987@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: e9be9d5e76 ("overlay filesystem")
Cc: <stable@vger.kernel.org> # v3.18
2018-10-31 12:15:23 +01:00
Linus Torvalds
310c7585e8 Olga added support for the NFSv4.2 asynchronous copy protocol. We
already supported COPY, by copying a limited amount of data and then
 returning a short result, letting the client resend.  The asynchronous
 protocol should offer better performance at the expense of some
 complexity.
 
 The other highlight is Trond's work to convert the duplicate reply cache
 to a red-black tree, and to move it and some other server caches to RCU.
 (Previously these have meant taking global spinlocks on every RPC.)
 
 Otherwise, some RDMA work and miscellaneous bugfixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJb2KWzAAoJECebzXlCjuG+gcQP/3DldB86CFxgSFx0t+h+s+TV
 CdYJDPyLyRkEMiD+4dCPPuhueve+j5BPHVsDbn98FTWrEn131NMIs6uhU/VGTtAU
 6a8f/ExtZ5U7s39MJCzlk2ozVElBc3QPp7p3p9NKn0Wi0PXbVgjuIqR5o2vwa8Si
 KOVdLm6ylfav/HTH8DO6zFPJRsTgTwcJOivXXshjpglMKAcw8AuqSsGgBrDeGpgU
 u91Vi0EM1vt96+CA6a01mTgC/sFX7EqGvxUUHOrKWf5cIjnpT3FDvouYPxi+GH8Z
 SIDlaMQyXF5m4m6MhELNTP4v97XAHyPJtvLkEe5lggTyABPiA2heo9e8onysWkzV
 1v8OZHCVFa1UL34mDlnFxbFCYVr7FFKMGjTBR/ntinobPfAbWRCO1Hdd+bBGPDD4
 byf7ctDVp7KQ2bSatIdlYavikuGDHWFDZHzPHlqkD3gpIZSNvhe26sV3NZqIFlXO
 cMUega2Y5mXmULauHhxAcNGtDK7dF5hHoMWKJy0DNxiyDiDLylwDOIfwt1De3Q7V
 ycd/wUytUS2LkAhyS2mvoDK6eXTBAeQwzmXAqveh6rewwO83HC/t9mtKBBDomvKG
 xRpRPmmbj9ijbwkilEBmijjR47wrihmEVIFahznEerZ+//QOfVVOB0MNtzIyU9/k
 CnP1ZNvOs3LR1pxxwFa8
 =TTo0
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.20' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Olga added support for the NFSv4.2 asynchronous copy protocol. We
  already supported COPY, by copying a limited amount of data and then
  returning a short result, letting the client resend. The asynchronous
  protocol should offer better performance at the expense of some
  complexity.

  The other highlight is Trond's work to convert the duplicate reply
  cache to a red-black tree, and to move it and some other server caches
  to RCU. (Previously these have meant taking global spinlocks on every
  RPC)

  Otherwise, some RDMA work and miscellaneous bugfixes"

* tag 'nfsd-4.20' of git://linux-nfs.org/~bfields/linux: (30 commits)
  lockd: fix access beyond unterminated strings in prints
  nfsd: Fix an Oops in free_session()
  nfsd: correctly decrement odstate refcount in error path
  svcrdma: Increase the default connection credit limit
  svcrdma: Remove try_module_get from backchannel
  svcrdma: Remove ->release_rqst call in bc reply handler
  svcrdma: Reduce max_send_sges
  nfsd: fix fall-through annotations
  knfsd: Improve lookup performance in the duplicate reply cache using an rbtree
  knfsd: Further simplify the cache lookup
  knfsd: Simplify NFS duplicate replay cache
  knfsd: Remove dead code from nfsd_cache_lookup
  SUNRPC: Simplify TCP receive code
  SUNRPC: Replace the cache_detail->hash_lock with a regular spinlock
  SUNRPC: Remove non-RCU protected lookup
  NFS: Fix up a typo in nfs_dns_ent_put
  NFS: Lockless DNS lookups
  knfsd: Lockless lookup of NFSv4 identities.
  SUNRPC: Lockless server RPCSEC_GSS context lookup
  knfsd: Allow lockless lookups of the exports
  ...
2018-10-30 13:03:29 -07:00
Linus Torvalds
9b190ecca1 Make the Cramfs code more robust against filesystem corruptions,
plus trivial indentation fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJb2KX1AAoJEH9KYoIL9GO3lfcH/R1eClPzhBbOj/MZWgxoB5VN
 lsOFl2XYc8bhThxOygqKcLpa2De5Q6ebrvLFgQ43erO7MaXzI4mswM5+azIlLXx3
 iVuE1NIYze5g92yvb4mLeDHGVid4EjGoG1tiGRxuU18j02Nze1B7t22tBzcYUCyi
 buJfx0A37aMepd/+cy3Qp4G03hgaNMama1220AR0S0kkORIBZFzKQOAKN6r8DGa/
 05QhmtJJQsLJJxyLDv6lKmy0Ef42COeDICpYUlQ1LvoxJJBAblDBzlkYl7ulORwV
 f147xPV+v/jlE8CktOtN31S8x+XRvbbqm9sKLB0XKnA9vz89WAl1BzoZ/7FZf/Y=
 =aGIT
 -----END PGP SIGNATURE-----

Merge tag 'cramfs_fixes' of git://git.linaro.org/people/nicolas.pitre/linux

Pull cramfs fixes from Nicolas Pitre:
 "Make the Cramfs code more robust against filesystem corruptions, plus
  trivial indentation fixes"

* tag 'cramfs_fixes' of git://git.linaro.org/people/nicolas.pitre/linux:
  Cramfs: trivial whitespace fixes
  Cramfs: fix abad comparison when wrap-arounds occur
2018-10-30 12:46:25 -07:00
Nicolas Pitre
56ce68bcee Cramfs: trivial whitespace fixes
Signed-off-by: Nicolas Pitre <nico@linaro.org>
2018-10-30 14:24:19 -04:00
Nicolas Pitre
672ca9dd13 Cramfs: fix abad comparison when wrap-arounds occur
It is possible for corrupted filesystem images to produce very large
block offsets that may wrap when a length is added, and wrongly pass
the buffer size test.

Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Cc: stable@vger.kernel.org
2018-10-30 14:24:19 -04:00
Linus Torvalds
85b5d4bcab for-4.20-part2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlvYVlMACgkQxWXV+ddt
 WDv9xxAAmN+R9y+wOKjPkDoM7jr8hRR12YnTC8R4X8oD8QTnSXWOrmfO2prYpe7d
 RyUxpuhqY+q+qvCxkp+BREa86a0zswhn/Z6HfLbHn4CaEhtchkMKR/gFOiYeL2B1
 ZIJtqgnqOGP3N1oxfn3Zr586W3ECUJq+4EUD/1OWCxZHvn1DWWd7L3VL0884hAhE
 kDVWhMdBm0nX1SOet/8haI0N98NLdyltsGdz80ooi65qR52YE4u2IoqXEg2z0AEM
 EApA6vQeOIIuZaRznIl2xFiIMbQCoMRb2sQgwIPmWoXrfboJUHyHfFrKRv5gGUHg
 DXjOXTvVdu9EEqm+1HughwZL/KRkr+OcXHHWwP+v51zsiyfbic+fegpM6a+Z0NjD
 LCo5D1NSLulhpZHr14F3qM27+LYHEC4xxXrrzRoVq4DCoSq7xgj3ip49uXe1F4Rw
 AyLeJGGOp8aqvPiD0BfgMVi4+YhWJUd/ob9Ldn9z+2y0XGQ2FDM58iCt+49+YIQi
 e2ywGaHt3aXghPAo/mvnckfZMLNZ7DJPwA7K6ayJ3N23dqGW2CORkKrGy7xVGoZn
 2AjIN1pSRLlknQJZsa6Yp1mPxnrBQfutTVxxUfKOtmEzydxMVS0g92+Lu/JRb4pu
 F/tpq/lC7dpTvP08EWw0sLjIhLeqMKzbXk38pSfUm39yDgQ10e8=
 =CiDs
 -----END PGP SIGNATURE-----

Merge tag 'for-4.20-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull more btrfs updates from David Sterba:
 "This contains a few minor updates and fixes that were under testing or
  arrived shortly after the merge window freeze, mostly stable material"

* tag 'for-4.20-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix use-after-free when dumping free space
  Btrfs: fix use-after-free during inode eviction
  btrfs: move the dio_sem higher up the callchain
  btrfs: don't run delayed_iputs in commit
  btrfs: fix insert_reserved error handling
  btrfs: only free reserved extent if we didn't insert it
  btrfs: don't use ctl->free_space for max_extent_size
  btrfs: set max_extent_size properly
  btrfs: reset max_extent_size properly
  MAINTAINERS: update my email address for btrfs
  btrfs: delayed-ref: extract find_first_ref_head from find_ref_head
  Btrfs: fix deadlock when writing out free space caches
  Btrfs: fix assertion on fsync of regular file when using no-holes feature
  Btrfs: fix null pointer dereference on compressed write path error
2018-10-30 08:27:13 -07:00
Darrick J. Wong
bf4a1fcf0b xfs: remove [cm]time update from reflink calls
Now that the vfs remap helper dirties the inode [cm]time for us, xfs no
longer needs to do that on its own.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:47:48 +11:00
Darrick J. Wong
3fc9f5e409 xfs: remove xfs_reflink_remap_range
Since xfs_file_remap_range is a thin wrapper, move the contents of
xfs_reflink_remap_range into the shell.  This cuts down on the vfs
calls being made from internal xfs code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:47:26 +11:00
Darrick J. Wong
7a6ccf004e xfs: remove redundant remap partial EOF block checks
Now that we've moved the partial EOF block checks to the VFS helpers, we
can remove the redundant functionality from XFS.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:47:16 +11:00
Darrick J. Wong
3f68c1f562 xfs: support returning partial reflink results
Back when the XFS reflink code only supported clone_file_range, we were
only able to return zero or negative error codes to userspace.  However,
now that copy_file_range (which returns bytes copied) can use XFS'
clone_file_range, we have the opportunity to return partial results.
For example, if userspace sends a 1GB clone request and we run out of
space halfway through, we at least can tell userspace that we completed
512M of that request like a regular write.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:47:06 +11:00
Darrick J. Wong
9f04aaffdd xfs: clean up xfs_reflink_remap_blocks call site
Move the offset <-> blocks unit conversions into
xfs_reflink_remap_blocks to make the call site less ugly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:46:50 +11:00
Darrick J. Wong
4918ef4ea0 xfs: fix pagecache truncation prior to reflink
Prior to remapping blocks, it is necessary to remove pages from the
destination file's page cache.  Unfortunately, the truncation is not
aggressive enough -- if page size > block size, we'll end up zeroing
subpage blocks instead of removing them.  So, round the start offset
down and the end offset up to page boundaries.  We already wrote all
the dirty data so the larger range shouldn't be a problem.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:46:33 +11:00
Darrick J. Wong
65f098e91f ocfs2: remove ocfs2_reflink_remap_range
Since ocfs2_remap_file_range is a thin shell around
ocfs2_remap_remap_range, move everything from the latter into the
former.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:45:48 +11:00
Darrick J. Wong
900611a1bd ocfs2: support partial clone range and dedupe range
Change the ocfs2 remap code to allow for returning partial results.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:44:45 +11:00
Darrick J. Wong
a8a94302c9 ocfs2: fix pagecache truncation prior to reflink
Prior to remapping blocks, it is necessary to remove pages from the
destination file's page cache.  Unfortunately, the truncation is not
aggressive enough -- if page size > block size, we'll end up zeroing
subpage blocks instead of removing them.  So, round the start offset
down and the end offset up to page boundaries.  We already wrote all
the dirty data so the larger range should be fine.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:43:16 +11:00
Darrick J. Wong
2587b1f1fa ocfs2: truncate page cache for clone destination file before remapping
When cloning blocks into another file, truncate the page cache before we
start remapping blocks so that concurrent reads wait for us to finish.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:42:56 +11:00
Darrick J. Wong
8c5c836bd6 vfs: clean up generic_remap_file_range_prep return value
Since the remap prep function can update the length of the remap
request, we can change this function to return the usual return status
instead of the odd behavior it has now.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:42:24 +11:00
Darrick J. Wong
c32e5f3995 vfs: hide file range comparison function
There are no callers of vfs_dedupe_file_range_compare, so we might as
well make it a static helper and remove the export.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:42:17 +11:00
Darrick J. Wong
eca3654e3c vfs: enable remap callers that can handle short operations
Plumb in a remap flag that enables the filesystem remap handler to
shorten remapping requests for callers that can handle it.  Now
copy_file_range can report partial success (in case we run up against
alignment problems, resource limits, etc.).

We also enable CAN_SHORTEN for fideduperange to maintain existing
userspace-visible behavior where xfs/btrfs shorten the dedupe range to
avoid stale post-eof data exposure.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:42:10 +11:00
Darrick J. Wong
df36583619 vfs: plumb remap flags through the vfs dedupe functions
Plumb a remap_flags argument through the vfs_dedupe_file_range_one
functions so that dedupe can take advantage of it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:42:03 +11:00
Darrick J. Wong
452ce65951 vfs: plumb remap flags through the vfs clone functions
Plumb a remap_flags argument through the {do,vfs}_clone_file_range
functions so that clone can take advantage of it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:41:56 +11:00
Darrick J. Wong
42ec3d4c02 vfs: make remap_file_range functions take and return bytes completed
Change the remap_file_range functions to take a number of bytes to
operate upon and return the number of bytes they operated on.  This is a
requirement for allowing fs implementations to return short clone/dedupe
results to the user, which will enable us to obey resource limits in a
graceful manner.

A subsequent patch will enable copy_file_range to signal to the
->clone_file_range implementation that it can handle a short length,
which will be returned in the function's return value.  For now the
short return is not implemented anywhere so the behavior won't change --
either copy_file_range manages to clone the entire range or it tries an
alternative.

Neither clone ioctl can take advantage of this, alas.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:41:49 +11:00
Darrick J. Wong
8dde90bca6 vfs: remap helper should update destination inode metadata
Extend generic_remap_file_range_prep to handle inode metadata updates
when remapping into a file.  If the operation can possibly alter the
file contents, we must update the ctime and mtime and remove security
privileges, just like we do for regular file writes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:41:41 +11:00
Darrick J. Wong
3d28193e1d vfs: pass remap flags to generic_remap_checks
Pass the same remap flags to generic_remap_checks for consistency.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:41:34 +11:00
Darrick J. Wong
a91ae49bba vfs: pass remap flags to generic_remap_file_range_prep
Plumb the remap flags through the filesystem from the vfs function
dispatcher all the way to the prep function to prepare for behavior
changes in subsequent patches.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:41:28 +11:00
Darrick J. Wong
2e5dfc99f2 vfs: combine the clone and dedupe into a single remap_file_range
Combine the clone_file_range and dedupe_file_range operations into a
single remap_file_range file operation dispatch since they're
fundamentally the same operation.  The differences between the two can
be made in the prep functions.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:41:21 +11:00
Darrick J. Wong
6095028b45 vfs: rename clone_verify_area to remap_verify_area
Since we use clone_verify_area for both clone and dedupe range checks,
rename the function to make it clear that it's for both.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:41:14 +11:00
Darrick J. Wong
a83ab01a62 vfs: rename vfs_clone_file_prep to be more descriptive
The vfs_clone_file_prep is a generic function to be called by filesystem
implementations only.  Rename the prefix to generic_ and make it more
clear that it applies to remap operations, not just clones.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:41:08 +11:00
Darrick J. Wong
9aae20500d vfs: skip zero-length dedupe requests
Don't bother calling the filesystem for a zero-length dedupe request;
we can return zero and exit.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:41:01 +11:00
Darrick J. Wong
07d19dc9fb vfs: avoid problematic remapping requests into partial EOF block
A deduplication data corruption is exposed in XFS and btrfs. It is
caused by extending the block match range to include the partial EOF
block, but then allowing unknown data beyond EOF to be considered a
"match" to data in the destination file because the comparison is only
made to the end of the source file. This corrupts the destination file
when the source extent is shared with it.

The VFS remapping prep functions  only support whole block dedupe, but
we still need to appear to support whole file dedupe correctly.  Hence
if the dedupe request includes the last block of the souce file, don't
include it in the actual dedupe operation. If the rest of the range
dedupes successfully, then reject the entire request.  A subsequent
patch will enable us to shorten dedupe requests correctly.

When reflinking sub-file ranges, a data corruption can occur when the
source file range includes a partial EOF block. This shares the unknown
data beyond EOF into the second file at a position inside EOF, exposing
stale data in the second file.

If the reflink request includes the last block of the souce file, only
proceed with the reflink operation if it lands at or past the
destination file's current EOF. If it lands within the destination file
EOF, reject the entire request with -EINVAL and make the caller go the
hard way.  A subsequent patch will enable us to shorten reflink requests
correctly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:40:55 +11:00
Darrick J. Wong
2c5773f102 vfs: exit early from zero length remap operations
If a remap caller asks us to remap to the source file's EOF and the
source file length leaves us with a zero byte request, exit early.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:40:39 +11:00
Darrick J. Wong
1383a7ed67 vfs: check file ranges before cloning files
Move the file range checks from vfs_clone_file_prep into a separate
generic_remap_checks function so that all the checks are collected in a
central location.  This forms the basis for adding more checks from
generic_write_checks that will make cloning's input checking more
consistent with write input checking.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:40:31 +11:00
Darrick J. Wong
5b49f64db2 vfs: vfs_clone_file_prep_inodes should return EINVAL for a clone from beyond EOF
vfs_clone_file_prep_inodes cannot return 0 if it is asked to remap from
a zero byte file because that's what btrfs does.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-30 10:40:22 +11:00
Linus Torvalds
134bf98c55 media updates for v4.20-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJb12wyAAoJEAhfPr2O5OEVQG4P/3QlXjec6qlhbo6UPs54E2sC
 bVdZfp3mobo8NmLRt791Yh9cc0rN45Tlf2BT8XEmCyI6+NB++obU/j0LW5XT7sp7
 oE8IgeRraVFWH/Xl9lTgP15Cs6v43eyvP12xgRWBmr+TYugLHDVTheGBvU/COb3d
 yaykUULezuOMLA3HsPbz5EJOmU5rZ/Wa1w1sAiNJY/cRohfVb3kO4593enwUTMSx
 yHJ+AVjl/Dn3RV4yLwoybpxPH6XIb3KoLg/6Fx8bOlKy1sg0mcWpzQ1CvMUNpXTF
 kdwTw3ri1bfYnjChZewuKoJU8Wcw0Gt7pkqAhULN1ieo84MNA3bNor56pdRPaOZW
 KxzlXZRS6xgYW8bzZ51N0Ku6fwSt3AWRE7TeKcrHF84Yb8vOtPS15sp3qc+9o9rb
 EDV/lJLcz4bbi3W28di5WMFaN7LHxCHnRV7GvrcNQm6Im62CBFZHiI7jKjMv3tXp
 Taes0utMPGfWuY6fv4LmuBzFG4nGB6/H4RiVvL1cLkjnx/FJtWGH+1uOcKDraKeI
 ENBrK0VYrNH7nCDGNehiamStcVK+27tS+xsuqoZkGz6RA8vAxYBXTIZULXA98BPA
 f6NC32ZNJaruxh4qh5tUy+LKPGzXs0sWa9kfgKmFfaOndFLMjGTXHpAT5AYJMbNe
 iqKi/4aXD4aKAWTA7PPg
 =Cc6D
 -----END PGP SIGNATURE-----

Merge tag 'media/v4.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

Pull media updates from Mauro Carvalho Chehab:

 - new dvb frontend driver: lnbh29

 - new sensor drivers: imx319 and imx 355

 - some old soc_camera driver renames to avoid conflict with new
   drivers

 - new i.MX Pixel Pipeline (PXP) mem-to-mem platform driver

 - a new V4L2 frontend for the FWHT codec

 - several other improvements, bug fixes, code cleanups, etc

* tag 'media/v4.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (289 commits)
  media: rename soc_camera I2C drivers
  media: cec: forgot to cancel delayed work
  media: vivid: Support 480p for webcam capture
  media: v4l2-tpg: fix kernel oops when enabling HFLIP and OSD
  media: vivid: Add 16-bit bayer to format list
  media: v4l2-tpg-core: Add 16-bit bayer
  media: pvrusb2: replace `printk` with `pr_*`
  media: venus: vdec: fix decoded data size
  media: cx231xx: fix potential sign-extension overflow on large shift
  media: dt-bindings: media: rcar_vin: add device tree support for r8a7744
  media: isif: fix a NULL pointer dereference bug
  media: exynos4-is: make const array config_ids static
  media: cx23885: make const array addr_list static
  media: ivtv: make const array addr_list static
  media: bttv-input: make const array addr_list static
  media: cx18: Don't check for address of video_dev
  media: dw9807-vcm: Fix probe error handling
  media: dw9714: Remove useless error message
  media: dw9714: Fix error handling in probe function
  media: cec: name for RC passthrough device does not need 'RC for'
  ...
2018-10-29 14:29:58 -07:00
Amir Goldstein
93f38b6fae lockd: fix access beyond unterminated strings in prints
printk format used %*s instead of %.*s, so hostname_len does not limit
the number of bytes accessed from hostname.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-10-29 16:58:04 -04:00
Andrew Elble
bd8d725078 nfsd: correctly decrement odstate refcount in error path
alloc_init_deleg() both allocates an nfs4_delegation, and
bumps the refcount on odstate. So after this point, we need to
put_clnt_odstate() and nfs4_put_stid() to not leave the odstate
refcount inappropriately bumped.

Signed-off-by: Andrew Elble <aweits@rit.edu>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-10-29 16:58:04 -04:00
Gustavo A. R. Silva
0ac203cb1f nfsd: fix fall-through annotations
Replace "fallthru" with a proper "fall through" annotation.

Also, add an annotation were it is expected to fall through.

These fixes are part of the ongoing efforts to enabling
-Wimplicit-fallthrough

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-10-29 16:58:04 -04:00
Trond Myklebust
736c6625de knfsd: Improve lookup performance in the duplicate reply cache using an rbtree
Use an rbtree to ensure the lookup/insert of an entry in a DRC bucket is
O(log(N)).

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-10-29 16:58:04 -04:00
Trond Myklebust
ed00c2f652 knfsd: Further simplify the cache lookup
Order the structure so that the key can be compared using memcmp().

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-10-29 16:58:04 -04:00
Trond Myklebust
76ecec2119 knfsd: Simplify NFS duplicate replay cache
Simplify the duplicate replay cache by initialising the preallocated
cache entry, so that we can use it as a key for the cache lookup.

Note that the 99.999% case we want to optimise for is still the one
where the lookup fails, and we have to add this entry to the cache,
so preinitialising should not cause a performance penalty.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-10-29 16:58:04 -04:00
Trond Myklebust
3e87da5145 knfsd: Remove dead code from nfsd_cache_lookup
The preallocated cache entry is always set to type RC_NOCACHE, and that
type isn't changed until we later call nfsd_cache_update().

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-10-29 16:58:04 -04:00
Trond Myklebust
a6482733bc NFS: Fix up a typo in nfs_dns_ent_put
call_rcu() needs to take a first argument of type (struct rcu_head *).

Fixes: fd497f1e40d9 ("NFS: Lockless DNS lookups")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-10-29 16:58:04 -04:00
Trond Myklebust
437f914513 NFS: Lockless DNS lookups
Enable RCU protected lookup in the legacy DNS resolver.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-10-29 16:58:04 -04:00
Trond Myklebust
9d5afd9491 knfsd: Lockless lookup of NFSv4 identities.
Enable RCU protected lookups of the NFSv4 idmap.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-10-29 16:58:04 -04:00
Trond Myklebust
9ceddd9da1 knfsd: Allow lockless lookups of the exports
Convert structs svc_expkey and svc_export to allow RCU protected lookups.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-10-29 16:58:04 -04:00
Linus Torvalds
e64433d587 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlvW43UACgkQnJ2qBz9k
 QNnqtQgA2uzRlz6U7UayQl9JiYKd2XbuojmAE+irdSL5+4OzpqkOsRfLzGSKAfvs
 ekv1eVv+4+PS90FUNbvwmX/OzZ9wi3e5d3/qfnJ7l2ZsMfKuc9aW/9I8EXPXkpAB
 O7NgoTtOZMXTJXMhseMmha2JfpbQZZ566NzCDfhOfPKqylbjTEM58pcY382VGRDX
 Iv1DNwrzPw7PaOOYO3P/vWLeb4GGpMkdG61eoTBMi6SKb/5QMc6MS+WqYzmdLZWE
 aP4tK8VhC0L47i0myXzWOHMrjQysq+E24CuQ6zG2O4bFRZj1fT+hiST9SwyUim2+
 Ne8P5gnHJiBSZgYtKBoETNI0jORzCA==
 =9DMi
 -----END PGP SIGNATURE-----

Merge tag 'filesystems_for_v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull ext2 and udf updates from Jan Kara:
 "Small ext2 cleanups and a couple of udf fixes"

* tag 'filesystems_for_v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  ext2: remove redundant building macro check
  udf: Drop pack pragma from udf_sb.h
  udf: Drop freed bitmap / table support
  udf: Fix crash during mount
  udf: Prevent write-unsupported filesystem to be remounted read-write
  ext2: cache NULL when both default_acl and acl are NULL
  udf: remove unused variables group_start and nr_groups
2018-10-29 10:23:36 -07:00
Linus Torvalds
79257514f5 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlvWyDMACgkQnJ2qBz9k
 QNnifgf+PXybPXX3KxtRUmK4u2zX2JMTwzuE0wmLxM6I08tf7rzLrBIbOY7iXka/
 nzW6IK+KnA5HtPTEUbxqNBAvWpUAvPLZ/v20d0t/QTMJcz8yfhpvM9O2mjQAGMH8
 EBmjjEhZaso8uOIAPhUg9um1QdQoYWa329fsoQuHor9kjKmDg+3RmtdH0jbRzQ6B
 RNAY1WNFbm+7MH7Fu3AB/jLqqkwZhoPcu7TwXP6m+va6xAvzEYUOQQB9rPEIaY2Z
 +q0B9LhwFIAnWPCI7dxw3CBTndoR2u1vkpnGw5FFhJgnMG4L1QMPoCCYPIZEIXg/
 VuGZQ0/mayCtO+JWw+VDJF3jQFrHxA==
 =J6tx
 -----END PGP SIGNATURE-----

Merge tag 'for_v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify updates from Jan Kara:
 "Amir's patches to implement superblock fanotify watches, Xiaoming's
  patch to enable reporting of thread IDs in fanotify events instead of
  TGIDs (sadly the patch got mis-attributed to Amir and I've noticed
  only now), and a fix of possible oops on umount caused by fsnotify
  infrastructure"

* tag 'for_v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fsnotify: Fix busy inodes during unmount
  fs: group frequently accessed fields of struct super_block together
  fanotify: support reporting thread id instead of process id
  fanotify: add BUILD_BUG_ON() to count the bits of fanotify constants
  fsnotify: convert runtime BUG_ON() to BUILD_BUG_ON()
  fanotify: deprecate uapi FAN_ALL_* constants
  fanotify: simplify handling of FAN_ONDIR
  fsnotify: generalize handling of extra event flags
  fanotify: fix collision of internal and uapi mark flags
  fanotify: store fanotify_init() flags in group's fanotify_data
  fanotify: add API to attach/detach super block mark
  fsnotify: send path type events to group with super block marks
  fsnotify: add super block object type
2018-10-29 09:19:53 -07:00
Linus Torvalds
7da4221b53 Pull request for inclusion in 4.20
* Finish removing the custom 9p request cache mechanism
  * Embed part of the fcall in the request to have better slab
 performance (msize usually is power of two aligned)
  * syzkaller fixes:
   - add a refcount to 9p requests to avoid use after free
   - a few double free issues
  * A few coverity fixes
  * Some old patches that were in the bugzilla:
   - do not trust pdu content for size header
   - mount option for lock retry interval
 
 ----------------------------------------------------------------
 Dan Carpenter (1):
       9p: potential NULL dereference
 
 Dinu-Razvan Chis-Serban (1):
       9p locks: add mount option for lock retry interval
 
 Dominique Martinet (12):
       9p/xen: fix check for xenbus_read error in front_probe
       v9fs_dir_readdir: fix double-free on p9stat_read error
       9p: clear dangling pointers in p9stat_free
       9p: embed fcall in req to round down buffer allocs
       9p: add a per-client fcall kmem_cache
       9p/rdma: do not disconnect on down_interruptible EAGAIN
       9p: acl: fix uninitialized iattr access
       9p/rdma: remove useless check in cm_event_handler
       9p: p9dirent_read: check network-provided name length
       9p locks: fix glock.client_id leak in do_lock
       9p/trans_fd: abort p9_read_work if req status changed
       9p/trans_fd: put worker reqs on destroy
 
 Gertjan Halkes (1):
       9p: do not trust pdu content for stat item size
 
 Gustavo A. R. Silva (1):
       9p: fix spelling mistake in fall-through annotation
 
 Matthew Wilcox (2):
       9p: Use a slab for allocating requests
       9p: Remove p9_idpool
 
 Tomas Bortoli (3):
       9p: rename p9_free_req() function
       9p: Add refcount to p9_req_t
       9p: Rename req to rreq in trans_fd
 
  fs/9p/acl.c             |   2 +-
  fs/9p/v9fs.c            |  21 +++++
  fs/9p/v9fs.h            |   1 +
  fs/9p/vfs_dir.c         |  19 +---
  fs/9p/vfs_file.c        |  24 +++++-
  include/net/9p/9p.h     |  12 +--
  include/net/9p/client.h |  71 ++++++---------
  net/9p/Makefile         |   1 -
  net/9p/client.c         | 551 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------------------
  net/9p/mod.c            |   9 +-
  net/9p/protocol.c       |  20 ++++-
  net/9p/trans_fd.c       |  64 +++++++++-----
  net/9p/trans_rdma.c     |  37 ++++----
  net/9p/trans_virtio.c   |  44 +++++++---
  net/9p/trans_xen.c      |  17 ++--
  net/9p/util.c           | 140 ------------------------------
  16 files changed, 482 insertions(+), 551 deletions(-)
  delete mode 100644 net/9p/util.c
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAlvWYE0ACgkQq06b7GqY
 5nBBDxAAkz4rGRsnRpx9D1Bt+81eARTqsWcoPM138zEbwiJehe5aSF8x/CGyirDW
 +PoES/efho50sDfaUL9eBMgyy95UMN7LepMTcjNE1tywA+tRsH3dOL4757+lBwZc
 FGsHvfse4b5ZzmMxSo2L5KbzpNiUEXCJGEvNlBSbeWcSSUtCfOShMVx9IC6r7JhX
 pbsoczQ6W+384V4WgBxBZZf99EryxSeYJ9KMU+9b13ndzKICJqYeaEdItt28uYUC
 d05iXDEDQZD3B8R/1Y08ktMqmHLXP9kZdJ/w8HhCMRAxWKUD1pfDwJ45gIwhCb5R
 KxdxZ2tPvc8ihGngNJ4VaJoQQFTVzDgbyWXjwAFYkFUuPFgmuAgx3qApOhUFHCdM
 8vgJd2u0attNwfOM3VsJApLQu09wmRw1LcmJ9re5OJ/YLiAUF3vw2oTpGhGlKRRI
 bMBI9rwXHzeoVI7ank5mh1NMVRpMbzL9A8uqux85gcpXL36V3XmZtKRDe1AAa5jB
 JR6dciAUABm9m+Ilt1Pl/faSjmysUOb1wZ0XU/cvVy5EBMUjbHv3x0TuDmrVwDas
 puUXEM8soVf9e/NUYqTOozwFIle6KFcyPmae/OopATLgkGES4cjDy9cLh5SgnERM
 vMcN/Z+DQO98zM/fPRy2B4qklflgTKI6VI/gSLRCWGpn49Iwhhg=
 =g1OF
 -----END PGP SIGNATURE-----

Merge tag '9p-for-4.20' of git://github.com/martinetd/linux

Pull 9p updates from Dominique Martinet:
 "Highlights this time around are the end of Matthew's work to remove
  the custom 9p request cache and use a slab directly for requests, with
  some extra patches on my end to not degrade performance, but it's a
  very good cleanup.

  Tomas and I fixed a few more syzkaller bugs (refcount is the big one),
  and I had a go at the coverity bugs and at some of the bugzilla
  reports we had open for a while.

  I'm a bit disappointed that I couldn't get much reviews for a few of
  my own patches, but the big ones got some and it's all been soaking in
  linux-next for quite a while so I think it should be OK.

  Summary:

   - Finish removing the custom 9p request cache mechanism

   - Embed part of the fcall in the request to have better slab
     performance (msize usually is power of two aligned)

   - syzkaller fixes:
      * add a refcount to 9p requests to avoid use after free
      * a few double free issues

   - A few coverity fixes

   - Some old patches that were in the bugzilla:
      * do not trust pdu content for size header
      * mount option for lock retry interval"

* tag '9p-for-4.20' of git://github.com/martinetd/linux: (21 commits)
  9p/trans_fd: put worker reqs on destroy
  9p/trans_fd: abort p9_read_work if req status changed
  9p: potential NULL dereference
  9p locks: fix glock.client_id leak in do_lock
  9p: p9dirent_read: check network-provided name length
  9p/rdma: remove useless check in cm_event_handler
  9p: acl: fix uninitialized iattr access
  9p locks: add mount option for lock retry interval
  9p: do not trust pdu content for stat item size
  9p: Rename req to rreq in trans_fd
  9p: fix spelling mistake in fall-through annotation
  9p/rdma: do not disconnect on down_interruptible EAGAIN
  9p: Add refcount to p9_req_t
  9p: rename p9_free_req() function
  9p: add a per-client fcall kmem_cache
  9p: embed fcall in req to round down buffer allocs
  9p: Remove p9_idpool
  9p: Use a slab for allocating requests
  9p: clear dangling pointers in p9stat_free
  v9fs_dir_readdir: fix double-free on p9stat_read error
  ...
2018-10-29 09:09:47 -07:00
Linus Torvalds
dad4f140ed Merge branch 'xarray' of git://git.infradead.org/users/willy/linux-dax
Pull XArray conversion from Matthew Wilcox:
 "The XArray provides an improved interface to the radix tree data
  structure, providing locking as part of the API, specifying GFP flags
  at allocation time, eliminating preloading, less re-walking the tree,
  more efficient iterations and not exposing RCU-protected pointers to
  its users.

  This patch set

   1. Introduces the XArray implementation

   2. Converts the pagecache to use it

   3. Converts memremap to use it

  The page cache is the most complex and important user of the radix
  tree, so converting it was most important. Converting the memremap
  code removes the only other user of the multiorder code, which allows
  us to remove the radix tree code that supported it.

  I have 40+ followup patches to convert many other users of the radix
  tree over to the XArray, but I'd like to get this part in first. The
  other conversions haven't been in linux-next and aren't suitable for
  applying yet, but you can see them in the xarray-conv branch if you're
  interested"

* 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits)
  radix tree: Remove multiorder support
  radix tree test: Convert multiorder tests to XArray
  radix tree tests: Convert item_delete_rcu to XArray
  radix tree tests: Convert item_kill_tree to XArray
  radix tree tests: Move item_insert_order
  radix tree test suite: Remove multiorder benchmarking
  radix tree test suite: Remove __item_insert
  memremap: Convert to XArray
  xarray: Add range store functionality
  xarray: Move multiorder_check to in-kernel tests
  xarray: Move multiorder_shrink to kernel tests
  xarray: Move multiorder account test in-kernel
  radix tree test suite: Convert iteration test to XArray
  radix tree test suite: Convert tag_tagged_items to XArray
  radix tree: Remove radix_tree_clear_tags
  radix tree: Remove radix_tree_maybe_preload_order
  radix tree: Remove split/join code
  radix tree: Remove radix_tree_update_node_t
  page cache: Finish XArray conversion
  dax: Convert page fault handlers to XArray
  ...
2018-10-28 11:35:40 -07:00
Linus Torvalds
345671ea0f Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - a few misc things

 - ocfs2 updates

 - most of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (132 commits)
  hugetlbfs: dirty pages as they are added to pagecache
  mm: export add_swap_extent()
  mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
  tools/testing/selftests/vm/map_fixed_noreplace.c: add test for MAP_FIXED_NOREPLACE
  mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
  mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()
  mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition
  mm/kasan/quarantine.c: make quarantine_lock a raw_spinlock_t
  mm/gup: cache dev_pagemap while pinning pages
  Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"
  mm: return zero_resv_unavail optimization
  mm: zero remaining unavailable struct pages
  tools/testing/selftests/vm/gup_benchmark.c: add MAP_HUGETLB option
  tools/testing/selftests/vm/gup_benchmark.c: add MAP_SHARED option
  tools/testing/selftests/vm/gup_benchmark.c: allow user specified file
  tools/testing/selftests/vm/gup_benchmark.c: fix 'write' flag usage
  mm/gup_benchmark.c: add additional pinning methods
  mm/gup_benchmark.c: time put_page()
  mm: don't raise MEMCG_OOM event due to failed high-order allocation
  mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock
  ...
2018-10-26 19:33:41 -07:00
Johannes Weiner
4b85afbdac mm: zero-seek shrinkers
The page cache and most shrinkable slab caches hold data that has been
read from disk, but there are some caches that only cache CPU work, such
as the dentry and inode caches of procfs and sysfs, as well as the subset
of radix tree nodes that track non-resident page cache.

Currently, all these are shrunk at the same rate: using DEFAULT_SEEKS for
the shrinker's seeks setting tells the reclaim algorithm that for every
two page cache pages scanned it should scan one slab object.

This is a bogus setting.  A virtual inode that required no IO to create is
not twice as valuable as a page cache page; shadow cache entries with
eviction distances beyond the size of memory aren't either.

In most cases, the behavior in practice is still fine.  Such virtual
caches don't tend to grow and assert themselves aggressively, and usually
get picked up before they cause problems.  But there are scenarios where
that's not true.

Our database workloads suffer from two of those.  For one, their file
workingset is several times bigger than available memory, which has the
kernel aggressively create shadow page cache entries for the non-resident
parts of it.  The workingset code does tell the VM that most of these are
expendable, but the VM ends up balancing them 2:1 to cache pages as per
the seeks setting.  This is a huge waste of memory.

These workloads also deal with tens of thousands of open files and use
/proc for introspection, which ends up growing the proc_inode_cache to
absurdly large sizes - again at the cost of valuable cache space, which
isn't a reasonable trade-off, given that proc inodes can be re-created
without involving the disk.

This patch implements a "zero-seek" setting for shrinkers that results in
a target ratio of 0:1 between their objects and IO-backed caches.  This
allows such virtual caches to grow when memory is available (they do
cache/avoid CPU work after all), but effectively disables them as soon as
IO-backed objects are under pressure.

It then switches the shrinkers for procfs and sysfs metadata, as well as
excess page cache shadow nodes, to the new zero-seek setting.

Link: http://lkml.kernel.org/r/20181009184732.762-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Domas Mituzas <dmituzas@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:33 -07:00
Johannes Weiner
8508cf3ffa sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD
There are several definitions of those functions/macros in places that
mess with fixed-point load averages.  Provide an official version.

[akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c]
Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:32 -07:00
Vlastimil Babka
61f94e18de mm, proc: add KReclaimable to /proc/meminfo
The vmstat NR_KERNEL_MISC_RECLAIMABLE counter is for kernel non-slab
allocations that can be reclaimed via shrinker.  In /proc/meminfo, we can
show the sum of all reclaimable kernel allocations (including slab) as
"KReclaimable".  Add the same counter also to per-node meminfo under /sys

With this counter, users will have more complete information about kernel
memory usage.  Non-slab reclaimable pages (currently just the ION
allocator) will not be missing from /proc/meminfo, making users wonder
where part of their memory went.  More precisely, they already appear in
MemAvailable, but without the new counter, it's not obvious why the value
in MemAvailable doesn't fully correspond with the sum of other counters
participating in it.

Link: http://lkml.kernel.org/r/20180731090649.16028-6-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:32 -07:00
Vlastimil Babka
2e03b4bc4a dcache: allocate external names from reclaimable kmalloc caches
We can use the newly introduced kmalloc-reclaimable-X caches, to allocate
external names in dcache, which will take care of the proper accounting
automatically, and also improve anti-fragmentation page grouping.

This effectively reverts commit f1782c9bc5 ("dcache: account external
names as indirectly reclaimable memory") and instead passes
__GFP_RECLAIMABLE to kmalloc().  The accounting thus moves from
NR_INDIRECTLY_RECLAIMABLE_BYTES to NR_SLAB_RECLAIMABLE, which is also
considered in MemAvailable calculation and overcommit decisions.

Link: http://lkml.kernel.org/r/20180731090649.16028-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:32 -07:00
Nicolas Pitre
7f2764cfbd cramfs: convert to use vmf_insert_mixed
cramfs is the only remaining user of vm_insert_mixed() and should be
converted to vmf_insert_mixed().

Based on a previous patch from Matthew Wilcox.

Link: http://lkml.kernel.org/r/nycvar.YSQ.7.76.1808290945450.10215@knanqh.ubzr
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>a
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:19 -07:00
Souptick Joarder
5780a02fd1 fs/iomap.c: change return type to vm_fault_t
Change iomap_page_mkwrite() return type to vm_fault_t.

see commit 1c8f422059 ("mm: change return type to vm_fault_t") for
reference.

Link: http://lkml.kernel.org/r/20180827172050.GA18673@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:18 -07:00
YueHaibing
867632d6a6 ocfs2: remove set but not used variable 'rb'
Fixes gcc '-Wunused-but-set-variable' warning:

fs/ocfs2/refcounttree.c: In function 'ocfs2_create_reflink_node':
fs/ocfs2/refcounttree.c:4138:31: warning:
 variable 'rb' set but not used [-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/1536198443-113047-1-git-send-email-yuehaibing@huawei.com
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:18 -07:00
Jia-Ju Bai
999865764f fs/ocfs2/dlm/dlmdebug.c: fix a sleep-in-atomic-context bug in dlm_print_one_mle()
The kernel module may sleep with holding a spinlock.

The function call paths (from bottom to top) in Linux-4.16 are:

[FUNC] get_zeroed_page(GFP_NOFS)
fs/ocfs2/dlm/dlmdebug.c, 332: get_zeroed_page in dlm_print_one_mle
fs/ocfs2/dlm/dlmmaster.c, 240: dlm_print_one_mle in __dlm_put_mle
fs/ocfs2/dlm/dlmmaster.c, 255: __dlm_put_mle in dlm_put_mle
fs/ocfs2/dlm/dlmmaster.c, 254: spin_lock in dlm_put_ml

[FUNC] get_zeroed_page(GFP_NOFS)
fs/ocfs2/dlm/dlmdebug.c, 332: get_zeroed_page in dlm_print_one_mle
fs/ocfs2/dlm/dlmmaster.c, 240: dlm_print_one_mle in __dlm_put_mle
fs/ocfs2/dlm/dlmmaster.c, 222: __dlm_put_mle in dlm_put_mle_inuse
fs/ocfs2/dlm/dlmmaster.c, 219: spin_lock in dlm_put_mle_inuse

To fix this bug, GFP_NOFS is replaced with GFP_ATOMIC.

This bug is found by my static analysis tool DSAC.

Link: http://lkml.kernel.org/r/20180901112528.27025-1-baijiaju1990@gmail.com
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:18 -07:00
Ding Xiang
0ae1c2dbdc ocfs2: remove unneeded null check
Null check for kfree is unnecessary, so remove it.

Link: http://lkml.kernel.org/r/1535704514-26559-1-git-send-email-dingxiang@cmss.chinamobile.com
Signed-off-by: Ding Xiang <dingxiang@cmss.chinamobile.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:18 -07:00
Colin Ian King
2de24cb742 ocfs2: remove unused pointer 'eb'
Pointer 'eb' is being assigned but is never used hence it is
redundant and can be removed.

Cleans up clang warning:
warning: variable 'eb' set but not used [-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/20180828141907.10826-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:18 -07:00
Nathan Chancellor
32c1b90dcd ocfs2/dlm: remove unnecessary parentheses
Clang warns when more than one set of parentheses is used for a
single conditional statement:

fs/ocfs2/dlm/dlmthread.c:534:18: warning: equality comparison with extraneous
      parentheses [-Wparentheses-equality]
        if ((res->owner == dlm->node_num)) {
             ~~~~~~~~~~~^~~~~~~~~~~~~~~~
fs/ocfs2/dlm/dlmthread.c:534:18: note: remove extraneous parentheses around the
      comparison to silence this warning
        if ((res->owner == dlm->node_num)) {
            ~           ^               ~

Link: http://lkml.kernel.org/r/20180924181929.6853-1-natechancellor@gmail.com
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reported-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:18 -07:00
Christoph Hellwig
ae62c16e10 userfaultfd: disable irqs when taking the waitqueue lock
userfaultfd contains howe-grown locking of the waitqueue lock, and does
not disable interrupts.  This relies on the fact that no one else takes it
from interrupt context and violates an invariat of the normal waitqueue
locking scheme.  With aio poll it is easy to trigger other locks that
disable interrupts (or are called from interrupt context).

Link: http://lkml.kernel.org/r/20181018154101.18750-1-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@vger.kernel.org>	[4.19.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:18 -07:00
Vlastimil Babka
fa76da461b mm: /proc/pid/smaps_rollup: fix NULL pointer deref in smaps_pte_range()
Leonardo reports an apparent regression in 4.19-rc7:

 BUG: unable to handle kernel NULL pointer dereference at 00000000000000f0
 PGD 0 P4D 0
 Oops: 0000 [#1] PREEMPT SMP PTI
 CPU: 3 PID: 6032 Comm: python Not tainted 4.19.0-041900rc7-lowlatency #201810071631
 Hardware name: LENOVO 80UG/Toronto 4A2, BIOS 0XCN45WW 08/09/2018
 RIP: 0010:smaps_pte_range+0x32d/0x540
 Code: 80 00 00 00 00 74 a9 48 89 de 41 f6 40 52 40 0f 85 04 02 00 00 49 2b 30 48 c1 ee 0c 49 03 b0 98 00 00 00 49 8b 80 a0 00 00 00 <48> 8b b8 f0 00 00 00 e8 b7 ef ec ff 48 85 c0 0f 84 71 ff ff ff a8
 RSP: 0018:ffffb0cbc484fb88 EFLAGS: 00010202
 RAX: 0000000000000000 RBX: 0000560ddb9e9000 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: 0000000560ddb9e9 RDI: 0000000000000001
 RBP: ffffb0cbc484fbc0 R08: ffff94a5a227a578 R09: ffff94a5a227a578
 R10: 0000000000000000 R11: 0000560ddbbe7000 R12: ffffe903098ba728
 R13: ffffb0cbc484fc78 R14: ffffb0cbc484fcf8 R15: ffff94a5a2e9cf48
 FS:  00007f6dfb683740(0000) GS:ffff94a5aaf80000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00000000000000f0 CR3: 000000011c118001 CR4: 00000000003606e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  __walk_page_range+0x3c2/0x6f0
  walk_page_vma+0x42/0x60
  smap_gather_stats+0x79/0xe0
  ? gather_pte_stats+0x320/0x320
  ? gather_hugetlb_stats+0x70/0x70
  show_smaps_rollup+0xcd/0x1c0
  seq_read+0x157/0x400
  __vfs_read+0x3a/0x180
  ? security_file_permission+0x93/0xc0
  ? security_file_permission+0x93/0xc0
  vfs_read+0x8f/0x140
  ksys_read+0x55/0xc0
  __x64_sys_read+0x1a/0x20
  do_syscall_64+0x5a/0x110
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Decoded code matched to local compilation+disassembly points to
smaps_pte_entry():

        } else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
                                                        && pte_none(*pte))) {
                page = find_get_entry(vma->vm_file->f_mapping,
                                                linear_page_index(vma, addr));

Here, vma->vm_file is NULL.  mss->check_shmem_swap should be false in that
case, however for smaps_rollup, smap_gather_stats() can set the flag true
for one vma and leave it true for subsequent vma's where it should be
false.

To fix, reset the check_shmem_swap flag to false.  There's also related
bug which sets mss->swap to shmem_swapped, which in the context of
smaps_rollup overwrites any value accumulated from previous vma's.  Fix
that as well.

Note that the report suggests a regression between 4.17.19 and 4.19-rc7,
which makes the 4.19 series ending with commit 258f669e7e ("mm:
/proc/pid/smaps_rollup: convert to single value seq_file") suspicious.
But the mss was reused for rollup since 493b0e9d94 ("mm: add
/proc/pid/smaps_rollup") so let's play it safe with the stable backport.

Link: http://lkml.kernel.org/r/555fbd1f-4ac9-0b58-dcd4-5dc4380ff7ca@suse.cz
Link: https://bugzilla.kernel.org/show_bug.cgi?id=201377
Fixes: 493b0e9d94 ("mm: add /proc/pid/smaps_rollup")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Leonardo Soares Müller <leozinho29_eu@hotmail.com>
Tested-by: Leonardo Soares Müller <leozinho29_eu@hotmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:18 -07:00
Chengguang Xu
14fa085640 ovl: using posix_acl_xattr_size() to get size instead of posix_acl_to_xattr()
There is no functional change but it seems better to get size by calling
posix_acl_xattr_size() instead of calling posix_acl_to_xattr() with
NULL buffer argument. Additionally, remove unnecessary assignments.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-26 23:34:40 +02:00
Amir Goldstein
1e92e3072c ovl: abstract ovl_inode lock with a helper
The abstraction improves code readabilty (to some).

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-26 23:34:40 +02:00
Amir Goldstein
0e32992f7f ovl: remove the 'locked' argument of ovl_nlink_{start,end}
It just makes the interface strange without adding any significant value.
The only case where locked is false and return value is 0 is in
ovl_rename() when new is negative, so handle that case explicitly in
ovl_rename().

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-26 23:34:40 +02:00
Amir Goldstein
9df085f3c9 ovl: relax requirement for non null uuid of lower fs
We use uuid to associate an overlay lower file handle with a lower layer,
so we can accept lower fs with null uuid as long as all lower layers with
null uuid are on the same fs.

This change allows enabling index and nfs_export features for the setup of
single lower fs of type squashfs - squashfs supports file handles, but has
a null uuid. This change also allows enabling index and nfs_export features
for nested overlayfs, where the lower overlay has nfs_export enabled.

Enabling the index feature with single lower squashfs fixes the
unionmount-testsuite test:
  ./run --ov --squashfs --verify

As a by-product, if, like the lower squashfs, upper fs also uses the
generic export_encode_fh() implementation to export 32bit inode file
handles (e.g. ext4), then the xino_auto config/module/mount option will
enable unique overlay inode numbers.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-26 23:34:40 +02:00
Miklos Szeredi
6b52243f63 ovl: fold copy-up helpers into callers
Now that the workdir and tmpfile copy up modes have been untagled, the
functions become simple enough that the helpers can be folded into the
callers.

Add new helpers where there is any duplication remaining: preparing creds
for creating the object.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-26 23:34:39 +02:00
Amir Goldstein
b10cdcdc20 ovl: untangle copy up call chain
In an attempt to dedup ~100 LOC, we ended up creating a tangled call chain,
whose branches merge and diverge in several points according to the
immutable c->tmpfile copy up mode.

This call chain was hard to analyse for locking correctness because the
locking requirements for the c->tmpfile flow were very different from the
locking requirements for the !c->tmpfile flow (i.e. directory vs.  regulare
file copy up).

Split the copy up helpers of the c->tmpfile flow from those of the
!c->tmpfile (i.e. workdir) flow and remove the c->tmpfile mode from copy up
context.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-26 23:34:39 +02:00
Miklos Szeredi
007ea44892 ovl: relax permission checking on underlying layers
Make permission checking more consistent:

 - special files don't need any access check on underling fs

 - exec permission check doesn't need to be performed on underlying fs

Reported-by: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-26 23:34:39 +02:00
Amir Goldstein
6cd078702f ovl: fix recursive oi->lock in ovl_link()
linking a non-copied-up file into a non-copied-up parent results in a
nested call to mutex_lock_interruptible(&oi->lock). Fix this by copying up
target parent before ovl_nlink_start(), same as done in ovl_rename().

~/unionmount-testsuite$ ./run --ov -s
~/unionmount-testsuite$ ln /mnt/a/foo100 /mnt/a/dir100/

 WARNING: possible recursive locking detected
 --------------------------------------------
 ln/1545 is trying to acquire lock:
 00000000bcce7c4c (&ovl_i_lock_key[depth]){+.+.}, at:
     ovl_copy_up_start+0x28/0x7d
 but task is already holding lock:
 0000000026d73d5b (&ovl_i_lock_key[depth]){+.+.}, at:
     ovl_nlink_start+0x3c/0xc1

[SzM: this seems to be a false positive, but doing the copy-up first is
harmless and removes the lockdep splat]

Reported-by: syzbot+3ef5c0d1a5cb0b21e6be@syzkaller.appspotmail.com
Fixes: 5f8415d6b8 ("ovl: persistent overlay inode nlink for...")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-26 23:34:39 +02:00
Amir Goldstein
8f97d1e991 vfs: fix FIGETBSZ ioctl on an overlayfs file
Some anon_bdev filesystems (e.g. overlayfs, ceph) don't have s_blocksize
set. Returning zero from FIGETBSZ ioctl results in a Floating point
exception from the e2fsprogs utility filefrag, which divides the size of
the file with the value returned by FIGETBSZ.

Fix the interface by returning -EINVAL for these filesystems.

Fixes: d1d04ef857 ("ovl: stack file ops")
Cc: <stable@vger.kernel.org> # v4.19
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-26 23:34:39 +02:00
Miklos Szeredi
1f244dc521 ovl: clean up error handling in ovl_get_tmpfile()
If security_inode_copy_up() fails, it should not set new_creds, so no need
for the cleanup (which would've Oops-ed anyway, due to old_creds being
NULL).

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-26 23:34:39 +02:00
Amir Goldstein
babf4770be ovl: fix error handling in ovl_verify_set_fh()
We hit a BUG on kfree of an ERR_PTR()...

Reported-by: syzbot+ff03fe05c717b82502d0@syzkaller.appspotmail.com
Fixes: 8b88a2e640 ("ovl: verify upper root dir matches lower root dir")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-26 23:34:39 +02:00
Linus Torvalds
c7a2c49ea6 NFS client updates for Linux 4.20
Highlights include:
 
 Stable fixes:
 - Fix the NFSv4.1 r/wsize sanity checking
 - Reset the RPC/RDMA credit grant properly after a disconnect
 - Fix a missed page unlock after pg_doio()
 
 Features and optimisations:
 - Overhaul of the RPC client socket code to eliminate a locking bottleneck
   and reduce the latency when transmitting lots of requests in parallel.
 - Allow parallelisation of the RPCSEC_GSS encoding of an RPC request.
 - Convert the RPC client socket receive code to use iovec_iter() for
   improved efficiency.
 - Convert several NFS and RPC lookup operations to use RCU instead of
   taking global locks.
 - Avoid the need for BH-safe locks in the RPC/RDMA back channel.
 
 Bugfixes and cleanups:
 - Fix lock recovery during NFSv4 delegation recalls
 - Fix the NFSv4 + NFSv4.1 "lookup revalidate + open file" case.
 - Fixes for the RPC connection metrics
 - Various RPC client layer cleanups to consolidate stream based sockets
 - RPC/RDMA connection cleanups
 - Simplify the RPC/RDMA cleanup after memory operation failures
 - Clean ups for NFS v4.2 copy completion and NFSv4 open state reclaim.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJb0zW8AAoJEA4mA3inWBJcmccP/0hkeNFk2y4tErit1lq4TYDs
 sMkFv0rjhBkxWbZFmGJfAulbQ5cu+GwTBqqmhm67rE+2C+vevrE4JRfDFmcEGpio
 lE/2uJdqu1UlIOiovyjk0jMetUuf2LTS82vloPP/z5mmvgQ4S1NSajUGuPbjQR2S
 AtTj0XGI5e1nm8PZDftbomcxD5HUYaITQEDCyrm8a7xX8OZ5ySXakzdgXuNM5TgI
 MPjcpOFvIARwF4MhovYFZtSInB5XiZYSiTAB03deVgy38JDsSPeQgwUVWjErrq/K
 V/6kOg8EYd0uNFmUCwKX/ecbvAlnbfqAMX+YcL0ZrbVk0pBqxVvoGVXK8ex8Wbm1
 eL9tyYK81Sc7TliXr2+R22CHDcMTTMImFLix5Gp6mk2Fd5TpMydV9c9S7NBCHYB4
 rgcM9brgutFF6N8zqdBpa1FVH3cBE1A428/90kp4XU/kdQlxIvYBLBCylI25POEL
 7oqhcJxljFLWXZdhmH7t3WV0RWOzITZHEp9foL8p6yAPzOSWPF98OlQU+FmLj3Y4
 EZ61qLXIRxYpLf1aZh7GNKms5ZzOhKiZgw43UL3pl4xKhk2i9061IUKGSEHgIklk
 BX34dmCALDlapt+Ggcm1uIe9BLCc4KADfixqNfr91dSOycFM2RajsSZCPrP9Gx8G
 t8rYl8x+lLZ5ZxLkdTUP
 =Fn8z
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Stable fixes:
   - Fix the NFSv4.1 r/wsize sanity checking
   - Reset the RPC/RDMA credit grant properly after a disconnect
   - Fix a missed page unlock after pg_doio()

  Features and optimisations:
   - Overhaul of the RPC client socket code to eliminate a locking
     bottleneck and reduce the latency when transmitting lots of
     requests in parallel.
   - Allow parallelisation of the RPCSEC_GSS encoding of an RPC request.
   - Convert the RPC client socket receive code to use iovec_iter() for
     improved efficiency.
   - Convert several NFS and RPC lookup operations to use RCU instead of
     taking global locks.
   - Avoid the need for BH-safe locks in the RPC/RDMA back channel.

  Bugfixes and cleanups:
   - Fix lock recovery during NFSv4 delegation recalls
   - Fix the NFSv4 + NFSv4.1 "lookup revalidate + open file" case.
   - Fixes for the RPC connection metrics
   - Various RPC client layer cleanups to consolidate stream based
     sockets
   - RPC/RDMA connection cleanups
   - Simplify the RPC/RDMA cleanup after memory operation failures
   - Clean ups for NFS v4.2 copy completion and NFSv4 open state
     reclaim"

* tag 'nfs-for-4.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (97 commits)
  SUNRPC: Convert the auth cred cache to use refcount_t
  SUNRPC: Convert auth creds to use refcount_t
  SUNRPC: Simplify lookup code
  SUNRPC: Clean up the AUTH cache code
  NFS: change sign of nfs_fh length
  sunrpc: safely reallow resvport min/max inversion
  nfs: remove redundant call to nfs_context_set_write_error()
  nfs: Fix a missed page unlock after pg_doio()
  SUNRPC: Fix a compile warning for cmpxchg64()
  NFSv4.x: fix lock recovery during delegation recall
  SUNRPC: use cmpxchg64() in gss_seq_send64_fetch_and_inc()
  xprtrdma: Squelch a sparse warning
  xprtrdma: Clean up xprt_rdma_disconnect_inject
  xprtrdma: Add documenting comments
  xprtrdma: Report when there were zero posted Receives
  xprtrdma: Move rb_flags initialization
  xprtrdma: Don't disable BH's in backchannel server
  xprtrdma: Remove memory address of "ep" from an error message
  xprtrdma: Rename rpcrdma_qp_async_error_upcall
  xprtrdma: Simplify RPC wake-ups on connect
  ...
2018-10-26 13:05:26 -07:00
Linus Torvalds
033078a9af 3 smb3 fixes for stable, patches for improved debugging and perf gathering, and much improved performance for most metadata operations (expanded use of compounding)
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlvR8lcACgkQiiy9cAdy
 T1FNtgv/fpnMnf/4JPE40NgJ6CUcv4xsJ3bDzmezB5ZUgoNigtVeUMSBa8qCEBcg
 cdC243TOpwNGaWQ1yzRN4kyGq1cAE9B1xal4n7+xlii+ZpWXwrkOAiF27UTAIGTR
 ck3IfeS529QoQt9ReI4v+pWYKZOnlbWgF7iBflg0Snsz/JvICQ05wRA9VaXBJIz8
 Pwb3SDPCrON1KRJzJVDjC6AaYhZqu2VLbSV9fOhZ5WVcHb/t0EUqsFvgMzhk2+tv
 Rh+9zNzcQWyYI8KtYQmWMMoSk7F8OGlARWXW0ROfOoQwC70zg35F+tGUahlWsIYD
 19TLJy28g5Gqh0DZoPmtpNUdu1NCfy+vQcqaSNnAaQreMlqmV6ODxjvz3DeGL9lK
 Teo0V9dwWOZNFneFTpVsrWL4KQEZfDPDt1L6e3GOL5t6QLOZa5IuPVs8A9txqFCD
 kTAIQstESmXOrl+HpP64LcovV4BaD05st+fo7Cec16UDJjEqxCmHUSIYw3kFnCny
 4UAITp4V
 =q4Qs
 -----END PGP SIGNATURE-----

Merge tag '4.20-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs updates from Steve French:
 "Three smb3 fixes for stable, patches for improved debugging and perf
  gathering, and much improved performance for most metadata operations
  (expanded use of compounding)"

* tag '4.20-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: (46 commits)
  cifs: update internal module version number for cifs.ko to 2.14
  smb3: add debug for unexpected mid cancellation
  cifs: allow calling SMB2_xxx_free(NULL)
  smb3 - clean up debug output displaying network interfaces
  smb3: show number of current open files in /proc/fs/cifs/Stats
  cifs: add support for ioctl on directories
  cifs: fallback to older infolevels on findfirst queryinfo retry
  smb3: do not attempt cifs operation in smb3 query info error path
  smb3: send backup intent on compounded query info
  cifs: track writepages in vfs operation counters
  smb2: fix uninitialized variable bug in smb2_ioctl_query_info
  cifs: add IOCTL for QUERY_INFO passthrough to userspace
  cifs: minor clarification in comments
  CIFS: Print message when attempting a mount
  CIFS: Adds information-level logging function
  cifs: OFD locks do not conflict with eachothers
  CIFS: SMBD: Do not call ib_dereg_mr on invalidated memory registration
  CIFS: pass page offsets on SMB1 read/write
  fs/cifs: fix uninitialised variable warnings
  smb3: add tracepoint for sending lease break responses to server
  ...
2018-10-26 13:02:38 -07:00
Linus Torvalds
26873acacb Driver core patches for 4.20-rc1
Driver core patches for 4.20-rc1
 
 Here is a small number of driver core patches for 4.20-rc1.
 
 Not much happened here this merge window, only a very tiny number of
 patches that do:
 	- add BUS_ATTR_WO() for use by drivers
 	- component error path fixes
 	- kernfs range check fix
 	- other tiny error path fixes and const changes
 
 All of these have been in linux-next with no reported issues for a
 while.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCW9Lhtw8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykHTgCguaJ3SgRefuC/WijjqboTC/SikCoAnRVTUxfU
 v8BisSN22kR3jmxwsXud
 =/IvY
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is a small number of driver core patches for 4.20-rc1.

  Not much happened here this merge window, only a very tiny number of
  patches that do:

   - add BUS_ATTR_WO() for use by drivers

   - component error path fixes

   - kernfs range check fix

   - other tiny error path fixes and const changes

  All of these have been in linux-next with no reported issues for a
  while"

* tag 'driver-core-4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  devres: provide devm_kstrdup_const()
  mm: move is_kernel_rodata() to asm-generic/sections.h
  devres: constify p in devm_kfree()
  driver core: add BUS_ATTR_WO() macro
  kernfs: Fix range checks in kernfs_get_target_path
  component: fix loop condition to call unbind() if bind() fails
  drivers/base/devtmpfs.c: don't pretend path is const in delete_path
  kernfs: update comment about kernfs_path() return value
2018-10-26 08:42:25 -07:00
Linus Torvalds
62606c224d Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
 "API:
   - Remove VLA usage
   - Add cryptostat user-space interface
   - Add notifier for new crypto algorithms

  Algorithms:
   - Add OFB mode
   - Remove speck

  Drivers:
   - Remove x86/sha*-mb as they are buggy
   - Remove pcbc(aes) from x86/aesni
   - Improve performance of arm/ghash-ce by up to 85%
   - Implement CTS-CBC in arm64/aes-blk, faster by up to 50%
   - Remove PMULL based arm64/crc32 driver
   - Use PMULL in arm64/crct10dif
   - Add aes-ctr support in s5p-sss
   - Add caam/qi2 driver

  Others:
   - Pick better transform if one becomes available in crc-t10dif"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (124 commits)
  crypto: chelsio - Update ntx queue received from cxgb4
  crypto: ccree - avoid implicit enum conversion
  crypto: caam - add SPDX license identifier to all files
  crypto: caam/qi - simplify CGR allocation, freeing
  crypto: mxs-dcp - make symbols 'sha1_null_hash' and 'sha256_null_hash' static
  crypto: arm64/aes-blk - ensure XTS mask is always loaded
  crypto: testmgr - fix sizeof() on COMP_BUF_SIZE
  crypto: chtls - remove set but not used variable 'csk'
  crypto: axis - fix platform_no_drv_owner.cocci warnings
  crypto: x86/aes-ni - fix build error following fpu template removal
  crypto: arm64/aes - fix handling sub-block CTS-CBC inputs
  crypto: caam/qi2 - avoid double export
  crypto: mxs-dcp - Fix AES issues
  crypto: mxs-dcp - Fix SHA null hashes and output length
  crypto: mxs-dcp - Implement sha import/export
  crypto: aegis/generic - fix for big endian systems
  crypto: morus/generic - fix for big endian systems
  crypto: lrw - fix rebase error after out of bounds fix
  crypto: cavium/nitrox - use pci_alloc_irq_vectors() while enabling MSI-X.
  crypto: cavium/nitrox - NITROX command queue changes.
  ...
2018-10-25 16:43:35 -07:00
Linus Torvalds
57ce66d39f Merge branch 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull integrity updates from James Morris:
 "From Mimi: This contains a couple of bug fixes, including one for a
  recent problem with calculating file hashes on overlayfs, and some
  code cleanup"

* 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
  MAINTAINERS: add Jarkko as maintainer for trusted keys
  ima: open a new file instance if no read permissions
  ima: fix showing large 'violations' or 'runtime_measurements_count'
  security/integrity: remove unnecessary 'init_keyring' variable
  security/integrity: constify some read-only data
  vfs: require i_size <= SIZE_MAX in kernel_read_file()
2018-10-25 13:22:23 -07:00
Linus Torvalds
4ba9628fe5 Merge branch 'work.lookup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more ->lookup() cleanups from Al Viro:
 "Some ->lookup() instances are still overcomplicating the life
  for themselves, open-coding the stuff that would be handled by
  d_splice_alias() just fine.

  Simplify a couple of such cases caught this cycle and document
  d_splice_alias() intended use"

* 'work.lookup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  Document d_splice_alias() calling conventions for ->lookup() users.
  simplify btrfs_lookup()
  clean erofs_lookup()
2018-10-25 12:55:31 -07:00
Linus Torvalds
ba7d4f36a2 Merge branch 'work.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull compat_ioctl fixes from Al Viro:
 "A bunch of compat_ioctl fixes, mostly in bluetooth.

  Hopefully, most of fs/compat_ioctl.c will get killed off over the next
  few cycles; between this, tty series already merged and Arnd's work
  this cycle ought to take a good chunk out of the damn thing..."

* 'work.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  hidp: fix compat_ioctl
  hidp: constify hidp_connection_add()
  cmtp: fix compat_ioctl
  bnep: fix compat_ioctl
  compat_ioctl: trim the pointless includes
2018-10-25 12:48:22 -07:00
Linus Torvalds
4dcb9239da Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timekeeping updates from Thomas Gleixner:
 "The timers and timekeeping departement provides:

   - Another large y2038 update with further preparations for providing
     the y2038 safe timespecs closer to the syscalls.

   - An overhaul of the SHCMT clocksource driver

   - SPDX license identifier updates

   - Small cleanups and fixes all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
  tick/sched : Remove redundant cpu_online() check
  clocksource/drivers/dw_apb: Add reset control
  clocksource: Remove obsolete CLOCKSOURCE_OF_DECLARE
  clocksource/drivers: Unify the names to timer-* format
  clocksource/drivers/sh_cmt: Add R-Car gen3 support
  dt-bindings: timer: renesas: cmt: document R-Car gen3 support
  clocksource/drivers/sh_cmt: Properly line-wrap sh_cmt_of_table[] initializer
  clocksource/drivers/sh_cmt: Fix clocksource width for 32-bit machines
  clocksource/drivers/sh_cmt: Fixup for 64-bit machines
  clocksource/drivers/sh_tmu: Convert to SPDX identifiers
  clocksource/drivers/sh_mtu2: Convert to SPDX identifiers
  clocksource/drivers/sh_cmt: Convert to SPDX identifiers
  clocksource/drivers/renesas-ostm: Convert to SPDX identifiers
  clocksource: Convert to using %pOFn instead of device_node.name
  tick/broadcast: Remove redundant check
  RISC-V: Request newstat syscalls
  y2038: signal: Change rt_sigtimedwait to use __kernel_timespec
  y2038: socket: Change recvmmsg to use __kernel_timespec
  y2038: sched: Change sched_rr_get_interval to use __kernel_timespec
  y2038: utimes: Rework #ifdef guards for compat syscalls
  ...
2018-10-25 11:14:36 -07:00
Jan Kara
721fb6fbfd fsnotify: Fix busy inodes during unmount
Detaching of mark connector from fsnotify_put_mark() can race with
unmounting of the filesystem like:

  CPU1				CPU2
fsnotify_put_mark()
  spin_lock(&conn->lock);
  ...
  inode = fsnotify_detach_connector_from_object(conn)
  spin_unlock(&conn->lock);
				generic_shutdown_super()
				  fsnotify_unmount_inodes()
				    sees connector detached for inode
				      -> nothing to do
				  evict_inode()
				    barfs on pending inode reference
  iput(inode);

Resulting in "Busy inodes after unmount" message and possible kernel
oops. Make fsnotify_unmount_inodes() properly wait for outstanding inode
references from detached connectors.

Note that the accounting of outstanding inode references in the
superblock can cause some cacheline contention on the counter. OTOH it
happens only during deletion of the last notification mark from an inode
(or during unlinking of watched inode) and that is not too bad. I have
measured time to create & delete inotify watch 100000 times from 64
processes in parallel (each process having its own inotify group and its
own file on a shared superblock) on a 64 CPU machine. Average and
standard deviation of 15 runs look like:

	Avg		Stddev
Vanilla	9.817400	0.276165
Fixed	9.710467	0.228294

So there's no statistically significant difference.

Fixes: 6b3f05d24d ("fsnotify: Detach mark from object list when last reference is dropped")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
2018-10-25 15:49:19 +02:00
Linus Torvalds
5993692f09 Further restructure ext4 documentation; fix up ext4's delayed
allocation for bigalloc file systems; fix up some syzbot-detected
 races in EXT4_IOC_MOVE_EXT, EXT4_IOC_SWAP_BOOT, and ext4_remount; and
 a few other miscellaneous bugs and optimizations.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlvQYEcACgkQ8vlZVpUN
 gaOPYAgAh0BF7mTRnHAp/qkR5ZhDi3ecb3TpNlnpfzoDqQhPYETFisc18DD4HwTj
 wctwzSdYxYodeuPIK+R2bBzUy3FuSwtlER9cdr1ilcrUYPZHbir1rPPfTNb/oDGx
 WNcd/aulLjuU1eKDODowqMOF2HDchiJHqJqMBa+LfCHck1x/bt2uqdjNA5A1p5AV
 lp07DoXT54q5rWJDaXpbxTShWKhzHlRKbB9PKEvMHgPNl9sn5oRReRMKAW+WkT91
 e3mfy/GhzhugdWxYUg2oAn3dbqYkkAjW96WnBhCQHioW9ASphjl7yBi1LWh2aPA4
 haGxe5W3En8q678ZVtTVNJOyvbW81Q==
 =VgdS
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:

 - further restructure ext4 documentation

 - fix up ext4's delayed allocation for bigalloc file systems

 - fix up some syzbot-detected races in EXT4_IOC_MOVE_EXT,
   EXT4_IOC_SWAP_BOOT, and ext4_remount

 - ... and a few other miscellaneous bugs and optimizations.

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (21 commits)
  ext4: fix use-after-free race in ext4_remount()'s error path
  ext4: cache NULL when both default_acl and acl are NULL
  docs: promote the ext4 data structures book to top level
  docs: move ext4 administrative docs to admin-guide/
  jbd2: fix use after free in jbd2_log_do_checkpoint()
  ext4: propagate error from dquot_initialize() in EXT4_IOC_FSSETXATTR
  ext4: fix setattr project check in fssetxattr ioctl
  docs: make ext4 readme tables readable
  docs: fix ext4 documentation table formatting problems
  docs: generate a separate ext4 pdf file from the documentation
  ext4: convert fault handler to use vm_fault_t type
  ext4: initialize retries variable in ext4_da_write_inline_data_begin()
  ext4: fix EXT4_IOC_SWAP_BOOT
  ext4: fix build error when DX_DEBUG is defined
  ext4: fix argument checking in EXT4_IOC_MOVE_EXT
  ext4: fix reserved cluster accounting at page invalidation time
  ext4: adjust reserved cluster count when removing extents
  ext4: reduce reserved cluster count by number of allocated clusters
  ext4: fix reserved cluster accounting at delayed write time
  ext4: add new pending reservation mechanism
  ...
2018-10-24 17:42:24 +01:00
Linus Torvalds
d6edff78fe f2fs-for-4.20-rc1
In this round, we've added 1) superblock checksum feature, 2) implemented new
 mount option which we can disable/enable checkpoint to provide atomic updates of
 entire filesystem, 3) refactored quota operations to enhance its consistency
 along with checkpoint, 4) fixed subtle IO hang conditions and roll-forward
 recovery flow to resurrect any fsync'ed inode metadata.
 
 Enhancement:
  - add checksum to keep superblock contents more safe
  - add checkpoint=disable/enable to support A/B update of entire filesystem
  - use plug for readahead IO in readdir
  - add more IO counts to avoid block layer hacks
 
 Bug fix:
  - prevent data corruption issue for hardware encryption
  - fix IO hang issues when GC is heavily triggered
  - add missing up_read in __write_node_page
  - recover inode metadata during roll-forward recovery flow
  - fix null pointer dereference issue in wrongly configured discard map
 
 There are some more sanity checks and minor bug fixes as well.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAlvPzt4ACgkQQBSofoJI
 UNJLfg/8Ch9TWfbeUEH+6ioJ4pdURHmb/gFOQRSGX8Nu+HkuPJQD8pqlheK7n5g7
 Pw3K6NLXHmL2d8xSNFbBwYmUSAeoXTc0J4nqX0sUJ6m7SKsuQ45Qe3A90faKEoAA
 ce7flWKVI+aJcGurBe99GOM69ptfyjb1w/8UGB0pcXUDq4oaRv5a1UtAAm92WF7H
 4/7jYD3ub5aeSynwe16wWR4B5aXJT0l0FcZYicI6IRY1mOjMtXuQt72AY+ffSzBt
 yQ6qb8OEl3xpfQZHHH00ZfvarkTBzXJGZwquiPX/CPzVcee8cOqPp+XZqN7CXBEr
 9ItezxYiUxOkKCl12Al8DynHZa6o2kEnWxgd49WkL/cNdInnvf5MD0kdfCV3KfQa
 CAR0UVe2yTg5mGLemtTSWveLdHfI7+LhDmURuXmoTUa9GWldw0413qqVVypcsizv
 QOAS86hSicrVK+bDnCA70i8Xxw7YEnAyrfCcgihU84NZSi7nTPUYj4xtMd9SzRnK
 JO8gA79D7lcWaxUS4r9I+JBDwWcfMQZRPS7PFbvoGWilIwsEaocCPYNgtjCTsAsK
 1fePqiF/265Q4lapmEhEjhuQSNH2xfJQZ4ux1OU+eS3OTDjbEAFBeVPZImh7Mo7F
 dkpXQwfcqAXPzOM4QAJ6hFX40D8SWMAlId6XGiIfJlrFmEUAxBk=
 =VDw4
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we've added 1) superblock checksum feature, 2)
  implemented new mount option which we can disable/enable checkpoint to
  provide atomic updates of entire filesystem, 3) refactored quota
  operations to enhance its consistency along with checkpoint, 4) fixed
  subtle IO hang conditions and roll-forward recovery flow to resurrect
  any fsync'ed inode metadata.

  Enhancements:
   - add checksum to keep superblock contents more safe
   - add checkpoint=disable/enable to support A/B update of entire filesystem
   - use plug for readahead IO in readdir
   - add more IO counts to avoid block layer hacks

  Bug fixes:
   - prevent data corruption issue for hardware encryption
   - fix IO hang issues when GC is heavily triggered
   - add missing up_read in __write_node_page
   - recover inode metadata during roll-forward recovery flow
   - fix null pointer dereference issue in wrongly configured discard map

  There are some more sanity checks and minor bug fixes as well"

* tag 'f2fs-for-4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (62 commits)
  f2fs: fix to keep project quota consistent
  f2fs: guarantee journalled quota data by checkpoint
  f2fs: cleanup dirty pages if recover failed
  f2fs: fix data corruption issue with hardware encryption
  f2fs: fix to recover inode->i_flags of inode block during POR
  f2fs: spread f2fs_set_inode_flags()
  f2fs: fix to spread clear_cold_data()
  Revert "f2fs: fix to clear PG_checked flag in set_page_dirty()"
  f2fs: account read IOs and use IO counts for is_idle
  f2fs: fix to account IO correctly for cgroup writeback
  f2fs: fix to account IO correctly
  f2fs: remove request_list check in is_idle()
  f2fs: allow to mount, if quota is failed
  f2fs: update REQ_TIME in f2fs_cross_rename()
  f2fs: do not update REQ_TIME in case of error conditions
  f2fs: remove unneeded disable_nat_bits()
  f2fs: remove unused sbi->trigger_ssr_threshold
  f2fs: shrink sbi->sb_lock coverage in set_file_temperature()
  f2fs: use rb_*_cached friends
  f2fs: fix to recover cold bit of inode block during POR
  ...
2018-10-24 17:39:36 +01:00
Linus Torvalds
fe0142df64 xfs: Changes for 4.20
- only support filesystems with unwritten extents
 - add definition for statfs XFS magic number
 - remove unused parameters around reflink code
 - more debug for dangling delalloc extents
 - cancel COW extents on extent swap targets
 - fix quota stats output and clean up the code
 - refactor some of the attribute code in preparation for parent pointers
 - fix several buffer handling bugs
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJbz7CiAAoJEK3oKUf0dfodz3YP/jRBYAMall0h5sJZt0FHdIb1
 y8bvDz3YaN3T9cFsSKC+2eRiDnxUwvX23Fm3rd3XhMZb0JkLwxX2Y4Tm+BMLeOtx
 JcoFvO6xVNFugYgLHvxNh0jGfLAPRYzd1KOkoGBiDWl3GYijrXCU9wdXukHmYq9k
 /JIfXjVabgmlLhOo1SnaXtqBOo140FjYAlL4h1sGBYxzeoRk/ltsXBnEzjwOJajY
 EYKDjBMoGtxSh38EoXPHHP0Pf89miTE+B3Y7wkR+sURG/cptt6WN1MFCOmEOAPLH
 RYpaZrFYLLTS4xvaDo0UMLuMTv72tabqEtjQ1Rj4bSPZaFb7QZVscpNkoFjuK2V7
 iVJUE3bqlCAGoHdnK22a6Mq1c7inOEG/GGDv+V+xfdOZFlWtaUNyjSmDtpmwxR5D
 0W8eSYaEYfvhQJ7I9066SWof8EtUdc8cc4P+hshego1DWzDF1TSpBEwwy2V+WUsC
 l4NJyLwjUPjfuD/MSUly9N7bIEzgLM5sh+aSBGNY87ODnWbRTzoBbNjl2NOCWv58
 2zT57WLlT7mBzQPE6yNpUGwrpubNEC5z+LzcQRfBedzx/Wh8XBnV/8Z6ETft3sO+
 v63i5e11ejZBUSR1TGf8dIzQonBroJo7Zwk9ghxNs+KIXmG/8vQHN9EuC9X4IUm7
 RD5X0oxgxuFBzrD3G9kD
 =twNy
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.20-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pul xfs updates from Dave Chinner:
 "There's not a huge amount of change in this cycle - Darrick has been
  out of action for a couple of months (hence me sending the last few
  pull requests), so we decided a quiet cycle mainly focussed on bug
  fixes was a good idea. Darrick will take the helm again at the end of
  this merge window.

  FYI, I may be sending another update later in the cycle - there's a
  pending rework of the clone/dedupe_file_range code that fixes numerous
  bugs that is spread amongst the VFS, XFS and ocfs2 code. It has been
  reviewed and tested, Al and I just need to work out the details of the
  merge, so it may come from him rather than me.

  Summary:

   - only support filesystems with unwritten extents

   - add definition for statfs XFS magic number

   - remove unused parameters around reflink code

   - more debug for dangling delalloc extents

   - cancel COW extents on extent swap targets

   - fix quota stats output and clean up the code

   - refactor some of the attribute code in preparation for parent
     pointers

   - fix several buffer handling bugs"

* tag 'xfs-4.20-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (21 commits)
  xfs: cancel COW blocks before swapext
  xfs: clear ail delwri queued bufs on unmount of shutdown fs
  xfs: use offsetof() in place of offset macros for __xfsstats
  xfs: Fix xqmstats offsets in /proc/fs/xfs/xqmstat
  xfs: fix use-after-free race in xfs_buf_rele
  xfs: Add attibute remove and helper functions
  xfs: Add attibute set and helper functions
  xfs: Add helper function xfs_attr_try_sf_addname
  xfs: Move fs/xfs/xfs_attr.h to fs/xfs/libxfs/xfs_attr.h
  xfs: issue log message on user force shutdown
  xfs: fix buffer state management in xrep_findroot_block
  xfs: always assign buffer verifiers when one is provided
  xfs: xrep_findroot_block should reject root blocks with siblings
  xfs: add a define for statfs magic to uapi
  xfs: print dangling delalloc extents
  xfs: fix fork selection in xfs_find_trim_cow_extent
  xfs: remove the unused trimmed argument from xfs_reflink_trim_around_shared
  xfs: remove the unused shared argument to xfs_reflink_reserve_cow
  xfs: handle zeroing in xfs_file_iomap_begin_delay
  xfs: remove suport for filesystems without unwritten extent flag
  ...
2018-10-24 17:36:12 +01:00
Linus Torvalds
bfd93a87ea We've got 18 patches for this merge window, none of which are very major.
1. Andreas Gruenbacher contributed several patches to clean up the gfs2
    block allocator to prepare for future performance enhancements.
 2. Andy Price contributed a patch to fix a use-after-free problem.
 3. I contributed some patches that fix gfs2's broken rgrplvb mount option.
 4. I contributed some cleanup patches and error message improvements.
 5. Steve Whitehouse and Abhi Das sent a patch to enable getlabel support.
 6. Tim Smith contributed a patch to flush the glock delete workqueue at exit.
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJbz3QlAAoJENeLYdPf93o7+VUH/0XUfbyNQsU6fyLP8NrZq05z
 qNsVN3Hm+tPc0/V0C75lSp9ej7B8ogMl0RPysniziRTEWIDK6oGB/JUGvIH0O/z1
 vZ/sofZEDXthV3YjiI8RXLcaJLsOavSXnGwHbNKohM2PdRObVkZbaUL+xWlL9X3q
 yHgP5AHCIrpVzz5l4sLO6N0Npnl0aNRTBxPIyDTaBBmitXkvtqkCbkw185jzkDDs
 fMvZ6I+3UbUxp99InFTHeUXvTr1EbvfPrhZzmppuV1N4LLSa1eRaWmsKTEDPdnsy
 uhsh9ittv8EJXN2dpmZIOdGmDEK07kFoZrsbM5F78sOH/LbUyJ5YfBN02lDaaAI=
 =b7zk
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-4.20.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 updates from Bob Peterson:
 "We've got 18 patches for this merge window, none of which are very
  major:

   - clean up the gfs2 block allocator to prepare for future performance
     enhancements (Andreas Gruenbacher)

   - fix a use-after-free problem (Andy Price)

   - patches that fix gfs2's broken rgrplvb mount option (me)

   - cleanup patches and error message improvements (me)

   - enable getlabel support (Steve Whitehouse and Abhi Das)

   - flush the glock delete workqueue at exit (Tim Smith)"

* tag 'gfs2-4.20.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Fix minor typo: couln't versus couldn't.
  gfs2: write revokes should traverse sd_ail1_list in reverse
  gfs2: Pass resource group to rgblk_free
  gfs2: Remove unnecessary gfs2_rlist_alloc parameter
  gfs2: Fix marking bitmaps non-full
  gfs2: Fix some minor typos
  gfs2: Rename bitmap.bi_{len => bytes}
  gfs2: Remove unused RGRP_RSRV_MINBYTES definition
  gfs2: Move rs_{sizehint, rgd_gh} fields into the inode
  gfs2: Clean up out-of-bounds check in gfs2_rbm_from_block
  gfs2: Always check the result of gfs2_rbm_from_block
  gfs2: getlabel support
  GFS2: Flush the GFS2 delete workqueue before stopping the kernel threads
  gfs2: Don't leave s_fs_info pointing to freed memory in init_sbd
  gfs2: Use fs_* functions instead of pr_* function where we can
  gfs2: slow the deluge of io error messages
  gfs2: Don't set GFS2_RDF_UPTODATE when the lvb is updated
  gfs2: improve debug information when lvb mismatches are found
2018-10-24 17:30:39 +01:00
Linus Torvalds
e1cbbf4067 orangefs: fixes and a cleanup
fixes:
  + fix superfluous service_operation return code check in orangefs_lookup
  + fix some error code paths that missed kmem_cache_free
  + don't let orangefs_iget return NULL
  + don't let orangefs_new_inode return NULL
  + cache NULL when both default_acl and acl are NULL
 
 cleanup:
  + rate limit the client not running info message
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbzzyfAAoJEM9EDqnrzg2+gfQP/1HC41/2Xg+8NFzxASN4Cd6f
 6MJGGG3yfvq2PO+APQFG4SQNtFQO+8CDB3rIehCBQxKhHcjanr5ejzpoALaFcgEu
 VC1Pw4cZcKixxJyRf9LZChI2uzfFTVn3pna5y1ABUZA7r+LTvg/oMXkJH9BS03Xk
 r0onRKd0/nsde4dhnNlFizQuSmddzvObeZdAgqL4QWzh2J1Zs7ehtqcGu8JTDK2G
 nVm+tXyqllQmigk/blhE6lydxrQQt0w95+DlUf6x+PmH5pp7MKqi5kzelbo1ND/9
 BfSK4AvCdJhRjQNValop9Pafu55sj5RZEoG/GOSCvU3bdghoQi1mtuSVCEUblvR6
 EOJWd31y1Shk+XtVOFUNwo1jk/FhZOkGZNBY4xYjMmTUA3np4rpB2HMEkt4Sy/EO
 cOnS4CoB4Wc/36ZAGAfthNhMH66igMBdA7acDx91DeCFzBPn7SmstVDDVj6rdcGr
 MfyzvQaYooqHdWF3PzK97EsQW7ZXk8YPpUCbrF6+cxYfZN4DT4qJzKImXRHMLTiV
 qbUVtgYyzqa6caBHpVphq3evA0//vk4qK+fuIFZs/+cFZPBNtJNhye9q87DftcMx
 b8SiSNjBL4Q+/DGD6kvLwuX53PfxVAgllbk0Ncql2aSvCTvOirfpYbvM55RDapuJ
 dOuvR1rhzsP93t+dwRP4
 =iVPX
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-4.20-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux

Pull orangefs updates from Mike Marshall:
 "Fixes and a cleanup.

  Fixes:
   - fix superfluous service_operation return code check in
     orangefs_lookup
   - fix some error code paths that missed kmem_cache_free
   - don't let orangefs_iget return NULL
   - don't let orangefs_new_inode return NULL
   - cache NULL when both default_acl and acl are NULL

 Cleanup:
   - rate limit the client not running info message"

* tag 'for-linus-4.20-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
  orangefs: no need to check for service_operation returns > 0
  orangefs: some error code paths missed kmem_cache_free
  orangefs: don't let orangefs_iget return NULL.
  orangefs: don't let orangefs_new_inode return NULL
  orangefs: rate limit the client not running info message
  orangefs: cache NULL when both default_acl and acl are NULL
2018-10-24 17:28:03 +01:00
Linus Torvalds
6b609e3b00 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro.

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  gfs2_meta: ->mount() can get NULL dev_name
  ecryptfs_rename(): verify that lower dentries are still OK after lock_rename()
  cachefiles: fix the race between cachefiles_bury_object() and rmdir(2)
2018-10-24 17:24:04 +01:00
Linus Torvalds
deba28b12b Just a few small fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEIodevzQLVs53l6BhNqiEXrVAjGQFAlvPJVQACgkQNqiEXrVA
 jGQgUQ//QEMIcY3LZoHSUGfcI2Il4oBwLBnE1604FFNuhbBPiceVXcLIgPMeAXDt
 fMFaujgM/KEa+uj1tWdCD7kGiqlJo6SWzHVaF1+jwD+p2TZda0QxqAEa98lC07wb
 QrS+VGF0EFxYWmuJryB25KhhIDQjy0ePK78rXieNGcb9MaKsXxPeT2zgRFMv550v
 +XnbPzf2vpuNc5xh0l8ceNRETfa/SJDupWVqt7s7P3UrARiFPsVNSBqiRnHRLhS6
 HKQcuhVZHgphS5b3f8F75Opie2r/TZxFs7AaPHc3/W95bDVSHjXzojzn8/NVdN0e
 aLDEhhlqgx0fAC46TanxqQmnTYyD67LdMLQokU99ia60WdPdVhq3FvX9jX6EwRgj
 pQg1CE3RD3QCUqvMm0h/a/JmlegecrwFHeoE77DFj151s5/ceuT7yYrWHWUonfVs
 FwbuX1bWWd5zOKLifpf6tkXtTitdacUAxxBXHIFgaHL5HNS4vNtf9uTn3QKO2FHN
 ZVzAbsbLyShGRIdP0Z9Jf98m6HIFwKx4qQEDHgMbM9h+oVROiJIada1OWYxAK4yH
 nx9B8+mya0ZU4ozfDCL1JTdGMKD/rfjonDve1XCO5AVn8L+lxvU/SUbFZQr/UjeI
 Iw/xSOtFq/1Ma/K//gGNSFMhukDxtOD3PP8lJNSVXet4/wkag30=
 =hDy2
 -----END PGP SIGNATURE-----

Merge tag 'jfs-for-4.20' of git://github.com/kleikamp/linux-shaggy

Pull jfs updates from David Kleikamp:
 "Just a few small fixes"

* tag 'jfs-for-4.20' of git://github.com/kleikamp/linux-shaggy:
  jfs: remove redundant dquot_initialize() in jfs_evict_inode()
  jfs: remove quota option from ignore list
  jfs: cache NULL when both default_acl and acl are NULL
2018-10-24 17:22:16 +01:00
Linus Torvalds
318b067a5d for-4.20-part1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlvN/4kACgkQxWXV+ddt
 WDtA0Q//fIKXjjxDLa/EM4fz63urfczoxq/nVRVXveoTtAjT+QOTIY0iRblLhKQg
 ehy86ygYmwfnBjCzZRTIjtvFJtt93kWH3VtqGAOMcvA/sMGP68CDn8STyIaCsw2o
 UN0qEIYcQ5wj5aJlJhlWZgx9bvjK3xkMuNO/uPWz0M0OL+KTk6O265FjIot0M6Pa
 XsihJCn9ZrWL9peevGDUXdoPuXQKaU8aW47qe+989Ya0oPD11Knpn75J+t3U07/h
 sty6pFIQMCBSKOjXnCKhysv3wSyewzRyMznVPz6EhRfzn7od5GLmvG7cANiWcV6K
 jep5Hd7cJq/BeIwpTSAxh+ygxbce8EGIm9NyUPAPXDPtfgMv1Zf2QLEPZhi729xk
 OpSF+eOGuRdSCur7ng629LqOANjb+D1939QKrzIwO5SC3xSZ5Ht258IWcJ+FlDNz
 Cfxk8b3rWhsS9qSSmiq3kdQ3ECEOzFBYx7d8m2FZ2z3mViA1VdIROHhX97VCAcVu
 X1dq5kUycyioGak86Ce/cexpmkwcwf12ypCZWz7tL+pfgDKcAKFHjA0vNLzYjZEy
 mxHZijjJXrg8vWTqMCqgDhIDQVzmaOjFo7upXc+5MsL8sbC71V38QvTZ3F6jcnAa
 kmiE8lDRQmnM38/5U1EEjROHpBaRLbpSfZvQcfSOGNkaycb2cFI=
 =g5+b
 -----END PGP SIGNATURE-----

Merge tag 'for-4.20-part1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "This is the first batch with fixes and some nice performance
  improvements.

  Preliminary results show eg. more files/sec in fsmark, better perf on
  multi-threaded workloads (filebench, dbench), fewer context switches
  and overall better memory allocation characteristics (multiple
  benchmarks).

  Apart from general performance, there's an improvement for qgroups +
  balance workload that's been troubling our users.

  Note for stable: there are 20+ patches tagged for stable, out of 90.
  Not all of them apply cleanly on all stable versions but the conflicts
  are mostly due to simple cleanups and resolving should be obvious. The
  fixes are otherwise independent.

  Performance improvements:

   - transition between blocking and spinning modes of path is gone,
     which originally resulted to more unnecessary wakeups and updates
     to the path locks, the effects are measurable and improve latency
     and scalability

   - qgroups: first batch of changes that should speedup balancing with
     qgroups on, skip quota accounting on unchanged subtrees, overall
     gain is about 30+% in runtime

   - use rb-tree with cached first node for several structures, small
     improvement to avoid pointer chasing

  Fixes:

   - trim
      - fix: some blockgroups could have been missed if their logical
        address was past the total filesystem size (ie. after a lot of
        balancing)
      - better error reporting, after processing blockgroups and whole
        device
      - fix: continue trimming block groups after an error is
        encountered
      - check for trim support of the device earlier and avoid some
        unnecessary work
      - less interaction with transaction commit that improves latency
        on slower storage (eg. image files over NFS)

   - fsync
      - fix warning when replaying log after fsync of a O_TMPFILE
      - fix wrong dentries after fsync of file that got its parent
        replaced

   - qgroups: fix rescan that might misc some dirty groups

   - don't clean dirty pages during buffered writes, this could lead to
     lost updates in some corner cases

   - some block groups could have been delayed in creation, if the
     allocation triggered another one

   - error handling improvements

  Cleanups:

   - removed unused struct members and variables

   - function return type cleanups

   - delayed refs code refactoring

   - protect against deadlock that could be caused by crafted image that
     tries to allocate from a tree that's locked already"

* tag 'for-4.20-part1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (93 commits)
  btrfs: switch return_bigger to bool in find_ref_head
  btrfs: remove fs_info from btrfs_should_throttle_delayed_refs
  btrfs: remove fs_info from btrfs_check_space_for_delayed_refs
  btrfs: delayed-ref: pass delayed_refs directly to btrfs_delayed_ref_lock
  btrfs: delayed-ref: pass delayed_refs directly to btrfs_select_ref_head
  btrfs: qgroup: move the qgroup->members check out from (!qgroup)'s else branch
  btrfs: relocation: Remove redundant tree level check
  btrfs: relocation: Cleanup while loop using rbtree_postorder_for_each_entry_safe
  btrfs: qgroup: Avoid calling qgroup functions if qgroup is not enabled
  Btrfs: fix wrong dentries after fsync of file that got its parent replaced
  Btrfs: fix warning when replaying log after fsync of a tmpfile
  btrfs: drop min_size from evict_refill_and_join
  btrfs: assert on non-empty delayed iputs
  btrfs: make sure we create all new block groups
  btrfs: reset max_extent_size on clear in a bitmap
  btrfs: protect space cache inode alloc with GFP_NOFS
  btrfs: release metadata before running delayed refs
  Btrfs: kill btrfs_clear_path_blocking
  btrfs: dev-replace: remove pointless assert in write unlock
  btrfs: dev-replace: move replace members out of fs_info
  ...
2018-10-24 17:15:26 +01:00
Linus Torvalds
44adbac8f7 Merge branch 'work.tty-ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull tty ioctl updates from Al Viro:
 "This is the compat_ioctl work related to tty ioctls.

  Quite a bit of dead code taken out, all tty-related stuff gone from
  fs/compat_ioctl.c. A bunch of compat bugs fixed - some still remain,
  but all more or less generic tty-related ioctls should be covered
  (remaining issues are in things like driver-private ioctls in a pcmcia
  serial card driver not getting properly handled in 32bit processes on
  64bit host, etc)"

* 'work.tty-ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (53 commits)
  kill TIOCSERGSTRUCT
  change semantics of ldisc ->compat_ioctl()
  kill TIOCSER[SG]WILD
  synclink_gt(): fix compat_ioctl()
  pty: fix compat ioctls
  compat_ioctl - kill keyboard ioctl handling
  gigaset: add ->compat_ioctl()
  vt_compat_ioctl(): clean up, use compat_ptr() properly
  gigaset: don't try to printk userland buffer contents
  dgnc: don't bother with (empty) stub for TCXONC
  dgnc: leave TIOC[GS]SOFTCAR to ldisc
  remove fallback to drivers for TIOCGICOUNT
  dgnc: break-related ioctls won't reach ->ioctl()
  kill the rest of tty COMPAT_IOCTL() entries
  dgnc: TIOCM... won't reach ->ioctl()
  isdn_tty: TCSBRK{,P} won't reach ->ioctl()
  kill capinc_tty_ioctl()
  take compat TIOC[SG]SERIAL treatment into tty_compat_ioctl()
  synclink: reduce pointless checks in ->ioctl()
  complete ->[sg]et_serial() switchover
  ...
2018-10-24 14:43:41 +01:00
Linus Torvalds
08ffb584d9 pstore improvements:
- refactor init to happen as early as possible again (Joel Fernandes)
 - improve resource reservation names
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlvN3UwWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJkiZD/0Xx72AvLGBOBMmnTm1cP+p8A6k
 wLG4ThW5Hg7ArQ5RSsADFr2jidIFFyq6I7k0U5oj4E/hS9chbNQjvbzXCaNbkl5O
 TYy7usATrjLcR6ivGFKM1eTuN9rFb7zaWKkh08ORf5+aP/yS0yezdLSbGqHiJyas
 MJ/HvFRPeN6tqd6qyDme7WkOrdGyGWSs3VV44izvBqo4Ub7JFRmjegJOhyEh0TRf
 jobpkuEw0EzTiVqDyIBtqJdhZRiWzScS5gwNi0L6QOlsnnRoAVEYGKhBMEhLCtBx
 nUDZdaC0FhsjRXdqbt08ylQ8bRU6xKWLvKrQ4xdbDwFC4oI8H+ZVg0YUfhp3juH8
 wlvo1MoHJJryDQCTrqvW4KY8Hkz3uF5vE8KoEo6wX2+o9mRw+H/ArCL1pMQ15eIH
 3yPESbkSW/SOOehFcFp2IosqE2XrflzJLQ1IRgoe/E7rO99Kpp9INZZMT0jNtoHx
 2E/u6DpCPrQk+5ko+we/jfu4P2SoctpLSnN87O5mI9SD7fjpBOle1y0vo/gUEYsL
 0mB165FdP7Qjqc+vqDT3VxyY/44ZEZI0kJYyE7k0nLkEijSagLyI750qpyB4DN95
 Y10sPrDFICyhC7N+uOTGG/Ey4mIdpp6tiWsPbF9TLewdsM3EfvkzmYPSWUYaEDp3
 MCZ2680KUHdMHPidBA==
 =fe5o
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore updates from Kees Cook:
 "pstore improvements:

   - refactor init to happen as early as possible again (Joel Fernandes)

   - improve resource reservation names"

* tag 'pstore-v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  pstore/ram: Clarify resource reservation labels
  pstore: Refactor compression initialization
  pstore: Allocate compression during late_initcall()
  pstore: Centralize init/exit routines
2018-10-24 14:42:02 +01:00
Steve French
38f876bb2d cifs: update internal module version number for cifs.ko to 2.14
Update version reported in "modinfo cifs"

Signed-off-by: Steve French <stfrench@microsoft.com>
2018-10-24 07:22:02 -05:00
Steve French
43de1db364 smb3: add debug for unexpected mid cancellation
We have hit this intermittently, increase the verbosity of
warning message on unexpected mid cancellation.

Signed-off-by: Steve French <stfrench@microsoft.com>
2018-10-24 07:22:02 -05:00
Ronnie Sahlberg
32a1fb36f6 cifs: allow calling SMB2_xxx_free(NULL)
Change these free functions to allow passing NULL as the argument and
treat it as a no-op just like free(NULL) would.
Or, if rqst->rq_iov is NULL.

The second scenario could happen for smb2_queryfs() if the call
to SMB2_query_info_init() fails and we go to qfs_exit to clean up
and free all resources.
In that case we have not yet assigned rqst[2].rq_iov and thus
the rq_iov dereference in SMB2_close_free() will cause a NULL pointer
dereference.

Fixes:  1eb9fb5204 ("cifs: create SMB2_open_init()/SMB2_open_free() helpers")

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
CC: Stable <stable@vger.kernel.org>
2018-10-24 07:21:41 -05:00
Linus Torvalds
ba9f6f8954 Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull siginfo updates from Eric Biederman:
 "I have been slowly sorting out siginfo and this is the culmination of
  that work.

  The primary result is in several ways the signal infrastructure has
  been made less error prone. The code has been updated so that manually
  specifying SEND_SIG_FORCED is never necessary. The conversion to the
  new siginfo sending functions is now complete, which makes it
  difficult to send a signal without filling in the proper siginfo
  fields.

  At the tail end of the patchset comes the optimization of decreasing
  the size of struct siginfo in the kernel from 128 bytes to about 48
  bytes on 64bit. The fundamental observation that enables this is by
  definition none of the known ways to use struct siginfo uses the extra
  bytes.

  This comes at the cost of a small user space observable difference.
  For the rare case of siginfo being injected into the kernel only what
  can be copied into kernel_siginfo is delivered to the destination, the
  rest of the bytes are set to 0. For cases where the signal and the
  si_code are known this is safe, because we know those bytes are not
  used. For cases where the signal and si_code combination is unknown
  the bits that won't fit into struct kernel_siginfo are tested to
  verify they are zero, and the send fails if they are not.

  I made an extensive search through userspace code and I could not find
  anything that would break because of the above change. If it turns out
  I did break something it will take just the revert of a single change
  to restore kernel_siginfo to the same size as userspace siginfo.

  Testing did reveal dependencies on preferring the signo passed to
  sigqueueinfo over si->signo, so bit the bullet and added the
  complexity necessary to handle that case.

  Testing also revealed bad things can happen if a negative signal
  number is passed into the system calls. Something no sane application
  will do but something a malicious program or a fuzzer might do. So I
  have fixed the code that performs the bounds checks to ensure negative
  signal numbers are handled"

* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (80 commits)
  signal: Guard against negative signal numbers in copy_siginfo_from_user32
  signal: Guard against negative signal numbers in copy_siginfo_from_user
  signal: In sigqueueinfo prefer sig not si_signo
  signal: Use a smaller struct siginfo in the kernel
  signal: Distinguish between kernel_siginfo and siginfo
  signal: Introduce copy_siginfo_from_user and use it's return value
  signal: Remove the need for __ARCH_SI_PREABLE_SIZE and SI_PAD_SIZE
  signal: Fail sigqueueinfo if si_signo != sig
  signal/sparc: Move EMT_TAGOVF into the generic siginfo.h
  signal/unicore32: Use force_sig_fault where appropriate
  signal/unicore32: Generate siginfo in ucs32_notify_die
  signal/unicore32: Use send_sig_fault where appropriate
  signal/arc: Use force_sig_fault where appropriate
  signal/arc: Push siginfo generation into unhandled_exception
  signal/ia64: Use force_sig_fault where appropriate
  signal/ia64: Use the force_sig(SIGSEGV,...) in ia64_rt_sigreturn
  signal/ia64: Use the generic force_sigsegv in setup_frame
  signal/arm/kvm: Use send_sig_mceerr
  signal/arm: Use send_sig_fault where appropriate
  signal/arm: Use force_sig_fault where appropriate
  ...
2018-10-24 11:22:39 +01:00
Linus Torvalds
50b825d7e8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Add VF IPSEC offload support in ixgbe, from Shannon Nelson.

 2) Add zero-copy AF_XDP support to i40e, from Björn Töpel.

 3) All in-tree drivers are converted to {g,s}et_link_ksettings() so we
    can get rid of the {g,s}et_settings ethtool callbacks, from Michal
    Kubecek.

 4) Add software timestamping to veth driver, from Michael Walle.

 5) More work to make packet classifiers and actions lockless, from Vlad
    Buslov.

 6) Support sticky FDB entries in bridge, from Nikolay Aleksandrov.

 7) Add ipv6 version of IP_MULTICAST_ALL sockopt, from Andre Naujoks.

 8) Support batching of XDP buffers in vhost_net, from Jason Wang.

 9) Add flow dissector BPF hook, from Petar Penkov.

10) i40e vf --> generic iavf conversion, from Jesse Brandeburg.

11) Add NLA_REJECT netlink attribute policy type, to signal when users
    provide attributes in situations which don't make sense. From
    Johannes Berg.

12) Switch TCP and fair-queue scheduler over to earliest departure time
    model. From Eric Dumazet.

13) Improve guest receive performance by doing rx busy polling in tx
    path of vhost networking driver, from Tonghao Zhang.

14) Add per-cgroup local storage to bpf

15) Add reference tracking to BPF, from Joe Stringer. The verifier can
    now make sure that references taken to objects are properly released
    by the program.

16) Support in-place encryption in TLS, from Vakul Garg.

17) Add new taprio packet scheduler, from Vinicius Costa Gomes.

18) Lots of selftests additions, too numerous to mention one by one here
    but all of which are very much appreciated.

19) Support offloading of eBPF programs containing BPF to BPF calls in
    nfp driver, frm Quentin Monnet.

20) Move dpaa2_ptp driver out of staging, from Yangbo Lu.

21) Lots of u32 classifier cleanups and simplifications, from Al Viro.

22) Add new strict versions of netlink message parsers, and enable them
    for some situations. From David Ahern.

23) Evict neighbour entries on carrier down, also from David Ahern.

24) Support BPF sk_msg verdict programs with kTLS, from Daniel Borkmann
    and John Fastabend.

25) Add support for filtering route dumps, from David Ahern.

26) New igc Intel driver for 2.5G parts, from Sasha Neftin et al.

27) Allow vxlan enslavement to bridges in mlxsw driver, from Ido
    Schimmel.

28) Add queue and stack map types to eBPF, from Mauricio Vasquez B.

29) Add back byte-queue-limit support to r8169, with all the bug fixes
    in other areas of the driver it works now! From Florian Westphal and
    Heiner Kallweit.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2147 commits)
  tcp: add tcp_reset_xmit_timer() helper
  qed: Fix static checker warning
  Revert "be2net: remove desc field from be_eq_obj"
  Revert "net: simplify sock_poll_wait"
  net: socionext: Reset tx queue in ndo_stop
  net: socionext: Add dummy PHY register read in phy_write()
  net: socionext: Stop PHY before resetting netsec
  net: stmmac: Set OWN bit for jumbo frames
  arm64: dts: stratix10: Support Ethernet Jumbo frame
  tls: Add maintainers
  net: ethernet: ti: cpsw: unsync mcast entries while switch promisc mode
  octeontx2-af: Support for NIXLF's UCAST/PROMISC/ALLMULTI modes
  octeontx2-af: Support for setting MAC address
  octeontx2-af: Support for changing RSS algorithm
  octeontx2-af: NIX Rx flowkey configuration for RSS
  octeontx2-af: Install ucast and bcast pkt forwarding rules
  octeontx2-af: Add LMAC channel info to NIXLF_ALLOC response
  octeontx2-af: NPC MCAM and LDATA extract minimal configuration
  octeontx2-af: Enable packet length and csum validation
  octeontx2-af: Support for VTAG strip and capture
  ...
2018-10-24 06:47:44 +01:00
Steve French
35a9080723 smb3 - clean up debug output displaying network interfaces
Make the output of /proc/fs/cifs/DebugData a little easier to
read by cleaning up the listing of network interfaces removing
a wasted line break.

Here is a comparison of the network interface information
that from be viewed at the end of output from

     "cat /proc/fs/cifs/DebugData"

Before:

	Server interfaces: 8
	0)
		Speed: 10000000000 bps
		Capabilities: rss
		IPv6: fe80:0000:0000:0000:2cf5:407e:84b0:21dd
	1)
		Speed: 1000000000 bps
		Capabilities:
		IPv6: fe80:0000:0000:0000:61cd:6147:3d0c:f484

vs. after:

	Server interfaces: 11
	0)	Speed: 10000000000 bps
		Capabilities: rss
		IPv6: fe80:0000:0000:0000:2cf5:407e:84b0:21dd
	1)	Speed: 2000000000 bps
		Capabilities:
		IPv6: fe80:0000:0000:0000:3d76:2d05:dcf8:ed10

Signed-off-by: Steve French <stfrench@microsoft.com>
2018-10-23 21:16:06 -05:00
Steve French
fae8044c03 smb3: show number of current open files in /proc/fs/cifs/Stats
To allow better debugging (for example applications with
handle leaks, or complex reconnect scenarios) display the
number of open files (on the client) and number of open
server file handles for each tcon in /proc/fs/cifs/Stats.
Note that open files on server is one larger than local
due to handle caching (in this case of the root of
the share).  In this example there are two local
open files, and three (two file and one directory handle)
open on the server.

Sample output:

$ cat /proc/fs/cifs/Stats
Resources in use
CIFS Session: 1
Share (unique mount targets): 2
SMB Request/Response Buffer: 1 Pool size: 5
SMB Small Req/Resp Buffer: 1 Pool size: 30
Operations (MIDs): 0

0 session 0 share reconnects
Total vfs operations: 36 maximum at one time: 2

1) \\localhost\test
SMBs: 69
Bytes read: 27  Bytes written: 0
Open files: 2 total (local), 3 open on server
TreeConnects: 1 total 0 failed
TreeDisconnects: 0 total 0 failed
Creates: 19 total 0 failed
Closes: 16 total 0 failed
...

Signed-off-by: Steve French <stfrench@microsoft.com>
2018-10-23 21:16:06 -05:00
Ronnie Sahlberg
8d8b26e584 cifs: add support for ioctl on directories
We do not call cifs_open_file() for directories and thus we do not have a
pSMBFile we can extract the FIDs from.

Solve this by instead always using a compounded open/query/close for
the passthrough ioctl.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-10-23 21:16:05 -05:00
Steve French
3b7960cace cifs: fallback to older infolevels on findfirst queryinfo retry
In cases where queryinfo fails, we have cases in cifs (vers=1.0)
where with backupuid mounts we retry the query info with findfirst.
This doesn't work to some NetApp servers which don't support
WindowsXP (and later) infolevel 261 (SMB_FIND_FILE_ID_FULL_DIR_INFO)
so in this case use other info levels (in this case it will usually
be level 257, SMB_FIND_FILE_DIRECTORY_INFO).

(Also fixes some indentation)

See kernel bugzilla 201435

Signed-off-by: Steve French <stfrench@microsoft.com>
2018-10-23 21:16:05 -05:00
Steve French
1e77a8c204 smb3: do not attempt cifs operation in smb3 query info error path
If backupuid mount option is sent, we can incorrectly retry
(on access denied on query info) with a cifs (FindFirst) operation
on an smb3 mount which causes the server to force the session close.

We set backup intent on open so no need for this fallback.

See kernel bugzilla 201435

Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-10-23 21:16:05 -05:00
Steve French
61351d6d54 smb3: send backup intent on compounded query info
When mounting with backupuid set, we should be setting
CREATE_OPEN_BACKUP_INTENT flag on compounded opens as well,
especially the case of compounded smb2_query_path_info.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-10-23 21:16:05 -05:00
Steve French
0cb012d1a0 cifs: track writepages in vfs operation counters
writepages and readpages operations did not call get/free_xid
so the statistics for file copy could get confusing with "vfs operations"
not increasing.  Add get_xid and free_xid to cifs readpages and
writepages functions.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-10-23 21:16:05 -05:00
Gustavo A. R. Silva
f70556c8ca smb2: fix uninitialized variable bug in smb2_ioctl_query_info
There is a potential execution path in which variable *resp_buftype*
is passed as an argument to function free_rsp_buf(), in which it is
used in a comparison without being properly initialized previously.

Fix this by initializing variable *resp_buftype* to CIFS_NO_BUFFER
in order to avoid unpredictable or unintended results.

Addresses-Coverity-ID: 1473971 ("Uninitialized scalar variable")
Fixes: c5d25bdb2967 ("cifs: add IOCTL for QUERY_INFO passthrough to userspace")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-10-23 21:16:05 -05:00
Ronnie Sahlberg
f5b05d622a cifs: add IOCTL for QUERY_INFO passthrough to userspace
This allows userspace tools to query the raw info levels for cifs files
and process the response in userspace.
In particular this is useful for many of those data where there is no
corresponding native data structure in linux.
For example querying the security descriptor for a file and extract the
SIDs.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-10-23 21:16:05 -05:00
Steve French
8c1beb9801 cifs: minor clarification in comments
Clarify meaning (in comments) meaning of various
options for debug messages in cifs.ko. Also fixed
trivial formatting/style issue with previous patch.

Signed-off-by: Steve French <stfrench@microsoft.com>
2018-10-23 21:16:05 -05:00
Rodrigo Freire
f80eaedd6c CIFS: Print message when attempting a mount
Currently, no messages are printed when mounting a CIFS filesystem and
no debug configuration is enabled.

However, a CIFS mount information is valuable when troubleshooting
and/or forensic analyzing a system and finding out if was a CIFS
endpoint mount attempted.

Other filesystems such as XFS, EXT* does issue a printk() when mounting
their filesystems.

A terse log message is printed only if cifsFYI is not enabled. Otherwise,
the default full debug message is printed.

In order to not clutter and classify correctly the event messages, these
are logged as KERN_INFO level.

Sample mount operations:

[root@corinthians ~]# mount -o user=administrator //172.25.250.18/c$ /mnt
(non-existent system)

[root@corinthians ~]# mount -o user=administrator //172.25.250.19/c$ /mnt
(Valid system)

Kernel message log for the mount operations:

[  450.464543] CIFS: Attempting to mount //172.25.250.18/c$
[  456.478186] CIFS VFS: Error connecting to socket. Aborting operation.
[  456.478381] CIFS VFS: cifs_mount failed w/return code = -113
[  467.688866] CIFS: Attempting to mount //172.25.250.19/c$

Signed-off-by: Rodrigo Freire <rfreire@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-10-23 21:16:05 -05:00
Rodrigo Freire
9a0efeccfa CIFS: Adds information-level logging function
Currently, CIFS lacks a internal logging function that prints out data
when CIFS_DEBUG=n. When CIFS_DEBUG=y, the only message level for CIFS
events are KERN_ERR or KERN_DEBUG.

This patch creates cifs_info(), which is useful for printing
non-critical event messges, at either CIFS_DEBUG state.

Signed-off-by: Rodrigo Freire <rfreire@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-10-23 21:16:05 -05:00
Ronnie Sahlberg
9645759ce6 cifs: OFD locks do not conflict with eachothers
RHBZ 1484130

Update cifs_find_fid_lock_conflict() to recognize that
ODF locks do not conflict with eachother.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-10-23 21:16:05 -05:00
Long Li
ff526d8605 CIFS: SMBD: Do not call ib_dereg_mr on invalidated memory registration
It is not necessary to deregister a memory registration after it has been
successfully invalidated.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-10-23 21:16:05 -05:00
Long Li
6d3adb23be CIFS: pass page offsets on SMB1 read/write
When issuing SMB1 read/write, pass the page offset to transport.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-10-23 21:16:05 -05:00
Garry McNulty
ef2298a06d fs/cifs: fix uninitialised variable warnings
In some error conditions, resp_buftype can be passed uninitialised to
free_rsp_buf(), potentially resulting in a spurious debug message.
If resp_buftype randomly had the value 1 (CIFS_SMALL_BUFFER) then this
would log a debug message.
The rsp pointer is initialised to NULL so there is no other side-effect.

Detected by CoverityScan, CID 1438585 ("Uninitialized scalar variable")
Detected by CoverityScan, CID 1438667 ("Uninitialized scalar variable")
Detected by CoverityScan, CID 1438764 ("Uninitialized scalar variable")

Signed-off-by: Garry McNulty <garrmcnu@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2018-10-23 21:16:05 -05:00
Steve French
179e44d49c smb3: add tracepoint for sending lease break responses to server
Be able to log a ftrace message on success and/or failure of
sending a lease break response to the server.

Example output:

           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
             | |       |   ||||       |         |
     kworker/1:1-5681  [001] .... 11123.530457: smb3_lease_done: sid=0x291e3e0f tid=0x8ba43071 lease_key=0x1852ca0d3ecd9b55847750a86716fde lease_state=0x0

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2018-10-23 21:16:05 -05:00
Steve French
9b9c5bea0b cifs: do not return atime less than mtime
In network file system it is fairly easy for server and client
atime vs. mtime to get confused (and atime updated less frequently)
which we noticed broke some apps which expect atime >= mtime

Also ignore relatime mount option (rather than error on it) since
relatime is basically what some network server fs are doing
(relatime).

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-10-23 21:16:05 -05:00
Steve French
3d621230b8 smb3: update default requested iosize to 4MB from 1MB for recent dialects
Modern servers often support 8MB as maximum i/o size, and we see some
performance benefits (my testing showed 1 to 13% on write paths,
and 1 to 3% on read paths for increasing the default to 4MB). If server
doesn't support larger i/o size, during negotiate protocol it is already
set correctly to the server's maximum if lower than 4MB.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-10-23 21:16:05 -05:00
Steve French
6e4d3bbe92 smb3: Add debug message later in smb2/smb3 reconnect path
As we reset credits later in the reconnect path, useful
to have optional (cifsFYI) debug message.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-10-23 21:16:05 -05:00
Aurelien Aptel
8393072bab CIFS: make 'nodfs' mount opt a superblock flag
tcon->Flags is only used by SMB1 code and changing it is not permanent
(you lose the setting on tcon reconnect).

* Move the setting to superblock flags (per mount-points).
* Make automount callback exit early when flag present
* Make dfs resolving happening in mount syscall exit early if flag present

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-10-23 21:16:05 -05:00
Steve French
9e1a37dad4 smb3: track the instance of each session for debugging
Each time we reconnect to the same server, bump an instance
counter (and display in /proc/fs/cifs/DebugData) to make it
easier to debug.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-10-23 21:16:04 -05:00
Steve French
37e6a70576 smb3: minor missing defines relating to reparse points
Previously reserved dpen response field changed in smb3

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2018-10-23 21:16:04 -05:00
Steve French
00778e2294 smb3: add way to control slow response threshold for logging and stats
/proc/fs/cifs/Stats when CONFIG_CIFS_STATS2 is enabled logs 'slow'
responses, but depending on the server you are debugging a
one second timeout may be too fast, so allow setting it to
a larger number of seconds via new module parameter

/sys/module/cifs/parameters/slow_rsp_threshold

or via modprobe:

slow_rsp_threshold:Amount of time (in seconds) to wait before
logging that a response is delayed.
Default: 1 (if set to 0 disables msg). (uint)

Recommended values are 0 (disabled) to 32767 (9 hours) with
the default remaining as 1 second.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2018-10-23 21:16:04 -05:00
Steve French
1c3a13a38a cifs: minor updates to module description for cifs.ko
note smb3 (and common more modern servers) in the module description

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-10-23 21:16:04 -05:00
Steve French
5a519bead4 cifs: protect against server returning invalid file system block size
For a network file system we generally prefer large i/o, but
if the server returns invalid file system block/sector sizes
in cifs (vers=1.0) QFSInfo then set block size to a default
of a reasonable minimum (4K).

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2018-10-23 21:16:04 -05:00
Steve French
2c887635cd smb3: allow stats which track session and share reconnects to be reset
Currently, "echo 0 > /proc/fs/cifs/Stats" resets all of the stats
except the session and share reconnect counts.  Fix it to
reset those as well.

CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2018-10-23 21:16:04 -05:00
Steve French
4d5bdf2869 SMB3: Backup intent flag missing from compounded ops
When "backup intent" is requested on the mount (e.g. backupuid or
backupgid mount options), the corresponding flag was missing from
some of the new compounding operations as well (now that
open_query_close is gone).

Related to kernel bugzilla #200953

Reported-and-tested-by: <whh@rubrik.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-10-23 21:16:04 -05:00
Ronnie Sahlberg
14e562ada2 cifs: create a define for the max number of iov we need for a SMB2 set_info
So we don't overflow the io vector arrays accidentally

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-10-23 21:16:04 -05:00
Ronnie Sahlberg
bb435512ce cifs: change SMB2_OP_RENAME and SMB2_OP_HARDLINK to use compounding
Get rid of smb2_open_op_close() as all operations are now migrated
to smb2_compound_op().

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-10-23 21:16:04 -05:00
Ronnie Sahlberg
3764cbd179 cifs: remove the is_falloc argument to SMB2_set_eof
We never pass is_falloc==true here anyway and if we ever need to support
is_falloc in the future, SMB2_set_eof is such a trivial wrapper around
send_set_info() that we can/should just create a differently named wrapper
for that new functionality.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-10-23 21:16:04 -05:00
Ronnie Sahlberg
dcbf910357 cifs: change SMB2_OP_SET_INFO to use compounding
Cuts number of network roundtrips significantly for some common syscalls

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-10-23 21:16:04 -05:00
Ronnie Sahlberg
f7bfe04bf0 cifs: change SMB2_OP_SET_EOF to use compounding
This changes SMB2_OP_SET_EOF to use compounding in some situations.
This is part of the path based API to truncate a file.
Most of the time this will however not be invoked for SMB2 since
cifs_set_file_size() will as far as I can tell almost always just
open the file synchronously and switch to the handle based truncate
code path, thus bypassing the compounding we add here.

Rewriting cifs_set_file_size() and make that whole pile of code more
compounding friendly, and also easier to read and understand, is a
different project though and not for this patch.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-10-23 21:16:04 -05:00
Ronnie Sahlberg
c2e0fe3f5a cifs: make rmdir() use compounding
This and previous patches drop the number of roundtrips we need for rmdir()
from 6 to 2.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-10-23 21:16:04 -05:00
Ronnie Sahlberg
ba8ca11685 cifs: create helpers for SMB2_set_info_init/free()
so that we can use these later for compounded set-info calls.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-10-23 21:16:04 -05:00
Ronnie Sahlberg
47dd9597df cifs: change unlink to use a compound
This,and previous patches, drops the number of roundtrips from five to two
for unlink()

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-10-23 21:16:04 -05:00
Ronnie Sahlberg
f733e3936d cifs: change mkdir to use a compound
This with the previous patch changes mkdir() from needing 6 roundtrips to
just 3.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-10-23 21:16:04 -05:00
Ronnie Sahlberg
c5a5f38f07 cifs: add a smb2_compound_op and change QUERY_INFO to use it
This turns most open/query-info/close patterns in cifs.ko
to become compounds.

This changes stat from using 3 roundtrips to just a single one.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-10-23 21:16:04 -05:00
Ronnie Sahlberg
cb5c2e6394 cifs: fix a credits leak for compund commands
When processing the mids for compounds we would only add credits based on
the last successful mid in the compound which would leak credits and
eventually triggering a re-connect.

Fix this by splitting the mid processing part into two loops instead of one
where the first loop just waits for all mids and then counts how many
credits we were granted for the whole compound.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-10-23 21:16:04 -05:00
Steve French
b340a4d4aa smb3: add tracepoint to catch cases where credit refund of failed op overlaps reconnect
Add tracepoint to catch potential cases where a pending operation overlapping a
reconnect could fail and incorrectly refund its credits causing the client
to think it has more credits available than the server thinks it does.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-10-23 21:16:03 -05:00
YueHaibing
ce7fb50f92 cifs: remove set but not used variable 'cifs_sb'
Fixes gcc '-Wunused-but-set-variable' warning:

fs/cifs/ioctl.c: In function 'cifs_ioctl':
fs/cifs/ioctl.c:164:23: warning:
 variable 'cifs_sb' set but not used [-Wunused-but-set-variable]

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-10-23 21:16:03 -05:00
YueHaibing
d034feeb44 cifs: Use kmemdup rather than duplicating its implementation in smb311_posix_mkdir()
Use kmemdup rather than duplicating its implementation

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-10-23 21:16:03 -05:00
Steve French
d42c8a87d1 smb3: do not display confusing message on mount to Azure servers
Some servers (e.g. Azure) return "STATUS_NOT_IMPLEMENTED" rather
than "STATUS_INVALID_DEVICE_REQUEST" on query network interface
info at mount.  This shouldn't cause us to log a warning message
automatically.  Don't log this unless noisier cifsFYI is enabled.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-10-23 21:16:03 -05:00
David Howells
3bf0fb6f33 afs: Probe multiple fileservers simultaneously
Send probes to all the unprobed fileservers in a fileserver list on all
addresses simultaneously in an attempt to find out the fastest route whilst
not getting stuck for 20s on any server or address that we don't get a
reply from.

This alleviates the problem whereby attempting to access a new server can
take a long time because the rotation algorithm ends up rotating through
all servers and addresses until it finds one that responds.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24 00:41:09 +01:00
David Howells
18ac61853c afs: Fix callback handling
In some circumstances, the callback interest pointer is NULL, so in such a
case we can't dereference it when checking to see if the callback is
broken.  This causes an oops in some circumstances.

Fix this by replacing the function that worked out the aggregate break
counter with one that actually does the comparison, and then make that
return true (ie. broken) if there is no callback interest as yet (ie. the
pointer is NULL).

Fixes: 68251f0a68 ("afs: Fix whole-volume callback handling")
Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24 00:41:09 +01:00
David Howells
2feeaf8433 afs: Eliminate the address pointer from the address list cursor
Eliminate the address pointer from the address list cursor as it's
redundant (ac->addrs[ac->index] can be used to find the same address) and
address lists must be replaced rather than being rearranged, so is of
limited value.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24 00:41:09 +01:00
David Howells
744bcd713a afs: Allow dumping of server cursor on operation failure
Provide an option to allow the file or volume location server cursor to be
dumped if the rotation routine falls off the end without managing to
contact a server.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24 00:41:09 +01:00
David Howells
30062bd13e afs: Implement YFS support in the fs client
Implement support for talking to YFS-variant fileservers in the cache
manager and the filesystem client.  These implement upgraded services on
the same port as their AFS services.

YFS fileservers provide expanded capabilities over AFS.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24 00:41:08 +01:00
David Howells
d4936803a9 afs: Expand data structure fields to support YFS
Expand fields in various data structures to support the expanded
information that YFS is capable of returning.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24 00:41:08 +01:00
David Howells
f58db83fd3 afs: Get the target vnode in afs_rmdir() and get a callback on it
Get the target vnode in afs_rmdir() and validate it before we attempt the
deletion, The vnode pointer will be passed through to the delivery function
in a later patch so that the delivery function can mark it deleted.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24 00:41:08 +01:00
David Howells
12d8e95a91 afs: Calc callback expiry in op reply delivery
Calculate the callback expiration time at the point of operation reply
delivery, using the reply time queried from AF_RXRPC on that call as a
base.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24 00:41:08 +01:00
David Howells
36bb5f490a afs: Fix FS.FetchStatus delivery from updating wrong vnode
The FS.FetchStatus reply delivery function was updating inode of the
directory in which a lookup had been done with the status of the looked up
file.  This corrupts some of the directory state.

Fixes: 5cf9dd55a0 ("afs: Prospectively look up extra files when doing a single lookup")
Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24 00:41:08 +01:00
David Howells
35dbfba311 afs: Implement the YFS cache manager service
Implement the YFS cache manager service which gives extra capabilities on
top of AFS.  This is done by listening for an additional service on the
same port and indicating that anyone requesting an upgrade should be
upgraded to the YFS port.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24 00:41:08 +01:00
David Howells
06aeb29714 afs: Remove callback details from afs_callback_break struct
Remove unnecessary details of a broken callback, such as version, expiry
and type, from the afs_callback_break struct as they're not actually used
and make the list take more memory.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24 00:41:08 +01:00
David Howells
0067191201 afs: Commit the status on a new file/dir/symlink
Call the function to commit the status on a new file, dir or symlink so
that the access rights for the caller's key are cached for that object.

Without this, the next access to the file will cause a FetchStatus
operation to be emitted to retrieve the access rights.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24 00:41:08 +01:00
David Howells
3b6492df41 afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFS
Increase the sizes of the volume ID to 64 bits and the vnode ID (inode
number equivalent) to 96 bits to allow the support of YFS.

This requires the iget comparator to check the vnode->fid rather than i_ino
and i_generation as i_ino is not sufficiently capacious.  It also requires
this data to be placed into the vnode cache key for fscache.

For the moment, just discard the top 32 bits of the vnode ID when returning
it though stat.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24 00:41:08 +01:00
David Howells
2a0b4f64c9 afs: Don't invoke the server to read data beyond EOF
When writing a new page, clear space in the page rather than attempting to
load it from the server if the space is beyond the EOF.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24 00:41:08 +01:00
David Howells
f51375cd9e afs: Add a couple of tracepoints to log I/O errors
Add a couple of tracepoints to log the production of I/O errors within the AFS
filesystem.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24 00:41:07 +01:00
David Howells
4ac15ea536 afs: Handle EIO from delivery function
Fix afs_deliver_to_call() to handle -EIO being returned by the operation
delivery function, indicating that the call found itself in the wrong
state, by printing an error and aborting the call.

Currently, an assertion failure will occur.  This can happen, say, if the
delivery function falls off the end without calling afs_extract_data() with
the want_more parameter set to false to collect the end of the Rx phase of
a call.

The assertion failure looks like:

	AFS: Assertion failed
	4 == 7 is false
	0x4 == 0x7 is false
	------------[ cut here ]------------
	kernel BUG at fs/afs/rxrpc.c:462!

and is matched in the trace buffer by a line like:

kworker/7:3-3226 [007] ...1 85158.030203: afs_io_error: c=0003be0c r=-5 CM_REPLY

Fixes: 98bf40cd99 ("afs: Protect call->state changes against signals")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24 00:41:07 +01:00
David Howells
ded2f4c58a afs: Fix TTL on VL server and address lists
Currently the TTL on VL server and address lists isn't set in all
circumstances and may be set to poor choices in others, since the TTL is
derived from the SRV/AFSDB DNS record if and when available.

Fix the TTL by limiting the range to a minimum and maximum from the current
time.  At some point these can be made into sysctl knobs.  Further, use the
TTL we obtained from the upcall to set the expiry on negative results too;
in future a mechanism can be added to force reloading of such data.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24 00:41:07 +01:00
David Howells
0a5143f2f8 afs: Implement VL server rotation
Track VL servers as independent entities rather than lumping all their
addresses together into one set and implement server-level rotation by:

 (1) Add the concept of a VL server list, where each server has its own
     separate address list.  This code is similar to the FS server list.

 (2) Use the DNS resolver to retrieve a set of servers and their associated
     addresses, ports, preference and weight ratings.

 (3) In the case of a legacy DNS resolver or an address list given directly
     through /proc/net/afs/cells, create a list containing just a dummy
     server record and attach all the addresses to that.

 (4) Implement a simple rotation policy, for the moment ignoring the
     priorities and weights assigned to the servers.

 (5) Show the address list through /proc/net/afs/<cell>/vlservers.  This
     also displays the source and status of the data as indicated by the
     upcall.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24 00:41:07 +01:00
David Howells
e7f680f45b afs: Improve FS server rotation error handling
Improve the error handling in FS server rotation by:

 (1) Cache the latest useful error value for the fs operation as a whole in
     struct afs_fs_cursor separately from the error cached in the
     afs_addr_cursor struct.  The one in the address cursor gets clobbered
     occasionally.  Copy over the error to the fs operation only when it's
     something we'd be interested in passing to userspace.

 (2) Make it so that EDESTADDRREQ is the default that is seen only if no
     addresses are available to be accessed.

 (3) When calling utility functions, such as checking a volume status or
     probing a fileserver, don't let a successful result clobber the cached
     error in the cursor; instead, stash the result in a temporary variable
     until it has been assessed.

 (4) Don't return ETIMEDOUT or ETIME if a better error, such as
     ENETUNREACH, is already cached.

 (5) On leaving the rotation loop, turn any remote abort code into a more
     useful error than ECONNABORTED.

Fixes: d2ddc776a4 ("afs: Overhaul volume and server record caching and fileserver rotation")
Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24 00:41:07 +01:00
David Howells
12bdcf333f afs: Set up the iov_iter before calling afs_extract_data()
afs_extract_data sets up a temporary iov_iter and passes it to AF_RXRPC
each time it is called to describe the remaining buffer to be filled.

Instead:

 (1) Put an iterator in the afs_call struct.

 (2) Set the iterator for each marshalling stage to load data into the
     appropriate places.  A number of convenience functions are provided to
     this end (eg. afs_extract_to_buf()).

     This iterator is then passed to afs_extract_data().

 (3) Use the new ITER_DISCARD iterator to discard any excess data provided
     by FetchData.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24 00:41:07 +01:00
David Howells
160cb9574b afs: Better tracing of protocol errors
Include the site of detection of AFS protocol errors in trace lines to
better be able to determine what went wrong.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24 00:41:07 +01:00
David Howells
aa563d7bca iov_iter: Separate type from direction and use accessor functions
In the iov_iter struct, separate the iterator type from the iterator
direction and use accessor functions to access them in most places.

Convert a bunch of places to use switch-statements to access them rather
then chains of bitwise-AND statements.  This makes it easier to add further
iterator types.  Also, this can be more efficient as to implement a switch
of small contiguous integers, the compiler can use ~50% fewer compare
instructions than it has to use bitwise-and instructions.

Further, cease passing the iterator type into the iterator setup function.
The iterator function can set that itself.  Only the direction is required.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24 00:41:07 +01:00
David Howells
00e2370744 iov_iter: Use accessor function
Use accessor functions to access an iterator's type and direction.  This
allows for the possibility of using some other method of determining the
type of iterator than if-chains with bitwise-AND conditions.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24 00:40:44 +01:00
Frank Sorenson
86bbd7422a NFS: change sign of nfs_fh length
The filehandle has a length which is defined as a 32-bit
"unsigned integer".  Change sign of the length appropriately.

Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-10-23 12:22:21 -04:00
Linus Torvalds
99792e0cea Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Ingo Molnar:
 "Lots of changes in this cycle:

   - Lots of CPA (change page attribute) optimizations and related
     cleanups (Thomas Gleixner, Peter Zijstra)

   - Make lazy TLB mode even lazier (Rik van Riel)

   - Fault handler cleanups and improvements (Dave Hansen)

   - kdump, vmcore: Enable kdumping encrypted memory with AMD SME
     enabled (Lianbo Jiang)

   - Clean up VM layout documentation (Baoquan He, Ingo Molnar)

   - ... plus misc other fixes and enhancements"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
  x86/stackprotector: Remove the call to boot_init_stack_canary() from cpu_startup_entry()
  x86/mm: Kill stray kernel fault handling comment
  x86/mm: Do not warn about PCI BIOS W+X mappings
  resource: Clean it up a bit
  resource: Fix find_next_iomem_res() iteration issue
  resource: Include resource end in walk_*() interfaces
  x86/kexec: Correct KEXEC_BACKUP_SRC_END off-by-one error
  x86/mm: Remove spurious fault pkey check
  x86/mm/vsyscall: Consider vsyscall page part of user address space
  x86/mm: Add vsyscall address helper
  x86/mm: Fix exception table comments
  x86/mm: Add clarifying comments for user addr space
  x86/mm: Break out user address space handling
  x86/mm: Break out kernel address space handling
  x86/mm: Clarify hardware vs. software "error_code"
  x86/mm/tlb: Make lazy TLB mode lazier
  x86/mm/tlb: Add freed_tables element to flush_tlb_info
  x86/mm/tlb: Add freed_tables argument to flush_tlb_mm_range
  smp,cpumask: introduce on_each_cpu_cond_mask
  smp: use __cpumask_set_cpu in on_each_cpu_cond
  ...
2018-10-23 17:05:28 +01:00
Linus Torvalds
0200fbdd43 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking and misc x86 updates from Ingo Molnar:
 "Lots of changes in this cycle - in part because locking/core attracted
  a number of related x86 low level work which was easier to handle in a
  single tree:

   - Linux Kernel Memory Consistency Model updates (Alan Stern, Paul E.
     McKenney, Andrea Parri)

   - lockdep scalability improvements and micro-optimizations (Waiman
     Long)

   - rwsem improvements (Waiman Long)

   - spinlock micro-optimization (Matthew Wilcox)

   - qspinlocks: Provide a liveness guarantee (more fairness) on x86.
     (Peter Zijlstra)

   - Add support for relative references in jump tables on arm64, x86
     and s390 to optimize jump labels (Ard Biesheuvel, Heiko Carstens)

   - Be a lot less permissive on weird (kernel address) uaccess faults
     on x86: BUG() when uaccess helpers fault on kernel addresses (Jann
     Horn)

   - macrofy x86 asm statements to un-confuse the GCC inliner. (Nadav
     Amit)

   - ... and a handful of other smaller changes as well"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
  locking/lockdep: Make global debug_locks* variables read-mostly
  locking/lockdep: Fix debug_locks off performance problem
  locking/pvqspinlock: Extend node size when pvqspinlock is configured
  locking/qspinlock_stat: Count instances of nested lock slowpaths
  locking/qspinlock, x86: Provide liveness guarantee
  x86/asm: 'Simplify' GEN_*_RMWcc() macros
  locking/qspinlock: Rework some comments
  locking/qspinlock: Re-order code
  locking/lockdep: Remove duplicated 'lock_class_ops' percpu array
  x86/defconfig: Enable CONFIG_USB_XHCI_HCD=y
  futex: Replace spin_is_locked() with lockdep
  locking/lockdep: Make class->ops a percpu counter and move it under CONFIG_DEBUG_LOCKDEP=y
  x86/jump-labels: Macrofy inline assembly code to work around GCC inlining bugs
  x86/cpufeature: Macrofy inline assembly code to work around GCC inlining bugs
  x86/extable: Macrofy inline assembly code to work around GCC inlining bugs
  x86/paravirt: Work around GCC inlining bugs when compiling paravirt ops
  x86/bug: Macrofy the BUG table section handling, to work around GCC inlining bugs
  x86/alternatives: Macrofy lock prefixes to work around GCC inlining bugs
  x86/refcount: Work around GCC inlining bug
  x86/objtool: Use asm macros to work around GCC inlining bugs
  ...
2018-10-23 13:08:53 +01:00
Ding Xiang
84db119f5a ubifs: Remove unneeded semicolon
delete redundant semicolon

Signed-off-by: Ding Xiang <dingxiang@cmss.chinamobile.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-10-23 13:49:02 +02:00
Sascha Hauer
d8a22773a1 ubifs: Enable authentication support
With the preparations all being done this patch now enables authentication
support for UBIFS. Authentication is enabled when the newly introduced
auth_key and auth_hash_name mount options are passed. auth_key provides
the key which is used for authentication whereas auth_hash_name provides
the hashing algorithm used for this FS. Passing these options make
authentication mandatory and only UBIFS images that can be authenticated
with the given key are allowed.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-10-23 13:49:01 +02:00
Sascha Hauer
1e76592f2c ubifs: Do not update inode size in-place in authenticated mode
In authenticated mode we cannot fixup the inode sizes in-place
during recovery as this would invalidate the hashes and HMACs
we stored for this inode.

Instead, we just write the updated inodes to the journal. We can
only do this after ubifs_rcvry_gc_commit() is done though, so for
authenticated mode call ubifs_recover_size() after
ubifs_rcvry_gc_commit() and not vice versa as normally done.

Calling ubifs_recover_size() after ubifs_rcvry_gc_commit() has the
drawback that after a commit the size fixup information is gone, so
when a powercut happens while recovering from another powercut
we may lose some data written right before the first powercut.
This is why we only do this in authenticated mode and leave the
behaviour for unauthenticated mode untouched.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-10-23 13:48:57 +02:00
Sascha Hauer
104115a3eb ubifs: Add hashes and HMACs to default filesystem
This patch calculates the necessary hashes and HMACs for the default
filesystem so that the dynamically created default fs can be
authenticated.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-10-23 13:48:57 +02:00
Sascha Hauer
e158e02ff7 ubifs: authentication: Authenticate super block node
This adds a HMAC covering the super block node and adds the logic that
decides if a filesystem shall be mounted unauthenticated or
authenticated.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-10-23 13:48:57 +02:00
Sascha Hauer
b5b1f08369 ubifs: Create hash for default LPT
During creation of the default filesystem on an empty flash the default
LPT is created. With this patch a hash over the default LPT is
calculated which can be added to the default filesystems master node.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-10-23 13:48:56 +02:00
Sascha Hauer
625700ccb5 ubfis: authentication: Authenticate master node
The master node contains hashes over the root index node and the LPT.
This patch adds a HMAC to authenticate the master node itself.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-10-23 13:48:52 +02:00
Sascha Hauer
a1dc58140f ubifs: authentication: Authenticate LPT
The LPT needs to be authenticated aswell. Since the LPT is only written
during commit it is enough to authenticate the whole LPT with a single
hash which is stored in the master node. Only the leaf nodes (pnodes)
are hashed which makes the implementation much simpler than it would be
to hash the complete LPT.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-10-23 13:48:47 +02:00
Sascha Hauer
da8ef65f95 ubifs: Authenticate replayed journal
Make sure that during replay all buds can be authenticated. To do
this we calculate the hash chain until we find an authentication
node and check the HMAC in that node against the current status
of the hash chain.

After a power cut it can happen that some nodes have been written, but
not yet the authentication node for them. These nodes have to be
discarded during replay.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-10-23 13:48:40 +02:00
Sascha Hauer
6f06d96fdf ubifs: Add auth nodes to garbage collector journal head
To be able to authenticate the garbage collector journal head add
authentication nodes to the buds the garbage collector creates.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-10-23 13:48:40 +02:00
Sascha Hauer
6a98bc4614 ubifs: Add authentication nodes to journal
Nodes that are written to flash can only be authenticated through the
index after the next commit. When a journal replay is necessary the
nodes are not yet referenced by the index and thus can't be
authenticated.

This patch overcomes this situation by creating a hash over all nodes
beginning from the commit start node over the reference node(s) and
the buds themselves. From
time to time we insert authentication nodes. Authentication nodes
contain a HMAC from the current hash state, so that they can be
used to authenticate a journal replay up to the point where the
authentication node is. The hash is continued afterwards
so that theoretically we would only have to check the HMAC of
the last authentication node we find.

Overall we get this picture:

,,,,,,,,
,......,...........................................
,. CS  ,               hash1.----.           hash2.----.
,.  |  ,                    .    |hmac            .    |hmac
,.  v  ,                    .    v                .    v
,.REF#0,-> bud -> bud -> bud.-> auth -> bud -> bud.-> auth ...
,..|...,...........................................
,  |   ,
,  |   ,,,,,,,,,,,,,,,
.  |            hash3,----.
,  |                 ,    |hmac
,  v                 ,    v
, REF#1 -> bud -> bud,-> auth ...
,,,|,,,,,,,,,,,,,,,,,,
   v
  REF#2 -> ...
   |
   V
  ...

Note how hash3 covers CS, REF#0 and REF#1 so that it is not possible to
exchange or skip any reference nodes. Unlike the picture suggests the
auth nodes themselves are not hashed.

With this it is possible for an offline attacker to cut each journal
head or to drop the last reference node(s), but not to skip any journal
heads or to reorder any operations.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-10-23 13:48:39 +02:00
Sascha Hauer
16a26b20d2 ubifs: authentication: Add hashes to index nodes
With this patch the hashes over the index nodes stored in the tree node
cache are written to flash and are checked when read back from flash.
The hash of the root index node is stored in the master node.

During journal replay the hashes are regenerated from the read nodes
and stored in the tree node cache. This means the nodes must previously
be authenticated by other means. This is done in a later patch.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-10-23 13:48:39 +02:00
Sascha Hauer
823838a486 ubifs: Add hashes to the tree node cache
As part of the UBIFS authentication support every branch in the index
gets a hash covering the referenced node. To make that happen the tree
node cache needs hashes over the nodes. This patch adds a hash argument
to ubifs_tnc_add() and ubifs_tnc_add_nm(). The hashes are calculated
from the callers of these functions which actually prepare the nodes.
With this patch all the leaf nodes of the index tree get hashes, but
currently nothing is done with these hashes, this is left for a later
patch.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-10-23 13:48:39 +02:00
Sascha Hauer
a384b47e49 ubifs: Create functions to embed a HMAC in a node
With authentication support some nodes (master node, super block node)
get a HMAC embedded into them. This patch adds functions to prepare and
write such a node.
The difficulty is that besides the HMAC the nodes also have a CRC which
must stay valid. This means we first have to initialize all fields in
the node, then calculate the HMAC (not covering the CRC) and finally
calculate the CRC.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-10-23 13:48:37 +02:00
Sascha Hauer
49525e5eec ubifs: Add helper functions for authentication support
This patch adds the various helper functions needed for authentication
support. We need functions to hash nodes, to embed HMACs into a node and
to compare hashes and HMACs. Most functions first check if this
filesystem is authenticated and bail out early if not, which makes the
functions safe to be called with disabled authentication.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-10-23 13:48:33 +02:00
Sascha Hauer
dead97266f ubifs: Add separate functions to init/crc a node
When adding authentication support we will embed a HMAC into some
nodes. To prepare these nodes we have to first initialize the nodes,
then add a HMAC and finally add a CRC. To accomplish this add separate
ubifs_init_node/ubifs_crc_node functions.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-10-23 13:48:29 +02:00
Sascha Hauer
5125cfdff1 ubifs: Format changes for authentication support
This patch adds the changes to the on disk format needed for
authentication support. We'll add:

* a HMAC covering super block node
* a HMAC covering the master node
* a hash over the root index node to the master node
* a hash over the LPT to the master node
* a flag to the filesystem flag indicating the filesystem is
  authenticated
* an authentication node necessary to authenticate the nodes written
  to the journal heads while they are written.
* a HMAC of a well known message to the super block node to be able
  to check if the correct key is provided

And finally, not visible in this patch, nevertheless explained here:

* hashes over the referenced child nodes in each branch of a index node

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-10-23 13:48:29 +02:00
Sascha Hauer
fd6150051b ubifs: Store read superblock node
The superblock node is read/modified/written several times throughout
the UBIFS code. Instead of reading it from the device each time just
keep a copy in memory and write back the modified copy when necessary.
This patch helps for authentication support, here we not only have to
read the superblock node, but also have to authenticate it, which
is easier if we do it once during initialization.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-10-23 13:48:29 +02:00
Sascha Hauer
83407437bb ubifs: Drop write_node
write_node() is used only once and can easily be replaced with calls
to ubifs_prepare_node()/write_head() which makes the code a bit shorter.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-10-23 13:48:24 +02:00
Sascha Hauer
e635cf8c3b ubifs: Implement ubifs_lpt_lookup using ubifs_pnode_lookup
ubifs_lpt_lookup() starts by looking up the nth pnode in the LPT. We
already have this functionality in ubifs_pnode_lookup(). Use this
function rather than open coding its functionality.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-10-23 13:48:21 +02:00
Sascha Hauer
0e26b6e255 ubifs: Export pnode_lookup as ubifs_pnode_lookup
ubifs_lpt_lookup could be implemented using pnode_lookup. To make that
possible move pnode_lookup from lpt.c to lpt_commit.c. Rename it to
ubifs_pnode_lookup since it's now exported.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-10-23 13:48:17 +02:00
Sascha Hauer
22ceaa8c68 ubifs: Pass ubifs_zbranch to read_znode()
read_znode() takes len, lnum and offs arguments which the caller all
extracts from the same struct ubifs_zbranch *. When adding authentication
support we would have to add a pointer to a hash to the arguments which
is also part of struct ubifs_zbranch. Pass the ubifs_zbranch * instead
so that we do not have to add another argument.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-10-23 13:48:14 +02:00
Sascha Hauer
545bc8f6b0 ubifs: Pass ubifs_zbranch to try_read_node()
try_read_node() takes len, lnum and offs arguments which the caller all
extracts from the same struct ubifs_zbranch *. When adding authentication
support we would have to add a pointer to a hash to the arguments which
is also part of struct ubifs_zbranch. Pass the ubifs_zbranch * instead
so that we do not have to add another argument.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-10-23 13:48:10 +02:00
Sascha Hauer
c4de6d7e43 ubifs: Refactor create_default_filesystem()
create_default_filesystem() allocates memory for a node, writes that
node and frees the memory directly afterwards. With this patch we
allocate memory for all nodes at the beginning of the function and
free the memory at the end. This makes it easier to implement
authentication support since with authentication support we'll need
the contents of some nodes when creating other nodes.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-10-23 13:48:06 +02:00
Chao Yu
7813081969 f2fs: fix to keep project quota consistent
This patch does below changes to keep consistence of project quota data
in sudden power-cut case:
- update inode.i_projid and project quota atomically under lock_op() in
f2fs_ioc_setproject()
- recover inode.i_projid and project quota in recover_inode()

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-22 17:54:48 -07:00
Chao Yu
af033b2aa8 f2fs: guarantee journalled quota data by checkpoint
For journalled quota mode, let checkpoint to flush dquot dirty data
and quota file data to guarntee persistence of all quota sysfile in
last checkpoint, by this way, we can avoid corrupting quota sysfile
when encountering SPO.

The implementation is as below:

1. add a global state SBI_QUOTA_NEED_FLUSH to indicate that there is
cached dquot metadata changes in quota subsystem, and later checkpoint
should:
 a) flush dquot metadata into quota file.
 b) flush quota file to storage to keep file usage be consistent.

2. add a global state SBI_QUOTA_NEED_REPAIR to indicate that quota
operation failed due to -EIO or -ENOSPC, so later,
 a) checkpoint will skip syncing dquot metadata.
 b) CP_QUOTA_NEED_FSCK_FLAG will be set in last cp pack to give a
    hint for fsck repairing.

3. add a global state SBI_QUOTA_SKIP_FLUSH, in checkpoint, if quota
data updating is very heavy, it may cause hungtask in block_operation().
To avoid this, if our retry time exceed threshold, let's just skip
flushing and retry in next checkpoint().

Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid warnings and set fsck flag]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-22 17:54:47 -07:00
Sheng Yong
26b5a07919 f2fs: cleanup dirty pages if recover failed
During recover, we will try to create new dentries for inodes with
dentry_mark. But if the parent is missing (e.g. killed by fsck),
recover will break. But those recovered dirty pages are not cleanup.
This will hit f2fs_bug_on:

[   53.519566] F2FS-fs (loop0): Found nat_bits in checkpoint
[   53.539354] F2FS-fs (loop0): recover_inode: ino = 5, name = file, inline = 3
[   53.539402] F2FS-fs (loop0): recover_dentry: ino = 5, name = file, dir = 0, err = -2
[   53.545760] F2FS-fs (loop0): Cannot recover all fsync data errno=-2
[   53.546105] F2FS-fs (loop0): access invalid blkaddr:4294967295
[   53.546171] WARNING: CPU: 1 PID: 1798 at fs/f2fs/checkpoint.c:163 f2fs_is_valid_blkaddr+0x26c/0x320
[   53.546174] Modules linked in:
[   53.546183] CPU: 1 PID: 1798 Comm: mount Not tainted 4.19.0-rc2+ #1
[   53.546186] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[   53.546191] RIP: 0010:f2fs_is_valid_blkaddr+0x26c/0x320
[   53.546195] Code: 85 bb 00 00 00 48 89 df 88 44 24 07 e8 ad a8 db ff 48 8b 3b 44 89 e1 48 c7 c2 40 03 72 a9 48 c7 c6 e0 01 72 a9 e8 84 3c ff ff <0f> 0b 0f b6 44 24 07 e9 8a 00 00 00 48 8d bf 38 01 00 00 e8 7c a8
[   53.546201] RSP: 0018:ffff88006c067768 EFLAGS: 00010282
[   53.546208] RAX: 0000000000000000 RBX: ffff880068844200 RCX: ffffffffa83e1a33
[   53.546211] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff88006d51e590
[   53.546215] RBP: 0000000000000005 R08: ffffed000daa3cb3 R09: ffffed000daa3cb3
[   53.546218] R10: 0000000000000001 R11: ffffed000daa3cb2 R12: 00000000ffffffff
[   53.546221] R13: ffff88006a1f8000 R14: 0000000000000200 R15: 0000000000000009
[   53.546226] FS:  00007fb2f3646840(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000
[   53.546229] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   53.546234] CR2: 00007f0fd77f0008 CR3: 00000000687e6002 CR4: 00000000000206e0
[   53.546237] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   53.546240] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   53.546242] Call Trace:
[   53.546248]  f2fs_submit_page_bio+0x95/0x740
[   53.546253]  read_node_page+0x161/0x1e0
[   53.546271]  ? truncate_node+0x650/0x650
[   53.546283]  ? add_to_page_cache_lru+0x12c/0x170
[   53.546288]  ? pagecache_get_page+0x262/0x2d0
[   53.546292]  __get_node_page+0x200/0x660
[   53.546302]  f2fs_update_inode_page+0x4a/0x160
[   53.546306]  f2fs_write_inode+0x86/0xb0
[   53.546317]  __writeback_single_inode+0x49c/0x620
[   53.546322]  writeback_single_inode+0xe4/0x1e0
[   53.546326]  sync_inode_metadata+0x93/0xd0
[   53.546330]  ? sync_inode+0x10/0x10
[   53.546342]  ? do_raw_spin_unlock+0xed/0x100
[   53.546347]  f2fs_sync_inode_meta+0xe0/0x130
[   53.546351]  f2fs_fill_super+0x287d/0x2d10
[   53.546367]  ? vsnprintf+0x742/0x7a0
[   53.546372]  ? f2fs_commit_super+0x180/0x180
[   53.546379]  ? up_write+0x20/0x40
[   53.546385]  ? set_blocksize+0x5f/0x140
[   53.546391]  ? f2fs_commit_super+0x180/0x180
[   53.546402]  mount_bdev+0x181/0x200
[   53.546406]  mount_fs+0x94/0x180
[   53.546411]  vfs_kern_mount+0x6c/0x1e0
[   53.546415]  do_mount+0xe5e/0x1510
[   53.546420]  ? fs_reclaim_release+0x9/0x30
[   53.546424]  ? copy_mount_string+0x20/0x20
[   53.546428]  ? fs_reclaim_acquire+0xd/0x30
[   53.546435]  ? __might_sleep+0x2c/0xc0
[   53.546440]  ? ___might_sleep+0x53/0x170
[   53.546453]  ? __might_fault+0x4c/0x60
[   53.546468]  ? _copy_from_user+0x95/0xa0
[   53.546474]  ? memdup_user+0x39/0x60
[   53.546478]  ksys_mount+0x88/0xb0
[   53.546482]  __x64_sys_mount+0x5d/0x70
[   53.546495]  do_syscall_64+0x65/0x130
[   53.546503]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   53.547639] ---[ end trace b804d1ea2fec893e ]---

So if recover fails, we need to drop all recovered data.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-22 17:54:47 -07:00
Sahitya Tummala
1e78e8bd9d f2fs: fix data corruption issue with hardware encryption
Direct IO can be used in case of hardware encryption. The following
scenario results into data corruption issue in this path -

Thread A -                          Thread B-
-> write file#1 in direct IO
                                    -> GC gets kicked in
                                    -> GC submitted bio on meta mapping
				       for file#1, but pending completion
-> write file#1 again with new data
   in direct IO
                                    -> GC bio gets completed now
                                    -> GC writes old data to the new
                                       location and thus file#1 is
				       corrupted.

Fix this by submitting and waiting for pending io on meta mapping
for direct IO case in f2fs_map_blocks().

Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-22 17:54:47 -07:00
Chao Yu
0c093b590e f2fs: fix to recover inode->i_flags of inode block during POR
Testcase to reproduce this bug:
1. mkfs.f2fs /dev/sdd
2. mount -t f2fs /dev/sdd /mnt/f2fs
3. touch /mnt/f2fs/file
4. sync
5. chattr +a /mnt/f2fs/file
6. xfs_io -a /mnt/f2fs/file -c "fsync"
7. godown /mnt/f2fs
8. umount /mnt/f2fs
9. mount -t f2fs /dev/sdd /mnt/f2fs
10. xfs_io /mnt/f2fs/file

There is no error when opening this file w/o O_APPEND, but actually,
we expect the correct result should be:

/mnt/f2fs/file: Operation not permitted

The root cause is, in recover_inode(), we recover inode->i_flags more
than F2FS_I(inode)->i_flags, so fix it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-22 17:54:47 -07:00
Chao Yu
9149a5eb60 f2fs: spread f2fs_set_inode_flags()
This patch changes codes as below:
- use f2fs_set_inode_flags() to update i_flags atomically to avoid
potential race.
- synchronize F2FS_I(inode)->i_flags to inode->i_flags in
f2fs_new_inode().
- use f2fs_set_inode_flags() to simply codes in f2fs_quota_{on,off}.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-22 17:54:47 -07:00
Chao Yu
2baf078185 f2fs: fix to spread clear_cold_data()
We need to drop PG_checked flag on page as well when we clear PG_uptodate
flag, in order to avoid treating the page as GCing one later.

Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-22 17:54:46 -07:00
Jaegeuk Kim
164a63fa6b Revert "f2fs: fix to clear PG_checked flag in set_page_dirty()"
This reverts commit 66110abc4c.

If we clear the cold data flag out of the writeback flow, we can miscount
-1 by end_io, which incurs a deadlock caused by all I/Os being blocked during
heavy GC.

Balancing F2FS Async:
 - IO (CP:    1, Data:   -1, Flush: (   0    0    1), Discard: (   ...

GC thread:                              IRQ
- move_data_page()
 - set_page_dirty()
  - clear_cold_data()
                                        - f2fs_write_end_io()
                                         - type = WB_DATA_TYPE(page);
                                           here, we get wrong type
                                         - dec_page_count(sbi, type);
 - f2fs_wait_on_page_writeback()

Cc: <stable@vger.kernel.org>
Reported-and-Tested-by: Park Ju Hyung <qkrwngud825@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-22 17:54:46 -07:00
Jaegeuk Kim
5f9abab42b f2fs: account read IOs and use IO counts for is_idle
This patch adds issued read IO counts which is under block layer.

Chao modified a bit, since:

Below race can cause reversed reference on F2FS_RD_DATA, there is
the same issue in f2fs_submit_page_bio(), fix them by relocate
__submit_bio() and inc_page_count.

Thread A			Thread B
- f2fs_write_begin
 - f2fs_submit_page_read
 - __submit_bio
				- f2fs_read_end_io
				 - __read_end_io
				 - dec_page_count(, F2FS_RD_DATA)
 - inc_page_count(, F2FS_RD_DATA)

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-22 17:54:46 -07:00
Chao Yu
78efac537d f2fs: fix to account IO correctly for cgroup writeback
Now, we have supported cgroup writeback, it depends on correctly IO
account of specified filesystem.

But in commit d1b3e72d54 ("f2fs: submit bio of in-place-update pages"),
we split write paths from f2fs_submit_page_mbio() to two:
- f2fs_submit_page_bio() for IPU path
- f2fs_submit_page_bio() for OPU path

But still we account write IO only in f2fs_submit_page_mbio(), result in
incorrect IO account, fix it by adding missing IO account in IPU path.

Fixes: d1b3e72d54 ("f2fs: submit bio of in-place-update pages")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-22 17:54:39 -07:00
Chao Yu
4c58ed0768 f2fs: fix to account IO correctly
Below race can cause reversed reference on dirty count, fix it by
relocating __submit_bio() and inc_page_count().

Thread A				Thread B
- f2fs_inplace_write_data
 - f2fs_submit_page_bio
  - __submit_bio
					- f2fs_write_end_io
					 - dec_page_count
  - inc_page_count

Cc: <stable@vger.kernel.org>
Fixes: d1b3e72d54 ("f2fs: submit bio of in-place-update pages")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-22 17:53:48 -07:00
Linus Torvalds
a36cf68651 SPI NOR changes:
Core changes:
   * Support non-uniform erase size
   * Support controllers with limited TX fifo size
 
  Driver changes:
   * m25p80: Re-issue a WREN command after each write access
   * cadence: Pass a proper dir value to dma_[un]map_single()
   * fsl-qspi: Check fsl_qspi_get_seqid() return val make sure 4B
     addressing opcodes are properly handled
   * intel-spi: Add a new PCI entry for Ice Lake
 
 NAND changes:
  Raw NAND core changes:
  - Two batchs of cleanups of the NAND API, including:
    * Deprecating a lot of interfaces (now replaced by ->exec_op()).
    * Moving code in separate drivers (JEDEC, ONFI), in private files
      (internals), in platform drivers, etc.
    * Functions/structures reordering.
    * Exclusive use of the nand_chip structure instead of the MTD one
      all across the subsystem.
  - Addition of the nand_wait_readrdy/rdy_op() helpers.
 
  Raw NAND controllers drivers changes:
  - Various coccinelle patches.
  - Marvell:
    * Use regmap_update_bits() for syscon access.
    * More documentation.
    * BCH failure path rework.
    * More layouts to be supported.
    * IRQ handler complete() condition fixed.
  - Fsl_ifc:
    * SRAM initialization fixed for newer controller versions.
  - Denali:
    * Fix licenses mismatch and use a SPDX tag.
    * Set SPARE_AREA_SKIP_BYTES register to 8 if unset.
  - Qualcomm:
    * Do not include dma-direct.h.
  - Docg4:
    * Removed.
  - Ams-delta:
    * Use of a GPIO lookup table
    * Internal machinery changes.
 
  Raw NAND chip drivers changes:
  - Toshiba:
    * Add support for Toshiba memory BENAND
    * Pass a single nand_chip object to the status helper.
  - ESMT:
    * New driver to retrieve the ECC requirements from the 5th ID byte.
 
 MTD changes:
  * physmap cleanups/fixe
  * gpio-addr-flash cleanups/fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQJQBAABCgA6FiEEKmCqpbOU668PNA69Ze02AX4ItwAFAlvNmJocHGJvcmlzLmJy
 ZXppbGxvbkBib290bGluLmNvbQAKCRBl7TYBfgi3AKr+D/4pH0gYSGDUstZsqUHW
 gZsPRU4jmOA+OCRbSxE03bOZOYR0UPgdoUYLfAKhZ5qxc7wHd3b47wykP0kUEviu
 lC8QhSs4DUA+EuMVPDVS4FlXRT0e3dMV7jh/IK6nInshD2YkaovyCqa6GvgwFEcM
 7BCizxRhtHV8+fo7GVQWuMLb9ZfbEvz42D0sYOu6hIsF1SnRDvHOvfdDUFEXpJoF
 a2mC9ove65ChEzc2iZ/r260iZ4aoJYb9XJRJKWxmYeITZSmLNmcrUyGqAaNQ7NRc
 AIuPXASbeHGjrIuEfXpKYW07AE5MV1nJFSk3v4u5FjgoohOoobPp7npk+Lz/qwe3
 y8/uW9qOQ/iEOsRnkvMGNu4Yjhw41L1a+J5wVvUvzmwHy1xMCrRHYB/8gXoZzemR
 A7XmCPwjAFVWv1WeKV2cs5MLW9yZq9QNMtGLlNs5OgFR6CccLk67/tHIkYLsJ/l9
 IDXFhd/YKhTeF361u6Iimmgb427TwM71P9N+OMpv4nk46DxurpxkgGle3nslzKNU
 DOFNFlMGiPIx3h4X96AKER6u7cxiOXnLGq+XeHa/y8tIrziy3jg03YoTh0RSwKxV
 J3EXwh1sFLaeebwWEgpE3Ag1LOxpRCqJ2ED71SPzS/DR/938HBVsEoPqUEHNf1in
 jwsUB3cRzNwf+XX68DRCVT5GFA==
 =7w4w
 -----END PGP SIGNATURE-----

Merge tag 'mtd/for-4.20' of git://git.infradead.org/linux-mtd

Pull mtd updates from Boris Brezillon:
 "SPI NOR core changes:
   - Support non-uniform erase size
   - Support controllers with limited TX fifo size

 Driver changes:
   - m25p80: Re-issue a WREN command after each write access
   - cadence: Pass a proper dir value to dma_[un]map_single()
   - fsl-qspi: Check fsl_qspi_get_seqid() return val make sure 4B
     addressing opcodes are properly handled
   - intel-spi: Add a new PCI entry for Ice Lake

 Raw NAND core changes:
   - Two batchs of cleanups of the NAND API, including:
      * Deprecating a lot of interfaces (now replaced by ->exec_op()).
      * Moving code in separate drivers (JEDEC, ONFI), in private files
        (internals), in platform drivers, etc.
      * Functions/structures reordering.
      * Exclusive use of the nand_chip structure instead of the MTD one
        all across the subsystem.
   - Addition of the nand_wait_readrdy/rdy_op() helpers.

 Raw NAND controllers drivers changes:
   - Various coccinelle patches.
   - Marvell:
      * Use regmap_update_bits() for syscon access.
      * More documentation.
      * BCH failure path rework.
      * More layouts to be supported.
      * IRQ handler complete() condition fixed.
   - Fsl_ifc:
      * SRAM initialization fixed for newer controller versions.
   - Denali:
      * Fix licenses mismatch and use a SPDX tag.
      * Set SPARE_AREA_SKIP_BYTES register to 8 if unset.
   - Qualcomm:
      * Do not include dma-direct.h.
   - Docg4:
      * Removed.
   - Ams-delta:
      * Use of a GPIO lookup table
      * Internal machinery changes.

 Raw NAND chip drivers changes:
   - Toshiba:
      * Add support for Toshiba memory BENAND
      * Pass a single nand_chip object to the status helper.
   - ESMT:
      * New driver to retrieve the ECC requirements from the 5th ID
        byte.

  MTD changes:
   - physmap cleanups/fixe
   - gpio-addr-flash cleanups/fixes"

* tag 'mtd/for-4.20' of git://git.infradead.org/linux-mtd: (93 commits)
  jffs2: free jffs2_sb_info through jffs2_kill_sb()
  mtd: spi-nor: fsl-quadspi: fix read error for flash size larger than 16MB
  mtd: spi-nor: intel-spi: Add support for Intel Ice Lake SPI serial flash
  mtd: maps: gpio-addr-flash: Convert to gpiod
  mtd: maps: gpio-addr-flash: Replace array with an integer
  mtd: maps: gpio-addr-flash: Use order instead of size
  mtd: spi-nor: fsl-quadspi: Don't let -EINVAL on the bus
  mtd: devices: m25p80: Make sure WRITE_EN is issued before each write
  mtd: spi-nor: Support controllers with limited TX FIFO size
  mtd: spi-nor: cadence-quadspi: Use proper enum for dma_[un]map_single
  mtd: spi-nor: parse SFDP Sector Map Parameter Table
  mtd: spi-nor: add support to non-uniform SFDP SPI NOR flash memories
  mtd: rawnand: marvell: fix the IRQ handler complete() condition
  mtd: rawnand: denali: set SPARE_AREA_SKIP_BYTES register to 8 if unset
  mtd: rawnand: r852: fix spelling mistake "card_registred" -> "card_registered"
  mtd: rawnand: toshiba: Pass a single nand_chip object to the status helper
  mtd: maps: gpio-addr-flash: Use devm_* functions
  mtd: maps: gpio-addr-flash: Fix ioremapped size
  mtd: maps: gpio-addr-flash: Replace custom printk
  mtd: physmap_of: Release resources on error
  ...
2018-10-23 01:09:22 +01:00
Filipe Manana
9084cb6a24 Btrfs: fix use-after-free when dumping free space
We were iterating a block group's free space cache rbtree without locking
first the lock that protects it (the free_space_ctl->free_space_offset
rbtree is protected by the free_space_ctl->tree_lock spinlock).

KASAN reported an use-after-free problem when iterating such a rbtree due
to a concurrent rbtree delete:

[ 9520.359168] ==================================================================
[ 9520.359656] BUG: KASAN: use-after-free in rb_next+0x13/0x90
[ 9520.359949] Read of size 8 at addr ffff8800b7ada500 by task btrfs-transacti/1721
[ 9520.360357]
[ 9520.360530] CPU: 4 PID: 1721 Comm: btrfs-transacti Tainted: G             L    4.19.0-rc8-nbor #555
[ 9520.360990] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ 9520.362682] Call Trace:
[ 9520.362887]  dump_stack+0xa4/0xf5
[ 9520.363146]  print_address_description+0x78/0x280
[ 9520.363412]  kasan_report+0x263/0x390
[ 9520.363650]  ? rb_next+0x13/0x90
[ 9520.363873]  __asan_load8+0x54/0x90
[ 9520.364102]  rb_next+0x13/0x90
[ 9520.364380]  btrfs_dump_free_space+0x146/0x160 [btrfs]
[ 9520.364697]  dump_space_info+0x2cd/0x310 [btrfs]
[ 9520.364997]  btrfs_reserve_extent+0x1ee/0x1f0 [btrfs]
[ 9520.365310]  __btrfs_prealloc_file_range+0x1cc/0x620 [btrfs]
[ 9520.365646]  ? btrfs_update_time+0x180/0x180 [btrfs]
[ 9520.365923]  ? _raw_spin_unlock+0x27/0x40
[ 9520.366204]  ? btrfs_alloc_data_chunk_ondemand+0x2c0/0x5c0 [btrfs]
[ 9520.366549]  btrfs_prealloc_file_range_trans+0x23/0x30 [btrfs]
[ 9520.366880]  cache_save_setup+0x42e/0x580 [btrfs]
[ 9520.367220]  ? btrfs_check_data_free_space+0xd0/0xd0 [btrfs]
[ 9520.367518]  ? lock_downgrade+0x2f0/0x2f0
[ 9520.367799]  ? btrfs_write_dirty_block_groups+0x11f/0x6e0 [btrfs]
[ 9520.368104]  ? kasan_check_read+0x11/0x20
[ 9520.368349]  ? do_raw_spin_unlock+0xa8/0x140
[ 9520.368638]  btrfs_write_dirty_block_groups+0x2af/0x6e0 [btrfs]
[ 9520.368978]  ? btrfs_start_dirty_block_groups+0x870/0x870 [btrfs]
[ 9520.369282]  ? do_raw_spin_unlock+0xa8/0x140
[ 9520.369534]  ? _raw_spin_unlock+0x27/0x40
[ 9520.369811]  ? btrfs_run_delayed_refs+0x1b8/0x230 [btrfs]
[ 9520.370137]  commit_cowonly_roots+0x4b9/0x610 [btrfs]
[ 9520.370560]  ? commit_fs_roots+0x350/0x350 [btrfs]
[ 9520.370926]  ? btrfs_run_delayed_refs+0x1b8/0x230 [btrfs]
[ 9520.371285]  btrfs_commit_transaction+0x5e5/0x10e0 [btrfs]
[ 9520.371612]  ? btrfs_apply_pending_changes+0x90/0x90 [btrfs]
[ 9520.371943]  ? start_transaction+0x168/0x6c0 [btrfs]
[ 9520.372257]  transaction_kthread+0x21c/0x240 [btrfs]
[ 9520.372537]  kthread+0x1d2/0x1f0
[ 9520.372793]  ? btrfs_cleanup_transaction+0xb50/0xb50 [btrfs]
[ 9520.373090]  ? kthread_park+0xb0/0xb0
[ 9520.373329]  ret_from_fork+0x3a/0x50
[ 9520.373567]
[ 9520.373738] Allocated by task 1804:
[ 9520.373974]  kasan_kmalloc+0xff/0x180
[ 9520.374208]  kasan_slab_alloc+0x11/0x20
[ 9520.374447]  kmem_cache_alloc+0xfc/0x2d0
[ 9520.374731]  __btrfs_add_free_space+0x40/0x580 [btrfs]
[ 9520.375044]  unpin_extent_range+0x4f7/0x7a0 [btrfs]
[ 9520.375383]  btrfs_finish_extent_commit+0x15f/0x4d0 [btrfs]
[ 9520.375707]  btrfs_commit_transaction+0xb06/0x10e0 [btrfs]
[ 9520.376027]  btrfs_alloc_data_chunk_ondemand+0x237/0x5c0 [btrfs]
[ 9520.376365]  btrfs_check_data_free_space+0x81/0xd0 [btrfs]
[ 9520.376689]  btrfs_delalloc_reserve_space+0x25/0x80 [btrfs]
[ 9520.377018]  btrfs_direct_IO+0x42e/0x6d0 [btrfs]
[ 9520.377284]  generic_file_direct_write+0x11e/0x220
[ 9520.377587]  btrfs_file_write_iter+0x472/0xac0 [btrfs]
[ 9520.377875]  aio_write+0x25c/0x360
[ 9520.378106]  io_submit_one+0xaa0/0xdc0
[ 9520.378343]  __se_sys_io_submit+0xfa/0x2f0
[ 9520.378589]  __x64_sys_io_submit+0x43/0x50
[ 9520.378840]  do_syscall_64+0x7d/0x240
[ 9520.379081]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 9520.379387]
[ 9520.379557] Freed by task 1802:
[ 9520.379782]  __kasan_slab_free+0x173/0x260
[ 9520.380028]  kasan_slab_free+0xe/0x10
[ 9520.380262]  kmem_cache_free+0xc1/0x2c0
[ 9520.380544]  btrfs_find_space_for_alloc+0x4cd/0x4e0 [btrfs]
[ 9520.380866]  find_free_extent+0xa99/0x17e0 [btrfs]
[ 9520.381166]  btrfs_reserve_extent+0xd5/0x1f0 [btrfs]
[ 9520.381474]  btrfs_get_blocks_direct+0x60b/0xbd0 [btrfs]
[ 9520.381761]  __blockdev_direct_IO+0x10ee/0x58a1
[ 9520.382059]  btrfs_direct_IO+0x25a/0x6d0 [btrfs]
[ 9520.382321]  generic_file_direct_write+0x11e/0x220
[ 9520.382623]  btrfs_file_write_iter+0x472/0xac0 [btrfs]
[ 9520.382904]  aio_write+0x25c/0x360
[ 9520.383172]  io_submit_one+0xaa0/0xdc0
[ 9520.383416]  __se_sys_io_submit+0xfa/0x2f0
[ 9520.383678]  __x64_sys_io_submit+0x43/0x50
[ 9520.383927]  do_syscall_64+0x7d/0x240
[ 9520.384165]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 9520.384439]
[ 9520.384610] The buggy address belongs to the object at ffff8800b7ada500
                which belongs to the cache btrfs_free_space of size 72
[ 9520.385175] The buggy address is located 0 bytes inside of
                72-byte region [ffff8800b7ada500, ffff8800b7ada548)
[ 9520.385691] The buggy address belongs to the page:
[ 9520.385957] page:ffffea0002deb680 count:1 mapcount:0 mapping:ffff880108a1d700 index:0x0 compound_mapcount: 0
[ 9520.388030] flags: 0x8100(slab|head)
[ 9520.388281] raw: 0000000000008100 ffffea0002deb608 ffffea0002728808 ffff880108a1d700
[ 9520.388722] raw: 0000000000000000 0000000000130013 00000001ffffffff 0000000000000000
[ 9520.389169] page dumped because: kasan: bad access detected
[ 9520.389473]
[ 9520.389658] Memory state around the buggy address:
[ 9520.389943]  ffff8800b7ada400: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 9520.390368]  ffff8800b7ada480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 9520.390796] >ffff8800b7ada500: fb fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc
[ 9520.391223]                    ^
[ 9520.391461]  ffff8800b7ada580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 9520.391885]  ffff8800b7ada600: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 9520.392313] ==================================================================
[ 9520.392772] BTRFS critical (device vdc): entry offset 2258497536, bytes 131072, bitmap no
[ 9520.393247] BUG: unable to handle kernel NULL pointer dereference at 0000000000000011
[ 9520.393705] PGD 800000010dbab067 P4D 800000010dbab067 PUD 107551067 PMD 0
[ 9520.394059] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[ 9520.394378] CPU: 4 PID: 1721 Comm: btrfs-transacti Tainted: G    B        L    4.19.0-rc8-nbor #555
[ 9520.394858] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ 9520.395350] RIP: 0010:rb_next+0x3c/0x90
[ 9520.396461] RSP: 0018:ffff8801074ff780 EFLAGS: 00010292
[ 9520.396762] RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff81b5ac4c
[ 9520.397115] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 0000000000000011
[ 9520.397468] RBP: ffff8801074ff7a0 R08: ffffed0021d64ccc R09: ffffed0021d64ccc
[ 9520.397821] R10: 0000000000000001 R11: ffffed0021d64ccb R12: ffff8800b91e0000
[ 9520.398188] R13: ffff8800a3ceba48 R14: ffff8800b627bf80 R15: 0000000000020000
[ 9520.398555] FS:  0000000000000000(0000) GS:ffff88010eb00000(0000) knlGS:0000000000000000
[ 9520.399007] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9520.399335] CR2: 0000000000000011 CR3: 0000000106b52000 CR4: 00000000000006a0
[ 9520.399679] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9520.400023] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 9520.400400] Call Trace:
[ 9520.400648]  btrfs_dump_free_space+0x146/0x160 [btrfs]
[ 9520.400974]  dump_space_info+0x2cd/0x310 [btrfs]
[ 9520.401287]  btrfs_reserve_extent+0x1ee/0x1f0 [btrfs]
[ 9520.401609]  __btrfs_prealloc_file_range+0x1cc/0x620 [btrfs]
[ 9520.401952]  ? btrfs_update_time+0x180/0x180 [btrfs]
[ 9520.402232]  ? _raw_spin_unlock+0x27/0x40
[ 9520.402522]  ? btrfs_alloc_data_chunk_ondemand+0x2c0/0x5c0 [btrfs]
[ 9520.402882]  btrfs_prealloc_file_range_trans+0x23/0x30 [btrfs]
[ 9520.403261]  cache_save_setup+0x42e/0x580 [btrfs]
[ 9520.403570]  ? btrfs_check_data_free_space+0xd0/0xd0 [btrfs]
[ 9520.403871]  ? lock_downgrade+0x2f0/0x2f0
[ 9520.404161]  ? btrfs_write_dirty_block_groups+0x11f/0x6e0 [btrfs]
[ 9520.404481]  ? kasan_check_read+0x11/0x20
[ 9520.404732]  ? do_raw_spin_unlock+0xa8/0x140
[ 9520.405026]  btrfs_write_dirty_block_groups+0x2af/0x6e0 [btrfs]
[ 9520.405375]  ? btrfs_start_dirty_block_groups+0x870/0x870 [btrfs]
[ 9520.405694]  ? do_raw_spin_unlock+0xa8/0x140
[ 9520.405958]  ? _raw_spin_unlock+0x27/0x40
[ 9520.406243]  ? btrfs_run_delayed_refs+0x1b8/0x230 [btrfs]
[ 9520.406574]  commit_cowonly_roots+0x4b9/0x610 [btrfs]
[ 9520.406899]  ? commit_fs_roots+0x350/0x350 [btrfs]
[ 9520.407253]  ? btrfs_run_delayed_refs+0x1b8/0x230 [btrfs]
[ 9520.407589]  btrfs_commit_transaction+0x5e5/0x10e0 [btrfs]
[ 9520.407925]  ? btrfs_apply_pending_changes+0x90/0x90 [btrfs]
[ 9520.408262]  ? start_transaction+0x168/0x6c0 [btrfs]
[ 9520.408582]  transaction_kthread+0x21c/0x240 [btrfs]
[ 9520.408870]  kthread+0x1d2/0x1f0
[ 9520.409138]  ? btrfs_cleanup_transaction+0xb50/0xb50 [btrfs]
[ 9520.409440]  ? kthread_park+0xb0/0xb0
[ 9520.409682]  ret_from_fork+0x3a/0x50
[ 9520.410508] Dumping ftrace buffer:
[ 9520.410764]    (ftrace buffer empty)
[ 9520.411007] CR2: 0000000000000011
[ 9520.411297] ---[ end trace 01a0863445cf360a ]---
[ 9520.411568] RIP: 0010:rb_next+0x3c/0x90
[ 9520.412644] RSP: 0018:ffff8801074ff780 EFLAGS: 00010292
[ 9520.412932] RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff81b5ac4c
[ 9520.413274] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 0000000000000011
[ 9520.413616] RBP: ffff8801074ff7a0 R08: ffffed0021d64ccc R09: ffffed0021d64ccc
[ 9520.414007] R10: 0000000000000001 R11: ffffed0021d64ccb R12: ffff8800b91e0000
[ 9520.414349] R13: ffff8800a3ceba48 R14: ffff8800b627bf80 R15: 0000000000020000
[ 9520.416074] FS:  0000000000000000(0000) GS:ffff88010eb00000(0000) knlGS:0000000000000000
[ 9520.416536] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9520.416848] CR2: 0000000000000011 CR3: 0000000106b52000 CR4: 00000000000006a0
[ 9520.418477] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9520.418846] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 9520.419204] Kernel panic - not syncing: Fatal exception
[ 9520.419666] Dumping ftrace buffer:
[ 9520.419930]    (ftrace buffer empty)
[ 9520.420168] Kernel Offset: disabled
[ 9520.420406] ---[ end Kernel panic - not syncing: Fatal exception ]---

Fix this by acquiring the respective lock before iterating the rbtree.

Reported-by: Nikolay Borisov <nborisov@suse.com>
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-22 20:31:22 +02:00
Linus Torvalds
6ab9e09238 for-4.20/block-20181021
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlvNQKgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgps+8D/9Iy6YIeoPwN10gYsqIh0P2fS3wKzL3kiww
 3vFsWO78PzgLxUlNmB7teLtNFc/R5mi8becZmAdvs9za5YFZk56o3Ifv1x+e+z00
 VY1/gxhiJD6suLeJ6lECnERGDaiWOZVRMo2TE17vxYGW6GGaa0Ts6PUUXmpla1u5
 WKctgt0Qv9WVNyiIdLdeHqzKJwsSSwNTt8fK7eFhy3x8e0CwJr+GtXckbbW3LFkY
 lug0npsTli3EmEPMovZhd25SjZmTk5GTM+ADZQ7Tnv5KXoDWB9jn6TcCSAi3G+5d
 5WUVwfnDyYJiH8qvlg5tRJ690muIy3xMOmpr7QBQ0YnR/LQ3EW+1CVfqD+qimgLH
 TXzlREXQpBP3YlxSDS5nddz4o5z84GZmC9B/43ujPaZKIQ6eBXYdkmQH7tPtSugm
 C6VGomR5tHotjxIiAsexh/5hAus+wW8bObKGTPTyINT0ub3XNclwCKLh26CgI9ie
 WvbS9g3j/KPvu/7s6weZpgD+cks0YdWe/XdXXxiHwsGI9h3J2aJna5RQt1rKWDm5
 wGCgbc/B8eSwiWx+GXlqdB9/Dy/bGXOnSTDnKpEVl1f5zNjeLwUKXbjvkMefWs4m
 jEIcquuDETORY+ZYEfa5YbmS4Lhskr0kzMVTVkZ++81tAWpSCU9Xh3IHrR8TNpt+
 J0oh0FHBDg==
 =LRTT
 -----END PGP SIGNATURE-----

Merge tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block

Pull block layer updates from Jens Axboe:
 "This is the main pull request for block changes for 4.20. This
  contains:

   - Series enabling runtime PM for blk-mq (Bart).

   - Two pull requests from Christoph for NVMe, with items such as;
      - Better AEN tracking
      - Multipath improvements
      - RDMA fixes
      - Rework of FC for target removal
      - Fixes for issues identified by static checkers
      - Fabric cleanups, as prep for TCP transport
      - Various cleanups and bug fixes

   - Block merging cleanups (Christoph)

   - Conversion of drivers to generic DMA mapping API (Christoph)

   - Series fixing ref count issues with blkcg (Dennis)

   - Series improving BFQ heuristics (Paolo, et al)

   - Series improving heuristics for the Kyber IO scheduler (Omar)

   - Removal of dangerous bio_rewind_iter() API (Ming)

   - Apply single queue IPI redirection logic to blk-mq (Ming)

   - Set of fixes and improvements for bcache (Coly et al)

   - Series closing a hotplug race with sysfs group attributes (Hannes)

   - Set of patches for lightnvm:
      - pblk trace support (Hans)
      - SPDX license header update (Javier)
      - Tons of refactoring patches to cleanly abstract the 1.2 and 2.0
        specs behind a common core interface. (Javier, Matias)
      - Enable pblk to use a common interface to retrieve chunk metadata
        (Matias)
      - Bug fixes (Various)

   - Set of fixes and updates to the blk IO latency target (Josef)

   - blk-mq queue number updates fixes (Jianchao)

   - Convert a bunch of drivers from the old legacy IO interface to
     blk-mq. This will conclude with the removal of the legacy IO
     interface itself in 4.21, with the rest of the drivers (me, Omar)

   - Removal of the DAC960 driver. The SCSI tree will introduce two
     replacement drivers for this (Hannes)"

* tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block: (204 commits)
  block: setup bounce bio_sets properly
  blkcg: reassociate bios when make_request() is called recursively
  blkcg: fix edge case for blk_get_rl() under memory pressure
  nvme-fabrics: move controller options matching to fabrics
  nvme-rdma: always have a valid trsvcid
  mtip32xx: fully switch to the generic DMA API
  rsxx: switch to the generic DMA API
  umem: switch to the generic DMA API
  sx8: switch to the generic DMA API
  sx8: remove dead IF_64BIT_DMA_IS_POSSIBLE code
  skd: switch to the generic DMA API
  ubd: remove use of blk_rq_map_sg
  nvme-pci: remove duplicate check
  drivers/block: Remove DAC960 driver
  nvme-pci: fix hot removal during error handling
  nvmet-fcloop: suppress a compiler warning
  nvme-core: make implicit seed truncation explicit
  nvmet-fc: fix kernel-doc headers
  nvme-fc: rework the request initialization code
  nvme-fc: introduce struct nvme_fcp_op_w_sgl
  ...
2018-10-22 17:46:08 +01:00
Kees Cook
1227daa43b pstore/ram: Clarify resource reservation labels
When ramoops reserved a memory region in the kernel, it had an unhelpful
label of "persistent_memory". When reading /proc/iomem, it would be
repeated many times, did not hint that it was ramoops in particular,
and didn't clarify very much about what each was used for:

400000000-407ffffff : Persistent Memory (legacy)
  400000000-400000fff : persistent_memory
  400001000-400001fff : persistent_memory
...
  4000ff000-4000fffff : persistent_memory

Instead, this adds meaningful labels for how the various regions are
being used:

400000000-407ffffff : Persistent Memory (legacy)
  400000000-400000fff : ramoops:dump(0/252)
  400001000-400001fff : ramoops:dump(1/252)
...
  4000fc000-4000fcfff : ramoops:dump(252/252)
  4000fd000-4000fdfff : ramoops:console
  4000fe000-4000fe3ff : ramoops:ftrace(0/3)
  4000fe400-4000fe7ff : ramoops:ftrace(1/3)
  4000fe800-4000febff : ramoops:ftrace(2/3)
  4000fec00-4000fefff : ramoops:ftrace(3/3)
  4000ff000-4000fffff : ramoops:pmsg

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Tested-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
Tested-by: Guenter Roeck <groeck@chromium.org>
2018-10-22 07:11:58 -07:00
Kees Cook
95047b0519 pstore: Refactor compression initialization
This refactors compression initialization slightly to better handle
getting potentially called twice (via early pstore_register() calls
and later pstore_init()) and improves the comments and reporting to be
more verbose.

Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Guenter Roeck <groeck@chromium.org>
2018-10-22 07:11:58 -07:00
Joel Fernandes (Google)
416031653e pstore: Allocate compression during late_initcall()
ramoops's call of pstore_register() was recently moved to run during
late_initcall() because the crypto backend may not have been ready during
postcore_initcall(). This meant early-boot crash dumps were not getting
caught by pstore any more.

Instead, lets allow calls to pstore_register() earlier, and once crypto
is ready we can initialize the compression.

Reported-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Tested-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
Fixes: cb3bee0369 ("pstore: Use crypto compress API")
[kees: trivial rebase]
Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Guenter Roeck <groeck@chromium.org>
2018-10-22 07:11:58 -07:00
Kees Cook
cb095afd44 pstore: Centralize init/exit routines
In preparation for having additional actions during init/exit, this moves
the init/exit into platform.c, centralizing the logic to make call outs
to the fs init/exit.

Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Guenter Roeck <groeck@chromium.org>
2018-10-22 07:11:58 -07:00
Luis Henriques
ea4cdc548e ceph: new mount option to disable usage of copy-from op
Add a new mount option 'nocopyfrom' that will prevent the usage of the
RADOS 'copy-from' operation in cephfs.  This could be useful, for example,
for an administrator to temporarily mitigate any possible bugs in the
'copy-from' implementation.

Currently, only copy_file_range uses this RADOS operation.  Setting this
mount option will result in this syscall reverting to the default VFS
implementation, i.e. to perform the copies locally instead of doing remote
object copies.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:24 +02:00
Luis Henriques
503f82a993 ceph: support copy_file_range file operation
This commit implements support for the copy_file_range syscall in cephfs.
It is implemented using the RADOS 'copy-from' operation, which allows to
do a remote object copy, without the need to download/upload data from/to
the OSDs.

Some manual copy may however be required if the source/destination file
offsets aren't object aligned or if the copy length is smaller than the
object size.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:24 +02:00
Luis Henriques
2ee9dd958d ceph: add non-blocking parameter to ceph_try_get_caps()
ceph_try_get_caps currently calls try_get_cap_refs with the nonblock
parameter always set to 'true'.  This change adds a new parameter that
allows to set it's value.  This will be useful for a follow-up patch that
will need to get two sets of capabilities for two different inodes without
risking a deadlock.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:23 +02:00
Ilya Dryomov
0d9c1ab3be libceph: preallocate message data items
Currently message data items are allocated with ceph_msg_data_create()
in setup_request_data() inside send_request().  send_request() has never
been allowed to fail, so each allocation is followed by a BUG_ON:

  data = ceph_msg_data_create(...);
  BUG_ON(!data);

It's been this way since support for multiple message data items was
added in commit 6644ed7b7e ("libceph: make message data be a pointer")
in 3.10.

There is no reason to delay the allocation of message data items until
the last possible moment and we certainly don't need a linked list of
them as they are only ever appended to the end and never erased.  Make
ceph_msg_new2() take max_data_items and adapt the rest of the code.

Reported-by: Jerry Lee <leisurelysw24@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:22 +02:00
Ilya Dryomov
26f887e0a3 libceph, rbd, ceph: move ceph_osdc_alloc_messages() calls
The current requirement is that ceph_osdc_alloc_messages() should be
called after oid and oloc are known.  In preparation for preallocating
message data items, move ceph_osdc_alloc_messages() further down, so
that it is called when OSD op codes are known.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:22 +02:00
Ilya Dryomov
61d2f85504 ceph: num_ops is off by one in ceph_aio_retry_work()
Two OSD op slots are allocated, but only one is ever used.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:21 +02:00
Xuehan Xu
6680288441 ceph: set timeout conditionally in __cap_delay_requeue
__cap_delay_requeue could be invoked through ceph_check_caps when there
exists caps that needs to be sent and are delayed by "i_hold_caps_min"
or "i_hold_caps_max". If __cap_delay_requeue sets timeout unconditionally,
there could be a chance that some "wanted" caps can not be release for a
long since their timeouts are reset every time they get delayed.

Fixes: http://tracker.ceph.com/issues/36369
Signed-off-by: Xuehan Xu <xuxuehan@360.cn>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:21 +02:00
Ilya Dryomov
894868330a libceph: don't consume a ref on pagelist in ceph_msg_data_add_pagelist()
Because send_mds_reconnect() wants to send a message with a pagelist
and pass the ownership to the messenger, ceph_msg_data_add_pagelist()
consumes a ref which is then put in ceph_msg_data_destroy().  This
makes managing pagelists in the OSD client (where they are wrapped in
ceph_osd_data) unnecessarily hard because the handoff only happens in
ceph_osdc_start_request() instead of when the pagelist is passed to
ceph_osd_data_pagelist_init().  I counted several memory leaks on
various error paths.

Fix up ceph_msg_data_add_pagelist() and carry a pagelist ref in
ceph_osd_data.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:21 +02:00
Ilya Dryomov
33165d4723 libceph: introduce ceph_pagelist_alloc()
struct ceph_pagelist cannot be embedded into anything else because it
has its own refcount.  Merge allocation and initialization together.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:21 +02:00
Luis Henriques
bddff633ab ceph: only allow punch hole mode in fallocate
Current implementation of cephfs fallocate isn't correct as it doesn't
really reserve the space in the cluster, which means that a subsequent
call to a write may actually fail due to lack of space.  In fact, it is
currently possible to fallocate an amount space that is larger than the
free space in the cluster.  It has behaved this way since the initial
commit ad7a60de88 ("ceph: punch hole support").

Since there's no easy solution to fix this at the moment, this patch
simply removes support for all fallocate operations but
FALLOC_FL_PUNCH_HOLE (which implies FALLOC_FL_KEEP_SIZE).

Link: https://tracker.ceph.com/issues/36317
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:20 +02:00
Yan, Zheng
fce7a9744b ceph: refactor ceph_sync_read()
Avoid allocating memory for the entire user request: striped_read()
does a synchronous OSD request per object, so it doesn't need more than
object size worth of pages at a time.

[ Preserve the comment, changelog. ]

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:20 +02:00
Yan, Zheng
74c9e6bf4c ceph: check if LOOKUPNAME request was aborted when filling trace
d_lookup()/d_alloc() require parent inode locked. Parent inode is
not locked if request is aborted.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:20 +02:00
Yan, Zheng
c58f450bd6 ceph: fix dentry leak in ceph_readdir_prepopulate
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:20 +02:00
Yan, Zheng
efe328230d Revert "ceph: fix dentry leak in splice_dentry()"
This reverts commit 8b8f53af1e.

splice_dentry() is used by three places. For two places, req->r_dentry
is passed to splice_dentry(). In the case of error, req->r_dentry does
not get updated. So splice_dentry() should not drop reference.

Cc: stable@vger.kernel.org # 4.18+
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:20 +02:00
Chengguang Xu
5da207993e ceph: check snap first in ceph_set_acl()
Do the snap check first in ceph_set_acl(), so we can avoid
unnecessary operations when the inode has snap.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:20 +02:00
Chengguang Xu
3167893ae6 ceph: reset cap hold timeout only for requeued inode
__cap_delay_requeue() only requeue inode which does not
have CEPH_I_FLUSH flag, so avoid reset cap hold timeout
for that inode.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-10-22 10:28:19 +02:00
Matthew Wilcox
a283348629 page cache: Finish XArray conversion
With no more radix tree API users left, we can drop the GFP flags
and use xa_init() instead of INIT_RADIX_TREE().

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:44 -04:00
Matthew Wilcox
b15cd80068 dax: Convert page fault handlers to XArray
This is the last part of DAX to be converted to the XArray so
remove all the old helper functions.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:44 -04:00
Matthew Wilcox
9f32d22130 dax: Convert dax_lock_mapping_entry to XArray
Instead of always retrying when we slept, only retry if the page has
moved.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:44 -04:00
Matthew Wilcox
9fc747f68d dax: Convert dax writeback to XArray
Use XArray iteration instead of a pagevec.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:43 -04:00
Matthew Wilcox
07f2d89cc2 dax: Convert __dax_invalidate_entry to XArray
Avoids walking the radix tree multiple times looking for tags.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:43 -04:00
Matthew Wilcox
084a899008 dax: Convert dax_layout_busy_page to XArray
Instead of using a pagevec, just use the XArray iterators.  Add a
conditional rescheduling point which probably should have been there in
the original.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:43 -04:00
Matthew Wilcox
cfc93c6c6c dax: Convert dax_insert_pfn_mkwrite to XArray
Add some XArray-based helper functions to replace the radix tree based
metaphors currently in use.  The biggest change is that converted code
doesn't see its own lock bit; get_unlocked_entry() always returns an
entry with the lock bit clear.  So we don't have to mess around loading
the current entry and clearing the lock bit; we can just store the
unlocked entry that we already have.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:43 -04:00
Matthew Wilcox
ec4907ff69 dax: Hash on XArray instead of mapping
Since the XArray is embedded in the struct address_space, its address
contains exactly as much entropy as the address of the mapping.  This
patch is purely preparatory for later patches which will simplify the
wait/wake interfaces.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:43 -04:00
Matthew Wilcox
a77d19f46a dax: Rename some functions
Remove mentions of 'radix' and 'radix tree'.  Simplify some names by
dropping the word 'mapping'.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:42 -04:00
Matthew Wilcox
5ec2d99de7 f2fs: Convert to XArray
This is a straightforward conversion.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:42 -04:00
Matthew Wilcox
f611ff6375 nilfs2: Convert to XArray
This is close to a 1:1 replacement of radix tree APIs with their XArray
equivalents.  It would be possible to optimise nilfs_copy_back_pages(),
but that doesn't seem to be in the performance path.  Also, I think
it has a pre-existing bug, and I've added a note to that effect in the
source code.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:42 -04:00
Matthew Wilcox
04edf02cdd fs: Convert writeback to XArray
A couple of short loops.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:42 -04:00
Matthew Wilcox
ec82e1c1c8 fs: Convert buffer to XArray
Mostly comment fixes, but one use of __xa_set_mark.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:41 -04:00
Matthew Wilcox
0a943c65e7 btrfs: Convert page cache to XArray
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: David Sterba <dsterba@suse.com>
2018-10-21 10:46:41 -04:00
Matthew Wilcox
10bbd23585 pagevec: Use xa_mark_t
Removes sparse warnings.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:39 -04:00
Matthew Wilcox
0d3f929666 page cache: Convert hole search to XArray
The page cache offers the ability to search for a miss in the previous or
next N locations.  Rather than teach the XArray about the page cache's
definition of a miss, use xas_prev() and xas_next() to search the page
array.  This should be more efficient as it does not have to start the
lookup from the top for each index.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:33 -04:00
David S. Miller
2e2d6f0342 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
net/sched/cls_api.c has overlapping changes to a call to
nlmsg_parse(), one (from 'net') added rtm_tca_policy instead of NULL
to the 5th argument, and another (from 'net-next') added cb->extack
instead of NULL to the 6th argument.

net/ipv4/ipmr_base.c is a case of a bug fix in 'net' being done to
code which moved (to mr_table_dump)) in 'net-next'.  Thanks to David
Ahern for the heads up.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-19 11:03:06 -07:00
Bob Peterson
8e31582a9a gfs2: Fix minor typo: couln't versus couldn't.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-10-19 11:34:04 -05:00
Filipe Manana
421f0922a2 Btrfs: fix use-after-free during inode eviction
At inode.c:evict_inode_truncate_pages(), when we iterate over the
inode's extent states, we access an extent state record's "state" field
after we unlocked the inode's io tree lock. This can lead to a
use-after-free issue because after we unlock the io tree that extent
state record might have been freed due to being merged into another
adjacent extent state record (a previous inflight bio for a read
operation finished in the meanwhile which unlocked a range in the io
tree and cause a merge of extent state records, as explained in the
comment before the while loop added in commit 6ca0709756 ("Btrfs: fix
hang during inode eviction due to concurrent readahead")).

Fix this by keeping a copy of the extent state's flags in a local
variable and using it after unlocking the io tree.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201189
Fixes: b9d0b38928 ("btrfs: Add handler for invalidate page")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-19 12:25:33 +02:00
Josef Bacik
c495144bc6 btrfs: move the dio_sem higher up the callchain
We're getting a lockdep splat because we take the dio_sem under the
log_mutex.  What we really need is to protect fsync() from logging an
extent map for an extent we never waited on higher up, so just guard the
whole thing with dio_sem.

======================================================
WARNING: possible circular locking dependency detected
4.18.0-rc4-xfstests-00025-g5de5edbaf1d4 #411 Not tainted
------------------------------------------------------
aio-dio-invalid/30928 is trying to acquire lock:
0000000092621cfd (&mm->mmap_sem){++++}, at: get_user_pages_unlocked+0x5a/0x1e0

but task is already holding lock:
00000000cefe6b35 (&ei->dio_sem){++++}, at: btrfs_direct_IO+0x3be/0x400

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #5 (&ei->dio_sem){++++}:
       lock_acquire+0xbd/0x220
       down_write+0x51/0xb0
       btrfs_log_changed_extents+0x80/0xa40
       btrfs_log_inode+0xbaf/0x1000
       btrfs_log_inode_parent+0x26f/0xa80
       btrfs_log_dentry_safe+0x50/0x70
       btrfs_sync_file+0x357/0x540
       do_fsync+0x38/0x60
       __ia32_sys_fdatasync+0x12/0x20
       do_fast_syscall_32+0x9a/0x2f0
       entry_SYSENTER_compat+0x84/0x96

-> #4 (&ei->log_mutex){+.+.}:
       lock_acquire+0xbd/0x220
       __mutex_lock+0x86/0xa10
       btrfs_record_unlink_dir+0x2a/0xa0
       btrfs_unlink+0x5a/0xc0
       vfs_unlink+0xb1/0x1a0
       do_unlinkat+0x264/0x2b0
       do_fast_syscall_32+0x9a/0x2f0
       entry_SYSENTER_compat+0x84/0x96

-> #3 (sb_internal#2){.+.+}:
       lock_acquire+0xbd/0x220
       __sb_start_write+0x14d/0x230
       start_transaction+0x3e6/0x590
       btrfs_evict_inode+0x475/0x640
       evict+0xbf/0x1b0
       btrfs_run_delayed_iputs+0x6c/0x90
       cleaner_kthread+0x124/0x1a0
       kthread+0x106/0x140
       ret_from_fork+0x3a/0x50

-> #2 (&fs_info->cleaner_delayed_iput_mutex){+.+.}:
       lock_acquire+0xbd/0x220
       __mutex_lock+0x86/0xa10
       btrfs_alloc_data_chunk_ondemand+0x197/0x530
       btrfs_check_data_free_space+0x4c/0x90
       btrfs_delalloc_reserve_space+0x20/0x60
       btrfs_page_mkwrite+0x87/0x520
       do_page_mkwrite+0x31/0xa0
       __handle_mm_fault+0x799/0xb00
       handle_mm_fault+0x7c/0xe0
       __do_page_fault+0x1d3/0x4a0
       async_page_fault+0x1e/0x30

-> #1 (sb_pagefaults){.+.+}:
       lock_acquire+0xbd/0x220
       __sb_start_write+0x14d/0x230
       btrfs_page_mkwrite+0x6a/0x520
       do_page_mkwrite+0x31/0xa0
       __handle_mm_fault+0x799/0xb00
       handle_mm_fault+0x7c/0xe0
       __do_page_fault+0x1d3/0x4a0
       async_page_fault+0x1e/0x30

-> #0 (&mm->mmap_sem){++++}:
       __lock_acquire+0x42e/0x7a0
       lock_acquire+0xbd/0x220
       down_read+0x48/0xb0
       get_user_pages_unlocked+0x5a/0x1e0
       get_user_pages_fast+0xa4/0x150
       iov_iter_get_pages+0xc3/0x340
       do_direct_IO+0xf93/0x1d70
       __blockdev_direct_IO+0x32d/0x1c20
       btrfs_direct_IO+0x227/0x400
       generic_file_direct_write+0xcf/0x180
       btrfs_file_write_iter+0x308/0x58c
       aio_write+0xf8/0x1d0
       io_submit_one+0x3a9/0x620
       __ia32_compat_sys_io_submit+0xb2/0x270
       do_int80_syscall_32+0x5b/0x1a0
       entry_INT80_compat+0x88/0xa0

other info that might help us debug this:

Chain exists of:
  &mm->mmap_sem --> &ei->log_mutex --> &ei->dio_sem

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&ei->dio_sem);
                               lock(&ei->log_mutex);
                               lock(&ei->dio_sem);
  lock(&mm->mmap_sem);

 *** DEADLOCK ***

1 lock held by aio-dio-invalid/30928:
 #0: 00000000cefe6b35 (&ei->dio_sem){++++}, at: btrfs_direct_IO+0x3be/0x400

stack backtrace:
CPU: 0 PID: 30928 Comm: aio-dio-invalid Not tainted 4.18.0-rc4-xfstests-00025-g5de5edbaf1d4 #411
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
Call Trace:
 dump_stack+0x7c/0xbb
 print_circular_bug.isra.37+0x297/0x2a4
 check_prev_add.constprop.45+0x781/0x7a0
 ? __lock_acquire+0x42e/0x7a0
 validate_chain.isra.41+0x7f0/0xb00
 __lock_acquire+0x42e/0x7a0
 lock_acquire+0xbd/0x220
 ? get_user_pages_unlocked+0x5a/0x1e0
 down_read+0x48/0xb0
 ? get_user_pages_unlocked+0x5a/0x1e0
 get_user_pages_unlocked+0x5a/0x1e0
 get_user_pages_fast+0xa4/0x150
 iov_iter_get_pages+0xc3/0x340
 do_direct_IO+0xf93/0x1d70
 ? __alloc_workqueue_key+0x358/0x490
 ? __blockdev_direct_IO+0x14b/0x1c20
 __blockdev_direct_IO+0x32d/0x1c20
 ? btrfs_run_delalloc_work+0x40/0x40
 ? can_nocow_extent+0x490/0x490
 ? kvm_clock_read+0x1f/0x30
 ? can_nocow_extent+0x490/0x490
 ? btrfs_run_delalloc_work+0x40/0x40
 btrfs_direct_IO+0x227/0x400
 ? btrfs_run_delalloc_work+0x40/0x40
 generic_file_direct_write+0xcf/0x180
 btrfs_file_write_iter+0x308/0x58c
 aio_write+0xf8/0x1d0
 ? kvm_clock_read+0x1f/0x30
 ? __might_fault+0x3e/0x90
 io_submit_one+0x3a9/0x620
 ? io_submit_one+0xe5/0x620
 __ia32_compat_sys_io_submit+0xb2/0x270
 do_int80_syscall_32+0x5b/0x1a0
 entry_INT80_compat+0x88/0xa0

CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-19 12:20:04 +02:00
Josef Bacik
30928e9baa btrfs: don't run delayed_iputs in commit
This could result in a really bad case where we do something like

evict
  evict_refill_and_join
    btrfs_commit_transaction
      btrfs_run_delayed_iputs
        evict
          evict_refill_and_join
            btrfs_commit_transaction
... forever

We have plenty of other places where we run delayed iputs that are much
safer, let those do the work.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-19 12:20:03 +02:00
Josef Bacik
80ee54bfe8 btrfs: fix insert_reserved error handling
We were not handling the reserved byte accounting properly for data
references.  Metadata was fine, if it errored out the error paths would
free the bytes_reserved count and pin the extent, but it even missed one
of the error cases.  So instead move this handling up into
run_one_delayed_ref so we are sure that both cases are properly cleaned
up in case of a transaction abort.

CC: stable@vger.kernel.org # 4.18+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-19 12:20:03 +02:00
Josef Bacik
49940bdd57 btrfs: only free reserved extent if we didn't insert it
When we insert the file extent once the ordered extent completes we free
the reserved extent reservation as it'll have been migrated to the
bytes_used counter.  However if we error out after this step we'll still
clear the reserved extent reservation, resulting in a negative
accounting of the reserved bytes for the block group and space info.
Fix this by only doing the free if we didn't successfully insert a file
extent for this extent.

CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-19 12:20:03 +02:00
Josef Bacik
fb5c39d7a8 btrfs: don't use ctl->free_space for max_extent_size
max_extent_size is supposed to be the largest contiguous range for the
space info, and ctl->free_space is the total free space in the block
group.  We need to keep track of these separately and _only_ use the
max_free_space if we don't have a max_extent_size, as that means our
original request was too large to search any of the block groups for and
therefore wouldn't have a max_extent_size set.

CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-19 12:20:03 +02:00
Josef Bacik
ad22cf6ea4 btrfs: set max_extent_size properly
We can't use entry->bytes if our entry is a bitmap entry, we need to use
entry->max_extent_size in that case.  Fix up all the logic to make this
consistent.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-19 12:20:03 +02:00
Josef Bacik
21a94f7acf btrfs: reset max_extent_size properly
If we use up our block group before allocating a new one we'll easily
get a max_extent_size that's set really really low, which will result in
a lot of fragmentation.  We need to make sure we're resetting the
max_extent_size when we add a new chunk or add new space.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-19 12:20:03 +02:00
Boris Brezillon
042c1a5a60 NAND core changes:
- Two batchs of cleanups of the NAND API, including:
   * Deprecating a lot of interfaces (now replaced by ->exec_op()).
   * Moving code in separate drivers (JEDEC, ONFI), in private files
     (internals), in platform drivers, etc.
   * Functions/structures reordering.
   * Exclusive use of the nand_chip structure instead of the MTD one
     all across the subsystem.
 - Addition of the nand_wait_readrdy/rdy_op() helpers.
 
 Raw NAND controllers drivers changes:
 - Various coccinelle patches.
 - Marvell:
   * Use regmap_update_bits() for syscon access.
   * More documentation.
   * BCH failure path rework.
   * More layouts to be supported.
   * IRQ handler complete() condition fixed.
 - Fsl_ifc:
   * SRAM initialization fixed for newer controller versions.
 - Denali:
   * Fix licenses mismatch and use a SPDX tag.
   * Set SPARE_AREA_SKIP_BYTES register to 8 if unset.
 - Qualcomm:
   * Do not include dma-direct.h.
 - Docg4:
   * Removed.
 - Ams-delta:
   * Use of a GPIO lookup table
   * Internal machinery changes.
 
 Raw NAND chip drivers changes:
 - Toshiba:
   * Add support for Toshiba memory BENAND
   * Pass a single nand_chip object to the status helper.
 - ESMT:
   * New driver to retrieve the ECC requirements from the 5th ID byte.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEE9HuaYnbmDhq/XIDIJWrqGEe9VoQFAlvDdvgACgkQJWrqGEe9
 VoQ0zQf/QfXYjLDbC/1FXckEkEeVZPRbXQMf+ptdTo0inJefy9LMOvhRKmuABCxn
 wwPa99aqQ+aHESnU/s2d8FUrhNRUf35lmRIAP512lJk79tBfgo73l5WSc6UTg6hO
 w3KbigOvTi7wCKQYGDFOBZuoFFCvpNsIEtWFuodCbYG1GmeAaL7L2fhbri9JbcE2
 zPqmCv0w5BUrPR7i/taI5gD+KDtRdmCEyfb7wQoJy2XZVO0SKqGsBhDhYWqwCpck
 eP2UzQ4cFT8X1S0/AfUpbowc6cftdqIwW9VRY4inx3ALQ+sKxrkJ+0E3gL9lnZ+F
 p/is0gTdvDmnOjXfwGcL7GXGy0nN8Q==
 =EKv1
 -----END PGP SIGNATURE-----

Merge tag 'nand/for-4.20' of git://git.infradead.org/linux-mtd into mtd/next

NAND core changes:
- Two batchs of cleanups of the NAND API, including:
  * Deprecating a lot of interfaces (now replaced by ->exec_op()).
  * Moving code in separate drivers (JEDEC, ONFI), in private files
    (internals), in platform drivers, etc.
  * Functions/structures reordering.
  * Exclusive use of the nand_chip structure instead of the MTD one
    all across the subsystem.
- Addition of the nand_wait_readrdy/rdy_op() helpers.

Raw NAND controllers drivers changes:
- Various coccinelle patches.
- Marvell:
  * Use regmap_update_bits() for syscon access.
  * More documentation.
  * BCH failure path rework.
  * More layouts to be supported.
  * IRQ handler complete() condition fixed.
- Fsl_ifc:
  * SRAM initialization fixed for newer controller versions.
- Denali:
  * Fix licenses mismatch and use a SPDX tag.
  * Set SPARE_AREA_SKIP_BYTES register to 8 if unset.
- Qualcomm:
  * Do not include dma-direct.h.
- Docg4:
  * Removed.
- Ams-delta:
  * Use of a GPIO lookup table
  * Internal machinery changes.

Raw NAND chip drivers changes:
- Toshiba:
  * Add support for Toshiba memory BENAND
  * Pass a single nand_chip object to the status helper.
- ESMT:
  * New driver to retrieve the ECC requirements from the 5th ID byte.
2018-10-19 09:20:09 +02:00
Benjamin Coddington
fc187514d8 nfs: remove redundant call to nfs_context_set_write_error()
We don't need to call this in the direct, read, or pnfs resend paths and
the only other caller is the write path in nfs_page_async_flush() which
already checks and sets the pg_error on the context.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-10-18 17:20:57 -04:00
Benjamin Coddington
fdbd1a2e4a nfs: Fix a missed page unlock after pg_doio()
We must check pg_error and call error_cleanup after any call to pg_doio.
Currently, we are skipping the unlock of a page if we encounter an error in
nfs_pageio_complete() before handing off the work to the RPC layer.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-10-18 17:20:57 -04:00
Mike Marshall
22fc9db296 orangefs: no need to check for service_operation returns > 0
service_operation returns > 0 is undefined.

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2018-10-18 14:05:46 -04:00
Mike Marshall
34e6148a2c orangefs: some error code paths missed kmem_cache_free
If a slab cache object is allocated, it needs to be freed eventually,
certainly before anyone unloads the module that allocated it.

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2018-10-18 13:58:25 -04:00
Mike Marshall
b5d72cdc53 orangefs: don't let orangefs_iget return NULL.
Suggested by Dan Carpenter.

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2018-10-18 13:52:23 -04:00
Mike Marshall
56249998b2 orangefs: don't let orangefs_new_inode return NULL
Suggested by Dan Carpenter

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2018-10-18 13:47:16 -04:00
Eric Sandeen
fa520c47ea fscache: Fix out of bound read in long cookie keys
fscache_set_key() can incur an out-of-bounds read, reported by KASAN:

 BUG: KASAN: slab-out-of-bounds in fscache_alloc_cookie+0x5b3/0x680 [fscache]
 Read of size 4 at addr ffff88084ff056d4 by task mount.nfs/32615

and also reported by syzbot at https://lkml.org/lkml/2018/7/8/236

  BUG: KASAN: slab-out-of-bounds in fscache_set_key fs/fscache/cookie.c:120 [inline]
  BUG: KASAN: slab-out-of-bounds in fscache_alloc_cookie+0x7a9/0x880 fs/fscache/cookie.c:171
  Read of size 4 at addr ffff8801d3cc8bb4 by task syz-executor907/4466

This happens for any index_key_len which is not divisible by 4 and is
larger than the size of the inline key, because the code allocates exactly
index_key_len for the key buffer, but the hashing loop is stepping through
it 4 bytes (u32) at a time in the buf[] array.

Fix this by calculating how many u32 buffers we'll need by using
DIV_ROUND_UP, and then using kcalloc() to allocate a precleared allocation
buffer to hold the index_key, then using that same count as the hashing
index limit.

Fixes: ec0328e46d ("fscache: Maintain a catalogue of allocated cookies")
Reported-by: syzbot+a95b989b2dde8e806af8@syzkaller.appspotmail.com
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-18 11:32:21 +02:00
David Howells
1ff22883b0 fscache: Fix incomplete initialisation of inline key space
The inline key in struct rxrpc_cookie is insufficiently initialized,
zeroing only 3 of the 4 slots, therefore an index_key_len between 13 and 15
bytes will end up hashing uninitialized memory because the memcpy only
partially fills the last buf[] element.

Fix this by clearing fscache_cookie objects on allocation rather than using
the slab constructor to initialise them.  We're going to pretty much fill
in the entire struct anyway, so bringing it into our dcache writably
shouldn't incur much overhead.

This removes the need to do clearance in fscache_set_key() (where we aren't
doing it correctly anyway).

Also, we don't need to set cookie->key_len in fscache_set_key() as we
already did it in the only caller, so remove that.

Fixes: ec0328e46d ("fscache: Maintain a catalogue of allocated cookies")
Reported-by: syzbot+a95b989b2dde8e806af8@syzkaller.appspotmail.com
Reported-by: Eric Sandeen <sandeen@redhat.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-18 11:32:21 +02:00
Al Viro
169b803397 cachefiles: fix the race between cachefiles_bury_object() and rmdir(2)
the victim might've been rmdir'ed just before the lock_rename();
unlike the normal callers, we do not look the source up after the
parents are locked - we know it beforehand and just recheck that it's
still the child of what used to be its parent.  Unfortunately,
the check is too weak - we don't spot a dead directory since its
->d_parent is unchanged, dentry is positive, etc.  So we sail all
the way to ->rename(), with hosting filesystems _not_ expecting
to be asked renaming an rmdir'ed subdirectory.

The fix is easy, fortunately - the lock on parent is sufficient for
making IS_DEADDIR() on child safe.

Cc: stable@vger.kernel.org
Fixes: 9ae326a690 (CacheFiles: A cache that backs onto a mounted filesystem)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-18 11:32:21 +02:00
Christoph Hellwig
96987eea53 xfs: cancel COW blocks before swapext
We need to make sure we have no outstanding COW blocks before we swap
extents, as there is nothing preventing us from having preallocated COW
delalloc on either inode that swapext is called on.  That case can
easily be reproduced by running generic/324 in always_cow mode:

[  620.760572] XFS: Assertion failed: tip->i_delayed_blks == 0, file: fs/xfs/xfs_bmap_util.c, line: 1669
[  620.761608] ------------[ cut here ]------------
[  620.762171] kernel BUG at fs/xfs/xfs_message.c:102!
[  620.762732] invalid opcode: 0000 [#1] SMP PTI
[  620.763272] CPU: 0 PID: 24153 Comm: xfs_fsr Tainted: G        W         4.19.0-rc1+ #4182
[  620.764203] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014
[  620.765202] RIP: 0010:assfail+0x20/0x28
[  620.765646] Code: 31 ff e8 83 fc ff ff 0f 0b c3 48 89 f1 41 89 d0 48 c7 c6 48 ca 8d 82 48 89 fa 38
[  620.767758] RSP: 0018:ffffc9000898bc10 EFLAGS: 00010202
[  620.768359] RAX: 0000000000000000 RBX: ffff88012f14ba40 RCX: 0000000000000000
[  620.769174] RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffff828560d9
[  620.769982] RBP: ffff88012f14b300 R08: 0000000000000000 R09: 0000000000000000
[  620.770788] R10: 000000000000000a R11: f000000000000000 R12: ffffc9000898bc98
[  620.771638] R13: ffffc9000898bc9c R14: ffff880130b5e2b8 R15: ffff88012a1fa2a8
[  620.772504] FS:  00007fdc36e0fbc0(0000) GS:ffff88013ba00000(0000) knlGS:0000000000000000
[  620.773475] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  620.774168] CR2: 00007fdc3604d000 CR3: 0000000132afc000 CR4: 00000000000006f0
[  620.774978] Call Trace:
[  620.775274]  xfs_swap_extent_forks+0x2a0/0x2e0
[  620.775792]  xfs_swap_extents+0x38b/0xab0
[  620.776256]  xfs_ioc_swapext+0x121/0x140
[  620.776709]  xfs_file_ioctl+0x328/0xc90
[  620.777154]  ? rcu_read_lock_sched_held+0x50/0x60
[  620.777694]  ? xfs_iunlock+0x233/0x260
[  620.778127]  ? xfs_setattr_nonsize+0x3be/0x6a0
[  620.778647]  do_vfs_ioctl+0x9d/0x680
[  620.779071]  ? ksys_fchown+0x47/0x80
[  620.779552]  ksys_ioctl+0x35/0x70
[  620.780040]  __x64_sys_ioctl+0x11/0x20
[  620.780530]  do_syscall_64+0x4b/0x190
[  620.780927]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  620.781467] RIP: 0033:0x7fdc364d0f07
[  620.781900] Code: b3 66 90 48 8b 05 81 5f 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 28
[  620.784044] RSP: 002b:00007ffe2a766038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  620.784896] RAX: ffffffffffffffda RBX: 0000000000000025 RCX: 00007fdc364d0f07
[  620.785667] RDX: 0000560296ca2fc0 RSI: 00000000c0c0586d RDI: 0000000000000005
[  620.786398] RBP: 0000000000000025 R08: 0000000000001200 R09: 0000000000000000
[  620.787283] R10: 0000000000000432 R11: 0000000000000246 R12: 0000000000000005
[  620.788051] R13: 0000000000000000 R14: 0000000000001000 R15: 0000000000000006
[  620.788927] Modules linked in:
[  620.789340] ---[ end trace 9503b7417ffdbdb0 ]---
[  620.790065] RIP: 0010:assfail+0x20/0x28
[  620.790642] Code: 31 ff e8 83 fc ff ff 0f 0b c3 48 89 f1 41 89 d0 48 c7 c6 48 ca 8d 82 48 89 fa 38
[  620.793038] RSP: 0018:ffffc9000898bc10 EFLAGS: 00010202
[  620.793609] RAX: 0000000000000000 RBX: ffff88012f14ba40 RCX: 0000000000000000
[  620.794317] RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffff828560d9
[  620.795025] RBP: ffff88012f14b300 R08: 0000000000000000 R09: 0000000000000000
[  620.795778] R10: 000000000000000a R11: f000000000000000 R12: ffffc9000898bc98
[  620.796675] R13: ffffc9000898bc9c R14: ffff880130b5e2b8 R15: ffff88012a1fa2a8
[  620.797782] FS:  00007fdc36e0fbc0(0000) GS:ffff88013ba00000(0000) knlGS:0000000000000000
[  620.798908] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  620.799594] CR2: 00007fdc3604d000 CR3: 0000000132afc000 CR4: 00000000000006f0
[  620.800424] Kernel panic - not syncing: Fatal exception
[  620.801191] Kernel Offset: disabled
[  620.801597] ---[ end Kernel panic - not syncing: Fatal exception ]---

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:21:55 +11:00
Brian Foster
efc3289cf8 xfs: clear ail delwri queued bufs on unmount of shutdown fs
In the typical unmount case, the AIL is forced out by the unmount
sequence before the xfsaild task is stopped. Since AIL items are
removed on writeback completion, this means that the AIL
->ail_buf_list delwri queue has been drained. This is not always
true in the shutdown case, however.

It's possible for buffers to sit on a delwri queue for a period of
time across submission attempts if said items are locked or have
been relogged and pinned since first added to the queue. If the
attempt to log such an item results in a log I/O error, the error
processing can shutdown the fs, remove the item from the AIL, stale
the buffer (dropping the LRU reference) and clear its delwri queue
state. The latter bit means the buffer will be released from a
delwri queue on the next submission attempt, but this might never
occur if the filesystem has shutdown and the AIL is empty.

This means that such buffers are held indefinitely by the AIL delwri
queue across destruction of the AIL. Aside from being a memory leak,
these buffers can also hold references to in-core perag structures.
The latter problem manifests as a generic/475 failure, reproducing
the following asserts at unmount time:

  XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0,
	file: fs/xfs/xfs_mount.c, line: 151
  XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0,
	file: fs/xfs/xfs_mount.c, line: 132

To prevent this problem, clear the AIL delwri queue as a final step
before xfsaild() exit. The !empty state should never occur in the
normal case, so add an assert to catch unexpected problems going
forward.

[dgc: add comment explaining need for xfs_buf_delwri_cancel() after
 calling xfs_buf_delwri_submit_nowait().]

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:21:49 +11:00
Carlos Maiolino
26ca39015e xfs: use offsetof() in place of offset macros for __xfsstats
Most offset macro mess is used in xfs_stats_format() only, and we can
simply get the right offsets using offsetof(), instead of several macros
to mark the offsets inside __xfsstats structure.

Replace all XFSSTAT_END_* macros by a single helper macro to get the
right offset into __xfsstats, and use this helper in xfs_stats_format()
directly.

The quota stats code, still looks a bit cleaner when using XFSSTAT_*
macros, so, this patch also defines XFSSTAT_START_XQMSTAT and
XFSSTAT_END_XQMSTAT locally to that code. This also should prevent
offset mistakes when updates are done into __xfsstats.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:21:39 +11:00
Carlos Maiolino
41657e5507 xfs: Fix xqmstats offsets in /proc/fs/xfs/xqmstat
The addition of FIBT, RMAP and REFCOUNT changed the offsets into
__xfssats structure.

This caused xqmstat_proc_show() to display garbage data via
/proc/fs/xfs/xqmstat, once it relies on the offsets marked via macros.

Fix it.

Fixes: 00f4e4f9 xfs: add rmap btree stats infrastructure
Fixes: aafc3c24 xfs: support the XFS_BTNUM_FINOBT free inode btree type
Fixes: 46eeb521 xfs: introduce refcount btree definitions
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:21:34 +11:00
Dave Chinner
37fd167824 xfs: fix use-after-free race in xfs_buf_rele
When looking at a 4.18 based KASAN use after free report, I noticed
that racing xfs_buf_rele() may race on dropping the last reference
to the buffer and taking the buffer lock. This was the symptom
displayed by the KASAN report, but the actual issue that was
reported had already been fixed in 4.19-rc1 by commit e339dd8d8b
("xfs: use sync buffer I/O for sync delwri queue submission").

Despite this, I think there is still an issue with xfs_buf_rele()
in this code:

        release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
        spin_lock(&bp->b_lock);
        if (!release) {
.....

If two threads race on the b_lock after both dropping a reference
and one getting dropping the last reference so release = true, we
end up with:

CPU 0				CPU 1
atomic_dec_and_lock()
				atomic_dec_and_lock()
				spin_lock(&bp->b_lock)
spin_lock(&bp->b_lock)
<spins>
				<release = true bp->b_lru_ref = 0>
				<remove from lists>
				freebuf = true
				spin_unlock(&bp->b_lock)
				xfs_buf_free(bp)
<gets lock, reading and writing freed memory>
<accesses freed memory>
spin_unlock(&bp->b_lock) <reads/writes freed memory>

IOWs, we can't safely take bp->b_lock after dropping the hold
reference because the buffer may go away at any time after we
drop that reference. However, this can be fixed simply by taking the
bp->b_lock before we drop the reference.

It is safe to nest the pag_buf_lock inside bp->b_lock as the
pag_buf_lock is only used to serialise against lookup in
xfs_buf_find() and no other locks are held over or under the
pag_buf_lock there. Make this clear by documenting the buffer lock
orders at the top of the file.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:21:29 +11:00
Allison Henderson
068f985a9e xfs: Add attibute remove and helper functions
This patch adds xfs_attr_remove_args. These sub-routines remove
the attributes specified in @args. We will use this later for setting
parent pointers as a deferred attribute operation.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:21:23 +11:00
Allison Henderson
2f3cd80919 xfs: Add attibute set and helper functions
This patch adds xfs_attr_set_args and xfs_bmap_set_attrforkoff.
These sub-routines set the attributes specified in @args.
We will use this later for setting parent pointers as a deferred
attribute operation.

[dgc: remove attr fork init code from xfs_attr_set_args().]
[dgc: xfs_attr_try_sf_addname() NULLs args.trans after commit.]
[dgc: correct sf add error handling.]

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:21:16 +11:00
Allison Henderson
4c74a56b9d xfs: Add helper function xfs_attr_try_sf_addname
This patch adds a subroutine xfs_attr_try_sf_addname
used by xfs_attr_set.  This subrotine will attempt to
add the attribute name specified in args in shortform,
as well and perform error handling previously done in
xfs_attr_set.

This patch helps to pre-simplify xfs_attr_set for reviewing
purposes and reduce indentation.  New function will be added
in the next patch.

[dgc: moved commit to helper function, too.]

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:20:50 +11:00
Allison Henderson
e2421f0b5f xfs: Move fs/xfs/xfs_attr.h to fs/xfs/libxfs/xfs_attr.h
This patch moves fs/xfs/xfs_attr.h to fs/xfs/libxfs/xfs_attr.h
since xfs_attr.c is in libxfs.  We will need these later in
xfsprogs.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:20:45 +11:00
Dave Chinner
56668a5cc4 xfs: issue log message on user force shutdown
The kernel only issues a log message that it's been shut down when
the filesystem triggers a shutdown itself. Hence there is no trace
in the log when a shutdown is triggered manually from userspace.
This can make it hard to see sequence of events in the log when
things go wrong, so make sure we always log a message when a
shutdown is run.

While there, clean up the logic flow so we don't have to continually
check if the shutdown trigger was user initiated before logging
shutdown messages.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:20:39 +11:00
Darrick J. Wong
38b6238eb6 xfs: fix buffer state management in xrep_findroot_block
We don't handle buffer state properly in online repair's findroot
routine.  If a buffer already has b_ops set, we don't ever want to touch
that, and we don't want to call the read verifiers on a buffer that
could be dirty (CRCs are only recomputed during log checkpoints).

Therefore, be more careful about what we do with a buffer -- if someone
else already attached ops that are not the ones for this btree type,
just ignore the buffer.  We only attach our btree type's buf ops if it
matches the magic/uuid and structure checks.

We also modify xfs_buf_read_map to allow callers to set buffer ops on a
DONE buffer with NULL ops so that repair doesn't leave behind buffers
which won't have buffers attached to them.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:20:35 +11:00
Darrick J. Wong
1aff5696f3 xfs: always assign buffer verifiers when one is provided
If a caller supplies buffer ops when trying to read a buffer and the
buffer doesn't already have buf ops assigned, ensure that the ops are
assigned to the buffer and the verifier is run on that buffer.

Note that current XFS code is careful to assign buffer ops after a
xfs_{trans_,}buf_read call in which ops were not supplied.  However, we
should apply ops defensively in case there is ever a coding mistake; and
an upcoming repair patch will need to be able to read a buffer without
assigning buf ops.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:20:30 +11:00
Darrick J. Wong
1002ff45ef xfs: xrep_findroot_block should reject root blocks with siblings
In xrep_findroot_block, if we find a candidate root block with sibling
pointers or sibling blocks on the same tree level, we should not return
that block as a tree root because root blocks cannot have siblings.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:20:26 +11:00
Adam Borowski
dddde68b8f xfs: add a define for statfs magic to uapi
Needed by userspace programs that call fstatfs().

It'd be natural to publish XFS_SB_MAGIC in uapi, but while these two
have identical values, they have different semantic meaning: one is
an enum cookie meant for statfs, the other a signature of the
on-disk format.

Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:20:19 +11:00
Christoph Hellwig
4831822ff1 xfs: print dangling delalloc extents
Instead of just asserting that we have no delalloc space dangling
in an inode that gets freed print the actual offenders for debug
mode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:20:11 +11:00
Christoph Hellwig
032dc923b2 xfs: fix fork selection in xfs_find_trim_cow_extent
We should want to write directly into the data fork for blocks that don't
have an extent in the COW fork covering them yet.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:19:58 +11:00
Christoph Hellwig
d392bc81bb xfs: remove the unused trimmed argument from xfs_reflink_trim_around_shared
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:19:48 +11:00
Christoph Hellwig
fc439464e3 xfs: remove the unused shared argument to xfs_reflink_reserve_cow
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:19:37 +11:00
Christoph Hellwig
0365c5d6c3 xfs: handle zeroing in xfs_file_iomap_begin_delay
We only need to allocate blocks for zeroing for reflink inodes,
and for we currently have a special case for reflink files in
the otherwise direct I/O path that I'd like to get rid of.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:19:26 +11:00
Christoph Hellwig
daa79baefc xfs: remove suport for filesystems without unwritten extent flag
The option to enable unwritten extents was made default in 2003,
removed from mkfs in 2007, and cannot be disabled in v5.  We also
rely on it for a lot of common functionality, so filesystems without
it will run a completely untested and buggy code path.  Enabling the
support also is a simple bit flip using xfs_db, so legacy file
systems can still be brought forward.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:18:58 +11:00
Christoph Hellwig
97e5a6e6dc xfs: remove XFS_IO_INVALID
The invalid state isn't any different from a hole, so merge the two
states.  Use the more descriptive hole name, but keep it as the first
value of the enum to catch uninitialized fields.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-18 17:17:50 +11:00
Chengguang Xu
3642b29a63 fs/exofs: only use true/false for asignment of bool type variable
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-10-18 02:04:59 -04:00
Chengguang Xu
515f1867ad fs/exofs: fix potential memory leak in mount option parsing
There are some cases can cause memory leak when parsing
option 'osdname'.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-10-18 02:04:59 -04:00
nixiaoming
55338ac2a9 Delete invalid assignment statements in do_sendfile
Assigning value -EINVAL to "retval" here, but that stored value is
overwritten before it can be used.

retval = -EINVAL;
....
retval = rw_verify_area(WRITE, out.file, &out_pos, count);

value_overwrite: Overwriting previous write to "retval" with value
from rw_verify_area

delete invalid assignment statements

Signed-off-by: n00202754 <nixiaoming@huawei.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-10-18 01:42:49 -04:00
Yue Haibing
d65b1f2029 iomap: remove duplicated include from iomap.c
Remove duplicated include.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-10-18 01:18:55 -04:00
Mark Fasheh
85c95f208f vfs: dedupe should return EPERM if permission is not granted
Right now we return EINVAL if a process does not have permission to dedupe a
file. This was an oversight on my part. EPERM gives a true description of
the nature of our error, and EINVAL is already used for the case that the
filesystem does not support dedupe.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-10-17 21:15:39 -04:00
Mark Fasheh
5de4480ae7 vfs: allow dedupe of user owned read-only files
The permission check in vfs_dedupe_file_range_one() is too coarse - We only
allow dedupe of the destination file if the user is root, or they have the
file open for write.

This effectively limits a non-root user from deduping their own read-only
files. In addition, the write file descriptor that the user is forced to
hold open can prevent execution of files. As file data during a dedupe
does not change, the behavior is unexpected and this has caused a number of
issue reports. For an example, see:

https://github.com/markfasheh/duperemove/issues/129

So change the check so we allow dedupe on the target if:

- the root or admin is asking for it
- the process has write access
- the owner of the file is asking for the dedupe
- the process could get write access

That way users can open read-only and still get dedupe.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-10-17 21:15:39 -04:00
Lu Fengqi
0a9df0df17 btrfs: delayed-ref: extract find_first_ref_head from find_ref_head
The find_ref_head shouldn't return the first entry even if no exact match
is found. So move the hidden behavior to higher level.

Besides, remove the useless local variables in the btrfs_select_ref_head.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
[ reformat comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-17 19:21:00 +02:00
Filipe Manana
5ce555578e Btrfs: fix deadlock when writing out free space caches
When writing out a block group free space cache we can end deadlocking
with ourselves on an extent buffer lock resulting in a warning like the
following:

  [245043.379979] WARNING: CPU: 4 PID: 2608 at fs/btrfs/locking.c:251 btrfs_tree_lock+0x1be/0x1d0 [btrfs]
  [245043.392792] CPU: 4 PID: 2608 Comm: btrfs-transacti Tainted: G
    W I      4.16.8 #1
  [245043.395489] RIP: 0010:btrfs_tree_lock+0x1be/0x1d0 [btrfs]
  [245043.396791] RSP: 0018:ffffc9000424b840 EFLAGS: 00010246
  [245043.398093] RAX: 0000000000000a30 RBX: ffff8807e20a3d20 RCX: 0000000000000001
  [245043.399414] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff8807e20a3d20
  [245043.400732] RBP: 0000000000000001 R08: ffff88041f39a700 R09: ffff880000000000
  [245043.402021] R10: 0000000000000040 R11: ffff8807e20a3d20 R12: ffff8807cb220630
  [245043.403296] R13: 0000000000000001 R14: ffff8807cb220628 R15: ffff88041fbdf000
  [245043.404780] FS:  0000000000000000(0000) GS:ffff88082fc80000(0000) knlGS:0000000000000000
  [245043.406050] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [245043.407321] CR2: 00007fffdbdb9f10 CR3: 0000000001c09005 CR4: 00000000000206e0
  [245043.408670] Call Trace:
  [245043.409977]  btrfs_search_slot+0x761/0xa60 [btrfs]
  [245043.411278]  btrfs_insert_empty_items+0x62/0xb0 [btrfs]
  [245043.412572]  btrfs_insert_item+0x5b/0xc0 [btrfs]
  [245043.413922]  btrfs_create_pending_block_groups+0xfb/0x1e0 [btrfs]
  [245043.415216]  do_chunk_alloc+0x1e5/0x2a0 [btrfs]
  [245043.416487]  find_free_extent+0xcd0/0xf60 [btrfs]
  [245043.417813]  btrfs_reserve_extent+0x96/0x1e0 [btrfs]
  [245043.419105]  btrfs_alloc_tree_block+0xfb/0x4a0 [btrfs]
  [245043.420378]  __btrfs_cow_block+0x127/0x550 [btrfs]
  [245043.421652]  btrfs_cow_block+0xee/0x190 [btrfs]
  [245043.422979]  btrfs_search_slot+0x227/0xa60 [btrfs]
  [245043.424279]  ? btrfs_update_inode_item+0x59/0x100 [btrfs]
  [245043.425538]  ? iput+0x72/0x1e0
  [245043.426798]  write_one_cache_group.isra.49+0x20/0x90 [btrfs]
  [245043.428131]  btrfs_start_dirty_block_groups+0x102/0x420 [btrfs]
  [245043.429419]  btrfs_commit_transaction+0x11b/0x880 [btrfs]
  [245043.430712]  ? start_transaction+0x8e/0x410 [btrfs]
  [245043.432006]  transaction_kthread+0x184/0x1a0 [btrfs]
  [245043.433341]  kthread+0xf0/0x130
  [245043.434628]  ? btrfs_cleanup_transaction+0x4e0/0x4e0 [btrfs]
  [245043.435928]  ? kthread_create_worker_on_cpu+0x40/0x40
  [245043.437236]  ret_from_fork+0x1f/0x30
  [245043.441054] ---[ end trace 15abaa2aaf36827f ]---

This is because at write_one_cache_group() when we are COWing a leaf from
the extent tree we end up allocating a new block group (chunk) and,
because we have hit a threshold on the number of bytes reserved for system
chunks, we attempt to finalize the creation of new block groups from the
current transaction, by calling btrfs_create_pending_block_groups().
However here we also need to modify the extent tree in order to insert
a block group item, and if the location for this new block group item
happens to be in the same leaf that we were COWing earlier, we deadlock
since btrfs_search_slot() tries to write lock the extent buffer that we
locked before at write_one_cache_group().

We have already hit similar cases in the past and commit d9a0540a79
("Btrfs: fix deadlock when finalizing block group creation") fixed some
of those cases by delaying the creation of pending block groups at the
known specific spots that could lead to a deadlock. This change reworks
that commit to be more generic so that we don't have to add similar logic
to every possible path that can lead to a deadlock. This is done by
making __btrfs_cow_block() disallowing the creation of new block groups
(setting the transaction's can_flush_pending_bgs to false) before it
attempts to allocate a new extent buffer for either the extent, chunk or
device trees, since those are the trees that pending block creation
modifies. Once the new extent buffer is allocated, it allows creation of
pending block groups to happen again.

This change depends on a recent patch from Josef which is not yet in
Linus' tree, named "btrfs: make sure we create all new block groups" in
order to avoid occasional warnings at btrfs_trans_release_chunk_metadata().

Fixes: d9a0540a79 ("Btrfs: fix deadlock when finalizing block group creation")
CC: stable@vger.kernel.org # 4.4+
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199753
Link: https://lore.kernel.org/linux-btrfs/CAJtFHUTHna09ST-_EEiyWmDH6gAqS6wa=zMNMBsifj8ABu99cw@mail.gmail.com/
Reported-by: E V <eliventer@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-17 17:46:24 +02:00
Filipe Manana
7ed586d0a8 Btrfs: fix assertion on fsync of regular file when using no-holes feature
When using the NO_HOLES feature and logging a regular file, we were
expecting that if we find an inline extent, that either its size in RAM
(uncompressed and unenconded) matches the size of the file or if it does
not, that it matches the sector size and it represents compressed data.
This assertion does not cover a case where the length of the inline extent
is smaller than the sector size and also smaller the file's size, such
case is possible through fallocate. Example:

  $ mkfs.btrfs -f -O no-holes /dev/sdb
  $ mount /dev/sdb /mnt

  $ xfs_io -f -c "pwrite -S 0xb60 0 21" /mnt/foobar
  $ xfs_io -c "falloc 40 40" /mnt/foobar
  $ xfs_io -c "fsync" /mnt/foobar

In the above example we trigger the assertion because the inline extent's
length is 21 bytes while the file size is 80 bytes. The fallocate() call
merely updated the file's size and did not touch the existing inline
extent, as expected.

So fix this by adjusting the assertion so that an inline extent length
smaller than the file size is valid if the file size is smaller than the
filesystem's sector size.

A test case for fstests follows soon.

Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Fixes: a89ca6f24f ("Btrfs: fix fsync after truncate when no_holes feature is enabled")
CC: stable@vger.kernel.org # 4.14+
Link: https://lore.kernel.org/linux-btrfs/CAE5jQCfRSBC7n4pUTFJcmHh109=gwyT9mFkCOL+NKfzswmR=_Q@mail.gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-17 17:44:53 +02:00
Filipe Manana
3527a018c0 Btrfs: fix null pointer dereference on compressed write path error
At inode.c:compress_file_range(), under the "free_pages_out" label, we can
end up dereferencing the "pages" pointer when it has a NULL value. This
case happens when "start" has a value of 0 and we fail to allocate memory
for the "pages" pointer. When that happens we jump to the "cont" label and
then enter the "if (start == 0)" branch where we immediately call the
cow_file_range_inline() function. If that function returns 0 (success
creating an inline extent) or an error (like -ENOMEM for example) we jump
to the "free_pages_out" label and then access "pages[i]" leading to a NULL
pointer dereference, since "nr_pages" has a value greater than zero at
that point.

Fix this by setting "nr_pages" to 0 when we fail to allocate memory for
the "pages" pointer.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201119
Fixes: 771ed689d2 ("Btrfs: Optimize compressed writeback and reads")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-17 17:42:46 +02:00
Jens Axboe
b93f654d73 f2fs: remove request_list check in is_idle()
This doesn't work on stacked devices, and it doesn't work on
blk-mq devices. The request_list is only used on legacy, which
we don't have much of anymore, and soon won't have any of.

Kill the check.

Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: linux-f2fs-devel@lists.sourceforge.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-16 19:09:39 -07:00
Jaegeuk Kim
730746ce88 f2fs: allow to mount, if quota is failed
Since we can use the filesystem without quotas till next boot.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-16 09:37:00 -07:00
Sahitya Tummala
6390398ec7 f2fs: update REQ_TIME in f2fs_cross_rename()
Update REQ_TIME in the missing path - f2fs_cross_rename().

Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
[Jaegeuk Kim: add it in f2fs_rename()]
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-16 09:37:00 -07:00
Sahitya Tummala
c75f2feb80 f2fs: do not update REQ_TIME in case of error conditions
The REQ_TIME should be updated only in case of success cases
as followed at all other places in the file system.

Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-16 09:37:00 -07:00
Chao Yu
3b30eb19dc f2fs: remove unneeded disable_nat_bits()
Commit 7735730d39 ("f2fs: fix to propagate error from __get_meta_page()")
added disable_nat_bits() in error path of __get_nat_bitmaps(), but it's
unneeded, beause we will fail mount, we won't have chance to change nid
usage status w/o nat full/empty bitmaps.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-16 09:36:59 -07:00
Chao Yu
850971b23f f2fs: remove unused sbi->trigger_ssr_threshold
Commit a2a12b679f ("f2fs: export SSR allocation threshold") introduced
two threshold .min_ssr_sections and .trigger_ssr_threshold, but only
.min_ssr_sections is used, so just remove redundant one for cleanup.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-16 09:36:59 -07:00
Chao Yu
ed15ba1415 f2fs: shrink sbi->sb_lock coverage in set_file_temperature()
file_set_{cold,hot} doesn't need holding sbi->sb_lock, so moving them
out of the lock.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-16 09:36:59 -07:00
Chao Yu
4dada3fd70 f2fs: use rb_*_cached friends
As rbtree supports caching leftmost node natively, update f2fs codes
to use rb_*_cached helpers to speed up leftmost node visiting.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-16 09:36:59 -07:00
Chao Yu
ef2a007134 f2fs: fix to recover cold bit of inode block during POR
Testcase to reproduce this bug:
1. mkfs.f2fs /dev/sdd
2. mount -t f2fs /dev/sdd /mnt/f2fs
3. touch /mnt/f2fs/file
4. sync
5. chattr +A /mnt/f2fs/file
6. xfs_io -f /mnt/f2fs/file -c "fsync"
7. godown /mnt/f2fs
8. umount /mnt/f2fs
9. mount -t f2fs /dev/sdd /mnt/f2fs
10. chattr -A /mnt/f2fs/file
11. xfs_io -f /mnt/f2fs/file -c "fsync"
12. umount /mnt/f2fs
13. mount -t f2fs /dev/sdd /mnt/f2fs
14. lsattr /mnt/f2fs/file

-----------------N- /mnt/f2fs/file

But actually, we expect the corrct result is:

-------A---------N- /mnt/f2fs/file

The reason is in step 9) we missed to recover cold bit flag in inode
block, so later, in fsync, we will skip write inode block due to below
condition check, result in lossing data in another SPOR.

f2fs_fsync_node_pages()
	if (!IS_DNODE(page) || !is_cold_node(page))
		continue;

Note that, I guess that some non-dir inode has already lost cold bit
during POR, so in order to reenable recovery for those inode, let's
try to recover cold bit in f2fs_iget() to save more fsynced data.

Fixes: c56675750d ("f2fs: remove unneeded set_cold_node()")
Cc: <stable@vger.kernel.org> 4.17+
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-16 09:36:59 -07:00
Chao Yu
48018b4cfd f2fs: submit cached bio to avoid endless PageWriteback
When migrating encrypted block from background GC thread, we only add
them into f2fs inner bio cache, but forget to submit the cached bio, it
may cause potential deadlock when we are waiting page writebacked, fix
it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-16 09:36:59 -07:00
Daniel Rosenberg
4354994f09 f2fs: checkpoint disabling
Note that, it requires "f2fs: return correct errno in f2fs_gc".

This adds a lightweight non-persistent snapshotting scheme to f2fs.

To use, mount with the option checkpoint=disable, and to return to
normal operation, remount with checkpoint=enable. If the filesystem
is shut down before remounting with checkpoint=enable, it will revert
back to its apparent state when it was first mounted with
checkpoint=disable. This is useful for situations where you wish to be
able to roll back the state of the disk in case of some critical
failure.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
[Jaegeuk Kim: use SB_RDONLY instead of MS_RDONLY]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-16 09:36:39 -07:00
Hou Tao
92e2921f7e jffs2: free jffs2_sb_info through jffs2_kill_sb()
When an invalid mount option is passed to jffs2, jffs2_parse_options()
will fail and jffs2_sb_info will be freed, but then jffs2_sb_info will
be used (use-after-free) and freeed (double-free) in jffs2_kill_sb().

Fix it by removing the buggy invocation of kfree() when getting invalid
mount options.

Fixes: 92abc475d8 ("jffs2: implement mount option parsing and compression overriding")
Cc: stable@kernel.org
Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Boris Brezillon <boris.brezillon@bootlin.com>
2018-10-16 10:34:28 +02:00
Bob Peterson
c9e58fb2aa gfs2: write revokes should traverse sd_ail1_list in reverse
All the other functions that deal with the sd_ail_list run the list
from the tail back to the head, iow, in reverse. We should do the
same while writing revokes, otherwise we might miss removing entries
properly from the list when we hit the limit of how many revokes we
can write at one time (based on block size, which determines how
many block pointers will fit in the revoke block).

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-10-15 12:17:30 -05:00
Lu Fengqi
d9352794da btrfs: switch return_bigger to bool in find_ref_head
Using bool is more suitable than int here, and add the comment about the
return_bigger.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:41 +02:00
Lu Fengqi
7c8616278b btrfs: remove fs_info from btrfs_should_throttle_delayed_refs
The avg_delayed_ref_runtime can be referenced from the transaction
handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:41 +02:00
Lu Fengqi
af9b8a0e20 btrfs: remove fs_info from btrfs_check_space_for_delayed_refs
It can be referenced from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:41 +02:00
Lu Fengqi
9e920a6f03 btrfs: delayed-ref: pass delayed_refs directly to btrfs_delayed_ref_lock
Since trans is only used for referring to delayed_refs, there is no need
to pass it instead of delayed_refs to btrfs_delayed_ref_lock().

No functional change.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:41 +02:00
Lu Fengqi
5637c74b01 btrfs: delayed-ref: pass delayed_refs directly to btrfs_select_ref_head
Since trans is only used for referring to delayed_refs, there is no need
to pass it instead of delayed_refs to btrfs_select_ref_head().  No
functional change.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:40 +02:00
Lu Fengqi
b90e22ba48 btrfs: qgroup: move the qgroup->members check out from (!qgroup)'s else branch
There is no reason to put this check in (!qgroup)'s else branch because
if qgroup is null, it will goto out directly. So move it out to reduce
indentation level.  No functional change.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:40 +02:00
Qu Wenruo
06bbf67244 btrfs: relocation: Remove redundant tree level check
Commit 581c176041 ("btrfs: Validate child tree block's level and first
key") has made tree block level check mandatory.

So if tree block level doesn't match, we won't get a valid extent
buffer.  The extra WARN_ON() check can be removed completely.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:40 +02:00
Qu Wenruo
98ff7b94e4 btrfs: relocation: Cleanup while loop using rbtree_postorder_for_each_entry_safe
And add one line comment explaining what we're doing for each loop.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:40 +02:00
Qu Wenruo
3628b4ca64 btrfs: qgroup: Avoid calling qgroup functions if qgroup is not enabled
Some qgroup trace events like btrfs_qgroup_release_data() and
btrfs_qgroup_free_delayed_ref() can still be triggered even if qgroup is
not enabled.

This is caused by the lack of qgroup status check before calling some
qgroup functions.  Thankfully the functions can handle quota disabled
case well and just do nothing for qgroup disabled case.

This patch will do earlier check before triggering related trace events.

And for enabled <-> disabled race case:

1) For enabled->disabled case
   Disable will wipe out all qgroups data including reservation and
   excl/rfer. Even if we leak some reservation or numbers, it will
   still be cleared, so nothing will go wrong.

2) For disabled -> enabled case
   Current btrfs_qgroup_release_data() will use extent_io tree to ensure
   we won't underflow reservation. And for delayed_ref we use
   head->qgroup_reserved to record the reserved space, so in that case
   head->qgroup_reserved should be 0 and we won't underflow.

CC: stable@vger.kernel.org # 4.14+
Reported-by: Chris Murphy <lists@colorremedies.com>
Link: https://lore.kernel.org/linux-btrfs/CAJCQCtQau7DtuUUeycCkZ36qjbKuxNzsgqJ7+sJ6W0dK_NLE3w@mail.gmail.com/
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:40 +02:00
Filipe Manana
0f375eed92 Btrfs: fix wrong dentries after fsync of file that got its parent replaced
In a scenario like the following:

  mkdir /mnt/A               # inode 258
  mkdir /mnt/B               # inode 259
  touch /mnt/B/bar           # inode 260

  sync

  mv /mnt/B/bar /mnt/A/bar
  mv -T /mnt/A /mnt/B
  fsync /mnt/B/bar

  <power fail>

After replaying the log we end up with file bar having 2 hard links, both
with the name 'bar' and one in the directory with inode number 258 and the
other in the directory with inode number 259. Also, we end up with the
directory inode 259 still existing and with the directory inode 258 still
named as 'A', instead of 'B'. In this scenario, file 'bar' should only
have one hard link, located at directory inode 258, the directory inode
259 should not exist anymore and the name for directory inode 258 should
be 'B'.

This incorrect behaviour happens because when attempting to log the old
parents of an inode, we skip any parents that no longer exist. Fix this
by forcing a full commit if an old parent no longer exists.

A test case for fstests follows soon.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:39 +02:00
Filipe Manana
f2d72f42d5 Btrfs: fix warning when replaying log after fsync of a tmpfile
When replaying a log which contains a tmpfile (which necessarily has a
link count of 0) we end up calling inc_nlink(), at
fs/btrfs/tree-log.c:replay_one_buffer(), which produces a warning like
the following:

  [195191.943673] WARNING: CPU: 0 PID: 6924 at fs/inode.c:342 inc_nlink+0x33/0x40
  [195191.943723] CPU: 0 PID: 6924 Comm: mount Not tainted 4.19.0-rc6-btrfs-next-38 #1
  [195191.943724] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
  [195191.943726] RIP: 0010:inc_nlink+0x33/0x40
  [195191.943728] RSP: 0018:ffffb96e425e3870 EFLAGS: 00010246
  [195191.943730] RAX: 0000000000000000 RBX: ffff8c0d1e6af4f0 RCX: 0000000000000006
  [195191.943731] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8c0d1e6af4f0
  [195191.943731] RBP: 0000000000000097 R08: 0000000000000001 R09: 0000000000000000
  [195191.943732] R10: 0000000000000000 R11: 0000000000000000 R12: ffffb96e425e3a60
  [195191.943733] R13: ffff8c0d10cff0c8 R14: ffff8c0d0d515348 R15: ffff8c0d78a1b3f8
  [195191.943735] FS:  00007f570ee24480(0000) GS:ffff8c0dfb200000(0000) knlGS:0000000000000000
  [195191.943736] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [195191.943737] CR2: 00005593286277c8 CR3: 00000000bb8f2006 CR4: 00000000003606f0
  [195191.943739] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [195191.943740] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [195191.943741] Call Trace:
  [195191.943778]  replay_one_buffer+0x797/0x7d0 [btrfs]
  [195191.943802]  walk_up_log_tree+0x1c1/0x250 [btrfs]
  [195191.943809]  ? rcu_read_lock_sched_held+0x3f/0x70
  [195191.943825]  walk_log_tree+0xae/0x1d0 [btrfs]
  [195191.943840]  btrfs_recover_log_trees+0x1d7/0x4d0 [btrfs]
  [195191.943856]  ? replay_dir_deletes+0x280/0x280 [btrfs]
  [195191.943870]  open_ctree+0x1c3b/0x22a0 [btrfs]
  [195191.943887]  btrfs_mount_root+0x6b4/0x800 [btrfs]
  [195191.943894]  ? rcu_read_lock_sched_held+0x3f/0x70
  [195191.943899]  ? pcpu_alloc+0x55b/0x7c0
  [195191.943906]  ? mount_fs+0x3b/0x140
  [195191.943908]  mount_fs+0x3b/0x140
  [195191.943912]  ? __init_waitqueue_head+0x36/0x50
  [195191.943916]  vfs_kern_mount+0x62/0x160
  [195191.943927]  btrfs_mount+0x134/0x890 [btrfs]
  [195191.943936]  ? rcu_read_lock_sched_held+0x3f/0x70
  [195191.943938]  ? pcpu_alloc+0x55b/0x7c0
  [195191.943943]  ? mount_fs+0x3b/0x140
  [195191.943952]  ? btrfs_remount+0x570/0x570 [btrfs]
  [195191.943954]  mount_fs+0x3b/0x140
  [195191.943956]  ? __init_waitqueue_head+0x36/0x50
  [195191.943960]  vfs_kern_mount+0x62/0x160
  [195191.943963]  do_mount+0x1f9/0xd40
  [195191.943967]  ? memdup_user+0x4b/0x70
  [195191.943971]  ksys_mount+0x7e/0xd0
  [195191.943974]  __x64_sys_mount+0x21/0x30
  [195191.943977]  do_syscall_64+0x60/0x1b0
  [195191.943980]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
  [195191.943983] RIP: 0033:0x7f570e4e524a
  [195191.943986] RSP: 002b:00007ffd83589478 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
  [195191.943989] RAX: ffffffffffffffda RBX: 0000563f335b2060 RCX: 00007f570e4e524a
  [195191.943990] RDX: 0000563f335b2240 RSI: 0000563f335b2280 RDI: 0000563f335b2260
  [195191.943992] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000020
  [195191.943993] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 0000563f335b2260
  [195191.943994] R13: 0000563f335b2240 R14: 0000000000000000 R15: 00000000ffffffff
  [195191.944002] irq event stamp: 8688
  [195191.944010] hardirqs last  enabled at (8687): [<ffffffff9cb004c3>] console_unlock+0x503/0x640
  [195191.944012] hardirqs last disabled at (8688): [<ffffffff9ca037dd>] trace_hardirqs_off_thunk+0x1a/0x1c
  [195191.944018] softirqs last  enabled at (8638): [<ffffffff9cc0a5d1>] __set_page_dirty_nobuffers+0x101/0x150
  [195191.944020] softirqs last disabled at (8634): [<ffffffff9cc26bbe>] wb_wakeup_delayed+0x2e/0x60
  [195191.944022] ---[ end trace 5d6e873a9a0b811a ]---

This happens because the inode does not have the flag I_LINKABLE set,
which is a runtime only flag, not meant to be persisted, set when the
inode is created through open(2) if the flag O_EXCL is not passed to it.
Except for the warning, there are no other consequences (like corruptions
or metadata inconsistencies).

Since it's pointless to replay a tmpfile as it would be deleted in a
later phase of the log replay procedure (it has a link count of 0), fix
this by not logging tmpfiles and if a tmpfile is found in a log (created
by a kernel without this change), skip the replay of the inode.

A test case for fstests follows soon.

Fixes: 471d557afe ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay")
CC: stable@vger.kernel.org # 4.18+
Reported-by: Martin Steigerwald <martin@lichtvoll.de>
Link: https://lore.kernel.org/linux-btrfs/3666619.NTnn27ZJZE@merkaba/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:39 +02:00
Josef Bacik
ad80cf50c3 btrfs: drop min_size from evict_refill_and_join
We don't need it, rsv->size is set once and never changes throughout
its lifetime, so just use that for the reserve size.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:39 +02:00
Josef Bacik
e187831e18 btrfs: assert on non-empty delayed iputs
I ran into an issue where there was some reference being held on an
inode that I couldn't track.  This assert wasn't triggered, but it at
least rules out we're doing something stupid.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:39 +02:00
Josef Bacik
545e3366db btrfs: make sure we create all new block groups
Allocating new chunks modifies both the extent and chunk tree, which can
trigger new chunk allocations.  So instead of doing list_for_each_safe,
just do while (!list_empty()) so we make sure we don't exit with other
pending bg's still on our list.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:39 +02:00
Josef Bacik
553cceb496 btrfs: reset max_extent_size on clear in a bitmap
We need to clear the max_extent_size when we clear bits from a bitmap
since it could have been from the range that contains the
max_extent_size.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:39 +02:00
Josef Bacik
84de76a2fb btrfs: protect space cache inode alloc with GFP_NOFS
If we're allocating a new space cache inode it's likely going to be
under a transaction handle, so we need to use memalloc_nofs_save() in
order to avoid deadlocks, and more importantly lockdep messages that
make xfstests fail.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:38 +02:00
Josef Bacik
f45c752b65 btrfs: release metadata before running delayed refs
We want to release the unused reservation we have since it refills the
delayed refs reserve, which will make everything go smoother when
running the delayed refs if we're short on our reservation.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:38 +02:00
Liu Bo
5239834016 Btrfs: kill btrfs_clear_path_blocking
Btrfs's btree locking has two modes, spinning mode and blocking mode,
while searching btree, locking is always acquired in spinning mode and
then converted to blocking mode if necessary, and in some hot paths we may
switch the locking back to spinning mode by btrfs_clear_path_blocking().

When acquiring locks, both of reader and writer need to wait for blocking
readers and writers to complete before doing read_lock()/write_lock().

The problem is that btrfs_clear_path_blocking() needs to switch nodes
in the path to blocking mode at first (by btrfs_set_path_blocking) to
make lockdep happy before doing its actual clearing blocking job.

When switching to blocking mode from spinning mode, it consists of

step 1) bumping up blocking readers counter and
step 2) read_unlock()/write_unlock(),

this has caused serious ping-pong effect if there're a great amount of
concurrent readers/writers, as waiters will be woken up and go to
sleep immediately.

1) Killing this kind of ping-pong results in a big improvement in my 1600k
files creation script,

MNT=/mnt/btrfs
mkfs.btrfs -f /dev/sdf
mount /dev/def $MNT
time fsmark  -D  10000  -S0  -n  100000  -s  0  -L  1 -l /tmp/fs_log.txt \
        -d  $MNT/0  -d  $MNT/1 \
        -d  $MNT/2  -d  $MNT/3 \
        -d  $MNT/4  -d  $MNT/5 \
        -d  $MNT/6  -d  $MNT/7 \
        -d  $MNT/8  -d  $MNT/9 \
        -d  $MNT/10  -d  $MNT/11 \
        -d  $MNT/12  -d  $MNT/13 \
        -d  $MNT/14  -d  $MNT/15

w/o patch:
real    2m27.307s
user    0m12.839s
sys     13m42.831s

w/ patch:
real    1m2.273s
user    0m15.802s
sys     8m16.495s

1.1) latency histogram from funclatency[1]

Overall with the patch, there're ~50% less write lock acquisition and
the 95% max latency that write lock takes also reduces to ~100ms from
>500ms.

--------------------------------------------
w/o patch:
--------------------------------------------
Function = btrfs_tree_lock
     msecs               : count     distribution
         0 -> 1          : 2385222  |****************************************|
         2 -> 3          : 37147    |                                        |
         4 -> 7          : 20452    |                                        |
         8 -> 15         : 13131    |                                        |
        16 -> 31         : 3877     |                                        |
        32 -> 63         : 3900     |                                        |
        64 -> 127        : 2612     |                                        |
       128 -> 255        : 974      |                                        |
       256 -> 511        : 165      |                                        |
       512 -> 1023       : 13       |                                        |

Function = btrfs_tree_read_lock
     msecs               : count     distribution
         0 -> 1          : 6743860  |****************************************|
         2 -> 3          : 2146     |                                        |
         4 -> 7          : 190      |                                        |
         8 -> 15         : 38       |                                        |
        16 -> 31         : 4        |                                        |

--------------------------------------------
w/ patch:
--------------------------------------------
Function = btrfs_tree_lock
     msecs               : count     distribution
         0 -> 1          : 1318454  |****************************************|
         2 -> 3          : 6800     |                                        |
         4 -> 7          : 3664     |                                        |
         8 -> 15         : 2145     |                                        |
        16 -> 31         : 809      |                                        |
        32 -> 63         : 219      |                                        |
        64 -> 127        : 10       |                                        |

Function = btrfs_tree_read_lock
     msecs               : count     distribution
         0 -> 1          : 6854317  |****************************************|
         2 -> 3          : 2383     |                                        |
         4 -> 7          : 601      |                                        |
         8 -> 15         : 92       |                                        |

2) dbench also proves the improvement,
dbench -t 120 -D /mnt/btrfs 16

w/o patch:
Throughput 158.363 MB/sec

w/ patch:
Throughput 449.52 MB/sec

3) xfstests didn't show any additional failures.

One thing to note is that callers may set path->leave_spinning to have
all nodes in the path stay in spinning mode, which means callers are
ready to not sleep before releasing the path, but it won't cause
problems if they don't want to sleep in blocking mode.

[1]: https://github.com/iovisor/bcc/blob/master/tools/funclatency.py

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:38 +02:00
David Sterba
9b142115ed btrfs: dev-replace: remove pointless assert in write unlock
The value of blocking_readers is increased only when the lock is taken
for read, no way we can fail the condition with the write lock.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:38 +02:00
David Sterba
7f8d236ae1 btrfs: dev-replace: move replace members out of fs_info
The replace_wait and bio_counter were mistakenly added to fs_info in
commit c404e0dc2c ("Btrfs: fix use-after-free in the finishing
procedure of the device replace"), but they logically belong to
fs_info::dev_replace. Besides, bio_counter is a very generic name and is
confusing in bare fs_info context.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:38 +02:00
David Sterba
aa144bfeaa btrfs: dev-replace: avoid useless lock on error handling path
The exit sequence in btrfs_dev_replace_start does not allow to simply
add a label to the right place so the error handling after starting
transaction failure jumps there. Currently there's a lock that pairs
with the unlock in the section, which is unnecessary and only raises
questions.  Add a variable to track the locking status and avoid the
extra locking.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:38 +02:00
David Sterba
9f6cbcbb09 btrfs: open code btrfs_after_dev_replace_commit
Too trivial, the purpose can be simply documented in a comment.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:37 +02:00
David Sterba
e37abe9725 btrfs: open code btrfs_dev_replace_stats_inc
The wrapper is too trivial, open coding does not make it less readable.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:37 +02:00
David Sterba
7fb2eced10 btrfs: open code btrfs_dev_replace_clear_lock_blocking
There's a single caller and the function name does not say it's actually
taking the lock, so open coding makes it more explicit.

For now, btrfs_dev_replace_read_lock is used instead of read_lock so
it's paired with the unlocking wrapper in the same block.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:37 +02:00
David Sterba
3280f87457 btrfs: remove btrfs_dev_replace::read_locks
This member seems to be copied from the extent_buffer locking scheme and
is at least used to assert that the read lock/unlock is properly nested.
In some way. While the _inc/_dec are called inside the read lock
section, the asserts are both inside and outside, so the ordering is not
guaranteed and we can see read/inc/dec ordered in any way
(theoretically).

A missing call of btrfs_dev_replace_clear_lock_blocking could cause
unexpected read_locks count, so this at least looks like a valid
assertion, but this will become unnecessary with later updates.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:37 +02:00
Qu Wenruo
f556faa46e btrfs: tree-checker: Check level for leaves and nodes
Although we have tree level check at tree read runtime, it's completely
based on its parent level.
We still need to do accurate level check to avoid invalid tree blocks
sneak into kernel space.

The check itself is simple, for leaf its level should always be 0.
For nodes its level should be in range [1, BTRFS_MAX_LEVEL - 1].

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:37 +02:00
Qu Wenruo
3d0174f78e btrfs: qgroup: Only trace data extents in leaves if we're relocating data block group
For qgroup_trace_extent_swap(), if we find one leaf that needs to be
traced, we will also iterate all file extents and trace them.

This is OK if we're relocating data block groups, but if we're
relocating metadata block groups, balance code itself has ensured that
both subtree of file tree and reloc tree contain the same contents.

That's to say, if we're relocating metadata block groups, all file
extents in reloc and file tree should match, thus no need to trace them.
This should reduce the total number of dirty extents processed in metadata
block group balance.

[[Benchmark]] (with all previous enhancement)
Hardware:
	VM 4G vRAM, 8 vCPUs,
	disk is using 'unsafe' cache mode,
	backing device is SAMSUNG 850 evo SSD.
	Host has 16G ram.

Mkfs parameter:
	--nodesize 4K (To bump up tree size)

Initial subvolume contents:
	4G data copied from /usr and /lib.
	(With enough regular small files)

Snapshots:
	16 snapshots of the original subvolume.
	each snapshot has 3 random files modified.

balance parameter:
	-m

So the content should be pretty similar to a real world root fs layout.

                     | v4.19-rc1    | w/ patchset    | diff (*)
---------------------------------------------------------------
relocated extents    | 22929        | 22851          | -0.3%
qgroup dirty extents | 227757       | 140886         | -38.1%
time (sys)           | 65.253s      | 37.464s        | -42.6%
time (real)          | 74.032s      | 44.722s        | -39.6%

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:36 +02:00
Qu Wenruo
2cd86d309b btrfs: qgroup: Don't trace subtree if we're dropping reloc tree
Reloc tree doesn't contribute to qgroup numbers, as we have accounted
them at balance time (see replace_path()).

Skipping the unneeded subtree tracing should reduce the overhead.

[[Benchmark]]
Hardware:
	VM 4G vRAM, 8 vCPUs,
	disk is using 'unsafe' cache mode,
	backing device is SAMSUNG 850 evo SSD.
	Host has 16G ram.

Mkfs parameter:
	--nodesize 4K (To bump up tree size)

Initial subvolume contents:
	4G data copied from /usr and /lib.
	(With enough regular small files)

Snapshots:
	16 snapshots of the original subvolume.
	each snapshot has 3 random files modified.

balance parameter:
	-m

So the content should be pretty similar to a real world root fs layout.

                     | v4.19-rc1    | w/ patchset    | diff (*)
---------------------------------------------------------------
relocated extents    | 22929        | 22900          | -0.1%
qgroup dirty extents | 227757       | 167139         | -26.6%
time (sys)           | 65.253s      | 50.123s        | -23.2%
time (real)          | 74.032s      | 52.551s        | -29.0%

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:36 +02:00
Qu Wenruo
5f527822be btrfs: qgroup: Use generation-aware subtree swap to mark dirty extents
Before this patch, with quota enabled during balance, we need to mark
the whole subtree dirty for quota.

E.g.
OO = Old tree blocks (from file tree)
NN = New tree blocks (from reloc tree)

        File tree (src)		          Reloc tree (dst)
            OO (a)                              NN (a)
           /  \                                /  \
     (b) OO    OO (c)                    (b) NN    NN (c)
        /  \  /  \                          /  \  /  \
       OO  OO OO OO (d)                    OO  OO OO NN (d)

For old balance + quota case, quota will mark the whole src and dst tree
dirty, including all the 3 old tree blocks in reloc tree.

It's doable for small file tree or new tree blocks are all located at
lower level.

But for large file tree or new tree blocks are all located at higher
level, this will lead to mark the whole tree dirty, and be unbelievably
slow.

This patch will change how we handle such balance with quota enabled
case.

Now we will search from (b) and (c) for any new tree blocks whose
generation is equal to @last_snapshot, and only mark them dirty.

In above case, we only need to trace tree blocks NN(b), NN(c) and NN(d).
(NN(a) will be traced when COW happens for nodeptr modification).  And
also for tree blocks OO(b), OO(c), OO(d). (OO(a) will be traced when COW
happens for nodeptr modification.)

For above case, we could skip 3 tree blocks, but for larger tree, we can
skip tons of unmodified tree blocks, and hugely speed up balance.

This patch will introduce a new function,
btrfs_qgroup_trace_subtree_swap(), which will do the following main
work:

1) Read out real root eb
   And setup basic dst_path for later calls
2) Call qgroup_trace_new_subtree_blocks()
   To trace all new tree blocks in reloc tree and their counter
   parts in the file tree.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:36 +02:00
Qu Wenruo
ea49f3e73c btrfs: qgroup: Introduce function to find all new tree blocks of reloc tree
Introduce new function, qgroup_trace_new_subtree_blocks(), to iterate
all new tree blocks in a reloc tree.
So that qgroup could skip unrelated tree blocks during balance, which
should hugely speedup balance speed when quota is enabled.

The function qgroup_trace_new_subtree_blocks() itself only cares about
new tree blocks in reloc tree.

All its main works are:

1) Read out tree blocks according to parent pointers

2) Do recursive depth-first search
   Will call the same function on all its children tree blocks, with
   search level set to current level -1.
   And will also skip all children whose generation is smaller than
   @last_snapshot.

3) Call qgroup_trace_extent_swap() to trace tree blocks

So although we have parameter list related to source file tree, it's not
used at all, but only passed to qgroup_trace_extent_swap().
Thus despite the tree read code, the core should be pretty short and all
about recursive depth-first search.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:36 +02:00
Qu Wenruo
25982561db btrfs: qgroup: Introduce function to trace two swaped extents
Introduce a new function, qgroup_trace_extent_swap(), which will be used
later for balance qgroup speedup.

The basis idea of balance is swapping tree blocks between reloc tree and
the real file tree.

The swap will happen in highest tree block, but there may be a lot of
tree blocks involved.

For example:
 OO = Old tree blocks
 NN = New tree blocks allocated during balance

          File tree (257)                  Reloc tree for 257
L2              OO                                NN
              /    \                            /    \
L1          OO      OO (a)                    OO      NN (a)
           / \     / \                       / \     / \
L0       OO   OO OO   OO                   OO   OO NN   NN
                 (b)  (c)                          (b)  (c)

When calling qgroup_trace_extent_swap(), we will pass:
@src_eb = OO(a)
@dst_path = [ nodes[1] = NN(a), nodes[0] = NN(c) ]
@dst_level = 0
@root_level = 1

In that case, qgroup_trace_extent_swap() will search from OO(a) to
reach OO(c), then mark both OO(c) and NN(c) as qgroup dirty.

The main work of qgroup_trace_extent_swap() can be split into 3 parts:

1) Tree search from @src_eb
   It should acts as a simplified btrfs_search_slot().
   The key for search can be extracted from @dst_path->nodes[dst_level]
   (first key).

2) Mark the final tree blocks in @src_path and @dst_path qgroup dirty
   NOTE: In above case, OO(a) and NN(a) won't be marked qgroup dirty.
   They should be marked during preivous (@dst_level = 1) iteration.

3) Mark file extents in leaves dirty
   We don't have good way to pick out new file extents only.
   So we still follow the old method by scanning all file extents in
   the leave.

This function can free us from keeping two pathes, thus later we only need
to care about how to iterate all new tree blocks in reloc tree.

Signed-off-by: Qu Wenruo <wqu@suse.com>
[ copy changelog to function comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:36 +02:00
Qu Wenruo
c337e7b02f btrfs: qgroup: Introduce trace event to analyse the number of dirty extents accounted
Number of qgroup dirty extents is directly linked to the performance
overhead, so add a new trace event, trace_qgroup_num_dirty_extents(), to
record how many dirty extents is processed in
btrfs_qgroup_account_extents().

This will be pretty handy to analyze later balance performance
improvement.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:36 +02:00
Qu Wenruo
fa6ac71524 btrfs: relocation: Add basic extent backref related comments for build_backref_tree
fs/btrfs/relocation.c:build_backref_tree() is some code from 2009 era,
although it works pretty fine, it's not that easy to understand.
Especially combined with the complex btrfs backref format.

This patch adds some basic comment for the backref build part of the
code, making it less hard to read, at least for backref searching part.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:35 +02:00
Omar Sandoval
4779cc0424 Btrfs: get rid of btrfs_symlink_aops
The only aops we define for symlinks are identical to the aops for
regular files. This has been the case since symlink support was added in
commit 2b8d99a723 ("Btrfs: symlinks and hard links"). As far as I can
tell, there wasn't a good reason to have separate aops then, and there
isn't now, so let's just do what most other filesystems do and reuse the
same structure.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:35 +02:00
Chris Mason
7703bdd8d2 Btrfs: don't clean dirty pages during buffered writes
During buffered writes, we follow this basic series of steps:

again:
	lock all the pages
	wait for writeback on all the pages
	Take the extent range lock
	wait for ordered extents on the whole range
	clean all the pages

	if (copy_from_user_in_atomic() hits a fault) {
		drop our locks
		goto again;
	}

	dirty all the pages
	release all the locks

The extra waiting, cleaning and locking are there to make sure we don't
modify pages in flight to the drive, after they've been crc'd.

If some of the pages in the range were already dirty when the write
began, and we need to goto again, we create a window where a dirty page
has been cleaned and unlocked.  It may be reclaimed before we're able to
lock it again, which means we'll read the old contents off the drive and
lose any modifications that had been pending writeback.

We don't actually need to clean the pages.  All of the other locking in
place makes sure we don't start IO on the pages, so we can just leave
them dirty for the duration of the write.

Fixes: 73d59314e6 (the original btrfs merge)
CC: stable@vger.kernel.org # v4.4+
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:35 +02:00
David Sterba
818255feec btrfs: use common helper instead of open coding a bit test
The helper does the same math and we take care about the special case
when flags is 0 too.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:35 +02:00
Nikolay Borisov
0110a4c434 btrfs: refactor __btrfs_run_delayed_refs loop
Refactor the delayed refs loop by using the newly introduced
btrfs_run_delayed_refs_for_head function. This greatly simplifies
__btrfs_run_delayed_refs and makes it more obvious what is happening.

We now have 1 loop which iterates the existing delayed_heads and then
each selected ref head is processed by the new helper. All existing
semantics of the code are preserved so no functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:35 +02:00
Nikolay Borisov
e726138676 btrfs: Factor out loop processing all refs of a head
This patch introduces a new helper encompassing the implicit inner loop
in __btrfs_run_delayed_refs which processes all the refs for a given
head. The code is mostly copy/paste, the only difference is that if we
detect a newer reference then -EAGAIN is returned so that callers can
react correctly.

Also, at the end of the loop the head is relocked and
btrfs_merge_delayed_refs is run again to retain the pre-refactoring
semantics.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:35 +02:00
Nikolay Borisov
b1cdbcb53a btrfs: Factor out ref head locking code in __btrfs_run_delayed_refs
This is in preparation to refactor the giant loop in
__btrfs_run_delayed_refs. As a first step define a new function
which implements acquiring a reference to a btrfs_delayed_refs_head and
use it. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:34 +02:00
David Sterba
b2fa11547b btrfs: tests: polish ifdefs around testing helper
Avoid the inline ifdefs and use two sections for self-tests enabled and
disabled.

Though there could be no ifdef and unconditional test_bit of
BTRFS_FS_STATE_DUMMY_FS_INFO, the static inline can help to optimize out
any code that would depend on conditions using btrfs_is_testing.

As this is only for the testing code, drop unlikely().

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:34 +02:00
David Sterba
a654666a34 btrfs: tests: group declarations of self-test helpers
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:34 +02:00
David Sterba
57ec5fb478 btrfs: tests: move testing members of struct btrfs_root to the end
The data used only for tests are better placed at the end of the
structure so that they don't change the structure layout. All new
members of btrfs_root should be placed before.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:34 +02:00
David Sterba
9c36396c2a btrfs: tests: add separate stub for find_lock_delalloc_range
The helper find_lock_delalloc_range is now conditionally built static,
dpending on whether the self-tests are enabled or not. There's a macro
that is supposed to hide the export, used only once. To discourage
further use, drop it an add a public wrapper for the helper needed by
tests.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:34 +02:00
Liu Bo
ecf160b424 Btrfs: preftree: use rb_first_cached
rb_first_cached() trades an extra pointer "leftmost" for doing the same
job as rb_first() but in O(1).

While resolving indirect refs and missing refs, it always looks for the
first rb entry in a while loop, it's helpful to use rb_first_cached
instead.

For more details about the optimization see patch "Btrfs: delayed-refs:
use rb_first_cached for href_root".

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:33 +02:00
Liu Bo
07e1ce096d Btrfs: extent_map: use rb_first_cached
rb_first_cached() trades an extra pointer "leftmost" for doing the
same job as rb_first() but in O(1).

As evict_inode_truncate_pages() removes all extent mapping by always
looking for the first rb entry, it's helpful to use rb_first_cached
instead.

For more details about the optimization see patch "Btrfs: delayed-refs:
use rb_first_cached for href_root".

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:33 +02:00
Liu Bo
03a1d4c891 Btrfs: delayed-inode: use rb_first_cached for ins_root and del_root
rb_first_cached() trades an extra pointer "leftmost" for doing the same job as
rb_first() but in O(1).

Functions manipulating delayed_item need to get the first entry, this converts
it to use rb_first_cached().

For more details about the optimization see patch "Btrfs: delayed-refs:
use rb_first_cached for href_root".

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:33 +02:00
Liu Bo
e3d0396563 Btrfs: delayed-refs: use rb_first_cached for ref_tree
rb_first_cached() trades an extra pointer "leftmost" for doing the same
job as rb_first() but in O(1).

Functions manipulating href->ref_tree need to get the first entry, this
converts href->ref_tree to use rb_first_cached().

For more details about the optimization see patch "Btrfs: delayed-refs:
use rb_first_cached for href_root".

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:33 +02:00
Liu Bo
5c9d028b3b Btrfs: delayed-refs: use rb_first_cached for href_root
rb_first_cached() trades an extra pointer "leftmost" for doing the same
job as rb_first() but in O(1).

Functions manipulating href_root need to get the first entry, this
converts href_root to use rb_first_cached().

This patch is first in the sequenct of similar updates to other rbtrees
and this is analysis of the expected behaviour and improvements.

There's a common pattern:

while (node = rb_first) {
        entry = rb_entry(node)
        next = rb_next(node)
        rb_erase(node)
        cleanup(entry)
}

rb_first needs to traverse the tree up to logN depth, rb_erase can
completely reshuffle the tree. With the caching we'll skip the traversal
in rb_first.  That's a cached memory access vs looped pointer
dereference trade-off that IMHO has a clear winner.

Measurements show there's not much difference in a sample tree with
10000 nodes: 4.5s / rb_first and 4.8s / rb_first_cached. Real effects of
caching and pointer chasing are unpredictable though.

Further optimzations can be done to avoid the expensive rb_erase step.
In some cases it's ok to process the nodes in any order, so the tree can
be traversed in post-order, not rebalancing the children nodes and just
calling free. Care must be taken regarding the next node.

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog from mail discussions ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:33 +02:00
Josef Bacik
3aa7c7a31c btrfs: wait on caching when putting the bg cache
While testing my backport I noticed there was a panic if I ran
generic/416 generic/417 generic/418 all in a row.  This just happened to
uncover a race where we had outstanding IO after we destroy all of our
workqueues, and then we'd go to queue the endio work on those free'd
workqueues.

This is because we aren't waiting for the caching threads to be done
before freeing everything up, so to fix this make sure we wait on any
outstanding caching that's being done before we free up the block group,
so we're sure to be done with all IO by the time we get to
btrfs_stop_all_workers().  This fixes the panic I was seeing
consistently in testing.

------------[ cut here ]------------
kernel BUG at fs/btrfs/volumes.c:6112!
SMP PTI
Modules linked in:
CPU: 1 PID: 27165 Comm: kworker/u4:7 Not tainted 4.16.0-02155-g3553e54a578d-dirty #875
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
Workqueue: btrfs-cache btrfs_cache_helper
RIP: 0010:btrfs_map_bio+0x346/0x370
RSP: 0000:ffffc900061e79d0 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff880071542e00 RCX: 0000000000533000
RDX: ffff88006bb74380 RSI: 0000000000000008 RDI: ffff880078160000
RBP: 0000000000000001 R08: ffff8800781cd200 R09: 0000000000503000
R10: ffff88006cd21200 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: ffff8800781cd200 R15: ffff880071542e00
FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000817ffc4 CR3: 0000000078314000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 btree_submit_bio_hook+0x8a/0xd0
 submit_one_bio+0x5d/0x80
 read_extent_buffer_pages+0x18a/0x320
 btree_read_extent_buffer_pages+0xbc/0x200
 ? alloc_extent_buffer+0x359/0x3e0
 read_tree_block+0x3d/0x60
 read_block_for_search.isra.30+0x1a5/0x360
 btrfs_search_slot+0x41b/0xa10
 btrfs_next_old_leaf+0x212/0x470
 caching_thread+0x323/0x490
 normal_work_helper+0xc5/0x310
 process_one_work+0x141/0x340
 worker_thread+0x44/0x3c0
 kthread+0xf8/0x130
 ? process_one_work+0x340/0x340
 ? kthread_bind+0x10/0x10
 ret_from_fork+0x35/0x40
RIP: btrfs_map_bio+0x346/0x370 RSP: ffffc900061e79d0
---[ end trace 827eb13e50846033 ]---
Kernel panic - not syncing: Fatal exception
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:33 +02:00
Jeff Mahoney
fee7acc361 btrfs: keep trim from interfering with transaction commits
Commit 499f377f49 (btrfs: iterate over unused chunk space in FITRIM)
fixed free space trimming, but introduced latency when it was running.
This is due to it pinning the transaction using both a incremented
refcount and holding the commit root sem for the duration of a single
trim operation.

This was to ensure safety but it's unnecessary.  We already hold the the
chunk mutex so we know that the chunk we're using can't be allocated
while we're trimming it.

In order to check against chunks allocated already in this transaction,
we need to check the pending chunks list.  To to that safely without
joining the transaction (or attaching than then having to commit it) we
need to ensure that the dev root's commit root doesn't change underneath
us and the pending chunk lists stays around until we're done with it.

We can ensure the former by holding the commit root sem and the latter
by pinning the transaction.  We do this now, but the critical section
covers the trim operation itself and we don't need to do that.

This patch moves the pinning and unpinning logic into helpers and unpins
the transaction after performing the search and check for pending
chunks.

Limiting the critical section of the transaction pinning improves the
latency substantially on slower storage (e.g. image files over NFS).

Fixes: 499f377f49 ("btrfs: iterate over unused chunk space in FITRIM")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:32 +02:00
Jeff Mahoney
0be88e367f btrfs: don't attempt to trim devices that don't support it
We check whether any device the file system is using supports discard in
the ioctl call, but then we attempt to trim free extents on every device
regardless of whether discard is supported.  Due to the way we mask off
EOPNOTSUPP, we can end up issuing the trim operations on each free range
on devices that don't support it, just wasting time.

Fixes: 499f377f49 ("btrfs: iterate over unused chunk space in FITRIM")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:32 +02:00
Jeff Mahoney
d4e329de5e btrfs: iterate all devices during trim, instead of fs_devices::alloc_list
btrfs_trim_fs iterates over the fs_devices->alloc_list while holding the
device_list_mutex.  The problem is that ->alloc_list is protected by the
chunk mutex.  We don't want to hold the chunk mutex over the trim of the
entire file system.  Fortunately, the ->dev_list list is protected by
the dev_list mutex and while it will give us all devices, including
read-only devices, we already just skip the read-only devices.  Then we
can continue to take and release the chunk mutex while scanning each
device.

Fixes: 499f377f49 ("btrfs: iterate over unused chunk space in FITRIM")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:32 +02:00
Qu Wenruo
6ba9fc8e62 btrfs: Ensure btrfs_trim_fs can trim the whole filesystem
[BUG]
fstrim on some btrfs only trims the unallocated space, not trimming any
space in existing block groups.

[CAUSE]
Before fstrim_range passed to btrfs_trim_fs(), it gets truncated to
range [0, super->total_bytes).  So later btrfs_trim_fs() will only be
able to trim block groups in range [0, super->total_bytes).

While for btrfs, any bytenr aligned to sectorsize is valid, since btrfs
uses its logical address space, there is nothing limiting the location
where we put block groups.

For filesystem with frequent balance, it's quite easy to relocate all
block groups and bytenr of block groups will start beyond
super->total_bytes.

In that case, btrfs will not trim existing block groups.

[FIX]
Just remove the truncation in btrfs_ioctl_fitrim(), so btrfs_trim_fs()
can get the unmodified range, which is normally set to [0, U64_MAX].

Reported-by: Chris Murphy <lists@colorremedies.com>
Fixes: f4c697e640 ("btrfs: return EINVAL if start > total_bytes in fitrim ioctl")
CC: <stable@vger.kernel.org> # v4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:32 +02:00
Qu Wenruo
93bba24d4b btrfs: Enhance btrfs_trim_fs function to handle error better
Function btrfs_trim_fs() doesn't handle errors in a consistent way. If
error happens when trimming existing block groups, it will skip the
remaining blocks and continue to trim unallocated space for each device.

The return value will only reflect the final error from device trimming.

This patch will fix such behavior by:

1) Recording the last error from block group or device trimming
   The return value will also reflect the last error during trimming.
   Make developer more aware of the problem.

2) Continuing trimming if possible
   If we failed to trim one block group or device, we could still try
   the next block group or device.

3) Report number of failures during block group and device trimming
   It would be less noisy, but still gives user a brief summary of
   what's going wrong.

Such behavior can avoid confusion for cases like failure to trim the
first block group and then only unallocated space is trimmed.

Reported-by: Chris Murphy <lists@colorremedies.com>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add bg_ret and dev_ret to the messages ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:32 +02:00
Jeff Mahoney
5c06147128 btrfs: fix error handling in btrfs_dev_replace_start
When we fail to start a transaction in btrfs_dev_replace_start, we leave
dev_replace->replace_start set to STARTED but clear ->srcdev and
->tgtdev.  Later, that can result in an Oops in
btrfs_dev_replace_progress when having state set to STARTED or SUSPENDED
implies that ->srcdev is valid.

Also fix error handling when the state is already STARTED or SUSPENDED
while starting.  That, too, will clear ->srcdev and ->tgtdev even though
it doesn't own them.  This should be an impossible case to hit since we
should be protected by the BTRFS_FS_EXCL_OP bit being set.  Let's add an
ASSERT there while we're at it.

Fixes: e93c89c1aa (Btrfs: add new sources for device replace code)
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:31 +02:00
zhong jiang
c1766dd782 btrfs: change remove_extent_mapping to return void
remove_extent_mapping uses the variable "ret" for return value, but it
is not modified after initialzation. Further, I find that any of the
callers do not handle the return value and the callees are only simple
functions so the return values does not need to be passed.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:31 +02:00
Nikolay Borisov
315bed43fe btrfs: handle error of get_old_root
In btrfs_search_old_slot get_old_root is always used with the assumption
it cannot fail. However, this is not true in rare circumstance it can
fail and return null. This will lead to null point dereference when the
header is read. Fix this by checking the return value and properly
handling NULL by setting ret to -EIO and returning gracefully.

Coverity-id: 1087503
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:31 +02:00
Nikolay Borisov
28bee48982 btrfs: Remove logically dead code from btrfs_orphan_cleanup
In btrfs_orphan_cleanup the final 'if (ret) goto out' cannot ever be
executed. This is due to the last assignment to 'ret' depending on the
return value of btrfs_iget. If an error other than -ENOENT is returned
then the loop is prematurely terminated by 'goto out'.  On the other
hand, if the error value is ENOENT then a subsequent if branch is
executed that always re-assigns 'ret' and in case it's an error just
terminates the loop. No functional changes.

Coverity-id: 1437392
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:31 +02:00
Liu Bo
4183c52ce8 Btrfs: remove wait_ordered_range in btrfs_evict_inode
When we delete an inode,

btrfs_evict_inode() {
    truncate_inode_pages_final()
        truncate_inode_pages_range()
            lock_page()
            truncate_cleanup_page()
                 btrfs_invalidatepage()
                      wait_on_page_writeback
                           btrfs_lookup_ordered_range()
                 cancel_dirty_page()
           unlock_page()
     ...
     btrfs_wait_ordered_range()
     ...

As VFS has called ->invalidatepage() to get all ordered extents done (if
there are any) and truncated all page cache pages (no dirty pages to
writeback after this step), wait_ordered_range() is just a noop.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:31 +02:00
Liu Bo
abb57ef3ff Btrfs: skip set_page_dirty if eb pages are already dirty
As long as @eb is marked with EXTENT_BUFFER_DIRTY, all of its pages
are dirty, so no need to set pages dirty again.

Ftrace showed that the loop took 10us on my dev box, so removing this
can save us at least 10us if eb is already dirty and otherwise avoid a
potentially expensive calls to set_page_dirty.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:31 +02:00
Liu Bo
51995c399b Btrfs: assert page dirty bit on extent buffer pages
Just in case that someone breaks the rule that pages are dirty as long
as eb is dirty. The next patch will dirty the pages conditionally.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:30 +02:00
Liu Bo
98e6b1eb40 Btrfs: remove unnecessary level check in balance_level
In the callchain:

btrfs_search_slot()
   if (level != 0)
      setup_nodes_for_search()
          balance_level()

It is just impossible to have level=0 in balance_level, we can drop the
check.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:30 +02:00
Liu Bo
3cf5068f3d Btrfs: unify error handling of btrfs_lookup_dir_item
Unify the error handling of directory item lookups using IS_ERR_OR_NULL.
No functional changes.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:30 +02:00
Liu Bo
3b2fd80160 Btrfs: use args in the correct order for kcalloc in btrfsic_read_block
kcalloc is defined as:

  kcalloc(size_t n, size_t size, gfp_t flags)

Although this won't cause problems in practice, btrfsic_read_block()
uses kcalloc with n and size in the opposite order.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:30 +02:00
Nikolay Borisov
a27a94c2b0 btrfs: Make btrfs_find_device_by_devspec return btrfs_device directly
Instead of returning an error value and using one of the parameters for
returning the actual object we are interested in just refactor the
function to directly return btrfs_device *. Also bubble up the error
handling for the special BTRFS_ERROR_DEV_MISSING_NOT_FOUND value into
btrfs_rm_device. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:30 +02:00
Nikolay Borisov
6c05040702 btrfs: Make btrfs_find_device_missing_or_by_path return directly a device
This function returns a numeric error value and additionally the
device found in one of its input parameters. Simplify this by making
the function directly return a pointer to btrfs_device. Additionally
adjust the caller to handle the case when we want to remove the
'missing' device and ENOENT is returned to return the expected
positive error value, parsed by progs. Finally, unexport the function
since it's not called outside of volume.c. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:29 +02:00
Nikolay Borisov
b444ad46b2 btrfs: Make btrfs_find_device_by_path return struct btrfs_device
Currently this function returns an error code as well as uses one of
its arguments as a return value for struct btrfs_device. Change the
function so that it returns btrfs_device directly and use the usual
"encode error in pointer" mechanics if something goes wrong. No
functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:29 +02:00
Jeff Mahoney
374b0e2d6b btrfs: fix error handling in free_log_tree
When we hit an I/O error in free_log_tree->walk_log_tree during file system
shutdown we can crash due to there not being a valid transaction handle.

Use btrfs_handle_fs_error when there's no transaction handle to use.

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000060
  IP: free_log_tree+0xd2/0x140 [btrfs]
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
  Modules linked in: <modules>
  CPU: 2 PID: 23544 Comm: umount Tainted: G        W        4.12.14-kvmsmall #9 SLE15 (unreleased)
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
  task: ffff96bfd3478880 task.stack: ffffa7cf40d78000
  RIP: 0010:free_log_tree+0xd2/0x140 [btrfs]
  RSP: 0018:ffffa7cf40d7bd10 EFLAGS: 00010282
  RAX: 00000000fffffffb RBX: 00000000fffffffb RCX: 0000000000000002
  RDX: 0000000000000000 RSI: ffff96c02f07d4c8 RDI: 0000000000000282
  RBP: ffff96c013cf1000 R08: ffff96c02f07d4c8 R09: ffff96c02f07d4d0
  R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000000
  R13: ffff96c005e800c0 R14: ffffa7cf40d7bdb8 R15: 0000000000000000
  FS:  00007f17856bcfc0(0000) GS:ffff96c03f600000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000060 CR3: 0000000045ed6002 CR4: 00000000003606e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   ? wait_for_writer+0xb0/0xb0 [btrfs]
   btrfs_free_log+0x17/0x30 [btrfs]
   btrfs_drop_and_free_fs_root+0x9a/0xe0 [btrfs]
   btrfs_free_fs_roots+0xc0/0x130 [btrfs]
   ? wait_for_completion+0xf2/0x100
   close_ctree+0xea/0x2e0 [btrfs]
   ? kthread_stop+0x161/0x260
   generic_shutdown_super+0x6c/0x120
   kill_anon_super+0xe/0x20
   btrfs_kill_super+0x13/0x100 [btrfs]
   deactivate_locked_super+0x3f/0x70
   cleanup_mnt+0x3b/0x70
   task_work_run+0x78/0x90
   exit_to_usermode_loop+0x77/0xa6
   do_syscall_64+0x1c5/0x1e0
   entry_SYSCALL_64_after_hwframe+0x42/0xb7
  RIP: 0033:0x7f1784f90827
  RSP: 002b:00007ffdeeb03118 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
  RAX: 0000000000000000 RBX: 0000556a60c62970 RCX: 00007f1784f90827
  RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556a60c62b50
  RBP: 0000000000000000 R08: 0000000000000005 R09: 00000000ffffffff
  R10: 0000556a60c63900 R11: 0000000000000246 R12: 0000556a60c62b50
  R13: 00007f17854a81c4 R14: 0000000000000000 R15: 0000000000000000
  RIP: free_log_tree+0xd2/0x140 [btrfs] RSP: ffffa7cf40d7bd10
  CR2: 0000000000000060

Fixes: 681ae50917 ("Btrfs: cleanup reserved space when freeing tree log on error")
CC: <stable@vger.kernel.org> # v3.13
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:29 +02:00
Misono Tomohiro
380fd06640 btrfs: remove redundant variable from btrfs_cross_ref_exist
Since commit d7df2c796d ("Btrfs attach delayed ref updates to delayed
ref heads"), check_delayed_ref() won't return -ENOENT.

In btrfs_cross_ref_exist(), two variables 'ret' and 'ret2' are
originally used to handle -ENOENT error case.  Since the code is not
needed anymore, let's just remove 'ret2'.

Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:29 +02:00
Liu Bo
e49aabd973 Btrfs: set leave_spinning in btrfs_get_extent
Unless it's going to read inline extents from btree leaf to page,
btrfs_get_extent won't sleep during the period of holding path lock.

This sets leave_spinning at first and sets path to blocking mode right
before reading inline extent if that's the case.  The benefit is that a
path in spinning mode typically has lower impact (faster) on waiters
rather than that in the blocking mode.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:29 +02:00
Liu Bo
de2c6615dc Btrfs: fix alignment in declaration and prototype of btrfs_get_extent
This fixes btrfs_get_extent to be consistent with our existing
declaration style.

Note: For the record, indentation styles that are accepted are both,
aligning under the opening ( and tab or double tab indentation on the
next line.  Preferrably not spliting the type or long expressions in the
argument lists.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:29 +02:00
Colin Ian King
29c5e5d496 btrfs: remove unused pointer 'tree' in btrfs_submit_compressed_read
Pointer 'tree' is being assigned but is never used hence it is redundant
and can be removed. This is a leftover from cleanup patch
00032d38ea ("btrfs: drop extent_io_ops::merge_bio_hook
callback").

Cleans up clang warning:

  warning: variable 'tree' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:28 +02:00
Colin Ian King
d005dbeca0 btrfs: remove unused pointer inode in relink_file_extents
Pointer inode is being assigned but is never used hence it is redundant
and can be removed. It's been unused since the introduction in
38c227d87c ("Btrfs: snapshot-aware defrag").

Cleans up clang warning:

  variable ‘inode’ set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:28 +02:00
Su Yue
28c4a3e21a btrfs: defrag: use btrfs_mod_outstanding_extents in cluster_pages_for_defrag
Since commit 8b62f87bad ("Btrfs: rework outstanding_extents"),
manual operations of outstanding_extent in btrfs_inode are replaced by
btrfs_mod_outstanding_extents().
The one in cluster_pages_for_defrag seems to be lost, so replace it
of btrfs_mod_outstanding_extents().

Fixes: 8b62f87bad ("Btrfs: rework outstanding_extents")
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:28 +02:00
Liu Bo
6aadd9eb74 Btrfs: remove confusing tracepoint in btrfs_add_reserved_bytes
Here we're not releasing any space, but transferring bytes from
->bytes_may_use to ->bytes_reserved. The last change to the code in
commit 18513091af ("btrfs: update btrfs_space_info's
bytes_may_use timely") removed a conditional tracepoint and the logic
changed too but the tracepiont remained.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:28 +02:00
Liu Bo
c64142807f btrfs: free path at an earlier point in btrfs_get_extent
trace_btrfs_get_extent() has nothing to do with path, place
btrfs_free_path ahead so that we can unlock path on error.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:28 +02:00
Liu Bo
9688e9a99e Btrfs: use next_state in find_first_extent_bit
As next_state() is already defined to get the next state, use it in
find_first_extent_bit. No functional changes.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:28 +02:00
Qu Wenruo
b72c3aba09 btrfs: locking: Add extra check in btrfs_init_new_buffer() to avoid deadlock
[BUG]
For certain crafted image, whose csum root leaf has missing backref, if
we try to trigger write with data csum, it could cause deadlock with the
following kernel WARN_ON():

  WARNING: CPU: 1 PID: 41 at fs/btrfs/locking.c:230 btrfs_tree_lock+0x3e2/0x400
  CPU: 1 PID: 41 Comm: kworker/u4:1 Not tainted 4.18.0-rc1+ #8
  Workqueue: btrfs-endio-write btrfs_endio_write_helper
  RIP: 0010:btrfs_tree_lock+0x3e2/0x400
  Call Trace:
   btrfs_alloc_tree_block+0x39f/0x770
   __btrfs_cow_block+0x285/0x9e0
   btrfs_cow_block+0x191/0x2e0
   btrfs_search_slot+0x492/0x1160
   btrfs_lookup_csum+0xec/0x280
   btrfs_csum_file_blocks+0x2be/0xa60
   add_pending_csums+0xaf/0xf0
   btrfs_finish_ordered_io+0x74b/0xc90
   finish_ordered_fn+0x15/0x20
   normal_work_helper+0xf6/0x500
   btrfs_endio_write_helper+0x12/0x20
   process_one_work+0x302/0x770
   worker_thread+0x81/0x6d0
   kthread+0x180/0x1d0
   ret_from_fork+0x35/0x40

[CAUSE]
That crafted image has missing backref for csum tree root leaf.  And
when we try to allocate new tree block, since there is no
EXTENT/METADATA_ITEM for csum tree root, btrfs consider it's free slot
and use it.

The extent tree of the image looks like:

  Normal image                      |       This fuzzed image
  ----------------------------------+--------------------------------
  BG 29360128                       | BG 29360128
   One empty slot                   |  One empty slot
  29364224: backref to UUID tree    | 29364224: backref to UUID tree
   Two empty slots                  |  Two empty slots
  29376512: backref to CSUM tree    |  One empty slot (bad type) <<<
  29380608: backref to D_RELOC tree | 29380608: backref to D_RELOC tree
  ...                               | ...

Since bytenr 29376512 has no METADATA/EXTENT_ITEM, when btrfs try to
alloc tree block, it's an valid slot for btrfs.

And for finish_ordered_write, when we need to insert csum, we try to CoW
csum tree root.

By accident, empty slots at bytenr BG_OFFSET, BG_OFFSET + 8K,
BG_OFFSET + 12K is already used by tree block COW for other trees, the
next empty slot is BG_OFFSET + 16K, which should be the backref for CSUM
tree.

But due to the bad type, btrfs can recognize it and still consider it as
an empty slot, and will try to use it for csum tree CoW.

Then in the following call trace, we will try to lock the new tree
block, which turns out to be the old csum tree root which is already
locked:

btrfs_search_slot() called on csum tree root, which is at 29376512
|- btrfs_cow_block()
   |- btrfs_set_lock_block()
   |  |- Now locks tree block 29376512 (old csum tree root)
   |- __btrfs_cow_block()
      |- btrfs_alloc_tree_block()
         |- btrfs_reserve_extent()
            | Now it returns tree block 29376512, which extent tree
            | shows its empty slot, but it's already hold by csum tree
            |- btrfs_init_new_buffer()
               |- btrfs_tree_lock()
                  | Triggers WARN_ON(eb->lock_owner == current->pid)
                  |- wait_event()
                     Wait lock owner to release the lock, but it's
                     locked by ourself, so it will deadlock

[FIX]
This patch will do the lock_owner and current->pid check at
btrfs_init_new_buffer().
So above deadlock can be avoided.

Since such problem can only happen in crafted image, we will still
trigger kernel warning for later aborted transaction, but with a little
more meaningful warning message.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=200405
Reported-by: Xu Wen <wen.xu@gatech.edu>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:27 +02:00
Qu Wenruo
65c6e82bec btrfs: Handle owner mismatch gracefully when walking up tree
[BUG]
When mounting certain crafted image, btrfs will trigger kernel BUG_ON()
when trying to recover balance:

  kernel BUG at fs/btrfs/extent-tree.c:8956!
  invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
  CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10
  RIP: 0010:walk_up_proc+0x336/0x480 [btrfs]
  RSP: 0018:ffffb53540c9b890 EFLAGS: 00010202
  Call Trace:
   walk_up_tree+0x172/0x1f0 [btrfs]
   btrfs_drop_snapshot+0x3a4/0x830 [btrfs]
   merge_reloc_roots+0xe1/0x1d0 [btrfs]
   btrfs_recover_relocation+0x3ea/0x420 [btrfs]
   open_ctree+0x1af3/0x1dd0 [btrfs]
   btrfs_mount_root+0x66b/0x740 [btrfs]
   mount_fs+0x3b/0x16a
   vfs_kern_mount.part.9+0x54/0x140
   btrfs_mount+0x16d/0x890 [btrfs]
   mount_fs+0x3b/0x16a
   vfs_kern_mount.part.9+0x54/0x140
   do_mount+0x1fd/0xda0
   ksys_mount+0xba/0xd0
   __x64_sys_mount+0x21/0x30
   do_syscall_64+0x60/0x210
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

[CAUSE]
Extent tree corruption.  In this particular case, reloc tree root's
owner is DATA_RELOC_TREE (should be TREE_RELOC), thus its backref is
corrupted and we failed the owner check in walk_up_tree().

[FIX]
It's pretty hard to take care of every extent tree corruption, but at
least we can remove such BUG_ON() and exit more gracefully.

And since in this particular image, DATA_RELOC_TREE and TREE_RELOC share
the same root (which is obviously invalid), we needs to make
__del_reloc_root() more robust to detect such invalid sharing to avoid
possible NULL dereference as root->node can be NULL in this case.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411
Reported-by: Xu Wen <wen.xu@gatech.edu>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:27 +02:00
zhong jiang
45128b08f7 btrfs: change btrfs_pin_log_trans to return void
btrfs_pin_log_trans defines the variable "ret" for return value, but it
is not modified after initialization. Further, I find that none of the
callers do handles the return value, so it is safe to drop the unneeded
"ret" and make it return void.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:27 +02:00
zhong jiang
556f3ca88e btrfs: change btrfs_free_reserved_bytes to return void
btrfs_free_reserved_bytes uses the variable "ret" for return value,
but it is not modified after initialzation. Further, I find that any of
the callers do not handle the return value, so it is safe to drop the
unneeded "ret" and return void. There are no callees that would need the
function to handle or pass the value either.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:27 +02:00
Liu Bo
bee6ec822a Btrfs: remove always true if branch in btrfs_get_extent
@path is always NULL when it comes to the if branch.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:27 +02:00
Qu Wenruo
9c7b0c2e8d btrfs: qgroup: Dirty all qgroups before rescan
[BUG]
In the following case, rescan won't zero out the number of qgroup 1/0:

  $ mkfs.btrfs -fq $DEV
  $ mount $DEV /mnt

  $ btrfs quota enable /mnt
  $ btrfs qgroup create 1/0 /mnt
  $ btrfs sub create /mnt/sub
  $ btrfs qgroup assign 0/257 1/0 /mnt

  $ dd if=/dev/urandom of=/mnt/sub/file bs=1k count=1000
  $ btrfs sub snap /mnt/sub /mnt/snap
  $ btrfs quota rescan -w /mnt
  $ btrfs qgroup show -pcre /mnt
  qgroupid         rfer         excl     max_rfer     max_excl parent  child
  --------         ----         ----     --------     -------- ------  -----
  0/5          16.00KiB     16.00KiB         none         none ---     ---
  0/257      1016.00KiB     16.00KiB         none         none 1/0     ---
  0/258      1016.00KiB     16.00KiB         none         none ---     ---
  1/0        1016.00KiB     16.00KiB         none         none ---     0/257

So far so good, but:

  $ btrfs qgroup remove 0/257 1/0 /mnt
  WARNING: quotas may be inconsistent, rescan needed
  $ btrfs quota rescan -w /mnt
  $ btrfs qgroup show -pcre  /mnt
  qgoupid         rfer         excl     max_rfer     max_excl parent  child
  --------         ----         ----     --------     -------- ------  -----
  0/5          16.00KiB     16.00KiB         none         none ---     ---
  0/257      1016.00KiB     16.00KiB         none         none ---     ---
  0/258      1016.00KiB     16.00KiB         none         none ---     ---
  1/0        1016.00KiB     16.00KiB         none         none ---     ---
	     ^^^^^^^^^^     ^^^^^^^^ not cleared

[CAUSE]
Before rescan we call qgroup_rescan_zero_tracking() to zero out all
qgroups' accounting numbers.

However we don't mark all qgroups dirty, but rely on rescan to do so.

If we have any high level qgroup without children, it won't be marked
dirty during rescan, since we cannot reach that qgroup.

This will cause QGROUP_INFO items of childless qgroups never get updated
in the quota tree, thus their numbers will stay the same in "btrfs
qgroup show" output.

[FIX]
Just mark all qgroups dirty in qgroup_rescan_zero_tracking(), so even if
we have childless qgroups, their QGROUP_INFO items will still get
updated during rescan.

Reported-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Tested-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:26 +02:00
Omar Sandoval
3293428096 Btrfs: clean up scrub is_dev_replace parameter
struct scrub_ctx has an ->is_dev_replace member, so there's no point in
passing around is_dev_replace where sctx is available.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:26 +02:00
Anand Jain
1da739678e btrfs: add helper to obtain number of devices with ongoing dev-replace
When the replace is running the fs_devices::num_devices also includes
the replaced device, however in some operations like device delete and
balance it needs the actual num_devices without the repalced devices.
The function btrfs_num_devices() just provides that.

And here is a scenario how balance and repalce items could co-exist:

Consider balance is started and paused, now start the replace followed
by a unmount or system power-cycle. During following mount, the
open_ctree() first restarts the balance so it must check for the device
replace otherwise our num_devices calculation will be wrong.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:26 +02:00
Anand Jain
16220c467a btrfs: add assertions where number of devices could go below 0
In preparation to add helper function to deduce the num_devices with
replace running, use assert instead of BUG_ON or WARN_ON. The number of
devices would not normally drop to 0 due to other checks so the assert
is sufficient.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog, adjust the assert condition ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:26 +02:00
zhong jiang
f8b00e0f06 btrfs: remove unneeded NULL checks before kfree
Kfree has taken the NULL pointer into account. So remove the check
before kfree.

The issue is detected with the help of Coccinelle.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:26 +02:00
Liu Bo
4b6f8e9695 Btrfs: do not unnecessarily pass write_lock_level when processing leaf
As we're going to return right after the call, it's not necessary to get
update the new write_lock_level from unlock_up.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:26 +02:00
Misono Tomohiro
4fd786e6c3 btrfs: Remove 'objectid' member from struct btrfs_root
There are two members in struct btrfs_root which indicate root's
objectid: objectid and root_key.objectid.

They are both set to the same value in __setup_root():

  static void __setup_root(struct btrfs_root *root,
                           struct btrfs_fs_info *fs_info,
                           u64 objectid)
  {
    ...
    root->objectid = objectid;
    ...
    root->root_key.objectid = objecitd;
    ...
  }

and not changed to other value after initialization.

grep in btrfs directory shows both are used in many places:
  $ grep -rI "root->root_key.objectid" | wc -l
  133
  $ grep -rI "root->objectid" | wc -l
  55
 (4.17, inc. some noise)

It is confusing to have two similar variable names and it seems
that there is no rule about which should be used in a certain case.

Since ->root_key itself is needed for tree reloc tree, let's remove
'objecitd' member and unify code to use ->root_key.objectid in all places.

Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:25 +02:00
Lu Fengqi
5a2cb25ab9 btrfs: remove a useless return statement in btrfs_block_rsv_add
Since ret must be 0 here, don't have to return.  No functional change
and code readability is not hurt.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:25 +02:00
Lu Fengqi
684572df94 btrfs: Remove root parameter from btrfs_insert_dir_item
All callers pass the root tree of dir, we can push that down to the
function itself.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:25 +02:00
Lu Fengqi
3a58417486 btrfs: switch update_size to bool in btrfs_block_rsv_migrate and btrfs_rsv_add_bytes
Using true and false here is closer to the expected semantic than using
0 and 1.  No functional change.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:25 +02:00
Lu Fengqi
a7176f74fa btrfs: simplify the send_in_progress check in btrfs_delete_subvolume
Only when send_in_progress, we have to do something different such as
btrfs_warn() and return -EPERM. Therefore, we could check
send_in_progress first and process error handling, after the
root_item_lock has been got.

Just for better readability. No functional change.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:25 +02:00
Dan Schatzberg
5571f1e654 fuse: enable caching of symlinks
FUSE file reads are cached in the page cache, but symlink reads are
not. This patch enables FUSE READLINK operations to be cached which
can improve performance of some FUSE workloads.

In particular, I'm working on a FUSE filesystem for access to source
code and discovered that about a 10% improvement to build times is
achieved with this patch (there are a lot of symlinks in the source
tree).

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-15 15:43:07 +02:00
Miklos Szeredi
9a2eb24d1a fuse: only invalidate atime in direct read
After sending a synchronous READ request from __fuse_direct_read() we only
need to invalidate atime; none of the other attributes should be changed by
a read().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-15 15:43:06 +02:00
Miklos Szeredi
802dc0497b fuse: don't need GETATTR after every READ
If 'auto_inval_data' mode is active, then fuse_file_read_iter() will call
fuse_update_attributes(), which will check the attribute validity and send
a GETATTR request if some of the attributes are no longer valid.  The page
cache is then invalidated if the size or mtime have changed.

Then, if a READ request was sent and reply received (which is the case if
the data wasn't cached yet, or if the file is opened for O_DIRECT), the
atime attribute is invalidated.

This will result in the next read() also triggering a GETATTR, ...

This can be fixed by only sending GETATTR if the mode or size are invalid,
we don't need to do a refresh if only atime is invalid.

More generally, none of the callers of fuse_update_attributes() need an
up-to-date atime value, so for now just remove STATX_ATIME from the request
mask when attributes are updated for internal use.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-15 15:43:06 +02:00
Miklos Szeredi
2f1e81965f fuse: allow fine grained attr cache invaldation
This patch adds the infrastructure for more fine grained attribute
invalidation.  Currently only 'atime' is invalidated separately.

The use of this infrastructure is extended to the statx(2) interface, which
for now means that if only 'atime' is invalid and STATX_ATIME is not
specified in the mask argument, then no GETATTR request will be generated.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-15 15:43:06 +02:00
David Howells
f0a7d1883d afs: Fix clearance of reply
The recent patch to fix the afs_server struct leak didn't actually fix the
bug, but rather fixed some of the symptoms.  The problem is that an
asynchronous call that holds a resource pointed to by call->reply[0] will
find the pointer cleared in the call destructor, thereby preventing the
resource from being cleaned up.

In the case of the server record leak, the afs_fs_get_capabilities()
function in devel code sets up a call with reply[0] pointing at the server
record that should be altered when the result is obtained, but this was
being cleared before the destructor was called, so the put in the
destructor does nothing and the record is leaked.

Commit f014ffb025 removed the additional ref obtained by
afs_install_server(), but the removal of this ref is actually used by the
garbage collector to mark a server record as being defunct after the record
has expired through lack of use.

The offending clearance of call->reply[0] upon completion in
afs_process_async_call() has been there from the origin of the code, but
none of the asynchronous calls actually use that pointer currently, so it
should be safe to remove (note that synchronous calls don't involve this
function).

Fix this by the following means:

 (1) Revert commit f014ffb025.

 (2) Remove the clearance of reply[0] from afs_process_async_call().

Without this, afs_manage_servers() will suffer an assertion failure if it
sees a server record that didn't get used because the usage count is not 1.

Fixes: f014ffb025 ("afs: Fix afs_server struct leak")
Fixes: 08e0e7c82e ("[AF_RXRPC]: Make the in-kernel AFS filesystem use AF_RXRPC.")
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-15 15:31:47 +02:00
Greg Kroah-Hartman
3a27203102 libnvdimm/dax 4.19-rc8
* Fix a livelock in dax_layout_busy_page() present since v4.18. The
   lockup triggers when truncating an actively mapped huge page out of a
   mapping pinned for direct-I/O.
 
 * Fix mprotect() clobbers of _PAGE_DEVMAP. Broken since v4.5 mprotect()
   clears this flag that is needed to communicate the liveness of device
   pages to the get_user_pages() path.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbwiZhAAoJEB7SkWpmfYgCYFoQAL8ED6c1bfGUPRsWSrTRChU0
 ungVZ/Vf1+2ERd3ivUXPQzahNtqH5EWvEVp0aboVpyJUoVllrztInVS2hxaGJE+e
 w7WnzaXh36MY0kvLpK+Ny1Cxk7qg2rXnmzOAPRVdSUoSvh0TXOn5HFX1i/OdI7WK
 wgJwXraCoyKP9aTItw7oHQy9S36bi1RJVUakOAoEpEx4Vn+fwFxLNIt34G5CRJ+k
 iflicM7CPngxlFzwfoiX9v3DhV7toexk1A4LAzzwypG0Aiqd5tW2FG1lwLMPncNk
 8FezBm9VjkMwzv6hj7nD9UfU2lbh3GqqGDW0cPX1DPSgDxr/4pOLtKcbYWHh6yas
 NtCXk37q90ey3GtD2wYBRkBNly6UWvHJ0d3srtO6ZSl1VN6JQu8rhVhQ6KnON24B
 NcWlEVf2brqf0uaW4byCVbdVfIDp96/qgEvCo1pq3olXwCdDyOBJjYxaBcnu5JDV
 YsItMCJ49AxS/qoCt3vam7vC5TGhfYHL5xJPaF06cdjYvgfqOIV3VQT1ujBx4cvh
 MBFRBKDc6oDiJFgkrdYqHwJfn5fCQVS180Oy5S0AFGsVAzsJalKBZBLx2f2RQn8c
 r+kczvvPjpczEeEqzaqsxTgjowo/75Q8PRXc2PbwQzNxfkHuKf+xxQpnUg0mN6Hf
 w8zPSaCcCs2Wo21Kd/ua
 =VXnU
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-fixes-4.19-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Dan writes:
  "libnvdimm/dax 4.19-rc8

   * Fix a livelock in dax_layout_busy_page() present since v4.18. The
     lockup triggers when truncating an actively mapped huge page out of
     a mapping pinned for direct-I/O.

   * Fix mprotect() clobbers of _PAGE_DEVMAP. Broken since v4.5
     mprotect() clears this flag that is needed to communicate the
     liveness of device pages to the get_user_pages() path."

* tag 'libnvdimm-fixes-4.19-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  mm: Preserve _PAGE_DEVMAP across mprotect() calls
  filesystem-dax: Fix dax_layout_busy_page() livelock
2018-10-14 08:34:31 +02:00
Richard Weinberger
f8ccb14fd6 ubifs: Fix WARN_ON logic in exit path
ubifs_assert() is not WARN_ON(), so we have to invert
the checks.
Randy faced this warning with UBIFS being a module, since
most users use UBIFS as builtin because UBIFS is the rootfs
nobody noticed so far. :-(
Including me.

Reported-by: Randy Dunlap <rdunlap@infradead.org>
Fixes: 54169ddd38 ("ubifs: Turn two ubifs_assert() into a WARN_ON()")
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-13 11:05:02 +02:00
Greg Kroah-Hartman
79fc170b1f Merge branch 'akpm'
Fixes from Andrew:

* akpm:
  fs/fat/fatent.c: add cond_resched() to fat_count_free_clusters()
  mm/thp: fix call to mmu_notifier in set_pmd_migration_entry() v2
  mm/mmap.c: don't clobber partially overlapping VMA with MAP_FIXED_NOREPLACE
  ocfs2: fix a GCC warning
2018-10-13 09:31:19 +02:00
Khazhismel Kumykov
ac081c3be3 fs/fat/fatent.c: add cond_resched() to fat_count_free_clusters()
On non-preempt kernels this loop can take a long time (more than 50 ticks)
processing through entries.

Link: http://lkml.kernel.org/r/20181010172623.57033-1-khazhy@google.com
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-13 09:31:03 +02:00
zhong jiang
1cff514a51 ocfs2: fix a GCC warning
Fix the following compile warning:

fs/ocfs2/dlmglue.c:99:30: warning: ‘lockdep_keys’ defined but not used [-Wunused-variable]
 static struct lock_class_key lockdep_keys[OCFS2_NUM_LOCK_TYPES];

Link: http://lkml.kernel.org/r/1536938148-32110-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-13 09:31:02 +02:00
Greg Kroah-Hartman
ed66c252d9 gfs2 4.19 fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbwLzVAAoJENW/n+sDE2U6ZQUP/3mcvCqDjU+YBC/RCuA0Hvxd
 4rtzRqY5jTt4HZCTJ9Qj8aARZ4vN/LjKQuQU8IYhzJQcXzeyDi1V2zH4xGmEDomZ
 o9LqcVUUJO1bS/aykLrbnSEEfjU+zeA3z60Jbj3OHkpL6H9u6q+AwqCZnsIcKBnI
 Z8lgUda6oC3L60xzxpdr6BoYLIMqucuJ0YMyVo6YsKT1PEoP30EQhAoqnnKVfxx7
 +l66OUK4Qw3z45auen9MIOXjn+y7BC0RjFANetsBlAS29PE5X9rXWQ5xnv+9uYCh
 DDXPiMjdkLm2Xv8d5U80QhVuO8vxkgn3lnLvE99cmMmQrU/6pS91q+azgc/98RDN
 ZYCReIihyPbug3+uRziYcBHnI+4c8LrM/Za0TsHDG6gA6ddiiNHvdnJE9Nbk85Le
 dS+jQwA5eiYtGc6styoyk52v6huGTI3oPbffyYTTbOLnT6bSaEv+k0zWWCgTqMv3
 sb0imETfsdd1O/7MxR2kAVtQPdSzhD5GMIbkmlRb8nwP+QXb+FBVtQFPsZaPZklR
 qF//eFXGcoWfM1XerOYgl6VVolvntSc+ivycSEsDGHNSU5zUy2JYNTxmkypBB6pb
 wslVyIvjW2bcnUhjGr9PDSg+yQjaAk+5EyiJmxtJaqCiSfjgTpI4YwNrkfQo5csI
 H/Sq1WeTLjo2KeRb7LpU
 =gGo4
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-4.19.fixes3' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Andreas writes:
  "gfs2 4.19 fixes

   Fix iomap buffered write support for journaled files"

* tag 'gfs2-4.19.fixes3' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Fix iomap buffered write support for journaled files (2)
2018-10-13 09:06:27 +02:00
Al Viro
82a6857bf9 compat_ioctl - kill keyboard ioctl handling
all of those are provided only by vt and s390 tty3270; both have
proper ->compat_ioctl()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-10-13 00:50:51 -04:00
Al Viro
969ec01e99 gigaset: add ->compat_ioctl()
... and get rid of COMPAT_IOCTL() for its private ioctls

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-10-13 00:50:50 -04:00
Al Viro
6bbf265892 kill the rest of tty COMPAT_IOCTL() entries
TIOCLINUX is handled by ->compat_ioctl() in the only place that has
native ->ioctl() recognizing it, TIOC{START,STOP} are simply useless
these days - unrecognized compat ioctl won't spew into syslog
anymore.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-10-13 00:50:46 -04:00
Al Viro
7765435030 take compat TIOC[SG]SERIAL treatment into tty_compat_ioctl()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-10-13 00:50:44 -04:00
David S. Miller
d864991b22 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts were easy to resolve using immediate context mostly,
except the cls_u32.c one where I simply too the entire HEAD
chunk.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-12 21:38:46 -07:00
Al Viro
3df629d873 gfs2_meta: ->mount() can get NULL dev_name
get in sync with mount_bdev() handling of the same

Reported-by: syzbot+c54f8e94e6bba03b04e9@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-10-13 00:19:13 -04:00
Al Viro
995f608e7a ntfs: don't open-code ERR_CAST
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-10-12 22:46:50 -04:00
David Howells
f014ffb025 afs: Fix afs_server struct leak
Fix a leak of afs_server structs.  The routine that installs them in the
various lookup lists and trees gets a ref on leaving the function, whether
it added the server or a server already exists.  It shouldn't increment
the refcount if it added the server.

The effect of this that "rmmod kafs" will hang waiting for the leaked
server to become unused.

Fixes: d2ddc776a4 ("afs: Overhaul volume and server record caching and fileserver rotation")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-12 17:36:40 +02:00
Andreas Gruenbacher
fee5150c48 gfs2: Fix iomap buffered write support for journaled files (2)
It turns out that the fix in commit 6636c3cc56 is bad; the assertion
that the iomap code no longer creates buffer heads is incorrect for
filesystems that set the IOMAP_F_BUFFER_HEAD flag.

Instead, what's happening is that gfs2_iomap_begin_write treats all
files that have the jdata flag set as journaled files, which is
incorrect as long as those files are inline ("stuffed").  We're handling
stuffed files directly via the page cache, which is why we ended up with
pages without buffer heads in gfs2_page_add_databufs.

Fix this by handling stuffed journaled files correctly in
gfs2_iomap_begin_write.

This reverts commit 6636c3cc5690c11631e6366cf9a28fb99c8b25bb.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-10-12 17:14:42 +02:00
Theodore Ts'o
33458eaba4 ext4: fix use-after-free race in ext4_remount()'s error path
It's possible for ext4_show_quota_options() to try reading
s_qf_names[i] while it is being modified by ext4_remount() --- most
notably, in ext4_remount's error path when the original values of the
quota file name gets restored.

Reported-by: syzbot+a2872d6feea6918008a9@syzkaller.appspotmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 3.2+
2018-10-12 09:28:09 -04:00
Andreas Gruenbacher
0ddeded4ae gfs2: Pass resource group to rgblk_free
Function rgblk_free can only deal with one resource group at a time, so
pass that resource group is as a parameter.  Several of the callers
already have the resource group at hand, so we only need additional
lookup code in a few places.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Steven Whitehouse <swhiteho@redhat.com>
2018-10-12 07:33:07 -05:00
Bob Peterson
c3abc29e54 gfs2: Remove unnecessary gfs2_rlist_alloc parameter
The state parameter of gfs2_rlist_alloc is set to LM_ST_EXCLUSIVE in all
calls, so remove it and hardcode that state in gfs2_rlist_alloc instead.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Steven Whitehouse <swhiteho@redhat.com>
2018-10-12 07:32:28 -05:00
Andreas Gruenbacher
ec23df2b0c gfs2: Fix marking bitmaps non-full
Reservations in gfs can span multiple gfs2_bitmaps (but they won't span
multiple resource groups).  When removing a reservation, we want to
clear the GBF_FULL flags of all involved gfs2_bitmaps, not just that of
the first bitmap.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Steven Whitehouse <swhiteho@redhat.com>
2018-10-12 07:31:55 -05:00
Andreas Gruenbacher
243fea4df9 gfs2: Fix some minor typos
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Steven Whitehouse <swhiteho@redhat.com>
2018-10-12 07:31:21 -05:00
Andreas Gruenbacher
281b4952d1 gfs2: Rename bitmap.bi_{len => bytes}
This field indicates the size of the bitmap in bytes, similar to how the
bi_blocks field indicates the size of the bitmap in blocks.

In count_unlinked, replace an instance of bi_bytes * GFS2_NBBY by
bi_blocks.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Steven Whitehouse <swhiteho@redhat.com>
2018-10-12 07:30:43 -05:00
Andreas Gruenbacher
ad89945818 gfs2: Remove unused RGRP_RSRV_MINBYTES definition
This definition is only used to define RGRP_RSRV_MINBLKS, with no
benefit over defining RGRP_RSRV_MINBLKS directly.

In addition, instead of forcing RGRP_RSRV_MINBLKS to be of type u32,
cast it to that type where that type is required.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Steven Whitehouse <swhiteho@redhat.com>
2018-10-12 07:29:59 -05:00
Andreas Gruenbacher
21f09c4395 gfs2: Move rs_{sizehint, rgd_gh} fields into the inode
Move the rs_sizehint and rs_rgd_gh fields from struct gfs2_blkreserv
into the inode: they are more closely related to the inode than to a
particular reservation.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Steven Whitehouse <swhiteho@redhat.com>
2018-10-12 07:29:14 -05:00
Andreas Gruenbacher
3548fce164 gfs2: Clean up out-of-bounds check in gfs2_rbm_from_block
We already have a function that checks if a block is within a resource
group, so use that in gfs2_rbm_from_block as well.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Steven Whitehouse <swhiteho@redhat.com>
2018-10-12 07:28:39 -05:00
Andreas Gruenbacher
f654683dae gfs2: Always check the result of gfs2_rbm_from_block
When gfs2_rbm_from_block fails, the rbm it returns is undefined, so we
always want to make sure gfs2_rbm_from_block has succeeded before
looking at the rbm.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Steven Whitehouse <swhiteho@redhat.com>
2018-10-12 07:18:25 -05:00
David Howells
6b3944e42e afs: Fix cell proc list
Access to the list of cells by /proc/net/afs/cells has a couple of
problems:

 (1) It should be checking against SEQ_START_TOKEN for the keying the
     header line.

 (2) It's only holding the RCU read lock, so it can't just walk over the
     list without following the proper RCU methods.

Fix these by using an hlist instead of an ordinary list and using the
appropriate accessor functions to follow it with RCU.

Since the code that adds a cell to the list must also necessarily change,
sort the list on insertion whilst we're at it.

Fixes: 989782dcdc ("afs: Overhaul cell database management")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-12 13:18:57 +02:00
Greg Kroah-Hartman
4718dcad7d xfs: fixes for 4.19-rc7
Update for 4.19-rc7 to fix numerous file clone and deduplication issues.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJbvo1jAAoJEK3oKUf0dfodurwP/3OGvJo11XjE5Wuh5tbcSOUm
 duJdJvN+b0gx8Lb1IL2lHaBEt/Pw3PKi+KPx6d+pd47aflCYNMiU7I1kzejxvq9I
 TBVdhxffAmNBU02VU/qmhucMnKfpVr/UX19f0sHZcfXbZkpIuHrSpI25MNSYqbEs
 bRoMDkgbuu377hCJu6tnBAUm38z4iZdfJaGPVmJmuVMn8JJE8KA3Une/daxg0/HY
 zbCMWP3SGJwqx7SyyeurQSkRS/8GG3LgbnYOz0FlaHfBd4JVJscHOZU08nrYmMAN
 9SOowvahaPkDcyO8c6gRkyE6NIbOsb79718g+HTm4eqzyChnh+jnFTzitcTMuK2c
 vjblXZSDKnN/P4UaEEzIAnQ1Eew4WhLrKuBr+3fCr2bKTGfj0iiX36CjEgk1k+Df
 t1DV/Hj6me6crZpWO7GQZVLnDsiJBzaOsIgpYnB3vcJyof/cgm+5bfGFedDZFdpV
 XTh0oBN8B7oQztd4MvcfQOGXSktrMtq+6RomO8kv77mVtHFnjnroKHq/wIft2nyI
 rs7u7DFmCLyB9Lbm8aoeMPo4+0zxTp9385HSJ+XznHcjGXxkxIIGhhdU3omKxANa
 j3UQB1+VRC4lHSluUXOW7vH0j1tX3gaS3z/yJ+gfcAdizUmdp0h0rMx8c9VR8SEJ
 xv/fpbrcKHyBwuNvKtBm
 =yj1P
 -----END PGP SIGNATURE-----

Merge tag 'xfs-fixes-for-4.19-rc7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Dave writes:
  "xfs: fixes for 4.19-rc7

   Update for 4.19-rc7 to fix numerous file clone and deduplication issues."

* tag 'xfs-fixes-for-4.19-rc7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: fix data corruption w/ unaligned reflink ranges
  xfs: fix data corruption w/ unaligned dedupe ranges
  xfs: update ctime and remove suid before cloning files
  xfs: zero posteof blocks when cloning above eof
  xfs: refactor clonerange preparation into a separate helper
2018-10-11 07:17:42 +02:00
Al Viro
e884bce1d9 ext4: don't open-code ERR_CAST
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-10-10 16:41:40 -04:00
Al Viro
3837d208d8 simplify btrfs_lookup()
d_splice_alias() is fine with ERR_PTR(-E...) for inode

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-10-10 16:38:27 -04:00
Eric Biggers
691115c351 vfs: require i_size <= SIZE_MAX in kernel_read_file()
On 32-bit systems, the buffer allocated by kernel_read_file() is too
small if the file size is > SIZE_MAX, due to truncation to size_t.

Fortunately, since the 'count' argument to kernel_read() is also
truncated to size_t, only the allocated space is filled; then, -EIO is
returned since 'pos != i_size' after the read loop.

But this is not obvious and seems incidental.  We should be more
explicit about this case.  So, fail early if i_size > SIZE_MAX.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2018-10-10 12:56:14 -04:00
Colin Ian King
2978d87347 orangefs: rate limit the client not running info message
Currently accessing various /sys/fs/orangefs files will spam the
kernel log with the following info message when the client is not
running:

[  491.489284] sysfs_service_op_show: Client not running :-5:

Rate limit this info message to make it less spammy.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2018-10-10 09:54:29 -04:00
Chengguang Xu
052d12766b orangefs: cache NULL when both default_acl and acl are NULL
default_acl and acl of newly created inode will be initiated
as ACL_NOT_CACHED in vfs function inode_init_always() and later
will be updated by calling xxx_init_acl() in specific filesystems.
Howerver, when default_acl and acl are NULL then they keep the value
of ACL_NOT_CACHED, this patch tries to cache NULL for acl/default_acl
in this case.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2018-10-10 09:54:25 -04:00
Al Viro
74dd7c97ea ecryptfs_rename(): verify that lower dentries are still OK after lock_rename()
We get lower layer dentries, find their parents, do lock_rename() and
proceed to vfs_rename().  However, we do not check that dentries still
have the same parents and are not unlinked.  Need to check that...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-10-09 23:33:17 -04:00
Andreas Gruenbacher
dc480feb45 gfs2: Fix iomap buffered write support for journaled files
Commit 64bc06bb32 broke buffered writes to journaled files (chattr
+j): we'll try to journal the buffer heads of the page being written to
in gfs2_iomap_journaled_page_done.  However, the iomap code no longer
creates buffer heads, so we'll BUG() in gfs2_page_add_databufs.  Fix
that by creating buffer heads ourself when needed.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-10-09 18:20:13 +02:00
Steve Whitehouse
6ddc5c3ddf gfs2: getlabel support
Add support for the GETFSLABEL ioctl in gfs2.
I tested this patch and it works as expected.

Signed-off-by: Steve Whitehouse <swhiteho@redhat.com>
Tested-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-10-09 07:06:16 -05:00
Tim Smith
1eb8d73879 GFS2: Flush the GFS2 delete workqueue before stopping the kernel threads
Flushing the workqueue can cause operations to happen which might
call gfs2_log_reserve(), or get stuck waiting for locks taken by such
operations.  gfs2_log_reserve() can io_schedule(). If this happens, it
will never wake because the only thing which can wake it is gfs2_logd()
which was already stopped.

This causes umount of a gfs2 filesystem to wedge permanently if, for
example, the umount immediately follows a large delete operation.

When this occured, the following stack trace was obtained from the
umount command

[<ffffffff81087968>] flush_workqueue+0x1c8/0x520
[<ffffffffa0666e29>] gfs2_make_fs_ro+0x69/0x160 [gfs2]
[<ffffffffa0667279>] gfs2_put_super+0xa9/0x1c0 [gfs2]
[<ffffffff811b7edf>] generic_shutdown_super+0x6f/0x100
[<ffffffff811b7ff7>] kill_block_super+0x27/0x70
[<ffffffffa0656a71>] gfs2_kill_sb+0x71/0x80 [gfs2]
[<ffffffff811b792b>] deactivate_locked_super+0x3b/0x70
[<ffffffff811b79b9>] deactivate_super+0x59/0x60
[<ffffffff811d2998>] cleanup_mnt+0x58/0x80
[<ffffffff811d2a12>] __cleanup_mnt+0x12/0x20
[<ffffffff8108c87d>] task_work_run+0x7d/0xa0
[<ffffffff8106d7d9>] exit_to_usermode_loop+0x73/0x98
[<ffffffff81003961>] syscall_return_slowpath+0x41/0x50
[<ffffffff815a594c>] int_ret_from_sys_call+0x25/0x8f
[<ffffffffffffffff>] 0xffffffffffffffff

Signed-off-by: Tim Smith <tim.smith@citrix.com>
Signed-off-by: Mark Syms <mark.syms@citrix.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-10-09 07:05:36 -05:00
Borislav Petkov
cf089611f4 proc/vmcore: Fix i386 build error of missing copy_oldmem_page_encrypted()
Lianbo reported a build error with a particular 32-bit config, see Link
below for details.

Provide a weak copy_oldmem_page_encrypted() function which architectures
can override, in the same manner other functionality in that file is
supplied.

Reported-by: Lianbo Jiang <lijiang@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
CC: x86@kernel.org
Link: http://lkml.kernel.org/r/710b9d95-2f70-eadf-c4a1-c3dc80ee4ebb@redhat.com
2018-10-09 11:57:28 +02:00
Dan Williams
d7782145e1 filesystem-dax: Fix dax_layout_busy_page() livelock
In the presence of multi-order entries the typical
pagevec_lookup_entries() pattern may loop forever:

	while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
				min(end - index, (pgoff_t)PAGEVEC_SIZE),
				indices)) {
		...
		for (i = 0; i < pagevec_count(&pvec); i++) {
			index = indices[i];
			...
		}
		index++; /* BUG */
	}

The loop updates 'index' for each index found and then increments to the
next possible page to continue the lookup. However, if the last entry in
the pagevec is multi-order then the next possible page index is more
than 1 page away. Fix this locally for the filesystem-dax case by
checking for dax-multi-order entries. Going forward new users of
multi-order entries need to be similarly careful, or we need a generic
way to report the page increment in the radix iterator.

Fixes: 5fac7408d8 ("mm, fs, dax: handle layout changes to pinned dax...")
Cc: <stable@vger.kernel.org>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-10-08 11:38:44 -07:00
Andrew Price
4c62bd9cea gfs2: Don't leave s_fs_info pointing to freed memory in init_sbd
When alloc_percpu() fails, sdp gets freed but sb->s_fs_info still points
to the same address. Move the assignment after that error check so that
s_fs_info can only point to a valid sdp or NULL, which is checked for
later in the error path, in gfs2_kill_super().

Reported-by: syzbot+dcb8b3587445007f5808@syzkaller.appspotmail.com
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-10-08 07:52:43 -05:00
Amir Goldstein
d0a6a87e40 fanotify: support reporting thread id instead of process id
In order to identify which thread triggered the event in a
multi-threaded program, add the FAN_REPORT_TID flag in fanotify_init to
opt-in for reporting the event creator's thread id information.

Signed-off-by: nixiaoming <nixiaoming@huawei.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-10-08 13:48:45 +02:00
Chengguang Xu
6fd941784b ext4: cache NULL when both default_acl and acl are NULL
default_acl and acl of newly created inode will be initiated as
ACL_NOT_CACHED in vfs function inode_init_always() and later will be
updated by calling xxx_init_acl() in specific filesystems.  However,
when default_acl and acl are NULL then they keep the value of
ACL_NOT_CACHED.  This patch changes the code to cache NULL for acl /
default_acl in this case to save unnecessary ACL lookup attempt.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2018-10-06 22:40:34 -04:00
David S. Miller
72438f8cef Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-10-06 14:43:42 -07:00
Lianbo Jiang
992b649a3f kdump, proc/vmcore: Enable kdumping encrypted memory with SME enabled
In the kdump kernel, the memory of the first kernel needs to be dumped
into the vmcore file.

If SME is enabled in the first kernel, the old memory has to be remapped
with the memory encryption mask in order to access it properly.

Split copy_oldmem_page() functionality to handle encrypted memory
properly.

 [ bp: Heavily massage everything. ]

Signed-off-by: Lianbo Jiang <lijiang@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: kexec@lists.infradead.org
Cc: tglx@linutronix.de
Cc: mingo@redhat.com
Cc: hpa@zytor.com
Cc: akpm@linux-foundation.org
Cc: dan.j.williams@intel.com
Cc: bhelgaas@google.com
Cc: baiyaowei@cmss.chinamobile.com
Cc: tiwai@suse.de
Cc: brijesh.singh@amd.com
Cc: dyoung@redhat.com
Cc: bhe@redhat.com
Cc: jroedel@suse.de
Link: https://lkml.kernel.org/r/be7b47f9-6be6-e0d1-2c2a-9125bc74b818@redhat.com
2018-10-06 12:09:26 +02:00
Dave Chinner
b39989009b xfs: fix data corruption w/ unaligned reflink ranges
When reflinking sub-file ranges, a data corruption can occur when
the source file range includes a partial EOF block. This shares the
unknown data beyond EOF into the second file at a position inside
EOF, exposing stale data in the second file.

XFS only supports whole block sharing, but we still need to
support whole file reflink correctly.  Hence if the reflink
request includes the last block of the souce file, only proceed with
the reflink operation if it lands at or past the destination file's
current EOF. If it lands within the destination file EOF, reject the
entire request with -EINVAL and make the caller go the hard way.

This avoids the data corruption vector, but also avoids disruption
of returning EINVAL to userspace for the common case of whole file
cloning.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-06 11:44:39 +10:00
Dave Chinner
dceeb47b0e xfs: fix data corruption w/ unaligned dedupe ranges
A deduplication data corruption is Exposed by fstests generic/505 on
XFS. It is caused by extending the block match range to include the
partial EOF block, but then allowing unknown data beyond EOF to be
considered a "match" to data in the destination file because the
comparison is only made to the end of the source file. This corrupts
the destination file when the source extent is shared with it.

XFS only supports whole block dedupe, but we still need to appear to
support whole file dedupe correctly.  Hence if the dedupe request
includes the last block of the souce file, don't include it in the
actual XFS dedupe operation. If the rest of the range dedupes
successfully, then report the partial last block as deduped, too, so
that userspace sees it as a successful dedupe rather than return
EINVAL because we can't dedupe unaligned blocks.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-06 11:44:19 +10:00
Ashish Samant
cbe355f57c ocfs2: fix locking for res->tracking and dlm->tracking_list
In dlm_init_lockres() we access and modify res->tracking and
dlm->tracking_list without holding dlm->track_lock.  This can cause list
corruptions and can end up in kernel panic.

Fix this by locking res->tracking and dlm->tracking_list with
dlm->track_lock instead of dlm->spinlock.

Link: http://lkml.kernel.org/r/1529951192-4686-1-git-send-email-ashish.samant@oracle.com
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Acked-by: Joseph Qi <jiangqi903@gmail.com>
Acked-by: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-05 16:32:05 -07:00
Jann Horn
f8a00cef17 proc: restrict kernel stack dumps to root
Currently, you can use /proc/self/task/*/stack to cause a stack walk on
a task you control while it is running on another CPU.  That means that
the stack can change under the stack walker.  The stack walker does
have guards against going completely off the rails and into random
kernel memory, but it can interpret random data from your kernel stack
as instruction pointers and stack pointers.  This can cause exposure of
kernel stack contents to userspace.

Restrict the ability to inspect kernel stacks of arbitrary tasks to root
in order to prevent a local attacker from exploiting racy stack unwinding
to leak kernel task stack contents.  See the added comment for a longer
rationale.

There don't seem to be any users of this userspace API that can't
gracefully bail out if reading from the file fails.  Therefore, I believe
that this change is unlikely to break things.  In the case that this patch
does end up needing a revert, the next-best solution might be to fake a
single-entry stack based on wchan.

Link: http://lkml.kernel.org/r/20180927153316.200286-1-jannh@google.com
Fixes: 2ec220e27f ("proc: add /proc/*/stack")
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Ken Chen <kenchen@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-05 16:32:05 -07:00
Larry Chen
69eb7765b9 ocfs2: fix crash in ocfs2_duplicate_clusters_by_page()
ocfs2_duplicate_clusters_by_page() may crash if one of the extent's pages
is dirty.  When a page has not been written back, it is still in dirty
state.  If ocfs2_duplicate_clusters_by_page() is called against the dirty
page, the crash happens.

To fix this bug, we can just unlock the page and wait until the page until
its not dirty.

The following is the backtrace:

kernel BUG at /root/code/ocfs2/refcounttree.c:2961!
[exception RIP: ocfs2_duplicate_clusters_by_page+822]
__ocfs2_move_extent+0x80/0x450 [ocfs2]
? __ocfs2_claim_clusters+0x130/0x250 [ocfs2]
ocfs2_defrag_extent+0x5b8/0x5e0 [ocfs2]
__ocfs2_move_extents_range+0x2a4/0x470 [ocfs2]
ocfs2_move_extents+0x180/0x3b0 [ocfs2]
? ocfs2_wait_for_recovery+0x13/0x70 [ocfs2]
ocfs2_ioctl_move_extents+0x133/0x2d0 [ocfs2]
ocfs2_ioctl+0x253/0x640 [ocfs2]
do_vfs_ioctl+0x90/0x5f0
SyS_ioctl+0x74/0x80
do_syscall_64+0x74/0x140
entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Once we find the page is dirty, we do not wait until it's clean, rather we
use write_one_page() to write it back

Link: http://lkml.kernel.org/r/20180829074740.9438-1-lchen@suse.com
[lchen@suse.com: update comments]
  Link: http://lkml.kernel.org/r/20180830075041.14879-1-lchen@suse.com
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Larry Chen <lchen@suse.com>
Acked-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-05 16:32:04 -07:00
Jan Kara
ccd3c4373e jbd2: fix use after free in jbd2_log_do_checkpoint()
The code cleaning transaction's lists of checkpoint buffers has a bug
where it increases bh refcount only after releasing
journal->j_list_lock. Thus the following race is possible:

CPU0					CPU1
jbd2_log_do_checkpoint()
					jbd2_journal_try_to_free_buffers()
					  __journal_try_to_free_buffer(bh)
  ...
  while (transaction->t_checkpoint_io_list)
  ...
    if (buffer_locked(bh)) {

<-- IO completes now, buffer gets unlocked -->

      spin_unlock(&journal->j_list_lock);
					    spin_lock(&journal->j_list_lock);
					    __jbd2_journal_remove_checkpoint(jh);
					    spin_unlock(&journal->j_list_lock);
					  try_to_free_buffers(page);
      get_bh(bh) <-- accesses freed bh

Fix the problem by grabbing bh reference before unlocking
journal->j_list_lock.

Fixes: dc6e8d669c ("jbd2: don't call get_bh() before calling __jbd2_journal_remove_checkpoint()")
Fixes: be1158cc61 ("jbd2: fold __process_buffer() into jbd2_log_do_checkpoint()")
Reported-by: syzbot+7f4a27091759e2fe7453@syzkaller.appspotmail.com
CC: stable@vger.kernel.org
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-10-05 18:44:40 -04:00
Al Viro
b07581d2d5 cachefiles: fix the race between cachefiles_bury_object() and rmdir(2)
the victim might've been rmdir'ed just before the lock_rename();
unlike the normal callers, we do not look the source up after the
parents are locked - we know it beforehand and just recheck that it's
still the child of what used to be its parent.  Unfortunately,
the check is too weak - we don't spot a dead directory since its
->d_parent is unchanged, dentry is positive, etc.  So we sail all
the way to ->rename(), with hosting filesystems _not_ expecting
to be asked renaming an rmdir'ed subdirectory.

The fix is easy, fortunately - the lock on parent is sufficient for
making IS_DEADDIR() on child safe.

Cc: stable@vger.kernel.org
Fixes: 9ae326a690 (CacheFiles: A cache that backs onto a mounted filesystem)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-10-05 14:37:28 -04:00
Bob Peterson
e54c78a27f gfs2: Use fs_* functions instead of pr_* function where we can
Before this patch, various errors and messages were reported using
the pr_* functions: pr_err, pr_warn, pr_info, etc., but that does
not tell you which gfs2 mount had the problem, which is often vital
to debugging. This patch changes the calls from pr_* to fs_* in
most of the messages so that the file system id is printed along
with the message.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-10-05 11:16:54 -05:00
Bob Peterson
b524abcc01 gfs2: slow the deluge of io error messages
When an io error is hit, it calls gfs2_io_error_bh_i for every
journal buffer it can't write. Since we changed gfs2_io_error_bh_i
recently to withdraw later in the cycle, it sends a flood of
errors to the console. This patch checks for the file system already
being withdrawn, and if so, doesn't send more messages. It doesn't
stop the flood of messages, but it slows it down and keeps it more
reasonable.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-10-05 10:51:11 -05:00
Greg Kroah-Hartman
087f759a41 four small SMB3 fixes: one for stable, the others to address a more recent regression
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlu2jLMACgkQiiy9cAdy
 T1GLMQwAiUaUTtR0MN2ws+IHNnwFX5YnWFa1Cr/6Tgz9WBo5TrrvQcBGhTIUQok2
 7w2+I0rXPO7P7Tq2odWn66BT1/l0o2U10aB4OHYlS7zqfCAp6GuwKGtATivRqjsg
 ENn2mmBRuTmz+XtGmlVklmQpHjn4+MTR2GSLA57oo1fqX5YBeSxcq2UEciCGttRQ
 nAvn923NrPSbGRvL/eDDWGXas8gMATXnrNlN8st2AsxnoFC6CrzUuRM0s/IFgp6r
 Orh0jIlovlT2Q6LdND7n7xXnEVoyUTETG89IjRVwcBJ9LtXrJXpuWo7/sOr11cMy
 sGeUkILNlmkIdLtIlUaZcGaclq/yfeKdMxziBT8tWP6wXIH4LrGCzvZ9jDtrzygU
 EWL2nLsqMJNh0+wsYSZM/m9VFC1ReeB1AgIC2EHu/xzqWoTeL40i2OhKi/732FEX
 H+FlTZUoorhFoI96AMJMrqXjXyqZUZSQ1b5iJ7/jq888Vdey+m7HiMUtxv0+5yzC
 +lQksh8X
 =4PHD
 -----END PGP SIGNATURE-----

Merge tag '4.19-rc6-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Steve writes:
  "SMB3 fixes

   four small SMB3 fixes: one for stable, the others to address a more
   recent regression"

* tag '4.19-rc6-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb3: fix lease break problem introduced by compounding
  cifs: only wake the thread for the very last PDU in a compound
  cifs: add a warning if we try to to dequeue a deleted mid
  smb2: fix missing files in root share directory listing
2018-10-05 08:27:47 -07:00
Olga Kornievskaia
44f411c353 NFSv4.x: fix lock recovery during delegation recall
Running "./nfstest_delegation --runtest recall26" uncovers that
client doesn't recover the lock when we have an appending open,
where the initial open got a write delegation.

Instead of checking for the passed in open context against
the file lock's open context. Check that the state is the same.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-10-05 09:34:36 -04:00
Darrick J. Wong
7debbf015f xfs: update ctime and remove suid before cloning files
Before cloning into a file, update the ctime and remove sensitive
attributes like suid, just like we'd do for a regular file write.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-05 19:05:41 +10:00
Darrick J. Wong
410fdc72b0 xfs: zero posteof blocks when cloning above eof
When we're reflinking between two files and the destination file range
is well beyond the destination file's EOF marker, zero any posteof
speculative preallocations in the destination file so that we don't
expose stale disk contents.  The previous strategy of trying to clear
the preallocations does not work if the destination file has the
PREALLOC flag set.

Uncovered by shared/010.

Reported-by: Zorro Lang <zlang@redhat.com>
Bugzilla-id: https://bugzilla.kernel.org/show_bug.cgi?id=201259
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-05 19:04:27 +10:00
Darrick J. Wong
0d41e1d28c xfs: refactor clonerange preparation into a separate helper
Refactor all the reflink preparation steps into a separate helper
that we'll use to land all the upcoming fixes for insufficient input
checks.

This rework also moves the invalidation of the destination range to
the prep function so that it is done before the range is remapped.
This ensures that nobody can access the data in range being remapped
until the remap is complete.

[dgc: fix xfs_reflink_remap_prep() return value and caller check to
handle vfs_clone_file_prep_inodes() returning 0 to mean "nothing to
do". ]

[dgc: make sure length changed by vfs_clone_file_prep_inodes() gets
propagated back to XFS code that does the remapping. ]

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-05 19:04:22 +10:00
Greg Kroah-Hartman
010bd965f9 overlayfs fixes for 4.19-rc7
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW7YOCQAKCRDh3BK/laaZ
 PAmGAQC1/3+mkfHXqyDx+zv4f7A348ZULW26EWFWwMwlEljWowD+PlRFBgUu5v5c
 194mCApZU7hl6o0u07aijjz6m81e6QE=
 =dM89
 -----END PGP SIGNATURE-----

Merge tag 'ovl-fixes-4.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Miklos writes:
  "overlayfs fixes for 4.19-rc7

   This update fixes a couple of regressions in the stacked file update
   added in this cycle, as well as some older bugs uncovered by
   syzkaller.

   There's also one trivial naming change that touches other parts of
   the fs subsystem."

* tag 'ovl-fixes-4.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: fix format of setxattr debug
  ovl: fix access beyond unterminated strings
  ovl: make symbol 'ovl_aops' static
  vfs: swap names of {do,vfs}_clone_file_range()
  ovl: fix freeze protection bypass in ovl_clone_file_range()
  ovl: fix freeze protection bypass in ovl_write_iter()
  ovl: fix memory leak on unlink of indexed file
2018-10-04 13:24:38 -07:00
Greg Kroah-Hartman
1b0350c355 XFS fixes for 4.19-rc6
Accumlated regression and bug fixes for 4.19-rc6, including:
 
 o make iomap correctly mark dirty pages for sub-page block sizes
 o fix regression in handling extent-to-btree format conversion errors
 o fix torn log wrap detection for new logs
 o various corrupt inode detection fixes
 o various delalloc state fixes
 o cleanup all the missed transaction cancel cases missed from changes merged
   in 4.19-rc1
 o fix lockdep false positive on transaction allocation
 o fix locking and reference counting on buffer log items
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJbtVyHAAoJEK3oKUf0dfodc8YP/1hT+TdXZDBVPo9kXQwmrnra
 8P1J8IEuGj851PwYQobUhifdDh4eQ+PpcYEIfqTSGhtjTg7cqyQOvTo4uUKHLn50
 8mFXQb6JrAFYnjGjorOCde+lpoQrtj3oKASscs7RaL/W1RNffY74UGdiE0yOJnJs
 9IC7TpJT4ZzAgYBgC61whfYWmszLmRuKlVZWgXrM8hkMqliHdfrXeZk7MXTeiflz
 Q5+9eVnCCgTtC0TWgVAkFMvlZs7UtNMWIIn6zt+HQLB7Vms6q4ArpIT+fF45njwG
 94tyTcFT0AcIJ8OIi5i85fOXcDGjTFFf554Yf80PlWGlk3SbXPHFQao/ItDA4S1Z
 eCl2eurUOXlNXKuyzWoV5KjOBiiBicbOwl6eX0K996cGehhEdFOvBihvAGjJy5rm
 c4plTTy6bhYKZWr3JLuaPDlNWDMr/P/aUkiq9tFXyaMmVKNeyxd6UKzIokl5uUfi
 Ycik4Ik8zRgpFEUx5Clvb2W5qH9pqpR0WcFhrcj0mDnJa10TpBWesA0g2F+hM0Jc
 0mSuGxfQeJk7tuVcvsBpPNzgVEPKabvnYbOlGK7HvCSOPRkg0Fhmj+iJK721ccVl
 Ltiz0Y485eAGLKn9kOWKr47A5GAAFpuK29OrD2a8XizHPZ3Sr2CG5X1jI0gtMb8n
 BiBAxO6dojRQATDuTI6w
 =5ylA
 -----END PGP SIGNATURE-----

Merge tag 'xfs-fixes-for-4.19-rc6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Dave writes:
  "XFS fixes for 4.19-rc6

   Accumlated regression and bug fixes for 4.19-rc6, including:

   o make iomap correctly mark dirty pages for sub-page block sizes
   o fix regression in handling extent-to-btree format conversion errors
   o fix torn log wrap detection for new logs
   o various corrupt inode detection fixes
   o various delalloc state fixes
   o cleanup all the missed transaction cancel cases missed from changes merged
     in 4.19-rc1
   o fix lockdep false positive on transaction allocation
   o fix locking and reference counting on buffer log items"

* tag 'xfs-fixes-for-4.19-rc6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: fix error handling in xfs_bmap_extents_to_btree
  iomap: set page dirty after partial delalloc on mkwrite
  xfs: remove invalid log recovery first/last cycle check
  xfs: validate inode di_forkoff
  xfs: skip delalloc COW blocks in xfs_reflink_end_cow
  xfs: don't treat unknown di_flags2 as corruption in scrub
  xfs: remove duplicated include from alloc.c
  xfs: don't bring in extents in xfs_bmap_punch_delalloc_range
  xfs: fix transaction leak in xfs_reflink_allocate_cow()
  xfs: avoid lockdep false positives in xfs_trans_alloc
  xfs: refactor xfs_buf_log_item reference count handling
  xfs: clean up xfs_trans_brelse()
  xfs: don't unlock invalidated buf on aborted tx commit
  xfs: remove last of unnecessary xfs_defer_cancel() callers
  xfs: don't crash the vfs on a garbage inline symlink
2018-10-04 09:17:38 -07:00
Miklos Szeredi
1a8f8d2a44 ovl: fix format of setxattr debug
Format has a typo: it was meant to be "%.*s", not "%*s".  But at some point
callers grew nonprintable values as well, so use "%*pE" instead with a
maximized length.

Reported-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 3a1e819b4e ("ovl: store file handle of lower inode on copy up")
Cc: <stable@vger.kernel.org> # v4.12
2018-10-04 14:49:10 +02:00
Amir Goldstein
601350ff58 ovl: fix access beyond unterminated strings
KASAN detected slab-out-of-bounds access in printk from overlayfs,
because string format used %*s instead of %.*s.

> BUG: KASAN: slab-out-of-bounds in string+0x298/0x2d0 lib/vsprintf.c:604
> Read of size 1 at addr ffff8801c36c66ba by task syz-executor2/27811
>
> CPU: 0 PID: 27811 Comm: syz-executor2 Not tainted 4.19.0-rc5+ #36
...
>  printk+0xa7/0xcf kernel/printk/printk.c:1996
>  ovl_lookup_index.cold.15+0xe8/0x1f8 fs/overlayfs/namei.c:689

Reported-by: syzbot+376cea2b0ef340db3dd4@syzkaller.appspotmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 359f392ca5 ("ovl: lookup index entry for copy up origin")
Cc: <stable@vger.kernel.org> # v4.13
2018-10-04 14:49:10 +02:00
Amir Goldstein
bdd5a46fe3 fanotify: add BUILD_BUG_ON() to count the bits of fanotify constants
Also define the FANOTIFY_EVENT_FLAGS consisting of the extra flags
FAN_ONDIR and FAN_ON_CHILD.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-10-04 13:30:03 +02:00
Amir Goldstein
a39f7ec417 fsnotify: convert runtime BUG_ON() to BUILD_BUG_ON()
The BUG_ON() statements to verify number of bits in ALL_FSNOTIFY_BITS
and ALL_INOTIFY_BITS are converted to build time check of the constant.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-10-04 13:28:50 +02:00
Amir Goldstein
23c9deeb32 fanotify: deprecate uapi FAN_ALL_* constants
We do not want to add new bits to the FAN_ALL_* uapi constants
because they have been exposed to userspace.  If there are programs
out there using these constants, those programs could break if
re-compiled with modified FAN_ALL_* constants and run on an old kernel.

We deprecate the uapi constants FAN_ALL_* and define new FANOTIFY_*
constants for internal use to replace them. New feature bits will be
added only to the new constants.

Cc: <linux-api@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-10-04 13:28:38 +02:00
Amir Goldstein
a72fd224e3 fanotify: simplify handling of FAN_ONDIR
fanotify mark add/remove code jumps through hoops to avoid setting the
FS_ISDIR in the commulative object mask.

That was just papering over a bug in fsnotify() handling of the FS_ISDIR
extra flag. This bug is now fixed, so all the hoops can be removed along
with the unneeded internal flag FAN_MARK_ONDIR.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-10-04 13:28:29 +02:00
Amir Goldstein
007d1e8395 fsnotify: generalize handling of extra event flags
FS_EVENT_ON_CHILD gets a special treatment in fsnotify() because it is
not a flag specifying an event type, but rather an extra flags that may
be reported along with another event and control the handling of the
event by the backend.

FS_ISDIR is also an "extra flag" and not an "event type" and therefore
desrves the same treatment. With inotify/dnotify backends it was never
possible to set FS_ISDIR in mark masks, so it did not matter.
With fanotify backend, mark adding code jumps through hoops to avoid
setting the FS_ISDIR in the commulative object mask.

Separate the constant ALL_FSNOTIFY_EVENTS to ALL_FSNOTIFY_FLAGS and
ALL_FSNOTIFY_EVENTS, so the latter can be used to test for specific
event types.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-10-04 13:28:02 +02:00
David Howells
46894a1359 rxrpc: Use IPv4 addresses throught the IPv6
AF_RXRPC opens an IPv6 socket through which to send and receive network
packets, both IPv6 and IPv4.  It currently turns AF_INET addresses into
AF_INET-as-AF_INET6 addresses based on an assumption that this was
necessary; on further inspection of the code, however, it turns out that
the IPv6 code just farms packets aimed at AF_INET addresses out to the IPv4
code.

Fix AF_RXRPC to use AF_INET addresses directly when given them.

Fixes: 7b674e390e ("rxrpc: Fix IPv6 support")
Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-04 09:32:28 +01:00
David Howells
66be646bd9 afs: Sort address lists so that they are in logical ascending order
Sort address lists so that they are in logical ascending order rather than
being partially in ascending order of the BE representations of those
values.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-04 09:32:28 +01:00
David Howells
4c19bbdc7f afs: Always build address lists using the helper functions
Make the address list string parser use the helper functions for adding
addresses to an address list so that they end up appropriately sorted.
This will better handles overruns and make them easier to compare.

It also reduces the number of places that addresses are handled, making it
easier to fix the handling.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-04 09:32:27 +01:00
David Howells
68eb64c3d2 afs: Do better max capacity handling on address lists
Note the maximum allocated capacity in an afs_addr_list struct and discard
addresses that would exceed it in afs_merge_fs_addr{4,6}().

Also, since the current maximum capacity is less than 255, reduce the
relevant members to bytes.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-04 09:32:27 +01:00
Ingo Molnar
c0554d2d3d Merge branch 'linus' into x86/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-04 08:23:03 +02:00
Wang Shilong
182a79e0c1 ext4: propagate error from dquot_initialize() in EXT4_IOC_FSSETXATTR
We return most failure of dquota_initialize() except
inode evict, this could make a bit sense, for example
we allow file removal even quota files are broken?

But it dosen't make sense to allow setting project
if quota files etc are broken.

Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-10-03 12:19:21 -04:00
Eric W. Biederman
ae7795bc61 signal: Distinguish between kernel_siginfo and siginfo
Linus recently observed that if we did not worry about the padding
member in struct siginfo it is only about 48 bytes, and 48 bytes is
much nicer than 128 bytes for allocating on the stack and copying
around in the kernel.

The obvious thing of only adding the padding when userspace is
including siginfo.h won't work as there are sigframe definitions in
the kernel that embed struct siginfo.

So split siginfo in two; kernel_siginfo and siginfo.  Keeping the
traditional name for the userspace definition.  While the version that
is used internally to the kernel and ultimately will not be padded to
128 bytes is called kernel_siginfo.

The definition of struct kernel_siginfo I have put in include/signal_types.h

A set of buildtime checks has been added to verify the two structures have
the same field offsets.

To make it easy to verify the change kernel_siginfo retains the same
size as siginfo.  The reduction in size comes in a following change.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-10-03 16:47:43 +02:00
Wang Shilong
dc7ac6c4ca ext4: fix setattr project check in fssetxattr ioctl
Currently, project quota could be changed by fssetxattr
ioctl, and existed permission check inode_owner_or_capable()
is obviously not enough, just think that common users could
change project id of file, that could make users to
break project quota easily.

This patch try to follow same regular of xfs project
quota:

"Project Quota ID state is only allowed to change from
within the init namespace. Enforce that restriction only
if we are trying to change the quota ID state.
Everything else is allowed in user namespaces."

Besides that, check and set project id'state should
be an atomic operation, protect whole operation with
inode lock, ext4_ioctl_setproject() is only used for
ioctl EXT4_IOC_FSSETXATTR, we have held mnt_want_write_file()
before ext4_ioctl_setflags(), and ext4_ioctl_setproject()
is called after ext4_ioctl_setflags(), we could share
codes, so remove it inside ext4_ioctl_setproject().

Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@kernel.org
2018-10-03 10:33:32 -04:00
Greg Kroah-Hartman
73dec82d8d Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Al writes:
  "xattrs regression fix from Andreas; sat in -next for quite a while."

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  sysfs: Do not return POSIX ACL xattrs via listxattr
2018-10-03 04:21:23 -07:00
Souptick Joarder
401b25aa1a ext4: convert fault handler to use vm_fault_t type
Return type of ext4_page_mkwrite and ext4_filemap_fault are
changed to use vm_fault_t type.

With this patch all the callers of block_page_mkwrite_return()
are changed to handle vm_fault_t. So converting the return type
of block_page_mkwrite_return() to vm_fault_t.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
2018-10-02 22:20:50 -04:00
Lukas Czerner
625ef8a3ac ext4: initialize retries variable in ext4_da_write_inline_data_begin()
Variable retries is not initialized in ext4_da_write_inline_data_begin()
which can lead to nondeterministic number of retries in case we hit
ENOSPC. Initialize retries to zero as we do everywhere else.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fixes: bc0ca9df3b ("ext4: retry allocation when inline->extent conversion failed")
Cc: stable@kernel.org
2018-10-02 21:18:45 -04:00
Jaegeuk Kim
fb7d70db30 f2fs: clear PageError on the read path
When running fault injection test, I hit somewhat wrong behavior in f2fs_gc ->
gc_data_segment():

0. fault injection generated some PageError'ed pages

1. gc_data_segment
 -> f2fs_get_read_data_page(REQ_RAHEAD)

2. move_data_page
 -> f2fs_get_lock_data_page()
  -> f2f_get_read_data_page()
   -> f2fs_submit_page_read()
    -> submit_bio(READ)
  -> return EIO due to PageError
  -> fail to move data

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-02 17:33:10 -07:00
Steve French
7af929d6d0 smb3: fix lease break problem introduced by compounding
Fixes problem (discovered by Aurelien) introduced by recent commit:
commit b24df3e30c
("cifs: update receive_encrypted_standard to handle compounded responses")

which broke the ability to respond to some lease breaks
(lease breaks being ignored is a problem since can block
server response for duration of the lease break timeout).

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-10-02 18:54:09 -05:00
Ronnie Sahlberg
4e34feb5e9 cifs: only wake the thread for the very last PDU in a compound
For compounded PDUs we whould only wake the waiting thread for the
very last PDU of the compound.
We do this so that we are guaranteed that the demultiplex_thread will
not process or access any of those MIDs any more once the send/recv
thread starts processing.

Else there is a race where at the end of the send/recv processing we
will try to delete all the mids of the compound. If the multiplex
thread still has other mids to process at this point for this compound
this can lead to an oops.

Needed to fix recent commit:
commit 730928c8f4
("cifs: update smb2_queryfs() to use compounding")

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-10-02 18:53:57 -05:00
Ronnie Sahlberg
ddf83afb9f cifs: add a warning if we try to to dequeue a deleted mid
cifs_delete_mid() is called once we are finished handling a mid and we
expect no more work done on this mid.

Needed to fix recent commit:
commit 730928c8f4
("cifs: update smb2_queryfs() to use compounding")

Add a warning if someone tries to dequeue a mid that has already been
flagged to be deleted.
Also change list_del() to list_del_init() so that if we have similar bugs
resurface in the future we will not oops.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-10-02 18:12:31 -05:00
Aurelien Aptel
0595751f26 smb2: fix missing files in root share directory listing
When mounting a Windows share that is the root of a drive (eg. C$)
the server does not return . and .. directory entries. This results in
the smb2 code path erroneously skipping the 2 first entries.

Pseudo-code of the readdir() code path:

cifs_readdir(struct file, struct dir_context)
    initiate_cifs_search            <-- if no reponse cached yet
        server->ops->query_dir_first

    dir_emit_dots
        dir_emit                    <-- adds "." and ".." if we're at pos=0

    find_cifs_entry
        initiate_cifs_search        <-- if pos < start of current response
                                         (restart search)
        server->ops->query_dir_next <-- if pos > end of current response
                                         (fetch next search res)

    for(...)                        <-- loops over cur response entries
                                          starting at pos
        cifs_filldir                <-- skip . and .., emit entry
            cifs_fill_dirent
            dir_emit
	pos++

A) dir_emit_dots() always adds . & ..
   and sets the current dir pos to 2 (0 and 1 are done).

Therefore we always want the index_to_find to be 2 regardless of if
the response has . and ..

B) smb1 code initializes index_of_last_entry with a +2 offset

  in cifssmb.c CIFSFindFirst():
		psrch_inf->index_of_last_entry = 2 /* skip . and .. */ +
			psrch_inf->entries_in_buffer;

Later in find_cifs_entry() we want to find the next dir entry at pos=2
as a result of (A)

	first_entry_in_buffer = cfile->srch_inf.index_of_last_entry -
					cfile->srch_inf.entries_in_buffer;

This var is the dir pos that the first entry in the buffer will
have therefore it must be 2 in the first call.

If we don't offset index_of_last_entry by 2 (like in (B)),
first_entry_in_buffer=0 but we were instructed to get pos=2 so this
code in find_cifs_entry() skips the 2 first which is ok for non-root
shares, as it skips . and .. from the response but is not ok for root
shares where the 2 first are actual files

		pos_in_buf = index_to_find - first_entry_in_buffer;
                // pos_in_buf=2
		// we skip 2 first response entries :(
		for (i = 0; (i < (pos_in_buf)) && (cur_ent != NULL); i++) {
			/* go entry by entry figuring out which is first */
			cur_ent = nxt_dir_entry(cur_ent, end_of_smb,
						cfile->srch_inf.info_level);
		}

C) cifs_filldir() skips . and .. so we can safely ignore them for now.

Sample program:

int main(int argc, char **argv)
{
	const char *path = argc >= 2 ? argv[1] : ".";
	DIR *dh;
	struct dirent *de;

	printf("listing path <%s>\n", path);
	dh = opendir(path);
	if (!dh) {
		printf("opendir error %d\n", errno);
		return 1;
	}

	while (1) {
		de = readdir(dh);
		if (!de) {
			if (errno) {
				printf("readdir error %d\n", errno);
				return 1;
			}
			printf("end of listing\n");
			break;
		}
		printf("off=%lu <%s>\n", de->d_off, de->d_name);
	}

	return 0;
}

Before the fix with SMB1 on root shares:

<.>            off=1
<..>           off=2
<$Recycle.Bin> off=3
<bootmgr>      off=4

and on non-root shares:

<.>    off=1
<..>   off=4  <-- after adding .., the offsets jumps to +2 because
<2536> off=5       we skipped . and .. from response buffer (C)
<411>  off=6       but still incremented pos
<file> off=7
<fsx>  off=8

Therefore the fix for smb2 is to mimic smb1 behaviour and offset the
index_of_last_entry by 2.

Test results comparing smb1 and smb2 before/after the fix on root
share, non-root shares and on large directories (ie. multi-response
dir listing):

PRE FIX
=======
pre-1-root VS pre-2-root:
        ERR pre-2-root is missing [bootmgr, $Recycle.Bin]
pre-1-nonroot VS pre-2-nonroot:
        OK~ same files, same order, different offsets
pre-1-nonroot-large VS pre-2-nonroot-large:
        OK~ same files, same order, different offsets

POST FIX
========
post-1-root VS post-2-root:
        OK same files, same order, same offsets
post-1-nonroot VS post-2-nonroot:
        OK same files, same order, same offsets
post-1-nonroot-large VS post-2-nonroot-large:
        OK same files, same order, same offsets

REGRESSION?
===========
pre-1-root VS post-1-root:
        OK same files, same order, same offsets
pre-1-nonroot VS post-1-nonroot:
        OK same files, same order, same offsets

BugLink: https://bugzilla.samba.org/show_bug.cgi?id=13107
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Paulo Alcantara <palcantara@suse.deR>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
2018-10-02 18:06:21 -05:00
Theodore Ts'o
18aded1749 ext4: fix EXT4_IOC_SWAP_BOOT
The code EXT4_IOC_SWAP_BOOT ioctl hasn't been updated in a while, and
it's a bit broken with respect to more modern ext4 kernels, especially
metadata checksums.

Other problems fixed with this commit:

* Don't allow installing a DAX, swap file, or an encrypted file as a
  boot loader.

* Respect the immutable and append-only flags.

* Wait until any DIO operations are finished *before* calling
  truncate_inode_pages().

* Don't swap inode->i_flags, since these flags have nothing to do with
  the inode blocks --- and it will give the IMA/audit code heartburn
  when the inode is evicted.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Reported-by: syzbot+e81ccd4744c6c4f71354@syzkaller.appspotmail.com
2018-10-02 18:21:19 -04:00
Gabriel Krisman Bertazi
799578ab16 ext4: fix build error when DX_DEBUG is defined
Enabling DX_DEBUG triggers the build error below.  info is an attribute
of  the dxroot structure.

linux/fs/ext4/namei.c:2264:12: error: ‘info’
undeclared (first use in this function); did you mean ‘insl’?
	   	  info->indirect_levels));

Fixes: e08ac99fa2 ("ext4: add largedir feature")
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2018-10-02 12:43:51 -04:00
Theodore Ts'o
f18b2b83a7 ext4: fix argument checking in EXT4_IOC_MOVE_EXT
If the starting block number of either the source or destination file
exceeds the EOF, EXT4_IOC_MOVE_EXT should return EINVAL.

Also fixed the helper function mext_check_coverage() so that if the
logical block is beyond EOF, make it return immediately, instead of
looping until the block number wraps all the away around.  This takes
long enough that if there are multiple threads trying to do pound on
an the same inode doing non-sensical things, it can end up triggering
the kernel's soft lockup detector.

Reported-by: syzbot+c61979f6f2cba5cb3c06@syzkaller.appspotmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-10-02 01:34:44 -04:00
Greg Kroah-Hartman
ef0f2584c2 Fixes for v4.19-rc7
- Fix failure-path memory leak in ramoops_init (nixiaoming)
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAluxB3cWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJidvEACZlqemGsSVQ4elXWCW9EPqyVSn
 lbPg5ONurG/51J6313Ankgn7PrOI7WRd3iAXUWYByoc0DXn/jgz0i/B1+MtSnH4g
 fMeZ3DMnwmuC4h8/50/xmxbjKj2+vW7tX+978wWbYnvYNC+UXf8CN4J4MwBFmNR3
 DGH+oVCx2MITbIYQ3u5FIUgRJl0sD15GtxHg/l5Ff78dtUJVvlnD6ZAa9/rDBCPu
 DD0IqjJ5BqTmGw98L2tG0I2SDSrC8TGAYdQlZK/k7vHUqCWP6QspCpQQy3x6bh+W
 QRL6gCNEPZIGB+uAVbueCo80zRrx1NltbkbO4n9zn9ItYxpvOXJwGT4peYYXwabO
 nn+cWwRAJGITp9BFKIk/V8p05rpRhw8oeOBHIzwylb3C6bqK3ijnkk9mNSmUnLPG
 FtzRs/cYEMHi7n7+aygm1lNHn98PAWGLmomXLyxIUtsSH1jvFWG+jxLiRih/Oah5
 qjSGw2r681vLKsj2tQV4hpFvWaV83t2cO1QOldWSSkVHKJO9+pglEUGz9gfCvBRz
 bCvA0aPNGmMGTO0faQMB/TG2pMtOK94UU6kBIALxlEeHrtdpoOaQN6g++R7sY3/y
 SWo8BeGlNMSAtkWtd6Y7xxIF5N4PDDc6iMyjHGM/KCptANVu03BWlfOn32tel9jR
 17wiJi6SjqQjHVp7jw==
 =bzfz
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v4.19-rc7' of https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Kees writes:
  "Pstore fixes for v4.19-rc7

   - Fix failure-path memory leak in ramoops_init (nixiaoming)"

* tag 'pstore-v4.19-rc7' of https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  pstore/ram: Fix failure-path memory leak in ramoops_init
2018-10-01 17:22:36 -07:00
Eric Whitney
f456767d33 ext4: fix reserved cluster accounting at page invalidation time
Add new code to count canceled pending cluster reservations on bigalloc
file systems and to reduce the cluster reservation count on all file
systems using delayed allocation.  This replaces old code in
ext4_da_page_release_reservations that was incorrect.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-10-01 14:33:24 -04:00
Eric Whitney
9fe671496b ext4: adjust reserved cluster count when removing extents
Modify ext4_ext_remove_space() and the code it calls to correct the
reserved cluster count for pending reservations (delayed allocated
clusters shared with allocated blocks) when a block range is removed
from the extent tree.  Pending reservations may be found for the clusters
at the ends of written or unwritten extents when a block range is removed.
If a physical cluster at the end of an extent is freed, it's necessary
to increment the reserved cluster count to maintain correct accounting
if the corresponding logical cluster is shared with at least one
delayed and unwritten extent as found in the extents status tree.

Add a new function, ext4_rereserve_cluster(), to reapply a reservation
on a delayed allocated cluster sharing blocks with a freed allocated
cluster.  To avoid ENOSPC on reservation, a flag is applied to
ext4_free_blocks() to briefly defer updating the freeclusters counter
when an allocated cluster is freed.  This prevents another thread
from allocating the freed block before the reservation can be reapplied.

Redefine the partial cluster object as a struct to carry more state
information and to clarify the code using it.

Adjust the conditional code structure in ext4_ext_remove_space to
reduce the indentation level in the main body of the code to improve
readability.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-10-01 14:25:08 -04:00
Eric Whitney
b6bf9171ef ext4: reduce reserved cluster count by number of allocated clusters
Ext4 does not always reduce the reserved cluster count by the number
of clusters allocated when mapping a delayed extent.  It sometimes
adds back one or more clusters after allocation if delalloc blocks
adjacent to the range allocated by ext4_ext_map_blocks() share the
clusters newly allocated for that range.  However, this overcounts
the number of clusters needed to satisfy future mapping requests
(holding one or more reservations for clusters that have already been
allocated) and premature ENOSPC and quota failures, etc., result.

Ext4 also does not reduce the reserved cluster count when allocating
clusters for non-delayed allocated writes that have previously been
reserved for delayed writes.  This also results in overcounts.

To make it possible to handle reserved cluster accounting for
fallocated regions in the same manner as used for other non-delayed
writes, do the reserved cluster accounting for them at the time of
allocation.  In the current code, this is only done later when a
delayed extent sharing the fallocated region is finally mapped.

Address comment correcting handling of unsigned long long constant
from Jan Kara's review of RFC version of this patch.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-10-01 14:24:08 -04:00
Eric Whitney
0b02f4c0d6 ext4: fix reserved cluster accounting at delayed write time
The code in ext4_da_map_blocks sometimes reserves space for more
delayed allocated clusters than it should, resulting in premature
ENOSPC, exceeded quota, and inaccurate free space reporting.

Fix this by checking for written and unwritten blocks shared in the
same cluster with the newly delayed allocated block.  A cluster
reservation should not be made for a cluster for which physical space
has already been allocated.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-10-01 14:19:37 -04:00
Eric Whitney
1dc0aa46e7 ext4: add new pending reservation mechanism
Add new pending reservation mechanism to help manage reserved cluster
accounting.  Its primary function is to avoid the need to read extents
from the disk when invalidating pages as a result of a truncate, punch
hole, or collapse range operation.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-10-01 14:17:41 -04:00
Eric Whitney
ad431025ae ext4: generalize extents status tree search functions
Ext4 contains a few functions that are used to search for delayed
extents or blocks in the extents status tree.  Rather than duplicate
code to add new functions to search for extents with different status
values, such as written or a combination of delayed and unwritten,
generalize the existing code to search for caller-specified extents
status values.  Also, move this code into extents_status.c where it
is better associated with the data structures it operates upon, and
where it can be more readily used to implement new extents status tree
functions that might want a broader scope for i_es_lock.

Three missing static specifiers in RFC version of patch reported and
fixed by Fengguang Wu <fengguang.wu@intel.com>.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-10-01 14:10:39 -04:00
Jens Axboe
c0aac682fa This is the 4.19-rc6 release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAluw4MIACgkQONu9yGCS
 aT7+8xAAiYnc4khUsxeInm3z44WPfRX1+UF51frTNSY5C8Nn5nvRSnTUNLuKkkrz
 8RbwCL6UYyJxF9I/oZdHPsPOD4IxXkQY55tBjz7ZbSBIFEwYM6RJMm8mAGlXY7wq
 VyWA5MhlpGHM9DjrguB4DMRipnrSc06CVAnC+ZyKLjzblzU1Wdf2dYu+AW9pUVXP
 j4r74lFED5djPY1xfqfzEwmYRCeEGYGx7zMqT3GrrF5uFPqj1H6O5klEsAhIZvdl
 IWnJTU2coC8R/Sd17g4lHWPIeQNnMUGIUbu+PhIrZ/lDwFxlocg4BvarPXEdzgYi
 gdZzKBfovpEsSu5RCQsKWG4IGQxY7I1p70IOP9eqEFHZy77qT1YcHVAWrK1Y/bJd
 UA08gUOSzRnhKkNR3+PsaMflUOl9WkpyHECZu394cyRGMutSS50aWkavJPJ/o1Qi
 D/oGqZLLcKFyuNcchG+Met1TzY3LvYEDgSburqwqeUZWtAsGs8kmiiq7qvmXx4zV
 IcgM8ERqJ8mbfhfsXQU7hwydIrPJ3JdIq19RnM5ajbv2Q4C/qJCyAKkQoacrlKR4
 aiow/qvyNrP80rpXfPJB8/8PiWeDtAnnGhM+xySZNlw3t8GR6NYpUkIzf5TdkSb3
 C8KuKg6FY9QAS62fv+5KK3LB/wbQanxaPNruQFGe5K1iDQ5Fvzw=
 =dMl4
 -----END PGP SIGNATURE-----

Merge tag 'v4.19-rc6' into for-4.20/block

Merge -rc6 in, for two reasons:

1) Resolve a trivial conflict in the blk-mq-tag.c documentation
2) A few important regression fixes went into upstream directly, so
   they aren't in the 4.20 branch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

* tag 'v4.19-rc6': (780 commits)
  Linux 4.19-rc6
  MAINTAINERS: fix reference to moved drivers/{misc => auxdisplay}/panel.c
  cpufreq: qcom-kryo: Fix section annotations
  perf/core: Add sanity check to deal with pinned event failure
  xen/blkfront: correct purging of persistent grants
  Revert "xen/blkfront: When purging persistent grants, keep them in the buffer"
  selftests/powerpc: Fix Makefiles for headers_install change
  blk-mq: I/O and timer unplugs are inverted in blktrace
  dax: Fix deadlock in dax_lock_mapping_entry()
  x86/boot: Fix kexec booting failure in the SEV bit detection code
  bcache: add separate workqueue for journal_write to avoid deadlock
  drm/amd/display: Fix Edid emulation for linux
  drm/amd/display: Fix Vega10 lightup on S3 resume
  drm/amdgpu: Fix vce work queue was not cancelled when suspend
  Revert "drm/panel: Add device_link from panel device to DRM device"
  xen/blkfront: When purging persistent grants, keep them in the buffer
  clocksource/drivers/timer-atmel-pit: Properly handle error cases
  block: fix deadline elevator drain for zoned block devices
  ACPI / hotplug / PCI: Don't scan for non-hotplug bridges if slot is not bridge
  drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set
  ...

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-01 08:58:57 -06:00
Miklos Szeredi
e52a825048 fuse: realloc page array
Writeback caching currently allocates requests with the maximum number of
possible pages, while the actual number of pages per request depends on a
couple of factors that cannot be determined when the request is allocated
(whether page is already under writeback, whether page is contiguous with
previous pages already added to a request).

This patch allows such requests to start with no page allocation (all pages
inline) and grow the page array on demand.

If the max_pages tunable remains the default value, then this will mean
just one allocation that is the same size as before.  If the tunable is
larger, then this adds at most 3 additional memory allocations (which is
generously compensated by the improved performance from the larger
request).

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-01 10:07:06 +02:00
Constantine Shulyupin
5da784cce4 fuse: add max_pages to init_out
Replace FUSE_MAX_PAGES_PER_REQ with the configurable parameter max_pages to
improve performance.

Old RFC with detailed description of the problem and many fixes by Mitsuo
Hayasaka (mitsuo.hayasaka.hu@hitachi.com):
 - https://lkml.org/lkml/2012/7/5/136

We've encountered performance degradation and fixed it on a big and complex
virtual environment.

Environment to reproduce degradation and improvement:

1. Add lag to user mode FUSE
Add nanosleep(&(struct timespec){ 0, 1000 }, NULL); to xmp_write_buf in
passthrough_fh.c

2. patch UM fuse with configurable max_pages parameter. The patch will be
provided latter.

3. run test script and perform test on tmpfs
fuse_test()
{

       cd /tmp
       mkdir -p fusemnt
       passthrough_fh -o max_pages=$1 /tmp/fusemnt
       grep fuse /proc/self/mounts
       dd conv=fdatasync oflag=dsync if=/dev/zero of=fusemnt/tmp/tmp \
		count=1K bs=1M 2>&1 | grep -v records
       rm fusemnt/tmp/tmp
       killall passthrough_fh
}

Test results:

passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
	rw,nosuid,nodev,relatime,user_id=0,group_id=0 0 0
1073741824 bytes (1.1 GB) copied, 1.73867 s, 618 MB/s

passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
	rw,nosuid,nodev,relatime,user_id=0,group_id=0,max_pages=256 0 0
1073741824 bytes (1.1 GB) copied, 1.15643 s, 928 MB/s

Obviously with bigger lag the difference between 'before' and 'after'
will be more significant.

Mitsuo Hayasaka, in 2012 (https://lkml.org/lkml/2012/7/5/136),
observed improvement from 400-550 to 520-740.

Signed-off-by: Constantine Shulyupin <const@MakeLinux.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-01 10:07:06 +02:00
Miklos Szeredi
8a7aa286ab fuse: allocate page array more efficiently
When allocating page array for a request the array for the page pointers
and the array for page descriptors are allocated by two separate kmalloc()
calls.  Merge these into one allocation.

Also instead of initializing the request and the page arrays with memset(),
use the zeroing allocation variants.

Reserved requests never carry pages (page array size is zero). Make that
explicit by initializing the page array pointers to NULL and make sure the
assumption remains true by adding a WARN_ON().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-01 10:07:05 +02:00
Miklos Szeredi
ab2257e994 fuse: reduce size of struct fuse_inode
Do this by grouping fields used for cached writes and putting them into a
union with fileds used for cached readdir (with obviously no overlap, since
we don't have hybrid objects).

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-01 10:07:05 +02:00
Miklos Szeredi
261aaba72f fuse: use iversion for readdir cache verification
Use the internal iversion counter to make sure modifications of the
directory through this filesystem are not missed by the mtime check (due to
mtime granularity).

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-01 10:07:05 +02:00
Miklos Szeredi
7118883b44 fuse: use mtime for readdir cache verification
Store the modification time of the directory in the cache, obtained before
starting to fill the cache.

When reading the cache, verify that the directory hasn't changed, by
checking if current modification time is the same as the one stored in the
cache.

This only needs to be done when the current file position is at the
beginning of the directory, as mandated by POSIX.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-01 10:07:04 +02:00
Miklos Szeredi
3494927e09 fuse: add readdir cache version
Allow the cache to be invalidated when page(s) have gone missing.  In this
case increment the version of the cache and reset to an empty state.

Add a version number to the directory stream in struct fuse_file as well,
indicating the version of the cache it's supposed to be reading.  If the
cache version doesn't match the stream's version, then reset the stream to
the beginning of the cache.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-01 10:07:04 +02:00
Miklos Szeredi
5d7bc7e868 fuse: allow using readdir cache
The cache is only used if it's completed, not while it's still being
filled; this constraint could be lifted later, if it turns out to be
useful.

Introduce state in struct fuse_file that indicates the position within the
cache.  After a seek, reset the position to the beginning of the cache and
search the cache for the current position.  If the current position is not
found in the cache, then fall back to uncached readdir.

It can also happen that page(s) disappear from the cache, in which case we
must also fall back to uncached readdir.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-01 10:07:04 +02:00
Miklos Szeredi
69e3455115 fuse: allow caching readdir
This patch just adds the cache filling functions, which are invoked if
FOPEN_CACHE_DIR flag is set in the OPENDIR reply.

Cache reading and cache invalidation are added by subsequent patches.

The directory cache uses the page cache.  Directory entries are packed into
a page in the same format as in the READDIR reply.  A page only contains
whole entries, the space at the end of the page is cleared.  The page is
locked while being modified.

Multiple parallel readdirs on the same directory can fill the cache; the
only constraint is that continuity must be maintained (d_off of last entry
points to position of current entry).

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-01 10:07:04 +02:00
Chao Yu
f847c699cf f2fs: allow out-place-update for direct IO in LFS mode
Normally, DIO uses in-pllace-update, but in LFS mode, f2fs doesn't
allow triggering any in-place-update writes, so we fallback direct
write to buffered write, result in bad performance of large size
write.

This patch adds to support triggering out-place-update for direct IO
to enhance its performance.

Note that it needs to exclude direct read IO during direct write,
since new data writing to new block address will no be valid until
write finished.

storage: zram

time xfs_io -f -d /mnt/f2fs/file -c "pwrite 0 1073741824" -c "fsync"

Before:
real	0m13.061s
user	0m0.327s
sys	0m12.486s

After:
real	0m6.448s
user	0m0.228s
sys	0m6.212s

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-30 18:42:50 -07:00
Chao Yu
39a8695824 f2fs: refactor ->page_mkwrite() flow
Thread A				Thread B
- f2fs_vm_page_mkwrite
					- f2fs_setattr
					 - down_write(i_mmap_sem)
					 - truncate_setsize
					 - f2fs_truncate
					 - up_write(i_mmap_sem)
 - f2fs_reserve_block
 reserve NEW_ADDR
 - skip dirty page due to truncation

1. we don't need to rserve new block address for a truncated page.
2. dn.data_blkaddr is used out of node page lock coverage.

Refactor ->page_mkwrite() flow to fix above issues:
- use __do_map_lock() to avoid racing checkpoint()
- lock data page in prior to dnode page
- cover f2fs_reserve_block with i_mmap_sem lock
- wait page writeback before zeroing page

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-30 18:41:22 -07:00
Chao Yu
bab475c541 Revert: "f2fs: check last page index in cached bio to decide submission"
There is one case that we can leave bio in f2fs, result in hanging
page writeback waiter.

Thread A				Thread B
- f2fs_write_cache_pages
 - f2fs_submit_page_write
 page #0 cached in bio #0 of cold log
 - f2fs_submit_page_write
 page #1 cached in bio #1 of warm log
					- f2fs_write_cache_pages
					 - f2fs_submit_page_write
					 bio is full, submit bio #1 contain page #1
 - f2fs_submit_merged_write_cond(, page #1)
 fail to submit bio #0 due to page #1 is not in any cached bios.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-30 18:39:54 -07:00
Junling Zheng
d440c52d31 f2fs: support superblock checksum
Now we support crc32 checksum for superblock.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Junling Zheng <zhengjunling@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-30 18:34:18 -07:00
Chao Yu
274bd9ba39 f2fs: add to account skip count of background GC
This patch adds to account skip count of background GC, and show stat
info via 'status' debugfs entry.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-30 18:33:34 -07:00
Chao Yu
b63e7be590 f2fs: add to account meta IO
This patch supports to account meta IO, it enables to show write IO
from f2fs more comprehensively via 'status' debugfs entry.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-30 18:33:21 -07:00
Jaegeuk Kim
095680f24f f2fs: keep lazytime on remount
This patch fixes losing lazytime when remounting f2fs.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-30 16:50:46 -07:00
Dave Chinner
e55ec4ddbe xfs: fix error handling in xfs_bmap_extents_to_btree
Commit 01239d77b9 ("xfs: fix a null pointer dereference in
xfs_bmap_extents_to_btree") attempted to fix a null pointer
dreference when a fuzzing corruption of some kind was found.
This fix was flawed, resulting in assert failures like:

XFS: Assertion failed: ifp->if_broot == NULL, file: fs/xfs/libxfs/xfs_bmap.c, line: 715
.....
Call Trace:
  xfs_bmap_extents_to_btree+0x6b9/0x7b0
  __xfs_bunmapi+0xae7/0xf00
  ? xfs_log_reserve+0x1c8/0x290
  xfs_reflink_remap_extent+0x20b/0x620
  xfs_reflink_remap_blocks+0x7e/0x290
  xfs_reflink_remap_range+0x311/0x530
  vfs_dedupe_file_range_one+0xd7/0xe0
  vfs_dedupe_file_range+0x15b/0x1a0
  do_vfs_ioctl+0x267/0x6c0

The problem is that the error handling code now asserts that the
inode fork is not in btree format before the error handling code
undoes the modifications that put the fork back in extent format.
Fix this by moving the assert back to after the xfs_iroot_realloc()
call that returns the fork to extent format, and clean up the jump
labels to be meaningful.

Also, returning ENOSPC when xfs_btree_get_bufl() fails to
instantiate the buffer that was allocated (the actual fix in the
commit mentioned above) is incorrect. This is a fatal error - only
an invalid block address or a filesystem shutdown can result in
failing to get a buffer here.

Hence change this to EFSCORRUPTED so that the higher layer knows
this was a corruption related failure and should not treat it as an
ENOSPC error.  This should result in a shutdown (via cancelling a
dirty transaction) which is necessary as we do not attempt to clean
up the (invalid) block that we have already allocated.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-01 08:11:07 +10:00
Trond Myklebust
c7944ebb9c NFSv4: Fix lookup revalidate of regular files
If we're revalidating an existing dentry in order to open a file, we need
to ensure that we check the directory has not changed before we optimise
away the lookup.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:18 -04:00
Trond Myklebust
5ceb9d7fda NFS: Refactor nfs_lookup_revalidate()
Refactor the code in nfs_lookup_revalidate() as a stepping stone towards
optimising and fixing nfs4_lookup_revalidate().

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:18 -04:00
Trond Myklebust
be189f7e7f NFS: Fix dentry revalidation on NFSv4 lookup
We need to ensure that inode and dentry revalidation occurs correctly
on reopen of a file that is already open. Currently, we can end up
not revalidating either in the case of NFSv4.0, due to the 'cached open'
path.
Let's fix that by ensuring that we only do cached open for the special
cases of open recovery and delegation return.

Reported-by: Stan Hu <stanhu@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:18 -04:00
Trond Myklebust
1c6c4b740d NFS: Remove private spinlock in struct nfs_pgio_header
Now that each struct nfs_pgio_header corresponds to one RPC call, we
only have one writer to the struct nfs_pgio_header.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:17 -04:00
Trond Myklebust
8d8928d879 NFSv3: Improve NFSv3 performance when server returns no post-op attributes
When the server fails to return post-op attributes, the client's
attempt to place read data directly in the page cache fails, and
so we have to do an extra copy in order to realign the data with
page borders.
This patch attempts to detect servers that don't return post-op
attributes on read (e.g. for pNFS) and adjusts the placement
calculation accordingly.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:17 -04:00
Anna Schumaker
80f4236886 NFSv4: Split out NFS v4.2 copy completion functions
The convention in the rest of the code is to have a separate function
for anything that might be ifdef-ed out.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:17 -04:00
Anna Schumaker
000d3f9566 NFS: Reduce indentation of nfs4_recovery_handle_error()
This is to match kernel coding style for switch statements.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:17 -04:00
Anna Schumaker
35a61606a6 NFS: Reduce indentation of the switch statement in nfs4_reclaim_open_state()
Most places in the kernel tend to line up cases with the switch to
reduce indentation, so move this over to match that style.
Additionally, I handle the (status >= 0) case in the switch so that we
only "goto restart" from a single place after error handling.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:17 -04:00
Anna Schumaker
cb7a8384dc NFS: Split out the body of nfs4_reclaim_open_state()
Moving all of this into a new function removes the need for cramped
indentation, making the code overall easier to look at.   I also take
this chance to switch copy recovery over to using
nfs4_stateid_match_other()

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:17 -04:00
Tigran Mkrtchyan
10ec57e4c5 nfs4: flex_file: ignore synthetic uid/gid for tightly coupled DSes
for tightly coupled DSes client must ignore provided synthetic uid and
gid as stated in draft-ietf-nfsv4-flex-files-19#section-5.1.

Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:17 -04:00
Trond Myklebust
943cff67b8 NFSv4.1: Fix the r/wsize checking
The intention of nfs4_session_set_rwsize() was to cap the r/wsize to the
buffer sizes negotiated by the CREATE_SESSION. The initial code had a
bug whereby we would not check the values negotiated by nfs_probe_fsinfo()
(the assumption being that CREATE_SESSION will always negotiate buffer values
that are sane w.r.t. the server's preferred r/wsizes) but would only check
values set by the user in the 'mount' command.

The code was changed in 4.11 to _always_ set the r/wsize, meaning that we
now never use the server preferred r/wsizes. This is the regression that
this patch fixes.
Also rename the function to nfs4_session_limit_rwsize() in order to avoid
future confusion.

Fixes: 033853325f (NFSv4.1 respect server's max size in CREATE_SESSION")
Cc: stable@vger.kernel.org # v4.11+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:17 -04:00
Trond Myklebust
ace9fad43a NFSv4: Convert struct nfs4_state to use refcount_t
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:17 -04:00
Trond Myklebust
9ae075fdd1 NFSv4: Convert open state lookup to use RCU
Further reduce contention on the inode->i_lock.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:17 -04:00
Trond Myklebust
0de43976fb NFS: Convert lookups of the open context to RCU
Reduce contention on the inode->i_lock by ensuring that we use RCU
when looking up the NFS open context.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:17 -04:00
Trond Myklebust
6ba0c4e5bb NFS: Simplify internal check for whether file is open for write
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:17 -04:00
Trond Myklebust
1db97eaa0b NFS: Convert lookups of the lock context to RCU
Speed up lookups of an existing lock context by avoiding the inode->i_lock,
and using RCU instead.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:16 -04:00
Trond Myklebust
28ced9a84c pNFS: Don't allocate more pages than we need to fit a layoutget response
For the 'files' and 'flexfiles' layout types, we do not expect the reply
to be any larger than 4k. The block and scsi layout types are a little more
greedy, so we keep allocating the maximum response size for now.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:16 -04:00
Trond Myklebust
a2791d3a2c pNFS: Don't zero out the array in nfs4_alloc_pages()
We don't need a zeroed out array, since it is immediately being filled.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:16 -04:00
Trond Myklebust
431f6eb357 SUNRPC: Add a label for RPC calls that require allocation on receive
If the RPC call relies on the receive call allocating pages as buffers,
then let's label it so that we
a) Don't leak memory by allocating pages for requests that do not expect
   this behaviour
b) Can optimise for the common case where calls do not require allocation.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:16 -04:00
Miguel Ojeda
f0604f6303 Compiler Attributes: ext4: remove local __nonstring definition
Commit 072ebb3bff ("ext4: add nonstring annotations to ext4.h")
introduced a local definition of __nonstring to suppress some false
positives in gcc 8's -Wstringop-truncation.

Since now we support __nonstring for everyone, remove it.

Tested-by: Sedat Dilek <sedat.dilek@gmail.com> # on top of v4.19-rc5, clang 7
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
2018-09-30 20:14:04 +02:00
Kees Cook
bac6f6cda2 pstore/ram: Fix failure-path memory leak in ramoops_init
As reported by nixiaoming, with some minor clarifications:

1) memory leak in ramoops_register_dummy():
   dummy_data = kzalloc(sizeof(*dummy_data), GFP_KERNEL);
   but no kfree() if platform_device_register_data() fails.

2) memory leak in ramoops_init():
   Missing platform_device_unregister(dummy) and kfree(dummy_data)
   if platform_driver_register(&ramoops_driver) fails.

I've clarified the purpose of ramoops_register_dummy(), and added a
common cleanup routine for all three failure paths to call.

Reported-by: nixiaoming <nixiaoming@huawei.com>
Cc: stable@vger.kernel.org
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-09-30 10:15:41 -07:00
Greg Kroah-Hartman
9ba6873e16 filesystem-dax for 4.19-rc6
Fix a deadlock in the new for 4.19 dax_lock_mapping_entry() routine.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbsDBrAAoJEB7SkWpmfYgCi54QAIz8yFZI5+5+amG/L/F9mGe4
 sagcSPsk67EzzTDhnKASTlRmpm0+LWzQckY7o/fDRoM0VQVKjXVDke4VTDnHFg7W
 JfZMN24dg6Pcbq3CxuSXMOiWd8vXSnLL2Myin+fQ/kY1rxnIz2ZYNWxCQsLvdPiC
 VKJAbpYlcG41HZPPnRkMaRBxf2INUraSgyHFoehbgvlwLD7YUOzPh9strauutK5M
 xljv2d/yjfaW4U6DhQhUSo+sDYRLGDkbqQw6ZoVqbODA0IXdY6ytiCujLLD9xODg
 lDKF68jCX/+lFIURm8BRpX9iqHvfILC5el61a4bTxjJ6XUf+Ok5vgkeZFDfQKziC
 rLqm09NTQ5Xu0MJ8Ql+5cqAFqBMA7Uy1zF6l8DnGFCtMV/S0H/TgdXWLzHjRXQvE
 18ekLqTcRk5UmPXJYJ829ln0TKTd3zyuVgwuLuGAeO97m431y3K2Q74ncPahgE9+
 W0nduPFTmMikohcKah2P3mQWGtUAYWodQsEs+Y9gJPyoDic+fmjo+mI0xg7CeFL4
 kpfug45i8hdbnlHrHOJ6bz7fRq7CvaaRaI3gOvFfuN2TJVY8Qfs/8JD4HN8F7u+r
 zDPVnvkutaYV1uOOBU4nDzPJ+naVGlpOj1/tsMU4ikj3LbfkfW+gxsr6XGZYPU81
 qYjEfXm60ritFoAA5dVV
 =3l8a
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-fixes2-4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Dan writes:
  "filesystem-dax for 4.19-rc6

   Fix a deadlock in the new for 4.19 dax_lock_mapping_entry() routine."

* tag 'libnvdimm-fixes2-4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  dax: Fix deadlock in dax_lock_mapping_entry()
2018-09-30 06:19:38 -07:00
Matthew Wilcox
3159f943aa xarray: Replace exceptional entries
Introduce xarray value entries and tagged pointers to replace radix
tree exceptional entries.  This is a slight change in encoding to allow
the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
value entry).  It is also a change in emphasis; exceptional entries are
intimidating and different.  As the comment explains, you can choose
to store values or pointers in the xarray and they are both first-class
citizens.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2018-09-29 22:47:49 -04:00
Matthew Wilcox
3d0186bb06 Update email address
Redirect some older email addresses that are in the git logs.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-09-29 22:47:48 -04:00
Brian Foster
561295a325 iomap: set page dirty after partial delalloc on mkwrite
The iomap page fault mechanism currently dirties the associated page
after the full block range of the page has been allocated. This
leaves the page susceptible to delayed allocations without ever
being set dirty on sub-page block sized filesystems.

For example, consider a page fault on a page with one preexisting
real (non-delalloc) block allocated in the middle of the page. The
first iomap_apply() iteration performs delayed allocation on the
range up to the preexisting block, the next iteration finds the
preexisting block, and the last iteration attempts to perform
delayed allocation on the range after the prexisting block to the
end of the page. If the first allocation succeeds and the final
allocation fails with -ENOSPC, iomap_apply() returns the error and
iomap_page_mkwrite() fails to dirty the page having already
performed partial delayed allocation. This eventually results in the
page being invalidated without ever converting the delayed
allocation to real blocks.

This problem is reliably reproduced by generic/083 on XFS on ppc64
systems (64k page size, 4k block size). It results in leaked
delalloc blocks on inode reclaim, which triggers an assert failure
in xfs_fs_destroy_inode() and filesystem accounting inconsistency.

Move the set_page_dirty() call from iomap_page_mkwrite() to the
actor callback, similar to how the buffer head implementation works.
The actor callback is called iff ->iomap_begin() returns success, so
ensures the page is dirtied as soon as possible after an allocation.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:51:01 +10:00
Brian Foster
ec2ed0b5e9 xfs: remove invalid log recovery first/last cycle check
One of the first steps of log recovery is to check for the special
case of a zeroed log. If the first cycle in the log is zero or the
tail portion of the log is zeroed, the head is set to the first
instance of cycle 0. xlog_find_zeroed() includes a sanity check that
enforces that the first cycle in the log must be 1 if the last cycle
is 0. While this is true in most cases, the check is not totally
valid because it doesn't consider the case where the filesystem
crashed after a partial/out of order log buffer completion that
wraps around the end of the physical log.

For example, consider a filesystem that has completed most of the
first cycle of the log, reaches the end of the physical log and
splits the next single log buffer write into two in order to wrap
around the end of the log. If these I/Os are reordered, the second
(wrapped) I/O completes and the first happens to fail, the log is
left in a state where the last cycle of the log is 0 and the first
cycle is 2. This causes the xlog_find_zeroed() sanity check to fail
and prevents the filesystem from mounting. This situation has been
reproduced on particular systems via repeated runs of generic/475.

This is an expected state that log recovery already knows how to
deal with, however. Since the log is still partially zeroed, the
head is detected correctly and points to a valid tail. The
subsequent stale block detection clears blocks beyond the head up to
the tail (within a maximum range), with the express purpose of
clearing such out of order writes. As expected, this removes the out
of order cycle 2 blocks at the physical start of the log.

In other words, the only thing that prevents a clean mount and
recovery of the filesystem in this scenario is the specific (last ==
0 && first != 1) sanity check in xlog_find_zeroed(). Since the log
head/tail are now independently validated via cycle, log record and
CRC checks, this highly specific first cycle check is of dubious
value. Remove it and rely on the higher level validation to
determine whether log content is sane and recoverable.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:50:41 +10:00
Eric Sandeen
339e1a3fcd xfs: validate inode di_forkoff
Verify the inode di_forkoff, lifted from xfs_repair's
process_check_inode_forkoff().

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:50:13 +10:00
Christoph Hellwig
f5f3f959b7 xfs: skip delalloc COW blocks in xfs_reflink_end_cow
The iomap direct I/O code issues a single ->end_io call for the whole
I/O request, and if some of the extents cowered needed a COW operation
it will call xfs_reflink_end_cow over the whole range.

When we do AIO writes we drop the iolock after doing the initial setup,
but before the I/O completion.  Between dropping the lock and completing
the I/O we can have a racing buffered write create new delalloc COW fork
extents in the region covered by the outstanding direct I/O write, and
thus see delalloc COW fork extents in xfs_reflink_end_cow.  As
concurrent writes are fundamentally racy and no guarantees are given we
can simply skip those.

This can be easily reproduced with xfstests generic/208 in always_cow
mode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:49:58 +10:00
Eric Sandeen
f369a13cea xfs: don't treat unknown di_flags2 as corruption in scrub
xchk_inode_flags2() currently treats any di_flags2 values that the
running kernel doesn't recognize as corruption, and calls
xchk_ino_set_corrupt() if they are set.  However, it's entirely possible
that these flags were set in some newer kernel and are quite valid,
but ignored in this kernel.

(Validators don't care one bit about unknown di_flags2.)

Call xchk_ino_set_warning instead, because this may or may not actually
indicate a problem.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:49:00 +10:00
YueHaibing
2863c2ebc4 xfs: remove duplicated include from alloc.c
Remove duplicated include xfs_alloc.h

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:48:21 +10:00
Christoph Hellwig
0065b54119 xfs: don't bring in extents in xfs_bmap_punch_delalloc_range
This function is only used to punch out delayed allocations on I/O
failure, which means we need to have read the extents earlier.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:47:46 +10:00
Dave Chinner
df30707791 xfs: fix transaction leak in xfs_reflink_allocate_cow()
When xfs_reflink_allocate_cow() allocates a transaction, it drops
the ILOCK to perform the operation. This Introduces a race condition
where another thread modifying the file can perform the COW
allocation operation underneath us. This result in the retry loop
finding an allocated block and jumping straight to the conversion
code. It does not, however, cancel the transaction it holds and so
this gets leaked. This results in a lockdep warning:

================================================
WARNING: lock held when returning to user space!
4.18.5 #1 Not tainted
------------------------------------------------
worker/6123 is leaving the kernel with locks still held!
1 lock held by worker/6123:
 #0: 000000009eab4f1b (sb_internal#2){.+.+}, at: xfs_trans_alloc+0x17c/0x220

And eventually the filesystem deadlocks because it runs out of log
space that is reserved by the leaked transaction and never gets
released.

The logic flow in xfs_reflink_allocate_cow() is a convoluted mess of
gotos - it's no surprise that it has bug where the flow through
several goto jumps then fails to clean up context from a non-obvious
logic path. CLean up the logic flow and make sure every path does
the right thing.

Reported-by: Alexander Y. Fomichev <git.user@gmail.com>
Tested-by: Alexander Y. Fomichev <git.user@gmail.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200981
Signed-off-by: Dave Chinner <dchinner@redhat.com>
[hch: slight refactor]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:47:15 +10:00
Dave Chinner
8683edb775 xfs: avoid lockdep false positives in xfs_trans_alloc
We've had a few reports of lockdep tripping over memory reclaim
context vs filesystem freeze "deadlocks". They all have looked
to be false positives on analysis, but it seems that they are
being tripped because we take freeze references before we run
a GFP_KERNEL allocation for the struct xfs_trans.

We can avoid this false positive vector just by re-ordering the
operations in xfs_trans_alloc(). That is. we need allocate the
structure before we take the freeze reference and enter the GFP_NOFS
allocation context that follows the xfs_trans around. This prevents
lockdep from seeing the GFP_KERNEL allocation inside the transaction
context, and that prevents it from triggering the freeze level vs
alloc context vs reclaim warnings.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-09-29 13:46:21 +10:00
Brian Foster
95808459b1 xfs: refactor xfs_buf_log_item reference count handling
The xfs_buf_log_item structure has a reference counter with slightly
tricky semantics. In the common case, a buffer is logged and
committed in a transaction, committed to the on-disk log (added to
the AIL) and then finally written back and removed from the AIL. The
bli refcount covers two potentially overlapping timeframes:

 1. the bli is held in an active transaction
 2. the bli is pinned by the log

The caveat to this approach is that the reference counter does not
purely dictate the lifetime of the bli. IOW, when a dirty buffer is
physically logged and unpinned, the bli refcount may go to zero as
the log item is inserted into the AIL. Only once the buffer is
written back can the bli finally be freed.

The above semantics means that it is not enough for the various
refcount decrementing contexts to release the bli on decrement to
zero. xfs_trans_brelse(), transaction commit (->iop_unlock()) and
unpin (->iop_unpin()) must all drop the associated reference and
make additional checks to determine if the current context is
responsible for freeing the item.

For example, if a transaction holds but does not dirty a particular
bli, the commit may drop the refcount to zero. If the bli itself is
clean, it is also not AIL resident and must be freed at this time.
The same is true for xfs_trans_brelse(). If the transaction dirties
a bli and then aborts or an unpin results in an abort due to a log
I/O error, the last reference count holder is expected to explicitly
remove the item from the AIL and release it (since an abort means
filesystem shutdown and metadata writeback will never occur).

This leads to fairly complex checks being replicated in a few
different places. Since ->iop_unlock() and xfs_trans_brelse() are
nearly identical, refactor the logic into a common helper that
implements and documents the semantics in one place. This patch does
not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:45:26 +10:00
Brian Foster
23420d05e6 xfs: clean up xfs_trans_brelse()
xfs_trans_brelse() is a bit of a historical mess, similar to
xfs_buf_item_unlock(). It is unnecessarily verbose, has snippets of
commented out code, inconsistency with regard to stale items, etc.

Clean up xfs_trans_brelse() to use similar logic and flow as
xfs_buf_item_unlock() with regard to bli reference count handling.
This patch makes no functional changes, but facilitates further
refactoring of the common bli reference count handling code.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:45:02 +10:00
Brian Foster
d9183105ca xfs: don't unlock invalidated buf on aborted tx commit
xfstests generic/388,475 occasionally reproduce assertion failures
in xfs_buf_item_unpin() when the final bli reference is dropped on
an invalidated buffer and the buffer is not locked as it is expected
to be. Invalidated buffers should remain locked on transaction
commit until the final unpin, at which point the buffer is removed
from the AIL and the bli is freed since stale buffers are not
written back.

The assert failures are associated with filesystem shutdown,
typically due to log I/O errors injected by the test. The
problematic situation can occur if the shutdown happens to cause a
race between an active transaction that has invalidated a particular
buffer and an I/O error on a log buffer that contains the bli
associated with the same (now stale) buffer.

Both transaction and log contexts acquire a bli reference. If the
transaction has already invalidated the buffer by the time the I/O
error occurs and ends up aborting due to shutdown, the transaction
and log hold the last two references to a stale bli. If the
transaction cancel occurs first, it treats the buffer as non-stale
due to the aborted state: the bli reference is dropped and the
buffer is released/unlocked. The log buffer I/O error handling
eventually calls into xfs_buf_item_unpin(), drops the final
reference to the bli and treats it as stale. The buffer wasn't left
locked by xfs_buf_item_unlock(), however, so the assert fails and
the buffer is double unlocked. The latter problem is mitigated by
the fact that the fs is shutdown and no further damage is possible.

->iop_unlock() of an invalidated buffer should behave consistently
with respect to the bli refcount, regardless of aborted state. If
the refcount remains elevated on commit, we know the bli is awaiting
an unpin (since it can't be in another transaction) and will be
handled appropriately on log buffer completion. If the final bli
reference of an invalidated buffer is dropped in ->iop_unlock(), we
can assume the transaction has aborted because invalidation implies
a dirty transaction. In the non-abort case, the log would have
acquired a bli reference in ->iop_pin() and prevented bli release at
->iop_unlock() time. In the abort case the item must be freed and
buffer unlocked because it wasn't pinned by the log.

Rework xfs_buf_item_unlock() to simplify the currently circuitous
and duplicate logic and leave invalidated buffers locked based on
bli refcount, regardless of aborted state. This ensures that a
pinned, stale buffer is always found locked when eventually
unpinned.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:44:40 +10:00
Brian Foster
d5a2e2893d xfs: remove last of unnecessary xfs_defer_cancel() callers
Now that deferred operations are completely managed via
transactions, it's no longer necessary to cancel the dfops in error
paths that already cancel the associated transaction. There are a
few such calls lingering throughout the codebase.

Remove all remaining unnecessary calls to xfs_defer_cancel(). This
leaves xfs_defer_cancel() calls in two places. The first is the call
in the transaction cancel path itself, which facilitates this patch.
The second is made via the xfs_defer_finish() error path to provide
consistent error semantics with transaction commit. For example,
xfs_trans_commit() expects an xfs_defer_finish() failure to clean up
the dfops structure before it returns.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:41:58 +10:00
Darrick J. Wong
ae29478766 xfs: don't crash the vfs on a garbage inline symlink
The VFS routine that calls ->get_link blindly copies whatever's returned
into the user's buffer.  If we return a NULL pointer, the vfs will
crash on the null pointer.  Therefore, return -EFSCORRUPTED instead of
blowing up the kernel.

[dgc: clean up with hch's suggestions]

Reported-by: wen.xu@gatech.edu
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:40:40 +10:00
Jaegeuk Kim
89d13c3850 f2fs: fix missing up_read
This patch fixes missing up_read call.

Fixes: c9b60788fc ("f2fs: fix to do sanity check with block address in main area")
Cc: <stable@vger.kernel.org> # 4.19+
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-28 19:11:38 -07:00
Jaegeuk Kim
61f7725aa1 f2fs: return correct errno in f2fs_gc
This fixes overriding error number in f2fs_gc.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-28 10:40:46 -07:00
Jaegeuk Kim
edc55aaf0d f2fs: avoid f2fs_bug_on if f2fs_get_meta_page_nofail got EIO
This patch avoids BUG_ON when f2fs_get_meta_page_nofail got EIO during
xfstests/generic/475.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-28 10:39:58 -07:00
Miklos Szeredi
18172b10b6 fuse: extract fuse_emit() helper
Prepare for cache filling by introducing a helper for emitting a single
directory entry.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-28 16:43:23 +02:00
Miklos Szeredi
d123d8e183 fuse: split out readdir.c
Directory reading code is about to grow larger, so split it out from dir.c
into a new source file.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-28 16:43:23 +02:00
Kirill Tkhai
be2ff42c5d fuse: Use hash table to link processing request
We noticed the performance bottleneck in FUSE running our Virtuozzo storage
over rdma. On some types of workload we observe 20% of times spent in
request_find() in profiler.  This function is iterating over long requests
list, and it scales bad.

The patch introduces hash table to reduce the number of iterations, we do
in this function. Hash generating algorithm is taken from hash_add()
function, while 256 lines table is used to store pending requests.  This
fixes problem and improves the performance.

Reported-by: Alexey Kuznetsov <kuznet@virtuozzo.com>
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-28 16:43:23 +02:00
Kirill Tkhai
3a5358d1a1 fuse: kill req->intr_unique
This field is not needed after the previous patch, since we can easily
convert request ID to interrupt request ID and vice versa.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-28 16:43:23 +02:00
Kirill Tkhai
c59fd85e4f fuse: change interrupt requests allocation algorithm
Using of two unconnected IDs req->in.h.unique and req->intr_unique does not
allow to link requests to a hash table. We need can't use none of them as a
key to calculate hash.

This patch changes the algorithm of allocation of IDs for a request. Plain
requests obtain even ID, while interrupt requests are encoded in the low
bit. So, in next patches we will be able to use the rest of ID bits to
calculate hash, and the hash will be the same for plain and interrupt
requests.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-28 16:43:23 +02:00
Kirill Tkhai
63825b4e1d fuse: do not take fc->lock in fuse_request_send_background()
Currently, we take fc->lock there only to check for fc->connected.
But this flag is changed only on connection abort, which is very
rare operation.

So allow checking fc->connected under just fc->bg_lock and use this lock
(as well as fc->lock) when resetting fc->connected.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-28 16:43:23 +02:00
Kirill Tkhai
ae2dffa394 fuse: introduce fc->bg_lock
To reduce contention of fc->lock, this patch introduces bg_lock for
protection of fields related to background queue. These are:
max_background, congestion_threshold, num_background, active_background,
bg_queue and blocked.

This allows next patch to make async reads not requiring fc->lock, so async
reads and writes will have better performance executed in parallel.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-28 16:43:22 +02:00
Kirill Tkhai
2b30a53314 fuse: add locking to max_background and congestion_threshold changes
Functions sequences like request_end()->flush_bg_queue() require that
max_background and congestion_threshold are constant during their
execution. Otherwise, checks like

	if (fc->num_background == fc->max_background)

made in different time may behave not like expected.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-28 16:43:22 +02:00
Kirill Tkhai
2a23f2b8ad fuse: use READ_ONCE on congestion_threshold and max_background
Since they are of unsigned int type, it's allowed to read them
unlocked during reporting to userspace. Let's underline this fact
with READ_ONCE() macroses.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-28 16:43:22 +02:00
Kirill Tkhai
e287179afe fuse: use list_first_entry() in flush_bg_queue()
This cleanup patch makes the function to use the primitive
instead of direct dereferencing.

Also, move fiq dereferencing out of cycle, since it's
always constant.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-28 16:43:22 +02:00
Niels de Vos
88bc7d5097 fuse: add support for copy_file_range()
There are several FUSE filesystems that can implement server-side copy
or other efficient copy/duplication/clone methods. The copy_file_range()
syscall is the standard interface that users have access to while not
depending on external libraries that bypass FUSE.

Signed-off-by: Niels de Vos <ndevos@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-28 16:43:22 +02:00
Miklos Szeredi
908a572b80 fuse: fix blocked_waitq wakeup
Using waitqueue_active() is racy.  Make sure we issue a wake_up()
unconditionally after storing into fc->blocked.  After that it's okay to
optimize with waitqueue_active() since the first wake up provides the
necessary barrier for all waiters, not the just the woken one.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 3c18ef8117 ("fuse: optimize wake_up")
Cc: <stable@vger.kernel.org> # v3.10
2018-09-28 16:43:22 +02:00
Miklos Szeredi
4c316f2f3f fuse: set FR_SENT while locked
Otherwise fuse_dev_do_write() could come in and finish off the request, and
the set_bit(FR_SENT, ...) could trigger the WARN_ON(test_bit(FR_SENT, ...))
in request_end().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reported-by: syzbot+ef054c4d3f64cd7f7cec@syzkaller.appspotmai
Fixes: 46c34a348b ("fuse: no fc->lock for pqueue parts")
Cc: <stable@vger.kernel.org> # v4.2
2018-09-28 16:43:22 +02:00
Kirill Tkhai
d2d2d4fb1f fuse: Fix use-after-free in fuse_dev_do_write()
After we found req in request_find() and released the lock,
everything may happen with the req in parallel:

cpu0                              cpu1
fuse_dev_do_write()               fuse_dev_do_write()
  req = request_find(fpq, ...)    ...
  spin_unlock(&fpq->lock)         ...
  ...                             req = request_find(fpq, oh.unique)
  ...                             spin_unlock(&fpq->lock)
  queue_interrupt(&fc->iq, req);   ...
  ...                              ...
  ...                              ...
  request_end(fc, req);
    fuse_put_request(fc, req);
  ...                              queue_interrupt(&fc->iq, req);


Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 46c34a348b ("fuse: no fc->lock for pqueue parts")
Cc: <stable@vger.kernel.org> # v4.2
2018-09-28 16:43:21 +02:00
Kirill Tkhai
bc78abbd55 fuse: Fix use-after-free in fuse_dev_do_read()
We may pick freed req in this way:

[cpu0]                                  [cpu1]
fuse_dev_do_read()                      fuse_dev_do_write()
   list_move_tail(&req->list, ...);     ...
   spin_unlock(&fpq->lock);             ...
   ...                                  request_end(fc, req);
   ...                                    fuse_put_request(fc, req);
   if (test_bit(FR_INTERRUPTED, ...))
         queue_interrupt(fiq, req);

Fix that by keeping req alive until we finish all manipulations.

Reported-by: syzbot+4e975615ca01f2277bdd@syzkaller.appspotmail.com
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 46c34a348b ("fuse: no fc->lock for pqueue parts")
Cc: <stable@vger.kernel.org> # v4.2
2018-09-28 16:43:21 +02:00
Greg Kroah-Hartman
c127e59bee \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlusxsAACgkQnJ2qBz9k
 QNnfhAgAlvqmbnw9LFG+p0DbByFshuf7r/ygruOGKgWZ3/z2ES6Zg4eUpvgQGK04
 3+c2TWgnAgXRPOLwIQkxndXTCzR8hxueIfpnVTzNildQ2R3PNpVM8Q+DPmVrgfYo
 OOFB/bXig1EKtLmlXdNSHOSlqRc7G+xmG1zx04to1ecYjr8ISp9BvnedfNe4iNkD
 PHoaZsb1Lic8/rLpabXFrYe9wI4udesE3H4uQWckfAW6wL4w7+x9FFEFgoSHmkSL
 AxCDGZnvZAMUqECyYQ806UrihXmXxFbGVKY9LUWOzcn9TyvpiyLLlCsGn9Dpc3vA
 67o2CVIDR+PyqapHY3HAF+jJuuV15Q==
 =vI4P
 -----END PGP SIGNATURE-----

Merge tag 'for_v4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Jan writes:
  "an ext2 patch fixing fsync(2) for DAX mounts."

* tag 'for_v4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  ext2, dax: set ext2_dax_aops for dax files
2018-09-27 21:16:24 +02:00
Jan Kara
f52afc93cd dax: Fix deadlock in dax_lock_mapping_entry()
When dax_lock_mapping_entry() has to sleep to obtain entry lock, it will
fail to unlock mapping->i_pages spinlock and thus immediately deadlock
against itself when retrying to grab the entry lock again. Fix the
problem by unlocking mapping->i_pages before retrying.

Fixes: c2a7d2a115 ("filesystem-dax: Introduce dax_lock_mapping_entry()")
Reported-by: Barret Rhoden <brho@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-09-27 10:56:15 -07:00
Amir Goldstein
96a71f21ef fanotify: store fanotify_init() flags in group's fanotify_data
This averts the need to re-generate flags in fanotify_show_fdinfo()
and sets the scene for addition of more upcoming flags without growing
new members to the fanotify_data struct.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-27 15:29:00 +02:00
Chao Yu
4a1728cad6 f2fs: mark inode dirty explicitly in recover_inode()
Mark inode dirty explicitly in the end of recover_inode() to make sure
that all recoverable fields can be persisted later.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-26 12:45:33 -07:00
Chao Yu
5cd1f387a1 f2fs: fix to recover inode's crtime during POR
Testcase to reproduce this bug:
1. mkfs.f2fs -O extra_attr -O inode_crtime /dev/sdd
2. mount -t f2fs /dev/sdd /mnt/f2fs
3. touch /mnt/f2fs/file
4. xfs_io -f /mnt/f2fs/file -c "fsync"
5. godown /mnt/f2fs
6. umount /mnt/f2fs
7. mount -t f2fs /dev/sdd /mnt/f2fs
8. xfs_io -f /mnt/f2fs/file -c "statx -r"

stat.btime.tv_sec = 0
stat.btime.tv_nsec = 0

This patch fixes to recover inode creation time fields during
mount.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-26 12:45:33 -07:00
Chao Yu
7de36cf3e4 f2fs: fix to recover inode's i_gc_failures during POR
inode.i_gc_failures is used to indicate that skip count of migrating
on blocks of inode, we should guarantee it can be recovered in sudden
power-off case.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-26 12:45:33 -07:00
Chao Yu
19c73a691c f2fs: fix to recover inode's i_flags during POR
Testcase to reproduce this bug:
1. mkfs.f2fs /dev/sdd
2. mount -t f2fs /dev/sdd /mnt/f2fs
3. touch /mnt/f2fs/file
4. sync
5. chattr +A /mnt/f2fs/file
6. xfs_io -f /mnt/f2fs/file -c "fsync"
7. godown /mnt/f2fs
8. umount /mnt/f2fs
9. mount -t f2fs /dev/sdd /mnt/f2fs
10. lsattr /mnt/f2fs/file

-----------------N- /mnt/f2fs/file

But actually, we expect the corrct result is:

-------A---------N- /mnt/f2fs/file

The reason is we didn't recover inode.i_flags field during mount,
fix it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-26 12:45:33 -07:00
Chao Yu
f4474aa6e5 f2fs: fix to recover inode's project id during POR
Testcase to reproduce this bug:
1. mkfs.f2fs -O extra_attr -O project_quota /dev/sdd
2. mount -t f2fs /dev/sdd /mnt/f2fs
3. touch /mnt/f2fs/file
4. sync
5. chattr -p 1 /mnt/f2fs/file
6. xfs_io -f /mnt/f2fs/file -c "fsync"
7. godown /mnt/f2fs
8. umount /mnt/f2fs
9. mount -t f2fs /dev/sdd /mnt/f2fs
10. lsattr -p /mnt/f2fs/file

    0 -----------------N- /mnt/f2fs/file

But actually, we expect the correct result is:

    1 -----------------N- /mnt/f2fs/file

The reason is we didn't recover inode.i_projid field during mount,
fix it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-26 12:45:33 -07:00
Jaegeuk Kim
0a4daae5ff f2fs: update i_size after DIO completion
This is related to
ee70daaba8 ("xfs: update i_size after unwritten conversion in dio completion")

If we update i_size during dio_write, dio_read can read out stale data, which
breaks xfstests/465.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-26 12:45:34 -07:00
Jaegeuk Kim
d83d0f5ba8 f2fs: report ENOENT correctly in f2fs_rename
This fixes wrong error report in f2fs_rename.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-26 12:38:11 -07:00
Chengguang Xu
c6b1867b1d f2fs: fix remount problem of option io_bits
Currently we show mount option "io_bits=%u" as "io_size=%uKB",
it will cause option parsing problem(unrecognized mount option)
in remount.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-25 18:44:56 -07:00
YueHaibing
7d20b6a272 nfsd: remove set but not used variable 'dirp'
Fixes gcc '-Wunused-but-set-variable' warning:

fs/nfsd/vfs.c: In function 'nfsd_create':
fs/nfsd/vfs.c:1279:16: warning:
 variable 'dirp' set but not used [-Wunused-but-set-variable]

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-09-25 20:35:13 -04:00
Olga Kornievskaia
e0639dc580 NFSD introduce async copy feature
Upon receiving a request for async copy, create a new kthread.  If we
get asynchronous request, make sure to copy the needed arguments/state
from the stack before starting the copy. Then start the thread and reply
back to the client indicating copy is asynchronous.

nfsd_copy_file_range() will copy in a loop over the total number of
bytes is needed to copy. In case a failure happens in the middle, we
ignore the error and return how much we copied so far. Once done
creating a workitem for the callback workqueue and send CB_OFFLOAD with
the results.

The lifetime of the copy stateid is bound to the vfs copy. This way we
don't need to keep the nfsd_net structure for the callback.  We could
keep it around longer so that an OFFLOAD_STATUS that came late would
still get results, but clients should be able to deal without that.

We handle OFFLOAD_CANCEL by sending a signal to the copy thread and
calling kthread_stop.

A client should cancel any ongoing copies before calling DESTROY_CLIENT;
if not, we return a CLIENT_BUSY error.

If the client is destroyed for some other reason (lease expiration, or
server shutdown), we must clean up any ongoing copies ourselves.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
[colin.king@canonical.com: fix leak in error case]
[bfields@fieldses.org: remove signalling, merge patches]
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-09-25 20:34:54 -04:00
Olga Kornievskaia
885e2bf3ea NFSD OFFLOAD_CANCEL xdr
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-09-25 20:34:54 -04:00
Olga Kornievskaia
6308bc98e8 NFSD OFFLOAD_STATUS xdr
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-09-25 20:34:54 -04:00
Olga Kornievskaia
9eb190fca8 NFSD CB_OFFLOAD xdr
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-09-25 20:34:54 -04:00
Greg Kroah-Hartman
a38523185b libnvdimm/dax for 4.19-rc6
* (2) fixes for the dax error handling updates that were merged for
 v4.19-rc1. My mails to Al have been bouncing recently, so I do not have
 his ack but the uaccess change is of the trivial / obviously correct
 variety. The address_space_operations fixes a regression.
 
 * A filesystem-dax fix to correct the zero page lookup to be compatible
  with non-x86 (mips and s390) architectures.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbqoecAAoJEB7SkWpmfYgCaPYP/1Pf2ADt0pOskSk0ixM06UI9
 1lR2g2/ICMc505+oB+wUv9TkZOh9jcIS9o8GfLhgNvP7AU4woRvudeyr4NUc0mdw
 rtHRA6TIimbXa3+O2qMg4gqUjXRxj6urQp5oeQi8mQ/vefZv1aisEw/Klae8klVC
 HGoMFii19WGXXyPM2vNUb2L+JGZt1p/nl/Z8ydPavn1XkIIGb7c+MivDiaemjjgT
 487TmFULgLVhTCtQXlhkH7UCcCQ3+l3yxKaS1/g2hFpWE4LncBIvq8XBPwf5RQSL
 H/5rH/sd30XR2L0NMERxr0ENvCJf2iIf4bIqckODN4L9ojRE8zmBZMsSeRKmHufm
 3ZfTBLHjPUwwKWy7PlKSsFk2J8KjErsqRlfZQSMSJShpEgL1jCjYtuTEtupaegbU
 v8TwzsNDgJ1B6JuKJ7lh7hOF7vUQ/i65xG8SFACvNoiih8RGSW3llra442k2hmFu
 IEMXa9S4tvqHfXUb0J/6hLLi+xoV+KsYPWRiCuovy7t6EfAWUnNuGCldjfsQtZZv
 npHS7F7lkWlSCneDbE4cMdkkwjBKjAw0sjIWDrPVVCoITVe1j9+bEwE9reX1VOS+
 L+PB/WgcVH72MeiQnmPPTcUyEmNgCbku7NEhwrMHJngxuo9HmXE+BN8jdFDYZFfL
 uWDV25XOxviOC9xBosw4
 =mAaH
 -----END PGP SIGNATURE-----

erge tag 'libnvdimm-fixes-4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Dan writes:
  "libnvdimm/dax for 4.19-rc6

  * (2) fixes for the dax error handling updates that were merged for
  v4.19-rc1. My mails to Al have been bouncing recently, so I do not have
  his ack but the uaccess change is of the trivial / obviously correct
  variety. The address_space_operations fixes a regression.

  * A filesystem-dax fix to correct the zero page lookup to be compatible
   with non-x86 (mips and s390) architectures."

* tag 'libnvdimm-fixes-4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  device-dax: Add missing address_space_operations
  uaccess: Fix is_source param for check_copy_size() in copy_to_iter_mcsafe()
  filesystem-dax: Fix use of zero page
2018-09-25 21:37:41 +02:00
Wei Yongjun
69383c5913 ovl: make symbol 'ovl_aops' static
Fixes the following sparse warning:

fs/overlayfs/inode.c:507:39: warning:
 symbol 'ovl_aops' was not declared. Should it be static?

Fixes: 5b910bd615 ("ovl: fix GPF in swapfile_activate of file from overlayfs over xfs")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-25 20:41:23 +02:00
Chengguang Xu
2aad26fa0a ext2: remove redundant building macro check
If macro CONFIG_QUOTA is not enabled then mount option flag
of usrquota/grpquota will not be set, so we can remove some
building macro check safely in ext2_shwo_options().
Additionally, I think it's better to define EXT2_MOUNT_DAX
regardless macro CONFIG_FS_DAX is enabled just like acl/xattr.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-24 21:34:15 +02:00
Amir Goldstein
a725356b66 vfs: swap names of {do,vfs}_clone_file_range()
Commit 031a072a0b ("vfs: call vfs_clone_file_range() under freeze
protection") created a wrapper do_clone_file_range() around
vfs_clone_file_range() moving the freeze protection to former, so
overlayfs could call the latter.

The more common vfs practice is to call do_xxx helpers from vfs_xxx
helpers, where freeze protecction is taken in the vfs_xxx helper, so
this anomality could be a source of confusion.

It seems that commit 8ede205541 ("ovl: add reflink/copyfile/dedup
support") may have fallen a victim to this confusion -
ovl_clone_file_range() calls the vfs_clone_file_range() helper in the
hope of getting freeze protection on upper fs, but in fact results in
overlayfs allowing to bypass upper fs freeze protection.

Swap the names of the two helpers to conform to common vfs practice
and call the correct helpers from overlayfs and nfsd.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-24 10:54:01 +02:00
Amir Goldstein
d9d150ae50 ovl: fix freeze protection bypass in ovl_clone_file_range()
Tested by doing clone on overlayfs while upper xfs+reflink is frozen:

  xfs_io -f /ovl/y
                             fsfreeze -f /xfs
  xfs_io> reflink /ovl/x

Before the fix xfs_io enters xfs_reflink_remap_range() and blocks
in xfs_trans_alloc(). After the fix, xfs_io blocks outside xfs code
in ovl_clone_file_range().

Fixes: 8ede205541 ("ovl: add reflink/copyfile/dedup support")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-24 10:54:01 +02:00
Amir Goldstein
898cc19d8a ovl: fix freeze protection bypass in ovl_write_iter()
Tested by re-writing to an open overlayfs file while upper ext4 is frozen:

  xfs_io -f /ovl/x
  xfs_io> pwrite 0 4096
                             fsfreeze -f /ext4
  xfs_io> pwrite 0 4096

  WARNING: CPU: 0 PID: 1492 at fs/ext4/ext4_jbd2.c:53 \
           ext4_journal_check_start+0x48/0x82

After the fix, the second write blocks in ovl_write_iter() and avoids
hitting WARN_ON(sb->s_writers.frozen == SB_FREEZE_COMPLETE) in
ext4_journal_check_start().

Fixes: 2a92e07edc ("ovl: add ovl_write_iter()")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-24 10:54:01 +02:00
Amir Goldstein
63e1325280 ovl: fix memory leak on unlink of indexed file
The memory leak was detected by kmemleak when running xfstests
overlay/051,053

Fixes: caf70cb2ba ("ovl: cleanup orphan index entries")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-24 10:54:01 +02:00
Dennis Zhou (Facebook)
bdc2491708 blkcg: associate writeback bios with a blkg
One of the goals of this series is to remove a separate reference to
the css of the bio. This can and should be accessed via bio_blkcg. In
this patch, the wbc_init_bio call is changed such that it must be called
after a queue has been associated with the bio.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:29:11 -06:00
Greg Kroah-Hartman
0eba8697bc This pull request contains fixes for UBIFS:
- A wrong UBIFS assertion in mount code
 - Fix for a NULL pointer deref in mount code
 - Revert of a bad fix for xattrs
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAlukuf8WHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wRnyD/40jc4PA1pP3dJcXqdqwSeN+vJZ
 YWu307rE+giWWg6vRKa8zu3z614579UC7EE/I9miSrPNPYCi9VTLNzJGNMvn8aSf
 8seMrbhW0kz29T4YZ5fESr6x3hMsSdmIWZkPobruul8oBGmNcUm3Ag3MBM4XbQTm
 5AcWUdXVQUXPROK/Xm04hNa5pJ6wkSu5CxA+mf9YM2ZYYPPdCblxxq3NsDX99vDw
 gZup7JndSzpk+IpZMQIEKUP83vh5YgCU06YPjXi5YIfXl+sQnpH0fyrWm3+1C0Yz
 4T80rd3fm+7+obXdAI29mhcoZ09fGayozfqjU/fV3edW43+pUvRWced13vuLqz9Q
 JannCFeN+v/nB6OPlj4JGkgNunZ5M3r6ZZIQBbY9RJuZgG0/FMbpYb62RY1z/jg9
 YA7e3J/R02i7tHnPbcIz6ngKE0c42VKxTdATaybXVunx5ZPip7txj9OeJbfIYUrr
 CbLMdiqIJUAAheHMUrTiF3jDNwEiMhBZA4dAGDvLmpLuyfMlN8Lwje9BdRd5VNKp
 zwEGUQkDWLxniFYLtmbceFSa4IQT6z70XJn1pqdDvgjxfp8trEqhtp9GMxA2u6TA
 adcug5gUzRWMQ2QuyoKIv6tkjmnII5N2KzhZH9hwBReRJUlKY9WQ9jowcGrPXhbw
 F2prIwBxp8drneOeKg==
 =CGLq
 -----END PGP SIGNATURE-----

Merge tag 'upstream-4.19-rc4' of git://git.infradead.org/linux-ubifs

Richard writes:
  "This pull request contains fixes for UBIFS:
   - A wrong UBIFS assertion in mount code
   - Fix for a NULL pointer deref in mount code
   - Revert of a bad fix for xattrs"

* tag 'upstream-4.19-rc4' of git://git.infradead.org/linux-ubifs:
  Revert "ubifs: xattr: Don't operate on deleted inodes"
  ubifs: drop false positive assertion
  ubifs: Check for name being NULL while mounting
2018-09-21 15:29:44 +02:00
Chao Yu
dc4cd1257c f2fs: fix to recover inode's uid/gid during POR
Step to reproduce this bug:
1. logon as root
2. mount -t f2fs /dev/sdd /mnt;
3. touch /mnt/file;
4. chown system /mnt/file; chgrp system /mnt/file;
5. xfs_io -f /mnt/file -c "fsync";
6. godown /mnt;
7. umount /mnt;
8. mount -t f2fs /dev/sdd /mnt;

After step 8) we will expect file's uid/gid are all system, but during
recovery, these two fields were not been recovered, fix it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 14:50:25 -07:00
Jaegeuk Kim
f84262b086 f2fs: avoid infinite loop in f2fs_alloc_nid
If we have an error in f2fs_build_free_nids, we're able to fall into a loop
to find free nids.

Suggested-by: Chao Yu <chao@kernel.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 14:49:19 -07:00
Junxiao Bi
234b69e3e0 ocfs2: fix ocfs2 read block panic
While reading block, it is possible that io error return due to underlying
storage issue, in this case, BH_NeedsValidate was left in the buffer head.
Then when reading the very block next time, if it was already linked into
journal, that will trigger the following panic.

[203748.702517] kernel BUG at fs/ocfs2/buffer_head_io.c:342!
[203748.702533] invalid opcode: 0000 [#1] SMP
[203748.702561] Modules linked in: ocfs2 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sunrpc dm_switch dm_queue_length dm_multipath bonding be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i iw_cxgb4 cxgb4 cxgb3i libcxgbi iw_cxgb3 cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_devintf iTCO_wdt iTCO_vendor_support dcdbas ipmi_ssif i2c_core ipmi_si ipmi_msghandler acpi_pad pcspkr sb_edac edac_core lpc_ich mfd_core shpchp sg tg3 ptp pps_core ext4 jbd2 mbcache2 sr_mod cdrom sd_mod ahci libahci megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod
[203748.703024] CPU: 7 PID: 38369 Comm: touch Not tainted 4.1.12-124.18.6.el6uek.x86_64 #2
[203748.703045] Hardware name: Dell Inc. PowerEdge R620/0PXXHP, BIOS 2.5.2 01/28/2015
[203748.703067] task: ffff880768139c00 ti: ffff88006ff48000 task.ti: ffff88006ff48000
[203748.703088] RIP: 0010:[<ffffffffa05e9f09>]  [<ffffffffa05e9f09>] ocfs2_read_blocks+0x669/0x7f0 [ocfs2]
[203748.703130] RSP: 0018:ffff88006ff4b818  EFLAGS: 00010206
[203748.703389] RAX: 0000000008620029 RBX: ffff88006ff4b910 RCX: 0000000000000000
[203748.703885] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00000000023079fe
[203748.704382] RBP: ffff88006ff4b8d8 R08: 0000000000000000 R09: ffff8807578c25b0
[203748.704877] R10: 000000000f637376 R11: 000000003030322e R12: 0000000000000000
[203748.705373] R13: ffff88006ff4b910 R14: ffff880732fe38f0 R15: 0000000000000000
[203748.705871] FS:  00007f401992c700(0000) GS:ffff880bfebc0000(0000) knlGS:0000000000000000
[203748.706370] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[203748.706627] CR2: 00007f4019252440 CR3: 00000000a621e000 CR4: 0000000000060670
[203748.707124] Stack:
[203748.707371]  ffff88006ff4b828 ffffffffa0609f52 ffff88006ff4b838 0000000000000001
[203748.707885]  0000000000000000 0000000000000000 ffff880bf67c3800 ffffffffa05eca00
[203748.708399]  00000000023079ff ffffffff81c58b80 0000000000000000 0000000000000000
[203748.708915] Call Trace:
[203748.709175]  [<ffffffffa0609f52>] ? ocfs2_inode_cache_io_unlock+0x12/0x20 [ocfs2]
[203748.709680]  [<ffffffffa05eca00>] ? ocfs2_empty_dir_filldir+0x80/0x80 [ocfs2]
[203748.710185]  [<ffffffffa05ec0cb>] ocfs2_read_dir_block_direct+0x3b/0x200 [ocfs2]
[203748.710691]  [<ffffffffa05f0fbf>] ocfs2_prepare_dx_dir_for_insert.isra.57+0x19f/0xf60 [ocfs2]
[203748.711204]  [<ffffffffa065660f>] ? ocfs2_metadata_cache_io_unlock+0x1f/0x30 [ocfs2]
[203748.711716]  [<ffffffffa05f4f3a>] ocfs2_prepare_dir_for_insert+0x13a/0x890 [ocfs2]
[203748.712227]  [<ffffffffa05f442e>] ? ocfs2_check_dir_for_entry+0x8e/0x140 [ocfs2]
[203748.712737]  [<ffffffffa061b2f2>] ocfs2_mknod+0x4b2/0x1370 [ocfs2]
[203748.713003]  [<ffffffffa061c385>] ocfs2_create+0x65/0x170 [ocfs2]
[203748.713263]  [<ffffffff8121714b>] vfs_create+0xdb/0x150
[203748.713518]  [<ffffffff8121b225>] do_last+0x815/0x1210
[203748.713772]  [<ffffffff812192e9>] ? path_init+0xb9/0x450
[203748.714123]  [<ffffffff8121bca0>] path_openat+0x80/0x600
[203748.714378]  [<ffffffff811bcd45>] ? handle_pte_fault+0xd15/0x1620
[203748.714634]  [<ffffffff8121d7ba>] do_filp_open+0x3a/0xb0
[203748.714888]  [<ffffffff8122a767>] ? __alloc_fd+0xa7/0x130
[203748.715143]  [<ffffffff81209ffc>] do_sys_open+0x12c/0x220
[203748.715403]  [<ffffffff81026ddb>] ? syscall_trace_enter_phase1+0x11b/0x180
[203748.715668]  [<ffffffff816f0c9f>] ? system_call_after_swapgs+0xe9/0x190
[203748.715928]  [<ffffffff8120a10e>] SyS_open+0x1e/0x20
[203748.716184]  [<ffffffff816f0d5e>] system_call_fastpath+0x18/0xd7
[203748.716440] Code: 00 00 48 8b 7b 08 48 83 c3 10 45 89 f8 44 89 e1 44 89 f2 4c 89 ee e8 07 06 11 e1 48 8b 03 48 85 c0 75 df 8b 5d c8 e9 4d fa ff ff <0f> 0b 48 8b 7d a0 e8 dc c6 06 00 48 b8 00 00 00 00 00 00 00 10
[203748.717505] RIP  [<ffffffffa05e9f09>] ocfs2_read_blocks+0x669/0x7f0 [ocfs2]
[203748.717775]  RSP <ffff88006ff4b818>

Joesph ever reported a similar panic.
Link: https://oss.oracle.com/pipermail/ocfs2-devel/2013-May/008931.html

Link: http://lkml.kernel.org/r/20180912063207.29484-1-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <ge.changwei@h3c.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-20 22:01:12 +02:00
Dominique Martinet
a1b3d2f217 fs/proc/kcore.c: fix invalid memory access in multi-page read optimization
The 'm' kcore_list item could point to kclist_head, and it is incorrect to
look at m->addr / m->size in this case.

There is no choice but to run through the list of entries for every
address if we did not find any entry in the previous iteration

Reset 'm' to NULL in that case at Omar Sandoval's suggestion.

[akpm@linux-foundation.org: add comment]
Link: http://lkml.kernel.org/r/1536100702-28706-1-git-send-email-asmadeus@codewreck.org
Fixes: bf991c2231 ("proc/kcore: optimize multiple page reads")
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Omar Sandoval <osandov@osandov.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-20 22:01:11 +02:00
Richard Weinberger
f061c1cc40 Revert "ubifs: xattr: Don't operate on deleted inodes"
This reverts commit 11a6fc3dc7.
UBIFS wants to assert that xattr operations are only issued on files
with positive link count. The said patch made this operations return
-ENOENT for unlinked files such that the asserts will no longer trigger.
This was wrong since xattr operations are perfectly fine on unlinked
files.
Instead the assertions need to be fixed/removed.

Cc: <stable@vger.kernel.org>
Fixes: 11a6fc3dc7 ("ubifs: xattr: Don't operate on deleted inodes")
Reported-by: Koen Vandeputte <koen.vandeputte@ncentric.com>
Tested-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-09-20 21:37:41 +02:00
Sascha Hauer
d3bdc016c5 ubifs: drop false positive assertion
The following sequence triggers

	ubifs_assert(c, c->lst.taken_empty_lebs > 0);

at the end of ubifs_remount_fs():

mount -t ubifs /dev/ubi0_0 /mnt
echo 1 > /sys/kernel/debug/ubifs/ubi0_0/ro_error
umount /mnt
mount -t ubifs -o ro /dev/ubix_y /mnt
mount -o remount,ro /mnt

The resulting

UBIFS assert failed in ubifs_remount_fs at 1878 (pid 161)

is a false positive. In the case above c->lst.taken_empty_lebs has
never been changed from its initial zero value. This will only happen
when the deferred recovery is done.

Fix this by doing the assertion only when recovery has been done
already.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-09-20 21:37:07 +02:00
Richard Weinberger
37f31b6ca4 ubifs: Check for name being NULL while mounting
The requested device name can be NULL or an empty string.
Check for that and refuse to continue. UBIFS has to do this manually
since we cannot use mount_bdev(), which checks for this condition.

Fixes: 1e51764a3c ("UBIFS: add new flash file system")
Reported-by: syzbot+38bd0f7865e5c6379280@syzkaller.appspotmail.com
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-09-20 21:37:07 +02:00
Chao Yu
1390643d1d jfs: remove redundant dquot_initialize() in jfs_evict_inode()
We don't need to call dquot_initialize() twice in jfs_evict_inode(),
remove one of them for cleanup.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2018-09-20 09:28:49 -05:00
Sahitya Tummala
a7d10cf3e4 f2fs: add new idle interval timing for discard and gc paths
This helps to control the frequency of submission of discard and
GC requests independently, based on the need.

Suggested-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-19 15:55:14 -07:00
Toshi Kani
9e796c9db9 ext2, dax: set ext2_dax_aops for dax files
Sync syscall to DAX file needs to flush processor cache, but it
currently does not flush to existing DAX files.  This is because
'ext2_da_aops' is set to address_space_operations of existing DAX
files, instead of 'ext2_dax_aops', since S_DAX flag is set after
ext2_set_aops() in the open path.

Similar to ext4, change ext2_iget() to initialize i_flags before
ext2_set_aops().

Fixes: fb094c9074 ("ext2, dax: introduce ext2_dax_aops")
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Suggested-by: Jan Kara <jack@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-19 15:03:04 +02:00
Andreas Gruenbacher
ffc4c92227 sysfs: Do not return POSIX ACL xattrs via listxattr
Commit 786534b92f introduced a regression that caused listxattr to
return the POSIX ACL attribute names even though sysfs doesn't support
POSIX ACLs.  This happens because simple_xattr_list checks for NULL
i_acl / i_default_acl, but inode_init_always initializes those fields
to ACL_NOT_CACHED ((void *)-1).  For example:
    $ getfattr -m- -d /sys
    /sys: system.posix_acl_access: Operation not supported
    /sys: system.posix_acl_default: Operation not supported
Fix this in simple_xattr_list by checking if the filesystem supports POSIX ACLs.

Fixes: 786534b92f ("tmpfs: listxattr should include POSIX ACL xattrs")
Reported-by:  Marc Aurèle La France <tsi@tuyoix.net>
Tested-by: Marc Aurèle La France <tsi@tuyoix.net>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Cc: stable@vger.kernel.org # v4.5+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-09-18 07:30:48 -04:00
Greg Kroah-Hartman
ad3273d5f1 Various ext4 bug fixes; primarily making ext4 more robust against
maliciously crafted file systems, and some DAX fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlufGncACgkQ8vlZVpUN
 gaPwuQf9FKp9yRvjBkjtnH3+s4Ps8do9r067+90y1k2DJMxKoaBUhGSW2MJJ04j+
 5F6Ndp/TZHw+LfPnzsqlrAAoP3CG5+kacfJ7xeVKR0umvACm6rLMsCUct7/rFoSl
 PgzCALFIJvQ9+9shuO9qrgmjJrfrlTVUgR9Mu3WUNEvMFbMjk3FMI8gi5kjjWemE
 G9TDYH2lMH2sL0cWF51I2gOyNXOXrihxe+vP7j6i/rUkV+YLpKZhE1ss3Sfn6pR2
 p/KjnXdupLJpgYLJne9kMrq2r8xYmDfA0S+Dec7nkox5FUOFUHssl3+q8C7cDwO9
 zl6VyVFwybjFRJ/Y59wox6eqVPlIWw==
 =1P1w
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Ted writes:
	Various ext4 bug fixes; primarily making ext4 more robust against
	maliciously crafted file systems, and some DAX fixes.

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4, dax: set ext4_dax_aops for dax files
  ext4, dax: add ext4_bmap to ext4_dax_aops
  ext4: don't mark mmp buffer head dirty
  ext4: show test_dummy_encryption mount option in /proc/mounts
  ext4: close race between direct IO and ext4_break_layouts()
  ext4: fix online resizing for bigalloc file systems with a 1k block size
  ext4: fix online resize's handling of a too-small final block group
  ext4: recalucate superblock checksum after updating free blocks/inodes
  ext4: avoid arithemetic overflow that can trigger a BUG
  ext4: avoid divide by zero fault when deleting corrupted inline directories
  ext4: check to make sure the rename(2)'s destination is not freed
  ext4: add nonstring annotations to ext4.h
2018-09-17 09:13:47 +02:00
Bernd Edlinger
a75e78f21f kernfs: Fix range checks in kernfs_get_target_path
The terminating NUL byte is only there because the buffer is
allocated with kzalloc(PAGE_SIZE, GFP_KERNEL), but since the
range-check is off-by-one, and PAGE_SIZE==PATH_MAX, the
returned string may not be zero-terminated if it is exactly
PATH_MAX characters long.  Furthermore also the initial loop
may theoretically exceed PATH_MAX and cause a fault.

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-16 22:38:57 +02:00
Toshi Kani
cce6c9f7e6 ext4, dax: set ext4_dax_aops for dax files
Sync syscall to DAX file needs to flush processor cache, but it
currently does not flush to existing DAX files.  This is because
'ext4_da_aops' is set to address_space_operations of existing DAX
files, instead of 'ext4_dax_aops', since S_DAX flag is set after
ext4_set_aops() in the open path.

  New file
  --------
  lookup_open
    ext4_create
      __ext4_new_inode
        ext4_set_inode_flags   // Set S_DAX flag
      ext4_set_aops            // Set aops to ext4_dax_aops

  Existing file
  -------------
  lookup_open
    ext4_lookup
      ext4_iget
        ext4_set_aops          // Set aops to ext4_da_aops
        ext4_set_inode_flags   // Set S_DAX flag

Change ext4_iget() to initialize i_flags before ext4_set_aops().

Fixes: 5f0663bb4a ("ext4, dax: introduce ext4_dax_aops")
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Suggested-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
2018-09-15 21:37:59 -04:00
Toshi Kani
94dbb63117 ext4, dax: add ext4_bmap to ext4_dax_aops
Ext4 mount path calls .bmap to the journal inode. This currently
works for the DAX mount case because ext4_iget() always set
'ext4_da_aops' to any regular files.

In preparation to fix ext4_iget() to set 'ext4_dax_aops' for ext4
DAX files, add ext4_bmap() to 'ext4_dax_aops', since bmap works for
DAX inodes.

Fixes: 5f0663bb4a ("ext4, dax: introduce ext4_dax_aops")
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Suggested-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
2018-09-15 21:23:41 -04:00
Li Dongyang
fe18d64989 ext4: don't mark mmp buffer head dirty
Marking mmp bh dirty before writing it will make writeback
pick up mmp block later and submit a write, we don't want the
duplicate write as kmmpd thread should have full control of
reading and writing the mmp block.
Another reason is we will also have random I/O error on
the writeback request when blk integrity is enabled, because
kmmpd could modify the content of the mmp block(e.g. setting
new seq and time) while the mmp block is under I/O requested
by writeback.

Signed-off-by: Li Dongyang <dongyangli@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@vger.kernel.org
2018-09-15 17:11:25 -04:00
Eric Biggers
338affb548 ext4: show test_dummy_encryption mount option in /proc/mounts
When in effect, add "test_dummy_encryption" to _ext4_show_options() so
that it is shown in /proc/mounts and other relevant procfs files.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-09-15 14:28:26 -04:00
Linus Torvalds
3a5af36b6d fixes for four CIFS/SMB3 potential pointer overflow issues, one minor build fix, and a build warning cleanup
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlucKDwACgkQiiy9cAdy
 T1EhEgwAqgVXTujce2UVPtaFY/MaGmaIwAimh+aYOCAADxLYJHkjtRzHd5PQgf+L
 n55R2hFcMeelWxOMEb/aRmxIKLk8fmJYVWClM2+S7Z79M3GHexDbMS8+oDZnzCTB
 EknvaTbi+vOt4HGABkbJ/jiQCgonmeobon30gLWaYa3XGeYc7ZV5gR+EXL9xSdvh
 +I+x3rSDpm8fQ5njkB7RKgfB+ha4NQqZ6dXlYQzcb0vMO3/lhQ56Ypgn95Jlu5UW
 pcLxUFE1do+JeGvIU1it2SCRJ5499g180Rxucl7X1xFBQ44Qss9QOeWkFTZ8784V
 PIYVZMTUqO0Km4H22qXD8lIY5GjDuAXLYM3AddFkJpbKaw6g++ZsUXSfoA5zRIDn
 10edaPK/hDIQeFaV+ySTN5g/Qh3YFnmY4kDL3t3CRZDe4+DTW/+cmrF3sGkhZDQt
 +nDo0JxJsjNnJW6loB5Lb76lygvsng01owSsYSAChjhYwBvCDgp9/D85pDQ/oYkl
 HKD9tiF3
 =YNSx
 -----END PGP SIGNATURE-----

Merge tag '4.19-rc3-smb3-cifs' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Fixes for four CIFS/SMB3 potential pointer overflow issues, one minor
  build fix, and a build warning cleanup"

* tag '4.19-rc3-smb3-cifs' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: read overflow in is_valid_oplock_break()
  cifs: integer overflow in in SMB2_ioctl()
  CIFS: fix wrapping bugs in num_entries()
  cifs: prevent integer overflow in nxt_dir_entry()
  fs/cifs: require sha512
  fs/cifs: suppress a string overflow warning
2018-09-14 19:33:42 -10:00
Linus Torvalds
589109df31 NFS client bugfixes for Linux 4.19
Stable bugfixes:
 - v4.17+: Fix a tracepoint Oops in initiate_file_draining()
 - v4.17+: Fix a tracepoint Oops in initiate_file_draining()
 - v4.11+: Fix an infinite loop on I/O
 
 Other fixes:
 - Return errors if a waiting layoutget is killed
 - Don't open code clearing of delegation state
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlucGhMACgkQ18tUv7Cl
 QOsMCA/9ET0bbzus25DWcnbT10bpEdtW4p6dR/2ztGqUGe0FVyCyVT70jBKFbnAL
 cO1pqElKLj7TMPhTsxK63dDxXGELXCqDtmsHELaD8jf0h6270KAPariJBQ5+N0ud
 5U5CXswW/zbQekE9GMTDQtAAGBzfht33PavFt2+5oVYTAQ5K6Pwvq2qMoifQxMlk
 wjtVjypz0QjBy5bHBO6XGxX58JIc23EwA0/KDS4cU3vkwDXmEZcVYIUdqJF4gtCz
 JdJdnT4b9ebtbdHENx8rkot3L1VSx6JfW9pPMvxLjxn8IG1rj7zQXtc7kpnoF8RY
 WVGWuf4rn7Zo7YXf11SXNebMYgrljx5/0KcmUtSgSmCCqVUmY1e/a69e0fhkKfxn
 /W/+fYIdC1wG0JXtrCN8eJbGropYj3B8Ln5TBV4LN91hFMI8Lx/4D1lKLoK7RNLJ
 3Q6VmKIhKVMHgK6eAivWyN7X/WzIvAj37a2ix2xSjENUIP+l3ePAekc4f5XGMe8O
 wx6wxgvVSomVrsM565XGcjw+LXUyzlXowS6JhR+Zn6fYmbDkPk0ZMC5HBFidakkB
 YxO7aWjBvyYinOrBMWODeWatt4q50bI6YtQpxxWyDIdRXWbvQDjBILveTh6/sRGQ
 KbA3r6XC/DSMri4Mo/r+92fbBEYV2otnkkYraz04VgBVYYgiJts=
 =C83+
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.19-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client bugfixes from Anna Schumaker:
 "These are a handful of fixes for problems that Trond found. Patch #1
  and #3 have the same name, a second issue was found after applying the
  first patch.

  Stable bugfixes:
   - v4.17+: Fix tracepoint Oops in initiate_file_draining()
   - v4.11+: Fix an infinite loop on I/O

  Other fixes:
   - Return errors if a waiting layoutget is killed
   - Don't open code clearing of delegation state"

* tag 'nfs-for-4.19-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFS: Don't open code clearing of delegation state
  NFSv4.1 fix infinite loop on I/O.
  NFSv4: Fix a tracepoint Oops in initiate_file_draining()
  pNFS: Ensure we return the error if someone kills a waiting layoutget
  NFSv4: Fix a tracepoint Oops in initiate_file_draining()
2018-09-14 19:25:28 -10:00
Trond Myklebust
9f0c5124f4 NFS: Don't open code clearing of delegation state
Add a helper for the case when the nfs4 open state has been set to use
a delegation stateid, and we want to revert to using the open stateid.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-09-14 16:24:27 -04:00
Trond Myklebust
994b15b983 NFSv4.1 fix infinite loop on I/O.
The previous fix broke recovery of delegated stateids because it assumes
that if we did not mark the delegation as suspect, then the delegation has
effectively been revoked, and so it removes that delegation irrespectively
of whether or not it is valid and still in use. While this is "mostly
harmless" for ordinary I/O, we've seen pNFS fail with LAYOUTGET spinning
in an infinite loop while complaining that we're using an invalid stateid
(in this case the all-zero stateid).

What we rather want to do here is ensure that the delegation is always
correctly marked as needing testing when that is the case. So we want
to close the loophole offered by nfs4_schedule_stateid_recovery(),
which marks the state as needing to be reclaimed, but not the
delegation that may be backing it.

Fixes: 0e3d3e5df0 ("NFSv4.1 fix infinite loop on IO BAD_STATEID error")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.11+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-09-14 16:24:11 -04:00
Trond Myklebust
2edaead69e NFSv4: Fix a tracepoint Oops in initiate_file_draining()
Now that the value of 'ino' can be NULL or an ERR_PTR(), we need to
change the test in the tracepoint.

Fixes: ce5624f7e6 ("NFSv4: Return NFS4ERR_DELAY when a layout fails...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.17+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-09-14 16:24:08 -04:00
Trond Myklebust
d03360aaf5 pNFS: Ensure we return the error if someone kills a waiting layoutget
If someone interrupts a wait on one or more outstanding layoutgets in
pnfs_update_layout() then return the ERESTARTSYS/EINTR error.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-09-14 16:24:08 -04:00
Trond Myklebust
2a534a7473 NFSv4: Fix a tracepoint Oops in initiate_file_draining()
Now that the value of 'ino' can be NULL or an ERR_PTR(), we need to
change the test in the tracepoint.

Fixes: ce5624f7e6 ("NFSv4: Return NFS4ERR_DELAY when a layout fails...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.17+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-09-14 16:23:16 -04:00
Al Viro
e21120383f move compat handling of tty ioctls to tty_compat_ioctl()
ioctls that are
	* callable only via tty_ioctl()
	* not driver-specific
	* not demand data structure conversions
	* either always need passing arg as is or always demand compat_ptr()
get intercepted in tty_compat_ioctl() from the very beginning and
redirecter to tty_ioctl().  As the result, their entries in fs/compat_ioctl.c
(some of those had been missing, BTW) got removed, as well as
n_tty_compat_ioctl_helper() (now it's never called with any cmd it would accept).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-09-14 11:12:17 -04:00
Linus Torvalds
48751b562b overlayfs fixes for 4.19-rc4
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW5qpOgAKCRDh3BK/laaZ
 PDCQAQCIKLg0aLeWOkfUO76mBjlp5srKgJfrqpFoyuozD6l2fQEAl/W2x9NOduV+
 PK4sCYMT8SpI0hMrbv9P4zZ683kmaA8=
 =RnZU
 -----END PGP SIGNATURE-----

Merge tag 'ovl-fixes-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs fixes from Miklos Szeredi:
 "This fixes a regression in the recent file stacking update, reported
  and fixed by Amir Goldstein. The fix is fairly trivial, but involves
  adding a fadvise() f_op and the associated churn in the vfs. As
  discussed on -fsdevel, there are other possible uses for this method,
  than allowing proper stacking for overlays.

  And there's one other fix for a syzkaller detected oops"

* tag 'ovl-fixes-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: fix oopses in ovl_fill_super() failure paths
  ovl: add ovl_fadvise()
  vfs: implement readahead(2) using POSIX_FADV_WILLNEED
  vfs: add the fadvise() file operation
  Documentation/filesystems: update documentation of file_operations
  ovl: fix GPF in swapfile_activate of file from overlayfs over xfs
  ovl: respect FIEMAP_FLAG_SYNC flag
2018-09-13 19:21:40 -10:00
Linus Torvalds
145ea6f10d pstore fixes for v4.19-rc4:
- Handle page-vs-byte offset handling between iomap and vmap (Bin Yang)
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAluakCIWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJnq1D/9a9PFwxBflN3Wi9G2SMJuCF8y1
 vb+auPyMtk2gpnDnE/GbumwLfB/ZW/1JSBNncCmSnFOjeK3M0m4e0gz0XU2AQaT6
 BTNFZlVe649a6zct59nEQRILG1aPP5UdKv4eE2mJ+XOFEi3DSTMgtsmFhftrngGZ
 Gp8XYaNgf6JIL25fUtfgZd5YiwbR+ZdLZdcd+qlIliJF80rup/QftdgRsPJk0kUS
 b8yE6sWNnP1NDzlUWcAWXy231lewbbUsdYAQ11RLThl8Glogz3hlxLKIbCuKFymd
 IGwSyShnZFczO6qN6Rcvx5o8PIgbTH+1ntQc3s3PNuXOpOmOk7F2YspkAEqLY3//
 4Gy+jphSmg12JAzPy7B1m0UT8nCE0S9qfi7tnQ0Dti0rvu+vS1dyqFlqbTgN9Hb3
 Wfi8u/rNvUT5XqB0afq5v7U6On7mjHfW6PwcwqZJZtEblERqv+DOgRUPjWkack6E
 x64IA0UzlEX2AlWHiBxGcwd55kIvk+xI94VVy2K8QcbWlGeM5Ij7WDMABTkmJn5L
 3I8EwoIpg9qtoabKOywOQuj6m9Dovvo0Dv8juUtHARBUJZC0fEpakTdPZEQoiSBs
 7Vox1UTzY+QqJch1ayEvjmqJNkPsAor7zqfVX4Ub85tuECAEs+MtQDbZCZUTXR7I
 WN5N1QBDpuwC8tupCg==
 =8krP
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore fix from Kees Cook:
 "This fixes a 6 year old pstore bug that everyone just got lucky in
  avoiding, likely due only using page-aligned persistent ram regions:

   - Handle page-vs-byte offset handling between iomap and vmap (Bin Yang)"

* tag 'pstore-v4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  pstore: Fix incorrect persistent ram buffer mapping
2018-09-13 18:52:41 -10:00
Bin Yang
831b624df1 pstore: Fix incorrect persistent ram buffer mapping
persistent_ram_vmap() returns the page start vaddr.
persistent_ram_iomap() supports non-page-aligned mapping.

persistent_ram_buffer_map() always adds offset-in-page to the vaddr
returned from these two functions, which causes incorrect mapping of
non-page-aligned persistent ram buffer.

By default ftrace_size is 4096 and max_ftrace_cnt is nr_cpu_ids. Without
this patch, the zone_sz in ramoops_init_przs() is 4096/nr_cpu_ids which
might not be page aligned. If the offset-in-page > 2048, the vaddr will be
in next page. If the next page is not mapped, it will cause kernel panic:

[    0.074231] BUG: unable to handle kernel paging request at ffffa19e0081b000
...
[    0.075000] RIP: 0010:persistent_ram_new+0x1f8/0x39f
...
[    0.075000] Call Trace:
[    0.075000]  ramoops_init_przs.part.10.constprop.15+0x105/0x260
[    0.075000]  ramoops_probe+0x232/0x3a0
[    0.075000]  platform_drv_probe+0x3e/0xa0
[    0.075000]  driver_probe_device+0x2cd/0x400
[    0.075000]  __driver_attach+0xe4/0x110
[    0.075000]  ? driver_probe_device+0x400/0x400
[    0.075000]  bus_for_each_dev+0x70/0xa0
[    0.075000]  driver_attach+0x1e/0x20
[    0.075000]  bus_add_driver+0x159/0x230
[    0.075000]  ? do_early_param+0x95/0x95
[    0.075000]  driver_register+0x70/0xc0
[    0.075000]  ? init_pstore_fs+0x4d/0x4d
[    0.075000]  __platform_driver_register+0x36/0x40
[    0.075000]  ramoops_init+0x12f/0x131
[    0.075000]  do_one_initcall+0x4d/0x12c
[    0.075000]  ? do_early_param+0x95/0x95
[    0.075000]  kernel_init_freeable+0x19b/0x222
[    0.075000]  ? rest_init+0xbb/0xbb
[    0.075000]  kernel_init+0xe/0xfc
[    0.075000]  ret_from_fork+0x3a/0x50

Signed-off-by: Bin Yang <bin.yang@intel.com>
[kees: add comments describing the mapping differences, updated commit log]
Fixes: 24c3d2f342 ("staging: android: persistent_ram: Make it possible to use memory outside of bootmem")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-09-13 09:14:57 -07:00
Dan Carpenter
097f5863b1 cifs: read overflow in is_valid_oplock_break()
We need to verify that the "data_offset" is within bounds.

Reported-by: Dr Silvio Cesare of InfoSect <silvio.cesare@gmail.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2018-09-12 17:13:34 -05:00
Chao Yu
6f5c2ed0a2 f2fs: split IO error injection according to RW
This patch adds to support injecting error for write IO, this can simulate
IO error like fail_make_request or dm_flakey does.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-12 13:07:32 -07:00
Chao Yu
7c1a000d46 f2fs: add SPDX license identifiers
Remove the verbose license text from f2fs files and replace them with
SPDX tags.  This does not change the license of any of the code.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-12 13:07:10 -07:00
Chengguang Xu
4cb037ec3f f2fs: surround fault_injection related option parsing using CONFIG_F2FS_FAULT_INJECTION
It's a little bit strange when fault_injection related
options fail with -EINVAL which were already disabled
from config, so surround all fault_injection related option
parsing code using CONFIG_F2FS_FAULT_INJECTION. Meanwhile,
slightly change warning message to keep consistency with
option POSIX_ACL and FS_XATTR.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-12 13:05:33 -07:00
Arnd Bergmann
04b72322e8 media: dvb: move compat handlers into drivers
The VIDEO_STILLPICTURE is only implemented by one driver, while
VIDEO_GET_EVENT has two users in tree. In both cases, it is fairly
easy to handle the compat ioctls in the native handler rather
than relying on translation in fs/compat_ioctls.

In effect, this means that now the drivers implement both structure
layouts in both native and compat mode, but I don't see anything
wrong with that.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2018-09-12 11:00:51 -04:00
Arnd Bergmann
8a24280b11 media: dvb: move most compat_ioctl handling into drivers
Most DVB audio and video ioctl commands are completely compatible,
and are implemented by just two drivers: ttpci and ivtv. In both
cases, we can use the same ioctl handler for both native and
compat ioctl handling, and remove the entries from the global
lookup table.

In case of ttpci, this directly hooks into the file_operations
structure, and for ivtv, we have to set the compat_ioctl32
method in v4l2_file_operations. For all I can tell, setting it
to video_ioctl2 will still do the right thing for all commands.

Note that for the VIDEO_STILLPICTURE and VIDEO_GET_EVENT commands,
a translation handler in fs/compat_ioctl.c is still used. This
works because the command numbers are different on 32-bit
systems.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2018-09-12 11:00:51 -04:00
Arnd Bergmann
e6c8320648 media: cec: move compat_ioctl handling to cec-api.c
All the CEC ioctls are compatible, and they are only implemented
in one driver, so we can simply let this driver handle them
natively.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Hans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2018-09-12 11:00:51 -04:00
Arnd Bergmann
b5d3206112 media: dvb: dmxdev: move compat_ioctl handling to dmxdev.c
All dmx ioctls are compatible, and they are only implemented
in one file, so we can replace the list of commands in
fs/compat_ioctl.c with a single line in dmxdev.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2018-09-12 11:00:51 -04:00
Arnd Bergmann
1ccbeeb888 media: dvb: fix compat ioctl translation
The VIDEO_GET_EVENT and VIDEO_STILLPICTURE was added back in 2005 but
it never worked because the command number is wrong.

Using the right command number means we have a better chance of them
actually doing the right thing, though clearly nobody has ever tried
it successfully.

I noticed these while auditing the remaining users of compat_time_t
for y2038 bugs. This one is fine in that regard, it just never did
anything.

Fixes: 6e87abd0b8 ("[DVB]: Add compat ioctl handling.")

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2018-09-12 11:00:51 -04:00
Dan Carpenter
2d204ee9d6 cifs: integer overflow in in SMB2_ioctl()
The "le32_to_cpu(rsp->OutputOffset) + *plen" addition can overflow and
wrap around to a smaller value which looks like it would lead to an
information leak.

Fixes: 4a72dafa19 ("SMB2 FSCTL and IOCTL worker function")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
CC: Stable <stable@vger.kernel.org>
2018-09-12 09:32:07 -05:00
Dan Carpenter
56446f218a CIFS: fix wrapping bugs in num_entries()
The problem is that "entryptr + next_offset" and "entryptr + len + size"
can wrap.  I ended up changing the type of "entryptr" because it makes
the math easier when we don't have to do so much casting.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
2018-09-12 09:32:07 -05:00
Dan Carpenter
8ad8aa3535 cifs: prevent integer overflow in nxt_dir_entry()
The "old_entry + le32_to_cpu(pDirInfo->NextEntryOffset)" can wrap
around so I have added a check for integer overflow.

Reported-by: Dr Silvio Cesare of InfoSect <silvio.cesare@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
2018-09-12 09:27:57 -05:00
Matthew Wilcox
b90ca5cc32 filesystem-dax: Fix use of zero page
Use my_zero_pfn instead of ZERO_PAGE(), and pass the vaddr to it instead
of zero so it works on MIPS and s390 who reference the vaddr to select a
zero page.

Cc: <stable@vger.kernel.org>
Fixes: 91d25ba8a6 ("dax: use common 4k zero page for dax mmap reads")
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-09-11 21:27:44 -07:00
Wang Shilong
c8e927579e f2fs: fix setattr project check upon fssetxattr ioctl
Currently, project quota could be changed by fssetxattr
ioctl, and existed permission check inode_owner_or_capable()
is obviously not enough, just think that common users could
change project id of file, that could make users to
break project quota easily.

This patch try to follow same regular of xfs project
quota:

"Project Quota ID state is only allowed to change from
within the init namespace. Enforce that restriction only
if we are trying to change the quota ID state.
Everything else is allowed in user namespaces."

Besides that, check and set project id'state should
be an atomic operation, protect whole operation with
inode lock.

Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 17:09:32 -07:00
Zhikang Zhang
b430f72636 f2fs: avoid sleeping under spin_lock
In the call trace below, we might sleep in function dput().

So in order to avoid sleeping under spin_lock, we remove f2fs_mark_inode_dirty_sync
from __try_update_largest_extent && __drop_largest_extent.

BUG: sleeping function called from invalid context at fs/dcache.c:796
Call trace:
	dump_backtrace+0x0/0x3f4
	show_stack+0x24/0x30
	dump_stack+0xe0/0x138
	___might_sleep+0x2a8/0x2c8
	__might_sleep+0x78/0x10c
	dput+0x7c/0x750
	block_dump___mark_inode_dirty+0x120/0x17c
	__mark_inode_dirty+0x344/0x11f0
	f2fs_mark_inode_dirty_sync+0x40/0x50
	__insert_extent_tree+0x2e0/0x2f4
	f2fs_update_extent_tree_range+0xcf4/0xde8
	f2fs_update_extent_cache+0x114/0x12c
	f2fs_update_data_blkaddr+0x40/0x50
	write_data_page+0x150/0x314
	do_write_data_page+0x648/0x2318
	__write_data_page+0xdb4/0x1640
	f2fs_write_cache_pages+0x768/0xafc
	__f2fs_write_data_pages+0x590/0x1218
	f2fs_write_data_pages+0x64/0x74
	do_writepages+0x74/0xe4
	__writeback_single_inode+0xdc/0x15f0
	writeback_sb_inodes+0x574/0xc98
	__writeback_inodes_wb+0x190/0x204
	wb_writeback+0x730/0xf14
	wb_check_old_data_flush+0x1bc/0x1c8
	wb_workfn+0x554/0xf74
	process_one_work+0x440/0x118c
	worker_thread+0xac/0x974
	kthread+0x1a0/0x1c8
	ret_from_fork+0x10/0x1c

Signed-off-by: Zhikang Zhang <zhangzhikang1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 17:09:32 -07:00
Chao Yu
e1293bdfa0 f2fs: plug readahead IO in readdir()
Add a plug to merge readahead IO in readdir(), expecting it can
reduce bio count before submitting to block layer.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 17:09:32 -07:00
Chao Yu
042be0f849 f2fs: fix to do sanity check with current segment number
https://bugzilla.kernel.org/show_bug.cgi?id=200219

Reproduction way:
- mount image
- run poc code
- umount image

F2FS-fs (loop1): Bitmap was wrongly set, blk:15364
------------[ cut here ]------------
kernel BUG at /home/yuchao/git/devf2fs/segment.c:2061!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 2 PID: 17686 Comm: umount Tainted: G        W  O      4.18.0-rc2+ #39
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
EIP: update_sit_entry+0x459/0x4e0 [f2fs]
Code: e8 1c b5 fd ff 0f 0b 0f 0b 8b 45 e4 c7 44 24 08 9c 7a 6c f8 c7 44 24 04 bc 4a 6c f8 89 44 24 0c 8b 06 89 04 24 e8 f7 b4 fd ff <0f> 0b 8b 45 e4 0f b6 d2 89 54 24 10 c7 44 24 08 60 7a 6c f8 c7 44
EAX: 00000032 EBX: 000000f8 ECX: 00000002 EDX: 00000001
ESI: d7177000 EDI: f520fe68 EBP: d6477c6c ESP: d6477c34
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010282
CR0: 80050033 CR2: b7fbe000 CR3: 2a99b3c0 CR4: 000406f0
Call Trace:
 f2fs_allocate_data_block+0x124/0x580 [f2fs]
 do_write_page+0x78/0x150 [f2fs]
 f2fs_do_write_node_page+0x25/0xa0 [f2fs]
 __write_node_page+0x2bf/0x550 [f2fs]
 f2fs_sync_node_pages+0x60e/0x6d0 [f2fs]
 ? sync_inode_metadata+0x2f/0x40
 ? f2fs_write_checkpoint+0x28f/0x7d0 [f2fs]
 ? up_write+0x1e/0x80
 f2fs_write_checkpoint+0x2a9/0x7d0 [f2fs]
 ? mark_held_locks+0x5d/0x80
 ? _raw_spin_unlock_irq+0x27/0x50
 kill_f2fs_super+0x68/0x90 [f2fs]
 deactivate_locked_super+0x3d/0x70
 deactivate_super+0x40/0x60
 cleanup_mnt+0x39/0x70
 __cleanup_mnt+0x10/0x20
 task_work_run+0x81/0xa0
 exit_to_usermode_loop+0x59/0xa7
 do_fast_syscall_32+0x1f5/0x22c
 entry_SYSENTER_32+0x53/0x86
EIP: 0xb7f95c51
Code: c1 1e f7 ff ff 89 e5 8b 55 08 85 d2 8b 81 64 cd ff ff 74 02 89 02 5d c3 8b 0c 24 c3 8b 1c 24 c3 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
EAX: 00000000 EBX: 0871ab90 ECX: bfb2cd00 EDX: 00000000
ESI: 00000000 EDI: 0871ab90 EBP: 0871ab90 ESP: bfb2cd7c
DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
Modules linked in: f2fs(O) crc32_generic bnep rfcomm bluetooth ecdh_generic snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq pcbc joydev aesni_intel snd_seq_device aes_i586 snd_timer crypto_simd snd cryptd soundcore mac_hid serio_raw video i2c_piix4 parport_pc ppdev lp parport hid_generic psmouse usbhid hid e1000 [last unloaded: f2fs]
---[ end trace d423f83982cfcdc5 ]---

The reason is, different log headers using the same segment, once
one log's next block address is used by another log, it will cause
panic as above.

Main area: 24 segs, 24 secs 24 zones
  - COLD  data: 0, 0, 0
  - WARM  data: 1, 1, 1
  - HOT   data: 20, 20, 20
  - Dir   dnode: 22, 22, 22
  - File   dnode: 22, 22, 22
  - Indir nodes: 21, 21, 21

So this patch adds sanity check to detect such condition to avoid
this issue.

Signed-off-by: Chao Yu <yuchao0@huawei.com>

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 17:09:32 -07:00
Chao Yu
4a70e25544 f2fs: fix memory leak of percpu counter in fill_super()
In fill_super -> init_percpu_info, we should destroy percpu counter
in error path, otherwise memory allcoated for percpu counter will
leak.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 17:09:32 -07:00
Chao Yu
0b2103e886 f2fs: fix memory leak of write_io in fill_super()
It needs to release memory allocated for sbi->write_io in error path,
otherwise, it will cause memory leak.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 17:09:32 -07:00
Al Viro
77021f8bab presence of RS485 ioctls has been unconditional since 2014
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-09-11 19:46:07 -04:00
Chengguang Xu
313ed62a3d f2fs: cache NULL when both default_acl and acl are NULL
default_acl and acl of newly created inode will be initiated
as ACL_NOT_CACHED in vfs function inode_init_always() and later
will be updated by calling xxx_init_acl() in specific filesystems.
Howerver, when default_acl and acl are NULL then they keep the value
of ACL_NOT_CACHED, this patch tries to cache NULL for acl/default_acl
in this case.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 13:16:03 -07:00
Chao Yu
1378752b99 f2fs: fix to flush all dirty inodes recovered in readonly fs
generic/417 reported as blow:

------------[ cut here ]------------
kernel BUG at /home/yuchao/git/devf2fs/inode.c:695!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 21697 Comm: umount Tainted: G        W  O      4.18.0-rc2+ #39
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
EIP: f2fs_evict_inode+0x556/0x580 [f2fs]
Call Trace:
 ? _raw_spin_unlock+0x2c/0x50
 evict+0xa8/0x170
 dispose_list+0x34/0x40
 evict_inodes+0x118/0x120
 generic_shutdown_super+0x41/0x100
 ? rcu_read_lock_sched_held+0x97/0xa0
 kill_block_super+0x22/0x50
 kill_f2fs_super+0x6f/0x80 [f2fs]
 deactivate_locked_super+0x3d/0x70
 deactivate_super+0x40/0x60
 cleanup_mnt+0x39/0x70
 __cleanup_mnt+0x10/0x20
 task_work_run+0x81/0xa0
 exit_to_usermode_loop+0x59/0xa7
 do_fast_syscall_32+0x1f5/0x22c
 entry_SYSENTER_32+0x53/0x86
EIP: f2fs_evict_inode+0x556/0x580 [f2fs]

It can simply reproduced with scripts:

Enable quota feature during mkfs.

Testcase1:
1. mkfs.f2fs /dev/zram0
2. mount -t f2fs /dev/zram0 /mnt/f2fs
3. xfs_io -f /mnt/f2fs/file -c "pwrite 0 4k" -c "fsync"
4. godown /mnt/f2fs
5. umount /mnt/f2fs
6. mount -t f2fs -o ro /dev/zram0 /mnt/f2fs
7. umount /mnt/f2fs

Testcase2:
1. mkfs.f2fs /dev/zram0
2. mount -t f2fs /dev/zram0 /mnt/f2fs
3. touch /mnt/f2fs/file
4. create process[pid = x] do:
	a) open /mnt/f2fs/file;
	b) unlink /mnt/f2fs/file
5. godown -f /mnt/f2fs
6. kill process[pid = x]
7. umount /mnt/f2fs
8. mount -t f2fs -o ro /dev/zram0 /mnt/f2fs
9. umount /mnt/f2fs

The reason is: during recovery, i_{c,m}time of inode will be updated, then
the inode can be set dirty w/o being tracked in sbi->inode_list[DIRTY_META]
global list, so later write_checkpoint will not flush such dirty inode into
node page.

Once umount is called, sync_filesystem() in generic_shutdown_super() will
skip syncng dirty inodes due to sb_rdonly check, leaving dirty inodes
there.

To solve this issue, during umount, add remove SB_RDONLY flag in
sb->s_flags, to make sure sync_filesystem() will not be skipped.

Signed-off-by: Chao Yu <yuchao0@huawei.com>

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 13:16:03 -07:00
Yunlei He
cda9cc595f f2fs: report error if quota off error during umount
Now, we depend on fsck to ensure quota file data is ok,
so we scan whole partition if checkpoint without umount
flag. It's same for quota off error case, which may make
quota file data inconsistent.

generic/019 reports below error:

 __quota_error: 1160 callbacks suppressed
 Quota error (device zram1): write_blk: dquota write failed
 Quota error (device zram1): qtree_write_dquot: Error -28 occurred while creating quota
 Quota error (device zram1): write_blk: dquota write failed
 Quota error (device zram1): qtree_write_dquot: Error -28 occurred while creating quota
 Quota error (device zram1): write_blk: dquota write failed
 Quota error (device zram1): qtree_write_dquot: Error -28 occurred while creating quota
 Quota error (device zram1): write_blk: dquota write failed
 Quota error (device zram1): qtree_write_dquot: Error -28 occurred while creating quota
 Quota error (device zram1): write_blk: dquota write failed
 Quota error (device zram1): qtree_write_dquot: Error -28 occurred while creating quota
 VFS: Busy inodes after unmount of zram1. Self-destruct in 5 seconds.  Have a nice day...

If we failed in below path due to fail to write dquot block, we will miss
to release quota inode, fix it.

- f2fs_put_super
 - f2fs_quota_off_umount
  - f2fs_quota_off
   - f2fs_quota_sync   <-- failed
   - dquot_quota_off   <-- missed to call

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 13:16:03 -07:00
Eric W. Biederman
961366a019 signal: Remove the siginfo paramater from kernel_dqueue_signal
None of the callers use the it so remove it.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-09-11 21:19:14 +02:00
Ross Zwisler
b1f382178d ext4: close race between direct IO and ext4_break_layouts()
If the refcount of a page is lowered between the time that it is returned
by dax_busy_page() and when the refcount is again checked in
ext4_break_layouts() => ___wait_var_event(), the waiting function
ext4_wait_dax_page() will never be called.  This means that
ext4_break_layouts() will still have 'retry' set to false, so we'll stop
looping and never check the refcount of other pages in this inode.

Instead, always continue looping as long as dax_layout_busy_page() gives us
a page which it found with an elevated refcount.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-09-11 13:31:16 -04:00
Chengguang Xu
02645bcdfc jfs: remove quota option from ignore list
We treat quota option as usrquota, so remove quota option from ignore
list.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2018-09-10 15:36:40 -05:00
Al Viro
702ec3072a hidp: fix compat_ioctl
1) no point putting it into fs/compat_ioctl.c when you handle it in
your ->compat_ioctl() anyway.
2) HIDPCONNADD is *not* COMPATIBLE_IOCTL() stuff at all - it does
layout massage (pointer-chasing there)
3) use compat_ptr()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-09-10 12:41:07 -04:00
Al Viro
89c0c24b4f cmtp: fix compat_ioctl
Use compat_ptr().  And don't mess with fs/compat_ioctl.c

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-09-10 12:41:06 -04:00
Al Viro
cc04f6e242 bnep: fix compat_ioctl
use compat_ptr() properly and don't bother with fs/compat_ioctl.c -
it's all handled in ->compat_ioctl() anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-09-10 12:41:06 -04:00
Al Viro
0976d4e1dc compat_ioctl: trim the pointless includes
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-09-10 12:40:57 -04:00
Miklos Szeredi
8c25741aaa ovl: fix oopses in ovl_fill_super() failure paths
ovl_free_fs() dereferences ofs->workbasedir and ofs->upper_mnt in cases when
those might not have been initialized yet.

Fix the initialization order for these fields.

Reported-by: syzbot+c75f181dc8429d2eb887@syzkaller.appspotmail.com
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc:  <stable@vger.kernel.org> # v4.15
Fixes: 95e6d4177c ("ovl: grab reference to workbasedir early")
Fixes: a9075cdb46 ("ovl: factor out ovl_free_fs() helper")
2018-09-10 12:55:49 +02:00
Stefan Metzmacher
5890184d2b fs/cifs: require sha512
This got lost in commit 0fdfef9aa7,
which removed CONFIG_CIFS_SMB311.

Signed-off-by: Stefan Metzmacher <metze@samba.org>
Fixes: 0fdfef9aa7 ("smb3: simplify code by removing CONFIG_CIFS_SMB311")
CC: Stable <stable@vger.kernel.org>
CC: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-09-09 00:04:27 -05:00
Stephen Rothwell
bcfb84a996 fs/cifs: suppress a string overflow warning
A powerpc build of cifs with gcc v8.2.0 produces this warning:

fs/cifs/cifssmb.c: In function ‘CIFSSMBNegotiate’:
fs/cifs/cifssmb.c:605:3: warning: ‘strncpy’ writing 16 bytes into a region of size 1 overflows the destination [-Wstringop-overflow=]
   strncpy(pSMB->DialectsArray+count, protocols[i].name, 16);
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Since we are already doing a strlen() on the source, change the strncpy
to a memcpy().

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-09-09 00:02:39 -05:00
David Howells
ecfe951f0c afs: Fix cell specification to permit an empty address list
Fix the cell specification mechanism to allow cells to be pre-created
without having to specify at least one address (the addresses will be
upcalled for).

This allows the cell information preload service to avoid the need to issue
loads of DNS lookups during boot to get the addresses for each cell (500+
lookups for the 'standard' cell list[*]).  The lookups can be done later as
each cell is accessed through the filesystem.

Also remove the print statement that prints a line every time a new cell is
added.

[*] There are 144 cells in the list.  Each cell is first looked up for an
    SRV record, and if that fails, for an AFSDB record.  These get a list
    of server names, each of which then has to be looked up to get the
    addresses for that server.  E.g.:

	dig srv _afs3-vlserver._udp.grand.central.org

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-09-07 16:39:44 -07:00
Jaegeuk Kim
5ce805869c f2fs: submit bio after shutdown
Sometimes, some merged IOs could get a chance to be submitted, resulting in
system hang in shutdown test. This issues IOs all the time after shutdown.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-07 14:26:43 -07:00
Linus Torvalds
a12ed06ba2 Two rbd patches to complete support for images within namespaces that
went into -rc1 and a use-after-free fix.
 
 The rbd changes have been sitting in a branch for quite a while but
 couldn't be included into the -rc1 pull request because of a pending
 wire protocol backwards compatibility fixup that only got committed
 early this week.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAluSrJYTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi/N8B/4sZzRCJMCejvU/yRq91NlaPDrxbVHh
 nfICZ/8Fsy/fmvK8NWNyHcCIWx+nWrbCvCJMj0fxWMhk/1t75yC+TdyCJnyuhsQU
 V/CPTs9BTdwrSUiTB83/n/ukGL6mpESk0CQ1er/l1EO6FnNOXvgzHDnCqUQZLdzU
 1aRcx5JQWWo/QlCmzt2KWENhfQRMvLAtf04F5cUuR+JTrMjwWia6MAuRGuOhVQkW
 XIlFNakBKab89Vod1pmA7BrG/+sHXCpVGX6sjAp9vQUWO3WWKBRnNtVwo9dPSHah
 hBR8IzOkihw7HfTlINWVpiR69nTfM80PQHXJkFSp36E6Sfq8EShRpFIZ
 =pga5
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.19-rc3' of https://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "Two rbd patches to complete support for images within namespaces that
  went into -rc1 and a use-after-free fix.

  The rbd changes have been sitting in a branch for quite a while but
  couldn't be included into the -rc1 pull request because of a pending
  wire protocol backwards compatibility fixup that only got committed
  early this week"

* tag 'ceph-for-4.19-rc3' of https://github.com/ceph/ceph-client:
  rbd: support cloning across namespaces
  rbd: factor out get_parent_info()
  ceph: avoid a use-after-free in ceph_destroy_options()
2018-09-07 10:57:59 -07:00
Linus Torvalds
d042a240a8 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAluSNfQACgkQnJ2qBz9k
 QNloSAf/RpsqUnmQvJKK7hQUVNMCQP/Kf3KND5iN5RfMbhU9r7tzERkNvqhdA6QZ
 uoPi8dEecI+ihY5F8ddyw1Chaou4MToWKdNz4ojwJXVrN6bb+pq+xj0hTvT5FjFh
 iM1JXHtSEk6W+CnXPE5CycrZppIHxJfJxeaWg7av5Zyc4nkTesxtG8PycMBxROW8
 detUcJt15VGBswi19udztf7XY/lwDwUQ9LwC0W5B+o8pKIwuN3ENMVVOeAriAyoy
 hXTpPA8twBhM7i8D/1eppDCkYLTr08bquNsDpn8kUEf2RxcxiFJuDLOeXiH3sQRq
 BZmf/QIIRA8R+SPeFiuxY/795FDC6Q==
 =CWu1
 -----END PGP SIGNATURE-----

Merge tag 'for_v4.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify fix from Jan Kara:
 "A small fsnotify fix from Amir"

* tag 'for_v4.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fsnotify: fix ignore mask logic in fsnotify()
2018-09-07 10:54:46 -07:00
Dominique Martinet
b4dc44b3ca 9p locks: fix glock.client_id leak in do_lock
the 9p client code overwrites our glock.client_id pointing to a static
buffer by an allocated string holding the network provided value which
we do not care about; free and reset the value as appropriate.

This is almost identical to the leak in v9fs_file_getlock() fixed by
Al Viro in commit ce85dd58ad ("9p: we are leaking glock.client_id
in v9fs_file_getlock()"), which was returned as an error by a coverity
false positive -- while we are here attempt to make the code slightly
more robust to future change of the net/9p/client code and hopefully
more clear to coverity that there is no problem.

Link: http://lkml.kernel.org/r/1536339057-21974-5-git-send-email-asmadeus@codewreck.org
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-09-08 01:52:35 +09:00
Dominique Martinet
e02a53d92e 9p: acl: fix uninitialized iattr access
iattr is passed to v9fs_vfs_setattr_dotl which does send various
values from iattr over the wire, even if it tells the server to
only look at iattr.ia_valid fields this could leak some stack data.

Link: http://lkml.kernel.org/r/1536339057-21974-2-git-send-email-asmadeus@codewreck.org
Addresses-Coverity-ID: 1195601 ("Uninitalized scalar variable")
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-09-08 01:51:50 +09:00
Dinu-Razvan Chis-Serban
5e172f75e5 9p locks: add mount option for lock retry interval
The default P9_LOCK_TIMEOUT can be too long for some users exporting
a local file system to a guest VM (30s), make this configurable at
mount time.

Link: http://lkml.kernel.org/r/1536295827-3181-1-git-send-email-asmadeus@codewreck.org
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=195727
Signed-off-by: Dinu-Razvan Chis-Serban <justcsdr@gmail.com>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-09-08 01:40:06 +09:00
Gertjan Halkes
2803cf4379 9p: do not trust pdu content for stat item size
v9fs_dir_readdir() could deadloop if a struct was sent with a size set
to -2

Link: http://lkml.kernel.org/r/1536134432-11997-1-git-send-email-asmadeus@codewreck.org
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=88021
Signed-off-by: Gertjan Halkes <gertjan@google.com>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-09-08 01:40:06 +09:00
Gustavo A. R. Silva
426d5a0f97 9p: fix spelling mistake in fall-through annotation
Replace "fallthough" with a proper "fall through" annotation.

This fix is part of the ongoing efforts to enabling -Wimplicit-fallthrough

Link: http://lkml.kernel.org/r/20180903193806.GA11258@embeddedor.com
Addresses-Coverity-ID: 402012 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-09-08 01:39:47 +09:00
Jan Kara
1abefb0274 udf: Drop pack pragma from udf_sb.h
Drop pack pragma. The header file defines only in-memory structures.

Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-07 10:32:23 +02:00
Jan Kara
694538b5d7 udf: Drop freed bitmap / table support
We don't support Free Space Table and Free Space Bitmap as specified by
UDF standard for writing as we don't support erasing blocks before
overwriting them. Just drop the handling of these structures as
partition descriptor checking code already makes sure such filesystems
can be mounted only read-only.

Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-07 10:32:22 +02:00
Jan Kara
b085fbe2ef udf: Fix crash during mount
Fix a crash during an attempt to mount a filesystem that has both
Unallocated Space Table and Unallocated Space Bitmap. Such filesystem
actually violates the UDF standard so we just have to properly detect
such situation and refuse to mount such filesystem read-write. When we
are at it, verify also other constraints on the allocation information
mandated by the standard.

Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-07 10:32:22 +02:00
Jan Kara
a9ad01bc75 udf: Prevent write-unsupported filesystem to be remounted read-write
There are certain filesystem features which we support for reading but
not for writing. We properly refuse to mount such filesystems read-write
however for some features (such as read-only partitions), we don't check
for these features when remounting the filesystem from read-only to
read-write. Thus such filesystems could be remounted read-write leading
to strange behavior (most likely crashes).

Fix the problem by marking in superblock whether the filesystem has some
features that are supported in read-only mode and check this flag during
remount.

Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-07 10:32:22 +02:00
Linus Torvalds
c6ff25ce35 four small SMB3 fixes, three for stable, and one minor debug clarification
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAluRb84ACgkQiiy9cAdy
 T1GBuAv/ZkXC5cxNpE84dRLii4ey+Y0u5ip3VIyzxwburDDxQh99zWTs8FdyYe1f
 5CYD4PKcNajVceNUg1EdfkNC4ss21I2HcxujquCo8gZJqrt5lHvZXELJ31d8ovXG
 Y2MRl5+KZKLx1sBgzsGJf3aZOneObp7EEAnL/bjeziX7caD6uO/F52MXcMgWrpoM
 krvWSMzS5iY+jRltvPLhTzUmfbaPoS86FRNIHHOiA8AgQLvx3CT4lL+kJOHv1bHX
 haZc9zKy1iUU+yK05vnNLOHVlPeZDa8j/Q8lcEfTrDVa0J/je4DkI9HsB1X54vz0
 65JluAf4G4vaSYU/hnLaWt4PZ7owjTr7fJlu0c8TE6aPqAZEYYoVhjbrdXg7OI4j
 9nUsoEY1hyBH2X5qStvpb9GiZbhR19bhcvnvOAgZ7p4VnyKfexe7o2QHTCPPAxaP
 uYf8OeHjmnYF3MQI0s6rA0TPdZxc77acgrZQWUoOMPN/bwkyP1bSfA1aURJAsx/k
 OodjAog1
 =WRXo
 -----END PGP SIGNATURE-----

Merge tag '4.19-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Four small SMB3 fixes, three for stable, and one minor debug
  clarification"

* tag '4.19-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: connect to servername instead of IP for IPC$ share
  smb3: check for and properly advertise directory lease support
  smb3: minor debugging clarifications in rfc1001 len processing
  SMB3: Backup intent flag missing for directory opens with backupuid mounts
  fs/cifs: don't translate SFM_SLASH (U+F026) to backslash
2018-09-06 15:39:11 -07:00
Chengguang Xu
e8d4ceeb34 jfs: cache NULL when both default_acl and acl are NULL
default_acl and acl of newly created inode will be initiated
as ACL_NOT_CACHED in vfs function inode_init_always() and later
will be updated by calling xxx_init_acl() in specific filesystems.
Howerver, when default_acl and acl are NULL then they keep the value
of ACL_NOT_CACHED, this patch tries to cache NULL for acl/default_acl
in this case.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2018-09-06 11:56:26 -05:00
Linus Torvalds
5404525b98 for-4.19-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAluRLa8ACgkQxWXV+ddt
 WDvc+BAAqxTMVngZ60WfktXzsS56OB6fu/R3DORgYcSZ0BCD4zTwoDlCjLhrCK6E
 cmC+BMj+AspDQYiYESwGyFcN10sK0X7w7fa3wypTc4GNWxpkRm0Z6zT/kCvLUhdI
 NlkMqAfsZ9N6iIXcR0qOxI7G55e3mpXPZGdFTk5rmDTv/9TqU0TMp9s8Zw5scn6R
 ctdE+iE0lpRfNjF8ZDH1BtYIV4g2X81sZF/fkGz621HQfMTCjjPHFdlz+jlirBaf
 BrYR4w4zjVuMKd3ZC5FHffVchbkvt29h6fAr4sEpJTwFJwd8pjI7GuPYWDQ918NB
 TGX6EUP6usQqDK2zD405jCS6MbMshJm3uh5kmEpeNgK/tKJTln8Sbef/Xs93yIn2
 +k9BMKOIcUHHBiv6PgCaZomcWCpii2S2u6vncqCnNuI4wK1RN3gHJc5YPhJArlrB
 NUFJiTCQE6LWYOP2Hw+rggcrtBxli0bX7Mqp5FYFVdh5KBvolJE1o3B/JS8qpqRF
 u0dPwbLHtTpTpXM5EfmM8a45S+DxuxTDBh3vdoAOM9LN/ivpeqqnFbHrIGmrTMjo
 pQJ8aTrCwYMEMNu6oCV1cniFrOYRZ439hYjg524MjVXYCRyxhzAdVmVTEBaLjWCW
 9GlGqEC7YZY2wLi5lPEGqxsIaVVELpettJB9KbBKmYB47VFWEf0=
 =fu93
 -----END PGP SIGNATURE-----

Merge tag 'for-4.19-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix for improper fsync after hardlink

 - fix for a corruption during file deduplication

 - use after free fixes

 - RCU warning fix

 - fix for buffered write to nodatacow file

* tag 'for-4.19-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: Fix suspicious RCU usage warning in btrfs_debug_in_rcu
  btrfs: use after free in btrfs_quota_enable
  btrfs: btrfs_shrink_device should call commit transaction at the end
  btrfs: fix qgroup_free wrong num_bytes in btrfs_subvolume_reserve_metadata
  Btrfs: fix data corruption when deduplicating between different files
  Btrfs: sync log after logging new name
  Btrfs: fix unexpected failure of nocow buffered writes after snapshotting when low on space
2018-09-06 09:04:45 -07:00
Ilya Dryomov
8aaff15168 ceph: avoid a use-after-free in ceph_destroy_options()
syzbot reported a use-after-free in ceph_destroy_options(), called from
ceph_mount().  The problem was that create_fs_client() consumed the opt
pointer on some errors, but not on all of them.  Make sure it always
consumes both libceph and ceph options.

Reported-by: syzbot+8ab6f1042021b4eed062@syzkaller.appspotmail.com
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
2018-09-06 16:18:04 +02:00
Jaegeuk Kim
0ded69f632 f2fs: avoid wrong decrypted data from disk
1. Create a file in an encrypted directory
2. Do GC & drop caches
3. Read stale data before its bio for metapage was not issued yet

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-05 13:41:07 -07:00
Chao Yu
22d7ea1364 Revert "f2fs: use printk_ratelimited for f2fs_msg"
Don't limit printing log, so that we will not miss any key messages.

This reverts commit a36c106dff.

In addition, we use printk_ratelimited to avoid too many log prints.
- error injection
- discard submission failure

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-05 13:41:06 -07:00
Sahitya Tummala
abde73c718 f2fs: fix unnecessary periodic wakeup of discard thread when dev is busy
When dev is busy, discard thread wake up timeout can be aligned with the
exact time that it needs to wait for dev to come out of busy. This helps
to avoid unnecessary periodic wakeups and thus save some power.

Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-05 13:40:32 -07:00
Chao Yu
7d20c8abb2 f2fs: fix to avoid NULL pointer dereference on se->discard_map
https://bugzilla.kernel.org/show_bug.cgi?id=200951

These is a NULL pointer dereference issue reported in bugzilla:

Hi,
in the setup there is a SATA SSD connected to a SATA-to-USB bridge.

The disc is "Samsung SSD 850 PRO 256G" which supports TRIM.
There are four partitions:
 sda1: FAT  /boot
 sda2: F2FS /
 sda3: F2FS /home
 sda4: F2FS

The bridge is ASMT1153e which uses the "uas" driver.
There is no TRIM pass-through, so, when mounting it reports:
 mounting with "discard" option, but the device does not support discard

The USB host is USB3.0 and UASP capable. It is the one on RK3399.

Given this everything works fine, except there is no TRIM support.

In order to enable TRIM a new UDEV rule is added [1]:
 /etc/udev/rules.d/10-sata-bridge-trim.rules:
 ACTION=="add|change", ATTRS{idVendor}=="174c", ATTRS{idProduct}=="55aa", SUBSYSTEM=="scsi_disk", ATTR{provisioning_mode}="unmap"
After reboot any F2FS write hangs forever and dmesg reports:
 Unable to handle kernel NULL pointer dereference

Also tested on a x86_64 system: works fine even with TRIM enabled.
 same disc
 same bridge
 different usb host controller
 different cpu architecture
 not root filesystem

Regards,
  Vicenç.

[1] Post #5 in https://bbs.archlinux.org/viewtopic.php?id=236280

 Unable to handle kernel NULL pointer dereference at virtual address 000000000000003e
 Mem abort info:
   ESR = 0x96000004
   Exception class = DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
 Data abort info:
   ISV = 0, ISS = 0x00000004
   CM = 0, WnR = 0
 user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000626e3122
 [000000000000003e] pgd=0000000000000000
 Internal error: Oops: 96000004 [#1] SMP
 Modules linked in: overlay snd_soc_hdmi_codec rc_cec dw_hdmi_i2s_audio dw_hdmi_cec snd_soc_simple_card snd_soc_simple_card_utils snd_soc_rockchip_i2s rockchip_rga snd_soc_rockchip_pcm rockchipdrm videobuf2_dma_sg v4l2_mem2mem rtc_rk808 videobuf2_memops analogix_dp videobuf2_v4l2 videobuf2_common dw_hdmi dw_wdt cec rc_core videodev drm_kms_helper media drm rockchip_thermal rockchip_saradc realtek drm_panel_orientation_quirks syscopyarea sysfillrect sysimgblt fb_sys_fops dwmac_rk stmmac_platform stmmac pwm_bl squashfs loop crypto_user gpio_keys hid_kensington
 CPU: 5 PID: 957 Comm: nvim Not tainted 4.19.0-rc1-1-ARCH #1
 Hardware name: Sapphire-RK3399 Board (DT)
 pstate: 00000005 (nzcv daif -PAN -UAO)
 pc : update_sit_entry+0x304/0x4b0
 lr : update_sit_entry+0x108/0x4b0
 sp : ffff00000ca13bd0
 x29: ffff00000ca13bd0 x28: 000000000000003e
 x27: 0000000000000020 x26: 0000000000080000
 x25: 0000000000000048 x24: ffff8000ebb85cf8
 x23: 0000000000000253 x22: 00000000ffffffff
 x21: 00000000000535f2 x20: 00000000ffffffdf
 x19: ffff8000eb9e6800 x18: ffff8000eb9e6be8
 x17: 0000000007ce6926 x16: 000000001c83ffa8
 x15: 0000000000000000 x14: ffff8000f602df90
 x13: 0000000000000006 x12: 0000000000000040
 x11: 0000000000000228 x10: 0000000000000000
 x9 : 0000000000000000 x8 : 0000000000000000
 x7 : 00000000000535f2 x6 : ffff8000ebff3440
 x5 : ffff8000ebff3440 x4 : ffff8000ebe3a6c8
 x3 : 00000000ffffffff x2 : 0000000000000020
 x1 : 0000000000000000 x0 : ffff8000eb9e5800
 Process nvim (pid: 957, stack limit = 0x0000000063a78320)
 Call trace:
  update_sit_entry+0x304/0x4b0
  f2fs_invalidate_blocks+0x98/0x140
  truncate_node+0x90/0x400
  f2fs_remove_inode_page+0xe8/0x340
  f2fs_evict_inode+0x2b0/0x408
  evict+0xe0/0x1e0
  iput+0x160/0x260
  do_unlinkat+0x214/0x298
  __arm64_sys_unlinkat+0x3c/0x68
  el0_svc_handler+0x94/0x118
  el0_svc+0x8/0xc
 Code: f9400800 b9488400 36080140 f9400f01 (387c4820)
 ---[ end trace a0f21a307118c477 ]---

The reason is it is possible to enable discard flag on block queue via
UDEV, but during mount, f2fs will initialize se->discard_map only if
this flag is set, once the flag is set after mount, f2fs may dereference
NULL pointer on se->discard_map.

So this patch does below changes to fix this issue:
- initialize and update se->discard_map all the time.
- don't clear DISCARD option if device has no QUEUE_FLAG_DISCARD flag
during mount.
- don't issue small discard on zoned block device.
- introduce some functions to enhance the readability.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Tested-by: Vicente Bergas <vicencb@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-05 13:40:31 -07:00
Chengguang Xu
1618e6e297 f2fs: add additional sanity check in f2fs_acl_from_disk()
Add additinal sanity check for irregular case(e.g. corruption).
If size of extended attribution is smaller than size of acl header,
then return -EINVAL.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-05 13:40:31 -07:00
Ryusuke Konishi
ae98043f5f nilfs2: convert to SPDX license tags
Remove the verbose license text from NILFS2 files and replace them with
SPDX tags.  This does not change the license of any of the code.

Link: http://lkml.kernel.org/r/1535624528-5982-1-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-09-04 16:45:02 -07:00
Alexander Popov
c8d126275a fs/proc: Show STACKLEAK metrics in the /proc file system
Introduce CONFIG_STACKLEAK_METRICS providing STACKLEAK information about
tasks via the /proc file system. In particular, /proc/<pid>/stack_depth
shows the maximum kernel stack consumption for the current and previous
syscalls. Although this information is not precise, it can be useful for
estimating the STACKLEAK performance impact for your workloads.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Alexander Popov <alex.popov@linux.com>
Tested-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-09-04 10:35:48 -07:00
Jason A. Donenfeld
578bdaabd0 crypto: speck - remove Speck
These are unused, undesired, and have never actually been used by
anybody. The original authors of this code have changed their mind about
its inclusion. While originally proposed for disk encryption on low-end
devices, the idea was discarded [1] in favor of something else before
that could really get going. Therefore, this patch removes Speck.

[1] https://marc.info/?l=linux-crypto-vger&m=153359499015659

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Eric Biggers <ebiggers@google.com>
Cc: stable@vger.kernel.org
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-09-04 11:35:03 +08:00
Theodore Ts'o
5f8c10936f ext4: fix online resizing for bigalloc file systems with a 1k block size
An online resize of a file system with the bigalloc feature enabled
and a 1k block size would be refused since ext4_resize_begin() did not
understand s_first_data_block is 0 for all bigalloc file systems, even
when the block size is 1k.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-09-03 22:25:01 -04:00
Theodore Ts'o
f0a459dec5 ext4: fix online resize's handling of a too-small final block group
Avoid growing the file system to an extent so that the last block
group is too small to hold all of the metadata that must be stored in
the block group.

This problem can be triggered with the following reproducer:

umount /mnt
mke2fs -F -m0 -b 4096 -t ext4 -O resize_inode,^has_journal \
	-E resize=1073741824 /tmp/foo.img 128M
mount /tmp/foo.img /mnt
truncate --size 1708M /tmp/foo.img
resize2fs /dev/loop0 295400
umount /mnt
e2fsck -fy /tmp/foo.img

Reported-by: Torsten Hilbrich <torsten.hilbrich@secunet.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-09-03 22:19:43 -04:00
Amir Goldstein
d54f4fba88 fanotify: add API to attach/detach super block mark
Add another mark type flag FAN_MARK_FILESYSTEM for add/remove/flush
of super block mark type.

A super block watch gets all events on the filesystem, regardless of
the mount from which the mark was added, unless an ignore mask exists
on either the inode or the mount where the event was generated.

Only one of FAN_MARK_MOUNT and FAN_MARK_FILESYSTEM mark type flags
may be provided to fanotify_mark() or no mark type flag for inode mark.

Cc: <linux-api@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-03 15:16:11 +02:00
Amir Goldstein
60f7ed8c7c fsnotify: send path type events to group with super block marks
Send events to group if super block mark mask matches the event
and unless the same group has an ignore mask on the vfsmount or
the inode on which the event occurred.

Soon, fanotify backend is going to support super block marks and
fanotify backend only supports path type events.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-03 15:14:04 +02:00
Amir Goldstein
1e6cb72399 fsnotify: add super block object type
Add the infrastructure to attach a mark to a super_block struct
and detach all attached marks when super block is destroyed.

This is going to be used by fanotify backend to setup super block
marks.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-03 15:14:01 +02:00
Jann Horn
9da3f2b740 x86/fault: BUG() when uaccess helpers fault on kernel addresses
There have been multiple kernel vulnerabilities that permitted userspace to
pass completely unchecked pointers through to userspace accessors:

 - the waitid() bug - commit 96ca579a1e ("waitid(): Add missing
   access_ok() checks")
 - the sg/bsg read/write APIs
 - the infiniband read/write APIs

These don't happen all that often, but when they do happen, it is hard to
test for them properly; and it is probably also hard to discover them with
fuzzing. Even when an unmapped kernel address is supplied to such buggy
code, it just returns -EFAULT instead of doing a proper BUG() or at least
WARN().

Try to make such misbehaving code a bit more visible by refusing to do a
fixup in the pagefault handler code when a userspace accessor causes a #PF
on a kernel address and the current context isn't whitelisted.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: kernel-hardening@lists.openwall.com
Cc: dvyukov@google.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/20180828201421.157735-7-jannh@google.com
2018-09-03 15:12:09 +02:00
Amir Goldstein
9bdda4e9cf fsnotify: fix ignore mask logic in fsnotify()
Commit 92183a4289 ("fsnotify: fix ignore mask logic in
send_to_group()") acknoledges the use case of ignoring an event on
an inode mark, because of an ignore mask on a mount mark of the same
group (i.e. I want to get all events on this file, except for the events
that came from that mount).

This change depends on correctly merging the inode marks and mount marks
group lists, so that the mount mark ignore mask would be tested in
send_to_group(). Alas, the merging of the lists did not take into
account the case where event in question is not in the mask of any of
the mount marks.

To fix this, completely remove the tests for inode and mount event masks
from the lists merging code.

Fixes: 92183a4289 ("fsnotify: fix ignore mask logic in send_to_group")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-03 14:57:41 +02:00
Chengguang Xu
59fed3bf8a ext2: cache NULL when both default_acl and acl are NULL
default_acl and acl of newly created inode will be initiated as
ACL_NOT_CACHED in vfs function inode_init_always() and later will be
updated by calling xxx_init_acl() in specific filesystems.  However,
when default_acl and acl are NULL, then they keep the value of
ACL_NOT_CACHED. This patch changes the code to cache NULL for acl /
default_acl in this case to save unnecessary ACL lookup in future.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-03 11:05:03 +02:00
Colin Ian King
849fe89ce6 udf: remove unused variables group_start and nr_groups
Variables group_start and nr_groups are being assigned but are never used
hence they are redundant and can be removed.

Cleans up clang warning:
variable 'group_start' set but not used [-Wunused-but-set-variable]
variable 'nr_groups' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-03 11:04:49 +02:00
Amir Goldstein
b833a36603 ovl: add ovl_fadvise()
Implement stacked fadvise to fix syscalls readahead(2) and fadvise64(2)
on an overlayfs file.

Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: d1d04ef857 ("ovl: stack file ops")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-03 09:43:10 +02:00
Thomas Werschlein
395a2076b4 cifs: connect to servername instead of IP for IPC$ share
This patch is required allows access to a Microsoft fileserver failover
cluster behind a 1:1 NAT firewall.

The change also provides stronger context for authentication and share
connection (see MS-SMB2 3.3.5.7 and MS-SRVS 3.1.6.8) as noted by
Tom Talpey, and addresses comments about the buffer size for the UNC
made by Aurélien Aptel.

Signed-off-by: Thomas Werschlein <thomas.werschlein@geo.uzh.ch>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Tom Talpey <ttalpey@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
CC: Stable <stable@vger.kernel.org>
2018-09-02 23:21:42 -05:00
Steve French
f801568332 smb3: check for and properly advertise directory lease support
Although servers will typically ignore unsupported features,
we should advertise the support for directory leases (as
Windows e.g. does) in the negotiate protocol capabilities we
pass to the server, and should check for the server capability
(CAP_DIRECTORY_LEASING) before sending a lease request for an
open of a directory.  This will prevent us from accidentally
sending directory leases to SMB2.1 or SMB2 server for example.

Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-09-02 23:21:42 -05:00
Steve French
25f2573512 smb3: minor debugging clarifications in rfc1001 len processing
I ran into some cases where server was returning the wrong length
on frames but I couldn't easily match them to the command in the
network trace (or server logs) since I need the command and/or
multiplex id to find the offending SMB2/SMB3 command.  Add these
two fields to the log message. In the case of padding too much
it may not be a problem in all cases but might have correlated
to a network disconnect case in some problems we have been
looking at. In the case of frame too short is even more important.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-09-02 23:21:42 -05:00
Steve French
5e19697b56 SMB3: Backup intent flag missing for directory opens with backupuid mounts
When "backup intent" is requested on the mount (e.g. backupuid or
backupgid mount options), the corresponding flag needs to be set
on opens of directories (and files) but was missing in some
places causing access denied trying to enumerate and backup
servers.

Fixes kernel bugzilla #200953
https://bugzilla.kernel.org/show_bug.cgi?id=200953

Reported-and-tested-by: <whh@rubrik.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-09-02 23:21:42 -05:00
Jon Kuhn
c15e3f19a6 fs/cifs: don't translate SFM_SLASH (U+F026) to backslash
When a Mac client saves an item containing a backslash to a file server
the backslash is represented in the CIFS/SMB protocol as as U+F026.
Before this change, listing a directory containing an item with a
backslash in its name will return that item with the backslash
represented with a true backslash character (U+005C) because
convert_sfm_character mapped U+F026 to U+005C when interpretting the
CIFS/SMB protocol response.  However, attempting to open or stat the
path using a true backslash will result in an error because
convert_to_sfm_char does not map U+005C back to U+F026 causing the
CIFS/SMB request to be made with the backslash represented as U+005C.

This change simply prevents the U+F026 to U+005C conversion from
happenning.  This is analogous to how the code does not do any
translation of UNI_SLASH (U+F000).

Signed-off-by: Jon Kuhn <jkuhn@barracuda.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-09-02 23:21:42 -05:00
Linus Torvalds
501dacbc24 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core fixes from Thomas Gleixner:
 "A small set of updates for core code:

   - Prevent tracing in functions which are called from trace patching
     via stop_machine() to prevent executing half patched function trace
     entries.

   - Remove old GCC workarounds

   - Remove pointless includes of notifier.h"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  objtool: Remove workaround for unreachable warnings from old GCC
  notifier: Remove notifier header file wherever not used
  watchdog: Mark watchdog touch functions as notrace
2018-09-02 09:41:45 -07:00
Theodore Ts'o
4274f516d4 ext4: recalucate superblock checksum after updating free blocks/inodes
When mounting the superblock, ext4_fill_super() calculates the free
blocks and free inodes and stores them in the superblock.  It's not
strictly necessary, since we don't use them any more, but it's nice to
keep them roughly aligned to reality.

Since it's not critical for file system correctness, the code doesn't
call ext4_commit_super().  The problem is that it's in
ext4_commit_super() that we recalculate the superblock checksum.  So
if we're not going to call ext4_commit_super(), we need to call
ext4_superblock_csum_set() to make sure the superblock checksum is
consistent.

Most of the time, this doesn't matter, since we end up calling
ext4_commit_super() very soon thereafter, and definitely by the time
the file system is unmounted.  However, it doesn't work in this
sequence:

mke2fs -Fq -t ext4 /dev/vdc 128M
mount /dev/vdc /vdc
cp xfstests/git-versions /vdc
godown /vdc
umount /vdc
mount /dev/vdc
tune2fs -l /dev/vdc

With this commit, the "tune2fs -l" no longer fails.

Reported-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-09-01 14:42:14 -04:00
Theodore Ts'o
bcd8e91f98 ext4: avoid arithemetic overflow that can trigger a BUG
A maliciously crafted file system can cause an overflow when the
results of a 64-bit calculation is stored into a 32-bit length
parameter.

https://bugzilla.kernel.org/show_bug.cgi?id=200623

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Wen Xu <wen.xu@gatech.edu>
Cc: stable@vger.kernel.org
2018-09-01 12:45:04 -04:00
Amir Goldstein
5b910bd615 ovl: fix GPF in swapfile_activate of file from overlayfs over xfs
Since overlayfs implements stacked file operations, the underlying
filesystems are not supposed to be exposed to the overlayfs file,
whose f_inode is an overlayfs inode.

Assigning an overlayfs file to swap_file results in an attempt of xfs
code to dereference an xfs_inode struct from an ovl_inode pointer:

 CPU: 0 PID: 2462 Comm: swapon Not tainted
 4.18.0-xfstests-12721-g33e17876ea4e #3402
 RIP: 0010:xfs_find_bdev_for_inode+0x23/0x2f
 Call Trace:
  xfs_iomap_swapfile_activate+0x1f/0x43
  __se_sys_swapon+0xb1a/0xee9

Fix this by not assigning the real inode mapping to f_mapping, which
will cause swapon() to return an error (-EINVAL). Although it makes
sense not to allow setting swpafile on an overlayfs file, some users
may depend on it, so we may need to fix this up in the future.

Keeping f_mapping pointing to overlay inode mapping will cause O_DIRECT
open to fail. Fix this by installing ovl_aops with noop_direct_IO in
overlay inode mapping.

Keeping f_mapping pointing to overlay inode mapping will cause other
a_ops related operations to fail (e.g. readahead()). Those will be
fixed by follow up patches.

Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: f7c72396d0 ("ovl: add O_DIRECT support")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-08-30 17:08:35 +02:00
Amir Goldstein
80d3481081 ovl: respect FIEMAP_FLAG_SYNC flag
Stacked overlayfs fiemap operation broke xfstests that test delayed
allocation (with "_test_generic_punch -d"), because ovl_fiemap()
failed to write dirty pages when requested.

Fixes: 9e142c4102 ("ovl: add ovl_fiemap()")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-08-30 17:08:35 +02:00
Mukesh Ojha
13ba17bee1 notifier: Remove notifier header file wherever not used
The conversion of the hotplug notifiers to a state machine left the
notifier.h includes around in some places. Remove them.

Signed-off-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1535114033-4605-1-git-send-email-mojha@codeaurora.org
2018-08-30 12:56:40 +02:00
Linus Torvalds
f3f106dac0 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAluGfbAACgkQnJ2qBz9k
 QNmkVQgAjv/tCHStwoQ4Hhr6q5wLU9apAW5WC7Or2MyNfqoJsKgyujpikwFMa0xY
 wkVqlITYavvUKl/nRlQ6hpZtAY7hi1Y7GvIYs2ci6QO4YAUuBoH7qPLKqyZYzXx+
 mBaC68885nMMDqIHvsCcLurwTiDer6XQXXPDmpMoO9g4kyVuhm2e0/M7CPaA4Ue9
 WEfBPBROSNdRH7Wtww/MUvtfd+9mezdlpUQZVVO5cdhAVQ8pzW2g2piYtwIpaY7E
 DiJ1zcgaqCnni+b69x4wz8TwB0q8wZJjrPfJgf4X7mIGBZrw80yNRlevfe5SxwdJ
 mkMDvjaD0TEItvpjgTZReXaAjFOWTA==
 =E61M
 -----END PGP SIGNATURE-----

Merge tag 'for_v4.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull misc fs fixes from Jan Kara:

 - make UDF to properly mount media created by Win7

 - make isofs to properly refuse devices with large physical block size

 - fix a Spectre gadget in quotactl(2)

 - fix a warning in fsnotify code hit by syzkaller

* tag 'for_v4.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: Fix mounting of Win7 created UDF filesystems
  udf: Remove dead code from udf_find_fileset()
  fs/quota: Fix spectre gadget in do_quotactl
  fs/quota: Replace XQM_MAXQUOTAS usage with MAXQUOTAS
  isofs: reject hardware sector size > 2048 bytes
  fsnotify: fix false positive warning on inode delete
2018-08-29 14:56:45 -07:00
Arnd Bergmann
4faea239e5 y2038: utimes: Rework #ifdef guards for compat syscalls
After changing over to 64-bit time_t syscalls, many architectures will
want compat_sys_utimensat() but not respective handlers for utime(),
utimes() and futimesat(). This adds a new __ARCH_WANT_SYS_UTIME32 to
complement __ARCH_WANT_SYS_UTIME. For now, all 64-bit architectures that
support CONFIG_COMPAT set it, but future 64-bit architectures will not
(tile would not have needed it either, but got removed).

As older 32-bit architectures get converted to using CONFIG_64BIT_TIME,
they will have to use __ARCH_WANT_SYS_UTIME32 instead of
__ARCH_WANT_SYS_UTIME. Architectures using the generic syscall ABI don't
need either of them as they never had a utime syscall.

Since the compat_utimbuf structure is now required outside of
CONFIG_COMPAT, I'm moving it into compat_time.h.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
changed from last version:
- renamed __ARCH_WANT_COMPAT_SYS_UTIME to __ARCH_WANT_SYS_UTIME32
2018-08-29 15:42:23 +02:00
Arnd Bergmann
185cfaf764 y2038: Compile utimes()/futimesat() conditionally
There are four generations of utimes() syscalls: utime(), utimes(),
futimesat() and utimensat(), each one being a superset of the previous
one. For y2038 support, we have to add another one, which is the same
as the existing utimensat() but always passes 64-bit times_t based
timespec values.

There are currently 10 architectures that only use utimensat(), two
that use utimes(), futimesat() and utimensat() but not utime(), and 11
architectures that have all four, and those define __ARCH_WANT_SYS_UTIME
in order to get a sys_utime implementation. Since all the new
architectures only want utimensat(), moving all the legacy entry points
into a common __ARCH_WANT_SYS_UTIME guard simplifies the logic. Only alpha
and ia64 grow a tiny bit as they now also get an unused sys_utime(),
but it didn't seem worth the extra complexity of adding yet another
ifdef for those.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-08-29 15:42:23 +02:00
Arnd Bergmann
a4f7a30046 y2038: Change sys_utimensat() to use __kernel_timespec
When 32-bit architectures get changed to support 64-bit time_t,
utimensat() needs to use the new __kernel_timespec structure as its
argument.

The older utime(), utimes() and futimesat() system calls don't need a
corresponding change as they are no longer used on C libraries that have
64-bit time support.

As we do for the other syscalls that have timespec arguments, we reuse
the 'compat' syscall entry points to implement the traditional four
interfaces, and only leave the new utimensat() as a native handler,
so that the same code gets used on both 32-bit and 64-bit kernels
on each syscall.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-08-29 15:42:22 +02:00
Arnd Bergmann
caf6f9c8a3 asm-generic: Remove unneeded __ARCH_WANT_SYS_LLSEEK macro
The sys_llseek sytem call is needed on all 32-bit architectures and
none of the 64-bit ones, so we can remove the __ARCH_WANT_SYS_LLSEEK guard
and simplify the include/asm-generic/unistd.h header further.

Since 32-bit tasks can run either natively or in compat mode on 64-bit
architectures, we have to check for both !CONFIG_64BIT and CONFIG_COMPAT.

There are a few 64-bit architectures that also reference sys_llseek
in their 64-bit ABI (e.g. sparc), but I verified that those all
select CONFIG_COMPAT, so the #if check is still correct here. It's
a bit odd to include it in the syscall table though, as it's the
same as sys_lseek() on 64-bit, but with strange calling conventions.

Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-08-29 15:42:21 +02:00
Arnd Bergmann
82b355d161 y2038: Remove newstat family from default syscall set
We have four generations of stat() syscalls:
- the oldstat syscalls that are only used on the older architectures
- the newstat family that is used on all 64-bit architectures but
  lacked support for large files on 32-bit architectures.
- the stat64 family that is used mostly on 32-bit architectures to
  replace newstat
- statx() to replace all of the above, adding 64-bit timestamps among
  other things.

We already compile stat64 only on those architectures that need it,
but newstat is always built, including on those that don't reference
it. This adds a new __ARCH_WANT_NEW_STAT symbol along the lines of
__ARCH_WANT_OLD_STAT and __ARCH_WANT_STAT64 to control compilation of
newstat. All architectures that need it use an explict define, the
others now get a little bit smaller, and future architecture (including
64-bit targets) won't ever see it.

Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-08-29 15:42:20 +02:00
Dominique Martinet
81c99089bc v9fs_dir_readdir: fix double-free on p9stat_read error
p9stat_read will call p9stat_free on error, we should only free the
struct content on success.

There also is no need to "p9stat_init" st as the read function will
zero the whole struct for us anyway, so clean up the code a bit while
we are here.

Link: http://lkml.kernel.org/r/1535410108-20650-1-git-send-email-asmadeus@codewreck.org
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
Reported-by: syzbot+d4252148d198410b864f@syzkaller.appspotmail.com
2018-08-29 13:39:57 +09:00
Bob Peterson
4f36cb36c9 gfs2: Don't set GFS2_RDF_UPTODATE when the lvb is updated
The GFS2_RDF_UPTODATE flag in the rgrp is used to determine when
a rgrp buffer is valid. It's cleared when the glock is invalidated,
signifying that the buffer data is now invalid. But before this
patch, function update_rgrp_lvb was setting the flag when it
determined it had a valid lvb. But that's an invalid assumption:
just because you have a valid lvb doesn't mean you have valid
buffers. After all, another node may have made the lvb valid,
and this node just fetched it from the glock via dlm.

Consider this scenario:
1. The file system is mounted with RGRPLVB option.
2. In gfs2_inplace_reserve it locks the rgrp glock EX, but thanks
   to GL_SKIP, it skips the gfs2_rgrp_bh_get.
3. Since loops == 0 and the allocation target (ap->target) is
   bigger than the largest known chunk of blocks in the rgrp
   (rs->rs_rbm.rgd->rd_extfail_pt) it skips that rgrp and bypasses
   the call to gfs2_rgrp_bh_get there as well.
4. update_rgrp_lvb sees the lvb MAGIC number is valid, so bypasses
   gfs2_rgrp_bh_get, but it still sets sets GFS2_RDF_UPTODATE due
   to this invalid assumption.
5. The next time update_rgrp_lvb is called, it sees the bit is set
   and just returns 0, assuming both the lvb and rgrp are both
   uptodate. But since this is a smaller allocation, or space has
   been freed by another node, thus adjusting the lvb values,
   it decides to use the rgrp for allocations, with invalid rd_free
   due to the fact it was never updated.

This patch changes update_rgrp_lvb so it doesn't set the UPTODATE
flag anymore. That way, it has no choice but to fetch the latest
values.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-08-28 12:51:08 -05:00
Bob Peterson
72244b6bc7 gfs2: improve debug information when lvb mismatches are found
Before this patch, gfs2_rgrp_bh_get would check for lvb mismatches,
but it wouldn't tell you what was actually wrong. This patch adds
more information to help us debug it. It also makes rgrp consistency
checks dump any bad rgrps, and the rgrp dump code dump any lvbs
as well as the rgrp itself.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
2018-08-28 12:51:08 -05:00
Theodore Ts'o
4d982e25d0 ext4: avoid divide by zero fault when deleting corrupted inline directories
A specially crafted file system can trick empty_inline_dir() into
reading past the last valid entry in a inline directory, and then run
into the end of xattr marker. This will trigger a divide by zero
fault.  Fix this by using the size of the inline directory instead of
dir->i_size.

Also clean up error reporting in __ext4_check_dir_entry so that the
message is clearer and more understandable --- and avoids the division
by zero trap if the size passed in is zero.  (I'm not sure why we
coded it that way in the first place; printing offset % size is
actually more confusing and less useful.)

https://bugzilla.kernel.org/show_bug.cgi?id=200933

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Wen Xu <wen.xu@gatech.edu>
Cc: stable@vger.kernel.org
2018-08-27 09:22:45 -04:00
Arnd Bergmann
9afc5eee65 y2038: globally rename compat_time to old_time32
Christoph Hellwig suggested a slightly different path for handling
backwards compatibility with the 32-bit time_t based system calls:

Rather than simply reusing the compat_sys_* entry points on 32-bit
architectures unchanged, we get rid of those entry points and the
compat_time types by renaming them to something that makes more sense
on 32-bit architectures (which don't have a compat mode otherwise),
and then share the entry points under the new name with the 64-bit
architectures that use them for implementing the compatibility.

The following types and interfaces are renamed here, and moved
from linux/compat_time.h to linux/time32.h:

old				new
---				---
compat_time_t			old_time32_t
struct compat_timeval		struct old_timeval32
struct compat_timespec		struct old_timespec32
struct compat_itimerspec	struct old_itimerspec32
ns_to_compat_timeval()		ns_to_old_timeval32()
get_compat_itimerspec64()	get_old_itimerspec32()
put_compat_itimerspec64()	put_old_itimerspec32()
compat_get_timespec64()		get_old_timespec32()
compat_put_timespec64()		put_old_timespec32()

As we already have aliases in place, this patch addresses only the
instances that are relevant to the system call interface in particular,
not those that occur in device drivers and other modules. Those
will get handled separately, while providing the 64-bit version
of the respective interfaces.

I'm not renaming the timex, rusage and itimerval structures, as we are
still debating what the new interface will look like, and whether we
will need a replacement at all.

This also doesn't change the names of the syscall entry points, which can
be done more easily when we actually switch over the 32-bit architectures
to use them, at that point we need to change COMPAT_SYSCALL_DEFINEx to
SYSCALL_DEFINEx with a new name, e.g. with a _time32 suffix.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Link: https://lore.kernel.org/lkml/20180705222110.GA5698@infradead.org/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-08-27 14:48:48 +02:00
Theodore Ts'o
b50282f324 ext4: check to make sure the rename(2)'s destination is not freed
If the destination of the rename(2) system call exists, the inode's
link count (i_nlinks) must be non-zero.  If it is, the inode can end
up on the orphan list prematurely, leading to all sorts of hilarity,
including a use-after-free.

https://bugzilla.kernel.org/show_bug.cgi?id=200931

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Wen Xu <wen.xu@gatech.edu>
Cc: stable@vger.kernel.org
2018-08-27 01:47:09 -04:00
Theodore Ts'o
072ebb3bff ext4: add nonstring annotations to ext4.h
This suppresses some false positives in gcc 8's -Wstringop-truncation

Suggested by Miguel Ojeda (hopefully the __nonstring definition will
eventually get accepted in the compiler-gcc.h header file).

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
2018-08-27 01:15:11 -04:00
Linus Torvalds
aba16dc5cf Merge branch 'ida-4.19' of git://git.infradead.org/users/willy/linux-dax
Pull IDA updates from Matthew Wilcox:
 "A better IDA API:

      id = ida_alloc(ida, GFP_xxx);
      ida_free(ida, id);

  rather than the cumbersome ida_simple_get(), ida_simple_remove().

  The new IDA API is similar to ida_simple_get() but better named.  The
  internal restructuring of the IDA code removes the bitmap
  preallocation nonsense.

  I hope the net -200 lines of code is convincing"

* 'ida-4.19' of git://git.infradead.org/users/willy/linux-dax: (29 commits)
  ida: Change ida_get_new_above to return the id
  ida: Remove old API
  test_ida: check_ida_destroy and check_ida_alloc
  test_ida: Convert check_ida_conv to new API
  test_ida: Move ida_check_max
  test_ida: Move ida_check_leaf
  idr-test: Convert ida_check_nomem to new API
  ida: Start new test_ida module
  target/iscsi: Allocate session IDs from an IDA
  iscsi target: fix session creation failure handling
  drm/vmwgfx: Convert to new IDA API
  dmaengine: Convert to new IDA API
  ppc: Convert vas ID allocation to new IDA API
  media: Convert entity ID allocation to new IDA API
  ppc: Convert mmu context allocation to new IDA API
  Convert net_namespace to new IDA API
  cb710: Convert to new IDA API
  rsxx: Convert to new IDA API
  osd: Convert to new IDA API
  sd: Convert to new IDA API
  ...
2018-08-26 11:48:42 -07:00
Linus Torvalds
d207ea8e74 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Thomas Gleixner:
 "Kernel:
   - Improve kallsyms coverage
   - Add x86 entry trampolines to kcore
   - Fix ARM SPE handling
   - Correct PPC event post processing

  Tools:
   - Make the build system more robust
   - Small fixes and enhancements all over the place
   - Update kernel ABI header copies
   - Preparatory work for converting libtraceevnt to a shared library
   - License cleanups"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
  tools arch: Update arch/x86/lib/memcpy_64.S copy used in 'perf bench mem memcpy'
  tools arch x86: Update tools's copy of cpufeatures.h
  perf python: Fix pyrf_evlist__read_on_cpu() interface
  perf mmap: Store real cpu number in 'struct perf_mmap'
  perf tools: Remove ext from struct kmod_path
  perf tools: Add gzip_is_compressed function
  perf tools: Add lzma_is_compressed function
  perf tools: Add is_compressed callback to compressions array
  perf tools: Move the temp file processing into decompress_kmodule
  perf tools: Use compression id in decompress_kmodule()
  perf tools: Store compression id into struct dso
  perf tools: Add compression id into 'struct kmod_path'
  perf tools: Make is_supported_compression() static
  perf tools: Make decompress_to_file() function static
  perf tools: Get rid of dso__needs_decompress() call in __open_dso()
  perf tools: Get rid of dso__needs_decompress() call in symbol__disassemble()
  perf tools: Get rid of dso__needs_decompress() call in read_object_code()
  tools lib traceevent: Change to SPDX License format
  perf llvm: Allow passing options to llc in addition to clang
  perf parser: Improve error message for PMU address filters
  ...
2018-08-26 11:25:21 -07:00
Linus Torvalds
2923b27e54 libnvdimm-for-4.19_dax-memory-failure
* memory_failure() gets confused by dev_pagemap backed mappings. The
   recovery code has specific enabling for several possible page states
   that needs new enabling to handle poison in dax mappings. Teach
   memory_failure() about ZONE_DEVICE pages.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE5DAy15EJMCV1R6v9YGjFFmlTOEoFAlt9ui8ACgkQYGjFFmlT
 OEpNRw//XGj9s7sezfJFeol4psJlRUd935yii/gmJRgi/yPf2VxxQG9qyM6SMBUc
 75jASfOL6FSsfxHz0kplyWzMDNdrTkNNAD+9rv80FmY7GqWgcas9DaJX7jZ994vI
 5SRO7pfvNZcXlo7IhqZippDw3yxkIU9Ufi0YQKaEUm7GFieptvCZ0p9x3VYfdvwM
 BExrxQe0X1XUF4xErp5P78+WUbKxP47DLcucRDig8Q7dmHELUdyNzo3E1SVoc7m+
 3CmvyTj6XuFQgOZw7ZKun1BJYfx/eD5ZlRJLZbx6wJHRtTXv/Uea8mZ8mJ31ykN9
 F7QVd0Pmlyxys8lcXfK+nvpL09QBE0/PhwWKjmZBoU8AdgP/ZvBXLDL/D6YuMTg6
 T4wwtPNJorfV4lVD06OliFkVI4qbKbmNsfRq43Ns7PCaLueu4U/eMaSwSH99UMaZ
 MGbO140XW2RZsHiU9yTRUmZq73AplePEjxtzR8oHmnjo45nPDPy8mucWPlkT9kXA
 oUFMhgiviK7dOo19H4eaPJGqLmHM93+x5tpYxGqTr0dUOXUadKWxMsTnkID+8Yi7
 /kzQWCFvySz3VhiEHGuWkW08GZT6aCcpkREDomnRh4MEnETlZI8bblcuXYOCLs6c
 nNf1SIMtLdlsl7U1fEX89PNeQQ2y237vEDhFQZftaalPeu/JJV0=
 =Ftop
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm memory-failure update from Dave Jiang:
 "As it stands, memory_failure() gets thoroughly confused by dev_pagemap
  backed mappings. The recovery code has specific enabling for several
  possible page states and needs new enabling to handle poison in dax
  mappings.

  In order to support reliable reverse mapping of user space addresses:

   1/ Add new locking in the memory_failure() rmap path to prevent races
      that would typically be handled by the page lock.

   2/ Since dev_pagemap pages are hidden from the page allocator and the
      "compound page" accounting machinery, add a mechanism to determine
      the size of the mapping that encompasses a given poisoned pfn.

   3/ Given pmem errors can be repaired, change the speculatively
      accessed poison protection, mce_unmap_kpfn(), to be reversible and
      otherwise allow ongoing access from the kernel.

  A side effect of this enabling is that MADV_HWPOISON becomes usable
  for dax mappings, however the primary motivation is to allow the
  system to survive userspace consumption of hardware-poison via dax.
  Specifically the current behavior is:

     mce: Uncorrected hardware memory error in user-access at af34214200
     {1}[Hardware Error]: It has been corrected by h/w and requires no further action
     mce: [Hardware Error]: Machine check events logged
     {1}[Hardware Error]: event severity: corrected
     Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
     [..]
     Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
     mce: Memory error not recovered
     <reboot>

  ...and with these changes:

     Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
     Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
     Memory failure: 0x20cb00: recovery action for dax page: Recovered

  Given all the cross dependencies I propose taking this through
  nvdimm.git with acks from Naoya, x86/core, x86/RAS, and of course dax
  folks"

* tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
  libnvdimm, pmem: Restore page attributes when clearing errors
  x86/memory_failure: Introduce {set, clear}_mce_nospec()
  x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses
  mm, memory_failure: Teach memory_failure() about dev_pagemap pages
  filesystem-dax: Introduce dax_lock_mapping_entry()
  mm, memory_failure: Collect mapping size in collect_procs()
  mm, madvise_inject_error: Let memory_failure() optionally take a page reference
  mm, dev_pagemap: Do not clear ->mapping on final put
  mm, madvise_inject_error: Disable MADV_SOFT_OFFLINE for ZONE_DEVICE pages
  filesystem-dax: Set page->index
  device-dax: Set page->index
  device-dax: Enable page_mapping()
  device-dax: Convert to vmf_insert_mixed and vm_fault_t
2018-08-25 18:43:59 -07:00
Linus Torvalds
828bf6e904 libnvdimm-for-4.19_misc
Collection of misc libnvdimm patches for 4.19 submission
 * Adding support to read locked nvdimm capacity.
 
 * Change test code to make DSM failure code injection an override.
 
 * Add support for calculate maximum contiguous area for namespace.
 
 * Add support for queueing a short ARS when there is on going ARS for
   nvdimm.
 
 * Allow NULL to be passed in to ->direct_access() for kaddr and
   pfn params.
 
 * Improve smart injection support for nvdimm emulation testing.
 
 * Fix test code that supports for emulating controller temperature.
 
 * Fix hang on error before devm_memremap_pages()
 
 * Fix a bug that causes user memory corruption when data returned
   to user for ars_status.
 
 * Maintainer updates for Ross Zwisler emails and adding Jan Kara to fsdax.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE5DAy15EJMCV1R6v9YGjFFmlTOEoFAlt9uUIACgkQYGjFFmlT
 OErL+xAAgWHSGs8w98VtYA9kLDeTYEXutq93wJZQoBu/FMAXuuU3hYmQYnOQU87h
 KKYmfDkeusaih1R3IX7mzlegnnzSfQ6MraNSV76M43noJHbRTunknCPZH6ebp4fo
 b/eljvWlZF/idM+7YcsnoFMnHSRj2pjJGXmKQDlKedHD+KMxpmk6zEl2s5Y0zvPU
 4U7UQLtk3D5IIpLNsLEmxge32BfvNf5IzoSO1aZp7Eqk0+U5Tq3Sq/Tjmd+J0RKt
 6WH5yA6NqXQgBh+ayHsYU8YX62RqnbKQZXqVxD35OH64zJEUefnP1fpt9pmaZ9eL
 43BPMkpM09eLAikO2ET3/3c2k6h3h9ttz1sH8t/hiroCtfmxs3XgskY06hxpKjZV
 EbN+BUmut5Mr+zzYitRr3dbK2aHPVU9IbU7jUw/1Tz23rq3kU5iI7SHHv1b/eWup
 1Cr77Z1M6HB8VBhjnJ+R607sbRrnKQUOV7fGzAaIskyUOTWsEvIgTh/6MRiaj9MD
 5HXIgc/0y9E+G93s7MsUWwzpB7J6E7EGoybST2SKPtqwtDMPsBNeWRjyA9quBCoN
 u1s+e+lWHYutqRW0eisDTTlq3nJwPijSx1nnzhJxw9s1EkCXz3f7KRZhyH1C79Co
 7wjiuvKQ79e/HI/oXvGmTnv5lbLEpWYyJ3U3KIFfoUqugeyhr0k=
 =5p2n
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.19_misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dave Jiang:
 "Collection of misc libnvdimm patches for 4.19 submission:

   - Adding support to read locked nvdimm capacity.

   - Change test code to make DSM failure code injection an override.

   - Add support for calculate maximum contiguous area for namespace.

   - Add support for queueing a short ARS when there is on going ARS for
     nvdimm.

   - Allow NULL to be passed in to ->direct_access() for kaddr and pfn
     params.

   - Improve smart injection support for nvdimm emulation testing.

   - Fix test code that supports for emulating controller temperature.

   - Fix hang on error before devm_memremap_pages()

   - Fix a bug that causes user memory corruption when data returned to
     user for ars_status.

   - Maintainer updates for Ross Zwisler emails and adding Jan Kara to
     fsdax"

* tag 'libnvdimm-for-4.19_misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
  libnvdimm: fix ars_status output length calculation
  device-dax: avoid hang on error before devm_memremap_pages()
  tools/testing/nvdimm: improve emulation of smart injection
  filesystem-dax: Do not request kaddr and pfn when not required
  md/dm-writecache: Don't request pointer dummy_addr when not required
  dax/super: Do not request a pointer kaddr when not required
  tools/testing/nvdimm: kaddr and pfn can be NULL to ->direct_access()
  s390, dcssblk: kaddr and pfn can be NULL to ->direct_access()
  libnvdimm, pmem: kaddr and pfn can be NULL to ->direct_access()
  acpi/nfit: queue issuing of ars when an uc error notification comes in
  libnvdimm: Export max available extent
  libnvdimm: Use max contiguous area for namespace size
  MAINTAINERS: Add Jan Kara for filesystem DAX
  MAINTAINERS: update Ross Zwisler's email address
  tools/testing/nvdimm: Fix support for emulating controller temperature
  tools/testing/nvdimm: Make DSM failure code injection an override
  acpi, nfit: Prefer _DSM over _LSR for namespace label reads
  libnvdimm: Introduce locked DIMM capacity support
2018-08-25 18:13:10 -07:00
Linus Torvalds
db84abf5f8 This pull request contains a single fix for UBIFS:
- Remove an empty file from UBIFS source
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAluBE84WHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wc73EACHsV8cp3Atkg18yZYQq3t8uOZG
 lZApUWTDr2SpWW88/OJ+RAFKGflKQqSiS1y1HYd2s8is4yYMA/XBto5pWvGqg03P
 GsFGzjMqRztZtNwg7n7JUD0Sq9WLoB1BimuMAG1b8vjTH2uuCxJZVyVi7ZF5W5Oc
 065ebVblFhziI7jddRh/Gpwa2dvx3SlKZa0VImiJd6Mw9AnjO4grU+rOofswXTjX
 AKVjeRvNp0DuxJkxaqO9mH7mug87Owi4co4DBLCJu7pPEq2kvFRHQQtn6QKCdtRk
 gyYBD/ZGkhNYz3wPoQxlHM0bGVAToKtjcT1/T61lrlXKjGJu8S8IHC/raEyCB8yb
 Cz3XxMCH7LXyS3eaov4sZAiogENsoP7i20E2AKwlGil65gx4DVR814MTAi5m0AwN
 Kpv7GHRG9oMEar378Rrv8xptRJhj6nwCxAYuBwmPFlK7d95qMQ9hLjsvYp3TLuzn
 7ihUzT8m/GZG1qrlkkk24Bjm834dCVRKPBJJ4beh3YG8/mj/h6OlQGs0NPRIF6O1
 JknbQqXOz6UZ1UsvgtDK7p8GuHztKA4K+3Qv4BWwIRFisX+gTWKB8IveGPpouUaa
 K8NMOkbw8RsatxzM4DNZBPl2YNjvHuuTGudNqMn3pytsi4Y8y5YKdkpFVgZH1V25
 QmQZbhicHSlfwWR0FQ==
 =QoL7
 -----END PGP SIGNATURE-----

Merge tag 'upstream-4.19-rc1-fix' of git://git.infradead.org/linux-ubifs

Pull UBIFS fix from Richard Weinberger:
 "Remove an empty file from UBIFS source"

* tag 'upstream-4.19-rc1-fix' of git://git.infradead.org/linux-ubifs:
  ubifs: Remove empty file.h
2018-08-25 13:27:35 -07:00
Linus Torvalds
04faac10fc three small SMB3 fixes, one for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAluAdzkACgkQiiy9cAdy
 T1GjlAv/SOsNm2sj9Bcq/Z/CpPoFRFoJBFsLeReF78QdbF/+eUFuQJvq0aIK8BmV
 0NRmvlQk9oCGQfWN0pLWeRn7a7xUqMQ7HYKSS1fzW1O+kJGlyA1cFjCbiZe4py6m
 AFSlpraPTtL7isX/ZMOyZ1D7YKMj4Fq5wndcHPSnMQXI2GlaUAip5k/zamXUbMmo
 dFaGDkXc67un6Y/04v18LsKJtOHgbVIAES2OgO0sjqiwp0cnGATsZl/OGzvsTo31
 brstBum/0Ig2Mpr+5IXa4QFoP+naNXDyhv+D69huETwsMSImnjGL6L/GAgUUO5/t
 sDN6bpQdM9wqpckNuFcV9hKbBHQ1nZhzr0gUjRnmN8CJFHOgI2Ndo4J4N9slnWo9
 wEXJV7RN5/VQh3Ozb7m+DpAYo4K4r3Oexxa3MG7+IFY8t89O39PBc0dEQ6O8ztaJ
 NCcubQVpSFxC7fwVjY56IHHofyznS7JLKMAcCe3+ssOHwJt0PZr/iqdBNHC9W4ZE
 7fcNZtct
 =jMLB
 -----END PGP SIGNATURE-----

Merge tag '4.19-rc-smb3' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Three small SMB3 fixes, one for stable"

* tag '4.19-rc-smb3' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: update internal module version number for cifs.ko to 2.12
  cifs: check kmalloc before use
  cifs: check if SMB2 PDU size has been padded and suppress the warning
  cifs: create a define for how many iovs we need for an SMB2_open()
2018-08-25 13:17:53 -07:00
Colin Ian King
e0fcfe1f1a hpfs: remove unnecessary checks on the value of r when assigning error code
At the point where r is being checked for different values, r is always
going to be equal to 2 as the previous if statements jump to end or end1
if r is not 2.  Hence the assignment to err can be simplified to just
err an assignment without any checks on the value or r.

Detected by CoverityScan, CID#1226737 ("Logically dead code")

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-25 12:42:33 -07:00
Linus Torvalds
4def196360 Merge branch 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace fixes from Eric Biederman:
 "This is a set of four fairly obvious bug fixes:

   - a switch from d_find_alias to d_find_any_alias because the xattr
     code perversely takes a dentry

   - two mutex vs copy_to_user fixes from Jann Horn

   - a fix to use a sanitized size not the size userspace passed in from
     Christian Brauner"

* 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  getxattr: use correct xattr length
  sys: don't hold uts_sem while accessing userspace memory
  userns: move user access out of the mutex
  cap_inode_getsecurity: use d_find_any_alias() instead of d_find_alias()
2018-08-24 09:25:39 -07:00
Misono Tomohiro
b6fdfbff07 btrfs: Fix suspicious RCU usage warning in btrfs_debug_in_rcu
Commit 672d599041 ("btrfs: Use wrapper macro for rcu string to remove
duplicate code") replaces some open coded RCU string handling with macro.

It turns out that btrfs_debug_in_rcu() is used for the first time and
the macro lacks lock/unlock of RCU string for non-debug case (i.e. when
the message is not printed), leading to suspicious RCU usage warning
when CONFIG_PROVE_RCU is on.

Fix this by adding a wrapper to call lock/unlock for the non-debug case
too.

Fixes: 672d599041 ("btrfs: Use wrapper macro for rcu string to remove duplicate code")
Reported-by: David Howells <dhowells@redhat.com>
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-24 14:09:43 +02:00
Richard Weinberger
6e5461d774 ubifs: Remove empty file.h
This empty file sneaked into the tree by mistake.
Remove it.

Fixes: 6eb61d587f ("ubifs: Pass struct ubifs_info to ubifs_assert()")
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-08-24 13:50:07 +02:00
Jan Kara
ee4af50ca9 udf: Fix mounting of Win7 created UDF filesystems
Win7 is creating UDF filesystems with single partition with number 8192.
Current partition descriptor scanning code does not handle this well as
it incorrectly assumes that partition numbers will form mostly contiguous
space of small numbers. This results in unmountable media due to errors
like:

UDF-fs: error (device dm-1): udf_read_tagged: tag version 0x0000 != 0x0002 || 0x0003, block 0
UDF-fs: warning (device dm-1): udf_fill_super: No fileset found

Fix the problem by handling partition descriptors in a way that sparse
partition numbering does not matter.

Reported-and-tested-by: jean-luc malet <jeanluc.malet@gmail.com>
CC: stable@vger.kernel.org
Fixes: 7b78fd02fb
Signed-off-by: Jan Kara <jack@suse.cz>
2018-08-24 11:13:32 +02:00
Jan Kara
82c82ab658 udf: Remove dead code from udf_find_fileset()
Remove dead code and slightly simplify code in udf_find_fileset().

Signed-off-by: Jan Kara <jack@suse.cz>
2018-08-24 11:13:32 +02:00
Linus Torvalds
33e17876ea Merge branch 'akpm' (patches from Andrew)
Merge yet more updates from Andrew Morton:

 - the rest of MM

 - various misc fixes and tweaks

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits)
  mm: Change return type int to vm_fault_t for fault handlers
  lib/fonts: convert comments to utf-8
  s390: ebcdic: convert comments to UTF-8
  treewide: convert ISO_8859-1 text comments to utf-8
  drivers/gpu/drm/gma500/: change return type to vm_fault_t
  docs/core-api: mm-api: add section about GFP flags
  docs/mm: make GFP flags descriptions usable as kernel-doc
  docs/core-api: split memory management API to a separate file
  docs/core-api: move *{str,mem}dup* to "String Manipulation"
  docs/core-api: kill trailing whitespace in kernel-api.rst
  mm/util: add kernel-doc for kvfree
  mm/util: make strndup_user description a kernel-doc comment
  fs/proc/vmcore.c: hide vmcoredd_mmap_dumps() for nommu builds
  treewide: correct "differenciate" and "instanciate" typos
  fs/afs: use new return type vm_fault_t
  drivers/hwtracing/intel_th/msu.c: change return type to vm_fault_t
  mm: soft-offline: close the race against page allocation
  mm: fix race on soft-offlining free huge pages
  namei: allow restricted O_CREAT of FIFOs and regular files
  hfs: prevent crash on exit from failed search
  ...
2018-08-23 19:20:12 -07:00
Souptick Joarder
2b74030354 mm: Change return type int to vm_fault_t for fault handlers
Use new return type vm_fault_t for fault handler.  For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno.  Once all instances are converted, vm_fault_t will become a
distinct type.

Ref-> commit 1c8f422059 ("mm: change return type to vm_fault_t")

The aim is to change the return type of finish_fault() and
handle_mm_fault() to vm_fault_t type.  As part of that clean up return
type of all other recursively called functions have been changed to
vm_fault_t type.

The places from where handle_mm_fault() is getting invoked will be
change to vm_fault_t type but in a separate patch.

vmf_error() is the newly introduce inline function in 4.17-rc6.

[akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 18:48:44 -07:00
Arnd Bergmann
a2036a1ef2 fs/proc/vmcore.c: hide vmcoredd_mmap_dumps() for nommu builds
Without CONFIG_MMU, we get a build warning:

  fs/proc/vmcore.c:228:12: error: 'vmcoredd_mmap_dumps' defined but not used [-Werror=unused-function]
   static int vmcoredd_mmap_dumps(struct vm_area_struct *vma, unsigned long dst,

The function is only referenced from an #ifdef'ed caller, so
this uses the same #ifdef around it.

Link: http://lkml.kernel.org/r/20180525213526.2117790-1-arnd@arndb.de
Fixes: 7efe48df8a ("vmcore: append device dumps to vmcore as elf notes")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Ganesh Goudar <ganeshgr@chelsio.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 18:48:43 -07:00
Souptick Joarder
0722f18620 fs/afs: use new return type vm_fault_t
Use new return type vm_fault_t for fault handler in struct
vm_operations_struct.  For now, this is just documenting that the
function returns a VM_FAULT value rather than an errno.  Once all
instances are converted, vm_fault_t will become a distinct type.

See 1c8f422059 ("mm: change return type to vm_fault_t") for reference.

Link: http://lkml.kernel.org/r/20180702152017.GA3780@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 18:48:43 -07:00
Salvatore Mesoraca
30aba6656f namei: allow restricted O_CREAT of FIFOs and regular files
Disallows open of FIFOs or regular files not owned by the user in world
writable sticky directories, unless the owner is the same as that of the
directory or the file is opened without the O_CREAT flag.  The purpose
is to make data spoofing attacks harder.  This protection can be turned
on and off separately for FIFOs and regular files via sysctl, just like
the symlinks/hardlinks protection.  This patch is based on Openwall's
"HARDEN_FIFO" feature by Solar Designer.

This is a brief list of old vulnerabilities that could have been prevented
by this feature, some of them even allow for privilege escalation:

CVE-2000-1134
CVE-2007-3852
CVE-2008-0525
CVE-2009-0416
CVE-2011-4834
CVE-2015-1838
CVE-2015-7442
CVE-2016-7489

This list is not meant to be complete.  It's difficult to track down all
vulnerabilities of this kind because they were often reported without any
mention of this particular attack vector.  In fact, before
hardlinks/symlinks restrictions, fifos/regular files weren't the favorite
vehicle to exploit them.

[s.mesoraca16@gmail.com: fix bug reported by Dan Carpenter]
  Link: https://lkml.kernel.org/r/20180426081456.GA7060@mwanda
  Link: http://lkml.kernel.org/r/1524829819-11275-1-git-send-email-s.mesoraca16@gmail.com
[keescook@chromium.org: drop pr_warn_ratelimited() in favor of audit changes in the future]
[keescook@chromium.org: adjust commit subjet]
Link: http://lkml.kernel.org/r/20180416175918.GA13494@beast
Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Suggested-by: Solar Designer <solar@openwall.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 18:48:43 -07:00
Ernesto A. Fernández
dc2572791d hfs: prevent crash on exit from failed search
hfs_find_exit() expects fd->bnode to be NULL after a search has failed.
hfs_brec_insert() may instead set it to an error-valued pointer.  Fix
this to prevent a crash.

Link: http://lkml.kernel.org/r/53d9749a029c41b4016c495fc5838c9dba3afc52.1530294815.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Cc: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 18:48:42 -07:00
Ernesto A. Fernandez
aba93a92f4 hfsplus: prevent crash on exit from failed search
hfs_find_exit() expects fd->bnode to be NULL after a search has failed.
hfs_brec_insert() may instead set it to an error-valued pointer.  Fix
this to prevent a crash.

Link: http://lkml.kernel.org/r/803590a35221fbf411b2c141419aea3233a6e990.1530294813.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernandez <ernesto.mnd.fernandez@gmail.com>
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 18:48:42 -07:00
Ernesto A. Fernández
a7ec7a4193 hfsplus: fix NULL dereference in hfsplus_lookup()
An HFS+ filesystem can be mounted read-only without having a metadata
directory, which is needed to support hardlinks.  But if the catalog
data is corrupted, a directory lookup may still find dentries claiming
to be hardlinks.

hfsplus_lookup() does check that ->hidden_dir is not NULL in such a
situation, but mistakenly does so after dereferencing it for the first
time.  Reorder this check to prevent a crash.

This happens when looking up corrupted catalog data (dentry) on a
filesystem with no metadata directory (this could only ever happen on a
read-only mount).  Wen Xu sent the replication steps in detail to the
fsdevel list: https://bugzilla.kernel.org/show_bug.cgi?id=200297

Link: http://lkml.kernel.org/r/20180712215344.q44dyrhymm4ajkao@eaf
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reported-by: Wen Xu <wen.xu@gatech.edu>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 18:48:42 -07:00
Linus Torvalds
53a01c9a5f NFS client updates for Linux 4.19
Stable bufixes:
 - v3.17+: Fix an off-by-one in bl_map_stripe()
 - v4.9+: NFSv4 client live hangs after live data migration recovery
 - v4.18+: xprtrdma: Fix disconnect regression
 - v4.14+: Fix locking in pnfs_generic_recover_commit_reqs
 - v4.9+: Fix a sleep in atomic context in nfs4_callback_sequence()
 
 Features:
 - Add support for asynchronous server-side COPY operations
 
 Other bugfixes and cleanups:
 - Optitmizations and fixes involving NFS v4.1 / pNFS layout handling
 - Optimize lseek(fd, SEEK_CUR, 0) on directories to avoid locking
 - Immediately reschedule writeback when the server replies with an error
 - Fix excessive attribute revalidation in nfs_execute_ok()
 - Add error checking to nfs_idmap_prepare_message()
 - Use new vm_fault_t return type
 - Return a delegation when reclaiming one that the server has recalled
 - Referrals should inherit proto setting from parents
 - Make rpc_auth_create_args a const
 - Improvements to rpc_iostats tracking
 - Fix a potential reference leak when there is an error processing a callback
 - Fix rmdir / mkdir / rename nlink accounting
 - Fix updating inode change attribute
 - Fix error handling in nfsn4_sp4_select_mode()
 - Use an appropriate work queue for direct-write completion
 - Don't busy wait if NFSv4 session draining is interrupted
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlt/CYIACgkQ18tUv7Cl
 QOu8gBAA0xQWmgRoG6oIdYUxvgYqhuJmMqC4SU1E6mCJ93xEuUSvEFw51X+84KCt
 r6UPkp/bKiVe3EIinKTplIzuxgggXNG0EQmO46FYNTl7nqpN85ffLsQoWsiD23fp
 j8afqKPFR2zfhHXLKQC7k1oiOpwGqJ+EJWgIW4llE80pSNaErEoEaDqSPds5thMN
 dHEjjLr8ef6cbBux6sSPjwWGNbE82uoSu3MDuV2+e62hpGkgvuEYo1vyE6ujeZW5
 MUsmw+AHZkwro0msTtNBOHcPZAS0q/2UMPzl1tsDeCWNl2mugqZ6szQLSS2AThKq
 Zr6iK9Q5dWjJfrQHcjRMnYJB+SCX1SfPA7ASuU34opwcWPjecbS9Q92BNTByQYwN
 o9ngs2K0mZfqpYESMAmf7Il134cCBrtEp3skGko2KopJcYcE5YUFhdKihi1yQQjU
 UbOOubMpQk8vY9DpDCAwGbICKwUZwGvq27uuUWL20kFVDb1+jvfHwcV4KjRAJo/E
 J9aFtU+qOh4rMPMnYlEVZcAZBGfenlv/DmBl1upRpjzBkteUpUJsAbCmGyAk4616
 3RECasehgsjNCQpFIhv3FpUkWzP5jt0T3gRr1NeY6WKJZwYnHEJr9PtapS+EIsCT
 tB5DvvaJqFtuHFOxzn+KlGaxdSodHF7klOq7NM3AC0cX8AkWqaU=
 =8+9t
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.19-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "These patches include adding async support for the v4.2 COPY
  operation. I think Bruce is planning to send the server patches for
  the next release, but I figured we could get the client side out of
  the way now since it's been in my tree for a while. This shouldn't
  cause any problems, since the server will still respond with
  synchronous copies even if the client requests async.

  Features:
   - Add support for asynchronous server-side COPY operations

  Stable bufixes:
   - Fix an off-by-one in bl_map_stripe() (v3.17+)
   - NFSv4 client live hangs after live data migration recovery (v4.9+)
   - xprtrdma: Fix disconnect regression (v4.18+)
   - Fix locking in pnfs_generic_recover_commit_reqs (v4.14+)
   - Fix a sleep in atomic context in nfs4_callback_sequence() (v4.9+)

  Other bugfixes and cleanups:
   - Optimizations and fixes involving NFS v4.1 / pNFS layout handling
   - Optimize lseek(fd, SEEK_CUR, 0) on directories to avoid locking
   - Immediately reschedule writeback when the server replies with an
     error
   - Fix excessive attribute revalidation in nfs_execute_ok()
   - Add error checking to nfs_idmap_prepare_message()
   - Use new vm_fault_t return type
   - Return a delegation when reclaiming one that the server has
     recalled
   - Referrals should inherit proto setting from parents
   - Make rpc_auth_create_args a const
   - Improvements to rpc_iostats tracking
   - Fix a potential reference leak when there is an error processing a
     callback
   - Fix rmdir / mkdir / rename nlink accounting
   - Fix updating inode change attribute
   - Fix error handling in nfsn4_sp4_select_mode()
   - Use an appropriate work queue for direct-write completion
   - Don't busy wait if NFSv4 session draining is interrupted"

* tag 'nfs-for-4.19-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (54 commits)
  pNFS: Remove unwanted optimisation of layoutget
  pNFS/flexfiles: ff_layout_pg_init_read should exit on error
  pNFS: Treat RECALLCONFLICT like DELAY...
  pNFS: When updating the stateid in layoutreturn, also update the recall range
  NFSv4: Fix a sleep in atomic context in nfs4_callback_sequence()
  NFSv4: Fix locking in pnfs_generic_recover_commit_reqs
  NFSv4: Fix a typo in nfs4_init_channel_attrs()
  NFSv4: Don't busy wait if NFSv4 session draining is interrupted
  NFS recover from destination server reboot for copies
  NFS add a simple sync nfs4_proc_commit after async COPY
  NFS handle COPY ERR_OFFLOAD_NO_REQS
  NFS send OFFLOAD_CANCEL when COPY killed
  NFS export nfs4_async_handle_error
  NFS handle COPY reply CB_OFFLOAD call race
  NFS add support for asynchronous COPY
  NFS COPY xdr handle async reply
  NFS OFFLOAD_CANCEL xdr
  NFS CB_OFFLOAD xdr
  NFS: Use an appropriate work queue for direct-write completion
  NFSv4: Fix error handling in nfs4_sp4_select_mode()
  ...
2018-08-23 16:03:58 -07:00
Linus Torvalds
9157141c95 A mistake on my part caused me to tag my branch 6 commits too early,
missing Chuck's fixes for the problem with callbacks over GSS from
 multi-homed servers, and a smaller fix from Laura Abbott.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbftA8AAoJECebzXlCjuG+QPMQALieEKkX0YoqRhPz5G+RrWFy
 KgOBFAoiRcjFQD6wMt9FzD6qYEZqSJ+I2b+K5N3BkdyDDQu845iD0wK0zBGhMgLm
 7ith85nphIMbe18+5jPorqAsI9RlfBQjiSGw1MEx5dicLQQzTObHL5q+l5jcWna4
 jWS3yUKv1URpOsR1hIryw74ktSnhuH8n//zmntw8aWrCkq3hnXOZK/agtYxZ7Viv
 V3kiQsiNpL2FPRcHN7ejhLUTnRkkuD2iYKrzP/SpTT/JfdNEUXlMhKkAySogNpus
 nvR9X7hwta8Lgrt7PSB9ibFTXtCupmuICg5mbDWy6nXea2NvpB01QhnTzrlX17Eh
 Yfk/18z95b6Qs1v4m3SI8ESmyc6l5dMZozLudtHzifyCqooWZriEhCR1PlQfQ/FJ
 4cYQ8U/qiMiZIJXL7N2wpSoSaWR5bqU1rXen29Np1WEDkiv4Nf5u2fsCXzv0ZH2C
 ReWpNkbnNxsNiKpp4geBZtlcSEU1pk+1PqE0MagTdBV3iptiUHRSP4jR7qLnc0zT
 J1lCvU7Fodnt9vNSxMpt2Jd6XxQ6xtx7n6aMQAiYFnXDs+hP2hPnJVCScnYW3L6R
 2r1sHRKKeoOzCJ2thw+zu4lOwMm7WPkJPWAYfv90reWkiKoy2vG0S9P7wsNGoJuW
 fuEjB2b9pow1Ffynat6q
 =JnLK
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.19-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Chuck Lever fixed a problem with NFSv4.0 callbacks over GSS from
  multi-homed servers.

  The only new feature is a minor bit of protocol (change_attr_type)
  which the client doesn't even use yet.

  Other than that, various bugfixes and cleanup"

* tag 'nfsd-4.19-1' of git://linux-nfs.org/~bfields/linux: (27 commits)
  sunrpc: Add comment defining gssd upcall API keywords
  nfsd: Remove callback_cred
  nfsd: Use correct credential for NFSv4.0 callback with GSS
  sunrpc: Extract target name into svc_cred
  sunrpc: Enable the kernel to specify the hostname part of service principals
  sunrpc: Don't use stack buffer with scatterlist
  rpc: remove unneeded variable 'ret' in rdma_listen_handler
  nfsd: use true and false for boolean values
  nfsd: constify write_op[]
  fs/nfsd: Delete invalid assignment statements in nfsd4_decode_exchange_id
  NFSD: Handle full-length symlinks
  NFSD: Refactor the generic write vector fill helper
  svcrdma: Clean up Read chunk path
  svcrdma: Avoid releasing a page in svc_xprt_release()
  nfsd: Mark expected switch fall-through
  sunrpc: remove redundant variables 'checksumlen','blocksize' and 'data'
  nfsd: fix leaked file lock with nfs exported overlayfs
  nfsd: don't advertise a SCSI layout for an unsupported request_queue
  nfsd: fix corrupted reply to badly ordered compound
  nfsd: clarify check_op_ordering
  ...
2018-08-23 16:00:10 -07:00
Linus Torvalds
6f7948f566 This pull request contains updates for both UBI and UBIFS:
- Year 2038 preparations
 - New UBI feature to skip CRC checks of static volumes
 - A new Kconfig option to disable xattrs in UBIFS
 - Lots of fixes in UBIFS, found by our new test framework
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAlt9zFkWHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7waiuD/oDYzerOLe0R7n2sRT9zjtY8kCx
 LuizRvDYUlmMynI6EVahfyJy2IixcDmXOklGdxJqUkN5igDC/FORWdQjv2X9y56d
 qZ2dlS8aBvI0ZKBG2ew4VP1H67CXtCw8H9fE32SGotPmxKRUQqt2vKqo+vgQfapH
 eSVPrOaoqoRh+/ieumYXsvFdEUWpa66G3tVMFe4znu+kYRBbGzSszxpuq1ukIls2
 P9wewqbWAZpqn+n9A9+RBIv81g+jH87/acfjK2L7/lT9wsFO7BQGKi373dPbnTa5
 9WsjGEd+Gt0kb4Kjh5QegY97bPqWjmaMj1BLqeQVpSbQqpzkiFMf9GW5+h3XqAfO
 hM1zzgONZMxHdZSKH0bWzIRQbvU6v0d9C4J/elfFuH9ke2XscrxjOtZZQbtbGeYj
 tE7FWoZnB8euXubulGAUBKofzWe+gItBe9+iA29EBETNOemrJyHyKjO0Fe9ze5p2
 bfVFvN62kHz4ZCJoinwO/OpXnCuA91xrVocLOOIreb4dkZ/kqP+YZWFf70FcE1o5
 sPAbAUu+hfb2LbpktEdZHHbhoupfCnJokzfboJMX0NWKRtFXJDONjogJYTFUjrpW
 eXS+55+WFHoLWtx9J2IVmcb3cQrj/W/4J83kSg99cUkVjGpil50zmtzhq9bHzsLc
 wazngueP7kW2l9bSSg==
 =gCyp
 -----END PGP SIGNATURE-----

Merge tag 'upstream-4.19-rc1' of git://git.infradead.org/linux-ubifs

Pull UBI/UBIFS updates from Richard Weinberger:

 - Year 2038 preparations

 - New UBI feature to skip CRC checks of static volumes

 - A new Kconfig option to disable xattrs in UBIFS

 - Lots of fixes in UBIFS, found by our new test framework

* tag 'upstream-4.19-rc1' of git://git.infradead.org/linux-ubifs: (21 commits)
  ubifs: Set default assert action to read-only
  ubifs: Allow setting assert action as mount parameter
  ubifs: Rework ubifs_assert()
  ubifs: Pass struct ubifs_info to ubifs_assert()
  ubifs: Turn two ubifs_assert() into a WARN_ON()
  ubi: expose the volume CRC check skip flag
  ubi: provide a way to skip CRC checks
  ubifs: Use kmalloc_array()
  ubifs: Check data node size before truncate
  Revert "UBIFS: Fix potential integer overflow in allocation"
  ubifs: Add comment on c->commit_sem
  ubifs: introduce Kconfig symbol for xattr support
  ubifs: use swap macro in swap_dirty_idx
  ubifs: tnc: use monotonic znode timestamp
  ubifs: use timespec64 for inode timestamps
  ubifs: xattr: Don't operate on deleted inodes
  ubifs: gc: Fix typo
  ubifs: Fix memory leak in lprobs self-check
  ubi: Initialize Fastmap checkmapping correctly
  ubifs: Fix synced_i_size calculation for xattr inodes
  ...
2018-08-23 15:58:04 -07:00
Steve French
7753e38286 cifs: update internal module version number for cifs.ko to 2.12
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-08-23 15:11:10 -05:00
Nicholas Mc Guire
126c97f4d0 cifs: check kmalloc before use
The kmalloc was not being checked - if it fails issue a warning
and return -ENOMEM to the caller.

Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Fixes: b8da344b74 ("cifs: dynamic allocation of ntlmssp blob")
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
cc: Stable <stable@vger.kernel.org>`
2018-08-23 15:10:49 -05:00
Ronnie Sahlberg
e6c47dd0da cifs: check if SMB2 PDU size has been padded and suppress the warning
Some SMB2/3 servers, Win2016 but possibly others too, adds padding
not only between PDUs in a compound but also to the final PDU.
This padding extends the PDU to a multiple of 8 bytes.

Check if the unexpected length looks like this might be the case
and avoid triggering the log messages for :

  "SMB2 server sent bad RFC1001 len %d not %d\n"

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-08-23 15:10:46 -05:00
Ronnie Sahlberg
4d8dfafc5c cifs: create a define for how many iovs we need for an SMB2_open()
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-08-23 15:10:40 -05:00
Christian Brauner
82c9a927bc getxattr: use correct xattr length
When running in a container with a user namespace, if you call getxattr
with name = "system.posix_acl_access" and size % 8 != 4, then getxattr
silently skips the user namespace fixup that it normally does resulting in
un-fixed-up data being returned.
This is caused by posix_acl_fix_xattr_to_user() being passed the total
buffer size and not the actual size of the xattr as returned by
vfs_getxattr().
This commit passes the actual length of the xattr as returned by
vfs_getxattr() down.

A reproducer for the issue is:

  touch acl_posix

  setfacl -m user:0:rwx acl_posix

and the compile:

  #define _GNU_SOURCE
  #include <errno.h>
  #include <stdio.h>
  #include <stdlib.h>
  #include <string.h>
  #include <sys/types.h>
  #include <unistd.h>
  #include <attr/xattr.h>

  /* Run in user namespace with nsuid 0 mapped to uid != 0 on the host. */
  int main(int argc, void **argv)
  {
          ssize_t ret1, ret2;
          char buf1[128], buf2[132];
          int fret = EXIT_SUCCESS;
          char *file;

          if (argc < 2) {
                  fprintf(stderr,
                          "Please specify a file with "
                          "\"system.posix_acl_access\" permissions set\n");
                  _exit(EXIT_FAILURE);
          }
          file = argv[1];

          ret1 = getxattr(file, "system.posix_acl_access",
                          buf1, sizeof(buf1));
          if (ret1 < 0) {
                  fprintf(stderr, "%s - Failed to retrieve "
                                  "\"system.posix_acl_access\" "
                                  "from \"%s\"\n", strerror(errno), file);
                  _exit(EXIT_FAILURE);
          }

          ret2 = getxattr(file, "system.posix_acl_access",
                          buf2, sizeof(buf2));
          if (ret2 < 0) {
                  fprintf(stderr, "%s - Failed to retrieve "
                                  "\"system.posix_acl_access\" "
                                  "from \"%s\"\n", strerror(errno), file);
                  _exit(EXIT_FAILURE);
          }

          if (ret1 != ret2) {
                  fprintf(stderr, "The value of \"system.posix_acl_"
                                  "access\" for file \"%s\" changed "
                                  "between two successive calls\n", file);
                  _exit(EXIT_FAILURE);
          }

          for (ssize_t i = 0; i < ret2; i++) {
                  if (buf1[i] == buf2[i])
                          continue;

                  fprintf(stderr,
                          "Unexpected different in byte %zd: "
                          "%02x != %02x\n", i, buf1[i], buf2[i]);
                  fret = EXIT_FAILURE;
          }

          if (fret == EXIT_SUCCESS)
                  fprintf(stderr, "Test passed\n");
          else
                  fprintf(stderr, "Test failed\n");

          _exit(fret);
  }
and run:

  ./tester acl_posix

On a non-fixed up kernel this should return something like:

  root@c1:/# ./t
  Unexpected different in byte 16: ffffffa0 != 00
  Unexpected different in byte 17: ffffff86 != 00
  Unexpected different in byte 18: 01 != 00

and on a fixed kernel:

  root@c1:~# ./t
  Test passed

Cc: stable@vger.kernel.org
Fixes: 2f6f0654ab ("userns: Convert vfs posix_acl support to use kuids and kgids")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199945
Reported-by: Colin Watson <cjwatson@ubuntu.com>
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2018-08-23 20:42:57 +02:00
Dan Carpenter
b9b8a41ade btrfs: use after free in btrfs_quota_enable
The issue here is that btrfs_commit_transaction() frees "trans" on both
the error and the success path.  So the problem would be if
btrfs_commit_transaction() succeeds, and then qgroup_rescan_init()
fails.  That means that "ret" is non-zero and "trans" is non-NULL and it
leads to a use after free inside the btrfs_end_transaction() macro.

Fixes: 340f1aa27f ("btrfs: qgroups: Move transaction management inside btrfs_quota_enable/disable")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-23 17:37:27 +02:00
Anand Jain
801660b040 btrfs: btrfs_shrink_device should call commit transaction at the end
Test case btrfs/164 reports use-after-free:

[ 6712.084324] general protection fault: 0000 [#1] PREEMPT SMP
..
[ 6712.195423]  btrfs_update_commit_device_size+0x75/0xf0 [btrfs]
[ 6712.201424]  btrfs_commit_transaction+0x57d/0xa90 [btrfs]
[ 6712.206999]  btrfs_rm_device+0x627/0x850 [btrfs]
[ 6712.211800]  btrfs_ioctl+0x2b03/0x3120 [btrfs]

Reason for this is that btrfs_shrink_device adds the resized device to
the fs_devices::resized_devices after it has called the last commit
transaction.

So the list fs_devices::resized_devices is not empty when
btrfs_shrink_device returns.  Now the parent function
btrfs_rm_device calls:

        btrfs_close_bdev(device);
        call_rcu(&device->rcu, free_device_rcu);

and then does the transactio ncommit. It goes through the
fs_devices::resized_devices in btrfs_update_commit_device_size and
leads to use-after-free.

Fix this by making sure btrfs_shrink_device calls the last needed
btrfs_commit_transaction before the return. This is consistent with what
the grow counterpart does and this makes sure the on-disk state is
persistent when the function returns.

Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Tested-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-23 17:37:27 +02:00
Lu Fengqi
a5b7f4295e btrfs: fix qgroup_free wrong num_bytes in btrfs_subvolume_reserve_metadata
After btrfs_qgroup_reserve_meta_prealloc(), num_bytes will be assigned
again by btrfs_calc_trans_metadata_size(). Once block_rsv fails, we
can't properly free the num_bytes of the previous qgroup_reserve. Use a
separate variable to store the num_bytes of the qgroup_reserve.

Delete the comment for the qgroup_reserved that does not exist and add a
comment about use_global_rsv.

Fixes: c4c129db5d ("btrfs: drop unused parameter qgroup_reserved")
CC: stable@vger.kernel.org # 4.18+
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-23 17:37:26 +02:00
Filipe Manana
de02b9f6bb Btrfs: fix data corruption when deduplicating between different files
If we deduplicate extents between two different files we can end up
corrupting data if the source range ends at the size of the source file,
the source file's size is not aligned to the filesystem's block size
and the destination range does not go past the size of the destination
file size.

Example:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ xfs_io -f -c "pwrite -S 0x6b 0 2518890" /mnt/foo
  # The first byte with a value of 0xae starts at an offset (2518890)
  # which is not a multiple of the sector size.
  $ xfs_io -c "pwrite -S 0xae 2518890 102398" /mnt/foo

  # Confirm the file content is full of bytes with values 0x6b and 0xae.
  $ od -t x1 /mnt/foo
  0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
  *
  11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae
  11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
  *
  11777540 ae ae ae ae ae ae ae ae
  11777550

  # Create a second file with a length not aligned to the sector size,
  # whose bytes all have the value 0x6b, so that its extent(s) can be
  # deduplicated with the first file.
  $ xfs_io -f -c "pwrite -S 0x6b 0 557771" /mnt/bar

  # Now deduplicate the entire second file into a range of the first file
  # that also has all bytes with the value 0x6b. The destination range's
  # end offset must not be aligned to the sector size and must be less
  # then the offset of the first byte with the value 0xae (byte at offset
  # 2518890).
  $ xfs_io -c "dedupe /mnt/bar 0 1957888 557771" /mnt/foo

  # The bytes in the range starting at offset 2515659 (end of the
  # deduplication range) and ending at offset 2519040 (start offset
  # rounded up to the block size) must all have the value 0xae (and not
  # replaced with 0x00 values). In other words, we should have exactly
  # the same data we had before we asked for deduplication.
  $ od -t x1 /mnt/foo
  0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
  *
  11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae
  11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
  *
  11777540 ae ae ae ae ae ae ae ae
  11777550

  # Unmount the filesystem and mount it again. This guarantees any file
  # data in the page cache is dropped.
  $ umount /dev/sdb
  $ mount /dev/sdb /mnt

  $ od -t x1 /mnt/foo
  0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
  *
  11461300 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 00
  11461320 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  11470000 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
  *
  11777540 ae ae ae ae ae ae ae ae
  11777550

  # The bytes in range 2515659 to 2519040 have a value of 0x00 and not a
  # value of 0xae, data corruption happened due to the deduplication
  # operation.

So fix this by rounding down, to the sector size, the length used for the
deduplication when the following conditions are met:

  1) Source file's range ends at its i_size;
  2) Source file's i_size is not aligned to the sector size;
  3) Destination range does not cross the i_size of the destination file.

Fixes: e1d227a42e ("btrfs: Handle unaligned length in extent_same")
CC: stable@vger.kernel.org # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-23 17:37:26 +02:00
Filipe Manana
d4682ba03e Btrfs: sync log after logging new name
When we add a new name for an inode which was logged in the current
transaction, we update the inode in the log so that its new name and
ancestors are added to the log. However when we do this we do not persist
the log, so the changes remain in memory only, and as a consequence, any
ancestors that were created in the current transaction are updated such
that future calls to btrfs_inode_in_log() return true. This leads to a
subsequent fsync against such new ancestor directories returning
immediately, without persisting the log, therefore after a power failure
the new ancestor directories do not exist, despite fsync being called
against them explicitly.

Example:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ mkdir /mnt/A
  $ mkdir /mnt/B
  $ mkdir /mnt/A/C
  $ touch /mnt/B/foo
  $ xfs_io -c "fsync" /mnt/B/foo
  $ ln /mnt/B/foo /mnt/A/C/foo
  $ xfs_io -c "fsync" /mnt/A
  <power failure>

After the power failure, directory "A" does not exist, despite the explicit
fsync on it.

Instead of fixing this by changing the behaviour of the explicit fsync on
directory "A" to persist the log instead of doing nothing, make the logging
of the new file name (which happens when creating a hard link or renaming)
persist the log. This approach not only is simpler, not requiring addition
of new fields to the inode in memory structure, but also gives us the same
behaviour as ext4, xfs and f2fs (possibly other filesystems too).

A test case for fstests follows soon.

Fixes: 12fcfd22fe ("Btrfs: tree logging unlink/rename fixes")
Reported-by: Vijay Chidambaram <vvijay03@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-23 17:37:26 +02:00
Chuck Lever
a26dd64f54 nfsd: Remove callback_cred
Clean up: The global callback_cred is no longer used, so it can be
removed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-22 18:32:07 -04:00
Chuck Lever
cb25e7b293 nfsd: Use correct credential for NFSv4.0 callback with GSS
I've had trouble when operating a multi-homed Linux NFS server with
Kerberos using NFSv4.0. Lately, I've seen my clients reporting
this (and then hanging):

May  9 11:43:26 manet kernel: NFS: NFSv4 callback contains invalid cred

The client-side commit f11b2a1cfb ("nfs4: copy acceptor name from
context to nfs_client") appears to be related, but I suspect this
problem has been going on for some time before that.

RFC 7530 Section 3.3.3 says:
> For Kerberos V5, nfs/hostname would be a server principal in the
> Kerberos Key Distribution Center database.  This is the same
> principal the client acquired a GSS-API context for when it issued
> the SETCLIENTID operation ...

In other words, an NFSv4.0 client expects that the server will use
the same GSS principal for callback that the client used to
establish its lease. For example, if the client used the service
principal "nfs@server.domain" to establish its lease, the server
is required to use "nfs@server.domain" when performing NFSv4.0
callback operations.

The Linux NFS server currently does not. It uses a common service
principal for all callback connections. Sometimes this works as
expected, and other times -- for example, when the server is
accessible via multiple hostnames -- it won't work at all.

This patch scrapes the target name from the client credential,
and uses that for the NFSv4.0 callback credential. That should
be correct much more often.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-22 18:32:07 -04:00
Chuck Lever
9abdda5dda sunrpc: Extract target name into svc_cred
NFSv4.0 callback needs to know the GSS target name the client used
when it established its lease. That information is available from
the GSS context created by gssproxy. Make it available in each
svc_cred.

Note this will also give us access to the real target service
principal name (which is typically "nfs", but spec does not require
that).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-22 18:32:07 -04:00
Linus Torvalds
fe6f0ed0da f2fs-for-4.19-rc1
In this round, we've tuned f2fs to improve general performance by serializing
 block allocation and enhancing discard flows like fstrim which avoids user IO
 contention. And we've added fsync_mode=nobarrier which gives an option to user
 where it skips issuing cache_flush commands to underlying flash storage. And
 there are many bug fixes related to fuzzed images, revoked atomic writes, quota
 ops, and minor direct IO.
 
 Enhancement:
  - add fsync_mode=nobarrier which bypasses cache_flush command
  - enhance the discarding flow which avoids user IOs and issues in LBA order
  - readahead some encrypted blocks during GC
  - enable in-memory inode checksum to verify the blocks if F2FS_CHECK_FS is set
  - enhance nat_bits behavior
  - set -o discard by default
  - set REQ_RAHEAD to bio in ->readpages
 
 Bug fixes:
  - fix a corner case to corrupt atomic_writes revoking flow
  - revisit i_gc_rwsem to fix race conditions
  - fix some dio behaviors captured by xfstests
  - correct handling errors given by quota-related failures
  - add many sanity check flows to avoid fuzz test failures
  - add more error number propagation to their callers
  - fix several corner cases to continue fault injection w/ shutdown loop
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAlt82U4ACgkQQBSofoJI
 UNJTLQ/+PhewnNa5tDfUgWdFnUFz3h9/NcC677l0OplOOUNxA8iSa1xamlKf/nf9
 sB5ey0I7oBF8zQGxfndhHQfi6fpfUcNMr14hm+TS/3+d54xLJmiVShD5fjNSV2vB
 Ur0xoozuQDwYF1e3QKdBQjFqaCf78VheTr3aWxyv22/Sg+PYylZJ2K8rHTB7mGPU
 UG0aRnKrP3FPRjL7Q3m0Vm6b6eZ5uNdNrFfjgn/8yuQQ9V197K8vwSbPAsR5/pOh
 miCQXyM708NgEYJRWkWmC/rDSQdU0/h/mGnJWrBrbceW62QefGOgd2jcVfmthHJa
 ZXpj+BEG5bYpCCxGxF6N+u0e28OKonCwO/uvL8YAd5icN7yXtsKzoF1CCuXxOYf1
 9K5SMylCTSyrs/+LV8CJoT2ya8w0l0w+R/txUYn8UT+4AgqU+chS2kJeXqw9tcHB
 WLFs/rnAyofWCI/8frVBmJY+zA1ZZvTqs/lmVYrtJUkiOcMTq34WICBUAEFKV452
 BM5dcu21bSIkapYispEt4Rr7o4P4HHMQ+N1i2yUZMFCz5T0RyzdybeS5THk2yVzd
 L0kxfYU+zHigNX51ez8+Z7DyDLDBp6jkD0e66x73bUK9TGPH+ZbnAL6gwLikmD3M
 +VxYl5nyW/3bxx1HdfK1Xwd4/wYMNBmdtn5NC50oZ+jpB0h8YCE=
 =zgbL
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we've tuned f2fs to improve general performance by
  serializing block allocation and enhancing discard flows like fstrim
  which avoids user IO contention. And we've added fsync_mode=nobarrier
  which gives an option to user where it skips issuing cache_flush
  commands to underlying flash storage. And there are many bug fixes
  related to fuzzed images, revoked atomic writes, quota ops, and minor
  direct IO.

  Enhancements:
   - add fsync_mode=nobarrier which bypasses cache_flush command
   - enhance the discarding flow which avoids user IOs and issues in
     LBA order
   - readahead some encrypted blocks during GC
   - enable in-memory inode checksum to verify the blocks if
     F2FS_CHECK_FS is set
   - enhance nat_bits behavior
   - set -o discard by default
   - set REQ_RAHEAD to bio in ->readpages

  Bug fixes:
   - fix a corner case to corrupt atomic_writes revoking flow
   - revisit i_gc_rwsem to fix race conditions
   - fix some dio behaviors captured by xfstests
   - correct handling errors given by quota-related failures
   - add many sanity check flows to avoid fuzz test failures
   - add more error number propagation to their callers
   - fix several corner cases to continue fault injection w/ shutdown
     loop"

* tag 'f2fs-for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (89 commits)
  f2fs: readahead encrypted block during GC
  f2fs: avoid fi->i_gc_rwsem[WRITE] lock in f2fs_gc
  f2fs: fix performance issue observed with multi-thread sequential read
  f2fs: fix to skip verifying block address for non-regular inode
  f2fs: rework fault injection handling to avoid a warning
  f2fs: support fault_type mount option
  f2fs: fix to return success when trimming meta area
  f2fs: fix use-after-free of dicard command entry
  f2fs: support discard submission error injection
  f2fs: split discard command in prior to block layer
  f2fs: wake up gc thread immediately when gc_urgent is set
  f2fs: fix incorrect range->len in f2fs_trim_fs()
  f2fs: refresh recent accessed nat entry in lru list
  f2fs: fix avoid race between truncate and background GC
  f2fs: avoid race between zero_range and background GC
  f2fs: fix to do sanity check with block address in main area v2
  f2fs: fix to do sanity check with inline flags
  f2fs: fix to reset i_gc_failures correctly
  f2fs: fix invalid memory access
  f2fs: fix to avoid broken of dnode block list
  ...
2018-08-22 13:29:39 -07:00
Miklos Szeredi
6faf05c2b2 ovl: set I_CREATING on inode being created
...otherwise there will be list corruption due to inode_sb_list_add() being
called for inode already on the sb list.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: e950564b97 ("vfs: don't evict uninitialized inode")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 13:15:25 -07:00
Linus Torvalds
cd9b44f907 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - the rest of MM

 - procfs updates

 - various misc things

 - more y2038 fixes

 - get_maintainer updates

 - lib/ updates

 - checkpatch updates

 - various epoll updates

 - autofs updates

 - hfsplus

 - some reiserfs work

 - fatfs updates

 - signal.c cleanups

 - ipc/ updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (166 commits)
  ipc/util.c: update return value of ipc_getref from int to bool
  ipc/util.c: further variable name cleanups
  ipc: simplify ipc initialization
  ipc: get rid of ids->tables_initialized hack
  lib/rhashtable: guarantee initial hashtable allocation
  lib/rhashtable: simplify bucket_table_alloc()
  ipc: drop ipc_lock()
  ipc/util.c: correct comment in ipc_obtain_object_check
  ipc: rename ipcctl_pre_down_nolock()
  ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid()
  ipc: reorganize initialization of kern_ipc_perm.seq
  ipc: compute kern_ipc_perm.id under the ipc lock
  init/Kconfig: remove EXPERT from CHECKPOINT_RESTORE
  fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp
  adfs: use timespec64 for time conversion
  kernel/sysctl.c: fix typos in comments
  drivers/rapidio/devices/rio_mport_cdev.c: remove redundant pointer md
  fork: don't copy inconsistent signal handler state to child
  signal: make get_signal() return bool
  signal: make sigkill_pending() return bool
  ...
2018-08-22 12:34:08 -07:00
Arnd Bergmann
3e811f053a fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp
get_seconds() is deprecated in favor of ktime_get_real_seconds(), which
returns a 64-bit timestamp.

In the SYSV file system, the superblock timestamp is only 32 bits wide,
and it is used to check whether a file system is clean, so the best
solution seems to be to force a wraparound and explicitly convert it to an
unsigned 32-bit value.

This is independent of the inode timestamps that are also 32-bit wide on
disk and that come from current_time().

Link: http://lkml.kernel.org/r/20180713145236.3152513-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:51 -07:00
Arnd Bergmann
d9edcbc42c adfs: use timespec64 for time conversion
We just truncate the seconds to 32-bit in one place now, so this can
trivially be converted over to using timespec64 consistently.

Link: http://lkml.kernel.org/r/20180620100133.4035614-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:51 -07:00
Arnd Bergmann
f423420c23 fat: propagate 64-bit inode timestamps
Now that we pass down 64-bit timestamps from VFS, we just need to convert
that correctly into on-disk timestamps.  To make that work correctly, this
changes the last use of time_to_tm() in the kernel to time64_to_tm(),
which also lets use remove that deprecated interfaces.

Similarly, the time_t use in fat_time_fat2unix() truncates the timestamp
on the way in, which can be avoided by using types that are wide enough to
hold the intermediate values during the conversion.

[hirofumi@mail.parknet.co.jp: remove useless temporary variable, needless long long]
Link: http://lkml.kernel.org/r/20180619153646.3637529-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:50 -07:00