Commit Graph

53189 Commits

Author SHA1 Message Date
Al Viro
62d9956cef handle move to LRU in retain_dentry()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-29 15:07:43 -04:00
Al Viro
a338579f2f dput(): consolidate the "do we need to retain it?" into an inlined helper
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-29 15:07:43 -04:00
Al Viro
8b987a46a1 split the slow part of lock_parent() off
Turn the "trylock failed" part into uninlined __lock_parent().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-29 15:07:42 -04:00
Al Viro
65d8eb5a8f now lock_parent() can't run into killed dentry
all remaining callers hold either a reference or ->i_lock

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-29 15:07:42 -04:00
Al Viro
3b3f09f48b get rid of trylock loop in locking dentries on shrink list
In case of trylock failure don't re-add to the list - drop the locks
and carefully get them in the right order.  For shrink_dentry_list(),
somebody having grabbed a reference to dentry means that we can
kick it off-list, so if we find dentry being modified under us we
don't need to play silly buggers with retries anyway - off the list
it is.

The locking logics taken out into a helper of its own; lock_parent()
is no longer used for dentries that can be killed under us.

[fix from Eric Biggers folded]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-29 15:07:02 -04:00
Eric Biggers
ce3fd194fc ext4: limit xattr size to INT_MAX
ext4 isn't validating the sizes of xattrs where the value of the xattr
is stored in an external inode.  This is problematic because
->e_value_size is a u32, but ext4_xattr_get() returns an int.  A very
large size is misinterpreted as an error code, which ext4_get_acl()
translates into a bogus ERR_PTR() for which IS_ERR() returns false,
causing a crash.

Fix this by validating that all xattrs are <= INT_MAX bytes.

This issue has been assigned CVE-2018-1095.

https://bugzilla.kernel.org/show_bug.cgi?id=199185
https://bugzilla.redhat.com/show_bug.cgi?id=1560793

Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Fixes: e50e5129f3 ("ext4: xattr-in-inode support")
2018-03-29 14:31:42 -04:00
Abhi Das
5e86d9d122 gfs2: time journal recovery steps accurately
This patch spits out the time taken by the various steps in the
journal recover process. Previously, the journal recovery time
didn't account for finding the journal head in the log which takes
up a significant portion of time.

Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-03-29 10:41:27 -07:00
Eric Sandeen
dc1baa715b xfs: do not log/recover swapext extent owner changes for deleted inodes
Today if we run xfs_fsr and crash[1], log replay can fail because
the recovery code tries to instantiate the donor inode from
disk to replay the swapext, but it's been deleted and we get
verifier failures when we try to read the inode off disk with
i_mode == 0.

This fixes both sides: We don't log the swapext change if the
inode has been deleted, and we don't try to recover it either.

[1] or if systemd doesn't cleanly unmount root, as it is wont
    to do ...

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-29 10:19:15 -07:00
Andreas Gruenbacher
fffb64127a gfs2: Zero out fallocated blocks in fallocate_chunk
Instead of zeroing out fallocated blocks in gfs2_iomap_alloc, zero them
out in fallocate_chunk, much higher up the call stack.  This gets rid of
gfs2's abuse of the IOMAP_ZERO flag as well as the gfs2 specific zeronew
buffer flag.  I can't think of a reason why zeroing out the blocks in
gfs2_iomap_alloc would have any benefits: there is no additional locking
at that level that would add protection to the newly allocated blocks.

While at it, change fallocate over from gs2_block_map to gfs2_iomap_begin.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
2018-03-29 06:50:32 -07:00
Eric Biggers
53fedcc09c f2fs: reserve bits for fs-verity
Reserve an F2FS feature flag and inode flag for fs-verity.  This is an
in-development feature that is planned be discussed at LSF/MM 2018 [1].
It will provide file-based integrity and authenticity for read-only
files.  Most code will be in a filesystem-independent module, with
smaller changes needed to individual filesystems that opt-in to
supporting the feature.  An early prototype supporting F2FS is available
[2].  Reserving the F2FS on-disk bits for fs-verity will prevent users
of the prototype from conflicting with other new F2FS features.

Note that we're reserving the inode flag in f2fs_inode.i_advise, which
isn't really appropriate since it's not a hint or advice.  But
->i_advise is already being used to hold the 'encrypt' flag; and F2FS's
->i_flags uses the generic FS_* values, so it seems ->i_flags can't be
used for an F2FS-specific flag without additional work to remove the
assumption that ->i_flags uses the generic flags namespace.

[1] https://marc.info/?l=linux-fsdevel&m=151690752225644
[2] https://git.kernel.org/pub/scm/linux/kernel/git/mhalcrow/linux.git/log/?h=fs-verity-dev

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-28 12:56:49 -07:00
Greg Kroah-Hartman
b24d0d5b12 Merge 4.16-rc7 into char-misc-next
We want the hyperv fix in here for merging and testing.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-28 12:27:35 +02:00
Christoph Hellwig
0e11f6443f fs: move I_DIRTY_INODE to fs.h
And use it in a few more places rather than opencoding the values.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-28 01:39:02 -04:00
Christoph Hellwig
f35562549f ubifs: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) call
I_DIRTY_DATASYNC is a strict superset of I_DIRTY_SYNC semantics, as
in mark dirty to be written out by fdatasync as well.  So dirtying
for both flags makes no sense.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-28 01:39:02 -04:00
Christoph Hellwig
2c2acd2d19 ntfs: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) call
I_DIRTY_DATASYNC is a strict superset of I_DIRTY_SYNC semantics, as
in mark dirty to be written out by fdatasync as well.  So dirtying
for both flags makes no sense.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-28 01:39:02 -04:00
Christoph Hellwig
937d330512 gfs2: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) calls
I_DIRTY_DATASYNC is a strict superset of I_DIRTY_SYNC semantics, as
in mark dirty to be written out by fdatasync as well.  So dirtying
for both flags makes no sense.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-28 01:39:01 -04:00
Christoph Hellwig
cab64df194 fs: fold open_check_o_direct into do_dentry_open
do_dentry_open is where we do the actual open of the file, so this is
where we should do our O_DIRECT sanity check to cover all potential
callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-28 01:39:01 -04:00
Yunlei He
d21b0f238a f2fs: Add a segment type check in inplace write
This patch add a segment type check in IPU, in
case of something wrong with blkadd in dnode.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-27 20:13:54 -07:00
Yunlong Song
2f52052929 f2fs: no need to initialize zero value for GFP_F2FS_ZERO
Since f2fs_inode_info is allocated with flag GFP_F2FS_ZERO, so we do not
need to initialize zero value for its member any more.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-27 20:12:53 -07:00
Chao Yu
780de47cf6 f2fs: don't track new nat entry in nat set
Nat entry set is used only in checkpoint(), and during checkpoint() we
won't flush new nat entry with unallocated address, so we don't need to
add new nat entry into nat set, then nat_entry_set::entry_cnt can
indicate actual entry count we need to flush in checkpoint().

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-27 20:10:29 -07:00
Chao Yu
df033caf51 f2fs: clean up with F2FS_BLK_ALIGN
Clean up F2FS_BYTES_TO_BLK(x + F2FS_BLKSIZE - 1) with F2FS_BLK_ALIGN(x).

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-27 20:09:32 -07:00
David Howells
a25e21f0bc rxrpc, afs: Use debug_ids rather than pointers in traces
In rxrpc and afs, use the debug_ids that are monotonically allocated to
various objects as they're allocated rather than pointers as kernel
pointers are now hashed making them less useful.  Further, the debug ids
aren't reused anywhere nearly as quickly.

In addition, allow kernel services that use rxrpc, such as afs, to take
numbers from the rxrpc counter, assign them to their own call struct and
pass them in to rxrpc for both client and service calls so that the trace
lines for each will have the same ID tag.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-03-27 23:03:00 +01:00
Kirill Tkhai
2f635ceeb2 net: Drop pernet_operations::async
Synchronous pernet_operations are not allowed anymore.
All are asynchronous. So, drop the structure member.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 13:18:09 -04:00
Kirill Tkhai
67441c2472 net: Convert nfsd_net_ops
These pernet_operations look similar to rpcsec_gss_net_ops,
they just create and destroy another caches. So, they also
can be async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 13:18:07 -04:00
Masanari Iida
bc8282a730 treewide: Fix typos in printk
This patch fixes spelling typos found in printk.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2018-03-27 09:51:22 +02:00
Theodore Ts'o
7dac4a1726 ext4: add validity checks for bitmap block numbers
An privileged attacker can cause a crash by mounting a crafted ext4
image which triggers a out-of-bounds read in the function
ext4_valid_block_bitmap() in fs/ext4/balloc.c.

This issue has been assigned CVE-2018-1093.

BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=199181
BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1560782
Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-03-26 23:54:10 -04:00
Kirill Tkhai
dbf7bb4437 net: Convert nfs4blocklayout_net_ops
These pernet_operations create and destroy per-net pipe
and dentry, and they seem safe to be marked as async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-26 13:03:26 -04:00
Kirill Tkhai
436de50094 net: Convert nfs4_dns_resolver_ops
These pernet_operations look similar to rpcsec_gss_net_ops,
they just create and destroy another cache. Also they create
and destroy directory. So, they also look safe to be async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-26 13:03:26 -04:00
Brian Foster
72c44e35f0 xfs: clean up xfs_mount allocation and dynamic initializers
Most of the generic data structures embedded in xfs_mount are
dynamically initialized immediately after mp is allocated. A few
fields are left out and initialized during the xfs_mountfs()
sequence, after mp has been attached to the superblock.

To clean this up and help prevent premature access of associated
fields, refactor xfs_mount allocation and all dependent init calls
into a new helper. This self-documents that all low level data
structures (i.e., locks, trees, etc.) should be initialized before
xfs_mount is attached to the superblock.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-26 08:54:15 -07:00
Arnd Bergmann
a687a53370 treewide: simplify Kconfig dependencies for removed archs
A lot of Kconfig symbols have architecture specific dependencies.
In those cases that depend on architectures we have already removed,
they can be omitted.

Acked-by: Kalle Valo <kvalo@codeaurora.org>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-03-26 15:55:57 +02:00
Liu Bo
4759700a71 Btrfs: dev-replace: make sure target is identical to source when raid56 rebuild fails
In the last step of scrub_handle_error_block, we try to combine good
copies on all possible mirrors, this works fine for raid1 and raid10,
but not for raid56 as it's doing parity rebuild.

If parity rebuild doesn't get back with correct data which matches its
checksum, in case of replace we'd rather write what is stored in the
source device than the data calculuated from parity.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:43 +02:00
Liu Bo
d6a691350b Btrfs: raid56: remove redundant async_missing_raid56
async_missing_raid56() is identical to async_read_rebuild().

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:43 +02:00
Su Yue
005d67127f btrfs: adjust return values of btrfs_inode_by_name
Previously, btrfs_inode_by_name() returned 0 which left caller to check
objectid of location even location if the type was invalid.

Let btrfs_inode_by_name() return -EUCLEAN if a corrupted location of a
dir entry is found.  Removal of label out_err also simplifies the
function.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ drop unlikely ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:43 +02:00
Anand Jain
9b99b11564 btrfs: rename btrfs_close_extra_device to btrfs_free_extra_devids
This function btrfs_close_extra_devices() is about freeing
extra devids which once it may have belonged to this filesystem.
So rename it and add the comment. The _devid suffix is
appropriate as this function won't handle devices which are
outside of the filesytem being mounted.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:42 +02:00
Nikolay Borisov
d02c0e2019 btrfs: Remove root argument from cow_file_range_inline
This argument is always set to the root of the inode, which is also
passed. So let's get a reference inside the function and simplify
the arg list.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:42 +02:00
Liu Bo
895a72be41 Btrfs: send: fix typo in TLV_PUT
According to tlv_put()'s prototype, data and attrlen needs to be
exchanged in the macro, but seems all callers are already aware of
this misorder and are therefore not affected.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:42 +02:00
Nikolay Borisov
e5b84f7a25 btrfs: Remove root argument from btrfs_log_dentry_safe
Now that nothing uses the root arg of btrfs_log_dentry_safe it can be
safely removed. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:42 +02:00
Nikolay Borisov
f882274b2d btrfs: Remove root arg from btrfs_log_inode_parent
btrfs_log_inode_parent is called from 2 places (btrfs_log_dentry_safe
and btrfs_log_new_name) both of which pass inode->root as the root
argument and the inode itself. Remove the redundant root argument and
get a reference to the root directly from the inode, also remove
redundant root != inode->root check from the same function. No
functional change.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:42 +02:00
Nikolay Borisov
448f3a17ac btrfs: Remove redundant comment from btrfs_search_forward
This function always sets keep_locks to 1 and saves the old value of
keep_locks which is restored at the end. So there is no way it can be
called without keep_locks being set. Remove comment imposing redundant
requirement on callers.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:41 +02:00
David Sterba
738c93d42c btrfs: move btrfs_listxattr prototype to xattr.h
There's a proper header for xattr handlers.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:41 +02:00
David Sterba
bcadd7050a btrfs: adjust return type of btrfs_getxattr
The xattr_handler::get prototype returns int, use it. The only ssize_t
exception is the per-inode listxattr handler.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:41 +02:00
David Sterba
ab0d093616 btrfs: drop extern from function declarations
Extern for functions does not make any difference, there are only a few
so let's remove them before it's too late.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:41 +02:00
David Sterba
7852781d94 btrfs: drop underscores from exported xattr functions
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:41 +02:00
Filipe Manana
ffa7c4296e Btrfs: send, do not issue unnecessary truncate operations
When send finishes processing an inode representing a regular file, it
always issues a truncate operation for that file, even if its size did
not change or the last write sets the file size correctly. In the most
common cases, the issued write operations set the file to correct size
(either full or incremental sends) or the file size did not change (for
incremental sends), so the only case where a truncate operation is needed
is when a file size becomes smaller in the send snapshot when compared
to the parent snapshot.

By not issuing unnecessary truncate operations we reduce the stream size
and save time in the receiver. Currently truncating a file to the same
size triggers writeback of its last page (if it's dirty) and waits for it
to complete (only if the file size is not aligned with the filesystem's
sector size). This is being fixed by another patch and is independent of
this change (that patch's title is "Btrfs: skip writeback of last page
when truncating file to same size").

The following script was used to measure time spent by a receiver without
this change applied, with this change applied, and without this change and
with the truncate fix applied (the fix to not make it start and wait for
writeback to complete).

  $ cat test_send.sh
  #!/bin/bash

  SRC_DEV=/dev/sdc
  DST_DEV=/dev/sdd
  SRC_MNT=/mnt/sdc
  DST_MNT=/mnt/sdd

  mkfs.btrfs -f $SRC_DEV >/dev/null
  mkfs.btrfs -f $DST_DEV >/dev/null
  mount $SRC_DEV $SRC_MNT
  mount $DST_DEV $DST_MNT

  echo "Creating source filesystem"
  for ((t = 0; t < 10; t++)); do
      (
          for ((i = 1; i <= 20000; i++)); do
              xfs_io -f -c "pwrite -S 0xab 0 5000" \
                  $SRC_MNT/file_$i > /dev/null
          done
      ) &
     worker_pids[$t]=$!
  done
  wait ${worker_pids[@]}

  echo "Creating and sending snapshot"
  btrfs subvolume snapshot -r $SRC_MNT $SRC_MNT/snap1 >/dev/null
  /usr/bin/time -f "send took %e seconds"    \
         btrfs send -f $SRC_MNT/send_file $SRC_MNT/snap1
  /usr/bin/time -f "receive took %e seconds" \
         btrfs receive -f $SRC_MNT/send_file $DST_MNT

  umount $SRC_MNT
  umount $DST_MNT

The results, which are averages for 5 runs for each case, were the
following:

* Without this change

average receive time was 26.49 seconds
standard deviation of 2.53 seconds

* Without this change and with the truncate fix

average receive time was 12.51 seconds
standard deviation of 0.32 seconds

* With this change and without the truncate fix

average receive time was 10.02 seconds
standard deviation of 1.11 seconds

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:41 +02:00
Filipe Manana
213e8c5520 Btrfs: skip writeback of last page when truncating file to same size
When we truncate a file to the same size and that size is not aligned
with the sector size, we end up triggering writeback (and wait for it to
complete) of the last page. This is unncessary as we can not have delayed
allocation beyond the inode's i_size and the goal of truncating a file
to its own size is to discard prealloc extents (allocated via the
fallocate(2) system call). Besides the unnecessary IO start and wait, it
also breaks the oppurtunity for larger contiguous extents on disk, as
before the last dirty page there might be other dirty pages.

This scenario is probably not very common in general, however it is
common for btrfs receive implementations because currently the send
stream always issues a truncate operation for each processed inode as
the last operation for that inode (this truncate operation is not
always needed and the send implementation will be addressed to avoid
them).

So improve this by not starting and waiting for writeback of the inode's
last page when we are truncating to exactly the same size.

The following script was used to quickly measure the time a receive
operation takes:

 $ cat test_send.sh
 #!/bin/bash

 SRC_DEV=/dev/sdc
 DST_DEV=/dev/sdd
 SRC_MNT=/mnt/sdc
 DST_MNT=/mnt/sdd

 mkfs.btrfs -f $SRC_DEV >/dev/null
 mkfs.btrfs -f $DST_DEV >/dev/null
 mount $SRC_DEV $SRC_MNT
 mount $DST_DEV $DST_MNT

 echo "Creating source filesystem"
 for ((t = 0; t < 10; t++)); do
     (
         for ((i = 1; i <= 20000; i++)); do
             xfs_io -f -c "pwrite -S 0xab 0 5000" \
                $SRC_MNT/file_$i > /dev/null
         done
     ) &
     worker_pids[$t]=$!
 done
 wait ${worker_pids[@]}

 echo "Creating and sending snapshot"
 btrfs subvolume snapshot -r $SRC_MNT $SRC_MNT/snap1 >/dev/null
 /usr/bin/time -f "send took %e seconds"    \
     btrfs send -f $SRC_MNT/send_file $SRC_MNT/snap1
 /usr/bin/time -f "receive took %e seconds" \
     btrfs receive -f $SRC_MNT/send_file $DST_MNT

 umount $SRC_MNT
 umount $DST_MNT

The results for 5 runs were the following:

* Without this change

average receive time was 26.49 seconds
standard deviation of 2.53 seconds

* With this change

average receive time was 12.51 seconds
standard deviation of 0.32 seconds

Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:40 +02:00
Liu Bo
ed5d5f37e6 Btrfs: dev-replace: skip prealloc extents when copy nocow pages
It doens't make sense to process prealloc extents as pages will be
filled with zero when reading prealloc extents.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:40 +02:00
Anand Jain
d612ac59ef btrfs: unify types for metadata_ratio and data_chunk_allocations
We have btrfs_fs_info::data_chunk_allocations and
btrfs_fs_info::metadata_ratio declared as unsigned which would be
unsinged int and kernel style prefers unsigned int over bare unsigned.
So this patch changes them to u32.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:40 +02:00
Nikolay Borisov
de224b7c56 btrfs: Remove redundant memory barriers around dio_private error status
Using any kind of memory barriers around atomic operations which have
a return value is redundant, since those operations themselves are
fully ordered. atomic_t.txt states:

    - RMW operations that have a return value are fully ordered;

    Fully ordered primitives are ordered against everything prior and
    everything subsequent. Therefore a fully ordered primitive is like
    having an smp_mb() before and an smp_mb() after the primitive.

Given this let's replace the extra memory barriers with comments.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:40 +02:00
Anand Jain
16db5758fe btrfs: remove assert in btrfs_init_dev_replace_tgtdev()
In the same function we just ran btrfs_alloc_device() which means the
btrfs_device::resized_list is sure to be empty and we are protected
with the btrfs_fs_info::volume_mutex.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:40 +02:00
David Sterba
e67c718b5b btrfs: add more __cold annotations
The __cold functions are placed to a special section, as they're
expected to be called rarely. This could help i-cache prefetches or help
compiler to decide which branches are more/less likely to be taken
without any other annotations needed.

Though we can't add more __exit annotations, it's still possible to add
__cold (that's also added with __exit). That way the following function
categories are tagged:

- printf wrappers, error messages
- exit helpers

Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:39 +02:00
David Sterba
ffc5a3794f btrfs: add (the only possible) __exit annotation
Recently, the __init annotations have been added. There's unfortunatelly
only one case where we can add __exit, because most of the cleanup
helpers are also called from the __init phase.

As the __exit annotated functions get discarded completely for a
built-in code, we'd miss them from the init phase.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:39 +02:00
Anand Jain
ccb0e7d1c1 btrfs: verify subvolid mount parameter
We aren't verifying the parameter passed to the subvolid mount option,
so we won't report and fail the mount if a junk value is specified for
example, -o subvolid=abc.
This patch verifies the subvolid option with match_u64.

Up to now the memparse function accepts the K/M/G/ suffixes, that are
usually meant for size values and do not make sense for a subvolume it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:39 +02:00
Liu Bo
5811375325 Btrfs: fix unexpected cow in run_delalloc_nocow
Fstests generic/475 provides a way to fail metadata reads while
checking if checksum exists for the inode inside run_delalloc_nocow(),
and csum_exist_in_range() interprets error (-EIO) as inode having
checksum and makes its caller enter the cow path.

In case of free space inode, this ends up with a warning in
cow_file_range().

The same problem applies to btrfs_cross_ref_exist() since it may also
read metadata in between.

With this, run_delalloc_nocow() bails out when errors occur at the two
places.

cc: <stable@vger.kernel.org> v2.6.28+
Fixes: 17d217fe97 ("Btrfs: fix nodatasum handling in balancing code")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:39 +02:00
Nikolay Borisov
9678c54388 btrfs: Remove custom crc32c init code
The custom crc32 init code was introduced in
14a958e678 ("Btrfs: fix btrfs boot when compiled as built-in") to
enable using btrfs as a built-in. However, later as pointed out by
60efa5eb2e ("Btrfs: use late_initcall instead of module_init") this
wasn't enough and finally btrfs was switched to late_initcall which
comes after the generic crc32c implementation is initiliased. The
latter commit superseeded the former. Now that we don't have to
maintain our own code let's just remove it and switch to using the
generic implementation.

Despite touching a lot of files the patch is really simple. Here is the gist of
the changes:

1. Select LIBCRC32C rather than the low-level modules.
2. s/btrfs_crc32c/crc32c/g
3. replace hash.h with linux/crc32c.h
4. Move the btrfs namehash funcs to ctree.h and change the tree accordingly.

I've tested this with btrfs being both a module and a built-in and xfstest
doesn't complain.

Does seem to fix the longstanding problem of not automatically selectiong
the crc32c module when btrfs is used. Possibly there is a workaround in
dracut.

The modinfo confirms that now all the module dependencies are there:

before:
depends:        zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate

after:
depends:        libcrc32c,zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add more info to changelog from mails ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:39 +02:00
Qu Wenruo
3e72ee8874 btrfs: Refactor __get_raid_index() to btrfs_bg_flags_to_raid_index()
Function __get_raid_index() is used to convert block group flags into
raid index, which can be used to get various info directly from
btrfs_raid_array[].

Refactor this function a little:

1) Rename to btrfs_bg_flags_to_raid_index()
   Double underline prefix is normally for internal functions, while the
   function is used by both extent-tree and volumes.

   Although the name is a little longer, but it should explain its usage
   quite well.

2) Move it to volumes.h and make it static inline
   Just several if-else branches, really no need to define it as a normal
   function.

   This also makes later code re-use between kernel and btrfs-progs
   easier.

3) Remove function get_block_group_index()
   Really no need to do such a simple thing as an exported function.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:38 +02:00
Qu Wenruo
2f659546c9 btrfs: tree-checker: Replace root parameter with fs_info
When inspecting the error message with real corruption, the "root=%llu"
always shows "1" (root tree), instead of the correct owner.

The problem is that we are getting @root from page->mapping->host, which
points the same btree inode, so we will always get the same root.

This makes the root owner output meaningless, and harder to port
tree-checker to btrfs-progs.

So get rid of the false and meaningless @root parameter and replace it
with @fs_info.
To get the owner, we can only rely on btrfs_header_owner() now.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:38 +02:00
Liu Bo
393da91819 Btrfs: add tracepoint for em's EEXIST case
This is adding a tracepoint 'btrfs_handle_em_exist' to help debug the
subtle bugs around merge_extent_mapping.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:38 +02:00
Nikolay Borisov
5d23515be6 btrfs: Move qgroup rescan on quota enable to btrfs_quota_enable
Currently btrfs_run_qgroups is doing a bit too much. Not only is it
responsible for synchronizing in-memory state of qgroups to disk but
it also contains code to trigger the initial qgroup rescan when
quota is enabled initially. This condition is detected by checking that
BTRFS_FS_QUOTA_ENABLED is not set and BTRFS_FS_QUOTA_ENABLING is set.
Nothing really requires from the code to be structured (and scattered)
the way it is so let's streamline things. First move the quota rescan
code into btrfs_quota_enable, where its invocation is closer to the
use. This also makes the FS_QUOTA_ENABLING flag redundant so let's
remove it as well.

This has been tested with a full xfstest run with qgroups enabled on
the scratch device of every xfstest and no regressions were observed.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:38 +02:00
Gu JinXiang
7ce311d552 btrfs: use reada direction enum instead of constant value in load_free_space_tree
load_free_space_tree calls either function load_free_space_bitmaps or
load_free_space_extents. And either of those two will lead to call
btrfs_next_item.  So in function load_free_space_tree, use READA_FORWARD
to read forward ahead.

This also changes the value from READA_BACK to READA_FORWARD, since
according to the logic, it should reada_for_search forward, not
backward.

Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:37 +02:00
Gu Jinxiang
019599ada7 btrfs: use reada direction enum instead of constant value in populate_free_space_tree
populate_free_space_tree calls function btrfs_search_slot_for_read with
parameter int find_higher = 1, it means that, if no exact match is
found, then use the next higher item.  So in function
populate_free_space_tree, use READA_FORWARD to read forward ahead.

This also changes the value from READA_BACK to READA_FORWARD, since
according to the logic, it should reada_for_search forward, not
backward.

Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:37 +02:00
Nikolay Borisov
c1c3fac2a9 btrfs: Remove btrfs_inode::delayed_iput_count
delayed_iput_count wa supposed to be used to implement, well, delayed
iput. The idea is that we keep accumulating the number of iputs we do
until eventually the inode is deleted. Turns out we never really
switched the delayed_iput_count from 0 to 1, hence all conditional
code relying on the value of that member being different than 0 was
never executed. This, as it turns out, didn't cause any problem due
to the simple fact that the generic inode's i_count member was always
used to count the number of iputs. So let's just remove the unused
member and all unused code. This patch essentially provides no
functional changes. While at it, also add proper documentation for
btrfs_add_delayed_iput

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:37 +02:00
Qu Wenruo
793ff2c88c btrfs: volumes: Cleanup stripe size calculation
Cleanup the following things:
1) open coded SZ_16M round up
2) use min() to replace open-coded size comparison
3) code style

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com>
[ reformat comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:37 +02:00
Nikolay Borisov
da07d4ab33 btrfs: Streamline btrfs_delalloc_reserve_metadata initial operations
The behavior of btrfs_delalloc_reserve_metadata depends on whether
the inode we are allocating for is the freespace inode or not. As it
stands if we are the free node we set 'flush' and 'delalloc_lock'
variable to certain values. Subsequently we check the values of those
vars and act accordingly. Instead, simplify things by having 1 if
which checks whether we are the freespace inode or not and do any
specific operation in either branches of that if. This makes the code
a bit easier to understand, as an added bonus it also shrinks the
compiled size:

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-17 (-17)
Function                                     old     new   delta
btrfs_delalloc_reserve_metadata             1876    1859     -17
Total: Before=85966, After=85949, chg -0.02%

No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:37 +02:00
Anand Jain
b1b8e38622 btrfs: insert newly opened device to the end of the list
Add opened device to the tail of dev_alloc_list instead of head, so that
it maintains the same order as dev_list.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:37 +02:00
Anand Jain
f8e10cd3f8 btrfs: keep device list sorted
By maintaining the device list sorted lets us reproduce the problems
related to missing chunk in the degraded mode much more consistent. So
fix this by sorting the devices by devid within the kernel. So that we
know which device is assigned to the struct fs_info::latest_bdev when
all the devices are having and same SB generation.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:36 +02:00
Liu Bo
3d5addafd0 Btrfs: do not check inode's runtime flags under root->orphan_lock
It's not necessary to hold ->orphan_lock when checking inode's runtime
flags.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:36 +02:00
Nikolay Borisov
bc5511d0ed btrfs: Use schedule_timeout_interruptible
Instead of manually fiddling with the state of the task
(RUNNING->INTERRUPTIBLE->RUNNING) again just use schedule_timeout_interruptible
which adjusts the task state as needed. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:36 +02:00
Nikolay Borisov
f9cacae314 btrfs: Move error handling of btrfs_start_dirty_block_groups closer to call site
Even though btrfs_start_dirty_block_groups is fairly in the beginning of
btrfs_commit_transaction outside of the critical section defined by the
transaction states it can only be run by a single comitter. In other
words it defines its own critical section thanks to the
BTRFS_TRANS_DIRTY_BG run flag and ro_block_group_mutex. However, its
error handling is outside of this critical section which is a bit
counter-intuitive. So move the error handling righ after the function is
executed and let the sole runner of dirty block groups handle the return
value. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:36 +02:00
Anand Jain
7ef2d6a722 btrfs: not a disk error if the bio_add_page fails
bio_add_page() can fail for logical reasons as from the bio_add_page()
comments:

/*
 * This will only fail if either bio->bi_vcnt == bio->bi_max_vecs or
 * it's a cloned bio.
 */

Here we have just allocated the bio, so both of those failures can't
occur. So drop the check. We can also drop the error stats for write
error.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:36 +02:00
Qu Wenruo
4117f207d4 btrfs: Add chunk allocation ENOSPC debug message for enospc_debug mount option
Enospc_debug makes extent allocator print more debug messages,
however for chunk allocation, there is no debug message for enospc_debug
at all.

This patch will add message for the following parts of chunk allocator:

1) No rw device at all
   Quite rare, but at least output one message for this case.

2) Not enough space for some device
   This debug message is quite handy for unbalanced disks with stripe
   based profiles (RAID0/10/5/6).

3) Not enough free devices
   This debug message should tell us if current chunk allocator is
   working correctly under minimal device requirements.

Although in most cases, we will hit other ENOSPC before we even hit a
chunk allocator ENOSPC, but such debug info won't help.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:35 +02:00
Anand Jain
566b1760b4 btrfs: use ASSERT to report logical error in cow_file_range()
Use ASSERT to report logical error in cow_file_range(), also move it a
bit closer to when the num_bytes is derived.

The extent start could be (u64)-1 in some cases, the assert should catch
that we do not accidentally pass it to cow_file_range.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:35 +02:00
Anand Jain
3752d22fce btrfs: cow_file_range() num_bytes and disk_num_bytes are same
This patch deletes local variable disk_num_bytes as its value
is same as num_bytes in the function cow_file_range().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:35 +02:00
Anand Jain
2afb9653bf btrfs: remove unused function btrfs_async_submit_limit()
Commit [1] removed the need to use btrfs_async_submit_limit(), so
delete it.

[1]
 commit 736cd52e0c
  Btrfs: remove nr_async_submits and async_submit_draining

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:35 +02:00
Yang Shi
86d750a48a btrfs: remove unused hardirq.h
Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by btrfs at all.

So, remove the unused hardirq.h.

Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:35 +02:00
Nikolay Borisov
9a3daff320 btrfs: Add enospc_debug printing in metadata_reserve_bytes
Currently when enospc_debug mount option is turned on we do not print
any debug info in case metadata reservation failures happen. Fix this
by adding the necessary hook in reserve_metadata_bytes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:35 +02:00
Nikolay Borisov
45ae2c1841 btrfs: Document consistency of transaction->io_bgs list
The reason why io_bgs can be modified without holding any lock is
non-obvious. Document it and reference that documentation from the
respective call sites.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:34 +02:00
Nikolay Borisov
bf6d7d4900 btrfs: Remove invalid null checks from btrfs_cleanup_dirty_bgs
list_first_entry is essentially a wrapper over cotnainer_of. The latter
can never return null even if it's working on inconsistent list since it
will either crash or return some offset in the wrong struct.
Additionally, for the dirty_bgs list the iteration is done under
dirty_bgs_lock which ensures consistency of the list.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:34 +02:00
Anand Jain
8f2ceaa7b4 btrfs: log, when replace, is canceled by the user
For debugging or administration purposes, we would want to know if and
when the user cancels the replace, to complement the existing messages
when dev-replace starts or finishes.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog, fold fix for RCU warning from Nikolay ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:34 +02:00
Anand Jain
acf18c56fd btrfs: fix null pointer deref when target device is missing
The replace target device can be missing when mounted with -o degraded,
but we wont allocate a missing btrfs_device to it. So check the device
before accessing.

BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0
IP: btrfs_destroy_dev_replace_tgtdev+0x43/0xf0 [btrfs]
Call Trace:
btrfs_dev_replace_cancel+0x15f/0x180 [btrfs]
btrfs_ioctl+0x2216/0x2590 [btrfs]
do_vfs_ioctl+0x625/0x650
SyS_ioctl+0x4e/0x80
do_syscall_64+0x5d/0x160
entry_SYSCALL64_slow_path+0x25/0x25

This patch has been moved in front of patch "btrfs: log, when replace,
is canceled by the user" that could reproduce the crash if the system
reboots inside btrfs_dev_replace_start before the
btrfs_dev_replace_finishing call.

 $ mkfs /dev/sda
 $ mount /dev/sda mnt
 $ btrfs replace start /dev/sda /dev/sdb
 <insert reboot>
 $ mount po degraded /dev/sdb mnt
 <crash>

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ added reproducer description from mail ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:34 +02:00
Anand Jain
eceff22a80 btrfs: add a comment to mark the deprecated mount option
The options alloc_start and subvolrootid are deprecated, comment them in
the tokens list. And leave them as it is. No functional changes.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:34 +02:00
Anand Jain
d374060864 btrfs: manage commit mount option as %u
As the commit mount option is unsigned so manage it as %u for token
verifications, instead of %d.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:33 +02:00
Anand Jain
02453bdeb0 btrfs: manage check_int_print_mask mount option as %u
As check_int_print_mask mount option is unsigned so manage it as %u for
token verifications, instead of %d.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:33 +02:00
Anand Jain
764cb8b43d btrfs: manage metadata_ratio mount option as %u
As metadata_ratio mount option is unsinged so manage it as %u for token
verifications, instead of %d.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:33 +02:00
Anand Jain
f7b885befd btrfs: manage thread_pool mount option as %u
The mount option thread_pool is always unsigned. Manage it that way all
around.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:33 +02:00
Anand Jain
ba020491c8 btrfs: extent_buffer_uptodate() make it static and inline
extent_buffer_uptodate() is a trivial wrapper around test_bit() and
nothing else. So make it static and inline, save on code space and call
indirection.

Before:
   text	   data	    bss	    dec	    hex	filename
1131257	  82898	  18992	1233147	 12d0fb	fs/btrfs/btrfs.ko

After:
   text	   data	    bss	    dec	    hex	filename
1131090	  82898	  18992	1232980	 12d054	fs/btrfs/btrfs.ko

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:33 +02:00
Nikolay Borisov
70458a5819 btrfs: Remove fs_info argument of btrfs_write_and_wait_transaction
We already pass btrfs_trans_handle which contains a reference to the
fs_info so use that. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:33 +02:00
Nikolay Borisov
e9b919b1f7 btrfs: Remove fs_info argument from btrfs_update_commit_device_bytes_used
We already pass the btrfs_transaction which references fs_info so no
need to pass the later as an argument. Also use the opportunity to
shorten transaction->trans. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:32 +02:00
Nikolay Borisov
08d50ca32c btrfs: Remove fs_info argument from create_pending_snapshots/create_pending_snapshot
We already pass the trans handle which has a reference to fs_info to
create_pending_snapshot so we can refer to it directly. Doing this
obviates the need to pass the fs_info to create_pending_snapshots as
well. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:32 +02:00
Nikolay Borisov
16916a88d4 btrfs: Remove fs_info argument from switch_commit_roots
We already have the fs_info from the passed transaction so use it
directly. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:32 +02:00
Nikolay Borisov
97cb39bb91 btrfs: Remove root argument of cleanup_transaction
The only thing the passed root is used for is:
1. get a reference to the fs_info and to
2. call trace_btrfs_transaction_commit.

We can achieve 1) by simply referring to the fs_info from passed trans
object. As far as 2) is concerned cleanup_transaction is called from
only one place and the 'root' argument passed is the one from the trans
handle. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:32 +02:00
Nikolay Borisov
9386d8bc58 btrfs: Don't pass fs_info to commit_cowonly_roots
We already pass a transaction handle which refrences the fs_info so
we can grab it from there. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:32 +02:00
Nikolay Borisov
7e4443d9eb btrfs: Don't pass fs_info to commit_fs_roots
We already pass the transaction handle which has a reference to the
fs_info. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:32 +02:00
Nikolay Borisov
e5c304e651 btrfs: Don't pass fs_info to btrfs_run_delayed_items/_nr
We already pass the transaction which has a reference to the fs_info,
so use that. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:31 +02:00
Nikolay Borisov
b84acab38f btrfs: Don't pass fs_info to __btrfs_run_delayed_items
We already pass the transaction handle, which contains a refrence to
the fs_info so grab it from there. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:31 +02:00
Nikolay Borisov
2121705420 btrfs: Don't pass fs_info arg to btrfs_start_dirty_block_groups
It can be referenced from the passed transaction so no point in passing
it as a function argument. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:31 +02:00
Nikolay Borisov
6c686b359a btrfs: Remove fs_info argument from btrfs_create_pending_block_groups
It can be referenced from the passed transaciton so no point in
passing it as function argument. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:31 +02:00
Nikolay Borisov
dc60c525cf btrfs: Remove fs_info argument from btrfs_trans_release_metadata
All current callers of this function just get a reference to the
trans->fs_info member and pass it as the second argument. Collapse this
into the function itself. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:31 +02:00
Nikolay Borisov
c9b577c01a btrfs: Open code btrfs_write_and_wait_marked_extents
btrfs_write_and_wait_transaction is essentially a wrapper of
btrfs_write_and_wait_marked_extents with the addition of calling
clear_btree_io_tree. Having the code split doesn't really bring any
benefit. Open code the later into the former and add proper
documentation header.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:30 +02:00
Nikolay Borisov
0e34693f7b btrfs: Make btrfs_trans_release_metadata private to transaction.c
This function is only ever used in __btrfs_end_transaction and
btrfs_commit_transaction so there is no need to export it via header.
Let's move it closer to where it's used, make it static and remove it
from the header. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:30 +02:00
Anand Jain
15fc1283f6 btrfs: open code btrfs_init_dev_replace_tgtdev_for_resume()
btrfs_init_dev_replace_tgtdev_for_resume() initializes replace
target device in a few simple steps, so do it at the parent function.
Moreover, there isn't any other caller so just open code it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:30 +02:00
Anand Jain
18e67c73dc btrfs: btrfs_dev_replace_cancel() can return int
Current u64 return from btrfs_dev_replace_cancel() was probably done
to match the btrfs_ioctl_dev_replace_args::result. However as our
actual return value fits in int, and it further gets typecast to u64,
so just return int.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:30 +02:00
Anand Jain
17d202b973 btrfs: rename __btrfs_dev_replace_cancel()
Remove __ which is for the special functions.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:30 +02:00
Anand Jain
97282031a6 btrfs: open code btrfs_dev_replace_cancel()
btrfs_dev_replace_cancel() calls __btrfs_dev_replace_cancel() for the
actual cancel so just open code it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:29 +02:00
Nikolay Borisov
af89e0dc2c btrfs: Don't hardcode the csum size in btrfs_ordered_sum_size
Currently the function uses a hardcoded value for the checksum size of
a sector. This is fine, given that we currently support only a single
algorithm, whose checksum is 4 bytes == sizeof(u32). Despite not
having other algorithms, btrfs' design supports using a different
algorithm whith different space requirements. To future-proof the code
query the size of the currently used algorithm from the in-memory copy
of the super block. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:29 +02:00
Colin Ian King
97dc231e89 Btrfs: extent map selftest: add missing void parameter to btrfs_test_extent_map
Add a missing void parameter to function btrfs_test_extent_map, fixes
sparse warning:

warning: non-ANSI function declaration of function 'btrfs_test_extent_map'

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:29 +02:00
Colin Ian King
7a61f88088 btrfs: remove redundant check on ret and goto
The check for a non-zero ret is redundant as the goto will jump to
the very next statement anyway.  Remove this extraneous code.

Detected by CoverityScan, CID#1463784 ("Identical code for different
branches")

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:29 +02:00
Nikolay Borisov
7806c6eb15 btrfs: Remove unused btrfs_start_transaction_lflush function
Commit 0e8c36a9fd ("Btrfs: fix lots of orphan inodes when the space
is not enough") changed the way transaction reservation is made in
btrfs_evict_node and as a result this function became unused. This has
been the status quo for 5 years in which time no one noticed, so I'd
say it's safe to assume it's unlikely it will ever be used again.

Historical note: there were more attempts to remove the function, the
reasoning was missing and only based on some static analysis tool
reports. Other reason for rejection was that there seemed to be
connection to BTRFS_RESERVE_FLUSH_LIMIT and that would need to be
removeed to. This was not correct so removing the function is all we can
do.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
[ add the note ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:29 +02:00
Howard McLauchlan
b6a535faed btrfs: print error if primary super block write fails
Presently, failing a primary super block write but succeeding in at
least one super block write in general will appear to users as if
nothing important went wrong. However, upon unmounting and re-mounting,
the file system will be in a rolled back state. This was discovered
with a BCC program that uses bpf_override_return() to fail super block
writes.

This patch outputs an error clarifying that the primary super block
write has failed, so users can expect potentially erroneous behaviour.
It also forces wait_dev_supers() to return an error to its caller if
the primary super block write fails.

Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:29 +02:00
Qu Wenruo
062d4d1f40 btrfs: Refactor parameter of BTRFS_MAX_DEVS() from root to fs_info
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:28 +02:00
Liu Bo
af2679e4a7 Btrfs: enhance leak debug checker for extent state and extent buffer
This prints out eb->bflags since it contains some useful information,
e.g. whether eb is dirty.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:28 +02:00
Joe Perches
447a5647c9 treewide: Align function definition open/close braces
Some functions definitions have either the initial open brace and/or
the closing brace outside of column 1.

Move those braces to column 1.

This allows various function analyzers like gnu complexity to work
properly for these modified functions.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Takashi Iwai <tiwai@suse.de>
Acked-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Nicolin Chen <nicoleotsuka@gmail.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2018-03-26 11:13:09 +02:00
zhenwei.pi
dcae058a8d ext4: fix comments in ext4_swap_extents()
"mark_unwritten" in comment and "unwritten" in the function arguments
is mismatched.

Signed-off-by: zhenwei.pi <zhenwei.pi@youruncloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-03-26 01:44:03 -04:00
Goldwyn Rodrigues
043d20d159 ext4: use generic_writepages instead of __writepage/write_cache_pages
Code cleanup. Instead of writing an internal static function, use the
available generic_writepages().

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-03-26 01:32:50 -04:00
Dave Chinner
fa4493f0d9 xfs: remove dead inode version setting code
We can only get into the branch if CRCs are enabled, so there's no
need to check inside the branch for CRCs being enabled....

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-23 18:05:09 -07:00
Dave Chinner
ee457001ed xfs: catch inode allocation state mismatch corruption
We recently came across a V4 filesystem causing memory corruption
due to a newly allocated inode being setup twice and being added to
the superblock inode list twice. From code inspection, the only way
this could happen is if a newly allocated inode was not marked as
free on disk (i.e. di_mode wasn't zero).

Running the metadump on an upstream debug kernel fails during inode
allocation like so:

XFS: Assertion failed: ip->i_d.di_nblocks == 0, file: fs/xfs/xfs_inod=
e.c, line: 838
 ------------[ cut here ]------------
kernel BUG at fs/xfs/xfs_message.c:114!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 11 PID: 3496 Comm: mkdir Not tainted 4.16.0-rc5-dgc #442
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/0=
1/2014
RIP: 0010:assfail+0x28/0x30
RSP: 0018:ffffc9000236fc80 EFLAGS: 00010202
RAX: 00000000ffffffea RBX: 0000000000004000 RCX: 0000000000000000
RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffff8227211b
RBP: ffffc9000236fce8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000bec R11: f000000000000000 R12: ffffc9000236fd30
R13: ffff8805c76bab80 R14: ffff8805c77ac800 R15: ffff88083fb12e10
FS:  00007fac8cbff040(0000) GS:ffff88083fd00000(0000) knlGS:0000000000000=
000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fffa6783ff8 CR3: 00000005c6e2b003 CR4: 00000000000606e0
Call Trace:
 xfs_ialloc+0x383/0x570
 xfs_dir_ialloc+0x6a/0x2a0
 xfs_create+0x412/0x670
 xfs_generic_create+0x1f7/0x2c0
 ? capable_wrt_inode_uidgid+0x3f/0x50
 vfs_mkdir+0xfb/0x1b0
 SyS_mkdir+0xcf/0xf0
 do_syscall_64+0x73/0x1a0
 entry_SYSCALL_64_after_hwframe+0x42/0xb7

Extracting the inode number we crashed on from an event trace and
looking at it with xfs_db:

xfs_db> inode 184452204
xfs_db> p
core.magic = 0x494e
core.mode = 0100644
core.version = 2
core.format = 2 (extents)
core.nlinkv2 = 1
core.onlink = 0
.....

Confirms that it is not a free inode on disk. xfs_repair
also trips over this inode:

.....
zero length extent (off = 0, fsbno = 0) in ino 184452204
correcting nextents for inode 184452204
bad attribute fork in inode 184452204, would clear attr fork
bad nblocks 1 for inode 184452204, would reset to 0
bad anextents 1 for inode 184452204, would reset to 0
imap claims in-use inode 184452204 is free, would correct imap
would have cleared inode 184452204
.....
disconnected inode 184452204, would move to lost+found

And so we have a situation where the directory structure and the
inobt thinks the inode is free, but the inode on disk thinks it is
still in use. Where this corruption came from is not possible to
diagnose, but we can detect it and prevent the kernel from oopsing
on lookup. The reproducer now results in:

$ sudo mkdir /mnt/scratch/{0,1,2,3,4,5}{0,1,2,3,4,5}
mkdir: cannot create directory =E2=80=98/mnt/scratch/00=E2=80=99: File ex=
ists
mkdir: cannot create directory =E2=80=98/mnt/scratch/01=E2=80=99: File ex=
ists
mkdir: cannot create directory =E2=80=98/mnt/scratch/03=E2=80=99: Structu=
re needs cleaning
mkdir: cannot create directory =E2=80=98/mnt/scratch/04=E2=80=99: Input/o=
utput error
mkdir: cannot create directory =E2=80=98/mnt/scratch/05=E2=80=99: Input/o=
utput error
....

And this corruption shutdown:

[   54.843517] XFS (loop0): Corruption detected! Free inode 0xafe846c not=
 marked free on disk
[   54.845885] XFS (loop0): Internal error xfs_trans_cancel at line 1023 =
of file fs/xfs/xfs_trans.c.  Caller xfs_create+0x425/0x670
[   54.848994] CPU: 10 PID: 3541 Comm: mkdir Not tainted 4.16.0-rc5-dgc #=
443
[   54.850753] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIO=
S 1.10.2-1 04/01/2014
[   54.852859] Call Trace:
[   54.853531]  dump_stack+0x85/0xc5
[   54.854385]  xfs_trans_cancel+0x197/0x1c0
[   54.855421]  xfs_create+0x425/0x670
[   54.856314]  xfs_generic_create+0x1f7/0x2c0
[   54.857390]  ? capable_wrt_inode_uidgid+0x3f/0x50
[   54.858586]  vfs_mkdir+0xfb/0x1b0
[   54.859458]  SyS_mkdir+0xcf/0xf0
[   54.860254]  do_syscall_64+0x73/0x1a0
[   54.861193]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
[   54.862492] RIP: 0033:0x7fb73bddf547
[   54.863358] RSP: 002b:00007ffdaa553338 EFLAGS: 00000246 ORIG_RAX: 0000=
000000000053
[   54.865133] RAX: ffffffffffffffda RBX: 00007ffdaa55449a RCX: 00007fb73=
bddf547
[   54.866766] RDX: 0000000000000001 RSI: 00000000000001ff RDI: 00007ffda=
a55449a
[   54.868432] RBP: 00007ffdaa55449a R08: 00000000000001ff R09: 00005623a=
8670dd0
[   54.870110] R10: 00007fb73be72d5b R11: 0000000000000246 R12: 000000000=
00001ff
[   54.871752] R13: 00007ffdaa5534b0 R14: 0000000000000000 R15: 00007ffda=
a553500
[   54.873429] XFS (loop0): xfs_do_force_shutdown(0x8) called from line 1=
024 of file fs/xfs/xfs_trans.c.  Return address = ffffffff814cd050
[   54.882790] XFS (loop0): Corruption of in-memory data detected.  Shutt=
ing down filesystem
[   54.884597] XFS (loop0): Please umount the filesystem and rectify the =
problem(s)

Note that this crash is only possible on v4 filesystemsi or v5
filesystems mounted with the ikeep mount option. For all other V5
filesystems, this problem cannot occur because we don't read inodes
we are allocating from disk - we simply overwrite them with the new
inode information.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Tested-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-23 18:05:09 -07:00
Darrick J. Wong
b83e4c3ced xfs: xfs_scrub_iallocbt_xref_rmap_inodes should use xref_set_corrupt
In xfs_scrub_iallocbt_xref_rmap_inodes we're checking inodes against
rmap records, so we should use xfs_scrub_btree_xref_set_corrupt if we
encounter discrepancies here so that we know that it's a cross
referencing error, not necessarily a corruption in the inobt itself.

The userspace xfs_scrub program will try to repair outright corruptions
in the agi/inobt prior to phase 3 so that the inode scan will proceed.
If only a cross-referencing error is noted, the repair program defers
the repair attempt until it can check the other space metadata at least
once.

It is therefore essential that the inobt scrubber can correctly
distinguish between corruptions and "unable to cross-reference something
else with this inobt".  The same reasoning applies to "xfs: record inode
buf errors as a xref error in inobt scrubber".

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-23 18:05:09 -07:00
Darrick J. Wong
5927268f5a xfs: flag inode corruption if parent ptr doesn't get us a real inode
If a directory's parent inode pointer doesn't point to an inode, the
directory should be flagged as corrupt.  Enable IGET_UNTRUSTED here so
that _iget will return -EINVAL if the inobt does not confirm that the
inode is present and allocated and we can flag the directory corruption.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-23 18:05:08 -07:00
Darrick J. Wong
6a96c56505 xfs: don't accept inode buffers with suspicious unlinked chains
When we're verifying inode buffers, sanity-check the unlinked pointer.
We don't want to run the risk of trying to purge something that's
obviously broken.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-23 18:05:08 -07:00
Darrick J. Wong
8bb82bc12a xfs: move inode extent size hint validation to libxfs
Extent size hint validation is used by scrub to decide if there's an
error, and it will be used by repair to decide to remove the hint.
Since these use the same validation functions, move them to libxfs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-23 18:05:08 -07:00
Darrick J. Wong
1b44a6aecc xfs: record inode buf errors as a xref error in inobt scrubber
During the inode btree scrubs we try to confirm the freemask bits
against the inode records.  If the inode buffer read fails, this is a
cross-referencing error, not a corruption of the inode btree itself.
Use the xref_process_error call here.  Found via core.version middlebit
fuzz in xfs/415.

The userspace xfs_scrub program will try to repair outright corruptions
in the agi/inobt prior to phase 3 so that the inode scan will proceed.
If only a cross-referencing error is noted, the repair program defers
the repair attempt until it can check the other space metadata at least
once.

It is therefore essential that the inobt scrubber can correctly
distinguish between corruptions and "unable to cross-reference something
else with this inobt".

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-23 18:05:08 -07:00
Darrick J. Wong
7e56d9eaea xfs: remove xfs_buf parameter from inode scrub methods
Now that we no longer do raw inode buffer scrubbing, the bp parameter is
no longer used anywhere we're dealing with an inode, so remove it and
all the useless NULL parameters that go with it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-23 18:05:08 -07:00
Darrick J. Wong
d0018ad889 xfs: inode scrubber shouldn't bother with raw checks
The inode scrubber tries to _iget the inode prior to running checks.
If that _iget call fails with corruption errors that's an automatic
fail, regardless of whether it was the inode buffer read verifier,
the ifork verifier, or the ifork formatter that errored out.

Therefore, get rid of the raw mode scrub code because it's not needed.
Found by trying to fix some test failures in xfs/379 and xfs/415.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-23 18:05:08 -07:00
Darrick J. Wong
5e777b62b0 xfs: bmap scrubber should do rmap xref with bmap for sparse files
When we're scanning an extent mapping inode fork, ensure that every rmap
record for this ifork has a corresponding bmbt record too.  This
(mostly) provides the ability to cross-reference rmap records with bmap
data.  The rmap scrubber cannot do the xref on its own because that
requires taking an ilock with the agf lock held, which violates our
locking order rules (inode, then agf).

Note that we only do this for forks that are in btree format due to the
increased complexity; or forks that should have data but suspiciously
have zero extents because the inode could have just had its iforks
zapped by the inode repair code and now we need to reclaim the old
extents.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-23 18:05:07 -07:00
Darrick J. Wong
6edb181053 xfs: refactor inode buffer verifier error logging
When the inode buffer verifier encounters an error, it's much more
helpful to print a buffer from the offending inode instead of just the
start of the inode chunk buffer.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-23 18:05:07 -07:00
Darrick J. Wong
90a58f9571 xfs: refactor inode verifier error logging
Refactor some of the inode verifier failure logging call sites to use
the new xfs_inode_verifier_error method which dumps the offending buffer
as well as the code location of the failed check.  This trims the
output, makes it clearer to the admin that repair must be run, and gives
the developers more details to work from.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-23 18:05:07 -07:00
Darrick J. Wong
30b0984d91 xfs: refactor bmap record validation
Refactor the bmap validator into a more complete helper that looks for
extents that run off the end of the device, overflow into the next AG,
or have invalid flag states.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-23 18:05:07 -07:00
Darrick J. Wong
6915ef35c0 xfs: sanity-check the unused space before trying to use it
In xfs_dir2_data_use_free, we examine on-disk metadata and ASSERT if
it doesn't make sense.  Since a carefully crafted fuzzed image can cause
the kernel to crash after blowing a bunch of assertions, let's move
those checks into a validator function and rig everything up to return
EFSCORRUPTED to userspace.  Found by lastbit fuzzing ltail.bestcount via
xfs/391.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-23 18:05:07 -07:00
Brian Foster
a27ba2607e xfs: detect agfl count corruption and reset agfl
The struct xfs_agfl v5 header was originally introduced with
unexpected padding that caused the AGFL to operate with one less
slot than intended. The header has since been packed, but the fix
left an incompatibility for users who upgrade from an old kernel
with the unpacked header to a newer kernel with the packed header
while the AGFL happens to wrap around the end. The newer kernel
recognizes one extra slot at the physical end of the AGFL that the
previous kernel did not. The new kernel will eventually attempt to
allocate a block from that slot, which contains invalid data, and
cause a crash.

This condition can be detected by comparing the active range of the
AGFL to the count. While this detects a padding mismatch, it can
also trigger false positives for unrelated flcount corruption. Since
we cannot distinguish a size mismatch due to padding from unrelated
corruption, we can't trust the AGFL enough to simply repopulate the
empty slot.

Instead, avoid unnecessarily complex detection logic and and use a
solution that can handle any form of flcount corruption that slips
through read verifiers: distrust the entire AGFL and reset it to an
empty state. Any valid blocks within the AGFL are intentionally
leaked. This requires xfs_repair to rectify (which was already
necessary based on the state the AGFL was found in). The reset
mitigates the side effect of the padding mismatch problem from a
filesystem crash to a free space accounting inconsistency. The
generic approach also means that this patch can be safely backported
to kernels with or without a packed struct xfs_agfl.

Check the AGF for an invalid freelist count on initial read from
disk. If detected, set a flag on the xfs_perag to indicate that a
reset is required before the AGFL can be used. In the first
transaction that attempts to use a flagged AGFL, reset it to empty,
warn the user about the inconsistency and allow the freelist fixup
code to repopulate the AGFL with new blocks. The xfs_perag flag is
cleared to eliminate the need for repeated checks on each block
allocation operation.

This allows kernels that include the packing fix commit 96f859d52b
("libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct")
to handle older unpacked AGFL formats without a filesystem crash.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by Dave Chiluk <chiluk+linuxxfs@indeed.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-23 18:05:06 -07:00
Christoph Hellwig
3e4da466bf xfs: unwind the try_again loop in xfs_log_force
Instead split out a __xfs_log_fore_lsn helper that gets called again
with the already_slept flag set to true in case we had to sleep.

This prepares for aio_fsync support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-23 18:05:06 -07:00
Christoph Hellwig
93806299b5 xfs: refactor xfs_log_force_lsn
Use the the smallest possible loop as preable to find the correct iclog
buffer, and then use gotos for unwinding to straighten the code.

Also fix the top of function comment while we're at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-23 18:05:06 -07:00
Andreas Gruenbacher
bb491ce67a gfs2: Check for the end of metadata in punch_hole
When punching a hole or truncating an inode down to a given size, also
check if the truncate point / start of the hole is within the range we
have metadata for.  Otherwise, we can end up freeing blocks that
shouldn't be freed, corrupting the inode, or crashing the machine when
trying to punch a hole into the void.

When growing an inode via truncate, we set the new size but we don't
allocate additional levels of indirect blocks and grow the inode height.
When shrinking that inode again, the new size may still point beyond the
end of the inode's metadata.

Fixes xfstest generic/476.

Debugged-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-03-23 11:43:02 -07:00
David S. Miller
03fe2debbb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Fun set of conflict resolutions here...

For the mac80211 stuff, these were fortunately just parallel
adds.  Trivially resolved.

In drivers/net/phy/phy.c we had a bug fix in 'net' that moved the
function phy_disable_interrupts() earlier in the file, whilst in
'net-next' the phy_error() call from this function was removed.

In net/ipv4/xfrm4_policy.c, David Ahern's changes to remove the
'rt_table_id' member of rtable collided with a bug fix in 'net' that
added a new struct member "rt_mtu_locked" which needs to be copied
over here.

The mlxsw driver conflict consisted of net-next separating
the span code and definitions into separate files, whilst
a 'net' bug fix made some changes to that moved code.

The mlx5 infiniband conflict resolution was quite non-trivial,
the RDMA tree's merge commit was used as a guide here, and
here are their notes:

====================

    Due to bug fixes found by the syzkaller bot and taken into the for-rc
    branch after development for the 4.17 merge window had already started
    being taken into the for-next branch, there were fairly non-trivial
    merge issues that would need to be resolved between the for-rc branch
    and the for-next branch.  This merge resolves those conflicts and
    provides a unified base upon which ongoing development for 4.17 can
    be based.

    Conflicts:
            drivers/infiniband/hw/mlx5/main.c - Commit 42cea83f95
            (IB/mlx5: Fix cleanup order on unload) added to for-rc and
            commit b5ca15ad7e (IB/mlx5: Add proper representors support)
            add as part of the devel cycle both needed to modify the
            init/de-init functions used by mlx5.  To support the new
            representors, the new functions added by the cleanup patch
            needed to be made non-static, and the init/de-init list
            added by the representors patch needed to be modified to
            match the init/de-init list changes made by the cleanup
            patch.
    Updates:
            drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function
            prototypes added by representors patch to reflect new function
            names as changed by cleanup patch
            drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init
            stage list to match new order from cleanup patch
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 11:31:58 -04:00
Mimi Zohar
0834136aea fuse: define the filesystem as untrusted
Files on FUSE can change at any point in time without IMA being able
to detect it.  The file data read for the file signature verification
could be totally different from what is subsequently read, making the
signature verification useless.

FUSE can be mounted by unprivileged users either today with fusermount
installed with setuid, or soon with the upcoming patches to allow FUSE
mounts in a non-init user namespace.

This patch sets the SB_I_IMA_UNVERIFIABLE_SIGNATURE flag and when
appropriate sets the SB_I_UNTRUSTED_MOUNTER flag.

Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: Dongsu Park <dongsu@kinvolk.io>
Cc: Alban Crequy <alban@kinvolk.io>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-03-23 06:31:37 -04:00
Linus Torvalds
f36b7534b8 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "13 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm, thp: do not cause memcg oom for thp
  mm/vmscan: wake up flushers for legacy cgroups too
  Revert "mm: page_alloc: skip over regions of invalid pfns where possible"
  mm/shmem: do not wait for lock_page() in shmem_unused_huge_shrink()
  mm/thp: do not wait for lock_page() in deferred_split_scan()
  mm/khugepaged.c: convert VM_BUG_ON() to collapse fail
  x86/mm: implement free pmd/pte page interfaces
  mm/vmalloc: add interfaces to free unmapped page table
  h8300: remove extraneous __BIG_ENDIAN definition
  hugetlbfs: check for pgoff value overflow
  lockdep: fix fs_reclaim warning
  MAINTAINERS: update Mark Fasheh's e-mail
  mm/mempolicy.c: avoid use uninitialized preferred_node
2018-03-22 18:48:43 -07:00
Mike Kravetz
63489f8e82 hugetlbfs: check for pgoff value overflow
A vma with vm_pgoff large enough to overflow a loff_t type when
converted to a byte offset can be passed via the remap_file_pages system
call.  The hugetlbfs mmap routine uses the byte offset to calculate
reservations and file size.

A sequence such as:

  mmap(0x20a00000, 0x600000, 0, 0x66033, -1, 0);
  remap_file_pages(0x20a00000, 0x600000, 0, 0x20000000000000, 0);

will result in the following when task exits/file closed,

  kernel BUG at mm/hugetlb.c:749!
  Call Trace:
    hugetlbfs_evict_inode+0x2f/0x40
    evict+0xcb/0x190
    __dentry_kill+0xcb/0x150
    __fput+0x164/0x1e0
    task_work_run+0x84/0xa0
    exit_to_usermode_loop+0x7d/0x80
    do_syscall_64+0x18b/0x190
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

The overflowed pgoff value causes hugetlbfs to try to set up a mapping
with a negative range (end < start) that leaves invalid state which
causes the BUG.

The previous overflow fix to this code was incomplete and did not take
the remap_file_pages system call into account.

[mike.kravetz@oracle.com: v3]
  Link: http://lkml.kernel.org/r/20180309002726.7248-1-mike.kravetz@oracle.com
[akpm@linux-foundation.org: include mmdebug.h]
[akpm@linux-foundation.org: fix -ve left shift count on sh]
Link: http://lkml.kernel.org/r/20180308210502.15952-1-mike.kravetz@oracle.com
Fixes: 045c7a3f53 ("hugetlbfs: fix offset overflow in hugetlbfs mmap")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Nic Losby <blurbdust@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Yisheng Xie <xieyisheng1@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-22 17:07:01 -07:00
James Morris
5893ed18a2 Linux 4.16-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlqvCPYeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGOaAH/171cgZGFEXSONxK
 3O1AAv61wN5K/ISMt6mnelWR6fZg195FarOx0Rnq7Ot8OWuVa8CGcyT4vX4Z7nb9
 SVMQKNMPCVQE4WCDOv6S0njChmRC0BxBoVJtTN9fhywdYgX1KcaTS/drMRHACF5n
 rB9eouMQScfMzKGAW08gp5NvEGJ6W1SLX7La3/u0751dYisdJSP7+vFZNxUrGXEA
 yIPOQjFu0Tfo8GXz/BwC678RZVzVLN0sE6+/vM7zNnoDlsRVkdDIVMo3UiVqm/NK
 B37/TlZz8CYoapoKnRRB5giXnSPDSXtsikbGy3mcy0u5imGe+ZgdjrdYSaLk31cR
 NVZY08k=
 =pu3X
 -----END PGP SIGNATURE-----

Merge tag 'v4.16-rc6' into next-general

Merge to Linux 4.16-rc6 at the request of Jarkko, for his TPM updates.
2018-03-23 08:26:16 +11:00
Linus Torvalds
c4f4d2f917 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Always validate XFRM esn replay attribute, from Florian Westphal.

 2) Fix RCU read lock imbalance in xfrm_get_tos(), from Xin Long.

 3) Don't try to get firmware dump if not loaded in iwlwifi, from Shaul
    Triebitz.

 4) Fix BPF helpers to deal with SCTP GSO SKBs properly, from Daniel
    Axtens.

 5) Fix some interrupt handling issues in e1000e driver, from Benjamin
    Poitier.

 6) Use strlcpy() in several ethtool get_strings methods, from Florian
    Fainelli.

 7) Fix rhlist dup insertion, from Paul Blakey.

 8) Fix SKB leak in netem packet scheduler, from Alexey Kodanev.

 9) Fix driver unload crash when link is up in smsc911x, from Jeremy
    Linton.

10) Purge out invalid socket types in l2tp_tunnel_create(), from Eric
    Dumazet.

11) Need to purge the write queue when TCP connections are aborted,
    otherwise userspace using MSG_ZEROCOPY can't close the fd. From
    Soheil Hassas Yeganeh.

12) Fix double free in error path of team driver, from Arkadi
    Sharshevsky.

13) Filter fixes for hv_netvsc driver, from Stephen Hemminger.

14) Fix non-linear packet access in ipv6 ndisc code, from Lorenzo
    Bianconi.

15) Properly filter out unsupported feature flags in macvlan driver,
    from Shannon Nelson.

16) Don't request loading the diag module for a protocol if the protocol
    itself is not even registered. From Xin Long.

17) If datagram connect fails in ipv6, make sure the socket state is
    consistent afterwards. From Paolo Abeni.

18) Use after free in qed driver, from Dan Carpenter.

19) If received ipv4 PMTU is less than the min pmtu, lock the mtu in the
    entry. From Sabrina Dubroca.

20) Fix sleep in atomic in tg3 driver, from Jonathan Toppins.

21) Fix vlan in vlan untagging in some situations, from Toshiaki Makita.

22) Fix double SKB free in genlmsg_mcast(). From Nicolas Dichtel.

23) Fix NULL derefs in error paths of tcf_*_init(), from Davide Caratti.

24) Unbalanced PM runtime calls in FEC driver, from Florian Fainelli.

25) Memory leak in gemini driver, from Igor Pylypiv.

26) IDR leaks in error paths of tcf_*_init() functions, from Davide
    Caratti.

27) Need to use GFP_ATOMIC in seg6_build_state(), from David Lebrun.

28) Missing dev_put() in error path of macsec_newlink(), from Dan
    Carpenter.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (201 commits)
  macsec: missing dev_put() on error in macsec_newlink()
  net: dsa: Fix functional dsa-loop dependency on FIXED_PHY
  hv_netvsc: common detach logic
  hv_netvsc: change GPAD teardown order on older versions
  hv_netvsc: use RCU to fix concurrent rx and queue changes
  hv_netvsc: disable NAPI before channel close
  net/ipv6: Handle onlink flag with multipath routes
  ppp: avoid loop in xmit recursion detection code
  ipv6: sr: fix NULL pointer dereference when setting encap source address
  ipv6: sr: fix scheduling in RCU when creating seg6 lwtunnel state
  net: aquantia: driver version bump
  net: aquantia: Implement pci shutdown callback
  net: aquantia: Allow live mac address changes
  net: aquantia: Add tx clean budget and valid budget handling logic
  net: aquantia: Change inefficient wait loop on fw data reads
  net: aquantia: Fix a regression with reset on old firmware
  net: aquantia: Fix hardware reset when SPI may rarely hangup
  s390/qeth: on channel error, reject further cmd requests
  s390/qeth: lock read device while queueing next buffer
  s390/qeth: when thread completes, wake up all waiters
  ...
2018-03-22 14:10:29 -07:00
Eric Sandeen
0d9366d67b ext4: don't complain about incorrect features when probing
If mount is auto-probing for filesystem type, it will try various
filesystems in order, with the MS_SILENT flag set.  We get
that flag as the silent arg to ext4_fill_super.

If we're probing (silent==1) then don't complain about feature
incompatibilities that are found if it looks like it's actually
a different valid extN type - failed probes should be silent
in this case.

If the on-disk features are unknown even to ext4, then complain.

Reported-by: Joakim Tjernlund <Joakim.Tjernlund@infinera.com>
Tested-by: Joakim Tjernlund <Joakim.Tjernlund@infinera.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2018-03-22 11:59:00 -04:00
Nikolay Borisov
1d39834fba ext4: remove EXT4_STATE_DIOREAD_LOCK flag
Commit 16c5468859 ("ext4: Allow parallel DIO reads") reworked the way
locking happens around parallel dio reads. This resulted in obviating
the need for EXT4_STATE_DIOREAD_LOCK flag and accompanying logic.
Currently this amounts to dead code so let's remove it. No functional
changes

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2018-03-22 11:52:10 -04:00
Jiri Slaby
fe23cb65c2 ext4: fix offset overflow on 32-bit archs in ext4_iomap_begin()
ext4_iomap_begin() has a bug where offset returned in the iomap
structure will be truncated to unsigned long size. On 64-bit
architectures this is fine but on 32-bit architectures obviously not.
Not many places actually use the offset stored in the iomap structure
but one of visible failures is in SEEK_HOLE / SEEK_DATA implementation.
If we create a file like:

dd if=/dev/urandom of=file bs=1k seek=8m count=1

then

lseek64("file", 0x100000000ULL, SEEK_DATA)

wrongly returns 0x100000000 on unfixed kernel while it should return
0x200000000. Avoid the overflow by proper type cast.

Fixes: 545052e9e3 ("ext4: Switch to iomap for SEEK_HOLE / SEEK_DATA")
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org # v4.15
2018-03-22 11:50:26 -04:00
Eryu Guan
45d8ec4d9f ext4: update i_disksize if direct write past ondisk size
Currently in ext4 direct write path, we update i_disksize only when
new eof is greater than i_size, and don't update it even when new
eof is greater than i_disksize but less than i_size. This doesn't
work well with delalloc buffer write, which updates i_size and
i_disksize only when delalloc blocks are resolved (at writeback
time), the i_disksize from direct write can be lost if a previous
buffer write succeeded at write time but failed at writeback time,
then results in corrupted ondisk inode size.

Consider this case, first buffer write 4k data to a new file at
offset 16k with delayed allocation, then direct write 4k data to the
same file at offset 4k before delalloc blocks are resolved, which
doesn't update i_disksize because it writes within i_size(20k), but
the extent tree metadata has been committed in journal. Then
writeback of the delalloc blocks fails (due to device error etc.),
and i_size/i_disksize from buffer write can't be written to disk
(still zero). A subsequent umount/mount cycle recovers journal and
writes extent tree metadata from direct write to disk, but with
i_disksize being zero.

Fix it by updating i_disksize too in direct write path when new eof
is greater than i_disksize but less than i_size, so i_disksize is
always consistent with direct write.

This fixes occasional i_size corruption in fstests generic/475.

Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-03-22 11:44:59 -04:00
Eryu Guan
73fdad00b2 ext4: protect i_disksize update by i_data_sem in direct write path
i_disksize update should be protected by i_data_sem, by either taking
the lock explicitly or by using ext4_update_i_disksize() helper. But the
i_disksize updates in ext4_direct_IO_write() are not protected at all,
which may be racing with i_disksize updates in writeback path in
delalloc buffer write path.

This is found by code inspection, and I didn't hit any i_disksize
corruption due to this bug. Thanks to Jan Kara for catching this bug and
suggesting the fix!

Reported-by: Jan Kara <jack@suse.cz>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-03-22 11:41:25 -04:00
Richard Guy Briggs
ea841bafda audit: add refused symlink to audit_names
Audit link denied events for symlinks had duplicate PATH records rather
than just updating the existing PATH record.  Update the symlink's PATH
record with the current dentry and inode information.

See: https://github.com/linux-audit/audit-kernel/issues/21

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-03-21 11:31:03 -04:00
Richard Guy Briggs
94b9d9b7a1 audit: remove path param from link denied function
In commit 45b578fe4c
("audit: link denied should not directly generate PATH record")
the need for the struct path *link parameter was removed.
Remove the now useless struct path argument.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-03-21 11:17:41 -04:00
Linus Torvalds
645102eac1 Just one fix for an occasional panic from Jeff Layton.
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJasXu4AAoJECebzXlCjuG+fNcP/3QYZwLdIYe0pZSIn/9V6sW1
 PwmxhTkBOEkeP4RTArI/4spZk3mrTqv7SC+ACYKbkD910uDC8cY3ElcwKxi82NL7
 9M4gNGnscKJtUye2LhoamZZLOLDIzr302pEv0IIh3v11sAu0rK1K0bqHSfI1ga/q
 r3S1sRtad+UktVH+jJ8MNE4NBfXC/xTqPz6Z+wd30CPmJKaA7Oqle7Wd3479zj0c
 A7l+EtleFo/bctHwY7yp+lVs1Yqr3ShS+qu+YAknBTnGco9JJFJsZM6x33DFMZC4
 EvFPO0gh4PuN/SkgT4zBvBlSXQjuqUWTU98ohqP7ugTtd0dXEIpR4DLLwWScA2KV
 z+YIJIKUh8Ryh8KI+J50Ed9WseVlmACHOBTJjLW7KnSYtey3+mgh5kp6Vcd9IxHc
 Gju7YgDK5FOV2JykaHQ/qCaSyL6ao1firapSrtZK4N27LpJ05FreK6qnLJtHmiwU
 pqVHyfrH4kfho1T0Nr/bw4bnugBKZVXBbBQLKp2twgJhYkUkP+vxV8QMX3h531hB
 jsQV9OIVPkdTjr5L0BmQMJ4GugM1Zr4mIunY6Mw0z5ax1JeBUelWeQccqL1YCSGL
 TZr+0N9yOuzQHX+e+mZ8vFYqfUTRF/Xfkn1+ZV3ZQU/yY8oBkDbQddaLoo1hjLUx
 2dpR3xIhzyDkP59jrK4M
 =oGAf
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.16-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd fix from Bruce Fields:
 "Just one fix for an occasional panic from Jeff Layton"

* tag 'nfsd-4.16-1' of git://linux-nfs.org/~bfields/linux:
  nfsd: remove blocked locks on client teardown
2018-03-20 16:10:26 -07:00
J. Bruce Fields
353601e7d3 nfsd: create a separate lease for each delegation
Currently we only take one vfs-level delegation (lease) for each file,
no matter how many clients hold delegations on that file.

Let's instead keep a one-to-one mapping between NFSv4 delegations and
VFS delegations.  This turns out to be simpler.

There is still a many-to-one mapping of NFS opens to NFS files, and the
delegations on one file are all associated with one struct file.  The
VFS can still distinguish between these delegations since we're setting
fl_owner to the struct nfs4_delegation now, not to the shared file.

I'm replacing at least one complicated function wholesale, which I don't
like to do, but I haven't figured out how to do this more incrementally.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2018-03-20 17:51:14 -04:00
J. Bruce Fields
86d29b10eb nfsd: move sc_file assignment into alloc_init_deleg
Take an easy chance to simplify the caller a little.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2018-03-20 17:51:13 -04:00
J. Bruce Fields
0af6e690f0 nfsd: factor out common delegation-destruction code
Pull some duplicated code into a common helper.

This changes the order in destroy_delegation a little, but it looks to
me like that shouldn't matter.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2018-03-20 17:51:13 -04:00
J. Bruce Fields
68b18f5294 nfsd: make nfs4_get_existing_delegation less confusing
This doesn't "get" anything.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2018-03-20 17:51:12 -04:00
J. Bruce Fields
0c911f5408 nfsd4: dp->dl_stid.sc_file doesn't need locking
The delegation isn't visible to anyone yet.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2018-03-20 17:51:12 -04:00
J. Bruce Fields
653e514e9e nfsd4: set fl_owner to delegation, not file pointer
For now this makes no difference, as for files having delegations,
there's a one-to-one relationship between an nfs4_file and its
nfs4_delegation.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2018-03-20 17:51:11 -04:00
J. Bruce Fields
cba7b3d150 nfsd: simplify nfs4_put_deleg_lease calls
Every single caller gets the file out of the delegation, so let's do
that once in nfs4_put_deleg_lease.

Plus we'll need it there for other reasons.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2018-03-20 17:51:11 -04:00
J. Bruce Fields
b8232d3315 nfsd: simplify put of fi_deleg_file
fi_delegees is basically just a reference count on users of
fi_deleg_file, which is cleared when fi_delegees goes to zero.  The
fi_deleg_file check here is redundant.  Also add an assertion to make
sure we don't have unbalanced puts.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2018-03-20 17:51:10 -04:00
Greg Kroah-Hartman
4958134df5 Merge 4.16-rc6 into tty-next
We want the serial/tty fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-20 11:27:18 +01:00
Peter Zijlstra
e24e960c7f sched/wait, fs/ocfs2: Convert wait_on_atomic_t() usage to the new wait_var_event() API
The old wait_on_atomic_t() is going to get removed, use the more
flexible wait_var_event() API instead.

No change in functionality.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 08:23:22 +01:00
Peter Zijlstra
723c921e7d sched/wait, fs/nfs: Convert wait_on_atomic_t() usage to the new wait_var_event() API
The old wait_on_atomic_t() is going to get removed, use the more
flexible wait_var_event() API instead.

No change in functionality.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 08:23:21 +01:00
Peter Zijlstra
dc5d4afbb0 sched/wait, fs/fscache: Convert wait_on_atomic_t() usage to the new wait_var_event() API
The old wait_on_atomic_t() is going to get removed, use the more
flexible wait_var_event() API instead.

No change in functionality.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 08:23:21 +01:00
Peter Zijlstra
4625956a4e sched/wait, fs/btrfs: Convert wait_on_atomic_t() usage to the new wait_var_event() API
The old wait_on_atomic_t() is going to get removed, use the more
flexible wait_var_event() API instead.

No change in functionality.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 08:23:20 +01:00
Peter Zijlstra
ab1fbe3247 sched/wait, fs/afs: Convert wait_on_atomic_t() usage to the new wait_var_event() API
The old wait_on_atomic_t() is going to get removed, use the more
flexible wait_var_event() API instead.

No change in functionality.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 08:23:19 +01:00
Grygorii Strashko
2399ac42e7 sysfs: symlink: export sysfs_create_link_nowarn()
The sysfs_create_link_nowarn() is going to be used in phylib framework in
subsequent patch which can be built as module. Hence, export
sysfs_create_link_nowarn() to avoid build errors.

Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Fixes: a399546049 ("net: phy: Relax error checking on sysfs_create_link()")
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-19 21:14:26 -04:00
Jeff Layton
9258a2d5cd nfsd: move nfs4_client allocation to dedicated slabcache
On x86_64, it's 1152 bytes, so we can avoid wasting 896 bytes each.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-03-19 16:38:13 -04:00
J. Bruce Fields
9d7ed1355d nfsd: don't require low ports for gss requests
In a traditional NFS deployment using auth_unix, the clients are trusted
to correctly report the credentials of their logged-in users.  The
server assumes that only root on client machines is allowed to send
requests from low-numbered ports, so it can use the originating port
number to distinguish "real" NFS clients from NFS clients run by
ordinary users, to prevent ordinary users from spoofing credentials.

The originating port number on a gss-authenticated request is less
important.  The authentication ties the request to a user, and we take
it as proof that that user authorized the request.  The low port number
check no longer adds much.

So, don't enforce low port numbers in the auth_gss case.

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-03-19 16:38:13 -04:00
J. Bruce Fields
edcc8452a0 nfsd: remove unsused "cp_consecutive" field
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-03-19 16:38:13 -04:00
Colin Ian King
554faf2819 lockd: make nlm_ntf_refcnt and nlm_ntf_wq static
The variables nlm_ntf_refcnt and nlm_ntf_wq are local to the source and
do not need to be in global scope, so make them static.

Cleans up sparse warnings:
fs/lockd/svc.c:60:10: warning: symbol 'nlm_ntf_refcnt' was not declared.
Should it be static?
fs/lockd/svc.c:61:1: warning: symbol 'nlm_ntf_wq' was not declared.
Should it be static?

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-03-19 16:38:13 -04:00
Jeff Layton
bd2decac5e nfsd4: send the special close_stateid in v4.0 replies as well
We already send it for v4.1, but RFC7530 also notes that the stateid in
the close reply is bogus.

Always send the special close stateid, even in v4.0 responses. No client
should put any meaning on it whatsoever. For now, we continue to
increment the stateid value, though that might not be necessary either.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-03-19 16:38:12 -04:00
Jeff Layton
68ef3bc316 nfsd: remove blocked locks on client teardown
We had some reports of panics in nfsd4_lm_notify, and that showed a
nfs4_lockowner that had outlived its so_client.

Ensure that we walk any leftover lockowners after tearing down all of
the stateids, and remove any blocked locks that they hold.

With this change, we also don't need to walk the nbl_lru on nfsd_net
shutdown, as that will happen naturally when we tear down the clients.

Fixes: 76d348fadf (nfsd: have nfsd4_lock use blocking locks for v4.1+ locks)
Reported-by: Frank Sorenson <fsorenso@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: stable@vger.kernel.org # 4.9
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-03-19 16:37:21 -04:00
Tejun Heo
f729863a8c fs/aio: Use rcu_work instead of explicit rcu and work item
Workqueue now has rcu_work.  Use it instead of open-coding rcu -> work
item bouncing.

Signed-off-by: Tejun Heo <tj@kernel.org>
2018-03-19 10:12:03 -07:00
Yunlei He
0833721ec3 f2fs: check blkaddr more accuratly before issue a bio
This patch check blkaddr more accuratly before issue a
write or read bio.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-18 23:33:50 -07:00
Ritesh Harjani
02117b8ae9 f2fs: Set GF_NOFS in read_cache_page_gfp while doing f2fs_quota_read
Quota code itself is serializing the operations by taking mutex_lock.
It seems a below deadlock can happen if GF_NOFS is not used in
f2fs_quota_read

__switch_to+0x88
__schedule+0x5b0
schedule+0x78
schedule_preempt_disabled+0x20
__mutex_lock_slowpath+0xdc   		//mutex owner is itself
mutex_lock+0x2c
dquot_commit+0x30			//mutex_lock(&dqopt->dqio_mutex);
dqput+0xe0
__dquot_drop+0x80
dquot_drop+0x48
f2fs_evict_inode+0x218
evict+0xa8
dispose_list+0x3c
prune_icache_sb+0x58
super_cache_scan+0xf4
do_shrink_slab+0x208
shrink_slab.part.40+0xac
shrink_zone+0x1b0
do_try_to_free_pages+0x25c
try_to_free_pages+0x164
__alloc_pages_nodemask+0x534
do_read_cache_page+0x6c
read_cache_page+0x14
f2fs_quota_read+0xa4
read_blk+0x54
find_tree_dqentry+0xe4
find_tree_dqentry+0xb8
find_tree_dqentry+0xb8
find_tree_dqentry+0xb8
qtree_read_dquot+0x68
v2_read_dquot+0x24
dquot_acquire+0x5c			// mutex_lock(&dqopt->dqio_mutex);
dqget+0x238
__dquot_initialize+0xd4
dquot_initialize+0x10
dquot_file_open+0x34
f2fs_file_open+0x6c
do_dentry_open+0x1e4
vfs_open+0x6c
path_openat+0xa20
do_filp_open+0x4c
do_sys_open+0x178

Signed-off-by: Ritesh Harjani <riteshh@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-18 23:31:28 -07:00
Ingo Molnar
793057e1c7 vfs: Replace stray non-ASCII homoglyph characters with their ASCII equivalents
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-19 01:07:42 -04:00
Rasmus Villemoes
1c94984396 vfs: make sure struct filename->iname is word-aligned
I noticed that offsetof(struct filename, iname) is actually 28 on 64
bit platforms, so we always pass an unaligned pointer to
strncpy_from_user. This is mostly a problem for those 64 bit platforms
without HAVE_EFFICIENT_UNALIGNED_ACCESS, but even on x86_64, unaligned
accesses carry a penalty.

A user-space microbenchmark doing nothing but strncpy_from_user from the
same (aligned) source string runs about 5% faster when the destination
is aligned. That number increases to 20% when the string is long
enough (~32 bytes) that we cross a cache line boundary - that's for
example the case for about half the files a "git status" in a kernel
tree ends up stat'ing.

This won't make any real-life workloads 5%, or even 1%, faster, but path
lookup is common enough that cutting even a few cycles should be
worthwhile. So ensure we always pass an aligned destination pointer to
strncpy_from_user. Instead of explicit padding, simply swap the refcnt
and aname members, as suggested by Al Viro.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-19 01:07:04 -04:00
Kees Cook
7bd698b3c0 exec: Set file unwritable before LSM check
The LSM check should happen after the file has been confirmed to be
unchanging. Without this, we could have a race between the Time of Check
(the call to security_kernel_read_file() which could read the file and
make access policy decisions) and the Time of Use (starting with
kernel_read_file()'s reading of the file contents). In theory, file
contents could change between the two.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Signed-off-by: James Morris <james.morris@microsoft.com>
2018-03-19 15:49:32 +11:00
Sheng Yong
ff62af200b f2fs: introduce a new mount option test_dummy_encryption
This patch introduces a new mount option `test_dummy_encryption'
to allow fscrypt to create a fake fscrypt context. This is used
by xfstests.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-17 14:02:57 +09:00
Sheng Yong
b7c409deda f2fs: introduce F2FS_FEATURE_LOST_FOUND feature
This patch introduces a new feature, F2FS_FEATURE_LOST_FOUND, which
is set by mkfs. mkfs creates a directory named lost+found, which saves
unreachable files. If fsck finds a file which has no parent, or its
parent is removed by fsck, the file will be placed under lost+found
directory by fsck.

lost+found directory could not be encrypted. As a result, the root
directory cannot be encrypted too. So if LOST_FOUND feature is enabled,
let's avoid to encrypt root directory.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-17 14:02:47 +09:00
Qiuyang Sun
da5ce874f8 f2fs: release locks before return in f2fs_ioc_gc_range()
Currently, we will leave the kernel with locks still held when the gc_range
is invalid. This patch fixes the bug.

Signed-off-by: Qiuyang Sun <sunqiuyang@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-17 13:58:16 +09:00
Jaegeuk Kim
bb1105e479 f2fs: align memory boundary for bitops
For example, in arm64, free_nid_bitmap should be aligned to word size in order
to use bit operations.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-17 13:57:39 +09:00
Chao Yu
c56675750d f2fs: remove unneeded set_cold_node()
When setting COLD_BIT_SHIFT flag in node block, we only need to call
set_cold_node() in new_node_page() and recover_inode_page() during
node page initialization. So remove unneeded set_cold_node() in other
places.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-17 13:57:33 +09:00
Hyunchul Lee
b91050a80c f2fs: add nowait aio support
This patch adds nowait aio support[1].

Return EAGAIN if any of the following checks fail for direct I/O:
  - i_rwsem is not lockable
  - Blocks are not allocated at the write location

And xfstests generic/471 is passed.

 [1]: 6be96d "Introduce RWF_NOWAIT and FMODE_AIO_NOWAIT"

Signed-off-by: Hyunchul Lee <cheol.lee@lge.com>
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-17 13:57:32 +09:00
Chao Yu
63189b7859 f2fs: wrap all options with f2fs_sb_info.mount_opt
This patch merges miscellaneous mount options into struct f2fs_mount_info,
After this patch, once we add new mount option, we don't need to worry
about recovery of it in remount_fs(), since we will recover the
f2fs_sb_info.mount_opt including all options.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-17 13:57:30 +09:00
Yunlei He
5d7881cadf f2fs: Don't overwrite all types of node to keep node chain
Currently, we enable node SSR by default, and mixed
different types of node segment to do SSR more intensively.
Although reuse warm node is not allowed, warm node chain
will be destroyed by errors introduced by other types
node chain. So we'd better forbid reusing all types
of node to keep warm node chain.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-17 13:57:29 +09:00
Junling Zheng
93cf93f17c f2fs: introduce mount option for fsync mode
Commit "0a007b97aad6"(f2fs: recover directory operations by fsync)
fixed xfstest generic/342 case, but it also increased the written
data and caused the performance degradation. In most cases, there's
no need to do so heavy fsync actually.

So we introduce new mount option "fsync_mode={posix,strict}" to
control the policy of fsync. "fsync_mode=posix" is set by default,
and means that f2fs uses a light fsync, which follows POSIX semantics.
And "fsync_mode=strict" means that it's a heavy fsync, which behaves
in line with xfs, ext4 and btrfs, where generic/342 will pass, but
the performance will regress.

Signed-off-by: Junling Zheng <zhengjunling@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-17 13:57:28 +09:00
Chao Yu
eea52882d8 f2fs: fix to restore old mount option in ->remount_fs
This patch fixes to restore old mount option once we encounter failure
in ->remount_fs.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-17 13:57:27 +09:00
Chao Yu
deeedd71e3 f2fs: wrap sb_rdonly with f2fs_readonly
Use f2fs_readonly to wrap sb_rdonly for cleanup, and spread it in
all places.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-17 13:57:26 +09:00
Jaegeuk Kim
162b27aec9 f2fs: avoid selinux denial on CAP_SYS_RESOURCE
This fixes CAP_SYS_RESOURCE denial of selinux when using resgid, since it
seems selinux reports it at the first place, but mostly we don't need to
check this condition first.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-17 13:55:43 +09:00
Linus Torvalds
8f5fd927c3 for-4.16-rc5-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlqrzY4ACgkQxWXV+ddt
 WDtV4BAAiiv5XECNB3lIvcoFjst8E7xIJNeKKzFZh8HNm+L94zP00uqrpHkwQSUv
 tk9TGSYgbJJZMhw6I1rwYSKsoTkvP6NVMQZjpkJktWto9NXObKPvJzjKMdB+GYQR
 TWGLBaHsMhruMTOewhdVsxA5od5+LravygHaLmTDQmho8jZSSEe0qqpqlakRI9sQ
 +VZv1PDEI68IDMKMjOT7de7Ssq9c99BpGNXxvi5fy5kijyfgrs/BUfHBNV5r66MS
 y0Yw1Ab+nlLC7cE5Ql24/snDojcLoSl7f5eYt7ib+DAmsJucYQzTfUetFcrYf8IC
 EAmFs5Br6pbopi5M1IOxGZABNbr0qMS+zLvoF52cyka32nrmK/IMWl1s8bW0xfva
 gxkjouzJDOTMq28asSdyu73i9yFyLB+tEHpojECljo+K5ASzvDad34xqVRlxY6vN
 UJm4d76OqTfAyTRo7sQRWfLz/Cb1W+/v0mPotPWHWVyF0bUNsBgX5YKJhma3pU/z
 rHwhI/fqg0NTXBcNUMCosEwS2u5vnjIfiOHkDTl1Dv747S/jM0AaNEvpjDo1cOuk
 lsl4xKoDkQ6RpBCeFxRCW0wk1Nl/Su6RQ217xTtND+aIa4c6G7A9X8QAjj2XNAuK
 CktJJD7wHYSBTu1sQ+u2R/c2QEgkAxABr4NpmWTPCBNlIKEjd4A=
 =bXFe
 -----END PGP SIGNATURE-----

Merge tag 'for-4.16-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "There's an important revert in this pull request that needs to go to
  stable as it causes a corruption on big endian machines.

  The other fix is for FIEMAP incorrectly reporting shared extents
  before a sync and one fix for a crash in raid56.

  So far we got only one report about the BE corruption, the stable
  kernels were out for like a week, so hopefully the scope of the damage
  is low"

* tag 'for-4.16-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Revert "btrfs: use proper endianness accessors for super_copy"
  btrfs: add missing initialization in btrfs_check_shared
  btrfs: Fix NULL pointer exception in find_bio_stripe
2018-03-16 13:37:42 -07:00
David Sterba
093e037ca8 Revert "btrfs: use proper endianness accessors for super_copy"
This reverts commit 3c181c12c4.

The offending patch was merged in 4.16-rc4 and was promptly applied to
stable kernels 4.14.25 and 4.15.8.

The patch causes a corruption in several superblock items on big-endian
machines because of messed up endianity conversions. The damage is
manually repairable. A filesystem cannot be mounted again after it has
been unmounted once.

We do a full revert and not a fixup so stable can pick that patch ASAP.

Fixes: 3c181c12c4 ("btrfs: use proper endianness accessors for super_copy")
Link: https://lkml.kernel.org/r/1521139304@msgid.manchmal.in-ulm.de
CC: stable@vger.kernel.org # 4.14+
Reported-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-16 14:49:44 +01:00
Arnd Bergmann
e05a959f4b procfs: remove CONFIG_HARDWALL dependency
Hardwall is a tile specific feature, and with the removal of the
tile architecture, this has become dead code, so let's remove it.

Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-03-16 10:56:08 +01:00
Linus Torvalds
df09348f78 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:

 - backport-friendly part of lock_parent() race fix

 - a fix for an assumption in the heurisic used by path_connected() that
   is not true on NFS

 - livelock fixes for d_alloc_parallel()

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: Teach path_connected to handle nfs filesystems with multiple roots.
  fs: dcache: Use READ_ONCE when accessing i_dir_seq
  fs: dcache: Avoid livelock between d_alloc_parallel and __d_add
  lock_parent() needs to recheck if dentry got __dentry_kill'ed under it
2018-03-15 18:57:14 -07:00
Eric W. Biederman
95dd77580c fs: Teach path_connected to handle nfs filesystems with multiple roots.
On nfsv2 and nfsv3 the nfs server can export subsets of the same
filesystem and report the same filesystem identifier, so that the nfs
client can know they are the same filesystem.  The subsets can be from
disjoint directory trees.  The nfsv2 and nfsv3 filesystems provides no
way to find the common root of all directory trees exported form the
server with the same filesystem identifier.

The practical result is that in struct super s_root for nfs s_root is
not necessarily the root of the filesystem.  The nfs mount code sets
s_root to the root of the first subset of the nfs filesystem that the
kernel mounts.

This effects the dcache invalidation code in generic_shutdown_super
currently called shrunk_dcache_for_umount and that code for years
has gone through an additional list of dentries that might be dentry
trees that need to be freed to accomodate nfs.

When I wrote path_connected I did not realize nfs was so special, and
it's hueristic for avoiding calling is_subdir can fail.

The practical case where this fails is when there is a move of a
directory from the subtree exposed by one nfs mount to the subtree
exposed by another nfs mount.  This move can happen either locally or
remotely.  With the remote case requiring that the move directory be cached
before the move and that after the move someone walks the path
to where the move directory now exists and in so doing causes the
already cached directory to be moved in the dcache through the magic
of d_splice_alias.

If someone whose working directory is in the move directory or a
subdirectory and now starts calling .. from the initial mount of nfs
(where s_root == mnt_root), then path_connected as a heuristic will
not bother with the is_subdir check.  As s_root really is not the root
of the nfs filesystem this heuristic is wrong, and the path may
actually not be connected and path_connected can fail.

The is_subdir function might be cheap enough that we can call it
unconditionally.  Verifying that will take some benchmarking and
the result may not be the same on all kernels this fix needs
to be backported to.  So I am avoiding that for now.

Filesystems with snapshots such as nilfs and btrfs do something
similar.  But as the directory tree of the snapshots are disjoint
from one another and from the main directory tree rename won't move
things between them and this problem will not occur.

Cc: stable@vger.kernel.org
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Fixes: 397d425dc2 ("vfs: Test for and handle paths that are unreachable from their mnt_root")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-15 18:48:38 -04:00
Christoph Hellwig
df79b81b2e xfs: minor cleanup for xfs_reflink_end_cow
Use xfs_iext_prev_extent to skip to the previous extent instead of
opencoding it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-15 10:31:38 -07:00
Christoph Hellwig
1d4352de51 xfs: minor cleanup for xfs_get_blocks
Simplify the control flow a bit in preparation for O_ATOMIC-related
changes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-15 10:31:38 -07:00
Christoph Hellwig
f5c54717bf xfs: remove xfs_zero_range
This helper doesn't add any real value over just calling iomap_zero_range
directly, so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-15 10:31:38 -07:00
Christoph Hellwig
c7dbe3f2c4 xfs: assert that xfs_reflink_allocate_cow is called with XFS_ILOCK_EXCL
Now that we convert COW preallocations from unwritten to real on every
call this function needs to be called with the ilock held exclusively.

Fortunately we already do that, but update the assert to match.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-15 10:31:38 -07:00
Christoph Hellwig
7d9df3c163 xfs: don't use XFS_BMAPI_ENTRIRE in xfs_get_blocks
There is no reason to get a mapping bigger than what we were asked for.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-15 10:31:38 -07:00
Christoph Hellwig
5bcffe300c xfs: fix the check for COW extents in xfs_swap_extents
i_cnextents does not include delayed allocated extents, so switch
to the inode fork size check that we already use in other places
instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-15 10:31:38 -07:00
Boris Brezillon
8f347c4232 mtd: Unconditionally update ->fail_addr and ->addr in part_erase()
->fail_addr and ->addr can be updated no matter the result of
parent->_erase(), we just need to remove the code doing the same thing
in mtd_erase_callback() to avoid adjusting those fields twice.

Note that this can be done because all MTD users have been converted to
not pass an erase_info->callback() and are thus only taking the
->addr_fail and ->addr fields into account after part_erase() has
returned.

While we're at it, get rid of the erase_info->mtd field which was only
needed to let mtd_erase_callback() get the partition device back.

Signed-off-by: Boris Brezillon <boris.brezillon@bootlin.com>
Reviewed-by: Richard Weinberger <richard@nod.at>
2018-03-15 18:22:26 +01:00
Boris Brezillon
884cfd9023 mtd: Stop assuming mtd_erase() is asynchronous
None of the mtd->_erase() implementations work in an asynchronous manner,
so let's simplify MTD users that call mtd_erase(). All they need to do
is check the value returned by mtd_erase() and assume that != 0 means
failure.

Signed-off-by: Boris Brezillon <boris.brezillon@bootlin.com>
Reviewed-by: Richard Weinberger <richard@nod.at>
2018-03-15 18:21:07 +01:00
Arnd Bergmann
58eb5b6707 pstore: fix crypto dependencies
The new crypto API use causes some problems with Kconfig dependencies,
including this link error:

fs/pstore/platform.o: In function `pstore_register':
platform.c:(.text+0x248): undefined reference to `crypto_has_alg'
platform.c:(.text+0x2a0): undefined reference to `crypto_alloc_base'
fs/pstore/platform.o: In function `pstore_unregister':
platform.c:(.text+0x498): undefined reference to `crypto_destroy_tfm'
crypto/lz4hc.o: In function `lz4hc_sdecompress':
lz4hc.c:(.text+0x1a): undefined reference to `LZ4_decompress_safe'
crypto/lz4hc.o: In function `lz4hc_decompress_crypto':
lz4hc.c:(.text+0x5a): undefined reference to `LZ4_decompress_safe'
crypto/lz4hc.o: In function `lz4hc_scompress':
lz4hc.c:(.text+0xaa): undefined reference to `LZ4_compress_HC'
crypto/lz4hc.o: In function `lz4hc_mod_init':
lz4hc.c:(.init.text+0xf): undefined reference to `crypto_register_alg'
lz4hc.c:(.init.text+0x1f): undefined reference to `crypto_register_scomp'
lz4hc.c:(.init.text+0x2f): undefined reference to `crypto_unregister_alg'

The problem is that with CONFIG_CRYPTO=m, we must not 'select CRYPTO_LZ4'
from a bool symbol, or call crypto API functions from a built-in
module.

This turns the sub-options into 'tristate' ones so the dependencies
are honored, and makes the pstore itself select the crypto core
if necessary.

Fixes: cb3bee0369 ("pstore: Use crypto compress API")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-03-15 10:06:06 -07:00
Srivatsa S. Bhat
f33ff110ef block, char_dev: Use correct format specifier for unsigned ints
register_blkdev() and __register_chrdev_region() treat the major
number as an unsigned int. So print it the same way to avoid
absurd error statements such as:
"... major requested (-1) is greater than the maximum (511) ..."
(and also fix off-by-one bugs in the error prints).

While at it, also update the comment describing register_blkdev().

Signed-off-by: Srivatsa S. Bhat <srivatsa@csail.mit.edu>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-15 17:59:24 +01:00
Srivatsa S. Bhat
652d703b21 char_dev: Fix off-by-one bugs in find_dynamic_major()
CHRDEV_MAJOR_DYN_END and CHRDEV_MAJOR_DYN_EXT_END are valid major
numbers. So fix the loop iteration to include them in the search for
free major numbers.

While at it, also remove a redundant if condition ("cd->major != i"),
as it will never be true.

Signed-off-by: Srivatsa S. Bhat <srivatsa@csail.mit.edu>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-15 17:59:24 +01:00
Andreas Gruenbacher
ee6ed857c8 gfs2: gfs2_iomap_end tracepoint: log block address
In the gfs2_iomap_end tracepoint, log the physical block address, just
as in the gfs2_bmap tracepoint.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-03-15 07:17:17 -07:00
Edmund Nadolski
18bf591ba9 btrfs: add missing initialization in btrfs_check_shared
This patch addresses an issue that causes fiemap to falsely
report a shared extent.  The test case is as follows:

xfs_io -f -d -c "pwrite -b 16k 0 64k" -c "fiemap -v" /media/scratch/file5
sync
xfs_io  -c "fiemap -v" /media/scratch/file5

which gives the resulting output:

wrote 65536/65536 bytes at offset 0
64 KiB, 4 ops; 0.0000 sec (121.359 MiB/sec and 7766.9903 ops/sec)
/media/scratch/file5:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..127]:        24576..24703       128 0x2001
/media/scratch/file5:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..127]:        24576..24703       128   0x1

This is because btrfs_check_shared calls find_parent_nodes
repeatedly in a loop, passing a share_check struct to report
the count of shared extent. But btrfs_check_shared does not
re-initialize the count value to zero for subsequent calls
from the loop, resulting in a false share count value. This
is a regressive behavior from 4.13.

With proper re-initialization the test result is as follows:

wrote 65536/65536 bytes at offset 0
64 KiB, 4 ops; 0.0000 sec (110.035 MiB/sec and 7042.2535 ops/sec)
/media/scratch/file5:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..127]:        24576..24703       128   0x1
/media/scratch/file5:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..127]:        24576..24703       128   0x1

which corrects the regression.

Fixes: 3ec4d3238a ("btrfs: allow backref search checks for shared extents")
Signed-off-by: Edmund Nadolski <enadolski@suse.com>
[ add text from cover letter to changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-14 22:26:46 +01:00
Dmitriy Gorokh
047fdea634 btrfs: Fix NULL pointer exception in find_bio_stripe
On detaching of a disk which is a part of a RAID6 filesystem, the
following kernel OOPS may happen:

[63122.680461] BTRFS error (device sdo): bdev /dev/sdo errs: wr 0, rd 0, flush 1, corrupt 0, gen 0
[63122.719584] BTRFS warning (device sdo): lost page write due to IO error on /dev/sdo
[63122.719587] BTRFS error (device sdo): bdev /dev/sdo errs: wr 1, rd 0, flush 1, corrupt 0, gen 0
[63122.803516] BTRFS warning (device sdo): lost page write due to IO error on /dev/sdo
[63122.803519] BTRFS error (device sdo): bdev /dev/sdo errs: wr 2, rd 0, flush 1, corrupt 0, gen 0
[63122.863902] BTRFS critical (device sdo): fatal error on device /dev/sdo
[63122.935338] BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
[63122.946554] IP: fail_bio_stripe+0x58/0xa0 [btrfs]
[63122.958185] PGD 9ecda067 P4D 9ecda067 PUD b2b37067 PMD 0
[63122.971202] Oops: 0000 [#1] SMP
[63123.006760] CPU: 0 PID: 3979 Comm: kworker/u8:9 Tainted: G W 4.14.2-16-scst34x+ #8
[63123.007091] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[63123.007402] Workqueue: btrfs-worker btrfs_worker_helper [btrfs]
[63123.007595] task: ffff880036ea4040 task.stack: ffffc90006384000
[63123.007796] RIP: 0010:fail_bio_stripe+0x58/0xa0 [btrfs]
[63123.007968] RSP: 0018:ffffc90006387ad8 EFLAGS: 00010287
[63123.008140] RAX: 0000000000000002 RBX: ffff88004beaa0b8 RCX: ffff8800b2bd5690
[63123.008359] RDX: 0000000000000000 RSI: ffff88007bb43500 RDI: ffff88004beaa000
[63123.008621] RBP: ffffc90006387ae8 R08: 0000000099100000 R09: ffff8800b2bd5600
[63123.008840] R10: 0000000000000004 R11: 0000000000010000 R12: ffff88007bb43500
[63123.009059] R13: 00000000fffffffb R14: ffff880036fc5180 R15: 0000000000000004
[63123.009278] FS: 0000000000000000(0000) GS:ffff8800b7000000(0000) knlGS:0000000000000000
[63123.009564] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[63123.009748] CR2: 0000000000000080 CR3: 00000000b0866000 CR4: 00000000000406f0
[63123.009969] Call Trace:
[63123.010085] raid_write_end_io+0x7e/0x80 [btrfs]
[63123.010251] bio_endio+0xa1/0x120
[63123.010378] generic_make_request+0x218/0x270
[63123.010921] submit_bio+0x66/0x130
[63123.011073] finish_rmw+0x3fc/0x5b0 [btrfs]
[63123.011245] full_stripe_write+0x96/0xc0 [btrfs]
[63123.011428] raid56_parity_write+0x117/0x170 [btrfs]
[63123.011604] btrfs_map_bio+0x2ec/0x320 [btrfs]
[63123.011759] ? ___cache_free+0x1c5/0x300
[63123.011909] __btrfs_submit_bio_done+0x26/0x50 [btrfs]
[63123.012087] run_one_async_done+0x9c/0xc0 [btrfs]
[63123.012257] normal_work_helper+0x19e/0x300 [btrfs]
[63123.012429] btrfs_worker_helper+0x12/0x20 [btrfs]
[63123.012656] process_one_work+0x14d/0x350
[63123.012888] worker_thread+0x4d/0x3a0
[63123.013026] ? _raw_spin_unlock_irqrestore+0x15/0x20
[63123.013192] kthread+0x109/0x140
[63123.013315] ? process_scheduled_works+0x40/0x40
[63123.013472] ? kthread_stop+0x110/0x110
[63123.013610] ret_from_fork+0x25/0x30
[63123.014469] RIP: fail_bio_stripe+0x58/0xa0 [btrfs] RSP: ffffc90006387ad8
[63123.014678] CR2: 0000000000000080
[63123.016590] ---[ end trace a295ea7259c17880 ]—

This is reproducible in a cycle, where a series of writes is followed by
SCSI device delete command. The test may take up to few minutes.

Fixes: 74d46992e0 ("block: replace bi_bdev with a gendisk pointer and partitions index")
[ no signed-off-by provided ]
Author: Dmitriy Gorokh <Dmitriy.Gorokh@wdc.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-14 22:26:35 +01:00
Tejun Heo
d0264c01e7 fs/aio: Use RCU accessors for kioctx_table->table[]
While converting ioctx index from a list to a table, db446a08c2
("aio: convert the ioctx list to table lookup v3") missed tagging
kioctx_table->table[] as an array of RCU pointers and using the
appropriate RCU accessors.  This introduces a small window in the
lookup path where init and access may race.

Mark kioctx_table->table[] with __rcu and use the approriate RCU
accessors when using the field.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jann Horn <jannh@google.com>
Fixes: db446a08c2 ("aio: convert the ioctx list to table lookup v3")
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org # v3.12+
2018-03-14 12:10:17 -07:00
Tejun Heo
a6d7cff472 fs/aio: Add explicit RCU grace period when freeing kioctx
While fixing refcounting, e34ecee2ae ("aio: Fix a trinity splat")
incorrectly removed explicit RCU grace period before freeing kioctx.
The intention seems to be depending on the internal RCU grace periods
of percpu_ref; however, percpu_ref uses a different flavor of RCU,
sched-RCU.  This can lead to kioctx being freed while RCU read
protected dereferences are still in progress.

Fix it by updating free_ioctx() to go through call_rcu() explicitly.

v2: Comment added to explain double bouncing.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jann Horn <jannh@google.com>
Fixes: e34ecee2ae ("aio: Fix a trinity splat")
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org # v3.13+
2018-03-14 12:10:17 -07:00
Christoph Hellwig
e6b9657056 xfs: refactor xfs_log_force
Streamline the conditionals so that it is more obvious which specific case
form the top of the function comments is being handled.  Use gotos only
for early returns.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-14 11:12:52 -07:00
Christoph Hellwig
656de4ffaf xfs: merge _xfs_log_force_lsn and xfs_log_force_lsn
Switch to a single interface for flushing the log to a specific LSN, which
gives consistent trace point coverage and a less confusing interface.

The was only a single user of the previous xfs_log_force_lsn function,
which now also passes a NULL log_flushed argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-14 11:12:52 -07:00
Christoph Hellwig
60e5bb7844 xfs: merge _xfs_log_force and xfs_log_force
Switch to a single interface for flushing the whole log, which gives
consistent trace point coverage, and removes the unused log_flushed
argument for the previous _xfs_log_force callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-14 11:12:52 -07:00
Christoph Hellwig
2b56c2857f xfs: remove the unused log_flushed variable in xfs_extent_busy_flush
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-14 11:12:52 -07:00
Christoph Hellwig
4f7aa2fd8c xfs: remove an outdated comment for xfs_inode_item_committing
The function now does something, and that something is central to our
inode logging scheme.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-14 11:12:51 -07:00
Christoph Hellwig
2022ab36fe xfs: remove misleading comment text on xfs_inode_item_unlock
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-14 11:12:51 -07:00
Christian Brauner
4e15f760a4 devpts: comment devpts_mntget()
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-14 13:31:23 +01:00
Christian Brauner
a319b01d90 devpts: resolve devpts bind-mounts
Most libcs will still look at /dev/ptmx when opening the master fd of a pty
device. When /dev/ptmx is a bind-mount of /dev/pts/ptmx and the TIOCGPTPEER
ioctl() is used to safely retrieve a file descriptor for the slave side of
the pty based on the master fd, the /proc/self/fd/{0,1,2} symlinks will
point to /. A very simply reproducer for this issue presupposing a libc
that uses TIOCGPTPEER in its openpty() implementation is:

unshare --mount
mount --bind /dev/pts/ptmx /dev/ptmx
chmod 666 /dev/ptmx
script
ls -al /proc/self/fd/0

Having bind-mounts of /dev/pts/ptmx to /dev/ptmx not working correctly is a
regression. In addition, it is also a fairly common scenario in containers
employing user namespaces.

The reason for the current failure is that the kernel tries to verify the
useability of the devpts filesystem without resolving the /dev/ptmx
bind-mount first. This will lead it to detect that the dentry is escaping
its bind-mount. The reason is that while the devpts filesystem mounted at
/dev/pts has the devtmpfs mounted at /dev as its parent mount:

21 -- -- / /dev
-- 21 -- / /dev/pts

devtmpfs and devpts are on different devices

-- -- 0:6  / /dev
-- -- 0:20 / /dev/pts

This has the consequence that the pathname of the parent directory of the
devpts filesystem mount at /dev/pts is /. So if /dev/ptmx is a bind-mount
of /dev/pts/ptmx then the /dev/ptmx bind-mount and the devpts mount at
/dev/pts will end up being located on the same device which is recorded in
the superblock of their vfsmount. This means the parent directory of the
/dev/ptmx bind-mount will be /ptmx:

-- -- ---- /ptmx /dev/ptmx

Without the bind-mount resolution patch the kernel will now perform the
bind-mount escape check directly on /dev/ptmx. The function responsible for
this is devpts_ptmx_path() which calls pts_path() which in turn calls
path_parent_directory(). Based on the above explanation,
path_parent_directory() will yield / as the parent directory for the
/dev/ptmx bind-mount and not the expected /dev. Thus, the kernel detects
that /dev/ptmx is escaping its bind-mount and will set /proc/<pid>/fd/<nr>
to /.

This patch changes the logic to first resolve any bind-mounts. After the
bind-mounts have been resolved (i.e. we have traced it back to the
associated devpts mount) devpts_ptmx_path() can be called. In order to
guarantee correct path generation for the slave file descriptor the kernel
now requires that a pts directory is found in the parent directory of the
ptmx bind-mount. This implies that when doing bind-mounts the ptmx
bind-mount and the devpts mount should have a common parent directory. A
valid example is:

mount -t devpts devpts /dev/pts
mount --bind /dev/pts/ptmx /dev/ptmx

an invalid example is:

mount -t devpts devpts /dev/pts
mount --bind /dev/pts/ptmx /ptmx

This allows us to support:
- calling open on ptmx devices located inside non-standard devpts mounts:
  mount -t devpts devpts /mnt
  master = open("/mnt/ptmx", ...);
  slave = ioctl(master, TIOCGPTPEER, ...);
- calling open on ptmx devices located outside the devpts mount with a
  common ancestor directory:
  mount -t devpts devpts /dev/pts
  mount --bind /dev/pts/ptmx /dev/ptmx
  master = open("/dev/ptmx", ...);
  slave = ioctl(master, TIOCGPTPEER, ...);

while failing on ptmx devices located outside the devpts mount without a
common ancestor directory:
  mount -t devpts devpts /dev/pts
  mount --bind /dev/pts/ptmx /ptmx
  master = open("/ptmx", ...);
  slave = ioctl(master, TIOCGPTPEER, ...);

in which case save path generation cannot be guaranteed.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Suggested-by: Eric Biederman <ebiederm@xmission.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-14 13:31:23 +01:00
Christian Brauner
7d71109df1 devpts: hoist out check for DEVPTS_SUPER_MAGIC
Hoist the check whether we have already found a suitable devpts filesystem
out of devpts_ptmx_path() in preparation for the devpts bind-mount
resolution patch. This is a non-functional change.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-14 13:31:23 +01:00
Chao Yu
b6a06cbbb5 f2fs: support hot file extension
This patch supports to recognize hot file extension in f2fs, so that we
can allocate proper hot segment location for its data, which can lead to
better hot/cold seperation in filesystem.

In addition, we changes a bit on query/add/del operation method for
extension_list sysfs entry as below:

- Query: cat /sys/fs/f2fs/<disk>/extension_list
- Add: echo 'extension' > /sys/fs/f2fs/<disk>/extension_list
- Del: echo '!extension' > /sys/fs/f2fs/<disk>/extension_list
- Add: echo '[h/c]extension' > /sys/fs/f2fs/<disk>/extension_list
- Del: echo '[h/c]!extension' > /sys/fs/f2fs/<disk>/extension_list
- [h] means add/del hot file extension
- [c] means add/del cold file extension

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:57 +09:00
Chao Yu
1dc0f8991d f2fs: fix to avoid race in between atomic write and background GC
Sqlite user			Background GC
				- move_data_block
				  : move page #1
				 - f2fs_is_atomic_file
- f2fs_ioc_start_atomic_write
- f2fs_ioc_commit_atomic_write
 - commit_inmem_pages
   : commit page #1 & set node #2 dirty
				 - f2fs_submit_page_write
				  - f2fs_update_data_blkaddr
				   - set_page_dirty
				     : set node #2 dirty
 - f2fs_do_sync_file
  - fsync_node_pages
   : commit node #1 & node #2, then sudden power-cut

In a race case, we may check FI_ATOMIC_FILE flag before starting atomic
write flow, then we will commit meta data before data with reversed
order, after a sudden pow-cut, database transaction will be inconsistent.

So we'd better to exclude gc/atomic_write to each other by using lock
instead of flag checking.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:55 +09:00
Jaegeuk Kim
b27bc8091c f2fs: do gc in greedy mode for whole range if gc_urgent mode is set
Otherwise, f2fs conducts GC on 8GB range only based on slow cost-benefit.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:54 +09:00
Jaegeuk Kim
dee02f0d62 f2fs: issue discard aggressively in the gc_urgent mode
This patch avoids to skip discard commands when user sets gc_urgent mode.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:53 +09:00
Jaegeuk Kim
782911f491 f2fs: set readdir_ra by default
It gives general readdir improvement.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:52 +09:00
Jaegeuk Kim
84b89e5d94 f2fs: add auto tuning for small devices
If f2fs is running on top of very small devices, it's worth to avoid abusing
free LBAs. In order to achieve that, this patch introduces some parameter
tuning.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:51 +09:00
Jaegeuk Kim
079396270b f2fs: add mount option for segment allocation policy
This patch adds an mount option, "alloc_mode=%s" having two options, "default"
and "reuse".

In "alloc_mode=reuse" case, f2fs starts to allocate segments from 0'th segment
all the time to reassign segments. It'd be useful for small-sized eMMC parts.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:50 +09:00
Jaegeuk Kim
69babac019 f2fs: don't stop GC if GC is contended
Let's do GC as much as possible, while gc_urgent is set.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:49 +09:00
Chao Yu
846ae671ad f2fs: expose extension_list sysfs entry
This patch adds a sysfs entry 'extension_list' to support
query/add/del item in extension list.

Query:
cat /sys/fs/f2fs/<device>/extension_list

Add:
echo 'extension' > /sys/fs/f2fs/<device>/extension_list

Del:
echo '!extension' > /sys/fs/f2fs/<device>/extension_list

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:48 +09:00
Chao Yu
17cd07ae95 f2fs: fix to set KEEP_SIZE bit in f2fs_zero_range
As Jayashree Mohan reported:

A simple workload to reproduce this would be :
1. create foo
2. Write (8K - 16K)  // foo size = 16K now
3. fsync()
4. falloc zero_range , keep_size (4202496 - 4210688) // foo size must be 16K
5. fdatasync()
Crash now

On recovery, we see that the file size is 4210688 and not 16K, which
violates the semantics of keep_size flag. We have a test case to
reproduce this using CrashMonkey on 4.15 kernel. Try this out by
simply running :
 ./c_harness -f /dev/sda -d /dev/cow_ram0 -t f2fs -e 102400  -P -v
 tests/generic_468_zero.so

The root cause is that we miss to set KEEP_SIZE bit correctly in zero_range
when zeroing block cross EOF with FALLOC_FL_KEEP_SIZE, let's fix this
missing case.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:47 +09:00
Chao Yu
d0d3f1b329 f2fs: introduce sb_lock to make encrypt pwsalt update exclusive
f2fs_super_block.encrypt_pw_salt can be udpated and persisted
concurrently, result in getting different pwsalt in separated
threads, so let's introduce sb_lock to exclude concurrent
accessers.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:46 +09:00
Colin Ian King
8fe326cb99 f2fs: remove redundant initialization of pointer 'p'
Pointer p is initialized with a value that is never read and is later
re-assigned a new value, hence the initialization is redundant and can
be removed.

Cleans up clang warning:
fs/f2fs/extent_cache.c:463:19: warning: Value stored to 'p' during
its initialization is never read

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:45 +09:00
Gao Xiang
46706d5917 f2fs: flush cp pack except cp pack 2 page at first
Previously, we attempt to flush the whole cp pack in a single bio,
however, when suddenly powering off at this time, we could get into
an extreme scenario that cp pack 1 page and cp pack 2 page are updated
and latest, but payload or current summaries are still partially
outdated. (see reliable write in the UFS specification)

This patch submits the whole cp pack except cp pack 2 page at first,
and then writes the cp pack 2 page with an extra independent
bio with pre-io barrier.

Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:43 +09:00
Sheng Yong
ccd31cb28f f2fs: clean up f2fs_sb_has_xxx functions
This patch introduces F2FS_FEATURE_FUNCS to clean up the definitions of
different f2fs_sb_has_xxx functions.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:42 +09:00
Tiezhu Yang
3bb09a0e7e f2fs: remove redundant check of page type when submit bio
This patch removes redundant check of page type when submit bio to
make the logic more clear.

Signed-off-by: Tiezhu Yang <kernelpatch@126.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:41 +09:00
Chao Yu
fb0e72c8b9 f2fs: fix to handle looped node chain during recovery
There is no checksum in node block now, so bit-transition from hardware
can make node_footer.next_blkaddr being corrupted w/o any detection,
result in node chain becoming looped one.

For this condition, during recovery, in order to avoid running into dead
loop, let's detect it and just skip out.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:40 +09:00
Jaegeuk Kim
0f9ec2a8f6 f2fs: handle quota for orphan inodes
This is to detect dquot_initialize errors early from evict_inode
for orphan inodes.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:38 +09:00
Hyunchul Lee
f2e703f9a3 f2fs: support passing down write hints to block layer with F2FS policy
Add 'whint_mode=fs-based' mount option. In this mode, F2FS passes
down write hints with its policy.

* whint_mode=fs-based. F2FS passes down hints with its policy.

User                  F2FS                     Block
----                  ----                     -----
                      META                     WRITE_LIFE_MEDIUM;
                      HOT_NODE                 WRITE_LIFE_NOT_SET
                      WARM_NODE                "
                      COLD_NODE                WRITE_LIFE_NONE
ioctl(COLD)           COLD_DATA                WRITE_LIFE_EXTREME
extension list        "                        "

-- buffered io
WRITE_LIFE_EXTREME    COLD_DATA                WRITE_LIFE_EXTREME
WRITE_LIFE_SHORT      HOT_DATA                 WRITE_LIFE_SHORT
WRITE_LIFE_NOT_SET    WARM_DATA                WRITE_LIFE_LONG
WRITE_LIFE_NONE       "                        "
WRITE_LIFE_MEDIUM     "                        "
WRITE_LIFE_LONG       "                        "

-- direct io
WRITE_LIFE_EXTREME    COLD_DATA                WRITE_LIFE_EXTREME
WRITE_LIFE_SHORT      HOT_DATA                 WRITE_LIFE_SHORT
WRITE_LIFE_NOT_SET    WARM_DATA                WRITE_LIFE_NOT_SET
WRITE_LIFE_NONE       "                        WRITE_LIFE_NONE
WRITE_LIFE_MEDIUM     "                        WRITE_LIFE_MEDIUM
WRITE_LIFE_LONG       "                        WRITE_LIFE_LONG

Many thanks to Chao Yu and Jaegeuk Kim for comments to
implement this patch.

Signed-off-by: Hyunchul Lee <cheol.lee@lge.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:36 +09:00
Hyunchul Lee
0cdd319539 f2fs: support passing down write hints given by users to block layer
Add the 'whint_mode' mount option that controls which write
hints are passed down to block layer. There are "off" and
"user-based" mode. The default mode is "off".

1) whint_mode=off. F2FS only passes down WRITE_LIFE_NOT_SET.

2) whint_mode=user-based. F2FS tries to pass down hints given
by users.

User                  F2FS                     Block
----                  ----                     -----
                      META                     WRITE_LIFE_NOT_SET
                      HOT_NODE                 "
                      WARM_NODE                "
                      COLD_NODE                "
ioctl(COLD)           COLD_DATA                WRITE_LIFE_EXTREME
extension list        "                        "

-- buffered io
WRITE_LIFE_EXTREME    COLD_DATA                WRITE_LIFE_EXTREME
WRITE_LIFE_SHORT      HOT_DATA                 WRITE_LIFE_SHORT
WRITE_LIFE_NOT_SET    WARM_DATA                WRITE_LIFE_NOT_SET
WRITE_LIFE_NONE       "                        "
WRITE_LIFE_MEDIUM     "                        "
WRITE_LIFE_LONG       "                        "

-- direct io
WRITE_LIFE_EXTREME    COLD_DATA                WRITE_LIFE_EXTREME
WRITE_LIFE_SHORT      HOT_DATA                 WRITE_LIFE_SHORT
WRITE_LIFE_NOT_SET    WARM_DATA                WRITE_LIFE_NOT_SET
WRITE_LIFE_NONE       "                        WRITE_LIFE_NONE
WRITE_LIFE_MEDIUM     "                        WRITE_LIFE_MEDIUM
WRITE_LIFE_LONG       "                        WRITE_LIFE_LONG

Many thanks to Chao Yu and Jaegeuk Kim for comments to
implement this patch.

Signed-off-by: Hyunchul Lee <cheol.lee@lge.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: avoid build warning]
[Chao Yu: fix to restore whint_mode in ->remount_fs]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:35 +09:00
Chao Yu
cd36d7a17f f2fs: fix to clear CP_TRIMMED_FLAG
Once CP_TRIMMED_FLAG is set, after a reboot, we will never issue discard
before LBA becomes invalid again, fix it by clearing the flag in
checkpoint without CP_TRIMMED reason.

Fixes: 1f43e2ad7b ("f2fs: introduce CP_TRIMMED_FLAG to avoid unneeded discard")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:34 +09:00
Chao Yu
199bc3fef2 f2fs: support large nat bitmap
Previously, we will store all nat version bitmap in checkpoint pack block,
so our total node entry number has a limitation which caused total node
number can not exceed (3900 * 8) block * 455 node/block = 14196000. So
that once user wants to create more nodes in large size image, it becomes
a bottleneck, that's unreasonable.

This patch detects the new layout of nat/sit version bitmap in image in
order to enable supporting large nat bitmap.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:33 +09:00
Chao Yu
bf617f7a92 f2fs: fix to check extent cache in f2fs_drop_extent_tree
If noextent_cache mount option is on, we will never initialize extent tree
in inode, but still we're going to access it in f2fs_drop_extent_tree,
result in kernel panic as below:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
 IP: _raw_write_lock+0xc/0x30
 Call Trace:
  ? f2fs_drop_extent_tree+0x41/0x70 [f2fs]
  f2fs_fallocate+0x5a0/0xdd0 [f2fs]
  ? common_file_perm+0x47/0xc0
  ? apparmor_file_permission+0x1a/0x20
  vfs_fallocate+0x15b/0x290
  SyS_fallocate+0x44/0x70
  do_syscall_64+0x6e/0x160
  entry_SYSCALL64_slow_path+0x25/0x25

This patch fixes to check extent cache status before using in
f2fs_drop_extent_tree.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:32 +09:00
Chao Yu
4d817ae099 f2fs: restrict inline_xattr_size configuration
This patch limits to enable inline_xattr_size mount option only if
both extra_attr and flexible_inline_xattr feature is on in current
image.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:31 +09:00
Yunlong Song
b94929d975 f2fs: fix heap mode to reset it back
Commit 7a20b8a61e ("f2fs: allocate node
and hot data in the beginning of partition") introduces another mount
option, heap, to reset it back. But it does not do anything for heap
mode, so fix it.

Cc: stable@vger.kernel.org
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:30 +09:00
Sheng Yong
0964fc1a82 f2fs: fix potential corruption in area before F2FS_SUPER_OFFSET
sb_getblk does not guarantee the buffer head is uptodate. If bh is not
uptodate, the data (may be used as boot code) in area before
F2FS_SUPER_OFFSET may get corrupted when super block is committed.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:29 +09:00
Yunlong Song
bdbc90fa55 f2fs: don't put dentry page in pagecache into highmem
Previous dentry page uses highmem, which will cause panic in platforms
using highmem (such as arm), since the address space of dentry pages
from highmem directly goes into the decryption path via the function
fscrypt_fname_disk_to_usr. But sg_init_one assumes the address is not
from highmem, and then cause panic since it doesn't call kmap_high but
kunmap_high is triggered at the end. To fix this problem in a simple
way, this patch avoids to put dentry page in pagecache into highmem.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix coding style]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-03-13 08:05:03 +09:00
Linus Torvalds
fc6eabbbf8 NFS client bugfixes for Linux 4.16
Hightlights include the following stable fixes:
 
 - NFS: Fix an incorrect type in struct nfs_direct_req
 - pNFS: Prevent the layout header refcount going to zero in pnfs_roc()
 - NFS: Fix unstable write completion
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJaprb9AAoJEGcL54qWCgDysK0P/Rwyc7fhcs4CvxETQT0LhbbB
 P6XDAb5pBeU+ca8ZWzqlpe1uKc0z5ykuTLuxLt2QvSb9+6h+8T1gxOa5SibehnwY
 6UizLDgHMuY2V4wqm4XLHiZX7J0Evdo4rgNCp7qq2/TinBPYKQsqIz34+s9BZDti
 6LG6WiVhOtkhs6s/sRbAn+FfMbSiQn54iQIgFPeO+3zEzaaLzUZqlVHO5yjxVnxh
 6oYGkEPVzP09URUO/HJ1VTUZJnso1/axEcH9qhuJzZ+pANCpuBjSMoFzzE6oI2mD
 dfD8UJZjwqLb7AsFhgSQUrDXzaI0Xg7ccrImtCp4QLlTjja3TlOYMYbK9kDRHC1/
 kT93cSFSIeSGpqLVSdNyiz7pCh2Cps2U0JTbhwE0R1NEskWQPzSwY73jEwiA/bex
 JOZBxtjVZt8TgOKHxPO65rvQwaq3T31KrQSDLqv3JReI0epJeqng0h+pwiZluNAs
 TFjE1aP0l567A3qQnPrHrGZ4IIs8XYobZ+MBxf9x5PaRWnGoPUkhHXMP2j8JCnwa
 ofPBR4Mx5WhnsB7Z5vsYetHtmD99QbJ0+WH9ObilUFt6gUgXoXCrvexIUq+LD3DP
 HIIqfbSorAIlQxtCwEz7m85B7VEKByaCWj8OiFt5VVAreEhiazmsncsyh8R5fa88
 wJDxRe3QmNu7BEJ9NJKi
 =IQGX
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.16-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Hightlights include the following stable fixes:

   - NFS: Fix an incorrect type in struct nfs_direct_req

   - pNFS: Prevent the layout header refcount going to zero in
     pnfs_roc()

   - NFS: Fix unstable write completion"

* tag 'nfs-for-4.16-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFS: Fix unstable write completion
  pNFS: Prevent the layout header refcount going to zero in pnfs_roc()
  NFS: Fix an incorrect type in struct nfs_direct_req
2018-03-12 10:47:03 -07:00
Nikolay Borisov
ce3077ee80 direct-io: Remove unused DIO_SKIP_DIO_COUNT logic
This flag was added by fe0f07d08e ("direct-io: only inc/deci
inode->i_dio_count for file systems") as means to optimise the atomic
modificaiton of the variable for blockdevices. However with the advent
of 542ff7bf18 ("block: new direct I/O implementation") it became
unused. So let's remove it.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-12 10:21:24 -06:00
Nikolay Borisov
c8f4c36f81 direct-io: Remove unused DIO_ASYNC_EXTEND flag
This flag was added by 6039257378 ("direct-io: add flag to allow aio
writes beyond i_size") to support XFS. However, with the rework of
XFS' DIO's path to use iomap in acdda3aae1 ("xfs: use iomap_dio_rw")
it became redundant. So let's remove it.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-12 10:21:22 -06:00
Al Viro
c19457f0ae d_delete(): get rid of trylock loop
just grab ->i_lock first; we have a positive dentry, nothing's going
to happen to inode

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-12 11:59:14 -04:00
John Ogness
c1d0c1a2b5 fs/dcache: Move dentry_kill() below lock_parent()
A subsequent patch will modify dentry_kill() to call lock_parent().
Move the dentry_kill() implementation "as is" below lock_parent()
first. This will help simplify the review of the subsequent patch
with dentry_kill() changes.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-12 11:59:13 -04:00
John Ogness
06080d100d fs/dcache: Remove stale comment from dentry_kill()
Commit 0d98439ea3 ("vfs: use lockred "dead" flag to mark unrecoverably
dead dentries") removed the `ref' parameter in dentry_kill() but its
documentation remained. Remove it.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-12 11:59:13 -04:00
Al Viro
0632a9ac7b take write_seqcount_invalidate() into __d_drop()
... and reorder it with making d_unhashed() true.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-03-12 11:59:12 -04:00
Brian Foster
0ab32086d0 xfs: account only rmapbt-used blocks against rmapbt perag res
The rmapbt perag metadata reservation reserves blocks for the
reverse mapping btree (rmapbt). Since the rmapbt uses blocks from
the agfl and perag accounting is updated as blocks are allocated
from the allocation btrees, the reservation actually accounts blocks
as they are allocated to (or freed from) the agfl rather than the
rmapbt itself.

While this works for blocks that are eventually used for the rmapbt,
not all agfl blocks are destined for the rmapbt. Blocks that are
allocated to the agfl (and thus "reserved" for the rmapbt) but then
used by another structure leads to a growing inconsistency over time
between the runtime tracking of rmapbt usage vs. actual rmapbt
usage. Since the runtime tracking thinks all agfl blocks are rmapbt
blocks, it essentially believes that less future reservation is
required to satisfy the rmapbt than what is actually necessary.

The inconsistency is rectified across mount cycles because the perag
reservation is initialized based on the actual rmapbt usage at mount
time. The problem, however, is that the excessive drain of the
reservation at runtime opens a window to allocate blocks for other
purposes that might be required for the rmapbt on a subsequent
mount. This problem can be demonstrated by a simple test that runs
an allocation workload to consume agfl blocks over time and then
observe the difference in the agfl reservation requirement across an
unmount/mount cycle:

  mount ...: xfs_ag_resv_init: ... resv 3193 ask 3194 len 3194
  ...
  ...      : xfs_ag_resv_alloc_extent: ... resv 2957 ask 3194 len 1
  umount...: xfs_ag_resv_free: ... resv 2956 ask 3194 len 0
  mount ...: xfs_ag_resv_init: ... resv 3052 ask 3194 len 3194

As the above tracepoints show, the reservation requirement reduces
from 3194 blocks to 2956 blocks as the workload runs.  Without any
other changes in the filesystem, the same reservation requirement
jumps from 2956 to 3052 blocks over a umount/mount cycle.

To address this divergence, update the RMAPBT reservation to account
blocks used for the rmapbt only rather than all blocks filled into
the agfl. This patch makes several high-level changes toward that
end:

1.) Reintroduce an AGFL reservation type to serve as an accounting
    no-op for blocks allocated to (or freed from) the AGFL.
2.) Invoke RMAPBT usage accounting from the actual rmapbt block
    allocation path rather than the AGFL allocation path.

The first change is required because agfl blocks are considered free
blocks throughout their lifetime. The perag reservation subsystem is
invoked unconditionally by the allocation subsystem, so we need a
way to tell the perag subsystem (via the allocation subsystem) to
not make any accounting changes for blocks filled into the AGFL.

The second change causes the in-core RMAPBT reservation usage
accounting to remain consistent with the on-disk state at all times
and eliminates the risk of leaving the rmapbt reservation
underfilled.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:57 -07:00
Brian Foster
2159286335 xfs: rename agfl perag res type to rmapbt
The AGFL perag reservation type accounts all allocations that feed
into (or are released from) the allocation group free list (agfl).
The purpose of the reservation is to support worst case conditions
for the reverse mapping btree (rmapbt). As such, the agfl
reservation usage accounting only considers rmapbt usage when the
in-core counters are initialized at mount time.

This implementation inconsistency leads to divergence of the in-core
and on-disk usage accounting over time. In preparation to resolve
this inconsistency and adjust the AGFL reservation into an rmapbt
specific reservation, rename the AGFL reservation type and
associated accounting fields to something more rmapbt-specific. Also
fix up a couple tracepoints that incorrectly use the AGFL
reservation type to pass the agfl state of the associated extent
where the raw reservation type is expected.

Note that this patch does not change perag reservation behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:57 -07:00
Brian Foster
b3fed43482 xfs: account format bouncing into rmapbt swapext tx reservation
The extent swap mechanism requires a unique implementation for
rmapbt enabled filesystems. Because the rmapbt tracks extent owner
information, extent swap must individually unmap and remap each
extent between the two inodes.

The rmapbt extent swap transaction block reservation currently
accounts for the worst case bmapbt block and rmapbt block
consumption based on the extent count of each inode. There is a
corner case that exists due to the extent swap implementation that
is not covered by this reservation, however.

If one of the associated inodes is just over the max extent count
used for extent format inodes (i.e., the inode is in btree format by
a single extent), the unmap/remap cycle of the extent swap can
bounce the inode between extent and btree format multiple times,
almost as many times as there are extents in the inode (if the
opposing inode happens to have one less, for example). Each back and
forth cycle involves a block free and allocation, which isn't a
problem except for that the initial transaction reservation must
account for the total number of block allocations performed by the
chain of deferred operations. If not, a block reservation overrun
occurs and the filesystem shuts down.

Update the rmapbt extent swap block reservation to check for this
situation and add some block reservation slop to ensure the entire
operation succeeds. We'd never likely require reservation for both
inodes as fsr wouldn't defrag the file in that case, but the
additional reservation is constrained by the data fork size so be
cautious and check for both.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:57 -07:00
Brian Foster
3e78b9a468 xfs: shutdown if block allocation overruns tx reservation
The ->t_blk_res_used field tracks how many blocks have been used in
the current transaction. This should never exceed the block
reservation (->t_blk_res) for a particular transaction. We currently
assert this condition in the transaction block accounting code, but
otherwise take no additional action should this situation occur.

The overrun generally has no effect if space ends up being available
and the associated transaction commits. If the transaction is
duplicated, however, the current block usage is used to determine
the remaining block reservation to be transferred to the new
transaction. If usage exceeds reservation, this calculation
underflows and creates a transaction with an invalid and excessive
reservation. When the second transaction commits, the release of
unused blocks corrupts the in-core free space counters. With lazy
superblock accounting enabled, this inconsistency eventually
trickles to the on-disk superblock and corrupts the filesystem.

Replace the transaction block usage accounting assert with an
explicit overrun check. If the transaction overruns the reservation,
shutdown the filesystem immediately to prevent corruption. Add a new
assert to xfs_trans_dup() to catch any callers that might induce
this invalid state in the future.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:57 -07:00
Matthew Wilcox
57e8095611 xfs: Rename xa_ elements to ail_
This is a simple rename, except that xa_ail becomes ail_head.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:56 -07:00
Dave Chinner
ae23395d88 inode: don't memset the inode address space twice
Noticed when looking at why cycling 600k inodes/s through the inode
cache was taking a total of 8% cpu in memset() during inode
initialisation.  There is no need to zero the inode.i_data structure
twice.

This increases single threaded bulkstat throughput from ~200,000
inodes/s to ~220,000 inodes/s, so we save a substantial amount of
CPU time per inode init by doing this.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:56 -07:00
Dave Chinner
a78ee256c3 xfs: convert XFS_AGFL_SIZE to a helper function
The AGFL size calculation is about to get more complex, so lets turn
the macro into a function first and remove the macro.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[darrick: forward port to newer kernel, simplify the helper]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-11 20:27:56 -07:00
Darrick J. Wong
6231848c3a xfs: check for cow blocks before trying to clear them
There's no point in allocating a transaction and locking the inode in
preparation to clear cow blocks if there actually are any cow fork
extents.  Therefore, move the xfs_reflink_cancel_cow_range hunk to
xfs_inactive and check the cow ifp first.  This makes inode reclamation
run faster.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-11 20:27:56 -07:00
Darrick J. Wong
3f883f5bb1 xfs: convert a few more directory asserts to corruption
Yet another round of playing whack-a-mole with directory code that
asserts on corrupt on-disk metadata when it really should be returning
-EFSCORRUPTED instead of ASSERTing.  Found by a xfs/391 crash while
lastbit fuzzing of ltail.bestcount.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-11 20:27:56 -07:00
Darrick J. Wong
8241f7f983 xfs: don't iunlock the quota ip when quota block
In xfs_qm_dqalloc, we join the locked quota inode to the transaction we
use to allocate blocks.  If the allocation or mapping fails, we're not
allowed to unlock the inode because the transaction code is in charge of
unlocking it for us.  Therefore, remove the iunlock call to avoid
blowing asserts about unbalanced locking + mount hang.

Found by corrupting the AGF and allocating space in the filesystem
(quotacheck) immediately after mount.  The upcoming agfl wrapping fixup
test will trigger this scenario.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-03-11 20:27:56 -07:00
Vratislav Bendel
19957a1816 xfs: Correctly invert xfs_buftarg LRU isolation logic
Due to an inverted logic mistake in xfs_buftarg_isolate()
the xfs_buffers with zero b_lru_ref will take another trip
around LRU, while isolating buffers with non-zero b_lru_ref.

Additionally those isolated buffers end up right back on the LRU
once they are released, because b_lru_ref remains elevated.

Fix that circuitous route by leaving them on the LRU
as originally intended.

Signed-off-by: Vratislav Bendel <vbendel@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:56 -07:00
Dave Chinner
4df0f7f145 xfs: fix transaction allocation deadlock in IO path
xfs_trans_alloc() does GFP_KERNEL allocation, and we can call it
while holding pages locked for writeback in the ->writepages path.
The memory allocation is allowed to wait on pages under writeback,
and so can wait on pages that are tagged as writeback by the
caller.

This affects both pre-IO submission and post-IO submission paths.
Hence xfs_setsize_trans_alloc(), xfs_reflink_end_cow(),
xfs_iomap_write_unwritten() and xfs_reflink_cancel_cow_range().
xfs_iomap_write_unwritten() already does the right thing, but the
others don't. Fix them.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Fixes: 281627df3e ("xfs: log file size updates at I/O completion time")
Fixes: 43caeb187d ("xfs: move mappings from cow fork to data fork after copy-write)"
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:55 -07:00
Christoph Hellwig
c3b1b13190 xfs: implement the lazytime mount option
Use the VFS dirty inode tracking for lazytime inodes only, and just
log them in ->dirty_inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:55 -07:00
Christoph Hellwig
0d07e5573f fs: don't clear I_DIRTY_TIME before calling mark_inode_dirty_sync
__mark_inode_dirty already takes care of that, and for the XFS lazytime
implementation we need to know that ->dirty_inode was called because
I_DIRTY_TIME was set.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:55 -07:00
Nikolay Borisov
bcab2ebfa1 xfs: Remove dead code from inode recover function
The memcpy is guarded by a check which is performed a right before we
call xfs_log_dinode_to_disk. At this point we are sure this check will
always be false otherwise we would have errored out. So let's remove
this dead weight.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:55 -07:00
Carlos Maiolino
e157ebdcb3 Cleanup old XFS_BTREE_* traces
Remove unused legacy btree traces from IRIX era.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:55 -07:00
Eric Sandeen
4603fa744c xfs: remove unused m_dmevmask from xfs_mount struct
The dmevmask structure member is a dmapi leftover; it's
set here and there but never actually used.  Remove it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:55 -07:00
Dave Chinner
cb0a8d2302 xfs: fall back to vmalloc when allocation log vector buffers
When using large directory blocks, we regularly see memory
allocations of >64k being made for the shadow log vector buffer.
When we are under memory pressure, kmalloc() may not be able to find
contiguous memory chunks large enough to satisfy these allocations
easily, and if memory is fragmented we can potentially stall here.

TO avoid this problem, switch the log vector buffer allocation to
use kmem_alloc_large(). This will allow failed allocations to fall
back to vmalloc and so remove the dependency on large contiguous
regions of memory being available. This should prevent slowdowns
and potential stalls when memory is low and/or fragmented.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-11 20:27:55 -07:00
Geliang Tang
cb3bee0369 pstore: Use crypto compress API
In the pstore compression part, we use zlib/lzo/lz4/lz4hc/842
compression algorithm API to implement pstore compression backends. But
there are many repeat codes in these implementations. This patch uses
crypto compress API to simplify these codes.

1) rewrite allocate_buf_for_compression, free_buf_for_compression,
pstore_compress, pstore_decompress functions using crypto compress API.
2) drop compress, decompress, allocate, free functions in pstore_zbackend,
and add zbufsize function to get each different compress buffer size.
3) use late_initcall to call ramoops_init later, to make sure the crypto
subsystem has already initialized.
4) use 'unsigned int' type instead of 'size_t' in pstore_compress,
pstore_decompress functions' length arguments.
5) rename 'zlib' to 'deflate' to follow the crypto API's name convention.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
[kees: tweaked error messages on allocation failures and Kconfig help]
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-03-09 14:16:29 -08:00
Linus Torvalds
719ea86151 Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
 "This fixes a corner case for NFS exporting (introduced in this cycle)
  as well as fixing miscellaneous bugs"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: update Kconfig texts
  ovl: redirect_dir=nofollow should not follow redirect for opaque lower
  ovl: fix ptr_ret.cocci warnings
  ovl: check ERR_PTR() return value from ovl_lookup_real()
  ovl: check lower ancestry on encode of lower dir file handle
  ovl: hash non-dir by lower inode for fsnotify
2018-03-09 09:46:14 -08:00
Linus Torvalds
2d9b1d69c3 Changes since last update:
- Fix some iomap locking problems
 - Don't allocate cow blocks when we're zeroing file data
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCgAGBQJamePeAAoJEPh/dxk0SrTryPgP/0sh8xcAFUHR0ad7D74nSp3o
 1gjBi6Yl+i+hn5SZ/gkKPa5YFjTMaefcTV8rpZm3uPYbTf75TTscjKDDczFNEqoh
 CDKOeyfOchVSjhRI5ExElBh5ToxDgh+HMzzlmxk+olfWA+BXjJj2J2iLh871v9Ym
 hQl7xOYUDPu1xZz+jLhDLKG2Gd/R97ez2KZM43gKMAUsQ7eUbz6CtV73cFT+tNfP
 jjIe3xp6ohbJYJalMThbNcI2mOOssnCM8BHUDBJib7N4CXucJYTMpDAcGvNoFKul
 +K2Icyip1/6CS4LkWDpP29ayiH+sTbZk8/ipioe27hXXHGqKG84HQOBefRZmo4o0
 CC64SoomuxQX6V7qL6hf0Sz82QQbl+ToSliZ97xXP5o84/OxhggZGqVdykTR2+DB
 4POaJzRPmYJKPO4Q0yT2worB4oepAUrridJnaQE0y9F80xnLTvJxpD53atLCHulj
 Ht9NlvnvqndDYaYjQwvucOmdR9vCos3OjZCaXnYgL+EpbgZUvMeQoxuyBUBTdQeQ
 CNGq3EDssI75gD67sWeyFc+gOST259XofANw3mvlu3KS2/huviORJkDYdtzi2/LL
 gkCN9OW9CmXOaAiylE8dSAu98SAR+c83wmahMeBA2mq5vYrQU4rIDGYZi/Mulne4
 tD7eW5nlvps9Fsvf58fd
 =T9nM
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.16-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:

 - Fix some iomap locking problems

 - Don't allocate cow blocks when we're zeroing file data

* tag 'xfs-4.16-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: don't block on the ilock for RWF_NOWAIT
  xfs: don't start out with the exclusive ilock for direct I/O
  xfs: don't allocate COW blocks for zeroing holes or unwritten extents
2018-03-09 09:37:29 -08:00
Kyle Spiers
a9cee1764b reiserfs: Remove VLA from fs/reiserfs/reiserfs.h
Remove Variable Length Array from fs/reiserfs/reiserfs.h. EMPTY_DIR_SIZE
is used as an array size and as it is using strlen() it need not be
evaluated at compile time. Change it's definition to use sizeof() to
force evaluation of array length at compile time.

Signed-off-by: Kyle Spiers <kyle@spiers.me>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-03-09 13:04:20 +01:00
Trond Myklebust
c4f24df942 NFS: Fix unstable write completion
We do want to respect the FLUSH_SYNC argument to nfs_commit_inode() to
ensure that all outstanding COMMIT requests to the inode in question are
complete. Currently we may exit early from both nfs_commit_inode() and
nfs_write_inode() even if there are COMMIT requests in flight, or unstable
writes on the commit list.

In order to get the right semantics w.r.t. sync_inode(), we don't need
to have nfs_commit_inode() reset the inode dirty flags when called from
nfs_wb_page() and/or nfs_wb_all(). We just need to ensure that
nfs_write_inode() leaves them in the right state if there are outstanding
commits, or stable pages.

Reported-by: Scott Mayhew <smayhew@redhat.com>
Fixes: dc4fd9ab01 ("nfs: don't wait on commit in nfs_commit_inode()...")
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-03-08 12:56:32 -05:00
Trond Myklebust
9c6376ebdd pNFS: Prevent the layout header refcount going to zero in pnfs_roc()
Ensure that we hold a reference to the layout header when processing
the pNFS return-on-close so that the refcount value does not inadvertently
go to zero.

Reported-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org # v4.10+
Tested-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
2018-03-08 12:56:31 -05:00
Trond Myklebust
d9ee65539d NFS: Fix an incorrect type in struct nfs_direct_req
The start offset needs to be of type loff_t.

Fixed: 5fadeb47dc ("nfs: count DIO good bytes correctly with mirroring")
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-03-08 12:56:31 -05:00
Andreas Gruenbacher
d39d18e0ef gfs2: Improve gfs2_block_map comment
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-03-08 09:26:20 -07:00
Bob Peterson
b9e03f1861 GFS2: Only set PageChecked for jdata pages
Before this patch, GFS2 was setting the PageChecked flag for ordered
write pages. This is unnecessary. The ext3 file system only does it
for jdata, and it's only used in jdata circumstances. It only muddies
the already murky waters of writing pages in the aops.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-03-08 09:26:20 -07:00
Bob Peterson
9bc980cdb9 GFS2: Make function gfs2_remove_from_ail static
Function gfs2_remove_from_ail is only ever used from log.c, so there
is no reason to declare it extern. This patch removes the extern and
declares it static.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-03-08 09:26:20 -07:00
Andreas Gruenbacher
83998ccd9b gfs2: Dirty source inode during rename
Mark the source inode dirty during a rename instead of just updating the
underlying buffer head.  Otherwise, fsync may find the inode clean and
will then skip flushing the journal.  A subsequent power failure will
cause the rename to be lost.  This happens in command sequences like:

  xfs_io -f -c 'pwrite 0 4096' -c 'fsync' foo
  mv foo bar
  xfs_io -c 'fsync' bar
  # power failure

Fixes xfstests generic/322, generic/376.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-03-08 09:26:20 -07:00
Andreas Gruenbacher
174d1232eb gfs2: Fix fallocate chunk size
The chunk size of allocations in __gfs2_fallocate is calculated
incorrectly.  The size can collapse, causing __gfs2_fallocate to
allocate one block at a time, which is very inefficient.  This needs
fixing in two places:

In gfs2_quota_lock_check, always set ap->allowed to UINT_MAX to indicate
that there is no quota limit.  This fixes callers that rely on
ap->allowed to be set even when quotas are off.

In __gfs2_fallocate, reset max_blks to UINT_MAX in each iteration of the
loop to make sure that allocation limits from one resource group won't
spill over into another resource group.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-03-08 09:26:20 -07:00
Kees Cook
f2531f1976 pstore/ram: Do not use stack VLA for parity workspace
Instead of using a stack VLA for the parity workspace, preallocate a
memory region. The preallocation is done to keep from needing to perform
allocations during crash dump writing, etc. This also fixes a missed
release of librs on free.

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-03-07 12:47:06 -08:00
Kees Cook
fe1d475888 pstore: Select compression at runtime
To allow for easier build test coverage and run-time testing, this allows
multiple compression algorithms to be built into pstore. Still only one
is supported to operate at a time (which can be selected at build time
or at boot time, similar to how LSMs are selected).

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-03-07 12:43:35 -08:00
Andreas Gruenbacher
3b5da96e45 gfs2: Fixes to "Implement iomap for block_map" (2)
It turns out that commit 3229c18c0d6b2 'Fixes to "Implement iomap for
block_map"' introduced another bug in gfs2_iomap_begin that can cause
gfs2_block_map to set bh->b_size of an actual buffer to 0.  This can
lead to arbitrary incorrect behavior including crashes or disk
corruption.  Revert the incorrect part of that commit.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-03-07 11:40:38 -07:00
Miklos Szeredi
36cd95dfa1 ovl: update Kconfig texts
Add some hints about overlayfs kernel config options.

Enabling NFS export by default is especially recommended against, as it
incurs a performance penalty even if the filesystem is not actually
exported.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-03-07 11:47:15 +01:00
Kees Cook
555974068e pstore: Avoid size casts for 842 compression
Instead of casting, make sure we don't end up with giant values and just
perform regular assignments with unsigned int instead of re-cast size_t.

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-03-06 15:15:24 -08:00
Geliang Tang
239b716199 pstore: Add lz4hc and 842 compression support
Currently, pstore has supported three compression algorithms: zlib,
lzo and lz4. This patch added two more compression algorithms: lz4hc
and 842.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
[kees: tweaked Kconfig help text slightly]
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-03-06 15:06:11 -08:00
David S. Miller
0f3e9c97eb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
All of the conflicts were cases of overlapping changes.

In net/core/devlink.c, we have to make care that the
resouce size_params have become a struct member rather
than a pointer to such an object.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-06 01:20:46 -05:00
Linus Torvalds
af8c081627 for-4.16-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlqbuK8ACgkQxWXV+ddt
 WDsiVQ//eE8Axfw4qWOyHHjozoAIu9kifvOFQIhvKviLiXHJNrD/vBI6YwD1hyD1
 rbbLilMsEl1OD1Sq3AeOUMNSSl5qUFEB+CA8vg9GznFTNRobTkL+p96Zt5xlRDu3
 lsFFV93tED+dK4D/eSGP+xYbknA8hIk/2gWPkd6hpYKyh2QdsPBYDqCnaEXvd79P
 DIP/cAjIfzqQn0FTiZ9wbaES+LPO+NoZgQRC2w9McYQ5CEMc+oAChEmPJRwpPoKy
 YdhuwoULniRNHVnVOIVfi4w9jkSPSz7YIgLeRFli/WGBYGcKeHTMFkMa12KdpuUC
 JUclOogJ5ZMbFV3C0W8XEJ7Vb9ltIevrH8MgfKP/3BScuZzbJZ+n5KkH2Gf7vcpe
 w5cZGOsKDz+35fDCdTmsFpDK9kpGzHq48JlRifOjARbdyqNwVq4emxOeQlO1ygzq
 Y9H5UeMpp/FDAm6g/bV8ezCXzwuwUDV9CwAJBD+WCiZhD2nX85FfIp1kfF2zLcUg
 Y8irqV6A/J/0BFkF7Iu9AuzTxOdo6zMLkcHEKb+sL7yGxZ2248v5gZxDoyV5iNWR
 WY4Y2GaMemZGu6+NyyLFAJzlFyACD5fSy8vT7KD6upCzwmlAtaJ3VULfTrLyi/uM
 SNZAKf3WzKBVe5h52GNGRCi5OLTteH6Yqz5m5/h8r74mBO00IrA=
 =bF96
 -----END PGP SIGNATURE-----

Merge tag 'for-4.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - when NR_CPUS is large, a SRCU structure can significantly inflate
   size of the main filesystem structure that would not be possible to
   allocate by kmalloc, so the kvalloc fallback is used

 - improved error handling

 - fix endiannes when printing some filesystem attributes via sysfs,
   this is could happen when a filesystem is moved between different
   endianity hosts

 - send fixes: the NO_HOLE mode should not send a write operation for a
   file hole

 - fix log replay for for special files followed by file hardlinks

 - fix log replay failure after unlink and link combination

 - fix max chunk size calculation for DUP allocation

* tag 'for-4.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix log replay failure after unlink and link combination
  Btrfs: fix log replay failure after linking special file and fsync
  Btrfs: send, fix issuing write op when processing hole in no data mode
  btrfs: use proper endianness accessors for super_copy
  btrfs: alloc_chunk: fix DUP stripe size handling
  btrfs: Handle btrfs_set_extent_delalloc failure in relocate_file_extent_cluster
  btrfs: handle failure of add_pending_csums
  btrfs: use kvzalloc to allocate btrfs_fs_info
2018-03-04 11:04:27 -08:00
Linus Torvalds
2833419a62 A cap handling fix from Zhi that ensures that metadata writeback isn't
delayed and three error path memory leak fixups from Chengguang.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJamV3qAAoJEEp/3jgCEfOLv6kIAI0RXShR20dz9M3GnL2bDws7
 KFkC/aNwbviXMAih3q7Q7ZiwXv6BI/uUK0YIYG/tWZoNHucYGS8ZOFDikdiIMcJR
 2VP8Yas1/NvnLlh0ut45s2imvbXinkHcjWOLqgJJmJ2cEP9Ar/mbwz//TgfVMdNm
 TSHEm2LPRge9jK3Ab/ZiatExGH6dx5hRJvssSauF+f2iN+q4nKj2aBqle+FKaNWT
 Iv0ONi2f7aXnaXRvum4Y1gnKWXzKSFd8Y2Cr5R3tfTrKBv8nmQz54eRsMnnuxK/7
 uV6Ujch4LUXBI1etXLqTwRdZQIcw3Li5PYxGrO8fl8mQfJau9yv5REw/rbgB5FU=
 =f8gA
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.16-rc4' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "A cap handling fix from Zhi that ensures that metadata writeback isn't
  delayed and three error path memory leak fixups from Chengguang"

* tag 'ceph-for-4.16-rc4' of git://github.com/ceph/ceph-client:
  ceph: fix potential memory leak in init_caches()
  ceph: fix dentry leak when failing to init debugfs
  libceph, ceph: avoid memory leak when specifying same option several times
  ceph: flush dirty caps of unlinked inode ASAP
2018-03-02 10:05:10 -08:00
Linus Torvalds
fb6d47a592 for-linus-20180302
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCAAGBQJamXImAAoJEPfTWPspceCmEP4P/3kkm0JIXtbNZFMb1JZtsjwE
 t4OUVEDj4jjRfmZfUVkajPnczM4MSPiXm43PbcOi4NF53mv8k76jyIPhlZREzYzq
 MBknibvpqyiWpbii9tBRrR6FGDR/N51//ya9vdPaYBcBssTg6Aqtt4BE5oPfo011
 PleGROe1jtrBUNBy2dMy4sHb/MvZ0vRuNPxMsD8Agy5UiVeItAelY/lDn1Hw41BY
 O+muE5bw6+yKqB9vGXhV3O4WRh8BofJi1YdzbwbbIzH40ZZK5VTDQc5o19/CFEZ/
 uZ8BStOFEWA0LNuarME5fknWcogiedEtszweyiWBbVZo4VqCsfxPoaRCibY/Wg5F
 a0UNJ4iSzglhfSMoHJlhvlCAMCyubFSeMSdJjrrpIcyBrziJXpcEXcUnWI43yi4P
 FoM8zUni22XnfLWxIdTjVkMRytjtqTLcXOHXdP5N/ESa80jBq3Q76TLmzIKW+kK5
 sAre+hgr52NdgovP/NSxsdvsckAolWNe40JI8wLbwNo+lMHr0ckzOG+sAdz1iPRK
 iVL0CAlby4A94Wcu+OHCwfY7B9lBrMuMfHsesEM6x1cxgAhd3YNfEJ8g2QolCUEV
 KmZizXbV9nnmJfegVC06SgM+D7AR26dwsBG2aoibShuvdxX6jMdUHygyu5DCJdg/
 JS+q71jmxb/r1TWe/62r
 =AMhV
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20180302' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "A collection of fixes for this series. This is a little larger than
  usual at this time, but that's mainly because I was out on vacation
  last week. Nothing in here is major in any way, it's just two weeks of
  fixes. This contains:

   - NVMe pull from Keith, with a set of fixes from the usual suspects.

   - mq-deadline zone unlock fix from Damien, fixing an issue with the
     SMR zone locking added for 4.16.

   - two bcache fixes sent in by Michael, with changes from Coly and
     Tang.

   - comment typo fix from Eric for blktrace.

   - return-value error handling fix for nbd, from Gustavo.

   - fix a direct-io case where we don't defer to a completion handler,
     making us sleep from IRQ device completion. From Jan.

   - a small series from Jan fixing up holes around handling of bdev
     references.

   - small set of regression fixes from Jiufei, mostly fixing problems
     around the gendisk pointer -> partition index change.

   - regression fix from Ming, fixing a boundary issue with the discard
     page cache invalidation.

   - two-patch series from Ming, fixing both a core blk-mq-sched and
     kyber issue around token freeing on a requeue condition"

* tag 'for-linus-20180302' of git://git.kernel.dk/linux-block: (24 commits)
  block: fix a typo
  block: display the correct diskname for bio
  block: fix the count of PGPGOUT for WRITE_SAME
  mq-deadline: Make sure to always unlock zones
  nvmet: fix PSDT field check in command format
  nvme-multipath: fix sysfs dangerously created links
  nbd: fix return value in error handling path
  bcache: fix kcrashes with fio in RAID5 backend dev
  bcache: correct flash only vols (check all uuids)
  blktrace_api.h: fix comment for struct blk_user_trace_setup
  blockdev: Avoid two active bdev inodes for one device
  genhd: Fix BUG in blkdev_open()
  genhd: Fix use after free in __blkdev_get()
  genhd: Add helper put_disk_and_module()
  genhd: Rename get_disk() to get_disk_and_module()
  genhd: Fix leaked module reference for NVME devices
  direct-io: Fix sleep in atomic due to sync AIO
  nvme-pci: Fix nvme queue cleanup if IRQ setup fails
  block: kyber: fix domain token leak during requeue
  blk-mq: don't call io sched's .requeue_request when requeueing rq to ->dispatch
  ...
2018-03-02 09:35:36 -08:00
Chengguang Xu
785dffe1da udf: fix potential refcnt problem of nls module
When specifiying iocharset multiple times in a mount or once/multiple in
a remount, current option parsing may cause inaccurate refcount of nls
module.  Also, in the failure cleanup of option parsing, the condition
of calling unload_nls is not sufficient.

Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-03-02 14:23:12 +01:00
Chengguang Xu
d090990510 ext2: change return code to -ENOMEM when failing memory allocation
Change return code to -ENOMEM from -EINVAL when failing memory
allocation in fill_super(), meanwhile delete redundant initial
assignment of variable err.

Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-03-02 14:23:11 +01:00
Jan Kara
b72e632c6c udf: Do not mark possibly inconsistent filesystems as closed
If logical volume integrity descriptor contains non-closed integrity
type when mounting the volume, there are high chances that the volume is
not consistent (device was detached before the filesystem was
unmounted). Don't touch integrity type of such volume so that fsck can
recognize it and check such filesystem.

Reported-by: Pali Rohár <pali.rohar@gmail.com>
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-03-02 14:22:57 +01:00
Christoph Hellwig
ff3d8b9c4c xfs: don't block on the ilock for RWF_NOWAIT
Fix xfs_file_iomap_begin to trylock the ilock if IOMAP_NOWAIT is passed,
so that we don't block io_submit callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-01 14:12:45 -08:00
Christoph Hellwig
af5b5afe9a xfs: don't start out with the exclusive ilock for direct I/O
There is no reason to take the ilock exclusively at the start of
xfs_file_iomap_begin for direct I/O, given that it will be demoted
just before calling xfs_iomap_write_direct anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-01 14:12:12 -08:00
Christoph Hellwig
172ed391f6 xfs: don't allocate COW blocks for zeroing holes or unwritten extents
The iomap zeroing interface is smart enough to skip zeroing holes or
unwritten extents.  Don't subvert this logic for reflink files.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-03-01 14:10:31 -08:00
Chengguang Xu
1c78924957 ceph: fix potential memory leak in init_caches()
There is lack of cache destroy operation for ceph_file_cachep
when failing from fscache register.

Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-03-01 16:39:47 +01:00
Filipe Manana
1f250e929a Btrfs: fix log replay failure after unlink and link combination
If we have a file with 2 (or more) hard links in the same directory,
remove one of the hard links, create a new file (or link an existing file)
in the same directory with the name of the removed hard link, and then
finally fsync the new file, we end up with a log that fails to replay,
causing a mount failure.

Example:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ mkdir /mnt/testdir
  $ touch /mnt/testdir/foo
  $ ln /mnt/testdir/foo /mnt/testdir/bar

  $ sync

  $ unlink /mnt/testdir/bar
  $ touch /mnt/testdir/bar
  $ xfs_io -c "fsync" /mnt/testdir/bar

  <power failure>

  $ mount /dev/sdb /mnt
  mount: mount(2) failed: /mnt: No such file or directory

When replaying the log, for that example, we also see the following in
dmesg/syslog:

  [71813.671307] BTRFS info (device dm-0): failed to delete reference to bar, inode 258 parent 257
  [71813.674204] ------------[ cut here ]------------
  [71813.675694] BTRFS: Transaction aborted (error -2)
  [71813.677236] WARNING: CPU: 1 PID: 13231 at fs/btrfs/inode.c:4128 __btrfs_unlink_inode+0x17b/0x355 [btrfs]
  [71813.679669] Modules linked in: btrfs xfs f2fs dm_flakey dm_mod dax ghash_clmulni_intel ppdev pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper evdev psmouse i2c_piix4 parport_pc i2c_core pcspkr sg serio_raw parport button sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel floppy virtio e1000 scsi_mod [last unloaded: btrfs]
  [71813.679669] CPU: 1 PID: 13231 Comm: mount Tainted: G        W        4.15.0-rc9-btrfs-next-56+ #1
  [71813.679669] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
  [71813.679669] RIP: 0010:__btrfs_unlink_inode+0x17b/0x355 [btrfs]
  [71813.679669] RSP: 0018:ffffc90001cef738 EFLAGS: 00010286
  [71813.679669] RAX: 0000000000000025 RBX: ffff880217ce4708 RCX: 0000000000000001
  [71813.679669] RDX: 0000000000000000 RSI: ffffffff81c14bae RDI: 00000000ffffffff
  [71813.679669] RBP: ffffc90001cef7c0 R08: 0000000000000001 R09: 0000000000000001
  [71813.679669] R10: ffffc90001cef5e0 R11: ffffffff8343f007 R12: ffff880217d474c8
  [71813.679669] R13: 00000000fffffffe R14: ffff88021ccf1548 R15: 0000000000000101
  [71813.679669] FS:  00007f7cee84c480(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
  [71813.679669] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [71813.679669] CR2: 00007f7cedc1abf9 CR3: 00000002354b4003 CR4: 00000000001606e0
  [71813.679669] Call Trace:
  [71813.679669]  btrfs_unlink_inode+0x17/0x41 [btrfs]
  [71813.679669]  drop_one_dir_item+0xfa/0x131 [btrfs]
  [71813.679669]  add_inode_ref+0x71e/0x851 [btrfs]
  [71813.679669]  ? __lock_is_held+0x39/0x71
  [71813.679669]  ? replay_one_buffer+0x53/0x53a [btrfs]
  [71813.679669]  replay_one_buffer+0x4a4/0x53a [btrfs]
  [71813.679669]  ? rcu_read_unlock+0x3a/0x57
  [71813.679669]  ? __lock_is_held+0x39/0x71
  [71813.679669]  walk_up_log_tree+0x101/0x1d2 [btrfs]
  [71813.679669]  walk_log_tree+0xad/0x188 [btrfs]
  [71813.679669]  btrfs_recover_log_trees+0x1fa/0x31e [btrfs]
  [71813.679669]  ? replay_one_extent+0x544/0x544 [btrfs]
  [71813.679669]  open_ctree+0x1cf6/0x2209 [btrfs]
  [71813.679669]  btrfs_mount_root+0x368/0x482 [btrfs]
  [71813.679669]  ? trace_hardirqs_on_caller+0x14c/0x1a6
  [71813.679669]  ? __lockdep_init_map+0x176/0x1c2
  [71813.679669]  ? mount_fs+0x64/0x10b
  [71813.679669]  mount_fs+0x64/0x10b
  [71813.679669]  vfs_kern_mount+0x68/0xce
  [71813.679669]  btrfs_mount+0x13e/0x772 [btrfs]
  [71813.679669]  ? trace_hardirqs_on_caller+0x14c/0x1a6
  [71813.679669]  ? __lockdep_init_map+0x176/0x1c2
  [71813.679669]  ? mount_fs+0x64/0x10b
  [71813.679669]  mount_fs+0x64/0x10b
  [71813.679669]  vfs_kern_mount+0x68/0xce
  [71813.679669]  do_mount+0x6e5/0x973
  [71813.679669]  ? memdup_user+0x3e/0x5c
  [71813.679669]  SyS_mount+0x72/0x98
  [71813.679669]  entry_SYSCALL_64_fastpath+0x1e/0x8b
  [71813.679669] RIP: 0033:0x7f7cedf150ba
  [71813.679669] RSP: 002b:00007ffca71da688 EFLAGS: 00000206
  [71813.679669] Code: 7f a0 e8 51 0c fd ff 48 8b 43 50 f0 0f ba a8 30 2c 00 00 02 72 17 41 83 fd fb 74 11 44 89 ee 48 c7 c7 7d 11 7f a0 e8 38 f5 8d e0 <0f> ff 44 89 e9 ba 20 10 00 00 eb 4d 48 8b 4d b0 48 8b 75 88 4c
  [71813.679669] ---[ end trace 83bd473fc5b4663b ]---
  [71813.854764] BTRFS: error (device dm-0) in __btrfs_unlink_inode:4128: errno=-2 No such entry
  [71813.886994] BTRFS: error (device dm-0) in btrfs_replay_log:2307: errno=-2 No such entry (Failed to recover log tree)
  [71813.903357] BTRFS error (device dm-0): cleaner transaction attach returned -30
  [71814.128078] BTRFS error (device dm-0): open_ctree failed

This happens because the log has inode reference items for both inode 258
(the first file we created) and inode 259 (the second file created), and
when processing the reference item for inode 258, we replace the
corresponding item in the subvolume tree (which has two names, "foo" and
"bar") witht he one in the log (which only has one name, "foo") without
removing the corresponding dir index keys from the parent directory.
Later, when processing the inode reference item for inode 259, which has
a name of "bar" associated to it, we notice that dir index entries exist
for that name and for a different inode, so we attempt to unlink that
name, which fails because the inode reference item for inode 258 no longer
has the name "bar" associated to it, making a call to btrfs_unlink_inode()
fail with a -ENOENT error.

Fix this by unlinking all the names in an inode reference item from a
subvolume tree that are not present in the inode reference item found in
the log tree, before overwriting it with the item from the log tree.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-01 16:18:40 +01:00
Filipe Manana
9a6509c4da Btrfs: fix log replay failure after linking special file and fsync
If in the same transaction we rename a special file (fifo, character/block
device or symbolic link), create a hard link for it having its old name
then sync the log, we will end up with a log that can not be replayed and
at when attempting to replay it, an EEXIST error is returned and mounting
the filesystem fails. Example scenario:

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt
  $ mkdir /mnt/testdir
  $ mkfifo /mnt/testdir/foo
  # Make sure everything done so far is durably persisted.
  $ sync

  # Create some unrelated file and fsync it, this is just to create a log
  # tree. The file must be in the same directory as our special file.
  $ touch /mnt/testdir/f1
  $ xfs_io -c "fsync" /mnt/testdir/f1

  # Rename our special file and then create a hard link with its old name.
  $ mv /mnt/testdir/foo /mnt/testdir/bar
  $ ln /mnt/testdir/bar /mnt/testdir/foo

  # Create some other unrelated file and fsync it, this is just to persist
  # the log tree which was modified by the previous rename and link
  # operations. Alternatively we could have modified file f1 and fsync it.
  $ touch /mnt/f2
  $ xfs_io -c "fsync" /mnt/f2

  <power failure>

  $ mount /dev/sdc /mnt
  mount: mount /dev/sdc on /mnt failed: File exists

This happens because when both the log tree and the subvolume's tree have
an entry in the directory "testdir" with the same name, that is, there
is one key (258 INODE_REF 257) in the subvolume tree and another one in
the log tree (where 258 is the inode number of our special file and 257
is the inode for directory "testdir"). Only the data of those two keys
differs, in the subvolume tree the index field for inode reference has
a value of 3 while the log tree it has a value of 5. Because the same key
exists in both trees, but have different index, the log replay fails with
an -EEXIST error when attempting to replay the inode reference from the
log tree.

Fix this by setting the last_unlink_trans field of the inode (our special
file) to the current transaction id when a hard link is created, as this
forces logging the parent directory inode, solving the conflict at log
replay time.

A new generic test case for fstests was also submitted.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-01 16:18:34 +01:00
Filipe Manana
d4dfc0f4d3 Btrfs: send, fix issuing write op when processing hole in no data mode
When doing an incremental send of a filesystem with the no-holes feature
enabled, we end up issuing a write operation when using the no data mode
send flag, instead of issuing an update extent operation. Fix this by
issuing the update extent operation instead.

Trivial reproducer:

  $ mkfs.btrfs -f -O no-holes /dev/sdc
  $ mkfs.btrfs -f /dev/sdd
  $ mount /dev/sdc /mnt/sdc
  $ mount /dev/sdd /mnt/sdd

  $ xfs_io -f -c "pwrite -S 0xab 0 32K" /mnt/sdc/foobar
  $ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap1

  $ xfs_io -c "fpunch 8K 8K" /mnt/sdc/foobar
  $ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap2

  $ btrfs send /mnt/sdc/snap1 | btrfs receive /mnt/sdd
  $ btrfs send --no-data -p /mnt/sdc/snap1 /mnt/sdc/snap2 \
       | btrfs receive -vv /mnt/sdd

Before this change the output of the second receive command is:

  receiving snapshot snap2 uuid=f6922049-8c22-e544-9ff9-fc6755918447...
  utimes
  write foobar, offset 8192, len 8192
  utimes foobar
  BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=f6922049-8c22-e544-9ff9-...

After this change it is:

  receiving snapshot snap2 uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64...
  utimes
  update_extent foobar: offset=8192, len=8192
  utimes foobar
  BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64...

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-01 16:18:07 +01:00
Anand Jain
3c181c12c4 btrfs: use proper endianness accessors for super_copy
The fs_info::super_copy is a byte copy of the on-disk structure and all
members must use the accessor macros/functions to obtain the right
value.  This was missing in update_super_roots and in sysfs readers.

Moving between opposite endianness hosts will report bogus numbers in
sysfs, and mount may fail as the root will not be restored correctly. If
the filesystem is always used on a same endian host, this will not be a
problem.

Fix this by using the btrfs_set_super...() functions to set
fs_info::super_copy values, and for the sysfs, use the cached
fs_info::nodesize/sectorsize values.

CC: stable@vger.kernel.org
Fixes: df93589a17 ("btrfs: export more from FS_INFO to sysfs")
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-01 16:17:27 +01:00
Hans van Kranenburg
92e222df7b btrfs: alloc_chunk: fix DUP stripe size handling
In case of using DUP, we search for enough unallocated disk space on a
device to hold two stripes.

The devices_info[ndevs-1].max_avail that holds the amount of unallocated
space found is directly assigned to stripe_size, while it's actually
twice the stripe size.

Later on in the code, an unconditional division of stripe_size by
dev_stripes corrects the value, but in the meantime there's a check to
see if the stripe_size does not exceed max_chunk_size. Since during this
check stripe_size is twice the amount as intended, the check will reduce
the stripe_size to max_chunk_size if the actual correct to be used
stripe_size is more than half the amount of max_chunk_size.

The unconditional division later tries to correct stripe_size, but will
actually make sure we can't allocate more than half the max_chunk_size.

Fix this by moving the division by dev_stripes before the max chunk size
check, so it always contains the right value, instead of putting a duct
tape division in further on to get it fixed again.

Since in all other cases than DUP, dev_stripes is 1, this change only
affects DUP.

Other attempts in the past were made to fix this:
* 37db63a400 "Btrfs: fix max chunk size check in chunk allocator" tried
to fix the same problem, but still resulted in part of the code acting
on a wrongly doubled stripe_size value.
* 86db25785a "Btrfs: fix max chunk size on raid5/6" unintentionally
broke this fix again.

The real problem was already introduced with the rest of the code in
73c5de0051.

The user visible result however will be that the max chunk size for DUP
will suddenly double, while it's actually acting according to the limits
in the code again like it was 5 years ago.

Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
Link: https://www.spinics.net/lists/linux-btrfs/msg69752.html
Fixes: 73c5de0051 ("btrfs: quasi-round-robin for chunk allocation")
Fixes: 86db25785a ("Btrfs: fix max chunk size on raid5/6")
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-01 16:16:47 +01:00
Nikolay Borisov
765f3cebff btrfs: Handle btrfs_set_extent_delalloc failure in relocate_file_extent_cluster
Essentially duplicate the error handling from the above block which
handles the !PageUptodate(page) case and additionally clear
EXTENT_BOUNDARY.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-01 16:16:12 +01:00
Nikolay Borisov
ac01f26a27 btrfs: handle failure of add_pending_csums
add_pending_csums was added as part of the new data=ordered
implementation in e6dcd2dc9c ("Btrfs: New data=ordered
implementation"). Even back then it called the btrfs_csum_file_blocks
which can fail but it never bothered handling the failure. In ENOMEM
situation this could lead to the filesystem failing to write the
checksums for a particular extent and not detect this. On read this
could lead to the filesystem erroring out due to crc mismatch. Fix it by
propagating failure from add_pending_csums and handling them.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-01 16:16:00 +01:00