Commit Graph

54142 Commits

Author SHA1 Message Date
Ronnie Sahlberg
1fc6ad2f10 cifs: remove header_preamble_size where it is always 0
Since header_preamble_size is 0 for SMB2+ we can remove it in those
code paths that are only invoked from SMB2.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-06-01 09:14:30 -05:00
Ronnie Sahlberg
49f466bdbd cifs: remove struct smb2_hdr
struct smb2_hdr is now just a wrapper for smb2_sync_hdr.
We can thus get rid of smb2_hdr completely and access the sync header directly.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-06-01 09:14:30 -05:00
Mark Syms
d81243c697 CIFS: 511c54a2f6 adds a check for session expiry, status STATUS_NETWORK_SESSION_EXPIRED, however the server can also respond with STATUS_USER_SESSION_DELETED in cases where the session has been idle for some time and the server reaps the session to recover resources.
Handle this additional status in the same way as SESSION_EXPIRED.

Signed-off-by: Mark Syms <mark.syms@citrix.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
2018-06-01 09:14:13 -05:00
Ronnie Sahlberg
e4dc31fe9a cifs: change smb2_get_data_area_len to take a smb2_sync_hdr as argument
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-05-31 21:30:51 -05:00
Ronnie Sahlberg
84f0cbfba8 cifs: update smb2_calc_size to use smb2_sync_hdr instead of smb2_hdr
smb2_hdr is just a wrapper around smb2_sync_hdr at this stage
and smb2_hdr is going away.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-05-31 21:30:51 -05:00
Ronnie Sahlberg
0d5a288d25 cifs: remove struct smb2_oplock_break_rsp
The two structures smb2_oplock_breaq_req/rsp are now basically identical.
Replace this with a single definition of a smb2_oplock_break structure.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-05-31 21:30:51 -05:00
Ronnie Sahlberg
977b617040 cifs: remove rfc1002 header from all SMB2 response structures
Separate out all the 4 byte rfc1002 headers so that they are no longer
part of the SMB2 header structures to prepare for future work to add
compounding support.

Update the smb3 transform header processing that we no longer have
a rfc1002 header at the start of this structure.

Update smb2_readv_callback to accommodate that the first iovector in the
response is no the smb2 header and no longer a rfc1002 header.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-05-31 21:30:50 -05:00
Steve French
b2adf22fdf smb3: on reconnect set PreviousSessionId field
The server detects reconnect by the (non-zero) value in PreviousSessionId
of SMB2/SMB3 SessionSetup request, but this behavior regressed due
to commit 166cea4dc3
("SMB2: Separate RawNTLMSSP authentication from SMB2_sess_setup")

CC: Stable <stable@vger.kernel.org>
CC: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-05-31 21:23:07 -05:00
Steve French
ce558b0e17 smb3: Add posix create context for smb3.11 posix mounts
Signed-off-by: Steve French <smfrench@gmail.com>
2018-05-31 21:23:07 -05:00
Linus Torvalds
0512e01345 Changes since last update:
- Clear out i_mapping error state when we're reinitializing inodes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlsPY0wACgkQ+H93GTRK
 tOuuDQ/+IvBngJL9I5py7GF0EXlMuge0nAulEj4d1ZT4tNCPp0Ouu4Jy+za+RapQ
 w604fvI6VPPbtidjLpUkR+ZzVeIAanaUkHY+MXl7DEYnyKC+VO7rPZUQiXe4kCeE
 ExNpL063vj4FND3xO/tXz2cs6Wjk8RuCLPWprLVKPpZ79w+BQwYFlpKGschMhR7w
 EQM+7TIJHff1C2nwETbX5ZcM6yxo6PVUwxEsF7+pubVulMoJZ57m5OnS7RXZY7L7
 33S3du85A/Unby+mlYQTsmWf+1FOfIIf6+r1i13gRorGSZongPSenQdO6h4uKzXc
 3OHXTl783ip2cFhhbbTnDlmly66Q1wcDwUDd88YvP94Wv9K+lWASKJGqDwpT/T3/
 gkmg9pTXezPytTZb+F86nFN91b4NWSdskwN4/Du2ydnSEQVmzwdYLyc1oQn9IWal
 HITBlVApLF33rHgmPJXRT64uKPqsPttu3DR5337waTPKf8po+Xk7CaATIIHx8gTD
 Jj8UfH7b9u7tjk5yXnx7qVCquwsG1E8N3Xi5eqn2dsTVSqia3vjyBoI7esPX5DBO
 ZvbBuU5MMmGr0p7DCEcFbe/otToqdoc0quebuUodKbhUS70+RGDoqwfR+R7Gbprq
 M6+Tfm7S6DIKOsfgde5HBEEAjQtsrNMNdBsBemtL1v3fzI6SyJQ=
 =UtGb
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.17-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fix from Darrick Wong:
 "Clear out i_mapping error state when we're reinitializing inodes.

  This last minute fix prevents writeback error state from persisting
  past the end of the in-core inode lifecycle and causing EIO errors to
  be reported to userspace when no error has occurred.

  This fix for the behavioral regression has been soaking in for-next
  for a while, but various fs developers persuaded me to try to get it
  upstream for 4.17 because the patch that broke things was introduced
  in 4.17-rc4"

* tag 'xfs-4.17-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  fs: clear writeback errors in inode_init_always
2018-05-31 16:23:07 -05:00
Dave Jiang
80660f2025 dax: change bdev_dax_supported() to support boolean returns
The function return values are confusing with the way the function is
named. We expect a true or false return value but it actually returns
0/-errno.  This makes the code very confusing. Changing the return values
to return a bool where if DAX is supported then return true and no DAX
support returns false.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-31 08:58:34 -07:00
Darrick J. Wong
ba23cba9b3 fs: allow per-device dax status checking for filesystems
Change bdev_dax_supported so it takes a bdev parameter.  This enables
multi-device filesystems like xfs to check that a dax device can work for
the particular filesystem.  Once that's in place, actually fix all the
parts of XFS where we need to be able to distinguish between datadev and
rtdev.

This patch fixes the problem where we screw up the dax support checking
in xfs if the datadev and rtdev have different dax capabilities.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[rez: Re-added __bdev_dax_supported() for !CONFIG_FS_DAX cases]
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-05-31 08:58:33 -07:00
Tetsuo Handa
543b8f8662 fuse: don't keep dead fuse_conn at fuse_fill_super().
syzbot is reporting use-after-free at fuse_kill_sb_blk() [1].
Since sb->s_fs_info field is not cleared after fc was released by
fuse_conn_put() when initialization failed, fuse_kill_sb_blk() finds
already released fc and tries to hold the lock. Fix this by clearing
sb->s_fs_info field after calling fuse_conn_put().

[1] https://syzkaller.appspot.com/bug?id=a07a680ed0a9290585ca424546860464dd9658db

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+ec3986119086fe4eec97@syzkaller.appspotmail.com>
Fixes: 3b463ae0c6 ("fuse: invalidation reverse calls")
Cc: John Muir <john@jmuir.com>
Cc: Csaba Henk <csaba@gluster.com>
Cc: Anand Avati <avati@redhat.com>
Cc: <stable@vger.kernel.org> # v2.6.31
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 12:26:11 +02:00
Miklos Szeredi
6becdb601b fuse: fix control dir setup and teardown
syzbot is reporting NULL pointer dereference at fuse_ctl_remove_conn() [1].
Since fc->ctl_ndents is incremented by fuse_ctl_add_conn() when new_inode()
failed, fuse_ctl_remove_conn() reaches an inode-less dentry and tries to
clear d_inode(dentry)->i_private field.

Fix by only adding the dentry to the array after being fully set up.

When tearing down the control directory, do d_invalidate() on it to get rid
of any mounts that might have been added.

[1] https://syzkaller.appspot.com/bug?id=f396d863067238959c91c0b7cfc10b163638cac6
Reported-by: syzbot <syzbot+32c236387d66c4516827@syzkaller.appspotmail.com>
Fixes: bafa96541b ("[PATCH] fuse: add control filesystem")
Cc: <stable@vger.kernel.org> # v2.6.18
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 12:26:10 +02:00
Tejun Heo
8a301eb16d fuse: fix congested state leak on aborted connections
If a connection gets aborted while congested, FUSE can leave
nr_wb_congested[] stuck until reboot causing wait_iff_congested() to
wait spuriously which can lead to severe performance degradation.

The leak is caused by gating congestion state clearing with
fc->connected test in request_end().  This was added way back in 2009
by 26c3679101 ("fuse: destroy bdi on umount").  While the commit
description doesn't explain why the test was added, it most likely was
to avoid dereferencing bdi after it got destroyed.

Since then, bdi lifetime rules have changed many times and now we're
always guaranteed to have access to the bdi while the superblock is
alive (fc->sb).

Drop fc->connected conditional to avoid leaking congestion states.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Joshua Miller <joshmiller@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: stable@vger.kernel.org # v2.6.29+
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 12:26:10 +02:00
Eric W. Biederman
4ad769f3c3 fuse: Allow fully unprivileged mounts
Now that the fuse and the vfs work is complete.  Allow the fuse filesystem
to be mounted by the root user in a user namespace.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 12:26:10 +02:00
Eric W. Biederman
e45b2546e2 fuse: Ensure posix acls are translated outside of init_user_ns
Ensure the translation happens by failing to read or write
posix acls when the filesystem has not indicated it supports
posix acls.

This ensures that modern cached posix acl support is available
and used when dealing with posix acls.  This is important
because only that path has the code to convernt the uids and
gids in posix acls into the user namespace of a fuse filesystem.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 12:26:10 +02:00
Tomohiro Misono
23d0b79dfa btrfs: Add unprivileged version of ino_lookup ioctl
Add unprivileged version of ino_lookup ioctl BTRFS_IOC_INO_LOOKUP_USER
to allow normal users to call "btrfs subvolume list/show" etc. in
combination with BTRFS_IOC_GET_SUBVOL_INFO/BTRFS_IOC_GET_SUBVOL_ROOTREF.

This can be used like BTRFS_IOC_INO_LOOKUP but the argument is
different. This is  because it always searches the fs/file tree
correspoinding to the fd with which this ioctl is called and also
returns the name of bottom subvolume.

The main differences from original ino_lookup ioctl are:

  1. Read + Exec permission will be checked using inode_permission()
     during path construction. -EACCES will be returned in case
     of failure.
  2. Path construction will be stopped at the inode number which
     corresponds to the fd with which this ioctl is called. If
     constructed path does not exist under fd's inode, -EACCES
     will be returned.
  3. The name of bottom subvolume is also searched and filled.

Note that the maximum length of path is shorter 256 (BTRFS_VOL_NAME_MAX+1)
bytes than ino_lookup ioctl because of space of subvolume's name.

Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
[ style fixes ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-31 11:35:24 +02:00
Tomohiro Misono
42e4b520c8 btrfs: Add unprivileged ioctl which returns subvolume's ROOT_REF
Add unprivileged ioctl BTRFS_IOC_GET_SUBVOL_ROOTREF which returns
ROOT_REF information of the subvolume containing this inode except the
subvolume name (this is because to prevent potential name leak). The
subvolume name will be gained by user version of ino_lookup ioctl
(BTRFS_IOC_INO_LOOKUP_USER) which also performs permission check.

The min id of root ref's subvolume to be searched is specified by
@min_id in struct btrfs_ioctl_get_subvol_rootref_args. After the search
ends, @min_id is set to the last searched root ref's subvolid + 1. Also,
if there are more root refs than BTRFS_MAX_ROOTREF_BUFFER_NUM,
-EOVERFLOW is returned. Therefore the caller can just call this ioctl
again without changing the argument to continue search.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
[ style fixes and struct item renames ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-31 11:35:23 +02:00
Tomohiro Misono
b64ec075bd btrfs: Add unprivileged ioctl which returns subvolume information
Add new unprivileged ioctl BTRFS_IOC_GET_SUBVOL_INFO which returns
the information of subvolume containing this inode.
(i.e. returns the information in ROOT_ITEM and ROOT_BACKREF.)

Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
[ minor style fixes, update struct comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-31 11:35:23 +02:00
Amir Goldstein
01b39dcc95 ovl: use inode_insert5() to hash a newly created inode
Currently, there is a small window where ovl_obtain_alias() can
race with ovl_instantiate() and create two different overlay inodes
with the same underlying real non-dir non-hardlink inode.

The race requires an adversary to guess the file handle of the
yet to be created upper inode and decode the guessed file handle
after ovl_creat_real(), but before ovl_instantiate().
This race does not affect overlay directory inodes, because those
are decoded via ovl_lookup_real() and not with ovl_obtain_alias().

This patch fixes the race, by using inode_insert5() to add a newly
created inode to cache.

If the newly created inode apears to already exist in cache (hashed
by the same real upper inode), we instantiate the dentry with the old
inode and drop the new inode, instead of silently not hashing the new
inode.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 11:06:12 +02:00
Vivek Goyal
ac6a52eb65 ovl: Pass argument to ovl_get_inode() in a structure
ovl_get_inode() right now has 5 parameters. Soon this patch series will
add 2 more and suddenly argument list starts looking too long.

Hence pass arguments to ovl_get_inode() in a structure and it looks
little cleaner.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 11:06:12 +02:00
Miklos Szeredi
80ea09a002 vfs: factor out inode_insert5()
Split out common helper for race free insertion of an already allocated
inode into the cache.  Use this from iget5_locked() and
insert_inode_locked4().  Make iget5_locked() use new_inode()/iput() instead
of alloc_inode()/destroy_inode() directly.

Also export to modules for use by filesystems which want to preallocate an
inode before file/directory creation.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 11:06:11 +02:00
Miklos Szeredi
b148cba403 ovl: clean up copy-up error paths
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 11:06:11 +02:00
Miklos Szeredi
dd8ac699ed ovl: return EIO on internal error
EIO better represents an internal error than ENOENT.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 11:06:11 +02:00
Al Viro
f73cc77c3a ovl: make ovl_create_real() cope with vfs_mkdir() safely
vfs_mkdir() may succeed and leave the dentry passed to it unhashed and
negative.  ovl_create_real() is the last caller breaking when that
happens.

[amir: split re-factoring of ovl_create_temp() to prep patch
       add comment about unhashed dir after mkdir
       add pr_warn() if mkdir succeeds and lookup fails]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 11:06:11 +02:00
Amir Goldstein
137ec526a2 ovl: create helper ovl_create_temp()
Also used ovl_create_temp() in ovl_create_index() instead of calling
ovl_do_mkdir() directly, so now all callers of ovl_do_mkdir() are routed
through ovl_create_real(), which paves the way for Al's fix for non-hashed
result from vfs_mkdir().

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 11:06:11 +02:00
Miklos Szeredi
95a1c8153a ovl: return dentry from ovl_create_real()
Al Viro suggested to simplify callers of ovl_create_real() by
returning the created dentry (or ERR_PTR) from ovl_create_real().

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 11:06:11 +02:00
Amir Goldstein
471ec5dcf4 ovl: struct cattr cleanups
* Rename to ovl_cattr

* Fold ovl_create_real() hardlink argument into struct ovl_cattr

* Create macro OVL_CATTR() to initialize struct ovl_cattr from mode

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 11:06:10 +02:00
Amir Goldstein
6cf00764b0 ovl: strip debug argument from ovl_do_ helpers
It did not prove to be useful.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 11:06:10 +02:00
Amir Goldstein
a8b9e0ceed ovl: remove WARN_ON() real inode attributes mismatch
Overlayfs should cope with online changes to underlying layer
without crashing the kernel, which is what xfstest overlay/019
checks.

This test may sometimes trigger WARN_ON() in ovl_create_or_link()
when linking an overlay inode that has been changed on underlying
layer.

Remove those WARN_ON() to prevent the stress test from failing.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 11:06:10 +02:00
Miklos Szeredi
4280f74a57 ovl: Kconfig documentation fixes
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 11:06:10 +02:00
Darrick J. Wong
829bc787c1 fs: clear writeback errors in inode_init_always
In inode_init_always(), we clear the inode mapping flags, which clears
any retained error (AS_EIO, AS_ENOSPC) bits.  Unfortunately, we do not
also clear wb_err, which means that old mapping errors can leak through
to new inodes.

This is crucial for the XFS inode allocation path because we recycle old
in-core inodes and we do not want error state from an old file to leak
into the new file.  This bug was discovered by running generic/036 and
generic/047 in a loop and noticing that the EIOs generated by the
collision of direct and buffered writes in generic/036 would survive the
remount between 036 and 047, and get reported to the fsyncs (on
different files!) in generic/047.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-30 19:43:53 -07:00
Steve French
28d59363ae smb3: add tracepoints for smb2/smb3 open
add two tracepoints for open completion. One for error one for completion (open_done).
Sample output below

            TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
               | |       |   ||||       |         |
            bash-15348 [007] .... 42441.027492: smb3_enter: 	cifs_lookup: xid=45
            bash-15348 [007] .... 42441.028214: smb3_cmd_err: 	sid=0x6173e4ce tid=0xa05150e6 cmd=5 mid=105 status=0xc0000034 rc=-2
            bash-15348 [007] .... 42441.028219: smb3_open_err: xid=45 sid=0x6173e4ce tid=0xa05150e6 cr_opts=0x0 des_access=0x80 rc=-2
            bash-15348 [007] .... 42441.028225: smb3_exit_done: 	cifs_lookup: xid=45
          fop777-24560 [002] .... 42442.627617: smb3_enter: 	cifs_revalidate_dentry_attr: xid=46
          fop777-24560 [003] .... 42442.628301: smb3_cmd_err: 	sid=0x6173e4ce tid=0xa05150e6 cmd=5 mid=106 status=0xc0000034 rc=-2
          fop777-24560 [003] .... 42442.628319: smb3_open_err: xid=46 sid=0x6173e4ce tid=0xa05150e6 cr_opts=0x0 des_access=0x80 rc=-2
          fop777-24560 [003] .... 42442.628335: smb3_enter: 	cifs_atomic_open: xid=47
          fop777-24560 [003] .... 42442.629587: smb3_cmd_done: 	sid=0x6173e4ce tid=0xa05150e6 cmd=5 mid=107
          fop777-24560 [003] .... 42442.629592: smb3_open_done: xid=47 sid=0x6173e4ce tid=0xa05150e6 fid=0xb8a0984d cr_opts=0x40 des_access=0x40000080

Signed-off-by: Steve French <smfrench@gmail.com>
2018-05-30 21:42:34 -05:00
Steve French
5c5a41be89 cifs: add debug output to show nocase mount option
For smb1 nocase can be specified on mount.  Allow displaying it
in debug data.

Signed-off-by: Steve French <smfrench@gmail.com>
2018-05-30 17:59:49 -05:00
Steve French
fe048402e8 smb3: add define for id for posix create context and corresponding struct
Signed-off-by: Steve French <smfrench@gmail.com>
2018-05-30 17:59:46 -05:00
Ronnie Sahlberg
98170fb535 cifs: update smb2_check_message to handle PDUs without a 4 byte length header
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-05-30 17:24:14 -05:00
Kent Overstreet
e292d7bc63 xfs: convert to bioset_init()/mempool_init()
Convert XFS to embedded bio sets.

Acked-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 15:33:32 -06:00
Kent Overstreet
8ac9f7c1fd btrfs: convert to bioset_init()/mempool_init()
Convert btrfs to embedded bio sets.

Acked-by: Chris Mason <clm@fb.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 15:33:32 -06:00
Kent Overstreet
52190f8abe fs: convert block_dev.c to bioset_init()
Convert block DIO code to embedded bio sets.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 15:33:32 -06:00
Steve French
b326614ea2 smb3: allow "posix" mount option to enable new SMB311 protocol extensions
If "posix" (or synonym "unix" for backward compatibility) specified on mount,
and server advertises support for SMB3.11 POSIX negotiate context, then
enable the new posix extensions on the tcon.  This can be viewed by
looking for "posix" in the mount options displayed by /proc/mounts
for that mount (ie if posix extensions allowed by server and the
experimental POSIX extensions also requested on the mount by specifying
"posix" at mount time).

Also add check to warn user if conflicting unix/nounix or posix/noposix specified
on mount.

Signed-off-by: Steve French <smfrench@gmail.com>
2018-05-30 16:06:18 -05:00
Steve French
fcef0db6d6 smb3: add support for posix negotiate context
Unlike CIFS where UNIX/POSIX extensions had been negotiatable,
SMB3 did not have POSIX extensions yet.  Add the new SMB3.11
POSIX negotiate context to ask the server whether it can
support POSIX (and thus whether we can send the new POSIX open
context).

Signed-off-by: Steve French <smfrench@gmail.com>
2018-05-30 16:06:18 -05:00
Steve French
f92a720ee9 cifs: allow disabling less secure legacy dialects
To improve security it may be helpful to have additional ways to restrict the
ability to override the default dialects (SMB2.1, SMB3 and SMB3.02) on mount
with old dialects (CIFS/SMB1 and SMB2) since vers=1.0 (CIFS/SMB1) and vers=2.0
are weaker and less secure.

Add a module parameter "disable_legacy_dialects"
(/sys/module/cifs/parameters/disable_legacy_dialects) which can be set to
1 (or equivalently Y) to forbid use of vers=1.0 or vers=2.0 on mount.

Also cleans up a few build warnings about globals for various module parms.

Signed-off-by: Steve French <smfrench@gmail.com>
2018-05-30 16:06:18 -05:00
Steve French
11911b956f cifs: make minor clarifications to module params for cifs.ko
Note which ones of the module params are cifs dialect only
(N/A for default dialect now that has moved to SMB2.1 or later)

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-05-30 16:06:18 -05:00
Ronnie Sahlberg
6539e7f372 cifs: show the "w" bit for writeable /proc/fs/cifs/* files
RHBZ: 1539612

Lets show the "w" bit for those files have a .write interface to set/enable/...
the feature.

Reported-by: Xiaoli Feng <xifeng@redhat.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-05-30 16:06:18 -05:00
Steve French
49218b4f57 smb3: add module alias for smb3 to cifs.ko
We really don't want to be encouraging people to use the old
(less secure) cifs dialect (SMB1) and it can be confusing for them
with SMB3 (or later) being recommended but the module name is cifs.

Add a module alias for "smb3" to cifs.ko to make this less confusing.

Signed-off-by: Steve French <smfrench@gmail.com>
2018-05-30 16:06:18 -05:00
Ronnie Sahlberg
25ad1cbd02 cifs: return error on invalid value written to cifsFYI
RHBZ: 1539617

Check that, if it is not a boolean, the value the user tries
to write to /proc/fs/cifs/cifsFYI is valid and return an error
if not.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reported-by: Xiaoli Feng <xifeng@redhat.com>
2018-05-30 16:06:18 -05:00
Ronnie Sahlberg
57c55cd7c7 cifs: invalidate cache when we truncate a file
RHBZ: 1566345

When truncating a file we always do this synchronously to the server.
Thus we need to make sure that the cached inode metadata is
marked as stale so that on next getattr we will refresh the metadata.
In this particular bug we want to ensure that both ctime and mtime
are updated and become visible to the application after a truncate.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reported-by: Xiaoli Feng <xifeng@redhat.com>
2018-05-30 16:06:18 -05:00
Steve French
e0386e449a smb3: print tree id in debugdata in proc to be able to help logging
When loooking at the logs for the new trace-cmd tracepoints for cifs,
it would help to know which tid is for which share (UNC name) so
update /proc/fs/cifs/DebugData to display the tid.
Also display Maximal Access which was missing as well.

Now the entry for typical entry for a tcon (in proc/fs/cifs/) looks
like:

1) \\localhost\test Mounts: 1 DevInfo: 0x20 Attributes: 0x1006f
	PathComponentMax: 255 Status: 1 type: DISK
	Share Capabilities: None Aligned, Partition Aligned,	Share Flags: 0x0
	tid: 0xe0632a55	Optimal sector size: 0x200	Maximal Access: 0x1f01ff

Signed-off-by: Steve French <smfrench@gmail.com>
2018-05-30 16:06:18 -05:00
Steve French
d683bcd3e5 smb3: add additional ftrace entry points for entry/exit to cifs.ko
Signed-off-by: Steve French <smfrench@gmail.com>
2018-05-30 16:06:18 -05:00
Steve French
cfe8909164 smb3: fix various xid leaks
Fix a few cases where we were not freeing the xid which led to
active requests being non-zero at unmount time.

Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-05-30 16:06:18 -05:00
Long Li
57a929a66f CIFS: Introduce offset for the 1st page in data transfer structures
When direct I/O is used, the data buffer may not always align to page
boundaries. Introduce a page offset in transport data structures to
describe the location of the buffer within the page.

Also change the function to pass the page offset when sending data to
transport.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-05-30 16:06:12 -05:00
Omar Sandoval
ad7e1a740d Btrfs: clean up error handling in btrfs_truncate()
btrfs_truncate() uses two variables for error handling, ret and err (if
this sounds familiar, it's because btrfs_truncate_inode_items() did
something similar). This is error prone, as was made evident by "Btrfs:
fix error handling in btrfs_truncate()". We only have err because we
don't want to mask an error if we call btrfs_update_inode() and
btrfs_end_transaction(), so let's make that its own scoped return
variable and use ret everywhere else.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 21:27:32 +02:00
Nikolay Borisov
c5794e5178 btrfs: Factor out write portion of btrfs_get_blocks_direct
Now that the read side is extracted into its own function, do the same
to the write side. This leaves btrfs_get_blocks_direct_write with the
sole purpose of handling common locking required. Also flip the
condition in btrfs_get_blocks_direct_write so that the write case
comes first and we check for if (Create) rather than if (!create). This
is purely subjective but I believe makes reading a bit more "linear".
No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 19:01:44 +02:00
Nikolay Borisov
1c8d0175df btrfs: Factor out read portion of btrfs_get_blocks_direct
Currently this function handles both the READ and WRITE dio cases. This
is facilitated by a bunch of 'if' statements, a goto short-circuit
statement and a very perverse aliasing of "!created"(READ) case
by setting lockstart = lockend and checking for lockstart < lockend for
detecting the write. Let's simplify this mess by extracting the
READ-only code into a separate __btrfs_get_block_direct_read function.
This is only the first step, the next one will be to factor out the
write side as well. The end goal will be to have the common locking/
unlocking code in btrfs_get_blocks_direct and then it will call either
the read|write subvariants. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 19:01:43 +02:00
Su Yue
9132c4ff6f btrfs: return ENOMEM if path allocation fails in btrfs_cross_ref_exist
The error code does not match the reason of failure and may confuse the
callers.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 17:33:58 +02:00
Kees Cook
1389053e1b btrfs: raid56: Remove VLA usage
In the quest to remove all stack VLA usage from the kernel[1], this
allocates the working buffers during regular init, instead of using stack
space. This refactors the allocation code a bit to make it easier
to review.

[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 17:15:43 +02:00
Darrick J. Wong
d25522f10c xfs: repair superblocks
If one of the backup superblocks is found to differ seriously from
superblock 0, write out a fresh copy from the in-core sb.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-30 08:03:15 -07:00
Darrick J. Wong
7e85bc6c87 xfs: add helpers to attach quotas to inodes
Add a helper routine to attach quota information to inodes that are
about to undergo repair.  If that fails, we need to schedule a
quotacheck for the next mount but allow the corrupted metadata repair to
continue.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-30 08:03:15 -07:00
Darrick J. Wong
04a2b7b254 xfs: recover AG btree roots from rmap data
Add a helper function to help us recover btree roots from the rmap data.
Callers pass in a list of rmap owner codes, buffer ops, and magic
numbers.  We iterate the rmap records looking for owner matches, and
then read the matching blocks to see if the magic number & uuid match.
If so, we then read-verify the block, and if that passes then we retain
a pointer to the block with the highest level, assuming that by the end
of the call we will have found the root.  This will be used to reset the
AGF/AGI btree root fields during their rebuild procedures.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-30 08:03:14 -07:00
Darrick J. Wong
12c6510e2f xfs: add helpers to dispose of old btree blocks after a repair
Now that we've plumbed in the ability to construct a list of dead btree
blocks following a repair, add more helpers to dispose of them.  This is
done by examining the rmapbt -- if the btree was the only owner we can
free the block, otherwise it's crosslinked and we can only remove the
rmapbt record.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-30 08:03:14 -07:00
Darrick J. Wong
64a39d876e xfs: add helpers to collect and sift btree block pointers during repair
Add some helpers to assemble a list of fs block extents.  Generally,
repair functions will iterate the rmapbt to make a list (1) of all
extents owned by the nominal owner of the metadata structure; then they
will iterate all other structures with the same rmap owner to make a
list (2) of active blocks; and finally we have a subtraction function to
subtract all the blocks in (2) from (1), with the result that (1) is now
a list of blocks that were owned by the old btree and must be disposed.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-30 08:03:14 -07:00
Darrick J. Wong
73d6b42aa4 xfs: add helpers to allocate and initialize fresh btree roots
Add a pair of helper functions to allocate and initialize fresh btree
roots.  The repair functions will use these as part of recreating
corrupted metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2018-05-30 08:03:14 -07:00
Darrick J. Wong
0a9633fa2f xfs: add helpers to deal with transaction allocation and rolling
For repairs, we need to reserve at least as many blocks as we think
we're going to need to rebuild the data structure, and we're going to
need some helpers to roll transactions while maintaining locks on the AG
headers so that other threads cannot wander into the middle of a repair.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2018-05-30 08:03:14 -07:00
Darrick J. Wong
51863d7dd7 xfs: grab the per-ag structure whenever relevant
Grab and hold the per-AG data across a scrub run whenever relevant.
This helps us avoid repeated trips through rcu and the radix tree
in the repair code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-30 08:03:14 -07:00
Su Yue
090a127afa btrfs: return error value if create_io_em failed in cow_file_range
In cow_file_range(), create_io_em() may fail, but its return value is
not recorded.  Then return value may be 0 even it failed which is a
wrong behavior.

Let cow_file_range() return PTR_ERR(em) if create_io_em() failed.

Fixes: 6f9994dbab ("Btrfs: create a helper to create em for IO")
CC: stable@vger.kernel.org # 4.11+
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:51:08 +02:00
Gu JinXiang
6b0cb1f901 btrfs: drop useless member qgroup_reserved of btrfs_pending_snapshot
Since there is no more use of qgroup_reserved member in struct
btrfs_pending_snapshot, remove it.

Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:54 +02:00
Gu JinXiang
c4c129db5d btrfs: drop unused parameter qgroup_reserved
Since commit 7775c8184e ("btrfs: remove unused parameter from
btrfs_subvolume_release_metadata") parameter qgroup_reserved is not used
by caller of function btrfs_subvolume_reserve_metadata.  So remove it.

Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:53 +02:00
Ethan Lien
e73e81b6d0 btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
[Problem description and how we fix it]
We should balance dirty metadata pages at the end of
btrfs_finish_ordered_io, since a small, unmergeable random write can
potentially produce dirty metadata which is multiple times larger than
the data itself. For example, a small, unmergeable 4KiB write may
produce:

    16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
    16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
    16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree

Although we do call balance dirty pages in write side, but in the
buffered write path, most metadata are dirtied only after we reach the
dirty background limit (which by far only counts dirty data pages) and
wakeup the flusher thread. If there are many small, unmergeable random
writes spread in a large btree, we'll find a burst of dirty pages
exceeds the dirty_bytes limit after we wakeup the flusher thread - which
is not what we expect. In our machine, it caused out-of-memory problem
since a page cannot be dropped if it is marked dirty.

Someone may worry about we may sleep in btrfs_btree_balance_dirty_nodelay,
but since we do btrfs_finish_ordered_io in a separate worker, it will not
stop the flusher consuming dirty pages. Also, we use different worker for
metadata writeback endio, sleep in btrfs_finish_ordered_io help us throttle
the size of dirty metadata pages.

[Reproduce steps]
To reproduce the problem, we need to do 4KiB write randomly spread in a
large btree. In our 2GiB RAM machine:

1) Create 4 subvolumes.
2) Run fio on each subvolume:

   [global]
   direct=0
   rw=randwrite
   ioengine=libaio
   bs=4k
   iodepth=16
   numjobs=1
   group_reporting
   size=128G
   runtime=1800
   norandommap
   time_based
   randrepeat=0

3) Take snapshot on each subvolume and repeat fio on existing files.
4) Repeat step (3) until we get large btrees.
   In our case, by observing btrfs_root_item->bytes_used, we have 2GiB of
   metadata in each subvolume tree and 12GiB of metadata in extent tree.
5) Stop all fio, take snapshot again, and wait until all delayed work is
   completed.
6) Start all fio. Few seconds later we hit OOM when the flusher starts
   to work.

It can be reproduced even when using nocow write.

Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:53 +02:00
Ethan Lien
78d4295b1e btrfs: lift some btrfs_cross_ref_exist checks in nocow path
In nocow path, we check if the extent is snapshotted in
btrfs_cross_ref_exist(). We can do the similar check earlier and avoid
unnecessary search into extent tree.

A fio test on a Intel D-1531, 16GB RAM, SSD RAID-5 machine as follows:

[global]
group_reporting
time_based
thread=1
ioengine=libaio
bs=4k
iodepth=32
size=64G
runtime=180
numjobs=8
rw=randwrite

[file1]
filename=/mnt/nocow/testfile

IOPS result:   unpatched     patched

1 fio round:     46670        46958
snapshot
2 fio round:     51826        54498
3 fio round:     59767        61289

After snapshot, the first fio get about 5% performance gain. As we
continually write to the same file, all writes will resume to nocow mode
and eventually we have no performance gain.

Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:53 +02:00
Lu Fengqi
d19577912d btrfs: Remove fs_info argument from btrfs_uuid_tree_rem
This function always takes a transaction handle which contains a
reference to the fs_info. Use that and remove the extra argument.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ rename the function ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:53 +02:00
Lu Fengqi
cdb345a877 btrfs: Remove fs_info argument from btrfs_uuid_tree_add
This function always takes a transaction handle which contains a
reference to the fs_info. Use that and remove the extra argument.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:52 +02:00
Liu Bo
f9ddfd0592 Btrfs: remove unused check of skip_locking
The check is superfluous since all callers who set search_for_commit
also have skip_locking set.

ASSERT() is put in place to ensure skip_locking is set by new callers.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:52 +02:00
Liu Bo
d80bb3f905 Btrfs: remove always true check in unlock_up
As unlock_up() is written as

for () {
   if (!path->locks[i])
       break;
   ...
   if (... && path->locks[i]) {
   }
}

Apparently, @path->locks[i] is always true at this 'if'.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:51 +02:00
Liu Bo
662c653bfd Btrfs: grab write lock directly if write_lock_level is the max level
Typically, when acquiring root node's lock, btrfs tries its best to get
read lock and trade for write lock if @write_lock_level implies to do so.

In case of (cow && (p->keep_locks || p->lowest_level)), write_lock_level
is set to BTRFS_MAX_LEVEL, which means we need to acquire root node's
write lock directly.

In this particular case, the dance of acquiring read lock and then trading
for write lock can be saved.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:51 +02:00
Liu Bo
1fc28d8e2e Btrfs: move get root out of btrfs_search_slot to a helper
It's good to have a helper instead of having all get-root details
open-coded.  The new helper locks (if necessary) and sets root node of
the path.

Also invert the checks to make the code flow easier to read.  There is
no functional change in this commit.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:44 +02:00
Liu Bo
e6a1d6fd27 Btrfs: use more straightforward extent_buffer_uptodate check
If parent_transid "0" is passed to btrfs_buffer_uptodate(),
btrfs_buffer_uptodate() is equivalent to extent_buffer_uptodate(), but
extent_buffer_uptodate() is preferred since we don't have to look into
verify_parent_transid().

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:44 +02:00
Liu Bo
ca19b4a699 Btrfs: remove superfluous free_extent_buffer in read_block_for_search
read_block_for_search() can be simplified as:

tmp = find_extent_buffer();
if (tmp)
   return;

...

free_extent_buffer();
read_tree_block();

Apparently, @tmp must be NULL at this point, free_extent_buffer() is not
needed.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:44 +02:00
Lu Fengqi
4ca6168327 btrfs: drop unused space_info parameter from create_space_info
Since commit dc2d3005d2 ("btrfs: remove dead create_space_info
calls"), there is only one caller btrfs_init_space_info. However, it
doesn't need create_space_info to return space_info at all.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:43 +02:00
Liu Bo
ff76a864cc Btrfs: add parent_transid parameter to veirfy_level_key
As verify_level_key() is checked after verify_parent_transid(), i.e.

if (verify_parent_transid())
   ret = -EIO;
else if (verify_level_key())
   ret = -EUCLEAN;

if parent_transid is 0, verify_parent_transid() skips verifying
parent_transid and considers eb as valid, and if verify_level_key()
reports something wrong, we're not going to know if it's caused by
corrupted metadata or non-checkecd eb (e.g. stale eb).

The stale eb can be from an outdated raid1 mirror after a degraded
mount, see eg "btrfs: fix reading stale metadata blocks after degraded
raid1 mounts" (02a3307aa9) for more details.

@parent_transid is able to tell whether the eb's generation has been
verified by the caller.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:43 +02:00
Qu Wenruo
9593bf4967 btrfs: qgroup: show more meaningful qgroup_rescan_init error message
Error message from qgroup_rescan_init() mostly looks like:

  BTRFS info (device nvme0n1p1): qgroup_rescan_init failed with -115

Which is far from meaningful, and sometimes confusing as for above
-EINPROGRESS it's mostly (despite the init race) harmless, but sometimes
it can also indicate problem if the return value is -EINVAL.

Change it to some more meaningful messages like:

  BTRFS info (device nvme0n1p1): qgroup rescan is already in progress

And

  BTRFS err(device nvme0n1p1): qgroup rescan init failed, qgroup is not enabled

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ update the messages and level ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:43 +02:00
Omar Sandoval
fd4e994bd1 Btrfs: fix memory and mount leak in btrfs_ioctl_rm_dev_v2()
If we have invalid flags set, when we error out we must drop our writer
counter and free the buffer we allocated for the arguments. This bug is
trivially reproduced with the following program on 4.7+:

	#include <fcntl.h>
	#include <stdint.h>
	#include <stdio.h>
	#include <stdlib.h>
	#include <unistd.h>
	#include <sys/ioctl.h>
	#include <sys/stat.h>
	#include <sys/types.h>
	#include <linux/btrfs.h>
	#include <linux/btrfs_tree.h>

	int main(int argc, char **argv)
	{
		struct btrfs_ioctl_vol_args_v2 vol_args = {
			.flags = UINT64_MAX,
		};
		int ret;
		int fd;

		if (argc != 2) {
			fprintf(stderr, "usage: %s PATH\n", argv[0]);
			return EXIT_FAILURE;
		}

		fd = open(argv[1], O_WRONLY);
		if (fd == -1) {
			perror("open");
			return EXIT_FAILURE;
		}

		ret = ioctl(fd, BTRFS_IOC_RM_DEV_V2, &vol_args);
		if (ret == -1)
			perror("ioctl");

		close(fd);
		return EXIT_SUCCESS;
	}

When unmounting the filesystem, we'll hit the
WARN_ON(mnt_get_writers(mnt)) in cleanup_mnt() and also may prevent the
filesystem to be remounted read-only as the writer count will stay
lifted.

Fixes: 6b526ed70c ("btrfs: introduce device delete by devid")
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:43 +02:00
Qu Wenruo
de885e3ee2 btrfs: lzo: Harden inline lzo compressed extent decompression
For inlined extent, we only have one segment, thus less things to check.
And further more, inlined extent always has the csum in its leaf header,
it's less probable to have corrupted data.

Anyway, still check header and segment header.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:43 +02:00
Qu Wenruo
314bfa473b btrfs: lzo: Add header length check to avoid potential out-of-bounds access
James Harvey reported that some corrupted compressed extent data can
lead to various kernel memory corruption.

Such corrupted extent data belongs to inode with NODATASUM flags, thus
data csum won't help us detecting such bug.

If lucky enough, KASAN could catch it like:

BUG: KASAN: slab-out-of-bounds in lzo_decompress_bio+0x384/0x7a0 [btrfs]
Write of size 4096 at addr ffff8800606cb0f8 by task kworker/u16:0/2338

CPU: 3 PID: 2338 Comm: kworker/u16:0 Tainted: G           O      4.17.0-rc5-custom+ #50
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Workqueue: btrfs-endio btrfs_endio_helper [btrfs]
Call Trace:
 dump_stack+0xc2/0x16b
 print_address_description+0x6a/0x270
 kasan_report+0x260/0x380
 memcpy+0x34/0x50
 lzo_decompress_bio+0x384/0x7a0 [btrfs]
 end_compressed_bio_read+0x99f/0x10b0 [btrfs]
 bio_endio+0x32e/0x640
 normal_work_helper+0x15a/0xea0 [btrfs]
 process_one_work+0x7e3/0x1470
 worker_thread+0x1b0/0x1170
 kthread+0x2db/0x390
 ret_from_fork+0x22/0x40
...

The offending compressed data has the following info:

Header:			length 32768		(looks completely valid)
Segment 0 Header:	length 3472882419	(obviously out of bounds)

Then when handling segment 0, since it's over the current page, we need
the copy the compressed data to temporary buffer in workspace, then such
large size would trigger out-of-bounds memory access, screwing up the
whole kernel.

Fix it by adding extra checks on header and segment headers to ensure we
won't access out-of-bounds, and even checks the decompressed data won't
be out-of-bounds.

Reported-by: James Harvey <jamespharvey20@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ updated comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:38 +02:00
Al Viro
1da92779e2 aio: sanitize the limit checking in io_submit(2)
as it is, the logics in native io_submit(2) is "if asked for
more than LONG_MAX/sizeof(pointer) iocbs to submit, don't
bother with more than LONG_MAX/sizeof(pointer)" (i.e.
512M requests on 32bit and 1E requests on 64bit) while
compat io_submit(2) goes with "stop after the first
PAGE_SIZE/sizeof(pointer) iocbs", i.e. 1K or so.  Which is
	* inconsistent
	* *way* too much in native case
	* possibly too little in compat one
and
	* wrong anyway, since the natural point where we
ought to stop bothering is ctx->nr_events

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-29 23:20:17 -04:00
Al Viro
67ba049f94 aio: fold do_io_submit() into callers
get rid of insane "copy array of 32bit pointers into an array of
native ones" glue.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-29 23:19:29 -04:00
Al Viro
95af8496ac aio: shift copyin of iocb into io_submit_one()
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-29 23:18:31 -04:00
Al Viro
d2988bd412 aio_read_events_ring(): make a bit more readable
The logics for 'avail' is
	* not past the tail of cyclic buffer
	* no more than asked
	* not past the end of buffer
	* not past the end of a page

Unobfuscate the last part.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-29 23:18:17 -04:00
Al Viro
9061d14a8a aio: all callers of aio_{read,write,fsync,poll} treat 0 and -EIOCBQUEUED the same way
... so just make them return 0 when caller does not need to destroy iocb

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-29 23:17:40 -04:00
Al Viro
3c96c7f4ca aio: take list removal to (some) callers of aio_complete()
We really want iocb out of io_cancel(2) reach before we start tearing
it down.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-29 23:16:43 -04:00
Linus Torvalds
91fc957a61 Merge tag 'afs-fixes-20180529' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS fixes from David Howells:

 - fix a BUG triggerable from faccessat()

 - fix the mounting of backup volumes

* tag 'afs-fixes-20180529' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  afs: Fix mounting of backup volumes
  afs: Fix directory permissions check
2018-05-29 15:30:16 -05:00
Souptick Joarder
05edd888d1 fs: xfs: Change return type to vm_fault_t
Use new return type vm_fault_t for fault handlers.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-29 10:46:03 -07:00
Darrick J. Wong
2e050e648a xfs: fix inobt magic number check
In commit a6a781a58b ("xfs: have buffer verifier functions
report failing address") the bad magic number return was ported
incorrectly.

Fixes: a6a781a58b
Reported-by: syzbot+08ab33be0178b76851c8@syzkaller.appspotmail.com
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-05-29 10:46:03 -07:00
Darrick J. Wong
aee9a4a555 fs: clear writeback errors in inode_init_always
In inode_init_always(), we clear the inode mapping flags, which clears
any retained error (AS_EIO, AS_ENOSPC) bits.  Unfortunately, we do not
also clear wb_err, which means that old mapping errors can leak through
to new inodes.

This is crucial for the XFS inode allocation path because we recycle old
in-core inodes and we do not want error state from an old file to leak
into the new file.  This bug was discovered by running generic/036 and
generic/047 in a loop and noticing that the EIOs generated by the
collision of direct and buffered writes in generic/036 would survive the
remount between 036 and 047, and get reported to the fsyncs (on
different files!) in generic/047.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-29 10:46:03 -07:00
Chengguang Xu
eb91537575 vfs: delete unnecessary assignment in vfs_listxattr
It seems the first error assignment in if branch is redundant.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-29 13:22:41 -04:00
Qu Wenruo
2a1f7c0cbd btrfs: lzo: document the compressed data format
Although it's not that complex, but such comment could still save
several minutes for newer reader/reviewer instead of inferring that from
the code.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor wording updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-29 18:13:00 +02:00
Qu Wenruo
d5c1d68fde btrfs: compression: Add linux/sizes.h for compression.h
Since compression.h is using the SZ_* macros, and if some file includes
only compression.h without linux/sizes.h, it will cause compile error.

One example is lzo.c, if it uses BTRFS_MAX_COMPRESSED.  Fix it by adding
linux/sizes.h in compression.h

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-29 18:13:00 +02:00
Omar Sandoval
b5c40d598f Btrfs: fix clone vs chattr NODATASUM race
In btrfs_clone_files(), we must check the NODATASUM flag while the
inodes are locked. Otherwise, it's possible that btrfs_ioctl_setflags()
will change the flags after we check and we can end up with a party
checksummed file.

The race window is only a few instructions in size, between the if and
the locks which is:

3834         if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
3835                 return -EISDIR;

where the setflags must be run and toggle the NODATASUM flag (provided
the file size is 0).  The clone will block on the inode lock, segflags
takes the inode lock, changes flags, releases log and clone continues.

Not impossible but still needs a lot of bad luck to hit unintentionally.

Fixes: 0e7b824c4e ("Btrfs: don't make a file partly checksummed through file clone")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-29 18:12:59 +02:00
Gu Jinxiang
b89311efe6 btrfs: propagate failures of __exclude_logged_extent to upper caller
Function btrfs_exclude_logged_extents may call __exclude_logged_extent
which may fail.
Propagate the failures of __exclude_logged_extent to upper caller.

Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-29 18:12:58 +02:00
Nikolay Borisov
d4b20733d2 btrfs: Streamline shared ref check in alloc_reserved_tree_block
Instead of setting "parent" to ref->parent only when dealing with
a shared ref and subsequently performing another check to see
if (parent > 0), check the "node->type" directly and act accordingly.
This makes the code more streamline. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-29 18:12:57 +02:00
Nikolay Borisov
21ebfbe7e0 btrfs: Pass btrfs_delayed_extent_op to alloc_reserved_tree_block
Instead of taking only specific member of this structure, which results
in 2 extra arguments, just take the delayed_extent_op struct and
reference the arguments inside the functions. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-29 18:12:57 +02:00
Nikolay Borisov
4e6bd4e0aa btrfs: Simplify alloc_reserved_tree_block interface
This function currently takes 7 parameters, most of which are proxies
for values from btrfs_delayed_ref_node struct which is not passed. This
patch simplifies the interface of the function by simply passing said
delayed ref node struct to the function. This enables us to:

1. Move locals variables and init code related to them from
   run_delayed_tree_ref which should only be used inside
   alloc_reserved_tree_block, such as skinny_metadata and the btrfs_key,
   representing the extent being inserted. This removes the need for the
   "ins" argument. Instead, it's replaced by a local var with a more
   verbose name - extent_key.

2. Now that we have a reference to the node in alloc_reserved_tree_block
   the delayed_tree_ref struct can be referenced inside the function and
   this enable removing the "ref->level", "parent" and "ref_root"
   arguments.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-29 18:12:53 +02:00
Nikolay Borisov
9dcdbe0144 btrfs: Remove fs_info argument from alloc_reserved_tree_block
This function already takes a transaction handle which contains a
reference to the fs_info. So use this and remove the extra argument.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-29 18:12:53 +02:00
David Sterba
315b76b462 btrfs: tests: drop newline from test_msg strings
Now that test_err strings do not need the newline, remove them also from
the test_msg.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-29 18:12:52 +02:00
David Sterba
3c7251f2f8 btrfs: tests: add helper for error messages and update them
The test failures are not clearly visible in the system log as they're
printed at INFO level. Add a new helper that is level ERROR. As this
touches almost all strings, I took the opportunity to unify them:

- decapitalize the first letter as there's a prefix and the text
  continues after ":"
- glue strings split to more lines and un-indent so they fit to 80
  columns
- use %llu instead of %Lu
- drop \n from the modified messages (test_msg is left untouched)

Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-29 18:12:51 +02:00
Gang He
da3627c30d dlm: remove O_NONBLOCK flag in sctp_connect_to_sock
We should remove O_NONBLOCK flag when calling sock->ops->connect()
in sctp_connect_to_sock() function.
Why?
1. up to now, sctp socket connect() function ignores the flag argument,
that means O_NONBLOCK flag does not take effect, then we should remove
it to avoid the confusion (but is not urgent).
2. for the future, there will be a patch to fix this problem, then the flag
argument will take effect, the patch has been queued at https://git.kernel.o
rg/pub/scm/linux/kernel/git/davem/net.git/commit/net/sctp?id=644fbdeacf1d3ed
d366e44b8ba214de9d1dd66a9.
But, the O_NONBLOCK flag will make sock->ops->connect() directly return
without any wait time, then the connection will not be established, DLM kernel
module will call sock->ops->connect() again and again, the bad results are,
CPU usage is almost 100%, even trigger soft_lockup problem if the related
configurations are enabled,
DLM kernel module also prints lots of messages like,
[Fri Apr 27 11:23:43 2018] dlm: connecting to 172167592
[Fri Apr 27 11:23:43 2018] dlm: connecting to 172167592
[Fri Apr 27 11:23:43 2018] dlm: connecting to 172167592
[Fri Apr 27 11:23:43 2018] dlm: connecting to 172167592
The upper application (e.g. ocfs2 mount command) is hanged at new_lockspace(),
the whole backtrace is as below,
tb0307-nd2:~ # cat /proc/2935/stack
[<0>] new_lockspace+0x957/0xac0 [dlm]
[<0>] dlm_new_lockspace+0xae/0x140 [dlm]
[<0>] user_cluster_connect+0xc3/0x3a0 [ocfs2_stack_user]
[<0>] ocfs2_cluster_connect+0x144/0x220 [ocfs2_stackglue]
[<0>] ocfs2_dlm_init+0x215/0x440 [ocfs2]
[<0>] ocfs2_fill_super+0xcb0/0x1290 [ocfs2]
[<0>] mount_bdev+0x173/0x1b0
[<0>] mount_fs+0x35/0x150
[<0>] vfs_kern_mount.part.23+0x54/0x100
[<0>] do_mount+0x59a/0xc40
[<0>] SyS_mount+0x80/0xd0
[<0>] do_syscall_64+0x76/0x140
[<0>] entry_SYSCALL_64_after_hwframe+0x42/0xb7
[<0>] 0xffffffffffffffff

So, I think we should remove O_NONBLOCK flag here, since DLM kernel module can
not handle non-block sockect in connect() properly.

Signed-off-by: Gang He <ghe@suse.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2018-05-29 10:48:35 -05:00
Christoph Hellwig
5afb78356c block: don't print a message when the device went away
The information about a size change in this case just creates confusion.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-29 08:59:21 -06:00
Christoph Hellwig
4163a03984 block: unexport check_disk_size_change
Only used in block_dev.c and the partitions code, and it should remain
that way..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-29 08:59:21 -06:00
Christoph Hellwig
ac060cbaa8 aio: add missing break for the IOCB_CMD_FDSYNC case
Looks like this got lost in a merge.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-28 13:40:50 -04:00
Arnd Bergmann
533d1daea8 IB: Revert "remove redundant INFINIBAND kconfig dependencies"
Several subsystems depend on INFINIBAND_ADDR_TRANS, which in turn depends
on INFINIBAND. However, when with CONFIG_INIFIBAND=m, this leads to a
link error when another driver using it is built-in. The
INFINIBAND_ADDR_TRANS dependency is insufficient here as this is
a 'bool' symbol that does not force anything to be a module in turn.

fs/cifs/smbdirect.o: In function `smbd_disconnect_rdma_work':
smbdirect.c:(.text+0x1e4): undefined reference to `rdma_disconnect'
net/9p/trans_rdma.o: In function `rdma_request':
trans_rdma.c:(.text+0x7bc): undefined reference to `rdma_disconnect'
net/9p/trans_rdma.o: In function `rdma_destroy_trans':
trans_rdma.c:(.text+0x830): undefined reference to `ib_destroy_qp'
trans_rdma.c:(.text+0x858): undefined reference to `ib_dealloc_pd'

Fixes: 9533b292a7 ("IB: remove redundant INFINIBAND kconfig dependencies")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-05-28 10:40:16 -06:00
Misono Tomohiro
ad1e3d5672 btrfs: use error code returned by btrfs_read_fs_root_no_name in search ioctl
btrfs_read_fs_root_no_name() may return ERR_PTR(-ENOENT) or
ERR_PTR(-ENOMEM) and therefore search_ioctl() and
btrfs_search_path_in_tree() should use PTR_ERR() instead of -ENOENT,
which all other callers of btrfs_read_fs_root_no_name() do.

Drop the error message as it would be confusing, the caller of ioctl
will likely interpret the error code and not look into the syslog.

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:24:14 +02:00
Omar Sandoval
37becec95a Btrfs: allow empty subvol= again
I got a report that after upgrading to 4.16, someone's filesystems
weren't mounting:

[   23.845852] BTRFS info (device loop0): unrecognized mount option 'subvol='

Before 4.16, this mounted the default subvolume. It turns out that this
empty "subvol=" is actually an application bug, but it was causing the
application to fail, so it's an ABI break if you squint.

The generic parsing code we use for mount options (match_token())
doesn't match an empty string as "%s". Previously, setup_root_args()
removed the "subvol=" string, but the mount path was cleaned up to not
need that. Add a dummy Opt_subvol_empty to fix this.

The simple workaround is to use / or . for the value of 'subvol=' .

Fixes: 312c89fbca ("btrfs: cleanup btrfs_mount() using btrfs_mount_root()")
CC: stable@vger.kernel.org # 4.16+
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:24:13 +02:00
Anand Jain
b78e2b78a8 btrfs: fix describe_relocation when printing unknown flags
Looks like the original idea was to print the hex of the flags which is
not coded with their flag name. So use the current buf pointer bp
instead of buf.

Reaching the uknown flags should never happen, it's there just in case.

Fixes: ebce0e01b9 ("btrfs: make block group flags in balance printks human-readable")
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:24:11 +02:00
David Sterba
bf5091c8d6 btrfs: use kvzalloc for EXTENT_SAME temporary data
The dedupe range is 16 MiB, with 4 KiB pages and 8 byte pointers, the
arrays can be 32KiB large. To avoid allocation failures due to
fragmented memory, use the allocation with fallback to vmalloc.

The arrays are allocated and freed only inside btrfs_extent_same and
reused for all the ranges.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:24:09 +02:00
Timofey Titovets
67b07bd4be Btrfs: reuse cmp workspace in EXTENT_SAME ioctl
We support big dedup requests by splitting range to smaller parts, and
call dedupe logic on each of them.

Instead of repeated allocation and deallocation, allocate once at the
beginning and reuse in the iteration.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:24:07 +02:00
Timofey Titovets
b672876826 Btrfs: dedupe_file_range ioctl: remove 16MiB restriction
Currently btrfs_dedupe_file_range silently restricts the dedupe range to
to 16MiB to limit locking and working memory size and is documented in
manual page as implementation specific.

Let's remove that restriction by iterating over the dedup range in 16MiB
steps.  This is backward compatible and will not change anything for
requests smaller then 16MiB.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:24:04 +02:00
Timofey Titovets
3973909d92 Btrfs: split btrfs_extent_same
Split btrfs_extent_same() to two parts where one is the main EXTENT_SAME
entry and a helper that can be repeatedly called on a range.  This will
be used in following patches.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:24:03 +02:00
Omar Sandoval
399b0bbf5f Btrfs: reserve space for O_TMPFILE orphan item deletion
btrfs_link() calls btrfs_orphan_del() if it's linking an O_TMPFILE but
it doesn't reserve space to do so. Even before the removal of the
orphan_block_rsv it wasn't using it.

Fixes: ef3b9af50b ("Btrfs: implement inode_operations callback tmpfile")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:24:00 +02:00
Omar Sandoval
7efc3e349c Btrfs: renumber BTRFS_INODE_ runtime flags and switch to enums
We got rid of BTRFS_INODE_HAS_ORPHAN_ITEM and
BTRFS_INODE_ORPHAN_META_RESERVED, so we can renumber the flags to make
them consecutive again.

Signed-off-by: Omar Sandoval <osandov@fb.com>
[ switch them enums so we don't have to do that again ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:59 +02:00
Omar Sandoval
a575ceeb13 Btrfs: get rid of unused orphan infrastructure
Now that we don't keep long-standing reservations for orphan items,
root->orphan_block_rsv isn't used. We can git rid of it, along with:

- root->orphan_lock, which was used to protect root->orphan_block_rsv
- root->orphan_inodes, which was used as a refcount for root->orphan_block_rsv
- BTRFS_INODE_ORPHAN_META_RESERVED, which was used to track reservations
  in root->orphan_block_rsv
- btrfs_orphan_commit_root(), which was the last user of any of these
  and does nothing else

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:57 +02:00
Omar Sandoval
27919067f1 Btrfs: fix ENOSPC caused by orphan items reservations
Currently, we keep space reserved for all inode orphan items until the
inode is evicted (i.e., all references to it are dropped). We hit an
issue where an application would keep a bunch of deleted files open (by
design) and thus keep a large amount of space reserved, causing ENOSPC
errors when other operations tried to reserve space. This long-standing
reservation isn't absolutely necessary for a couple of reasons:

- We can almost always make the reservation we need or steal from the
  global reserve for the orphan item
- If we can't, it's not the end of the world if we drop the orphan item
  on the floor and let the next mount clean it up

So, get rid of persistent reservation and just reserve space in
btrfs_evict_inode().

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:54 +02:00
Omar Sandoval
4b9d7b59bf Btrfs: refactor btrfs_evict_inode() reserve refill dance
The truncate loop in btrfs_evict_inode() does two things at once:

- It refills the temporary block reserve, potentially stealing from the
  global reserve or committing
- It calls btrfs_truncate_inode_items()

The tangle of continues hides the fact that these two steps are actually
separate. Split the first step out into a separate function both for
clarity and so that we can reuse it in a later patch.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:52 +02:00
Omar Sandoval
c08db7d8d2 Btrfs: don't return ino to ino cache if inode item removal fails
In btrfs_evict_inode(), if btrfs_truncate_inode_items() fails, the inode
item will still be in the tree but we still return the ino to the ino
cache. That will blow up later when someone tries to allocate that ino,
so don't return it to the cache.

Fixes: 581bb05094 ("Btrfs: Cache free inode numbers in memory")
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:51 +02:00
Omar Sandoval
05a5bd7c4d Btrfs: delete dead code in btrfs_orphan_commit_root()
btrfs_orphan_commit_root() tries to delete an orphan item for a
subvolume in the tree root, but we don't actually insert that item in
the first place. See commit 0a0d4415e3 ("Btrfs: delete dead code in
btrfs_orphan_add()"). We can get rid of it.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:49 +02:00
Omar Sandoval
7b40b695b4 Btrfs: get rid of BTRFS_INODE_HAS_ORPHAN_ITEM
Now that we don't add orphan items for truncate, there can't be races on
adding or deleting an orphan item, so this bit is unnecessary.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:48 +02:00
Omar Sandoval
f7e9e8fc79 Btrfs: stop creating orphan items for truncate
Currently, we insert an orphan item during a truncate so that if there's
a crash, we don't leak extents past the on-disk i_size. However, since
commit 7f4f6e0a3f ("Btrfs: only update disk_i_size as we remove
extents"), we keep disk_i_size in sync with the extent items as we
truncate, so orphan cleanup will never have any extents to remove. Don't
bother with the superfluous orphan item.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:46 +02:00
Omar Sandoval
0552210997 Btrfs: don't BUG_ON() in btrfs_truncate_inode_items()
btrfs_free_extent() can fail because of ENOMEM. There's no reason to
panic here, we can just abort the transaction.

Fixes: f4b9aa8d3b ("btrfs_truncate")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:45 +02:00
Omar Sandoval
fd86a3a315 Btrfs: fix error handling in btrfs_truncate_inode_items()
btrfs_truncate_inode_items() uses two variables for error handling, ret
and err. These are not handled consistently, leading to a couple of
bugs.

- Errors from btrfs_del_items() are handled but not propagated to the
  caller
- If btrfs_run_delayed_refs() fails and aborts the transaction, we
  continue running

Just use ret everywhere and simplify things a bit, fixing both of these
issues.

Fixes: 79787eaab4 ("btrfs: replace many BUG_ONs with proper error handling")
Fixes: 1262133b8d ("Btrfs: account for crcs in delayed ref processing")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:44 +02:00
Omar Sandoval
d1342aadbd Btrfs: update stale comments referencing vmtruncate()
Commit a41ad394a0 ("Btrfs: convert to the new truncate sequence")
changed btrfs_setsize() to call truncate_setsize() instead of
vmtruncate() but didn't update the comment above it. truncate_setsize()
never fails (the IS_SWAPFILE() check happens elsewhere), so remove the
comment.

Additionally, the comment above btrfs_page_mkwrite() references
vmtruncate(), but truncate_setsize() does the size write and page
locking now.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:43 +02:00
Nikolay Borisov
c442793e67 btrfs: Remove stale comment about select_delayed_ref
select_delayed_ref really just gets the next delayed ref which has to
be processed - either an add ref or drop ref. We never go back for
anything. So the comment is actually bogus, just remove it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:42 +02:00
Misono Tomohiro
f902bd3a5e btrfs: sysfs: Add entry which shows if rmdir can work on subvolumes
Deletion of a subvolume by rmdir(2) has become allowed by the
'commit cd2decf640b1 ("btrfs: Allow rmdir(2) to delete an empty
subvolume")'.

It is a kind of new feature and this commits add a sysfs entry

  /sys/fs/btrfs/features/rmdir_subvol

to indicate the availability of the feature so that a user program
(e.g. fstests) can detect it.

Prior to this commit, all entries in /sys/fs/btrfs/features are feature
which depend on feature bits of superblock (i.e. each feature affects
on-disk format) and managed by attribute_group "btrfs_feature_attr_group".
For each fs, entries in /sys/fs/btrfs/UUID/features indicate which
features are enabled (or can be changed online) for the fs.

However, rmdir_subvol feature only depends on kernel module. Therefore
new attribute_group "btrfs_static_feature_attr_group" is introduced and
sysfs_merge_group() is used to share /sys/fs/btrfs/features directory.
Features in "btrfs_static_feature_attr_group" won't be listed in each
/sys/fs/btrfs/UUID/features.

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:41 +02:00
Tomohiro Misono
6c52157fa9 btrfs: sysfs: Use enum/define value for feature array definitions
Use existing named values instead of the raw numbers.

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:39 +02:00
Anand Jain
6dac13f8e2 btrfs: add prefix "balance:" for log messages
Kernel logs are very important for the forensic investigations of the
issues in general make it easy to use it. This patch adds 'balance:'
prefix so that it can be easily searched.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:38 +02:00
David Sterba
5c57b8b6a4 btrfs: unify naming of flags variables for SETFLAGS and XFLAGS
* The simple 'flags' refer to the btrfs inode
* ... that's in 'binode
* the FS_*_FL variables are 'fsflags'
* the old copies of the variable are prefixed by 'old_'
* Struct inode flags contain 'i_flags'.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:32 +02:00
David Sterba
025f212148 btrfs: add FS_IOC_FSSETXATTR ioctl
The new ioctl is an extension to the FS_IOC_SETFLAGS and adds new
flags and is extensible. Don't get fooled by the XATTR in the name, it
does not have anything in common with the extended attributes,
incidentally also abbreviated as XATTRs.

This patch allows to set the xflags portion of the fsxattr structure,
other items have no meaning and non-zero values will result in
EOPNOTSUPP.

Currently supported xflags:

- APPEND
- IMMUTABLE
- NOATIME
- NODUMP
- SYNC

The structure of btrfs_ioctl_fssetxattr copies btrfs_ioctl_setflags but
is simpler on the flag setting side.

The original patch was written by Chandan Jay Sharma but was incomplete
and no further revision has been sent.

Based-on-patches-by: Chandan Jay Sharma <chandansbg@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:31 +02:00
David Sterba
e4202ac927 btrfs: add FS_IOC_FSGETXATTR ioctl
The new ioctl is an extension to the FS_IOC_GETFLAGS and adds new
flags and is extensible. This patch allows to return the xflags portion
of the fsxattr structure, other items have no meaning for btrfs or can
be added later.

The original patch was written by Chandan Jay Sharma but was incomplete
and no further revision has been sent. Several cleanups were necessary
to avoid confusion with other ioctls, as we have another flavor of
flags.

Based-on-patches-by: Chandan Jay Sharma <chandansbg@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:29 +02:00
David Sterba
19f93b3cd8 btrfs: add helpers for FS_XFLAG_* conversion
Preparatory work for the FS_IOC_FSGETXATTR ioctl, basic conversions and
checking helpers.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:28 +02:00
David Sterba
a157d4fd81 btrfs: rename btrfs_flags_to_ioctl to reflect which flags it touches
Converts btrfs_inode::flags to the FS_*_FL flags.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:27 +02:00
David Sterba
5ba76abfb2 btrfs: rename check_flags to reflect which flags it touches
The FS_*_FL flags cannot be easily identified by a prefix but we still
need to recognize them so the 'fsflags' should be closer to the naming
scheme but again the 'fs' part sounds like it's a filesystem flag. I
don't have a better idea for now.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:25 +02:00
David Sterba
1905a0f7c7 btrfs: rename btrfs_mask_flags to reflect which flags it touches
The FS_*_FL flags cannot be easily identified by a variable name prefix
but we still need to recognize them so the 'fsflags' should be closer to
the naming scheme but again the 'fs' part sounds like it's a filesystem
flag. I don't have a better idea for now.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:24 +02:00
David Sterba
7b6a221e5b btrfs: rename btrfs_update_iflags to reflect which flags it touches
The btrfs inode flag flavour is now simply called 'inode flags' and the
vfs inode are i_flags.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:20 +02:00
Anand Jain
d9a071f008 btrfs: use common variable for fs_devices in btrfs_destroy_dev_replace_tgtdev
Use a local btrfs_fs_devices variable to access the structure.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:18 +02:00
Anand Jain
ab5c2f65de btrfs: drop uuid_mutex in btrfs_destroy_dev_replace_tgtdev
Delete the uuid_mutex lock here as this thread accesses the
btrfs_fs_devices::devices only (counters or called functions do a list
traversal). And the device_list_mutex lock is already taken.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:17 +02:00
Anand Jain
b25e59e2b2 btrfs: drop uuid_mutex in btrfs_dev_replace_finishing
btrfs_dev_replace_finishing updates devices (soruce and target) which
are within the btrfs_fs_devices::devices or withint the cloned seed
devices (btrfs_fs_devices::seed::devices), so we don't need the global
uuid_mutex.

The device replace context is also locked by its own locks.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:16 +02:00
Anand Jain
542c5908ab btrfs: replace uuid_mutex by device_list_mutex in btrfs_open_devices
btrfs_open_devices() is using the uuid_mutex, but as btrfs_open_devices
is just limited to openning all the devices under for given fsid, so we
don't need uuid_mutex.

Instead it should hold the device_list_mutex as it updates the members
of the btrfs_fs_devices and btrfs_device and not the whole fs_devs list.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:15 +02:00
Anand Jain
3dd0f7a364 btrfs: document uuid_mutex uasge in read_chunk_tree
read_chunk_tree() calls read_one_dev(), but for seed device we have
to search the fs_uuids list, so we need the uuid_mutex. Add a comment
comment, so that we can improve this part.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:14 +02:00
Anand Jain
41a52a0f1b btrfs: use existing cur_devices, cleanup btrfs_rm_device
Instead of de-referencing the device->fs_devices use cur_devices
which points to the same fs_devices and does not change.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:13 +02:00
Anand Jain
b6ed73bcb1 btrfs: reduce uuid_mutex critical section while scanning devices
The generic block device lookup or cleanup does not need the uuid mutex,
that's only for the device_list_add.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:12 +02:00
Nikolay Borisov
20a6800402 btrfs: Unexport and rename btrfs_invalidate_inodes
This function is no longer used outside of inode.c so just make it
static. At the same time give a more becoming name, since it's not
really invalidating the inodes but just calling d_prune_alias. Last,
but not least - move the function above the sole caller to avoid
introducing yet-another-pointless forward declaration.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:10 +02:00
David Sterba
093258e6eb btrfs: replace waitqueue_actvie with cond_wake_up
Use the wrappers and reduce the amount of low-level details about the
waitqueue management.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:09 +02:00