Commit Graph

2126 Commits

Author SHA1 Message Date
Vivek Goyal
15bf32398a security: Return xattr name from security_dentry_init_security()
Right now security_dentry_init_security() only supports single security
label and is used by SELinux only. There are two users of this hook,
namely ceph and nfs.

NFS does not care about xattr name. Ceph hardcodes the xattr name to
security.selinux (XATTR_NAME_SELINUX).

I am making changes to fuse/virtiofs to send security label to virtiofsd
and I need to send xattr name as well. I also hardcoded the name of
xattr to security.selinux.

Stephen Smalley suggested that it probably is a good idea to modify
security_dentry_init_security() to also return name of xattr so that
we can avoid this hardcoding in the callers.

This patch adds a new parameter "const char **xattr_name" to
security_dentry_init_security() and LSM puts the name of xattr
too if caller asked for it (xattr_name != NULL).

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: James Morris <jamorris@linux.microsoft.com>
[PM: fixed typos in the commit description]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2021-10-20 08:17:08 -04:00
Jeff Layton
1bd85aa65d ceph: fix handling of "meta" errors
Currently, we check the wb_err too early for directories, before all of
the unsafe child requests have been waited on. In order to fix that we
need to check the mapping->wb_err later nearer to the end of ceph_fsync.

We also have an overly-complex method for tracking errors after
blocklisting. The errors recorded in cleanup_session_requests go to a
completely separate field in the inode, but we end up reporting them the
same way we would for any other error (in fsync).

There's no real benefit to tracking these errors in two different
places, since the only reporting mechanism for them is in fsync, and
we'd need to advance them both every time.

Given that, we can just remove i_meta_err, and convert the places that
used it to instead just use mapping->wb_err instead. That also fixes
the original problem by ensuring that we do a check_and_advance of the
wb_err at the end of the fsync op.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/52864
Reported-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-10-19 09:36:06 +02:00
Jeff Layton
98d0a6fb73 ceph: skip existing superblocks that are blocklisted or shut down when mounting
Currently when mounting, we may end up finding an existing superblock
that corresponds to a blocklisted MDS client. This means that the new
mount ends up being unusable.

If we've found an existing superblock with a client that is already
blocklisted, and the client is not configured to recover on its own,
fail the match. Ditto if the superblock has been forcibly unmounted.

While we're in here, also rename "other" to the more conventional "fsc".

Cc: stable@vger.kernel.org
URL: https://bugzilla.redhat.com/show_bug.cgi?id=1901499
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-10-19 09:36:06 +02:00
Dan Carpenter
708c87168b ceph: fix off by one bugs in unsafe_request_wait()
The "> max" tests should be ">= max" to prevent an out of bounds access
on the next lines.

Fixes: e1a4541ec0 ("ceph: flush the mdlog before waiting on unsafe reqs")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-21 17:39:20 +02:00
Jeff Layton
90f7d7a0d0 locks: remove LOCK_MAND flock lock support
As best I can tell, the logic for these has been broken for a long time
(at least before the move to git), such that they never conflict with
anything. Also, nothing checks for these flags and prevented opens or
read/write behavior on the files. They don't seem to do anything.

Given that, we can rip these symbols out of the kernel, and just make
flock(2) return 0 when LOCK_MAND is set in order to preserve existing
behavior.

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
2021-09-10 16:21:44 -04:00
Linus Torvalds
8a05abd0c9 We have:
- a set of patches to address fsync stalls caused by depending on
   periodic rather than triggered MDS journal flushes in some cases
   (Xiubo Li)
 
 - a fix for mtime effectively not getting updated in case of competing
   writers (Jeff Layton)
 
 - a couple of fixes for inode reference leaks and various WARNs after
   "umount -f" (Xiubo Li)
 
 - a new ceph.auth_mds extended attribute (Jeff Layton)
 
 - a smattering of fixups and cleanups from Jeff, Xiubo and Colin.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmE46mYTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi9UEB/sGT4eqMzkQLzJ2XjpKUvxXaJNVdPvS
 Jmg26KV5wc9Y9v6L7ww/eQjxbTOnda3G2/XG0xiE8dC1vq54Vux/FKiAT+H2/z/9
 onShFK+SARoF4DilKnY0JNCwcGxQ3FjWAgPqPKqAyTAX2wjVxDKFHB0C+7yhhJay
 wyDrRaaHyFc4TwHeiEi8xU7dB55XsvxWGUgnHbcOLyUbbBKddt98FadNZ2t9b76y
 EVwAxgY0RbUUFxOJ9VVjiaNLUP4532iXUn+fehMjRGmDCmjaLNxCrsq6d0p//LJV
 nhVRG+Mv8IfTjqZwFbnWV8xbGwX0lY+g+hn0cdi7urUH3GDa97vmJF3u
 =z6dR
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.15-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:

 - a set of patches to address fsync stalls caused by depending on
   periodic rather than triggered MDS journal flushes in some cases
   (Xiubo Li)

 - a fix for mtime effectively not getting updated in case of competing
   writers (Jeff Layton)

 - a couple of fixes for inode reference leaks and various WARNs after
   "umount -f" (Xiubo Li)

 - a new ceph.auth_mds extended attribute (Jeff Layton)

 - a smattering of fixups and cleanups from Jeff, Xiubo and Colin.

* tag 'ceph-for-5.15-rc1' of git://github.com/ceph/ceph-client:
  ceph: fix dereference of null pointer cf
  ceph: drop the mdsc_get_session/put_session dout messages
  ceph: lockdep annotations for try_nonblocking_invalidate
  ceph: don't WARN if we're forcibly removing the session caps
  ceph: don't WARN if we're force umounting
  ceph: remove the capsnaps when removing caps
  ceph: request Fw caps before updating the mtime in ceph_write_iter
  ceph: reconnect to the export targets on new mdsmaps
  ceph: print more information when we can't find snaprealm
  ceph: add ceph_change_snap_realm() helper
  ceph: remove redundant initializations from mdsc and session
  ceph: cancel delayed work instead of flushing on mdsc teardown
  ceph: add a new vxattr to return auth mds for an inode
  ceph: remove some defunct forward declarations
  ceph: flush the mdlog before waiting on unsafe reqs
  ceph: flush mdlog before umounting
  ceph: make iterate_sessions a global symbol
  ceph: make ceph_create_session_msg a global symbol
  ceph: fix comment about short copies in ceph_write_end
  ceph: fix memory leak on decode error in ceph_handle_caps
2021-09-08 15:50:32 -07:00
Colin Ian King
05a444d3f9 ceph: fix dereference of null pointer cf
Currently in the case where kmem_cache_alloc fails the null pointer
cf is dereferenced when assigning cf->is_capsnap = false. Fix this
by adding a null pointer check and return path.

Cc: stable@vger.kernel.org
Addresses-Coverity: ("Dereference null return")
Fixes: b2f9fa1f3b ("ceph: correctly handle releasing an embedded cap flush")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-03 10:55:51 +02:00
Jeff Layton
9f3589993c ceph: drop the mdsc_get_session/put_session dout messages
These are very chatty, racy, and not terribly useful. Just remove them.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Jeff Layton
3eaf5aa1cf ceph: lockdep annotations for try_nonblocking_invalidate
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Xiubo Li
a76d0a9c28 ceph: don't WARN if we're forcibly removing the session caps
For example in the case of a forced umount, we'll remove all the session
caps even if they are dirty. Move the warning to a wrapper function and
make most of the callers use it. Call the core function when removing
caps due to a forced umount.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Xiubo Li
42ad631b4d ceph: don't WARN if we're force umounting
Force umount will try to close the sessions by setting the session
state to _CLOSING. We don't want to WARN in this situation, since it's
expected.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Xiubo Li
a6d37ccdd2 ceph: remove the capsnaps when removing caps
capsnaps will take inode references via ihold when queueing to flush.
When force unmounting, the client will just close the sessions and
may never get a flush reply, causing a leak and inode ref leak.

Fix this by removing the capsnaps for an inode when removing the caps.

URL: https://tracker.ceph.com/issues/52295
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Jeff Layton
b11ed50346 ceph: request Fw caps before updating the mtime in ceph_write_iter
The current code will update the mtime and then try to get caps to
handle the write. If we end up having to request caps from the MDS, then
the mtime in the cap grant will clobber the updated mtime and it'll be
lost.

This is most noticable when two clients are alternately writing to the
same file. Fw caps are continually being granted and revoked, and the
mtime ends up stuck because the updated mtimes are always being
overwritten with the old one.

Fix this by changing the order of operations in ceph_write_iter to get
the caps before updating the times. Also, make sure we check the pool
full conditions before even getting any caps or uninlining.

URL: https://tracker.ceph.com/issues/46574
Reported-by: Jozef Kováč <kovac@firma.zoznam.sk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Xiubo Li
d517b3983d ceph: reconnect to the export targets on new mdsmaps
In the case where the export MDS has crashed just after the EImportStart
journal is flushed, a standby MDS takes over for it and when replaying
the EImportStart journal the MDS will wait the client to reconnect. That
may never happen because the client may not have registered or opened
the sessions yet.

When receiving a new map, ensure we reconnect to valid export targets as
well if their sessions don't exist yet.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Jeff Layton
692e171597 ceph: print more information when we can't find snaprealm
Print a bit more information when we can't find the realm during
ceph_add_cap. Show both the inode number and the old realm inode
number.

Suggested-by: Sage Weil <sage@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Jeff Layton
0ba92e1c5f ceph: add ceph_change_snap_realm() helper
Consolidate some fiddly code for changing an inode's snap_realm
into a new helper function, and change the callers to use it.

While we're in here, nothing uses the i_snap_realm_counter field, so
remove that from the inode.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Jeff Layton
c80dc3aee9 ceph: remove redundant initializations from mdsc and session
The ceph_mds_client and ceph_mds_session structures are kzalloc'ed so
there's no need to explicitly initialize either of their fields to 0.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Jeff Layton
b4002173b7 ceph: cancel delayed work instead of flushing on mdsc teardown
The first thing metric_delayed_work does is check mdsc->stopping,
and then return immediately if it's set. That's good since we would
have already torn down the metric structures at this point, otherwise,
but there is no locking around mdsc->stopping.

It's possible that the ceph_metric_destroy call could race with the
delayed_work, in which case we could end up with the delayed_work
accessing destroyed percpu variables.

At this point in the mdsc teardown, the "stopping" flag has already been
set, so there's no benefit to flushing the work. Move the work
cancellation in ceph_metric_destroy ahead of the percpu variable
destruction, and eliminate the flush_delayed_work call in
ceph_mdsc_destroy.

Fixes: 18f473b384 ("ceph: periodically send perf metrics to MDSes")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:17 +02:00
Jeff Layton
40e309de4d ceph: add a new vxattr to return auth mds for an inode
Add a new vxattr that shows what MDS is authoritative for an inode (if
we happen to have auth caps). If we don't have an auth cap for the inode
then just return -1.

URL: https://tracker.ceph.com/issues/1276
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:16 +02:00
Jeff Layton
49f8899e5e ceph: remove some defunct forward declarations
We missed these in the recent fscache rework.

Reported-by: Steve French <smfrench@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:16 +02:00
Xiubo Li
e1a4541ec0 ceph: flush the mdlog before waiting on unsafe reqs
For the client requests who will have unsafe and safe replies from
MDS daemons, in the MDS side the MDS daemons won't flush the mdlog
(journal log) immediatelly, because they think it's unnecessary.
That's true for most cases but not all, likes the fsync request.
The fsync will wait until all the unsafe replied requests to be
safely replied.

Normally if there have multiple threads or clients are running, the
whole mdlog in MDS daemons could be flushed in time if any request
will trigger the mdlog submit thread. So usually we won't experience
the normal operations will stuck for a long time. But in case there
has only one client with only thread is running, the stuck phenomenon
maybe obvious and the worst case it must wait at most 5 seconds to
wait the mdlog to be flushed by the MDS's tick thread periodically.

This patch will trigger to flush the mdlog in the relevant and auth
MDSes to which the in-flight requests are sent just before waiting
the unsafe requests to finish.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:16 +02:00
Xiubo Li
d095559ce4 ceph: flush mdlog before umounting
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:16 +02:00
Xiubo Li
59b312f362 ceph: make iterate_sessions a global symbol
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:16 +02:00
Xiubo Li
fba97e8025 ceph: make ceph_create_session_msg a global symbol
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:16 +02:00
Jeff Layton
ce3a8732ae ceph: fix comment about short copies in ceph_write_end
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:16 +02:00
Jeff Layton
2ad32cf09b ceph: fix memory leak on decode error in ceph_handle_caps
If we hit a decoding error late in the frame, then we might exit the
function without putting the pool_ns string. Ensure that we always put
that reference on the way out of the function.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02 22:49:16 +02:00
Linus Torvalds
815409a12c overlayfs update for 5.15
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYTDKKAAKCRDh3BK/laaZ
 PG9PAQCUF0fdBlCKudwSEt5PV5xemycL9OCAlYCd7d4XbBIe9wEA6sVJL9J+OwV2
 aF0NomiXtJccE+S9+byjVCyqSzQJGQQ=
 =6L2Y
 -----END PGP SIGNATURE-----

Merge tag 'ovl-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs update from Miklos Szeredi:

 - Copy up immutable/append/sync/noatime attributes (Amir Goldstein)

 - Improve performance by enabling RCU lookup.

 - Misc fixes and improvements

The reason this touches so many files is that the ->get_acl() method now
gets a "bool rcu" argument.  The ->get_acl() API was updated based on
comments from Al and Linus:

Link: https://lore.kernel.org/linux-fsdevel/CAJfpeguQxpd6Wgc0Jd3ks77zcsAv_bn0q17L3VNnnmPKu11t8A@mail.gmail.com/

* tag 'ovl-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: enable RCU'd ->get_acl()
  vfs: add rcu argument to ->get_acl() callback
  ovl: fix BUG_ON() in may_delete() when called from ovl_cleanup()
  ovl: use kvalloc in xattr copy-up
  ovl: update ctime when changing fileattr
  ovl: skip checking lower file's i_writecount on truncate
  ovl: relax lookup error on mismatch origin ftype
  ovl: do not set overlay.opaque for new directories
  ovl: add ovl_allow_offline_changes() helper
  ovl: disable decoding null uuid with redirect_dir
  ovl: consistent behavior for immutable/append-only inodes
  ovl: copy up sync/noatime fileattr flags
  ovl: pass ovl_fs to ovl_check_setxattr()
  fs: add generic helper for filling statx attribute flags
2021-09-02 09:21:27 -07:00
Linus Torvalds
6f01c935d9 File locking changes for v5.15.
-----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAmEo3WcTHGpsYXl0b25A
 a2VybmVsLm9yZwAKCRAADmhBGVaCFbnuD/9Vj3dnXRgZ9LHFuQHp5Vy5yaGfujwP
 kIMN7bJiL0pCZ1zHapyn0cidZuTScDK2qMzGVR9Wrj/uYHEI+T08IEfaq/2CU3pS
 PdygHgqQi+WtJj4dks1OphVdlF0+46B/zy2YY0LE39zFUDO38VUIgb/gTiQ8JHIr
 BaDGtohlXmK8bDXo2XqMunSp7uQoRX92mv+bPjvIpsM60RnhPzhX0HeO/xhXO1PP
 OpswQG3nOTX1alZOXuSKdH7e3zfO+pLnC5NYeo5R4rys5xRiShU19OSAJfiPwf2w
 06OX7uLT7aF5LPQPjOP4MA9dHPWWeIuYnKzUpQZZ+D+zjuaxhF62saJG1Um0yQns
 qqPZgTWhxzkJ16bj94T2UZW0bufFKM5cEfBPFw9ZyzQshX5jIE2V5ecV+F/AdtTj
 EA55m/5N8NZMsNgnEOqn6o8TkWoV1seHKJXqja0+ZY9/Xs8GkmjHLhjNBCwGZIbw
 8S64Szp+yex3m1mRS+L6T0Gudic5WdFxOKQAzvvJcmT6u+ZGBzU0FjNTNNdnu3tp
 f9mANYxT1vNfYplRuNhMCp9oioSRpO2vfDQLCBRtsCy0QlKCum6qb7E/BcWRlwn1
 SOAFmjeXpr+VXGX+Rjp/bzCdHlNOVbR4ODWJ+h5D7QZaQwQ6FafkEq1D2yjPR0OR
 i722ECoM++vZ4w==
 =CfcC
 -----END PGP SIGNATURE-----

Merge tag 'locks-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull file locking updates from Jeff Layton:
 "This starts with a couple of fixes for potential deadlocks in the
  fowner/fasync handling.

  The next patch removes the old mandatory locking code from the kernel
  altogether.

  The last patch cleans up rw_verify_area a bit more after the mandatory
  locking removal"

* tag 'locks-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  fs: clean up after mandatory file locking support removal
  fs: remove mandatory file locking support
  fcntl: fix potential deadlock for &fasync_struct.fa_lock
  fcntl: fix potential deadlocks for &fown_struct.lock
2021-08-30 12:38:13 -07:00
Linus Torvalds
aa99f3c2b9 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmEmTZcACgkQnJ2qBz9k
 QNkkmAgArW6XoF1CePds/ZaC9vfg/nk66/zVo0n+J8xXjMWAPxcKbWFfV0uWVixq
 yk4lcLV47a2Mu/B/1oLNd3vrSmhwU+srWqNwOFn1nv+lP/6wJqr8oztRHn/0L9Q3
 ZSRrukSejbQ6AvTL/WzTNnCjjCc2ne3Kyko6W41aU6uyJuzhSM32wbx7qlV6t54Z
 iint9OrB4gM0avLohNafTUq6I+tEGzBMNwpCG/tqCmkcvDcv3rTDVAnPSCTm0Tx2
 hdrYDcY/rLxo93pDBaW1rYA/fohR+mIVye6k2TjkPAL6T1x+rxeT5qnc+YijH5yF
 sFPDhlD+ZsfOLi8stWXLOJ+8+gLODg==
 =pDBR
 -----END PGP SIGNATURE-----

Merge tag 'hole_punch_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fs hole punching vs cache filling race fixes from Jan Kara:
 "Fix races leading to possible data corruption or stale data exposure
  in multiple filesystems when hole punching races with operations such
  as readahead.

  This is the series I was sending for the last merge window but with
  your objection fixed - now filemap_fault() has been modified to take
  invalidate_lock only when we need to create new page in the page cache
  and / or bring it uptodate"

* tag 'hole_punch_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  filesystems/locking: fix Malformed table warning
  cifs: Fix race between hole punch and page fault
  ceph: Fix race between hole punch and page fault
  fuse: Convert to using invalidate_lock
  f2fs: Convert to using invalidate_lock
  zonefs: Convert to using invalidate_lock
  xfs: Convert double locking of MMAPLOCK to use VFS helpers
  xfs: Convert to use invalidate_lock
  xfs: Refactor xfs_isilocked()
  ext2: Convert to using invalidate_lock
  ext4: Convert to use mapping->invalidate_lock
  mm: Add functions to lock invalidate_lock for two mappings
  mm: Protect operations adding pages to page cache with invalidate_lock
  documentation: Sync file_operations members with reality
  mm: Fix comments mentioning i_mutex
2021-08-30 10:24:50 -07:00
Tuo Li
a9e6ffbc5b ceph: fix possible null-pointer dereference in ceph_mdsmap_decode()
kcalloc() is called to allocate memory for m->m_info, and if it fails,
ceph_mdsmap_destroy() behind the label out_err will be called:
  ceph_mdsmap_destroy(m);

In ceph_mdsmap_destroy(), m->m_info is dereferenced through:
  kfree(m->m_info[i].export_targets);

To fix this possible null-pointer dereference, check m->m_info before the
for loop to free m->m_info[i].export_targets.

[ jlayton: fix up whitespace damage
	   only kfree(m->m_info) if it's non-NULL ]

Reported-by: TOTE Robot <oslab@tsinghua.edu.cn>
Signed-off-by: Tuo Li <islituo@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-08-25 16:34:11 +02:00
Xiubo Li
b2f9fa1f3b ceph: correctly handle releasing an embedded cap flush
The ceph_cap_flush structures are usually dynamically allocated, but
the ceph_cap_snap has an embedded one.

When force umounting, the client will try to remove all the session
caps. During this, it will free them, but that should not be done
with the ones embedded in a capsnap.

Fix this by adding a new boolean that indicates that the cap flush is
embedded in a capsnap, and skip freeing it if that's set.

At the same time, switch to using list_del_init() when detaching the
i_list and g_list heads.  It's possible for a forced umount to remove
these objects but then handle_cap_flushsnap_ack() races in and does the
list_del_init() again, corrupting memory.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/52283
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-08-25 16:34:11 +02:00
Jeff Layton
f7e33bdbd6 fs: remove mandatory file locking support
We added CONFIG_MANDATORY_FILE_LOCKING in 2015, and soon after turned it
off in Fedora and RHEL8. Several other distros have followed suit.

I've heard of one problem in all that time: Someone migrated from an
older distro that supported "-o mand" to one that didn't, and the host
had a fstab entry with "mand" in it which broke on reboot. They didn't
actually _use_ mandatory locking so they just removed the mount option
and moved on.

This patch rips out mandatory locking support wholesale from the kernel,
along with the Kconfig option and the Documentation file. It also
changes the mount code to ignore the "mand" mount option instead of
erroring out, and to throw a big, ugly warning.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
2021-08-23 06:15:36 -04:00
Miklos Szeredi
0cad624662 vfs: add rcu argument to ->get_acl() callback
Add a rcu argument to the ->get_acl() callback to allow
get_cached_acl_rcu() to call the ->get_acl() method in the next patch.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-18 22:08:24 +02:00
Jeff Layton
8434ffe71c ceph: take snap_empty_lock atomically with snaprealm refcount change
There is a race in ceph_put_snap_realm. The change to the nref and the
spinlock acquisition are not done atomically, so you could decrement
nref, and before you take the spinlock, the nref is incremented again.
At that point, you end up putting it on the empty list when it
shouldn't be there. Eventually __cleanup_empty_realms runs and frees
it when it's still in-use.

Fix this by protecting the 1->0 transition with atomic_dec_and_lock,
and just drop the spinlock if we can get the rwsem.

Because these objects can also undergo a 0->1 refcount transition, we
must protect that change as well with the spinlock. Increment locklessly
unless the value is at 0, in which case we take the spinlock, increment
and then take it off the empty list if it did the 0->1 transition.

With these changes, I'm removing the dout() messages from these
functions, as well as in __put_snap_realm. They've always been racy, and
it's better to not print values that may be misleading.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/46419
Reported-by: Mark Nelson <mnelson@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-08-04 19:20:29 +02:00
Luis Henriques
bf2ba43221 ceph: reduce contention in ceph_check_delayed_caps()
Function ceph_check_delayed_caps() is called from the mdsc->delayed_work
workqueue and it can be kept looping for quite some time if caps keep
being added back to the mdsc->cap_delay_list.  This may result in the
watchdog tainting the kernel with the softlockup flag.

This patch breaks this loop if the caps have been recently (i.e. during
the loop execution).  Any new caps added to the list will be handled in
the next run.

Also, allow schedule_delayed() callers to explicitly set the delay value
instead of defaulting to 5s, so we can ensure that it runs soon
afterward if it looks like there is more work.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/46284
Signed-off-by: Luis Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-08-04 19:20:05 +02:00
Luis Henriques
cdb330f4b4 ceph: don't WARN if we're still opening a session to an MDS
If MDSs aren't available while mounting a filesystem, the session state
will transition from SESSION_OPENING to SESSION_CLOSING.  And in that
scenario check_session_state() will be called from delayed_work() and
trigger this WARN.

Avoid this by only WARNing after a session has already been established
(i.e., the s_ttl will be different from 0).

Fixes: 62575e270f ("ceph: check session state after bumping session->s_seq")
Signed-off-by: Luis Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-07-20 17:57:33 +02:00
Jan Kara
057ba5b245 ceph: Fix race between hole punch and page fault
Ceph has a following race between hole punching and page fault:

CPU1                                  CPU2
ceph_fallocate()
  ...
  ceph_zero_pagecache_range()
                                      ceph_filemap_fault()
                                        faults in page in the range being
                                        punched
  ceph_zero_objects()

And now we have a page in punched range with invalid data. Fix the
problem by using mapping->invalidate_lock similarly to other
filesystems. Note that using invalidate_lock also fixes a similar race
wrt ->readpage().

CC: Jeff Layton <jlayton@kernel.org>
CC: ceph-devel@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2021-07-13 14:29:01 +02:00
Jeff Layton
4c18347238 ceph: take reference to req->r_parent at point of assignment
Currently, we set the r_parent pointer but then don't take a reference
to it until we submit the request. If we end up freeing the req before
that point, then we'll do a iput when we shouldn't.

Instead, take the inode reference in the callers, so that it's always
safe to call ceph_mdsc_put_request on the req, even before submission.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29 00:15:52 +02:00
Jeff Layton
23c2c76ead ceph: eliminate ceph_async_iput()
Now that we don't need to hold session->s_mutex or the snap_rwsem when
calling ceph_check_caps, we can eliminate ceph_async_iput and just use
normal iput calls.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29 00:15:52 +02:00
Jeff Layton
7732fe168e ceph: don't take s_mutex in ceph_flush_snaps
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29 00:15:52 +02:00
Jeff Layton
0449a35222 ceph: don't take s_mutex in try_flush_caps
The s_mutex doesn't protect anything in this codepath.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29 00:15:52 +02:00
Jeff Layton
6a92b08fda ceph: don't take s_mutex or snap_rwsem in ceph_check_caps
These locks appear to be completely unnecessary. Almost all of this
function is done under the inode->i_ceph_lock, aside from the actual
sending of the message. Don't take either lock in this function.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29 00:15:52 +02:00
Jeff Layton
52d60f8e18 ceph: eliminate session->s_gen_ttl_lock
Turn s_cap_gen field into an atomic_t, and just rely on the fact that we
hold the s_mutex when changing the s_cap_ttl field.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29 00:15:52 +02:00
Jeff Layton
7e65624d32 ceph: allow ceph_put_mds_session to take NULL or ERR_PTR
...to simplify some error paths.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29 00:15:52 +02:00
Jeff Layton
df2c0cb7f8 ceph: clean up locking annotation for ceph_get_snap_realm and __lookup_snap_realm
They both say that the snap_rwsem must be held for write, but I don't
see any real reason for it, and it's not currently always called that
way.

The lookup is just walking the rbtree, so holding it for read should be
fine there. The "get" is bumping the refcount and (possibly) removing
it from the empty list. I see no need to hold the snap_rwsem for write
for that.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29 00:15:51 +02:00
Jeff Layton
a6862e6708 ceph: add some lockdep assertions around snaprealm handling
Turn some comments into lockdep asserts.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29 00:15:51 +02:00
Jeff Layton
f3fd3ea6a2 ceph: decoding error in ceph_update_snap_realm should return -EIO
Currently ceph_update_snap_realm returns -EINVAL when it hits a decoding
error, which is the wrong error code. -EINVAL implies that the user gave
us a bogus argument to a syscall or something similar. -EIO is more
descriptive when we hit a decoding error.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29 00:15:51 +02:00
Xiubo Li
903f4fec78 ceph: add IO size metrics support
This will collect IO's total size and then calculate the average
size, and also will collect the min/max IO sizes.

The debugfs will show the size metrics in bytes and will let the
userspace applications to switch to what they need.

URL: https://tracker.ceph.com/issues/49913
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29 00:15:51 +02:00
Xiubo Li
fc123d5f50 ceph: update and rename __update_latency helper to __update_stdev
The new __update_stdev() helper will only compute the standard
deviation.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29 00:15:51 +02:00
Xiubo Li
8ecd34c797 ceph: simplify the metrics struct
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29 00:15:51 +02:00
Jeff Layton
4364c6938d ceph: make ceph_queue_cap_snap static
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-28 23:49:25 +02:00
Wei Yongjun
675d4d8997 ceph: make ceph_netfs_read_ops static
The sparse tool complains as follows:

fs/ceph/addr.c:316:37: warning:
 symbol 'ceph_netfs_read_ops' was not declared. Should it be static?

This symbol is not used outside of addr.c, so mark it static.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-28 23:49:24 +02:00
Jeff Layton
22d41cdcd3 ceph: remove bogus checks and WARN_ONs from ceph_set_page_dirty
The checks for page->mapping are odd, as set_page_dirty is an
address_space operation, and I don't see where it would be called on a
non-pagecache page.

The warning about the page lock also seems bogus.  The comment over
set_page_dirty() says that it can be called without the page lock in
some rare cases. I don't think we want to warn if that's the case.

Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-28 23:49:24 +02:00
Jeff Layton
7a971e2c07 ceph: fix error handling in ceph_atomic_open and ceph_lookup
Commit aa60cfc3f7 broke the error handling in these functions such
that they don't handle non-ENOENT errors from ceph_mdsc_do_request
properly.

Move the checking of -ENOENT out of ceph_handle_snapdir and into the
callers, and if we get a different error, return it immediately.

Fixes: aa60cfc3f7 ("ceph: don't use d_add in ceph_handle_snapdir")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-22 13:53:28 +02:00
Jeff Layton
27171ae6a0 ceph: must hold snap_rwsem when filling inode for async create
...and add a lockdep assertion for it to ceph_fill_inode().

Cc: stable@vger.kernel.org # v5.7+
Fixes: 9a8d03ca2e ("ceph: attempt to do async create when possible")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-22 13:53:28 +02:00
Linus Torvalds
7ac86b3dca Notable items here are a series to take advantage of David Howells'
netfs helper library from Jeff, three new filesystem client metrics
 from Xiubo, ceph.dir.rsnaps vxattr from Yanhu and two auth-related
 fixes from myself, marked for stable.  Interspersed is a smattering
 of assorted fixes and cleanups across the filesystem.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmCT8IITHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzizgqCACYbyY4Yr/2C8fZsn+P9rd97zRTbcC6
 eufTZwnlECLnc89BxJQRk9a2UpDJfC8RMM3/9tmiulc8G4M+ggVbdFQTCzsZox3c
 vLAunGeVyfKIY+16Bv2RNuoO3KeeZm5aB3jXJ5QcUPcXmd4XnHKI1FU2ebC56UJb
 pxxfHpE6fb59r6Ek1e5uUFyta4KDMrvwXozghuAPEgT1GpKeA9zMIGI0CkQbBHlW
 PWHpcahTiT6GWa/d9ud0CnfssiBxVydWyKTz9xppYC6LNdsZUf9tBmYYGRklcjoA
 yAwPSuqxNmg+7uWubEawc0+a/3fXORgp2SF7Rbp1XYE+HpfnMF1J+nIn
 =IO5c
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.13-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "Notable items here are

   - a series to take advantage of David Howells' netfs helper library
     from Jeff

   - three new filesystem client metrics from Xiubo

   - ceph.dir.rsnaps vxattr from Yanhu

   - two auth-related fixes from myself, marked for stable.

  Interspersed is a smattering of assorted fixes and cleanups across the
  filesystem"

* tag 'ceph-for-5.13-rc1' of git://github.com/ceph/ceph-client: (24 commits)
  libceph: allow addrvecs with a single NONE/blank address
  libceph: don't set global_id until we get an auth ticket
  libceph: bump CephXAuthenticate encoding version
  ceph: don't allow access to MDS-private inodes
  ceph: fix up some bare fetches of i_size
  ceph: convert some PAGE_SIZE invocations to thp_size()
  ceph: support getting ceph.dir.rsnaps vxattr
  ceph: drop pinned_page parameter from ceph_get_caps
  ceph: fix inode leak on getattr error in __fh_to_dentry
  ceph: only check pool permissions for regular files
  ceph: send opened files/pinned caps/opened inodes metrics to MDS daemon
  ceph: avoid counting the same request twice or more
  ceph: rename the metric helpers
  ceph: fix kerneldoc copypasta over ceph_start_io_direct
  ceph: use attach/detach_page_private for tracking snap context
  ceph: don't use d_add in ceph_handle_snapdir
  ceph: don't clobber i_snap_caps on non-I_NEW inode
  ceph: fix fall-through warnings for Clang
  ceph: convert ceph_readpages to ceph_readahead
  ceph: convert ceph_write_begin to netfs_write_begin
  ...
2021-05-06 10:27:02 -07:00
Jeff Layton
d4f6b31d72 ceph: don't allow access to MDS-private inodes
The MDS reserves a set of inodes for its own usage, and these should
never be accessible to clients. Add a new helper to vet a proposed
inode number against that range, and complain loudly and refuse to
create or look it up if it's in it.

Also, ensure that the MDS doesn't try to delegate inodes that are in
that range or lower. Print a warning if it does, and don't save the
range in the xarray.

URL: https://tracker.ceph.com/issues/49922
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:23 +02:00
Jeff Layton
2d6795fbb8 ceph: fix up some bare fetches of i_size
We need to use i_size_read(), which properly handles the torn read
case on 32-bit arches.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:23 +02:00
Jeff Layton
8ff2d290c8 ceph: convert some PAGE_SIZE invocations to thp_size()
Start preparing to allow the use of THPs in the pagecache with ceph by
making it use thp_size() in lieu of PAGE_SIZE in the appropriate places.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:23 +02:00
Yanhu Cao
e7f7295250 ceph: support getting ceph.dir.rsnaps vxattr
Add support for grabbing the rsnaps value out of the inode info in
traces, and exposing that via ceph.dir.rsnaps xattr.

Signed-off-by: Yanhu Cao <gmayyyha@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:23 +02:00
Jeff Layton
e72968e15b ceph: drop pinned_page parameter from ceph_get_caps
All of the existing callers that don't set this to NULL just drop the
page reference at some arbitrary point later in processing. There's no
point in keeping a page reference that we don't use, so just drop the
reference immediately after checking the Uptodate flag.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:23 +02:00
Jeff Layton
1775c7ddac ceph: fix inode leak on getattr error in __fh_to_dentry
Fixes: 878dabb641 ("ceph: don't return -ESTALE if there's still an open file")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:23 +02:00
Jeff Layton
e9b2250156 ceph: only check pool permissions for regular files
There is no need to do a ceph_pool_perm_check() on anything that isn't a
regular file, as the MDS is what handles talking to the OSD in those
cases. Just return 0 if it's not a regular file.

Reported-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:23 +02:00
Xiubo Li
3d8b6987a2 ceph: send opened files/pinned caps/opened inodes metrics to MDS daemon
For the old ceph version, if it received this metric info, it will just
ignore them.

URL: https://tracker.ceph.com/issues/46866
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:23 +02:00
Xiubo Li
fbd47ddc5e ceph: avoid counting the same request twice or more
If the request will retry, skip updating the latency metric.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:23 +02:00
Xiubo Li
8ae99ae2b4 ceph: rename the metric helpers
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:23 +02:00
Jeff Layton
54b026b456 ceph: fix kerneldoc copypasta over ceph_start_io_direct
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:23 +02:00
Jeff Layton
379fc7fad0 ceph: use attach/detach_page_private for tracking snap context
There is some ambiguity around the use of PagePrivate. It's
generally expected in core code that if PagePrivate is set then
you have a reference to it. It's not clear that ceph always
does (and I believe it may not).

Change ceph to use attach/detach_page_private so that we keep a
reference to the page until the snap context is detached.

Link: https://lore.kernel.org/ceph-devel/2503810.1616508988@warthog.procyon.org.uk/T/#mf29e5abbb0ec8035cde0de30778690de7d956f84
Reported-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:23 +02:00
Jeff Layton
aa60cfc3f7 ceph: don't use d_add in ceph_handle_snapdir
It's possible ceph_get_snapdir could end up finding a (disconnected)
inode that already exists in the cache. Change the prototype for
ceph_handle_snapdir to return a dentry pointer and have it use
d_splice_alias so we don't end up with an aliased dentry in the cache.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:23 +02:00
Jeff Layton
d3c51ae1b8 ceph: don't clobber i_snap_caps on non-I_NEW inode
We want the snapdir to mirror the non-snapped directory's attributes for
most things, but i_snap_caps represents the caps granted on the snapshot
directory by the MDS itself. A misbehaving MDS could issue different
caps for the snapdir and we lose them here.

Only reset i_snap_caps when the inode is I_NEW. Also, move the setting
of i_op and i_fop inside the if block since they should never change
anyway.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:22 +02:00
Gustavo A. R. Silva
fcaddb1d85 ceph: fix fall-through warnings for Clang
In preparation to enable -Wimplicit-fallthrough for Clang, fix a couple
of warnings by explicitly adding a break and a goto statements instead
of just letting the code fall through to the next case.

URL: https://github.com/KSPP/linux/issues/115
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:22 +02:00
Jeff Layton
4987005600 ceph: convert ceph_readpages to ceph_readahead
Convert ceph_readpages to ceph_readahead and make it use
netfs_readahead. With this we can rip out a lot of the old
readpage/readpages infrastructure.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:22 +02:00
Jeff Layton
d801327d95 ceph: convert ceph_write_begin to netfs_write_begin
Convert ceph_write_begin to use the netfs_write_begin helper. Most of
the ops we need for it are already in place from the readpage conversion
but we do add a new check_write_begin op since ceph needs to be able to
vet whether there is an incompatible writeback already in flight before
reading in the page.

With this, we can also remove the old ceph_do_readpage helper.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:22 +02:00
Jeff Layton
f0702876e1 ceph: convert ceph_readpage to netfs_readpage
Have the ceph KConfig select NETFS_SUPPORT. Add a new netfs ops
structure and the operations for it. Convert ceph_readpage to use
the new netfs_readpage helper.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:22 +02:00
Jeff Layton
10a7052c78 ceph: fix fscache invalidation
Ensure that we invalidate the fscache whenever we invalidate the
pagecache.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:22 +02:00
Jeff Layton
7c46b31809 ceph: rework PageFsCache handling
With the new fscache API, the PageFsCache bit now indicates that the
page is being written to the cache and shouldn't be modified or released
until it's finished.

Change releasepage and invalidatepage to wait on that bit before
returning.

Also define FSCACHE_USE_NEW_IO_API so that we opt into the new fscache
API.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:21 +02:00
Jeff Layton
e7df4524cd ceph: rip out old fscache readpage handling
With the new netfs read helper functions, we won't need a lot of this
infrastructure as it handles the pagecache pages itself. Rip out the
read handling for now, and much of the old infrastructure that deals in
individual pages.

The cookie handling is mostly unchanged, however.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27 23:52:21 +02:00
Jeff Layton
ed94f87c2b ceph: don't allow type or device number to change on non-I_NEW inodes
Al pointed out that a malicious or broken MDS could change the type or
device number of a given inode number. It may also be possible for the
MDS to reuse an old inode number.

Ensure that we never allow fill_inode to change the type part of the
i_mode or the i_rdev unless I_NEW is set. Throw warnings if the MDS ever
changes these on us mid-stream, and return an error.

Don't set i_rdev directly, and rely on init_special_inode to do it.
Also, fix up error handling in the callers of ceph_get_inode.

In handle_cap_grant, check for and warn if the inode type changes, and
only overwrite the mode if it didn't.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-03-08 10:19:37 -05:00
Jeff Layton
3e10a15ffc ceph: fix up error handling with snapdirs
There are several warts in the snapdir error handling. The -EOPNOTSUPP
return in __snapfh_to_dentry is currently lost, and the call to
ceph_handle_snapdir is not currently checked at all.

Fix all of this up and eliminate a BUG_ON in ceph_get_snapdir. We can
handle that case with a warning and return an error.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-03-08 10:19:36 -05:00
Linus Torvalds
7d6beb71da idmapped-mounts-v5.12
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYCegywAKCRCRxhvAZXjc
 ouJ6AQDlf+7jCQlQdeKKoN9QDFfMzG1ooemat36EpRRTONaGuAD8D9A4sUsG4+5f
 4IU5Lj9oY4DEmF8HenbWK2ZHsesL2Qg=
 =yPaw
 -----END PGP SIGNATURE-----

Merge tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull idmapped mounts from Christian Brauner:
 "This introduces idmapped mounts which has been in the making for some
  time. Simply put, different mounts can expose the same file or
  directory with different ownership. This initial implementation comes
  with ports for fat, ext4 and with Christoph's port for xfs with more
  filesystems being actively worked on by independent people and
  maintainers.

  Idmapping mounts handle a wide range of long standing use-cases. Here
  are just a few:

   - Idmapped mounts make it possible to easily share files between
     multiple users or multiple machines especially in complex
     scenarios. For example, idmapped mounts will be used in the
     implementation of portable home directories in
     systemd-homed.service(8) where they allow users to move their home
     directory to an external storage device and use it on multiple
     computers where they are assigned different uids and gids. This
     effectively makes it possible to assign random uids and gids at
     login time.

   - It is possible to share files from the host with unprivileged
     containers without having to change ownership permanently through
     chown(2).

   - It is possible to idmap a container's rootfs and without having to
     mangle every file. For example, Chromebooks use it to share the
     user's Download folder with their unprivileged containers in their
     Linux subsystem.

   - It is possible to share files between containers with
     non-overlapping idmappings.

   - Filesystem that lack a proper concept of ownership such as fat can
     use idmapped mounts to implement discretionary access (DAC)
     permission checking.

   - They allow users to efficiently changing ownership on a per-mount
     basis without having to (recursively) chown(2) all files. In
     contrast to chown (2) changing ownership of large sets of files is
     instantenous with idmapped mounts. This is especially useful when
     ownership of a whole root filesystem of a virtual machine or
     container is changed. With idmapped mounts a single syscall
     mount_setattr syscall will be sufficient to change the ownership of
     all files.

   - Idmapped mounts always take the current ownership into account as
     idmappings specify what a given uid or gid is supposed to be mapped
     to. This contrasts with the chown(2) syscall which cannot by itself
     take the current ownership of the files it changes into account. It
     simply changes the ownership to the specified uid and gid. This is
     especially problematic when recursively chown(2)ing a large set of
     files which is commong with the aforementioned portable home
     directory and container and vm scenario.

   - Idmapped mounts allow to change ownership locally, restricting it
     to specific mounts, and temporarily as the ownership changes only
     apply as long as the mount exists.

  Several userspace projects have either already put up patches and
  pull-requests for this feature or will do so should you decide to pull
  this:

   - systemd: In a wide variety of scenarios but especially right away
     in their implementation of portable home directories.

         https://systemd.io/HOME_DIRECTORY/

   - container runtimes: containerd, runC, LXD:To share data between
     host and unprivileged containers, unprivileged and privileged
     containers, etc. The pull request for idmapped mounts support in
     containerd, the default Kubernetes runtime is already up for quite
     a while now: https://github.com/containerd/containerd/pull/4734

   - The virtio-fs developers and several users have expressed interest
     in using this feature with virtual machines once virtio-fs is
     ported.

   - ChromeOS: Sharing host-directories with unprivileged containers.

  I've tightly synced with all those projects and all of those listed
  here have also expressed their need/desire for this feature on the
  mailing list. For more info on how people use this there's a bunch of
  talks about this too. Here's just two recent ones:

      https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf
      https://fosdem.org/2021/schedule/event/containers_idmap/

  This comes with an extensive xfstests suite covering both ext4 and
  xfs:

      https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts

  It covers truncation, creation, opening, xattrs, vfscaps, setid
  execution, setgid inheritance and more both with idmapped and
  non-idmapped mounts. It already helped to discover an unrelated xfs
  setgid inheritance bug which has since been fixed in mainline. It will
  be sent for inclusion with the xfstests project should you decide to
  merge this.

  In order to support per-mount idmappings vfsmounts are marked with
  user namespaces. The idmapping of the user namespace will be used to
  map the ids of vfs objects when they are accessed through that mount.
  By default all vfsmounts are marked with the initial user namespace.
  The initial user namespace is used to indicate that a mount is not
  idmapped. All operations behave as before and this is verified in the
  testsuite.

  Based on prior discussions we want to attach the whole user namespace
  and not just a dedicated idmapping struct. This allows us to reuse all
  the helpers that already exist for dealing with idmappings instead of
  introducing a whole new range of helpers. In addition, if we decide in
  the future that we are confident enough to enable unprivileged users
  to setup idmapped mounts the permission checking can take into account
  whether the caller is privileged in the user namespace the mount is
  currently marked with.

  The user namespace the mount will be marked with can be specified by
  passing a file descriptor refering to the user namespace as an
  argument to the new mount_setattr() syscall together with the new
  MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
  of extensibility.

  The following conditions must be met in order to create an idmapped
  mount:

   - The caller must currently have the CAP_SYS_ADMIN capability in the
     user namespace the underlying filesystem has been mounted in.

   - The underlying filesystem must support idmapped mounts.

   - The mount must not already be idmapped. This also implies that the
     idmapping of a mount cannot be altered once it has been idmapped.

   - The mount must be a detached/anonymous mount, i.e. it must have
     been created by calling open_tree() with the OPEN_TREE_CLONE flag
     and it must not already have been visible in the filesystem.

  The last two points guarantee easier semantics for userspace and the
  kernel and make the implementation significantly simpler.

  By default vfsmounts are marked with the initial user namespace and no
  behavioral or performance changes are observed.

  The manpage with a detailed description can be found here:

      1d7b902e28

  In order to support idmapped mounts, filesystems need to be changed
  and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The
  patches to convert individual filesystem are not very large or
  complicated overall as can be seen from the included fat, ext4, and
  xfs ports. Patches for other filesystems are actively worked on and
  will be sent out separately. The xfstestsuite can be used to verify
  that port has been done correctly.

  The mount_setattr() syscall is motivated independent of the idmapped
  mounts patches and it's been around since July 2019. One of the most
  valuable features of the new mount api is the ability to perform
  mounts based on file descriptors only.

  Together with the lookup restrictions available in the openat2()
  RESOLVE_* flag namespace which we added in v5.6 this is the first time
  we are close to hardened and race-free (e.g. symlinks) mounting and
  path resolution.

  While userspace has started porting to the new mount api to mount
  proper filesystems and create new bind-mounts it is currently not
  possible to change mount options of an already existing bind mount in
  the new mount api since the mount_setattr() syscall is missing.

  With the addition of the mount_setattr() syscall we remove this last
  restriction and userspace can now fully port to the new mount api,
  covering every use-case the old mount api could. We also add the
  crucial ability to recursively change mount options for a whole mount
  tree, both removing and adding mount options at the same time. This
  syscall has been requested multiple times by various people and
  projects.

  There is a simple tool available at

      https://github.com/brauner/mount-idmapped

  that allows to create idmapped mounts so people can play with this
  patch series. I'll add support for the regular mount binary should you
  decide to pull this in the following weeks:

  Here's an example to a simple idmapped mount of another user's home
  directory:

	u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt

	u1001@f2-vm:/$ ls -al /home/ubuntu/
	total 28
	drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
	drwxr-xr-x 4 root   root   4096 Oct 28 04:00 ..
	-rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
	-rw-r--r-- 1 ubuntu ubuntu  220 Feb 25  2020 .bash_logout
	-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25  2020 .bashrc
	-rw-r--r-- 1 ubuntu ubuntu  807 Feb 25  2020 .profile
	-rw-r--r-- 1 ubuntu ubuntu    0 Oct 16 16:11 .sudo_as_admin_successful
	-rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo

	u1001@f2-vm:/$ ls -al /mnt/
	total 28
	drwxr-xr-x  2 u1001 u1001 4096 Oct 28 22:07 .
	drwxr-xr-x 29 root  root  4096 Oct 28 22:01 ..
	-rw-------  1 u1001 u1001 3154 Oct 28 22:12 .bash_history
	-rw-r--r--  1 u1001 u1001  220 Feb 25  2020 .bash_logout
	-rw-r--r--  1 u1001 u1001 3771 Feb 25  2020 .bashrc
	-rw-r--r--  1 u1001 u1001  807 Feb 25  2020 .profile
	-rw-r--r--  1 u1001 u1001    0 Oct 16 16:11 .sudo_as_admin_successful
	-rw-------  1 u1001 u1001 1144 Oct 28 00:43 .viminfo

	u1001@f2-vm:/$ touch /mnt/my-file

	u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file

	u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file

	u1001@f2-vm:/$ ls -al /mnt/my-file
	-rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file

	u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
	-rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file

	u1001@f2-vm:/$ getfacl /mnt/my-file
	getfacl: Removing leading '/' from absolute path names
	# file: mnt/my-file
	# owner: u1001
	# group: u1001
	user::rw-
	user:u1001:rwx
	group::rw-
	mask::rwx
	other::r--

	u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
	getfacl: Removing leading '/' from absolute path names
	# file: home/ubuntu/my-file
	# owner: ubuntu
	# group: ubuntu
	user::rw-
	user:ubuntu:rwx
	group::rw-
	mask::rwx
	other::r--"

* tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits)
  xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl
  xfs: support idmapped mounts
  ext4: support idmapped mounts
  fat: handle idmapped mounts
  tests: add mount_setattr() selftests
  fs: introduce MOUNT_ATTR_IDMAP
  fs: add mount_setattr()
  fs: add attr_flags_to_mnt_flags helper
  fs: split out functions to hold writers
  namespace: only take read lock in do_reconfigure_mnt()
  mount: make {lock,unlock}_mount_hash() static
  namespace: take lock_mount_hash() directly when changing flags
  nfs: do not export idmapped mounts
  overlayfs: do not mount on top of idmapped mounts
  ecryptfs: do not mount on top of idmapped mounts
  ima: handle idmapped mounts
  apparmor: handle idmapped mounts
  fs: make helpers idmap mount aware
  exec: handle idmapped mounts
  would_dump: handle idmapped mounts
  ...
2021-02-23 13:39:45 -08:00
Xiubo Li
558b4510f6 ceph: defer flushing the capsnap if the Fb is used
If the Fb cap is used it means the current inode is flushing the
dirty data to OSD, just defer flushing the capsnap.

URL: https://tracker.ceph.com/issues/48640
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-02-16 12:09:52 +01:00
Jeff Layton
a8810cdc00 ceph: allow queueing cap/snap handling after putting cap references
Testing with the fscache overhaul has triggered some lockdep warnings
about circular lock dependencies involving page_mkwrite and the
mmap_lock. It'd be better to do the "real work" without the mmap lock
being held.

Change the skip_checking_caps parameter in __ceph_put_cap_refs to an
enum, and use that to determine whether to queue check_caps, do it
synchronously or not at all. Change ceph_page_mkwrite to do a
ceph_put_cap_refs_async().

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-02-16 12:09:51 +01:00
Jeff Layton
64f28c627a ceph: clean up inode work queueing
Add a generic function for taking an inode reference, setting the I_WORK
bit and queueing i_work. Turn the ceph_queue_* functions into static
inline wrappers that pass in the right bit.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-02-16 12:09:51 +01:00
Jeff Layton
64f36da562 ceph: fix flush_snap logic after putting caps
A primary reason for skipping ceph_check_caps after putting the
references was to avoid the locking in ceph_check_caps during a
reconnect. __ceph_put_cap_refs can still call ceph_flush_snaps in that
case though, and that takes many of the same inconvenient locks.

Fix the logic in __ceph_put_cap_refs to skip flushing snaps when the
skip_checking_caps flag is set.

Fixes: e64f44a884 ("ceph: skip checking caps when session reconnecting and releasing reqs")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-02-16 12:09:51 +01:00
Christian Brauner
549c729771
fs: make helpers idmap mount aware
Extend some inode methods with an additional user namespace argument. A
filesystem that is aware of idmapped mounts will receive the user
namespace the mount has been marked with. This can be used for
additional permission checking and also to enable filesystems to
translate between uids and gids if they need to. We have implemented all
relevant helpers in earlier patches.

As requested we simply extend the exisiting inode method instead of
introducing new ones. This is a little more code churn but it's mostly
mechanical and doesnt't leave us with additional inode methods.

Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:20 +01:00
Christian Brauner
0d56a4518d
stat: handle idmapped mounts
The generic_fillattr() helper fills in the basic attributes associated
with an inode. Enable it to handle idmapped mounts. If the inode is
accessed through an idmapped mount map it into the mount's user
namespace before we store the uid and gid. If the initial user namespace
is passed nothing changes so non-idmapped mounts will see identical
behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-12-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:17 +01:00
Christian Brauner
e65ce2a50c
acl: handle idmapped mounts
The posix acl permission checking helpers determine whether a caller is
privileged over an inode according to the acls associated with the
inode. Add helpers that make it possible to handle acls on idmapped
mounts.

The vfs and the filesystems targeted by this first iteration make use of
posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to
translate basic posix access and default permissions such as the
ACL_USER and ACL_GROUP type according to the initial user namespace (or
the superblock's user namespace) to and from the caller's current user
namespace. Adapt these two helpers to handle idmapped mounts whereby we
either map from or into the mount's user namespace depending on in which
direction we're translating.
Similarly, cap_convert_nscap() is used by the vfs to translate user
namespace and non-user namespace aware filesystem capabilities from the
superblock's user namespace to the caller's user namespace. Enable it to
handle idmapped mounts by accounting for the mount's user namespace.

In addition the fileystems targeted in the first iteration of this patch
series make use of the posix_acl_chmod() and, posix_acl_update_mode()
helpers. Both helpers perform permission checks on the target inode. Let
them handle idmapped mounts. These two helpers are called when posix
acls are set by the respective filesystems to handle this case we extend
the ->set() method to take an additional user namespace argument to pass
the mount's user namespace down.

Link: https://lore.kernel.org/r/20210121131959.646623-9-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:17 +01:00
Christian Brauner
2f221d6f7b
attr: handle idmapped mounts
When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:16 +01:00
Christian Brauner
47291baa8d
namei: make permission helpers idmapped mount aware
The two helpers inode_permission() and generic_permission() are used by
the vfs to perform basic permission checking by verifying that the
caller is privileged over an inode. In order to handle idmapped mounts
we extend the two helpers with an additional user namespace argument.
On idmapped mounts the two helpers will make sure to map the inode
according to the mount's user namespace and then peform identical
permission checks to inode_permission() and generic_permission(). If the
initial user namespace is passed nothing changes so non-idmapped mounts
will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:16 +01:00
Ilya Dryomov
4972cf605f libceph, ceph: disambiguate ceph_connection_operations handlers
Since a few years, kernel addresses are no longer included in oops
dumps, at least on x86.  All we get is a symbol name with offset and
size.

This is a problem for ceph_connection_operations handlers, especially
con->ops->dispatch().  All three handlers have the same name and there
is little context to disambiguate between e.g. monitor and OSD clients
because almost everything is inlined.  gdb sneakily stops at the first
matching symbol, so one has to resort to nm and addr2line.

Some of these are already prefixed with mon_, osd_ or mds_.  Let's do
the same for all others.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
2021-01-04 17:31:32 +01:00
Ilya Dryomov
60267ba35c ceph: reencode gid_list when reconnecting
On reconnect, cap and dentry releases are dropped and the fields
that follow must be reencoded into the freed space.  Currently these
are timestamp and gid_list, but gid_list isn't reencoded.  This
results in

  failed to decode message of type 24 v4: End of buffer

errors on the MDS.

While at it, make a change to encode gid_list unconditionally,
without regard to what head/which version was used as a result
of checking whether CEPH_FEATURE_FS_BTIME is supported or not.

URL: https://tracker.ceph.com/issues/48618
Fixes: 4f1ddb1ea8 ("ceph: implement updated ceph_mds_request_head structure")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2020-12-28 20:34:32 +01:00
Ilya Dryomov
ce287162d9 libceph, ceph: make use of __ceph_auth_get_authorizer() in msgr1
This shouldn't cause any functional changes.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:50 +01:00
Ilya Dryomov
cd1a677cad libceph, ceph: implement msgr2.1 protocol (crc and secure modes)
Implement msgr2.1 wire protocol, available since nautilus 14.2.11
and octopus 15.2.5.  msgr2.0 wire protocol is not implemented -- it
has several security, integrity and robustness issues and therefore
considered deprecated.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:50 +01:00
Ilya Dryomov
a5cbd5fc22 libceph, ceph: get and handle cluster maps with addrvecs
In preparation for msgr2, make the cluster send us maps with addrvecs
including both LEGACY and MSGR2 addrs instead of a single LEGACY addr.
This means advertising support for SERVER_NAUTILUS and also some older
features: SERVER_MIMIC, MONENC and MONNAMES.

MONNAMES and MONENC are actually pre-argonaut, we just never updated
ceph_monmap_decode() for them.  Decoding is unconditional, see commit
23c625ce30 ("libceph: assume argonaut on the server side").

SERVER_MIMIC doesn't bear any meaning for the kernel client.

Since ceph_decode_entity_addrvec() is guarded by encoding version
checks (and in msgr2 case it is guarded implicitly by the fact that
server is speaking msgr2), we assume MSG_ADDR2 for it.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:50 +01:00
Ilya Dryomov
285ea34fc8 libceph, ceph: incorporate nautilus cephx changes
- request service tickets together with auth ticket.  Currently we get
  auth ticket via CEPHX_GET_AUTH_SESSION_KEY op and then request service
  tickets via CEPHX_GET_PRINCIPAL_SESSION_KEY op in a separate message.
  Since nautilus, desired service tickets are shared togther with auth
  ticket in CEPHX_GET_AUTH_SESSION_KEY reply.

- propagate session key and connection secret, if any.  In preparation
  for msgr2, update handle_reply() and verify_authorizer_reply() auth
  ops to propagate session key and connection secret.  Since nautilus,
  if secure mode is negotiated, connection secret is shared either in
  CEPHX_GET_AUTH_SESSION_KEY reply (for mons) or in a final authorizer
  reply (for osds and mdses).

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:50 +01:00
Jeff Layton
4f1ddb1ea8 ceph: implement updated ceph_mds_request_head structure
When we added the btime feature in mainline ceph, we had to extend
struct ceph_mds_request_args so that it could be set. Implement the same
in the kernel client.

Rename ceph_mds_request_head with a _old extension, and a union
ceph_mds_request_args_ext to allow for the extended size of the new
header format.

Add the appropriate code to handle both formats in struct
create_request_message and key the behavior on whether the peer supports
CEPH_FEATURE_FS_BTIME.

The gid_list field in the payload is now populated from the saved
credential. For now, we don't add any support for setting the btime via
setattr, but this does enable us to add that in the future.

[ idryomov: break unnecessarily long lines ]

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:48 +01:00
Jeff Layton
396bd62c69 ceph: clean up argument lists to __prepare_send_request and __send_request
We can always get the mdsc from the session, so there's no need to pass
it in as a separate argument. Pass the session to __prepare_send_request
as well, to prepare for later patches that will need to access it.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:48 +01:00
Jeff Layton
7fe0cdeb0f ceph: take a cred reference instead of tracking individual uid/gid
Replace req->r_uid/r_gid with an r_cred pointer and take a reference to
that at the point where we previously would sample the two.  Use that to
populate the uid and gid in the header and release the reference when
the request is freed.

This should enable us to later add support for sending supplementary
group lists in MDS requests.

[ idryomov: break unnecessarily long lines ]

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:48 +01:00
Jeff Layton
0f51a98361 ceph: don't reach into request header for readdir info
We already have a pointer to the argument struct in req->r_args. Use that
instead of groveling around in the ceph_mds_request_head.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:48 +01:00
Xiubo Li
968cd14edc ceph: set osdmap epoch for setxattr
When setting the file/dir layout, it may need data pool info. So
in mds server, it needs to check the osdmap. At present, if mds
doesn't find the data pool specified, it will try to get the latest
osdmap. Now if pass the osd epoch for setxattr, the mds server can
only check this epoch of osdmap.

URL: https://tracker.ceph.com/issues/48504
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:48 +01:00